Recognizing Text Similarity

Ozlem Uzuner, Randall Davis & Boris Katz Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts 02139


The Problem: There are a variety of circumstances under which it would be useful to be able to determine that two documents contain similar text, including detecting plagiarism and copyrightinfringement, and filtering and organizing documents returned as matches to a query by a search engine. The vast amount of digital information available on the Web makes it necessary to deal with all of these issues. The ease of copying facilitates both plagiarism and copyright infringement, while the volume of information available increases the difficulty of finding the right information quickly.Motivation: Automatic text similarity detectors can help identify plagiarism and copyright infringement and help reduce the abuse and misuse of electronic content. In addition, they can make information discovery more intuitive and less time consuming. Related Work in Text Similarity Recognition: Existing text similarity detection systems recognize verbatim similarities between documents but donot pay attention to similarity in expression. SCAM [4, 5], developed in the Stanford Digital Library looks for verbatim copies of text documents by fingerprinting documents and checking these fingerprints against a repository of previously known fingerprints. SCAM looks for overlaps between verbatim text strings to identify partial similarity. We want to detect non-verbatim similarity by measuringsimilarity of expression. We are particularly interested in identifying documents that are paraphrases of each other and that express the same content in the same way. Related Work in Rhetorical Structure Theory: The main idea of Rhetorical structure theory (RST) [2] is to model the discourse structure of a text with a hierarchical tree diagram that uses rhetorical relations such as sequence,contrast and elaboration. In most texts, rhetorical relations are indicated by lexical cues. Contrast, for example, is usually signalled by cue phrases like “however”, “but” and “in contrast”. Corston-Oliver [1] and Marcu [3] have demonstrated that they can build rhetorical structure trees by identifying the nuclei and satellites of clauses and sentences, and the relationships between them. Approach: Wepropose to extend the rhetorical structure analysis from clauses and sentences to paragraphs and larger syntactic constructs. This way we can measure the similarity of the overall organization of documents to each other as well as measuring the similarity of their expression and content. To illustrate our approach in more detail, we look at an example fragment from ABC News, shown in Figure 1. ADecade of Warnings Did Rabbi’s 1990 Assassination Mark the Birth of Islamic Terror in America? Aug. 16 - The headlines that followed the Sept. 11 attacks on the World Trade Center and Pentagon screamed tragically of a ”changed world.” But did that change actually begin a decade earlier, with the assassination of a radical rabbi in New York City? In his newly released book, The Cell, 20/20’s JohnMiller teams up with investigative reporters Michael Stone and Chris Mitchell and gives a blow-by-blow account of terrorist events leading up to Sept. 11. They trace a trail of terror back to 1990, and ask whether U.S. intelligence agencies could have done something to prevent the Sept. 11 attacks. Figure 1: Fragment of ABC News Article Figure 2 shows the syntactic relations and Figure 3 shows therhetorical structure of the first few sentences of 298

the fragment in Figure 1. A Decade of Warnings [rabbi] ⇐ relates-to [assassination] [1990] ⇐ describes [assassination] [assassination] ⇐ is-subject-of[mark] [birth] ⇐ is-object-of[mark] [birth] ⇐ relates-to [terror] [Islamic] ⇐ describes [terror] [America] ⇐ is-location-of [terror] ... Figure 2: Syntactic Relations extracted from Figure 1....
