Abstract
In many modern information retrieval applications, a common problem which arises is the existence of multiple documents covering similar information, as in the case of multiple news stories about an event or a sequence of events. A particular challenge for text summarization is to be able to summarize the similarities and differences in information content among these documents. The approach described here exploits the results of recent progress in information extraction to represent salient units of text and their relationships. By exploiting meaningful relations between units based on an analysis of text cohesion and the context in which the comparison is desired, the summarizer can pinpoint similarities and differences, and align text segments. In evaluation experiments, these techniques for exploiting cohesion relations result in summaries which (i) help users more quickly complete a retrieval task (ii) result in improved alignment accuracy over baselines, and (iii) improve identification of topic-relevant similarities and differences.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
J. Aberdeen, J. Burger, D. Day, L. Hirschman, P. Robinson, and M. Vilain. “MITRE: Description of the Alembic System Used forMUC-6”, Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 1995.
J. Abracos and G. Pereira Lopes. Statistical Methods for Retrieving Most Significant Paragraphs in Newspaper Articles, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997, pp. 51–57.
R. Alterman. “A Dictionary Based on Concept Coherence”, Artificial Intelligence, 25, 1985, pp. 153–86.
C. Aone, M.E. Okurowski, J. Gorlinsky and B. Larsen. “A Scalable Summarization System using Robust NLP”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997, pp. 66–73.
J.P. Callan. “Passage-Level Evidence in Document Retrieval”, Proceedings of SIGIR'94, p. 302–310, 1994.
A. Barzilay and M. Elhadad. “Using Lexical Chains for Text Summarization”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997, pp. 10–17.
P.B. Baxendale. “Man-made index for technical literature: an experiment”, IBM Journal of Research and Development, 2, 4, 1958, pp. 354–361.
B. Boguraev and C. Kennedy. “Salience-based Content Characterization of Text Documents”, in Mani, I., and Maybury, M., eds., Proceedings of theACL/EACL'97Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997, pp. 2–9.
E. Brill. “Some advances in rule-based part-of-speech tagging”, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, August 1–4, 1994, pp. 722–727.
J. Broglio and B. Croft. “Query Processing for Retrieval from Large Text Bases”, ARPA Human Language Technology Workshop, 1993.
B. Buckley. “The Importance of Proper Weighting Methods”, ARPA Human Language Technology Workshop, 1993.
C.H. Chen, K. Basu and T. Ng. “An Algorithmic Approach to Concept Exploration in a Large Knowledge Network”, Technical Report, MIS Department, University of Arizona, Tucson, AZ, 1994.
J.D. Cohen. “Hilights: Language-and Domain-Independent Automatic Indexing Terms for Abstracting”, Journal of the American Society for Information Science, 46, 3, 162–174, 1995. See also vol. 47, 3, 260 for a very important erratum.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, 41, 6, pp. 391–407.
H.P. Edmundson. “New methods in automatic abstracting”, Journal of the Association for Computing Machinery, 1969, 16, 2, pp. 264–285.
D. Evans. “The Clarit Project”, Technical Report, Laboratory for Computational Linguistics, Carnegie Mellon University, 1991.
D.A. Evans, K. Ginther-Webster, M. Hart, R.G. Lefferts and I.A. Monarch. “Automatic indexing using selective NLP and first-order thesauri”, Proceedings of RIAO'91, 2, pp. 624–643.
D. Evans and C. Zhai. “Noun Phrase Analysis in Unrestricted Text for Information Retrieval”, Proceedings of ACL-96, Cambridge, MA, June 1996.
G. Grefenstette. “Use of syntactic context to produce term association lists for text retrieval”, Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 89–97.
G. Grefenstette. “Explorations in Automatic Thesaurus Discovery”, Kluwer, Boston, 1994.
M. Halliday and R. Hasan. “Cohesion in Text”, 1996, London, Longmans.
T.F. Hand. “A Proposal for Task-Based Evaluation of Text Summarization Systems”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997.
D. Harman, editor, “An Overview of the Third Text Retrieval Conference”, National Institute of Standards and Tehnology, NIST Special Publication 500–225, 1994, Gaithersburg, MD.
M. Hearst. “Multi-Paragraph Segmentation of Expository Text”, Proceedings of ACL-94, Las Cruces, New Mexico, 1994.
G. Krupka. “SRA: Description of the SRA System as Used for MUC-6”, Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 1995.
J. Kupiec, J. Pedersen and F. Chen. “A Trainable Document Summarizer”, Proceedings of ACM-SIGIR'95, Seattle, WA, 1995, pp. 68–73.
E.R. Liddy. “The discourse-level Structure of Empirical Abstracts: An Exploratory Study”, Information Processing and Management, 1991, 27, 1, 55–81.
I. Mani, D. House, M. Maybury and M. Green. “Towards Content-Based Browsing of Broadcast News Video”, in Maybury, M., ed., Intelligent Multimedia Information Retrieval, AAAI/MIT Press, 1997.
I. Mani and E. Bloedorn. “Summarizing Similarities and Differences Among Related Documents”, Proceedings of RIAO-97, Montreal, Canada, June 25–27, 1997, pp. 373–387.
I. Mani and E. Bloedorn. “Multi-document Summarization by Graph Search and Merging”, Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), Providence, RI, July 27–31, 1997, pp. 622–628.
W.C. Mann and S.A. Thompson. Rhetorical Structure Theory: Toward a functional theory of text organization. Text, 8, 3, 1988, pp. 243–281.
D. Marcu. “From discourse structures to text summaries”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997, pp. 82–88.
M. Maybury. “Generating Summaries from Event Data”, Information Processing and Management, 31, 5, 1995, pp. 735–751.
K. McKeown and D. Radev. “Generating Summaries of Multiple News Articles”, Proceedings of ACM-SIGIR '95, Seattle, WA.
S. Miike, E. Itoh, K. Ono and K. Sumita. “A Full-Text Retrieval System with a Dynamic Abstract Generation Function”, Proceedings of ACM-SIGIR'94, Dublin, Ireland.
M. Mitra, A. Singhal and C. Buckley. “Automatic Text Summarization by Paragraph Extraction”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997.
J. Morris and G. Hirst. “Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text”, Computational Linguistics, 17, 1, pp. 21–43, 1991.
G. Miller. “WordNet: A Lexical Database for English”, Communications of the ACM, 38, 11, pp. 39–41, 1995.
MUC-6, Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 1995.
C. Paice. “Constructing Literature Abstracts by Computer: Techniques and Prospects, Information Processing and Management, 26, 1, pp. 171–186, 1990.
C. Paice and P. Jones. “The Identification of Important Concepts in Highly Structured Technical Papers”, Proceedings of ACM-SIGIR'93, Pittsburgh, PA.
W. Paik, E. Liddy, E. Yu and M. McKenna. “Categorizing and Standardizing Proper Nouns for Efficient Information Retrieval”, Proceedings of the ACL Workshop on Acquisition of Lexical Knowledge from Text, Ohio State University, 1993.
C. Pearce and C. Nicholas. “TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data”, JASIS, 47, 4, 263–275, 1996.
M.F. Porter. “An Algorithm For Suffix Stripping”, Program, 14, 3, July 1980, pp. 130–137.
G.J. Rath, A. Resnick and T.R. Savage. “The formation of abstracts by the selection of sentences”, American Documentation, 12, 2, 1961, pp. 139–143.
L. Rau. ”Knowledge Organization and Access in a Conceptual Information System,” Information Processing and Management, 23, 4, 269–283, 1987.
P. Resnick. “Selection and Information: A Class-Based Approach to Lexical Relationships”, Ph.D. Dissertation, 1993, University of Pennsylvania, Philadelphia, PA.
G. Salton. “Automatic text processing-the transformation, analysis, and retrieval of information by computer”, Addison-Wesley, Reading, MA, 1989.
G. Salton, J. Allan, C. Buckley and A. Singhal. “Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts”, Science, 264, June 1994, pp. 1421–1426.
G. Salton and C. Buckley. “On the Use of Spreading Activation Methods in Automatic Information Retrieval”, Technical Report 88–907, Department of Computer Science, Cornell University, 1988.
G. Salton, A. Singhal, C. Buckley and M. Mitra. “Automatic Text Decomposition Using Text Segments and Text Themes”, Cornell University Technical Report TR 95–1555, Nov. 17, 1995.
H.M. Schutze and J.O. Pedersen. “A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval”, Proceedings of RIAO'97.
A.F. Smeaton and I. Quigley. “Experiments on Using Semantic Distances Between Words in Image Caption Retrieval”, Proceedings of ACM-SIGIR'96, Zurich, Switzerland.
K. Sparck-Jones. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval”, Journal of Documentation, 28, 1, 11–20, 1972.
K. Sparck-Jones. “Summarizing: Where are we now? Where should we go?”, in Mani, I., and Maybury, M., eds., Proceedings of the ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 11 July 1997.
T. Strzalkowski. “Natural Language Information Retrieval: TIPSTER-2 Final Report”, TIPSTER Text Program (Phase II), 1996, pp. 143–148.
T.A. Van Dijk. “News as Discourse”, Lawrence Erlbaum, Hillsdale, NJ, 1988.
E.M. Voorhees. “Using WordNet to Disambiguate Word Senses for Text Retrieval”, Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, June, 1993, pp. 171–180.
Rights and permissions
About this article
Cite this article
Mani, I., Bloedorn, E. Summarizing Similarities and Differences Among Related Documents. Information Retrieval 1, 35–67 (1999). https://doi.org/10.1023/A:1009930203452
Issue Date:
DOI: https://doi.org/10.1023/A:1009930203452