Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects

Ji, Heng; Favre, Benoit; Lin, Wen-Pin; Gillick, Dan; Hakkani-Tur, Dilek; Grishman, Ralph

doi:10.1007/978-3-642-28569-1_9

Heng Ji⁵,
Benoit Favre⁶,
Wen-Pin Lin⁵,
Dan Gillick⁷,
Dilek Hakkani-Tur⁸ &
…
Ralph Grishman⁹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

2172 Accesses
9 Citations

Abstract

Information Extraction (IE) and Summarization share the same goal of extracting and presenting the relevant information of a document. While IE was a primary element of early abstractive summarization systems, it’s been left out in more recent extractive systems. However, extracting facts, recognizing entities and events should provide useful information to those systems and help resolve semantic ambiguities that they cannot tackle. This paper explores novel approaches to taking advantage of cross-document IE for multi-document summarization. We propose multiple approaches to IE-based summarization and analyze their strengths and weaknesses. One of them, re-ranking the output of a high performing summarization system with IE-informed metrics, leads to improvements in both manually-evaluated content quality and readability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.nist.gov/speech/tests/ace/

References

Banko, M., Cafarella, M.J., Soderland, S., Etzioni, O.: Open information extraction from the web. In: Proceeding of the International Joint Conferences on Artificial Intelligence (IJCAI 2007), Hyderabad (2007)
Google Scholar
Banko, M., Etzioni, O.: The tradeoffs between open and traditional relation extraction. In: Proceeding of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2008), Columbus (2008)
Google Scholar
Bellare, K., Sarma, A.D., Loiwal, N., Mehta, V., Ramakrishnan, G., Bhattacharyya, P.: Generic text summarization using wordNet. In: Proceeding of the 4th International Conference on Language Resource and Evaluation (LREC2004), Lisbon (2004)
Google Scholar
Biadsy, F., Hirschberg, J., Filatova, E.: An unsupervised approach to biography production using wikipedia. In: Proceeding of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2008), Columbus, pp. 807–815. (2008)
Google Scholar
Bollacker, K., Cook, R., Tufts, P.: Freebase: a shared database of structured general human knowledge. In: Proceeding of the National Conference on Artificial Intelligence, Vancouver, vol. 2 (2007)
Google Scholar
Callison-Burch, C.: Syntactic constraints on paraphrases extracted from parallel corpora. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu (2008)
Google Scholar
Chaves, R.P.: WordNet and automated text summarization. In: Proceeding of the 6th Natural Language Processing Pacific Rim Symposium, Tokyo (2001)
Google Scholar
Chen, Z., Tamang, S., Lee, A., Li, X., Lin, W., Artiles, J., Snover, M., Passantino, M., Ji, H.: CUNY-BLENDER TAC-KBP2010 entity linking and slot filling system description. In: Proceeding of the Text Analysis Conference (TAC2010), City University of New York (2010)
Google Scholar
Dang, C., Luo, X., Zhang, H.: Wordnet-based summarization of unstructured document. J. WSEAS Trans. Comput. 7(9), 1467–1472 (2008)
Google Scholar
Dang, H. T., Owczarzak, K.: Overview of the TAC 2009 summarization track. In: Proceeding of the Text Analysis Conference (TAC 2009), NIST (2009)
Google Scholar
Fellbaum, C. (ed.). WordNet: An Electronic Lexical Database. MIT, Cambridge (1998)
Google Scholar
Filatova, E., Hatzivassiloglou, V.: A formal model for information selection in multi-sentence text extraction. In: Proceeding of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva (2004)
Google Scholar
Gillick, D., Favre, B., Hakkani-Tur, D., Bohnet, B., Liu, Y., Xie, S.: The ICSI/UTD summarization system at TAC 2009. In: Proceeding of the Text Analysis Conference (TAC 2009), NIST (2009)
Google Scholar
Grishman, R., Hobbs, J., Hovy, E., Sanfilippo, A., Wilks, Y: Cross-lingual information extraction and automated text summarization. Linguist. Comput. XIV–XV (1997)
Google Scholar
Grishman, R., Sundheim, B.: Message understanding conference - 6: a brief history. In: Proceeding of the 16th International Conference on Computational Linguistics (COLING 1996), Copenhagen, pp. 466–471. (1996)
Google Scholar
Grishman, R., Westbrook, D., Meyers, A.: NYUs Chinese ACE 2005 EDR system description. In: Proceeding of the NIST Automatic Content Extraction Workshop (ACE2005) (2005)
Google Scholar
Hachey, B.: Multi-document summarisation using generic relation extraction. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pp. 420–429. (2009)
Google Scholar
Ji, H., Grishman, R.: Refining event extraction through cross-document inference. In: Proceeding of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2008), Columbus (2008)
Google Scholar
Ji, H., Grishman, R., Chen, Z., Gupta, P.: Cross-document event extraction, ranking and tracking. In: Proceeding of the Recent Advances in Natural Language Processing (RANLP 2009), Borovets, pp. 166–172. (2009)
Google Scholar
Ji, H., Grishman, R., Dang, H. T., Griffitt, K., Ellis, J.: An overview of the TAC2010 knowledge base population track. In: Proceeding of the Text Analysis Conference (TAC2010), Gaithersburg (2010)
Google Scholar
Lin, C., Hovy, E.: Automatic evaluation of summaries using N-gram co-occurrence statistics. In: Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, pp. 150–156. (2003)
Google Scholar
Liu, F., Liu, Y.: From extractive to abstractive meeting summaries: can it be done by sentence compression? In: Proceeding of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Singapore (2009)
Google Scholar
McKeown, K., Passonneau, R., Elson, D., Nenkova, A., Hirschberg, J.: Do summaries help? A task-based evaluation of multi-document summarization. In: Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Salvador (2005)
Google Scholar
Melli, G., Shi, Z., Wang, Y., Liu, Y., Sarkar, A., Popowich, F.: Description of SQUASH, the SFU question answering summary handler for the DUC-2006 summarization task. In: Proceeding of the Document Understanding Conference (DUC 2006), Brooklyn (2006)
Google Scholar
Melli, G., Wang, Y., Liu, Y., Kashani, M.M., Shi, Z., Gu, B., Sarkar, A., Popowich, F.: Description of SQUASH, the SFU question answering summary handler for the DUC-2005 summarization task. In: Proceeding of the Document Understanding Conference (DUC2005), Vancouver (2005)
Google Scholar
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Singapore (2009)
Google Scholar
Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceeding of the Human Language Technology Conference-North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL 2004), Boston (2004)
Google Scholar
Radev, D.R., McKeown, K.R.: Generating natural language summaries from multiple on-line sources. Comput. Linguist. 24(3), 469–500 (1998)
Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (2006)
Google Scholar
Rusu, D., Fortuna, B., Grobelink, M., Mladenic, D.: Semantic graphs derived from triplets with application in document summarization. Informatica, 33, 357–362 (2009)
Google Scholar
Sauper, C., Barzilay, R.: Automatically generating wikipedia articles: a structure-aware approach. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Singapore (2009)
Google Scholar
Schlaefer, N., Ko, J., Betteridge, J., Sautter, G., Pathak, M., Nyberg, E.: Semantic extensions of the Ephyra QA system for TREC2007. In: Proceeding of the Text Retrieval Conference (TREC2007), Gaithersburg (2007)
Google Scholar
Sekine, S.: On-demand information extraction. In: Proceeding of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL 2006), Sydney (2006)
Google Scholar
Vanderwende, L., Banko, M., Menezes, A.: Event-centric summary generation. In: Proceeding of the Document Understanding Conference (DUC 2004), Boston (2004)
Google Scholar
Vikas, O., Meshram, A.K., Meena, G., Gupta, A.: Multiple document summarization using principal component analysis incorporating semantic vector space model. Comput. Linguist. Chin. Lang. Process. 13(2), 141–156 (2008)
Google Scholar
White, M., Korelsky, T., Cardie, C., Ng, V., Pierce, D., Wagstaff, K.: Multidocument summarization via information extraction. In: Proceeding of the Human Language Technologies (HLT 2001), Lisbon, pp. 263–269. (2001)
Google Scholar
Yarowsky, D.: Word-sense disambiguation using statistical models of Rogets categories trained on large corpora. In: Proceeding of the 14th International Conference on Computational Linguistics (COLING 1992), Nantes (1992)
Google Scholar

Download references

Acknowledgements

The first author and the third author were supported by the U.S. Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053, the U.S. NSF CAREER Award under Grant IIS-0953149 and PSC-CUNY Research Program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

Author information

Authors and Affiliations

Computer Science Department, Queens College and Graduate Center, City University of New York, New York, NY, USA
Heng Ji & Wen-Pin Lin
LIF, Aix-Marseille Université, Marseille, France
Benoit Favre
Computer Science Department, University of California, Berkeley, CA, USA
Dan Gillick
Speech Labs, Microsoft, Mountain View, CA, USA
Dilek Hakkani-Tur
Computer Science Department, New York University, New York, NY, USA
Ralph Grishman

Authors

Heng Ji
View author publications
You can also search for this author in PubMed Google Scholar
Benoit Favre
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Pin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dan Gillick
View author publications
You can also search for this author in PubMed Google Scholar
Dilek Hakkani-Tur
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Grishman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heng Ji .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ji, H., Favre, B., Lin, WP., Gillick, D., Hakkani-Tur, D., Grishman, R. (2013). Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_9
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics