Multilingual Statistical News Summarization

  • Mijail Kabadjov
  • Josef Steinberger
  • Ralf Steinberger
Chapter

Abstract

In this chapter we present a generic approach for summarizing clusters of multilingual news articles such as the ones produced by the Europe Media Monitor (EMM) system. Our approach uses robust statistical techniques as well as multilingual tools for named entity recognition and disambiguation to produce entity-centered summaries. We run experiments with the TAC 2008 and 2009 data sets (English corpora for summarization research), and we obtained very promising results; at TAC 2009 our runs attained top rank for linguistic quality and second best for overall responsiveness. We also run a small-scale evaluation on languages other than English, demonstrating thereby the multilinguality of our approach, but also providing interesting evidence that contradicts the pervasive assumption “if it works for English, it works for any language”. Finally, we present an online system currently under development which will eventually incorporate all the elements of the summarization approach discussed hereby and we show sample output summaries in various languages.

References

  1. 1.
    Atkinson, M., der Goot, E.V.: Near real time information mining in multilingual news. In: Proceedings of the 18th International World Wide Web Conference (WWW 2009), Madrid, pp. 1153–1154 (2009)Google Scholar
  2. 2.
    Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  3. 3.
    Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor (2005)Google Scholar
  4. 4.
    Boguraev, B., Kennedy, C.: Salience-based content characterisation of text documents. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  5. 5.
    Ding, C.H.Q.: A probabilistic model for latent semantic indexing. J. Am. Soc. Inf. Sci. Technol. 56(6), 597–608 (2005)Google Scholar
  6. 6.
    Edmundson, H.: New methods in automatic extracting. J. Assoc. Comput. Mach. 16(2), 264–285 (1969)Google Scholar
  7. 7.
    Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of ACM SIGIR, New Orleans (2002)Google Scholar
  8. 8.
    Grosz, B., Aravind, J., Scott, W.: Centering: a framework for modelling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)Google Scholar
  9. 9.
    Hirschman, L.: MUC-7 coreference task definition, version 3.0. In: Chinchor, N. (ed.) Proceedings of the 7th Message Understanding Conference, Virginia. NIST (1998). Available online at http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
  10. 10.
    Hovy, E., Lin, C.: Automated text summarization in summarist. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  11. 11.
    Jones, K.S.: Automatic summarising: factors and directions. In: Mani, I., Maybury, M. (eds.) Advances in Automatic Text Summarization. MIT, Cambridge (1999)Google Scholar
  12. 12.
    Kabadjov, M.A.: A comprehensive evaluation of anaphora resolution and discourse-new recognition. Ph.D. thesis, Department of Computer Science, University of Essex (2007)Google Scholar
  13. 13.
    Kabadjov, M.A., Steinberger, J., Pouliquen, B., Steinberger, R., Poesio, M.: Multilingual statistical news summarisation: preliminary experiments with english. In: Proceedings of the Workshop on Intelligent Analysis and Processing of Web News Content at the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), Milan (2009)Google Scholar
  14. 14.
    Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, pp. 68–73 (1995)Google Scholar
  15. 15.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona (2004)Google Scholar
  16. 16.
    Litvak, M., Last, M., Friedman, M.: A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, pp. 927–936. Association for Computational Linguistics (2010)Google Scholar
  17. 17.
    Luhn, H.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)Google Scholar
  18. 18.
    Mani, I. (ed.): Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  19. 19.
    Mani, I., Maybury, M. (eds.): Advances in Automatic Text Summarization. MIT, Cambridge (1999)Google Scholar
  20. 20.
    Marcu, D.: From discourse structures to text summaries. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  21. 21.
    Maybury, M.: Generating summaries from event data. In: Mani, I., Maybury, M. (eds.) Advances in Automatic Text Summarization. MIT, Cambridge (1999)Google Scholar
  22. 22.
    McKeown, K., Radev, D.: Generating summaries of multiple news articles. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, pp. 74–82 (1995)Google Scholar
  23. 23.
    Nenkova, A., Louis, A.: Can you summarize this? identifying correlates of input difficulty for generic multi-document summarization. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, pp. 825–833. Association for Computational Linguistics (2008)Google Scholar
  24. 24.
    Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of the Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Boston (2004)Google Scholar
  25. 25.
    Nenkova, A., Passonneau, R., McKeown, K.: The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process. 4(2), 4 (2007)Google Scholar
  26. 26.
    Over, P., Dang, H., Harman, D.: DUC in context. Inf. Process. Manag. 43(6), 1506–1520 (2007). Special Issue on Text Summarisation (Donna Harman, ed.)Google Scholar
  27. 27.
    Piskorski, J.: CORLEONE – core linguistic entity online extraction. Tech. Rep. EN 23393, Joint Research Centre of the European Commission (2008)Google Scholar
  28. 28.
    Pouliquen, B., Kimler, M., Steinberger, R., Ignat, C., Oellinger, T., Blackler, K., Fuart, F., Zaghouani, W., Widiger, A., Forslund, A.C., Best, C.: Geocoding multilingual texts: recognition, disambiguation and visualisation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, pp. 53–58 (2006)Google Scholar
  29. 29.
    Pouliquen, B., Steinberger, R.: Automatic construction of multilingual name dictionaries. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation. NIPS series. MIT, Cambridge (2009)Google Scholar
  30. 30.
    Saggion, H., Torres-Moreno, J.M., da Cunha, I., SanJuan, E., Velazquez-Morales, P.: Multilingual summarization evaluation without human models. In: Proceedings of the International Conference on Computational Linguistics, Beijing, pp. 1059–1067 (2010)Google Scholar
  31. 31.
    Steinberger, J., Jez̆ek, K.: Update summarization based on novel topic distribution. In: Proceedings of the 9th ACM DocEng, Munich (2009)Google Scholar
  32. 32.
    Steinberger, J., Kabadjov, M.A., Poesio, M., Sanchez-Graillet, O.: Improving LSA-based summarization with anaphora resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Vancouver (2005)Google Scholar
  33. 33.
    Steinberger, J., Poesio, M., Kabadjov, M.A., Jez̆ek, K.: Two uses of anaphora resolution in summarization. Inf. Process. Manag. 43(6), 1663–1680 (2007). Special Issue on Text Summarisation (Donna Harman, ed.)Google Scholar
  34. 34.
    Steinberger, R., Pouliquen, B., Ignat, C.: Using language-independent rules to achieve high multilinguality in text mining. In: Fogelman-Soulié, F., Perrotta, D., Piskorski, J., Steinberger, R. (eds.) Mining Massive Data Sets for Security. IOS-Press, Amsterdam/Holland (2009)Google Scholar
  35. 35.
    Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University (2008)Google Scholar
  36. 36.
    Teufel, S., Moens, M.: Sentence extraction as a classification task. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)Google Scholar
  37. 37.
    Turchi, M., Steinberger, J., Kabadjov, M., Steinberger, R.: Using parallel corpora for multilingual (multi-document) summarisation evaluation. In: Proceedings of CLEF-10, Padua, pp. 52–63. Springer, Berlin (2010)Google Scholar
  38. 38.
    Wan, X., Li, H., Xiao, J.: Cross-language document summarization based on machine translation quality prediction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, pp. 917–926. Association for Computational Linguistics (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Mijail Kabadjov
    • 1
  • Josef Steinberger
    • 1
  • Ralf Steinberger
    • 1
  1. 1.EC Joint Research CentreIspra (VA)Italy

Personalised recommendations