Exploring Clustering for Multi-document Arabic Summarisation

  • Mahmoud El-Haj
  • Udo Kruschwitz
  • Chris Fox
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7097)

Abstract

In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation we use an Arabic version of the DUC-2002 dataset that we previously generated using Google Translate. We explore how clustering (at the sentence level) can be applied to multi-document summarisation as well as for redundancy elimination within this process. We use different parameter settings including the cluster size and the selection model applied in the extractive summarisation process. The automatically generated summaries are evaluated using the ROUGE metric, as well as precision and recall. The results we achieve are compared with the top five systems in the DUC-2002 multi-document summarisation task.

Keywords

Machine Translation Statistical Machine Translation Computational Linguistics Parallel Corpus Arabic Version 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Funk, A., Maynard, D., Saggion, H., Bontcheva, K.: Ontological integration of information extracted from multiple sources. In: In the Multi-source Multilingual Information Extraction and Summarization (MMIES) Workshop at Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria (2007)Google Scholar
  2. 2.
    Berger, A., Mittal, V.O.: Query-relevant summarization using FAQs. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, ACL 2000, pp. 294–301. Association for Computational Linguistics, Stroudsburg (2000)Google Scholar
  3. 3.
    Brandow, R., Mitze, K., Rau, L.F.: Automatic condensation of electronic publications by sentence selection. Inf. Process. Manage. 31, 675–685 (1995)CrossRefGoogle Scholar
  4. 4.
    Douzidia, F.S., Lapalme, G.: Lakhas, an Arabic summarising system. In: In the Proceedings of the Document Understanding Conferences (DUC) Workshop, pp. 128–135. DUC (2004)Google Scholar
  5. 5.
    Document Understanding Conference (DUC) dataset (2002), http://duc.nist.gov/
  6. 6.
    Dunlavy, D.M., O’Leary, D.P., Conroy, J.M., Schlesinger, J.D.: Qcs: A system for querying, clustering and summarizing documents. Inf. Process. Manage. 43, 1588–1605 (2007)CrossRefGoogle Scholar
  7. 7.
    El-Haj, M., Kruschwitz, U., Fox, C.: Multi-document Arabic text summarisation. In: Proceedings of the third Computer science and Electronic Engineering Conference. IEEE, Colchester (2011)Google Scholar
  8. 8.
    Fattah, M.A., Ren, F.: Automatic text summarization. Proceedings of World Academy of Science 27, 192–195 (2008)Google Scholar
  9. 9.
    Fiszman, M., Demner-Fushman, D., Kilicoglu, H., Rindflesch, T.C.: Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation. Jouranl of Biomedical Informatics 42(5), 801–813 (2009)CrossRefGoogle Scholar
  10. 10.
    Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in +Information Retrieval, SIGIR 2001, pp. 19–25. ACM, New York (2001)Google Scholar
  11. 11.
    Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. on Knowl. and Data Eng. 16, 1279–1296 (2004)CrossRefGoogle Scholar
  12. 12.
    Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1), 100–108 (1979)MATHGoogle Scholar
  13. 13.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: X Machine Translation Summit, pp. 79–86. Phuket, Thailand (2005)Google Scholar
  14. 14.
    Lin, C.: Rouge: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 25–26 (2004)Google Scholar
  15. 15.
    Liu, S., Lindroos, J.: Towards fast digestion of IMF staff reports with automated text summarization systems. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 978–982. IEEE Computer Society (2006)Google Scholar
  16. 16.
    McKeown, K., Radev, D.R.: Generating summaries of multiple news articles. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, pp. 74–82. ACM, New York (1995)Google Scholar
  17. 17.
    Mihalcea, R., Tarau, P.: Multi-Document Summarization with Iterative Graph-based Algorithms. In: The First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA (2005)Google Scholar
  18. 18.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL 2002, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  19. 19.
    Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, NAACL-ANLP-AutoSum 2000, vol. 4, pp. 21–30. Association for Computational Linguistics, Stroudsburg (2000)Google Scholar
  20. 20.
    Radev, D.R., Jing, H., Sty, M., Tam, D.: Centroid-based summarization of multiple documents. Information Processing and Management 40, 919–938 (2004)CrossRefMATHGoogle Scholar
  21. 21.
    Salhi, H.: Small parallel corpora in an English-Arabic translation classroom: No need to reinvent the wheel in the era of globalization. In: Globalisation and Aspects of Translation, pp. 53–67. Cambridge Scholars Publishing, Newcastle (2010)Google Scholar
  22. 22.
    Salton, G., Wong, A., Yang, S.: A vector space model for automatic indexing. Proceedings of the Communications of the ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  23. 23.
    Sarkar, K.: Centroid-based summarization of multiple documents. TECHNIA — International Journal of Computing Science and Communication Technologies 2 (2009)Google Scholar
  24. 24.
    Schlesinger, J.D., O’Leary, D.P., Conroy, J.M.: Arabic/English Multi-document Summarization with CLASSY—The Past and the Future. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 568–581. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  25. 25.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of LREC, Genova, Italy, pp. 24–26 (2006)Google Scholar
  26. 26.
    Turchi, M., Steinberger, J., Kabadjov, M., Steinberger, R.: Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 52–63. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  27. 27.
    Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 299–306. ACM, New York (2008)Google Scholar
  28. 28.
    Yeh, J.Y., Ke, H.R., Yang, W.P.: iSpreadRank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network. Expert Systems with Applications 35(3), 1451–1462 (2008)CrossRefGoogle Scholar
  29. 29.
    Zhao, L., Wu, L., Huang, X.: Using query expansion in graph-based approach for query-focused multi-document summarization. Inf. Process. Manage. 45, 35–41 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mahmoud El-Haj
    • 1
  • Udo Kruschwitz
    • 1
  • Chris Fox
    • 1
  1. 1.Computer Science and Electronic EngineeringUniversity of EssexUnited Kingdom

Personalised recommendations