EM Clustering Algorithm for Automatic Text Summarization

  • Yulia Ledeneva
  • René García Hernández
  • Romyna Montiel Soto
  • Rafael Cruz Reyes
  • Alexander Gelbukh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7094)


Automatic text summarization has emerged as a technique for accessing only to useful information. In order to known the quality of the automatic summaries produced by a system, in DUC 2002 (Document Understanding Conference) has developed a standard human summaries called gold collection of 567 documents of single news. In this conference only five systems could outperforms the baseline heuristic in single extractive summarization task. So far, some approaches have got good results combining different strategies with language-dependent knowledge. In this paper, we present a competitive method based on an EM clustering algorithm for improving the quality of the automatic summaries using practically non language-dependent knowledge. Also, a comparison of this method with three text models is presented.


Automatic text summarization extractive summarization EM clustering algorithm text models n-grams maximal frequent sequences 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Montiel, R., et al.: Comparación de Tres Modelos de Texto para la Generación Automática de Resúmenes. Natural Language Processing Journal of Spain Society 43, 303–311 (2009)Google Scholar
  2. 2.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier (2005) ISBN: 0-12-088407-0Google Scholar
  3. 3.
    NetCraft, June 2011 Web Server Survey, England (2011),
  4. 4.
    Lee, J.-H., Park, S., Ahn, C.-M., Kim, D.: Automatic Generic Document Summarization Based on Non-negative Matrix Factorization. In: Information Processing and Management, vol. 45, pp. 20–34. Elsevier (2009) ISSN 0306-4573Google Scholar
  5. 5.
    García, R., Ledeneva, Y., Gelbukh, A.: Keeping Maximal Frequent Sequences Facilitates Extractive Summarization. Research in Computing Science 28 (2008) ISSN 1870-4069Google Scholar
  6. 6.
    García-Hernández, R.A., Montiel, R., Ledeneva, Y., Rendón, E., Gelbukh, A., Cruz, R.: Text Summarization by Sentence Extraction Using Unsupervised Learning. In: Gelbukh, A., Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 133–143. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Porter, M.: The Porter Stemming Algorithm. Official home page for distribution of the Porter Stemming Algorithm (2006),
  8. 8.
    Luhn, H.P.: The automatic creation of Literature abstracts. IBM Journal of Research and Development (1958)Google Scholar
  9. 9.
    Edmondson, H.P.: New Methods in Automatic Extraction. Journal of the Association for Computing Machinery (1969)Google Scholar
  10. 10.
    Brandow, R., Mitze, K., Rau, L.: Automatic condensation of Electronic publication by sentence selection. In: Information Proc. and Management (1995)Google Scholar
  11. 11.
    Kupiec, J., Pedersen, J.O., Chen, F.: A trainable document summarizer. In: SIGIR 1995, New York (1995)Google Scholar
  12. 12.
    Goldstein, J., Carbolell, J., Kantrowitz, M., Mittal, V.: Summarizating text documents: sentence and evaluation metrics. In: 22nd Int. ACM SIGIR Research and Development in Information Retrieval, Berkley (1999)Google Scholar
  13. 13.
    Marcus, D.: The rethorical parsing, summarization, and generation of natural language text, PhD. Thesis, Dep. of Computer Science, University of Toronto (1998)Google Scholar
  14. 14.
    Marcus, D.: The Theory and practice of Discourse Parsing summarization. Institute of technology, Massachusetts (2000)Google Scholar
  15. 15.
    Yeh, J.Y., Ke, H.R., Yang, W.P., Meng, J.H.: Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management (2005)Google Scholar
  16. 16.
    Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proc. IJCAI 2007 (2007)Google Scholar
  17. 17.
    da Cunha, I., Fernández, S., Velázquez Morales, P., Vivaldi, J., SanJuan, E., Torres-Moreno, J.-M.: A New Hybrid Summarizer Based on Vector Space Model, Statistical Physics and Linguistics. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 872–882. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Radev, R., Jing, H., Stys, M., Tam, D.: Centroid-based summarization for multiple documents. 1st Int. Journal Information Processing and Management (2004)Google Scholar
  19. 19.
    García, R., Ledeneva, Y., Gelbukh, A., Gutierrez, C.: An assessment of Word Sequence Models for Extractive Text Summarization. Research in Computing Science (38), 253–262 (2008)Google Scholar
  20. 20.
    García, R., Ledeneva, Y., Mendoza, G., Hernandez, A., Chavez, J., Gelbukh, A., Tapia, L.: Comparing commercial tools and state-of-the-art methods for generating text summaries. In: Eighth Mexican International Conference on Artificial Intelligence, México, pp. 92–96 (2009)Google Scholar
  21. 21.
    Ledeneva, Y., Sidorov, G.: Recent advances in Computational Linguistics. Informatica, International Journal of Computing and Informatics 3871(34), 3–18 (2010) ISSN: 1854-3871Google Scholar
  22. 22.
    Ledeneva, Y., García, R., Gelbukh, A.: Multi-document summarization using Maximal Frequent Sequences. Research in Computer Science 47 (2010) ISSN 1870-4069Google Scholar
  23. 23.
    Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Association for Computational Linguistics, Workshop, on Text Summarization, pp. 74–81 (2004)Google Scholar
  24. 24.
    Lin, C., Hovy, E.: Manual and Automatic evaluation of summaries, In: Proceedings of the Workshop on Automatic Summarization (including DUC 2002), vol. I, pp. 71–78. Association for Computational Linguistics on Human Language Technology (2002)Google Scholar
  25. 25.
    DUC Document Understanding Conference 2002 (2002),
  26. 26.
    Garcia, R., Martinez, F., Carrasco, A.: Finding maximal sequential patterns in text document collections and single documents. Informatica, International Journal of Computing and Informatics (34), 93–101 (2010) ISSN: 1854-3871Google Scholar
  27. 27.
    Lin, C.Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. In: Proceedings of HLT-NAACL, Canada (2003)Google Scholar
  28. 28.
    Mihalcea, R.: Graph-based Ranking Algorithms for Sentence Extraction. In: Applied to Text Summarization. University of North Texas, Texas (2004)Google Scholar
  29. 29.
    Ledeneva, Y.: Automatic Language-Independent Detection of Multiword Descriptions for Text Summarization, National Polytechnic Institute, PhD. Thesis, Mexico (2009)Google Scholar
  30. 30.
    Sidorov, G.: Lemmatization in automatized system for compilation of personal style dictionaries of literary writers. “Word of Dostoyevsky”, Russian Academy of Sciences, 266–300 (1996)Google Scholar
  31. 31.
    Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  32. 32.
    Sidorov, G., Barrón-Cedeño, A., Rosso, P.: English-Spanish Large Statistical Dictionary of Inflectional Forms. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 277–281. European Language Resources Association (ELRA) (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Yulia Ledeneva
    • 1
  • René García Hernández
    • 1
  • Romyna Montiel Soto
    • 2
  • Rafael Cruz Reyes
    • 2
  • Alexander Gelbukh
    • 3
  1. 1.Unidad Académica Profesional TianguistencoUniversidad Autónoma del Estado de MéxicoMéxico
  2. 2.Laboratorio de Reconocimiento de PatronesInstituto Tecnológico de TolucaMetepecMéxico
  3. 3.Centro de Investigación en ComputaciónInstituto Politécnico NacionalMéxico

Personalised recommendations