Abstract
Automatic text summarization has emerged as a technique for accessing only to useful information. In order to known the quality of the automatic summaries produced by a system, in DUC 2002 (Document Understanding Conference) has developed a standard human summaries called gold collection of 567 documents of single news. In this conference only five systems could outperforms the baseline heuristic in single extractive summarization task. So far, some approaches have got good results combining different strategies with language-dependent knowledge. In this paper, we present a competitive method based on an EM clustering algorithm for improving the quality of the automatic summaries using practically non language-dependent knowledge. Also, a comparison of this method with three text models is presented.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Montiel, R., et al.: Comparación de Tres Modelos de Texto para la Generación Automática de Resúmenes. Natural Language Processing Journal of Spain Society 43, 303–311 (2009)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier (2005) ISBN: 0-12-088407-0
NetCraft, June 2011 Web Server Survey, England (2011), http://news.netcraft.com/archives/-web_server_survey.html
Lee, J.-H., Park, S., Ahn, C.-M., Kim, D.: Automatic Generic Document Summarization Based on Non-negative Matrix Factorization. In: Information Processing and Management, vol. 45, pp. 20–34. Elsevier (2009) ISSN 0306-4573
García, R., Ledeneva, Y., Gelbukh, A.: Keeping Maximal Frequent Sequences Facilitates Extractive Summarization. Research in Computing Science 28 (2008) ISSN 1870-4069
García-Hernández, R.A., Montiel, R., Ledeneva, Y., Rendón, E., Gelbukh, A., Cruz, R.: Text Summarization by Sentence Extraction Using Unsupervised Learning. In: Gelbukh, A., Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 133–143. Springer, Heidelberg (2008)
Porter, M.: The Porter Stemming Algorithm. Official home page for distribution of the Porter Stemming Algorithm (2006), http://tartarus.org/~martin/PorterStemmer/index.html
Luhn, H.P.: The automatic creation of Literature abstracts. IBM Journal of Research and Development (1958)
Edmondson, H.P.: New Methods in Automatic Extraction. Journal of the Association for Computing Machinery (1969)
Brandow, R., Mitze, K., Rau, L.: Automatic condensation of Electronic publication by sentence selection. In: Information Proc. and Management (1995)
Kupiec, J., Pedersen, J.O., Chen, F.: A trainable document summarizer. In: SIGIR 1995, New York (1995)
Goldstein, J., Carbolell, J., Kantrowitz, M., Mittal, V.: Summarizating text documents: sentence and evaluation metrics. In: 22nd Int. ACM SIGIR Research and Development in Information Retrieval, Berkley (1999)
Marcus, D.: The rethorical parsing, summarization, and generation of natural language text, PhD. Thesis, Dep. of Computer Science, University of Toronto (1998)
Marcus, D.: The Theory and practice of Discourse Parsing summarization. Institute of technology, Massachusetts (2000)
Yeh, J.Y., Ke, H.R., Yang, W.P., Meng, J.H.: Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management (2005)
Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proc. IJCAI 2007 (2007)
da Cunha, I., Fernández, S., Velázquez Morales, P., Vivaldi, J., SanJuan, E., Torres-Moreno, J.-M.: A New Hybrid Summarizer Based on Vector Space Model, Statistical Physics and Linguistics. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 872–882. Springer, Heidelberg (2007)
Radev, R., Jing, H., Stys, M., Tam, D.: Centroid-based summarization for multiple documents. 1st Int. Journal Information Processing and Management (2004)
García, R., Ledeneva, Y., Gelbukh, A., Gutierrez, C.: An assessment of Word Sequence Models for Extractive Text Summarization. Research in Computing Science (38), 253–262 (2008)
García, R., Ledeneva, Y., Mendoza, G., Hernandez, A., Chavez, J., Gelbukh, A., Tapia, L.: Comparing commercial tools and state-of-the-art methods for generating text summaries. In: Eighth Mexican International Conference on Artificial Intelligence, México, pp. 92–96 (2009)
Ledeneva, Y., Sidorov, G.: Recent advances in Computational Linguistics. Informatica, International Journal of Computing and Informatics 3871(34), 3–18 (2010) ISSN: 1854-3871
Ledeneva, Y., García, R., Gelbukh, A.: Multi-document summarization using Maximal Frequent Sequences. Research in Computer Science 47 (2010) ISSN 1870-4069
Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Association for Computational Linguistics, Workshop, on Text Summarization, pp. 74–81 (2004)
Lin, C., Hovy, E.: Manual and Automatic evaluation of summaries, In: Proceedings of the Workshop on Automatic Summarization (including DUC 2002), vol. I, pp. 71–78. Association for Computational Linguistics on Human Language Technology (2002)
DUC Document Understanding Conference 2002 (2002), http://www-nlpir.nist.gov/proyect/duc
Garcia, R., Martinez, F., Carrasco, A.: Finding maximal sequential patterns in text document collections and single documents. Informatica, International Journal of Computing and Informatics (34), 93–101 (2010) ISSN: 1854-3871
Lin, C.Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. In: Proceedings of HLT-NAACL, Canada (2003)
Mihalcea, R.: Graph-based Ranking Algorithms for Sentence Extraction. In: Applied to Text Summarization. University of North Texas, Texas (2004)
Ledeneva, Y.: Automatic Language-Independent Detection of Multiword Descriptions for Text Summarization, National Polytechnic Institute, PhD. Thesis, Mexico (2009)
Sidorov, G.: Lemmatization in automatized system for compilation of personal style dictionaries of literary writers. “Word of Dostoyevsky”, Russian Academy of Sciences, 266–300 (1996)
Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Sidorov, G., Barrón-Cedeño, A., Rosso, P.: English-Spanish Large Statistical Dictionary of Inflectional Forms. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 277–281. European Language Resources Association (ELRA) (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ledeneva, Y., Hernández, R.G., Soto, R.M., Reyes, R.C., Gelbukh, A. (2011). EM Clustering Algorithm for Automatic Text Summarization. In: Batyrshin, I., Sidorov, G. (eds) Advances in Artificial Intelligence. MICAI 2011. Lecture Notes in Computer Science(), vol 7094. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25324-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-25324-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25323-2
Online ISBN: 978-3-642-25324-9
eBook Packages: Computer ScienceComputer Science (R0)