Abstract
This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amini, M.-R., Gallinari, P.: The Use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th ACM SIGIR, pp. 105–112 (2002)
Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001)
Caillet, M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts. In: Proceedings of RIAO (2004)
Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL-2002 (2002)
Chuang, W.T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. In: Proceedings of the 23th ACM SIGIR, pp. 152–159 (2000)
Fellbaum, C.: WordNet, an Electronic Lexical Database. MIT Press, Cambridge (1998)
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. Technical Report Stanford University (1998)
Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In: Proceedings of the 22th ACM SIGIR, pp. 121–127 (1999)
Jing, H.: Summary generation through intelligent cutting and pasting of the input document. Technical Report Columbia University (1998)
Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Technical Report CMU-CS-01-144, School of Computer Science, CMU (2001)
Knaus, D., Mittendorf, E., Shauble, P., Sheridan, P.: Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System. In: TREC-4 Proceedings (1994)
Kupiec, J., Pederson, J., Chen, F.A.: Trainable Document Summarizer. In: Proceedings of the 18th ACM SIGIR, pp. 68–73 (1995)
Luhn, P.H.: Automatic creation of litterature abstracts. IBM Journal, 159–165 (1958)
Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 821–826 (1998)
Marcu, D.: The Automatic Construction of Large-Scale corpora for Summarization Research. In: Proceedings of the 22th ACM SIGIR (1999)
Mitra, M., Singhal, A., Buckley, C.: Automatic Text Summarization by Paragraph Extraction. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 31–36 (1997)
Paice, C.D., Jones, P.A.: The identification of important concepts in highly structured technical papers. In: Proceedings of the 16th ACM SIGIR, pp. 69–78 (1993)
Sparck-Jones, K.: Discourse modeling for automatic summarizing. Technical Report 29D, Computer Laboratory, university of Cambridge (1993)
Strzalkowski, T., Wang, J., Wise, B.: A Robust practical text summarization system. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 26–30 (1998)
Shen, L., Joshi, A.K.: Ranking and Reranking with Perceptron. Machine Learning, Special Issue on Learning in Speech and Language Technologies (2004)
http://www.itl.nist.gov/iaui/894.02/related_projects/tipster_summac/cmp_lg.html
Symons, M.J.: Clustering Criteria and Multivariate Normal Mixture. Biometrics 37, 35–43 (1981)
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. IJDAR 1, 191–198 (1999)
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Teufel, S., Moens, M.: Sentence Extraction as a Classification Task. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 58–65 (1997)
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996)
Zechner, K.: Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences. In: COLING, pp. 986–989 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Amini, M.R., Usunier, N., Gallinari, P. (2005). Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)