Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms

Amini, Massih R.; Usunier, Nicolas; Gallinari, Patrick

doi:10.1007/978-3-540-31865-1_11

Massih R. Amini¹⁸,
Nicolas Usunier¹⁸ &
Patrick Gallinari¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

European Conference on Information Retrieval

4442 Accesses
15 Citations

Abstract

This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amini, M.-R., Gallinari, P.: The Use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25^th ACM SIGIR, pp. 105–112 (2002)
Google Scholar
Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24^th annual international ACM SIGIR conference on Research and development in information retrieval (2001)
Google Scholar
Caillet, M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts. In: Proceedings of RIAO (2004)
Google Scholar
Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL-2002 (2002)
Google Scholar
Chuang, W.T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. In: Proceedings of the 23^th ACM SIGIR, pp. 152–159 (2000)
Google Scholar
Fellbaum, C.: WordNet, an Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. Technical Report Stanford University (1998)
Google Scholar
Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In: Proceedings of the 22^th ACM SIGIR, pp. 121–127 (1999)
Google Scholar
Jing, H.: Summary generation through intelligent cutting and pasting of the input document. Technical Report Columbia University (1998)
Google Scholar
Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Technical Report CMU-CS-01-144, School of Computer Science, CMU (2001)
Google Scholar
Knaus, D., Mittendorf, E., Shauble, P., Sheridan, P.: Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System. In: TREC-4 Proceedings (1994)
Google Scholar
Kupiec, J., Pederson, J., Chen, F.A.: Trainable Document Summarizer. In: Proceedings of the 18^th ACM SIGIR, pp. 68–73 (1995)
Google Scholar
Luhn, P.H.: Automatic creation of litterature abstracts. IBM Journal, 159–165 (1958)
Google Scholar
Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 821–826 (1998)
Google Scholar
Marcu, D.: The Automatic Construction of Large-Scale corpora for Summarization Research. In: Proceedings of the 22^th ACM SIGIR (1999)
Google Scholar
Mitra, M., Singhal, A., Buckley, C.: Automatic Text Summarization by Paragraph Extraction. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 31–36 (1997)
Google Scholar
Paice, C.D., Jones, P.A.: The identification of important concepts in highly structured technical papers. In: Proceedings of the 16^th ACM SIGIR, pp. 69–78 (1993)
Google Scholar
Sparck-Jones, K.: Discourse modeling for automatic summarizing. Technical Report 29D, Computer Laboratory, university of Cambridge (1993)
Google Scholar
Strzalkowski, T., Wang, J., Wise, B.: A Robust practical text summarization system. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 26–30 (1998)
Google Scholar
Shen, L., Joshi, A.K.: Ranking and Reranking with Perceptron. Machine Learning, Special Issue on Learning in Speech and Language Technologies (2004)
Google Scholar
http://www.itl.nist.gov/iaui/894.02/related_projects/tipster_summac/cmp_lg.html
Symons, M.J.: Clustering Criteria and Multivariate Normal Mixture. Biometrics 37, 35–43 (1981)
Article MATH MathSciNet Google Scholar
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. IJDAR 1, 191–198 (1999)
Article Google Scholar
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Teufel, S., Moens, M.: Sentence Extraction as a Classification Task. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 58–65 (1997)
Google Scholar
http://www.wipo.int/ibis/datasets/index.html
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996)
Google Scholar
Zechner, K.: Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences. In: COLING, pp. 986–989 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Laboratory of Paris 6, 8 Rue du Capitaine Scott, 75015, Paris, France
Massih R. Amini, Nicolas Usunier & Patrick Gallinari

Authors

Massih R. Amini
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Usunier
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, Spain
David E. Losada
Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, Universidad de Granada, 18071, Granada, Spain
Juan M. Fernández-Luna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amini, M.R., Usunier, N., Gallinari, P. (2005). Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-31865-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics