Skip to main content

Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms

  • Conference paper
Book cover Advances in Information Retrieval (ECIR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

Abstract

This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amini, M.-R., Gallinari, P.: The Use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th ACM SIGIR, pp. 105–112 (2002)

    Google Scholar 

  2. Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001)

    Google Scholar 

  3. Caillet, M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts. In: Proceedings of RIAO (2004)

    Google Scholar 

  4. Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL-2002 (2002)

    Google Scholar 

  5. Chuang, W.T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. In: Proceedings of the 23th ACM SIGIR, pp. 152–159 (2000)

    Google Scholar 

  6. Fellbaum, C.: WordNet, an Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  7. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)

    Article  MathSciNet  Google Scholar 

  8. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. Technical Report Stanford University (1998)

    Google Scholar 

  9. Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In: Proceedings of the 22th ACM SIGIR, pp. 121–127 (1999)

    Google Scholar 

  10. Jing, H.: Summary generation through intelligent cutting and pasting of the input document. Technical Report Columbia University (1998)

    Google Scholar 

  11. Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Technical Report CMU-CS-01-144, School of Computer Science, CMU (2001)

    Google Scholar 

  12. Knaus, D., Mittendorf, E., Shauble, P., Sheridan, P.: Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System. In: TREC-4 Proceedings (1994)

    Google Scholar 

  13. Kupiec, J., Pederson, J., Chen, F.A.: Trainable Document Summarizer. In: Proceedings of the 18th ACM SIGIR, pp. 68–73 (1995)

    Google Scholar 

  14. Luhn, P.H.: Automatic creation of litterature abstracts. IBM Journal, 159–165 (1958)

    Google Scholar 

  15. Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 821–826 (1998)

    Google Scholar 

  16. Marcu, D.: The Automatic Construction of Large-Scale corpora for Summarization Research. In: Proceedings of the 22th ACM SIGIR (1999)

    Google Scholar 

  17. Mitra, M., Singhal, A., Buckley, C.: Automatic Text Summarization by Paragraph Extraction. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 31–36 (1997)

    Google Scholar 

  18. Paice, C.D., Jones, P.A.: The identification of important concepts in highly structured technical papers. In: Proceedings of the 16th ACM SIGIR, pp. 69–78 (1993)

    Google Scholar 

  19. Sparck-Jones, K.: Discourse modeling for automatic summarizing. Technical Report 29D, Computer Laboratory, university of Cambridge (1993)

    Google Scholar 

  20. Strzalkowski, T., Wang, J., Wise, B.: A Robust practical text summarization system. In: Proceedings of hte Fifteenth National Conferences on AI, pp. 26–30 (1998)

    Google Scholar 

  21. Shen, L., Joshi, A.K.: Ranking and Reranking with Perceptron. Machine Learning, Special Issue on Learning in Speech and Language Technologies (2004)

    Google Scholar 

  22. http://www.itl.nist.gov/iaui/894.02/related_projects/tipster_summac/cmp_lg.html

  23. Symons, M.J.: Clustering Criteria and Multivariate Normal Mixture. Biometrics 37, 35–43 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  24. Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. IJDAR 1, 191–198 (1999)

    Article  Google Scholar 

  25. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

  26. Teufel, S., Moens, M.: Sentence Extraction as a Classification Task. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, pp. 58–65 (1997)

    Google Scholar 

  27. http://www.wipo.int/ibis/datasets/index.html

  28. Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996)

    Google Scholar 

  29. Zechner, K.: Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences. In: COLING, pp. 986–989 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Amini, M.R., Usunier, N., Gallinari, P. (2005). Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31865-1_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25295-5

  • Online ISBN: 978-3-540-31865-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics