Skip to main content

Using Annotated Suffix Tree Similarity Measure for Text Summarisation

  • Conference paper
  • First Online:
Analysis of Large and Complex Data

Abstract

The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect consequent sentences. The second stage is applying the PageRank algorithm as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similarity. Mihalcea and Tarau suggest to employ the common scheme: use the vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched concurrencies. For testing the method on Russian texts we made our own collection based on newspapers articles with some sentences highlighted as being more important. Using the AST similarity measure on this collection allows to achieve a slight improvement in comparison with using the cosine similarity measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph-based topic tanking for keyphrase extraction. In Proceedings of International Joint Conference on Natural Language Processing (pp. 543–551).

    Google Scholar 

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7 (pp. 107–117).

    Google Scholar 

  • Cruz, F., Troyano, J. A., & Enruquez, F. (2006). Supervised TextRank. In Advances in natural language processing (pp. 632–639). Berlin/Heidelberg: Springer.

    Chapter  Google Scholar 

  • Document Understanding Conference. Retrieved October 20, 2014, http://www-nlpir.nist.gov/ (Web source)

  • Enhanced Annotated Suffix Tree . Retrieved January 15, 2015, https://pypi.python.org/pypi/EAST/0.2.2/ (Web source)

  • Erkan, G., & Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.

    Google Scholar 

  • Garg, N., Favre, B., Reidhammer, K., & Hakkani-Tur, D. (2009). ClusterRank: a graph based method for meeting summarization. In Interspeech, ISCA (pp. 1499–1502).

    Google Scholar 

  • Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and computational biology. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Hahn, U., & Mani, I. (2000). The challenges of automatic summarization. Computer, 33(11), 29–36.

    Article  Google Scholar 

  • Mihalcea, R., & Tarau P. (2004). TextRank: bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 404–411).

    Google Scholar 

  • Pampapathi, R., Mirkin, B., & Levene, M. (2008). A suffix tree approach to anti-spam email filtering. Machine Learning, 65(1), 309–338.

    Article  Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

Download references

Acknowledgements

The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2014–2015 (grant No 15-05-0041) and supported within the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Global Competitiveness Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxim Yakovlev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Yakovlev, M., Chernyak, E. (2016). Using Annotated Suffix Tree Similarity Measure for Text Summarisation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_9

Download citation

Publish with us

Policies and ethics