Using Annotated Suffix Tree Similarity Measure for Text Summarisation

Yakovlev, Maxim; Chernyak, Ekaterina

doi:10.1007/978-3-319-25226-1_9

Maxim Yakovlev²⁰ &
Ekaterina Chernyak²⁰

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2207 Accesses
1 Citations

Abstract

The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect consequent sentences. The second stage is applying the PageRank algorithm as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similarity. Mihalcea and Tarau suggest to employ the common scheme: use the vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched concurrencies. For testing the method on Russian texts we made our own collection based on newspapers articles with some sentences highlighted as being more important. Using the AST similarity measure on this collection allows to achieve a slight improvement in comparison with using the cosine similarity measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph-based topic tanking for keyphrase extraction. In Proceedings of International Joint Conference on Natural Language Processing (pp. 543–551).
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7 (pp. 107–117).
Google Scholar
Cruz, F., Troyano, J. A., & Enruquez, F. (2006). Supervised TextRank. In Advances in natural language processing (pp. 632–639). Berlin/Heidelberg: Springer.
Chapter Google Scholar
Document Understanding Conference. Retrieved October 20, 2014, http://www-nlpir.nist.gov/ (Web source)
Enhanced Annotated Suffix Tree . Retrieved January 15, 2015, https://pypi.python.org/pypi/EAST/0.2.2/ (Web source)
Erkan, G., & Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.
Google Scholar
Garg, N., Favre, B., Reidhammer, K., & Hakkani-Tur, D. (2009). ClusterRank: a graph based method for meeting summarization. In Interspeech, ISCA (pp. 1499–1502).
Google Scholar
Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and computational biology. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Hahn, U., & Mani, I. (2000). The challenges of automatic summarization. Computer, 33(11), 29–36.
Article Google Scholar
Mihalcea, R., & Tarau P. (2004). TextRank: bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 404–411).
Google Scholar
Pampapathi, R., Mirkin, B., & Levene, M. (2008). A suffix tree approach to anti-spam email filtering. Machine Learning, 65(1), 309–338.
Article Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar

Download references

Acknowledgements

The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2014–2015 (grant No 15-05-0041) and supported within the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Global Competitiveness Program.

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Maxim Yakovlev & Ekaterina Chernyak

Authors

Maxim Yakovlev
View author publications
You can also search for this author in PubMed Google Scholar
Ekaterina Chernyak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxim Yakovlev .

Editor information

Editors and Affiliations

Jacobs University Bremen , Bremen, Germany
Adalbert F.X. Wilhelm
Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
Hans A. Kestler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yakovlev, M., Chernyak, E. (2016). Using Annotated Suffix Tree Similarity Measure for Text Summarisation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-25226-1_9
Published: 04 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics