Abstract
The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect consequent sentences. The second stage is applying the PageRank algorithm as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similarity. Mihalcea and Tarau suggest to employ the common scheme: use the vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched concurrencies. For testing the method on Russian texts we made our own collection based on newspapers articles with some sentences highlighted as being more important. Using the AST similarity measure on this collection allows to achieve a slight improvement in comparison with using the cosine similarity measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph-based topic tanking for keyphrase extraction. In Proceedings of International Joint Conference on Natural Language Processing (pp. 543–551).
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7 (pp. 107–117).
Cruz, F., Troyano, J. A., & Enruquez, F. (2006). Supervised TextRank. In Advances in natural language processing (pp. 632–639). Berlin/Heidelberg: Springer.
Document Understanding Conference. Retrieved October 20, 2014, http://www-nlpir.nist.gov/ (Web source)
Enhanced Annotated Suffix Tree . Retrieved January 15, 2015, https://pypi.python.org/pypi/EAST/0.2.2/ (Web source)
Erkan, G., & Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.
Garg, N., Favre, B., Reidhammer, K., & Hakkani-Tur, D. (2009). ClusterRank: a graph based method for meeting summarization. In Interspeech, ISCA (pp. 1499–1502).
Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and computational biology. Cambridge: Cambridge University Press.
Hahn, U., & Mani, I. (2000). The challenges of automatic summarization. Computer, 33(11), 29–36.
Mihalcea, R., & Tarau P. (2004). TextRank: bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 404–411).
Pampapathi, R., Mirkin, B., & Levene, M. (2008). A suffix tree approach to anti-spam email filtering. Machine Learning, 65(1), 309–338.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Acknowledgements
The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2014–2015 (grant No 15-05-0041) and supported within the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Global Competitiveness Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Yakovlev, M., Chernyak, E. (2016). Using Annotated Suffix Tree Similarity Measure for Text Summarisation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-25226-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)