International Conference on Analysis of Images, Social Networks and Texts

Analysis of Images, Social Networks and Texts pp 308-319 | Cite as

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 542)

Abstract

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called “EAST”, which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

Keywords

Text analysis Algorithms on strings Annotated suffix trees Suffix arrays Synonym extraction 

Notes

Acknowledgments

This research carried out in 2015 was supported by “The National Research University ‘Higher School of Economics’ Academic Fund Program” grant (Open image in new window 15-05-0041). The financial support from the Government of the Russian Federation within the framework of the implementation of the 5–100 Programme Roadmap of the National Research University – Higher School of Economics is acknowledged.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)CrossRefGoogle Scholar
  3. 3.
    Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013)Google Scholar
  4. 4.
    Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014)Google Scholar
  5. 5.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  6. 6.
    Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003)Google Scholar
  7. 7.
    Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) CrossRefGoogle Scholar
  8. 8.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998)Google Scholar
  9. 9.
    Manber, U.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012)Google Scholar
  11. 11.
    Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008)Google Scholar
  12. 12.
    Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRefGoogle Scholar
  13. 13.
    Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010) Google Scholar
  14. 14.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Ukkonen, E.: On-Line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Computer Science FacultyNational Research University Higher School of EconomicsMoscowRussia

Personalised recommendations