Advertisement

Programming and Computer Software

, Volume 40, Issue 5, pp 288–295 | Cite as

Texterra: A framework for text analysis

  • D. Yu. TurdakovEmail author
  • N. A. Astrakhantsev
  • Ya. R. Nedumov
  • A. A. Sysoev
  • I. A. Andrianov
  • V. D. Mayorov
  • D. G. Fedorenko
  • A. V. Korshunov
  • S. D. Kuznetsov
Article

Abstract

A framework for fast text analysis, which is developed as a part of the Texterra project, is described. Texterra provides a scalable solution for the fast text processing on the basis of novel methods that exploit knowledge extracted from the Web and text documents. For the developed tools, details of the project, use cases, and evaluation results are presented.

Keywords

text analysis natural language processing Wikipedia computational linguistics machine learning knowledge bases semantic ontologies information search terminology extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bird, S., Klein, E., Loper, E., and Baldridge, J., Multi-disciplinary instruction with the Natural Language Toolkit, Proc. Third Workshop on Issues in Teaching Computational Linguistics (TeachCL’ 08), Stroudsburg, 2008, pp. 62–70.CrossRefGoogle Scholar
  2. 2.
    Cunningham, H., Tablan, V., Roberts, A., and Bontcheva, K., Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics, PLoS Comput. Biol., 2013 vol. 9, no. 2.Google Scholar
  3. 3.
    Ferrucci, D. et.al., Towards an interoperability standard for text and multi-modal analytics, Technical report RC24122, IBM, 2006.Google Scholar
  4. 4.
    Nozhov, I., Morphological and syntax-oriented text processing (models and programs), Theses of dissertation, 2003.Google Scholar
  5. 5.
    Alekseev, A., Dobrov, B., and Lukashevich, N., Linguistic ontology of RuTez thesaurus, Proc. Conf. on Open Semantic Technologies for Intelligent Systems (OSTIS), 2013, pp. 153–158.Google Scholar
  6. 6.
    Braslavskii, P.I., Mukhin, M.Yu., Lyashevskaya, O.N., Bonch-Osmolovskaya, A.A., Krizhanovskii, A.A., and Egorov, P.E., YARN: The beginning, Proc. Conf. Dialog-2013, 2013.Google Scholar
  7. 7.
    Karkaletsis, V., Fragkou, P., Petasis, G., and Iosif, E., Ontology based information extraction from text, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, Paliouras, G., Spyropoulos, C., and Tsatsaronis, G., Eds., Lecture Notes Comput. Sci., Berlin: Springer, 2011, vol. 6050, pp. 89–109.CrossRefGoogle Scholar
  8. 8.
    Unger, C. and Cimiano, P., Pythia: compositional meaning construction for ontology-based question answering on the semantic web, Lecture Notes Comput. Sci., Berlin: Springer, 2011, vol. 6716, pp. 153–160.CrossRefGoogle Scholar
  9. 9.
    Jimeno-Yepes, Berlanga-Llavori, R., and Rebholz-Schuhmann, D., Ontology refinement for improved information retrieval, Information Processing Management, 2010, vol. 46, no. 4, pp. 426–435.CrossRefGoogle Scholar
  10. 10.
    Grineva, M., Turdakov, D., and Sysoev, A., Blognoon: exploring a topic in the blogosphere, Proc. 20th Int. Conf. Companion on World Wide Web, Hyderabad, 2011, pp. 213–216.CrossRefGoogle Scholar
  11. 11.
    Biemann, C., Ontology learning from text: a survey of methods, LDV-Forum, 2005, vol. 20, pp. 75–93.Google Scholar
  12. 12.
    Astrakhantsev, N.A. and Turdakov, D.Yu., Automatic construction and enrichment of informal ontologies: a survey, Programming Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.CrossRefGoogle Scholar
  13. 13.
    Segalovich, A., Fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine.Google Scholar
  14. 14.
    Bocharov, V.V., Alexeeva, S.V., Granovsky, D.V., Protopopova, E.V., Stepanova, M.E., and Surikov, A.V., Crowdsourcing morphological annotation, Computational Linguistics and Intelligent Technologies: Mater. Annu. Int. Conf. Dialog (Bekasovo, 2013), Moscow: RGGU, 2013, vol. 12, no. 19.Google Scholar
  15. 15.
    Lyashevskaya, O.N., Plungyan, V.A., and Sichinava, D.V., On the morphological standard of Russian National Corpus, (Russian National Corpus Russian National Corpus 2003–2005: Results and Perspectives,, Moscow, 2005, pp. 111–135.Google Scholar
  16. 16.
    Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. on Information and Knowledge Management (CIKM’ 08), New York, 2008.Google Scholar
  17. 17.
    Stanford University, Stanford Twitter sentiment general domain dataset. http://www.stanford.edu/~alecmgo/cs224n/trainingandtestdata.zip. Cited July 22, 2012.Google Scholar
  18. 18.
    Stanford University, Sentiment140 Twitter sentiment general domain dataset. http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Cited July 22, 2012.Google Scholar
  19. 19.
    Know-Center GmbH, KnowCenter Twitter sentiment general domain dataset. http://know-center.tugraz.at/loesungen/daten. Cited July 22, 2012.Google Scholar
  20. 20.
    Natural Language Processing and Information Retrieval Group, UNED Twitter sentiment general domain dataset. http://nlp.uned.es/~damiano/datasets/enti-tyProfiling_ORM_Twitter.html. Cited July 22, 2012.Google Scholar
  21. 21.
    International Conference on Weblogs and Social Media movie domain dataset. http://icwsm.cs.mcgill.ca. Cited December 6, 2013.
  22. 22.
    Cornell University, Department of Computer Science, IMDb movie review dataset. http://www.cs.cornell.edu/people/pabo/movie-review-data/polarity-html.zip. Cited December 6, 2013.Google Scholar
  23. 23.
    Infochimps, Twitter Sentiment Dataset from the 1st 2008 Presidential Debate. http://www.infochimps.com/datasets/twitter-sentiment-dataset-2008-debates. Cited December 6, 2013.Google Scholar
  24. 24.
    Mendes, P.N., Jakob, M., Garcia-Silva, A., and Bizer, C., DBpedia spotlight: Shedding light on the Web of documents, Proc. 7th International Conference on Semantic Systems (I-Semantics 2011), Graz, 2011.Google Scholar
  25. 25.
    Korshunov, A., Problems and methods of determining attributes of social network users, Proc. 15th All-Russian Scientific Conference “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”, 2013.Google Scholar
  26. 26.
    Grineva, M., Grinev, M., and Lizorkin, D., Extracting key terms from noisy and multitheme documents, Proc. 18th International World Wide Web Conference (WWW 2009), 2009.Google Scholar

Copyright information

© Pleiades Publishing, Ltd. 2014

Authors and Affiliations

  • D. Yu. Turdakov
    • 1
    Email author
  • N. A. Astrakhantsev
    • 1
  • Ya. R. Nedumov
    • 1
  • A. A. Sysoev
    • 1
  • I. A. Andrianov
    • 1
  • V. D. Mayorov
    • 1
  • D. G. Fedorenko
    • 1
  • A. V. Korshunov
    • 1
  • S. D. Kuznetsov
    • 1
  1. 1.Institute for System ProgrammingRussian Academy of SciencesMoscowRussia

Personalised recommendations