Abstract
A framework for fast text analysis, which is developed as a part of the Texterra project, is described. Texterra provides a scalable solution for the fast text processing on the basis of novel methods that exploit knowledge extracted from the Web and text documents. For the developed tools, details of the project, use cases, and evaluation results are presented.
Similar content being viewed by others
References
Bird, S., Klein, E., Loper, E., and Baldridge, J., Multi-disciplinary instruction with the Natural Language Toolkit, Proc. Third Workshop on Issues in Teaching Computational Linguistics (TeachCL’ 08), Stroudsburg, 2008, pp. 62–70.
Cunningham, H., Tablan, V., Roberts, A., and Bontcheva, K., Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics, PLoS Comput. Biol., 2013 vol. 9, no. 2.
Ferrucci, D. et.al., Towards an interoperability standard for text and multi-modal analytics, Technical report RC24122, IBM, 2006.
Nozhov, I., Morphological and syntax-oriented text processing (models and programs), Theses of dissertation, 2003.
Alekseev, A., Dobrov, B., and Lukashevich, N., Linguistic ontology of RuTez thesaurus, Proc. Conf. on Open Semantic Technologies for Intelligent Systems (OSTIS), 2013, pp. 153–158.
Braslavskii, P.I., Mukhin, M.Yu., Lyashevskaya, O.N., Bonch-Osmolovskaya, A.A., Krizhanovskii, A.A., and Egorov, P.E., YARN: The beginning, Proc. Conf. Dialog-2013, 2013.
Karkaletsis, V., Fragkou, P., Petasis, G., and Iosif, E., Ontology based information extraction from text, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, Paliouras, G., Spyropoulos, C., and Tsatsaronis, G., Eds., Lecture Notes Comput. Sci., Berlin: Springer, 2011, vol. 6050, pp. 89–109.
Unger, C. and Cimiano, P., Pythia: compositional meaning construction for ontology-based question answering on the semantic web, Lecture Notes Comput. Sci., Berlin: Springer, 2011, vol. 6716, pp. 153–160.
Jimeno-Yepes, Berlanga-Llavori, R., and Rebholz-Schuhmann, D., Ontology refinement for improved information retrieval, Information Processing Management, 2010, vol. 46, no. 4, pp. 426–435.
Grineva, M., Turdakov, D., and Sysoev, A., Blognoon: exploring a topic in the blogosphere, Proc. 20th Int. Conf. Companion on World Wide Web, Hyderabad, 2011, pp. 213–216.
Biemann, C., Ontology learning from text: a survey of methods, LDV-Forum, 2005, vol. 20, pp. 75–93.
Astrakhantsev, N.A. and Turdakov, D.Yu., Automatic construction and enrichment of informal ontologies: a survey, Programming Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.
Segalovich, A., Fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine.
Bocharov, V.V., Alexeeva, S.V., Granovsky, D.V., Protopopova, E.V., Stepanova, M.E., and Surikov, A.V., Crowdsourcing morphological annotation, Computational Linguistics and Intelligent Technologies: Mater. Annu. Int. Conf. Dialog (Bekasovo, 2013), Moscow: RGGU, 2013, vol. 12, no. 19.
Lyashevskaya, O.N., Plungyan, V.A., and Sichinava, D.V., On the morphological standard of Russian National Corpus, (Russian National Corpus Russian National Corpus 2003–2005: Results and Perspectives,, Moscow, 2005, pp. 111–135.
Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. on Information and Knowledge Management (CIKM’ 08), New York, 2008.
Stanford University, Stanford Twitter sentiment general domain dataset. http://www.stanford.edu/~alecmgo/cs224n/trainingandtestdata.zip. Cited July 22, 2012.
Stanford University, Sentiment140 Twitter sentiment general domain dataset. http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Cited July 22, 2012.
Know-Center GmbH, KnowCenter Twitter sentiment general domain dataset. http://know-center.tugraz.at/loesungen/daten. Cited July 22, 2012.
Natural Language Processing and Information Retrieval Group, UNED Twitter sentiment general domain dataset. http://nlp.uned.es/~damiano/datasets/enti-tyProfiling_ORM_Twitter.html. Cited July 22, 2012.
International Conference on Weblogs and Social Media movie domain dataset. http://icwsm.cs.mcgill.ca. Cited December 6, 2013.
Cornell University, Department of Computer Science, IMDb movie review dataset. http://www.cs.cornell.edu/people/pabo/movie-review-data/polarity-html.zip. Cited December 6, 2013.
Infochimps, Twitter Sentiment Dataset from the 1st 2008 Presidential Debate. http://www.infochimps.com/datasets/twitter-sentiment-dataset-2008-debates. Cited December 6, 2013.
Mendes, P.N., Jakob, M., Garcia-Silva, A., and Bizer, C., DBpedia spotlight: Shedding light on the Web of documents, Proc. 7th International Conference on Semantic Systems (I-Semantics 2011), Graz, 2011.
Korshunov, A., Problems and methods of determining attributes of social network users, Proc. 15th All-Russian Scientific Conference “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”, 2013.
Grineva, M., Grinev, M., and Lizorkin, D., Extracting key terms from noisy and multitheme documents, Proc. 18th International World Wide Web Conference (WWW 2009), 2009.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © D.Yu. Turdakov, N.A. Astrakhantsev, Ya.R. Nedumov, A.A. Sysoev, I.A. Andrianov, V.D. Mayorov, D.G. Fedorenko, A.V. Korshunov, S.D. Kuznetsov, 2014, published in Proceedings of the Institute for System Programming of RAS, 2014, vol. 26, issue 1, pp. 421–438.
Rights and permissions
About this article
Cite this article
Turdakov, D.Y., Astrakhantsev, N.A., Nedumov, Y.R. et al. Texterra: A framework for text analysis. Program Comput Soft 40, 288–295 (2014). https://doi.org/10.1134/S0361768814050090
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768814050090