Investigation of features for extraction of named entities from texts in Russian

Mozharova, V. A.; Lukashevich, N. V.

doi:10.3103/S0005105517030049

Investigation of features for extraction of named entities from texts in Russian

Text Processing Automation
Published: 19 August 2017

Volume 51, pages 127–134, (2017)
Cite this article

Automatic Documentation and Mathematical Linguistics Aims and scope

V. A. Mozharova¹ &
N. V. Lukashevich²

42 Accesses
Explore all metrics

Abstract

This paper considers various features for extracting named entities from texts in Russian, which are used within the approaches based on machine learning, including the features of a token itself (lexeme), as well as vocabulary, contextual, cluster, and two-stage features. The contribution of each feature to improving the quality of extraction of named entities is studied. The CRF-classifier is used as a method of machine learning in the experiments that are described in this paper. The contribution of features is compared based on two open collections using the F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Named Entities as New Features for Czech Document Classification

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Introducing Baselines for Russian Named Entity Recognition

References

Ermakov, A.E., Extraction of knowledge from the text and their processing: The state and prospects, Inf. Tekhnol., 2009, no. 7, pp. 50–55.
Google Scholar
Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I., and Timonina, N.O., Intelligent system for entities extraction (ISEE) from natural language texts, Proceedings of the International Workshop on Conceptual Structures for Extracting Natural Language Semantics-Sense, 2009, no. 9, pp. 17–25.
Google Scholar
Khoroshevsky, V.F., Ontology driven multilingual information extraction and intelligent analytics, in Web Intelligence and Security, 2010.
Google Scholar
Lafferty, J., McCallum, A., and Pereira, F., Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the International Conference on Machine Learning ICML, 2001.
Google Scholar
Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition, Proceedings of the 13th Conference on Computational Natural Language Learning CoNLL, ACL, 2009, pp. 147–155.
Chapter Google Scholar
Straková, J., Straka, M., and Hajič, J., A new state-of-the-art. Czech named entity recognizer, Proceedings of the 16th International Conference Text, Speech, and Dialogue TSD 2013, Berlin-Heidelberg: Springer, 2013, pp. 68–75.
Google Scholar
Tjon Kim Sang Erik and Fien de Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, Proceedings of the 7th Conference on Natural Language Learning at HLTNAACL 2003, ACL, 2003, no. 4, pp. 142–147.
Article Google Scholar
Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., and Mercer, R.L., Class-based n-gram models of natural language, Comput. Linguist., 1992, no. 18 (4), pp. 467–479.
Google Scholar
Tkachenko, M. and Simanovsky, A., Named entity recognition: Exploring features, Proceedings of the 11th Conference on Natural Language Processing KONVENS 2012, Eigenverlag ÖGAI, 2012, pp. 118–127.
Google Scholar
Marcinczuk, M., Stanek, M., Piasecki, M., and Musial, A., Rich set of features for proper name recognition in Polish texts, Proceedings of the International Joint Conferences Security and Intelligent Information Systems SIIS 2011, Springer Berlin Heidelberg, 2012, pp. 332–344.
Google Scholar
Antonova, A.Y. and Soloviev, A.N., Conditional random field models for the processing of Russian, Proceedings of the International Conference Dialog 2013, RGGU, 2013, pp. 27–44.
Google Scholar
Podobryaev, A.V., Search for mentions of persons in news texts using the model of conditional random fields, Trudy vserossiiskoi konferentsii Elektronnye biblioteki: Perspektivnye Metody i Tekhnologii, Elektronnye kollektsii RCDL-2013 (Proceedings of the All-Russian Conference Electronic Libraries: Advanced Methods and Technologies, Electronic Collections RCDL-2013), YaRGU im. Demidova, 2013, pp. 255–258.
Google Scholar
Gareev, R., Tkachenko, M., Solovyev, V., Simanovsky, A., and Ivanov, V., Introducing baselines for Russian named entity recognition, Proceedings of the 14th International Conference CICLing 2013, Springer Berlin Heidelberg, 2013, pp. 329–342.
Google Scholar
Chrupala, G., Efficient induction of probabilistic word classes with LDA, Proceedings of the 5th International Conference on Natural Language Processing IJCNLP 2011, Asian Federation of Natural Language Processing, 2011, pp. 363–372.
Google Scholar
Clark, A., Combining distributional and morphological information for part of speech induction, Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics EACL 2003, 2003, no. 1, pp. 59–66.
Google Scholar
Bocharov, V., Starostin, A., Alexeeva, S., Bodrova, A., Chunchunkov, A., Dzhumaev, S., Efimenko, I., Granovsky, D., Khoroshevsky, V., Krylova, I., Nikolaeva, M., Smurov, I., and Toldova, S., FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian, Proceedings of International Conference on Computational Linguistics Dialog-2016, 2016, pp. 702–720.
Google Scholar
Vlasova, N.A., To the problem of marking texts in Russian for extracting factographic information, Trudy konferentsii TEL (Proc. Conf. TEL), Fan, 2014, pp. 36–40.
Google Scholar
Mozharova, V. and Loukachevitch, N., Combining knowledge and CRF-based approach to named entity recognition in Russian, Proceedings of the 5th International Conference on Analysis of Images, Social Networks, and Texts AIST'2016, 2016.
Google Scholar
Loukachevitch, N. and Dobrov, B., RuThes linguistic ontology vs. Russian wordnets, Proceedings of the Global WordNet Conference GWC, Tartu, 2014.
Google Scholar
Mozharova, V. and Loukachevitch, N., Two stage approach in Russian named entity recognition, Proceedings of the International FRUCT conference on Intelligence, Social Media and Web ISMW, 2016.
Google Scholar
Trofimov, I.V., Identification of mentions of persons in news texts, Progr. Inzh., 2015, no. 6, pp. 41–47.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computational Mathematics and Cybernetics, Moscow State University, Moscow, 119991, Russia
V. A. Mozharova
Scientific Research Computational Center, Moscow State University, Moscow, 119991, Russia
N. V. Lukashevich

Authors

V. A. Mozharova
View author publications
You can also search for this author in PubMed Google Scholar
N. V. Lukashevich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. A. Mozharova.

Additional information

Original Russian Text © V.A. Mozharova, N.V. Lukashevich, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 5, pp. 14–21.

About this article

Cite this article

Mozharova, V.A., Lukashevich, N.V. Investigation of features for extraction of named entities from texts in Russian. Autom. Doc. Math. Linguist. 51, 127–134 (2017). https://doi.org/10.3103/S0005105517030049

Download citation

Received: 07 December 2016
Published: 19 August 2017
Issue Date: June 2017
DOI: https://doi.org/10.3103/S0005105517030049

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigation of features for extraction of named entities from texts in Russian

Abstract

Access this article

Similar content being viewed by others

Named Entities as New Features for Czech Document Classification

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Introducing Baselines for Russian Named Entity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Investigation of features for extraction of named entities from texts in Russian

Abstract

Access this article

Similar content being viewed by others

Named Entities as New Features for Czech Document Classification

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Introducing Baselines for Russian Named Entity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation