Abstract
This paper considers various features for extracting named entities from texts in Russian, which are used within the approaches based on machine learning, including the features of a token itself (lexeme), as well as vocabulary, contextual, cluster, and two-stage features. The contribution of each feature to improving the quality of extraction of named entities is studied. The CRF-classifier is used as a method of machine learning in the experiments that are described in this paper. The contribution of features is compared based on two open collections using the F-measure.
Similar content being viewed by others
References
Ermakov, A.E., Extraction of knowledge from the text and their processing: The state and prospects, Inf. Tekhnol., 2009, no. 7, pp. 50–55.
Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I., and Timonina, N.O., Intelligent system for entities extraction (ISEE) from natural language texts, Proceedings of the International Workshop on Conceptual Structures for Extracting Natural Language Semantics-Sense, 2009, no. 9, pp. 17–25.
Khoroshevsky, V.F., Ontology driven multilingual information extraction and intelligent analytics, in Web Intelligence and Security, 2010.
Lafferty, J., McCallum, A., and Pereira, F., Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the International Conference on Machine Learning ICML, 2001.
Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition, Proceedings of the 13th Conference on Computational Natural Language Learning CoNLL, ACL, 2009, pp. 147–155.
Straková, J., Straka, M., and Hajič, J., A new state-of-the-art. Czech named entity recognizer, Proceedings of the 16th International Conference Text, Speech, and Dialogue TSD 2013, Berlin-Heidelberg: Springer, 2013, pp. 68–75.
Tjon Kim Sang Erik and Fien de Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, Proceedings of the 7th Conference on Natural Language Learning at HLTNAACL 2003, ACL, 2003, no. 4, pp. 142–147.
Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., and Mercer, R.L., Class-based n-gram models of natural language, Comput. Linguist., 1992, no. 18 (4), pp. 467–479.
Tkachenko, M. and Simanovsky, A., Named entity recognition: Exploring features, Proceedings of the 11th Conference on Natural Language Processing KONVENS 2012, Eigenverlag ÖGAI, 2012, pp. 118–127.
Marcinczuk, M., Stanek, M., Piasecki, M., and Musial, A., Rich set of features for proper name recognition in Polish texts, Proceedings of the International Joint Conferences Security and Intelligent Information Systems SIIS 2011, Springer Berlin Heidelberg, 2012, pp. 332–344.
Antonova, A.Y. and Soloviev, A.N., Conditional random field models for the processing of Russian, Proceedings of the International Conference Dialog 2013, RGGU, 2013, pp. 27–44.
Podobryaev, A.V., Search for mentions of persons in news texts using the model of conditional random fields, Trudy vserossiiskoi konferentsii Elektronnye biblioteki: Perspektivnye Metody i Tekhnologii, Elektronnye kollektsii RCDL-2013 (Proceedings of the All-Russian Conference Electronic Libraries: Advanced Methods and Technologies, Electronic Collections RCDL-2013), YaRGU im. Demidova, 2013, pp. 255–258.
Gareev, R., Tkachenko, M., Solovyev, V., Simanovsky, A., and Ivanov, V., Introducing baselines for Russian named entity recognition, Proceedings of the 14th International Conference CICLing 2013, Springer Berlin Heidelberg, 2013, pp. 329–342.
Chrupala, G., Efficient induction of probabilistic word classes with LDA, Proceedings of the 5th International Conference on Natural Language Processing IJCNLP 2011, Asian Federation of Natural Language Processing, 2011, pp. 363–372.
Clark, A., Combining distributional and morphological information for part of speech induction, Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics EACL 2003, 2003, no. 1, pp. 59–66.
Bocharov, V., Starostin, A., Alexeeva, S., Bodrova, A., Chunchunkov, A., Dzhumaev, S., Efimenko, I., Granovsky, D., Khoroshevsky, V., Krylova, I., Nikolaeva, M., Smurov, I., and Toldova, S., FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian, Proceedings of International Conference on Computational Linguistics Dialog-2016, 2016, pp. 702–720.
Vlasova, N.A., To the problem of marking texts in Russian for extracting factographic information, Trudy konferentsii TEL (Proc. Conf. TEL), Fan, 2014, pp. 36–40.
Mozharova, V. and Loukachevitch, N., Combining knowledge and CRF-based approach to named entity recognition in Russian, Proceedings of the 5th International Conference on Analysis of Images, Social Networks, and Texts AIST'2016, 2016.
Loukachevitch, N. and Dobrov, B., RuThes linguistic ontology vs. Russian wordnets, Proceedings of the Global WordNet Conference GWC, Tartu, 2014.
Mozharova, V. and Loukachevitch, N., Two stage approach in Russian named entity recognition, Proceedings of the International FRUCT conference on Intelligence, Social Media and Web ISMW, 2016.
Trofimov, I.V., Identification of mentions of persons in news texts, Progr. Inzh., 2015, no. 6, pp. 41–47.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © V.A. Mozharova, N.V. Lukashevich, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 5, pp. 14–21.
About this article
Cite this article
Mozharova, V.A., Lukashevich, N.V. Investigation of features for extraction of named entities from texts in Russian. Autom. Doc. Math. Linguist. 51, 127–134 (2017). https://doi.org/10.3103/S0005105517030049
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105517030049