Skip to main content
Log in

Investigation of features for extraction of named entities from texts in Russian

  • Text Processing Automation
  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

This paper considers various features for extracting named entities from texts in Russian, which are used within the approaches based on machine learning, including the features of a token itself (lexeme), as well as vocabulary, contextual, cluster, and two-stage features. The contribution of each feature to improving the quality of extraction of named entities is studied. The CRF-classifier is used as a method of machine learning in the experiments that are described in this paper. The contribution of features is compared based on two open collections using the F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ermakov, A.E., Extraction of knowledge from the text and their processing: The state and prospects, Inf. Tekhnol., 2009, no. 7, pp. 50–55.

    Google Scholar 

  2. Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I., and Timonina, N.O., Intelligent system for entities extraction (ISEE) from natural language texts, Proceedings of the International Workshop on Conceptual Structures for Extracting Natural Language Semantics-Sense, 2009, no. 9, pp. 17–25.

    Google Scholar 

  3. Khoroshevsky, V.F., Ontology driven multilingual information extraction and intelligent analytics, in Web Intelligence and Security, 2010.

    Google Scholar 

  4. Lafferty, J., McCallum, A., and Pereira, F., Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the International Conference on Machine Learning ICML, 2001.

    Google Scholar 

  5. Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition, Proceedings of the 13th Conference on Computational Natural Language Learning CoNLL, ACL, 2009, pp. 147–155.

    Chapter  Google Scholar 

  6. Straková, J., Straka, M., and Hajič, J., A new state-of-the-art. Czech named entity recognizer, Proceedings of the 16th International Conference Text, Speech, and Dialogue TSD 2013, Berlin-Heidelberg: Springer, 2013, pp. 68–75.

    Google Scholar 

  7. Tjon Kim Sang Erik and Fien de Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, Proceedings of the 7th Conference on Natural Language Learning at HLTNAACL 2003, ACL, 2003, no. 4, pp. 142–147.

    Article  Google Scholar 

  8. Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., and Mercer, R.L., Class-based n-gram models of natural language, Comput. Linguist., 1992, no. 18 (4), pp. 467–479.

    Google Scholar 

  9. Tkachenko, M. and Simanovsky, A., Named entity recognition: Exploring features, Proceedings of the 11th Conference on Natural Language Processing KONVENS 2012, Eigenverlag ÖGAI, 2012, pp. 118–127.

    Google Scholar 

  10. Marcinczuk, M., Stanek, M., Piasecki, M., and Musial, A., Rich set of features for proper name recognition in Polish texts, Proceedings of the International Joint Conferences Security and Intelligent Information Systems SIIS 2011, Springer Berlin Heidelberg, 2012, pp. 332–344.

    Google Scholar 

  11. Antonova, A.Y. and Soloviev, A.N., Conditional random field models for the processing of Russian, Proceedings of the International Conference Dialog 2013, RGGU, 2013, pp. 27–44.

    Google Scholar 

  12. Podobryaev, A.V., Search for mentions of persons in news texts using the model of conditional random fields, Trudy vserossiiskoi konferentsii Elektronnye biblioteki: Perspektivnye Metody i Tekhnologii, Elektronnye kollektsii RCDL-2013 (Proceedings of the All-Russian Conference Electronic Libraries: Advanced Methods and Technologies, Electronic Collections RCDL-2013), YaRGU im. Demidova, 2013, pp. 255–258.

    Google Scholar 

  13. Gareev, R., Tkachenko, M., Solovyev, V., Simanovsky, A., and Ivanov, V., Introducing baselines for Russian named entity recognition, Proceedings of the 14th International Conference CICLing 2013, Springer Berlin Heidelberg, 2013, pp. 329–342.

    Google Scholar 

  14. Chrupala, G., Efficient induction of probabilistic word classes with LDA, Proceedings of the 5th International Conference on Natural Language Processing IJCNLP 2011, Asian Federation of Natural Language Processing, 2011, pp. 363–372.

    Google Scholar 

  15. Clark, A., Combining distributional and morphological information for part of speech induction, Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics EACL 2003, 2003, no. 1, pp. 59–66.

    Google Scholar 

  16. Bocharov, V., Starostin, A., Alexeeva, S., Bodrova, A., Chunchunkov, A., Dzhumaev, S., Efimenko, I., Granovsky, D., Khoroshevsky, V., Krylova, I., Nikolaeva, M., Smurov, I., and Toldova, S., FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian, Proceedings of International Conference on Computational Linguistics Dialog-2016, 2016, pp. 702–720.

    Google Scholar 

  17. Vlasova, N.A., To the problem of marking texts in Russian for extracting factographic information, Trudy konferentsii TEL (Proc. Conf. TEL), Fan, 2014, pp. 36–40.

    Google Scholar 

  18. Mozharova, V. and Loukachevitch, N., Combining knowledge and CRF-based approach to named entity recognition in Russian, Proceedings of the 5th International Conference on Analysis of Images, Social Networks, and Texts AIST'2016, 2016.

    Google Scholar 

  19. Loukachevitch, N. and Dobrov, B., RuThes linguistic ontology vs. Russian wordnets, Proceedings of the Global WordNet Conference GWC, Tartu, 2014.

    Google Scholar 

  20. Mozharova, V. and Loukachevitch, N., Two stage approach in Russian named entity recognition, Proceedings of the International FRUCT conference on Intelligence, Social Media and Web ISMW, 2016.

    Google Scholar 

  21. Trofimov, I.V., Identification of mentions of persons in news texts, Progr. Inzh., 2015, no. 6, pp. 41–47.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. A. Mozharova.

Additional information

Original Russian Text © V.A. Mozharova, N.V. Lukashevich, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 5, pp. 14–21.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mozharova, V.A., Lukashevich, N.V. Investigation of features for extraction of named entities from texts in Russian. Autom. Doc. Math. Linguist. 51, 127–134 (2017). https://doi.org/10.3103/S0005105517030049

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105517030049

Keywords

Navigation