Introducing Baselines for Russian Named Entity Recognition

  • Rinat Gareev
  • Maksim Tkachenko
  • Valery Solovyev
  • Andrey Simanovsky
  • Vladimir Ivanov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7816)

Abstract

Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multi-language Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05% F1 measure), although they do not employ morphological or syntactical analysis.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th Conference on Computational Linguistics, vol. 1, pp. 466–471. ACL, Stroudsburg (1996)CrossRefGoogle Scholar
  2. 2.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. ACL, Morristown (2003)CrossRefGoogle Scholar
  3. 3.
    Cunningham, H., Wilks, Y., Gaizauskas, R.J.: GATE: a general architecture for text engineering. In: Proceedings of the 16th Conference on Computational Linguistics, vol. 2, pp. 1057–1060. ACL, Stroudsburg (1996)CrossRefGoogle Scholar
  4. 4.
    Popov, B., Kirilov, A., Maynard, D., Manov, D.: Creation of reusable components and language resources for named entity recognition in Russian. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation. European Language Resources Association (2004)Google Scholar
  5. 5.
    Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th International EAMT Workshop, pp. 1–8. ACL, Stroudsburg (2003)Google Scholar
  6. 6.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  7. 7.
    Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 70–75. ACL, Stroudsburg (2004)Google Scholar
  8. 8.
    Ritter, A., Clark, S., Mausam, E.O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. ACL, Stroudsburg (2011)Google Scholar
  9. 9.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  10. 10.
    Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the 13th Conference on Computational Natural Language Learning, pp. 147–155. ACL, Stroudsburg (2009)CrossRefGoogle Scholar
  11. 11.
    Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1030–1038. ACL, Suntec (2009)Google Scholar
  12. 12.
    Ciaramita, M., Altun, Y.: Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of the NIPS Workshop on Advances in Structured Learning for Text and Speech Processing (2005)Google Scholar
  13. 13.
    Tkachenko, M., Simanovsky, A.: Named entity recognition: Exploring features. In: Jancsary, J. (ed.) Proceedings of KONVENS 2012, ÖGAI, pp. 118–127 (2012)Google Scholar
  14. 14.
    Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: the 90% solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 57–60. Association for Computational Linguistics, Stroudsburg (2006)CrossRefGoogle Scholar
  15. 15.
    Finkel, J.R., Manning, C.D.: Joint parsing and named entity recognition. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 326–334. ACL, Stroudsburg (2009)Google Scholar
  16. 16.
    Du, M., von Etter, P., Kopotev, M., Novikov, M., Tarbeeva, N., Yangarber, R.: Building Support Tools for Russian-Language Information Extraction. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 380–387. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  17. 17.
    Ehrmann, M., Turchi, M., Steinberger, R.: Building a multilingual named entity-annotated corpus using annotation projection. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2011 Organising Committee, Hissar, Bulgaria, pp. 118–124 (2011)Google Scholar
  18. 18.
    Szabó, M.K., Vincze, V., Nagy T., I.: HunOr: A Hungarian-Russian parallel corpus. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association, Istanbul (2012)Google Scholar
  19. 19.
    Chinchor, N.A.: MUC-7 named entity task definition. In: Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, VA, USA (1998)Google Scholar
  20. 20.
    Tanenblatt, M., Coden, A., Sominsky, I.: The ConceptMapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association, Valletta (2010)Google Scholar
  21. 21.
    Kluegl, P., Atzmueller, M., Puppe, F.: TextMarker: A tool for rule-based information extraction. In: Proceedings of the 2nd UIMA@GSCL Workshop, 2009 Conference of the GSCL (Gesellschaft fur Sprachtechnologie und Computerlinguistik) (2009)Google Scholar
  22. 22.
    Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - a crystallization point for the Web of Data. Web Semant. 7(3), 154–165 (2009)CrossRefGoogle Scholar
  23. 23.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia Spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8. ACM, New York (2011)Google Scholar
  24. 24.
    Chrupala, G.: Efficient induction of probabilistic word classes with LDA. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 363–372. Asian Federation of Natural Language Processing (2011)Google Scholar
  25. 25.
    Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comp. Linguistics 18(4), 467–479 (1992)Google Scholar
  26. 26.
    Clark, A.: Combining distributional and morphological information for part of speech induction. In: Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, vol. 1, pp. 59–66. ACL, Stroudsburg (2003)Google Scholar
  27. 27.
    Krishnan, V., Manning, C.D.: An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 1121–1128. ACL, Stroudsburg (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Rinat Gareev
    • 1
  • Maksim Tkachenko
    • 2
  • Valery Solovyev
    • 1
  • Andrey Simanovsky
    • 3
  • Vladimir Ivanov
    • 4
  1. 1.Kazan Federal UniversityRussia
  2. 2.St Petersburg State UniversityRussia
  3. 3.HP LabsRussia
  4. 4.National University of Science and Technology ”MISIS”Russia

Personalised recommendations