Skip to main content

Morphological Analyzer and Generator for Russian and Ukrainian Languages

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 542))

Abstract

pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages. It uses large efficiently encoded lexicons built from OpenCorpora and LanguageTool data. A set of linguistically motivated rules is developed to enable morphological analysis and generation of out-of-vocabulary words observed in real-world documents. For Russian pymorphy2 provides state-of-the-arts morphological analysis quality. The analyzer is implemented in Python programming language with optional C++ extensions. Emphasis is put on ease of use, documentation and extensibility. The package is distributed under a permissive open-source license, encouraging its use in both academic and commercial setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://bitbucket.com/kmike/pymorphy.

  2. 2.

    https://github.com/kmike/pymorphy2.

  3. 3.

    https://www.python.org/.

  4. 4.

    http://pymorphy2.readthedocs.org.

  5. 5.

    http://pypy.org/.

  6. 6.

    http://opencorpora.org/?page=export.

  7. 7.

    Conversion utilities: https://github.com/dchaplinsky/LT2OpenCorpora.

  8. 8.

    https://languagetool.org/.

  9. 9.

    https://code.google.com/p/dawgdic/.

  10. 10.

    https://github.com/kmike/DAWG.

  11. 11.

    https://code.google.com/p/marisa-trie/.

  12. 12.

    pymorphy2 encodes words to UTF-8 before putting them to DAFSA, so in practice there are more nodes than shown on Fig. 1. It is an implementation detail.

  13. 13.

    https://tech.yandex.ru/mystem/.

  14. 14.

    https://github.com/kmike/microcorpus.

  15. 15.

    http://kmike.ru/links/aist2015/pymorphy2-mystem3.

  16. 16.

    Anonymized results: http://ru-eval.ru/tables_index.html.

  17. 17.

    http://bnkorpus.info.

References

  1. Astaf’eva, I., Bonch-Osmolovskaya, A., Garejshina, A., Ju, G., D’jachkov, V., Ionov, M., Koroleva, A., Kudrinsky, M., Lityagina, A., Luchina, E., Sidorova, E., Toldova, S., Lyashevskaya, O., Savchuk, S., Koval’, S.: NLP evaluation: Russian morphological parsers. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialouge”, vol. 1 (2010)

    Google Scholar 

  2. Bocharov, V.V., Granovsky, D.V., Surikov, A.V.: Probabilistic Tokenization model in the OpenCorpora project [Veroyatnastnaya model’ tokenizacii v proekte Otkritiy Korpus]. In: New Information Technology in Automated Systems: Proceedings of the 15th Seminar [Noviye informacionnie tehnologii v avtomatizirovannih sistemah: materiali pyatnadcatogo nauchno-prakticheskogo seminara] (2012)

    Google Scholar 

  3. Bocharov, V.V., Alexeeva, S.V., Granovsky, D.V., Protopopova, E.V., Stepanova, M.E., Surikov, A.V.: Crowdsourcing morphological annotation. In: Selegey, V. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2013)

    Google Scholar 

  4. Bolshakov, I.A., Bolshakova, E.I.: An automatic morphological classifier of noun phrases in Russian. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2012)

    Google Scholar 

  5. Daciuk, J., Watson, B.W., Mihov, S., Watson, R.E.: Incremental construction of minimal acyclic finite-state automata. Comput. Linguist. 26(1), 3–16 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  6. Daciuk, J.: Treatment of unknown words. In: Boldt, O., Jürgensen, H. (eds.) WIA 1999. LNCS, vol. 2214, p. 71. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  7. Krylov, S.A., Starostin, S.A.: Current morphological analysis and synthesis challanges in the STARLING system [Aktualniye zadachi morfologicheskogo analiza i sinteza v integrirovannoy informacionnoy srede STARLING]. In: Proceedings of the International Conference “Dialog 2003” (2003)

    Google Scholar 

  8. Mikheev, A.: Automatic rule induction for unknown word guessing. Comput. Linguist. 23(3), 405–423 (1997)

    Google Scholar 

  9. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: Proceedings of MLMTA 2003, Las Vegas (2003)

    Google Scholar 

  10. Sokirko, A.: Morphological modules on the web-site www.aot.ru [Morphologicheskie Moduli na saite www.aot.ru]. In: Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference “Dialog 2004” (2004)

  11. Yata, S., Morita, K., Fuketa, M., Aoe, J.: Fast string matching with space-efficient word graphs. In: Innovations in Information Technology (Innovations 2008), Al Ain, United Arab Emirates, pp. 79–83, December 2008

    Google Scholar 

  12. Zaliznjak, A.A.: Grammaticeskij slovar’ russkogo jazyka, Moscow, Russia (1977)

    Google Scholar 

  13. Zanegina, N.N.: Improvised-temporary-compounds as a new expressive mean in Russian. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2012)

    Google Scholar 

  14. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, Cambridge (1932)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mikhail Korobov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26123-2_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26122-5

  • Online ISBN: 978-3-319-26123-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics