Abstract
pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages. It uses large efficiently encoded lexicons built from OpenCorpora and LanguageTool data. A set of linguistically motivated rules is developed to enable morphological analysis and generation of out-of-vocabulary words observed in real-world documents. For Russian pymorphy2 provides state-of-the-arts morphological analysis quality. The analyzer is implemented in Python programming language with optional C++ extensions. Emphasis is put on ease of use, documentation and extensibility. The package is distributed under a permissive open-source license, encouraging its use in both academic and commercial setting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
Conversion utilities: https://github.com/dchaplinsky/LT2OpenCorpora.
- 8.
- 9.
- 10.
- 11.
- 12.
pymorphy2 encodes words to UTF-8 before putting them to DAFSA, so in practice there are more nodes than shown on Fig. 1. It is an implementation detail.
- 13.
- 14.
- 15.
- 16.
Anonymized results: http://ru-eval.ru/tables_index.html.
- 17.
References
Astaf’eva, I., Bonch-Osmolovskaya, A., Garejshina, A., Ju, G., D’jachkov, V., Ionov, M., Koroleva, A., Kudrinsky, M., Lityagina, A., Luchina, E., Sidorova, E., Toldova, S., Lyashevskaya, O., Savchuk, S., Koval’, S.: NLP evaluation: Russian morphological parsers. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialouge”, vol. 1 (2010)
Bocharov, V.V., Granovsky, D.V., Surikov, A.V.: Probabilistic Tokenization model in the OpenCorpora project [Veroyatnastnaya model’ tokenizacii v proekte Otkritiy Korpus]. In: New Information Technology in Automated Systems: Proceedings of the 15th Seminar [Noviye informacionnie tehnologii v avtomatizirovannih sistemah: materiali pyatnadcatogo nauchno-prakticheskogo seminara] (2012)
Bocharov, V.V., Alexeeva, S.V., Granovsky, D.V., Protopopova, E.V., Stepanova, M.E., Surikov, A.V.: Crowdsourcing morphological annotation. In: Selegey, V. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2013)
Bolshakov, I.A., Bolshakova, E.I.: An automatic morphological classifier of noun phrases in Russian. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2012)
Daciuk, J., Watson, B.W., Mihov, S., Watson, R.E.: Incremental construction of minimal acyclic finite-state automata. Comput. Linguist. 26(1), 3–16 (2000)
Daciuk, J.: Treatment of unknown words. In: Boldt, O., Jürgensen, H. (eds.) WIA 1999. LNCS, vol. 2214, p. 71. Springer, Heidelberg (2001)
Krylov, S.A., Starostin, S.A.: Current morphological analysis and synthesis challanges in the STARLING system [Aktualniye zadachi morfologicheskogo analiza i sinteza v integrirovannoy informacionnoy srede STARLING]. In: Proceedings of the International Conference “Dialog 2003” (2003)
Mikheev, A.: Automatic rule induction for unknown word guessing. Comput. Linguist. 23(3), 405–423 (1997)
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: Proceedings of MLMTA 2003, Las Vegas (2003)
Sokirko, A.: Morphological modules on the web-site www.aot.ru [Morphologicheskie Moduli na saite www.aot.ru]. In: Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference “Dialog 2004” (2004)
Yata, S., Morita, K., Fuketa, M., Aoe, J.: Fast string matching with space-efficient word graphs. In: Innovations in Information Technology (Innovations 2008), Al Ain, United Arab Emirates, pp. 79–83, December 2008
Zaliznjak, A.A.: Grammaticeskij slovar’ russkogo jazyka, Moscow, Russia (1977)
Zanegina, N.N.: Improvised-temporary-compounds as a new expressive mean in Russian. In: Kibrik, A. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, vol. 1 (2012)
Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, Cambridge (1932)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-26123-2_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26122-5
Online ISBN: 978-3-319-26123-2
eBook Packages: Computer ScienceComputer Science (R0)