Skip to main content
Log in

Recovering Word Forms by Context for Morphologically Rich Languages

  • Published:
Journal of Mathematical Sciences Aims and scope Submit manuscript

In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first try, apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67, 3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. I. Anisimov, V. Polyakov, E. Makarova, and V. Solovyev, “Spelling correction in english: Joint use of bi-grams and chunking,” in: 2017 Intelligent Systems Conference (IntelliSys), IEEE (2017), pp. 886–892.

  2. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).

  3. D. Gavrilov, P. Kalaidin, and V. Malykh, “Self-attentive model for headline generation,” CoRR abs/1901.07786 arXiv:1901.07786 (2019).

  4. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9, No. 8, 1735–1780 (1997).

    Article  Google Scholar 

  5. G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: Open-Source Toolkit for Neural Machine Translation,” ArXiv e-prints arXiv:1701.02810 (2017). https://arxiv.org/abs/1701.02810

  6. Koehn, P., H. Hoang, A. Birch, Chr. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, Chr. Moran, Zens R., et al., “Moses: Open source toolkit for statistical machine translation,” in:Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics (2007), pp. 177– 180.

  7. M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in: Analysis of Images, Social Networks and Texts (M. Yu. Khachay, N. Konstantinova, A. Panchenko, D.I. Ignatov, and V.G. Labunets, eds.), Communications in Computer and Information Science, Vol. 542, Springer International Publishing (2015), pp. 320–332.

  8. J. Lee, K. Cho, and Th. Hofmann, “Fully character-level neural machine translation without explicit segmentation,” Transactions of the Association for Computational Linguistics 5, 365–378 (2017).

    Article  Google Scholar 

  9. M.-Th. Luong, H. Pham, and Chr. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025 (2015).

  10. Z. Miftahutdinov and E. Tutubalina, “Deep learning for ICD coding: Looking for medical concepts in clinical documents in english and in french,” in: Experimental IR Meets Multilinguality, Multimodality, and Interaction (Cham) (P. Bellot, Ch. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, and N. Ferro, eds.), Springer International Publishing, 2018, pp. 203–215.

  11. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in PyTorch, NIPS-W, 2017.

    Google Scholar 

  12. D. Polykovskiy, D. Soloviev, and S. Nikolenko, “Concorde: Morphological agreement in conversational models,” in: Proceedings of The 10th Asian Conference on Machine Learning (J. Zhu and I. Takeuchi, eds.), Proceedings of Machine Learning Research, Vol. 95, PMLR (2018), pp. 407–421.

  13. I. Segalovich, A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine, MLMTA, Citeseer (2003), pp. 273–280.

    Google Scholar 

  14. D. Sukhonin and A. Panchenko, A Python wrapper of the Tandex mystem 3.1 morphological analyzer, https://github.com/nlpub/pymystem3 (2013).

  15. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv:1706.03762 (2017).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. M. Alekseev.

Additional information

Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 129–136.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alekseev, A.M., Nikolenko, S.I. Recovering Word Forms by Context for Morphologically Rich Languages. J Math Sci 273, 527–532 (2023). https://doi.org/10.1007/s10958-023-06518-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10958-023-06518-7

Navigation