Advertisement

Evaluation of Sentence Embedding Models for Natural Language Understanding Problems in Russian

  • Dmitry Popov
  • Alexander Pugachev
  • Polina Svyatokum
  • Elizaveta SvitankoEmail author
  • Ekaterina Artemova
Conference paper
  • 380 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11832)

Abstract

We investigate the performance of sentence embeddings models on several tasks for the Russian language. In our comparison, we include such tasks as multiple choice question answering, next sentence prediction, and paraphrase identification. We employ FastText embeddings as a baseline and compare it to ELMo and BERT embeddings. We conduct two series of experiments, using both unsupervised (i.e., based on similarity measure only) and supervised approaches for the tasks. Finally, we present datasets for multiple choice question answering and next sentence prediction in Russian.

Keywords

Multiple choice question answering Next sentence prediction Paraphrase identification Sentence embedding 

Notes

Acknowledgements

This project was supported by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project “5–100”.

References

  1. 1.
    Arroyo-Fernández, I., Méndez-Cruz, C.F., Sierra, G., Torres-Moreno, J.M., Sidorov, G.: Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput. Speech Lang. 56, 107–129 (2019)CrossRefGoogle Scholar
  2. 2.
    Bakarov, A.: A survey of word embeddings evaluation methods (2018)Google Scholar
  3. 3.
    Cer, D., et al.: Universal sentence encoder. arXiv preprint: arXiv:1803.11175 (2018)
  4. 4.
    Conneau, A., Kiela, D.: SentEval: an evaluation toolkit for universal sentence representations (2018)Google Scholar
  5. 5.
    Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. Association for Computational Linguistics, September 2017. https://www.aclweb.org/anthology/D17-1070
  6. 6.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)Google Scholar
  7. 7.
    Dolan, W., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. https://www.microsoft.com/en-us/download/details.aspx?id=52398
  8. 8.
    Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint: arXiv:1705.03551 (2018)
  9. 9.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016). http://arxiv.org/abs/1607.01759
  10. 10.
    Kiros, R., et al.: Skip-thought vectors (2015)Google Scholar
  11. 11.
    Kutuzov, A., Kuzmenko, E.: WebVectors: a toolkit for building web interfaces for vector semantic models. In: Ignatov, D.I., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 155–161. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-52920-2_15CrossRefGoogle Scholar
  12. 12.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents (2014)Google Scholar
  13. 13.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
  14. 14.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013). http://arxiv.org/abs/1310.4546
  15. 15.
    Ostroumova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features. arXiv preprint: arXiv:1706.09516 (2018)
  16. 16.
    Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2018).  https://doi.org/10.18653/v1/n18-1202
  17. 17.
    Rücklé, A., Eger, S., Peyrard, M., Gurevych, I.: Concatenated power mean word embeddings as universal cross-lingual sentence representations. arXiv preprint: arXiv:1803.01400 (2018)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Dmitry Popov
    • 1
  • Alexander Pugachev
    • 1
  • Polina Svyatokum
    • 1
  • Elizaveta Svitanko
    • 1
    Email author
  • Ekaterina Artemova
    • 1
  1. 1.National Research University Higher School of EconomicsMoscowRussia

Personalised recommendations