Semantic Recommendation System for Bilingual Corpus of Academic Papers

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1357)


We tested four methods of making document representations cross-lingual for the task of semantic search for the similar papers based on the corpus of papers from three Russian conferences on NLP: Dialogue, AIST and AINL. The pipeline consisted of three stages: preprocessing, word-by-word vectorisation using models obtained with various methods to map vectors from two independent vector spaces to a common one, and search for the most similar papers based on the cosine similarity of text vectors. The four methods used can be grouped into two approaches: 1) aligning two pretrained monolingual word embedding models with a bilingual dictionary on our own (for example, with the VecMap algorithm) and 2) using pre-aligned cross-lingual word embedding models (MUSE). To find out, which approach brings more benefit to the task, we conducted a manual evaluation of the results and calculated the average precision of recommendations for all the methods mentioned above. MUSE turned out to have the highest search relevance, but the other methods produced more recommendations in a language other than the one of the target paper.


Semantic similarity Semantic search Scientific literature search Document representations Cross-lingual representations 


  1. 1.
    Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 451–462. Association for Computational Linguistics, July 2017Google Scholar
  2. 2.
    Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5012–5019 (2018)Google Scholar
  3. 3.
    Artetxe, M., Ruder, S., Yogatama, D., Labaka, G., Agirre, E.: A call for more rigor in unsupervised cross-lingual learning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7375–7388. Association for Computational Linguistics, Online, July 2020Google Scholar
  4. 4.
    Bakarov, A., Kutuzov, A., Nikishina, I.: Russian computational linguistics: topical structure in 2007–2017 conference papers. In: Proceedings of Dialogue-2018, Online Papers. ABBYY (2018)Google Scholar
  5. 5.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  6. 6.
    Celli, F., Keizer, J.: Enabling multilingual search through controlled vocabularies: The AGRIS approach. In: MTSR (2016)Google Scholar
  7. 7.
    Klusch, M., Kapahnke, P., Schulte, S., Lécué, F., Bernstein, A.: Semantic web service search: a brief survey. KI - Künstliche Intelligenz 30, 139–147 (2015)CrossRefGoogle Scholar
  8. 8.
    Krippendorff, K.: Content Analysis: An Introduction to Its Methodology. Sage Publications, Thousand Oaks (2018)Google Scholar
  9. 9.
    Kwary, D.A.: A corpus and a concordancer of academic journal articles. Data Brief 16, 94–100 (2018)CrossRefGoogle Scholar
  10. 10.
    Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (2018)Google Scholar
  11. 11.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. ArXiv abs/1607.05368 (2016)Google Scholar
  12. 12.
    Litschko, R., Glavas, G., Ponzetto, S.P., Vulic, I.: Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (2018)Google Scholar
  13. 13.
    Litschko, R., Glavas, G., Vulic, I., Dietz, L.: Evaluating resource-lean cross-lingual embedding models in unsupervised retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)Google Scholar
  14. 14.
    Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013)Google Scholar
  15. 15.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  16. 16.
    Minguillón, J., Lerga, M., Aibar, E., Lladós-Masllorens, J., Meseguer-Artola, A.: Semi-automatic generation of a corpus of Wikipedia articles on science and technology. Profesional De La Informacion 26, 995–1004 (2017)CrossRefGoogle Scholar
  17. 17.
    Moshtaghi, M.: Supervised and nonlinear alignment of two embedding spaces for dictionary induction in low resourced languages. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 823–832. Association for Computational Linguistics, Novomber 2019Google Scholar
  18. 18.
    Nikishina, I., Bakarov, A., Kutuzov, A.: RusNLP: semantic search engine for Russian NLP conference papers. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 111–120. Springer, Cham (2018). Scholar
  19. 19.
    Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan and Claypool Publishers (2020)Google Scholar
  20. 20.
    Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual bert? ArXiv abs/1906.01502 (2019)Google Scholar
  21. 21.
    Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Stanković, R., Krstev, C., Obradović, I., Trtovac, A., Utvić, M.: A tool for enhanced search of multilingual digital libraries of e-journals. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 1710–1717. European Language Resources Association (ELRA), May 2012Google Scholar
  23. 23.
    Straka, M., Straková, J.: CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada. Association for Computational Linguistics, August 2017Google Scholar
  24. 24.
    Wang, Z., et al.: Estimation of cross-lingual news similarities using text-mining methods. J. Risk Financ. Manage. 11, 8 (2018)Google Scholar
  25. 25.
    Xu, R., Yang, Y., Otani, N., Wu, Y.: Unsupervised cross-lingual transfer of word embedding spaces. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2465–2474. Association for Computational Linguistics, October– November 2018Google Scholar
  26. 26.
    Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 55–75 (2018)Google Scholar
  27. 27.
    Zhang, W., Li, Y., Wang, S.: Learning document representation via topic-enhanced LSTM model. Knowl. Based Syst. 174, 194–204 (2019)CrossRefGoogle Scholar
  28. 28.
    Zhang, Y., Gaddy, D., Barzilay, R., Jaakkola, T.: Ten pairs to tag - multilingual POS tagging via coarse mapping between embeddings. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1307–1317. Association for Computational Linguistics, June 2016Google Scholar

Copyright information

© Springer Nature Switzerland AG 2021

Authors and Affiliations

  1. 1.National Research University Higher School of EconomicsMoscowRussia
  2. 2.University of OsloOsloNorway
  3. 3.Skolkovo Institute of Science and Technology (Skoltech)MoscowRussia

Personalised recommendations