Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

  • Gennady Shtekh
  • Polina KazakovaEmail author
  • Nikita Nikitinsky
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)


Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.


Cross-language information retrieval Document-level information retrieval CLIR evaluation CLIR datasets Parallel corpora Information retrieval methodology 



We would like to acknowledge the hard work and commitment from Ivan Menshikh throughout this study. We are also thankful to Anna Potapenko for offering very useful comments on the present paper, and Konstantin Vorontsov for encouragement and support.

The present research was supported by the Ministry of Education and Science of the Russian Federation under the unique research id RFMEFI57917X0143.


  1. 1.
    Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: ACM SIGIR Forum, vol. 31, pp. 84–91. ACM (1997)Google Scholar
  2. 2.
    Berry, M.W., Young, P.G.: Using latent semantic indexing for multilanguage information retrieval. Comput. Hum. 29(6), 413–429 (1995)CrossRefGoogle Scholar
  3. 3.
    Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press (2009)Google Scholar
  4. 4.
    Braschler, M., Harman, D., Hess, M., Kluck, M., Peters, C., Schäuble, P.: The evaluation of systems for cross-language information retrieval. In: LREC (2000)Google Scholar
  5. 5.
    De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized Louvain method for community detection in large networks. In: 2011 11th International Conference on Intelligent Systems Design and Applications, ISDA, pp. 88–93. IEEE (2011)Google Scholar
  6. 6.
    Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, vol. 15, p. 21 (1997)Google Scholar
  7. 7.
    Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th Edition of the Language Resources and Evaluation Conference (2016)Google Scholar
  8. 8.
    Germann, U.: Aligned hansards of the 36th parliament of Canada (2001).
  9. 9.
    Gonzalo, J., Verdejo, F., Peters, C., Calzolari, N.: Applying EuroWordNet to cross-language text retrieval. In: Vossen, P. (ed.) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, pp. 113–135. Springer, Dordrecht (1998). Scholar
  10. 10.
    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
  11. 11.
    Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 evaluation measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008). Scholar
  12. 12.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)Google Scholar
  13. 13.
    Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends\({\textregistered }\) Mach. Learn. 5(2–3), 123–286 (2012)Google Scholar
  14. 14.
    Meng, H.M., Lo, W.K., Chen, B., Tang, K.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 311–314. IEEE (2001)Google Scholar
  15. 15.
    Mori, T., Kokubu, T., Tanaka, T.: Cross-lingual information retrieval based on LSI with multiple word spaces. In: Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization. Citeseer (2001)Google Scholar
  16. 16.
    Nikitinsky, N., Ustalov, D., Shashev, S.: An information retrieval system for technology analysis and forecasting. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference, AINL-ISMW FRUCT, pp. 52–59. IEEE (2015)Google Scholar
  17. 17.
    Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS, vol. 1529, pp. 472–483. Springer, Heidelberg (1998). Scholar
  18. 18.
    Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inf. Retr. 4(3–4), 209–230 (2001)CrossRefGoogle Scholar
  19. 19.
    Ruder, S.: A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902 (2017)
  20. 20.
    Voorhees, E.M., Harman, D.K., et al.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)Google Scholar
  21. 21.
    Vulić, I., De Smet, W., Moens, M.F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 16(3), 331–368 (2013)CrossRefGoogle Scholar
  22. 22.
    Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)Google Scholar
  23. 23.
    Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: LREC (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Gennady Shtekh
    • 2
  • Polina Kazakova
    • 2
    Email author
  • Nikita Nikitinsky
    • 1
  1. 1.Integrated SystemsMoscowRussia
  2. 2.National University of Science and Technology MISISMoscowRussia

Personalised recommendations