Abstract
Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here we define a document-level information retrieval system as a type of information retrieval systems where users query not by short keyword phrases but by full-text document examples.
- 2.
Nonetheless, the case of document-level information retrieval somewhat simplifies the evaluation procedure as at least there is no need for example queries and ground truth relevance measures between queries and documents: only document-to-document relevance is required.
- 3.
For simplicity, in the present paper we discuss the case of a bilingual dataset. However, the approaches described here could be easily generalized to the case of multiple languages.
References
Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: ACM SIGIR Forum, vol. 31, pp. 84–91. ACM (1997)
Berry, M.W., Young, P.G.: Using latent semantic indexing for multilanguage information retrieval. Comput. Hum. 29(6), 413–429 (1995)
Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press (2009)
Braschler, M., Harman, D., Hess, M., Kluck, M., Peters, C., Schäuble, P.: The evaluation of systems for cross-language information retrieval. In: LREC (2000)
De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized Louvain method for community detection in large networks. In: 2011 11th International Conference on Intelligent Systems Design and Applications, ISDA, pp. 88–93. IEEE (2011)
Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, vol. 15, p. 21 (1997)
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th Edition of the Language Resources and Evaluation Conference (2016)
Germann, U.: Aligned hansards of the 36th parliament of Canada (2001). https://www.isi.edu/natural-language/download/hansard/
Gonzalo, J., Verdejo, F., Peters, C., Calzolari, N.: Applying EuroWordNet to cross-language text retrieval. In: Vossen, P. (ed.) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, pp. 113–135. Springer, Dordrecht (1998). https://doi.org/10.1007/978-94-017-1491-4_5
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 evaluation measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_2
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)
Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends\({\textregistered }\) Mach. Learn. 5(2–3), 123–286 (2012)
Meng, H.M., Lo, W.K., Chen, B., Tang, K.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 311–314. IEEE (2001)
Mori, T., Kokubu, T., Tanaka, T.: Cross-lingual information retrieval based on LSI with multiple word spaces. In: Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization. Citeseer (2001)
Nikitinsky, N., Ustalov, D., Shashev, S.: An information retrieval system for technology analysis and forecasting. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference, AINL-ISMW FRUCT, pp. 52–59. IEEE (2015)
Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS, vol. 1529, pp. 472–483. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_42
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inf. Retr. 4(3–4), 209–230 (2001)
Ruder, S.: A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902 (2017)
Voorhees, E.M., Harman, D.K., et al.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)
Vulić, I., De Smet, W., Moens, M.F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 16(3), 331–368 (2013)
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: LREC (2016)
Acknowledgements
We would like to acknowledge the hard work and commitment from Ivan Menshikh throughout this study. We are also thankful to Anna Potapenko for offering very useful comments on the present paper, and Konstantin Vorontsov for encouragement and support.
The present research was supported by the Ministry of Education and Science of the Russian Federation under the unique research id RFMEFI57917X0143.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Shtekh, G., Kazakova, P., Nikitinsky, N. (2018). Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)