Abstract
Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their cross-lingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sharma, V.K., Mittal, N.: Cross lingual information retrieval (CLIR): review of tools, challenges and translation approaches. In: Satapathy, S.C., Mandal, J.K., Udgata, S.K., Bhateja, V. (eds.) Information Systems Design and Intelligent Applications. AISC, vol. 433, pp. 699–708. Springer, New Delhi (2016). https://doi.org/10.1007/978-81-322-2755-7_72
Hajič, J., Homola, P., Kuboň, V.: A simple multilingual machine translation system. In: Proceedings of the MT Summit IX, pp. 157–164 (2016)
Salton, G.: Experiments in multi-lingual information retrieval. Cornell University (1972)
Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. arXiv preprint arXiv:1805.00879 (2018)
Zhou, D., Truran, M., Brailsford, T., Wade, V., Ashman, H.: Translation techniques in cross-language information retrieval. ACM Comput. Surv. 45, 1 (2012)
Zhou, D., Lawless, S., Wu, X., Zhao, W., Liu, J.: A study of user profile representation for personalized cross-language information retrieval. Aslib J. Inf. Manag. 68, 448–477 (2016)
Gao, J., Nie, J.-Y., Xun, E., Zhang, J., Zhou, M., Huang, C.: Improving query translation for cross-language information retrieval using statistical models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–104. ACM (2001)
Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 472–483. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_42
Gollins, T., Sanderson, M.: Improving cross language retrieval with triangulated translation. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 90–95. ACM (2001)
Zhou, D., Wu, X., Zhao, W., Lawless, S., Liu, J., Engineering, D.: Query expansion with enriched user profiles for personalized search utilizing folksonomy data. IEEE Trans. Knowl. 29, 1536–1548 (2017)
Zhou, D., Zhao, W., Wu, X., Lawless, S., Liu, J.: An iterative method for personalized results adaptation in cross-language search. Inf. Sci. 430, 200–215 (2018)
Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval with latent topic models trained on a comparable corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 37–48. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25631-8_4
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Conneau, A., Lample, G., Ranzato, M.A., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)
Vulić, I., Moens, M.-F.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)
Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947 (2017)
Nerbonne, J., Hinrichs, E.: Linguistic distances. In: Proceedings of the Workshop on Linguistic Distances, pp. 1–6 (2006)
Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 270–280 (2015)
Levy, O., Søgaard, A., Goldberg, Y.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. arXiv preprint arXiv:1608.05426 (2016)
Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., Ji, H.: Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1946–1958 (2017)
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D.: Cross-lingual models of word embeddings: an empirical comparison. arXiv preprint arXiv:1604.00425 (2016)
Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: IJCAI, pp. 2901–2907 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1908.00879 (2013)
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-2008: Hlt, pp. 771–779 (2008)
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471 (2008)
Vulić, I., Moens, M.-F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)
Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retrieval 16, 331–368 (2013)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Laver, J., John, L.: Principles of Phonetics. Cambridge University Press, Cambridge (1994)
Miller, G.A., Nicely, P.E.: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 27, 338–352 (1955)
Albright, A., Hayes, B.: Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90, 119–161 (2003)
Comrie, B.: Language Universals and Linguistic Typology: Syntax and Morphology. University of Chicago Press, Chicago (1989)
Homola, P., Kubon, V.: A translation model for languages of acceding countries. In: Broadening Horizons of Machine Translation and its Applications. Proceedings of the Ninth EAMT workshop, Foundation for International Studies, University of Malta, Valletta, pp. 90–97 (2004)
Firth, J.R.: Selected Papers of JR Firth, pp. 1952–1959. Indiana University Press (1968)
Zesch, T., Müller, C., Gurevych, I.: Extracting lexical semantic knowledge from Wikipedia and wiktionary. In: LREC, pp. 1646–1652 (1968)
Dridan, R., Bond, F.: Sentence comparison using robust minimal recursion semantics and an ontology. In: Proceedings of the Workshop on Linguistic Distances, pp. 35–42. Association for Computational Linguistics (2006)
Acknowledgement
The work described in this paper was supported by National Natural Science Foundation of China under Project No. 61876062, Scientific Research Fund of Hunan Provincial Education Department of China under Project No. 16K030, Hunan Provincial Natural Science Foundation of China under Project No. 2017JJ2101, Hunan Provincial Innovation Foundation for Postgraduate under Project No. CX2018B671.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, Y., Zhou, D. (2019). Research on Cross-Language Retrieval Using Bilingual Word Vectors in Different Languages. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_35
Download citation
DOI: https://doi.org/10.1007/978-981-15-0118-0_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0117-3
Online ISBN: 978-981-15-0118-0
eBook Packages: Computer ScienceComputer Science (R0)