Skip to main content

Research on Cross-Language Retrieval Using Bilingual Word Vectors in Different Languages

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1058))

  • 1468 Accesses

Abstract

Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their cross-lingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.wikipedia.org.

  2. 2.

    http://www.clef-initiative.eu/.

  3. 3.

    https://www.elinguistics.net/.

  4. 4.

    https://translate.google.com/.

  5. 5.

    https://github.com/taku910/mecab.

  6. 6.

    https://github.com/fxsjy/jieba.

  7. 7.

    https://fasttext.cc.

  8. 8.

    https://github.com/rlitschk/UnsupCLIR.

References

  1. Sharma, V.K., Mittal, N.: Cross lingual information retrieval (CLIR): review of tools, challenges and translation approaches. In: Satapathy, S.C., Mandal, J.K., Udgata, S.K., Bhateja, V. (eds.) Information Systems Design and Intelligent Applications. AISC, vol. 433, pp. 699–708. Springer, New Delhi (2016). https://doi.org/10.1007/978-81-322-2755-7_72

    Chapter  Google Scholar 

  2. Hajič, J., Homola, P., Kuboň, V.: A simple multilingual machine translation system. In: Proceedings of the MT Summit IX, pp. 157–164 (2016)

    Google Scholar 

  3. Salton, G.: Experiments in multi-lingual information retrieval. Cornell University (1972)

    Google Scholar 

  4. Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. arXiv preprint arXiv:1805.00879 (2018)

  5. Zhou, D., Truran, M., Brailsford, T., Wade, V., Ashman, H.: Translation techniques in cross-language information retrieval. ACM Comput. Surv. 45, 1 (2012)

    Article  Google Scholar 

  6. Zhou, D., Lawless, S., Wu, X., Zhao, W., Liu, J.: A study of user profile representation for personalized cross-language information retrieval. Aslib J. Inf. Manag. 68, 448–477 (2016)

    Article  Google Scholar 

  7. Gao, J., Nie, J.-Y., Xun, E., Zhang, J., Zhou, M., Huang, C.: Improving query translation for cross-language information retrieval using statistical models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–104. ACM (2001)

    Google Scholar 

  8. Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 472–483. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_42

    Chapter  Google Scholar 

  9. Gollins, T., Sanderson, M.: Improving cross language retrieval with triangulated translation. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 90–95. ACM (2001)

    Google Scholar 

  10. Zhou, D., Wu, X., Zhao, W., Lawless, S., Liu, J., Engineering, D.: Query expansion with enriched user profiles for personalized search utilizing folksonomy data. IEEE Trans. Knowl. 29, 1536–1548 (2017)

    Article  Google Scholar 

  11. Zhou, D., Zhao, W., Wu, X., Lawless, S., Liu, J.: An iterative method for personalized results adaptation in cross-language search. Inf. Sci. 430, 200–215 (2018)

    Article  Google Scholar 

  12. Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval with latent topic models trained on a comparable corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 37–48. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25631-8_4

    Chapter  Google Scholar 

  13. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)

  14. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)

  15. Conneau, A., Lample, G., Ranzato, M.A., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)

  16. Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)

    Google Scholar 

  17. Vulić, I., Moens, M.-F.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)

    Article  MathSciNet  Google Scholar 

  18. Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947 (2017)

    Google Scholar 

  19. Nerbonne, J., Hinrichs, E.: Linguistic distances. In: Proceedings of the Workshop on Linguistic Distances, pp. 1–6 (2006)

    Google Scholar 

  20. Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 270–280 (2015)

    Google Scholar 

  21. Levy, O., Søgaard, A., Goldberg, Y.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. arXiv preprint arXiv:1608.05426 (2016)

  22. Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., Ji, H.: Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1946–1958 (2017)

    Google Scholar 

  23. Upadhyay, S., Faruqui, M., Dyer, C., Roth, D.: Cross-lingual models of word embeddings: an empirical comparison. arXiv preprint arXiv:1604.00425 (2016)

  24. Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: IJCAI, pp. 2901–2907 (2016)

    Google Scholar 

  25. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  26. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1908.00879 (2013)

  27. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-2008: Hlt, pp. 771–779 (2008)

    Google Scholar 

  28. Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471 (2008)

    Google Scholar 

  29. Vulić, I., Moens, M.-F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)

    Google Scholar 

  30. Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retrieval 16, 331–368 (2013)

    Article  Google Scholar 

  31. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  32. Laver, J., John, L.: Principles of Phonetics. Cambridge University Press, Cambridge (1994)

    Book  Google Scholar 

  33. Miller, G.A., Nicely, P.E.: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 27, 338–352 (1955)

    Article  Google Scholar 

  34. Albright, A., Hayes, B.: Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90, 119–161 (2003)

    Article  Google Scholar 

  35. Comrie, B.: Language Universals and Linguistic Typology: Syntax and Morphology. University of Chicago Press, Chicago (1989)

    Google Scholar 

  36. Homola, P., Kubon, V.: A translation model for languages of acceding countries. In: Broadening Horizons of Machine Translation and its Applications. Proceedings of the Ninth EAMT workshop, Foundation for International Studies, University of Malta, Valletta, pp. 90–97 (2004)

    Google Scholar 

  37. Firth, J.R.: Selected Papers of JR Firth, pp. 1952–1959. Indiana University Press (1968)

    Google Scholar 

  38. Zesch, T., Müller, C., Gurevych, I.: Extracting lexical semantic knowledge from Wikipedia and wiktionary. In: LREC, pp. 1646–1652 (1968)

    Google Scholar 

  39. Dridan, R., Bond, F.: Sentence comparison using robust minimal recursion semantics and an ontology. In: Proceedings of the Workshop on Linguistic Distances, pp. 35–42. Association for Computational Linguistics (2006)

    Google Scholar 

Download references

Acknowledgement

The work described in this paper was supported by National Natural Science Foundation of China under Project No. 61876062, Scientific Research Fund of Hunan Provincial Education Department of China under Project No. 16K030, Hunan Provincial Natural Science Foundation of China under Project No. 2017JJ2101, Hunan Provincial Innovation Foundation for Postgraduate under Project No. CX2018B671.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Zhou, D. (2019). Research on Cross-Language Retrieval Using Bilingual Word Vectors in Different Languages. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-0118-0_35

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-0117-3

  • Online ISBN: 978-981-15-0118-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics