Advertisement

Information Retrieval

, Volume 16, Issue 1, pp 1–29 | Cite as

Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

  • Azadeh ShakeryEmail author
  • ChengXiang Zhai
Article

Abstract

Cross-language information retrieval (CLIR) has so far been studied with the assumption that some rich linguistic resources such as bilingual dictionaries or parallel corpora are available. But creation of such high quality resources is labor-intensive and they are not always at hand. In this paper we investigate the feasibility of using only comparable corpora for CLIR, without relying on other linguistic resources. Comparable corpora are text documents in different languages that cover similar topics and are often naturally attainable (e.g., news articles published in different languages at the same time period). We adapt an existing cross-lingual word association mining method and incorporate it into a language modeling approach to cross-language retrieval. We investigate different strategies for estimating the target query language models. Our evaluation results on the TREC Arabic–English cross-lingual data show that the proposed method is effective for the CLIR task, demonstrating that it is feasible to perform cross-lingual information retrieval with just comparable corpora.

Keywords

Cross-language information retrieval Comparable corpora Probabilistic propagation Language models 

Notes

Acknowledgments

We thank the anonymous reviewers for their constructive comments, which have helped to improve framing of the paper, discussion of related work and completeness of evaluation.

References

  1. Abdul-Rauf, S., & Schwenk, H. (2009). Exploiting comparable corpora with TER and TERp. In BUCC ‘09: Proceedings of the 2nd workshop on building and using comparable corpora (pp. 46–54).Google Scholar
  2. Aljlayl, M., & Frieder, O. (2001). Effective Arabic–English cross-language information retrieval via machine-readable dictionaries and machine translation. In CIKM ‘01: Proceedings of the 10th international conference on information and knowledge management (pp. 295–302).Google Scholar
  3. Ballesteros, L., & Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In SIGIR ‘97: Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval (pp. 84–91).Google Scholar
  4. Braschler, M., Ripplinger, B., & Schäuble, P. (2002). Experiments with the eurospider retrieval system for CLEF 2001. In CLEF ‘01: Revised papers from the 2nd workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems (pp. 102–110).Google Scholar
  5. Braschler, M., & Schäuble, P. (1998). Multilingual information retrieval based on document alignment techniques. In ECDL ‘98: Proceedings of the 2nd European conference on research and advanced technology for digital libraries (pp. 183–197).Google Scholar
  6. Braschler, M., & Schäuble, P. (2001). Experiments with the Eurospider retrieval system for CLEF 2000. In CLEF ‘00: Revised papers from the workshop of cross-language evaluation forum on cross-language information retrieval and evaluation (pp. 140–148).Google Scholar
  7. Buckley, C., Mitra, M., Walz, J., & Cardie, C. (2000). Using clustering and superconcepts within SMART: TREC 6. Information Processing and Management, 36(1), 109–131.CrossRefGoogle Scholar
  8. Cao, G., Gao, J., Nie, J. Y., & Bai, J. (2007). Extending query translation to cross-language query expansion with Markov chain models. In CIKM ‘07: Proceedings of the 16th ACM conference on information and knowledge management (pp. 351–360).Google Scholar
  9. Chen, A., & Gey, F. C. (2004). Multilingual information retrieval using machine translation, relevance feedback and decompounding. Information Retrieval, 7(1–2), 149–182.CrossRefGoogle Scholar
  10. Desikan, P. K. (2009). Efficient computation of first-order markov measures on large evolving graphs. Ph.D. thesis, University of Minnesota, Minneapolis, MN, USA.Google Scholar
  11. Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H. J., & Tufis, D. (1998). MULTEXT-east: Parallel and comparable corpora and lexicons for six central and eastern European languages. In COLING ‘98: Proceedings of the 17th international conference on computational linguistics (Vol. 1, pp. 315–319).Google Scholar
  12. Dolamic, L., & Savoy, J. (2010) Retrieval effectiveness of machine translated queries. Journal of the American Society for Information Science and Technology, 61(11), 2266–2273.CrossRefGoogle Scholar
  13. Franz, M., & McCarley, J. S. (2002). Arabic information retrieval at IBM. In Proceedings of the 11th text retrieval conference (TREC-2002).Google Scholar
  14. Franz, M., McCarley, J. S., & Roukos, S. (1999). Ad hoc and multilingual information retrieval at IBM. In Proceedings of the 7th text retrieval conference (TREC-7) (pp. 157–168).Google Scholar
  15. Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In COLING ‘98: Proceedings of the 17th international conference on computational linguistics (pp. 414–420).Google Scholar
  16. Gey, F. C. (2004). Chinese and Korean topic search of Japanese news collections. In Working notes of the fourth NTCIR workshop on Asian language retrieval and question answering (pp. 214–218).Google Scholar
  17. Grimmett, G., & Stirzaker, D. (2001). Probability and random processes (3rd ed.). Oxford: Oxford University Press.Google Scholar
  18. Haveliwala, T. (1999). Efficient computation of PageRank. Technical report 1999-31, Stanford InfoLab.Google Scholar
  19. He, D., Oard, D. W., Wang, J., Luo, J., Demner-Fushman, D., Darwish, K., et al. (2003). Making MIRACLEs: Interactive translingual search for Cebuano and Hindi. ACM Transactions on Asian Language Information Processing (TALIP), 2(3), 219–244.CrossRefGoogle Scholar
  20. Hedlund, T., Airio, E., Keskustalo, H., Lehtokangas, R., Pirkola, A., & Järvelin, K. (2004). Dictionary-based cross-language information retrieval: Learning experiences from CLEF 2000–2002. Information Retrieval, 7(1-2), 99–119.CrossRefGoogle Scholar
  21. Hiemstra, D., Kraaij, W., Pohlmann, R., & Westerveld, T. (2001). Translation resources, merging strategies, and relevance feedback for cross-language information retrieval. In CLEF ‘00: Revised papers from the workshop of cross-language evaluation forum on cross-language information retrieval and evaluation (pp. 102–115).Google Scholar
  22. Hull, D., & Grefenstette, G. (1996). Querying across languages: a dictionary-based approach to multilingual information retrieval. In SIGIR ‘96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 49–57).Google Scholar
  23. Kwok, K. L. (1999). English–Chinese cross-language retrieval based on a translation package. In Workshop of machine translation for cross language information retrieval, machine translation summit VII (pp. 8–13).Google Scholar
  24. Levow, G. A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management, 41(3), 523–547.CrossRefGoogle Scholar
  25. Masuichi, H., Flournoy, R., Kaufmann, S., Peters, S. (2000). A bootstrapping method for extracting bilingual text pairs. In COLING ‘00: Proceedings of the 18th conference on computational linguistics (pp. 1066–1070).Google Scholar
  26. Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  27. Nie, J. Y., & Simard, M. (2002) Using statistical translation models for bilingual IR. In CLEF ‘01: Revised papers from the second workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems (pp. 137–150).Google Scholar
  28. Nie, J. Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In SIGIR ‘99: Proceedings of the 22nd annual international ACM sigir conference on research and development in information retrieval (pp. 74–81).Google Scholar
  29. Oard, D. W., & Diekema, A. R. (1998). Cross-language information retrieval. Annual Review of Information Science and Technology (ARIST), 33, 223–256.Google Scholar
  30. Oard, D. W., & Gey, F. C. (2002). The TREC 2002 Arabic/English CLIR track. In Proceedings of the 11th text retrieval conference (TREC-2002) (pp. 17–26).Google Scholar
  31. Oard, D. W., & Hackett, P. (1997). Document translation for cross-language text retrieval at the University of Maryland. In Proceedings of the 6th text retrieval conference (TREC-6) (pp. 687–696).Google Scholar
  32. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report 1999-66, Stanford InfoLab.Google Scholar
  33. Picchi, E., & Peters, C. (1996). Cross language information retrieval: A system for comparable corpus querying. In Workshop on cross-linguistic information retrieval, SIGIR ‘96 (pp. 24–33).Google Scholar
  34. Rapp, R. (1995). Identifying word translations in non-parallel texts. In ACL ‘95: Proceedings of the 33rd annual meeting on association for computational linguisticss (pp. 320–322).Google Scholar
  35. Sadat, F., Yoshikawa, M., Uemura, S. (2003a). Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. In ACL ‘03: Proceedings of the 41st annual meeting on association for computational linguistics (pp. 141–144).Google Scholar
  36. Sadat, F., Yoshikawa, M., & Uemura, S. (2003b). Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In IRAL ‘03: Proceedings of the 6th international workshop on information retrieval with asian languages (pp. 57–64).Google Scholar
  37. Savoy, J. (2005). Comparative study of monolingual and multilingual search models for use with Asian languages. ACM transactions on asian language information processing (TALIP), 4(2), 163–189.CrossRefGoogle Scholar
  38. Savoy, J., & Rasolofo, Y. (2002). Report on the TREC 11 experiment: Arabic, named page and topic distillation searches. In proceedings of the 11th text retrieval conference (TREC-2002) (pp. 765–774).Google Scholar
  39. Shakery, A., & Zhai, C. (2006). A probabilistic relevance propagation model for hypertext retrieval. In CIKM ‘06: Proceedings of the 15th ACM international conference on information and knowledge managements (pp. 550–558).Google Scholar
  40. Sheridan, P., & Ballerini, J. P. (1996). Experiments in multilingual information retrieval using the SPIDER system. In SIGIR ‘96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 58–65).Google Scholar
  41. Sheridan, P., Ballerini, J., & Schauble, P. (1998). Building a large multilingual test collection from comparable news documents (Chap. 11). Boston, MA: Kluwer.Google Scholar
  42. Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information. Computing and Information Technology, 13(4), 257–264.CrossRefGoogle Scholar
  43. Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., & Keskustalo, H. (2007). Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems, 25(1), Article 4. doi: 10.1145/1198296.1198300.
  44. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.CrossRefGoogle Scholar
  45. Tao, T., & Zhai, C. (2005). Mining comparable bilingual text corpora for cross-language information integration. In KDD ‘05: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 691–696).Google Scholar
  46. Tomlinson, S. (2002). Experiments in named page finding and Arabic retrieval with Hummingbird SearchServer™ at TREC 2002. In Proceedings of the 11th Text retrieval conference (TREC-2002) (pp. 248–259).Google Scholar
  47. Utsuro, T., Horiuchi, T., Chiba, Y., & Hamamoto, T. (2002). Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. In AMTA ‘02: Proceedings of the 5th conference of the association for machine translation in the Americas on machine translation: From research to real users (pp. 165–176).Google Scholar
  48. Vu, T., Aw, A. T., & Zhang, M. (2009). Feature-based method for document alignment in comparable news corpora. In EACL ‘09: Proceedings of the 12th conference of the European chapter of the association for computational linguistics (pp. 843–851).Google Scholar
  49. Xu, J., & Weischedel, R. (2000). Cross-lingual information retrieval using hidden Markov models. In Proceedings of the 2000 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics (Vol. 13, pp. 95–103).Google Scholar
  50. Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In HLT-NAACL ‘09: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, companion volume: Short papers (pp. 121–124).Google Scholar
  51. Zanettin, F. (1998). Bilingual comparable corpora and the training of translators. META, 43(4), 616–630.CrossRefGoogle Scholar
  52. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM ‘01: Proceedings of the 10th international conference on information and knowledge management (pp. 403–410).Google Scholar
  53. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Electrical and Computer Engineering, College of EngineeringUniversity of TehranTehranIran
  2. 2.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations