Cross-Lingual Information Retrieval System for Indian Languages

  • Jagadeesh Jagarlamudi
  • A. Kumaran
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5152)


This paper describes our attempt to build a Cross-Lingual Information Retrieval (CLIR) system as a part of the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF competition. In this track, the task required retrieval of relevant documents from an English corpus in response to a query expressed in different Indian languages including Hindi, Tamil, Telugu, Bengali and Marathi. Groups participating in this track were required to submit a English to English monolingual run and a Hindi to English bilingual run with optional runs in rest of the languages. Our submission consisted of a monolingual English run and a Hindi to English cross-lingual run.

We used a word alignment table that was learnt by a Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to map a query in the source language into an equivalent query in the language of the document collection. The relevant documents are then retrieved using a Language Modeling based retrieval algorithm. On the CLEF 2007 data set, our official cross-lingual performance was 54.4% of the monolingual performance and in the post submission experiments we found that it can be significantly improved up to 76.3%.


Machine Translation Mean Average Precision Indian Language Query Word Source Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Ballesteros, L., Croft, W.B.: Dictionary methods for cross-lingual information retrieval. In: Thoma, H., Wagner, R.R. (eds.) DEXA 1996. LNCS, vol. 1134, pp. 791–801. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  4. 4.
    Hull, D.A., Grefenstette, G.: Querying across languages: A dictionary-based approach to Multilingual Information Retrieval. In: SIGIR 1996: Proc. of the 19th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 49–57. ACM Press, New York (1996)CrossRefGoogle Scholar
  5. 5.
    McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166. ACM Press, New York (2002)CrossRefGoogle Scholar
  6. 6.
    Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3-4), 209–230 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
    Moulinier, I., Schilder, F.: What is the future of multi-lingual information access?. In: SIGIR 2006 Workshop on Multilingual Information Access 2006, Seattle, Washington, USA (2006)Google Scholar
  8. 8.
    Burkhart, G.E., Goodman, S.E., Mehta, A., Press, L.: The Internet in India: Better times ahead?. Commun. ACM 41(11), 21–26 (1998)CrossRefGoogle Scholar
  9. 9.
    Bharati, A., Sangal, R., Sharma, D.M., Kulakarni, A.P.: Machine Translation activities in India: A survey. In: Workshop on survey on Research and Development of Machine Translation in Asian Countries (2002)Google Scholar
  10. 10.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefGoogle Scholar
  11. 11.
    Kwok, K.L., Choi, S., Dinstl, N.: Rich results from poor resources: Ntcir-4 monolingual and cross-lingual retrieval of korean texts using chinese and english. ACM Transactions on Asian Language Information Processing (TALIP) 4(2), 136–162 (2005)CrossRefGoogle Scholar
  12. 12.
    Kumaran, A., Kellner, T.: A generic framework for machine transliteration. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 721–722. ACM Press, New York (2007)CrossRefGoogle Scholar
  13. 13.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: English Translation in Soviet Physics Doklady, pp. 707–710 (1966)Google Scholar
  14. 14.
    Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Program: News of Computers in British University libraries 14, 130–137 (1980)Google Scholar
  16. 16.
    Bhogal, J., Macfarlane, A., Smith, P.: A review of ontology based query expansion. Inf. Process. Manage. 43(4), 866–886 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jagadeesh Jagarlamudi
    • 1
  • A. Kumaran
    • 1
  1. 1.Multilingual Systems ResearchMicrosoft Research IndiaBangaloreIndia

Personalised recommendations