Skip to main content

Improving Cross-Language Information Retrieval by Transliteration Mining and Generation

  • Conference paper
Multilingual Information Access in South Asian Languages

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

  • 690 Accesses

Abstract

The retrieval performance of Cross-Language Retrieval (CLIR) systems is a function of the coverage of the translation lexicon used by them. Unfortunately, most translation lexicons do not provide a good coverage of proper nouns and common nouns which are often the most information-bearing terms in a query. As a consequence, many queries cannot be translated without a substantial loss of information and the retrieval performance of the CLIR system is less than satisfactory for those queries. However, proper nouns and common nouns very often appear in their transliterated forms in the target language document collection. In this work, we study two techniques that leverage this fact for addressing the problem, namely, Transliteration Mining and Transliteration Generation. The first technique attempts to mine the transliterations of out-of-vocabulary query terms from the document collection whereas the second generates the transliterations. We systematically study the effectiveness of both techniques in the context of the Hindi-English and Tamil-English ad hoc retrieval tasks at FIRE2010. The results of our study show that both techniques are effective in addressing the problem posed by out-of-vocabulary terms with Transliteration Mining technique giving better results than Transliteration Generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English-Arabic cross language information retrieval. In: CIKM (2003)

    Google Scholar 

  2. Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: ACL Workshop on Computational Approaches to Semitic Languages (2002)

    Google Scholar 

  3. Al-Onaizan, Y., Knight, K.: Translating named entities using monolingual and bilingual resources. In: 40th Annual Meeting of ACL (2002)

    Google Scholar 

  4. Ballesteros, L., Croft, B.: Dictionary Methods for Cross-Lingual Information Retrieval. In: Thoma, H., Wagner, R.R. (eds.) DEXA 1996. LNCS, vol. 1134, pp. 791–801. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  5. Cao, G., Gao, J., Nie, J.Y.: A system to mine large-scale bilingual dictionaries from monolingual Web pages. In: Proceedings of the 11th MT Summit (2007)

    Google Scholar 

  6. Chinnakotla, M.K., Vachhani, V., Gupta, S., Raman, K., Bhattacharyya, P.: IITB CFILT @ FIRE 2010: Discriminative Approach to IR. Working Notes for the Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)

    Google Scholar 

  7. CRF++, http://crfpp.sourceforge.net

  8. Demner-Fushman, D., Oard, D.W.: The effect of bilingual term list size on dictionary based cross-language information retrieval. In: 36th Hawaii International Conference on System Sciences (2002)

    Google Scholar 

  9. Forum for Information Retrieval Evaluation, http://www.isical.ac.in/~fire

  10. Fung, P.: A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In: ACL (1995)

    Google Scholar 

  11. Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: 3rd Work-shop on Very Large Corpora (1995)

    Google Scholar 

  12. He, X.: Using word dependent transition models in HMM based word alignment for statistical machine translation. In: 2nd ACL Workshop on Statistical Machine Translation (2007)

    Google Scholar 

  13. Jagarlamudi, J., Kumaran, A.: Cross-Lingual Information Retrieval System for Indian Languages. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 80–87. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Järvelin, A., Järvelin, A.: Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 75–86. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  15. Joshi, T., Joy, J., Kellner, T., Khurana, U., Kumaran, A., Sengar, V.S.: Crosslingual location search. In: SIGIR, pp. 211–218 (2008)

    Google Scholar 

  16. Khapra, M., Kumaran, A., Bhattacharyya, P.: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In: NAACL (2010)

    Google Scholar 

  17. Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics (1998)

    Google Scholar 

  18. Kraiij, W., Nie, J.-Y., Simard, M.: Emebdding Web-based Statistical Translation Models in Cross-Language Information Retrieval. Computational Linguistics (2003)

    Google Scholar 

  19. Kumaran, A., Khapra, M., Bhattacharyya, P.: Compositional Machine Transliteration. ACM Transactions on Asian Language Information Processing, TALIP (2010)

    Google Scholar 

  20. Kumaran, A., Khapra, M., Li, H.: Report of NEWS 2010 Transliteration Mining Shared Task. In: 2010 Named Entities Workshop, ACL (2010)

    Google Scholar 

  21. Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: International Conference on Machine Learning (2001)

    Google Scholar 

  22. Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 Machine Transliteration Shared Task. In: 2009 Named Entities Workshop: Shared Task on Transliteration (2009)

    Google Scholar 

  23. Li, H., Sim, K.C., Kuo, J., Dong, M.: Semantic Transliteration of Personal Names. In: ACL (2007)

    Google Scholar 

  24. Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Sen, A., Pal, S.: Text collections for FIRE. In: SIGIR (2008)

    Google Scholar 

  25. Mandl, T., Womser-Hacker, C.: How do named entities contribute to retrieval effectiveness? In: Cross Language Evaluation Forum Campaign (2004)

    Google Scholar 

  26. Mandl, T., Womser-Hacker, C.: The Effect of named entities on effectiveness in crosslanguage information retrieval evaluation. In: ACM Symposium on Applied Computing (2005)

    Google Scholar 

  27. Mayfield, J., McNamee, P.: Single n-gram stemming. In: SIGIR (2003)

    Google Scholar 

  28. Microsoft Research India, http://research.microsoft.com/en-us/labs/india/

  29. Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: ACL (2006)

    Google Scholar 

  30. Nardi, A., Peters, C.: Working Notes for the CLEF 2007 Workshop (2007)

    Google Scholar 

  31. NTCIR, http://research.nii.ac.jp/ntcir

  32. Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computation Linguistics (2002)

    Google Scholar 

  33. Peters, C.: Working Notes for the CLEF 2006 Workshop (2006)

    Google Scholar 

  34. Pirkola, A., Toivonen, J., Keskustalo, H., Järvelin, K.: Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules. ACM Transactions on Information Systems (TOIS) 26(1), article 2 (2007)

    Google Scholar 

  35. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR (1998)

    Google Scholar 

  36. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  37. Quirk, C., Udupa, R., Menezes, A.: Generative models of noisy translations with applications to parallel fragments extraction. In: 11th MT Summit (2007)

    Google Scholar 

  38. Rao, P.R.K., Devi, S.L.: AU-KBC FIRE2010 Submission - Cross Lingual Information Retrieval Track: Tamil- English. In: Working Notes for the Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)

    Google Scholar 

  39. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: ACL (1999)

    Google Scholar 

  40. Saravanan, K., Udupa, R., Kumaran, A.: Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining. In: Working notes for Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)

    Google Scholar 

  41. The Cross-Language Evaluation Forum (CLEF), http://clef-campaign.org

  42. Udupa, R., Jagarlamudi, J., Saravanan, K.: Microsoft Research India at FIRE 2008: Hindi-English Cross-Language Information Retrieval. In: Working notes for Forum for Information Retrieval Evaluation (FIRE) Workshop (2008)

    Google Scholar 

  43. Udupa, R., Saravanan, K., Bakalov, A., Bhole, A.: They Are Out There, If You Know Where to Look: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval. In: ECIR (2009)

    Google Scholar 

  44. Udupa, R., Saravanan, K., Kumaran, A.: Mining Named Entity Transliteration Equivalents from Comparable Corpora. In: CIKM (2008)

    Google Scholar 

  45. Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora. In: EACL (2009)

    Google Scholar 

  46. Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: ACL Workshop on Multilingual and Mixed Language Named Entity Recognition (2003)

    Google Scholar 

  47. Xu, J., Weischedel, R.: Empirical studies on the impact of lexical resources on CLIR performance. Information Processing and Management (2005)

    Google Scholar 

  48. Zhai, C., Lafferty, J.: Two Stage Language Models for Information Retrieval. In: SIGIR (2002)

    Google Scholar 

  49. Zhai, C., Lafferty, J.: A study of smoothing algorithms for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179–214 (2004)

    Article  Google Scholar 

  50. Zhang, M., Li, H., Kumaran, A., Liu, M.: Report of NEWS 2011 Machine Transliteration Shared Task. In: 2011 Named Entities Workshop, IJCNLP (2011)

    Google Scholar 

  51. Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: SIGIR (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saravanan, K., Udupa, R., Kumaran, A. (2013). Improving Cross-Language Information Retrieval by Transliteration Mining and Generation. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40087-2_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40086-5

  • Online ISBN: 978-3-642-40087-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics