Improving Cross-Language Information Retrieval by Transliteration Mining and Generation

Saravanan, K.; Udupa, Raghavendra; Kumaran, A.

doi:10.1007/978-3-642-40087-2_29

K. Saravanan²¹,
Raghavendra Udupa²¹ &
A. Kumaran²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

690 Accesses

Abstract

The retrieval performance of Cross-Language Retrieval (CLIR) systems is a function of the coverage of the translation lexicon used by them. Unfortunately, most translation lexicons do not provide a good coverage of proper nouns and common nouns which are often the most information-bearing terms in a query. As a consequence, many queries cannot be translated without a substantial loss of information and the retrieval performance of the CLIR system is less than satisfactory for those queries. However, proper nouns and common nouns very often appear in their transliterated forms in the target language document collection. In this work, we study two techniques that leverage this fact for addressing the problem, namely, Transliteration Mining and Transliteration Generation. The first technique attempts to mine the transliterations of out-of-vocabulary query terms from the document collection whereas the second generates the transliterations. We systematically study the effectiveness of both techniques in the context of the Hindi-English and Tamil-English ad hoc retrieval tasks at FIRE2010. The results of our study show that both techniques are effective in addressing the problem posed by out-of-vocabulary terms with Transliteration Mining technique giving better results than Transliteration Generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English-Arabic cross language information retrieval. In: CIKM (2003)
Google Scholar
Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: ACL Workshop on Computational Approaches to Semitic Languages (2002)
Google Scholar
Al-Onaizan, Y., Knight, K.: Translating named entities using monolingual and bilingual resources. In: 40th Annual Meeting of ACL (2002)
Google Scholar
Ballesteros, L., Croft, B.: Dictionary Methods for Cross-Lingual Information Retrieval. In: Thoma, H., Wagner, R.R. (eds.) DEXA 1996. LNCS, vol. 1134, pp. 791–801. Springer, Heidelberg (1996)
Chapter Google Scholar
Cao, G., Gao, J., Nie, J.Y.: A system to mine large-scale bilingual dictionaries from monolingual Web pages. In: Proceedings of the 11th MT Summit (2007)
Google Scholar
Chinnakotla, M.K., Vachhani, V., Gupta, S., Raman, K., Bhattacharyya, P.: IITB CFILT @ FIRE 2010: Discriminative Approach to IR. Working Notes for the Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)
Google Scholar
CRF++, http://crfpp.sourceforge.net
Demner-Fushman, D., Oard, D.W.: The effect of bilingual term list size on dictionary based cross-language information retrieval. In: 36th Hawaii International Conference on System Sciences (2002)
Google Scholar
Forum for Information Retrieval Evaluation, http://www.isical.ac.in/~fire
Fung, P.: A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In: ACL (1995)
Google Scholar
Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: 3rd Work-shop on Very Large Corpora (1995)
Google Scholar
He, X.: Using word dependent transition models in HMM based word alignment for statistical machine translation. In: 2nd ACL Workshop on Statistical Machine Translation (2007)
Google Scholar
Jagarlamudi, J., Kumaran, A.: Cross-Lingual Information Retrieval System for Indian Languages. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 80–87. Springer, Heidelberg (2008)
Chapter Google Scholar
Järvelin, A., Järvelin, A.: Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 75–86. Springer, Heidelberg (2008)
Chapter Google Scholar
Joshi, T., Joy, J., Kellner, T., Khurana, U., Kumaran, A., Sengar, V.S.: Crosslingual location search. In: SIGIR, pp. 211–218 (2008)
Google Scholar
Khapra, M., Kumaran, A., Bhattacharyya, P.: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In: NAACL (2010)
Google Scholar
Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics (1998)
Google Scholar
Kraiij, W., Nie, J.-Y., Simard, M.: Emebdding Web-based Statistical Translation Models in Cross-Language Information Retrieval. Computational Linguistics (2003)
Google Scholar
Kumaran, A., Khapra, M., Bhattacharyya, P.: Compositional Machine Transliteration. ACM Transactions on Asian Language Information Processing, TALIP (2010)
Google Scholar
Kumaran, A., Khapra, M., Li, H.: Report of NEWS 2010 Transliteration Mining Shared Task. In: 2010 Named Entities Workshop, ACL (2010)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: International Conference on Machine Learning (2001)
Google Scholar
Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 Machine Transliteration Shared Task. In: 2009 Named Entities Workshop: Shared Task on Transliteration (2009)
Google Scholar
Li, H., Sim, K.C., Kuo, J., Dong, M.: Semantic Transliteration of Personal Names. In: ACL (2007)
Google Scholar
Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Sen, A., Pal, S.: Text collections for FIRE. In: SIGIR (2008)
Google Scholar
Mandl, T., Womser-Hacker, C.: How do named entities contribute to retrieval effectiveness? In: Cross Language Evaluation Forum Campaign (2004)
Google Scholar
Mandl, T., Womser-Hacker, C.: The Effect of named entities on effectiveness in crosslanguage information retrieval evaluation. In: ACM Symposium on Applied Computing (2005)
Google Scholar
Mayfield, J., McNamee, P.: Single n-gram stemming. In: SIGIR (2003)
Google Scholar
Microsoft Research India, http://research.microsoft.com/en-us/labs/india/
Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: ACL (2006)
Google Scholar
Nardi, A., Peters, C.: Working Notes for the CLEF 2007 Workshop (2007)
Google Scholar
NTCIR, http://research.nii.ac.jp/ntcir
Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computation Linguistics (2002)
Google Scholar
Peters, C.: Working Notes for the CLEF 2006 Workshop (2006)
Google Scholar
Pirkola, A., Toivonen, J., Keskustalo, H., Järvelin, K.: Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules. ACM Transactions on Information Systems (TOIS) 26(1), article 2 (2007)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR (1998)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Quirk, C., Udupa, R., Menezes, A.: Generative models of noisy translations with applications to parallel fragments extraction. In: 11th MT Summit (2007)
Google Scholar
Rao, P.R.K., Devi, S.L.: AU-KBC FIRE2010 Submission - Cross Lingual Information Retrieval Track: Tamil- English. In: Working Notes for the Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: ACL (1999)
Google Scholar
Saravanan, K., Udupa, R., Kumaran, A.: Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining. In: Working notes for Forum for Information Retrieval Evaluation (FIRE) Workshop (2010)
Google Scholar
The Cross-Language Evaluation Forum (CLEF), http://clef-campaign.org
Udupa, R., Jagarlamudi, J., Saravanan, K.: Microsoft Research India at FIRE 2008: Hindi-English Cross-Language Information Retrieval. In: Working notes for Forum for Information Retrieval Evaluation (FIRE) Workshop (2008)
Google Scholar
Udupa, R., Saravanan, K., Bakalov, A., Bhole, A.: They Are Out There, If You Know Where to Look: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval. In: ECIR (2009)
Google Scholar
Udupa, R., Saravanan, K., Kumaran, A.: Mining Named Entity Transliteration Equivalents from Comparable Corpora. In: CIKM (2008)
Google Scholar
Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora. In: EACL (2009)
Google Scholar
Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: ACL Workshop on Multilingual and Mixed Language Named Entity Recognition (2003)
Google Scholar
Xu, J., Weischedel, R.: Empirical studies on the impact of lexical resources on CLIR performance. Information Processing and Management (2005)
Google Scholar
Zhai, C., Lafferty, J.: Two Stage Language Models for Information Retrieval. In: SIGIR (2002)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing algorithms for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179–214 (2004)
Article Google Scholar
Zhang, M., Li, H., Kumaran, A., Liu, M.: Report of NEWS 2011 Machine Transliteration Shared Task. In: 2011 Named Entities Workshop, IJCNLP (2011)
Google Scholar
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: SIGIR (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Multilingual Systems Research, Microsoft Research India, Bangalore, India
K. Saravanan, Raghavendra Udupa & A. Kumaran

Authors

K. Saravanan
View author publications
You can also search for this author in PubMed Google Scholar
Raghavendra Udupa
View author publications
You can also search for this author in PubMed Google Scholar
A. Kumaran
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
Indian Institutte of Technology, Bombay, India
Pushpak Bhattacharyya
IBM Research New Delhi, India
L. Venkata Subramaniam & Danish Contractor &
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saravanan, K., Udupa, R., Kumaran, A. (2013). Improving Cross-Language Information Retrieval by Transliteration Mining and Generation. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-40087-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics