Transliteration Equivalence Using Canonical Correlation Analysis

Udupa, Raghavendra; Khapra, Mitesh M.

doi:10.1007/978-3-642-12275-0_10

Raghavendra Udupa²⁴ &
Mitesh M. Khapra²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

European Conference on Information Retrieval

2216 Accesses
5 Citations

Abstract

We address the problem of Transliteration Equivalence, i.e. determining whether a pair of words in two different languages (e.g. Auden, ऑडेन) are name transliterations or not. This problem is at the heart of Mining Name Transliterations (MINT) from various sources of multilingual text data including parallel, comparable, and non-comparable corpora and multilingual news streams. MINT is useful in several cross-language tasks including Cross-Language Information Retrieval (CLIR), Machine Translation (MT), and Cross-Language Named Entity Retrieval. We propose a novel approach to Transliteration Equivalence using language-neutral representations of names. The key idea is to consider name transliterations in two languages as two views of the same semantic object and compute a low-dimensional common feature space using Canonical Correlation Analysis (CCA). Similarity of the names in the common feature space forms the basis for classifying a pair of names as transliterations. We show that our approach outperforms state-of-the-art baselines in the CLIR task for Hindi-English (3 collections) and Tamil-English (2 collections).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barr, C., Jones, R., Regelson, M.: The linguistic structure of english web-search queries. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (October 2008)
Google Scholar
Guo, J., Xu, G., Cheng, X., Li, H.: Named entity recognition in query. In: SIGIR 2009: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 267–274. ACM, New York (2009)
Chapter Google Scholar
Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the AaCL 2002 workshop on Computational approaches to semitic languages, pp. 1–13 (2002)
Google Scholar
Al-Onaizan, Y., Knight, K.: Translating named entities using monolingual and bilingual resources. In: ACL 2002: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 400–408 (2001)
Google Scholar
Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition, Morristown, NJ, USA, pp. 57–64. Association for Computational Linguistics (2003)
Google Scholar
Udupa, R., Saravanan, K., Bakalov, A., Bhole, A.: “They are out there, if you know where to look”: Mining transliterations of oov query terms for cross-language information retrieval. In: ECIR, pp. 437–448 (2009)
Google Scholar
Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: Mint: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In: EACL, pp. 799–807 (2009)
Google Scholar
Kondrak, G.: Identifying cognates by phonetic and semantic similarity. In: NAACL 2001: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, Morristown, NJ, USA, pp. 1–8. Association for Computational Linguistics (2001)
Google Scholar
Hardoon, D.R., Szedmák, S., Shawe-Taylor, J.: Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16(12), 2639–2664 (2004)
Article MATH Google Scholar
Li, H., Sim, K.C., Kuo, J.S., Dong, M.: Semantic transliteration of personal names. In: ACL (2007)
Google Scholar
Pirkola, A., Toivonen, J., Keskustalo, H., Järvelin, K.: Fite-trt: a high quality translation technique for oov words. In: SAC 2006: Proceedings of the, ACM Symposium on Applied computing, pp. 1043–1049. ACM, New York (2006)
Google Scholar
Klementiev, A., Roth, D.: Named entity transliteration and discovery from multilingual comparable corpora. In: HLT-NAACL (2006)
Google Scholar
Fung, P.: A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 236–243. Association for Computational Linguistics (1995)
Google Scholar
Quirk, C., Udupa, R., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proc. of the MT Summit XI, pp. 321–327 (2007)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp. 519–526. Association for Computational Linguistics (1999)
Google Scholar
Ballesteros, L., Croft, B.: Dictionary methods for cross-lingual information retrieval. In: Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pp. 791–801 (1996)
Google Scholar
Mandl, T., Womser-Hacker, C.: How do named entities contribute to retrieval effectiveness? In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 833–842. Springer, Heidelberg (2005)
Google Scholar
Mandl, T., Womser-Hacker, C.: The effect of named entities on effectiveness in cross-language information retrieval evaluation. In: SAC 2005: Proceedings of the, ACM symposium on Applied computing, pp. 1059–1064. ACM, New York (2005)
Chapter Google Scholar
Gaussier, É., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL, pp. 526–533 (2004)
Google Scholar
Nardi, A., Peters, C.: Working notes for the clef 2006 workshop (2006)
Google Scholar
Nardi, A., Peters, C.: Working notes for the clef 2007 workshop (2007)
Google Scholar
Udupa, R., Jagarlamudi, J., Saravanan, K.: Microsoft research india at fire 2008: Hindi-english cross-language information retrieval (2008)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping, pp. 313–316 (1997)
Google Scholar
Khapra, M., Bhattacharyya, P.: Improving transliteration accuracy using word-origin detection and lexicon lookup. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), Suntec, Singapore, August 2009, pp. 84–87. Association for Computational Linguistics (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, India
Raghavendra Udupa
Indian Institute of Technology, Bombay
Mitesh M. Khapra

Authors

Raghavendra Udupa
View author publications
You can also search for this author in PubMed Google Scholar
Mitesh M. Khapra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Adaptive Information Cluster, Dublin City University, Dublin, 9, Ireland
Cathal Gurrin
The Open University, Walton Hall, MK7 6HF, Milton Keynes, UK
Yulan He
Microsoft Research Ltd, 7 JJ Thomson Avenue, CB3 0FB, Cambridge, UK
Gabriella Kazai
Department of Computer Science, University of Essex, Wivenhoe Park, CO4 3SQ, Colchester, UK
Udo Kruschwitz
The Open University, Walton Hall, Milton Keynes, UK
Suzanne Little
University of London, London, UK
Thomas Roelleke
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Department of Computing Science, University of Glasgow, 17 Lilybank Gardens, G12 8QQ, Glasgow, UK
Keith van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Udupa, R., Khapra, M.M. (2010). Transliteration Equivalence Using Canonical Correlation Analysis. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-12275-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12274-3
Online ISBN: 978-3-642-12275-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics