Abstract
Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulory words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often does not match its intended translation which hurts retrieval. We propose an approach to extract the correct word from such strings using word segmentation along with approximate string matching using Soundex algorithm & Levenshtein distance. We evaluate our approach across three Indian languages and find an average improvement of 5.8% MAP on the FIRE-2010 dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability, vol. 57. Chapman & Hall (1994)
Faruqui, M., Majumder, P., Padó, S.: Soundex-based translation correction in urdu–english cross-language information retrieval. In: Proceedings of the Fifth International Workshop On Cross Lingual Information Access, pp. 25–29. Asian Federation of Natural Language Processing, Chiang Mai (2011)
Lewis, M.P. (ed.): Ethnologue – Languages of the World, 16th edn. SIL International (2009)
Makin, R., Pandey, N., Pingali, P., Varma, V.: Approximate String Matching Techniques for Effective CLIR Among Indian Languages. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS (LNAI), vol. 4578, pp. 430–437. Springer, Heidelberg (2007)
Mandal, D., Gupta, M., Dandapat, S., Banerjee, P.: Approximate string matching techniques for effective clir among indian languages. In: Proceedings of CLEF, pp. 95–102 (2007)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proceedings of ACM SIGIR 2006 Workshop on Open Source Information Retrieval, Seattle, WA, USA (2006)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Suwanvisat, P., Prasitjutrakul, S.: Thai-English cross-language transliterated word retrieval using soundex technique. In: Proceesings of the National Computer Science and Engineering Conference, Bangkok, Thailand (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chheda, P., Faruqui, M., Mitra, P. (2012). Handling OOV Words in Indian-language – English CLIR. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-28997-2_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)