Advertisement

Language Identification and Disambiguation in Indian Mixed-Script

  • Bhumika Gupta
  • Gaurav Bhatt
  • Ankush Mittal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9581)

Abstract

The algorithm that has been proposed in this paper tries to segregate words from various languages (namely Hindi, English, Bengali and Gujarati) and provide relevant replacements for the misspelled or unknown words in a given query. Thus, generating a relevant query in which the original language of each word is known. First, the words are matched directly with the dictionaries of each language transliterated into English. And then, for those that do not match, a set of probable words from all the dictionaries taking words that are closest to the given spelling is shortlisted using the Levenshtein algorithm. After this, to achieve a higher level of generalization, we use a list of probabilities of doublets and triplets of words occurring together that are computed from a training database. The probabilities computed further determine the relevance of those words in the given text allowing us to pick the most relevant match.

Keywords

Mixed-script Transliteration Similarity matching Supervised machine learning Information retreival 

References

  1. 1.
    Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of the EMNLP 2014, pp. 974–979 (2014)Google Scholar
  2. 2.
    Chittaranjan, G., Vyas, Y.: Word-level language identification using CRF: code switching shared task report of MSR india system. In: Proceedings of the EMNLP (2014)Google Scholar
  3. 3.
    Gupta, P., Bali, K., Banchs, R.E., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (2014)Google Scholar
  4. 4.
    Bhat, I.A., Mujadia, V., Tammewar, A., Bhat, R.A., Shrivastava, M.: IIIT-H system submission for FIRE 2014 shared task on transliterated search. In: Proceedings of the Forum for Information Retrieval Evaluation (2014)Google Scholar
  5. 5.
    King, B., Abney, S.: Labelling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of NAACL-HLT (2013)Google Scholar
  6. 6.
    Gupta, P., Rosso, P., Banchs, R.E.: Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In: Proceedings of the 5th Forum for Information Retrieval Evaluation (2013)Google Scholar
  7. 7.
    Raghavi, K.C., Chinnakotla, M.K., Shrivastava, M.: Answer ka type kya he? Learning to classify questions in code-mixed language. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 853–858. International World Wide Web Conferences Steering Committee (2015)Google Scholar
  8. 8.
    Roy, R.S., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: Proceedings of the 5th Forum for Information Retrieval Evaluation (2013)Google Scholar
  9. 9.
    Marton, Y., Callison-Burch, C., Resnik, P.: Improved statistical machine translation using monolingually-derived paraphrases. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 381–390. Association for Computational Linguistics (2009)Google Scholar
  10. 10.
    Callison-Burch, C., Koehn, P., Osborne, M.: Improved statistical machine translation using paraphrases. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 17–24. Association for Computational Linguistics (2006)Google Scholar
  11. 11.
    Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)Google Scholar
  12. 12.
    Gupta, K., Choudhury, M., Bali, K.: Mining Hindi-English transliteration pairs from online Hindi lyrics. In: LREC, pp. 2459–2465 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.College of Engineering RoorkeeRoorkeeIndia
  2. 2.Indian Institute of TechnologyRoorkeeIndia
  3. 3.Graphic Era UniversityDehradunIndia

Personalised recommendations