Abstract
Code-Mixing (CM) is a natural phenomenon observed in many multilingual societies and is becoming the preferred medium of expression and communication in online and social media fora. In spite of this, current Question Answering (QA) systems do not support CM and are only designed to work with a single interaction language. This assumption makes it inconvenient for multi-lingual users to interact naturally with the QA system especially in scenarios where they do not know the right word in the target language. In this paper, we present WebShodh - an end-end web-based Factoid QA system for CM languages. We demonstrate our system with two CM language pairs: Hinglish (Matrix language: Hindi, Embedded language: English) and Tenglish (Matrix language: Telugu, Embedded language: English). Lack of language resources such as annotated corpora, POS taggers or parsers for CM languages poses a huge challenge for automated processing and analysis. In view of this resource scarcity, we only assume the existence of bi-lingual dictionaries from the matrix languages to English and use it for lexically translating the question into English. Later, we use this loosely translated question for our downstream analysis such as Answer Type(AType) prediction, answer retrieval and ranking. Evaluation of our system reveals that we achieve an MRR of 0.37 and 0.32 for Hinglish and Tenglish respectively. We hosted this system online and plan to leverage it for collecting more CM questions and answers data for further improvement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Mixing of Spanish-English, Hindi-English, Telugu-English, Portugese-Spanish and French-Japanese language pairs respectively.
- 2.
Hindi is one of the most spoken languages in India, with 370 million native speakers and is an official language along with English. Telugu is the most spoken Dravidian language in South India with about 70 million native speakers.
- 3.
- 4.
- 5.
This video is recorded in real time frame to demonstrate the speed of the system for practical purposes.
References
Myers-Scotton, C., Linguistics, C.: Bilingual Encounters and Grammatical Outcomes. Oxford University Press, Oxford (2002)
Hidayat, T.: An Analysis of Code Switching used by Facebookers (2008)
Brill, E., Dumais, S., Banko, M.: An analysis of the AskMSR question-answering system. In: EMNLP-Volume 10 (2002)
Zhang, D., Lee, W.S.: A web-based question answering system (2003)
Magnini, B., et al.: Overview of the CLEF 2004 multilingual question answering track. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 371–391. Springer, Heidelberg (2005). doi:10.1007/11519645_38
Tay, M.W.J.: Code switching and code mixing as a communicative strategy in multilingual discourse. World Englishes 8(3), 407–417 (1989)
Lesley, M., Pieter, M.: One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching. Cambridge University Press, Cambridge (1995)
Beatrice, A.: Automatic Detection of English Inclusions in Mixed-lingual Data with an Application to Parsing. Dissertation, University of Edinburgh (2007)
Auer, P.: Code-Switching in Conversation: Language, Interaction and Identity (2013)
Dey, A., Fung, P.: A hindi-english code-switching corpus. In: LREC, pp. 2410–2413 (2014)
Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: EMNLP (2014)
Vyas, Y., et al.: POS tagging of english-hindi code-mixed social media content. In: EMNLP, vol. 14, pp. 974–979 (2014)
Ferrucci, D., et al.: Building watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010)
Moschitti, A., et al.: Using syntactic and semantic structural kernels for classifying definition questions in Jeopardy! In: EMNLP, pp. 712–724 (2011)
Xu, J., Zhou, Y., Wang, Y.: A classification of questions using SVM and semantic similarity analysis. In: ICICSE, pp. 31–34 (2012)
Li, X., Roth, D.: Learning question classifiers. In: International Conference on Computational Linguistics-Volume 1, pp. 1–7 (2002)
Chandu, K.R., Chinnakotla, M., Shrivastava, M.: Answer ka type kya he? Learning to classify questions in code-mixed language. In: International Conference on World Wide Web, pp. 853–858. ACM (2015)
Majumder, G., Pakray, P.: NLP-NITMZ@ MSIR 2016 system for CodeMixed crossScript question classification. In: ECIR, pp. 7–10 (2016)
Banerjee, S., et al.: The first cross-script code-mixed question answering corpus. In: ECIR (2016)
Bhat, I.A., et al.: IIIT-H system submission for FIRE 2014 shared task on transliterated search. In: FIRE, pp. 48–53 (2014)
Zhang, D., Lee, W.S.: Question classification using support vector machines. In: International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 26–32 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chandu, K.R., Chinnakotla, M., Black, A.W., Shrivastava, M. (2017). WebShodh: A Code Mixed Factoid Question Answering System for Web. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-65813-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65812-4
Online ISBN: 978-3-319-65813-1
eBook Packages: Computer ScienceComputer Science (R0)