Bilingual Code-Mixing in Indian Social Media Texts for Hindi and English

  • Rajesh KumarEmail author
  • Pardeep Singh
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 712)


Code Mixing (CM) is an important in the area of Natural Language Processing (NLP) but it is more challenging technique. There are many techniques available for code-mixing but till now less work has been done for code mixing. In this paper we discussed the various approaches used for code mixing and classifying existing code mixing algorithm according to their techniques. Most of people do not always use the Unicode that means only one language during chatting on Facebook, Gmail, Twitter, etc. If some people do not understand the Hindi language, then it is very difficult task for these people to understanding the meaning of code-mixedsentences. For correct Hindi words we used the converter form Hindi words to English words. But most of the words are not correct words according to dictionary and also the code-mixed sentences contained the short form, abbreviation words, phonetic typing, etc. So we have used the character N-gram pruning which is one of the most popular and successful technique of Natural Language Processing (NLP) with dictionary based approaches for language identification of social media text. This paper proposed a scheme which improve the translation by removing the phonetic typing, abbreviation words, shortcut, Hindi word and emotions.


Code mixing (CM) Natural language processing (NLP) Creative typing (CT) Abbreviation (A) Contracted (C) Code switching (CS) Language identification (LID) 


  1. 1.
    Das, A., Gambäck, B.: Code-mixed in social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, December 2014Google Scholar
  2. 2.
    Sharma, S., Srinivas, P., Balabantaray, R.: Text normalization of code mix and sentiment analysis. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2015)Google Scholar
  3. 3.
    Shrestha, P.: Incremental n-gram approach for language identification in code switched text. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, 25 October 2014, pp. 133–138. Association for Computational Linguistics (2014)Google Scholar
  4. 4.
    Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Sharma, S., Srinivas, P., Balabantaray, R.: Sentiment analysis of code- mix script. In: International Conference on Computing and Network Communications (CoCoNet) (2015)Google Scholar
  6. 6.
    Nguyen, D., Dogruöz, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Seattle, Washington, pp. 857–862, October 2013Google Scholar
  7. 7.
    House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. Preliminary methodological considerations. J. Acoust. Soc. Am. 62, 708–713 (1977)CrossRefGoogle Scholar
  8. 8.
    Dutta, S., Saha, T., Banerjee, S., Naskar, S.: Text normalization in code-mixed social media text. In: IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) (2015)Google Scholar
  9. 9.
    Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)Google Scholar
  10. 10.
    Das, A., Gambäck, B.: Identifying languages at the word level in code-mixed indian social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 169–178, December 2014Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Institute of Technology HamirpurHamirpurIndia

Personalised recommendations