A Modular Approach for Social Media Text Normalization

  • Palak Rehan
  • Mukesh Kumar
  • Sarbjeet Singh
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 701)


The normalized data is the backbone of various Natural Language Processing (NLP), Information Retrieval (IR), data mining, and Machine Translation (MT) applications. Thus, we propose an approach to normalize the colloquial and breviate text being posted on the social media like Twitter, Facebook, etc. The proposed approach for text normalization is based upon Levenshtein distance, demetaphone algorithm, and dictionary mappings. The standard dataset named lexnorm 1.2, containing English tweets is used to validate the proposed modular approach. Experimental results are compared with existing unsupervised approaches. It has been found that modular approach outperforms other exploited normalization techniques by achieving 83.6% of precision, recall, and F-scores. Also 91.1% of BLUE scores have been achieved.


  1. 1.
    Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 02, pp. 144–151, Philadelphia, USA (2002)Google Scholar
  2. 2.
    Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)CrossRefGoogle Scholar
  3. 3.
    Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pp. 71–78. Association for Computational Linguistics, Boulder, USA, June (2009)Google Scholar
  4. 4.
    Yang, Y., Eisenstein, J.: A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 61–72, Seattle, USA, Oct 2013Google Scholar
  5. 5.
    Gouws, S., Hovy, D., Metzler, D.: Unsupervised mining of lexical variants from noisy text. In: Proceedings of the First workshop on Unsupervised Learning in NLP, pp. 82–90, Edinburgh, Scotland (2011)Google Scholar
  6. 6.
    Saloot, M.A., Idris, N., Shuib, L., Raj, R.G., Aw, A.: Toward tweets normalization using maximum entropy. In Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pp. 19–27. Association for Computational Linguistics, Beijing, China, 31 July 2015 (2015)Google Scholar
  7. 7.
    Min, W., Mott, B., Lester, J., Cox, J.: Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In: proceedings of WNUT, Beijing, China (2015)Google Scholar
  8. 8.
    Modupe, A., Celik, T., Marivate, V., Diale, M.: Semi-supervised probabilistics approach for normalising informal short text messages. In: Conference on Information Communication Technology and Society (ICTAS). IEEE (2017)Google Scholar
  9. 9.
    Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a# twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368–378. Association for Computational Linguistics, Portland, Oregon, June (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Computer Science & Engineering DepartmentUniversity Institute of Engineering and Technology, Panjab UniversityChandigarhIndia

Personalised recommendations