A Modular Approach for Social Media Text Normalization

Rehan, Palak; Kumar, Mukesh; Singh, Sarbjeet

doi:10.1007/978-981-10-7563-6_20

Palak Rehan¹⁸,
Mukesh Kumar¹⁸ &
Sarbjeet Singh¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 701))

1407 Accesses

Abstract

The normalized data is the backbone of various Natural Language Processing (NLP), Information Retrieval (IR), data mining, and Machine Translation (MT) applications. Thus, we propose an approach to normalize the colloquial and breviate text being posted on the social media like Twitter, Facebook, etc. The proposed approach for text normalization is based upon Levenshtein distance, demetaphone algorithm, and dictionary mappings. The standard dataset named lexnorm 1.2, containing English tweets is used to validate the proposed modular approach. Experimental results are compared with existing unsupervised approaches. It has been found that modular approach outperforms other exploited normalization techniques by achieving 83.6% of precision, recall, and F-scores. Also 91.1% of BLUE scores have been achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 02, pp. 144–151, Philadelphia, USA (2002)
Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)
Article Google Scholar
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pp. 71–78. Association for Computational Linguistics, Boulder, USA, June (2009)
Google Scholar
Yang, Y., Eisenstein, J.: A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 61–72, Seattle, USA, Oct 2013
Google Scholar
Gouws, S., Hovy, D., Metzler, D.: Unsupervised mining of lexical variants from noisy text. In: Proceedings of the First workshop on Unsupervised Learning in NLP, pp. 82–90, Edinburgh, Scotland (2011)
Google Scholar
Saloot, M.A., Idris, N., Shuib, L., Raj, R.G., Aw, A.: Toward tweets normalization using maximum entropy. In Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pp. 19–27. Association for Computational Linguistics, Beijing, China, 31 July 2015 (2015)
Google Scholar
Min, W., Mott, B., Lester, J., Cox, J.: Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In: proceedings of WNUT, Beijing, China (2015)
Google Scholar
Modupe, A., Celik, T., Marivate, V., Diale, M.: Semi-supervised probabilistics approach for normalising informal short text messages. In: Conference on Information Communication Technology and Society (ICTAS). IEEE (2017)
Google Scholar
Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a# twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368–378. Association for Computational Linguistics, Portland, Oregon, June (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Engineering Department, University Institute of Engineering and Technology, Panjab University, Chandigarh, India
Palak Rehan, Mukesh Kumar & Sarbjeet Singh

Authors

Palak Rehan
View author publications
You can also search for this author in PubMed Google Scholar
Mukesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Sarbjeet Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Palak Rehan .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, PVP Siddhartha Institute of Technology, Vijayawada, Andhra Pradesh, India
Suresh Chandra Satapathy
Departamento de Engenharia Mecânica, Universidade do Porto, Porto, Portugal
Joao Manuel R.S. Tavares
Department of Electronics and Communication Engineering, SRMGPC, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
School of Computer Application, KIIT University, Bhubaneswar, Odisha, India
J. R. Mohanty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rehan, P., Kumar, M., Singh, S. (2018). A Modular Approach for Social Media Text Normalization. In: Satapathy, S., Tavares, J., Bhateja, V., Mohanty, J. (eds) Information and Decision Sciences. Advances in Intelligent Systems and Computing, vol 701. Springer, Singapore. https://doi.org/10.1007/978-981-10-7563-6_20

Download citation

DOI: https://doi.org/10.1007/978-981-10-7563-6_20
Published: 14 April 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7562-9
Online ISBN: 978-981-10-7563-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics