Skip to main content
Log in

A customizable pipeline for social media text normalization

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Available at: http://aspell.net/ Accessed: September 18, 2016.

  2. https://noisy-text.github.io/norm-shared-task.html Accessed July 29, 2016.

  3. www.noslang.com Accessed: July 29, 2016.

  4. www.hlt.utdallas.edu/~yangl/data/Text_Norm_Data_Release_Fei_Liu Accessed: July 29, 2016.

  5. http://trec.nist.gov/data/tweets/ Accessed: September 16, 2016.

  6. http://demeter.inf.ed.ac.uk/cross/publications.html Accessed September 16, 2016.

  7. http://diego.asu.edu/Publications/Drugchatter.html Accessed September 16, 2016.

  8. The list is available at: http://www.tysto.com/uk-us-spelling-list.html Accessed on September 15, 2016.

  9. This is not a real Tweet. It will be used as an example throughout the rest of the paper.

  10. https://www.tensorflow.org/ Accessed September 20, 2016.

  11. https://kheafield.com/code/kenlm/ Accessed: August 15, 2017.

  12. We also experimented with the spell-checker Aspell for generating potential mappings at this step, but the approach resulted in a small increase in recall with significant drops in precision.

  13. To perform medical domain-specific normalization, we added vocabulary from http://bio.nlplab.org/. Accessed: August 15, 2017.

  14. The dataset was built from the misspellings available at: http://diego.asu.edu/drugstats/drugstats.php. Accessed: January 26, 2017.

  15. We leave out an additional system that was submitted to the shared task (F-score: 0.7264), but for which no description was available.

References

  • Anagnostopoulos A, Fabio P, Sorella M. (2016) Targeted interest-driven advertising in cities using twitter. In Proceedings of ICWSM-2016. AAAI, pp 527–530

  • Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of EMNLP-2011. ACL, pp 1568–1576

  • Baldwin T, de Marneffe MC, Han B, Ritter A, Xu W (2015) Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proc ACL workshop on noisy user-generated text. ACL, pp 126—135

  • Beckley R. (2015) Bekli: A simple approach to twitter text normalization. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 82–86

  • Berend G, Tasnadi E (2015) Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of workshop on noisy user-generated text. ACL, pp 120–125

  • Brill E, Moore RC (2000) An improved error model for noisy channel spelling correction. In Proceedings of ACL-2000. ACL, pp 286—293

  • Church KW, Gale WG (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103. doi:10.1007/BF01889984

    Article  Google Scholar 

  • Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia Soc Behav Sci 27:2–11. doi:10.1016/j.sbspro.2011.10.577

    Article  Google Scholar 

  • Derczynski L, Maynard D, Rizzo G, van Erp M, Gorrel G, Troncy R, Petrak J, Bontcheva K (2015) Analysis of named entity recognition and linking for tweets. Inf Process Manag 51(2):32–49

    Article  Google Scholar 

  • Doval Y, Vilares J, Gómez-Rodríguez C (2015) Lysgroup: adapting a Spanish microtext normalization system to English. In: Proceedings of the workshop on noisy user-generated text. ACL, Beijing, China, pp 99–105

    Chapter  Google Scholar 

  • Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Technical report, Stanford University. http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

  • Han B, Cook P, Baldwin T (2012) Automatically constructing a normalization dictionary for microblogs. In Proceedings of EMNLP-CoNLL-2012. ACL, pp 421–432

  • Han B, Cook P, Baldwin T (2013) Lexical normalization for social media text. ACM Trans Intell Syst Technol 4(1). doi:10.1145/2414425.2414430

  • Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen B (2013) Tweaking and tweeting: exploring twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. J Med Internet Res 15(4):e62. doi:10.2196/jmir.2503

    Article  Google Scholar 

  • Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In Proceedings of 49th annual meeting of ACL. ACL, pp 151–160

  • Jin N (2015) NCSU-SAS-NING: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of workshop on noisy user-generated text. ACL, pp 87–92

  • Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 1st edn. Prentice Hall PTR, Upper Saddle River, NJ, USA

    Google Scholar 

  • Liu F, Weng F, Jiang X (2012) A broad-coverage normalization system for social media language. In Proceedings of 50th annual meeting of ACL. ACL, pp 1035–1044

  • Lui M, Baldwin T (2011) Cross-domain feature selection for language identification. In Proceedings of IJCNLP, pp 553–561

  • Mikolov T, Sutskever K, Chen K, Corrado G, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pp 3111–3119

  • Mikolov T, Yih W, Zweig G (2013b) Linguistic regularities in continuous space word representations. In Proceedings of NAACL. ACL, pp 746–751

  • Nakhasi A, Passarella R, Bell SG, Paul MJ, Dredze M, Provost P (2012) Malpractice and malcontent: analyzing medical complaints in twitter. In Proceedings of AAAI fall symposium on information retrieval and knowledge discovery in biomedical text. AAAI, pp 84—85

  • Nakov P, Zesch T, Cer D, David J (2015) In Proceedings of SemEval-2015. ACL; http://www.aclweb.org/anthology/S15-2

  • Paul MJ, Dredze M (2011) A model for mining public health topics from twitter. Technical report, Johns Hopkins University. http://www.cs.jhu.edu/~mpaul/files/2011.tech.twitter_health.pdf Accessed: 24 Sept 2016

  • Petrovic S, Osborne M, Lavrenko V (2012) Using paraphrases for improving first story detection in news and twitter. In Proceedings of NAACL. ACL, pp 338–346

  • Ritter A, Clark S, Mausam, Oren Etzioni (2011) Named entity recognition in tweets: an experimental study. In Proceedings of EMNLP-2011. ACL, pp 1524–1534

  • Sakaki T, Okazaki M, Matsuo Y (2010) earthquake shakes twitter users: real-time event detection by social sensors. In Proceeding of 19th international conference on WWW. WWW; 851–860

  • Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931

    Article  Google Scholar 

  • Sarker A, Gonzalez G (2015) Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform 53:196–207

    Article  Google Scholar 

  • Sarker A, Gonzalez G (2016) A corpus for mining drug-related knowledge from twitter chatter: language models and their utilities. Data Brief 10:122–131

    Article  Google Scholar 

  • Sarker A, Gonzalez G (2017) HLP@UPenn at SemEval-2017 Task 4A: A simple, self-optimizing text classification system combining dense and sparse vectors. In Proceedings of the 11th international workshop on semantic evaluations (SemEval-2017), pp 640–643. Vancouver, Canada, August 3–4

  • Sarker A, Ginn R, Nikfarjam A, O’Connor K, Smith K, Jayaraman S, Upadhaya T, Gonzalez G (2015) Utilizing social media data for pharmacovigilance: a review. J Biomed Inform 54:202–2012. doi:10.1016/j.jbi.2015.02.004

    Article  Google Scholar 

  • Sproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards C (2001) Normalization of non-standard words. Comput Speech Lang 15:287–333

    Article  Google Scholar 

  • Supranovic D, Patsepnia V (2015) IHS RD: lexical normalization for English tweets. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 78–81

  • Toutanova K, Moore RC (2012) Pronunciation modeling for improved spelling correction. In Proceedings of ACL. ACL, pp 144–151

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abeed Sarker.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarker, A. A customizable pipeline for social media text normalization. Soc. Netw. Anal. Min. 7, 45 (2017). https://doi.org/10.1007/s13278-017-0464-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-017-0464-z

Keywords

Navigation