Abstract
Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text.
Similar content being viewed by others
Notes
Available at: http://aspell.net/ Accessed: September 18, 2016.
https://noisy-text.github.io/norm-shared-task.html Accessed July 29, 2016.
www.noslang.com Accessed: July 29, 2016.
www.hlt.utdallas.edu/~yangl/data/Text_Norm_Data_Release_Fei_Liu Accessed: July 29, 2016.
http://trec.nist.gov/data/tweets/ Accessed: September 16, 2016.
http://demeter.inf.ed.ac.uk/cross/publications.html Accessed September 16, 2016.
http://diego.asu.edu/Publications/Drugchatter.html Accessed September 16, 2016.
The list is available at: http://www.tysto.com/uk-us-spelling-list.html Accessed on September 15, 2016.
This is not a real Tweet. It will be used as an example throughout the rest of the paper.
https://www.tensorflow.org/ Accessed September 20, 2016.
https://kheafield.com/code/kenlm/ Accessed: August 15, 2017.
We also experimented with the spell-checker Aspell for generating potential mappings at this step, but the approach resulted in a small increase in recall with significant drops in precision.
To perform medical domain-specific normalization, we added vocabulary from http://bio.nlplab.org/. Accessed: August 15, 2017.
The dataset was built from the misspellings available at: http://diego.asu.edu/drugstats/drugstats.php. Accessed: January 26, 2017.
We leave out an additional system that was submitted to the shared task (F-score: 0.7264), but for which no description was available.
References
Anagnostopoulos A, Fabio P, Sorella M. (2016) Targeted interest-driven advertising in cities using twitter. In Proceedings of ICWSM-2016. AAAI, pp 527–530
Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of EMNLP-2011. ACL, pp 1568–1576
Baldwin T, de Marneffe MC, Han B, Ritter A, Xu W (2015) Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proc ACL workshop on noisy user-generated text. ACL, pp 126—135
Beckley R. (2015) Bekli: A simple approach to twitter text normalization. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 82–86
Berend G, Tasnadi E (2015) Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of workshop on noisy user-generated text. ACL, pp 120–125
Brill E, Moore RC (2000) An improved error model for noisy channel spelling correction. In Proceedings of ACL-2000. ACL, pp 286—293
Church KW, Gale WG (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103. doi:10.1007/BF01889984
Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia Soc Behav Sci 27:2–11. doi:10.1016/j.sbspro.2011.10.577
Derczynski L, Maynard D, Rizzo G, van Erp M, Gorrel G, Troncy R, Petrak J, Bontcheva K (2015) Analysis of named entity recognition and linking for tweets. Inf Process Manag 51(2):32–49
Doval Y, Vilares J, Gómez-Rodríguez C (2015) Lysgroup: adapting a Spanish microtext normalization system to English. In: Proceedings of the workshop on noisy user-generated text. ACL, Beijing, China, pp 99–105
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Technical report, Stanford University. http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
Han B, Cook P, Baldwin T (2012) Automatically constructing a normalization dictionary for microblogs. In Proceedings of EMNLP-CoNLL-2012. ACL, pp 421–432
Han B, Cook P, Baldwin T (2013) Lexical normalization for social media text. ACM Trans Intell Syst Technol 4(1). doi:10.1145/2414425.2414430
Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen B (2013) Tweaking and tweeting: exploring twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. J Med Internet Res 15(4):e62. doi:10.2196/jmir.2503
Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In Proceedings of 49th annual meeting of ACL. ACL, pp 151–160
Jin N (2015) NCSU-SAS-NING: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of workshop on noisy user-generated text. ACL, pp 87–92
Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 1st edn. Prentice Hall PTR, Upper Saddle River, NJ, USA
Liu F, Weng F, Jiang X (2012) A broad-coverage normalization system for social media language. In Proceedings of 50th annual meeting of ACL. ACL, pp 1035–1044
Lui M, Baldwin T (2011) Cross-domain feature selection for language identification. In Proceedings of IJCNLP, pp 553–561
Mikolov T, Sutskever K, Chen K, Corrado G, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pp 3111–3119
Mikolov T, Yih W, Zweig G (2013b) Linguistic regularities in continuous space word representations. In Proceedings of NAACL. ACL, pp 746–751
Nakhasi A, Passarella R, Bell SG, Paul MJ, Dredze M, Provost P (2012) Malpractice and malcontent: analyzing medical complaints in twitter. In Proceedings of AAAI fall symposium on information retrieval and knowledge discovery in biomedical text. AAAI, pp 84—85
Nakov P, Zesch T, Cer D, David J (2015) In Proceedings of SemEval-2015. ACL; http://www.aclweb.org/anthology/S15-2
Paul MJ, Dredze M (2011) A model for mining public health topics from twitter. Technical report, Johns Hopkins University. http://www.cs.jhu.edu/~mpaul/files/2011.tech.twitter_health.pdf Accessed: 24 Sept 2016
Petrovic S, Osborne M, Lavrenko V (2012) Using paraphrases for improving first story detection in news and twitter. In Proceedings of NAACL. ACL, pp 338–346
Ritter A, Clark S, Mausam, Oren Etzioni (2011) Named entity recognition in tweets: an experimental study. In Proceedings of EMNLP-2011. ACL, pp 1524–1534
Sakaki T, Okazaki M, Matsuo Y (2010) earthquake shakes twitter users: real-time event detection by social sensors. In Proceeding of 19th international conference on WWW. WWW; 851–860
Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931
Sarker A, Gonzalez G (2015) Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform 53:196–207
Sarker A, Gonzalez G (2016) A corpus for mining drug-related knowledge from twitter chatter: language models and their utilities. Data Brief 10:122–131
Sarker A, Gonzalez G (2017) HLP@UPenn at SemEval-2017 Task 4A: A simple, self-optimizing text classification system combining dense and sparse vectors. In Proceedings of the 11th international workshop on semantic evaluations (SemEval-2017), pp 640–643. Vancouver, Canada, August 3–4
Sarker A, Ginn R, Nikfarjam A, O’Connor K, Smith K, Jayaraman S, Upadhaya T, Gonzalez G (2015) Utilizing social media data for pharmacovigilance: a review. J Biomed Inform 54:202–2012. doi:10.1016/j.jbi.2015.02.004
Sproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards C (2001) Normalization of non-standard words. Comput Speech Lang 15:287–333
Supranovic D, Patsepnia V (2015) IHS RD: lexical normalization for English tweets. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 78–81
Toutanova K, Moore RC (2012) Pronunciation modeling for improved spelling correction. In Proceedings of ACL. ACL, pp 144–151
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sarker, A. A customizable pipeline for social media text normalization. Soc. Netw. Anal. Min. 7, 45 (2017). https://doi.org/10.1007/s13278-017-0464-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-017-0464-z