A customizable pipeline for social media text normalization

Sarker, Abeed

doi:10.1007/s13278-017-0464-z

A customizable pipeline for social media text normalization

Original Article
Published: 09 September 2017

Volume 7, article number 45, (2017)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Abeed Sarker ORCID: orcid.org/0000-0001-7358-544X¹

722 Accesses
12 Citations
Explore all metrics

Abstract

Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Available at: http://aspell.net/ Accessed: September 18, 2016.
https://noisy-text.github.io/norm-shared-task.html Accessed July 29, 2016.
www.noslang.com Accessed: July 29, 2016.
www.hlt.utdallas.edu/~yangl/data/Text_Norm_Data_Release_Fei_Liu Accessed: July 29, 2016.
http://trec.nist.gov/data/tweets/ Accessed: September 16, 2016.
http://demeter.inf.ed.ac.uk/cross/publications.html Accessed September 16, 2016.
http://diego.asu.edu/Publications/Drugchatter.html Accessed September 16, 2016.
The list is available at: http://www.tysto.com/uk-us-spelling-list.html Accessed on September 15, 2016.
This is not a real Tweet. It will be used as an example throughout the rest of the paper.
https://www.tensorflow.org/ Accessed September 20, 2016.
https://kheafield.com/code/kenlm/ Accessed: August 15, 2017.
We also experimented with the spell-checker Aspell for generating potential mappings at this step, but the approach resulted in a small increase in recall with significant drops in precision.
To perform medical domain-specific normalization, we added vocabulary from http://bio.nlplab.org/. Accessed: August 15, 2017.
The dataset was built from the misspellings available at: http://diego.asu.edu/drugstats/drugstats.php. Accessed: January 26, 2017.
We leave out an additional system that was submitted to the shared task (F-score: 0.7264), but for which no description was available.

References

Anagnostopoulos A, Fabio P, Sorella M. (2016) Targeted interest-driven advertising in cities using twitter. In Proceedings of ICWSM-2016. AAAI, pp 527–530
Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of EMNLP-2011. ACL, pp 1568–1576
Baldwin T, de Marneffe MC, Han B, Ritter A, Xu W (2015) Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proc ACL workshop on noisy user-generated text. ACL, pp 126—135
Beckley R. (2015) Bekli: A simple approach to twitter text normalization. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 82–86
Berend G, Tasnadi E (2015) Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of workshop on noisy user-generated text. ACL, pp 120–125
Brill E, Moore RC (2000) An improved error model for noisy channel spelling correction. In Proceedings of ACL-2000. ACL, pp 286—293
Church KW, Gale WG (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103. doi:10.1007/BF01889984
Article Google Scholar
Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia Soc Behav Sci 27:2–11. doi:10.1016/j.sbspro.2011.10.577
Article Google Scholar
Derczynski L, Maynard D, Rizzo G, van Erp M, Gorrel G, Troncy R, Petrak J, Bontcheva K (2015) Analysis of named entity recognition and linking for tweets. Inf Process Manag 51(2):32–49
Article Google Scholar
Doval Y, Vilares J, Gómez-Rodríguez C (2015) Lysgroup: adapting a Spanish microtext normalization system to English. In: Proceedings of the workshop on noisy user-generated text. ACL, Beijing, China, pp 99–105
Chapter Google Scholar
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Technical report, Stanford University. http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
Han B, Cook P, Baldwin T (2012) Automatically constructing a normalization dictionary for microblogs. In Proceedings of EMNLP-CoNLL-2012. ACL, pp 421–432
Han B, Cook P, Baldwin T (2013) Lexical normalization for social media text. ACM Trans Intell Syst Technol 4(1). doi:10.1145/2414425.2414430
Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen B (2013) Tweaking and tweeting: exploring twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. J Med Internet Res 15(4):e62. doi:10.2196/jmir.2503
Article Google Scholar
Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In Proceedings of 49th annual meeting of ACL. ACL, pp 151–160
Jin N (2015) NCSU-SAS-NING: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of workshop on noisy user-generated text. ACL, pp 87–92
Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 1st edn. Prentice Hall PTR, Upper Saddle River, NJ, USA
Google Scholar
Liu F, Weng F, Jiang X (2012) A broad-coverage normalization system for social media language. In Proceedings of 50th annual meeting of ACL. ACL, pp 1035–1044
Lui M, Baldwin T (2011) Cross-domain feature selection for language identification. In Proceedings of IJCNLP, pp 553–561
Mikolov T, Sutskever K, Chen K, Corrado G, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pp 3111–3119
Mikolov T, Yih W, Zweig G (2013b) Linguistic regularities in continuous space word representations. In Proceedings of NAACL. ACL, pp 746–751
Nakhasi A, Passarella R, Bell SG, Paul MJ, Dredze M, Provost P (2012) Malpractice and malcontent: analyzing medical complaints in twitter. In Proceedings of AAAI fall symposium on information retrieval and knowledge discovery in biomedical text. AAAI, pp 84—85
Nakov P, Zesch T, Cer D, David J (2015) In Proceedings of SemEval-2015. ACL; http://www.aclweb.org/anthology/S15-2
Paul MJ, Dredze M (2011) A model for mining public health topics from twitter. Technical report, Johns Hopkins University. http://www.cs.jhu.edu/~mpaul/files/2011.tech.twitter_health.pdf Accessed: 24 Sept 2016
Petrovic S, Osborne M, Lavrenko V (2012) Using paraphrases for improving first story detection in news and twitter. In Proceedings of NAACL. ACL, pp 338–346
Ritter A, Clark S, Mausam, Oren Etzioni (2011) Named entity recognition in tweets: an experimental study. In Proceedings of EMNLP-2011. ACL, pp 1524–1534
Sakaki T, Okazaki M, Matsuo Y (2010) earthquake shakes twitter users: real-time event detection by social sensors. In Proceeding of 19th international conference on WWW. WWW; 851–860
Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931
Article Google Scholar
Sarker A, Gonzalez G (2015) Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform 53:196–207
Article Google Scholar
Sarker A, Gonzalez G (2016) A corpus for mining drug-related knowledge from twitter chatter: language models and their utilities. Data Brief 10:122–131
Article Google Scholar
Sarker A, Gonzalez G (2017) HLP@UPenn at SemEval-2017 Task 4A: A simple, self-optimizing text classification system combining dense and sparse vectors. In Proceedings of the 11th international workshop on semantic evaluations (SemEval-2017), pp 640–643. Vancouver, Canada, August 3–4
Sarker A, Ginn R, Nikfarjam A, O’Connor K, Smith K, Jayaraman S, Upadhaya T, Gonzalez G (2015) Utilizing social media data for pharmacovigilance: a review. J Biomed Inform 54:202–2012. doi:10.1016/j.jbi.2015.02.004
Article Google Scholar
Sproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards C (2001) Normalization of non-standard words. Comput Speech Lang 15:287–333
Article Google Scholar
Supranovic D, Patsepnia V (2015) IHS RD: lexical normalization for English tweets. In Proceedings of ACL workshop on noisy user-generated text. ACL, pp 78–81
Toutanova K, Moore RC (2012) Pronunciation modeling for improved spelling correction. In Proceedings of ACL. ACL, pp 144–151

Download references

Author information

Authors and Affiliations

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Abeed Sarker

Authors

Abeed Sarker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abeed Sarker.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarker, A. A customizable pipeline for social media text normalization. Soc. Netw. Anal. Min. 7, 45 (2017). https://doi.org/10.1007/s13278-017-0464-z

Download citation

Received: 25 September 2016
Revised: 31 August 2017
Accepted: 01 September 2017
Published: 09 September 2017
DOI: https://doi.org/10.1007/s13278-017-0464-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A customizable pipeline for social media text normalization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Social media analytics: a survey of techniques, tools and platforms

A survey of sentiment analysis in social media

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A customizable pipeline for social media text normalization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Social media analytics: a survey of techniques, tools and platforms

A survey of sentiment analysis in social media

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation