Abstract
Human civilizations have performed the art of writing across continents and over different time periods. In order to speed up the writing process, the art of shorthand (brachygraphy) came into existence. Today, the performance of writing does not make an exception in social media platforms. Brachygraphy started to re-emerge in the early 2000s in the form of microtext in order to facilitate faster typing without compromising semantic clarity. This paper focuses on microtext approaches predominantly found in social media and explains the relevance of microtext normalization for natural language processing tasks in English. The review introduces brachygraphy and how it has evolved into microtext in today’s social media–dominant society. The study provides a comprehensive classification of microtext normalization based on different approaches. We propose to classify microtext based on different normalization techniques, i.e. syntax-based (syntactic), probability-based (probabilistic) and phonetic-based approaches and review application areas, strategies and challenges of microtext normalization. The review shows that there is a compelling similarity between brachygraphy and microtext even though they started centuries apart. This paper represents the first attempt to connect brachygraphy to current texting language and to show its impact in social media. This paper classifies microtext normalization according to different approaches and discusses how, in the future, microtext will likely comprise both words and images together. This will expand the horizon of human creative power. We conclude the review with some considerations on future directions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
It includes English, before and after the advent of print
www.en.wikipedia.org/wiki/List_of_Latin_abbreviations (accessed on 15 July 2019)
http://americanhistory.si.edu/collections/search/object/nmah_849951 (accessed on 15 July 2019)
The Project Gutenberg website http://www.gutenberg.org/
The US Conference of Catholic Bishops website: http://www.usccb.org
The Project Gutenberg website: http://www.gutenberg.org/
A Chinese version of Twitter at www.weibo.com
Available at www.comp.nus.edu.sg/~nlp/corpora.html
References
Agarwal S, Godbole S, Punjani D, Roy S. How much noise is too much: a study in automatic text classification. Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007; 2007. p. 3–12.
Aha D W, Kibler D, Albert M K. Instance-based learning algorithms. Mach Learn 1991;6(1):37–66.
Baldwin T, de Marneffe M-C, Han B, Kim Y-B, Ritter A, Xu W. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. Proceedings of the Workshop on Noisy User-generated Text; 2015. p. 126–135.
Bartlett S, Kondrak G, Cherry C. Automatic syllabification with structured SVMs for letter-to-phoneme conversion. Proceedings of ACL-08: HLT; 2008. pp 568–576.
Bayer T, Kressel U, Mogg-Schneider H, Renz I. Categorizing paper documents: a generic system for domain and language independent text categorization. Comput Vis Image Underst 1998;70(3):299–306.
Beaufort R, Roekhaut S, Cougnon L-AL, Fairon C. A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: ACL. Association for Computational Linguistics; 2010. p. 770–779.
Black A, Taylor P, Caley R, Clark R. 1998. The festival speech synthesis system.
Bouma G. Finite state methods for hyphenation. Nat Lang Eng 2003;9(1):5–20.
Brody S, Diakopoulos N. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics; 2011. p. 562–570.
Cambria E, Hussain A, Havasi C, Eckl C. Sentic Computing: Exploitation of Common Sense for the Development of Emotion-Sensitive Systems. LNCS 5967; 2010. p. 148–156.
Cambria E, Poria S, Gelbukh A, Thelwall M. Sentiment analysis is a big suitcase. IEEE Intell Syst 2017;32(6):74–80.
Cappelli A, Pelzer A. 1967. Dizionario di abbreviature latine ed italiane. Ulrico Hoepli. http://www.hist.msu.ru/Departments/Medieval/Cappelli.
Chaturvedi I, Cambria E, Welsch R, Herrera F. Distinguishing between facts and opinions for sentiment analysis: survey and challenges. Inf Fus 2018;44:65–77.
Choudhury M, Saraf R, Jain V, Mukherjee A, Sarkar S, Basu A. Investigation and modeling of the structure of texting language. Int J Doc Anal Recogn (IJDAR) 2007;10(3-4):157–174.
Chrupała G. Normalizing tweets with edit scripts and recurrent neural embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2014. p. 680–686.
Clark E, Araki K. Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Soc Behav Sci 2011;27:2–11.
Current R N. The Original Typewriter Enterprise 1867-1873. Wis Mag Hist; 1949. p. 391–407.
Daelemans W, van den Bosch A. Generalization performance of backpropagation learning on a syllabification task. Proceedings of the 3rd Twente Workshop on Language Technology. Enschede: Universiteit Twente; 1992. p. 27–38.
Daelemans W, Zavrel J, Van Der Sloot K, Van den Bosch A. Timbl: Tilburg memory-based learner. Tilburg: Tilburg University; 2004.
Desai N, Narvekar M. Normalization of noisy text data. Procedia Comput Sci 2015;45:127–132. International Conference on Advanced Computing Technologies and Applications (ICACTA).
Doval Y, Vilares M, Vilares J. On the performance of phonetic algorithms in microtext normalization. Expert Syst Appl 2018;113:213–222.
Ellen J. All about microtext-a working definition and a survey of current microtext research within artificial intelligence and natural language processing., ICAART (1) 2011; 2011. p. 329–336.
Fairon C, Paumier S. A translated corpus of 30,000 French SMS. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006); 2006. p. 351–354.
Fossati D, Di Eugenio B. I saw TREE trees in the park: how to correct real-word spelling mistakes. In: LREC. Citeseer; 2008. p. 896–901.
Gopalakrishna Pillai R, Thelwall M, Orasan C. Detection of stress and relaxation magnitudes for Tweets. In: Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee; 2018. p. 1677–1684.
Gouws S, Metzler D, Cai C, Hovy E. Contextual bearing on linguistic variation in social media. In: Proceedings of the Workshop on Languages in Social Media. Association for Computational Linguistics; 2011. p. 20–29.
Han B, Baldwin T. Lexical normalisation of short text messages: Makn sens a# twitter. In: ACL; 2011. p. 368–378.
Hanna S. An essential guide to singlish. Gartbooks: Singapore; 2003.
Hirst G, Budanitsky A. Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng 2005;11(1):87–111.
Hocq S. 2006. Étude des sms en franċais: constitution et exploitation d’un corpus aligné SMS-langue standard. Rapport interne, Université Aix-Marseille.
Hoppe H R. The Third (1600) Edition of Bales’s “Brachygraphy”. J Engl German Philol 1938;37(4):537–541.
How Y, Kan M-Y. 2005. Optimizing predictive text entry for short message service on mobile phones. In: Proceedings of HCII; 2005. vol. 5. p.
Jahjah V, Khoury R, Lamontagne L. Word Normalization Using Phonetic Signatures. In: Khoury R. and Drummond C, editors. Advances in artificial intelligence. Springer International Publishing; 2016. p. 180–185.
Jiampojamarn S, Kondrak G, Sherif T. Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference; 2007. p. 372–379.
Jing H, Lopresti D, Shih C. Summarizing noisy documents. In: Proceedings of the Symposium on Document Image Understanding Technology; 2003. p. 111–119.
Jose G, Raj NS. Lexico-syntactic normalization model for noisy SMS text. In: 2014 International Conference on Electronics, Communication and Computational Engineering (ICECCE). IEEE; 2014. p. 163–168.
Kaufmann M, Kalita J. Syntactic normalization of Twitter messages. International conference on natural language processing. India: Kharagpur; 2010. p. 7.
Khoury R. 2015. Phonetic normalization of microtext. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE; 2015s, p. 1600–1601.
Kobus C, Yvon F, Damnati G. Normalizing SMS: are two metaphors better than one? In: Proceedings of the 22nd International Conference on Computational Linguistics. Vol. 1. Association for Computational Linguistics; 2008, p. 441–448.
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics; 2007, p. 177–180.
Kohavi R, et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI. p. 1137–1145.
Leeman-Munk S, Lester J, Cox J. Ncsu_sas_sam: deep encoding and reconstruction for normalization of noisy text. In: Proceedings of the Workshop on Noisy User-generated Text; 2015. p. 154–161.
Li C, Liu Y. Normalization of text messages using character-and phone-based machine translation approaches. In: Thirteenth Annual Conference of the International Speech Communication Association; 2012. p. 2330–2333.
Liu F, Weng F, Jiang X. A broad-coverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers; 2012. Vol. 1. p. 1035–1044.
Lo S L, Cambria E, Chiong R, Cornforth D. Multilingual sentiment analysis: from formal to informal and scarce resource languages. Artif Intell Rev 2017;48(4):499–527.
Lopes C, Perdigao F. Phoneme recognition on the TIMIT database. In: Speech Technologies. InTech; 2011, p. 285–302.
Lourentzou I, Manghnani K, Zhai C. 2019. Adapting sequence to sequence models for text normalization in social media. arXiv:1904.06100.
Luong M-T, Manning C. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2016. vol. 1, p. 1054–1063.
Luong T, Socher R, Manning C. Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning; 2013. p. 104–113.
Lusetti M, Ruzsics T, Göhring A, Samardžić T, Stark E. Encoder-decoder methods for text normalization. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Santa Fe: Association for Computational Linguistics; 2018. p. 18–28. https://www.aclweb.org/anthology/W18-3902.
Manning C. 2011. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: International conference on intelligent text processing and computational linguistics. Springer; 2011, pp. 171–189.
Miller D, Boisen S, Schwartz R, Stone R, Weischedel R. 2000. Named entity extraction from noisy input: speech and OCR. In: Proceedings of the sixth conference on Applied natural language processing. Association for Computational Linguistics; 2000, p. 316–324.
Mittal A, Bhatt P, Kumar P. 2014. Phonetic matching and syntactic tree similarity based QA system for SMS queries. In: 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE). IEEE; 2014, p. 1–6.
Mitzschke P, Lipsius J, Haffley N. Biography of the father of stenography, Marcus Tullius Tiro together with the Latin Letter De Notis. Brooklyn: Concerning the Origin of Shorthand ; 1882.
Molyneux J. 1993. Greek Lyric, Vol. III Stesichorus, Ibycus, Simonides, and Others ed. by David A. Campbell, Vol. 37.
Norvig P. 2007. How to write a spelling corrector. De: http://norvig.com/spell-correct.html.
Peng H, Ma Y, Li Y, Cambria E. Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowl-Based Syst 2018;148:167–176.
Pennell DL, Liu Y. Normalization of text messages for text-to-speech. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). IEEE; 2010. p. 4842–4845.
Pennell D. L, Liu Y. A character-level machine translation approach for normalization of SMS abbreviations. In: IJCNLP; 2011. p. 974–982.
Pennell D L, Liu Y. Normalization of informal text. Comput Speech Lang 2014;28(1):256–277.
Petrović S, Osborne M, Lavrenko V. The Edinburgh Twitter corpus. In: Proceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media; 2010. p. 25–26.
Pirinen TA, Hardwick S. 2012. Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking. In: Preproceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing FSMNLP; 2012. p. 6–14.
Pirinen T. A, Lindén K. State-of-the-art in weighted finite-state spell-checking. In: International Conference on Intelligent Text Processing and Computational Linguistics. Springer; 2014, p. 519–532.
Platt J T. 1975. The Singapore English speech continuum and its basilect‘Singlish’as a‘creoloid’. Anthropological Linguistics; 1975. p. 363–374.
Plutarch, Vol. 4. Moralia. Cambridge: Harvard University Press; 1936, p. 500.
Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: from unimodal analysis to multimodal fusion. Inf Fus 2017;37:98–125.
Psellu M. De operatione daemonum. A.M Hakkert; 1964. p. 2.
Robertson D S, et al. Phase change: the computer revolution in science and mathematics. USA: Oxford University Press; 2003.
Rosa KD, Ellen J. Text classification methodologies applied to micro-text in military chat. In: Proc. Eight International Conference on Machine Learning and Applications. Miami; 2009, p. 710–714.
Satapathy R, Guerreiro C, Chaturvedi I, Cambria E. Phonetic-based microtext normalization for Twitter sentiment analysis. In: ICDM; 2017. p. 407–413.
Satapathy R, Li Y, Cavallari S, Cambria E. Seq2seq deep learning models for microtext normalization. In: 2019 International Joint Conference on Neural Networks (IJCNN). IEEE; 2019.
Satapathy R, Singh A, Cambria E. PhonSenticNet: a cognitive approach to microtext normalization for concept-level sentiment analysis. CSoNet; 2019. p. 177–188. arXiv:1905.01967.
Schiaparelli L. Avviamento allo studio delle abbreviature latine nel medioevo. Olschki; 1926.
Skut W, Krenn B, Brants T, Uszkoreit H. 1997. An annotation scheme for free word order languages. In: Proceedings of the Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics, p. 88–95.
Taghva K, Borsack J, Condit A. Effects of OCR errors on ranking and feedback using the vector space model. Inf Process Manag 1996;32(3):317–327.
Thurlow C, Brown A. Generation txt? The sociolinguistics of young people’s text-messaging. Discour Anal Online 2003;1(1):30.
Wang P, Ng HT. A beam-search decoder for normalization of social media text with application to machine translation. In: HLT-NAACL; 2013. p. 471–481.
Wilcox-O’Hearn A, Hirst G, Budanitsky A. 2008. Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics. Springer; 2008, p. 605–616.
Xu K, Xia Y, Lee C-H. 2015. Tweet normalization with syllables In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); 2015. vol. 1, p. 920–928.
Xue Z, Yin D, Davison B D. Normalizing microtext. Analyzing Microtext. 2011:74–79.
Yang Y, Eisenstein J. 2013. A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. p. 61–72.
Yuan S, Wu J, Wang L, Wang Q. A hybrid method for multi-class sentiment analysis of micro-blogs. In: 2016 13th International Conference on Service Systems and Service Management (ICSSSM). IEEE; 2016. p. 1–6.
Zhang C, Baldwin T, Ho H, Kimelfeld B, Li Y. Adaptive parser-centric text normalization. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2013. vol. 1, p. 1159–1168.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Informed Consent
Informed consent was not required as no human or animals were involved.
Human and Animal Rights
This article does not contain any studies with human or animal subjects performed by any of the authors
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Satapathy, R., Cambria, E., Nanetti, A. et al. A Review of Shorthand Systems: From Brachygraphy to Microtext and Beyond. Cogn Comput 12, 778–792 (2020). https://doi.org/10.1007/s12559-020-09723-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-020-09723-7