Skip to main content

Investigation and modeling of the structure of texting language

Abstract

Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The use of simple bigram language model results in a 35% reduction of the relative word level error rates.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Andersen J.M. (1973). Structural Aspects of Language Change. Longmans, London

    Google Scholar 

  2. 2.

    Androutopoulos, J., Schimdt, G.: SMS-kommunikation: etnografische gattungsanaslyse am beispeil einer kleingruppe. Zeitschrift fü Angewandte Linguistik (2001)

  3. 3.

    Atkinson, K.: Gnu Aspell. http://aspell.sourceforge.net/ (2005)

  4. 4.

    Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 33–40. ACL, Sydney (2006)

  5. 5.

    Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of COLING-2002 (2002)

  6. 6.

    Baron N.S. (1984). Computer mediated communication as a force in language change. Vis. Lang. 18(2): 118–141

    Google Scholar 

  7. 7.

    Baron N.S. (1998). Letters by phone or speech by other means: the linguistics of e-mail. Lang. Comm. 18: 133–170

    Article  Google Scholar 

  8. 8.

    Baron N.S. (2000). Alphabet to e-mail: How written English evolved and where it’s heading. Routledge, London

    Google Scholar 

  9. 9.

    Bigram: Frequency list. http://www.clg.wlv.ac.uk/projects/style/corpus

  10. 10.

    Boersma, P.: Sound change in functional phonology. Technical Report (1997). URL Rutgers Optimality Archive. http://ruccs.rutgers.edu/roa.html

  11. 11.

    Boersma P. (1998). Functional Phonology: Formalizing the interactions between articulatory and perceptual drives. Uitgave van Holland Academic Graphics, Hague, Netherlands

    Google Scholar 

  12. 12.

    Brill, E., Moore, R.C.: An improved error model for noisy spelling correction. In: Proceedings of the 38th Annual Meeting of the ACL, pp. 286–293. ACL (2000)

  13. 13.

    Brown P.F., Pietra S.A.D., Pietra V.J.D. and Mercer R.L. (1993). The mathematics of statistical machine translation: parameter estimation. Comput. Linguistics 19(2): 263–312

    Google Scholar 

  14. 14.

    Choudhury, M.: Word-aligned SMS-standard English parallel corpus. http://www.cel.iitkgp.ernet.in/~monojit/sms.html

  15. 15.

    Choudhury, M., Basu, A., Sarkar, S.: A diachronic approach for schwa deletion in Indo-Aryan languages. In: Proceedings of the 7th Meeting of the ACL SIGPHON, pp. 20–26 (2004)

  16. 16.

    Crystal D. (2001). Language and the Internet. CUP, Cambridge

    Google Scholar 

  17. 17.

    Damerau F.J. (1964). A technique for computer detection and correction of spelling errors. Commun. ACM 7(3): 171–176

    Article  Google Scholar 

  18. 18.

    Döring, N.: Kurzm wird gesendet—abkürzungen und akronyme in der SMS-kommunikation. Muttersprache Vierteljahresschrift für deutsche Sprache 2 (2002)

  19. 19.

    Ferrara K., Brunner H. and Whittemore G. (1990). Interactive written discourse as an emergent register. Writt. Commun. 8: 8–34

    Article  Google Scholar 

  20. 20.

    Fisher, W.M.: A statistical text-to-phone function using ngrams and rules. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 649–652. IEEE, New York (1999)

  21. 21.

    Fraser, A., Marcu, D.: Getting the structure right for word alignment: leaf. In: Proceedings of the 2007 Joint Conference on Empirical Methods in natural Language Processing and Computational natural Language Learning, pp. 51–60. ACL, Prague (2007)

  22. 22.

    Herring S.C. (2001). Computer-mediated discourse. In: Tannen, D., Schiffrin, D. and Hamilton, H. (eds) Handbook of Discourse Analysis, pp 612–634. Blackwell, Oxford

    Google Scholar 

  23. 23.

    Jelinek F. (1997). Statistical Methods for Speech Recognition. MIT Press, Cambridge

    Google Scholar 

  24. 24.

    Jurafsky D. and Martin J.H. (2000). An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, Englewood cliffs

    Google Scholar 

  25. 25.

    Kernighan, M.D., Church, K.W., Gale, W.A.: A spelling correction program based on a noisy channel model. In: Proceedings of COLING, pp. 205–210. ACL, NJ (1990)

  26. 26.

    Kukich K. (1992). Technique for automatically correcting words in text. ACM Comput. Surv. 24: 377–439

    Article  Google Scholar 

  27. 27.

    Levenshtein V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10: 707–710

    MathSciNet  Google Scholar 

  28. 28.

    Lindblom B. (1998). Systemic constraints and adaptive change in the formation of sound structure. In: Hurford, J.R. and Studdert-Kennedy, M.K.C (eds) Approaches to the Evolution of Language: Social and Cognitive Bases., pp. Cambridge University Press, Cambridge

    Google Scholar 

  29. 29.

    Mayes F. and Damerau F.J. (1991). Context based spelling correction. Inform. Process. Manage. 27(5): 517–522

    Article  Google Scholar 

  30. 30.

    Meillet A. (1967). The comparative method in historical linguistics. Champion, Paris

    Google Scholar 

  31. 31.

    Mihov S. and Schulz K.U. (2004). Fast approximate search in large dictionaries. Comput. Lingu. 30(4): 451–477

    Article  MathSciNet  Google Scholar 

  32. 32.

    Murray D. (1985). Composition as conversation: the computer terminal as medium of communication. In: Odell, L. and Goswami, D. (eds) Writing in Nonacademic Settings, pp 203–228. The Guilford Press, New York

    Google Scholar 

  33. 33.

    Ney H., Mergel D., Noll A. and Paesler A. (1992). Data-driven search organisation for continuous speech recognition. IEEE Trans. Sig. Process 40: 272–281

    Article  Google Scholar 

  34. 34.

    Nishimura, Y.: Linguistic innovations and interactional features of casual online communication in Japanese. J. Comput. Med. Commun. 9(1) (2003)

  35. 35.

    Odell, M.K., Russell, R.C.: U.S. patent number 1,261,167 (1918)

  36. 36.

    Odell, M.K., Russell, R.C.: U.S. patent number 1,435,663 (1922)

  37. 37.

    Palfreyman, D., al Khalil, M.: A funky language for teenzz to use: representing Gulf Arabic in instant messaging. J. Comput. Med. Commun. 9(1) (2003)

  38. 38.

    Philips, L.: The double metaphone search algorithm. C/C++ Users J. (2000). http://www.ddj.com/dept/cpp/184401251

  39. 39.

    Rabiner L.R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2): 257–286

    Article  Google Scholar 

  40. 40.

    Ringlstetter, C., Schulz, K.U., S.M.: Orthographic errors in web ages: towards a cleaner web corpora. Comput. Lingu. 32(3), 295–340 (2006)

  41. 41.

    af Segerstad, Y.H.: Use and adaptation of written language to the conditions of computer-mediated communication. Ph.D. thesis, Department of Linguistics, Göteborg University Sweden (2002)

  42. 42.

    Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise—noise as genre. In: Proceedings of IJCAI-07 Workshop on Analytics for Noisy Unstructured Text Data (AND-07), pp. 9–16 (2007)

  43. 43.

    Taylor, P.: Hidden markov models for grapheme to phoneme conversion. In: Proceedings of 9th European Conference on Speech Communication and Technology—Interspeech, pp. 1973–1976 (2005)

  44. 44.

    Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 144–151. ACL (2002)

  45. 45.

    Unigram: Frequency list. http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html

  46. 46.

    Wikipedia: Texting language. http://en.wikipedia.org/wiki/Texting_language

  47. 47.

    Xia, Y., Wong, K.F., Li, W.: A phonetic-based approach to Chinese chat text normalization. In: Proceedings of the COLING-ACL’06, ACL (2006)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Monojit Choudhury.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Choudhury, M., Saraf, R., Jain, V. et al. Investigation and modeling of the structure of texting language. IJDAR 10, 157–174 (2007). https://doi.org/10.1007/s10032-007-0054-0

Download citation

Keywords

  • Texting language
  • SMS
  • Hidden Markov Model
  • Text correction
  • Spell checking