Language Resources and Evaluation

, Volume 47, Issue 2, pp 299–335 | Cite as

Creating a live, public short message service corpus: the NUS SMS corpus

  • Tao ChenEmail author
  • Min-Yen Kan
Original Paper


Short Message Service (SMS) messages are short messages sent from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data have not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors’ concerns. Our live project collects new SMS message submissions, checks their quality, and adds valid messages. We release the resultant corpus as XML and as SQL dumps, along with monthly corpus statistics. We opportunistically collect as much metadata about the messages and their senders as possible, so as to enable different types of analyses. To date, we have collected more than 71,000 messages, focusing on English and Mandarin Chinese.


SMS corpus Corpus creation English Chinese Crowdsourcing Mechanical turk Zhubajie 



We would like to thank many of our colleagues who have made valuable suggestions on the SMS collection, including Jesse Prabawa Gozali, Ziheng Lin, Jun-Ping Ng, Kazunari Sugiyama, Yee Fan Tan, Aobo Wang and Jin Zhao. The authors gratefully acknowledge the support of the China-Singapore Institute of Digital Media’s support of this work by the “Co-training NLP systems and Language Learners” grant R 252-002-372-490.


  1. Bach, C., & Gunnarsson J. (2010). Extraction of trends is SMS text. Master’s thesis, Lund University.Google Scholar
  2. Back, M. D., Küfner, A. C., & Egloff, B. (2010). The emotional timeline of September 11, 2001. Psychological Science, 21(10), 1417–1419.CrossRefGoogle Scholar
  3. Back, M. D., Küfner, A. C. P., & Egloff, B. (2011). Automatic or the people?: Anger on September 11, 2001, and lessons learned for the analysis of large digital data sets. Psychological Science, 22(6), 837–838.Google Scholar
  4. Barasa, S. (2010). Language, mobile phones and internet: A study of SMS texting, email, IM and SNS chats in computer mediated communication (CMC) in Kenya. Ph.D. thesis, Leiden University.Google Scholar
  5. Bodomo, A. B. (2010). The grammar of mobile phone written language, Chap. 7 (pp. 110–198). Hershey: IGI Global.Google Scholar
  6. Callison-Burch, C., & Dredze M. (2010). Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, CSLDAMT ’10, Stroudsburg, PA, pp. 1–12. Association for Computational Linguistics.Google Scholar
  7. Chilton, L. B., Horton, J. J., Miller, R. C., & Azenkot, S. (2010). Task search in a human computation market. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, pp. 1–9. New York, NY: ACM press.Google Scholar
  8. Choudhury, M., Saraf, R., Jain, V., Sarkar, S., & Basu A. (2007). Investigation and modeling of the structure of texting language. In In Proceedings of the IJCAI-workshop on analytics for noisy unstructured text data (pp. 63–70) .Google Scholar
  9. Crystal, D. (2008). Txtng: The Gr8 Db8. Oxford: Oxford University Press.Google Scholar
  10. Denby, L. (2010). The language of twitter: Linguistic innovation and character limitation in short messaging. Undergraduate thesis, University of Leeds.Google Scholar
  11. Deumert, A., & Oscar Masinyana, S. (2008). Mobile language choices–the use of English and Isixhosa in text messages (sms): Evidence from a bilingual South African sample. English World-Wide, 29(2), 117–147.CrossRefGoogle Scholar
  12. DiPalantino, D., Karagiannis, T., & Vojnovic M. (2011). Individual and collective user behavior in crowdsourcing services. Technical report, Microsoft Research.Google Scholar
  13. Dürscheid, C., & Stark, E. (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland, Chap. 5. Oxford: Oxford University Press.Google Scholar
  14. Elizondo, J. (2011). Not 2 Cryptic 2 DCode: Paralinguistic restitution, deletion, and non-standard orthography in text messages. Ph. D. thesis, Swarthmore College.Google Scholar
  15. Elvis, F. W. (2009). The sociolinguistics of mobile phone sms usage in Cameroon and Nigeria. The International Journal of Language Society and Culture, (28), 25–41.Google Scholar
  16. Fairon, C., & Paumier, S. (2006). A translated corpus of 30,000 French SMS. In Proceedings of language resources and evaluation.Google Scholar
  17. Gao, Q., & Vogel, S. (2010). Consensus versus expertise: A case study of word alignment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, CSLDAMT ’10, Stroudsburg, PA, pp. 30–34. Association for Computational Linguistics.Google Scholar
  18. Gibbon, D., & Kul, M. (2008). Economy strategies in restricted communication channels: A study of polish short text messages. In Proceedings of 5th Internationale Tagung Perspektiven der Jugendspracheforschung.Google Scholar
  19. Grinter, R., Eldridge, M. (2003). Wan2tlk?: Everyday text messaging. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 441–448. ACM.Google Scholar
  20. Herring, S., & Zelenkauskaite, A. (2009). Symbolic capital in a virtual heterosexual market: Abbreviation and insertion in Italian iTV SMS. Written Communication, 26(1), 27.CrossRefGoogle Scholar
  21. How, Y. (2004). Analysis of sms efficiency. Undergraduate thesis, National University of Singapore.Google Scholar
  22. How, Y., & Kan, M. (2005). Optimizing predictive text entry for short message service on mobile phones. In Proceedings of HCII. Lawrence Erlbaum Associates.Google Scholar
  23. Hutchby, I., & Tanna, V. (2008). Aspects of sequential organization in text message exchange. Discourse & Communication, 2(2), 143–164.CrossRefGoogle Scholar
  24. Ipeirotis, P. (2010a). Demographics of mechanical turk. New York University, Working Paper No: CEDER-10-01.Google Scholar
  25. Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS 17, 16–21.CrossRefGoogle Scholar
  26. Yang, J., Adamic, L.A., & Ackerman, M. S. (2008). Competing to share expertise: The taskcn knowledge sharing community. In Proceeding of the international AAAI conference on weblogs and social media.Google Scholar
  27. Jonsson, H., Nugues, P., Bach, C., & Gunnarsson J. (2010). Text mining of personal communication. In 2010 14th international conference on intelligence in next generation networks, pp. 1–5. IEEE.Google Scholar
  28. Ju, Y., & Paek, T. (2009). A voice search approach to replying to SMS messages in automobiles. In Proceedings of Interspeech (pp. 1–4). Citeseer.Google Scholar
  29. Kasesniemi, E. -L., & Rautiainen, P. (2002). Mobile culture of children and teenagers in finland. In Perpetual contact, New York, NY, pp. 170–192. Cambridge: Cambridge University Press.Google Scholar
  30. Lexander, K. V. (2011). Names U ma puce : Multilingual texting in Senegal. Working paper.Google Scholar
  31. Ling, R. (2005). The sociolinguistics of sms: An analysis of sms use by a random sample of norwegians. Mobile Communications Engineering: Theory and Applications, 26(3), 335–349.Google Scholar
  32. Ling, R., & Baron, N. S. (2007). Text messaging and im. Journal of Language and Social Psychology, 26(3), 291–298.CrossRefGoogle Scholar
  33. Liu, F., Zhang, L., & Gu, J. (2007). The application of knowledge management in the internet—Witkey mode in China. International journal of knowledge and systems sciences, 4(4), 32–41.Google Scholar
  34. Liu, W., & Wang, T. (2010). Index-based online text classification for sms spam filtering. Journal of Computers, 5(6), 844–851.Google Scholar
  35. Mason, W., & Watts, D.J. (2009). Financial incentives and the "performance of crowds". In Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). HCOMP ’09, New York, NY. ACM.Google Scholar
  36. Munro, R., & Manning, C. D. (2012). Short message communications: users, topics, and in-language processing. In Proceedings of the 2nd ACM symposium on computing for development, ACM DEV ’12, New York, NY, pp. 4:1–4:10. ACM.Google Scholar
  37. Ogle, T. (2005). Creative uses of information extracted from SMS messages. Undergraduate thesis, The University of Sheffield.Google Scholar
  38. Piao, C., Han, X., & Jing, X. (2009). Research on web2.0-based anti-cheating mechanism for witkey e-commerce. In Second international symposium on electronic commerce and security, 2009. ISECS ’09, Volume 2 (pp. 474–478).Google Scholar
  39. Pietrini, D. (2001). X’6 :-(?": The sms and the triumph of informality and ludic writing. Italienisch 46, 92–101.Google Scholar
  40. Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical report, University of Maryland, College Park.Google Scholar
  41. Resnik, P., Buzek, O., Hu, C., Kronrod, Y., Quinn, A., & Bederson, B. B. (2010). Improving translation via targeted paraphrasing. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10, Stroudsburg, PA, pp. 127–137. Association for Computational Linguistics.Google Scholar
  42. Rettie, R. (2007). Texters not talkers: Phone call aversion among mobile phone users. PsychNology Journal, 5(1), 33–57.Google Scholar
  43. Ritter, A., Cherry, C., & Dolan, W. B. (2011). Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing, EMNLP ’11, Stroudsburg, PA, pp. 583–593. Association for Computational Linguistics.Google Scholar
  44. Schlobinski, P., Fortmann, N., Groß, O., Hogg, F., Horstmann, F., & Theel, R. (2001). Simsen. eine pilotstudie zu sprachlichen und kommunikativen aspekten in der sms-kommunikation. Networx.Google Scholar
  45. Segerstad, Y. (2002). Use and adaptation of written language to the conditions of computer-mediated communication use and adaptation of written language to the conditions of computer-mediated communication. Ph.D. thesis, University of Gothenburg.Google Scholar
  46. Shortis, T. (2001). ’new literacies’ and emerging forms: Text messaging on mobile phones. In International literacy and research network conference on learning.Google Scholar
  47. Sotillo, S. (2010). SMS texting practices and communicative intention, chapter 16 (pp. 252–265). Hershey: IGI Global.Google Scholar
  48. Sun, Y., Wang, N., & Peng, Z. (2011). Working for one penny: Understanding why people would like to participate in online tasks with low payment. Computers in Human Behavior, 27(2), 1033–1041.CrossRefGoogle Scholar
  49. Tagg, C. (2009). A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham.Google Scholar
  50. Thurlow, C., & Brown, A. (2003). Generation Txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1), 30.Google Scholar
  51. Walkowska, J. (2009). International joint conference intelligent information systems (IIS 2009). Recent advances in intelligent infomation systems, Warsaw, 2009. ISBN: 978-83-60434-59-8.Google Scholar
  52. Wang, A., Chen, T., & Kan, M.-Y. (2012a). Re-tweeting from a linguistic perspective. In Proceedings of the second workshop on language in social media, Montréal, Canada, pp. 46–55. Association for Computational Linguistics. Accesses Mar 2012.Google Scholar
  53. Wang, A., Hoang, C., & Kan, M.-Y. (2012b). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation. doi: 10.1007/s10579-012-9176-1.
  54. Yang, J., Adamic, L. A., & Ackerman, M. S. (2008). Crowdsourcing and knowledge sharing: Strategic user behavior on taskcn. In Proceedings of the 9th ACM conference on electronic commerce, EC ’08, New York, NY, pp. 246–255. ACM.Google Scholar
  55. Zhang, L., & Zhang, H. (2011). Research of crowdsourcing model based on case study. In 8th international conference on service systems and service management (ICSSSM), 2011, pp. 1–5.Google Scholar
  56. Zhou, K.-l., Lv, Q., Zhang, Y.-h., Pan, J.-s., & Qian, P.-d. (2007). Towards evaluating chinese character digital input system. Journal of Chinese Information Processing, 21(1), 67–73.Google Scholar
  57. Žic Fuchs, M., & Tudman Vukovic, N. (2008). Communication technologies and their influence on language: Reshuffling tenses in Croatian SMS text messaging. Jezikoslovlje, 2(9.1-2), 109–122.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.School of ComputingNational University of SingaporeSingaporeSingapore
  2. 2.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations