Skip to main content

Creating a live, public short message service corpus: the NUS SMS corpus

Abstract

Short Message Service (SMS) messages are short messages sent from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data have not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors’ concerns. Our live project collects new SMS message submissions, checks their quality, and adds valid messages. We release the resultant corpus as XML and as SQL dumps, along with monthly corpus statistics. We opportunistically collect as much metadata about the messages and their senders as possible, so as to enable different types of analyses. To date, we have collected more than 71,000 messages, focusing on English and Mandarin Chinese.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    http://www.itu.int/ITU-D/ict/material/FactsFigures2010.pdf.

  2. 2.

    http://www.ctia.org/media/press/body.cfm/prid/2021.

  3. 3.

    http://12321.cn/pdf/sms1102.pdf.

  4. 4.

    http://latimesblogs.latimes.com/technology/2009/05/invented-text-messaging.html.

  5. 5.

    http://www.wisitech.com/blog/?p=57.

  6. 6.

    In terms of publications in IEEE, ACM and ACL between June 2010 and June 2011.

  7. 7.

    http://www.ayman-naaman.net/2010/04/21/how-many-characters-do-you-tweet.

  8. 8.

    In contrast, our 2004 corpus was collected locally within the University in Singapore, not representative of general worldwide SMS use.

  9. 9.

    http://www.alpes4science.org.

  10. 10.

    Available at http://www.demo.inty.net/app6.html. Although the corpus is not directly downloadable as a file, we still consider it as public as all of the messages are displayed on the single web page.

  11. 11.

    http://www.mediensprache.net/archiv/corpora/sms_os_h.pdf.

  12. 12.

    http://www0.hku.hk/linguist/research/bodomo/MPC/SMS_glossed.pdf.

  13. 13.

    http://www.cel.iitkgp.ernet.in/~monojit/sms.html.

  14. 14.

    http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus.

  15. 15.

    http://mirror.wikileaks.info/wiki/911.

  16. 16.

    https://www.mturk.com/mturk/welcome.

  17. 17.

    http://www.shorttask.com.

  18. 18.

    http://www.zhubajie.com.

  19. 19.

    According to the China Witkey Industrial White Paper 2011 conducted by iResearch.

  20. 20.

    Corpora@uib.no.

  21. 21.

    http://www.corpus4u.org.

  22. 22.

    http://www.52nlp.cn.

  23. 23.

    http://bbs.hiapk.com.

  24. 24.

    http://bbs.gfan.

  25. 25.

    Now replaced by Nokia Suite http://www.comms.ovi.com/m/p/ovi/suite/English.

  26. 26.

    http://code.google.com/p/android-sms.

  27. 27.

    http://mail.google.com.

  28. 28.

    https://play.google.com/store/search?q=pname:edu.nus.sms.collection.

  29. 29.

    Hence a Gmail account is a prerequisite to this collection method.

  30. 30.

    But we do not encourage users to edit messages since we feel it may destroy the originality.

  31. 31.

    http://www.sonarproject.nl.

  32. 32.

    Since we replace sensitive data with pre-defined codes in the anonymization process, the unique token count of the original messages is likely to be higher than what we calculated.

  33. 33.

    On 21 April 2012 when most payments were made. 1 SGD = 0.8015 USD, 1 CNY = 0.1585 USD.

  34. 34.

    In fact, these were some of Ipeirotis’ suggestions to ameliorate the problem, so credit is due to him.

  35. 35.

    http://economictimes.indiatimes.com/tech/internet/idg-backed-chinese-website-zhubajie-to-list-in-us-in-3-years/articleshow/9478731.cms.

  36. 36.

    As of 18 June 2012.

  37. 37.

    http://www.smashingmagazine.com/2010/03/15/showcase-of-web-design-in-china-from-imitation-to-innovation-and-user-centered-design.

  38. 38.

    Via the Corpora List, corpus4u forum (Chinese), the 52nlp blog (Chinese).

  39. 39.

    hiapk and gfan.

  40. 40.

    From additional personal contacts, we obtained an additional 1,433 English and 1,996 Chinese SMS respectively.

  41. 41.

    http://www.smspourlascience.be/index.php?page=14.

  42. 42.

    http://www.smspourlascience.be/index.php?page=16.

References

  1. Bach, C., & Gunnarsson J. (2010). Extraction of trends is SMS text. Master’s thesis, Lund University.

  2. Back, M. D., Küfner, A. C., & Egloff, B. (2010). The emotional timeline of September 11, 2001. Psychological Science, 21(10), 1417–1419.

    Article  Google Scholar 

  3. Back, M. D., Küfner, A. C. P., & Egloff, B. (2011). Automatic or the people?: Anger on September 11, 2001, and lessons learned for the analysis of large digital data sets. Psychological Science, 22(6), 837–838.

    Google Scholar 

  4. Barasa, S. (2010). Language, mobile phones and internet: A study of SMS texting, email, IM and SNS chats in computer mediated communication (CMC) in Kenya. Ph.D. thesis, Leiden University.

  5. Bodomo, A. B. (2010). The grammar of mobile phone written language, Chap. 7 (pp. 110–198). Hershey: IGI Global.

    Google Scholar 

  6. Callison-Burch, C., & Dredze M. (2010). Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, CSLDAMT ’10, Stroudsburg, PA, pp. 1–12. Association for Computational Linguistics.

  7. Chilton, L. B., Horton, J. J., Miller, R. C., & Azenkot, S. (2010). Task search in a human computation market. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, pp. 1–9. New York, NY: ACM press.

  8. Choudhury, M., Saraf, R., Jain, V., Sarkar, S., & Basu A. (2007). Investigation and modeling of the structure of texting language. In In Proceedings of the IJCAI-workshop on analytics for noisy unstructured text data (pp. 63–70) .

  9. Crystal, D. (2008). Txtng: The Gr8 Db8. Oxford: Oxford University Press.

    Google Scholar 

  10. Denby, L. (2010). The language of twitter: Linguistic innovation and character limitation in short messaging. Undergraduate thesis, University of Leeds.

  11. Deumert, A., & Oscar Masinyana, S. (2008). Mobile language choices–the use of English and Isixhosa in text messages (sms): Evidence from a bilingual South African sample. English World-Wide, 29(2), 117–147.

    Article  Google Scholar 

  12. DiPalantino, D., Karagiannis, T., & Vojnovic M. (2011). Individual and collective user behavior in crowdsourcing services. Technical report, Microsoft Research.

  13. Dürscheid, C., & Stark, E. (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland, Chap. 5. Oxford: Oxford University Press.

    Google Scholar 

  14. Elizondo, J. (2011). Not 2 Cryptic 2 DCode: Paralinguistic restitution, deletion, and non-standard orthography in text messages. Ph. D. thesis, Swarthmore College.

  15. Elvis, F. W. (2009). The sociolinguistics of mobile phone sms usage in Cameroon and Nigeria. The International Journal of Language Society and Culture, (28), 25–41.

  16. Fairon, C., & Paumier, S. (2006). A translated corpus of 30,000 French SMS. In Proceedings of language resources and evaluation.

  17. Gao, Q., & Vogel, S. (2010). Consensus versus expertise: A case study of word alignment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, CSLDAMT ’10, Stroudsburg, PA, pp. 30–34. Association for Computational Linguistics.

  18. Gibbon, D., & Kul, M. (2008). Economy strategies in restricted communication channels: A study of polish short text messages. In Proceedings of 5th Internationale Tagung Perspektiven der Jugendspracheforschung.

  19. Grinter, R., Eldridge, M. (2003). Wan2tlk?: Everyday text messaging. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 441–448. ACM.

  20. Herring, S., & Zelenkauskaite, A. (2009). Symbolic capital in a virtual heterosexual market: Abbreviation and insertion in Italian iTV SMS. Written Communication, 26(1), 27.

    Article  Google Scholar 

  21. How, Y. (2004). Analysis of sms efficiency. Undergraduate thesis, National University of Singapore.

  22. How, Y., & Kan, M. (2005). Optimizing predictive text entry for short message service on mobile phones. In Proceedings of HCII. Lawrence Erlbaum Associates.

  23. Hutchby, I., & Tanna, V. (2008). Aspects of sequential organization in text message exchange. Discourse & Communication, 2(2), 143–164.

    Article  Google Scholar 

  24. Ipeirotis, P. (2010a). Demographics of mechanical turk. New York University, Working Paper No: CEDER-10-01.

  25. Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS 17, 16–21.

    Article  Google Scholar 

  26. Yang, J., Adamic, L.A., & Ackerman, M. S. (2008). Competing to share expertise: The taskcn knowledge sharing community. In Proceeding of the international AAAI conference on weblogs and social media.

  27. Jonsson, H., Nugues, P., Bach, C., & Gunnarsson J. (2010). Text mining of personal communication. In 2010 14th international conference on intelligence in next generation networks, pp. 1–5. IEEE.

  28. Ju, Y., & Paek, T. (2009). A voice search approach to replying to SMS messages in automobiles. In Proceedings of Interspeech (pp. 1–4). Citeseer.

  29. Kasesniemi, E. -L., & Rautiainen, P. (2002). Mobile culture of children and teenagers in finland. In Perpetual contact, New York, NY, pp. 170–192. Cambridge: Cambridge University Press.

  30. Lexander, K. V. (2011). Names U ma puce : Multilingual texting in Senegal. Working paper.

  31. Ling, R. (2005). The sociolinguistics of sms: An analysis of sms use by a random sample of norwegians. Mobile Communications Engineering: Theory and Applications, 26(3), 335–349.

    Google Scholar 

  32. Ling, R., & Baron, N. S. (2007). Text messaging and im. Journal of Language and Social Psychology, 26(3), 291–298.

    Article  Google Scholar 

  33. Liu, F., Zhang, L., & Gu, J. (2007). The application of knowledge management in the internet—Witkey mode in China. International journal of knowledge and systems sciences, 4(4), 32–41.

    Google Scholar 

  34. Liu, W., & Wang, T. (2010). Index-based online text classification for sms spam filtering. Journal of Computers, 5(6), 844–851.

    Google Scholar 

  35. Mason, W., & Watts, D.J. (2009). Financial incentives and the "performance of crowds". In Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). HCOMP ’09, New York, NY. ACM.

  36. Munro, R., & Manning, C. D. (2012). Short message communications: users, topics, and in-language processing. In Proceedings of the 2nd ACM symposium on computing for development, ACM DEV ’12, New York, NY, pp. 4:1–4:10. ACM.

  37. Ogle, T. (2005). Creative uses of information extracted from SMS messages. Undergraduate thesis, The University of Sheffield.

  38. Piao, C., Han, X., & Jing, X. (2009). Research on web2.0-based anti-cheating mechanism for witkey e-commerce. In Second international symposium on electronic commerce and security, 2009. ISECS ’09, Volume 2 (pp. 474–478).

  39. Pietrini, D. (2001). X’6 :-(?": The sms and the triumph of informality and ludic writing. Italienisch 46, 92–101.

    Google Scholar 

  40. Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical report, University of Maryland, College Park.

  41. Resnik, P., Buzek, O., Hu, C., Kronrod, Y., Quinn, A., & Bederson, B. B. (2010). Improving translation via targeted paraphrasing. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10, Stroudsburg, PA, pp. 127–137. Association for Computational Linguistics.

  42. Rettie, R. (2007). Texters not talkers: Phone call aversion among mobile phone users. PsychNology Journal, 5(1), 33–57.

    Google Scholar 

  43. Ritter, A., Cherry, C., & Dolan, W. B. (2011). Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing, EMNLP ’11, Stroudsburg, PA, pp. 583–593. Association for Computational Linguistics.

  44. Schlobinski, P., Fortmann, N., Groß, O., Hogg, F., Horstmann, F., & Theel, R. (2001). Simsen. eine pilotstudie zu sprachlichen und kommunikativen aspekten in der sms-kommunikation. Networx.

  45. Segerstad, Y. (2002). Use and adaptation of written language to the conditions of computer-mediated communication use and adaptation of written language to the conditions of computer-mediated communication. Ph.D. thesis, University of Gothenburg.

  46. Shortis, T. (2001). ’new literacies’ and emerging forms: Text messaging on mobile phones. In International literacy and research network conference on learning.

  47. Sotillo, S. (2010). SMS texting practices and communicative intention, chapter 16 (pp. 252–265). Hershey: IGI Global.

    Google Scholar 

  48. Sun, Y., Wang, N., & Peng, Z. (2011). Working for one penny: Understanding why people would like to participate in online tasks with low payment. Computers in Human Behavior, 27(2), 1033–1041.

    Article  Google Scholar 

  49. Tagg, C. (2009). A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham.

  50. Thurlow, C., & Brown, A. (2003). Generation Txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1), 30.

    Google Scholar 

  51. Walkowska, J. (2009). International joint conference intelligent information systems (IIS 2009). Recent advances in intelligent infomation systems, Warsaw, 2009. ISBN: 978-83-60434-59-8.

  52. Wang, A., Chen, T., & Kan, M.-Y. (2012a). Re-tweeting from a linguistic perspective. In Proceedings of the second workshop on language in social media, Montréal, Canada, pp. 46–55. Association for Computational Linguistics. Accesses Mar 2012.

  53. Wang, A., Hoang, C., & Kan, M.-Y. (2012b). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation. doi:10.1007/s10579-012-9176-1.

  54. Yang, J., Adamic, L. A., & Ackerman, M. S. (2008). Crowdsourcing and knowledge sharing: Strategic user behavior on taskcn. In Proceedings of the 9th ACM conference on electronic commerce, EC ’08, New York, NY, pp. 246–255. ACM.

  55. Zhang, L., & Zhang, H. (2011). Research of crowdsourcing model based on case study. In 8th international conference on service systems and service management (ICSSSM), 2011, pp. 1–5.

  56. Zhou, K.-l., Lv, Q., Zhang, Y.-h., Pan, J.-s., & Qian, P.-d. (2007). Towards evaluating chinese character digital input system. Journal of Chinese Information Processing, 21(1), 67–73.

    Google Scholar 

  57. Žic Fuchs, M., & Tudman Vukovic, N. (2008). Communication technologies and their influence on language: Reshuffling tenses in Croatian SMS text messaging. Jezikoslovlje, 2(9.1-2), 109–122.

    Google Scholar 

Download references

Acknowledgments

We would like to thank many of our colleagues who have made valuable suggestions on the SMS collection, including Jesse Prabawa Gozali, Ziheng Lin, Jun-Ping Ng, Kazunari Sugiyama, Yee Fan Tan, Aobo Wang and Jin Zhao. The authors gratefully acknowledge the support of the China-Singapore Institute of Digital Media’s support of this work by the “Co-training NLP systems and Language Learners” grant R 252-002-372-490.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tao Chen.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chen, T., Kan, MY. Creating a live, public short message service corpus: the NUS SMS corpus. Lang Resources & Evaluation 47, 299–335 (2013). https://doi.org/10.1007/s10579-012-9197-9

Download citation

Keywords

  • SMS corpus
  • Corpus creation
  • English
  • Chinese
  • Crowdsourcing
  • Mechanical turk
  • Zhubajie