Approaches of Anonymisation of an SMS Corpus

  • Namrata Patel
  • Pierre Accorsi
  • Diana Inkpen
  • Cédric Lopez
  • Mathieu Roche
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7816)


This paper presents two anonymisation methods to process an SMS corpus. The first one is based on an unsupervised approach called Seek&Hide. The implemented system uses several dictionaries and rules in order to predict if a SMS needs anonymisation process. The second method is based on a supervised approach using machine learning techniques. We evaluate the two approaches and we propose a way to use them together. Only when the two methods do not agree on their prediction, will the SMS be checked by a human expert. This greatly reduces the cost of anonymising the corpus.


Natural Language Processing Text Message Confusion Matrix Short Message Service Decision Tree Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Accorsi, P., Patel, N., Lopez, C., Panckhurst, R., Roche, M.: Seek&Hide: Anonymising a French SMS Corpus Using Natural Language Processing Techniques. Lingvisticæ Investigationes 35(2) (2012)Google Scholar
  2. 2.
    Plamondon, L., Lapalme, G., Pelletier, F.: Anonymisation de décisions de justice. In: TALN 2004 (2004)Google Scholar
  3. 3.
    Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Une procédure d’anonymisation à deux niveaux pour créer un corpus de comptes rendus hospitaliers. Risques, technologies de l’information pour les pratiques médicales, 23–34 (2009)Google Scholar
  4. 4.
    Tamersoy, A., Loukides, G., Nergiz, M., Saygin, Y., Malin, B.: Anonymization of longitudinal electronic medical records. IEEE Transactions on Information Technology in Biomedicine 16(3), 413–423 (2012)CrossRefGoogle Scholar
  5. 5.
    Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, p. 333 (1996)Google Scholar
  6. 6.
    Aramaki, E., Imai, T., Miyo, K., Ohe, K.: Automatic deidentification by using sentence features and label consistency. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, pp. 10–11 (2006)Google Scholar
  7. 7.
    Gardner, J., Xiong, L., Wang, F., Post, A., Saltz, J., Grandison, T.: An evaluation of feature sets and sampling techniques for de-identification of medical records. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 183–190. ACM (2010)Google Scholar
  8. 8.
    Gicquel, Q., Proux, D., Marchal, P., Hagége, C., Berrouane, Y., Darmoni, S., Pereira, S., Segond, F., Metzger, M.: Évaluation d’un outil d’aide á l’anonymisation des documents médicaux basé sur le traitement automatique du langage naturel. Systèmes d’information pour l’amélioration de la qualité en santé, 165–176 (2012)Google Scholar
  9. 9.
    Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. JAMIA 14(5), 574–580 (2007)Google Scholar
  10. 10.
    Reffay, C., Blondel, F., Giguet, E., et al.: Stratégies pour l’anonymisation systématique d’un corpus d’interactions plurilingues. In: Proceedings of IC 2012, pp. 1–21 (2012)Google Scholar
  11. 11.
    Fairon, C., Klein, J.: Les écritures et graphies inventives des sms face aux graphies normées. Le Français aujourd’hui (3), 113–122 (2010)Google Scholar
  12. 12.
    Treurniet, M., De Clercq, O., van den Heuvel, H., Oostdijk, N.: Collection of a corpus of dutch sms (2012)Google Scholar
  13. 13.
    Chen, T., Kan, M.: Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation, 1–37Google Scholar
  14. 14.
    Reffay, C., Teutsch, P.: Anonymisation de corpus réutilisables Masquer l’identité sans altérer l’analyse des interactions. In: Proceedings of EIAH 2007: Environnements Informatiques pour l’Apprentissage Humain (2007)Google Scholar
  15. 15.
    Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing sms messages. In: Proceedings of ACL, pp. 770–779 (2010)Google Scholar
  16. 16.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Namrata Patel
    • 1
  • Pierre Accorsi
    • 1
  • Diana Inkpen
    • 2
  • Cédric Lopez
    • 3
  • Mathieu Roche
    • 1
  1. 1.LIRMM – CNRSUniv. Montpellier 2France
  2. 2.Univ. of OttawaCanada
  3. 3.Objet Direct – VISEOFrance

Personalised recommendations