Skip to main content

Approaches of Anonymisation of an SMS Corpus

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Abstract

This paper presents two anonymisation methods to process an SMS corpus. The first one is based on an unsupervised approach called Seek&Hide. The implemented system uses several dictionaries and rules in order to predict if a SMS needs anonymisation process. The second method is based on a supervised approach using machine learning techniques. We evaluate the two approaches and we propose a way to use them together. Only when the two methods do not agree on their prediction, will the SMS be checked by a human expert. This greatly reduces the cost of anonymising the corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Accorsi, P., Patel, N., Lopez, C., Panckhurst, R., Roche, M.: Seek&Hide: Anonymising a French SMS Corpus Using Natural Language Processing Techniques. Lingvisticæ Investigationes 35(2) (2012)

    Google Scholar 

  2. Plamondon, L., Lapalme, G., Pelletier, F.: Anonymisation de décisions de justice. In: TALN 2004 (2004)

    Google Scholar 

  3. Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Une procédure d’anonymisation à deux niveaux pour créer un corpus de comptes rendus hospitaliers. Risques, technologies de l’information pour les pratiques médicales, 23–34 (2009)

    Google Scholar 

  4. Tamersoy, A., Loukides, G., Nergiz, M., Saygin, Y., Malin, B.: Anonymization of longitudinal electronic medical records. IEEE Transactions on Information Technology in Biomedicine 16(3), 413–423 (2012)

    Article  Google Scholar 

  5. Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, p. 333 (1996)

    Google Scholar 

  6. Aramaki, E., Imai, T., Miyo, K., Ohe, K.: Automatic deidentification by using sentence features and label consistency. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, pp. 10–11 (2006)

    Google Scholar 

  7. Gardner, J., Xiong, L., Wang, F., Post, A., Saltz, J., Grandison, T.: An evaluation of feature sets and sampling techniques for de-identification of medical records. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 183–190. ACM (2010)

    Google Scholar 

  8. Gicquel, Q., Proux, D., Marchal, P., Hagége, C., Berrouane, Y., Darmoni, S., Pereira, S., Segond, F., Metzger, M.: Évaluation d’un outil d’aide á l’anonymisation des documents médicaux basé sur le traitement automatique du langage naturel. Systèmes d’information pour l’amélioration de la qualité en santé, 165–176 (2012)

    Google Scholar 

  9. Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. JAMIA 14(5), 574–580 (2007)

    Google Scholar 

  10. Reffay, C., Blondel, F., Giguet, E., et al.: Stratégies pour l’anonymisation systématique d’un corpus d’interactions plurilingues. In: Proceedings of IC 2012, pp. 1–21 (2012)

    Google Scholar 

  11. Fairon, C., Klein, J.: Les écritures et graphies inventives des sms face aux graphies normées. Le Français aujourd’hui (3), 113–122 (2010)

    Google Scholar 

  12. Treurniet, M., De Clercq, O., van den Heuvel, H., Oostdijk, N.: Collection of a corpus of dutch sms (2012)

    Google Scholar 

  13. Chen, T., Kan, M.: Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation, 1–37

    Google Scholar 

  14. Reffay, C., Teutsch, P.: Anonymisation de corpus réutilisables Masquer l’identité sans altérer l’analyse des interactions. In: Proceedings of EIAH 2007: Environnements Informatiques pour l’Apprentissage Humain (2007)

    Google Scholar 

  15. Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing sms messages. In: Proceedings of ACL, pp. 770–779 (2010)

    Google Scholar 

  16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patel, N., Accorsi, P., Inkpen, D., Lopez, C., Roche, M. (2013). Approaches of Anonymisation of an SMS Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics