Approaches of Anonymisation of an SMS Corpus

Patel, Namrata; Accorsi, Pierre; Inkpen, Diana; Lopez, Cédric; Roche, Mathieu

doi:10.1007/978-3-642-37247-6_7

Namrata Patel¹⁷,
Pierre Accorsi¹⁷,
Diana Inkpen¹⁸,
Cédric Lopez¹⁹ &
…
Mathieu Roche¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2232 Accesses
4 Citations

Abstract

This paper presents two anonymisation methods to process an SMS corpus. The first one is based on an unsupervised approach called Seek&Hide. The implemented system uses several dictionaries and rules in order to predict if a SMS needs anonymisation process. The second method is based on a supervised approach using machine learning techniques. We evaluate the two approaches and we propose a way to use them together. Only when the two methods do not agree on their prediction, will the SMS be checked by a human expert. This greatly reduces the cost of anonymising the corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Accorsi, P., Patel, N., Lopez, C., Panckhurst, R., Roche, M.: Seek&Hide: Anonymising a French SMS Corpus Using Natural Language Processing Techniques. Lingvisticæ Investigationes 35(2) (2012)
Google Scholar
Plamondon, L., Lapalme, G., Pelletier, F.: Anonymisation de décisions de justice. In: TALN 2004 (2004)
Google Scholar
Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Une procédure d’anonymisation à deux niveaux pour créer un corpus de comptes rendus hospitaliers. Risques, technologies de l’information pour les pratiques médicales, 23–34 (2009)
Google Scholar
Tamersoy, A., Loukides, G., Nergiz, M., Saygin, Y., Malin, B.: Anonymization of longitudinal electronic medical records. IEEE Transactions on Information Technology in Biomedicine 16(3), 413–423 (2012)
Article Google Scholar
Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, p. 333 (1996)
Google Scholar
Aramaki, E., Imai, T., Miyo, K., Ohe, K.: Automatic deidentification by using sentence features and label consistency. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, pp. 10–11 (2006)
Google Scholar
Gardner, J., Xiong, L., Wang, F., Post, A., Saltz, J., Grandison, T.: An evaluation of feature sets and sampling techniques for de-identification of medical records. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 183–190. ACM (2010)
Google Scholar
Gicquel, Q., Proux, D., Marchal, P., Hagége, C., Berrouane, Y., Darmoni, S., Pereira, S., Segond, F., Metzger, M.: Évaluation d’un outil d’aide á l’anonymisation des documents médicaux basé sur le traitement automatique du langage naturel. Systèmes d’information pour l’amélioration de la qualité en santé, 165–176 (2012)
Google Scholar
Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. JAMIA 14(5), 574–580 (2007)
Google Scholar
Reffay, C., Blondel, F., Giguet, E., et al.: Stratégies pour l’anonymisation systématique d’un corpus d’interactions plurilingues. In: Proceedings of IC 2012, pp. 1–21 (2012)
Google Scholar
Fairon, C., Klein, J.: Les écritures et graphies inventives des sms face aux graphies normées. Le Français aujourd’hui (3), 113–122 (2010)
Google Scholar
Treurniet, M., De Clercq, O., van den Heuvel, H., Oostdijk, N.: Collection of a corpus of dutch sms (2012)
Google Scholar
Chen, T., Kan, M.: Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation, 1–37
Google Scholar
Reffay, C., Teutsch, P.: Anonymisation de corpus réutilisables Masquer l’identité sans altérer l’analyse des interactions. In: Proceedings of EIAH 2007: Environnements Informatiques pour l’Apprentissage Humain (2007)
Google Scholar
Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing sms messages. In: Proceedings of ACL, pp. 770–779 (2010)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

LIRMM – CNRS, Univ. Montpellier 2, France
Namrata Patel, Pierre Accorsi & Mathieu Roche
Univ. of Ottawa, Canada
Diana Inkpen
Objet Direct – VISEO, France
Cédric Lopez

Authors

Namrata Patel
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Accorsi
View author publications
You can also search for this author in PubMed Google Scholar
Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Roche
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, N., Accorsi, P., Inkpen, D., Lopez, C., Roche, M. (2013). Approaches of Anonymisation of an SMS Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-37247-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics