Anonymization of Unstructured Data via Named-Entity Recognition

  • Fadi Hassan
  • Josep Domingo-Ferrer
  • Jordi Soria-Comas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11144)


The anonymization of structured data has been widely studied in recent years. However, anonymizing unstructured data (typically text documents) remains a highly manual task and needs more attention from researchers. The main difficulty when dealing with unstructured data is that no database schema is available that can be used to measure privacy risks. In fact, confidential data and quasi-identifier values may be spread throughout the documents to be anonymized. In this work we propose to use a named-entity recognition tagger based on machine learning. The ultimate aim is to build a system capable of detecting all attributes that have privacy implications (identifiers, quasi-identifiers and sensitive attributes). In particular, we present a proof of concept focused on the detection of confidential attributes. We consider a case study in which confidential values to be detected are disease names in medical diagnoses. Once these confidential attribute values are located, one can use standard statistical disclosure control techniques for structured data to control disclosure risk.


Anonymization Unstructured data Named-entity recognition Conditional random fields 


Acknowledgments and Disclaimer

The following funding sources are gratefully acknowledged: European Commission (projects H2020-644024 “CLARUS” and H2020-700540 “CANVAS”), Government of Catalonia (ICREA Acadèmia Prize to J. Domingo-Ferrer) and Spanish Government (projects TIN2014-57364-C2-1-R “SmartGlacis” and TIN2015-70054-REDC). The views in this paper are the authors’ own and do not necessarily reflect the views of UNESCO or any of the funders.


  1. 1.
    Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT Through Other Language Technology Tools: Resources and Tools for Building MT (EAMT 2003), pp. 1–8. Association for Computational Linguistics (2003)Google Scholar
  2. 2.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly (2009). The Natural Language Toolkit software (NLTK):
  3. 3.
    Culotta, A., Bekkerman, R., McCallum, A.: Extracting Social Networks and Contact Information from Email and the Web. Computer Science Department Faculty Publication Series, no. 33. University of Massachusetts-Amherst, 2004Google Scholar
  4. 4.
    Domingo-Ferrer, J., Sánchez, D., Soria-Comas, J.: Database Anonymization: Privacy Models, Data Utility, and Microaggregation-Based Inter-model Connections. Morgan & Claypool, San Rafael (2016)Google Scholar
  5. 5.
    Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. LNS, vol. 201. Springer, New York (2011). Scholar
  6. 6.
    Ekbal, A., Haque, R., Bandyopadhyay, S.: Bengali part of speech tagging using conditional random field. In: Proceedings of the Seventh International Symposium on Natural Language Processing (SNLP 2007) (2007)Google Scholar
  7. 7.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370. Association for Computational Linguistics (2005)Google Scholar
  8. 8.
    EU General Data Protection Regulation, 2016/679.
  9. 9.
    Grimes, S.: Structure, models and meaning. Intelligent Enterprise, March 2005Google Scholar
  10. 10.
    Hundepool, A., et al.: Statistical Disclosure Control. Wiley, New York (2012)Google Scholar
  11. 11.
    Jabreel, M., Hassan, F., Moreno, A.: Target-dependent sentiment analysis of tweets using bidirectional gated recurrent neural networks. In: Hatzilygeroudis, I., Palade, V. (eds.) Advances in Hybridization of Intelligent Methods, pp. 39–55. Springer, Cham (2018). Scholar
  12. 12.
    Khalid, M.A., Jijkoun, V., de Rijke, M.: The impact of named entity normalization on information retrieval for question answering. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 705–710. Springer, Heidelberg (2008). Scholar
  13. 13.
    Kleinberg, B., Mozes, M., van der Toolen, Y., Verschuere, B.: NETANOS - Named Entity-based Text Anonymization for Open Science. Open Science Framework, 31 January 2018.
  14. 14.
    Korobov, M.: sklearn-crfsuite (2015).
  15. 15.
    Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289. ACM (2001)Google Scholar
  16. 16.
    Morwal, S., Jahan, N., Chopra, D.: Named entity recognition using hidden Markov model (HMM). Int. J. Nat. Lang. Comput. 1(4), 15–23 (2012)CrossRefGoogle Scholar
  17. 17.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  18. 18.
    Neamatullah, I., et al.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Making 8(1), 32 (2008)CrossRefGoogle Scholar
  19. 19.
    Pérez-Laínez, R., Iglesias, A., de Pablo-Sánchez, C.: Anonimytext: anonymization of unstructured documents. Universidad Carlos III de Madrid (2009).
  20. 20.
    Rosario, B., Hearst, M.A.: Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Association for Computational Linguistics (2004). Data:
  21. 21.
    Sang, E.F., Veenstra, J.: Representing text chunks. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pp. 173–179. Association for Computational Linguistics (1999)Google Scholar
  22. 22.
    United Kingdom Data Service: Text Anonymization Helper Tool. Accessed 24 Mar 2018
  23. 23.
    Sundheim, B.M.: Overview of results of the MUC-6 evaluation. In: Proceedings of the TIPSTER Text Program: Phase II, pp. 423–442. Association for Computational Linguistics (1996)Google Scholar
  24. 24.
    Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, p. 333. American Medical Informatics Association (1996)Google Scholar
  25. 25.
    Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)Google Scholar
  26. 26.
    Vico, H., Calegari, D.: Software architecture for document anonymization. Electron. Notes Theor. Comput. Sci. 314(C), 83–100 (2015)CrossRefGoogle Scholar
  27. 27.
    Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 473–480. Association for Computational Linguistics (2002)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Fadi Hassan
    • 1
  • Josep Domingo-Ferrer
    • 1
  • Jordi Soria-Comas
    • 1
  1. 1.Department of Computer Science and Mathematics, CYBERCAT-Center for Cybersecurity Research of Catalonia, UNESCO Chair in Data PrivacyUniversitat Rovira i VirgiliTarragonaCatalonia

Personalised recommendations