De-identification of Unstructured Clinical Data for Patient Privacy Protection

  • Stephane M. MeystreEmail author


The adoption of Electronic Health Record (EHR) systems is growing at a fast pace in the United States and in Europe, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potential, but also equally growing concern for patient confidentiality breaches. Secondary use of clinical information is essential to fulfil the promises for high quality healthcare, improved healthcare management, and effective clinical research. De-identification of patient information has been proposed as a solution to both facilitate secondary use of clinical information, and protect patient information confidentiality. Most clinical information found in the EHR is unstructured and represented as narrative text, and de-identification of clinical text is a tedious and costly manual endeavor. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster de-identification than manual approaches. This chapter introduces clinical text-de-identification in general, and then focuses on recent efforts and studies at the U.S. Veterans Health Administration. It includes the origins and definition of text de-identification in the United States and Europe and a discussion about text anonymization. It also presents methods applied for text de-identification, examples of clinical text de-identification applications, and U.S. Veterans Health Administration clinical text de-identification efforts.


Veteran Health Administration Electronic Health Record Regular Expression Clinical Note Conditional Random Field 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aberdeen, J., Bayer, S., Yeniterzi, R., Wellner, B., Clark, C., Hanauer, D., Malin, B., Hirschman, L.: The MITRE identification scrubber toolkit: Design, training, and assessment. Int. J. Med. Inform. 79(12), 849–859 (2010)CrossRefGoogle Scholar
  2. 2.
    Apache cTAKES. (2015). Accessed 20 June 2015
  3. 3.
    Apache Lucene. (2015). Accessed 20 June 2015
  4. 4.
    Apostolico, A., Galil, Z.: Pattern Matching Algorithms. Oxford University Press, Oxford (1997)CrossRefzbMATHGoogle Scholar
  5. 5.
    Beckwith, B., Mahaadevan, R., Balis, U., Kuo, F.: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 6, 12 (2006)CrossRefGoogle Scholar
  6. 6.
    Benitez, K., Malin, B.: Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 17(2), 169–177 (2010)CrossRefGoogle Scholar
  7. 7.
    Blumenthal, D., Tavenner, M.: The “meaningful” use regulation for electronic health records. N. Engl. J. Med. 363(6), 501–504 (2010)CrossRefGoogle Scholar
  8. 8.
    Cannon, J., Lucci, S.: Transcription and EHRs. Benefits of a blended approach. J. Am. Health Inf. Manag. Assoc. 81(2), 36–40 (2010)Google Scholar
  9. 9.
    Carrell, D., Malin, B., Aberdeen, J., Bayer, S., Clark, C., Wellner, B.: Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc. 20, 342–348 (2013)CrossRefGoogle Scholar
  10. 10.
    Chakaravarthy, V., Gupta, H., Roy, P., Mohania, M.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 843–852. ACM, New York (2008)Google Scholar
  11. 11.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2015). Accessed 20 June 2015
  12. 12.
    Dankar, F., El-Emam, K., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)CrossRefGoogle Scholar
  13. 13.
    Directive 95/46/EC of the European Parliament and of the Council: Eur-lex. 1995. (1995). Accessed 24 July 2014
  14. 14.
    Dorr, D., Phillips, W., Phansalkar, S., Sims, S., Hurdle, J.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(3), 246–252 (2006)Google Scholar
  15. 15.
    El-Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PLoS ONE 6(12), e28071 (2011)CrossRefGoogle Scholar
  16. 16.
    Federal Data Protection Act. (2015). Accessed 20 June 2015
  17. 17.
    Fernandes, A., Cloete, D., Broadbent, M., Hayes, R., Chang, C.K., Jackson, R., Roberts, A., Tsang, J., Soncul, M., Liebscher, J., Stewart, R., Callard, F.: Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med. Inf. Decis. Mak. 13(1), 71 (2013)CrossRefGoogle Scholar
  18. 18.
    Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Matthew, H., Meystre, S.: Generalizability and comparison of automatic clinical text de-identification methods and resources. AMIA Annu. Symp. Proc. 2012, 199–208 (2012)Google Scholar
  19. 19.
    Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012)CrossRefGoogle Scholar
  20. 20.
    Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Inform. Assoc. 20, 77–83 (2013)CrossRefGoogle Scholar
  21. 21.
    Fielstein, E., Brown, S., Speroff, T.: Algorithmic de-identification of VA medical exam text for HIPAA privacy compliance: Preliminary findings. In: Proceedings of the 11th World Congress on Medical Informatics, p. 1590. Ios Press, Fairfax (2004)Google Scholar
  22. 22.
    Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Stroudsburg (2005)Google Scholar
  23. 23.
    Friedl, J.: Mastering Regular Expressions. O’Reilly, Cambridge (2002)zbMATHGoogle Scholar
  24. 24.
    Friedlin, F., McDonald, C.: A software tool for removing patient identify-ing information from clinical documents. J. Am. Med. Inform. Assoc. 15, 601–610 (2008)CrossRefGoogle Scholar
  25. 25.
    Gardner, J., Xiong, L., Li, K., Lu, J.: HIDE: Heterogeneous Information De-identification (2009). ACM, New YorkCrossRefGoogle Scholar
  26. 26.
    GPO US. 45 C.F.R. S164: Security and privacy. (2008). Accessed 20 June 2015
  27. 27.
    Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Testing tactics to localize de-identification. Stud. Health Technol. Inform. 150, 735–739 (2009)Google Scholar
  28. 28.
    Gupta, D., Saul, M., Gilbertson, J.: Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am. J. Clin. Pathol. 121(2), 176–186 (2004)CrossRefGoogle Scholar
  29. 29.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)CrossRefzbMATHGoogle Scholar
  30. 30.
    Jiang, W., Murugesan, M., Clifton, C., Luo, S.: t-plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 International Conference on Computational Science and Engineering, vol. 3, pp. 68–75 (2009)Google Scholar
  31. 31.
    Kushida, C., Nichols, D., Jadrnicek, R., Miller, R., Walsh, J., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, 82–101 (2012)CrossRefGoogle Scholar
  32. 32.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)Google Scholar
  33. 33.
    MedCom: In english. (2015). Accessed 20 June 2015
  34. 34.
    Meystre, S., Friedlin, F., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)CrossRefGoogle Scholar
  35. 35.
    Meystre, S., Ferrandez, O., Friedlin, F., South, B., Shen, S., Samore, M.: Text de-identification for privacy protection: a study of its impact on clinical text information content. J. Biomed. Inform. 50, 142–150 (2014)CrossRefGoogle Scholar
  36. 36.
    Meystre, S., Shen, S., Hofmann, D., Gundlapalli, A.: Can physicians recognize their own patients in de-identified notes? Stud. Health Technol. Inform. 205, 778–782 (2014)Google Scholar
  37. 37.
    Michell, T.: Machine Learning. McGraw-Hill, Maidenhead (1997)Google Scholar
  38. 38.
    Morrison, F., Sengupta, S., Hripcsak, G.: Using a pipeline to improve de-identification performance. AMIA Annu. Symp. Proc. 2009, 447–451 (2009)Google Scholar
  39. 39.
    National Library of Medicine: The hippocratic oath. (2002). Accessed 20 June 2015
  40. 40.
    National Research Council (US): Committee on a framework for developing a new taxonomy of disease. In: Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. National Academies Press, Washington (2011)Google Scholar
  41. 41.
    Neamatullah, I., Douglass, M., Lehman, L., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., Clifford, G.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8(1), 32 (2008)CrossRefGoogle Scholar
  42. 42.
    NLM US. SNOMED Clinical Terms: SNOMED-CT. (2015). Accessed 20 June 2015
  43. 43.
    OpenNLP. (2015). Accessed 20 June 2015
  44. 44.
    Quinlan, J.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)Google Scholar
  45. 45.
    Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)zbMATHGoogle Scholar
  46. 46.
    Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)zbMATHGoogle Scholar
  48. 48.
    Sweden Jumps on National EHR Bandwagon: Healthitnewsdirect. (2009). Accessed 20 June 2015
  49. 49.
    Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings: A conference of the American Medical Informatics Association, pp. 333–337 (1996)Google Scholar
  50. 50.
    Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. J. Am. Med. Inform. Assoc. 14(5), 574–580 (2007)CrossRefGoogle Scholar
  51. 51.
    U.S. Department of Health and Human Services: Breaches affecting 500 or more individuals. (2015). Accessed 20 June 2015
  52. 52.
    U.S. Department of Health and Human Services: Doctors and hospitals’ use of health IT more than doubles since 2012. (2015). Accessed 20 June 2015
  53. 53.
    U.S. Department of Health and Human Services: Numbers at a glance. (2015). Accessed 20 June 2015
  54. 54.
    U.S. Government Accountability Office: Identity theft. (2012). Accessed 20 June 2015
  55. 55.
    Uzuner, O., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14(5), 550–563 (2007)CrossRefGoogle Scholar
  56. 56.
    Uzuner, O., South, B., Shen, S., DuVall, S.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(5), 552–556 (2011)CrossRefGoogle Scholar
  57. 57.
    Velupillai, S., Dalianis, H., Hassel, M., Nilsson, G.: Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int. J. Med. Inform. 78(12), 19–26 (2009)CrossRefGoogle Scholar
  58. 58.
    Welcome to (2015). Accessed 20 June 2015
  59. 59.
    Wellner, B., Huyck, M., Mardis, S., Aberdeen, J., Morgan, A., Peshkin, L., Yeh, A., Hitzeman, J., Hirschman, L.: Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 14(5), 564–573 (2007)CrossRefGoogle Scholar
  60. 60.
    Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B., Hirschman, L., Malin, B.: Effects of personal identifier resynthesis on clinical text de-identification. J. Am. Med. Inform. Assoc. 17(2), 159–168 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Biomedical InformaticsUniversity of UtahSalt Lake CityUSA

Personalised recommendations