Skip to main content

De-identification of Unstructured Clinical Data for Patient Privacy Protection

  • Chapter
Book cover Medical Data Privacy Handbook

Abstract

The adoption of Electronic Health Record (EHR) systems is growing at a fast pace in the United States and in Europe, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potential, but also equally growing concern for patient confidentiality breaches. Secondary use of clinical information is essential to fulfil the promises for high quality healthcare, improved healthcare management, and effective clinical research. De-identification of patient information has been proposed as a solution to both facilitate secondary use of clinical information, and protect patient information confidentiality. Most clinical information found in the EHR is unstructured and represented as narrative text, and de-identification of clinical text is a tedious and costly manual endeavor. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster de-identification than manual approaches. This chapter introduces clinical text-de-identification in general, and then focuses on recent efforts and studies at the U.S. Veterans Health Administration. It includes the origins and definition of text de-identification in the United States and Europe and a discussion about text anonymization. It also presents methods applied for text de-identification, examples of clinical text de-identification applications, and U.S. Veterans Health Administration clinical text de-identification efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aberdeen, J., Bayer, S., Yeniterzi, R., Wellner, B., Clark, C., Hanauer, D., Malin, B., Hirschman, L.: The MITRE identification scrubber toolkit: Design, training, and assessment. Int. J. Med. Inform. 79(12), 849–859 (2010)

    Article  Google Scholar 

  2. Apache cTAKES. https://ctakes.apache.org (2015). Accessed 20 June 2015

  3. Apache Lucene. http://lucene.apache.org/ (2015). Accessed 20 June 2015

  4. Apostolico, A., Galil, Z.: Pattern Matching Algorithms. Oxford University Press, Oxford (1997)

    Book  MATH  Google Scholar 

  5. Beckwith, B., Mahaadevan, R., Balis, U., Kuo, F.: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 6, 12 (2006)

    Article  Google Scholar 

  6. Benitez, K., Malin, B.: Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 17(2), 169–177 (2010)

    Article  Google Scholar 

  7. Blumenthal, D., Tavenner, M.: The “meaningful” use regulation for electronic health records. N. Engl. J. Med. 363(6), 501–504 (2010)

    Article  Google Scholar 

  8. Cannon, J., Lucci, S.: Transcription and EHRs. Benefits of a blended approach. J. Am. Health Inf. Manag. Assoc. 81(2), 36–40 (2010)

    Google Scholar 

  9. Carrell, D., Malin, B., Aberdeen, J., Bayer, S., Clark, C., Wellner, B.: Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc. 20, 342–348 (2013)

    Article  Google Scholar 

  10. Chakaravarthy, V., Gupta, H., Roy, P., Mohania, M.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 843–852. ACM, New York (2008)

    Google Scholar 

  11. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2015). Accessed 20 June 2015

  12. Dankar, F., El-Emam, K., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)

    Article  Google Scholar 

  13. Directive 95/46/EC of the European Parliament and of the Council: Eur-lex. 1995. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:en:HTML (1995). Accessed 24 July 2014

  14. Dorr, D., Phillips, W., Phansalkar, S., Sims, S., Hurdle, J.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(3), 246–252 (2006)

    Google Scholar 

  15. El-Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PLoS ONE 6(12), e28071 (2011)

    Article  Google Scholar 

  16. Federal Data Protection Act. http://www.iuscomp.org/gla/statutes/BDSG.htm (2015). Accessed 20 June 2015

  17. Fernandes, A., Cloete, D., Broadbent, M., Hayes, R., Chang, C.K., Jackson, R., Roberts, A., Tsang, J., Soncul, M., Liebscher, J., Stewart, R., Callard, F.: Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med. Inf. Decis. Mak. 13(1), 71 (2013)

    Article  Google Scholar 

  18. Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Matthew, H., Meystre, S.: Generalizability and comparison of automatic clinical text de-identification methods and resources. AMIA Annu. Symp. Proc. 2012, 199–208 (2012)

    Google Scholar 

  19. Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012)

    Article  Google Scholar 

  20. Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Inform. Assoc. 20, 77–83 (2013)

    Article  Google Scholar 

  21. Fielstein, E., Brown, S., Speroff, T.: Algorithmic de-identification of VA medical exam text for HIPAA privacy compliance: Preliminary findings. In: Proceedings of the 11th World Congress on Medical Informatics, p. 1590. Ios Press, Fairfax (2004)

    Google Scholar 

  22. Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Stroudsburg (2005)

    Google Scholar 

  23. Friedl, J.: Mastering Regular Expressions. O’Reilly, Cambridge (2002)

    MATH  Google Scholar 

  24. Friedlin, F., McDonald, C.: A software tool for removing patient identify-ing information from clinical documents. J. Am. Med. Inform. Assoc. 15, 601–610 (2008)

    Article  Google Scholar 

  25. Gardner, J., Xiong, L., Li, K., Lu, J.: HIDE: Heterogeneous Information De-identification (2009). ACM, New York

    Book  Google Scholar 

  26. GPO US. 45 C.F.R. S164: Security and privacy. http://www.access.gpo.gov/nara/cfr/waisidx_08/45cfr164_08.html (2008). Accessed 20 June 2015

  27. Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Testing tactics to localize de-identification. Stud. Health Technol. Inform. 150, 735–739 (2009)

    Google Scholar 

  28. Gupta, D., Saul, M., Gilbertson, J.: Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am. J. Clin. Pathol. 121(2), 176–186 (2004)

    Article  Google Scholar 

  29. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)

    Book  MATH  Google Scholar 

  30. Jiang, W., Murugesan, M., Clifton, C., Luo, S.: t-plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 International Conference on Computational Science and Engineering, vol. 3, pp. 68–75 (2009)

    Google Scholar 

  31. Kushida, C., Nichols, D., Jadrnicek, R., Miller, R., Walsh, J., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, 82–101 (2012)

    Article  Google Scholar 

  32. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  33. MedCom: In english. http://www.medcom.dk/wm109991 (2015). Accessed 20 June 2015

  34. Meystre, S., Friedlin, F., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)

    Article  Google Scholar 

  35. Meystre, S., Ferrandez, O., Friedlin, F., South, B., Shen, S., Samore, M.: Text de-identification for privacy protection: a study of its impact on clinical text information content. J. Biomed. Inform. 50, 142–150 (2014)

    Article  Google Scholar 

  36. Meystre, S., Shen, S., Hofmann, D., Gundlapalli, A.: Can physicians recognize their own patients in de-identified notes? Stud. Health Technol. Inform. 205, 778–782 (2014)

    Google Scholar 

  37. Michell, T.: Machine Learning. McGraw-Hill, Maidenhead (1997)

    Google Scholar 

  38. Morrison, F., Sengupta, S., Hripcsak, G.: Using a pipeline to improve de-identification performance. AMIA Annu. Symp. Proc. 2009, 447–451 (2009)

    Google Scholar 

  39. National Library of Medicine: The hippocratic oath. http://www.nlm.nih.gov/hmd/greek/greek_oath.html (2002). Accessed 20 June 2015

  40. National Research Council (US): Committee on a framework for developing a new taxonomy of disease. In: Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. National Academies Press, Washington (2011)

    Google Scholar 

  41. Neamatullah, I., Douglass, M., Lehman, L., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., Clifford, G.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8(1), 32 (2008)

    Article  Google Scholar 

  42. NLM US. SNOMED Clinical Terms: SNOMED-CT. http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html (2015). Accessed 20 June 2015

  43. OpenNLP. http://opennlp.sourceforge.net/ (2015). Accessed 20 June 2015

  44. Quinlan, J.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)

    Google Scholar 

  45. Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)

    MATH  Google Scholar 

  46. Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)

    Article  MathSciNet  Google Scholar 

  47. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)

    MATH  Google Scholar 

  48. Sweden Jumps on National EHR Bandwagon: Healthitnewsdirect. http://www.health itnewsdirect.com/?p=116 (2009). Accessed 20 June 2015

  49. Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings: A conference of the American Medical Informatics Association, pp. 333–337 (1996)

    Google Scholar 

  50. Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. J. Am. Med. Inform. Assoc. 14(5), 574–580 (2007)

    Article  Google Scholar 

  51. U.S. Department of Health and Human Services: Breaches affecting 500 or more individuals. http://www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/breachtool.html (2015). Accessed 20 June 2015

  52. U.S. Department of Health and Human Services: Doctors and hospitals’ use of health IT more than doubles since 2012. http://www.hhs.gov/news/press/2013pres/05/20130522a.html (2015). Accessed 20 June 2015

  53. U.S. Department of Health and Human Services: Numbers at a glance. http://www.hhs.gov/ocr/privacy/hipaa/enforcement/highlights/indexnumbers.html (2015). Accessed 20 June 2015

  54. U.S. Government Accountability Office: Identity theft. http://www.gao.gov/assets/660/650366.pdf (2012). Accessed 20 June 2015

  55. Uzuner, O., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14(5), 550–563 (2007)

    Article  Google Scholar 

  56. Uzuner, O., South, B., Shen, S., DuVall, S.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(5), 552–556 (2011)

    Article  Google Scholar 

  57. Velupillai, S., Dalianis, H., Hassel, M., Nilsson, G.: Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int. J. Med. Inform. 78(12), 19–26 (2009)

    Article  Google Scholar 

  58. Welcome to eHealth.gov.au. http://www.ehealth.gov.au/internet/ehealth/publishing.nsf/content/home (2015). Accessed 20 June 2015

  59. Wellner, B., Huyck, M., Mardis, S., Aberdeen, J., Morgan, A., Peshkin, L., Yeh, A., Hitzeman, J., Hirschman, L.: Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 14(5), 564–573 (2007)

    Article  Google Scholar 

  60. Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B., Hirschman, L., Malin, B.: Effects of personal identifier resynthesis on clinical text de-identification. J. Am. Med. Inform. Assoc. 17(2), 159–168 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephane M. Meystre .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Meystre, S.M. (2015). De-identification of Unstructured Clinical Data for Patient Privacy Protection. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23633-9_26

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23632-2

  • Online ISBN: 978-3-319-23633-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics