De-identification of Unstructured Clinical Data for Patient Privacy Protection

Meystre, Stephane M.

doi:10.1007/978-3-319-23633-9_26

Stephane M. Meystre³

2702 Accesses
2 Citations
3 Altmetric

Abstract

The adoption of Electronic Health Record (EHR) systems is growing at a fast pace in the United States and in Europe, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potential, but also equally growing concern for patient confidentiality breaches. Secondary use of clinical information is essential to fulfil the promises for high quality healthcare, improved healthcare management, and effective clinical research. De-identification of patient information has been proposed as a solution to both facilitate secondary use of clinical information, and protect patient information confidentiality. Most clinical information found in the EHR is unstructured and represented as narrative text, and de-identification of clinical text is a tedious and costly manual endeavor. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster de-identification than manual approaches. This chapter introduces clinical text-de-identification in general, and then focuses on recent efforts and studies at the U.S. Veterans Health Administration. It includes the origins and definition of text de-identification in the United States and Europe and a discussion about text anonymization. It also presents methods applied for text de-identification, examples of clinical text de-identification applications, and U.S. Veterans Health Administration clinical text de-identification efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aberdeen, J., Bayer, S., Yeniterzi, R., Wellner, B., Clark, C., Hanauer, D., Malin, B., Hirschman, L.: The MITRE identification scrubber toolkit: Design, training, and assessment. Int. J. Med. Inform. 79(12), 849–859 (2010)
Article Google Scholar
Apache cTAKES. https://ctakes.apache.org (2015). Accessed 20 June 2015
Apache Lucene. http://lucene.apache.org/ (2015). Accessed 20 June 2015
Apostolico, A., Galil, Z.: Pattern Matching Algorithms. Oxford University Press, Oxford (1997)
Book MATH Google Scholar
Beckwith, B., Mahaadevan, R., Balis, U., Kuo, F.: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 6, 12 (2006)
Article Google Scholar
Benitez, K., Malin, B.: Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 17(2), 169–177 (2010)
Article Google Scholar
Blumenthal, D., Tavenner, M.: The “meaningful” use regulation for electronic health records. N. Engl. J. Med. 363(6), 501–504 (2010)
Article Google Scholar
Cannon, J., Lucci, S.: Transcription and EHRs. Benefits of a blended approach. J. Am. Health Inf. Manag. Assoc. 81(2), 36–40 (2010)
Google Scholar
Carrell, D., Malin, B., Aberdeen, J., Bayer, S., Clark, C., Wellner, B.: Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc. 20, 342–348 (2013)
Article Google Scholar
Chakaravarthy, V., Gupta, H., Roy, P., Mohania, M.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 843–852. ACM, New York (2008)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2015). Accessed 20 June 2015
Dankar, F., El-Emam, K., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)
Article Google Scholar
Directive 95/46/EC of the European Parliament and of the Council: Eur-lex. 1995. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:en:HTML (1995). Accessed 24 July 2014
Dorr, D., Phillips, W., Phansalkar, S., Sims, S., Hurdle, J.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(3), 246–252 (2006)
Google Scholar
El-Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PLoS ONE 6(12), e28071 (2011)
Article Google Scholar
Federal Data Protection Act. http://www.iuscomp.org/gla/statutes/BDSG.htm (2015). Accessed 20 June 2015
Fernandes, A., Cloete, D., Broadbent, M., Hayes, R., Chang, C.K., Jackson, R., Roberts, A., Tsang, J., Soncul, M., Liebscher, J., Stewart, R., Callard, F.: Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med. Inf. Decis. Mak. 13(1), 71 (2013)
Article Google Scholar
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Matthew, H., Meystre, S.: Generalizability and comparison of automatic clinical text de-identification methods and resources. AMIA Annu. Symp. Proc. 2012, 199–208 (2012)
Google Scholar
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012)
Article Google Scholar
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Inform. Assoc. 20, 77–83 (2013)
Article Google Scholar
Fielstein, E., Brown, S., Speroff, T.: Algorithmic de-identification of VA medical exam text for HIPAA privacy compliance: Preliminary findings. In: Proceedings of the 11th World Congress on Medical Informatics, p. 1590. Ios Press, Fairfax (2004)
Google Scholar
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Friedl, J.: Mastering Regular Expressions. O’Reilly, Cambridge (2002)
MATH Google Scholar
Friedlin, F., McDonald, C.: A software tool for removing patient identify-ing information from clinical documents. J. Am. Med. Inform. Assoc. 15, 601–610 (2008)
Article Google Scholar
Gardner, J., Xiong, L., Li, K., Lu, J.: HIDE: Heterogeneous Information De-identification (2009). ACM, New York
Book Google Scholar
GPO US. 45 C.F.R. S164: Security and privacy. http://www.access.gpo.gov/nara/cfr/waisidx_08/45cfr164_08.html (2008). Accessed 20 June 2015
Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Testing tactics to localize de-identification. Stud. Health Technol. Inform. 150, 735–739 (2009)
Google Scholar
Gupta, D., Saul, M., Gilbertson, J.: Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am. J. Clin. Pathol. 121(2), 176–186 (2004)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Book MATH Google Scholar
Jiang, W., Murugesan, M., Clifton, C., Luo, S.: t-plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 International Conference on Computational Science and Engineering, vol. 3, pp. 68–75 (2009)
Google Scholar
Kushida, C., Nichols, D., Jadrnicek, R., Miller, R., Walsh, J., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, 82–101 (2012)
Article Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
MedCom: In english. http://www.medcom.dk/wm109991 (2015). Accessed 20 June 2015
Meystre, S., Friedlin, F., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)
Article Google Scholar
Meystre, S., Ferrandez, O., Friedlin, F., South, B., Shen, S., Samore, M.: Text de-identification for privacy protection: a study of its impact on clinical text information content. J. Biomed. Inform. 50, 142–150 (2014)
Article Google Scholar
Meystre, S., Shen, S., Hofmann, D., Gundlapalli, A.: Can physicians recognize their own patients in de-identified notes? Stud. Health Technol. Inform. 205, 778–782 (2014)
Google Scholar
Michell, T.: Machine Learning. McGraw-Hill, Maidenhead (1997)
Google Scholar
Morrison, F., Sengupta, S., Hripcsak, G.: Using a pipeline to improve de-identification performance. AMIA Annu. Symp. Proc. 2009, 447–451 (2009)
Google Scholar
National Library of Medicine: The hippocratic oath. http://www.nlm.nih.gov/hmd/greek/greek_oath.html (2002). Accessed 20 June 2015
National Research Council (US): Committee on a framework for developing a new taxonomy of disease. In: Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. National Academies Press, Washington (2011)
Google Scholar
Neamatullah, I., Douglass, M., Lehman, L., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., Clifford, G.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8(1), 32 (2008)
Article Google Scholar
NLM US. SNOMED Clinical Terms: SNOMED-CT. http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html (2015). Accessed 20 June 2015
OpenNLP. http://opennlp.sourceforge.net/ (2015). Accessed 20 June 2015
Quinlan, J.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
Google Scholar
Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)
MATH Google Scholar
Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)
Article MathSciNet Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
MATH Google Scholar
Sweden Jumps on National EHR Bandwagon: Healthitnewsdirect. http://www.health itnewsdirect.com/?p=116 (2009). Accessed 20 June 2015
Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings: A conference of the American Medical Informatics Association, pp. 333–337 (1996)
Google Scholar
Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. J. Am. Med. Inform. Assoc. 14(5), 574–580 (2007)
Article Google Scholar
U.S. Department of Health and Human Services: Breaches affecting 500 or more individuals. http://www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/breachtool.html (2015). Accessed 20 June 2015
U.S. Department of Health and Human Services: Doctors and hospitals’ use of health IT more than doubles since 2012. http://www.hhs.gov/news/press/2013pres/05/20130522a.html (2015). Accessed 20 June 2015
U.S. Department of Health and Human Services: Numbers at a glance. http://www.hhs.gov/ocr/privacy/hipaa/enforcement/highlights/indexnumbers.html (2015). Accessed 20 June 2015
U.S. Government Accountability Office: Identity theft. http://www.gao.gov/assets/660/650366.pdf (2012). Accessed 20 June 2015
Uzuner, O., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14(5), 550–563 (2007)
Article Google Scholar
Uzuner, O., South, B., Shen, S., DuVall, S.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(5), 552–556 (2011)
Article Google Scholar
Velupillai, S., Dalianis, H., Hassel, M., Nilsson, G.: Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int. J. Med. Inform. 78(12), 19–26 (2009)
Article Google Scholar
Welcome to eHealth.gov.au. http://www.ehealth.gov.au/internet/ehealth/publishing.nsf/content/home (2015). Accessed 20 June 2015
Wellner, B., Huyck, M., Mardis, S., Aberdeen, J., Morgan, A., Peshkin, L., Yeh, A., Hitzeman, J., Hirschman, L.: Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 14(5), 564–573 (2007)
Article Google Scholar
Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B., Hirschman, L., Malin, B.: Effects of personal identifier resynthesis on clinical text de-identification. J. Am. Med. Inform. Assoc. 17(2), 159–168 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
Stephane M. Meystre

Authors

Stephane M. Meystre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephane M. Meystre .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Dublin, Ireland
Aris Gkoulalas-Divanis
Cardiff University, Cardiff, United Kingdom
Grigorios Loukides

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Meystre, S.M. (2015). De-identification of Unstructured Clinical Data for Patient Privacy Protection. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-23633-9_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics