Abstract
The adoption of Electronic Health Record (EHR) systems is growing at a fast pace in the United States and in Europe, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potential, but also equally growing concern for patient confidentiality breaches. Secondary use of clinical information is essential to fulfil the promises for high quality healthcare, improved healthcare management, and effective clinical research. De-identification of patient information has been proposed as a solution to both facilitate secondary use of clinical information, and protect patient information confidentiality. Most clinical information found in the EHR is unstructured and represented as narrative text, and de-identification of clinical text is a tedious and costly manual endeavor. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster de-identification than manual approaches. This chapter introduces clinical text-de-identification in general, and then focuses on recent efforts and studies at the U.S. Veterans Health Administration. It includes the origins and definition of text de-identification in the United States and Europe and a discussion about text anonymization. It also presents methods applied for text de-identification, examples of clinical text de-identification applications, and U.S. Veterans Health Administration clinical text de-identification efforts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aberdeen, J., Bayer, S., Yeniterzi, R., Wellner, B., Clark, C., Hanauer, D., Malin, B., Hirschman, L.: The MITRE identification scrubber toolkit: Design, training, and assessment. Int. J. Med. Inform. 79(12), 849–859 (2010)
Apache cTAKES. https://ctakes.apache.org (2015). Accessed 20 June 2015
Apache Lucene. http://lucene.apache.org/ (2015). Accessed 20 June 2015
Apostolico, A., Galil, Z.: Pattern Matching Algorithms. Oxford University Press, Oxford (1997)
Beckwith, B., Mahaadevan, R., Balis, U., Kuo, F.: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 6, 12 (2006)
Benitez, K., Malin, B.: Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 17(2), 169–177 (2010)
Blumenthal, D., Tavenner, M.: The “meaningful” use regulation for electronic health records. N. Engl. J. Med. 363(6), 501–504 (2010)
Cannon, J., Lucci, S.: Transcription and EHRs. Benefits of a blended approach. J. Am. Health Inf. Manag. Assoc. 81(2), 36–40 (2010)
Carrell, D., Malin, B., Aberdeen, J., Bayer, S., Clark, C., Wellner, B.: Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc. 20, 342–348 (2013)
Chakaravarthy, V., Gupta, H., Roy, P., Mohania, M.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 843–852. ACM, New York (2008)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2015). Accessed 20 June 2015
Dankar, F., El-Emam, K., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)
Directive 95/46/EC of the European Parliament and of the Council: Eur-lex. 1995. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:en:HTML (1995). Accessed 24 July 2014
Dorr, D., Phillips, W., Phansalkar, S., Sims, S., Hurdle, J.: Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf. Med. 45(3), 246–252 (2006)
El-Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PLoS ONE 6(12), e28071 (2011)
Federal Data Protection Act. http://www.iuscomp.org/gla/statutes/BDSG.htm (2015). Accessed 20 June 2015
Fernandes, A., Cloete, D., Broadbent, M., Hayes, R., Chang, C.K., Jackson, R., Roberts, A., Tsang, J., Soncul, M., Liebscher, J., Stewart, R., Callard, F.: Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med. Inf. Decis. Mak. 13(1), 71 (2013)
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Matthew, H., Meystre, S.: Generalizability and comparison of automatic clinical text de-identification methods and resources. AMIA Annu. Symp. Proc. 2012, 199–208 (2012)
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012)
Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., Meystre, S.: BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Inform. Assoc. 20, 77–83 (2013)
Fielstein, E., Brown, S., Speroff, T.: Algorithmic de-identification of VA medical exam text for HIPAA privacy compliance: Preliminary findings. In: Proceedings of the 11th World Congress on Medical Informatics, p. 1590. Ios Press, Fairfax (2004)
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Stroudsburg (2005)
Friedl, J.: Mastering Regular Expressions. O’Reilly, Cambridge (2002)
Friedlin, F., McDonald, C.: A software tool for removing patient identify-ing information from clinical documents. J. Am. Med. Inform. Assoc. 15, 601–610 (2008)
Gardner, J., Xiong, L., Li, K., Lu, J.: HIDE: Heterogeneous Information De-identification (2009). ACM, New York
GPO US. 45 C.F.R. S164: Security and privacy. http://www.access.gpo.gov/nara/cfr/waisidx_08/45cfr164_08.html (2008). Accessed 20 June 2015
Grouin, C., Rosier, A., Dameron, O., Zweigenbaum, P.: Testing tactics to localize de-identification. Stud. Health Technol. Inform. 150, 735–739 (2009)
Gupta, D., Saul, M., Gilbertson, J.: Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am. J. Clin. Pathol. 121(2), 176–186 (2004)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Jiang, W., Murugesan, M., Clifton, C., Luo, S.: t-plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 International Conference on Computational Science and Engineering, vol. 3, pp. 68–75 (2009)
Kushida, C., Nichols, D., Jadrnicek, R., Miller, R., Walsh, J., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, 82–101 (2012)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
MedCom: In english. http://www.medcom.dk/wm109991 (2015). Accessed 20 June 2015
Meystre, S., Friedlin, F., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)
Meystre, S., Ferrandez, O., Friedlin, F., South, B., Shen, S., Samore, M.: Text de-identification for privacy protection: a study of its impact on clinical text information content. J. Biomed. Inform. 50, 142–150 (2014)
Meystre, S., Shen, S., Hofmann, D., Gundlapalli, A.: Can physicians recognize their own patients in de-identified notes? Stud. Health Technol. Inform. 205, 778–782 (2014)
Michell, T.: Machine Learning. McGraw-Hill, Maidenhead (1997)
Morrison, F., Sengupta, S., Hripcsak, G.: Using a pipeline to improve de-identification performance. AMIA Annu. Symp. Proc. 2009, 447–451 (2009)
National Library of Medicine: The hippocratic oath. http://www.nlm.nih.gov/hmd/greek/greek_oath.html (2002). Accessed 20 June 2015
National Research Council (US): Committee on a framework for developing a new taxonomy of disease. In: Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. National Academies Press, Washington (2011)
Neamatullah, I., Douglass, M., Lehman, L., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., Clifford, G.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8(1), 32 (2008)
NLM US. SNOMED Clinical Terms: SNOMED-CT. http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html (2015). Accessed 20 June 2015
OpenNLP. http://opennlp.sourceforge.net/ (2015). Accessed 20 June 2015
Quinlan, J.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)
Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Sweden Jumps on National EHR Bandwagon: Healthitnewsdirect. http://www.health itnewsdirect.com/?p=116 (2009). Accessed 20 June 2015
Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings: A conference of the American Medical Informatics Association, pp. 333–337 (1996)
Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. J. Am. Med. Inform. Assoc. 14(5), 574–580 (2007)
U.S. Department of Health and Human Services: Breaches affecting 500 or more individuals. http://www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/breachtool.html (2015). Accessed 20 June 2015
U.S. Department of Health and Human Services: Doctors and hospitals’ use of health IT more than doubles since 2012. http://www.hhs.gov/news/press/2013pres/05/20130522a.html (2015). Accessed 20 June 2015
U.S. Department of Health and Human Services: Numbers at a glance. http://www.hhs.gov/ocr/privacy/hipaa/enforcement/highlights/indexnumbers.html (2015). Accessed 20 June 2015
U.S. Government Accountability Office: Identity theft. http://www.gao.gov/assets/660/650366.pdf (2012). Accessed 20 June 2015
Uzuner, O., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14(5), 550–563 (2007)
Uzuner, O., South, B., Shen, S., DuVall, S.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(5), 552–556 (2011)
Velupillai, S., Dalianis, H., Hassel, M., Nilsson, G.: Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int. J. Med. Inform. 78(12), 19–26 (2009)
Welcome to eHealth.gov.au. http://www.ehealth.gov.au/internet/ehealth/publishing.nsf/content/home (2015). Accessed 20 June 2015
Wellner, B., Huyck, M., Mardis, S., Aberdeen, J., Morgan, A., Peshkin, L., Yeh, A., Hitzeman, J., Hirschman, L.: Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 14(5), 564–573 (2007)
Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B., Hirschman, L., Malin, B.: Effects of personal identifier resynthesis on clinical text de-identification. J. Am. Med. Inform. Assoc. 17(2), 159–168 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Meystre, S.M. (2015). De-identification of Unstructured Clinical Data for Patient Privacy Protection. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-23633-9_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)