Overview of Patient Data Anonymization

Gkoulalas-Divanis, Aris; Loukides, Grigorios

doi:10.1007/978-1-4614-5668-1_2

Aris Gkoulalas-Divanis³ &
Grigorios Loukides⁴

Part of the book series: SpringerBriefs in Electrical and Computer Engineering ((BRIEFSELECTRIC))

794 Accesses

Abstract

This chapter presents an overview of anonymization techniques that can be used to protect different types of patient data. We first discuss anonymization principles that can prevent identity and sensitive information disclosure in demographic data publishing, as well as algorithms for enforcing these principles, in Section.1. Subsequently, in Section 2.2, we turn our attention to diagnosis code anonymization, which, somewhat surprisingly, has not been considered by the medical informatics community. After motivating the need for anonymizing diagnosis codes, we present several related principles and transformation models that have been proposed by the data management community. We then review anonymization algorithms, detailing how these principles and algorithms are utilized by them. Last, in Section 2.3, we examine genomic data, which are also susceptible to privacy attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
ICD-9 codes are described in the International Classification of Diseases, Ninth Revision – Clinical Modification, http://www.cdc.gov/nchs/icd/icd9cm.htm
2.
The identifier is used only for reference and may be omitted, if this is clear from the context.
3.
http://hapmap.ncbi.nlm.nih.gov/
4.
Minor Allele Frequencies (MAFs) are the frequencies at which the less common allele occurs in a given population.

References

Adam, N., Worthmann, J.: Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21(4), 515–556 (1989)
Article Google Scholar
Aggarwal, C., Yu, P.: A condensation approach to privacy preserving data mining. In: EDBT, pp. 183–199 (2004)
Google Scholar
Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: VLDB, pp. 901–909 (2005)
Google Scholar
Aggarwal, G., Kenthapadi, F., Motwani, K., Panigrahy, R., Zhu, D.T.A.: Approximation algorithms for k-anonymity. Journal of Privacy Technology (2005)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994)
Google Scholar
Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st ICDE, pp. 217–228 (2005)
Google Scholar
Braun, R., Rowe, W., Schaefer, C., Zhang, J., Buetow, K.: Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genetocs 5(10), e1000,668 (2009)
Google Scholar
Byun, J., Kamra, A., Bertino, E., Li, N.: Efficient k-anonymity using clustering technique. In: DASFAA, pp. 188–200 (2007)
Google Scholar
Cao, J., Karras, P., Kalnis, P., Tan, K.L.: Sabre: a sensitive attribute bucketization and redistribution framework for t-closeness. VLDBJ 20, 59–81 (2011)
Article Google Scholar
Cassa, C., Schmidt, B., Kohane, I., Mandl, K.D.: My sister’s keeper? genomic research and the identifiability of siblings. BMC Medical Genomics 1, 32 (2008)
Article Google Scholar
Chen, B., Ramakrishnan, R., LeFevre, K.: Privacy skyline: Privacy with multidimensional adversarial knowledge. In: VLDB, pp. 770–781 (2007)
Google Scholar
Medical Research Council: MRC data sharing and preservation initiative policy. http://www.mrc.ac.uk/ourresearch/ethicsresearchguidance/datasharinginitiative (2006)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on Knowledge and Data Engineering 14(1), 189–201 (2002)
Article Google Scholar
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. DMKD 11(2), 195–212 (2005)
Article MathSciNet Google Scholar
Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)
Google Scholar
Emam, K.E.: Methods for the de-identification of electronic health records for genomic research. Genome Medicine 3(4), 25 (2011)
Article Google Scholar
Emam, K.E., Dankar, F.K.: Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15(5), 627–637 (2008)
Article Google Scholar
Emam, K.E., Dankar, F.K., et al.: A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association 16(5), 670–682 (2009)
Article Google Scholar
Farkas, C., Jajodia, S.: The inference problem: a survey. SIGKDD Explorations 4(2), 6–11 (2002)
Article Google Scholar
Federal Committee on Statistical Methodology: Report on statistical disclosure limitation methodology. http://www.fcsm.gov/working-papers/totalreport.pdf (2005)
Fienberg, S.E., Slavkovic, A., Uhler, C.: Privacy preserving gwas data sharing. In: IEEE ICDM Worksops, pp. 628–635 (2011)
Google Scholar
Friedman, J., Bentley, J., Finkel, R.: An algorithm for finding best matches in logarithmic time. ACM Trans. on Mathematical Software 3(3) (1977)
Google Scholar
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: A survey on recent developments. ACM Comput. Surv. 42 (2010)
Google Scholar
Gkoulalas-Divanis, A., Loukides, G.: PCTA: Privacy-constrained Clustering-based Transaction Data Anonymization. In: EDBT PAIS, p. 5 (2011)
Google Scholar
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD ’84, pp. 47–57 (1984)
Google Scholar
Hamming, R.W.: Coding and Information Theory. Prentice-Hall (1980)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD, pp. 1–12 (2000)
Google Scholar
He, Y., Naughton, J.F.: Anonymization of set-valued data via top-down, local generalization. PVLDB 2(1), 934–945 (2009)
Google Scholar
Homer, N., Szelinger, S., Redman, M., et al.: Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genetics 4(8), e1000,167 (2008)
Google Scholar
Iwuchukwu, T., Naughton, J.F.: K-anonymization as spatial indexing: Toward scalable and incremental anonymization. In: VLDB, pp. 746–757 (2007)
Google Scholar
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: KDD, pp. 279–288 (2002)
Google Scholar
Koudas, N., Zhang, Q., Srivastava, D., Yu, T.: Aggregate query answering on anonymized tables. In: ICDE ’07, pp. 116–125 (2007)
Google Scholar
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: SIGMOD, pp. 49–60 (2005)
Google Scholar
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE, p. 25 (2006)
Google Scholar
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Workload-aware anonymization. In: KDD, pp. 277–286 (2006)
Google Scholar
Li, J., Wong, R., Fu, A., Pei, J.: Achieving -anonymity by clustering in attribute hierarchical structures. In: DaWaK, pp. 405–416 (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)
Google Scholar
Li, T., Li, N.: Towards optimal k-anonymization. DKE 65, 22–39 (2008)
Article Google Scholar
Lin, Z., Altman, R.B., Owen, A.: Confidentiality in genome research. Science 313(5786), 441–442 (2006)
Article Google Scholar
Loukides, G., Denny, J., Malin, B.: The disclosure of diagnosis codes can breach research participants’ privacy. Journal of the American Medical Informatics Association 17, 322–327 (2010)
Google Scholar
Loukides, G., Gkoulalas-Divanis, A., Malin, B.: Anonymization of electronic medical records for validating genome-wide association studies. Proceedings of the National Academy of Sciences 17(107), 7898–7903 (2010)
Article Google Scholar
Loukides, G., Gkoulalas-Divanis, A., Malin, B.: COAT: Constraint-based anonymization of transactions. KAIS 28(2), 251–282 (2011)
Google Scholar
Loukides, G., Gkoulalas-Divanis, A., Shao, J.: Anonymizing transaction data to eliminate sensitive inferences. In: DEXA, pp. 400–415 (2010)
Google Scholar
Loukides, G., Shao, J.: Capturing data usefulness and privacy protection in k-anonymisation. In: SAC, pp. 370–374 (2007)
Google Scholar
Loukides, G., Shao, J.: Preventing range disclosure in k-anonymised data. Expert Systems with Applications 38(4), 4559–4574 (2011)
Article Google Scholar
Loukides, G., Tziatzios, A., Shao, J.: Towards preference-constrained -anonymisation. In: DASFAA International Workshop on Privacy- Preserving Data Analysis (PPDA), pp. 231–245 (2009)
Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: ICDE, p. 24 (2006)
Google Scholar
Malin, B., Loukides, G., Benitez, K., Clayton, E.: Identifiability in biobanks: models, measures, and mitigation strategies. Human Genetics 130(3), 383–392 (2011)
Article Google Scholar
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: PODS, pp. 223–228 (2004)
Google Scholar
National Institutes of Health: Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies. NOT-OD-07-088. 2007.
Google Scholar
Nergiz, M.E., Clifton, C.: Thoughts on k-anonymization. DKE 63(3), 622–645 (2007)
Article Google Scholar
Ohno-Machado, L., Vinterbo, S., Dreiseitl, S.: Effects of data anonymization by cell suppression on descriptive statistics and predictive modeling performance. Journal of American Medical Informatics Association 9(6), 115119 (2002)
Article Google Scholar
Park, H., Shim, K.: Approximate algorithms for k-anonymity. In: SIGMOD, pp. 67–78 (2007)
Google Scholar
European Parliament, C.: EU Directive on privacy and electronic communications. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32002L0058:EN:NOT (2002)
Phillips, C., Salas, A., Sanchez, J., et al.: Inferring ancestral origin using a single multiplex assay of ancestry-informative marker snps. Forensic Science International: Genetics 1, 273–280 (2007)
Article Google Scholar
Rodgers, J.: Quality assurance and medical ontologies. Methods of Information in Medicine 45(3), 267–274 (2006)
Google Scholar
Rothstein, M., Epps, P.: Ethical and legal implications of pharmacogenomics. Nature Review Genetics 2, 228–231 (2001)
Article Google Scholar
Samarati, P.: Protecting respondents identities in microdata release. TKDE 13(9), 1010–1027 (2001)
Google Scholar
Sweeney, L.: k-anonymity: a model for protecting privacy. IJUFKS 10, 557–570 (2002)
MathSciNet MATH Google Scholar
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. PVLDB 1(1), 115–125 (2008)
Google Scholar
Terrovitis, M., Mamoulis, N., Kalnis, P.: Local and global recoding methods for anonymizing set-valued data. VLDB J 20(1), 83–106 (2011)
Article Google Scholar
Texas Department of State Health Services: User manual of texas hospital inpatient discharge public use data file. http://www.dshs.state.tx.us/THCIC/ (2008)
Truta, T.M., Campan, A., Meyer, P.: Generating microdata with p -sensitive k -anonymity property. In: Secure Data Management, pp. 124–141 (2007)
Google Scholar
U.S. Department of Health and Human Services Office for Civil Rights: HIPAA administrative simplification regulation text (2006)
Google Scholar
Wang, R., Li, Y.F., Wang, X., Tang, H., Zhou, X.: Learning your identity and disease from research papers: information leaks in genome wide association study. In: CCS, pp. 534–544 (2009)
Google Scholar
Wong, R.C., Li, J., Fu, A., K.Wang: alpha-k-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In: KDD, pp. 754–759 (2006)
Google Scholar
Xiao, X., Tao, Y.: Personalized privacy preservation. In: SIGMOD, pp. 229–240 (2006)
Google Scholar
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.C.: Utility-based anonymization using local recoding. In: KDD, pp. 785–790 (2006)
Google Scholar
Xu, Y., Wang, K., Fu, A.W.C., Yu, P.S.: Anonymizing transaction databases for publication. In: KDD, pp. 767–775 (2008)
Google Scholar
Zerhouni, E.A., Nabel, E.: Protecting aggregate genomic data. Science 322(5898) (2008)
Google Scholar
Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X.: To release or not to release: evaluating information leaks in aggregate human-genome data. In: ESORICS, pp. 607–627 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research - Ireland, Damastown Industrial Estate, Mulhuddart, Ireland
Aris Gkoulalas-Divanis
The Parade, Cardiff University, Cardiff, UK
Grigorios Loukides

Authors

Aris Gkoulalas-Divanis
View author publications
You can also search for this author in PubMed Google Scholar
Grigorios Loukides
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gkoulalas-Divanis, A., Loukides, G. (2013). Overview of Patient Data Anonymization. In: Anonymization of Electronic Medical Records to Support Clinical Analysis. SpringerBriefs in Electrical and Computer Engineering. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-5668-1_2

Download citation

DOI: https://doi.org/10.1007/978-1-4614-5668-1_2
Published: 13 September 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-5667-4
Online ISBN: 978-1-4614-5668-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics