Medical Data Privacy Handbook

pp 179-200

Methods to Mitigate Risk of Composition Attack in Independent Data Publications

  • Jiuyong LiAffiliated withSchool of Information Technology and Mathematical Sciences, University of South Australia Email author 
  • , Sarowar A. SattarAffiliated withSchool of Information Technology and Mathematical Sciences, University of South Australia
  • , Muzammil M. BaigAffiliated withInterSect Alliance International Pty Ltd
  • , Jixue LiuAffiliated withSchool of Information Technology and Mathematical Sciences, University of South Australia
  • , Raymond HeatherlyAffiliated withDepartment of Biomedical Informatics, Vanderbilt University
  • , Qiang TangAffiliated withAPSIA group, SnT, University of Luxembourg
  • , Bradley MalinAffiliated withDepartments of Biomedical Informatics and EE and CS, Vanderbilt University

* Final gross prices may vary according to local VAT.

Get Access


Data publication is a simple and cost-effective approach for data sharing across organizations. Data anonymization is a central technique in privacy preserving data publications. Many methods have been proposed to anonymize individual datasets and multiple datasets of the same data publisher. In real life, a dataset is rarely isolated and two datasets published by two organizations may contain the records of the same individuals. For example, patients might have visited two hospitals for follow-up or specialized treatment regarding a disease, and their records are independently anonymized and published. Although each published dataset poses a small privacy risk, the intersection of two datasets may severely compromise the privacy of the individuals. The attack using the intersection of datasets published by different organizations is called a composition attack. Some research work has been done to study methods for anonymizing data to prevent a composition attack for independent data releases where one data publisher has no knowledge of records of another data publisher. In this chapter, we discuss two exemplar methods, a randomization based and a generalization based approaches, to mitigate risks of composition attacks. In the randomization method, noise is added to the original values to make it difficult for an adversary to pinpoint an individual’s record in a published dataset. In the generalization method, a group of records according to potentially identifiable attributes are generalized to the same so that individuals are indistinguishable. We discuss and experimentally demonstrate the strengths and weaknesses of both types of methods. We also present a mixed data publication framework where a small proportion of the records are managed and published centrally and other records are managed and published locally in different organizations to reduce the risk of the composition attack and improve the overall utility of the data.