To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data

  • Xiaoyong Zhou
  • Bo Peng
  • Yong Fuga Li
  • Yangyi Chen
  • Haixu Tang
  • XiaoFeng Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6879)


The rapid progress of human genome studies leads to a strong demand of aggregate human DNA data (e.g, allele frequencies, test statistics, etc.), whose public dissemination, however, has been impeded by privacy concerns. Prior research shows that it is possible to identify the presence of some participants in a study from such data, and in some cases, even fully recover their DNA sequences. A critical issue, therefore, becomes how to evaluate such a risk on individual data-sets and determine when they are safe to release. In this paper, we report our research that makes the first attempt to address this issue. We first identified the space of the aggregate-data-release problem, through examining common types of aggregate data and the typical threats they are facing. Then, we performed an in-depth study on different scenarios of attacks on different types of data, which sheds light on several fundamental questions in this problem domain. Particularly, we found that attacks on aggregate data are difficult in general, as the adversary often does not have enough information and needs to solve NP-complete or NP-hard problems. On the other hand, we acknowledge that the attacks can succeed under some circumstances, particularly, when the solution space of the problem is small. Based upon such an understanding, we propose a risk-scale system and a methodology to determine when to release an aggregate data-set and when not to. We also used real human-genome data to verify our findings.


Correct Sign Markov Chain Model Differential Privacy Green Zone Haplotype Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Haplotype Estimation and Association (2005),
  2. 2.
    NIH Background Fact Sheet on GWAS Policy Update (2008),
  3. 3.
  4. 4.
    Genome-Wide Association Studies (2010),
  5. 5.
  6. 6.
    International HapMap Project (2010),
  7. 7.
    National Institutes of Health (2010),
  8. 8.
    Policy for sharing of data obtained in nih supported or conducted genome-wide association studies (gwas) (2010),
  9. 9.
    The R project for statistical computing (2010),
  10. 10.
  11. 11.
  12. 12.
  13. 13.
    Wellcome Trust Case Control Consortium (WTCCC1) (2010),
  14. 14.
    Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: PODS 2001: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 247–255. ACM, New York (2001)CrossRefGoogle Scholar
  15. 15.
    Agrawal, R., Srikant, R.: Privacy-preserving data mining. SIGMOD Rec. 29(2), 439–450 (2000)CrossRefGoogle Scholar
  16. 16.
    Atallah, M.J., Kerschbaum, F., Du, W.: Secure and private sequence comparisons. In: WPES 20: Proceedings of the 2003 ACM Workshop on Privacy in the Electronic Society, pp. 39–44. ACM, New York (2003)Google Scholar
  17. 17.
    Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS 2007: Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM, New York (2007)CrossRefGoogle Scholar
  18. 18.
    Beck, L.L.: A security machanism for statistical database. ACM Trans. Database Syst. 5(3), 316–3338 (1980)CrossRefGoogle Scholar
  19. 19.
    Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS 2005: Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM, New York (2005)CrossRefGoogle Scholar
  20. 20.
    Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probability. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8(1), 3–62 (1936)zbMATHGoogle Scholar
  21. 21.
    Braun, R., Rowe, W., Schaefer, C., Zhang, J., Buetow, K.: Needles in the haystack: Identifying individuals present in pooled genomic data. PLoS Genet 5(10), e1000668 (2009)CrossRefGoogle Scholar
  22. 22.
    Bruekers, F., Katzenbeisser, S., Kursawe, K., Tuyls, P.: Privacy-preserving matching of dna profiles. Technical Report Report 2008/203, ACR Cryptology ePrint Archive (2008)Google Scholar
  23. 23.
    Chen, Y., Diaconis, P., Holmes, S.P., Liu, J.S.: Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association 100, 109–120 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE Trans. Softw. Eng. 8(6), 574–582 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Chiò, A., Schymick, J.C., et al.: A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Hum. Mol. Genet. 18(8), 1524–1532 (2009)CrossRefGoogle Scholar
  26. 26.
    Chvatal, V.: Recognizing intersection patterns. In: Combinatorics 79, Part I, pp. 249–251. North-Holland Publishing Company, Amsterdam (1980)CrossRefGoogle Scholar
  27. 27.
    Dobra, A., Fienberg, S.E.: Bounds for cell entries in contingency tables induced by fixed marginal totals. Statistical Journal of the United Nations ECE 18, 363–371 (2001)Google Scholar
  28. 28.
    Duerr, R.H.H., et al.: A genome-wide association study identifies il23r as an inflammatory bowel disease gene. Science (October 2006)Google Scholar
  29. 29.
    Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  30. 30.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  31. 31.
    Edwards, A.O., Ritter, R., et al.: Complement factor H polymorphism and age-related macular degeneration. Science 308(5720), 421–424 (2005)CrossRefGoogle Scholar
  32. 32.
    Fienberg, S.E.: Datamining and disclosure limitation for categorical statistical databases. In: Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004), pp. 1–12. Nova Science Publishing, Bombay (2004)Google Scholar
  33. 33.
    Gehrke, J.: Models and methods for privacy-preserving data analysis and publishing. In: ICDE 2006: Proceedings of the 22nd International Conference on Data Engineering, p. 105. IEEE Computer Society, Washington, DC, USA (2006)Google Scholar
  34. 34.
    Goldreich, O., Vadhan, S.: Special issue on worst-case versus average-case complexity editors’ foreword. Comput. Complex. 16, 325–330 (2007)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Greenspan, G., Geiger, D.: Modeling haplotype block variation using markov chains. Genetics 172(4), 2583–2599 (2006)CrossRefGoogle Scholar
  36. 36.
    Haines, J.L., et al.: Complement factor H variant increases the risk of age-related macular degeneration. Science 308(5720), 419–421 (2005)CrossRefGoogle Scholar
  37. 37.
    Herman, G.T., Kuba, A.: Advances in Discrete Tomography and Its Applications (Applied and Numerical Harmonic Analysis). Birkhauser, Basel (2007)Google Scholar
  38. 38.
    Hoeffding, W.: Scale-invariant correlation theory. Masstabinvariante Korrelationstheorie, Schriften des Matematischen Instituts und des Instituts fr Angewandte Mathematik der University 5, 179–233 (1940)Google Scholar
  39. 39.
    Homer, N., et al.: Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 4(8), e1000167+ (2008)Google Scholar
  40. 40.
    Jacobs, K.B., et al.: A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genetics 41(11), 1253–1257 (2009)CrossRefGoogle Scholar
  41. 41.
    Jha, S., Kruger, L., Shmatikov, V.: Towards practical privacy for genomic computation. In: 2008 IEEE Symposium on Security and Privacy (2008)Google Scholar
  42. 42.
    Kim, Y., Feng, S., Zeng, Z.B.: Measuring and partitioning the high-order linkage disequilibrium by multiple order markov chains. Genet. Epidemiol. 32(4), 301–312 (2008)CrossRefGoogle Scholar
  43. 43.
    Morris, A.P., Whittaker, J.C., Balding, D.J.: Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am. J. Hum. Genet. 74(5), 945–953 (2004)CrossRefGoogle Scholar
  44. 44.
    Renström, F., et al.: Replication and extension of genome-wide association study results for obesity in 4,923 adults from northern sweden. Hum. Mol. Genet. (January 2009)Google Scholar
  45. 45.
    Robbins, R.: Some applications of mathematics to breeding problems iii. Genetics 3(4), 375–389 (1918)Google Scholar
  46. 46.
    Sankararaman, S., Obozinski, G., Jordan, M.I., Halperin, E.: Genomic privacy and limits of individual detection in a pool. Nat. Genet. 41(9), 965–967 (2009)CrossRefGoogle Scholar
  47. 47.
    Scott, L., et al.: A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science (April 2007)Google Scholar
  48. 48.
    Sladek, R., et al.: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature (February 2007)Google Scholar
  49. 49.
    Stephens, M., Donnelly, P.: A comparison of bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics 73(5), 1162–1169 (2003)CrossRefGoogle Scholar
  50. 50.
    Stephens, M., Smith, N., Donnelly, P.: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics 68(4), 978–989 (2001)CrossRefGoogle Scholar
  51. 51.
    Visscher, P.M., Hill, W.G.: The limits of individual identification from sample allele frequencies: Theory and statistical analysis. PLoS Genet. 5(10), e1000628 (2009)Google Scholar
  52. 52.
    Wang, R., Li, Y.F., Wang, X., Tang, H., Zhou, X.: Learning your identity and disease from research papers: Information leaks in genome wide association study. In: CCS 2009: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 534–544. ACM, New York (2009)Google Scholar
  53. 53.
    Yeager, M., et al.: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 39(5), 645–649 (2007)CrossRefGoogle Scholar
  54. 54.
    Yuguo Chen, I.H.D., Sullivant, S.: Sequential importance sampling for multiway tables. The Annals of Statistics 34(1), 523–545 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X.: Technical report tr696: To release or not to release: Evaluating information leaks in aggregate human-genome data (2011),

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Xiaoyong Zhou
    • 1
  • Bo Peng
    • 1
  • Yong Fuga Li
    • 1
  • Yangyi Chen
    • 1
  • Haixu Tang
    • 1
  • XiaoFeng Wang
    • 1
  1. 1.Indiana UniversityBloomingtonUSA

Personalised recommendations