Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases

  • Fei Yu
  • Michal Rybar
  • Caroline Uhler
  • Stephen E. Fienberg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8744)


Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to developing methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-validation. In particular, we focus on penalized logistic regression with elastic-net regularization, a method widely used to in GWAS analyses to identify disease-causing genes. We show how a differentially private procedure for penalized logistic regression with elastic-net regularization can be applied to the analysis of GWAS data and evaluate our method’s performance.


Differential privacy genome-wide association studies (GWAS) logistic regression elastic-net ridge regression lasso cross-validation single nucleotide polymorphism (SNP) 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Austin, E., Pan, W., Shen, X.: Penalized regression and risk prediction in genome-wide association studies. Statistical Analysis and Data Mining 6(4) (August 2013)Google Scholar
  2. 2.
    Cho, S., et al.: Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis. BMC Proceedings 3(suppl. 7), S25 (2009)Google Scholar
  3. 3.
    Homer, N., et al.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP geno-typing microarrays. PLoS Genetics 4(8), e1000167 (2008)Google Scholar
  4. 4.
    Couzin, J.: Whole-genome data not anonymous, challenging assumptions. Science 321(5894), 1278 (2008)CrossRefGoogle Scholar
  5. 5.
    Zerhouni, E.A., Nabel, E.G.: Protecting aggregate genomic data. Science 322(5898), 44 (2008)CrossRefGoogle Scholar
  6. 6.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Uhler, C., Slavkovic, A.B., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. Journal of Privacy and Confidentiality 5(1), 137–166 (2013)Google Scholar
  8. 8.
    Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087 (2013)Google Scholar
  9. 9.
    Yu, F., et al.: Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies. Journal of Biomedical Informatics (February 2014)Google Scholar
  10. 10.
    Kifer, D., Smith, A., Thakurta, A.: Private convex empirical risk minimization and high-dimensional regression. Proceedings of Journal of Machine Learning Research - Proceedings Track 23, 25.1–25.40 (2012)Google Scholar
  11. 11.
    Chaudhuri, K., Vinterbo, S.A.: A stability-based validation procedure for differentially private machine learning. In: Advances in Neural Information Processing Systems, pp. 1–19 (2013)Google Scholar
  12. 12.
    Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. JMLR 12(7), 1069–1109 (2011)zbMATHMathSciNetGoogle Scholar
  13. 13.
    Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28(5), 1302–1338 (2000)CrossRefzbMATHMathSciNetGoogle Scholar
  14. 14.
    Wright, F.A., et al.: Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics 23(19), 2581–2588 (2007)CrossRefGoogle Scholar
  15. 15.
    Malaspinas, A.S., Uhler, C.: Detecting epistasis via Markov bases. Journal of Algebraic Statistics 2(1), 36–53 (2010)MathSciNetGoogle Scholar
  16. 16.
    Gómez, E., Gomez-Viilegas, M.A., Marín, J.M.: A multivariate generalization of the power exponential family of distributions. Communications in Statistics - Theory and Methods 27(3), 589–600 (1998)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Fei Yu
    • 1
  • Michal Rybar
    • 2
  • Caroline Uhler
    • 2
  • Stephen E. Fienberg
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Institute of Science and Technology AustriaKlosterneuburgAustria

Personalised recommendations