Skip to main content

Privacy-Preserving Gradient Descent for Distributed Genome-Wide Analysis

  • 1176 Accesses

Part of the Lecture Notes in Computer Science book series (LNSC,volume 12973)


Genome-wide analysis, which provides perceptive insights into complex diseases, plays an important role in biomedical data analytics. It usually involves large-scale human genomic data, and thus may disclose sensitive information about individuals. While existing studies have been conducted against data exfiltration by external malicious actors, this work focuses on the emerging identity tracing attack that occurs when a dishonest insider attempts to re-identify obtained DNA samples. We propose a framework named \(\upsilon \textsc {Frag}\) to facilitate privacy-preserving data sharing and computation in genome-wide analysis. \(\upsilon \textsc {Frag}\) mitigates privacy risks by using vertical fragmentations to disrupt the genetic architecture on which the adversary relies for re-identification. The fragmentation significantly reduces the overall amount of information the adversary can obtain. Notably, it introduces no sacrifice to the capability of genome-wide analysis—we prove that it preserves the correctness of gradient descent, the most popular optimization approach for training machine learning models. We also explore the efficiency performance of \(\upsilon \textsc {Frag}\) through experiments on a large-scale, real-world dataset. Our experiments demonstrate that \(\upsilon \textsc {Frag}\) outperforms not only secure multiparty computation (MPC) and homomorphic encryption (HE) protocols with a speedup of more than 221x for training neural networks, but also noise-based differential privacy (DP) solutions and traditional non-private algorithms in most settings.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-88428-4_20
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-88428-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.


  1. 1.

    There have been around 26 million tests sold in 2019 [24].

  2. 2.;

  3. 3.

    Genetic architecture refers to the underlying genetic basis and its variational properties that are responsible for broad-sense heritability [26].

  4. 4.

    For readability, a table of notations is included as Appendix A.

  5. 5.

    Only the performance with \(m = 60,000\) in the LAN setting is reported in [22].


  1. vFrag.

  2. 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68 (2015)

    Google Scholar 

  3. Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. ACM (2016)

    Google Scholar 

  4. Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12(7), 878 (2016)

    CrossRef  Google Scholar 

  5. Bogdanov, D., Kamm, L., Laur, S., Sokk, V.: RMIND: a tool for cryptographically secure statistical analysis. IEEE Trans. Dependable Secure Comput. 15, 481–495 (2016)

    CrossRef  Google Scholar 

  6. Cormode, G., Jha, S., Kulkarni, T., Li, N., Srivastava, D., Wang, T.: Privacy at scale: local differential privacy in practice. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1655–1658 (2018)

    Google Scholar 

  7. Das, S., et al.: Next-generation genotype imputation service and methods. Nat. Genet. 48(10), 1284–1287 (2016)

    CrossRef  Google Scholar 

  8. Erlich, Y., Narayanan, A.: Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 15(6), 409 (2014)

    CrossRef  Google Scholar 

  9. Erlich, Y., Shor, T., Pe’er, I., Carmi, S.: Identity inference of genomic data using long-range familial searches. Science 362(6415), 690–694 (2018)

    CrossRef  Google Scholar 

  10. Erlingsson, Ú., Pihur, V., Korolova, A.: Rappor: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)

    Google Scholar 

  11. Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010).

  12. Hagestedt, I., et al.: MBeacon: privacy-preserving beacons for DNA methylation data. In: NDSS (2019)

    Google Scholar 

  13. Han, S., Ng, W.K., Wan, L., Lee, V.C.: Privacy-preserving gradient-descent methods. IEEE Trans. Knowl. Data Eng. 22(6), 884–899 (2009)

    CrossRef  Google Scholar 

  14. Hartmann, V., West, R.: Privacy-preserving distributed learning with secret gradient descent. arXiv preprint arXiv:1906.11993 (2019)

  15. Hintjens, P.: ZeroMQ: Messaging for Many Applications. O’Reilly Media Inc., Sebastopol (2013)

    Google Scholar 

  16. Hu, Y., Niu, D., Yang, J., Zhou, S.: FDML: a collaborative machine learning framework for distributed features. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2232–2240 (2019)

    Google Scholar 

  17. Jagadeesh, K.A., Wu, D.J., Birgmeier, J.A., Boneh, D., Bejerano, G.: Deriving genomic diagnoses without revealing patient genomes. Science 357(6352), 692–695 (2017)

    CrossRef  Google Scholar 

  18. Jia, J., Salem, A., Backes, M., Zhang, Y., Gong, N.Z.: MemGuard: defending against black-box membership inference attacks via adversarial examples. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 259–274 (2019)

    Google Scholar 

  19. Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087. ACM (2013)

    Google Scholar 

  20. Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)

    Google Scholar 

  21. Marees, A.T., et al.: A tutorial on conducting genome-wide association studies: quality control and statistical analysis. Int. J. Methods Psychiatric Res. 27(2), e1608 (2018)

    CrossRef  Google Scholar 

  22. Mohassel, P., Zhang, Y.: SecureML: a system for scalable privacy-preserving machine learning. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 19–38. IEEE (2017)

    Google Scholar 

  23. Ralph, P., Coop, G.: The geography of recent genetic ancestry across Europe. PLoS Biol. 11(5), e1001555 (2013)

    CrossRef  Google Scholar 

  24. Regalado, A.: MIT technology review.

  25. Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: ML-Leaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246 (2018)

  26. Timpson, N.J., Greenwood, C.M., Soranzo, N., Lawson, D.J., Richards, J.B.: Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19(2), 110 (2018)

    CrossRef  Google Scholar 

  27. Visscher, P.M., et al.: 10 years of GWAS discovery: biology, function, and translation. Am. J. Human Genet. 101(1), 5–22 (2017)

    CrossRef  Google Scholar 

  28. Wang, K., Zhang, J., Bai, G., Ko, R., Dong, J.S.: It’s not just the site, it’s the contents: intra-domain fingerprinting social media websites through CDN bursts. In: Proceedings of the Web Conference 2021, pp. 2142–2153 (2021)

    Google Scholar 

  29. Wang, S., Pi, A., Zhou, X.: Scalable distributed DL training: batching communication and computation. In: Proceedings of AAAI (2019)

    Google Scholar 

  30. Wang, S., et al.: HEALER: homomorphic computation of exact logistic regression for secure rare disease variants analysis in GWAS. Bioinformatics 32(2), 211–218 (2015)

    Google Scholar 

  31. Wang, Y., Huang, Z., Mitra, S., Dullerud, G.E.: Differential privacy in linear distributed control systems: entropy minimizing mechanisms and performance tradeoffs. IEEE Trans. Control Netw. Syst. 4(1), 118–130 (2017)

    MathSciNet  CrossRef  Google Scholar 

  32. Wang, Y.-X., Lei, J., Fienberg, S.E.: On-average KL-privacy and its equivalence to generalization for max-entropy mechanisms. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 121–134. Springer, Cham (2016).

    CrossRef  Google Scholar 

  33. Xing, E.P., Ho, Q., Xie, P., Wei, D.: Strategies and principles of distributed machine learning on big data. Engineering 2(2), 179–195 (2016)

    CrossRef  Google Scholar 

  34. Yang, J., Lee, S.H., Goddard, M.E., Visscher, P.M.: GCTA: a tool for genome-wide complex trait analysis. Am. J. Human Genet. 88(1), 76–82 (2011)

    CrossRef  Google Scholar 

  35. Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inform. 50, 133–141 (2014)

    CrossRef  Google Scholar 

  36. Yuan, J., Yu, S.: Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Trans. Parallel Distrib. Syst. 25(1), 212–221 (2014)

    CrossRef  Google Scholar 

  37. Zhang, Y., Bai, G., Li, X., Curtis, C., Chen, C., Ko, R.K.L.: PrivColl: practical privacy-preserving collaborative machine learning. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds.) ESORICS 2020. LNCS, vol. 12308, pp. 399–418. Springer, Cham (2020).

    CrossRef  Google Scholar 

  38. Zhang, Y., Bai, G., Li, X., Nepal, S., Ko, R.K.: Confined gradient descent: Privacy-preserving optimization for federated learning. arXiv preprint arXiv:2104.13050 (2021)

  39. Zhang, Y., Bai, G., Zhong, M., Li, X., Ko, R.: Differentially private collaborative coupling learning for recommender systems. IEEE Intell. Syst. 36, 16–24 (2020)

    CrossRef  Google Scholar 

  40. Zhang, Y., Zhao, X., Li, X., Zhong, M., Curtis, C., Chen, C.: Enabling privacy-preserving sharing of genomic data for GWASs in decentralized networks. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 204–212. ACM (2019)

    Google Scholar 

Download references


We thank our shepherd Erman Ayday and the anonymous reviewers for their insightful comments to improve this manuscript. This work is partly supported by the University of Queensland under the UQ Cyber Initiative Strategic Research Seed Funding 4018264-01-299-21-618071.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Guangdong Bai .

Editor information

Editors and Affiliations


Appendix A Notation Table

Table 3 summarizes the notations defined in this paper.

Table 3. Notation table

Appendix B Functionalities in Genome-Wide Analysis

In this section, we briefly introduce the functionalities commonly used in the analysis.

Summary Statistics. Summary statistics are used to summarize the observations on the genome-wide data. Commonly used summary statistics include the missingness statistics (\(U_{i, miss }/n\), where \(U_{i, miss }\) is the number of missing SNPs of \(i^{th}\) sample), allele frequency (c/2m, where c is the total number of allele for each SNP), and Hardy-Weinberg equilibrium (\(\{(p^2+2pq+q^2==1)\}\), where \(p^2\) is the frequency of homozygous dominant genotype, pq is the frequency of heterozygous genotype, and \(q^2\) is the frequency of homozygous recessive genotype) [21].

Basic Association Analysis. The basic association analysis for GWAS checks on any particular SNP. If one type of the variant (i.e., one allele) is more frequent in individuals with a disease, the variant is said to be associated with the disease. Commonly used statistics include standard \({{\chi }^2}\) test and the Cochran-Armitage test, which performs the tests with respect to each SNP.

Genetic Relationship Matrix (GRM). GRM is developed for addressing the missing heritability problem by estimating the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait [34]. The genetic relationship between individuals \(\beta \) and \(\zeta \) can be estimated by \(\frac{1}{n}\sum _{i=1}^n \frac{(x_{\beta i}-2p_i)(x_{\zeta i}-2p_i)}{2p_i(1-p_i)}\), where \(x_{\beta i}\) is the genotype of \(i^{th}\) SNP of \(\beta ^{th}\) individual, and \(p_i\) is the frequency of the reference allele.

Classification Models Such as Neural Networks. Machine/deep learning algorithms, such as various NNs, are commonly used in genome-wide analysis. For example, they can be used to fit the effects of all the SNPs as random effects to estimate the total amount of phenotypic variance [34], or applied in genotype clustering and ethnicity prediction [4].

The former three functionalities are relatively simple to parallelize than machine learning algorithms, as the statistics with respect to each SNP can directly apply on the vertically partitioned dataset. Therefore, in this work, we focused on the latter.

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Bai, G., Li, X., Curtis, C., Chen, C., Ko, R.K.L. (2021). Privacy-Preserving Gradient Descent for Distributed Genome-Wide Analysis. In: Bertino, E., Shulman, H., Waidner, M. (eds) Computer Security – ESORICS 2021. ESORICS 2021. Lecture Notes in Computer Science(), vol 12973. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88427-7

  • Online ISBN: 978-3-030-88428-4

  • eBook Packages: Computer ScienceComputer Science (R0)