Skip to main content

Homomorphic Training of 30,000 Logistic Regression Models

  • Conference paper
  • First Online:
Applied Cryptography and Network Security (ACNS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11464))

Included in the following conference series:

Abstract

In this work, we demonstrate the use the CKKS homomorphic encryption scheme to train a large number of logistic regression models simultaneously, as needed to run a genome-wide association study (GWAS) on encrypted data. Our implementation can train more than 30,000 models (each with four features) in about 20 min. To that end, we rely on a similar iterative Nesterov procedure to what was used by Kim, Song, Kim, Lee, and Cheon to train a single model [14]. We adapt this method to train many models simultaneously using the SIMD capabilities of the CKKS scheme. We also performed a thorough validation of this iterative method and evaluated its suitability both as a generic method for computing logistic regression models, and specifically for GWAS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The last three datasets are much smaller than we would like. Nonetheless, they contain features with strong signal and others with very weak signal, so we can still use them to evaluate the GWAS setting.

  2. 2.

    The LRT measures how much more likely we are to observe the training data if the true probability distribution of the \(y_i\)’s is what we compute in the model vs. the probability to observe the same training data according to the null hypothesis in which the \(y_i\)’s are independent of the \(\varvec{x}_i\)’s.

  3. 3.

    This form of initialization differs from the description in [14], but it is consistent with the code shared online by the authors.

  4. 4.

    Another “hidden” dimension are the slots \(t=1,\ldots ,N\) in each ciphertext, but since our computation is completely SIMD then we can ignore that dimension.

  5. 5.

    Our logistic regression procedure uses a power of two cyclotomic field for efficiency.

References

  1. Bonte, C., Vercauteren, F.: Privacy-preserving logistic regression training. BMC Medi. Genom. 11(Suppl 4), 86 (2018). https://doi.org/10.1186/s12920-018-0398-y

    Article  Google Scholar 

  2. Bontempi, G., Pozzolo, A.D., Caelen, O., Johnson, R.A.: Credit card fraud detection. Technical report, Université Libre de Bruxelles (2015)

    Google Scholar 

  3. Bootwala, A.: Titanic for Binary logistic regression. https://www.kaggle.com/azeembootwala/titanic/home

  4. Brakerski, Z., Gentry, C., Vaikuntanathan, V.: Fully homomorphic encryption without bootstrapping. In: Innovations in Theoretical Computer Science (ITCS 2012) (2012). http://eprint.iacr.org/2011/277

  5. Bubeck, S.: ORF523: Nesterov’s accelerated gradient descent. https://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent. Accessed January 2019, 2013

  6. Chen, H., et al.: Logistic regression over encrypted data from fully homomorphic encryption. BMC Med. Genom. 11(Suppl 4), 81 (2018). https://doi.org/10.1186/s12920-018-0397-z

    Article  Google Scholar 

  7. Cheon, J.H., Kim, A., Kim, M., Song, Y.: Homomorphic encryption for arithmetic of approximate numbers. In: Takagi, T., Peyrin, T. (eds.) ASIACRYPT 2017. LNCS, vol. 10624, pp. 409–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70694-8_15

    Chapter  Google Scholar 

  8. Crawford, J.L.H., Gentry, C., Halevi, S., Platt, D., Shoup, V.: Doing real work with FHE: the case of logistic regression. In: Brenner, M., Rohloff, K. (eds.) Proceedings of the 6th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, WAHC@CCS 2018, pp. 1–12. ACM (2018). https://eprint.iacr.org/2018/202

  9. Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st ACM Symposium on Theory of Computing - STOC 2009, pp. 169–178. ACM (2009)

    Google Scholar 

  10. Halevi, S., Shoup, V.: HElib - an implementation of homomorphic encryption. https://github.com/shaih/HElib/, Accessed January 2019

  11. Han, K., Hong, S., Cheon, J.H., Park, D.: Efficient logistic regression on large encrypted data. Cryptology ePrint Archive, Report 2018/662 (2018). https://eprint.iacr.org/2018/662

  12. Integrating Data for Analysis, Anonymization and SHaring (iDASH). https://idash.ucsd.edu/

  13. Kennedy, R.L., Fraser, H.S., McStay, L.N., Harrison, R.F.: Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. Eur. Heart J. 17(8), 1181–1191 (1996). https://github.com/kimandrik/IDASH2017/tree/master/IDASH2017/data/edin.txt

    Article  Google Scholar 

  14. Kim, A., Song, Y., Kim, M., Lee, K., Cheon, J.H.: Logistic regression model training based on the approximate homomorphic encryption. BMC Med. Genom. 11(4), 83 (2018)

    Article  Google Scholar 

  15. Kim, M., Song, Y., Wang, S., Xia, Y., Jiang, X.: Secure logistic regression based on homomorphic encryption: design and evaluation. JMIR Med. Inf. 6(2), e19 (2018). https://doi.org/10.2196/medinform.8805. https://eprint.iacr.org/2018/074

    Article  Google Scholar 

  16. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization, vol. 87. Springer, New York (2004). https://doi.org/10.1007/978-1-4419-8853-9

    Book  MATH  Google Scholar 

  17. Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166, December 2015

    Google Scholar 

  18. Sikorska, K., Lesaffre, E., Groenen, P.J., Eilers, P.H.: GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinf. 14, 166 (2013)

    Article  Google Scholar 

  19. Logistic regression. https://en.wikipedia.org/wiki/Logistic_regression#Discussion. Accessed January 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Flavio Bergamaschi .

Editor information

Editors and Affiliations

Appendices

A Corrections in the Literature

During our work we encountered two minor bugs/inconsistencies in the literature. We have notified the relevant authors and document these issues here:

  • The Matlab code used in the iDASH 2017 competition had a bug in the way it computed the recall values, computing it as \(\frac{false~positive + true~positive}{false~negative + true~positive}\) instead of \(\frac{true~positive}{false~negative + true~positive}\).

  • Some of the mean-squared-error (MSE) results reported in [14] seem inconsistent with their accuracy values: For the Edinburgh dataset, they report accuracy value of 86%, but MSE of only 0.00075. We note that 86% accuracy implies MSE of at least \(0.14 \cdot (0.5)^2=0.035\) (likely a typo).

B The Datasets that We Used

Recall that we tested the iterative procedure against a few different datasets, to ensure that it is not “tailored” too much to the characteristics of just one type of data. We had some difficulties finding public datasets that we could use for this evaluation, eventually we converged on the following four:

  • The iDASH 2018 dataset, as provided by the organizers of the competition, is meant to correlate various genetic markers with the risk of developing cancer. It consists of 245 records, each with a binary condition (cancer or not), three covariates (age, weight, and height), and 10643 markers (SNPs). The last 120 records were missing the covariates, so we ran our procedure by replacing each missing covariate by the average of the same covariate in the other records.

  • A credit card dataset [2] attempts to correlate credit-card fraud with observed characteristics of the transaction. This dataset has 984 records each with thirty columns.

  • The Edinburgh dataset [13] correlates the condition of Myocardial Infarction (heart attack) in patients who presented to the emergency room in the Edinburgh Royal Infirmary in Scotland with various symptoms and test results (e.g., ST elevation, New Q waves, Hypoperfusion, depression, vomiting, etc.). The same dataset was also used to evaluate the procedure of Kim et al. [14]. The data includes 1253 records, each with nine features.

  • The Titanic dataset [3], consisting of 892 records with sixteen features, correlating passenger’s survival in that disaster with various characteristics such as gender, age, fare, etc.

The first dataset comes with a distinction between SNPs and clinical variables, but the other three have just the condition variable and all the rest. We had to decide which of the features (if any) to use for covariates. We note that whatever feature we designate as covariate will be present in all the models, so choosing a feature with very high signal will make the predictive power of all the models very similar. We therefore typically opted to choose for a covariate the features which is least correlated with the condition. We also ran the same test with no covariates, and the results were very similar.

C Model Evaluation Figures

Fig. 2.
figure 2

Accuracy/recall of the Matlab LR models for the iDASH 2018 dataset ordered according to the p-value order of the iterative procedure (top) or the semi-parallel algorithm (bottom).

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bergamaschi, F., Halevi, S., Halevi, T.T., Hunt, H. (2019). Homomorphic Training of 30,000 Logistic Regression Models. In: Deng, R., Gauthier-Umaña, V., Ochoa, M., Yung, M. (eds) Applied Cryptography and Network Security. ACNS 2019. Lecture Notes in Computer Science(), vol 11464. Springer, Cham. https://doi.org/10.1007/978-3-030-21568-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21568-2_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21567-5

  • Online ISBN: 978-3-030-21568-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics