Homomorphic Training of 30,000 Logistic Regression Models

Bergamaschi, Flavio; Halevi, Shai; Halevi, Tzipora T.; Hunt, Hamish

doi:10.1007/978-3-030-21568-2_29

Flavio Bergamaschi¹²,
Shai Halevi¹³,
Tzipora T. Halevi¹⁴ &
…
Hamish Hunt¹²

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11464))

Included in the following conference series:

International Conference on Applied Cryptography and Network Security

1896 Accesses
16 Citations

Abstract

In this work, we demonstrate the use the CKKS homomorphic encryption scheme to train a large number of logistic regression models simultaneously, as needed to run a genome-wide association study (GWAS) on encrypted data. Our implementation can train more than 30,000 models (each with four features) in about 20 min. To that end, we rely on a similar iterative Nesterov procedure to what was used by Kim, Song, Kim, Lee, and Cheon to train a single model [14]. We adapt this method to train many models simultaneously using the SIMD capabilities of the CKKS scheme. We also performed a thorough validation of this iterative method and evaluated its suitability both as a generic method for computing logistic regression models, and specifically for GWAS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The last three datasets are much smaller than we would like. Nonetheless, they contain features with strong signal and others with very weak signal, so we can still use them to evaluate the GWAS setting.
2.
The LRT measures how much more likely we are to observe the training data if the true probability distribution of the \(y_i\)’s is what we compute in the model vs. the probability to observe the same training data according to the null hypothesis in which the \(y_i\)’s are independent of the \(\varvec{x}_i\)’s.
3.
This form of initialization differs from the description in [14], but it is consistent with the code shared online by the authors.
4.
Another “hidden” dimension are the slots \(t=1,\ldots ,N\) in each ciphertext, but since our computation is completely SIMD then we can ignore that dimension.
5.
Our logistic regression procedure uses a power of two cyclotomic field for efficiency.

References

Bonte, C., Vercauteren, F.: Privacy-preserving logistic regression training. BMC Medi. Genom. 11(Suppl 4), 86 (2018). https://doi.org/10.1186/s12920-018-0398-y
Article Google Scholar
Bontempi, G., Pozzolo, A.D., Caelen, O., Johnson, R.A.: Credit card fraud detection. Technical report, Université Libre de Bruxelles (2015)
Google Scholar
Bootwala, A.: Titanic for Binary logistic regression. https://www.kaggle.com/azeembootwala/titanic/home
Brakerski, Z., Gentry, C., Vaikuntanathan, V.: Fully homomorphic encryption without bootstrapping. In: Innovations in Theoretical Computer Science (ITCS 2012) (2012). http://eprint.iacr.org/2011/277
Bubeck, S.: ORF523: Nesterov’s accelerated gradient descent. https://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent. Accessed January 2019, 2013
Chen, H., et al.: Logistic regression over encrypted data from fully homomorphic encryption. BMC Med. Genom. 11(Suppl 4), 81 (2018). https://doi.org/10.1186/s12920-018-0397-z
Article Google Scholar
Cheon, J.H., Kim, A., Kim, M., Song, Y.: Homomorphic encryption for arithmetic of approximate numbers. In: Takagi, T., Peyrin, T. (eds.) ASIACRYPT 2017. LNCS, vol. 10624, pp. 409–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70694-8_15
Chapter Google Scholar
Crawford, J.L.H., Gentry, C., Halevi, S., Platt, D., Shoup, V.: Doing real work with FHE: the case of logistic regression. In: Brenner, M., Rohloff, K. (eds.) Proceedings of the 6th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, WAHC@CCS 2018, pp. 1–12. ACM (2018). https://eprint.iacr.org/2018/202
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st ACM Symposium on Theory of Computing - STOC 2009, pp. 169–178. ACM (2009)
Google Scholar
Halevi, S., Shoup, V.: HElib - an implementation of homomorphic encryption. https://github.com/shaih/HElib/, Accessed January 2019
Han, K., Hong, S., Cheon, J.H., Park, D.: Efficient logistic regression on large encrypted data. Cryptology ePrint Archive, Report 2018/662 (2018). https://eprint.iacr.org/2018/662
Integrating Data for Analysis, Anonymization and SHaring (iDASH). https://idash.ucsd.edu/
Kennedy, R.L., Fraser, H.S., McStay, L.N., Harrison, R.F.: Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. Eur. Heart J. 17(8), 1181–1191 (1996). https://github.com/kimandrik/IDASH2017/tree/master/IDASH2017/data/edin.txt
Article Google Scholar
Kim, A., Song, Y., Kim, M., Lee, K., Cheon, J.H.: Logistic regression model training based on the approximate homomorphic encryption. BMC Med. Genom. 11(4), 83 (2018)
Article Google Scholar
Kim, M., Song, Y., Wang, S., Xia, Y., Jiang, X.: Secure logistic regression based on homomorphic encryption: design and evaluation. JMIR Med. Inf. 6(2), e19 (2018). https://doi.org/10.2196/medinform.8805. https://eprint.iacr.org/2018/074
Article Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization, vol. 87. Springer, New York (2004). https://doi.org/10.1007/978-1-4419-8853-9
Book MATH Google Scholar
Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166, December 2015
Google Scholar
Sikorska, K., Lesaffre, E., Groenen, P.J., Eilers, P.H.: GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinf. 14, 166 (2013)
Article Google Scholar
Logistic regression. https://en.wikipedia.org/wiki/Logistic_regression#Discussion. Accessed January 2017

Download references

Author information

Authors and Affiliations

IBM Research, Winchester, UK
Flavio Bergamaschi & Hamish Hunt
IBM Research, Albany, NY, USA
Shai Halevi
Brooklyn College, Brooklyn, NY, USA
Tzipora T. Halevi

Authors

Flavio Bergamaschi
View author publications
You can also search for this author in PubMed Google Scholar
Shai Halevi
View author publications
You can also search for this author in PubMed Google Scholar
Tzipora T. Halevi
View author publications
You can also search for this author in PubMed Google Scholar
Hamish Hunt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Flavio Bergamaschi .

Editor information

Editors and Affiliations

Singapore Management University, Singapore, Singapore
Robert H. Deng
Universidad del Rosario, Bogota, Colombia
Valérie Gauthier-Umaña
Cyxtera Technologies, Bogota, Colombia
Martín Ochoa
Columbia University, New York, NY, USA
Moti Yung

Appendices

A Corrections in the Literature

During our work we encountered two minor bugs/inconsistencies in the literature. We have notified the relevant authors and document these issues here:

The Matlab code used in the iDASH 2017 competition had a bug in the way it computed the recall values, computing it as \(\frac{false~positive + true~positive}{false~negative + true~positive}\) instead of \(\frac{true~positive}{false~negative + true~positive}\).
Some of the mean-squared-error (MSE) results reported in [14] seem inconsistent with their accuracy values: For the Edinburgh dataset, they report accuracy value of 86%, but MSE of only 0.00075. We note that 86% accuracy implies MSE of at least \(0.14 \cdot (0.5)^2=0.035\) (likely a typo).

B The Datasets that We Used

Recall that we tested the iterative procedure against a few different datasets, to ensure that it is not “tailored” too much to the characteristics of just one type of data. We had some difficulties finding public datasets that we could use for this evaluation, eventually we converged on the following four:

The iDASH 2018 dataset, as provided by the organizers of the competition, is meant to correlate various genetic markers with the risk of developing cancer. It consists of 245 records, each with a binary condition (cancer or not), three covariates (age, weight, and height), and 10643 markers (SNPs). The last 120 records were missing the covariates, so we ran our procedure by replacing each missing covariate by the average of the same covariate in the other records.
A credit card dataset [2] attempts to correlate credit-card fraud with observed characteristics of the transaction. This dataset has 984 records each with thirty columns.
The Edinburgh dataset [13] correlates the condition of Myocardial Infarction (heart attack) in patients who presented to the emergency room in the Edinburgh Royal Infirmary in Scotland with various symptoms and test results (e.g., ST elevation, New Q waves, Hypoperfusion, depression, vomiting, etc.). The same dataset was also used to evaluate the procedure of Kim et al. [14]. The data includes 1253 records, each with nine features.
The Titanic dataset [3], consisting of 892 records with sixteen features, correlating passenger’s survival in that disaster with various characteristics such as gender, age, fare, etc.

The first dataset comes with a distinction between SNPs and clinical variables, but the other three have just the condition variable and all the rest. We had to decide which of the features (if any) to use for covariates. We note that whatever feature we designate as covariate will be present in all the models, so choosing a feature with very high signal will make the predictive power of all the models very similar. We therefore typically opted to choose for a covariate the features which is least correlated with the condition. We also ran the same test with no covariates, and the results were very similar.

C Model Evaluation Figures

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bergamaschi, F., Halevi, S., Halevi, T.T., Hunt, H. (2019). Homomorphic Training of 30,000 Logistic Regression Models. In: Deng, R., Gauthier-Umaña, V., Ochoa, M., Yung, M. (eds) Applied Cryptography and Network Security. ACNS 2019. Lecture Notes in Computer Science(), vol 11464. Springer, Cham. https://doi.org/10.1007/978-3-030-21568-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-21568-2_29
Published: 29 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21567-5
Online ISBN: 978-3-030-21568-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics