Abstract
In this work, we demonstrate the use the CKKS homomorphic encryption scheme to train a large number of logistic regression models simultaneously, as needed to run a genome-wide association study (GWAS) on encrypted data. Our implementation can train more than 30,000 models (each with four features) in about 20 min. To that end, we rely on a similar iterative Nesterov procedure to what was used by Kim, Song, Kim, Lee, and Cheon to train a single model [14]. We adapt this method to train many models simultaneously using the SIMD capabilities of the CKKS scheme. We also performed a thorough validation of this iterative method and evaluated its suitability both as a generic method for computing logistic regression models, and specifically for GWAS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The last three datasets are much smaller than we would like. Nonetheless, they contain features with strong signal and others with very weak signal, so we can still use them to evaluate the GWAS setting.
- 2.
The LRT measures how much more likely we are to observe the training data if the true probability distribution of the \(y_i\)’s is what we compute in the model vs. the probability to observe the same training data according to the null hypothesis in which the \(y_i\)’s are independent of the \(\varvec{x}_i\)’s.
- 3.
This form of initialization differs from the description in [14], but it is consistent with the code shared online by the authors.
- 4.
Another “hidden” dimension are the slots \(t=1,\ldots ,N\) in each ciphertext, but since our computation is completely SIMD then we can ignore that dimension.
- 5.
Our logistic regression procedure uses a power of two cyclotomic field for efficiency.
References
Bonte, C., Vercauteren, F.: Privacy-preserving logistic regression training. BMC Medi. Genom. 11(Suppl 4), 86 (2018). https://doi.org/10.1186/s12920-018-0398-y
Bontempi, G., Pozzolo, A.D., Caelen, O., Johnson, R.A.: Credit card fraud detection. Technical report, Université Libre de Bruxelles (2015)
Bootwala, A.: Titanic for Binary logistic regression. https://www.kaggle.com/azeembootwala/titanic/home
Brakerski, Z., Gentry, C., Vaikuntanathan, V.: Fully homomorphic encryption without bootstrapping. In: Innovations in Theoretical Computer Science (ITCS 2012) (2012). http://eprint.iacr.org/2011/277
Bubeck, S.: ORF523: Nesterov’s accelerated gradient descent. https://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent. Accessed January 2019, 2013
Chen, H., et al.: Logistic regression over encrypted data from fully homomorphic encryption. BMC Med. Genom. 11(Suppl 4), 81 (2018). https://doi.org/10.1186/s12920-018-0397-z
Cheon, J.H., Kim, A., Kim, M., Song, Y.: Homomorphic encryption for arithmetic of approximate numbers. In: Takagi, T., Peyrin, T. (eds.) ASIACRYPT 2017. LNCS, vol. 10624, pp. 409–437. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70694-8_15
Crawford, J.L.H., Gentry, C., Halevi, S., Platt, D., Shoup, V.: Doing real work with FHE: the case of logistic regression. In: Brenner, M., Rohloff, K. (eds.) Proceedings of the 6th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, WAHC@CCS 2018, pp. 1–12. ACM (2018). https://eprint.iacr.org/2018/202
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st ACM Symposium on Theory of Computing - STOC 2009, pp. 169–178. ACM (2009)
Halevi, S., Shoup, V.: HElib - an implementation of homomorphic encryption. https://github.com/shaih/HElib/, Accessed January 2019
Han, K., Hong, S., Cheon, J.H., Park, D.: Efficient logistic regression on large encrypted data. Cryptology ePrint Archive, Report 2018/662 (2018). https://eprint.iacr.org/2018/662
Integrating Data for Analysis, Anonymization and SHaring (iDASH). https://idash.ucsd.edu/
Kennedy, R.L., Fraser, H.S., McStay, L.N., Harrison, R.F.: Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. Eur. Heart J. 17(8), 1181–1191 (1996). https://github.com/kimandrik/IDASH2017/tree/master/IDASH2017/data/edin.txt
Kim, A., Song, Y., Kim, M., Lee, K., Cheon, J.H.: Logistic regression model training based on the approximate homomorphic encryption. BMC Med. Genom. 11(4), 83 (2018)
Kim, M., Song, Y., Wang, S., Xia, Y., Jiang, X.: Secure logistic regression based on homomorphic encryption: design and evaluation. JMIR Med. Inf. 6(2), e19 (2018). https://doi.org/10.2196/medinform.8805. https://eprint.iacr.org/2018/074
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization, vol. 87. Springer, New York (2004). https://doi.org/10.1007/978-1-4419-8853-9
Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166, December 2015
Sikorska, K., Lesaffre, E., Groenen, P.J., Eilers, P.H.: GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinf. 14, 166 (2013)
Logistic regression. https://en.wikipedia.org/wiki/Logistic_regression#Discussion. Accessed January 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Corrections in the Literature
During our work we encountered two minor bugs/inconsistencies in the literature. We have notified the relevant authors and document these issues here:
-
The Matlab code used in the iDASH 2017 competition had a bug in the way it computed the recall values, computing it as \(\frac{false~positive + true~positive}{false~negative + true~positive}\) instead of \(\frac{true~positive}{false~negative + true~positive}\).
-
Some of the mean-squared-error (MSE) results reported in [14] seem inconsistent with their accuracy values: For the Edinburgh dataset, they report accuracy value of 86%, but MSE of only 0.00075. We note that 86% accuracy implies MSE of at least \(0.14 \cdot (0.5)^2=0.035\) (likely a typo).
B The Datasets that We Used
Recall that we tested the iterative procedure against a few different datasets, to ensure that it is not “tailored” too much to the characteristics of just one type of data. We had some difficulties finding public datasets that we could use for this evaluation, eventually we converged on the following four:
-
The iDASH 2018 dataset, as provided by the organizers of the competition, is meant to correlate various genetic markers with the risk of developing cancer. It consists of 245 records, each with a binary condition (cancer or not), three covariates (age, weight, and height), and 10643 markers (SNPs). The last 120 records were missing the covariates, so we ran our procedure by replacing each missing covariate by the average of the same covariate in the other records.
-
A credit card dataset [2] attempts to correlate credit-card fraud with observed characteristics of the transaction. This dataset has 984 records each with thirty columns.
-
The Edinburgh dataset [13] correlates the condition of Myocardial Infarction (heart attack) in patients who presented to the emergency room in the Edinburgh Royal Infirmary in Scotland with various symptoms and test results (e.g., ST elevation, New Q waves, Hypoperfusion, depression, vomiting, etc.). The same dataset was also used to evaluate the procedure of Kim et al. [14]. The data includes 1253 records, each with nine features.
-
The Titanic dataset [3], consisting of 892 records with sixteen features, correlating passenger’s survival in that disaster with various characteristics such as gender, age, fare, etc.
The first dataset comes with a distinction between SNPs and clinical variables, but the other three have just the condition variable and all the rest. We had to decide which of the features (if any) to use for covariates. We note that whatever feature we designate as covariate will be present in all the models, so choosing a feature with very high signal will make the predictive power of all the models very similar. We therefore typically opted to choose for a covariate the features which is least correlated with the condition. We also ran the same test with no covariates, and the results were very similar.
C Model Evaluation Figures
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bergamaschi, F., Halevi, S., Halevi, T.T., Hunt, H. (2019). Homomorphic Training of 30,000 Logistic Regression Models. In: Deng, R., Gauthier-Umaña, V., Ochoa, M., Yung, M. (eds) Applied Cryptography and Network Security. ACNS 2019. Lecture Notes in Computer Science(), vol 11464. Springer, Cham. https://doi.org/10.1007/978-3-030-21568-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-21568-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21567-5
Online ISBN: 978-3-030-21568-2
eBook Packages: Computer ScienceComputer Science (R0)