Keywords

1 Introduction

We motivate our study based on the following biomarker discovery application. The current cancer treatment based on doctors’ empirical knowledge can be described as “one-size-fits-all” - almost all the patients diagnosed with the same cancer will receive similar treatment. Under this situation, some patients are likely to be under-treated while others be over-treated. Even worse, not all patients will benefit from the treatment, a proportion of them may suffer from severe side effects. By contrast, personalized medicine aims at treating patients differently with different drugs at the right dose [1]. To achieve personalized treatment for cancer, we need biomarkers (i.e. a set of genes) to predict a patient’s response to anticancer drugs (e.g. sensitivity and resistance). With the advert of bioinformatics technology, we are able to make use of data mining and statistical methods to discover biomarkers from genomic data. Cancer Genome Project (CGP) [2] and Cancer Cell Line Encyclopedia (CCLE) [3] are examples showing the analysis results for discovering biomarkers using genomic features derived from human tumor samples against drug responses. A typical input for genomic features is a gene expression profile which is a vector recording the degrees of activation of different genes. The number of genes can be up to tens of thousands. A patient’s response is usually measured by GI50 value (log of drug concentration for 50% growth inhibition) [4]. Given n gene expression profiles (e.g. from n patients) of which the dimension is m (\(n \ll m\)), the task is to perform regression analysis between gene expression profiles and GI50 values. Elastic net regression was found to be the most accurate predictor [5] among existing approaches.

Why encrypted by different keys? Due to the huge volume of medical records and DNA information, there is an increasing trend for medical units to make use of a third-party cloud system to store the records as well as to leverage its massive computational power to analyze the data. It is well recognized that genomic information such as DNA is particularly sensitive and must be well protected [6]. The privacy of gene expression has been overlooked until Schadt et al. pointed out that gene contents can be inferred based on expression profiles alone [7]. Even worse, some expression data is strongly correlated to important personal indexes such as body mass index and insulin levels. It is likely that an entire profile can be derived and linked to a specific individual. Therefore, gene expression profiles stored in cloud should also be encrypted. As medical units need to retrieve expression profiles when implementing personalized treatment, profiles from different medical units would be encrypted using different keys to avoid leaking the details of the records to other medical units.

It is important for different medical units to combine their datasets in order to increase the size of n (the number of patients) for accurate predication. Collaborative data mining on encrypted data is a promising direction for medical units to “share” data for more accurate prediction without jeopardizing the privacy of the data. The problem to be tackled in this paper is to design a privacy-preserving elastic net protocol to predict biomarkers based on gene expression profiles encrypted by different keys and GI50 values. Our goal is to ensure that the cloud learns nothing about the patients’ expression profiles beyond what is revealed by the final result of elastic net regression.

Difficulties of the problem: There is no existing work for elastic net or lasso (another popular linear regression model) [8] while most of the work was designed for ordinary least square (OLS) and ridge regression. The difficulty lies on the fact that unlike solving OLS and ridge regression, the state-of-the-art solution (e.g. glmnet [9]) for elastic net is based on an iterative algorithm, which requires information of one training sample in each iteration. It is not clear how to perform these iterations if all data records are encrypted. Other existing solutions (e.g. Least Angle Regression (LARS) [10], computing the Euclidean projection onto a closed convex set [11] and using proximal stochastic dual coordinate descent [12]) suffer from a similar problem.

Ideas of our proposed solution: Instead of using the iterative algorithms to solve the elastic net problem directly, it has been proved that elastic net regression can be reduced to support vector machine (SVM) [13]. An identical solution as glmnet [9] up to a tolerance level can be obtained with a solver for SVM. Our main idea is to transform the encrypted training dataset of elastic net to that of SVM, based on which we compute the gram matrix Footnote 1. Then the gram matrix will be used as input to a modern SVM solver. Once obtaining the solution to SVM, we reconstruct the solution to elastic net. We make sure that the cloud server cannot recover patients’ expression profiles based on the gram matrix, for which we provide a security proof in this paper.

Roughly speaking, there are two ways to achieve privacy preserving SVM. One is perturbation based approach. Data sent to the cloud is perturbed by a random transformation [14], which considers only one user (i.e. medical unit). The other is cryptography based approach, such as secret sharing [15], Oblivious Transfer (OT) [15, 16] and Fully Homomorphic Encryption (FHE) [15, 17]. The cryptography based approach provides a higher level of privacy compared to perturbation based approach, but incurs higher computation/communication overhead. Most of the previous work focused on distributed databases [15, 16, 18,19,20], while we consider a centralized outsourced encrypted database under multiple keys. Liu et al. proposed a secure protocol based on FHE for outsourced encrypted SVM [17], but it requires the users to be online during the whole process. It has been proved that completely non-interactive multiple party computation cannot be achieved in the single server setting when user-server collusion might exist [21]. Thus, we need at least two non-colluding servers [22] if we want to keep the medical units offline. This two non-colluding servers model makes sense in the practical community (e.g. [22, 23]). For example, we set up two cloud servers which belong to Amazon Web Services (AWS) cloud service and Google Cloud Platform (GCP) respectively. Considering the consequences of legal action for breach of contract and bad reputation, it is reasonable to assume that they will not collude. According to [24], each user can secret-share its data among the two non-colluding servers. Then the two servers compute on the shares of the input interactively and send the shares of the result to the users to reconstruct the final output. Although the secret sharing based approach is better in terms of computation cost, it incurs higher communication cost [25] and cannot deal with data encrypted under multiple keys. Moreover, oblivious transfer focuses on the single key setting, which is not suitable for the case of multiple keys. Consequently, we focus on homomorphic encryption based approach in this paper. There indeed exists a multikey FHE primitive that allows computation on data encrypted under multiple keys [26]. However, its efficiency is still far from practice and it requires interactions among all the medical units during the decryption phase. Peter et al. came up with a scheme that transforms the ciphertexts under different keys into those under the same key [27], incurring a huge amount of interactions between the servers. To reduce communication overhead, proxy re-encryption [30] can be utilized to transform ciphertexts [28, 29]. However, the amount of interactions is still heavy. Because they used partially homomorphic encryption - if the underlying cryptosystem is additively homomorphic, they need joint work between the two servers to compute multiplication and vice versa. To further reduce the communication overhead, we utilize a framework to enable additively homomorphic encryption to support one multiplication [31]. We choose the BCP Cryptosystem [32] as the underlying additively homomorphic encryption and modify it to support multikey additive homomorphism. In this way, we successfully remove the need to transform the ciphertexts to those under the same key, while it is a must in [27,28,29]. To remove the constraint that medical units need to be online during decryption phase, we divide a medical unit’s secret key s into two shares \(s_1\) and \(s_2\), and distribute them to the servers. Final decryption is obtained after two rounds of partial decryption.

To summarize, our contributions include the following:

  1. (1)

    We construct a homomorphic cryptosystem that supports one multiply operation under single key and multiple add operations under both single key and different keys. Compared with the BCP cryptosystem, our scheme only doubles the encryption time. With 1024-bit security parameter, an add operation takes less than 1 ms while a multiply operation takes about 16 ms. The size of ciphertext increases linearly from 6138 B to 26 KB with the number of involved users increasing from 2 to 100. Overall speaking, the proposed scheme is practical.

  2. (2)

    We propose the first privacy preserving protocol to solve elastic net on gene expression profiles encrypted by different encryption keys for cancer biomarker discovery, which encourages cooperation between medical units. Through reduction from elastic net to SVM, we demonstrate how to train SVM securely based on the gram matrix. The solution to elastic net is reconstructed based on the solution to SVM. Moreover, our solution can allow users (medical units) to stay offline except for the initialization phase.

  3. (3)

    We evaluate our scheme on a real databaseFootnote 2 for drug sensitivity in cancer cell lines [33]. Moreover, our scheme can also be used to solve lasso, based on a similar reduction from lasso to SVM [34].

2 Model Description

In this paper, we propose a collaborative model for privacy preserving biomarker discovery for anticancer drugs using encrypted expression profiles extracted from the tumor samples of patients. As shown in Fig. 1, the involved parties are patients, medical units, certified institution and the cloud.

Fig. 1.
figure 1

System model for genomic biomarker discovery through collaborative data mining.

Fig. 2.
figure 2

Dataset transformation from elastic net to SVM

  1. (1)

    Patients (P). Cancer patients go to the medical units to receive personalized treatment. We list six patients here labeled as \(\{ P_1,P_2 \cdots , P_6 \}\).

  2. (2)

    Medical Units (MUs). There are different MUs (e.g. cancer hospitals, tumor research centers) in our model. Each MU is able to extract tumor samples from the patients, observe the effect of 72 h of anticancer drug treatment on them, and upload the GI50 values to the cloud. On the other hand, MU sends the tumor samples to the certified institution.

  3. (3)

    Certified Institution (CI). CI is responsible to perform gene expression profiling. CI encrypts the gene expression profiles from different MUs with different encryption keys, and sends the encrypted profiles to the cloud. Only the MU that holds the correct private key can decrypt the encrypted profiles.

  4. (4)

    Cloud (C). It consists of two non-colluding servers \(S_1\) and \(S_2\), which is responsible for storage and massive computation.

Threat model: CI is a trusted party. \(S_1\) and \(S_2\) are both honest-but-curious, and they are non-colluding. There might exist collusion between a MU and \(S_1\). However, none of the medical units will collude with \(S_2\). We consider two types of potential attacks: (i) attacker at one MU tries to know the expression profiles of other MUs (ii) attacker at \(S_1\) or \(S_2\) in the cloud aims at recovering gene expression profiles through observing the input, intermediate or final results.

3 Preliminaries

3.1 Elastic Net Regression

Let the input dataset be \(\{(x_i,y_i)\}_{i=1}^n\), where each \(x_i\in R^m\) is a column vector representing a gene expression profile, and \(y_i \in R\) is the GI50 value.Footnote 3 Let \(X \in R^{n \times m}\) be a matrix containing all gene expression profiles (the transposed i-th row of X is \(x_i\)) and the column vector \(y \in R^n\) (i-th element of y is \(y_i\)) be the responses, the goal of linear regression analysis is to find a column vector \(\beta \in R^m\) such that \(y_i\) can be approximated by \(\tilde{y}_i=\beta ^T x_i\). The ordinary least squares (OLS) regression works by minimizing the residual sum of squares \(\min _{\beta }||X \beta -y||_2^2\). There are some situations where OLS is not a good solution, for example, when m is large or the columns of X are highly correlated. One way to handle this problem is to introduce a penalization term. Ridge regression uses \(l_2\)-norm penalization (\(||\beta ||_2^2 \)), while lasso regression uses \(l_1\)-norm penalization (\(|\beta |_1\)). Ridge regression cannot produce a sparse model. By contrast, owing to the nature of \(l_1\) penalty, lasso is able to generate a sparse model. Nevertheless, lasso has some limitations - it selects at most n variables in the \(n \ll m\) case, picks out only one variable from a group of correlated variables not caring which one is selected (the robustness issue: we want to identify all related variables). In our application, since \(n \ll m\) and genes may be highly correlated, lasso regression is not the ideal method in this situation and elastic net penalty (\(\lambda _1|\beta |_1+\lambda _2||\beta ||_2^2\)) is introduced, which is a convex combination of the lasso and ridge penalty [10]. It performs well under the situation of \(n \ll m\) and correlated variables. The elastic net regression can be represented as follows.

$$\begin{aligned} \min _{\beta \in R^m}||X \beta -y||_2^2+ \lambda ||\beta ||_2^2 \qquad such\ that\ |\beta |_1 \le t \end{aligned}$$
(1)

where \(\lambda > 0\) is the \(l_2\)-regularization constant and \(t > 0\) is the \(l_1\)-norm budget.

3.2 Support Vector Machine with Squared Hinge Loss

Given that we have a dataset \(\{(x_i,y_i)\}_{i=1}^n\) where \(x_i \in R^m\) and \(y_i \in \{+1,-1\}\), we aim at finding a separating plane \(w^Tx+b=0\) (\(w \in R^m\)) to classify the training samples into two classes. There exists many eligible separating planes. For sake of robustness, support vector machine maximizes the margin (\(\frac{1}{||w||}\)) between two classes, which is equivalent to minimize \(||w||^2\). However, sometimes the training dataset is linearly inseparable. One solution is to allow SVM to make mistakes on some samples. We use the squared hinge loss \(max(0,1-y_i(w^Tx_i+b))^2\) to measure the error of sample \(x_i\), which need to be minimized. Therefore, the linear SVM with squared hinge loss can be represented as follows.

$$\begin{aligned} \min _{w}\frac{1}{2}w^Tw+C\sum _{i=1}^n max(0,1-y_i(w^Tx_i+b))^2 \end{aligned}$$
(2)

where C is the penalty parameter of the error term. The above is the primal form of SVM, which is often solved in its dual form:

$$\begin{aligned} \min _{\alpha _i\ge 0} \quad f(\alpha )=\alpha ^TQ\alpha +\frac{1}{2C}\sum _{i=1}^n\alpha _i^2-2\sum _{i=1}^n\alpha _i \end{aligned}$$
(3)

where \(\alpha \in R^n\) and each \(\alpha _i\) is the coefficient for \(x_i\). Q is a \(n \times n\) matrix with \(Q_{ij}=y_iy_jx_i^Tx_j\). Gram matrix K is defined as \(K=x_i^Tx_j\). Once we get \(\alpha \) by solving (3), we can further compute \(w=\sum _{i=1}^n\alpha _ix_iy_i\).

3.3 Reduction from Elastic Net to SVM

Zhou et al. demonstrated that elastic net regression can be reduced to SVM [13]. They do not include any bias item b (they assume that the separating hyperplane passes through the origin). After a series of transformations, (1) and (3) can be changed to (4) and (5) respectively.Footnote 4 We do not provide the transformation steps (Please refer to [13] for details).

$$\begin{aligned} \min _{\hat{\beta }_i>0}||\hat{Z}\hat{\beta }||_2^2+\lambda \sum _{i=1}^{2m} \hat{\beta }_i^2 \qquad \sum _{i=1}^{2m} \hat{\beta }_i=1 \end{aligned}$$
(4)

where \(\hat{Z}=[\hat{X}_1, -\hat{X}_2] \in R^{n \times 2m}\), \(\hat{X}_1=X-\frac{1}{t}yI^T\) and \(\hat{X}_2=X+\frac{1}{t}yI^T\) (\(I \in R^{m}\) is an identity vector).

$$\begin{aligned} \min _{\alpha _i>0} ||Z(\frac{\alpha }{|\alpha ^*|_1})||_2^2+\lambda \sum _{i=1}^{2m}(\frac{\alpha _i}{|\alpha ^*|_1})^2 \qquad \sum _{i=1}^{2m} \frac{\alpha _i}{|\alpha ^*|_1} =1 \end{aligned}$$
(5)

where \(Z=y_ix_i\), \(C=\frac{1}{2\lambda }\) and \(\alpha ^*\) is the optimal solution. Comparing (4) and (5), we notice that they have similar form except for two differences. The first one is that the class labels in elastic net are real valued but binary in SVM. As shown in Fig. 2, to transform the training dataset X of elastic net to that of SVM, we compute \(\hat{X}_1\) as subtracting each column of X by \(\frac{1}{t}y\) and calculate \(\hat{X}_2\) as adding each column of X by \(\frac{1}{t}y\), then concatenate \(\hat{X}_1\) and \(\hat{X}_2\) together and transpose it. The first m training samples of SVM are of class \(+1\), and the remaining are of class \(-1\). The second difference is that they have different scale. The optimal solution \(\hat{\beta }^*\) can be represented by the optimal solution \(\alpha ^*\) as \(\hat{\beta }^*=\frac{\alpha ^*}{|\alpha ^*|_1}\). Finally, the optimal solution \(\beta \) to elastic net (see (1)) can be recovered from \(\hat{\beta }\) according to \(\beta = t \times (\hat{\beta }_{1\cdots m}-\hat{\beta }_{m+1\cdots 2m})\), where t is the \(l_1\)-norm budget and \(\hat{\beta }_{i \cdots j}\) denotes a vector consisting of elements of \(\hat{\beta }\) from index i to j.

4 Our Scheme

Fully homomorphic encryption can be used to compute arbitrary polynomial functions over encrypted data. However, the high computational complexity and communication cost preclude its use in practice. If only focusing on those operations of interest to the target application, more practical homomorphic encryption schemes are possible. For example, Zhou and Wornell [35] proposed an integer vector encryption scheme which supports addition, linear transformation and weighted inner product on ciphertexts. Nevertheless, reduction from elastic net to SVM leads to changes of the training dataset. To be specific, one gene expression profile of a patient across all genes (i.e. one row) is a training sample of elastic net. But one training sample of SVM see Fig. 2 can be considered as gene expression values of a particular gene among all the patients (i.e. one column). Therefore, if we encrypt gene expression profiles using the integer vector encryption, there is no way to construct ciphertexts for the training dataset of SVM. As a result, we restrict our attention to cryptosystems encrypting one element of the profile at a time instead of encrypting the whole profile. Recall that we can use gram matrix as input to train SVM (see Sect. 3.2) and the basic operation to compute gram matrix is the dot product of two samples, it requires a ciphertext to support one multiply operation and multiple add operations. Indeed, the BGN cryptosystem [36] can compute one multiplication on ciphertexts using the bilinear maps. However, it does not support multikey homomorphism. In our setting of collaborative data mining in the cloud, the training dataset of elastic net is horizontally partitioned (different units holding different records with the same set of attributes) while the training dataset of SVM is vertically partitioned (records are partitioned into different parts with different attributes after the transformation). In order to train SVM on encrypted training dataset, we thus need a cryptosystem that supports one multiply operation under single key and multiple add operations under both single key and different keys. In this paper, we try to let medical units stay offline except for the initialization phase. Specifically, we use secret sharing to authorize one server (i.e. \(S_1\)) to decrypt the encrypted gram matrix without knowing the secret key of any medical unit.

4.1 Building Blocks

Framework to Enable One Multiplication on Cihphertexts. Catalano and Fiore [31] showed a framework to enable existing additively homomorphic encryption schemes (i.e. Paillier, ElGamal) to compute multiplication on ciphertexts. We use \(E(\,)\) to denote the underlying additively homomorphic encryption. The idea is to transform a ciphertext \(E(x_{ij})\) into “multiplication friendly”. To be specific, we use \(\mathcal {E}(x_{ij})=(x_{ij}-b_{ij}, E(b_{ij}))\) (where \(b_{ij}\) is a random number) to represent the “multiplication friendly” ciphertext. Given two “multiplication friendly” ciphertexts \(\mathcal {E}(x_{11})=(x_{11}-b_{11},E(b_{11}))\) and \(\mathcal {E}(x_{21})=(x_{21}-b_{21},E(b_{21}))\), we compute multiplication as \(\mathcal {E}(x_{11}x_{21})=(\alpha _1,\beta _1,\beta _2)\).

$$\begin{aligned} \alpha _1= & {} E[(x_{11}-b_{11})(x_{21}-b_{21})]E(b_{11})^{x_{21}-b_{21}}E(b_{21})^{x_{11}-b_{11}}\nonumber \\= & {} E(x_{11}x_{21}-b_{11}b_{21}) \end{aligned}$$
(6)
$$\begin{aligned} \beta _1= & {} E(b_{11}) \qquad \beta _2=E(b_{21}) \end{aligned}$$
(7)

To decrypt \(\mathcal {E}(x_{11}x_{21})\), we will add \(b_{11}b_{21}\) to the decryption of \(\alpha \) where \(b_{11}\), \(b_{21}\) is retrieved from \(\beta _1\), \(\beta _2\). The addition of two ciphertexts after multiplication works by adding the \(\alpha \) components and concatenating the \(\beta \) components. Therefore, the \(\beta \) component will grow linearly with additions after performing a multiplication. To remove this constraint, two non-colluding servers are used to store \(\mathcal {E}=(x_{ij}-b_{ij}, E(b_{ij}))\) and \(b_{ij}\) respectively. In this way, \(S_1\) can throw away the \(\beta \) component after performing a multiplication, because \(S_2\) will operate on the \(b_{ij}\)’s in plaintext. Therefore, the ciphertext contains only the \(\alpha \) component after performing a multiplication. This framework has a nice property that it inherits the multikey homomorphism of the underlying additively homomorphic encryption.

Fig. 3.
figure 3

The BCP cryptosystem

Multikey Homomorphism of the BCP Cryptosystem. The BCP cryptosystem (also known as Modified Paillier Cryptosystem) is an additively homomorphic encryption under single key [32]. We briefly review the BCP cryptosystem in Fig. 3 and discuss how to modify it to support multikey homomorphism at the expense of expanding the ciphertext size. Supposed that \(E(m_a)=(C^{(1)}_{m_a},C^{(2)}_{m_a})\) is under key \(s_a\), \(E(m_b)=(C^{(1)}_{m_b},C^{(2)}_{m_b})\) is under key \(s_b\), then \(E^{(ab)}(m_a+m_b)\) where \(E^{(ab)}\) denotes a ciphertext related to key \(s_a \) and \(s_b\) can be computed as

$$\begin{aligned} E^{(ab)}(m_a+m_b)=(C^{(1)}_{m_a}, C^{(1)}_{m_b}, C^{(2)}_{m_a}C^{(2)}_{m_b}) \end{aligned}$$
(11)

The ciphertext size only depends on the number of involved medical units (i.e. keys). There are two MUs with key \(s_a\) and \(s_b\) respectively in this example, the addition of their ciphertexts is a 3-tuple. If n MUs cooperate together, the addition of their ciphertexts should be a \((n+1)\)-tuple. To decrypt \(E^{(ab)}(m_a+m_b)\), the secret key \(s_a\) and \(s_b\) are required.

$$\begin{aligned} t=\frac{C^{(2)}_{m_a}C^{(2)}_{m_b}}{(C^{(1)}_{m_a})^{s_a}(C^{(1)}_{m_b})^{s_b}} \qquad m_a+m_b=\frac{t-1 \ mod \ N^2}{N} \end{aligned}$$
(12)

Incorporating the above modified the BCP cryptosystem to the framework that enables additively homomorphic encryption to support one multiplication, we obtain our final encryption scheme \(\mathcal {E}_{BCP}\).Footnote 5

Gram Matrix Computation. Gram matrix K is defined as \(K_{ij}=\langle x_i,x_j\rangle =x_i^Tx_j\) where \(x_i\) and \(x_j\) are any two training samples (see Sect. 3.2). Recall that the original training dataset X of elastic net regression is transformed to the training dataset \(\hat{X}\) of SVM during the reduction process (see Sect. 3.3), we use \(\hat{X}=\{\hat{x}_i\}_{i=1}^{2m}\) to denote the transformed dataset. After dataset transformation, the horizontally partitioned dataset of the elastic net is converted to vertically partitioned dataset of SVM. The gram matrix \(K(\hat{X})\) of \(\hat{X}\) is computed as follows.

$$\begin{aligned} K(\hat{X})=\left[ \begin{array}{cccc} \langle \hat{x}_1,\hat{x}_1\rangle &{} \langle \hat{x}_1,\hat{x}_2\rangle &{} \cdots &{} \langle \hat{x}_1,\hat{x}_{2m}\rangle \\ \langle \hat{x}_2,\hat{x}_1\rangle &{} \langle \hat{x}_2,\hat{x}_2\rangle &{} \cdots &{} \langle \hat{x}_2,\hat{x}_{2m}\rangle \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \langle \hat{x}_{2m},\hat{x}_1\rangle &{} \langle \hat{x}_{2m},\hat{x}_2\rangle &{} \cdots &{} \langle \hat{x}_{2m},\hat{x}_{2m}\rangle \end{array}\right] \end{aligned}$$
(13)

For ease of description, we firstly consider the case of two medical units denoted as MU\(_A\) and MU\(_B\). Assume that the cloud store n gene expression profiles, among which \(n_A\) records are from MU\(_A\) and \(n_B\) records are from MU\(_B\). Then in the transformed dataset of SVM, for each training sample, the first \(n_A\) elements are encrypted under key \(s_A\), the remaining \(n_B\) elements are encrypted under key \(s_B\). Assume that we have two training samples \(\hat{x_1}\) and \(\hat{x_2}\) of SVM, their dot product \(\langle \hat{x}_1,\hat{x}_2\rangle \) can be computed as follows and their ciphertexts are denoted as \(\mathcal {E}_{BCP}(\hat{x}_1)=(\hat{x}_1-b_1,E(b_1))\), \(\mathcal {E}_{BCP}(\hat{x}_2)=(\hat{x}_2-b_2, E(b_2))\).

$$\begin{aligned} \langle \hat{x}_1,\hat{x}_2\rangle =\sum _{i=1}^{n_A}\hat{x}_{1i}\hat{x}_{2i}+\sum _{i=n_A+1}^{n}\hat{x}_{1i}\hat{x}_{2i} \end{aligned}$$
(14)

Supposed that the ciphertext of \(\sum _{i=1}^{n_A}\hat{x}_{1i}\hat{x}_{2i}\) and \(\sum _{i=n_A+1}^{n}\hat{x}_{1i}\hat{x}_{2i}\) are \(\alpha _A\) and \(\alpha _B\) respectively,Footnote 6 then \(S_1\) will compute \(\alpha _A\), \(\alpha _B\) as follows. The computation of \(\alpha _A\) or \(\alpha _B\) only requires single key homomorphism.

$$\begin{aligned} \alpha _{A}= & {} E(\sum _{i=1}^{n_A}\hat{x}_{1i}\hat{x}_{2i}-b_{1i}b_{2i})=(C^{(1)}_A,C^{(2)}_A) \end{aligned}$$
(15)
$$\begin{aligned} \alpha _{B}= & {} E(\sum _{i=n_A+1}^{n_A+n_B}\hat{x}_{1i}\hat{x}_{2i}-b_{1i}b_{2i})=(C^{(1)}_B,C^{(2)}_B) \end{aligned}$$
(16)

As \(\alpha _A\) and \(\alpha _B\) are encrypted under different keys, adding them together requires multikey homomorphism.

$$\begin{aligned} E(\langle \hat{x}_1,\hat{x}_2\rangle -\langle \hat{b}_1,\hat{b}_2\rangle )=\alpha _{A}+\alpha _{B}=(C^{(1)}_A,C^{(1)}_B,C^{(2)}_AC^{(2)}_B) \end{aligned}$$
(17)

Keep Medical Units Offline. We leverage \(\mathcal {E}_{BCP}\)’s proxy re-encryption property, which inherits from the underlying the BCP cryptosystem (see (10)). To keep MU\(_A\) and MU\(_B\) offline, we split the secret key s of each involved medical unit into two shares. Specifically, we have \(s_A=s_{A_1}+s_{A_2}\) and \(s_B=s_{B_1}+s_{B_2}\). \(S_1\) holds \(s_{A_1}\), \(s_{B_1}\) and \(S_2\) holds \(s_{A_2}\), \(s_{B_2}\). To compute \(\langle \hat{x}_1,\hat{x}_2\rangle \), \(S_1\) will firstly decrypt (17) partially.

$$\begin{aligned}&{C^{(1)}_A}^{\prime }=C^{(1)}_A \qquad {C^{(1)}_B}^{\prime }=C^{(1)}_B \end{aligned}$$
(18)
$$\begin{aligned}&{C^{(2)}_A}^{\prime }{C^{(2)}_B}^{\prime }=\frac{C^{(2)}_AC^{(2)}_B}{{(C^{(1)}_A)}^{s_{A_1}}(C^{(1)}_B)^{s_{B_1}}} \end{aligned}$$
(19)

Then \(S_1\) will send \(C^{(1)}_A\) and \(C^{(1)}_B\) to \(S_2\). \(S_2\) will compute and return \((C^{(1)}_A)^{s_{A_2}}\), \((C^{(1)}_B)^{s_{B_2}}\), \(\langle \hat{b}_1,\hat{b}_2\rangle \) to \(S_1\) afterwards. Finally, \(S_1\) is able to decrypt \(E(\langle \hat{x}_1,\hat{x}_2\rangle -\langle \hat{b}_1,\hat{b}_2\rangle )\) completely and get \(\langle \hat{x}_1,\hat{x}_2\rangle \) in plaintext.

$$\begin{aligned} \qquad {C^{(2)}_A}^{\prime \prime }{C^{(2)}_B}^{\prime \prime }= & {} \frac{{C^{(2)}_A}^{\prime }{C^{(2)}_B}^{\prime }}{({C^{(1)}_A}^{\prime })^{s_{A_2}}({C^{(1)}_B}^{\prime })^{s_{B_2}}} \end{aligned}$$
(20)
$$\begin{aligned} \langle \hat{x}_1,\hat{x}_2\rangle= & {} \frac{({C^{(2)}_A}^{\prime \prime }{C^{(2)}_B}^{\prime \prime }-1)\ mod\ n^2}{n}+\langle \hat{b}_1,\hat{b}_2\rangle \end{aligned}$$
(21)

The above shows how to compute \(\langle \hat{x}_1,\hat{x}_2\rangle \) based on two ciphertexts \(\mathcal {E}_{BCP}(\hat{x}_1)\) and \(\mathcal {E}_{BCP}(\hat{x}_2)\). Similarly, we can compute each element of the gram matrix \(K_{ij}=\langle \hat{x}_i,\hat{x}_j\rangle =\hat{x}_i^T\hat{x}_j\) based on the ciphertexts \(\mathcal {E}_{BCP}(\hat{x}_i)\) and \(\mathcal {E}_{BCP}(\hat{x}_j)\). Observing that the gram matrix in (13) is symmetric, we can only compute the upper triangular half of it. In the end, \(S_1\) gets the gram matrix K in plaintext. If there are more than two medical units, we can easily extend (14), (15) and (17) to handle the case of multiple medical units. The size of ciphertext of \(\langle \hat{x}_1,\hat{x}_2\rangle \) increases linearly with the number of involved medical units. Likewise, the communication overhead also increases linearly during the decryption phase.

4.2 Our Construction

Given the encrypted gene expression profiles \(\mathcal {E}_{BCP}(X)\) derived from multiple medical units, the cloud runs privacy preserving elastic net on it to discover biomarkers to predict a patient’s response to anticancer drugs. As it is not clear how to design a privacy preserving protocol based on iterative algorithms to solve elastic net. We resort to reduction to shift our attention from elastic net to SVM. In Algorithm 1, we firstly demonstrate how to transform the encrypted dataset of elastic net to that of SVM (see Sect. 3.3). It is easy to perform such transformation on the dataset in plaintext. However, once it is encrypted, we need to rely on the homomorphic properties of our cryptosystem to finish the transformation. Next, we compute the encrypted gram matrix \(\mathcal {E}_{BCP}(K)\) of the transformed training dataset (see Sect. 4.1). Gram matrix plays a role as intermediate dataset based on which SVM model can be generated correctly without breaching the privacy of patients’ gene expression profiles. In order to keep medical units offline, we authorize \(S_1\) to decrypt \(\mathcal {E}_{BCP}(K)\). Based on K, we train SVM and obtain the solution \(\alpha \). Finally, we use \(\alpha \) to reconstruct \(\beta \), which is the solution to elastic net.

figure a

Model Assessment. In Algorithm 1, there are two parameters: \(l_1\)-norm constraint t and \(l_2\)-regularization parameter \(\lambda \). It is not known beforehand which t and \(\lambda \) are best for the elastic net. For different combinations of \((t,\lambda )\), the predictive power of the derived solution varies. We do “grid search” on t and \(\lambda \) using k-fold cross validation [37] to assess the goodness-of-fit of our model under different parameters. The grid-search is straightforward. We specify the range of t and \(\lambda \) respectively. Then we try various pairs of \((t,\lambda )\). As it might be time-consuming to do a complete grid-search, we recommend using a coarse grid first. Once a “better” region is identified, we will conduct a finer grid search on that region. We divide our training dataset \(\mathcal {E}_{BCP}(X)\) into k subsets satisfying \(\mathcal {E}_{BCP}(X)=\mathcal {E}_{BCP}(X_1) \cup \cdots \cup \mathcal {E}_{BCP}(X_k)\), \(\mathcal {E}_{BCP}(X_i) \cap \mathcal {E}_{BCP}(X_j)=\emptyset \ (i \ne j)\). We use \(\mathcal {E}_{BCP}(X_i)\) where \(i \in [1,k]\) as the validation set and the remaining \(k-1\) subsets as the training dataset each time. In order to measure the performance of regression, we choose Rooted Mean Squared Error (RMSE). An RMSE value closer to 0 indicates the regression model is more useful for prediction. In the setting of k-fold cross validation, we need to compute the average of k RMSE values. Supposed that there are d samples in the validation set, the predicted GI50 value of gene expression profile \(x_i\) is \(\tilde{y}_i\) and the true value is \(y_i\), then RMSE is computed as

$$\begin{aligned} RMSE=\sqrt{(\frac{1}{d}\sum _{i=1}^{d}(\tilde{y}_i-y_i)^2)} \end{aligned}$$
(22)

Recall that each gene expression profile is encrypted, we can compute the ciphertext of the predicted GI50 value \(\tilde{y}_i\) as \(\mathcal {E}_{BCP}(\tilde{y}_i)=(\beta ^T(x_i-b_i),\beta ^TE(b_i))\) where \(\beta \) is the solution to elastic net. To get \(\tilde{y}_i\) in plaintext, we make the two non-colluding servers work together. \(S_1\) reveals \(\beta \) to \(S_2\). \(S_2\) return \(\beta ^Tb_i\) to \(S_1\). Then \(S_1\) computes \(\tilde{y}_i=\beta ^T(x_i-b_i)+\beta ^Tb_i=\beta ^Tx_i\). For each \((t,\lambda )\) pair, \(S_1\) computes RMSE with each predicted value \(\tilde{y}_i\). Finally, \(S_1\) will pick the optimal \((t,\lambda )\) which achieves the smallest RMSE and get the optimal solution \(\beta ^{*}\).

5 Security Analysis

We consider the honest-but-curious model, meaning that all the medical units, \(S_1\) or \(S_2\) will follow our protocol but try to gather information about the inputs of MUs. There might exist collusion between a medical unit and \(S_1\). We analyze the security of our model with the Real and Ideal paradigm and Composition Theorem [38]. The main idea is to use a simulator in the ideal world to simulate the view of a semi-honest adversary in the real world. If the view in the real world is computationally indistinguishable from the view in the ideal world, then the protocol is believed to be secure. According to the Composition Theorem, the entire scheme is secure if each step is proved to be secure. Due to page limit, the Proof of Theorem 1 is given in the Appendix.

Theorem 1

In Algorithm 1, it is computationally infeasible for \(S_1\) to distinguish the gene expression profiles encrypted under multiple keys as long as \(\mathcal {E}_{BCP}\) is semantically secure and the two servers are non-colluding.

Theorem 2

No encryption scheme is secure against known-sample attack if dot products are revealed.

Proof: We define known-sample attack as an attacker obtaining the plaintexts of a set of records of the encrypted database but not knowing the correspondence between the plaintexts and the encrypted records. According to [39], no encryption scheme is secure against known-sample attack if distance information is revealed. As distance computation can be decomposed into dot products, revealing dot products equals to revealing distance. Given n encrypted samples whose dimension is m, if an attacker knows the plaintexts of m linearly independent samples, the attacker can obtain the plaintext of any encrypted samples even without the decryption key. The idea is to construct m linear equations, whose unique solution corresponds to the desired sample.

Fortunately, in the following theorem, we show that it is impossible for the attackers to make use of Theorem 2 to launch the attack.

Theorem 3

\(S_1\) cannot reconstruct gene expression profile of a patient with gram matrix K known, considering the impossibility that an attacker collects enough samples of SVM to launch the attack mentioned in Theorem 2.

Proof: According to Sect. 3.3, one gene expression profile of a patient across all genes (i.e. one row) is a training sample of elastic net. But the training sample of SVM can be considered as gene expression values of a particular gene among all the patients (i.e. one column). If \(S_1\) colludes with \(MU_1\), it only brings minor advantage that some of the elements from \(MU_1\) of a training sample are revealed. Unless the attacker cracks our cryptosystem and obtains all the private keys of the involved medical units, he cannot set up linear equations to launch known-sample attack.

6 Experimental Evaluation

The configuration of our PC is Windows 7 Enterprise 64-bit Operating System with Intel(R) Core(TM) i5 CPU (4 cores), 3.4 GHz and 16 GB memory. We use a public database for drug sensitivity in cancer cell lines [33]. To provide platform independence, we use Java to implement our scheme together with open-source IDE. We use BigInteger class to process big numbers which offers all basic operations we need. We utilize SecureRandom class to produce a cryptographically strong random number. As for the generation of safe prime numbers, we use the probablePrime method provided by BigInteger class. The probability that a BigInteger returned by this method is composite does not exceed \(2^{-100}\). The performance of our scheme depends heavily on the size of modulus N, and the number of additions and multiplications performed. During the initialization phase, public and private key pair are generated. The runtime of generating a key pair varies with the bit length of N, as it depends a lot on the random number generator. A typical value for N is 1024 and it takes about 2 s in average to generate one key pair. We firstly compare the encryption time of two training samples when using the BCP Cryptosystem and our proposed cryptosystem \(\mathcal {E}_{BCP}\). We vary the dimension of each sample from 1000 to 10000. The bit length of modulus N is set to 1024 and 1536 respectively (following the same setting as in [27]). As shown in Fig. 4, the encryption time scales linearly as the dimension increases. We use E(x) and \((x-b,E(b))\) where b is a random number to denote the ciphertext of x under BCP and \(\mathcal {E}_{BCP}\) cryptosystem respectively. Leveraging the framework proposed in [31], the encryption time of \(\mathcal {E}_{BCP}\) doubles that of the BCP Cryptosystem. The additional encryption time is caused by generating the random number b. Moreover, we measure the time to compute dot product of two encrypted training samples. We focus on vertically partitioned dataset of SVM. To facilitate understanding, we encrypt the first half (belonging to Alice) of a sample using secret key \(s_A\) and the second half (belonging to Bob) using secret key \(s_B\). We show the runtime of dot product computation on the ciphertexts in Fig. 5. Similarly, time to calculate dot product increases linearly with the dimension of samples. For vectors of dimension m, one dot product operation includes m multiplications and \(m-1\) additions, among which one addition is multikey homomorphic. It takes only 1 ms to run a multikey homomorphic addtition. For operations under single key, addition is much faster than multiplication. With a 1024-bit modulus, the runtime of additions is less than 1 s. The runtime of multiplications varies from 16 s to 185 s with the dimension of a sample increasing from 1000 to 10000. Therefore, multiplications are the bottleneck of dot product computation. To decrypt one encryted dot product, it takes 285 ms and 572 ms with and without secret sharing separately. Recall that the multikey homomorphism property is achieved at the expense of expanding ciphertext size, we also measure the effect of the number of involved users on the increase of ciphertext size. As shown in Fig. 6, the ciphertext size increases linearly from 6138 B to 26 KB when the number of involved users increasing from 2 to 100.

Fig. 4.
figure 4

Encryption time

Fig. 5.
figure 5

Dot product time

Fig. 6.
figure 6

Ciphertext size

The public database for drug sensitivity in this paper consists of 1002 cancer cell lines, 265 anticancer drugs. For each drug, GI50 values of around 300 to 1000 cell lines are available. As for gene expression profiling, it contains the RMA-normalized expression values of 17737 genes of 1018 cell lines. We preprocess them in MATLAB, keeping those cell lines that belong to the intersection of gene expression profiles and GI50 values. For example, considering drug PD-0325901, we get 843 expression profiles and GI50 values. As our cryptosystem only supports operations on the integer domain, we need to preprocess the database. To be specific, we first select a system parameter p to represent the number of bits for the fractional part of expression values. We next multiply each expression value by \(10^p\) to get its integer value. Then, we need to divide each element of the gram matrix by \(10^{2p}\) to remove the influence of scaling up. After running our privacy-preserving elastic net, we successfully pick out 165 genomic biomarkers.

Comparison with existing schemes: We focus on the homomorphic encryption based schemes [17, 27,28,29], of which the setting is outsourced encrypted database under multiple keys. According to the experimental evaluation of [31], using their proposed framework to enable one multiplication on additively homomorphic ciphertexts outperforms the BGV homomorphic encryption [40] (in terms of ciphertext size, time of encryption/decryption/homomorphic operations). As shown in the experiments above, modification to the BCP Cryptosystem for multikey homomorphism only doubles the encryption time. Therefore, addition and multiplication can be run more efficiently in our scheme compared to [17], which deals with only two users (i.e. keys). Besides, they require the users to be online while we keep the users offline in this paper. Under two non-colluding servers model, the schemes in [27,28,29] can be used to compute addition/multiplication. The main drawback of their schemes is that they have to transform the ciphertexts under multiple keys to those under the same key, which is a heavy workload for the cloud server. Moreover, computing multiplication incurs interactions between the two servers. By contrast, our cryptosystem enables calculating multiplication without interactions.

7 Discussion and Conclusions

In practical scenarios, a gene expression profile typically has dimension of order \(10^4\). We can only collect hundreds of patients’ profiles of different cancers. If we store gram matrix K in the memory, it is of order \(10^8\) in our case, which requires a lot of memory. Keerthi et al. [41] proposed to restrict the support vectors to some subset of basis vectors \(\mathcal {J} \subset \{1,\cdots ,n\}\) in order to reduce the memory requirement. This method requires \(O(|\mathcal {J}|n)\) space where \(|\mathcal {J}| \ll n\). However, the derived \(\alpha \) using this method can be different from the one we get using the entire gram matrix K. It is a trade-off between accuracy and efficiency. For genomic biomarker discovery, it is obviously more important to pick out accurate biomarkers. Therefore, it makes sense to maintain the gram matrix in memory. Furthermore, each element of the gram matrix can be calculated independently. To accelerate the computation of gram matrix, we can utilize existing parallel computing frameworks.

To conclude, in this paper, by assuming the existence of two non-colluding servers, we proposed a privacy preserving collaborative model to conduct elastic net regression through reduction to SVM on encrypted gene expression profiles and GI50 values of anticancer drugs. To compute the gram matrix on ciphertexts, we successfully construct a cryptosystem that supports one multiply operation under single key and multiple add operations under both single key and different keys. Besides, we use secret sharing to allow one of the cloud server to get the gram matrix. Our scheme keeps the medical units offline and is proved to be secure in the semi-honest model or even if a medical unit colludes with one cloud server. The experimental results highlight the practicability of our scheme. The proposed protocol could also be applied to other applications that use elastic net or lasso for linear regression. Our future work is to extend our scheme to malicious adversaries (either \(S_1\) or \(S_2\) is malicious). One promising direction is to use commitment scheme [42] and zero-knowledge protocols.