1 Introduction

Research on preserving data privacy in outsourced databases has been spotlighted with the development of a cloud computing. Since a data owner (DO) outsources his/her databases and allows a cloud to manage them, the DO can reduce the database management cost by using the cloud’s resources with flexibility [1,2,3]. The cloud not only maintains the databases, but also provides an authorized user (AU) with querying services on the outsourced databases.

However, because the data are private assets of the DO and may include sensitive information such as financial records, they should be protected against adversaries including a cloud server. Therefore, the databases of the DO should be encrypted before being outsourced to the cloud. In addition, a user’s query should be protected from the adversaries because the query may contain the private information of the user [4,5,6,7,8,9,10]. Therefore, a vital challenge in the cloud computing is to protect both data privacy and query privacy among the data owner, the users, and the cloud. However, during query processing, the cloud can derive sensitive information about the actual data items and users by observing data access patterns even if the data and the query are encrypted [11,12,13,14,15,16,17]. In addition, it is very challenging to process a query on the encrypted data without having to decrypt it.

Meanwhile, a k nearest neighbor (kNN) query, one of the most typical query types, has been widely used as a baseline technique in many fields, such as data mining and location-based services. The kNN query finds k neighbors that are closest to a given query. However, a kNN result is closely related to the interest and preference of a user. Therefore, researches on secure kNN query processing algorithms (SkNN) that preserve both the data privacy and the query privacy have been proposed [18,19,20,21,22,23,24]. However, the existing algorithms in [18, 19] are insecure because they are vulnerable to chosen- and known-plaintext attacks. In addition, the DO should be heavily involved in the query processing [19,20,21]. Furthermore, the algorithms in [18,19,20,21] do not protect data access patterns from the cloud. The algorithms in [22, 24] guarantee the confidentiality of both the outsourced databases and a user's query while hiding data access patterns. However, they suffer from high query processing cost.

To solve the problems, in the paper we propose a privacy-preserving kNN query processing algorithm via secure two-party computation on the encrypted database. Our algorithm preserves both data privacy and query privacy while hiding data access patterns. For this, we propose efficient and secure protocols based on Yao’s garbled circuit [25] and a data packing technique. To enhance the performance of our kNN query processing algorithm, we also propose a parallel kNN query processing algorithm using improved secure protocols based on encrypted random value pool. To verify the security of our algorithm, we provide the formal security proofs of our privacy-preserving kNN query processing algorithms. Through the performance analysis, we verify that our proposed algorithms outperform the existing ones for both a syntactic dataset and a real dataset. Our contributions can be summarized as follows:

  • We present a framework for outsourcing both encrypted databases and encrypted indexes.

  • We propose new secure protocols (e.g., ESSED, GSCMP, GSPE) in order to preserve data privacy and query privacy while hiding data access patterns.

  • We propose an encrypted random value pool to minimize the computational cost of secure protocols.

  • We propose a new privacy-preserving parallel kNN query processing algorithm which can support efficient query processing.

  • We also present an extensive experimental evaluation of our algorithms with various parameter settings.

The rest of the paper is organized as follows: Section 2 introduces background and related work. Section 3 presents system architecture and secure protocols. Section 4 proposes our privacy-preserving kNN query processing algorithm. Section 5 proposes our parallel kNN query processing algorithm. Section 6 shows the security proof of our privacy-preserving kNN algorithms under semi-honest model. Section 7 presents the performance analysis of our kNN query processing algorithms. Finally, Sect. 8 concludes this paper.

2 Background and related work

2.1 Background

Importance of hiding data access patterns The data access pattern is one of the most important factors for privacy preservation in cloud computing. If an attacker possesses the order of data accesses or their frequency, he/she can infer the original data by using data access patterns. Therefore, hiding data access patterns is as important as encrypting data. First, in location-based service (LBS), one of the well-known queries is to find a nearby point of interest (POI) with a current user's location. For data protection, POI data are indexed, encrypted, and outsourced using a spatial index structure. For query protection, a user's location is encrypted and used for query processing. Because the query and POI data are encrypted, the exact location is not exposed to an attacker. However, by observing accesses to an index structure, an attacker can obtain data access patterns. By using data access patterns, the attacker can know where a query issuer is located and when he/she is in a specific area. As a result, if a user continuously issues queries while moving, an attacker can obtain his/her personal information, such as his/her moving trajectory and preference.

Second, in a healthcare service, a data mining technique for classifying patients based on their health information and symptom is widely used. For classifying patients, the service finds the most similar disease by getting accesses to the previously generated disease classification table. Because the patient’s health information is encrypted, an attacker cannot obtain sensitive information. However, an attacker can acquire data access patterns by repeatedly accessing the disease classification table with fake patients. As a result, by using the data access pattern, an attacker can infer what kind of disease an actual patient has when the patient information is given. Therefore, hiding data access patterns is very essential for privacy preservation in cloud computing.

Paillier cryptosystem The Paillier cryptosystem [26] is an additive homomorphic and probabilistic asymmetric encryption scheme for public key cryptography. The public key pk for encryption is given by (N, g), where N is a product of two large prime numbers p and q, and g is circular value in \({ }Z_{{N^{2} }}^{*}\). Here, \({ }Z_{{N^{2} }}^{*}\) denotes an integer domain ranged from zero to N2. The secret key sk for decryption is given by (p, q). Let E(·) denote the encryption function and D(·) denote the decryption function. The Paillier cryptosystem has the following properties.

  1. (1)

    Homomorphic addition The product of two ciphertexts E(m1) and E(m2) results in the encryption of the sum c and m2 (Eq. 1).

$$ E\left( {m_{1} + m_{2} } \right) = E\left( {m_{1} } \right) \times E\left( {m_{2} } \right) {\text{mod}}\; N^{2} $$
(1)
  1. (2)

    Homomorphic multiplication The m2th power of ciphertext E(m1) results in the encryption of the product of m1 and m2 (Eq. 2).

    $$ E\left( {m_{1} \times m_{2} } \right) = E\left( {m_{1} } \right)^{{m_{2} }} {\text{ mod}} \;N^{2} $$
    (2)
  1. (3)

    Semantic security Encrypting the same plaintexts with the same public key results in distinct ciphertexts (Eq. 3).

    $$ m_{1} = m_{2} { \nRightarrow }E\left( {m_{1} } \right) = E\left( {m_{2} } \right) $$
    (3)

Therefore, an adversary cannot infer any information about the plaintexts.

Yao’s garbled circuit Yao’s garbled circuits [25] allow two parties holding inputs x and y, respectively, to evaluate a function f(x, y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao’s garbled circuit provides a high security level. Another benefit of using the Yao’s garbled circuit is that it can provide high efficiency if a function can be realized with a reasonably small circuit.

2.2 Related work

The typical kNN query processing schemes on encrypted databases are as follows. Wong et al. [18] processed a kNN query by devising an encryption scheme that supports distance comparison on the encrypted data. However, the scheme is vulnerable to chosen-plaintext attacks [27, 28] and cannot hide the data access pattern to the cloud. Yiu et al. [19] proposed a kNN query processing algorithm using the R-tree index [29] encrypted by AES [30]. However, the scheme has a drawback that the most of the computation is performed at the user side rather than the cloud. In addition, data access pattern is not preserved as the user hierarchically requests the required R-tree nodes to the cloud. Hu et al. [20] proposed a kNN query processing algorithm by using the provably secure privacy homomorphism encryption method. However, the user is in charge of index traversal during the query processing. In addition, the scheme is known to be vulnerable to chosen-plaintext attacks and leaks the data access patterns. Zhu et al. [21] proposed a kNN query processing scheme by considering untrusted users. Because a user does not hold an encryption key, a data owner should encrypt the query. In addition, the cloud can know the identifiers of the query result that implies the leakage of the data access pattern.

Elmehdwi et al. [22] proposed the SkNNm scheme over the encrypted database. To the best of our knowledge, this is the first work that guarantees both the data privacy and the query privacy while hiding the data access pattern [14] at the same time. In addition, a data owner and a user do not participate in the query processing. However, the query processing cost of this scheme is extremely high because the scheme considers all of the encrypted data and makes use of secure protocols that take the encrypted binary representation of the data as inputs. Zhou et al. [23] proposed an asymmetric scalar-product-preserving encryption (ASPE) scheme based on Wong et al.’s work [18]. By using random asymmetric splitting with additional artificial dimensions, the scheme can resist known-plaintext attacks [28, 31]. In this scheme, the query issuers are fully trusted and the decryption key is partially revealed to the query issuers. However, the scheme cannot hide the data access pattern. Most recently, Kim et al. [24] proposed a kNN query processing scheme(SkNNI) by using an encrypted index. The algorithm guarantees the confidentiality of both the data and the user query while hiding data access patterns. By filtering unnecessary data using a secure index mechanism, the algorithm provides better performance than SkNNm. However, the algorithm still requires a high computation cost because it uses secure protocols that take the encrypted binary representation of the data as inputs.

3 System architecture and secure protocols

3.1 System architecture

The typical types of adversaries are semi-honest and malicious [32]. In this paper, we consider the cloud as insider adversaries who have more authorities than outsider attackers. In the semi-honest adversarial model, the cloud correctly follows the given protocol, but may try to obtain the additional information not allowed to them. In the malicious adversarial model, the cloud can deviate from the protocol specification. However, protocols against malicious adversaries are inefficient. Nonetheless, protocols associated with semi-honest adversaries are practical and can be used to design protocols against malicious adversaries. Therefore, according to earlier work [22, 24], we also adopt the semi-honest adversarial model. A secure protocol under the semi-honest adversarial model can be defined as follows.

Definition 1

Secure protocol Let \(\mathop \prod \nolimits_{i}^{ } \left( \pi \right)\) be an execution image of the protocol π at the Ci side and let ai and bi be the input and the output of the protocol π, respectively. Then, π is secure if \(\mathop \prod \nolimits_{i}^{ } \left( \pi \right)\) is computationally indistinguishable from the simulated image \(\mathop \prod \nolimits_{i}^{s} \left( \pi \right)\).

The system consists of four components: data owner (DO), authorized user (AU), and two clouds (CA and CB). The DO owns the original database (T) of n records [33,34,35]. A record ti (1 ≤ i ≤ n) consists of m attributes, where m means the number of data dimensions, and the jth attribute value of ti is denoted as ti,j (1 ≤ j ≤ m). The DO partitions T by using the kd-tree structure [36, 37] to provide the indexing on T. The reason why we use the kd-tree structure as an index structure is to hide data access patterns. Using a space filling curve (e.g., Hilbert curve) for partitioning data items into blocks (nodes) can guarantee data locality, but it cannot guarantee that data items are evenly distributed over blocks. As a result, an attacker may infer a specific block based on the number of data items stored in it, whereas using the kd-tree structure for partitioning data items into blocks makes an attacker unable to distinguish a block from each other. This is because data items are evenly distributed into blocks in the kd-tree structure even if data items are skewed. Meanwhile, while traversing the kd-tree structure in a hierarchical way, an attacker can know which block is relevant to the query, which results in the leakage of data access patterns. To tackle the problem, our algorithm accesses only the leaf nodes of the kd-tree during the query processing step, rather than traversing the tree structure in a hierarchical way.

Henceforth, a node refers to a leaf node. Let h denote the level of the constructed kd-tree and F be the fanout of each leaf node. A node is denoted by nodez (1 ≤ z ≤ 2h−1) where 2h−1 is the total number of leaf nodes. The region information of nodez is represented as the lower bound lbz,j and the upper bound ubz,j (1 ≤ z ≤ 2h−1, 1 ≤ j ≤ m). Each node stores the identifiers (id) of data located inside the node region. To preserve the data privacy, the DO encrypts T attribute-wise using the public key (pk) of the Paillier cryptosystem [26] before outsourcing the database. Therefore, the DO generates E(ti,j) for 1 ≤ i ≤ n and 1 ≤ j ≤ m by encrypting ti,j. The DO also encrypts the region information of all kd-tree nodes to support efficient query processing. Specifically, lb and ub of each node are encrypted attribute-wise such that E(lbz,j) and E(ubz,j) are generated with 1 ≤ z ≤ 2 h−1 and 1 ≤ j ≤ m. We assume that CA and CB are non-colluding and semi-honest (or honest-but-curious) clouds. Thus, they correctly perform the given protocols and do not exchange unpermitted data. However, they may try to obtain additional information from the intermediate data while executing their own protocols. This assumption has been used in the related problem domains (for example, in [38]) even though it is not new as mentioned in the earlier works [22, 24, 39].

In this paper, we consider privacy-preserving kNN query processing which retrieves k nearest data items that are closest to the given query. To support kNN query processing over the encrypted database, a secure multi-party computation (SMC) is required for privacy-preserving kNN query processing algorithm [40]. A secure multi-party computation can be defined as follows.

Definition 2

Secure multi-party computation a given number of participants, p1, p2, …, pn(n ≥ 2), each have private data, respectively, d1, d2, …, dn. Participants want to compute the value of a public function on the private data: F(d1, d2, …, dn) while keeping their own inputs secret.

According to Definition 2, the proposed algorithm uses two clouds (e.g., CA and CB) because at least two parties are required for secure computation. Existing studies, such as Elmehdwi et al.’s work [22], Zhou et al.’s work [23], Wong et al.’s work [18] and Kim et al.’s work [24], also use two-party computation to support privacy-preserving kNN query processing. Thus, we do not consider a single-party computation model because the single-party computation model is vulnerable against semi-honest adversaries.

The DO outsources both the encrypted database and its encrypted index to the CA with pk, while the DO sends sk to the CB. The encrypted index includes the region information of each node in ciphertext and the ids of data located inside the node in plaintext. The DO also sends pk to AUs to enable them to encrypt a kNN query. When requesting a query, an AU first generates E(qj) by encrypting a query q attribute-wise for 1 ≤ j ≤ m. CA and CB cooperatively process the query and return a query result to the AU without data leakage.

As an example, assume that an AU has sixteen data in two-dimensional space (x-axis and y-axis) as depicted in Fig. 1. The data items are partitioned into four nodes for a kd-tree: node1, node2, node3, and node4. To clarify the relationship between data items and nodes, we suppose that there is no data item on the boundary of a node. To outsource the database, the DO encrypts each data item and its region information attribute-wise. The ith data item di is represented as < xi, yi > in two-dimensional space. Therefore, the di can be encrypted to < E(xi), E(yi) > by using the Paillier cryptosystem. For example, d1 =  < 2,1 > is encrypted as E(d1) =  < E(2), E(1) > and the encrypted index is shown in Fig. 1.

Fig. 1
figure 1

Data items and kd-tree in two-dimensional space

3.2 Enhanced secure protocols

Our kNN query processing algorithm is constructed using several secure protocols. We adopt four secure protocols from the literatures [22, 24, 39], such as secure multiplication (SM), secure bit-not (SBN), CoMPare-S (CMP-S), and secure minimum from set of n values (SMINn). All of the protocols except the SBN protocol use the SMC technique between CA and CB, while the SBN protocol can be solely executed at the CA side. In addition, we propose three new secure protocols: enhanced secure squared Euclidean distance (ESSED), garbled circuit-based secure compare (GSCMP), and garbled circuit-based secure point enclosure (GSPE). For both GSCMP and GSPE, we use Yao’s garbled circuits [25] that allow two parties holding inputs x and y, respectively, to evaluate a function f(x,y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao’s garbled circuit supports high security level. Another advantage of the Yao’s garbled circuit is to provide high efficiency if a function can be realized with a reasonably small circuit [39]. Because our protocols do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols [22, 24], they can provide a low computation cost.

ESSED protocol: Suppose that there are two m-dimensional vectors \(\vec{X} = \left\{ {x_{1} ,x_{2} ,x_{3} , \ldots ,x_{m} } \right\}\) and \(\vec{Y} = \left\{ {y_{1} ,y_{2} ,y_{3} , \ldots ,y_{m} } \right\}\). The goal of the ESSED (enhanced secure squared Euclidean distance) protocol is to securely compute \(E\left( {\left| {\vec{X} - \vec{Y}} \right|^{2} } \right)\), where \(\left| {\vec{X} - \vec{Y}} \right|\) denotes the Euclidean distance between \(\vec{X}\) and \(\vec{Y}\). Note that \(\left| {\vec{X} - \vec{Y}} \right|^{2} = \mathop \sum \nolimits_{i = 1}^{m} \left( {x_{i} - y_{i} } \right)^{2} .\)

We utilize a data packing technique to enhance the efficiency of a secure protocol. Specifically, we pack λ number of σ-bit data instances to generate a packed value. The overall procedure of ESSED is as follows. First, CA generates random numbers rj for 1 ≤ j ≤ m and packs them to obtain R using Eq. (4).

$$ R = \sum\limits_{j = 1}^{m} {r_{j} \times 2^{\sigma (m - j)} } $$
(4)

Then, CA generates E(R) by encrypting R. Second, CA calculates E(xjyj) attribute-wise and packs these results to obtain E(v) using Eq. (5). Then, CA computes E(v) = E(v) × E(R) and sends E(v) to CB.

$$ E\left( v \right) \, = \mathop \prod \limits_{j - 1}^{m} E\left( {x_{j} - y_{j} } \right)^{{2^{{\sigma \left( {m - j} \right)}} }} $$
(5)

Third, CB acquires [x1y1 + r1|…|xmym + rm] by decrypting E(v). CB obtains xjyj + rj for 1 ≤ j ≤ m by unpacking v through v × 2–σ(mj). CB also calculates (xjyj + rj)2 attribute-wise and stores their sum into d. CB encrypts d and sends E(d) to CA. Finally, CA obtains E(\(\left| {\vec{X} - \vec{Y}} \right|\)) by eliminating randomized values using Eq. (6).

$$ E\left( {\left| {\vec{X} - \vec{Y}} \right|^{2} } \right) = E\left( d \right) \times \mathop \prod \limits_{j = 1}^{m} \left( {E\left( {x_{j} - y_{j} } \right)^{ - 2rj} \times E(r_{j}^{2} )^{ - 1} } \right) $$
(6)

Our ESSED outperforms the existing distance computation protocol, i.e., data packing-based secure squared Euclidean distance (DPSSED) [39]. Table 1 shows the difference between the existing DPSSED and our ESSED in terms of the number of encryptions. Our ESSED requires only one encryption on the CB side, while the existing DPSSED requires m times of encryptions. Therefore, our ESSED requires a total of two encryptions, whereas the existing DPSSED requires a total of m + 1 encryptions. In addition, our ESSED calculates the randomized distance in plaintext on the CB side, while the existing DPSSED computes the sum of the squared Euclidean distances among all attributes over ciphertext on the CA side. Therefore, the number of computations on encrypted data in our ESSED can be reduced greatly, compared with the existing DPSSED.

Table 1 Comparison between the DPSSED and our ESSED in terms of the number of encryptions

GSCMP protocol: Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCMP (garbled circuit-based secure CoMPare) protocol returns the result as follows.

$$ {\text{GSCMP}}\left( {E\left( u \right), E\left( v \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\; u \le v} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

The main difference between GSCMP and CMP-S is that GSCMP receives encrypted data as inputs, while CMP-S receives the randomized plaintext. Furthermore, in the case of CMP-S, plaintext is returned as a result, whereas GSCMP encrypts the result of CMP-S and sends it to CA. Through this, GSCMP can protect the data access patterns. The overall procedure of the GSCMP is as follows.

First, CA generates two random numbers ru and rv and encrypts them. CA computes E(m1) = E(u)2 × E(ru) and E(m2) = E(v)2 × E(1) × E(rv). For the given input value u and v, Yao’s garbled circuit returns one if u < v; returns zero if u > v; returns a random value if u = v. To avoid returning a random value, our GSCMP protocol calculates u’ = 2 × u and v’ = 2 × v + 1 in Eqs. (7) and (8) while maintaining the values of inequality. For example, when u = v = 3, GSCMP calculates u’ = 6 and v’ = 7. Therefore, our GSCMP protocol avoids returning a random value when u = v.

$$ E\left( {m_{1} } \right) = E\left( u \right)^{2} \times E\left( {ru} \right) = E\left( {2 \times u + ru} \right) $$
(7)
$$ E\left( {m_{2} } \right) = E\left( {\text{v}} \right)^{2} \times E\left( 1 \right) \times E\left( {rv} \right) = E\left( {2 \times v + rv + 1} \right) $$
(8)

Second, CA randomly chooses one functionality between F0: u ≥ v and F1: u < v. The selected functionality is oblivious to CB. Then, CA sends data to CB, depending on the selected functionality. If F0: u ≥ v is selected, CA sends < E(m2), E(m1) > to CB. If F1: u < v is selected, CA sends < E(m1), E(m2) > to CB.

Third, CB obtains < m2, m1 > by decrypting < E(m2), E(m1) > if F0: u ≥ v is selected. If F1: u < v is selected, CB obtains < m1, m2 > by decrypting < E(m1), E(m2) > .

Fourthly, CA generates a garbled circuit consisting of two ADD circuits and one CMP circuit. Here, ADD circuit takes two integers u and v as input, and outputs u + v while CMP circuit takes two integers u and v as input, and outputs 1 if u < v, zero otherwise. If F0: u ≥ v is selected, CA puts –rv and –ru into the first and second ADD gates, respectively. If F1: u < v is selected, CA puts –ru and –rv into the first and second ADD gates.

Fifthly, if F0: u ≥ v is selected, CB puts m2 and m1 into the first and second ADD gates, respectively. If F1: u < v is selected, CB puts m1 and m2 into the first and second ADD gates.

Sixthly, the first ADD gate adds two input values and puts the output result1 into CMP gate. Similarly, the second ADD gate puts the output result2 into CMP gate. Seventhly, CMP gate outputs α = 1 if result1 < result2 is true, α = 0 otherwise. The output of the CMP is returned to the CB. Then, CB encrypts α and sends E(α) to CA. Since E(α) is an encrypted value, CA cannot identify the data received from the CB. If CA receives α from CB, CA can know which data is relevant to the query, which can lead to the exposure of the data access patterns. Therefore, it is necessary that CB sends E(α) rather than α to CA.

Finally, when the selected functionality is F0: u ≥ v, CA computes E(α) = SBN(E(α)) and returns the final E(α). If E(α) is E(1), u is less than or equal to v.

GSPE protocol: Suppose that E(p) is an encrypted value of a point p and E(range) is a set of the encrypted values containing the E(range.lbj) and the E(range.ubj) for 1 ≤ j ≤ m (m is the data dimension). When E(p) and E(range) are given as inputs, GSPE (garbled circuit-based secure point enclosure) protocol returns the result as follows.

$$ {\text{GSPE}}\left( {E\left( p \right), E\left( {range} \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\;range.lb \le p \le range.ub} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

If E(pj) ≤ E(range.ubj) and E(pj) ≥ E(range.lbj), the p is inside the range. To securely compare between a point and a range, the GSPE protocol needs to add random values for all data dimension. However, as the number of data dimensions increases, the number of data encryptions is increased. The GSPE protocol reduces the number of data encryptions by using a packing technique that transforms the m-dimensional data into one packed value.

The overall procedure of the GSPE is shown in Algorithm 1. First, CA generates two random numbers raj and rbj for 1 ≤ j ≤ 2 m (line 1–2). CA obtains PA and PB by initially packing raj and rbj, respectively, by using Eq. (9) for 1 ≤ j ≤ 2 m (line 3).

$$ {\text{PA}} = \mathop \sum \limits_{j = 1}^{2m} ra_{j} \times 2^{{\sigma \left( {2m - j} \right)}} ,\quad {\text{PB}} = \mathop \sum \limits_{j = 1}^{2m} rb_{j} \times 2^{{\sigma \left( {2m - j} \right)}} $$
(9)

Here, σ means the maximum length in bit to represent a data. Then, CA generates E(PA) and E(PB) by encrypting PA and PB (line 4). Second, CA computes E(μj) = E(pj)2 and E(ωj) = E(range.lbj)2 for 1 ≤ j ≤ m. CA also computes E(δj) = E(pj)2 × E(1) and E(τj) = E(range.ubj)2 × E(1) for 1 ≤ j ≤ m (line 5–8). Third, CA randomly chooses one of two functions, F0: u ≥ v and F1: v > u. Then, CA performs encrypted data packing for E(μj), Ej), E(ωj) and E(δj), by using homomorphic multiplication and addition (Eqs. 1 and 2) depending on the selected function (lines 8–18). CA sends E(PA) and E(PB) to CB (line 19). Fourthly, CB obtains PA and PB by decrypting E(PA) and E(PB) (line 20). CB stores xj ←  \({\text{PA}} \times 2^{{ - \sigma \left( {2m - j} \right)}}\) for 1 ≤ j ≤ 2 m by unpacking PA, while CB stores yj\({\text{PB}} \times 2^{{ - \sigma \left( {2m - j} \right)}}\) for 1 ≤ j ≤ 2 m by unpacking PB (line 21–23). Here, xj (or yj) is one of the μj, τj, ωj, and δj.

Fifthly, CA generates two add gates and one compare gate (line 24). CA puts –raj and –rbj into the first and the second add gates, respectively, for 1 ≤ j ≤ 2 m (line 25–26). CB puts xj and yj into the first and the second add gate, respectively, for 1 ≤ j ≤ 2 m (line 27). When –raj, –rbj, xj and yj are given to the compare gate, the result of compare gate, α′ =  < α1′, α2′, …, α2m′ > , is returned to CB (line 28–29). Sixthly, CB encrypts α′ and sends E(α′) to CA (line 30). Seventhly, CA performs E(αj) = SBN(E(αj)) for 1 ≤ j ≤ 2 m when the selected function is F0: u ≥ v (line 32–33). Then, CA computes E(result) = SMR(E(result), E(αj)) for 1 ≤ j ≤ 2 m where the initial value of E(result) is E(1) (line 31, line 34). When all of E(αj) for 1 ≤ j ≤ 2 m are E(1), E(result) remains E(1). Finally, GSPE returns E(result) (line 35).

figure a

3.3 Secure protocols using encrypted random value pool

While processing a query in our secure system, CB decrypts the received ciphertext. Thus, we need to prevent CB from extracting meaningful information while executing secure protocols. For this, CA generates a random value r from ZN and encrypts r by using Paillier cryptosystem. Then, CA adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing E(m + r) = E(m) × E(r). Because m ± r is independent from m, CB cannot obtain a meaningful information with decryption. However, in Paillier cryptosystem, the process of adding a random value to the ciphertext leads performance degradation because both encryption and decryption operations require higher computational cost than other encrypted operations. Therefore, we propose an encrypted random value pool to reduce the amount of computational cost for ciphertext generation. First, in a preprocessing phase, we generate random ciphertexts and store them to an encrypted random value pool. Second, while processing a query in CA, a random ciphertext is selected from the encrypted random value pool whenever the secure protocol is called. Therefore, CA not only prevents CB from extracting meaningful information while processing a secure protocol, but also reduces a cost of generating encrypted random values. We use the encrypted random value pool to the SM protocol [22] and our GSCMP protocol. In SM and GSCMP, CA generates two encrypted random values before sending the ciphertext to CB. According to Table 2, we can reduce the number of encryption operations to 67% by using the encrypted random value pool.

Table 2 A comparison of amount in secure protocols

Secure multiplication protocol using encrypted random value pool (SMR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, SMR protocol returns the result as follows.

$$ {\text{SMR}}\left( {E\left( u \right), E\left( v \right)} \right) = E\left( {u \times v} \right) $$

SMR protocol is shown in Algorithm 2. When two encrypted values E(u) and E(v) are given as inputs, CA selects two random values E(ra) and E(rb) from the encrypted random value pool (line 1). The rest of the SMR protocol is the same as the previous SM protocol (line 2–6).

figure b

Garbled secure compare protocol using encrypted random value pool (GSCR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCR protocol returns the result which is the same as our GSCMP as follows.

$$ {\text{GSCR}}\left( {E\left( u \right), E\left( v \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\, u \le v} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

The difference between GSCR and GSCMP is to select a random ciphertext from the encrypted random value pool, instead of generating an encrypted arbitrary value.

4 KNN query processing algorithm

In this section, we present our kNN query processing algorithm (SkNNG) that uses Yao’s garbled circuit [25]. The algorithm consists of three phases: encrypted kd-tree search, kNN retrieval, and kNN result refinement.

4.1 Candidate node search phase

In the encrypted kd-tree search phase, the CA securely extracts all of the data from a node containing a query point while hiding the data access patterns. The procedure of the encrypted kd-tree search phase is shown in Algorithm 3. First, CA securely finds nodes that include a query by executing E(αz) = GSPE(E(q), E(nodez)) for 1 ≤ z ≤ #_of_node where #_of_node means the total number of kd-tree leaf nodes (line 1–2). The result of GSPE for all nodes is stored in E(α) = {E(α1), E(α2), …, E(α#_of_node)}. By utilizing GSPE, our kNN query processing algorithm can obtain better performance than the existing algorithms [22, 24] because we can avoid operations related to the SBD protocol that causes high computation overhead. Then, we perform the 8–24 lines of the index search algorithm in [24]. Second, CA generates E(α′) by shuffling E(α) using a random shuffling function π and sends E(α′) to CB (line 3–4).

figure c

Third, CB obtains α′ by decrypting E(α′) and counts the number of α′ = 1 and stores it into c (line 5–6). Here, c means the number of nodes that the query is related to. Fourthly, CB creates c number of node groups (NG) (line 7–11). CB assigns to each NG a node with α′ = 1 and #_of_node/c-1 nodes with α′ = 0. Then, CB obtains NG′ by randomly shuffling the ids of nodes in each NG and sends NG′ to CA. Fifthly, CA obtains NG* by shuffling the ids of nodes using π-1 in each NG′ (line 12–13). Finally, CA accesses one data in each node for each NG* and performs E(t'i,j) = SMR(E(nodez.datas,j), E(αz)) where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (line 14–22). Here, E(αz) is the result of GSPE corresponding to nodez. If a node has the less number of data than FanOut, it performs SMR by using E(max), instead of using E(nodez.datas,j). Here, max is the largest value in the domain. When CA accesses one datum from every node in a NG*, CA performs E(candcnt,j) = \({ }\mathop \prod \nolimits_{i = 1}^{num} E(t^{\prime}_{i,j} )\) where num means the total number of nodes in the selected NG*.

As a result, data items in the node related to the query are securely extracted without revealing the data access patterns [5, 14] because the searched nodes are not revealed. By repeating these steps, all of the data in the node are safely stored into the E(candi,j) for 1 ≤ i ≤ cnt and 1 ≤ j ≤ m where cnt means the total number of data extracted during the index search. Figure 2 shows an example of the candidate node search phase. The example uses the data items and their kd-tree structure that are represented in Fig. 1. First, CA performs the GSPE between E(q) and E(Nodei.Range) for 1 ≤ i ≤ cnt. CA stores the GSPE result into E(α) and sends it to CB. For example, in Fig. 2, CA performs GSPE({< E(0), E(0) > , < E(4), E(5) >}, < E(6), E(1) >) between E(Node1.Range) and E(q), and stores the GSPE result, i.e., E(0), into E(α1) for Node1. Second, CA shuffles the sequence of {< Node1, E(α1) > , < Node2, E(α2) > , …, < Noden, E(αn) >} and changes the shuffled node ids into new ids so as to hide the original node ids from CB. To obtain the original sequence of node ids, CA records their shuffled sequence. For example, in Fig. 2, the original sequence of {< Node1, E(0) > , < Node2, E(1) > , < Node3, E(0) > , < Node4, E(0) >} is shuffled to {< Node4, E(0) > , < Node1, E(0) > , < Node2, E(1) > , < Node3, E(0) >}. Then, CA changes Node4, Node1, Node2, and Node3 into PN1, PN2, PN3, and PN4, respectively. As a result, the shuffled sequence is {< PN1, E(0) > , < PN2, E(0) > , < PN3, E(1) > , < PN4, E(0) >}. Then, CA sends the shuffled sequence to CB. Third, CB receives the shuffled sequence and decrypts it. For example, in Fig. 2, CB receives {< PN1, E(0) > , < PN2, E(0) > , < PN3, E(1) > , < PN4, E(0) >} and obtains {< PN1, 0 > , < PN2, 0 > , < PN3, 1 > , < PN4, 0 >} by decrypting it. To generate node groups, i.e., NGs, CB counts how many 1s are in the sequence. Each NG has one seed node whose αp equals 1 where 1 ≤ p ≤ #_of_node. Therefore, there exists the same number of NGs as the number of 1s in the sequence. Nodes whose αp equals 0, where 1 ≤ p ≤ #_of_node, are evenly assigned to the generated NGs. And CB sends the generated NGs to CA. For example, CB counts 1s in the sequence of {< PN1, 0 > , < PN2, 0 > , < PN3, 1 > , < PN4, 0 >} and generates NG1 with the seed PN3. The nodes < PN1, 0 > , < PN2, 0 > , and < PN4, 0 > are assigned into NG1. CB sends NG1 = {PN3, PN1, PN2, PN4} to CA. Fourth, CA obtains the original node ids from the received NGs by using the shuffled sequence of node ids. For example, in Fig. 2, CA obtains NG1’ = {Node2, Node4, Node1, Node3} as the original node ids by using both the received NG1 = {PN3, PN1, PN2, PN4} and the shuffled sequence of node ids. Fifth, CA performs SMR protocol between E(α) and the encrypted data item in a node group, and makes a candidate set by summarizing all the result of the SMR. In Fig. 2, CA performs SMR(E(0), E(Node1.Data)), SMR(E(1), E(Node2.Data)), SMR(E(0), E(Node3.Data)), and SMR(E(0), E(Node4.Data)). The result of the SMR are E(0), E(d5), E(0), E(0) and CA summarizes them and stores E(d5) into the candidate set. Sixth, for each NG, the SMR protocol and the summarization of results are performed as many times as the number of data items of each node. For example, in Fig. 2, CA performs summarization of results four times. As a result, CA obtains {E(d5), E(d6), E(d7), E(d8)} as the candidate set.

Fig. 2
figure 2

The example of candidate node search phase

4.2 kNN retrieval phase

In the kNN retrieval phase, we retrieve the k nearest neighbors from the given query by partially utilizing the SkNNm scheme [22]. However, we only consider E(candi) for 1 ≤ i ≤ cnt, which are extracted in the index search phase, whereas the SkNNm considers all the encrypted data. In addition, we use our efficient secure protocols that require relatively low computation costs, instead of using the existing expensive secure protocols including SBD (secure bit decomposition) [22, 24]. The overall procedure of kNN retrieval algorithm is as follows: First, the algorithm calculates the distance between the encrypted data items and the encrypted query without data and query decryption. Second, the algorithm finds the minimum distance (distmin) among the calculated distances. It cannot know which data item has distmin due to the semantic security of the Paillier cryptosystem. Third, to obtain the encrypted data item with distmin, the algorithm performs the subtraction of distmin with the calculated distance and the data item with distmin has E(zero) as the result of subtraction. Here, E(zero) is the only value which is not changed by the homomorphic multiplication of the Paillier cryptosystem. Therefore, the algorithm can distinguish the nearest neighbor from others, while an attacker cannot determine which data item has the minimum distance. Fourth, in order to hide the original data items from the attacker, the algorithm performs the homomorphic multiplication of the result of subtraction with a random value. It also shuffles the sequence of the result of multiplication in order to hide data access pattern. Finally, by using our secure protocols, the algorithm finds the nearest neighbor and repeats the above process until k nearest neighbors are found.

The pseudocode of the kNN retrieval phase is shown in Algorithm 4. First, using ESSED, CA securely calculates the squared Euclidean distances E(disti) between a query and E(candi) for 1 ≤ i ≤ cnt (line 1–2). Second, CA performs SMINn to find the minimum value E(distmin) among E(di) for 1 ≤ i ≤ cnt (line 3–4). Third, CA calculates E(τi) = E(distmin) × E(disti)N−1, for 1 ≤ i ≤ cnt. CA computes E(τiʹ) = E(τi)ri. CA obtains E(β) by shuffling E(τʹ) using a random shuffling function π and sends E(β) to CB (lines 4–9). For example, E(τʹ) is calculated as {E(0), E(-r)} where r means a random number. The E(0) corresponds to the E(distmin). Assuming that π shuffles data in reverse way, CA sends E(β) = {E(-r), E(0)} to CB. Fourthly, after decrypting E(β), CB sets E(Ui) = E(1) if E(βi) = 0, otherwise E(Ui) = E(0). CB sends E(U) to CA (line 10–13). Fifthly, CA obtains E(V) by shuffling E(U) using π-1. Then, CA performs SMR protocol by using E(Vi) and E(candi,j) to obtain E(i,j) (lines 14–17). Sixthly, by computing E(nns,j) = \(\mathop \prod \limits_{i = 1}^{cnt} E(V^{\prime}_{i,j} )\) for 1 ≤ j ≤ m, CA can securely extract the datum corresponding to the E(distmin) (line 18). Finally, for preventing the selected result from being selected in the later phase, CA securely updates the distance of the selected result as E(max) by computing Eq. (10) (lines 19–22).

$$ E\left( {d_{i} } \right) \, = {\text{ SMR}}\left( {E\left( {V_{i} } \right),E\left( {max} \right)} \right) \, \times {\text{ SMR}}\left( {{\text{SBN}}\left( {E\left( {V_{i} } \right)} \right),E\left( {d_{i} } \right)} \right) $$
(10)

Because only the selected result has E(Vi) = E(1), the E(disti) corresponding to the datum selected in current round becomes E(max), while the other values remain the same. This procedure is repeated for k rounds to find the kNN result.

figure d

Figure 3 shows an example of kNN retrieval phase. The example uses the data items and their kd-tree structure, as shown in Fig. 1. For simplicity and clarity, the shuffling function π is omitted. First, CA calculates the distance by using the ESSED and stores the ESSED result into E(disti) for 1 ≤ i ≤ cnt (①). In Fig. 3, CA performs ESSED(E(d5), < E(6),E(1) >) for E(d5) and E(q), and stores the ESSED result, i.e., E(2), into E(dist5). Second, the minimum value is calculated by using the SMINn and stored the SMINn result into E(distmin)(②). In Fig. 3, CA performs SMINn(E(2), E(9), E(8), E(18)) for obtaining the minimum distance among E(dist5), E(dist6), E(dist7), and E(dist8), and stores the SMINn result, i.e., E(2), into E(distmin). Third, in order to obtain the encrypted data item with the minimum distance to the given query, CA performs E(distmindisti) and stores the result into E(\(\tau\)i) for 1 ≤ i ≤ cnt (③). When distmin is the same as disti, E(0) is stored into E(\(\tau_{i}\)). In Fig. 3, for E(dist5), CA performs E(2–2) and stores the result, i.e., E(0), into E(\(\tau_{5}\)). Fourth, in order to protect the value of E(\(\tau\)i) for 1 ≤ i ≤ cnt, CA performs the homomorphic multiplication of E(\(\tau\)i) by a random value (④). In Fig. 3, for E(\(\tau_{6}\)), CA performs E(− 7 × 3) with a random value = 3 and stores the result, i.e., E(-21), into E(\(\tau_{7} ^{\prime}\)). Fifth, CB returns E(1) as E(Vi) if E(\(\tau_{i} {^{\prime}}\)) is E(0) for 1 ≤ i ≤ cnt; otherwise, it returns E(0) as E(Vi)(⑤). Sixth, CA obtains the nearest neighbor by performing the SMR between E(di) and E(Vi) for 1 ≤ i ≤ cnt and merging the result of the SMR(⑥–⑦). In Fig. 3, CA performs SMR(E(1), E(5)), SMR(E(0), E(6)), SMR(E(0), E(8)) and SMR(E(0), E(9)) for x-axis and SMR(E(1), E(2)), SMR(E(0), E(4)), SMR(E(0), E(3)) and SMR(E(0), E(4)) for y-axis. CA merges E(5), E(0), E(0), and E(0) for x-axis while merging E(2), E(0), E(0), and E(0) for y-axis. As a result, CA obtains < E(5), E(2) > as the nearest neighbor. Seventh, by using Eq. (5), the CA sets the distance of the found nearest neighbor to the maximum value so that CA can avoid finding the found nearest neighbor again in the next round(⑧). In Fig. 3, CA performs SMR(E(1), E(max)) × SMR(E(0), E(2)) for E(d5) and stores E(max) into E(dist5). Finally, CA repeats the above process until k nearest neighbors are found(②–⑦).

Fig. 3
figure 3

The example of kNN retrieval phase

4.3 kNN result refinement phase

As mentioned in [24], the result of kNN retrieval phase may not be accurate because candidates are extracted from only one leaf node in index search phase. Therefore, the kNN result refinement is necessary to confirm the correctness of the current query result. Specifically, assuming that the squared Euclidean distance between the kth closest result E(nnk) and the query is distk, the neighboring kd-tree nodes need to be searched to acquire data with the shorter distance than distk. For this reason, we use the concept of shortest point (sp) defined in [24]. The sp is a point in a given node whose distance is closest to a given point p as compared with the other points in the node. To find the sp in each node, we use the following properties described in [24]. (i) If both the lower bound (lb) and the upper bound (ub) of the node are lesser than the p, the ub becomes the sp of the region. (ii) If both the lb and the ub of the region are greater than p, the lb becomes the sp of the region. (iii) If p is between the lb and the ub of the region, p is the sp of the region. Since this property can be applied to each dimension, our kNN result refinement phase partially utilizes that of the existing algorithm [19, 21]. However, to reduce the computation cost, we do not use the existing expensive protocols, such as SBD, SSED, SCMP, and SPE [22, 24].

figure e

The procedure of the kNN result refinement phase is shown in Algorithm 5 First, CA computes E(distk) = ESSED(E(q), E(nnk)) to obtain the squared Euclidean distance between the query and the kth closest result, which is returned from the kNN retrieval phase (line 1). Second, for each node, CA performs GSCMP by using E(qj) and E(nodez.lbj) for 1 ≤ z ≤ numnode and 1 ≤ j ≤ m and stores the result in E(ψ1). CA also performs GSCMP by using E(qj) and E(nodez.ubj) for 1 ≤ z ≤ numnode and 1 ≤ j ≤ m and stores the result into E(ψ2) (lines 2–5). When the value of E(qj) is equal to or less than the E(lbj) (E(ubj)), the E(ψ1) (E(ψ2)) has the value of E(1). Then, CA obtains E(ψ3) by carrying out E(ψ1) × E(ψ2) × SMR(E(ψ1), E(ψ2))N−2 so as to acquire the result of bit-xor operation between E(ψ1) and E(ψ2) (line 6). Note that “-2” is equivalent to “N-2” under ZN. Third, CA securely obtains the shortest point of each node, that is, E(spz,j), by computing SMR(E(ψ3), E(qj)) × SMR(SBN(E(ψ3)), f(E(lbz,j), E(ubz,j))) for 1 ≤ z ≤ numnode and 1 ≤ j ≤ m. Here, f(E(lbj), E(ubj)) is obtained by computing SMR(E(ψ1), E(lbz,j)) × SMR(SBN(E(ψ1)), E(ubz,j)) for 1 ≤ z ≤ numnode and 1 ≤ j ≤ m (lines 7–10). Fourthly, CA calculates E(spdistz), the squared Euclidean distances between the query and E(spz) for 1 ≤ z ≤ numnode by using ESSED. In addition, CA securely updates the E(spdistz) of the nodes, which are retrieved in index search phase, into E(max) by computing E(spdistz) = SMR(E(αz), E(max)) × SMR(SBN(E(αz)), E(spdistz)) (lines 11–13). Here, E(αz) is the value returned by GSPE in index search phase. Then, CA performs E(αz) = GSCMP(E(spdistz), E(distk)) (line 14). If the E(spdistz) is less than E(distk), the corresponding nodez is assigned E(α) = E(1). The nodes with E(α) = E(1) need to be retrieved for kNN result refinement. The number of nodes to expand increases according to how many E(αz) becomes E(1). If the number of ‘1’ is c in E(αz), c number of node groups are created in the CB and CA extract the data of each node group. Therefore, the number of cnt becomes c × fanout.

Because the E(spdistz) of nodes being retrieved in the index search phase are E(max), they are safely pruned. Fifthly, CA securely extracts the data stored in the nodes with E(α) = E(1) by performing the index search using E(α) and appends them to E(nn) (line 15–16). Then, CA executes kNN search phase based on E(nn) to obtain the final kNN result E(resulti) for 1 ≤ i ≤ k (line 17). Therefore, the final result becomes {E(nn1), E(nn5)} because the squared Euclidean distance of E(nn5) is E(4). Sixthly, CA returns the decrypted result to AU in cooperation with CB to reduce the computation overhead at the AU side. To do this, CA computes E(γi,j) = E(resulti) × E(ri,j) for 1 ≤ i ≤ k and 1 ≤ j ≤ m by using a random value ri,j. Then, CA sends E(γi,j) to CB and ri,j to AU (lines 18–22). Then, CB decrypts E(γi,j) and sends the decrypted value to AU (lines 23–26). Finally, AU obtains the actual kNN result by computing γi,j-ri,j in plaintext (lines 27–29).

5 Parallel kNN query processing algorithm

5.1 Parallel encrypted kd-tree search phase

In the parallel encrypted kd-tree search phase, CA simultaneously extracts all of the data from a node containing a query point. To expand encrypted kd-tree search phase to parallel environment, we use a thread pool which stores tasks in a queue so that threads can process tasks in parallel. The procedure of the parallel encrypted kd-tree search phase is shown in Algorithm 6. First, CA generates a queue-based thread pool (line 1). If a thread in the thread pool is available, it can process a task in FIFO manner. Second, CA pushes the task, i.e., GSPE(E(q), E(nodei)), to the thread pool for 1 ≤ i ≤ #_of_node. A result of GSPE protocol is stored in E(α) = {E(α1), E(α2), …, E(α#_of_node)} (lines 2–3). Third, CA generates E(α') by shuffling E(α) using a random shuffling function π and sends E(α') to CB (lines 4–5). Fourthly, CB performs the same procedure in Algorithm 3 (line 6). Fifthly, CA obtains NG* by shuffling the ids of nodes using π-1 in each NGʹ (line 7). Finally, CA accesses one datum in each node for each NG* and pushes both E(t'i,j) = SMR(E(nodez.datas,j), E(αz)) and E(cantcnt+s,j) = E(cantcnt+s,j) × E(t'i,j) to the thread pool, where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (lines 8–18).

figure f

5.2 Parallel kNN retrieval phase

In the parallel kNN retrieval phase, we simultaneously retrieve the k closest data from the query by partially utilizing the SkNNm scheme [22]. We consider E(candi) for 1 ≤ i ≤ cnt, which are extracted in the parallel index search phase. The procedure of the parallel kNN retrieval phase is shown in Algorithm 7. First, using ESSED, CA simultaneously calculates the squared Euclidean distances E(di) between a query and E(candi) for 1 ≤ i ≤ cnt (lines 1–2). Second, CA performs SMINn to find the minimum value E(dmin) among E(di) for 1 ≤ i ≤ cnt (lines 3–4). Third, CA simultaneously calculates both E(τi) = E(dmin) × E(di)N−1 and E(τiʹ) = E(τi)ri for 1 ≤ i ≤ cnt (line 5–7). CA obtains E(β) by shuffling E(τʹ) using a random shuffling function π and sends E(β) to CB (lines 8–9). Fourthly, after decrypting E(β), CB sets E(Ui) = E(1) if E(βi) = 0, otherwise E(Ui) = E(0). CB sends E(U) to CA (line 10). Fifthly, CA obtains E(V) by shuffling E(U) using π−1 (line 11). Sixthly, instead of using SM protocol, CA simultaneously performs SMR protocol with E(vi) and E(candi,j) to obtain E(i,j) (lines 12–16). Seventhly, by computing E(nns,j) = \(\mathop \prod \nolimits_{i = 1}^{cnt} E(V^{\prime}_{i,j} )\) for 1 ≤ j ≤ m, CA can simultaneously extract the datum corresponding to the E(dmin) (lines 17–18). Finally, CA simultaneously updates the distance of the selected result as E(max) by computing Eq. (11) (lines 19–24).

$$ E\left( {d_{i} } \right) \, = {\text{ SMR}}\left( {E\left( {V_{i} } \right), \, E\left( {max} \right)} \right) \, \times {\text{ SMR}}\left( {{\text{SBN}}\left( {E\left( {V_{i} } \right)} \right), \, E\left( {d_{i} } \right)} \right) $$
(11)
figure g

5.3 Parallel KNN result refinement phase

In the parallel kNN result refinement phase, CA simultaneously checks whether results of kNN is correct or not. If not, CA performs both index search phase and kNN retrieval phase again. The procedure of the parallel kNN result refinement phase is shown in Algorithm 8. First, CA computes E(distk) = ESSED(E(q), E(nnk)) to obtain the squared Euclidean distance between the query and the kth closest result which is returned from the kNN retrieval phase (line 1). Second, CA simultaneously finds nodes that is closer than diskk by using both SMR and GSCR protocols (lines 2–16). Third, CA performs 15–22 lines of Algorithm 5 (line 17). Fourthly, CB decrypts E(γi,j) and sends the decrypted value to AU (lines 18–21). Finally, AU obtains the actual kNN result by computing γi,j-ri,j in plaintext (lines 22–24).

figure h

6 Security proof under semi-honest model

As mentioned above, the proposed privacy-preserving kNN algorithm is implemented in a semi-honest attack model. Therefore, the security proof of the proposed privacy-preserving kNN algorithm is performed from the three viewpoints of CA, CB, and AU (Authorized User), which are the subjects of actions excluding data owners. In addition, the following lemmas are used in our security proof.

Lemma 1

If a random element r is uniformly distributed on ZN and independent from any variable x \(\in\) ZN, then r ± x is also uniformly random and independent from x.

Lemma 2

The Paillier cryptosystem is semantically secure based on the composite residuosity class problem [26].

Theorem 1

The proposed privacy-preserving kNN algorithm is secure from the perspective of CA under the semi-honest model.

Proof

CA owns the cryptographic database and cryptographic index. However, since it does not own the decryption key, the encryption database and encryption index are not exposed. Data cannot be inferred even in a frequency-based attack because the same plaintext has different ciphertexts (Lemma 2). In addition, since all values ​​that our secure protocol returns ​​from CB are encrypted values, no information is exposed from data received from CB. Even though the query is received from the user, it cannot be inferred because the query is in a cryptographic state. □

Theorem 2

The proposed privacy-preserving kNN algorithm is secure from the perspective of CB under the semi-honest model [32].

Proof

CB decrypts the encrypted text received from CA. Because CA hides the original data by adding an arbitrary integer before it is passed to CB, CB cannot infer meaningful data from the decrypted plaintext (Lemma 1). □

Theorem 3

The proposed privacy-preserving kNN algorithm is secure from the perspective of AU under the semi-honest model.

Proof

The AU encrypts his/her query using the public key and sends the encrypted query to CA. This can protect the user's preferences and personal information. Since the query results received from CA and CB do not include the information for the owner's data, it is impossible to infer the original data. □

Theorem 4

The proposed privacy-preserving kNN algorithm is secure even though c and cnt are exposed to CA under the semi-honest model.

Proof

CA can obtain both c and cnt in Algorithms 3 and 5. Here, c is the number of nodes relevant to the query and cnt equals to c × fanout(i.e., F). c initially equals to one in the candidate search phase and c is changed to be ranged from zero to a total number of leaf nodes in the result refinement phase. However, because the upper and lower bounds of all nodes are encrypted and the node ids are hidden through grouping and shuffling, it is impossible to know which node is related to the query. Therefore, even if c and cnt ​​are exposed as plaintext to CA, an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □

Theorem 5

The proposed privacy-preserving kNN algorithm is secure even though c is exposed to CB under the semi-honest model.

Proof

CB can obtain c in algorithms 3 and 5. Here, c denotes the number of nodes related to the query. Because CB cannot know the fanout(F) of the kd-tree, cnt is not disclosed to CB. In addition, because the order of node ids is changed through shuffling, it is impossible to infer which node is related to the query. Therefore, even if c is exposed to CB as plaintext, an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □

According to Theorems 1, 2, 3, 4, and 5, the original data, an index, and a query are protected through the Paillier encryption system (Lemma 2), and when decrypted, the original data cannot be inferred through arbitrary data (Lemma 1). Through this, we prove that the proposed privacy-preserving kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns.

In addition, the proposed parallel kNN algorithm is implemented in a semi-honest attack model and its security proof is performed from the three viewpoints of CA, CB, and AU. Because the procedure of the proposed parallel kNN algorithm is the same as that of the proposed privacy-preserving kNN algorithm except for using multiple threads, the proposed parallel kNN algorithm is secure from the perspective of CA, CB, and AU under the semi-honest attack model. Therefore, we prove that the proposed parallel kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns [4, 5, 14].

7 Performance analysis

In this section, we compare the proposed privacy-preserving kNN algorithm (SkNNG) with the existing algorithms, SkNNm [22] and SkNNI [24], which can hide data access patterns. We used the Paillier cryptosystem to encrypt a database for both schemes [22, 24]. We implemented our algorithm and the existing ones using C++. Experiments were performed on a Linux machine with an Intel Xeon E3-1220v3 4-Core 3.10 GHz and 32 GB RAM running Ubuntu 14.04.2. In addition, we compare the proposed parallel algorithm (SkNNPG) with the parallel version of SkNNm (SkNNpm) and that of SkNNI (SkNNPI). Experiments for parallel algorithms were performed on a Linux machine with an Intel Xeon CPU E5-2630v4 2.20 GHz and 64 GB RAM running Ubuntu 14.04.2.

We conduct performance analysis using both a syntactic dataset and a real dataset. For a synthetic datasets, we randomly generated 30 k records with six attributes. For a real dataset, we make use of the Chess dataset available at http://archive.ics.uci.edu/ml/datasets [41]. It consists of 28,056 records with six attributes. Parameters for our experiments are listed in Table 3. We use 512 bits for encryption key size (K) and set the default values of the required k as 10. The query was used by selecting a random integer from the range of data.

Table 3 Parameters used in our experiments

7.1 Performance using a synthetic dataset

Figure 4 shows the performances of both SkNNI and SkNNG in terms of the height of kd-tree(h). When the number of data items is fixed, fanout(F) is decreased as h increases because the total number of leaf nodes is calculated by using h, i.e., 2h−1. Therefore, it is important to choose the appropriate height(h) of kd-tree depending on the number of data items. If h is greatly high for the given number of data items, the number of leaf nodes to be searched increases and the cost for accessing leaf node increases in the candidate node search phase. On the contrary, if h is greatly low for the given number of data items, the number of data items to be search increases and the cost for calculating distances between data items and the query is increased in the kNN retrieval phase. Therefore, the existing algorithm (SkNNI) and the proposed algorithm (SkNNG) shows near-optimal performance in case of h = 7. In particular, the performance of the existing algorithm is greatly affected by the height of the kd-tree(h) because the existing algorithm uses secure protocols based on an encrypted binary array, which requires high computation cost. However, the proposed algorithm is relatively less affected by h than the existing algorithm because the proposed algorithm uses secure protocol based on garbled circuit, which requires low computation cost. As a result, we set h to 7 in our experiment.

Fig. 4
figure 4

Performance with varying kd-tree depth for synthetic data

Figures 5 and 6 show the performance of three algorithms in a single machine. With varying n, our SkNNG shows 30.2 and 6.1 times better performance than SkNNm and SkNNI, respectively. With varying k, our SkNNG shows 33.2 and 4.9 times better performance than SkNNm and SkNNI, respectively. As a result, our SkNNG outperforms SkNNm because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNNm. In addition, our SkNNG outperforms SkNNI because our algorithm uses efficient secure protocols based on both Yao’s garbled circuit and the data packing technique. First, if a function can be realized with a reasonably small circuit, Yao’s garbled circuit provides a high degree of efficiency. Because our secure protocols, i.e., GSCMP and GSPE, do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols used in [22, 24], our encrypted data is reasonably small. As a result, our SkNNG can provide a low computation cost by using GSCMP and GSPE. Second, our ESSED protocol requires only one encryption operation by using the data packing technique while other protocol (i.e., DESSED) needs m operations for data encryption. Moreover, ESSED calculates the randomized distance in plaintext, while other protocol computes the sum of the squared Euclidean distances among all attributes over ciphertext. Therefore, our SkNNG can greatly reduce a computation cost by using ESSED.

Fig. 5
figure 5

Performance with varying n for synthetic data

Fig. 6
figure 6

Performance with varying k for synthetic data

Figures 7, 8, and 9 show the performance of three parallel algorithms. In Fig. 7, when the number of threads is 2, 4, 6, 8 and 10, the query processing time of SkNNPG is 3309, 2009, 1572, 1136, and 994 s, respectively. The query processing time of SkNNPG is decreased according to the number of threads. In addition, our SkNNPG shows 12 and 7 times better performance on average than SkNNpm and SkNNPI, respectively. In Fig. 8, when the number of data is 5 k, 10 k, 15 k, 20 k, 25 k, and 30 k, the query processing time of SkNNPG is 125, 241, 353, 462, 537, and 640 s, respectively. SkNNPI and SkNNPG show better performance than SkNNpm because they use the index-based data filtering technique. In Fig. 9, when the number of k is 5, 10, 15, and 20, the query processing time of SkNNPG is 586, 1173, 1773, and 2308 s, respectively. Our SkNNPG shows 10 and 5.2 times better performance on average than SkNNpm and SkNNPI, respectively. Our SkNNPG outperforms SkNNpm and SkNNPI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

Fig. 7
figure 7

Performance with varying # of threads for synthetic data

Fig. 8
figure 8

Performance with varying n for synthetic data in parallel

Fig. 9
figure 9

Performance with varying k for synthetic data in parallel

Privacy-preserving kNN query processing algorithms generally use homomorphic encryption for providing data privacy and query privacy. Therefore, it is inevitable that they require high computational cost and their search time complexity is linear. As a result, the existing algorithms handle 10,000 data items in their performance evaluation [22, 24]. By following the existing algorithms, we conduct the performance evaluation of privacy-preserving kNN query processing algorithms when the number of data items is ranged from 5000 to 30,000. It is shown from our performance evaluation that the proposed algorithm (SkNNG) has linear time complexity, but its slope is lower than the existing algorithms (SkNNm, SkNNI). This is because the proposed algorithm performs data filtering using a kd-tree structure and uses Yao’s gabled circuit, which does not use the encrypted binary array.

In order that the proposed algorithm can handle a very large dataset (e.g., 1 million), we do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 (1 million). But in our performance evaluation, we exclude the existing algorithms because they cannot work for a very large dataset due to both their extremely long execution time and the nonexistence of their parallel versions. We do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 with two dimensions. Rather than six-dimensional data in Table 3, we use two-dimensional data because the size of main memory for our experiment is limited. Figure 10 shows the performance of the proposed parallel algorithm to show that it has the capability of dealing with a very large dataset (e.g., 1 million). The proposed algorithm requires 452, 834, and 1341 s when the number of data items is 300,000, 600,000, and 1,000,000, respectively. It is shown from our experiment that the proposed algorithm is linear according to the number of data items. Thus, it is inferred from our observation that the proposed algorithm can handle a very large dataset with a linear time complexity.

Fig. 10
figure 10

Performance with varying n for a very large dataset

7.2 Performance using a real dataset

According to Fig. 11, it is shown that the performances of both SkNNI and SkNNG are best when h is 7 and 8. So, we set h to 7 in our experiment. Figure 12 shows the performance of three algorithms in a single machine. With varying k, our SkNNG shows 22.3 and 5.9 times better performance than SkNNm and SkNNI, respectively. As a result, our SkNNG outperforms SkNNm because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNNm. In addition, our SkNNG outperforms SkNNI because our algorithm uses efficient secure protocols based on both Yao’s garbled circuit and the data packing technique.

Fig. 11
figure 11

Performance with varying kd-tree depth for real data

Fig. 12
figure 12

Performance with varying k for real data

Figures 13 and 14 show the performance of three parallel algorithms. In Fig. 13, when the number of threads is 2, 4, 6, 8, and 10, the query processing time of SkNNPG is 1659, 977, 745, 624, and 536 s, respectively. The query processing time of SkNNPG is decreased according to the number of threads. In addition, our SkNNPG shows 13.3 and 4.1 times better performance on average than SkNNpm and SkNNPI, respectively. In Fig. 14, when the number of k is 5, 10, 15, and 20, the query processing time of SkNNPG is 277, 536, 765, and 1022 s, respectively. Our SkNNPG shows 12.1 and 3.7 times better performance on average than SkNNpm and SkNNPI, respectively. Our SkNNPG outperforms SkNNpm and SkNNPI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

Fig. 13
figure 13

Performance with varying # of threads for real data

Fig. 14
figure 14

Performance with varying k for real data in parallel

7.3 Discussion

In this section, we not only clarify the differences between the existing privacy-preserving kNN query processing algorithms [22, 24] and our algorithm, but also highlight the advantage of our algorithm. In Table 4, we analyze the privacy-preserving kNN query processing algorithms, in terms of secure protocol, index structure, and random value pool.

Table 4 Comparison of privacy-preserving kNN query processing algorithms

Impact of secure protocol with low computational cost Secure protocols are very important for privacy-preserving query processing in cloud computing. We should make secure protocols more efficient because we target on the privacy-preserving query processing by using the Paillier cryptosystem, which consumes high computational cost. First, Elmehdwi et al.’s algorithm proposed secure protocols, such as SM, SBD, SMIN, SMINn for kNN query processing. Elmehdwi et al.’s algorithm can protect data privacy, query privacy by using the Paillier cryptosystem. Also, it uses arithmetic operations to protect the original data and to hide data access patterns. However, the drawback of Elmehdwi et al.’s algorithm is excessively high computational cost because SBD, SMIN, and SMINn protocols use an encrypted binary array as input value. For example, when we perform the SMIN protocol between E(8) and E(7), clouds transform an encrypted decimal value into an encrypted binary array: E(8)(10) = {E(1), E(0), E(0), E(0)}(2), E(7)(10) = {E(0), E(1), E(1), E(1)}(2). After that, clouds perform the SMIN protocol between E(8)(10) = {E(1), E(0), E(0), E(0)}(2) and E(7)(10) = {E(0), E(1), E(1), E(1)}(2). As a result, the SMIN requires high computational cost because it performs the encrypted operations as many times as the length of data domain in bit. The SBD and SMINn require high computational cost for the same reason of the SMIN. Second, Kim et al.’s algorithm proposed such secure protocols as SCMP and SPE, which are used for index search to find out kNN candidates. However, because both SCMP and SPE use encrypted binary array as input value, they require high computational cost for the same reason of the SMIN. Meanwhile, our algorithm proposed the GSCMP and the GSPE, which perform only one Paillier arithmetic operation. This is because they use an encrypted decimal value as input, rather than an encrypted binary array, due to utilizing Yao’s garbled circuit. As a result, the GSCMP and the GSPE require low computational cost.

Impact of using index structure over encrypted database Because Elmehdwi et al.’s algorithm does not use index structure for data filtering, it should process all of the data items, which leads to performance degradation. To solve the problem, both Kim et al.’s algorithm and our algorithm use encrypted kd-tree as an index structure. As a result, both algorithms achieve performance enhancement by using kd-tree. It is shown from our experiment that our algorithm searches only 10% of all the data items on average, due to data filtering using kd-tree. Table 5 shows the comparison of privacy-preserving kNN algorithms in terms of the number of data items accessed.

Table 5 Comparison of privacy-preserving kNN algorithms in terms of the number of data items accessed

Elmehdwi et al.’s algorithm requires N × k number of data items accessed in total, whereas both Kim et al.’s algorithm and our algorithm require N/F + F × k × (c + 1) number of data items accessed in total. Here, N is the number of data items, h is the height of kd-tree, F is the fanout of the kd-tree, and c is the number of node group in the kNN result refinement phase. We show that both Kim et al.’s algorithm and our algorithm are better than Elmehdwi et al.’s algorithm, in terms of the number of data items accessed, by using Eq. (12). Here, we can ignore 1/F × k because it is small enough. Because F × (c + 1) means the number of all the data items in the selected leaf nodes for the privacy-preserving kNN query processing, which is always the subset of N, we can say that Eq. (12) is true. By using the index structure over encrypted database, our algorithm and Kim et al.’s algorithm are shown to be better than Elmehdwi et al.’s algorithm.

$$ \begin{aligned} N \times k & { } \ge \frac{N}{F} + F \times k \times \left( {c + 1} \right) \\ & \Leftrightarrow k \ge \frac{1}{F} + \frac{{F \times k \times \left( {c + 1} \right)}}{N} \\ & \Leftrightarrow 1 \ge \frac{1}{F \times k} + \frac{{F \times \left( {c + 1} \right)}}{N} \\ & \approx 1 \ge \frac{{F \times \left( {c + 1} \right)}}{N} \\ \end{aligned} $$
(12)

Impact of encrypted random value pool for parallelism In our secure system, we use two-party computation for the parallel kNN query processing algorithm. Thus, we need to prevent CB from extracting meaningful information while executing secure protocols. For this, CA generates a random value r from ZN and encrypts r by using the Paillier cryptosystem. Then, CA adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing \(E\left( {m + r} \right) = E\left( m \right) \times E\left( r \right)\). Because m ± r is independent from m, CB cannot obtain meaningful information with decryption. However, adding a random value to the ciphertext in the Paillier cryptosystem leads to performance degradation because both encryption and decryption operations require higher computation cost than other encrypted operations. As shown in Table 2, in the Secure Multiplication protocol, both Elmehdwi et al.’s work and Kim et al.’s work require three times of the encryption: 2 encryptions for random values at CA and 1 encryption for the result of multiplication at CB. Meanwhile, our algorithm requires only one encryption for the result of multiplication at CB because it selects the encrypted random values from the random value pool without encrypting the random values at CA. As shown in Table 2, in the secure compare protocol, Elmehdwi et al.’s work requires log2D times of encryption where D is a data domain. Kim et al.’s work requires three times of the encryption: 2 encryptions for random values at CA and 1 encryption for the result of the comparison between two values at CB. Meanwhile, our algorithm requires only one encryption for the result of comparison at CB by using the random value pool. Therefore, our algorithm can reduce the amount of computation cost for encryption by using the encrypted random value pool.

8 Conclusion and future work

Due to the privacy issues, a database needs to be encrypted before being outsourced to the cloud. However, most of the existing kNN algorithms are insecure in that they disclose data access patterns during the query processing. To solve the problem, we proposed a new privacy-preserving kNN query processing algorithm via secure two-party computation. To achieve a high degree of efficiency in query processing, we also proposed a parallel kNN query processing algorithm using encrypted random value pool. Our algorithms can protect data, query and data access patterns. In our performance analysis, our algorithms showed about 4–30 times better performance than the existing algorithms, in terms of a query processing cost. As a future work, we plan to expand our algorithms to support other types of queries, such as Top-k and kNN classification. In addition, to the best of our knowledge, the privacy-preserving kNN algorithms with homomorphic encryption are generally studied for low-dimensional data space because they require the high computational cost [22, 24]. For dealing with high-dimensional data space, we will extend our algorithm by using a data dimensionality reduction technique [42, 43].