Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing

Kim, Hyeong-Jin; Lee, Hyunjo; Kim, Yong-Ki; Chang, Jae-Woo

doi:10.1007/s11227-021-04286-2

Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing

Open access
Published: 17 January 2022

Volume 78, pages 9245–9284, (2022)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing

Download PDF

Hyeong-Jin Kim¹,
Hyunjo Lee¹,
Yong-Ki Kim² &
…
Jae-Woo Chang ORCID: orcid.org/0000-0002-0037-6812¹

2030 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

Since studies on privacy-preserving database outsourcing have been spotlighted in a cloud computing, databases need to be encrypted before being outsourced to the cloud. Therefore, a couple of privacy-preserving kNN query processing algorithms have been proposed over the encrypted database. However, the existing algorithms are either insecure or inefficient. Therefore, in this paper we propose a privacy-preserving kNN query processing algorithm via secure two-party computation on the encrypted database. Our algorithm preserves both data privacy and query privacy while hiding data access patterns. For this, we propose efficient and secure protocols based on Yao’s garbled circuit. To achieve a high degree of efficiency in query processing, we also propose a parallel kNN query processing algorithm using encrypted random value pool. Through our performance analysis, we verify that our proposed algorithms outperform the existing ones in terms of a query processing cost.

A new Top-k query processing algorithm to guarantee confidentiality of data and user queries on outsourced databases

Article 06 August 2019

Efficient Protocols for Private Database Queries

Private Boolean Query Processing on Encrypted Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Research on preserving data privacy in outsourced databases has been spotlighted with the development of a cloud computing. Since a data owner (DO) outsources his/her databases and allows a cloud to manage them, the DO can reduce the database management cost by using the cloud’s resources with flexibility [1,2,3]. The cloud not only maintains the databases, but also provides an authorized user (AU) with querying services on the outsourced databases.

However, because the data are private assets of the DO and may include sensitive information such as financial records, they should be protected against adversaries including a cloud server. Therefore, the databases of the DO should be encrypted before being outsourced to the cloud. In addition, a user’s query should be protected from the adversaries because the query may contain the private information of the user [4,5,6,7,8,9,10]. Therefore, a vital challenge in the cloud computing is to protect both data privacy and query privacy among the data owner, the users, and the cloud. However, during query processing, the cloud can derive sensitive information about the actual data items and users by observing data access patterns even if the data and the query are encrypted [11,12,13,14,15,16,17]. In addition, it is very challenging to process a query on the encrypted data without having to decrypt it.

Meanwhile, a k nearest neighbor (kNN) query, one of the most typical query types, has been widely used as a baseline technique in many fields, such as data mining and location-based services. The kNN query finds k neighbors that are closest to a given query. However, a kNN result is closely related to the interest and preference of a user. Therefore, researches on secure kNN query processing algorithms (SkNN) that preserve both the data privacy and the query privacy have been proposed [18,19,20,21,22,23,24]. However, the existing algorithms in [18, 19] are insecure because they are vulnerable to chosen- and known-plaintext attacks. In addition, the DO should be heavily involved in the query processing [19,20,21]. Furthermore, the algorithms in [18,19,20,21] do not protect data access patterns from the cloud. The algorithms in [22, 24] guarantee the confidentiality of both the outsourced databases and a user's query while hiding data access patterns. However, they suffer from high query processing cost.

To solve the problems, in the paper we propose a privacy-preserving kNN query processing algorithm via secure two-party computation on the encrypted database. Our algorithm preserves both data privacy and query privacy while hiding data access patterns. For this, we propose efficient and secure protocols based on Yao’s garbled circuit [25] and a data packing technique. To enhance the performance of our kNN query processing algorithm, we also propose a parallel kNN query processing algorithm using improved secure protocols based on encrypted random value pool. To verify the security of our algorithm, we provide the formal security proofs of our privacy-preserving kNN query processing algorithms. Through the performance analysis, we verify that our proposed algorithms outperform the existing ones for both a syntactic dataset and a real dataset. Our contributions can be summarized as follows:

We present a framework for outsourcing both encrypted databases and encrypted indexes.
We propose new secure protocols (e.g., ESSED, GSCMP, GSPE) in order to preserve data privacy and query privacy while hiding data access patterns.
We propose an encrypted random value pool to minimize the computational cost of secure protocols.
We propose a new privacy-preserving parallel kNN query processing algorithm which can support efficient query processing.
We also present an extensive experimental evaluation of our algorithms with various parameter settings.

The rest of the paper is organized as follows: Section 2 introduces background and related work. Section 3 presents system architecture and secure protocols. Section 4 proposes our privacy-preserving kNN query processing algorithm. Section 5 proposes our parallel kNN query processing algorithm. Section 6 shows the security proof of our privacy-preserving kNN algorithms under semi-honest model. Section 7 presents the performance analysis of our kNN query processing algorithms. Finally, Sect. 8 concludes this paper.

2 Background and related work

2.1 Background

Importance of hiding data access patterns The data access pattern is one of the most important factors for privacy preservation in cloud computing. If an attacker possesses the order of data accesses or their frequency, he/she can infer the original data by using data access patterns. Therefore, hiding data access patterns is as important as encrypting data. First, in location-based service (LBS), one of the well-known queries is to find a nearby point of interest (POI) with a current user's location. For data protection, POI data are indexed, encrypted, and outsourced using a spatial index structure. For query protection, a user's location is encrypted and used for query processing. Because the query and POI data are encrypted, the exact location is not exposed to an attacker. However, by observing accesses to an index structure, an attacker can obtain data access patterns. By using data access patterns, the attacker can know where a query issuer is located and when he/she is in a specific area. As a result, if a user continuously issues queries while moving, an attacker can obtain his/her personal information, such as his/her moving trajectory and preference.

Second, in a healthcare service, a data mining technique for classifying patients based on their health information and symptom is widely used. For classifying patients, the service finds the most similar disease by getting accesses to the previously generated disease classification table. Because the patient’s health information is encrypted, an attacker cannot obtain sensitive information. However, an attacker can acquire data access patterns by repeatedly accessing the disease classification table with fake patients. As a result, by using the data access pattern, an attacker can infer what kind of disease an actual patient has when the patient information is given. Therefore, hiding data access patterns is very essential for privacy preservation in cloud computing.

Paillier cryptosystem The Paillier cryptosystem [26] is an additive homomorphic and probabilistic asymmetric encryption scheme for public key cryptography. The public key pk for encryption is given by (N, g), where N is a product of two large prime numbers p and q, and g is circular value in ${ }Z_{{N^{2} }}^{*}$. Here, ${ }Z_{{N^{2} }}^{*}$ denotes an integer domain ranged from zero to N². The secret key sk for decryption is given by (p, q). Let E(·) denote the encryption function and D(·) denote the decryption function. The Paillier cryptosystem has the following properties.

(1)
Homomorphic addition The product of two ciphertexts E(m₁) and E(m₂) results in the encryption of the sum c and m₂ (Eq. 1).

$$ E\left( {m_{1} + m_{2} } \right) = E\left( {m_{1} } \right) \times E\left( {m_{2} } \right) {\text{mod}}\; N^{2} $$

(1)

(2)
Homomorphic multiplication The m₂th power of ciphertext E(m₁) results in the encryption of the product of m₁ and m₂ (Eq. 2).
$$ E\left( {m_{1} \times m_{2} } \right) = E\left( {m_{1} } \right)^{{m_{2} }} {\text{ mod}} \;N^{2} $$
(2)

(3)
Semantic security Encrypting the same plaintexts with the same public key results in distinct ciphertexts (Eq. 3).
$$ m_{1} = m_{2} { \nRightarrow }E\left( {m_{1} } \right) = E\left( {m_{2} } \right) $$
(3)

Therefore, an adversary cannot infer any information about the plaintexts.

Yao’s garbled circuit Yao’s garbled circuits [25] allow two parties holding inputs x and y, respectively, to evaluate a function f(x, y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao’s garbled circuit provides a high security level. Another benefit of using the Yao’s garbled circuit is that it can provide high efficiency if a function can be realized with a reasonably small circuit.

2.2 Related work

The typical kNN query processing schemes on encrypted databases are as follows. Wong et al. [18] processed a kNN query by devising an encryption scheme that supports distance comparison on the encrypted data. However, the scheme is vulnerable to chosen-plaintext attacks [27, 28] and cannot hide the data access pattern to the cloud. Yiu et al. [19] proposed a kNN query processing algorithm using the R-tree index [29] encrypted by AES [30]. However, the scheme has a drawback that the most of the computation is performed at the user side rather than the cloud. In addition, data access pattern is not preserved as the user hierarchically requests the required R-tree nodes to the cloud. Hu et al. [20] proposed a kNN query processing algorithm by using the provably secure privacy homomorphism encryption method. However, the user is in charge of index traversal during the query processing. In addition, the scheme is known to be vulnerable to chosen-plaintext attacks and leaks the data access patterns. Zhu et al. [21] proposed a kNN query processing scheme by considering untrusted users. Because a user does not hold an encryption key, a data owner should encrypt the query. In addition, the cloud can know the identifiers of the query result that implies the leakage of the data access pattern.

Elmehdwi et al. [22] proposed the SkNN_m scheme over the encrypted database. To the best of our knowledge, this is the first work that guarantees both the data privacy and the query privacy while hiding the data access pattern [14] at the same time. In addition, a data owner and a user do not participate in the query processing. However, the query processing cost of this scheme is extremely high because the scheme considers all of the encrypted data and makes use of secure protocols that take the encrypted binary representation of the data as inputs. Zhou et al. [23] proposed an asymmetric scalar-product-preserving encryption (ASPE) scheme based on Wong et al.’s work [18]. By using random asymmetric splitting with additional artificial dimensions, the scheme can resist known-plaintext attacks [28, 31]. In this scheme, the query issuers are fully trusted and the decryption key is partially revealed to the query issuers. However, the scheme cannot hide the data access pattern. Most recently, Kim et al. [24] proposed a kNN query processing scheme(SkNN_I) by using an encrypted index. The algorithm guarantees the confidentiality of both the data and the user query while hiding data access patterns. By filtering unnecessary data using a secure index mechanism, the algorithm provides better performance than SkNN_m. However, the algorithm still requires a high computation cost because it uses secure protocols that take the encrypted binary representation of the data as inputs.

3 System architecture and secure protocols

3.1 System architecture

The typical types of adversaries are semi-honest and malicious [32]. In this paper, we consider the cloud as insider adversaries who have more authorities than outsider attackers. In the semi-honest adversarial model, the cloud correctly follows the given protocol, but may try to obtain the additional information not allowed to them. In the malicious adversarial model, the cloud can deviate from the protocol specification. However, protocols against malicious adversaries are inefficient. Nonetheless, protocols associated with semi-honest adversaries are practical and can be used to design protocols against malicious adversaries. Therefore, according to earlier work [22, 24], we also adopt the semi-honest adversarial model. A secure protocol under the semi-honest adversarial model can be defined as follows.

Definition 1

Secure protocol Let $\mathop \prod \nolimits_{i}^{ } \left( \pi \right)$ be an execution image of the protocol π at the C_i side and let a_i and b_i be the input and the output of the protocol π, respectively. Then, π is secure if $\mathop \prod \nolimits_{i}^{ } \left( \pi \right)$ is computationally indistinguishable from the simulated image $\mathop \prod \nolimits_{i}^{s} \left( \pi \right)$.

The system consists of four components: data owner (DO), authorized user (AU), and two clouds (C_A and C_B). The DO owns the original database (T) of n records [33,34,35]. A record t_i (1 ≤ i ≤ n) consists of m attributes, where m means the number of data dimensions, and the jth attribute value of t_i is denoted as t_i,j (1 ≤ j ≤ m). The DO partitions T by using the kd-tree structure [36, 37] to provide the indexing on T. The reason why we use the kd-tree structure as an index structure is to hide data access patterns. Using a space filling curve (e.g., Hilbert curve) for partitioning data items into blocks (nodes) can guarantee data locality, but it cannot guarantee that data items are evenly distributed over blocks. As a result, an attacker may infer a specific block based on the number of data items stored in it, whereas using the kd-tree structure for partitioning data items into blocks makes an attacker unable to distinguish a block from each other. This is because data items are evenly distributed into blocks in the kd-tree structure even if data items are skewed. Meanwhile, while traversing the kd-tree structure in a hierarchical way, an attacker can know which block is relevant to the query, which results in the leakage of data access patterns. To tackle the problem, our algorithm accesses only the leaf nodes of the kd-tree during the query processing step, rather than traversing the tree structure in a hierarchical way.

Henceforth, a node refers to a leaf node. Let h denote the level of the constructed kd-tree and F be the fanout of each leaf node. A node is denoted by node_z (1 ≤ z ≤ 2^h−1) where 2^h−1 is the total number of leaf nodes. The region information of node_z is represented as the lower bound lb_z,j and the upper bound ub_z,j (1 ≤ z ≤ 2^h−1, 1 ≤ j ≤ m). Each node stores the identifiers (id) of data located inside the node region. To preserve the data privacy, the DO encrypts T attribute-wise using the public key (pk) of the Paillier cryptosystem [26] before outsourcing the database. Therefore, the DO generates E(t_i,j) for 1 ≤ i ≤ n and 1 ≤ j ≤ m by encrypting t_i,j. The DO also encrypts the region information of all kd-tree nodes to support efficient query processing. Specifically, lb and ub of each node are encrypted attribute-wise such that E(lb_z,j) and E(ub_z,j) are generated with 1 ≤ z ≤ 2^h−1 and 1 ≤ j ≤ m. We assume that C_A and C_B are non-colluding and semi-honest (or honest-but-curious) clouds. Thus, they correctly perform the given protocols and do not exchange unpermitted data. However, they may try to obtain additional information from the intermediate data while executing their own protocols. This assumption has been used in the related problem domains (for example, in [38]) even though it is not new as mentioned in the earlier works [22, 24, 39].

In this paper, we consider privacy-preserving kNN query processing which retrieves k nearest data items that are closest to the given query. To support kNN query processing over the encrypted database, a secure multi-party computation (SMC) is required for privacy-preserving kNN query processing algorithm [40]. A secure multi-party computation can be defined as follows.

Definition 2

Secure multi-party computation a given number of participants, p₁, p₂, …, p_n(n ≥ 2), each have private data, respectively, d₁, d₂, …, d_n. Participants want to compute the value of a public function on the private data: F(d₁, d₂, …, d_n) while keeping their own inputs secret.

According to Definition 2, the proposed algorithm uses two clouds (e.g., C_A and C_B) because at least two parties are required for secure computation. Existing studies, such as Elmehdwi et al.’s work [22], Zhou et al.’s work [23], Wong et al.’s work [18] and Kim et al.’s work [24], also use two-party computation to support privacy-preserving kNN query processing. Thus, we do not consider a single-party computation model because the single-party computation model is vulnerable against semi-honest adversaries.

The DO outsources both the encrypted database and its encrypted index to the C_A with pk, while the DO sends sk to the C_B. The encrypted index includes the region information of each node in ciphertext and the ids of data located inside the node in plaintext. The DO also sends pk to AUs to enable them to encrypt a kNN query. When requesting a query, an AU first generates E(q_j) by encrypting a query q attribute-wise for 1 ≤ j ≤ m. C_A and C_B cooperatively process the query and return a query result to the AU without data leakage.

As an example, assume that an AU has sixteen data in two-dimensional space (x-axis and y-axis) as depicted in Fig. 1. The data items are partitioned into four nodes for a kd-tree: node₁, node₂, node₃, and node₄. To clarify the relationship between data items and nodes, we suppose that there is no data item on the boundary of a node. To outsource the database, the DO encrypts each data item and its region information attribute-wise. The ith data item d_i is represented as < x_i, y_i > in two-dimensional space. Therefore, the d_i can be encrypted to < E(x_i), E(y_i) > by using the Paillier cryptosystem. For example, d₁ = < 2,1 > is encrypted as E(d₁) = < E(2), E(1) > and the encrypted index is shown in Fig. 1.

3.2 Enhanced secure protocols

Our kNN query processing algorithm is constructed using several secure protocols. We adopt four secure protocols from the literatures [22, 24, 39], such as secure multiplication (SM), secure bit-not (SBN), CoMPare-S (CMP-S), and secure minimum from set of n values (SMIN_n). All of the protocols except the SBN protocol use the SMC technique between C_A and C_B, while the SBN protocol can be solely executed at the C_A side. In addition, we propose three new secure protocols: enhanced secure squared Euclidean distance (ESSED), garbled circuit-based secure compare (GSCMP), and garbled circuit-based secure point enclosure (GSPE). For both GSCMP and GSPE, we use Yao’s garbled circuits [25] that allow two parties holding inputs x and y, respectively, to evaluate a function f(x,y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao’s garbled circuit supports high security level. Another advantage of the Yao’s garbled circuit is to provide high efficiency if a function can be realized with a reasonably small circuit [39]. Because our protocols do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols [22, 24], they can provide a low computation cost.

ESSED protocol: Suppose that there are two m-dimensional vectors $\vec{X} = \left\{ {x_{1} ,x_{2} ,x_{3} , \ldots ,x_{m} } \right\}$ and $\vec{Y} = \left\{ {y_{1} ,y_{2} ,y_{3} , \ldots ,y_{m} } \right\}$. The goal of the ESSED (enhanced secure squared Euclidean distance) protocol is to securely compute $E\left( {\left| {\vec{X} - \vec{Y}} \right|^{2} } \right)$, where $\left| {\vec{X} - \vec{Y}} \right|$ denotes the Euclidean distance between $\vec{X}$ and $\vec{Y}$. Note that $\left| {\vec{X} - \vec{Y}} \right|^{2} = \mathop \sum \nolimits_{i = 1}^{m} \left( {x_{i} - y_{i} } \right)^{2} .$

We utilize a data packing technique to enhance the efficiency of a secure protocol. Specifically, we pack λ number of σ-bit data instances to generate a packed value. The overall procedure of ESSED is as follows. First, C_A generates random numbers r_j for 1 ≤ j ≤ m and packs them to obtain R using Eq. (4).

$$ R = \sum\limits_{j = 1}^{m} {r_{j} \times 2^{\sigma (m - j)} } $$

(4)

Then, C_A generates E(R) by encrypting R. Second, C_A calculates E(x_j–y_j) attribute-wise and packs these results to obtain E(v) using Eq. (5). Then, C_A computes E(v) = E(v) × E(R) and sends E(v) to C_B.

$$ E\left( v \right) \, = \mathop \prod \limits_{j - 1}^{m} E\left( {x_{j} - y_{j} } \right)^{{2^{{\sigma \left( {m - j} \right)}} }} $$

(5)

Third, C_B acquires [x₁–y₁ + r₁|…|x_m–y_m + r_m] by decrypting E(v). C_B obtains x_j–y_j + r_j for 1 ≤ j ≤ m by unpacking v through v × 2–σ(m–j). C_B also calculates (x_j–y_j + r_j)² attribute-wise and stores their sum into d. C_B encrypts d and sends E(d) to C_A. Finally, C_A obtains E($\left| {\vec{X} - \vec{Y}} \right|$) by eliminating randomized values using Eq. (6).

$$ E\left( {\left| {\vec{X} - \vec{Y}} \right|^{2} } \right) = E\left( d \right) \times \mathop \prod \limits_{j = 1}^{m} \left( {E\left( {x_{j} - y_{j} } \right)^{ - 2rj} \times E(r_{j}^{2} )^{ - 1} } \right) $$

(6)

Our ESSED outperforms the existing distance computation protocol, i.e., data packing-based secure squared Euclidean distance (DPSSED) [39]. Table 1 shows the difference between the existing DPSSED and our ESSED in terms of the number of encryptions. Our ESSED requires only one encryption on the C_B side, while the existing DPSSED requires m times of encryptions. Therefore, our ESSED requires a total of two encryptions, whereas the existing DPSSED requires a total of m + 1 encryptions. In addition, our ESSED calculates the randomized distance in plaintext on the C_B side, while the existing DPSSED computes the sum of the squared Euclidean distances among all attributes over ciphertext on the C_A side. Therefore, the number of computations on encrypted data in our ESSED can be reduced greatly, compared with the existing DPSSED.

Table 1 Comparison between the DPSSED and our ESSED in terms of the number of encryptions

Full size table

GSCMP protocol: Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCMP (garbled circuit-based secure CoMPare) protocol returns the result as follows.

$$ {\text{GSCMP}}\left( {E\left( u \right), E\left( v \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\; u \le v} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

The main difference between GSCMP and CMP-S is that GSCMP receives encrypted data as inputs, while CMP-S receives the randomized plaintext. Furthermore, in the case of CMP-S, plaintext is returned as a result, whereas GSCMP encrypts the result of CMP-S and sends it to C_A. Through this, GSCMP can protect the data access patterns. The overall procedure of the GSCMP is as follows.

First, C_A generates two random numbers ru and rv and encrypts them. C_A computes E(m₁) = E(u)² × E(ru) and E(m₂) = E(v)² × E(1) × E(rv). For the given input value u and v, Yao’s garbled circuit returns one if u < v; returns zero if u > v; returns a random value if u = v. To avoid returning a random value, our GSCMP protocol calculates u’ = 2 × u and v’ = 2 × v + 1 in Eqs. (7) and (8) while maintaining the values of inequality. For example, when u = v = 3, GSCMP calculates u’ = 6 and v’ = 7. Therefore, our GSCMP protocol avoids returning a random value when u = v.

$$ E\left( {m_{1} } \right) = E\left( u \right)^{2} \times E\left( {ru} \right) = E\left( {2 \times u + ru} \right) $$

(7)

$$ E\left( {m_{2} } \right) = E\left( {\text{v}} \right)^{2} \times E\left( 1 \right) \times E\left( {rv} \right) = E\left( {2 \times v + rv + 1} \right) $$

(8)

Second, C_A randomly chooses one functionality between F₀: u ≥ v and F₁: u < v. The selected functionality is oblivious to C_B. Then, C_A sends data to C_B, depending on the selected functionality. If F₀: u ≥ v is selected, C_A sends < E(m₂), E(m₁) > to C_B. If F₁: u < v is selected, C_A sends < E(m₁), E(m₂) > to C_B.

Third, C_B obtains < m₂, m₁ > by decrypting < E(m₂), E(m₁) > if F₀: u ≥ v is selected. If F₁: u < v is selected, C_B obtains < m₁, m₂ > by decrypting < E(m₁), E(m₂) > .

Fourthly, C_A generates a garbled circuit consisting of two ADD circuits and one CMP circuit. Here, ADD circuit takes two integers u and v as input, and outputs u + v while CMP circuit takes two integers u and v as input, and outputs 1 if u < v, zero otherwise. If F₀: u ≥ v is selected, C_A puts –rv and –ru into the first and second ADD gates, respectively. If F₁: u < v is selected, C_A puts –ru and –rv into the first and second ADD gates.

Fifthly, if F₀: u ≥ v is selected, C_B puts m₂ and m₁ into the first and second ADD gates, respectively. If F₁: u < v is selected, C_B puts m₁ and m₂ into the first and second ADD gates.

Sixthly, the first ADD gate adds two input values and puts the output result₁ into CMP gate. Similarly, the second ADD gate puts the output result₂ into CMP gate. Seventhly, CMP gate outputs α = 1 if result₁ < result₂ is true, α = 0 otherwise. The output of the CMP is returned to the C_B. Then, C_B encrypts α and sends E(α) to C_A. Since E(α) is an encrypted value, C_A cannot identify the data received from the C_B. If C_A receives α from C_B, C_A can know which data is relevant to the query, which can lead to the exposure of the data access patterns. Therefore, it is necessary that C_B sends E(α) rather than α to C_A.

Finally, when the selected functionality is F₀: u ≥ v, C_A computes E(α) = SBN(E(α)) and returns the final E(α). If E(α) is E(1), u is less than or equal to v.

GSPE protocol: Suppose that E(p) is an encrypted value of a point p and E(range) is a set of the encrypted values containing the E(range.lb_j) and the E(range.ub_j) for 1 ≤ j ≤ m (m is the data dimension). When E(p) and E(range) are given as inputs, GSPE (garbled circuit-based secure point enclosure) protocol returns the result as follows.

$$ {\text{GSPE}}\left( {E\left( p \right), E\left( {range} \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\;range.lb \le p \le range.ub} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

If E(p_j) ≤ E(range.ub_j) and E(p_j) ≥ E(range.lb_j), the p is inside the range. To securely compare between a point and a range, the GSPE protocol needs to add random values for all data dimension. However, as the number of data dimensions increases, the number of data encryptions is increased. The GSPE protocol reduces the number of data encryptions by using a packing technique that transforms the m-dimensional data into one packed value.

The overall procedure of the GSPE is shown in Algorithm 1. First, C_A generates two random numbers ra_j and rb_j for 1 ≤ j ≤ 2 m (line 1–2). C_A obtains PA and PB by initially packing ra_j and rb_j, respectively, by using Eq. (9) for 1 ≤ j ≤ 2 m (line 3).

$$ {\text{PA}} = \mathop \sum \limits_{j = 1}^{2m} ra_{j} \times 2^{{\sigma \left( {2m - j} \right)}} ,\quad {\text{PB}} = \mathop \sum \limits_{j = 1}^{2m} rb_{j} \times 2^{{\sigma \left( {2m - j} \right)}} $$

(9)

Here, σ means the maximum length in bit to represent a data. Then, C_A generates E(PA) and E(PB) by encrypting PA and PB (line 4). Second, C_A computes E(μ_j) = E(p_j)² and E(ω_j) = E(range.lb_j)² for 1 ≤ j ≤ m. C_A also computes E(δ_j) = E(p_j)² × E(1) and E(τ_j) = E(range.ub_j)² × E(1) for 1 ≤ j ≤ m (line 5–8). Third, C_A randomly chooses one of two functions, F₀: u ≥ v and F₁: v > u. Then, C_A performs encrypted data packing for E(μ_j), E(τ_j), E(ω_j) and E(δ_j), by using homomorphic multiplication and addition (Eqs. 1 and 2) depending on the selected function (lines 8–18). C_A sends E(PA) and E(PB) to C_B (line 19). Fourthly, C_B obtains PA and PB by decrypting E(PA) and E(PB) (line 20). C_B stores x_j ← ${\text{PA}} \times 2^{{ - \sigma \left( {2m - j} \right)}}$ for 1 ≤ j ≤ 2 m by unpacking PA, while C_B stores y_j ← ${\text{PB}} \times 2^{{ - \sigma \left( {2m - j} \right)}}$ for 1 ≤ j ≤ 2 m by unpacking PB (line 21–23). Here, x_j (or y_j) is one of the μj, τ_j, ω_j, and δ_j.

Fifthly, C_A generates two add gates and one compare gate (line 24). C_A puts –ra_j and –rb_j into the first and the second add gates, respectively, for 1 ≤ j ≤ 2 m (line 25–26). C_B puts x_j and y_j into the first and the second add gate, respectively, for 1 ≤ j ≤ 2 m (line 27). When –ra_j, –rb_j, x_j and y_j are given to the compare gate, the result of compare gate, α′ = < α₁′, α₂′, …, α_2m′ > , is returned to C_B (line 28–29). Sixthly, C_B encrypts α′ and sends E(α′) to C_A (line 30). Seventhly, C_A performs E(α_j’) = SBN(E(α_j’)) for 1 ≤ j ≤ 2 m when the selected function is F₀: u ≥ v (line 32–33). Then, C_A computes E(result) = SMR(E(result), E(α_j’)) for 1 ≤ j ≤ 2 m where the initial value of E(result) is E(1) (line 31, line 34). When all of E(α_j’) for 1 ≤ j ≤ 2 m are E(1), E(result) remains E(1). Finally, GSPE returns E(result) (line 35).

3.3 Secure protocols using encrypted random value pool

While processing a query in our secure system, C_B decrypts the received ciphertext. Thus, we need to prevent C_B from extracting meaningful information while executing secure protocols. For this, C_A generates a random value r from Z_N and encrypts r by using Paillier cryptosystem. Then, C_A adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing E(m + r) = E(m) × E(r). Because m ± r is independent from m, C_B cannot obtain a meaningful information with decryption. However, in Paillier cryptosystem, the process of adding a random value to the ciphertext leads performance degradation because both encryption and decryption operations require higher computational cost than other encrypted operations. Therefore, we propose an encrypted random value pool to reduce the amount of computational cost for ciphertext generation. First, in a preprocessing phase, we generate random ciphertexts and store them to an encrypted random value pool. Second, while processing a query in C_A, a random ciphertext is selected from the encrypted random value pool whenever the secure protocol is called. Therefore, C_A not only prevents C_B from extracting meaningful information while processing a secure protocol, but also reduces a cost of generating encrypted random values. We use the encrypted random value pool to the SM protocol [22] and our GSCMP protocol. In SM and GSCMP, C_A generates two encrypted random values before sending the ciphertext to C_B. According to Table 2, we can reduce the number of encryption operations to 67% by using the encrypted random value pool.

Table 2 A comparison of amount in secure protocols

Full size table

Secure multiplication protocol using encrypted random value pool (SMR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, SMR protocol returns the result as follows.

$$ {\text{SMR}}\left( {E\left( u \right), E\left( v \right)} \right) = E\left( {u \times v} \right) $$

SMR protocol is shown in Algorithm 2. When two encrypted values E(u) and E(v) are given as inputs, C_A selects two random values E(r_a) and E(r_b) from the encrypted random value pool (line 1). The rest of the SMR protocol is the same as the previous SM protocol (line 2–6).

Garbled secure compare protocol using encrypted random value pool (GSCR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCR protocol returns the result which is the same as our GSCMP as follows.

$$ {\text{GSCR}}\left( {E\left( u \right), E\left( v \right)} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\, u \le v} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

The difference between GSCR and GSCMP is to select a random ciphertext from the encrypted random value pool, instead of generating an encrypted arbitrary value.

4 KNN query processing algorithm

In this section, we present our kNN query processing algorithm (SkNN_G) that uses Yao’s garbled circuit [25]. The algorithm consists of three phases: encrypted kd-tree search, kNN retrieval, and kNN result refinement.

4.1 Candidate node search phase

In the encrypted kd-tree search phase, the C_A securely extracts all of the data from a node containing a query point while hiding the data access patterns. The procedure of the encrypted kd-tree search phase is shown in Algorithm 3. First, C_A securely finds nodes that include a query by executing E(α_z) = GSPE(E(q), E(node_z)) for 1 ≤ z ≤ #_of_node where #_of_node means the total number of kd-tree leaf nodes (line 1–2). The result of GSPE for all nodes is stored in E(α) = {E(α₁), E(α₂), …, E(α_{#_of_node})}. By utilizing GSPE, our kNN query processing algorithm can obtain better performance than the existing algorithms [22, 24] because we can avoid operations related to the SBD protocol that causes high computation overhead. Then, we perform the 8–24 lines of the index search algorithm in [24]. Second, C_A generates E(α′) by shuffling E(α) using a random shuffling function π and sends E(α′) to C_B (line 3–4).

Third, C_B obtains α′ by decrypting E(α′) and counts the number of α′ = 1 and stores it into c (line 5–6). Here, c means the number of nodes that the query is related to. Fourthly, C_B creates c number of node groups (NG) (line 7–11). C_B assigns to each NG a node with α′ = 1 and #_of_node/c-1 nodes with α′ = 0. Then, C_B obtains NG′ by randomly shuffling the ids of nodes in each NG and sends NG′ to C_A. Fifthly, C_A obtains NG^* by shuffling the ids of nodes using π-1 in each NG′ (line 12–13). Finally, C_A accesses one data in each node for each NG^* and performs E(t'_i,j) = SMR(E(node_z.data_s,j), E(α_z)) where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (line 14–22). Here, E(α_z) is the result of GSPE corresponding to node_z. If a node has the less number of data than FanOut, it performs SMR by using E(max), instead of using E(node_z.data_s,j). Here, max is the largest value in the domain. When C_A accesses one datum from every node in a NG^*, C_A performs E(cand_cnt,j) = ${ }\mathop \prod \nolimits_{i = 1}^{num} E(t^{\prime}_{i,j} )$ where num means the total number of nodes in the selected NG^*.

As a result, data items in the node related to the query are securely extracted without revealing the data access patterns [5, 14] because the searched nodes are not revealed. By repeating these steps, all of the data in the node are safely stored into the E(cand_i,j) for 1 ≤ i ≤ cnt and 1 ≤ j ≤ m where cnt means the total number of data extracted during the index search. Figure 2 shows an example of the candidate node search phase. The example uses the data items and their kd-tree structure that are represented in Fig. 1. First, C_A performs the GSPE between E(q) and E(Node_i.Range) for 1 ≤ i ≤ cnt. C_A stores the GSPE result into E(α) and sends it to C_B. For example, in Fig. 2, C_A performs GSPE({< E(0), E(0) > , < E(4), E(5) >}, < E(6), E(1) >) between E(Node₁.Range) and E(q), and stores the GSPE result, i.e., E(0), into E(α₁) for Node₁. Second, C_A shuffles the sequence of {< Node₁, E(α₁) > , < Node₂, E(α₂) > , …, < Node_n, E(α_n) >} and changes the shuffled node ids into new ids so as to hide the original node ids from C_B. To obtain the original sequence of node ids, C_A records their shuffled sequence. For example, in Fig. 2, the original sequence of {< Node₁, E(0) > , < Node₂, E(1) > , < Node₃, E(0) > , < Node₄, E(0) >} is shuffled to {< Node₄, E(0) > , < Node₁, E(0) > , < Node₂, E(1) > , < Node₃, E(0) >}. Then, C_A changes Node_4, Node₁, Node₂, and Node₃ into PN₁, PN₂, PN₃, and PN₄, respectively. As a result, the shuffled sequence is {< PN₁, E(0) > , < PN₂, E(0) > , < PN₃, E(1) > , < PN₄, E(0) >}. Then, C_A sends the shuffled sequence to C_B. Third, C_B receives the shuffled sequence and decrypts it. For example, in Fig. 2, CB receives {< PN₁, E(0) > , < PN₂, E(0) > , < PN₃, E(1) > , < PN₄, E(0) >} and obtains {< PN₁, 0 > , < PN₂, 0 > , < PN₃, 1 > , < PN₄, 0 >} by decrypting it. To generate node groups, i.e., NGs, C_B counts how many 1s are in the sequence. Each NG has one seed node whose α′_p equals 1 where 1 ≤ p ≤ #_of_node. Therefore, there exists the same number of NGs as the number of 1s in the sequence. Nodes whose α′_p equals 0, where 1 ≤ p ≤ #_of_node, are evenly assigned to the generated NGs. And C_B sends the generated NGs to C_A. For example, C_B counts 1s in the sequence of {< PN₁, 0 > , < PN₂, 0 > , < PN₃, 1 > , < PN₄, 0 >} and generates NG₁ with the seed PN₃. The nodes < PN₁, 0 > , < PN₂, 0 > , and < PN₄, 0 > are assigned into NG₁. C_B sends NG₁ = {PN₃, PN₁, PN₂, PN₄} to C_A. Fourth, C_A obtains the original node ids from the received NGs by using the shuffled sequence of node ids. For example, in Fig. 2, C_A obtains NG₁’ = {Node₂, Node₄, Node₁, Node₃} as the original node ids by using both the received NG₁ = {PN₃, PN₁, PN₂, PN₄} and the shuffled sequence of node ids. Fifth, C_A performs SMR protocol between E(α) and the encrypted data item in a node group, and makes a candidate set by summarizing all the result of the SMR. In Fig. 2, C_A performs SMR(E(0), E(Node₁.Data)), SMR(E(1), E(Node₂.Data)), SMR(E(0), E(Node₃.Data)), and SMR(E(0), E(Node₄.Data)). The result of the SMR are E(0), E(d₅), E(0), E(0) and C_A summarizes them and stores E(d₅) into the candidate set. Sixth, for each NG, the SMR protocol and the summarization of results are performed as many times as the number of data items of each node. For example, in Fig. 2, C_A performs summarization of results four times. As a result, C_A obtains {E(d₅), E(d₆), E(d₇), E(d₈)} as the candidate set.

4.2 kNN retrieval phase

In the kNN retrieval phase, we retrieve the k nearest neighbors from the given query by partially utilizing the SkNN_m scheme [22]. However, we only consider E(cand_i) for 1 ≤ i ≤ cnt, which are extracted in the index search phase, whereas the SkNN_m considers all the encrypted data. In addition, we use our efficient secure protocols that require relatively low computation costs, instead of using the existing expensive secure protocols including SBD (secure bit decomposition) [22, 24]. The overall procedure of kNN retrieval algorithm is as follows: First, the algorithm calculates the distance between the encrypted data items and the encrypted query without data and query decryption. Second, the algorithm finds the minimum distance (dist_min) among the calculated distances. It cannot know which data item has dist_min due to the semantic security of the Paillier cryptosystem. Third, to obtain the encrypted data item with dist_min, the algorithm performs the subtraction of dist_min with the calculated distance and the data item with dist_min has E(zero) as the result of subtraction. Here, E(zero) is the only value which is not changed by the homomorphic multiplication of the Paillier cryptosystem. Therefore, the algorithm can distinguish the nearest neighbor from others, while an attacker cannot determine which data item has the minimum distance. Fourth, in order to hide the original data items from the attacker, the algorithm performs the homomorphic multiplication of the result of subtraction with a random value. It also shuffles the sequence of the result of multiplication in order to hide data access pattern. Finally, by using our secure protocols, the algorithm finds the nearest neighbor and repeats the above process until k nearest neighbors are found.

The pseudocode of the kNN retrieval phase is shown in Algorithm 4. First, using ESSED, C_A securely calculates the squared Euclidean distances E(dist_i) between a query and E(cand_i) for 1 ≤ i ≤ cnt (line 1–2). Second, C_A performs SMIN_n to find the minimum value E(dist_min) among E(d_i) for 1 ≤ i ≤ cnt (line 3–4). Third, C_A calculates E(τ_i) = E(dist_min) × E(dist_i)^N−1, for 1 ≤ i ≤ cnt. C_A computes E(τ_iʹ) = E(τ_i)^ri. C_A obtains E(β) by shuffling E(τʹ) using a random shuffling function π and sends E(β) to C_B (lines 4–9). For example, E(τʹ) is calculated as {E(0), E(-r)} where r means a random number. The E(0) corresponds to the E(dist_min). Assuming that π shuffles data in reverse way, C_A sends E(β) = {E(-r), E(0)} to C_B. Fourthly, after decrypting E(β), C_B sets E(U_i) = E(1) if E(β_i) = 0, otherwise E(U_i) = E(0). C_B sends E(U) to C_A (line 10–13). Fifthly, C_A obtains E(V) by shuffling E(U) using π-1. Then, C_A performs SMR protocol by using E(V_i) and E(cand_i,j) to obtain E(Vʹ_i,j) (lines 14–17). Sixthly, by computing E(nn_s,j) = $\mathop \prod \limits_{i = 1}^{cnt} E(V^{\prime}_{i,j} )$ for 1 ≤ j ≤ m, C_A can securely extract the datum corresponding to the E(dist_min) (line 18). Finally, for preventing the selected result from being selected in the later phase, C_A securely updates the distance of the selected result as E(max) by computing Eq. (10) (lines 19–22).

$$ E\left( {d_{i} } \right) \, = {\text{ SMR}}\left( {E\left( {V_{i} } \right),E\left( {max} \right)} \right) \, \times {\text{ SMR}}\left( {{\text{SBN}}\left( {E\left( {V_{i} } \right)} \right),E\left( {d_{i} } \right)} \right) $$

(10)

Because only the selected result has E(V_i) = E(1), the E(dist_i) corresponding to the datum selected in current round becomes E(max), while the other values remain the same. This procedure is repeated for k rounds to find the kNN result.

Figure 3 shows an example of kNN retrieval phase. The example uses the data items and their kd-tree structure, as shown in Fig. 1. For simplicity and clarity, the shuffling function π is omitted. First, C_A calculates the distance by using the ESSED and stores the ESSED result into E(dist_i) for 1 ≤ i ≤ cnt (①). In Fig. 3, C_A performs ESSED(E(d₅), < E(6),E(1) >) for E(d₅) and E(q), and stores the ESSED result, i.e., E(2), into E(dist₅). Second, the minimum value is calculated by using the SMIN_n and stored the SMIN_n result into E(dist_min)(②). In Fig. 3, C_A performs SMIN_n(E(2), E(9), E(8), E(18)) for obtaining the minimum distance among E(dist₅), E(dist₆), E(dist₇), and E(dist₈), and stores the SMIN_n result, i.e., E(2), into E(dist_min). Third, in order to obtain the encrypted data item with the minimum distance to the given query, C_A performs E(dist_min − dist_i) and stores the result into E($\tau$_i) for 1 ≤ i ≤ cnt (③). When dist_min is the same as dist_i, E(0) is stored into E($\tau_{i}$). In Fig. 3, for E(dist₅), C_A performs E(2–2) and stores the result, i.e., E(0), into E($\tau_{5}$). Fourth, in order to protect the value of E($\tau$_i) for 1 ≤ i ≤ cnt, C_A performs the homomorphic multiplication of E($\tau$_i) by a random value (④). In Fig. 3, for E($\tau_{6}$), C_A performs E(− 7 × 3) with a random value = 3 and stores the result, i.e., E(-21), into E($\tau_{7} ^{\prime}$). Fifth, C_B returns E(1) as E(V_i) if E($\tau_{i} {^{\prime}}$) is E(0) for 1 ≤ i ≤ cnt; otherwise, it returns E(0) as E(V_i)(⑤). Sixth, C_A obtains the nearest neighbor by performing the SMR between E(d_i) and E(V_i) for 1 ≤ i ≤ cnt and merging the result of the SMR(⑥–⑦). In Fig. 3, C_A performs SMR(E(1), E(5)), SMR(E(0), E(6)), SMR(E(0), E(8)) and SMR(E(0), E(9)) for x-axis and SMR(E(1), E(2)), SMR(E(0), E(4)), SMR(E(0), E(3)) and SMR(E(0), E(4)) for y-axis. C_A merges E(5), E(0), E(0), and E(0) for x-axis while merging E(2), E(0), E(0), and E(0) for y-axis. As a result, C_A obtains < E(5), E(2) > as the nearest neighbor. Seventh, by using Eq. (5), the C_A sets the distance of the found nearest neighbor to the maximum value so that C_A can avoid finding the found nearest neighbor again in the next round(⑧). In Fig. 3, C_A performs SMR(E(1), E(max)) × SMR(E(0), E(2)) for E(d₅) and stores E(max) into E(dist₅). Finally, C_A repeats the above process until k nearest neighbors are found(②–⑦).

4.3 kNN result refinement phase

As mentioned in [24], the result of kNN retrieval phase may not be accurate because candidates are extracted from only one leaf node in index search phase. Therefore, the kNN result refinement is necessary to confirm the correctness of the current query result. Specifically, assuming that the squared Euclidean distance between the kth closest result E(nn_k) and the query is dist_k, the neighboring kd-tree nodes need to be searched to acquire data with the shorter distance than dist_k. For this reason, we use the concept of shortest point (sp) defined in [24]. The sp is a point in a given node whose distance is closest to a given point p as compared with the other points in the node. To find the sp in each node, we use the following properties described in [24]. (i) If both the lower bound (lb) and the upper bound (ub) of the node are lesser than the p, the ub becomes the sp of the region. (ii) If both the lb and the ub of the region are greater than p, the lb becomes the sp of the region. (iii) If p is between the lb and the ub of the region, p is the sp of the region. Since this property can be applied to each dimension, our kNN result refinement phase partially utilizes that of the existing algorithm [19, 21]. However, to reduce the computation cost, we do not use the existing expensive protocols, such as SBD, SSED, SCMP, and SPE [22, 24].

The procedure of the kNN result refinement phase is shown in Algorithm 5 First, C_A computes E(dist_k) = ESSED(E(q), E(nn_k)) to obtain the squared Euclidean distance between the query and the kth closest result, which is returned from the kNN retrieval phase (line 1). Second, for each node, C_A performs GSCMP by using E(q_j) and E(node_z.lb_j) for 1 ≤ z ≤ num_node and 1 ≤ j ≤ m and stores the result in E(ψ₁). C_A also performs GSCMP by using E(q_j) and E(node_z.ub_j) for 1 ≤ z ≤ num_node and 1 ≤ j ≤ m and stores the result into E(ψ₂) (lines 2–5). When the value of E(q_j) is equal to or less than the E(lb_j) (E(ub_j)), the E(ψ₁) (E(ψ₂)) has the value of E(1). Then, C_A obtains E(ψ₃) by carrying out E(ψ₁) × E(ψ₂) × SMR(E(ψ₁), E(ψ₂))^N−2 so as to acquire the result of bit-xor operation between E(ψ₁) and E(ψ₂) (line 6). Note that “-2” is equivalent to “N-2” under Z_N. Third, C_A securely obtains the shortest point of each node, that is, E(sp_z,j), by computing SMR(E(ψ₃), E(q_j)) × SMR(SBN(E(ψ₃)), f(E(lb_z,j), E(ub_z,j))) for 1 ≤ z ≤ num_node and 1 ≤ j ≤ m. Here, f(E(lb_j), E(ub_j)) is obtained by computing SMR(E(ψ₁), E(lb_z,j)) × SMR(SBN(E(ψ₁)), E(ub_z,j)) for 1 ≤ z ≤ num_node and 1 ≤ j ≤ m (lines 7–10). Fourthly, C_A calculates E(spdist_z), the squared Euclidean distances between the query and E(sp_z) for 1 ≤ z ≤ num_node by using ESSED. In addition, C_A securely updates the E(spdist_z) of the nodes, which are retrieved in index search phase, into E(max) by computing E(spdist_z) = SMR(E(α_z), E(max)) × SMR(SBN(E(α_z)), E(spdist_z)) (lines 11–13). Here, E(α_z) is the value returned by GSPE in index search phase. Then, C_A performs E(α_z) = GSCMP(E(spdist_z), E(dist_k)) (line 14). If the E(spdist_z) is less than E(dist_k), the corresponding node_z is assigned E(α) = E(1). The nodes with E(α) = E(1) need to be retrieved for kNN result refinement. The number of nodes to expand increases according to how many E(α_z) becomes E(1). If the number of ‘1’ is c in E(α_z), c number of node groups are created in the C_B and C_A extract the data of each node group. Therefore, the number of cnt becomes c × fanout.

Because the E(spdist_z) of nodes being retrieved in the index search phase are E(max), they are safely pruned. Fifthly, C_A securely extracts the data stored in the nodes with E(α) = E(1) by performing the index search using E(α) and appends them to E(nn) (line 15–16). Then, C_A executes kNN search phase based on E(nn) to obtain the final kNN result E(result_i) for 1 ≤ i ≤ k (line 17). Therefore, the final result becomes {E(nn₁), E(nn₅)} because the squared Euclidean distance of E(nn₅) is E(4). Sixthly, C_A returns the decrypted result to AU in cooperation with C_B to reduce the computation overhead at the AU side. To do this, C_A computes E(γ_i,j) = E(result_i) × E(r_i,j) for 1 ≤ i ≤ k and 1 ≤ j ≤ m by using a random value r_i,j. Then, C_A sends E(γ_i,j) to C_B and r_i,j to AU (lines 18–22). Then, C_B decrypts E(γ_i,j) and sends the decrypted value to AU (lines 23–26). Finally, AU obtains the actual kNN result by computing γ_i,j-r_i,j in plaintext (lines 27–29).

5 Parallel kNN query processing algorithm

5.1 Parallel encrypted kd-tree search phase

In the parallel encrypted kd-tree search phase, C_A simultaneously extracts all of the data from a node containing a query point. To expand encrypted kd-tree search phase to parallel environment, we use a thread pool which stores tasks in a queue so that threads can process tasks in parallel. The procedure of the parallel encrypted kd-tree search phase is shown in Algorithm 6. First, C_A generates a queue-based thread pool (line 1). If a thread in the thread pool is available, it can process a task in FIFO manner. Second, C_A pushes the task, i.e., GSPE(E(q), E(node_i)), to the thread pool for 1 ≤ i ≤ #_of_node. A result of GSPE protocol is stored in E(α) = {E(α₁), E(α₂), …, E(α_{#_of_node})} (lines 2–3). Third, C_A generates E(α') by shuffling E(α) using a random shuffling function π and sends E(α') to C_B (lines 4–5). Fourthly, C_B performs the same procedure in Algorithm 3 (line 6). Fifthly, C_A obtains NG^* by shuffling the ids of nodes using π-1 in each NGʹ (line 7). Finally, C_A accesses one datum in each node for each NG^* and pushes both E(t'_i,j) = SMR(E(node_z.data_s,j), E(α_z)) and E(cant_cnt+s,j) = E(cant_cnt+s,j) × E(t'_i,j) to the thread pool, where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (lines 8–18).

5.2 Parallel kNN retrieval phase

In the parallel kNN retrieval phase, we simultaneously retrieve the k closest data from the query by partially utilizing the SkNN_m scheme [22]. We consider E(cand_i) for 1 ≤ i ≤ cnt, which are extracted in the parallel index search phase. The procedure of the parallel kNN retrieval phase is shown in Algorithm 7. First, using ESSED, C_A simultaneously calculates the squared Euclidean distances E(d_i) between a query and E(cand_i) for 1 ≤ i ≤ cnt (lines 1–2). Second, C_A performs SMIN_n to find the minimum value E(d_min) among E(d_i) for 1 ≤ i ≤ cnt (lines 3–4). Third, C_A simultaneously calculates both E(τ_i) = E(d_min) × E(d_i)^N−1 and E(τ_iʹ) = E(τ_i)^ri for 1 ≤ i ≤ cnt (line 5–7). C_A obtains E(β) by shuffling E(τʹ) using a random shuffling function π and sends E(β) to C_B (lines 8–9). Fourthly, after decrypting E(β), C_B sets E(U_i) = E(1) if E(β_i) = 0, otherwise E(U_i) = E(0). C_B sends E(U) to C_A (line 10). Fifthly, C_A obtains E(V) by shuffling E(U) using π⁻¹ (line 11). Sixthly, instead of using SM protocol, C_A simultaneously performs SMR protocol with E(v_i) and E(cand_i,j) to obtain E(Vʹ_i,j) (lines 12–16). Seventhly, by computing E(nn_s,j) = $\mathop \prod \nolimits_{i = 1}^{cnt} E(V^{\prime}_{i,j} )$ for 1 ≤ j ≤ m, C_A can simultaneously extract the datum corresponding to the E(d_min) (lines 17–18). Finally, C_A simultaneously updates the distance of the selected result as E(max) by computing Eq. (11) (lines 19–24).

$$ E\left( {d_{i} } \right) \, = {\text{ SMR}}\left( {E\left( {V_{i} } \right), \, E\left( {max} \right)} \right) \, \times {\text{ SMR}}\left( {{\text{SBN}}\left( {E\left( {V_{i} } \right)} \right), \, E\left( {d_{i} } \right)} \right) $$

(11)

5.3 Parallel KNN result refinement phase

In the parallel kNN result refinement phase, C_A simultaneously checks whether results of kNN is correct or not. If not, C_A performs both index search phase and kNN retrieval phase again. The procedure of the parallel kNN result refinement phase is shown in Algorithm 8. First, C_A computes E(dist_k) = ESSED(E(q), E(nn_k)) to obtain the squared Euclidean distance between the query and the kth closest result which is returned from the kNN retrieval phase (line 1). Second, C_A simultaneously finds nodes that is closer than disk_k by using both SMR and GSCR protocols (lines 2–16). Third, C_A performs 15–22 lines of Algorithm 5 (line 17). Fourthly, C_B decrypts E(γ_i,j) and sends the decrypted value to AU (lines 18–21). Finally, AU obtains the actual kNN result by computing γ_i,j-r_i,j in plaintext (lines 22–24).

6 Security proof under semi-honest model

As mentioned above, the proposed privacy-preserving kNN algorithm is implemented in a semi-honest attack model. Therefore, the security proof of the proposed privacy-preserving kNN algorithm is performed from the three viewpoints of C_A, C_B, and AU (Authorized User), which are the subjects of actions excluding data owners. In addition, the following lemmas are used in our security proof.

Lemma 1

If a random element r is uniformly distributed on Z_N and independent from any variable x $\in$ Z_N, then r ± x is also uniformly random and independent from x.

Lemma 2

The Paillier cryptosystem is semantically secure based on the composite residuosity class problem [26].

Theorem 1

The proposed privacy-preserving kNN algorithm is secure from the perspective of C_A under the semi-honest model.

Proof

C_A owns the cryptographic database and cryptographic index. However, since it does not own the decryption key, the encryption database and encryption index are not exposed. Data cannot be inferred even in a frequency-based attack because the same plaintext has different ciphertexts (Lemma 2). In addition, since all values that our secure protocol returns from C_B are encrypted values, no information is exposed from data received from C_B. Even though the query is received from the user, it cannot be inferred because the query is in a cryptographic state. □

Theorem 2

The proposed privacy-preserving kNN algorithm is secure from the perspective of C_B under the semi-honest model [32].

Proof

C_B decrypts the encrypted text received from C_A. Because C_A hides the original data by adding an arbitrary integer before it is passed to C_B, C_B cannot infer meaningful data from the decrypted plaintext (Lemma 1). □

Theorem 3

The proposed privacy-preserving kNN algorithm is secure from the perspective of AU under the semi-honest model.

Proof

The AU encrypts his/her query using the public key and sends the encrypted query to C_A. This can protect the user's preferences and personal information. Since the query results received from C_A and C_B do not include the information for the owner's data, it is impossible to infer the original data. □

Theorem 4

The proposed privacy-preserving kNN algorithm is secure even though c and cnt are exposed to C_A under the semi-honest model.

Proof

C_A can obtain both c and cnt in Algorithms 3 and 5. Here, c is the number of nodes relevant to the query and cnt equals to c × fanout(i.e., F). c initially equals to one in the candidate search phase and c is changed to be ranged from zero to a total number of leaf nodes in the result refinement phase. However, because the upper and lower bounds of all nodes are encrypted and the node ids are hidden through grouping and shuffling, it is impossible to know which node is related to the query. Therefore, even if c and cnt are exposed as plaintext to C_A, an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □

Theorem 5

The proposed privacy-preserving kNN algorithm is secure even though c is exposed to C_B under the semi-honest model.

Proof

C_B can obtain c in algorithms 3 and 5. Here, c denotes the number of nodes related to the query. Because C_B cannot know the fanout(F) of the kd-tree, cnt is not disclosed to C_B. In addition, because the order of node ids is changed through shuffling, it is impossible to infer which node is related to the query. Therefore, even if c is exposed to C_B as plaintext, an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □

According to Theorems 1, 2, 3, 4, and 5, the original data, an index, and a query are protected through the Paillier encryption system (Lemma 2), and when decrypted, the original data cannot be inferred through arbitrary data (Lemma 1). Through this, we prove that the proposed privacy-preserving kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns.

In addition, the proposed parallel kNN algorithm is implemented in a semi-honest attack model and its security proof is performed from the three viewpoints of C_A, C_B, and AU. Because the procedure of the proposed parallel kNN algorithm is the same as that of the proposed privacy-preserving kNN algorithm except for using multiple threads, the proposed parallel kNN algorithm is secure from the perspective of C_A, C_B, and AU under the semi-honest attack model. Therefore, we prove that the proposed parallel kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns [4, 5, 14].

7 Performance analysis

In this section, we compare the proposed privacy-preserving kNN algorithm (SkNN_G) with the existing algorithms, SkNN_m [22] and SkNN_I [24], which can hide data access patterns. We used the Paillier cryptosystem to encrypt a database for both schemes [22, 24]. We implemented our algorithm and the existing ones using C++. Experiments were performed on a Linux machine with an Intel Xeon E3-1220v3 4-Core 3.10 GHz and 32 GB RAM running Ubuntu 14.04.2. In addition, we compare the proposed parallel algorithm (SkNN_PG) with the parallel version of SkNN_m (SkNN_pm) and that of SkNN_I (SkNN_PI). Experiments for parallel algorithms were performed on a Linux machine with an Intel Xeon CPU E5-2630v4 2.20 GHz and 64 GB RAM running Ubuntu 14.04.2.

We conduct performance analysis using both a syntactic dataset and a real dataset. For a synthetic datasets, we randomly generated 30 k records with six attributes. For a real dataset, we make use of the Chess dataset available at http://archive.ics.uci.edu/ml/datasets [41]. It consists of 28,056 records with six attributes. Parameters for our experiments are listed in Table 3. We use 512 bits for encryption key size (K) and set the default values of the required k as 10. The query was used by selecting a random integer from the range of data.

Table 3 Parameters used in our experiments

Full size table

7.1 Performance using a synthetic dataset

Figure 4 shows the performances of both SkNN_I and SkNN_G in terms of the height of kd-tree(h). When the number of data items is fixed, fanout(F) is decreased as h increases because the total number of leaf nodes is calculated by using h, i.e., 2^h−1. Therefore, it is important to choose the appropriate height(h) of kd-tree depending on the number of data items. If h is greatly high for the given number of data items, the number of leaf nodes to be searched increases and the cost for accessing leaf node increases in the candidate node search phase. On the contrary, if h is greatly low for the given number of data items, the number of data items to be search increases and the cost for calculating distances between data items and the query is increased in the kNN retrieval phase. Therefore, the existing algorithm (SkNN_I) and the proposed algorithm (SkNN_G) shows near-optimal performance in case of h = 7. In particular, the performance of the existing algorithm is greatly affected by the height of the kd-tree(h) because the existing algorithm uses secure protocols based on an encrypted binary array, which requires high computation cost. However, the proposed algorithm is relatively less affected by h than the existing algorithm because the proposed algorithm uses secure protocol based on garbled circuit, which requires low computation cost. As a result, we set h to 7 in our experiment.

Figures 5 and 6 show the performance of three algorithms in a single machine. With varying n, our SkNN_G shows 30.2 and 6.1 times better performance than SkNN_m and SkNN_I, respectively. With varying k, our SkNN_G shows 33.2 and 4.9 times better performance than SkNN_m and SkNN_I, respectively. As a result, our SkNN_G outperforms SkNN_m because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNN_m. In addition, our SkNN_G outperforms SkNN_I because our algorithm uses efficient secure protocols based on both Yao’s garbled circuit and the data packing technique. First, if a function can be realized with a reasonably small circuit, Yao’s garbled circuit provides a high degree of efficiency. Because our secure protocols, i.e., GSCMP and GSPE, do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols used in [22, 24], our encrypted data is reasonably small. As a result, our SkNN_G can provide a low computation cost by using GSCMP and GSPE. Second, our ESSED protocol requires only one encryption operation by using the data packing technique while other protocol (i.e., DESSED) needs m operations for data encryption. Moreover, ESSED calculates the randomized distance in plaintext, while other protocol computes the sum of the squared Euclidean distances among all attributes over ciphertext. Therefore, our SkNN_G can greatly reduce a computation cost by using ESSED.

Figures 7, 8, and 9 show the performance of three parallel algorithms. In Fig. 7, when the number of threads is 2, 4, 6, 8 and 10, the query processing time of SkNN_PG is 3309, 2009, 1572, 1136, and 994 s, respectively. The query processing time of SkNN_PG is decreased according to the number of threads. In addition, our SkNN_PG shows 12 and 7 times better performance on average than SkNN_pm and SkNN_PI, respectively. In Fig. 8, when the number of data is 5 k, 10 k, 15 k, 20 k, 25 k, and 30 k, the query processing time of SkNN_PG is 125, 241, 353, 462, 537, and 640 s, respectively. SkNN_PI and SkNN_PG show better performance than SkNN_pm because they use the index-based data filtering technique. In Fig. 9, when the number of k is 5, 10, 15, and 20, the query processing time of SkNN_PG is 586, 1173, 1773, and 2308 s, respectively. Our SkNN_PG shows 10 and 5.2 times better performance on average than SkNN_pm and SkNN_PI, respectively. Our SkNN_PG outperforms SkNN_pm and SkNN_PI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

Privacy-preserving kNN query processing algorithms generally use homomorphic encryption for providing data privacy and query privacy. Therefore, it is inevitable that they require high computational cost and their search time complexity is linear. As a result, the existing algorithms handle 10,000 data items in their performance evaluation [22, 24]. By following the existing algorithms, we conduct the performance evaluation of privacy-preserving kNN query processing algorithms when the number of data items is ranged from 5000 to 30,000. It is shown from our performance evaluation that the proposed algorithm (SkNN_G) has linear time complexity, but its slope is lower than the existing algorithms (SkNN_m, SkNN_I). This is because the proposed algorithm performs data filtering using a kd-tree structure and uses Yao’s gabled circuit, which does not use the encrypted binary array.

In order that the proposed algorithm can handle a very large dataset (e.g., 1 million), we do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 (1 million). But in our performance evaluation, we exclude the existing algorithms because they cannot work for a very large dataset due to both their extremely long execution time and the nonexistence of their parallel versions. We do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 with two dimensions. Rather than six-dimensional data in Table 3, we use two-dimensional data because the size of main memory for our experiment is limited. Figure 10 shows the performance of the proposed parallel algorithm to show that it has the capability of dealing with a very large dataset (e.g., 1 million). The proposed algorithm requires 452, 834, and 1341 s when the number of data items is 300,000, 600,000, and 1,000,000, respectively. It is shown from our experiment that the proposed algorithm is linear according to the number of data items. Thus, it is inferred from our observation that the proposed algorithm can handle a very large dataset with a linear time complexity.

7.2 Performance using a real dataset

According to Fig. 11, it is shown that the performances of both SkNN_I and SkNN_G are best when h is 7 and 8. So, we set h to 7 in our experiment. Figure 12 shows the performance of three algorithms in a single machine. With varying k, our SkNN_G shows 22.3 and 5.9 times better performance than SkNN_m and SkNN_I, respectively. As a result, our SkNN_G outperforms SkNN_m because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNN_m. In addition, our SkNN_G outperforms SkNN_I because our algorithm uses efficient secure protocols based on both Yao’s garbled circuit and the data packing technique.

Figures 13 and 14 show the performance of three parallel algorithms. In Fig. 13, when the number of threads is 2, 4, 6, 8, and 10, the query processing time of SkNN_PG is 1659, 977, 745, 624, and 536 s, respectively. The query processing time of SkNN_PG is decreased according to the number of threads. In addition, our SkNN_PG shows 13.3 and 4.1 times better performance on average than SkNN_pm and SkNN_PI, respectively. In Fig. 14, when the number of k is 5, 10, 15, and 20, the query processing time of SkNN_PG is 277, 536, 765, and 1022 s, respectively. Our SkNN_PG shows 12.1 and 3.7 times better performance on average than SkNN_pm and SkNN_PI, respectively. Our SkNN_PG outperforms SkNN_pm and SkNN_PI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

7.3 Discussion

In this section, we not only clarify the differences between the existing privacy-preserving kNN query processing algorithms [22, 24] and our algorithm, but also highlight the advantage of our algorithm. In Table 4, we analyze the privacy-preserving kNN query processing algorithms, in terms of secure protocol, index structure, and random value pool.

Table 4 Comparison of privacy-preserving kNN query processing algorithms

Full size table

Impact of secure protocol with low computational cost Secure protocols are very important for privacy-preserving query processing in cloud computing. We should make secure protocols more efficient because we target on the privacy-preserving query processing by using the Paillier cryptosystem, which consumes high computational cost. First, Elmehdwi et al.’s algorithm proposed secure protocols, such as SM, SBD, SMIN, SMIN_n for kNN query processing. Elmehdwi et al.’s algorithm can protect data privacy, query privacy by using the Paillier cryptosystem. Also, it uses arithmetic operations to protect the original data and to hide data access patterns. However, the drawback of Elmehdwi et al.’s algorithm is excessively high computational cost because SBD, SMIN, and SMIN_n protocols use an encrypted binary array as input value. For example, when we perform the SMIN protocol between E(8) and E(7), clouds transform an encrypted decimal value into an encrypted binary array: E(8)₍₁₀₎ = {E(1), E(0), E(0), E(0)}₍₂₎, E(7)₍₁₀₎ = {E(0), E(1), E(1), E(1)}₍₂₎. After that, clouds perform the SMIN protocol between E(8)₍₁₀₎ = {E(1), E(0), E(0), E(0)}₍₂₎ and E(7)₍₁₀₎ = {E(0), E(1), E(1), E(1)}₍₂₎. As a result, the SMIN requires high computational cost because it performs the encrypted operations as many times as the length of data domain in bit. The SBD and SMIN_n require high computational cost for the same reason of the SMIN. Second, Kim et al.’s algorithm proposed such secure protocols as SCMP and SPE, which are used for index search to find out kNN candidates. However, because both SCMP and SPE use encrypted binary array as input value, they require high computational cost for the same reason of the SMIN. Meanwhile, our algorithm proposed the GSCMP and the GSPE, which perform only one Paillier arithmetic operation. This is because they use an encrypted decimal value as input, rather than an encrypted binary array, due to utilizing Yao’s garbled circuit. As a result, the GSCMP and the GSPE require low computational cost.

Impact of using index structure over encrypted database Because Elmehdwi et al.’s algorithm does not use index structure for data filtering, it should process all of the data items, which leads to performance degradation. To solve the problem, both Kim et al.’s algorithm and our algorithm use encrypted kd-tree as an index structure. As a result, both algorithms achieve performance enhancement by using kd-tree. It is shown from our experiment that our algorithm searches only 10% of all the data items on average, due to data filtering using kd-tree. Table 5 shows the comparison of privacy-preserving kNN algorithms in terms of the number of data items accessed.

Table 5 Comparison of privacy-preserving kNN algorithms in terms of the number of data items accessed

Full size table

Elmehdwi et al.’s algorithm requires N × k number of data items accessed in total, whereas both Kim et al.’s algorithm and our algorithm require N/F + F × k × (c + 1) number of data items accessed in total. Here, N is the number of data items, h is the height of kd-tree, F is the fanout of the kd-tree, and c is the number of node group in the kNN result refinement phase. We show that both Kim et al.’s algorithm and our algorithm are better than Elmehdwi et al.’s algorithm, in terms of the number of data items accessed, by using Eq. (12). Here, we can ignore 1/F × k because it is small enough. Because F × (c + 1) means the number of all the data items in the selected leaf nodes for the privacy-preserving kNN query processing, which is always the subset of N, we can say that Eq. (12) is true. By using the index structure over encrypted database, our algorithm and Kim et al.’s algorithm are shown to be better than Elmehdwi et al.’s algorithm.

$$ \begin{aligned} N \times k & { } \ge \frac{N}{F} + F \times k \times \left( {c + 1} \right) \\ & \Leftrightarrow k \ge \frac{1}{F} + \frac{{F \times k \times \left( {c + 1} \right)}}{N} \\ & \Leftrightarrow 1 \ge \frac{1}{F \times k} + \frac{{F \times \left( {c + 1} \right)}}{N} \\ & \approx 1 \ge \frac{{F \times \left( {c + 1} \right)}}{N} \\ \end{aligned} $$

(12)

Impact of encrypted random value pool for parallelism In our secure system, we use two-party computation for the parallel kNN query processing algorithm. Thus, we need to prevent C_B from extracting meaningful information while executing secure protocols. For this, C_A generates a random value r from Z_N and encrypts r by using the Paillier cryptosystem. Then, C_A adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing $E\left( {m + r} \right) = E\left( m \right) \times E\left( r \right)$. Because m ± r is independent from m, C_B cannot obtain meaningful information with decryption. However, adding a random value to the ciphertext in the Paillier cryptosystem leads to performance degradation because both encryption and decryption operations require higher computation cost than other encrypted operations. As shown in Table 2, in the Secure Multiplication protocol, both Elmehdwi et al.’s work and Kim et al.’s work require three times of the encryption: 2 encryptions for random values at C_A and 1 encryption for the result of multiplication at C_B. Meanwhile, our algorithm requires only one encryption for the result of multiplication at C_B because it selects the encrypted random values from the random value pool without encrypting the random values at C_A. As shown in Table 2, in the secure compare protocol, Elmehdwi et al.’s work requires log₂D times of encryption where D is a data domain. Kim et al.’s work requires three times of the encryption: 2 encryptions for random values at C_A and 1 encryption for the result of the comparison between two values at C_B. Meanwhile, our algorithm requires only one encryption for the result of comparison at C_B by using the random value pool. Therefore, our algorithm can reduce the amount of computation cost for encryption by using the encrypted random value pool.

8 Conclusion and future work

Due to the privacy issues, a database needs to be encrypted before being outsourced to the cloud. However, most of the existing kNN algorithms are insecure in that they disclose data access patterns during the query processing. To solve the problem, we proposed a new privacy-preserving kNN query processing algorithm via secure two-party computation. To achieve a high degree of efficiency in query processing, we also proposed a parallel kNN query processing algorithm using encrypted random value pool. Our algorithms can protect data, query and data access patterns. In our performance analysis, our algorithms showed about 4–30 times better performance than the existing algorithms, in terms of a query processing cost. As a future work, we plan to expand our algorithms to support other types of queries, such as Top-k and kNN classification. In addition, to the best of our knowledge, the privacy-preserving kNN algorithms with homomorphic encryption are generally studied for low-dimensional data space because they require the high computational cost [22, 24]. For dealing with high-dimensional data space, we will extend our algorithm by using a data dimensionality reduction technique [42, 43].

References

Oh D, Kim I, Kim K, Lee SM, Ro WW (2015) Highly secure mobile devices assisted with trusted cloud computing environments. ETRI J 37(2):348–358
Article Google Scholar
Raja J, Ramakrishnan M (2020) Confidentiality-preserving based on attribute encryption using auditable access during encrypted records in cloud location. J Supercomput 76(8):6026–6039
Article Google Scholar
Ahmad A, Ahmad M, Habib MA, Sarwar S, Chaudhry J, Latif MA, Shahid M (2019) Parallel query execution over encrypted data in database-as-a-service (DaaS). J Supercomput 75(4):2269–2288
Article Google Scholar
Williams P, Sion R, Carbunar B (2008) Building castles out of mud: practical access pattern privacy and correctness on untrusted storage. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp 139–148
Cui S, Belguith S, Zhang M, Asghar MR, Russello G (2018) Preserving access pattern privacy in sgx-assisted encrypted search. In: 2018 27th International Conference on Computer Communication and Networks (ICCCN). IEEE, pp 1–9
Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S (2016) Protection of big data privacy. IEEE Access 4:1821–1834
Article Google Scholar
Pingley A, Zhang N, Fu X, Choi HA, Subramaniam S, Zhao W (2011) Protection of query privacy for continuous location based services. In: 2011 Proceedings IEEE INFOCOM. IEEE, pp 1710–1718
Eom CS, Lee C, Lee W, Leung C (2020) Effective privacy preserving data publishing by vectorization. Inf Sci 527:311–328
Article Google Scholar
Kaiping X, Zhu B, Yang Q, Gai N, Wei D, Yu N (2020) InPPTD: a lightweight incentive-based privacy-preserving truth discovery for crowdsensing systems. IEEE Internet Things J 8(6):4305–4316
Google Scholar
Kousika N, Premalatha K (2021) An improved privacy-preserving data mining technique using singular value decomposition with three-dimensional rotation data perturbation. J Supercomput:1–9
Carbunar B, Yu Y, Shi W, Pearce M, Vasudevan V (2010) Query privacy in wireless sensor networks. ACM Trans Sens Netw (TOSN) 6(2):1–34
Article Google Scholar
Veugen T, Blom F, de Hoogh SJ, Erkin Z (2015) Secure comparison protocols in the semi-honest model. IEEE J Sel Top Signal Process 9(7):1217–1228
Article Google Scholar
Youn TY, Jho NS, Chang KY (2018) Design of additive homomorphic encryption with multiple message spaces for secure and practical storage services over encrypted data. J Supercomput 74(8):3620–3638
Article Google Scholar
Islam MS, Kuzu M, Kantarcioglu M (2012) Access pattern disclosure on searchable encryption: ramification, attack and mitigation. In: Ndss, vol 20, p 12
Wu W, Xian M, Parampalli U, Lu B (2021) February). Efficient privacy-preserving frequent itemset query over semantically secure encrypted cloud database. In World Wide Web 24:607–629
Article Google Scholar
Wu W, Parampalli U, Liu J, Xian M (2018) March). Privacy preserving k-nearest neighbor classification over encrypted database in outsourced cloud environments. In World Wide Web 22:101–123
Article Google Scholar
Dai H, Ji Y, Yang G, Huang H, Yi X (2019) A privacy-preserving multi-keyword ranked search over encrypted data in hybrid clouds. In IEEE Access 8:4895–4907
Article Google Scholar
Wong WK, Cheung DWL, Kao B, Mamoulis N (2009) Secure kNN computation on encrypted databases. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp 139–152
Yiu ML, Ghinita G, Jensen CS, Kalnis P (2010) Enabling search services on outsourced private spatial data. VLDB J 19(3):363–384
Article Google Scholar
Hu H, Xu J, Ren C, Choi B (2011) Processing private queries over untrusted data cloud through privacy homomorphism. In: 2011 IEEE 27th International Conference on Data Engineering. IEEE, pp 601–612
Zhu Y, Xu R, Takagi T (2013) Secure k-NN computation on encrypted cloud data without sharing key with query users. In: Proceedings of the 2013 International Workshop on Security in Cloud Computing, pp 55–60
Elmehdwi Y, Samanthula BK, Jiang W (2014) Secure k-nearest neighbor query over encrypted data in outsourced environments. In: 2014 IEEE 30th International Conference on Data Engineering. IEEE, pp 664–675
Zhou L, Zhu Y, Castiglione A (2017) Efficient k-NN query over encrypted data in cloud with limited key-disclosure and offline data owner. Comput Secur 69:84–96
Article Google Scholar
Kim HI, Kim HJ, Chang JW (2019) A secure kNN query processing algorithm using homomorphic encryption on outsourced database. Data Knowl Eng 123:101602
Article Google Scholar
Yao ACC (1986) How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science (sfcs 1986). IEEE, pp 162–167
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: International Conference on the Theory and Applications of Cryptographic Techniques. Springer, Berlin, Heidelberg, pp 223–238
Camenisch J, Chandran N, Shoup V (2009) A public key encryption scheme secure against key dependent chosen plaintext and adaptive chosen ciphertext attacks. In: Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, Berlin, Heidelberg, pp. 351–368
Cambareri V, Mangia M, Pareschi F, Rovatti R, Setti G (2015) On known-plaintext attacks to a compressed sensing-based encryption: a quantitative analysis. IEEE Trans Inf Forensics Secur 10(10):2182–2195
Article Google Scholar
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pp. 47–57
Daemen J, Rijmen V (1999) AES proposal: Rijndael
Yao B, Li F, Xiao X (2013) Secure nearest neighbor revisited. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp 733–744
Hazay C, Lindell Y (2010) Efficient secure two-party protocols: Techniques and constructions. Springer
Book Google Scholar
Tsai WT, Sun X, Balasooriya J (2010) Service-oriented cloud computing architecture. In: 2010 Seventh International Conference on Information Technology: New Generations. IEEE, pp 684–689
Jadeja Y, Modi K (2012) Cloud computing-concepts, architecture and challenges. In: 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET). IEEE, pp 877–880
Bahrami M, Singhal M (2015) The role of cloud computing architecture in big data. In: Information Granularity, Big Data, and Computational Intelligence. Springer, Cham, pp 275–295
Beniley JL (1975) Multidimensional binary seareh trees used for assoeiative searehing. ACM Commun 18(9):509–517
Article Google Scholar
Robinson JT (1981) The KDB-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data, pp 10–18
Bugiel S, Nürnberger S, Sadeghi AR, Schneider T (2011) Twin clouds: Secure cloud computing with low latency. In: IFIP International Conference on Communications and Multimedia Security. Springer, Berlin, Heidelberg, pp 32–44
Liu A, Zhengy K, Liz L, Liu G, Zhao L, Zhou X (2015) Efficient secure similarity computation on encrypted trajectory data. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp 66–77
Goldreich O (1998) Secure multi-party computation. Manuscript. Preliminary version 78
Michael Bain (2021) Chess (King-Rook vs. King) Data Set. http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29. Accessed 21 April 2021
Ayesha S, Muhammad K, Talib R (2020) Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf Fusion 59:44–58
Article Google Scholar
Reddy G, Reddy M, Lakshmanna K, Kaluri R, Rajput DS, Srivastava G, Baker T (2020) Analysis of dimensionality reduction techniques on big data. IEEE Access 8:54776–54788
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by a National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIT) (No. 2019R1I1A3A01058375).

Author information

Authors and Affiliations

Department of Computer Engineering, Chonbuk National University, Room 7401, 7th Engineering Building, Baekje-daero, Deokjin-gu, Jeonju-si, Jeollabuk-do, 54896, Republic of Korea
Hyeong-Jin Kim, Hyunjo Lee & Jae-Woo Chang
Department of IT Convergence System, Vision College of Jeonju, 235, Chun-jam-ro, Wansan-gu, Jeonju-si, Jeollabuk-do, 55069, Republic of Korea
Yong-Ki Kim

Authors

Hyeong-Jin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyunjo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Ki Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Woo Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jae-Woo Chang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, HJ., Lee, H., Kim, YK. et al. Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing. J Supercomput 78, 9245–9284 (2022). https://doi.org/10.1007/s11227-021-04286-2

Download citation

Accepted: 21 December 2021
Published: 17 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11227-021-04286-2

Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing

Abstract

Similar content being viewed by others

A new Top-k query processing algorithm to guarantee confidentiality of data and user queries on outsourced databases

Efficient Protocols for Private Database Queries

Private Boolean Query Processing on Encrypted Data

1 Introduction

2 Background and related work

2.1 Background

2.2 Related work

3 System architecture and secure protocols

3.1 System architecture

Definition 1

Definition 2

3.2 Enhanced secure protocols

3.3 Secure protocols using encrypted random value pool

4 KNN query processing algorithm

4.1 Candidate node search phase

4.2 kNN retrieval phase

4.3 kNN result refinement phase

5 Parallel kNN query processing algorithm

5.1 Parallel encrypted kd-tree search phase

5.2 Parallel kNN retrieval phase

5.3 Parallel KNN result refinement phase

6 Security proof under semi-honest model

Lemma 1

Lemma 2

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

Theorem 5

Proof

7 Performance analysis

7.1 Performance using a synthetic dataset

7.2 Performance using a real dataset

7.3 Discussion

8 Conclusion and future work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation