Privacy-preserving kNN query processing algorithms via secure two-party computation over encrypted database in cloud computing

Since studies on privacy-preserving database outsourcing have been spotlighted in a cloud computing, databases need to be encrypted before being outsourced to the cloud. Therefore, a couple of privacy-preserving kNN query processing algorithms have been proposed over the encrypted database. However, the existing algorithms are either insecure or inefficient. Therefore, in this paper we propose a privacy-preserving kNN query processing algorithm via secure two-party computation on the encrypted database. Our algorithm preserves both data privacy and query privacy while hiding data access patterns. For this, we propose efficient and secure protocols based on Yao’s garbled circuit. To achieve a high degree of efficiency in query processing, we also propose a parallel kNN query processing algorithm using encrypted random value pool. Through our performance analysis, we verify that our proposed algorithms outperform the existing ones in terms of a query processing cost.


Introduction
Research on preserving data privacy in outsourced databases has been spotlighted with the development of a cloud computing. Since a data owner (DO) outsources his/her databases and allows a cloud to manage them, the DO can reduce the database management cost by using the cloud's resources with flexibility [1][2][3]. The cloud not only maintains the databases, but also provides an authorized user (AU) with querying services on the outsourced databases.
However, because the data are private assets of the DO and may include sensitive information such as financial records, they should be protected against adversaries including a cloud server. Therefore, the databases of the DO should be encrypted before being outsourced to the cloud. In addition, a user's query should be protected from the adversaries because the query may contain the private information of the user [4][5][6][7][8][9][10]. Therefore, a vital challenge in the cloud computing is to protect both data privacy and query privacy among the data owner, the users, and the cloud. However, during query processing, the cloud can derive sensitive information about the actual data items and users by observing data access patterns even if the data and the query are encrypted [11][12][13][14][15][16][17]. In addition, it is very challenging to process a query on the encrypted data without having to decrypt it.
Meanwhile, a k nearest neighbor (kNN) query, one of the most typical query types, has been widely used as a baseline technique in many fields, such as data mining and location-based services. The kNN query finds k neighbors that are closest to a given query. However, a kNN result is closely related to the interest and preference of a user. Therefore, researches on secure kNN query processing algorithms (SkNN) that preserve both the data privacy and the query privacy have been proposed [18][19][20][21][22][23][24]. However, the existing algorithms in [18,19] are insecure because they are vulnerable to chosen-and known-plaintext attacks. In addition, the DO should be heavily involved in the query processing [19][20][21]. Furthermore, the algorithms in [18][19][20][21] do not protect data access patterns from the cloud. The algorithms in [22,24] guarantee the confidentiality of both the outsourced databases and a user's query while hiding data access patterns. However, they suffer from high query processing cost.
To solve the problems, in the paper we propose a privacy-preserving kNN query processing algorithm via secure two-party computation on the encrypted database. Our algorithm preserves both data privacy and query privacy while hiding data access patterns. For this, we propose efficient and secure protocols based on Yao's garbled circuit [25] and a data packing technique. To enhance the performance of our kNN query processing algorithm, we also propose a parallel kNN query processing algorithm using improved secure protocols based on encrypted random value pool. To verify the security of our algorithm, we provide the formal security proofs of our privacy-preserving kNN query processing algorithms. Through the performance analysis, we verify that our proposed algorithms outperform the existing ones for both a syntactic dataset and a real dataset. Our contributions can be summarized as follows: Privacy-preserving kNN query processing algorithms via… • We present a framework for outsourcing both encrypted databases and encrypted indexes. • We propose new secure protocols (e.g., ESSED, GSCMP, GSPE) in order to preserve data privacy and query privacy while hiding data access patterns. • We propose an encrypted random value pool to minimize the computational cost of secure protocols. • We propose a new privacy-preserving parallel kNN query processing algorithm which can support efficient query processing. • We also present an extensive experimental evaluation of our algorithms with various parameter settings.
The rest of the paper is organized as follows: Section 2 introduces background and related work. Section 3 presents system architecture and secure protocols. Section 4 proposes our privacy-preserving kNN query processing algorithm. Section 5 proposes our parallel kNN query processing algorithm. Section 6 shows the security proof of our privacy-preserving kNN algorithms under semi-honest model. Section 7 presents the performance analysis of our kNN query processing algorithms. Finally, Sect. 8 concludes this paper.

Background
Importance of hiding data access patterns The data access pattern is one of the most important factors for privacy preservation in cloud computing. If an attacker possesses the order of data accesses or their frequency, he/she can infer the original data by using data access patterns. Therefore, hiding data access patterns is as important as encrypting data. First, in location-based service (LBS), one of the well-known queries is to find a nearby point of interest (POI) with a current user's location. For data protection, POI data are indexed, encrypted, and outsourced using a spatial index structure. For query protection, a user's location is encrypted and used for query processing. Because the query and POI data are encrypted, the exact location is not exposed to an attacker. However, by observing accesses to an index structure, an attacker can obtain data access patterns. By using data access patterns, the attacker can know where a query issuer is located and when he/she is in a specific area. As a result, if a user continuously issues queries while moving, an attacker can obtain his/her personal information, such as his/her moving trajectory and preference.
Second, in a healthcare service, a data mining technique for classifying patients based on their health information and symptom is widely used. For classifying patients, the service finds the most similar disease by getting accesses to the previously generated disease classification table. Because the patient's health information is encrypted, an attacker cannot obtain sensitive information. However, an attacker can acquire data access patterns by repeatedly accessing the disease classification table with fake patients. As a result, by using the data access pattern, an attacker can infer what kind of disease an actual patient has when the patient information is given. Therefore, hiding data access patterns is very essential for privacy preservation in cloud computing.
Paillier cryptosystem The Paillier cryptosystem [26] is an additive homomorphic and probabilistic asymmetric encryption scheme for public key cryptography. The public key pk for encryption is given by (N, g), where N is a product of two large prime numbers p and q, and g is circular value in Z * N 2 . Here, Z * N 2 denotes an integer domain ranged from zero to N 2 . The secret key sk for decryption is given by (p, q). Let E(·) denote the encryption function and D(·) denote the decryption function. The Paillier cryptosystem has the following properties.
(1) Homomorphic addition The product of two ciphertexts E(m 1 ) and E(m 2 ) results in the encryption of the sum c and m 2 (Eq. 1).
(2) Homomorphic multiplication The m 2 th power of ciphertext E(m 1 ) results in the encryption of the product of m 1 and m 2 (Eq. 2).
(3) Semantic security Encrypting the same plaintexts with the same public key results in distinct ciphertexts (Eq. 3).
Therefore, an adversary cannot infer any information about the plaintexts. Yao's garbled circuit Yao's garbled circuits [25] allow two parties holding inputs x and y, respectively, to evaluate a function f(x, y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao's garbled circuit provides a high security level. Another benefit of using the Yao's garbled circuit is that it can provide high efficiency if a function can be realized with a reasonably small circuit.

Related work
The typical kNN query processing schemes on encrypted databases are as follows. Wong et al. [18] processed a kNN query by devising an encryption scheme that supports distance comparison on the encrypted data. However, the scheme is vulnerable to chosen-plaintext attacks [27,28] and cannot hide the data access pattern to the cloud. Yiu et al. [19] proposed a kNN query processing algorithm using the Privacy-preserving kNN query processing algorithms via… R-tree index [29] encrypted by AES [30]. However, the scheme has a drawback that the most of the computation is performed at the user side rather than the cloud. In addition, data access pattern is not preserved as the user hierarchically requests the required R-tree nodes to the cloud. Hu et al. [20] proposed a kNN query processing algorithm by using the provably secure privacy homomorphism encryption method. However, the user is in charge of index traversal during the query processing. In addition, the scheme is known to be vulnerable to chosen-plaintext attacks and leaks the data access patterns. Zhu et al. [21] proposed a kNN query processing scheme by considering untrusted users. Because a user does not hold an encryption key, a data owner should encrypt the query. In addition, the cloud can know the identifiers of the query result that implies the leakage of the data access pattern.
Elmehdwi et al. [22] proposed the SkNN m scheme over the encrypted database. To the best of our knowledge, this is the first work that guarantees both the data privacy and the query privacy while hiding the data access pattern [14] at the same time. In addition, a data owner and a user do not participate in the query processing. However, the query processing cost of this scheme is extremely high because the scheme considers all of the encrypted data and makes use of secure protocols that take the encrypted binary representation of the data as inputs. Zhou et al. [23] proposed an asymmetric scalar-product-preserving encryption (ASPE) scheme based on Wong et al.'s work [18]. By using random asymmetric splitting with additional artificial dimensions, the scheme can resist known-plaintext attacks [28,31]. In this scheme, the query issuers are fully trusted and the decryption key is partially revealed to the query issuers. However, the scheme cannot hide the data access pattern. Most recently, Kim et al. [24] proposed a kNN query processing scheme(SkNN I ) by using an encrypted index. The algorithm guarantees the confidentiality of both the data and the user query while hiding data access patterns. By filtering unnecessary data using a secure index mechanism, the algorithm provides better performance than SkNN m . However, the algorithm still requires a high computation cost because it uses secure protocols that take the encrypted binary representation of the data as inputs.
3 System architecture and secure protocols

System architecture
The typical types of adversaries are semi-honest and malicious [32]. In this paper, we consider the cloud as insider adversaries who have more authorities than outsider attackers. In the semi-honest adversarial model, the cloud correctly follows the given protocol, but may try to obtain the additional information not allowed to them. In the malicious adversarial model, the cloud can deviate from the protocol specification. However, protocols against malicious adversaries are inefficient. Nonetheless, protocols associated with semi-honest adversaries are practical and can be used to design protocols against malicious adversaries. Therefore, according to earlier work [22,24], we also adopt the semi-honest adversarial model. A secure protocol under the semi-honest adversarial model can be defined as follows.

Definition 1 Secure protocol Let
∏ i ( ) be an execution image of the protocol π at the C i side and let a i and b i be the input and the output of the protocol π, respectively. Then, π is secure if ∏ i ( ) is computationally indistinguishable from the simulated image ∏ s i ( ).
The system consists of four components: data owner (DO), authorized user (AU), and two clouds (C A and C B ). The DO owns the original database (T) of n records [33][34][35]. A record t i (1 ≤ i ≤ n) consists of m attributes, where m means the number of data dimensions, and the jth attribute value of t i is denoted as t i,j (1 ≤ j ≤ m). The DO partitions T by using the kd-tree structure [36,37] to provide the indexing on T. The reason why we use the kd-tree structure as an index structure is to hide data access patterns. Using a space filling curve (e.g., Hilbert curve) for partitioning data items into blocks (nodes) can guarantee data locality, but it cannot guarantee that data items are evenly distributed over blocks. As a result, an attacker may infer a specific block based on the number of data items stored in it, whereas using the kd-tree structure for partitioning data items into blocks makes an attacker unable to distinguish a block from each other. This is because data items are evenly distributed into blocks in the kd-tree structure even if data items are skewed. Meanwhile, while traversing the kdtree structure in a hierarchical way, an attacker can know which block is relevant to the query, which results in the leakage of data access patterns. To tackle the problem, our algorithm accesses only the leaf nodes of the kd-tree during the query processing step, rather than traversing the tree structure in a hierarchical way.
Henceforth, a node refers to a leaf node. Let h denote the level of the constructed kd-tree and F be the fanout of each leaf node. A node is denoted by node z (1 ≤ z ≤ 2 h−1 ) where 2 h−1 is the total number of leaf nodes. The region information of node z is represented as the lower bound lb z,j and the upper bound ub z,j (1 ≤ z ≤ 2 h−1 , 1 ≤ j ≤ m). Each node stores the identifiers (id) of data located inside the node region. To preserve the data privacy, the DO encrypts T attribute-wise using the public key (pk) of the Paillier cryptosystem [26] before outsourcing the database. Therefore, the DO generates E(t i,j ) for 1 ≤ i ≤ n and 1 ≤ j ≤ m by encrypting t i,j . The DO also encrypts the region information of all kd-tree nodes to support efficient query processing. Specifically, lb and ub of each node are encrypted attribute-wise such that E(lb z,j ) and E(ub z,j ) are generated with 1 ≤ z ≤ 2 h−1 and 1 ≤ j ≤ m. We assume that C A and C B are non-colluding and semi-honest (or honest-but-curious) clouds. Thus, they correctly perform the given protocols and do not exchange unpermitted data. However, they may try to obtain additional information from the intermediate data while executing their own protocols. This assumption has been used in the related problem domains (for example, in [38]) even though it is not new as mentioned in the earlier works [22,24,39].
In this paper, we consider privacy-preserving kNN query processing which retrieves k nearest data items that are closest to the given query. To support kNN query processing over the encrypted database, a secure multi-party computation (SMC) is required for privacy-preserving kNN query processing algorithm [40]. A secure multi-party computation can be defined as follows.

3
Privacy-preserving kNN query processing algorithms via… Definition 2 Secure multi-party computation a given number of participants, p 1 , p 2 , …, p n (n ≥ 2), each have private data, respectively, d 1 , d 2 , …, d n . Participants want to compute the value of a public function on the private data: F(d 1 , d 2 , …, d n ) while keeping their own inputs secret.
According to Definition 2, the proposed algorithm uses two clouds (e.g., C A and C B ) because at least two parties are required for secure computation. Existing studies, such as Elmehdwi [24], also use two-party computation to support privacypreserving kNN query processing. Thus, we do not consider a single-party computation model because the single-party computation model is vulnerable against semihonest adversaries.
The DO outsources both the encrypted database and its encrypted index to the C A with pk, while the DO sends sk to the C B . The encrypted index includes the region information of each node in ciphertext and the ids of data located inside the node in plaintext. The DO also sends pk to AUs to enable them to encrypt a kNN query. When requesting a query, an AU first generates E(q j ) by encrypting a query q attribute-wise for 1 ≤ j ≤ m. C A and C B cooperatively process the query and return a query result to the AU without data leakage.
As an example, assume that an AU has sixteen data in two-dimensional space (x-axis and y-axis) as depicted in Fig. 1. The data items are partitioned into four nodes for a kd-tree: node 1 , node 2 , node 3 , and node 4 . To clarify the relationship between data items and nodes, we suppose that there is no data item on the boundary of a node. To outsource the database, the DO encrypts each data item and its region information attribute-wise. The ith data item d i is represented as < x i , y i > in Fig. 1 Data items and kd-tree in two-dimensional space two-dimensional space. Therefore, the d i can be encrypted to < E(x i ), E(y i ) > by using the Paillier cryptosystem. For example, d 1 = < 2,1 > is encrypted as E(d 1 ) = < E(2), E(1) > and the encrypted index is shown in Fig. 1.

Enhanced secure protocols
Our kNN query processing algorithm is constructed using several secure protocols. We adopt four secure protocols from the literatures [22,24,39], such as secure multiplication (SM), secure bit-not (SBN), CoMPare-S (CMP-S), and secure minimum from set of n values (SMIN n ). All of the protocols except the SBN protocol use the SMC technique between C A and C B , while the SBN protocol can be solely executed at the C A side. In addition, we propose three new secure protocols: enhanced secure squared Euclidean distance (ESSED), garbled circuit-based secure compare (GSCMP), and garbled circuit-based secure point enclosure (GSPE). For both GSCMP and GSPE, we use Yao's garbled circuits [25] that allow two parties holding inputs x and y, respectively, to evaluate a function f(x,y) without leaking any information about the inputs beyond what is implied by the function output. One party generates an encrypted version of a circuit to compute f. The other party obliviously evaluates the output of the circuit without learning any intermediate values. Therefore, the Yao's garbled circuit supports high security level. Another advantage of the Yao's garbled circuit is to provide high efficiency if a function can be realized with a reasonably small circuit [39]. Because our protocols do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols [22,24], they can provide a low computation cost.
ESSED protocol: Suppose that there are two m-dimensional vectors We utilize a data packing technique to enhance the efficiency of a secure protocol. Specifically, we pack λ number of σ-bit data instances to generate a packed value. The overall procedure of ESSED is as follows. First, C A generates random numbers r j for 1 ≤ j ≤ m and packs them to obtain R using Eq. (4).
Then, C A generates E(R) by encrypting R. Second, C A calculates E(x j -y j ) attribute-wise and packs these results to obtain E(v) using Eq. (5). Then, C A computes ) by eliminating randomized values using Eq. (6).
Our ESSED outperforms the existing distance computation protocol, i.e., data packing-based secure squared Euclidean distance (DPSSED) [39]. Table 1 shows the difference between the existing DPSSED and our ESSED in terms of the number of encryptions. Our ESSED requires only one encryption on the C B side, while the existing DPSSED requires m times of encryptions. Therefore, our ESSED requires a total of two encryptions, whereas the existing DPSSED requires a total of m + 1 encryptions. In addition, our ESSED calculates the randomized distance in plaintext on the C B side, while the existing DPSSED computes the sum of the squared Euclidean distances among all attributes over ciphertext on the C A side. Therefore, the number of computations on encrypted data in our ESSED can be reduced greatly, compared with the existing DPSSED.
GSCMP protocol: Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCMP (garbled circuit-based secure CoMPare) protocol returns the result as follows.
The main difference between GSCMP and CMP-S is that GSCMP receives encrypted data as inputs, while CMP-S receives the randomized plaintext. Furthermore, in the case of CMP-S, plaintext is returned as a result, whereas GSCMP encrypts the result of CMP-S and sends it to C A . Through this, GSCMP can protect the data access patterns. The overall procedure of the GSCMP is as follows.
First, C A generates two random numbers ru and rv and encrypts them. a random value if u = v. To avoid returning a random value, our GSCMP protocol calculates u' = 2 × u and v' = 2 × v + 1 in Eqs. (7) and (8) while maintaining the values of inequality. For example, when u = v = 3, GSCMP calculates u' = 6 and v' = 7. Therefore, our GSCMP protocol avoids returning a random value when u = v.
Second, C A randomly chooses one functionality between F 0 : u ≥ v and F 1 : u < v. The selected functionality is oblivious to C B . Then, C A sends data to C B , depending on the selected functionality.
Fourthly, C A generates a garbled circuit consisting of two ADD circuits and one CMP circuit. Here, ADD circuit takes two integers u and v as input, and outputs u + v while CMP circuit takes two integers u and v as input, and outputs 1 if u < v, zero otherwise. If F 0 : u ≥ v is selected, C A puts -rv and -ru into the first and second ADD gates, respectively. If F 1 : u < v is selected, C A puts -ru and -rv into the first and second ADD gates.
Fifthly, if F 0 : u ≥ v is selected, C B puts m 2 and m 1 into the first and second ADD gates, respectively. If F 1 : u < v is selected, C B puts m 1 and m 2 into the first and second ADD gates.
Sixthly, the first ADD gate adds two input values and puts the output result 1 into CMP gate. Similarly, the second ADD gate puts the output result 2 into CMP gate. Seventhly, CMP gate outputs α = 1 if result 1 < result 2 is true, α = 0 otherwise. The output of the CMP is returned to the C B . Then, C B encrypts α and sends E(α) to C A . Since E(α) is an encrypted value, C A cannot identify the data received from the C B . If C A receives α from C B , C A can know which data is relevant to the query, which can lead to the exposure of the data access patterns. Therefore, it is necessary that Finally, when the selected functionality is GSPE protocol: Suppose that E(p) is an encrypted value of a point p and E(range) is a set of the encrypted values containing the E(range.lb j ) and the E(range.ub j ) for 1 ≤ j ≤ m (m is the data dimension). When E(p) and E(range) are given as inputs, GSPE (garbled circuit-based secure point enclosure) protocol returns the result as follows.
To securely compare between a point and a range, the GSPE protocol needs to add random values for all data dimension. However, as the number of data dimensions increases, the number of data encryptions is increased. The GSPE protocol reduces the number of data encryptions by using a packing technique that transforms the m-dimensional data into one packed value.
The overall procedure of the GSPE is shown in Algorithm 1. First, C A generates two random numbers ra j and rb j for 1 ≤ j ≤ 2 m (line 1-2). C A obtains PA and PB by initially packing ra j and rb j , respectively, by using Eq. (9) for 1 ≤ j ≤ 2 m (line 3).
Fifthly, C A generates two add gates and one compare gate (line 24). C A puts -ra j and -rb j into the first and the second add gates, respectively, for 1 ≤ j ≤ 2 m (line [25][26]. C B puts x j and y j into the first and the second add gate, respectively, for 1 ≤ j ≤ 2 m (line 27). When -ra j , -rb j , x j and y j are given to the compare gate, the result of compare gate, α′ = < α 1 ′, α 2 ′, …, α 2m ′ > , is returned to C B (line [28][29]. Sixthly, C B encrypts α′ and sends E(α′) to C A (line 30). Seventhly, Privacy-preserving kNN query processing algorithms via…

Secure protocols using encrypted random value pool
While processing a query in our secure system, C B decrypts the received ciphertext. Thus, we need to prevent C B from extracting meaningful information while executing secure protocols. For this, C A generates a random value r from Z N and encrypts r by using Paillier cryptosystem. Then, C A adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing Because m ± r is independent from m, C B cannot obtain a meaningful information with decryption. However, in Paillier cryptosystem, the process of adding a random value to the ciphertext leads performance degradation because both encryption and decryption operations require higher computational cost than other encrypted operations. Therefore, we propose an encrypted random value pool to reduce the amount of computational cost for ciphertext generation. First, in a preprocessing phase, we generate random ciphertexts and store them to an encrypted random value pool. Second, while processing a query in C A , a random ciphertext is selected from the encrypted random value pool whenever the secure protocol is called. Therefore, C A not only prevents C B from extracting meaningful information while processing a secure protocol, but also reduces a cost of generating encrypted random values. We use the encrypted random value pool to the SM protocol [22] and our GSCMP protocol. In SM and GSCMP, C A generates two encrypted random values before sending the ciphertext to C B . According to Table 2, we can reduce the number of encryption operations to 67% by using the encrypted random value pool.
Secure multiplication protocol using encrypted random value pool (SMR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, SMR protocol returns the result as follows.
SMR protocol is shown in Algorithm 2. When two encrypted values E(u) and E(v) are given as inputs, C A selects two random values E(r a ) and E(r b ) from the encrypted random value pool (line 1). The rest of the SMR protocol is the same as the previous SM protocol (line 2-6).
Garbled secure compare protocol using encrypted random value pool (GSCR): Suppose that E(u) and E(v) are the encrypted values of u and v, respectively. When E(u) and E(v) are given as inputs, GSCR protocol returns the result which is the same as our GSCMP as follows.
The difference between GSCR and GSCMP is to select a random ciphertext from the encrypted random value pool, instead of generating an encrypted arbitrary value.

KNN query processing algorithm
In this section, we present our kNN query processing algorithm (SkNN G ) that uses Yao's garbled circuit [25]. The algorithm consists of three phases: encrypted kd-tree search, kNN retrieval, and kNN result refinement.

Candidate node search phase
In the encrypted kd-tree search phase, the C A securely extracts all of the data from a node containing a query point while hiding the data access patterns. The procedure of the encrypted kd-tree search phase is shown in Algorithm 3. First, C A securely finds nodes that include a query by executing E(α z ) = GSPE(E(q), E(node z )) for 1 ≤ z ≤ #_of_node where #_of_node means the total number of kd-tree leaf nodes (line 1-2). The result of GSPE for all nodes is stored in E(α) = {E(α 1 ), E(α 2 ), …, E(α #_of_node )}. By utilizing GSPE, our kNN query processing algorithm can obtain better performance than the existing algorithms [22,24] because we can avoid operations related to the SBD protocol that causes high computation overhead. Then, we perform the 8-24 lines of the index search algorithm in [24]. Second, C A generates Privacy-preserving kNN query processing algorithms via… E(α′) by shuffling E(α) using a random shuffling function π and sends E(α′) to C B (line 3-4).
Third, C B obtains α′ by decrypting E(α′) and counts the number of α′ = 1 and stores it into c (line 5-6). Here, c means the number of nodes that the query is related to. Fourthly, C B creates c number of node groups (NG) (line 7-11). C B assigns to each NG a node with α′ = 1 and #_of_node/c-1 nodes with α′ = 0. Then, C B obtains NG′ by randomly shuffling the ids of nodes in each NG and sends NG′ to C A . Fifthly, C A obtains NG * by shuffling the ids of nodes using π-1 in each NG′ (line [12][13]. Finally, C A accesses one data in each node for each NG * and performs E(t' i,j ) = SMR(E(node z .data s,j ), E(α z )) where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (line [14][15][16][17][18][19][20][21][22]. Here, E(α z ) is the result of GSPE corresponding to node z . If a node has the less number of data than FanOut, it performs SMR by using E(max), instead of using E(node z .data s,j ). Here, max is the largest value in the domain. When C A accesses one datum from every node in a NG * , where num means the total number of nodes in the selected NG * . As a result, data items in the node related to the query are securely extracted without revealing the data access patterns [5,14] because the searched nodes are not revealed. By repeating these steps, all of the data in the node are safely stored into the E(cand i,j ) for 1 ≤ i ≤ cnt and 1 ≤ j ≤ m where cnt means the total number of data extracted during the index search. Figure 2 shows an example of the candidate node search phase. The example uses the data items and their kd-tree structure that are represented in Fig. 1. First, C A performs the GSPE between E(q) and E(Node i .Range) for 1 ≤ i ≤ cnt. C A stores the GSPE result into E(α) and sends it to C B . For example, in Fig. 2 Privacy-preserving kNN query processing algorithms via… result, i.e., E(0), into E(α 1 ) for Node 1 . Second, C A shuffles the sequence of {< Node 1 , E(α 1 ) > , < Node 2 , E(α 2 ) > , …, < Node n , E(α n ) >} and changes the shuffled node ids into new ids so as to hide the original node ids from C B . To obtain the original sequence of node ids, C A records their shuffled sequence. For example, in Fig. 2, the original sequence of {< Node 1 , E(0) > , < Node 2 , E(1) > , < Node 3 , E(0) > , < Node 4 , E(0) >} is shuffled to {< Node 4 , E(0) > , < Node 1 , E(0) > , < Node 2 , E(1) > , < Node 3 , E(0) >}. Then, C A changes Node 4, Node 1 , Node 2 , and Node 3 into PN 1 , PN 2 , PN 3 , and PN 4 , respectively. As a result, the shuffled sequence is {< PN 1 , E(0 To generate node groups, i.e., NGs, C B counts how many 1s are in the sequence. Each NG has one seed node whose α′ p equals 1 where 1 ≤ p ≤ #_of_node. Therefore, there exists the same number of NGs as the number of 1s in the sequence. Nodes whose α′ p equals 0, where 1 ≤ p ≤ #_of_node, are evenly assigned to the generated NGs. And C B sends the generated NGs to C A . For example, C B counts 1s in the sequence of {< PN 1 , 0 > , < PN 2 , 0 > , < PN 3 , 1 > , < PN 4 , 0 >} and generates NG 1 with the seed PN 3 . The nodes < PN 1 , 0 > , < PN 2 , 0 > , and < PN 4 , 0 > are assigned into NG 1 . C B sends NG 1 = {PN 3 , PN 1 , PN 2 , PN 4 } to C A . Fourth, C A obtains the original node ids from the received NGs by using the shuffled sequence of node ids. For example, in Fig. 2, C A obtains NG 1 ' = {Node 2 , Node 4 , Node 1 , Node 3 } as the original node ids by using both the received NG 1 = {PN 3 , PN 1 , PN 2 , PN 4 } and the shuffled sequence of node ids. Fifth, C A performs SMR protocol between E(α) and the encrypted data item in a node group, and makes a candidate set by summarizing all the result of the SMR. In Fig. 2

kNN retrieval phase
In the kNN retrieval phase, we retrieve the k nearest neighbors from the given query by partially utilizing the SkNN m scheme [22]. However, we only consider E(cand i ) for 1 ≤ i ≤ cnt, which are extracted in the index search phase, whereas the SkNN m considers all the encrypted data. In addition, we use our efficient secure protocols that require relatively low computation costs, instead of using the existing expensive secure protocols including SBD (secure bit decomposition) [22,24]. The overall procedure of kNN retrieval algorithm is as follows: First, the algorithm calculates the distance between the encrypted data items and the encrypted query without data and query decryption. Second, the algorithm finds the minimum distance (dist min ) among the calculated distances. It cannot know which data item has dist min due to the semantic security of the Paillier cryptosystem. Third, to obtain the encrypted data item with dist min , the algorithm performs the subtraction of dist min with the calculated distance and the data item with dist min has E(zero) as the result of subtraction. Here, E(zero) is the only value which is not changed by the homomorphic multiplication of the Paillier cryptosystem. Therefore, the algorithm can distinguish the nearest neighbor from others, while an attacker cannot determine which data item has the minimum distance. Fourth, in order to hide the original data items from the attacker, the algorithm performs the homomorphic multiplication of the result of subtraction with a random value. It also shuffles the sequence of the result of multiplication in order to hide data access pattern. Finally, by using our secure protocols, the algorithm finds the nearest neighbor and repeats the above process until k nearest neighbors are found.
The pseudocode of the kNN retrieval phase is shown in Algorithm 4. First, using ESSED, C A securely calculates the squared Euclidean distances E(dist i ) between a query and E(cand i ) for 1 ≤ i ≤ cnt (line 1-2). Second, C A performs SMIN n to find the by shuffling E(τʹ) using a random shuffling function π and sends E(β) to C B (lines 4-9). For example, E(τʹ) is calculated as {E(0), E(-r)} where r means a random number. The E(0) corresponds to the E(dist min ). Assuming that π shuffles data in reverse way, [10][11][12][13]. Fifthly, C A obtains E(V) by shuffling E(U) using π-1. Then, C A performs SMR protocol by using E(V i ) and E(cand i,j ) to obtain E(Vʹ i,j ) (lines [14][15][16][17]. Sixthly, by computing E(nn s,j ) = Because only the selected result has E(V i ) = E(1), the E(dist i ) corresponding to the datum selected in current round becomes E(max), while the other values remain the same. This procedure is repeated for k rounds to find the kNN result. Figure 3 shows an example of kNN retrieval phase. The example uses the data items and their kd-tree structure, as shown in Fig. 1. For simplicity and clarity, the shuffling function π is omitted. First, C A calculates the distance by using the ESSED and stores the ESSED result into E(dist i ) for 1 ≤ i ≤ cnt (①). In Fig. 3, C A performs ESSED(E(d 5 ), < E(6),E(1) >) for E(d 5 ) and E(q), and stores the ESSED result, i.e., E(2), into E(dist 5 ). Second, the minimum value is calculated by using the SMIN n and stored the SMIN n result into E(dist min )(②). In Fig. 3, C A performs SMIN n (E(2), E(9), E(8), E(18)) for obtaining the minimum distance among E(dist 5 ), E(dist 6 ), E(dist 7 ), and E(dist 8 ), and stores the SMIN n result, i.e., E(2), into E(dist min ). Third, in order to obtain the encrypted data item with the minimum distance to the given query, C A performs E(dist min − dist i ) and stores the result into E( i ) for 1 ≤ i ≤ cnt (③). When dist min is the same as dist i , E(0) is stored into E( i ). In Fig. 3, for E(dist 5 ), C A performs E(2-2) and stores the result, i.e., E(0), into E( 5 ). Fourth, in order to protect the value of E( i ) for 1 ≤ i ≤ cnt, C A performs the homomorphic multiplication of E( i ) by a random value (④). In Fig. 3, for E( 6 ), C A performs E(− 7 × 3) with a random value = 3 and stores the result, i.e., E (-21) . Sixth, C A obtains the nearest neighbor by performing the SMR between E(d i ) and E(V i ) for 1 ≤ i ≤ cnt and merging the result of the SMR(⑥-⑦). In Fig. 3, C A performs SMR(E(1), E(5)), SMR(E(0), E(6)), SMR(E(0), E(8)) and SMR(E(0), E(9)) for x-axis and SMR(E(1), E(2)), SMR(E(0), E(4)), SMR(E(0), E(3)) and SMR(E(0),

E(4)) for y-axis. C A merges E(5), E(0), E(0), and E(0) for x-axis while merging E(2), E(0), E(0)
, and E(0) for y-axis. As a result, C A obtains < E(5), E(2) > as the nearest neighbor. Seventh, by using Eq. (5), the C A sets the distance of the found nearest neighbor to the maximum value so that C A can avoid finding the found nearest neighbor again in the next round(⑧). In Fig. 3, C A performs SMR(E(1), E(max)) × SMR(E(0), E(2)) for E(d 5 ) and stores E(max) into E(dist 5 ). Finally, C A repeats the above process until k nearest neighbors are found(②-⑦).

kNN result refinement phase
As mentioned in [24], the result of kNN retrieval phase may not be accurate because candidates are extracted from only one leaf node in index search phase. Therefore, the kNN result refinement is necessary to confirm the correctness of Fig. 3 The example of kNN retrieval phase the current query result. Specifically, assuming that the squared Euclidean distance between the kth closest result E(nn k ) and the query is dist k , the neighboring kd-tree nodes need to be searched to acquire data with the shorter distance than dist k . For this reason, we use the concept of shortest point (sp) defined in [24]. The sp is a point in a given node whose distance is closest to a given point p as compared with the other points in the node. To find the sp in each node, we use the following properties described in [24]. (i) If both the lower bound (lb) and the upper bound (ub) of the node are lesser than the p, the ub becomes the sp of the region. (ii) If both the lb and the ub of the region are greater than p, the lb becomes the sp of the region. (iii) If p is between the lb and the ub of the region, p is the sp of the region. Since this property can be applied to each dimension, our kNN result refinement phase partially utilizes that of the existing algorithm [19,21]. However, to reduce the computation cost, we do not use the existing expensive protocols, such as SBD, SSED, SCMP, and SPE [22,24].
The procedure of the kNN result refinement phase is shown in Algorithm 5 First, C A computes E(dist k ) = ESSED(E(q), E(nn k )) to obtain the squared Euclidean distance between the query and the kth closest result, which is returned from the kNN retrieval phase (line 1). Second, for each node, C A performs GSCMP by using E(q j ) and E(node z .lb j ) for 1 ≤ z ≤ num node and 1 ≤ j ≤ m and stores the result in E(ψ 1 ). C A also performs GSCMP by using E(q j ) and E(node z .ub j ) for 1 ≤ z ≤ numnode and 1 ≤ j ≤ m and stores the result into E(ψ 2 ) (lines [2][3][4][5]. When the value of E(q j ) is equal to or less than the E(lb j ) (E(ub j )), the E(ψ 1 ) (E(ψ 2 )) has the value of E (1). Then, C A obtains E(ψ 3 ) by carrying out E(ψ 1 ) × E(ψ 2 ) × SMR(E(ψ 1 ), E(ψ 2 )) N−2 so as to acquire the result of bit-xor operation between E(ψ 1 ) and E(ψ 2 ) (line 6). Note that "-2" is equivalent to "N-2" under Z N . Third, C A securely obtains the shortest point of each node, that is, E(sp z,j ), by computing SMR(E(ψ 3 ), E(q j )) × SMR (SBN(E(ψ 3 )), f(E(lb z,j ), E(ub z,j ))) for 1 ≤ z ≤ num node and 1 ≤ j ≤ m. Here, f(E(lb j ), E(ub j )) is obtained by computing SMR(E(ψ 1 ), E(lb z,j )) × SMR (SBN(E(ψ 1 )), E(ub z,j )) for 1 ≤ z ≤ num node and 1 ≤ j ≤ m (lines 7-10). Fourthly, C A calculates E(spdist z ), the squared Euclidean distances between the query and E(sp z ) for 1 ≤ z ≤ num node by using ESSED. In addition, C A securely updates the E(spdist z ) of the nodes, which are retrieved in index search phase, into E(max) by computing E(spdist z ) = SMR(E(α z ), E(max)) × SMR(SBN(E(α z )), E(spdist z )) (lines [11][12][13]. Here, E(α z ) is the value returned by GSPE in index search phase. Then, C A performs E(α z ) = GSCMP(E(spdist z ), E(dist k )) (line 14). If the E(spdist z ) is less than E(dist k ), the corresponding node z is assigned E(α) = E(1). The nodes with E(α) = E(1) need to be retrieved for kNN result refinement. The number of nodes to expand increases according to how many E(α z ) becomes E (1). If the number of '1' is c in E(α z ), c number of node groups are created in the C B and C A extract the data of each node group. Therefore, the number of cnt becomes c × fanout.
Because the E(spdist z ) of nodes being retrieved in the index search phase are E(max), they are safely pruned. Fifthly, C A securely extracts the data stored in the nodes with E(α) = E(1) by performing the index search using E(α) and appends them to E(nn) (line [15][16]. Then, C A executes kNN search phase based on E(nn) to obtain the final kNN result E(result i ) for 1 ≤ i ≤ k (line 17). Therefore, the final result becomes {E(nn 1 ), E(nn 5 )} because the squared Euclidean distance of E(nn 5 ) is E (4). Sixthly, C A returns the decrypted result to AU in cooperation with C B to reduce the computation overhead at the AU side. To do this, C A computes E(γ i,j ) = E(result i ) × E(r i,j ) for 1 ≤ i ≤ k and 1 ≤ j ≤ m by using a random value r i,j . Then, C A sends E(γ i,j ) to C B and r i,j to AU (lines [18][19][20][21][22]. Then, C B decrypts E(γ i,j ) and sends the decrypted value to AU (lines [23][24][25][26]. Finally, AU obtains the actual kNN result by computing γ i,j -r i,j in plaintext (lines 27-29).

Parallel encrypted kd-tree search phase
In the parallel encrypted kd-tree search phase, C A simultaneously extracts all of the data from a node containing a query point. To expand encrypted kd-tree search phase to parallel environment, we use a thread pool which stores tasks in a queue so that threads can process tasks in parallel. The procedure of the parallel encrypted kd-tree search phase is shown in Algorithm 6. First, C A generates a queue-based thread pool (line 1). If a thread in the thread pool is available, it can process a task in FIFO manner. Second, C A pushes the task, i.e., GSPE(E(q), E(node i )), to the thread pool for 1 ≤ i ≤ #_of_node. A result of GSPE protocol is stored in E(α) = {E(α 1 ), E(α 2 ), …, E(α #_of_node )} (lines 2-3). Third, C A generates E(α') by shuffling E(α) using a random shuffling function π and sends E(α') to C B (lines [4][5]. Fourthly, C B performs the same procedure in Algorithm 3 (line 6). Fifthly, C A obtains NG * by shuffling the ids of nodes using π-1 in each NGʹ (line 7). Finally, C A accesses one datum in each node for each NG * and pushes both E(t' i,j ) = SMR(E(node z .data s,j ), E(α z )) and E(cant cnt+s,j ) = E(cant cnt+s,j ) × E(t ' i,j ) to the thread pool, where 1 ≤ s ≤ FanOut and 1 ≤ j ≤ m (lines 8-18).

Parallel kNN retrieval phase
In the parallel kNN retrieval phase, we simultaneously retrieve the k closest data from the query by partially utilizing the SkNN m scheme [22]. We consider E(cand i ) for 1 ≤ i ≤ cnt, which are extracted in the parallel index search phase. The procedure of the parallel kNN retrieval phase is shown in Algorithm 7. First, using ESSED, C A simultaneously calculates the squared Euclidean distances E(d i ) between a query and E(cand i ) for 1 ≤ i ≤ cnt (lines 1-2). Second, C A performs SMIN n to find the mini- . C A obtains E(β) by shuffling E(τʹ) using a random shuffling function π and sends E(β) to C B (lines [8][9]. Fourthly, after decrypting E(β), 10). Fifthly, C A obtains E(V) by shuffling E(U) using π −1 (line 11). Sixthly, instead of using SM protocol, C A simultaneously performs SMR protocol with E(v i ) and E(cand i,j ) to obtain E(Vʹ i,j ) (lines [12][13][14][15][16]. Seventhly, by computing E(nn s,j ) = ∏ cnt i=1 E(V � i,j ) for 1 ≤ j ≤ m, C A can simultaneously extract the datum corresponding to the E(d min ) (lines [17][18]. Finally, C A simultaneously updates the distance of the selected result as E(max) by computing Eq. (11) (lines [19][20][21][22][23][24].
Privacy-preserving kNN query processing algorithms via…

Parallel KNN result refinement phase
In the parallel kNN result refinement phase, C A simultaneously checks whether results of kNN is correct or not. If not, C A performs both index search phase and kNN retrieval phase again. The procedure of the parallel kNN result refinement phase is shown in Algorithm 8. First, C A computes E(dist k ) = ESSED(E(q), E(nn k )) to obtain the squared Euclidean distance between the query and the kth closest result which is returned from the kNN retrieval phase (line 1). Second, C A simultaneously finds nodes that is closer than disk k by using both SMR and GSCR protocols (lines 2-16). Third, C A performs 15-22 lines of Algorithm 5 (line 17). Fourthly, C B decrypts E(γ i,j ) and sends the decrypted value to AU (lines [18][19][20][21]. Finally, AU obtains the actual kNN result by computing γ i,j -r i,j in plaintext (lines 22-24).

3 6 Security proof under semi-honest model
As mentioned above, the proposed privacy-preserving kNN algorithm is implemented in a semi-honest attack model. Therefore, the security proof of the proposed privacy-preserving kNN algorithm is performed from the three viewpoints of C A , C B , and AU (Authorized User), which are the subjects of actions excluding data owners. In addition, the following lemmas are used in our security proof.

Lemma 1 If a random element r is uniformly distributed on Z N and independent
from any variable x ∈ Z N , then r ± x is also uniformly random and independent from x.

3
Privacy-preserving kNN query processing algorithms via… Lemma 2 The Paillier cryptosystem is semantically secure based on the composite residuosity class problem [26].

Theorem 1 The proposed privacy-preserving kNN algorithm is secure from the perspective of C A under the semi-honest model.
Proof C A owns the cryptographic database and cryptographic index. However, since it does not own the decryption key, the encryption database and encryption index are not exposed. Data cannot be inferred even in a frequency-based attack because the same plaintext has different ciphertexts (Lemma 2). In addition, since all values that our secure protocol returns from C B are encrypted values, no information is exposed from data received from C B . Even though the query is received from the user, it cannot be inferred because the query is in a cryptographic state. □

Theorem 2
The proposed privacy-preserving kNN algorithm is secure from the perspective of C B under the semi-honest model [32].
Proof C B decrypts the encrypted text received from C A . Because C A hides the original data by adding an arbitrary integer before it is passed to C B , C B cannot infer meaningful data from the decrypted plaintext (Lemma 1). □

Theorem 3 The proposed privacy-preserving kNN algorithm is secure from the perspective of AU under the semi-honest model.
Proof The AU encrypts his/her query using the public key and sends the encrypted query to C A . This can protect the user's preferences and personal information. Since the query results received from C A and C B do not include the information for the owner's data, it is impossible to infer the original data. □

Theorem 4
The proposed privacy-preserving kNN algorithm is secure even though c and cnt are exposed to C A under the semi-honest model.
Proof C A can obtain both c and cnt in Algorithms 3 and 5. Here, c is the number of nodes relevant to the query and cnt equals to c × fanout(i.e., F). c initially equals to one in the candidate search phase and c is changed to be ranged from zero to a total number of leaf nodes in the result refinement phase. However, because the upper and lower bounds of all nodes are encrypted and the node ids are hidden through grouping and shuffling, it is impossible to know which node is related to the query. Therefore, even if c and cnt are exposed as plaintext to C A , an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □

Theorem 5
The proposed privacy-preserving kNN algorithm is secure even though c is exposed to C B under the semi-honest model.
Proof C B can obtain c in algorithms 3 and 5. Here, c denotes the number of nodes related to the query. Because C B cannot know the fanout(F) of the kd-tree, cnt is not disclosed to C B . In addition, because the order of node ids is changed through shuffling, it is impossible to infer which node is related to the query. Therefore, even if c is exposed to C B as plaintext, an attacker cannot know which nodes are related to the query, thus resulting in no additional information leakage. □ According to Theorems 1, 2, 3, 4, and 5, the original data, an index, and a query are protected through the Paillier encryption system (Lemma 2), and when decrypted, the original data cannot be inferred through arbitrary data (Lemma 1). Through this, we prove that the proposed privacy-preserving kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns.
In addition, the proposed parallel kNN algorithm is implemented in a semihonest attack model and its security proof is performed from the three viewpoints of C A , C B , and AU. Because the procedure of the proposed parallel kNN algorithm is the same as that of the proposed privacy-preserving kNN algorithm except for using multiple threads, the proposed parallel kNN algorithm is secure from the perspective of C A , C B , and AU under the semi-honest attack model. Therefore, we prove that the proposed parallel kNN algorithm can guarantee data protection, query protection, and query result protection, while hiding data access patterns [4,5,14].

Performance analysis
In this section, we compare the proposed privacy-preserving kNN algorithm (SkNN G ) with the existing algorithms, SkNN m [22] and SkNN I [24], which can hide data access patterns. We used the Paillier cryptosystem to encrypt a database for both schemes [22,24]. We implemented our algorithm and the existing ones using C++. Experiments were performed on a Linux machine with an Intel Xeon E3-1220v3 4-Core 3.10 GHz and 32 GB RAM running Ubuntu 14.04.2. In addition, we compare the proposed parallel algorithm (SkNN PG ) with the parallel version of SkNN m (SkNN pm ) and that of SkNN I (SkNN PI ). Experiments for parallel algorithms were performed on a Linux machine with an Intel Xeon CPU E5-2630v4 2.20 GHz and 64 GB RAM running Ubuntu 14.04.2.
We conduct performance analysis using both a syntactic dataset and a real dataset. For a synthetic datasets, we randomly generated 30 k records with six attributes. For a real dataset, we make use of the Chess dataset available at http:// archi ve. ics. uci. edu/ ml/ datas ets [41]. It consists of 28,056 records with six attributes. Parameters for our experiments are listed in Table 3. We use 512 bits for encryption key size (K) and set the default values of the required k as 10. The query was used by selecting a random integer from the range of data. Figure 4 shows the performances of both SkNN I and SkNN G in terms of the height of kd-tree(h). When the number of data items is fixed, fanout(F) is decreased as h increases because the total number of leaf nodes is calculated by using h, i.e., 2 h−1 . Therefore, it is important to choose the appropriate height(h) of kd-tree depending on the number of data items. If h is greatly high for the given number of data items, the number of leaf nodes to be searched increases and the cost for accessing leaf node increases in the candidate node search phase. On the contrary, if h is greatly low for the given number of data items, the number of data items to be search increases and the cost for calculating distances between data items and the query is increased in the kNN retrieval phase. Therefore, the existing algorithm (SkNN I ) and the proposed algorithm (SkNN G ) shows near-optimal performance in case of h = 7. In particular, the performance of the existing algorithm is greatly affected by the height of the kd-tree(h) because the existing algorithm uses secure protocols based on an encrypted binary array, which requires high computation cost. However,  4 Performance with varying kd-tree depth for synthetic data the proposed algorithm is relatively less affected by h than the existing algorithm because the proposed algorithm uses secure protocol based on garbled circuit, which requires low computation cost. As a result, we set h to 7 in our experiment. Figures 5 and 6 show the performance of three algorithms in a single machine. With varying n, our SkNN G shows 30.2 and 6.1 times better performance than SkNN m and SkNN I , respectively. With varying k, our SkNN G shows 33.2 and 4.9 times better performance than SkNN m and SkNN I , respectively. As a result, our SkNN G outperforms SkNN m because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNN m . In addition, our SkNN G outperforms SkNN I because our algorithm uses efficient secure protocols based on both Yao's garbled circuit and the data packing technique. First, if a function can be realized with a reasonably small circuit, Yao's garbled circuit provides a high degree of efficiency. Because our secure protocols, i.e., GSCMP and GSPE, do not take the encrypted binary representation of the data as inputs, contrary to the existing protocols used in [22,24], our encrypted data is reasonably small. As a result, our SkNN G can provide a low computation cost by using GSCMP and GSPE. Second, our ESSED protocol requires only one encryption operation by using the data packing technique while other protocol (i.e., DESSED) needs m operations for data encryption. Moreover, ESSED calculates the randomized distance in plaintext, while other protocol computes the sum of the squared Euclidean distances among all attributes over ciphertext. Therefore, our SkNN G can greatly reduce a computation cost by using ESSED. Figures 7, 8, and 9 show the performance of three parallel algorithms. In Fig. 7, when the number of threads is 2, 4, 6, 8 and 10, the query processing time of SkNN PG is 3309, 2009, 1572, 1136, and 994 s, respectively. The query processing time of SkN-N PG is decreased according to the number of threads. In addition, our SkNN PG shows 12 and 7 times better performance on average than SkNN pm and SkNN PI , respectively. In Fig. 8, when the number of data is 5 k, 10 k, 15 k, 20 k, 25 k, and 30 k, the query  Fig. 9, when the number of k is 5, 10, 15, and 20, the query processing time of SkNN PG is 586, 1173, 1773, and 2308 s, respectively. Our SkNN PG shows 10 and 5.2 times better performance on average than SkNN pm and SkNN PI , respectively. Our SkNN PG outperforms SkNN pm and SkNN PI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

Performance using a synthetic dataset
Privacy-preserving kNN query processing algorithms generally use homomorphic encryption for providing data privacy and query privacy. Therefore, it is inevitable that they require high computational cost and their search time complexity is linear. As a result, the existing algorithms handle 10,000 data items in their performance evaluation [22,24]. By following the existing algorithms, we conduct the performance evaluation of privacy-preserving kNN query processing algorithms when the number of data items is ranged from 5000 to 30,000. It is shown from our performance evaluation that the proposed algorithm (SkNN G ) has linear time complexity, but its slope is lower than the existing algorithms (SkNN m , SkNN I ). This is because the proposed algorithm performs data filtering using a kd-tree structure and uses Yao's gabled circuit, which does not use the encrypted binary array.
In order that the proposed algorithm can handle a very large dataset (e.g., 1 million), we do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 (1 million). But in our performance evaluation, we exclude the existing algorithms because they cannot work for a very large dataset due to both their extremely long execution time and the nonexistence of their parallel versions. We do the performance evaluation of the proposed algorithm when the number of data items is 300,000, 600,000, and 1,000,000 with two dimensions. Rather than six-dimensional data in Table 3, we use two-dimensional data because the size of main memory for our experiment is limited. Figure 10 shows the performance of the proposed parallel algorithm to show that it has the capability of dealing with a very large dataset (e.g., 1 million). The proposed algorithm requires 452, 834, and 1341 s when the number of data items is 300,000, 600,000, and 1,000,000, Fig. 9 Performance with varying k for synthetic data in parallel Privacy-preserving kNN query processing algorithms via… respectively. It is shown from our experiment that the proposed algorithm is linear according to the number of data items. Thus, it is inferred from our observation that the proposed algorithm can handle a very large dataset with a linear time complexity.

Performance using a real dataset
According to Fig. 11, it is shown that the performances of both SkNN I and SkNN G are best when h is 7 and 8. So, we set h to 7 in our experiment. Figure 12 shows the performance of three algorithms in a single machine. With varying k, our SkNN G shows 22.3 and 5.9 times better performance than SkNN m and SkNN I , respectively. As a result, our SkNN G outperforms SkNN m because it can reduce the computation cost by pruning out unnecessary data with the kd-tree, contrary to considering all the data in SkNN m . In addition, our SkNN G outperforms SkNN I because our algorithm uses efficient secure protocols based on both Yao's garbled circuit and the data packing technique. Figures 13 and 14 show the performance of three parallel algorithms. In Fig. 13, when the number of threads is 2, 4, 6, 8, and 10, the query processing time of SkN-N PG is 1659, 977, 745, 624, and 536 s, respectively. The query processing time of SkNN PG is decreased according to the number of threads. In addition, our SkNN PG    Privacy-preserving kNN query processing algorithms via… shows 13.3 and 4.1 times better performance on average than SkNN pm and SkN-N PI , respectively. In Fig. 14, when the number of k is 5, 10, 15, and 20, the query processing time of SkNN PG is 277, 536, 765, and 1022 s, respectively. Our SkNN PG shows 12.1 and 3.7 times better performance on average than SkNN pm and SkNN PI , respectively. Our SkNN PG outperforms SkNN pm and SkNN PI because it uses efficient secure protocols for parallel environment, i.e., SMR and GSCR.

Discussion
In this section, we not only clarify the differences between the existing privacy-preserving kNN query processing algorithms [22,24] and our algorithm, but also highlight the advantage of our algorithm. In Table 4, we analyze the privacy-preserving kNN query processing algorithms, in terms of secure protocol, index structure, and random value pool.
Impact of secure protocol with low computational cost Secure protocols are very important for privacy-preserving query processing in cloud computing. We should make secure protocols more efficient because we target on the privacy-preserving query processing by using the Paillier cryptosystem, which consumes high computational cost. First, Elmehdwi et al.'s algorithm proposed secure protocols, such as SM, SBD, SMIN, SMIN n for kNN query processing. Elmehdwi et al.'s algorithm can protect data privacy, query privacy by using the Paillier cryptosystem. Also, it uses arithmetic operations to protect the original data and to hide data access patterns. However, the drawback of Elmehdwi et al.'s algorithm is excessively high computational cost because SBD, SMIN, and SMIN n protocols use an encrypted binary array as input value. For example, when we perform the SMIN protocol between E(8) and E(7), clouds transform an encrypted decimal value into an encrypted binary array: E(8) (10) = {E(1), E(0), E(0), E(0)} (2) , E(7) (10) = {E(0), E(1), E(1), E(1)} (2) . After that, clouds perform the SMIN protocol between E(8) (10) = {E(1), E(0), E(0), E(0)} (2) and E(7) (10) = {E(0), E(1), E(1), E(1)} (2) . As a result, the SMIN requires high computational cost because it performs the encrypted operations as many times as the length of data domain in bit. The SBD and SMIN n require high computational cost for the same reason of the SMIN. Second, Kim et al.'s algorithm proposed such secure protocols as SCMP and SPE, which are used for index search to find out kNN candidates. However, because both SCMP and SPE use encrypted binary array as input value, they require high computational cost for the same reason of the SMIN. Meanwhile, our algorithm proposed the GSCMP and the GSPE, which perform only one Paillier arithmetic operation. This is because they use an encrypted decimal value as input, rather than an encrypted binary array, due to utilizing Yao's garbled circuit. As a result, the GSCMP and the GSPE require low computational cost. Impact of using index structure over encrypted database Because Elmehdwi et al.'s algorithm does not use index structure for data filtering, it should process all of the data items, which leads to performance degradation. To solve the problem, both Kim et al.'s algorithm and our algorithm use encrypted kd-tree as an index structure. As a result, both algorithms achieve performance enhancement by using kd-tree. It is shown from our experiment that our algorithm searches only 10% of all the data items on average, due to data filtering using kd-tree. Table 5 shows the comparison of privacy-preserving kNN algorithms in terms of the number of data items accessed.
Elmehdwi et al.'s algorithm requires N × k number of data items accessed in total, whereas both Kim et al.'s algorithm and our algorithm require N/F + F × k × (c + 1) number of data items accessed in total. Here, N is the number of data items, h is the height of kd-tree, F is the fanout of the kd-tree, and c is the number of node group in the kNN result refinement phase. We show that both Kim et al.'s algorithm and our algorithm are better than Elmehdwi et al.'s algorithm, in terms of the number of data items accessed, by using Eq. (12). Here, we can ignore 1/F × k because it is small enough. Because F × (c + 1) means the number of all the data items in the selected leaf nodes for the privacy-preserving kNN query processing, which is always the subset of N, we can say that Eq. (12) is true. By using the index structure over encrypted database, our algorithm and Kim et al.'s algorithm are shown to be better than Elmehdwi et al.'s algorithm.

3
Privacy-preserving kNN query processing algorithms via… Impact of encrypted random value pool for parallelism In our secure system, we use two-party computation for the parallel kNN query processing algorithm. Thus, we need to prevent C B from extracting meaningful information while executing secure protocols. For this, C A generates a random value r from Z N and encrypts r by using the Paillier cryptosystem. Then, C A adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing E(m + r) = E(m) × E(r) . Because m ± r is independent from m, C B cannot obtain meaningful information with decryption. However, adding a random value to the ciphertext in the Paillier cryptosystem leads to performance degradation because both encryption and decryption operations require higher computation cost than other encrypted operations. As shown in Table 2 Meanwhile, our algorithm requires only one encryption for the result of comparison at C B by using the random value pool. Therefore, our algorithm can reduce the amount of computation cost for encryption by using the encrypted random value pool.

Conclusion and future work
Due to the privacy issues, a database needs to be encrypted before being outsourced to the cloud. However, most of the existing kNN algorithms are insecure in that they disclose data access patterns during the query processing. To solve the problem, we proposed a new privacy-preserving kNN query processing algorithm via secure two-party computation. To achieve a high degree of efficiency in query processing, we also proposed a parallel kNN query processing algorithm using encrypted random value pool. Our algorithms can protect data, query and (12) data access patterns. In our performance analysis, our algorithms showed about 4-30 times better performance than the existing algorithms, in terms of a query processing cost. As a future work, we plan to expand our algorithms to support other types of queries, such as Top-k and kNN classification. In addition, to the best of our knowledge, the privacy-preserving kNN algorithms with homomorphic encryption are generally studied for low-dimensional data space because they require the high computational cost [22,24]. For dealing with high-dimensional data space, we will extend our algorithm by using a data dimensionality reduction technique [42,43].