Abstract
Given the morass of available data, ranking and best match queries are often used to find records of interest. As such, kNN queries, which give the k closest matches to a query point, are of particular interest, and have many applications. We study this problem in the context of the financial sector, wherein an investment portfolio database is queried for matching portfolios. Given the sensitivity of the information involved, our key contribution is to develop a secure kNN computation protocol that can enable the computation kNN queries in a distributed multiparty environment while taking domain semantics into account. The experimental results show that the proposed protocols are extremely efficient.
Keywords
 kNN queries
 kNN classification
 Privacy
 Distributed computation
The work of Shafiq is supported by HEC grant under the PAKUS Science and Technology Cooperation Program and by HEC NRPU grant. The work of Adam is supported by the National Academies of Sciences, Engineering, and Medicine under the PAKUS Science and Technology Cooperation Program. The work of Vaidya and Asif is supported in part by the National Science Foundation Grant CNS1422501 and the National Institutes of Health Award R01GM118574. The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research.
Download conference paper PDF
1 Introduction
Nearest neighbor (NN) queries are an extremely important data analysis tool, and have been used in numerous domains. Indeed, they have been identified (in the form of kNN classification) as one of the top 10 algorithms in data mining [17], though they can also be used for other applications such as regression, content retrieval, and structure prediction. While the typical use of kNN does not worry about the sensitivity of the data, kNN is also applicable in many cases where the data may be private, and the organization interested in querying is different from the organization holding the data.
Consider the financial environment, wherein we have several organizations (such as Ameritrade, Charles Schwab, etc.) which possess financial data about individuals, including their current stock positions, transactional history, etc. Now, a regulating agency such as the SEC may be interested in finding individuals who have a certain stock position, or have indulged in particular type of transactional behavior. Since perfect match is often difficult, best match queries are used to find the closest individuals of interest. Alternatively, a financial advisor service might want to provide recommendations based on similar stock positions or transactional behavior. Typically, since financial data is extremely sensitive, the organizations may be unwilling or even (legally) unable to allow unfettered access to this data. However, in certain cases, there may be a lot of value associated to be obtained through computing the best match queries. For example, one can identify trading behavior of investors from their portfolio structures as shown by [3]; this kind of information is invaluable for numerous organizations e.g. State Exchange Commission (SEC) in the United States. Similar problems exist in the field of medicine, finance, and homeland security.
In this paper, we address this specific problem. We consider the scenario where several organizations possess independent portfolio databases, each record of which contains financial stock positions for a single portfolio. Together, these databases comprise the global database which contains the portfolios of all entities, though no third party exists which knows this global database in its entirety. Another organization, called the querier, would like to query this global database to retrieve the k portfolios that are the most similar to a particular query portfolio that it possesses. All of the organizations would like to protect the privacy of their information, while still enabling the computation. There has been some work addressing this problem in the past, especially in the context of outsourcing [7, 16]. Our proposed solution improves on the state of the art by providing a way to incorporate the domain semantics and is significantly more efficient. Our solution is also applicable in the outsourcing environment where an organization may want to outsource its database in encrypted form and still enable best match queries. Furthermore, our solution can be extended to provide topk results based only on private ranking criteria (without reference to a specific query point) in an even more efficient fashion. It is worth noting that while the problem has been formulated in the context of financial domain, our approach is quite general and can be used to solve kNN and topk query problems in any domain. Overall, our key contributions are:

1.
We introduce the notion of semantic distance which is useful in taking domain semantics into account while computing knearest neighbor queries.

2.
We propose an extremely efficient multiparty protocol to compute kNN queries that is robust to semihonest adversaries.

3.
We show how the protocol can be adapted to the outsourced data model and used for kNN based classification without leaking any additional information.
2 Problem Statement
In this paper, we build a protocol for overcoming the privacy problem for situations where organizations (or people) are interested in finding the best matches for a query over distributed data. As discussed earlier, we formulate the problem in the context of the financial domain i.e., finding investment portfolios from a distributed database, which best match a given query portfolio in a dynamic and semantic aware environment, while also providing confidentiality and securityprivacy guarantees to the parties involved in the protocol. We consider a database \(\mathcal {P}\in \mathbb {R}^{N\times M}\) is horizontally distributed among n parties \(P_1,\dots ,P_n\) such that for all \(i\in [n]\) and \(N_i\in \mathbb {N}\) the database fragment \(\mathcal {P}_i\in \mathbb {R}^{N_i\times M}\) is kept by \(P_i\) where \(\sum _{i\in [n]}N_i=N\). Each party \(P_i\) collects the same features of information but for different entities. These parties could be banks, hedgefunds, mutualfunds, or other institutions. Another party Q is interested in performing a kNN query, which also incorporates structure and semantics, on this distributed data without revealing its query, while the parties owning data being queried want to avoid disclosure of their data except for the legitimate output of the query. Thus, the problem can be formally defined as follows:
Definition 1 (Distributed Secure kNN query: DSkNN)
Given a database \(\mathcal {P}\in \mathbb {R}^{N\times M}\), which is horizontally distributed among n parties with party \(P_i\) having fragment \(\mathcal {P}_i~\forall i\in [n]\), a querier Q wants to privately find semantic distance (Definition 3) based kNN in \(\mathcal {P}\) to its query \(q\in \mathbb {R}^{M}\) for \(N,M,n\in \mathbb {N}\), where \(n\ge 2\).
Definition 2 (Security/Privacy)
A protocol \(\varPi \) computes DSkNN query securely if it reveals nothing but kNN records to Q with leakage \(\mathcal {L}\) to all other parties with negligible probability in security parameter – a function, \(\mu \left( m\right) \), is negligible in m if \(\mu (m)<\frac{1}{p(m)}\) for all sufficiently large m and polynomial \(p\left( m\right) \).
Hence Definition 2 ensures that data owners do not learn anything about the query beyond what they already know or infer from leakage, and Q learns nothing about \(\mathcal {P}\) beyond what it already knows and can infer from the output and leakage.
Adversarial Model: We assume all parties to be noncolluding and semihonest (i.e., honestbutcurious) adversaries, who communicate over a secure channel. However, restriction on collusion among the data owners can be relaxed to the collusion of at least \(\alpha \) data owners, where \(\alpha \le n\) (total number of data owners). We can accomplish this (without any major change in the proposed protocol) by employing additive homomorphic encryption with threshold \(\alpha \), which for decryption will require \(\alpha \) parties each performing partial decryption on an encrypted message.
3 Proposed Approach
We first introduce a notion of semantic awareness for distance metrics that can capture the desired level of granularity and structure for measuring similarity. For example, standard distance metrics such as Euclidean distance fails to capture structural and semantic information such as the industry or sector a stock belongs to, market capitalization, risk and type of the stock, etc. Consider an investor who would like to find similar portfolios, while incorporating portfolio structure and/or commodity relationships with regards to a particular categorization model (e.g., the industry classification of stocks [1]). Such a categorization model would typically be built by domain experts. We assimilate the categorization model into the distance metric, denoted sem_dist, which we term as the semantic distance. This enables the integration of semantic information representing the true interest of the querier, while evaluating similarity among portfolios (or records). We now formalize the notion of semantic aware distance and then discuss how it will be calculated in a secure (and privacypreserving) manner.
Definition 3
Semantic Distance between two points \(X,Y\in \mathbb {R}^M\) is Euclidean distance between their linear projection in space \(\mathbb {R}^l\), where \(l,M\in \mathbb {N}\) and the projection is guided by a given categorization model (map, W), and is formalized as follows.
Here, \(map:S\rightarrow 2^{\mathcal {C}}\) specifies the category \(c_l\in \mathcal {C}\) to which the stock \(s_j \in S\) should be mapped with \(\mathcal {C}\) and S being set of all the categories and set of all stocks respectively. \(C\subseteq \mathcal {C}\) is the set of categories, for which sem_dist is to be calculated, whereas \(w_{lj}\) in \({W}=[w_{lj}]_{\mathcal {C} \times M}\) gives the number of units of \(c_l\) equivalent to one unit of \(s_j\). We also define signeddistance(X,Y) at a category level (\(c_l\)) to be \( D_l= \sum \limits _{j:c_l\in map(s_j)}{ w_{lj}\left( x_jy_j\right) }\).
Tuple (map, W) defines categorization model. map gives the relationship among commodities e.g., Industrial Classification Benchmark (ICB) [1] provides a classification for stocks based on the sector and industry; equivalently map could specify the categorization based on market capitalization or some other type. W, here, could denote a weight factor to estimate equivalent worth of a stock in an industry or sector. In general, sem_dist allows for a richer query specification, which is very helpful. For example, sem_dist allows accounting for risk and/or diversity of each portfolio, while calculating distance between a portfolio and a query portfolio. Here we show, using an example, the effectiveness of semantic distance. Consider Table 1a, which contains three portfolios \(p_1, p_2\) and \(p_3\). Each portfolio specifies the number of stocks of AAV, RDC, ICD, GTT and NOW held in it. The stocks in Table 1a are from Oil and IT sectors. Table 1b gives the conversion factor per share of a stock to equivalent dollar value in a sector. Now, consider an investor who wants to find a portfolio from Table 1a, which is the closest in terms of its value at sector level. If Euclidean distance is directly used, then the results are not meaningful, as can be seen from the results in Table 1c; i.e., according to Euclidean distance, \(p_2\) and \(p_3\) are equally close (similar) to \(p_1\), whereas we notice that in term of the value of portfolio at sector level, \(p_1\) should be closer to \(p_2\) as compared to \(p_3\), since in contrast to \(p_3\), where only 15 shares of stocks are held in IT sector, \(p_1\) and \(p_2\) both hold 10 shares of stocks in Oil sector and 5 shares of stocks in IT sector. On the other hand if we use semantic distance then the results corroborate with our intuition and the semantic meaning of the query asked by the investor; this can be seen from the calculated distances in Table 1d. Though the calculated Euclidean distance is spatially correct, it fails to capture a lot of domain, structural and semantic information.
We stress that the proposed protocol is also able to calculate secure DSkNN query based on simple Euclidean distance measure. In semantics distance formulation, this can be accomplished by a \(map:S\rightarrow \mathcal {C}\), which is bijective, and setting W to identity matrix of dimension \(S\times S\). This will essentially calculate the Euclidean distance between two points.
3.1 DSkNN Query Protocol
Before presenting the details of the approach to compute the k closest portfolios as per the problem definition above, we first present the underlying assumptions, the notation used, and a few preliminaries. N, S and \(\mathcal {C}\) are known to all the parties including querier. Furthermore, each data owner (\(P_i\)) knows the size (\(N_j~\forall j\in [n]\)) of all databases fragments \(\mathcal {P}_j\). The database can be viewed as a matrix. In rest of the paper by parties/party we mean parties/party owning data, whereas Q is referred to as querier. Additive homomorphic encryption (AHE) e.g., Paillier allows addition of two encrypted values and multiplication of encrypted value with a plaintext value. The plain text values on which AHE operates come from \(\mathbb {N}\); let us say for a given security parameter \(\lambda \) AHE accepts plaintext values from \(P(\lambda )\) such that \(\forall x \in P(\lambda ),~x< \varLambda \), where \(\varLambda \in \mathbb {N}\). We divide \(P(\lambda )\) into two halves where lower half is positive and upper half is negative (i.e. contains additive inverses of lower half). Whenever a text is decrypted it is converted to equivalent negative or positive value; additive inverse of an encrypted value x i.e., \(E_{pk}[x]\), is \(E_{pk}[x]^{\varLambda 1}\). As for the decimal values we can decide for a precision up to a decimal point d, then multiply each plaintext value with \(10^d\) and convert it to an integer value. \(\lambda \in \mathbb {N}\) is picked in such a manner that the finally computed plaintext value in encrypted form is always within the range. We also employ garbled circuit [19] for secure comparison.
The basic idea in DSkNN is for data owners to encrypt the portfolio database and send it to Q, who calculates signeddistances w.r.t. its query, q, in encrypted form (Algorithm 1), and uses them to collaboratively calculate semantic distances in form of random shares with a data owner (Algorithm 1). A distributed rank query is then carried out to identify the indices of k portfolios with the smallest distances (Algorithm 2). Finally, Q retrieves portfolios, corresponding to the indices identified above, from the portfolio database.
We now discuss the details. In Algorithm 1 we outline the algorithm for DSkNN, where a party \(P_t\) is picked at random from data owners to initiate the protocol. \(P_t\) can be picked by each party generating a random number \(r_i\) from [n] and then calculating \(\sum _{i\in [n]}r_i \mod (n+1)\) using secure sum [6]. \(P_t\) generates publicprivate (pk, sk) key pair of AHE and sends it to all data owners and pk to Q. \(P_t\) also picks two parties \(P_l\) and \(P_{l'}\) randomly and lets all the parties know who they are. Next, every \(P_i\) first permutes its database, and creates encrypted shares \(EP_i=E_{pk}[\mathcal {P}_i^{\pi _i}\mathcal {R}_i]\) and \(ER_i=E_{pk}[\mathcal {R}_i]\). Note that these are homomorphically encrypted additive random shares of \(\mathcal {P}_i\). Now \(P_i\) sends \(EP_i\) to \(P_l\) and \(ER_i\) to \(P_{l'}\). This ensures that every database is split into two parts and thus prevents the leakage of any information to other parties or to Q. \(P_l\) and \(P_{l'}\) put all of these shares together and permute them using a common random seed \(\tilde{s}\) (which can be done by having \(P_l\) and \(P_{l'}\) each pick a random number and send it to the other and then compute the XOR of both random numbers). These encrypted permuted shares (denoted EP and ER) are then sent to Q. At this point, Q also randomly permutes the received shares of the database to avoid linkage attack by data owners. Q can reconstruct the database by adding together the received shares of database in encrypted form. Q then proceeds to calculate signeddistance in encrypted form according to its specified (map, W) and C. This steps consists of addition of encrypted values and their multiplication by values in plaintext (weights), which can be done in encrypted form thanks to the additive homomorphism.
Next, these signed distances need to be squared. This is accomplished by Q generating random numbers, r and g for each signeddistance, \(v_{i,l}\). Q sets its random share for the squared signeddistance to be \(v_{i,l}^1=r^2+g\), and sends \(P_t\), \(E_{pk}[v_{i,l}r]\) and \(E_{pk}[2rv_{i,l}g]\). \(P_t\) decrypts the received encrypted messages, converts them to appropriate negative or positive values as explained in preliminaries and sets its share to be \(v_{i,l}^{2}=(v_{i,l}r)^2+2rv_{i,l}g\). It is obvious that \((v_{i,l})^2=v_{i,l}^1+v_{i,l}^2\). Summing all shares of squared signeddistances of a portfolio will give the share of square of semantic distance for the portfolio. Thus Q and \(P_t\) can compute their shares for the distance for each portfolio since sem_dist \((p_i,q)^2=\sum _{l=1}^{C}v_{i,l}^1+\sum _{l=1}^{C}v_{i,l}^2\). At this point, the square of semantic distance between each portfolio and the query has been randomly split between Q and \(P_t\). For the sake of efficiency we do not compute the square root of squared semantic distance. However, this does not impact correctness of the protocol. Henceforth, Q and \(P_t\) engage in an interactive protocol to compute the k smallest distances corresponding to kNN portfolios. To find the ksmallest entries from the split distance vectors we develop a novel protocol kSmallest that can accomplish this both securely and efficiently. We first present the simpleksmallest (SKS) protocol, that efficiently computes the ksmallest entries without worrying about security. For a given a vector V containing unique values, v, and \(\mathcal {S}=\{\}\), the ksmallest values can be found as follows:

1: Set \(\mu =\sum _{v\in V}\frac{v}{V}\) and divide V into \(V_g=\{v\in V:v >\mu \}\) and \(V_{le}=V\setminus V_g\)

2: If \(V_{le} > k\), set \( V = V_{le}\) and go to step 1

3: If \(V_{le} \le k\) then set \(V =V_g\), \(\mathcal {S}=\mathcal {S}\cup V_{le}\) and \(k=kV_{le}\)

4: if \(k \ne 0\) go to step 1, terminates otherwise
SKS terminates, since each iteration reduces size of V. Note that, only the correct distances are added to the output in step 3 since the distances in \(V_{le}\) are guaranteed to be smaller than the ones in \(V_g\). SKS works very well for our problem setting and can easily be extended to be secure. Though any point in V instead of arithmetic mean can be used to split V without affecting correctness of the algorithm, choice of arithmetic mean as a split point is quite effective as long as subsets (of different sizes) of data are not highly skewed to the left for small values of k. This assumption does hold in real world data. Specifically, we show through empirical analysis that portfolio distances for real world stock market data [3] are but slightly skewed to left. We used portfolio data of hundred thousand individuals, which was collected over the period of three years from Swedish stock market [3]. We calculated the mean distance and variance for the mean distance over samples of various sizes (i.e., number of portfolios). Figure 1a depicts the percentile for mean distance and average percentiles for mean distances. It can be seen that mean distance is consistently at percentile 60. Figure 1b depicts the variance for the above calculated percentile for mean distance, and the average variance, which asserts that percentile for mean distance does not vary much. The complexity of SKS for such a distribution will be O(V) regardless of value for k. In the case where data is highly skewed or follows exponential distribution or leakage function is different a randomly picked data points can be used as a split point instead.
Now we focus on devising a secure and distributed SKS so that it can be carried out on random shares of distances without violating the privacy. It is easy to see that if the first step of SKS can be performed in a secure and distributed form (note, in our case, \(D_Q\) and \(D_t\) together give V), the remaining steps can be performed locally at Q and \(P_t\).
SKS requires the distances in vector to be unique i.e. \(\forall i \in [N]\), \(D_Q[i]+D_t[i]\) is unique. This is necessary, not only to guarantee that the protocol terminates but also to ensure security. In essence, if distances were nonunique, it could have been possible that all of the distances were same, thus resulting in \(V_{le} = V\) for all iterations. Since, in our case, uniqueness does not generally hold; therefore, we use a perturbation mechanism to achieve uniqueness. This can be accomplished by scaling the distance of portfolio \(p_i\) by N, and translating it by i. In the protocol, ksmallest, it is carried out as follows: Q multiplies \(D_Q[i]\) with N while \(P_t\) multiplies \(D_t[i]\) with N and adds i to it (lines 1–2 of Algorithm 2) (Note: \(D_Q=D_t=\mathcal {P}=N\)), which will together gives us \(N\times (D_Q[i]+D_t[i])+i\). Next we need to devise a secure and distributed protocol to compute mean and identify indices of \(D_Q\) or \(D_t\) for which distances (\(D_Q[i]+D_t[i]\)) are greater than the mean. If we let mean distance to be \(\mu = \mu _Q+\mu _t\),
where \(\mu _Q=\sum _i D_Q[i]/N\) and \(\mu _t=\sum _i D_t[i]/N\) then
The above observation tells us that result of comparing the distance for a record against the mean distance can be equivalently obtained by comparing the difference between random share and the mean of random shares. Note that since Q and \(P_t\) can locally compute this difference, the parties can simply use a secure comparison [19] (the garbled circuit approach) to compute the first step of SKS in a secure manner. Furthermore, Since we are using a finite integer field, it is possible that \(\mu _Q\) and \(\mu _t\) are fractional, and hence outside the field. To avoid this we use the output of floorfunction on \(\mu _Q\) and \(\mu _t\) and employ the following comparison instead: \(D_Q[i]  \lfloor \mu _Q \rfloor > D_t[i] + \lfloor \mu _t \rfloor \), but it does not affect the performance of Algorithm 2 since \(0\le \mu (\lfloor \mu _Q \rfloor + \lfloor \mu _t \rfloor )<2\). Since the remaining steps are local, both parties can calculate ksmallest entries securely and identify their corresponding indices in \(D_Q\) or \(D_t\).
Result Extraction: Using above found indices Q can identify and obtain kNN portfolios from the encrypted database. Let I contain the indices of ksmallest distances and \(\forall j\in I,~\hat{t}_j\) and \(\tilde{t}_j\) be the corresponding records in EP and ER then for all j, Q asks \(P_t\) for decryption of \(E_{pk}[\varphi _j]=E_{pk}[\hat{t}_j]\otimes E_{pk}[\tilde{t}_j]\otimes \gamma _j\), where \(\gamma _j\) is uniformly picked vector of size S from an appropriate domain and \(\otimes \) gives coordinatewise product of two vectors. It is straightforward to compute \(E_{pk}[\varphi _j]\) for homomorphically encrypted values and the original record \(t_j\) from \(\varphi _j\) i.e., \(t_j=\varphi _j\gamma _j\). Thus completing the protocol for computing sem_dist based kNN for horizontally fragmented database in a privacy preserving fashion.
3.2 Extensions
We can easily extend the protocol devised above to work for the outsourced data model. It can also be used for kNN classification. Both of these are briefly described below.
Outsourcing Case: Our protocol can very simply be applied for the case where the computation of data owners is transferred to the cloud in a secure manner. Parties can pick noncolluding, semihonest and untrusted servers \(C_1\) and \(C_2\) to take responsibilities of \(P_l\) and \(P_{l'}\) respectively except for creation random shares of their databases and their encryption. All the responsibilities of \(P_t\) for distance and ksmallest computation along with decryption for result retrieval phase are handed to one of the servers. Once responsibilities have been assigned to \(C_1\) and \(C_2\), following the protocol stated in DSkNN will compute kNN securely in cloud.
kNN Classification Case: The proposed protocol also has the ability to carry out kNN classification with a very small modification. Let us say there are G classes with labels \(\{1,2,\dots ,G\}\). We append each database fragment with G new columns and name them \({1,2,\dots ,G}\). For each row with class label g only column g of the appended G columns will have the value 1, and value 0 for the others. Now all the steps outlined in DSkNN are carried on the database with appended columns, except for the result retrieval step; furthermore, appended columns are not used for kNN computations. Once Q has identified kNN records in encrypted database, it computes a vector \(\mathcal {G}\), where \(\forall g\in [G],~\mathcal {G}[g]\) contains k minus the sum of values in column g of kNN records i.e., k minus the number of votes for each class; thus smaller the value \(\mathcal {G}[g]\), higher the number of votes for class g. Next, Q permutes \(\mathcal {G}\), creates random shares of values in \(\pi (\mathcal {G})\) and send them to \(P_t\), after which both Q and \(P_t\) follow ksmallest protocol with \(k=1\). At the end of ksmallest Q is able to identify the class of its instance q.
4 Complexity Analysis
Let N be the number of portfolios in database \(\mathcal {P}\) horizontally distributed among parties \(P_1,\dots P_{n}\), where each record is of dimension \(M=S\). The asymptotic computational complexity of DSkNN is \(O(N^2)=O(NM, N\times C+ N^2)\) since in the worst case there will be O(NM) encryption and \(O(N\times C)\) decryption operations along with \(O(N^2)\) secure comparison by data owners, whereas querier will perform \(O(N\times C)\) arithmetic operations on encrypted values and \(O(N^2)\) secure comparisons; furthermore, in most of the application scenarios \(M,C,k \ll N\).
It is important to note that in real world data for portfolios require only O(N) comparison to find k smallest entries as shown in Fig. 1 and is explained in Sect. 6. Moreover, C would also be much smaller as compared to S because thousands of stocks are traded in the market. So for all practical purposes asymptotically complexity for our problem will be O(NM) Following the same reasoning as above the asymptotic communication complexity of DSkNN will also be O(NM).
With respect to the communication complexity, it may appear that the cost of transferring the entire database over is excessive. While this is true in terms of the communication itself, both the monetary and time cost of doing this is negligible, since currently available bandwidth and speed are quite high e.g., currently ISPs are providing 1000 Mbps connection to residential users and small businesses, which allows an encrypted database of million rows and ten attributes to be transferred in matter of few seconds. On the other hand, in many cases cost and the time required for secure operations are significantly higher than that of required for data transfer. Additionally, many of the secure protocols including [7, 12] require transferring complete database between/among the parties. Therefore, we believe that this cost is reasonable.
5 Security Analysis
In this section we analyze the security of DSkNN under the framework of Definition 2. We want to show the following:
DSkNN is secure if probabilistic polynomial time simulators \(\mathcal {S}_i(m,\mathcal {P}_i,\mathcal {L}_{P})\) and \(\mathcal {S}_Q(m,q,(map,W),\mathcal {L}_Q,\mathcal {O}_k)\) can respectively simulate the view of \(P_i,~\forall i \in [n]\) and Q during the execution of DSkNN.
This means that if \(\mathcal {S}_i\) is provided with \(P_i\)’s input (\(m,\mathcal {P}_i\)) and leakage \(\mathcal {L}_{P}\) (which gives \((J,j_1,\dots ,j_J,N_1,\dots ,N_n,C)\)), and \(\mathcal {S}_Q\) with Q’s input (m, q, (map, W)) and output (\(\mathcal {O}_k\) i.e. kNN records) along with \(\mathcal {L}_Q\) (which gives \((J,j_1,\dots ,j_P,N,S)\)) then these simulators will have the same view as their respective parties; thus asserting that DSkNN reveals no extra information and does fulfill the security Definition 2. In the output of leakage functions J is the total number of iterations taken in Algorithm 2 corresponding to \(\mathcal {P}\) and q, \(j_\ell \) is the percentile of mean distance in \(\ell ^{th}\) iteration, whereas rest of the symbols are same as defined previously.
Let us analyze \(\mathcal {S}_j\) for the situation where \(j=t\) and also \(j \in \{l,l'\}\) (\(l,l'\) and t are as per specification in Algorithm 1) since such a party, \(P_j\), will receive the biggest set of intermediate messages, in all other cases parties receive less information. \(P_j\)’s view consists of its input (\(m,\mathcal {P}_j\)), random shares (\(\mathcal {D}_i \in \mathbb {N}^{N_i\times S},~\forall i\in [n]\)) of database fragments, random shares (\(\mathcal {H}_R \in \mathbb {N}^{N\times C}\)) of distances at category level, random seeds (\(s_j,\tilde{s}\)), randomtape (\(r_j\in \{0,1\}^{p(m)}\)) and \(\mathcal {L}_{P}\). \(P_j\)’s view can easily be generated by \(\mathcal {S}_j\): based on \(m,\mathcal {P}_j\) and \(\mathcal {L}_P\) that are provided to \(\mathcal {S}_j\), it can generate \(\mathcal {D}_i^j \leftarrow \mathbb {N}^{N_i\times S}\), \(\mathcal {H}_R^j \leftarrow \mathbb {N}^{N\times C}\), \((s_j',\tilde{s}_j')\leftarrow \mathbb {N}^2\) and \(r_j'\leftarrow \{0,1\}^{p(m)}\), \(\mathcal {L}_{P_j}\) using uniform distribution. Thus \(P_j\)’s view, \((m, \mathcal {P}_j,\mathcal {D}_1,\dots ,\mathcal {D}_n, \mathcal {H}_R, s_j,\tilde{s}_j,r_j,\mathcal {L}_{P})\), is computationally indistinguishable from \(\mathcal {S}_j\)’s view, \((m, \mathcal {P}_j,\mathcal {D}_1^j,\dots ,\mathcal {D}_n^j, \mathcal {H}_R^j, s_j', \tilde{s}_j', r_j',\mathcal {L}_P)\), in polynomial time, otherwise pseudorandom generator, which is assumed to be secure, can be broken which is used to create random shares and seeds. It is straightforward that for all other cases a party’s view will consist of less information than that of \(P_j\)’s view; hence \(\forall i\in [n],~\mathcal {S}_i\) will be able to generate \(P_i\)’s view.
The case for \(\mathcal {S}_Q\), is also very similar in that respective inputs, output, and leakage is provided to \(\mathcal {S}_Q\), except for the difference that Q receives encrypted database \(EP\in \mathbb {N}^{N\times S}\) instead of random share of a database, but since AHE is (semantically) secure – meaning EP is computationally indistinguishable from \(EP'\leftarrow \mathbb {N}^{N\times S}\) (i.e. generated uniformly) – \(\mathcal {S}_Q\) can generate a view using m, q, (map, W) and \(\mathcal {L}_Q\) that is indistinguishable from Q’s view. Thus proving that DSkNN is secure.
Note that the defined leakage reveals information, usually known in our application scenario. If one wants to hide this information then following is one way to accomplish this. Instead of mean distance, randomly picked distances can be used for the purpose of comparison to find ksmallest distances; dummy portfolios with sentinel values can be added to hide size of database; extra columns can be added for dummy coordinates, mapping to which can be provided through a secure and modified bloom filter. C can be hidden by adding dummy signeddistances with value zero. Though such measures will stop the leakage, they will significantly reduce the efficiency of the protocol.
6 Experimental Evaluation
We implemented DSkNN in Java. The platform used for testing is asymmetric in terms of its computational power. The querier machine had a 2.2 GH corei7 processor and 16 GB RAM whereas each of the database owners was a Xeon E52680 v2, with 10 cores running at 2.80 GHz, and 96 GB RAM. For AHE and garbled circuit we employed the implementation available at [2] with key size 1024 and [9] with a key size of 512 respectively. The default values for parameters are set based upon domain semantics. Specifically, even though 2k3k stocks are traded on the stock exchange, only a few hundred of them are most often traded; thus we set S to be 100. C is set to be 10 because ICB [1] classification taxonomy segregates stocks to 10 industries at the top level. As for k, it is set to 1, which represents the worst case for Algorithm 2. Lastly, N is set to be 1000. For each experiment, only one parameter is varied, while keeping the rest constant. Experiments described below were carried out with synthetic data. The results on real data are described later.
Figures 3a, b and c report time taken for distance computation by Q and \(P_t\), with varying N, S and C respectively. Time for all of these experiments grows linearly except for \(P_t\) w.r.t. S. This is because of the fact that distance computation time for \(P_t\) depends on C and N, but not on S. Let us now look at the performance of Algorithm 2, which only depends upon N. Figure 2a plots the computation time taken by Algorithm 2 for varying values of N. Again, the computation time scales linearly w.r.t. N. Figure 2b plots the computation time with respect to varying k. It is interesting to note that the time taken is roughly constant, and thus the time taken by our approach is actually independent of k.
We also compare our work with Elmehdwi et al. [7] for outsourcing case. Figure 1c compares the complete time taken by DSkNN and SRkNN [7]. For the sake of fair comparison results are computed for same parameters and equivalent processing power. It can be seen that DSkNN outperforms existing state of the art by an order of magnitude. Additionally, our implementation is in Java and uses threading only for decryption at \(P_t\), whereas implementation in [7] uses the openMP parallelization framework. Thus with an equivalent implementation, our results can be further improved.
Performance on real data: We obtained Swiss stock market data for year 2009–2011, which is a collection of portfolios of around 100k individuals for 300 stocks; the data was previously used in [3]. We only evaluated the performance of ksmallest protocol because time for distance computation is independent of data distribution. We randomly picked a subset of the data and choose one portfolio from it as the query portfolio, and computed the number of actual comparisons required by the ksmallest protocol for \(k=1\). Figure 2c plots the number of comparisons carried out for different values of N (the number of portfolios) along with a reference line for 2N. The two lines are almost perfectly in lockstep, which demonstrates the efficiency and suitability of our algorithm for real world data. However, in the worst case, it is still possible that in each iteration only one distance will be removed resulting in \(O(N^2)\) total comparisons.
7 Related Work
Privacypreserving data mining has received a lot of attention [15]. Given the numerous practical applications of privacypreserving kNN search, various protocols have been developed to address this problem. [5, 14] present solutions to the problem of computing kNN, where the data is fragmented among different parties, while also preserving privacy. [4] uses a semitrusted third party to find best k matches. In [12] Qi et al. introduce a singlestep protocol for kNN search, whereas [8] proposed a secure kNN searching protocol based on PIR for locationbased services. However, none of the above work is appropriate for computation over encrypted data. [10] solves recommendation problem using SelfOrganizing Map for clustering and kNN based collaborative filtering, but reveals query to data owner. Zang et al. in [20] employ homomorphic encryption for finding kNN in distributed setting, but in contrast to our work it reveals distances, partial access pattern to the parties. In [18], the query along with kNN distances is exposed and the output is less accurate. [11] makes use of untrusted third party and reveals query to parties. Although semantic distance can be applied here, the categorization model will be revealed to data owners. Shaneck et al. [13] provide a solution that reveals partial access pattern while being slower than our proposed protocol. Not only is our protocol straightforwardly extensible to provide outsourcing and kNN classification, but it also allows for incorporation of semantic distance, while still being comparatively very efficient as compared to state of the art [7].
8 Conclusion and Future Work
In this paper we have presented a secure approach to computing knearest neighbor queries for horizontally distributed data. Our approach is an order of magnitude faster than the existing state of the art. It is also applicable in the outsourcing environment, and can be used to compute topk queries, as well as kNN based classification. In the future, we plan to develop solutions that are resilient to stronger adversaries, some of which may collude as well.
References
Industry classification benchmark (2015). http://www.icbenchmark.com/ICBDocs/Structure_Defs_English.pdf. Accessed 06 Feb 2016
Pailliar encryption implementation (2015). https://code.google.com/p/thep/
Bohlin, L., Rosvall, M.: Stock portfolio structure of individual investors infers future trading behavior. PLoS One 9(7), e103006 (2014)
Boneh, D., Gentry, C., Halevi, S., Wang, F., Wu, D.J.: Private database queries using somewhat homomorphic encryption. In: Jacobson, M., Locasto, M., Mohassel, P., SafaviNaini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 102–118. Springer, Heidelberg (2013). doi:10.1007/9783642389801_7
Burkhart, M., Dimitropoulos, X.: Fast privacypreserving topk queries using secret sharing. In: Proceedings of 19th International Conference on Computer Communications and Networks (ICCCN), pp. 1–7 (2010)
Clifton, C., Kantarcioglu, M., Lin, X., Vaidya, J., Zhu, M.: Tools for privacy preserving distributed data mining. SIGKDD Explor. 4(2), 28–34 (2003)
Elmehdwi, Y., Samanthula, B.K., Jiang, W.: Secure \(k\)nearest neighbor query over encrypted data in outsourced environment. In: The 30th International Conference on Data Engineering, Chicago, USA, pp. 664–675, March 2014
Ghinita, G., Kalnis, P., Khoshgozaran, A., Shahabi, C., Tan, K.L.: Private queries in location based services: anonymizers are not necessary. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 121–132. ACM (2008)
Henecka, W., Schneider, T.: Faster secure twoparty computation with less memory. In: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, pp. 437–446. ACM (2013)
Kaleli, C., Polat, H.: Privacypreserving sombased recommendations on horizontally distributed data. Knowl.Based Syst. 33, 124–135 (2012)
Kantarcıoǧlu, M., Clifton, C.: Privately computing a distributed knn classifier. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 279–290. Springer, Heidelberg (2004). doi:10.1007/9783540301165_27
Qi, Y., Atallah, M.J.: Efficient privacypreserving knearest neighbor search. In: The 28th International Conference on Distributed Computing Systems, ICDCS 2008, pp. 311–319. IEEE (2008)
Shaneck, M., Kim, Y., Kumar, V.: Privacy preserving nearest neighbor search. In: Shaneck, M., Kim, Y., Kumar, V. (eds.) Machine Learning in Cyber Trust, pp. 247–276. Springer, Heidelberg (2009)
Vaidya, J., Clifton, C.: Privacypreserving kth element score over vertically partitioned data. IEEE Trans. Knowl. Data Eng. 21, 253–258 (2009)
Vaidya, J., Clifton, C., Zhu, M.: Privacy Preserving Data Mining. Advances in Information Security, vol. 19. Springer, Heidelberg (2006)
Wong, W.K., Cheung, D.W.l., Kao, B., Mamoulis, N.: Secure kNN computation on encrypted databases. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 139–152. ACM (2009)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Xiong, L., Chitti, S., Liu, L.: Mining multiple private databases using a kNN classifier. In: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 435–440. ACM (2007)
Yao, A.C.: Protocols for secure computation. In: Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science, pp. 160–164 (1982)
Zhang, F., Zhao, G., Xing, T.: Privacypreserving distributed knearest neighbor mining on horizontally partitioned multiparty data. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS (LNAI), vol. 5678, pp. 755–762. Springer, Heidelberg (2009). doi:10.1007/9783642033483_80
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 IFIP International Federation for Information Processing
About this paper
Cite this paper
Asif, H., Vaidya, J., Shafiq, B., Adam, N. (2017). Secure and Efficient kNN Queries. In: De Capitani di Vimercati, S., Martinelli, F. (eds) ICT Systems Security and Privacy Protection. SEC 2017. IFIP Advances in Information and Communication Technology, vol 502. Springer, Cham. https://doi.org/10.1007/9783319584690_11
Download citation
DOI: https://doi.org/10.1007/9783319584690_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319584683
Online ISBN: 9783319584690
eBook Packages: Computer ScienceComputer Science (R0)