Privacy-Preserving Range Queries from Keyword Queries

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9149)

Abstract

We consider the problem of a client performing privacy-preserving range queries to a server’s database. We propose a cryptographic model for the study of such protocols, by expanding previous well-studied models of keyword search and private information retrieval to the range query type and to incorporate a multiple-occurrence attribute column in the database table.

Our first two results are 2-party privacy-preserving range query protocols, where either (a) the value domain is linear in the number of database records and the database size is only increased by a small constant factor; or (b) the value domain is exponential (thus, essentially of arbitrarily large size) in the number of database records and the database size is increased by a factor logarithmic in the value domain size. Like all previous work in private information retrieval and keyword search, this protocol still satisfies server time complexity linear in the number of database payloads.

We discuss how to adapt these results to a 3-party model where encrypted data is outsourced to a third party (i.e., a cloud server). The result is a private database retrieval protocol satisfying a highly desirable tradeoff of privacy and efficiency properties; most notably: (1) no unintended information is leaked to clients or servers, and the information leaked to the third party is characterized as ‘access pattern’ on encrypted data; (2) for each query, all parties run in time only logarithmic in the number of database records and linear in the answer size; (3) the protocol’s query runtime is practical for real-life applications.

1 Introduction

The recent computing trend of outsourcing big data in the cloud for simplified and efficient application deployment is being embraced in government, as well as other areas, including finance, information technology, etc. In government, large databases are needed in many contexts (e.g., no-fly lists, metadata of communication records, etc.). In finance, banks and other financial institutions need to store huge data volumes and compute over them on a daily basis. In information technology, web and social networks collect huge data from computer users, which is then made available for different uses and computations. To facilitate and guarantee success for all of these applications, databases are very useful data management tools, and cloud storage and computing provide tremendous efficiency and utility for users, as exemplified by the increasingly successful database-as-a-services application paradigm (see, e.g., [13]). On the other hand, cloud storage and computing paradigms are also accompanied by privacy risks (see, e.g., [21]). To mitigate these risks, database-management systems can use privacy-preserving database retrieval protocols that allow users to submit queries and receive results in a way that clients learn nothing about the contents of a database except the results of their queries, and servers do not learn which queries are submitted. The research literature has attempted to address these issues, by studying private database retrieval protocols in limited database and query models and with limited efficiency properties. In this paper we partially address some of these limitations, by using a practical database model, and proposing protocols in both a client-server model and a 3-party model, where servers can outsource data to a third party (in encrypted form). In these models, practical and privacy-preserving database retrieval protocols for basic query types such as keyword queries, have been recently shown to be possible. In this paper, we attempt to show that practical and privacy-preserving database retrieval protocols are possible for a more complex query type: range queries.

Previous Work. The security and cryptography literature contains a significant amount of research in the private information retrieval (PIR) [6, 18, 20] and keyword search (KS) [3, 5, 10] areas. Both areas consider rather theoretical data models, as we now discuss. In PIR, a database is modeled as a string of n bits, and the query value is an index \(i\in \{1,\ldots ,n\}\). In KS, early data models were also somewhat restrictive; for instance, [10] only admitted a single matching record per query. The inefficiency of the server runtime in PIR and KS protocols has been well documented (see, e.g., [24]). Some results attempted to use a third party and make the PIR query subprotocol more efficient but require a practically inefficient preprocessing phase [8]. Recently, however, some results on provably privacy-preserving and practical keyword queries in a practical database model and in an outsourced-data scenario were concurrently shown by [7, 17], where significant efficiency is achieved by provably limiting the privacy loss to encrypted data “access-pattern” information, only leaked to the cloud server.

The literature also contains a significant amount of work on range queries or range computations on encrypted data. Some papers (starting with [4, 22]) focus on encrypting messages, on which one can later perform range query computations. These approaches offer interesting provable security properties but make heavy use of asymmetric cryptography techniques and seem hard to translate into practical protocols for databases. Promising approaches to achieve at least some limited amount of privacy (with tradeoffs against efficiency) on range queries in an outsourced database setting have also been shown (see, e.g., [14] and follow-up work), typically based on variants of “bucketization” approaches. The primitive of order-preserving encryption gives rise to elegant and efficient range query protocols in the “database-as-a-service” model (see, e.g., [1] and follow-up work), but constructions of order-preserving encryption are still not very efficient and especially come with static leakage on the encrypted data to the server holding it [2]. Overall, the question of designing provably privacy-preserving range queries in a practical data model, even in the outsourced-data scenario, seems to still deserve more attention from the security community.

Our Contribution. We study range queries in a more practical (outsourced or not) database model, capturing record payloads, possibly equal attribute values across different database records, and multiple answers to a given query. In this model, we define suitable correctness, privacy and efficiency requirements.

We then design two range query protocols in the 2-party model, which satisfy desired privacy properties (i.e., the server learns no information about the query range other than the number of matching records, and the client learns no information about the database other than matching database records) in our data model. Our first protocol works for linear-size value domains by only increasing database size by a small constant, and our second protocol works for exponential-size (thus, essentially arbitrary-size) value domains while increasing database size by a factor logarithmic in the value domain size. These protocols are constructed directly from any KS protocol and, like previous PIR and KS protocols, have server time complexity linear in the database size, a drawback dealt with in our next result.

Our third protocol transforms any of our 2-party range query protocols into a 3-party protocol, where the third party can be a cloud server, based on any 3-party KS protocol (like the one in [7], only based on any pseudo-random function, implemented as a block cipher). In this protocol, both server and third party run queries in logarithmic time and the following privacy properties provably hold: the server learns nothing about the query range, the client learns nothing about the database in addition to the matching database records, and the third party learns nothing about the query range or the database content, other than the repeating of queries from the client and repeated access to the encrypted data structures received by the server at initialization. This solves the problem of achieving provable privacy (against a semi-honest adversary) and efficient server runtime at the cost of a ‘third-party’-server and some leakage to the third party characterized as ‘access-pattern’ to encrypted data. We stress that this protocol has efficient running time not only in an asymptotic sense, but in a sense that makes it ready for real-life applications (where such form of leakage to the third party is tolerable). In our implementation of a computationally similar protocol, we reached our main performance goal of achieving response time to be less than 1 order of magnitude slower than commercial non-private protocols like MySQL. Our protocol solves a number of technical challenges using simple and practical techniques, including a reduction step via an intermediate rank database and a ‘lazy’ database value shifting approach. The privacy loss traded for such a practicality property was already studied in [15, 16], who also proposed simple techniques to mitigate leakage to the cloud server in the form of ‘access-pattern’ to encrypted data, at least in the case of keyword queries. (Here, note that in the presence of such leakage, neither the client nor the server learn anything new, and the cloud server does not statically learn anything about the plain database content). We believe that one appropriate mitigation technique needed for such solutions could be based on Oblivious RAM (an active area started in [12]), and it is plausible that dedicated Oblivious RAM techniques in the 3-party model may nullify or mitigate any such leakage based on ‘access-pattern’ over encrypted data. This is indeed a promising direction as, while years ago Oblivious RAM was considered inefficient, recent advances (see, e.g., [23]) have made it significantly less inefficient. In all our protocols, we only consider privacy against a semi-honest adversary corrupting at most one party (i.e., an adversary that follows the protocol and then attempts to violate the privacy of one of the parties).

2 Models and Requirements

Data and Query Models. We model a database as an n-row, 2-column matrix \(D=(A_1,A_2)\), where each column is associated with an attribute, denoted as \(A_j\), for \(j=1,2\), and each entry is denoted as \(A_j(i)\). The first column is a value attribute, where entries are values in a domain\(Dom\) with a total order \(\le \), and the last column \(A_2\), is a payload attribute, where entries can be arbitrary binary strings. The database schema, assumed to be publicly known to all parties, includes parameter n, the security parameter, and the description of the attribute value domain. A database row is also called record, and is assumed to have the same length \(\ell _r\) (if data is not already in this form, techniques from [9] are used to efficiently achieve this property), where \(\ell _r\) is constant with respect to n.

A queryq is modeled to contain one or more query values from the relative attribute domains. We mainly consider Range queries, defined as:
$$ \text{ SELECT }\, *\, \text{ FROM }\, main\, \text{ WHERE }\, \text{ attribute }\_\text{ name } \in [v_0,v_1],$$
where \(v_0,v_1\) are the query values. A valid response (to a range query) consists of all payloads \(A_2(i)\), for \(i\in [1,n]\), such that \(A_1(i)\in [v_0,v_1]\), and we say that these payloads (or records) match the query. We also discuss KS queries, defined as:
$$ \text{ SELECT }\, *\, \text{ FROM }\, main\, \text{ WHERE }\, \text{ attribute }\_\text{ name } = v, $$
where v is the query value. A valid response (to a keyword query) consists of all payloads \(A_2(i)\), for \(i\in [1,n]\), such that \(A_1(i)=v\).
Participant Models. We consider the following efficient (i.e., running in probabilistic polynomial-time in a common security parameter \(1^\sigma \)) participants. The client is the party, denoted as C, that is interested in retrieving data from the database. The server is the party, denoted as S, holding the database (in the clear), and is interested in allowing clients to retrieve data. The third party, denoted as TP, helps the client to carry out the database retrieval functionality and the server to satisfy efficiency requirements during the associated protocol. By 2-party model we denote the participant model that includes C, S and no third party. By 3-party model we denote the participant model that includes C, S, and TP. (See Figs. 1 and 2 for a comparison of the two participant models.)
Fig. 1.

Structure of our 2-party RQ protocol

Fig. 2.

Structure of our 3-party RQ protocol

Range Query Protocols. In the above data, query, and participant models, we consider a (static-data) range query (briefly, RQ) protocol that extends the KS protocol, as defined in [10] (in turn, an evolution of the PIR protocol, as defined in [18]), in that it considers range queries instead of keyword queries, and it allows the attribute column to have multiple occurrences of the same value. (We can also extend the model so to incorporate databases that contain multiple attributes). Specifically, we define an RQ protocol as a pair \((\mathsf {Init}, \mathsf {Query})\) of subprotocols, as follows. The initialization subprotocol \(\mathsf {Init}\) is used to set up data structures and cryptographic keys before C’s queries are executed. The query subprotocol \(\mathsf {Query}\) allows C to make a single query to retrieve (possibly multiple) matching database records. We also define an RQ protocol execution as a sequence of executions of subprotocols \((\mathsf {Init},\mathsf {Query}_1,\ldots ,\mathsf {Query}_q)\), for some q polynomial in the security parameter, and all subprotocols are run on inputs provided by the involved parties (i.e., a database from S and query values from C). We would like to build RQ protocols that satisfy the following (informal) list of requirements:
  1. 1.

    Correctness: the RQ protocol allows a client to obtain all payloads from the current database associated with records that match its issued query; more specifically, for any RQ protocol execution, and any inputs provided by the participants, in any execution of a \(\mathsf {Query}\) subprotocol, the probability that C obtains all records in the current database that match C’s query value input to this subprotocol, is 1.

     
  2. 2.
    Privacy: informally speaking, the RQ protocol preserves privacy of database content and query values, ideally only revealing what is leaked by system parameters known to all parties and by the intended functionality output (i.e., all payloads in matching records to C); more specifically, we require the subprotocols in an RQ protocol execution to not leak information beyond the following
    • \(\mathsf {Init}\): all system parameters, including the database schema and a security parameter, will be known to all participants; in the 3-party model, an additional string eds (for encrypted data structures) will be known to TP, will be encrypted under one or more keys unknown to TP and its length is known from quantities in the database schema;

    • \(\mathsf {Query}\), based on query range \(qr=[v_0,v_1]\) and the database D: all payloads \(\{p(i):i=i(1),\ldots ,i(m(qr))\}\) such that \(A_1(i)\in [v_0,v_1]\), for \(i=i(1),\ldots ,i(m(qr))\), will be obtained by C, as a consequence of the correctness requirement; in the 2-party model, the value m(qr) will be known to S; in the 3-party model, the value m(qr), all bits in eds read by TP according to the instructions in the \(\mathsf {Query}\) protocol, and which previous executions of \(\mathsf {Query}\) used the same query value v, will be known to TP.

     
  3. 3.

    Efficiency: the protocol should have low time, communication and round complexity, as a function of system parameters, including the number n of database records.

     

Given the characterization of intended leakage in the above privacy definition, a formal privacy definition can be derived using known definition techniques from simulation-based security and composable security frameworks often used in the cryptography literature.

Similarly as noted for keyword queries in [7], we observe that the communication exchanged in each execution of any subprotocol \(\mathsf {Query}\) has to leak an upper bound on the value m(qr), i.e., the number of matching records, to S in the 2-party model, and to the coalition of TP and S in the 3-party model. Accordingly, we target the design of protocols that may leak m(qr) to S in the 2-party model. In the 3-party model, different RQ protocols could leak m(qr) only to S, or only to TP, or somehow split this leakage between S and TP. Having to choose between one of these options, we made the practical consideration that privacy against S (i.e., the data owner) is typically of greater interest than privacy against TP (i.e., the cloud server helping C retrieve data from S) in many applications, and therefore we focused in this paper on seeking protocols that leak m(qr) to TP and nothing at all to S. Moreover, in the 3-party model, we made a definitional choice of leaking patterns of repeated access to encrypted data to TP; this is not due to a theoretical limitation, but seems a well-characterized privacy leakage, which, depending on the application at hand, either is a small price to pay towards achieving very efficient time-complexity requirements on S and TP, or can be reduced by using separate techniques.

With respect to efficiency, although we design protocols with low time, round and communication complexity, we focus our discussions on the communication complexity of the query subprotocols, and on the running time of S in the 2-party model and of S and TP in the 3-party model.

Background: Keyword Search Protocols. A random functionR is a function that is chosen with distribution uniform across all possible functions with some pre-defined input and output domains. A keyed function \(F(k,\cdot )\) is a pseudo-random function (PRF, first defined in [11]) if, after key k is randomly chosen, no efficient algorithm allowed to query an oracle function O can distinguish whether O is \(F(k,\cdot )\) or O is a random function R (over the same input and output domain), with probability greater than 1 / 2 plus a negligible quantity. A KS protocol is a protocol between two parties A, having as input a keyword \(v\in \{1,\ldots ,n\}\), and B, having as input a 2-column database represented as \(D=(A_1,A_2)\). The protocol consists in a private retrieval of the value(s) \(A_2(i)\) such that \(A_1(i)=v\), returned to A (thus, without revealing any information about i to B or about \(A_2(1),\ldots ,A_2(i-1),A_2(i+1),\ldots ,A_2(n)\) to A). Several KS protocols have been presented in the cryptographic literature, starting with [18], using number-theoretic hardness assumptions (see also [5, 7, 10]).

3 Range Queries in the Two-Party Model

We describe two RQ protocols for range queries in this model: the first protocol, presented in Sect. 3.1, works for ranges with elements in any linear-size domain; the second protocol, presented in Sect. 3.2, works for ranges with elements in any exponential-size (in practice, arbitrarily large) domain.

3.1 A Range Query Protocol for Linear-Size Domains

Our first 2-party RQ protocol considers range values in linear-size domains (that is, where the domain size is equal to the number of database records). This protocol follows the general structure outlined in Fig. 1 and satisfies the following

Theorem 1

Consider a database with n records and domain \(Dom=[0,n-1]\). Assuming the existence of a 2-party privacy-preserving KS protocol \(\pi _0=(\mathsf {Init}_0,\mathsf {Query}_0)\), there exists (constructively) a 2-party privacy-preserving RQ protocol \(\pi _1=(\mathsf {Init}_1,\mathsf {Query}_1)\) for such a database, satisfying:
  1. 1.

    correctness

     
  2. 2.

    privacy against C (i.e., it only leaks the matching records to C);

     
  3. 3.

    privacy against S (i.e., it only leaks the number of matching records to S);

     
  4. 4.

    communication complexity of \(\mathsf {Query}_1\) on a queried range qr is O(m(qr)) times the communication complexity of \(\mathsf {Query}_0\);

     
  5. 5.

    the S-time complexity in \(\mathsf {Query}_1\) on a queried range qr is O(m(qr)) times the S-time complexity in \(\mathsf {Query}_0\) plus O(n).

     

We prove Theorem 1 by describing RQ protocol \(\pi _1\) and its properties.

The RQ Protocol\(\pi _1\): Basic Definitions. Let \(Dom\) be a value domain with a total order \(\le \) defined on it. We say that \(Dom\) is a linear-size domain if it holds that \(|Dom|\le n\). Given a list U of (not necessarily distinct) values \(u_1,\ldots ,u_n\in Dom\), we say that a value \(v\in Dom\) has lowerU-rankr, also denoted as \(Lrank(U,v)=r\), if there are r values strictly smaller than v. We say that a value \(v\in Dom\) has upperU-rankr, also denoted as \(Urank(U,v)=r\), if there are \(n-r\) values in U strictly larger than v. Let \(sU=(u_{h(0)},\ldots ,u_{h(n-1)})\) denote the list obtained from U by sorting its n elements. These definitions directly imply the following:

Fact 1

Given values \(v_0,v_1\in Dom\) such that \(v_0\le v_1\), it holds that
  1. 1.

    \(U\cap [v_0,v_1]=\emptyset \) if and only if \(Lrank(U,v_0)\ge Urank(U,v_1)\).

     
  2. 2.

    \(U\cap [v_0,v_1]\ne \emptyset \) if and only if \(u_{h(a)},\ldots ,u_{h(b)}\in [v_0,v_1]\), for \(a=Lrank(U,v_0)\) and \(b=Urank(U,v_1)-1\).

     

The RQ Protocol\(\pi _1\): An Informal Description. A first approach in our protocol goes as follows. At initialization S splits database D into two databases: a rank database rD and a payload database pD. At query time, C asks S for the lower rank of \(v_0\) and the upper rank of \(v_1\), where \([v_0,v_1]\) denotes the range queried by C. Because in this protocol we consider only linear-size value domains, S can store at initialization the lower rank and the upper rank of each value in the domain in rD; thus, it suffices C to perform a keyword query to rD to retrieve the two upper and lower rank values. Given these retrieved values, C can compute how many attribute values (if any) are in \([v_0,v_1]\) (i.e., the upper rank minus the lower rank), and then perform as many keyword queries in pD to retrieve the records matching the queried range. As written so far, the protocol satisfies our desired correctness and efficiency properties, but not the privacy property, as C learns the two rank values associated with the queried range’s endpoints. We fix this problem by requiring S to randomize the rank values by a random shift of the attribute values, a variation of an idea first used in [8] to improve the efficiency of keyword queries in a 3-party model. Thus, the ranks received by C will be randomly distributed, conditioned by the fact that the difference between them remains the same, and C is entitled to know this difference because of the correctness requirement.

The RQ Protocol\(\pi _1\): A Formal Description. Protocol \(\pi _1\) uses a KS protocol \(\pi _0=(\mathsf {Init}_0,\mathsf {Query}_0)\) for a 2-column database, which can be obtained from protocol \(\pi _1\) in [7] or protocol 2 in [10]. Both these protocols use the KS protocol in [5], which in turn is based on any semi-private PIR protocol (e.g., [18]).

\(\mathsf {Init}_1\). On input database \(D=(A_1,A_2)\), S sets U as the list \((A_1(1),\ldots ,A_1(n))\), and builds an associated rank database \(rD=(rA_1,rA_{2})\) and an associated payload database \(pD=(pA_1,pA_{2})\), computed as follows.

For each \(i=1,\ldots ,n\),
  1. 1.

    \(rA_1(i)=i\),

     
  2. 2.

    \(rA_2(i)=(Lrank(U,i)),Urank(U,i))\), and

     
for each \(i=0,\ldots ,n\),
  1. 1.

    \(pA_1(i)=i\), and

     
  2. 2.

    \(pA_2(i)=A_2(j)\) where \(j\in \{1,\ldots ,n\}\) satisfies \(Lrank(U,A_1(j))=i\).

     
\(\mathsf {Query}_1\). Let \(\gg _n\) denote the operation ‘right shift modulo n’. On input query range \(qr=[v_0,v_1]\), where \(v_0,v_1\in Dom\), from C, and all quantities computed during \(\mathsf {Init}_1\), the following steps are run:
  1. 1.

    If \(v_0>v_1\) then C sends failure symbol \(\perp \) to S and halts;

     
  2. 2.

    S randomly chooses value \(s\in \{0,\ldots ,n-1\}\)

     
  3. 3.

    for \(i=1,\ldots ,n\),

       S sets \(Lr'(U,i)=Lrank(U,i) \gg _n s\), \(Ur'(U,i)=Urank(U,i) \gg _n s\);

       S sets \(rA'_2(i)=(Lr'(U,i),Ur'(U,i))\)

     
  4. 4.

    S runs \(\mathsf {Init}_0\) on input 2-column database \(rD=(rA_1,rA'_{2})\)

     
  5. 5.

    for \(j=0,1\): C and S run \(\mathsf {Query}_0\), where C uses \(v_j\) as query value and S provides \((rA_1,rA'_2)\) as a 2-column database; at the end of the protocol, C computes the payload \(rA'_2(i(j))=(Lr'(U,i(j)),Ur'(U,i(j)))\) such that \(rA_1(i(j))=v_j\);

     
  6. 6.

    if \(Lr'(U,i(0))= Ur'(U,i(1))\) then

       C sends failure symbol \(\perp \) to S and halts.

     
  7. 7.

    for \(i=0,\ldots ,n\),

       S sets \(pA'_1(i)=pA_1(i) \gg _n s\);

     
  8. 8.

    S runs \(\mathsf {Init}_0\) on input 2-column database \(pD=(pA'_1,pA_{2})\)

     
  9. 9.

    for \(j=Lr'(U,i(0)),\ldots ,Ur'(U,i(1))-1\), possibly cycling from \(n-1\) to 0: C and S run \(\mathsf {Query}_0\), where C uses j as query value and S provides \((pA'_1,pA_2)\) as a 2-column database; at the end of the protocol, C computes the payload \(pA_2(i(j))\) such that \(pA'_1(i(j))=j\).

     
Properties of\(\pi _1\). We now show that \(\pi _1\) satisfies the correctness, privacy and efficiency properties defined in the 2-party model.

Correctness. First of all, note that by the test in step 1, we can assume that \(v_0\le v_1\), which implies that \(Lrank(U,i(0))\le Urank(U,i(1))\).

By the correctness property of the KS protocol \(\pi _0\), at the end of step 5 of \(\mathsf {Query}_1\), C can compute the shifted lower rank \(Lr'(U,i(0))\) of \(v_0\) and the shifted upper rank \(Ur'(U,i(1))\) of \(v_1\). As both values are obtained as a shift, by the same random number s, of Lrank(Ui(0)) and Urank(Ui(1)), respectively, it holds that \(Lr'(U,i(0))= Ur'(U,i(1))\) if and only if \(Lrank(U,i(0))= Urank(U,i(1))\). Using item 1 of Fact 1, this implies that if \(U\cap [v_0,v_1]=\emptyset \), it will hold that \(Lrank(U,i(0))= Urank(U,i(1))\) and thus \(Lr'(U,i(0))= Ur'(U,i(1))\), and then C will halt in step 6 of \(\mathsf {Query}_1\), without receiving any payload from S. On the other hand, if \(U\cap [v_0,v_1]\ne \emptyset \), at the end of step 9 of \(\mathsf {Query}_1\), by the correctness property of the KS protocol \(\pi _0\), C computes the payload \(pA_2(i(j))\) such that \(pA'_1(i(j))=j\), for all \(j=Lr'(U,i(0)),\ldots , Ur'(U,i(1))-1\), possibly cycling from \(n-1\) to 0. Using item 2 of Fact 1, this implies that S receives all payloads corresponding to values \(A_1(i)\) in the range \([v_0,v_1]\).

Privacy. We show that \(\pi _1\) satisfies our privacy requirement when the adversary corrupts any one among S or C.

When the adversary corrupts S, privacy (i.e., corrupting S does not provide the adversary any new information about C’s range query \([v_0,v_1]\) other than system parameters and the number of matching payloads) can be proved by using the analogue privacy property of the KS protocol \(\pi _0\). First of all, we observe that \(\mathsf {Query}_1\) in protocol \(\pi _1\) consists of 1 execution of \(\mathsf {Query}_0\) followed by either no further execution of \(\mathsf {Query}_0\) (resulting in no payload received by C) or by \(m(qr)=Urank(U,v_1)-Lrank(U,v_0)\) additional executions of \(\mathsf {Query}_0\) (resulting in \(m(qr)>0\) payloads received by C). Thus, given the number \(m(qr)\ge 0\) of payloads received by C, an efficient simulator for the view obtained by S is obtained by suitably calling the efficient simulator for the view by S in the KS protocol \(\pi _0\).

When the adversary corrupts C, privacy (i.e., corrupting C does not provide the adversary with any information about S’s database D other than system parameters and what intended by the correctness requirement) can be proved by using the analogue privacy property of protocol \(\pi _0\). Here, the proof is similar to the previous case: given the number \(m(qr)\ge 0\) of payloads received by C, a simulator for C’s view is obtained by suitably calling the simulator for C’s view in the KS protocol \(\pi _0\).

Efficiency. As \(\mathsf {Query}_1\) essentially consists of running \(m(qr)+1\) times \(\mathsf {Query}_0\), the communication complexity (resp., S-time complexity) of \(\mathsf {Query}_1\) is O(m(qr)) times the communication complexity (resp., S-time complexity) of \(\mathsf {Query}_0\). Thus, the communication complexity is desirably linear in the number of matching records (and can be sub-linear in the number n of total database records). Analogously, the S-time complexity of \(\mathsf {Query}_1\) is O(m(qr)) times the S-time complexity of \(\mathsf {Query}_0\) plus O(n). Here, note that the S-time complexity of \(\mathsf {Query}_1\) is linear in n already for small values of m(qr) as so is the S-time complexity of \(\mathsf {Query}_0\). This inefficiency is a major and known drawback of all 2-party model solutions for protocols like PIR, KS, and therefore, of protocols \(\pi _0\) and \(\pi _1\). Indeed, this motivated our study of RQ protocols in the 3-party model in Sect. 4.

3.2 A Range Query Protocol for Exponential-Size Domains

Our second 2-party RQ protocol considers range values in exponential-size (which means, practically speaking, arbitrarily large) domains. This protocol follows the general structure outlined in Fig. 1 and satisfies the following

Theorem 2

Consider a database with n records and domain \(Dom=[0,2^d-1]\), for some d polynomial in n. Assuming the existence of a 2-party privacy-preserving KS protocol \(\pi _0=(\mathsf {Init}_0,\mathsf {Query}_0)\), there exists (constructively) a 2-party privacy-preserving RQ protocol \(\pi _2=(\mathsf {Init}_2,\mathsf {Query}_2)\) for such a database, satisfying:
  1. 1.

    correctness

     
  2. 2.

    privacy against C (i.e., it only leaks the matching records to C);

     
  3. 3.

    privacy against S (i.e., it only leaks the number of matching records to S);

     
  4. 4.

    communication complexity of \(\mathsf {Query}_1\) on a queried range qr is O(m(qr)) times the communication complexity of \(\mathsf {Query}_0\);

     
  5. 5.

    the S-time complexity in \(\mathsf {Query}_1\) on a queried range qr is O(m(qr)) times the S-time complexity in \(\mathsf {Query}_0\) plus O(dn).

     

We prove Theorem 2 by describing RQ protocol \(\pi _2\) and its properties.

The RQ Protocol\(\pi _2\): Basic Definitions. Let \(Dom\) be a value domain with a total order \(\le \) defined on it. We say that \(Dom\) is an exponential-size domain if it holds that \(|Dom|\le 2^d\le 2^{p(n)}\), for some polynomial p. For simplicity, we restrict to the case \(Dom\) is the d-dimensional hypercube, i.e. \(Dom=[0,2^d-1]\), but note that our results can be extended to any exponential-size domain.

We define the set\(cI(Dom)\)of canonical intervals for \(Dom\) by the following recursion: first, add \(Dom\) into \(cI(Dom)\); then, split \(Dom\) into \(Dom_0\), containing the first half of its elements, and \(Dom_1\) containing the second half; then, for \(i=0,1\), generate \(cI(Dom_0)\), the set of canonical intervals for \(Dom_i\); finally, add \(cI(Dom_0),cI(Dom_1)\) to \(cI(Dom)\).

An interval \([a,b]\subseteq Dom\) is a border interval in\(Dom\) if there exists an interval \(I\in cI(Dom)\) such that either a is the first element in I or b is the last element in I. The following fact directly follows by the above definitions of border and canonical intervals.

Fact 2

For every interval \([a,b]\subseteq Dom\), either [ab] is a border interval in \(Dom\), or there exists c such that [ac] and \([c+1,b]\) are border intervals in \(Dom\).

For any interval \([a,b]\subseteq Dom\), intervals \([a,c_1], [c_1+1,c_2], \ldots , [c_t,b]\) are said to cover [ab]. We note that a border interval is covered by at most \(d-1\) canonical intervals. This, together with Fact 2, implies the following

Fact 3

For every interval \([a,b]\subseteq Dom\), there exists a set of \(\le 2(d-1)\) canonical intervals covering [ab].

We note that results similar to Fact 3 have already been studied in other papers (see, e.g., [19]), but we could not find range query protocols based on them with provable privacy properties.

The RQ Protocol\(\pi _2\). We would like to construct \(\pi _2=(\mathsf {Init}_2,\mathsf {Query}_2)\) as an extension of \(\pi _1=(\mathsf {Init}_1,\mathsf {Query}_1)\), based on the above notions of canonical intervals, and interval covering.

At initialization S again splits database D into two databases: a rank database rD and a payload database pD. This time, however, since we consider exponential-size value domains (as opposed to linear-size value domains used for \(\pi _1\)), S cannot store at initialization the lower rank and the upper rank of each domain value in rD. Then, instead of storing all domain elements in rD, we store all attribute values \(u_1,\dots ,u_n\) in D and, for each interval \([u_{i-1}+1,u_i-1]\), we consider the set of canonical intervals covering it, as guaranteed by Fact 3, and store each one of these intervals in rD. Note that in each of these latter intervals, each domain value has the same lower and upper ranks, so we only need to store a single copy of these two values in rD as well. Thus, in \(rD=(rD_1,rD_2)\), the column \(rD_1\) contains the following
  1. 1.

    the attribute values \(u_1,\dots ,u_n\) in D;

     
  2. 2.

    each one of the canonical intervals covering every interval \([u_{i-1}+1,u_i-1]\), where \(u_1,\dots ,u_n\) are the attribute values in D.

     

After this modification, the remaining computations in \(\mathsf {Init}_2\), including of the lower/upper ranks, continue as in \(\pi _1\). Because of Fact 3, this modified initialization at most increases the size of rD by a multiplicative factor of \(2(d-1)\).

At query time, denoting as \([v_0,v_1]\) the range queried by C, the computation of the shifted lower/upper ranks continue as in \(\mathsf {Query}_1\). However, C not only asks S for the lower rank of \(v_0\) and the upper rank of \(v_1\) by 2 KS queries as in \(\pi _1\), but also makes KS queries on input all canonical intervals that contain \(v_0\) and all canonical intervals that contain \(v_1\). Here, note that if \(v_0\) (resp., \(v_1\)) is different from all attribute values, then exactly one of the canonical intervals containing \(v_0\) (resp., \(v_1\)) was included in rD during initialization. Thus, only one of the KS queries associated with \(v_0\) and only one of the KS queries associated with \(v_1\) will be successfully completed, returning to C ranks for either an attribute value \(u_i\) or a canonical interval containing the query range value. From now on, protocol \(\mathsf {Query}_2\) continues exactly as \(\mathsf {Query}_1\). That is, C can use the obtained ranks to generate the m(qr) keyword queries to database pD, and obtain m(qr) matching records.

Properties of\(\pi _2\). The proofs that \(\pi _2\) satisfies the correctness, privacy and efficiency properties defined in the 2-party model are obtained by extending the analogue proofs for \(\pi _1\), using the properties of the KS protocol \(\pi _0\). In particular, the correctness property of \(\pi _2\) is showed by additionally using Facts 2 and 3. The privacy and the communication complexity properties are not significantly affected by the modifications in \(\pi _2\) with respect to \(\pi _1\). The S-time complexity changes by observing that rD is larger in \(\pi _2\) by a multiplicative factor of d.

4 Range Queries in the Three-Party Model

We show a 3-party RQ protocol by extending the 2-party protocol in Sect. 3.1. Our protocol follows the general structure outlined in Fig. 2 and satisfies the following

Theorem 3

Consider a database with n records and domain \(Dom=[0,n-1]\). Assuming the existence of a pseudo-random function, there exists (constructively) a 3-party privacy-preserving RQ protocol \(\pi _3=(\mathsf {Init}_3,\mathsf {Query}_3)\) for such a database, satisfying:
  1. 1.

    correctness

     
  2. 2.

    privacy against C (i.e., it only leaks the matching records to C);

     
  3. 3.

    privacy against S (i.e., it does not leak anything to S);

     
  4. 4.

    privacy against TP (i.e., it only leaks number of matching records, the repetition of query values and the repeated access to initialization encrypted data structures);

     
  5. 5.

    communication complexity of \(\mathsf {Query}_3\) on a queried range qr is O(m(qr));

     
  6. 6.

    the TP-time complexity in \(\mathsf {Query}_3\) on a queried range qr is \(O(m(qr)\log n)\).

     

Remark: on Exponential-Size Domains. We stated Theorem 3 for linear-size value domains, and established it by transforming the 2-party protocol \(\pi _1\) into the 3-party model. By a very similar transformation, we can adapt the 2-party protocol \(\pi _2\) into the 3-party model, and obtain a similar result for exponential-size value domains.

Our RQ Protocol in the 3-Party Model: An Informal Description. Briefly speaking, our protocol \(\pi _3\) is obtained by performing the following two main modifications in the 3-party model to protocol \(\pi _1\) (which was designed in the 2-party model): (1) the KS protocol in the 2-party model is replaced by a KS protocol in the 3-party model [7], that was constructed starting from any pseudo-random function; and (2) the shifts performed to the entire databases \(rD_2\) and \(pD_1\) in protocol \(\pi _1\) are now replaced by a ‘lazy shifting’ technique, according to which shifts are performed only to database entries which are used in the protocol. We note that the first modification replaces the use of asymmetric cryptography protocols with only symmetric cryptography techniques, and the second modification eliminates linear-time computations from S during the query subprotocol.

In fact, we can use the following simplified version of the KS protocol in the 3-party model from [7], by assuming that each keyword query will have at most 1 matching record (which was shown to be the case in \(\pi _1\)). First of all, S encrypts both the attribute column and the payload column in its database, where the attribute column is encrypted using deterministic encryption, via a pseudo-random permutation (which can be built from any pseudo-random function). As a result, the encrypted attribute column is searchable by TP using a conventional search data structure (i.e., a binary search tree). Later, S sends the encrypted database to TP and C sends its query values encrypted using the same pseudo-random permutation used by S (with key unknown to TP). Finally, TP can search such value in the search data structure over the encrypted attribute values and return the matching record to C.

Given the above 3-party KS protocol, our 3-party RQ protocol \(\pi _3\) works as follows. The following high-level structure of \(\pi _1\) remains in \(\pi _3\): specifically, S constructs a rank database rD and a payload database pD, and C will perform keyword queries first based on rD and later based on pD. In \(\pi _3\), however, S sends encrypted versions of rD and pD to TP, and from then on, C only performs keyword queries to TP. Specifically, while the payload columns of rD and pD are encrypted using conventional probabilistic encryption, the attribute columns of rD and pD are encrypted using deterministic encryption, based on a pseudo-random permutation, which makes attribute column values searchable by TP. To encrypt an attribute value \(v\in Dom\), S randomly chooses \(v_0\) and an initial shifts such that \(v_0+is=v\text{ mod } n\), and returns ciphertext \((f_k(v_0),is)\), where f is the pseudo-random permutation, and k is a key known to C and S but not to TP. An interesting property of such ciphertexts is that TP can compute a ‘lazy shift’ of v over its encryption and by any random next shiftns, by returning \((f_k(v_0),cs)\), where the current shiftcs is \(=is+ns\text{ mod } n\). Such ciphertexts will be used by S to encrypt lower and upper ranks in rD before sending them to TP. Then, after C’s keyword query to (the encrypted version of) database rD held by TP, such encrypted ranks will not be directly returned to C (or otherwise this may leak some information to C across multiple queries). Instead, TP and C will run a 2-party secure function evaluation (using Yao’s protocol [25]), where C provides key k as input, TP provides the encrypted ranks and the current shift cs as input and the output returned to TP will be the encrypted queries for (the encrypted version of) database pD. Then, by using the current shift as input to the secure function evaluation protocol, TP obtains encrypted keyword queries, each of them being used to search across the first ciphertext component of all encrypted attribute values in pD, exactly as done in the above 3-party KS protocol.

Practical Performance of Our 3-Party Protocol. In our implementation, the S and TP processes and an instance of MySQL server version 5.5.28 were running on a Dell PowerEdge R710 server with two Intel Xeon X5650 2.66 Ghz processors, 48 GB of memory, 64-bit Ubuntu 12.04.1 operating system, and connected to a Dell PowerVault MD1200 disk array with 12 2 TB 7.2K RPM SAS RQives in RAID6 configuration. The C process was running on a Dell PowerEdge R810 server with two Intel Xeon E7-4870 2.40 GHz processors, 64 GB of memory, 64-bit Red Hat Enterprise Linux Server release 6.3 operating system, and connected to the Dell PowerEdge R710 server via switched Gigabit Ethernet.

The 3-party protocol that we implemented was somewhat different than the ones discussed in this paper, because it was developed under more complex and specific project requirements. However, by protocol analysis, we have noted that these differences are not expected to significantly affect practical performance of the protocols. Accordingly, we briefly report on the performance of our implemented protocols, as a useful indication on the performance of the protocols described here.

In our implementation, we have noted practical efficiency and scalability of our 3-party protocols, and were able to achieve query latency performance of no more than 1 order of magnitude slower than a comparable non-private protocol for the same task (specifically, a mySQL protocol for range queries over same-size value domains and database size). This result was achieved, with minor differences, over both linear-size and exponential-size value domains. A similar performance result was presented in [7] in the same implementation environment for keyword queries. In achieving such a result for range queries, our approach of constructing range query protocols from keyword query protocols was critical. This is especially the case when considering that the dominating performance factor in all our range query protocols is given by the performance of one keyword query for each of the records matching the range query. Performance numbers (where time is measured in milliseconds and communication in bytes) for range queries matching \(1\,\%\) of the database records, are captured in Figs. 3 and 4.
Fig. 3.

Time and communication performance for different database sizes.

Fig. 4.

Time performance as a function of database size.

The most challenging aspect in our performance analysis was the scalability of the initialization procedure, where we observed the following results: the initialization of the 3-party protocol for linear-size value domains, based on a transformation of the 2-party protocol \(\pi _1\), does achieve satisfactory scalability properties; however, the initalization phase of the 3-party protocol for exponential-size value domains, based on a transformation of the 2-party protocol \(\pi _2\), does not achieve satisfactory scalability properties, especially as the logarithm of the domain size grows. Although the initialization procedure is typically a one-time procedure, we still consider the following an interesting open problem: designing a 3-party privacy-preserving range query protocol that achieves scalable performance on both query latency and initialization.

Notes

Acknowledgments

Many thanks to Euthimios Panagos and Aditya Naidu for helping on performance evaluation. Most of this work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D13PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation hereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

References

  1. 1.
    Agrawal, R., Kiernan, J., Srikant, R., Xu, Y.: Order-preserving encryption for numeric data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, pp. 563–574 (2004)Google Scholar
  2. 2.
    Boldyreva, A., Chenette, N., O’Neill, A.: Order-preserving encryption revisited: improved security analysis and alternative solutions. In: Rogaway, P. (ed.) CRYPTO 2011. LNCS, vol. 6841, pp. 578–595. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  3. 3.
    Boneh, D., Di Crescenzo, G., Ostrovsky, R., Persiano, G.: Public key encryption with keyword search. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 506–522. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  4. 4.
    Boneh, D., Waters, B.: Conjunctive, subset, and range queries on encrypted data. In: Vadhan, S.P. (ed.) TCC 2007. LNCS, vol. 4392, pp. 535–554. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  5. 5.
    Chor, B., Gilboa, N., Naor, M.: Private information retrieval by keywords. IACR Cryptology ePrint Archive (1998)Google Scholar
  6. 6.
    Chor, B., Kushilevitz, E., Goldreich, O., Sudan, M.: Private information retrieval. J. ACM 45(6), 965–981 (1998)MATHMathSciNetCrossRefGoogle Scholar
  7. 7.
    Di Crescenzo, G., Cook, D., McIntosh, A., Panagos, E.: Practical private information retrieval from a time-varying, multi-attribute, and multiple-occurrence database. In: Atluri, V., Pernul, G. (eds.) DBSec 2014. LNCS, vol. 8566, pp. 339–355. Springer, Heidelberg (2014) Google Scholar
  8. 8.
    Di Crescenzo, G., Ishai, Y., Ostrovsky, R.: Universal service-providers for private information retrieval. J. Cryptology 14(1), 37–74 (2001)MATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Di Crescenzo, G., Shallcross, D.: On minimizing the size of encrypted databases. In: Atluri, V., Pernul, G. (eds.) DBSec 2014. LNCS, vol. 8566, pp. 364–372. Springer, Heidelberg (2014) Google Scholar
  10. 10.
    Freedman, M.J., Ishai, Y., Pinkas, B., Reingold, O.: Keyword search and oblivious pseudorandom functions. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 303–324. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  11. 11.
    Goldreich, O., Goldwasser, S., Micali, S.: How to construct random functions. J. ACM 33(4), 792–807 (1986)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Goldreich, O., Ostrovsky, R.: Software protection and simulation on oblivious RAMs. J. ACM 43(3), 431–473 (1996)MATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    Hacigümüs, H., Iyer, B.R., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the database-service-provider model. In: SIGMOD Conference, pp. 216–227 (2002)Google Scholar
  14. 14.
    Hore, B., Mehrotra, S., Tsudik, G.: A privacy-preserving index for range queries. In: (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 – September 3, pp. 720–731 (2004)Google Scholar
  15. 15.
    Islam, M.S., Kuzu, M., Kantarcioglu, M.: Access pattern disclosure on searchable encryption: ramification, attack and mitigation. In: NDSS (2012)Google Scholar
  16. 16.
    Islam, M.S., Kuzu, M., Kantarcioglu, M.: Inference attack against encrypted range queries on outsourced databases. In: Fourth ACM Conference on Data and Application Security and Privacy, CODASPY’14, San Antonio, TX, USA, March 03–05, pp. 235–246 (2014)Google Scholar
  17. 17.
    Jarecki, S., Jutla, C.S., Krawczyk, H., Rosu, M.-C., Steiner, M.: Outsourced symmetric private information retrieval. In: ACM Conference on Computer and Communications Security, pp. 875–888 (2013)Google Scholar
  18. 18.
    Kushilevitz, E., Ostrovsky, R.: Replication is not needed: single database, computationally-private information retrieval. In: FOCS, pp. 364–373 (1997)Google Scholar
  19. 19.
    Li, J., Omiecinski, E.R.: Efficiency and security trade-off in supporting range queries on encrypted databases. In: Jajodia, S., Wijesekera, D. (eds.) Data and Applications Security 2005. LNCS, vol. 3654, pp. 69–83. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  20. 20.
    Ostrovsky, R., Skeith III, W.E.: A survey of single-database private information retrieval: techniques and applications. In: Okamoto, T., Wang, X. (eds.) PKC 2007. LNCS, vol. 4450, pp. 393–411. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  21. 21.
    Samarati, P., De Capitani di Vimercati, S.: Data protection in outsourcing scenarios: issues and directions. In: Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, ASIACCS 2010, Beijing, China, April 13–16, pp. 1–14 (2010)Google Scholar
  22. 22.
    Shi, E., Bethencourt, J., Chan, H.T., Song, D.X., Perrig, A.: Multi-dimensional range query over encrypted data. In: 2007 IEEE Symposium on Security and Privacy (S&P 2007), 20–23 May 2007, Oakland, California, USA, pp. 350–364 (2007)Google Scholar
  23. 23.
    Stefanov, E., van Dijk, M., Shi, E., Fletcher, C.W., Ren, L., Yu, X., Devadas, S.: Path ORAM: an extremely simple oblivious RAM protocol. In: 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS 2013, Berlin, Germany, November 4–8, pp. 299–310 (2013)Google Scholar
  24. 24.
    Wang, S., Ding, X., Deng, R.H., Bao, F.: Private information retrieval using trusted hardware. In: Gollmann, D., Meier, J., Sabelfeld, A. (eds.) ESORICS 2006. LNCS, vol. 4189, pp. 49–64. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  25. 25.
    Yao, A.C.-C.: How to generate and exchange secrets (extended abstract). In: FOCS, pp. 162–167 (1986)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2015

Authors and Affiliations

  1. 1.Applied Communication SciencesBasking RidgeUSA

Personalised recommendations