## Abstract

Data-as-a-service (DaaS) is a cloud computing service that emerged as a viable option to businesses and individuals for outsourcing and sharing their collected data with other parties. Although the cloud computing paradigm provides great flexibility to consumers with respect to computation and storage capabilities, it imposes serious concerns about the confidentiality of the outsourced data as well as the privacy of the individuals referenced in the data. In this paper we formulate and address the problem of querying encrypted data in a cloud environment such that query processing is confidential and the result is differentially private. We propose a framework where the data provider uploads an encrypted index of her anonymized data to a DaaS service provider that is responsible for answering range count queries from authorized data miners for the purpose of data mining. To satisfy the confidentiality requirement, we leverage attribute-based encryption to construct a secure *k*d-tree index over the differentially private data for fast access. We also utilize the exponential variant of the ElGamal cryptosystem to efficiently perform homomorphic operations on encrypted data. Experiments on real-life data demonstrate that our proposed framework preserves data utility, can efficiently answer range queries, and is scalable with increasing data size.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
PopData: https://www.popdata.bc.ca/.

- 2.
Statistical Data Integration Involving Commonwealth Data: http://statistical-data-integration.govspace.gov.au/.

- 3.

## References

- 1.
Agrawal R, Kiernan J, Srikant R, Xu Y (2004) Order preserving encryption for numeric data. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 563–574

- 2.
Bache K, Lichman M (2013) UCI machine learning repository. School of Information and Computer Sciences, University of California, Irvine

- 3.
Barbaro M, Zeller TJ (2006) A face is exposed for AOL searcher no. 4417749

- 4.
Barouti S, Aljumah F, Alhadidi D, Debbabi M (2014) Secure and privacy-preserving querying of personal health records in the cloud. In: Data and applications security and privacy XXVIII (LNCS), vol 8566, pp 82–97

- 5.
Bayer R, McCreight E (1970) Organization and maintenance of large ordered indices. In: Proceedings of the ACM SIGFIDET workshop on data description, access and control, pp 107–141

- 6.
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

- 7.
Bethencourt J, Sahai A, Waters B (2007) Ciphertext-policy attribute-based encryption. In: Proceedings of the IEEE symposium on security and privacy. IEEE Computer Society, Washington, DC, pp 321–334

- 8.
Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the SuLQ framework. In: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS). ACM, pp 128–138

- 9.
Boneh D, Boyen X, Shacham H (2004) Short group signatures. In: Advances in cryptology—CRYPTO 2004. Volume 3152 of lecture notes in computer science, pp 41–55

- 10.
Boneh D, Franklin M (2003) Identity-based encryption from the Weil pairing. SIAM J Comput 32(3):586–615

- 11.
Boneh D, Lynn B, Shacham H (2001) Short signatures from the weil pairing. In: Proceedings of the 7th international conference on the theory and application of cryptology and information security: advances in cryptology (ASIACRYPT). Springer, pp 514–532

- 12.
Boneh D, Sahai A, Waters B (2011) Functional encryption: definitions and challenges. In: Proceedings of TCC, pp 253–273

- 13.
Boneh D, Waters B (2007) Conjunctive, subset, and range queries on encrypted data. In: Proceedings of the 4th conference on theory of cryptography (TCC), pp 535–554

- 14.
Bösch C, Hartel P, Jonker W, Peter A (2014) A survey of provably secure searchable encryption. ACM Comput Surv 47(2):18:1–18:51

- 15.
Bösch C, Tang Q, Hartel P, Jonker W (2012) Selective document retrieval from encrypted database. In: Proceedings of ISC, pp 224–241

- 16.
Chen R, Xiao Q, Zhang Y, Xu J (2015) Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15, pp 129–138

- 17.
Comer D (1979) Ubiquitous B-tree. ACM Comput Surv 11(2):121–137

- 18.
Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T (2012) Differentially private spatial decompositions. In: Proceedings of the IEEE 28th international conference on data engineering (ICDE). IEEE Computer Society, pp 20–31

- 19.
Cramer R, Gennaro R, Schoenmakers B (1997) A secure and optimally efficient multi-authority election scheme. In: Proceedings of the 16th annual international conference on theory and application of cryptographic techniques (EUROCRYPT). Springer, pp 103–118

- 20.
Dagher GG, Mohler J, Milojkovic M, Marella PB (2018) Ancile: privacy-preserving framework for access control and interoperability of electronic health records using blockchain technology. Sustain Cities Soc (SCS) 39:283–297

- 21.
Damiani E, Vimercati SDC, Jajodia S, Paraboschi S, Samarati P (2003) Balancing confidentiality and efficiency in untrusted relational DBMSs. In: Proceedings of the 10th ACM conference on computer and communications security (CCS). ACM, pp 93–102

- 22.
de Berg M, Cheong O, van Kreveld M, Overmars M (2008) Computational geometry: algorithms and applications, 3rd edn. Springer, Berlin

- 23.
Dillon T, Wu C, Chang E (2010) Cloud computing: issues and challenges. In: Proceedings of the 24th IEEE conference on advanced information networking and applications (AINA), pp 27–33

- 24.
Dong C, Russello G, Dulay N (2008) Shared and searchable encrypted data for untrusted servers. In: Proceeedings of DBSec, pp 127–143

- 25.
Dwork, C (2006) Differential privacy. In: Proceedings of the international colloquium on automata, languages, and programming (ICALP), pp 1–12

- 26.
Emekci F, Agrawal D, Abbadi A, Gulbeden A (2006) Privacy preserving query processing using third parties. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 27–36

- 27.
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD. KDD ’10, pp 493–502

- 28.
Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):14:1–14:53

- 29.
Fung BCM, Wang K, Yu PS (2007) Anonymizing classification data for privacy preservation. IEEE Trans Knowl Data Eng (TKDE) 19(5):711–725

- 30.
Ge T, Zdonik S (2007) Answering aggregation queries in a secure system model. In: Proceedings of the 33rd international conference on very large data bases (PVLDB). VLDB Endowment, pp 519–530

- 31.
Ge T, Zdonik S (2007) Fast, secure encryption for indexing in a column-oriented DBMS. In: Proceedings of the IEEE 23rd international conference on data engineering (ICDE), pp 676–685

- 32.
Gentry C (2009) Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st annual ACM symposium on theory of computing (STOC). ACM, pp 169–178

- 33.
Giannotti F, Lakshmanan L, Monreale A, Pedreschi D, Wang H (2013) Privacy-preserving mining of association rules from outsourced transaction databases. IEEE Syst J (ISJ) 7(3):385–395

- 34.
Goldreich O (2004) Foundations of cryptography, vol 2. Cambridge University Press, Cambridge

- 35.
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 47–57

- 36.
Hacigümüş H, Iyer B, Li C, Mehrotra S (2002) Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 216–227

- 37.
Hacigümüş H, Iyer B, Mehrotra S (2004) Efficient execution of aggregation queries over encrypted relational databases. In: Proceedings of the database systems for advanced applications (DASFAA), pp 125–136

- 38.
Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: Proceedings of the 13th international conference on very large data bases (PVLDB). VLDB Endowment, pp 720–731

- 39.
Hu H, Xu J, Ren C, Choi B (2011) Processing private queries over untrusted data cloud through privacy homomorphism. In: Proceedings of the IEEE 27th international conference on data engineering (ICDE), pp 601–612

- 40.
Jarecki S, Jutla C, Krawczyk H, Rosu M, Steiner M (2013) Outsourced symmetric private information retrieval. In: Proceedings of the ACM SIGSAC conference on computer & communications security (CCS), pp 875–888

- 41.
Joux A (2000) A one round protocol for tripartite diffie-hellman. In: Proceedings of the 4th international symposium on algorithmic number theory (ANTS), pp 385–394

- 42.
Kamara S, Mohassel P, Raykova M (n.d.) Outsourcing multi-party computation. IACR Cryptology ePrint Archive 2011:272

- 43.
Kifer D, Lin B-R (2010) Towards an axiomatization of statistical privacy and utility. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS). ACM, pp 147–158

- 44.
Lai J, Deng RH, Li Y (2011) Fully secure cipertext-policy hiding cp-abe. In: Proceedings of the 7th international conference on information security practice and experience (ISPEC). Springer, Berlin, pp 24–39

- 45.
McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data, pp 19–30

- 46.
Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 493–501

- 47.
Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In: Proceedings of the IEEE symposium on security and privacy (SP), pp 111–125

- 48.
Popa RA, Redfield CMS, Zeldovich N, Balakrishnan H (2011) Cryptdb: protecting confidentiality with encrypted query processing. In: Proceedings of the 23rd ACM symposium on operating systems principles (SOSP). ACM, pp 85–100

- 49.
Salzberg S (1994) C4.5: programs for machine learning by j. ross quinlan morgan. kaufmann publishers, inc., 1993. Mach Learn 16(3):235–240

- 50.
Shabtai A, Elovici Y, Rokach L (2012) A survey of data leakage detection and prevention solutions. SpringerBriefs in computer science. Springer, Berlin

- 51.
Shmueli E, Tassa T, Wasserstein R, Shapira B, Rokach L (2012) Limiting disclosure of sensitive data in sequential releases of databases. Inf Sci 191:98–127

- 52.
Song DX, Wagner D, Perrig A (2000) Practical techniques for searches on encrypted data. In: Proceedings of the 2000 IEEE symposium on security and privacy (S&P)

- 53.
Tysowski P, Hasan M (2013) Hybrid attribute- and re-encryption-based key management for secure and scalable mobile applications in clouds. IEEE Trans Cloud Comput (TCC) 1(2):172–186

- 54.
Wang C, Cao N, Li J, Ren K, Lou W (2010) Secure ranked keyword search over encrypted cloud data. In: Proceedings of the IEEE 30th international conference on distributed computing systems (ICDCS). IEEE Computer Society, pp 253–262

- 55.
Wang H, Lakshmanan LVS (2006) Efficient secure query evaluation over encrypted xml databases. In: Proceedings of the 32nd international conference on very large data bases (PVLDB). VLDB endowment, pp 127–138

- 56.
Wang P, Ravishankar C (2013) Secure and efficient range queries on outsourced databases using \(\widehat{R}\)-trees. In: Proceedings of the IEEE 29th international conference on data engineering (ICDE), pp 314–325

- 57.
Wang S, Agrawal D, El Abbadi A (2011) A comprehensive framework for secure query processing on relational data in the cloud. In: Proceedings of the 8th VLDB international conference on secure data management (SDM). Springer, pp 52–69

- 58.
Wang Y (2014) Privacy-preserving data storage in cloud using array BP-XOR codes. IEEE Trans Cloud Comput 3(4):425–435

- 59.
Wang Z-F, Dai J, Wang W, Shi B-L (2004) Fast query over encrypted character data in database. In: Proceedings of the 1st international conference on computational and information science (CIS). Springer, pp 1027–1033

- 60.
Wang Z-F, Wang W, Shi B-L (2005) Storage and query over encrypted character and numerical data in database. In: Proceedings of the 5th international conference on computer and information technology (CIT). IEEE Computer Society, pp 77–81

- 61.
Williams P, Sion R, Carbunar B (2008) Building castles out of mud: Practical access pattern privacy and correctness on untrusted storage. In: Proceedings of the 15th ACM conference on computer and communications security (CCS), pp 139–148

- 62.
Wong RC-W, Li J, Fu AW-C, Wang K (2006) (\(\alpha \), k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 754–759

- 63.
Wong WK, Cheung DW-l, Kao B, Mamoulis N (2009) Secure KNN computation on encrypted databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 139–152

- 64.
Xiao Y, Xiong L, Yuan C (2010) Differentially private data release through multidimensional partitioning. In: Proceedings of the 7th VLDB conference on secure data management (SDM). Springer, pp 150–168

- 65.
Yi X, Paulet R, Bertino E, Xu G (2016) Private cell retrieval from data warehouses. IEEE Trans Inf Forensics Secur 11(6):1346–1361

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix

### A. Security analysis

The proposed framework is sound since all adversaries are non-colluding and semi-honest, according to our adversarial model. In the rest of this section, we focus on proving that the protocol is confidentiality-preserving. We also illustrate the accessibility of the keys in the framework, and show that all keys are properly distributed between the parties.

**Privacy by simulation** Goldreich [34] defines the security of a protocol in the semi-honest adversarial model as follows.

### Definition 7

(*Privacy w.r.t. semi-honest behavior*) [34]. Let \(f: (\{0,1\}^*)^m \mapsto (\{0,1\}^*)^m\) be an m-ary deterministic polynomial-time functionality, where \(f_i(x_1,\ldots ,x_m)\) is the *i*th element of \(f(x_1,\ldots ,x_m)\). Let \(\Pi \) be an m-party protocol for computing *f*. The view of the *i*th party during an execution of \(\Pi \) over \(x = (x_1,\ldots ,x_n)\) is \(\textsf {view} _i^\Pi (x) = (x_i, r_i , m_{i,1}, \ldots , m_{i,t})\), where \(r_i\) equals the contents of the *i*th party’s internal random tape, and \(m_{i,j}\) represents the *j*th message that it received. For \(I = \{i_1,\ldots ,i_l\} \subseteq \{1,\ldots ,m\}\), \(\textsf {view} _I^\Pi (x) = (I,\textsf {view} _{i_1}^\Pi (x),\ldots ,\textsf {view} _{i_l}^\Pi (x))\). We say that \(\Pi \) securely computes *f* in the presence of static semi-honest adversaries if there exists probabilistic polynomial-time algorithm (simulator) *S* such that for every \(I \subseteq \{1,\ldots ,m\}\):

where \(\overset{c}{\equiv }\) denotes computational indistinguishability.\(\square \)

According to Definition 7, it is sufficient to show that we can effectively simulate the view of each party during the execution of the \(\textsf {SecDM} \) protocol given the input, output and *acceptable* leaked information of that party, in order to prove that our protocol is secure. We achieve that by simulating each message *received* by a party in each algorithm. If we can simulate the input messages of each party in the protocol based only on its input and output, and the party is not able to recognize that it is dealing with a simulator, that means the protocol does not leak anything to that party since it would have been able to compute its output from its input without the need to be involved in the protocol in the first place.

First, we define the concepts *query distribution* and *query processing threshold*.

### Definition 8

(*Query distribution*) The distribution of the data mining queries, denoted by *U*, is the set of all possible queries, where each query consists of \(k_c + 2 \times k_n\) integers, each of which maps to a value in the domain of a categorical or numerical attribute.

### Definition 9

(*Query processing threshold*) Query processing threshold, denoted by \(\alpha \), is the maximum number of queries allowed to be processed on a *k*d-tree before the latter is replaced by a new shuffled and re-encrypted *k*d-tree submitted by the data provider to the service provider.

### Definition 10

(*Privacy-preserving data outsourcing framework*) Let \(\mathcal {F}\) be a framework that enables a service provider (cloud) to answer queries from data miners on hosted (outsourced) data. \(\mathcal {F}\) is a privacy-preserving framework if the following properties hold:

- 1.
*Correctness*For any user query \(u \in U\), the cloud returns \(res_u\) to the data miner such \(res_u\) is the correct answer to*u*. - 2.
*Data confidentiality*A semi-honest adversary \(\mathcal {E}\), statically corrupting the service provider, cannot learn anything more about the hosted data from an accepted transcript of \(\mathcal {F}\) than she could given only the total number of numerical and categorical attributes, and the size of each attribute’s domain. - 3.
*Query confidentiality*A semi-honest adversary \(\mathcal {E}\), statically corrupting the service provider, cannot learn anything about the query. - 4.
*Differentially private output*For all \(u \in U\), \(res_u\) satisfies differential privacy.

### Definition 11

(\(\alpha \)-*privacy-preserving data outsourcing framework*) An outsourcing framework \(\mathcal {F}\) is \(\alpha \)-privacy-preserving if it satisfies all properties in Definition 10 except that the cloud learns the search pattern of at most \(\alpha \) number of queries.

### Theorem A.1

SecDM, as specified in Protocols 4.1–4.7, is an \(\alpha \)-privacy-preserving data outsourcing framework.

### Proof

We proved in Sect. 1 Property 1 (correctness) and Property 4 (differentially private output).

To prove Property 2 (data Confidentiality) and Property 3 (query Confidentiality), we build a simulator \(\mathcal {S}\) that generates a view that is statistically indistinguishable from the view of \(\mathcal {E}\) in real execution. As per Definition 7, the view of the service provider consists mainly of the messages it receives from the other parties. Although we have 8 algorithms, the service provider receives messages from the protocol only in Algorithm 3—Line 2 (encrypted index from data provider) and Algorithm 5—Line 4 (encrypted query from data miner). All other steps in all algorithms do not need to be simulated because they either do not involve the service provider at all (e.g., the steps in Algorithm 1, 2 and 4), or involve ciphertext operations (e.g., the steps in Algorithm 6 and 7) which are inherently secure from the security of the cryptosystems used (ABE and Elgamal).

**Discussion** The threshold parameter \(\alpha \) can range between 1 and \(\infty \). To better understand the impact of revealing \(\alpha \) queries to \(\mathcal {S}\), we analyze the security when \(\alpha = 1\) and \(\alpha > 1\).

**Case 1**\(\alpha = 1\) This represents the highest security level of our protocol, where one system query is executed per one *k*d-tree. Since the *k*d-tree index is constructed by Algorithm 4.1 as a *balanced* tree and since each path contains all attributes, no correlation can be established between any two attributes and the attributes are protected when evaluated for splitting the *k*-dimensional space. As for the data mining query, the service provider cannot determine what attributes are included in the query, nor know what values or ranges the data miner is interested in. Since Algorithm 4.6 yields how many leaf nodes (equivalent classes) identified, this reveals how general the query is. In general, the more leaf nodes identified by a query, the more general the query is. The revealing of the number of identified leaf nodes, however, will not help the service provider better guess the final result of the query since it cannot access the encrypted noisy counts.

Although setting \(\alpha \) to 1 provides the highest security w.r.t. query search pattern, it is impractical due to the cost of reconstructing the *k*d-tree. We refer the reader to *solution construction scalability* in Sect. 6.2.2 for more details about the cost of reconstructing the *k*d-tree.

\(\mathbf Case 2 \)\(\alpha > 1\) While our proposed framework supports *confidential access* to the data, executing multiple queries on the same *k*d-tree index reveals the search pattern of the queries, where the service provider is able to determine the number of leaf nodes that overlap between the queries. Let *u* and \(u'\) be two user queries that satisfy the same set of leaf nodes \(l = \{l_1, \ldots , l_r\}\), and let *collision set* denote the set of all unique queries that could satisfy *l*. The size of the collision set can be determined as follows:

where \(|l_i.Range(\hat{A}_j)|\) denotes the size of the range of attribute \(\hat{A}_j\) in the equivalent class represented by leaf node \(l_i\). Note that since the noisy counts are encrypted using ElGamal, the position of the attributes in the tree is hidden and is shuffled every time the *k*d-tree is constructed, disclosing the search pattern on the differentially private data reveals nothing about the final (noisy) result of each query, nor about the attributes/values in each query. The smaller the value of \(\alpha \) is, the less overlap between queries is revealed. Several techniques have been proposed in the literature to address the problem of private search pattern, such as [61]; however, it is out of the scope of this paper.

Note that each time the data provider generates a shuffled and re-encrypted *k*d-tree, a different ACP-ABE master secret key MSK should be used to prevent the service provider from processing new queries on the old tree.

In our model, we assume the data miner can have access to the entire differentially private dataset. The data privacy is guaranteed by differential privacy. Therefore, there is no need to simulate the view of the data miner.

Moreover, since our framework returns differentially private results for each count query in a deterministic way, any repetition of queries will leak no extra information about the data. Also, since count query results are differentially private, our framework is also protected against background knowledge attacks.

The proposed protocol in this paper involves the composition of secure subprotocols in which all intermediate outputs from one subprotocol are inputs to the next subprotocol. These intermediate outputs are either simulated given the final output and the local input for each party or computed as random shares. Using the composition theorem (Goldreich [34]), it can be shown that if each subprotocol is secure, then the resulting composition is also secure.

**Key accessibility** Protecting the data distributed between different parties from unauthorized access is an essential part of securing the \(\textsf {SecDM} \) framework. We must ensure that all keys are properly distributed such that no party can decrypt any data it is not supposed to have access to in plaintext. Table 4 illustrates the accessibility of each key by each party in \(\textsf {SecDM} \).

Observe that the data provider is the generator of all encryption keys in the system and maintains full control over them. The service provider, on the other hand, has no access to Exponential ElGamal’s private key, \(\mathbb {G}.x\), that would have allowed her to fully decrypt the contents of each leaf node in the *k*d-tree index. Moreover, not having access to the ACP-ABE master secret key \(\mathbb {A}.MSK\) prevents the service provider from being able to determine the access structures of the ciphertexts in each internal node of the *k*d-tree index. As for the user (data miner), not having access to \(\mathbb {A}.MSK\) prevents her from bypassing authentication and creating her own system count queries.

### B. Correctness analysis

The correctness proof is twofold. First, we prove that Algorithm 4.6 identifies all the leaf nodes satisfying the user count query *u*. Second, we prove that Algorithm 4.7 produces the exact total count answer to *u*, and the answer is differentially private.

### Proposition 5

Given a user count query \(u = \mathcal {P}_1 \wedge \cdots \wedge \mathcal {P}_m\), Algorithm 4.6 produces a set \(\mathcal {R}\) containing all leaf nodes satisfying *u*.

### Proof

To prove the correctness of Algorithm 4.6 we prove *partial correctness* and *termination*.

*1. Partial correctness* We provide a proof by induction.

**Basis** When *u* includes no predicate for any of the attributes in \(\hat{D}\), then each categorical attribute in \(SK_u\) is assigned the value 1 (the identifier of the root node of the corresponding taxonomy tree), whereas for each numerical attribute \(\hat{A}_i \in \hat{D}\), \(\hat{A}_i^{min}=1\) (the lowest range identifier) and \(\hat{A}_i^{max}\) is assigned the highest range identifier in \(\Omega (\hat{A}_i)\). When \(SK_u\) is used to traverse the *k*d-tree index, all internal nodes will be traversed until the leaf nodes are reached. That is, if the current node *v* is internal, \(\mathbb {A}.\textsf {Dec} (v.CT_{left},SK_u)\) and \(\mathbb {A}.\textsf {Dec} (v.CT_{right},SK_u)\) will always be true because the attributes in \(SK_u\) will always satisfy the access structure in \(v.CT_{left}\) and \(v.CT_{right}\), and pointers to the left child node and right child node will always be obtained.

**Induction step** Assume that traversing the *k*d-tree index using \(SK_u\) produces the correct set of leaf nodes \(\mathcal {R}\) satisfying *u*. We show that if a new predicate \(\mathcal {P}=(\hat{A}_i \, \vdash \,\, s_i)\) is added to *u* such that \(\acute{u} = u + \mathcal {P}\), then traversing the *k*d-tree index using \(SK_{\acute{u}}\) produces the correct set of leaf nodes \(\acute{\mathcal {R}}\) satisfying \(\acute{u}\). We observe that \(\acute{\mathcal {R}} \subseteq \mathcal {R}\). To complete the proof in this step, we assume that \(\mathcal {P}\) corresponds to a categorical attribute; however, the same analogy can be applied to a numerical attribute’s predicate. When *v* is an internal node and \(v.split\_dim = \hat{A}_i\), if \(s_i.ID \le v.split\_value\) then \(\mathbb {A}.\textsf {Dec} (v.CT_{right},SK_u)\) will evaluate to *false*, and no recursive call of procedure *traverseIndex* over node *v*.*rc* will be executed. This behavior is correct because in this case, the subtree whose root is *v*.*rc* includes the leaf nodes that do not satisfy \(\mathcal {P}\), and hence there is no need to search the subtree rooted at *v*.*rc*. The same logic can be used to reason about the case when \(s_i.ID > v.split\_value\).

*2. Termination* Each recursive call on a child node partitions the space of the parent node in half. This shows that the algorithm strictly moves from one level to a lower level in the *k*d-tree index while reducing the search space by half until all leaf nodes satisfying *u* are reached. \(\square \)

### Proposition 6

Given a set of leaf nodes \(\mathcal {R}\) generated by a system count query \(SK_u\) and a set of attribute distribution tokens \(\mathcal {N}\), the output of Algorithm 4.7 is the exact noisy count answer corresponding to \(SK_u\).

### Proof

To prove the correctness of Algorithm 4.7, we prove *partial correctness* and *termination*.

*1. Partial correctness* We provide a proof by induction.

**Basis** When \(\mathcal {N}=\phi \), the inner loop will never be executed. In this case, procedure *compTCount* will go through all the leaf nodes in \(\mathcal {R}\) and add together all corresponding noisy counts by utilizing the homomorphic addition property of Exponential ElGamal. This is correct because if no *ADT* token was originally generated, then the user query is an *exact* query, and \(100\%\) of the noisy count of each leaf node in \(\mathcal {R}\) must be used.

**Induction step** Assume that for \(\mathcal {N}=\{ADT_1,\ldots ,ADT_l\}\), procedure *compTCount* computes the exact noisy count answer to the user count query *u*. We show that if a new token \(ADT_{l+1}\) for numerical attribute \(\hat{A}_i\) is added such that \(\acute{\mathcal {N}} = \mathcal {N} \cup ADT_{l+1} = \{ADT_1, \ldots , ADT_{l+1}\}\), where \(\acute{\mathcal {N}}\) corresponds to the system count query \(SK_{\acute{u}}\), then procedure *compTCount* computes the exact noisy count answer to the user count query \(\acute{u}\). Without loss of generality, we assume that the set of leaf nodes \(\mathcal {R}\) remains the same. Since \(ADT_{l+1}\) is for numerical attribute \(\hat{A}_i\), then \(ADT_{l+1}.value\) represents the percentage of the partial intersection between query \(\acute{u}\) and attribute \(\hat{A}_i\) by definition. If \(\acute{u}\) is a *generic* query, then not all leaf nodes in \(\mathcal {R}\) will contain a tag that corresponds to \(ADT_{l+1}.tag\). However, the noisy count of each leaf node *l* containing a tag that matches \(ADT_{l+1}.tag\) must be adjusted by multiplying *l*.*NCount* with \(ADT_{l+1}.value\).

*2. Termination* We denote by *n* the initial number of leaf nodes in \(\mathcal {R}\). If \(n > 0\), then we enter the outer loop. We also denote by *m* the initial number of *ADT* tokens in \(\mathcal {N}\). If \(m > 0\) then we enter the inner loop such that after each iteration, the variable *m* is decreased by one, and it keeps strictly decreasing until \(m=0\) where the inner loop terminates. Similarly, the outer loop will terminate as *n* keeps strictly decreasing until it reaches 0; at that stage, the algorithm terminates. \(\square \)

## Rights and permissions

## About this article

### Cite this article

Dagher, G.G., Fung, B.C.M., Mohammed, N. *et al.* \({\textsf {SecDM}}\): privacy-preserving data outsourcing framework with differential privacy.
*Knowl Inf Syst* **62, **1923–1960 (2020). https://doi.org/10.1007/s10115-019-01405-7

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Cloud computing
- Data outsourcing
- Search on encrypted data
- Differential privacy