\({\textsf {SecDM}}\): privacy-preserving data outsourcing framework with differential privacy

Abstract

Data-as-a-service (DaaS) is a cloud computing service that emerged as a viable option to businesses and individuals for outsourcing and sharing their collected data with other parties. Although the cloud computing paradigm provides great flexibility to consumers with respect to computation and storage capabilities, it imposes serious concerns about the confidentiality of the outsourced data as well as the privacy of the individuals referenced in the data. In this paper we formulate and address the problem of querying encrypted data in a cloud environment such that query processing is confidential and the result is differentially private. We propose a framework where the data provider uploads an encrypted index of her anonymized data to a DaaS service provider that is responsible for answering range count queries from authorized data miners for the purpose of data mining. To satisfy the confidentiality requirement, we leverage attribute-based encryption to construct a secure kd-tree index over the differentially private data for fast access. We also utilize the exponential variant of the ElGamal cryptosystem to efficiently perform homomorphic operations on encrypted data. Experiments on real-life data demonstrate that our proposed framework preserves data utility, can efficiently answer range queries, and is scalable with increasing data size.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    PopData: https://www.popdata.bc.ca/.

  2. 2.

    Statistical Data Integration Involving Commonwealth Data: http://statistical-data-integration.govspace.gov.au/.

  3. 3.

    MIRACL: https://certivox.org/display/EXT/MIRACL.

References

  1. 1.

    Agrawal R, Kiernan J, Srikant R, Xu Y (2004) Order preserving encryption for numeric data. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 563–574

  2. 2.

    Bache K, Lichman M (2013) UCI machine learning repository. School of Information and Computer Sciences, University of California, Irvine

  3. 3.

    Barbaro M, Zeller TJ (2006) A face is exposed for AOL searcher no. 4417749

  4. 4.

    Barouti S, Aljumah F, Alhadidi D, Debbabi M (2014) Secure and privacy-preserving querying of personal health records in the cloud. In: Data and applications security and privacy XXVIII (LNCS), vol 8566, pp 82–97

  5. 5.

    Bayer R, McCreight E (1970) Organization and maintenance of large ordered indices. In: Proceedings of the ACM SIGFIDET workshop on data description, access and control, pp 107–141

  6. 6.

    Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

    MathSciNet  Article  Google Scholar 

  7. 7.

    Bethencourt J, Sahai A, Waters B (2007) Ciphertext-policy attribute-based encryption. In: Proceedings of the IEEE symposium on security and privacy. IEEE Computer Society, Washington, DC, pp 321–334

  8. 8.

    Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the SuLQ framework. In: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS). ACM, pp 128–138

  9. 9.

    Boneh D, Boyen X, Shacham H (2004) Short group signatures. In: Advances in cryptology—CRYPTO 2004. Volume 3152 of lecture notes in computer science, pp 41–55

  10. 10.

    Boneh D, Franklin M (2003) Identity-based encryption from the Weil pairing. SIAM J Comput 32(3):586–615

    MathSciNet  Article  Google Scholar 

  11. 11.

    Boneh D, Lynn B, Shacham H (2001) Short signatures from the weil pairing. In: Proceedings of the 7th international conference on the theory and application of cryptology and information security: advances in cryptology (ASIACRYPT). Springer, pp 514–532

  12. 12.

    Boneh D, Sahai A, Waters B (2011) Functional encryption: definitions and challenges. In: Proceedings of TCC, pp 253–273

  13. 13.

    Boneh D, Waters B (2007) Conjunctive, subset, and range queries on encrypted data. In: Proceedings of the 4th conference on theory of cryptography (TCC), pp 535–554

  14. 14.

    Bösch C, Hartel P, Jonker W, Peter A (2014) A survey of provably secure searchable encryption. ACM Comput Surv 47(2):18:1–18:51

    Google Scholar 

  15. 15.

    Bösch C, Tang Q, Hartel P, Jonker W (2012) Selective document retrieval from encrypted database. In: Proceedings of ISC, pp 224–241

  16. 16.

    Chen R, Xiao Q, Zhang Y, Xu J (2015) Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15, pp 129–138

  17. 17.

    Comer D (1979) Ubiquitous B-tree. ACM Comput Surv 11(2):121–137

    MathSciNet  Article  Google Scholar 

  18. 18.

    Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T (2012) Differentially private spatial decompositions. In: Proceedings of the IEEE 28th international conference on data engineering (ICDE). IEEE Computer Society, pp 20–31

  19. 19.

    Cramer R, Gennaro R, Schoenmakers B (1997) A secure and optimally efficient multi-authority election scheme. In: Proceedings of the 16th annual international conference on theory and application of cryptographic techniques (EUROCRYPT). Springer, pp 103–118

  20. 20.

    Dagher GG, Mohler J, Milojkovic M, Marella PB (2018) Ancile: privacy-preserving framework for access control and interoperability of electronic health records using blockchain technology. Sustain Cities Soc (SCS) 39:283–297

    Article  Google Scholar 

  21. 21.

    Damiani E, Vimercati SDC, Jajodia S, Paraboschi S, Samarati P (2003) Balancing confidentiality and efficiency in untrusted relational DBMSs. In: Proceedings of the 10th ACM conference on computer and communications security (CCS). ACM, pp 93–102

  22. 22.

    de Berg M, Cheong O, van Kreveld M, Overmars M (2008) Computational geometry: algorithms and applications, 3rd edn. Springer, Berlin

    Google Scholar 

  23. 23.

    Dillon T, Wu C, Chang E (2010) Cloud computing: issues and challenges. In: Proceedings of the 24th IEEE conference on advanced information networking and applications (AINA), pp 27–33

  24. 24.

    Dong C, Russello G, Dulay N (2008) Shared and searchable encrypted data for untrusted servers. In: Proceeedings of DBSec, pp 127–143

  25. 25.

    Dwork, C (2006) Differential privacy. In: Proceedings of the international colloquium on automata, languages, and programming (ICALP), pp 1–12

  26. 26.

    Emekci F, Agrawal D, Abbadi A, Gulbeden A (2006) Privacy preserving query processing using third parties. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 27–36

  27. 27.

    Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD. KDD ’10, pp 493–502

  28. 28.

    Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):14:1–14:53

    Article  Google Scholar 

  29. 29.

    Fung BCM, Wang K, Yu PS (2007) Anonymizing classification data for privacy preservation. IEEE Trans Knowl Data Eng (TKDE) 19(5):711–725

    Article  Google Scholar 

  30. 30.

    Ge T, Zdonik S (2007) Answering aggregation queries in a secure system model. In: Proceedings of the 33rd international conference on very large data bases (PVLDB). VLDB Endowment, pp 519–530

  31. 31.

    Ge T, Zdonik S (2007) Fast, secure encryption for indexing in a column-oriented DBMS. In: Proceedings of the IEEE 23rd international conference on data engineering (ICDE), pp 676–685

  32. 32.

    Gentry C (2009) Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st annual ACM symposium on theory of computing (STOC). ACM, pp 169–178

  33. 33.

    Giannotti F, Lakshmanan L, Monreale A, Pedreschi D, Wang H (2013) Privacy-preserving mining of association rules from outsourced transaction databases. IEEE Syst J (ISJ) 7(3):385–395

    Article  Google Scholar 

  34. 34.

    Goldreich O (2004) Foundations of cryptography, vol 2. Cambridge University Press, Cambridge

    Google Scholar 

  35. 35.

    Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 47–57

  36. 36.

    Hacigümüş H, Iyer B, Li C, Mehrotra S (2002) Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 216–227

  37. 37.

    Hacigümüş H, Iyer B, Mehrotra S (2004) Efficient execution of aggregation queries over encrypted relational databases. In: Proceedings of the database systems for advanced applications (DASFAA), pp 125–136

  38. 38.

    Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: Proceedings of the 13th international conference on very large data bases (PVLDB). VLDB Endowment, pp 720–731

  39. 39.

    Hu H, Xu J, Ren C, Choi B (2011) Processing private queries over untrusted data cloud through privacy homomorphism. In: Proceedings of the IEEE 27th international conference on data engineering (ICDE), pp 601–612

  40. 40.

    Jarecki S, Jutla C, Krawczyk H, Rosu M, Steiner M (2013) Outsourced symmetric private information retrieval. In: Proceedings of the ACM SIGSAC conference on computer & communications security (CCS), pp 875–888

  41. 41.

    Joux A (2000) A one round protocol for tripartite diffie-hellman. In: Proceedings of the 4th international symposium on algorithmic number theory (ANTS), pp 385–394

  42. 42.

    Kamara S, Mohassel P, Raykova M (n.d.) Outsourcing multi-party computation. IACR Cryptology ePrint Archive 2011:272

  43. 43.

    Kifer D, Lin B-R (2010) Towards an axiomatization of statistical privacy and utility. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS). ACM, pp 147–158

  44. 44.

    Lai J, Deng RH, Li Y (2011) Fully secure cipertext-policy hiding cp-abe. In: Proceedings of the 7th international conference on information security practice and experience (ISPEC). Springer, Berlin, pp 24–39

  45. 45.

    McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data, pp 19–30

  46. 46.

    Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 493–501

  47. 47.

    Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In: Proceedings of the IEEE symposium on security and privacy (SP), pp 111–125

  48. 48.

    Popa RA, Redfield CMS, Zeldovich N, Balakrishnan H (2011) Cryptdb: protecting confidentiality with encrypted query processing. In: Proceedings of the 23rd ACM symposium on operating systems principles (SOSP). ACM, pp 85–100

  49. 49.

    Salzberg S (1994) C4.5: programs for machine learning by j. ross quinlan morgan. kaufmann publishers, inc., 1993. Mach Learn 16(3):235–240

    MathSciNet  Google Scholar 

  50. 50.

    Shabtai A, Elovici Y, Rokach L (2012) A survey of data leakage detection and prevention solutions. SpringerBriefs in computer science. Springer, Berlin

    Google Scholar 

  51. 51.

    Shmueli E, Tassa T, Wasserstein R, Shapira B, Rokach L (2012) Limiting disclosure of sensitive data in sequential releases of databases. Inf Sci 191:98–127

    Article  Google Scholar 

  52. 52.

    Song DX, Wagner D, Perrig A (2000) Practical techniques for searches on encrypted data. In: Proceedings of the 2000 IEEE symposium on security and privacy (S&P)

  53. 53.

    Tysowski P, Hasan M (2013) Hybrid attribute- and re-encryption-based key management for secure and scalable mobile applications in clouds. IEEE Trans Cloud Comput (TCC) 1(2):172–186

    Article  Google Scholar 

  54. 54.

    Wang C, Cao N, Li J, Ren K, Lou W (2010) Secure ranked keyword search over encrypted cloud data. In: Proceedings of the IEEE 30th international conference on distributed computing systems (ICDCS). IEEE Computer Society, pp 253–262

  55. 55.

    Wang H, Lakshmanan LVS (2006) Efficient secure query evaluation over encrypted xml databases. In: Proceedings of the 32nd international conference on very large data bases (PVLDB). VLDB endowment, pp 127–138

  56. 56.

    Wang P, Ravishankar C (2013) Secure and efficient range queries on outsourced databases using \(\widehat{R}\)-trees. In: Proceedings of the IEEE 29th international conference on data engineering (ICDE), pp 314–325

  57. 57.

    Wang S, Agrawal D, El Abbadi A (2011) A comprehensive framework for secure query processing on relational data in the cloud. In: Proceedings of the 8th VLDB international conference on secure data management (SDM). Springer, pp 52–69

  58. 58.

    Wang Y (2014) Privacy-preserving data storage in cloud using array BP-XOR codes. IEEE Trans Cloud Comput 3(4):425–435

    Article  Google Scholar 

  59. 59.

    Wang Z-F, Dai J, Wang W, Shi B-L (2004) Fast query over encrypted character data in database. In: Proceedings of the 1st international conference on computational and information science (CIS). Springer, pp 1027–1033

  60. 60.

    Wang Z-F, Wang W, Shi B-L (2005) Storage and query over encrypted character and numerical data in database. In: Proceedings of the 5th international conference on computer and information technology (CIT). IEEE Computer Society, pp 77–81

  61. 61.

    Williams P, Sion R, Carbunar B (2008) Building castles out of mud: Practical access pattern privacy and correctness on untrusted storage. In: Proceedings of the 15th ACM conference on computer and communications security (CCS), pp 139–148

  62. 62.

    Wong RC-W, Li J, Fu AW-C, Wang K (2006) (\(\alpha \), k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 754–759

  63. 63.

    Wong WK, Cheung DW-l, Kao B, Mamoulis N (2009) Secure KNN computation on encrypted databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD). ACM, pp 139–152

  64. 64.

    Xiao Y, Xiong L, Yuan C (2010) Differentially private data release through multidimensional partitioning. In: Proceedings of the 7th VLDB conference on secure data management (SDM). Springer, pp 150–168

  65. 65.

    Yi X, Paulet R, Bertino E, Xu G (2016) Private cell retrieval from data warehouses. IEEE Trans Inf Forensics Secur 11(6):1346–1361

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gaby G. Dagher.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A. Security analysis

The proposed framework is sound since all adversaries are non-colluding and semi-honest, according to our adversarial model. In the rest of this section, we focus on proving that the protocol is confidentiality-preserving. We also illustrate the accessibility of the keys in the framework, and show that all keys are properly distributed between the parties.

Privacy by simulation Goldreich [34] defines the security of a protocol in the semi-honest adversarial model as follows.

Definition 7

(Privacy w.r.t. semi-honest behavior) [34]. Let \(f: (\{0,1\}^*)^m \mapsto (\{0,1\}^*)^m\) be an m-ary deterministic polynomial-time functionality, where \(f_i(x_1,\ldots ,x_m)\) is the ith element of \(f(x_1,\ldots ,x_m)\). Let \(\Pi \) be an m-party protocol for computing f. The view of the ith party during an execution of \(\Pi \) over \(x = (x_1,\ldots ,x_n)\) is \(\textsf {view} _i^\Pi (x) = (x_i, r_i , m_{i,1}, \ldots , m_{i,t})\), where \(r_i\) equals the contents of the ith party’s internal random tape, and \(m_{i,j}\) represents the jth message that it received. For \(I = \{i_1,\ldots ,i_l\} \subseteq \{1,\ldots ,m\}\), \(\textsf {view} _I^\Pi (x) = (I,\textsf {view} _{i_1}^\Pi (x),\ldots ,\textsf {view} _{i_l}^\Pi (x))\). We say that \(\Pi \) securely computes f in the presence of static semi-honest adversaries if there exists probabilistic polynomial-time algorithm (simulator) S such that for every \(I \subseteq \{1,\ldots ,m\}\):

$$\begin{aligned} \{S(I,(x_{i_1},\ldots ,x_{i_l}),f_I(x))\}_{x \in (\{0,1\}*)^m} \overset{c}{\equiv } \{\textsf {view} _I^\Pi (x)\}_{x \in (\{0,1\}*)^m} \end{aligned}$$

where \(\overset{c}{\equiv }\) denotes computational indistinguishability.\(\square \)

According to Definition 7, it is sufficient to show that we can effectively simulate the view of each party during the execution of the \(\textsf {SecDM} \) protocol given the input, output and acceptable leaked information of that party, in order to prove that our protocol is secure. We achieve that by simulating each message received by a party in each algorithm. If we can simulate the input messages of each party in the protocol based only on its input and output, and the party is not able to recognize that it is dealing with a simulator, that means the protocol does not leak anything to that party since it would have been able to compute its output from its input without the need to be involved in the protocol in the first place.

First, we define the concepts query distribution and query processing threshold.

Definition 8

(Query distribution) The distribution of the data mining queries, denoted by U, is the set of all possible queries, where each query consists of \(k_c + 2 \times k_n\) integers, each of which maps to a value in the domain of a categorical or numerical attribute.

Definition 9

(Query processing threshold) Query processing threshold, denoted by \(\alpha \), is the maximum number of queries allowed to be processed on a kd-tree before the latter is replaced by a new shuffled and re-encrypted kd-tree submitted by the data provider to the service provider.

Definition 10

(Privacy-preserving data outsourcing framework) Let \(\mathcal {F}\) be a framework that enables a service provider (cloud) to answer queries from data miners on hosted (outsourced) data. \(\mathcal {F}\) is a privacy-preserving framework if the following properties hold:

  1. 1.

    Correctness For any user query \(u \in U\), the cloud returns \(res_u\) to the data miner such \(res_u\) is the correct answer to u.

  2. 2.

    Data confidentiality A semi-honest adversary \(\mathcal {E}\), statically corrupting the service provider, cannot learn anything more about the hosted data from an accepted transcript of \(\mathcal {F}\) than she could given only the total number of numerical and categorical attributes, and the size of each attribute’s domain.

  3. 3.

    Query confidentiality A semi-honest adversary \(\mathcal {E}\), statically corrupting the service provider, cannot learn anything about the query.

  4. 4.

    Differentially private output For all \(u \in U\), \(res_u\) satisfies differential privacy.

Definition 11

(\(\alpha \)-privacy-preserving data outsourcing framework) An outsourcing framework \(\mathcal {F}\) is \(\alpha \)-privacy-preserving if it satisfies all properties in Definition 10 except that the cloud learns the search pattern of at most \(\alpha \) number of queries.

Theorem A.1

SecDM, as specified in Protocols 4.1–4.7, is an \(\alpha \)-privacy-preserving data outsourcing framework.

Proof

We proved in Sect. 1 Property 1 (correctness) and Property 4 (differentially private output).

To prove Property 2 (data Confidentiality) and Property 3 (query Confidentiality), we build a simulator \(\mathcal {S}\) that generates a view that is statistically indistinguishable from the view of \(\mathcal {E}\) in real execution. As per Definition 7, the view of the service provider consists mainly of the messages it receives from the other parties. Although we have 8 algorithms, the service provider receives messages from the protocol only in Algorithm 3—Line 2 (encrypted index from data provider) and Algorithm 5—Line 4 (encrypted query from data miner). All other steps in all algorithms do not need to be simulated because they either do not involve the service provider at all (e.g., the steps in Algorithm 1, 2 and 4), or involve ciphertext operations (e.g., the steps in Algorithm 6 and 7) which are inherently secure from the security of the cryptosystems used (ABE and Elgamal).

figurem

Discussion The threshold parameter \(\alpha \) can range between 1 and \(\infty \). To better understand the impact of revealing \(\alpha \) queries to \(\mathcal {S}\), we analyze the security when \(\alpha = 1\) and \(\alpha > 1\).

Case 1\(\alpha = 1\) This represents the highest security level of our protocol, where one system query is executed per one kd-tree. Since the kd-tree index is constructed by Algorithm 4.1 as a balanced tree and since each path contains all attributes, no correlation can be established between any two attributes and the attributes are protected when evaluated for splitting the k-dimensional space. As for the data mining query, the service provider cannot determine what attributes are included in the query, nor know what values or ranges the data miner is interested in. Since Algorithm 4.6 yields how many leaf nodes (equivalent classes) identified, this reveals how general the query is. In general, the more leaf nodes identified by a query, the more general the query is. The revealing of the number of identified leaf nodes, however, will not help the service provider better guess the final result of the query since it cannot access the encrypted noisy counts.

Although setting \(\alpha \) to 1 provides the highest security w.r.t. query search pattern, it is impractical due to the cost of reconstructing the kd-tree. We refer the reader to solution construction scalability in Sect. 6.2.2 for more details about the cost of reconstructing the kd-tree.

\(\mathbf Case 2 \)\(\alpha > 1\) While our proposed framework supports confidential access to the data, executing multiple queries on the same kd-tree index reveals the search pattern of the queries, where the service provider is able to determine the number of leaf nodes that overlap between the queries. Let u and \(u'\) be two user queries that satisfy the same set of leaf nodes \(l = \{l_1, \ldots , l_r\}\), and let collision set denote the set of all unique queries that could satisfy l. The size of the collision set can be determined as follows:

$$\begin{aligned} |\textsf {collision set} (l)| = \prod _{i=1}^{r}\prod _{j=1}^{k}|l_i.Range(\hat{A}_j)|~:~ \hat{A}_j \text{ is } \text{ numerical }, \end{aligned}$$

where \(|l_i.Range(\hat{A}_j)|\) denotes the size of the range of attribute \(\hat{A}_j\) in the equivalent class represented by leaf node \(l_i\). Note that since the noisy counts are encrypted using ElGamal, the position of the attributes in the tree is hidden and is shuffled every time the kd-tree is constructed, disclosing the search pattern on the differentially private data reveals nothing about the final (noisy) result of each query, nor about the attributes/values in each query. The smaller the value of \(\alpha \) is, the less overlap between queries is revealed. Several techniques have been proposed in the literature to address the problem of private search pattern, such as [61]; however, it is out of the scope of this paper.

Note that each time the data provider generates a shuffled and re-encrypted kd-tree, a different ACP-ABE master secret key MSK should be used to prevent the service provider from processing new queries on the old tree.

In our model, we assume the data miner can have access to the entire differentially private dataset. The data privacy is guaranteed by differential privacy. Therefore, there is no need to simulate the view of the data miner.

Moreover, since our framework returns differentially private results for each count query in a deterministic way, any repetition of queries will leak no extra information about the data. Also, since count query results are differentially private, our framework is also protected against background knowledge attacks.

The proposed protocol in this paper involves the composition of secure subprotocols in which all intermediate outputs from one subprotocol are inputs to the next subprotocol. These intermediate outputs are either simulated given the final output and the local input for each party or computed as random shares. Using the composition theorem (Goldreich [34]), it can be shown that if each subprotocol is secure, then the resulting composition is also secure.

Table 4 Key accessibility w.r.t. all parties in \(\textsf {SecDM} \) framework

Key accessibility Protecting the data distributed between different parties from unauthorized access is an essential part of securing the \(\textsf {SecDM} \) framework. We must ensure that all keys are properly distributed such that no party can decrypt any data it is not supposed to have access to in plaintext. Table 4 illustrates the accessibility of each key by each party in \(\textsf {SecDM} \).

Observe that the data provider is the generator of all encryption keys in the system and maintains full control over them. The service provider, on the other hand, has no access to Exponential ElGamal’s private key, \(\mathbb {G}.x\), that would have allowed her to fully decrypt the contents of each leaf node in the kd-tree index. Moreover, not having access to the ACP-ABE master secret key \(\mathbb {A}.MSK\) prevents the service provider from being able to determine the access structures of the ciphertexts in each internal node of the kd-tree index. As for the user (data miner), not having access to \(\mathbb {A}.MSK\) prevents her from bypassing authentication and creating her own system count queries.

B. Correctness analysis

The correctness proof is twofold. First, we prove that Algorithm 4.6 identifies all the leaf nodes satisfying the user count query u. Second, we prove that Algorithm 4.7 produces the exact total count answer to u, and the answer is differentially private.

Proposition 5

Given a user count query \(u = \mathcal {P}_1 \wedge \cdots \wedge \mathcal {P}_m\), Algorithm 4.6 produces a set \(\mathcal {R}\) containing all leaf nodes satisfying u.

Proof

To prove the correctness of Algorithm 4.6 we prove partial correctness and termination.

1. Partial correctness We provide a proof by induction.

Basis When u includes no predicate for any of the attributes in \(\hat{D}\), then each categorical attribute in \(SK_u\) is assigned the value 1 (the identifier of the root node of the corresponding taxonomy tree), whereas for each numerical attribute \(\hat{A}_i \in \hat{D}\), \(\hat{A}_i^{min}=1\) (the lowest range identifier) and \(\hat{A}_i^{max}\) is assigned the highest range identifier in \(\Omega (\hat{A}_i)\). When \(SK_u\) is used to traverse the kd-tree index, all internal nodes will be traversed until the leaf nodes are reached. That is, if the current node v is internal, \(\mathbb {A}.\textsf {Dec} (v.CT_{left},SK_u)\) and \(\mathbb {A}.\textsf {Dec} (v.CT_{right},SK_u)\) will always be true because the attributes in \(SK_u\) will always satisfy the access structure in \(v.CT_{left}\) and \(v.CT_{right}\), and pointers to the left child node and right child node will always be obtained.

Induction step Assume that traversing the kd-tree index using \(SK_u\) produces the correct set of leaf nodes \(\mathcal {R}\) satisfying u. We show that if a new predicate \(\mathcal {P}=(\hat{A}_i \, \vdash \,\, s_i)\) is added to u such that \(\acute{u} = u + \mathcal {P}\), then traversing the kd-tree index using \(SK_{\acute{u}}\) produces the correct set of leaf nodes \(\acute{\mathcal {R}}\) satisfying \(\acute{u}\). We observe that \(\acute{\mathcal {R}} \subseteq \mathcal {R}\). To complete the proof in this step, we assume that \(\mathcal {P}\) corresponds to a categorical attribute; however, the same analogy can be applied to a numerical attribute’s predicate. When v is an internal node and \(v.split\_dim = \hat{A}_i\), if \(s_i.ID \le v.split\_value\) then \(\mathbb {A}.\textsf {Dec} (v.CT_{right},SK_u)\) will evaluate to false, and no recursive call of procedure traverseIndex over node v.rc will be executed. This behavior is correct because in this case, the subtree whose root is v.rc includes the leaf nodes that do not satisfy \(\mathcal {P}\), and hence there is no need to search the subtree rooted at v.rc. The same logic can be used to reason about the case when \(s_i.ID > v.split\_value\).

2. Termination Each recursive call on a child node partitions the space of the parent node in half. This shows that the algorithm strictly moves from one level to a lower level in the kd-tree index while reducing the search space by half until all leaf nodes satisfying u are reached. \(\square \)

Proposition 6

Given a set of leaf nodes \(\mathcal {R}\) generated by a system count query \(SK_u\) and a set of attribute distribution tokens \(\mathcal {N}\), the output of Algorithm 4.7 is the exact noisy count answer corresponding to \(SK_u\).

Proof

To prove the correctness of Algorithm 4.7, we prove partial correctness and termination.

1. Partial correctness We provide a proof by induction.

Basis When \(\mathcal {N}=\phi \), the inner loop will never be executed. In this case, procedure compTCount will go through all the leaf nodes in \(\mathcal {R}\) and add together all corresponding noisy counts by utilizing the homomorphic addition property of Exponential ElGamal. This is correct because if no ADT token was originally generated, then the user query is an exact query, and \(100\%\) of the noisy count of each leaf node in \(\mathcal {R}\) must be used.

Induction step Assume that for \(\mathcal {N}=\{ADT_1,\ldots ,ADT_l\}\), procedure compTCount computes the exact noisy count answer to the user count query u. We show that if a new token \(ADT_{l+1}\) for numerical attribute \(\hat{A}_i\) is added such that \(\acute{\mathcal {N}} = \mathcal {N} \cup ADT_{l+1} = \{ADT_1, \ldots , ADT_{l+1}\}\), where \(\acute{\mathcal {N}}\) corresponds to the system count query \(SK_{\acute{u}}\), then procedure compTCount computes the exact noisy count answer to the user count query \(\acute{u}\). Without loss of generality, we assume that the set of leaf nodes \(\mathcal {R}\) remains the same. Since \(ADT_{l+1}\) is for numerical attribute \(\hat{A}_i\), then \(ADT_{l+1}.value\) represents the percentage of the partial intersection between query \(\acute{u}\) and attribute \(\hat{A}_i\) by definition. If \(\acute{u}\) is a generic query, then not all leaf nodes in \(\mathcal {R}\) will contain a tag that corresponds to \(ADT_{l+1}.tag\). However, the noisy count of each leaf node l containing a tag that matches \(ADT_{l+1}.tag\) must be adjusted by multiplying l.NCount with \(ADT_{l+1}.value\).

2. Termination We denote by n the initial number of leaf nodes in \(\mathcal {R}\). If \(n > 0\), then we enter the outer loop. We also denote by m the initial number of ADT tokens in \(\mathcal {N}\). If \(m > 0\) then we enter the inner loop such that after each iteration, the variable m is decreased by one, and it keeps strictly decreasing until \(m=0\) where the inner loop terminates. Similarly, the outer loop will terminate as n keeps strictly decreasing until it reaches 0; at that stage, the algorithm terminates. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dagher, G.G., Fung, B.C.M., Mohammed, N. et al. \({\textsf {SecDM}}\): privacy-preserving data outsourcing framework with differential privacy. Knowl Inf Syst 62, 1923–1960 (2020). https://doi.org/10.1007/s10115-019-01405-7

Download citation

Keywords

  • Cloud computing
  • Data outsourcing
  • Search on encrypted data
  • Differential privacy