Skip to main content
Log in

Highly distributed and privacy-preserving queries on personal data management systems

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Personal data management system (PDMS) solutions are flourishing, boosted by smart disclosure initiatives and new regulations. PDMSs allow users to easily store and manage data directly generated by their devices or resulting from their (digital) interactions. Users can then leverage the power of their PDMS to benefit from their personal data, for their own good and in the interest of the community. The PDMS paradigm thus brings exciting perspectives by unlocking novel usages, but also raises security issues. An effective approach, considered in several recent works, is to let the user data distributed on personal platforms, secured locally using hardware and/or software security mechanisms. This paper goes beyond the local security issues and addresses the important question of securely querying this massively distributed personal data. To this end, we propose DISPERS, a fully distributed PDMS peer-to-peer architecture. DISPERS allows users to securely and efficiently share and query their personal data, even in the presence of malicious nodes. We consider three increasingly powerful threat models and derive, for each, a security requirement that must be fulfilled to reach a lower-bound in terms of sensitive data leakage: (1) hidden communications, (2) random dispersion of data and (3) collaborative proofs. These requirements are incremental and, respectively, resist spied, leaking or corrupted nodes. We show that the expected security level can be guaranteed with near certainty and validate experimentally the efficiency of the proposed protocols, allowing for adjustable trade-off between the security level and its cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. This paper is based on previous studies [38,39,40]: In [40], we showed that the execution of a P2P query can indeed rely exclusively on data processor nodes if and only if they are selected in a verifiable random way, which cannot be influenced by corrupted nodes. [39] is a demonstration of DISPERS architecture and applications. [38] is a PhD manuscript. It includes implementation details of the protocols proposed in this paper and describes a proof-of-concept implementation of the most advanced protocol into the Cozy Cloud product [19].

  2. Issues related to statistical databases (e.g., inferences from results [71], authorized queries, query replay) or to network security (e.g., message drop/delay, routing table poisoning [72]) are complementary to this work and fall outside its scope (see Sects. 8 and 9 ).

  3. This is possible with distributive aggregation expression, i.e., the aggregate computation can be distributed on several data processors.

  4. The interest reader can refer to the DISPERS demonstration [39] and the associated video (see https://tinyurl.com/dispers-hrc) for more qualitative aspects, and to [38] for a practical implementation in CozyCloud [19].

References

  1. Allard, T., Anciaux, N., Bouganim, L., Guo, Y., et al.: Secure Personal Data Servers: a Vision Paper. PVLDB, 3(1-2), (2010)

  2. Allard, T., Nguyen, B., Pucheral, P.: MET\({}_{\text{A}}\)P: revisiting Privacy-Preserving Data Publishing using secure devices. Distributed and Parallel Databases, 32(2), (2014)

  3. Alvim, M. S., Chatzikokolakis, K., Palamidessi, C., Pazii, A.: Local Differential Privacy on Metric Spaces: Optimizing the Trade-Off with Utility. In IEEE CSF, (2018)

  4. Anciaux, N., Bonnet, P., Bouganim, L., Nguyen, B., et al.: Personal Data Management Systems: The security and functionality standpoint. Information Systems, 80, (2018)

  5. Anciaux, N., Bouganim, L., Pucheral, P., Guo, Y., et al.: MILo-DB: a personal, secure and portable database machine. Distributed and Parallel Databases, 32(1), (2014)

  6. Anciaux, N., Bouganim, L., Pucheral, P., Popa, I. S., et al.: Personal Database Security and Trusted Execution Environments: A Tutorial at the Crossroads. PVLDB, 12(12), (2019)

  7. Aumann, Y., Lindell, Y.: Security against covert adversaries: Efficient protocols for realistic adversaries. J. Cryptol., 23(2), (2010)

  8. Backes, M., Druschel, P., Haeberlen, A., Unruh, D.: CSAR: A Practical and Provable Technique to Make Randomized Systems Accountable. In NDSS, (2009)

  9. Bater, J., Elliott, G., Eggen, C., Goel, S., et al.: SMCQL: Secure Query Processing for Private Data Networks. PVLDB, 10(6), (2017)

  10. Bellet, A., Guerraoui, R., Taziki, M., Tommasi, M.: Personalized and Private Peer-to-Peer Machine Learning. In AISTATS, (2018)

  11. Blond, S. L., Manils, P., Abdelberi, C., Kâafar, M. A., et al.: One bad apple spoils the bunch: Exploiting P2P applications to trace and profile tor users. In USENIX LEET, (2011)

  12. Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., et al.: Practical Secure Aggregation for Privacy-Preserving Machine Learning. In ACM CCS, (2017)

  13. Carpentier, R., Popa, I. S., Anciaux, N.: Reducing data leakage on personal data management systems. In IEEE EuroS &P, (2021)

  14. Carpentier, R., Thiant, F., Sandu Popa, I., Anciaux, N., et al.: An Extensive and Secure Personal Data Management System using SGX. In EDBT, (2022)

  15. Castro, M., Druschel, P., Ganesh, A., Rowstron, A., et al.: Secure routing for structured peer-to-peer overlay networks. ACM SIGOPS Operating Systems Review, 36(SI), (2002)

  16. Castro, M., Liskov, B.: Practical Byzantine Fault Tolerance. In OSDI, (1999)

  17. Cormode, G., Kulkarni, T., Srivastava, D.: Answering Range Queries Under Local Differential Privacy. PVLDB, 12(10), (2019)

  18. Corrigan-Gibbs, H., Boneh, D.: Prio: Private, robust, and scalable computation of aggregate statistics. In NSDI, (2017)

  19. Cozy Cloud. A smart personal cloud to gather all your data. (see https://cozy.io/en), (2021)

  20. De Montjoye, Y.-A., Shmueli, E., Wang, S. S., Pentland, A. S.: OpenPDS: Protecting the privacy of metadata through safeanswers. PloS one, 9(7), (2014)

  21. Dingledine, R., Mathewson, N., Syverson, P.: Tor: The second-generation onion router. In USENIX SSYM, (2004)

  22. Douceur, J.: The Sybil attack. In Int, Workshop on Peer-to-Peer Systems (2002)

    Book  MATH  Google Scholar 

  23. European Commission. Proposal for a regulation on european data governance (data governance act), com/2020/767. [eur-lex], 25 (October 2020). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52020PC0767

  24. European Parliament. General Data Protection Regulation. (see https://gdpr-info.eu/), (2018)

  25. Faruki, P., Bharmal, A., Laxmi, V., Ganmoor, V., et al.: Android security: A survey of issues, malware penetration, and defenses. IEEE Communications Surveys Tutorials, 17(2), (2015)

  26. Gulati, M., Smith, M. J., Yu, S.-Y.: Security enclave processor for a system on a chip, (2014). US Patent 8,832,465

  27. Gupta, P., Li, Y., Mehrotra, S., Panwar, N., et al.: Obscure: Information-Theoretic Oblivious and Verifiable Aggregation Queries. volume 12, (2019)

  28. Hayek, R., Raschia, G., Valduriez, P., Mouaddib, N.: Summary management in P2P systems. In EDBT, (2008)

  29. Heiser, G., Elphinstone, K.: L4 Microkernels: The Lessons from 20 Years of Research and Deployment. ACM Trans. Comput. Syst., 34(1), (2016)

  30. Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), (1963)

  31. Joung, Y., Yang, L., Fang, C.: Keyword search in DHT-based peer-to-peer networks. IEEE Journal on Selected Areas in Communications, 25(1), (2007)

  32. Kermarrec, A., Taïani, F.: Want to scale in centralized systems? Think P2P. J. Internet Services and Applications, 6(1), (2015)

  33. Ladjel, R., Anciaux, N., Pucheral, P., Scerri, G.: A Manifest-Based Framework for Organizing the Management of Personal Data at the Edge of the Network. In ISD, (2019)

  34. Ladjel, R., Anciaux, N., Pucheral, P., Scerri, G.: Trustworthy Distributed Computations on Personal Data Using Trusted Execution Environments. In TrustCom, (2019)

  35. Lallali, S., Anciaux, N., Popa, I. S., Pucheral, P.: Supporting secure keyword search in the personal cloud. Information Systems, 72, (2017)

  36. Lamport, L., Shostak, R., Pease, M.: The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst., 4(3), (1982)

  37. Lee, S., Wong, E. L., Goel, D., Dahlin, M., et al.: \(\pi \)box: A platform for privacy-preserving apps. In NSDI, (2013)

  38. Loudet, J.: Distributed and Privacy-Preserving Personal Queries on Personal Clouds. PhD thesis, Versailles University, (2019)

  39. Loudet, J., Popa, I. S., Bouganim, L.: DISPERS: Securing Highly Distributed Queries on Personal Data Management Systems. PVLDB, 12(12), (2019)

  40. Loudet, J., Popa, I. S., Bouganim, L.: SEP2P: Secure and Efficient P2P Personal Data Processing. In EDBT, (2019)

  41. Maiyya, S., Zakhary, V., Amiri, M. J., Agrawal, D., et al.: Database and Distributed Computing Foundations of Blockchains. In SIGMOD, (2019)

  42. Maymounkov, P., Mazieres, D.: Kademlia: A peer-to-peer information system based on the xor metric. In Int, Workshop on Peer-to-Peer Systems (2002)

    MATH  Google Scholar 

  43. Menezes, A., van Oorschot, P. C., Vanstone, S. A.: Handbook of Applied Cryptography. (1996)

  44. Merkle, R. C.: A Digital Signature Based on a Conventional Encryption Function. In CRYPTO, volume 293, (1987)

  45. Mirval, J., Bouganim, L., Popa, I. S.: Practical fully-decentralized secure aggregation for personal data management systems. In SSDBM, (2021)

  46. MyData Global. Empowering individuals by improving their right to self-determination regarding their personal data. (see https://mydata.org), (2020)

  47. Nanni, M., Andrienko, G. L., Barabási, A., Boldrini, C., et al.: Give more data, awareness and control to individual citizens, and they will help COVID-19 containment. Trans. Data Priv., 13(1), (2020)

  48. Nextcloud. The self-hosted productivity platform that keeps you in contro. (see https://nextcloud.com), (2021)

  49. Nilsson, A., Bideh, P. N., Brorsson, J.: A survey of published attacks on intel SGX. CoRR. (2020). arXiv:abs/2006.13598

  50. Nithyanand, R., Starov, O., Gill, P., Zair, A., et al.: Measuring and mitigating as-level adversaries against tor. In NDSS, (2016)

  51. Özsu, M. T., Valduriez, P.: Principles of Distributed Database Systems, 4th Edition. Springer, (2020)

  52. Pinto, S., Santos, N.: Demystifying Arm TrustZone: A Comprehensive Survey. ACM Comput. Surv., 51(6), (2019)

  53. Popa, I. S., That, D. H. T., Zeitouni, K., Borcea, C.: Mobile participatory sensing with strong privacy guarantees using secure probes. GeoInformatica, 25(3), (2021)

  54. Popa, R. A., Blumberg, A. J., Balakrishnan, H., Li, F. H.: Privacy and accountability for location-based aggregate statistics. In CCS, (2011)

  55. Priebe, C., Vaswani, K., Costa, M.: EnclaveDB: A Secure Database Using SGX. In IEEE S &P, (2018)

  56. Rabin, M. O.: Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance. J. ACM, 36(2), (1989)

  57. Ratnasamy, S., Francis, P., Handley, M., Karp, R. M., et al.: A scalable content-addressable network. In ACM SIGCOMM, (2001)

  58. Reed, M. G., Syverson, P. F., Goldschlag, D. M.: Anonymous connections and onion routing. IEEE Journal on Selected Areas in Communications, 16(4), (1998)

  59. Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In Middleware, (2003)

  60. Sabt, M., Achemlal, M., Bouabdallah, A.: Trusted Execution Environment: What It is, and What It is Not. In TrustCom/BigDataSE/ISPA (1), (2015)

  61. Saleh, E., Alsa’deh, A., Kayed, A., Meinel, C.: Processing over encrypted data: between theory and practice. ACM SIGMOD Record, 45(3), (2016)

  62. Secure Data Hub. Output Confidentiality Rules. (see https://www.casd.eu/wp/wp-content/uploads/Output_Confidentiality_Rules.pdf), (2021)

  63. Shamir, A.: How to Share a Secret. Commun. ACM, 22(11), (1979)

  64. Skobeltsyn, G., Luu, T., Zarko, I. P., Rajman, M., et al.: Web text retrieval with a P2P query-driven index. In SIGIR, (2007)

  65. Solid. All of your data, under your control. (see https://solidproject.org/), (2021)

  66. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., et al.: Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM, 31(4), (2001)

  67. Tang, C., Dwarkadas, S.: Hybrid global-local indexing for efficient peer-to-peer information retrieval. In NSDI, (2004)

  68. Tang, C., Xu, Z., Dwarkadas, S.: Peer-to-peer information retrieval using self-organizing semantic overlay networks. In ACM SIGCOMM, (2003)

  69. To, Q., Nguyen, B., Pucheral, P.: Private and Scalable Execution of SQL Aggregates on a Secure Decentralized Architecture. ACM Trans. Database Syst., 41(3), (2016)

  70. Tomàs, J. C., Amann, B., Travers, N., Vodislav, D.: RoSeS: a continuous query processor for large-scale RSS filtering and aggregation. In ACM CIKM, (2011)

  71. Unnikrishnan, J., Naini, F. M.: De-anonymizing private data by matching statistics. In IEEE Allerton, (2013)

  72. Urdaneta, G., Pierre, G., Steen, M. V.: A survey of DHT security techniques. ACM Computing Surveys (CSUR), 43(2), (2011)

  73. Volgushev, N., Schwarzkopf, M., Getchell, B., Varia, M., et al.: Conclave: Secure multi-party computation on big data. In EuroSys, (2019)

  74. Wang, Q., Borisov, N.: Octopus: A Secure and Anonymous DHT Lookup. In ICDCS, (2012)

  75. Yang, Y., Dunlap, R., Rexroad, M., Cooper, B. F.: Performance of full text search in structured and unstructured peer-to-peer systems. In INFOCOM, (2006)

  76. Zhang, Z., Wang, T., Li, N., He, S., et al.: CALM: Consistent Adaptive Local Marginal for Marginal Release under Local Differential Privacy. In ACM CCS, (2018)

  77. Zheng, K., Mou, W., Wang, L.: Collect at Once, Use Effectively: Making Non-interactive Locally Private Learning Possible. In ICML, volume 70, (2017)

  78. Have i been pwned. Check if you have an account that has been compromised. (web link at https://haveibeenpwned.com/lastly). Accessed July 2022

Download references

Acknowledgements

This research was partially supported by the ANR PersoCloud grant ANR-16-CE39-0014 and by the PEPR iPoP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luc Bouganim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Background on cryptography

Symmetric encryption is computationally efficient but requires a symmetric encryption key \( k_{sym} \) known beforehand by both parties. On the contrary, asymmetric encryption is a demanding operation that relies on a pair of keys: the private key, \( k_{priv} \), and its matching public key, \( k_{pub} \). To avoid man-in-the-middle attacks, \( k_{pub} \) must be certified. Hybrid encryption uses asymmetric encryption to securely exchange a symmetric encryption key and combines the advantages of both encryption schemes. To ensure forward secrecy [43], a new symmetric key is used for each communication session. The widely used TLS protocol is based on hybrid encryption and provides also integrity, and authenticity of the communicating parties.

A cryptographic hash function [43], referred as \(\mathtt {hash}()\), is a one-way function that maps a data of arbitrary size to a fixed size bit string (e.g., 256 bits), is resistant to collision and provides a uniform distribution of its outputs.

A digital signature [43] can be used to prove that a data d was produced by an entity E (authentication) and has not been altered (integrity). A signature contains the encryption of \(\mathtt {hash}(d)\) using \({ k_{priv} }_E\) and the certificate of \({ k_{pub} }_E\), \( cert_E \). Anyone can check a signature by checking the certificate, decrypting the encrypted hash, and finally comparing the result with \(\mathtt {hash}(d)\) (recomputed by the verifier).

Shamir’s Secret Sharing Scheme (SSSS) [63] consists in dividing some data d into n shares \(d_1, \dots , d_n\) in such a way that: (i) knowledge of any t (\(t\le n\)) or more shares makes d easily computable; but (ii) knowledge of any \(t - 1\) or fewer shares leaves d protected (not even providing any information about it). t is called the threshold value (see Sect. 8) and is set to resist to \(n - t\) shareholders failures. The low, polynomial complexity of SSSS (i.e., Lagrange interpolation) for both secret decomposition and reconstruction, makes it an ideal solution for a fully distributed system like DISPERS in which any PDMS node has to securely store its profile in the DHT or can be selected as actor node (Profile Sampler or Target Finder) to recompose a secret. Note that DISPERS employs the basic SSSS and does not require more advanced (and much costlier) operations such as string-matching on secret-shares or order-preserving secret-sharing, e.g., as used in [27].

Anonymous communications can be obtained by using onion routing technique [58]. The sender selects all the routers and asymmetrically encrypts the message “in layers,” as an onion. Each router decrypts one layer and discovers dynamically the next router up to the destination.

A Merkle Hash Tree (MHT) [44] is a tree data structure for which leaf labels are hashes of data blocks \(d_1, \dots , d_n\), and the remaining tree nodes are labeled with the hash of their children’s labels. The root of the tree is digitally signed allowing to check the integrity of any of the data blocks, computing the intermediary hashes, starting from the leaf, going up to the root and verifying that the computed root matches the signed one. MHTs are particularly useful to check the integrity of a given block \(d_i\) without disclosing the others data blocks, but only the intermediate hashes in the MHT.

A verifiable random number generation protocol is a protocol which allows n nodes to produce a random value R, while guaranteeing that none of the n nodes can choose or influence the value of R. This is made possible if, at least, one of the n nodes is honest. A version of this protocol is described in details in our previous work [40] and is adapted from [8] which includes a formal proof. It roughly unfolds as following: (i) each node selects a random value \(r_i\) and commits on it by sending \(\mathtt {hash}(r_i)\) to a coordinator; (ii) the list of hash values, L, is disclosed by the coordinator to the n nodes; (iii) each node then checks that \(\mathtt {hash}(r_i) \in L\) and, if so, sends \(r_i\) and a signature of L back to the coordinator.

R is finally obtained by computing a XOR of the n individual random values. An attacker controlling \(n - 1\) nodes cannot influence R since these nodes cannot change their \(r_i\), committed with \(\mathtt {hash}(r_i)\). Thus, the random value of a single honest node is enough to obtain a truly random final value.

Background on distributed hash tables

A Distributed Hash Table (DHT) in a P2P network [51] offers an optimized solution to the problem of locating the node storing a specific data item. The DHT offers a basic interface allowing nodes to store data, i.e., \(\mathtt {store(key,value)}\), or to search for certain data, i.e., \(\mathtt {lookup(key) \rightarrow value}\). DHT proposals share the concepts of keyspace or DHT virtual space (e.g., a 256 bits string obtained by hashing the key or the node ID with the SHA256 algorithm), space partitioning (mapping space partitions to nodes, using generally a distance function), and overlay network (routing tables and strategies allowing reaching a node, given its ID). For instance, the virtual space is represented as a multi-dimensional space in CAN [57], as a ring in Chord [66] or as a binary tree in Kademlia [42] and is uniformly divided among the nodes in the network. Thus, each node is responsible for the storage of all the \(\mathtt {(key,~value)}\) pairs where the key falls in the subspace it manages. The \(\mathtt {store}\) and \(\mathtt {lookup}\) operations are fully distributed: DHTs do not require any central coordination. They are scalable, fault tolerant and provide a uniform distribution of the data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouganim, L., Loudet, J. & Sandu Popa, I. Highly distributed and privacy-preserving queries on personal data management systems. The VLDB Journal 32, 415–445 (2023). https://doi.org/10.1007/s00778-022-00753-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00753-1

Keywords

Navigation