Skip to main content
Log in

MET𝔸P: revisiting Privacy-Preserving Data Publishing using secure devices

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The goal of Privacy-Preserving Data Publishing (PPDP) is to generate a sanitized (i.e. harmless) view of sensitive personal data (e.g. a health survey), to be released to some agencies or simply the public. However, traditional PPDP practices all make the assumption that the process is run on a trusted central server. In this article, we argue that the trust assumption on the central server is far too strong. We propose Met 𝔸P, a generic fully distributed protocol, to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure. We show that this protocol is both correct and secure against honest-but-curious or malicious adversaries. Finally, we provide an experimental validation showing that this protocol can support PPDP processes scaling up to nation-wide surveys.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Algorithm 1
Fig. 9
Algorithm 2
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://datalossdb.org/.

  2. For example, the European directive 95/46/EC.

  3. Roughly speaking, a smart token is the combination of a tamper resistant smart card micro-controller with a mass storage (gigabytes-sized) NAND Flash chip.

  4. http://www.healthecard.co.uk.

  5. http://www.gematik.de.

  6. http://www.lifemedid.com/.

  7. http://www-smis.inria.fr/DMSP/home.php.

  8. http://www.freedomboxfoundation.org/.

  9. http://tinyurl.com/ArmTrustzoneReport.

  10. In order to avoid ambiguity we use the term protocol to denote the distributed implementation of an algorithm.

  11. http://www.commoncriteriaportal.org/.

  12. For example, a recipient (e.g., a research team) may either use its own server, the computing facilities of a larger institution (e.g., an hospital) or even externalize these services to a third party (e.g., a Cloud provider) assuming this third party formally commits to providing the same privacy guarantees as the ones imposed to the recipient itself.

  13. The detection of an attack puts the attacker in an awkward position. If the data leak is revealed in a public place, participants are likely to refuse to participate in further studies with an irreversible political/financial damage and they can even engage a class action.

  14. In [46], the (d,γ)-Privacy model is shown to be equivalent to the Ͼ-Indistinguishability model [15].

  15. The degree of parallelism is determined by the number of secure devices which connect together during the sanitization phase, every token being eligible to participate to this phase, including those which did not participate to the collection phase. In the extreme case where secure devices connect one after the other, the processing will simply be performed sequentially.

  16. A similar tradeoff occurs in the query outsourcing approach where an untrusted host must be able to issue queries over encrypted data (e.g., [26]).

  17. Revealing the true or fake nature of a record does not put privacy in danger except if the recipient is able to link it to the corresponding decrypted record. At this point of the protocol, it is linked to the encrypted record only; Sect. 6.1.2 will focus on attacks that aim at linking it to its corresponding decrypted record.

  18. Informally speaking, a MAC can be seen as a cryptographic hash whose output depends on a secret key.

  19. Due to space reasons, in this paper we present sketches of proofs. The complete proofs of all theorems can be found in [2].

  20. Due to space reasons, in this paper and in Appendix A, we present sketches of the safety properties’ implementation. The complete description of all implementations can be found in [2].

  21. Our implementation of Mondrian publishes the sanitized release in a form equivalent to a release made of two tables where the one contains raw quasi-identifiers, the other contains sensitive data, and both can be (lossy) joined through a class’ identifier. This form of publishing releases, called Anatomy, was proposed in [49] to increase their utility.

  22. http://ipums.org/.

  23. In general, the identifier can be implemented simply by letting secure devices generate a random number. It has to be big enough with respect to the number of tuples to collect in order to make collisions improbable so that in the rare collision cases the recipient simply keeps one of the colliding tuples. For example, around 5 billion numbers have to be generated to reach 50 % chance collision with a 64-bits random number.

  24. Many common statistical computations can be built from a simple count primitive [9].

References

  1. Agrawal, S., Haritsa, J.R.: A framework for high-accuracy privacy-preserving mining. In: Proceedings of the 21st International Conference on Data Engineering, ICDE’05, pp. 193–204. IEEE Comput. Soc., Washington (2005)

    Google Scholar 

  2. Allard, T.: Sanitizing microdata without leak: a decentralized approach. Ph.D. thesis, University of Versailles (2011)

  3. Allard, T., Anciaux, N., Bouganim, L., Guo, Y., Le Folgoc, L., Nguyen, B., Pucheral, P., Ray, I., Ray, I., Yin, S.: Secure personal data servers: a vision paper. Proc. VLDB Endow. 3, 25–35 (2010)

    Google Scholar 

  4. Allard, T., Nguyen, B., Pucheral, P.: Safe realization of the generalization privacy mechanism. In: Proceedings of the 9th International Conference on Privacy Security and Trust, PST’11, pp. 16–23 (2011)

    Google Scholar 

  5. Allard, T., Nguyen, B., Pucheral, P.: Sanitizing microdata without leak: combining preventive and curative actions. In: Proceedings of the 7th International Conference on Information Security Practice and Experience, ISPEC’11, pp. 333–342. Springer, Berlin (2011)

    Chapter  Google Scholar 

  6. Anciaux, N., Bouganim, L., Guo, Y., Pucheral, P., Vandewalle, J.-J., Yin, S.: Pluggable personal data servers. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD’10, pp. 1235–1238. ACM, New York (2010)

    Google Scholar 

  7. Bajaj, S., Sion, R.: Trusteddb: a trusted hardware based database with privacy and data confidentiality. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, pp. 205–216. ACM, New York (2011)

    Chapter  Google Scholar 

  8. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, New York (1994)

    MATH  Google Scholar 

  9. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’05, pp. 128–138. ACM, New York (2005)

    Chapter  Google Scholar 

  10. Boldyreva, A., Chenette, N., O’Neill, A.: Order-preserving encryption revisited: improved security analysis and alternative solutions. In: Proceedings of the 31st Annual Conference on Advances in Cryptology, CRYPTO’11, pp. 578–595. Springer, Berlin (2011)

    Google Scholar 

  11. Cao, J., Karras, P., Kalnis, P., Tan, K.-L.: SABRE: a Sensitive Attribute Bucketization and REdistribution framework for t-closeness. VLDB J. 20, 59–81 (2011)

    Article  Google Scholar 

  12. Chan, H., Hsiao, H.-C., Perrig, A., Song, D.: Secure distributed data aggregation. Found. Trends Databases 3(3), 149–201 (2011)

    Article  MATH  Google Scholar 

  13. Chen, B.-C., Kifer, D., LeFevre, K., Machanavajjhala, A.: Privacy-preserving data publishing. Found. Trends Databases 2(1–2), 1–167 (2009)

    Article  Google Scholar 

  14. Cormode, G.: Personal privacy vs population privacy: learning to attack anonymization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’11, pp. 1253–1261. ACM, New York (2011)

    Google Scholar 

  15. Dwork, C.: Differential privacy. In: Proceeding of the 39th International Colloquium on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 4052, pp. 1–12. Springer, Berlin (2006)

    Chapter  Google Scholar 

  16. Eurosmart: Smart USB token (white paper). Eurosmart (2008)

  17. Fischlin, M., Pinkas, B., Sadeghi, A.-R., Schneider, T., Visconti, I.: Secure set intersection with untrusted hardware tokens. In: Proceedings of the 11th International Conference on Topics in Cryptology: CT-RSA 2011, CT-RSA’11, pp. 1–16. Springer, Berlin (2011)

    Chapter  Google Scholar 

  18. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42, 14 (2010)

    Article  Google Scholar 

  19. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB’07, pp. 758–769 (2007). VLDB Endowment

    Google Scholar 

  20. Giesecke & Devrient. Portable security token. http://www.gd-sfs.com/portable-security-token. Accessed 27 June 2012

  21. Goldreich, O.: Foundations of cryptography: a primer. Found. Trends Theor. Comput. Sci. 1(1), 1–116 (2005)

    Article  MathSciNet  Google Scholar 

  22. Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC’87, pp. 218–229. ACM, New York (1987)

    Google Scholar 

  23. Gordon, L.A., Loeb, M.P., Lucyshin, W., Richardson, R.: 2006 CSI/FBI Computer Crime and Security Survey. Computer Security Institute, Hudson (2006)

    Google Scholar 

  24. Goyal, V., Ishai, Y., Mahmoody, M., Sahai, A.: Interactive locking, zero-knowledge PCPs, and unconditional cryptography. In: Rabin, T. (ed.) Advances in Cryptology—CRYPTO 2010. Lecture Notes in Computer Science, vol. 6223, pp. 173–190. Springer, Berlin (2010)

    Chapter  Google Scholar 

  25. Goyal, V., Ishai, Y., Sahai, A., Venkatesan, R., Wadia, A.: Founding cryptography on tamper-proof hardware tokens. In: Micciancio, D. (ed.) Theory of Cryptography. Lecture Notes in Computer Science, vol. 5978, pp. 308–326. Springer, Berlin (2010)

    Chapter  Google Scholar 

  26. Hacigümüş, H., Iyer, B., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD’02, pp. 216–227. ACM, New York (2002)

    Chapter  Google Scholar 

  27. Hazay, C., Lindell, Y.: Constructions of truly practical secure protocols using standard smartcards. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS’08, pp. 491–500. ACM, New York (2008)

    Chapter  Google Scholar 

  28. IDC. IDC defines the personal portable security device market. http://tinyurl.com/IDC-PPSD. Accessed 27 June 2012

  29. Järvinen, K., Kolesnikov, V., Sadeghi, A.-R., Schneider, T.: Embedded SFE: offloading server and network using hardware tokens. In: Proceedings of the 14th International Conference on Financial Cryptography and Data Security, FC’10, pp. 207–221. Springer, Berlin (2010)

    Chapter  Google Scholar 

  30. Jiang, W., Clifton, C.: A secure distributed framework for achieving k-anonymity. VLDB J. 15, 316–333 (2006)

    Article  Google Scholar 

  31. Jurczyk, P., Xiong, L.: Distributed anonymization: achieving privacy for both data subjects and data providers. In: IFIP WG 11.3 Working Conference on Data and Applications Security, pp. 191–207. Springer, Berlin (2009)

    Google Scholar 

  32. Katz, J.: Universally composable multi-party computation using tamper-proof hardware. In: Proceedings of the 26th Annual International Conference on Advances in Cryptology, EUROCRYPT’07, pp. 115–128. Springer, Berlin (2007)

    Google Scholar 

  33. Kifer, D.: Attacks on privacy and deFinetti’s theorem. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, SIGMOD’09, pp. 127–138. ACM, New York (2009)

    Chapter  Google Scholar 

  34. Kifer, D., Lin, B.-R.: Towards an axiomatization of statistical privacy and utility. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’10, pp. 147–158. ACM, New York (2010)

    Chapter  Google Scholar 

  35. Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD’11, pp. 193–204. ACM, New York (2011)

    Google Scholar 

  36. Kifer, D., Machanavajjhala, A.: A rigorous and customizable framework for privacy. In: Proceedings of the 31st Symposium on Principles of Database Systems, PODS’12, pp. 77–88. ACM, New York (2012)

    Chapter  Google Scholar 

  37. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE’06, p. 25. IEEE Comput. Soc., Washington (2006)

    Google Scholar 

  38. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd IEEE International Conference on Data Engineering, ICDE’07, pp. 106–115 (2007)

    Google Scholar 

  39. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE’06, p. 24. IEEE Comput. Soc., Washington (2006)

    Google Scholar 

  40. Machanavajjhala, A., Gehrke, J., Götz, M.: Data publishing against realistic adversaries. Proc. VLDB Endow. 2(1), 790–801 (2009)

    Google Scholar 

  41. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’04, pp. 223–228. ACM, New York (2004)

    Chapter  Google Scholar 

  42. Mohammed, N., Fung, B.C.M., Wang, K., Hung, P.C.K.: Privacy-preserving data mashup. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT’09, pp. 228–239. ACM, New York (2009)

    Chapter  Google Scholar 

  43. Mohammed, N., Fung, B.C.M., Hung, P.C.K., Lee, C.-K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM Trans. Knowl. Discov. Data 4, 18 (2010)

    Article  Google Scholar 

  44. Mohammed, N., Chen, R., Fung, B.C., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’11, pp. 493–501. ACM, New York (2011)

    Google Scholar 

  45. Pandey, O., Rouselakis, Y.: Property preserving symmetric encryption. In: Proceedings of the 31st Annual International Conference on Theory and Applications of Cryptographic Techniques, EUROCRYPT’12, pp. 375–391. Springer, Berlin (2012)

    Google Scholar 

  46. Rastogi, V., Suciu, D., Hong, S.: The boundary between privacy and utility in data anonymization. CoRR (2006). arXiv:cs/0612103

  47. Rastogi, V., Suciu, D., Hong, S.: The boundary between privacy and utility in data publishing. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB’07, pp. 531–542 (2007). VLDB Endowment

    Google Scholar 

  48. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  49. Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB’06, pp. 139–150 (2006). VLDB Endowment

    Google Scholar 

  50. Xue, M., Papadimitriou, P., Raïssi, C., Kalnis, P., Pung, H.K.: Distributed privacy preserving data collection. In: Proceedings of the 16th International Conference on Database Systems for Advanced Applications—Volume Part I, DASFAA’11, pp. 93–107. Springer, Berlin (2011)

    Google Scholar 

  51. Yao, A.C.: Protocols for secure computations. In: Proceedings of the 23rd Annual Symposium on Foundations of Computer Science, SFCS’82, pp. 160–164. IEEE Comput. Soc., Washington (1982)

    Google Scholar 

  52. Zhang, N., Zhao, W.: Distributed privacy preserving information sharing. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB’05, pp. 889–900 (2005). VLDB Endowment

    Google Scholar 

  53. Zhong, S., Yang, Z., Wright, R.N.: Privacy-enhancing k-anonymization of customer data. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’05, pp. 139–147. ACM, New York (2005)

    Chapter  Google Scholar 

  54. Zhong, S., Yang, Z., Chen, T.: k-anonymous data collection. Inf. Sci. 179, 2948–2963 (2009)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tristan Allard.

Additional information

Communicated by Elena Ferrari.

Appendices

Appendix A: Malicious recipient

1.1 A.1 Protecting \(\mathcal{T}^{s}\)

Forge actions precluded

The forge tampering action allows the attacker to forge tuples to be sanitized. The harmful effects of such attack are obvious; for example within the ιβ-Algorithm, the recipient could forge all the fake records, launch the sanitization, and identify and remove them from the sanitized release. The Origin safety property (Definition 2) states that each collected tuple must be accompanied with a signature binding the encrypted record to its security information and guaranteeing their authenticity: the recipient is now unable to forge authentic s-tuples.

Definition 2

(Origin safety property)

In order to respect the Origin safety property, each c-tuple embeds a signature (e.g. a randomized MAC), denoted σ, which is the result of signing the c-tuple’s encrypted record concatenated to its security information: \(\forall t^{c}\in\mathcal{T}^{c}\) then t c.σ←MAC(t c.e∥t c.ζ), where ∥ denotes the concatenation operator. Each s-tuple embeds the signature of its corresponding c-tuple as a proof of authenticity.

Copy actions precluded

Copy actions allow the recipient to artificially increase the number of tuples to be sanitized in \(\mathcal{T}^{s}\), and to identify their corresponding sanitized tuples in \(\mathcal{T}^{r}\) based on the number of copies. For example within the ιβ-Algorithm, the recipient may have stored a single fake tuple in \(\mathcal{T}^{s}\), copy it \((n_{\mathcal{F}}-1)\) times (where \(n_{\mathcal{F}}\) denotes the number of fake records), launch the sanitization, and finally remove from the sanitized tuples \(\mathcal{T}^{r}\) the ones that appear \(n_{\mathcal{F}}\) times: only true tuples would remain.

Copy actions are actually twofold. With intra-partition copy actions, one or more s-tuples are copied several times into their own partition, while with inter-partition copy actions, the destination partition is different from the source partition. The safety properties in charge of detecting intra/inter-partition copy actions are based on (1) assigning to each tuple a unique identifier (thus making trivial the detection of duplicates in a partition), and (2) organizing the set of tuples to be sanitized such that each identifier is authorized to be part of a single partition only thus forcing the duplicates, if any, to be part of the same partition—where detection is trivial as stated in (1). The Identifier Unicity safety property (Definition 3) precludes intra-partition copy actions by requiring that in a given partition each tuple identifier be unique (see the implementation sketches below for an example of such identifier).

Definition 3

(Identifier Unicity safety property)

Let \(t^{s}\in\mathcal{T}_{i}^{s}\) be a s-tuple in the partition \(\mathcal{T}_{i}^{s}\), and t s.ζ.θ denote t s’s identifier. Partition \(\mathcal{T}_{i}^{s}\) respects the Identifier Unicity safety property if for every pair of s-tuples \(t_{j}^{s},t_{k}^{s}\in\mathcal{T}_{i}^{s}\), \(t_{j}^{s}.\zeta.\theta=t_{k}^{s}.\zeta.\theta\) ⇒j=k.

Inter-partition copies are more difficult to detect. In addition to being unique in its own partition, each s-tuple must only appear in a single partition. To this end, we define for each partition the set of identifiers it is supposed to contain, called the partition’s TID-Set, and denote it \(\mathcal{T}_{i}^{s}.\varTheta\). The Mutual Exclusion safety property (Definition 4) ensures that no partition’s TID-set overlaps, and the Membership safety property (Definition 5) that each identifier must appear in the partition to which it is supposed to belong. As a result, Mutual Exclusion and Membership together guarantee that each identifier actually appears within a single partition (as stated in Lemma 1).

Definition 4

(Mutual Exclusion safety property)

Partitions respect Mutual Exclusion if for every pair of partitions \(\mathcal{T}_{i}^{s},\mathcal{T}_{j}^{s}\subset\mathcal{T}^{s}\), \(i\neq j\Rightarrow \mathcal{T}_{i}^{s}.\varTheta\cap \mathcal{T}_{j}^{s}.\varTheta=\varnothing\).

Definition 5

(Membership safety property)

A partition \(\mathcal{T}_{i}^{s}\) respects Membership if for every s-tuple \(t_{j}^{s}\in\mathcal{T}_{i}^{s}\), then \(t_{j}^{s}.\zeta.\theta\in\mathcal{T}_{i}^{s}.\varTheta\).

Lemma 1

Enforcing together the Identifier Unicity, Mutual Exclusion, and Membership safety properties is necessary and sufficient to guarantee the absence of any (intra/inter-partition) copy action.

Proof

We start by showing the sufficiency of these properties. First, Identifier Unicity is sufficient to preclude by itself intra-partition copy actions (recall that the authenticity of a s-tuple and its identifier is guaranteed by the Origin safety property). Second, assume that a given s-tuple t s has been copied into two distinct partitions. Only the TID-Set of one of them contains t s’s identifier because otherwise Mutual Exclusion would be contradicted. Consequently there must be one partition’s TID-Set that does not contain t s’s identifier. But this clearly contradicts the Membership safety property. As a result, Membership and Mutual Exclusion are together sufficient to preclude inter-partition copy actions.

We now show the necessity of these properties. First, since a distinct identifier is assigned to each s-tuple, the absence of intra-partition copy results immediately in the satisfaction of the Identifier Unicity property. Second, the absence of inter-partition copy implies that the partitioning is correct so that: (1) the Mutual Exclusion property is satisfied in that TID-Sets do not overlap (recall that a distinct identifier is assigned to each s-tuple) and (2) the Membership property too in that each s-tuple appears in the partition which TID-Set contains its identifier. □

Delete actions precluded

The delete tampering action is based on sanitizing at least two versions of \(\mathcal{T}^{s}\) where one version contains some tuples that do not appear in the other version (i.e., deleted). The deleted subset of s-tuples correspond to the subset of sanitized tuples that do not appear in \(\mathcal{T}^{r}\) anymore. For example within Mondrian, the recipient could obtain the equivalence classes corresponding to a set of s-tuples (based on their quasi-identifiers), then obtain a second version of the classes where one tuple has been deleted; the sensitive value that does not appear anymore in \(\mathcal{T}^{r}\) simply corresponds to the quasi-identifier of the deleted s-tuple.

Loosely speaking, the set of tuples to be sanitized is affected by a delete action if the set of tuples associated to a partition changes over time. In other words, a delete has occurred if the TID-Set of (at least) one partition has been associated to (at least) two distinct sets of tuples. To avoid such actions, the content of each partition must not change during the whole protocol: it must be made invariant. We define the Invariance safety property independently from the data structure to be made invariant.

Definition 6

(Invariance safety property)

Let L be a set of character strings containing the labels designating the data structures to be made invariant; L is known by both the recipient and the secure devices. Let l 0∈L be a label, b 0∈{0,1}∗ be an arbitrary bitstring, and Pr[(l 0,b 0)] denote the probability that at least one secure device receives the couple (l 0,b 0). We say that (l 0,b 0) respects the Invariance property if for all bitstrings b i ∈{0,1}∗ received by any secure device, then Pr[(l 0,b i )]=1 if b i =b 0, and Pr[(l 0,b i )]=0 otherwise.

For example, the set of tuples to be sanitized is invariant if the couple (“\(\mathcal{T}^{s}\)”,\(\mathcal{T}^{s}\)) respects the Invariance safety property, “\(\mathcal{T}^{s}\)” being its label and \(\mathcal{T}^{s}\) its actual bitstring representation.

Implementations sketches

The implementations of the Origin and Identifier Unicity safety properties are straightforward: when receiving a partition to sanitize, the given secure device simply checks the signatures of s-tuples and the absence of duplicate identifier.Footnote 23

The other properties are harder to check because they concern the complete dataset—we base their implementation on a summary of the dataset.

First, the summary contains for each partition a cryptographic hash of its s-tuples. Making the set of tuples to be sanitized become invariant consists in (1) letting the recipient send the summary to the absolute majority of a designated population of secure devices (e.g., the population might be the top 1 % most available secure devices)—there can thus exist only one summary, and (2) when sanitizing a partition, letting the given secure device check that the summary was indeed sent to the required number of secure devices and that the actual hash of its s-tuples is consistent with the hash announced in the summary. The content of each partition thus satisfies Invariance.

Second, add to the summary the TID-Sets, expressed as ranges (for each partition, the min and max TIDs) and ordered (either ascending or descending). A linear scan of the summary is now sufficient for a secure device to assert that the summary meets Mutual Exclusion. Since it is invariant and its consistency is checked when sanitizing partitions, the recipient is forced to produce a single, mutually exclusive, partitioning—Mutual Exclusion is guaranteed. The Membership property is now trivial to check: when sanitizing a partition, the given secure device checks that the actual TIDs indeed belong to the TID-Set announced.

1.2 A.2 Execution safety of the construction function

Most global publishing models (if not all) share a common feature: their privacy guarantees are based on a set of (one or more) counts.Footnote 24 For example, the (d,γ)-Privacy guarantees depend on the number of true and fake records in the final release, k-Anonymity’s on the number of records per equivalence class, and ℓ-Diversity’s on the distribution of sensitive values in each class. This recurrent need motivates the design of a secure counting protocol between the secure devices and the recipient; we call it Secure Count. Secure Count is a generic algorithm; when used to check the safety of the construction function’s execution, Secure Count inputs the set of tuples to be sanitized \(\mathcal{T}^{s}\) and outputs the correct corresponding set of counts (we sketch below a possible implementation). Checking these counts means asserting that \(\mathtt{G}(\mathtt{Secure\ Count}(\mathcal{T}^{s}))=\mathtt{True}\) where G is an algorithm-dependent check based on an algebraic manipulation of the counts. For example, the number of records in each class is counted and compared to k for Mondrian, the distinct sensitive data within each class are counted and the compliance of their distribution with the chosen ℓ-Diversity criterion is checked for Bucketization, and the numbers of fake and true records are counted and the compliance of their proportion is checked with respect to the expected α and β for the αβ-Algorithm. Note that such counts are not private; they were already known to the recipient to build sanitization information respecting the privacy parameters of the algorithm.

Although counting tuples is a powerful primitive, it is not sufficient for all the possible algorithms: there exist algorithms whose sanitization information is not fully decided by the recipient. While this is not the case for the equivalence classes of our Mondrian and Bucketization instances, this occurs for the αβ-Algorithm: the true/fake nature of tuples in the αβ-Algorithm is set by secure devices during the collection phase and consequently must be checked (by devices too) for producing a sound count of true/fake tuples. For such algorithms, the unconstrained nature of their construction function precludes any generic checking mechanism. Fortunately, although full genericity cannot be reached here, the other safety properties defined above together with their implementations form a versatile toolkit that can be used for these algorithm-dependent checks. For example within the αβ-Algorithm, the algorithm-dependent checks consists in checking: (1) the authenticity of the true/fake nature of tuples (by signing also the true/fake bit in the Origin safety property), (2) the absence of any duplicate record within the set of tuples to be sanitized (by setting the tuple identifier to be its record’s deterministic MAC), (3) the absence of any duplicate record outside the tuples to be sanitized (by splitting the allowed domains of collected true and fake tuples based on the Invariance safety property). As a result, we define below the Safe Construction Function safety property in charge of guaranteeing the execution safety of the construction function. The precise location of the enforcement of Safe Construction Function within the execution sequence will be given below in Appendix A.3. For the moment, it is sufficient to observe that the counts check must occur after having guaranteed the protection of the set of tuples to be sanitized (so that the Secure Count inputs a safe dataset), and before the final step of disclosing the sanitized records (so that the recipient does not obtain sanitized tuples before the complete sequence of checks has been performed).

Definition 7

(Safe Construction Function safety property)

The Safe Construction Function property is respected if both the counts check and the algorithm-dependent checks have succeeded.

Implementing secure count

A naive count scheme could consist in using locks to synchronize secure devices on a shared counter: each participating secure device would lock the counter first, then increment it, and eventually sign it. However, due to the possibly high number of secure devices concurrently connecting, this approach would suffer from prohibitive blocking delays. In [2] we propose a count scheme free from locks. It consists in gathering on the recipient unitary count increments sent by secure devices, and in summing them up while ensuring the absence of tampering from the recipient. In the case where the set of counts overwhelms the resources of a secure device, [2] additionally proposes a scalable sum scheme inspired from traditional parallel tree-based sum schemes.

1.3 A.3 Execution sequence

As illustrated in Fig. 9, the Met 𝔸P(wm)’s sanitization phase consists in the following steps:

  • S1: Apply Invariance to the partitioned set of tuples to be sanitized \(\mathcal{T}^{s}\);

  • S2: Assert Mutual Exclusion on the partitions of \(\mathcal{T}^{s}\);

  • S3: Process each partition (check the Origin, Identifier Unicity, and Membership safety properties, and return the encrypted sanitized records and the Secure Counts);

  • S4: Check the counts for the Safe Construction Information safety property;

  • S5: Apply Unlinkability to the set of encrypted sanitized records \(\mathcal{T}^{v_{o}}\);

  • S6: Finally decrypt the shuffled sanitized records to yield \(\mathcal{T}^{r}\).

This execution sequence is composed of the generic steps of Met 𝔸P(mal) (where S1 and S2 can be performed in parallel, as well as S4 and S5). Specific instances whose sanitization information is fully decided by the recipient follow it as-is, the other instances additionally insert in it the algorithm-dependent checks of the Safe Construction Function property (see Appendix A.2). There only remains to protect its integrity (complete execution in the correct order) in order to cover all possible attacks (active and passive) from a recipient with weakly-malicious intent.

Execution sequence integrity

Checking the completeness and order of the execution sequence is critical (e.g., final decryption must not occur before having checked the safety properties). The Execution Sequence Integrity safety property is respected if and only if the sequence flows through the complete sequence of safety properties in the correct order. Observe that an adversarial recipient essentially aims at obtaining records: he necessarily executes the first step—i.e., the initial c-tuples collection—and the last step—i.e., the final records disclosure. It consequently appears that completeness is guaranteed if the steps in between are all executed, and correctness if they are executed in the correct order. To this end, secure devices embed the expected execution sequence of the algorithm instantiated under Met 𝔸P, and control that the actual sequence matches the expected embedded sequence. Checking the match between the two sequences amounts to checking that, at each current execution step, the recipient is able to prove (in its most general meaning) that the immediately preceding step(s) was performed on the expected data structure (hence, by recursion, a valid proof for the current step demonstrates that the previous steps were all performed). Hence, when a secure device connects, the recipient sends to it the current execution step and the set of proofs binding the immediately preceding step to the data structure on which it was performed. The secure device checks the validity of the proofs, performs the task associated to the execution step, and finally returns the corresponding proof (in addition to the output of the task).

Definition 8

(Execution Sequence Integrity safety property)

Let S be the pre-installed set of execution steps, s∈S denote the current execution step indicated to the connecting secure device by the recipient, s.P denote the set of proofs required for executing s as specified by s.VALID (the detailed list of steps and proofs is given below). Finally, let s.P rec denote the set of proofs received from the recipient for step s. The Execution Sequence Integrity safety property is respected if: ∀s∈S, then s.VALID(s.P,s.P rec )=True.

The list of proofs required for each step of the generic Met 𝔸P(mal) execution sequence and their corresponding VALID function is the following:

  • S1 and S2: No proof required because no execution step precedes them.

  • S3: The Invariance and Mutual Exclusion proofs are two signatures emitted by secure devices and binding for each property its label (e.g., “Invariance” and “Mutual Exclusion”) to the hash of the summary to which the property was applied or checked. The set of proofs for S3 is thus the signatures plus the summary. The VALID function consists for each connecting secure device in verifying the signatures (correct labels and correct summary’s hash) and checking the consistency of the summary with the partition to sanitize (TID-set and hash of its content).

  • S4: The secure device that checks the counts asserts that all the partitions have been counted based on the partitions announced in the summary and on the partitions covered by the counts (secure counts include this information): this is S4’s VALID function. The set of proofs for S4 consists of the counts and the summary (with its Invariance signature).

  • S5: The set of encrypted sanitized tuples is first protected against tampering (the same way as \(\mathcal{T}^{s}\) was). The set of proofs for the first shuffling level and the VALID function are thus the same as S3’s but applied to \(\mathcal{T}^{v_{o}}\). For the next shuffling levels, secure devices only need to check that the actual shuffling circuit is the expected one. The set of proofs is thus the expected shuffling circuit (e.g., computed from \(\mathcal{T}^{v_{o}}\)’s summary) and a signature binding each tuple to its current position in the circuit. The VALID function consists in verifying the tuples signatures and in checking that each set of tuples shuffled together is consistent with the shuffling circuit.

  • S6: Finally, secure devices check the completeness of the shuffling circuit (i.e., the position of each tuple is at the end of the shuffling circuit), the signature obtained in S4 proving that the counts have been checked, and the continuity between \(\mathcal{T}^{s}\) and \(\mathcal{T}^{v_{o}}\) (based on the number of tuples they contain (e.g., computed from their summaries, or counted by the Secure Count scheme)): this is the VALID function. The set of proofs consists of the shuffling circuit and the positions of tuples (signed), S4’s signature, and the number of tuples of \(\mathcal{T}^{s}\) and \(\mathcal{T}^{v_{o}}\) (e.g., computed from \(\mathcal{T}^{s}\)’s and \(\mathcal{T}^{v_{o}}\)’s summaries).

Appendix B: Extension to a Malicious Hard recipient

2.1 B.1 Typicality property

The Typicality property is in charge of deterring the Malicious Hard recipient from performing attacks based on injecting forged tuples in the dataset. As explained in Sect. 6.3, the effectiveness of such attacks depends on the number of tuples forged by the recipient (based on the cryptographic keys of the cluster(s) it has broken) and injected in the dataset. The Typicality property consequently thwarts these attacks by requiring that the participations of all clusters be similar, where similarity can be instantiated in various ways (see below).

Definition 9

(Typicality)

Let \(\mathcal{P}\) denote a set of clusters participations (e.g., in the full dataset), and T denotes the statistical typicality test used by secure devices (e.g., a standard deviation analysis). \(\mathcal{P}\) respects the Typicality safety property if \(\mathtt{T}(\mathcal{P})=1\).

Secure devices enforce the Typicality property by counting the participations of clusters (using the Secure Count protocol sketched above) and then checking that the set of counts satisfies the typicality test T. When the value of each count can be pre-defined (e.g., by fixing the number of tuples per cluster to be collected) T checks that each count is equal to the expected value. As a result, the Malicious Hard recipient is bounded by the size of the cluster on the number of tuples it can forge. This may be sufficient protection (e.g. for ιβ-Algorithm), or it may still let the attacker cause severe privacy damage (e.g. for Mondrian). In any case, even if the participations cannot be predefined (e.g., within the k-Anonymity running example, the typicality can be defined inside each equivalence class), T is instantiated as a traditional statistical analysis, e.g., an outlier detection measure [8]. Section 7 shows the feasibility of the Typicality property both in terms of cost and detection effectiveness.

2.2 B.2 Adapting the Secure Count and the shuffle to clusters

We adapt the implementations of both the Secure Count and the shuffle following the same two-steps pattern. As a first step, the standard implementation is performed within each cluster. No modification is needed here. As a second step, the set of local—intra-cluster—results must be merged together in order to yield the global—inter-cluster—result: the counts obtained by all the clusters must be summed up, the tuples shuffled within each cluster must be shuffled with the tuples of all the other clusters. The difficulty that arises in the second step comes from the lack of trust a secure device can have concerning results originating from any cluster other than its own: it is unable to validate the signatures of the others, and is consequently unable to distinguish between data legitimately produced (e.g., the count of another cluster with its signature) and data injected by the recipient (e.g., a cheated count without any valid signature). We propose to deter such cheats by requiring that secure devices that participate in the second step simply do not reveal to the recipient their cluster: a cheat is detected if the connected device receives cheated data claimed to originate from its own cluster but with an invalid signature. At each connection, the probability of detecting the cheat is 1/n c , and the recipient needs to send the cheat to at least n c secure devices (to have each cluster compute its result). This worst case thus results in a lower bound detection probability of \(1-((n_{c}-1)/n_{c})^{n_{c}}\). Note however that this worst case is unrealistic (e.g., in practice, in order to contact 500 clusters the recipient has to send the cheat to approximately 3400 secure devices). Let us now informally sketch the adaptation of the implementations of both the Secure Count and the shuffle.

The Secure Count takes as input the complete set of tuples and outputs the corresponding set of counts. We adapt it to secure devices following the two-step pattern described above: first, the count that is local to each cluster is computed as usual by the Secure Count, and second, the global count is computed by letting each cluster download the local counts and sum them. A malicious recipient may attempt to increase the global counts by sending forged counts, the rationale being to artificially reach the count required for the Safe Construction Function or the Typicality safety properties. However, since the secure devices do not reveal their cluster to the recipient, and because the recipient must obtain all the local counts because each cluster will need the global count later in the protocol, the high number of devices to which a cheated set of counts must be sent makes the detection probability reach highly deterring values as shown by Fig. 14.

Fig. 14
figure 16

Evolution of the detection probability

We follow a similar approach when adapting shuffling to clusters: first, shuffling that is local to each cluster is performed as usual, and second, global shuffling is performed on the outputs of local shufflings. The second step basically consists in letting each connected device (1) download a partition made of one tuple per cluster (previously shuffled locally), and (2) return the single tuple originating from its own cluster. During this step, as for the adaptation of Secure Count, the devices do not reveal their cluster to the recipient: the returned tuple is likely to originate from any cluster. A malicious recipient may cheat the second step by forming partitions that do not contain tuples from all the clusters (e.g., replacing some tuples chosen at random by a random bitstring): the shuffling would thus be incomplete. This cheat differs from the Secure Count’s cheat described above in that it is not necessary for the recipient to send a cheated partition to all the clusters: it can use a cheated partition to get one result tuple and then use the non-cheated version of the partition to get the other tuples. The first protection measure is thus to make the partitions, either cheated or not, invariant (for each cluster). The second protection measure lies in hiding the connected devices’ clusters, similarly as for Secure Count. Indeed, in order to obtain the results corresponding to the valid tuples of a partition, the recipient will have to send the latter to all the corresponding clusters. This results in a high number of devices to which a cheated partition must be sent, and consequently a detection probability that reaches sufficiently deterring values fastly.

Figure 14 shows the exponential growth of the detection probability with respect to the number of devices receiving a cheat (the x axis is at logscale). We plot the evolution of the detection probability for four numbers of clusters (n c =5, n c =50, n c =500, n c =5000) with respect to the number of secure devices that receive the cheat; their lower bound (LB) is also represented. We focus on the part of the curves above the asymptotic limit (AL) computed by:

$$\lim_{n_c \to +\infty}1-\bigl((n_c-1)/n_c\bigr)^{n_c} = 1-1/e $$

which is approximately 0.63.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Allard, T., Nguyen, B. & Pucheral, P. MET𝔸P: revisiting Privacy-Preserving Data Publishing using secure devices. Distrib Parallel Databases 32, 191–244 (2014). https://doi.org/10.1007/s10619-013-7122-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7122-x

Keywords

Navigation