Skip to main content

M\(^2\)M: A General Method to Perform Various Data Analysis Tasks from a Differentially Private Sketch

  • Conference paper
  • First Online:
Security and Trust Management (STM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13867))

Included in the following conference series:

  • 214 Accesses

Abstract

Differential privacy is the standard privacy definition for performing analyses over sensitive data. Yet, its privacy budget bounds the number of tasks an analyst can perform with reasonable accuracy, which makes it challenging to deploy in practice. This can be alleviated by private sketching, where the dataset is compressed into a single noisy sketch vector which can be shared with the analysts and used to perform arbitrarily many analyses. However, the algorithms to perform specific tasks from sketches must be developed on a case-by-case basis, which is a major impediment to their use. In this paper, we introduce the generic moment-to-moment (\(\textrm{M}^2\textrm{M}\)) method to perform a wide range of data exploration tasks from a single private sketch. Among other things, this method can be used to estimate empirical moments of attributes, the covariance matrix, counting queries (including histograms), and regression models. Our method treats the sketching mechanism as a black-box operation, and can thus be applied to a wide variety of sketches from the literature, widening their ranges of applications without further engineering or privacy loss, and removing some of the technical barriers to the wider adoption of sketches for data exploration under differential privacy. We validate our method with data exploration tasks on artificial and real-world data, and show that it can be used to reliably estimate statistics and train classification models from private sketches.

F. Houssiau and V. Schellekens—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We consider only unbounded DP for conciseness, yet the private sketches from Sect. 2.3 can be extended in a straightforward manner to the bounded DP setting. In this case no noise needs to be added to the denominator in (2).

References

  1. Abowd, J.M.: The US census bureau adopts differential privacy. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, p. 2867 (2018)

    Google Scholar 

  2. Aktay, A., et al.: Google Covid-19 community mobility reports: anonymization process description (version 1.0). arXiv preprint arXiv:2004.04145 (2020)

  3. Balog, M., Tolstikhin, I., Schölkopf, B.: Differentially private database release via kernel mean embeddings. In: International Conference on Machine Learning, pp. 414–422 (2018)

    Google Scholar 

  4. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)

    Google Scholar 

  5. Barbaro, M., Zeller, T., Hansell, S.: A face is exposed for AOL searcher no. 4417749. New York Times 9(2008), 8For (2006)

    Google Scholar 

  6. Blum, A., Hopcroft, J., Kannan, R.: Foundations of Data Science. Cambridge University Press, Cambridge (2020)

    Book  MATH  Google Scholar 

  7. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 28–39 (2016)

    Article  Google Scholar 

  8. Chatalic, A., Schellekens, V., Houssiau, F., De Montjoye, Y.A., Jacques, L., Gribonval, R.: Compressive learning with privacy guarantees. Inf. Inference: J. IMA (iaab005) (2021). https://doi.org/10.1093/imaiai/iaab005

  9. Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Advances in Neural Information Processing Systems, pp. 289–296 (2009)

    Google Scholar 

  10. Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12(3) (2011)

    Google Scholar 

  11. Coleman, B., Shrivastava, A.: Sub-linear race sketches for approximate kernel density estimation on streaming data. In: Proceedings of the Web Conference 2020, pp. 1739–1749 (2020)

    Google Scholar 

  12. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)

    MATH  Google Scholar 

  13. Drineas, P., Kannan, R., Mahoney, M.W.: Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J. Comput. 36(1), 132–157 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1

    Chapter  MATH  Google Scholar 

  15. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  16. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  17. Gribonval, R., Blanchard, G., Keriven, N., Traonmilin, Y.: Compressive statistical learning with random feature moments. Math. Stat. Learn. 3(2), 113–164 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  18. Harder, F., Adamczewski, K., Park, M.: DP-MERF: differentially private mean embeddings with RandomFeatures for practical privacy-preserving data generation. In: Banerjee, A., Fukumizu, K. (eds.) Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 130, pp. 1819–1827. PMLR (2021)

    Google Scholar 

  19. Huang, G., Huang, G.B., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Netw. 61, 32–48 (2015). https://doi.org/10.1016/j.neunet.2014.10.001

    Article  MATH  Google Scholar 

  20. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)

    Article  Google Scholar 

  21. Kenthapadi, K., Tran, T.T.: PriPeARL: a framework for privacy-preserving analytics and reporting at linkedin. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2183–2191 (2018)

    Google Scholar 

  22. Keriven, N., Bourrier, A., Gribonval, R., Pérez, P.: Sketching for large-scale learning of mixture models. Inf. Inference: J. IMA 7(3), 447–508 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  23. Keriven, N., Tremblay, N., Traonmilin, Y., Gribonval, R.: Compressive k-means. In: ICASSP (2017). https://hal.inria.fr/hal-01386077/document

  24. Li, H., Xiong, L., Jiang, X.: Differentially private synthesization of multi-dimensional data using copula functions. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2014, p. 475. NIH Public Access (2014)

    Google Scholar 

  25. Liu, F., Huang, X., Chen, Y., Suykens, J.A.: Random features for kernel approximation: a survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 01, 1 (2021)

    Google Scholar 

  26. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  27. de Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013)

    Article  Google Scholar 

  28. de Montjoye, Y.A., Radaelli, L., Singh, V.K., et al.: Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221), 536–539 (2015)

    Article  Google Scholar 

  29. Park, M., Vinaroz, M., Charusaie, M.A., Harder, F.: Polynomial magic! Hermite polynomials for private data generation. arXiv:2106.05042 [cs, stat] (2021)

  30. Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  31. Qardaji, W., Yang, W., Li, N.: Priview: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1435–1446 (2014)

    Google Scholar 

  32. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)

    Google Scholar 

  33. Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)

    Google Scholar 

  34. Rudin, W.: Fourier Analysis on Groups. Interscience Publishers (1962)

    Google Scholar 

  35. Schellekens, V., Chatalic, A., Houssiau, F., de Montjoye, Y.A., Jacques, L., Gribonval, R.: Differentially private compressive k-means. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 7933–7937. IEEE (2019)

    Google Scholar 

  36. Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018)

  37. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  38. Zhang, R., Lan, Y., Huang, G.B., Xu, Z.B.: Universal approximation of extreme learning machine with adaptive growth of hidden nodes. IEEE Trans. Neural Netw. Learn. Syst. 23(2), 365–371 (2012). https://doi.org/10.1109/TNNLS.2011.2178124

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florimond Houssiau .

Editor information

Editors and Affiliations

Appendices

A Proof of Theorem 1

Let \(J_\varSigma \), the left-hand side of the inequality, the mean squared error between the empirical mean \(\overline{f}\) and the estimation from the sketch \(\widetilde{f}\). Denoting \(X=(X_1,\dots ,X_n)\), we have

$$ \begin{array}{ll} J_\varSigma &{}= \mathbb {E}_{X,\xi }\left[ \left( \frac{1}{n}\sum _{i=1}^n f(X_i) - \langle a, \frac{1}{n}\left( \sum _{i=1}^n \varPhi (X_i) + \xi \right) \rangle \right) ^2\right] \\ &{}= \mathbb {E}_{X,\xi }\left[ \left( \frac{1}{n}\sum _{i=1}^n \left( f(X_i)-\langle a,\varPhi (X_i)\rangle \right) - \frac{1}{n}\langle a, \xi \rangle \right) ^2\right] \\ &{}{\mathop {=}\limits ^{(i)}} \mathbb {E}_{X}\left[ \left( \frac{1}{n}\sum _{i=1}^n \left( f(X_i)-\langle a,\varPhi (X_i)\rangle \right) \right) ^2\right] + \frac{1}{n^2}\mathbb {E}_{\xi }\left[ \langle a,\xi \rangle ^2\right] \\ &{}{\mathop {=}\limits ^{(ii)}} \frac{n (n-1)}{n^2} \cdot \left( \mathbb {E}_{X}\left[ f(X)\right] -\langle a,\mathbb {E}_{X}\left[ \varPhi (X)\right] \rangle \right) ^2\\ &{}\quad \quad ~+~\frac{n}{n^2}\cdot \mathbb {E}_{X}\left[ \left( f(X) - \langle a,\varPhi (X)\rangle \right) ^2\right] + ||a||_2^2 \frac{\mathbb {V}[\xi ]}{n^2} \end{array} $$

where we used in (i) the independence from \(\xi \) and X and the fact that \(\mathbb {E}\left[ \xi \right] = 0\), and in (ii) the fact that samples \((X_i)_{1\le i\le n}\) are independent (and \(\mathbb {V}[\cdot ]\) denotes the variance of a random variable). Finally, we use Jensen’s inequality (since \(x\mapsto x^2\) is convex) to show that \(\left( \mathbb {E}_{X}\left[ f(X)\right] -\langle a,\mathbb {E}_{X}\left[ \varPhi (X)\right] \rangle \right) ^2 \le \mathbb {E}_{X}\left[ \left( f(X) - \langle a,\varPhi (X)\rangle \right) ^2\right] \), which concludes the proof.

B \(\textrm{M}^2\textrm{M}\) Learning Procedure

figure a

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Houssiau, F., Schellekens, V., Chatalic, A., Annamraju, S.K., de Montjoye, YA. (2023). M\(^2\)M: A General Method to Perform Various Data Analysis Tasks from a Differentially Private Sketch. In: Lenzini, G., Meng, W. (eds) Security and Trust Management. STM 2022. Lecture Notes in Computer Science, vol 13867. Springer, Cham. https://doi.org/10.1007/978-3-031-29504-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-29504-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-29503-4

  • Online ISBN: 978-3-031-29504-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics