Advertisement

When Random Sampling Preserves Privacy

  • Kamalika Chaudhuri
  • Nina Mishra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4117)

Abstract

Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable “do-it-yourself” calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values – values that occur with low frequency in the table – can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require ε-privacy (where the larger ε is, the worse the privacy guarantee) with probability at least 1 – δ, we say that a value is rare if it occurs in at most \(\tilde{O}(\frac{1}{\epsilon})\) rows of the table (ignoring log factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most ε then the sample is O(ε)-private with high probability. In the case that there are t rare values, then the sample is \(\tilde{O}(\epsilon \delta /t)\)-private with probability at least 1–δ.

Keywords

Sampling Frequency Failure Probability Frequent Itemsets Private Data Good Sample 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS, pp. 128–138 (2005)Google Scholar
  2. 2.
    U.S. Census Bureau. Public use microdata sample (pums) (2003), http://www.census.gov/Press-Release/www/2003/PUMS.html
  3. 3.
    Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.M.: Toward privacy in public databases. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 363–385. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: PODS, pp. 202–210 (2003)Google Scholar
  5. 5.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Dwork, C., Nissim, K.: Privacy-preserving datamining on vertically partitioned databases. In: Franklin, M. (ed.) CRYPTO 2004. LNCS, vol. 3152, pp. 528–544. Springer, Heidelberg (2004)Google Scholar
  7. 7.
    Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: PODS, pp. 211–222 (2003)Google Scholar
  8. 8.
    Goldreich, O.: Foundations of Cryptography, vol. I and II. Cambridge University Press, Cambridge (2004)MATHCrossRefGoogle Scholar
  9. 9.
    Kenthapadi, K., Mishra, N., Nissim, K.: Simulatable auditing. In: PODS, pp. 118–127 (2005)Google Scholar
  10. 10.
    Mishra, N., Sandler, M.: Privacy via pseudorandom sketches. In: PODS (2006)Google Scholar
  11. 11.
    Social Security Administration: Office of Policy Data. Benefits and earnings public-use file (2004), http://www.ssa.gov/policy/docs/microdata/earn/index.html
  12. 12.
    Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. In: Proceedings AMIA Annual Fall Symposium (1997)Google Scholar
  13. 13.
    Vitter, J.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kamalika Chaudhuri
    • 1
  • Nina Mishra
    • 2
  1. 1.Computer Science DepartmentUC BerkeleyBerkeleyUSA
  2. 2.Computer Science DepartmentUniversity of VirginiaCharlottesvilleUSA

Personalised recommendations