Abstract
The present article reviews the theory of data privacy and confidentiality in statistics and computer science, to modernize the theory of anonymization. This effort results in the mathematical definitions of identity disclosure and attribute disclosure applicable to even synthetic data. Also differential privacy is clarified as a method to bound the accuracy of population inference. This bound is derived by the Hammersley-Chapman-Robbins inequality, and it leads to the intuitive selection of the privacy budget \(\epsilon\) of differential privacy.
Similar content being viewed by others
References
Abowd, J. M., & Vilhuber, L. (2008). How protective are synthetic data? In Domingo-Ferrer & Saygun (Eds.), Privacy in statistical databases. Lecture notes in computer science (Vol. 5262 pp. 239–246). New Yor: Springer.
Aggarwal, C. C., Yu, P. S., et al. (2004). A condensation approach to privacy preserving data mining. In E. Bertino, et al. (Eds.), Advances in database technology—EDBT, lecture notes in computer science (Vol. 2992, pp. 183–199). Berlin: Springer.
Aggarwal, C. C., & Yu, P. S. (2008). Privacy-preserving data mining: models and algorithms. New York: Springer.
Agrawal, R., & Srikant, R. (2000). Privacy preserving data mining. In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 439–450).
Anderson, M. J., & Seltzer, W. (2009). Federal statistical confidentiality and business data: Twentieth century challenges and continuing issues. Journal of Privacy and Confidentiality, 1, 7–52.
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer.
Bambauer, J., Muralidhar, K., & Sarathy, R. (2013). Fool’s gold: an illustrated critique of differential privacy. Vanderbilt Journal of Entertainment and Technology Law, 16, 701–755.
Barbaro, M., & Zeller, T. (2006). A Face is exposed for AOL searcher no. 4417749, The New York Times.
Beckman, R. J., Baggerly, K. A., & McKay, M. D. (1996). Creating synthetic baseline populations. Transportation Research, Part A: Policy and Practice, 30, 415–429.
Benedetto, G., Stanley, J.C., & Totty, E. (2018) The creation and use of the SIPP synthetic Beta v7.0, U.S. Census Bureau.
Bethlehem, J. G., Keller, W. J., & Pannekoek, J. (1990). Disclosure control of microdata. Journal of the American Statistical Association, 85, 38–45.
Birnbaum, A. (1962). On the foundation of statistical inference. Journal of the American Statistical Association, 57, 269–306.
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: MIT Press.
Bowen, C. M., & Liu, F. (2020). Comparative study of differentially private data synthesis methods. Statistical Science, 35, 280–307.
Brand, R. (2002). Microdata protection through noise addition. In Domingo-Ferrer (Ed.), Inference control in statistical databases: From theory to practice, lecture notes in computer science (Vol. 2316, pp. 97–116). Berlin: Springer.
Brandt, M., Lenz, R., & Rosemann, M. (2008). Anonymisation of panel enterprise microdata—Survey of a German project. In Domingo-Ferrer, et al. (Eds.), Privacy in statistical databases, lecture notes in computer science (Vol. 5262 pp. 139–151). Berli: Springer.
Butz, W., & Torrey, B. (2006). Some frontiers in social science. Science, 312, 1898–1900.
Chapman, D. G., & Robbins, H. (1951). Minimum variance estimation without regularity assumptions. The Annals of Mathematical Statistics, 22, 581–586.
Chaudhuri, K., & Mishra, N. (2006). When random sampling preserves privacy. In Proceedings of the 26th Annual International Conference on Advances in Cryptology (CRYPTO 2006) (pp. 198–213). Berlin:Springer.
Clifton, C., & Tassa, T. (2013). On syntactic anonymity and differential privacy. Transactions on Data Privacy, 6, 161–183.
Dalenius, T. (1986). Finding a needle in a haystack – or identifying anonymous census records. Journal of Official Statistics, 2, 329–336.
Danker, F. K., & El Eman, K. (2013). Practicing differential privacy in health care: A review. Transactions on Data Privacy, 5, 35–67.
Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11, 427–444.
Deng, M., Wuyts, K., Scandariato, R., Preneel, B., & Joosen, W. (2011). A privacy threat analysis framework: Supporting the elicitation and fulfillment of privacy requirements. Requirements Engineering, 16, 3–32.
Dennis, J. C. (2000). Privacy and confidentiality of health information. San Francisco: Jossey-Bass.
Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 202–210).
Domingo-Ferrer, J., & Tora, V. (2004). Privacy in statistical databases, lecture notes in computer science (Vol. 3050). Berlin: Springer.
D’Orazio, M., Di Zio, M., & Scanu, M. (2006). Statistical matching: Theory and practice. Chichester: Wiley.
Doyle, P., Lane, J. I., Theeuwes, J. J. M., & Zayatz, L. V. (2001). Confidentiality, disclosure, and data access. Amsterdam: Elsevier.
Drechsler, J. (2011). Synthetic datasets for statistical disclosure control: Theory and implementation, lecture notes in statistics (Vol. 201). New York: Springer.
Duncan, G. T., Elliot, M., & Salazar-González, J. J. (2011). Statistical confidentiality. New York: Springer.
Dwork, C. (2006). Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming-ICALP 2006, Part II, Lecture Notes in Computer Science (Vol. 4052, pp. 1–12). Springer.
Dwork, C. (2011). A firm foundation for private data analysis. Communications of the ACM, 54, 86–95.
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006a). Our data, ourselves: privacy via distributed noise generation. In S. Vaudenay (Ed.), Advances in cryptology - EUROCRYPT 2006, lecture notes in computer science (Vol. 4004, pp. 486–503). Berlin: Springer.
Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In TCC 2006-theory of cryptography conference (pp. 265–284).
Dwork, C., Smith, A., Steinke, T., & Ulllman, J. (2017). Exposed! A survey of attacks on private data. Annual Review of Statistics and Its Application, 4, 61–84.
Efron, B. (1979). Bootstrap methods: Another look at the Jackknife. Annals of Statistics, 7, 1–26.
El Emam, K., & Arbuckle, L. (2013). Anonymizing health data. Sebastopol: O’Reilly.
Erlingsson, U., Pihur, V., & Korolova, A. (2014). RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 21st ACM Conference on Computer and Communications Security, ACM, Scottsdale, Arizona.
Evett, I., Jackson, G., Lambert, J. A., & McCrossan, S. (2000). The impact of the principles of evidence interpretation on the structure and content of statements. Science & Justice, 40, 233–239.
Fienberg, S. E. (1994). A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Technical report, Department of Statistics, Carnegie Mellon University.
Fienberg, S. E. (2005). Confidentiality and disclosure limitation. In Kempf-Leonard (Ed.), Encyclopedia of social measurement (Vol. 1 pp. 463–469). New Yor: Elsevier.
Fienberg, S. E., & Holland, P. W. (1973). Simultaneous estimation of multinomial cell probabilities. Journal of the American Statistical Association, 68, 683–691.
Fienberg, S. E., Makov, U. E., & Steele, R. J. (1998). Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics, 14, 485–502.
Fung, B. C. M., Wang, K., Fu, A. W. C., & Yu, P. S. (2010). Introduction to privacy-preserving data publishing. Boca Raton: Chapman and Hall/CRC.
Ghosh, A., Roughgarden, T., & Sundararajan, M. (2012). Universally utility-maximizing privacy mechanism. SIAM Journal of Computing, 41, 1673–1693.
Giessing, S. (2004). Survey on methods for tabular data protection in ARGUS. In Domingo-Ferrer & Torra (Eds.), Privacy in statistical databases, lecture notes in computer science (Vol. 3050 pp. 1–13). Berli: Springer.
Godambe, V. P. (1955). A unified theory of sampling from finite populations. Journal of the Royal Statistical Society, B, 17, 268–278.
Goel, V. (2014). How Facebook sold you krill oil, The New York Times.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264.
Gottschalk, S. (2004). Microdata disclosure by resampling – Empirical findings for business survey data. Allgemeines Statistisches Archiv, 88, 279–302.
Hammersley, J. M. (1950). On estimating restricted parameters. The Journal of the Royal Statistical Society, Series B, 12, 192–240.
Heard, D., Dent, G., Schifeling, T., & Banks, D. (2015). Agent-based models and microsimulation. Annual Review of Statistics and its Application, 2, 259–272.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.
Hoshino, N. (2009). The quasi-multinomial distribution as a tool for disclosure risk assessment. Journal of Official Statistics, 25, 269–291.
Hoshino, N. (2016). Evidence based anonymization. Journal of the Japan Statistical Society, Series J, 46, 1–42. (In Japanese.)
Hoshino, N. (2018). The control of statistical inference. In Talk at computer security symposium 2018, October 24. (In Japanese.).
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012). Statistical disclosure control. West Sussex: Wiley.
Inusah, S., & Kozubowski, T. J. (2006). A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136, 1090–1102.
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Mathematical Proceedings of the Cambridge Philosophical Society, 31, 203–222.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press.
Kasivisiwanathan, S. P., & Smith, A. (2014). On the semantics of differential privacy: A Bayesian formulation. Journal of Privacy and Confidentiality, 6, 1–16.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Khmaladze, E. V. (1987). The statistical analysis of a large number of rare events. In Technical Report Report MS-R8804, Department of Mathematical Statistics, CWI. Amsterdam: Center for Mathematics and Computer Science.
Kifer, D., & Machanavajjhala, A. (2011). No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11) (pp. 193–204). Association for Computing Machinery, New York, NY, USA.
Kifer, D., & Machanavajjhala, A. (2014). Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 39, [a3]. https://doi.org/10.1145/2514689
Kotz, S., Kozubowski, T., & Podgórski, K. (2001). The laplace distribution and generalizations: A revisit with applications to communications, economics, engineering, and finance. Boston: Birkhäuser.
Lee, J., & Clifton, C. (2011). How much is enough? Choosing \(\epsilon\) for differential privacy. In Lai et al. (Eds.) ISC 2011, Lecture Notes in Computer Science (Vol. 7001, pp. 325–340).
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). New York: Springer.
Li, N., Li, T., & Venkatasubramanian, S. (2007). \(t\)-Closeness: Privacy beyond \(k\)-anonymity and \(\ell\)-diversity. In IEEE 23rd International Conference on Data Engineering (ICDE) (pp. 106–115).
Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. In Mihir Bellare (Ed.) Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology (CRYPTO ’00) (pp. 36–54). London: Springer.
Little, R. (1993). Statistical analysis of masked data. Journal of Official Statistics, 9, 407–426.
Liu, C., He, X., Chanyaswad, T., Wang, S., & Mittal, P. (2019). Investigating statistical privacy frameworks from the perspective of hypothesis testing. Proceedings on Privacy Enhancing Technologies, 2019(3), 233–254.
Lowrance, W. W. (2012). Privacy, confidentiality, and health research. New York: Cambridge University Press.
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., & Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE’08 (pp. 277–286).
Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam. (2007). \(\ell\)-diversity: privacy beyond \(k\)-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), Article 3.
Marsh, C., Skinner, C., Arber, S., Penhale, P., Openshaw, S., Hobcraft, J., et al. (1991). The case for a sample of anonymized records from the 1991 census. Journal of the Royal Statistical Society, Series A, 154, 305–340.
Meiser, S. (2018). Approximate and probabilistic differential privacy definitions. IACR Cryptology ePrint Archive, 2018, 277.
Mendes, R., & Vilela, J. P. (2017). Privacy-preserving data mining: Methods. Metrics, and Applications IEEE Access, 5, 10562–10582.
Muralidhar, K., Saraty, R., & Li, H. (2016). Secure attribute sharing of linked microdata. Decision Support Systems, 81, 20–29.
Nakamura, H. (2017). Microdata access for official statistics in Japan. Sociological Theory and Methods, 32, 310–320. (In Japanese.).
National Research Council. (2007). Putting people on the map: Protecting confidentiality with linked social-spatial data. Washington: The National Academies Press.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231, 289–337.
Nin, J., & Herranz, J. (2010). Privacy and anonymity in information management systems. London: Springer.
Nissim, K., Raskhodnikova, S., & Smith, A. (2007). Smooth sensitivity and sampling in private data analysis. In Proceedings of the Annual ACM Symposium on Theory of Computing (pp. 75–84).
O’Keefe, C. M. (2015). Privacy and confidentiality in service science and big data analytics. In J. Camenisch, S. Fischer-Hubner, & M. Hansen (Eds.), Privacy and identity management for the future internet in the age of globalisation, privacy and identity 2014. IFIP advances in information and communication technology (Vol. 457, pp. 54–70). Cham: Springer.
Pawitan, Y. (2001). In all likelihood. Oxford: Clarendon Press.
Pfitzmann, A., & Hansen, M. (2010). A terminology for talking about privacy by data minimization: anonymity, unlinkability, undetectability, unobservability, pseudonymity, and identity management. Version 0.34 August 2010, Technical Report, TU Dresden and ULD Kiel. http://dud.inf.tu-dresden.de/Anon_Terminology.shtml
President’s Council of Advisors on Science and Technology. (2014). Report to the president: Big data and privacy: A technological perspective. Washington: Executive Office of the President.
Quatember, A. (2015). Pseudo-populations. Cham: Springer.
Raab, G. M., Nowok, B., & Dibben, C. (2017). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7, 67–97.
Reiter, J. P. (2019). Differential privacy and federal data releases. Annual Review of Statistics and its Application, 6, 85–101.
Rinott, Y., O’Keefe, C. M., Shlomo, N., & Skinner, C. (2018). Confidentiality and differential privacy in the dissemination of frequency tables. Statistical Sciences, 33, 358–385.
Ritchie, F. (2017). The “Five Safes”: A framework for planning, designing and evaluating data access solutions. Paper presented at Data for Policy 2017, London, UK.
Ritchie, F. (2008). Secure access to confidential microdata: Four years of the Virtual Microdata Laboratory. Economic and Labour Market Review, 2, 29–34.
Rocher, L., Hendrickx, J. M., & de Montjoye, Y. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10, 3069.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official Statistics, 9, 462–468.
Ruggles, S., Fitch, C. A., Magnuson, D. L., & Schroeder, J. P. (2019). Differential privacy and census data: Implications for social and economic research. AEA Papers and Proceedings, 109, 403–408.
Shlomo, N., & Skinner, C. J. (2012). Privacy protection from sampling and perturbation in survey microdata. Journal of Privacy and Confidentiality, 4, 155–169.
Shlosser, A. (1981). On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics, 19, 97–102.
Singer, E., Van Hoewyk, J., & Neugebauer, R. J. (2003). Attitudes and behavior: the impact of privacy and confidentiality concerns on participation in the 2000 Census. Public Opinion Quarterly, 67, 368–384.
Singer, E., Mathiowetz, N. A., & Couper, M. P. (1993). The impact of privacy and confidentiality concerns on survey participation: the case of the 1990 U.S. Ceusus. Public Opinion Quarterly, 57, 465–482.
Smith, A. (2008). Efficient, differentially private point estimators. arXiv:0809.4794.
Snoke, J., Raab, G., Nowok, B., Dibben, C., & Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society, Series A, 181, 663–688.
Solove, D. J. (2008). Understanding privacy. Cambridge: Harvard University Press.
Solove, D. J. (2013). Privacy self-management and the consent dilemma. Harvard Law Review, 126, 1880–1903.
Soria-Comas, J., Domingo-Ferrer, J., Sanchez, D., & Megias, D. (2017). Individual differential privacy: A utility-preserving formulation of differential privacy guarantees. IEEE Transactions on Information Forensics and Security, 12, 1418–1429.
Stewart, K. A., & Segars, A. H. (2002). An empirical examination of the concern for information privacy instrument. Information Systems Research, 13, 36–49.
Sweeney, L. (2000). Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh.
Sweeney, L. (2002). \(k\)-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 10, 557–570.
Tang, J., Korolova, A., Bai, X., Wang, X., & Wang, X. (2017). Privacy loss in Apple’s implementation of differential privacy on MacOS 10.12. arXiv:1709.02753 [cs.CR]
Templ, M. (2017). Statistical disclosure control for microdata. Cham: Springer.
Templ, M., Meindl, B., Kowarik, A., & Dupriez, O. (2017). Simulation of synthetic complex data: The R package simPop. Journal of Statistical Software, 79, 1–38.
Tukey, J. W. (1977). Exploratory data analysis. Boston: Addison-Wesley.
Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60, 63–69.
Warner, S. L. (1971). The linear randomized response model. Journal of the American Statistical Association, 66, 884–888.
Wasserman, L., & Zhou, S. (2010). A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 375–389.
Wilks, S. S. (1938). The large-sample distribution of the Likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9, 60–62.
Willenborg, L., & de Waal, T. (1996). Statistical disclosure control in practice, lecture notes in statistics (Vol. 111). New York: Springer.
Willenborg, L., & de Waal, T. (2000). Elements of statistical disclosure control. Lecture notes in statistics (Vol. 155). New York: Springer.
Zhu, T., Li, G., Zhou, W., & Yu, P. S. (2017). Differential privacy and applications. Cham: Springer.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers JP18H00835 and JP20H00576.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hoshino, N. A firm foundation for statistical disclosure control. Jpn J Stat Data Sci 3, 721–746 (2020). https://doi.org/10.1007/s42081-020-00086-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-020-00086-9