Skip to main content
Log in

Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

As the amount of data generated continues to increase, consideration of individuals’ privacy is a growing concern. As a result, there has been a vast quantity of research done on methods of statistical disclosure control. Some of these methods propose to release a randomized version of the data rather than the actual data. While methods of this type certainly offer some layer of protection, there is still the potential for private information to be disclosed. Quantifying the level of privacy provided by these methods is often difficult. In the past, a method for assessing privacy using the receiver operating characteristic (ROC) curve based on ideas related to differential privacy was proposed. However, the method was only demonstrated for univariate randomized releases. Here, the ROC-based privacy measure is extended to the release of randomized vectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abowd, J., Woodcock, S.: Disclosure limitation in longitudinal linked data. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 215–277. Elsevier, Amsterdam (2001)

  • Adam, N.R., Wortmann, J.C.: Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21(4), 515–556 (1989)

    Article  Google Scholar 

  • Cox, L.H.: Suppression methodology and statistical disclosure control. J. Am. Stat. Assoc. 75, 377–385 (1980)

    Google Scholar 

  • Cox, L.H.: Disclosure control methods for frequency count data. Technical report, U.S. Bureau of the Census (1984)

  • Cox, L.H.: A constructive procedure for unbiased controlled rounding. J. Am. Stat. Assoc. 82, 520–524 (1987)

    Google Scholar 

  • Cox, L.H.: Matrix masking methods for disclosure limitation in microdata. Surv. Methodol. 6, 165–169 (1994)

    Google Scholar 

  • Cox, L.H., Fagan, J.T., Greenberg, B., Hemmig, R.: Disclosure avoidance techniques for tabular data. Technical report, U.S. Bureau of the Census (1987)

  • Dalenius, T., Reiss, S.P.: Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6, 73–85 (1982)

    Article  Google Scholar 

  • De Waal, A., Hundepool, A., Willenborg, L.: Argus: Software for statistical disclosure control of microdata. U.S. Census Bureau (1995)

  • Duncan, G., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)

    Article  Google Scholar 

  • Duncan, G., Pearson, R.: Enhancing access to microdata while protecting confidentiality: prospects for the future (with discussion). Stat. Sci. 6, 219–232 (1991)

    Article  Google Scholar 

  • Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP, pp. 1–12. Springer, Heidelberg (2006)

  • Fienberg, S.E., McIntyre, J.: Data swapping: variations on a theme. Technical report, National Institute of Statistical Sciences, Research Triangle Park (2005)

  • Fuller, W.: Masking procedurse for microdata disclosure limitation. J. Off. Stat. 9, 383–406 (1993)

    Google Scholar 

  • Gouweleeuw, J., Kooiman, L.W.P., de Wolf, P.-P.: Post randomisation for statistical disclosure control: theory and implementation. J. Off. Stat. 14(4), 463–478 (1998)

    Google Scholar 

  • Harel, O., Zhou, X.-H.: Multiple imputation: review and theory, implementation and software. Stat. Med. 26, 3057–3077 (2007)

    Article  PubMed  Google Scholar 

  • Hundepool, A., Wetering, A.v.d., Ramaswamy, R., Wolf, P.d., Giessing, S., Fischetti, M., Salazar, J., Castro, J., Lowthian, P.: τ-argus 3.1 User Manual. Statistics Netherlands, Voorburg NL (2005)

  • Kennickell, A.B.: Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In: Alvey, W., Jamerson, B. (eds.) Record Linkage Techniques, pp. 248–267. National Academy Press, Washington (1997)

  • Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)

    Google Scholar 

  • Liu, F., Little, R.J.A.: Selective multiple mputation of keys for statistical disclosure control in microdata. In: Proceedings of Joint Statistical Meeting, pp. 2133–2138 (2002)

  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering, p. 10. Cornell University Comuputer Science Department, Cornell, USA (2008)

  • Manning, A.M., Haglin, D.J., Keane, J.A.: A recursive search algorithm for statistical disclosure assessment. Data Min. Knowl. Discov. 16(2), 165–196 (2008)

    Article  Google Scholar 

  • Matthews, G.J., Harel, O.: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011)

    Article  Google Scholar 

  • Matthews, G.J., Harel, O., Aseltine, R.H.: Assessing database privacy using the area under the receiver-operator characteristic curve. Health Serv. Outcomes Res. Method. 10(1), 1–15 (2010a)

    Article  Google Scholar 

  • Matthews, G.J., Harel, O., Aseltine, R.H.: Examining the robustness of fully synthetic data techniques for data with binary variables. J. Stat. Comput. Simul. 80(6), 609–624 (2010b)

    Article  Google Scholar 

  • McIntosh, M.W., Pepe, M.S.: Combining several screening tests: optimality of the risk score. Biometrics 58(3), 657–664 (2002)

    Article  PubMed  Google Scholar 

  • Moore, Jr., R.: Controlled data-swapping techniques for masking public use microdata. Census Tech Report (1996)

  • Mugge, R.: Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, Washington (1983)

  • Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC ’07: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84, San Diego (2007)

  • Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003)

    Google Scholar 

  • Reiter, J.P.: Satisfying disclosure restriction with synthetic data sets. J. Off. Stat. 18(4), 531–543 (2002)

    Google Scholar 

  • Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)

    Google Scholar 

  • Reiter, J.P.: New approaches to data dissemination: a glimpse into the future (?). Chance 17(3), 11–15 (2004a)

    Google Scholar 

  • Reiter, J.P.: Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv. Methodol. 30(2), 235–242 (2004b)

    Google Scholar 

  • Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A Stat. Soc. 168(1), 185–205 (2005a)

    Article  Google Scholar 

  • Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005b)

    Google Scholar 

  • Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)

    Book  Google Scholar 

  • Rubin, D.B.: Comment on “Statistical disclosure limitation”. J. Off. Stat. 9, 461–468 (1993)

    Google Scholar 

  • Sarathy, R., Muralidhar, K.: The security of confidential numerical data in databases. Inf. Syst. Res. 13(4), 389–403 (2002)

    Article  Google Scholar 

  • Schafer, J.L., Graham, J.W.: Missing data: our view of state of the art. Psychol. Methods 7(2), 147–177 (2002)

    Article  PubMed  Google Scholar 

  • Singh, A., Yu, F., Dunteman, G.: MASSC: A new data mask for limiting statistical information loss and disclosure. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, pp. 373–394, Luxembourg (2003)

  • Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: American Medical Informatics Association, pp. 333–337. Hanley and Belfus, Inc., Washington (1996)

  • Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. J. Am. Med. Inform. Assoc. 4, 51–55 (1997)

    Google Scholar 

  • Sweeney, L.: The identifiability of data (2000)

  • Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowledge-Based Syst. 10(5), 557–570 (2002)

    Article  Google Scholar 

Download references

Acknowledgments

This project was partially supported by Award Number K01MH087219 from the National Institute of Mental Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Mental Health or the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ofer Harel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Matthews, G.J., Harel, O. Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve. Health Serv Outcomes Res Method 12, 141–155 (2012). https://doi.org/10.1007/s10742-012-0093-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-012-0093-y

Keywords

Navigation