Skip to main content

Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems

  • Conference paper
Privacy in Statistical Databases (PSD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3050))

Included in the following conference series:

Abstract

This paper provides an overview of methods of masking microdata so that the data can be placed in public-use files. It divides the methods according to whether they have been demonstrated to provide analytic properties or not. For those methods that have been shown to provide one or two sets of analytic properties in the masked data, we indicate where the data may have limitations for most analyses and how re-identification might or can be performed. We cover several methods for producing synthetic data and possible computational extensions for better automating the creation of the underlying statistical models. We finish by providing background on analysis-specific and general information-loss metrics to stimulate research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abowd, J.M., Woodcock, S.D.: Disclosure Limitation in Longitudinal Linked Data. In: Confidentiality, Disclosure, and Data Access, North Holland, Amsterdam (2002)

    Google Scholar 

  2. Agrawal, D., Aggarwal, C.C.: On the Design on Privacy Preserving Data Mining Algorithms. In: Proceedings of the ACM SIGPODS, pp. 247–255 (2001)

    Google Scholar 

  3. Agrawal, R., Srikant, R.: Privacy Preserving Data Mining. In: Proceedings of the ACM SIGMOD, pp. 439–450 (2000)

    Google Scholar 

  4. Bacher, J., Brand, R., Bender, S.: Re-identifying Register Data by Survey Data using Cluster Analysis: An Empirical Study. International Journal of Uncertainty, Fuzziness, Knowledge-Based Systems 10(5), 589–608 (2002)

    Article  MATH  Google Scholar 

  5. Benedetti, P., Franconi, L.: Statistical and Technological Solutions to the Controlled Data Dissemation. In: Pre-proceedings of New Techniques and Technologies for Statistics. Sorrento, vol. 1, pp. 225–232 (1998)

    Google Scholar 

  6. Bethlehem, J.A., Keller, W.J., Pannekoek, J.: Disclosure Control of Microdata. Journal of the American Statistical Association 85, 38–45 (1990)

    Article  Google Scholar 

  7. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18(5), 16–23 (2003)

    Article  Google Scholar 

  8. Brand, R.: Microdata Protection Through Noise Addition. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 97. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  9. Dalenius, T., Reiss, S.P.: Data-swapping: A Technique for Disclosure Control. Journal of Statistical Planning and Inference 6, 73–85 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  10. Dandekar, R.A., Domingo-Ferrer, J., Sebe, F.: LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 153. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  11. Dandekar, R., Cohen, M., Kirkendal, N.: Sensitive Microdata Protection Using Latin Hypercube Sampling Technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 117. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  12. Defays, D., Anwar, M.N.: Masking Microdata Using Micro-aggregation. Journal of Official Statistics 14, 449–461 (1998)

    Google Scholar 

  13. De Waal, A.G., Willenborg, L.C.R.J.: Global Recodings and Local Suppressions in Microdata Sets. Proceedings of Statistics Canada Symposium 95, 121–132 (1995)

    Google Scholar 

  14. De Waal, A.G., Willenborg, L.C.R.J.: A View of Statistical Disclosure Control for Microdata. Survey Methodology 22, 95–103 (1996)

    Google Scholar 

  15. Domingo-Ferrer, J. (ed.): Inference Control in Statistical Databases. LNCS, vol. 2316. Springer, Heidelberg (2002)

    MATH  Google Scholar 

  16. Domingo-Ferrer, J., Mateo-Sanz, J.M.: An Empirical Comparison of SDC Methods for Continuous Microdata in Terms of Information Loss and Re-Identification Risk. Presented at the UNECE Workshop On Statistical Data Editing, Skopje, Macedonia (May 2001)

    Google Scholar 

  17. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering 14(1), 189–201 (2002)

    Article  Google Scholar 

  18. Domingo-Ferrer, J., Torra, V.: A Quantitative Comparison of Disclosure Control Methods for Microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure Control and Data Access: Theory and Practical Applications, pp. 111–134. North Holland, Amsterdam (2001)

    Google Scholar 

  19. Domingo-Ferrer, J., Torra, V.: Statistical Data Protection in Statistical Microdata Protection via Advanced Record Linkage. Statistics and Computing 13(4), 343–354 (2003)

    Article  MathSciNet  Google Scholar 

  20. Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure Risk vs. Data Utility: The R-U Confidentiality Map, Los Alamos National Laboratory Technical Report LA-UR- 01-6428 (2001)

    Google Scholar 

  21. Elliott, M.A., Manning, A.M., Ford, R.W.: A Computational Algorithm for Handling the Special Uniques Problem. International Journal of Uncertainty, Fuzziness, and Knowledge- Based Systems 10(5), 493–510 (2002)

    Article  Google Scholar 

  22. Elliott, M.A., Skinner, C.J., Dale, A.: Special Uniques, Random Uniques, and Sticky Populations: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk. In: Statistical Data Protection 1998, Eurostat, Brussels, Belgium, pp. 261–265 (1998); also Research in Official Statistics 1(2), 53–68

    Google Scholar 

  23. Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  24. Fienberg, S.E.: Confidentiality and Disclosure Limitation Methodology: Challenges for National Statistics and Statistical Research, commissioned by Committee on National Statistics of the National Academy of Sciences (1997)

    Google Scholar 

  25. Fienberg, S.E., Makov, U.: Confidentiality, Uniqueness, and Disclosure Limitation for Categorical Data. Journal of Official Statistics 14, 385–397 (1998)

    Google Scholar 

  26. Fienberg, S.E., Makov, E.U., Sanil, A.P.: A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data. Journal of Official Statistics 14, 75–89 (1997)

    Google Scholar 

  27. Fienberg, S.E., Makov, E.U., Steel, R.J.: Disclosure Limitation using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics 14, 485–502 (1998)

    Google Scholar 

  28. Fuller, W.A.: Masking Procedures for Microdata Disclosure Limitation. Journal of Official Statistics 9, 383–406 (1993)

    Google Scholar 

  29. Gomatam, S.V., Karr, A.: On Data Swapping of Categorical Data, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, CD-ROM (2003)

    Google Scholar 

  30. Iyengar, V.: Transforming Data to Satisfy Privacy Constraints, Association of Computing Machinery, Special Interest Group on Knowledge Discovery and Datamining 2002 (2002)

    Google Scholar 

  31. Kennickell, A.B.: Multiple Imputation and Disclosure Control: The Case of the 1995 Survey of Consumer Finances. In: Record Linkage Techniques 1997, pp. 248–267. National Academy Press, Washington (1997)

    Google Scholar 

  32. Kim, J.J.: A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 303–308 (1986)

    Google Scholar 

  33. Kim, J.J.: Subdomain Estimation for the Masked Data, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 456–461 (1990)

    Google Scholar 

  34. Kim, J.J., Winkler, W.E.: Masking Microdata Files, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 114–119 (1995)

    Google Scholar 

  35. Lambert, D.: Measures of Disclosure Risk and Harm. Journal of Official Statistics 9, 313–331 (1993)

    Google Scholar 

  36. Little, R.J.A.: Statistical Analysis of Masked Data. Journal of Official Statistics 9, 407–426 (1993)

    Google Scholar 

  37. Little, R.J.A., Liu, F.: Selective Multiple Imputation of Keys for Statistical DisclosureControl in Microdata. In: Proceedings of the Section on Survey Research Methods, CD-ROM, American Statistical Association (2002)

    Google Scholar 

  38. Little, R.J.A., Liu, F.: Comparison of SMIKe with Data-Swapping and PRAM for Statistical Disclosure Control of Simulated Microdata, American Statistical Association. In: Proceedings of the Section on Survey Research Methods (2003)

    Google Scholar 

  39. Malin, B., Sweeney, L., Newton, E.: Trail Re-identification: Learning Who You are from Where You have Been. In: Workshop on Privacy in Data, March 2003, Carnegie-Mellon University (2003)

    Google Scholar 

  40. McCallum, A., Wellner, B.: Object Consolidation by Graph Partitioning with a Conditionally- Trained Distance Metric. In: Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC (August 2003)

    Google Scholar 

  41. Moore, R.: Controlled Data Swapping Techniques for Masking Public Use Data Sets, U.S. Bureau of the Census, Statistical Research Division Report rr96/04 (1995), available at http://www.census.gov/srd/www/byyear.html

  42. Muralidhar, K., Parsa, R., Sarathy, R.: A General Additive Data Perturbation Method for Database Security. Management Science 45(10), 1399–1415 (1999)

    Article  Google Scholar 

  43. Muralidhar, K., Sarathy, R., Parsa, R.: An Improved Security Requirement for Data Perturbation with Implications for E-Commerce. Decision Sciences 32(4), 683–698 (2001)

    Article  Google Scholar 

  44. Paas, G.: Disclosure Risk and Disclosure Avoidance for Microdata. Journal of Business and Economic Statistics 6, 487–500 (1988)

    Article  Google Scholar 

  45. Palley, M.A., Simonoff, J.S.: The Use of Regression Methodology for the Compromise of Confidential Information in Statistical Databases. ACM Transactions on Database Systems 12(4), 593–608 (1987)

    Article  Google Scholar 

  46. Polettini, S.: Maximum Entropy Simulation for Microdata Protection. Statistics and Computing 13(4), 307–320 (2003)

    Article  MathSciNet  Google Scholar 

  47. Polettini, S., Stander, J.: A Bayesian Hierarchical Model Approach to Risk Estimation in Statistical Disclosure Limitation. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 247–261. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  48. Raghunathan, T.E., Reiter, J.P., Rubin, D.R.: Multiple Imputation for Statistical Disclosure Limitation. Journal of Official Statistics 19, 1–16 (2003)

    Google Scholar 

  49. Reiss, J.P.: Practical Data Swapping: The First Steps. ACM R=Transactions on Database Systems 9, 20–37 (1984)

    Article  MATH  Google Scholar 

  50. Reiter, J.P.: Satisfying Disclosure Restrictions with Synthetic Data Sets. Journal of Official Statistics 18, 531–543 (2002)

    Google Scholar 

  51. Reiter, J.P.: Inference for Partially Synthetic, Public Use Data Sets. Survey Methodology (2003)

    Google Scholar 

  52. Reiter, J.P.: Releasing Multiply Imputed, Synthetic Public-Use Microdata: An Illustration and Empirical Study. Journal of the Royal Statistical Society, A (2004)

    Google Scholar 

  53. Rinott, Y.: On Models for Statistical Disclosure Risk Estimation, UNECE Work Session on Statistical Data Confidentiality, Luxembourg (April 2003), http://www.unece.org/stats/documents/2003/04/confidentiality/wp.16.e.pdf

  54. Roque, G.M.: Masking Microdata Files with Mixtures of Multivariate Normal Distributions, Ph.D. Dissertation, University of California at Riverside (2000)

    Google Scholar 

  55. Sarathy, R., Muralidhar, K., Parsa, R.: Perturbing Non-Normal Attributes: The Copula Approach. Management Science 48(12), 1613–1627 (2002)

    Article  Google Scholar 

  56. Scheuren, F., Winkler, W.: Regression Analysis of Data Files that are Computer Matched – Part II. In: Survey Methodology, pp. 157–165 (1997)

    Google Scholar 

  57. Schlörer, J.: Security of Statistical Databases: Multidimensional Transformation. ACM Transactions on Database Systems 6, 91–112 (1981)

    Article  Google Scholar 

  58. Skinner, C.J., Elliot, M.A.: A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, B 64(4), 855–867 (2001)

    Article  MathSciNet  Google Scholar 

  59. Skinner, C.J., Holmes, D.J.: Estimating the Re-identification Risk per Record in Microdata. Journal of Official Statistics 14, 361–372 (1998)

    Google Scholar 

  60. Sweeney, L.: Computational Disclosure Control for Medical Microdata: The Datafly System. In: Record Linkage Techniques 1997, pp. 442–453. National Academy Press, Washington (1999)

    Google Scholar 

  61. Sweeney, L.: Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10(5), 571–588 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  62. Thibaudeau, Y., Winkler, W.E.: Bayesian Networks Representations, Generalized Imputation, and Synthetic Microdata Satisfying Analytic Restraints, Statistical Research Division report RR 2002/09 (2002), at http://www.census.gov/srd/www/byyear.html

  63. Trottini, M., Fienberg, S.E.: Modelling User Uncertainty for Disclosure Risk and Data Utility. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10(5), 511–528 (2002)

    Article  MATH  Google Scholar 

  64. Van Den Hout, A., Van Der Heijden, P.G.M.: Randomized Response, Statistical Disclosure Control, and Misclassification: A Review. International Statistical Review 70(2), 269–288 (2002)

    Article  MATH  Google Scholar 

  65. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Lecture Notes in Statistics, vol. 111. Springer, New York (1996)

    MATH  Google Scholar 

  66. Willenborg, L., De Waal, T.: Elements of Statistical Disclosure Control. Lecture Notes in Statistics, vol. 155. Springer, New York (2000)

    Google Scholar 

  67. Winkler, W.E.: Matching and Record Linkage. In: Cox, B.G. (ed.) Business Survey Methods, pp. 355–384. J. Wiley, New York (1995)

    Google Scholar 

  68. Winkler, W.E.: Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata. Research in Official Statistics 1, 87–104 (1998)

    Google Scholar 

  69. Winkler, W.E.: Issues with Linking Files and Performing Analyses on the Merged Files. In: Proceedings of the Sections on Government Statistics and Social Statistics, American Statistical Association, pp. 262–265 (1999)

    Google Scholar 

  70. Winkler, W.E.: Single Ranking Micro-aggregation and Re-identification, Statistical Research Division report RR 2002/08 (2002), at http://www.census.gov/srd/www/byyear.html

  71. Yancey, W.E., Winkler, W.E., Creecy, R.H.: Disclosure Risk Assessment in Perturbative Microdata Protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 135. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Winkler, W.E. (2004). Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems. In: Domingo-Ferrer, J., Torra, V. (eds) Privacy in Statistical Databases. PSD 2004. Lecture Notes in Computer Science, vol 3050. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-25955-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-25955-8_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22118-0

  • Online ISBN: 978-3-540-25955-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics