A Survey of Inference Control Methods for Privacy-Preserving Data Mining

  • Josep Domingo-Ferrer
Part of the Advances in Database Systems book series (ADBS, volume 34)

Inference control in databases, also known as Statistical Disclosure Control (SDC), is about protecting data so they can be published without revealing confidential information that can be linked to specific individuals among those to which the data correspond. This is an important application in several areas, such as official statistics, health statistics, e-commerce (sharing of consumer data), etc. Since data protection ultimately means data modification, the challenge for SDC is to achieve protection with minimum loss of the accuracy sought by database users. In this chapter, we survey the current state of the art in SDC methods for protecting individual data (microdata). We discuss several information loss and disclosure risk measures and analyze several ways of combining them to assess the performance of the various methods. Last but not least, topics which need more research in the area are identified and possible directions hinted.

Keywords

Privacy inference control statistical disclosure control statistical disclosure limitation statistical databases microdata 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    J. M. Abowd and S. D. Woodcock. Disclosure limitation in longitudinal linked tables. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 215–278, Amsterdam, 2001. North-Holland.Google Scholar
  2. 2.
    J. M. Abowd and S. D. Woodcock. Multiply-imputing confidential characteristics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 290–297, Berlin Heidelberg, 2004. Springer.Google Scholar
  3. 3.
    N. R. Adam and J. C. Wortmann. Security-control for statistical databases: a comparative study. ACM Computing Surveys, 21(4):515–556, 1989.CrossRefGoogle Scholar
  4. 4.
    C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining. In E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm, E. Ferrari, editors, Advances in Database Technology - EDBT 2004, vol. 2992 of Lecture Notes in Computer Science, pages 183-199, Berlin Heidelberg, 2004. Springer.Google Scholar
  5. 5.
    R. Brand. Microdata protection through noise addition. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 97–116, Berlin Heidelberg, 2002. Springer.Google Scholar
  6. 6.
    R. Brand. Tests of the applicability of sullivan’s algorithm to synthetic data and real business data in official statistics, 2002. European Project IST-2000-25069 CASC, Deliverable 1.1-D1, http://neon.vb.cbs.nl/casc.
  7. 7.
    J. Burridge. Information preserving statistical obfuscation. Statistics and Computing, 13:321–327, 2003.CrossRefMathSciNetGoogle Scholar
  8. 8.
    CASC. Computational aspects of statistical confidentiality, 2004. European project IST-2000-25069 CASC, 5th FP, 2001-2004, http://neon.vb.cbs.nl/casc.
  9. 9.
    F. Y. Chin and G. Ozsoyoglu. Auditing and inference control in statistical databases. IEEE Transactions on Software Engineering, SE-8:574–582, 1982.CrossRefMathSciNetGoogle Scholar
  10. 10.
    L. H. Cox and J. J. Kim. Effects of rounding on the quality and confidentiality of statistical data. In J. Domingo-Ferrer and L. Franconi, editors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 48–56, Berlin Heidelberg, 2006.Google Scholar
  11. 11.
    T. Dalenius and S. P. Reiss. Data-swapping: a technique for disclosure control (extended abstract). In Proc. of the ASA Section on Survey Research Methods, pages 191–194, Washington DC, 1978. American Statistical Association.Google Scholar
  12. 12.
    R. Dandekar, M. Cohen, and N. Kirkendall. Sensitive micro data protection using latin hypercube sampling technique. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 245–253, Berlin Heidelberg, 2002. Springer.Google Scholar
  13. 13.
    R. Dandekar, J. Domingo-Ferrer, and F. Sebé. Lhs-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 153–162, Berlin Heidelberg, 2002. Springer.Google Scholar
  14. 14.
    P.-P. de Wolf. Risk, utility and pram. In J. Domingo-Ferrer and L. Franconi, editors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 189–204, Berlin Heidelberg, 2006.Google Scholar
  15. 15.
    D. Defays and P. Nanopoulos. Panels of enterprises and confidentiality: the small aggregates method. In Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, pages 195–204, Ottawa, 1993. Statistics Canada.Google Scholar
  16. 16.
    A. G. DeWaal and L. C. R. J. Willenborg. Global recodings and local suppressions in microdata sets. In Proceedings of Statistics Canada Symposium’95, pages 121–132, Ottawa, 1995. Statistics Canada.Google Scholar
  17. 17.
    J. Domingo-Ferrer and J. M. Mateo-Sanz. On resampling for statistical confidentiality in contingency tables. Computers & Mathematics with Applications, 38:13–32, 1999.MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1):189–201, 2002.CrossRefGoogle Scholar
  19. 19.
    J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Comparing sdc methods for microdata on the basis of information loss and disclosure risk. In Pre-proceedings of ETK-NTTS’2001 (vol. 2), pages 807–826, Luxemburg, 2001. Eurostat.Google Scholar
  20. 20.
    J. Domingo-Ferrer, F. Sebé, and A. Solanas. A polynomial-time approximation to optimal multivariate microaggregation. Computers & Mathematics with Applications, 2007. (To appear).Google Scholar
  21. 21.
    J. Domingo-Ferrer and V. Torra. A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 111–134, Amsterdam, 2001. North-Holland. http://vneumann.etse.urv.es/publications/bcpi.
  22. 22.
    J. Domingo-Ferrer and V. Torra. Algorithmic data mining against privacy protection methods for statistical databases. manuscript, 2004.Google Scholar
  23. 23.
    J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2):195–212, 2005.CrossRefMathSciNetGoogle Scholar
  24. 24.
    G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S. F. Roehrig. Disclosure limitation methods and information loss for tabular data. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 135–166, Amsterdam, 2001. North-Holland.Google Scholar
  25. 25.
    G. T. Duncan, S. A. Keller-McNulty, and S. L Stokes. Disclosure risk vs. data utility: The r-u confidentiality map, 2001.Google Scholar
  26. 26.
    G. T. Duncan and S. Mukherjee. Optimal disclosure limitation strategy in statistical databases: deterring tracker attacks through additive noise. Journal of the American Statistical Association, 95:720–729, 2000.CrossRefGoogle Scholar
  27. 27.
    G. T. Duncan and R. W. Pearson. Enhancing access to microdata while protecting confidentiality: prospects for the future. Statistical Science, 6:219–239, 1991.CrossRefGoogle Scholar
  28. 28.
    E.U.Privacy. European privacy regulations, 2004. http://europa.eu.int/ comm/internal_market/privacy/law_en.htm.
  29. 29.
    I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.CrossRefGoogle Scholar
  30. 30.
    S. E. Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Technical Report 611, Carnegie Mellon University Department of Statistics, 1994.Google Scholar
  31. 31.
    S. E. Fienberg, U. E. Makov, and R. J. Steele. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics, 14(4):485–502, 1998.Google Scholar
  32. 32.
    S. E. Fienberg and J. McIntyre. Data swapping: variations on a theme by dalenius and reiss. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 14–29, Berlin Heidelberg, 2004. Springer.Google Scholar
  33. 33.
    A. Florian. An efficient sampling scheme: updated latin hypercube sampling. Probabilistic Engineering Mechanics, 7(2):123–130, 1992.CrossRefMathSciNetGoogle Scholar
  34. 34.
    L. Franconi and J. Stander. A model based method for disclosure limitation of business microdata. Journal of the Royal Statistical Society D - Statistician, 51:1–11, 2002.CrossRefMathSciNetGoogle Scholar
  35. 35.
    R. Garfinkel, R. Gopal, and D. Rice. New approaches to disclosure limitation while answering queries to a database: protecting numerical confidential data against insider threat based on data and algorithms, 2004. Manuscript. Available at http://www-eio.upc.es/seminar/04/garfinkel.pdf.
  36. 36.
    S. Giessing. Survey on methods for tabular data protection in argus. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 1–13, Berlin Heidelberg, 2004. Springer.Google Scholar
  37. 37.
    R. Gopal, R. Garfinkel, and P. Goes. Confidentiality via camouflage: the cvc approach to disclosure limitation when answering queries to databases. Operations Research, 50:501–516, 2002.MATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    R. Gopal, P. Goes, and R. Garfinkel. Interval protection of confidential information in a database. INFORMS Journal on Computing, 10:309–322, 1998.MATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    J. M. Gouweleeuw, P. Kooiman, L. C. R. J. Willenborg, and P.-P. DeWolf. Post randomisation for statistical disclosure control: Theory and implementation, 1997. Research paper no. 9731 (Voorburg: Statistics Netherlands).Google Scholar
  40. 40.
    B. Greenberg. Rank swapping for ordinal data, 1987. Washington, DC: U. S. Bureau of the Census (unpublished manuscript).Google Scholar
  41. 41.
    S. L. Hansen and S. Mukherjee. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4):1043–1044, 2003.CrossRefGoogle Scholar
  42. 42.
    G. R. Heer. A bootstrap procedure to preserve statistical confidentiality in contingency tables. In D. Lievesley, editor, Proc. of the International Seminar on Statistical Confidentiality, pages 261–271, Luxemburg, 1993. Office for Official Publications of the European Communities.Google Scholar
  43. 43.
    HIPAA. Health insurance portability and accountability act, 2004. http://www.hhs.gov/ocr/hipaa/.
  44. 44.
    A. Hundepool, A. Van de Wetering, R. Ramaswamy, L. Franconi, A. Capobianchi, P.-P. DeWolf, J. Domingo-Ferrer, V. Torra, R. Brand, and S. Giessing. μ-ARGUS version 4.0 Software and User’s Manual. Statistics Netherlands, Voorburg NL, may 2005. http://neon.vb.cbs.nl/casc.
  45. 45.
    A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J. Longhurst, E. Schulte-Nordholt, G. Seri, and P.-P. DeWolf. Handbook on Statistical Disclosure Control (version 1.0). Eurostat (CENEX SDC Project Deliverable), 2006.Google Scholar
  46. 46.
    D. E. Huntington and C. S. Lyrintzis. Improvements to and limitations of latin hypercube sampling. Probabilistic Engineering Mechanics, 13(4):245–253, 1998.CrossRefGoogle Scholar
  47. 47.
    A. B. Kennickell. Multiple imputation and disclosure control: the case of the 1995 survey of consumer finances. In Record Linkage Techniques, pages 248–267, Washington DC, 1999. National Academy Press.Google Scholar
  48. 48.
    A. B. Kennickell. Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg, 1999. Office for Official Publications of the European Communities.Google Scholar
  49. 49.
    J. J. Kim. A method for limiting disclosure in microdata based on random noise and transformation. In Proceedings of the Section on Survey Research Methods, pages 303–308, Alexandria VA, 1986. American Statistical Association.Google Scholar
  50. 50.
    M. Laszlo and S. Mukherjee. Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7):902–911, 2005.CrossRefGoogle Scholar
  51. 51.
    J. M. Mateo-Sanz and J. Domingo-Ferrer. A method for data-oriented multivariate microaggregation. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 89–99, Luxemburg, 1999. Office for Official Publications of the European Communities.Google Scholar
  52. 52.
    A. Meyerson and R. Williams. General k-anonymization is hard. Technical Report 03-113, Carnegie Mellon School of Computer Science (USA), 2003.Google Scholar
  53. 53.
    R. Moore. Controlled data swapping techniques for masking public use microdata sets, 1996. U. S. Bureau of the Census, Washington, DC, (unpublished manuscript).Google Scholar
  54. 54.
    K. Muralidhar, D. Batra, and P. J. Kirs. Accessibility, security and accuracy in statistical databases: the case for the multiplicative fixed data perturbation approach. Management Science, 41:1549–1564, 1995.MATHCrossRefGoogle Scholar
  55. 55.
    A. Oganian and J. Domingo-Ferrer. On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Comission for Europe, 18(4):345–354, 2001.Google Scholar
  56. 56.
    S. Polettini, L. Franconi, and J. Stander. Model based disclosure protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 83–96, Berlin Heidelberg, 2002. Springer.Google Scholar
  57. 57.
    T. J. Raghunathan, J. P. Reiter, and D. Rubin. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1–16, 2003.Google Scholar
  58. 58.
    S. P. Reiss. Practical data-swapping: the first steps. ACM Transactions on Database Systems, 9:20–37, 1984.MATHCrossRefGoogle Scholar
  59. 59.
    S. P. Reiss, M. J. Post, and T. Dalenius. Non-reversible privacy transformations. In Proceedings of the ACM Symposium on Principles of Database Systems, pages 139–146, Los Angeles, CA, 1982. ACM.CrossRefGoogle Scholar
  60. 60.
    J. P. Reiter. Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4):531–544, 2002.Google Scholar
  61. 61.
    J. P. Reiter. Inference for partially synthetic, public use microdata sets. Survey Methodology, 29:181–188, 2003.Google Scholar
  62. 62.
    J. P. Reiter. Using cart to generate partially synthetic public use microdata, 2003. Duke University working paper.Google Scholar
  63. 63.
    J. P. Reiter. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A, 168:185–205, 2005.MATHCrossRefMathSciNetGoogle Scholar
  64. 64.
    J. P. Reiter. Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 131(2):365–377, 2005.MATHCrossRefMathSciNetGoogle Scholar
  65. 65.
    D. B. Rubin. Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468, 1993.Google Scholar
  66. 66.
    P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001.CrossRefGoogle Scholar
  67. 67.
    P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, 1998.Google Scholar
  68. 68.
    G. Sande. Exact and approximate methods for data directed microaggregation in one or more dimensions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):459–476, 2002.MATHCrossRefMathSciNetGoogle Scholar
  69. 69.
    J. Schlörer. Disclosure from statistical databases: quantitative aspects of trackers. ACM Transactions on Database Systems, 5:467–492, 1980.MATHCrossRefGoogle Scholar
  70. 70.
    F. Sebé, J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Post-masking optimization of the tradeoff between information loss and disclosure risk in masked microdata sets. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 163–171, Berlin Heidelberg, 2002. Springer.Google Scholar
  71. 71.
    A. C. Singh, F. Yu, and G. H. Dunteman. Massc: A new data mask for limiting statistical information loss and disclosure. In H. Linden, J. Riecan, and L. Belsby, editors, Work Session on Statistical Data Confidentiality 2003, Monographs in Official Statistics, pages 373–394, Luxemburg, 2004. Eurostat.Google Scholar
  72. 72.
    G. R. Sullivan. The Use of Added Error to Avoid Disclosure in Microdata Releases. PhD thesis, Iowa State University, 1989.Google Scholar
  73. 73.
    L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):571–588, 2002.MATHCrossRefMathSciNetGoogle Scholar
  74. 74.
    L. Sweeney. k-anonimity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):557–570, 2002.MATHCrossRefMathSciNetGoogle Scholar
  75. 75.
    V. Torra. Microaggregation for categorical variables: a median based approach. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 162–174, Berlin Heidelberg, 2004. Springer.Google Scholar
  76. 76.
    J. F. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9:672–679, 1984.CrossRefGoogle Scholar
  77. 77.
  78. 78.
    L. Willenborg and T. DeWaal. Statistical Disclosure Control in Practice. Springer-Verlag, New York, 1996.MATHGoogle Scholar
  79. 79.
    L. Willenborg and T. DeWaal. Elements of Statistical Disclosure Control. Springer-Verlag, New York, 2001.MATHGoogle Scholar
  80. 80.
    W. E. Winkler. Re-identification methods for masked microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 216–230, Berlin Heidelberg, 2004. Springer.Google Scholar
  81. 81.
    W. E. Yancey, W. E. Winkler, and R. H. Creecy. Disclosure risk assessment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 135–152, Berlin Heidelberg, 2002. Springer.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Josep Domingo-Ferrer
    • 1
  1. 1.Dept. of Computer Engineering and MathematicsRovira i Virgili University of TarragonaSpain

Personalised recommendations