Anonymization Methods for Taxonomic Microdata

  • Josep Domingo-Ferrer
  • Krish Muralidhar
  • Guillem Rufian-Torrell
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7556)

Abstract

Often microdata sets contain attributes which are neither numerical nor ordinal, but take nominal values from a taxonomy, ontology or classification (e.g. diagnosis in a medical data set about patients, economic activity in an economic data set, etc.). Such data sets must be anonymized if transferred outside the data collector’s premises (e.g. hospital or national statistical office), say, for research purposes. The literature on microdata anonymization methods is relatively limited for nominal data. Multiple imputation is a usual choice for such data, but it has computational problems when nominal attributes can take many possible different values. In this paper, we provide anonymization methods for data sets which include nominal taxonomic attributes with many possible different values.

We show how to adapt to the case of taxonomic attributes two anonymization methods, data shuffling and microaggregation, that were originally designed for numerical attributes. The above adaptation relies on a hierarchy-aware numerical mapping of nominal categories, which we call marginality. The resulting adapted methods circumvent the computational problems of multiple imputation and take the semantics of the taxonomy into account.

Keywords

Statistical disclosure control Hierarchical attributes Classification Taxonomic data Nominal data Data anonymization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Ottawa, Statistics Canada (1993)Google Scholar
  2. 2.
    Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, pp. 53–80. Springer, New York (2008)CrossRefGoogle Scholar
  3. 3.
    Domingo-Ferrer, J.: Marginality: a numerical mapping for enhanced exploitation of taxonomic attributes. In: Proc. of the 9th International Conference on Modeling Attributes for Artificial Intelligence, MDAI 2012. LNCS. Springer (to appear, 2012); Preliminary version available from http://arxiv.org/abs/1202.6009 (since February 27, 2012)
  4. 4.
    Domingo-Ferrer, J., González-Nicolás, Ú.: Hybrid data using microaggregation. Information Sciences 180(15), 2834–2844 (2010)CrossRefGoogle Scholar
  5. 5.
    Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M., Sebé, F.: Efficient multivariate data-oriented microaggregation. VLDB Journal 15(4), 355–369 (2006)CrossRefGoogle Scholar
  6. 6.
    Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering 14(1), 189–201 (2002)CrossRefGoogle Scholar
  7. 7.
    Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. Computers & Mathematics with Applications 55(4), 714–732 (2008)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Domingo-Ferrer, J., Solanas, A.: A measure of nominal variance for hierarchical nominal attributes. Information Sciences 178(24), 4644–4655 (2008); Erratum: Information Sciences 179(20), 3732 (2009)Google Scholar
  9. 9.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Duncan, G.T., Elliot, M., Salazar-González, J.J.: Statistical Confidentiality: Principles and Practice. Springer, New York (2011)MATHCrossRefGoogle Scholar
  11. 11.
    Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte-Nordholt, E., Seri, G., De Wolf, P.P.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET SDC Project (2010), http://neon.vb.cbs.nl/casc
  12. 12.
    Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte Nordholt, E., Spicer, K., De Wolf, P.P.: Statistical Disclosure Control. Wiley, New York (2012)CrossRefGoogle Scholar
  13. 13.
    International Classification of Diseases, 9th Revision, Clinical Modification, 6th edn. (2008), http://icd9cm.chrisendres.com/
  14. 14.
    IVEware: Imputation and Variance Estimation Software, v. 0.2. University of Michigan (2010), http://www.isr.umich.edu/src/smp/ive/
  15. 15.
    Lenz, R.: Methoden der Geheimhaltung wirtschaftsstatistischer Einzeldaten und ihre Schutzwirkung. Statistisches Bundesamt, Wiesbaden (2010)Google Scholar
  16. 16.
    Muralidhar, K., Sarathy, R.: Data shuffling - a new masking approach for numerical data. Management Science 52(5), 658–670 (2006)CrossRefGoogle Scholar
  17. 17.
    NACE Rev. 2 - Statistical Classification of Economic Activities in the European Community, Rev. 2. Eurostat, European Commission (2008), http://epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-RA-07-015/EN/KS-RA-07-015-EN.PDF
  18. 18.
    Oganian, A., Karr, A.F.: Combinations of SDC Methods for Microdata Protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  19. 19.
    Patient Discharge Data, Office of Statewide Health Planning & Development-OSHPD (2010), http://www.oshpd.ca.gov/HID/Products/PatDischargeData/PublicDataSet/index.html
  20. 20.
    Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)Google Scholar
  21. 21.
    Reiter, J.P.: Releasing multiply-imputed, synthetic public-use microdata: an illustration and empirical study. Journal of the Royal Statistical Society A 168(1), 185–205 (2005)MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Rubin, D.B.: Discussion of statistical disclosure limitation. Journal of Official Statistics 9(2), 461–468 (1993)Google Scholar
  23. 23.
    Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a new feature-based approach. Expert Systems with Applications 39(9), 7718–7728 (2012)CrossRefGoogle Scholar
  24. 24.
    Torra, V.: Microaggregation for Categorical Variables: A Median Based Approach. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 162–174. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  25. 25.
    Willenborg, L., De Waal, T.: Elements of Statistical Disclosure Control. Springer, New York (2001)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Josep Domingo-Ferrer
    • 1
  • Krish Muralidhar
    • 2
  • Guillem Rufian-Torrell
    • 1
  1. 1.Department of Computer Engineering and MathsUniversitat Rovira i Virgili, UNESCO Chair in Data PrivacyTarragonaSpain
  2. 2.Gatton College of Business & EconomicsUniversity of KentuckyLexingtonUSA

Personalised recommendations