Abstract
The exploitation of microdata compiled by statistical agencies is of great interest for the data mining community. However, such data often include sensitive information that can be directly or indirectly related to individuals. Hence, an appropriate anonymisation process is needed to minimise the risk of disclosing identities and/or confidential data. In the past, many anonymisation methods have been developed to deal with numerical data, but approaches tackling the anonymisation of non-numerical values (e.g. categorical, textual) are scarce and shallow. Since the utility of this kind of information is closely related to the preservation of its meaning, in this work, the notion of semantic similarity is used to enable a semantically coherent anonymisation. The knowledge modelled in ontologies is used as the basic pillar to propose semantic operators that enable an accurate management and transformation of categorical attributes. These operators are then used in three anonymisation mechanisms: Semantic Recoding, Semantic and Adaptive Microaggregation and Semantic Resampling. The three algorithms are compared in terms of semantic utility, privacy disclosure risk and runtime, with encouraging results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Willenborg, L., de Waal, T.: Elements of Statistical Diclosure Control. Lecture Notes in Statistics, vol. 155. p. 261. Springer, New York (261)
Domingo-Ferrer, J.: A Survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, pp. 53–80. Springer, US (2008)
Jin, X., Zhang, N., Das, G.: ASAP: eliminating algorithm-based disclosure in privacy-preserving data publishing. Inf. Syst. 36(5), 859–880 (2011)
Herranz, J., et al.: Classifying data from protected statistical datasets. Comput. Secur. 29(8), 875–890 (2010)
Oliveira, S.R.M., Zaïane, O.R.: A privacy-preserving clustering approach toward secure and effective data analysis for business collaboration. Comput. Secur. 26(1), 81–93 (2007)
Shin, H., Vaidya, J., Atluri, V.: Anonymization models for directional location based service environments. Comput. Secur. 29(1), 59–73 (2010)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Aggarwal, C.C., Yu, P.S.: Privacy-Preserving Data Mining: Models and Algorithms. Springer Publishing Company, Incorporated, Berlin (2008)
Torra, V.: Towards knowledge intensive data privacy. In: Proceedings of the 5th International Workshop on Data Privacy Management, and 3rd International Conference on Autonomous Spontaneous Security, Springer, Athens, Greece (2011)
Guarino, N.: Formal, ontology and information systems. In: 1st International Conference on Formal Ontology in Information Systems. IOS Press, Trento, Italy (1998)
Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering, 2nd Printing. Springer, New York (2004)
Ding, L. et al.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. ACM, Washington, D.C., USA (2004)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory (1998)
Martínez, S., et al.: Privacy protection of textual attributes through a semantic-based masking method. Inf. Fusion 13(4), 304–314 (2011)
Martínez, S.: Ontology based semantic anonimisation of microdata. Universitat Rovira i Virgili. PhD. Thesis (2013). http://www.tdx.cat/bitstream/handle/10803/108961/Tesi.pdf?sequence=1
Martínez, S., Sánchez, D., Valls, A.: A semantic framework to protect the privacy of electronic health records with non-numerical attributes. J. Biomed. Inform. 46(2), 294–303 (2013)
Rada, R., et al.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)
Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nations Econ. Comm. Eur. 18(4), 345–353 (2001)
Hundepool, A. et al.: \(\mu \)-ARGUS version 3.2 software and user’s manual. Statistics Netherlands, Voorburg NL (2003). http://neon.vb.cbs.nl/casc://neon.vb.cbs.nl/casc
Domingo-Ferrer, J., et al.: Efficient multivariate data-oriented microaggregation. VLDB J. 15(4), 355–369 (2006)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)
Torra, V.: Microaggregation for categorical variables: a median based approach. In: Domingo-Ferrer, J., Torra, V. (eds.) Privacy in Statistical Databases, pp. 518–518. Springer, Berlin (2004)
Abril, D., Navarro-Arribas, G., Torra, V: Towards semantic microaggregation of categorical data for confidential documents. In: Proceedings of the 7th International Conference on Modeling Decisions for Artificial Intelligence. Springer, Perpignan, France (2010)
Martínez, S., Valls, A., Sánchez, D.: Semantically-grounded construction of centroids for datasets with textual attributes. Knowl. Based Syst. 35, 160–172 (2012)
Martínez, S., Sánchez, D., Valls, A.: Semantic adaptive microaggregation of categorical microdata. Comput. Secur. 31(5), 653–672 (2012)
Heer, G.R.: A bootstrap procedure to preserve statistical confidentiality in contingency tables. In: International Seminar on Statistical Confidentiality. Eurostat, Luxembourg (1993)
Herranz, J., Nin, J., Torra, V.: Distributed privacy-preserving methods for statistical disclosure control data privacy management and autonomous spontaneous security. Int. Sci. 5939, 33–47 (2010)
Karr, A.F., et al.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60, 224–232 (2006)
Martínez, S., Sánchez, D., Valls, A.: Towards k-anonymous non-numerical data via semantic resampling. In: Greco, S. et al. (eds.) Information Processing and Management of Uncertainty in Knowledge-Based Systems, Catania, Italy (2012)
Elliot, M., Purdam, K., Smith, D.: Statistical disclosure control architectures for patient records in biomedical information systems. J. Biomed. Inform. 41(1), 58–64 (2008)
Malin, B., Sweeney, L.: How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J. Biomed. Inform. 37(3), 179–192 (2004)
Spackman, K.A., Campbell, K.E., Cote, R.A.: SNOMED RT: a reference terminology for health care. In: Proceedings of AMIA Annual Fall Symposium, pp. 640–644 (1997)
Nelson, S.J., Johnston, D., Humphreys, B.L.: Relationships in medical subject headings. In: Relationships in the Organization of Knowledge, pp. 171–184. K.A. Publishers, New York (2001)
Martínez, S., Sánchez, D., Valls, A.: Evaluation of the disclosure risk of masking methods dealing with textual attributes. Int. J. Innovative Comput. Inf. Control 8(7(A)), 4869–4882 (2012)
Dwork, C.: Differential privacy. In: ICALP, Springer (2006)
Acknowledgments
This work has been supported by the Spanish Ministry of Science and Innovation (through projects ICWT TIN2012-32757, ARES-CONSOLIDER INGENIO 2010 CSD2007-00004 and BallotNext IPT-2012-0603-430000) and by the Government of Catalonia under grants 2009 SGR 1135 and 2009 SGR-01523. Dr. Martínez was supported with research grants by the Universitat Rovira i Virgili and Ministerio de Educación y Ciencia (Spain).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Martínez, S., Valls, A., Sánchez, D. (2015). Semantic Anonymisation of Categorical Datasets. In: Navarro-Arribas, G., Torra, V. (eds) Advanced Research in Data Privacy. Studies in Computational Intelligence, vol 567. Springer, Cham. https://doi.org/10.1007/978-3-319-09885-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-09885-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09884-5
Online ISBN: 978-3-319-09885-2
eBook Packages: EngineeringEngineering (R0)