Data Privacy with \(R\)

  • Daniel Abril
  • Guillermo Navarro-Arribas
  • Vicenç Torra
Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 567)

Abstract

Privacy Preserving Data Mining (PPDM) is an application field, which is becoming very relevant. Its goal is the study of new mechanisms which allow the dissemination of confidential data for data mining tasks while preserving individual private information. Additionally, due to the relevance of \(R\) language in the statistics and data mining communities, it is undoubtedly a good environment to research, develop and test privacy techniques aimed to data mining. In this chapter we outline some helpful tools in \(R\) to introduce readers to that field, so that we present several PPDM protection techniques as well as their information loss and disclosure risk evaluation process and outline some tools in \(R\) to help to introduce practitioners to this field.

Keywords

privacy preserving data mining microdata protection masking methods information loss disclosure risk record linkage 

Notes

Acknowledgments

Partial support by the Spanish MICINN (projects COPRIVACY (TIN2011-27076-C03-03), N-KHRONOUS (TIN2010-15764), and ARES (CONSOLIDER INGENIO 2010 CSD2007-00004)) and by the EC (FP7/2007-2013) Data without Boundaries (grant agreement number 262608) is acknowledged. The work contributed by the first author was carried out as part of the Computer Science Ph.D. program of the Universitat Autónoma de Barcelona (UAB).

References

  1. 1.
    Abril, D., Navarro-Arribas, G., Torra, V.: Supervised learning using mahalanobis distance for record linkage. In: Proceedings of 6th International Summer School on Aggregation Operators—AGOP2011. pp. 223–228 (2011)Google Scholar
  2. 2.
    Abril, D., Navarro-Arribas, G., Torra, V.: Improving record linkage with supervised learning for disclosure risk assessment. Inf. Fusion 13(4), 274–284 (2012)CrossRefGoogle Scholar
  3. 3.
    Abril, D., Navarro-Arribas, G., Torra, V.: Choquet integral for record linkage. Ann. Oper. Res. 195, 97–110 (2012)CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Abril, D., Navarro-Arribas, G., Torra, V.: Towards a private vector space model for confidential documents. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing. pp. 944–945. SAC ’13, ACM, New York, NY, USA (2013) http://doi.acm.org/10.1145/2480362.2480543
  5. 5.
    Agafitei, M., Defays, D.: Analysis of information loss in european data due to confidentiality. In: Joint UNECE/Eurostat work session on statistical data confidentiality (2011)Google Scholar
  6. 6.
    Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings of the ACM SIGMOD Conference on Management of Data. pp. 439–450. ACM Press (2000)Google Scholar
  7. 7.
    Brand, R.: Microdata protection through noise addition. In: Inference Control in Statistical Databases, from Theory to Practice. pp. 97–116. No. 2316 in Lecture Notes in Computer Science, Springer-Verlag (2002)Google Scholar
  8. 8.
    Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys. pp. 195–204. Statistics Canada (1993)Google Scholar
  9. 9.
    Domingo-Ferrer, J., Mateo-Sanz, J.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14, 189–201 (2002)CrossRefGoogle Scholar
  10. 10.
    Domingo-Ferrer, J., Rebollo-Monedero, D.: Measuring risk and utility of anonymized data using information theory. In: Privacy and Anonymity in the Information Society (PAIS’09), Proceedings of the 2009 EDBT/ICDT Workshops (EDBT/ICDT ’09). pp. 126–130. ACM (2009)Google Scholar
  11. 11.
    Domingo-Ferrer, J., Sebé, F., Castellà-Roca, J.: On the security of noise addition for privacy in statistical databases. In: Privacy in Statistical Databases. Lecture Notes In Computer Science, vol. 3050, pp. 149–161 (2004)Google Scholar
  12. 12.
    Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Confidentiality, disclosure, and data access : theory and practical applications for statistical agencies, pp. 111–133. Elsevier (2001)Google Scholar
  13. 13.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continous and heterogeneous anonymity through microaggregation. Data Min. Knowl. Disc. 11(2), 195–212 (2005)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Hornik, K., Theussl, S.: Rglpk: R/GNU Linear Programming Kit Interface (2012), http://CRAN.R-project.org/package=Rglpk, R package version 0.3-8
  15. 15.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985), http://dx.doi.org/10.1007/BF01908075
  16. 16.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  17. 17.
    lp\_solve, Konis, K.: lpSolveAPI: R Interface for lp\_solve version 5.5.2.0 (2011), http://CRAN.R-project.org/package=lpSolveAPI, R package version 5.5.2.0-5
  18. 18.
    Mateo-Sanz, J., Domingo-Ferrer, J., Sebé, F.: Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Min. Knowl. Discov. 11(2), 181–193 (2005)Google Scholar
  19. 19.
    Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (unpublished manuscript) (1996)Google Scholar
  20. 20.
    Navarro-Arribas, G., Torra, V.: Privacy-preserving data-mining through microaggregation for web-based e-commerce. Internet Res. 20(3), 366–384 (2010)CrossRefGoogle Scholar
  21. 21.
    Navarro-Arribas, G., Torra, V., Erola, A., Castellà -Roca, J.: User k-anonymity for privacy preserving data mining of query logs. Inf. Process. Manage. 48(3), 476–487 (2012)CrossRefGoogle Scholar
  22. 22.
    Nin, J., Torra, V.: Towards the evaluation of time series protection methods. Inf. Sci. 179(11), 1663–1677 (2009)CrossRefMATHGoogle Scholar
  23. 23.
    Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Comm. Eur. 18, 345–354 (2001)Google Scholar
  24. 24.
    Pagliuca, D., Seri, G.: Some results of individual ranking method on the system of enterprise acounts annual survey. Esprit SDC Project, Delivrable MI-3/D2 (1999)Google Scholar
  25. 25.
    R Core Team: R data import/export (2012) http://cran.r-project.org/doc/manuals/R-data.pdf
  26. 26.
    Reiss, S.: Practical data-swapping: the first steps. In: IEEE Symposium on Security and Privacy. pp. 38–43 (1980)Google Scholar
  27. 27.
    Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)CrossRefGoogle Scholar
  28. 28.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)CrossRefMATHMathSciNetGoogle Scholar
  29. 29.
    Sweeney, L.: Uniqueness of simple demographics in the U.S. population (2000)Google Scholar
  30. 30.
    Templ, M., Meindl, B.: Robust statistics meets sdc: New disclosure risk measures for continuous microdata masking. In: Proceedings of the UNESCO Chair in data privacy international conference on Privacy in Statistical Databases. pp. 177–189. Springer (2008)Google Scholar
  31. 31.
    Templ, M.: Statistical disclosure control for microdata using the r-package sdcmicro. Trans. Data Priv. 1(2), 67–85 (2008)MathSciNetGoogle Scholar
  32. 32.
    Torra, V.: Microaggregation for categorical variables: a median based approach. In: Privacy in Statistical Databases. Lecture Notes in Computer Science, vol. 3050, pp. 162–174 (2004)Google Scholar
  33. 33.
    Torra, V.: Constrained microaggregation: adding constraints for data editing. Trans. Data Priv. 1, 86–104 (2008)MathSciNetGoogle Scholar
  34. 34.
    Torra, V., Ladra, S.: Cluster-specific information loss measures in data privacy: A review. In: Third International Conference on Availability, Reliability and Security, 2008. ARES 08 (2008)Google Scholar
  35. 35.
    Torra, V., Navarro-Arribas, G.: Data privacy. WIREs Data Mining Knowl Discov (2014). doi:10.1002/widm.1129
  36. 36.
    Willenborg, L., de Waal, T.: Elements of Statistical Disclosure Control. Springer, Berliin (2001) (Lecture Notes in Statistics)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Daniel Abril
    • 1
    • 2
  • Guillermo Navarro-Arribas
    • 3
  • Vicenç Torra
    • 1
  1. 1.Institut d’Investigació en Intel·ligència ArtificialConsejo Superior de Investigaciones Científicas Campus de la UABCataloniaSpain
  2. 2.UABUniversitat Autónoma de BarcelonaBarcelonaSpain
  3. 3.Department of Information and Communications EngineeringUniversitat Autonoma de BarcelonaCataloniaSpain

Personalised recommendations