COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

  • Vanessa Ayala-Rivera
  • A. Omar Portillo-Dominguez
  • Liam Murphy
  • Christina Thorpe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9867)

Abstract

Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Eurostat. Access to Microdata. http://ec.europa.eu/eurostat/web/microdata
  5. 5.
  6. 6.
  7. 7.
    simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. https://cran.r-project.org/web/packages/simPop
  8. 8.
    synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. https://cran.r-project.org/web/packages/synthpop
  9. 9.
  10. 10.
  11. 11.
  12. 12.
    Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data generation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)
  13. 13.
    Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7(3), 337–370 (2014)MathSciNetGoogle Scholar
  14. 14.
    Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The dacapo benchmarks: Java benchmarking development and analysis. ACM Sigplan Not. 41, 169–190 (2006). ACMCrossRefGoogle Scholar
  15. 15.
    Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal components analysis and indirect branches. In: Workshop on Workload Characterization, pp. 11–19 (1998)Google Scholar
  16. 16.
    Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with virtual machines at the microarchitectural level. ACM SIGPLAN Not. 38, 169–186 (2003)CrossRefGoogle Scholar
  17. 17.
    Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. ACM SIGMOD Rec. 36(1), 19–24 (2007)CrossRefGoogle Scholar
  18. 18.
    Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Heidelberg (2014)Google Scholar
  19. 19.
    Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)Google Scholar
  20. 20.
    LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional K-Anonymity. In: International Conference Data Engineering, p. 25 (2006)Google Scholar
  21. 21.
    Lichman, M.: UCI Machine Learning Repository (2013)Google Scholar
  22. 22.
    Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the Map. In: International Conference on Data Engineering, pp. 277–286 (2008)Google Scholar
  23. 23.
    Mateo-Sanz, J.M., Martínez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  24. 24.
    Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. Association for Computing Machinery (2006)Google Scholar
  25. 25.
    Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini: an adaptive load balancing strategy based on garbage collection for clustered java systems. Softw. Pract. Exp. (2016)Google Scholar
  26. 26.
    Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)Google Scholar
  27. 27.
    Sakshaug, J.W., Raghunathan, T.E.: Nonparametric generation of synthetic data for small geographic areas. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 213–231. Springer, Heidelberg (2014)Google Scholar
  28. 28.
    Samarati, P.: Protecting respondents identities in microdata release. Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)CrossRefGoogle Scholar
  29. 29.
    Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Walck, C.: Handbook on statistical distributions for experimentalists (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Vanessa Ayala-Rivera
    • 1
  • A. Omar Portillo-Dominguez
    • 1
  • Liam Murphy
    • 1
  • Christina Thorpe
    • 1
  1. 1.Lero@UCD, School of Computer ScienceUniversity College DublinDublinIreland

Personalised recommendations