Skip to main content

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9867))

Included in the following conference series:

Abstract

Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. COCOA Datasets. https://github.com/ucd-pel/COCOA/

  2. CSO. Access to Microdata. http://www.cso.ie/en/aboutus/dissemination/accesstomicrodatarulespoliciesandprocedures/accesstomicrodata/

  3. Data Benerator Tool. http://databene.org/databene-benerator

  4. Eurostat. Access to Microdata. http://ec.europa.eu/eurostat/web/microdata

  5. OpenForecast Library. http://www.stevengould.org/software/openforecast/

  6. Payscale USA. http://www.payscale.com/research/US/

  7. simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. https://cran.r-project.org/web/packages/simPop

  8. synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. https://cran.r-project.org/web/packages/synthpop

  9. UML basics: The component diagram. https://www.ibm.com/developerworks/rational/library/dec04/bell/

  10. US Census. Restricted-Use Microdata. http://www.census.gov/research/data/restricted_use_microdata.html

  11. UTD Anonym ToolBox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox/

  12. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data generation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)

  13. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7(3), 337–370 (2014)

    MathSciNet  Google Scholar 

  14. Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The dacapo benchmarks: Java benchmarking development and analysis. ACM Sigplan Not. 41, 169–190 (2006). ACM

    Article  Google Scholar 

  15. Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal components analysis and indirect branches. In: Workshop on Workload Characterization, pp. 11–19 (1998)

    Google Scholar 

  16. Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with virtual machines at the microarchitectural level. ACM SIGPLAN Not. 38, 169–186 (2003)

    Article  Google Scholar 

  17. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. ACM SIGMOD Rec. 36(1), 19–24 (2007)

    Article  Google Scholar 

  18. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Heidelberg (2014)

    Google Scholar 

  19. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)

    Google Scholar 

  20. LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional K-Anonymity. In: International Conference Data Engineering, p. 25 (2006)

    Google Scholar 

  21. Lichman, M.: UCI Machine Learning Repository (2013)

    Google Scholar 

  22. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the Map. In: International Conference on Data Engineering, pp. 277–286 (2008)

    Google Scholar 

  23. Mateo-Sanz, J.M., Martínez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  24. Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. Association for Computing Machinery (2006)

    Google Scholar 

  25. Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini: an adaptive load balancing strategy based on garbage collection for clustered java systems. Softw. Pract. Exp. (2016)

    Google Scholar 

  26. Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  27. Sakshaug, J.W., Raghunathan, T.E.: Nonparametric generation of synthetic data for small geographic areas. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 213–231. Springer, Heidelberg (2014)

    Google Scholar 

  28. Samarati, P.: Protecting respondents identities in microdata release. Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  29. Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  30. Walck, C.: Handbook on statistical distributions for experimentalists (2007)

    Google Scholar 

Download references

Acknowledgments

This work was supported, in part, by Science Foundation Ireland grant 10/CE/I1855 to Lero - the Irish Software Research Centre (www.lero.ie)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vanessa Ayala-Rivera .

Editor information

Editors and Affiliations

Appendices

Appendix A PCA Constituent Metrics

Tables 1, 2, and 3 list the constituent metrics used to perform the PCA analysis of the datasets generated by COCOA (discussed in Sect. 4.2).

Table 1. Constituent metrics for PC analysis of German credit domain.
Table 2. Constituent metrics for PC analysis of adult domain.
Table 3. Constituent metrics for PC analysis of insurance domain.

Appendix B Structure of the Irish census and insurance domains

Tables 4 and 5 list the attributes and the type of generators used for producing data for the Irish census and insurance domains, respectively (discussed in Sect. 3.5).

Table 4. Irish census domain structure.
Table 5. Insurance domain structure.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ayala-Rivera, V., Portillo-Dominguez, A.O., Murphy, L., Thorpe, C. (2016). COCOA: A Synthetic Data Generator for Testing Anonymization Techniques. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45381-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45380-4

  • Online ISBN: 978-3-319-45381-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics