COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

Ayala-Rivera, Vanessa; Portillo-Dominguez, A. Omar; Murphy, Liam; Thorpe, Christina

doi:10.1007/978-3-319-45381-1_13

Vanessa Ayala-Rivera¹⁵,
A. Omar Portillo-Dominguez¹⁵,
Liam Murphy¹⁵ &
…
Christina Thorpe¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9867))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1015 Accesses
10 Citations
1 Altmetric

Abstract

Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

COCOA Datasets. https://github.com/ucd-pel/COCOA/
CSO. Access to Microdata. http://www.cso.ie/en/aboutus/dissemination/accesstomicrodatarulespoliciesandprocedures/accesstomicrodata/
Data Benerator Tool. http://databene.org/databene-benerator
Eurostat. Access to Microdata. http://ec.europa.eu/eurostat/web/microdata
OpenForecast Library. http://www.stevengould.org/software/openforecast/
Payscale USA. http://www.payscale.com/research/US/
simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. https://cran.r-project.org/web/packages/simPop
synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. https://cran.r-project.org/web/packages/synthpop
UML basics: The component diagram. https://www.ibm.com/developerworks/rational/library/dec04/bell/
US Census. Restricted-Use Microdata. http://www.census.gov/research/data/restricted_use_microdata.html
UTD Anonym ToolBox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox/
Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data generation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)
Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7(3), 337–370 (2014)
MathSciNet Google Scholar
Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The dacapo benchmarks: Java benchmarking development and analysis. ACM Sigplan Not. 41, 169–190 (2006). ACM
Article Google Scholar
Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal components analysis and indirect branches. In: Workshop on Workload Characterization, pp. 11–19 (1998)
Google Scholar
Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with virtual machines at the microarchitectural level. ACM SIGPLAN Not. 38, 169–186 (2003)
Article Google Scholar
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. ACM SIGMOD Rec. 36(1), 19–24 (2007)
Article Google Scholar
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Heidelberg (2014)
Google Scholar
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)
Google Scholar
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional K-Anonymity. In: International Conference Data Engineering, p. 25 (2006)
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the Map. In: International Conference on Data Engineering, pp. 277–286 (2008)
Google Scholar
Mateo-Sanz, J.M., Martínez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004)
Chapter Google Scholar
Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. Association for Computing Machinery (2006)
Google Scholar
Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini: an adaptive load balancing strategy based on garbage collection for clustered java systems. Softw. Pract. Exp. (2016)
Google Scholar
Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar
Sakshaug, J.W., Raghunathan, T.E.: Nonparametric generation of synthetic data for small geographic areas. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 213–231. Springer, Heidelberg (2014)
Google Scholar
Samarati, P.: Protecting respondents identities in microdata release. Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Article Google Scholar
Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Article MathSciNet MATH Google Scholar
Walck, C.: Handbook on statistical distributions for experimentalists (2007)
Google Scholar

Download references

Acknowledgments

This work was supported, in part, by Science Foundation Ireland grant 10/CE/I1855 to Lero - the Irish Software Research Centre (www.lero.ie)

Author information

Authors and Affiliations

Lero@UCD, School of Computer Science, University College Dublin, Dublin, Ireland
Vanessa Ayala-Rivera, A. Omar Portillo-Dominguez, Liam Murphy & Christina Thorpe

Authors

Vanessa Ayala-Rivera
View author publications
You can also search for this author in PubMed Google Scholar
A. Omar Portillo-Dominguez
View author publications
You can also search for this author in PubMed Google Scholar
Liam Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Christina Thorpe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vanessa Ayala-Rivera .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Spain
Josep Domingo-Ferrer
University of Zagreb, Zagreb, Croatia
Mirjana Pejić-Bach

Appendices

Appendix A PCA Constituent Metrics

Tables 1, 2, and 3 list the constituent metrics used to perform the PCA analysis of the datasets generated by COCOA (discussed in Sect. 4.2).

Table 1. Constituent metrics for PC analysis of German credit domain.

Full size table

Table 2. Constituent metrics for PC analysis of adult domain.

Full size table

Table 3. Constituent metrics for PC analysis of insurance domain.

Full size table

Appendix B Structure of the Irish census and insurance domains

Tables 4 and 5 list the attributes and the type of generators used for producing data for the Irish census and insurance domains, respectively (discussed in Sect. 3.5).

Table 4. Irish census domain structure.

Full size table

Table 5. Insurance domain structure.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ayala-Rivera, V., Portillo-Dominguez, A.O., Murphy, L., Thorpe, C. (2016). COCOA: A Synthetic Data Generator for Testing Anonymization Techniques. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-45381-1_13
Published: 31 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45380-4
Online ISBN: 978-3-319-45381-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix A PCA Constituent Metrics

Appendix B Structure of the Irish census and insurance domains

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation