Abstract
Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
COCOA Datasets. https://github.com/ucd-pel/COCOA/
CSO. Access to Microdata. http://www.cso.ie/en/aboutus/dissemination/accesstomicrodatarulespoliciesandprocedures/accesstomicrodata/
Data Benerator Tool. http://databene.org/databene-benerator
Eurostat. Access to Microdata. http://ec.europa.eu/eurostat/web/microdata
OpenForecast Library. http://www.stevengould.org/software/openforecast/
Payscale USA. http://www.payscale.com/research/US/
simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. https://cran.r-project.org/web/packages/simPop
synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. https://cran.r-project.org/web/packages/synthpop
UML basics: The component diagram. https://www.ibm.com/developerworks/rational/library/dec04/bell/
US Census. Restricted-Use Microdata. http://www.census.gov/research/data/restricted_use_microdata.html
UTD Anonym ToolBox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox/
Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data generation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)
Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7(3), 337–370 (2014)
Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The dacapo benchmarks: Java benchmarking development and analysis. ACM Sigplan Not. 41, 169–190 (2006). ACM
Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal components analysis and indirect branches. In: Workshop on Workload Characterization, pp. 11–19 (1998)
Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with virtual machines at the microarchitectural level. ACM SIGPLAN Not. 38, 169–186 (2003)
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. ACM SIGMOD Rec. 36(1), 19–24 (2007)
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Heidelberg (2014)
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional K-Anonymity. In: International Conference Data Engineering, p. 25 (2006)
Lichman, M.: UCI Machine Learning Repository (2013)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the Map. In: International Conference on Data Engineering, pp. 277–286 (2008)
Mateo-Sanz, J.M., MartÃnez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004)
Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. Association for Computing Machinery (2006)
Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini: an adaptive load balancing strategy based on garbage collection for clustered java systems. Softw. Pract. Exp. (2016)
Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Sakshaug, J.W., Raghunathan, T.E.: Nonparametric generation of synthetic data for small geographic areas. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 213–231. Springer, Heidelberg (2014)
Samarati, P.: Protecting respondents identities in microdata release. Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Walck, C.: Handbook on statistical distributions for experimentalists (2007)
Acknowledgments
This work was supported, in part, by Science Foundation Ireland grant 10/CE/I1855 to Lero - the Irish Software Research Centre (www.lero.ie)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A PCA Constituent Metrics
Tables 1, 2, and 3 list the constituent metrics used to perform the PCA analysis of the datasets generated by COCOA (discussed in Sect. 4.2).
Appendix B Structure of the Irish census and insurance domains
Tables 4 and 5 list the attributes and the type of generators used for producing data for the Irish census and insurance domains, respectively (discussed in Sect. 3.5).
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ayala-Rivera, V., Portillo-Dominguez, A.O., Murphy, L., Thorpe, C. (2016). COCOA: A Synthetic Data Generator for Testing Anonymization Techniques. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-45381-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45380-4
Online ISBN: 978-3-319-45381-1
eBook Packages: Computer ScienceComputer Science (R0)