Abstract
Quasi-identifiers (QIDs) are attribute combinations that can be used to discover hidden personal identifying information from an anonymised dataset. Typically, the information drawn from such QIDs can then be combined with more publicly accessible datasets to discover sensitive information (e.g. medical conditions, financial status, criminal history, ...). Research on data anonymisation has therefore proposed various algorithms to discover and transform quasi-identifiers efficiently to prevent re-identification. However, all existing algorithms are inefficient and fail to prevent re-identification attacks on large real-world high dimensional datasets successfully. This paper presents a quasi-identifier discovery algorithm that combines parallelism with an efficient search technique to find all minimal quasi-identifiers in a given dataset. As a further step, we present an adversary model based on the enumeration problem of discovering unique column combinations in a dataset. We demonstrate that our quasi-identifier discovery algorithm is secure to re-identification attacks based on this adversarial model, even in the presence of large high-dimensional datasets that change dynamically. Our empirical results show that our algorithm not only scales well to large high-dimensional datasets but exploits its parallelisability on GPU (Graphics Processing Unit) architectures to prevent re-identification even in the presence of a powerful adversary equipped with similar high-performance computing processing power. Furthermore, our results show that the proposed GPU algorithm offers up to 100x times speedup over the algorithm’s CPU version.
This is a preview of subscription content, access via your institution.
Buying options









Notes
References
Abedjan, Z., Naumann, F.: Advancing the discovery of unique column combinations. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1565–1570 (2011)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-y
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data profiling. Synth. Lect. Data Manage. 10(4), 1–154 (2018)
Aggarwal, G., et al.: Anonymizing tables. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 246–258. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_17
Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. In: Proceedings of the VLDB Endowment vol. 13, no. 11, pp. 2270–2283 (2020)
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), volume 63 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 6:1–6:13, Dagstuhl, Germany. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ISBN: 978-3-95977-023-1 (2017). https://doi.org/10.4230/LIPIcs.IPEC.2016.6, http://drops.dagstuhl.de/opus/volltexte/2017/6920
Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)
Braghin, S., Gkoulalas-Divanis, A., Wurst, M.: Detecting quasi-identifiers in datasets. US Patent 9,870,381, 16 January 2018
Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.-D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration 69, 335–344. ISSN: 0167-9260 (2019). https://doi.org/10.1016/j.vlsi.2019.07.003, http://www.sciencedirect.com/science/article/pii/S0167926019301348
Heer, D., Podlesny, J.: Process for the user-related answering of customer inquiries in data networks. US Patent 10,033,705, 24 July 2018
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5_752
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)
Gutmann, A., et al.: Privacy and progress in whole genome sequencing. Presidential Committee for the Study of Bioethical (2012)
Hamza, N., Hefny, H.A., et al.: Attacks on anonymization-based privacy-preserving: a survey for data mining and data publishing (2013)
Han, S., Cai, X., Wang, C., Zhang, H., Wen, Y.: Discovery of unique column combinations with hadoop. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 533–541. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11116-2_49
Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. Proc. VLDB Endowment 7(4), 301–312 (2013)
Ilavarasi, A.K., Sathiyabhama, B., Poorani, S.: A survey on privacy preserving data mining techniques. Int. J. Comput. Sci. Bus. Inform. 7(1) (2013)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. IRSS, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Kavitha, S., Yamini, S., et al.: An evaluation on big data generalization using k-anonymity algorithm on cloud. In: 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO), pp. 1–5. IEEE (2015)
Kushida, C.A., Nichols, D.A., Jadrnicek, R., Miller, R., Walsh, J.K., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, S82–S101 (2012)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd ICDE, pp. 106–115, April 2007. https://doi.org/10.1109/ICDE.2007.367856
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 236–241. IEEE (2012)
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)
Motwani, R., Xu, Y.: Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the Conference on Very Large Data Bases (VLDB), pp. 83–93 (2007)
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Technologie und Web (BTW), Datenbanksysteme für Business, p. 2017 (2017)
Papenbrock, T., et al.: Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endowment 8(10), 1082–1093 (2015)
Podlesny, N.J.: Semi-synthetic genome data (2020). https://github.com/jaSunny/synthetic_genome_data
Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M.: Minimising information loss on anonymised high dimensional data with greedy in-memory processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11029, pp. 85–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98809-2_6
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable. Autonomic and Secure Computing (DASC), p. 2019. IEEE (2019)
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11706, pp. 33–43. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27615-7_3
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: How data anonymisation techniques influence disease triage in digital health: a study on base rate neglect. In: Proceedings of the 2019 International Conference on Digital Health. ACM (2019)
Podlesny, N.J.: High-dimensional data anonymization for in-memory applications. US Patent 10,747,901, 18 August 2020
Polat, H., Du, W.: Privacy-preserving collaborative filtering using randomized perturbation techniques. In Third IEEE International Conference on Data Mining. ICDM 2003, pp. 625–628. IEEE (2003)
Presswala, F., Thakkar, A., Bhatt, N.: Survey on anonymization in privacy preserving data mining (2015)
Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, Boston (2010)
Sopaoglu, U., Abul, O.: A top-down k-anonymization implementation for apache spark. In 2017 IEEE International Conference On Big Data (Big Data), pp. 4513–4521. IEEE (2017)
Sowmya, Y., Nagaratna, M.: Parallelizing k-anonymity algorithm for privacy preserving knowledge discovery from big data. Int. J. Appl. Eng. Res. 11(2), 1314–1321 (2016)
Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Working Paper 3, Carnegie Mellon University, USA (2000). https://projects.iq.harvard.edu/files/privacytools/files/paper1.pdf
Sweeney, L.: Uniqueness of simple demographics in the us population. LIDAP-WP4 (2000)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(05), 571–588 (2002)
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 543–554. VLDB Endowment. ISBN: 978-1-59-593649-3 (2007)
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Anonymization-based attacks in privacy-preserving data publishing. ACM Trans. Database Syst. 34(2). ISSN: 0362-5915 (2009). https://doi.org/10.1145/1538909.1538910
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Yu, P.S., Pei, J.: Can the utility of anonymized data be used for privacy breaches? ACM Trans. Knowl. Discov. Data 5(3). ISSN: 1556-4681 (2011). https://doi.org/10.1145/1993077.1993080
Zare-Mirakabad, M.-R., Jantan, A., Bressan, S.: Privacy risk diagnosis: mining l-Diversity. In: Chen, L., Liu, C., Liu, Q., Deng, K. (eds.) DASFAA 2009. LNCS, vol. 5667, pp. 216–230. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04205-8_19
Zhang, B., Dave, V., Mohammed, N., Al Hasan, M.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)
Zhang, X., Qi, L., He, Q., Dou, W.: Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In: Wang, G., Ray, I., Alcaraz Calero, J.M., Thampi, S.M. (eds.) SpaCCS 2016. LNCS, vol. 10067, pp. 311–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49145-5_31
Zimmermann, T., et al.: Detecting fraudulent advertisements on a large e-commerce platform. In: EDBT/ICDT Workshops (2017)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). A Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems L. Lecture Notes in Computer Science(), vol 12930. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-64553-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-64553-6_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-64552-9
Online ISBN: 978-3-662-64553-6
eBook Packages: Computer ScienceComputer Science (R0)