Abstract
The first part of the study of the automated processing of personal data for the purpose of their anonymization and analysis is presented. This part is an overview and aims to analyze the state of research in this area and systematize the available results. Results of the analysis of a wide range of issues of anonymization are presented, which formed a systematic understanding of the state of research and substantiated the choice of direction for further study. First, definitions of the main terms and concepts used in connection with anonymization of personal data are formulated, including association with the legislation of the Russian Federation. The directions of research are grouped into four sections: anonymization methods, implementation issues, applications of anonymized data processing, and deanonymization issues. For each group of anonymization methods—randomization, grouping, data distribution and application control— descriptions of the main algorithms are given and their advantages and disadvantages are analyzed. Implementation issues concern such concepts as the usefulness of anonymized data, limitations of the applicability of universal algorithms, and the reliability in relation to maintaining the anonymity of personal data subjects. Among applications that have formed the demand for the processing anonymized data, medical, biological, and genetic research and law enforcement are discussed. In the final part, the most interesting facts of deanonymization are mentioned and a short review of mass media is given.
Similar content being viewed by others
REFERENCES
Aggarwal, C.C. and Yu, P.S., A general survey of privacy-preserving data mining models and algorithms, Privacy-Preserving Data Mining, Aggarwal, C.C. and Yu, P.S., Eds., Advances in Database Systems, vol. 34, Boston: Springer, 2008.
Domingo-Ferrer, J., Farras, O., Ribes-Gonzalez, J., and Sanchez, D., Privacy-preserving cloud computing on sensitive data: A survey of methods, products and challenges, Comput. Commun., 2019, vols. 140–141, pp. 38–60.
Sahi, M.A., et al., Privacy preservation in e-healthcare environments: State of the art and future directions, IEEE Access, 2018, vol. 6, pp. 464–478.
Spiekermann, S. and Cranor, L.F., Engineering privacy, IEEE Trans. Software Eng., 2009, vol. 35, no. 1, pp. 67–82.
Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., Saygin, Y., and Theodoridis, Y., State-of-the-art in privacy preserving data mining, ACM SIGMOD Record., 2004, vol. 33, no. 1.
Guide to Basic Data Anonymization Technique. Personal Data Protection Commission, Singapore, 2018.
Newton, E., Sweeney, L., and Malin, B., Preserving privacy by de-identifying facial images, IEEE Trans. Knowl. Data Eng., 2005.
Sweeney, L., Privacy-preserving bio-terrorism surveillance, AAAI Spring Symposium, AI Technologies for Homeland Security, 2005.
Sweeney, L., AI technologies to defeat identity theft vulnerabilities, AAAI Spring Symposium, AI Technologies for Homeland Security, 2005.
Sweeney, L. and Gross, R., Mining images in publicly-available cameras for homeland security, AAAI Spring Symposium, AI Technologies for Homeland Security, 2005.
Agrawal, R. and Srikant, R., Privacy-preserving data mining, Proc. of the ACM SIGMOD Conference, 2000.
Agrawal, D. and Aggarwal, C.C., On the design and quantification of privacy-preserving data mining algorithms, ACM PODS Conf., 2002.
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A., Approximation algorithms for K-anonymity, J. Privacy Technol., 2005, no. 20051120001.
Aggarwal, C.C., On K-anonymity and the curse of dimensionality, VLDB Confeence, 2005.
LeFevre, K., DeWitt, D., and Ramakrishnan, R., Incognito: Full domain K-anonymity, ACM SIGMOD Conference, 2005.
Meyerson, A. and Williams, R., On the complexity of optimal K-anonymity, ACM PODS Conference, 2004.
Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasubramaniam, M., L-diversity: Privacy beyond K-anonymity, ICDE Conference, 2006.
Li, N., Li, T., and Venkatasubramanian, S., T-closeness: Privacy beyond K-anonymity and L-diversity, ICDE Conference, 2007.
Dwork, C. and Nissim, K., Privacy-preserving data mining on vertically partitioned databases, CRYPTO, 2004.
Vaidya, J. and Clifton, C., Privacy-preserving decision trees over vertically partitioned data, Lect. Notes Comp. Sci., 2005, vol. 3654.
Yu, H., Vaidya, J., and Jiang, X., Privacy-preserving SVM classification on vertically partitioned data, PAKDD Conference, 2006.
Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., and Dasseni, E., Association rule hiding, IEEE Trans. Knowl. Data Eng., 2004, vol. 16, no. 4.
Moskowitz, I. and Chang, L., A decision theoretic system for information downgrading, Joint Conference on Information Sciences, 2000.
Adam, N. and Wortmann, J.C., Security-control methods for statistical databases: A comparison study, ACM Comput. Surveys, 1989, vol. 21, no. 4.
Liew, C.K., Choi, U.J., and Liew, C.J., A data distortion by probability distribution, ACM TODS, 1985, vol. 10, no. 3, pp. 395–411.
Warner, S.L., Randomized response: A survey technique for eliminating evasive answer bias, J. Am. Stat. Assoc., 1965, vol. 60, no. 309, pp. 63–69.
Silverman, B.W., Density Estimation for Statistics and Data Analysis, Chapman and Hall, 1986.
Aggarwal, C.C., On randomization, public information and the curse of dimensionality, ICDE Conference, 2007.
Gambs, S., Kegl, B., and Aimeur, E., Privacy-preserving boosting, Knowl. Discovery Data Mining J., 2007, vol. 14, no. 1, pp. 131–170.
Zhang, P., Tong, Y., Tang, S., and Yang, D., Privacy-preserving naive Bayes classifier, Lect. Notes Comp. Sci., 2005, vol. 3584.
Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J., Privacy-preserving mining of association rules, ACM KDD Conference, 2002.
Rizvi, S. and Haritsa, J., Maintaining data privacy in association rule mining, VLDB Conference, 2002.
Agrawal, R., Srikant, R., and Thomas, D., Privacy-preserving OLAP, Proc. of the ACM SIGMOD Conference, 2005.
Polat, H. and Du, W., SVD-based collaborative filtering with privacy, ACM SAC Symp., 2005.
Bertino, E., Fovino, I., and Provenza, L., A framework for evaluating privacy-preserving data mining algorithms, Data Mining Knowl. Discovery J., 2005, vol. 11, pp. 121–154.
Evfimievski, A., Gehrke, J., and Srikant, R., Limiting privacy breaches in privacy preserving data mining, ACM PODS Conference, 2003.
Huang, Z., Du, W., and Chen, B., Deriving private information from randomized data, ACM SIGMOD Conference, 2005, pp. 37–48.
Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K., On the privacy preserving properties of random data perturbation techniques, ICDM Conference, 2003, pp. 99–106.
Johnson, W. and Lindenstrauss, J., Extensions of Lipschitz mapping into Hilbert space, Contemp. Math., 1984, vol. 26, pp. 189–206.
Oliveira, S.R.M. and Zaiane, O., Privacy preserving clustering by data transformation, Proc. 18th Brazilian Symp. on Databases, 2003, pp. 304–318.
Oliveira, S.R.M. and Zaiane, O., Data perturbation by rotation for privacy-preserving clustering, Technical Report no. TR04-17, Department of Computing Science, University of Alberta, Edmonton, AB, Canada, 2004
Chen, K. and Liu, L., Privacy-preserving data classification with rotation perturbation, ICDM Conference, 2005.
Liu, K., Kargupta, H., and Ryan, J., Random projection based multiplicative data perturbation for privacy preserving distributed data mining, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 1.
Kim, J. and Winkler, W., Multiplicative noise for masking continuous data, Technical Report Statistics, no. 2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., 2003.
Mukherjee, S., Chen, Z., and Gangopadhyay, S., A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier based transforms, VLDB J., 2006.
Liu, K., Giannella, C., and Kargupta, H., An attacker’s view of distance preserving maps for privacy-preserving data mining, PKDD Conference, 2006.
Fienberg, S. and McIntyre, J., Data swapping: variations on a theme by Dalenius and Reiss, Technical Report, National Institute of Statistical Sciences, 2003.
Samarati, P., Protecting respondents' identities in Microdata release, IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027.
Bayardo, R.J. and Agrawal, R., Data privacy through optimal K-anonymization, Proc. of the ICDE Conference, 2005, pp. 217–228.
Fung, B., Wang, K., and Yu, P., Top-down specialization for information and privacy preservation, ICDE Conference, 2005.
Wang, K., Yu, P., and Chakraborty, S., Bottom-up generalization: A data mining solution to privacy protection, ICDM Conference, 2004.
Domingo-Ferrer, J. and Mateo-Sanz, J., Practical data-oriented micro-aggregation for statistical disclosure control, IEEE TKDE, 2002, vol. 14, no. 1.
Aggarwal, G., Feder, T., Kenthapadi, K., Khuller, S., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A., Achieving anonymity via clustering, ACM PODS Conference, 2006.
Aggarwal, C.C. and Yu, P.S., A condensation approach to privacy preserving data mining, EDBT Conference, 2004.
Winkler, W., Using simulated annealing for K-anonymity, Technical Report, no. 7, US Census Bureau, Washington D.C. 20233, 2002.
Iyengar, V.S., Transforming data to satisfy privacy constraints, KDD Conference, 2002.
Lakshmanan, L., Ng, R., and Ramesh, G., To do or not to do: The dilemma of disclosing anonymized data, ACM SIGMOD Conference, 2005.
Aggarwal, C.C. and Yu, P.S., On variable constraints in privacy-preserving data mining, SIAM Conference, 2005.
Xiao, X. and Tao, Y., Personalized privacy preservation, ACM SIGMOD Conference, 2006.
Wang, K. and Fung, B.C.M., Anonymization for sequential releases, ACM KDD Conference, 2006.
Pei, J., Xu, J., Wang, Z., Wang, W., and Wang, K., Maintaining K-anonymity against incremental updates, Symposium on Scientific and Statistical Database Management, 2007.
Aggarwal, C.C. and Yu, P.S., On privacy-preservation of text and sparse binary data with sketches, SIAM Conference on Data Mining, 2007.
Aggarwal, C.C. and Yu, P.S., On anonymization of string data, SIAM Conference Data Mining, 2007.
Martin, D., Kifer, D., Machanavajjhala, A., Gehrke, J., and Halpern, J., Worst-case background knowledge, ICDE Conference, 2007.
Pinkas, B., Cryptographic techniques for privacy-preserving data mining, ACM SIGKDD Explorations, 2002, vol. 4, no. 2.
Even, S., Goldreich, O., and Lempel, A., A randomized protocol for signing contracts, Commun. ACM, 1985, vol. 28.
Rabin, M.O., How to exchange secrets by oblivious transfer, 20233TR-81, Aiken Corporation Laboratory, Washington, 1981.
Naor, M. and Pinkas, B., Efficient oblivious transfer protocols, SODA Conference, 2001.
Yao, A.C., How to generate and exchange secrets, FOCS Conference, 1986.
Chaum, D., Crepeau, C., and Damgard, I., Multiparty unconditionally secure protocols, ACM STOC Conf., 1988.
Ioannidis, I., Grama, A., and Atallah, M., A secure protocol for computing dot-products in clustered and distributed environments, Int. Conference on Parallel Processing, 2002.
Du, W. and Atallah, M., Secure multi-party computation: A review and open problems, CERIAS Technical Report, no. 2001-51, Purdue University, 2001.
Clifton, C., Kantarcioglou, M., Lin, X., and Zhu, M., Tools for privacy preserving distributed data mining, ACM SIGKDD Explorations, 2002, vol. 4, no. 2.
Lindell, Y. and Pinkas, B., Privacy-preserving data mining, CRYPTO, 2000.
Kantarcioglu, M. and Vaidya, J., Privacy-preserving naive Bayes classifier for horizontally partitioned data, IEEE Workshop on Privacy-Preserving Data Mining, 2003.
Yu, H., Jiang, X., and Vaidya, J., Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data, SAC Conference, 2006.
Yang, Z., Zhong, S., and Wright, R., Privacy-preserving classification of customer data without loss of accuracy, SDM Conference, 2006.
Kantarcioglu, M. and Clifton, C., Privacy-preserving distributed mining of association rules on horizontally partitioned data, IEEE TKDE J., 2004, vol. 16, no. 9.
Inan, A., Saygin, Y., Savas, E., Hintoglu, A., and Levi, A., Privacy-preserving clustering on horizontally partitioned data, Data Engineering Workshops, 2006.
Jagannathan, G. and Wright, R., Privacy-preserving distributed k-means clustering over arbitrarily partitioned data, ACM KDD Conference, 2005.
Jagannathan, G., Pillaipakkamnatt, K., and Wright, R., A new privacy-preserving distributed k-clustering algorithm, SIAM Conference on Data Mining, 2006.
Polat, H. and Du, W., Privacy-preserving top-N recommendations on horizontally partitioned data, Web Intelligence, 2005.
Bawa, M., Bayardo, R.J., and Agrawal, R., Privacy-preserving indexing of documents on the network, VLDB Conference, 2003.
Vaidya, J. and Clifton, C., Privacy-preserving association rule mining in vertically partitioned databases, ACM KDD Conference, 2002.
Vaidya, J. and Clifton, C., Privacy-preserving naive Bayes classifier over vertically partitioned data, SIAM Conference, 2004.
Vaidya, J. and Clifton, C., Privacy-preserving k-means clustering over vertically partitioned data, ACM KDD Conference, 2003.
Jiang, W. and Clifton, C., Privacy-preserving distributed k-anonymity, Proc. of the IFIP, 11.3 Working Conference on Data and Applications Security, 2005.
Wang, K., Fung, B.C.M., and Dong, G., Integrating private databases for data analysis, Lect. Notes Comp. Sci., 2005, vol. 3495.
Zhong, S., Yang, Z., and Wright, R., Privacy-enhancing K-anonymization of customer data, Proc. of the ACM SIGMOD-SIGACT-SIGART Principles of Database Systems, Baltimore, MD, 2005.
Bettini, C., Wang, X.S., and Jajodia, S., Protecting privacy against location based personal identification, Proc. of Secure Data Management Workshop, Trondheim, Norway, 2005.
Gedik, B. and Liu, L., A customizable K-anonymity model for protecting location privacy, ICDCS Conference, 2005.
Mimoto, T., Kiyomoto, Sh., and Miyaji, A., Secure data management technology, In Security Infrastructure Technology for Integrated Utilization of Big Data, vol. Mimoto, T. and Miyaji, A., Eds., Singapore: Springer Open, 2020.
Oliveira, S.R.M., Zaiane, O., and Saygin, Y., Secure association-rule sharing, PAKDD Conference, 2004.
Saygin, Y., Verykios, V., and Clifton, C., Using unknowns to prevent discovery of association rules, ACM SIGMOD Record, 2001, vol. 30, no. 4.
Atallah, M., Elmagarmid, A., Ibrahim, M., Bertino, E., and Verykios, V., Disclosure limitation of sensitive rules, Workshop on Knowl. and Data Engineering Exchange, 1999.
Dasseni, E., Verykios, V., Elmagarmid, A., and Bertino, E., Hiding association rules using confidence and support, 4th Information Hiding Workshop, 2001.
Chang, L. and Moskowitz, I., An integrated framework for database inference and privacy protection, in Data and Applications Security, Kluwer, 2000.
Saygin, Y., Verykios, V., and Elmagarmid, A., Privacy-preserving association rule mining, 12th Int. Workshop on Research Issues in Data Engineering, 2002.
Wu, Y.-H., Chiang, C.-M., and Chen, A.L.P., Hiding sensitive association rules with limited side effects, IEEE Trans. Knowl. Data Eng., 2007, vol. 19, no. 1.
Aggarwal, C., Pei, J., and Zhang, B., A framework for privacy preservation against adversarial data mining, ACM KDD Conference, 2006.
Chang, L. and Moskowitz, I., Parsimonious downgrading and decision trees applied to the inference problem, New Security Paradigms Workshop, 1998.
Natwichai, J., Li, X., and Orlowska, M., A reconstruction-based algorithm for classification rules hiding, Australasian Database Conference, 2006.
Kenthapadi, K., Mishra, N., and Nissim, K., Simulatable auditing, ACM PODS Conference, 2005.
Nabar, S., Marthi, B., Kenthapadi, K., Mishra, N., and Motwani, R., Towards robustness in query auditing, VLDB Conference, 2006.
Chawla, S., Dwork, C., McSherry, F., Smith, A., and Wee, H., Towards privacy in public databases, TCC, 2005.
Mishra, N. and Sandler, M., Privacy vs pseudorandom sketches, ACM PODS Conference, 2006.
Blum, A., Dwork, C., McSherry, F., and Nissim, K., Practical privacy: the SuLQ framework, ACM PODS Conference, 2005.
Dinur, I. and Nissim, K., Revealing information while preserving privacy, ACM PODS Conference, 2003.
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M., Our data, ourselves: Privacy via distributed noise generation, EUROCRYPT, 2006.
Dwork, C., McSherry, F., Nissim, K., and Smith, A., Calibrating noise to sensitivity in private data analysis, TCC, 2006.
Wang, K., Fung, B.C.M., and Yu, P., Template based privacy-preservation in classification problems, ICDM Conference, 2005.
Kifer, D. and Gehrke, J., Injecting utility into anonymized datasets, SIGMOD Conference, 2006, pp. 217–228.
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., and Fu, A.W.C., Utility based anonymization using local recoding, ACM KDD Conference, 2006.
LeFevre, K., DeWitt, D., and Ramakrishnan, R., Workload aware anonymization, KDD Conference, 2006.
Koudas, N., Srivastava, D., Yu, T., and Zhang, Q., Aggregate query answering on anonymized tables, ICDE Conference, 2007.
Malin, B. and Sweeney, L., Re-identification of DNA through an automated linkage process, Proc. AMIA Symp., pp. 423–427.
Malin, B., Why methods for genomic data privacy fail and what we can do to fix it, AAAS Annual Meeting, Seattle, WA, 2004.
ARTICLE 29 DATA PROTECTION WORKING PARTY. Opinion 05/2014 on Anonymisation Techniques. Adopted on 10 April 2014.
Sweeney, L., Replacing personally identifiable information in medical records, the Scrub system, Proc. AMIA Annual Fall Symp., 1996, pp. 333–337.
Sweeney, L., Guaranteeing anonymity while sharing data, the Datafly system, Proc. AMIA Annual Fall Symp., 1997, pp. 51–55.
Sweeney, L., Privacy technologies for homeland security, Testimony before the Privacy and Integrity Advisory Committee of the Department of Homeland Security, Boston, MA, June, 2005, p. 15.
Malin, B. and Sweeney, L., Determining the identifiability of DNA database entries, Proc. AMIA Symp., 2000, pp. 537–541.
Malin, B., Protecting DNA sequence anonymity with generalization lattices, Meth. Inf. Medicine, 2005, vol. 44, no. 5, pp. 687–692.
Hodson, H., Revealed: Google AI has access to huge haul of NHS patient data, New Scientist, April 29, 2016.
Cadwalladr, C. and Graham-Harrison, E., Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach, The Guardian, March 17, 2018.
Harmon, A., “Indian tribe wins fight to limit research of its DNA,” New York Times, April 22, 2010.
Meyer, M., Law, ethics & science of re-identification demonstrations, Bill of Health: Examining the Intersection of Health Law, Biotechnology and Bioethics, Petrie Flom Center at Harvard University, 2021.
Ohm, P., Broken promises of privacy: Responding to the surprising failure of anonymization, UCLA Law Review, 2010, vol. 57, pp. 1700–1777.
de Montjoye, Y.-A., Radaelli, L., Singh, V.K., and Pentland, A., Unique in the shopping mall: On the reidentifiability of credit card metadata, Science, 2015, vol. 347, pp. 536–539.
Golle, P., Revisiting the uniqueness of simple demographics in the U.S. population, Workshop on privacy in the electronic society, 2006, New York: Association for Computing Machinery.
Rocher, L., Hendrickx, J.M., and de Montjoye, Y.-A., Estimating the success of re-identifications in incomplete datasets using generative models, Nat. Commun., 2019, vol. 10, no. 1 (3069).
Culnane C., Rubinstein B.I.P., and Teague V., Health data in an open world, Preprint. https://arxiv.org/abs/1712.05627. 2017.
Siddle, J., I know where you were last summer: London’s public bike data is telling everyone where you’ve been, 2014. vartree.blogspot.com.
Lavrenovs, A. and Podins, K., Privacy violations in Riga open data public transport system, IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), Vilnius, Lithuania, 2016, pp. 1–6.
Narayanan, A. and Shmatikov, V., Robust de-anonymization of large sparse datasets, IEEE Symp. on Security and Privacy, 2008, pp. 111–125.
ACKNOWLEDGMENTS
This work was carried out using the infrastructure of the shared computer center “High-Performance Computations and Big Data” of the Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Rights and permissions
About this article
Cite this article
Borisov, A.V., Bosov, A.V. & Ivanov, A.V. Application of Computer Simulation to the Anonymization of Personal Data: State-of-the-Art and Key Points. Program Comput Soft 49, 232–246 (2023). https://doi.org/10.1134/S0361768823040047
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768823040047