Large Scale Data Anonymisation for GDPR Compliance

Ortega-Fernandez, Ines; Martinez, Sara El Kortbi; Orellana, Lilian Adkinson

doi:10.1007/978-3-030-94590-9_19

Ines Ortega-Fernandez³,
Sara El Kortbi Martinez³ &
Lilian Adkinson Orellana³

9373 Accesses
1 Citations
3 Altmetric

Abstract

General Data Protection Regulation (GDPR) has been in place since May 2018 to give EU citizens more control over their personal data, applying principles like security and privacy by design. One of the most powerful tools to allow data processing while being compliant with data protection regulations is anonymisation, a procedure that consists of transforming data in such a way that makes no longer possible the re-identification of the data subjects. This chapter describes how anonymisation can be performed at a large scale, addressing common challenges to become GDPR compliant.

You have full access to this open access chapter, Download chapter PDF

Anonimisation, Impacts and Challenges into Big Data: A Case Studies

Personal Big Data, GDPR and Anonymization

Improving privacy preservation policy in the modern information age

Article 21 August 2018

1 Introduction

Data privacy refers to the rights of individuals over their personal information. That is, it is concerned with the data collection, its purpose, and how is it handled. In recent years, data privacy has become a major issue due to the growth of data generation, as well as due to the interest of third parties, such as business or researchers, in collecting and exploiting that information. It is important to differentiate it from the concept of data security, whose main objective is to protect personal information from being accessed by unauthorised third parties or attackers. Data security is, however, a prerequisite for data privacy [1].

One way to ensure data privacy is through anonymisation, a process that enables being non-identifiable within a set of individuals. It is a Privacy Enhancing Technique (PET) that results from transforming personal data to irreversibly prevent identification, and it comprises a set of techniques to manipulate the information to make data subjects (i.e., the persons to which the data refers) less identifiable. The robustness of each anonymisation technique can be analysed in terms of different criteria, such as if is it possible to identify a single person, link different records regarding the same individual, or the quantity of information that can be inferred regarding the data subject [2]. As a result, once the data is properly anonymised it cannot be linked back to the individual, and therefore it is not considered personal data anymore, according to the General Data Protection Regulation (GDPR): “The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person”.

It is important to highlight the differences between anonymisation and other PETs, such as pseudonymisation and encryption, since these terms are often misunderstood: pseudonymisation protects an individual by replacing its identifiers with a pseudonym, making the individual less (but still) identifiable. It is a reversible process, so it does not remove the re-identification risk, since the mapping between the real identifiers and pseudonyms still exists. Article 25 of the GDPR highlights the role of pseudonymisation “as a technical and organisational measure to help enforce data minimisation principles and compliance with Data Protection by Design and by Default obligations” [3].

On the other hand, encryption is a security measure to provide confidentiality in a communication channel or with the data at rest (data saved in a physical stable storage), to avoid disclosure of information to unauthorised parties. The goal of encryption is not related to making the data subject less identifiable: the original data is always available to any entity that has access to the encryption key (or can break the encryption protocol, recovering the original information), and therefore the possibility to identify the subject remains [2]. In addition, a key management system must be in place to protect and manage the encryption keys, which introduces complexity in the system.

The main advantage of anonymisation techniques over other PETs such as encryption is that it does not involve key management, and depending on the technique it requires less computational resources. However, data anonymisation is an irreversible process and only provides privacy, meaning that other security properties (e.g., confidentiality or integrity) must be implemented through other means. In addition, adequate anonymisation allows data processing without the need to be compliant with data privacy regulations, reducing the organisational cost of using and exploiting the data.

However, in order to ensure that data is properly anonymised multiple factors need to be taken into account: anonymisation of data to lower the re-identification risk to a specific extent is not always possible without also losing the utility of the data. Moreover, the anonymisation process requires a deep analysis of the original dataset to find the best anonymisation procedure that fit our needs. The anonymisation process that was suitable for a particular dataset might not work for a second dataset, since the nature and scope of the data will differ, as well as the later use of the anonymised data. In addition, it is also necessary to consider that additional datasets might be available in the future, which could be used for making cross-referencing with the anonymous data, affecting the overall re-identification risk.

Despite how useful the data anonymisation can be from a compliance perspective, it can be a daunting task, especially if we try to minimise the risks and ensure that the data is being properly anonymised. This chapter explores how anonymisation can be used as a Regulatory Compliance Tool, addressing common issues to introduce data anonymisation in a Big Data context. The complexity of a data anonymisation procedure is explored in Sect. 2, and its relationship with GDPR compliance. Finally, the specific challenges of data anonymisation in a Big Data context are analysed in Sect. 3, highlighting the differences with a Small Data environment.

2 Anonymisation as Regulatory Compliance Tool

With regards to the GPDR, encryption and pseudonymisation are considered security measures that need to be implemented to allow the processing of personal data. On the other hand, anonymisation makes the individuals within a particular dataset non-identifiable, and therefore anonymised data is not considered personal data anymore.

However, ensuring that the data anonymisation is being correctly applied is a challenging task. The use of data anonymisation techniques implies, in most cases, a certain loss of the utility of the data, as it relies typically on modifying the data values in order to make them less unique. While privacy aims to avoid sensitive data disclosure and the possibility of making certain deductions from a given dataset, the utility describes the analytical value of the data. In other words, the utility seeks to find correlations in a real-world scenario from a certain dataset, and the goal of privacy is to hide those correlations [4]. When applying anonymisation operations, data privacy improves, but there is a risk of reducing the analytical value of the dataset. For this reason, it is necessary to find a suitable trade-off between privacy and utility, as illustrated in Fig. 19.1, where some utility is sacrificed to reach an acceptable level of privacy.

D’Acquisto et al. distinguish two anonymisation approximations to find this trade-off point [5]: the Utility-first and the Privacy-first anonymisation. In the former, an anonymisation method with a heuristic parameter and utility preservation properties is run, and the risk of disclosure is measured afterwards. On the contrary, in the latter, an upper bound on the re-identification disclosure risk and/or the attribute disclosure risk is set.

Independently on which anonymisation approach is followed, it is essential to ensure that the individuals are not identifiable after the process. To verify this property, it is necessary to analyse the data and classify it into direct identifiers (data attributes that unequivocally identify a data subject, such as the name, telephone number or ID card number), quasi-identifiers (attributes which by themselves do not reveal an identity, but they can suppose a privacy risk when they are combined with others; e.g., postal code or birth date) and sensitive data (attributes that should be preserved as they present a certain value for later analysis, such as a medical condition or salary).

It has been proven that removing the direct identifiers from a dataset is not enough to preserve privacy: an attacker might have access to additional data sources or background knowledge that could lead to the re-identification of the anonymised individuals [6]. Montjoye et al. [7] demonstrated that 95% of the individuals of an anonymised dataset of fifteen months of mobility data (containing records from around 1.5M individuals) could be identified when taking into consideration only four spatial-temporal points. This probes how simple approaches to anonymisation are not enough and more complex solutions are needed.

The GDPR sets a high standard for data to be considered truly anonymous since it implies that data protection regulations do not apply anymore: in Recital 26, the GDPR states that the organisation should not only consider if the individual is re-identifiable, but also consider: all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments [3].

The sentence above can be better understood with a real example: in October 2018, a Danish service was fined 1.2 million kroner (around €160.000) for not deleting users’ data once it wasn’t needed anymore for the company’s business activity [8]. The company argued that they could keep the data since it was properly anonymised and it could not be considered personal data anymore: they were anonymising the dataset by deleting the names from the database, but other data such as telephone numbers were neither removed nor masked.

The assumption of the company about how anonymised data cannot be considered personal data was correct. However, they failed to analyse the GDPR requirements for data to be considered anonymous. At Opinion 05/2014 on Anonymisation Techniques, the Article 29 Working party clearly states that “removing directly identifying elements in itself is not enough to ensure that identification of the data subject is no longer possible. It will often be necessary to take additional measures to prevent identification, once again depending on the context and purposes of the processing for which the anonymised data are intended” [2].

According to Article 29 Working Party, an effective anonymisation solution should satisfy the following criteria, preventing all parties from:

Singling out an individual in a dataset (i.e., isolating some records that could point out the identity of the person).
Linking two records within a dataset that belong to the same person, even though her identity remains unknown.
Inferring any information in such dataset by deducing the value of an attribute from another set of attributes.

Therefore, re-identification does not only mean retrieving a person’s name but also being able to “single out an individual in a dataset”. Coming back to our previous example, now the issues of the company’s anonymisation solution are clear: by keeping direct identifiers such as telephone numbers in the dataset, they did not satisfy the singling nor the linkability criterion: first, telephone numbers can be considered direct identifiers since each person will have a different telephone number, singling out all individuals in the dataset. Second, a telephone number can be easily linked to a natural person by using other datasets and publicly available information. Therefore, the data could not be considered anonymous, and they were not fully compliant with GDPR.

This example demonstrates the complexity of data anonymisation to be GDPR compliant: basic anonymisation techniques are insufficient to guarantee the privacy of the data subjects; the risk that the anonymised data retains must be analysed to ensure a proper level of privacy protection, and how the anonymisation solution is able to meet the requirements stated above.

Therefore, the anonymisation process must be adapted on a case-by-case basis, ideally adopting a risk-based approach. An analysis of the re-identification risk must be performed to assess if the anonymisation solution meets the criteria, or further measures are needed [2]. Moreover, this analysis must be performed continuously since it is subject to change, and new datasets might be published that allow cross-referencing anonymous data: for instance, in Sweden taxpayers’ information is publicly available [9], while in other countries is not. This circumstance might change in the future, making previous assumptions about data linkability erroneous. In addition, a dataset containing information about both Swedish and Spanish citizens cannot be anonymised following the same procedure, since the re-identification risks might be different [10].

As a result, the analysis of the re-identification risk is a useful procedure that will allow us to identify the residual risk of the anonymised data, as well as easing the GDPR compliance process. The re-identification risk of the data can be assessed using multiple techniques, mainly focused on analysing the uniqueness of the data subjects (the first criteria for a good anonymisation solution) but that do not take into account other properties such as linkability or inference. To address this issue, Adkinson et al. [11] propose a dynamic risk-based approach and a set of privacy metrics that take into account both the re-identification risk and the presence of an adversary with certain background knowledge trying to disclose information from a dataset.

To summarise, anonymisation can be considered as a non-trivial process that needs to be performed cautiously to become GDPR compliant. It is essential to understand that basic anonymisation procedures are not usually sufficient for most of the real-world applications and datasets, and that it is necessary not just to anonymise the data but also to estimate the remaining privacy risks. Therefore, it is essential to perform a continuous analysis of the re-identification risk of the anonymised dataset, taking into account the uniqueness, linkability, and inference properties of the anonymisation solution applied. Furthermore, the anonymised datasets need to guarantee an adequate balance between data privacy and utility, preserving the analytical value of the data, which requires analysing different approximations to decide the adequate trade-off point between data privacy and utility for the future data processing purpose.

3 Anonymisation at Large Scale: Challenges and Solutions

The rapid increment of the amount of data that business and researchers handle nowadays is a consequence of the technological advances in fields such as Cloud Computing, the Internet of Things (IoT) or Machine Learning, the increase of computational power and the lower cost of data storage. This has led to the concept of Big Data: large volumes of data, whose complexity and velocity hinders the use of conventional technologies [12]. This definition remarks three properties of Big Data, also known as 3V’s: volume, velocity, and variety. However, later studies improved this definition by adding the properties of veracity, variability, visualisation, and value [13]. The term veracity refers to the reliability or the quality of the data, that is, the truthfulness of the data. Variability alludes to the non-homogeneity of the data. Finally, the value property describes the remaining utility of the data after its processing. The opposite to Big Data is known as Small Data, where the amount of data and its complexity is way lower, and therefore, easier to process.

As introduced earlier, ensuring data privacy while preserving some utility is a hard challenge. However, the increment of data volume, as well as the complexity, variety, and velocity typical of a Big Data scenario introduce even more complications. Firstly, due to the large volume of the data, computational efficiency may be a critical issue when selecting an anonymisation technique or privacy model [5]. Furthermore, evaluating the identification risks in a Big Data context, as well as measuring the utility and information loss, is also computationally complex. Secondly, aside from the computational issues derived from working with a huge amount of data, other problems arise in Big Data scenarios: the variety of the data also plays an important role, since most of the current algorithms that can be applied to preserve privacy, such as clustering, are designed for homogeneous data. These algorithms work well in Small Data scenarios, however, in Big Data, the information is usually heterogeneous. Therefore, traditional data privacy techniques are outdated, and there is a lack of scalable and efficient privacy algorithms.

Classical anonymisation techniques such as k-anonymity [6], l-diversity [14] or t-closeness [15] are not completely adequate to ensure Big Data privacy, as in most of the cases the data to be anonymised can be unstructured or live streaming. However, other well-known techniques such as differential privacy [16] can be more easily adapted to a Big Data context [17]. This method introduces an intermediary between the data processor and the database, which acts as a privacy guard. The data processor does not get direct access to the full version of the data. Instead, the privacy guard evaluates the privacy risk (according to different factors such as the sensitivity of the query to be executed and/or the size of the dataset) and introduces some distortion within the partial information retrieved from the database, which will be proportional to the current privacy risk. Differential privacy benefits from larger datasets: if the dataset is large then less noise will be needed to protect privacy [17]. However, one drawback of this technique is that the amount of noise to be introduced for sensitive queries (with a high privacy risk) is large, and retaining the utility of the data may be challenging or impossible.

Regarding privacy models, the main difference with anonymisation techniques is that they do not specify the set of specific transformations that need to be performed on the data, but the conditions that a dataset must satisfy to maintain the disclosure risk under control [18]. According to Soria-Comas et al. [19], a privacy model needs to satisfy three properties to be usable in a Big Data environment: composability, computability, and linkability. A privacy model is composable if its privacy guarantees are preserved for a dataset resultant from merging several datasets, for each of which the privacy guarantees of the model holds. Computability refers to the cost of computation of the anonymisation. Finally, linkability is the ability to link records relating to an individual. In their work, they evaluate k-anonymity and differential privacy in terms of those properties.

While k-anonymity is not composable (the combination of two k-anonymous datasets does not guarantee the privacy preservation of k-anonymity) differential privacy is strongly composable (combining two differentially private datasets increases the risk of disclosure, but the differential privacy is still preserved). In terms of linkability, in k-anonymity it is possible to at least link the groups of k-anonymous records of the individuals. In differential privacy, datasets are not linkable if noise addition is used. Finally, the computability property cannot be compared, as the performance of the privacy models depends on the method used to anonymise the data.

Therefore, there is not an absolute solution to protect data privacy in a Big Data context since each method has its advantages and flaws, and the selection of the privacy model should be analysed on a case-by-case basis. The existing techniques, practices, and methodologies for data privacy protection are ineffective in Big Data if they are not used in an integrated manner [20]. In any case, any solution dealing with sensitive data should analyse the privacy risks to address the possible data privacy challenges correctly.

Another relevant issue associated with Big Data privacy is the complexity of data anonymisation in real-time, as in many cases the data has to be processed immediately as it arrives at the system. This concept is known as stream processing, and it occurs in many Big Data scenarios where the data is usually generated by many sources at a high speed. Unlike batch processing, where the data is collected into batches and then fed into the system, in streaming processing the data is dynamic since it has a temporal dimension. Therefore, there is a maximum acceptable delay between the in-flowing data and the processed output [21].

Data processing in streaming adds certain difficulties when it comes to making a prior analysis of the data to select the best anonymisation strategy. Since the data arrives in portions to the system, the information is always incomplete, and performing a correct privacy risk assessment and utility evaluation is not an easy task. Moreover, traditional k-anonymity schemes are designed for static datasets and therefore are not suitable for streaming contexts. Furthermore, these techniques assume that each person appears in the dataset only once, assumption that cannot be made within a streaming context [21]. These challenges become especially hard in the context of Big Data streams.

Some solutions have been proposed to solve the issues that arise when anonymising in such scenarios. Sakpere et al. [22] made a review of the state of the art of the existing methods to anonymise Big Data streams, which will be briefly explored hereunder.

Li et al. developed in 2007 a perturbative method to achieve streaming data privacy based on adding random noise to the incoming data [23]. However, this method has certain flaws, as it can only handle numerical data, complicating the analysis of the anonymised dataset due to a large amount of artificial noise.

Other proposed methods are based on tree structures, such as SKY (Stream K-anonYmity) [24], SWAF (Sliding Window Anonymisation Framework) [25] and KIDS (K-anonymIsation Data Stream) [26]. The main disadvantages of these methods are the time and space complexity for data streams, the risk of re-identification if the data hierarchy used for the generalisation process is discovered by an attacker, and its use for the anonymisation of numerical values, which becomes considerably complicated due to the difficulty of finding a suitable hierarchy for the specialisation tree [27].

Clustering (grouping of similar data points) algorithms can also be useful for data anonymisation in a streaming context. Some examples are CASTLE (Continuously Anonymising STreaming Data via adaptative cLustEring) [21], B-CASTLE [28], which is an improvement of CASTLE, FAANST (Fast Anonymising Algorithm for Numerical STreaming data) [29] and FADS (Fast clustering-based k-Anonymisation approach for Data Streams) [30]. FADS is the best choice for anonymisation in streaming, due to its low time and space complexity. However, since it handles tuples sequentially, it is not suitable for Big Data streams. Mohammadian et al. introduced a new method, FAST (Fast Anonymization of Big Data Streams) based on FADS, which uses parallelism to increase the effectiveness of FADS and make it applicable in big data streams [31].

Last but not least, is important to take into account that the growth of Big Data in recent years has facilitated cross-referencing information from different databases, increasing the risk of re-identification. Databases that contain information that even though at first might seem insensitive, can be matched with other publicly available datasets to re-identify users [32]. For example, it has been shown that only a 5-digit zip code, birth date and gender can identify 80% of the population in the United States [33]. The inherent properties of Big Data, such as the high volume and its complexity, aggravate even more this problem.

In brief, the characteristics of Big Data make it especially difficult to preserve privacy according to the GDPR. Most of the traditional anonymisation algorithms are ineffective and/or inefficient in such scenarios, and further research in the field of Big Data privacy is necessary. Moreover, the velocity of the streaming data (usually also present in Big Data scenarios) introduces additional complications. Even though some anonymisation algorithms have been developed to address this issue, there are still unsolved challenges.

4 Conclusions

Through the course of this chapter, we introduced some of the existing data privacy protection approaches, highlighting the main differences between them. Different real-world examples have been provided to emphasise the importance of the correct application of the GDPR, as well as the consequences of a wrong anonymisation approach, which are critical for both individuals and companies. Anonymisation is a process that requires a deep analysis and constant monitoring of the re-identification risks to be performed correctly. In addition, we presented the privacy-utility problem and the metrics that can be used to monitor the re-identification risk of a dataset.

Furthermore, an introduction to Big Data and streaming processing and its anonymisation difficulties were presented. Different methods for data anonymisation in Big Data scenarios have been reviewed, highlighting their strengths and their drawbacks, as well as their applicability in different contexts.

References

Data privacy vs. data security [definitions and comparisons] (2021, January). https://dataprivacymanager.net/security-vs-privacy/
Article 29 Data Protection Working Party. (2014, April). Opinion 05/2014 on Anonymisation Techniques. In Working Party Opinions (pp. 1–37).
Google Scholar
European Commission. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Da. https://eur-lex.europa.eu/eli/reg/2016/679/oj
Google Scholar
Srivastava, S., Namboodiri, V. P., & Prabhakar, T. (2020, February). Achieving privacy-utility trade-off in existing software systems. Journal of Physics: Conference Series, 1454, 012004.
Google Scholar
D’Acquisto, G., Domingo-Ferrer, J., Kikiras, P., Torra, V., de Montjoye, Y.-A., & Bourka, A. (2015, December). Privacy by design in big data: An overview of privacy enhancing technologies in the era of big data analytics. https://op.europa.eu/en/publication-detail/-/publication/20492499-ce2e-11e5-a4b5-01aa75ed71a1/language-en
Google Scholar
Sweeney, L. (2002, October). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 557–570.
Article MathSciNet Google Scholar
de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The privacy bounds of human mobility. Scientific Reports, 3(1), 1376.
Article Google Scholar
Data anonymization and GDPR compliance: The case of Taxa 4x35 - GDPR.eu (2020). https://gdpr.eu/data-anonymization-taxa-4x35/.
Sweden - Information on Tax Identification Numbers. https://www.oecd.org/tax/automatic-exchange/crs-implementation-and-assistance/tax-identification-numbers/Sweden-TIN.pdf
European Data Protection Supervisor and Agencia Espanola Protection Datos. (2021). 10 Misunderstandings related to anonymisation. https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf
Google Scholar
Adkinson Orellana, L., Dago Casas, P., Sestelo, M., & Pintos Castro, B. (2021). A new approach for dynamic and risk-based data anonymization. In Á. Herrero, C. Cambra, D. Urda, J. Sedano, H. Quintián, & E. Corchado (Eds.), 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020) (pp. 327–336). Cham: Springer International Publishing.
Chapter Google Scholar
Laney, D. (2001, February). 3D data management: Controlling data volume, velocity, and variety. Tech. rep., META Group.
Google Scholar
Tsai, C.-W., Lai, C.-F., Chao, H.-C., & Vasilakos, A. V. (2015). Big data analytics: A survey. Journal of Big Data, 2(1), 21.
Article Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2006, April). L-diversity: Privacy beyond k-anonymity. In 22nd International Conference on Data Engineering (ICDE’06) (pp. 24–24).
Google Scholar
Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity, in 2007 IEEE 23rd International Conference on Data Engineering (pp. 106–115).
Google Scholar
Dwork, C., & Roth, A. (2013). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–487.
Article MathSciNet Google Scholar
Shrivastva, K. M. P., Rizvi, M., & Singh, S. (2014). Big data privacy based on differential privacy a hope for big data. In 2014 International Conference on Computational Intelligence and Communication Networks (pp. 776–781).
Google Scholar
Pawar, A., Ahirrao, S., & Churi, P. P. (2018). Anonymization techniques for protecting privacy: A survey, in 2018 IEEE Punecon (pp. 1–6).
Google Scholar
Soria-Comas, J., & Domingo-Ferrer, J. (2016, March). Big data privacy: Challenges to privacy principles and models. Data Science and Engineering, 1, 21–28.
Article Google Scholar
Moura, J., & Serrão, C. (2015). Security and privacy issues of big data. In Handbook of research on trends and future directions in big data and web intelligence (pp. 20–52). IGI Global.
Google Scholar
Cao, J., Carminati, B., Ferrari, E., & Tan, K.-L. (2011, July). Castle: Continuously anonymizing data streams. IEEE Transactions on Dependable and Secure Computing, 8, 337–352.
Article Google Scholar
Sakpere, A. B., & Kayem, A. V. (2014). A state-of-the-art review of data stream anonymization schemes. In Information security in diverse computing environments. IGI Global.
Google Scholar
Li, F., Sun, J., Papadimitriou, S., Mihaila, G. A., & Stanoi, I. (2007). Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 686–695).
Google Scholar
Li, J., Ooi, B. C., & Wang, W. (2008, April). Anonymizing streaming data for privacy protection. In 2008 IEEE 24th International Conference on Data Engineering (pp. 1367–1369).
Google Scholar
Wang, W., Li, J., Ai, C., & Li, Y. (2007, November). Privacy protection on sliding window of data streams. In 2007 International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2007) (pp. 213–221).
Google Scholar
Zhang, J., Yang, J., Zhang, J., & Yuan, Y. (2010, May). Kids:k-anonymization data stream base on sliding window. In 2010 2nd International Conference on Future Computer and Communication (Vol. 2, pp. V2-311–V2-316).
Google Scholar
Zakerzadeh, H., & Osborn, S. L. (2013, October). Delay-sensitive approaches for anonymizing numerical streaming data. International Journal of Information Security, 12, 423–437.
Article Google Scholar
Wang, P., Lu, J., Zhao, L., & Yang, J. (2010). B-castle: An efficient publishing algorithm for k-anonymizing data streams. In 2010 Second WRI Global Congress on Intelligent Systems (Vol. 2, pp. 132–136).
Google Scholar
Zakerzadeh, H., & Osborn, S. L. (2010). Faanst: Fast anonymizing algorithm for numerical streaming data. In Proceedings of the 5th International Workshop on Data Privacy Management, and 3rd International Conference on Autonomous Spontaneous Security, DPM’10/SETOP’10 (pp. 36–50). Berlin: Springer.
Google Scholar
Guo, K., & Zhang, Q. (2013, July). Fast clustering-based anonymization approaches with time constraints for data streams. Knowledge-Based Systems, 46, 95–108.
Article Google Scholar
Mohammadian, E., Noferesti, M., & Jalili, R. (2014). Fast: Fast anonymization of big data streams. In Proceedings of the 2014 International Conference on Big Data Science and Computing, BigDataScience ’14. New York, NY: Association for Computing Machinery.
Google Scholar
Narayanan, A., & Shmatikov, V. (2008, May). Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (SP 2008) (pp. 111–125).
Google Scholar
Lu, R., Zhu, H., Liu, X., Liu, J. K., & Shao, J. (2014, July). Toward efficient and privacy-preserving computing in big data era. IEEE Network, 28, 46–50.
Article Google Scholar

Download references

Acknowledgements

This work is funded by the EU H2020 Programme INFINITECH (grant agreement No. 856632).

Author information

Authors and Affiliations

Fundacion Centro Tecnoloxico de Telecomunicacions de Galicia (GRADIANT), Vigo, Spain
Ines Ortega-Fernandez, Sara El Kortbi Martinez & Lilian Adkinson Orellana

Authors

Ines Ortega-Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Sara El Kortbi Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Lilian Adkinson Orellana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ines Ortega-Fernandez .

Editor information

Editors and Affiliations

INNOV-ACTS LIMITED, Nicosia, Cyprus
John Soldatos
University of Piraeus, Piraeus, Greece
Dimosthenis Kyriazis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ortega-Fernandez, I., Martinez, S.E.K., Orellana, L.A. (2022). Large Scale Data Anonymisation for GDPR Compliance. In: Soldatos, J., Kyriazis, D. (eds) Big Data and Artificial Intelligence in Digital Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-94590-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-94590-9_19
Published: 24 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94589-3
Online ISBN: 978-3-030-94590-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Large Scale Data Anonymisation for GDPR Compliance

Abstract

Similar content being viewed by others

Anonimisation, Impacts and Challenges into Big Data: A Case Studies

Personal Big Data, GDPR and Anonymization

Improving privacy preservation policy in the modern information age

1 Introduction

2 Anonymisation as Regulatory Compliance Tool

3 Anonymisation at Large Scale: Challenges and Solutions

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Large Scale Data Anonymisation for GDPR Compliance

Abstract

Similar content being viewed by others

Anonimisation, Impacts and Challenges into Big Data: A Case Studies

Personal Big Data, GDPR and Anonymization

Improving privacy preservation policy in the modern information age

1 Introduction

2 Anonymisation as Regulatory Compliance Tool

3 Anonymisation at Large Scale: Challenges and Solutions

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation