1 Introduction

Data privacy refers to the rights of individuals over their personal information. That is, it is concerned with the data collection, its purpose, and how is it handled. In recent years, data privacy has become a major issue due to the growth of data generation, as well as due to the interest of third parties, such as business or researchers, in collecting and exploiting that information. It is important to differentiate it from the concept of data security, whose main objective is to protect personal information from being accessed by unauthorised third parties or attackers. Data security is, however, a prerequisite for data privacy [1].

One way to ensure data privacy is through anonymisation, a process that enables being non-identifiable within a set of individuals. It is a Privacy Enhancing Technique (PET) that results from transforming personal data to irreversibly prevent identification, and it comprises a set of techniques to manipulate the information to make data subjects (i.e., the persons to which the data refers) less identifiable. The robustness of each anonymisation technique can be analysed in terms of different criteria, such as if is it possible to identify a single person, link different records regarding the same individual, or the quantity of information that can be inferred regarding the data subject [2]. As a result, once the data is properly anonymised it cannot be linked back to the individual, and therefore it is not considered personal data anymore, according to the General Data Protection Regulation (GDPR): “The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person”.

It is important to highlight the differences between anonymisation and other PETs, such as pseudonymisation and encryption, since these terms are often misunderstood: pseudonymisation protects an individual by replacing its identifiers with a pseudonym, making the individual less (but still) identifiable. It is a reversible process, so it does not remove the re-identification risk, since the mapping between the real identifiers and pseudonyms still exists. Article 25 of the GDPR highlights the role of pseudonymisation “as a technical and organisational measure to help enforce data minimisation principles and compliance with Data Protection by Design and by Default obligations” [3].

On the other hand, encryption is a security measure to provide confidentiality in a communication channel or with the data at rest (data saved in a physical stable storage), to avoid disclosure of information to unauthorised parties. The goal of encryption is not related to making the data subject less identifiable: the original data is always available to any entity that has access to the encryption key (or can break the encryption protocol, recovering the original information), and therefore the possibility to identify the subject remains [2]. In addition, a key management system must be in place to protect and manage the encryption keys, which introduces complexity in the system.

The main advantage of anonymisation techniques over other PETs such as encryption is that it does not involve key management, and depending on the technique it requires less computational resources. However, data anonymisation is an irreversible process and only provides privacy, meaning that other security properties (e.g., confidentiality or integrity) must be implemented through other means. In addition, adequate anonymisation allows data processing without the need to be compliant with data privacy regulations, reducing the organisational cost of using and exploiting the data.

However, in order to ensure that data is properly anonymised multiple factors need to be taken into account: anonymisation of data to lower the re-identification risk to a specific extent is not always possible without also losing the utility of the data. Moreover, the anonymisation process requires a deep analysis of the original dataset to find the best anonymisation procedure that fit our needs. The anonymisation process that was suitable for a particular dataset might not work for a second dataset, since the nature and scope of the data will differ, as well as the later use of the anonymised data. In addition, it is also necessary to consider that additional datasets might be available in the future, which could be used for making cross-referencing with the anonymous data, affecting the overall re-identification risk.

Despite how useful the data anonymisation can be from a compliance perspective, it can be a daunting task, especially if we try to minimise the risks and ensure that the data is being properly anonymised. This chapter explores how anonymisation can be used as a Regulatory Compliance Tool, addressing common issues to introduce data anonymisation in a Big Data context. The complexity of a data anonymisation procedure is explored in Sect. 2, and its relationship with GDPR compliance. Finally, the specific challenges of data anonymisation in a Big Data context are analysed in Sect. 3, highlighting the differences with a Small Data environment.

2 Anonymisation as Regulatory Compliance Tool

With regards to the GPDR, encryption and pseudonymisation are considered security measures that need to be implemented to allow the processing of personal data. On the other hand, anonymisation makes the individuals within a particular dataset non-identifiable, and therefore anonymised data is not considered personal data anymore.

However, ensuring that the data anonymisation is being correctly applied is a challenging task. The use of data anonymisation techniques implies, in most cases, a certain loss of the utility of the data, as it relies typically on modifying the data values in order to make them less unique. While privacy aims to avoid sensitive data disclosure and the possibility of making certain deductions from a given dataset, the utility describes the analytical value of the data. In other words, the utility seeks to find correlations in a real-world scenario from a certain dataset, and the goal of privacy is to hide those correlations [4]. When applying anonymisation operations, data privacy improves, but there is a risk of reducing the analytical value of the dataset. For this reason, it is necessary to find a suitable trade-off between privacy and utility, as illustrated in Fig. 19.1, where some utility is sacrificed to reach an acceptable level of privacy.

Fig. 19.1
figure 1

Trade-off between privacy and utility during an anonymisation process

D’Acquisto et al. distinguish two anonymisation approximations to find this trade-off point [5]: the Utility-first and the Privacy-first anonymisation. In the former, an anonymisation method with a heuristic parameter and utility preservation properties is run, and the risk of disclosure is measured afterwards. On the contrary, in the latter, an upper bound on the re-identification disclosure risk and/or the attribute disclosure risk is set.

Independently on which anonymisation approach is followed, it is essential to ensure that the individuals are not identifiable after the process. To verify this property, it is necessary to analyse the data and classify it into direct identifiers (data attributes that unequivocally identify a data subject, such as the name, telephone number or ID card number), quasi-identifiers (attributes which by themselves do not reveal an identity, but they can suppose a privacy risk when they are combined with others; e.g., postal code or birth date) and sensitive data (attributes that should be preserved as they present a certain value for later analysis, such as a medical condition or salary).

It has been proven that removing the direct identifiers from a dataset is not enough to preserve privacy: an attacker might have access to additional data sources or background knowledge that could lead to the re-identification of the anonymised individuals [6]. Montjoye et al. [7] demonstrated that 95% of the individuals of an anonymised dataset of fifteen months of mobility data (containing records from around 1.5M individuals) could be identified when taking into consideration only four spatial-temporal points. This probes how simple approaches to anonymisation are not enough and more complex solutions are needed.

The GDPR sets a high standard for data to be considered truly anonymous since it implies that data protection regulations do not apply anymore: in Recital 26, the GDPR states that the organisation should not only consider if the individual is re-identifiable, but also consider: all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments [3].

The sentence above can be better understood with a real example: in October 2018, a Danish service was fined 1.2 million kroner (around €160.000) for not deleting users’ data once it wasn’t needed anymore for the company’s business activity [8]. The company argued that they could keep the data since it was properly anonymised and it could not be considered personal data anymore: they were anonymising the dataset by deleting the names from the database, but other data such as telephone numbers were neither removed nor masked.

The assumption of the company about how anonymised data cannot be considered personal data was correct. However, they failed to analyse the GDPR requirements for data to be considered anonymous. At Opinion 05/2014 on Anonymisation Techniques, the Article 29 Working party clearly states that “removing directly identifying elements in itself is not enough to ensure that identification of the data subject is no longer possible. It will often be necessary to take additional measures to prevent identification, once again depending on the context and purposes of the processing for which the anonymised data are intended” [2].

According to Article 29 Working Party, an effective anonymisation solution should satisfy the following criteria, preventing all parties from:

  • Singling out an individual in a dataset (i.e., isolating some records that could point out the identity of the person).

  • Linking two records within a dataset that belong to the same person, even though her identity remains unknown.

  • Inferring any information in such dataset by deducing the value of an attribute from another set of attributes.

Therefore, re-identification does not only mean retrieving a person’s name but also being able to “single out an individual in a dataset”. Coming back to our previous example, now the issues of the company’s anonymisation solution are clear: by keeping direct identifiers such as telephone numbers in the dataset, they did not satisfy the singling nor the linkability criterion: first, telephone numbers can be considered direct identifiers since each person will have a different telephone number, singling out all individuals in the dataset. Second, a telephone number can be easily linked to a natural person by using other datasets and publicly available information. Therefore, the data could not be considered anonymous, and they were not fully compliant with GDPR.

This example demonstrates the complexity of data anonymisation to be GDPR compliant: basic anonymisation techniques are insufficient to guarantee the privacy of the data subjects; the risk that the anonymised data retains must be analysed to ensure a proper level of privacy protection, and how the anonymisation solution is able to meet the requirements stated above.

Therefore, the anonymisation process must be adapted on a case-by-case basis, ideally adopting a risk-based approach. An analysis of the re-identification risk must be performed to assess if the anonymisation solution meets the criteria, or further measures are needed [2]. Moreover, this analysis must be performed continuously since it is subject to change, and new datasets might be published that allow cross-referencing anonymous data: for instance, in Sweden taxpayers’ information is publicly available [9], while in other countries is not. This circumstance might change in the future, making previous assumptions about data linkability erroneous. In addition, a dataset containing information about both Swedish and Spanish citizens cannot be anonymised following the same procedure, since the re-identification risks might be different [10].

As a result, the analysis of the re-identification risk is a useful procedure that will allow us to identify the residual risk of the anonymised data, as well as easing the GDPR compliance process. The re-identification risk of the data can be assessed using multiple techniques, mainly focused on analysing the uniqueness of the data subjects (the first criteria for a good anonymisation solution) but that do not take into account other properties such as linkability or inference. To address this issue, Adkinson et al. [11] propose a dynamic risk-based approach and a set of privacy metrics that take into account both the re-identification risk and the presence of an adversary with certain background knowledge trying to disclose information from a dataset.

To summarise, anonymisation can be considered as a non-trivial process that needs to be performed cautiously to become GDPR compliant. It is essential to understand that basic anonymisation procedures are not usually sufficient for most of the real-world applications and datasets, and that it is necessary not just to anonymise the data but also to estimate the remaining privacy risks. Therefore, it is essential to perform a continuous analysis of the re-identification risk of the anonymised dataset, taking into account the uniqueness, linkability, and inference properties of the anonymisation solution applied. Furthermore, the anonymised datasets need to guarantee an adequate balance between data privacy and utility, preserving the analytical value of the data, which requires analysing different approximations to decide the adequate trade-off point between data privacy and utility for the future data processing purpose.

3 Anonymisation at Large Scale: Challenges and Solutions

The rapid increment of the amount of data that business and researchers handle nowadays is a consequence of the technological advances in fields such as Cloud Computing, the Internet of Things (IoT) or Machine Learning, the increase of computational power and the lower cost of data storage. This has led to the concept of Big Data: large volumes of data, whose complexity and velocity hinders the use of conventional technologies [12]. This definition remarks three properties of Big Data, also known as 3V’s: volume, velocity, and variety. However, later studies improved this definition by adding the properties of veracity, variability, visualisation, and value [13]. The term veracity refers to the reliability or the quality of the data, that is, the truthfulness of the data. Variability alludes to the non-homogeneity of the data. Finally, the value property describes the remaining utility of the data after its processing. The opposite to Big Data is known as Small Data, where the amount of data and its complexity is way lower, and therefore, easier to process.

As introduced earlier, ensuring data privacy while preserving some utility is a hard challenge. However, the increment of data volume, as well as the complexity, variety, and velocity typical of a Big Data scenario introduce even more complications. Firstly, due to the large volume of the data, computational efficiency may be a critical issue when selecting an anonymisation technique or privacy model [5]. Furthermore, evaluating the identification risks in a Big Data context, as well as measuring the utility and information loss, is also computationally complex. Secondly, aside from the computational issues derived from working with a huge amount of data, other problems arise in Big Data scenarios: the variety of the data also plays an important role, since most of the current algorithms that can be applied to preserve privacy, such as clustering, are designed for homogeneous data. These algorithms work well in Small Data scenarios, however, in Big Data, the information is usually heterogeneous. Therefore, traditional data privacy techniques are outdated, and there is a lack of scalable and efficient privacy algorithms.

Classical anonymisation techniques such as k-anonymity [6], l-diversity [14] or t-closeness [15] are not completely adequate to ensure Big Data privacy, as in most of the cases the data to be anonymised can be unstructured or live streaming. However, other well-known techniques such as differential privacy [16] can be more easily adapted to a Big Data context [17]. This method introduces an intermediary between the data processor and the database, which acts as a privacy guard. The data processor does not get direct access to the full version of the data. Instead, the privacy guard evaluates the privacy risk (according to different factors such as the sensitivity of the query to be executed and/or the size of the dataset) and introduces some distortion within the partial information retrieved from the database, which will be proportional to the current privacy risk. Differential privacy benefits from larger datasets: if the dataset is large then less noise will be needed to protect privacy [17]. However, one drawback of this technique is that the amount of noise to be introduced for sensitive queries (with a high privacy risk) is large, and retaining the utility of the data may be challenging or impossible.

Regarding privacy models, the main difference with anonymisation techniques is that they do not specify the set of specific transformations that need to be performed on the data, but the conditions that a dataset must satisfy to maintain the disclosure risk under control [18]. According to Soria-Comas et al. [19], a privacy model needs to satisfy three properties to be usable in a Big Data environment: composability, computability, and linkability. A privacy model is composable if its privacy guarantees are preserved for a dataset resultant from merging several datasets, for each of which the privacy guarantees of the model holds. Computability refers to the cost of computation of the anonymisation. Finally, linkability is the ability to link records relating to an individual. In their work, they evaluate k-anonymity and differential privacy in terms of those properties.

While k-anonymity is not composable (the combination of two k-anonymous datasets does not guarantee the privacy preservation of k-anonymity) differential privacy is strongly composable (combining two differentially private datasets increases the risk of disclosure, but the differential privacy is still preserved). In terms of linkability, in k-anonymity it is possible to at least link the groups of k-anonymous records of the individuals. In differential privacy, datasets are not linkable if noise addition is used. Finally, the computability property cannot be compared, as the performance of the privacy models depends on the method used to anonymise the data.

Therefore, there is not an absolute solution to protect data privacy in a Big Data context since each method has its advantages and flaws, and the selection of the privacy model should be analysed on a case-by-case basis. The existing techniques, practices, and methodologies for data privacy protection are ineffective in Big Data if they are not used in an integrated manner [20]. In any case, any solution dealing with sensitive data should analyse the privacy risks to address the possible data privacy challenges correctly.

Another relevant issue associated with Big Data privacy is the complexity of data anonymisation in real-time, as in many cases the data has to be processed immediately as it arrives at the system. This concept is known as stream processing, and it occurs in many Big Data scenarios where the data is usually generated by many sources at a high speed. Unlike batch processing, where the data is collected into batches and then fed into the system, in streaming processing the data is dynamic since it has a temporal dimension. Therefore, there is a maximum acceptable delay between the in-flowing data and the processed output [21].

Data processing in streaming adds certain difficulties when it comes to making a prior analysis of the data to select the best anonymisation strategy. Since the data arrives in portions to the system, the information is always incomplete, and performing a correct privacy risk assessment and utility evaluation is not an easy task. Moreover, traditional k-anonymity schemes are designed for static datasets and therefore are not suitable for streaming contexts. Furthermore, these techniques assume that each person appears in the dataset only once, assumption that cannot be made within a streaming context [21]. These challenges become especially hard in the context of Big Data streams.

Some solutions have been proposed to solve the issues that arise when anonymising in such scenarios. Sakpere et al. [22] made a review of the state of the art of the existing methods to anonymise Big Data streams, which will be briefly explored hereunder.

Li et al. developed in 2007 a perturbative method to achieve streaming data privacy based on adding random noise to the incoming data [23]. However, this method has certain flaws, as it can only handle numerical data, complicating the analysis of the anonymised dataset due to a large amount of artificial noise.

Other proposed methods are based on tree structures, such as SKY (Stream K-anonYmity) [24], SWAF (Sliding Window Anonymisation Framework) [25] and KIDS (K-anonymIsation Data Stream) [26]. The main disadvantages of these methods are the time and space complexity for data streams, the risk of re-identification if the data hierarchy used for the generalisation process is discovered by an attacker, and its use for the anonymisation of numerical values, which becomes considerably complicated due to the difficulty of finding a suitable hierarchy for the specialisation tree [27].

Clustering (grouping of similar data points) algorithms can also be useful for data anonymisation in a streaming context. Some examples are CASTLE (Continuously Anonymising STreaming Data via adaptative cLustEring) [21], B-CASTLE [28], which is an improvement of CASTLE, FAANST (Fast Anonymising Algorithm for Numerical STreaming data) [29] and FADS (Fast clustering-based k-Anonymisation approach for Data Streams) [30]. FADS is the best choice for anonymisation in streaming, due to its low time and space complexity. However, since it handles tuples sequentially, it is not suitable for Big Data streams. Mohammadian et al. introduced a new method, FAST (Fast Anonymization of Big Data Streams) based on FADS, which uses parallelism to increase the effectiveness of FADS and make it applicable in big data streams [31].

Last but not least, is important to take into account that the growth of Big Data in recent years has facilitated cross-referencing information from different databases, increasing the risk of re-identification. Databases that contain information that even though at first might seem insensitive, can be matched with other publicly available datasets to re-identify users [32]. For example, it has been shown that only a 5-digit zip code, birth date and gender can identify 80% of the population in the United States [33]. The inherent properties of Big Data, such as the high volume and its complexity, aggravate even more this problem.

In brief, the characteristics of Big Data make it especially difficult to preserve privacy according to the GDPR. Most of the traditional anonymisation algorithms are ineffective and/or inefficient in such scenarios, and further research in the field of Big Data privacy is necessary. Moreover, the velocity of the streaming data (usually also present in Big Data scenarios) introduces additional complications. Even though some anonymisation algorithms have been developed to address this issue, there are still unsolved challenges.

4 Conclusions

Through the course of this chapter, we introduced some of the existing data privacy protection approaches, highlighting the main differences between them. Different real-world examples have been provided to emphasise the importance of the correct application of the GDPR, as well as the consequences of a wrong anonymisation approach, which are critical for both individuals and companies. Anonymisation is a process that requires a deep analysis and constant monitoring of the re-identification risks to be performed correctly. In addition, we presented the privacy-utility problem and the metrics that can be used to monitor the re-identification risk of a dataset.

Furthermore, an introduction to Big Data and streaming processing and its anonymisation difficulties were presented. Different methods for data anonymisation in Big Data scenarios have been reviewed, highlighting their strengths and their drawbacks, as well as their applicability in different contexts.