Encyclopedia of Big Data

Living Edition
| Editors: Laurie A. Schintler, Connie L. McNeely

Anonymization Techniques

  • Mick SmithEmail author
  • Rajeev Agrawal
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-32001-4_9-1

Keywords

Noise Addition Identifiable Information Differential Privacy Patient Health Statistic Personally Identifiable Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Synonyms

Introduction

Personal information is constantly being collected on individuals as they browse the internet or share data electronically. This collection of information has been further exacerbated with the emergence of the Internet of things and the connectivity of many electronic devices. As more data is disseminated into the world, interconnected patterns are created connecting one data record to the next. The massive data sets that are collected are of great value to businesses and data scientists alike. To properly protect the privacy of these individuals, it is necessary to de-identify or anonymize the data. In other words, personally identifiable information (PII) needs to be encrypted or altered so that a person’s sensitive data remains indiscernible to outside sources and readable to the pre-approved parties. Some popular anonymization techniques include noise addition, differential privacy, k-anonymity, l-diversity, and t-closeness.

The need for anonymizing data has come from the availability of data through big data. Cheaper storage, improved processing capabilities, and a greater diversity of analysis techniques have created an environment in which big data can thrive. This has allowed organizations to collect massive amounts of data on the customer/client base. This information in turn can then be subjected to a variety of business intelligence applications so as to improve the efficiency of the collecting organization. For instance, a hospital can collect various patient health statistics over a series of visits. This information could include vital statistics measurements, family history, frequency of visits, test results, or any other health-related metric. All of this data could be analyzed to provide the patient with an improved plan of care and treatment, ultimately improving the patient’s overall health and the facilities ability to provide a diagnosis.

However, the benefits that can be realized from the analysis of massive amounts of data come with the responsibility of protecting the privacy of the entities whose data is collected. Before the data is released, or in some instances analyzed, the sensitive personal information needs to be altered. The challenge comes in deciding upon a method that can achieve anonymity and preserve the data integrity.

Noise Addition

The belief with noise addition is that by adding noise to data sets that the data becomes ambiguous and the individual subjects will not be identified. The noise refers to the skewing of an attribute so that it is displayed as a value within a range. For instance, instead of giving one static value for a person’s age, it could be adjusted ±2 years. If the subject’s age is displayed as 36, the observer would not know the exact value, only that the age may be between 34 and 38. The challenge with this technique comes in identifying the appropriate amount of noise. There needs to be enough to mask the true attribute value, while at the same time preserving the data mining relationships that exist within the dataset.

Differential Privacy

Differential privacy is similar to the noise addition technique in that the original data is altered slightly to prevent any de-identification. However, it is done in a manner that if a query is done on two databases that differ in only one row that the information contained in the missing row is not discernable. Cynthia Dwork provides the following definition:

A randomized function K gives ε-differential privacy if for all data sets D 1 and D 2 differing on at most one element, and all S ⊆ Range(K),
$$ Pr\left[\mathrm{K}\left({D}_1\right)\in S\right]\le exp\left(\varepsilon \right)\times Pr\left[\mathrm{K}\left({D}_1\right)\in S\right] $$

As an example think of a database containing the incomes of 75 people in a neighborhood and the average income is $75,000. If one person were to leave the neighborhood and the average income dropped to $74,000, it would be easy to identify the income of the departing individual. To overcome this, it would be necessary to apply minimum noise so that the average income before and after would not be representative of the change. At the same time, the computational integrity of the data is maintained. The amount of noise and whether an exponential or Laplacian mechanism is used is still subject to ongoing research/discussion.

K-Anonymity

In the k-anonymity algorithm, two common methods for anonymizing data are suppression and generalization. By using suppression, the values of categorical variable, such as name, are removed entirely from the data set. With generalization quantitative variables, such as age or height, are replaced with a range. This in turn makes each record in a data set indistinguishable from at least k–1 other records. One of the major drawbacks to k-anonymity is that it may be possible to infer identity if certain characteristics are already known. As a simple example consider a data set that contains credit decisions from a bank (Table 1). The names have been omitted, the age categorized, and the last two digits of the zip code have been removed.
Table 1

K-anonymity credit example

Age

Gender

Zip

Credit decision

18–25

M

149**

Yes

18–25

M

148**

No

32–39

F

149**

Yes

40–47

M

149**

Yes

25–32

F

148**

No

32–39

M

149**

Yes

This obvious example is for the purposes of demonstrating the weakness of a potential homogeneity attack in k-anonymity. In this case, if it was known that a 23-year-old man living in 14,999 was in this data set, the credit decision information for that particular individual could be inferred.

L-Diversity

L-diversity can be viewed as an extension to k-anonymity in which the goal is to anonymize specific sensitive values of a data record. For instance, in the previous example, the sensitive information would be the credit decision. As with k-anonymity generalization and suppression techniques are used to mask the true values of the target variable. The authors of the l-diversity principle, Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramniam, define it as follows:

A q*-block is l-diverse if it contains at least l well-represented values for the sensitive attribute S. A table is l-diverse if every q*-block is l-diverse.

The concept of well-represented has been defined in three possible methods: distinct l-diversity, entropy l-diversity, and recursive (c, l)-diversity. A criticism of the l-diversity model is that it does not hold up well when the sensitive value has a minimal number of states. As an example, consider the credit decision table from above. If that table were extended to include 1000 records and 999 of them had a decision of “yes,” then l-diversity would not be able to provide sufficient equivalence classes.

T-Closeness

Continuing with the refinement of de-identification techniques, t-closeness is an extension of l-diversity. The goal of t-closeness is to create equivalence classes that approximate the original distribution of the attributes in the initial database. Privacy can be considered a measure of information gain. T-Closeness takes this characteristic into consideration by assessing an observer’s prior and posterior belief about the content of a data set as well as the influence of the sensitivity attribute. As with l-diversity, this approach hides the sensitive values within a data set while maintaining association through “closeness.” The algorithm uses a distance metric known as the Earth Mover Distance to measure the level of closeness. This takes into consideration the semantic interrelatedness of the attribute values. However, it should be noted that the distance metric may differ depending on the data types. This includes the following distance measures: numerical, equal, and hierarchical.

Conclusion

To be effective each anonymization technique should prevent against the following risks: singling out, linkability, and inference. Singling out is the process of isolating data that could identify an individual. Linkability occurs when two or more records in a data set can be linked to either an individual or grouping of individuals. Finally inference is the ability to determine the value of the anonymized data through the values of other elements within the set. An anonymization approach that can mitigate these risks should be considered robust and will reduce the possibility of re-identification. Each of the techniques presented address each of these risks differently. The following table outlines their respective performance (Table 2):
Table 2

Anonymization algorithm comparison

Technique

Singling out

Linkability

Inference

Noise addition

At risk

Possibly

Possibly

K-anonymity

Not at risk

At risk

At risk

L-diversity

Not at risk

At risk

Possibly

T-closeness

Not at risk

At risk

Possibly

Differential privacy

Possibly

Possibly

Possibly

For instance, unlike k-anonymity, l-diversity, and t-closeness are not subject to inference attacks that utilize the homogeneity or background knowledge of the data set. Similarly, the three generalization techniques (k-anonymity, l-diversity, and t-closeness), all present differing levels of association that can be made due to the clustering nature of each approach.

As with any aspect of data collection, sharing, publishing, and marketing, there is the potential for malicious activity. However, the benefits that can be achieved from the potential analysis of such data cannot be overlooked. Therefore, it is extremely important to mitigate such risks through the use of effective de-identification techniques so as to protect sensitive personal information. As the amount of data becomes more abundant and accessible, there is an increased importance to continuously modify and refine existing anonymization techniques.

Further Reading

  1. Dwork, C. (2006). Differential privacy. In Automata, languages and programming. Berlin: Springer.Google Scholar
  2. Li, Ninghui, et al. (2007). t-Closeness: Privacy beyond k-anonymity and l-diversity. IEEE 23rd International Conference on Data Engineering, 7.Google Scholar
  3. Machanavajjhala, A., et al. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), Article 3, 1–12.Google Scholar
  4. Sweeney, L. (2002). k-anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5).Google Scholar
  5. The European Parliament and of the Council Working Party. (2014). Opinion 05/2014 on anonymisation techniques. http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf. Retrieved on 29 Dec 2014.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.North Carolina A&T State UniversityGreensboroUSA
  2. 2.Information Technology LaboratoryUS Army Engineer Research and Development CenterVicksburgUSA