1 Introduction

In this paper, we limit ourselves to the problem of user re-identification from a dataset. We decide to focus on two very specific questions: given a set of records with no obvious information that would allow for easily identifying a single person from the dataset, (i) can we make sure that no one is easily identifiable from the data (and identify it), and (ii) if some individuals are easy to identify, can we modify the data so as to “blend them in” while retaining the key characteristics of the data statistics? The approach we take is to consider the data fields (over a set of records) as separate entities and try to build clusters of records based on metric proximity: if the records have similar values across several vector elements, they are likely to be grouped together. We assume then that if such a group is large enough and that the records inside that group have been “stirred” enough, identification of a single individual becomes impossible. One of the main assumption in this paper is that whatever further processing is to be performed on the anonymized data, is relying on these “group statistics/properties” to be as intact as possible. The proposed “stirring” of the data implies that global data statistics and structures will be preserved, but local ones are disturbed. We basically want to preserve the underlying manifold structure (in terms of the cluster of data that it is composed of) as much as possible, while locally shuffling the data around. In order to clarify some of the notions presented, we introduce here an example data set of health-related information in Table 1. We consider for the purpose of this example that this represents the full health records from a certain medical institution.

Table 1. Example of Health Data records from a medical institution.

The records from Table 1 show no obvious easily identifiable information when considering single fields. Nevertheless, relationships between the non-sensitive fields in this data can probably make it relatively easy to identify some individuals: within a zip code, the nationality and the age allow someone to restrict the set of possible individuals dramatically. The last individual in the table is even more striking as her age, nationality and zip code surely make her stand out of the rest.In such a situation, the proposed approaches in this paper seek to “blend in” this last individual from Table 1 with the rest of the records, as well as making sure that all the records get “shuffled” (regarding each of the fields) so as to make the data anonymized. In effect, we want here to actively modify the data values, not by changing their nature (as would hashing the values do, e.g.) or by omitting them, but really by modifying the values to realistic ones (belonging to the same category/set) in a way that preserves some of the information.

2 High-Level Motivation for Data Privacy

In this section, we propose a high level description of the problem tackled in this paper. The next sections then describe the proposed means of doing so. In an ideal situation, data mining and classification or partition of data, in particular for health and medical data [6], can be made in an unambiguous manner; meaning that, for example, a classification of the data can be made and the number of border cases is minimal. Application of algorithms that increase the privacy (or the entropy) of a system distort this in some known manner, in terms of the direct effects on the data fields. For example \(\kappa \)-anonymity [2] and \(\ell \)-diversity [8] reduce the distribution and amounts of unique values in the discrete valued cases; differential privacy [4] adds noise in the continuous valued cases, for example, speeds, distances etc. A privacy function distorts a system such that classification and/or mining either cannot be made or becomes difficult to make in a reasonable manner [5].

In this work, we consider the case of such privacy functions that modify the data in such a way as to avoid changing the “format” of the data, and thus the underlying space in which the data lies. Indeed, another way to consider this is that the privacy functions usually alter the underlying space or topology of the space rather than moving the elements themselves. This altering of the topology in the best case involves continuous (in the sense of metric preserving or homotopy preserving) stretching and shrinking, but may also include non-continuous tearing and creasing of the space such that the resolution of the original metric function is no longer possible. The challenge here is then to avoid this deformation of the underlying space, by attempting to shuffle and move the data around in the best manner (regarding increasing the privacy and minimising the distortions on the data). The work in this paper is aimed at this problem: proposing several practical solutions to the identification problem from Table 1. We first define in the next section, some notations and assumptions on the structure of the data, and first look at the problem of moving a single sample (or a group of them) back into bigger clusters. We then tackle the problem of increasing the privacy for the samples within a cluster so as to make sure that re-identification, even within a cluster, is more difficult.

3 Methodology for Data Anonymization in a Data Clustering Context

3.1 Some Notational Details

Traditionally in the data privacy literature, one defines a table \(\mathbf {T}\) of N records as \(\mathbf {T} = \left\{ \mathbf {t}_{1},\dots \mathbf {t}_{N}\right\} \), with attributes \(\left\{ A_{1},\dots ,A_{d}\right\} \). We have that \(\mathbf {T}\in \varOmega \), the set of all possible records (samples), and \(\mathcal {A}=\left\{ A_{1},\dots ,A_{d}\right\} \) is the set of all possible d attributes (in this case, all possible attributes are used in table \(\mathbf {T}\)). Typically, one denotes the value of a certain attribute \(A_{j}\) for sample \(\mathbf {t}_{i}\) as \(\mathbf {t}_{i}[A_{j}]\). In this paper, and for the developments below, we take the liberty to note \(\mathbb {X}^{(j)}\) the set of all possible values for a certain attribute \(A_{j}\). Referring to Table 1 for our example case, if \(A_{j}\) is the attribute for the Zip Code of the patients, this means that \(\mathbb {X}^{(j)}\) represents the set of all possible Zip Codes (possibly limited to the existing ones that make sense within the context of this table, e.g. limited to a country).

We then assume that it is possible to define a distance function \(d^{(j)}:\mathbb {X}^{(j)}\times \mathbb {X}^{(j)}\longrightarrow \mathbb {R}_{+}\) over this set \(\mathbb {X}^{(j)}\). Note that the metric space \(\mathcal {X}^{(j)}=(\mathbb {X}^{(j)},d^{(j)})\), defined by these two entities need not be Euclidean. Some considerations on such distance functions over non-Euclidean spaces are detailed in the following Sect. 3.2. Departing slightly from the data privacy notations and denoting by \(\mathbf {T}=\left[ \mathbf {t}_{1},\dots ,\mathbf {t}_{N}\right] ^{T}\), the matrix of N samples holding the health records. A record \(\mathbf {t}_{i}\) is now defined as \(\mathbf {t}_{i}=\left[ a_{i,1},a_{i,2},\dots ,a_{i,d}\right] ,\,a_{i,j}\in \mathbb {X}^{(j)}\), with \(\mathbb {X}^{(j)}\) the set considered as part of the metric space \(\mathcal {X}^{(j)}=(\mathbb {X}^{(j)},d^{(j)})\).

With these extended notations, we can see the column \(\left[ a_{1,j},\dots ,a_{N,j}\right] ^{T}\in \mathbb {X}^{(j)\,N\times 1}\) as a discrete random variable (or a set of realizations of the underlying random variable, more precisely) over \(\mathcal {X}^{(j)}\). The following section discusses the previous assumption of being able to define a distance function over a potentially non-Euclidean space.

3.2 Distance Functions Over Non-Euclidean Spaces

The argument for considering the use of distances over non-Euclidean spaces in this work, is that it is possible to tweak and modify such non-Euclidean distances so that their distribution and properties will be “close enough” to that of the original Euclidean distance. Most of the developments in this paper rely on having “meaningful and consistent” distance functions across all the dimensions, so that they can be at least compared, even if this means re-mapping the distribution of its values.

More formally, let us assume that we have two metric spaces \(\mathcal {X}^{(i)}=(\mathbb {X}^{(i)},d^{(i)})\) and \(\mathcal {X}^{(j)}=(\mathbb {X}^{(j)},d^{(j)})\), with \(\mathcal {X}^{(i)}\) the canonical Euclidean space (i.e. \(\mathbb {X}^{(i)}=\mathbb {R}^{d}\) and \(d^{(i)}\) the Euclidean norm) and \(\mathcal {X}^{(j)}\) a non-Euclidean metric space endowed with a non-Euclidean metric. Drawing uniformly samples from the set \(\mathbb {X}^{(j)}\), we form \(\mathbf {x}^{(j)}=\left[ x^{(j)}_{1},\dots ,x^{(j)}_{n}\right] \), a set of values (realizations of the underlying random variable), with \(x^{(j)}_{l}\in \mathbb {X}^{(j)}\). Denoting then by \(f_{d^{(j)}}\) the distribution of pairwise distances over all the samples in \(\mathbf {x}^{(j)}\), we assume that it is possible to modify the distribution of the values of the non-Euclidean metric \(d^{(j)}\) (into a distribution \(f^{\text {map}}_{d^{(j)}}\)) such that

$$\begin{aligned} \lim _{n\rightarrow \infty }f^{\text {map}}_{d^{(j)}}=f_{d^{(i)}}, \end{aligned}$$
(1)

where \(f_{d^{(i)}}\) is the distribution of the Euclidean distances \(d^{(i)}\) over the Euclidean space \(\mathcal {X}^{(i)}\) and \(f^{\text {map}}_{d^{(j)}}\) is a non-linear transformation of the original distribution \(f_{d^{(j)}}\) by a certain function.

The limit here is over n as the distribution \(f_{d^{(j)}}\) is considered to be estimated using a limited number n of realizations of the random variables, and we are interested in the limit case where we can “afford” to draw as many samples as possible to be as close to the Euclidean metric as possible. That is, that we can make sure that the non-Euclidean metric behaves over its non-Euclidean space, as would a Euclidean metric over a Euclidean space. This assumption is “theoretically reasonable”, as it comes down to being able to transform a distribution \(f_{d^{(j)}}\) into another \(f^{\text {map}}_{d^{(j)}}\), given both. And while this may not be simple nor possible using linear transformation tools, most Machine Learning techniques are able to fit a continuous input to another different continuous output (this is basically the well-known Universal Function Approximator property [3]). It can be noted that using such tools, the mapping will not be perfect (as we will work with discrete versions of the distributions) and will not result in the equality case from Eq. 1. Nevertheless, we assume in this paper that this is sufficient for our needs. With this assumption in mind, we come to the problem of addressing Differential Privacy approaches as a Vector Quantization matter.

4 Considering Differential Privacy as a Vector Quantization Problem

Using the previous notations introduced, Differential Privacy aims at finding sets or clusters (groups) \(C_{l}\) of samples

$$\begin{aligned} C_{l} =&\left\{ \mathbf {t}_{i},\mathbf {t}_{j}\in \mathbf {T}|\forall i,j\in [\![1,N]\!],i\ne j, \forall k\in [\![1,d]\!],\right. \nonumber \\&\left. d^{(k)}(a_{i,k}, a_{j,k})\le \varepsilon _{k}\right\} , \end{aligned}$$
(2)

with \(\varepsilon _{k}\) the maximum radii of the cluster \(C_{l}\) (each dimension k can have a separate radius, thus). The total number of clusters C is here determined by the choices made for the maximum radii of them, i.e. the \(\varepsilon _{k}\). Intuitively, these \(C_{l}\) are clusters of samples that are “not too distant from each other”. If all the metric spaces (across all dimensions) were Euclidean, Eq. 2 would simply define the sets of samples that have pairwise Euclidean distance smaller than a certain epsilon. In this respect, we are considering similarity between groups of sample as a defined by cluster density across all dimensions. In our case, we generalize this definition by potentially having a different distance function for each dimension, thus bounded by different \(\varepsilon _{k}\). Note that samples that are “alone” in their cluster basically represent outliers in terms of the data they hold (and thus, individuals): they might be very easy to identify/recognize out of the rest of the records, as they do not “fit with others”. This observation can be generalized to clusters that have “few samples” in the ball they define. “Few” has to be defined, in this case. This is directly related to the \(\kappa \) in \(\kappa \)-anonymity.

Denoting by m, the previous “few”, if \(|C_{l1}|\le m\), there are not enough records in the cluster: They might be easy to identify, or represent too obvious a group. The goal is then to modify as few dimensions as possible (so as to minimize distortion of the data) to bring these records in the nearest cluster \(C_{l2}\) which respects \(|C_{l2}|>m\) or so that \(|C_{l1}|+|C_{l2}|>m\). To be able to find which nearby cluster is the most fitting for such a lonesome sample, we decide to rely on centroids. We thus need to calculate a centroid (or representative) of the clusters such that \(|C_{l}|>m\). Note that as the sets \(\mathbb {X}^{(j)}\) across which the data is defined do not necessarily have any implied order, we have to use solely the distances between samples to calculate the most fitting centroid.

This comes to determining the centroid \(c_{l}\) of cluster \(C_{l}\) with only inter-records distances (pairwise distances for all samples within one cluster):

$$\begin{aligned} c^{(j)}_{l} = \arg \min _{a_{k,j}\in \mathbb {X}^{(j)}}\left[ \sum _{a_{i,j}\in C_{l}}d^{(j)}(a_{i,j},a_{k,j})\right] , \end{aligned}$$
(3)

where \(c^{(j)}_{l}\) denotes the j-th coordinate of the centroid \(c_{l}\) of cluster \(C_{l}\), and Eq. 3 has an abuse of notations in the summation index to avoid too heavy notations: the summation is made over the j-th coordinate \(a_{i,j}\) of all the samples \(\mathbf {t}_{i}\) in cluster \(C_{l}\). This is to avoid defining the set of samples in the cluster formally. From Eq. 3, it can be seen that the centroid coordinates are picked from the sets \(\mathbb {X}^{(j)}\), and not calculated as some mean value over the samples present in the cluster. This would not have any sense in the case of discrete \(\mathbb {X}^{(j)}\), so this definition is more practical for the general purpose.

We do not discuss in this paper the algorithmic means of finding such centroids based on this definition from Eq. 3. With the centroids of each cluster estimated, we can then decide how to move samples that are lonesome and too easy to identify.

4.1 Moving Samples to Nearby Clusters

The task of moving a sample (or a small enough group of them) into a near cluster first requires the determination of the most suitable cluster for each of these samples.

Identifying the Most Suitable Cluster. Intuitively, and in order to preserve data as much as possible, the most suitable cluster \(C_{l}\) for this application is such that the total distortion, approximated in this case by how much the outlier \(\mathbf {t}_{o}\) is moved across all dimensions, is minimal. Thus, denoting by \(\mathbf {t}_{o}=\left[ a_{o,1},\dots ,a_{o,d}\right] \) an outlier, \(\mathcal {C}=\left\{ C_{k}\right\} _{1\le k\le C}\) the set of all the clusters (which have a sufficient amount of samples in them), and by \(d_{\text {map}}^{(j)}\) the mapped version of the distance function \(d^{(j)}\) (so that the distribution of its values matches that of an Euclidean metric, see Sect. 3.2), we get

$$\begin{aligned} C_{l}=\arg \min _{C_{k}\in \mathcal {C}}\left[ \sum _{j=1}^{d}d_{\text {map}}^{(j)}\left( a_{o,j},c_{k}^{(j)}\right) \right] . \end{aligned}$$
(4)

One argument for using the mapped distances \(d_{\text {map}}^{(j)}\) in this determination of the suitable cluster, is that we have to make a decision over all the dimensions at once, regarding the distortion generated by moving the outlier into a cluster. Therefore, in order to quantify this distortion across all dimensions at once, it is important that the distances are all within similar ranges and following similar distributions (otherwise, some dimensions will be “favoured” by the sum, possibly unjustly). Actual weighting of the distances in order to artificially favour some dimensions is the subject of further work.

Moving the Sample to the Decided Cluster. Once the most suitable cluster \(C_{l}\) for outlier \(\mathbf {t}_{o}\) has been determined (note that there might not be a unique solution to this cluster determination), the problem is to move the outlier within that cluster so as to modify the actual values of the outlier as little as possible. We identify three ways to do this in practice, out of which the first is probably the best in terms of low distortion, but also the most difficult — and thus probably not achievable in real cases. In all the following three cases, the following steps are applied:

$$\begin{aligned} \forall k \in [\![1,d]\!],\,\left\{ \begin{array}{ll} a_{o,k}=&{}a_{o,k}^{\text {new}}\,\text {if}\,d^{(k)}(a_{o,k},c_{l}^{(k)})>\\ &{}\max \limits _{a_{i,k}\in C_{l}}\left[ d^{(k)}(a_{i,k},c_{l}^{(k)})\right] \\ a_{o,k}&{}\,\text {unchanged otherwise}\end{array}\right. , \end{aligned}$$
(5)

where \(a_{o,k}^{\text {new}}\) is the new value to be given to the k-th coordinate of the outlier \(\mathbf {t}_{o}\), and \(\max _{a_{i,k}\in C_{l}}\left[ d^{(k)}(a_{i,k},c_{l}^{(k)})\right] \) is in fact the maximum intra-cluster distance between cluster elements and the centroid \(c_{l}\) of the cluster \(C_{l}\). Thus, we ensure that the modification of this specific dimension does not modify too much the intra cluster distances. We then propose three approaches to determine the new \(a_{o,k}^{\text {new}}\) value: (a) Setting it to the centroid value, and adding some noise; (b) Setting it to the centroid value only; (c) Setting it to an existing cluster element value.

(a) Centroid and Noise. In this case, we set the new value of the outlier coordinate \(a_{o,k}^{\text {new}}\) as

$$\begin{aligned} a_{o,k}^{\text {new}} = c_{l}^{(k)}+r, \end{aligned}$$
(6)

where r is randomly drawn from a certain distribution such that the distribution of the distances from the cluster samples to the cluster centroid is not modified too much. More precisely, with \(f_{d_{C_{l}}}\) the distribution of the distances between the samples in cluster \(C_{l}\) and its centroid, and \(f_{d_{C_{l}}}^{\text {new}}\) the same distribution after modifying the outlier coordinate \(a_{o,k}\), we want to make sure that \(\text {KL}(f_{d_{C_{l}}}, f_{d_{C_{l}}}^{\text {new}})\le \varepsilon \), where \(\text {KL}(\cdot ,\cdot )\) stands for the Kullback-Leibler divergence between the two distributions [7]. In practice, other metrics could be used in this place, such as the Earth-Mover Distance [1, 9], e.g. This approach, as said before, although probably very desirable, is rather difficult to achieve practically, as drawing the noise value r in such a way as described above is difficult.

(b) Flattening to Centroid. This case is a direct simplified version of the previous one. Here, we set the new value of \(a_{o,k}^{\text {new}}\) as

$$\begin{aligned} a_{o,k}^{\text {new}} = c_{l}^{(k)}. \end{aligned}$$
(7)

While this approach has a very clear advantage of being simple, it might lead to moving the outlier “too close” to the centroid. Remember here that the centroid is likely not an actual sample from the data, and we are thus inserting a sample with unseen before coordinates, in this cluster.

(c) Flattening to a Cluster Element Value. Finally, this third approach is probably a good compromise of simplicity of execution and low distortion of the data. Here, we set the new value of \(a_{o,k}^{\text {new}}\) as

$$\begin{aligned} a_{o,k}^{\text {new}} = a_{i,k}\,\text {with}\,a_{i,k}\,\text {drawn at random from}\, C_{l}. \end{aligned}$$
(8)

In this case, we thus draw from the existing sample values from this cluster for this specific dimension. This ensures that we avoid disturbing too much the existing samples within the cluster, while moving effectively the outlier within the cluster. Once we have the outlier(s) moved back into the most appropriate cluster, we can assume that the isolated individuals these samples were representing are no longer as easy to re-identify as before. We can move on to the second part of this work: anonymizing the data within a cluster.

4.2 Anonymizing the Data Within a Cluster

The idea behind this approach is to provide some methods to anonymize the data (by modifying its inherent values) while retaining, as in the previous sections, the structure of it — in the same data clustering sense as in the rest of the paper. For this, we propose two approaches that aim at anonymizing the data within a cluster (and not the whole cluster in itself), so that samples (individuals) within a cluster cannot be re-identified easily. The two approaches are relatively destructive on purpose, in order to provide a means of destroying “intelligently” the data for some specific data fields. The first one relies on flattening the required dimensions: if a specific data field is deemed sensitive, it can be summarized, for a single cluster, by a single value. The second approach is a lot less destructive, and tries to preserve the overall cluster statistics as much as possible, by randomizing the values within a cluster for all the samples so that the cluster remains similar.

Flattening Dimensions. Referring back to Table 1, it is for example likely that the data in the “Sensitive” field, namely the Condition, would need to be modified before this data is released. For such cases where destructive data alterations are desirable or even needed, it would be possible to replace the sensitive values by empty or unusable ones. This would effectively destroy some of the data statistics and structures within the cluster considered. But in the cases where one would want to preserve some of this information in order to keep some structure within the cluster, the question becomes: how do we modify this data so that it is as close to destroying it as possible, while maintaining the cluster structure/statistics? The proposed straightforward way to do this is to “flatten” the sensitive field (dimension) to the value of the centroid. The effect of collapsing a specific dimension is illustrated on Fig. 1. In effect, what happens for each cluster \(C_{l}\) is

$$\begin{aligned} \forall k\in \mathcal {S},\,\forall \mathbf {t}_{i}\in C_{l}, a_{i,k}=c_{l}^{(k)}, \end{aligned}$$
(9)

where \(\mathcal {S}\) is the set of the considered sensitive fields to be anonymized “destructively”. While this approach effectively destroys the data structure within the cluster to some extent, there is a risk that the cluster is already initially as on Fig. 1; this could likely happen if the initial clustering of the data samples is efficient already in the first place. The flattening procedure proposed here would then have no effect and one could argue that the anonymization is not carried out.

Fig. 1.
figure 1

Example of collapsing of one dimension to the centroid value. This obviously breaks cluster distribution and distances to the centroid.

This is unlikely to happen for all clusters at the same time, although this is obviously highly data-dependent. For this reason, we propose the second method, also considered as destructive regarding the data values, but “safer” in this respect.

Shuffling Data Around. This second method is about preserving the intra-cluster data structure as much as possible, while still modifying the sample values as much as possible. This approach is the most costly in terms of computations and general costs. In this case, we shuffle the samples (on one dimension at a time only) around the centroid. In effect, for a cluster \(C_{l}\),

$$\begin{aligned} \forall k\in \mathcal {S},\,\forall \mathbf {t}_{i}\in C_{l}, a_{i,k}=a_{i,k}^{\text {new}}\,\text {s.t.}\,\text {KL}(f_{d_{C_{l}}}, f_{d_{C_{l}}}^{\text {new}})\le \varepsilon , \end{aligned}$$
(10)
Fig. 2.
figure 2

Example of re-distributing the samples within a cluster (or adding noise to them in a controlled fashion): The distribution of the distances to the centroid is preserved and the overall cluster structure is preserved.

where, as before, \(f_{d_{C_{l}}}\) is the distribution of the distances between the samples in cluster \(C_{l}\) and its centroid, and \(f_{d_{C_{l}}}^{\text {new}}\) the same distribution after modifying the coordinate \(a_{i,k}\), and \(a_{i,k}^{\text {new}}\in \mathbb {X}^{(k)}\) is the new value for the coordinate k of sample \(\mathbf {t}_{i}\) in \(C_{l}\). This approach is illustrated on Fig. 2, where one can see that the overall effect is to “shuffle around” within the cluster, while preserving the distances between the samples in the cluster and the cluster centroid.

Note in this case that we do not try to preserve explicitly the pairwise distances between the samples within a cluster. Such distances will, at the whole cluster level, be preserved somewhat in any case, by preserving the distances to the centroid.

5 Conclusions and Future Work

In this paper, we propose an early version of a data anonymization framework, focusing on making individual re-identification difficult, while preserving clusters/group statistics and structure, over any type of data field (provided it can be abstracted as a metric space). We first develop the means of identifying outliers in terms of clustering the data, and propose ways to modify the data so as to “push back” this outlier with the rest of the crowd. We then propose several methods to “stir” the data within a cluster, effectively modifying the data values completely, but retaining the internal structure of the clusters. This to allow for further data processing at a somewhat global level, while ensuring the privacy of individuals. As can be noted, some of the data alterations proposed in this work are relatively computationally heavy, and currently require many iterations to converge to an acceptable solution (e.g. the case from Sect. 4.2 where one shuffles data around within a cluster so as to minimize the distortions on the distances distributions). Current and future work will focus on developing efficient algorithms to perform the proposed anonymization tasks, and experiment the proposed framework over large data sets composed of very different data fields.