Keywords

1 Introduction

We live in an era where the world is more connected than ever before, and everything is digitized from smartphones, smart vehicles to smart homes and smart cities, continually generating a tremendous amount of information. With this information at hand, many concerns may arise, one in particular, is the critical exposure of individuals’ privacy, putting their anonymity at risk [1, 2]. Several anonymization techniques are developed in the literature to preserve privacy. Whether they are generalization-based techniques [3,4,5] that alter the original values or bucketization-based techniques [6,7,8,9,10] that preserve privacy by splitting the dataset into sensitive and non-sensitive tables to hide the link between their values, they all assume that there is a trade-off between good privacy and utility. It is a trade-off that is highly required to keep the dataset suitable for analysis while preserving the individuals’ anonymity. However, it keeps anonymization vulnerable and unable to cope with all sort of attacks [11,12,13,14]. It is indeed difficult to provide a completely anonymous dataset without losing utility. There are many reasons for this to happen, notably, is the ability to presume knowledge of the adversary’s prior belief and her/his ability to gain insights after looking at the anonymized dataset. Besides, a dataset in which several tuples relate to the same individual may expose significant correlations between identifying and sensitive values. An adversary can use his/her knowledge of such correlations [11, 13], or use these correlations as foreground knowledge [15] to breach individuals’ privacy. To cope with this particular problem, safe grouping is proposed in [16, 17] to ensure that the individuals’ tuples are grouped in one and only one quasi-identifying group (QI-group) that is at the same time l-diverse, respects a minimum diversity for identifying attribute values, and all individuals in the same QI-group have an equal number of tuples. The (kl)-diversity [18] is another technique that uses generalization to associate k distinct individuals to l-diverse QI-groups. While these techniques are useful in dealing with the correlation problem on bulk datasets, they provide no proof of effectiveness in anonymizing data streams where data must be protected on the fly before being stored in an anonymized dataset. The anonymization technique has a partial view of the dataset, limited to the batch of tuples undergoing the anonymization.

Let us consider a car rental example scenario where each smart vehicle triggers an event between two piers in the form of a transaction to be stored in a dataset for analysis. Transactions are generated continuously as long as customers are driving their vehicles to form a data stream. In this scenario, we assume that the anonymization must be performed on the stream of tuples generated by the data source to output an anonymized dataset in the form shown in Fig. 1.

Fig. 1.
figure 1

Rental data stream anonymized

The released 2-diverse dataset is divided into two separate tables to hide the link between the identifying and sensitive values as in [6, 7, 19]. In a QI-group an identifying value cannot be associated with a sensitive value with a probability higher than 1/2. The problem arises when the identifying and sensitive values correlate across the QI-groups [16, 18, 20] (e.g., first two QI-groups in Fig. 1(b)). This leads to an implication that the values belong to the same individual.

In this paper, we extend the work in [16, 17] to address the correlation problem in the anonymization of transactional data streams where data dynamically changes and its distribution is imbalanced. We propose (kl)-clustering that continuously groups k distinct individuals into l-diverse QI-groups and ensures that these individuals remain grouped together in future releases of QI-groups. (kl)-clustering keeps track of incoming identifying values to safely release them across the QI-groups. It is a bucketization technique that prevents attribute disclosure, releasing trustful information. Our contributions in this paper include:

  • defining privacy properties that are required to bind the correlations in a data stream.

  • proposing a novel clustering approach to enforce the aforementioned privacy properties.

The remainder of this paper is organized as follows. In Sect. 2, we investigate works related to the anonymization of data streams. In Sect. 3, we define the basic concepts and definitions. We present our privacy model in Sect. 4 and describe the (kl)-clustering approaches. Section 5 evaluates the performance of our algorithm by adopting two clustering techniques to data streams.

2 Related Work

In [21], Cao et al. extend the definition of k-anonymity to apply it on data streams and propose CASTLE, a clustering-based algorithm, that publishes k-anonymized clusters in an acceptable delay. An extension of CASTLE is presented in [22] to reduce the number of tuples in the clusters and to maximize the utility of the anonymized dataset. In another work [23], FAANST is proposed to anonymize numerical data streams. FADS is an anonymization algorithm proposed in [24, 25] that has convenient time and space scale with additional constraints on the size of the clusters size and their reuse strategy. While these techniques extend privacy solutions based on k-anonmyity and l-diversity on transactional data streams, they do not take into account the correlation of the identifying and sensitive values across the QI-groups. Moreover, several studies [11, 13, 18, 20] have shown that correlations attacks can be launched not only on bucketization techniques but on generalization-based techniques as well.

A similar work to ours is defined in [26] where the authors include background knowledge in their anonymization algorithm to deal with strong adversaries. They propose a hierarchical agglomerative algorithm to prevent attribute and identity disclosure. However, the authors only address correlations known to the adversary. Here, we consider that the correlations can be mined from the dataset and used as foreground knowledge to link individuals to their sensitive values. Alternatively, in [20], the authors present a sequential bottom-up anonymization algorithm, KSAA, that uses generalization to protect against background knowledge attacks on different anonymized views of the same original dataset. KSAA clusters tuples and generates QI-groups satisfying the privacy model in the current view. It checks, in a second step, if the privacy constraint is satisfied when several views are joined together. Here, our clustering algorithm is applied on a stream of tuples on the fly where three requirements must be met including low retention of tuples, balanced memory usage and runtime. In [27], the authors propose a generalization-based microaggregation algorithm for stream k-anonymity that meets a maximum delay constraint, without preserving the order of incoming tuples in the published stream such as in [21]. Then, they improve the preservation of the original order of the tuples by using steered microaggregation while adding the timestamp as an artificial attribute. Similar to [21], we do not publish the time stamp attribute due to privacy constraints however we use it for experimental purposes.

On the other side, several notable works [29,30,31] have been done for differential privacy [32] for streaming data. In this work, we choose to work with bucketization technique that publishes trustworthy information. We particularly extend previous works [16, 17] to address correlations in the data stream in data sharing scenarios.

3 Preliminary Definitions

In this section, we present the basic concepts and definitions to be used in the remainder of this paper.

Definition 1

(Tuple - t). In a relational dataset, a tuple t is a finite ordered list of values \(\{v_{1},v_{2},...,v_{b}\}\) where, given a set of attributes \(\{A_{1},...,A_{b}\}, \forall i (1\,{\le }\,i\,{\le }\,b)\,v_{i}\,=\,t[A_{i}]\) refers to the value of attribute \(A_{i}\) in t. We categorize attributes as follows:

  • \({Identifier\,(A^{id})}\) is an attribute whose value is linked to an individual in a given dataset. For example, a social security number anonymized in a way to represent uniquely an individual but cannot explicitly identify her/him.

  • \({Sensitive\ attribute\,(A^{s})}\) reveals critical and sensitive information about a certain individual and must not be directly linked to individuals’ identifying values in data sharing, publishing or releasing scenarios.

  • \({Time\text {-}stamp\,(A^{ts}})\) indicates the arrival time of the tuple, its position in the stream. The time-stamp is considered identifying, which can be used to expose individuals’ privacy in a transactional data stream. Here, we do not publish the time-stamp, we use it instead for evaluating the utility of our anonymization technique.

Definition 2

(Data Stream - S). A data stream \(S= t_{1}, t_{2}...,\) is a continuously growing dataset composed of infinite series of tuples received at each instance. Let U be the set of individuals of a specific population, \(\forall \) u \(\in \) U we denote by \(S_u\) the set of tuples in S related to the individual u, where \(\forall \) t \(\in \) Su, \(t[A^{id}]={v_{id}}\).

Definition 3

(Cluster - C). Let \( S^\prime \subset S \) be a set of tuples in S. A cluster C over \(S^\prime \) is defined as a set of tuples \(\{t_1, ..., t_n\}\) and a centroid \(V_{id}\) consisting of a set of identifying values such that, \(\forall t \in C, t[A^{id}] \in {V}_{id}\). We use the notation \({V}_{id}(C)\) to denote the centroid \({V}_{id}\) of C.

Table 1. Notations

Definition 4

(Equivalence class/QI-group) [1]. A quasi-identifier group (QI-group) is defined as a subset \(QI_j, j=1,2,...\) of released tuples in \(S^{*} = \bigcup _{j=1}^{\infty }QI_j\) such that, for any \(j_1 \ne j_2\), \(QI_{j1} \cap QI_{j2} =\emptyset \).

We stick with the QI-group terminology for compatibility with the broader anonymization literature, which can include identifying as well as quasi-identifying attributes (Table 1).

4 Privacy Preservation

We work under the assumption that the anonymization of the data stream will continuously release l-diverse QI-groups, and these QI-groups, if joined together, will not expose unsafe correlations between identifying and sensitive values. We define two types of adversaries, passive and active.

  • Passive adversary has no prior knowledge concerning the individuals and the correlations of their identifying and sensitive values in the dataset. She/He is able, however, to extract foreground knowledge from the anonymized dataset that can be used to breach privacy. For example knowing renting patterns of individuals, which might lead to link their identifying values to their identity and track them in the anonymized dataset.

  • Active adversary is equipped with certain knowledge about the individuals and the correlations of their identifying and sensitive values in the dataset before having access to its anonymized version. She/he can use that background knowledge to provoke a privacy breach. In our renting example, knowing the true identity, in plain text, of an individual (e.g. Full Name) alongside her/his location patterns might lead to link her/his identity to her/his identifying value in the stream thus exposing him in the anonymized dataset.

4.1 Privacy Model

Given a stream S and two user-defined constants \(l\ge 2\) and \(k\ge 2\), we say that an anonymization technique safely anonymizes S if it produces a stream \(S^*\) that satisfies the following properties:

Property 1

(Safe release of QI-groups). Provides safe correlation of identifying and sensitive values across the released QI-groups such that the intersection of any QI-groups in \(S^*\) on their identifying attribute \(A^{id}\) yields either k identifying values or none. Formally,

\(\forall v_{id} \in \mathcal {D}(A^{id})\), if \(v_{id} \in \pi _{A^{id}}QI_1\,\cap ...\,\cap \pi _{A^{id}}QI_j \), then there exists a set of

identifying values \(V_{id} \subseteq \mathcal {D}(A^{id})\), such that \(V_{id} =\{v_{id}, v_{id_1},...,v_{id_{k-1}}\} \) and \(V_{id}=\pi _{A^{id}}QI_1\,\cap ...\,\cap \pi _{A^{id}}QI_j \). In other words,

$$\begin{aligned} \pi _{A^{id}}QI_1\,\cap ...\,\cap \pi _{A^{id}}QI_j = {\left\{ \begin{array}{ll} V_{id} &{} \quad \text {if }\exists v_{id}\in \pi _{A^{id}}QI_1 \\ \text { } &{}\text { }\,\cap ...\,\cap \pi _{A^{id}}QI_j \\ \emptyset &{} \quad {otherwise } \\ \end{array}\right. } \end{aligned}$$
(1)

In a less formal definition, the identifying values that are grouped together in a QI-group must always remain grouped together throughout the entire anonymized stream.

Property 2

(l-diverse QI-groups). Ensures that all the anonymized and released QI-groups are l-diverse. Formally,

\(\forall v_{id} \in \mathcal {D}(A^{id}), \forall QI \in S^*, Pr(v_{id}, v_s |QI) \le 1/l\).

Property 3

(Safe correlation of identifying values). Prohibits linking correlated identifying values in the same QI-group to their corresponding sensitive values, which result in an inherent violation of l-diversity [16,17,18]. Formally,

\(\forall v_{id_1},v_{id_2}, f(v_{id_1}, QI_j) = f(v_{id_2}, QI_j) \) where \(f(v_{id_i}, QI_j)\) is a function that returns the number of occurrences of \(v_{id_i}\) in \(QI_j\).

Property 3 hides frequent correlations of identifying values in the same QI-groups. It handles cases arising when an adversary may be able to link an individual to his/her sensitive value or to narrow the possibilities for other individuals.

4.2 (kl)-Clustering for Privacy Preservation

To preserve our privacy properties, we propose a (kl)-clustering technique that groups tuples into clusters of disjoint centroids and releases, from these clusters, l-diverse QI-groups containing k distinct identifying values. In brief, our clustering technique works as follows:

  • It creates centroids containing k distinct identifying values: \(\forall QI_i,QI_j\) two QI-groups released from C, \(\pi _{A^{id}}QI_i = \pi _{A^{id}}QI_j = V_{id}(C)\) where \( |V_{id}(C)| = k\).

  • It ensures that an identifying value exists in one and only centroid: \(\forall C_1, C_2\) \(V_{id}(C_1) \cap V_{id}(C_2)=\emptyset \).

  • It releases a QI-group from a cluster C such that: \(\forall QI\), a QI-group created from a subset of tuples in the cluster C, and \(\forall t \in QI\), \(t[A^{id}] \in V_{id}(C)\).

(kl)-clustering is a bucketization technique that releases l-diverse QI-groups created from a subset of clusters having disjoint centroids. It ensures safe correlation of identifying and sensitive values across the QI-groups, i.e., once k identifying values are grouped in a QI-group, they will remain grouped together in future releases of QI-groups throughout the anonymized stream. We assume that the clustering can be done in two ways, unsupervised and supervised as defined below.

  • Unsupervised (k, l)-clustering has no prior knowledge about the distribution of identifying values in the original dataset. The clustering is done on first-come, first-serve basis inspired by “bottom-up” agglomerative clustering algorithms [26]. Unsupervised (kl)-clustering creates cluster centroids and groups tuples accordingly, in reference to their identifying values and privacy constants k and l.

  • Supervised (k, l)-clustering has a partial or full view over the distribution of identifying values in the original dataset, thus and unlike the unsupervised clustering, clusters are created based on a predefined set of centroids \(\mathcal {V}=\{V^1_{id}, ..., V^m_{id}\}\) that are fed to the clustering technique prior the anonymization. Hence, the identifying and sensitive values that are highly correlated are grouped together in the same cluster to reduce the chances of having these values anonymized/suppressed to meet the privacy properties.

As shown in Fig. 2(c), ‘Allen_U1’ and ‘Cathy_U3’ are grouped together in 3 QI-groups because they occur the most in the incoming stream. However in Fig. 2(b), ‘Allen_U1’ is grouped alongside ‘Betty_U2’ and ‘Cathy_U3’ alongside ‘David_U4’ due to the order of their tuples in the data stream.

Fig. 2.
figure 2

Applying unsupervised and supervised (k, l)-clustering on a data stream with k, l = (2, 2)

Lemma 1

Given a transactional stream S, safe clustering ensures the safe release of QI-groups in the published version \(S^{*}\).

Proof

Since (kl)-clustering is applied, \(\forall QI_i,QI_j\) two QI-groups released from C, \(\pi _{A^{id}}QI_i = \pi _{A^{id}}QI_j = V_{id}(C)\) where \( |V_{id}(C)| = k\). Alternatively, since (kl)-clustering ensures that an identifying value exists in one and only centroid, \(\forall C_1,C_2\), two distinct clusters over \(S^{*}, V_{id}(C_1) \cap V_{id}(C_2) =\emptyset \) can be written as \(\pi _{A^{id}}QI_1 \cap \pi _{A^{id}}QI_2= \emptyset \) where, \(QI_1,QI_2\) are two QI-groups released respectively from \(C_1\) and \(C_2\). Hence, the intersection of any QI-groups in S* on the identifying values yields either k identifying values or none.

4.3 (kl)-Clustering Algorithm

In this section, we present our (k, l)-clustering algorithm applied on a transactional data stream. The main idea behind it is to process incoming tuples on the fly while guarantying safe release of l-diverse QI-groups. It requires two privacy constants k and l, the stream S, and a set of centroids \(\mathcal {V}\). (k, l)-clustering outputs an anonymized data stream. The algorithm is composed of two main steps; safe clustering and tuple assignment.

4.4 Safe Clustering

The function assigns tuples to their corresponding clusters based on their identifying values.

$$ t_p \text {is assigned to} {\left\{ \begin{array}{ll} C_e &{} \quad \text {if }\exists V_{id}(C_e)\subset \mathcal {V} \text { where } \\ \text { } &{}\text { } t_{p}[A_{id}] \in V_{id}(C_e) \\ C_q \text { where } |V_{id}(C_q)| < k &{} \quad \text {otherwise} \end{array}\right. } $$
figure a

4.5 Tuple Assignment

It assigns a tuple \(t_{p}\) to the selected cluster \(C_{sel}\) as follow: In a given cluster, all tuples are distributed over multiple sub-groups. sub-groups must contain at least k distinct identifying values before verifying their l-diversity.

After processing the entire stream, the algorithm will publish all sub-groups which are not l-diverse nor reached size k (i.e., stored in the temp structure), by suppressing the identifying values. This guarantees the privacy constraints but impacts the utility of the dataset.

5 Experiments

In this section, we evaluate the efficiency of our unsupervised and supervised (k, l)-clustering techniques by conducting a set of experiments detailed hereinafter. The algorithm is implemented in JAVA and tested on a PC with 2.20 GHz Intel Core i7 CPU, 8.0 GB RAM.

figure b

To simulate a data stream scenario, we used a rental transaction datasetFootnote 1 composed of 109763 tuples where each tuple is associated with a timestamp used only for evaluation purposes. We assume that at each time instant exactly one tuple arrives. As a result, timestamps range from 1 to |S|. The dataset contains 2374 distinct identifying values.

We designed two sets of experiments to examine the effectiveness of our approach in terms of utility:

  • Evaluating the percentage of suppressed identifying values.

  • Evaluating the delay-retention of tuples in the queue before being released in QI-groups.

5.1 Percentage of Suppressed Identifying Values

As previously stated, after processing the stream over a specified interval of time, our algorithm suppresses the identifying values in the QI-groups that are not l-diverse nor of size k.

Using the unsupervised (kl)-clustering, we vary the value of k from 3 to 8, and examine the percentage of suppressed values. The parameter l is set to 3. For high values of k, the percentage of suppressed values increases. It reaches almost 60% for k = 8 as shown in Fig. 3. Here, we cluster identifying values based on their order of arrival. Each k individuals clustered together might not have the same distribution over the stream. Therefore, when k increases, it becomes more difficult to form QI-groups leading to an increase in the amount of suppressed values. Hence, we did not evaluate the unsupervised approach for k values higher than 8.

Using the supervised (kl)-clustering, we ensure that the most frequent identifying values are clustered then grouped together in the QI-groups. Consequently, we suppress fewer identifying values and thus, obtain better utility, as shown in Fig. 3, where the percentage of suppressed values reaches 1% for k = 20.

Fig. 3.
figure 3

Percentage of suppressed values for \(l=3\) while varying k for both Unsupervised and Supervised (kl)-clustering approaches

5.2 Retention of Tuples

A tuple is retained in the queue if it remains (a) in a sub-group that did not reach size k or (b) in the temporary sub-group of the corresponding cluster.

For each set of {kl} values, we measure the retention delay of each tuple in memory. Then we compute the average delay time of all the tuples. This value is chosen as the delay constraint \(\delta \) defined in [28].

We consider a tuple that remains more than the specified delay \(\delta \) in memory a “delayed or outdated tuple”. \(\delta \) slightly varies with k. We applied our algorithm to the same rental dataset we used before, while adopting both approaches, as shown in Fig. 4. The delay constraint can be chosen depending on the data stream application requirement regarding availability of the anonymized tuples as stated in [28].

Fig. 4.
figure 4

Percentage of published tuples for both approaches before \(\delta \)

6 Conclusion

In this paper, we have defined new privacy properties to address the correlation problem in the anonymization of transactional data streams. A bucketization based technique, entitled (kl)-clustering, is proposed to enforce these privacy properties. (kl)-clustering processes incoming tuples on the fly. It continuously groups k distinct individuals into l-diverse QI-groups and ensures that these individuals remain grouped together in future releases of QI-groups. We evaluated our algorithm in terms of utility by considering two approaches: supervised and unsupervised. We showed, by conducting a set of experiments, that both approaches cope well with the streaming nature of the data while respecting the privacy constraints. The supervised approach yielded better results because it has a partial or full view over the distribution of identifying values in the dataset.