Two privacy-preserving approaches for data publishing with identity reservation

Many approaches have been proposed for publishing useful information while preserving data privacy. Among them, the privacy models of identity-reserved (k, l)-anonymity and identity-reserved (α,β)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha , \beta )$$\end{document}-anonymity have been proposed to handle the situation where an individual could have multiple records. However, the two models fail to prevent attribute disclosure. To this end, we propose two new privacy models: enhanced identity-reserved l-diversity and enhanced identity-reserved (α,β)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha , \beta )$$\end{document}-anonymity. Moreover, to implement the two privacy models we design a general anonymization algorithm, called DAnonyIR, with clustering technique by calling different decision functions, which can decrease the information loss caused by generalization. Further, we compare DAnonyIR concerning our two privacy models with existing generalization method GeneIR concerning identity-reserved (k, l)-anonymity and identity-reserved (α,β)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha , \beta )$$\end{document}-anonymity, respectively. The experimental results show that our two approaches provide stronger privacy preservation, and their information loss and relative error ratio of query answering are less than those of GeneIR.


Introduction
Hospitals and other organizations often need to publish microdata (e.g.medical data or census data) for the purposes of scientific research and knowledge-based decision-making [1], for example, disease analysis and prediction.These data are often stored in a table in the form of D(explicit identifier, quasi-identifier, sensitive attributes, other attributes) [2], where explicit identifier (ID) can clearly identify individuals (e.g.name and social security number); quasiidentifier (QI) is a set of attributes that can potentially identify an individual, such as zip it satisfies the EIR l-diversity and EIR (α, β)-anonymity are changed to the problems of minimum hitting set and the highest frequency of sensitive values, respectively.(2) We define the distances between two individuals, between individual and equivalence class, and between two equivalence classes.(3) We design a general anonymization algorithm DAnonyIR with clustering techniques to make a dataset satisfy different identity-reserved privacy models by calling different decision functions.(4) We do lots of experiments to show the vulnerability of IR (k, l)-anonymity and IR (α, β)-anonymity, and our algorithm of DAnonyIR concerning EIR l-diversity and EIR (α, β)-anonymity outperforms the existing one of GeneIR [13] concerning IR (k, l)-anonymity and IR (α, β)-anonymity with respect to information loss and relative error ratio of query answering, respectively.At present, GeneIR is the only an algorithm for achieving IR (k, l)-anonymity and IR (α, β)-anonymity.Also, we show that DAnonyIR can achieve the two privacy models, and compare them with our enhanced approaches.
The rest of this paper is organized as follows.Section 2 recaps basic concepts and notations we will use in this paper.Section 3 introduces the concepts of reasoning set of an equivalence class and its reasoning space and proposes two privacy models for identity reservation.Section 4 defines some distance concepts about individuals and equivalence classes and designs a general anonymization algorithm based on clustering techniques for different privacy models with identity reservation.Section 5 analyses our methods experimentally.Section 6 discusses the related work.Finally, Sect.7 concludes this paper and points out directions for further study.

Preliminaries
This section recaps some fundamental privacy models, generalization operations, and information metrics, which are necessary for developing our work.

Privacy models with identity reservation
Consider an original data table D = {I D, A 1 , . . ., A d , A s } in which there are no duplicate records, where A 1 , . . ., A d are Q I attributes. 1 For the convenience of reference, Table 1 summarizes the meanings of symbols used in the paper.
In privacy-preserving data publishing, for every privacy model π, there is a corresponding anonymization approach to transforming the original data table to an anonymous table which satisfies π.And privacy models k-anonymity, distinct l-diversity, and (α, k)-anonymity all assume that an individual has only one record.For these privacy models, their anonymous tables are in the form of D * (Q I , A s ), where Q I is an anonymous version of the original Q I obtained by applying anonymization approach to Q I in original table D. For the problem of identity-reserved anonymity, we need to keep the information in which multiple records belong to the same individual, so the explicit identifier should not be directly deleted and we use numbers to identify different individuals.The published anonymous table of D is in the form of D * (I d_num, Q I , A s ), where the values of I d_num are numbers, which denote The first one is equivalence class [3], which is the elementary unit of anonymous table.Formally, we have: The second is the definitions of k-anonymity [3], distinct l-diversity [10], and (α, k)anonymity [11], which are suitable for the situation where an individual has a record only.Distinct l-diversity and (α, k)-anonymity can prevent identity disclosure and attribute disclosure, and k-anonymity can only prevent identity disclosure [2].

Definition 2.2 (k-anonymity, distinct l-diversity, (α, k)-anonymity) Given an original data table D, the published anonymous table D
The following is the definitions of IR k-anonymity, IR (k, l)-anonymity, and IR (α, β)anonymity [13], which are obtained by extending k-anonymity, distinct l-diversity, and (α, k)anonymity to the scenario where an individual could have multiple records, respectively.Obviously, |R(s j )|/|Q| ≤ β, the condition for IR (α, β)-anonymity, is equal to |P(s j )|/|Q| ≤ β because s j appears in any individual at most once.

Data generalization
In order to satisfy the requirement of given privacy model π, the original table usually needs to be generalized in the values of quasi-identifier.The idea is to replace a specific value by a general value.Although the generalization operation reduces the data quality of an original table, it still can retain its semantic information to some extent.Therefore, the method attracts increasing attention in the field of privacy-preserving data publishing.For an anonymization algorithm, it first needs to satisfy the given privacy model and then considers to keep data quality (when an equivalence class satisfies the given privacy model, it is unnecessary to generalize its attributes' values to higher levels) and spend time as little as possible.In general, the stronger the privacy preservation of a privacy model is, the worse its data quality is and the more runtime it needs.Many existing approaches generalize the values of attributes in quasi-identifier according to predefined taxonomy trees [10][11][12][13][14][15][16][17][18][19][20][21][22][23], which are given by the domain experts.For example, Fig. 1 shows the taxonomy trees for categorical attribute Postcode and numeric attribute age. 2ang et al. [24] pointed out that the taxonomy tree restricts the choice of data generalization and causes some unnecessary information loss.If the values of Postcode in two records are 10076 and 10085, respectively, we can generalize them to {10076, 10085} in order to make the two records into an equivalence class.However, they are generalized to 100 * according to the taxonomy tree.If the values of Age in two records are 34 and 35, respectively, we can generalize them to [34,35] for forming an equivalence class.However, they are generalized to [30,39] according to the taxonomy tree.Thus, Wang et al. [24] supplied a different scheme of data generalization, focusing on both numeric and categorical attributes.

Information metrics
There exists information loss due to data generalization.So various metrics have been proposed for calculating how much information is lost.For the generalization with taxonomy tree, normalized certainty penalty (NCP) [21,25] and generalized loss metric (GLM) [26,27] are the two main metrics proposed.They are the same for numerical attributes but different for categorical ones.That is, for a numerical attribute A, and an interval I = [l, u] from the domain [L, U ] of A, used to generalize A's value, the information loss associated with I is defined as follows: For a categorical attribute A , where T is its taxonomy tree and a node p in T is used to generalize A 's value, the information loss associated with p is defined as follows (for NCP and GLM, respectively): where u p is the set of leaf nodes of the subtree rooted at p in T and u is the set of all the leaf nodes in T .For example, a record has values 32 and 10073 for Age and Postcode attributes, respectively, whose taxonomy trees are shown in Fig. 1.Assume that 32 and 10073 are generalized to [30,34] and 1007*.The information loss for Age by using NCP and GLM is the same: 34−30 39−30 = 4 9 .We can see that [30,34] actually contains 30, 31, 32, 33, and 34, and there are five different numbers, while 34 − 30 = 4. Similarly, [30,39] actually contains ten different numbers, while 39 − 30 = 9.The information loss for categorical attribute Postcode by using NCP is 4  7 , while it is 4−1 7−1 = 1 2 by using GLM.1007 * is equal to {10070, 10073, 10076, 10077} and there are four different numbers.The set of all the leaf nodes in the taxonomy tree of Postcode is {10070, 10073, 10076, 10077, 10085, 10086, 10087} and there are seven different numbers.We can see that the definitions of GLM are consistent with respect to numeric and categorical attributes, while NCP is not.So GLM is more suitable.
For NCP and GLM, they consider only the information loss between an original value and a generalized value of an attribute.When we create an equivalence class by continually adding the records of individuals until it satisfies given privacy model, it is necessary to compute the information loss from current equivalence class to the later equivalence class obtained by adding the records of an individual to current equivalence class.As a result, we need to consider the information loss caused by one generalization value to another.The definition of information loss between a generalized value (or original value) and another needs to be given.
For the generalization method without predefined taxonomy trees, Wang et al. [24] presented the information metrics for categorical and numerical attributes, as given by Eqs. ( 6) and (7), respectively.However, the results of Eqs. ( 6) and (7) are not normalized, i.e. the results (except 0) all are greater than 1.Also, it is unreasonable to denote the information loss by using the times (the number related to the later generalized value is divided by the number related to the original value or before generalized value), because the denominator is varied and so the standard is not uniform.In this paper, based on GLM we will improve the information metrics for numeric and categorical attributes, which are described in Sect.4.1.
For a record r , the value is r [A] = [a, b] on a numeric attribute A, which is the original value or a generalized value, and the (later Then the information loss of r on attribute A is given by For a record r , the value is the set r [A ] on a categorical attribute A , which is the original value or a generalized value, and the (later) generalized value is the set r * [A ].Then the information loss of r on attribute A is

Enhanced privacy models with identity reservation
In this section, we first give an example to show that although IR (k, l)-anonymity and IR (α, β)-anonymity can prevent identity disclosure, they fail to prevent attribute disclosure.Thus, in this section we will propose two privacy models, the EIR l-diversity and EIR (α, β)anonymity, to prevent not only identity disclosure but also attribute disclosure.(3,3)-anonymity or IR (0.4, 0.6)-anonymity, obtained by the anonymization approach GeneIR [13] according to predefined taxonomy trees.If an attacker knows the Mike's QI information (i.e.{M, 36, 10085}), obtained by some public information (i.e.voter list), and knows that Mike is in the published table, then the attacker can infer that Mike is in equivalence class Q 2 (see Table 3).Because Q 2 contains three different individuals, the attacker cannot know which one is corresponding to Mike, and thus, the identity disclosure is prevented.However, the attacker knows that Mike has hypertension disease because any individual in Q 2 has hypertension disease.Obviously, the privacy leakage is 100% in this case, so attribute disclosure happens.Why does this attribute disclosure happen?This is because IR (k, l)-anonymity and IR (α, β)-anonymity do not consider that some records in an equivalence class belong to the same individual regarding the restrictions with respect to l and β, so they cannot prevent attribute disclosure.To address this issue, in the following we will give two enhanced privacy models, called EIR l-diversity and EIR (α, β)-anonymity, which consider that for each equivalence class, any set of records from different individuals satisfies l-diversity, and the percentage of any sensitive value in any set of records from different individuals is not more than β, respectively.Definition 3.1 (Single-record/multi-record individual) Given an original data table D, for ∀ p i ∈ P, if p i has one record, then p i is called an single-record individual; if p i has several records, then p i is called an multi-record individual and these records are called related records.
For example in Table 2, r 1 and r 2 belong to the same individual, so they are related.From the example discussed in the beginning of this section, we know that the attacker cannot differentiate which individual in Q 2 is corresponding to Mike.However, the attacker can infer that Mike has a disease with certain probability.If the probability is very big, it means that Mike's privacy is leaked.The probability of a disease is the maximum percentage of the disease in any set of records from different individuals.The attacker reasons in the four sets consisting of these records from different individuals in Q 2 , i.e. {r 1 , r 4 , r 5 }, {r 1 , r 4 , r 6 }, {r 2 , r 4 , r 5 }, and {r 2 , r 4 , r 6 }.The related records are not in the same set, and we need to prevent privacy leakage in any set.Definition 3.2 (Reasoning set) For an equivalence class Q, let P Q = {p 1 , p 2 , . .., p n Q } be the set of individuals.A reasoning set of Q is a set of records which maps to different individuals from p 1 to p n Q .
All reasoning sets of Q constitute the reasoning space.The size of the reasoning space is determined by multi-record individuals, i.e.

Anonymization
In this section, first we will discuss how to calculate the information loss for numeric and categorical attributes, a record, and a data table, then define various concepts of distances, and finally propose an anonymization algorithm and analyse its correctness and time and space complexity.

Information metric used in our approach
We extend GLM to the definition of information metrics between a generalized value (or original value) and another generalized value for numeric and categorical attributes.
When r [A] is the original value, we have b − a = 0, which is consistent with GLM/NCP as a result for numeric attribute.When [a, b] is a generalized value, Loss(r Definition 4.2 (Information metric for a categorical attribute) Let the value domain of a categorical attribute A be the set X .And let the value of a record r be r [A ] on the attribute A , which is the original value or a generalized value, and its (later) generalized value be r * [A ] on the attribute A .Then the information loss of r on categorical attribute For our anonymization approach, if there exists an individual p i , which is added to any an equivalence class and the equivalence class does not satisfy given privacy requirement, or is added to the equivalence class, whose distance to p i is minimum, and data quality of the equivalence class is decreased seriously, we need to suppress the records of the individual.The suppression of a value of an attribute is to replace it by a special mark (i.e.* , a special generalization), indicating that the replaced values are not disclosed, and its information loss is 1.

Definition 4.3 (Information loss for a record)
The information loss of a record r generalized to r * is where Loss(r ) is calculated by Eq. ( 11), if A i is numeric attribute; otherwise, it is calculated by Eq. ( 12).

Definition 4.4 (Information loss for a data table)
Given an original data table D, the information loss of D generalized to its published anonymous table The normalized information loss of D generalized to D * is In worst case, we suppress all records.

Definition of distances
Some approaches consider the problem of data anonymization as a clustering problem satisfying k-anonymity or l-diversity [18,19,24].They gave some definitions about distance between two records, distance between a record and an equivalence class, and distance between two equivalence classes.In this subsection, we will also give some definitions about distances from information loss perspective, including distance between two individuals, distance between an individual and an equivalence class, and distance between two equivalence classes.Different from the concepts of distances in [18,19,24], we consider that an individual could have multiple records, and use Eqs.( 11) and ( 12) to calculate the information loss for numeric attributes and categorical attributes, respectively.

Definition 4.5 (Distance between two individuals)
Given the individuals p 1 and p 2 , r p 1 and r p 2 are the original identity elements of p 1 and p 2 , respectively.The information loss caused by generalizing the records in R( p 1 ) and R( p 2 ) to r p 1 p 2 is called the distance between p 1 and p 2 , defined as follows: where r p 1 p 2 = δ(r p 1 , r p 2 ) is the identity element of equivalence class Q p 1 p 2 , which is formed by generalizing the records in R( p 1 ) and R( p 2 ).
To make R( p 1 ) ∪ R( p 2 ) become an equivalence class, we need to generalize these records to r p 1 p 2 .In Eq. ( 16), the first product term denotes the information loss caused by generalizing the records in R( p 1 ) to r p 1 p 2 and the second is the information loss caused by generalizing the records in R( p 2 ) to r p 1 p 2 .
For example, in Table 2, r 5 and r 6 both belong to individual 4, whose original identity element is {F, 33, 10087, null}, and r 8 and r 9 both belong to individual 6, whose original identity element is {F, 34, 10070, null}.If we put the records of individuals 4 and 6 into an equivalence class, the identity element of the equivalence class is {F, [33,34], {10070, 10087}, null}, i.e. the records of individuals 4 and 6 are all generalized to {F, [33,34], {10070, 10087}, null}.Then the information loss caused by the generalization is the distance between individuals 4 and 6.

Definition 4.6 (Distance between individual and equivalence class)
Let r q be the identity element of equivalence class Q.If individual p / ∈ Q.I d_num, then the information loss caused by generalizing the records in R( p) and Q to r pq is the distance between p and Q, given by: where r p is the identity element of p and r pq = δ(r p , r q ) is the identity element of the equivalence class Q pq , which is formed by generalizing the records in R( p) and Q.
To make R( p) ∪ Q become an equivalence class, we need to generalize these records to r pq .In Eq. ( 17), the first product term denotes the information loss caused by generalizing the records in R( p 1 ) to r pq and the second is the information loss caused by generalizing the records in Q to r pq .Definition 4.7 (Distance between two equivalence classes) Let r q 1 and r q 2 be the identity elements of equivalence classes Q 1 and Q 2 , respectively.The information loss caused by generalizing the records in Q 1 and Q 2 to r q 12 is the distance between Q 1 and Q 2 , given by: where r q 12 is the identity element of the equivalence class Q 1,2 , which is formed by generalizing the records in Q 1 and Q 2 .
To make Q 1 ∪ Q 2 become an equivalence class, we need to generalize these records to r q 12 .In Eq. ( 18), the first product term denotes the information loss caused by generalizing the records in Q 1 to r q 12 and the second is the information loss caused by generalizing the records in Q 2 to r q 12 .

Algorithm
Wang et al. [24] presented a clustering algorithm for data anonymization achieving l-diversity.In this subsection, by improving their method, we will propose the heuristic greedy clustering algorithm DAnonyIR, as shown in Algorithm 1.It generalizes the original data table to an anonymous table which satisfies given privacy requirement for identity reservation.To explain our algorithm, first we need the following concept: DAnonyIR The whole clustering algorithm DAnonyIR is shown in Algorithm 1.Its input is the original data table, QI attributes, and some parameters about privacy requirement π.The output is an anonymous table.The basic idea of the algorithm is as follows: when D = ∅, we try to create an equivalence class Q from D; if Q satisfies π, we add it to D * ; if D = ∅ and Q still does not satisfy π, the individuals in Q are residual and we use Handle function to deal with them.Firstly, on line 1, we preprocess the original data.That is, recode the explicit identifier of D with numbers.As the information where several records belong to the same individual needs to be kept, the explicit identifier is replaced with a different number to denote a different individual.On lines 2-19, we try to create continually equivalence classes until D = ∅.The process of creating an equivalence class is shown on lines 3-15, First, select randomly individual p from D and initialize equivalence class Q with these records of p. r q is the identity element of Q.Now Q only contains an individual and it does not satisfy π, so we set Sat Flag = False, where variable Sat Flag denotes whether Q satisfies π.When Q does not satisfy π and D = ∅, we perform repeatedly lines 7-14.On lines 7 and 8, we get the individual p and the equivalence class Q from D and D * , whose distances to Q are minimum, respectively.Because the current Q does not satisfy π, we need to add more individuals.We can add p to Q, or combine Q with Q .In order to reduce the information loss, we select the way in which less information loss is caused.form the equivalence class Q = R( p), where r q is the identity element of Q, r q [Q I ] are the values of p on Q I attributes, and r q [A s ] = null; 5: otherwise, we merge Q with Q and update the identity element of Q.On line 14, we call function SaPriIR to judge whether Q satisfies π.On lines 16 and 18, if Sat Flag = T rue, i.e.Q satisfies π, we add Q to D * .On lines 20-26, when the lines 2-19 are executed and D = ∅, if the last equivalence class Q does not satisfy π, for every individual in Q, we call function Handle to decide to add its records to an equivalence class or suppress it.Because these individuals are directly put in D * , it will lead to privacy leakage.After that, we obtain the set D * of equivalence classes which are not generalized.Then for Q i ∈ D * and every record in Q i , we substitute its values on Q I attributes with its identity element, so the anonymous table satisfying π is obtained.
SatPriIR For function SatPriIR, the procedure is different for a different privacy model with identity reservation.For IR (k, l)-anonymity, IR (α, β)-anonymity, EIR l-diversity, and EIR (α, β)-anonymity, it is substituted with SatPriIR_kl, SatPriIR_αβ, SatPriIR_El, and SatPriIR_Eαβ, respectively.The privacy models EIR l-diversity and EIR (α, β)-anonymity are proposed in the paper, so we describe the functions SatPriIR_El and SatPriIR_Eαβ in detail, as shown in Algorithms 2 and 4, respectively.For SatPriIR_kl, we scan Q once to obtain the number n Q of individuals and the number m Q of sensitive values appearing in Q.

SatPriIR_El
In Algorithm 2, on line 1, we get the collection of subsets Ψ , in which S( p i ) is the set of sensitive values of the individual p i .On line 2, we set a variable ξ to store single element sets in Ψ .From lines 3 to 8, we find all single element sets, delete them from Ψ and add them to set ξ .On lines 9-16, if |ξ | = ∅, we directly call function BHS [28] to get all minimal hitting sets of Ψ , and find a minimum hitting set h. Function BHS is described in Algorithm 3.According to Theorem 3.1, if |h| ≥ l, then Q satisfies EIR l-diversity, and return T rue; otherwise, return False.On lines 17 and 18, if |ξ | = ∅ and |ξ | ≥ l, then the cardinality of any minimum hitting set is not less than l, because ξ is contained in any minimal hitting set, also any minimum hitting set.So return T rue.From lines 19 to 31, we consider another case, 0 < |ξ | < l.On lines 20 to 24, we use ξ to further simplify Ψ according to the definition of a hitting set. to two parts: One contains the sets whose intersection with ξ is not ∅ and the other contains the sets whose intersection with ξ is ∅.ξ is the only minimum hitting set of the first part.We need to find a minimum hitting set h of the second part.h ∪ ξ is a minimum hitting set of the whole Ψ .BHS We combine an example to explain Algorithm 3. Let On line 1, we transform Ψ to where x y (or x • y) and x + y denote the AN D and O R results of x and y, respectively.For any hitting set of Ψ , e.g.{x 1 , x 4 , x 5 }, we have Π • x 1 x 4 x 5 = 0. On lines 2 and 3, when Π only contains a conjunctive item, and assume that Π = x 1 x 3 , return x 1 + x 3 , i.e. {x 1 } and {x 3 } are minimal hitting sets, because Π • x 1 = 0 and Π • x 3 = 0.In this example, Π has 8 conjunctive items, the algorithm executes the fifth line.We simplify Π with the absorption law A + AB = A and obtain On lines 6 and 7, if every conjunctive item in Π contains literal s, then s is the only a minimal hitting sets, because Π •s = 0.In this example, x 4 is a single literal item, and line 9 is executed.We have sig = x 4 , which is contained in all hitting sets.We delete x 4 from Π, and Select literal x 1 whose frequency appearing in Π is highest for accelerating convergence.We have The first part is these minimal hitting sets containing x 1 and the second part is the ones which do not contain x 1 .For B H S(x 3 x 5 + x 5 x 7 ), because the two conjunctive items both contain x 5 , so return x 5 .

B H S(x
So the final return result is In fact, when some records of an individual are added to Q, in order to check whether the current Q satisfies privacy requirement π, we do not call SatPriIR_kl, SatPriIR_El, SatPriIR_αβ, or SatPriIR_Eαβ.Because these individuals are added to Q one by one (when we combine an equivalence class Q to Q, we may consider as the individuals of Q are added to Q one by one), we can use incremental methods to check whether Q satisfies π, denoted by SatPriIRInc_kl, SatPriIRInc_El, SatPriIRInc_αβ, and SatPriIRInc_Eαβ, respectively.SatPriIRInc_El and SatPriIRInc_Eαβ are shown in Algorithms 5 and 6, respectively.For SatPriIRInc_kl, when an individual p is added to Q, we only need to update the number n Q of individuals and the number m Q of sensitive values appearing in Q according to p.For SatPriIRInc_αβ, the updated process to Max RecN um and Max RecN um is the same as SatPriIRInc_Eαβ.

Algorithm 5 SatPriIRInc_El(Q, H Q , p, l)
Input: the set of records Q; all minimal hitting sets

SatPriIRInc_El
The idea of Sat Pri I R I nc_El is introduced by [28].For the set of records Q, SatPriIRInc_Eαβ When an individual p is added to Q, Algorithm 6 is used to check whether at the moment Q satisfies EIR (α, β)-anonymity.We only need to check whether |R( p i )| is greater than Max RecN um and the numbers of occurrences of s 1 , s 2 , .., s r in Q at the moment are greater than Max Sen N um to decide whether to update Max RecN um and Max Sen N um.
Handle Function Handle is shown in Algorithm 7.For every residual individual p in D, we need to decide to add its records to an equivalence class or suppress it.On lines 1 and 2, we set two variables Min Dis and Min.The initial value of Min Dis is a greater value.From lines 3 to 10, we find the equivalence class Q min which still satisfies privacy requirement π after merging p (this step is ignored for IR (k, l)-anonymity, and EIR l-diversity, as the equivalence class after adding the records of an individual satisfies still them, if an equivalence class satisfies the two privacy models) and the distance to p is minimum.If distance Min Dis is less than the information loss caused by suppressing R( p ), we merge p to Q min ; otherwise, we suppress R( p ).
When D = ∅ and the equivalence class Q still does not satisfy privacy requirement π, we call function Handle to deal with these individuals in Q.In Algorithm 7, for every residual individual p we find Q min .If p is added to Q min , we need to generalize the records of p and Q min to r p q min , which is the identity element of the equivalence class r q min = δ(r q min , r p ); 14: else 15: suppress R( p ); 16: end if Q p q min , formed by generalizing the records in p and Q min , and Min Dis is the information loss caused by generalization.Should we generalize or suppress p ?We select the way that causes less information loss.That is, if suppression is selected, the information loss caused by suppression is less than that caused by generalization.
We try to create an equivalence class Q from D. Select randomly an individual p from D, and assume that individual 6 is selected.D = D − R( 6) and Q = R (6).The identity element r q of Q is {F, 34, 10070, null}.Because Q only contains individual 6 and it does not satisfy EIR 3-diversity, we set Sat Flag = False.!Sat Flag = T rue and D = ∅, so we add continually an individual to Q.The individual whose distance to Q is minimum is the one with Id_num=7 for our example.The computation of distance between individual 7 and Q is as follows: r 7q is the identity element of Q ∪ R (7), which is {F, [33,34], {10070, 10073}, null}.Thus where r 7 is the identity element of individual 7, which is {F, 33, 10073, null}.The distances between individuals 1, 2, 3, 4, 5 and Q are 5.556, 1.500, 4.167, 1.111, and 1.833, respectively.

By using SatPriIRInc_El,
The collection of all minimal hitting sets of previous Ψ is {{Leukaemia}, {Heart}}, and the collection of all minimal hitting sets of current Ψ is {{Leukaemia, Syphilis}, {Heart, Syphilis}}, whose cardinalities both are less than 3. Q does not satisfy EIR 3-diversity and D = ∅, so we add continually an individual to Q.The distance of individual 4 to Q are is minimum, which can be calculated as follows: r 4q is the identity element of Q ∪ R(4), which is {F, [33,34], {10070, 10073, 10087}, null}, and r 4 is the identity element of individual 4, which is {F, 33, 10087, null}, and thus The distances between individuals 1, 2, 3, 5 and Q are 7.501, 2.278, 5.834, and 2.722, respectively.
By using Sat Pri I R I nc_El, the collection of all minimal hitting sets of Ψ is {{Leukaemia, Syphilis, H ypertension}, {Leukaemia, Syphilis, Diabetes}, {Leukaemia, Syphilis, Diabetes}, {Heart, Syphilis, Diabetes}} by Published anonymous table D * , as shown in Table 4, satisfies EIR 3-diversity, and it can prevent identity disclosure and attribute disclosure.We also take Mike as an example.Assume that an attacker knows Mike's QI information (i.e.{M, 36, 10085}) and knows that Mike is in published table D * , then the attacker can infer that Mike is in equivalence class Q 2 .There are three different individuals in Q 2 , and the attacker cannot know which one is corresponding to Mike.Thus, the identity disclosure is prevented.In Q 2 , there are two reasoning sets, i.e. {r 1 , r 3 , r 4 , r 7 } and {r 2 , r 3 , r 4 , r 7 }.Every reasoning set contains at least three different sensitive values, so the attacker cannot know which sensitive disease is corresponding to Mike, and so attribute disclosure is prevented.
Similarly, we use DAnonyIR_Eαβ with α = 0.4 and β = 0.6 to anonymize data D in Table 2. Assume the first selected individuals are 6 and 3 in creating equivalence classes Q 1 and Q 2 , respectively.The anonymous table is also shown in Table 4, which satisfies EIR (0.4, 0.6)-diversity.We also take Mike as an example.According to the background

Analysis of algorithm
In this section, we first analyse the correctness of algorithm DAnonyIR and then give its time and space complexity analysis.

Analysis of time complexity
For our algorithm DAnonyIR, given an original data table D, let |D.I d_num| = n (the I D attribute of D has been substituted with I d_num), |Q I | = d, the numbers of equivalence classes |D * | = e, |Q.I d_num| = q, m be the size of domain of sensitive attribute, r be the average number of sensitive values or the average records of an individual, and h be the number of all minimal hitting sets of an equivalence class.
We check whether Q satisfies privacy requirement π, and the time is O(nr) for Sat-PriIR_kl, SatPriIR_αβ, and SatPriIR_Eαβ, which scan once the data with some simple comparison.For SatPriIR_El, we first scan the data to obtain Ψ and obtain the number of single-record individuals shown on lines 1-8 of Algorithm 2, and the time is O(nr + n).In worst case, we need to call B H S to compute all minimal hitting sets and then find a minimum hitting set.The time of B H S is O(2 m ), because we may select every sensitive value to divide Boolean formula Π to two parts in worst case, as shown on line 14 of Algorithm 3.So the time of SatPriIR_El is O(nr + n + 2 m ), i.e.O(nr + 2 m ).Q contains some single-record individuals and we assume that the number is s.If s ≥ l, algorithm SatPriIR_El returns T rue and ends.When it is not true, we call B H S to calculate all minimal hitting sets.
In fact, we use incremental method SatPriIRInc_kl, SatPriIRInc_El, SatPriIRInc_αβ, or SatPriIRInc_Eαβ to check whether Q satisfies π.When an individual is added to Q, we only consider how the added individual influences the privacy.The time is O(r ) for SatPriIRInc_kl, SatPriIRInc_αβ, and SatPriIRInc_Eαβ.For SatPriIRInc_El, we need to consider hr hitting sets when an individual is added to Q, and so the time is O(hr).
For Handle, it scans D * to check whether each equivalence class satisfies π after adding an individual, and computes corresponding distances in QI.For IR (k, l)-anonymity and EIR l-diversity, an equivalence class still satisfies the privacy model after adding an individual to it.So we need not to check.For IR (α, β)-anonymity and EIR (α, β)-anonymity, we call SatPriIRInc_αβ and SatPriIRInc_Eαβ to check it.The time is O(r ).The main time is spent in finding the equivalence class whose distance to the individual is minimum and the time is O(ed).So the time of Handle is O(ed) for IR (k, l)-anonymity and EIR l-diversity, and O(ed + r ) for IR (α, β)-anonymity and EIR (α, β)-anonymity.
The time complexity analysis of our algorithm DAnonyIR is shown as follows.
(1) On line 1, we recode explicit identifier of D with numbers, and the executed time is O(n).(2) From lines 2-19, the while loop needs to be run e +1 times (the last obtained equivalence class may not satisfy π).Each loop creates an equivalence class and needs time is described as follows: we need to scan q − 1 times D and D * and compute corresponding distances in QI in order to create an equivalence class, because the first individual in the equivalence class is randomly selected; due to |D.I d_num| + |D * .I d_num| ≤ n, the time is dn for once scanning D and D * and computing distances to obtain the individual and equivalence class, whose distances to Q are minimum (lines 7 and 8 in Algorithm 1); when an individual is added to Q or an equivalence class Q is combined to Q, we call incremental method check whether Q satisfies privacy requirement π; in first case, the time is O(r ) for SatPriIRInc_kl, SatPriIRInc_αβ, and SatPriIRInc_Eαβ; for the latter case, we can add the individuals of Q to Q one by one, and the time is O(qr).For SatPriIRInc_El, the time is O(hr) and O(qhr) for these two cases, respectively.So the time of this step is O((e + 1)(q − 1)(dn + qr)) for DAnonyIR_kl, DAnonyIR_αβ and DAnonyIR_Eαβ, and O((e + 1)(q − 1)(dn + qhr)) DAnonyIR_El.(3) When Q does not satisfy π and D = ∅, we need to run the while loop on lines 21-25.The number of residual individuals is less than q, because the residual individuals are not created to an equivalence class that satisfies π, so the loop is executed at most q times.Every loop calls function handle.The time of the while loop is O(edq) for IR (k, l)-anonymity and EIR l-diversity, and O(q(ed + r )) for IR (α, β)-anonymity and EIR (α, β)-anonymity.(4) We scan every Q and substitute the values in QI with Q's identity element.The time is O(eqd).
When d increases, the time of DAnonyIR is increased linearly.If the parameter l increases, an equivalence class needs more individuals to make it satisfy IR (k, l)-anonymity or EIR l-diversity, and so e decreases but q increases.From the time complexity analysis, we know that the runtime is increased.As parameter α (or β) increases, q decreases and e increases, and so the runtime is decreased.

Analysis of space complexity
We can store the records of an individual in a node of a linked list for saving storage memory, because the QI attributes of the records of an individual are the same.A node is denoted by a struct of C++, and the struct contains |Q I | + 1 variables and an array, which is used to store the sensitive values of an individual.We need O((1 + d + r )n) units to store the data, where an individual needs 1 + d + r units to denote 1 explicit identifier, d values in QI, and r sensitive values.When published anonymous table D * is outputted, we transform it in the form of relational data.
When we check whether Q satisfies privacy requirement π, O(m) units are used to store the frequency of each sensitive value appearing in Q for SatPriIR_kl, SatPriIR_αβ, and SatPriIR_Eαβ.For SatPriIR_El, we first scan Q to obtain Ψ , and need O(qr) units to store it.In worst case, we call B H S to compute all minimal hitting sets, and need O(2 m qr) units to store, because every sensitive value may be selected to divide the Boolean formula Π to two parts in worst case as shown on line 14 of Algorithm 3.
In fact, we use incremental method SatPriIRInc_kl, SatPriIRInc_El, SatPriIRInc_αβ, or SatPriIRInc_Eαβ to check whether Q satisfies π.We still need O(m) units used to store the frequency of each sensitive value appearing in Q for SatPriIRInc_kl, SatPriIRInc_αβ, and SatPriIRInc_Eαβ.For SatPriIRInc_El, we need O(hm) units to store all the minimal hitting sets related to Q.
The space complexity analysis of our algorithm DAnonyIR is shown as follows.( 1 When d increases, the space that DAnonyIR required is increased linearly.If the parameter l increases, an equivalence class needs more individuals to make it satisfy IR (k, l)-anonymity or EIR l-diversity, and so e decreases but q increases.Because eq is invariable, it is no influence on DAnonyIR_kl.For DAnonyIR_El, when the size of an equivalence class is increased, h is increased.So does the space DAnonyIR_El required.As parameter α (or β) increases, q decreases and e increases.Because eq is invariable, the space is no influence on DAnonyIR_αβ or DAnonyIR_Eαβ.

Experimental analysis
IR (k, l)-anonymity and IR (α, β)-anonymity are proposed for solving the anonymous problem with multiple records with identity reservation.To the best of our knowledge, there is no further research on IR (k, l)-anonymity and IR (α, β)-anonymity, and generalization algorithm GeneIR [13] is only an algorithm for achieving the two privacy models.So we can only use GeneIR to benchmark our approaches DAnonyIR_El and DAnonyIR_Eαβ.We implement GeneIR for IR (k, l)-anonymity and IR (α, β)-anonymity in [13], denoted by GeneIR_kl, and GeneIR_αβ, respectively.For avoiding the influence caused by different algorithms, we also compare DAnonyIR_kl and DAnonyIR_αβ with our DAnonyIR_El and DAnonyIR_Eαβ, respectively.
In anonymous table, these equivalence classes, which satisfy IR (k, l)-anonymity (IR (α, β)-anonymity) but do not satisfy EIR l-diversity (EIR (α, β)-anonymity), are called vulnerable equivalence classes.These vulnerable equivalence classes may cause privacy leakage, because the IR (k, l)-anonymity does not ensure that an attacker knows the sensitive value of an individual is one of l sensitive values, and maybe it is one of less l sensitive values, but EIR l-diversity can ensure that.Likewise, IR (α, β)-anonymity does not ensure that an attacker knows the sensitive value of an individual with probability of at most β, and maybe it is with probability more than β, but EIR (α, β)-anonymity can ensure that.
Besides information loss, we also study the utility of the anonymized data based on the accuracy of query answering, because it is the basis of statistical analysis and many data mining applications (e.g.association rule mining and decision trees).The type of aggregation queries is defined as follows [23,29]:

AND …AND pred(A λ ) AND pred(A s )
where A j is a Q I attribute.The query has predicates on the λ randomly selected Q I attributes and sensitive attribute A s .Given a query, the precise result prec is computed from the original table, and the estimated result est is obtained from the anonymized table, defined as follows: where p is the percentage of the intersection of pred(A j ) and generalized value on attribute A j in Q i , and num A s Q i is the number of individuals in Q i which satisfy pred(A s ).The relative error ratio is defined as |est − prec|/ prec.
The purposes of our experiments are as follows.
(1) We use the percentage, v, of vulnerable equivalence classes in anonymous table to show the vulnerability of IR (k, l)-anonymity or IR (α, β)-anonymity, i.e. v = N um vec /N um ec , where N um vec and N um ec are the numbers of vulnerable equivalence classes and all equivalence classes in an anonymized table, respectively.(2) From data quality, including information loss and accuracy of query answering, and runtime, we analyse the performance of DAnonyIR_El, compared with DAnonyIR_kl and GeneIR_kl; also, we analyse the performance of DAnonyIR_Eαβ, compared with DAnonyIR_αβ and GeneIR_αβ.
For conveniently comparing IR (k, l)-anonymity with EIR l-diversity, we set parameter k = l, because EIR l-diversity can ensure that there are at least l different individuals in an equivalence class according to Definition 3.3.The algorithms are implemented in C++ and ran on a computer with a four-core 3.2GHz CPU and 8GB RAM running Windows  3 This dataset includes 456,767 records of 49,384 different patients.We have used the following attributes of the dataset: Month of Birth, Year of Birth, Gender, Race, EducYear, Marry, Poverty, HISPANX, and Diagnosis, where EducYear denotes the years of education, Marry denotes the marital status, Poverty denotes the economic condition, and HISPANX denotes whether an individual is Hispanic.In our experiments, Diagnosis is a sensitive attribute.Because our approach considers a sensitive attribute, if there are multiple sensitive attributes, we handle them as follows: (1) if A s is in other attributes (we have explained how to handle other attributes in footnote 1), we may do not publish it, and (2) if the sensitive attribute A s has to be published, we consider it as the sensitive attribute which we protect, and residual attributes are considered as QI or other attributes.The detailed description of the dataset is shown in Table 5.In order to show the results with the change of |Q I |, we set |Q I | from 3 to 8. When |Q I | = d (d ∈ {3, . . ., 8}), it means that Q I contains the front d attributes.

The vulnerability of IR anonymity
In this subsection, we discuss the vulnerability of IR (k, l)-anonymity and IR (α, β)anonymity.The experimental results for IR (k, l)-anonymity and IR (α, β)-anonymity are as shown in Figs. 2 and 3, respectively.We can see that GeneIR_kl and DAnonyIR_kl (GeneIR_αβ and DAnonyIR_αβ) both generate many vulnerable equivalence classes, and the percentage of vulnerable equivalence classes for DAnonyIR_kl (DAnonyIR_αβ) is more than that for GeneIR_kl (GeneIR_αβ).GeneIR continually repeats the process: randomly selects an attribute in Q I to generalize according to the predefined taxonomy tree, then checks D to gain equivalence classes, which satisfy IR (k, l)-anonymity or IR (α, β)-anonymity, and moves them to D * .While DAnonyIR uses the set generalization and considers the distance among individuals in an equivalence class.If an equivalence class satisfies IR (k, l)-anonymity or IR (α, β)-anonymity, we remove the equivalence class to D * .An equivalence class obtained

The analysis of data quality
In this subsection, we analyse the data quality from information loss and accuracy of query answering.

Information loss
Figures 4 and 5 show the information loss exhibited by DAnonyIR and GeneIR algorithms based on the setting of different values of parameters l, α, β, and |Q I |.We can see that GeneIR is worse than our approach of DAnonyIR, because GeneIR makes the size of an equivalence class become very great, and causes much unnecessary information loss.In order to satisfy different privacy models with identity reservation, in DAnonyIR only the SatPriIR functions are different.If an equivalence class Q satisfies given privacy model, we do not add any individual to Q.Although an equivalence class needs more individuals for satisfying EIR l-diversity (EIR (α, β)-anonymity) than IR (k, l)-anonymity (IR (α, β)anonymity), the increase is very small.So our DAnonyIR algorithms have a much closer difference than GeneIR.
When l or |Q I | increases, the information loss is increased in GeneIR_kl, DAnony-IR_kl, and DAnonyIR_El, as shown in Fig. 4a, b, respectively.When QI is fixed (i.e.|Q I | = 6), as l increases, the number of individuals is increased and the number of records is also increased in each equivalence class for GeneIR_kl, DAnonyIR_kl, and DAnonyIR_El, and the possibility of providing more general values for the attributes per record increases.Therefore, the information loss is increased.When |Q I | increases and l is fixed (i.e.l = 6), the number of attributes which we need to generalize is increased.That is, there are more generalized attributes for creating equivalence classes.So the information loss is increased.From Fig. 4, we can see that the increase when |Q I | is increased is more sharp for the algorithms with respect to the increase of l, because the increase of l only makes records do further generalization, while the increase of |Q I | makes records increase the generalized attributes.
When α or β increases, the information loss is decreased in GeneIR_αβ, DAnony-IR_αβ and DAnonyIR_Eαβ, as shown in Fig. 5a, b, respectively.When QI and β (α) are fixed (i.e.|Q I | = 6 and β = 0.4 (α = 0.6)) and α (β) increases, then the number of records is decreased in each equivalence class, so the information loss is decreased.From Fig. 5(c), when |Q I | increases and α and β are fixed (i.e.α = 0.6 and β = 0.4), we can see that the information loss is increased, because the number of attributes that need to generalize is increased.

Aggregate query answering
The accuracy of query answering for GeneIR_kl, DAnonyIR_kl and DAnonyIR_El is as shown in Fig. 6.When Q I and λ are fixed (i.e.|Q I | = 6 and λ = 1), as l increases, the size of an equivalence class will increase.Thus, a more general value is needed for every attribute.So the relative error ratio is increased in these algorithms, as shown in Fig. 6a.
The accuracy of query answering for GeneIR_αβ, DAnonyIR_β and DAnonyIR _Eαβ is shown in Fig. 7.When Q I and λ are fixed (i.e.|Q I | = 6 and λ = 1), as α (i.e.β = 0.4) or β (i.e.α = 0.6) increases, the size of an equivalence class will decrease.Thus a less general value is needed for every attribute.So the relative error ratio is decreased with respect to these algorithms, as shown in Fig. 7a, b.
In order to show the influence of query dimension to relative error ratio, we set λ = |Q I |.From Figs. 6b and 7c, we can see that the relative error ratio is decreased, as the query dimension increases.Therefore, the anonymized data are performed better for queries with a larger query dimension.When l is fixed (i.e.l = 6) or α and β are fixed (i.e.α = 0.6 and β = 0.4), as query dimension λ increases, the precise result prec is decreased, and the estimated result est obtained from the anonymized table is closer to prec.Therefore, the relative error ratio is decreased.
From Fig. 6 (Fig. 7), we can see that the relative error ratio of DAnonyIR_kl and DAnonyIR_El (DAnonyIR_αβ and DAnonyIR_Eαβ) are less than that of Gene-IR_kl (GeneIR_αβ), and the relative error ratio of DAnonyIR_El (DAnonyIR_Eαβ) is close to that of DAnonyIR_kl (DAnonyIR_αβ).Because DAnonyIR and GeneIR do not generalize on sensitive attribute, Q i ∈D * num A s Q i is invariable according to Eq. (19).When the size of every equivalence class become larger, the est is further away from prec.Although an equivalence class needs more individuals for satisfying EIR l-diversity (EIR (α, β)-anonymity) than IR (k, l)-anonymity (IR (α, β)-anonymity), the increase is very small.

The analysis of efficiency
The runtimes exhibited by these algorithms based on the setting of values of different parameters l, α, β, and |Q I | are shown in Figs. 8 and 9. GeneIR can find quickly equivalence classes and check whether they satisfy IR (k, l)-anonymity or IR (α, β)-anonymity for each generalization operation.Apparently, its runtime is less than that of DAnonyIR.For GeneIR algorithm, it is possible that multiple equivalence classes are generated by scanning the dataset once or several times, while for DAnonyIR algorithm, an equivalence class is generated by scanning the dataset q times, where q is the number of individuals in the equivalence class.Therefore, the runtime of GeneIR is far less than that of DAnonyIR.
As shown in Fig. 8 For DAnonyIR_kl and DAnonyIR_El, when l or |Q I | increases, the runtime is increased, as shown in Fig. 8, which is consistent with the time complexity analysis of DAnonyIR.For each equivalence class Q, the first individual is randomly selected, and we do not need to compute, so little time is spent.For every other individual in the equivalence class, we need scan D and D * to get the individual or equivalence class with minimum distance to Q, respectively.In Fig. 8a, when l increases with fixed Q I (i.e.|Q I | = 6), the number of equivalence classes is decreased.That is, the number of individuals is increased in each equivalence class, so For DAnonyIR_αβ and DAnonyIR_Eαβ, when α or β increases, the runtime is decreased, as shown in Fig. 9a, b, which is consistent with the time complexity analysis of DAnonyIR.When Q I and β (α) are fixed (i.e.|Q I | = 6 and β = 0.4 (α = 0.6)), as α (β) increases, the number of records is decreased in each equivalence class.That is, the number of equivalence class is increased.So the algorithm needs less calculations, and thus the runtime is decreased.From Fig. 9c, we can see that the runtime increases with |Q I |, which is consistent with the time complexity analysis of DAnonyIR.When |Q I | increases but α and β are fixed (i.e.α = 0.6 and β = 0.4), more attributes are considered in calculating distance, thus the algorithm needs more time.When α (β) increases, the runtime of GeneIR_αβ is decreased, because the constraint condition is loosed and thus GeneIR_αβ needs less generalization operations.As |Q I | increases, GeneIR_αβ needs to generalize more attributes in order to obtain equivalence classes, so its runtime is increased.

Comprehensive analysis
From these experimental results, we can see that the percentage of vulnerable equivalence classes is fairly high.That is, if we use IR (k, l)-anonymity and IR (α, β)-diversity, there are many equivalence classes that could cause privacy leakage.Although GeneIR_kl and GeneIR_αβ can achieve quickly anonymization, in terms of the data quality and ability of privacy preservation, they are worse than our DAnonyIR.Comparing with DAnonyIR_kl and DAnonyIR_αβ, our DAnonyIR_El and DAnonyIR _Eαβ have higher information loss and relative error ratio for query answering, and spend more time, but the increments in these aspects are small and acceptable, because DAnonyIR_El and DAnonyIR_Eαβ supply stronger privacy preservation and the anonymized process is offline.Therefore, our enhanced privacy models and DAnonyIR algorithm are suitable for anonymizing just once over static datasets in an offline manner.However, when anonymization needs to take place quite frequently for data streams and execution time plays a major role, these approaches can be considered as inappropriate.The two previous privacy models do not reach a right privacy level while they suffer from great information loss by using GeneIR.On the other hand, the time exhibited by our DAnonyIR approach is not acceptable.Therefore, an appropriate approach for anonymizing data streams will be proposed in our further work.

Related work
In this section, we first discuss the privacy models and their anonymous approaches for static relational datasets with one-time anonymization.Then we discuss the development on privacy preservation for publishing dynamic relational datasets and data streams.Also, we discuss some privacy-preserving approaches for other data types except relational data.Finally, we show the main characteristics of our proposed approaches.An overall comparison of various anonymous approaches is shown in Table 6, where IDis=identity disclosure, ADis=attribute disclosure, AOpe=anonymous operation, MSen=multiple sensitive attributes, MRec=multiple records, DUpd=data update, DType=data type, Gen(T)=generalization with Samarati and Sweeney [3,14] in 1998.There exist many anonymization methods to implement k-anonymity, such as bottom-up generalization [15,16], top-down specialization [17] and anonymity by clustering technique [18][19][20].The bottom-up generalization starts from the original data which violates k-anonymity and greedily selects a generalization operation according to a search metric, until all equivalence classes satisfy k-anonymity.In contrast to the bottom-up approach, the top-down specialization starts from the most general state in which all values are generalized to the most general values of their taxonomy trees.The specialization process terminates if no specialization can be performed without violating k-anonymity.In order to decrease the information loss caused by generalization, the anonymization approaches by clustering technique were proposed to achieve k-anonymity, which transform the problem to a clustering problem (i.e. to find a set of clusters (equivalence class), each of which contains at least k records).These records in a cluster are as similar as possible, which can ensure that less distortion is required when the records in a cluster are modified to have the same Q I value.
Due to its simplicity, k-anonymity remains one of the most widely used models in the literature.It can protect against identity disclosure, but cannot prevent attribute disclosure.As a result, l-diversity has been proposed in [10].It requires that every equivalence class contains at least l "well-represented" sensitive values, which can be defined in diverse ways, i.e. distinct l-diversity, and entropy l-diversity (the entropy of sensitive values in each equivalence class should be at least log l).The IR (k, l)-anonymity refers to distinct l-diversity, so we also extent it to EIR l-diversity.There are numerous methods for achieving l-diversity [21,24].Ghinita et al [21] proposed a fast data anonymization with low information loss for achieving l-diversity, which first maps the multi-dimensional Q I attributes to 1-dimensional space, then partitions the space with considering to cover a variety of sensitive values and finally generalizes the Q I attributes in each group.Wang et al. [24] argued that traditional data generalization based on the predefined taxonomy trees often causes some unnecessary information loss, so they proposed more flexible strategies for data generalization by set generalization and presented a clustering algorithm to implement l-diversity.Wang et al. [22] gave a privacy template in the form of (Q I → s, h) (meaning that the confidence of inferring the sensitive value s from any group on Q I is no more than h) and proposed an algorithm to minimally suppress a table to satisfy a set of privacy templates.
Furthermore, Wong et al. [11] extended k-anonymity to (α, k)-anonymity to limit the confidence of the implications from the Q I to a sensitive value to within α in order to protect the sensitive information from being inferred by strong implications and proposed a bottom-up generalization algorithm to achieve (α, k)-anonymity.In order to prevent skewness attack and similarity attack, which belong to attribute disclosure, Li et al. [12] proposed t-closeness model.Skewness attack and similarity attack will happen when the percentages of sensitive values are skewness (some values appear with high frequency, while others appear with low frequency), and these sensitive values are similar semantically in an equivalence class, respectively.And t-closeness requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table.They also revised the Incognito algorithm [16], which is a top-down generalization method proposed for k-anonymity, to achieve t-closeness.Cao et al. [23] pointed out there is no anonymization algorithm tailored for t-closeness, and they proposed the SABRE approach for distribution-aware microdata anonymization based on the t-closeness principle.The approach first greedily partitions a table into buckets of similar sensitive values and then redistributes the tuples of each bucket into dynamically determined equivalence classes.Furthermore, Goryczka et al. [30] consider the collaborative data publishing problem for anonymizing horizontally partitioned data from multiple data providers.They introduced the 123 concept of m-privacy, which guarantees that the anonymized data satisfy a given privacy constraint (e.g.k-anonymity and l-diversity) against any group of up to m colluding data providers.Also, they presented a data provider-aware anonymization algorithm with adaptive m-privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency.

Single record and multiple sensitive attributes
The above approaches consider a data table with only a sensitive attribute.However, they cannot be applied directly to the case of multiple sensitive attributes.Yang and Wang [31] proved that if the minimum class coverage ϕ min in an equivalence class satisfies ϕ min ≥ l, then the equivalence class satisfies multiple sensitive attributes l-diversity (MSA l-diversity), which requires that the values of every sensitive attribute satisfy l-diversity in any equivalence class.Also, a anonymization approach with generalization is given based on minimum selected degree first.To preserve privacy against proximity attack (similarity attack), Zhang et al. [32] defined ( + , δ) k -dissimilarity privacy model for scalable big data with multiple sensitive attributes, which requires that the size of any equivalence class Q is at least k, and any sensitive vector in Q must be dissimilar to at least δ × (|Q| − 1) (0 ≤ δ ≤ 1) other sensitive vectors.Parameter k controls Q to prevent identity disclosure and δ specifies constraints on the number of + neighbours that each sensitive vector can own to combat proximity attack.Also, they proposed a clustering anonymization approach.Abdalaal et al. [33] assumed that adversaries can launch attacks by joining the quasi-identifiers with some non-membership knowledge to link individuals with the sensitive values.They proposed MSA-diversity, which ensures that the probability of mapping an individual to a sensitive value is bounded by 1  l−i under i bits of non-membership knowledge, but its strict grouping condition will result in excessive information loss.

Multiple records and single sensitive attribute
However, all the above approaches assume that each record in a data table represents a distinct owner.In fact, the case that an individual could have multiple records appears frequently in real life, if there exists 1 : N relationship between an individual and the sensitive attribute.For example, a student has grades in different courses, a patient suffers from different diseases, and a person has multiple hobbies.The relation among different sensitive values, which belong to the same individual, is very important for researchers and decision-makers.So we need to keep it by identifying the ID with numbers instead of removing the explicit identifying information.In this case, these anonymity models, in which an individual has only a record in a data table, may be underprotected, and are inadequate, and could cause privacy leakage.
At present, there exist some approaches to handle the situation that an individual could have multiple records.(X , Y )-anonymity introduced by Wang and Fung [34] specifies that each value in X is linked to at least k distinct values in Y .It provides a flexible way to specify different types of privacy requirements.If we specify k-anonymity with respect to patients by letting X be Q I attributes and Y be explicit identifier of individual, in this case, several records may represent the same record owner (individual).In order to maintain such a correlation, Tong et al. [13] proposed three privacy models with identity reservation (i.e.IR k-anonymity, IR (k, l)-anonymity, and IR (α, β)-anonymity) and presented an anonymization method, GeneIR, with bottom-up generalization by predefined taxonomy trees for implementing these privacy models.They first recode the ID of database D with numbers and group the records with the same Q I values.If an equivalence class satisfies the given privacy model π, the group is removed from D to D * .Then repeatedly execute the step: select an attribute in Q I to generalize up a level in its taxonomy tree and check D to obtain the equivalence classes which satisfy π, until D does not satisfy π or no further generalization could to be made.If there are residual individuals in D, every residual individual is added to the closest equivalence class.

Anonymization for dynamic datasets and data streams
In this subsection, we discuss the anonymization approaches for dynamic datasets and data streams, which almost all consider the scenario with single record and single sensitive attribute.

Anonymization for dynamic datasets
When data are dynamically updated with record insertions and/or deletions, the re-publication is needed.Anonymizing datasets statically (i.e. each release is individually anonymous) may cause privacy leakage by comparing different releases and eliminating some possible sensitive values for a victim [2].Byun et al. [35] were the pioneers who proposed an anonymization technique with generalization that enables privacy-preserving continuous data publishing after new records have been inserted.It guarantees that every release satisfies distinct l-diversity, and makes sure that a new anonymized table to be released does not create any inference channel with respect to the previously released tables (called dynamic l-diversity).Nevertheless, this approach supports only insertions.Xiao and Tao [36] proposed m-invariance privacy model and an anonymization method with generalization to address both record insertions and deletions.A sequence of releases D * 1 ,…,D * p satisfies minvariance if every equivalence class Q in D * i (1 ≤ i ≤ p) is m-unique (i.e.Q contains at least m records and all records in Q have different sensitive values) and all equivalence classes in D * 1 ,…,D * p containing record r must have the same set of sensitive vales.Li and Zhou [37] extended m-invariance to m-distinct to address external updates (the dataset is updated with record insertions and/or deletions) and internal updates (the attribute values of each record are dynamically updated).

Anonymization for data streams
Data streams are continuous, transient, and usually unbounded.Cao et al. [27] discussed that anonymizing data streams and anonymizing dynamic datasets are different because the inferences that may arise when anonymizing dynamic datasets and those that might happen during anonymizing of data streams are different.Anonymizing a dynamic dataset requires multiple releases of a table.The inference is happened because multiple anonymized tables contain some same records, while this inference cannot be carried out in anonymizing data stream because a record in the stream is anonymized only once.The possible inferences in anonymizing data stream are due to the fact that the attacker is able to inspect the sequence of anonymized tuples given in output.Because of the characteristics of data streams, the algorithm for data streaming can only scan the data in one pass and executes in a pipeline manner, and there is a need to offer strong guarantees on the maximum allowed delay between incoming data and the corresponding anonymized output.So the efficiency plays an important role in anonymizing data streams.Cao et al. [27] presented k s -anonymity for privacy-preserving 123 data streams publishing in the case of an individual multiple records and gave a cluster algorithm to anonymize data streams and ensure the freshness of the anonymized data by satisfying specified delay constraints.However, they put the different records of the same individual in different equivalence classes.Hence, they lose the relation among the values of sensitive attribute, which belong to the same individual.Furthermore, Guo and Zhang [38] improved algorithm of Cao et al. for data streams based on clustering by considering the time constraints on tuple publication and cluster reuse, to accelerate the anonymization process and reduce the information loss.

Anonymization for other data types except relational data
There are some studies on anonymizing set-valued data, trajectory data, and social network.Terrovitis et al. [39] presented k m -anonymity for the set-valued or transaction data, which guarantees that an adversary with maximum knowledge of m items cannot distinguish each transaction from k transactions.They proposed two heuristic anonymization algorithms, which greedily identify itemsets that violate the anonymity requirement and choose generalization rules that fix the corresponding problems.(h, k, p)-coherence introduced by Xu et al. [40] confines an attacker with maximum knowledge of p items to identify each transaction from k transactions in which no more than h% share a common private item.They gave an algorithm for achieving (h, k, p)-coherence by suppression while preserving as much information as possible.Cao et al. [41] proposed ρ-uncertainty privacy model, which does not allow an attacker knowing any subset of a transaction t to infer a sensitive item α ∈ t with confidence higher than ρ, and presented an algorithm by combining generalization and suppression to transform a data table and make it satisfy ρ-uncertainty.Chen et al. [42] studied the privacy problem of trajectory data.They proposed (K , C) L -privacy model, which requires any subsequence q of any adversary's L-knowledge to be shared by either 0 or at least K records in a trajectory database and the confidence of inferring any sensitive value from q to be at most C, and showed that the proposed suppression method can significantly improve the data utility in anonymous trajectory data.Liu and Terzi [43] proposed k-degree anonymity for anonymizing social network, which requires that all vertices have at least k − 1 other vertices sharing the same degree, and gave an algorithm to ensure that all vertices satisfy k-degree anonymity by modifying the graph structure.Moreover, Casas-Roma et al. [8] devised an efficient algorithm for k-degree anonymity in large networks.

Characteristics of our approaches
In this paper, we deal with static relational data with multiple records and single sensitive attribute.It is significant to study identity reservation for multiple records.Two privacy models EIR l-diversity and EIR (α, β)-anonymity are proposed to solve the disadvantage of IR (k, l)-anonymity and IR (α, β)-anonymity for identity reservation (i.e. they fail to prevent attribute disclosure).At present, many anonymization approaches use the generalization by predefined taxonomy tree, which restricts the generalized range and causes some unnecessary information loss.Therefore, Wang et al. [24] presented the set generalization and gave a clustering algorithm l-clustering to implement l-diversity.Inspired by the method, we present the heuristic greedy clustering algorithm DAnonyIR for achieving EIR l-diversity and EIR (α, β)-anonymity.Also, we can use DAnonyIR to anonymize database in order to make it satisfy IR (k, l)-anonymity and IR (α, β)-anonymity by calling different decision functions.
Our approach is different from l-clustering in following two aspects.(1) Our approach considers the case of an individual with multiple records, so the definitions of some distances are different.We introduce our own concepts of the distance between two individuals, the distance between individual and equivalence class, and the distance between two equivalence classes by using different information metrics for numeric and categorical attributes.(2) Our approach is used to achieve EIR l-diversity, EIR (α, β)-anonymity, IR (k, l)-anonymity, and IR (α, β)-anonymity, while l-clustering is used for l-diversity.We need to set different decision functions for different privacy models.Experimental results have shown our DAnonyIR is superior to the existing GeneIR for multiple records with generalization by predefined taxonomy tree in terms of the data quality.Our approaches are only used to anonymize static relational datasets with multiple records and single sensitive attribute.In the next work, it will be interesting to extend our approaches to anonymize datasets with multiple sensitive attributes, dynamic datasets, data streams, and other data types.

Conclusions
In this paper, we have argued that IR (k, l)-anonymity and IR (α, β)-anonymity are insufficient to prevent privacy leakage.Thus, we proposed enhanced versions of these two privacy models, called EIR l-diversity and EIR (α, β)-anonymity.Moreover, we have designed a general anonymization algorithm, called DAnonyIR, with clustering technique to transform the dataset to satisfy different identity-reserved privacy models by calling different decision functions.Compared with the existing approaches, i.e.GeneIR_kl and GeneIR_αβ [15], respectively, our DAnonyIR_El and DAnonyIR_Eαβ provide stronger privacy preservation, and the information loss and relative error ratio of query answering are less than those of the GeneIR_kl and GeneIR_αβ, although our approaches need more runtime.To avoid the influence caused by different algorithms, we also compared our enhanced approaches with DAnonyIR_kl and DAnonyIR_αβ, respectively, and found that our approaches are very close to DAnonyIR_kl and DAnonyIR_αβ in the aspects of information loss, relative error ratio of query answering, and runtime.
Our EIR l-diversity and EIR (α, β)-anonymity are suitable for the anonymization of relational data in which an individual could have multiple records, our DAnonyIR algorithm is performed just once over static datasets in an offline manner, and the clustering result is not optimal.So in future, it is worthy extending our approaches to find an optimal clustering result by analysing its average time complexity, and solve these problems considering privacy leakages caused by relation among attributes, attackers' stronger background knowledge, multiple sensitive attributes, and data publishing of dynamic datasets and data streams.Also, we will consider privacy preservation of distributed data [30] and other sorts of data, contained set-valued data [38], trajectory data [42], and social network [43].
* of D satisfies (1) k-anonymity if any equivalence class in D * contains at least k records; (2) distinct l-diversity if any equivalence class in D * contains at least l different sensitive values; and (3) (α, k)-anonymity if for any equivalence class Q in D * , Q satisfies k-anonymity and the percentage of any sensitive value in Q is less than or equal to α.

Fig. 1
Fig. 1 Taxonomy trees for Postcode and Age. a Postcode.b Age

Definition 4 . 1 (
Information metric for a numeric attribute) Let the value domain of a numeric attribute A be [L, U ]. Let the value of a record r be r [A] = [a, b] on the attribute A, which is the original value or a generalized value, and its (later) generalized value be r * [A] = [a * , b * ] on the attribute A. Then the information loss of r on numeric attribute A from r [A] to r * [A] is

Definition 4 . 8 (
Optimal clustering) Given an original data table D and a privacy model with identity reservation π, an optimal clustering of D is a partition P = {Q 1 , . . ., Q e } such that e i=1 Q i = ∅, e i=1 Q i ⊆ D, and Q i (i = 1, . . ., e) satisfies π after Q i is generalized.The published anonymous table D * consists of these generalized Q i , and Loss(D, D * ) is minimal.
in which every set is a minimum hitting set, and the cardinality is equal to 3. Q satisfies EIR 3-diversity.Algorithm 1 executes line 17; the first equivalence class is added to D * , i.e.D * = {{r 8 , r 9 , r 10 , r 5 , r 6 }}.D = {r 1 , r 2 , r 3 , r 4 , r 7 } and D = ∅.We try to create another equivalence class from D. Select randomly individual p from D. Assume that individual 3 is selected, then D = D − R(3) and Q = R(3).We have r q = {M, 36, 10086, null}.Q only contains an individual and it does not satisfy 3-diversity.Also D = ∅.Thus we add continually an individual to Q.The individual is 1, whose distance to Q is 0.500 and is minimum.The distance between Q and the first equivalence class in D * is 8.778.Therefore, Q = Q ∪ R(1), and r q = {M, 36, {10085, 10086}, null}.Then R(2) and R(5) are added consecutively to Q, and Q satisfies EIR 3-diversity.D * = D * ∪ {Q} = {{r 5 , r 6 , r 8 , r 9 , r 10 }, {r 1 , r 2 , r 3 , r 4 , r 7 }}.Now D = ∅ and Sat Flag = T rue, i.e. there are no residual individuals.Algorithm 1 executes line 27.For every equivalence class Q in Q * , we substitute its values on Q I attributes with Q's identity element.

Fig. 7
Fig. 7 The accuracy of query answering in GeneIR_αβ, DAnonyIR_αβ, and DAnonyIR_Eαβ.a |Q I | = 6, β = 0.4, and λ = 1.b |Q I | = 6, α = 0.6, and λ = 1.c α = 0.6 and β = 0.4 , the time of DAnonyIR_kl and DAnonyIR_El is close, because they both use the DAnonyIR algorithm, in which function SatPriIR is different, and they are SatPriIRInc_kl and SatPriIRInc_El.They both incremental methods and consider only an individual how to influence privacy, so the time of DAnonyIR_kl and DAnonyIR_El is close.It is similar to show that the time of DAnonyIR_αβ and DAnonyIR_Eαβ is close.

Table 1
Summary of notations reaA reasoning set in Q rea different individuals.Some notions directly related to our approaches are presented in the following.
Given an original data table D, for an equivalence class Q in the published anonymous table D * , let |Q| be the number of records, m Q be the number of sensitive values, and n Q be the number of individuals, i.e.IR) k-anonymity if any equivalence class Q contains at least k different individuals, i.e. n Q ≥ k; (2) identity-reserved (IR) (k, l)-anonymity if any equivalence class Q contains at least k different individuals and l different sensitive values, i.e. m Q ≥ l and n Q ≥ k; and (3) identity-reserved (IR) (α, β)-anonymity if for any equivalence class Q, |R( p i )|/|Q| ≤ α, ∀ p i ∈ Q.I d_num and |R(s j )|/|Q| ≤ β, ∀s j ∈ Q.A s ,where α, β ∈ (0, 1).
Table 2 is a patient table, in which an individual has one or several records, where {Gender, Age, Postcode} is the quasi-identifier set QI.And Table 3 is a published table which satisfies IR

Table 2
An patient table in which an individual could have multiple records

Definition 3.5 (Hitting
[28][28]Let Ψ = {X 1 , X 2 , ..., X t } be a collection of subsets of a finite set X .If H ⊆ X , andH ∩ X i = ∅, ∀X i ∈ Ψ , H is called a hitting set of Ψ .If there is no H ⊂ H such that H is a hitting set of Ψ , then H is called a minimal hitting set of Ψ .If cardinality of H is smallest, then H is called a minimum hitting set of Ψ .For example, given X = {x 1 , x 2 , x 3 , x 4 }, Ψ = {{x 2 , x 3 }, {x 2 , x 4 }, {x 1 , x 2 }}, then {x 1 , x 3 , x 4 }and {x 2 } are minimal hitting sets of Ψ , where {x 2 } is minimum hitting set of Given an equivalence class Q, P Q = {p 1 , p 2 , . . ., p n Q } is the set of individuals in Q.Let Ψ = {S( p 1 ), S( p 2 ), . . ., S( p n Q )} and H is a minimum hitting set of Ψ .If |H | ≥ l, then Q satisfies the EIR l-diversity.

diversity. Theorem 3.2 Given an equivalence class Q, let
anonymity, and return T rue; otherwise, return False.For SatPriIR_αβ, the algorithm is similar to SatPriIR_Eαβ, and the difference is on line 12 in Algorithm 4. According to Definition 2.3 (3), if Max RecN um/|Q| ≤ α and Max Sen N um/|Q| ≤ β, then Q satisfies IR (α, β)-anonymity, and return T rue; otherwise, return False.
).On lines 25 and 26, we call BHS to get all minimal hitting sets of current Ψ , and then find a minimum hitting set h .If |h | + |ξ | ≥ l, return T rue; otherwise, return False.In fact, we divide Ψ Input: the collection of sets Ψ ; Output: all minimal hitting sets of Ψ ; 1: transform Ψ to Boolean formula Π with disjunctive normal form; 2: if Π only contains a conjunctive item, i.e.Π = s 1 s 2 • • • s θ then if there are single literal items in Π , i.e.s 1 , s 2 , • • • , s ϑ then 10: sig = s 1 s 2 • • • s θ ; 11: delete s 1 , s 2 , • • • , s ϑ from Π ; 12: end if 13: get the literal s whose frequency appearing in Π is highest; 14: sig • (s • B H S(Π 1 ) + B H S(Π 2 )), where Π 1 and Π 2 are the results by deleting these conjunctive items which contains s , and s from Π , respectively; and {x 1 , x 4 , x 5 }, {x 3 , x 4 , x 5 , x 6 }, and {x 3 , x 4 , x 6 , x 7 } are minimal hitting sets.SatPriIR_Eαβ For Algorithm 4, from lines 2 to 10, we obtain the Max RecN um which denotes the maximum number of records of individuals in Q and count the number of occurrences of any sensitive value appearing in Q.On line 11, we find Max Sen N um which is the maximum value in {N um s 1 , . . ., N um s m Q }, where s 1 , . . ., s m Q are sensitive values appearing in Q and N um s j is the number of occurrences of s j .By Theorem 3.2, if Max RecN um/|Q| ≤ α and Max Sen N um/n Q ≤ β, Q satisfies EIR (α, β)-anonymity, so return T rue; otherwise, return False.Max Sen N um = max{N um s 1 , . . ., N um s m Q Input: the set of records Q; parameters α and β; };12: if Max RecN um/|Q| ≤ α and Max Sen N um/n Q ≤ β then , S( p 2 ), . .., S( p n Q )}; individual p added to Q with S( p) = {s 1 , s 2 , .., s r }; parameter l; Qp = (x 1 x 2 . ..x h 1 + y 1 y 2 . . .y h 2 + • • • + z 1 z 2 . . .z h t )(s 1 + s 2 + • • • + s r ); {y 1 , y 2 , . . ., y h 2 }, . . ., {z 1 , z 2 , . . ., z h t }}.For convenience, we represent H Q with Boolean formula.When an individual p is added to Q, we use Algorithm 5 to get all minimal hitting sets H Qp of Ψ ∪ S( p).If Q = ∅, then H Qp = s 1 + s 2 + • • • + s r , i.e.H Qp = {s 1 , s 2 , . . ., s r }; otherwise, we use the method shown on line 4 to get H Qp and simplify it with Boolean algebra.Then find a minimum hitting set h from H Qp .If h ≥ l, return True; otherwise, return False.

Algorithm 6
SatPriIRInc_Eαβ(Q, Sen N um Q , Max RecN um, Max Sen N um, p, α, β) Input: the set of records Q; the array Sen N um Q contains the number of occurrences of every sensitive value in Q; Max RecN um is the maximum number of records of individuals in Q; Max Sen N um is the maximum number of occurrences of sensitive values in Q; individual p added to Q with S( p) = {s 1 , s 2 , .., s r }; parameters α and β; Input: an individual ( p ); Output: the set D * of equivalence classes without generalization; 1: Min Dis = Max V alue; 2: Min = 0; 3: for ∀Q i ∈ D * do 4:if Sat Pri I R(Q i ∪ R( p ), parameters) then 5:if Dist( p , Q i ) < Min Dis then 6:Min Dis = Dist( p , Q i );

Table 4
EIR 3-diverse or EIR (0.4, 0.6)-anonymous table knowledge of an attacker, Mike is inferred in equivalence class Q 2 .Because the percentage of any individual's records in Q 2 is not more than 0.4, there are multiple individuals.The attacker cannot know which one is corresponding to Mike, and the identity disclosure is prevented.As shown above, there are two reasoning sets in Q 2 .In every reasoning set of Q 2 , the percentage of any sensitive value is not more than 0.6.So the attacker cannot know which sensitive disease is corresponding to Mike with certain probability (> 0.6) and attribute disclosure is prevented.
We add every individual in Q to an equivalence class, which distance to the individual is smallest and still satisfies π after being combined with the individual, or we suppress the individual.If given original table D does not satisfy π, we remove all individuals to the equivalence class Q and Q still does not satisfy π.Then these individuals in Q are residual, algorithm DAnonyIR calls Handle function to deal with them.Finally, all individuals in D are suppressed.
If given original table D satisfies privacy model π, algorithmDAnonyIR can transform D to D * which satisfies π.When D = ∅, we try to create an equivalence class Q.Firstly, we randomly select an individual and add its records to Q.If Q does not satisfy π and D = ∅, we add continually individuals to Q until Q satisfies π or D = ∅.If Q satisfies π, we add it to published anonymous table D * .If Q does not satisfy π and D = ∅, then these individuals in Q are residual.

Table 5
Detailed description of the dataset used in our experiment Attribute Distinct values All experiment results are the mean values of those from 50 experiments.To compare our DAnonyIR_El with DAnonyIR_kl and GeneIR_kl, parameters l and |Q I | are varied to show variation trends with respect to the percentage of vulnerable equivalence classes, data quality, and runtime of these algorithms.To analyse the performance of DAnonyIR_Eαβ, DAnonyIR_αβ, and GeneIR_αβ, parameters α, β, and |Q I | are varied.We use a real-world dataset, appeared in INFORMS data mining contest 2008, in which each patient has one or multiple diagnosis records.