Abstract
Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing transaction data have been proposed recently, but they may produce excessively distorted and inadequately protected solutions. This is because these approaches do not consider privacy requirements that are common in real-world applications in a realistic and flexible manner, and attempt to safeguard the data only against either identity disclosure or sensitive information inference. In this paper, we propose a new approach that overcomes these limitations. We introduce a rule-based privacy model that allows data publishers to express fine-grained protection requirements for both identity and sensitive information disclosure. Based on this model, we also develop two anonymization algorithms. Our first algorithm works in a top-down fashion, employing an efficient strategy to recursively generalize data with low information loss. Our second algorithm uses sampling and a combination of top-down and bottom-up generalization heuristics, which greatly improves scalability while maintaining low information loss. Extensive experiments show that our algorithms significantly outperform the state-of-the-art in terms of retaining data utility, while achieving good protection and scalability.
Similar content being viewed by others
Notes
We assume that these methods are applied to non-sensitive items. Otherwise, \(56\) itemsets would receive protection.
Specialization is the reverse operation of generalization [26].
This is the statistically best strategy an attacker can follow [9].
We assume that \(k^m\)-anonymity is applied to public items.
Recall from Sect. 4.1.2 that \(\varTheta \) does not contain rules that are always protected.
These itemsets represent individuals’ sensitive information, which is unknown to an attacker. A similar assumption was made in [87], for individuals’ sensitive values.
We employ two hierarchies, for simplicity. Using a single hierarchy containing all items of \(\mathcal I \) is possible, and it requires trivial changes in the process of generating PS-rules.
We do not consider empty \(T_\mathcal{P }\) and \(T_\mathcal{S }\subseteq T_\mathcal{N }\), where \(T_\mathcal{S }\) is the set of sensitive items in \(T\), as they are unlikely to lead to meaningful privacy attacks. Our approach can be trivially modified to deal with such transactions.
References
Health Insurance Portability and Accountability Act of 1996 United States Public Law
Abul O, Bonchi F, Giannotti F (2010) Hiding sequential and spatiotemporal patterns. TKDE 22(12):1709–1723
Abul O, Bonchi F, Nanni M (2008) Never walk alone: uncertainty for anonymity in moving objects databases. In: ICDE, pp 376–385
Aggarwal CC (2005) On k-anonymity and the curse of dimensionality. In: VLDB, pp 901–909
Aggarwal CC, Li Y, Yu PS (2011) On the hardness of graph anonymization. In: ICDM, pp 1002–1007
Aggarwal CC, Yu PS (2008) Privacy-preserving data mining: models and algorithms. Springer, Berlin
Agrawal R, Johnson CM (2007) Securing electronic health records without impeding the flow of information. Int J Med Inform 76(5–6):471–479
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499
Bacchus F, Grove AJ, Halpern JY, Koller D (1996) From statistical knowledge bases to degrees of belief. Artif Intell 87(1–2):75–143
Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K (2007) Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS, pp 273–282
Barbaro M, Zeller T (2006) A face is exposed for aol searcher no. 4417749. New York Times, Aug
Basu S, Mooney RJ, Pasupuleti KV, Ghosh J (2001) Evaluating the novelty of text-mined rules using lexical knowledge. In: KDD, pp 233–238
Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the sulq framework. In: PODS, pp 128–138
Bonchi F, Lakshmanan LVS (2011) Trajectory anonymity in publishing personal mobility data. SIGKDD Explor 13(1):30–42
Brickell J, Shmatikov V (2008) The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: KDD, pp 70–78
Cao J, Karras P, Raïssi C, Tan K (2010) \(rho\)-Uncertainty: inference-proof transaction anonymization. PVLDB 3(1):1033–1044
Centers for Medicare and Medicaid Services (2011) International classification of diseases, 9th revision, clinical modification (icd-9-cm). http://www.cdc.gov/nchs/icd/icd9cm.htm. Accessed 20 March 2011
Chen K, Liu L (2011) Geometric data perturbation for privacy preserving outsourced data mining. Knowl Inf Syst 29(3):657–695
Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L (2011) Publishing set-valued data via differential privacy. PVLDB 4(11):1087–1098
Cormode G (2011) Personal privacy vs population privacy: learning to attack anonymization. In: KDD, pp 1253–1261
Cormode G, Li N, Li T, Srivastava D (2010) Minimizing minimality and maximizing utility: analyzing method-based attacks on anonymized data. PVLDB 3(1):1045–1056
Dwork C (2006) Differential privacy. In: ICALP, pp 1–12
Dwork C (2008) Differential privacy: a survey of results. In: TAMC, pp 1–19
Friedman A, Schuster A (2010) Data mining with differential privacy. In: KDD, pp 493–502
Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey on recent developments. ACM Comput Surv 42(4):1–53
Fung BCM, Wang K, Chen R, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE, pp 205–216
Ghinita G, Kalnis P, Tao Y (2011) Anonymous publication of sensitive transactional data. IEEE TKDE 23(2):161–174
Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: VLDB, pp 758–769
Ghinita G, Tao Y, Kalnis P (2008) On the anonymization of sparse high-dimensional data. In: ICDE, pp 715–724
Gkoulalas-Divanis A, Loukides G (2011) Revisiting sequential pattern hiding to enhance utility. In: KDD, pp 1316–1324
Gkoulalas-Divanis A, Verykios VS (2009) Hiding sensitive knowledge without side effects. Knowl Inf Syst 20(3):263–299
Gkoulalas-Divanis A, Verykios VS, Mokbel MF (2009) Identifying unsafe routes for network-based trajectory privacy. In: SDM, pp 942–953
Hagerup T, Rüb C (1990) A guided tour of chernoff bounds. Inf Process Lett 33(6):305–308
Hay M, Li C, Miklau G, Jensen D (2009) Accurate estimation of the degree distribution of private networks. In: ICDM, pp 169–178
He Y, Naughton JF (2009) Anonymization of set-valued data via top-down, local generalization. PVLDB 2(1):934–945
Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: KDD, pp 279–288
Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP (2006) A framework for evaluating the utility of data altered to protect confidentiality. Am Stat 60(3):224–232
Khoshgozaran A, Shahabi C, Shirani-Mehr H (2011) Location privacy: going beyond k-anonymity, cloaking and anonymizers. Knowl Inf Syst 26(3):435–465
Kifer D, Machanavajjhala A (2011) No free lunch in data privacy. In: SIGMOD, pp 193–204
Korolova A, Kenthapadi K, Mishra N, Ntoulas A (2009) Releasing search queries and clicks privately. In: WWW, pp 171–180
LeFevre K, DeWitt DJ, Ramakrishnan R (2008) Workload-aware anonymization techniques for large-scale datasets. TODS 33(3):1–47
LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE, p 25
Li J, Liu J, Baig MM, Wong RC (2011) Information based data anonymization for classification utility. DKE 70(12):1030–1045
Li N, Li T, Venkatasubramanian S (2007) t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE, pp 106–115
Li T, Li N (2006) Optimal k-anonymity with flexible generalization schemes through bottom-up searching. In: ICDMW, pp 518–523
Liu B, Hsu W, Chen S (1997) Using general impressions to analyze discovered classification rules. In: KDD, pp 31–36
Liu B, Hsu W, Wang K, Chen S (1999) Visually aided exploration of interesting association rules. In: PAKDD, pp 380–389
Liu J, Wang K (2010) Anonymizing transaction data by integrating suppression and generalization. In: PAKDD, pp 171–180
Loukides G, Denny JC, Malin B (2010) The disclosure of diagnosis codes can breach research participants’ privacy. J Am Med Inform Assoc 17(3):322–327
Loukides G, Gkoulalas-Divanis A, Malin B (2010) Anonymization of electronic medical records for validating genome-wide association studies. Proc Natl Acad Sci 17(107):7898–7903
Loukides G, Gkoulalas-Divanis A, Malin B (2011) COAT: Constraint-based anonymization of transactions. Knowl Inf Syst 28(2):251–282
Loukides G, Gkoulalas-Divanis A, Shao J (2010) Anonymizing transaction data to eliminate sensitive inferences. In: DEXA, pp 400–415
Machanavajjhala A, Gehrke J, Götz M (2009) Data publishing against realistic adversaries. PVLDB 2(1):790–801
Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-anonymity. In: ICDE, p 24
Machanavajjhala A, Kifer D, Abowd JM, Gehrke J, Vilhuber L (2008) Privacy: theory meets practice on the map. In: ICDE, pp 277–286
Medforth N, Wang K (2011) Privacy risk in graph stream publishing for social network data. In: ICDM, pp 437–446
Meyerson A, Williams R (2004) On the complexity of optimal k-anonymity. In: PODS, pp 223–228
Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD, pp 493–501
Mohammed N, Fung BCM, Debbabi M (2009) Walking in the crowd: anonymizing trajectory data for pattern analysis. In: CIKM, pp 1441–1444
Mohammed N, Fung BCM, Hung PCK, Lee C (2009) Anonymizing healthcare data: a case study on the blood transfusion service. In: KDD, pp 1285–1294
Mohammed N, Fung BCM, Hung PCK, Lee C (2010) Centralized and distributed anonymization for high-dimensional healthcare data. TKDD 4(4):18
Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In: IEEE S&P, pp 111–125
Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: SIGMOD’07, pp 665–676
Nergiz ME, Clifton C, Nergiz AE (2009) Multirelational k-anonymity. TKDE 21(8):1104–1117
Texas Dept. of State Health Services (2008) User manual of texas hospital inpatient discharge public use data file. http://www.dshs.state.tx.us/THCIC/
Oliveira SRM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: ICDM, pp 613–616
Qiu L, Li Y, Wu X (2008) Protecting business intelligence and customer privacy while outsourcing data mining tasks. Knowl Inf Syst 17(1):99–120
Samarati P (2001) Protecting respondents identities in microdata release. TKDE 13(9):1010–1027
Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: KDD, pp 67–73
Sweeney L (2002) k-Anonymity: a model for protecting privacy. IJUFKS 10(5):557–570
Tai C, Yu P, Yang D, Chen M (2011) Privacy-preserving social network publication against friendship attacks. In: KDD, pp 1262–1270
Teng Z, Du W (2009) A hybrid multi-group approach for privacy-preserving data mining. Knowl Inf Syst 19(2):133–157
Terrovitis M, Mamoulis N (2008) Privacy preservation in the publication of trajectories. In: MDM, pp 65–72
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. PVLDB 1(1):115–125
Terrovitis M, Mamoulis N, Kalnis P (2011) Local and global recoding methods for anonymizing set-valued data. VLDB J 20(1):83–106
Velardi P, Cucchiarelli A, Petit M (2007) A taxonomy learning method and its application to characterize a scientific web community. TKDE 19(2):180–191
Verykios V, Elmagarmid AK, Bertino E, Saygin Y, Dasseni E (2004) Association rule hiding. TKDE 16(4):434–447
Wang K, Fung BCM, Yu PS (2007) Handicapping attacker’s confidence: an alternative to k-anonymization. Knowl Inf Syst 11(3):345–368
Wang K, Fung BCM, Yu PS (2005) Template-based privacy preservation in classification problems. In: ICDM, pp 466–473
Wang K, Xu Y, Fu A, Wong RCW (2009) FF-anonymity: when quasi-identifiers are missing. In: ICDE, pp 1136–1139
Wong RCW, Fu A, Wang K, Pei J (2007) Minimality attack in privacy preserving data publishing. In: VLDB, pp 543–554
Wong RCW, Fu A, Wang K, Pei J (2009) Anonymization-based attacks in privacy-preserving data publishing. TODS 34(2):1–46
Wong RCW, Li J, Fu A, Wang K (2006) Alpha-k-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In: KDD, pp 754–759
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: VLDB, pp 139–150
Xiao X, Tao Y (2006) Personalized privacy preservation. In: SIGMOD, pp 229–240
Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD, pp 689–700
Xiao X, Tao Y, Koudas N (2010) Transparent anonymization: Thwarting adversaries who know the algorithm. TODS 35(2):1–48
Xiao X, Wang G, Gehrke J (2010) Differential privacy via wavelet transforms. In: ICDE, pp 225–236
Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C (2006) Utility-based anonymization using local recoding. In: KDD, pp 785–790
Xu Y, Fung BCM, Wang K, Fu AW, Pei J (2008) Publishing sensitive transactions for itemset utility. In: ICDM, pp 1109–1114
Xu Y, Wang K, Fu AW-C, Yu PS (2008) Anonymizing transaction databases for publication. In: KDD, pp 767–775
Yakut I, Polat H (2012) Privacy-preserving hybrid collaborative filtering on cross distributed data. Knowl Inf Syst 30(2):405–433
Ying X, Wu X (2011) On link privacy in randomizing social networks. Knowl Inf Syst 28(3):645–663
Zhang L, Jajodia S, Brodsky A (2007) Information disclosure under realistic assumptions: privacy versus optimality. In: CCS, pp 573–583
Zhou B, Pei J (2011) The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks. Knowl Inf Syst 28(1):47–77
Acknowledgments
We would like to thank the handling editor and anonymous reviewers for their insightful comments, as well as the authors of [74] and [91] for providing the implementations of Apriori and Greedy, respectively. The first author was supported by a Research Fellowship from the Royal Academy of Engineering.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 1
We observe that our problem can be reduced from the NP-hard problem of optimal privacy-constrained anonymization (OPC) [51]. The latter problem seeks to transform \(\mathcal D \) into \(\tilde{\mathcal{D }}\) that has a minimum \(UL(\tilde{\mathcal{D }})\), using the set-based item generalization model, w.r.t. two sets of constraints \(\fancyscript{U}\) (on utility) and \(\fancyscript{P}\) (on privacy), and parameters \(k\) and \(s\). We can map an instance of OPC to an instance of our problem in polynomial time by constructing \(\varTheta \) in such a way that, for each \(p\in \fancyscript{P}\), we specify a PS-rule whose antecedent contains all the items in \(p\) and no other items, \(s=0\,\%\) to disallow suppression, and \(c=1\). Then, \(\tilde{\mathcal{D }}\) is a solution to our problem if and only if it is a solution to the OPC problem. \(\square \)
Proof of Theorem 2
Assume that a \(\tilde{\mathcal{D }}\) in which the PS-rules in \(\varTheta \) are protected can be constructed from \(\mathcal D \), for given \(k\) and \(c\), when \(sup(I \cup J,\mathcal D )> N \times c\), for a rule \(I\rightarrow J\) in \(\varTheta \). Then, from Definition 4, we have \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J,\tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})}\le c\). Since \(sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})\ge sup(I \cup J,\mathcal D )\) from the way \(\varPhi \) works, we have \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi , \tilde{\mathcal{D }})}\ge \frac{sup(I\cup J,\mathcal D )}{N}\). Thus, \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi , \tilde{\mathcal{D }})}>c\), which contradicts our assumption and proves the theorem true. \(\square \)
Proof of Theorem 3
Consider a dataset \(\tilde{\mathcal{D }}\) constructed using a generalization function \(\varPhi \) that maps each and every \(i\in \mathcal P \) to the same \(\tilde{i}\in \tilde{\mathcal{P }}\). Since we have \(sup(\tilde{i},\tilde{\mathcal{D }})=N \ge k\) for all rules in \(\varTheta \), these rules satisfy Condition (1) of Definition 4 for \(\tilde{\mathcal{D }}\). Also, as \(sup(\bigcup _{\forall i \in I_q}\varPhi (i) \cup J_q, \tilde{\mathcal{D }})\le N\times c\) and \(sup(\tilde{i},\tilde{\mathcal{D }})=N\) hold, we have \(\frac{sup(\bigcup _{\forall i \in I_q}\varPhi (i) \cup J_q, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I_q}\varPhi (i),\tilde{\mathcal{D }})}\le c\). So the rules in \(\varTheta \) also satisfy Condition (2) of Definition 4, and are all protected in \(\tilde{\mathcal{D }}\). Thus, \(\tilde{\mathcal{D }}\) is a generalized dataset that is constructed from \(\mathcal D \) as required. \(\square \)
Proof of Theorem 4
Assume that \(\small UL(\tilde{\mathcal{D _1}})<UL(\tilde{\mathcal{D _2}})\). For each generalized item \(\tilde{i}\) in both \(C_1\) and \(C_2\), we have \(\small UL(\tilde{i},\tilde{\mathcal{D _1}})=UL(\tilde{i},\tilde{\mathcal{D _2}})\). Thus, there must be sets of generalized items \(C_1^{\prime }= C_1{\setminus }C_2\) and \(C_2^{\prime }=C_2{\setminus }C_1\) such that \(\small \sum _{\tilde{i_x}\in {C_1^{\prime }}}UL(\tilde{i_x},\tilde{\mathcal{D _1}})<\sum _{\tilde{i_y}\in {C_2^{\prime }}}UL(\tilde{i_y},\tilde{\mathcal{D _2}})\) or \(\small \sum _{\forall \tilde{i_x}\in {C_1^{\prime }}}((2^{|\tilde{i_x}|}-1)\times w(\tilde{i_x})\times sup(\tilde{i_x},\tilde{\mathcal{D _1}}))<\sum _{\forall \tilde{i_y}\in {C_2^{\prime }}}((2^{|\tilde{i_y}|}-1)\times w(\tilde{i_y})\times sup(\tilde{i_y},\tilde{\mathcal{D _2}}))\), in order for our assumption to hold. However, this cannot be true because for each \(\tilde{i_x}\in C_1^{\prime }\) we have \(\small |\tilde{i_x}|=\sum _{\forall \tilde{i_q} \in desc(\tilde{i_x})}|\tilde{i_q}|\) (by definition), \(\small sup(\tilde{i_x},\tilde{\mathcal{D _1}})\ge \sum _{\forall \tilde{i_q} \in desc(\tilde{i_x})}sup(\tilde{i_q}, \tilde{\mathcal{D _2}})\) (by B-Split), and \(\small w(\tilde{i_x})\ge \sum _{\forall \tilde{i_q}\in desc(\tilde{i_x})}w(\tilde{i_q})\) (the condition given in Theorem 4). Thus, we cannot have \(UL(\tilde{\mathcal{D _1}})<UL(\tilde{\mathcal{D _2}})\). \(\square \)
Proof of Theorem 5
Consider an arbitrary generalized dataset \(\tilde{\mathcal{D }}\) that is constructed from \(\mathcal D \) using a set-based generalization function \(\varPhi \). In the following, we prove that \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D }}\) by showing that it satisfies both conditions of Definition 4. From \(sup(I,\mathcal D )\le sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\) and the condition (1) of the Theorem 5, we get that \(sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\ge k\). Thus, \(I\rightarrow J\) satisfies the condition (1) of Definition 4. Also, from \(sup(J,\mathcal D )\ge sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})\) and the condition (2) of Theorem 5, we have that \(sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})\le c\times k\). By combining the latter relation with \(sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\ge k\), we get \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})}\le c\). Thus, \(I\rightarrow J\) also satisfies the condition (2) of Definition 4. \(\square \)
Proof of Theorem 6
Assume that \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D _1}}\), not protected in \(\tilde{\mathcal{D _2}}\) and there is no item \(i \in I\) that maps to \(\tilde{i_l}\) or \(\tilde{i_r}\). By the construction of \(C_2\), \(i\) is not mapped to \(\tilde{i}\) either. Thus, we have \(\small sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _1}})=\small sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _2}})\), since other than \(\tilde{i_l}\) and \(\tilde{i_r}\), the support of each generalized item in \(\tilde{\mathcal{D _2}}\) and \(\tilde{\mathcal{D _1}}\) is the same. This implies \(\frac{sup(\bigcup _{\forall i\in I}\varPhi (i)\cup J,\tilde{\mathcal{D _1}})}{sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _1}})}=\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J,\tilde{\mathcal{D _2}})}{sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _2}})}\). Thus, \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D _2}}\), which contradicts our assumption and proves the theorem true. \(\square \)
Proof of Theorem 7
Assume that \(I^{\prime }\rightarrow J\) is not protected in \(\tilde{\mathcal{D }}\). Due to condition (1), we have \(\small sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})\ge k\) and \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}\le c\), and, from conditions (2) and (3), we have \(\small sup(\bigcup _{\forall i^{\prime }\in I^{\prime }}\varPhi (i^{\prime }),\tilde{\mathcal{D }})=sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\) since all items in \(I^{\prime }\) are contained in \(I\) and all items in \(I\) are mapped to \(\bigcup _{\forall i \in I}\varPhi (i)\). Thus, \(I^{\prime }\rightarrow J\) is protected by Definition 4, which contradicts our assumption. \(\square \)
Proof of Theorem 8
Assume that \(I\rightarrow J^{\prime }\) is not protected in \(\tilde{\mathcal{D }}\). \(\small sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})\ge k\) holds from condition (1), hence \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}>c\) must hold, for our assumption to be true. From \(\small sup(J^{\prime }, \tilde{\mathcal{D }}_{\tilde{I}})=sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})\), \(sup(J, \tilde{\mathcal{D }}_{\tilde{I}})=sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})\normalsize \), and condition (2), we get \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}\le \frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}<c\), which contradicts our assumption. \(\square \)
Proof of Theorem 9
Let \(X_1,\ldots ,X_n\) be independent Poisson trials and the probability \(Pr(X_i)=p_i\). Let also \(X=\sum _{\forall i\in [1,n]}X_i\) and \(\mu =E[X]\). For any \(\gamma >0\), the additive Chernoff bound \(Pr(|X-\mu |\ge \gamma )\le (2\times e^{-2n\gamma ^2})\) holds. By setting \(|\mathcal D _s|=n\), \(\gamma =\epsilon \), \(X=\frac{sup(I,\mathcal D _s)}{|\mathcal D _s|}\), and \(\mu =\frac{sup(I,\mathcal D )}{|\mathcal D |}\), we get \(Pr(|\frac{sup(I,\mathcal D _s)}{|\mathcal D _s|}-\frac{sup(I,\mathcal D )}{|\mathcal D |}|\ge \epsilon )\le (2\times e^{-2|\mathcal D _s|\epsilon ^2})\). After setting \(\delta \ge 2\times e^{-2|\mathcal D _s|\epsilon ^2}\) and some calculations, we get \(|\mathcal D _s|\ge \frac{ln(\frac{2}{\delta })}{2\times \epsilon ^2}\), which proves the theorem true. \(\square \)
Theorem 10
The time complexity of Algorithm 4 is \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) and the space complexity is \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\).
Proof
We first examine the time complexity of B-Split, Update, Replace, and Check. B-Split requires \(O\left( |\mathcal P |^2+|\mathcal P |\right)\approx O(|\mathcal P |^2)\) time, where \(|\mathcal P |\) is the size of the largest possible generalized item, since it examines all pairs of items mapped to this generalized item and then assigns all public items to the created seeds. Update scans \(\tilde{\mathcal{D }}\) once to replace \(\tilde{i}\) with \(\tilde{i_l}\) and \(\tilde{i_r}\), hence it is computed in \(O(|\mathcal P |\times N)\) time, where \(N\) is the number of transactions in \(\tilde{\mathcal{D }}\). Replace has the same time complexity with Update and Check needs \(O(|\varTheta |\times (N+|\mathcal P |))\) time; \(O(N)\) time to form \(I\rightarrow J\) and determine if it is protected, \(O(|\mathcal P |\times (log(|\mathcal P |)+|\varTheta |))\) time to construct \(\varTheta ^{\prime }\) using Find-rules in the worst-case in which all rules have at least one item mapped to \(\tilde{i}\) in their antecedent, and \(O(|\varTheta ^{\prime }|\times N)\) time to compute the support and confidence of these rules. Since Algorithm 4 performs up to \(\lfloor \frac{2\times |\mathcal P |-1}{2}\rfloor \approx |\mathcal P |\) recursive calls, one for each edge in the generalization tree, it needs \(O\left(|\mathcal P |\times \left(|\mathcal P |^2 + |\mathcal P |\times N + |\varTheta |\times (N+|\mathcal P |)\right)\right)\) time, which can be approximated as \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) in the worst case when \(|\varTheta |=(2^{|\mathcal P |}-1)\times |\mathcal S |\) and assuming that \(N>|\mathcal P |\). Note that \(|\varTheta |=(2^{|\mathcal P |}-1)\times |\mathcal S |\) when it contains all possible rules and there is no rule whose antecedent is the same with another rule and its consequent is a superset of that of the second rule (otherwise the former rule is redundant due to its lower confidence).
We then examine the space complexity of Update, Replace, and Check. Update and Replace need to store \(O(N\times |\mathcal I |)\) items each. The Rule-tree used in Check is created once and stores \(O(2^{|\mathcal P |}\times |\mathcal S |)\) items, as inserting a rule adds one item as a tree node and \(|\mathcal S |\) items in the consequent-list. \(Q\) needs to store \(O(|\mathcal P |^2)\) items when the generalization tree is the tallest one, hence Algorithm 4 requires \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\) space. \(\square \)
Theorem 11
The time complexity of Algorithm 5 is \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) and the space complexity is \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\).
Proof
The time complexity of the sample-based partitioning phase of Algorithm 5 is \(O(h\times (|\mathcal P |^2+|\mathcal P |\times N+|\varTheta |\times (|\mathcal P |+|\mathcal D _s|)))\), where \(h\) is the height of the generalization tree \(\mathcal G \) returned by Sample-Based-Partition. This is because the latter function calls Split and Update, whose cost was examined in Theorem 10, as well as Check, whose cost is \(O(|\varTheta |\times (|\mathcal P |+|\mathcal D _s|))\). The top-down cut revision phase needs \(O(h^{\prime }\times (|\mathcal P |^2+|\mathcal P |\times N + |\mathcal P |\times (log(|\mathcal P )+\varTheta )+ \varTheta \times (|\mathcal P |+N)))\), where \(h^{\prime }\) is the number of executions of the while loop of step \(9\) in Algorithm 5. Finally, steps 24–29 need \(O(h^{\prime \prime }\times (|\varTheta |\times (|\mathcal P |+N)+|\mathcal P |+N))\) time, assuming that all nodes of \(\mathcal G \) are merged to their ancestors that lie \(h^{\prime \prime }\) levels above them, since Check and Merge-siblings take \(O(|\varTheta |\times (|\mathcal P |+N))\) and \(O(|\mathcal P |+N)\) time, respectively. It also holds \(h^{\prime }+h\le |\mathcal P |\) and \(h^{\prime \prime }\le |\mathcal P |\), as \(\mathcal G \) has up to \(|\mathcal P |\) levels. Thus, Algorithm 5 takes \(O(|\mathcal P |\times (|\mathcal P |^2+|\mathcal P |\times N+|\varTheta |\times (|\mathcal P |+N)))\) time, or approximately \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\), in the worst case when \(|\varTheta |=O(2^{|\mathcal P |}\times |\mathcal S |)\) and assuming that \(N>|\mathcal P |\). The space complexity of Update and Check was examined in Theorem 10 and the cost of storing \(\mathcal G \) is \(O(|\mathcal P |^2)\) items. Thus, Algorithm 5 requires \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\) space. \(\square \)
COUNT() Queries and \(ARE\) Computation
Consider an SQL-like COUNT() query
\(\mathbf{Q:}\) SELECT COUNT( \(\tilde{T_n}\) (or \(T_n\) )) FROM \(\tilde{\mathcal{D }}\) (or \(\mathcal D \) )
WHERE \(\tilde{I}J\) supports \(\tilde{T_n}\) in \(\tilde{\mathcal{D }}\) (or \(IJ\) supports \(T_n\) \(\mathtt {in}\) \(\mathcal D \) )
where \(\tilde{I}\) and \(I\) are itemsets comprised of public items in \(\tilde{\mathcal{D }}\) and \(\mathcal D \), respectively, and \(J\) is an itemset comprised of sensitive items. Since data recipients have only access to \(\tilde{\mathcal{D }}\), they need to estimate an answer for \(Q\). This can be performed by computing the probability \(P(\tilde{T_n},Q)\) that \(\tilde{T_n}\), the anonymized version of a transaction \(T_n\) in \(\mathcal D \), satisfies \(Q\). Assume that a generalized item \(\tilde{i_m}\) in \(\tilde{T_n}\) is interpreted as any possible subset of the items mapped to it with equal probability, and that there are no correlations among generalized items [27, 29, 42]. \(P(\tilde{T_n} ,Q)\) is given by \(\displaystyle \Pi _{\forall \tilde{i_m} \in \tilde{T_n}}\frac{2^{|\tilde{i_m}|-c(I)}}{2^{|\tilde{i_m}|}-1}\) where \(c:\mathcal I \rightarrow [0,|i_m|]\) is a function that, given \(I\), returns the number of items in \(I\) that are mapped to \(\tilde{i_m}\) in \(\tilde{\mathcal{D }}\). An approximate answer \(e(Q)\) to \(Q\) is then derived by summing the corresponding probabilities across all transactions \(\tilde{T_n}\) in \(\tilde{\mathcal{D }}\) that support \(J\).
We considered two workloads \(W_1\) and \(W_2\), each comprised of 1,000 queries, whose items were selected uniformly at random. Queries in \(W_1\) involve a \(2\)-itemset in \(\mathcal P \), whereas those in \(W_2\) require retrieving a \(3\)-itemset comprised of \(2\) public and \(1\) sensitive items.
Rights and permissions
About this article
Cite this article
Loukides, G., Gkoulalas-Divanis, A. & Shao, J. Efficient and flexible anonymization of transaction data. Knowl Inf Syst 36, 153–210 (2013). https://doi.org/10.1007/s10115-012-0544-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0544-3