Skip to main content
Log in

Efficient and flexible anonymization of transaction data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing transaction data have been proposed recently, but they may produce excessively distorted and inadequately protected solutions. This is because these approaches do not consider privacy requirements that are common in real-world applications in a realistic and flexible manner, and attempt to safeguard the data only against either identity disclosure or sensitive information inference. In this paper, we propose a new approach that overcomes these limitations. We introduce a rule-based privacy model that allows data publishers to express fine-grained protection requirements for both identity and sensitive information disclosure. Based on this model, we also develop two anonymization algorithms. Our first algorithm works in a top-down fashion, employing an efficient strategy to recursively generalize data with low information loss. Our second algorithm uses sampling and a combination of top-down and bottom-up generalization heuristics, which greatly improves scalability while maintaining low information loss. Extensive experiments show that our algorithms significantly outperform the state-of-the-art in terms of retaining data utility, while achieving good protection and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40
Fig. 41
Fig. 42
Fig. 43
Fig. 44
Fig. 45
Fig. 46
Fig. 47
Fig. 48
Fig. 49
Fig. 50
Fig. 51
Fig. 52
Fig. 53
Fig. 54
Fig. 55
Fig. 56
Fig. 57
Fig. 58
Fig. 59
Fig. 60
Fig. 61

Similar content being viewed by others

Notes

  1. We assume that these methods are applied to non-sensitive items. Otherwise, \(56\) itemsets would receive protection.

  2. Specialization is the reverse operation of generalization [26].

  3. This is the statistically best strategy an attacker can follow [9].

  4. We assume that \(k^m\)-anonymity is applied to public items.

  5. Recall from Sect. 4.1.2 that \(\varTheta \) does not contain rules that are always protected.

  6. These itemsets represent individuals’ sensitive information, which is unknown to an attacker. A similar assumption was made in [87], for individuals’ sensitive values.

  7. http://www.octopus.com.hk.

  8. We employ two hierarchies, for simplicity. Using a single hierarchy containing all items of \(\mathcal I \) is possible, and it requires trivial changes in the process of generating PS-rules.

  9. We do not consider empty \(T_\mathcal{P }\) and \(T_\mathcal{S }\subseteq T_\mathcal{N }\), where \(T_\mathcal{S }\) is the set of sensitive items in \(T\), as they are unlikely to lead to meaningful privacy attacks. Our approach can be trivially modified to deal with such transactions.

References

  1. Health Insurance Portability and Accountability Act of 1996 United States Public Law

  2. Abul O, Bonchi F, Giannotti F (2010) Hiding sequential and spatiotemporal patterns. TKDE 22(12):1709–1723

    Google Scholar 

  3. Abul O, Bonchi F, Nanni M (2008) Never walk alone: uncertainty for anonymity in moving objects databases. In: ICDE, pp 376–385

  4. Aggarwal CC (2005) On k-anonymity and the curse of dimensionality. In: VLDB, pp 901–909

  5. Aggarwal CC, Li Y, Yu PS (2011) On the hardness of graph anonymization. In: ICDM, pp 1002–1007

  6. Aggarwal CC, Yu PS (2008) Privacy-preserving data mining: models and algorithms. Springer, Berlin

    Book  Google Scholar 

  7. Agrawal R, Johnson CM (2007) Securing electronic health records without impeding the flow of information. Int J Med Inform 76(5–6):471–479

    Google Scholar 

  8. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499

  9. Bacchus F, Grove AJ, Halpern JY, Koller D (1996) From statistical knowledge bases to degrees of belief. Artif Intell 87(1–2):75–143

    Google Scholar 

  10. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K (2007) Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS, pp 273–282

  11. Barbaro M, Zeller T (2006) A face is exposed for aol searcher no. 4417749. New York Times, Aug

  12. Basu S, Mooney RJ, Pasupuleti KV, Ghosh J (2001) Evaluating the novelty of text-mined rules using lexical knowledge. In: KDD, pp 233–238

  13. Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the sulq framework. In: PODS, pp 128–138

  14. Bonchi F, Lakshmanan LVS (2011) Trajectory anonymity in publishing personal mobility data. SIGKDD Explor 13(1):30–42

    Article  Google Scholar 

  15. Brickell J, Shmatikov V (2008) The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: KDD, pp 70–78

  16. Cao J, Karras P, Raïssi C, Tan K (2010) \(rho\)-Uncertainty: inference-proof transaction anonymization. PVLDB 3(1):1033–1044

    Google Scholar 

  17. Centers for Medicare and Medicaid Services (2011) International classification of diseases, 9th revision, clinical modification (icd-9-cm). http://www.cdc.gov/nchs/icd/icd9cm.htm. Accessed 20 March 2011

  18. Chen K, Liu L (2011) Geometric data perturbation for privacy preserving outsourced data mining. Knowl Inf Syst 29(3):657–695

    Article  Google Scholar 

  19. Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L (2011) Publishing set-valued data via differential privacy. PVLDB 4(11):1087–1098

    Google Scholar 

  20. Cormode G (2011) Personal privacy vs population privacy: learning to attack anonymization. In: KDD, pp 1253–1261

  21. Cormode G, Li N, Li T, Srivastava D (2010) Minimizing minimality and maximizing utility: analyzing method-based attacks on anonymized data. PVLDB 3(1):1045–1056

    Google Scholar 

  22. Dwork C (2006) Differential privacy. In: ICALP, pp 1–12

  23. Dwork C (2008) Differential privacy: a survey of results. In: TAMC, pp 1–19

  24. Friedman A, Schuster A (2010) Data mining with differential privacy. In: KDD, pp 493–502

  25. Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey on recent developments. ACM Comput Surv 42(4):1–53

    Google Scholar 

  26. Fung BCM, Wang K, Chen R, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE, pp 205–216

  27. Ghinita G, Kalnis P, Tao Y (2011) Anonymous publication of sensitive transactional data. IEEE TKDE 23(2):161–174

    Google Scholar 

  28. Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: VLDB, pp 758–769

  29. Ghinita G, Tao Y, Kalnis P (2008) On the anonymization of sparse high-dimensional data. In: ICDE, pp 715–724

  30. Gkoulalas-Divanis A, Loukides G (2011) Revisiting sequential pattern hiding to enhance utility. In: KDD, pp 1316–1324

  31. Gkoulalas-Divanis A, Verykios VS (2009) Hiding sensitive knowledge without side effects. Knowl Inf Syst 20(3):263–299

    Article  Google Scholar 

  32. Gkoulalas-Divanis A, Verykios VS, Mokbel MF (2009) Identifying unsafe routes for network-based trajectory privacy. In: SDM, pp 942–953

  33. Hagerup T, Rüb C (1990) A guided tour of chernoff bounds. Inf Process Lett 33(6):305–308

    Article  MATH  Google Scholar 

  34. Hay M, Li C, Miklau G, Jensen D (2009) Accurate estimation of the degree distribution of private networks. In: ICDM, pp 169–178

  35. He Y, Naughton JF (2009) Anonymization of set-valued data via top-down, local generalization. PVLDB 2(1):934–945

    Google Scholar 

  36. Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: KDD, pp 279–288

  37. Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP (2006) A framework for evaluating the utility of data altered to protect confidentiality. Am Stat 60(3):224–232

    Article  MathSciNet  Google Scholar 

  38. Khoshgozaran A, Shahabi C, Shirani-Mehr H (2011) Location privacy: going beyond k-anonymity, cloaking and anonymizers. Knowl Inf Syst 26(3):435–465

    Article  Google Scholar 

  39. Kifer D, Machanavajjhala A (2011) No free lunch in data privacy. In: SIGMOD, pp 193–204

  40. Korolova A, Kenthapadi K, Mishra N, Ntoulas A (2009) Releasing search queries and clicks privately. In: WWW, pp 171–180

  41. LeFevre K, DeWitt DJ, Ramakrishnan R (2008) Workload-aware anonymization techniques for large-scale datasets. TODS 33(3):1–47

    Google Scholar 

  42. LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE, p 25

  43. Li J, Liu J, Baig MM, Wong RC (2011) Information based data anonymization for classification utility. DKE 70(12):1030–1045

    Article  Google Scholar 

  44. Li N, Li T, Venkatasubramanian S (2007) t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE, pp 106–115

  45. Li T, Li N (2006) Optimal k-anonymity with flexible generalization schemes through bottom-up searching. In: ICDMW, pp 518–523

  46. Liu B, Hsu W, Chen S (1997) Using general impressions to analyze discovered classification rules. In: KDD, pp 31–36

  47. Liu B, Hsu W, Wang K, Chen S (1999) Visually aided exploration of interesting association rules. In: PAKDD, pp 380–389

  48. Liu J, Wang K (2010) Anonymizing transaction data by integrating suppression and generalization. In: PAKDD, pp 171–180

  49. Loukides G, Denny JC, Malin B (2010) The disclosure of diagnosis codes can breach research participants’ privacy. J Am Med Inform Assoc 17(3):322–327

    Google Scholar 

  50. Loukides G, Gkoulalas-Divanis A, Malin B (2010) Anonymization of electronic medical records for validating genome-wide association studies. Proc Natl Acad Sci 17(107):7898–7903

    Article  Google Scholar 

  51. Loukides G, Gkoulalas-Divanis A, Malin B (2011) COAT: Constraint-based anonymization of transactions. Knowl Inf Syst 28(2):251–282

    Article  Google Scholar 

  52. Loukides G, Gkoulalas-Divanis A, Shao J (2010) Anonymizing transaction data to eliminate sensitive inferences. In: DEXA, pp 400–415

  53. Machanavajjhala A, Gehrke J, Götz M (2009) Data publishing against realistic adversaries. PVLDB 2(1):790–801

    Google Scholar 

  54. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-anonymity. In: ICDE, p 24

  55. Machanavajjhala A, Kifer D, Abowd JM, Gehrke J, Vilhuber L (2008) Privacy: theory meets practice on the map. In: ICDE, pp 277–286

  56. Medforth N, Wang K (2011) Privacy risk in graph stream publishing for social network data. In: ICDM, pp 437–446

  57. Meyerson A, Williams R (2004) On the complexity of optimal k-anonymity. In: PODS, pp 223–228

  58. Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD, pp 493–501

  59. Mohammed N, Fung BCM, Debbabi M (2009) Walking in the crowd: anonymizing trajectory data for pattern analysis. In: CIKM, pp 1441–1444

  60. Mohammed N, Fung BCM, Hung PCK, Lee C (2009) Anonymizing healthcare data: a case study on the blood transfusion service. In: KDD, pp 1285–1294

  61. Mohammed N, Fung BCM, Hung PCK, Lee C (2010) Centralized and distributed anonymization for high-dimensional healthcare data. TKDD 4(4):18

    Article  Google Scholar 

  62. Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In: IEEE S&P, pp 111–125

  63. Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: SIGMOD’07, pp 665–676

  64. Nergiz ME, Clifton C, Nergiz AE (2009) Multirelational k-anonymity. TKDE 21(8):1104–1117

    Google Scholar 

  65. Texas Dept. of State Health Services (2008) User manual of texas hospital inpatient discharge public use data file. http://www.dshs.state.tx.us/THCIC/

  66. Oliveira SRM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: ICDM, pp 613–616

  67. Qiu L, Li Y, Wu X (2008) Protecting business intelligence and customer privacy while outsourcing data mining tasks. Knowl Inf Syst 17(1):99–120

    Article  Google Scholar 

  68. Samarati P (2001) Protecting respondents identities in microdata release. TKDE 13(9):1010–1027

    Google Scholar 

  69. Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: KDD, pp 67–73

  70. Sweeney L (2002) k-Anonymity: a model for protecting privacy. IJUFKS 10(5):557–570

    MathSciNet  MATH  Google Scholar 

  71. Tai C, Yu P, Yang D, Chen M (2011) Privacy-preserving social network publication against friendship attacks. In: KDD, pp 1262–1270

  72. Teng Z, Du W (2009) A hybrid multi-group approach for privacy-preserving data mining. Knowl Inf Syst 19(2):133–157

    Article  Google Scholar 

  73. Terrovitis M, Mamoulis N (2008) Privacy preservation in the publication of trajectories. In: MDM, pp 65–72

  74. Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. PVLDB 1(1):115–125

    Google Scholar 

  75. Terrovitis M, Mamoulis N, Kalnis P (2011) Local and global recoding methods for anonymizing set-valued data. VLDB J 20(1):83–106

    Article  Google Scholar 

  76. Velardi P, Cucchiarelli A, Petit M (2007) A taxonomy learning method and its application to characterize a scientific web community. TKDE 19(2):180–191

    Google Scholar 

  77. Verykios V, Elmagarmid AK, Bertino E, Saygin Y, Dasseni E (2004) Association rule hiding. TKDE 16(4):434–447

    Google Scholar 

  78. Wang K, Fung BCM, Yu PS (2007) Handicapping attacker’s confidence: an alternative to k-anonymization. Knowl Inf Syst 11(3):345–368

    Article  Google Scholar 

  79. Wang K, Fung BCM, Yu PS (2005) Template-based privacy preservation in classification problems. In: ICDM, pp 466–473

  80. Wang K, Xu Y, Fu A, Wong RCW (2009) FF-anonymity: when quasi-identifiers are missing. In: ICDE, pp 1136–1139

  81. Wong RCW, Fu A, Wang K, Pei J (2007) Minimality attack in privacy preserving data publishing. In: VLDB, pp 543–554

  82. Wong RCW, Fu A, Wang K, Pei J (2009) Anonymization-based attacks in privacy-preserving data publishing. TODS 34(2):1–46

    Google Scholar 

  83. Wong RCW, Li J, Fu A, Wang K (2006) Alpha-k-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In: KDD, pp 754–759

  84. Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: VLDB, pp 139–150

  85. Xiao X, Tao Y (2006) Personalized privacy preservation. In: SIGMOD, pp 229–240

  86. Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD, pp 689–700

  87. Xiao X, Tao Y, Koudas N (2010) Transparent anonymization: Thwarting adversaries who know the algorithm. TODS 35(2):1–48

    Google Scholar 

  88. Xiao X, Wang G, Gehrke J (2010) Differential privacy via wavelet transforms. In: ICDE, pp 225–236

  89. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C (2006) Utility-based anonymization using local recoding. In: KDD, pp 785–790

  90. Xu Y, Fung BCM, Wang K, Fu AW, Pei J (2008) Publishing sensitive transactions for itemset utility. In: ICDM, pp 1109–1114

  91. Xu Y, Wang K, Fu AW-C, Yu PS (2008) Anonymizing transaction databases for publication. In: KDD, pp 767–775

  92. Yakut I, Polat H (2012) Privacy-preserving hybrid collaborative filtering on cross distributed data. Knowl Inf Syst 30(2):405–433

    Article  Google Scholar 

  93. Ying X, Wu X (2011) On link privacy in randomizing social networks. Knowl Inf Syst 28(3):645–663

    Article  Google Scholar 

  94. Zhang L, Jajodia S, Brodsky A (2007) Information disclosure under realistic assumptions: privacy versus optimality. In: CCS, pp 573–583

  95. Zhou B, Pei J (2011) The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks. Knowl Inf Syst 28(1):47–77

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We would like to thank the handling editor and anonymous reviewers for their insightful comments, as well as the authors of [74] and [91] for providing the implementations of Apriori and Greedy, respectively. The first author was supported by a Research Fellowship from the Royal Academy of Engineering.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aris Gkoulalas-Divanis.

Appendix

Appendix

Proof of Theorem 1

We observe that our problem can be reduced from the NP-hard problem of optimal privacy-constrained anonymization (OPC) [51]. The latter problem seeks to transform \(\mathcal D \) into \(\tilde{\mathcal{D }}\) that has a minimum \(UL(\tilde{\mathcal{D }})\), using the set-based item generalization model, w.r.t. two sets of constraints \(\fancyscript{U}\) (on utility) and \(\fancyscript{P}\) (on privacy), and parameters \(k\) and \(s\). We can map an instance of OPC to an instance of our problem in polynomial time by constructing \(\varTheta \) in such a way that, for each \(p\in \fancyscript{P}\), we specify a PS-rule whose antecedent contains all the items in \(p\) and no other items, \(s=0\,\%\) to disallow suppression, and \(c=1\). Then, \(\tilde{\mathcal{D }}\) is a solution to our problem if and only if it is a solution to the OPC problem. \(\square \)

Proof of Theorem 2

Assume that a \(\tilde{\mathcal{D }}\) in which the PS-rules in \(\varTheta \) are protected can be constructed from \(\mathcal D \), for given \(k\) and \(c\), when \(sup(I \cup J,\mathcal D )> N \times c\), for a rule \(I\rightarrow J\) in \(\varTheta \). Then, from Definition 4, we have \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J,\tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})}\le c\). Since \(sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})\ge sup(I \cup J,\mathcal D )\) from the way \(\varPhi \) works, we have \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi , \tilde{\mathcal{D }})}\ge \frac{sup(I\cup J,\mathcal D )}{N}\). Thus, \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi , \tilde{\mathcal{D }})}>c\), which contradicts our assumption and proves the theorem true. \(\square \)

Proof of Theorem 3

Consider a dataset \(\tilde{\mathcal{D }}\) constructed using a generalization function \(\varPhi \) that maps each and every \(i\in \mathcal P \) to the same \(\tilde{i}\in \tilde{\mathcal{P }}\). Since we have \(sup(\tilde{i},\tilde{\mathcal{D }})=N \ge k\) for all rules in \(\varTheta \), these rules satisfy Condition (1) of Definition 4 for \(\tilde{\mathcal{D }}\). Also, as \(sup(\bigcup _{\forall i \in I_q}\varPhi (i) \cup J_q, \tilde{\mathcal{D }})\le N\times c\) and \(sup(\tilde{i},\tilde{\mathcal{D }})=N\) hold, we have \(\frac{sup(\bigcup _{\forall i \in I_q}\varPhi (i) \cup J_q, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I_q}\varPhi (i),\tilde{\mathcal{D }})}\le c\). So the rules in \(\varTheta \) also satisfy Condition (2) of Definition 4, and are all protected in \(\tilde{\mathcal{D }}\). Thus, \(\tilde{\mathcal{D }}\) is a generalized dataset that is constructed from \(\mathcal D \) as required. \(\square \)

Proof of Theorem 4

Assume that \(\small UL(\tilde{\mathcal{D _1}})<UL(\tilde{\mathcal{D _2}})\). For each generalized item \(\tilde{i}\) in both \(C_1\) and \(C_2\), we have \(\small UL(\tilde{i},\tilde{\mathcal{D _1}})=UL(\tilde{i},\tilde{\mathcal{D _2}})\). Thus, there must be sets of generalized items \(C_1^{\prime }= C_1{\setminus }C_2\) and \(C_2^{\prime }=C_2{\setminus }C_1\) such that \(\small \sum _{\tilde{i_x}\in {C_1^{\prime }}}UL(\tilde{i_x},\tilde{\mathcal{D _1}})<\sum _{\tilde{i_y}\in {C_2^{\prime }}}UL(\tilde{i_y},\tilde{\mathcal{D _2}})\) or \(\small \sum _{\forall \tilde{i_x}\in {C_1^{\prime }}}((2^{|\tilde{i_x}|}-1)\times w(\tilde{i_x})\times sup(\tilde{i_x},\tilde{\mathcal{D _1}}))<\sum _{\forall \tilde{i_y}\in {C_2^{\prime }}}((2^{|\tilde{i_y}|}-1)\times w(\tilde{i_y})\times sup(\tilde{i_y},\tilde{\mathcal{D _2}}))\), in order for our assumption to hold. However, this cannot be true because for each \(\tilde{i_x}\in C_1^{\prime }\) we have \(\small |\tilde{i_x}|=\sum _{\forall \tilde{i_q} \in desc(\tilde{i_x})}|\tilde{i_q}|\) (by definition), \(\small sup(\tilde{i_x},\tilde{\mathcal{D _1}})\ge \sum _{\forall \tilde{i_q} \in desc(\tilde{i_x})}sup(\tilde{i_q}, \tilde{\mathcal{D _2}})\) (by B-Split), and \(\small w(\tilde{i_x})\ge \sum _{\forall \tilde{i_q}\in desc(\tilde{i_x})}w(\tilde{i_q})\) (the condition given in Theorem 4). Thus, we cannot have \(UL(\tilde{\mathcal{D _1}})<UL(\tilde{\mathcal{D _2}})\). \(\square \)

Proof of Theorem 5

Consider an arbitrary generalized dataset \(\tilde{\mathcal{D }}\) that is constructed from \(\mathcal D \) using a set-based generalization function \(\varPhi \). In the following, we prove that \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D }}\) by showing that it satisfies both conditions of Definition 4. From \(sup(I,\mathcal D )\le sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\) and the condition (1) of the Theorem 5, we get that \(sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\ge k\). Thus, \(I\rightarrow J\) satisfies the condition (1) of Definition 4. Also, from \(sup(J,\mathcal D )\ge sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})\) and the condition (2) of Theorem 5, we have that \(sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})\le c\times k\). By combining the latter relation with \(sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\ge k\), we get \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i) \cup J,\tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})}\le c\). Thus, \(I\rightarrow J\) also satisfies the condition (2) of Definition 4. \(\square \)

Proof of Theorem 6

Assume that \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D _1}}\), not protected in \(\tilde{\mathcal{D _2}}\) and there is no item \(i \in I\) that maps to \(\tilde{i_l}\) or \(\tilde{i_r}\). By the construction of \(C_2\), \(i\) is not mapped to \(\tilde{i}\) either. Thus, we have \(\small sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _1}})=\small sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _2}})\), since other than \(\tilde{i_l}\) and \(\tilde{i_r}\), the support of each generalized item in \(\tilde{\mathcal{D _2}}\) and \(\tilde{\mathcal{D _1}}\) is the same. This implies \(\frac{sup(\bigcup _{\forall i\in I}\varPhi (i)\cup J,\tilde{\mathcal{D _1}})}{sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _1}})}=\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J,\tilde{\mathcal{D _2}})}{sup(\bigcup _{\forall i\in I}\varPhi (i),\tilde{\mathcal{D _2}})}\). Thus, \(I\rightarrow J\) is protected in \(\tilde{\mathcal{D _2}}\), which contradicts our assumption and proves the theorem true. \(\square \)

Proof of Theorem 7

Assume that \(I^{\prime }\rightarrow J\) is not protected in \(\tilde{\mathcal{D }}\). Due to condition (1), we have \(\small sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})\ge k\) and \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}\le c\), and, from conditions (2) and (3), we have \(\small sup(\bigcup _{\forall i^{\prime }\in I^{\prime }}\varPhi (i^{\prime }),\tilde{\mathcal{D }})=sup(\bigcup _{\forall i \in I}\varPhi (i),\tilde{\mathcal{D }})\) since all items in \(I^{\prime }\) are contained in \(I\) and all items in \(I\) are mapped to \(\bigcup _{\forall i \in I}\varPhi (i)\). Thus, \(I^{\prime }\rightarrow J\) is protected by Definition 4, which contradicts our assumption. \(\square \)

Proof of Theorem 8

Assume that \(I\rightarrow J^{\prime }\) is not protected in \(\tilde{\mathcal{D }}\). \(\small sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})\ge k\) holds from condition (1), hence \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}>c\) must hold, for our assumption to be true. From \(\small sup(J^{\prime }, \tilde{\mathcal{D }}_{\tilde{I}})=sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})\), \(sup(J, \tilde{\mathcal{D }}_{\tilde{I}})=sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})\normalsize \), and condition (2), we get \(\frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J^{\prime }, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}\le \frac{sup(\bigcup _{\forall i \in I}\varPhi (i)\cup J, \tilde{\mathcal{D }})}{sup(\bigcup _{\forall i \in I}\varPhi (i), \tilde{\mathcal{D }})}<c\), which contradicts our assumption. \(\square \)

Proof of Theorem 9

Let \(X_1,\ldots ,X_n\) be independent Poisson trials and the probability \(Pr(X_i)=p_i\). Let also \(X=\sum _{\forall i\in [1,n]}X_i\) and \(\mu =E[X]\). For any \(\gamma >0\), the additive Chernoff bound \(Pr(|X-\mu |\ge \gamma )\le (2\times e^{-2n\gamma ^2})\) holds. By setting \(|\mathcal D _s|=n\), \(\gamma =\epsilon \), \(X=\frac{sup(I,\mathcal D _s)}{|\mathcal D _s|}\), and \(\mu =\frac{sup(I,\mathcal D )}{|\mathcal D |}\), we get \(Pr(|\frac{sup(I,\mathcal D _s)}{|\mathcal D _s|}-\frac{sup(I,\mathcal D )}{|\mathcal D |}|\ge \epsilon )\le (2\times e^{-2|\mathcal D _s|\epsilon ^2})\). After setting \(\delta \ge 2\times e^{-2|\mathcal D _s|\epsilon ^2}\) and some calculations, we get \(|\mathcal D _s|\ge \frac{ln(\frac{2}{\delta })}{2\times \epsilon ^2}\), which proves the theorem true. \(\square \)

Theorem 10

The time complexity of Algorithm 4 is \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) and the space complexity is \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\).

Proof

We first examine the time complexity of B-Split, Update, Replace, and Check. B-Split requires \(O\left( |\mathcal P |^2+|\mathcal P |\right)\approx O(|\mathcal P |^2)\) time, where \(|\mathcal P |\) is the size of the largest possible generalized item, since it examines all pairs of items mapped to this generalized item and then assigns all public items to the created seeds. Update scans \(\tilde{\mathcal{D }}\) once to replace \(\tilde{i}\) with \(\tilde{i_l}\) and \(\tilde{i_r}\), hence it is computed in \(O(|\mathcal P |\times N)\) time, where \(N\) is the number of transactions in \(\tilde{\mathcal{D }}\). Replace has the same time complexity with Update and Check needs \(O(|\varTheta |\times (N+|\mathcal P |))\) time; \(O(N)\) time to form \(I\rightarrow J\) and determine if it is protected, \(O(|\mathcal P |\times (log(|\mathcal P |)+|\varTheta |))\) time to construct \(\varTheta ^{\prime }\) using Find-rules in the worst-case in which all rules have at least one item mapped to \(\tilde{i}\) in their antecedent, and \(O(|\varTheta ^{\prime }|\times N)\) time to compute the support and confidence of these rules. Since Algorithm 4 performs up to \(\lfloor \frac{2\times |\mathcal P |-1}{2}\rfloor \approx |\mathcal P |\) recursive calls, one for each edge in the generalization tree, it needs \(O\left(|\mathcal P |\times \left(|\mathcal P |^2 + |\mathcal P |\times N + |\varTheta |\times (N+|\mathcal P |)\right)\right)\) time, which can be approximated as \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) in the worst case when \(|\varTheta |=(2^{|\mathcal P |}-1)\times |\mathcal S |\) and assuming that \(N>|\mathcal P |\). Note that \(|\varTheta |=(2^{|\mathcal P |}-1)\times |\mathcal S |\) when it contains all possible rules and there is no rule whose antecedent is the same with another rule and its consequent is a superset of that of the second rule (otherwise the former rule is redundant due to its lower confidence).

We then examine the space complexity of Update, Replace, and Check. Update and Replace need to store \(O(N\times |\mathcal I |)\) items each. The Rule-tree used in Check is created once and stores \(O(2^{|\mathcal P |}\times |\mathcal S |)\) items, as inserting a rule adds one item as a tree node and \(|\mathcal S |\) items in the consequent-list. \(Q\) needs to store \(O(|\mathcal P |^2)\) items when the generalization tree is the tallest one, hence Algorithm 4 requires \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\) space. \(\square \)

Theorem 11

The time complexity of Algorithm 5 is \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\) and the space complexity is \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\).

Proof

The time complexity of the sample-based partitioning phase of Algorithm 5 is \(O(h\times (|\mathcal P |^2+|\mathcal P |\times N+|\varTheta |\times (|\mathcal P |+|\mathcal D _s|)))\), where \(h\) is the height of the generalization tree \(\mathcal G \) returned by Sample-Based-Partition. This is because the latter function calls Split and Update, whose cost was examined in Theorem 10, as well as Check, whose cost is \(O(|\varTheta |\times (|\mathcal P |+|\mathcal D _s|))\). The top-down cut revision phase needs \(O(h^{\prime }\times (|\mathcal P |^2+|\mathcal P |\times N + |\mathcal P |\times (log(|\mathcal P )+\varTheta )+ \varTheta \times (|\mathcal P |+N)))\), where \(h^{\prime }\) is the number of executions of the while loop of step \(9\) in Algorithm 5. Finally, steps 24–29 need \(O(h^{\prime \prime }\times (|\varTheta |\times (|\mathcal P |+N)+|\mathcal P |+N))\) time, assuming that all nodes of \(\mathcal G \) are merged to their ancestors that lie \(h^{\prime \prime }\) levels above them, since Check and Merge-siblings take \(O(|\varTheta |\times (|\mathcal P |+N))\) and \(O(|\mathcal P |+N)\) time, respectively. It also holds \(h^{\prime }+h\le |\mathcal P |\) and \(h^{\prime \prime }\le |\mathcal P |\), as \(\mathcal G \) has up to \(|\mathcal P |\) levels. Thus, Algorithm 5 takes \(O(|\mathcal P |\times (|\mathcal P |^2+|\mathcal P |\times N+|\varTheta |\times (|\mathcal P |+N)))\) time, or approximately \(O(2^{|\mathcal P |}\times |\mathcal S |\times N)\), in the worst case when \(|\varTheta |=O(2^{|\mathcal P |}\times |\mathcal S |)\) and assuming that \(N>|\mathcal P |\). The space complexity of Update and Check was examined in Theorem 10 and the cost of storing \(\mathcal G \) is \(O(|\mathcal P |^2)\) items. Thus, Algorithm 5 requires \(O(2^{|\mathcal P |}\times |\mathcal S |+N\times |\mathcal I |)\) space. \(\square \)

COUNT() Queries and \(ARE\) Computation

Consider an SQL-like COUNT() query

\(\mathbf{Q:}\) SELECT COUNT( \(\tilde{T_n}\) (or \(T_n\) )) FROM \(\tilde{\mathcal{D }}\) (or \(\mathcal D \) )

WHERE \(\tilde{I}J\) supports \(\tilde{T_n}\) in \(\tilde{\mathcal{D }}\)  (or  \(IJ\) supports \(T_n\) \(\mathtt {in}\) \(\mathcal D \) )

where \(\tilde{I}\) and \(I\) are itemsets comprised of public items in \(\tilde{\mathcal{D }}\) and \(\mathcal D \), respectively, and \(J\) is an itemset comprised of sensitive items. Since data recipients have only access to \(\tilde{\mathcal{D }}\), they need to estimate an answer for \(Q\). This can be performed by computing the probability \(P(\tilde{T_n},Q)\) that \(\tilde{T_n}\), the anonymized version of a transaction \(T_n\) in \(\mathcal D \), satisfies \(Q\). Assume that a generalized item \(\tilde{i_m}\) in \(\tilde{T_n}\) is interpreted as any possible subset of the items mapped to it with equal probability, and that there are no correlations among generalized items [27, 29, 42]. \(P(\tilde{T_n} ,Q)\) is given by \(\displaystyle \Pi _{\forall \tilde{i_m} \in \tilde{T_n}}\frac{2^{|\tilde{i_m}|-c(I)}}{2^{|\tilde{i_m}|}-1}\) where \(c:\mathcal I \rightarrow [0,|i_m|]\) is a function that, given \(I\), returns the number of items in \(I\) that are mapped to \(\tilde{i_m}\) in \(\tilde{\mathcal{D }}\). An approximate answer \(e(Q)\) to \(Q\) is then derived by summing the corresponding probabilities across all transactions \(\tilde{T_n}\) in \(\tilde{\mathcal{D }}\) that support \(J\).

We considered two workloads \(W_1\) and \(W_2\), each comprised of 1,000 queries, whose items were selected uniformly at random. Queries in \(W_1\) involve a \(2\)-itemset in \(\mathcal P \), whereas those in \(W_2\) require retrieving a \(3\)-itemset comprised of \(2\) public and \(1\) sensitive items.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loukides, G., Gkoulalas-Divanis, A. & Shao, J. Efficient and flexible anonymization of transaction data. Knowl Inf Syst 36, 153–210 (2013). https://doi.org/10.1007/s10115-012-0544-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0544-3

Keywords

Navigation