A transversal hypergraph approach for the frequent itemset hiding problem

Abstract

We propose a methodology for hiding all sensitive frequent itemsets in a transaction database. Our methodology relies on a novel technique that enumerates the minimal transversals of a hypergraph in order to induce the ideal border between frequent and sensitive itemsets. The ideal border is then utilized to formulate an integer linear program (ILP) that answers whether a feasible sanitized database that attains the ideal border, exists. The solution of the program identifies the set of transactions that need to be modified (sanitized) so that the hiding can be achieved with the maximum accuracy. If no solution exists, we modify the ILP by relaxing the constraints needed to be satisfied so that the sanitized database preserves the privacy with guarantee but with minimum effect in data quality. Experimental evaluation of the proposed approach on a number of real datasets has shown that the produced sanitized databases exhibit higher accuracy when compared with the solutions of other well-known approaches.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. 1.

    Aggarwal CC, Yu PS (eds) (2008) Privacy-preserving data mining: models and algorithms. Advances in database systems. Springer, New York

  2. 2.

    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases (VLDB’94), pp 487–499

  3. 3.

    Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM-SIGMOD international conference on management of data (SIGMOD 2000), pp 439–450

  4. 4.

    Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999) Disclosure limitation of sensitive rules. In: Proceedings of the knowledge and data engineering exchange (KDEX’99), pp 45–52

  5. 5.

    Bailey J, Manoukian T, Ramamohanarao K (2003) A fast algorithm for computing hypergraph transversals and its application in mining emerging patterns. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), pp 485–488. IEEE computer Society, Dec 2003

  6. 6.

    Bayardo R (1998) Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM-SIGMOD international conference on management of data (SIGMOD’98), pp 85–93

  7. 7.

    Berge C (1989) Hypergraphs: combinatorics of finite sets, volume 45 of North Holland mathematical library. Elsevier Science Publishers B.V., Amsterdam

    Google Scholar 

  8. 8.

    Bodon F (2003) A fast APRIORI implementation. In: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03), vol 90, pp 56–65

  9. 9.

    Bonchi F, Ferrari E (2011) Privacy-aware knowledge discovery: novel applications and new techniques. Chapman & Hall/CRC data mining and knowledge discovery series. CRC Press INC

  10. 10.

    Borgelt C (2012) Frequent item set mining. Wiley Interdiscip Rev: Data Min Knowl Discov 2(6):437–456

    Google Scholar 

  11. 11.

    Boros E, Elbassioni K, Gurvich V, Khachiyan L (2003) An efficient implementation of a quasi-polynomial algorithm for generating hypergraph transversals. In: Proceedings of the 11th annual European symposium on algorithms (ESA 2003), vol 2432 of LNCS, 556–567

  12. 12.

    Boros E, Elbassioni K, Makino K (2008) On Berge multiplication for monotone boolean dualization. In: Proceedings of the 35th international colloquium on automata, languages and programming (ICALP 2008), volume 5125 of LNCS, 48–59

  13. 13.

    Boros E, Gurvich V, Khachiyan L, Makino K (2003) On maximal frequent and minimal infrequent sets in binary matrices. Ann Math Artif Intell 39(3):211–221

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: proceedings of the 5th ACM-SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 254–260

  15. 15.

    Bu S, Lakshmanan LVS, Ng RT, Ramesh G (2007) Preservation of patterns and input–output privacy. In: Proceedings of the IEEE 23rd international conference on data engineering (ICDE 2007), pp 696–705

  16. 16.

    Calders T (2004) Computational complexity on itemset frequency satisfiability. In: Proceedings of symposium on principles of database systems 2004 (PODS’04), pp 143–154

  17. 17.

    Calders T (2008) Itemset frequency satisfiability: complexity and axiomatization. Theor Comput Sci 394(1–2):84–111

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Clifton C (1999) Protecting against data mining through samples. In: Proceedings of the 13th international conference on database security (DBSec’99), pp 193–207

  19. 19.

    Dong G, Li J (2005) Mining border descriptions of emerging patterns from dataset pairs. Knowl Info Syst 8(2):178–202

    Article  Google Scholar 

  20. 20.

    Eiter T, Gottlob G (1995) Identifying the minimal transversals of a hypergraph and related problems. SIAM J Comput 24(6):1278–1304

    MathSciNet  Article  MATH  Google Scholar 

  21. 21.

    Eiter T, Gottlob G (2002) Hypergraph transversal computation and related problems in Logic and AI. In: Proceedings of European conference on logic in AI (JELIA 2002), vol 2424 of LNCS/LNAI, pp 549–564

  22. 22.

    Eiter T, Gottlob G, Makino K (2003) New results on monotone dualization and generating hypergraph transversals. SIAM J Comput 32(2):514–537

    MathSciNet  Article  MATH  Google Scholar 

  23. 23.

    Evfimievski AV, Srikant R, Agrawal R, Gehrke J (2004) Privacy preserving mining of association rules. Info Syst 29(4):343–364

    Article  Google Scholar 

  24. 24.

    Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and Kolmogorov complexity. Data Min Knowl Discov 15(1):3–20

    MathSciNet  Article  Google Scholar 

  25. 25.

    Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/

  26. 26.

    Fredman ML, Khachiyan L (1996) On the complexity of dualization of monotone disjunctive normal forms. J Algorithm 21:618–628

    MathSciNet  Article  MATH  Google Scholar 

  27. 27.

    Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):571–588

    Article  Google Scholar 

  28. 28.

    Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman and Company, San Francisco

    Google Scholar 

  29. 29.

    Georgakopoulos G, Kavvadias D, Papadimitriou CH (1988) Probabilistic satisfiability. J Complex 4:1–11

    MathSciNet  Article  MATH  Google Scholar 

  30. 30.

    Gkoulalas-Divanis A, Verykios VS (2009) Hiding sensitive knowledge without side effects. Knowl Info Syst 20(3):263–299

    Article  Google Scholar 

  31. 31.

    Goldsmith J, Levy MA, Mundhenk M (1996) Limited nondeterminism. ACM SIGACT News 27(2):20–29

    MathSciNet  Article  Google Scholar 

  32. 32.

    Gottlob G (2013) Deciding monotone duality and identifying frequent itemsets in quadratic logspace. Technical report arxiv:1212.1881v3 [cs.DC]

  33. 33.

    Gunopulos D, Khardon R, Mannila H, Saluja S, Sharma HTR (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2):140–174

    Article  Google Scholar 

  34. 34.

    Gurvich V, Khachiyan L (1999) On generating the irredundant conjunctive and disjunctive normal forms of monotone Boolean functions. Discret Appl Math 96–97:363–373

    MathSciNet  Article  MATH  Google Scholar 

  35. 35.

    Guzzo A, Moccia L, Saccà D, Serra E (2013) Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans Knowl Discov Data 7(4), Article 18, 1–39

  36. 36.

    Guzzo A, Saccà D, Serra E (2009) An effective approach to inverse frequent set mining. In: Proceedings of the 9th IEEE international conference on data mining (ICDM’09), pp 806–811

  37. 37.

    Hagen M (2009) Lower bounds for three algorithms for transversal hypergraph generation. Discret Appl Math 157:1460–1469

    MathSciNet  Article  MATH  Google Scholar 

  38. 38.

    IBM ILOG CPLEX user’s manual v12.6

  39. 39.

    IBM Basket Data Generator. http://sourceforge.net/projects/ibmquestdatagen/

  40. 40.

    Kagklis V, Verykios VS, Tzimas G, Tsakalidis AK (2014) An integer linear programming scheme to sanitize sensitive frequent itemsets. In: Proceedings of 2014 IEEE international conference on tools with AI (ICTAI 2014), 2014. To appear

  41. 41.

    Kantarcioglu M, Jin J, Clifton C (2004) When do data mining results violate privacy? In: Proceedings of the 10th ACM-SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 599–604

  42. 42.

    Kavvadias DJ, Stavropoulos EC (2003) Monotone Boolean dualization is in co-NP[\(\log ^2n\)]. Info Process Lett 85(1):1–6

    MathSciNet  Article  MATH  Google Scholar 

  43. 43.

    Kavvadias DJ, Stavropoulos EC (2005) An efficient algorithm for the transversal hypergraph generation. J Graph Algorithms Appl 9(2):239–264

    MathSciNet  Article  MATH  Google Scholar 

  44. 44.

    Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD explorations, 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP

  45. 45.

    Leloglu E, Ayav T, Ergenc B (2014) Coefficient-based exact approach for frequent itemset hiding. In: eKNOW2014: The 6th international conference on information, process, and knowledge management, pp 124–130

  46. 46.

    Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Min Knowl Discov 1:241–258

    Article  Google Scholar 

  47. 47.

    Menon S, Sarkar S, Mukherjee S (2005) Maximizing accuracy of shared databases when concealing sensitive patterns. Info Syst Res 16(3):256–270

    Article  Google Scholar 

  48. 48.

    Mielikäinen T (2003) On inverse frequent set mining problems. In: Proceedings of the 2nd workshop on privacy preserving data mining (PPDM’03), pp 18–33

  49. 49.

    Moustakides GV, Verykios VS (2008) A maxmin approach for hiding frequent itemsets. Data Knowl Eng 65(1):75–89

    Article  Google Scholar 

  50. 50.

    Murakami K, Uno T (2011) Efficient algorithms for dualizing large-scale hypergraphs. Technical report arxiv:1102.3813v2 [cs.DC]

  51. 51.

    Rizvi S, Haritsa JR (2002) Maintaining data privacy in association rule mining. In: Proceedings of the 28th international conference on very large data bases (VLDB’02), pp 682–693

  52. 52.

    Sun X, Yu P (2005) A border–based approach for hiding sensitive frequent itemsets. In: Proceedings of 5th IEEE internationa conference on data mining (ICDM 2005), pp 426–433

  53. 53.

    Sun X, Yu PS (2007) Hiding sensitive frequent itemsets by a border-based approach. J Comput Sci Eng 1(1):74–94

    Article  Google Scholar 

  54. 54.

    Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588

    MathSciNet  Article  MATH  Google Scholar 

  55. 55.

    Sweeney L (2002) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570

    MathSciNet  Article  MATH  Google Scholar 

  56. 56.

    Takata K (2007) A worst-case analysis of the sequential method to list the minimal hitting sets of a hypergraph. SIAM J Discret Math 21(4):936–946

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors wish to thank the anonymous referees for their valuable comments that improved the final presentation of the paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Elias C. Stavropoulos.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Stavropoulos, E.C., Verykios, V.S. & Kagklis, V. A transversal hypergraph approach for the frequent itemset hiding problem. Knowl Inf Syst 47, 625–645 (2016). https://doi.org/10.1007/s10115-015-0862-3

Download citation

Keywords

  • Privacy-preserving data mining
  • Hiding frequent itemsets
  • Transversal hypergraph generation