Skip to main content

On mining approximate and exact fault-tolerant frequent itemsets

Abstract

Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    We use a machine different from the one in the conference version [21] and observe no significant difference in the performance.

  2. 2.

    http://www.cais.ntu.edu.sg/~vivek/pubs/ftfim09/.

  3. 3.

    http://lpsolve.sourceforge.net/5.5/

  4. 4.

    http://fimi.cs.helsinki.fi/data/.

  5. 5.

    This dataset can be downloaded from http://www-personal.umich.edu/~mejn/netdata/.

References

  1. 1.

    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’98, pp 94–105

  2. 2.

    Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’93, pp 207–216

  3. 3.

    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, VLDB ’94, pp 487–499

  4. 4.

    Bansal N, Korula N, Nagarajan V, Srinivasan A (2012) Solving packing integer programs via randomized rounding with alterations. Theory Comput 8(24):533–565

    MathSciNet  Article  MATH  Google Scholar 

  5. 5.

    Besson J, Pensa RG, Robardet C, Boulicaut JF (2005) Constraint-based mining of fault-tolerant patterns from Boolean data. In: Proceedings of the international conference on knowledge discovery in inductive databases, pp 55–71

  6. 6.

    Briest P, Krysta P, Vöcking B (2011) Approximation techniques for utilitarian mechanism design. SIAM J Comput 40(6):1587–1622

    MathSciNet  Article  MATH  Google Scholar 

  7. 7.

    Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the SIAM international conference on data mining, SDM ’05, pp 250–261

  8. 8.

    Cheng H, Yu PS, Han J (2008) Approximate frequent itemset mining in the presence of random noise. In: Soft computing for knowledge discovery and data mining, pp 363–389

  9. 9.

    Cong G, Tung AKH, Xu X, Pan F, Yang J (2004) FARMER: finding interesting rule groups in microarray datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 143–154

  10. 10.

    Dourisboure Y, Geraci F, Pellegrini M (2009) Extraction and classification of dense implicit communities in the web graph. ACM Trans Web 3(2):7:1–7:36

    Article  Google Scholar 

  11. 11.

    Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 301–309

  12. 12.

    Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86

    MathSciNet  Article  Google Scholar 

  13. 13.

    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 1–12

  14. 14.

    Hochbaum DS (1997) Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation algorithms for NP-hard problems, pp 94–143

  15. 15.

    Koh JL, Yo PW (2005) An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’95, pp 568–575

  16. 16.

    Kolliopoulos SG, Young NE (2005) Approximation algorithms for covering/packing integer programs. J Comput Syst Sci 71(4):495–505

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1:1–1:58

    Article  Google Scholar 

  18. 18.

    Krysta P (2005) Greedy approximation via duality for packing, combinatorial auctions and routing. In: Proceedings of the international symposium on mathematical foundations of computer science, MFCS ’05, pp 615–627

  19. 19.

    Lee G, Peng SL, Lin YT (2009) Proportional fault-tolerant data mining with applications to bioinformatics. Inf Syst Front 11(4):461–469

    Article  Google Scholar 

  20. 20.

    Liu j, Paulsen S, Sun X, Wang W, Nobel A, Prins J (2006) Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In: Proceedings of the SIAM international conference on data mining, SDM ’06, pp 405–416

  21. 21.

    Liu S, Poon CK (2014) On mining proportional fault-tolerant frequent itemsets. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’14, pp 342–356

  22. 22.

    Liu X, Li J, Wang L (2010) Modeling protein interacting groups by quasi-bicliques: complexity, algorithm, and application. IEEE ACM Trans Comput Biol Bioinform 7(2):354–364

    Article  Google Scholar 

  23. 23.

    Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the international workshop on research issues on data mining and knowledge discovery, pp 7–12

  24. 24.

    Poernomo AK, Gopalkrishnan V (2007) Mining statistical information of frequent fault-tolerant patterns in transactional databases. In: Proceedings of the IEEE international conference on data mining, ICDM ’07, pp 272–281

  25. 25.

    Poernomo AK, Gopalkrishnan V (2009) Efficient computation of partial-support for mining interesting itemsets. In: Proceedings of the SIAM international conference on data mining, SDM ’09, pp 1014–1025

  26. 26.

    Poernomo AK, Gopalkrishnan V (2009) Towards efficient mining of proportional fault-tolerant frequent itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, pp 697–706

  27. 27.

    Raghavan P (1988) Probabilistic construction of deterministic algorithms: approximating packing integer programs. J Comput Syst Sci 37(2):130–143

    MathSciNet  Article  MATH  Google Scholar 

  28. 28.

    Raghavan P, Tompson CD (1987) Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7(4):365–374

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04, pp 683–688

  30. 30.

    Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the IEEE international conference on data mining, ICDM ’06, pp 1059–1063

  31. 31.

    Srinivasan A (1999) Improved approximation guarantees for packing and covering integer programs. SIAM J Comput 29(2):648–670

    MathSciNet  Article  MATH  Google Scholar 

  32. 32.

    Wang SS, Lee SY (2002) Mining fault-tolerant frequent patterns in large databases. In: Proceedings of the international computer symposium

  33. 33.

    Wang X, Borgelt C, Kruse R (2005) Fuzzy frequent pattern discovering based on recursive elimination. In: Proceedings of the international conference on machine learning and applications, pp 391–396

  34. 34.

    Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 194–203

  35. 35.

    Zeng JJ, Lee G, Lee CC (2008) Mining fault-tolerant frequent patterns efficiently with powerful pruning. In: Proceedings of the ACM symposium on applied computing, pp 927–931

Download references

Acknowledgements

This work was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/FDS11/E02/15) and partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 122512). The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chung Keung Poon.

Additional information

This paper is an extended version of [21].

Appendix

Appendix

There are many applications of FTFI mining. We present an example in World Wide Web (WWW) as follows. An interesting and important topic in the study of WWW is the discovery of implicit local structures, i.e., cyber-communities, in order to better understand the sociological behavior and ever-increasing phenomena in the Web.

To this end, we commonly model the WWW as a directed Web graph G(VE) where each vertex represents a web page and each arc represents a hyperlink from one web page to another. Then a cyber-community is a subgraph with dense connections from a subset of vertices (representing webpages) to another subset (representing interests, etc.). Dourisboure et al. [10] showed that those relatively dense subgraphs instead of complete subgraphs capture larger and more meaningful communities in the Web.

Fig. 8
figure8

A cyber-community in the Political Blogs Dataset modeled as a collection of transactions. There are 7 columns representing the itemset \(X = \{a,b,c,d,e,f,g\}\) and 111 rows representing the set of supporting transactions Y. A mark indicates the absence of an item in the corresponding transaction. With lower minimum support threshold, frequent itemsets \(X_1 = \{a,b,c,d,e\}\) and \(X_2 = \{d,e,f,g\}\) together with their supporting transactions are marked as rectangles at the top-left corner and top/bottom-right corners, respectively

In the language of frequent itemset mining, one can view a vertex v as an item \(i_v\) and the set of out-neighbors of v as a transaction \(t_v\), i.e., \(t_v=\{i_u| u \in U \text { and } (v,u) \in E\}\). Then a fault-tolerant frequent pattern (i.e., an itemset together with its supporting transactions where each item appears in most of the transactions and each transaction contains most of the items) represents a cyber-community in which most (but not necessarily all) of a set of web pages share many (but not necessarily all) of a set of common interests or authorities.

We tested our algorithms on the Political Blogs DatasetFootnote 5 to demonstrate the power of fault-tolerant frequent itemset mining. The Political Blogs dataset is a directed network of 1490 webblogs on US politics with 19,090 hyperlinks between these webblogs. Each blog in the dataset has an attribute describing its political leaning as either liberal or conservative.

After modeling it as a transactional database as described above and removing the empty transactions (corresponding to vertices with no out-going edges), we are left with 1065 transactions. We applied our exact and (approximate) iterative insertion-based FTFI mining algorithms (described in Sects. 4, 3.2.2) with a minimum support threshold of \(\sigma = 10\%\) and proportional relaxation parameters \((\alpha _p, \beta _p) = (0.3, 0.2)\). Both algorithms extracted 198 FTFIs. Using the classic frequent itemset definition (i.e., without any fault tolerance), only 59 frequent itemsets can be found.

To further illustrate the advantage of FTFIs over standard frequent itemset, we look at a randomly selected fault-tolerant frequent pattern (XY) where \(X = \{a,b,c,d,e,f,g\}\) and \(|Y| = 111\), see Fig. 8. It turns out that all webblogs from X belong to the same political leaning, i.e., liberal, and 108 out of 111 transactions from Y share the political leaning of liberal.

On the other hand, the frequent itemset mining algorithm (with no fault tolerance) was unable to discover X at the same minimum support of 10%. Even when the minimum support is lowered to 5%, it can only identify some fragments of X such as \(X_1 = \{a,b,c,d,e\}\) and \(X_2 = \{d,e,f,g\}\). See Fig. 8.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, S., Poon, C.K. On mining approximate and exact fault-tolerant frequent itemsets. Knowl Inf Syst 55, 361–391 (2018). https://doi.org/10.1007/s10115-017-1079-4

Download citation

Keywords

  • Data mining
  • Mining methods and algorithms
  • Frequent itemsets
  • Fault tolerance
  • Approximate support count