Abstract
Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms.
Similar content being viewed by others
Notes
We use a machine different from the one in the conference version [21] and observe no significant difference in the performance.
This dataset can be downloaded from http://www-personal.umich.edu/~mejn/netdata/.
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’98, pp 94–105
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’93, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, VLDB ’94, pp 487–499
Bansal N, Korula N, Nagarajan V, Srinivasan A (2012) Solving packing integer programs via randomized rounding with alterations. Theory Comput 8(24):533–565
Besson J, Pensa RG, Robardet C, Boulicaut JF (2005) Constraint-based mining of fault-tolerant patterns from Boolean data. In: Proceedings of the international conference on knowledge discovery in inductive databases, pp 55–71
Briest P, Krysta P, Vöcking B (2011) Approximation techniques for utilitarian mechanism design. SIAM J Comput 40(6):1587–1622
Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the SIAM international conference on data mining, SDM ’05, pp 250–261
Cheng H, Yu PS, Han J (2008) Approximate frequent itemset mining in the presence of random noise. In: Soft computing for knowledge discovery and data mining, pp 363–389
Cong G, Tung AKH, Xu X, Pan F, Yang J (2004) FARMER: finding interesting rule groups in microarray datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 143–154
Dourisboure Y, Geraci F, Pellegrini M (2009) Extraction and classification of dense implicit communities in the web graph. ACM Trans Web 3(2):7:1–7:36
Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 301–309
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 1–12
Hochbaum DS (1997) Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation algorithms for NP-hard problems, pp 94–143
Koh JL, Yo PW (2005) An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’95, pp 568–575
Kolliopoulos SG, Young NE (2005) Approximation algorithms for covering/packing integer programs. J Comput Syst Sci 71(4):495–505
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1:1–1:58
Krysta P (2005) Greedy approximation via duality for packing, combinatorial auctions and routing. In: Proceedings of the international symposium on mathematical foundations of computer science, MFCS ’05, pp 615–627
Lee G, Peng SL, Lin YT (2009) Proportional fault-tolerant data mining with applications to bioinformatics. Inf Syst Front 11(4):461–469
Liu j, Paulsen S, Sun X, Wang W, Nobel A, Prins J (2006) Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In: Proceedings of the SIAM international conference on data mining, SDM ’06, pp 405–416
Liu S, Poon CK (2014) On mining proportional fault-tolerant frequent itemsets. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’14, pp 342–356
Liu X, Li J, Wang L (2010) Modeling protein interacting groups by quasi-bicliques: complexity, algorithm, and application. IEEE ACM Trans Comput Biol Bioinform 7(2):354–364
Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the international workshop on research issues on data mining and knowledge discovery, pp 7–12
Poernomo AK, Gopalkrishnan V (2007) Mining statistical information of frequent fault-tolerant patterns in transactional databases. In: Proceedings of the IEEE international conference on data mining, ICDM ’07, pp 272–281
Poernomo AK, Gopalkrishnan V (2009) Efficient computation of partial-support for mining interesting itemsets. In: Proceedings of the SIAM international conference on data mining, SDM ’09, pp 1014–1025
Poernomo AK, Gopalkrishnan V (2009) Towards efficient mining of proportional fault-tolerant frequent itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, pp 697–706
Raghavan P (1988) Probabilistic construction of deterministic algorithms: approximating packing integer programs. J Comput Syst Sci 37(2):130–143
Raghavan P, Tompson CD (1987) Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7(4):365–374
Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04, pp 683–688
Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the IEEE international conference on data mining, ICDM ’06, pp 1059–1063
Srinivasan A (1999) Improved approximation guarantees for packing and covering integer programs. SIAM J Comput 29(2):648–670
Wang SS, Lee SY (2002) Mining fault-tolerant frequent patterns in large databases. In: Proceedings of the international computer symposium
Wang X, Borgelt C, Kruse R (2005) Fuzzy frequent pattern discovering based on recursive elimination. In: Proceedings of the international conference on machine learning and applications, pp 391–396
Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 194–203
Zeng JJ, Lee G, Lee CC (2008) Mining fault-tolerant frequent patterns efficiently with powerful pruning. In: Proceedings of the ACM symposium on applied computing, pp 927–931
Acknowledgements
This work was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/FDS11/E02/15) and partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 122512). The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is an extended version of [21].
Appendix
Appendix
There are many applications of FTFI mining. We present an example in World Wide Web (WWW) as follows. An interesting and important topic in the study of WWW is the discovery of implicit local structures, i.e., cyber-communities, in order to better understand the sociological behavior and ever-increasing phenomena in the Web.
To this end, we commonly model the WWW as a directed Web graph G(V, E) where each vertex represents a web page and each arc represents a hyperlink from one web page to another. Then a cyber-community is a subgraph with dense connections from a subset of vertices (representing webpages) to another subset (representing interests, etc.). Dourisboure et al. [10] showed that those relatively dense subgraphs instead of complete subgraphs capture larger and more meaningful communities in the Web.
In the language of frequent itemset mining, one can view a vertex v as an item \(i_v\) and the set of out-neighbors of v as a transaction \(t_v\), i.e., \(t_v=\{i_u| u \in U \text { and } (v,u) \in E\}\). Then a fault-tolerant frequent pattern (i.e., an itemset together with its supporting transactions where each item appears in most of the transactions and each transaction contains most of the items) represents a cyber-community in which most (but not necessarily all) of a set of web pages share many (but not necessarily all) of a set of common interests or authorities.
We tested our algorithms on the Political Blogs DatasetFootnote 5 to demonstrate the power of fault-tolerant frequent itemset mining. The Political Blogs dataset is a directed network of 1490 webblogs on US politics with 19,090 hyperlinks between these webblogs. Each blog in the dataset has an attribute describing its political leaning as either liberal or conservative.
After modeling it as a transactional database as described above and removing the empty transactions (corresponding to vertices with no out-going edges), we are left with 1065 transactions. We applied our exact and (approximate) iterative insertion-based FTFI mining algorithms (described in Sects. 4, 3.2.2) with a minimum support threshold of \(\sigma = 10\%\) and proportional relaxation parameters \((\alpha _p, \beta _p) = (0.3, 0.2)\). Both algorithms extracted 198 FTFIs. Using the classic frequent itemset definition (i.e., without any fault tolerance), only 59 frequent itemsets can be found.
To further illustrate the advantage of FTFIs over standard frequent itemset, we look at a randomly selected fault-tolerant frequent pattern (X, Y) where \(X = \{a,b,c,d,e,f,g\}\) and \(|Y| = 111\), see Fig. 8. It turns out that all webblogs from X belong to the same political leaning, i.e., liberal, and 108 out of 111 transactions from Y share the political leaning of liberal.
On the other hand, the frequent itemset mining algorithm (with no fault tolerance) was unable to discover X at the same minimum support of 10%. Even when the minimum support is lowered to 5%, it can only identify some fragments of X such as \(X_1 = \{a,b,c,d,e\}\) and \(X_2 = \{d,e,f,g\}\). See Fig. 8.
Rights and permissions
About this article
Cite this article
Liu, S., Poon, C.K. On mining approximate and exact fault-tolerant frequent itemsets. Knowl Inf Syst 55, 361–391 (2018). https://doi.org/10.1007/s10115-017-1079-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1079-4