## Abstract

Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
We use a machine different from the one in the conference version [21] and observe no significant difference in the performance.

- 2.
- 3.
- 4.
- 5.
This dataset can be downloaded from http://www-personal.umich.edu/~mejn/netdata/.

## References

- 1.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’98, pp 94–105

- 2.
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’93, pp 207–216

- 3.
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, VLDB ’94, pp 487–499

- 4.
Bansal N, Korula N, Nagarajan V, Srinivasan A (2012) Solving packing integer programs via randomized rounding with alterations. Theory Comput 8(24):533–565

- 5.
Besson J, Pensa RG, Robardet C, Boulicaut JF (2005) Constraint-based mining of fault-tolerant patterns from Boolean data. In: Proceedings of the international conference on knowledge discovery in inductive databases, pp 55–71

- 6.
Briest P, Krysta P, Vöcking B (2011) Approximation techniques for utilitarian mechanism design. SIAM J Comput 40(6):1587–1622

- 7.
Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the SIAM international conference on data mining, SDM ’05, pp 250–261

- 8.
Cheng H, Yu PS, Han J (2008) Approximate frequent itemset mining in the presence of random noise. In: Soft computing for knowledge discovery and data mining, pp 363–389

- 9.
Cong G, Tung AKH, Xu X, Pan F, Yang J (2004) FARMER: finding interesting rule groups in microarray datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 143–154

- 10.
Dourisboure Y, Geraci F, Pellegrini M (2009) Extraction and classification of dense implicit communities in the web graph. ACM Trans Web 3(2):7:1–7:36

- 11.
Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 301–309

- 12.
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86

- 13.
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 1–12

- 14.
Hochbaum DS (1997) Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation algorithms for NP-hard problems, pp 94–143

- 15.
Koh JL, Yo PW (2005) An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’95, pp 568–575

- 16.
Kolliopoulos SG, Young NE (2005) Approximation algorithms for covering/packing integer programs. J Comput Syst Sci 71(4):495–505

- 17.
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1:1–1:58

- 18.
Krysta P (2005) Greedy approximation via duality for packing, combinatorial auctions and routing. In: Proceedings of the international symposium on mathematical foundations of computer science, MFCS ’05, pp 615–627

- 19.
Lee G, Peng SL, Lin YT (2009) Proportional fault-tolerant data mining with applications to bioinformatics. Inf Syst Front 11(4):461–469

- 20.
Liu j, Paulsen S, Sun X, Wang W, Nobel A, Prins J (2006) Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In: Proceedings of the SIAM international conference on data mining, SDM ’06, pp 405–416

- 21.
Liu S, Poon CK (2014) On mining proportional fault-tolerant frequent itemsets. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’14, pp 342–356

- 22.
Liu X, Li J, Wang L (2010) Modeling protein interacting groups by quasi-bicliques: complexity, algorithm, and application. IEEE ACM Trans Comput Biol Bioinform 7(2):354–364

- 23.
Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the international workshop on research issues on data mining and knowledge discovery, pp 7–12

- 24.
Poernomo AK, Gopalkrishnan V (2007) Mining statistical information of frequent fault-tolerant patterns in transactional databases. In: Proceedings of the IEEE international conference on data mining, ICDM ’07, pp 272–281

- 25.
Poernomo AK, Gopalkrishnan V (2009) Efficient computation of partial-support for mining interesting itemsets. In: Proceedings of the SIAM international conference on data mining, SDM ’09, pp 1014–1025

- 26.
Poernomo AK, Gopalkrishnan V (2009) Towards efficient mining of proportional fault-tolerant frequent itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, pp 697–706

- 27.
Raghavan P (1988) Probabilistic construction of deterministic algorithms: approximating packing integer programs. J Comput Syst Sci 37(2):130–143

- 28.
Raghavan P, Tompson CD (1987) Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7(4):365–374

- 29.
Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04, pp 683–688

- 30.
Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the IEEE international conference on data mining, ICDM ’06, pp 1059–1063

- 31.
Srinivasan A (1999) Improved approximation guarantees for packing and covering integer programs. SIAM J Comput 29(2):648–670

- 32.
Wang SS, Lee SY (2002) Mining fault-tolerant frequent patterns in large databases. In: Proceedings of the international computer symposium

- 33.
Wang X, Borgelt C, Kruse R (2005) Fuzzy frequent pattern discovering based on recursive elimination. In: Proceedings of the international conference on machine learning and applications, pp 391–396

- 34.
Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 194–203

- 35.
Zeng JJ, Lee G, Lee CC (2008) Mining fault-tolerant frequent patterns efficiently with powerful pruning. In: Proceedings of the ACM symposium on applied computing, pp 927–931

## Acknowledgements

This work was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/FDS11/E02/15) and partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 122512). The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions.

## Author information

### Affiliations

### Corresponding author

## Additional information

This paper is an extended version of [21].

## Appendix

### Appendix

There are many applications of FTFI mining. We present an example in World Wide Web (WWW) as follows. An interesting and important topic in the study of WWW is the discovery of implicit local structures, i.e., *cyber-communities*, in order to better understand the sociological behavior and ever-increasing phenomena in the Web.

To this end, we commonly model the WWW as a *directed* Web graph *G*(*V*, *E*) where each vertex represents a web page and each arc represents a hyperlink from one web page to another. Then a cyber-community is a subgraph with dense connections from a subset of vertices (representing webpages) to another subset (representing interests, etc.). Dourisboure et al. [10] showed that those relatively dense subgraphs instead of complete subgraphs capture larger and more meaningful communities in the Web.

In the language of frequent itemset mining, one can view a vertex *v* as an item \(i_v\) and the set of out-neighbors of *v* as a transaction \(t_v\), i.e., \(t_v=\{i_u| u \in U \text { and } (v,u) \in E\}\). Then a fault-tolerant frequent pattern (i.e., an itemset together with its supporting transactions where each item appears in most of the transactions and each transaction contains most of the items) represents a cyber-community in which *most* (but not necessarily all) of a set of web pages share *many* (but not necessarily all) of a set of common interests or authorities.

We tested our algorithms on the Political Blogs Dataset^{Footnote 5} to demonstrate the power of fault-tolerant frequent itemset mining. The Political Blogs dataset is a directed network of 1490 webblogs on US politics with 19,090 hyperlinks between these webblogs. Each blog in the dataset has an attribute describing its political leaning as either liberal or conservative.

After modeling it as a transactional database as described above and removing the empty transactions (corresponding to vertices with no out-going edges), we are left with 1065 transactions. We applied our exact and (approximate) iterative insertion-based FTFI mining algorithms (described in Sects. 4, 3.2.2) with a minimum support threshold of \(\sigma = 10\%\) and proportional relaxation parameters \((\alpha _p, \beta _p) = (0.3, 0.2)\). Both algorithms extracted 198 FTFIs. Using the classic frequent itemset definition (i.e., without any fault tolerance), only 59 frequent itemsets can be found.

To further illustrate the advantage of FTFIs over standard frequent itemset, we look at a randomly selected fault-tolerant frequent pattern (*X*, *Y*) where \(X = \{a,b,c,d,e,f,g\}\) and \(|Y| = 111\), see Fig. 8. It turns out that all webblogs from *X* belong to the same political leaning, i.e., liberal, and 108 out of 111 transactions from *Y* share the political leaning of liberal.

On the other hand, the frequent itemset mining algorithm (with no fault tolerance) was unable to discover *X* at the same minimum support of 10%. Even when the minimum support is lowered to 5%, it can only identify some fragments of *X* such as \(X_1 = \{a,b,c,d,e\}\) and \(X_2 = \{d,e,f,g\}\). See Fig. 8.

## Rights and permissions

## About this article

### Cite this article

Liu, S., Poon, C.K. On mining approximate and exact fault-tolerant frequent itemsets.
*Knowl Inf Syst* **55, **361–391 (2018). https://doi.org/10.1007/s10115-017-1079-4

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Data mining
- Mining methods and algorithms
- Frequent itemsets
- Fault tolerance
- Approximate support count