On mining approximate and exact fault-tolerant frequent itemsets

Liu, Shengxin; Poon, Chung Keung

doi:10.1007/s10115-017-1079-4

On mining approximate and exact fault-tolerant frequent itemsets

Regular Paper
Published: 11 July 2017

Volume 55, pages 361–391, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

261 Accesses
4 Citations
Explore all metrics

Abstract

Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Mining Proportional Fault-Tolerant Frequent Itemsets

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

New approaches for mining high utility itemsets with multiple utility thresholds

Article 19 December 2023

Bao Huynh, N. T. Tung, … Loan Nguyen

Notes

We use a machine different from the one in the conference version [21] and observe no significant difference in the performance.
http://www.cais.ntu.edu.sg/~vivek/pubs/ftfim09/.
http://lpsolve.sourceforge.net/5.5/
http://fimi.cs.helsinki.fi/data/.
This dataset can be downloaded from http://www-personal.umich.edu/~mejn/netdata/.

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’98, pp 94–105
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’93, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, VLDB ’94, pp 487–499
Bansal N, Korula N, Nagarajan V, Srinivasan A (2012) Solving packing integer programs via randomized rounding with alterations. Theory Comput 8(24):533–565
Article MathSciNet MATH Google Scholar
Besson J, Pensa RG, Robardet C, Boulicaut JF (2005) Constraint-based mining of fault-tolerant patterns from Boolean data. In: Proceedings of the international conference on knowledge discovery in inductive databases, pp 55–71
Briest P, Krysta P, Vöcking B (2011) Approximation techniques for utilitarian mechanism design. SIAM J Comput 40(6):1587–1622
Article MathSciNet MATH Google Scholar
Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the SIAM international conference on data mining, SDM ’05, pp 250–261
Cheng H, Yu PS, Han J (2008) Approximate frequent itemset mining in the presence of random noise. In: Soft computing for knowledge discovery and data mining, pp 363–389
Cong G, Tung AKH, Xu X, Pan F, Yang J (2004) FARMER: finding interesting rule groups in microarray datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 143–154
Dourisboure Y, Geraci F, Pellegrini M (2009) Extraction and classification of dense implicit communities in the web graph. ACM Trans Web 3(2):7:1–7:36
Article Google Scholar
Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 301–309
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Article MathSciNet Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 1–12
Hochbaum DS (1997) Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation algorithms for NP-hard problems, pp 94–143
Koh JL, Yo PW (2005) An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’95, pp 568–575
Kolliopoulos SG, Young NE (2005) Approximation algorithms for covering/packing integer programs. J Comput Syst Sci 71(4):495–505
Article MathSciNet MATH Google Scholar
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1:1–1:58
Article Google Scholar
Krysta P (2005) Greedy approximation via duality for packing, combinatorial auctions and routing. In: Proceedings of the international symposium on mathematical foundations of computer science, MFCS ’05, pp 615–627
Lee G, Peng SL, Lin YT (2009) Proportional fault-tolerant data mining with applications to bioinformatics. Inf Syst Front 11(4):461–469
Article Google Scholar
Liu j, Paulsen S, Sun X, Wang W, Nobel A, Prins J (2006) Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In: Proceedings of the SIAM international conference on data mining, SDM ’06, pp 405–416
Liu S, Poon CK (2014) On mining proportional fault-tolerant frequent itemsets. In: Proceedings of the international conference on database systems for advanced applications, DASFAA ’14, pp 342–356
Liu X, Li J, Wang L (2010) Modeling protein interacting groups by quasi-bicliques: complexity, algorithm, and application. IEEE ACM Trans Comput Biol Bioinform 7(2):354–364
Article Google Scholar
Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the international workshop on research issues on data mining and knowledge discovery, pp 7–12
Poernomo AK, Gopalkrishnan V (2007) Mining statistical information of frequent fault-tolerant patterns in transactional databases. In: Proceedings of the IEEE international conference on data mining, ICDM ’07, pp 272–281
Poernomo AK, Gopalkrishnan V (2009) Efficient computation of partial-support for mining interesting itemsets. In: Proceedings of the SIAM international conference on data mining, SDM ’09, pp 1014–1025
Poernomo AK, Gopalkrishnan V (2009) Towards efficient mining of proportional fault-tolerant frequent itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, pp 697–706
Raghavan P (1988) Probabilistic construction of deterministic algorithms: approximating packing integer programs. J Comput Syst Sci 37(2):130–143
Article MathSciNet MATH Google Scholar
Raghavan P, Tompson CD (1987) Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7(4):365–374
Article MathSciNet MATH Google Scholar
Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04, pp 683–688
Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the IEEE international conference on data mining, ICDM ’06, pp 1059–1063
Srinivasan A (1999) Improved approximation guarantees for packing and covering integer programs. SIAM J Comput 29(2):648–670
Article MathSciNet MATH Google Scholar
Wang SS, Lee SY (2002) Mining fault-tolerant frequent patterns in large databases. In: Proceedings of the international computer symposium
Wang X, Borgelt C, Kruse R (2005) Fuzzy frequent pattern discovering based on recursive elimination. In: Proceedings of the international conference on machine learning and applications, pp 391–396
Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 194–203
Zeng JJ, Lee G, Lee CC (2008) Mining fault-tolerant frequent patterns efficiently with powerful pruning. In: Proceedings of the ACM symposium on applied computing, pp 927–931

Download references

Acknowledgements

This work was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/FDS11/E02/15) and partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 122512). The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions.

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Hong Kong, China
Shengxin Liu
School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
Chung Keung Poon

Authors

Shengxin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chung Keung Poon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chung Keung Poon.

Additional information

This paper is an extended version of [21].

Appendix

There are many applications of FTFI mining. We present an example in World Wide Web (WWW) as follows. An interesting and important topic in the study of WWW is the discovery of implicit local structures, i.e., cyber-communities, in order to better understand the sociological behavior and ever-increasing phenomena in the Web.

To this end, we commonly model the WWW as a directed Web graph G(V, E) where each vertex represents a web page and each arc represents a hyperlink from one web page to another. Then a cyber-community is a subgraph with dense connections from a subset of vertices (representing webpages) to another subset (representing interests, etc.). Dourisboure et al. [10] showed that those relatively dense subgraphs instead of complete subgraphs capture larger and more meaningful communities in the Web.

In the language of frequent itemset mining, one can view a vertex v as an item \(i_v\) and the set of out-neighbors of v as a transaction \(t_v\), i.e., \(t_v=\{i_u| u \in U \text { and } (v,u) \in E\}\). Then a fault-tolerant frequent pattern (i.e., an itemset together with its supporting transactions where each item appears in most of the transactions and each transaction contains most of the items) represents a cyber-community in which most (but not necessarily all) of a set of web pages share many (but not necessarily all) of a set of common interests or authorities.

We tested our algorithms on the Political Blogs Dataset^{Footnote 5} to demonstrate the power of fault-tolerant frequent itemset mining. The Political Blogs dataset is a directed network of 1490 webblogs on US politics with 19,090 hyperlinks between these webblogs. Each blog in the dataset has an attribute describing its political leaning as either liberal or conservative.

After modeling it as a transactional database as described above and removing the empty transactions (corresponding to vertices with no out-going edges), we are left with 1065 transactions. We applied our exact and (approximate) iterative insertion-based FTFI mining algorithms (described in Sects. 4, 3.2.2) with a minimum support threshold of \(\sigma = 10\%\) and proportional relaxation parameters \((\alpha _p, \beta _p) = (0.3, 0.2)\). Both algorithms extracted 198 FTFIs. Using the classic frequent itemset definition (i.e., without any fault tolerance), only 59 frequent itemsets can be found.

To further illustrate the advantage of FTFIs over standard frequent itemset, we look at a randomly selected fault-tolerant frequent pattern (X, Y) where \(X = \{a,b,c,d,e,f,g\}\) and \(|Y| = 111\), see Fig. 8. It turns out that all webblogs from X belong to the same political leaning, i.e., liberal, and 108 out of 111 transactions from Y share the political leaning of liberal.

On the other hand, the frequent itemset mining algorithm (with no fault tolerance) was unable to discover X at the same minimum support of 10%. Even when the minimum support is lowered to 5%, it can only identify some fragments of X such as \(X_1 = \{a,b,c,d,e\}\) and \(X_2 = \{d,e,f,g\}\). See Fig. 8.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, S., Poon, C.K. On mining approximate and exact fault-tolerant frequent itemsets. Knowl Inf Syst 55, 361–391 (2018). https://doi.org/10.1007/s10115-017-1079-4

Download citation

Received: 12 August 2016
Revised: 20 May 2017
Accepted: 30 June 2017
Published: 11 July 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10115-017-1079-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On mining approximate and exact fault-tolerant frequent itemsets

Abstract

Access this article

Similar content being viewed by others

On Mining Proportional Fault-Tolerant Frequent Itemsets

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

New approaches for mining high utility itemsets with multiple utility thresholds

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On mining approximate and exact fault-tolerant frequent itemsets

Abstract

Access this article

Similar content being viewed by others

On Mining Proportional Fault-Tolerant Frequent Itemsets

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

New approaches for mining high utility itemsets with multiple utility thresholds

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation