Abstract
We investigate the problem of counting the number of frequent (item)sets—a problem known to be intractable in terms of an exact polynomial time computation. In this paper, we show that it is in general also hard to approximate. Subsequently, a randomized counting algorithm is developed using the Markov chain Monte Carlo method. While for general inputs an exponential running time is needed in order to guarantee a certain approximation bound, we show that the algorithm still has the desired accuracy on several real-world datasets when its running time is capped polynomially.
Similar content being viewed by others
References
Bayardo R, Goethals B, Zaki MJ (eds) (2004) Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, vol 126. CEUR Workshop Proceedings. http://CEUR-WS.org
Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4): 503–518
Blanchard J, Guillet F, Briand H (2007) Interactive visual exploration of association rules with rule-focusing methodology. Knowl Inf Syst 13(1): 43–75
Bodon F (2003) A fast apriori implementation, In: Goethals B, Zaki MJ (eds) Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03), vol 90. CEUR Workshop Proceedings, Melbourne
Boley M (2007) On approximating minimum infrequent and maximum frequent sets. Discov Sci 68–77
Boley M, Horváth T, Wrobel S (2009) Efficient discovery of interesting patterns based on strong closedness. In: Proceedings of the SIAM international conference for data mining (SDM)
Geerts F, Goethals B, Bussche JVD (2005) Tight upper bounds on the number of candidate patterns. ACM Trans Database Syst 30(2): 333–363
Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining frequent itemsets. In: FIMI’03 workshop on frequent itemset mining implementations
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma RS (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174
Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. ICDM
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan-Kaufmann, Menlo Park
Jerrum MR, Valiant LG, Vazirani VV (1986) Random generation of combinatorial structures from a uniform distribution. Theor Comput Sci 43(2–3): 169–188
Jerrum M, Sinclair A (1997) The markov chain monte carlo method: an approach to approximate counting and integration. In: Approximation algorithms for NP-hard problems. PWS Publishing Co., Boston, pp 482–520
Jin R, McCallen S, Breitbart Y, Fuhry D, Wang D (2009) Estimating the number of frequent itemsets in a large database. In: Proceedings of 12th international conference on extending database technology (EDBT)
Karp RM, Luby M, Madras N (1989) Monte-Carlo approximation algorithms for enumeration problems. J Algorithms 10(3): 429–448
Khot S (2004) Ruling out ptas for graph min-bisection, densest subgraph and bipartite clique. In: Foundations of computer science. IEEE Computer Society, Washington, DC, pp 136–145
Li W, Mozes A (2004) Computing frequent itemsets inside oracle 10g. In: VLDB ’04: Proceedings of the Thirtieth international conference on very large data bases, VLDB Endowment, pp 1253–1256
Morik K, Scholz M (2002) The miningmart approach. In: GI Jahrestagung, pp 811–818
Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: KDD, pp 350–354
Randall D (2006) Rapidly mixing Markov chains with applications in computer science and physics. Comput Sci Eng 8(2): 30–41
Scheffer T, Wrobel S (2002) Finding the most interesting patterns in a database quickly by using sequential sampling. J Mach Learn Res 3: 833–862
Sloan RH, Takata K, Turán G (1998) On frequent sets of boolean matrices. Ann Math Artif Intell 24(1–4): 193–209
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77
Utley C (2005) Introduction to sql server 2005 data mining. Technical report
Valiant LG (1979) The complexity of computing the permanent. Theor Comput Sci 8: 189–201
Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5): 652–664
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Yoshizawa T, Pramudiono I, Kitsuregawa M (2000) Sql based association rule mining using commercial rdbms (ibm db2 udb eee), Data Warehousing and Knowledge Discovery, pp 301–306
Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257
Zuckerman D (1996) On unapproximable versions of np-complete problems. SIAM J Comput 25(6): 1293–1304
Author information
Authors and Affiliations
Corresponding author
Additional information
A short version of this paper has appeared in the proceedings of the 2008 eighth IEEE international conference on data mining (ICDM 2008).
Rights and permissions
About this article
Cite this article
Boley, M., Grosskreutz, H. Approximating the number of frequent sets in dense data. Knowl Inf Syst 21, 65–89 (2009). https://doi.org/10.1007/s10115-009-0212-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0212-4