Skip to main content
Log in

Approximating the number of frequent sets in dense data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We investigate the problem of counting the number of frequent (item)sets—a problem known to be intractable in terms of an exact polynomial time computation. In this paper, we show that it is in general also hard to approximate. Subsequently, a randomized counting algorithm is developed using the Markov chain Monte Carlo method. While for general inputs an exponential running time is needed in order to guarantee a certain approximation bound, we show that the algorithm still has the desired accuracy on several real-world datasets when its running time is capped polynomially.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bayardo R, Goethals B, Zaki MJ (eds) (2004) Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, vol 126. CEUR Workshop Proceedings. http://CEUR-WS.org

  2. Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4): 503–518

    Article  Google Scholar 

  3. Blanchard J, Guillet F, Briand H (2007) Interactive visual exploration of association rules with rule-focusing methodology. Knowl Inf Syst 13(1): 43–75

    Article  Google Scholar 

  4. Bodon F (2003) A fast apriori implementation, In: Goethals B, Zaki MJ (eds) Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03), vol 90. CEUR Workshop Proceedings, Melbourne

  5. Boley M (2007) On approximating minimum infrequent and maximum frequent sets. Discov Sci 68–77

  6. Boley M, Horváth T, Wrobel S (2009) Efficient discovery of interesting patterns based on strong closedness. In: Proceedings of the SIAM international conference for data mining (SDM)

  7. Geerts F, Goethals B, Bussche JVD (2005) Tight upper bounds on the number of candidate patterns. ACM Trans Database Syst 30(2): 333–363

    Article  Google Scholar 

  8. Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining frequent itemsets. In: FIMI’03 workshop on frequent itemset mining implementations

  9. Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma RS (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174

    Article  Google Scholar 

  10. Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. ICDM

  11. Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan-Kaufmann, Menlo Park

  12. Jerrum MR, Valiant LG, Vazirani VV (1986) Random generation of combinatorial structures from a uniform distribution. Theor Comput Sci 43(2–3): 169–188

    Article  MATH  MathSciNet  Google Scholar 

  13. Jerrum M, Sinclair A (1997) The markov chain monte carlo method: an approach to approximate counting and integration. In: Approximation algorithms for NP-hard problems. PWS Publishing Co., Boston, pp 482–520

  14. Jin R, McCallen S, Breitbart Y, Fuhry D, Wang D (2009) Estimating the number of frequent itemsets in a large database. In: Proceedings of 12th international conference on extending database technology (EDBT)

  15. Karp RM, Luby M, Madras N (1989) Monte-Carlo approximation algorithms for enumeration problems. J Algorithms 10(3): 429–448

    Article  MATH  MathSciNet  Google Scholar 

  16. Khot S (2004) Ruling out ptas for graph min-bisection, densest subgraph and bipartite clique. In: Foundations of computer science. IEEE Computer Society, Washington, DC, pp 136–145

  17. Li W, Mozes A (2004) Computing frequent itemsets inside oracle 10g. In: VLDB ’04: Proceedings of the Thirtieth international conference on very large data bases, VLDB Endowment, pp 1253–1256

  18. Morik K, Scholz M (2002) The miningmart approach. In: GI Jahrestagung, pp 811–818

  19. Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: KDD, pp 350–354

  20. Randall D (2006) Rapidly mixing Markov chains with applications in computer science and physics. Comput Sci Eng 8(2): 30–41

    Article  Google Scholar 

  21. Scheffer T, Wrobel S (2002) Finding the most interesting patterns in a database quickly by using sequential sampling. J Mach Learn Res 3: 833–862

    Article  MathSciNet  Google Scholar 

  22. Sloan RH, Takata K, Turán G (1998) On frequent sets of boolean matrices. Ann Math Artif Intell 24(1–4): 193–209

    Article  MATH  Google Scholar 

  23. Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77

    Article  Google Scholar 

  24. Utley C (2005) Introduction to sql server 2005 data mining. Technical report

  25. Valiant LG (1979) The complexity of computing the permanent. Theor Comput Sci 8: 189–201

    Article  MATH  MathSciNet  Google Scholar 

  26. Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5): 652–664

    Article  Google Scholar 

  27. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  28. Yoshizawa T, Pramudiono I, Kitsuregawa M (2000) Sql based association rule mining using commercial rdbms (ibm db2 udb eee), Data Warehousing and Knowledge Discovery, pp 301–306

  29. Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257

    Article  Google Scholar 

  30. Zuckerman D (1996) On unapproximable versions of np-complete problems. SIAM J Comput 25(6): 1293–1304

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario Boley.

Additional information

A short version of this paper has appeared in the proceedings of the 2008 eighth IEEE international conference on data mining (ICDM 2008).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boley, M., Grosskreutz, H. Approximating the number of frequent sets in dense data. Knowl Inf Syst 21, 65–89 (2009). https://doi.org/10.1007/s10115-009-0212-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0212-4

Keywords

Navigation