Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

ECML PKDD 2012: Machine Learning and Knowledge Discovery in Databases pp 25–41Cite as

  1. Home
  2. Machine Learning and Knowledge Discovery in Databases
  3. Conference paper
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

  • Matteo Riondato20 &
  • Eli Upfal20 
  • Conference paper
  • 5008 Accesses

  • 9 Citations

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7523)

Abstract

The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)-approximation of the collection of FI’s is \(O(\frac{1}{\varepsilon^2}(d+\log\frac{1}{\delta}))\) (resp. \(O(\frac{2+\varepsilon}{\varepsilon^2(2-\varepsilon)\theta}(d\log\frac{2+\varepsilon}{(2-\varepsilon)\theta}+\log\frac{1}{\delta}))\)) transactions, which is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.

Keywords

  • Association Rule
  • Frequent Itemsets
  • Association Rule Mining
  • Range Space
  • Frequent Itemsets Mining

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Work was supported in part by NSF award IIS-0905553.

Download conference paper PDF

References

  1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)

    CrossRef  Google Scholar 

  2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994 (1994)

    Google Scholar 

  3. Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley (2008)

    Google Scholar 

  4. Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: KDD 2003 (2003)

    Google Scholar 

  5. Ceglar, A., Roddick, J.F.: Association mining. ACM Comput. Surv. 38(5) (2006)

    Google Scholar 

  6. Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)

    Google Scholar 

  7. Chandra, B., Bhaskar, S.: A new approach for generating efficient sample from market basket data. Expert Sys. with Appl. 38(3), 1321–1325 (2011)

    CrossRef  Google Scholar 

  8. Chazelle, B.: The discrepancy method: randomness and complexity, Cambridge (2000)

    Google Scholar 

  9. Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD 2002 (2002)

    Google Scholar 

  10. Chen, C., Horng, S.-J., Huang, C.-P.: Locality sensitive hashing for sampling-based algorithms in association rule mining. Expert Sys. with Appl. 38(10), 12388–12397 (2011)

    CrossRef  Google Scholar 

  11. Cheung, Y.-L., Fu, A.W.-C.: Mining frequent itemsets without support threshold: With and without item constraints. IEEE Trans. on Knowl. and Data Eng. 16, 1052–1069 (2004)

    CrossRef  Google Scholar 

  12. Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive Sampling for Association Rules Based on Sampling Error Estimation. In: Adv. in Knowl. Disc. and Data Mining. Springer, Heidelberg (2005)

    Google Scholar 

  13. Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17(5) (2008)

    Google Scholar 

  14. Fu, A.W.-C., Kwong, R.W.-W., Tang, J.: Mining N-most Interesting Itemsets. In: Ohsuga, S., Raś, Z.W. (eds.) ISMIS 2000. LNCS (LNAI), vol. 1932, pp. 59–67. Springer, Heidelberg (2000)

    CrossRef  Google Scholar 

  15. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15, 55–86 (2007)

    CrossRef  MathSciNet  Google Scholar 

  16. Har-Peled, S., Sharir, M.: Relative (p,ε)-approximations in geometry. Discr. & Comput. Geometry 45(3), 462–496 (2011)

    CrossRef  MathSciNet  MATH  Google Scholar 

  17. Hu, X., Yu, H.: The Research of Sampling for Mining Frequent Itemsets. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 496–501. Springer, Heidelberg (2006)

    CrossRef  Google Scholar 

  18. Hwang, W., Kim, D.: Improved association rule mining by modified trimming. In: CIT 2006 (2006)

    Google Scholar 

  19. Jia, C., Lu, R.: Sampling Ensembles for Frequent Patterns. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 1197–1206. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  20. Jia, C.-Y., Gao, X.-P.: Multi-scaling sampling: An adaptive sampling method for discovering approximate association rules. J. of Comp. Sci. and Tech. 20, 309–318 (2005)

    CrossRef  MathSciNet  Google Scholar 

  21. John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: KDD 1996 (1996)

    Google Scholar 

  22. Li, Y., Gopalan, R.: Effective Sampling for Mining Association Rules. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 391–401. Springer, Heidelberg (2004)

    CrossRef  Google Scholar 

  23. Linial, N., Mansour, Y., Rivest, R.L.: Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation 1, 33–49 (1991)

    CrossRef  MathSciNet  Google Scholar 

  24. Löffler, M., Phillips, J.M.: Shape Fitting on Point Sets with Probability Distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009)

    CrossRef  Google Scholar 

  25. Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. of Information Science 3, 358–376 (2009)

    CrossRef  Google Scholar 

  26. Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: KDD 2011 (2011)

    Google Scholar 

  27. Mannila, H., Toivonen, H., Verkamo, I.: Efficient algorithms for discovering association rules. In: KDD 1994 (1994)

    Google Scholar 

  28. Parthasarathy, S.: Efficient progressive sampling for association rules. In: ICDM 2002 (2002)

    Google Scholar 

  29. Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-K frequent itemsets through progressive sampling. Data Min. Knowl. Discov. 21, 310–326 (2010)

    CrossRef  MathSciNet  Google Scholar 

  30. Pietracaprina, A., Vandin, F.: Efficient Incremental Mining of Top-K Frequent Closed Itemsets. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 275–280. Springer, Heidelberg (2007)

    CrossRef  Google Scholar 

  31. Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR  abs/1111.6937 (2011)

    Google Scholar 

  32. Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2002)

    MathSciNet  Google Scholar 

  33. Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996 (1996)

    Google Scholar 

  34. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1999)

    Google Scholar 

  35. Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl. 16(2), 264–280 (1971)

    CrossRef  MathSciNet  MATH  Google Scholar 

  36. Vasudevan, D., Vojonović, M.: Ranking through random sampling. MSR-TR-2009-8 8, Microsoft Research (2009)

    Google Scholar 

  37. Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on Knowl. and Data Eng. 17, 652–664 (2005)

    CrossRef  Google Scholar 

  38. Wang, S., Dash, M., Chia, L.-T.: Efficient Sampling: Application to Image Data. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 452–463. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  39. Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: RIDE 1997 (1997)

    Google Scholar 

  40. Zhang, C., Zhang, S., Webb, G.I.: Identifying approximate itemsets of interest in large databases. Applied Intelligence 18, 91–104 (2003)

    CrossRef  Google Scholar 

  41. Zhao, Y., Zhang, C., Zhang, S.: Efficient frequent itemsets mining by sampling. In: AMT 2006 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Department of Computer Science, Brown University, Providence, RI, USA

    Matteo Riondato & Eli Upfal

Authors
  1. Matteo Riondato
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Eli Upfal
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK

    Peter A. Flach, Tijl De Bie & Nello Cristianini,  & 

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Riondato, M., Upfal, E. (2012). Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_7

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-33460-3_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33459-7

  • Online ISBN: 978-3-642-33460-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature