Advertisement

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

  • Matteo Riondato
  • Eli Upfal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7523)

Abstract

The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)-approximation of the collection of FI’s is \(O(\frac{1}{\varepsilon^2}(d+\log\frac{1}{\delta}))\) (resp. \(O(\frac{2+\varepsilon}{\varepsilon^2(2-\varepsilon)\theta}(d\log\frac{2+\varepsilon}{(2-\varepsilon)\theta}+\log\frac{1}{\delta}))\)) transactions, which is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.

Keywords

Association Rule Frequent Itemsets Association Rule Mining Range Space Frequent Itemsets Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)CrossRefGoogle Scholar
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994 (1994)Google Scholar
  3. 3.
    Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley (2008)Google Scholar
  4. 4.
    Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: KDD 2003 (2003)Google Scholar
  5. 5.
    Ceglar, A., Roddick, J.F.: Association mining. ACM Comput. Surv. 38(5) (2006)Google Scholar
  6. 6.
    Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)Google Scholar
  7. 7.
    Chandra, B., Bhaskar, S.: A new approach for generating efficient sample from market basket data. Expert Sys. with Appl. 38(3), 1321–1325 (2011)CrossRefGoogle Scholar
  8. 8.
    Chazelle, B.: The discrepancy method: randomness and complexity, Cambridge (2000)Google Scholar
  9. 9.
    Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD 2002 (2002)Google Scholar
  10. 10.
    Chen, C., Horng, S.-J., Huang, C.-P.: Locality sensitive hashing for sampling-based algorithms in association rule mining. Expert Sys. with Appl. 38(10), 12388–12397 (2011)CrossRefGoogle Scholar
  11. 11.
    Cheung, Y.-L., Fu, A.W.-C.: Mining frequent itemsets without support threshold: With and without item constraints. IEEE Trans. on Knowl. and Data Eng. 16, 1052–1069 (2004)CrossRefGoogle Scholar
  12. 12.
    Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive Sampling for Association Rules Based on Sampling Error Estimation. In: Adv. in Knowl. Disc. and Data Mining. Springer, Heidelberg (2005)Google Scholar
  13. 13.
    Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17(5) (2008)Google Scholar
  14. 14.
    Fu, A.W.-C., Kwong, R.W.-W., Tang, J.: Mining N-most Interesting Itemsets. In: Ohsuga, S., Raś, Z.W. (eds.) ISMIS 2000. LNCS (LNAI), vol. 1932, pp. 59–67. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15, 55–86 (2007)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Har-Peled, S., Sharir, M.: Relative (p,ε)-approximations in geometry. Discr. & Comput. Geometry 45(3), 462–496 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Hu, X., Yu, H.: The Research of Sampling for Mining Frequent Itemsets. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 496–501. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Hwang, W., Kim, D.: Improved association rule mining by modified trimming. In: CIT 2006 (2006)Google Scholar
  19. 19.
    Jia, C., Lu, R.: Sampling Ensembles for Frequent Patterns. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 1197–1206. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Jia, C.-Y., Gao, X.-P.: Multi-scaling sampling: An adaptive sampling method for discovering approximate association rules. J. of Comp. Sci. and Tech. 20, 309–318 (2005)MathSciNetCrossRefGoogle Scholar
  21. 21.
    John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: KDD 1996 (1996)Google Scholar
  22. 22.
    Li, Y., Gopalan, R.: Effective Sampling for Mining Association Rules. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 391–401. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  23. 23.
    Linial, N., Mansour, Y., Rivest, R.L.: Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation 1, 33–49 (1991)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Löffler, M., Phillips, J.M.: Shape Fitting on Point Sets with Probability Distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  25. 25.
    Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. of Information Science 3, 358–376 (2009)CrossRefGoogle Scholar
  26. 26.
    Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: KDD 2011 (2011)Google Scholar
  27. 27.
    Mannila, H., Toivonen, H., Verkamo, I.: Efficient algorithms for discovering association rules. In: KDD 1994 (1994)Google Scholar
  28. 28.
    Parthasarathy, S.: Efficient progressive sampling for association rules. In: ICDM 2002 (2002)Google Scholar
  29. 29.
    Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-K frequent itemsets through progressive sampling. Data Min. Knowl. Discov. 21, 310–326 (2010)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Pietracaprina, A., Vandin, F.: Efficient Incremental Mining of Top-K Frequent Closed Itemsets. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 275–280. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  31. 31.
    Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR  abs/1111.6937 (2011)Google Scholar
  32. 32.
    Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2002)MathSciNetGoogle Scholar
  33. 33.
    Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996 (1996)Google Scholar
  34. 34.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1999)Google Scholar
  35. 35.
    Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl. 16(2), 264–280 (1971)MathSciNetzbMATHCrossRefGoogle Scholar
  36. 36.
    Vasudevan, D., Vojonović, M.: Ranking through random sampling. MSR-TR-2009-8 8, Microsoft Research (2009)Google Scholar
  37. 37.
    Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on Knowl. and Data Eng. 17, 652–664 (2005)CrossRefGoogle Scholar
  38. 38.
    Wang, S., Dash, M., Chia, L.-T.: Efficient Sampling: Application to Image Data. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 452–463. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  39. 39.
    Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: RIDE 1997 (1997)Google Scholar
  40. 40.
    Zhang, C., Zhang, S., Webb, G.I.: Identifying approximate itemsets of interest in large databases. Applied Intelligence 18, 91–104 (2003)CrossRefGoogle Scholar
  41. 41.
    Zhao, Y., Zhang, C., Zhang, S.: Efficient frequent itemsets mining by sampling. In: AMT 2006 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Matteo Riondato
    • 1
  • Eli Upfal
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations