Skip to main content
Log in

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In this paper, we identify and explore that the power-law relationship and the self-similar phenomenon appear in the itemset support distribution. The itemset support distribution refers to the distribution of the count of itemsets versus their supports. Exploring the characteristics of these natural phenomena is useful to many applications such as providing the direction of tuning the performance of the frequent-itemset mining. However, due to the explosive number of itemsets, it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data. As such, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. As validated in our experimental results, PPL can efficiently and precisely identify the characteristics of the itemset support distribution in various real data. In addition, empirical studies also demonstrate that our mechanisms for those two challenging problems are in orders of magnitude better than previous works, showing the prominent advantage of PPL to be an important pre-processing means for mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of VLDB (1994)

  2. Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison–Wesley, Reading

    Google Scholar 

  3. Beran J. (1994). Statistics for long-memory processes. Monographs on Statistics and Applied Probability. Chapman & Hall, London

    Google Scholar 

  4. Bi, Z., Faloutsos, C., Korn, F.: The “DGX” Distribution for Mining Massive, Skewed Data. In: Proc. of ACM SIGKDD (2000)

  5. Borgelt, C.: Efficient implementations of apriori and eclat. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)

  6. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: evidence and implications. In: Proc. of IEEE INFOCOM (1999)

  7. Cheung, Y.L., Fu, A.W.: Mining Association Rules without Support Threshold: with and without Item Constraints. In: TKDE (2004)

  8. Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive sampling for association rules based on sampling error estimation. In: Proc. of PAKDD (2005)

  9. Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Mining Top-k Frequent Patterns in the Presence of the Memory Constraint. In: Technical Report, under submission. A short version is published in Proc. of ACM CIKM (2005)

  10. Cochran W.G. (1977). Sampling Techniques. Wiley, London

    MATH  Google Scholar 

  11. Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proc. of SIAM SDM (2005)

  12. Crovella, M.E., Bestavros, A.: Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In: Proc. of ACM SIGMETRICS (1996)

  13. Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. In: Proc. of VLDB (2001)

  14. Egghe, L.: The distribution of n-grams. Scientometrics (2000)

  15. Faloutsos, C.: Next Generation Data Mining Tools: Power Laws and Self-similarity for Graphs, Streams and Traditional Data. ECML (2003)

  16. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: Proc. of ACM SIGCOMM (1999)

  17. Geerts, F., Goethals, B., Bussche, J.V.D.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. (2005)

  18. Geerts, F., Goethals, B., Bussche, J.V.D.: A tight upper bound on the number of candidate patterns. In: Proc. of IEEE ICDM (2001)

  19. Ghoting, A., Buehrer, G., Parthasarathy, S., Y.Chen, Kim, D., Nguyen, A., Dubey, P.: Cache-conscious frequent pattern mining on a modern processor. In: Proc. of VLDB (2005)

  20. Han J. and Kamber M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco

    Google Scholar 

  21. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD (2000)

  22. Ioannidis, Y.: The history of histograms. In: Proc. of VLDB (2003)

  23. Koch R. (1998). The 80/20 Principle: The Secret of Achieving More With Less. Nicholas Brealey Publishing, London

    Google Scholar 

  24. Lee S.D., David Cheung W.-L. and Kao B. (1998). Is sampling useful in data mining? A case in the maintenance of discovered association rules. DMKD 2(3): 233–262

    Article  Google Scholar 

  25. Manku, G.S., Motwani, R.: Approximate frequency counts over streaming data. In: Proc. of VLDB (2002)

  26. Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: Proc. of ICDT (2005)

  27. Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)

  28. Orlando, S., Palmerini, P., Perego, R., Silvestri, F.: Adaptive and resource-aware mining of frequent sets. In: Proc. of IEEE ICDM (2002)

  29. Park, J.-S., Chen, M.-S., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Proc. of ACM SIGMOD (1995)

  30. Parthasarathy, S.: Efficient progressive sampling for association rules. In: Proc. of IEEE ICDM (2002)

  31. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. of ACM SIGKDD (1999)

  32. Ramesh, G., Maniatty, W.A., Zaki, M.J.: Feasible itemset distributions in data mining: Theory and application. In: Proc. of ACM PODS (2003)

  33. Rice J.A. (1995). Mathematical statistics and data analysis. Duxbury Press, North Scituate

    MATH  Google Scholar 

  34. Toivonen, H.: Sampling large databases for association rules. In: Proc. of VLDB (1996)

  35. Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)

  36. Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An Efficient Algorithm for Mining Top-K Frequent Closed Itemsets. In: TKDE (2005)

  37. Willinger, W., Taqqu, M.S., Leland, W.E., Wilson, D.V.: Self-similarity in high-speed packet traffic: analysis and modelling of ethernet traffic measurements. Stat. Sci. 10(1) (1995)

  38. Wong, R.C.-W., Fu, A.W.: Mining top-k itemsets over a sliding window based on zipfian distribution. In: Proc. of SIAM SDM (2005)

  39. Yu, J.X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Proc. of VLDB (2004)

  40. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Proc. of ACM SIGKDD (1997)

  41. Zaki, M.J., Parthasarathy, S., Wei, I., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Int. Workshop on Research Issues in Data Engineering (1997)

  42. Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proc. of SIGKDD (2001)

  43. Zipf G.K. (1949). Human Behavior and the Principle of Least Effort. Addison–Wesley, Reading

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun-Ta Chuang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chuang, KT., Huang, JL. & Chen, MS. Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17, 1121–1141 (2008). https://doi.org/10.1007/s00778-007-0054-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0054-1

Keywords

Navigation