Abstract
In this paper, we identify and explore that the power-law relationship and the self-similar phenomenon appear in the itemset support distribution. The itemset support distribution refers to the distribution of the count of itemsets versus their supports. Exploring the characteristics of these natural phenomena is useful to many applications such as providing the direction of tuning the performance of the frequent-itemset mining. However, due to the explosive number of itemsets, it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data. As such, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. As validated in our experimental results, PPL can efficiently and precisely identify the characteristics of the itemset support distribution in various real data. In addition, empirical studies also demonstrate that our mechanisms for those two challenging problems are in orders of magnitude better than previous works, showing the prominent advantage of PPL to be an important pre-processing means for mining applications.
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of VLDB (1994)
Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison–Wesley, Reading
Beran J. (1994). Statistics for long-memory processes. Monographs on Statistics and Applied Probability. Chapman & Hall, London
Bi, Z., Faloutsos, C., Korn, F.: The “DGX” Distribution for Mining Massive, Skewed Data. In: Proc. of ACM SIGKDD (2000)
Borgelt, C.: Efficient implementations of apriori and eclat. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: evidence and implications. In: Proc. of IEEE INFOCOM (1999)
Cheung, Y.L., Fu, A.W.: Mining Association Rules without Support Threshold: with and without Item Constraints. In: TKDE (2004)
Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive sampling for association rules based on sampling error estimation. In: Proc. of PAKDD (2005)
Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Mining Top-k Frequent Patterns in the Presence of the Memory Constraint. In: Technical Report, under submission. A short version is published in Proc. of ACM CIKM (2005)
Cochran W.G. (1977). Sampling Techniques. Wiley, London
Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proc. of SIAM SDM (2005)
Crovella, M.E., Bestavros, A.: Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In: Proc. of ACM SIGMETRICS (1996)
Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. In: Proc. of VLDB (2001)
Egghe, L.: The distribution of n-grams. Scientometrics (2000)
Faloutsos, C.: Next Generation Data Mining Tools: Power Laws and Self-similarity for Graphs, Streams and Traditional Data. ECML (2003)
Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: Proc. of ACM SIGCOMM (1999)
Geerts, F., Goethals, B., Bussche, J.V.D.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. (2005)
Geerts, F., Goethals, B., Bussche, J.V.D.: A tight upper bound on the number of candidate patterns. In: Proc. of IEEE ICDM (2001)
Ghoting, A., Buehrer, G., Parthasarathy, S., Y.Chen, Kim, D., Nguyen, A., Dubey, P.: Cache-conscious frequent pattern mining on a modern processor. In: Proc. of VLDB (2005)
Han J. and Kamber M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD (2000)
Ioannidis, Y.: The history of histograms. In: Proc. of VLDB (2003)
Koch R. (1998). The 80/20 Principle: The Secret of Achieving More With Less. Nicholas Brealey Publishing, London
Lee S.D., David Cheung W.-L. and Kao B. (1998). Is sampling useful in data mining? A case in the maintenance of discovered association rules. DMKD 2(3): 233–262
Manku, G.S., Motwani, R.: Approximate frequency counts over streaming data. In: Proc. of VLDB (2002)
Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: Proc. of ICDT (2005)
Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Orlando, S., Palmerini, P., Perego, R., Silvestri, F.: Adaptive and resource-aware mining of frequent sets. In: Proc. of IEEE ICDM (2002)
Park, J.-S., Chen, M.-S., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Proc. of ACM SIGMOD (1995)
Parthasarathy, S.: Efficient progressive sampling for association rules. In: Proc. of IEEE ICDM (2002)
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. of ACM SIGKDD (1999)
Ramesh, G., Maniatty, W.A., Zaki, M.J.: Feasible itemset distributions in data mining: Theory and application. In: Proc. of ACM PODS (2003)
Rice J.A. (1995). Mathematical statistics and data analysis. Duxbury Press, North Scituate
Toivonen, H.: Sampling large databases for association rules. In: Proc. of VLDB (1996)
Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An Efficient Algorithm for Mining Top-K Frequent Closed Itemsets. In: TKDE (2005)
Willinger, W., Taqqu, M.S., Leland, W.E., Wilson, D.V.: Self-similarity in high-speed packet traffic: analysis and modelling of ethernet traffic measurements. Stat. Sci. 10(1) (1995)
Wong, R.C.-W., Fu, A.W.: Mining top-k itemsets over a sliding window based on zipfian distribution. In: Proc. of SIAM SDM (2005)
Yu, J.X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Proc. of VLDB (2004)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Proc. of ACM SIGKDD (1997)
Zaki, M.J., Parthasarathy, S., Wei, I., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Int. Workshop on Research Issues in Data Engineering (1997)
Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proc. of SIGKDD (2001)
Zipf G.K. (1949). Human Behavior and the Principle of Least Effort. Addison–Wesley, Reading
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chuang, KT., Huang, JL. & Chen, MS. Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17, 1121–1141 (2008). https://doi.org/10.1007/s00778-007-0054-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0054-1