Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

Chuang, Kun-Ta; Huang, Jiun-Long; Chen, Ming-Syan

doi:10.1007/s00778-007-0054-1

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

Regular Paper
Published: 04 July 2007

Volume 17, pages 1121–1141, (2008)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Kun-Ta Chuang¹,
Jiun-Long Huang² &
Ming-Syan Chen¹

166 Accesses
13 Citations
Explore all metrics

Abstract

In this paper, we identify and explore that the power-law relationship and the self-similar phenomenon appear in the itemset support distribution. The itemset support distribution refers to the distribution of the count of itemsets versus their supports. Exploring the characteristics of these natural phenomena is useful to many applications such as providing the direction of tuning the performance of the frequent-itemset mining. However, due to the explosive number of itemsets, it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data. As such, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. As validated in our experimental results, PPL can efficiently and precisely identify the characteristics of the itemset support distribution in various real data. In addition, empirical studies also demonstrate that our mechanisms for those two challenging problems are in orders of magnitude better than previous works, showing the prominent advantage of PPL to be an important pre-processing means for mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining Data Streams with Dynamic Confidence Intervals

A high utility itemset mining algorithm based on subsume index

Article 09 December 2015

Efficiently Finding High Utility-Frequent Itemsets Using Cutoff and Suffix Utility

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of VLDB (1994)
Baeza-Yates R. and Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison–Wesley, Reading
Google Scholar
Beran J. (1994). Statistics for long-memory processes. Monographs on Statistics and Applied Probability. Chapman & Hall, London
Google Scholar
Bi, Z., Faloutsos, C., Korn, F.: The “DGX” Distribution for Mining Massive, Skewed Data. In: Proc. of ACM SIGKDD (2000)
Borgelt, C.: Efficient implementations of apriori and eclat. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: evidence and implications. In: Proc. of IEEE INFOCOM (1999)
Cheung, Y.L., Fu, A.W.: Mining Association Rules without Support Threshold: with and without Item Constraints. In: TKDE (2004)
Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive sampling for association rules based on sampling error estimation. In: Proc. of PAKDD (2005)
Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Mining Top-k Frequent Patterns in the Presence of the Memory Constraint. In: Technical Report, under submission. A short version is published in Proc. of ACM CIKM (2005)
Cochran W.G. (1977). Sampling Techniques. Wiley, London
MATH Google Scholar
Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proc. of SIAM SDM (2005)
Crovella, M.E., Bestavros, A.: Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In: Proc. of ACM SIGMETRICS (1996)
Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. In: Proc. of VLDB (2001)
Egghe, L.: The distribution of n-grams. Scientometrics (2000)
Faloutsos, C.: Next Generation Data Mining Tools: Power Laws and Self-similarity for Graphs, Streams and Traditional Data. ECML (2003)
Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: Proc. of ACM SIGCOMM (1999)
Geerts, F., Goethals, B., Bussche, J.V.D.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. (2005)
Geerts, F., Goethals, B., Bussche, J.V.D.: A tight upper bound on the number of candidate patterns. In: Proc. of IEEE ICDM (2001)
Ghoting, A., Buehrer, G., Parthasarathy, S., Y.Chen, Kim, D., Nguyen, A., Dubey, P.: Cache-conscious frequent pattern mining on a modern processor. In: Proc. of VLDB (2005)
Han J. and Kamber M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD (2000)
Ioannidis, Y.: The history of histograms. In: Proc. of VLDB (2003)
Koch R. (1998). The 80/20 Principle: The Secret of Achieving More With Less. Nicholas Brealey Publishing, London
Google Scholar
Lee S.D., David Cheung W.-L. and Kao B. (1998). Is sampling useful in data mining? A case in the maintenance of discovered association rules. DMKD 2(3): 233–262
Article Google Scholar
Manku, G.S., Motwani, R.: Approximate frequency counts over streaming data. In: Proc. of VLDB (2002)
Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: Proc. of ICDT (2005)
Orlando, S., Lucchese, C., Palmerini, P., Perego, R., Silvestri, F.: kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Orlando, S., Palmerini, P., Perego, R., Silvestri, F.: Adaptive and resource-aware mining of frequent sets. In: Proc. of IEEE ICDM (2002)
Park, J.-S., Chen, M.-S., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Proc. of ACM SIGMOD (1995)
Parthasarathy, S.: Efficient progressive sampling for association rules. In: Proc. of IEEE ICDM (2002)
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. of ACM SIGKDD (1999)
Ramesh, G., Maniatty, W.A., Zaki, M.J.: Feasible itemset distributions in data mining: Theory and application. In: Proc. of ACM PODS (2003)
Rice J.A. (1995). Mathematical statistics and data analysis. Duxbury Press, North Scituate
MATH Google Scholar
Toivonen, H.: Sampling large databases for association rules. In: Proc. of VLDB (1996)
Uno, T., Asai, T., Uchida, Y., Arimura, H.: Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proc. of Workshop on Frequent Itemset Mining Implementations (2004)
Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An Efficient Algorithm for Mining Top-K Frequent Closed Itemsets. In: TKDE (2005)
Willinger, W., Taqqu, M.S., Leland, W.E., Wilson, D.V.: Self-similarity in high-speed packet traffic: analysis and modelling of ethernet traffic measurements. Stat. Sci. 10(1) (1995)
Wong, R.C.-W., Fu, A.W.: Mining top-k itemsets over a sliding window based on zipfian distribution. In: Proc. of SIAM SDM (2005)
Yu, J.X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Proc. of VLDB (2004)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Proc. of ACM SIGKDD (1997)
Zaki, M.J., Parthasarathy, S., Wei, I., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Int. Workshop on Research Issues in Data Engineering (1997)
Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proc. of SIGKDD (2001)
Zipf G.K. (1949). Human Behavior and the Principle of Least Effort. Addison–Wesley, Reading
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, ROC
Kun-Ta Chuang & Ming-Syan Chen
Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, ROC
Jiun-Long Huang

Authors

Kun-Ta Chuang
View author publications
You can also search for this author in PubMed Google Scholar
Jiun-Long Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Syan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kun-Ta Chuang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chuang, KT., Huang, JL. & Chen, MS. Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17, 1121–1141 (2008). https://doi.org/10.1007/s00778-007-0054-1

Download citation

Received: 28 February 2006
Accepted: 28 March 2007
Published: 04 July 2007
Issue Date: August 2008
DOI: https://doi.org/10.1007/s00778-007-0054-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

Abstract

Access this article

Similar content being viewed by others

Mining Data Streams with Dynamic Confidence Intervals

A high utility itemset mining algorithm based on subsume index

Efficiently Finding High Utility-Frequent Itemsets Using Cutoff and Suffix Utility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

Abstract

Access this article

Similar content being viewed by others

Mining Data Streams with Dynamic Confidence Intervals

A high utility itemset mining algorithm based on subsume index

Efficiently Finding High Utility-Frequent Itemsets Using Cutoff and Suffix Utility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation