SCALE: a scalable framework for efficiently clustering transactional data

Yan, Hua; Chen, Keke; Liu, Ling; Yi, Zhang

doi:10.1007/s10618-009-0134-5

SCALE: a scalable framework for efficiently clustering transactional data

Published: 17 June 2009

Volume 20, pages 1–27, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Hua Yan¹,
Keke Chen²,
Ling Liu³ &
…
Zhang Yi⁴

270 Accesses
6 Citations
Explore all metrics

Abstract

This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abello J, Resende MGC, Sudarsky S (2002) Massive quasi-clique detection. In: Proceedings of the 5th Latin American symposium on theoretical informatics, pp 598–612
Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 487–499
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT), pp 123–146
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 234–243
Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 582–589
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 254–260
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 79–88
Chen K, Liu L (2004) VISTA: validating and refining clusters via visualization. Inf Vis 3(4): 257–270
Article Google Scholar
Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253–262
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–274
Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of ICDM 2001, pp 107–114
Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–83
Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 311–322
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 512–521
Guha S, Mishra N, Motwani R (2000) Clustering data streams. In: Proceeding of IEEE symposium on foundations of computer science, pp 359–366
Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I and II. SIGMOD Rec 31(2): 40–45
Article Google Scholar
Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, New York
MATH Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3): 283–304
Article Google Scholar
Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323
Article Google Scholar
Li Y, Gopalan R (2006) Clustering transactional data streams. Lect Notes Artif Intell 4304: 1069–1073
Google Scholar
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 68–75
Meiľ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on machine learning, pp 577–584
Mishra N, Ron D, Swaminathan R (2003) On finding large conjunctive clusters. In: Proceedings of the 16th annual conference on computational learning theory (COLT), pp 448–462
Ong K-l, Li Wy, Ng W-k, Lim E-p (2004) SCLOPE: an algorithm for clustering data streams of categorical attributes. In: Proceedings of international conference on data warehousing and knowledge discovery, pp 209–218
Ordonez C (2003) Clustering binary data streams with K-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues on data mining and knowledge discovery, pp 12–19
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 483–490
Yan H, Zhang L, Zhang Y (2005) Clustering categorical data using coverage density. In: Proceedings of international conference on advance data mining and application, pp 248–255
Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 367–376
Yang Y, Guan X, You J (2002) Clope: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687
Zha H, He X, Ding CHQ, Gu M, Simon HD (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32

Download references

Author information

Authors and Affiliations

Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, People’s Republic of China
Hua Yan
Department of Computer Science and Engineering, Wright State University, Dayton, OH, 45435, USA
Keke Chen
Georgia Institute of Technology, College of Computing, Atlanta, GA, 30280, USA
Ling Liu
Machine Intelligence Laboratory, College of Computer Science, Sichuan University, 610065, Chengdu, People’s Republic of China
Zhang Yi

Authors

Hua Yan
View author publications
You can also search for this author in PubMed Google Scholar
Keke Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ling Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Yi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keke Chen.

Additional information

Responsible editors: Charu Aggarwal and Geoffrey Webb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, H., Chen, K., Liu, L. et al. SCALE: a scalable framework for efficiently clustering transactional data. Data Min Knowl Disc 20, 1–27 (2010). https://doi.org/10.1007/s10618-009-0134-5

Download citation

Received: 23 September 2007
Accepted: 14 May 2009
Published: 17 June 2009
Issue Date: January 2010
DOI: https://doi.org/10.1007/s10618-009-0134-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SCALE: a scalable framework for efficiently clustering transactional data

Abstract

Access this article

Similar content being viewed by others

Data Mining Paradigms

Comparative Analysis of K-Means and Traversal Optimisation Algorithms

DIDES: a fast and effective sampling for clustering algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SCALE: a scalable framework for efficiently clustering transactional data

Abstract

Access this article

Similar content being viewed by others

Data Mining Paradigms

Comparative Analysis of K-Means and Traversal Optimisation Algorithms

DIDES: a fast and effective sampling for clustering algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation