Abstract
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.
Similar content being viewed by others
References
Abello J, Resende MGC, Sudarsky S (2002) Massive quasi-clique detection. In: Proceedings of the 5th Latin American symposium on theoretical informatics, pp 598–612
Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 487–499
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT), pp 123–146
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 234–243
Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 582–589
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 254–260
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 79–88
Chen K, Liu L (2004) VISTA: validating and refining clusters via visualization. Inf Vis 3(4): 257–270
Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253–262
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–274
Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of ICDM 2001, pp 107–114
Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–83
Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 311–322
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 512–521
Guha S, Mishra N, Motwani R (2000) Clustering data streams. In: Proceeding of IEEE symposium on foundations of computer science, pp 359–366
Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I and II. SIGMOD Rec 31(2): 40–45
Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, New York
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3): 283–304
Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323
Li Y, Gopalan R (2006) Clustering transactional data streams. Lect Notes Artif Intell 4304: 1069–1073
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 68–75
Meiľ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on machine learning, pp 577–584
Mishra N, Ron D, Swaminathan R (2003) On finding large conjunctive clusters. In: Proceedings of the 16th annual conference on computational learning theory (COLT), pp 448–462
Ong K-l, Li Wy, Ng W-k, Lim E-p (2004) SCLOPE: an algorithm for clustering data streams of categorical attributes. In: Proceedings of international conference on data warehousing and knowledge discovery, pp 209–218
Ordonez C (2003) Clustering binary data streams with K-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues on data mining and knowledge discovery, pp 12–19
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 483–490
Yan H, Zhang L, Zhang Y (2005) Clustering categorical data using coverage density. In: Proceedings of international conference on advance data mining and application, pp 248–255
Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 367–376
Yang Y, Guan X, You J (2002) Clope: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687
Zha H, He X, Ding CHQ, Gu M, Simon HD (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Charu Aggarwal and Geoffrey Webb.
Rights and permissions
About this article
Cite this article
Yan, H., Chen, K., Liu, L. et al. SCALE: a scalable framework for efficiently clustering transactional data. Data Min Knowl Disc 20, 1–27 (2010). https://doi.org/10.1007/s10618-009-0134-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0134-5