Skip to main content
Log in

SCALE: a scalable framework for efficiently clustering transactional data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abello J, Resende MGC, Sudarsky S (2002) Massive quasi-clique detection. In: Proceedings of the 5th Latin American symposium on theoretical informatics, pp 598–612

  • Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62

    Article  Google Scholar 

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 487–499

  • Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT), pp 123–146

  • Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 234–243

  • Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 582–589

  • Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 254–260

  • Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 79–88

  • Chen K, Liu L (2004) VISTA: validating and refining clusters via visualization. Inf Vis 3(4): 257–270

    Article  Google Scholar 

  • Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253–262

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–274

  • Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of ICDM 2001, pp 107–114

  • Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–83

  • Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 311–322

  • Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 512–521

  • Guha S, Mishra N, Motwani R (2000) Clustering data streams. In: Proceeding of IEEE symposium on foundations of computer science, pp 359–366

  • Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I and II. SIGMOD Rec 31(2): 40–45

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, New York

    MATH  Google Scholar 

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3): 283–304

    Article  Google Scholar 

  • Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323

    Article  Google Scholar 

  • Li Y, Gopalan R (2006) Clustering transactional data streams. Lect Notes Artif Intell 4304: 1069–1073

    Google Scholar 

  • Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 68–75

  • Meiľ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on machine learning, pp 577–584

  • Mishra N, Ron D, Swaminathan R (2003) On finding large conjunctive clusters. In: Proceedings of the 16th annual conference on computational learning theory (COLT), pp 448–462

  • Ong K-l, Li Wy, Ng W-k, Lim E-p (2004) SCLOPE: an algorithm for clustering data streams of categorical attributes. In: Proceedings of international conference on data warehousing and knowledge discovery, pp 209–218

  • Ordonez C (2003) Clustering binary data streams with K-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues on data mining and knowledge discovery, pp 12–19

  • Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377

  • Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 483–490

  • Yan H, Zhang L, Zhang Y (2005) Clustering categorical data using coverage density. In: Proceedings of international conference on advance data mining and application, pp 248–255

  • Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 367–376

  • Yang Y, Guan X, You J (2002) Clope: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687

  • Zha H, He X, Ding CHQ, Gu M, Simon HD (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keke Chen.

Additional information

Responsible editors: Charu Aggarwal and Geoffrey Webb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, H., Chen, K., Liu, L. et al. SCALE: a scalable framework for efficiently clustering transactional data. Data Min Knowl Disc 20, 1–27 (2010). https://doi.org/10.1007/s10618-009-0134-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0134-5

Keywords

Navigation