Summary
Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aldenderfer MS, Blashfield RK (1984) Cluster analysis. Sage Publications, Thousand Oaks
Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Ghosh J, Lambert D, Skillicorn DB, Srivastava J (eds) Proc 6th SIAM Int Conf Data Mining, Bethesda, MD, USA. SIAM, Philadelphia, pp 258–269
Ayad H, Kamel M (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Windeatt T, Roli F (eds) Proc 4th Int Workshop Multiple Classifier Systems, Guildford, UK. Springer, Berlin/Heidelberg, pp 166–175
Barbará D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proc 11th Int Conf Inf Knowl Manag, McLean, VA, USA. ACM Press, New York, pp 582–589
Dhillon I (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc 7th SIGKDD Int Conf Knowl Discov Data Mining, San Francisco, CA, USA. ACM Press, New York, pp 269–274
Fern X, Brodley C (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proc 21st Int Conf on Mach Learn, Banff, AL, Canada. ACM, New York, pp 281–288
Fred A, Jain A (2002) Data clustering using evidence accumulation. In: Proc 16th Int Conf Pattern Recognition, Quebec, QB, Canada. IEEE Computer Society, Washington, pp 276–280
Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter 6:87–94
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proc 15th Int Conf Data Engineering, Sydney, NSW, Australia. IEEE Computer Society, Washington, pp 512–521
He Z, Xu X, Deng S (2005) A cluster ensemble method for clustering categorical data. Inf Fusion 6:143–151
He Z, Xu X, Deng S (2005) Clustering mixed numeric and categorical data: a cluster ensemble approach. ArXiv Computer Science e-prints
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Analysis Mach Intell 20:832–844
Hu X (2004) Integration of cluster ensemble and text summarization for gene expression analysis. In: Proc 4th IEEE Symp Bioinformatics and Bioengineering, Taichung, Taiwan, ROC. IEEE Computer Society, Washington, pp 251–258
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discov 2:283–304
Karypis G, Kumar V (1995) Multilevel k-way partitioning scheme for irregular graphs. Technical report, University of Minnesota, Department of Computer Science and Army HPC Research Center
Kleinberg EM (1990) Stochastic discrimination. Annals Math Artif Intell 1:207–239
Kleinberg EM (1996) An overtraining-resistant stochastic modeling method for pattern recognition. The Annals of Stat 24:2319–2349
Kuncheva L, Hadjitodorov S (2004) Using diversity in cluster ensembles. In: Proc IEEE Int Conf Systems, Man and Cybernetics, The Hague, The Netherlands. IEEE Computer Society, Washington, pp 1214–1219
Mei Q, Xin D, Cheng H, Han J, Zhai C (2006) Generating semantic annotations for frequent patterns with context analysis. In: Proc 12th SIGKDD Int Conf Knowl Discov Data Mining, Philadelphia, PA, USA. ACM Press, New York, pp 337–346
Newman D, Hettich S, Blake C, Merz, C (1998) UCI repository of machine learning databases
Skurichina M, Duin RPW (2001) Bagging and the random subspace method for redundant feature spaces. In: Kittler J, Roli, F (eds) Proc 2nd Int Workshop Multiple Classifier Systems, Cambridge, UK. Springer, London, pp 1–10
Strehl A, Ghosh J (2002) Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Research 3: pp 583–617
Zengyou H, Xiaofei X, Shengchun D (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol 17:611–624
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Al-Razgan, M., Domeniconi, C., Barbará, D. (2008). Random Subspace Ensembles for Clustering Categorical Data. In: Okun, O., Valentini, G. (eds) Supervised and Unsupervised Ensemble Methods and their Applications. Studies in Computational Intelligence, vol 126. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78981-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-78981-9_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78980-2
Online ISBN: 978-3-540-78981-9
eBook Packages: EngineeringEngineering (R0)