Skip to main content
Log in

Projective clustering ensembles

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A considerable amount of work has been done in data clustering research during the last four decades, and a myriad of methods has been proposed focusing on different data types, proximity functions, cluster representation models, and cluster presentation. However, clustering remains a challenging problem due to its ill-posed nature: it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, mainly because every clustering algorithm has its own bias resulting from the optimization of different criteria. This bias becomes even more important as in almost all real-world applications, data is inherently high-dimensional and multiple clustering solutions might be available for the same data collection. In this respect, the problems of projective clustering and clustering ensembles have been recently defined to deal with the high dimensionality and multiple clusterings issues, respectively. Nevertheless, despite such two issues can often be encountered together, existing approaches to the two problems have been developed independently of each other. In our earlier work Gullo et al. (Proceedings of the international conference on data mining (ICDM), 2009a) we introduced a novel clustering problem, called projective clustering ensembles (PCE): given a set (ensemble) of projective clustering solutions, the goal is to derive a projective consensus clustering, i.e., a projective clustering that complies with the information on object-to-cluster and the feature-to-cluster assignments given in the ensemble. In this paper, we enhance our previous study and provide theoretical and experimental insights into the PCE problem. PCE is formalized as an optimization problem and is designed to satisfy desirable requirements on independence from the specific clustering ensemble algorithm, ability to handle hard as well as soft data clustering, and different feature weightings. Two PCE formulations are defined: a two-objective optimization problem, in which the two objective functions respectively account for the object- and feature-based representations of the solutions in the ensemble, and a single-objective optimization problem, in which the object- and feature-based representations are embedded into a single function to measure the distance error between the projective consensus clustering and the projective ensemble. The significance of the proposed methods for solving the PCE problem has been shown through an extensive experimental evaluation based on several datasets and comparatively with projective clustering and clustering ensemble baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

o :

Data object

\({\mathcal{D}}\) :

Collection of data objects

C :

Projective cluster (set of data objects)

\({\mathcal{C}}\) :

Projective clustering (set of projective clusters)

\({\mathcal{E}}\) :

Projective ensemble (set of projective clusterings)

\({\mathcal{C}^*}\) :

Projective consensus clustering

K :

Number of clusters in the projective consensus clustering

f :

Feature

\({\mathcal{F}}\) :

Set of features

Γ C :

Object-based representation vector of projective cluster C

Γ C , o :

Object-to-cluster assignment of object o to projective cluster C

Δ C :

Feature-based representation vector of projective cluster C

Δ C, f :

Feature-to-cluster assignment of feature f to projective cluster C]]

Λ o :

Feature-based representation vector of object o

Λ o , f:

Probability that feature f is informative for object o

\({\Psi_o, \overline{\psi}_o, \psi_o}\) :

Object-based optimization functions

\({\Psi_f, \overline{\psi}_f, \psi_f}\) :

Feature-based optimization functions

Q :

Object- and feature-based optimization function

J :

Extended Jaccard (Tanimoto) similarity coefficient

ρ :

Pareto ranking function

t :

Population size

\({\mathcal{I}, I}\) :

Numbers of iterations

F1 of , F1 o , F1 f :

External assessment criteria

\({\overline{F1}_{of}, \overline{F1}_o, \overline{F1}_f}\) :

Internal assessment criteria

References

  • Achtert E, Böhm C, Kriegel H-P, Kröger P, Müller-Gorman I, Zimek A (2006) Finding hierarchies of subspace clusters. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 446–453

  • Achtert E, Böhm C, Kriegel H-P, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the international conference on database systems for advanced applications (DASFAA), pp 152–163

  • Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 61–72

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings ACM SIGMOD international conference on management of data, pp 94–105

  • Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGMOD international conference on management of data, pp 49–60

  • Assent I, Krieger R, Müller E, Seidl T (2008) EDSC: efficient density-based subspace clustering. In: Proceedings ACM conference on information and knowledge management (CIKM), pp 1093–1102

  • Asuncion A, Newman DJ (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml/

  • Ayad H, Kamel MS (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Proceedings of the international workshop on multiple classifier systems (MCS), pp 166–175

  • Barthélemy JP, Leclerc B (1995) The median procedure for partitions. Partit Data Sets 19: 3–33

    Google Scholar 

  • Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the international conference on database theory (ICDT), pp 217–235

  • Böhm C, Kailing K, Kriegel HP, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 27–34

  • Boulis C, Ostendorf M (2004) Combining multiple clustering systems. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 63–74

  • Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the international conference on machine learning (ICML), pp 91–99

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140

    MathSciNet  MATH  Google Scholar 

  • Caruana R, Elhawary MF, Nguyen N, Smith C (2006) Meta clustering. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 107–118

  • Chen L, Jiang Q, Wang S (2008) A probability model for projective clustering on high dimensional data. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 755–760

  • Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, New York

    MATH  Google Scholar 

  • Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2): 182–197

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rdin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39: 1–38

    MATH  Google Scholar 

  • Dimitriadou E, Weingesse A, Hornik K (2001) Voting-merging: an ensemble method for clustering. In: Proceedings of the international conference on artificial neural networks (ICANN), pp 217–224

  • Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. In: ACM Trans Knowl Disc Data (TKDD), 2(4)

  • Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1): 63–97

    Article  MathSciNet  Google Scholar 

  • Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9): 1090–1099

    Article  Google Scholar 

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the international conference on knowledge discovery and data mining (KDD), pp 226–231

  • Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the international conference on machine learning (ICML), pp 281–288

  • Fern XZ, Lin W (2008) Cluster Ensemble Selection. In proceedings of the SIAM international conference on data mining (SDM), pp 787–797

  • Fischer B, Buhmann JM (2003) Bagging for path-based clustering. IEEE Trans Patt Anal Mach Intell (TPAMI) 25(11): 1411–1415

    Article  Google Scholar 

  • Fred ALN (2001) Finding consistent clusters in data partitions. In: Proceedings of the international workshop on multiple classifier systems (MCS), pp 309–318

  • Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: Proceedings of the international conference on pattern recognition (ICPR), pp 276–280

  • Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. ASA-SIAM series on statistics and applied probability

  • Ghaemi R, bin Sulaiman N, Ibrahim H, Mustapha N (2011) A review: accuracy optimization in clustering ensembles using genetic algorithms. Artif Intell Rev 35(4): 287–318

    Article  Google Scholar 

  • Ghosh J, Acharya A (2011) Cluster ensembles. Wiley interdisciplinary reviews. Data Min Knowl Disc 1(4): 305–315

    Article  Google Scholar 

  • Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. In: ACM Trans Knowl Disc Data (TKDD), 1(1)

  • Gullo F, Domeniconi C, Tagarelli A (2009a) Projective clustering ensembles. In: Proceedings of the international conference on data mining (ICDM), pp 794–799

  • Gullo F, Tagarelli A, Greco S (2009b) Diversity-based weighting schemes for clustering ensembles. In: Proceedings of the SIAM international conference on data mining (SDM), pp 437–448

  • Gullo F, Domeniconi C, Tagarelli A (2010) Enhancing single-objective projective clustering ensembles. In: proceedings of the IEEE international conference on data mining (ICDM), pp 833–838

  • Gullo F, Domeniconi C, Tagarelli A (2011) Advancing data clustering via projective clustering ensembles. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 733–744

  • Günnemann S, Boden B, Seidl T (2011a) DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), pp 565–580

  • Günnemann S, Färber I, Müller E, Assent I, Seidl T (2011b) External evaluation measures for subspace clustering. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 1363–1372

  • Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the international conference on very large data bases (VLDB), pp 506–515

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Ka Ka Ng E, Wai-Chee Fu A, Chi-Wing Wong R (2005) Projective clustering by histograms. IEEE Trans Knowl Data Eng (TKDE) 17(3): 369–383

    Article  Google Scholar 

  • Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1): 359–392

    Article  MathSciNet  Google Scholar 

  • Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: applications in VLSI domain. In: Proceedings of the design automation conference (DAC), pp 526–529

  • Keogh E, Xi X, Wei L, Ratanamahatana CA (2003) The UCR time series classification/clustering page. http://www.cs.ucr.edu/~eamonn/time_series_data/

  • Kriegel H-P, Kroger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 250–257

  • Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Disc Data (TKDD) 3(1): 1–58

    Article  Google Scholar 

  • Krivánek M, Morávek J (1986) NP-hard problems in hierarchical-tree clustering. Acta Inform 23(3): 311–323

    Article  MathSciNet  MATH  Google Scholar 

  • Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2: 83–97

    Article  Google Scholar 

  • Kuncheva LI, Hadjitodorov ST, Todorova LP (2006) Experimental comparison of cluster ensemble methods. In: Proceedings of the international conference on information fusion, pp 1–7

  • Lewis DD, Yang Y, Rose T, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397

    Google Scholar 

  • Li T, Ding C (2008) Weighted consensus clustering. In: Proceedings of the SIAM international conference on data mining (SDM), pp 798–809

  • Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 577–582

  • Liu B, Xia B, Yu PS (2000) Clustering through decision tree construction. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 20–29

  • Meila M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the international conference on machine learning (ICML), pp 577–584

  • Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3): 273–298

    Article  MATH  Google Scholar 

  • Moise G, Zimek A, Kröger P, Kriegel H-P, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326

    Article  Google Scholar 

  • Müller E, Assent I, Günnemann S, Krieger R, Seidl T (2009a) Relevant subspace clustering: mining the most interesting non-redundant concepts in high dimensional data. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 377–386

  • Müller E, Günnemann S, Assent I, Seidl T (2009b) Evaluating clustering in subspace projections of high dimensional data. Proc VLDB Endow (PVLDB) 2(1): 1270–1281

    Google Scholar 

  • Müller E, Günnemann S, Assent I, Seidl T (2009c) Evaluating clustering in subspace projections of high dimensional data. http://dme.rwth-aachen.de/en/OpenSubspace/evaluation

  • Müller E, Assent I, Günnemann S, Seidl T (2011) Scalable density-based subspace clustering. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 1077–1086

  • Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Proceedings of the international conference on neural information processing systems (NIPS), pp 849–856

  • Nguyen N, Caruana R (2007) Consensus clustering. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 607–612

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor 6(1): 90–105

    Article  Google Scholar 

  • Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng (TKDE) 18(7): 902–916

    Article  Google Scholar 

  • Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 418–427

  • Schapire R (1990) The strength of weak learnability. Mach Learn 5(2): 197–227

    Google Scholar 

  • Sequeira K, Zaki M (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 186–193

  • Srinivas N, Deb K (1994) Multiobjective optimization using nondominated sorting in genetic algorithms. Evol Comput 2(3): 221–248

    Article  Google Scholar 

  • Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617

    MathSciNet  Google Scholar 

  • Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the AAAI workshop on artificial intelligence for web search, pp 58–64

  • Tomasev N, Radovanovic M, Mladenic D, Ivanovic M (2011) The role of hubness in clustering high-dimensional data. In: Proceedings of the Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 183–195

  • Topchy AP, Jain AK, Punch WF (2004) A mixture model for clustering ensembles. In: Proceedings of the SIAM international conference on data mining (SDM), pp 379–390

  • Topchy AP, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell (TPAMI) 27(12): 1866–1881

    Article  Google Scholar 

  • van Rijsbergen CJ (1979) Information retrieval. Butterworths, London

    Google Scholar 

  • Wang H, Shan H, Banerjee A (2009) Bayesian cluster ensembles. In Proceedings of the SIAM international conference on data mining (SDM), pp 209–220

  • Wang H, Shan H, Banerjee A (2011) Bayesian cluster ensembles. Stat Anal Data Min 4(1): 54–70

    Article  MathSciNet  Google Scholar 

  • Wang P, Domeniconi C, Laskey KB (2010) Nonparametric Bayesian clustering ensembles. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), pp 435–450

  • Wang P, Laskey KB, Domeniconi C, Jordan M (2011) Nonparametric Bayesian co-clustering ensembles. In: Proceedings of the SIAM international conference on data mining (SDM), pp 331–342

  • Woo K-G, Lee J-H, Kim M-H, Lee Y-J (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Inf Softw Technol 46(4): 255–271

    Article  Google Scholar 

  • Yang Y, Kamel MS (2006) An aggregated clustering approach using multi-ant colonies algorithms. Pattern Recog 39(7): 1278–1289

    Article  MATH  Google Scholar 

  • Yip KY, Cheung DW, Ng MK (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng (TKDE) 16(11): 1387–1397

    Article  Google Scholar 

  • Yip KY, Cheung DW, Ng MK (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the IEEE international conference on data engineering (ICDE), pp 329–340

  • Yiu ML, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Trans Knowl Data Eng (TKDE) 17(2): 176–189

    Article  Google Scholar 

  • Zeng Y, Tang J, Garcia-Frias J, Gao GR (2002) An adaptive meta-clustering approach: combining the information from different clustering results. In: Proceedings of the IEEE computer society bioinformatics conference (CSB), pp 330–332

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Gullo.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gullo, F., Domeniconi, C. & Tagarelli, A. Projective clustering ensembles. Data Min Knowl Disc 26, 452–511 (2013). https://doi.org/10.1007/s10618-012-0266-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0266-x

Keywords

Navigation