Subspace Metric Ensembles for Semi-supervised Clustering of High Dimensional Data

  • Bojun Yan
  • Carlotta Domeniconi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)


A critical problem in clustering research is the definition of a proper metric to measure distances between points. Semi-supervised clustering uses the information provided by the user, usually defined in terms of constraints, to guide the search of clusters. Learning effective metrics using constraints in high dimensional spaces remains an open challenge. This is because the number of parameters to be estimated is quadratic in the number of dimensions, and we seldom have enough side-information to achieve accurate estimates. In this paper, we address the high dimensionality problem by learning an ensemble of subspace metrics. This is achieved by projecting the data and the constraints in multiple subspaces, and by learning positive semi-definite similarity matrices therein. This methodology allows leveraging the given side-information while solving lower dimensional problems. We demonstrate experimentally using high dimensional data (e.g., microarray data) the superior accuracy achieved by our method with respect to competitive approaches.


High Dimensional Data Cluster Solution Normalize Mutual Information Ensemble Size Cluster Ensemble 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)CrossRefGoogle Scholar
  2. 2.
    Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions using equivalence relations. In: International Conference on Machine Learning (2003)Google Scholar
  3. 3.
    Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: International Conference on Machine Learning (2002)Google Scholar
  4. 4.
    Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrainted clustering. In: SIAM International conference on Data Mining (2004)Google Scholar
  5. 5.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: International Conference on Knowledge Discovery and Data Mining (2004)Google Scholar
  6. 6.
    Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and Metric Learning in semi-supervised clustering. In: International Conference on Machine Learning (2004)Google Scholar
  7. 7.
    Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998),
  8. 8.
    Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback. TR2003-1892, Cornell University (2003)Google Scholar
  9. 9.
    Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: International Conference on Machine Learning (2004)Google Scholar
  10. 10.
    Fred, A.L.N., Jain, A.K.: Data clustering using evidence accumulation. In: International Conference on Pattern Recognition (2002)Google Scholar
  11. 11.
    Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering: a kernel approach. In: International Conference on Machine Learning (2005)Google Scholar
  12. 12.
    Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: International Conference on Machine Learning (1997)Google Scholar
  13. 13.
    McQueen, J.: Some Methods for Classification and Analysis of Multivariate Observation. In: Le Cam, L., Neyman, J. (eds.) Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  14. 14.
    Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)MATHCrossRefGoogle Scholar
  15. 15.
    Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14 (2002)Google Scholar
  16. 16.
    Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., et al.: Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 24(3), 227–235 (2000)CrossRefGoogle Scholar
  17. 17.
    Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Machine Learning Research 3, 417–583 (2002)MathSciNetGoogle Scholar
  18. 18.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI Workshop on Artificial Intelligence for Web Search (2000)Google Scholar
  19. 19.
    Theodoridis, S., Koutroubas, K.: Pattern Recognition. Academic Press, London (1999)Google Scholar
  20. 20.
    Topchy, A., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: IEEE International Conference of Data Mining (2003)Google Scholar
  21. 21.
    Topchy, A., Jain, A.K., Punch, W.: A mixture model for clustering ensembles. In: SIAM International Conference on Data Mining (2004)Google Scholar
  22. 22.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering with background knowledge. In: International Conference on Machine Learning (2001)Google Scholar
  23. 23.
    Wagstaff, K.: Intelligent Clustering with Instance-Level Constraints. PhD thesis, Cornell University (2002)Google Scholar
  24. 24.
    Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems, vol. 15 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Bojun Yan
    • 1
  • Carlotta Domeniconi
    • 1
  1. 1.Department of Information and Software EngineeringGeorge Mason UniversityFairfaxUSA

Personalised recommendations