Advertisement

Data Mining and Knowledge Discovery

, Volume 14, Issue 1, pp 63–97 | Cite as

Locally adaptive metrics for clustering high dimensional data

  • Carlotta DomeniconiEmail author
  • Dimitrios Gunopulos
  • Sheng Ma
  • Bojun Yan
  • Muna Al-Razgan
  • Dimitris Papadopoulos
Article

Abstract

Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high-dimensional data such as text data.

Keywords

Subspace clustering Dimensionality reduction Local feature relevance Clustering ensembles Gene expression data Text data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal C, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 61–72Google Scholar
  2. Aggarwal C, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 70–81Google Scholar
  3. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 94–105Google Scholar
  4. Alizadeh A. et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511CrossRefGoogle Scholar
  5. Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Proceedings of the SIAM international conference on data mining, pp 258–269Google Scholar
  6. Arabie P, Hubert LJ (1996) An overview of combinatorial data analysis. clustering and classification. World Scientific, Singapore pp 5–63Google Scholar
  7. Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900Google Scholar
  8. Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of VLDB, pp 89–100Google Scholar
  9. Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103Google Scholar
  10. Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining, Chap. 6. AAAI/MIT Press, pp 153–180Google Scholar
  11. Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38MathSciNetGoogle Scholar
  12. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 269–274Google Scholar
  13. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 89–98Google Scholar
  14. Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the SIAM international conference on data mining, pp 517–520Google Scholar
  15. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New YorkzbMATHGoogle Scholar
  16. Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: Proceedings of the international conference on machine learning, pp 247–254Google Scholar
  17. Ester M, Kriegel HP, Xu X (1995) A database interface for clustering in large spatial databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 94–99Google Scholar
  18. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the international conference on machine learning, pp 281–288Google Scholar
  19. Friedman J, Meulman J (2002) Clustering objects on subsets of attributes. Technical report, Stanford UniversityGoogle Scholar
  20. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic, New YorkzbMATHGoogle Scholar
  21. Ghahramani Z, Hinton GE (1996) The EM algorithm for mixtures of factor analyzers. Technical report CRG-TR-96-1, Department of Computer Science, University of TorontoGoogle Scholar
  22. Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129CrossRefGoogle Scholar
  23. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539–548CrossRefGoogle Scholar
  24. Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 151–162Google Scholar
  25. Kharypis G, Kumar V (1995) Multilevel k-way partitioning scheme for irregular graphs. Technical report, Department of Computer Science, University of Minnesota and Army HPC Research CenterGoogle Scholar
  26. Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach, vol 2. Palo Alto TIOGA Publishing Co., pp 331–363Google Scholar
  27. Mladenović N, Brimberg J (1996) A degeneracy property in continuous location-allocation problems. In: Les Cahiers du GERAD, G-96-37, Montreal, CanadaGoogle Scholar
  28. Modha D, Spangler S (2003) Feature weighting in K-means clustering. Mach Learn 52(3):217–237zbMATHCrossRefGoogle Scholar
  29. Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the VLDB conference, pp 144–155Google Scholar
  30. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105CrossRefGoogle Scholar
  31. Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM SIGMOD conference on management of data, pp 418–427Google Scholar
  32. Strehl A, Ghosh J (2003) Cluster ensemble—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617zbMATHCrossRefMathSciNetGoogle Scholar
  33. Tipping ME, Bishop CM (1999) Mixtures of principal component analyzers. Neural Comput 1(2):443–482CrossRefGoogle Scholar
  34. Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high dimensional spaces. In: Proceedings of CIKM, pp 201–207Google Scholar
  35. Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the ACM SIGMOD conference on management of data, pp 394–405Google Scholar
  36. Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103zbMATHGoogle Scholar
  37. Yang J, Wang W, Wang H, Yu P (2002) δ-Clusters: capturing subspace correlation in a large data set. In: Proceedings of the international conference on data engineering, pp 517–528Google Scholar
  38. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 103–114Google Scholar

Copyright information

© Springer Science+Business Media, LLC (omit copyright symbol) 2007

Authors and Affiliations

  • Carlotta Domeniconi
    • 1
    Email author
  • Dimitrios Gunopulos
    • 2
  • Sheng Ma
    • 3
  • Bojun Yan
    • 1
  • Muna Al-Razgan
    • 1
  • Dimitris Papadopoulos
    • 2
  1. 1.George Mason UniversityFairfaxUSA
  2. 2.UC RiversideRiversideUSA
  3. 3.Vivido Media Inc.BeijingChina

Personalised recommendations