Advertisement

Subspace and projected clustering: experimental evaluation and analysis

  • Gabriela Moise
  • Arthur Zimek
  • Peer Kröger
  • Hans-Peter Kriegel
  • Jörg Sander
Regular Paper

Abstract

Subspace and projected clustering have emerged as a possible solution to the challenges associated with clustering in high-dimensional data. Numerous subspace and projected clustering techniques have been proposed in the literature. A comprehensive evaluation of their advantages and disadvantages is urgently needed. In this paper, we evaluate systematically state-of-the-art subspace and projected clustering techniques under a wide range of experimental settings. We discuss the observed performance of the compared techniques, and we make recommendations regarding what type of techniques are suitable for what kind of problems.

Keywords

Subspace clustering Projected clustering 

References

  1. 1.
    Achtert E, Böhm C, Kriegel H-P, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Ramamohanarao K, Krishna P, Mohania M, Nantajeewarawat E (eds) Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), Bangkok, Thailand, 2007, pp 152–163Google Scholar
  2. 2.
    Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, ChinaGoogle Scholar
  3. 3.
    Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, Vianu V (eds) Proceedings of the eighth international conference on database theory (ICDT), London, UK, 2001, pp 420–434Google Scholar
  4. 4.
    Aggarwal C, Procopiuc C, Wolf J, Yu P, Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh, S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, 1999, pp 61–72Google Scholar
  5. 5.
    Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, TX, USA, 2000, pp 70–81Google Scholar
  6. 6.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high-dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, WA, USA, 1998, pp 94–105Google Scholar
  7. 7.
    Agrawal R, Srikan R (1994) Fast algorithms for mining association rules in large databases. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499Google Scholar
  8. 8.
    Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, USA, 1999, pp 49–60Google Scholar
  9. 9.
    Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the seventh international conference on data mining (ICDM), Omaha, NE, USA, 2007, pp 409–414Google Scholar
  10. 10.
    Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of the seventh international conference on database theory (ICDT), Jerusalem, Israel, 1999, pp 217–235Google Scholar
  11. 11.
    Böhm C, Kailing K, Kriegel H-P, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 27–34Google Scholar
  12. 12.
    Cheng C, Fu A, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, USA, 1999, pp 84–93Google Scholar
  13. 13.
    Dempster A, Laird N, Rubin D (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38zbMATHMathSciNetGoogle Scholar
  14. 14.
    Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining (SDM), San Francisco, CA, USA, 2003Google Scholar
  15. 15.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second ACM international conference on knowledge discovery and data mining (KDD), Portland, OR, USA, 1996, pp 226–231Google Scholar
  16. 16.
    Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–116CrossRefGoogle Scholar
  17. 17.
    Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Academic Press, LondonGoogle Scholar
  18. 18.
    Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces? In: Abbadi A, Brodie M, Chakravarthy, Dayal U, Kamel N, Schlageter G, Whang K (eds) Proceedings of the 26th international conference on very large data bases (VLDB), Cairo, Egypt, 2000, pp 506–515Google Scholar
  19. 19.
    Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415CrossRefGoogle Scholar
  20. 20.
    Kailing K, Kriegel H-P, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the fourth SIAM international conference on data mining (SDM), Orlando, FL, USA, 2004, pp 1–11Google Scholar
  21. 21.
    Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analyis. Wiley, New YorkGoogle Scholar
  22. 22.
    Kriegel H-P, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the fifth international conference on data mining, Houston, TX, USA, 2005, pp 250–257Google Scholar
  23. 23.
    Kriegel H-P, Kröger P, Zimek A (2009) Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1)Google Scholar
  24. 24.
    Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–133CrossRefGoogle Scholar
  25. 25.
    Liu G, Li J, Sim K, Wong L (2007) Distance based subspace clustering with flexible dimension partitioning. In: Proceedings of the 23rd international conference on data engineering (ICDE), Istanbul, Turkey, 2007, pp 1250–1254Google Scholar
  26. 26.
    McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematics, statistics, and probabilistics, vol 1, 1967, pp 281–297Google Scholar
  27. 27.
    Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the sixth international conference on data mining (ICDM), Hong Kong, China, 2006, pp 414–425Google Scholar
  28. 28.
    Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3): 273–298zbMATHCrossRefGoogle Scholar
  29. 29.
    Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas, NV, USA, 2008, pp 533–541Google Scholar
  30. 30.
    Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the first SIAM international conference on data mining (SDM), Chicago, IL, USA, 2001, pp 1–17Google Scholar
  31. 31.
    Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases 1998Google Scholar
  32. 32.
    Ng K, Fu A, Wong C (2005) Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383CrossRefGoogle Scholar
  33. 33.
    Ng R, Han J(1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499Google Scholar
  34. 34.
    Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105CrossRefGoogle Scholar
  35. 35.
    Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7): 902–916CrossRefGoogle Scholar
  36. 36.
    Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki, A (eds) Proceedings of the ACM international conference on management of data (SIGMOD, Madison, WI, USA, 2002, pp 418–427Google Scholar
  37. 37.
  38. 38.
    UCI Machine Learning Repository http://archive.ics.uci.edu/ml/
  39. 39.
    Sequeira K, Zaki M (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 186–193Google Scholar
  40. 40.
    Tang J, Chen J, Fu A, Cheung W (2007) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84CrossRefGoogle Scholar
  41. 41.
    Yip K, Cheung D, Ng M (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397CrossRefGoogle Scholar
  42. 42.
    Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, 2005, pp 329–340Google Scholar
  43. 43.
    Yip K, Qi P, Schultz M, Cheung D, Cheung K (2006) SemBiosphere: a semantic web approach to recommending microarray clustering services. In: Proceedings of the 11th pacific symposium on biocomputing (PSB), Maui, HI, USA, 2006Google Scholar
  44. 44.
    Yiu M, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Trans Knowl Data Eng 17(2): 176–189CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Gabriela Moise
    • 1
  • Arthur Zimek
    • 2
  • Peer Kröger
    • 2
  • Hans-Peter Kriegel
    • 2
  • Jörg Sander
    • 1
  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.Ludwig-Maximilians-Universität MünchenMunichGermany

Personalised recommendations