Journal of Intelligent Information Systems

, Volume 17, Issue 2–3, pp 107–145 | Cite as

On Clustering Validation Techniques

  • Maria Halkidi
  • Yannis Batistakis
  • Michalis Vazirgiannis
Article

Abstract

Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.

This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

clustering algorithms unsupervised learning cluster validity validity indices 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berry, M.J.A. and Linoff, G. (1996). Data Mining Techniques For Marketing, Sales and Customer Support. John Wiley & Sons, Inc., USA.Google Scholar
  2. Bezdeck, J.C., Ehrlich, R., and Full, W. (1984). FCM: Fuzzy C-Means Algorithm. Computers and Geoscience, 10(2-3), 191-203.Google Scholar
  3. Dave, R.N. (1996). Validating Fuzzy Partitions Obtained Through c-Shells Clustering. Pattern Recognition Letters, 17, 613-623.Google Scholar
  4. Davies, D.L. and Bouldin, D.W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227.Google Scholar
  5. Dunn, J.C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions. J. Cybern., 4, 95-104.Google Scholar
  6. Ester, M., Kriegel, H-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceeding of 2nd Int. Conf.On Knowledge Discovery and Data Mining, Portland (pp. 226-23).Google Scholar
  7. Ester, M., Kriegel, H.-P., Sander, J.,Wimmer, M., and Xu, X. (1998). Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of 24th VLDB Conference, New York, USA.Google Scholar
  8. Fayyad, M.U., Piatesky-Shapiro, G., Smuth P., Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press.Google Scholar
  9. Gath I. and Geva A.B. (1989). Unsupervised Optimal Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 773-781.Google Scholar
  10. Guha, S., Rastogi, R., and Shim K. (1998). CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD Conference.Google Scholar
  11. Guha, S, Rastogi, R., and Shim K. (1999). ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proceedings of the IEEE Conference on Data Engineering.Google Scholar
  12. Halkidi, M., Vazirgiannis, M., and Batistakis, I. (2000). Quality Scheme Assessment in the Clustering Process. In Proceedings of PKDD, Lyon, France.Google Scholar
  13. Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, USA.Google Scholar
  14. Hinneburg, A. and Keim, D. (1998). An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of KDD Conference.Google Scholar
  15. Huang, Z. (1997). A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining. DMKD. Google Scholar
  16. Jain, A.K., Murty, M.N., and Flyn, P.J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-323.Google Scholar
  17. Krishnapuram, R., Frigui, H., and Nasraoui, O. (1993). Quadratic Shell Clustering Algorithms and the Detection of Second-Degree Curves. Pattern Recognition Letters, 14(7), 545-552.Google Scholar
  18. MacQueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, Volume I: Statistics, pp. 281-297.Google Scholar
  19. Milligan, G.W. and Cooper, M.C. (1985). An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika, 50, 159-179.Google Scholar
  20. Milligan, G.W., Soon, S.C., and Sokol, L.M. (1983). The Effect of Cluster Size, Dimensionality and the Number of Clusters on Recovery of True Cluster Structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40-47.Google Scholar
  21. Mitchell, T. (1997). Machine Learning.McGraw-Hill, USA.Google Scholar
  22. Ng, R. and Han, J.(1994). Effecient and Effictive Clustering Methods for Spatial Data Mining. In Proceeding's of the 20th VLDB Conference, Santiago, Chile.Google Scholar
  23. Pal, N.R. and Biswas, J. ( 1997). Cluster Validation Using Graph Theoretic Concepts. Pattern Recognition, 30(6), 847-857.Google Scholar
  24. Rezaee, R., Lelieveldt, B.P.F., and Reiber, J.H.C. (1998). A New Cluster Validity Index for the Fuzzy c-Mean. Pattern Recognition Letters, 19, 237-246.Google Scholar
  25. Sharma, S.C. (1996). Applied Multivariate Techniques.John Wiley and Sons.Google Scholar
  26. Sheikholeslami, C., Chatterjee, S., and Zhang, A. (1998). WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database. In Proceedings of 24th VLDB Conference, New York, USA.Google Scholar
  27. Smyth, P. (1996). Clustering using Monte Carlo Cross-Validation. In Proceedings of KDD Conference.Google Scholar
  28. Theodoridis, S. and Koutroubas, K. (1999). Pattern Recognition.Academic Press.Google Scholar
  29. Theodoridis, Y. (1999). Spatial Datasets: An “unofficial” collection. http://dias.cti.gr/~ytheod/research/ datasets/spatial.html Wang, W., Yang, J., and Muntz, R. (1997). STING: A Ststistical Information Grid Approach to Spatial Data Mining. In Proceedings of 23rd VLDB Conference.Google Scholar
  30. Xie, X.L. and Beni, G. (1991). A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 841-846.Google Scholar
  31. Zhang, T., Ramakrishnman, R., and Linvy, M. (1996). BIRCH: An Efficient Method for Very Large Databases. ACM SIGMOD, Montreal, Canada.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Maria Halkidi
    • 1
  • Yannis Batistakis
    • 1
  • Michalis Vazirgiannis
    • 1
  1. 1.Department of InformaticsAthens University of Economics & BusinessAthensGreece (Hellas)

Personalised recommendations