Abstract
Clustering is mostly an unsupervised procedure and most of the clustering algorithms depend on assumptions and initial guesses in order to define the subgroups presented in a data set. As a consequence, in most applications the final clusters require some sort of evaluation. The evaluation procedure has to tackle difficult problems, which can be qualitatively expressed as: i. quality of clusters, ii. the degree with which a clustering scheme fits a specific data set, iii. the optimal number of clusters in a partitioning. In this paper we present a scheme for finding the optimal partitioning of a data set during the clustering process regardless of the clustering algorithm used. More specifically, we present an approach for evaluation of clustering schemes (partitions) so as to find the best number of clusters, which occurs in a specific data set. A clustering algorithm produces different partitions for different values of the input parameters. The proposed approach selects the best clustering scheme (i.e., the scheme with the most compact and well-separated clusters), according to a quality index we define. We verified our approach using two popular clustering algorithms on synthetic and real data sets in order to evaluate its reliability. Moreover, we study the influence of different clustering parameters to the proposed quality index.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Michael J. A. Berry, Gordon Linoff. Data Mining Techniques For marketing, Sales and Customer Support. John Willey & Sons, Inc, 1996.
Rajesh N. Dave. “Validating fuzzy partitions obtained through c-shells clustering”, Pattern Recognition Letters, Vol.17, pp613–623, 1996
J. C. Dunn. “Well separated clusters and optimal fuzzy partitions”, J. Cybern. Vol.4, pp. 95–104, 1974.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of 2 nd Int. Conf. On Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231, 1996.
Usama Fayyad, Ramasamy Uthurusamy. “Data Mining and Knowledge Discovery in Databases”, Communications of the ACM. Vol.39, No11, November 1996.
Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press 1996
Gath, B. Geva. “Unsupervised Optimal Fuzzy Clustering”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 11, No7, July 1989.
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “CURE: An Efficient Clustering Algorithm for Large Databases”, Published in the Proceedings of the ACM SIGMOD Conference, 1998.
Alexander Hinneburg, Daniel Keim. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”. Proceeding of KDD ’98, 1998.
Zhexue Huang. “A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining”, DMKD, 1997
Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. “A new cluster validity index for the fuzzy c-mean”, Pattern Recognition Letters, 19, pp237–246, 1998.
Padhraic Smyth. “Clustering using Monte Carlo Cross-Validation”. KDD 1996, 126–133.
C. Sheikholeslami, S. Chatterjee, A. Zhang. “WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database”. Proceedings of 24 th VLDB Conference, New York, USA, 1998.
S. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999
M. Vazirgiannis, “A classification and relationship extraction scheme for relational databases based on fuzzy logic”, in the proceedings of the Pacific-Asian Knowledge Discovery & Data Mining ’98 Conference, Melbourne, Australia.
M. Vazirgiannis, M. Halkidi. “Uncertainty handling in the datamining process with fuzzy logic”, in the proceedings of the IEEE-FUZZ conference, San Antonio, Texas, May, 2000.
Xunali Lisa Xie, Genardo Beni. “A Validity measure for Fuzzy Clustering”, IEEE Transactions on Pattern Analysis and machine Intelligence, Vol13, No4, August 1991.
A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No3, September 1999.
Fisher, R.A. Machine readable.names file for MLC++ library. July, 1988
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Halkidi, M., Vazirgiannis, M., Batistakis, Y. (2000). Quality Scheme Assessment in the Clustering Process. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_26
Download citation
DOI: https://doi.org/10.1007/3-540-45372-5_26
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive