Quality Scheme Assessment in the Clustering Process

Halkidi, M.; Vazirgiannis, M.; Batistakis, Y.

doi:10.1007/3-540-45372-5_26

M. Halkidi⁴,
M. Vazirgiannis⁴ &
Y. Batistakis⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1910))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

3639 Accesses
138 Citations

Abstract

Clustering is mostly an unsupervised procedure and most of the clustering algorithms depend on assumptions and initial guesses in order to define the subgroups presented in a data set. As a consequence, in most applications the final clusters require some sort of evaluation. The evaluation procedure has to tackle difficult problems, which can be qualitatively expressed as: i. quality of clusters, ii. the degree with which a clustering scheme fits a specific data set, iii. the optimal number of clusters in a partitioning. In this paper we present a scheme for finding the optimal partitioning of a data set during the clustering process regardless of the clustering algorithm used. More specifically, we present an approach for evaluation of clustering schemes (partitions) so as to find the best number of clusters, which occurs in a specific data set. A clustering algorithm produces different partitions for different values of the input parameters. The proposed approach selects the best clustering scheme (i.e., the scheme with the most compact and well-separated clusters), according to a quality index we define. We verified our approach using two popular clustering algorithms on synthetic and real data sets in order to evaluate its reliability. Moreover, we study the influence of different clustering parameters to the proposed quality index.

Download to read the full chapter text

Chapter PDF

Does Number of Clusters Effect the Purity and Entropy of Clustering?

Silhouette Index as Clustering Evaluation Tool

Clustering Evaluation in High-Dimensional Data

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Michael J. A. Berry, Gordon Linoff. Data Mining Techniques For marketing, Sales and Customer Support. John Willey & Sons, Inc, 1996.
Google Scholar
Rajesh N. Dave. “Validating fuzzy partitions obtained through c-shells clustering”, Pattern Recognition Letters, Vol.17, pp613–623, 1996
Article Google Scholar
J. C. Dunn. “Well separated clusters and optimal fuzzy partitions”, J. Cybern. Vol.4, pp. 95–104, 1974.
Article MathSciNet Google Scholar
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of 2 ^nd Int. Conf. On Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231, 1996.
Google Scholar
Usama Fayyad, Ramasamy Uthurusamy. “Data Mining and Knowledge Discovery in Databases”, Communications of the ACM. Vol.39, No11, November 1996.
Google Scholar
Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press 1996
Google Scholar
Gath, B. Geva. “Unsupervised Optimal Fuzzy Clustering”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 11, No7, July 1989.
Google Scholar
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “CURE: An Efficient Clustering Algorithm for Large Databases”, Published in the Proceedings of the ACM SIGMOD Conference, 1998.
Google Scholar
Alexander Hinneburg, Daniel Keim. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”. Proceeding of KDD ’98, 1998.
Google Scholar
Zhexue Huang. “A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining”, DMKD, 1997
Google Scholar
Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. “A new cluster validity index for the fuzzy c-mean”, Pattern Recognition Letters, 19, pp237–246, 1998.
Article MATH Google Scholar
Padhraic Smyth. “Clustering using Monte Carlo Cross-Validation”. KDD 1996, 126–133.
Google Scholar
C. Sheikholeslami, S. Chatterjee, A. Zhang. “WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database”. Proceedings of 24 ^th VLDB Conference, New York, USA, 1998.
Google Scholar
S. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999
Google Scholar
M. Vazirgiannis, “A classification and relationship extraction scheme for relational databases based on fuzzy logic”, in the proceedings of the Pacific-Asian Knowledge Discovery & Data Mining ’98 Conference, Melbourne, Australia.
Google Scholar
M. Vazirgiannis, M. Halkidi. “Uncertainty handling in the datamining process with fuzzy logic”, in the proceedings of the IEEE-FUZZ conference, San Antonio, Texas, May, 2000.
Google Scholar
Xunali Lisa Xie, Genardo Beni. “A Validity measure for Fuzzy Clustering”, IEEE Transactions on Pattern Analysis and machine Intelligence, Vol13, No4, August 1991.
Google Scholar
A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No3, September 1999.
Google Scholar
Fisher, R.A. Machine readable.names file for MLC++ library. July, 1988
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Informatics, Athens University of Economics & Business, Patision 76, 10434, Athens, Greece (Hellas)
M. Halkidi, M. Vazirgiannis & Y. Batistakis

Authors

M. Halkidi
View author publications
You can also search for this author in PubMed Google Scholar
M. Vazirgiannis
View author publications
You can also search for this author in PubMed Google Scholar
Y. Batistakis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, O.S. Bragstads plass 2E, 7491, Trondheim, Norway
Jan Komorowski
Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA
Jan Żytkow
Laboratoire ERIC, Université Lyon 2, 5 avenue Pierre Mendès-France, 69676, Bron, France
Djamel A. Zighed

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Halkidi, M., Vazirgiannis, M., Batistakis, Y. (2000). Quality Scheme Assessment in the Clustering Process. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_26

Download citation

DOI: https://doi.org/10.1007/3-540-45372-5_26
Published: 18 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Quality Scheme Assessment in the Clustering Process

Abstract

Chapter PDF

Similar content being viewed by others

Does Number of Clusters Effect the Purity and Entropy of Clustering?

Silhouette Index as Clustering Evaluation Tool

Clustering Evaluation in High-Dimensional Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Quality Scheme Assessment in the Clustering Process

Abstract

Chapter PDF

Similar content being viewed by others

Does Number of Clusters Effect the Purity and Entropy of Clustering?

Silhouette Index as Clustering Evaluation Tool

Clustering Evaluation in High-Dimensional Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation