An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity

Wang, Fei; Franco-Penya, Hector-Hugo; Kelleher, John D.; Pugh, John; Ross, Robert

doi:10.1007/978-3-319-62416-7_21

Fei Wang^14,16,
Hector-Hugo Franco-Penya¹⁴,
John D. Kelleher^14,16,
John Pugh¹⁵ &
…
Robert Ross^14,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10358))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

4420 Accesses
52 Citations
3 Altmetric

Abstract

This paper analyses the application of Simplified Silhouette to the evaluation of k-means clustering validity and compares it with the k-means Cost Function and the original Silhouette. We conclude that for a given dataset the k-means Cost Function is the most valid and efficient measure in the evaluation of the validity of k-means clustering with the same k value, but that Simplified Silhouette is more suitable than the original Silhouette in the selection of the best result from k-means clustering with different k values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here we assume that there are at least two different data points in the cluster. Otherwise, the a(i) is set to be 0, and the sil(i) will be 1.
2.
In preparing our experiments we tested two different initialisation methods for k-means, a random initialisation and a well-known algorithm k-means++. However, we found that the initialisation method made no difference in our results so in this paper we just report the results using the random initialisation.
3.
If other methods like k-means++ are used for selecting the initial centroids, it is very likely to get all the desired k values for all the synthetic datasets.
4.
Due to time and resource limitations Simplified Silhouette has not been fully explored in this paper, e.g. the actual industrial datasets are not available. However, this is an attempt to evaluate the internal measures for a specific clustering algorithm. Specific methods should be evaluated, selected and even designed for specific algorithms or conditions, rather than always a same set of general methods for all the situations.

References

Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Article Google Scholar
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Article MathSciNet MATH Google Scholar
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–775 (2006)
Article MATH Google Scholar
Franti, P., Virmajoki, O., Hautamaki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)
Article Google Scholar
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 265–276. Springer, Heidelberg (2000). doi:10.1007/3-540-45372-5_26
Chapter Google Scholar
Hruschka, E.R., de Castro, L.N., Campello, R.J.: Evolutionary algorithms for clustering gene-expression data. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 403–406. IEEE (2004)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press, Cambridge (2015).
Google Scholar
Vendramin, L., Campello, R.J., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)
MathSciNet Google Scholar
Wang, F., Franco, H., Pugh, J., Ross, R.: Empirical comparative analysis of 1-of-k coding and k-prototypes in categorical clustering (2016)
Google Scholar
Xiong, H., Li, Z.: Clustering validation measures (2013)
Google Scholar
Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)
Google Scholar

Download references

Acknowledgement

The authors wish to acknowledge the support of Enterprise Ireland through the Innovation Partnership Programme SmartSeg 2 and the ADAPT Research Centre. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Funds.

Author information

Authors and Affiliations

School of Computing, Dublin Institute of Technology, Dublin, Ireland
Fei Wang, Hector-Hugo Franco-Penya, John D. Kelleher & Robert Ross
Nathean Technologies Ltd., Dublin, Ireland
John Pugh
ADAPT Research Centre, Dublin, Ireland
Fei Wang, John D. Kelleher & Robert Ross

Authors

Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hector-Hugo Franco-Penya
View author publications
You can also search for this author in PubMed Google Scholar
John D. Kelleher
View author publications
You can also search for this author in PubMed Google Scholar
John Pugh
View author publications
You can also search for this author in PubMed Google Scholar
Robert Ross
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Wang .

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Leipzig, Sachsen, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, F., Franco-Penya, HH., Kelleher, J.D., Pugh, J., Ross, R. (2017). An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science(), vol 10358. Springer, Cham. https://doi.org/10.1007/978-3-319-62416-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-62416-7_21
Published: 02 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62415-0
Online ISBN: 978-3-319-62416-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics