Skip to main content

An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity

  • Conference paper
  • First Online:
Book cover Machine Learning and Data Mining in Pattern Recognition (MLDM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10358))

Abstract

This paper analyses the application of Simplified Silhouette to the evaluation of k-means clustering validity and compares it with the k-means Cost Function and the original Silhouette. We conclude that for a given dataset the k-means Cost Function is the most valid and efficient measure in the evaluation of the validity of k-means clustering with the same k value, but that Simplified Silhouette is more suitable than the original Silhouette in the selection of the best result from k-means clustering with different k values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here we assume that there are at least two different data points in the cluster. Otherwise, the a(i) is set to be 0, and the sil(i) will be 1.

  2. 2.

    In preparing our experiments we tested two different initialisation methods for k-means, a random initialisation and a well-known algorithm k-means++. However, we found that the initialisation method made no difference in our results so in this paper we just report the results using the random initialisation.

  3. 3.

    If other methods like k-means++ are used for selecting the initial centroids, it is very likely to get all the desired k values for all the synthetic datasets.

  4. 4.

    Due to time and resource limitations Simplified Silhouette has not been fully explored in this paper, e.g. the actual industrial datasets are not available. However, this is an attempt to evaluate the internal measures for a specific clustering algorithm. Specific methods should be evaluated, selected and even designed for specific algorithms or conditions, rather than always a same set of general methods for all the situations.

References

  1. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

  2. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  3. Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–775 (2006)

    Article  MATH  Google Scholar 

  4. Franti, P., Virmajoki, O., Hautamaki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)

    Article  Google Scholar 

  5. Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 265–276. Springer, Heidelberg (2000). doi:10.1007/3-540-45372-5_26

    Chapter  Google Scholar 

  6. Hruschka, E.R., de Castro, L.N., Campello, R.J.: Evolutionary algorithms for clustering gene-expression data. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 403–406. IEEE (2004)

    Google Scholar 

  7. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  8. Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press, Cambridge (2015).

    Google Scholar 

  9. Vendramin, L., Campello, R.J., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)

    MathSciNet  Google Scholar 

  10. Wang, F., Franco, H., Pugh, J., Ross, R.: Empirical comparative analysis of 1-of-k coding and k-prototypes in categorical clustering (2016)

    Google Scholar 

  11. Xiong, H., Li, Z.: Clustering validation measures (2013)

    Google Scholar 

  12. Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)

    Google Scholar 

Download references

Acknowledgement

The authors wish to acknowledge the support of Enterprise Ireland through the Innovation Partnership Programme SmartSeg 2 and the ADAPT Research Centre. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wang, F., Franco-Penya, HH., Kelleher, J.D., Pugh, J., Ross, R. (2017). An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science(), vol 10358. Springer, Cham. https://doi.org/10.1007/978-3-319-62416-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62416-7_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62415-0

  • Online ISBN: 978-3-319-62416-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics