Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

  • Duy-Tai DinhEmail author
  • Tsutomu Fujinami
  • Van-Nam Huynh
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1103)


The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.


Data mining Partitional clustering Categorical data Silhouette value Number of clusters 



This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.


  1. 1.
    Azimi, R., Ghayekhloo, M., Ghofrani, M., Sajedi, H.: A novel clustering algorithm based on data transformation approaches. Expert Syst. Appl. 76, 59–70 (2017)CrossRefGoogle Scholar
  2. 2.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). Scholar
  3. 3.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 243–254. SIAM (2008)Google Scholar
  4. 4.
    Chen, L., Wang, S.: Central clustering of categorical data with automated feature weighting. In: IJCAI, pp. 1260–1266 (2013)Google Scholar
  5. 5.
    Dinh, D.-T., Huynh, V.-N.: k-CCM: a center-based algorithm for clustering categorical data with missing values. In: Torra, V., Narukawa, Y., Aguiló, I., González-Hidalgo, M. (eds.) MDAI 2018. LNCS (LNAI), vol. 11144, pp. 267–279. Springer, Cham (2018). Scholar
  6. 6.
    Dinh, D.T., Huynh, V.N., Sriboonchita, S.: Data for: clustering mixed numeric and categorical data with missing values (2019)Google Scholar
  7. 7.
    Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton (2015)CrossRefGoogle Scholar
  8. 8.
    Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, pp. 21–34. World Scientific, Singapore (1997)Google Scholar
  9. 9.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRefGoogle Scholar
  10. 10.
    Liang, J., Zhao, X., Li, D., Cao, F., Dang, C.: Determining the number of clusters using information entropy for mixed data. Pattern Recogn. 45(6), 2251–2265 (2012)CrossRefGoogle Scholar
  11. 11.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)Google Scholar
  12. 12.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium On Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)Google Scholar
  13. 13.
    Nguyen, T.-P., Dinh, D.-T., Huynh, V.-N.: A new context-based clustering framework for categorical data. In: Geng, X., Kang, B.-H. (eds.) PRICAI 2018. LNCS (LNAI), vol. 11012, pp. 697–709. Springer, Cham (2018). Scholar
  14. 14.
    Nguyen, T.H.T., Dinh, D.T., Sriboonchitta, S., Huynh, V.N.: A method for k-means-like clustering of categorical data. J. Ambient. Intell. Hum. Comput. 1–11 (2019).
  15. 15.
    Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). CrossRefGoogle Scholar
  16. 16.
    Reddy, C.K., Vinzamuri, B.: A survey of partitional and hierarchical clustering algorithms. In: Data Clustering: Algorithms and Applications, pp. 87–110. Chapman and Hall/CRC (2013)Google Scholar
  17. 17.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  18. 18.
    San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14, 241–247 (2004)MathSciNetzbMATHGoogle Scholar
  19. 19.
    dos Santos, T.R., Zárate, L.E.: Categorical data clustering: what similarity measure to recommend? Expert. Syst. Appl. 42(3), 1247–1260 (2015)CrossRefGoogle Scholar
  20. 20.
    Ünlü, R., Xanthopoulos, P.: Estimating the number of clusters in a dataset via consensus clustering. Expert. Syst. Appl. 125, 33–39 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.School of Knowledge ScienceJapan Advanced Institute of Science and TechnologyNomiJapan

Personalised recommendations