Skip to main content
Log in

Evaluating clustering quality using features salience: a promising approach

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper focuses on using feature salience to evaluate the quality of a partition when dealing with hard clustering. It is based on the hypothesis that a good partition is an easy to label partition, i.e. a partition for which each cluster is made of salient features. This approach is mostly compared to usual approaches relying on distances between data, but also to more recent approaches based on entropy or stability. We show that our feature-based approach outperforms the compared indexes for optimal model selection: they are more efficient from low- to high-dimensional range as well as they are more robust to noise. To show the efficiency of our indexes on a real-life application, we consider the task of diachronic analysis on a textual dataset. We demonstrate that our approach allows to get some interesting and relevant results in that context, while other approaches mostly lead to unusable results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.research.att.com/~lewis/reuters21578.html.

  2. As an example, for the R52 dataset, the EC index computation time was 125 s as compared to 43,000 s for the Silhouette index using a standard laptop with 2,2 GHz quadricore processor and 8 GB of memory.

  3. This new version has been recently developed in the context of the ISTEX project. A demonstrator is available online (see https://github.com/nicolasdugue/istex-demonstrateur) and described in Dugué et al. [36].

  4. Diachronic analysis and feature selection code can be found at https://github.com/nicolasdugue/istex.

References

  1. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: International conference on data mining, pp 911–916

  2. Angel Latha Mary S, Sivagami AN, Usha Rani M (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012

    Google Scholar 

  3. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  4. Rendón E, Abundez Itzel, Arizmendi Alejandra, Mquiroz Elvia (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34

    Google Scholar 

  5. Kassab R, Lamirel J-C (2008) Feature-based cluster validation for high-dimensional data. In: International conference on artificial intelligence and applications, pp 232–239

  6. Lamirel J-C, Mall R, Cuxac P, Safi G (2011) Variations to incremental growing neural gas algorithm based on label maximization. In: International joint conference on neural networks, pp 956–965

  7. Guerra L, Robles V, Bielza C, Larranñaga P (2012) A comparison of clustering quality indices using outliers and noise. Intell Data Anal 16(4):703–715

    Article  Google Scholar 

  8. Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288

  9. Bock HH (1996) Probability model and hypothesis testing in partitioning cluster analysis

  10. Gordon AD (1997) External validation in cluster analysis. Bull Int Stat Inst 51(2):353–356

    MATH  Google Scholar 

  11. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145

    Article  Google Scholar 

  12. Bellman RE (1961) Adaptive control processes. Princeton University Press

  13. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105

    Article  Google Scholar 

  14. Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26

    Article  Google Scholar 

  15. Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104

    Article  MathSciNet  Google Scholar 

  16. Davies D, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Article  Google Scholar 

  17. Rousseuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  18. Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27

    Article  MathSciNet  Google Scholar 

  19. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847

    Article  Google Scholar 

  20. Dudek A (2019) Silhouette index as clustering evaluation tool. In: Conference of the section on classification and data analysis of the polish statistical association. Springer, pp 19–33

  21. Ünlü R, Xanthopoulos P (2019) Estimating the number of clusters in a dataset via consensus clustering. Expert Syst Appl 125:33–39

    Article  Google Scholar 

  22. Krasnov F, Sen A (2019) The number of topics optimization: clustering approach. Mach Learn Knowl Extr 1(1):416–426

    Article  Google Scholar 

  23. Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. arXiv preprint arXiv:2002.01822

  24. Kargar M, Isazadeh A, Izadkhah H (2020) New internal metric for software clustering algorithms validity. IET Softw 14:402–410

    Article  Google Scholar 

  25. Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297

  26. Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–159

    Article  MathSciNet  Google Scholar 

  27. Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recognit 30(6):847–857

    Article  Google Scholar 

  28. Lago-Fernández LF, Corbacho F (2009) Using the negentropy increment to determine the number of clusters. In: International work-conference on artificial neural networks, pp 448–455

  29. Lago-Fernández LF, Aragón J, Martínez-Muñoz G, González AM, Sánchez-Montañés M (2014) Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing 123:33–39

    Article  Google Scholar 

  30. Akaike H (1974) A new look at the statistical identification model. IEEE Trans Autom Control 19:716–723

    Article  Google Scholar 

  31. Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  Google Scholar 

  32. Manning C, Raghavan P, SChütze H (2008) An introduction to information retrieval, vol 151, p 177

  33. Ben-Hur A, Elisseef A, Guyon I (2001) A stability based method for discovering structure in clustered data. Pac Symp Biocomput 7:6–17

    Google Scholar 

  34. Lamirel J-C, Cuxac P, Chivukula AS, Hajlaoui K (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):379–396

    Article  Google Scholar 

  35. Falk I, Gardent C, Lamirel J-C (2012) Classifying French verbs using French and English lexical resources. In: Proceedings of the Association for Computational Linguistics, pp 854–863

  36. Dugué N, Lamirel J-C, Cuxac P (2016) Diachronic’explorer: keep track of your clusters. In: Research challenges in information science, pp 1–2

  37. Bache K, Lichman M (2013) Uci machine learning repository

  38. Sun L, Korhonen A, Poibeau T, Messiant C (2010) Investigating the cross-linguistic potential of verbnet: style classification. In: International conference on computational linguistics, pp 1056–1064

  39. Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, pp 625–632

  40. Schubert E, Sander J, Ester M, Peter KH, Xiaowei X (2017) Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst (TODS) 42(3):1–21

    Article  MathSciNet  Google Scholar 

  41. Schubert E, Gertz M (2018) Improving the cluster structure extracted from optics plots. In: LWDA, pp 318–329

  42. van der Merwe DW, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220

  43. Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200

    Article  Google Scholar 

  44. Lamirel J-C (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics 93(1):151–166

    Article  Google Scholar 

  45. Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):191–218

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean-Charles Lamirel.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Clustering stability and subsampling

Appendix: Clustering stability and subsampling

In Ben-Hur et al. [33], the method to find the correct number of clusters is based on sampling and clustering stability. To evaluate the agreement between partitions calculated on distinct samples with the same value of k, the authors propose to calculate the correlation between pairs of points classified in the same cluster in different models. Then, by looking at the correlation distribution for each k, one may be able to find the correct k. Indeed, in the original paper, the correlation distribution is really close to 1 for low k values and one can see a dramatic decrease for a particular value of \(k_{\mathrm{dec}}\). The optimal value chosen is \(k_{\mathrm{dec}} -1\), the last value of k with a distribution close to 1. As one can see, our experiments show that this method does not seem to work on our reference data by the exploitation of a standard clustering algorithm, like K-means. Indeed, Figs. 7 and 8 show that the correlation decreases steadily between each k and that it is therefore impossible to identify a dramatic decrease indicating the optimal k. As stated in the introduction, it may due to the fact that K-means partitions are too suboptimal, which leads to poor correlations. However, the algorithm requires to make a large number of run for each k and it is then impossible to use more complex algorithms on big datasets.

Fig. 7
figure 7

Correlation distribution for k from 2 to 7 on IRIS dataset. Ground truth is equal to 3. On the abscissa, the correlation between pairs from 0 to 1. On the ordinate, number of pairs of clustering

Fig. 8
figure 8

Correlation distribution for k from 4 to 9 on ZOO dataset. Ground truth is equal to 7. On the abscissa, the correlation between pairs from 0 to 1. On the ordinate, number of pairs of clustering

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dugué, N., Lamirel, JC. & Chen, Y. Evaluating clustering quality using features salience: a promising approach. Neural Comput & Applic 33, 12939–12956 (2021). https://doi.org/10.1007/s00521-021-05942-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05942-7

Navigation