Skip to main content

Probability Density Function for Clustering Validation

  • Conference paper
  • First Online:
Hybrid Artificial Intelligent Systems (HAIS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14001))

Included in the following conference series:

Abstract

The choice of the number of clusters is a leading problem in Machine Learning. Validation methods provide solutions, with the drawback that inference is not possible. In this manuscript, we derive a distribution for the number of clusters for clustering validation. The starting point of our approach is the data transformation to the probabilistic space. Then, the dependence of the non-negative factorization to the dimensionality of the space span provides a sequence of the traces when the dimensionality varies. Its limit is a gamma. This result allows a non-excluding discussion when interpreting probabilities as credibility levels, and we open the door to inference for clustering validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C.: Clustering: Algorithms and Applications. CRC Press Taylor and Francis Group, Boca Raton (2014)

    Google Scholar 

  2. Amari, S.I.: Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8(9), 1379–1408 (1995)

    Article  Google Scholar 

  3. Balakrishnan, N., Nevzorov, V.B.: A Primer on Statistical Distributions. Wiley, Hoboken (2004)

    Google Scholar 

  4. Chen, J.C.: The nonnegative rank factorizations of nonnegative matrices. Linear Algebra Appl. 62, 207–217 (1984)

    Article  MathSciNet  Google Scholar 

  5. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)

    Article  MathSciNet  Google Scholar 

  6. Deng, H., Han, J.: Probabilistic models for clustering. In: Data Clustering, pp. 61–86. Chapman and Hall/CRC (2018)

    Google Scholar 

  7. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)

    Article  MathSciNet  Google Scholar 

  8. Dougherty, E.R., Brun, M.: A probabilistic theory of clustering. Pattern Recogn. 37(5), 917–925 (2004)

    Article  Google Scholar 

  9. Figuera, P., García Bringas, P.: On the probabilistic latent semantic analysis generalization as the singular value decomposition probabilistic image. J. Stat. Theory Appl. 19, 286–296 (2020). https://doi.org/10.2991/jsta.d.200605.001

    Article  Google Scholar 

  10. Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)

    Article  Google Scholar 

  11. Fred, A.L., Jain, A.K.: Cluster validation using a probabilistic attributed graph. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4. IEEE (2008)

    Google Scholar 

  12. Har-Even, M., Brailovsky, V.L.: Probabilistic validation approach for clustering. Pattern Recogn. Lett. 16(11), 1189–1196 (1995)

    Article  Google Scholar 

  13. Hyslop, J.M.: Infinite Series. Dover Publications, New York (2006)

    Google Scholar 

  14. Jain Anil, K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8, SI), 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011. 19th International Conference on Pattern Recognition (ICPR 2008), Tampa, FL, DEC 08-11, 2008

  15. Kassambara, A., Mundt, F.: factoextra: Extract and visualize the results of multivariate data analyses (2019). https://CRAN.R-project.org/package=factoextra. r package version 1.0.6

  16. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  17. Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 8547–8555 (2021)

    Google Scholar 

  18. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  19. Olivares, J., et al.: Kalkayotl: a cluster distance inference code. Astron. Astrophys. 644, A7 (2020)

    Article  Google Scholar 

  20. Pallis, G., Angelis, L., Vakali, A., Pokorny, J.: A probabilistic validation algorithm for web users’ clusters. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), vol. 5, pp. 4129–4134. IEEE (2004)

    Google Scholar 

  21. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  22. Sinaga, K.P., Yang, M.S.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)

    Article  Google Scholar 

  23. Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)

    Article  Google Scholar 

  24. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)

    Article  MathSciNet  Google Scholar 

  25. Ullmann, T., Hennig, C., Boulesteix, A.L.: Validation of cluster analysis results on validation data: a systematic framework. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 12, e1444 (2022)

    Google Scholar 

  26. Usefi, H.: Clustering, multicollinearity, and singular vectors. Comput. Stat. Data Anal. 173, 107523 (2022)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pau Figuera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Figuera, P., Cuzzocrea, A., García Bringas, P. (2023). Probability Density Function for Clustering Validation. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40725-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40724-6

  • Online ISBN: 978-3-031-40725-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics