Abstract
Modal clustering has a clear population goal, where density estimation plays a critical role. In this paper, we study how to provide better density estimation so as to serve the objective of modal clustering. In particular, we use semiparametric mixtures for density estimation, aided with a novel mode-flattening technique. The use of semiparametric mixtures helps to produce better density estimates, especially in the multivariate situation, and the mode-flattening technique is intended to identify and smooth out spurious and minor modes. With mode flattening, the number of clusters can be sequentially reduced until there is only one mode left. In addition, we adopt the likelihood function in a coherent manner to measure the relative importance of a mode and let the current least important mode disappear in each step. For both simulated and real-world data sets, the proposed method performs very well, as compared with some well-known clustering methods in the literature, and can successfully solve some fairly difficult clustering problems.
Similar content being viewed by others
References
Anderson, E.: The irises of the Gaspe peninsula. Bull. Am. Iris Soc. 59, 2–5 (1935)
Arias-Castro, E., Mason, D., Pelletier, B.: On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res. 17(1), 1–28 (2016)
Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007)
Azzalini, A., Menardi, G.: Clustering via nonparametric density estimation: the R package pdfCluster. J. Stat. Softw. 57(11), 1–26 (2014)
Cadre, B., Pelletier, B., Pudlo, P.: Estimation of density level sets with a given probability content. J. Nonparametr. Stat. 25(1), 261–272 (2013)
Carmichael, J.W., George, J.A., Julius, R.S.: Finding natural clusters. Syst. Zool. 17(2), 144–150 (1968)
Chacón, J.E.: A population background for nonparametric density-based clustering. Stat. Sci. 30(4), 518–532 (2015)
Chen, Y., Genovese, C.R., Wasserman, L.: A comprehensive approach to mode clustering. Electron. J. Stat. 10(1), 210–241 (2016)
Chen, Y., Genovese, C.R., Wasserman, L.: Statistical inference using the Morse-Smale complex. Electron. J. Stat. 11(1), 1390–1433 (2017)
Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36(4), 441–459 (2001)
Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
Dua, D., Graff, C.: UCI machine learning repository (2017)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188 (1936)
Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classification of olive oils from their fatty acid composition. In: Food Research and Data Analysis, pp. 189–214. Applied Science Publishers, London (1983)
Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Fukunaga, K., Hostetler, L.D.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1), 32–40 (1975)
Geman, S., Hwang, C.: Nonparametric maximum-likelihood estimation by the method of sieves. Ann. Stat. 10(2), 401–414 (1982)
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. R. Stat. Soc. Ser. C - Appl. Stat. 18(1), 54–64 (1969)
Grenander, U.: Abstract Inference. Wiley, New York, NY (1981)
Hartigan, J.A.: Clustering Algorithms. Wiley, New York, NY (1975)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Laird, N.: Nonparametric maximum likelihood estimation of a mixing distribution. J. Am. Stat. Assoc. 73(364), 805–811 (1978)
Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. Prentice-Hall Inc, London (1974)
Li, J., Gray, R.M.: Image Segmentation and Compression Using Hidden Markov Models. Springer, Berlin (2000)
Li, J., Ray, S., Lindsay, B.G.: A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res. 8, 1687–1723 (2007)
Lindsay, B.G.: The geometry of mixture likelihoods: a general theory. Ann. Stat. 11(1), 86–94 (1983)
Lindsay, B.G.: Mixture models: theory, geometry and applications. NSF-CBMS Regional Conference Series in Probability and Statistics 5, i–163 (1995)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)
Melnykov, V.: On the distribution of posterior probabilities in finite mixture models with application in clustering. J. Multivariate Anal. 122, 175–189 (2013)
Menardi, G.: A review on modal clustering. Int. Stat. Rev. 84, 413–433 (2016)
Menardi, G., Azzalini, A.: An advancement in clustering via nonparametric density estimation. Stat. Comput. 24(5), 753–767 (2014)
Minnotte, M.C., Scott, D.W.: The mode tree: a tool for visualization of nonparametric density features. J. Comput. Graph. Stat. 2(1), 51–68 (1992)
Murrell, P.: R Graphics. CRC Press, Boca Raton (2011)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2019)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)
Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397–418 (2010)
Sugiura, N.: Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun. Stat. - Theory Methods 7(1), 13–26 (1978)
Urbanek, S.: jpeg: read and write JPEG images. R package version 0.1-8 (2014)
Wand, M.P., Jones, M.C.: Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Am. Stat. Assoc. 88(422), 520–528 (1993)
Wang, Y.: On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. J. R. Stat. Soc. B 69(2), 185–198 (2007)
Wang, Y.: Maximum likelihood computation for fitting semiparametric mixture models. Stat. Comput. 20(1), 75–86 (2010)
Wang, Y., Chee, C.-S.: Density estimation using non-parametric and semi-parametric mixtures. Stat. Comput. 12, 67–92 (2012)
Wang, X., Wang, Y.: Nonparametric multivariate density estimation using mixtures. Stat. Comput. 25(2), 349–364 (2015)
Acknowledgements
The authors thank the associated editor and two referees for their constructive and insightful suggestions, which led to many improvements in the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, S., Wang, Y. Modal Clustering Using Semiparametric Mixtures and Mode Flattening. Stat Comput 31, 5 (2021). https://doi.org/10.1007/s11222-020-09985-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-020-09985-z