Abstract
The present paper proposes a new strategy for probabilistic (often called model-based) clustering. It is well known that local maxima of mixture likelihoods can be used to partition an underlying data set. However, local maxima are rarely unique. Therefore, it remains to select the reasonable solutions, and in particular the desired one. Credible partitions are usually recognized by separation (and cohesion) of their clusters. We use here the p values provided by the classical tests of Wilks, Hotelling, and Behrens–Fisher to single out those solutions that are well separated by location. It has been shown that reasonable solutions to a clustering problem are related to Pareto points in a plot of scale balance vs. model fit of all local maxima. We briefly review this theory and propose as solutions all well-fitting Pareto points in the set of local maxima separated by location in the above sense. We also design a new iterative, parameter-free cutting plane algorithm for the multivariate Behrens–Fisher problem.
Similar content being viewed by others
References
Aitchison J, Silvey SD (1960) Maximum-likelihood estimation procedures and associated tests of significance. J R Stat Soc Ser B 22:154–171
Bailey TA Jr, Dubes RC (1982) Cluster validity profiles. Patt Rec 15:61–83
Behrens WU, Ein Beitrag zur Fehlerberechnung bei wenigen Beobachtungen. Landwirtschaftliche Jahrbücher. Zeitschrift für wissenschaftliche Landwirtschaft und Archiv des Königlich Preussischen Landes-Oekonomie-Kollegiums, 68:807–837, 1929. Original in Hathi Trust Digital Library
Belloni A, Didier G (2008) On the Behrens-Fisher problem: a globally convergent algorithm and a finite-sample study of the Wald, LR and LM tests. Ann Stat 36:2377–2408
Bock H-H (1985) On some significance tests in cluster analysis. J Classif 2:77–108
Böhning D (2000) Computer-assisted analysis of mixtures and applications. Chapman & Hall/CRC, Boca Raton
Bonnans J-F, Gilbert JC, Lemaréchal C, Sagastizábal CA (2006) Numerical optimization. Theoretical and practical aspects, 2nd edn. Springer, Berlin
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Austral J Zool 22:417–425
Cox DR, Hinkley DV (1974) Theoretical statistics. Chapman & Hall, London
Day NE (1969) Estimating the components of a mixture of normal distributions. Biometrika 56:463–474
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Fisher RA (1939) The comparison of samples with possibly unequal variances. Ann Eugenics 9:174–180
Fisher RA (1941) The asymptotic approach to Behrens’ integral with further tables for the \(d\) test of significance. Ann Eugenics 11:141–172
Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306
Fritz H, García-Escudero LA, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Comput Stat Data Anal 61:124–136
Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, Heidelberg
Gallegos MT, Ritter G (2009) Trimmed ML-estimation of contaminated mixtures. Sankhya Ser A 71:164–220
Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3:135–167
Gallegos MT, Ritter G (2010) Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput Stat Data Anal 54:637–654. doi:10.1016/j.csda.2009.08.023
Gallegos MT, Ritter G (2013) Strong consistency of \(k\)-parameters clustering. J Multivar Anal 117:14–31
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800
Kiefer J, Wolfowitz J (1956) Consistency of the maximum-likelihood estimation in the presence of infinitely many incidental parameters. Ann Math Stat 27:887–906
Kiefer NM (1978) Discrete parameter variation: efficient estimation of a switching regression model. Econometrica 46:427–434
Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew \(t\)-distributions. Adv Data Anal Classif 7:241–266
Lee SX, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comput 24:181–202
Lindsay BG (1995) Mixture models: theory, geometry and applications. NSF-CBMS regional conference series in probability and statistics, vol 5. IMS and ASA, Hayward
Mardia KV, Kent T, Bibby JM (1997) Multivariate analysis, 6th edn. Academic Press, London
McLachlan GJ, Peel D (2000a) Finite mixture models. Wiley, New York
McLachlan GJ, Peel D (2000) On computational aspects of clustering via mixtures of normal and \(t\)-components. In: Proceedings of the American Statistical Association. American Statistical Association, Alexandria
Muirhead RJ (1982) Aspects of multivariate statistical theory., Wiley series in probability and mathematical statisticsWiley, New York
Peters BC Jr, Walker HF (1978) An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions. SIAM J Appl Math 35:362–378
Ritter G (2015) Robust cluster analysis and variable selection. Monographs in statistics and applied probability, vol 137. Chapman & Hall/CRC, Boca Raton
Rossant C, Kadir S, Goodman DFM, Harris KD (2016) Spike sorting for large, dense electrode arrays. Nature Neurosci 19:624–641
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Silvey SD (1970) Statistical inference. Penguin, Baltimore
Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading
Wilks SS (1932) Certain generalizations in the analysis of variance. Biometrika 24:471–494
Wilks SS (1938) The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat 9:60–62
Yakowitz SJ, Spragins JD (1968) On the identifiability of finite mixtures. Ann Stat 39:209–214
Acknowledgements
We thank Prof. H.-H. Bock for kindly discussing with us the subject matter of this paper.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Lemma 7.1
The set \(\mathcal K\) introduced in Eq. (12) is closed in \({\mathbb R}^2\) and convex.
Proof
Let \((u_1^{(k)},u_2^{(k)})\) be a sequence of pairs in \(\mathcal K\), convergent to \((u_1,u_2)\in {\mathbb R}^2\). There exists a sequence \(m^{(k)}\) in \({\mathbb R}^d\) such that
for all k and \(j=1,2\). It follows that \(m^{(k)}\) is bounded. By Bolzano–Weierstrass, we may without loss of generality assume that it converges to \(m\in {\mathbb R}^d\). Closedness of \(\mathcal K\) follows after passing to the limit as \(k\rightarrow \infty \) in Eq. (14).
In view of convexity of \(\mathcal K\), let \(u=(u_1,u_2),~v=(v_1,v_2)\in \mathcal K\) and let \(0<\lambda <1\). By definition,
for \(j=1,2\) and two vectors \(m_u,m_v\in {\mathbb R}^d\). By convexity of \(m\mapsto M(x,m,S)\) it follows for \(j=1,2\)
Hence, \((1-\lambda )u+\lambda v\in \mathcal K\). This is the second claim. \(\square \)
We next deal with the properties of the function g defined in Eq. (13).
Lemma 7.2
The function g defined in Eq. (13) is real-analytic and strictly convex; moreover \(g'=-\lambda \).
Proof
The function g arises from minimizing the parabolic function \(M(\overline{x}_2,m,S_2)\) on the set of all \(m\in {\mathbb R}^d\) such that \(M(\overline{x}_1,m,S_1)=u_1\). Application of the Lagrange function
with multiplier \(\lambda \) leads to the system of equations
for m and the multiplier \(\lambda \). The multiplier is positive since we require the minimum of a parabolic function on a convex set. The first equation reduces to
Without loss of generality we assume from here on that \(S_1=I_d\), the identity matrix, and \(\overline{x}_1=0\). Eq. (15) yields \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\). Inserting into the second equation above, \(\Vert m\Vert ^2=u_1\), we find
Since \(\overline{x}_2\ne 0\), the function \(\lambda \mapsto \Vert (I_d+\lambda S_2)^{-1}\overline{x}_2\Vert ^2\), \(\lambda \ge 0\), is strictly positive and strictly decreasing from \(\Vert \overline{x}_2\Vert ^2\) to 0. Since \(0<u_1=\Vert m\Vert ^2<\Vert \overline{x}_2\Vert ^2\), Eq. (16) has a unique solution \(\lambda >0\) which determines the minimizer \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\) and, hence, \(g(u_1)~(=M(\overline{x}_2,m,S_2))\).
Now keep in mind that both solutions \(\lambda \) and m are functions of \(u_1\). Equation (16) shows that \(\lambda \) is real-analytic. Therefore, so are both m and g. In view of the derivative of g, we deduce from Eq. (15) \(m^{\top }S_2^{-1}(m-\overline{x}_2)=-\lambda \Vert m\Vert ^2=-\lambda u_1\) and, therefore,
Hence, \(g'(u_1) =-\frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}u_1-\lambda -\overline{x}_2^{\top }S_2^{-1}\frac{\,\mathrm {d}m}{\,\mathrm {d}u_1}\). Now, \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\) implies
and
We conclude \(g'(u_1)=-\lambda \). From Eq. (16), it follows that \(\lambda \) decreases as \(u_1\) increases. Hence g is strictly convex. \(\square \)
Proof of Theorem 3.1
By Bolzano–Weierstrass, the sequence of minimizing vertices of the cutting plane algorithm has a cluster point in the compact area described by the vertices B, 0, and A in Fig. 1. The construction of the polygons excludes every point u below the graph of g as a cluster point of the minimizers on the vertices. Hence all cluster points lie on the graph of g. Denoting the minimum of h there by \(h^*\) and by \((\widehat{u}_{t_k})\), a subsequence of minimizing vertices converging to a cluster point, we estimate
The last inequality follows from the fact that the concave function h assumes its minimum on the convex set described by the polygon at a vertex. Hence any cluster point \((u_1,u_2)\) of minimizing vertices is minimal on g and we have also proved part (b). \(\square \)
Rights and permissions
About this article
Cite this article
Gallegos, M.T., Ritter, G. Probabilistic clustering via Pareto solutions and significance tests. Adv Data Anal Classif 12, 179–202 (2018). https://doi.org/10.1007/s11634-016-0278-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0278-2
Keywords
- Cluster analysis
- Probabilistic models
- Mixture model
- Classification model
- Pareto solutions
- Behrens–Fisher problem
- Hotelling’s \(T^2\) statistic
- Wilks’ lambda