Skip to main content
Log in

Probabilistic clustering via Pareto solutions and significance tests

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

The present paper proposes a new strategy for probabilistic (often called model-based) clustering. It is well known that local maxima of mixture likelihoods can be used to partition an underlying data set. However, local maxima are rarely unique. Therefore, it remains to select the reasonable solutions, and in particular the desired one. Credible partitions are usually recognized by separation (and cohesion) of their clusters. We use here the p values provided by the classical tests of Wilks, Hotelling, and Behrens–Fisher to single out those solutions that are well separated by location. It has been shown that reasonable solutions to a clustering problem are related to Pareto points in a plot of scale balance vs. model fit of all local maxima. We briefly review this theory and propose as solutions all well-fitting Pareto points in the set of local maxima separated by location in the above sense. We also design a new iterative, parameter-free cutting plane algorithm for the multivariate Behrens–Fisher problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Aitchison J, Silvey SD (1960) Maximum-likelihood estimation procedures and associated tests of significance. J R Stat Soc Ser B 22:154–171

    MathSciNet  MATH  Google Scholar 

  • Bailey TA Jr, Dubes RC (1982) Cluster validity profiles. Patt Rec 15:61–83

    Article  MathSciNet  Google Scholar 

  • Behrens WU, Ein Beitrag zur Fehlerberechnung bei wenigen Beobachtungen. Landwirtschaftliche Jahrbücher. Zeitschrift für wissenschaftliche Landwirtschaft und Archiv des Königlich Preussischen Landes-Oekonomie-Kollegiums, 68:807–837, 1929. Original in Hathi Trust Digital Library

  • Belloni A, Didier G (2008) On the Behrens-Fisher problem: a globally convergent algorithm and a finite-sample study of the Wald, LR and LM tests. Ann Stat 36:2377–2408

    Article  MathSciNet  MATH  Google Scholar 

  • Bock H-H (1985) On some significance tests in cluster analysis. J Classif 2:77–108

    Article  MathSciNet  MATH  Google Scholar 

  • Böhning D (2000) Computer-assisted analysis of mixtures and applications. Chapman & Hall/CRC, Boca Raton

    MATH  Google Scholar 

  • Bonnans J-F, Gilbert JC, Lemaréchal C, Sagastizábal CA (2006) Numerical optimization. Theoretical and practical aspects, 2nd edn. Springer, Berlin

    MATH  Google Scholar 

  • Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Austral J Zool 22:417–425

    Article  Google Scholar 

  • Cox DR, Hinkley DV (1974) Theoretical statistics. Chapman & Hall, London

    Book  MATH  Google Scholar 

  • Day NE (1969) Estimating the components of a mixture of normal distributions. Biometrika 56:463–474

    Article  MathSciNet  MATH  Google Scholar 

  • Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York

    Book  MATH  Google Scholar 

  • Fisher RA (1939) The comparison of samples with possibly unequal variances. Ann Eugenics 9:174–180

    Article  MATH  Google Scholar 

  • Fisher RA (1941) The asymptotic approach to Behrens’ integral with further tables for the \(d\) test of significance. Ann Eugenics 11:141–172

    Article  MathSciNet  Google Scholar 

  • Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306

    Article  MATH  Google Scholar 

  • Fritz H, García-Escudero LA, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Comput Stat Data Anal 61:124–136

    Article  MathSciNet  MATH  Google Scholar 

  • Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, Heidelberg

    MATH  Google Scholar 

  • Gallegos MT, Ritter G (2009) Trimmed ML-estimation of contaminated mixtures. Sankhya Ser A 71:164–220

    MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3:135–167

    Article  MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2010) Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput Stat Data Anal 54:637–654. doi:10.1016/j.csda.2009.08.023

    Article  MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2013) Strong consistency of \(k\)-parameters clustering. J Multivar Anal 117:14–31

    Article  MathSciNet  MATH  Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800

    Article  MathSciNet  MATH  Google Scholar 

  • Kiefer J, Wolfowitz J (1956) Consistency of the maximum-likelihood estimation in the presence of infinitely many incidental parameters. Ann Math Stat 27:887–906

    Article  MATH  Google Scholar 

  • Kiefer NM (1978) Discrete parameter variation: efficient estimation of a switching regression model. Econometrica 46:427–434

    Article  MathSciNet  MATH  Google Scholar 

  • Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew \(t\)-distributions. Adv Data Anal Classif 7:241–266

    Article  MathSciNet  MATH  Google Scholar 

  • Lee SX, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comput 24:181–202

    Article  MathSciNet  MATH  Google Scholar 

  • Lindsay BG (1995) Mixture models: theory, geometry and applications. NSF-CBMS regional conference series in probability and statistics, vol 5. IMS and ASA, Hayward

  • Mardia KV, Kent T, Bibby JM (1997) Multivariate analysis, 6th edn. Academic Press, London

    MATH  Google Scholar 

  • McLachlan GJ, Peel D (2000a) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • McLachlan GJ, Peel D (2000) On computational aspects of clustering via mixtures of normal and \(t\)-components. In: Proceedings of the American Statistical Association. American Statistical Association, Alexandria

  • Muirhead RJ (1982) Aspects of multivariate statistical theory., Wiley series in probability and mathematical statisticsWiley, New York

    Book  MATH  Google Scholar 

  • Peters BC Jr, Walker HF (1978) An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions. SIAM J Appl Math 35:362–378

    Article  MathSciNet  MATH  Google Scholar 

  • Ritter G (2015) Robust cluster analysis and variable selection. Monographs in statistics and applied probability, vol 137. Chapman & Hall/CRC, Boca Raton

  • Rossant C, Kadir S, Goodman DFM, Harris KD (2016) Spike sorting for large, dense electrode arrays. Nature Neurosci 19:624–641

    Article  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Silvey SD (1970) Statistical inference. Penguin, Baltimore

    MATH  Google Scholar 

  • Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Wilks SS (1932) Certain generalizations in the analysis of variance. Biometrika 24:471–494

    Article  MATH  Google Scholar 

  • Wilks SS (1938) The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat 9:60–62

    Article  MATH  Google Scholar 

  • Yakowitz SJ, Spragins JD (1968) On the identifiability of finite mixtures. Ann Stat 39:209–214

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank Prof. H.-H. Bock for kindly discussing with us the subject matter of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gunter Ritter.

Appendix

Appendix

Lemma 7.1

The set \(\mathcal K\) introduced in Eq. (12) is closed in \({\mathbb R}^2\) and convex.

Proof

Let \((u_1^{(k)},u_2^{(k)})\) be a sequence of pairs in \(\mathcal K\), convergent to \((u_1,u_2)\in {\mathbb R}^2\). There exists a sequence \(m^{(k)}\) in \({\mathbb R}^d\) such that

$$\begin{aligned} u_j^{(k)}\ge M(\overline{x}_j,m^{(k)},S_j) \end{aligned}$$
(14)

for all k and \(j=1,2\). It follows that \(m^{(k)}\) is bounded. By Bolzano–Weierstrass, we may without loss of generality assume that it converges to \(m\in {\mathbb R}^d\). Closedness of \(\mathcal K\) follows after passing to the limit as \(k\rightarrow \infty \) in Eq. (14).

In view of convexity of \(\mathcal K\), let \(u=(u_1,u_2),~v=(v_1,v_2)\in \mathcal K\) and let \(0<\lambda <1\). By definition,

$$\begin{aligned} u_j\ge M(\overline{x}_j,m_u,S_j)\text { and }v_j\ge M(\overline{x}_j,m_v,S_j) \end{aligned}$$

for \(j=1,2\) and two vectors \(m_u,m_v\in {\mathbb R}^d\). By convexity of \(m\mapsto M(x,m,S)\) it follows for \(j=1,2\)

$$\begin{aligned}&(1-\lambda )u_j+\lambda v_j \ge (1-\lambda )M(\overline{x}_j,m_u,S_j)+\lambda M(\overline{x}_j,m_v,S_j) \\&\quad \ge M(\overline{x}_j,(1-\lambda )m_u+\lambda m_v,S_j). \end{aligned}$$

Hence, \((1-\lambda )u+\lambda v\in \mathcal K\). This is the second claim. \(\square \)

We next deal with the properties of the function g defined in Eq. (13).

Lemma 7.2

The function g defined in Eq. (13) is real-analytic and strictly convex; moreover \(g'=-\lambda \).

Proof

The function g arises from minimizing the parabolic function \(M(\overline{x}_2,m,S_2)\) on the set of all \(m\in {\mathbb R}^d\) such that \(M(\overline{x}_1,m,S_1)=u_1\). Application of the Lagrange function

$$\begin{aligned} M(\overline{x}_2,m,S_2)+\lambda (M(\overline{x}_1,m,S_1)-u_1) \end{aligned}$$

with multiplier \(\lambda \) leads to the system of equations

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathrm D}_m(m-\overline{x}_2)^{\top }S_2^{-1}(m-\overline{x}_2) +\lambda {\mathrm D}_m(m-\overline{x}_1)^{\top }S_1^{-1}(m-\overline{x}_1)=0,\\ (m-\overline{x}_1)^{\top }S_1^{-1}(m-\overline{x}_1)=u_1 \end{array}\right. } \end{aligned}$$

for m and the multiplier \(\lambda \). The multiplier is positive since we require the minimum of a parabolic function on a convex set. The first equation reduces to

$$\begin{aligned} S_2^{-1}(m-\overline{x}_2)+\lambda S_1^{-1}(m-\overline{x}_1)=0. \end{aligned}$$
(15)

Without loss of generality we assume from here on that \(S_1=I_d\), the identity matrix, and \(\overline{x}_1=0\). Eq. (15) yields \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\). Inserting into the second equation above, \(\Vert m\Vert ^2=u_1\), we find

$$\begin{aligned} \Vert (I_d+\lambda S_2)^{-1}\overline{x}_2\Vert ^2=u_1. \end{aligned}$$
(16)

Since \(\overline{x}_2\ne 0\), the function \(\lambda \mapsto \Vert (I_d+\lambda S_2)^{-1}\overline{x}_2\Vert ^2\), \(\lambda \ge 0\), is strictly positive and strictly decreasing from \(\Vert \overline{x}_2\Vert ^2\) to 0. Since \(0<u_1=\Vert m\Vert ^2<\Vert \overline{x}_2\Vert ^2\), Eq. (16) has a unique solution \(\lambda >0\) which determines the minimizer \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\) and, hence, \(g(u_1)~(=M(\overline{x}_2,m,S_2))\).

Now keep in mind that both solutions \(\lambda \) and m are functions of \(u_1\). Equation (16) shows that \(\lambda \) is real-analytic. Therefore, so are both m and g. In view of the derivative of g, we deduce from Eq. (15) \(m^{\top }S_2^{-1}(m-\overline{x}_2)=-\lambda \Vert m\Vert ^2=-\lambda u_1\) and, therefore,

$$\begin{aligned} g(u_1) = (m-\overline{x}_2)^{\top }S_2^{-1}(m-\overline{x}_2) = m^{\top }S_2^{-1}(m-\overline{x}_2)-\overline{x}_2^{\top }S_2^{-1}(m-\overline{x}_2)\\ = -\lambda u_1-\overline{x}_2^{\top }S_2^{-1}(m-\overline{x}_2). \end{aligned}$$

Hence, \(g'(u_1) =-\frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}u_1-\lambda -\overline{x}_2^{\top }S_2^{-1}\frac{\,\mathrm {d}m}{\,\mathrm {d}u_1}\). Now, \(m=(I_d+\lambda S_2)^{-1}\overline{x}_2\) implies

$$\begin{aligned} {\textstyle \frac{\,\mathrm {d}m}{\,\mathrm {d}u_1}} = -{\textstyle \frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}}(I_d+\lambda S_2)^{-2}S_2\overline{x}_2 \end{aligned}$$

and

$$\begin{aligned}&-\overline{x}_2^{\top }S_2^{-1}{\textstyle \frac{\,\mathrm {d}m}{\,\mathrm {d}u_1}} = {\textstyle \frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}}\overline{x}_2^{\top }S_2^{-1} (I_d+\lambda S_2)^{-2}S_2\overline{x}_2 = {\textstyle \frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}}\overline{x}_2^{\top }(I_d+\lambda S_2)^{-2}\overline{x}_2 \\&\quad = {\textstyle \frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}}\Vert m\Vert ^2= {\textstyle \frac{\,\mathrm {d}\lambda }{\mathrm {d}u_1}}u_1. \end{aligned}$$

We conclude \(g'(u_1)=-\lambda \). From Eq. (16), it follows that \(\lambda \) decreases as \(u_1\) increases. Hence g is strictly convex. \(\square \)

Proof of Theorem 3.1

By Bolzano–Weierstrass, the sequence of minimizing vertices of the cutting plane algorithm has a cluster point in the compact area described by the vertices B, 0, and A in Fig. 1. The construction of the polygons excludes every point u below the graph of g as a cluster point of the minimizers on the vertices. Hence all cluster points lie on the graph of g. Denoting the minimum of h there by \(h^*\) and by \((\widehat{u}_{t_k})\), a subsequence of minimizing vertices converging to a cluster point, we estimate

$$\begin{aligned} h^* \le h(u_1,u_2) = h(\lim _k\widehat{u}_{t_k}) = \lim _kh(\widehat{u}_{t_k}) \le h^*. \end{aligned}$$

The last inequality follows from the fact that the concave function h assumes its minimum on the convex set described by the polygon at a vertex. Hence any cluster point \((u_1,u_2)\) of minimizing vertices is minimal on g and we have also proved part (b). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gallegos, M.T., Ritter, G. Probabilistic clustering via Pareto solutions and significance tests. Adv Data Anal Classif 12, 179–202 (2018). https://doi.org/10.1007/s11634-016-0278-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0278-2

Keywords

Mathematics Subject Classification

Navigation