Skip to main content

A new mixture model on the simplex

Abstract

This paper is meant to introduce a significant extension of the flexible Dirichlet (FD) distribution, which is a quite tractable special mixture model for compositional data, i.e. data representing vectors of proportions of a whole. The FD model displays several theoretical properties which make it suitable for inference, and fairly easy to handle from a computational viewpoint. However, the rigid type of mixture structure implied by the FD makes it unsuitable to describe many compositional datasets. Furthermore, the FD only allows for negative correlations. The new extended model, by considerably relaxing the strict constraints among clusters entailed by the FD, allows for a more general dependence structure (including positive correlations) and greatly expands its applicative potential. At the same time, it retains, to a large extent, its good properties. EM-type estimation procedures can be developed for this more complex model, including ad hoc reliable initialization methods, which permit to keep the computational issues at a rather uncomplicated level. Accurate evaluation of standard error estimates can be provided as well.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  • Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London (2003)

    MATH  Google Scholar 

  • Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, I, pp. 610–624. Springer-Verlag (1973)

  • Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007). https://doi.org/10.1007/s11222-006-9010-y. arXiv:1301.6559

    MathSciNet  Article  Google Scholar 

  • Barndorff-Nielsen, O., Jørgensen, B.: Some parametric models on the simplex. J. Multivar. Anal. 39(1), 106–116 (1991)

    MathSciNet  Article  Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41, 561–575 (2003)

    MathSciNet  Article  Google Scholar 

  • Byrd, L.P.R.H., Nocedal, J.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Stat. Comput. 16(5), 1190–1208 (1995)

    MathSciNet  Article  Google Scholar 

  • Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 4, 315–332 (1992)

    MathSciNet  Article  Google Scholar 

  • Celeux, G., Chauveau, D., Diebolt, J.: Stochastic versions of the EM algorithm: an experimental study in the mixture case. J. Stat. Comput. Simul. 55, 287–314 (1996)

    Article  Google Scholar 

  • Comas-Cufí, M., Martín-Fernández, J.A., Mateu-Figueras, G.: Log-ratio methods in mixture models for compositional data sets. Sort 40(2), 349–374 (2016)

    MathSciNet  MATH  Google Scholar 

  • Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)

    MathSciNet  Article  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Diebolt, J., Ip, E.: Stochastic EM: method and application. In: WR Gilks, S.R., Spiegelhalter, D.: (eds) Markov Chain Monte Carlo in Practice, Chapman & Hall, London, pp 259–273 (1996)

  • Favaro, S., Hadjicharalambous, G., Prunster, I.: On a class of distributions on the simplex. J. Stat. Plan. Inference 141(426), 2987–3004 (2011)

    MathSciNet  Article  Google Scholar 

  • Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classification of olive oils from their fatty acid composition. In: Martens, R. (ed.) Food Research and Data Analysis. Dip. Chimica e Tecnologie Farmaceutiche ed Alimentari, University of Genova, Genoa (1983)

    Google Scholar 

  • Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006)

    MATH  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions. J. Multivar. Anal. 23, 233–256 (1987)

    MathSciNet  Article  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, II. Prob. Math. Stat. 12, 291–309 (1991)

    MathSciNet  MATH  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, III. J. Multivar. Anal. 43, 29–57 (1992)

    MathSciNet  Article  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, IV. J. Multivar. Anal. 54, 1–17 (1995)

    MathSciNet  Article  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, V. In: Johnson, N.L., Balakrishnan, N. (eds.) Advances in the Theory and Practice of Statistics: A Volume in Honour of Samuel Kotz, pp. 377–396. Wiley, New York (1997)

    Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: The covariance structure of the multivariate liouville distributions. Contemp. Math. 287, 125–138 (2001a)

    MathSciNet  Article  Google Scholar 

  • Gupta, R.D., Richards, D.S.P.: The history of the Dirichlet and Liouville distributions. Int. Stat. Rev. 69(3), 433–446 (2001b)

    Article  Google Scholar 

  • Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694. arXiv:1511.00860

    MathSciNet  Article  MATH  Google Scholar 

  • Migliorati, S., Ongaro, A., Monti, G.S.: A structured dirichlet mixture model for compositional data: inferential and applicative issues. Stat. Comput. 27, 963. https://doi.org/10.1007/s11222-016-9665-y

  • Ongaro, A., Migliorati, S.: A generalization of the Dirichlet distribution. J. Multivar. Anal. 114, 412–426 (2013)

    MathSciNet  Article  Google Scholar 

  • Pawlowsky-Glahn, V., Egozcue, J., Tolosana-Delgado, R.: Modeling and Analysis of Compositional Data. Wiley, New York (2015)

    Google Scholar 

  • R Core Team (2018) R: a language and environment for statistical computing. https://www.r-project.org/. Accessed 22 January

  • Rayens, W.S., Srinivasan, C.: Dependence properties of generalized Liouville distributions on the simplex. J. Am. Stat. Assoc. 89(428), 1465–1470 (1994)

    MathSciNet  Article  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 2, 461–464 (1978)

    MathSciNet  Article  Google Scholar 

  • Smith, B., Rayens, W.: Conditional generalized Liouville distributions on the simplex. Statistics 36(2), 185–194 (2002)

    MathSciNet  Article  Google Scholar 

  • Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)

    MathSciNet  Article  Google Scholar 

Download references

Funding

This study was partially funded by Università degli Studi di Milano-Bicocca (Grant No. FA 2018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sonia Migliorati.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Proposition 3

The conditional distribution function of \({\varvec{S}}_1\mid {\varvec{X}}_{2}={\varvec{x}}_{2}\) can be derived most easily by conditioning on \({\varvec{Z}}\):

$$\begin{aligned} \begin{aligned} F_{{\varvec{S}}_1\mid {\varvec{X}}_{2}={\varvec{x}}_{2}}({\varvec{s}}_1)=&\sum _{i\le D} F_{{\varvec{S}}_1\mid {\varvec{X}}_{2}={\varvec{x}}_{2},{\varvec{Z}}={\varvec{e}}_i}({\varvec{s}}_1)\\&\cdot P({\varvec{Z}}={\varvec{e}}_i\mid {\varvec{X}}_{2}={\varvec{x}}_{2}). \end{aligned} \end{aligned}$$
(35)

Given that \({\varvec{X}}\mid {\varvec{Z}}={\varvec{e}}_i\sim \mathcal{D}({\dot{\varvec{\alpha }}}_i)\), by using well-known Dirichlet independence properties we have that:

$$\begin{aligned} {\varvec{S}}_1 | {\varvec{X}}_{2}={\varvec{x}}_{2},{\varvec{Z}}={\varvec{e}}_i\sim {\varvec{S}}_1\mid {\varvec{Z}}={\varvec{e}}_i. \end{aligned}$$

Recalling that the Dirichlet distribution is closed under the operation of subcomposition, it follows that:

$$\begin{aligned} {\varvec{S}}_1 | {\varvec{Z}}={\varvec{e}}_i\;\sim \;\mathcal{D}(\alpha _1,\ldots ,\alpha _i+\tau _i,\ldots ,\alpha _k),\quad i\le k \end{aligned}$$

and

$$\begin{aligned} {\varvec{S}}_1 | {\varvec{Z}}={\varvec{e}}_i\;\sim \;\mathcal{D}(\alpha _1,\ldots ,\alpha _i,\ldots ,\alpha _k),\quad i> k. \end{aligned}$$

The probabilities \(P({\varvec{Z}}={\varvec{e}}_i\mid {\varvec{X}}_{2}={\varvec{x}}_{2})\) can be computed by the Bayes theorem. In particular, the distribution of \(({\varvec{X}}_{2},1-X_2^+)^\intercal | {\varvec{Z}}={\varvec{e}}_i\) can be obtained by resorting to closure of the Dirichlet under marginalization; it takes the form

$$\begin{aligned} ({\varvec{X}}_{2},1-X_2^+)^\intercal | {\varvec{Z}}={\varvec{e}}_i\;\sim \;\mathcal{D}(\alpha _{k+1},\ldots ,\alpha _D,\alpha _1^++\tau _i) \end{aligned}$$

if \(i\le k\) and

$$\begin{aligned} ({\varvec{X}}_{2},1-X_2^+)^\intercal | {\varvec{Z}}={\varvec{e}}_i\;\sim \;\mathcal{D}(\alpha _{k+1},\ldots ,\alpha _i+\tau _i,\ldots ,\alpha _D,\alpha _1^+) \end{aligned}$$

if \(i> k\). From the Bayes formula, some algebraic manipulations show that the probabilities \(P({\varvec{Z}}={\varvec{e}}_i\mid {\varvec{X}}_{2}={\varvec{x}}_{2})\) are proportional to the \(p_{i}^{'}\)’s provided by (14). By plugging all the computed quantities into (35), the result is obtained.

Proof of Proposition 5

It is obvious that if \(\varvec{\theta }=\varvec{\theta }^\prime \), then \(\mathbf{X } \sim \mathbf{X }^\prime \). In order to show the converse, one can focus on the marginal distribution of \(X_i\). By virtue of Proposition 3, we can write its density function \(g(x_i; \varvec{\theta })\) as:

$$\begin{aligned} \begin{aligned} g(x_i; \varvec{\theta })&= x_i^{\alpha _i - 1} (1-x_i)^{\alpha ^+-\alpha _i -1} \\&\cdot \left\{ p_i \frac{\Gamma (\alpha ^++ \tau _i) x_i^{\tau _i}}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)} \right. \\&+\, \left. \sum _{l\ne i} p_l \frac{\Gamma (\alpha ^++ \tau _l)(1-x_i)^{\tau _l} }{\Gamma (\alpha _i)\Gamma (\alpha ^+- \alpha _i + \tau _l)} \right\} . \end{aligned} \end{aligned}$$
(36)

If \(\mathbf{X } \sim \mathbf{X }^\prime \), then \(X_i \sim X_i^\prime \) and therefore, \(g(x_i; \varvec{\theta }) = g(x_i; \varvec{\theta }^\prime )\)\(\forall \)\(x_i\)\(\in \) (0, 1), as these density functions are continuous. It follows that \(\displaystyle \lim \limits _{x \rightarrow 0^+} \frac{g(x_i; \varvec{\theta })}{x_i^{\alpha _i - 1}} = \lim \limits _{x \rightarrow 0^+} \frac{g(x_i; \varvec{\theta }^\prime )}{x_i^{\alpha _i - 1}}\). We have:

$$\begin{aligned} \displaystyle \lim \limits _{x_i \rightarrow 0^+} \frac{g(x_i; \varvec{\theta })}{x_i^{\alpha _i - 1}} = \sum _{l \ne i} \frac{p_l \Gamma (\alpha ^++ \tau _l)}{\Gamma (\alpha _i)\Gamma (\alpha ^+-\alpha _i+\tau _l)} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} \lim \limits _{x_i \rightarrow 0^+} \frac{g(x_i; \varvec{\theta }^\prime )}{x_i^{\alpha _i - 1}}&= \left( \lim \limits _{x_i \rightarrow 0^+} \frac{x_i^{\alpha _i^\prime - 1}}{x_i^{\alpha _i - 1}} \right) \\&\cdot \sum _{l \ne i} \frac{p_l^\prime \Gamma (\alpha ^{\prime +} + \tau _l^\prime )}{\Gamma (\alpha _i^\prime )\Gamma (\alpha ^{\prime +}-\alpha _i^\prime +\tau _l^\prime )}. \end{aligned} \end{aligned}$$

In order to satisfy the equality of these two limits, the quantity \(\displaystyle \left( \lim _{x_i \rightarrow 0^+} \frac{x_i^{\alpha _i^\prime - 1}}{x^{\alpha _i - 1}}\right) \) must be finite and different from 0.

This implies that \(\varvec{\alpha }=\varvec{\alpha }^\prime \). As a consequence, the equality \(g(x_i; \varvec{\theta }) = g(x_i; \varvec{\theta }^\prime )\) can be rewritten as:

$$\begin{aligned}&\, \frac{p_i \Gamma (\alpha ^++ \tau _i)x_i^{\tau _i}}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)} + \sum _{l\ne i} \frac{ p_l \Gamma (\alpha ^++ \tau _l)(1-x_i)^{\tau _l}}{\Gamma (\alpha _i)\Gamma (\alpha ^+- \alpha _i + \tau _l)} = \nonumber \\&\quad = \frac{p_i^\prime \Gamma (\alpha ^++ \tau _i^\prime )x_i^{\tau _i^\prime }}{\Gamma (\alpha _i + \tau _i^\prime ) \Gamma (\alpha ^+-\alpha _i)} + \sum _{l\ne i} \frac{p_l^\prime \Gamma (\alpha ^++ \tau _l^\prime )(1-x_i)^{\tau _l^\prime }}{\Gamma (\alpha _i)\Gamma (\alpha ^+- \alpha _i + \tau _l^\prime )}.\nonumber \\ \end{aligned}$$
(37)

By taking the limits as \(x_i \rightarrow 1^-\) on both sides, one obtains:

$$\begin{aligned} \frac{p_i \Gamma (\alpha ^++ \tau _i)}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)} = \frac{p_i^\prime \Gamma (\alpha ^++ \tau _i^\prime )}{\Gamma (\alpha _i + \tau _i^\prime ) \Gamma (\alpha ^+-\alpha _i)}. \end{aligned}$$
(38)

Equation (38) implies that \(p_i\) and \( p_i^\prime \) are either both null or both strictly positive. In the former case, because of the parameter space definition, \( \tau _i=\tau _i^\prime =1\). In the latter case, plugging (38) into equality (37) and deriving both sides, the following equality must hold \(\forall \)\(x_i\)\(\in \) (0, 1):

$$\begin{aligned} \begin{aligned}&\, \frac{p_i \tau _i \Gamma (\alpha ^++ \tau _i) x_i^{\tau _i-1}}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)}\\&\qquad - \sum _{l\ne i} \frac{p_l \tau _l \Gamma (\alpha ^++ \tau _l)(1-x_i)^{\tau _l-1}}{\Gamma (\alpha _i)\Gamma (\alpha ^+- \alpha _i + \tau _l)} = \\&\quad = \frac{ p_i \tau _i^\prime \Gamma (\alpha ^++ \tau _i)x_i^{\tau _i^\prime -1}}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)}\\&\qquad -\sum _{l\ne i} \frac{p_l^\prime \tau _l^\prime \Gamma (\alpha ^++ \tau _l^\prime )(1-x_i)^{\tau _l^\prime -1}}{\Gamma (\alpha _i)\Gamma (\alpha ^+- \alpha _i + \tau _l^\prime )}. \end{aligned} \end{aligned}$$
(39)

Taking the limits as \(x_i \rightarrow 1^-\) on both sides, we have:

$$\begin{aligned} \frac{p_i \tau _i \Gamma (\alpha ^++ \tau _i)}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)} = \frac{p_i \tau _i ^\prime \Gamma (\alpha ^++ \tau _i)}{\Gamma (\alpha _i + \tau _i) \Gamma (\alpha ^+-\alpha _i)}. \end{aligned}$$
(40)

It follows that \(\tau _i=\tau _i^\prime \) for any i such that \(p_i>0\) and hence for all i. Finally, substituting this constraint in (38), it is possible to conclude that \(\mathbf{p }= \mathbf{p }^\prime \).

Proof of Proposition 8

Recall that \(\mathbf{X }| Y^+ = y^+ \sim EFD(\varvec{\alpha },\mathbf{p }^*(y^+),\varvec{\tau }, \beta )\), where \(\mathbf{p }^*(y^+)\) are defined as in (23). Then, if \(\tau _i=\tau \)\(\forall i\), it can be seen immediately that the \(p_i^*(y^+)\)’s are independent of \(y^+\) (and coincide with the \(p_i\)’s). Conversely, if the basis is compositional invariant, then \(p_i^*(y^+)\) does not depend on \(y^+\), and therefore, neither does the ratio \(p_i^*(y^+)/p_l^*(y^+)\)\(\forall i\ne l\). Because this ratio is proportional to \({(y^+)}^{\tau _i-\tau _l}\), \(\tau _i=\tau _l\)\(\forall i\ne l\).

Partial derivatives

In this section we show the partial derivatives of the complete-data log-likelihood (25). In particular, for \(i=1, \ldots ,D\), the first-order partial derivatives are:

$$\begin{aligned} \frac{\partial l_c(\varvec{\theta })}{\partial p_i} = \frac{z_{\cdot i}}{p_i} - \frac{z_{\cdot D}}{p_D}, \end{aligned}$$

where \(z_{\cdot i}=\sum _{j=1}^n z_{ji}\).

$$\begin{aligned} \frac{\partial l_c(\varvec{\theta })}{\partial \alpha _i}= & {} \left( \sum _{l=1}^D z_{\cdot l} \psi (\alpha ^++ \tau _l)\right) + \sum _{j=1}^n \log x_{ji}\\&+\, z_{\cdot i} \left( \psi (\alpha _i) - \psi (\alpha _i + \tau _i) \right) - n \psi (\alpha _i).\\ \frac{\partial l_c(\varvec{\theta })}{\partial \tau _i}= & {} z_{\cdot i} \left( \psi (\alpha ^++ \tau _i) - \psi (\alpha _i + \tau _i)\right) + \sum _{j=1}^n z_{ji} \log x_{ji}. \end{aligned}$$
Table 7 Goodness-of-fit measures for two-part compositions
Fig. 9
figure 9

Histograms and estimated densities of the eight two-part compositions

The second-order partial derivatives are:

$$\begin{aligned} \frac{\partial ^2 l_c(\varvec{\theta })}{\partial p_i \partial p_h} = - \frac{z_{\cdot D}}{p_D^2} - \mathbb {1}_{i=h} \cdot \frac{z_{\cdot i}}{p_i^2}, \end{aligned}$$

where \(\mathbb {1}_{i=h}\) is the indicator function that is equal to 1 if \(i = h\) and 0 otherwise.

$$\begin{aligned} \frac{\partial ^2 l_c(\varvec{\theta })}{\partial p_i \partial \alpha _h}= & {} \frac{\partial ^2 l_c(\varvec{\theta })}{\partial p_i \partial \tau _h} = 0.\\ \frac{\partial ^2 l_c(\varvec{\theta })}{\partial \alpha _i \partial \alpha _h}= & {} \left( \sum _{l=1}^D z_{\cdot l} \psi ^\prime (\alpha ^++ \tau _l) \right) - \mathbb {1}_{i=h} n\psi ^\prime (\alpha _i)\\&+\,\mathbb {1}_{i=h} \cdot \left[ z_{\cdot i} \left( \psi ^\prime (\alpha _i) - \psi ^\prime (\alpha _i + \tau _i)\right) \right] , \end{aligned}$$

where \(\psi ^\prime (\cdot )\) is the trigamma function.

$$\begin{aligned} \frac{\partial ^2 l_c(\varvec{\theta })}{\partial \alpha _i \partial \tau _h}= & {} z_{\cdot h} \psi ^\prime (\alpha ^++ \tau _h) - \mathbb {1}_{i=h} z_{\cdot i} \psi ^\prime (\alpha _i + \tau _i).\\ \frac{\partial ^2 l_c(\varvec{\theta })}{\partial \tau _i \partial \tau _h}= & {} \mathbb {1}_{i=h} \left[ z_{\cdot i} \left( \psi ^\prime (\alpha ^++ \tau _i) - \psi ^\prime (\alpha _i + \tau _i) \right) \right] . \end{aligned}$$

Results of the univariate case of the olive oil dataset

In this section we report the AIC and BIC criteria for the considered models (Table 7), and the fitted density curves (Fig. 9).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ongaro, A., Migliorati, S. & Ascari, R. A new mixture model on the simplex. Stat Comput 30, 749–770 (2020). https://doi.org/10.1007/s11222-019-09920-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-019-09920-x

Keywords

  • Proportion
  • Dirichlet mixture
  • EM-type algorithms
  • Multi-modality
  • Compositional invariance