Abstract
The simplex is the geometrical locus of D-dimensional positive data with constant sum, called compositions. A possible distribution for compositions is the Dirichlet. In Dirichlet models, there are no scale parameters and the D shapes are assumed dependent on auxiliary variables. This peculiar feature makes Dirichlet models difficult to apply and to interpret. Here, we propose a generalization of the Dirichlet, called the simplicial generalized Beta (SGB) distribution. It includes an overall shape parameter, a scale composition and the D Dirichlet shapes. The SGB is flexible enough to accommodate many practical situations. SGB regression models are applied to data from the United Kingdom Time Use Survey. The R-package SGB makes the methods accessible to users.
This is a preview of subscription content, access via your institution.




References
Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman and Hall Ltd (reprinted 2003 with additional material by the Blackburn Press, London (UK)
Aitchison J, Barceló-Vidal C, Martín-Fernández JA (2000) Logratio analysis and compositional distance. Math Geol 32(3):271–275
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289–300
Chen J, Zhang X, Li S (2017) Multiple linear regression with compositional response and covariates. J Appl Stat 44(12):2270–2285
Craiu M, Craiu V (1969) Repartitia Dirichlet generalizatá. Analele Universitatii Bucuresti, Mathematicá-Mecanicá 18:9–11
Egozcue JJ, Pawlovsky-Glahn V (2011) Chapter 2: basic concepts and procedures. In: Pawlowsky-Glahn V, Buccianti A (eds) Compositional data analysis. Wiley, Theory and applications
Faraway J, Marsaglia G, Marsaglia J, Baddeley A (2019) goftest: classical goodness-of-fit tests for Univariate distributions. r package version 1.2-2. https://CRAN.R-project.org/package=goftest
Gershuny J, Sullivan O (2017) United Kingdom Time Use Survey, 2014–2015 [data collection]. UK Data Service. http://doi.org/10.5255/UKDA-SN-8128-1
Graf M (2019) SGB: simplicial generalized beta regression. R package version 1.0. https://cran.r-project.org/package=SGB
Gueorguieva R, Rosenheck R, Zelterman D (2008) Dirichlet component regression and its applications to psychiatric data. Comput Stat Data Anal 52(12):5344–5355
Hijazi RH, Jernigan RW (2009) Modelling compositional data using Dirichlet regression models. J Appl Prob Stat 4(1):77–91
Hron K, Filzmoser P, Thompson K (2012) Linear regression with compositional explanatory variables. J Appl Stat 39(5):1115–1128
Huber PJ (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, pp 221–233
Kotz S, Balakrishnan N, Johnson NL (2000) Continuous multivariate distributions, models and applications, vol 1. Wiley, Hoboken
Madsen K, Nielsen H, Tingleff O (2004) Optimization with constraints, Informatics and Mathematical Modelling. Technical University of Denmark, Lyngby
Mateu-Figueras G, Pawlowsky-Glahn V, Barceló-Vidal C (2003) Distributions on the simplex. In: Thió-Henestrosa S, Fernández JM (eds) Proceedings of Compositional data analysis workshop—CoDaWork’03
Minka T (2000) Estimating a Dirichlet distribution (revised 2012). Tech. rep
Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Notes on the scaled Dirichlet distribution. Chap 10. In: Pawlowsky-Glahn V, Buccianti A (eds) Compositional data analysis, theory and applications. Wiley, Hoboken, pp 128–138
Monti G, Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue J (2015) Shifted-Dirichlet regression vs simplicial regression: a comparison. In: Thió-Henestrosa S, Fernández JM (eds) Proceedings of the 6th international workshop on compositional data analysis
Monti G, Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue J (2016) A regression model for compositional data based on the Shifted-Dirichlet distribution
Morais J, Thomas-Agnan C (2019) Impact of economic context on automobile market segment shares: a compositional approach (unpublished report)
Ng KW, Tian GL, Tang ML (2011) Dirichlet and related distributions: theory, methods and applications. Wiley series in probability and statistics, http://hdl.handle.net/10722/141604
Rayens WS, Srinivasan C (1994) Dependence properties of generalized liouville distributions on the simplex. J Am Stat Assoc 89(428):1465–1470
van den Boogaart KG, Tolosana R, Bren M (2014) Compositions: compositional data analysis. r package version 1.40-1. https://CRAN.R-project.org/package=compositions
van den Boogaart KG, Tolosana-Delgado R (2013) Analyzing compositional data with R. Springer, Heidelberg
Varadhan R (2015) alabama: constrained Nonlinear Optimization. R package version 2015.3-1, https://cran.r-project.org/package=alabama
Wicker N, Muller J, Kalathur RKR, Poch O (2008) A maximum likelihood approximation method for Dirichlet’s parameter estimation. Comput Stat Data Anal 52(3):1315–1322
Yang X, Frees E, Zhang Z (2011) A generalized Beta-copula with applications in modeling multivariate long-tailed data. Insur: Math Econ 49(2):265–284
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proofs
Proof of Theorem 1
-
1.
\(\{a_k,\, k=1,\ldots ,D\}\) not constant implies dependence on \(\theta\).
Making the change of variables defined by \(t = \sum _{j=1}^D y_j\) and \(u_k = y_k/t\), \(k=1,\ldots ,D-1\), and setting \(u_D=1-\sum _{j=1}^{D-1} u_j\), we obtain
$$\begin{aligned}&f({\mathbf {u}},t | \theta ) = \prod _{k=1}^D \left[ \frac{a_k}{\Gamma (p_k) \theta ^{1/a_k}b_k}\left( \frac{t u_k}{\theta ^{1/a_k}b_k}\right) ^{a_k p_k-1} \exp \left( -\left( \frac{t u_k}{\theta ^{1/a_k}b_k}\right) ^{a_k}\right) \right] t^{D-1} \nonumber \\&= \left[ \prod _{k=1}^D \frac{a_k}{\Gamma (p_k) b_k }\left( \frac{u_k}{b_k}\right) ^{a_k p_k-1}\right] \exp \left[ -\sum _{k=1}^D \left( \frac{t}{\theta ^{1/a_k}}\frac{u_k}{b_k}\right) ^{a_k}\right] \prod _{k=1}^D \left( \frac{t}{\theta ^{1/a_k}}\right) ^{a_kp_k } \frac{1}{t} \nonumber \\&=f({\mathbf {u}}|\theta )f(t|{\mathbf {u}},\theta ). \end{aligned}$$We want to find the constant of integration C, such that
$$\begin{aligned} C\int _0^{\infty }f(t|{\mathbf {u}},\theta ) dt&= \int _0^{\infty } \exp \left[ -\sum _{k=1}^D \left( \frac{t}{\theta ^{1/a_k}}\frac{u_k}{b_k}\right) ^{a_k}\right] \prod _{k=1}^D \left( \frac{t}{\theta ^{1/a_k}}\right) ^{a_kp_k } \frac{1}{t} dt \\&= \int _0^{\infty } \exp \left[ -\theta ^{-1}\sum _{k=1}^D \left( \frac{t\,u_k}{b_k}\right) ^{a_k}\right] \theta ^{-P} \prod _{k=1}^D t^{a_kp_k } \frac{1}{t} dt. \end{aligned}$$It is clear that, if the parameters \(a_k\) are not constant, the result still depends on \(\theta\). This implies that in this case the distribution of the composition depends on the mixing scheme.
-
2.
\(\{a_k,\, k=1,\ldots ,D\}\) constant implies independence on \(\theta\).
If \(a_k=a\) for all \(k=1,\ldots ,D\), \(f(t|{\mathbf {u}},\theta )\) is easily integrated. Setting
$$\begin{aligned}c_k= (u_k/b_k)^{a}, \quad v=\left( \sum _{k=1}^D c_k\right) \frac{t^{a}}{\theta } \text { and } dv=a\left( \sum _{k=1}^D c_k\right) \frac{t^{a}}{\theta }\frac{1}{t}dt, \end{aligned}$$we have
$$\begin{aligned} C\int _0^{\infty }f(t|{\mathbf {u}},\theta ) dt&= \frac{\Gamma (P)}{a\left( \sum _{k=1}^D (u_k/b_k)^{a}\right) ^P}. \end{aligned}$$(16)Thus the constant C in Eq. (16) does not depend on \(\theta\). Thus this distribution does not depend on \(\theta\).
\(\square\)
Proof of Theorem 2
Taking the density \(f_{{\mathbf {U}}}({\mathbf {u}}_{-D})\) as expressed in Eq. (4), we see that the kernel is, up to a constant factor,
and cannot be put into the form \(K({\mathbf {u}}_{-D})=h\left( \sum _{j=1}^{D-1} (u_j/b_j)^{\beta _j}\right)\), except if \(a=b_j=1\), in which case it reduces to \(K({\mathbf {u}}_{-D};\,a=1,b_j=1,j=1,\ldots ,D) \propto \left[ 1-\sum _{j=1}^{D-1}u_j\right] ^{p_D-1}.\)
Thus \(h(x;\,a=1,b_j=1,j=1,\ldots ,D)=(1-x)^{p_d-1}.\) \(\square\)
Proof of Theorem 3
Without loss of generality, suppose that \(J=1,\ldots ,r\), where \(2\le r<D-1\). Consider the following change of variables: \(x = \sum _{j=1}^r u_j\); \(v_k = u_k/x\) if \(1 \le k \le r-1\); \(w_k = u_k/(1-x)\) if \(r+1 \le k \le D-1\). The Jacobian is \(x^{r-1}(1-x)^{D-r-1}\).
Let \({\mathbf {b}}_1=(b_1,\ldots ,b_r)\) and \({\mathbf {b}}_2=(b_{r+1},\ldots ,b_D)\). We have \(\left( \| {\mathbf {u}}/{\mathbf {b}}\| _a\right) ^a=x^a\left( \| {\mathbf {v}}/{\mathbf {b}}_1\| _a\right) ^a+(1-x)^a \left( \| {\mathbf {w}}/{\mathbf {b}}_2\| _a\right) ^a.\) Making the above change of variables in Eq. (5) and setting \(P_1=\sum _{j=1}^r p_j\) and \(P_2=\sum _{j=r+1}^D p_j,\) we obtain, rearranging terms
Thus the amalgamation is, conditionally on the two sub-compositions,
\(SGB\left( a,\{(\| {\mathbf {v}}/{\mathbf {b}}_1\| _a^{-1},P_1), (\| {\mathbf {w}}/{\mathbf {b}}_2\| _a^{-1},P_2) \}\right) ,\) with conditional density
The constant of integration does not involve \({\mathbf {v}}\) and \({\mathbf {w}}\). Thus the two sub-compositions \({\mathbf {V}}\) and \({\mathbf {W}}\) are independent SGB.
-
1.
The densities of \({\mathbf {V}}\) and \({\mathbf {W}}\) are at the first two rows of Eq. (17). The independence follows directly from Eq. (18).
-
2.
The conditional distribution of \((X|{\mathbf {V}},{\mathbf {W}})\) is given in Eq. (19).
-
3.
The conditional expectation of \(\log (X/(1-X))\) is a direct application of Eq. (21).
-
4.
The expression for \({\mathrm {{E}}}_A(X | {\mathbf {V}}={\mathbf {v}}, {\mathbf {W}}={\mathbf {w}})\) is an application of Eq. (8) to the density in Eq. (19).
\(\square\)
Moments of ratios and log-ratios of parts
1. It is equivalent to compute moment of ratios and log-ratios of parts from the distribution of the composition \({\mathbf {U}}\) or from the initial vector \({\mathbf {Y}}\), because
The mixed moments of the random vector \({\mathbf {Y}}\) are given by \(M_{{\mathbf {Y}}}(t_1,\ldots ,t_D)= \mathrm {{E}}\left( Y_1^{t_1} \ldots Y_{D}^{t_D}\right) .\)
Set \(t_+=\sum _{j=1}^{D-1} t_j\). Then the mixed moment ratios of the random composition following an SGB distribution \({{SGB(a, \{b_j,p_j\})}},\) \(j=1,\ldots ,D\) are given by the corresponding moment of a product of generalized Gamma random variables, namely,
2. The function \(M_{{\mathbf {U}}}\) in Eq. (20) is the moment generating function of the log-ratios of parts. By taking the first and second derivative of \(M_{{\mathbf {U}}}(t{\mathbf {e}}_i)\) at \(t=0\), Eqs. (21) and (22) are obtained.
Distinct pairs of log-ratios of parts are uncorrelated. The technique can be readily applied to log-ratio transforms of any kind. From Eq. (21), we recover Eq. (8).
Partial derivatives of the pseudo-log-likelihood
Let n be the sample size, D the number of parts and p the number of explanatory variables. Set
where \({\mathbf {u}}_i=(u_{i1},u_{i2},\ldots , u_{iD}), i=1,\ldots ,n\) are the observed compositions, \(b_{ij}, j=1,\ldots ,D\) the corresponding scales and \(z_j({\mathbf {u}}_i)\) the j-th component of the vector defined in Eq. (6).
The partial derivatives of the pseudo-log-likelihood in Eq. (12) are
where \(\psi\) is the digamma function.
The derivatives with respect to the regression parameters are given by
The partial derivatives \(\partial b_{ik}/\partial \beta _{j \ell }\) are computed in two steps.
Setting \({\mathbf {s}}_i={\mathbf {x}}_i^t {\mathbf {B}}{\mathbf {V}}^+ = (s_{ir}),r=1,\ldots ,D\) in Eq. (14), we have
Thus, denoting by \(({\mathbf {V}}^+)_{m}\) the m-th column of \({\mathbf {V}}^+\), we have
because \(\left( P z_k({\mathbf {u}}_i)-p_k\right) \mathbf {1}_D = 0\).
Rights and permissions
About this article
Cite this article
Graf, M. Regression for compositions based on a generalization of the Dirichlet distribution. Stat Methods Appl 29, 913–936 (2020). https://doi.org/10.1007/s10260-020-00512-y
Accepted:
Published:
Issue Date:
Keywords
- Compositions
- Simplicial generalized Beta distribution
- Maximum likelihood estimation
- Imputation
- Multiple regression
Mathematics Subject Classification
- 62E15
- 62F10