Abstract
This paper investigates the small area estimation of population averages of unitlevel compositional data. The new methodology transforms the compositions into vectors of \(R^m\) and assumes that the vectors follow a multivariate nested error regression model. Empirical best predictors of domain indicators are derived from the fitted model, and their mean squared errors are estimated by parametric bootstrap. The empirical analysis of the behavior of the introduced predictors is investigated by means of simulation experiments. An application to real data from the Spanish household budget survey is given. The target is to estimate the average of proportions of annual household expenditures on food, housing and others, by Spanish provinces.
1 Introduction
Official statistics contain estimates of socioeconomic indicators at different levels of aggregation. In many sampling designs, small sample sizes do not allow accurate direct estimators to be calculated at low levels of aggregation. These territories or population groups are called small areas. Small Area Estimation (SAE) gives a solution to this problem by incorporating auxiliary information to the data analysis and by introducing modelbased predictors. The books of Rao and Molina (2015) and Morales et al. (2021) give a general description of SAE.
The Spanish household budget survey (SHBS) provides information about the nature and destination of the consumption household expenses, as well as on various characteristics related to the conditions of household life. Spain is hierarchically partitioned in 17 autonomous communities and 50 provinces, plus 2 autonomous cities. The sampling design and the sample sizes of the SHBS are developed to provide estimates for the 17 autonomous communities level, but not for the provinces. The direct estimates at the province level have a low accuracy and, therefore, estimating SHBS indicators at that level is a SAE problem. This paper has two objectives. The first one is to model the unitlevel proportions of annual household expenditures on food, housing and others. The second one is to estimate the average of these proportions, by Spanish provinces.
Under arealevel models, we find some more proposals for estimating domain proportions and counts. For example, Esteban et al. (2012), Marhuenda et al. (2013, 2014) and Morales et al. (2015) derived predictors based on linear mixed models and (Chambers et al. 2014; Dreassi et al. 2014; Tzavidis et al. 2015) and (Boubeta et al. 2017, 2016) applied binomial, negative binomial or Poisson regression models. There are also methodologies for estimating proportions and counts in the setup of contingency tables or multinomial regression models. Without being exhaustive, we find the papers of Zhang and Chambers (2004), Berg and Fuller (2014) for contingency tables, and the papers of Ferrante and Trivisano (2010), Souza and Moura (2016), Fabrizi et al. (2016), Saei and Chambers (2003), Molina et al. (2007) and LópezVizcaíno et al. (2013, 2015) for multinomial regression models. However, in the household survey samples, some variables of interest and domain indicators are compositions. This is to say, they are positive quantities summing up to one or to a known integer number. Concerning arealevel model for compositional data, Esteban et al. (2020) and Krause et al. (2022) transformed compositions into target vectors of multivariate FayHerriot models in order to make modelbased predictions, like the ones described by GonzálezManteiga et al. (2008a), Benavent and Morales (2016), Benavent and Morales (2021) or Arima et al. (2017).
The statistical literature presents some contributions to small area estimation of proportions and counts under unitlevel models for binary outcomes. For example, Chambers et al. (2016), Hobza and Morales (2016), Hobza et al. (2018) and Burgard et al. (2021) derived predictors under Mquantile or binomiallogit models for binary outcomes. These approaches are based on univariate models and not in models for compositional data that consider the possibility of jointly estimating the counts or proportions of all the categories of a classification variable. This issue was faced by Scealy and Welsh (2017), which introduced a directional mixed effects model for compositional data and predicted the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. A different approach was employed by Hijazi and Jernigan (2009), Camargo et al. (2012), Tsagris and Stewart (2018), Morais et al. (2018), which modelled compositional data using Dirichlet regression models. This manuscript also deals with unitlevel compositional data, but it proposes to fit multivariate linear mixed models to logratio transformations of compositions. Some references on the foundations of compositional data analysis are the books (Aitchison 1986) and (PawlowskyGlahn and Buccianti 2011) and the papers (Egozcue et al. 2003) and (Egozcue and PawlowskyGlahn 2019), where some basic transformations of compositions are studied.
This paper introduces small area predictors of averages of unitlevel vectors of compositions. For this sake, the paper considers three logratio transformations of compositions into vectors of \(R^m\). They are the additive, centered and isometric logratio transformations. We propose a multivariate nested error regression (MNER) model for analyzing the transformed SHBS compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. The estimates of the MNER model parameters are obtained by using the residual maximum likelihood (REML) estimation method, as it is described in Esteban et al. (2022a). The fitted model is then used to predict averages of proportions of annual household expenditures on food, housing and others, by Spanish provinces. The empirical best and plugin predictors of small area compositional parameters are derived similarly as in Esteban et al. (2022b).
The estimation of the mean squared error (MSE) of a modelbased predictor is an important issue that has no easy solution. Under nonlinear models, the problem is even more difficult. We follow the resampling approach appearing in GonzálezManteiga et al. (2007, 2008b) to implement a parametric bootstrap procedure.
This paper introduces statistical methodology that is new in four main aspects: (1) the employment of three transformations of unitlevel compositional survey data, (2) the use of MNER models with unstructured covariance matrix for modelling the transformed data and capturing the sample correlations, (3) the derivation of domainlevel predictors of averages of compositions based on the MNER model fitted to the transformed unitlevel data, and (4) the introduction of parametric bootstrap estimators of the MSEs of the new predictors.
The remainder of the paper is organized as follows: Section 2 establishes the probabilistic framework, describes the SAE problem of interest and presents the MNER model. Section 3 derives empirical best predictors (EBP) and plugin predictors of average compositions and gives a parametric bootstrap method for estimating the MSEs of the EBPs. Section 4 presents three simulation experiments. The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model. Simulation 2 investigates the performance of the EBPs and plugin predictors, and Simulation 3 analyzes the parametric bootstrap estimator of the MSEs. Section 5 applies the proposed methodology to data from the SHBS of 2016 in Spain. Section 6 gives some conclusions. The paper contains four appendices in a supplementary material file. Appendix A describes the additive, centered and isometric logratio transformations of compositions. Appendix B gives further simulation results. Appendix C analyzes the SHBS data with different transformations. Appendix D performs the application to SHBS data without applying logratio transformations of compositions.
2 The probabilistic framework
Let U be a population of size N partitioned into D domains or areas \(U_1,\ldots ,U_D\) of sizes \(N_1,\ldots ,N_D\), respectively. Let \(N=\sum _{j=1}^DN_d\) be the global population size. Let us consider the probability vector \(a_{dj}^{+}=(a_{dj1},\ldots ,a_{djm+1})^\prime \in R^{m+1}\) representing proportions associated with the \(m+1\) categories of a classification variable that is defined on the sample unit j of domain d, \(d=1,\ldots ,D\), \(j=1,\ldots ,N_d\). For example, \(a_{dj}^{+}\) may contain the proportions of annual household expenditures in the different expense categories. The components of \(a_{dj}^{+}\) are nonnegative and fulfill the constraint \(a_1+\ldots +a_{m+1}=1\). These vectors \(a_{dj}^{+}\) are called compositions or \((m+1)\)part compositions, and vectors \(a_{dj}=(a_{dj1},\ldots ,a_{djm})^\prime \) are called mpart compositions. Compositional data, consisting of compositions, play an important role in public statistics. Compositions take values in the simplex embedded in \(R^{m+1}\)
and mpart compositions take values in the mdimensional simplex defined by
This paper deals with the problem of predicting domain average compositions
under a compositional data analysis approach. This is to say, we apply a onetoone transformation, \(h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m\), to mpart compositions and we assume that the transformed vectors follow a multivariate regression model. Appendix A presents three widely employed transformations. They are the additive, centered and isometric logratio transformations. The components of the transformed vectors \(y_{dj}=h(a_{dj})=(y_{dj1},\ldots ,y_{djm})^{\prime }\) are continuous variables measured on the sample unit j of domain d, \(d=1,\ldots ,D\), \(j=1,\ldots ,N_d\).
For \(k=1,\ldots ,m\), let \({x}_{djk}=(x_{djk1},\ldots ,x_{djkp_k})\) be a row vector containing \(p_k\) explanatory variables and let \({X}_{dj}=\text{ diag }\left( {x}_{dj1},\ldots ,{x}_{djm}\right) _{m\times p}\) with \(p=p_1+\ldots +p_m\). Let \(\beta _{k}\) be a column vector of size \(p_k\) containing regression parameters and let \(\beta =\left( \beta _{1}^{\prime },\ldots ,\beta _{m}^{\prime }\right) ^{\prime }_{p\times 1}\). We assume that the transformed vectors \(y_{dj}\)’s follow the population MNER model
where the vectors of random effects \(u_{d}\)’s and random errors \(e_{dj}\)’s are independent with multivariate normal distributions
The \(m\times m\) covariance matrices \({V}_{ud}\) depend on \(q=m(m+1)/2\) unknown parameters, denoted by
The matrix \(V_{ud}\) is
The \(m\times m\) covariance matrices \({V}_{edj}\) depend on q unknown parameters, i.e.
The matrix \(V_{edj}\) is
The \(2q\times 1\) vector of variance component parameters is \(\theta =(\theta _u^\prime ,\theta _e^\prime )^\prime \). The \((p+2q)\times 1\) vector of model parameters is \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\). Let \({I}_a\) be the \(a\times a\) identity matrix. We define the \(mN_d\times 1\) vectors \(y_d\) and \(e_d\), the \(mN_d\times p\) matrix \(X_d\) and the \(mN_d\times m\) matrix \(Z_d\) as follows:
Model (2.2) can be written in the domainlevel form
where the vectors \(u_d\) and \(e_d\sim N_{mN_d}(0,V_{ed})\) are independent and \(V_{ed}=\underset{1\le j \le N_d}{\hbox {diag}}(V_{edj})\). We define the \(mN\times 1\) vectors y and e, the \(mD\times 1\) vector u, the \(mN\times p\) matrix X and \(mN\times mD\) matrix Z as follows:
Model (2.2) can be written in the linear mixed model form
where \(u\sim N_{mD}(0,{V}_{u})\), \(e\sim N_{mN}(0,V_{ed})\) are independent, \(V_{u}=\underset{1\le d \le D}{\hbox {diag}}(V_{ud})\) and \(V_{e}=\underset{1\le d \le D}{\hbox {diag}}(V_{ed})\).
Under the predictive approach to inference in finite populations, statistical procedures are based on a fixed subset (called sample), \(s=\cup _{d=1}^Ds_d\), of the finite population U. Let \(n_d\) be the size of the domain subset \(s_d\subset U_d\), \(d=1,\ldots ,D\), and let \(n=n_1+\ldots +n_D\) be the total sample size. The complementary domain subsets are \(r_d=U_ds_d\), \(d=1,\ldots ,D\). Let \(y_s\) and \(y_{ds}\) be the subvectors of y and \(y_d\) corresponding to sample elements and \(y_r\) \(y_{dr}\) the subvectors of y and \(y_d\) corresponding to the outofsample elements. Without lack of generality, we can write \(y_d=(y_{ds}^\prime ,y_{dr}^\prime )^\prime \). Define also the corresponding decompositions of \(X_d\), \(Z_d\) and \(V_d\). As we assume that sample indexes are fixed, then the sample subvectors \(y_{ds}\) follow the marginal models derived from the population model (2.3), i.e.
where \(u_d\sim N_m(0,{V}_{ud})\), \(e_{ds}\sim N_{mn_d}(0,V_{eds})\) are independent and \(V_{eds}=\underset{1\le j \le n_d}{\hbox {diag}}(V_{edj})\). The vectors \(y_{ds}\) are independent with \(y_{ds}\sim N_{n_d}(\mu _{ds},V_{ds})\), \(\mu _{ds}=X_{ds}\beta \), \(V_{ds}=Z_{ds}V_{ud}Z_{ds}^\prime +V_{eds}\).
When the variance component parameters are known, the best linear unbiased estimator (BLUE) of \(\beta \) and the best linear unbiased predictor (BLUP) of \(u_d\), \(d=1,\ldots ,D\), are
Let \({\hat{\theta }}\) be the REML estimator of \(\theta \), then the empirical BLUE (BLUE) of \(\beta \) and the empirical BLUP (EBLUP) of \(u_d\), \(d=1,\ldots ,D\), are
where \(\hat{V}_{ds}\) and \(\hat{V}_{ud}\) are obtained by substituting \(\theta \) by \({\hat{\theta }}\) in \({V}_{ds}\) and \({V}_{ud}\), respectively. We calculate the inverse of \(V_{ds}=V_{eds}+Z_{ds}V_{ud}Z_{ds}^\prime =A+BCD\) by applying the formula
As \(Z_{ds}^\prime V_{eds}^{1}Z_{ds}=\sum _{j=1}^{n_d}V_{edj}^{1}=n_dV_{edj}^{1}\), we obtain
As the sample indexes are fixed, the outofsample subvectors \(y_{dr}\) follow the marginal models derived from the population model (2.3), i.e.
where \(u_d\sim N_m(0,{V}_{ud})\), \(e_{dr}\sim N_{m(N_dn_d)}(0,V_{eds})\) are independent and \(V_{edr}=\underset{n_d+1\le j \le N_d}{\hbox {diag}}(V_{edj})\). The vectors \(y_{dr}\) are independent with \(y_{dr}\sim N_{N_dn_d}(\mu _{dr},V_{dr})\), \(\mu _{dr}=X_{dr}\beta \), \(V_{dr}=Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr}\). The covariance matrix between \(y_{dr}\) and \(y_{ds}\) is
The distribution of \(y_{dr}\), given the sample data \(y_{s}\), is
The conditional \((N_dn_d)\times 1\) mean vector is
The conditional covariance matrix is
Note that
If \(n_d\ne 0\) and \(j\in r_d\), \(j>n_d\), the conditional \(m\times 1\) mean vector is
If \(n_d=0\) and \(j\in r_d\), the conditional \(m\times 1\) mean vector is
If \(n_d\ne 0\) and \(j\in r_d\), \(j>n_d\), the conditional \(m\times m\) covariance matrix is
If \(n_d=0\) and \(j\in r_d\), the conditional \(m\times m\) covariance matrix is
3 Predictors of average compositions
This section deals with the problem of predicting the domain average compositions \(A_{dk}\), \(d=1,\ldots ,D\), \(k=1,\ldots ,m+1\), defined in (2.1). As explained in Sect. 2 and Appendix A, we first transform the mpart compositions \({a}_{dj}=({a}_{dj1},\ldots ,{a}_{djm})^\prime \) into vectors of \(R^m\). This is done by applying a onetoone function \(h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m\). The transformed vectors \(y_{dj}=h(a_{dj})\) have components \(y_{dj1}=h_1(a_{dj}),\ldots ,y_{djm}=h_m(a_{dj})\). Let \(h^{1}=(h_1^{1},\ldots ,h_m^{1})^\prime :R^m\mapsto {{\mathcal {S}}}^m\) be the inverse function of h, so that \(a_{dj1}=h_1^{1}(y_{dj}),\ldots ,a_{djm}=h_m^{1}(y_{dj})\).
For estimating \(A_{dk}\), \(k=1,\ldots ,m+1\), we assume that \(y_{dj}=(y_{dj1},\ldots ,y_{djm})^\prime \) follows a multivariate nested error regression (MNER) model. For \(d=1,\ldots ,D\), the target parameters are additive, i.e
The EBP of \(A_{dk}\) is
For a general function h, the expected values above might be not tractable analytically. When this occurs, the following Monte Carlo procedure can be applied.

(a)
Estimate the unknown parameter \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\) using sample data \((y_s,X_s)\).

(b)
Replacing \(\psi =(\beta ^{\prime },\theta ^\prime )^\prime \) by the estimate \({\hat{\psi }}=({\hat{\beta ^\prime }},{\hat{\theta ^\prime }})^\prime \) obtained in (a), draw L copies of each outofsample variable \(y_{dj}\) as
$$\begin{aligned} y_{dj}^{(\ell )}\sim N_2({\hat{\mu }}_{djs},\hat{V}_{ds}),\quad j\in r_{d},\,\, d=1,\ldots ,D,\,\, \ell =1,\ldots ,L. \end{aligned}$$where
$$\begin{aligned} {\hat{\mu }}_{djs}=\left\{ \begin{array}{ll} X_{dj}{\hat{\beta }}+\hat{V}_{ud}Z_{ds}^\prime \Big \{\hat{V}_{eds}^{1}\hat{V}_{eds}^{1}Z_{ds}\Big (\hat{V}_{ud}^{1}+n_d\hat{V}_{edj}^{1}\Big )^{1}Z_{ds}^\prime \hat{V}_{eds}^{1}\Big \}(y_{ds}X_{ds}{\hat{\beta }})&{} \text{ if } \, n_d\ne 0, \\ X_{dj}{\hat{\beta }}&{} \text{ if } \, n_d=0, \end{array}\right. \end{aligned}$$and
$$\begin{aligned} \hat{V}_{ds}=\left\{ \begin{array}{ll} \hat{V}_{ud}+\hat{V}_{edj}n_d\hat{V}_{ud}\hat{V}_{edj}^{1}\hat{V}_{ud} +n_d^2\hat{V}_{ud}\hat{V}_{edj}^{1}\Big (\hat{V}_{ud}^{1}+n_d\hat{V}_{edj}^{1}\Big )^{1}\hat{V}_{edj}^{1}\hat{V}_{ud}&{} \text{ if } \, n_d\ne 0, \\ \hat{V}_{ud}+\hat{V}_{edj}&{} \text{ if } \, n_d=0. \end{array}\right. \end{aligned}$$ 
(c)
The Monte Carlo approximation of the expected value is
$$\begin{aligned} E_{y_r}\big [h_k^{1}(y_{dj})y_s;{\hat{\psi }}\big ]\approx \frac{1}{L}\sum _{\ell =1}^L h_k^{1}(y_{dj}^{(\ell )}),\,\,\, j\in r_{d},\,\,\,d=1,\ldots ,D. \end{aligned}$$The Monte Carlo approximation of the EBP of \(A_{dk}\) is
$$\begin{aligned} \hat{A}_{dk}^{eb}\approx \frac{1}{L}\sum _{\ell =1}^LA_{dk}^{(\ell )},\,\,\, A_{dk}^{(\ell )}=\frac{1}{N_d}\bigg (\sum _{j\in s_{d}}h_k^{1}(y_{dj})+\sum _{j\in r_{d}} h_k^{1}(y_{dj}^{(\ell )})\bigg ),\,\,\,k=1,\ldots ,m. \end{aligned}$$
The plugin estimator of \(A_{dk}\) is
and \(\hat{A}_{dm+1}^{in}=1\hat{A}_{d1}^{in}\ldots \hat{A}_{dm}^{in}\).
Remark 3.1
In many practical cases, the values of the auxiliary variables are not available for all the population units. If in addition some of the variables are continuous, the EBP method is not applicable. An important particular case, where this method is applicable, is when the number of values of the vector of auxiliary variables is finite. More concretely, suppose that the covariates are categorical such that \(X_{dj}\in \{X_{01},\ldots ,X_{0T}\}\), then we can calculate \(A_{dk}^{(\ell )}\) as
where \(N_{dt}=\#\{j\in U_{d}:\,X_{dj}=X_{0t}\}\) is available from external data sources (aggregated auxiliary information), \(n_{dt}=\#\{j\in s_{d}:\,X_{dj}=X_{0t}\}\), \(y_{dtj}^{(\ell )}\sim N_2({\hat{\mu }}_{dts},\hat{V}_{ds})\), \(d=1,\ldots ,D\), \(j=1,\ldots ,N_{dt}n_{dt}\), \(t=1,\ldots ,T\), \(\ell =1,\ldots ,L\), and
and \(\hat{V}_{ds}\) was defined in Step (b) of the above Monte Carlo procedure.
Remark 3.2
If some auxiliary variables are continuous, we can use the Hájektype approximation to \(A_{dk}^{(\ell )}\), i.e.
where \(w_{dj}\) is the sample weight of unit j of domain d. A GREGtype approximation to \(A_{dk}^{(\ell )}\) is
where \( \tilde{w}_{dj}=w_{dj}N_d/\hat{N}_d\), \(\hat{N}_d=\sum _{j\in s_d}w_{dj}\).
Analytical approximations to the MSE are difficult to derive in the case of complex parameters. We therefore propose a parametric bootstrap MSE estimator by following the bootstrap method for finite populations of GonzálezManteiga et al. (2008b). The steps for implementing this method are

1.
Fit the model (2.5) to sample data \((y_s,X_s)\) and calculate an estimator \({\hat{\psi }}=({\hat{\beta }}^{\prime },{\hat{\theta }}^\prime )^{\prime }\) of \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\).

2.
For \(d=1,\ldots ,D\), \(j=1,\ldots ,N_{d}\), generate independently \(u_{d}^{*}\sim N(0,\hat{V}_{ud})\) and \(e_{dj}^{*}\sim N(0,\hat{V}_{edj})\), where \(\hat{V}_{ud}=V_{ud}(\hat{\theta })\) and \(\hat{V}_{edj}=V_{edj}(\hat{\theta })\).

3.
Construct the bootstrap superpopulation model \(\xi ^*\) using \(\{u_{d}^{*}\}\), \(\{e_{dj}^{*}\}\), \(\{X_{dj}\}\) and \(\hat{\beta }\), i.e
$$\begin{aligned} \xi ^{*}:\, y_{dj}^{*}=X_{dj}\hat{\beta }+u_{d}^{*}+e_{dj}^{*},\,\,d=1,\ldots ,D, j=1,\ldots ,N_{d}. \end{aligned}$$(3.1) 
4.
Under the bootstrap superpopulation model (3.1), generate a large number B of i.i.d. bootstrap populations \(\{y_{dj}^{*(b)}:\,d=1,\ldots ,D, j=1,\ldots ,N_{d}\}\) and calculate the bootstrap population parameters
$$\begin{aligned} A_{dk}^{*(b)}=\frac{1}{N_d}\sum _{j=1}^{N_{d}}h_k(y_{dj}^{*(b)}),\quad k=1,\ldots ,m,\,\,\, b=1,\ldots ,B. \end{aligned}$$ 
5.
From each bootstrap population b generated in Step 4, take the sample with the same indices \(s\subset U\) as the initial sample, and calculate the bootstrap EBPs, \(\hat{A}_{dk}^{eb*(b)}\), \(k=1,\ldots ,m\), as described in Sect. 3, using the bootstrap sample vector \(y_s^*\) and the known values \(X_{dj}\).

6.
A Monte Carlo approximation to the theoretical bootstrap estimator
$$\begin{aligned} MSE_*(\hat{A}_{dk}^{eb*})=E_{\xi ^*}\big [(\hat{A}_{dk}^{eb*}A_{dk}^{*})(\hat{A}_{dk}^{eb*}A_{dk}^{*})^\prime \big ],\quad k=1,\ldots ,m, \end{aligned}$$is
$$\begin{aligned} mse_*(\hat{A}_{dk}^{eb*})=\frac{1}{B}\sum _{b=1}^B(\hat{A}_{dk}^{eb*(b)}A_{dk}^{*(b)})(\hat{A}_{dk}^{eb*(b)}A_{dk}^{*(b)})^\prime ,\quad k=1,\ldots ,m.\nonumber \\ \end{aligned}$$(3.2)The estimator (3.2) is used to estimate \(MSE(\hat{A}_{dk}^{eb})\), \(k=1,\ldots ,m\).
4 Simulations
The simulation experiments empirically investigate the asymptotic behavior of: (1) the REML estimators of model parameters in Sect. 4.1 and Appendix B.1, (2) the EBP and plugin predictors of domain average compositions in Sect. 4.2 and Appendix B.2, and (3) the parametric bootstrap MSE estimators in Sect. 4.3 and Appendix B.3.
To meet these three objectives, we consider a basic scenario in which we run simulations for different sample sizes. Take \(m=2\), \(p_1=p_2=2\), \(p=4\), \(\beta _1=(\beta _{11},\beta _{12})^\prime =(10,10)^\prime \), \(\beta _2=(\beta _{21},\beta _{22})^\prime =(10,10)^\prime \), For \(k=1,2\), \(d=1,\ldots ,D\), \(j=1,\ldots ,n_d\), generate \(X_{dj}=\text{ diag }(x_{dj1},x_{dj2})_{2\times 4}\), where \(x_{dj1}=(x_{dj11},x_{dj12})\), \(x_{dj2}=(x_{dj21},x_{dj22})\) and
For \(d=1,\ldots ,D\), simulate \({u}_{d}\sim N_{2}(0,{V}_{ud})\) and \({e}_{dj}\sim N_{2}(0,{V}_{edj})\), where
where \(\theta _1=0.75\), \(\theta _2=0.75\), \(\theta _4=0.5\), \(\theta _5=0.5\) and \(\theta _3=0.4\), \(\theta _6=0.4\). Simulation 1 generates only 4 different matrices \(X_{dj}\). They are
where
4.1 Simulation 1 for REML estimators
The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model (2.5). This simulation runs \(I=200\) iterations. Appendix B.1 gives the steps of Simulation 1 and the definitions of the absolute and relative performance measures. For every REML estimator \(\hat{\eta }\in \{\hat{\beta }_{11},\hat{\beta }_{12},\hat{\beta }_{21},\hat{\beta }_{22},\hat{\theta }_{1},\ldots ,\hat{\theta }_{6}\}\), Tables 1 and 2 present the relative bias \(RB(\hat{\eta })\) and the relative rootmeansquared error \(RRE(\hat{\eta })\) in %. Appendix B.1 gives the corresponding absolute performance measures. Simulation 1 shows that the REML Fisherscoring algorithm works properly because \(RB({\hat{\eta }})\) and \(RRE(\hat{\eta })\) decrease as \(n_d\) or D increase.
4.2 Simulation 2 for EBPs
Simulation 2 investigates the EBP and plugin predictors, \(\hat{A}_{dk}^{eb}\) and \(\hat{A}_{dk}^{in}\), respectively, \(k=1,2,3\). It takes \(I=200\) iterations and generates \(L=200\) random vectors for the Monte Carlo approximations of integrals. The population sizes are \(N_d=200\) and \(D=50\). Let h be the clr, alr or ilr transformation. Appendix B.2 gives the steps of Simulation 2 and the definitions of the absolute and relative performance measures. Tables 3, 4 and 5 present the relative absolute bias \(RAB_k\) and the relative rootmeansquared error \(RRE_k\) in %, \(k=1,2,3\), for the clr, alr and ilr transformations, respectively. Appendix B.2 gives the corresponding absolute performance measures.
The performances measures decrease as the sample sizes, \(n_d\)’s, increase and the EBP gets better results (RAB and RRE) than the plugin predictor. Note that for each transformation, the data generation, and therefore the true underlying model, is different. For this reason, the results in Tables 3, 4 and 5 are not comparable. It is curious to observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, the results are slightly better than in the clr and ilr cases.
4.3 Simulation 3 for MSEs
Simulation 3 investigates the MSE estimators of predictors \(\hat{A}_{dk}^{eb}\) and \(\hat{A}_{dk}^{in}\), \(k=1,2,3\). One of the goals is to give a recommendation on the number of bootstrap replicates B to implement. The simulation takes \(I=200\) iterations and generates \(L=200\) random vectors for the Monte Carlo approximations of integrals. The population sizes are \(N_d=200\) and \(D=50\). Let h be the clr, alr or ilr transformation. Appendix B.3 gives the steps of Simulation 3 and the definitions of the absolute and relative performance measures.
Tables 6, 7 and 8 present the relative absolute bias \(RAB_k\) and the relative rootmeansquared error \(RRE_k\) in %, \(k=1,2,3\), for the clr, alr and ilr transformations, respectively. The number of bootstrap replicates is \(B=50, 100, 200, 300, 400\). Appendix B.3 gives the corresponding absolute performance measures. As in Simulation 2, we remark that the results in Tables 6, 7 and 8 are not comparable because the data generation is different. Nevertheless, we observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, Simulation 3 gives slightly better results than in the clr or ilr cases. That is, the functional form of the transformation plays a nonnegligible role. In any case, the selection of the transformation in an application to real data must be made based on the diagnosis of the corresponding MNER model that we select.
Figures 1 and 2 show the boxplots of \(RRE_{dk}\) and \(RAB_{dk}\) for the predictors \(\hat{A}^{eb}_{dk}\), \(k=1,2,3\), with the clr transformation. From the obtained performance measures, we recommend to implement the bootstrap algorithm with at least \(B=300\) iterations. Appendix B.3 give the same recommendation for the alr and ilr transformations.
5 The Spanish Household Budget Survey (SHBS)
The SHBS is annually carried out by the “Instituto Nacional de Estadística” (INE), with the objective of obtaining information on the nature and destination of the consumption expenses, as well as on various characteristics related to the conditions of household life. In the Spanish economy, it is important to have good estimates of consume spending, since this spending represents, approximately, \(60\%\) of gross domestic product. However, global political measures are not often satisfactory for regional authorities, which can also develop their own economic strategies. They need some tools to determine, with precision and reliability, the main variables and consume indicators in order to implement their strategies. Among the main consume indicators are the proportions of food and housing annual expenses of households. This section presents an application of the new statistical methodology to the estimation of domain parameters defined as average of proportions of annual household expenditures. We take data from the SHBS of 2016. The domains are the 50 Spanish provinces plus the autonomous cities Ceuta and Melilla, so that \(D=52\).
Let \(a_{dj1}\), \(a_{dj2}\) and \(a_{dj3}\) be the proportions of annual expenditures on food, housing and other for household j of domain d. Housing includes expenditure on current housing costs, water, electricity, gas and other fuels. Food includes both food and nonalcoholic beverages and other represent the remaining expenditures. The vectors \(a_{dj}=(a_{dj1},a_{dj2})^\prime \in R^2\) are 2part compositions that can be transformed into vectors \(y_{dj}=h(a_{dj})\) of \(R^2\) by one of the transformations h described in Appendix A. Let \(x_{djk}\), \(d=1,\ldots ,D\), \(j=1,\ldots ,n_d\), \(k=1,2\), be the \(4\times 1\) vector whose components are the binary auxiliary variables that indicate the composition of the household to which household j belongs in domain d. As auxiliary variables, we thus consider the household composition HC with categories
 HC1::

Single person or adult couple with at least one members with age over 65,
 HC2::

Other compositions with a single person or a couple without children,
 HC3::

Couple with children under 16 years old or adult with children under 16 years old,
 HC4::

Other households.
The variable HC is treated as a factor with reference category HC4.
For calculating the EBPs of the domain parameters of interest, we need the true population sizes, \(N_{dt}\), of the crossings of provinces with the categories of the variable HC. We calculate these sizes by using the sampling weights of the Spanish Labor Force Survey (SLFS). The SLFS sampling weights are calibrated to the population sizes of the provinces crossed with sex and age groups. These demographic quantities come from the INE population projection system and they are considered the most accurate demographic figures in Spain. On the other hand, the SHBS sampling weights are calibrated to the population sizes of the autonomous community (NUTS 2) crossed with sex and age groups, which are not the domains of interest.
This section presents an statistical analysis by applying the centered logratio transformation. This choice is due to the good fit of the MNER model to the transformed data. For the sake of completeness, Appendix C presents the corresponding data analysis for the alr and ilr transformations. Table 9 presents the estimates of the regression parameters, the zvalues, the standard errors and the asymptotic pvalues. The factor HC is significative for \(y_1\) and \(y_2\). Table 10 presents the asymptotic 95% confidence intervals (L.CI, U.CI) for the variance component parameters. None of them contains the zero.
For calculating the asymptotic pvalues and confidence intervals of Tables 9 and 10, we take the asymptotic distributions of the REML estimators \({\hat{\theta }}\) and \({\hat{\beta }}\), i.e.
where \(F_s\) is the REML Fisher information matrix. For \(\hat{\beta }_i=\beta _0\), the asymptotic pvalue for testing the hypothesis \(H_0:\,\beta _i=0\) is
where \(({X}^{\prime }{V}^{1}({\hat{\theta }}){X})^{1}=(q_{ij})_{i,j=1,\ldots ,p}\) and \(\beta _i\) denotes the ith component of the vector \(\beta \). The asymptotic \((1\alpha )\)level confidence intervals for the components \(\theta _{\ell }\) of \(\theta \) are
where \({F}^{1}(\hat{\theta })=(\nu _{ab})_{a,b=1,\ldots ,6}\) and \(z_{\alpha }\) is the \(\alpha \)quantile of the N(0, 1) distribution.
Figure 3 plots the histograms of the \(D=52\) standardized EBPs of the random effects of the fitted MNER model for food (left) and housing (right) expenditures. It also prints the corresponding probability density function estimates. The shapes of the densities are quite symmetrical, which indicates that the distributions of the random effects are not very far from the normal distributions. Since D is too small to obtain a good nonparametric estimate of the density functions, the definitive conclusions can not be drawn.
Figure 4 gives the histograms of standardized residuals for components \(y_1\) and \(y_2\). It also prints the corresponding probability density function estimates. We do not appreciate a large deviation from the normal distribution.
Figure 5 presents the dispersion plots of standardized residuals versus predicted values (in \(10^4\) euros). Most standardized residuals fall within the interval \((3,3)\), so we consider that outliers do not play a relevant role in the performance of the EBPs. Appendix C of the supplementary material gives the corresponding plots for the additive and isometric logratio transformations. The corresponding plots are similar to the ones shown in Figs. 4 and 5 for the centered logratio transformation. However, Fig. 5 presents more uniform clouds of points in both components than the corresponding figures for the two other transformations. From this graphical diagnosis, we finally prefer doing the data analysis with the centered logratio transformation. However, since the choice of the clr transformation can be debatable, Appendix C presents the full analysis of the data under the two other transformations.
Figure 6 plots the plugin and the EBP predictions of \({a_{d1}}\) and \({a}_{d2}\). The domains are sorted by sample sizes and the sample size is printed in the axis OX. This figure shows that both estimators follow a similar pattern. This information is completed by Fig 7, which shows the relative rootMSEs (RRMSE).
Figure 8 (left) maps the proportions of the household annual expenditures in food by Spanish provinces. Figure 8 (right) maps the estimated RRMSE in %. These figures show that expenditures on food are rather variable between provinces. This happens mostly in the autonomous regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated ratios is smaller. This information could be of great use to local governments in developing economic plans aimed at households and improving the quality of life.
Figure 9 (left) maps the proportions of the household annual expenditures on housing by Spanish provinces. Figure 9 (right) maps the estimated RRMSE in \(\%\). As is the case with food expenditure, these figures show that expenditures on housing is rather variable between provinces. This map shows clear differences between the northcentral regions, where the proportion of spending is higher, and the southern regions, where household expenditures are lower.
Tables 11 and 12 present some condensed numerical results. The tables are constructed in two steps: First, the domains are sorted by sample size, starting by the domain with the smallest sample size. Finally, a selection of 14 domains out of 52 is done from the positions \(1, 5, 9,\ldots , 52\). The name and code of provinces are labeled by province and d, respectively, and the sample sizes by \(n_d\). Table 11 presents the modelbased predictions of food and housing expenditures by provinces and Table 12 displays the corresponding estimates of RRMSEs. The plugin predictors are denoted by in1 and in2 and the EBPs by ebp1 and ebp2.
6 Conclusions
Compositional data play an important role in public statistics. The proposed methodology is applied to estimate the proportions of annual household expenditures on food, housing and others from the 2016 SHBS at the province level. This paper introduces small area predictors of averages of unitlevel vectors of compositions. For this purpose, the manuscript considers the centered logratio transformations of compositions into vectors of \(R^m\). For the sake of completeness, Appendix C of the supplementary material presents the corresponding statistical analysis under the additive and isometric logratio transformations. A MNER model is proposed for analyzing the transformed compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. As usual in linear mixed models, the parameter estimates of the MNER model are obtained using the REML method. The selection of the centered logratio transformation was motivated by the interpretability and diagnosis of the selected MNER model. In this sense, we followed the recommendations of Greenacre (2019). This is to say, we have tried to provide a simple solution to a practical problem of compositional data.
Of the two proposed predictors, EBP and plugin, EBP presents a slightly better performance, as can be seen in the simulation study. For the calculation of the MSE, we recommend a parametric bootstrap, following the ideas of GonzálezManteiga et al. (2008a) and for a number of repetitions greater than \(B=300\).
As a result of the statistical analysis for Spanish provinces, we conclude that food expenditure in Spain accounts for \(14.6\%\) of total household expenditure and presents great variability within autonomous communities. This happens mostly in the Autonomous Regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated proportions is smaller. On the other hand, spending on housing in Spain accounts for \(31\%\) of total household spending and there are important differences between the northcentral provinces (with higher incomes) and those in the south.
In this case, we applied the introduced methodology to the SHBS, but it is useful in other topics of the official statistics, like the classification of the population by the educational level and according to economic activity. In both situations, it is necessary to take into account the simplex constraints.
We finally remind that there are other regression models for compositions, such as directional mixed effects models or Dirichlet regression mixed models. These models are likely to be adapted to the SAE context described in Sect. 2, including fitting algorithms, predictors of domain quantities, MSE estimators, and so on. They can be competitive options with respect to fitting a multivariate normal mixed model to logratio transformations of compositions. We believe that these tasks are interesting subjects for future research.
References
Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, New York
Arima S, Bell WR, Datta GS, Franco C, Liseo B (2017) Multivariate FayHerriot Bayesian estimation of small area means under functional measurement error. J R Stat Soc Ser A 180(4):1191–1209
Benavent R, Morales D (2016) Multivariate FayHerriot models for small area estimation. Comput Stat Data Anal 94:372–390
Benavent R, Morales D (2021) Small area estimation under a temporal bivariate arealevel linear mixed model with independent time effects. Stat Methods Appl 30(1):195–222
Berg EJ, Fuller WA (2014) Small area prediction of proportions with applications to the Canadian Labour Force Survey. J Survey Stat Methodol 2:227–256
Boubeta M, Lombardía MJ, Morales D (2016) Empirical best prediction under arealevel Poisson mixed models. TEST 25:548–569
Boubeta M, Lombardía MJ, Morales D (2017) Poisson mixed models for studying the poverty in small areas. Comput Stat Data Anal 107:32–47
Burgard JP, Krause J, Münnich R, Morales D (2021) l2Penalized temporal logitmixed models for the estimation of regional obesity prevalence over time. Stat Methods Med Res 30(7):1744–1768
Camargo AP, Stern JM, Lauretto MS (2012) Estimation and model selection in Dirichlet regression. AIP Conf Proc 1443:206–213. https://doi.org/10.1063/1.3703637
Chambers R, Dreassi E, Salvati N (2014) Disease mapping via negative binomial regression Mquantiles. Stat Med 33:4805–4824
Chambers R, Salvati N, Tzavidis N (2016) Semiparametric small area estimation for binary outcomes with application to unemployment estimation for Local Authorities in the UK. J R Stat Soc Ser A 179:453–479
Dreassi E, Ranalli MG, Salvati N (2014) Semiparametric Mquantile regression for count data. Stat Methods Med Res 23:591–610
Egozcue JJ, PawlowskyGlahn V (2019) Compositional data: the sample space and its structure. TEST 28(3):599–638
Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarcelóVidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Esteban MD, Morales D, Pérez A, Santamaría L (2012) Small area estimation of poverty proportions under arealevel time models. Comput Stat Data Anal 56:2840–2855
Esteban MD, Lombardía MJ, LópezVizcaíno E, Morales D, Pérez A (2020) Small area estimation of proportions under arealevel compositional mixed models. TEST 29(3):793–818
Esteban MD, Lombardía MJ, LópezVizcaíno E, Morales D, Pérez A (2022a) Small area estimation of expenditure means and ratios under a unitlevel bivariate linear mixed model. J Appl Stat 49(1):143–168
Esteban MD, Lombardía MJ, LópezVizcaíno E, Morales D, Pérez A (2022b) Empirical best prediction of small area bivariate parameters. Scand J Stat 49:1699–1727
Fabrizi E, Ferrante MR, Trivisano C (2016) Hierarchical Beta regression models for the estimation of poverty and inequality parameters in small areas. In: Pratesi Monica (ed) Analysis of poverty data by small area methods. Wiley, New York
Ferrante MR, Trivisano C (2010) Small area estimation of the number of firms’ recruits by using multivariate models for count data. Surv Methodol 36(2):171–180
GonzálezManteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2007) Estimation of the mean squared error of predictors of small area linear parameters under a logistic mixed model. Comput Stat Data Anal 51:2720–33
GonzálezManteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2008a) Analytic and bootstrap approximations of prediction errors under a multivariate FayHerriot model. Comput Stat Data Anal 52:5242–5252
GonzálezManteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2008b) Bootstrap mean squared error of smallarea EBLUP. J Stat Comput Simul 78:443–462
Greenacre M (2019) Comments on: Compositional data: the sample space and its structure. TEST 28:644–652
Hobza T, Morales D (2016) Empirical best prediction under unitlevel logit mixed models. J Off Stat 32(3):661–669
Hobza T, Santamaría L, Morales D (2018) Small area estimation of poverty proportions under unitlevel temporal binomiallogit mixed models. TEST 27(2):270–294
Hijazi RH, Jernigan RW (2009) Modeling compositional data using Dirichlet regression models. J Appl Probab Stat 4(1):77–91
Krause J, Burgard JP, Morales D (2022) Robust prediction of domain compositions from uncertain data using isometric logratio transformations in a penalized multivariate FayHerriot model. Stat Neerl 76(1):65–96
LópezVizcaíno E, Lombardía MJ, Morales D (2013) Multinomialbased small area estimation of labour force indicators. Stat Model 13(2):153–178
LópezVizcaíno E, Lombardía MJ, Morales D (2015) Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. J R Stat Soc Ser A 178(3):535–565
Marhuenda Y, Molina I, Morales D (2013) Small area estimation with spatiotemporal FayHerriot models. Comput Stat Data Anal 58:308–325
Marhuenda Y, Morales D, Pardo MC (2014) Information criteria for FayHerriot model selection. Comput Stat Data Anal 70:268–280
Molina I, Saei A, Lombardía MJ (2007) Small area estimates of labour force participation under multinomial logit mixed model. J R Stat Soc Ser A 170:975–1000
Morais J, ThomasAgnan C, Simioni M (2018) Using compositional and Dirichlet models for market share regression. J Appl Stat 45(9):1670–1689
Morales D, Pagliarella MC, Salvatore R (2015) Small area estimation of poverty indicators under partitioned arealevel time models. SORT Stat Oper Res Trans 39(1):19–34
Morales D, Esteban MD, Pérez A, Hobza T (2021) A course on small area estimation and mixed models. Springer, Berlin
PawlowskyGlahn V, Buccianti A (eds) (2011) Compositional data analysis. Wiley, Chichester
Rao JNK, Molina I (2015) Small area estimation, 2nd edn. Wiley, Hoboken
Saei A, Chambers R (2003) Small area estimation under linear an generalized linear mixed models with time and area effects. S3RI Methodology Working Paper M03/15, Southampton Statistical Sciences Research Institute
Scealy JL, Welsh AH (2017) A directional mixed effects model for compositional expenditure data. J Am Stat Assoc 112(517):24–36
Souza DB, Moura FAS (2016) Multivariate Beta regression with applications in small area estimation. J Off Stat 32:747–768
Tsagris M, Stewart C (2018) A Dirichlet regression model for compositional data with zeros. Lobachevskii J Math 39(3):398–412
Tzavidis N, Ranalli MG, Salvati N, Dreassi E, Chambers R (2015) Robust small area prediction for counts. Stat Methods Med Res 24(3):373–395
Zhang L, Chambers R (2004) Small area estimates for crossclassifications. J Roy Stat Soc B 66(2):479–496
Funding
Open Access funding provided thanks to the CRUECSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supported by the Instituto Galego de Estatística, by the Grants PGC2018096840BI00 and PID2020113578RBI00 of the Spanish Ministerio de Economía y Competitividad, by the Grant Prometeo/2021/063 of the Generalitat Valenciana, and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C 2020/14), and by GAIN (Galician Innovation Agency) and the Regional Ministry of Economy, Employment and Industry Grant COV20/00604 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01, all of them through the ERDF.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Esteban, M.D., Lombardía, M.J., LópezVizcaíno, E. et al. Small area estimation of average compositions under multivariate nested error regression models. TEST (2023). https://doi.org/10.1007/s11749023008470
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11749023008470
Keywords
 Household budget survey
 Small area estimation
 Multivariate nested error regression model
 Compositional data
 Bootstrap
 Household expenditures
Mathematics Subject Classification
 62E30
 62J12