Small area estimation of average compositions under multivariate nested error regression models

This paper investigates the small area estimation of population averages of unit-level compositional data. The new methodology transforms the compositions into vectors of Rm\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^m$$\end{document} and assumes that the vectors follow a multivariate nested error regression model. Empirical best predictors of domain indicators are derived from the fitted model, and their mean squared errors are estimated by parametric bootstrap. The empirical analysis of the behavior of the introduced predictors is investigated by means of simulation experiments. An application to real data from the Spanish household budget survey is given. The target is to estimate the average of proportions of annual household expenditures on food, housing and others, by Spanish provinces.


Introduction
Official statistics contain estimates of socioeconomic indicators at different levels of aggregation. In many sampling designs, small sample sizes do not allow accurate direct estimators to be calculated at low levels of aggregation. These territories or population groups are called small areas. Small Area Estimation (SAE) gives a solution to this problem by incorporating auxiliary information to the data analysis and by introducing model-based predictors. The books of Rao and Molina (2015) and Morales et al. (2021) give a general description of SAE.
The Spanish household budget survey (SHBS) provides information about the nature and destination of the consumption household expenses, as well as on various characteristics related to the conditions of household life. Spain is hierarchically partitioned in 17 autonomous communities and 50 provinces, plus 2 autonomous cities. The sampling design and the sample sizes of the SHBS are developed to provide estimates for the 17 autonomous communities level, but not for the provinces. The direct estimates at the province level have a low accuracy and, therefore, estimating SHBS indicators at that level is a SAE problem. This paper has two objectives. The first one is to model the unit-level proportions of annual household expenditures on food, housing and others. The second one is to estimate the average of these proportions, by Spanish provinces.
Under area-level models, we find some more proposals for estimating domain proportions and counts. For example, Esteban et al. (2012), Marhuenda et al. (2013Marhuenda et al. ( , 2014 and Morales et al. (2015) derived predictors based on linear mixed models and (Chambers et al. 2014;Dreassi et al. 2014;Tzavidis et al. 2015) and (Boubeta et al. 2017(Boubeta et al. , 2016 applied binomial, negative binomial or Poisson regression models. There are also methodologies for estimating proportions and counts in the setup of contingency tables or multinomial regression models. Without being exhaustive, we find the papers of Zhang and Chambers (2004), Berg and Fuller (2014) for contingency tables, and the papers of Ferrante and Trivisano (2010), Souza and Moura (2016), Fabrizi et al. (2016), Saei and Chambers (2003), Molina et al. (2007) and López-Vizcaíno et al. (2013 for multinomial regression models. However, in the household survey samples, some variables of interest and domain indicators are compositions. This is to say, they are positive quantities summing up to one or to a known integer number. Concerning area-level model for compositional data, Esteban et al. (2020) and Krause et al. (2022) transformed compositions into target vectors of multivariate Fay-Herriot models in order to make model-based predictions, like the ones described by González-Manteiga et al. (2008a), Benavent and Morales (2016), Benavent and Morales (2021) or Arima et al. (2017).
The statistical literature presents some contributions to small area estimation of proportions and counts under unit-level models for binary outcomes. For example, Chambers et al. (2016), Hobza and Morales (2016), Hobza et al. (2018) and Burgard et al. (2021) derived predictors under M-quantile or binomial-logit models for binary outcomes. These approaches are based on univariate models and not in models for compositional data that consider the possibility of jointly estimating the counts or proportions of all the categories of a classification variable. This issue was faced by Scealy and Welsh (2017), which introduced a directional mixed effects model for compositional data and predicted the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. A different approach was employed by Hijazi and Jernigan (2009), Camargo et al. (2012), Tsagris and Stewart (2018), Morais et al. (2018), which modelled compositional data using Dirichlet regression models. This manuscript also deals with unit-level compositional data, but it proposes to fit multivariate linear mixed models to logratio transformations of compositions. Some references on the foundations of compositional data analysis are the books (Aitchison 1986) and (Pawlowsky-Glahn and Buccianti 2011) and the papers (Egozcue et al. 2003) and (Egozcue and Pawlowsky-Glahn 2019), where some basic transformations of compositions are studied.
This paper introduces small area predictors of averages of unit-level vectors of compositions. For this sake, the paper considers three logratio transformations of compositions into vectors of R m . They are the additive, centered and isometric logratio transformations. We propose a multivariate nested error regression (MNER) model for analyzing the transformed SHBS compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. The estimates of the MNER model parameters are obtained by using the residual maximum likelihood (REML) estimation method, as it is described in Esteban et al. (2022a). The fitted model is then used to predict averages of proportions of annual household expenditures on food, housing and others, by Spanish provinces. The empirical best and plug-in predictors of small area compositional parameters are derived similarly as in Esteban et al. (2022b).
The estimation of the mean squared error (MSE) of a model-based predictor is an important issue that has no easy solution. Under nonlinear models, the problem is even more difficult. We follow the resampling approach appearing in González-Manteiga et al. (2007, 2008b to implement a parametric bootstrap procedure. This paper introduces statistical methodology that is new in four main aspects: (1) the employment of three transformations of unit-level compositional survey data, (2) the use of MNER models with unstructured covariance matrix for modelling the transformed data and capturing the sample correlations, (3) the derivation of domainlevel predictors of averages of compositions based on the MNER model fitted to the transformed unit-level data, and (4) the introduction of parametric bootstrap estimators of the MSEs of the new predictors.
The remainder of the paper is organized as follows: Section 2 establishes the probabilistic framework, describes the SAE problem of interest and presents the MNER model. Section 3 derives empirical best predictors (EBP) and plug-in predictors of average compositions and gives a parametric bootstrap method for estimating the MSEs of the EBPs. Section 4 presents three simulation experiments. The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model. Simulation 2 investigates the performance of the EBPs and plug-in predictors, and Simulation 3 analyzes the parametric bootstrap estimator of the MSEs. Section 5 applies the proposed methodology to data from the SHBS of 2016 in Spain. Section 6 gives some conclusions. The paper contains four appendices in a supplementary material file. Appendix A describes the additive, centered and isometric logratio transformations of compositions. Appendix B gives further simulation results. Appendix C analyzes the SHBS data with different transformations. Appendix D performs the application to SHBS data without applying logratio transformations of compositions.

The probabilistic framework
Let U be a population of size N partitioned into D domains or areas U 1 , . . . , U D of sizes N 1 , . . . , N D , respectively. Let N = D j=1 N d be the global population size. Let us consider the probability vector a + d j = (a d j1 , . . . , a d jm+1 ) ∈ R m+1 representing proportions associated with the m + 1 categories of a classification variable that is defined on the sample unit j of domain d, d = 1, . . . , D, j = 1, . . . , N d  . , x d jkp k ) be a row vector containing p k explanatory variables and let X d j = diag x d j1 , . . . , x d jm m× p with p = p 1 + . . . + p m . Let β k be a column vector of size p k containing regression parameters and let β = β 1 , . . . , β m p×1 . We assume that the transformed vectors y d j 's follow the population MNER model where the vectors of random effects u d 's and random errors e d j 's are independent with multivariate normal distributions The m ×m covariance matrices V ud depend on q = m(m +1)/2 unknown parameters, denoted by The matrix V ud is The m × m covariance matrices V ed j depend on q unknown parameters, i.e.
The 2q ×1 vector of variance component parameters is θ = (θ u , θ e ) . The ( p +2q)×1 vector of model parameters is ψ = (β , θ ) . Let I a be the a × a identity matrix. We define the m N d × 1 vectors y d and e d , the m N d × p matrix X d and the m N d × m matrix Z d as follows: Model (2.2) can be written in the domain-level form Model (2.2) can be written in the linear mixed model form Under the predictive approach to inference in finite populations, statistical procedures are based on a fixed subset (called sample), s = ∪ D d=1 s d , of the finite population U . Let n d be the size of the domain subset s d ⊂ U d , d = 1, . . . , D, and let n = n 1 + . . . + n D be the total sample size. The complementary domain Let y s and y ds be the sub-vectors of y and y d corresponding to sample elements and y r y dr the sub-vectors of y and y d corresponding to the out-of-sample elements. Without lack of generality, we can write y d = (y ds , y dr ) . Define also the corresponding decompositions of X d , Z d and V d . As we assume that sample indexes are fixed, then the sample sub-vectors y ds follow the marginal models derived from the population model (2.3), i.e.
When the variance component parameters are known, the best linear unbiased estimator (BLUE) of β and the best linear unbiased predictor (BLUP) of u d , d = 1, . . . , D, arê Letθ be the REML estimator of θ , then the empirical BLUE (BLUE) of β and the empirical BLUP whereV ds andV ud are obtained by substituting θ byθ in V ds and V ud , respectively. We calculate the inverse of V ds = V eds + Z ds V ud Z ds = A + BC D by applying the formula As the sample indexes are fixed, the out-of-sample sub-vectors y dr follow the marginal models derived from the population model (2.3), i.e.
The covariance matrix between y dr and y ds is The distribution of y dr , given the sample data y s , is The conditional covariance matrix is Note that If n d = 0 and j ∈ r d , j > n d , the conditional m × 1 mean vector is If n d = 0 and j ∈ r d , the conditional m × 1 mean vector is If n d = 0 and j ∈ r d , the conditional m × m covariance matrix is

Predictors of average compositions
This section deals with the problem of predicting the domain average compositions A dk , d = 1, . . . , D, k = 1, . . . , m + 1, defined in (2.1). As explained in Sect. 2 and Appendix A, we first transform the m-part compositions a d j = (a d j1 , . . . , a d jm ) into vectors of R m . This is done by applying a one-to-one function . For estimating A dk , k = 1, . . . , m + 1, we assume that y d j = (y d j1 , . . . , y d jm ) follows a multivariate nested error regression (MNER) model. For d = 1, . . . , D, the target parameters are additive, i.e For a general function h, the expected values above might be not tractable analytically. When this occurs, the following Monte Carlo procedure can be applied.
(b) Replacing ψ = (β , θ ) by the estimateψ = (β ,θ ) obtained in (a), draw L copies of each out-of-sample variable y d j as (c) The Monte Carlo approximation of the expected value is The Monte Carlo approximation of the EBP of A dk iŝ The plug-in estimator of A dk iŝ In many practical cases, the values of the auxiliary variables are not available for all the population units. If in addition some of the variables are continuous, the EBP method is not applicable. An important particular case, where this method is applicable, is when the number of values of the vector of auxiliary variables is finite. More concretely, suppose that the covariates are categorical such that andV d|s was defined in Step (b) of the above Monte Carlo procedure.
Remark 3.2 If some auxiliary variables are continuous, we can use the Hájek-type approximation to A Analytical approximations to the MSE are difficult to derive in the case of complex parameters. We therefore propose a parametric bootstrap MSE estimator by following the bootstrap method for finite populations of González-Manteiga et al. (2008b). The steps for implementing this method are 1. Fit the model (2.5) to sample data (y s , X s ) and calculate an estimatorψ = (β ,θ )  k = 1, . . . , m, b = 1, . . . , B.

From each bootstrap population b generated in
Step 4, take the sample with the same indices s ⊂ U as the initial sample, and calculate the bootstrap EBPs,Â eb * (b) dk , k = 1, . . . , m, as described in Sect. 3, using the bootstrap sample vector y * s and the known values X d j . 6. A Monte Carlo approximation to the theoretical bootstrap estimator (3. 2) The estimator (3.2) is used to estimate M SE(Â eb dk ), k = 1, . . . , m.

Simulation 1 for REML estimators
The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model (2.5). This simulation runs I = 200 iterations. Appendix B.1 gives the steps of Simulation 1 and the definitions of the absolute and relative performance measures. For every REML estimatorη ∈ {β 11 ,β 12 ,β 21 ,β 22 ,θ 1 , . . . ,θ 6 }, Tables 1 and 2 present the relative bias R B(η) and the relative root-mean-squared error R R E(η) in %. Appendix B.1 gives the corresponding absolute performance measures. Simulation 1 shows that the REML Fisher-scoring algorithm works properly because R B(η) and R R E(η) decrease as n d or D increase.

Simulation 2 for EBPs
Simulation 2 investigates the EBP and plug-in predictors,Â eb dk andÂ in dk , respectively, k = 1, 2, 3. It takes I = 200 iterations and generates L = 200 random vectors for  Tables 3, 4 and 5 present the relative absolute bias R AB k and the relative root-mean-squared error R R E k in %, k = 1, 2, 3, for the clr, alr and ilr transformations, respectively. Appendix B.2 gives the corresponding absolute performance measures. The performances measures decrease as the sample sizes, n d 's, increase and the EBP gets better results (RAB and RRE) than the plug-in predictor. Note that for each transformation, the data generation, and therefore the true underlying model, is different. For this reason, the results in Tables 3, 4 and 5 are not comparable. It is curious to observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, the results are slightly better than in the clr and ilr cases.

Simulation 3 for MSEs
Simulation 3 investigates the MSE estimators of predictorsÂ eb dk andÂ in dk , k = 1, 2, 3. One of the goals is to give a recommendation on the number of bootstrap replicates Tables 6, 7 and 8 present the relative absolute bias R AB k and the relative rootmean-squared error R R E k in %, k = 1, 2, 3, for the clr, alr and ilr transformations, respectively. The number of bootstrap replicates is B = 50, 100, 200, 300, 400. Appendix B.3 gives the corresponding absolute performance measures. As in Simulation 2, we remark that the results in Tables 6, 7 and 8 are not comparable because the data generation is different. Nevertheless, we observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, Simulation 3 gives slightly better results than in the clr or ilr cases. That is, the functional form of the transformation plays a non-negligible role. In any case, the selection of the transformation in an application to real data must be made based on the diagnosis of the corresponding MNER model that we select. Figures 1 and 2 show the boxplots of R R E dk and R AB dk for the predictorsÂ eb dk , k = 1, 2, 3, with the clr transformation. From the obtained performance measures, we recommend to implement the bootstrap algorithm with at least B = 300 iterations. Appendix B.3 give the same recommendation for the alr and ilr transformations.

The Spanish Household Budget Survey (SHBS)
The SHBS is annually carried out by the "Instituto Nacional de Estadística" (INE), with the objective of obtaining information on the nature and destination of the consumption expenses, as well as on various characteristics related to the conditions of household life. In the Spanish economy, it is important to have good estimates of consume spending, since this spending represents, approximately, 60% of gross domestic product. However, global political measures are not often satisfactory for regional authorities, which can also develop their own economic strategies. They need some tools to determine, with precision and reliability, the main variables and consume indicators in order to implement their strategies. Among the main consume indicators are the proportions of food and housing annual expenses of households. This section presents an application of the new statistical methodology to the estimation of domain parameters defined as average of proportions of annual household expenditures. We take data from the SHBS of 2016. The domains are the 50 Spanish provinces plus the autonomous cities Ceuta and Melilla, so that D = 52.
Let a d j1 , a d j2 and a d j3 be the proportions of annual expenditures on food, housing and other for household j of domain d. Housing includes expenditure on current housing costs, water, electricity, gas and other fuels. Food includes both food and nonalcoholic beverages and other represent the remaining expenditures. The vectors a d j = (a d j1 , a d j2 ) ∈ R 2 are 2-part compositions that can be transformed into vectors y d j = h(a d j ) of R 2 by one of the transformations h described in Appendix A. Let x d jk , d = 1, . . . , D, j = 1, . . . , n d , k = 1, 2, be the 4 × 1 vector whose components are the binary auxiliary variables that indicate the composition of the household to which household j belongs in domain d. As auxiliary variables, we thus consider the household composition HC with categories   The variable HC is treated as a factor with reference category HC4.
For calculating the EBPs of the domain parameters of interest, we need the true population sizes, N dt , of the crossings of provinces with the categories of the variable HC. We calculate these sizes by using the sampling weights of the Spanish Labor Force Survey (SLFS). The SLFS sampling weights are calibrated to the population sizes of the provinces crossed with sex and age groups. These demographic quantities come from the INE population projection system and they are considered the most accurate demographic figures in Spain. On the other hand, the SHBS sampling weights are calibrated to the population sizes of the autonomous community (NUTS 2) crossed with sex and age groups, which are not the domains of interest.
This section presents an statistical analysis by applying the centered logratio transformation. This choice is due to the good fit of the MNER model to the transformed data. For the sake of completeness, Appendix C presents the corresponding data analysis for the alr and ilr transformations. Table 9 presents the estimates of the regression parameters, the z-values, the standard errors and the asymptotic p-values. The factor HC is significative for y 1 and y 2 . Table 10 presents the asymptotic 95% confidence intervals (L.CI, U.CI) for the variance component parameters. None of them contains the zero. For calculating the asymptotic p-values and confidence intervals of Tables 9 and 10, we take the asymptotic distributions of the REML estimatorsθ andβ, i.e.
where F s is the REML Fisher information matrix. Forβ i = β 0 , the asymptotic p-value for testing the hypothesis H 0 : β i = 0 is where (X V −1 (θ)X ) −1 = (q i j ) i, j=1,..., p and β i denotes the i-th component of the vector β. The asymptotic (1 − α)-level confidence intervals for the components θ of θ areθ ± z α/2 ν 1/2 , = 1, . . . , 6, where F −1 (θ) = (ν ab ) a,b=1,...,6 and z α is the α-quantile of the N (0, 1) distribution. Figure 3 plots the histograms of the D = 52 standardized EBPs of the random effects of the fitted MNER model for food (left) and housing (right) expenditures. It also prints the corresponding probability density function estimates. The shapes of the densities are quite symmetrical, which indicates that the distributions of the random effects are not very far from the normal distributions. Since D is too small to obtain a good nonparametric estimate of the density functions, the definitive conclusions can not be drawn. Figure 4 gives the histograms of standardized residuals for components y 1 and y 2 . It also prints the corresponding probability density function estimates. We do not appreciate a large deviation from the normal distribution. Figure 5 presents the dispersion plots of standardized residuals versus predicted values (in 10 4 euros). Most standardized residuals fall within the interval (−3, 3), so we consider that outliers do not play a relevant role in the performance of the EBPs. Appendix C of the supplementary material gives the corresponding plots for the additive and isometric logratio transformations. The corresponding plots are similar to the ones shown in Figs. 4 and 5 for the centered logratio transformation. However, Fig. 5 presents more uniform clouds of points in both components than the corresponding figures for the two other transformations. From this graphical diagnosis, we finally prefer doing the data analysis with the centered logratio transformation. However, since the choice of the clr transformation can be debatable, Appendix C presents the full analysis of the data under the two other transformations. Figure 6 plots the plug-in and the EBP predictions of a d1 and a d2 . The domains are sorted by sample sizes and the sample size is printed in the axis OX. This figure shows that both estimators follow a similar pattern. This information is completed by Fig 7, which shows the relative root-MSEs (RRMSE). show that expenditures on food are rather variable between provinces. This happens mostly in the autonomous regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated ratios is smaller. This information could be of great use to local governments in developing economic plans aimed at households and improving the quality of life. Figure 9 (left) maps the proportions of the household annual expenditures on housing by Spanish provinces. Figure 9 (right) maps the estimated RRMSE in %. As is the case with food expenditure, these figures show that expenditures on housing is rather variable between provinces. This map shows clear differences between the northcentral regions, where the proportion of spending is higher, and the southern regions, where household expenditures are lower.
Tables 11 and 12 present some condensed numerical results. The tables are constructed in two steps: First, the domains are sorted by sample size, starting by the  Table 11 presents the model-based predictions of food and housing expenditures by provinces and Table 12 displays the corresponding estimates of RRMSEs. The plug-in predictors are denoted by in1 and in2 and the EBPs by ebp1 and ebp2.

Conclusions
Compositional data play an important role in public statistics. The proposed methodology is applied to estimate the proportions of annual household expenditures on food, housing and others from the 2016 SHBS at the province level. This paper introduces small area predictors of averages of unit-level vectors of compositions. For this purpose, the manuscript considers the centered logratio transformations of compositions into vectors of R m . For the sake of completeness, Appendix C of the supplementary material presents the corresponding statistical analysis under the additive and isometric logratio transformations. A MNER model is proposed for analyzing the transformed compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. As usual in linear mixed models, the parameter estimates of the MNER model are obtained using the REML method. The selection of the centered logratio transformation was motivated by the interpretability and diagnosis of the selected MNER model. In this sense, we followed the recommendations of Greenacre (2019). This is to say, we have tried to provide a simple solution to a practical problem of compositional data. Of the two proposed predictors, EBP and plug-in, EBP presents a slightly better performance, as can be seen in the simulation study. For the calculation of the MSE, we recommend a parametric bootstrap, following the ideas of González-Manteiga et al. (2008a) and for a number of repetitions greater than B = 300.
As a result of the statistical analysis for Spanish provinces, we conclude that food expenditure in Spain accounts for 14.6% of total household expenditure and presents great variability within autonomous communities. This happens mostly in the Autonomous Regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated proportions is smaller. On the other hand, spending on housing in Spain accounts for 31% of total household spending and there are important differences between the north-central provinces (with higher incomes) and those in the south.
In this case, we applied the introduced methodology to the SHBS, but it is useful in other topics of the official statistics, like the classification of the population by the educational level and according to economic activity. In both situations, it is necessary to take into account the simplex constraints.
We finally remind that there are other regression models for compositions, such as directional mixed effects models or Dirichlet regression mixed models. These models are likely to be adapted to the SAE context described in Sect. 2, including fitting algorithms, predictors of domain quantities, MSE estimators, and so on. They can be competitive options with respect to fitting a multivariate normal mixed model to logratio transformations of compositions. We believe that these tasks are interesting subjects for future research.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.