1 Introduction

Statistical offices are interested on the estimation of socio-economic indicators, like proportions or counts, for the whole population or for subsets called domains. Sampling designs are developed for obtaining precise estimators on their target (planned) domains. Statisticians are also asked to provide estimates for unplanned domains (small areas), where sample sizes are too small to carry out such estimations. Small Area Estimation (SAE) deals with this kind of problems by combining tools of survey sampling and statistical modelling at the unit or at the area level. The monographs of Rao (2003) and of Rao and Molina (2015) give a general description of SAE.

In Galicia (north-west of Spain), the Spanish Labour Force Survey (SLFS) provides information about labour market indicators. The territory of Galicia is hierarchically divided into counties and municipalities. As the sampling design of the SLFS is stratified with strata defined by the size of the municipalities and most municipalities are not represented in the sample, the direct estimates at the municipal or county level have a low accuracy. In this context, estimating labour force indicators is thus a SAE problem. The objective of this paper is to estimate proportions of under 16 years, employed, unemployed and inactive people in counties of Galicia crossed by sex under an area-level model-based approach.

Under the unit-level approach, Chambers et al. (2016), Hobza and Morales (2016) and Hobza et al. (2018) have derived estimators of domain proportions and counts based on M-quantile or binomial logit models for binary outcomes. Under the area-level approach, Chambers et al. (2014), Dreassi et al. (2014), Tzavidis et al. (2015) and Boubeta et al. (2016, 2017) applied binomial, negative binomial or Poisson regression models for estimating the domain counts or proportions. Esteban et al. (2012), Marhuenda et al. (2013, 2014) and Morales et al. (2015) derived estimators of proportions based on area-level linear mixed models. Concerning the Bayesian approach to small area estimation of proportions, the contributions of Farrell (2000), Larsen (2003), Chen and Lahiri (2012) and Liu and Lahiri (2017) are relevant. These approaches are based on univariate models that do not consider the possibility of jointly estimating the counts or proportions of more than two categories.

In labour force statistics, some indicators of interest are the totals or proportions of the categories of a classification variable. This is to say, they are domain compositional parameters summing up to one or to a known integer number. The seminal book of Aitchison (1986), the more recent book edited by Pawlowsky-Glahn and Buccianti (2011) and the papers Egozcue et al. (2003) and Egozcue and Pawlowsky-Glahn (2019) are basic references for an introduction to compositional data analysis.

The estimation of compositional parameters requires using multivariate models. At the unit level, Scealy and Welsh (2017) have applied a directional mixed effects model for predicting the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. At the area level, we can find methodologies for contingency tables where the cell counts are explained by categorical auxiliary variables or by regression-based inference procedures allowing for continuous auxiliary variables. Within the first approach, Zhang and Chambers (2004) developed a class of log-linear structural models that is suited to estimation of small area cross-classified counts based on survey data. Berg and Fuller (2014) gave a SAE procedure for a two-way table of proportions with predictors based on a nonlinear mixed model. Under the regression-based setup, Ferrante and Trivisano (2010) proposed a multivariate SAE approach for count data based on the multivariate Poisson-lognormal distribution an derived hierarchical Bayes predictors. Souza and Moura (2016) and Fabrizi et al. (2016) deal with multivariate Beta regression models in SAE. Saei and Chambers (2003), Molina et al. (2007) and López-Vizcaíno et al. (2013, 2015) have applied multinomial logit mixed models to category counts for estimating domain totals of labour status categories.

This paper introduces three area-level compositional mixed models to obtain small area estimates of labour force proportions in Galicia. The first model is an additive logistic transformation of a multivariate Fay–Herriot (MFH) model with no restrictions on the covariance matrix of the vector of random effects. The second and third models are defined similarly by using centred and isometric logratio transformations, respectively. The new models give higher flexibility than the multinomial model for dealing with the covariance structure of the categories. The MFH model was suggested by Fay (1987) and studied by Datta et al. (1991, 1996), Ghosh et al. (1996), González-Manteiga et al. (2008a), Benavent and Morales (2016) and Arima et al. (2017).

We propose a trivariate Fay–Herriot (TFH) model for analyzing the SLFS data, where the vector of random effects (model errors) has an unstructured covariance matrix with unknown components and the vector of sampling errors has a known covariance matrix. As far as we know, this model is studied and applied for the first time to SAE problems. We do not implement a further parametric modelling of the last covariance matrix as Berg and Fuller (2012) do. The estimates of the TFH model parameters are obtained by using the residual maximum likelihood (REML) estimation method. The fitted model is then used to estimate the proportions of under 16 years, employed, unemployed and inactive people in Galician counties.

The estimation of the mean squared error (MSE) of a model-based predictor is an important issue that has no easy solution. Under nonlinear models, the problem is even more difficult. In this paper, we follow the resampling approach appearing in González-Manteiga et al. (2007, 2008a, 2008b) to introduce a parametric bootstrap procedure. Further research would be needed to approximate the MSEs of predictors as it is done by Slud and Maiti (2006) in transformed univariate Fay–Herriot models.

This paper introduces statistical methodology that is new in three main aspects: (1) the employment of three transformations of area-level compositional survey data, (2) the use of TFH models (as a particular case of the general multivariate case) with unstructured covariance matrix for modelling the transformed data and capturing the sample correlations and (3) the derivation of domain-level predictors of proportions and counts based on the TFH model fitted to the transformed data.

The remainder of the paper is organized as follows. Section 2 gives an introduction to the labour force data and to the SAE problem of interest. Section 3 introduces the additive, centred and isometric logratio area-level model-based approaches for estimating domain compositional parameters. Section 4 describes the considered TFH model. Section 5 develops the proposed compositional predictors and the corresponding MSE estimation procedures. Section 6 applies the proposed methodology to data from the SLFS of the first quarter of 2017 in Galicia. Section 7 gives some conclusions.

The paper contains four appendixes in a supplementary material file. Appendix A presents four simulation experiments. The target of Simulation 1 is to check the behaviour of the REML algorithm for fitting the TFH model. Simulation 2 investigates the performance of the compositional and the multinomial predictors of the category proportions. Simulation 3 analyses the parametric bootstrap estimator of the MSEs. Simulation 4 studies the behaviour of the predictors of proportions when the target data are generated by a multinomial mixed model. Appendix B reviews the competitor predictors based on multinomial mixed models. Appendix C gives the REML Fisher scoring algorithm and derives the REML score vector and Fisher information matrix. Appendix D describes the centred and the isometric logratio transformations of compositions.

2 The problem of interest

The SLFS in Galicia is a quarterly survey following a stratified two-stage random sampling design. Primary sampling units are composed by census sections, which are geographical areas with a minimum of 500 dwellings or about 3000 people. Secondary sampling units are composed by main family dwellings and permanent accommodations. Subsampling is not carried out in secondary sampling units, and data are collected on all persons who regularly live in the same dwelling. The SLFS gets information about the labour market.

Galicia is divided into four provinces. They are Coruña, Lugo, Ourense and Pontevedra, coded by the Spanish Statistical Office as 15, 27, 32 and 36, respectively. Each province is hierarchically partitioned in comarcas (counties) and municipalities. Our domains of interest are the counties crossed by sex. As there are 51 counties in the SLFS of Galicia, we have 102 domains. Our goal is to estimate domain proportions of people in the categories of the variable “labour status” by using SLFS data from the first quarter of 2017.

The SLFS is designed to obtain precise direct estimates of labour indicators at the province level. For the considered SLFS data, the minimum domain sample size is 10, the first quartile is 43 and the median is 79. Therefore, obtaining reliable estimates for target domains is a small area estimation problem and borrowing strength from auxiliary data is recommended.

In mathematical terms, Galicia is a population \(U=\cup _{d=1}^DU_d\) partitioned in D domains \(U_d\). Each domain is partitioned in subsets \(U_{dk}\), \(k=1,\ldots ,q\), defined by the classification variable “labour force status” that classifies units into a finite number of categories. The \(q=4\) categories are \(\le 15\) years (\(k=1\)), employed (\(k=2\)), unemployed (\(k=3\)) and inactive (\(k=4\)). Let N and \(N_d\) be the sizes of U and \(U_d\), respectively. Consider the study variables taking the values \(z_{dkj}=1\) if the unit j from the domain \(U_d\) is in the category k and \(z_{dkj}=0\) otherwise. The target parameters are the domain means (proportions) and the domain totals (counts), i.e.

$$\begin{aligned} \bar{Z}_{dk}=\frac{Z_{dk}}{N_d},\quad Z_{dk}=\sum _{j\in U_d}z_{dkj},\quad d=1,\ldots ,D,\,\,k=1,\ldots ,q. \end{aligned}$$
(2.1)

As \(\bar{Z}_{d1}+\cdots +\bar{Z}_{dq}=1\) and \(Z_{d1}+\cdots +Z_{dq}=N_d\), with \(N_d\) known, we are interested in estimating domain compositions with \(q=4\) categories.

For any sample \(s\subset U\) extracted from the population, \(s_d\) denotes the subsample from \(U_d\) of size \(n_d\) and \(w_{dj}\) represents the SLFS sampling weight of the unit j from the subsample \(s_d\). The sample proportions and counts are as:

$$\begin{aligned} \bar{z}_{dk.}=\frac{z_{dk.}}{n_d},\quad z_{dk.}=\sum _{j\in s_d}z_{dkj},\quad d=1,\ldots ,D,\,\,k=1,\ldots ,q. \end{aligned}$$
(2.2)

Direct estimators of \(\bar{Z}_{dk}\) and \(Z_{dk}\) are as:

$$\begin{aligned} \hat{\bar{Z}}_{dk}^{dir}=\frac{\hat{Z}_{dk}^{dir}}{\hat{N}_d^{dir}},\quad \hat{Z}_{dk}^{dir}=\sum _{j\in s_d}w_{dj}z_{dkj},\quad \hat{N}_d^{dir}=\sum _{j\in s_d}w_{dj}. \end{aligned}$$
(2.3)

As the SLFS sampling weights \(w_{dj}\) are derived from the inverses of the inclusion probabilities after non-response correction and calibration at the province and at the regional level, the direct estimators (2.3) are not design-based unbiased for estimating the target parameters (2.1) at the domain level. As the SLFS domain sample sizes \(n_d\) are small, the estimators (2.3) have large design-based variances. Therefore, the application of model-based approaches is advisable.

López-Vizcaíno et al. (2013) introduced a multinomial logit mixed (MLM) model for fitting \(z_{d.}=(z_{d1.},\ldots ,z_{dq-1.})\), \(d=1,\ldots ,D\), and estimating domain proportions. A related approach is fitting a MLM model to \((\hat{Z}_{d1}^{dir},\ldots ,\hat{Z}_{dq-1}^{dir})\), \(d=1,\ldots ,D\). Appendix B describes these MLM models for a response vector \(\xi _d=(\xi _{d1},\ldots ,\xi _{dq-1})\) and the corresponding predictors of the multinomial category probabilities \(p_{dk}\). In both cases, \(\xi _{dk}=z_{dk.}\) or \(\xi _{dk}=\hat{Z}_{dk}^{dir}\), the covariances, given in (2.2) of Appendix B, under MLM models are negative. Further, under the multinomial distribution, it holds that \(\text {cov}_M(\xi _{dk_1},\xi _{dk_2})=0\) if and only if \(p_{dk_1}=0\) or \(p_{dk_2}=0\), which implies that \(\text {cov}_M(\xi _{dk_1},\xi _{dk})=0\) \(\forall k\ne k_1\) or \(\text {cov}_M(\xi _{dk_2},\xi _{dk})=0\) \(\forall k\ne k_2\) . Therefore, the multinomial correlation structure is rather rigid. More concretely, if the number of categories is \(q=4\), then the six variance components (three variances and three covariances) depend only on three category parameters through the formulas (2.2) of Appendix B. In practice, the multinomial covariance structure does not necessarily fit to the true covariances \(\text {cov}(\xi _{dk_1.},\xi _{dk_2.})\), \(k_1\ne k_2\), \(k_1,k_2=1,\ldots ,q\), of sample counts or direct estimators of totals.

For the categories \(k_1, k_2=1,2,3,4\), Table 1 presents the correlations of the set of values \(\{(\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir}): d=1,\ldots ,D\}\) (left) and \(\{(\bar{z}_{dk_1.}, \bar{z}_{dk_2.}): d=1,\ldots ,D\}\) (right). These correlations are calculated from the direct estimates (2.3) and the simple estimates (2.2) of the category proportions along the domains. As they are calculated with aggregated data at the domain level, we call them domain-level empirical correlations between categories or, in short, domain-level correlations. Table 1 shows that most, but not all, correlations are negative. Nevertheless, we find the positive correlations 0.21 (right) or 0.04 (right) and 0.07 (left) that are close to zero.

Table 1 Domain-level correlations for \(\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir}\) (left) and \(\bar{z}_{dk_1.}, \bar{z}_{dk_2.}\) (right)
Fig. 1
figure 1

Unit-level covariances (left) and correlations (right) for \(\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir}\), \(k_1\ne k_2\)

The design-based within-domain covariance \(\text {cov}_\pi (\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir})\), \(k_1,k_2=1,\ldots ,q-1\), can be estimated by

$$\begin{aligned} \hat{\text {cov}}_\pi (\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir})= \frac{1}{\big (\hat{N}_d^{dir}\big )^2} \sum _{j\in s_{d}}w_{dj}(w_{dj}-1)(z_{dk_1j}-\hat{\bar{Z}}_{dk_1}^{dir})(z_{dk_2j}-\hat{\bar{Z}}_{dk_2}^{dir}),\nonumber \\ \end{aligned}$$
(2.4)

where the case \(k_1=k_2=k\) denotes estimated variance, i.e. \(\hat{{\text {var}}}_\pi (\hat{\bar{Z}}_{dk}^{dir})=\hat{{\text {cov}}}_\pi (\hat{\bar{Z}}_{dk}^{dir},\hat{\bar{Z}}_{dk}^{dir})\). The last formulas are obtained from Särndal et al. (1992), pp. 43, 185 and 391, with the simplifications \(w_{dj}=1/\pi _{dj}\), \(\pi _{dj,dj}=\pi _{dj}\) and \(\pi _{di,dj}=\pi _{di} \pi _{dj}\), \(i\ne j\) in the second-order inclusion probabilities. By applying formula (2.4), Fig. 1 (left) plots the estimated design-based covariances \(s_{k_1k_2}=\hat{\text {cov}}_\pi (\hat{\bar{Z}}_{dk_1}^{dir}, \hat{\bar{Z}}_{dk_2}^{dir})\), \(k_1, k_2=1,2,3,4\), \(k_1\ne k_2\), for the direct estimators of all the category proportions. As these covariances are calculated from unit-level data, they are called unit-level covariances. Figure 1 shows that \(\hat{\text {cov}}_\pi (\hat{\bar{Z}}_{d1}^{dir}, \hat{\bar{Z}}_{d3}^{dir})\approx 0\), \(d=1,\ldots ,D\). Under the multinomial mixed model, this fact implies (as explained above) that the domain proportions of people in the categories “under 16 years” and “unemployed” should be close to zero, which contradicts the observed sampling proportions. Figure 1 (right) plots the corresponding unit-level correlations \(r_{k_1k_2}\). In the application to SLFS data, these unit-level covariances are used to set the error covariance matrix of the fitted TFH model.

The exploratory data analysis shows that the covariance structure of the MLM models M1 and M2, described in Appendix B, cannot take into account simultaneously both types of covariances and variances (unit level and domain level). This is to say, the multinomial mixed model does not fit well to the domain-level correlations appearing in Table 1 or to the within-domain covariances shown in Fig. 1. Further, the definition of the model itself does not allow the introduction of both types of covariances and variances as MFH models do. This is why we propose transforming the direct estimators of domain proportions and fitting the transformed data to a more flexible compositional mixed model.

3 Transformations of compositions

This section considers three transformation of q-compositions onto \(R^{q-1}\). In what follows, we use the simpler notation \(z_{dk}\triangleq \hat{\bar{Z}}_{dk}^{dir}\), \(k=1,\ldots ,q\), \({z}_d=(z_{d1},\ldots ,z_{dq-1})^\prime \) and

$$\begin{aligned} \sigma _{z,k_1k_2}\triangleq \text {cov}_\pi (z_{dk_1},z_{dk_2}),\quad \text {var}_\pi ({z}_d)=\big (\sigma _{z,k_1k_2}\big )_{k_1,k_2=1,\ldots ,q-1}. \end{aligned}$$
(3.1)

We assume that \(z_{dk}>0\), \(d=1,\ldots ,D\), \(k=1,\ldots ,q\), and we note that \(z_{d1}+\ldots +z_{dq}=1\). For \(d=1,\ldots ,D\), let \({y}_d=(y_{d1},\ldots ,y_{dq-1})^\prime \in R^{q-1}\) be the additive logratio transformation (alr) of \({z}_d\), i.e. \({y}_d={h}({z}_d)=(h_1({z}_d),\ldots ,h_{q-1}({z}_d))^\prime \) with

$$\begin{aligned}&y_{dk}=h_k({z}_d)=\log (z_{dk}/z_{dq})=\log z_{dk} - \log \big ( 1-z_{d1}-\ldots -z_{dq-1}\big ),\quad \\&k=1,\ldots ,q-1, \end{aligned}$$

and with inverse the additive logistic (alogist) transformation

$$\begin{aligned} z_{dk}= & {} \frac{\exp \{y_{dk}\}}{1+\exp \{y_{d1}\}+\cdots +\exp \{y_{dq-1}\}},\quad k=1,\ldots ,q-1,\\ z_{dq}= & {} 1-z_{d1}-\cdots -z_{dq-1}=\frac{1}{1+\exp \{y_{d1}\}+\cdots +\exp \{y_{dq-1}\}}. \end{aligned}$$

For \(i,j=1,\ldots ,q-1\), the first partial derivatives of the alogist transformation are as:

$$\begin{aligned} \frac{\partial z_i}{\partial y_i}=z_i(1-z_i)>0,\quad \frac{\partial z_i}{\partial y_j}=-z_iz_j<0,\quad i\ne j. \end{aligned}$$
(3.2)

This to say, under the alogist transformation \(z_i\) is an increasing function of \(y_i\) and a decreasing transformation of \(y_j\), \(i,j =1,\ldots ,q-1\), \(j\ne i\). Alternatively, we may consider the centred or the isometric logratio (clr or ilr) transformations described in Appendix D. For ease of exposition, this paper deals mainly with the alr transformation. The mathematical developments for the clr or ilr transformations can be done similarly.

In applications to real data, we can have domain compositions \((z_{d1},\ldots ,z_{dq})\) with some zero components \(z_{dk}=0\). Since the logarithm of zero is \(-\infty \), we cannot apply alr, clr and ilr logratio transformations for analyzing compositional data. In our application to real data, we have no zeros. However, in other domain-level data we might find a low number of zeros. This can occur if a domain sample size is very small and none of the sampled units belong to a given category k. A practical solution for this problem is replacing zeros by a small numerical value \(\varepsilon >0\) and making a rounding-off adjustment process. For example, if \((z_{d1},\ldots ,z_{dq})\) has m zeros, we can apply the recommendation appearing in Section 11.5 of Aitchison (1986). This is to say, we can substitute the zeros by \(\varepsilon =(m+1)(q-m)\delta q^{-2}\) and we can subtract \(m(m+1)\delta q^{-2}\) to each positive component, where \(\delta \) is the maximum rounding-off error of the positive \(z_{dk}\)’s. We can take \(\big (\widehat{\text {var}}_\pi (z_{dk})\big )^{1/2}\) as rounding-off error of \(z_{dk}\). In any case, applied statisticians should do a sensitivity analysis of the results to the particular method employed to deal with zeros and a model diagnostic study.

The presence of structural zeros or a non-negligible amount of sample zeros is a severe issue when using alr, clr and ilr logratio transformations. We are not in favour of using the three introduced transformations in those cases. Other transformations and models for dealing with compositional data could be alternatively applied. For example, the directional mixed effects models applied by Scealy and Welsh (2017) overcome this problem.

The first derivatives of the alr transformation \(h_k\) are as:

$$\begin{aligned} H_{kk}({z}_d)=\frac{\partial h_{k}({z}_d)}{\partial z_{dk}}=\frac{1}{z_{dk}}+\frac{1}{z_{dq}},\quad H_{k_1k_2}({z}_d)=\frac{\partial h_{k_1}({z}_d)}{\partial z_{dk_2}}=\frac{1}{z_{dq}}\,\,\,\text {if }k_1\ne k_2. \end{aligned}$$

Let \({1}_{a}\) and \({0}_a\) be the \(a\times 1\) vectors with all the components equal to one and to zero, respectively. In matrix form, we have \({H}({z}_d)=\big (H_{ij}({z}_d)\big )_{i,j=1,\ldots ,q-1}= \mathop {\hbox {diag}}\nolimits _{1\le k \le q-1}(z_{dk}^{-1})+z_{dq}^{-1}{1}_{q-1}{1}_{q-1}^\prime \).

A Taylor series expansion of \({h}({z}_d)\) around a given \({z}_0\) yields to

$$\begin{aligned} {y}_d={h}({z}_d)\approx {h}({z}_0)+{H}({z}_0)({z}_d-{z}_0). \end{aligned}$$
(3.3)

If we take \({z}_0=q^{-1}{1}_{q-1}\), then \(h({z}_0)={0}_{q-1}\), \(H_0=H({z}_0)=q\big ({I}_{q-1}+{1}_{q-1}{1}_{q-1}^\prime \big )\), where \({I}_{a}\) is the \(a\times a\) identity matrix. Alternatively, we can take \(z_0=\frac{1}{D}\sum _{d=1}^Dz_d\) or we can select \(z_0\) depending on d and close to \(z_d\) for a better Taylor expansion approximation. For the clr and ilr transformations, Appendix D shows that \({h}(z_0)={0}_{q-1}\) and gives \(H(z_0)\). From (3.3), we get the approximated covariance matrix:

$$\begin{aligned} \text {var}_\pi ({y}_d)\approx {H}_0\text {var}_\pi ({z}_d){H}_0^\prime . \end{aligned}$$
(3.4)

A general area-level model for estimating \(\bar{Z}_{dk}\), \(d=1,\ldots ,D\), \(k=1,\ldots ,q\), is \({y}_d\overset{ind}{\sim }{N}_{q-1}(\mu _d,V_d)\), where \(\mu _d\) is a mean vector depending on unknown regression parameters and auxiliary variables and \(V_d\) is a covariance depending of some unknown parameters. The next section gives a flexible multivariate area-level linear mixed model for \({y}_d\), \(d=1,\ldots ,D\), which allows positive and negative covariances. Depending on the employed transformation, the resulting models for \(z_d\), \(d=1,\ldots ,D\), are called alr, clr and ilr compositional mixed models. These models arise as alternative to the MLM models, for \(z_{d.}\) or \(z_{d}\), described in Appendix B. For ease of exposition, we present the case \(q=4\) appearing in the application to real data.

4 Trivariate Fay–Herriot models

Let \(z_{d}=(z_{d1},z_{d2},z_{d3})^{\prime }\) be the vector of direct estimators (2.3) of proportions \(\bar{Z}_d=(\bar{Z}_{d1},\bar{Z}_{d2},\bar{Z}_{d3})^{\prime }\) of some classification variable with \(q=4\) categories. Let \(\widehat{\text {var}}_\pi ({z}_d)\) be the matrix of design-based covariance estimators (2.4), which can contain positive and negative covariances. Let \({y}_{d}=(y_{d1},y_{d2},y_{d3})^{\prime }\) be the alr transformation of \(z_d\). Let \(\mu _{d}=E_\pi (y_d)=(\mu _{d1},\mu _{d2},\mu _{d3})^{\prime }\) be the vector of design-based expectations of \(y_d\). The TFH model is defined in two stages. The sampling model is

$$\begin{aligned} {y}_{d}=\mu _{d}+{e}_{d},\quad d=1,\ldots ,D, \end{aligned}$$
(4.1)

where the vectors \({e}_{d}\overset{ind}{\sim } N_3\left( 0,{V}_{ed}\right) \) are independent and the \(3\times 3\) covariance matrices \({V}_{ed}=\big (\sigma _{dij}\big )_{i,j=1,2,3}\) are known. In practice, we take \({V}_{ed}=H_0\widehat{\text {var}}_\pi ({z}_d)H_0^\prime \), where \(\widehat{\text {var}}_\pi ({z}_d)\) is the covariance matrix given in (3.1) and \(H_0\) is defined in (3.4). Alternatively, \(y_d\) can be defined as the clr or ilr transformation of \(z_d\). In that case, the matrix \(H_0\) is taken from Appendix D.

Moreover, it is assumed that the \(\mu _{dk}\)’s are linearly related to \(r_k\) explanatory variables associated with the k-th category in the domain d. For \(k=1,2,3\), let \({x}_{dk}=(x_{dk1},\ldots ,x_{dkr_k})\) be a row vector containing the \(r_k\) explanatory variables for \(\mu _{dk}\) and let \({X}_{d}=\text {diag}\left( {x}_{d1},{x}_{d2},{x}_{d3}\right) _{3\times r}\) with \(r=r_1+r_2+r_3\). Let \(\beta _{k}\) be a column vector of size \(r_k\) containing the regression parameters for \(\mu _{dk}\) and let \(\beta =\left( \beta _{1}^{\prime },\beta _{2}^{\prime },\beta _{3}^{\prime }\right) ^{\prime }_{r\times 1}\). This section introduces a TFH model by assuming (4.1) and the linking model:

$$\begin{aligned} \mu _{d}={X}_{d}\beta +{u}_{d},\quad {u}_{d}\overset{ind}{\sim } N_3(0,{V}_{ud}),\quad d=1,\ldots ,D, \end{aligned}$$
(4.2)

where the vectors \({u}_{d}\)’s are independent and independent of the vectors \({e}_{d}\)’s. Unlike the MFH models studied by Benavent and Morales (2016), the \(3\times 3\) covariance matrices \({V}_{ud}\) are unstructured and depend on six unknown parameters, \(\theta _{1}=\sigma _{u1}^2\), \(\theta _{2}=\sigma _{u2}^2\), \(\theta _{3}=\sigma _{u3}^2\), \(\theta _{4}=\rho _{12}\), \(\theta _{5}=\rho _{13}\) and \(\theta _{6}=\rho _{23}\), i.e.

$$\begin{aligned} {V}_{ud}= \left( \begin{array}{ccc} \sigma _{u1}^2 &{} \quad \rho _{12}\sigma _{u1}\sigma _{u2} &{} \quad \rho _{13}\sigma _{u1}\sigma _{u3}\\ \rho _{12}\sigma _{u1}\sigma _{u2} &{} \quad \sigma _{u2}^2 &{} \quad \rho _{23}\sigma _{u2}\sigma _{u3}\\ \rho _{13}\sigma _{u1}\sigma _{u3} &{} \quad \rho _{23}\sigma _{u2}\sigma _{u3} &{} \quad \sigma _{u3}^2 \end{array}\right) . \end{aligned}$$

The matrix \(V_{ud}\) explains the covariance structure of the alr transformations of the direct estimators \(y_d\) that is not taken into account by the sampling errors \(e_d\) or by the auxiliary variables \(X_d\). Let \({I}_n\) be the \(n\times n\) identity matrix and \(\delta _{\ell d}\) be the Kronecker delta, and define

$$\begin{aligned} {y}=\underset{1\le d \le D}{\hbox {col}}({y}_{d}),\, {u}=\underset{1\le d \le D}{\hbox {col}}({u}_{d}),\, {e}=\underset{1\le d \le D}{\hbox {col}}({e}_{d}),\, {u}_d=\underset{1\le k \le 3}{\hbox {col}}(u_{dk}),\, {e}_d=\underset{1\le k \le 3}{\hbox {col}}(e_{dk}), \end{aligned}$$
$$\begin{aligned} {X}=\underset{1\le d \le D}{\hbox {col}}({X}_{d}),\, {Z}_{d}=\underset{1\le \ell \le D}{\hbox {col}}(\delta _{\ell d}{I}_3),\, {Z}=\underset{1\le d \le D}{\hbox {col}^{\prime }}({Z}_{d})={I}_{3D},\, {V}_u=\underset{1\le d \le D}{\hbox {diag}}({V}_{ud}), \end{aligned}$$

where col and \(\hbox {col}^{\prime }\) are matrix operators stacking by columns and rows, respectively.

In matrix form, the TFH model (4.1)+(4.2) is

$$\begin{aligned} {y}={X}\beta +{Z}{u}+{e}={X}\beta +{Z}_{1}{u}_{1}+\cdots +{Z}_{D}{u}_{D}+{e}, \end{aligned}$$
(4.3)

where \({e}, {u}_{1},\ldots ,{u}_D\) are independent with distributions

$$\begin{aligned} {e}\sim N\left( 0,{V}_{e}\right) ,\quad {u}\sim N(0,{V}_u)\quad \text {and}\quad {u}_{d}\sim N(0,{V}_{ud}),\quad d=1,\ldots ,D. \end{aligned}$$

Under model (4.3), it holds that

$$\begin{aligned} E\left( {y}\right) ={X}\beta \quad \text {and}\quad {V}=\text {var}({y})={Z}^{\prime }{V}_u{Z}+{V}_e={V}_u+{V}_e =\underset{1\le d \le D}{\hbox {diag}}({V}_d), \end{aligned}$$

where \({V}_{d}={V}_{ud}+{V}_{ed}\), \(d=1,\ldots ,D\). Further, the best linear unbiased estimator (BLUE) of \(\beta \) and the best linear unbiased predictors (BLUP) of u and \(\mu \) are as:

$$\begin{aligned} \hat{\beta }_B=({X}^{\prime }{V}^{-1}{X})^{-1}{X}^{\prime }{V}^{-1}{y},\,\, \hat{u}_B={V}_{u}{Z}^{\prime }{V}^{-1}({y}-{X}\hat{\beta }_B),\,\, \hat{\mu }_B={X}\hat{\beta }_B+{Z}\hat{u}_B.\nonumber \\ \end{aligned}$$
(4.4)

The residual maximum likelihood (REML) method maximizes the joint probability density function of a vector of \(3D-r\) independent contrasts \(\omega ={W}^\prime {y}\), where W is a \(3D\times (3D-r)\) matrix with linearly independent columns and such that \({W}^\prime {W}={I}_{3D-r}\) and \({W}^\prime {X}=0\). It holds that \(\omega \) is independent of the BLUE \(\hat{\beta }_B\) given in (4.4). The joint probability density function of \(\omega \) is the REML likelihood. The REML log-likelihood of model (4.3) is

$$\begin{aligned} {l}_\mathrm{reml}(\theta )= & {} -\frac{3D-r}{2}\log 2\pi +\frac{1}{2}\log |{X}^{\prime }{X}|-\frac{1}{2}\log |{V}|\nonumber \\&- \frac{1}{2}\log |{X}^{\prime }{V}^{-1}{X}| -\frac{1}{2}\,{y}^{\prime }{P}{y}, \end{aligned}$$
(4.5)

where \(\theta =(\theta _1,\ldots ,\theta _6)\), \({P}={V}^{-1}-{V}^{-1}{X}({X}^{\prime }{V}^{-1}{X})^{-1}{X}^{\prime }{V}^{-1}\), \({P}{V}{P}={P}\) and \({P}{X}=0\). Appendix C gives the calculation of the score vector \(S(\theta )=(S_1,\ldots ,S_6)^\prime \) and the Fisher information matrix \(F(\theta )=\big (F_{a,b}\big )_{a,b=1,\ldots ,6}\), where

$$\begin{aligned} S_{a}=\frac{\partial {l}_\mathrm{reml}}{\partial \theta _{a}},\quad F_{ab}=-E\Big [\frac{\partial {l}^2_\mathrm{reml}}{\partial \theta _{a}\partial \theta _{b}}\Big ],\quad a,b=1,\ldots ,6. \end{aligned}$$

Appendix C also gives the REML Fisher scoring algorithm for fitting the TFH model (4.3). To initiate this algorithm, a possible set of starting values is \(\hat{\theta }_{4,0}=\hat{\theta }_{5,0}=\hat{\theta }_{6,0}=0\), \(\hat{\theta }_{k,0}=\hat{\sigma }_{uk,0}^2\), \(k=1,2,3\), where \(\hat{\sigma }_{uk,0}^2\) is the REML or the ML or the Prasad and Rao (1990) moment-based estimator of \(\sigma _{uk}^2\) in the k-th marginal Fay–Herriot model. They can be calculated with the R library sae. Another possibility is to take \(V_{ud}^{(0)}\) as the domain-level variance matrix of the alr transformations of the direct estimators of the category proportions. This is the approach described and implemented around Table 2 in the application to real data.

The output of REML Fisher scoring algorithm, \(\hat{\theta }\), is the REML estimator of \(\theta \). By plugging \(\hat{\theta }\) in \({V}_u\), we get \(\hat{V}_u={V}_u(\hat{\theta })\) and \(\hat{V}=\hat{V}_u+{V}_e\). By substituting \(\hat{V}_u\) in (4.4), we obtain the EBLUP of \(\mu ={X}\beta +{Z}{u}\), i.e.

$$\begin{aligned} \hat{\beta }_C=({X}^{\prime }\hat{V}^{-1}{X})^{-1}{X}^{\prime }\hat{V}^{-1}{y},\quad \hat{u}_C=\hat{V}_{u}{Z}^{\prime }\hat{V}^{-1}({y}-{X}\hat{\beta }_C),\quad \hat{\mu }_C={X}\hat{\beta }_C+{Z}\hat{u}_C,\nonumber \\ \end{aligned}$$
(4.6)

with components

$$\begin{aligned} \hat{\mu }_{Cd}= & {} (\hat{\mu }_{Cd1},\hat{\mu }_{Cd2},\hat{\mu }_{Cd3})^\prime =X_d\hat{\beta }_C+\hat{u}_{Cd},\quad \\ \hat{u}_{Cd}= & {} \hat{V}_{ud}\hat{V}_d^{-1}\big (y_d-X_d\hat{\beta }\big ),\quad d=1,\ldots ,D. \end{aligned}$$

The asymptotic distributions of the REML estimators \(\hat{\theta }\) and \(\hat{\beta }\),

$$\begin{aligned} \hat{\theta }\sim N_6(\theta , {F}^{-1}(\theta )),\quad \hat{\beta }\sim N_r(\beta , ({X}^{\prime }{V}^{-1}{X})^{-1}), \end{aligned}$$

can be used to construct \((1-\alpha )\)-level asymptotic confidence intervals for the components \(\theta _{\ell }\) of \(\theta \) and \(\beta _i\) of \(\beta \), i.e.

$$\begin{aligned} \hat{\theta }_{\ell }\pm z_{\alpha /2}\,\nu _{\ell \ell }^{1/2},\,\, \ell =1,\ldots ,6,\quad \hat{\beta }_{i}\pm z_{\alpha /2}\,q_{ii}^{1/2},\,\, i=1,\ldots ,r, \end{aligned}$$
(4.7)

where \({F}^{-1}(\hat{\theta })=(\nu _{ab})_{a,b=1,\ldots ,6}\), \(({X}^{\prime }{V}^{-1}(\hat{\theta }){X})^{-1}=(q_{ij})_{i,j=1,\ldots ,r}\) and \(z_{\alpha }\) is the \(\alpha \)-quantile of the N(0, 1) distribution. For \(\hat{\beta }_i=\beta _0\), the asymptotic p value for testing the hypothesis \(H_0:\,\beta _i=0\) is

$$\begin{aligned} \text {{ p}-value}=2P_{H_0}(\hat{\beta }_i>|\beta _0|)=2P(N(0,1)> |\beta _0|/\sqrt{q_{ii}}). \end{aligned}$$
(4.8)

We remark that we have changed the notation in (4.7) and (4.8), where \(\beta _i\) denotes the i-th component of the vector \(\beta \) and not the vector of regression parameters of the i-th category.

5 Compositional predictors of proportions and counts

This section denotes the expectations under the distributions of the sampling design, the Fay–Herriot model and the compositional mixed model by \(E_{\pi }\), \(E_\mathrm{FH}\) and \(E_\mathrm{C}\), respectively. Statistical offices are interested in estimating population parameters, like the domain mean \(\bar{Z}_{dk}\) defined in (2.1). However, the domain-level approach to SAE gives predictors of functions of fixed and random model effects (in short, functions of model effects—FME). For example, a Fay–Herriot (FH) model for predicting \(\bar{Z}_{dk}\),

$$\begin{aligned} z_{dk}=x_{dk}\beta +u_{\mathrm{FH},dk}+e_{\mathrm{FH},dk},\quad d=1,\ldots ,D, \end{aligned}$$

with the standard assumptions on \(u_{\mathrm{FH},dk}\)’s and \(e_{\mathrm{FH},dk}\)’s, has \(\mu _{\mathrm{FH},dk}=E_\mathrm{FH}[z_{dk}|u_{\mathrm{FH},dk}]=x_{dk}\beta +u_{\mathrm{FH},dk}\) as the FME of interest. To connect \(\mu _{\mathrm{FH},dk}\) with \(\bar{Z}_{dk}\), it is assumed that: (1) the design-based expectation of \(z_{dk}\) is \(\bar{Z}_{dk}\), i.e. \(E_\pi [z_{dk}]=\bar{Z}_{dk}\), and (2) the design-based expectation is the realization of the model conditional expectation; in short, \(E_\mathrm{FH}[z_{dk}|u_{\mathrm{FH},dk}]=E_\pi [z_{dk}]\). Note that \(E_\mathrm{FH}[z_{dk}|u_{\mathrm{FH},dk}]\) is a function of \(u_{\mathrm{FH},dk}\). Similar assumptions are done for the MLM model described in Appendix B, i.e. (1) \(E_\pi [\bar{z}_{dk.}]=\bar{Z}_{dk}\), (2) \(E_{M}[\bar{z}_{dk.}|u_{M,dk}]=E_\pi [\bar{z}_{dk.}]\).

In practice, assumption (1) rarely holds. The direct estimator \(z_{dk}\) is not calculated with the inverse of the inclusion probabilities, but with sampling weights (expansion factors) that are not calibrated to domain totals. Therefore, \(z_{dk}\) is, in general, biased with respect to the sampling design distribution. Similarly, \(\bar{z}_{dk.}\) is also biased for estimating \(\bar{Z}_{dk}\), as it is calculated with weights equal to one. Therefore, predictors based on both models, FH and MLM, have problems to fulfil assumption (1) when dealing with real data.

Assumption (2) might be accepted if the area-level model has a “good” fit to data, which may happen if the set of auxiliary variables is highly correlated with the dependent variable. Further, by analogy to the GREG estimator, the auxiliary variables produce a calibration effect to their domain totals and a reduction of the design-based bias. In many cases, the assumptions (1) and (2) “approximately” hold and the empirical best predictor (EBP) will have some small bias, as the direct estimator has, but lower variance. So it is worthwhile to calculate EBPs based on area-level models, as they will tend to have lower design-based MSEs than direct estimators.

This paper introduces a compositional model for \(z_d\) that is introduced by assuming a TFH model on the alr transformed vector \(y_d\). In terms of \(y_d\), the compositional model does not fulfil assumption (1), because the direct estimator \(y_d\) is not design-based unbiased for the alr transformation of \(\bar{Z}_{dk}\), i.e. \(E_\pi [y_{dk}]\ne \log (\bar{Z}_{dk}/\bar{Z}_{d4})\), \(k=1,2,3\). For fulfilling the assumption (2), the FMEs to be predicted under the TFM model should be

$$\begin{aligned} E_C[z_{dk}|u_d]= & {} \int _{R^3}\frac{\exp \{y_{dk}\}}{1+\exp \{y_{d1}\}+\exp \{y_{d2}\}+\exp \{y_{d3}\}}\, f_{N_3(X_d\beta +u_d,V_{ed})}(y_d)\,dy_d,\quad \\&k=1,2,3, \end{aligned}$$

and \(E_C[z_{d4}|u_d]=1-E_C[z_{d1}|u_d]-E_C[z_{d2}|u_d]-E_C[z_{d3}|u_d]\), \(d=1,\ldots ,D\), which are nonlinear functions of \(u_d\) based on integrals that cannot be solved analytically. Further, the EBPs of \(E_C[z_{dk}|u_d]\) are non-analytically tractable integrals of \(E_C[z_{dk}|u_d]\) with respect to the density of \(u_d\) conditioned to \(y_d\). This is why we propose predicting the alogist transformations of \(\mu _{Cdk}=E_C[y_{dk}|u_d]\) under the TFH model, i.e.

$$\begin{aligned} {p}_{Cdk}= & {} \frac{\exp \{{\mu }_{Cdk}\}}{1+\exp \{{\mu }_{Cd1}\}+\exp \{{\mu }_{Cd2}\}+\exp \{{\mu }_{Cd3}\}},\,\,\\&{p}_{Cd4}=1-{p}_{Cd1}-{p}_{Cd2}-{p}_{Cd3},\,\,\, k=1,2,3. \end{aligned}$$

For the clr and ilr transformations, the alternative vectors are \(p_{Cd}^{clr}=\text {clr}^{-1}(\mu _{Cd})\) and \(p_{Cd}^{ilr}=\text {ilr}^{-1}(\mu _{Cd})\), respectively, where \(\mu _{Cd}=(\mu _{Cd1},\mu _{Cd2},\mu _{Cd3})\).

By predicting \({p}_{Cdk}\) instead of \(E_C[z_{dk}|u_d]\), we gain computational efficiency, but we have problems with assumption (2), as \(\bar{Z}_{dk}\) might not be considered as the realization of \({p}_{Cdk}\); in short, \({p}_{Cdk}\ne E_\pi [\bar{z}_{dk.}]\approx \bar{Z}_{dk}\). However, the disadvantage of not fulfilling the assumptions (1) and (2) can be compensated, as in practice happens with the FH and MLM models, by the use of auxiliary variables correlated with the objective variable and with the modelling of the covariance structure of the categories. So that what is lost on the one hand is recovered on the other.

In this section, two predictors of \({p}_{Cdk}\) are proposed. The compositional plug-in predictors, \(\hat{{p}}_{Cd}=(\hat{{p}}_{Cd1},\hat{{p}}_{Cd2},\hat{{p}}_{Cd3})^{\prime }\) and \(\hat{{p}}_{Cd4}\), of the proportions \({p}_{Cd}=({p}_{Cd1},{p}_{Cd2},{p}_{Cd3})^{\prime }\) and \({p}_{Cd4}\) are jointly obtained by applying the alogist transformation to \(\hat{\mu }_{Cd}=(\hat{\mu }_{Cd1},\hat{\mu }_{Cd2},\hat{\mu }_{Cd3})^{\prime }\), i.e. \(\hat{{p}}_{Cd}=\text {alogist}(\hat{\mu }_{Cd})\) or equivalently

$$\begin{aligned} \hat{{p}}_{Cdk}= & {} \frac{\exp \{\hat{\mu }_{Cdk}\}}{1+\exp \{\hat{\mu }_{Cd1}\}+\exp \{\hat{\mu }_{Cd2}\}+\exp \{\hat{\mu }_{Cd3}\}},\,\,\\ \hat{{p}}_{Cd4}= & {} 1-\hat{{p}}_{Cd1}-\hat{{p}}_{Cd2}-\hat{{p}}_{Cd3},\,\,\, k=1,2,3. \end{aligned}$$

Similarly, we may consider the plug-in predictors \(\hat{{p}}_{Cd}^{clr}=\text {clr}^{-1}(\hat{\mu }_{Cd})\) or \(\hat{{p}}_{Cd}^{ilr}=\text {ilr}^{-1}(\hat{\mu }_{Cd})\). The compositional plug-in predictors of the domain proportions \(\bar{Z}_{dk}\) are \(\hat{\bar{Z}}_{Cdk}=\hat{{p}}_{Cdk}\), and the compositional plug-in predictors of the counts \(Z_{dk}=N_{d}\bar{Z}_{dk}\) are \(\hat{Z}_{Cdk}=N_{d}\hat{\bar{Z}}_{Cdk}\), \(k=1,2,3,4\).

On the other hand, the compositional best predictors of the proportions \({p}_{Cdk}\) are \(\hat{{p}}_{Bdk}=\hat{{p}}_{Bdk}(\beta ,\theta )=E_{\beta ,\theta }[{p}_{Cdk}|\,y_d]\), \(k=1,2,3\), and \(\hat{{p}}_{Bd4}=1-\hat{{p}}_{Bd1}-\hat{{p}}_{Bd2}-\hat{{p}}_{Bd3}\). It holds that

$$\begin{aligned} \hat{{p}}_{Bdk}= & {} \frac{\int _{R^3}\frac{\exp \{{\mu }_{Cdk}\}}{1+\exp \{{\mu }_{Cd1}\}+\exp \{{\mu }_{Cd2}\}+\exp \{{\mu }_{Cd3}\}}f(y_d|u_d)f(u_d)\,du_d}{\int _{R^3}f(y_d|u_d)f(u_d)\,du_d}= \frac{A_{dk}(y_d,\beta ,\theta )}{B_d(y_d,\beta ,\theta )}, \end{aligned}$$

where \({\mu }_{Cdk}=x_{dk}\beta _k+u_{dk}\), \(y_d|u_d\sim N_3(X_d\beta +u_d, V_{ed})\), \(u_d\sim N_3(0, V_{ud}(\theta ))\) and

$$\begin{aligned} A_{dk}(y_d,\beta ,\theta )= & {} \int _{R^3}\frac{\exp \{{\mu }_{Cdk}\}\exp \big \{-\frac{1}{2} (y_d-X_d\beta -u_d)^\prime V_{ed}^{-1}(y_d-X_d\beta -u_d)\big \}}{1+\exp \{{\mu }_{Cd1}\}+\exp \{{\mu }_{Cd2}\}+\exp \{{\mu }_{Cd3}\}}\\&f_\theta (u_d)\,du_d, \\ B_d(y_d,\beta ,\theta )= & {} \int _{R^3}\exp \left\{ -\frac{1}{2} (y_d-X_d\beta -u_d)^\prime V_{ed}^{-1}(y_d-X_d\beta -u_d)\right\} f_\theta (u_d)\,du_d. \end{aligned}$$

The compositional EBP of \({p}_{Cdk}\) is \(\hat{{p}}_{Edk}=\hat{{p}}_{Bdk}(\hat{\beta },\hat{\theta })\), \(k=1,2,3,4\), and can be approximated by the following Monte Carlo algorithm:

  1. 1.

    Calculate the TFH model parameters \(\hat{\beta }\) and \(\hat{\theta }\).

  2. 2.

    For \(\ell =1,\ldots ,L\), \(d=1,\ldots ,D\), generate \(u_d^{(\ell )}\) i.i.d. \(N_3(0,V_{ud}(\hat{\theta }))\) and do \(u_d^{(L+\ell )}=-u_d^{(\ell )}\).

  3. 3.

    Calculate \(\hat{{p}}_{Edk}=\hat{{p}}_{Bdk}(\hat{\beta },\hat{\theta })=\hat{A}_{dk}/\hat{B}_d\), where

    $$\begin{aligned} \hat{A}_{dk}= & {} \frac{1}{2L}\sum _{\ell =1}^{2L}\frac{\exp \{\hat{\mu }_{Cdk}^{(\ell )}\}\exp \left\{ -\frac{1}{2} (y_d-X_d\hat{\beta }-u_d^{(\ell )})^\prime V_{ed}^{-1}(y_d-X_d\hat{\beta }-u_d^{(\ell )})\right\} }{1+\exp \{\hat{\mu }_{Cd1}^{(\ell )}\}+\exp \{\hat{\mu }_{Cd2}^{(\ell )}\}+\exp \{\hat{\mu }_{Cd3}^{(\ell )}\}}, \\ \hat{B}_d= & {} \frac{1}{2L}\sum _{\ell =1}^{2L}\exp \left\{ -\frac{1}{2} (y_d-X_d\hat{\beta }-u_d^{(\ell )})^\prime V_{ed}^{-1}(y_d-X_d\hat{\beta }-u_d^{(\ell )})\right\} , \end{aligned}$$

    where \(\hat{\mu }_{Cdk}^{(\ell )}=x_{dk}\hat{\beta }_k+u_{dk}^{(\ell )}\), \(d=1,\ldots ,D\), \(k=1,2,3\), \(\ell =1,\ldots ,2L\).

  4. 4.

    Calculate \(\hat{{p}}_{Ed4}=1-\hat{{p}}_{Ed1}-\hat{{p}}_{Ed2}-\hat{{p}}_{Ed3}\), \(d=1,\ldots ,D\).

The compositional EBPs of the domain proportions \(\bar{Z}_{dk}\) are \(\hat{\bar{Z}}_{Edk}=\hat{{p}}_{Edk}\), \(k=1,2,3,4\). The compositional EBPs of the counts \(Z_{dk}=N_{dk}\bar{Z}_{dk}\) are \(\hat{Z}_{Edk}=N_{dk}\hat{\bar{Z}}_{Edk}\), \(k=1,2,3,4\).

For the clr and ilr transformations, the EBP of \(p_{Cdk}\) is obtained by substituting the k-th component of the alogist transformation:

$$\begin{aligned} \text {alogist}_k(\mu _{Cd})=\frac{\exp \{{\mu }_{Cdk}\}}{1+\exp \{{\mu }_{Cd1}\}+\exp \{{\mu }_{Cd2}\}+\exp \{{\mu }_{Cd3}\}}, \end{aligned}$$

by the corresponding k-th component of \(\text {clr}^{-1}\) and \(\text {ilr}^{-1}\) transformations, respectively.

The estimation of compositions, like domain proportions of categories of a classification variable, requires the selection of the last (q-th) category. The last category is not a control or reference category as it happens in some ANOVA-type statistical analyses. The introduced methodology is not invariant with respect to the selection of the category q, but provides together the predictors of the proportions of the q categories. In practice, the selection of the first \(q-1\) categories should be based on the available explanatory variables. A good approach is to select as target categories (\(k=1,\ldots ,q-1\)) those ones that can be better explained by the auxiliary variables.

For every domain d, the predictors based on multinomial or compositional models fulfil the two conditions: (1) \(0\le p_{dk}\le 1\), \(k=1,\ldots ,q\), and (2) \(\sum _{k=1}^q p_{dk}=1\). Estimating \(p_{dk}\) with predictors based on univariate area-level mixed models, like Fay–Herriot or binomial logit, is not a good option because the conditions (1) and (2) might not be fulfilled.

Concerning the estimation of the MSEs of the compositional plug-in predictors or EBPs, we follow the parametric bootstrap approach of González-Manteiga et al. (2008a). The steps of the bootstrap resampling algorithm are as follows:

  1. 1.

    Fit the TFH model (4.3) to the data \((y_d,X_d)\), \(d=1,\ldots ,D\), and calculate \(\hat{\beta }\) and \(\hat{\theta }\).

  2. 2.

    Generate \(u_d^{*(b)}\sim N_3\big (0,{V}_{ud}(\hat{\theta })\big )\), \(e_d^{*(b)}\sim N_3(0,V_{ed})\), \(\mu _d^{*(b)}=X_{d}\hat{\beta }+u_{d}^{*(b)}\), \(y_{d}^{*(b)}=\mu _d^{*(b)}+e_{d}^{*(b)}\), \(p_{d}^{*(b)}=\text {alogist}(\mu _d^{*(b)})\), \(d=1,\ldots ,D\).

  3. 3.

    Calculate \(\hat{p}_{d}^{*(b)}\in \{\hat{p}_{Cd}^{*(b)},\hat{p}_{Ed}^{*(b)}\}\) based on the data \((y_{d}^{*(b)},X_d)\), \(d=1,\ldots ,D\).

  4. 4.

    Repeat B times, \(b=1,\ldots ,B\), the steps 2–3 and calculate the bootstrap MSE estimator

    $$\begin{aligned} mse_{dk}^{*}=\frac{1}{B}\sum _{b=1}^B\big (\hat{p}^{*(b)}_{dk}-p^{*(b)}_{dk}\big )^2,\,\,\, d=1,\ldots ,D,\,\,k=1,2,3. \end{aligned}$$

6 Application to Labour Force Survey data

This section gives an application of the alr compositional mixed model to the SLFS data described in Sect. 2. The TFH model is fitted to the target data and to a set of significant auxiliary aggregated variables taken from the administrative registers: PMH containing the official demographic data at municipal level, SSoc containing data from the Social Security System and SPEG containing data of employment claimants.

The considered domain-level auxiliary variables are as:

  • SS: proportion of population registered in SSoc.

  • REG: proportion of population registered as unemployed in SPEG.

  • to15, 16to24, 25to54, 55to: proportion of population aged \(\le \,15\), 16–24, 25–54 and \(\ge \,55\) registered in PMH.

The target parameters are the proportions of the four categories of the variable labour status, i.e. \(\le 15\) years, employed, unemployed and inactive people per sex in counties of Galicia. We are interested in estimating domain compositions with \(q=4\) categories. As the explanatory variables, to15, SS and REG, are highly correlated with \(\le 15\) years, employed and unemployed, respectively, for fitting compositional or multinomial mixed models to the SLFS data, we number the labour status categories from 1 to 4, so that inactive is the fourth one (q-th category).

We denote the alr transformations of the direct estimators of the category proportions by \(y_{dk}=\log (z_{dk}/z_{dq})\), \(k=1,2,3\). For the sake of brevity, we do not present the data analyses with the clr or ilr transformations. Keeping these assumptions in mind, below we present the results of the application to SLFS data.

Table 2 presents the domain-level correlations calculated for the sets \(\{(y_{dk_1}, y_{dk_2}): d=1,\ldots ,D\}\), \(k_1,k_2=1,2\). Because of the alr transformation, the domain-level correlation patterns of the y-variables are different from the corresponding ones of the z-variables given in Table 1. The correlations in Table 2 are used as seeds for the matrices \(V_{ud}\) in the Fisher scoring algorithm that calculates the REML estimators of the selected TFH model.

Table 2 Domain-level correlations for \(y_{dk_1},y_{dk_2}\)

Table 3 presents domain-level correlations for the response variables \(y_{dk}\) of the TFH model and the auxiliary variables \(x_{dki}\). For the k-th category and the i-th auxiliary variable, the correlations are calculated for the set of values \(\{(y_{dk}, x_{dki}): d=1,\ldots ,D\}\). Reading this table by rows, it is observed that to15 is the most positively correlated variable for \(y_1\) and 25to54 and SS are the most positively correlated variables for \(y_2\). Concerning \(y_3\), the first five auxiliary variables have a similar positive correlation. As expected, 55to is negatively correlated with \(y_1\), \(y_2\) and \(y_3\).

Table 3 Domain-level correlations for \(y_{dk},x_{dki}\)
Table 4 TFH model parameter estimates

A set of appropriate auxiliary variables is selected, and the corresponding TFH model is fitted to the data \((y_{dk}, x_{dki})\), \(d=1,\ldots ,D\), \(k=1,2,3\), \(i=1,\ldots ,r_k\). For the alr transformation, Table 4 presents the estimates of the regression parameters for the TFH model and their estimated standard deviations. It also presents the p-values, defined in (4.8), for testing the hypothesis \(H_0: \beta _{ki}=0\), \(k=1,2,3\), \(i=1,\ldots ,r_i\). Intercept parameters for \(y_1,y_2,y_3\) are denoted by \(c_1,c_2,c_3\), respectively. Appendix D presents the corresponding tables for the transformations clr and ilr. Based on those tables and on further model diagnostics, we select the alr transformation.

As all the auxiliary variables are proportions, the sign and the magnitude of the regression parameters give interesting interpretations. The alr transformation of the direct estimator of the SLFS proportion of people under 16 years, \(y_{d1}\), is solely explained with positive sign by the corresponding PMH proportion to15. For the proportion of employed people, \(y_{d2}\) tends to be greater in those domains with larger proportion of people registered in SSoc and greater proportion of working age people. The proportions of people SS and 25to54 are more relevant than 16to24 for predicting \(y_{d2}\). For the proportion of unemployed people, \(y_{d3}\) tends to be greater in those domains with larger proportion of people registered as unemployed in SPEG and greater proportion of working age people. As expected, REG is more important for predicting \(y_{d3}\) than 16to24 and 25to54. We have further fitted a second TFH model with 55to in the place of 16to24 and 25to54. The second model gives similar predictions because the auxiliary variables to15, 16to24, 25to54 and 55to sum up to one, and therefore, 55to and {16to24, 25to54} give basically the same information for predicting \(y_{d2}\) and \(y_{d3}\).

Table 5 gives the 95% confidence intervals (CIs) of the variance components, defined in (4.7), for \(\theta _1,\ldots ,\theta _6\). We observe that the variances \(\sigma _{u1}^2\), \(\sigma _{u2}^2\), \(\sigma _{u3}^2\) and the correlation \(\rho _{12}\) are significantly greater than zero.

Table 5 CIs of variance components

Figure 2 plots the dispersion graphs of the target variables \(y_{d1}\) (left), \(y_{d2}\) (centre) and \(y_{d3}\) versus their auxiliary variables with larger regression parameter, i.e. to15, SS and REG, respectively. Accordingly with Table 3, we observe the linear patterns related to positive high linear correlations.

Fig. 2
figure 2

Dispersion graphs of \(y_{d1}\), \(y_{d2}\) and \(y_{d3}\) versus to15, SS and REG, respectively

Figures 3 and 4 plot the marginal residuals and standardized marginal residuals of the fitted TFH model. The residuals are rather symmetric around zero and do not present any relevant pattern. Further, there are few standardized residuals (7 among 306) outside the interval (\(-3, 3\)).

Fig. 3
figure 3

Marginal residuals of the fitted TFH model

Fig. 4
figure 4

Standardized marginal residuals of the fitted TFH model

Figure 5 plots the compositional versus the direct estimates of the proportions of people under 16 years (left), employed (centre) and unemployed (right). We observe that model-based estimators take values rather symmetrically around the direct estimates. As the direct estimators of proportions are basically design-based unbiased, Fig. 5 suggests that compositional estimators partially share this property.

Fig. 5
figure 5

Compositional versus direct estimated proportions of people under 16 years (left), employed (centre) and unemployed (right)

In addition to the compositional plug-in predictors based on the selected compositional mixed model, we also calculate the direct estimators and the multinomial predictors. Taking into account the recommendations of López-Vizcaíno et al. (2013) and the last comment of Appendix B, the MLM model is fitted to the vectors of sample counts \(z_{d.}=(z_{d1.},z_{d2.},z_{d3.})\) with sample sizes \(n_d\), \(d=1,\ldots ,D\). For the sake of comparability, we use the same auxiliary variables as the fitted compositional model. This is to say, the auxiliary variables are listed in Table 4. Figure 6 plots the direct, multinomial and compositional plug-in estimated proportions of people under 16 years (left), employed (centre) and unemployed (right) for domains sorted by sample size. For the three categories, the compositional plug-in predictors present the smoothest behaviour across domains.

Fig. 6
figure 6

Direct, multinomial and compositional estimated proportions of people under 16 years (left), employed (centre) and unemployed (right) for domains sorted by sample size

Figure 7 plots the design-based estimates of the root mean squared errors (RMSE) of the direct estimators (D) and the parametric bootstrap estimates of the RMSEs of the multinomial and the compositional plug-in predictors. For the sake of comparability, the RMSEs of the multinomial (MC) and the compositional (CC) predictors are calculated under the assumption that the distribution of the fitted TFH Fay–Herriot model is the true one. Therefore, we run the bootstrap procedure by generating data from the fitted TFH model. The RMSEs of the corresponding direct estimates are estimated by applying formula (2.4). Nevertheless, we know that in real life there are no true models, but useful models. Therefore, we also calculate parametric bootstrap estimates of the RMSEs of the multinomial predictors assuming that the multinomial model is valid. The new estimated RMSEs (MM) can be interpreted as a measure of the goodness of fit of the multinomial model to the data. Figure 7 shows that the compositional plug-in predictors have the best results in terms of RMSE.

Fig. 7
figure 7

Estimated RMSEs of direct and model-based estimators of proportions of under 16 years, employed and unemployed people

Figures 8 and 9 map the compositional plug-in estimated county proportions of men (left) and women (right) under 16 years old and inactive, respectively. The colours are darker in areas with higher proportions. We observe that the counties with the youngest people are in the west coast, in the provinces of Coruña and Pontevedra where it is the Atlantic motorway (from Vigo to Coruña). On the contrary, it can be observed that the counties that are in the east of Galicia (in the provinces of Lugo and Ourense), in general terms, have a lot of inactive population, more than the 50% of their population. The Costa da Morte counties (in the west) are also in this situation.

Fig. 8
figure 8

Estimated county proportions of men (left) and women (right) under 16 years old

Fig. 9
figure 9

Estimated county proportions of inactive men (left) and women (right)

Fig. 10
figure 10

Estimated county proportions of employed men (left) and women (right)

Fig. 11
figure 11

Estimated county proportions of unemployed men (left) and women (right)

Table 6 Estimated men proportions and their estimated RMSEs
Table 7 Estimated women proportions and their estimated RMSEs

Figures 10 and 11 plot the compositional plug-in estimated county proportions of men (left) and women (right) that are employed and unemployed, respectively. Concerning employment, we observe that the touristic counties around the Santiago trail have the largest proportions. On the contrary, the industrial areas around Vigo (south-west) and the agricultural areas of Ourense (south-east) present the largest proportions of unemployed people.

Figure 9 shows that the quantiles of the distribution of the proportion of inactive women are displaced to the right with respect to that of men. Therefore, the proportions of inactive women tend to be greater than those of men in the regions of Galicia. Figures 10 and 11 give the opposite conclusion for employed and unemployed people, respectively.

Tables 6, 7 and 8 present some condensed numerical results for men and women, respectively. The tables have been constructed in two steps. The domains are sorted by province. Within each province, the domains are sorted by sample size, starting by the domain with the smallest sample size. A selection of 18 domains out of 51 is done from the positions \(1,4,7,\ldots ,49\) and 51. The tables give the direct and the compositional plug-in estimates (labelled by “dir” and “mod”, respectively) and the corresponding RMSE estimates. The provinces are labelled by “prov”, with codes given in Sect. 2, and the sample sizes are denoted by n.

Tables 6 and 7 are partitioned in three vertical sections dealing with the estimation of proportion of people under 16 years, employed and unemployed people. Table 8 contains the estimated proportions of inactive men and women and the corresponding RMSE estimates. By observing the columns of RMSEs, we conclude that the compositional plug-in predictors are preferred to the direct estimators.

We recall that the compositional predictors of the four categories are derived from the fitted TFH model by applying the alogist transformation, so they are jointly calculated. The last category (inactive) is not a reference category. Their results appear in Table 8 because of the lack of space when constructing Tables 6 and 7.

Table 8 Estimated inactive proportions and RMSEs

7 Conclusions

This paper introduces predictors of category proportions based on an area-level compositional mixed model. A TFH model is introduced for modelling the additive logratio transformations of the direct estimators of the category proportions. The additive logistic transformations of the EBLUPs under the TFH model are the proposed compositional plug-in predictors. Similarly, the centred or the isometric logratio transformation (see Appendix D) can be employed.

In addition, the compositional empirical best predictors are also introduced and empirically investigated. The first predictor is easy to calculate, but the second one requires the approximation of integrals in \(R^3\). This numerical problem demands rather high computational time to achieve an acceptable precision. Because of this issue and the results of Simulation 2, the use of the compositional plug-in predictor is recommended.

Unlike the multinomial approach of López-Vizcaíno et al. (2013), the compositional predictors take into account the sampling design. The sampling weights are employed by means of the direct estimators of the category proportions and through the estimators of their design-based variances and covariances.

As discussed in Section 4, both models (MLM and compositional) have problems to predict the population parameter \(\bar{Z}_{dk}\). However, these problems are compensated by the use of auxiliary variables correlated with the objective variable and with the modelling of the covariance structure of the categories. It is at this point where the compositional mixed model presents its greatest advantages compared to the MLM model. The compositional approach is more flexible and can be adapted to different correlation structures, while multinomial correlation structure is rather rigid (correlations negative) and does not fit well to both types of covariances and variances (unit level and domain level). The results of four simulation experiments and the analysis of real data confirm this assertion.

The new small area estimation methodology is applied to Labour Force Survey data from Galicia, a region in north-west of Spain, in the period January–April 2017. The selected compositional mixed model has had a better fit to the data than the corresponding multinomial mixed model, and therefore, the labour status proportions per county and sex are finally estimated by using the compositional plug-in predictors with its mean squares errors calculated by parametric bootstrap.

As for the labour market results in Galicia, we can conclude that the west coast, in general terms, is the most dynamic part with a higher proportion of employed people and people under 16 years old. There is a big problem in the south-east because there are several counties with a proportion of inactive people over the 50%. This area is essentially rural. Fixing the population in rural areas, with a decent living and income levels, is one basic requirement to ensure territorial balance and environmental sustainability. Galicia is ageing and more accentuated in rural areas, due to the emigration of young people and the fall in the birth rate. This reality, with obvious economic and sociological consequences, demands answers from social and employment policies.

Although the introduced methodology is applied to the variable “labour status” with \(q=4\) categories, it can be extended to categorical variables with any number \(q\ge 2\) of categories. However, we only recommend their use for the cases \(q=2,3,4,5\). This paper presents the mathematical developments for the case \(q=4\) leading to a TFH model with six variances or covariance parameters. The cases \(q=2\) and \(q=3\) are more simpler as they require fitting univariate and bivariate Fay–Herriot models with one and three variance components, respectively. The case \(q=5\) yields to a MFH model with ten variance components, where the fitting algorithm might have convergence problems when the number of domains D is not big enough.

The application to real data can be carried out in the subpopulation of people aged 16 or more, where the number of categories is \(q=3\) (employed, unemployed and inactive). This simpler scenario requires fitting a bivariate Fay–Herriot model to the transformed survey data and obtaining the predictions of the domain proportions of the three considered categories in a similar way as it is done in the case of \(q=4\) categories. We have decided to follow the more complex TFH approach because of the strength of the available auxiliary variables and to guarantee the coherence of the estimates of the four category proportions. For the categories \(\le 15\), employed and unemployed, we have the highly correlated variables to15, SS and REG. This last fact also motivated the selection of inactive as the fourth (reference) class.

Compositional data play an important role in public statistics. In this case, we applied the introduced methodology to the SLFS, but it is useful in other topics of the official statistics, like the classification of the population by the educational level, the income level or the type of household expenditure. In all these situations, it is necessary to take into account the simplex constraints.

This work may be extended to incorporate spatial correlations between domains. In practice, it is often reasonable to assume that the effects associated with neighbouring areas are proportionally correlated with a measure of distance. Also, it is important to know the evolution of the labour market, along the quarters for the counties, in an accurate and stable form, suitable for being used in statistical offices. Therefore, extensions of the introduced methodology to models incorporating temporal correlations are also a future research task.