1 Introduction

The use of social indicators at an appropriately disaggregated geographical scale is a relevant issue, not only for research but also for policy-making because efficient policy design sometimes requires information with a high degree of spatial disaggregation. Official data, however, are not widely produced with a high detail of spatial disaggregation and are generally only available at a national or -sometimes- regional level. If the geographical location of these individual agents is observable at a highly disaggregated spatial scale, the indicators for the small areas could be calculated simply by summing or averaging the individual estimates. In most cases, however, researchers have to deal with databases that might allow for a precise spatial location of the individuals, such the microdata of a Population Census, but that do not normally contain information on economic variables related to distributional issues, such as household income. On the other hand, the surveys that contain data about these indicators—such as household surveys—do not normally provide detailed information on the geographical location of the individuals surveyed. Even when they do, the small sample sizes for the sub-regions of interest make the estimates directly obtained very unstable (see Marchetti and Secondi 2017, for a recent application of this type of direct estimation).

Small Area Estimation (SAE) deals with the problem of providing reliable estimates of these indicators based on the information collected by conducting surveys in some or all areas. SAE techniques are increasingly used to provide estimates for local areas of interest (for a recent application, see Durán and Condorí 2019). Generally speaking, SAE methodologies combine direct estimates with some auxiliary information. The comparative performance of different techniques in this field has been recently evaluated by Guadarrama et al. (2016). Within the group of methodologies that aim at producing social indicators for small areas, there is one in particular that has been widely applied. Specifically, the original proposal by Elbers et al. (2003) and the modification proposed later in Tarozzi and Deaton (2009) has been used by the World Bank to map poverty across small areas in poor countries and it has later been used in several developing countries (see Modrego and Berdegué, 2015) and in other more advanced economies (see Sánchez-Cantalejo et al. 2008 Melo et al. 2016; Morales et al. 2018; for recent examples).

The basic idea of this procedure consists of “projecting” predictions of the variable of interest for a household survey onto the sample of households that form the population. In a nutshell, the procedure comprises three steps:

  1. 1.

    From the household surveys (HS), estimate a model of your variable y of interest y = (X), where X is a set of regressors also observable in the Population Census (PC).

  2. 2.

    Recover the set of parameters β estimated on the HS (with some degree of heterogeneity across regions or clusters of households) and take them to the PC.

  3. 3.

    Given the X observable in the PC and the corresponding \(\widehat{\beta }\), predict the figures of y for the households surveyed in the PC (\(\hat{y} = X\hat{\beta }\)).

The estimates produced have the advantage of a higher precision than previous methodologies, due to the large number of households in the census (see Tarozzi and Deaton, 2009). Additionally, this large number of estimates allows for studying potential differences between the individuals or households belonging to the same small area.

This feature is highly appealing to the social researcher, and this is the approach that we follow in this paper. One problem derived from applying this type of estimation, however, is that the estimates are not necessarily consistent with the aggregates that are already observable: for example, the mean value of the estimates of household income produced by the techniques at household level for, say, a given region may not necessarily be equal to the mean regional household income available on the official databases.

In order to overcome this limitation, we propose an alternative methodology to adjust the estimates to official observable aggregates by incorporating into the estimation problem information on the observable aggregates. We propose to make this correction basing on the infometrics framework detailed in Golan (2018) and with a procedure similar to the Generalized Maximum Entropy (GME) estimator presented in Bernadini-Papalia and Fernández-Vázquez (2018).

The rest of the paper is organized as follows. Section 2 describes the main characteristics of the technique proposed. In Sect. 3 we compare its performance with the original formulation by Tarozzi and Deaton (2009) we illustrate how the proposed technique can be applied to the empirical problem of recovering at risk of poverty an exclusion rates for small areas -municipalities- for a Spanish region. Finally, in Sect. 4 we draw some conclusions.

2 Spatial Disaggregation of Social Indicators Within an Infometrics Framework

The disaggregation methodology that we propose in this paper tries to exploit the information contained in the available databases while making minimal assumptions. The methodology suggested here follows the basics presented in Golan et al. (1996) and Golan (2018).

2.1 The GME Moment-Constrained Estimator

Assume that the estimation objective is to recover the cells of matrix \(P\), with dimension \(n \times J\), with a typical element \(p_{ij}\) that reflects the probability of category \(j\) for individual \(i\), with \(p_{ij} \ge 0\) and \(\mathop \sum \nolimits_{j = 1}^{J} p_{ij} = 1;\quad i = 1,..,n\). The basic idea of the infometrics framework consists of choosing as the solution the one that requires the least information (i.e., maximizes the uncertainty or entropy) of the distribution while being consistent with the observed data. Optimizing an entropy (uncertainty) function as in Eq. (1):

$$Ent\left( P \right) = - \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{J} p_{ij} {\ln}\left( {p_{ij} } \right)$$
(1)

This function reaches its unconstrained maximum when the \(n\) rows in \(P\) distribute uniformly across the \(J\) categories (see Shannon, 1948 for additional details). Any additional information—observed data—is included as a constraint and it makes the solution depart from the uniform distribution.

This basic idea can be adapted to predict and model categorical variables (see Chapter 11; Golan 2018). Consider a sample consisting of \(n\) observations of basic units, which for the sake of simplicity will be assumed to be individuals but which can be easily generalized to other types of observations, such as households or firms. Let us assume that the variable of interest is a categorical indicator that reflects whether the individual belongs to a specific class. Each data point observed in the sample belongs to one single category of the \(J\) unordered alternatives and each observation is coded as a binary variable, which takes value 1 for the realized category and value 0 for the rest. For each individual \(i = 1, \ldots ,n\), we define a variable \(y_{ij}\) that equals 1 if and only if alternative \(j\) is observed and zero otherwise. This implies that the information of the indicator of interest contained in the sample can be presented in the form of a \(n \times J\) matrix as \(Y\):

$$Y = P + U$$
(2)

Equation (2) indicates that the observed data in the sample are specified as a function of the unobserved probability \(P\) plus some noise contained in matrix \(U\). Our objective is to predict the elements of \(P\) based on the observed data, assuming that the \(p_{ij} ^{\prime}s\) cells in matrix \(P\) are related to a set of explanatory covariates contained in the \(n \times K\) matrix \(X\) that contain \(K\) individual’s characteristics. In order to accommodate these relations between \(P\) and \(X\) on a flexible way, Golan (2018) proposes the use of the cross-moments equations:Footnote 1

$$\frac{1}{n}X^{T} Y = \frac{1}{n}X^{T} \left[ {P + U} \right] = \frac{1}{n}X^{T} \left[ {P + WV} \right]$$
(3)

that serve as constraints in the following optimization program:

$$\mathop {Max}\limits_{P,W} Ent\left( {P,W} \right) = - \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{J} p_{ij} \ln \left( {p_{ij} } \right) - \mathop \sum \limits_{l = 1}^{L} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{J} w_{ijl} \ln \left( {w_{ijl} } \right)$$
(4)

Subject to:

$$\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{ik} y_{ij} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{ik} \left[ {p_{ij} + u_{ij} } \right] = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{ik} \left[ {p_{ij} + w_{ijl} v_{l} } \right];\quad j = 1, \ldots ,J$$
(5)
$$\mathop \sum \limits_{j = 1}^{J} p_{ij} = 1;\quad i = 1,..,n$$
$$\mathop \sum \limits_{l = 1}^{L} w_{ijl} = 1;\quad i = 1,..,n;\quad j = 1,..,J$$
(6)

Equations (4) to (6) depict a Generalized Maximum Entropy (GME) program, where the term \(Ent\left( W \right) = - \mathop \sum \nolimits_{l = 1}^{L} \mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j = 1}^{J} w_{ijl} \ln \left( {w_{ijl} } \right)\) quantifies the entropy associated with the cells \(u_{ij}\). These elements, which reflect the noise for individual \(i\) in category \(j\), are expressed as \(u_{ij} = \mathop \sum \nolimits_{l = 1}^{L} w_{ijl} v_{l}\) or \(U = WV\) in matrix notation where matrix \(V\) contains the set of \(L\) possible values for the error terms while matrix \(W\) denotes the corresponding (unknown) probabilities. Given the categorical nature of the data on \(Y\), the feasible values for the error on \(V\) are intuitively bounded between − 1 and 1; which for the sake of simplicity are set as common for all the data points.

If the GME program above is solved, estimates of \(P\)(\(\hat{P}\)) and \(W\)(\(\hat{W}\)) are recovered. While the set of constraints on Eq. (6) act just as normalization restrictions, the information contained in the sample in the form of cross moments between \(Y\) and \(X\) is given in the left hand side of Eq. (5). Without this piece of information, the solution of the GME estimator will be a uniform distribution, but this equation “pushes” the solution to be consistent with these observed moments.Footnote 2

2.2 Spatial Disaggregation Based on GME

We apply a modification of the general GME estimator above as an alternative to the methods presented in Elbers et al. (2003) or Tarozzi and Deaton (2009) of predicting social indicators for small areas. The GME technique proposed follows the idea of combining household surveys with population census but exploits the information in a different way. One advantage of the method proposed here is that it combines the detailed geographical information present in population census but making it consistent with some aggregates -moments- observable in the household survey.

For the sake of clarity, let us explain the procedure by assuming that our research interest is in estimating a categorical indicator in a set of small areas \(d = 1, \ldots D\). The proportion of individuals in \(d\) that fall into category \(j\) can be expressed as \(\mathop \sum \nolimits_{i = 1}^{{n_{d} }} y_{ij}^{d} /n_{d}\) where \(n_{d}\) represents the number of individuals in this area \(d\). Our problem is that the indicators \(y_{ij}\) are not directly observable in a population census. They are observable in the household survey, but we cannot identify the small area \(d\) in which the individual lives.

Following Eq. (2), our estimates of the indicator of interest \(\hat{y}_{ij}^{d}\) are defined as \(\hat{y}_{ij}^{d} = \hat{p}_{ij}^{d} + \hat{u}_{ij}^{d}\), where \(\hat{p}_{ij}^{d}\) and \(\hat{u}_{ij}^{d}\) are the solutions produced by the GME estimator and \(\hat{u}_{ij}^{d} = \mathop \sum \nolimits_{l = 1}^{L} \hat{w}_{ijl} v_{l}\). These estimates will be obtained by solving the previously depicted GME program but modifying the constraints on Eq. (5) as follows:

$$\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{sik} y_{sij} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} x_{cik} \left[ {p_{ij} + u_{ij} } \right] = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} x_{cik} \left[ {p_{ij} + w_{ijl} v_{l} } \right];\quad j = 1, \ldots ,J$$
(7)

Or, in matrix notation:

$$\frac{1}{n}X_{s}^{T} Y_{s} = \frac{1}{N}X_{c}^{T} \left[ {P + U} \right] = \frac{1}{N}X_{c}^{T} \left[ {P + WV} \right]$$
(8)

In Eqs. (7) and (8) the subscripts \(s\) and \(c\) refer to moments observed in the household survey and the population census respectively. Matrix \(X_{s}\) -together with \(Y_{s}\)- is collected from the information in the sample, and it has dimension \(n \times K\), while matrix \(X_{c}\) is observable at the population census level and has dimension \(N \times K\), where \(N\) is the population size. Note that the left-hand side on these equations refers to the cross-moments between \(Y\) and the covariates in \(X\), which are observable for the sample of households surveyed. The solutions \(\hat{P}\) and \(\hat{W}\) on the right-hand side must produce, at the census level, cross-moments identical to those observed in the household sample. Since these estimates are obtained with the sufficient geographical detail, it is possible to predict \(\hat{y}_{ij}^{d} = \hat{p}_{ij}^{d} + \hat{u}_{ij}^{d}\) and make these estimates consistent with the aggregates in \(\frac{1}{n}X_{s}^{T} Y_{s}\). This consistency can be viewed as one of the main comparative advantages of this GME technique: while other small-area estimation just make use of the available data for the small areas of interest, the GME estimator also considers the exogenous information that could come in the form of aggregates. If these aggregates contain information about the small areas, this is explicitly exploited and incorporated into the estimation problem, which can contribute to improve the accuracy of the estimates. Additionally, discrepancies between the small area estimates and the regional or national aggregates are prevented by construction.Footnote 3

3 An Empirical Application: Estimating Poverty and Exclusion Rates for Small Areas in Andalusia (Spain), 2011

3.1 Data

This section illustrates how the GME estimator works with a real-world example. It will be applied to predict at-risk of poverty and exclusion (AROPE) rates for small areas (municipalities) in a Spanish region. Following the definition given by Eurostat, for an individual to be classified as AROPE, it is necessary to belong to a household for which one or more of the following three conditions holds:

  1. i.

    Having a disposable income below 60% of the national median,

  2. ii.

    Suffering severe material deprivation, or,

  3. iii.

    Living in a household with a low work intensity.

This empirical example will be based on the data collected for 2011 in the Social Inclusion and Living Conditions (EU-SILC) survey. This survey provides information for the EU member states on issues related to income distribution, poverty or, more in general, living conditions and it is the most commonly used database for distributional issues. The EU-SILC survey is rich on information about several characteristics of individuals and households.Footnote 4 However, due to confidentiality issues, the EU-SILC does not provide detailed information on the geographical location of the individuals surveyed: depending on the specific EU country, information on location is only released at the level of NUTS1 or NUTS 2 regions. In our case, we will focus our analysis on the Spanish region of Andalusia. This NUTS 2 region is interesting for a number of reasons, the most important being that it presented the highest poverty rate in Spain on the year studied here, and one of the highest in the EU. Moreover, the region experienced large unemployment rates and low educational levels (measured as the percentage of population with a third-level degree) compared with other Spanish regions.

Additionally, this region contains large metropolitan areas -the urban agglomeration of Seville is larger than 1.5 million—with a relatively high concentration of manufacturing and service activities, together with many depopulated rural areas mainly specialized in agricultural and other low-value added activities. Consequently, one would expect to find a sizable internal heterogeneity in the poverty indicators across the municipalities of the region. However, this is not possible to study by using only the information contained in the EU-SILC.

In order to overcome the problems generated by the lack of data availability, we will make use of the GME estimator proposed, and the information contained in the sample of households in the EU-SILC will be combined with the sample of households surveyed in Andalusia in the Spanish Population Census conducted on 2011. While the EU-SILC surveyed slightly more than 3,500 adult individuals in the region, the sample in the Population Census contained information on more than 50,000, but not on the variables related to income or deprivation that would allow an individual to be classified as AROPE or not. The Census survey also provides information on the geographical location at the scale of municipality, although only those municipalities larger than 20,000 inhabitants can be properly identified, and the rest are aggregated because of privacy issues. Nevertheless, this spatial scale allows us to identify larger urban settlements and differentiate them from rural areas.

Our estimation objective will be to recover AROPE rates at municipal level by applying the GME program maximizing Eq. (4) subject to the set of normalizing constraints on (6) and the moment constraints on Eq. (7). In our problem, matrix \(Y_{s}\) contains the binary variable of interest in the EUSILC sample where \(y_{i} = 1\) if an individual can be classified as AROPE and zero otherwise; while \(X_{s}\) contains a set of covariates observable in the same EUSILC sample that are also observable at census level, i.e., it is possible to observe the equivalent matrix \(X_{c}\). For the sake of simplicity in this empirical example, we consider as covariates only one continuous variable (“Age”) and its square, plus the binary variables “National”, “Rural”, “College”, “Unemployed” and “Worker”. Table 1 presents a summary of these variables in both databases:

Table 1 Mean values. Andalusia 2011

Additionally, the sample correlations found between \(Y_{s}\) and \(X_{s}\) are reported in the first column on Table 2:

Table 2 Sample correlation and cross moments. Andalusia 2011, EUSILC sample

In the sample surveyed in the EUSILC, the indicator of risk of poverty and exclusion is negatively correlated with the age of the individuals. As expected, situations of AROPE seems to be negatively correlated with the fact of having a job, holding a college degree and having Spanish nationality, while it is positively correlated with the fact of living in a rural environment and being unemployed. Given the observed correlation between the covariates chosen and the indicator of interest, our GME estimator is designed to “project” onto the set of individuals surveyed in the Population Census estimates that are consistent with the cross moments contained in the matrix \(\frac{1}{n}X_{s}^{T} Y_{s}\). These cross-moments are also reported on Table 4 (second column).Footnote 5

3.2 Results: The Spatial Distribution of Rates of AROPE Population in Andalusia

From these data, we have applied a moment-constrained GME program including as constraints the sample information on \(X_{s}^{T} Y_{s}\). In this example, we followed the same procedure for the reparameterization of the error term as applied in the numerical experiments and we set wide error supports with \(L = 3\) points as \(v_{l} = \left( { - 1,0,1} \right)\). The solution of this program predicts the probability of being classified as AROPE with the individuals sampled in the Population Census calculated as \(\hat{y}_{i}^{d} = \hat{p}_{i}^{d} + \hat{u}_{i}^{d}\), where the superscript \(d\) refers to the small area – a specific municipality or a group of municipalities- where the individual lives. Having recovered the \(\hat{y}_{i}^{d}\), local estimates of AROPE rates can be calculated as \(\mathop \sum \nolimits_{i = 1}^{{N_{d} }} \hat{y}_{i}^{d} /N_{d}\), where \(N_{d}\) represents the population size on this area \(d\).Footnote 6 These local estimates allows for studying the spatial heterogeneity of the incidence of AROPE across the municipalities of the region. In total, we managed to recover estimates for 115 small areas and Fig. 1 visually represents these estimates:

Fig. 1
figure 1

AROPE rates at municipal level. Andalusia 2011

The estimates produced show sizable heterogeneity across the Andalusian municipalities. Approximately 10% of the small areas identified presented an AROPE rate smaller than 30%, being close to the national AROPE for Spain in that year (26%), while the estimate for the top 10% was higher than 50%. The figures represented in Fig. 1 also allow some geographical patterns to be identified. First, the map shows that the areas located in the South-west of the region generally presented higher values. Additionally, and as the correlation calculated in the EUSILC survey suggested, the greater AROPE rates are mainly concentrated in rural areas and are comparatively smaller in the urban agglomerations. Figure 1 displays the names of the municipalities that concentrate the five largest metropolitan areas of Andalusia, which accounted for approximately 4 million inhabitants and almost 50% of the total population of the region. In general, the estimates of poverty rates around these areas are comparatively smaller than in other zones of the region. These results are consistent with previous research that found higher persistence of poverty on rural areas for the US (Partridge and Rickman 2008) or for the EU (Weziak-Bialowolska 2016).

For the specific case of Spain (Sánchez-Cantalejo et al. 2008) estimated a multidimensional poverty indicators basing on the 2001 Census data. Although the results are not directly comparable (the reference years are different and they use an alternative definition of the poverty indicator) the spatial patterns found in their study for the case of Andalusia seem to be similar to those estimated in this paper. Although their study comprehends all the Spanish regions, for the specific case of Andalusia on 2001 their estimates suggest a similar geographical concentration of poverty for rural areas on Andalusia or, at least, a comparatively smaller incidence on Andalusian urban agglomerations (see Sánchez-Cantalejo et al. 2008, Figs. 2 and 3, page 268).

4 Concluding Remarks

This paper develops and applies an estimation procedure based on the info-metrics approach to estimate social indicators for small areas that are consistent with observable aggregate information. In this regard, the technique proposed here can be seen as a spatial disaggregation technique. More specifically, we propose a modified version of the cross- moment constrained GME estimator for categorical variables developed in Golan (2018), which we adapt to be applied in the field of small area estimation. Although this is not the first attempt to use this type of methodology for small area estimation problems (see Bernadini-Papalia and Fernández-Vázquez 2018), the novelty of our proposal is that we do not use only aggregates of the variable of interest as constraints, but also the information contained in its cross-moments with potential covariates can be exploited.

The formulation presented here is also inspired in the idea of the proposals made originally by Elbers et al. (2003) and the subsequent modification by Tarozzi and Deaton (2009) to map poverty for small areas. Our estimator is evaluated through numerical experiments, revealing a comparatively better performance than the previous proposals which attempt to make a similar analysis. We also illustrate, by means of an empirical illustration, how the proposed procedure can be implanted to solve a real-world problem and we have applied it to derive At Risk of Poverty and Exclusion (AROPE) rates for small geographical areas. More specifically, we estimate these rates for the municipalities of the region of Andalusia (Spain) for 2011, combining microdata from the EUSILC database and the Spanish Population Census produced that year.

The numerical simulations together with the empirical example illustrate how this methodology can be helpful for studying the spatial heterogeneity of social indicators across small areas. However, further work should be done to explore its performance when the variable of interest is not categorical but continuous, or it has the form of count data. Future research should also explicitly address potential processes of spatial dependence between the geographical locations of interest.

One innovation of the technique presented here it is that it does not require strong distributional assumptions and that it combines information from two different sources of information, while maintaining the consistency of the estimates produced with national and regional aggregates. One potential limitation associated to this characteristic is that the proposed GME technique has comparatively larger requirements of information if compared to other methodologies, since it bases on combining two different sources of data. Alternatives based on just exploiting observable information -as, for example, the Principal Components approach applied in (Sánchez-Cantalejo et al. 2008)– are less demanding on information, but they do not necesarely produce small area estimates consistent with the national or regional figures that are usually estimated by offical statistical agencies.