Spatial Disaggregation of Social Indicators: An Info-Metrics Approach

In this paper we propose a methodology to obtain social indicators at a detailed spatial scale by combining the information contained in census and sample surveys. Similarly to previous proposals, the method proposed here estimates a model at the sample level to later project it to the census scale. The main novelties of the technique presented are that (i) the small-scale mapping produced is perfectly consistent with the aggregates -regional or national- observed in the sample, and (ii) it does not require imposing strong distributional assumptions. The methodology suggested here follows the basics presented on Golan (2018) by adapting a cross-moment constrained Generalized Maximum Entropy (GME) estimator to the spatial disaggregation problem. This procedure is compared with the equivalent methodology of Tarozzi and Deaton (2009) by means of numerical experiments, providing a comparatively better performance. Additionally, the practical implementation of the methodology proposed is illustrated by estimating poverty rates for small areas for the region of Andalusia (Spain).


Introduction
The use of social indicators at an appropriately disaggregated geographical scale is a relevant issue, not only for research but also for policy-making because efficient policy design sometimes requires information with a high degree of spatial disaggregation. Official data, however, are not widely produced with a high detail of spatial disaggregation and are generally only available at a national or -sometimes-regional level. If the geographical location of these individual agents is observable at a highly disaggregated spatial scale, the indicators for 1 3 the small areas could be calculated simply by summing or averaging the individual estimates. In most cases, however, researchers have to deal with databases that might allow for a precise spatial location of the individuals, such the microdata of a Population Census, but that do not normally contain information on economic variables related to distributional issues, such as household income. On the other hand, the surveys that contain data about these indicatorssuch as household surveys-do not normally provide detailed information on the geographical location of the individuals surveyed. Even when they do, the small sample sizes for the sub-regions of interest make the estimates directly obtained very unstable (see Marchetti and Secondi 2017, for a recent application of this type of direct estimation).
Small Area Estimation (SAE) deals with the problem of providing reliable estimates of these indicators based on the information collected by conducting surveys in some or all areas. SAE techniques are increasingly used to provide estimates for local areas of interest (for a recent application, see Durán and Condorí 2019). Generally speaking, SAE methodologies combine direct estimates with some auxiliary information. The comparative performance of different techniques in this field has been recently evaluated by Guadarrama et al. (2016). Within the group of methodologies that aim at producing social indicators for small areas, there is one in particular that has been widely applied. Specifically, the original proposal by Elbers et al. (2003) and the modification proposed later in Tarozzi and Deaton (2009) has been used by the World Bank to map poverty across small areas in poor countries and it has later been used in several developing countries (see Modrego and Berdegué, 2015) and in other more advanced economies (see Sánchez-Cantalejo et al. 2008Melo et al. 2016Morales et al. 2018; for recent examples).
The basic idea of this procedure consists of "projecting" predictions of the variable of interest for a household survey onto the sample of households that form the population. In a nutshell, the procedure comprises three steps: 1. From the household surveys (HS), estimate a model of your variable y of interest y = (X), where X is a set of regressors also observable in the Population Census (PC). 2. Recover the set of parameters β estimated on the HS (with some degree of heterogeneity across regions or clusters of households) and take them to the PC. 3. Given the X observable in the PC and the corresponding ̂ , predict the figures of y for the households surveyed in the PC ( ŷ = X̂).
The estimates produced have the advantage of a higher precision than previous methodologies, due to the large number of households in the census (see Tarozzi and Deaton, 2009). Additionally, this large number of estimates allows for studying potential differences between the individuals or households belonging to the same small area. This feature is highly appealing to the social researcher, and this is the approach that we follow in this paper. One problem derived from applying this type of estimation, however, is that the estimates are not necessarily consistent with the aggregates that are already observable: for example, the mean value of the estimates of household income produced by the techniques at household level for, say, a given region may not necessarily be equal to the mean regional household income available on the official databases.
In order to overcome this limitation, we propose an alternative methodology to adjust the estimates to official observable aggregates by incorporating into the estimation problem information on the observable aggregates. We propose to make this correction basing on the infometrics framework detailed in Golan (2018) and with a procedure similar to the Generalized Maximum Entropy (GME) estimator presented in Bernadini-Papalia and Fernández-Vázquez (2018).
The rest of the paper is organized as follows. Section 2 describes the main characteristics of the technique proposed. In Sect. 3 we compare its performance with the original formulation by Tarozzi and Deaton (2009) we illustrate how the proposed technique can be applied to the empirical problem of recovering at risk of poverty an exclusion rates for small areas -municipalities-for a Spanish region. Finally, in Sect. 4 we draw some conclusions.

Spatial Disaggregation of Social Indicators Within an Infometrics Framework
The disaggregation methodology that we propose in this paper tries to exploit the information contained in the available databases while making minimal assumptions. The methodology suggested here follows the basics presented in Golan et al. (1996) and Golan (2018).

The GME Moment-Constrained Estimator
Assume that the estimation objective is to recover the cells of matrix P , with dimension n × J , with a typical element p ij that reflects the probability of category j for individual i , with p ij ≥ 0 and ∑ J j=1 p ij = 1; i = 1, .., n . The basic idea of the infometrics framework consists of choosing as the solution the one that requires the least information (i.e., maximizes the uncertainty or entropy) of the distribution while being consistent with the observed data. Optimizing an entropy (uncertainty) function as in Eq. (1): This function reaches its unconstrained maximum when the n rows in P distribute uniformly across the J categories (see Shannon, 1948 for additional details). Any additional information-observed data-is included as a constraint and it makes the solution depart from the uniform distribution.
This basic idea can be adapted to predict and model categorical variables (see Chapter 11; Golan 2018). Consider a sample consisting of n observations of basic units, which for the sake of simplicity will be assumed to be individuals but which can be easily generalized to other types of observations, such as households or firms. Let us assume that the variable of interest is a categorical indicator that reflects whether the individual belongs to a specific class. Each data point observed in the sample belongs to one single category of the J unordered alternatives and each observation is coded as a binary variable, which takes value 1 for the realized category and value 0 for the rest. For each individual i = 1, … , n , we define a variable y ij that equals 1 if and only if alternative j is observed and zero otherwise. This implies that the information of the indicator of interest contained in the sample can be presented in the form of a n × J matrix as Y: Equation (2) indicates that the observed data in the sample are specified as a function of the unobserved probability P plus some noise contained in matrix U . Our objective is to (2) Y = P + U predict the elements of P based on the observed data, assuming that the p ′ ij s cells in matrix P are related to a set of explanatory covariates contained in the n × K matrix X that contain K individual's characteristics. In order to accommodate these relations between P and X on a flexible way, Golan (2018) proposes the use of the cross-moments equations: 1 that serve as constraints in the following optimization program: Subject to: Equations (4) to (6) depict a Generalized Maximum Entropy (GME) program, where the term quantifies the entropy associated with the cells u ij . These elements, which reflect the noise for individual i in category j , are expressed as u ij = ∑ L l=1 w ijl v l or U = WV in matrix notation where matrix V contains the set of L possible values for the error terms while matrix W denotes the corresponding (unknown) probabilities. Given the categorical nature of the data on Y , the feasible values for the error on V are intuitively bounded between − 1 and 1; which for the sake of simplicity are set as common for all the data points.
If the GME program above is solved, estimates of P(P ) and W(Ŵ ) are recovered. While the set of constraints on Eq. (6) act just as normalization restrictions, the information contained in the sample in the form of cross moments between Y and X is given in the left hand side of Eq. (5). Without this piece of information, the solution of the GME estimator will be a uniform distribution, but this equation "pushes" the solution to be consistent with these observed moments. 2

Spatial Disaggregation Based on GME
We apply a modification of the general GME estimator above as an alternative to the methods presented in Elbers et al. (2003) or Tarozzi and Deaton (2009) Golan (2018) shows that under some mild assumptions the GME solution can be seen as a generalization of the classical Maximum Likelihood estimator for a logit model, yielding the same solution when U = 0.
indicators for small areas. The GME technique proposed follows the idea of combining household surveys with population census but exploits the information in a different way. One advantage of the method proposed here is that it combines the detailed geographical information present in population census but making it consistent with some aggregates -moments-observable in the household survey. For the sake of clarity, let us explain the procedure by assuming that our research interest is in estimating a categorical indicator in a set of small areas d = 1, … D . The proportion of individuals in d that fall into category j can be expressed as ∑ n d i=1 y d ij ∕n d where n d represents the number of individuals in this area d . Our problem is that the indicators y ij are not directly observable in a population census. They are observable in the household survey, but we cannot identify the small area d in which the individual lives.
Following Eq.
(2), our estimates of the indicator of interest ŷ d ij are defined as where p d ij and û d ij are the solutions produced by the GME estimator and û d ij = ∑ L l=1ŵijl v l . These estimates will be obtained by solving the previously depicted GME program but modifying the constraints on Eq. (5) as follows: Or, in matrix notation: In Eqs. (7) and (8) the subscripts s and c refer to moments observed in the household survey and the population census respectively. Matrix X s -together with Y s -is collected from the information in the sample, and it has dimension n × K , while matrix X c is observable at the population census level and has dimension N × K , where N is the population size. Note that the left-hand side on these equations refers to the cross-moments between Y and the covariates in X , which are observable for the sample of households surveyed. The solutions P and Ŵ on the right-hand side must produce, at the census level, cross-moments identical to those observed in the household sample. Since these estimates are obtained with the sufficient geographical detail, it is possible to predict ŷ d ij =p d ij +û d ij and make these estimates consistent with the aggregates in 1 n X T s Y s . This consistency can be viewed as one of the main comparative advantages of this GME technique: while other small-area estimation just make use of the available data for the small areas of interest, the GME estimator also considers the exogenous information that could come in the form of aggregates. If these aggregates contain information about the small areas, this is explicitly exploited and incorporated into the estimation problem, which can contribute to improve the accuracy of the estimates. Additionally, discrepancies between the small area estimates and the regional or national aggregates are prevented by construction. 3 1 3

Data
This section illustrates how the GME estimator works with a real-world example. It will be applied to predict at-risk of poverty and exclusion (AROPE) rates for small areas (municipalities) in a Spanish region. Following the definition given by Eurostat, for an individual to be classified as AROPE, it is necessary to belong to a household for which one or more of the following three conditions holds: i. Having a disposable income below 60% of the national median, ii. Suffering severe material deprivation, or, iii. Living in a household with a low work intensity.
This empirical example will be based on the data collected for 2011 in the Social Inclusion and Living Conditions (EU-SILC) survey. This survey provides information for the EU member states on issues related to income distribution, poverty or, more in general, living conditions and it is the most commonly used database for distributional issues. The EU-SILC survey is rich on information about several characteristics of individuals and households. 4 However, due to confidentiality issues, the EU-SILC does not provide detailed information on the geographical location of the individuals surveyed: depending on the specific EU country, information on location is only released at the level of NUTS1 or NUTS 2 regions. In our case, we will focus our analysis on the Spanish region of Andalusia. This NUTS 2 region is interesting for a number of reasons, the most important being that it presented the highest poverty rate in Spain on the year studied here, and one of the highest in the EU. Moreover, the region experienced large unemployment rates and low educational levels (measured as the percentage of population with a third-level degree) compared with other Spanish regions.
Additionally, this region contains large metropolitan areas -the urban agglomeration of Seville is larger than 1.5 million-with a relatively high concentration of manufacturing and service activities, together with many depopulated rural areas mainly specialized in agricultural and other low-value added activities. Consequently, one would expect to find a sizable internal heterogeneity in the poverty indicators across the municipalities of the region. However, this is not possible to study by using only the information contained in the EU-SILC.
In order to overcome the problems generated by the lack of data availability, we will make use of the GME estimator proposed, and the information contained in the sample of households in the EU-SILC will be combined with the sample of households surveyed in Andalusia in the Spanish Population Census conducted on 2011. While the EU-SILC surveyed slightly more than 3,500 adult individuals in the region, the sample in the Population Census contained information on more than 50,000, but not on the variables related to income or deprivation that would allow an individual to be classified as AROPE or not. The Census survey also provides information on the geographical location at the scale of municipality, although only those municipalities larger than 20,000 inhabitants can be properly identified, and the rest are aggregated because of privacy issues. Nevertheless, this spatial scale allows us to identify larger urban settlements and differentiate them from rural areas.
Our estimation objective will be to recover AROPE rates at municipal level by applying the GME program maximizing Eq. (4) subject to the set of normalizing constraints on (6) and the moment constraints on Eq. (7). In our problem, matrix Y s contains the binary variable of interest in the EUSILC sample where y i = 1 if an individual can be classified as AROPE and zero otherwise; while X s contains a set of covariates observable in the same EUSILC sample that are also observable at census level, i.e., it is possible to observe the equivalent matrix X c . For the sake of simplicity in this empirical example, we consider as covariates only one continuous variable ("Age") and its square, plus the binary variables "National", "Rural", "College", "Unemployed" and "Worker". Table 1 presents a summary of these variables in both databases: Additionally, the sample correlations found between Y s and X s are reported in the first column on Table 2: In the sample surveyed in the EUSILC, the indicator of risk of poverty and exclusion is negatively correlated with the age of the individuals. As expected, situations of AROPE seems to be negatively correlated with the fact of having a job, holding a college degree and having Spanish nationality, while it is positively correlated with the fact of living in a rural environment and being unemployed. Given the observed correlation between the covariates chosen and the indicator of interest, our GME estimator is designed to "project" onto the set of individuals surveyed in the Population Census estimates that are consistent with the cross moments contained in the matrix 1 n X T s Y s . These cross-moments are also reported on Table 4 (second column). 5

Results: The Spatial Distribution of Rates of AROPE Population in Andalusia
From these data, we have applied a moment-constrained GME program including as constraints the sample information on X T s Y s . In this example, we followed the same procedure for the reparameterization of the error term as applied in the numerical experiments and we set wide error supports with L = 3 points as v l = (−1, 0, 1) . The solution of this program predicts the probability of being classified as AROPE with the individuals sampled in the Population Census calculated as ŷ d i =p d i +û d i , where the superscript d refers to the small area -a specific municipality or a group of municipalities-where the individual lives. Having recovered the ŷ d i , local estimates of AROPE rates can be calculated as where N d represents the population size on this area d. 6 These local estimates allows for studying the spatial heterogeneity of the incidence of AROPE across the municipalities of the region. In total, we managed to recover estimates for 115 small areas and Fig. 1 visually represents these estimates: The estimates produced show sizable heterogeneity across the Andalusian municipalities. Approximately 10% of the small areas identified presented an AROPE rate smaller than 30%, being close to the national AROPE for Spain in that year (26%), while the estimate for the top 10% was higher than 50%. The figures represented in Fig. 1 also allow some geographical patterns to be identified. First, the map shows that the areas located in the South-west of the region generally presented higher values. Additionally, and as the correlation calculated in the EUSILC survey suggested, the greater AROPE rates are mainly concentrated in rural areas and are comparatively smaller in the urban agglomerations. Figure 1 displays the names of the municipalities that concentrate the five largest metropolitan areas of Andalusia, which accounted for approximately 4 million inhabitants and almost 50% of the total population of the region. In general, the estimates of poverty rates around these areas are comparatively smaller than in other zones of the region. These results are consistent with previous research that found higher persistence of poverty on rural areas for the US (Partridge and Rickman 2008)

Concluding Remarks
This paper develops and applies an estimation procedure based on the info-metrics approach to estimate social indicators for small areas that are consistent with observable aggregate information. In this regard, the technique proposed here can be seen as a spatial disaggregation technique. More specifically, we propose a modified version of the crossmoment constrained GME estimator for categorical variables developed in Golan (2018), which we adapt to be applied in the field of small area estimation. Although this is not the first attempt to use this type of methodology for small area estimation problems (see Bernadini-Papalia and Fernández-Vázquez 2018), the novelty of our proposal is that we do not use only aggregates of the variable of interest as constraints, but also the information contained in its cross-moments with potential covariates can be exploited. The formulation presented here is also inspired in the idea of the proposals made originally by Elbers et al. (2003) and the subsequent modification by Tarozzi and Deaton (2009) to map poverty for small areas. Our estimator is evaluated through numerical experiments, revealing a comparatively better performance than the previous proposals which attempt to make a similar analysis. We also illustrate, by means of an empirical illustration, how the proposed procedure can be implanted to solve a real-world problem and we have applied it to derive At Risk of Poverty and Exclusion (AROPE) rates for small geographical areas. More specifically, we estimate these rates for the municipalities of the region of Andalusia (Spain) for 2011, combining microdata from the EUSILC database and the Spanish Population Census produced that year.
The numerical simulations together with the empirical example illustrate how this methodology can be helpful for studying the spatial heterogeneity of social indicators across small areas. However, further work should be done to explore its performance when the variable of interest is not categorical but continuous, or it has the form of count data. Future research should also explicitly address potential processes of spatial dependence between the geographical locations of interest.
One innovation of the technique presented here it is that it does not require strong distributional assumptions and that it combines information from two different sources of information, while maintaining the consistency of the estimates produced with national and regional aggregates. One potential limitation associated to this characteristic is that the proposed GME technique has comparatively larger requirements of information if compared to other methodologies, since it bases on combining two different sources of data. Alternatives based on just exploiting observable information -as, for example, the Principal Components approach applied in (Sánchez-Cantalejo et al. 2008)-are less demanding on information, but they do not necesarely produce small area estimates consistent with the national or regional figures that are usually estimated by offical statistical agencies.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

Appendix: Numerical Experiments
This appendix conducts a Monte Carlo experiment that evaluates the GME procedure proposed. For the sake of simplicity, and also in order to facilitate comparison with the original formulation in Tarozzi and Deaton (2009), we replicate the general characteristics of the Monte Carlo simulations conducted in that paper (see Tarozzi and Deaton 2009, pages 781-784).
The idea of this simulation is to generate "true" but not completely observable data for a set of households and small spatial areas. The population simulated consists of N = 15, 000 households distributed uniformly across 150 small (local) areas. The values of a continuous variable (z) are generated as: For simplicity, we assume that our categorical indicator of interest y ij is just a binary variable defined as: 7 where z * is a given threshold and zero otherwise. In their paper, Tarozzi and Deaton (2009) use this data generation process to simulate poverty rates for the local areas as the simple head-count indicators where a household is classified as "poor" if the household income is below a given poverty line z * .
In Eq. (9), subscript d stands for the local areas and i for the households, e d is an areaspecific shifter that distributes as e d ∼ N(0, 0.01) and d i is a disturbance that distributes normally with mean zero and variance 2 (x) , where: The predictor x d i is generated as ∼ N(0, 1). We assume that we sample n d = 10 household on every local area, which makes a total sample size of n = 1, 500 . For this sample, we replicate Tarozzi and Deaton (2009) (TD) by estimating probit models where the dependent variable is y i and taking as matrix of covariates X the values of variable x , its square, sin (x) , cos (x) , sin (2x) and cos (2x). 8 Once these equations are estimated, the parameter estimates are projected onto the whole population of N = 15, 000 households to predict the values of z and y . Note that this population plays the role of the census, where the geographical location of the household (each local area d ) is observable. In the numerical experiment conducted they use three different poverty lines by setting z * = 24, 25, 26 . Once the predictions of ŷ d i for the whole population are calculated, this allows us to produce estimates of some average for the indicator on each d.
In order to evaluate their method, Tarozzi and Deaton (2009) compare the true values of y d i and the predictions obtained by their TD estimator, finding that it produced smaller bias and Square Mean Errors (SME) in comparative terms to Elbers et al. (2003). Following this idea, we have extended this numerical experiment by including the proposed GME procedure in the comparison. In the Monte Carlo experiment conducted, we apply the GME moment-constrained estimator including equations such as (8) as constraints on each simulation drawn. Note that we take the same n × 7 matrix X s as the TD estimator to calculate the cross-moments 1 n X T s Y s in the household survey that act as restrictions. Our GME estimates at the census level are 1 N X T c P +Û , where matrix X C has dimension N × 7 .
Regarding the specification of the support vectors for the error term contained in matrix V , we opted for wide error supports centered on zero with L = 3 points as v l = (−1, 0, 1). As in Tarozzi and Deaton (2009), we take 250 replications in the experiment, and a summary of the results that report the bias and the root of the MSE is presented in Table 3: Since the indicator of interest is binary, we simplify the notation and we do not make use of the sub index j in this and the following section. 8 Note that this implies the estimation of seven parameters because an intercept is considered as well.
The results of the numerical simulation suggest that the GME strategy produces equivalent results to those obtained by the TD estimator, both in terms of bias and RMSE. However, the results of the experiment are to some extent conditioned by the idea of normality present on the probit estimation, which is on line with the conditions of the data generation process since both e d and d i (and x d i as well, but only partially) are simulated to distribute normally. We have modified the conditions of the data generating process by generating ln z d i ∼ N z , z where z and z are respectively the empirical mean and standard deviation along the 250 simulation draws of the variable z generated before. With this modification we aim at generating values of the household income following a log-normal distribution, as empirical datasets usually suggest.
Once this modification has been introduced in the data generating process, we repeated the exercise and compare the performance of both TD and GME estimators. Results are reported in Table 4.
We can see how the deviations both in terms of bias and RMSE are now considerably smaller in the case of the GME estimator, irrespective of the specification of the poverty line. This is a consequence of the greater flexibility of the GME procedure, which only requires minimal assumptions on the distribution of the error terms u i . In the specific case of our experiments, we only impose a symmetric distribution around zero with bounds at − 1 and 1. Additionally, the GME estimator exploits the information contained in the household surveys and "pushes" the estimates at the census level to be consistent with the aggregates (i.e., the cross moments 1 n X T s Y s ) at the sample level. If the information contained on these aggregates in the sample are reliable estimates of the equivalent aggregates in the population census (i.e., the unobservable 1 N X T c Y c ), this information helps the estimator to produce more accurate estimates.