A hybrid approach for the spatial disaggregation of socio-economic indicators

Regular Paper


While statistical information on socio-economic activities is widely available, the data are often collected or released only at a relatively aggregated level. In these aggregated forms, the data are useful for broad-scale assessments, although we often need to disaggregate the source data in order to provide more localized estimates, and in order to analyze correlations against geophysical variables. Spatial disaggregation techniques can be used in this context, to transform data from a set of source zones into a set of target zones, with different geometry and with a higher general level of spatial resolution. Still, few previous studies in the area have attempted to leverage state-of-the-art spatial disaggregation procedures in the context of socio-economic variables, instead focusing on applications related to population modeling. In this article, we report on experiments with a hybrid spatial disaggregation technique that combines state-of-the-art regression analysis procedures with the classic methods of dasymetric mapping and pycnophylactic interpolation. The hybrid procedure was used together with population density, land coverage, nighttime satellite imagery, and OpenStreetMap road density, as ancillary data to disaggregate different types of socio-economic indicators to a high-resolution grid. Our test specifically leveraged data relative to the Portuguese territory, resulting in the production of raster datasets with a resolution of 30 arc-seconds per cell. The article discusses the spatial disaggregation methodology and the quality of the obtained results under different experimental conditions.


Spatial analysis Downscaling Geographic information systems Regression-based spatial disaggregation Socio-economic indicators 

1 Introduction

Statistical information on socio-economic activities is widely available, although the data are often collected or released only at a relatively aggregated level. Census data, for example, are often aggregated to census tracts, in part because of concerns about confidentiality. Depending on their nature, data on social indicators or on economic activities may likewise be aggregated to country or regional administrative units.

In these aggregated forms, the data are useful for broad-scale assessments, but using aggregated data has the danger of masking important local hotspots, and overall tends to smooth out spatial variations in impact. For this reason, researchers often need to disaggregate source data, in order to provide more localized estimates. In the context of spatial analysis, spatial disaggregation or spatial downscaling are processes by which information at a coarse spatial scale is translated to finer scales, while maintaining consistency with the original dataset. These techniques are used to convert data originally available for a set of source zones into a set of target zones that have a different geometry and a higher level of spatial resolution. Ranging in complexity from simple areal weighting to intelligent dasymetric disaggregation [33], most approaches have been applied to population data, and they have in common what Tobler [59] termed the pycnophylactic, or mass-preserving, property, in that the estimates are conditioned to sum to the original quantities in the source zones. The term spatial disaggregation is in fact usually employed in the context of additive variables (i.e., population counts and other datasets of aggregated counts over which the pycnophylactic property should be enforced), whereas spatial downscaling is more general, being frequently used with non-additive variables (e.g., environmental or geophysical properties, such as temperature, precipitation, soil moisture, agricultural land usage, air quality, etc.).

In this article, we report on experiments with a hybrid spatial disaggregation technique that combines the ideas of dasymetric mapping and pycnophylactic interpolation, using population density, nighttime satellite imagery, land coverage, and OpenStreetMap1 road density information, as ancillary data to disaggregate different types of socio-economic indicators. Apart from few exceptions (e.g., seminal work on the area by Goodchild et al. [25] that considered variables such as employment and income, or more recent work within the G-Econ research project of the University of Yale, which aimed to develop datasets on economic activity through spatial rescaling based on proportional allocation [47, 48]), most previous studies concerning with spatial disaggregation/downscaling have focused either on population density or on geophysical/environmental variables. We nonetheless believe that previously developed procedures, which had these traditional applications in mind, can also be equally for socio-economic indicators, facilitating the development of studies that link socio-economic information with different types of geophysical factors.

The spatial disaggregation technique discussed in this article has specifically been applied in a case study relative to the Portuguese territory, resulting in the production of raster datasets with different types of socio-economic indicators. Also referred to as a grid, a raster dataset is a type of tessellation (i.e., a mosaic) that divides a surface into uniform cells (i.e., pixels), being commonly used for representing phenomena that vary continuously over a geographic space. Examining socio-economic patterns on a gridded format has many advantages, one of them being the easiness in linking the socio-economic data to readily available geophysical data on themes such as climate, ecology, and the like. Even though socio-economic data are typically only collected at the level of regional administrative units, we may be interested in analyzing the data through different partitions of space (e.g., at the level of thin-grained administrative subunits, or with basis on regular tessellations of the geographic space, in order to better analyze local hotspots), or in terms of their relation to particular geophysical characteristics (e.g., proximity to regions with specific land coverage types, or relationship toward terrain elevation). Disaggregated socio-economic data can also have important applications in the generation of located synthetic population datasets [2, 3], for instance latter to be used in the context of spatial simulations.

In brief, the main research questions and contributions of this article can be summarized as follows:
  • We compared different methods in the disaggregation of socio-economic indicators, including simple baseline approaches (e.g., standard mass-preserving areal weighting or pycnophylactic interpolation) and dasymetric mapping methods that leverage ancillary sources of information. Most previous work in the area has considered applications such as population mapping, and we were interested to see (i) if similar approaches could also be used in other types of variables, and (ii) the degree to which ancillary data could be used to improve the results, depending on the type of variable that is being analyzed;

  • We proposed and evaluated a novel intelligent disaggregation method, based on a downscaling procedure originally proposed by Malone et al. [45] that uses regression analysis to combine different ancillary variables. We adapted/extended the original procedure, which deals with the spatial downscaling of non-additive variables, in several directions. These include (i) combining it with the use of pycnophylactic interpolation, thus resulting in a hybrid spatial disaggregation approach, (ii) experimenting with different types of regression models, or (iii) carefully sampling data points prior to the training of regression models. We comparatively evaluated the application of the proposed method on the disaggregation of socio-economic indicators relative to the Portuguese territory, when using different types of regression algorithms.

The rest of this article is organized as follows: Sect. 2 presents fundamental concepts and important related work. Section 3 describes the considered hybrid spatial disaggregation approach. Section 4 details the case study concerning with socio-economic indicators relative to the Portuguese territory. Finally, Sect. 5 presents our main conclusions and highlights possible directions for future work.

2 Fundamental concepts and related work

This section starts by describing classical approaches for spatial disaggregation, afterward describing more recent developments and practical applications.

2.1 Spatial disaggregation methods

Spatial disaggregation, as a procedure, is applied to data sets for which the underlying spatial distribution is unknown, but for which aggregated data, on the basis of spatial zones that resulted from some convenience of enumeration, already exist. A process of spatial disaggregation, or spatial downscaling to be more general, thus refers to the transformation of data from the arbitrary zones of data aggregation, to a set of target zones with different geometry and a higher general level of spatial resolution, in order to recover and better depict the underlying spatial distribution of the data [33].

The most basic method for spatial disaggregation is mass-preserving areal weighting, in which a homogeneous distribution of the data throughout each source zone is assumed [26]. Mass-preserving areal interpolation redistributes the aggregated data with basis on the proportion of each source zone that overlaps with the target zone according to the following equation:
$$\begin{aligned} P_t = \sum _{\{s : s \cap t \ne \emptyset \}} \left( P_s \times \frac{A_{s \cap t}}{A_s} \right) \end{aligned}$$
In the formula, the parameter \(P_t\) is the estimated count in a target zone t, while \(P_s\) is the count in a source zone s that is to be disaggregated. The parameter \(A_s\) corresponds to the area of source zone s, and \(A_{s \cap t}\) corresponds to the area of target zone t overlapping with the source zone s.

While mass-preserving areal weighting disaggregation ensures that the total count from the source data remains unchanged, it is based on the often incorrect assumption that the phenomena of interest are evenly distributed across the source zones. Population is one example where the assumption behind mass-preserving areal interpolation clearly does not hold, since most populations are rarely uniform across census tracts, and instead tend to be highly clustered in urban centers, surrounded by areas of dispersed rural homesteads.

Mask areal weighting, also referred to as binary dasymetric mapping, is an improvement on simple areal weighting in that it uses a mask to define where, within the target zone, the source data should be allocated [15]. Each source unit is divided into two sub-regions (i.e., populated and unpopulated) and the source information is then allocated only to the populated areas. Land coverage data, for instance derived from satellite imagery [22, 34], can be used to identify populated areas and create the mask. The general equation is as follows:
$$\begin{aligned} P_t = \sum _{\{s : s \cap t \ne \emptyset \}} \left( P_s \times \frac{A_{sp \cap t}}{A_{sp}} \right) \end{aligned}$$
In the previous formula, \(A_{sp \cap t}\) is the area of populated land that overlaps between the target map unit t and the source map unit s, while \(A_{sp}\) is the area of the target map unit s that corresponds to populated land. Mask areal weighting, and dasymetric disaggregation in general, has gained popularity with the increasing availability of satellite imagery, and with improved methods for using remotely-sensed Earth observation data within geographic information systems.

Although the results of binary mask areal weighting are generally an improvement over those of simple areal weighting, there are still considerable deficiencies in this method. For instance, all populated areas do not have the same density, but binary mask areal weighting assumes that all the populated areas are homogeneous with respect to density. Additionally, non-populated areas in the mask often have some population too, which is totally eliminated in the purely binary approach. Several authors have proposed refinements to the dasymetric approach introduced above, taking it from a binary model to more nuanced approaches, which result in a more realistic depiction of the densities that are typically encountered in real-world variables.

In the general case, dasymetric disaggregation is any type of areal interpolation method for disaggregating spatial data that leverages ancillary information. As mentioned in relation to mask areal weighting, land cover data, in particular, offers a means by which residential areas can be distinguished from nonresidential areas, and thus land cover has been extensively used in the context of population modeling. General dasymetric disaggregation is thus an improvement over mask areal weighting in that two or more categories can be assigned weights (e.g., specific disaggregation weights can be derived for individual land cover types to reflect population density). This is often also referred to as poly-categorical dasymetric disaggregation, or as the class-percent method. In the poly-categorical dasymetric disaggregation method, percentages are applied to each of the categories for the source area, representing the percentage of population (or another variable) that is likely to be contained within that category, per source area. These percentage numbers will vary depending on the location of the area of interest, and are subject to perceived local conditions and arbitrariness of the analyst. The main challenge in dasymetric disaggregation thus involves devising an appropriate set of weights that can be applied to the classes in the ancillary data, for instance to reflect population density. Weights may, for instance, be defined using selective sampling, or by some form of regression analysis.

The general dasymetric disaggregation method that corresponds to an extreme case where there is not a predefined number of classes for poly-categorical disaggregation can be seen as corresponding to a proportional and weighted areal interpolation method. Similarly to the case of mass-preserving areal weighting, each target zone also takes a proportionally calculated value, but now this value is also weighted according to some external variable (e.g., nighttime light intensity) or combination of external variables (e.g., obtained through regression analysis). Proportional and weighted areal interpolation corresponds to the formula shown below, where \(W_{s \cap t}\) is the weight assigned to the part of target zone \(P_t\) that overlaps with source zone s, and where each \(W_{s \cap t}\) is chosen with basis on the external variable(s), ensuring that \(\sum _{\{t' : t' \cap s \ne \emptyset \}} W_{s \cap t'} = 1\) if the estimates are required to sum to the same values of the source zones (i.e., in the case of disaggregation).
$$\begin{aligned} P_t = \sum _{\{s : s \cap t \ne \emptyset \}} \left( P_s \times \frac{W_{s \cap t} \times A_{s \cap t}}{\sum _{\{t' : t' \cap s \ne \emptyset \}} ( W_{s \cap t'} \times A_{s \cap t'} )} \right) \end{aligned}$$
The above method would disaggregate the source data under the assumption that target regions containing a higher value for the external variable will also correspond to regions having higher counts in the source data. Given that socio-economic indicators often correlate with population density, leveraging this method together with population density as an auxiliary variable can be a simple and natural approach for the spatial disaggregation of socio-economic data. We explored this idea on the present article.
Although dasymetric mapping methods are preferable to other methods that do not use ancillary data, having shown to achieve a superior performance [65], they also have several methodological and cartographic shortcomings, one of them being that the estimated density values often change abruptly, crossing source zone boundaries that may even have the same land-use class. Rather than performing the disaggregation into zones, several methods instead entail at creating continuous surfaces depicting the disaggregation of the target data. For instance, Tobler [60] proposed one such pycnophylactic spatial disaggregation method, which is an extension of simple areal weighting that assumes a degree of spatial autocorrelation of the variable being interpolated. This method calculates target region values based on the values and weighted distance from the center of neighboring source regions, keeping the mass consistent within source regions. Tobler’s method starts by applying the mass-preserving areal weighting procedure described previously, using a grid to define the target zones. Then, the values for the grid cells \(P_t\) are smoothed, by replacing them with the average of their neighbors. The predicted values in each source zone are then compared with the actual values, and adjusted to meet the pycnophylactic condition of mass-preservation, continuing until there is either no significant difference between predicted values and actual values within the source zones, or until there have been no significant changes from the previous iteration. The interpolated surface produced by the pycnophylactic method from Tobler [60] is smooth, with relatively small changes in attribute values at target region boundaries. The sum or mass of combined target attribute values, within each source region, is also kept consistent.
Fig. 1

Flowchart of the hybrid disaggregation method

Recent research on spatial disaggregation has advanced extensions and/or combinations of the classical methods surveyed in this section. For instance, authors like Kim and Yao [38] have developed hybrid approaches for the spatial disaggregation of population data that combine dasymetric mapping and pycnophylactic interpolation, making use of ancillary information that sheds light on the spatial structure of population distribution, while at the same time also adopting the conceptual assumption that population density varies smoothly, instead of uniformly, in space. Noting that the advantages and shortcomings of the two methods are complementary, Kim and Yao proposed an approach that consists in two consecutive logical steps: dasymetric mapping for a preliminary population redistribution, followed by an iterative pycnophylactic interpolation for obtaining a mass-preserving smoothed surface. Binary dasymetric mapping is used in the first step, resulting in a rough estimate of population density over the residential pixels, arguing that prior studies [18, 40] found no evidence to support any extra benefits of using general dasymetric mapping. In the second step, a smooth surface is produced with basis on the neighborhood of each residential pixel, using a floating window to calculate the average density value. In this context, the search distance of the floating window is computed through an iterative process of fine-tuning, until the integral of smoothed density values in the source zone matches the original value of source zone. In the beginning of Sect. 3, we present a diagram that illustrates our hybrid method, and point out its main differences against the approach from Kim and Yao.

The method advanced in this paper also takes its inspiration on the spatial downscaling algorithm proposed by Malone et al. [45]. This method is based on a disseveration procedure similar to dasymetric mapping, for which there is an open-source R implementation that was extended for the experiments reported on this article (i.e., although the original method was proposed for spatial downscaling of non-additive variables instead of disaggregation, in our work we extended the pre-existing R implementation in order to develop a novel spatial disaggregation method, combining the disseveration algorithm with the idea of pycnophylactic interpolation—see Fig. 1). The algorithm described by Malone has two phases, and it originally used generalized additive modeling to fit a nonlinear relationship between a target variable (i.e., the indicator that we which to model at a fine resolution, and for which we have data at a coarse resolution) and predictive covariates (i.e., data for other variables, available at a fine resolution, that can inform the downscaling). In an initialization phase, the authors perform a coarse grid to fine grid resample (i.e., through a nearest neighbor re-sampling approach for data downscaling, in which the cells at the fine grid take the value from the closest coarse grid cells), followed by random sampling of data points and initial model fit. The model assumes that the value at each target region corresponds to an additive combination of nonlinear functions (i.e., cubic splines with knots at each of the target regions) of the covariates—more details about additive models are available in the book by Hastie and Tibshirani [31]. In an iteration phase, adjustments are made to the predictions iteratively, trying to ensure that the coarse grid is linearly related to the fine grid predictions (i.e., there is a mass balance property to be attained). Iterations proceed until a stopping criterion is met, based on a maximum number of iterations, or alternatively using a threshold over the change in the estimated error rate over three consecutive iterations.

2.2 Applications and case studies

In a study of areal interpolation for socio-economic data, Goodchild et al. [25] looked at a typical problem of spatial analysis using non-coincident areal units, namely the 58 counties of California (the source zones) and the states 12 major hydrological basins (the target zones). The boundaries of the two sets of spatial units were, for the most part, incompatible. Socio-economic data were available on the county level, but data connected with water issues were collected based on the hydrological basin units that correspond to major watershed boundaries. In order to conduct a major economic impact study of water usage and policy, variables such as employment, income, and population had to be transferred from the county spatial units to the hydrological regions. Goodchild et al. [25] used direct areal weighting to accomplish this, assuming that densities in the source zones (the counties) were uniform. When later comparing the results of the areal weighting method with other methods using statistical approaches, they found that areal weighting had a much higher mean percentage error than did the other methods.

Gallego [20] described the production of an high-resolution European population map through a stochastic allocation process by which weights are devised for disaggregating population totals from larger administrative units (i.e., NUTS 2 regions in Europe) to smaller ones (i.e., communes) on the basis of the land cover information. Communes were first stratified, by comparing the commune population density to the average density of the surrounding NUTS 2 region, into one of three levels reflecting population density (i.e., dense, less dense, and not urban). The method then involved disaggregating the NUTS 2 totals using an initial set of weights, re-aggregating the population to the commune level, and comparing it to the known total, this way computing a disagreement indicator, and adjusting the weights to reduce the disagreement. Several subsequent studies have described refined versions of the methodology originally put forward by Gallego [20], which involved combining land use and/or land coverage information present in various high-resolution data sources [24, 57].

Besides land coverage and related types of information, which is often extracted from multi-spectral satellite imagery through data classification procedures, other dasymetric disaggregation approaches have been reported to use different kinds of ancillary data. For instance authors like Elvidge et al. [17] proposed to use satellite observed visible to near-infrared emissions, while Doll et al. [13] proposed to use nighttime light emissions, and Langford [40] proposed the usage of rasterized topographic maps. Depending on the variable that is to be disaggregated, different types of ancillary variables can indeed be of use.

Reibel and Bufalino [52] used street network data (i.e., U.S. Census TIGER files) to derive weights for the interpolation of population and housing unit counts, for incompatible zone systems in Los Angeles County, California. The authors used the street and road grid as a proxy for approximate population and housing unit density surfaces, for census tracts in the county. They then conducted an error analysis, comparing the results of the street-weighting method with traditional areal weighting, finding that the street-weighting method offers some benefits. Despite the interesting results, the authors noted that the street-weighting method appears to reduce errors most in those areas where the lack of population is reflected in the lack of roads, and least in those areas with a more developed but nonresidential transportation infrastructure (e.g., industrial areas).

In the context of modeling European population, Brigs et al. [7] developed a model that incorporates ancillary data, specifically Earth Observation (EO) products corresponding to nighttime light intensity data and CORINE land cover [34], in a GIS-based regression approach to disaggregate NUTS 5 census totals to a resolution of 1 km\(^2\). European light emission data from the DMSP satellites were re-sampled and modeled using kriging and inverse distance weighting, to provide a 200 m resolution light emissions map. This was matched to CORINE land cover classes, and linear regression analysis was used to derive models of relationships between census population counts, land cover area, and light emissions. The regression weights were then used in the dasymetric disaggregation procedure.

In a more recent study, Stevens et al. [58] also combined different types of remotely-sensed data, such as nighttime lights and land cover information, to derive weights for the disaggregation of population counts, originally at a country level. As their weighting scheme, the authors used a flexible and nonparametric predictive model based on ensembles of decision trees (i.e., the random forest regression approach [6]), in order to leverage the available ancillary data to generate a gridded prediction at an approximate spatial resolution of 100 \(\times \) 100 m. Stevens et al. [58] concluded that, at country-level scales, the ensemble of decision trees performed substantially better than other methods. The authors argued that decision trees are indeed quite flexible, being able to handle multiple covariates (i.e., the auxiliary variables that can inform the spatial disaggregation) of both discrete and continuous natures, with a minimum amount of tuning and supervision.

In sum, we have that some previous studies have indeed compared a variety of spatial disaggregation methods—see Wu et al. [65] for a recent review. Previous research suggests that dasymetric and intelligent (e.g., regression-based) areal interpolation techniques can outperform areal weighting and other areal interpolation approaches that do not incorporate ancillary data, although interpolation accuracy is dependent on the strength of the relationship between the source and ancillary data. Fisher and Langford [18] found that the traditional binary dasymetric method could also be more accurate than both areal weighting and regression-based intelligent areal interpolation techniques. Similar results were reported by Langford [40], who found no evidence to support any extra benefits of using multi-class dasymetric mapping, or by Gregory [27], who also reported good results for areal interpolation with a probabilistic approach that combines the binary method with the Expectation-Maximization (EM) algorithm [28]. Most previous studies in the area have also been limited to applications related to population modeling, although there are many other potentially interesting applications.

3 An hybrid disaggregation method

Both the dasymetric mapping and the pycnophylactic interpolation methods have solid theoretical foundations, as well as strong empirical supports in population-estimation research. Each of these methods has its own strengths, but also suffers obvious shortcomings. For instance, pycnophylactic interpolation warrants a smooth surface in the study area, without any presumption of uniform distribution (i.e., it iteratively smooths the estimates by taking the average of neighboring cells, instead of just dividing the total mass by the number of cells, as in the case of mass-preserving areal weighting). However, the method does not draw on any ancillary information about the real spatial distribution, so that its estimation accuracy cannot benefit from useful information that is frequently available.

We present a hybrid approach that takes advantage of the strengths and that remedies the flaws of both methods, following the general ideas from Kim and Yao [38] and Malone et al. [45]. Kim and Yao [38] proposed a method that starts with binary dasymetric disaggregation leveraging land coverage data, for producing initial estimates that are latter refined through pycnophylactic interpolation. In this article, we propose to leverage pycnophylactic interpolation for producing the initial estimates that are then adjusted through a disseveration method [45], as shown in Fig. 1. The general procedure is detailed next, through an enumeration of all the individual steps that are involved:
  1. 1.

    Produce a vector polygon layer for the variable to be disaggregated by associating the quantities, linked to the source regions, to geometric polygons representing the corresponding regions;

  2. 2.

    Create a raster representation for the study region, with basis on the vector polygon layer from the previous step and considering a resolution of 30 arc-seconds per cell. This raster, referred to as \(T^p\), will contain smooth values resulting from a pycnophylactic interpolation procedure [60]. The algorithm starts by assigning cells to the corresponding values in the original vector polygon layer, using a simple mass-preserving areal weighting procedure (i.e., we redistribute the aggregated data with basis on the proportion of each source zone that overlaps with the target zone). Interactively, each cell’s value is replaced with the average of its eight neighbors in the target raster. We finally adjust the values of all cells within each zone proportionally, so that each zone’s total in the target raster is the same as the original total (e.g., if the total is 10% lower than the original value, we increase the value of each cell by a factor of 10%). The procedure is repeated, until no more significant changes occur. The resulting raster is a smooth surface corresponding to an initial estimate for the disaggregated values;

  3. 3.

    Overlay four rasters \(P^1\), \(P^2\), \(P^3\) and \(P^4\), also using a resolution of 30 arc-seconds per cell, on the study region from the original vector layer and from the raster produced in the previous step, respectively, with information regarding (i) population counts, (ii) nighttime light emissions, (iii) land coverage classification, and (iv) OpenStreetMap road network density. These rasters will be used as ancillary information for the spatial disaggregation procedure. Prior to overlaying the data, the four different raster data sources are normalized to the resolution of 30 arc-seconds per cell, through a simple interpolation procedure based on taking the mean of the different values per cell (i.e., in the cases where the original raster had a higher resolution), or the value from the nearest/encompassing cell (i.e., in the cases where the original raster had a lower resolution);

  4. 4.

    Overlay two other rasters \(P^5\) and \(P^6\) over the study region, again using the same resolution of 30 arc-seconds per cell and with ancillary information derived from the rasters in the previous step. Specifically, these two rasters encode (i) the distance from a given cell to the nearest cell with a land coverage type equal to water, and (ii) the distance from a given cell to the nearest cell containing a road or a street segment. Raster \(P^5\) is thus derived from raster \(P^3\) with land coverage information, whereas \(P^6\) is derived from raster \(P^4\) with OpenStreetMap road network density. These two rasters will also be used as ancillary information for spatial disaggregation;

  5. 5.
    Overlay another raster \(T^d\) on study region, with the same resolution used in the rasters from the previous steps. This raster will be used to store the estimates produced by a simple spatial disaggregation procedure based on dasymetric mapping (i.e., a method based on proportional and weighted areal interpolation). For producing these estimates, we weight the total value, for each source zone in the original vector polygon layer, according to the proportion between the population values available for the corresponding cell in raster \(P^1\), and the sum of all the values for the given source zone in the same raster. This is essentially a proportional and weighted areal interpolation method, corresponding to the following equation where \(T^d_t\) is the estimated count in target zone t, where \(S_s\) is the count in source zone s, \(P_t\) is the population count in target zone t, and \(P_s\) is the population count in source zone s;
    $$\begin{aligned} T^d_t = \sum _{\{s : s \cap t \ne \emptyset \}} \left( \frac{P_t}{P_s} \times S_s \right) \end{aligned}$$
  6. 6.

    Collect a sample of cells in the fine-resolution grid, in order to latter fit regression models. The experiments with the original disseveration procedure that were described by Malone et al. [45] considered a random sampling strategy, although better alternatives can also be used. For study regions of a moderate size, all data instances can be considered. Alternatively, sampling can be performed using the R function spsample2 that supports regular (i.e., systematically aligned) sampling, which can evenly represent the entire geographic region while at the same time avoiding the problem of spatial autocorrelation, as well as clustered sampling (i.e., the same number of samples are collected from groups of points assumed to have different characteristics). In some of our experiments, we relied on a regular sampling of the data points, although most of our results were reported with models trained with the full set of data instances;

  7. 7.

    Create a final raster overlay, through the application of an intelligent dasymetric disaggregation procedure based on disseveration, as proposed by Malone et al. [45], and leveraging the rasters from the previous steps. Specifically, the vector polygon layer from Step 1 is considered as the source data to be disaggregated, while raster \(T^p\) from Step 2 is considered as an initial estimate for the disaggregated values. Rasters \(P^1\), \(P^2\), \(P^3\), \(P^4\), \(P^5\), \(P^6\) and \(T^d\) are seen as predictive covariates. The regression algorithm used in the disseveration procedure is fit using the data sample from the previous step, and applied to produce new values for raster \(T_p\). The application of the regression algorithm will refine the initial estimates with basis on their relation toward the predictive covariates, this way dissevering the source data;

  8. 8.

    We proportionally adjust the values returned by the downscaling method from Malone et al. [45] for all cells within each source zone, so that each source zone’s total in the target raster is the same as the total in the original vector polygon layer (e.g., again, if the total is 10% lower than the original value, increase the value of each cell in by a factor of 10%).

  9. 9.

    Steps 6–8 are repeated, iteratively executing the disseveration procedure that relies on regression analysis to adjust the initial estimates \(T^p\) from Step 2, until the estimated values converge (i.e., the change in the estimated error rate over three consecutive iterations is less than 0.001) or until reaching a maximum number of iterations (i.e., 100 iterations).

Notice that the previous enumeration describes the proposed procedure through example applications that involve a specific resolution (i.e., 30 arc-seconds per cell) and a particular set of ancillary datasets. In fact, we did not use a higher resolution because many of the ancillary datasets that were considered, particularly the ones that we expect to be more correlated with the target socio-economic variables, were only available at this resolution—see Sect. 3.2. The same general procedure could nonetheless also be used in different scenarios, involving different parameters. Moreover, different regression algorithms can also be used within Step 7. In the following sub-sections, we describe the particular regression algorithms that were considered, and detail the sources of ancillary data.

3.1 Implementation details and the considered regression algorithms

The disaggregation procedure was implemented through the programming language of the R3 project for statistical computing, given that there are already many extension packages4 for R concerned with the analysis of spatial data, facilitating the usage of geospatial datasets encoded using either the geometric or the raster data models [5]. We have specifically integrated and extended the source code from the R packages named pycno5 and dissever6, which, respectively, implement the pycnophylactic interpolation algorithm from Tobler [60] used in Step 2, and the downscaling procedure based on regression analysis and disseveration, that was outlined by Malone et al. [45] and that was used in Step 7. By leveraging the pre-existing dissever package, we could easily perform experiments with different types of regression models, such as ensembles of decision trees as used by Stevens et al. [58], or generalized additive models as originally used by Malone et al. [45]. The latest version of dissever is internally using the caret7 package, in terms of the implementation of the regression models. The caret package [39], short for classification and regression training, contains numerous tools for developing different types of predictive models, facilitating the realization of experiments with different types of regression approaches in order to discover the relations between the target variable to disaggregate, and the available covariates. In our experiments, we specifically used standard linear regression models, generalized additive models [31], and an approach based on ensembles of decision trees that is typically referred to as cubist [51].

In standard linear regression, a linear least-squares fit is computed for a set of predictor variables (i.e., the covariates) to predict a dependent variable (i.e., the disaggregated values). The well-known linear regression equation corresponds to a weighted linear combination of the predictive covariates, added to a bias term. In generalized additive models, the dependent variable values are also predicted from a linear combination of predictor variables, but these are instead connected to the dependent variable via a link function, which nonetheless may simply correspond to the identify function. We also have that instead of a single coefficient for each variable in the model (i.e., for each additive term in the linear combination), a function (e.g., a cubic smoothing spline smoother) is instead estimated for each predictor, to achieve the best prediction of the dependent variable values. Instead of estimating single parameters, in generalized additive models we find a more general function that relates the predicted values to the predictors, effectively allowing for some degree of nonlinearity. Details on how generalized additive models are fit to data can be found in the book from Hastie and Tibshirani [31].

The cubist approach is instead based on combining decision trees with linear regression models, again allowing for some degree of nonlinearity [51]. The leaf nodes in these trees contain linear regression models based on the predictors used in previous splits. There are also intermediate linear models at each step of a tree, so that the predictions made by the linear regression model, at the terminal node, are also smoothed by taking into account the predictions from the linear models in the previous nodes, recursively up the tree. The tree-based cubist approach is also normally used within an ensemble classification scheme based on committees, in which a series of trees is trained sequentially with adjusted weights. The final predictions result from the average of the predictions from all committee members.

We effectively experimented with these different regression models to measure their impact on the disaggregation performance for different types of variables. In cases where the target variable has a smooth and nearly linear dependence on the covariates, a standard linear regression model will probably perform better than more sophisticated nonlinear approaches (e.g., an approach based on a combination of multiple decision trees, which will attempt to approximate the linear relationship with an irregular step function). In the presence of multi-collinearity, or for more complex relationships between the target values and the covariates, then nonlinear models can perhaps offer a better performance.

3.2 The ancillary data sources

The ancillary information regarding population statistics was, in our case, obtained from the Gridded Population of the World (GPW8), a well-known dataset depicting the distribution of human population across the globe, providing globally consistent and spatially explicit (i.e., disaggregated) human population information. The current version of the dataset was constructed from national or subnational input units (i.e., from low-level administrative units from the different countries) of varying resolutions, through a complex spatial disaggregation procedure. The initial version of the GPW dataset, which was released in 1995 and used a simple pycnophylactic spatial disaggregation method for population data [61], resulted from a discussion at the 1994 workshop on global demography, where there was consensus that a consistent global database of population totals, in raster format, would be invaluable for interdisciplinary research. The dataset was then continuously revised over the years. Since many socio-economic variables are expected to correlate with population density, it is our belief that this dataset can provide crucial information for our disaggregation objectives.

The resolution considered for the GPW dataset is of 30 arc-seconds per cell, or 1 km at the Equator, although aggregates at coarser resolutions are also provided. Separate grids are available with population counts and with the density per grid cell. Population data estimates (in 2015, when GPWv4 was released) are provided for 2000, 2005, 2010, 2015 and 2020, extrapolating from data collected in the 2010 round of censuses, which occurred between 2005 and 2014. In our experiments, we used the count data projected to the year of 2010 (i.e., the year that is closer to the date associated to the target variables that we want to disaggregate), with the resolution of 30 arc-seconds per cell.

As for the ancillary information regarding nighttime light emissions, we used the publicly available VIIRS Nighttime Lights-2012 dataset9, maintained by the Earth Observation Group of the NOAA National Geophysical Data Center. Since 1992, the NOAA U.S. National Geophysical Data Center produces and provides a long time-series and global dataset of annual nighttime satellite images from the U.S. Air Force Defense Meteorological Satellite Program (DMSP), using the Operational Linescan System (OLS). In the past, the distribution of artificial light from these images has been used in many different studies, as a proxy for urbanisation, population density, economic activity, and armed conflict, as well as to assess the spatial extent of light pollution itself [16, 41]. We specifically used the global cloud-free composite of VIIRS nighttime lights, which was generated with VIIRS day/night band (DNB) observations collected on nights with zero moonlight, respectively, on 18–26 of April 2012, and on 11–23 of October 2012. Cloud screening was done based on the detection of clouds in the VIIRS M15 thermal band, and the product has not been filtered to subtract background noise, or to remove light detections associated with fires, gas flares, volcanoes, or aurora. The raster data, available at a resolution of 15 arc-seconds per cell, consist of floating point values calculated by averaging the pixels deemed to be cloud-free. Previous studies have shown that nighttime lights are strongly correlated with variables that reflect permanent or temporary population distribution, and thus this variable is also expected to be quite useful for the disaggregation of socio-economic variables. Nighttime light information is nowadays also made available at a high spatial resolution and at very frequent temporal intervals. Thus, this information can be useful in the case of applications that require frequent updates.

On what regards land coverage information, we used the standard Corine Land Cover (CLC) data product10, which is based on satellite images as the primary information source, and whose technical details are presented in the report by Heymann and Bossard [34]. We specifically used data for the year of 2012, on a 250 \(\times \) 250 m resolution (i.e., since the remaining ancillary layers are only available at more coarse-grained resolutions, using fine-gridded land coverage information, e.g., at a 100 m resolution, would imply the aggregation of the data, or the transformation of the remaining datasets). The 44 different classes of the 3-level Corine nomenclature that are considered in the original product (e.g., classes for water bodies, artificial surfaces, agricultural areas, etc.) were converted into a real value in the range [0, 1], which encodes how developed is the territory corresponding to a given cell in a simple dasymetric distribution that corresponds to a class-percent method (i.e., cells with the class water bodies were assigned the value of zero, cells corresponding to wetlands were assigned the value of 0.25, different types of forest and semi-natural areas were assigned the value of 0.5, agricultural areas were assigned the value of 0.75, and artificial surfaces were assigned the value of one). This conversion from categorical to numeric values makes it easier to explore land coverage within different types of regression modeling methods (e.g., this procedure is appropriate for standard linear regression models, where categorical variables would otherwise have to be encoded, for instance, through the use of one different variable for each possible category, with a value of one if the case falls in that category and zero otherwise). Despite the arbitrariness of the considered weights, our disaggregation method based on regression will adjust the contribution of each of the class-percent values in a data-driven way. Many previous studies have used land coverage data as a source of ancillary information for disaggregating population data, e.g., in order to distinguish rural from urban regions, and to redistribute the aggregated values accordingly.

Besides the raster encoding land development, the CLC dataset was also used to produce a second raster with derived information, encoding the distance toward the nearest water body (i.e., the distance toward the nearest CLC cell assigned to the class water bodies). For population distribution, as well as for socio-economic variables related to particular economic activities, this particular type of ancillary data can perhaps be of use, as we expect different concentrations of the target variables on areas near rivers, lakes, or oceans.

Finally, on what regards OpenStreetMap data, we used the methodology associated to a study put forward by Martin Raifer in 2015, regarding the most densely mapped regions in OpenStreetMap11. From a shapefile containing OpenStreetMap road network data12, we computed a raster with a resolution of 30 arc-seconds per cell, and where each cell is associated to the total number of nodes from the street network in that area. OpenStreetMap information was also used to produce an additional dataset with derived information, again with a resolution of 30 arc-seconds per cell and encoding the distance from a given position (i.e., from a given cell) toward the nearest road or street segment. Our assumption is that either the number of street/road nodes or the distance toward these infrastructures can reflect the distribution of human constructions, and thus also the distribution of related socio-economic variables.
Fig. 2

Aggregated counts for the different socio-economic indicators

4 A case study with the portuguese territory

In our case study, we used socio-economic data pertaining to the Portuguese territory and its administrative units, mostly at the level of civil parishes, publicly available from the Portuguese National Institute of Statistics. The available information is divided into several themes, like population, justice, education, health, or the environment, among several others. We have specifically used the following datasets in our case study, in all cases using data for the year of 2011 (i.e., the year of the last national census study):
  • Number of female residents, according to the national census in 2011;

  • Number of live births in 2011, by place of residence of the mother;

  • Number of deaths in 2011, according to the national directorate-general of health;

  • Number of foreign residents, according to the national census in 2011;

  • Number of buildings, according to the national census in 2011;

  • Number of buildings with at least two floors, according to the national census in 2011;

  • Resident population employed in the agriculture, animal production, hunting, forest, and fishery sectors, according to the national census in 2011;

  • Employed resident population, according to the national census in 2011;

  • Number of crimes registered by the police forces, for the year of 2011;

  • Number of hotel visitors (i.e., number of guests in hotel establishments) in 2011, according to the national tourism authority.

The table in “Appendix A1” presents aggregate information for all ten variables listed in the previous enumeration, considering large territorial divisions corresponding to NUTS III regions as the aggregation units. For testing the disaggregation procedure, we used the 4260 Portuguese civil parishes as the source units, producing raster datasets with a resolution of 30 arc-seconds per cell (i.e., the same resolution used in the GPW dataset that was used as ancillary data). Exceptions to this procedure are the two last variables from the previous enumeration (i.e., the number of crimes and the number of hotel visitors), for which we only had access to data aggregated at the level of municipalities.
Figure 2 presents a grid with multiple choropleth maps (i.e., five maps per row), illustrating the aggregated information at the level of civil parishes, for the considered socio-economic indicators and for the administrative units in Continental Portugal (i.e., ignoring the archipelagos of Azores and Madeira). In the case of the variables corresponding to (i) the number of crimes registered by the police forces, and (ii) the number of hotel visitors, the maps displayed in Fig. 2 use municipalities as the aggregation level, instead of civil parishes. Information on the number of hotel visitors is also available only for some of the municipalities in the Portuguese territory. Thus, in the corresponding map, the regions shown in red correspond to those municipalities where no information was available. All the maps from Fig. 2 used a logarithmic transformation to assign data values to particular colors, given that most of the indicators that were considered for disaggregation have a skewed distribution in their values. The logarithmic transformation is only used to facilitate the visual interpretation of the maps, given that most of the considered variables have a much higher density in big cities like Lisbon or Oporto, and we wanted to illustrate variations across the entire Portuguese territory.
Fig. 3

Disaggregation results for the different socio-economic indicators

Figure 3 presents a similar grid to the one that is shown in Fig. 2, but in this case illustrating the results that were obtained through the proposed spatial disaggregation procedure, using as source zones the highest possible resolutions in terms of the original data aggregation (i.e., civil parishes, in all cases except for the indicators corresponding to the number of crimes and the number of hotel visitors. The number of crimes was disaggregated from the level of municipalities, and the number of hotel visitors was disaggregated from a NUTS III level, given that we had these data for the entire NUTS regions, although not for some of the municipalities). We also used all sources of ancillary information, together with linear regression models within the disseveration-based algorithm. The maps from Fig. 3 have a resolution of 30 arc-seconds per cell, and they illustrate general trends in the resulting distribution for the disaggregated values (e.g., higher values are assigned to coastal regions).

Figure 4 details the disaggregation results for the number of hotel visitors, focusing on the city of Lisbon and its outskirts. We plot, side-by-side, (i) a map containing the estimates for the number of hotel visitors obtained with a pycnophylactic interpolation procedure, (ii) the estimates obtained with a baseline disaggregation method corresponding to a proportional and weighted areal interpolation procedure that only used population data (i.e., raster \(T^d\) from the enumeration shown in Sect. 3), (iii) the estimates computed with the complete hybrid method leveraging disseveration, and (iv) a satellite photo collected from Google Earth. From the figure, one can clearly see that areas with more buildings or landmarks have higher values, while less developed areas end up with lower values.
Fig. 4

Spatially disaggregated results for the variable corresponding to the number of hotel visitors, concerning the city of Lisbon and its outskirts, when using different approaches

Fig. 5

Spatially disaggregated results for the variable corresponding to the number of crimes

Fig. 6

Spatially disaggregated results for the variable corresponding to the number of foreign residents

Figure 5 details one of the variables from Figs. 2 and 3, specifically the one corresponding to the number of crimes. This figure plots, side-by-side, (i) a choropleth map with the number of crimes per municipality, (ii) the ancillary raster with population counts for the Portuguese territory, (iii) a raster showing the disaggregated number of crimes, as obtained with a simpler method corresponding to a proportional and weighted areal interpolation procedure, and (iv) the raster obtained with the proposed hybrid disaggregation method, using linear regression with all sources of ancillary data. From the figure, one can see that indeed the areas with the higher population counts end up receiving a large proportion of the disaggregated counts for the number of crimes, and also that the resulting map is smoother than the one that would be produced by the proportional and weighted areal interpolation procedure.

Figure 6 details another of the variables from Figs. 2 and 3, specifically the one corresponding to the disaggregated number of foreign residents. This figure plots, side-by-side, (i) a choropleth map with the number of foreign residents per civil parish, (ii) the ancillary raster with nighttime light emissions for the Portuguese territory, (iii) a raster showing the disaggregated number of foreign residents, as obtained with the simpler method that only used population data in the disaggregation procedure, and (iv) the raster obtained with the proposed hybrid disaggregation method, again using linear regression with all sources of ancillary data. From the figure, one can also see that indeed the coastal areas with the highest population counts end up receiving a large proportion of the disaggregated values for the number of foreigners, and also that the nighttime light emission data may aid in the disaggregation. Notice that, in Fig. 6, there is an apparent concentration of foreign residents in the East part of the source map, which does not appear anymore on the resulting map. Most of the foreign residents in those specific regions ended up being assigned to a few cells of the resulting map, either in the case of the baseline disaggregation method or in the case of the hybrid procedure. Notice also that the regions with the abnormally high values correspond to some of the larger parishes, although the counts to be disaggregated are somewhat small.
Fig. 7

Scatter-plots illustrating the correlation between the variables corresponding to the number of female inhabitants (top) and number of buildings (bottom), against three of the considered datasets with ancillary information

The results from the aforementioned figures suggest that indeed there is a high correlation between variables such as population counts or nighttime light emissions, and the target variables to be disaggregated. When investigating these correlations, we found that there is a very strong linear correlation between the population counts for each aggregation area (e.g., for each civil parish) and the aggregated values for the different variables that were considered. Consequently, a very high linear correlation is also found for the disaggregated results produced through the dasymetric procedure that relied exclusively on population counts as ancillary information (i.e., results suggest that proportional and weighted areal interpolation, leveraging the population counts, constitutes a very strong baseline).

Figure 7 presents scatter-plots illustrating the correlation between two of the variables that were considered in our study (i.e., the number of female residents and the number of buildings per civil parish, respectively, the variables with the highest and lowest linear correlations toward population counts) with three of the ancillary rasters used in the hybrid disaggregation procedure based on disseveration, namely the information regarding the population counts, the OpenStreetMap node count, and the raster obtained with the simple disaggregation method that only used population counts as ancillary information. The three variables with ancillary information were aggregated to the level of civil parish, in order to compare these values against those from the variables that were disaggregated. Each plot presents also the actual value that was obtained for the Pearson correlation coefficient between the variables, which in all cases had a p value below 0.001.

From the scatter-plots presented in Fig. 7, one can confirm the relevance of the auxiliary variables for spatial disaggregation. For instance, one can see (i.e., either through visual observation, or through the computed values for the Pearson correlation coefficient) that the relationship between two of the ancillary rasters (i.e., the one containing values concerning population counts, and the one based on the simple baseline disaggregation method) and both of the considered target variables is indeed strong, validating our assumptions on the importance of population distribution in the disaggregation of socio-economic variables. On the other hand, the node density from OpenSteetMap has notably less relevance in the distribution of the two target indicators considered in the plots, although it has still a strong relationship toward indicators like the number of buildings (i.e., in this case, a Pearson correlation of 0.311). The regression algorithms should consider parameters based on the importance of such correlations, giving more relevance to the ancillary information provided by the population counts and by the simple disaggregation algorithm.

It should be noted that spatial disaggregation is never an error-free process, and errors introduced during disaggregation can be propagated to the subsequent analysis steps. The typical accuracy assessment strategy is to aggregate the target zone estimates to either the source or some intermediary zones, and then compare the aggregated estimates against the original counts. The results for the comparison can be summarized by various statistics, such as the root-mean-square error (RMSE) between estimated and observed values, or the mean absolute error (MAE). The corresponding formulas are as follows.
$$\begin{aligned} \mathrm{RMSE}= & {} \sqrt{\frac{\sum _{i=1}^{n}({\hat{y}}_i-y_i)^2}{n}} \end{aligned}$$
$$\begin{aligned} \mathrm{MAE}= & {} \frac{\sum _{i=1}^{n}|{\hat{y}}_i-y_i|}{n} \end{aligned}$$
In Eqs. (5) and (6), \({\hat{y}}_i\) corresponds to a predicted value, \(y_i\) corresponds to a true value, and n is the number of predictions. Using multiple error metrics can have advantages, given that individual measures condense a large number of data into a single value, thus only providing one projection of the model errors that emphasizes a certain aspect of model performance. For instance Willmott and Matsuura [64] proved that the RMSE is not equivalent to the MAE, and that one cannot easily derive the MAE value from the RMSE (and vice versa). While the MAE gives the same weight to all errors, the RMSE penalizes variance, as it gives errors with larger absolute values more weight than errors with smaller absolute values. When both metrics are calculated, the RMSE is by definition never smaller than the MAE. Chai and Draxler [8] argued that the MAE is suitable to describe uniformly distributed errors, but because model errors are likely to have a normal distribution rather than a uniform distribution, the RMSE is often a better metric to present than the MAE. Multiple metrics can provide a better picture of error distribution and thus, in our study, we present results in terms of the MAE and RMSE metrics. We also report results in terms of the normalized root-mean-square error (NRMSE) and the normalized mean absolute error (NMAE), in which we divide the values of the RMSE and MAE by the amplitude of the true values (i.e., the subtraction of the maximum true value by the minimum true value). This normalization can facilitate the comparison of results across variables.
Table 1

Disaggregation errors measured for the ten different socio-economic variables, with the aggregated data collected originally at a NUTS III level


Pycnophylactic interpolation

Weighted interpolation

Hybrid method














Female residents













Live births


























Foreign residents


























Tall buildings













Prim. sect. workers













Employed pop.













Crimes (M)













Hotel visitors (M)













To get some idea on the errors that are involved in the proposed spatial disaggregation procedure, we experimented with the disaggregation of data originally reported at the level of large territorial divisions (i.e., the NUTS III divisions shown in Table 5, or at the level of municipalities) to the raster level, latter aggregating the estimates to the level of civil parishes (i.e., taking the sum of the values from all raster cells associated to each civil parish) and comparing the aggregated estimates against the values that were originally available for the 4260 civil parishes from 308 municipalities.

Table 1 shows the obtained results, in the case of aggregated data collected at the NUTS III level, comparing the usage of the complete hybrid disaggregation method, when leveraging linear regression models, against the results obtained with (i) pycnophylactic interpolation, or with (ii) weighted areal disaggregation leveraging population data for the weights (i.e., raster \(T^d\) in the enumeration given in Sect. 3). All the evaluation metrics are computed over results at the level of civil parishes, except for the last two variables (i.e., number of crimes and number of hotel visitors) for which we had no access to information at a finer granularity than municipalities. The results for the NRMSE and NMAE metrics are reported with a multiplication factor of 10\(^{-2}\), in order to facilitate the interpretation of quantities associated to small areas. Values in bold correspond to the best results for each variable.

The results from Table 1 show that the proposed hybrid method indeed outperforms the baselines corresponding to pycnophylactic interpolation or weighted areal interpolation, at a NUTS III level. However, in some error metrics and particularly for indicators that have a strong linear correlation with population counts (e.g., the indicator corresponding to the number of female residents), the simpler dasymetric procedure that only takes into account the population as ancillary data produces slightly better results.

Tables 2 and 3 present similar results to those from Table 1, in this case obtained with the disaggregation of data originally reported at the level of municipalities, and measuring results at the level of civil parishes. Table 2 presents results for baseline methods corresponding to (i) mass-preserving areal weighting, (ii) pycnophylactic interpolation, and (iii) weighted areal disaggregation leveraging population data for the weights. Table 3 instead presents results with the hybrid method based on disseveration, using three different regression methods in the dasymetric procedure, namely linear regression models, generalized additive models, or ensembles of trees based on the cubist method. The results for NRMSE and NMAE are again reported with a multiplication factor of 10\(^{-2}\). Values in bold, in both Tables 2 and 3, correspond to the best results that were achieved for each variable (i.e., values in bold shown in Table 2 correspond to cases in which the hybrid disaggregation method, based on disseveration, could not outperform one of the baselines, namely the one based on weighted areal disaggregation leveraging population).
Table 2

Disaggregation errors measured for different socio-economic variables, using baseline methods and with the aggregated data collected originally at the level of municipalities


Areal Interpolation

Pycnophylactic Interpolation

Weighted Interpolation














Female residents













Live births


























Foreign residents


























Tall buildings













Primary sector


























Table 3

Disaggregation errors measured for different socio-economic variables, using different types of regression models and with the aggregated data collected at the level of municipalities


Linear models

Generalized additive models















Female residents













Live births


























Foreign residents


























Tall buildings













Prim. sect. workers













Employed pop.













From Tables 2 and 3 we can also see that, at a municipality level, the hybrid method continues to outperform the baseline disaggregation methods in almost all indicators. When the indicator to disaggregate is strongly correlated with population counts (e.g., for variables such as female residents, live births, or employed population), the methods that produced lower disaggregation errors used regression analysis based on standard linear regression or generalized additive models. The strong linear dependence between the indicators that are to be disaggregated and some of the ancillary variables can explain why a simple linear regression can model the dependence better than more sophisticated methods. On the other hand, for the case of indicators depending less on population (e.g., number of buildings, or number of buildings with more than a single floor), the regression model based on ensembles of trees obtained slightly better results. In all cases, the simple method based on weighted areal disaggregation, leveraging population data for the weights, indeed corresponded to a very strong baseline.

The strong correlations between the considered indicators and the auxiliary variable corresponding to population counts explain why the simpler baseline achieves almost the same results as the more sophisticated hybrid method. Nonetheless, for almost all indicators, the usage of additional ancillary information can indeed lead to improvements, sometimes considerable ones. It should also be noted that the errors that were reported correspond to an upper bound on the actual errors produced from the disaggregation of data reported at the level of civil parishes (i.e., we only measured the errors in the disaggregation of data originally at the level of NUTS III regions or municipalities), given that the higher the differences between the source and the target areas, the higher the errors introduced by a spatial disaggregation procedure.
Table 4

Disaggregation errors measured for different socio-economic variables, using linear regression together with different thresholds for the regular sampling procedure


Regular sampling 25%

Regular sampling 50%

Regular sampling 75%














Female residents













Live births


























Foreign residents


























Tall buildings













Prim. sect. workers













Employed pop.













The results reported in Tables 1 and 3 leveraged the full set of data points when training the regression models. Table 4 instead presents results when considering a regular sampling strategy, in which only 25, 50, or 75% of the available data points are used for model training. These experiments relied on a linear regression model to disaggregate data originally reported at the level of municipalities, and the regular sampling procedure ensures that the entire geographic region is evenly represented through the systematically aligned collection of the data points. The values in bold correspond to cases where using the sampling strategy outperformed the linear model trained on the full set of instances. When analyzing the results presented in Table 4, comparing them against the results presented in the first column of Table 3, one can see the benefits of using the sampling procedure. The training of the regression models becomes computationally less demanding and, for many of the considered indicators, applying regular sampling also results in lower disaggregation errors. For instance, in variables such as the number of buildings or the number of primary sector workers, we have that a drastic reduction of the number of samples (i.e., setting the sampling threshold to 25% of the data points) produces the best results. Reducing the number of samples from close-by regions is a possible strategy to deal with spatial autocorrelation that appears to result in lower disaggregation errors. Some of our variables do indeed exhibit a high degree of spatial autocorrelation (e.g., in Fig. 8, we can see that for many of the variables the close regions have similar values, and using all the data points is perhaps artificially reducing variance in the training data, and inflating the effect size of the covariates), which can explain the reason why sampling was also beneficial in terms of result quality (i.e., regular sampling provides a good variance in the observations, with small sample sizes).

It is also interesting to notice that the errors that were measured for the different spatial disaggregation procedures were also evenly distributed over the considered geographic territory. In Fig. 8, we plot the errors that were measured individually for each civil parish (i.e., the difference between the estimated value and the real value, divided by the real value so as to obtain normalized scores), in the case of the indicators that generally had the lowest and highest errors (i.e., the number of female residents and the number of buildings, respectively) over the continental Portuguese territory, and for the archipelagos of Azores and Madeira. These errors are shown for the case of the disaggregation procedures corresponding to (i) the disseveration algorithm based on a linear model, and (ii) the weighted areal interpolation procedure that leveraged population data. From the figures, one can see that the largest errors in both disaggregation procedures correspond to civil parishes for which we over-estimated the true values. The regions with darker colors and higher values generally correspond to civil parishes where the indicator being disaggregated had very small values (i.e., a number of 268 female residents in the civil parish named São Nicolau, in the municipality of Mesão Frio and district of Vila Real, over continental Portugal), and where therefore a small deviation in the estimated results produced high values for the normalized error. Still, the disaggregation errors are very small in most of the territory, and also evenly distributed.
Fig. 8

Normalized errors measured for the different civil parishes

5 Conclusions and future work

Spatial analyses in the fields of urban and regional planning, transport planning, environment, or climate often require high-resolution socio-economic data. Such analyses typically work with raster data to calculate indicators such as exposure to air pollutants or to noise. However, in many cases, the available socio-economic data do not have the necessary spatial resolution. Usually, data on socio-economic variables relative to population, employment, or housing are available only for larger areas such as provinces, districts, municipalities, or other statistical entities, i.e., units that might be too coarse to be used in particular types of spatial analysis. Spatial disaggregation techniques can be used in this context, to transform data from a set of source zones into a set of target zones with different geometry and with a higher general level of spatial resolution (e.g., into raster cells). In this article, we reported on experiments with an hybrid spatial disaggregation technique that combines the ideas of dasymetric mapping and pycnophylactic interpolation, using population density, nighttime light emissions, land coverage, and information from OpenSteetMap, as ancillary data to disaggregate different types of socio-economic indicators to a raster-grid level. The proposed technique was applied in a case study relative to the Portuguese territory, resulting in the production of raster datasets with an approximate resolution of 30 arc-seconds per cell. Throughout the article, we discussed the spatial disaggregation methodology, and the quality of the obtained results. Specifically on what regards the research questions that were put forward at the introduction, the experiments reported on the article have shown that:
  • Standard spatial disaggregation approaches can be effectively used in the disaggregation of socio-economic indicators. Many of these variables have a strong correlation with population density, and thus a baseline disaggregation method, leveraging population data to perform proportional and weighted areal interpolation, achieved very good results. Still, in most cases, the use of additional ancillary variables within a disaggregation methodology leveraging regression could further improve the results. This was especially true in the case of variables less correlated with population density;

  • The hybrid disaggregation method that was proposed in the article could outperform baseline methods in most of the considered variables, even when using standard linear regression. The use of more sophisticated regression methods was nonetheless useful in cases where the target variable had a lower linear correlation against the ancillary variables based on population density;

  • The regular sampling of data points, prior to the training of regression models, was also beneficial in our experiments. Sampling could reduce computational efforts, while at the same time also leading to better result quality in most of the cases.

For future work, we intend to continue improving the spatial disaggregation methodology. For instance, in the experiments that were reported here, we have already compared different types of regression models, although other approaches could also be interesting to test. Robust regression methods, such as least trimmed squares [53], Huber M-estimators or M-quantile models [1, 9, 23, 54, 55], can for instance provide estimates with a superior quality, in the presence of outliers or when the classical assumptions of linear regression are not met. Geographically weighted regression [10, 19, 29, 30, 42, 54] can also be particularly interesting when downscaling data associated to large regions, given that these methods can more effectively capture spatially varying relationships between the involved variables.

The proposed approach could also be enriched with estimates for the variance associated to the disaggregation results, resulting in the production of fine-resolution estimates together with associated measures of uncertainty [46, 63]. A bootstrapping approach, based on running the disseveration procedure multiple times with random samples from initial estimates (i.e., random samples taken from the raster produced in Step 2 of the methodology outlined in Sect. 3), could for instance be used to estimate a raster with the uncertainty associated to the downscaled values.

The proposed approach already also combined the ideas of dasymetric mapping and pycnophylactic interpolation, but recent studies in the area have also proposed other types of downscaling methods, for instance based on fractal analysis and interpolation [37, 56, 62, 66]. In future tests, we can perhaps consider including the results of different downscaling methods as ancillary rasters within the methodology based on disseveration.

Another idea for future work concerns with the usage of other types of ancillary data, like information on terrain elevation inferred from satellite imagery, or population estimates inferred from mobile phone data [12, 14]. Taking inspiration on very recent studies [43, 44, 49], we would also like to experiment with the incorporation of ancillary data extracted from popular location-based services like Flickr13 or Twitter14, for instance by creating density surfaces from geo-referenced items published on these services with particular keywords, and then using these density surfaces in more or less the same way as we are now using the population data or the data from OpenStreetMap. Geo-referenced social media data are already increasingly being used a source of volunteered geographic information, for instance in applications like delimiting vague regions [11], modeling human mobility [32], or within land use and land cover analysis [4]. It has been shown, for instance, that the number of geo-referenced photos published on Flickr can correlate well with indicators like tourist visitors [21, 35, 36], and the number of geo-referenced Twitter messages mentioning diseases like flu can also correlate well with the number of patients with influenza [50]. It is therefore our belief that data from these services can indeed provide very useful information for supporting spatial disaggregation procedures.




This research was partially supported through Fundação para a Ciência e Tecnologia (FCT), through project grants with references PTDC/EEI-SCR/1743/2014 (Saturn) and EXPL/EEI-ESS/0427/2013 (KD-LBSN), as well as through the INESC-ID multi-annual funding from the PIDDAC programme (UID/CEC/50021/2013).


  1. 1.
    Andersen, R.: Modern Methods for Robust Regression. No. 152 in Quantitative Applications in the Social Sciences. Sage Publications, Thousand Oaks (2008)CrossRefGoogle Scholar
  2. 2.
    Antoni, J.P., Vuidel, G., Aupet, J.B., Aube, J.: Generating a located synthetic population: a prerequisite to agent-based urban modelling. In: Proceedings of the European Colloquium of Quantitative and Theoretical Geography (2011)Google Scholar
  3. 3.
    Antoni, J.P., Vuidel, G., Klein, O.: Generating a located synthetic population of individuals, households, and dwellings. Working Paper Series, Luxembourg Institute of Socio-Economic Research (2017)Google Scholar
  4. 4.
    Antoniou, V., Fonte, C.C., See, L., Estima, J., Arsanjani, J.J., Lupia, F., Minghini, M., Foody, G., Fritz, S.: Investigating the feasibility of geo-tagged photographs as sources of land cover input data. ISPRS Int. J. GeoInf. 5(5), 64 (2016)CrossRefGoogle Scholar
  5. 5.
    Bivand, R.S., Pebesma, E., Gmez-Rubio, V.: Applied Spatial Data Analysis with R. Springer, Berlin (2012)MATHGoogle Scholar
  6. 6.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)MATHCrossRefGoogle Scholar
  7. 7.
    Briggs, D.J., Gulliver, J., Fecht, D., Vienneau, D.M.: Dasymetric modelling of small-area population distribution using land cover and light emissions data. Remote Sens. Environ. 108(4), 451–466 (2007)CrossRefGoogle Scholar
  8. 8.
    Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)CrossRefGoogle Scholar
  9. 9.
    Chambers, R., Tzavidis, N.: M-quantile models for small area estimation. Biometrika 93(2), 255–268 (2006)MATHMathSciNetCrossRefGoogle Scholar
  10. 10.
    Chandra, H., Salvati, N., Chambers, R., Tzavidis, N.: Small area estimation under spatial nonstationarity. Comput. Stat. Data Anal. 56(10), 2875–2888 (2012)MATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Cunha, E., Martins, B.: Using one-class classifiers and multiple kernel learning for defining imprecise geographic regions. Int. J. Geogr. Inf. Sci. 28(11), 2220–2241 (2014)CrossRefGoogle Scholar
  12. 12.
    Deville, P., Linard, C., Martin, S., Gilbert, M., Stevens, F.R., Gaughan, A.E., Blondel, V.D., Tatem, A.J.: Dynamic population mapping using mobile phone data. Proc. Natl. Acad. Sci. 111(45), 15888–15893 (2014)CrossRefGoogle Scholar
  13. 13.
    Doll, C.N.H., Muller, J.P., Elvidge, C.: Night-time imagery as a tool for global mapping of socio-economic parameters and greenhouse gas emissions. Ambio 29(3), 157–162 (2000)CrossRefGoogle Scholar
  14. 14.
    Douglass, R., Meyer, D., Ram, M., Rideout, D., Song, D.: High resolution population estimates from telecommunications data. Euro. Phys. J. Data Sci. 4(1), 4 (2015)Google Scholar
  15. 15.
    Eicher, C.L., Brewer, C.A.: Dasymetric mapping and areal interpolation: Implementation and evaluation. Cartogr. Geogr. Inf. Sci. 28(2), 125–138 (2001)CrossRefGoogle Scholar
  16. 16.
    Elvidge, C., Erwin, E., Baugh, K., Ziskin, D., Tuttle, B., Ghosh, T., Sutton, P.: Overview of dmsp nightime lights and future possibilities. In: Proceedings of the Joint Urban Remote Sensing Event, pp. 1–5 (2009)Google Scholar
  17. 17.
    Elvidge, C.D., Baugh, K.E., Kihn, E.A., Kroehl, H.W., Davis, E.R., Davis, C.: Relation between satellite observed visible to near infrared emissions, population, and energy consumption. Int. J. Remote Sens. 18(6), 1373–1379 (1997)CrossRefGoogle Scholar
  18. 18.
    Fisher, P.F., Langford, M.: Modelling the errors in areal interpolation between zonal systems by monte carlo simulation. Environ. Plann. A 27(2), 211–224 (1995)CrossRefGoogle Scholar
  19. 19.
    Fotheringham, A.S., Brunsdon, C., Charlton, M.E.: Geographically Weighted Regression : The Analysis of Spatially Varying Relationships. Wiley, Hoboken (2002)MATHGoogle Scholar
  20. 20.
    Gallego, F.J.: A population density grid of the European Union. Popul. Environ. 31(6), 460–473 (2010)CrossRefGoogle Scholar
  21. 21.
    García-Palomares, J.C., Gutiérrez, J., Mínguez, C.: Identification of tourist hot spots based on social networks: a comparative analysis of european metropolises using photo-sharing services and GIS. Appl. Geogr. 63(1), 408–417 (2015)CrossRefGoogle Scholar
  22. 22.
    Giri, C.P.: Remote Sensing of Land Use and Land Cover: Principles and Applications. CRC Press, Boca Raton (2012)CrossRefGoogle Scholar
  23. 23.
    Giusti, C., Tzavidis, N., Pratesi, M., Salvati, N.: Resistance to outliers of M-quantile and robust random effects small area models. Commun. Stat. Simul. Comput. 43(3), 549–568 (2014)MATHMathSciNetCrossRefGoogle Scholar
  24. 24.
    Goerlich, F.J., Cantarino, I.: A population density grid for spain. Int. J. Geogr. Inf. Sci. 27(12), 2247–2263 (2013)CrossRefGoogle Scholar
  25. 25.
    Goodchild, M.F., Anselin, L., Deichmann, U.: A framework for the areal interpolation of socioeconomic data. Environ. Plan. A 25(3), 383–397 (1993)CrossRefGoogle Scholar
  26. 26.
    Goodchild, M.F., Lam, N.S.N.: Areal interpolation: a variant of the traditional spatial problem. Department of Geography, University of Western Ontario London, Canada (1980)Google Scholar
  27. 27.
    Gregory, I.N.: The accuracy of areal interpolation techniques: standardising 19th and 20th century census data to allow long-term comparisons. Comput. Environ. Urban Syst. 26(4), 293–314 (2002)CrossRefGoogle Scholar
  28. 28.
    Gupta, M.R., Chen, Y.: Theory and use of the EM algorithm. Found. Trends Signal Process. 4(3), 223–296 (2010)MATHCrossRefGoogle Scholar
  29. 29.
    Harris, P., Brunsdon, C., Fotheringham, A.S.: Links, comparisons and extensions of the geographically weighted regression model when used as a spatial predictor. Stoch. Environ. Res. Risk Assess. 25(2), 123–138 (2011)CrossRefGoogle Scholar
  30. 30.
    Harris, P., Fotheringham, A., Crespo, R., Charlton, M.: The use of geographically weighted regression for spatial prediction: an evaluation of models using simulated data sets. Math. Geosci. 42(6), 657–680 (2010)MATHMathSciNetCrossRefGoogle Scholar
  31. 31.
    Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman & Hall, Boca Raton (1990)MATHGoogle Scholar
  32. 32.
    Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., Ratti, C.: Geo-located Twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci. 41(3), 260–271 (2014)CrossRefGoogle Scholar
  33. 33.
    Hawley, K., Moellering, H.: A comparative analysis of areal interpolation methods. Cartogr. Geogr. Inf. Sci. 32(4), 411–423 (2005)CrossRefGoogle Scholar
  34. 34.
    Heymann Y., S.C.C.G., Bossard, M.: CORINE land cover technical guide. Technical Report EUR12585, Office for Official Publications of the European Communities (1994)Google Scholar
  35. 35.
    Kádár, B.: Measuring tourist activities in cities using geotagged photography. Tour. Geogr. 16(1), 88–104 (2014)CrossRefGoogle Scholar
  36. 36.
    Kádár, B., Gede, M.: Where do tourists go? Visualizing and analysing the spatial distribution of geotagged photography. Cartogr. Int. J. Geogr. Inf. Geovis. 48(2), 78–88 (2013)Google Scholar
  37. 37.
    Kim, G., Barros, A.P.: Downscaling of remotely sensed soil moisture with a modified fractal interpolation method using contraction mapping and ancillary data. Remote Sens. Environ. 83(3), 400–413 (2002)CrossRefGoogle Scholar
  38. 38.
    Kim, H., Yao, X.: Pycnophylactic interpolation revisited: integration with the dasymetric-mapping method. Int. J. Remote Sens. 31(21), 5657–5671 (2010)CrossRefGoogle Scholar
  39. 39.
    Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28(5), 1–26 (2008)CrossRefGoogle Scholar
  40. 40.
    Langford, M.: Rapid facilitation of dasymetric-based population interpolation by means of raster pixel maps. Comput. Environ. Urban Syst. 31(1), 19–32 (2007)CrossRefGoogle Scholar
  41. 41.
    Li, D., Zhao, X., Li, X.: Remote sensing of human beings a perspective from nighttime light. Geospat. Inf. Sci. 19(1), 69–79 (2016)CrossRefGoogle Scholar
  42. 42.
    Lin, J., Cromley, R., Zhang, C.: Using geographically weighted regression to solve the areal interpolation problem. Ann. GIS 17(1), 1–14 (2011)CrossRefGoogle Scholar
  43. 43.
    Lin, J., Cromley, R.G.: Evaluating geo-located Twitter data as a control layer for areal interpolation of population. Appl. Geogr. 58(1), 41–47 (2015)CrossRefGoogle Scholar
  44. 44.
    Longley, P.A., Adnan, M., Lansley, G.: The geotemporal demographics of Twitter usage. Environ. Plan. A 47(2), 465–484 (2015)CrossRefGoogle Scholar
  45. 45.
    Malone, B.P., McBratney, A.B., Minasny, B., Wheeler, I.: A general method for downscaling Earth resource information. Comput. Geosci. 41(1), 119–125 (2012)CrossRefGoogle Scholar
  46. 46.
    Nagle, N.N., Buttenfield, B.P., Leyk, S., Spielman, S.: Dasymetric modeling and uncertainty. Ann. Assoc. Am. Geogr. 104(1), 80–95 (2014)CrossRefGoogle Scholar
  47. 47.
    Nordhaus, W.D.: Alternative Approaches to Spatial Rescaling. Technical Report. Yale University, New Haven (2003)Google Scholar
  48. 48.
    Nordhaus, W.D.: Geography and macroeconomics: new data and new findings. Proc. Natl. Acad. Sci. 103(10), 3510–3517 (2006)CrossRefGoogle Scholar
  49. 49.
    Patel, N.N., Stevens, F.R., Huang, Z., Gaughan, A.E., Elyazar, I., Tatem, A.J.: Improving large area population mapping using geotweet densities. Trans. GIS 21(2), 317–331 (2016)CrossRefGoogle Scholar
  50. 50.
    Paul, M.J., Dredze, M., Broniatowski, D.: Twitter improves influenza forecasting. PLoS Curr. 6(1), 18 (2014)Google Scholar
  51. 51.
    Quinlan, R.J.: Learning with continuous classes. In: Proceedings of the Australian Joint Conference On Artificial Intelligence, pp. 343–348 (1992)Google Scholar
  52. 52.
    Reibel, M., Bufalino, M.E.: Street-weighted interpolation techniques for demographic count estimation in incompatible zone systems. Environ. Plan. A 37(1), 127–139 (2005)CrossRefGoogle Scholar
  53. 53.
    Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, Hoboken (2005)MATHGoogle Scholar
  54. 54.
    Salvati, N., Tzavidis, N., Pratesi, M., Chambers, R.: Small area estimation via M-quantile geographically weighted regression. Test 21(1), 1–28 (2012)MATHMathSciNetCrossRefGoogle Scholar
  55. 55.
    Schmid, T., Münnich, R.T.: Spatial robust small area estimation. Stat. Pap. 55(3), 653–670 (2014)MATHMathSciNetCrossRefGoogle Scholar
  56. 56.
    Sémécurbe, F., Tannier, C., Roux, S.G.: Spatial distribution of human population in france: Exploring the modifiable areal unit problem using multifractal analysis. Geogr. Anal. 48(3), 292–313 (2016)CrossRefGoogle Scholar
  57. 57.
    Batista e Silva, F., Gallego, J., Lavalle, C.: A high-resolution population grid map for europe. J. Maps 9(1), 16–28 (2013)CrossRefGoogle Scholar
  58. 58.
    Stevens, F.R., Gaughan, A.E., Linard, C., Tatem, A.J.: Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data. PLoS ONE 10(2), 1–22 (2015)CrossRefGoogle Scholar
  59. 59.
    Tobler, W.: A computer movie simulating urban growth in the detroit region. Econ. Geogr. 46(2), 234–240 (1970)CrossRefGoogle Scholar
  60. 60.
    Tobler, W.: Smooth pycnophylactic interpolation for geographical regions. J. Am. Stat. Assoc. 74(367), 519–530 (1979)MathSciNetCrossRefGoogle Scholar
  61. 61.
    Tobler, W., Deichmann, U., Gottsegen, J., Maloy, K.: The Global Demography Project. Technical Report 95-6, National Center for Geographic Information and Analysis, Santa Barbara (1995)Google Scholar
  62. 62.
    Vega, K.V.A.: Aplicacin de la Interpolacin Fractal en Downscaling de Imgenes Satelitales NOAA-AVHRR de Temperatura de Superficie en Terrenos de Topografia Compleja. Ph.D. thesis, Universidad de Chile (2012)Google Scholar
  63. 63.
    Whitworth, A., Carter, E., Ballas, D., Moon, G.: Estimating uncertainty in spatial microsimulation approaches to small area estimation: a new approach to solving an old problem. Comput. Environ. Urban Syst. 63, 50–57 (2016)CrossRefGoogle Scholar
  64. 64.
    Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)CrossRefGoogle Scholar
  65. 65.
    Wu, Ss, Qiu, X., Wang, L.: Population estimation methods in GIS and remote sensing: a review. GISci. Remote Sens. 42(1), 80–96 (2005)CrossRefGoogle Scholar
  66. 66.
    Xu, G., Xu, X., Liu, M., Sun, A.Y., Wang, K.: Spatial downscaling of TRMM precipitation product using a combined multifractal and regression approach: demonstration for South China. Water 7(6), 3083–3102 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Universidade de Lisboa, IST/INESC-IDPorto SalvoPortugal
  2. 2.Universidade NOVA de Lisboa, DI, FCT/NOVA LINCSCaparicaPortugal

Personalised recommendations