A hybrid approach for the spatial disaggregation of socioeconomic indicators
Abstract
While statistical information on socioeconomic activities is widely available, the data are often collected or released only at a relatively aggregated level. In these aggregated forms, the data are useful for broadscale assessments, although we often need to disaggregate the source data in order to provide more localized estimates, and in order to analyze correlations against geophysical variables. Spatial disaggregation techniques can be used in this context, to transform data from a set of source zones into a set of target zones, with different geometry and with a higher general level of spatial resolution. Still, few previous studies in the area have attempted to leverage stateoftheart spatial disaggregation procedures in the context of socioeconomic variables, instead focusing on applications related to population modeling. In this article, we report on experiments with a hybrid spatial disaggregation technique that combines stateoftheart regression analysis procedures with the classic methods of dasymetric mapping and pycnophylactic interpolation. The hybrid procedure was used together with population density, land coverage, nighttime satellite imagery, and OpenStreetMap road density, as ancillary data to disaggregate different types of socioeconomic indicators to a highresolution grid. Our test specifically leveraged data relative to the Portuguese territory, resulting in the production of raster datasets with a resolution of 30 arcseconds per cell. The article discusses the spatial disaggregation methodology and the quality of the obtained results under different experimental conditions.
Keywords
Spatial analysis Downscaling Geographic information systems Regressionbased spatial disaggregation Socioeconomic indicators1 Introduction
Statistical information on socioeconomic activities is widely available, although the data are often collected or released only at a relatively aggregated level. Census data, for example, are often aggregated to census tracts, in part because of concerns about confidentiality. Depending on their nature, data on social indicators or on economic activities may likewise be aggregated to country or regional administrative units.
In these aggregated forms, the data are useful for broadscale assessments, but using aggregated data has the danger of masking important local hotspots, and overall tends to smooth out spatial variations in impact. For this reason, researchers often need to disaggregate source data, in order to provide more localized estimates. In the context of spatial analysis, spatial disaggregation or spatial downscaling are processes by which information at a coarse spatial scale is translated to finer scales, while maintaining consistency with the original dataset. These techniques are used to convert data originally available for a set of source zones into a set of target zones that have a different geometry and a higher level of spatial resolution. Ranging in complexity from simple areal weighting to intelligent dasymetric disaggregation [33], most approaches have been applied to population data, and they have in common what Tobler [59] termed the pycnophylactic, or masspreserving, property, in that the estimates are conditioned to sum to the original quantities in the source zones. The term spatial disaggregation is in fact usually employed in the context of additive variables (i.e., population counts and other datasets of aggregated counts over which the pycnophylactic property should be enforced), whereas spatial downscaling is more general, being frequently used with nonadditive variables (e.g., environmental or geophysical properties, such as temperature, precipitation, soil moisture, agricultural land usage, air quality, etc.).
In this article, we report on experiments with a hybrid spatial disaggregation technique that combines the ideas of dasymetric mapping and pycnophylactic interpolation, using population density, nighttime satellite imagery, land coverage, and OpenStreetMap^{1} road density information, as ancillary data to disaggregate different types of socioeconomic indicators. Apart from few exceptions (e.g., seminal work on the area by Goodchild et al. [25] that considered variables such as employment and income, or more recent work within the GEcon research project of the University of Yale, which aimed to develop datasets on economic activity through spatial rescaling based on proportional allocation [47, 48]), most previous studies concerning with spatial disaggregation/downscaling have focused either on population density or on geophysical/environmental variables. We nonetheless believe that previously developed procedures, which had these traditional applications in mind, can also be equally for socioeconomic indicators, facilitating the development of studies that link socioeconomic information with different types of geophysical factors.
The spatial disaggregation technique discussed in this article has specifically been applied in a case study relative to the Portuguese territory, resulting in the production of raster datasets with different types of socioeconomic indicators. Also referred to as a grid, a raster dataset is a type of tessellation (i.e., a mosaic) that divides a surface into uniform cells (i.e., pixels), being commonly used for representing phenomena that vary continuously over a geographic space. Examining socioeconomic patterns on a gridded format has many advantages, one of them being the easiness in linking the socioeconomic data to readily available geophysical data on themes such as climate, ecology, and the like. Even though socioeconomic data are typically only collected at the level of regional administrative units, we may be interested in analyzing the data through different partitions of space (e.g., at the level of thingrained administrative subunits, or with basis on regular tessellations of the geographic space, in order to better analyze local hotspots), or in terms of their relation to particular geophysical characteristics (e.g., proximity to regions with specific land coverage types, or relationship toward terrain elevation). Disaggregated socioeconomic data can also have important applications in the generation of located synthetic population datasets [2, 3], for instance latter to be used in the context of spatial simulations.

We compared different methods in the disaggregation of socioeconomic indicators, including simple baseline approaches (e.g., standard masspreserving areal weighting or pycnophylactic interpolation) and dasymetric mapping methods that leverage ancillary sources of information. Most previous work in the area has considered applications such as population mapping, and we were interested to see (i) if similar approaches could also be used in other types of variables, and (ii) the degree to which ancillary data could be used to improve the results, depending on the type of variable that is being analyzed;

We proposed and evaluated a novel intelligent disaggregation method, based on a downscaling procedure originally proposed by Malone et al. [45] that uses regression analysis to combine different ancillary variables. We adapted/extended the original procedure, which deals with the spatial downscaling of nonadditive variables, in several directions. These include (i) combining it with the use of pycnophylactic interpolation, thus resulting in a hybrid spatial disaggregation approach, (ii) experimenting with different types of regression models, or (iii) carefully sampling data points prior to the training of regression models. We comparatively evaluated the application of the proposed method on the disaggregation of socioeconomic indicators relative to the Portuguese territory, when using different types of regression algorithms.
2 Fundamental concepts and related work
This section starts by describing classical approaches for spatial disaggregation, afterward describing more recent developments and practical applications.
2.1 Spatial disaggregation methods
Spatial disaggregation, as a procedure, is applied to data sets for which the underlying spatial distribution is unknown, but for which aggregated data, on the basis of spatial zones that resulted from some convenience of enumeration, already exist. A process of spatial disaggregation, or spatial downscaling to be more general, thus refers to the transformation of data from the arbitrary zones of data aggregation, to a set of target zones with different geometry and a higher general level of spatial resolution, in order to recover and better depict the underlying spatial distribution of the data [33].
While masspreserving areal weighting disaggregation ensures that the total count from the source data remains unchanged, it is based on the often incorrect assumption that the phenomena of interest are evenly distributed across the source zones. Population is one example where the assumption behind masspreserving areal interpolation clearly does not hold, since most populations are rarely uniform across census tracts, and instead tend to be highly clustered in urban centers, surrounded by areas of dispersed rural homesteads.
Although the results of binary mask areal weighting are generally an improvement over those of simple areal weighting, there are still considerable deficiencies in this method. For instance, all populated areas do not have the same density, but binary mask areal weighting assumes that all the populated areas are homogeneous with respect to density. Additionally, nonpopulated areas in the mask often have some population too, which is totally eliminated in the purely binary approach. Several authors have proposed refinements to the dasymetric approach introduced above, taking it from a binary model to more nuanced approaches, which result in a more realistic depiction of the densities that are typically encountered in realworld variables.
In the general case, dasymetric disaggregation is any type of areal interpolation method for disaggregating spatial data that leverages ancillary information. As mentioned in relation to mask areal weighting, land cover data, in particular, offers a means by which residential areas can be distinguished from nonresidential areas, and thus land cover has been extensively used in the context of population modeling. General dasymetric disaggregation is thus an improvement over mask areal weighting in that two or more categories can be assigned weights (e.g., specific disaggregation weights can be derived for individual land cover types to reflect population density). This is often also referred to as polycategorical dasymetric disaggregation, or as the classpercent method. In the polycategorical dasymetric disaggregation method, percentages are applied to each of the categories for the source area, representing the percentage of population (or another variable) that is likely to be contained within that category, per source area. These percentage numbers will vary depending on the location of the area of interest, and are subject to perceived local conditions and arbitrariness of the analyst. The main challenge in dasymetric disaggregation thus involves devising an appropriate set of weights that can be applied to the classes in the ancillary data, for instance to reflect population density. Weights may, for instance, be defined using selective sampling, or by some form of regression analysis.
Recent research on spatial disaggregation has advanced extensions and/or combinations of the classical methods surveyed in this section. For instance, authors like Kim and Yao [38] have developed hybrid approaches for the spatial disaggregation of population data that combine dasymetric mapping and pycnophylactic interpolation, making use of ancillary information that sheds light on the spatial structure of population distribution, while at the same time also adopting the conceptual assumption that population density varies smoothly, instead of uniformly, in space. Noting that the advantages and shortcomings of the two methods are complementary, Kim and Yao proposed an approach that consists in two consecutive logical steps: dasymetric mapping for a preliminary population redistribution, followed by an iterative pycnophylactic interpolation for obtaining a masspreserving smoothed surface. Binary dasymetric mapping is used in the first step, resulting in a rough estimate of population density over the residential pixels, arguing that prior studies [18, 40] found no evidence to support any extra benefits of using general dasymetric mapping. In the second step, a smooth surface is produced with basis on the neighborhood of each residential pixel, using a floating window to calculate the average density value. In this context, the search distance of the floating window is computed through an iterative process of finetuning, until the integral of smoothed density values in the source zone matches the original value of source zone. In the beginning of Sect. 3, we present a diagram that illustrates our hybrid method, and point out its main differences against the approach from Kim and Yao.
The method advanced in this paper also takes its inspiration on the spatial downscaling algorithm proposed by Malone et al. [45]. This method is based on a disseveration procedure similar to dasymetric mapping, for which there is an opensource R implementation that was extended for the experiments reported on this article (i.e., although the original method was proposed for spatial downscaling of nonadditive variables instead of disaggregation, in our work we extended the preexisting R implementation in order to develop a novel spatial disaggregation method, combining the disseveration algorithm with the idea of pycnophylactic interpolation—see Fig. 1). The algorithm described by Malone has two phases, and it originally used generalized additive modeling to fit a nonlinear relationship between a target variable (i.e., the indicator that we which to model at a fine resolution, and for which we have data at a coarse resolution) and predictive covariates (i.e., data for other variables, available at a fine resolution, that can inform the downscaling). In an initialization phase, the authors perform a coarse grid to fine grid resample (i.e., through a nearest neighbor resampling approach for data downscaling, in which the cells at the fine grid take the value from the closest coarse grid cells), followed by random sampling of data points and initial model fit. The model assumes that the value at each target region corresponds to an additive combination of nonlinear functions (i.e., cubic splines with knots at each of the target regions) of the covariates—more details about additive models are available in the book by Hastie and Tibshirani [31]. In an iteration phase, adjustments are made to the predictions iteratively, trying to ensure that the coarse grid is linearly related to the fine grid predictions (i.e., there is a mass balance property to be attained). Iterations proceed until a stopping criterion is met, based on a maximum number of iterations, or alternatively using a threshold over the change in the estimated error rate over three consecutive iterations.
2.2 Applications and case studies
In a study of areal interpolation for socioeconomic data, Goodchild et al. [25] looked at a typical problem of spatial analysis using noncoincident areal units, namely the 58 counties of California (the source zones) and the states 12 major hydrological basins (the target zones). The boundaries of the two sets of spatial units were, for the most part, incompatible. Socioeconomic data were available on the county level, but data connected with water issues were collected based on the hydrological basin units that correspond to major watershed boundaries. In order to conduct a major economic impact study of water usage and policy, variables such as employment, income, and population had to be transferred from the county spatial units to the hydrological regions. Goodchild et al. [25] used direct areal weighting to accomplish this, assuming that densities in the source zones (the counties) were uniform. When later comparing the results of the areal weighting method with other methods using statistical approaches, they found that areal weighting had a much higher mean percentage error than did the other methods.
Gallego [20] described the production of an highresolution European population map through a stochastic allocation process by which weights are devised for disaggregating population totals from larger administrative units (i.e., NUTS 2 regions in Europe) to smaller ones (i.e., communes) on the basis of the land cover information. Communes were first stratified, by comparing the commune population density to the average density of the surrounding NUTS 2 region, into one of three levels reflecting population density (i.e., dense, less dense, and not urban). The method then involved disaggregating the NUTS 2 totals using an initial set of weights, reaggregating the population to the commune level, and comparing it to the known total, this way computing a disagreement indicator, and adjusting the weights to reduce the disagreement. Several subsequent studies have described refined versions of the methodology originally put forward by Gallego [20], which involved combining land use and/or land coverage information present in various highresolution data sources [24, 57].
Besides land coverage and related types of information, which is often extracted from multispectral satellite imagery through data classification procedures, other dasymetric disaggregation approaches have been reported to use different kinds of ancillary data. For instance authors like Elvidge et al. [17] proposed to use satellite observed visible to nearinfrared emissions, while Doll et al. [13] proposed to use nighttime light emissions, and Langford [40] proposed the usage of rasterized topographic maps. Depending on the variable that is to be disaggregated, different types of ancillary variables can indeed be of use.
Reibel and Bufalino [52] used street network data (i.e., U.S. Census TIGER files) to derive weights for the interpolation of population and housing unit counts, for incompatible zone systems in Los Angeles County, California. The authors used the street and road grid as a proxy for approximate population and housing unit density surfaces, for census tracts in the county. They then conducted an error analysis, comparing the results of the streetweighting method with traditional areal weighting, finding that the streetweighting method offers some benefits. Despite the interesting results, the authors noted that the streetweighting method appears to reduce errors most in those areas where the lack of population is reflected in the lack of roads, and least in those areas with a more developed but nonresidential transportation infrastructure (e.g., industrial areas).
In the context of modeling European population, Brigs et al. [7] developed a model that incorporates ancillary data, specifically Earth Observation (EO) products corresponding to nighttime light intensity data and CORINE land cover [34], in a GISbased regression approach to disaggregate NUTS 5 census totals to a resolution of 1 km\(^2\). European light emission data from the DMSP satellites were resampled and modeled using kriging and inverse distance weighting, to provide a 200 m resolution light emissions map. This was matched to CORINE land cover classes, and linear regression analysis was used to derive models of relationships between census population counts, land cover area, and light emissions. The regression weights were then used in the dasymetric disaggregation procedure.
In a more recent study, Stevens et al. [58] also combined different types of remotelysensed data, such as nighttime lights and land cover information, to derive weights for the disaggregation of population counts, originally at a country level. As their weighting scheme, the authors used a flexible and nonparametric predictive model based on ensembles of decision trees (i.e., the random forest regression approach [6]), in order to leverage the available ancillary data to generate a gridded prediction at an approximate spatial resolution of 100 \(\times \) 100 m. Stevens et al. [58] concluded that, at countrylevel scales, the ensemble of decision trees performed substantially better than other methods. The authors argued that decision trees are indeed quite flexible, being able to handle multiple covariates (i.e., the auxiliary variables that can inform the spatial disaggregation) of both discrete and continuous natures, with a minimum amount of tuning and supervision.
In sum, we have that some previous studies have indeed compared a variety of spatial disaggregation methods—see Wu et al. [65] for a recent review. Previous research suggests that dasymetric and intelligent (e.g., regressionbased) areal interpolation techniques can outperform areal weighting and other areal interpolation approaches that do not incorporate ancillary data, although interpolation accuracy is dependent on the strength of the relationship between the source and ancillary data. Fisher and Langford [18] found that the traditional binary dasymetric method could also be more accurate than both areal weighting and regressionbased intelligent areal interpolation techniques. Similar results were reported by Langford [40], who found no evidence to support any extra benefits of using multiclass dasymetric mapping, or by Gregory [27], who also reported good results for areal interpolation with a probabilistic approach that combines the binary method with the ExpectationMaximization (EM) algorithm [28]. Most previous studies in the area have also been limited to applications related to population modeling, although there are many other potentially interesting applications.
3 An hybrid disaggregation method
Both the dasymetric mapping and the pycnophylactic interpolation methods have solid theoretical foundations, as well as strong empirical supports in populationestimation research. Each of these methods has its own strengths, but also suffers obvious shortcomings. For instance, pycnophylactic interpolation warrants a smooth surface in the study area, without any presumption of uniform distribution (i.e., it iteratively smooths the estimates by taking the average of neighboring cells, instead of just dividing the total mass by the number of cells, as in the case of masspreserving areal weighting). However, the method does not draw on any ancillary information about the real spatial distribution, so that its estimation accuracy cannot benefit from useful information that is frequently available.
 1.
Produce a vector polygon layer for the variable to be disaggregated by associating the quantities, linked to the source regions, to geometric polygons representing the corresponding regions;
 2.
Create a raster representation for the study region, with basis on the vector polygon layer from the previous step and considering a resolution of 30 arcseconds per cell. This raster, referred to as \(T^p\), will contain smooth values resulting from a pycnophylactic interpolation procedure [60]. The algorithm starts by assigning cells to the corresponding values in the original vector polygon layer, using a simple masspreserving areal weighting procedure (i.e., we redistribute the aggregated data with basis on the proportion of each source zone that overlaps with the target zone). Interactively, each cell’s value is replaced with the average of its eight neighbors in the target raster. We finally adjust the values of all cells within each zone proportionally, so that each zone’s total in the target raster is the same as the original total (e.g., if the total is 10% lower than the original value, we increase the value of each cell by a factor of 10%). The procedure is repeated, until no more significant changes occur. The resulting raster is a smooth surface corresponding to an initial estimate for the disaggregated values;
 3.
Overlay four rasters \(P^1\), \(P^2\), \(P^3\) and \(P^4\), also using a resolution of 30 arcseconds per cell, on the study region from the original vector layer and from the raster produced in the previous step, respectively, with information regarding (i) population counts, (ii) nighttime light emissions, (iii) land coverage classification, and (iv) OpenStreetMap road network density. These rasters will be used as ancillary information for the spatial disaggregation procedure. Prior to overlaying the data, the four different raster data sources are normalized to the resolution of 30 arcseconds per cell, through a simple interpolation procedure based on taking the mean of the different values per cell (i.e., in the cases where the original raster had a higher resolution), or the value from the nearest/encompassing cell (i.e., in the cases where the original raster had a lower resolution);
 4.
Overlay two other rasters \(P^5\) and \(P^6\) over the study region, again using the same resolution of 30 arcseconds per cell and with ancillary information derived from the rasters in the previous step. Specifically, these two rasters encode (i) the distance from a given cell to the nearest cell with a land coverage type equal to water, and (ii) the distance from a given cell to the nearest cell containing a road or a street segment. Raster \(P^5\) is thus derived from raster \(P^3\) with land coverage information, whereas \(P^6\) is derived from raster \(P^4\) with OpenStreetMap road network density. These two rasters will also be used as ancillary information for spatial disaggregation;
 5.Overlay another raster \(T^d\) on study region, with the same resolution used in the rasters from the previous steps. This raster will be used to store the estimates produced by a simple spatial disaggregation procedure based on dasymetric mapping (i.e., a method based on proportional and weighted areal interpolation). For producing these estimates, we weight the total value, for each source zone in the original vector polygon layer, according to the proportion between the population values available for the corresponding cell in raster \(P^1\), and the sum of all the values for the given source zone in the same raster. This is essentially a proportional and weighted areal interpolation method, corresponding to the following equation where \(T^d_t\) is the estimated count in target zone t, where \(S_s\) is the count in source zone s, \(P_t\) is the population count in target zone t, and \(P_s\) is the population count in source zone s;$$\begin{aligned} T^d_t = \sum _{\{s : s \cap t \ne \emptyset \}} \left( \frac{P_t}{P_s} \times S_s \right) \end{aligned}$$(4)
 6.
Collect a sample of cells in the fineresolution grid, in order to latter fit regression models. The experiments with the original disseveration procedure that were described by Malone et al. [45] considered a random sampling strategy, although better alternatives can also be used. For study regions of a moderate size, all data instances can be considered. Alternatively, sampling can be performed using the R function spsample^{2} that supports regular (i.e., systematically aligned) sampling, which can evenly represent the entire geographic region while at the same time avoiding the problem of spatial autocorrelation, as well as clustered sampling (i.e., the same number of samples are collected from groups of points assumed to have different characteristics). In some of our experiments, we relied on a regular sampling of the data points, although most of our results were reported with models trained with the full set of data instances;
 7.
Create a final raster overlay, through the application of an intelligent dasymetric disaggregation procedure based on disseveration, as proposed by Malone et al. [45], and leveraging the rasters from the previous steps. Specifically, the vector polygon layer from Step 1 is considered as the source data to be disaggregated, while raster \(T^p\) from Step 2 is considered as an initial estimate for the disaggregated values. Rasters \(P^1\), \(P^2\), \(P^3\), \(P^4\), \(P^5\), \(P^6\) and \(T^d\) are seen as predictive covariates. The regression algorithm used in the disseveration procedure is fit using the data sample from the previous step, and applied to produce new values for raster \(T_p\). The application of the regression algorithm will refine the initial estimates with basis on their relation toward the predictive covariates, this way dissevering the source data;
 8.
We proportionally adjust the values returned by the downscaling method from Malone et al. [45] for all cells within each source zone, so that each source zone’s total in the target raster is the same as the total in the original vector polygon layer (e.g., again, if the total is 10% lower than the original value, increase the value of each cell in by a factor of 10%).
 9.
Steps 6–8 are repeated, iteratively executing the disseveration procedure that relies on regression analysis to adjust the initial estimates \(T^p\) from Step 2, until the estimated values converge (i.e., the change in the estimated error rate over three consecutive iterations is less than 0.001) or until reaching a maximum number of iterations (i.e., 100 iterations).
3.1 Implementation details and the considered regression algorithms
The disaggregation procedure was implemented through the programming language of the R^{3} project for statistical computing, given that there are already many extension packages^{4} for R concerned with the analysis of spatial data, facilitating the usage of geospatial datasets encoded using either the geometric or the raster data models [5]. We have specifically integrated and extended the source code from the R packages named pycno^{5} and dissever^{6}, which, respectively, implement the pycnophylactic interpolation algorithm from Tobler [60] used in Step 2, and the downscaling procedure based on regression analysis and disseveration, that was outlined by Malone et al. [45] and that was used in Step 7. By leveraging the preexisting dissever package, we could easily perform experiments with different types of regression models, such as ensembles of decision trees as used by Stevens et al. [58], or generalized additive models as originally used by Malone et al. [45]. The latest version of dissever is internally using the caret^{7} package, in terms of the implementation of the regression models. The caret package [39], short for classification and regression training, contains numerous tools for developing different types of predictive models, facilitating the realization of experiments with different types of regression approaches in order to discover the relations between the target variable to disaggregate, and the available covariates. In our experiments, we specifically used standard linear regression models, generalized additive models [31], and an approach based on ensembles of decision trees that is typically referred to as cubist [51].
In standard linear regression, a linear leastsquares fit is computed for a set of predictor variables (i.e., the covariates) to predict a dependent variable (i.e., the disaggregated values). The wellknown linear regression equation corresponds to a weighted linear combination of the predictive covariates, added to a bias term. In generalized additive models, the dependent variable values are also predicted from a linear combination of predictor variables, but these are instead connected to the dependent variable via a link function, which nonetheless may simply correspond to the identify function. We also have that instead of a single coefficient for each variable in the model (i.e., for each additive term in the linear combination), a function (e.g., a cubic smoothing spline smoother) is instead estimated for each predictor, to achieve the best prediction of the dependent variable values. Instead of estimating single parameters, in generalized additive models we find a more general function that relates the predicted values to the predictors, effectively allowing for some degree of nonlinearity. Details on how generalized additive models are fit to data can be found in the book from Hastie and Tibshirani [31].
The cubist approach is instead based on combining decision trees with linear regression models, again allowing for some degree of nonlinearity [51]. The leaf nodes in these trees contain linear regression models based on the predictors used in previous splits. There are also intermediate linear models at each step of a tree, so that the predictions made by the linear regression model, at the terminal node, are also smoothed by taking into account the predictions from the linear models in the previous nodes, recursively up the tree. The treebased cubist approach is also normally used within an ensemble classification scheme based on committees, in which a series of trees is trained sequentially with adjusted weights. The final predictions result from the average of the predictions from all committee members.
We effectively experimented with these different regression models to measure their impact on the disaggregation performance for different types of variables. In cases where the target variable has a smooth and nearly linear dependence on the covariates, a standard linear regression model will probably perform better than more sophisticated nonlinear approaches (e.g., an approach based on a combination of multiple decision trees, which will attempt to approximate the linear relationship with an irregular step function). In the presence of multicollinearity, or for more complex relationships between the target values and the covariates, then nonlinear models can perhaps offer a better performance.
3.2 The ancillary data sources
The ancillary information regarding population statistics was, in our case, obtained from the Gridded Population of the World (GPW^{8}), a wellknown dataset depicting the distribution of human population across the globe, providing globally consistent and spatially explicit (i.e., disaggregated) human population information. The current version of the dataset was constructed from national or subnational input units (i.e., from lowlevel administrative units from the different countries) of varying resolutions, through a complex spatial disaggregation procedure. The initial version of the GPW dataset, which was released in 1995 and used a simple pycnophylactic spatial disaggregation method for population data [61], resulted from a discussion at the 1994 workshop on global demography, where there was consensus that a consistent global database of population totals, in raster format, would be invaluable for interdisciplinary research. The dataset was then continuously revised over the years. Since many socioeconomic variables are expected to correlate with population density, it is our belief that this dataset can provide crucial information for our disaggregation objectives.
The resolution considered for the GPW dataset is of 30 arcseconds per cell, or 1 km at the Equator, although aggregates at coarser resolutions are also provided. Separate grids are available with population counts and with the density per grid cell. Population data estimates (in 2015, when GPWv4 was released) are provided for 2000, 2005, 2010, 2015 and 2020, extrapolating from data collected in the 2010 round of censuses, which occurred between 2005 and 2014. In our experiments, we used the count data projected to the year of 2010 (i.e., the year that is closer to the date associated to the target variables that we want to disaggregate), with the resolution of 30 arcseconds per cell.
As for the ancillary information regarding nighttime light emissions, we used the publicly available VIIRS Nighttime Lights2012 dataset^{9}, maintained by the Earth Observation Group of the NOAA National Geophysical Data Center. Since 1992, the NOAA U.S. National Geophysical Data Center produces and provides a long timeseries and global dataset of annual nighttime satellite images from the U.S. Air Force Defense Meteorological Satellite Program (DMSP), using the Operational Linescan System (OLS). In the past, the distribution of artificial light from these images has been used in many different studies, as a proxy for urbanisation, population density, economic activity, and armed conflict, as well as to assess the spatial extent of light pollution itself [16, 41]. We specifically used the global cloudfree composite of VIIRS nighttime lights, which was generated with VIIRS day/night band (DNB) observations collected on nights with zero moonlight, respectively, on 18–26 of April 2012, and on 11–23 of October 2012. Cloud screening was done based on the detection of clouds in the VIIRS M15 thermal band, and the product has not been filtered to subtract background noise, or to remove light detections associated with fires, gas flares, volcanoes, or aurora. The raster data, available at a resolution of 15 arcseconds per cell, consist of floating point values calculated by averaging the pixels deemed to be cloudfree. Previous studies have shown that nighttime lights are strongly correlated with variables that reflect permanent or temporary population distribution, and thus this variable is also expected to be quite useful for the disaggregation of socioeconomic variables. Nighttime light information is nowadays also made available at a high spatial resolution and at very frequent temporal intervals. Thus, this information can be useful in the case of applications that require frequent updates.
On what regards land coverage information, we used the standard Corine Land Cover (CLC) data product^{10}, which is based on satellite images as the primary information source, and whose technical details are presented in the report by Heymann and Bossard [34]. We specifically used data for the year of 2012, on a 250 \(\times \) 250 m resolution (i.e., since the remaining ancillary layers are only available at more coarsegrained resolutions, using finegridded land coverage information, e.g., at a 100 m resolution, would imply the aggregation of the data, or the transformation of the remaining datasets). The 44 different classes of the 3level Corine nomenclature that are considered in the original product (e.g., classes for water bodies, artificial surfaces, agricultural areas, etc.) were converted into a real value in the range [0, 1], which encodes how developed is the territory corresponding to a given cell in a simple dasymetric distribution that corresponds to a classpercent method (i.e., cells with the class water bodies were assigned the value of zero, cells corresponding to wetlands were assigned the value of 0.25, different types of forest and seminatural areas were assigned the value of 0.5, agricultural areas were assigned the value of 0.75, and artificial surfaces were assigned the value of one). This conversion from categorical to numeric values makes it easier to explore land coverage within different types of regression modeling methods (e.g., this procedure is appropriate for standard linear regression models, where categorical variables would otherwise have to be encoded, for instance, through the use of one different variable for each possible category, with a value of one if the case falls in that category and zero otherwise). Despite the arbitrariness of the considered weights, our disaggregation method based on regression will adjust the contribution of each of the classpercent values in a datadriven way. Many previous studies have used land coverage data as a source of ancillary information for disaggregating population data, e.g., in order to distinguish rural from urban regions, and to redistribute the aggregated values accordingly.
Besides the raster encoding land development, the CLC dataset was also used to produce a second raster with derived information, encoding the distance toward the nearest water body (i.e., the distance toward the nearest CLC cell assigned to the class water bodies). For population distribution, as well as for socioeconomic variables related to particular economic activities, this particular type of ancillary data can perhaps be of use, as we expect different concentrations of the target variables on areas near rivers, lakes, or oceans.
4 A case study with the portuguese territory

Number of female residents, according to the national census in 2011;

Number of live births in 2011, by place of residence of the mother;

Number of deaths in 2011, according to the national directorategeneral of health;

Number of foreign residents, according to the national census in 2011;

Number of buildings, according to the national census in 2011;

Number of buildings with at least two floors, according to the national census in 2011;

Resident population employed in the agriculture, animal production, hunting, forest, and fishery sectors, according to the national census in 2011;

Employed resident population, according to the national census in 2011;

Number of crimes registered by the police forces, for the year of 2011;

Number of hotel visitors (i.e., number of guests in hotel establishments) in 2011, according to the national tourism authority.
Figure 3 presents a similar grid to the one that is shown in Fig. 2, but in this case illustrating the results that were obtained through the proposed spatial disaggregation procedure, using as source zones the highest possible resolutions in terms of the original data aggregation (i.e., civil parishes, in all cases except for the indicators corresponding to the number of crimes and the number of hotel visitors. The number of crimes was disaggregated from the level of municipalities, and the number of hotel visitors was disaggregated from a NUTS III level, given that we had these data for the entire NUTS regions, although not for some of the municipalities). We also used all sources of ancillary information, together with linear regression models within the disseverationbased algorithm. The maps from Fig. 3 have a resolution of 30 arcseconds per cell, and they illustrate general trends in the resulting distribution for the disaggregated values (e.g., higher values are assigned to coastal regions).
Figure 5 details one of the variables from Figs. 2 and 3, specifically the one corresponding to the number of crimes. This figure plots, sidebyside, (i) a choropleth map with the number of crimes per municipality, (ii) the ancillary raster with population counts for the Portuguese territory, (iii) a raster showing the disaggregated number of crimes, as obtained with a simpler method corresponding to a proportional and weighted areal interpolation procedure, and (iv) the raster obtained with the proposed hybrid disaggregation method, using linear regression with all sources of ancillary data. From the figure, one can see that indeed the areas with the higher population counts end up receiving a large proportion of the disaggregated counts for the number of crimes, and also that the resulting map is smoother than the one that would be produced by the proportional and weighted areal interpolation procedure.
The results from the aforementioned figures suggest that indeed there is a high correlation between variables such as population counts or nighttime light emissions, and the target variables to be disaggregated. When investigating these correlations, we found that there is a very strong linear correlation between the population counts for each aggregation area (e.g., for each civil parish) and the aggregated values for the different variables that were considered. Consequently, a very high linear correlation is also found for the disaggregated results produced through the dasymetric procedure that relied exclusively on population counts as ancillary information (i.e., results suggest that proportional and weighted areal interpolation, leveraging the population counts, constitutes a very strong baseline).
Figure 7 presents scatterplots illustrating the correlation between two of the variables that were considered in our study (i.e., the number of female residents and the number of buildings per civil parish, respectively, the variables with the highest and lowest linear correlations toward population counts) with three of the ancillary rasters used in the hybrid disaggregation procedure based on disseveration, namely the information regarding the population counts, the OpenStreetMap node count, and the raster obtained with the simple disaggregation method that only used population counts as ancillary information. The three variables with ancillary information were aggregated to the level of civil parish, in order to compare these values against those from the variables that were disaggregated. Each plot presents also the actual value that was obtained for the Pearson correlation coefficient between the variables, which in all cases had a p value below 0.001.
From the scatterplots presented in Fig. 7, one can confirm the relevance of the auxiliary variables for spatial disaggregation. For instance, one can see (i.e., either through visual observation, or through the computed values for the Pearson correlation coefficient) that the relationship between two of the ancillary rasters (i.e., the one containing values concerning population counts, and the one based on the simple baseline disaggregation method) and both of the considered target variables is indeed strong, validating our assumptions on the importance of population distribution in the disaggregation of socioeconomic variables. On the other hand, the node density from OpenSteetMap has notably less relevance in the distribution of the two target indicators considered in the plots, although it has still a strong relationship toward indicators like the number of buildings (i.e., in this case, a Pearson correlation of 0.311). The regression algorithms should consider parameters based on the importance of such correlations, giving more relevance to the ancillary information provided by the population counts and by the simple disaggregation algorithm.
Disaggregation errors measured for the ten different socioeconomic variables, with the aggregated data collected originally at a NUTS III level
Pycnophylactic interpolation  Weighted interpolation  Hybrid method  

RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  
Female residents  3944.16  1379.01  11.387  3.981  839.76  278.52  2.425  0.804  840.21  283.43  2.426  0.818 
Live births  79.73  26.30  9.303  3.068  108.37  10.72  12.646  1.251  21.35  8.38  2.491  0.977 
Deaths  68.43  24.16  12.556  4.432  107.48  11.43  19.722  2.098  19.95  9.15  3.661  1.679 
Foreign residents  517.34  123.31  8.299  1.978  194.51  47.36  3.120  0.760  164.00  44.74  2.631  0.718 
Buildings  1386.72  631.27  10.806  4.919  786.92  329.22  6.132  2.565  747.37  308.31  5.824  2.402 
Tall buildings  906.69  427.85  10.471  4.941  497.83  225.28  5.749  2.602  487.98  223.98  5.636  2.587 
Prim. sect. workers  54.57  24.61  3.555  1.603  120.48  27.28  7.849  1.777  45.74  20.73  2.980  1.351 
Employed pop.  3215.25  1118.42  10.593  3.685  738.62  256.98  2.433  0.847  738.00  261.35  2.431  0.861 
Crimes (M)  3295.44  1184.76  7.755  2.788  1298.86  383.84  3.057  0.903  1248.36  346.14  2.938  0.815 
Hotel visitors (M)  288576.60  98276.07  10.103  3.441  195708.60  59899.58  6.852  2.097  195681.50  60013.58  6.851  2.101 
To get some idea on the errors that are involved in the proposed spatial disaggregation procedure, we experimented with the disaggregation of data originally reported at the level of large territorial divisions (i.e., the NUTS III divisions shown in Table 5, or at the level of municipalities) to the raster level, latter aggregating the estimates to the level of civil parishes (i.e., taking the sum of the values from all raster cells associated to each civil parish) and comparing the aggregated estimates against the values that were originally available for the 4260 civil parishes from 308 municipalities.
Table 1 shows the obtained results, in the case of aggregated data collected at the NUTS III level, comparing the usage of the complete hybrid disaggregation method, when leveraging linear regression models, against the results obtained with (i) pycnophylactic interpolation, or with (ii) weighted areal disaggregation leveraging population data for the weights (i.e., raster \(T^d\) in the enumeration given in Sect. 3). All the evaluation metrics are computed over results at the level of civil parishes, except for the last two variables (i.e., number of crimes and number of hotel visitors) for which we had no access to information at a finer granularity than municipalities. The results for the NRMSE and NMAE metrics are reported with a multiplication factor of 10\(^{2}\), in order to facilitate the interpretation of quantities associated to small areas. Values in bold correspond to the best results for each variable.
The results from Table 1 show that the proposed hybrid method indeed outperforms the baselines corresponding to pycnophylactic interpolation or weighted areal interpolation, at a NUTS III level. However, in some error metrics and particularly for indicators that have a strong linear correlation with population counts (e.g., the indicator corresponding to the number of female residents), the simpler dasymetric procedure that only takes into account the population as ancillary data produces slightly better results.
Disaggregation errors measured for different socioeconomic variables, using baseline methods and with the aggregated data collected originally at the level of municipalities
Areal Interpolation  Pycnophylactic Interpolation  Weighted Interpolation  

RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  
Female residents  2220.60  876.94  6.411  2.532  2097.99  845.73  6.057  2.442  587.44  185.57  1.696  0.536 
Live births  47.21  17.40  5.509  2.031  44.72  16.74  5.218  1.953  15.12  6.07  1.764  0.708 
Deaths  35.98  15.57  6.601  2.857  35.61  15.33  6.535  2.813  15.76  6.83  2.893  1.253 
Foreign residents  322.70  82.41  5.176  1.322  326.06  81.55  5.230  1.308  133.91  35.35  2.148  0.567 
Buildings  786.00  412.58  6.125  3.215  793.52  417.58  6.183  3.254  538.06  241.46  4.193  1.882 
Tall buildings  548.98  290.48  6.340  3.355  554.03  291.93  6.398  3.371  317.16  155.76  3.663  1.799 
Primary sector  41.83  16.97  2.725  1.105  44.72  17.89  2.913  1.165  37.42  16.92  2.438  1.102 
Employees  1871.29  724.62  6.165  2.387  1761.49  696.99  5.803  2.296  506.61  172.12  1.669  0.567 
Disaggregation errors measured for different socioeconomic variables, using different types of regression models and with the aggregated data collected at the level of municipalities
Linear models  Generalized additive models  Cubist  

RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  
Female residents  589.04  188.44  1.701  0.544  600.76  194.92  1.735  0.563  702.99  237.05  2.030  0.684 
Live births  14.99  5.97  1.749  0.696  14.91  6.02  1.739  0.702  17.57  6.71  2.050  0.783 
Deaths  16.55  7.40  3.036  1.358  17.18  7.40  3.152  1.357  17.72  7.38  3.251  1.355 
Foreign residents  133.31  34.78  2.138  0.558  133.21  36.66  2.137  0.588  180.71  45.36  2.899  0.728 
Buildings  511.02  223.21  3.982  1.739  497.11  219.90  3.874  1.714  329.51  170.15  2.568  1.326 
Tall buildings  311.83  154.74  3.601  1.787  304.25  150.93  3.514  1.743  263.99  138.36  3.049  1.598 
Prim. sect. workers  37.37  15.33  2.435  0.998  37.98  16.48  2.474  1.074  35.33  15.99  2.302  1.042 
Employed pop.  503.83  169.80  1.660  0.559  546.68  201.98  1.801  0.665  622.32  230.87  2.050  0.761 
From Tables 2 and 3 we can also see that, at a municipality level, the hybrid method continues to outperform the baseline disaggregation methods in almost all indicators. When the indicator to disaggregate is strongly correlated with population counts (e.g., for variables such as female residents, live births, or employed population), the methods that produced lower disaggregation errors used regression analysis based on standard linear regression or generalized additive models. The strong linear dependence between the indicators that are to be disaggregated and some of the ancillary variables can explain why a simple linear regression can model the dependence better than more sophisticated methods. On the other hand, for the case of indicators depending less on population (e.g., number of buildings, or number of buildings with more than a single floor), the regression model based on ensembles of trees obtained slightly better results. In all cases, the simple method based on weighted areal disaggregation, leveraging population data for the weights, indeed corresponded to a very strong baseline.
Disaggregation errors measured for different socioeconomic variables, using linear regression together with different thresholds for the regular sampling procedure
Regular sampling 25%  Regular sampling 50%  Regular sampling 75%  

RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  RMSE  MAE  NRMSE  NMAE  
Female residents  588.51  192.17  1.699  0.555  588.64  189.33  1.700  0.547  588.95  188.58  1.700  0.544 
Live births  15.16  6.03  1.769  0.704  15.03  5.98  1.754  0.698  15.09  6.01  1.761  0.701 
Deaths  16.88  7.53  3.098  1.381  17.30  7.75  3.174  1.421  16.80  7.51  3.083  1.379 
Foreign residents  135.58  36.10  2.175  0.579  134.12  35.28  2.151  0.566  137.31  36.94  2.203  0.593 
Buildings  496.05  216.35  3.865  1.686  498.88  217.71  3.887  1.696  503.94  219.78  3.927  1.713 
Tall buildings  299.67  151.12  3.461  1.745  309.74  154.05  3.577  1.779  305.46  152.69  3.528  1.763 
Prim. sect. workers  36.99  15.17  2.410  0.988  37.33  15.32  2.432  0.998  37.37  15.33  2.434  0.999 
Employed pop.  502.13  169.92  1.654  0.560  502.12  169.40  1.654  0.558  503.61  169.65  1.659  0.559 
The results reported in Tables 1 and 3 leveraged the full set of data points when training the regression models. Table 4 instead presents results when considering a regular sampling strategy, in which only 25, 50, or 75% of the available data points are used for model training. These experiments relied on a linear regression model to disaggregate data originally reported at the level of municipalities, and the regular sampling procedure ensures that the entire geographic region is evenly represented through the systematically aligned collection of the data points. The values in bold correspond to cases where using the sampling strategy outperformed the linear model trained on the full set of instances. When analyzing the results presented in Table 4, comparing them against the results presented in the first column of Table 3, one can see the benefits of using the sampling procedure. The training of the regression models becomes computationally less demanding and, for many of the considered indicators, applying regular sampling also results in lower disaggregation errors. For instance, in variables such as the number of buildings or the number of primary sector workers, we have that a drastic reduction of the number of samples (i.e., setting the sampling threshold to 25% of the data points) produces the best results. Reducing the number of samples from closeby regions is a possible strategy to deal with spatial autocorrelation that appears to result in lower disaggregation errors. Some of our variables do indeed exhibit a high degree of spatial autocorrelation (e.g., in Fig. 8, we can see that for many of the variables the close regions have similar values, and using all the data points is perhaps artificially reducing variance in the training data, and inflating the effect size of the covariates), which can explain the reason why sampling was also beneficial in terms of result quality (i.e., regular sampling provides a good variance in the observations, with small sample sizes).
5 Conclusions and future work

Standard spatial disaggregation approaches can be effectively used in the disaggregation of socioeconomic indicators. Many of these variables have a strong correlation with population density, and thus a baseline disaggregation method, leveraging population data to perform proportional and weighted areal interpolation, achieved very good results. Still, in most cases, the use of additional ancillary variables within a disaggregation methodology leveraging regression could further improve the results. This was especially true in the case of variables less correlated with population density;

The hybrid disaggregation method that was proposed in the article could outperform baseline methods in most of the considered variables, even when using standard linear regression. The use of more sophisticated regression methods was nonetheless useful in cases where the target variable had a lower linear correlation against the ancillary variables based on population density;

The regular sampling of data points, prior to the training of regression models, was also beneficial in our experiments. Sampling could reduce computational efforts, while at the same time also leading to better result quality in most of the cases.
The proposed approach could also be enriched with estimates for the variance associated to the disaggregation results, resulting in the production of fineresolution estimates together with associated measures of uncertainty [46, 63]. A bootstrapping approach, based on running the disseveration procedure multiple times with random samples from initial estimates (i.e., random samples taken from the raster produced in Step 2 of the methodology outlined in Sect. 3), could for instance be used to estimate a raster with the uncertainty associated to the downscaled values.
The proposed approach already also combined the ideas of dasymetric mapping and pycnophylactic interpolation, but recent studies in the area have also proposed other types of downscaling methods, for instance based on fractal analysis and interpolation [37, 56, 62, 66]. In future tests, we can perhaps consider including the results of different downscaling methods as ancillary rasters within the methodology based on disseveration.
Another idea for future work concerns with the usage of other types of ancillary data, like information on terrain elevation inferred from satellite imagery, or population estimates inferred from mobile phone data [12, 14]. Taking inspiration on very recent studies [43, 44, 49], we would also like to experiment with the incorporation of ancillary data extracted from popular locationbased services like Flickr^{13} or Twitter^{14}, for instance by creating density surfaces from georeferenced items published on these services with particular keywords, and then using these density surfaces in more or less the same way as we are now using the population data or the data from OpenStreetMap. Georeferenced social media data are already increasingly being used a source of volunteered geographic information, for instance in applications like delimiting vague regions [11], modeling human mobility [32], or within land use and land cover analysis [4]. It has been shown, for instance, that the number of georeferenced photos published on Flickr can correlate well with indicators like tourist visitors [21, 35, 36], and the number of georeferenced Twitter messages mentioning diseases like flu can also correlate well with the number of patients with influenza [50]. It is therefore our belief that data from these services can indeed provide very useful information for supporting spatial disaggregation procedures.
Footnotes
Notes
Acknowledgements
This research was partially supported through Fundação para a Ciência e Tecnologia (FCT), through project grants with references PTDC/EEISCR/1743/2014 (Saturn) and EXPL/EEIESS/0427/2013 (KDLBSN), as well as through the INESCID multiannual funding from the PIDDAC programme (UID/CEC/50021/2013).
References
 1.Andersen, R.: Modern Methods for Robust Regression. No. 152 in Quantitative Applications in the Social Sciences. Sage Publications, Thousand Oaks (2008)CrossRefGoogle Scholar
 2.Antoni, J.P., Vuidel, G., Aupet, J.B., Aube, J.: Generating a located synthetic population: a prerequisite to agentbased urban modelling. In: Proceedings of the European Colloquium of Quantitative and Theoretical Geography (2011)Google Scholar
 3.Antoni, J.P., Vuidel, G., Klein, O.: Generating a located synthetic population of individuals, households, and dwellings. Working Paper Series, Luxembourg Institute of SocioEconomic Research (2017)Google Scholar
 4.Antoniou, V., Fonte, C.C., See, L., Estima, J., Arsanjani, J.J., Lupia, F., Minghini, M., Foody, G., Fritz, S.: Investigating the feasibility of geotagged photographs as sources of land cover input data. ISPRS Int. J. GeoInf. 5(5), 64 (2016)CrossRefGoogle Scholar
 5.Bivand, R.S., Pebesma, E., GmezRubio, V.: Applied Spatial Data Analysis with R. Springer, Berlin (2012)MATHGoogle Scholar
 6.Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefMATHGoogle Scholar
 7.Briggs, D.J., Gulliver, J., Fecht, D., Vienneau, D.M.: Dasymetric modelling of smallarea population distribution using land cover and light emissions data. Remote Sens. Environ. 108(4), 451–466 (2007)CrossRefGoogle Scholar
 8.Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)CrossRefGoogle Scholar
 9.Chambers, R., Tzavidis, N.: Mquantile models for small area estimation. Biometrika 93(2), 255–268 (2006)MathSciNetCrossRefMATHGoogle Scholar
 10.Chandra, H., Salvati, N., Chambers, R., Tzavidis, N.: Small area estimation under spatial nonstationarity. Comput. Stat. Data Anal. 56(10), 2875–2888 (2012)MathSciNetCrossRefMATHGoogle Scholar
 11.Cunha, E., Martins, B.: Using oneclass classifiers and multiple kernel learning for defining imprecise geographic regions. Int. J. Geogr. Inf. Sci. 28(11), 2220–2241 (2014)CrossRefGoogle Scholar
 12.Deville, P., Linard, C., Martin, S., Gilbert, M., Stevens, F.R., Gaughan, A.E., Blondel, V.D., Tatem, A.J.: Dynamic population mapping using mobile phone data. Proc. Natl. Acad. Sci. 111(45), 15888–15893 (2014)CrossRefGoogle Scholar
 13.Doll, C.N.H., Muller, J.P., Elvidge, C.: Nighttime imagery as a tool for global mapping of socioeconomic parameters and greenhouse gas emissions. Ambio 29(3), 157–162 (2000)CrossRefGoogle Scholar
 14.Douglass, R., Meyer, D., Ram, M., Rideout, D., Song, D.: High resolution population estimates from telecommunications data. Euro. Phys. J. Data Sci. 4(1), 4 (2015)Google Scholar
 15.Eicher, C.L., Brewer, C.A.: Dasymetric mapping and areal interpolation: Implementation and evaluation. Cartogr. Geogr. Inf. Sci. 28(2), 125–138 (2001)CrossRefGoogle Scholar
 16.Elvidge, C., Erwin, E., Baugh, K., Ziskin, D., Tuttle, B., Ghosh, T., Sutton, P.: Overview of dmsp nightime lights and future possibilities. In: Proceedings of the Joint Urban Remote Sensing Event, pp. 1–5 (2009)Google Scholar
 17.Elvidge, C.D., Baugh, K.E., Kihn, E.A., Kroehl, H.W., Davis, E.R., Davis, C.: Relation between satellite observed visible to near infrared emissions, population, and energy consumption. Int. J. Remote Sens. 18(6), 1373–1379 (1997)CrossRefGoogle Scholar
 18.Fisher, P.F., Langford, M.: Modelling the errors in areal interpolation between zonal systems by monte carlo simulation. Environ. Plann. A 27(2), 211–224 (1995)CrossRefGoogle Scholar
 19.Fotheringham, A.S., Brunsdon, C., Charlton, M.E.: Geographically Weighted Regression : The Analysis of Spatially Varying Relationships. Wiley, Hoboken (2002)MATHGoogle Scholar
 20.Gallego, F.J.: A population density grid of the European Union. Popul. Environ. 31(6), 460–473 (2010)CrossRefGoogle Scholar
 21.GarcíaPalomares, J.C., Gutiérrez, J., Mínguez, C.: Identification of tourist hot spots based on social networks: a comparative analysis of european metropolises using photosharing services and GIS. Appl. Geogr. 63(1), 408–417 (2015)CrossRefGoogle Scholar
 22.Giri, C.P.: Remote Sensing of Land Use and Land Cover: Principles and Applications. CRC Press, Boca Raton (2012)CrossRefGoogle Scholar
 23.Giusti, C., Tzavidis, N., Pratesi, M., Salvati, N.: Resistance to outliers of Mquantile and robust random effects small area models. Commun. Stat. Simul. Comput. 43(3), 549–568 (2014)MathSciNetCrossRefMATHGoogle Scholar
 24.Goerlich, F.J., Cantarino, I.: A population density grid for spain. Int. J. Geogr. Inf. Sci. 27(12), 2247–2263 (2013)CrossRefGoogle Scholar
 25.Goodchild, M.F., Anselin, L., Deichmann, U.: A framework for the areal interpolation of socioeconomic data. Environ. Plan. A 25(3), 383–397 (1993)CrossRefGoogle Scholar
 26.Goodchild, M.F., Lam, N.S.N.: Areal interpolation: a variant of the traditional spatial problem. Department of Geography, University of Western Ontario London, Canada (1980)Google Scholar
 27.Gregory, I.N.: The accuracy of areal interpolation techniques: standardising 19th and 20th century census data to allow longterm comparisons. Comput. Environ. Urban Syst. 26(4), 293–314 (2002)CrossRefGoogle Scholar
 28.Gupta, M.R., Chen, Y.: Theory and use of the EM algorithm. Found. Trends Signal Process. 4(3), 223–296 (2010)CrossRefMATHGoogle Scholar
 29.Harris, P., Brunsdon, C., Fotheringham, A.S.: Links, comparisons and extensions of the geographically weighted regression model when used as a spatial predictor. Stoch. Environ. Res. Risk Assess. 25(2), 123–138 (2011)CrossRefGoogle Scholar
 30.Harris, P., Fotheringham, A., Crespo, R., Charlton, M.: The use of geographically weighted regression for spatial prediction: an evaluation of models using simulated data sets. Math. Geosci. 42(6), 657–680 (2010)MathSciNetCrossRefMATHGoogle Scholar
 31.Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman & Hall, Boca Raton (1990)MATHGoogle Scholar
 32.Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., Ratti, C.: Geolocated Twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci. 41(3), 260–271 (2014)CrossRefGoogle Scholar
 33.Hawley, K., Moellering, H.: A comparative analysis of areal interpolation methods. Cartogr. Geogr. Inf. Sci. 32(4), 411–423 (2005)CrossRefGoogle Scholar
 34.Heymann Y., S.C.C.G., Bossard, M.: CORINE land cover technical guide. Technical Report EUR12585, Office for Official Publications of the European Communities (1994)Google Scholar
 35.Kádár, B.: Measuring tourist activities in cities using geotagged photography. Tour. Geogr. 16(1), 88–104 (2014)CrossRefGoogle Scholar
 36.Kádár, B., Gede, M.: Where do tourists go? Visualizing and analysing the spatial distribution of geotagged photography. Cartogr. Int. J. Geogr. Inf. Geovis. 48(2), 78–88 (2013)Google Scholar
 37.Kim, G., Barros, A.P.: Downscaling of remotely sensed soil moisture with a modified fractal interpolation method using contraction mapping and ancillary data. Remote Sens. Environ. 83(3), 400–413 (2002)CrossRefGoogle Scholar
 38.Kim, H., Yao, X.: Pycnophylactic interpolation revisited: integration with the dasymetricmapping method. Int. J. Remote Sens. 31(21), 5657–5671 (2010)CrossRefGoogle Scholar
 39.Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28(5), 1–26 (2008)CrossRefGoogle Scholar
 40.Langford, M.: Rapid facilitation of dasymetricbased population interpolation by means of raster pixel maps. Comput. Environ. Urban Syst. 31(1), 19–32 (2007)CrossRefGoogle Scholar
 41.Li, D., Zhao, X., Li, X.: Remote sensing of human beings a perspective from nighttime light. Geospat. Inf. Sci. 19(1), 69–79 (2016)CrossRefGoogle Scholar
 42.Lin, J., Cromley, R., Zhang, C.: Using geographically weighted regression to solve the areal interpolation problem. Ann. GIS 17(1), 1–14 (2011)CrossRefGoogle Scholar
 43.Lin, J., Cromley, R.G.: Evaluating geolocated Twitter data as a control layer for areal interpolation of population. Appl. Geogr. 58(1), 41–47 (2015)CrossRefGoogle Scholar
 44.Longley, P.A., Adnan, M., Lansley, G.: The geotemporal demographics of Twitter usage. Environ. Plan. A 47(2), 465–484 (2015)CrossRefGoogle Scholar
 45.Malone, B.P., McBratney, A.B., Minasny, B., Wheeler, I.: A general method for downscaling Earth resource information. Comput. Geosci. 41(1), 119–125 (2012)CrossRefGoogle Scholar
 46.Nagle, N.N., Buttenfield, B.P., Leyk, S., Spielman, S.: Dasymetric modeling and uncertainty. Ann. Assoc. Am. Geogr. 104(1), 80–95 (2014)CrossRefGoogle Scholar
 47.Nordhaus, W.D.: Alternative Approaches to Spatial Rescaling. Technical Report. Yale University, New Haven (2003)Google Scholar
 48.Nordhaus, W.D.: Geography and macroeconomics: new data and new findings. Proc. Natl. Acad. Sci. 103(10), 3510–3517 (2006)CrossRefGoogle Scholar
 49.Patel, N.N., Stevens, F.R., Huang, Z., Gaughan, A.E., Elyazar, I., Tatem, A.J.: Improving large area population mapping using geotweet densities. Trans. GIS 21(2), 317–331 (2016)CrossRefGoogle Scholar
 50.Paul, M.J., Dredze, M., Broniatowski, D.: Twitter improves influenza forecasting. PLoS Curr. 6(1), 18 (2014)Google Scholar
 51.Quinlan, R.J.: Learning with continuous classes. In: Proceedings of the Australian Joint Conference On Artificial Intelligence, pp. 343–348 (1992)Google Scholar
 52.Reibel, M., Bufalino, M.E.: Streetweighted interpolation techniques for demographic count estimation in incompatible zone systems. Environ. Plan. A 37(1), 127–139 (2005)CrossRefGoogle Scholar
 53.Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, Hoboken (2005)MATHGoogle Scholar
 54.Salvati, N., Tzavidis, N., Pratesi, M., Chambers, R.: Small area estimation via Mquantile geographically weighted regression. Test 21(1), 1–28 (2012)MathSciNetCrossRefMATHGoogle Scholar
 55.Schmid, T., Münnich, R.T.: Spatial robust small area estimation. Stat. Pap. 55(3), 653–670 (2014)MathSciNetCrossRefMATHGoogle Scholar
 56.Sémécurbe, F., Tannier, C., Roux, S.G.: Spatial distribution of human population in france: Exploring the modifiable areal unit problem using multifractal analysis. Geogr. Anal. 48(3), 292–313 (2016)CrossRefGoogle Scholar
 57.Batista e Silva, F., Gallego, J., Lavalle, C.: A highresolution population grid map for europe. J. Maps 9(1), 16–28 (2013)CrossRefGoogle Scholar
 58.Stevens, F.R., Gaughan, A.E., Linard, C., Tatem, A.J.: Disaggregating census data for population mapping using random forests with remotelysensed and ancillary data. PLoS ONE 10(2), 1–22 (2015)CrossRefGoogle Scholar
 59.Tobler, W.: A computer movie simulating urban growth in the detroit region. Econ. Geogr. 46(2), 234–240 (1970)CrossRefGoogle Scholar
 60.Tobler, W.: Smooth pycnophylactic interpolation for geographical regions. J. Am. Stat. Assoc. 74(367), 519–530 (1979)MathSciNetCrossRefGoogle Scholar
 61.Tobler, W., Deichmann, U., Gottsegen, J., Maloy, K.: The Global Demography Project. Technical Report 956, National Center for Geographic Information and Analysis, Santa Barbara (1995)Google Scholar
 62.Vega, K.V.A.: Aplicacin de la Interpolacin Fractal en Downscaling de Imgenes Satelitales NOAAAVHRR de Temperatura de Superficie en Terrenos de Topografia Compleja. Ph.D. thesis, Universidad de Chile (2012)Google Scholar
 63.Whitworth, A., Carter, E., Ballas, D., Moon, G.: Estimating uncertainty in spatial microsimulation approaches to small area estimation: a new approach to solving an old problem. Comput. Environ. Urban Syst. 63, 50–57 (2016)CrossRefGoogle Scholar
 64.Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)CrossRefGoogle Scholar
 65.Wu, Ss, Qiu, X., Wang, L.: Population estimation methods in GIS and remote sensing: a review. GISci. Remote Sens. 42(1), 80–96 (2005)CrossRefGoogle Scholar
 66.Xu, G., Xu, X., Liu, M., Sun, A.Y., Wang, K.: Spatial downscaling of TRMM precipitation product using a combined multifractal and regression approach: demonstration for South China. Water 7(6), 3083–3102 (2015)CrossRefGoogle Scholar