Study area
The population of Sweden was 10,319,601 at the end of October 2019 with an annual population growth rate of approximately 1% (Statistiska centralbyrån (SCB), 2019a). Thus, Sweden can be characterized as having a slow growing population. Similar rates of growth exist in countries such as Ireland, Vietnam, Argentina, Mexico, and Indonesia (based on 2018 annual growth rates) (World Bank 2018). The population density of Sweden was approximately 25 people/km2 in 2018 (SCB, 2019b). However, this varies widely throughout the country. The population density in the most populated municipalities is approximately 5000 people/km2 while the population in the most sparsely populated municipalities is lower than 1 person/km2 (SCB, 2019b). Most of the population (87%) lives on approximately 630,000 ha of land, which equates to approximately 1.5% of Sweden’s landmass. Approximately 32% of the Swedish population live in cities with a population over 100,000 (of which there are nine cities). Since 2015, 40% of Sweden’s population increase has occurred in these cities and a further 37% has occurred in cities with populations between 50,000 and 99,999 (SCB, 2019c).
Data
Here, we describe the five gridded population datasets (hereafter referred to as “candidate population grids”) which were evaluated in this study as well as the Swedish reference dataset (hereafter referred to as the “known population grid”) which was used for comparison. We first provide a general background on existing methods for generating gridded population datasets and then provide details on each individual candidate population grid dataset.
Areal interpolation is a technique commonly used to disaggregate census population data gathered from various sources to the grid level (Wu et al. 2005). Areal interpolation refers to transforming statistical information from one set of geographical boundaries to another separate set of boundaries and can be accomplished using a number of different methods (Deichmann 1996; Fisher and Langford 1995, 1996; Mennis 2003). Most relevant to the global gridded population datasets used in this study are the areal interpolation methods of areal weighting and dasymetric mapping. Areal weighting makes the assumption that population is equally distributed across administrative areas and distributes census population counts by assigning a population value to each grid cell based on the percentage of the administrative unit covered by each cell (Bhaduri et al. 2007; Lloyd et al. 2017; Mennis 2003). Of the five global gridded population datasets, GPW is the only one which uses areal weighting, while the others use varying interpretations of dasymetric mapping. Dasymetric mapping makes assumptions about the relationship between population density and various geographic and land characteristics and introduces ancillary data to help dictate where population should be assigned and how much should be assigned in each location (Bhaduri et al. 2007; Linard et al. 2011; Mennis 2003, 2009; Mennis and Hultgren 2006; Zandbergen and Ignizio 2010). GRUMP and GHS-POP use a binary reallocation that distinguishes rural and urban areas and built-up areas respectively, while LandScan and WorldPop use a number of different ancillary datasets to model population distribution. In the following subsections, we provide a brief summary of the characteristics of each of the datasets and the methods used to produce them. Table 1 provides an overview. A similar discussion can be found in (Leyk et al. 2019). For full details, please consult documentation related to each dataset.
Table 1 Comparison of global gridded population datasets GPWv4
GPWv4 models population distribution on a continuous and global surface in order to provide a “spatially disaggregated population layer that is compatible with data sets from social, economic, and Earth sciences disciplines, and remote sensing” for use in “research, policy-making and communications” (Center for International Earth Science Information Network (CIESIN) 2018b, p. 3). The first release of GPW was in 1995; however, here, the current release, version 4.11, will be discussed. Over time, the methodology and inputs to GPW have been improved and refined.
The GPWv4 population distribution surfaces are available both in population counts (persons per pixel) and population density (persons per km2) for 2000, 2005, 2010, 2015, and 2020. The spatial resolution is 30 arc seconds (which corresponds to approximately 1 km at the equator); however, aggregated datasets are also available at 2.5 arc minutes, 15 arc minutes, 30 arc minutes, and 1 degree for faster processing. The data are published in WGS1984 geographic coordinate system and is available in ASCII, GeoTiff, and NetCDF formats. Each year of data includes one dataset which is consistent with national population censuses and registers and another which has been adjusted to match United Nations (UN) estimates of population taken from the UN World Population Prospects, 2015 Revision.
The GPWv4 population surfaces are generated using areal weighting. The input population data are gathered through an intensive process of data collection from national statistics offices and other organizations. In GPWv4, population data correspond to the highest spatial resolution available from censuses undertaken between 2005 and 2014 (CIESIN, 2018b). Specifically for Sweden, the population inputs are from the 2010 population register at administrative level 3 (corresponding to församling). This is the highest spatial disaggregation of population data freely available to the public. Input population data are matched to administrative boundaries collected from national agencies and other organizations. The priority is to use boundaries from the census used by each country; however, when these are not available, other sources are used. The sources of both population and boundary data are available on the GPW website. The yearly population estimates are then disaggregated to the 30 arc second grid using areal weighting. A water mask is applied before disaggregation in order to ensure that population is not assigned to waterbodies and ice-covered areas (CIESIN, 2018b; Doxsey-Whitfield et al. 2015).
A main benefit of the fact that GPW does not use any ancillary data in its modeling approach is that the outputs can be incorporated with other data without worrying that they might be endogenous to the modeled population (CIESIN, 2018b; Doxsey-Whitfield et al. 2015). However, one of the limitations with this approach is the assumption that population is evenly distributed within administrative boundaries (Deichmann et al. 2001). This translates into a limitation with change in precision over space due to the fact that smaller input units produce higher accuracy results than larger input units (Deichmann et al. 2001; Doxsey-Whitfield et al. 2015). Additionally, the confidence in the estimates varies across countries as a function of the currency, spatial resolution, and accuracy of the input data (CIESIN, 2018b; Doxsey-Whitfield et al. 2015) and can be affected by the temporal interpolation, especially in regions where population changes dramatically under a short period (Deichmann et al. 2001).
GRUMP
The GRUMP database is an attempt to better model differences in population distribution across the urban rural divide. It consists of several data products; however, here, only the population grid will be discussed. The GRUMP population grid builds on GPW data (version 3) to generate an improved population grid that reallocates population more accurately between urban and rural areas (Balk et al. 2005). The GRUMP population grid is available for 1990, 1995, and 2000 and is based on the 2000 round of Population and Housing Censuses. Like GPW, GRUMP population is provided in two sets, one based on unadjusted census estimates and one based on census estimates adjusted to UN population estimates and includes both population counts and population densities. GRUMP population data are produced in WGS1984 geographic coordinate system and has a spatial resolution of 30 arc seconds. It is available in BIL, ASCII, and GRID (ESRI) formats.
The GRUMP population datasets use a binary dasymetric methodology to reallocate population to urban and rural areas. They are based on two vector inputs: administrative areas with population counts built on data from GPWv3 and urban extents along with urban population counts based on national statistics and nighttime lights. The inputs are overlaid in order to generate a vector dataset that delineates urban and rural areas in each administrative unit. This dataset becomes the input to GRUMPe, an algorithm that, working on a country basis, reallocates the total population of each administrative unit to rural and urban areas based on the area and population of the urban area and total administrative area, the area of the region where the administrative and urban areas overlap and UN national estimates for the percentage of the population in urban and rural areas. The output of the GRUMPe reallocation algorithm is a vector dataset which is converted to a raster grid to produce the GRUMP dataset (D. Balk et al. 2005; Balk et al. 2010).
Some issues with the GRUMP population dataset are related to the estimation of urban extents using nighttime lights, which have known issues with “blooming” and an inability to detect areas with low or no electrification (D. Balk et al. 2005; Elvidge et al. 2004). However, GRUMP uses a combination of nighttime lights and buffered settlement extents based on population size in dimly lit areas, which helps reduce this limitation. Additionally, issues relevant to GPW, including reliability of input population estimates, are also relevant for GRUMP. Finally, GRUMPe has observed spatial differences in accuracy depending on the quality of input data. Where the input administrative data and urban extents data are good, GRUMPe does not add much to the areal weighting of GPW. Where the administrative population data are moderate or poor but the urban extent data are good, GRUMPe performs very well. Finally, where both the administrative and urban extents data are poor, then GRUMPe does not perform well, since the outputs can only be as good as the inputs (D. Balk et al. 2005). Since the release of GPWv4, users are recommended to use that dataset instead, since it supersedes GPWv3 and thus also GRUMP.
GHS-POP
The Global Human Settlement Layer is a group of datasets which describes the human presence on the planet based on Earth observation satellite sensors, national statistical surveys, and crowd sources and is processed by “exploiting novel spatial data analytics tools allowing to handle their complexity, heterogeneity and large volume” (Pesaresi et al. 2016, p. 10). Again, only the population grid dataset will be discussed here, although reference to other layers will be made in that they contribute to its creation. The population grid (GHS-POP) is available for 1975, 1990, 2000, and 2015. It is provided in a World Mollweide projection (an equal area projection) with a 250-m and 1-km spatial resolution. The grid cell value represent both the count and density of the population in the cell (density since each cell has a consistent size). The datasets are distributed in GeoTiff format.
GHS-POP also uses a binary dasymetric methodology to disaggregate population. Its population data input comes from the GPWv4 UN-WPP-adjusted dataset. These administrative area level population data are disaggregated to built-up areas based on the GHS built-up area (GHS-BU) layer, a grid dataset which shows the percentage of each cell that is covered by built-up areas. GHS-BU is generated using automated analysis of Landsat satellite imagery to identify the location and density of built-up areas (Freire et al. 2016; Pesaresi et al. 2016). The population data are reallocated in one of three ways, although most of the population (95%) is allocated in the first way (Freire et al. 2016). If the administrative area is large enough to generate 250 m grids and contains built-up areas, then the population for that administrative area is assigned in proportion to the density of the built-up areas. If the administrative area is large enough to generate 250 m grids but does not contain any built-up areas, then the population is allocated using an areal weighting approach. If a cell is located on the border of an administrative area, it is assigned to the administrative area its centroid falls in. And finally, if the administrative area is smaller than a 250-m grid cell, then a centroid is generated for the area and the population of all centroids found within a cell is added (European Commission Joint Research Center (JRC) 2017; Freire et al. 2016). Because of the use of centroids to identify the location of cells and administrative units, shifts in the location of population to neighboring cells may occur (Freire et al. 2016).
The benefit of GHS-POP is that it restricts population to built-up areas and makes its density directly proportional to the density of built-up areas (Freire et al. 2015). However, because the reallocation of population in GHS-POP is based on the density of built-up areas, population may be allocated to “non-residential” areas such as commercial, industrial, and recreational areas. For this reason, the output of GHS-POP is said to depict the “resident-based population,” which is not the population at its place of residence, but rather residential population allocated to built-up areas (Freire et al. 2016). In order to generate population counts at the place of residence, land use information would need to be considered (Freire et al. 2016).
LandScan
LandScan gridded population data was first generated in 1998 and, for over 15 years, the methodology and data sources have been improved. LandScan population grids are freely available for researchers and students after registration and are available online yearly from 2000 to 2018 (as explained below, LandScan data are not, however, a consistent time series). The population grids are produced in WGS1984 geographic coordinate system and have a resolution of 30 arc seconds. They are provided in GRID and binary raster formats (both ESRI formats). The cell values for LandScan data represent integer population counts of ambient population. Ambient population, in contrast to resident population, is the average population over a 24-h period for typical days, weeks, and seasons (Dobson et al. 2000). Therefore, it represents not only where people live but also where they work and travel (Dobson et al. 2000).
LandScan uses a highly modeled approach to disaggregate subnational census data to the grid level through a series of dynamically adaptable algorithms using a combination of spatial data, satellite imagery, and derived products and manual corrections. The algorithms used by LandScan are proprietary and therefore not available publicly. Each grid cell in LandScan is assigned a “likelihood” coefficient for the presence of population. The subnational census population is then assigned proportionally to each cell based on this likelihood coefficient. Likelihood coefficients are based on the relationship between population and ancillary data such as land cover, roads, slope, urban areas, village locations, and high-resolution imagery. Relationships vary locally and are drawn from socio-economic and cultural understanding of the area (Oak Ridge National Laboratory (ORNL) 2019; Rose and Bright 2014).
The LandScan algorithm has been refined and improved over time. Additional and higher accuracy data have also been incorporated every year. For this reason, time series comparisons using LandScan data are discouraged, since changes cannot be solely attributed to changes in population but may instead reflect changes in the input data or algorithms. An additional weakness of LandScan is that the population and ancillary data inputs are not documented specifically, making it more difficult to evaluate fitness for use in various situations. Direct comparisons with other population datasets are also cautioned because ambient population is of course not directly comparable with resident population (Rose and Bright 2014). For example, the ambient population in city centers is expected to be higher than the resident population because of the presence of a high level of employment uses (Dobson et al. 2000).
WorldPop
The WorldPop project was initiated in 2013 to unite a number of different initiatives focused on producing population distribution and composition maps. Apart from population counts, the WorldPop database includes a number of different demographic indicators at the grid level. WorldPop population count datasets are available at the country level globally. They are available yearly from 2000 to 2020 (unlike LandScan, this does represent a time series) at a spatial resolution of 3 arc seconds (approximately 100 m at the equator) and in geographic coordinate system WGS1984.
WorldPop population grid datasets are created using Random Forest–based dasymetric redistribution. This method uses a Random Forest model to generate population density predictions based on ancillary data, which WorldPop refers to as covariates. These covariates include land cover, elevation data and derived slope estimates, nighttime lights, climactic spatial variation, roads, waterways, settlements, protected areas, and facilities such as schools, hospitals, and health clinics (Stevens et al. 2015) and can vary by country based on data availability and the relative importance for population estimation at each location. The population inputs are taken from a GIS-linked database of census and official population estimates constructed through the WorldPop initiative and based on GPWv4. Details of the Random Forest model are available in Stevens et al. (2015). The predicted population densities for each country are then used to disaggregate population estimates using a dasymetric mapping approach weighted by the population density prediction. The covariates and estimation algorithms are all freely available from the WorldPop website.
The WorldPop Random Forest methodology is said to improve on other methods because it incorporates many ancillary datasets with little tuning or supervision. However, some important challenges include the standardization of these multiple inputs that come in multiple scales and resolutions, the fact that many of the inputs are highly correlated and the presence of many non-linear interactions (Stevens et al. 2015). One limitation of the data is that Random Forest predictions are limited to the range of population densities of the inputs. This can have an effect in larger administrative areas where population is concentrated in one location (Stevens et al. 2015). However, it is found that the Random Forest methodology is no worse in these areas than other approaches that use less ancillary data (Stevens et al. 2015).
Candidate population grid inputs
Grid data for population counts from each of the above described datasets were downloaded between 2019-03-11 and 2019-03-12. Table 2 provides a summary of the datasets used as well as the temporal and spatial resolution used in the study.
Table 2 Description of datasets used in analysis We maintained the native projection for all datasets and the native resolution for all except WorldPop (we used only the 1-km resolution for GHS-POP). For WorldPop, we aggregated the 3 arc second resolution to 30 arc seconds. The data was clipped to the extent of the Swedish border, with the exception of the WorldPop dataset which was downloaded already clipped. Grid population cells with NoData were assigned a value of zero in order to ensure that all cells could be compared to the known population grid.
The datasets used in this analysis were chosen because they were identified as the most commonly used datasets in current research. We would like to make two notes regarding the inclusion of certain datasets in this analysis. First, GRUMP has been superseded by the latest version of GPW (version 4) and its use is no longer recommended; however, we included it in this analysis since it is still sometimes used in research. Second, LandScan and GHS-POP represent population which is not equivalent to the “night-time” or resident-based population which may affect comparison results, as population comparisons are made against known residential statistics. However, both have been included in this analysis because of the temporal and population type comparisons which were also done. The results which compare the datasets over time and in different situations are still highly relevant. Furthermore, comparing these datasets to a resident-based population highlights the differences between resident-based and ambient populations and shows the importance of choosing the correct dataset for the application in question.
Known population data input
The known population was provided by the Swedish Statistical Bureau (Statistiska centralbyrån, SCB) and consisted of a 100 m by 100 m vector grid containing the count of the population which resides in each grid. These data are available only to researchers upon specific request to SCB. For this reason, they are not currently used in the generation of the candidate population grids which we examine in this research and are thus relevant for use as a comparison. The gridded data are based on information from the Swedish population register. Included in the Swedish population register are all registered residents of Sweden, which comprises all Swedish citizens and all non-Swedish citizens with a residence permit for a minimum of 12 months. The Swedish population register does not include temporary migrants, undocumented migrants or asylum seekers who have not yet received a residence permit (SCB n.d.). To generate the 100 m by 100 m grid data, each person in the population register is geocoded to the location of residence (fastighet) and this information is then generalized to the grid level based on the location of the centroid of each residence building. The grid data are highly accurate; during the 6 study years, an average of approximately 5300 people are missing from the dataset per year, as compared to official population totals from SCB. This amounts to approximately 0.06% of the population.
The known population data were obtained for 1990, 1995, 2000, 2005, 2010, and 2015. For 1990–2010, the data were in an RT90 25 gon väst projection, while for 2015, they were in SWEREF99 TM. First, the known population of each grid cell was assigned to the cell’s centroid and the centroids were then re-projected to both WGS1984 and World Mollweide projections to match the projection of the candidate population grids.
Methods
Overlay
To compare the known and candidate population distributions, each candidate population grid was overlaid with the known population grid centroids for the corresponding year and appropriate projection. The total known population within each candidate population cell was then calculated as the sum of the known population centroids which fell within each candidate population cell. This produced an aggregated population grid with cell by cell totals for candidate and known populations.
Comparison statistics
Comparison statistics were then calculated between the known (k) and candidate (g) populations for each dataset and year. The comparison statistics used were the percent mean absolute error (%MAE), shown in Eq. (1), the percent root mean square error (%RMSE), shown in Eq. (2), Pearson’s r, the percentage of cells correctly identified as either populated or unpopulated, and the relative differences, shown in Eq. (3).
$$ \%\mathrm{MAE}=\frac{\frac{1}{n}{\sum}_{j=1}^n\left|{k}_j-{g}_j\right|}{\frac{\sum_{j=1}^n\left({k}_j\right)}{n}} $$
(1)
$$ \%\mathrm{RMSE}=\frac{\sqrt{\frac{1}{n}{\sum}_{j=1}^n{\left({k}_j-{g}_j\right)}^2}}{\frac{\sum_{\mathrm{j}=1}^n\left({k}_j\right)}{n}} $$
(2)
$$ \mathrm{relative}\ \mathrm{difference}=\frac{\left(g-k\right)}{\left(g+k\right)} $$
(3)
The %MAE and %RMSE measure the absolute fit between the known and candidate populations, with the %RMSE “penalizing” outliers since they stand out more than when looking only at the %MAE. The percent values are a standardization of the MAE and RMSE values by the average known population within the cells. They were deemed necessary in order to allow for cross-dataset comparisons, since the cell size differs between datasets (due to different projections and resolutions) and thus the sizes of the study regions differed.
Pearson’s r was used as a measure of the linear association between the known and candidate populations.
In order to calculate the percentage of cells correctly identified as populated or unpopulated, the known and candidate datasets were converted to Boolean datasets, where cells with populations greater than zero were deemed populated and cells with populations equal to zero were deemed unpopulated. The percent correctly populated is the number of cells that were identified as populated in both the candidate and known datasets, divided by the number of known populated cells, while the percent correctly unpopulated cells is the same calculation with unpopulated cells replacing populated cells. The total percent correct is the sum of the number of cells identified as populated in both datasets and the number identified as unpopulated in both divided by the total number of cells.
The relative differences were used in order to evaluate the differences between the known and candidate populations in each cell relative to the size of the population in each cell, rather than looking at the absolute differences. The relative difference ranges from − 1 to 1. Negative relative difference values indicate that the candidate population underestimates the known population; that is, the candidate population is lower than the known population. Positive values indicate that the candidate population overestimates the known population. Values of − 1 occur where the candidate population is estimated to be zero, but the cell is in fact populated based on the known population. Values of 1 are the opposite; in these cells, the known population is zero, but the candidate population has assigned the cell as populated. A relative difference of zero indicates that the known and candidate populations are equal. The relative differences were only used for mapped comparisons of the datasets.
Population density comparison
The population density for each cell in the known population was calculated as the population count divided by the area of the cell in km2. This is, of course, an approximation of population density, since cells can be partially covered by non-land areas. The cells were then divided into three groups: high, low, and zero population density. The high and low population density groups were divided using three different cutoffs as a robustness check. The cutoffs were determined as follows:
-
Average population density for Sweden for the period from 1990 to 2015: 22 people/km2
-
Average global population density for the period from 1990 to 2015: 50 people/km2
-
Average of the average country population density for the period from 1990 to 2015: 382 people/km2
The spatial distribution of high- and low-density cells for each of these cutoffs is consistent with Swedish population centers (tätort) and divisions between urban and rural areas. Figure 2 shows the population distribution for the known population for the medium cutoff of 50 people/km2 during 2015.
The comparison statistics were then calculated for the high, low, and zero density groups separately.
Population change comparison
Known population change was calculated for all but the first year of data for each dataset. Population change was calculated as the difference between the population in each cell relative to the first year available for that dataset. It was determined that 70 to 95% of cells experienced no change over time. The cells were then divided into growth, decline, and stable cells corresponding to cells with change above zero, below zero, and equal to zero. Comparison statistics were then calculated separately for each of the three change groups. Additionally, known population cells with zero population were removed and the comparison statistics were calculated for the three groups using only these known populated cells.