Assessing the use of global land cover data for guiding large area population distribution modelling
- First Online:
- Cite this article as:
- Linard, C., Gilbert, M. & Tatem, A.J. GeoJournal (2011) 76: 525. doi:10.1007/s10708-010-9364-8
- 1k Views
Gridded population distribution data are finding increasing use in a wide range of fields, including resource allocation, disease burden estimation and climate change impact assessment. Land cover information can be used in combination with detailed settlement extents to redistribute aggregated census counts to improve the accuracy of national-scale gridded population data. In East Africa, such analyses have been done using regional land cover data, thus restricting application of the approach to this region. If gridded population data are to be improved across Africa, an alternative, consistent and comparable source of land cover data is required. Here these analyses were repeated for Kenya using four continent-wide land cover datasets combined with detailed settlement extents and accuracies were assessed against detailed census data. The aim was to identify the large area land cover dataset that, combined with detailed settlement extents, produce the most accurate population distribution data. The effectiveness of the population distribution modelling procedures in the absence of high resolution census data was evaluated, as was the extrapolation ability of population densities between different regions. Results showed that the use of the GlobCover dataset refined with detailed settlement extents provided significantly more accurate gridded population data compared to the use of refined AVHRR-derived, MODIS-derived and GLC2000 land cover datasets. This study supports the hypothesis that land cover information is important for improving population distribution model accuracies, particularly in countries where only coarse resolution census data are available. Obtaining high resolution census data must however remain the priority. With its higher spatial resolution and its more recent data acquisition, the GlobCover dataset was found as the most valuable resource to use in combination with detailed settlement extents for the production of gridded population datasets across large areas.
KeywordsPopulation mappingGlobal land cover dataCensus dataDasymetric modellingGlobCover
Gridded population distribution data are increasingly being used for resource allocation, disease burden estimation and climate change impact assessment amongst other applications, at global, continental and national scales. Detailed and spatially disaggregated population data are essential resources in the assessment of the number of impacted people in decision-making processes related to developmental or health issues (Bhaduri et al. 2002; Dobson et al. 2000; Hay et al. 2005; Salvatore et al. 2005). Existing gridded population data have been used, for example, to quantify populations at risk of several infectious diseases such as malaria (Guerra et al. 2006; Hay et al. 2009), yellow fever and dengue (Rogers et al. 2006), or avian influenza (Ferguson et al. 2005; Rao et al. 2009). Global population datasets have also been used to study the spatial distribution of infant mortality (Storeygard et al. 2008) and child hunger (Balk et al. 2005c). Moreover, gridded population distribution data have shown application in the analysis of the impacts of climate change, such as sea level rise (McGranahan et al. 2007) and the collapse of an Antarctic ice sheet (Nicholls et al. 2005), while the vulnerability of people to natural disasters has also been quantified (Balk et al. 2005a; Maynard-Ford et al. 2008).
Three global gridded population datasets are available for undertaking such studies; the Gridded Population of the World (GPW), the Global Rural Urban Mapping Project (GRUMP), and the LandScan Global Population database. The United Nation Environment Programme (UNEP) has also compiled gridded population data for Africa, Asia and Latin America. In the GPW database–which was first released in 1995 (Tobler et al. 1995, 1997), then updated in 2000 (Deichmann et al. 2001) and 2004 (Balk and Yetman 2004)–population data were simply areal-weighted per administrative unit, thus assuming that the population is uniformly distributed within each administrative unit. GRUMP uses a similar approach to GPW, but incorporates satellite nighttime light-derived urban extents and their corresponding populations in the spatial reallocation of census counts (Balk et al. 2005b). LandScan was first developed in 1998 (Dobson et al. 2000), then updated yearly from 2000 to 2008. LandScan uses ancillary data such as roads, slope, land cover and nighttime lights to estimate probabilities of population occurrence in grid cells. Populations are spatially reallocated within each areal unit using modelling approaches based on these probability coefficients (Dobson et al. 2000; Bhaduri et al. 2007). Finally, the UNEP database was constructed based on an accessibility surface developed from road networks and populated places datasets (Deichmann 1996; Hyman et al. 2004; Nelson 2004).
These existing large area population datasets exhibit significant drawbacks due to the coarse nature of the input census data used in their construction for many countries, particularly those in the low income regions of the World. For the majority of African countries, census data are often over a decade old and at a provincial or district level resolution (Tatem et al. 2008). The use of modelling techniques for the spatial reallocation of populations within census units is therefore particularly relevant for Africa. Dasymetric modelling methods involve using ancillary data to redistribute populations from administrative units to more homogenous units such as square grids (Mennis 2003). However, these approaches only increase population distribution model accuracies over the simple gridding (areal weighting) of census data if the ancillary data is more detailed and complete spatially than the input census data, and can be detrimental to modelling accuracies otherwise (Hay et al. 2005; Tatem et al. 2007). Land cover and land use data, particularly on settlements, at a spatial resolution finer than the scale of census data administrative units offer an opportunity for improving population distribution models in areas with poor ancillary spatial data, such as sub-Saharan Africa. Population density is assumed to vary according to land use and land cover types (Mennis 2003; Wright 1936). Land use classes–defined by purposes for which humans exploit the land cover–are closely linked to people activities, which make it a more effective indicator of population distribution than land cover. Satellite remote sensing offers a cheap and effective solution to obtain spatial information such as land cover and land use data at different spatial scales (Tatem et al. 2004).
Recent work forming part of the AfriPop project (www.afripop.org) has shown that detailed satellite imagery-based mapping of settlements combined with land cover information can be used to increase population model accuracies across large areas (Tatem et al. 2007). Using East Africa as an example, Tatem et al. (2007) showed that the combination of detailed settlement extents data with land cover data produced more accurate population distribution data than simple areal weighting or the allocation of people only to the grid squares classified as settlement. Dasymetric modelling methods based on land use data require the definition of relative weights associated with land use classes (Hay et al. 2005; Tatem et al. 2007). These weights are first calculated for regions where high resolution census data are available and then applied to other geographically proximate or similar regions with coarser census data. The aim of the AfriPop project is to extend these dasymetric methods to model population distributions across the whole of Africa. As census data are coarse and outdated in many of these countries, land cover specific weights will be calculated based on regions where accurate, detailed and contemporary data are available and then extrapolated to neighbouring regions. The extrapolation level will depend on available data. This spatial extrapolation of relative population weights assumes that the weights are consistent across the regions considered.
The work performed by Tatem et al. (2007) relied upon East Africa-specific land cover information (Africover, www.africover.org), thus restricting application to East Africa. The extension of these approaches beyond the region requires the identification and testing of candidate land cover datasets of wider extent. This paper aims to identify the large area land cover dataset that, combined with detailed settlement extents, produces the most accurate population distribution data. The most appropriate land cover data, refined with detailed settlement extents, will then be used for population distribution modelling across Africa. Here, four satellite imagery derived global land cover datasets are first refined in the same way, and then tested with Kenyan census data on their ability to improve the accuracy of population distribution models. In addition, the spatial extrapolation ability of the relative weights calculated from the four refined land cover datasets was also tested.
Land cover and land use
Global land cover datasets and their main characteristics
Number of land cover classes
Data acquisition year
Advanced Very High Resolution Radiometer (AVHRR) Land Cover Classification
University of Maryland, Department of Geography
MODerate resolution Imaging Spectroradiometer (MODIS) Land Cover Classification
Boston University, Department of Geography
Global land cover 2000 (v1.1)
Joint Research Center
GlobCover Land Cover product v2.2
European Space Agency
23 global and 47 regional
Settlement maps at 30 m spatial resolution were created by Tatem et al. (2007) for five East African countries (Kenya, Uganda, Burundi, Rwanda and Tanzania) based upon methodologies detailed in Tatem et al. (2004). In brief, bands 1–5, 7 and 8 from Landsat Enhanced Thematic Mapper (ETM) imagery and eight texture layers extracted from Radarsat-1 synthetic aperture radar (SAR) were combined for classifier training. The imagery was split into segments and spatial-spectral segmentation was undertaken in each segment. A feed-forward neural network classifier was then used to identify settlements within each spectrally and spatially contiguous zone, using Africover and settlement centroid data for training and testing. In highly rugged areas, only ETM data were used to avoid strong radar responses due to variations in topography.
Administrative unit level 0 (national), 1 (province), 2 (district), 3 (division), 4 (location), 5 (sublocation) Kenya census data were obtained from the 1999 population and housing census report, available at the Central Bureau of Statistics in Nairobi (CBS 2001), along with corresponding administrative unit boundaries. Also obtained were corresponding census data at administrative unit level 6 (enumeration area) with corresponding boundaries for 58 of the 69 Kenyan districts.
Population distribution modelling approach
Land cover data refinement (Fig. 1, box 1)
The global land cover maps were ‘refined’ to accommodate the more detailed and accurate information on settlements provided by Tatem et al. (2007). The four global land cover datasets were first resampled to 100 m spatial resolution. For each land cover dataset, the urban class, which typically overestimates settlement extent size (Tatem et al. 2005, 2007), was removed and the surrounding classes expanded equally to fill the remaining space. The 30 m settlement map constructed in Tatem et al. (2007) was also degraded to 100 m spatial resolution. This more detailed settlement map was then overlaid onto the ‘urban class deprived’ land cover map and land covers beneath were replaced to produce a refined land cover map. Four refined land cover datasets were therefore created for Kenya.
Sampling methods (Fig. 1, box 2)
Extrapolation levels (EXL) with their corresponding sampling method
Selection based on admin. level 5 (sublocations)
Selection based on admin. level 4 (locations)
Selection based on admin. level 3 (divisions)
Selection based on admin. level 2 (districts)
Selection based on admin. level 1 (provinces)
Dasymetric modelling (Fig. 1, box 3)
The refined land cover data and Kenyan enumeration area census data were then used to define per land cover class population densities (i.e. the average number of people per 100 × 100 m pixel). Mennis and Hultgren (2006) described and compared different methods for estimating population densities based on land cover data. Here, the average population density of one specific land cover class was calculated based on EAs from the first sample that record this land cover class for the majority of their pixels. Different tables were produced containing the population density per land cover class for each of the four newly created land cover datasets. Zeros were attributed to classes with no human habitation, mainly water bodies.
Totals correction methods (TCM) with their corresponding average spatial resolution (ASR) in Kenya
Accuracy assessment (Fig. 1, box 4)
The accuracies of these population distribution data were tested principally using the second sample of EA census data, the first sample having been used for the relative weights calculation. With an average of 23,017 EAs and an ASR of 3.21 km (8.4 EAs per sublocation in average), these provided a valuable dataset for assessing the accuracy with which populations had been distributed within each administrative unit by the application of each global land cover data. Predicted population data per EA were compared to observed population data from the 1999 Kenyan census. Accuracy statistics including root mean square errors (RMSE) and Pearson correlation coefficients were computed. Accuracies were also tested by comparing the output population distribution data derived from each land cover product to areal weighting, to examine which approaches produced improvements over this simplest of methods. As discussed previously, the areal weighting method is a simple population distribution modelling method consisting of a homogenous distribution of populations within census units, and represents the basis by which the existing widely used global population data, Gridded Population of the World (Balk et al. 2006), are constructed.
Tests and replications
In summary, each population distribution dataset produced in this study is characterized by input land cover data (AVHRR, MODIS, GLC2000 or GlobCover), a TCM (ADMIN-5, ADMIN-4, ADMIN-3, ADMIN-2, ADMIN-1, ADMIN-0, or no correction) and an EXL (RD, L5, L4, L3, L2 or L1) (Fig. 1).
In a first step, we fixed the extrapolation level to the maximum (i.e. EXL = L1)–because a high extrapolation level is likely to be required to produce population distribution data in other African countries–and varied the TCM. This allowed for exploration of the effectiveness of the population modelling procedures in the absence of high resolution census data. With EXL = L1, the sampling method is based on the Kenyan provinces. The selection of 4 out of the 8 provinces was replicated 25 times and these 25 different combinations (out of 70) were used to produce population distribution datasets. This was also repeated with the four land cover datasets as input data.
Statistical analyses including analyses of variance and Tukey’s honest significant difference tests were performed to test for differences between different land cover data, TCM and EXL. The Tukey’s honest significant difference statistical test is used to identify which means are significantly different from the others. This test is based on the range of the sample means rather than the individual differences.
Results from the analysis of variance performed on RMSEs extracted from population maps
Pr (> F)
The accuracy of population distribution data decreased drastically with coarser administrative levels used for TCM, both in terms of RMSE and correlation coefficient (Fig. 4). This is even more marked for ASR below 100 km. Without any correction by totals, the population distribution data produced show similar Pearson correlation coefficients as those of population data produced with TCM = ADMIN-0, but RMSE approximately 100 times higher, with average RMSEs between 411,467 for the GLC2000-based population model and 615,896 for the GlobCover-based population model.
Figure 3 also allows comparison of the accuracy of population distribution data produced with the areal weighting method (dotted line in the graphs). We observe that with TCM = ADMIN-5, the areal weighted method produced more accurate population distribution data than the procedure described in this paper, whereas for the other levels of TCM, the land cover based population data were generally more accurate. Moreover, for ADMIN-2 and ADMIN-1 levels of TCM, the improvement shown for the GlobCover-based population distribution dataset compared to the areal weighted data is much clearer than the population data based on other land cover datasets.
Results from the analysis of variance performed on RMSEs extracted from population maps
Pr (> F)
We performed 25 random simulations for each combination of TCM and EXL. Results showed that the average RMSE converged appreciably after this reasonable number of simulations, with changes in the average RMSE in the last 5 simulations generally lower than 2% and lower than 1% in 84% of cases.
The primary aim of this work was to identify which global land cover data could be used in combination with detailed settlement extents to produce the most accurate population distribution modelling across Africa. Results showed that, combined with detailed settlement extents, the GlobCover dataset generally provided significantly more accurate population distribution models than other global land cover datasets in Kenya. As a massive majority of people across the World reside in settlements, it was important to refine global land cover datasets with as detailed as possible settlement extents data. However, we showed that different refined land cover data resulted in significantly different output population distribution datasets, which confirms that the use of additional land cover classes for dasymetric modelling can further improve population distribution models.
We also tested the effectiveness of the population distribution modelling procedures with different extrapolation levels. Our results showed no significant influence of the extrapolation level used on the accuracy of the output population distribution datasets for Kenya. This does not exclude accuracy losses when land cover specific population densities are extrapolated from one country to the other for large area population distribution modelling, as the relationship between population density and land cover differs from one country to the other. The spatial extrapolation level should therefore be minimized as much as possible in any large area population distribution modelling. Even if the impact of a high extrapolation level was limited in our analysis, whichever global land cover data used, population weights can only be extrapolated to spatially proximate and environmentally similar regions.
The better performance of the GlobCover dataset for population distribution modelling is most likely due to its finer spatial resolution (300 m compared to 1 km for AVHRR, MODIS and GLC2000). The GlobCover dataset also includes a larger number of land cover classes compared to other global land cover datasets (Table 1), which could enable greater precision in the derivation and modelling of land cover-population density relationships. However, a large number of different land cover classes would only improve the accuracy of population distribution data produced if population densities are significantly different by land cover class. The optimal land cover data for population distribution modelling would be a land cover classification that maximizes within land cover class homogeneity and maximizes between-class heterogeneity in relation to population density. In addition, the per land cover specific weights calculated can actually be less accurate with a higher number of classes. Combining land cover classes could therefore increase the accuracy of population distribution data. An additional analysis showed that combining GlobCover land cover classes did not however influence significantly the accuracies of output population distribution datasets here (see supplementary material).
GlobCover provided less accurate average results in the worst modelling situation, i.e. with the highest level of extrapolation (EXL = L1) and the lowest administrative level for the correction by the totals (TCM = ADMIN-0) (see the last boxplot in Fig. 3) or without any correction by the totals. In some particular cases, the population distribution data produced using the GlobCover dataset provided very high RMSEs, which increased the RMSE variation and reduced considerably the average accuracy. The large number of land cover classes in the GlobCover dataset made the per land cover class specific densities sometimes less accurate because they were calculated based on a limited number of EAs where these land cover classes are dominant. The RMSEs of GlobCover-based population data are thus higher in some particular situations, but the correlation coefficient is always higher on average for GlobCover-based population distribution data (Fig. 4). Aggregating land cover classes could limit this effect.
The time of land cover data acquisition may also influence the results. The date of imagery acquisition for MODIS and GLC2000 (2001 and 2000 respectively, see Table 1) were close to the census data (1999), whereas the AVHRR data were older (1981–1994) and the MERIS imagery used for GlobCover were more recent (2005–2006). With substantial population growth and urbanization taking place across Africa, the expansion of cities may have been important. In our analysis, the urban classes of land cover data have been refined based on the settlement map from Tatem et al. (2007), which relied upon data collected between 1999 and 2002. The discrepancy between census and land cover data is therefore limited for urban areas in our analysis. However, other land use changes, such as the expansion of cropland over natural vegetation may have changed in Kenya and could have induced discrepancies between census and land cover data.
This study supports prior work on population distribution modelling. Firstly, the gridded population distribution datasets produced from census data and satellite imagery derived land cover data generally provided more accurate results than areal weighting, as already shown in Tatem et al. (2007) and Mennis and Hultgren (2006). However, when using sublocation level census data in Kenya (with ASR < 10 km), the areal weighting method provided the most accurate results (first boxplot in Fig. 3). This suggests that when very fine-resolution census data are available, the use of land cover data at the spatial resolutions considered here in population distribution modelling does not necessarily improve the simple areal weighting method. This demonstrates that the approach only increases population distribution model accuracies over the simple gridding of census data if land cover data are significantly more detailed than the input census data. In our case, more spatially detailed ancillary data would be needed to improve the redistribution of populations within sublocation units in Kenya. Secondly, Fig. 4 shows the accuracy changes experienced with different ASR of census data available for modelling. As already described in Hay et al. (2005), it demonstrates that obtaining as high a spatial resolution of census data as possible must be the priority starting point in population distribution modelling. Given the resolution of census data available, the ancillary data can improve population model accuracies to a lesser extent. The potential improvement provided by land cover data is higher with coarser ASR of input census data.
In conclusion, GlobCover, in combination with detailed settlement extents, likely represents a more accurate source of land cover data for dasymetric modelling than other global land cover datasets. In addition, GlobCover is the most recent global land cover dataset, being derived from 2005/2006 MERIS imagery. Moreover, the robust automated processes used in the data production (Arino et al. 2007, 2008) allows for updates to be incorporated in the coming years. A complete land cover dataset for the year 2009 is currently under production (ESA GlobCover Team 2009). For all these reasons, GlobCover represents the preferred global land cover dataset for use as an alternative to regional land cover products in the creation of population distribution data across large areas.
These analyses form part of a wider initiative, the AfriPop project (www.afripop.org), aimed at providing detailed and open access gridded population distribution data for all African countries. AfriPop aims to produce datasets based on freely available data and methods that can easily incorporate new data as it becomes available.
CL is supported by a grant from the Fondation Philippe Wiener - Maurice Anspach. AJT is supported by a grant from the Bill and Melinda Gates Foundation (#49446). This work forms part of the output of the AfriPop Project (www.afripop.org), principally funded by the Fondation Philippe Wiener–Maurice Anspach, and the Malaria Atlas Project (MAP, www.map.ox.ac.uk), principally funded by the Wellcome Trust, U.K.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.