For many studies, no sampling frame of the target population is available. The most common approach to addressing this problem for large-scale household surveys in the developing world is to use a stratified two-stage design. In the first stage, census enumeration areas are selected as the Primary Sampling Unit (PSU), using probability proportional to estimated size. In the second stage, a household listing operation is conducted in the selected PSUs, and households are selected using simple random sampling.Footnote 1,Footnote 2 With this approach, even outdated census data can be used to select PSUs, as long as a high-quality listing operation is done in the selected PSUs to create a sampling for the second stage selection of households. Using out-of-date census data as a measure of size in PSU selection will result in estimates that are inefficient but still unbiased. However, some countries do not have census records at all because of accessibility issues, war, or natural disasters. In these situations, newly available high-resolution satellite data can be used to generate estimated population densities and to demarcate PSU boundaries. The two examples discussed here are from surveys conducted in rural Somalia and Kinshasa, Democratic Republic of the Congo (DRC).
In Somalia, the last population census, carried out in 1975, measured the population at 3.9 million. Current estimates for the country indicate a population of more than 14 million. For the DRC, similarly, the last census was carried out in 1984, at which time the population was around 29 million. Current population estimates are now over 77 million. As noted above, it would still be possible to use the outdated census for estimated population totals if there was an expectation of approximately constant growth across regions. Both Somalia and DRC, however, have experienced significant civil strife, including large-scale displacements of the population.
Some countries, notably Haiti following the 2010 earthquake, have used “quick counts” to collect information about where the population lives and to estimate its size. In a quick count, enumeration areas are randomly sampled and listed, then the results are used to build a model to update census counts in the remaining areas.Footnote 3 However, in Haiti, the most recent census was only seven years old at the time of the earthquake, and the damage and population movements were relatively concentrated. The more time that has elapsed since the last census, the more difficult it is to develop an accurate model of the current population based on quick counts. Moreover, the DRC has a land area nearly 85 times the size of Haiti, which makes using a quick count methodology impractical from the perspective of both cost and implementation time. In Somalia, in addition to ongoing insecurity in certain areas, the enumeration area estimates from the 1975 census were never published, and the full results are thought to be lost. Therefore, alternative approaches to selecting a household sample were needed in both Somalia and the DRC.
Three approaches were implemented across the two surveys. For the Somali High Frequency Survey (SHFS), rural areas posed a challenge for the creation of a sampling frame. Rural areas were defined as non-urban permanently settled areas but excluding Internally Displaced Persons (IDP) settlements.Footnote 4 To create a frame for the first selection stage, a gridded population approach was developed in collaboration with Flowminder.Footnote 5 Rural areas that were secure enough for data collection were divided into 100 by 100 meter grid cells. For each cell, WorldPop data provided an estimated population size.
Neighboring cells were then combined to form PSUs, using a quadtree algorithm, which combines cells to meet specified criteria, in this case, area and population size.Footnote 6 The maximum area was set at 3 by 3 kilometers, and the maximum population was limited to 3500 to keep enumeration areas manageable for field teams. The left panel of Fig. 1 shows the PSUs created by the above steps, with the color indicating the estimated population in each one.Footnote 7 Next, a sample of PSUs selected using probability proportional to estimated size. The selected PSUs were then further subdivided into segments. If the selected PSU contained 12 or fewer dwellings based on satellite imagery, only one segment was defined. For those PSUs containing between 13 and 150 dwellings, 12 segments were defined, with additional segments being defined for PSUs with more than 150 dwellings.
A major disadvantage of the grid approach described above is that the boundaries of the resulting PSUs do not follow natural boundaries such as roads, valleys, and rivers. The cells’ artificial boundaries complicate field implementation. Aware of this constraint, the team initially pursued an alternative methodology in which the WorldPop distribution was used to randomly select points to serve as “seeds” for PSUs, which were then grown until they reached an estimated population of around 150 dwellings but without crossing natural boundaries.Footnote 8 Unfortunately, two major drawbacks became immediately apparent: the development of algorithms to detect natural boundaries was expensive and time-consuming, and selection probabilities were not straightforward to calculate because of boundary effects (seeds near boundaries could grow in fewer directions than others). The team therefore reverted to a gridded approach but manually adjusted segments to follow natural boundaries to mitigate potential implementation issues.
In the DRC, two methods were used. In the districts of Kisenso, Kimbanseke, and Mont Ngafula in Kinshasa, and the sites of Kindu, Tchonka, and Basankusu, a one-stage sample of dwellings was selected based on counts of dwellings made from satellite images. In partnership with the firm Satplan Alpha, the project used recent satellite images to count and geo-locate all dwelling units. This work was done manually. Team members classified each building in the satellite images as low-density residential, high-density residential, or non-residential, using their local knowledge of the typical characteristics of dwelling units in the DRC. These typical characteristics were locally specific, varying between cities and between dense inner-city districts, peri-urban zones, and semirural areas on the outskirts. The main characteristics used to classify structures were architecture, building size and features, roof segmentation, roof design intricacy and height, building orientation, site boundary features, proximity to major streets, street activity, and traffic. The right panel in Fig. 1 shows the final map for Kindu, DRC, with each building classified as low-density residential (blue), high-density residential (yellow), and non-residential (red). When the counting, geo-locating, and classification were complete, each dwelling was assigned a random number, and a sample was selected through a one-stage random draw. If the classification was correct, this approach resulted in an equal-probability simple random sample of dwellings.
In the districts of N’djili and Makala in Kinshasa, a two-stage random sample was used.Footnote 9 PSU boundaries were first defined using administrative and physical boundaries such as rivers, highways, and secondary and residential roads that would be easily identifiable by interviewers on the ground. The delineation process used an automated iterative approach where PSUs were created and then split or merged based on target population size. The left panel of Fig. 2 shows a map indicating the manually created PSUs.
The next step was to estimate the population within each of these PSUs from high-resolution satellite data. First, a Random Forest Regression model was used to estimate population density based on contextual image information (image metrics that incorporate various aspects of surrounding information, rather than single-pixel signature).Footnote 10 The model was trained using a sub-sample of building locations.Footnote 11 The area and average building density for each PSU was then integrated with land use and land-cover data to adjust the area by the percentage covered with vegetation and then to produce a building count.Footnote 12
PSUs were selected with probability proportional to this estimated size. A full listing operation was then conducted in the selected PSUs prior to the second stage selection of households. This approach leads to estimates with larger variances, and therefore less precise estimates, than the single-stage approach because the resulting sample is clustered.Footnote 13
2.2 Key Results and Implementation Challenges
Each of the methods described above produced a sampling frame from which a representative sample was selected. There were, however, substantial challenges in Somalia. For the SHFS, 407 PSUs were selected for the survey (320 urban and 87 rural), and 366 PSUs were selected as replacements (251 urban and 115 rural). After selection, the PSUs were overlaid with satellite imagery from Google Earth and Bing to verify the presence of dwellings. Following that process, 53% of rural PSUs and 2% of urban PSUs were discarded and replaced due to having no visible population. In some cases, it was necessary to replace a PSU multiple times before one with visible dwellings was identified.
The approach used in the DRC generated more reliable results. Both the single-stage and multi-stage methods yielded results close to what the interview teams found during the listing exercise. The single-stage approach, which manually located dwellings based on satellite imagery and then drew a one-stage random sample, was applied in three large districts of Kinshasa. Locating individual dwellings on satellite imagery remains a manual task that is both relatively time-consuming and cannot be entirely standardized. While guidelines can be set for identifying dwellings, in practice, judgment calls are often required to (for example) distinguish businesses or separate conjoined structures into multiple dwellings. When selected structures turned out to be businesses, empty or destroyed houses, or other non-dwelling structures, the misidentified structures were replaced by a randomly selected replacement dwelling. If such misidentification is not excessive and does not systematically vary across the sampled area, the sample can be assumed to remain unbiased. However, misidentification can increase costs and needs to be monitored closely. Systematic variation in the misidentification of households across the sampled area may bias the sample (for example, underrepresenting areas with many high-rise buildings if the true number of dwellings within high-rises is systematically under-identified in a rooftop count). From a practical point of view, interviewers also sometimes struggled to find the selected households in dense areas, because no addresses were available, only a rooftop view with a GPS point. This drawback can be mitigated, however, by equipping interviewers with GPS-capable phones and clear walking maps that point out local landmarks and house characteristics to help with identification.
The second approach used in the DRC, which first defined PSUs and then algorithmically estimated population numbers to allow for an unbiased two-stage selection, posed different challenges. First, refining the algorithm that estimates population density is technically more complex than a simple visual count of dwellings based on satellite imagery. Once in place, however, it can quickly create automated population estimates for large areas. A second challenge is the loss of statistical efficiency inherent in the two-stage approach. Third, interviewers carrying out the listing within selected PSUs sometimes struggled to follow PSU boundaries and to distinguish which buildings were within or outside a given PSU. To minimize such problems, it is critical to prepare clear walking maps for interviewers and guidelines on how to deal with overlapping properties.
In the 28 PSUs in the Makala municipality in the Funa district of Kinshasa, both manual counting of residential buildings (the first method) and the modeling approach (the second method) were used, permitting a comparison between the two methods and the actual number of households identified in the field listing. Compared to an actual total of 9322 households recorded by the listing, the manual approach identified 7489 dwellings, while the modeling approach generated 10,667 dwellings in the same area. The correlations between the estimated and the actual values at the PSU level were 88.7 and 93.1% for the manual approach and the modeled approach, respectively. This important result indicates that the algorithm outperformed manual counting, at least for this application. See Fig. 3 for a comparison of the dwelling counts estimated by the two methods with the household totals generated in the listing operation, for the 28 PSUs in Makala.