1 Introduction

As of April 2018, the United Nations High Commissioner for Refugees (UNHCR) reported that an estimated 6.6 million Syrians were internally displaced within the country, and that over 5.6 million Syrians had fled to seek refuge in other countries, of which around 8% were accommodated in camps.Footnote 1 In addition to these official figures, there were anywhere from 0.4 to 1.1 million unregistered Syrian refugees in Lebanon and Jordan, and an estimated one million Syrian asylum-seekers in Europe.Footnote 2 In effect, more than half of Syria’s pre-war population has been forcibly displaced since the beginning of the Syrian civil war.

The Syrian crisis has caused one of the largest episodes of forced displacement since World War II and some of the densest refugee-hosting situations in modern history. Syria’s immediate neighbors host the bulk of Syrian refugees: Turkey, Lebanon, and Jordan rank in the top five countries globally for the number of refugees hosted—according to UNHCR data, as of June 2018, Turkey hosted 3.5 million Syrian refugees, Lebanon 0.97 million, and Jordan 0.66 million. In fact, Lebanon and Jordan hold the top two slots for per-capita recipients of refugees in the world, at 164 and 71 refugees per 1000 inhabitants, respectively (UNHCR 2019).Footnote 3 The influx into these countries has also occurred at a more rapid rate than prior refugee crises. At one point in the conflict, an average of 6000 Syrians were fleeing into neighboring countries every day.Footnote 4 Beyond the immediate impact of inflow of refugees, the host countries are also dealing with other consequences of the Syrian conflict, including the disruption on trade and economic activity and growth and spread of the Islamic State (also called ISIS) in Iraq. While the Kurdish Region of Iraq (KRI) hosts at least 200,000 Syrian refugees, the ISIS-induced displacement from neighboring parts of Iraq means that KRI is now hosting over 2.25 million displaced persons, equivalent to approximately 40–50% of its population.

While each neighboring country has received many Syrian refugees in both absolute and relative terms, that is where the commonality ends. Each country has responded to the influx in its own way, influenced by its previous experience of handling protracted displacement situations. Given its history of encampment of the displaced Palestinian population, Lebanon has refrained from setting up camps for Syrians. There is also understandable wariness and anxiety of the impact the influx may have in the delicate domestic political power-sharing equilibrium. In KRI, the influx of Syrian refugees overlaps with a significant number of Iraqi citizens seeking a safe haven from the ISIS militants. The refugees and internally displaced people (IDPs) are located both in camps and non-camps, with a very porous camp boundary that allows its residents to move freely and work outside the camp. At the time of the survey, Jordan had an explicit policy to house refugees in camps and few refugees have legal residency and/or work permits, although a significant majority of refugees had moved outside the camps.

Creating an evidence base to frame the policies for refugees in host environment requires a sampling methodology to select a sample that represents both the host and refugee populations. There are several challenges associated with conducting a representative survey of the host community population and the forcibly displaced. In all three settings we consider, a reliable and updated sampling frame for the resident population was not available.Footnote 5 No sample frames existed for forcibly displaced populations as they were excluded from available national sampling frames. Databases maintained by humanitarian agencies for internal programming purposes are often incomplete and out of date. The displaced also have high degree of mobility and they are often unwilling to speak to surveyors. In this context, and in similar contexts of forced displacement, the selection of a representative sample of hosts and the displaced becomes a major challenge to drawing credible inferences about their socio-economic outcomes.

In this chapter, we describe the strategies that had to be devised to overcome these challenges when designing the sampling procedure for the Syrian Refugee and Host Community Surveys (SRHCS), which were implemented over 2015–2016 in Lebanon, Jordan, and the Kurdistan region of Iraq.Footnote 6 Section 2 describes the innovative use of available information to come up with a strategy for generating representative samples of host community and refugee households in the three settings. Section 3 presents the implementation of this strategy. Section 4 concludes by highlighting implementation challenges and drawing general lessons from our experience on sampling forcibly displaced populations.

2 The Innovation

In all three settings, the main challenge to implementing a survey that would yield estimates representative of the refugee and host community populations, was the lack of an updated or comprehensive sample frame, including for hosting populations and especially for displaced populations. In general, the latter were completely missing from existing national sample frames. None of the three countries had at the time, a recent population and housing census, duly updated for population growth and movement, which could have provided the frame to choose the survey sample for the hosting community.

Each of the three contexts presented different challenges. Lebanon and Iraq have both not had a census for several decades and existing sample frames were out of date at the time of the SRHCS. In Lebanon, information from this sample frame was not available at low levels of geographic disaggregation, while in Iraq, internal displacement of millions of Iraqis had made existing frames obsolete. In Jordan, while census exercises are undertaken every decade, data from the most recent census was not available for the SRHCS, and we had to rely on a relatively outdated sample frame based on the 2005 census. Differences in the distribution of Syrian refugees across the three contexts implied a country-specific approach as well. In Lebanon, there were no refugee camps for Syrians; in Jordan, there were two main refugee camps for Syrians; and in Kurdistan, Iraq, Syrians as well as Iraqi IDPs lived in camps but were also free to move in and out.

Defining a sampling strategy to yield representative samples of hosts and displaced populations in this context involved two key innovations. The first was the creation of a sample frame feasible for household listing operations from large geographical divisions where it did not exist. This was the case in Lebanon and among the two largest refugee camps in Jordan. In Lebanon, cartographic divisions of the country were only available for large areas, and had to be segmented and subsegmented based on satellite imagery and dwelling counts to yield geographic areas small enough for listing. These segmentations attempted to divide the larger areas into equal population size subdivisions or segments, much the same way as enumeration areas are generated. Similarly, for the two largest refugee camps in Jordan, Zaatari, and Azraq, satellite imagery was used to divide the camps into mutually exhaustive and exclusive sampling units of roughly equal population size.

The second innovation was the use of available information from different sources on displaced population prevalence which were incorporated into the sample frames of host population prevalence. In most cases, this information was only available at a geographic level higher than the smaller sampling units used in the final frame. This data allowed for the estimation of known probabilities of selection. The first stage sample selection assumed these probabilities were uniformly distributed over the larger geographic area, and in the sampling units within that area. The household listing operation in the selected small sampling units was then used to update this known (albeit incorrect) probability of selection. In Lebanon and Kurdistan, auxiliary information on spatial distribution of refugees and IDPs available from the UNHCR and the International Organization for Migration (IOM), was merged with the sampling frame. Subdistrict level refugee and IDP prevalence information was used to stratify subdistricts by intensity of prevalence: low, middle, and high. The sample was further stratified into subgroups of interest, depending on the context. In Lebanon, the survey was representative of the host community and the Syrian refugee population. In Kurdistan, the scope of the survey was expanded to include IDPs, so that the survey was representative of the host community, Syrian refugees inside and outside of camps, and IDPs inside and outside of camps.

3 Implementation

In what follows, we detail the sampling strategy for Lebanon, which was the most complicated, and then describe the strategy for the other two contexts.

Lebanon. Conducting a representative survey in Lebanon was especially challenging. The first difficulty was that, as of 2015, there was no recent or reliable sample frame, even for Lebanese households, as the last official population census was conducted in 1932. Typically, such a sample frame consists of the universe of enumeration areas in a country, with associated estimates of population. This meant that we had to construct our own sample frame by selecting a few Small Area Units (SAUs) and then conducting a full listing operation by visiting every household within the selected SAUs and collecting basic demographic and contact information. The second difficulty was that there was no available cartographic division of the country into geographic areas small enough to be the subject of a full listing operation, which could then serve as a sampling frame for the SAUs. Circonscription Foncières (CF) were the finest level of disaggregation available; CFs are generally too large to be listed as some have populations of over 100,000. Finally, there was no available sampling frame for Syrian refugees in Lebanon, which meant that we had to depend on UNHCR data on registered Syrian refugees, combined with the estimates of Lebanese population at the CF level. Given these challenges and time and budgetary constraints, the sample was selected in multiple (four) stages as described below.

3.1 First Sampling Stage

The sample frame for the first stage is the list of 1301 CFs published by the Council for Development and Reconstruction (CDR) in 2004 and the 2014 UNHCR registration database. Each CF is identified by way of its administrative affiliation—Kaza, Qadha, and Mohafza. The UNHCR database reports the total population in each CF, as well as the number of Lebanese and Syrian population in each.Footnote 7,Footnote 8,Footnote 9 The CF cartographic boundaries are described digitally in a linked Geographic Information System shape file.

The CFs were sorted into three strata depending on their ex-ante prevalence of Syrian population, as follows:

  • Low prevalence: where the Syrian population accounted for less than 20% of the total population;

  • Medium prevalence: where the Syrian population accounted for between 20 and 50% of the total population;

  • High prevalence: where the Syrian population accounted for over 50% of the total population.

Prevalence of Syrian refugees at the CF level was defined as the number of registered Syrian refugees from the 2014 UNHCR database divided by the sum of the number of registered Syrian refugees and the 2004 Lebanese population counts from the CDR database. The first columns of Table 1 show the distribution of the CFs into strata, as well as the population in each stratum, as per the UNHCR database.

Table 1 Syrian Refugee and Host Community Survey: sampling strata—Lebanon

Our intention was to select 75 CFs in total. The decision of how to distribute them across the 3 strata faced the classical dilemma of whether to do it in proportion to the population of the strata, which would deliver nearly optimal estimates for the country as a whole, or to allocate the same sample size (i.e. 25 CFs) to each stratum, which would deliver estimates of nearly the same quality for each of them. Since both considerations were important for the 2015 SRHCS, we opted to do it in accordance to Markwardt’s rule (also known as the ‘50/50 equal/proportional allocation’), which is generally considered a good compromise between the two extremes. The last three columns in Table 1 show the chosen allocation, the corresponding sample sizes (in number of households), and the expected maximum margins of error.Footnote 10

Within each stratum, CFs were selected for inclusion with probability proportional to size (PPS), using the total population as a measure of size, and with implicit stratification by administrative units (Kaza, Qadha and Mohafza). Some of the large CFs were selected more than once. For instance, there were 34 selections made from among the ‘low prevalence’ CFs (as per Table 1), and one extremely populous CF (Chiyah, located in Mount Lebanon) was randomly selected three times. As a result, the 75 selections were drawn from 71 different CFs. Annex Table 1 shows the list of sampled CFs, where the last column indicates the number of times each CFs was selected in the sample (e.g. one, two or three times depending on each case).

3.2 Segmentation of Circonscriptions Foncières (PSUs)

Given that CFs are larger in size than typical census Enumeration Areas which are roughly of 200 households each, the majority of the selected sample CFs was too large to be manageable for implementing a complete household listing operation. For this reason, these large CFs were divided into ‘super segments’ and ‘segments’ of roughly equal size within each category, using total number of households as a measure of size. The number of households in each ‘super segment’ or ‘segment’ was estimated based on observation of height of buildings and estimated population density in each area in the 2015 ESRI World ImageryFootnote 11 and 2015 Google Earth imagery, combined with local knowledge of these areas.

Based on the estimated measure of size, only five CFs were considered to be too large in size and hence were selected for ‘super segmentation’. At a later stage, all CFs and ‘super segments’ were divided into ‘segments’ due to their large size.

3.3 Second Sampling Stage: Super Segmentation of Circonscriptions Foncières

In the second stage, the boundaries of the ‘super segments’ in each CF were drawn using the 2015 ESRI World imagery basemap. These boundaries take into account the total estimated household count, as well as natural boundaries such as major roads, rivers, and paths that can easily be recognizable by field teams during the listing operation and implementation of the household questionnaire.

Within each super-segmented CFs, the sample ‘super segments’ were selected with equal probability, based on the assumption that each ‘super segment’ is of roughly equal size. The number of ‘super segments’ selected within each CF was the same as the number of times the corresponding CF was selected in the first sampling stage. For instance, if a CF was selected three times in the first sampling stage, we selected three ‘super segments’ within this CF. Similarly, if a CF was selected only once or twice on the first sampling stage, we correspondingly selected one or two ‘super segments’ on the secondary sampling stage.

Annex Table 2 shows the list of ‘super segments’ within selected CFs, where the ninth column indicates the number of times each CFs was selected in the sample (e.g. one, two or three times depending on each case). The column headed ‘Prob 2’ shows the probability of selecting the ‘super segment’ within each CF.

Table 2 List of selected segments (enumeration areas)—Lebanon

3.4 Third Sampling Stage: Segmentation of Circonscriptions Foncières

In a third stage, the boundaries of the ‘segments’ were drawn for all CFs and selected ‘super segments’ within CFs. Similar to the process of ‘super segmentation’, boundaries of segments were drawn using the 2015 ESRI World imagery basemap. These boundaries also take into account the total estimated household count, as well as natural boundaries such as major roads, rivers, and paths.

Within each CF or corresponding ‘super segment’, the sample ‘segments’ were selected with equal probability, with the underlying assumption that each ‘segment’ is of roughly equal size. Annex Table 3 shows the list of ‘segments’ for all CFs, where the last column indicates the probability of selecting the ‘segment’ within each CF in the third sampling stage.

Table 3 List of sample super segments (for CFs divided into super-segments or secondary sampling units)—Lebanon

3.5 Fourth Sampling Stage

The sample frame for the fourth stage is the full list of all households in the sample CF segments. The listing operation consisted of a full enumeration of all physical structures in the area, with each physical structure being classified as a primary or secondary residential dwelling, commercial building, school, hospital, government office, etc. The listing operation collected information about the household occupying each residential dwelling, and each household was classified as either a Syrian refugee household or a host community household. Care was also taken to record two households living in the same unit separately.Footnote 12

To ensure the quality and completeness of the listing operation, enumerators relied on high-resolution paper maps identifying all buildings within each segment. Each building or structure was pre-assigned with a unique identifier. Enumerators then created a record for each residential unit and household following the protocol described in the 2015 SRHCS Manual of Enumerator. The 40 households to be visited by the 2015 SRHCS in each segment (with a target of 20 Syrian refugee and 20 non-Syrian refugee households in each) was selected from the listing data by systematic equal-probability sampling.Footnote 13

3.6 Selection Probabilities and Sampling Weights

Given the sampling design discussed in the last paragraphs, the probability \(p_{\text{hizsj}}\) of selecting household \({\text{hijzsj}}\) in segment \({\text{hizs}}\) of super segment \({\text{hiz}}\) in Circonscription Foncière hi of stratum h is given by:

$$p_{\text{hizsj}} = \frac{{k_{h} n_{\text{hi}} }}{{\mathop \sum \nolimits_{i} n_{\text{hi}} }} \times \frac{{t_{\text{hi}} }}{{T_{\text{hi}} }} \times \frac{{g_{\text{hi}} }}{{G_{\text{hi}} }} \times \frac{{m_{\text{hij}} }}{{n_{\text{hi}}^{'} }}$$

where the four fractions on the right-hand side respectively represent the probability of selecting the CF in the first stage, and the conditional probabilities of selecting the super segment, the segment, and the household in the second, third, and fourth stages, and:

  • \(k_{h}\) is the number of CFs selected in the stratum (the fifth column in Table 1),

  • \(n_{\text{hi}}\) is the number of households in the CF, as per the sample frame (the column headed ‘population’ in Table 1),

  • \(t_{\text{hi}}\) is the number of ‘super segments’ to be drawn in the CF, as per the first sampling stage (the column headed ‘No. super segments selected’ in Annex Table 2),

  • \(T_{\text{hi}}\) is the total number of ‘super segments’ in the CF, as per the segmentation procedure (the column headed ‘No. of super segments’ in Annex Table 2),

  • \(g_{\text{hi}}\) is the number of segments to be drawn in the CF, as per the second sampling stage (the column headed ‘n_segments to draw’ in Annex Table 3),

  • \(G_{\text{hi}}\) is the total number of segments in the CF, as per the segmentation procedure in the third sampling stage (the column headed ‘n_segments per SSU’ in Annex Table 3),

  • \(m_{hij}\) is the total number of households identified as Syrian refugees during the household listing operation;

  • \(m_{\text{hizsj}}\) is the number of households selected in the segmented CF (with a target 20 Syrian-refugee and 20 non-Syrian-refugee households in this case); or mhij = \({\text{mhij}}\) + (40−\({\text{mhij);}}\)

  • n’hizs is the number of households in the segmented CF, as per the household listing operation.

To deliver unbiased estimates from the sample, the data from each household hij should be affected by a sampling weight (or raising factor) whzsij, equal to the inverse of its selection probability (i.e. whizsj = phizsj−1).

Kurdistan. Much of the sampling procedure in Kurdistan resembled that of Lebanon, except for one important difference: unlike in Lebanon, the frame for the first stage sample existed in Kurdistan (albeit outdated), and a subset of the enumerations areas had updated population information from the 2012 IHSES survey (which did not take into account subsequent internal displacement). A subsample of the 2012 clusters was selected for our survey, followed by a comprehensive listing exercise to update the frame for second stage sampling. Four strata based on refugee and IDP prevalence were defined as following:

  • Low Syrian prevalence (<5%) and Low IDP prevalence (<15%)

  • Low Syrian prevalence (<5%) and High IDP prevalence (> = 15%)

  • High Syrian prevalence (> = 5%) and Low IDP prevalence (<15%).

  • High Syrian prevalence (> = 5%) and High IDP prevalence (> = 15%).

In the first stage, within each stratum, enumeration areas were selected with PPS using the number of households reported from the 2012 listing exercise as a measure of size. In the second stage, 18 households per PSU were selected: six Syrian households, six IDP households, and six host community households in each PSU to the extent possible. In areas where there were less than six Syrian or IDP households, the shortfall was met by host community households. The sampling frame for second stage sampling was the complete list of households in the selected EAs from the listing exercise.

Jordan. In contrast to Lebanon and Iraq, Jordan has carried out Population and Housing Censuses on regular intervals, with the last one in late 2015. What was particularly attractive about the latest census from the perspective of sampling was that it explicitly asked about the nationality of all residents. This would have allowed stratification of areas by density of Syrians. However, the original design could not be implemented because we could not access the new sample frame based on the 2015 Jordanian census. The design was then amended to include a representative sample of the Azraq and Za’atari camps (which account for the vast majority of Syrian refugees in camps in Jordan). This sample was complemented by purposive samples of the surrounding governorates, Mafraq and Zarqa, where the sample included areas physically proximate to the camp and other areas with a high number of Syrian refugees. In Amman Governorate, a purposive sample was drawn, combining a geographically distributed sample with a sample of areas with a high prevalence of Syrian refugees per the 2015 census, as indicated by the Jordanian Department of Statistics. Analytically, this implies the insights from Jordan will be limited to camp residents, neighboring areas of the camps, and Amman governorate.

4 Implementation Challenges, Lessons Learned, and Next Steps

The three surveys described in this paper were designed to generate comparable findings on the lives and livelihoods of Syrian refugees and host communities in the three settings. The absence of updated national sample frames and the lack of a comprehensive mapping of the forced displaced within these countries posed challenges for the design of these surveys. These challenges are not unique—indeed, most developing countries face similar issues, which are exacerbated at times of large scale internal population movements or in contexts of a large localized or widespread influx of migrants. Such data challenges become particularly stark in countries hosting displaced populations or in situations of ongoing or protracted conflict as local populations move to escape violence. But exclusion of displaced persons from national sampling frames, and consequently from national surveys, provides a skewed picture of the world (World Bank 2018a). As the number of displaced persons continues to increase, it becomes all the more urgent to devise strategies to include them in representative socioeconomic surveys.

This methodology paper describes the strategy implemented in the three contexts to generate known ex-ante selection probabilities through a variety of data sources, the use of geospatial segmenting to create enumeration areas where they did not exist, and to use data collected by humanitarian agencies to generate sample frames for displaced populations. The strategies implemented in these surveys can be useful in designing similar exercises in contexts of forced displacement. Moreover, this effort shows the importance of including refugees and non-nationals in national sample frames. The move by Jordan’s statistical agency to explicitly include non-nationals in the 2017/2018 household survey is a commendable step in the right direction.