Background

There is a constant need to estimate population size and density for the purposes of planning resource requirements or assessing health needs. For reasons relating to timeliness, cost or practicality, data are often obtained through surveys that aim to collect representative samples. Public health specialists rely traditionally on detailed sample frames to survey populations. There are however many situations (such as those relating to displaced populations in emergencies) in which detailed sample frames are either unavailable or unfeasible. Only a small number of sampling methods are suitable for such situations.

Ecological methods, which often do not require a detailed sample frame, can offer practical solutions to household sampling problems and are currently being explored. These methods include sequential sampling techniques to estimate prevalence or program coverage [1, 2], capture-recapture techniques [3, 4], adaptive sampling [5], T-Square sampling [6] and Catana's wandering quarter method [7] to estimate population size and density.

One of the problems in validating and verifying sampling methods used in situations devoid of sampling frames is the difficulty in analysing the properties of the sampling methods [8]. Traditional optimisation of sampling methods is done using computationally intensive re-sampling techniques such as Monte Carlo (MC) or Latin Hypercube Sampling (LHS) simulations, while experimenting with different permutations of the parameters of the sampling method on simulated or real population data. Further, from a theoretical perspective, there are infinitely many scenarios (covering a wide distribution of household and individual data) for which the sampling method requires validation and verification.

Mathematical Programming (MP) provides a powerful tool to optimise rigorously the properties of sampling methods [8]. The key advantage of MP is that it provides a more directed and less computing-intensive approach for optimisation compared to traditional methods. The purpose of this paper is to demonstrate this methodology in practice. Optimisation of a sampling method through MP could be considered as the first step in a four-step procedure for validation as shown in figure 1. Here, we explore optimisation as a first step in developing an alternative sampling method using the T-Square sampling method to estimate human population sizes as an example.

Figure 1
figure 1

Validation steps of a household survey sampling method.

T-Square sampling is a distance-based sampling method whose statistical properties have been thoroughly investigated [914]. It has been used in ecology to estimate sizes, densities and deviations from random spatial distributions of mainly plant populations [15] and more recently it has been used to estimate the size of displaced human populations in emergency situations [6, 16, 17].

Estimating human populations in emergencies by using distance-based methods, such as the T-Square, rely on collecting data on distances between households (shelters) rather than on households per se. Advantages of distance sampling methods include:

  • Human population density can be estimated even when not every household per unit area is detected;

  • The same population density estimate can be calculated from data independently collected by multiple observers;

  • A relatively small number of distances need to be measured;

  • It may be less resource intensive and potentially more accurate than traditional sampling methods such as the quadrant method [6, 16].

Two of the substantive issues to be addressed in this paper are whether:

  • The assumptions on which the T-Square method is originally based for estimating plant population sizes are equally valid for estimating human population sizes;

  • The T-Square method can be optimised.

Analysis

T-Square sampling and other distance-based methods

Two of the simplest distance-based methods to estimate population densities are those which measure distances between a random geographical point and its nearest household or a randomly selected household and its nearest neighbour. If the households are randomly distributed in the region of interest, both approaches are equivalent. On the other hand, if the households are aggregated, the assumption of randomness can be violated and both methods are prone to bias. However, the bias of the two methods in estimating population densities tends to be in opposite directions. This is because when households are aggregated, the average distance from a 'random geographic point to the nearest household' increases while the average distance measured between a 'random household to its nearest neighbour' decreases (figure 2). Using both distances together improves the robustness of the estimation method compared to the use of any estimation method which relies on either distance measure on its own.

Figure 2
figure 2

A schematic of distance-sampling methods. (Abbreviations: H, household; S, sampling points).

The T-Square method starts with generating random geographical points in the region of interest (Ω) such as point S1 in figure 3. From each point, the distance x is measured to the nearest household H1 along the line C connecting S1 and H1. At H1 the area is split, by a line Q which goes through H1 and is perpendicular to line C, in two planes L and R. The distance y from H1 to the nearest household in the opposite plane R (plane which does not contain point S1) is measured. The "T" formed by lines C and Q gives the method its name. The calculation of the population size and population densities based on these distances is explained in detail in Appendix I. The T-Square method assumes "complete spatial randomness". In mathematical terms, this assumption means that the households are described by a spatially homogeneous Poisson process (Appendix I).

Figure 3
figure 3

T-Square sampling method. (Abbreviations: H, household; S, sampling points; distances labelled x and y; planes labelled L and R; lines labelled Q and C; Ω, region of interest).

An alternative method to T-Square sampling is Catana's 'wandering quarter' method [7]. The principle of the method is illustrated in figure 4. A transect of random direction and a random starting point (S1) is selected. From this point, the closest household (H1) within a 90° vertex (area bounded by the dotted lines) is determined. Starting from this household, the next household (H2) is selected in the same way resulting in a sequence of distances (x1, x2,...). This process is continued until the nearest household is outside the survey area. Although the properties of this method have not been thoroughly studied as those of T-Square sampling, Catana's method does not require the assumption of complete spatial randomness [7, 13].

Figure 4
figure 4

Catana's wandering quarter sampling method (Abbreviations: H, household; S, sampling points; x, distance).

Choosing the appropriate distance-based method for use in human populations requires careful practical and theoretical considerations. Distances within which a surveyor can determine accurately the closest household from a random point or the closest household from a previously selected household are limited. In practice, it could be difficult to identify precisely the location of a household that occupies a large area. Furthermore some sampling methods are more sensitive than others to errors in the measurement of angles and distances. In the T-Square method the sample observations are pre-determined, unlike the wandering quarter method. The wandering quarter method could therefore be more difficult to plan in advance compared to the T-Square method if health data are to be collected from each household.

In addition to T-Square sampling and the Catana's wandering quarter methods, there are other distance-based methods such as the line-transect and point-transect distance methods [18, 19]. It could be argued that although these methods are well established for estimating abundance of biological populations (plants or animals), extrapolating their use to household surveys would require evaluation. We note however that distance-based methods do not replace classical sampling methods where sample frames are available.

Optimisation of the T-Square sampling method

The elements of optimising any household sampling method are the objective function (performance measure) to be optimised (maximised or minimised), the parameters of the method which can be tuned to optimise the objective function, and the constraints that are imposed on the values of these parameters [8]. In the context of optimising the T-Square method this is translated as follows.

The choice of the objective function to be optimised is not arbitrary and should be carefully considered. In real-life applications, a set of empirically-derived objective functions would be proposed and tailored to particular situations. Appendix II derives a simple objective function based on practical considerations. We present several examples of objective functions in the following paragraphs.

The simplest objective functions to be optimised (minimised in this case) are the standard error of the estimate of the average area per household (E) or the "cost" of the sampling (C), defined in a generic sense, as a measure of the "quantity of resources" required for sampling (for example, human resources). We can define an objective function which combines both those functions: T = E + αC where α is a trade-off scalar, or parameter, which has a dual purpose: to scale E and C numerically to the same unit and to weight the relative significance of each of them in terms of the overall performance measure.

An obvious parameter to tune is the number of sampling points (m). Both terms (E and C) in the above combined objective function depend on m. We would expect E(m) to decrease monotonically with respect to m and C(m) to increase monotonically with m thus providing a trade-off in the choice of m to be optimised.

A key assumption in the optimisation analysis is that the distribution of the households can be described adequately by a two-dimensional spatially homogeneous Poisson process (Appendix I). In using the T-Square method, there is a potential bias in the estimate of the household density (mean number of households per unit area) if the Poisson assumption does not hold. The standard error term E(m) is proportional to m 1 provided the sampling points are well spaced. The constant of proportionality however will depend on the underlying distribution and therefore would influence the optimal solution. Unlike the expression for E(m), the expression of C(m) is derived from practical considerations. The constraints on m are usually in the form of simple bounds on the sample size, i.e. greater than zero, but less than 60.

For illustrative purposes, we chose the following objective function to be minimised as a first example:

T ( m ) = m 1 + α m 2
(1)

The above objective function is the weighted sum of two terms: the standard error of the population size estimate and a quadratic cost relationship. The optimal sample size is sensitive to the choice of the trade-off parameter α. The choice of α balances the importance of maximising the precision of the estimate against minimising cost. In this example, we set α to 10-5 and the simple bound constraint to 1 ≤ m. Figure 5 shows the variation of T(m) with m.

Figure 5
figure 5

Objective function corresponding to Equation (1). (Abbreviations: T, objective function; m, number of sampling points).

The minimisation was carried out in Mathematica using a standard non-linear programming optimisation algorithm [20]. The optimal sample size (to the nearest integer) is m* = 58.

Another example of an objective function was chosen to reflect a different cost-sample size relationship:

T ( m ) = m 1 + α tanh ( β m )
(2)

The standard error term is the same as in the previous example, but the cost term is assumed to increase asymptotically with respect to sample size and is modelled using a hyperbolic tangent function where β is an empirically derived parameter. In the simulation, β is set to 0.002. This relationship represents scenarios where the incremental cost becomes smaller with progressively increasing sample size. The trade-off parameter α was set to unity and the same constraint was used as before. Figure 6 shows the variation of T(m) with m. The optimal sample size (to the nearest integer) in this example is m* = 40.

Figure 6
figure 6

Objective function corresponding to Equation (2). (Abbreviations: T, objective function; m, number of sampling points).

The two previous simulations were concerned with optimising sample size. Once the optimal sample size is determined, one can envisage a second optimisation stage whose aim is to select the optimal pathway for data collection. This could be required in practice for operational reasons and is not necessarily reflected in the cost function of the first stage optimisation problem. The optimal pathway is defined as the shortest pathway connecting all the sampling points. It is assumed here that one observer would be carrying out the survey.

Assume that the optimal sample size (obtained in the first optimisation step) is m* = 50. Figure 7 simulates a two-dimensional display of the 50 sampling points chosen randomly in a square plane whose boundary corner points have coordinates: (0,0), (0,5), (5,0) and (5,5). The two coordinates of each of the sampling points are generated independently using a pseudo random number generator. The random number generator produces a real number uniformly distributed between 0 and 5. Ignoring for the time being the straight-line segments, the dots numbered 1 to 50 in figure 7 represent the locations of the random points in the plane. Dot 1 is the location of the first sampling point selected, and dot 50 is the location of the last point selected.

Figure 7
figure 7

Location of sampling points.

The optimisation is concerned with computing the shortest pathway that connects all the sampling points. This is a very well known and classical problem in combinatorial optimization known as the "Travelling Salesperson Problem" [21]. The problem is to determine the least-distance route taken by a salesperson to visit a fixed number of cities in which each city is visited once only and in which the trip starts and ends at the same point. The Travelling Salesperson Problem (TSP) is not easy to solve (computational difficulty increases with the number of cities) and there is extensive literature on fast and efficient numerical algorithms used to solve both the classical version and more complex variations of the TSP [22, 23].

Here, we solved the TSP problem in Mathematica [20, 24]. The optimisation method used is called simulated annealing. Simulated annealing is a stochastic approach to find the global solution of an optimization problem where there could be multiple local solutions [25]. In this approach, an optimal solution is found iteratively by selecting randomly at each step a point in the neighbourhood of the current solution and then directing the search in the subsequent steps to improve the value of the objective function whilst not getting trapped in a local solution. It has been found that simulated annealing has several advantages over other optimization methods to solve TSP [26]. (Additional information and an illustration of simulated annealing [27]).

Figure 8 is a schematic diagram of a plausible sequence of steps to apply the optimised T-Square in practice. This is an extension of the chronology of steps proposed by Grais et al [6]. The first step defines the elements of the first optimisation problem, namely the standard error of the average area per household, the cost-sample size relationship and the constraints on the sample size. The second step solves for the optimal sample size. The third step generates the random coordinates of the sampling points bounded by the perimeter of domain Ω (the region of interest). The fourth step defines the optimal pathway. Starting from any sampling point on the optimal pathway and moving in either direction (clockwise or counter clockwise) the fifth step collates the pair of distances comprising: (i) The distance from the random sampling point to the nearest household and; (ii) The distance from that household to its nearest neighbour on the other side of the T-Square. The sixth step applies the T-Square statistics to test the null hypothesis that distribution of the households is completely random (Appendix I). If the null hypothesis is statistically not significant, the optimisation procedure yields a sub-optimal solution. Note that that the optimisation in Step 2 is done only once whereas the optimisation is Step 4 is required for each set of sampling points.

Figure 8
figure 8

An illustration of the steps followed when applying the T-Square method in practice.

Because of the strict condition of complete randomness demanded by the T-Square sampling method, it is unlikely that this method would always be applicable. Catana's method could prove a valid alternative in the sense that it does not require complete spatial randomness however no results have been published for its use in human populations. As in the case of the T-Square method, Catana's method also has some restrictions in practice as discussed previously.

Conclusion

The purpose of this paper was to illustrate the principle of optimising a household sampling method in situations where sampling frames are unavailable. We chose the T-Square method as the exemplar because it holds promise for estimating population sizes in such situations. The optimisation of the T-Square method was demonstrated using a simple illustrative example depicting scenarios that are faithful to the basic assumption of the method, namely that the distribution of the households can be described by a two-dimensional homogeneous Poisson process. If this assumption does not hold, then the proposed optimisation procedure would likely be sub-optimal. Further work should investigate optimising the T-Square method in scenarios that are more realistic and situations in which the distribution of the households is not described by a spatially inhomogeneous Poisson process.

The rigorous optimisation approach, which was demonstrated here on the T-Square method, can be applied to any other sampling method. Traditionally sampling methods were validated using computer simulations and were not formally optimised. The scope of the traditional computing-intensive approaches are somehow limited and the necessity of a mathematical approach for validation and optimisation is warranted [8].

Optimisation of sampling methods provides important information for surveys in contexts where sampling frames are not available. These techniques may be contained within computer software used by field survey teams without requiring technical knowledge of the algorithm. That is, a user-interface allowing survey teams to enter their objective function and generate an optimal survey strategy can mask formulae making them easier for use by non-technical survey teams. Instead of asking survey teams to define the objective function, they could be led through a set of heuristics which provide the number of points to be sampled. For example, in the case of the T-Square method, if the distribution of dwellings is uniform (e.g. as in a street-structured refugee camp) then sample m1 points, if the distribution of dwellings is clumped (e.g. as in a village-structured refugee camp) then sample m2 points. Another way to envision this step would be to ask a similar set of heuristics which are then translated into an objective function behind the user-interface. The second stage of optimisation, the travelling salesperson problem, could be contained within computer software and adapted for use in the field. These heuristics could be tailored to the key issues at hand in other sampling methods.

Appendix I. Statistical properties of the T-Square sampling method

The T-Square sampling method can be described simply in figure 3. We assume that individuals live in households that are not enumerated (i.e. there is no sampling frame). In emergencies, impromptu shelters grouped haphazardly represent households. Points H1, H2 and H3 represent the locations of three of the households. The region of interest (Ω) could contain n households (H1...H n ). Point S1 represents an arbitrary chosen point in Ω. It represents one sample of m points (S1...S m ), which are generated randomly and used as anchors for the estimation method.

Recall the description of figure 3. C is the straight line joining S1 to the nearest household (H1). Q is the line perpendicular to C at household H1. Q partitions the Ω plane into two semi-planes R and L indicated by the arrows. Household H2 is the nearest to H1 on the R semi-plane. The distance between S1 and H1, and the distance between H1 and H2 are denoted by x and y, respectively.

The primary assumption of the T-Square method is that the objects of interest (plants or households) are distributed randomly within the region of interest which means that their spatial distribution is described by a two-dimensional homogeneous Poisson point process [11, 12]. This means that for any two non-overlapping regions A and B (within Ω) of areas δA and δB respectively, the probabilities of finding k households in A and B are statistically independent and that each probability is proportional to the area size:

p ( N A = k ) = exp ( λ δ A ) × ( λ δ A ) k k ! p ( N B = k ) = exp ( λ δ B ) × ( λ δ B ) k k !
(I.1)

In Equation (I.1), NA and NB are respectively the number of households in regions A and B, and λ is the density (number of households per unit area) of the underpinning Poisson process and the parameter to be estimated.

Of course, the principal assumption of the T-Square method is very restrictive in the context of human population estimates. There are several statistical tests available to test for complete randomness of spatial point patterns [9, 1214, 2831]. The relaxation of this assumption has implications for the robustness of the method (see below) used to estimate λ [12].

Recall that x is the distance between point S1 and household H1. Consider next the ensemble of all such distances between the randomly chosen sample points (S1...S m ) and their nearest households (H1...H m ) and assume for simplicity that n = m. The probability density function (pdf) of x is [9, 31]

f ( x ) = 2 π λ x exp ( π λ x 2 )
(I.2)

It follows from Equation (I.2) that the random variable ω defined by ω = 2π λ x2 is chi-square (χ2) distributed with 2 degrees of freedom [12].

If we selected the households arbitrarily, instead of the sampling points, and measured the distance between each selected household and its nearest neighbour, this distance will have the same pdf as x. However, households cannot be selected arbitrarily without enumeration of these households.

Distance methods invariably use pairs of distances between each of the random points and the nearest household and the distances between those households and their nearest neighbours (defined in some sense). With reference to figure 3, this means that the pair (x, y) could be used to estimate λ. Besag and Gleaves [9, 12] showed that under the principal assumption that the households are distributed as a homogeneous Poisson process, y 2 is independent of x and identically distributed to it. In other words, y 2 has the same pdf as x (Equation I.2). Using this statistical feature of the distribution of the pair of variables (x, y), a robust estimator for λ is [12]

η = π × ( i = 1 m x i 2 + 1 2 i = 1 m y i 2 ) 2 m λ = η 1
(I.3)

where η is the average area per household.

The principal assumption can be tested using appropriate T-Square sampling statistical tests [9, 11, 14]. These statistical tests are used to test the null hypothesis that the households (or shelters) are distributed as a homogeneous two-dimensional Poisson process. Under the null hypothesis the random variable on the left hand side of Equation (I.4) [6, 9, 11]

z = ( t 1 2 ) ( 12 m ) 1
(I.4)

is normally distributed with zero mean and unit variance, where

t = i = 1 m ( x i 2 x i 2 + 1 2 y i 2 ) m
(I.5)

As was argued by Diggle [12] and proposed in practice for use in human population estimates by Grais et al [6], hypothesis testing can be carried out as a two-step procedure. In the first step, the above null hypothesis is tested for statistical significance and if found to be statistically not significant, a supplementary null hypothesis is tested for statistical significance. In this second step, the null hypothesis corresponds to u2 being χ2-distributed with m - 1 degrees of freedom where

u = 48 m 13 m + 1 × ( m L o g ( i = 1 m ( x i 2 + 1 2 y i 2 ) ) i = 1 m L o g ( x i 2 + 1 2 y i 2 ) )
(I.6)

If both hypotheses are statistically not significant (when the spatial pattern is described by a two-dimensional homogeneous Poisson process), it is justified to use Equation (I.3) to estimate the average area per household (η). The 95% confidence interval for η is calculated by:

I = [ η 1.96 × η 2 m , η + 1.96 × η 2 m ]
(I.7)

The implication is that the underlying assumptions concerning the distributions of the households (or shelters) may be violated as indicated by the statistical tests performed after field data were collected. In this case, a more robust estimate of η is [12, 13]

η = π m × ( i = 1 m x i 2 × 1 2 i = 1 m y i 2 )
(I.8)

Equation (I.3) (or Equation (I.8)) estimates the average area per household. The human population ρ in the region of interest (Ω) can be estimated by Equation (I.9) [6]

ρ = κ × Γ η
(I.9)

where κ is the average household population and Γ is total the area of region Ω.

Appendix II. Objective function

This section describes a simple objective function which has been used in practice to determine sample size requirements in cluster surveys on provision of water, sanitation and hygiene. The cluster surveys used a two stage sampling approach. In the first stage the primary sampling units (PSUs) were selected with a probability proportioned to their size. In the second stage a simple random sample of size b was taken from each PSU, where b is the number of basic sampling units (BSUs) within each PSU. b is also known as the 'take'.

The objective function describes the relationship between the survey cost and number of BSUs. The total sample size (s) is determined by the number of clusters (c) and the number of BSUs (s = c × b). The cost of the total survey (C survey ) is the sum of a fixed cost (C fixed ) independent of b and a variable cost (C variable ) which depends on b and c.

C survey  = C fixed  + C variable
(II.1)

The variable cost is given

C variable  = c  × C PSU  +c × b × C BSU
(II.2)

where C PSU and C BSU are respectively the survey cost per PSU and per BSU. If we set C r a t i o = C P S U C B S U and assume without loss of generality that C BSU = 1 (i.e. represent all costs relative to C BSU ), Equation (II.2) becomes

C variale  = (C ratio  + b)  × c
(II.3)

The required size of the cluster can be expressed in terms of the expected proportion of the target population, p, and the standard error of its mean estimate, ξ [32]

c = p × ( 1 p ) ξ 2 × b × d e f f
(II.4)

where d eff is the design effect [33]

d eff  = 1 +  ρ × ( b 1 )
(II.5)

ρ is the rate of homogeneity. Substituting Equations (II.4) and (II.5) in (II.3) gives the expression of Cvariablein terms of b

C var i a b l e = ( C r a t i o + b ) × p × ( 1 p ) × ( 1 + ρ × ( b 1 ) ) ξ 2 × b
(II.6)

Figure 9 shows C s u r v e y C B S U in terms of b.

Figure 9
figure 9

An example of a practically constructed objective function.