Combining Environmental Area Frame Surveys of a Finite Population

New ways to combine data from multiple environmental area frame surveys of a finite population are being introduced. Environmental surveys often sample finite populations through area frames. However, to combine multiple surveys without risking bias, design components (inclusion probabilities, etc.) are needed at unit level of the finite population. We show how to derive the design components and exemplify this for three commonly used area frame sampling designs. We show how to produce an unbiased estimator using data from multiple surveys, and how to reduce the risk of introducing significant bias in linear combinations of estimators from multiple surveys. If separate estimators and variance estimators are used in linear combinations, there’s a risk of introducing negative bias. By using pooled variance estimators, the bias of a linear combination estimator can be reduced. National environmental surveys often provide good estimators at national level, while being too sparse to provide sufficiently good estimators for some domains. With the proposed methods, one can plan extra sampling efforts for such domains, without discarding readily available information from the aggregate/national survey. Through simulation, we show that the proposed methods are either unbiased, or yield low variance with small bias, compared to traditionally used methods.


INTRODUCTION
For a traditional finite population survey, one often think of some well-structured list frame covering the population of interest, from which a statistician can draw a sample according to some procedure, in order to produce an efficient and unbiased estimator of some population parameter. When conducting environmental surveys, however, this is often not the case.
Environmental surveys often lack well-structured, comprehensive list frames to sample from. In such settings, it is common to use area frames covering the assumed spread of the population of interest. Examples of environmental surveys using such area frames are national forest inventories (Axelsson et al. 2010), agricultural inventories (Fecso et al. 1986), landscape inventories (Allard 2017), among others. By using area frames, a sample unit becomes a point from a continuous population-the area surface-why there is a need to map the sample properties for the sampled points to the indirectly sampled units in the population of interest.
Other desirable outcomes in environmental surveys are domain estimates, or their counterparts, estimates created by aggregating domain estimates. In the first case, primary surveys are seldom planned with domain estimates in mind, why complementary surveys are often considered. The latter case may especially be considered when dealing with rare populations, or wanting to incorporate a previously conducted domain survey into an aggregate survey (Benedetti et al. 2015).
Scenarios like these, or when dealing with two samples with different designs, connect to the multiple-frame research area. When combining such samples, an optimal linearly combined estimator should be weighted by the variance (Lohr and Rao 2006). Since true variances are most likely not available, variance estimates are often used instead. However, environmental surveys conducted using area frames often have target variables with highly skewed distributions, since the units in the population of interest might be absent in large parts of the area frame. Under such circumstances, the estimators and the variance estimators are susceptible to correlation, which can introduce significant bias into linearly combined estimates using variance estimates as weights (Grafström et al. 2019).
In order to reduce the bias of a combined estimate, we propose two methods: The first approach is a generalization of the combining samples approach derived by Grafström et al. (2019), which combines unit sample properties from an arbitrary number of designs into design components for the combined design. The second approach uses a pooled variance estimator to estimate the variance of each survey's estimator by using all available information from the surveys.
The targeted applications are primarily environmental surveys and monitoring, where it is common to use area frames. Several countries have national landscape and forest monitoring programs that may not be enough to produce regional or domain level estimates, and thus need be complemented on some level to reach specific accuracy targets (Christensen and Ringvall 2013).
With the methodology presented in this paper, there might be a need to link surveys relating to different definitions of statistical units. Hence, this is something that should be planned for from start. We need be able to detect if the same population unit is included in more than one sample (or multiple times in the same sample). However, in most applications, the size of the area being sampled is likely to be very large compared to the area covered in the samples, which makes overlap not particularly common. In area-based surveys, we are likely to have geographical coordinates for at least the statistical unit. These coordinates can easily be used to detect possible overlap between different surveys. In the rare case of possible overlap, it may be difficult identify exactly which population unit that is included multiple times. If this is thought to be an issue, then it may be needed to use markings of coordinates and/or population units in the field to make such identification easier.
In some cases, e.g., for unbiased variance estimation using a combined sample, we need at least partial knowledge of the geographical coordinates of the sampled population units. Such knowledge can be included by the use of accurate satellite-based positioning systems, as is done, e.g., for permanent sample plots in the Swedish national forest inventory (Fridman et al. 2014).
In Sect. 2, we provide a general procedure to produce unit sample properties for a discrete population sampled using an area frame. Through Sect. 2.1, we show examples on unit sample properties for a discrete population sampled through three different, commonly used area frame designs. In Sect. 3, we recall the single and multiple count estimators that are used to estimate population totals. Then, in Sect. 4, we present the theory for combining samples, and for combining estimators using pooled variance estimators. In Sect. 5, we use a simulation to compare a naive linear combination with the combined sample and the linear combination using pooled variance estimates. Finally, we discuss the results in Sect. 6.

UNIT SAMPLE PROPERTIES FOR GENERAL DESIGNS
Assume that there is a finite, but unknown population U , represented by fixed points on an area of interest F U , that has some measurable properties of interest. If a sample point X (k) , with probability density function (pdf) f (k) (x), falls within the inclusion zone A (k) i of an unit i ∈ U , the unit is included in the sample.
Let P be the set of independent but not necessarily equally distributed sample points. For any sample point X (k) ∈ P, and units {i, j} ∈ U , we make the following definitions: where I (·) denotes the indicator function, S i is the number of inclusions of unit i by sample point X (k) , π (k) i is the first-order inclusion probability of unit i by sample point X (k) , i.e., the probability of unit i being included into the sample by a sample point X (k) , π (k) i j is the second-order inclusion probability for units i, j to be included in the sample simultaneously by sample point X (k) , E (k) i is the (first-order) expected number of inclusions of unit i by X (k) , and E (k) i j is the second-order expected number of inclusions of units i, j by X (k) . For the set of independent sample points P, we extend the definition in (1) to Expanding the definition of (4) to the first-order expected number of inclusions for unit i by the set of sample points P, we have while it can be shown (see "Appendix" for further details), that the expected number of inclusions of the second-order for units i, j by the set of sample points P can be extended from (5) to Moreover, the inclusion probabilities of the first and second-order of units i, j by the set of sample points P can be expressed similarly to (2) and (3) as For any set of sample points P to be used to make an unbiased estimator of a parameter of U , we require that all units in the population have positive inclusion probabilities, equivalent to ensuring that a sampling design satisfies For an unbiased estimator of variance by any set of sample points P, we require that all pairs of units {i, j} ∈ U have positive second-order inclusion probabilities, equivalent to ensuring that a sampling design satisfies While the requirements in (11) and (12) are necessary and sufficient for positive inclusion probabilities of the first and second-order, they are in reality often not assessable if the units in U are unknown. Instead, sufficient counterparts with respect to F U can be formulated as where F, the sample frame, is connected to F U so that F U \F dx = 0, assuming reasonably defined inclusion zones. It holds that (14) is sufficient for (13).

SAMPLE PROPERTIES FOR THREE COMMON DESIGNS
Provided the derived sample properties, it is easy to show the sample properties for three common designs-i.i.d., one point per stratum stratified, and systematic-given uniform sample point distributions. Assuming that unit i's inclusion zones are identical for all sample points within a specific design, i.e., A d , we define F as the area enclosing all possible inclusion zones, a F as the area of F, a i as the area of A i , and a i j as the area of 1 . The inclusion probabilities for units i, j by a single sample point X (k) 1 can thus be described as From this, it follows that the first-order sample properties for unit i are with the second-order sample properties for units i, j where n 1 denotes the cardinality of P 1 , i.e., the number of sample points in the design. A systematic design with uniform pdf's, and a repeating pattern in the inclusion zones defined by the stratification (exemplified in Fig. 1), is a special case of the i.i.d. design where only one point is sampled. Thus, for the systematic design, the sample properties for units The final example is the one point per stratum stratified design defined by P 3 , where one point is sampled from each of a fixed number of disjoint strata. Let the stratum for sample point X 3 , given uniform pdf's, can then be described as Figure 1. Examples of a i.i.d., b stratified, and c systematic frames and inclusion zones. The outer areas represent the sample frames (F), the inner areas represents the areas of interest (F U ), and the circles represents the inclusion zones (A) for units. In both a and b, the sample frame expands around the area of interest so that the largest of the inclusion zones will always be fully within the area frame. In b four disjoint strata of unequal sizes and shapes are exemplified through the dashed lines. c shows inclusion zones for two units, where dashed circles and x'es indicate the units' positions. These types of inclusion zones would exemplify systematic plot sampling. from which the results in (7), (8), (9), and (10) follows. In the case of equally sized and disjoint strata, a (k) F = a F /n 3 , where n 3 represent the number of strata/sample points.

SINGLE AND MULTIPLE COUNT ESTIMATORS
The sample properties derived in Sect. 2 are needed for two common estimators used when estimating the population total Y = i∈U y i of a finite population U . The first of these two estimators is the single-count (SC) Horvitz-Thompson estimator (Horvitz and Thompson 1952), defined asŶ where S i denotes the number of inclusions of unit i, π i = Pr (S i > 0) denotes the inclusion probability for unit i, i.e., the probability for unit i to be included in the sample, and I (·) denotes the indicator function. The variance ofŶ SC can be shown to be where π i j = Pr S i > 0, S j > 0 denotes the second-order inclusion probability, i.e., the probability for units i, j to be included in the sample simultaneously. Given that the secondorder inclusion probabilities are strictly positive for all pairs {i, j} ∈ U , an unbiased variance estimator forŶ SC isV The second estimator to be used in this paper is the multiple-count (MC), or Hansen-Hurwitz, estimator (Hansen and Hurwitz 1943), defined aŝ where E i j = E S i S j denotes the second-order expected number of inclusions for two units i, j. Given that the second-order expected number of inclusions are strictly positive for all pairs {i, j} ∈ U , an unbiased variance estimator ofŶ MC iŝ As by the requirements in (13) and (14), the variance estimators presented here are not applicable when using a one-per-stratum stratified or systematic sample design such as those presented in Sect. 2.1. However, when combining two or more independent samples, these criteria will be evaluated on the combined sample.

COMBINING SAMPLES
Let D = {P d } d denote a combined sample, i.e., a set of independent sets of sample points P d . By extending the definition of (6) to the number of inclusions by the combined sample as S the inclusion probability of unit i by a combined sample D becomes similar to (9). Comparable to (7), (8), and (10), the rest of the necessary sample properties for units i, j by a combined sample D follows as By using these combined sample properties, the estimators in Sect. 3 can be applied directly. When combining samples, for example in a multiple frame setting, the individual designs' sample frames do not need to be identical, nor do they need to individually cover the area of interest. The requirements in (11) and (12) needs to be fulfilled with respect to the sample points in ∪ d P d , i.e., the necessary condition for positive second-order inclusion probabilities and positive expected number of inclusions for all pairs in the combined sample D is with sufficient counterpart both of which imply positive first-order inclusion probabilities and positive expected number of inclusions for all units by the combined sample D.
If sample frames are extended in ways similar to those in Fig. 1, or if combining multiple frames, there will be some oversampling. In such cases, it will be required to be able to identify objects not part of the population of interest.
These results are not limited to area frames. As per an example in Lohr and Rao (2006), it is possible to combine, for example, a sample taken from an area frame with full coverage of the population of interest, and a list frame with unknown coverage of the population of interest, as long as it is possible to identify units in the list frame that are not part of the population of interest, and units sampled from the area frame that are also present in the list frame.

COMBINING ESTIMATORS BY LINEAR COMBINATIONS
When combining a set of unbiased estimates formed of the samples in D by linear combinations, the formŶ is often considered, since it will yield an unbiased result. Often the inverse variance proportion is used as the weight in order to increase accuracy. However, as described by Grafström et al. (2019), if true variances are not available, using variance estimates may in certain cases introduce bias to such a linear combination, especially when the variance estimator is correlated with the estimator of the population parameter. We denote a linear combination using variance estimates aŝ with * for either SC (single-count) or MC (multiple-count).
To overcome the issue with biased variance estimators, we propose a pooled variance estimator, using all available information to estimate the separate variances. We denote the linear combination estimator using such pooled variance estimates aŝ are both unbiased estimators of the variances of the single and multiple count estimators, given ∀{i, j} ∈ U, π (D) i j > 0 and ∀{i, j} ∈ U, E (D) i j > 0. Note that the final fractions for both variance estimators for a design P d assures that all available information are used through S i j , as defined in (15), (17) and (19). However, if many second-order design properties are positive, but small, the variance estimators might produce negative and unstable estimates, making them unsuitable for combinations.

SIMULATION
In order to evaluate the proposed combinations of samples and estimates, a simulation study was performed. The simulation sampled 10,000 times from a simulated population generated from the SLU (Swedish University of Agricultural Sciences) Forest Map (Reese et al. 2003). The SLU Forest Map, previously known as kNN-Sweden, has extensive information about Swedish forest land and is based on satellite and field data from the Swedish national forest inventory (NFI). The map contains information about age, height, species  of wood and woodland for the country's forest land. The basic format is raster data with a resolution of 25 × 25 square meters.
From the SLU Forest map, an area of 1000 × 1000 square meters of southern Sweden was cropped to represent the area of interest. Figure 2 illustrates the location as well as the total volume of the stand for the cropped area. Using individual tree data variables from the Swedish NFI, the three dominating tree species-birch, pine, and spruce-were randomly added to the population according to species-specific volume maps of the cropped area. In the resulting population, the number of trees for each species is 7411 (13%), 24,428 (41%) and 27,212 (46%), respectively. The resulting population is presented in Fig. 3, color-coded by volume intensity.
For each of the 10,000 simulation runs, four samples were generated from the sample frame using uniform densities-two i.i.d. samples, one systematic sample, and one stratified sample. Each design used circular inclusion zones of common sizes per design, correspond- ing to plot sampling. In order to have equal first-order expected number of inclusions for all units, the sample frames were expanded around the area of interest in each direction by the size of the inclusion zone radius, guaranteeing that all inclusion zones are fully within the sample frames. In Table 1, the designs are described in further detail. For each sample and combination, single (SC) and multiple count (MC) estimates were calculated. To show the effect of different ways of combining data, we compared the estimators using combined samples, with sample properties derived through (16), (17), (18) and (19), with the estimators based on linear combinations of estimates using estimated variances and pooled variance estimates as in (22).
As mentioned in Sect. 3, for variance estimators to be unbiased, we require positive second-order sample properties for all pairs in the population. While the systematic and stratified designs fulfills the requirements in (20) and (21) in combination with each other or any of the i.i.d. designs, they do not fulfill (12) and (14) individually, while also being prone to negative and unstable pooled variance estimates due to small second-order design properties, making them unsuitable to use in a linear combination. In environmental surveys, one often deal with this by using a more conservative variance estimator, for example by using the i.i.d. variance estimator (Benedetti et al. 2015). However, using the i.i.d. variance estimator might be too conservative, i.e., reducing the assumed efficiency of the stratified and systematic designs.
For this simulation, second-order design properties were calculated as if they were sampled using a i.i.d. design, when calculating the linear combination of estimates using pooled variances. For the naive combination, plot variance estimates in the linear combination were used, where y (l) d is the plot l estimate of the total. In order to reduce the efficiency impact of the stratified and systematic designs, plot variances were calculated using a variant of the local mean variance estimator proposed by Grafström and Schelin (2013) where P * d (k) is the set of n * sample points of design d closest to X (k) d . For this simulation, the fixed number of neighbors was set to n * = 4.
The results, presented in Table 2, show that while any combination reduced the variance in the estimator, the combination based on plot variance estimates introduced bias at least three times of that generated by the pooled variance estimates. Because of the relatively small probability of two sample points sampling the same tree, the SC and MC estimators perform similarly.
In Table 3, bias, MSE, and variance estimates are presented for the i.i.d. 1 and 2 designs, and the combinations of the two. Comparing the combined samples versus the combined estimates, one can observe the trade-off between unbiased estimates and estimates with reduced variances.

DISCUSSION
In Table 2, we showed that combined samples and linear combinations based on pooled variances (pooled combination) will probably always be preferable to linear combinations based on individual variances (naive combination), given that the target variable has a skewed distribution. Even if no correlation exists between the estimator and its variance estimator, the pooled combination should be more efficient than the naive combination, as more information is used. The main drawback of the pooled combination is the need to compute additional second-order design properties, which may be difficult if positional data is not available or accurate enough to map the sample properties of the designs. Furthermore, for some designs the pooled variance estimator might be unstable, which makes it an unsuitable choice for such designs. However, the combined samples approach will function sufficiently in most cases, as its estimate is not dependent on second-order design properties, why the impact of absence of reliable positional data should be small, for most designs. While the results from the simulation are conditional to the simulated population, we expect the bias to be proportional to the heterogeneity of the population, why we may draw some general conclusions. We believe both of these methods to be useful for domain estimates. For the domain estimate of a primary survey, the target variable will have a skewed distribution, even if the target variable over the domain is not. It is thus expected that significant bias will be introduced by using the naive combination.
Another scenario where both presented methods might be useful are when combining designs like those used in the simulation here, where it is not possible to get an unbiased variance estimator for one or more of the individual designs. The pooled combination is unbiased if the combined second-order sample properties are positive for all units in the population, whereas the naive combination needs positive second-order sample properties for all units and all designs. Furthermore, the combined samples approach has none of these restrictions and is also more relaxed in terms of first-order sample properties. Table 3 provides results regarding MSE and variance estimates for i.i.d. designs. These results highlight the bias-variance trade-off between the pooled combination and the combined sample approaches. The combined samples approach produces unbiased estimators, however, in the simulation, with larger empirical mean-squared errors than the pooled combinations. A statistician deciding between these two approaches should thus know to what extent the end product needs to be accurate or reliable.
In Tables 2 and 3, we see that the bias is, as expected, more apparent when dealing with skewed target variables, as the volume of birch. It is not uncommon to reach acceptable MSE's for some dominant or aggregate target variable in a primary survey, here represented by the total wood volume, while needing complementary surveys to study some target variable with a more skewed distribution. The results of the simulation show that different methods of combination will affect the reliability of the combined estimates.
Further research would study the effects of errors in the positioning of units, to see how previously described mismatching would affect the estimates. For plot sampling procedures, that are commonly used in forest inventories, one can assume two types of mismatching to be common: One where there is a difference between the location of the studied plot and the sampled location, and one where the positioning of units within a plot are inaccurate. Depending on designs, these errors will have different effects. permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Funding Open access funding provided by Swedish University of Agricultural Sciences.

APPENDIX: UNIT DESIGN PROPERTIES
Let U be a finite, unknown population, representable by fixed points on an area of interest F U . If a sample point X (k) , with probability density function (pdf) f (k) (x), falls within the inclusion zone A (k) i of unit i ∈ U , the unit is included in the sample.