Keywords

1 Introduction

National and global scale land cover mappings based on remotely sensed imagery have been shown to be directly useful in many environmental science applications including: carbon emission monitoring [1,2,3,4], forest monitoring [5, 6], modelling of soil properties [7], land change detection [8,9,10], climate dynamics [11,12,13,14], natural hazard assessment [15, 16], agriculture, water/wetland monitoring [17, 18] and biodiversity studies [19, 20]. Because of this, along with the increasing availability of satellite imagery data, national and global scale land cover mappings have attracted significant attention from researchers in the environmental sciences in recent decades.

Satellite imagery alone though is generally not enough to build reliable and meaningful land cover maps. One must also collect reference samples (sometimes referred to as ground truth samples) to both train (when using supervised learning techniques) and validate maps.

When estimating standard performance metrics and area estimates in land cover mapping (e.g. user, producer and overall accuracies and area estimates), a popular method of estimating these quantities is with the use of a post-hoc validation set. This is done by comparing the ground-truth values of these validation samples with their respective predicted values and inferring estimates based on the forms of agreements and disagreements between these values. Since these estimations usually only require the number of different types of agreements and disagreements, it is often convenient to tabulate these results. When this is the case, the subsequent tabulated results are often presented as an error (or confusion) matrix. As a validation set is itself only a sample, such estimations are inevitably going to have uncertainties associated with them. In order for policy makers, stakeholders and other users to have the appropriate level of confidence in such estimations, it is vital that any quantification of these uncertainties are justified.

A major advantage of a using a post-hoc validation sample for estimating these quantities (and subsequently quantifying the associated uncertainties) is that it does not place any requirements on the methods used to create the strata. This means that one is free to build mappings with machine learning techniques (such as Random Forests, Support Vector Machines and Artificial Neural Networks [21]) without needing to be concerned that many of these methods can be black box in nature. Another advantage is that one has much more freedom when collecting training samples. This is because one is not restricted to the specific stochastic structures of sampling, which are necessary when inferring uncertainties directly from a model. This is especially important when dealing with machine learning techniques, as we often have to rely on cheaper, less structured methods, of collecting training data (e.g. polygon sampling, using found data, etc.). Thirdly, it is possible to apply this method with nothing more than the results from an error matrix. This especially useful when analysing historical or third party maps.

The current recommended approach of uncertainty quantification from error matrices is to take a frequentist approach and rely on asymptotic normality estimates to provide confidence intervals [22, 23]. The drawback of this approach is that it is not appropriate when relevant entries of an error matrix are not sufficiently large. Furthermore, because relevant events may be rare (e.g. instances of incorrect labelling between two contrast classes) additional sampling of validation data is not always a practical solution to this problem. Whilst there are methods for dealing with low entry counts in simple situations [24,25,26], complications arise when one needs to correct estimates for disproportionate sampling across the strata. The main consequences of these complications is that the resultant confidence intervals are of little practical use, either due to their excessively cautious nature, or by the fact the fundamental statement that is implicitly made by confidence intervals (i.e. nominal coverage) cannot be reliably verified.

The goal of this paper is to review existing methods for quantifying uncertainty with the aim of providing an approach that can deal with these aforementioned complications. In what follows, we firstly review the current recommended practice for uncertainty quantification under a frequentist perspective based on the following criteria: transparency, generalisability, suitability when stratified sampling and suitability in low count situations (see Sect. 2 for further details).

We then make a case that a Bayesian approach is more suited as a default for method uncertainty quantification when judged by these criteria.

2 Terminology and Formulating the Problem

We begin by supposing that we have \( k \) mutually exclusive strata and that, within in each stratum, instances can be classified as belonging to one of \( c \) discrete values. Typically, in land cover mapping applications these instances are single pixels or small clusters of pixels, each approximately of equal size. For the sake of convince we assume that these instances are always at the single pixel level and hence refer to instances as pixels.

Let \( \varvec{p}_{i} \in \left[ {0,1} \right]^{c} ,\varvec{ }i = 1, \ldots ,k \) denote the proportion vector for population \( i \) where \( \left( {\varvec{p}_{i} } \right)_{j} \) is the proportion of pixels that are within strata \( i \) that belong to class \( j \) where \( j = 1,..,c \). We define a global quantity as any quantity that can be expressed as a function of \( \varvec{p}: = \left( {\varvec{p^{\prime}}_{1} , \ldots ,\varvec{ p^{\prime}}_{k} } \right)^{'} \). i.e. a global quantity is any quantity that can be expressed in the form \( g\left( \varvec{p} \right) \). Examples of global quantities in land cover mapping applications are performance metrics such as user, producer, and overall accuracies along with large scale measurements such as the total areas. In practice not all entries of \( \varvec{p} \) will be needed in the calculation of \( g \). Here we will write “relevant \( \varvec{p} \)” as a short hand to “all entries \( \varvec{p} \) that are necessary in the calculation of \( g \)”.

Next suppose that for each of the \( k \) strata, we draw a random sample of pixels (with replacement) of size \( n_{1} , \ldots n_{k} \) respectively and let \( \varvec{x}_{i} \) denote the response vector for strata \( i \) with \( \left( {\varvec{x}_{i} } \right)_{j} \) indicating the number of the \( n_{i} \) pixels drawn from strata \( i \) that belong to class \( j \).

Within this notation, the aim of this paper is to review current methods of quantifying uncertainty for estimates of \( g\left( \varvec{p} \right) \) made with \( \varvec{n}: = (n_{1} , \ldots n_{k} )^{{\prime }} \) and \( \varvec{x}: = \left( {\varvec{x^{\prime}}_{1} , \ldots ,\varvec{x^{\prime}}_{K} } \right)^{{\prime }} \).

Note that estimates obtained from an error matrix are a special case of this whereby \( k = c \) and \( \varvec{x} \) is a vector representation of said error matrix. The evaluation of methods discussed in this paper will be based on the following four criteria.

Transparency – the extent to which one can explicitly state, justify and analyse any assumptions or choices necessary within the method. This is an important criterion as this will influence how likely end users will have confidence in the results of methods.

Generalisability – the suitability and ease of applying a method when considering a wide variety of global quantities or when estimates for global quantities are part of a modelling chain. Essentially, this criterion is included to assess how flexible a given method is to choices of \( g \). This important land cover mapping applications as \( g \) is regularly a non-trivial function of the components of \( \varvec{p} \) (e.g. ratios, weighted sums etc.). In addition, global quantities are regularly used inputs in other models. Hence, it is common that one may wish to propagate the uncertainty for an estimate of \( g \) into another quantity.

Suitability for stratified sampling - how appropriate the method is in situations when a stratified random sampling has taken place. Stratified random sampling has been common practice when collecting test samples as it allows for a more efficient reduction in uncertainty under the currently recommended approach [27]. Hence, it is important that a method of uncertainty quantification can also handle the case of stratified random sampling in order to similar advantage of these practices.

Suitability in low count situations - how appropriate the method is in situations when relevant entries (or combinations of entries) in \( \varvec{x} \) are close to, or exactly, zero (around 5 or less). Note, that low sample sizes can cause low count situations but these are not the same thing. For example, a sample of 25 success and 25 failures is not a low count situation but a sample of 499 successes and 1 failure would be a low count situation. It is important that a method of uncertainty quantification can handle low count situations as there serval naturally occurring factors that make them quite frequent. Such factors include a demand for higher resolutions (e.g. thematic, temporal), the relatively high cost of test sampling (reducing the total sample size) and situations when a single class dominates a stratum (making the alternative classes in said strum rare). The latter factor here is interconnected with the efficiency gains that can arise from stratified random sampling. This is because stratified random sampling is most effective when one can create strata in which a single class of pixel heavily dominates each stratum. However, such a stratification is likely to induce a low count situation. This can lead to a peculiar situation in where one can be a victim of one’s own success when quantifying uncertainty with a method that cannot handle low count situations.

3 A Motivating Example: Georgian Deforestation

To motivate the work, we consider an example case study of estimating the total deforestation with the use of a land cover change map of Georgia [28].

This case study was chosen as it provides an example in which stratified sampling has taken place and one is in a low count situation for some of the entries of the error matrix. The general problem of monitoring deforestation plays an important role in estimating carbon emissions and is now required as part of recent EU policy [29].

Table 1. Error table for the map presented in Fig. 1. 1 = forest-to-non-forest; 2 = stable forest; 3 = stable non-forest. \( {\text{W}}_{\text{i}} \) denotes the total area of the predicted classes in hectares.

For the sake of brevity, we shall only consider providing uncertainty quantification for estimates of the total area of deforestation along with the user accuracy and producer accuracy for the deforestation class. In terms of our notation, we define the total area (\( {\mathcal{A}}_{1} ) \), user accuracy \( \left( {{\mathcal{U}}_{1} } \right) \) and producer accuracy \( {\mathcal{P}}_{1} \) for the forest-to-non-forest class as

$$ {\mathcal{A}}_{1} = \mathop \sum \limits_{i = 1}^{k} W_{i} \left( {\varvec{p}_{i} } \right)_{1} , \;{\mathcal{U}}_{1} : = \left( {\varvec{p}_{1} } \right)_{1} , \;{\mathcal{P}}_{1} = \frac{{W_{1} \left( {\varvec{p}_{1} } \right)_{1} }}{{\mathop \sum \nolimits_{i = 1}^{k} W_{i} \left( {\varvec{p}_{i} } \right)_{1} }} = \frac{{W_{1} {\mathcal{U}}_{1} }}{{{\mathcal{A}}_{1} }} $$

which we need to estimate from \( \varvec{x}: = \left( {\varvec{x^{\prime}}_{1} ,\varvec{x^{\prime}}_{2} ,\varvec{x^{\prime}}_{3} } \right)^{{\prime }} \) with

$$ \varvec{x}_{1} = \left( {51 ,23, 13\varvec{ }} \right)^{{\prime }} ,\,\varvec{x}_{2} = \left( {0, 416, 15} \right)^{{\prime }} ,\,\varvec{x}_{3} = \left( {1,20,410} \right)^{{\prime }} $$

We chose these accuracy quantities as they are standard practice in many land cover mapping applications and will allow us to demonstrate how different methods behave when assessing them against our chosen criteria. The user accuracy for the forest-to-non-forest class is intended to act as simple base case. The total area of deforestation is a quantity in this case in which we are in a low count situation and must account for stratified sampling for a relatively simple function (i.e. a weighted sum). The producer accuracy has the same qualities as total area but considers a slightly more complex case of \( g \) that involves a ratio of two unknown values. We also make a note that only \( \left( {\varvec{p}_{1} } \right)_{1} \) is relevant to \( {\mathcal{U}}_{1} \) and \( \left( {\varvec{p}_{i} } \right)_{1} ,\;i = 1,\;2,\;3 \) are relevant to \( {\mathcal{A}}_{1} \) and \( {\mathcal{P}}_{1} \).

4 Methods of Uncertainty Quantification

One way of quantifying uncertainty is to take a frequentist approach and use measures of uncertainty such as confidence intervals. Here the unknown value of \( g\left( \varvec{p} \right) \) is assumed fixed and confidence intervals are probabilistic statements made in relation to the test sample, to which \( \varvec{x} \) is one realisation of this process.

It is here that we introduce the concept of nominal coverage. Suppose we have a method of generating confidence intervals for \( g\left( \varvec{p} \right) \) and we repeat a sampling process a large number of times to generate a large number of test samples. Next suppose one was to apply said method to each of these test samples to generate a large number of confidence intervals. The nominal coverage for \( g\left( \varvec{p} \right) \) for a method under this sampling process is then the proportion of these confidence intervals containing the unobserved true value of \( g\left( \varvec{p} \right) \). For a method that quantifies uncertainty in terms of confidence intervals the validity of said method in particular situations is determined by how closely the stated level of coverage relates to its nominal coverage. For example, a method that creates a confidence interval at the \( 100\left( {1 - \alpha } \right)\% \) level is valid in a given scenario if it is reasonable to believe that the nominal coverage is approximately \( 1 - \alpha \). For the sake of simplicity, this paper will only focus on equal tailed intervals but much of the analysis will extend to the case when tails are not equal.

The use of confidence intervals is currently the recommended practice within the land cover mapping community [22, 23]. We place methods of creating confidence intervals in three categories, exact, heuristic and asymptotic.

Exact methods are methods that rely on using the exact distribution of the sampling processes (in relation to \( g\left( \varvec{p} \right) \)) to generate confidence intervals that have rational gauntness regarding nominal coverage. An example of this is Clopper-Pearson intervals [30].

Heuristic methods are methods that rely on approximations of sampling distributions or make slight amendments to exact methods. Typically, heuristic methods are in response to specific weakness of exact methods or when exact methods are not easily be derivable. An example of a heuristic approaches would be Agresti–Coull intervals [26] or using credible intervals from Bayesian methods with uninformative priors.

Asymptotic methods are methods that rely on asymptotic theory to generate confidence intervals. Whilst they could be considered specific cases of heuristic methods, we have chosen to separate them as they act differently when judge by our four criteria (see Sect. 5). The current recommended practice, that assumes a normal distribution based on asymptotic properties of the central limit theorem and bootstrapping methods [31] are examples of asymptotic methods.

An alternative approach to uncertainty quantification seen in land cover mapping applications is to express uncertainties in the form of probability density functions through Bayesian inference [32, 33]. Here allow for the uncertainty of relevant \( \varvec{p} \) to be represented as a probability distribution given the observed data and a predetermined prior distribution. From this, we can then quantify uncertainty for \( g\left( \varvec{p} \right) \), either through direct inference or through simulation based methods.

Because frequentists and Bayesian methods take different perspectives on probability, it does not make sense to judge a Bayesian approach through the assessment of nominal coverage. In a frequentist setting, a confidence interval is a statement related to the behaviour of a large number of (hypothetical) samples. The uncertainty is on the sampling process, not on the parameter itself. Whereas a measure of uncertainty such as a credible interval (often described as a parallel to confidence intervals in a Bayesian setting) is a measure for the spread of the posterior distribution of model unknowns including parameters. This distribution is a rational quantification of uncertainty based on an observed sample and prior knowledge (or belief). Technically speaking, providing that we believe the prior placed on relevant \( \varvec{p} \) to be suitable, the resultant posterior distribution for \( g\left( \varvec{p} \right) \) is valid. A potential difference in results due to set of priors deemed suitable is consistent here. An intuitive interpretation of this is that if two or more actors have different beliefs before observing sample, their beliefs after seeing the sample may also be different if their prior beliefs were sufficiently strong.

Hence when a method takes a Bayesian approach to uncertainty, we shall judge its suitably based on how sensitive the posterior distributions are to similar choices of prior distributions.

5 Analysis of Methods

We begin by applying several methods of uncertainty quantification on our Georgian deforestation example. For each method, we calculate an equal tailed confidence (or credible) interval at the 95% level. For the frequentist methods we apply the currently recommended normal approximation method as well as naïve bootstrapping (both asymptotic), a method based on using bounds of multiple Clopper-Pearson intervals for \( \left( {\varvec{p}_{i} } \right)_{1} \) created at the \( 100\left( {1 - \sqrt[3]{0.95}} \right)\% \) level (exact method) and the 95% credible intervals from the Bayesian methods (heuristic). For the Bayesian methods, we use a set of uninformative priors for each \( \left( {\varvec{p}_{i} } \right)_{1} \) (Jeffery \( \left( {\varvec{p}_{i} } \right)_{1} \sim Beta\left( {0.5,0.5} \right), \) uniform \( \left( {\varvec{p}_{i} } \right)_{1} \sim Beta\left( {1,1} \right), \) (close to) improper \( \left( {\varvec{p}_{i} } \right)_{1} \sim Beta\left( {0.01,0.01} \right) \)).

Form Table 2 we can see that all methods largely agree for the user accuracy of the deforestation class. This is largely expected as all methods will eventually converge to normality and the user accuracy is a relatively simple case. What is particular striking however, is the substantial differences we see for the intervals around the area of deforestation. This is problematic as the choice of method here could potentially have a serious impact on decision making.

Table 2. Limits for the confidence and credible intervals under various methods at (equal tailed, 95% level) for the Gregorian deforestation example.

In a frequentist setting, one must be able to confirm it is likely that the stated level of confidence is at least close to nominal coverage. However, analysing nominal coverage in general is difficult and will depend on many factors including the sample size \( \varvec{n} \), the level of confidence, \( g \) and the value of unknown \( \varvec{p} \). In practice, the dependence on unknown \( \varvec{p} \) will generally mean that one will never be able to give the exact nominal coverage. Rather, we may need to consider multiple plausible values of relevant entries of \( \varvec{p} \) based on \( \varvec{x} \) to build a case of representative coverage.

As an example, let us consider a confidence interval at the 95% level for \( \left( {\varvec{p}_{1} } \right)_{3} \) generated with the normal approximation method and naive bootstrapping based on a sample size in the presented example (431). In Fig. 2 we can see that both methods suffer from considerable under-coverage if \( \left( {\varvec{p}_{1} } \right)_{3} \) is close to 0. A low count situation observed in \( \left( {\varvec{x}_{1} } \right)_{3} \) in the present example is evidence that \( \left( {\varvec{p}_{1} } \right)_{3} \) may be close to 0.

Fig. 1.
figure 1

Change map for Georgia, from circa 1990 to 2000 as presented in [28]

Fig. 2.
figure 2

Coverage plot for the normal approximation method and naïve bootstrapping based on a sample size of 431 (the same as stratum (3) in the Georgian example).

This may call in to question the validity of both these methods when quantifying uncertainty for the total area of deforestation as this quantity replies heavily on \( \left( {\varvec{p}_{1} } \right)_{3} \) (especially with \( W_{3} \) being so large). However, a more robust analysis is likely unviable since the relative area relies on three values of \( \varvec{p} \), and so one would need a 3-dimensional equivalent of Fig. 2.

For the other frequentists methods, whilst the Clopper–Pearson (+) intervals are guaranteed to provide sufficient coverage, the unknown over-coverage may be so high that this is also misleading.

For each of the different types of frequentist methods, there is a trade-off between how well they satisfy of each of our criteria. A more systematic analysis of how each type of method meets our criteria is presented in the appendix in Table A1. The major findings are that, most suffer from transparency issues in more complex situations due to the difficulties in coverage assessment. No type of method is likely to be suitable across all four criteria. Of course, not all criteria are relevant in all situations and so some methods may be suitable in individual cases. The problem is that no method is consistent enough across all four criteria to be a good default approach.

In practice, this can mean having to choose between many methods when taking a frequentist approach to uncertainty. This type of approach would rely on expertise in said methods and suitable diagnostic tests (which may not even be available in situations that are more complex).

In comparison, one is more likely to better satisfy our criteria when taking a Bayesian approach to uncertainty quantification (see Table A1 for further details). This is mainly due to three important advantages

The first advantage is that issues related to coverage are avoided as Bayesian analysis is an entirely different form of uncertainty quantification.

The second advantage is that sensitivity to prior choice can be assessed post-hoc. This means we can effectively “wait and see” if prior sensitivity is going to be an issue at all. In comparison, one must have assurances related to nominal coverage for every new situation with frequentist methods.

Thirdly, the problem of prior sensitivity is not as detrimental as the problems we face in frequentist settings due to coverage assessment. This is because the validity of the results is determined how reasonable we believe the priors to be. In practice, a set of standard prior choices is often agreed in advanced by communities (e.g. a set of uninformative priors). We would argue that this is an easier task than say having communities agree which frequentist methods to use in which particular situations.

In the context of the Georgian deforestation example, consider the Bayesian results for the area of deforestation. The results in this case differ by a relatively considerable amount. Whilst this is not ideal, these results are robust and informative for decision makers. This is different to the frequentist setting we have considerably different confidence intervals that are, potentially, misleading due to their level of miss-coverage.

One may be tempted to make a similar statement with the frequentist approaches. The problem with this is that we need to assume that all the individual methods are appropriate to begin with (otherwise, they may act as disinformation). In order to do this, one must assess the coverage for all methods in a given situation, which as we have discussed already, is often very difficult.

6 Discussion and Future Work

So far we have discussed the advantages of a Bayesian approach to quantifying uncertainty from an error matrix produced by single sample and map. However, a Bayesian approach to uncertainty quantification has the potential to offer many more advantages that are not available in frequentist approaches. Whilst an in-depth exploration of these advantages goes beyond the scope of this paper, they do offer some insights in to where future work may lead in terms of uncertainty quantification from a Bayesian perspective. Such work could include:

A Formal Means of Including Prior Information.

The inclusion of a prior distribution means that we can formally include information in to our uncertainty quantification before observing our sample. This information may come from historical maps or biased samples (e.g. citizen science data). This could substantially the reduce the sizes of test samples needed to reduce uncertainty to satisfactory levels, especially investigating the prevalence of rare classes (e.g. when monitoring land-use change). Note that prior information cannot be formally incorporated in frequentist approaches, as statements such as confidence intervals are related to the sample itself under fixed parameter values.

Predicting the Impact of Additional Test Samples on Uncertainty Reduction.

Suppose we wish to reduce the uncertainty further by collecting a further test sample. When predicting the effects of further test sampling, their impact on the degree of uncertainty is often governed by relevant \( \varvec{p} \), which we can estimate based on an initial test sample. When taking a frequentist approach to uncertainty quantitation, it is difficult to propagate uncertainty in these initial estimates.

However, when the uncertainty for relevant \( \varvec{p} \) is represented as a probability distribution, one can propagate this forward. The practical advantage this gives is that one can have a more reliable means of assessing the trade-offs between the cost of additional samples and their likely impact on uncertainty. In addition, it allows us to compare how different distributions across strata may effect uncertainty. This would be a key step in any work assessing the efficiency of different sampling strategies.

7 Conclusion

When making estimates from mappings built with machine learning techniques, one must often rely on error matrices obtained from test sampling to quantify uncertainty for these estimates. The current recommended approach in this setting is a frequentist one that assumes asymptotic normality of estimates. This is often unsuitable when estimating the prevalence of rare classes or when strata are homogenous. Alternative methods may exist for simple cases, but they do not extend to more advanced situations that are relied upon in land cover mapping applications. Furthermore, the assessment of any frequentist method itself is near impossible in more complex situations because of the difficulties in analysing nominal coverage.

In comparison, Bayesian inference can offer an approach to uncertainty quantification that is better suited for land cover mapping applications. It is for these reasons that we recommend that future work related to uncertainty quantification from error matrices should be focused on the development and refinement of Bayesian approaches rather than looking towards more advanced frequentist methods.