# An information-based criterion to measure pixel-level thematic uncertainty in land cover classifications

- 640 Downloads
- 2 Citations

## Abstract

Traditional accuracy assessment of satellite-derived maps relies on a confusion matrix and its associated indices built by comparing ground truth observations and classification outputs at specific locations. These indices may be applied at the map-level or at the class level. However, the spatial variation of the accuracy is not captured by those statistics. Pixel-level thematic uncertainty measures derived from class membership probability vectors can provide such spatially explicit information. In this paper, a new information-based criterion—the equivalent reference probability—is introduced to provide a synoptic thematic uncertainty measure that has the advantage of taking the maximum probability value into account while committing for the full set of probabilities. The fundamental theoretical properties of this indicator was first highlighted and its use was afterwards demonstrated on a real case study in Belgium. Results showed that the proposed approach positively correlates with the quality of the classification and is more sensitive than the classical maximum probability criterion. As this information-based criterion can be used for providing spatially explicit maps of thematic uncertainty quality, it provides substantial additional information regarding classification quality compared to conventional quality measures. Accordingly, it proved to be useful both for end-users and map producers as a way to better understand the nature of the errors and to subsequently improve the map quality.

## Keywords

Information theory Classification Thematic uncertainty Confidence Remote sensing## 1 Introduction

In the framework of classification, the most frequent way of assessing the performance of a classifier is to compare the labels of the classification with independent ground truth observations (Stehman 1997). Accuracy measures have been designed to report accuracy both at the map level and at the class level [see (Story and Congalton 1986) for examples] and are typically assumed to apply uniformly over the region of interest. Yet several studies have also demonstrated that errors vary spatially (Liu et al. 2004; Foody 2005; Comber et al. 2012; Renier et al. 2015; Liu et al. 2015; Waldner et al. 2015b; Feng et al. 2015).

As global accuracy statistics cannot model this spatial variation adequately, statistics describing the map quality at a more local level are thus necessary. Foody (2005) applied local accuracy assessment by constraining geographically the data used for accuracy assessment and showed that local accuracy assessment provides a more complete understanding of the quality of land cover maps derived from remote sensing. Nonetheless, to obtain sub-regional accuracy estimates using conventional design-based accuracy assessment, validation data must ensure a sufficient sample size within the region of interest for precise estimates. Unfortunately, sufficient sub-regional data are rarely available to support this (Strahler et al. 2006). Cripps et al. (2013) presented a Bayesian method for quantifying the uncertainty that results from potential misclassification in remotely sensed land cover maps. Discrete remote sensing classification neglects intrinsically the fuzzy character of the land surface and, as a consequence, leads to the inclusion of uncertainty in class assignments (Van der Wel et al. 1998). Lunetta et al. (1991) give an overview of the sources of errors and uncertainties in remote sensing classification. Accordingly, several studies are addressing quality issues by propagating uncertainties in spatial datasets (Pontius 2000; Atkinson and Foody 2002; Crosetto and Tarantola 2001; Liu et al. 2004), while others found that addressing classification uncertainty improved subseququent model calibration (Cockx et al. 2014).

In remote sensing, measures like the posterior probability of membership to the allocated class are often used as an indicator of uncertainty on a per-case basis (Foody et al. 1992). Probably the simplest approach for visualizing the uncertainties underlying a remote sensing classification is by the way of a gray-scale map depicting the maximum probability (MP) \(\max {({\mathbf {p}})}\) of a probabilistic output vector \({\mathbf {p}} = (p_1,..., p_k)\) (Van der Wel et al. 1998) where *k* is the number of classes. The direct use of the MP from other probabilistic classifiers is a common practice; see for instance Mitchell et al. (2013), Dronova et al. (2011) and Polikar (2006). For non-probabilistic classifiers, soft outputs might also be used as a proxy to class membership probability. In the random forest framework it is defined as the number of trees in the ensemble voting for the final class (Loosvelt et al. 2012a). In support vector machine classifications it is based on the distances of the samples to the optimal separating hyperplane in the feature space (Giacco et al. 2010), while for the multi-layer perceptron it is based on the activation levels (Brown et al. 2009). If these measures are not posterior probabilities *per se*, they can be regarded as such.

*i*th band, \(\bar{m_i}\) is the class average in the

*i*th band, and \(s_i\) is the standard deviation of the class in the

*i*th band. The ratio \(d_1/d_2\) between the distance to the closest/assigned class centroid \(d_1\) and the distance to the second closest class centroid \(d_2\) along with the magnitude of these distances (for each pixel) provide additional information about the reliability of class label assignment.

*U*lying in [0,1] and only depending on the maximum probability and the total number of classes. The numerator of the second term expresses the difference between the MP assigned to a class and the probability that would be associated with the classes if a maximum dispersion for all classes occurred, that is, if a probability of 1/

*k*was assigned to all

*k*classes. The denominator corresponds to the extreme opposite case, where the MP is 1 (and thus a total commitment to a single class occurs). The ratio of these two quantities expresses the degree of commitment to a specific class relative to the largest possible commitment.

*p*and commits the probabilities of the other classes in the uncertainty evaluation. It has been suggested that it has a higher sensitivity compared to Shannon’s entropy (Löw et al. 2013). Yet, its definition depends on \(\alpha\) with values that are often set arbitrarily. Löw et al. (2013) proposed a normalized version of the \(\alpha\)-quadratic entropy, the relative \(\alpha\)-quadratic entropy that simply consists in dividing \(H_\alpha ({\mathbf {p}})\) by the maximum possible \(H_\alpha ({\mathbf {p}})\), that is when the probabilities are evenly distributed in all categories with \(p_i = 1/k\) for all

*i*.

Despite alternative approaches to characterize pixel-level thematic uncertainty with more elaborated criteria, the most popular way of assessing the performance of a classifier remains the rate of correctly classified items (or variations around this theme). Although the shortcomings of this simple approach have been clearly emphasized by many authors, it also remains true that most of the alternate way of assessing the accuracy that are proposed are based on *ad hoc* methods or indicators that lack strong epistemic grounds. As a direct consequence, this leads to a multiplication of these indicators, leaving the user without clear final guidelines.

*a*) to (

*d*) share the same category \(c_1\) as the most probable one, they widely differ with respect to probabilities \(p_2\), \(p_3\) and \(p_4\). While (

*a*) is concentrating the remaining probability \(1-p_1=0.3\) over a single category \(c_2\) and (

*b*) is distributing them evenly over these three categories, the corresponding \(p_1\) is the same and does not allow to make a clear preference between these two cases. The same remark applies when comparing (

*c*) with (

*d*). A comparison between (

*a*) and (

*c*) would lead to the conclusion that (

*c*) is more favourable, i.e. \(p_1\) is higher while the remaining probability \(1-p_1\) is distributed over the same single category \(c_2\). However, there is a major issue when it comes to comparing (

*a*) with (

*d*) and (

*b*) with (

*c*), as all probabilities are now different. Clearly, the difficulty of comparing these various cases is precisely coming from the necessity of accounting for the whole probability vector \({\mathbf {p}}\) based on a sound theoretical approach, so that meaningful comparisons can be made and clear conclusions can be reached afterwards. This is of course impossible when relying only on \(\max {({\mathbf {p}})}\).

Illustrative examples when \(k=4\) for the values of \(p_1\) and the way probabilities are distributed over the remaining categories

\(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | max \(({{\mathbf {p}}})\) | |
---|---|---|---|---|---|

( | 0.7 | 0.3 | 0 | 0 | 0.7 |

( | 0.7 | 0.1 | 0.1 | 0.1 | 0.7 |

( | 0.8 | 0.2 | 0 | 0 | 0.8 |

( | 0.8 | 0.1 | 0.1 | 0 | 0.8 |

Instead of discussing at length the benefits and limitations of all possible alternate approaches that have been advocated so far, we present in the present paper a way of assessing pixel-level thematic uncertainty by starting from scratch using information theory. To that aim, we will begin from the most elementary concept of information theory, i.e., the definition of information itself. It will be shown how an expected difference of information can account for the full set of probabilities, while remaining at the same time perfectly consistent with \(p_i^*\) when used as a simple assessment indicator or as a criterion for selecting the best category. The similarities and discrepancies with entropy-based criteria will also be emphasized. Following a rigorous statistical reasoning, one indicator is proposed: the equivalent relative probability derived from the information difference. Their use and usefulness is demonstrated with synthetic examples as well as with a real land cover classification case study.

## 2 The notion of difference of information

*k*non overlapping categories \(\{c_1,\ldots ,c_k\}\) with associated probabilities \({\mathbf {p}}=(p_1,\ldots ,p_k)\) such that \(\sum _ip_i=1\). Let us consider that an arbitrary category \(c_i\) is observed. The information \(I(p_i)\) which is gained by observing the occurrence of \(c_i\) is then given by

### 2.1 The expected difference of information and its relationship with entropy

*per se*, that corresponds to the traditional definition of entropy \(H({\mathbf {p}})\), with

It is also worth emphasizing that even if \(E[D(i\vert \vert i^*)]\) is measuring an expected difference of information, there is no direct connection with a classical measure of expected difference of information as given by the Kullback–Leibler (KL) divergence (Kullback and Leibler 1951) nor with cross-entropy measures (Stehlík and Sivasundaram 2012). Indeed, while KL divergence and cross-entropies aim at comparing two distinct probability vectors, say *p* and *q*, in our case the comparison is always made with respect to a given reference category \(p_{i^*}\) belonging to a single probability vector *p*.

### 2.2 Fundamental properties

Resuming again from the interpretation of the probabilities as information, one can see that \(E[D(i\vert \vert i^*)]\) is measuring the average (difference of) surprise of observing any category \(c_i\) instead of the reference category \(c_{i^*}\). If the issue is to select the reference category at best among a set of categories as represented by a probability vector \({\mathbf {p}}\), it is thus consistent to select \(c_{i^*}\) such that \(E[D(i\vert \vert i^*)]\) is maximized. When the problem at hand is to compare classifications as represented by two probability vectors \({\mathbf {p}_j}\) and \({\mathbf {p}_{j'}}\) with corresponding reference categories \(c_{i_j^*}\) and \(c_{i_{j'}^*}\), it is thus consistent to directly compare their corresponding expected difference of information \(E[D(i_j\vert \vert i_j^*)]\) and \(E[D(i_{j'}\vert \vert i_{j'}^*)]\) and to favor the classification which exhibits a higher expected difference of information.

We thus postulate that \(E[D(i\vert \vert i^*)]\) is a sound and natural way of assessing the quality associated with a probability vector \({\mathbf {p}}\) and the choice of a given \(c_{i^*}\) as reference category. In order to show this, the most important properties of \(E[D(i\vert \vert i^*)]\) will first be given. The corresponding proofs of the theorems are grouped in the appendices for the sake of conciseness. For the non-specialist reader, the proofs can thus be skipped without compromising the global understanding of the text. For each result, a special attention is also devoted to its interpretation and to the way it relates to specific and important cases. Furthermore, the use of \(E[D(i\vert \vert i^*)]\) will be illustrated using simple but carefully selected synthetic examples

**Theorem 1**

*Given any probability vector * \({\mathbf {p}}=(p_1,\ldots ,p_k)\) with \(\sum _ip_i=1\) *and two possible reference categories * \(i^*\) *and * \(i^{**}\) *, with* \(i^*\ne i^{**}\) *. If* \(p_{i^*}>p_{i^{**}}\) *, then * \(E[D(i\vert \vert i^*)]>E[D(i\vert \vert i^{**})]\).

Synthetic example for a probability vector \({\mathbf {p}}=(0.1,0.2,0.4,0.3)\), with \(c_3\) as the most probable category, that also maximizes the value for \(E[D(i\vert \vert i^*)]\)

\(i^*\) | \(p_{i^*}\) | \(E[D(i\vert \vert i^*)]\) |
---|---|---|

1 | 0.1 | −1.1364 |

2 | 0.2 | −0.4120 |

3 | 0.4 | 0.6059 |

4 | 0.3 | 0.1084 |

**Theorem 2**

*If* \(p_i\le p_{i^*}\) \(\forall i\ne i^*\) *, then* \(E[D(i\vert \vert i^*)]\ge 0\) *with equality if and only if* \(p_i=p_{i^*}=\frac{1}{k'}\) *for all* \(k'\le k\) *categories with associated non null probabilities. *

Synthetic example for three probability vectors \({\mathbf {p}}\) sharing the same MP value occurring for category \(c_1\)

\(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(E[D(i\vert \vert i^*)]\) | |
---|---|---|---|---|---|

( | 0.5 | 0.5 | 0 | 0 | 0 |

( | 0.5 | 0.3 | 0.1 | 0.1 | 0.9503 |

( | 0.5 | 0.2 | 0.2 | 0.1 | 1.0549 |

**Theorem 3**

*Given a reference category*\(i^*\)

*with probability*\(p_{i^*}\).

*The minimum possible value for*\(E[D(i\vert \vert i^*)]\)

*is thus given by*

*and it occurs if and only if there is a single non null*\(p_i=1-p_{i^*}\)

*with*\(i\ne i^*\).

*a*) by considering \(c_1\) as the reference category, where the lower bound is then precisely equal to \(\ln \frac{1}{2}-\ln (1-\frac{1}{2})=0\). However, Eq. (15) applies in a more general way even if the chosen reference category is not the most probable one (though the situation where the most probable category corresponds to the reference category is of particular interest, of course). Looking again at Eq. (15), it is worth noting that \({\mathcal {L}}(p_{i^*})\) is monotonically increasing with \(p_{i^*}\), as seen from Fig. 2. Moreover, the value for \({\mathcal {L}}(p_{i^*})\) does not depend on the number

*k*of categories. When combined with the results for the upper bound as given below, these remarks will prove to be useful for practical purposes.

**Theorem 4**

*Given a set of*

*k*

*categories and a reference category*\(i^*\)

*with probability*\(p_{i^*}\)

*. The upper bound for*\(E[D(i\vert \vert i^*)]\)

*is then given by*

*and it occurs if and only if*\(\displaystyle p_i=\frac{1-p_{i^*}}{k-1}\) \(\forall i\ne i^*\).

Synthetic example for three probability vectors \({\mathbf {p}}\) sharing the MP value occurring for category \(c_1\)

\(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(E[D(i\vert \vert i^*)]\) | |
---|---|---|---|---|---|

( | 0.7 | 0.3 | 0 | 0 | 0.8473 |

( | 0.7 | 0.2 | 0.1 | 0 | 1.4838 |

( | 0.7 | 0.1 | 0.1 | 0.1 | 1.9459 |

Clearly, when choosing \(c_1\) as the reference category, the lowest possible value \(E[D(i\vert \vert i^*)]\) is reached for case (*a*) where the complementary probability \(1-p_1=0.3\) is concentrated over a single category, as previously stated by Eq. (15). It is worth noting too that this minimum value is higher than 0, as all non-null probabilities are not equal, to the opposite of Table 3(*a*). On the other side, the maximum possible value is reached for case (*c*) where \(1-p_1=0.3\) is distributed evenly over the three remaining categories.

*k*, where \({\mathcal {U}}(p_{i^*},k)\) is monotonically increasing both with \(p_{i^*}\) and

*k*. Combining the formulas for the lower and upper bounds as given by Eqs. (15) and (17) on the same graph leads directly to Fig. 3. For the special case \(k=2\), one can see from Eqs. (15) and (17) that the lower and upper bounds are identical. This is a direct consequence of the fact that \(p_{i^*}\) completely defines the distribution, as the only possible probability value for the single other category is \(1-p_{i^*}\), of course. One can also remark from Eqs. (15) and (17) that the difference between these bounds does not depend on \(p_{i^*}\). Indeed,

*k*can be illustrated with a simple example given in Table 5. Let us consider various probability vectors \({\mathbf {p}}\) sharing the same maximum probability \(p_1=\frac{1}{2}\) but where the complementary probability \(1-p_1=\frac{1}{2}\) is evenly distributed over an increasing number \(k-1\) of remaining categories. Using the same reference category \(c_1\), the upper bound is accordingly increasing with

*k*.

Synthetic example for three probability vectors \({\mathbf {p}}\) sharing a same MP value occurring for category \(c_1\) and remaining probabilities that are evenly distributed over an increasing number of categories *k*

| \(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(E[D(i\vert \vert i^*)]\) |
---|---|---|---|---|---|

2 | \(\frac{1}{2}\) | \(\frac{1}{2}\) | – | – | 0 |

3 | \(\frac{1}{2}\) | \(\frac{1}{4}\) | \(\frac{1}{4}\) | – | 0.6931 |

4 | \(\frac{1}{2}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | 1.0986 |

### 2.3 Categories with null probabilities

Synthetic example for two probability vectors \({\mathbf {p}}\) sharing the same MP value occurring for category \(c_1\)

| \(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \({\mathcal {L}}(p_{i^*})\) | \(E[D(i\vert \vert i^*)]\) | \({\mathcal {U}}(p_{i^*})\) |
---|---|---|---|---|---|---|---|

3 | 0.6 | 0.2 | 0.2 | – | 0.4055 | 1.0986 | 1.0986 |

4 | 0.6 | 0.2 | 0.2 | 0 | 0.4055 | 1.0986 | 1.5041 |

This also emphasizes that, as soon as one wants to compare classifiers over a distinct number of categories, the value for \(E[D(i\vert \vert i^*)]\) cannot be interpreted as is without referencing it to the way \(E[D(i\vert \vert i^*)]\) is located with respect to the corresponding upper bound (the lower bound remaining the same as it does not depend on *k*). It will be shown a little bit further that a way of accounting for this relative location of \(E[D(i\vert \vert i^*)]\) is through the use of an “upper bound equivalent probability”.

### 2.4 Negative values for \(E[D(i\vert \vert i^*)]\)

### 2.5 Equivalent reference probability

*k*of categories. Its values are necessarily lying in the interval \([{\mathcal {L}}(p_{i^*},k),{\mathcal {U}}(p_{i^*},k)]\), so that for any probability vectors \({\mathbf {p}}\) and a chosen reference category \(c_{i^*}\) one can see how the corresponding value \(E[D(i\vert \vert i^*)]\) is close or far from these lower and upper bounds. However, for people used to deal with probabilities, the interpretation of the \(E[D(i\vert \vert i^*)]\) values are made more difficult due to the fact that both \({\mathcal {L}}(p_{i^*})\) and \({\mathcal {U}}(p_{i^*},k)\) are unbounded. Indeed, both from Eqs. (15) and (17) along with Fig. 3, it is clear that

*k*leads to the result

*a*). Solving for \(p^*\) using Eq. (22) leads to the result \(p^*=0.5\). Accordingly, the probability vector in (

*b*) where \(p_1=p^*=0.5\) can be viewed as an equivalent case, in the sense that it has the same \(E[D(i\vert \vert i^*)]\) value but this value now corresponds to the upper bound when \(c_1\) is chosen as the reference category (note however that any permutation of the probabilities in (

*b*) would lead to the same result as long as the same probability value is used for the reference category, of course, so that \(p^*\) is not intended to be associated with any specific category). Focusing now on the graphic representation of this equivalence between \(E[D(i\vert \vert i^*)]\) and \(p^*\) as given in Fig. 4, it can be seen that looking for the value of \(p^*\) is done by moving horizontally leftwards from the point \((p_{i^*},E[D(i\vert \vert i^*)])\) up to the curve corresponding to the upper bound \({\mathcal {U}}(p^*,k)]\), making also clear that the result \(p^*\le p_{i^*}\) necessarily holds true. Clearly too, the closer \(E[D(i\vert \vert i^*)])\) is from the upper bound \({\mathcal {U}}(p_{i^*},k)]\), the closer \(p^*\) will be from \(p_{i^*}\).

Synthetic example for two probability vectors \({\mathbf {p}}\) sharing the same \(E[D(i\vert \vert i^*)]\) value but where the last vector corresponds to the upper bound when \(k=4\)

\(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(E[D(i\vert \vert i^*)]\) | \({\mathcal {U}}(p_{i^*})\) | |
---|---|---|---|---|---|---|

( | 0.6 | 0.2 | 0.2 | 0 | 1.0986 | 1.5041 |

( | 0.5 | 0.16 | 0.16 | 0.16 | 1.0986 | 1.0986 |

## 3 Synthetic examples

*a*) and (

*b*) are respectively the lower and upper bounds for \(E[D(i\vert \vert i^*)]\) when \(p_{i^*}=0.7\). As a consequence, any intermediate case sharing the same \(p_{i^*}\) value will have a value \(E[D(i\vert \vert i^*)]\in [0.85,1.95]\) as, e.g., for case (

*c*). Comparing now case (

*a*) with case (

*d*) and case (

*c*) with case (

*e*) for which the probabilities are distributed with the same logic over \(c_2\), \(c_3\) and \(c_4\), it can be seen that increasing \(p_{i^*}\) will lead to an increase for \(E[D(i\vert \vert i^*)]\), as expected. However, higher \(p_{i^*}\)’s do not necessarily correspond automatically to higher \(E[D(i\vert \vert i^*)]\)’s. Indeed, comparing directly cases (

*b*) and (

*d*) which are respectively the most favorable case when \(p_{i^*}=0.7\) and the least favorable one when \(p_{i^*}=0.8\), \(E[D(i\vert \vert i^*)]\) is still favouring case (

*b*) over case (

*d*), as the even distribution of the probabilities over categories \(c_2\), \(c_3\) and \(c_4\) in (

*b*) does compensate the higher probability for \(c_1\) in (

*d*). Using \(E[D(i\vert \vert i^*)]\) as a sorting criterion from the most favourable to the least favourable case, the ordering is now (

*e*), (

*b*), (

*c*), (

*d*), (

*a*). Clearly, \(E[D(i\vert \vert i^*)]\) allows us to directly compare here the various case using a single criterion that simultaneously accounts for the effect of the reference category probability and the way other probabilities are distributed over the remaining categories.

Illustrative examples when \(k=4\) for the values of \(p_{i^*}\) and \(E[D(i\vert \vert i^*)]\), where (*a*) and (*b*) are the lower and upper bounds when \(p_{i^*}=0.7\), while (*d*) is the lower bound when \(p_{i^*}=0.8\) (the value for the upper bound is equal to 2.48)

\(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(p_{i^*}\) | \(E[D(i\vert \vert i^*)]\) | |
---|---|---|---|---|---|---|

( | 0.7 | 0.3 | 0 | 0 | 0.7 | 0.85 |

( | 0.7 | 0.1 | 0.1 | 0.1 | 0.7 | 1.95 |

( | 0.7 | 0.15 | 0.15 | 0 | 0.7 | 1.54 |

( | 0.8 | 0.2 | 0 | 0 | 0.8 | 1.39 |

( | 0.8 | 0.1 | 0.1 | 0 | 0.8 | 2.08 |

## 4 Evaluation using remote sensing data

Satellite images were downloaded over a 30 × 30 km\(^2\) area in Belgium centered on 50.60\(^\circ\)N, 4.68\(^\circ\)E from which land/crop cover maps were derived by a random forest classifier. It should be emphasized here that the purpose was not to achieve the highest level of accuracy but rather to demonstrate (1) how the equivalent reference probability (ERP) as defined by Eq. 22 can complement traditional accuracy assessments and (2) how ERP criterion compares with the MP criterion.

### 4.1 Study area and data

The study site is located in central agricultural loamy region of Belgium. The typical field size ranges from 3 to 15 ha and the dominant crop types are winter wheat, winter barley, potatoes, sugar beet, and corn. Winter crops are generally sown in October and harvested in August at the latest whereas summer crops are sown in April and harvested from September onward. Other dominant land covers include pastures, forests, artificial lands and water bodies. The landscape topography is flatlands and hills. The climatic zone is temperate with annual rainfall of about 780 mm that are relatively well distributed over the year, therefore irrigation is not frequent.

The targeted legend includes eleven classes: six crop types [winter barley (WB), winter wheat (WW), sugar beet (SB), potato (Po), corn (C) and other crops (OC)], pasture (Pa), forest (F), artificial areas (A) and water bodies (W) [see Radoux et al. (2016) for a separability analysis of the main land cover classes in the area]. One thousand calibration samples were randomly extracted from a data set combining the land parcel identification system and the land cover map of Wallonia. Similarly, 2000 samples independent from the training data were randomly selected to constitute the validation dataset.

### 4.2 Evaluation methodology

#### 4.2.1 Qualitative analysis and spatial patterns

#### 4.2.2 Quantitative analysis and relationship to class-level accuracy measures

To quantitatively evaluate the proposed indicator, thematic uncertainty measures and classification errors were compared. The results from this comparison were then used to establish if thematic uncertainty is positively correlated with classification accuracy and can therefore indicate classification quality. Results demonstrate that the proposed approach successfully predicts the quality of the classification and is more sensitive than MP.

## 5 Discussion and conclusions

This paper presents a new criterion to derive thematic uncertainty measures from pixel-level class membership outputs as provided by classifiers. This indicator—the equivalent reference probability—is built on the concept of information as defined in information theory. Its derivation from the expected difference of information has been demonstrated. Theorems and simple synthetic examples illustrated how it can account for the full set of probabilities, while remaining at the same time perfectly consistent with the MP both when used as a simple assessment indicator or as a criterion for selecting the best category. Additionally, the ERP does not rely on any tuning procedure, and it can be derived from any classifier that provides soft outputs, either probabilistic or based on probability membership proxies—number of trees, distance to the separating plane, activation level, etc.

The fundamental theoretical properties of the expected difference of information leading to the definition of the ERP were first demonstrated. In particular, it has been shown that the expected difference of information (i) is bounded, (ii) is consistent with the initial order of the input probability vector, and that (iii) as long as the reference category is the most probable one, the expected difference of information is non-negative. To ease the interpretation and comparison of the information-based criterion, we introduce the notion of equivalent reference probability, that bounds the expected difference of information between zero and one. Using synthetic examples, it has been shown how this index allows us to directly compare various cases of probability membership outputs using single values that simultaneously accounts for the effect of the reference category probability and for the way the other probabilities are distributed over the remaining categories. The usefulness and complementary information brought by the criterion was successfully highlighted in both synthetic and real data sets. Based on a case study, it has been shown that they provide a way of obtaining per-pixel classification confidences that are strongly correlated with classification accuracy (Pearson-R = 0.8).

The ERP criterion has been shown to be more sensitive than maximum probability criterion. For a given MP, the ERP varies as a function of the distribution of the remaining class membership probability vector which permits a finer characterization of the uncertainty. This exacerbated sensitivity highlighted in the real case study makes the ERP the fittest indicator for classification comparisons and benchmarking activities.

Reliable pixel-level thematic uncertainty indicators are critical because they provide a means of producing classification confidence that convey considerably more information about classification quality than traditional accuracy assessment measures. As classifying large areas repeatedly over time with high spatial resolution images is becoming more and more frequent, the local/regional relevance of simple global confusion matrices and their derived measures are continuously reduced.

This type of approach is interesting for providing a deeper and spatially explicit understanding of the quality of land cover maps as derived from remote sensing. Additionally, the indicator is also useful to visualize the uncertainty, to ease the monitoring of ecological conditions (Dronova et al. 2011) and to further improve the classification accuracy (Foody 2008; Gonçalves et al. 2009), e.g., by combining different classifier outputs (Liu et al. 2004) and fusing classifier decisions (Löw et al. 2015a). Such criterion could also inform about sampling strategies for selecting reliable pixels in the framework of vegetation monitoring, area estimates or subsequent classifications. Further research will focus on the link between uncertainty, class proportion and purity as well as on the way to integrate these information-based criteria within the classifiers themselves for optimal class selection.

## Notes

### Acknowledgments

This research was funded in the framework of the Seventh Programme for research, technological development, and demonstration under Grant Agreement No. 603719. The Landsat data were obtained through the online Data Pool at the NASA Land Processes Distributed Active Archive Center (LP 440 DAAC), USGS/Earth Resources Observation and Science (EROS) Center, 441 Sioux Falls, South Dakota (https://lpdaac.usgs.gov/get_data). The SPOT4 imagery was obtained under the SPOT4/Take5 programme. Imagery is copyrighted to CNES under the mention: “CNES 2013, all rights reserved. Commercial use of the product prohibited”.

## References

- Atkinson P, Foody G (2002) Uncertainty in remote sensing and GIS. Wiley, Chichester, pp 1–18Google Scholar
- Brown K, Foody G, Atkinson P (2009) Estimating per-pixel thematic uncertainty in remote sensing classifications. Int J Remote Sens 30(1):209–229CrossRefGoogle Scholar
- Cockx K, Van de Voorde T, Canters F (2014) Quantifying uncertainty in remote sensing-based urban land-use mapping. Int J Appl Earth Obs Geoinf 31:154–166CrossRefGoogle Scholar
- Comber A, Fisher P, Brunsdon C, Khmag A (2012) Spatial analysis of remote sensing image classification accuracy. Remote Sens Environ 127:237–246CrossRefGoogle Scholar
- Cripps E, OHagan A, Quaife T (2013) Quantifying uncertainty in remotely sensed land cover maps. Stoch Environ Res Risk Assess 27(5):1239–1251CrossRefGoogle Scholar
- Crosetto M, Tarantola S (2001) Uncertainty and sensitivity analysis: tools for GIS-based model implementation. Int J Geogr Inf Sci 15(5):415–437CrossRefGoogle Scholar
- Dehghan H, Ghassemian H (2006) Measurement of uncertainty by the entropy: application to the classification of MSS data. Int J Remote Sens 27(18):4005–4014CrossRefGoogle Scholar
- Dronova I, Gong P, Wang L (2011) Object-based analysis and change detection of major wetland cover types and their classification uncertainty during the low water period at Poyang Lake, China. Remote Sens Environ 115(12):3220–3236CrossRefGoogle Scholar
- Eastman JR (2006) Idrisi andes. Guide to GIS and image processing. Clark University, Worcester, pp 87–131Google Scholar
- Feng Y, Liu Y, Batty M (2015) Modeling urban growth with GIS based cellular automata and least squares SVM rules: a case study in Qingpu–Songjiang area of Shanghai, China. Stoch Environ Res Risk Assess 30:1–14Google Scholar
- Foody G (2005) Local characterization of thematic classification accuracy through spatially constrained confusion matrices. Int J Remote Sens 26(6):1217–1228CrossRefGoogle Scholar
- Foody GM (2008) RVM-based multi-class classification of remotely sensed data. Int J Remote Sens 29(6):1817–1823CrossRefGoogle Scholar
- Foody GM, Campbell N, Trodd N, Wood T (1992) Derivation and applications of probabilistic measures of class membership from the maximum-likelihood classification. Photogr Eng Remote Sens 58(9):1335–1341Google Scholar
- Ge Y, Li S, Lakhan VC, Lucieer A (2009) Exploring uncertainty in remotely sensed data with parallel coordinate plots. Int J Appl Earth Obs Geoinf 11(6):413–422CrossRefGoogle Scholar
- Giacco F, Thiel C, Pugliese L, Scarpetta S, Marinaro M (2010) Uncertainty analysis for the classification of multispectral satellite images using SVMs and SOMs. IEEE Trans Geosci Remote Sens 48(10):3769–3779CrossRefGoogle Scholar
- Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recognit Lett 27(4):294–300CrossRefGoogle Scholar
- Glasziou P, Hilden J (1989) Test selection measures. Med Decis Mak 9(2):133–141CrossRefGoogle Scholar
- Gonçalves LM, Fonte CC, Júlio EN, Caetano M (2009) A method to incorporate uncertainty in the classification of remote sensing images. Int J Remote Sens 30(20):5489–5503CrossRefGoogle Scholar
- Hagolle O, Dedieu G, Mougenot B, Debaecker V, Duchemin B, Meygret A (2008) Correction of aerosol effects on multi-temporal images acquired with constant viewing angles: application to formosat-2 images. Remote Sens Environ 112(4):1689–1701CrossRefGoogle Scholar
- Hagolle O, Huc M, Villa Pascual D, Dedieu G (2015) A multi-temporal and multi-spectral method to estimate aerosol optical thickness over land, for the atmospheric correction of formosat-2, landsat, ven\(\mu\)s and sentinel-2 images. Remote Sens 7(3):2668–2691CrossRefGoogle Scholar
- Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86CrossRefGoogle Scholar
- Liu R, Chen Y, Wu J, Gao L, Barrett D, Xu T, Li L, Huang C, Yu J (2015) Assessing spatial likelihood of flooding hazard using Naive Bayes and GIS: a case study in Bowen Basin, Australia. Stoch Environ Res Risk Assess 30:1–16Google Scholar
- Liu W, Gopal S, Woodcock CE (2004) Uncertainty and confidence in land cover classification using a hybrid classifier approach. Photogr Eng Remote Sens 70(8):963–971CrossRefGoogle Scholar
- Loosvelt L, Peters J, Skriver H, De Baets B, Verhoest NE (2012a) Impact of reducing polarimetric sar input on the uncertainty of crop classifications based on the random forests algorithm. IEEE Trans Geosci Remote Sens 50(10):4185–4200CrossRefGoogle Scholar
- Loosvelt L, Peters J, Skriver H, Lievens H, Van Coillie FM, De Baets B, Verhoest NE (2012b) Random forests as a tool for estimating uncertainty at pixel-level in sar image classification. Int J Appl Earth Obs Geoinf 19:173–184CrossRefGoogle Scholar
- Lunetta RS, Congalton RG, Fenstermaker L, Jense J, McGwire K, Tinney L (1991) Remote sensing and geographic information system data integration: error sources and reseach issues. Photogr Eng Remote Sens 57(6):677–687Google Scholar
- Löw F, Conrad C, Michel U (2015a) Decision fusion and non-parametric classifiers for land use mapping using multi-temporal rapideye data. ISPRS J Photogr Remote Sens 108:191–204CrossRefGoogle Scholar
- Löw F, Knöfel P, Conrad C (2015b) Analysis of uncertainty in multi-temporal object-based classification. ISPRS J Photogr Remote Sens 105:91–106CrossRefGoogle Scholar
- Löw F, Michel U, Dech S, Conrad C (2013) Impact of feature selection on the accuracy and spatial uncertainty of per-field crop classification using support vector machines. ISPRS J Photogr Remote Sens 85:102–119CrossRefGoogle Scholar
- Maselli F, Conese C, Petkov L (1994) Use of probability entropy for the estimation and graphical representation of the accuracy of maximum likelihood classifications. ISPRS J Photogr Remote Sens 49(2):13–20CrossRefGoogle Scholar
- McIver DK, Friedl M et al (2001) Estimating pixel-scale land cover classification confidence using nonparametric machine learning methods. IEEE Trans Geosci Remote Sens 39(9):1959–1968CrossRefGoogle Scholar
- Mitchell SW, Remmel TK, Csillag F, Wulder MA (2008) Distance to second cluster as a measure of classification confidence. Remote Sens Environ 112(5):2615–2626CrossRefGoogle Scholar
- Mitchell JJ, Shrestha R, Moore-Ellison CA, Glenn NF (2013) Single and multi-date landsat classifications of basalt to support soil survey efforts. Remote Sens 5(10):4857–4876CrossRefGoogle Scholar
- Pal NR, Bezdek JC (1994) Measuring fuzzy uncertainty. IEEE Trans Fuzzy Syst 2(2):107–118CrossRefGoogle Scholar
- Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRefGoogle Scholar
- Pontius RG (2000) Quantification error versus location error in comparison of categorical maps. Photogr Eng Remote Sens 66(8):1011–1016Google Scholar
- Radoux J, Chomé G, Jacques DC, Waldner F, Bellemans N, Matton N, Lamarche C, dAndrimont R, Defourny P (2016) Sentinel-2s potential for sub-pixel landscape feature detection. Remote Sens 8(6):488CrossRefGoogle Scholar
- Renier C, Waldner F, Jacques DC, Babah Ebbe MA, Cressman K, Defourny P (2015) A dynamic vegetation senescence indicator for near-real-time desert locust habitat monitoring with MODIS. Remote Sens 7(6):7545–7570CrossRefGoogle Scholar
- Rodriguez-Galiano V, Ghimire B, Rogan J, Chica-Olmo M, Rigol-Sanchez J (2012) An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J Photogr Remote Sens 67:93–104CrossRefGoogle Scholar
- Stehlí KM, Sivasundaram S (2012) Decompositions of information divergences: recent development, open problems and applications. In: AIP conference proceedings, vol 1493. American Institute of Physics, p 972Google Scholar
- Stehman SV (1997) Selecting and interpreting measures of thematic classification accuracy. Remote Sens Environ 62(1):77–89CrossRefGoogle Scholar
- Story M, Congalton RG (1986) Accuracy assessment-a user\(\backslash\)’s perspective. Photogr Eng Remote Sens 52(3):397–399Google Scholar
- Strahler AH, Boschetti L, Foody GM, Friedl MA, Hansen MC, Herold M, Mayaux P, Morisette JT, Stehman SV, Woodcock CE (2006) Global land cover validation: recommendations for evaluation and accuracy assessment of global land cover maps. European Communities, Luxembourg 51Google Scholar
- Van der Wel FJ, Van der Gaag LC, Gorte BG (1998) Visual exploration of uncertainty in remote-sensing classification. Comput Geosci 24(4):335–343CrossRefGoogle Scholar
- Waldner F, Canto GS, Defourny P (2015a) Automated annual cropland mapping using knowledge-based temporal features. ISPRS J Photogr Remote Sens 110:1–13CrossRefGoogle Scholar
- Waldner F, Lambert MJ, Li W, Weiss M, Demarez V, Morin D, Marais-Sicre C, Hagolle O, Baret F, Defourny P (2015c) Land cover and crop type classification along the season based on biophysical variables retrieved from multi-sensor high-resolution time series. Remote Sens 7(8):10400–10424CrossRefGoogle Scholar
- Waldner F, Ebbe MAB, Cressman K, Defourny P (2015) Operational monitoring of the desert locust habitat with earth observation: an assessment. ISPRS Int J GeoInf 4(4):2379. doi:10.3390/ijgi4042379. http://www.mdpi.com/2220-9964/4/4/2379
- Zhang J, Sun J (2002) The survey of accuracy analysis of remote sensing and GIS. Int Arch Photogr Remote Sens Spat Inf Sci 34(2):581–584Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.