Digital classification is commonly used to convert image data into map products by assigning each pixel into one of two or more categories. This approach is routine in many different fields, ranging from medical imaging to satellite remote sensing (Celeb et al. 2019; Phiri and Morgenroth 2017; Xiao et al. 2019). The resulting maps are frequently used for decision making and scientific research, so it is vital to verify map quality. One standard way to evaluate maps is to use error matrices (also known as confusion matrices) to compute accuracy metrics or indices, including map-level and category-level accuracies. In the geospatial field, the map-level accuracy metric is called the overall accuracy (O) and the category-level accuracy metrics are called the producer’s (P) and user’s (U) accuracies (Table 1) (Story and Congalton 1986; Congalton 1991; Foody 2002). The overall accuracy is the percentage of cases (e.g., pixels or sampling units) correctly classified. If only a single accuracy metric is reported for a classification map, it is likely to be the overall accuracy, as shown in an overwhelming preponderance of published research in the field of earth-surface detection and mapping (Fig. 1). We refer to this phenomenon as “overselling overall accuracy,” meaning that map handlers overemphasize the importance of map-level accuracy.

Table 1 A population error matrix and selected accuracy metrics
Fig. 1
figure 1

Histogram comparing the number of publications reporting three types of accuracy based on a topic search from the Web of Science Core Collection. The data cover the top ten journal groups, including remote sensing, imaging science photographic technology, environmental sciences, geoscience multidisciplinary, geography physical, ecology, forestry, water resources, agriculture interdisciplinary, and geography during 1978–2018. The topic keywords include remote sensing or remotely sensed, classification or map* in addition to overall accuracy, producer’s accuracy, and user’s accuracy in titles, keywords, and abstracts

Overselling overall accuracy should be discouraged because the O metric has inherent properties that lead to misleading perceptions about the soundness of classification methods and map-based research. Specifically, the inherent properties of the O metric contribute to at least four application problems: (1) The O metric cannot explicitly explain the accuracy or uncertainties of map derivatives, (2) the O values are not comparable between maps with different numbers of categories, (3) the O values are not comparable between maps with different category proportions, and (4) a greater O value may not necessarily represent a more realistic map.

The first O-metric problem arises because the O metric is computed using only diagonal elements in an error matrix and thus does not reflect error distributions among categories. A high O value for a map does not assure the reliability of map derivatives. These derivatives include category area, the most common map derivative for remote-sensing applications (Olofsson et al. 2014). For example, the rate of 90% overall accuracy cannot ensure that the area estimate of a category is accurate to within ± 10%. Consider a case where a 100 ha forest has a carbon density of 100 tons/ha at time 1 and 110 tons/ha at time 2. The forest will sequester 1000 tons of carbon in total between the two times. Given ± 10% uncertainties for the estimates of area and carbon density, the maximum possible range of carbon sequestration between time 1 and time 2 is from − 3190 to 5210 tons. This example illustrates that relatively high O values for a map do not guarantee uncertainties small enough for map derivatives to avoid big uncertainties in map applications. By contrast, the producer’s and user’s accuracies are more informative than O for area estimates. Despite a reasonable value for O, errors or uncertainties can be surprisingly high in landscape quantification (Arbia et al. 1998; Shao and Wu 2008). That is because the O metric does not reflect error’s spatial distribution.

The second problem with the O metric appears because an increase in the number of mapping categories leads to an increase in the number of spectrally similar pixels and in the number of edge pixels. When a pixel probability of assignment is nearly equal for each of two or more categories, the chances increase for the pixel to be wrongly classified. Because edge pixels are located at the border of adjacent components, they tend to contain a mixture of two categories, which complicates image classification (Sweeney and Evans 2012; Heydari and Mountrakis 2018). Thus, an increase in mapping categories usually lowers O. The balance between the number of categories and classification accuracy is a practical consideration in classification design and implementation. On the application side, if a single category is the focus, such as deforestation or urban sprawl, a low-O map with multiple categories may be aggregated into a high-O map with only two categories. Such an aggregation approach does not, however, increase the accuracy of the target category.

To understand the third problem with the O metric, we include a range of pjj into its formula:

$$O = \mathop \sum \limits_{j = 1}^{J} p_{jj} \quad \left( {0 < p_{jj} \le p_{ + j} } \right)$$
(1)

A large category means that p+j is large (Olofsson et al. 2014) and that pjj covers a broad range. Thus, a larger category can have a greater role than a smaller category in regulating the value of O. For example, if a dominant category accounts for 90% of the total mapping area (i.e., p+j = 90%) and its pjj value is also 90%, O will be at least 90%. If its pjj value is reduced to 70%, the O value cannot exceed 80% because the other categories have only a 10% share in total, and their pjj values cannot exceed 10%. In contract, a single small category does not exert such a dominant effect. This dominant control of O by a large category has already attracted attention in predictive analytics and machine learning (Fielding and Bell 1997; He and Garcia 2009) and remote sensing (Stehman and Foody 2019).

When a map contains two unevenly sized or imbalanced categories (e.g., category 1 is the majority category, and category 2 is the minority category), and the classification accuracies between P and U are balanced for each category (i.e., P1 = U1, P2 = U2, and p12 = p21), we obtain

$$\frac{{P_{1} }}{{P_{2} }} = \frac{{\frac{{p_{11} }}{{p_{ + 1} }}}}{{\frac{{p_{22} }}{{p_{ + 2} }}}} = \frac{{1 - \frac{{p_{21} }}{{p_{ + 1} }}}}{{1 - \frac{{p_{21} }}{{p_{ + 2} }}}}$$
(2)

Because p+1 > p+2, the accuracy of a large category exceeds that of a small category. In other words, the relative error for the minority category always exceeds that of the majority category. Equation (2) represents a simple case of image-data classification and map assessment. Although multiple unevenly sized categories are more complicated than Eq. (2), they show similar trends. For example, a positive correlation between category areas and accuracies was found in the 1-km-resolution global land-cover data sets (Scepan 1999). Equation (2) also implies that requiring imbalanced categories to be equally accurate may be unrealistic. Thus, O cannot represent the accuracy of every category.

Under the circumstances of Eq. (2), the minimum value of the majority category cannot be less than the difference between the majority and minority populations (i.e., their reference totals). Thus, Eq. (1) can be expressed as

$$O = p_{11} + p_{22} \quad \left( {p_{ + 1} - p_{ + 2} < p_{11} \le p_{ + 1} , 0 < p_{22} \le p_{ + 2} } \right)$$
(3)

This means that as category 1 becomes more dominant, the values of p11 and P1 increase. The same reasoning holds for O: a map with a heavily dominant category is assured of having a relatively high O.

The fourth problem with the O metric is an extension of the second and third problems. The metric O can be increased by maximizing the producer’s accuracy for the majority category or by expanding the extent of the mapping area. We demonstrate this effect using three classification maps with different O values (Fig. 2b–d). The O values in Fig. 2c are increased as the result of effortless modifications based on initial classification (Fig. 2b). This raises the question of whether a map with a higher O value (Fig. 2c) is better in practice than the map with the lowest O value (Fig. 2b). Increasing O by enlarging the mapping area can lead to false expectations regarding the performance of classification technique used and/or that the quality of the resulting map, which is referred to as an optimistic bias (Hammond and Verbyla 1996). In an extreme case, O will improve if every pixel is classified as the dominant category (Fig. 2d). Obtaining a high value for O in this way is clearly not an indication of a preferable map in the real world, however, which shows that the O metric can be a poor measure of accuracy. In the field of predictive analytics, this phenomenon (i.e., when high-accuracy models do not have greater predictive power than lower-accuracy models) is called the “accuracy paradox” (Thomas 2013; Kim et al. 2017). It is worth it to mention that the value of kappa coefficient for Fig. 2d is 0 and does not cause accuracy paradox in this case. Although kappa coefficient is more conservative than overall accuracy, it has its own problems and its overuse should also be avoided.

Fig. 2
figure 2

Comparison of thematic maps under different classification strategies. The mapping system consists of two categories denoted 1 and 2 (a). Wrongly classified pixels are located along the shared edges of the two categories. The gray pixels belong to category 2 (e.g., shrubs) and the white area is for category 1 (e.g., a grassland). Three classification strategies: b the mapping area is defined by the 7 × 7 box and O = 33/49 = 67%, c the mapping area is expanded to the 9 × 9 box and O = 65/81 = 80%, and d every pixel is classified as category 1, resulting in O = 40/49 = 82%

Note that these problems exist even though the O metric was computed correctly from an error matrix that represents an estimated population. The inappropriate use of reference sampling data may worsen the problem. For example, O can be unrealistically high if only pure pixels are used to quantify accuracy. Biased pixel sampling can amplify the problems inherent to O. Overselling overall accuracy has misleading effects on research reliability because overall accuracy alone cannot explain the robustness of a classification model on the method-research side or the quality of classification output on the application-research side.

Because artificial intelligence is becoming increasingly popular for image recognition, classification, and mapping in various disciplines (e.g., Grekousis 2019), consistent standards of accuracy quantification and analysis are needed to assure users that the technique is adequately dependable. Although overall accuracy has problems, it cannot be ignored. Instead, overall accuracy should be used together with multiple accuracy metrics to promote a comprehensive approach to quantify map accuracy (Lasko et al. 2005; Liu et al. 2007). Many researchers urge map producers to provide complete information on map development and assessment (Congalton et al. 2014; Congalton and Green 2019; Stehman and Foody 2019). This type of transparency is vital because it would allow map stakeholders to compute any accuracy metrics rather than relying solely on those given by the map providers.