Given two classes of objects, labelled 0 and 1, our aim is to assign new objects to their correct class. To do this, we use the information in a sample of objects each with known class memberships (this is called the training set or design set) to construct an algorithm, a classification model, a classification rule, a measuring instrument, or a diagnostic system which will yield estimated probabilities of belonging to class 1 (and, by implication, also for class 0) for future objects. Let \(q\) be the estimated probability that an object belongs to class 1, and define \(F_{0} \left( q \right)\) and \(F_{1} \left( q \right)\) to be the cumulative distribution functions of these estimates of probability for objects randomly drawn from classes 0 and 1 respectively, \(\pi_{0}\) and \(\pi_{1}\) to be the respective class sizes (alternative terms sometimes used for the latter are class proportions, ratios, or priors; we discuss how to handle uncertain class sizes in Sect. 6), \(F\left( q \right) = \pi_{0} F_{0} \left( q \right) + \pi_{1} F_{1} \left( q \right)\) to be the overall mixture cumulative distribution function, and \(f_{0} \left( q \right)\), \(f_{1} \left( q \right)\), and \(f\left( q \right) = \pi_{0} f_{0} \left( q \right) + \pi_{1} f_{1} \left( q \right)\) to be the corresponding density functions. For a given training set and a specified way of deriving the classification model, these functions are known. This paper is concerned with estimating the performance of a classification rule based on such known functions. Comparison of performance estimates for different rules can be used to choose between rules, and to adjust or optimise a rule (e.g. enabling parameter optimisation, threshold setting, etc.). However, note that this paper is not concerned with estimating or comparing the performance of methods of constructing classification or machine learning rules, which would require also taking into consideration the randomness implicit in the choice of training sample and random elements in the estimation algorithm (e.g. in the case of random forests or cross-validation).
To classify an object, its estimated probability of class 1 membership, or “score” for short, is compared with a threshold t, the “classification threshold”, such that objects with scores greater than t are assigned to class 1 and otherwise to class 0. This results in two kinds of potential misclassifications: a class 0 object might be misclassified as class 1, and a class 1 object might be misclassified as class 0. Define \(c \in \left[ {0,1} \right]\) to be the cost due to misclassifying a class 0 object and \(\left( {1 - c} \right)\) the cost due to misclassifying a class 1 object, and take correct classifications as incurring no cost. This is the basic and most common situation encountered, but various generalisations can be made. For example, correct classifications might incur a cost (in that case the cost scale can be standardised by subtracting the cost of a correct classification and renormalising c); misclassification costs might differ depending on how “severe” is the misclassification (an object which has a score just on the “wrong” side of the threshold might incur a different cost from one which is far from the threshold on the wrong side); misclassification costs (and even the correct classification costs) might depend on the object, yielding a distribution of costs across the population; costs might not combine additively; population score distributions might change over time (sometimes called population drift or concept drift. This is often handled by revising the classifier periodically. For example, in credit scoring the tradition has been to rebuild every three years or so, but more advanced methods adaptively update the parameters); and, perhaps the most important generalisation, more than two classes might be involved (extending the H-measure to more than two classes is an ongoing project).
We see that, when threshold t is used, the total cost due to misclassification is
$$ L\left( {c;\;t} \right) = c\pi_{0} \left( {1 - F_{0} \left( t \right)} \right) + \left( {1 - c} \right)\pi_{1} F_{1} \left( t \right) $$
(1)
For any given cost c a sensible choice of classification threshold t is that which minimises \(L\left( {c;\;t} \right)\) (we consider other ways of choosing t in Sect. 9). Assuming the distributions are differentiable on \(\left[ {0,1} \right]\), differentiating (1) shows that this is given by \(t = T_{c}\) satisfying
$$ c = \pi_{1} f_{1} \left( {T_{c} } \right)/f\left( {T_{c} } \right) $$
(2)
But (by definition of \(\pi_{1}\), \(f_{1} \left( q \right)\), and \(f\left( q \right)\)) the ratio \(\pi_{1} f_{1} \left( q \right)/f\left( q \right)\) is the estimated conditional probability that an object with score q will belong to class 1—estimated from the training set using the specified classifier model or algorithm. That is \(\pi_{1} f_{1} \left( q \right)/f\left( q \right) = q\). It follows from (2) that the (estimated) best choice of threshold to use when c is the cost of misclassifying a class 0 object and \(\left( {1 - c} \right)\) is the cost of misclassifying a class 1 object is \(T_{c} = c\).
This leads to a minimum classification loss of
$$ L\left( c \right) = c\pi_{0} \left( {1 - F_{0} \left( c \right)} \right) + \left( {1 - c} \right)\pi_{1} F_{1} \left( c \right) $$
In summary, if the cost c (and class sizes and distributions) are known, this gives us the unique classification threshold to use and the associated consequent minimum loss. In general, however, costs are difficult to determine and they may not be known at the time that the classification rule has to be evaluated. For example, in a clinical setting the severity of misclassifications might depend on what treatments will be available in a clinic or on the particular characteristics of the local patient population, or in a credit scoring context the degree of acceptable risk might depend on current interest rates. More generally, we might want to make comparative statements about the relative performance of classification rules without knowing the details of the environment in which they will be applied, with the risk that any particular choice of costs could be very different from those encountered in practice. For these reasons, we will integrate over a distribution of costs, \(w\left( c \right)\) say, chosen to represent one’s beliefs about the costs to be encountered in the future. A similar point applies if the class sizes, \(\pi_{0}\) and \(\pi_{1}\), are unknown, but they can, at least in principle, be estimated from empirical considerations. We discuss this further in Sect. 6.
The overall expected minimum misclassification loss is then
$$ L = \int\limits_{0}^{1} {\left[ {c\pi_{0} \left( {1 - F_{0} \left( c \right)} \right) + \left( {1 - c} \right)\pi_{1} F_{1} \left( c \right)} \right]w\left( c \right){\text{d}}c} $$
(3)
Substituting \(c = \pi_{1} f_{1} \left( c \right)/f\left( c \right)\) into (3) yields
$$ \begin{aligned} L & = \int\limits_{0}^{1} {\left[ {\frac{{\pi_{1} f_{1} \left( c \right)}}{f\left( c \right)}\pi_{0} \left( {1 - F_{0} \left( c \right)} \right) + \frac{{\pi_{0} f_{0} \left( c \right)}}{f\left( c \right)}\pi_{1} F_{1} \left( c \right)} \right]w\left( c \right){\text{d}}c} \\ & = \pi_{0} \pi_{1} \mathop \smallint \limits_{0}^{1} \left[ {f_{1} \left( c \right)\left( {1 - F_{0} \left( c \right)} \right) + f_{0} \left( c \right)F_{1} \left( c \right)} \right]\frac{w\left( c \right)}{{f\left( c \right)}}{\text{d}}c \\ \end{aligned} $$
From this we see that, were we to take \(w\left( c \right) = f\left( c \right)\), we would obtain \( L = L_{A}\) where
$$ \begin{aligned} L_{A} & = \pi_{0} \pi_{1} \int\limits_{0}^{1} {\left[ {f_{1} \left( c \right)\left( {1 - F_{0} \left( c \right)} \right) + f_{0} \left( c \right)F_{1} \left( c \right)} \right]\frac{f\left( c \right)}{{f\left( c \right)}}{\text{d}}c} \\ & = 2\pi_{0} \pi_{1} \left( {1 - \int\limits_{0}^{1} {F_{0} \left( c \right)f_{1} \left( c \right){\text{d}}c} } \right) \\ \end{aligned} $$
(4)
The expression \(\int\limits_{0}^{1} {F_{0} \left( c \right)f_{1} \left( c \right){\text{d}}c}\) in (4) is the familiar and widely used measure of performance known as the Area Under the Receiver Operating Characteristic Curve, denoted AUC (see, for example, Krzanowski and Hand 2009).
Inverting (4), we obtain
$$ AUC = 1 - L_{A} /2\pi_{0} \pi_{1} $$
That is, the AUC for a particular classifier is a linear function of the expected misclassification loss when the cost distribution w is taken to be the overall score distribution for the classifier being evaluated: \(w\left( c \right) = f\left( c \right)\).
This seems to provide a justification for the widespread use of the AUC in comparing classification rules. However, note that the overall score distribution \(f\) will generally be different for different classifiers. This means that, if we choose the distribution \(w\) to be \(f\), the loss \(L_{A}\) will be calculated using different \(w\) cost distributions for different classifiers. But this is inappropriate: the cost distribution must be the same for all classifiers we might wish to compare. We would not want to say that, when we used one classifier, we thought that misclassifying a particular object would be likely to incur cost \(c_{x}\), but that, when we used a different classifier, misclassifying that same object would be likely to incur a different cost \(c_{y}\). Putting aside the expense of running an algorithm, we would not want to say that the loss arising from misclassifying a particular object using logistic regression is more severe than the loss arising from misclassifying that object using a support vector machine. Our belief about the likely severities of losses arising from misclassifying objects is an aspect of the problem and researcher, not of the classification method. In general, the distribution we choose for the misclassification cost is independent of what classifier we choose.
Since the AUC is equivalent to (i.e. is a linear transformation of) \(L_{A}\) this also implies that measuring the performance of different classifiers using the AUC is unreasonable, at least if the threshold is chosen to minimise loss for each c (we discuss alternatives in Sect. 9): it is equivalent to using different measuring criteria for different classifiers, contravening the basic principle of measurement that different objects should be measured using the same or equivalent instruments. This fundamental unsuitability of the AUC as a measure of classifier performance had been previously noted by Hilden (1991), though we were unaware of his work when we wrote Hand (2009). The dependence of the AUC on the classifier itself means that it can lead to seriously misleading results, so it is concerning that it continues to be widely used (see Hand and Anagnostopoulos 2013, for evidence on how widely used it is).
To overcome the problem, we need to use the same \(w\left( c \right)\) distribution for all classifiers being compared. This is exactly what the H-measure does, and we discuss the choice of w in the next section.
As a final point, note that many measures of performance, including the AUC, the proportion correctly classified, and the F-measure, take values between 0 and 1, with larger values indicating better performance. To produce a measure which accords with this convention (Hand 2009), the H-measure is a standardised version of the loss:
$$ H = 1 - L/L_{{{\text{ref}}}} $$
(5)
where
$$ L_{{{\text{ref}}}} = \pi_{0} \int\limits_{0}^{{\pi_{1} }} {cw\left( c \right){\text{d}}c} + \pi_{1} \int\limits_{{\pi_{1} }}^{1} {\left( {1 - c} \right)w\left( c \right){\text{d}}c} $$
(6)
the value of L when \(F_{0} \equiv F_{1}\). This reference value of L is derived by noting that when \(F_{0} \equiv F_{1}\) the minimum of L is achieved by classifying everything to class 0 if \(c\pi_{0} < \left( {1 - c} \right)\pi_{1}\) and everything to class 1 otherwise. That is, classify all objects to class 0 whenever \(c < \pi_{1}\) and to class 1 whenever \(c \ge \pi_{1}\). \(L_{{{\text{ref}}}}\) is the worst case in the sense that it is the minimum loss corresponding to a classifier which fails to separate the score distributions of the two classes at all. Of course, in fact even worse cases can arise—when the classifier assigns classes the wrong way round. For example, the very worst loss is when the classifier assigns all class 0 objects to class 1 and all class 1 objects to class 0, yielding loss \(c\pi_{0} + \left( {1 - c} \right)\pi_{1} = \pi_{0} \left( {2c - 1} \right) + 1 - c\). In such cases, we can simply invert all the assigned class labels, yielding a classification performance better than that of random assignment.