1 Introduction

In medical field, a binary classification problem is common, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision and recall, which are equal to positive predictive value and sensitivity, respectively. For measuring the performance of text classification in the field of information retrieval and of a classifier in machine learning, the F score (F measure) has been widely used. In particular, the F1 score has been popular, which is defined as the harmonic mean of precision and recall [1, 2]. The F1 score is rarely used in diagnostic studies in medicine despite its favorable characteristics. As a single performance measure, the F1 score may be preferred to specificity and accuracy, which may be artificially high even for a poor classifier with a high false negative probability when disease prevalence is low. The F1 score is especially useful when identification of true negatives is relatively unimportant because the true negative rate is not included in the computation of either precision or recall.

To evaluate a multi-class classification, a single summary measure is often sought. And as extensions of the F1 score for the binary classification, there exist two types of such measures: a micro-averaged F1 score and a macro-averaged F1 score [2]. The micro-averaged F1 score pools per-sample classifications across classes, and then calculates the overall F1 score. Contrarily, the macro-averaged F1 score computes a simple average of the F1 scores over classes. Sokolova and Lapalme [3] gave an alternative definition of the macro-averaged F1 score as the harmonic mean of the simple averages of the precision and recall over classes. Both micro-averaged and macro-averaged F1 scores have a simple interpretation as an average of precision and recall, with different ways of computing averages. Moreover, as will be shown in Section 2, the micro-averaged F1 score has an additional interpretation as the total probability of true positive classifications.

For binary classification, some statistical methods for inference have been proposed for the F1 scores (e.g., [4]); however, the methodology has not been extended to the multi-class F1 scores. To our knowledge, methods for computing variance estimates of the micro-averaged F1 score and macro-averaged F1 score have not been reported. Thus, computing confidence intervals for the multi-class F1 scores is not possible, and the inference about them is usually solely based on point estimates, and thus highly limited in practical utility. For example, consider the results of an analysis reported by Dong et al. [5]. In this analysis, the authors calculated the point estimates of macro-averaged F1 scores for four classifiers, and they concluded a classifier outperformed the others by comparing the point estimates without taking into account their uncertainty. Others have also used multi-class F1 scores but only reported point estimates without confidence intervals [6,7,8,9,10,11,12,13,14,15,16].

To address this knowledge gap, we provide herein the methods for computing variances of these multi-class F1 scores so that estimating the micro-averaged F1 score and macro-averaged F1 score with confidence intervals becomes possible in multi-class classification.

The rest of the manuscript is organized as follows: The definitions of the micro-averaged F1 score and macro-averaged F1 score are reviewed in Section 2. In Section 3, variance estimates and confidence intervals for the multi-class F1 scores are derived. A simulation study to investigate the coverage probabilities of the proposed confidence intervals is presented in Section 4. Then, our method is applied to a real study as an example in Section 5 followed by a brief discussion in Section 6.

2 Averaged F 1 scores

This section introduces notations and definitions of multi-class F1 scores, namely, macro-averaged and micro-averaged F1 scores. Consider an r × r contingency table for a nominal categorical variable with r classes (r ≥ 2). The columns indicate the true conditions, and rows indicate the predicted conditions. It is called the binary classification when r = 2, and the multi-class classification when r > 2. Such a table is also called a confusion matrix. We consider multi-class classification, i.e., r > 2, and denote cell probabilities and marginal probabilities by pij, pi, and pj, respectively (i,j = 1,⋯ ,r). For each class i (i = 1,⋯ ,r), the true positive rate (TPi), the false positive rate (FPi), and the false negative rate (FNi) are defined as follows:

$$ \begin{array}{@{}rcl@{}} TP_{i} &=& p_{ii}, \\ FP_{i} &=& \sum\limits_{\begin{array}{ll}j=1 \\ j\not=i \end{array}}^{r} p_{ij}, \\ FN_{i} &=& \sum\limits_{\begin{array}{ll}j=1 \\ j\not=i \end{array}}^{r} p_{ji}. \end{array} $$

TPi is the i-th diagonal element, FPi is the sum of off-diagonal elements of the i-th row, and FNi is the sum of off-diagonal elements of the i-th column. Note that TPi + FPi = pi, and TPi + FNi = pi.

In the current and following sections, we will use the simple 3-by-3 confusion matrix in Table 1 as an example to demonstrate various computations. Columns represent the true state, and rows represent the predicted classification. The total sample size is 100.

Table 1 Numeric example

The within-class probabilities are:

$$ \begin{array}{llllll} TP_{1}&=0.02, & \qquad TP_{2}&=0.70, & \qquad TP_{3}&=0.15. \\ FP_{1}&=0.04, & \qquad FP_{2}&=0.07, & \qquad FP_{3}&=0.02. \\ FN_{1}&=0.05, & \qquad FN_{2}&=0.04, & \qquad FN_{3}&=0.04. \end{array} $$

Micro-averaged F 1 score

The micro-averaged precision (miP) and micro-averaged recall (miR) are defined as

$$ \begin{array}{@{}rcl@{}} miP &=& \frac{{\sum}_{i=1}^{r} TP_{i}}{{\sum}_{i=1}^{r}\left( TP_{i} + FP_{i}\right)} = \frac{\sum p_{ii}}{\sum p_{i \cdot}} = \sum\limits_{i=1}^{r} p_{ii}, \\ miR &=& \frac{{\sum}_{i=1}^{r} TP_{i}}{{\sum}_{i=1}^{r}\left( TP_{i} + FN_{i}\right)} = \frac{\sum p_{ii}}{\sum p_{\cdot i}} = \sum\limits_{i=1}^{r} p_{ii}. \end{array} $$

Note that for both miP and miR, the denominator is the sum of all the elements (diagonal and off-diagonal) of the confusion matrix, and it is 1. Finally, the micro-averaged F1 score is defined as the harmonic mean of these quantities:

$$ miF_{1} = 2\frac{miP \times miR}{miP + miR} = \sum\limits_{i=1}^{r} p_{ii}. $$
(1)

This definition is commonly used (e.g., [6, 8,9,10,11,12, 14, 15]).

By definition, we have miP, miR, and miF1 all equal to the sum of the diagonal elements, which, in our example, is 0.87.

Macro-averaged F 1 score

To define the macro-averaged F1 score (maF1), first consider the following precision (Pi) and recall (Ri) within each class, i = 1,⋯ ,r:

$$ \begin{array}{@{}rcl@{}} P_{i} &=& \frac{TP_{i}}{(TP_{i} + FP_{i})} = p_{ii} / p_{i \cdot}, \\ R_{i} &=& \frac{TP_{i}}{(TP_{i} + FN_{i})} = p_{ii} / p_{\cdot i}. \end{array} $$

For our example, simple calculation shows:

$$ \begin{array}{llllll} P_{1}&=0.33, & \qquad P_{2}&=0.91, & \qquad P_{3}&=0.88, \\ R_{1}&=0.29, & \qquad R_{2}&=0.95, & \qquad R_{3}&=0.79. \end{array} $$

And F1 score within each class (F1i) is defined as the harmonic mean of Pi and Ri, that is,

$$ F_{1i} = 2\frac{P_{i} \times R_{i}}{P_{i} + R_{i}} = 2\frac{p_{ii}}{p_{i \cdot}+p_{\cdot i}}. $$

The macro-averaged F1 score is defined as the simple arithmetic mean of F1i:

$$ maF_{1} = \frac{1}{r} \sum\limits_{i=1}^{r} F_{1i} = \frac{2}{r}\sum\limits_{i=1}^{r} \frac{p_{ii}}{p_{i \cdot}+p_{\cdot i}}. $$
(2)

This score, like miF1, is frequently reported (e.g., [5,6,7,8,9,10, 13]).

F1i and maF1 in our example are:

$$ \begin{array}{@{}rcl@{}} F_{11} &=& 0.308, \phantom{abc} F_{12}=0.927, \phantom{abc} F_{13} =0.833. \\ maF_{1} &=& (0.308 + 0.927 + 0.833) / 3 = 0.689. \end{array} $$

Alternative definition of Macro-averaged F 1 score

Sokolova and Lapalme [3] gave an alternative definition of the macro-averaged F1 score (\(maF^{*}_{1}\)). First, macro-averaged precision (maP) and macro-averaged recall (maR) are defined as simple arithmetic means of the within-class precision and within-class recall, respectively.

$$ \begin{array}{@{}rcl@{}} maP &=& \frac{1}{r}\sum\limits_{i=1}^{r}\frac{TP_{i}}{TP_{i}+FP_{i}} = \frac{1}{r} \sum\limits_{i=1}^{r}\frac{p_{ii}}{p_{i \cdot}}, \\ maR &=& \frac{1}{r}\sum\limits_{i=1}^{r}\frac{TP_{i}}{TP_{i}+FN_{i}} = \frac{1}{r} \sum\limits_{i=1}^{r}\frac{p_{ii}}{p_{\cdot i}}. \end{array} $$

And \(maF^{*}_{1}\) is defined as the harmonic mean of these quantities.

$$ maF^{*}_{1} = 2\frac{maP\times maR}{maP+maR}. $$
(3)

This version of macro-averaged F1 score is less frequently used (e.g., [11, 12, 16]). For our example,

$$ \begin{array}{@{}rcl@{}} maP &=& (0.02/0.06 + 0.70/0.77 + 0.15/0.17) / 3 = 0.708. \\ maR &=& (0.02/0.07 + 0.70/0.74 + 0.15/0.19) / 3 = 0.674. \\ maF_{1}^{*} &=& 0.691. \end{array} $$

In this example, the micro-averaged F1 score is higher than the macro-averaged F1 scores because both within-class precision and recall are much lower for the first class compared to the other two. Micro-averaging puts only a small weight on the first column because the sample size there is relatively small. This numeric example shows a shortcoming of summarizing a performance of a multi-class classification with a single number when within-class precision and recall vary substantially. However, aggregate measures such as the micro-averaged and macro-averaged F1 scores are useful in quantifying the performance of a classifier as a whole.

3 Variance estimate and confidence interval

In this section, we derive the confidence interval for miF1, maF1, and \(maF^{*}_{1}\). We assume that the observed frequencies, nij, for 1 ≤ ir, 1 ≤ jr, have a multinomial distribution with sample size n and probabilities p = (p11,⋯ ,p1r,p21,⋯ ,p2r,⋯ ,pr1,⋯ ,prr)T, where “T” represents the transpose, that is

$$ (n_{11}, n_{12}, \cdots, n_{rr}) \sim Multinomial (n; \boldsymbol{p}). $$

The expectation, variance, and covariance for i,j = 1,⋯ ,r, are:

$$ \begin{array}{@{}rcl@{}} E\left( n_{ij}\right) &=& np_{ij}, \\ Var\left( n_{ij}\right) &=& np_{ij}(1-p_{ij}), \\ Cov\left( n_{ij},n_{kl}\right) &=& -np_{ij}p_{kl}, \text{ for } i\not=k \text{ or } j\not=l, \end{array} $$

respectively, where \(n= {\sum }_{i,j} n_{ij}\) is the overall sample size. The maximum likelihood estimate (MLE) of pij is \(\hat {p}_{ij} = n_{ij}/n\). Using the multivariate central limit theorem, we have

$$ \begin{array}{@{}rcl@{}} \sqrt{n}\left( \hat{\boldsymbol{p}} - \boldsymbol{p} \right) \dot\sim Normal\left( \boldsymbol{0}_{r^{2}}, diag(\boldsymbol{p})-\boldsymbol{p}\boldsymbol{p}^{T} \right), \end{array} $$

where \(\boldsymbol {0}_{r^{2}}\) is r2 × 1 vector whose elements are all 0, diag(p) is an r2 × r2 diagonal matrix whose diagonal elements are p, and “\( \dot \sim \)” represents “approximately distributed as.”

By invariance property of MLE’s, the maximum likelihood estimates of miF1, maF1, \(maF_{1}^{*}\), and other quantities in the previous section can be obtained by substituting pij by \(\hat {p}_{ij}\). In the following subsections, we use the multivariate delta-method to derive large-sample distributions of \(\widehat {miF_{1}}\), \(\widehat {maF_{1}}\), and \(\widehat {maF_{1}^{*}}\).

3.1 Confidence interval for m i F 1

As shown in (1), \(miF_{1} = \sum p_{ii}\), and the maximum likelihood estimate (MLE) of miF1 is

$$ \widehat{miF_{1}} = \sum\limits_{i=1}^{r} \hat{p}_{ii}. $$

Using the multivariate delta-method (Appendix Appendix), we have

$$ \widehat{miF_{1}} \dot\sim Normal\left( miF_{1},Var(\widehat{miF_{1}})\right), $$

where variance of \(\widehat {miF_{1}}\) is

$$ Var\left( \widehat{miF_{1}}\right) = \left. \left( \sum\limits_{i=1}^{r} p_{ii}\right)\left( 1-\sum\limits_{i=1}^{r} p_{ii}\right) \right/ n. $$
(4)

And a (1 − α) × 100% confidence interval of miF1 is

$$ \widehat{miF_{1}} \pm Z_{1-\alpha/2} \times \sqrt{\widehat{Var}\left( \widehat{miF_{1}}\right)}, $$

where \(\widehat {Var}\left (\widehat {miF_{1}}\right )\) is \(Var\left (\widehat {miF_{1}}\right )\) with {pii} replaced by \(\{\hat {p}_{ii}\}\), and Zp denote the 100p-th percentile of the standard normal distribution. Computation of \(\widehat {Var}\left (\widehat {miF_{1}}\right )\) for our numeric example is straightforward using (4):

$$ \begin{array}{@{}rcl@{}} \widehat{Var}\left( \widehat{miF_{1}}\right) &=& \left( 0.02+0.70+0.15 \right)\\ &&\times \left\{ 1-(0.02+0.70+0.15) \right\} / 100 \\ &=& 0.0336^{2}. \end{array} $$

And a 95% confidence interval for miF1 is

$$ 0.87 \pm 1.960 \times 0.0336 = \left( 0.804, 0.936 \right). $$

3.2 Confidence interval for m a F 1

The MLE of maF1 can be obtained by substituting pii, pi and pi by their MLE’s in (2).

$$ \widehat{maF}_{1} = \frac{2}{r} \sum\limits_{i=1}^{r}\frac{\hat{p}_{ii}}{\hat{p}_{i \cdot} + \hat{p}_{\cdot i}}. $$

Again by the multivariate delta-method (Appendix Appendix), we have the variance of \(\widehat {maF_{1}}\) as

$$ \begin{array}{@{}rcl@{}} Var\left( \widehat{maF_{1}}\right) &=& \left. \frac{2}{r^{2}} \left\{ \sum\limits_{i=1}^{r} \frac{F_{1i}(p_{i \cdot}+p_{\cdot i}-2p_{ii})}{(p_{i \cdot}+p_{\cdot i})^{2}}\left( \frac{p_{i \cdot}+p_{\cdot i}-2p_{ii}}{p_{i\cdot}+p_{\cdot i}} +\frac{F_{1i}}{2}\right)\right.\right.\\ &&\left.\left.+ \sum\limits_{i=1}^{r} \sum\limits_{j\neq i}\frac{p_{ij}F_{1i}F_{1j}}{(p_{i\cdot}+p_{\cdot i})(p_{j \cdot}+p_{\cdot j})}\right\} \right/ n . \end{array} $$

A (1 − α) × 100% confidence interval of maF1 is

$$ \widehat{maF_{1}} \pm Z_{1-\alpha/2} \times \sqrt{ \widehat{Var}\left( \widehat{maF_{1}}\right)} , $$

where \(\widehat {Var}\left (\widehat {maF_{1}}\right )\) is \(Var\left (\widehat {maF_{1}}\right )\) with {pij} replaced by \(\{\hat {p}_{ij}\}\). This computation is complex even for a small 3 by 3 table; an R code (Appendix Appendix) was used to compute the variance estimate and a 95% confidence interval of maF1.

$$ \begin{array}{@{}rcl@{}} \widehat{Var}\left( \widehat{maF_{1}}\right) &=& 0.0650^{2}, \\ 0.69 &\pm& 1.960 \times 0.0650 = \left( 0.562, 0.817 \right). \end{array} $$

3.3 Confidence interval for \(maF^{*}_{1}\)

To obtain the MLE’s of \(maF^{*}_{1}\), we first substitute pii, pi and pi by their MLE’s to get MLE’s of maP and maR and use these in (3):

$$ \widehat{maF^{*}_{1}} = 2\frac{\widehat{maP}\times \widehat{maR}}{\widehat{maP} + \widehat{maR}}. $$

As shown in Appendix Appendix,

$$ Var\left( \widehat{maF^{*}_{1}}\right) =\left. 4n\frac{ maR^{4} Var\left( \widehat{maP}\right) + 2maP^{2}maR^{2}Cov\left( \widehat{maP},\widehat{maR}\right) + maP^{4} Var\left( \widehat{maR}\right) }{\left( maP+maR\right)^{4}} \right/ n ,\\ $$

where

$$ \begin{array}{@{}rcl@{}} Var\left( \widehat{maP}\right) &=& \frac{1}{r^{2}}\left.\left( \sum\limits_{i=1}^{r} \frac{p_{ii}\left( {\sum}_{j\not=i}p_{ij}\right)}{p_{i \cdot}^{3}} \right) \right/ n , \\ Var\left( \widehat{maR}\right) &=& \frac{1}{r^{2}}\left.\left( \sum\limits_{i=1}^{r} \frac{p_{ii}\left( {\sum}_{j\not=i}p_{ji}\right)}{p_{\cdot i}^{3}} \right) \right/ n ,\\ Cov\left( \widehat{maP}, \widehat{maR}\right)\!\! &=& \!\!\frac{1}{r^{2}}\left( \sum\limits_{i=1}^{r} \frac{\left( {\sum}_{j\not=i}p_{ij}\right) p_{ii} \left( {\sum}_{j\not=i}p_{ji}\right)}{p_{i \cdot}^{2}p_{\cdot i}^{2}}\right.\\ &&\left.\left.+ \sum\limits_{i=1}^{r} \sum\limits_{j\not=i} \frac{p_{ii}p_{ij}p_{jj}}{p_{i \cdot}^{2}p_{j \cdot}^{2}} \right) \right/ n . \end{array} $$

A (1 − α) × 100% confidence interval of \(maF^{*}_{1}\) is

$$ \widehat{maF^{*}_{1}} \pm Z_{1-\alpha/2} \times \sqrt{ \widehat{Var}\left( \widehat{maF^{*}_{1}}\right)}. $$

Again to get \(\widehat {Var}\left (\widehat {maF^{*}_{1}}\right )\), all components of \(Var\left (\widehat {maF^{*}_{1}}\right )\) are replaced by their respective MLE’s. Using the accompanying R code (Appendix Appendix), we computed the variance estimate and a 95% confidence interval of \(maF_{1}^{*}\):

$$ \begin{array}{@{}rcl@{}} \widehat{Var}\left( \widehat{maF_{1}^{*}}\right) &=& 0.0649^{2} \\ 0.69 &\pm& 1.960 \times 0.0649 = \left( 0.563, 0.818 \right). \end{array} $$

4 Simulation

We performed a simulation study to assess the coverage probability of the confidence intervals proposed in Section 3. We set r = 3 (class 1, 2, 3), and generated data according to the multinomial distributions with p summarized in Table 2. The total sample size, n, was set to 25, 50, 100, 500, 1,000, and 5,000. For each combination of the true distribution and sample size, we generated 1,000,000 data, each time computing 95% confidence intervals for miF1, maF1, and \(maF^{*}_{1}\).

Table 2 Simulation study: True cell probabilities

In scenario 1, the true conditions of class 1, 2, and 3 have the same probability (1/3), and the recall and precision are equal (80%). Thus miP = maP = 0.80, miR = maR = 0.80, and \(miF_{1} = maF_{1} = maF^{*}_{1} = 0.80\).

In scenario 2, the true condition of class 1 has higher probability than the others (80% vs 10%), and the recall and precision of class 1 are also higher than the others (80% vs 40%, and 91% vs 27%, respectively). miF1 gives equal weight to each per-sample classification decision, whereas maF1 gives equal weight to each class. Thus, large classes dominate small classes in computing miF1 [2], and miF1 is larger than maF1 (miF1 = 0.72, maF1 = 0.50, \(maF^{*}_{1} = 0.51\)) in scenario 2 because class 1 has higher probability and has higher precision and recall.

In scenario 3, the true condition of class 1 has higher probability than the others (80% vs 10%). The precision of class 1 is higher than the others (94% vs 24%), and the recall of class 1 is lower than the others, (40% vs 80%). Compared to the other two scenarios, the diagonal entries are relatively small, which makes miF1 small (miF1 = 0.48, maF1 = 0.44, and \(maF^{*}_{1} = 0.55\)).

Table 3 shows the coverage probability of the proposed 95% confidence intervals for each scenario. The coverage probabilities for both miF1 and maF1 are close to the nominal 95% when the sample size is large. When n is small (25, 50), the coverage probability tends to be smaller than 95%, especially for maF1 and \(maF^{*}_{1}\). Moreover, computing a confidence interval for \(maF^{*}_{1}\) for small n is often impossible because \(\widehat {maF^{*}_{1}}\) is undefined when either pi = 0 or pj = 0 for any i or j. In typical applications where these F scores are computed, n is large, and the small n problem is unlikely to occur.

Table 3 Simulation study: Coverage probability

5 Example

As an example, we applied our method to the temporal sleep stage classification data provided by Dong et al.[5]. They proposed a new approach based on a Mixed Neural Network (MNN) to classify sleep into five stages with one awake stage (W), three sleep stages (N1, N2, N3), and one rapid eye movement stage (REM). In addition to the MNN, they evaluated the following three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Multilayer Perceptron (MLP). The data came from 62 healthy subjects, and classification by a single sleep expert was used as the gold standard. The staging is based on a 30-second window of the physiological signals called an EEG (electroencephalography) epoch. Thus, each subject contributes a large number of data to be classified. The total number of epochs depends on the classifiers, and it is about 59,000. Performance of each classifier was evaluated using maF1 along with precision, recall, and overall accuracy. They concluded that the MNN outperformed the competitors by comparing the point estimates of maF1 and overall accuracy. We provide here 95% confidence intervals for miF1, maF1, and \(maF^{*}_{1}\) for each of the four methods, as summarized in Table 4. The confidence intervals of miF1, maF1, and \(maF^{*}_{1}\) for the MNN do not overlap with the point estimates of other methods, providing further evidence that MNN is superior to the other method. For completeness we present 95% confidence intervals for other methods in Table 4 as well. As n is large for this example, the confidence intervals are narrow, and the ones for MNN do not overlap with confidence intervals for other three methods.

Table 4 Point estimates and confidence intervals for miF1, maF1, and \(maF^{*}_{1}\)

6 Discussion

We derived large sample variance estimates of miF1, maF1, and \(maF_{1}^{*}\) in terms of the observed cell probabilities and sample size. This enabled us to derive large sample confidence intervals.

Coverage probabilities of the proposed confidence intervals were assessed through the simulation study. According to the result of the simulation, when n is larger than 100, the coverage probability was close to the nominal level; however, for n < 100, the coverage probabilities tended to be smaller than the target. Moreover, with an extremely small sample size, \(maF_{1}^{*}\) could not be estimated as computation of \(maF_{1}^{*}\) requires all margins to be non-zero. Zhang et al. [17] have considered interval estimation of miF1 and maF1 and proposed the highest density interval through Bayesian framework. On the other hand, we have proposed confidence interval for miF1, maF1, and \(maF_{{1}^{*}}\) through frequentist framework using a large-sample approximation.

There is an inherit drawback of multi-class F1 scores that these scores do not summarize the data appropriately when a large variability exists between classes. This was demonstrated in the numeric example in Section 2 for which the within-class F1 values are 0.308, 0.927, and 0.833, and miF1, maF1, and \(maF_{1}^{*}\) are 0.870, 0.689, and 0.691, respectively. Reporting multiple within-class F1 scores may be an option as done in [18] and [19]; however, an aggregate measure is useful in evaluating an overall performance of a classifier across classes. Another limitation with F1 scores is that they do not take into consideration the true negative rate, and they may not be an appropriate measure when true negatives are important.

For future works, we are working on developing hypothesis testing procedure for miF1, maF1 and, \(maF_{1}^{*}\) based on the variance estimates proposed in this article.

An R code for computing confidence intervals for miF1, maF1, and \(maF^{*}_{1}\), is available and presented in Appendix Appendix.