Abstract
Several performance metrics are currently available to evaluate the performance of Machine Learning (ML) models in classification problems. ML models are usually assessed using a single measure because it facilitates the comparison between several models. However, there is no silver bullet since each performance metric emphasizes a different aspect of the classification. Thus, the choice depends on the particular requirements and characteristics of the problem. An additional problem arises in multiclass classification problems, since most of the wellknown metrics are only directly applicable to binary classification problems. In this paper, we propose the General Performance Score (GPS), a methodological approach to build performance metrics for binary and multiclass classification problems. The basic idea behind GPS is to combine a set of individual metrics, penalising low values in any of them. Thus, users can combine several performance metrics that are relevant in the particular problem based on their preferences obtaining a conservative combination. Different GPSbased performance metrics are compared with alternatives in classification problems using real and simulated datasets. The metrics built using the proposed method improve the stability and explainability of the usual performance metrics. Finally, the GPS brings benefits in both new research lines and practical usage, where performance metrics tailored for each particular problem are considered.
Similar content being viewed by others
1 Introduction
Supervised Learning is the set of Machine Learning (ML) techniques that use labelled data. The task of these techniques is to learn a function that maps an input to a label, learning from examples of inputlabel pairs. When the label is categorical, the task addressed by these methods is referred to as classification. Based on the characteristics of the labels, several types of classification problems are defined: binary, multiclass, multilabelled, and hierarchical [24].
In the literature, there are several metrics to evaluate the performance of ML models in classification problems [25]. Most of these metrics are defined for binary classification, of which some can be generalised for more than two classes. In practice, data analysts focus mainly on selecting the algorithm with the best predictive performance, disregarding the selection of the specific performance metric [6]. However, no general performance metric exists. Consequently, the proper definition of a performance metric, based on the problem domain and requirements, is crucial. Performance metrics are used to rank ML models and to evaluate if the selected one meets the classification requirements. Therefore, the choice of the right metric is crucial, especially when the cost of misclassification varies between classes.
In general, given a classification ML model, the information regarding its performance is summarised into a confusion matrix. This matrix is built by comparing the observed and predicted classes for a set of observations. It contains all the information needed to calculate most of the classification performance metrics. Among them, Accuracy (ACC) is one of the most common. It represents the ratio of correctly predicted observations. However, in many binary classification problems, alternative measures that combine two metrics regarding the classification task in both classes are more appropriate.
In this paper, several performance metrics used in classification problems are discussed. The General Performance Score (GPS), a new family of classification metrics, is presented. The GPS is obtained from the combination of several metrics estimated through a \(K \times K\) confusion matrix, with \(K \ge 2\). Therefore, this family of metrics performs for both binary and multiclass classification. Several instances of GPS are presented and compared with wellknown alternative metrics from a theoretical and practical level.
The main contributions of the paper are listed as follows:

A novel family of performance metrics, GPS, is developed for both binary and multiclass classification.

GPS is configurable depending on the problem domain by combining appropriate performance metrics.

GPS performance metrics allow a high explainability of the performance of the ML models.
The rest of the paper is structured as follows. Section 2 presents an overview of binary and multiclass classification metrics based on the confusion matrix. The proposed metrics family is described in Section 3 for both binary and multiclass classification. Experiments on simulated and real case studies with different number of classes are detailed in Section 4. Finally, Section 5 concludes and provides further research lines.
2 State of the art
2.1 Binary classification
In a binary classification problem, with classes \(1\) and \(+1\), the performance metrics achieved by the selected ML classifier are obtained from the wellknown \(2 \times 2\) confusion matrix (see Table 1). This matrix relates the observed values to the ones predicted by the classifier. Notice that many ML models return probabilities. In these cases, a threshold on these probabilities can be used to obtain binary predictions. The elements of a confusion matrix are:

True Positive (TP): the observed \(+1\) instances that are predicted as \(+1\).

True Negative (TN): the observed \(1\) instances that are predicted as \(1\).

False Positive (FP): the observed \(1\) instances but predicted as \(+1\).

False Negative (FN): the observed \(+1\) instances but predicted as \(1\).
FP and FN are also known as type I and type II errors, respectively. The relative importance of these errors depends on the problem under consideration [5, 21]. For instance, in anomaly detection problems, the number of observed \(+1\) is usually much smaller than the number of observed \(1\). On the one hand, the FP are false alarms that should be treated by the system. This implies several actions with an associated cost. On the other hand, the FN are those anomalies that are not detected by the system and thus, could potentially damage it.
The performance metrics that can be obtained from a confusion matrix are summarised in Table 2. The most intuitive one is the ACC [9], which represents the ratio of correctly predicted instances among all instances in the dataset. The complementary metric is the Error Rate (ERR), which evaluates the model by its proportion of incorrect predictions. Both metrics are commonly used by researchers to select a model. However, these two metrics turn out to be an overoptimistic estimation of the ability of the classifier over the majority class [4]. Consequently, they are sensitive to imbalanced data.
The Precision, also known as Positive Predictive Value (PPV), can be considered as the probability of success when an instance is classified as \(+1\). The Sensitivity, also known as Recall or True Positive Rate (TPR), can be understood as the probability that an observed \(+1\) is classified as \(+1\) by the ML classifier. The Specificity, also known as True Negative Rate (TNR), is the proportion of \(1\) instances that are correctly predicted. Similarly, the Negative Predictive Value (NPV) is the proportion of \(1\) instances correctly classified by the ML classifier. The main drawback of these metrics is that they do not consider all the confusion matrix elements. For example, the Sensitivity only focuses on positive examples, while Specificity only focuses on the negative ones. The main goal of ML classifiers is to improve the Sensitivity, without losing the Specificity. However, there is a tradeoff between these two metrics since increasing the Sensitivity implies a decrease in the Specificity and vice versa. The same relationship appears between Sensitivity and Precision. Besides, Precision and NPV are sensitive to imbalanced data. Each of these four metrics cannot be used separately to evaluate the performance of a ML method because none of them takes into consideration the entire confusion matrix. This is, they do not take into account all information that the classifier provides. Hence, these metrics are adequate for capturing a partial perspective of the classifier performance, but are individually insufficient.
Regarding the four basic metrics, given three of them, the remaining fourth can be obtained. For instance, given PPV, TPR, and TNR, the NPV is defined as follows:
The Balanced Accuracy (BA) is the arithmetic mean of Sensitivity and Specificity. That is, the average of two rates: positive instances correctly classified and negative instances correctly classified. The BA, unlike Accuracy, is robust for evaluating classifiers over imbalanced datasets.
Another useful metric is the geometric mean of Sensitivity and Specificity, konwn as Geometric Mean (GM) [25]. It can be used both with balanced and imbalanced data. Likewise, FowlkesMallows Index (FM) [12] is defined as the geometric mean of Sensitivity and Precision. In contrast to GM, FM will approach zero with a random classification.
Notice that the harmonic mean is more intuitive than the arithmetic mean when computing a mean of ratios. Thus, the \(F_1^{+}\) (usually called \(F_1\)score [23]) is defined as the harmonic mean of Precision and Recall. Therefore, to achieve a high \(F_1^{+}\) value, it is necessary to have both high values of Precision and Recall. Even though the \(F_{1}^{+}\) is popular in statistics, it can be misleading since it does not consider the TN. Thus, this performance metric does not consider the ratio of \(1\) instances correctly classified by the ML classifier. Besides, \(F_{1}^{+}\) is not invariant to class swapping.
Furthermore, it is possible to define the \(F_1^{}\) [22] as the harmonic mean of Specificity and NPV. The \(F_1^{}\) is a tradeoff between the success of predicting an observation as \(1\) and the ratio of right predictions in the negative class. The \(F_{1}^{}\) has the same strengths and weaknesses as the \(F_{1}^{+}\), but focusing on the negative class. That is, it considers the TN but not the TP.
Markedness (MK) is defined as the distance of the sum of Precision and NPV to 1, while Bookmaker Informedness (BM) is defined as the distance between 1 and the sum of Specificity and Sensitivity [20]. Again, both measures complement each other, but do not provide an overall view of the different perspectives provided by the four metrics involved in their definitions. MK is sensitive to changes in data distributions and, hence, it is not appropriate for imbalanced data [25]. On the contrary, BM is suitable with imbalanced data. Nevertheless, it does not change concerning the differences between Specificity and Sensitivity [25].
In [22], a new metric that considers all the elements in the confusion matrix has been recently proposed. The Unified Performance Measure (UPM) is defined as the harmonic mean of \(F_1^{+}\) and \(F_1^{}\). Thus, UPM assess the performance on both the positive and the negative class. This performance metric has high values only when the four fundamental metrics, PPV, TPR, PNR, and NPV, also have high values. In addition, UPM is suitable with imbalanced data [22].
In the same way, Matthews Correlation Coefficient (MCC) [16] also includes all the elements of the confusion matrix. MCC is defined as the geometric mean of the regression coefficients of the problem and its dual. It can be also formulated as follows:
However, MCC differs from the abovementioned metrics as it takes values in the range \([1,1]\). On the one hand, \(MCC=1\) means that both classes are perfectly classified, as it occurs in the alternative metrics. On the other hand, \(MCC=1\) reveals a total disagreement between the observed and the predicted classes. \(MCC=0\) indicates a random prediction. It has been proven that MCC is not as stable as UPM [22].
The Cohen’s Kappa coefficient measures the accordance between the ML classifier and the observed classes as follows:
where Pr(e) is the hypothetical probability of agreement by chance, using the observed data to calculate the probabilities that each observer will randomly rank each category. The Cohen’s Kappa coefficient also takes values from \(1\) to \(+1\). The Cohen’s kappa coefficient is more informative than Accuracy when working with imbalanced data. However, it is likely to give low values for imbalanced data [2].
Finally, the Receiver Operating Characteristics (ROC) graph is a technique for visualising, organising, and selecting classifiers based on their performance [8]. In this case, a set of confusion matrices is obtained by modifying parameters in the model. ROC graphs are twodimensional representations in which two inversely related variables are plotted. For instance, TPR is usually plotted versus False Positive Rate (FPR) (\(FPR=1TNR\)). These two metrics are calculated for each confusion matrix. Then, the ROC curve is plotted with TPR against the FPR where TPR is on the yaxis and FPR is on the xaxis. The Area Under the ROC Curve (AUC) [3] is the performance metric obtained from ROC. It is defined as the proportion of the unit square under the ROC curve. Thus, it takes values in the range [0, 1]. No realistic classifier should have an AUC less than 0.50. Although the AUC is generally used, it presents some drawbacks. For instance, the AUC lacks clinical interpretability because it does not reflect when diagnostic tests are presented in terms of gains and losses to individual patients [13].
2.2 Multiclass classification
Consider a multiclass classification problem with K classes to be predicted by a ML classifier. As in the binary classification, most performance metrics are obtained from the confusion matrix (see Table 3). In this matrix, the element \(C_{ij}\) (\({i,j=1,\ldots ,K}\)) represents the number of the elements in class j classified as class i.
A common approach when dealing with multiclass classification problems is the One vs Rest technique [1]. It consists on facing each of the classes against the rest of them. Thus, the model is trained and evaluated on a binary setting where one of the classes is set to positive and the others to negative. This process is repeated for all classes obtaining a binary confusion matrix for each class. An instance of this approach is the generalisation of \(F_1^{+}\) to multiclass classification, the \(MacroF_1^{+}\) [19]:
where \(F_{1,i}^{+}\) is the \(F_1^{+}\) value obtained from the confusion matrix when the ith class is faced against the rest of the classes. Analogously, MacroPrecision, MacroRecall, \(MacroF_1^{}\), and MacroAccuracy can be defined. Notice that \(MacroF_1^{+}\) is an arithmetic mean of harmonic means.
An alternative to macro averages are micro averages. Since a FP for a given class is a FN for another class, all errors are considered the same in multiclass micro averages. The same reasoning applies to TP and TN. Thus, \(FP=FN\) and \(TP=TN\). In this context, the MicroAccuracy (or multiclass accuracy) is defined as the ratio between the correctly predicted instances and the dataset size. Furthermore, the MicroAccuracy equals the MicroRecall, the MicroPrecision, and the MicroF\(_1\). When the dataset is imbalanced, MicroAccuracy provides an overoptimistic estimation of the classifier performance over the majority class. Notice that these metrics are invariant to class swapping since \(TP=TN\) and \(FP=FN\).
There are also specific approaches to extend binary metrics to a multiclass setting such as multiclass MCC [10] and multiclass Cohen’s Kappa coefficient [11]. Considering the \(K \times K\) confusion matrix in Table 3, \(MCC_K\) for multiclass classification is defined as:
The range of multiclass MCC is different from the binary MCC. In this case, the minimum value might be between \(1\) and 0 depending on the labels distribution, while the maximum value is the same.
Regarding the multiclass Cohen’s Kappa coefficient, it is defined as follows:
where \(p_k = \sum _i^K C_{ki}\) and \(t_k = \sum _i^K C_{ik}\).
MCC and Cohen’s Kappa are close in multiclass classification. The only difference between them is that the denominator is slightly lower in Cohen’s Kappa coefficient, justifying slightly higher final scores.
3 General Performance Score
Several performance metrics to evaluate ML classifiers have been presented in the previous section. However, in some cases it is necessary to jointly consider a set of metrics that emphasise different aspects of the classifier. Thus, it is necessary to define an approach that combines a set of metrics into a single one. In this section, GPS, an approach to perform this combination, is presented.
Definition 1
Let \(p_1, \cdots , p_n\) be n different performance metrics that describe the output of a ML model for a classification task, then the General Performance Score (GPS) is defined as follows:
Notice that the GPS is the harmonic mean of the set of different performance metrics \(p_1, \cdots , p_n\). The harmonic mean is a measure of central tendency, which is useful when averaging rates like those obtained from the confusion matrix.
It can be proven that the GPS is also equal to:
The GPS has the following properties:
Property 1
When the set of n performance metrics are defined in [0, 1], the GPS is maximum, i.e., equal to 1, \(\iff\) all the performance metrics are maximum, i.e., equal to 1.
Property 2
GPS is equal to 0, if at least one performance metric is equal to 0.
Notice that the harmonic mean minimises the impact of large values while maximizing the impact of small values. Therefore, high values of GPS denotes that all of the the involved metrics have high values. Furthermore, it is possible to calculate the GPS standard deviation based on the standard deviation of the harmonic mean [17].
Property 3
The standard deviation of GPS is:
It is clear that the standard deviation is minimum (and takes the zero value) when all the performance metrics (\(p_i\)) are the same. To study the maximum value for sd(GPS), first consider the binary case.
Property 4
Given two performance metrics, the standard deviation of GPS is maximum when one of the metrics is 1 and the other is \(\frac{1}{3}\). In this case, GPS\(=\frac{1}{2}\), and \(sd(GPS)=\frac{1}{2\sqrt{2}}\).
Proof
Given two performance metrics \(p_1\) and \(p_2\), the maximum distance between them is achieved when one metric is equal to 1 and the other is equal to 0. However, in that case, the sd(GPS) is not defined. To examine the maximum of the function, let \(x=1/p_1\) and \(y=1/p_2\). Thus, \(x,y\ge 1\). Without loss of generality, we assume that \(x \ge y\). Then, the GPS is:
and the sd(GPS) is:
The partial derivatives of the previous expression are:
Given that \(x \ge y\), we require that \(x=3y\). Thus, when \(y=1\), the derivative is 0 at \(x=3\). That is, \(p_1=1/3\), and \(p_2=1\). In such a case, \(GPS=\frac{2\cdot 1/3}{1+1/3}=\frac{1}{2}\), and \(sd(GPS)=\frac{1}{2\sqrt{2}}\). Figure 1 shows the value for the sd(GPS) at \(x \in [1,100]\) and \(y \in [1,10]\). Figure 1 shows the value for the sd(GPS) for all the values of x in [1, 100] at several values of y. It can be shown that the maximum is achieved for \(y=1\), \(x=3\).
It is straightforward to show the following property.
Property 5
Given a set of n performance metrics \(p_1,\ldots ,p_n\), the standard deviation of GPS is maximum when \(p_i=1\) \(\forall i=1,\ldots ,n1\), and \(p_n=\frac{1}{n+1}\). In such a case, GPS\(=\frac{1}{2}\), and \(sd(GPS)=\frac{1}{4}\sqrt{\frac{n}{n1}}\).
Proof
Let be \(x_i= 1/p_i\). Since \(p_i\le 1\), then \(x_i\ge 1\,,\forall i\). Let be \(s=\sum _{i=1}^{n}x_i\). Then, \(GPS=n/s\), and
In order to maximise this expression, s needs to be as small as possible. Then \(x_i\) maximise the difference to the mean value s/n for all i. To minimise s, \(x_i=1\,,\forall i < n\). Thus, \(s=x_n+n1\). Now, the standard deviation is:
The derivative of this expression is:
The root of the derivative is \(x_n=n+1\). Through the second derivative it can be demonstrated that it is a maximum. Thus, the standard deviation of GPS is achieved for \(x_i=1 \,, \forall i < n\), and \(x_n=n+1\). Therefore, \(GPS=\frac{n}{n1+n+1}=\frac{1}{2}\), and \(sd(GPS)=\frac{1}{4}\sqrt{\frac{n}{n1}}\).
3.1 Binary classification
In binary classification, a wellknown particular case of GPS is the \(F_1^{+}\)score. It corresponds to GPS parameterised with the Precision (PPV) and Recall (TPR):
On the other hand, the \(F_1^{}\)score is GPS parameterised with the Specificity (TNR) and Negative Predictive Value (NPV):
The UPM [22] is another performance metric that belongs to the GPS family. The UPM is equals to GPS parameterised with Precision (PPV), Recall (TPR), Specificity (TNR) and Negative Predictive Value (NPV):
Given that the combined harmonic mean of two sets of variables is equal to the harmonic mean of the harmonic means of the two sets [18], the previous expression can be easily simplified to:
This instance of GPS overcomes one of the main shortcomings of the \(F_1^+\) and \(F_1^\), which is that they do not consider TP and TN, respectively. Thus, both metrics are misleading for imbalanced classes. Further, it performs properly for imbalanced classification problems, since it is built using information regarding the performance of a classifier on both classes. In addition, it improves the stability and explainability of the existing metrics [22].
Another possible instance of GPS is the combination of the Specificity (TNR) and Sensitivity (TPR):
This same combination is performed by the GM and BA (see Section 2) that use the geometric and arithmetic mean, respectively. Since the harmonic mean is lower or equal than the geometric mean, and the geometric mean is lower or equal than the arithmetic mean, then:
Let us consider two different ML models: \(ML_1\) and \(ML_2\). Let the performances of these models be as follows: \(Specificity=0.4\) and \(Sensitivity=0.6\), for \(ML_1\), and \(Specificity=0.1\) and \(Sensitivity=0.9\), for \(ML_2\). On the one hand, notice that \(BA=0.5\) for both models. On the other hand, GM is equal to 0.49 and 0.30 for \(ML_1\) and \(ML_2\), respectively, penalising the low value of Specificity. The proposed GPS results are: 0.48 and 0.18 for \(ML_1\) and \(ML_2\), respectively. Thus, as explained before, it can be seen that GPS is more sensitive to smaller values than to larger values in the involved metrics.
3.2 Multiclass classification
In this section, several instances of GPS in a multiclass classification problem are discussed. Lets consider a multiclass confusion matrix with Kclasses (see Table 3). Applying a technique for switching from multiclass confusion matrices to binary matrices, it is possible to obtain K different binary confusion matrices. For instance, in this case the One vs Rest technique is used. Let be UPM\(_k\) (k in \(1,\ldots ,K\)) the calculated UPM for each of these K confusion matrices. Then, GPS can be parameterised with UPM\(_k\) in order to create a multiclass performance metric as follows:
Consider a uniform confusion matrix such that all the elements in the matrix are equal, the following property can be defined:
Property 6
Given a Kclass classification problem. The value of \({\mathrm{GPS}}_{\mathrm{UPM}}\) for a uniform confusion matrix is:
Proof
Let consider all the elements in the uniform confusion matrix equal to x. First, notice that all UPMs in a uniform confusion matrix are equal. Since \(GPS_{UPM}\) is an harmonic mean of the UPMs, its value is equal to the value of the UPMs. Thus, it is enough to calculate one UPM. The \(UPM_k\) in a uniform confusion matrix is equal to:
The Precision and Recall are \(\frac{1}{K}\), and the NPV and Specificity are \(\frac{(K1)^2}{(K1)^2+(K1)}=\frac{K1}{K}\). Then, UPM is equal to:
As an example, let us consider a 3classes classification problem. The \(3 \times 3\) multiclass confusion matrix can be divided into 3 binary confusion submatrices (see Table 4). Then, \(GPS(UPM_1,UPM_2,UPM_3)\) is defined as follows:
Notice that in the particular case of ordered classes, the confusion matrix in Table 4b could be omitted. When the order is relevant, merging the first and last classes could be meaningless for the application domain perspective. Then, the GPS implementation parameterised with UPM for ordered classes is defined as follows:
Furthermore, alternative contextaware definitions of performance metrics could be useful. For instance, consider a multiclass classification problem where only the Recall of each class is relevant. Thus, the base metrics are:
In such a case, GPS is defined as follows:
Notice that when \(K=2\), then \(GPS_{Recall}\) is equal to the harmonic mean of Specificity and Sensitivity, presented in (18).
4 Experiments
In this section, several experiments on real and artificial datasets are considered. The properties and performance of GPSbased metrics are discussed and compared with alternative performance metrics. The first and second experiments consider a binary classification problem with simulated confusion matrices and real datasets, respectively. In the third experiment, a battery of simulated confusion matrices obtained from a multiclass classification problem is considered. Finally, in the fourth experiment, several definitions of GPS for two real dataset in multiclass classification problem are explored.
4.1 Simulated confusion matrices in binary classification
In this experiment, five confusion matrices are generated to compare GPSbased metrics against several alternatives. These confusion matrices are reported in Table 5. The confusion matrix a) presents a good classifier with adequate results in both classes. The confusion matrix b) is a random confusion matrix with the same values in all its cells. In the confusion matrices c) and d) only one class is correctly classified, negative class in c) and positive class in d). Finally, the confusion matrix e) presents a conservative classifier (most of the model predictions are negative) in an imbalanced dataset (most of the instances are positive).
Table 6 shows the results of the metrics for these confusion matrices. In this experiment, the GPS(PPV, TPR, TNR, NPV) has been considered. First, when the classification model works properly, as in a), all metrics achieve high values. The GPS instance presents low values in the confusion matrices c), d) and e) since at least one of its performance metrics presents low values. Regarding the random confusion matrix b), the GPS value is 0.5. It is interesting to remark that in this case, all the performance metrics used in its definition have the same value. Thus, the standard deviation of GPS is 0.0.
In confusion matrix e), the Precision and Specificity are very high, but the Recall and NPV are very low. In addition, it can be observed in the confusion matrices c) and d) that these metrics are sensitive to swapping the classes and to imbalanced data. The Balanced Accuracy obtains very similar values for the last four confusion matrices, although they represent totally different scenarios. It can be observed that the \(F_1^{+}\) and the \(F_1^{}\) metrics are sensible to imbalanced data. In the confusion matrix c), \(F^\) achieves a high value while the positive class is almost entirely misclassified. On the other hand, in confusion matrix d), \(F_1^{+}\) achieves a high value while the negative class is almost entirely misclassified. Moreover, they are sensitive to swapping the classes. The Geometric Mean value in the confusion matrices c) and d) is similar to the random confusion matrix b). The FowlkesMallows Index obtains very similar values to \(F_1^{+}\). Both Markedness, Bookmaker Informedness and Cohen’s Kappa get low values for the last three confusion matrices, and 0.00 for the random confusion matrix b). Given the low performance on the nonpredominant class, GPS achieves values lower than 0.50 for the confusion matrices c) and d). However, MCC achieves higher values for these confusion matrices (0.13 in both cases) than for the random confusion matrix b) (0.00). Moreover, MCC returns similar values for the confusion matrices b) (random) and a) (high Precision and low Recall).
4.2 Binary classification with real datasets
The performance of GPS for binary classification is also evaluated on several real datasets from the UCI Machine Learning Repository [7]. In this experiment, the following datasets are considered:

Pima Indians and Vote datasets: two imbalanced datasets for the positive class.

Ionosphere: an imbalanced dataset for the negative class.

Sonar: a balanced dataset.

Adult and Credit datasets: two very imbalanced datasets for the positive class

Hepatitis: a very imbalanced dataset for the negative class.
Each dataset has been randomly split into two sets: training (80%) and testing (20%) sets. A Random Forest (RF) model with the following parameters has been trained on the training set: number of trees equals to 500, each tree grows to the maximum number of terminal nodes as possible, and the square root of the number of variables in the dataset is used as the number of variables randomly sampled as candidates at each split. Then, the metrics MCC and GPS are estimated over the testing sets. This process is repeated 100 times. Finally, the global performance metric values are obtained as the mean of the 100 performance score in the testing sets. The Mean, Standard Deivation (SD) and Coefficient of Variation (CV) for both GPS and MCC are shown in Table 7.
The correlation between both metrics is very high (Pearson correlation coefficient equals 0.98). However, GPS presents a lower standard deviation, which indicates that GPS is more stable. Furthermore, MCC obtains higher CV values, meaning that it is more dispersed than GPS. In addition, the GPS is easier to interpret since it is defined in the range [0, 1] as most performance metrics. Thus, it can be concluded that the proposed ML model performs properly for Vote and Ionosphere datasets. Better classifiers could probably be found for Sonar, Adult, and Pima Indians datasets. Finally, given the low values for GPS, the proposed classification technique shows a poor performance for Credit and Hepatitis datasets.
4.3 Simulated confusion matrices in multiclass classification
In this experiment, different simulated \(3 \times 3\) confusion matrices are generated and presented in Table 8. The confusion matrices a) and b), show good classifiers on balanced datasets. The confusion matrices c) and d) correspond to very high imbalanced data. The confusion matrices e) and f) correspond to classifiers on imbalanced data. In the confusion matrix g) results from a bad classifier are presented. Finally, the confusion matrices h) and i) show very bad classifiers, completely wrong in their predictions. The following metrics have been calculated: Accuracy, MacroAccuracy, MacroPrecision, MacroRecall, Macro\(F_1^+\), Macro\(F_1^\), Micro\(F_1^+\), Micro\(F_1^\), MCC and \(GPS_{UPM}\).
Table 9 shows the results of the metrics for these multiclass confusion matrices. First, when the classes are balanced and the classification error is not high, as in a) and b), all performance metrics achieve higher values. Notice that the metrics Accuracy, Micro\(F_1^+\) and Micro\(F_1^\) have the same results for all the proposed confusion matrices. In the confusion matrices c) and d), corresponding to imbalanced data, ACC and MacroAccuracy are unreliable measures for model performance. The good performance of the model for the majority class implies high ACC and MacroAccuracy, even when the performance of the model is low for the other classes. By contrast, \(GPS_{UPM}\) penalises the poor performance of the model in any of the classes.
The \(GPS_{UPM}\) obtains the lowest possible value when all observations are wrongly classified. The \(GPS_{UPM}\) is similar in the confusion matrices a) and c). Nevertheless, its standard deviation is minimum in a) 0.0, but 0.15 in c). This evinces a nonhomogeneous performance along the different classes in the problem. The same occurs in cases b) (standard deviation 0.0) and e) (standard deviation 0.07). Note that following Property 5, the maximum standard deviation is 0.31. The \(GPS_{UPM}\) value for confusion matrix g) implies a nearrandom performance. In fact, notice that the expected random value in each element of the diagonal is equal to the observed value 50 (450 observations to be distributed in 9 cells). Following Property 6, the \(GPS_{UPM}\) for a uniform \(3 \times 3\) confusion matrix is 4/9.
The confusion matrices h) and i) show nonzero values for Macro\(F_1^\), even though all the observations are misclassified. In these two confusion matrices, MCC obtains different values. Moreover, negative MCC values are difficult to interpret. This difficulty arises from the fact that the minimum MCC value depends on the distribution of the observed label. Finally, Cohen’s Kappa coefficient achieves similar results to MCC in all the cases except for the example i). In that case, Cohen’s Kappa coefficient performs similar to GPS providing the same values for h) and i).
4.4 Multiclass classification with real datasets
In the last experiment, GPSbased metrics are evaluated on multiclass datasets. Firstly, the three classes Connect4 dataset [7] is used. Secondly, the four classes Vehicle dataset [7] is considered. Both datasets have been divided in training set (80%) to fit a ML model and testing set (20%).
In the Connect4 dataset, a RF model with the following parameters has been trained on the training set: number of trees equals to 500, each tree grows to the maximum number of terminal nodes as possible, and the square root of the number of variables in the dataset is used as the number of variables randomly sampled as candidates at each split. For each observation in the testing set, the ML model returns the probability of belonging to each class. Given these probabilities, different thresholds are used to classify the elements. Thus, a set of confusion matrices is obtained.
Three GPSbased instances are considered to show that it can be built up depending on the particular problem specifications. First, the \(GPS_{UPM}\) as a summary metric is calculated. Next, the \(GPS_{Recall}\) is considered as a metric that focuses on the relevant instances retrieved from all the relevant instances of all the classes in the problem. Finally, the \(GPS_{Recall, Precision_3}\) is considered. In this case, it is calculated from the three Recalls and the Precision of class 3.
In Table 10, the confusion matrices that maximise the \(GPS_{UPM}\), the \(GPS_{Recall}\) and the \(GPS_{Recall, Precision_3}\) values respectively in the test dataset are presented. Table 11 shows the value of the metrics for each of confusion matrix. Notice that, in this case:
The \(GPS_{UPM}\) achieves its maximum value, \(0.69\pm 0.08\), in the confusion matrix a). The standard deviation of GPS has been calculated using Property 3 in Section 3. Notice that the range of the six basic metrics (three Precisions and three Recalls) is minimal for this case: a) 0.56, b) 0.64, c) 0.75. When only the Recalls are relevant, the maximum of \(GPS_{Recall}\) is \(0.67\pm 0.07\), corresponding to confusion matrix b). Since the Precisions are not considered, they can have more extreme values (range equals 0.64), while less extreme values are allowed for the Recalls (range equals 0.21). Finally, when the \(GPS_{Recall,Precision_3}\) is used, a higher value of \(Precision_{3}\) is obtained. In this case, the maximum value is achieved in confusion matrix c), (\(0.72\pm 0.08\)).
Secondly, GPSbased metrics are evaluated on the Vehicle dataset. In this case, the ML model selected is a Support Vector Machines (SVM) with linear kernel and cost equals to 1. For each observation in the testing set, the ML model returns the probability of belonging to each class. Given these probabilities, different thresholds are used to classify the elements. Thus, a set of confusion matrices is obtained.
In this case, six different GPSbased instances are considered to show that the classifier predictions that maximise the chosen performance metric will differ, depending on the GPS definition, leading to different confusion matrices. First, the \(GPS_{UPM}\) as a summary metric is calculated. Next, the \(GPS_{NPV}\) is considered as a metric that measures the proportion of negative samples that were correctly classified respect to the total number of negative predicted samples. Later, the \(GPS_{Precision}\) is considered as the inverse NPV, which represents the proportion of positive samples that were correctly classified with respect the total number of positive predicted samples. After, the \(GPS_{NPV, Precision_1}\) is considered. In this case, it is calculated from the four NPVs and the Precision of class 1. Then, the \(GPS_{Recall}\) is considered as a metric that focuses on the relevant instances retrieved from all relevant instances of all classes of the problem. Finally, the \(GPS_{Recall, Precision_4}\) is presented to show the changes related to the increase in the Precision of class 4.
In Table 12, the confusion matrices in the test dataset, obtained from the maximization of the different GPSbased instances are presented. \(GPS_{Recall}\), and the \(GPS_{Recall, Precision_4}\) values respectively in the test dataset are presented. Table 13 shows the values of the metrics for each confusion matrix.
The confusion matrix a) maximises \(GPS_{UPM}\), being the maximum value \(0.12\pm 0.04\). The standard deviation of GPS has been calculated using Property 3 in Section 3. When only the NPV is relevant, the maximum of \(GPS_{NPV}\) is 0.80, corresponding to the confusion matrix b). Notice the significant differences between the confusion matrices, depending on the chosen performance metric. In this case, since the Recalls are not considered, they can have more extreme values (range equals 1.00), whereas less extreme values are allowed for the Specificity (range equals 0.36). When only the Precisions are relevant, the maximum of \(GPS_{Precision}\) is \(0.07\pm 0.09\), corresponding to confusion matrix c). The confusion matrix d) is the result of maximising \(GPS_{NPV,Precision_1}\). The solution is similar to the obtained when \(GPS_{NPV}\) is chosen as performance metric (confusion matrix b)). However, in d) a high value of \(Precision_1\) is required (0.81 vs 0.54). The ML classifier chooses the thresholds to maximise \(GPS_{Recall}\) resulting in confusion matrix e), where the maximum value is \(0.06\pm 0.11\). The confusion matrix f) is the achieved solution when the Precision in class 4 is added to the above definition of \(GPS_{Recall}\). As expected, the main differences between confusion matrices e) and f) are presented in class 4, increasing the corresponding Precision from 0.02 to 0.11, and the corresponding Recall from 0.02 to 0.15.
5 Conclusions
In this paper, the GPS, a novel family of performance metrics for binary and multiclass classification problems, has been presented. It is defined as the combination of a set of performance metrics using the harmonic mean. The harmonic mean is a natural choice to combine values representing ratios, such as those from the confusion matrix. Besides, it generates conservative combinations since it penalises low values. Thus, data analysts can develop different metrics tailored for the problem domain and the domainexpert goals based on GPS.
Several instances of GPS have been presented and compared with various stateoftheart performance metrics in both binary and multiclass classification problems. It has been shown that it is possible to use different instances of GPS depending on the particular problem specifications. These definitions lead to different class predictions from the classifier. Therefore, to different confusion matrices. The GPS has proven to be more stable and explainable than the alternatives. Further, it has been shown that previous definitions of performance metrics such as \(F_1^{+}\), \(F_1^{}\) and UPM are instances of GPS.
Future work will focus on performing model selection using GPS. Given a set of ML classifiers, different performance metrics might lead to a different selection of best model. In this context, the effect of GPSbased metrics in the selection process could be evaluated. In addition, a sensitivity analysis to study the effect of different misclassification costs and different techniques to build binary matrices in multiclass problems will be carried out in the future. Further analysis will be carried out on the classification of datasets with a large number of categories. Notice that as the number of categories grows, the number of possible definitions of performance metrics that can be derived from the one proposed in this paper increases. Thus, a future research line would be to carry out a comparative study of the different solutions achieved through the chosen metrics within a specific problem. Furthermore, instances of GPS for multilabelled, hierarchical, and nonsquare confusion matrices classification will be developed. The latter corresponds to binary classification problems where an output with more than two options is more informative. For instance, in a system that predicts if a patient will die in a given surgery, an output such as highrisk, mediumrisk, and lowrisk is more informative than a binary output. Finally, future work will focus on the use of the method when the data are in tensor form [14, 15].
References
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Bland M (2008) Cohen’s kappa. University of York Department of Health Sciences https://www.usersyorkacuk/~ mb55/msc/clinimet/week4/kappash2.pdf. Accessed 13 Feb 2014
Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7):1145–1159
Chicco D, Jurman G (2020) The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 21(1):6
Cohen P (1982) To be or not to be: Control and balancing of type i and type ii errors. Evaluation and Program Planning 5(3):247–253
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Fawcett T (2006) An introduction to roc analysis. Pattern Recognition Letters 27(8):861–874
Goodall DW (1967) The distribution of the matching coefficient. Biometrics, 647–656
Gorodkin J (2004) Comparing two kcategory assignments by a kcategory correlation coefficient. Computational biology and chemistry 28(5–6):367–374
Grandini M, Bagli E, Visani G (2020) Metrics for multiclass classification: an overview. arXiv:200805756
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Journal of Intelligent Information Systems 17(2–3):107–145
Halligan S, Altman DG, Mallett S (2015) Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. European Radiology 25(4):932–939
Hu C, Wang Y, Gu J (2020) Crossdomain intelligent fault classification of bearings based on tensoraligned invariant subspace learning and twodimensional convolutional neural networks. KnowledgeBased Systems 209:106214
Hu C, He S, Wang Y (2021) A classification method to detect faults in a rotating machinery based on kernelled support tensor machine and multilinear principal component analysis. Applied Intelligence 51(4):2609–2621
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)Protein Structure 405(2):442–451
Norris N (1940) The standard errors of the geometric and harmonic means and their application to index numbers. The Annals of Mathematical Statistics 11(4):445–448
Ogbi MSZ (2012) A mathematical property of the harmonic mean. In: The 6th international days of statistics and economics. Prague University of Economics and Business, pp 873–877
Opitz J, Burst S (2019) Macro f1 and macro f1. arXiv:191103347
Powers DM (2020) Evaluation: from precision, recall and fmeasure to roc, informedness, markedness and correlation. arXiv:201016061
Puthiya Parambath S, Usunier N, Grandvalet Y (2014) Optimizing fmeasures by costsensitive classification. Advances in Neural Information Processing Systems 27:2123–2131
Redondo AR, Navarro J, Fernández RR, de Diego IM, Moguerza JM, FernándezMuñoz JJ (2020) Unified performance measure for binary classification problems. In: International conference on intelligent data engineering and automated learning. Springer, pp 104–112
Sasaki Y, Fellow R (2007) The truth of the fmeasure, manchester: Mibschool of computer science. University of Manchester p 25
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437
Tharwat A (2020) Classification assessment methods. New England Journal of Entrepreneurship 17(1):168–192
Acknowledgements
This research has been supported by grants from Madrid Autonomous Community (Ref: IND2018/TIC9665) and the Spanish Science and Innovation, under the RetosColaboración program: SABERMED (Ref: RTC201762531); and the RetosInvestigación program: MODASIN (reference: RTI2018094269BI00). Special thanks to MISC International S.L.
Funding
Open Access funding provided thanks to the CRUECSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
De Diego, I.M., Redondo, A.R., Fernández, R.R. et al. General Performance Score for classification problems. Appl Intell 52, 12049–12063 (2022). https://doi.org/10.1007/s10489021030417
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489021030417