After having introduced the mathematical foundations of MCC, accuracy, and F1 score, and having explored their relationships, here we describe some synthetic, realistic scenarios where MCC results are more informative and truthful than the other two measures analyzed.
Positively imbalanced dataset — Use case A1. Consider, for a clinical example, a positively imbalanced dataset made of 9 healthy individuals (negatives =9%) and 91 sick patients (positives =91%) (Fig. 2c). Suppose the machine learning classifier generated the following confusion matrix: TP=90, FN=1, TN=0, FP=9 (Fig. 2b).
In this case, the algorithm showed its ability to predict the positive data instances (90 sick patients out of 91 were correctly predicted), but it also displayed its lack of talent in identifying healthy controls (only 1 healthy individual out of 9 was correctly recognized) (Fig. 2b). Therefore, the overall performance should be judged poor. However, accuracy and of F1 showed high values in this case: accuracy = 0.90 and F1 score = 0.95, both close to the best possible value 1.00 in the [0, 1] interval (Fig. 2a). At this point, if one decided to evaluate the performance of this classifier by considering only accuracy and F1 score, he/she would overoptimistically think that the computational method generated excellent predictions.
Instead, if one decided to take advantage of the Matthews correlation coefficient in the Use case A1, he/she would notice the resulting MCC = –0.03 (Fig. 2a). By seeing a value close to zero in the [–1, +1] interval, he/she would be able to understand that the machine learning method has performed poorly.
Positively imbalanced dataset — Use case A2. Suppose the prediction generated this other confusion matrix: TP = 5, FN = 70, TN = 19, FP = 6 (Additional file 1b).
Here the classifier was able to correctly predict negatives (19 healthy individuals out of 25), but was unable to correctly identify positives (only 5 sick patients out of 70). In this case, all three statistical rates showed a low score which emphasized the deficiency in the prediction process (accuracy = 0.24, F1 score = 0.12, and MCC = −0.24).
Balanced dataset — Use case B1. Consider now, as another example, a balanced dataset made of 50 healthy controls (negatives =50%) and 50 sick patients (positives =50%) (Additional file 2c). Imagine that the machine learning prediction generated the following confusion matrix: TP=47, FN=3, TN=5, FP=45 (Additional file 2b).
Once again, the algorithm exhibited its ability to predict the positive data instances (47 sick patients out of 50 were correctly predicted), but it also demonstrated its lack of talent in identifying healthy individuals (only 5 healthy controls of 50 were correctly recognized) (Additional file 2b). Again, the overall performance should be considered mediocre.
Checking only F1, one would read a good value (0.66 in the [0, 1] interval), and would be overall satisfied about the prediction (Additional file 2a). Once again, this score would hide the truth: the classification algorithm has performed poorly on the negative subset. The Matthews correlation coefficient, instead, by showing a score close to random guessing (+0.07 in the [–1, +1] interval) would be able to inform that the machine learning method has been on the wrong track. Also, it is worth noticing that accuracy would provide with an informative result in this case (0.52 in the [0, 1] interval).
Balanced dataset — Use case B2. As another example, imagine the classifier produced the following confusion matrix: TP = 10, FN = 40, TN = 46, FP = 4 (Additional file 3b).
Similar to what happened for the Use case A2, the method was able to correctly predict many negative cases (46 healthy individuals out of 50), but failed in predicting most of positive data instances (only 10 sick patients were correctly predicted out of 50). Like for the Use case A2, accuracy, F1 and MCC show average or low result scores (accuracy = 0.56, F1 score = 0.31, and MCC = +0.17), correctly informing you about the non-optimal performance of the prediction method (Additional file 3a).
Negatively imbalanced dataset — Use case C1. As another example, analyze now this imbalanced dataset made of 90 healthy controls (negatives =90%) and 10 sick patients (positives =10%) (Additional file 4c).
Assume the classifier prediction produced this confusion matrix: TP = 9, FN = 1, TN = 1, FP = 89 (Additional file 4b).
In this case, the method revealed its ability to predict positive data instances (9 sick patients out of 10 were correctly predicted), but it also has shown its lack of skill in identifying negative cases (only 1 healthy individual out of 90 was correctly recognized) (Additional file 4c). Again, the overall performance should be judged modest.
Similar to the Use case A2 and B2, all three statistical scores generated low results that reflect the mediocre quality of the prediction: F1 score = 0.17 and accuracy = 0.10 in the [0, 1] interval, and MCC = −0.19 in the [–1, +1] interval (Additional file 4a).
Negatively imbalanced dataset — Use case C2. As a last example, suppose you obtained this alternative confusion matrix, through another prediction: TP = 2, FN = 9, TN = 88, FP = 1 (Additional file 5b).
Similar to the Use case A1 and B1, the method was able to correctly identify multiple negative data instances (88 healthy patients out of 89), but unable to correctly predict most of sick patients (only 2 true positives out of 11 possible elements).
Here, accuracy showed a high value: 0.90 in the [0, 1] interval.
On the contrary, if one decided to take a look at F1 and at the Matthews correlation coefficient, by noticing low values value (F1 score = 0.29 in the [0, 1] interval and MCC = +0.31 in the [–1, +1] interval), she/he would be correctly informed about the low quality of the prediction (Additional file 5a).
As we explained earlier, the key advantage of the Matthews correlation coefficient is that it generates a high quality score only if the prediction correctly classified a high percentage of negative data instances and a high percentage of positive data instances, with any class balance or imbalance.
Recap. We recap here the results obtained for the six use cases (Table 4). For the Use case A1 (negatively imbalanced dataset), the machine learning classifier was unable to correctly predict negative data instances, and it therefore produced confusion matrices featuring few true negatives (TN). There, accuracy and F1 generated overoptimistic and inflated results, while the Matthews correlation coefficient was the only statistical rate which identified the aforementioned prediction problem, and therefore to provide a low truthful quality score.
In the Use case A2 (positively imbalanced dataset), instead, the method did not predict correctly enough positive data instances, and therefore showed few true positives. Even if accuracy showed an excessively high result score, the values of F1 and MCC correctly reflected the low quality of the prediction.
In the Use case B1 (balanced dataset), the machine learning method was unable to correctly predict negative data instances, and therefore produced a confusion matrix featuring few true negatives (TN). In this case, F1 generated an overoptimistic result, while accuracy and the MCC correctly produced low results that highlight an issue in the prediction.
The classifier did not find enough true positives for the Use case B2 (balanced dataset), too. In this case, all the analyzed rates (accuracy, F1, and MCC) produced average or low results which correctly represented the prediction issue.
Also in the Use case C1 (positively imbalanced dataset), the machine learning method was unable to correctly recognize negative data instances, and therefore produced a confusion matrix with a low number of true negative (TN). Here, accuracy again generated an overoptimistic inflated score, while F1 and the MCC correctly produced low results that indicated a problem in the prediction process.
Finally, in the last Use case C2 (positively imbalanced dataset), the prediction technique failed in predicting negative elements, and therefore its confusion matrix showed a low percentage of true negatives. Here accuracy again generated overoptimistic, misleading, and inflated high results, while F1 and MCC were able to produce a low score that correctly reflected the prediction issue.
In summary, even if F1 and accuracy results were able to reflect the prediction issue in some of the six analyzed use cases, the Matthews correlation coefficient was the only score which correctly indicated the prediction problem in all six examples (Table 4).
Particularly, in the Use case A1 (a prediction which generated many true positives and few true negatives on a positively imbalanced dataset), the MCC was the only statistical rate able to truthfully highlight the classification problem, while the other two rates showed misleading results (Fig. 2).
These results show that, while accuracy and F1 score often generate high scores that do not inform the user about ongoing prediction issues, the MCC is a robust, useful, reliable, truthful statistical measure able to correctly reflect the deficiency of any prediction in any dataset.
Genomics scenario: colon cancer gene expression
In this section, we show a real genomics scenario where the Matthews correlation coefficient result being more informative than accuracy and F1 score.
Dataset. We trained and applied several machine learning classifiers to gene expression data from the microarray experiments of colon tissue released by Alon et al.  and made it publically available within the Partial Least Squares Analyses for Genomics (plsgenomics) R package [104, 105]. The dataset contains 2,000 gene probsets for 62 patients, of which 22 are healthy controls and 40 have colon cancer (35.48% negatives and 64.52% positives) .
Experiment design. We employed machine learning binary classifiers to predict patients and healthy controls in this dataset: gradient boosting , decision tree , k-nearest neighbors (k-NN) , support vector machine (SVM) with linear kernel , and support vector machine with radial Gaussian kernel .
For gradient boosting and decision tree, we trained the classifiers on a training set containing 80% of randomly selected data instances, and test them on the test set containing the remaining 20% data instances. For k-NN and SVMs, we split the dataset into training set (60% data instances, randomly selected), validation set (20% data instances, randomly selected), and the test set (remaining 20% data instances). We used the validation set for the hyper-parameter optimization grid search : number k of neighbors for k-NN, and cost C hyper-parameter for the SVMs. We trained each model having a different hyper-parameter on the training set, applied it to the validation set, and then picked the one obtaining the highest MCC as final model to be applied to the test set. For all the classifiers, we repeated the experiment execution ten times and recorded the average results for MCC, F1 score, accuracy, true positive (TP) rate, and true negative (TN) rate.
We then ranked the results obtained on the test sets or the validation sets first based on the MCC, then based on the F1 score, and finally based on the accuracy (Table 5).
Results: different metric, different ranking. The three rankings we employed to report the same results (Table 5) show two interesting aspects. First, the top classifier changes when we consider the ranking based on MCC, F1 score, or accuracy. In the MCC ranking, in fact, the top performing method is gradient boosting (MCC = +0.55), while in the F1 score ranking and in the accuracy ranking the best classifier resulted being k-NN (F1 score = 0.87 and accuracy = 0.81). The ranks of the other methods change, too: linear SVM is ranked forth in the MCC ranking and in the accuracy ranking, but ranked second in the F1 score ranking. Decision tree changes its position from one ranking to another, too.
As mentioned earlier, for binary classifications like this, we prefer to focus on the ranking obtained by the MCC, because this rate generates a high score only if the classifier was able to correctly predict the majority of the positive data instances and the majority of the negative data instances. In our example, in fact, the top MCC ranking classifier gradient boosting did quite well both on the recall (TP rate = 0.85) and on the specificity (TN rate = 0.69). k-NN, that is the top performing method both in the F1 score ranking and in the accuracy ranking, instead, obtained an excellent score for recall (TP rate = 0.92) but just sufficient on the specificity (TN rate = 0.52).
The F1 score ranking and the accuracy ranking, in conclusion, are hiding this important flaw of the top classifier: k-NN was unable top correctly predict a high percentage of patients. The MCC ranking, instead, takes into account this information.
Results: F1 score and accuracy can mislead, but MCC does not. The second interesting aspect of the results we obtained relates to the radial SVM (Table 5). If a researcher decided to evaluate the performance of this method by observing only the F1 score and the accuracy, she/he would notice good results (F1 score = 0.75 and accuracy = 0.67) and might be satisfied about them. These results, in fact, mean 3/4 correct F1 score and 2/3 correct accuracy.
However, these values of F1 score and accuracy would mislead the researcher once again: with a closer look to the results, one can notice that the radial SVM has performed poorly on the true negatives (TN rate = 0.40), by correctly predicting less than half patients. Similar to the synthetic Use case A1 previously described (Fig. 2 and Table 4), the Matthews correlation coefficient is the only aggregate rate highlighting the weak performance of the classifier here. With its low value (MCC = +0.29), the MCC informs the readers about the poor general outcome of the radial SVM, while the accuracy and F1 score show misleading values.