Comparing ϕ and the F-measure as performance metrics for software-related classifications

The F-measure has been widely used as a performance metric when selecting binary classifiers for prediction, but it has also been widely criticized, especially given the availability of alternatives such as ϕ (also known as Matthews Correlation Coefficient). Our goals are to (1) investigate possible issues related to the F-measure in depth and show how ϕ can address them, and (2) explore the relationships between the F-measure and ϕ. Based on the definitions of ϕ and the F-measure, we derive a few mathematical properties of these two performance metrics and of the relationships between them. To demonstrate the practical effects of these mathematical properties, we illustrate the outcomes of an empirical study involving 70 Empirical Software Engineering datasets and 837 classifiers. We show that ϕ can be defined as a function of Precision and Recall, which are the only two performance metrics used to define the F-measure, and the rate of actually positive software modules in a dataset. Also, ϕ can be expressed as a function of the F-measure and the rates of actual and estimated positive software modules. We derive the minimum and maximum value of ϕ for any given value of the F-measure, and the conditions under which both the F-measure and ϕ rank two classifiers in the same order. Our results show that ϕ is a sensible and useful metric for assessing the performance of binary classifiers. We also recommend that the F-measure should not be used by itself to assess the performance of a classifier, but that the rate of positives should always be specified as well, at least to assess if and to what extent a classifier performs better than random classification. The mathematical relationships described here can also be used to re-interpret the conclusions of previously published papers that relied mainly on the F-measure as a performance metric.


Introduction
Classification problems are quite common in rather diverse application areas of software practice and research. Here are just a few examples: -The classification of the words in requirements texts has been used to derive the semantic representation of functional software requirements (Sonbol et al. 2020). -Software requirements have been classified into functional requirements and subclasses of non-functional requirements via machine-learning techniques (Dias Canedo and Cordeiro Mendes 2020). -Machine-learning techniques have also been used to recognize attacks to softwaredefined networks (Scaranti et al. 2020). -The diffusion of news via Twitter was used to classify news articles pertaining to disinformation vs. mainstream news (Pierri et al. 2020). -Defect prediction, which is probably the best known software classification activity (Hall et al. 2011), classifies software modules as faulty or non-faulty.
Many different classifiers are built and used to address these and other Empirical Software Engineering problems. Thus, it is important to assess how well classifiers perform, so the best can be selected.
In this paper, we focus on binary classifiers, which are the most widely used, and on the metrics that have been defined to evaluate their performance. Using one performance metric instead of another may lead to very different evaluations and ranking among competing classifiers. To select effective and practically useful classifiers, it is therefore crucial to use performance metrics that are sound and reliable. This requires carefully examining and comparing the properties and possible issues of performance metrics before adopting any of them.
The F-measure (also known as F-score or F1) is a performance metric that has been widely used in Empirical Software Engineering. For instance, it was used-along with other metrics-to evaluate the performance of the classifications obtained in all of the empirical studies mentioned above.
The F-measure combines two performance metrics, Precision and Recall, also widely used to measure specific aspects of performance. As such, the F-measure is often perceived as a convenient means for obtaining an overall performance metric. The F-measure was originally defined to evaluate the performance of information retrieval techniques (van Rijsbergen 1979). However, it has numerous serious drawbacks that spurred criticisms (Hernández-Orallo et al. 2012;Powers 2011;Sokolova and Lapalme 2009;Luque et al. 2019).
Several researchers favored using other performance metrics like φ (Cohen 1988) (also known as Matthews Correlation Coefficient (Matthews 1975)), which are generally considered sounder (Yao and Shepperd 2020).
Unfortunately, the F-measure and φ may rank competing classifiers in different ways. According to Yao and Shepperd's analysis of the literature, around 22% of the published results change when φ is used instead of the F-measure (Yao and Shepperd 2021).
The goal of this paper is to analyze and compare the issues, advantages, and relationships of F-measure and φ, to help decision-makers use the performance metric that allows them to select the classifiers that better suit their goals.
Thus, after introducing the basic notions and terminology in Section 2, the paper provides the following main contributions, which we list along with the section where they can be found.
-We provide an organized in-depth discussion and comparison of the characteristics of the F-measure and φ, by building on the criticisms of the literature and adding some more observations (Section 3). -We show that φ is a mathematical function of Precision, Recall, and the rate of actual positive modules (Section 4). -We show that φ can be mathematically expressed as a function of the F-measure and the rates of actual and estimated positive modules. We study the extent to which these rates influence the set of possible values of φ that correspond to a given value of the F-measure. We also derive the conditions under which both the F-measure and φ rank two classifiers in the same order (Section 5). Specifically, we proved that φ and the F-Measure tend to rank two classifiers in the same way when the rate of actual positives is quite small. This results explains why the F-Measure was originally proposed in the information retrieval domain, where the rate of actual negatives is generally very large. When that is not the case-as in many software engineering situations-even a seemingly high value of the F-Measure may correspond to a performance not better than that of a random classifier. -The knowledge provided in this paper casts new light on some results published previously, allowing for a more rigorous and sound reinterpretation of such results, and in some cases leading to rejecting conclusions that are not based on reliable evaluations (Section 7).
Our mathematical approach, described and proved in Sections 4 and 5 (whose details can be found in the Appendices), provides a theoretical explanation for the findings of the previous literature (discussed in Section 8), which were based on empirical studies or simulations. In addition, it generalizes and extends them to new results and evidence. Our results are of mathematical nature and therefore do not need empirical confirmatory evidence. At any rate, for demonstrative illustration purposes only, we also carried out an empirical study with 70 real-life Empirical Software Engineering datasets and 837 classifiers (shown in Section 6), to show the practical relevance of the mathematical results.
As we remark in the conclusions in Section 9, our study indicates that i) the proportion of positive modules should always be reported, along with the performance metrics of choice, ii) the F-measure should be used only when the rate of positive modules is very small, iii) φ is always a useful alternative, as already observed by some other previous studies, iv) if possible, providing the raw measures that are used to compute performance metrics is the best choice, as it provides the most detailed view of performance.
As a final observation, the mathematical results reported in this paper depend only on the definitions of φ and F-measure. Therefore, they can be used in the evaluation of any binary classifier used in Software Engineering and any other domains. At any rate, in the Software Engineering domain, our results can be useful in software defect prediction, in which binary classifiers are used to estimate which software modules are likely to be defective and should be treated as such. To this end, the illustration empirical study of Section 6 focuses on software defect prediction.

Background
A classifier is a function that partitions a set of n elements into equivalence classes, identified by different labels. We only deal with binary classifiers, hence we write "classifier" instead of "binary classifier" for conciseness in what follows. Also, since we are interested in software-related classifiers, instead of "element" we use "software module," or, for short, "module," by which term we denote any piece of software (e.g., routine, method, class). The modules of the set are therefore classified as "positive" or "negative," where the meaning of these labels depends on the specific application. For instance, when estimating whether software modules are defective, the label "positive" means "faulty module" and the label "negative" means "non-faulty module." The performance of a classifier on a set of modules is usually assessed based on a 2 × 2 matrix called "confusion matrix" (also known as "contingency table") that shows how many of those n modules are correctly and incorrectly classified. As Table 1 shows, the cells of a confusion matrix contain the numbers of modules that are: correctly estimated negative (True Negatives TN); incorrectly estimated negative (False Negatives FN); incorrectly estimated positive (False Positives FP); and correctly estimated positive (True Positives TP).
In Table 1, we also reported EN and EP, the numbers of Estimated Negatives and Estimated Positives, and AN and AP, the numbers of Actual Negatives and Actual Positives. AN and AP are intrinsic characteristics of the dataset, as is the actual prevalence ρ = AP n (Yao and Shepperd 2021). Instead, EN and EP depend on the classifier, like the estimated prevalence σ = EP n . Note that prevalence, quantified via ρ, is closely related to the notion of class imbalance, as quantified by IR (Imbalance Ratio), which is the ratio of the number of elements of the majority class to number of the elements of the minority class. In several application areas, e.g., software defect prediction, there is a majority of negative elements, so, for instance, Song, Guo, and Shepperd (Song et al. 2019) take IR = AN AP = 1 ρ − 1. Because of the existence of this functional relationship between prevalence and imbalance, we take into account class imbalance via prevalence ρ in the paper. Unlike IR, ρ ranges between zero and one: according to ρ, a dataset is perfectly balanced when ρ = 0.5, while positive classes are prevalent when ρ > 0.5, and negative classes are prevalent when ρ < 0.5.
A perfect classifier has FN = FP = 0, but this is hardly ever the case for any real-life classifier, so, the closer FN and FP are to zero, the better. To evaluate the performance of a classifier with respect to FP or FN, two performance metrics have been defined and used, respectively, Precision and Recall. For brevity, and to shorten the length of the formulas, PPV is the proportion of estimated positives that have been correctly estimated, and can be used to quantify FP, since FP = EP (1-PPV). Maximizing PPV amounts to minimizing FP, regardless of the value of FN. Thus, maximizing PPV is important when the cost of dealing with an estimated positive is high, but the impact of having false negatives is low. TPR is the proportion of correctly estimated actual positives, and it is related to FN, since FN = AP (1-TPR). Maximizing TPR amounts to minimizing FN, regardless of the value of FP. Maximizing TPR is important when the consequences of false negatives are substantial and the cost of dealing with a false positive instead is quite low.
So, using one of these performance metrics means dealing with only one between FP and FN.
Given two classifiers cl 1 and cl 2 , it is easy to conclude that cl 1 is preferable to cl 2 if T P R 1 > T PR 2 and P P V 1 > P P V 2 . However, it is not straightforward to draw any conclusions if T P R 1 > T P R 2 and P P V 1 < P P V 2 , or if T P R 1 < T P R 2 and P P V 1 > P P V 2 . This is a typical issue in multi-objective optimization, since the goal here is to minimize two figures of merit, i.e., FN and FP, or, equivalently, maximize T P R and P P V , which may not be possible at the same time. Multi-objective optimization is often reduced to single-objective optimization, by defining a single figure of merit (Serafini 1985). Based on the cells of the confusion matrix, several performance metrics have been defined and used to act as single figures of merit. Different performance metrics take into account different aspects of performance that can be of interest in different application cases.

The Definition of F-Measure
The purpose of the F-measure (FM) is to combine PPV and TPR into a single performance metric by taking their harmonic mean, as shown in Formula (2) Since FM was originally defined to evaluate the performance of information retrieval (van Rijsbergen 1979), the focus is on how well the true positives have been identified. It is important that, at the same time, (1) a high proportion of actual positives be correctly estimated as such, so TPR should be high, and (2) a high proportion of the estimated positives be positive indeed, so PPV should be high too. Instead, true negatives are not taken into account in the computation of the F-measure because (1) their number is usually very large, (2) it is generally unknown, and (3) in practice it is hardly relevant. Strictly speaking, FM is not defined when PPV = 0 or TPR = 0, i.e., TP = 0, but we can safely assume that FM = 0 when TP = 0, since it can be easily shown that and the rightmost fraction is equal to 0 when TP = 0.
So, FM is in the [0,1] range, with FM = 0 if and only if TP = 0, i.e., no actual positives have been correctly estimated, and FM = 1 if and only if FP = FN = 0, i.e., in the perfect classification case. When interpreting FM, classifier cl 1 performs better than classifier cl 2 if F M 1 > FM 2 . So, the higher FM, the better.
FM is a special case of a more general definition that includes a parameter β, used to weigh PPV and TPR differently, as shown in Formula (4) However, β is set to 1 in the near totality of Empirical Software Engineering studies using FM. So, we use "F-measure" (or FM) instead of F 1 .

The Definition of φ
The purpose of φ, defined in Formula (5), is to assess the strength of the association between estimated and actual values in a confusion matrix φ is not defined when EN · EP · AN · AP = 0, i.e., when at least an entire row or column of the confusion matrix is null. As Chicco and Jurman (2020) observe, if exactly one among AP, AN, EP, or EN is null, i.e., when exactly one column or row of the confusion matrix is null, the value of φ can be set to 0. When a row and a column are null, φ can be set to 1 if the only nonnull cell in the confusion matrix is T N = n or T P = n (perfect classification) and instead set to −1 if the only nonnull cell in the confusion matrix is F N = n or F P = n (total misclassification). At any rate, these cases are quite peculiar, as Yao and Shepperd observe too (Yao and Shepperd 2021), since they apply to datasets composed exclusively of elements belonging to one class. φ is in the [−1, 1] range. Specifically, φ = 1 if and only if FP = FN = 0, i.e., in the perfect classification case. φ = 0 is the expected (i.e., average) performance of the random classifier that estimates a module positive with a probability equal to ρ, i.e., with the same probability as that of selecting a positive module totally at random from the set of modules. φ = −1 if and only if TP = TN = 0, i.e., in the perfect misclassification case, that is, with a "perverse" classifier. It is well-known that perfect misclassification can be transformed into perfect classification by simply inverting the estimations, which means swapping the rows, in terms of confusion matrices. More generally, when φ < 0, a classifier appears to be better at misclassifying modules than at classifying them correctly, so one can invert the estimations to obtain a classifier that instead is better at classifying modules correctly.
Thus, φ is an effect size measure, which quantifies how far the estimation given by a classifier is far from being random i.e., from the random classifier that has φ = 0. A commonly cited proposal (Cohen 1988) uses φ = 0.1, φ = 0.3, and φ = 0.5 respectively to denote a weak, a medium, and a large effect size. φ is also related to the χ 2 statistic, since |φ| = χ 2 n .

A Comparative Assessment of FM and φ
In Sections 3.1 and 3.2, we report and elaborate on some of the issues that have been found about FM in the past, and add another possible issue in Section 3.3. We show whether and how φ can address them. In Section 3.4, we introduce and discuss a possible advantage of FM, which seems to be more sensitive to false negatives than to false positives. We summarize the results of our comparative assessment in Section 3.5.

FM Does not Take into Account TN, while φ Does
Formula (3) clearly shows that FM does not depend on TN. So, let us consider the two confusion matrices CM a and CM b shown below, which concern different datasets. 0.64, while φ a 0.5: φ b > φ a because φ accounts for the fact that in CM b 400 more true negatives are correctly classified than in CM a .
Take now a third confusion matrix CM c , concerning a third dataset.
It is F M c 0.68, thus, according to FM, one should conclude that the performance represented by CM c is slightly better than those represented in CM a and CM b . However, though one more actual positive is classified correctly in CM c , when it comes to classifying actual negatives CM c performs quite poorly. φ c −0.07 appears to account for the overall performance represented by CM c more adequately.
Based on these examples and on Formula (3), it appears that FM is not always an adequate metric for quantifying the overall performance of a classifier, since it does not use all available information about the classification results. This is one of the main criticisms made to FM by previous studies (Powers 2011;Yao and Shepperd 2021).
Formula (5) instead shows that φ takes into account all of the cells of a confusion matrix, so φ is a better performance metric for the overall performance of a classifier. The above examples with CM a , CM b , and CM c show typical cases in which φ agrees with intuition more than FM does.
Note, however, that CM a , CM b , and CM c show results related to three different datasets. When it comes to comparing the performance of classifiers on the same dataset, things are a bit different. Let us rewrite FM as where the first fraction contains only T P and T N, which are related to correct classifications, while the second only F P and F N, which are related to misclassifications. AP and AN are fixed when comparing classifiers on the same dataset. So, knowing the values of two cells in different columns equates to knowing the entire confusion matrix. Thus, FM provides an overall performance evaluation of a classifier that can be used when comparing classifiers applied to the same dataset.

FM Does not Allow for Comparisons with Baseline Classifiers, while φ Does
The assessment of a model, like a classifier, is typically done by comparing its performance against the performance of a less complex baseline model. A classifier estimates the class of modules by taking into account information on their characteristics. For instance, a classifier may estimate a software module defective or not defective based on the module's number of Lines Of Code (LOC). However, how much performance do we gain by using that classifier, instead of using random estimation, i.e., a baseline classifier that does not require any knowledge of the modules? Recall that a random classifier behaves as described in Section 2.2, i.e., it estimates each module positive with probability ρ.
The expected values of TPR and PPV for the random classifier (i.e., the mean values obtained from a large number of random estimations) are both equal to ρ (Morasca and Lavazza 2020): by using these values in (3), we obtain F M = ρ as well, so, when evaluating a classifier, we should compare its FM against ρ. Thus, the knowledge of FM by itself is not sufficient to tell whether a classifier performs better than even random estimation (Yao and Shepperd 2021). An example is given by confusion matrix CM c above: it is F M c 0.68 < ρ c = AP n = 90 105 0.86, thus the performance represented by confusion matrix CM c is worse than random, on average.
On the other hand, φ, by its very definition, quantifies how far a classifier is from the random classifier. This may lead to what might seem to be a paradox, especially if compared to what happens with FM. Take CM d below. We have FM d 0.78, which in general is considered a quite good result in terms of FM. Also, visual inspection of the confusion matrix shows that the classifier is able to correctly classify most of the majority class (the 90 actual positives), even though it does not fare well with the minority class (the 15 actual negatives).
So, one may suggest that the classifier performs well, since, at any rate, the minority class is only one-sixth of the majority class, so its contribution to performance should be much less than that of the majority class anyway. However, φ d −0.01 casts some serious doubts on the performance level of the classifier, which appears to be rather poor. Which performance metric should we trust then? The answer is that the classifier is indeed good in itself at estimating the positive class, as indicated by the high value F M d 0.78. However, even the random classifier would be better overall, since it has F M random 0.86. This is also indicated by the value of φ d , because φ is an effect size measure, which in this case shows that the classifier is quite close to the random classifier, just a bit worse. Since we can always perform random estimation without having to go through the pains of building and validating a classifier, we must conclude that the classifier at hand is "good," but nowhere nearly good enough.
To further explain why the comparison with a baseline classifier is a fundamental point in the evaluation of a classifier beyond Empirical Software Engineering, consider that using a classifier on a dataset is similar to administering a treatment to a set of subjects: in a way, it is like giving a treatment to a dataset. A classifier is worth using if it has greater beneficial effects than using another existing classifier or doing nothing, i.e., relying on randomness.
Likewise, it is worthwhile giving a certain treatment to subjects only if it is better than some existing treatment or better than providing no treatment.
Suppose therefore that we need to evaluate the effectiveness of a medication for a disease. Suppose that, in a clinical trial, 96 out of 100 diseased patients that take the medication fully recover, i.e., the treatment achieves a 96% recovery rate. This rate looks quite high, especially if the disease is a lethal one. However, by itself, this seemingly high value does not tell us much about the real effectiveness. Suppose that the rate of spontaneous recovery from the disease, i.e., without taking any medications, is 97%. Then, one may argue that the medication actually worsens the chances of recovery. If, instead, the spontaneous recovery rate was 54%, for instance, then the medication would appear to be very effective. Thus, when evaluating the performance of some treatment (i.e., classifier, in our case) we always need to compare its effect to those of some baseline treatment (i.e., baseline classifier, in our case).
Prediction in Empirical Software Engineering refers to totally different domains than medical treatments, but the consequences of misjudging classifier effectiveness can be quite serious too. Using a performance metric that leads to selecting a classifier that estimates too many false negatives results in, say, having too many vulnerability attacks in software security applications or, in software quality assurance, having too many faulty modules released to the final users. If the selected classifier estimates too many false positives, precious resources are wasted by unnecessarily maintaining software to make it supposedly more secure or less faulty. Note that these unnecessary software modifications may even lead to introducing more vulnerabilities or defects.

FM is Nonnegative, while φ Can Be Negative
However strange it may seem, another issue with FM is that F M ≥ 0. Suppose that F M 0: does this really mean that there is no association between estimated and actual values? Though this is the usual interpretation, F M 0 should instead be interpreted as a lack of a concordant association between the estimated and the actual values, but not as a lack of a discordant one.
Take the confusion matrices CM e and CM f above, which differ only by TN. We have F M e = F M f 0.167, i.e., a fairly small value for FM. However, it is apparent that CM e represents a much better situation than CM f , in terms of correct module classification. It is also apparent that in CM f more modules (50) are misclassified than correctly classified (just 10). Detecting discordant associations can be useful, since it is possible to obtain concordant associations by inverting the classifications.
With φ, we have φ e −0.079, which indicates a close-to-random classification, and φ f −0.556, which indicates a rather strong discordant association. If we swap the rows of CM f , we obtain a new confusion matrix CM f having F M f = 0.889, so one may conclude that it is possible to use FM to detect discordant classifications anyway. However, swapping the rows of the confusion matrix basically equates to using a different performance metric on the original confusion matrix, defined as 2F N AP +EN . This would defeat the purpose of having a single performance metric to evaluate the overall performance of a classifier, while φ is actually able to detect both concordant and discordant associations.

FM Gives Different Relative Importance to False Positives and False Negatives, while φ does not
In practical applications, the cost of a false positive may be quite different from the cost of a false negative. For instance, suppose that a defective software module is not detected during the Verification & Validation phase in the development of a safety-critical application. That module-a false negative-is then released to the users as a part of the final product. The cost due to the damages it can do during operational use is typically much higher than the unnecessary Verification & Validation cost incurred by a false positive module.
FM is symmetrical with respect to PPV and TPR, but not with respect to FN and FP. Take the fraction in the rightmost member of Formula (6). Swapping FN and FP produces another fraction that is equal to the one in Formula (6) if and only if F N = F P . 1 Thus, the same extent of variation in FN and FP does not have the same impact on FM. We show in Appendix A that increasing or decreasing FN by some amount has more impact on FM than does increasing or decreasing FP by the same amount.
To provide even more relative importance to false negatives, one may set β of Formula (4) to specific values. Increasing β means giving more importance to TPR with respect to PPV. However, as we already noted, the vast majority of studies in the literature use the definition of FM of Formula (3).
φ is perfectly symmetrical with respect to FN and FP, whose variations therefore have the same impact on the value of φ.

Summary of Evaluations
FM does not fully capture the intrinsic characteristics of the underlying dataset, being defined as the harmonic mean of PPV and TPR (see Appendix C for some considerations on the usage of the harmonic mean). Therefore, FM quantifies an aspect of a classifier's performance related to the positive class that may be used only for comparing classifiers for the same dataset. FM cannot detect the existence and the extent of discordant associations between estimated and actual values either. Since it does not take into account ρ, which is also the rate with which a random classifier would successfully detect positives, FM cannot tell whether a classifier performs better than the random classifier, which can be taken as an inexpensive, default classifier a decision-maker can always fall back on to. The only advantage that FM seems to have over φ is that FM is more affected by variations of FN than of FP. Thus, software managers that rely on FM as performance metric may be encouraged to reduce false negatives more than false positives.
In this section, we show how φ too can be defined as a function of PPV and TPR, like FM, so we can point out the structural differences between FM and φ, along with their consequences.
In the following Section 5, we directly investigate the relationship between FM and φ and how it can be influenced by ρ and σ .
From Formula (5), via a few mathematical computations (reported in Appendix B), we have Unlike FM, φ is not a symmetrical function of PPV and TPR, so equal variations in PPV and TPR have different effects on φ.
Formula (7) shows that, in addition to PPV and TPR, φ depends on ρ, which is an intrinsic characteristic of the dataset. Thus, for the same values of PPV and TPR, we can obtain different values of φ, depending on the imbalance degree of the dataset. However, as we show in Appendix B, given ρ, it is for P P V = 0 and T P R = 0. For completeness, Appendix B also shows what happens in the special case in which P P V = T P R = T P = 0. We also show in Appendix B that, for given values of PPV and TPR, φ is a monotonically decreasing function of ρ, which tends to √ P P V · T P R when ρ tends to 0 and takes value P P V +T P R−P P V ·T P R−1 Figure 1 shows how φ varies depending on the value of ρ in three cases, depending on the values of PPV and TPR that satisfy (8). Note that F M ≈ 0.5 for the three pairs of values of PPV and TPR used for the three curves. In the special case of an unbiased classifier, i.e., when EP = AP, it is P P V = T P R and FP = FN, so the confusion matrix is symmetric. In this case, as previously shown (see, e.g., Delgado and Tibau (2019)) φ coincides with Cohen's kappa (Cohen 1960), as follows Cohen's kappa is a measure of the extent to which two classifiers (the actual classifier and the estimated classifier, in our case) agree when classifying n items into a number of different categories (two categories, in our case).
It can be shown that Formula (9) as a function of ρ represents a rectangular hyperbola with asymptotes φ= 1 and ρ= 1. Figure 3 shows how φ varies depending on the value of ρ ∈ [0, 1] in three cases, i.e., for high (green line), medium (red line), and low (blue line) values of PPV.
Formula (9) also shows that φ, unlike FM, is not a central tendency indicator for PPV and TPR. Specifically, φ does not satisfy Cauchy's property (Cauchy 1821), according to which a central tendency indicator of a set of values must always be between the minimum and the maximum value in the set. In our case, for Cauchy's property to be satisfied, φ would have to be between PPV and TPR. Since PPV and TPR are equal for an unbiased classifier, this would mean having φ = P P V as a result of Formula (9), but this is not the case. FM, instead, satisfies Cauchy's property.

The Relationships between φ and FM
Before presenting our mathematical study (described in Sections 5.2-5.5), we show some empirical evidence and simulation results about the relationship between φ and FM in Section 5.1.

Empirical Observations and Simulation Results
The scatterplot in Fig. 4 shows the values of φ and FM for all of the 837 classifiers that we obtained in our empirical study (more details in Section 6). The scatterplot, which shows only the part of the F M × φ plane in which we obtained pairs of values (F M, φ) for our classifiers, is consistent with the findings by other researchers, like Yao and Shepperd (see Figure 4 in (Yao and Shepperd 2021)). It shows that the vast majority of the points are below the bisector. Also, FM and φ often provide discordant indications; while a high value of φ implies high values for FM, the converse is not true: when φ > 0.5, it is also F M > 0.5, but when FM is close to 0.8, φ can be below 0.2 as well as above 0.6.
In a simulation analysis, Chicco and Jurman (Chicco and Jurman 2020) computed FM and φ for all confusion matrices (hence, for all ρ) with n = 500 and showed the results in a scatterplot. In Fig. 5, we show a similar scatterplot for illustration purposes, where we choose n = 100 because the dots corresponding to the confusion matrices are already dense enough that increasing the value of n would not change the graphical aspect of the figure. Figure 5 shows that, for a given value of FM, there is a wide range of possible values of φ, in general.
Sections 5.2-5.5 mathematically explain the scatterplots in Figs. 4 and 5. Specifically, in Section 5.2, we show how the relationship between FM and φ is influenced by ρ and σ . In Section 5.3, we derive the upper and lower bounds of φ when FM is known, based on a

The Mathematical Relationship Between φ and FM
The mathematical relationship between FM and φ can be expressed as in Formula (10) (the derivation can be found in Formula (38) which shows how φ depends on FM, ρ, and σ . In the special case of an unbiased classifier, i.e., when consistently with Formula (9), since PPV = TPR = FM for unbiased classifiers. This relationship between FM and φ holds for all values of FM provided that ρ ≤ 1 2 . When, instead, ρ > 1 2 , the relationship holds only for some values of FM, because it must be φ = F M−ρ 1−ρ ≥ −1, so F M ≥ 2ρ − 1. For instance, when ρ = 0.75, it must be F M ≥ 1 2 . This effect can also be seen in Fig. 3, in which the three lines also show φ vs. ρ for different values of FM = PPV = TPR in the unbiased case. When ρ = 0.75, a line either coincides with the red line or is above it, i.e., it has a value of F M = P P V = T P R ≥ 0.5.

Variation Intervals of φ Depending on FM for Given Values of ρ
Formula (10) shows that, given a dataset (hence, given a value of ρ), the relationship between FM and φ is influenced by σ . To evaluate how tightly FM and φ are related to each other on a given dataset, it is useful to assess the extent of such influence, that is, how much φ can vary, depending on σ , for any given value of FM. Appendix E shows that φ belongs to an interval [φ min (F M; ρ), φ max (F M; ρ)], where function φ min (FM; ρ) depends on the value of FM (for a given value of ρ, taken as a parameter), as shown in Formulas (12) and (13), while the function that defines φ max (F M; ρ) is the same for all values of FM It can be shown that φ min (F M; ρ) is a continuous function and that it is zero for F M =  The plots graphically show how the width of the variation intervals depends on ρ and, therefore, imbalance. The region delimited by φ min (F M; ρ) and φ max (FM; ρ) is generally quite thin for small values of ρ and becomes thicker and thicker the larger ρ becomes. In other words, the uncertainty with which the value of φ can be known for a value of FM increases with ρ. Thus, the higher ρ, the easier it is to find both good and bad values of φ for any given value of FM. For instance, with ρ = 0.5, when F M = 0.4, φ may take a value between φ min (0.4; 0.5) −0.5773 and φ max (0.4; 0.5) 0.378.
The plots also show that, when F M ≥ 2ρ 1+ρ and ρ is small, φ min (F M; ρ) approximates very well the straight line of Formula (11) that describes the relationship between FM and φ for unbiased classifiers. Note that, the smaller ρ, the larger is the interval F M ∈ 2ρ 1+ρ , 1 in which this approximation can be used. For F M ∈ 2ρ 1+ρ , 1 , the difference between the value of φ of Formula (11) and φ min (F M; ρ) is maximum exactly when F M = 2ρ 1+ρ . Since φ min 2ρ 1+ρ = 0, this maximum difference turns out to be equal to ρ.

Preserving Classifiers' Rankings with φ and FM
Let us now consider Also, the higher FM, the smaller the difference in FM to have complete separation between two variation intervals. With ρ = 0.05, suppose for instance that F M d = 0.65, with variation interval φ ∈ [0.6313, 0.6846], and F M e = 0.7, with variation interval φ ∈ [0.6840, 0.7250]. These two intervals still minimally overlap, but it suffices to take F M f = 0.71, which has a variation interval φ ∈ [0.6946, 0.7333] to have complete separation between the variation intervals related to F M d and F M f .
2 There is no general consensus on acceptability thresholds for FM. As a matter of fact, a value F M = 0.3 would probably be considered too low for practical purposes, since generally it implies that either PPV or TPR is even below 0.3. However, we choose a value of FM low enough to show that the variation interval of φ is small for an interval of FM even larger than would be practically useful. This inequality along with Formula (15) and Fig. 8 show that, given a value F M a , the set of values of F M b such that F M a < F M b for which we have φ a < φ b with certainty gets larger when ρ decreases. For instance, take F M a = 0.6. When ρ = 0.05, any value of F M b > 0.663 guarantees that φ b > φ a , while when ρ = 0.5, we need F M > 0.783, so that φ b > φ a . So, the ranking between two modules is more and more likely to be the same according to FM and to φ for smaller and smaller values of ρ.
Formula (15) is a special case of a more general formula that applies when using two classifiers cl a and cl b on two datasets with different actual prevalence values ρ a and ρ b , as discussed in Appendix F.

Variation Intervals of φ for All Values of ρ
We have so far supposed that ρ is given, so it is either known or it is assumed to be equal to some value. At any rate, we may want to delimit the variation interval [φ min (F M), φ max (F M)] of φ for a given value of FM for all possible values of ρ. Appendix G shows that

Consequences of the Analytical Relationship Between φ and FM
Sections 5.2-5.5 show that the relationship between φ and FM depends on ρ and σ . Providing FM by itself, without specifying ρ, provides at best an incomplete view of a classifier's performance. In some cases, notably when ρ is quite small, φ and FM basically provide the same information, e.g., they tend to rank classifiers in the same order. The value ρ may be quite small in some software-related application cases. For instance, the actual prevalence of vulnerable software modules in a software system is typically quite low. In other application areas, however, the range of ρ can be quite wide, e.g., in software defect prediction. For instance, the real-life datasets that we used in the empirical study of Section 6 have ρ ranging between 0.007 and 0.988 (see also Fig. 10). So, it is not possible to tell whether a classifier is an effective and useful one by simply looking at FM, and a statement like "classifier X achieves F M = 0.8, therefore it is very accurate," the likes of which have sometimes appeared in the literature, may be misleading. Therefore, the FM achieved by a classifier on a dataset should always be accompanied by the actual prevalence ρ of the dataset, also because ρ provides the F-measure value of a totally random classifier.
As a final observation, we note that our results confirm the validity of FM as a performance metric in the domain in which it was originally proposed, i.e., information retrieval. In fact, ρ is generally very small in information retrieval situations. Consider for instance the case of a search on google.scholar.com: you are typically interested in no more than a few hundred papers out of the 10 8 indexed papers, hence ρ is in the order of 10 −5 . In this section, we show how the analytical results of Section 5 can explain empirical data, which were obtained from real-life projects. Note that this empirical demonstration is not meant to confirm the validity or correctness of the relationship between φ and FM or of any mathematical results introduced in Section 5. Those results were derived analytically and therefore do not need any empirical validation.

The Datasets
We use two sets of datasets that are publicly available from the SEACRAFT repository (2017) and are reported among the most widely used (Singh et al. 2015). The first set was collected by Jureczko and Madeyski (2010) from real-life projects of different types and has been used in several defect prediction studies (e.g., Bowes et al. (2018) and Zhang et al. (2017)). The second set is the NASA Metrics Data Program defect dataset (Menzies and Di Stefano 2004); it has also been used in several defect prediction studies (e.g., Gray et al. (2011)). Therefore, in this section, a positive module is a defective one and a negative module a non-defective one. Some descriptive statistics concerning the datasets are given in Appendix I.
The data from the aforementioned datasets were used to derive models of module defectiveness. The technique used to derive defect predictors is immaterial for the purpose of this work; nonetheless, we provide some details in Appendix I.
Given the importance of ρ, the distribution of ρ in the considered datasets is illustrated by the boxplot in Fig. 10. This distribution contains a fairly large and varied set of values

Analysis of FM vs. φ with Different Values of ρ
First, we plot FM vs. φ when ρ is small. Figure 11 (which shows only the part of the F M×φ plane in which we obtained pairs of values (F M, φ)) illustrates the situation when ρ is close to 0.05, namely when ρ ∈ [0.025, 0.075]. We select the data that correspond to a range, rather than to a specific value of ρ, because in the latter case we would end up selecting data from a single dataset. In the [0.025, 0.075] range, we have 2 datasets and 44 classifiers.
The yellow line has equation (11) with ρ = 0.05 (the mean value for these datasets). FM and φ tend to provide practically equivalent information, regardless of σ , especially when F M > 0.4, and the relationship between FM and φ is well represented by (11). These results are practically relevant for application areas such as vulnerability prediction or defect prediction, in which low values of actual prevalence ρ can be found.
For higher values of ρ, the correspondence between FM and φ is less clear: Fig. 12 shows FM vs. φ when ρ ∈ [0.74, 0.77] (like with Figs. 4 and 11, we only show the relevant part of the F M × φ plane). We could not observe higher values of ρ, because no dataset with higher values of ρ supported enough classifiers. In the [0.74, 0.77] range, we have 16 classifiers from 3 datasets (xerces-1.4, with ρ = 0.743, pbeans1, with ρ = 0.769, and velocity 1.4 with ρ = 0.750). The value of ρ used to draw the yellow line having equation (11) is the mean of the three datasets' ρ, i.e., 0.754. Figure 12 shows that the different values of σ of different classifiers blur the relationship between FM and φ. Also, some predictors with high FM (close to 0.8) actually have a rather poor value of φ (around 0.2). Namely, we have a model that features FM = 0.77 and φ = 0.23. This is coherent with (12), (13), and (14), according to which φ is expected to be in the [− 0.22, 0.54] range, when ρ = 0.754 and FM = 0.77.
In practice, it is apparent that, with high values of ρ, FM can be deceiving, showing high values that correspond to rather low φ.

On the Threats to Validity of the Demonstration Study
Even though our results are of an analytical nature, let us here explore the possible threats to validity that would derive from an empirical study like the demonstration study that we described.
Construct validity. Our demonstration study is about analyzing the relationships between two specific variables, i.e., φ and FM, so there is no real threat to construct validity. Instead, the debate is not over on whether these two variables adequately represent the overall performance of a classifier, even though FM has been heavily criticized in the last few years (Hernández-Orallo et al. 2012;Powers 2011;Sokolova and Lapalme 2009;Luque et al. 2019). Our analytical results provide researchers and practitioners with more information to make a more informed decision on the one that they would like to use.
Internal Validity. Our goal was of a descriptive kind, i.e., we wanted to show how φ varied for any given value of FM depending on ρ. We did not look for possible associations or correlations between them. This would likely be the goal of an empirical study, by using statistical or machine-learning techniques. An association/correlation could be found or not depending on the specific sample. However, if ρ and σ were included as additional independent variables, a statistical or machine-learning technique may indicate perfect correlation in all cases, even though this is not guaranteed, because of the nature of the technique used. For instance, suppose that a linear model that combines F M, ρ, and σ were used to estimate φ. Since the relationship between these variables is not linear, even having all the information needed to determine φ would not suffice. At any rate, even if a technique were able to find a perfect correlation, this would simply be an empirical way of finding the relationship that we analytically describe in Formula (10). In addition, the empirical approach would only provide strong evidence about perfect correlation, but not certainty.
External Validity. Like with any empirical study, we took a sample of possible subjects (the software projects) and we showed results about it. Thus, in an empirical study like the ones we show here, it is very possible that the results have limited external validity. In our demonstration study, we took projects from different application domains, of different sizes and with different prevalence values, so the results may be applicable to a fairly large set of projects. However, the analytical results are applicable to all projects and are valid beyond software defect prediction and Software Engineering.

Revisiting Previous Empirical Studies
We here show how our analytical study can be used to reinterpret previous defect prediction analyses that reported results via FM (and possibly other performance metrics, like PPV and TPR), but did not report φ values. Li et al. used Binary Logistic Regression (BLR), Naive Bayes (NB), Decision Trees (DT), CoForest (CF), and ACoForest (ACF) to build defect classifiers. They performed withinproject defect predictions, training predictors with data from a variable number of modules from the software system that was the object of predictions (Li et al. 2012).

Case 1
As an example of the outcomes of the study by Li et al., Table 2 (taken from Table 5 in Li et al. (2012)) shows the values of FM for the classifiers built with CF, BLR, NB, and DT. For each row, i.e., for each dataset, the highest FM value is in bold.
Li et al. conclude that "It can be easily observed from the table that CoForest achieves the best performance among the compared methods except that on SWT NaiveBayes performs the best." Of the considered datasets, all but one have ρ ≥ 0.3. Based on the considerations illustrated in Section 5.3, conclusions based on FM alone should not be trusted when ρ is that high. Since Li et al. reported the values of PPV and TPR in Tables 8-13 of their paper, we  (7). The results are in Table 3, where the highest φ of each row is in bold. Table 3 shows clearly that 1) the performance of the CF classifier is unacceptably low when quantified via φ, with φ ≤ 0.23 for all datasets, 2) NB classifiers have the best φ for all datasets, and 3) NB classifiers are the only ones with acceptable performance, featuring φ ≥ 0.3 in 4 datasets out of 6.
This case demonstrates that considering FM without taking ρ into consideration is risky, as it can easily lead to untrustworthy conclusions.
However, for papers that published not only the values of FM, but also those of ρ and TPR or PPV it is possible to derive reliable indications based on φ. For instance, the conclusions by Li et al. concerning ACF appear reliable, according to our computation of φ (not reported here).

Case 2
Deng et al. addressed cross-project defect prediction via a method that adopts a better abstract syntax tree node granularity and proposes and uses multi-kernel transfer convolutional neural networks (Deng et al. 2020).
They evaluated their approach on 110 cross-project defect prediction tasks formed by 11 open-source projects. As an example of their evaluations, we report in Table 4 the FM values from Table 7 in Deng et al. (2020) concerning the proposed method MK-TCNN-mix and the ρ of the projects used to evaluate method MK-TCNN-mix (from Table 3 in (Deng et al. 2020)).
Based on the ρ and FM columns of Table 4, we computed the range to which φ must belong, via (12), (13), and (14). It turns out that only for the Xerces dataset and, to some   Table 4 were used by Deng et al. to draw conclusions concerning the proposed method's performance; however, as Table 4 clearly shows, no reliable conclusion (i.e, neither in favor nor against Deng et alii's proposal) is supported.
In conclusion, the paper by Deng et al. shows that FM, even with ρ, does not let readers appreciate the actual performance of classifiers. This is because FM is a reliable metric only for small values of ρ (see Fig. 6). Take for instance the result on project Velocity in Table 4: the FM obtained (0.519) could correspond to φ = 0, indicating that the proposed method MK-TCNN-mix is equivalent to random estimation. Unfortunately, Deng et al. do not provide in the paper additional data that can be used to compute φ more precisely than done in Table 4; hence, solving the doubts concerning the validity of Deng et al's conclusions is not possible.

Case 3
In paper "Slope-based fault-proneness thresholds for software engineering measures" (Morasca and Lavazza 2016), we also used FM to evaluate classifications. Specifically, we proposed a method to set thresholds for defect estimation based on the slope of Binary Logistic Regression (BLR) and Probit Regression (PBR) functions (Morasca and Lavazza 2016). The performance of the classifiers built with the proposed method was evaluated via an empirical study that used data from several projects from the SEACRAFT repository, including project berek, which has n = 43 software modules, of which AP = 16 defective, so ρ = 16 43 0.372. Performance was quantified and reported via F M and T P R. ρ 0.372 is too large a value to assure that a reliable value of φ can be derived from F M alone. Nonetheless, we can derive the value of φ for all the models presented in Morasca and Lavazza (2016), by means of the following procedure: 1. Derive PPV from FM and TPR: Compute T P = AP · T P R; then, compute TN as follows: (these equations can be derived via simple transformations of the definitions of PPV and TPR in Formula (1)). 3. Compute FP, FN, EP and EN based on their definitions (Table 1). 4. Compute φ based on its definition (Formula (5)).
Noticeably, in an extended version of the paper, being aware of the limitations of FM, we reported φ in addition to FM (Morasca and Lavazza 2017). The values of φ that can be computed as shown above match exactly the values of φ that were computed based on the confusion matrices and were reported in Morasca and Lavazza (2017).

Related Work
Yao and Shepperd investigated the relationship between FM and φ (Yao and Shepperd 2020; 2021) from an empirical point of view. Via a systematic literature review, they identified 38 refereed primary studies in which FM and φ were used, to evaluate the effects of using FM instead of φ. In this sense, the work by Yao and Shepperd provides a solid background and a strong justification for our analytical study. In fact, they found that around 22% of all results found in the 38 primary studies would be reversed if φ is selected as a performance metric instead of FM. Based on the empirical results and a comparison of the properties of φ and FM, they strongly recommend that FM should no longer be used and that φ should be used instead.
In a simulation analysis, Chicco and Jurman (2020) computed F M and φ for all confusion matrices with n = 500 and showed the results in a scatterplot. A similar scatterplot is in Figure 5, which shows that, for a given value of FM, there is a wide range of possible values of φ, in general. Our work (see Section 5.5) provides the theoretical explanation for their simulation results. Chicco and Jurman also illustrated via representative numerical cases how the imbalance between the actual negatives and actual positives affects the ability of FM and φ to assess classifier performance. When ρ is quite low, they find that both FM and φ provide the same kind of evaluation. Our study (see Section 5.4) provides the general mathematical bases and explanations for their numerical results. Bowes et al. (2012) observed that a variety of different performance metric are used in empirical studies. Since these measures are not directly comparable, comparing different results is often difficult. Also, decision-makers may be interested in different measures than those reported in a specific study. Therefore, Bowes et al. proposed an approach to reconstruct a frequency confusion matrix based on the values of the performance measures provided in empirical studies. The proposal by Bowes et al. can therefore be used to compute FM or φ when an empirical study does not provide them, but provides instead a suitable set of metrics, as specified in Table 5 in Bowes et al. (2012).

Findings
Different performance metrics provide different evaluations and rankings for a set of classifiers. We focused on two performance metrics that have been extensively used in the Empirical Software Engineering literature, namely FM and φ.
Previous research found that imbalanced data can significantly affect performance metrics. However, to the best of our knowledge, this is the first time that the role of imbalance (via prevalence ρ) in the relationship between φ and FM is made explicit.
Our study provides the mathematical explanations for some phenomena that have been detected empirically or via simulations. Specifically, we show the mathematical relationships between FM and φ, and how they are influenced by the values of actual and estimated prevalence. Though FM and φ are based on different formulas, we show the conditions under which both FM and φ provide the same ranking between two classifiers. Specifically, it appears that FM and φ tend to agree on the ranking more when the actual prevalence ρ is low, as is the case for several datasets used for software defect prediction. In addition, we review existing analyses about the validity and usefulness of FM and φ, and add two more observations. The mathematical relationships between FM and φ can be used also to get a more rigorous and sound interpretation of the results published in papers that used FM alone.

Recommendations
Based on the considerations reported through the paper, we can formulate a few recommendations about the performance metrics to be used to evaluate classifiers.
It is not advisable to evaluate the performance of a classifier based exclusively on FM. Also, if using FM, the value of ρ needs to be specified, at least to know if a classifier performs better than the random one. Unfortunately, this practice has not been followed in many cases, which led to many questionable evaluations (Yao and Shepperd 2021).
At any rate, we recommend that, even when the value of φ is reported, FM should not be used without also providing the value of φ, to have at least a more complete evaluation of a classifier. For instance, take the data in Table 4: for dataset Synapse, we have F M = 0.516 and ρ = 0.336. FM is sufficiently larger than ρ to suggest that the classifier performs better than the random one. However, our mathematical results show that in this case φ is between 0.17 (which would be rather bad) and 0.51 (which would be fairly good). Thus, in this case, the knowledge of FM and ρ is not sufficient to establish how good the binary classifier is.
Unlike FM, φ takes into account all of the cells in a confusion matrix. Thus, φ seems to be more adequate to be used as an overall metric for the performance of a classifier. It is true that FM and φ are likely to provide the same ranking when the actual prevalence is small. However, one may as well use φ without using FM.
In addition, it would be useful that, whenever possible, authors of scientific articles provide the entire confusion matrices for the classifiers. Based on the confusion matrices, any performance metric of interest to decision-makers and researchers can be computed.

Dealing with Other Performance Metrics
In this paper, we investigated FM and φ. As already mentioned, many performance metrics have been proposed. Of these, several are used in common practice. Therefore, it could be useful to explore the relationship among these metrics (including their relationships with FM and φ).
To this end, we note that in a previous paper (Morasca and Lavazza 2020) we provided the mathematical basis for comparing some performance metrics. Some comparisons have also been performed already, although not systematically. For instance, in Morasca and Lavazza (2020) and Lavazza and Morasca (2022) we showed how φ, FM and Youden's J can be expressed in terms of TPR and FPR (i.e, the axis of the ROC space) and ρ. The systematic investigation of additional relationships is part of our research agenda, from a mathematical and an empirical point of view.
In this respect, an important topic that we plan to investigate further is the impact of data imbalance on the indications provided by the various performance metrics, some of which, like Youden's J , do not suffer from imbalance effects while others appear to be largely influenced by imbalance. Formula (6) shows that FM can be written as follows

Appendix A: Comparing the Variations of FM Depending on FP and FN
Suppose that we would like to compare the performance of CM against the performance of another confusion matrix CM defined by "difference" from CM, as below P and N are the variations on the numbers of false negatives and false positives with respect to CM. We here study how FM varies depending on the values of P and N .
Let us first deal with a few trivial cases: So, let us now assume that P and N are nonzero and have opposite signs. First, take P = A > 0 and N = −R < 0, to obtain a new confusion matrix CM in which, in comparison to CM, R units are removed from FP at the price of adding A units to FN. The value of F M for CM is Via mathematical computations, we obtain It is easy to show that Formula (22) shows that, to have F M > F M, the number R of units removed from FN must be at least as large as the number of units added to FP.
Conversely, let us now be take P = −R < 0 and N = A > 0, i.e., we have a new confusion matrix CM in which, in comparison to CM, R units are removed from FN while A units are added to FP. The value of F M for this CM is Via mathematical computations, we obtain The rightmost inequality in Formula (25)  It can also be shown that F M > F M also when exactly one between A and R is zero. Summarizing, reducing or increasing FN by some amount has more impact on FM than does reducing or increasing FP by the same amount.

Appendix B: Defining φ in Terms of PPV and TPR
Starting from Formula (5), which defines φ, we apply a few mathematical computations, as follows.
By dividing both numerator and denominator by EP, we have We now show the possible values of PPV and TPR, given ρ.
It is immediate to note that constr ρ (P P V , T P R) in (28) is 1 if and only if P P V = 1. For completeness, let us now study the special case in which P P V = T P R = T P = 0. We have which can take any value between the minimum 0 and the supremum 1. It can be shown that, given PPV and TPR, φ is a monotonically decreasing function of ρ, except when P P V = T P R = 1. Based on Formula (27), the first derivative of φ with respect to ρ is dφ dρ = (P P V + T P R − 2P P V · T P R)ρ + P P V (P P V + T P R − 2) 2(P P V − (P P V + T P R)ρ + T P Rρ 2 ) 3 2 The denominator of the right-hand fraction in Formula (30) is always positive, as it is the cube of the denominator of the right-hand fraction in Formula (27). Thus, the sign of the derivative is the same as the sign of the numerator of the right-hand fraction in Formula (30). The numerator is a linear function of ρ. The value of the numerator for ρ = 0 is P P V (P P V + T P R − 2), which is a negative value unless P P V = T P R = 1 (in which case φ = 1 for all values of ρ). Thus, the numerator is negative for all values in the interval ρ ∈ [0, ρ(P P V , T P R)), where ρ(P P V , T P R) = P P V (2 − P P V − T P R) P P V + T P R − 2 P P V · T P R We now prove that constr ρ (P P V , T P R) ≤ ρ(P P V , T P R), so φ is a decreasing function for all values of ρ. To prove that constr ρ (P P V , T P R) ≤ ρ(P P V , T P R), i.e., P P V P P V + T P R − P P V · T P R ≤ P P V (2 − P P V − T P R) P P V + T P R − 2 P P V · T P R consider that the numerator in the left side fraction is never greater than the numerator in the right side fraction, and, at the same time the denominator in the left side fraction is never smaller than the denominator in the right side fraction. Thus, constr ρ (P P V , T P R) ≤ ρ(P P V , T P R). Therefore, φ is minimum when ρ = constr ρ (P P V , T P R), with value It is φ min ≤ 0, since the numerator in (33) is (1 − T P R)(P P V − 1) ≤ 0. Finally, for completeness, the supremum of φ is based on Formula (27).

Appendix C: FM is the Harmonic Mean of TPR and PPV
A characteristic of FM that is not fully explained (Yao and Shepperd 2021) is that it is defined as the harmonic mean of TPR and PPV. The often suggested rationale is that the harmonic mean is a "low" mean: it is never higher than the geometric mean, which, in turn, is never higher than the arithmetic mean (which would probably be a more easily interpretable choice (Yao and Shepperd 2021)). It is immediate to show that the harmonic mean of two values is never greater than twice the lesser of the two, i.e., it never gets "too far" from the lower value. As an example, if P P V = 0.04, F M ≤ 0.08 no matter how high the value of TPR. It is not clear, however, what is to be gained by choosing a "low" mean instead of a "high" mean. If it were all that important to have low values, one could then use min{P P V , T P R} as a performance metric. What matters more, instead, is that choosing a different mean implies choosing a different ordering among performances and, therefore, a different preference ranking among classifiers. For instance, a classifier cl f with T P R f = 0.2 and P P V f = 0.8 would rank better than a classifier cl g with T P R g = 0.3 and P P V g = 0.5 if taking the geometric mean of TPR and PPV as a performance metric (in fact, √ 0.2 0.8 = 0.4 > √ 0.3 0.5 = 0.387), but worse with the harmonic mean, i.e., FM (in fact, 2 0.8 0.2 0.8+0.2 = 0.32 < 2 0.3 0.5 0.3+0.5 = 0.375). As we show in Section 4, φ is not defined as a mean (of any type) of TPR and PPV, but, rather, as an effect size metric.

Appendix D: The Relationship between φ and FM
We take the definition formula for FM we solve it for TP, thereby obtaining i.e., TP can be seen as a function of FM and EP. We replace the value of TP in the rightmost member of Formula (26) and carry out a few computations Formula (38) shows how φ depends on FM, EP, and ρ. It is the basis for studying the relationship between φ and FM, which we do in Section 5.

Appendix E: Variation Interval of φ as a Function of FM
The range of values of σ in Formula (10) is constrained because of the natural constraints on the cells of the confusion matrix, as we now detail. For illustration purposes, we set the constraints in terms of EP, which can always be immediately translated in terms of σ = EP n .
It can be shown that all other natural constraints (e.g., F N ≤ EN, or EP ≤ n) are satisfied when the above constraints are satisfied and all cells of the confusion matrix are nonnegative.
We need to find under what conditions these upper bounds are stricter than the others. Let us first compare the upper bounds of Formulas (39) and (41). We have  (37), φ min (F M; ρ) and φ max (F M; ρ) are, respectively, the minimum and the maximum value of φ as EP varies in its interval. φ is a continuous function of EP, so we use the first derivative of φ with respect to EP to identify minima and maxima. The derivative is  The second-degree polynomial of FM in Formula (45) has roots F M 1 = − 1−ρ ρ and F M 2 = 2ρ 1+ρ , so it is less than or equal to 0 between those two roots. Root F M 1 is negative, while root F M 2 coincides with the upper bound of the interval of FM we are currently investigating, so we can conclude that φ < 0 when F M ≤ 2ρ 1+ρ . This implies that, for all F M ≤ 2ρ 1+ρ , φ min (F M) is achieved with the highest value possible for EP, i.e., EP = 2AN+AP ·F M Furthermore, it can be shown that φ lb (F M) ≥ φ ub (F M), so φ max = φ lb (F M). Summarizing, we have Note φ max (F M) is defined by the same function in Formulas (47) and (51).

Appendix F: Preserving Classifiers' Rankings with φ and FM for Datasets with Different Actual Prevalence
Formula (52) shows the inequality that must hold when using two classifiers cl a and cl b on two datasets with different actual prevalence values ρ a and ρ b .
Formula (52) reduces to Formula (15) when ρ a = ρ b . It can be shown that, for every value of F M a , there always exists F M b such that there is complete separation between the variation intervals, regardless of the values of ρ a and ρ b . The right-hand side of the inequality in Formula (52) is an increasing function of ρ b and a decreasing function of ρ a : the constraint on F M b becomes stricter when ρ b increases and easier to satisfy when ρ a decreases. Figure 13 shows the minimum value of F M b as a function of F M a , for a few pairs ρ a , ρ b .  (12) is never positive, while the one in Formula (13) is nonnegative, so, the value of ρ that minimizes φ min (F M; ρ) for a given value of FM is the one that minimizes the function in Formula (12). Thus, let us take the first derivative of the function in the square root in the right-hand side of the equality in Formula (12) This derivative is positive for ρ < 1 2−F M , null in ρ = 1 2−F M , and negative for ρ > 1 2−F M , so the function in the square root has a maximum in ρ = 1 2−F M , and φ min (F M; ρ) has a minimum there. By setting ρ = 1 2−F M in Formula (12), we obtain φ min (F M) as in Formula (16).
Let us now compute φ max (F M), the maximum value of φ max (F M; ρ) for a given value of FM. To this end, let us compute the first derivative of the term in the square root sign of Formula (14) For all values of F M < 1, this derivative is negative, so φ max (F M; ρ) attains its maximum when ρ = 0. Thus, we obtain φ max (F M) as in Formula (17). For completeness, φ max (F M; ρ) = 1 when F M = 1, which coincides with the value obtained in Formula (17) for F M = 1 anyway.
A file describing the 837 models from 70 datasets that we analyzed is available from http://www.dista.uninsubria.it/supplemental material/PhiFM/binclass.csv The file specifies also the value of ρ for each dataset, the confusion matrix of each binary classifier, and a set of performance metrics.