Master your Metrics with Calibration

Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics may sometimes lead to wrong conclusions about the performance. For example, when dealing with non-stationary data streams, they do not allow the user to discern the reasons why a model performance varies across different periods. In this paper, we propose a way to calibrate the metrics so that they are no longer tied to the class prior. It corresponds to a readjustment, based on probabilities, to the value that the metric would have if the class prior was equal to a reference prior (user parameter). We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve inter-pretability and provide a better control over what is really measured. We describe speciﬁc real-world use-cases where calibration is beneﬁcial such as, for instance, model monitoring in production, reporting, or fairness evaluation.


Introduction
In real-world machine learning systems, the predictive performance of a model is often evaluated on several test datasets rather than one, and comparisons are made.These datasets can correspond to sub-populations in the data, or different periods in time [15,18].Choosing the best suited metrics for these comparisons is not a trivial task.Some metrics may prevent a proper interpretation of the performance differences between the sets [8,14], especially because the latter generally not only have a different class prior P(y) but also a different likelihood P(x|y).For instance, a metric dependent on the prior (e.g.precision which is naturally higher if positive examples are globally more frequent) will be affected by both differences indiscernibly [3] but a practitioner could be interested in isolating the variation of performance due to likelihood which reflects the intrinsic model's performance.Take the example of comparing the performance of a model across time periods: At time t, we receive data drawn from P t (x, y) = P t (x|y)P t (y) where x are the features and y the label.Hence the optimal scoring function (i.e.model) for this dataset is the likelihood ratio [11]: In particular, if P t (x|y) does not vary with time, neither will s t (x).In this case, even if the prior P t (y) varies, it is desirable to have a performance metric M (•) satisfying M (s t , P t ) = M (s t+1 , P t+1 ), ∀t so that the model maintains the same metric value over time.
In binary classification, researchers often rely on the AUC-ROC (Area Under the Curve of Receiver Operating Characteristic) to measure a classifier's performance [6,9].While the AUC-ROC has the advantage of being invariant to the class prior, many real-world applications, especially when data are imbalanced, favor precision-based metrics.The reason is that, by considering the proportion of false positives with respect to negative examples which are in large majority, ROC suffers from giving them too little importance [5].Yet, they deteriorate user experience and waste human efforts with false alerts.Thus, ROC can give over-optimistic scores for a poorly performing classifier in terms of precision and this is particularly inconvenient when the class of interest is in minority [10].
Metrics directly based on recall and precision, which gives more weight to false positives, such as AUC-PR or F1-score have shown to be much more relevant [12] and have gained in popularity in recent years [13].That being said, these metrics are strongly tied to the class prior [2,3].A new definition of precision and recall into precision gain and recall gain has been recently proposed to correct several drawbacks of AUC-PR [7].However, while the resulting AUC-PR Gain have some advantages of the AUC-ROC such as the linear interpolation between points, it remains dependent on the class prior.
This study aims at providing (i) a precision-based metric to cope with problems where the class of interest is in a slim minority and (ii) a metric independent of the prior to monitor performance over time.The motivation is illustrated by a fraud detection use-case [4].Figure 1 show a situation where, over a given period in time, both the detection system's performance and the fraud ratio π (i.e. the empirical P t (y)) decrease.As the used metric AUC-PR is dependent on the prior, it cannot tell if the performance variation is only due to the fraud ratio or if there are other factors (e.g.drift in P(x|y)).To cope with the limitations of the widely used existing precision-based metrics, we introduce a calibrated version of them which excludes the impact of the class ratio.More precisely, our contributions are: 1.A formulation, in section 3.1, of calibration for precision-based metrics: It estimates the value that the metrics would have if the class ratio in the test set was equal to a reference class ratio π 0 fixed by the user.We give a theoretical argument to show that it allows invariance to the class prior.We also provide a calibrated version of precision gain and recall gain.

4.
A large scale experiments on 614 datasets using openML [16], in section 4, to (a) give more insights on correlation between popular metrics by analyzing how they rank models, (b) explore the links between the calibrated metrics and the regular ones and explain how it is impacted by the choice of π 0 .
We emphasize that calibration not only solves the issue of dependence to the prior but also allows, with parameter π 0 , propelling the user to a different configuration (a different ratio) and controlling what the metric will precisely reflect.These new properties have several practical interests (e.g. for development, reporting, analysis) and we discuss them in realistic use-cases in section 5.

Popular metrics for binary classification: advantages and limits
We consider a usual binary classification setting where a model has been trained and its performance is evaluated on a test dataset of N instances.y i ∈ {0, 1} is the ground-truth label of the i th instance and is equal to 1 (resp.0) if the instance belongs to the positive (resp.negative) class.The model provides s i ∈ R, a score for the i th instance to belong to the positive class.For a given threshold τ ∈ R, the predicted label is y i = 1 if s i > τ and 0 otherwise.
The number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN) are defined as follows: where 1(ω) yields 1 when ω is true and 0 otherwise.Based on these statistics, one can compute relevant ratios such as the True Positive Rate (TPR) also referred to as the Recall (Rec), the False Positive Rate (FPR) also referred to as the Fall-out and the Precision (P rec): As these ratios are biased towards a specific type of error and can easily be manipulated with the threshold, more complex metrics have been proposed.In this paper, we discuss the most popular ones which have been widely adopted in binary classification: F1-Score, AUC-ROC, AUC-PR and AUC-PR Gain.
F1-Score is the harmonic average between P rec and Rec: The three other metrics consider every threshold τ from the highest s i to the lowest.For each one, they compute TP, FP, TN and FN.Then, they plot one ratio against another and compute the Area Under the Curve (Figure 2).AUC-ROC considers the Receiver Operating Characteristic curve where TPR is plotted against FPR.AUC-PR considers the Precision vs Recall curve.Finally, in AUC-PR Gain, the precision gain (P rec G ) is plotted against the recall gain (Rec G ).They are defined in [7] as follows: where is the proportion of positive examples or positive class ratio. 1 This allows PR Gain to enjoy many interesting properties of the ROC that the original PR analysis does not, such as the validity of linear interpolations, the existence of universal baselines or the interpretability of the area under its curve [7].However, AUC-PR Gain can become limited in an extremely imbalanced setting.In particular, we can derive from (4) and ( 5) that both P rec G and Rec G will often be close to 1 if π is close to 0. This is illustrated in Figure 2   Each metric has its own raison d'etre.For instance, we visually observe that only AUC-ROC is invariant to the positive class ratio.Indeed, FPR and Rec are both unrelated to the class ratio because they only focus on one class but it is not the case for P rec: Its dependency on the positive class ratio π is illustrated in Figure 3. Instances, represented as circles (resp.squares) if they belong to the positive (resp.negative) class, are ordered from left to right according to their score given by the model.A threshold is illustrated as a vertical line between the instances: those on the left (resp.right) are classified as positive (resp.negative).When comparing a case (i) with a given ratio π and another case (ii) where a randomly selected half of the positive examples has been removed, one can visually understand that both recall and false positive rate are the same but the precision is lower in the second case.

Case (ii)
Same recall, same false positive rate, lower precision These different properties can become strengths or limitations depending on the context.As stated in the introduction, we will consider a motivating scenario where (i) data are imbalanced and the minority (positive) class is the one of interest and (ii) we monitor performance across time.In that case, we need a metric that considers precision rather than FPR and that is invariant to the prior.By definition, AUC-ROC does not involve precision whereas the other metrics do.But it is the only one invariant to the positive class ratio.

Calibrated Metrics
To obtain a metric that satisfies both properties from the last section, we will modify those based on P rec (AUC-PR, F1-Score and AUC-PR Gain) to make them independent of the positive class ratio π.

Calibration
The idea is to fix a reference ratio π 0 and to weigh the count of TP or FP in order to calibrate them to the value that they would have if π was equal to π 0 .π 0 can be chosen arbitrarily (e.g.0.5 for balanced) but it is preferable to fix it according to the task at hand.We analyze the impact of π 0 in section 4 and describe simple guidelines to fix it in section 5.
If the positive class ratio is π 0 instead of π, the ratio between negative examples and positive examples is multiplied by π(1−π0) π0(1−π) .In this case, we expect the ratio between false positives and true positives to be multiplied by π (1−π0)  π0(1−π) .Therefore, we define the calibrated precision P rec c as follows: Since 1−π π is the imbalance ratio N− N+ where N + (resp.N − ) is the number of positive (resp.negative) examples, we have: TP/N+ = FPR TPR which is independent of π.Based on the calibrated precision, we can also define the calibrated F1-score, the calibrated P rec G and the calibrated Rec G by replacing P rec by P rec c and π by π 0 in equations ( 3), ( 4) and (5).Note that calibration does not change precision gain.Indeed, calibrated precision gain P recc−π0 (1−π0)P recc can be rewritten as P rec−π (1−π)P rec which is equal to the regular precision gain.Also, the interesting properties of the recall gain were proved independently of the ratio π in [7] which means calibration preserves them.

Robustness to π variations
In order to evaluate the robustness of the new metrics to variations in π, we create a synthetic dataset where the label is drawn from a Bernoulli distribution with parameter π and the feature is drawn from Normal distributions: We empirically study the behavior of the F 1 -score, AUC-PR, AUC-PR Gain and their calibrated version on the optimal model defined in (1) as π decreases from π = 0.5 (balanced) to π = 0.001.Figure 4 presents the results averaged over 30 runs with their confidence interval.We observe that the impact of the class prior on the regular metrics is important.It can be a serious issue for applications where π sometimes vary by one order of magnitude from one day to another (see [4] for a real world example) as it leads to a significant variation of the measured performance (see the difference between AUC-PR when π = 0.5 and when π = 0.05) even if the model remains the same.On the contrary, the calibrated versions remain very robust to changes in the class prior π even for extreme values.Note that we here experiment with synthetic data to have a strong control over the distribution/prior.In appendix A, we run a similar experiment with real-world data where we artificially change π in the test set with undersampling.The conclusions remain the same.
Let us remark that for test datasets in which π = π 0 , P rec c is equal to the regular precision P rec since π(1−π0) π0(1−π) = 1 (the intersection of the metrics in Figure 4 when π = π 0 = 0.5 reflects that).The new metrics essentially have the value that the original ones would have if the positive class ratio π was equal to π 0 .This is analyzed in depth in appendix B where we compare the proposed formula for calibration with the most promising proposal from the past [10]: a heuristic-based calibration.Their approach consists in randomly undersampling the test set so that π = π 0 and then compute the For every value of π, 10 6 data points are generated -so that the observed class ratio π is approximately equal to the Bernouilli parameter πfrom ( 7) with µ 1 = 2 and µ 0 = 1.8, and the metrics are evaluated on the optimal model defined in (1).We arbitrarily set π 0 = 0.5 for the calibrated metrics.The curves are obtained by averaging results over 30 runs.
regular metrics.Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa.To avoid that, they perform several runs and compute the mean performance.The appendix shows and explain why our formula directly computes a value equal to their estimation.Therefore, while similar in spirit, our proposal can be seen as an improvement as it provides a closed-form solution, is deterministic, and less computationally expensive than their Monte-Carlo approximation.

Assessment of the model quality
Besides the robustness of the calibrated metrics to changes in π, we also want them to be sensitive to the quality of the model.If this latter decreases regardless of the π value, we expect all metrics, calibrated ones included, to decrease in value.Let us consider an experiment where we use the same synthetic dataset as defined the previous section.However, instead of changing the value of π only, we change (µ 1 , µ 0 ) to make the problem harder and harder and thus worsen the optimal model's performance.This can be done by reducing the distance between the two normal distributions in (7).As a distance, we consider the KL-divergence that boils down to  Figure 5 shows how the values of the metrics evolve as the KL-divergence gets closer to zero.
For each run, we randomly chose π in the interval [0.001, 0.5].As expected, all metrics globally decrease as the problem gets harder.However, we can notice an important difference: the variation in the calibrated metrics are smooth compared to those of the original ones which are affected by the random changes in π.In that sense, variations of the calibrated metrics across the different generated datasets are much easier to interpret than the original metrics.
4 Link with the original metrics Calibrated metrics are robust to variation in π and improve the interpretability of the model's performance.But a question remains: are they still linked to the original metrics and does π 0 play a role in that link?We tackle this question from the model selection point of view by empirically analyzing the correlation of the metrics in terms of model ordering.We use OpenML [16] to select the 614 supervised binary classification datasets over which at least 30 models have been evaluated with a 10-fold cross-validation.For each one, we randomly choose 30 models, fetch their predictions, and evaluate their performance with several metrics.This leaves us with 614 × 30 = 18, 420 different values for each metric.To analyze whether the metrics rank the models in the same order, we compute the Spearman rank correlation coefficient between them for each of the 614 problems.
Most datasets roughly have balanced classes (see the cumulative distribution function in Figure 1 in appendix C).We also run the same experiment with the subset of 4 highly imbalanced datasets where π < 0.01.A third experiment with the 21 datasets where π < 0.05 is available in appendix C.
The compared metrics are AUC-ROC, AUC-PR, AUC-PR Gain and F1-score.The threshold for the latter is fixed on a holdout validation set.We also add the calibrated version of the last three.In order to understand the impact of π 0 on the calibration, we use two different values: the arbitrary π 0 = 0.5 and another value π 0 ≈ π (for the first experiment with all datasets, π 0 ≈ π corresponds to π 0 = 1.01π and for the second experiment where π is very small, we go further and π 0 ≈ π corresponds to π 0 = 10π which remains closer to π than 0.5).The obtained correlation matrices are shown in Figure 6.The individual entries correspond to the average Spearman correlation over all datasets between the row metric and the column metric.In general, we observe that metrics are less correlated when classes are unbalanced (right figure).
We also note that F1-score is more correlated to PR based metrics than to AUC-ROC.And it agrees more with AUC-PR than AUC-PR Gain.The two latter have a high correlation, especially in the balanced case (left matrix in Figure 6).Also in the balanced case, we can see that metrics defined as area under curves are more correlated with each other than with the threshold sensitive classification metric F1-score.
Let us now analyze calibration.As expected, in general, when π 0 ≈ π, calibrated metrics have a behavior really close to that of the original metrics because π(1−π0) π0(1−π) ≈ 1 and therefore P rec c ≈ P rec.In the balanced case (left), since π is close to 0.5, calibrated metrics where π 0 = 0.5 are also highly correlated with the original metrics.In the imbalanced case (on the right matrix of Figure 6), when π 0 is arbitrarily set to 0.5 the calibrated metrics seem to disagree with the original ones.In fact, they are even less correlated with the original metrics than with AUC-ROC.It can be explained with the relative weights given to FP and TP by each of these metrics.The original precision gives the same weight to T P as to F P , although false positives are 1−π π times more likely to occur2 .The calibrated precision with the arbitrary value π 0 = 0.5 boils down to TP TP+ π (1−π) FP and gives a weight 1−π π times smaller to false positives which counterbalances their higher likelihood.ROC also implicitly gives 1−π π less weight to FP because it is computed from FPR and TPR which are linked to TP and FP with the relationship π 1−π FP TP = FPR TPR .To conclude this analysis, we first emphasize that the choice of the metrics when datasets are rather balanced seems to be much less sensitive than in the extremely imbalanced case.Indeed, In the balanced case the least correlated metrics are F1-score and AUC-ROC with a correlation coefficient of 0.53.For the imbalanced dataset, on the other hand, many different metrics are uncorrelated which means that most of the time they disagree on the best model.The choice of the metric is very important here and our experiment reflects that it is a matter of how much weight we are willing to give to each type of error.With calibration, in order to preserve the nature of the original metrics, π 0 has to be fixed to a value close to π and not arbitrarily.π 0 can also be fixed to a value different from π if the user has another purpose.

Guidelines and use-cases
Calibration could benefit ML practitioners when comparing the performance of a single model on different datasets/time periods.Without being exhaustive, we give four use-cases where it is beneficial (the strategy for setting π 0 depends on the target use-case): • Model performance monitoring in an industrial context: in systems where performance is monitored over time with F1-score, using calibration in addition to the regular metrics makes it easier to analyze the drift (i.e.distinguish between variations linked to π or P (X|y) -see Appendix D) and design adapted solutions: either updating the threshold or completely retraining the model.To avoid denaturing too much the F1-score, π 0 can here be, for instance, fixed based on expert knowledge (e.g.average π in historical data).• Comparing the performance of a model on two populations: If the prior is different from one population to another, the calibrated metric will set each population to the same reference.It will provide a balanced point of view and make the analysis richer.This might be useful to study fairness [1] for instance.Here, π 0 can be chosen as the prior of one of the two populations.• Establishing agreements with clients: The positive label ratio π can vary extremely on particular events (e.g.fraudster attacks) which significantly affects the measured F1-score (see Figure 4), making it difficult to guarantee a standard to a client.Calibration helps provide a more controlled guarantee such as a minimal level of precision at a given reference ratio π 0 .Here π 0 can be a norm fixed by both parties based on expert knowledge.• Anticipating the deployment of an algorithm in a real-world system: If the prior in the data collected for offline development is different from reality, non-calibrated metrics measured during development might give pessimistic/optimistic estimations of the post-deployment performance.In particular, this can be harmful when industry has constraints on the performance (e.g precision has to be strictly above 1/2).Calibration with, for instance, a value π 0 equal to the minimal prior envisioned for the application at hand will allow anticipating the worst case scenario.

Conclusion
In this paper, we provided a formula and guidelines to calibrate metrics in order to make the variation of their value across different datasets more interpretable.As opposed to the regular metrics, the new ones are shown to be robust to random variation in the class prior.This property can be useful to both academic and industrial applications.On the one hand, in a research study that involves incremental learning on streams, having a metric which resists to a virtual concept drift [17] can help to better focus on other sources of drifts and design adapted solutions.On the other hand, in an industrial context, calibrated metrics will give more stable results, help prevent false conclusions about a deployed model and also allow for more interpretable performance indicators, easier to guarantee and report.Calibration relies on a reference positive class ratio π 0 and transform the metric such that its value is the one that would be obtained if π in the evaluated test set was equal to π 0 .It has a simple interpretation and should be chosen with caution.If one's goal is to preserve the nature of the original metric, π 0 has to be close to the real π.But choosing a different π 0 will also allow, if needed, to situate the algorithm's performance to a reference situation.Because it directly controls the importance given to false positives relative to true positives, calibration draws an interesting perspective for future works: investigating a generalized metric in which the cost associated to each type of error (FP, FN) appears as a parameter and from which we can retrieve the definition of existing popular metrics such as TPR, FPR or Precision.

B Comparison between the proposed calibrated metrics and a previously published heuristic for calibration
In [10], the authors propose a heuristic-based calibration which consists in randomly undersampling the test set so that π = π 0 and then compute the regular metrics.Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa.To avoid that, they perform several runs and compute the mean value of the metrics.
In this appendix, we experimentally compare the results obtained with our calibration formula and with their calibration heuristic.We use the same train/test data and model as in Appendix A. Figure 8 displays the AUC-PR on the test set calibrated with our formula (blue dot) and the heuristic from [10] (red line) at several reference ratio π 0 .The red shadow represents the standard deviation over 1000 runs when applying the heuristic.We can observe that our formula and the heuristic provide the same value and it can be theoretically confirmed (Proposition 1).This is an important result as it confirms that the formula proposed in the paper really plays the role of calibrating the metric to the value that it would have if π = π 0 .
Note that the closed-form calibration in the formula can be seen as an improvement of the heuristic because it directly provides the targeted value.It is deterministic and computes the metric one time which is therefore roughly n times faster (where n is the number of runs in the heuristic).Proposition 1.Let y ∈ {0, 1} N be the predictions of a classifier on a test set of N samples, y ∈ {0, 1} N the ground truth label vector with N + = N π non-zero values.Let y , y be the prediction and ground truth vectors of a sampled test set where we keep all the negative examples and a random sample of k positive examples so that the positive class ratio becomes π 0 .
If we denote by P rec (resp.P rec ) and Rec (resp.Rec ) the precision and recall of ( y, y) (resp. ( y , y )), then : and Proof.Let us first introduce some notations: • P(y) (resp.P( y)): the set of indices of the positive examples in y (resp.y) • R ⊂ P(y): a random sample, of size k < N + , without replacement of P(y).
• T P , F P , F N , π (resp.T P , F P , π 0 ): the number of true positive, the number of false positive, the number of false negative and the positive class ratio in ( y, y) (resp in ( y , y )) By definition, we have: Therefore, Moreover,

E[
i∈P(y)∩P( y) By injecting ( 13) in ( 12), we obtain : Let us now define P rec : Since only positive examples are sampled from y to y , the number of false positives remain unchanged so F P = F P .Moreover, if Rec = Rec then: Therefore, if Rec = Rec then: Note: Property 1 reflects the case where we sample the positive class in the test set so it allows explaining why the heuristic in [10] provides results close to our calibration formula when the target reference ratio π 0 is lower than the original ratio π.If the targeted reference ratio π 0 is higher than π, the heuristic have to sample the negative class.In that case, we can show a similar property (and proof) in which we simply have to replace the recall Rec (resp Rec ) with the false positive rate F P R (resp F P R ).

C Additional experiment with OpenML
Here we display the CDF graph of the minority class ratio π in OpenML binary classification datasets (Figure 9).We also present a third experiment on OpenML.The protocol is the same as explained in the paper.Here we rather select the 21 datasets for which π < 0.05.The correlation matrix in Figure 10 presents the results.

D Real-world use-case of calibration: fraud detection systems
In this appendix, we consider a real dataset from a credit card fraud detection system which fits the scenario described in the paper.We reuse the example presented in figure 1 where we observed that both the model's performance in terms of AUC-PR and the fraud ratio decrease.This example is from a private dataset similar to the credit card fraud detection dataset available on Kaggle.Let us take a look at the calibrated metric.To set π 0 , we recommend avoiding using the arbitrary value 0.5 to preserve the behavior of precision.π 0 has to be fixed to a value that makes sense for the application at hand.A straightforward strategy is to fix it with expert knowledge such as the average proportion of fraudulent transactions in historical data (in our case, π 0 = 0.0031).Figure 11 presents the comparison between AUC-PR and AUC-P c R. On the left figure, we can see that the two metrics differ when the fraud ratio π is far from the reference π 0 .As an expected behavior, when π is higher (resp.lower) than π 0 , calibration reduces (resp.increases) the value of the precision (see how AUC-PR -AUC-P c R is correlated to the variation of π in the right figure).That being said, although it couldn't be concluded with certitude with the original metric, the calibrated metric shows that the model is getting worse with time independently of π.With this in mind, we can start making hypotheses on the reasons of such behavior and take proper actions to correct future predictions.

Figure 1 :
Figure 1: Evolution of the fraudulent transactions ratio (π) and the AUC-PR of the model over time.

Figure 2 :
Figure 2: ROC, PR and PR gain curves for the same model evaluated on an extremely imbalanced test set from a fraud detection application (π = 0.003, in the top row) and on a balanced sample (π = 0.5, in the bottom row).

Figure 3 :
Figure 3: Illustration of the impact of π on precision, recall, and the false positive rate.

Figure 4 :
Figure4: Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as π decreases.For every value of π, 10 6 data points are generated -so that the observed class ratio π is approximately equal to the Bernouilli parameter πfrom (7) with µ 1 = 2 and µ 0 = 1.8, and the metrics are evaluated on the optimal model defined in(1).We arbitrarily set π 0 = 0.5 for the calibrated metrics.The curves are obtained by averaging results over 30 runs.

Figure 5 :
Figure 5: Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as KL(p 1 , p 0 ) tends to 0 and as π randomly varies.This curve was obtained by averaging results over 30 runs.

Figure 6 :
Figure 6: Spearman rank correlation matrices between 10 metrics over 30 models and 614 datasets for the left figure and 30 models and 4 highly imbalanced datasets for the right figure.The used supervised binary classification datasets and model predictions are all available in OpenML.

Figure 8 :
Figure 8: Comparison between heuristic-based calibration and closed-form calibration applied to AUC-PR on real-world data

Figure 9 :
Figure 9: CDF of the minority class ratio π in OpenML binary classification datasets.

Figure 10 :
Figure 10: Spearman rank correlation matrices between 10 metrics over 30 models and 21 datasets.The used supervised binary classification datasets and model predictions are all available in OpenML.

Figure 11 :
Figure 11: On the left, the figure shows the performance of a model over time on the fraud detection task in terms of AUC-PR and AUC-P c R. On the right, the figure presents the normalized difference between AUC-PR and AUC-P c R against the value of π in the test sets.
with π in the order of 10 −3 .