A comprehensive theoretical framework for the optimization of neural networks classification performance with respect to weighted metrics

In many contexts, customized and weighted classification scores are designed in order to evaluate the goodness of the predictions carried out by neural networks. However, there exists a discrepancy between the maximization of such scores and the minimization of the loss function in the training phase. In this paper, we provide a complete theoretical setting that formalizes weighted classification metrics and then allows the construction of losses that drive the model to optimize these metrics of interest. After a detailed theoretical analysis, we show that our framework includes as particular instances well-established approaches such as classical cost-sensitive learning, weighted cross entropy loss functions and value-weighted skill scores.


The supervised classification problem via neural networks
Neural networks are in the state-of-art of many machine learning classification tasks in a huge variety of contexts [6], such as, e.g., medical diagnostics [15], forecasting problems [10,7], image classification [21].Thanks to their flexibility, they have been adapted and shaped to be employed in the solution of different complex issues.In this paper we focus on their usage in addressing the classical supervised learning problem, which can be described by using the following notation.Let X ⊂ Ω, Ω ⊂ R m be a training set of data in some space of dimension m ∈ N ≥1 , and let Y be a finite set of d labels or classes to be learned, where d ∈ N ≥2 .In multiclass tasks, each element in X is uniquely assigned to a class in Y, and it is common for these labels to be integer or one-hot encoded [11].Differently, in the so-called multilabel framework, each element in X can be assigned to more than one class in Y.
Hence, the supervised classification problem consists in constructing a function ŷ on Ω by learning from the labeled data set X , so that ŷ models the data-label relation between elements in Ω and labels in Y.In the context of neural networks, such function is characterized by a vector (or matrix) of weights θ, i.e., ŷ = ŷθ (x), x ∈ Ω.These weights are set by minimizing a certain loss function ℓ(ŷ θ (x), y), that measures the possible discrepancy between the prediction given for x and its true label y ∈ Y.More precisely, the minimization process is carried out on the training set, that is we consider the task min θ ℓ(ŷ θ (x), y), x ∈ X , where possible regularization terms on the weights of the network can be also added.Beyond a good performance on the training set, the model is expected to generalize in predicting unseen test samples in Ω.Although several loss functions have been proposed, with the aim of accounting for the specific properties of the problem under analysis [14,17,27], the most common choice as loss function is the Cross Entropy (CE) and its generalizations [5,26], which is provided with a robust theoretical background that originated in the context of information theory.

Classical and weighted scores
After the neural network model weights are calibrated in the training phase, the classification results are evaluated by means of metrics, also known as scores, that are chosen according to the framework and are usually built from the elements of the so-called confusion matrix (CM), which provides a consistent description of the quality and the type of the errors achieved by the constructed classifier.In this work, we mainly focus on the cardinal binary classification setting d = 2 where the CM can be expressed in its fundamental form that involves a positive {y = 1} and a negative {y = 0} class, and the output of the network is such that ŷθ ∈ (0, 1), accordingly.The choice of the evaluation metric is crucial for the assessment of the predictions: for example, in cases where the dataset is imbalanced, the ratio between the correct predictions over the number of elements, that is the accuracy, is not very meaningful, while other scores like the F 1 score, True Skill Statistic (TSS), or the Heidke Skill Score (HSS) are more appropriate for evaluating the goodness of predictions.Another perspective to evaluate the predictions consists in assigning a different impact to errors of different type: this falls into the cost-sensitive learning field [3].In this framework, the evaluation of the predicted value is commonly carried out on the basis of preassigned costs for False Positives (FPs) and False Negatives (FNs).For example, in applications as medical diagnosis, or identification of frauds, missing a positive class is worse than incorrectly classifying an example from the negative class: therefore a major cost to FNs is assigned with respect to FPs by defining a suitable cost matrix [4,24].Furthermore, in cases where binary predictions are performed over time, such scores defined upon the classical CM do not take into account the distribution of the predictions over the actual outcomes: a false positive is counted as "one" independently it represents a false alarm just before an actual occurrence or a false alarm given after an actual occurrence.The evaluation of the forecast in terms of its usefulness to support the user while making a decision is known in literature as the forecast value.In [9], valueweighted skill scores have been introduced in order to take into account the severity of errors with respect to the distribution of predictions over time, with applications ranging from weather, space weather forecasting, through environment [8,12].

Addressing the score-loss discrepancy
Loss minimization and score maximization are intertwined concepts.However, a direct maximization of a skill score of interest in the training phase is not recommended, as the score is usually discontinuous with respect to the weights of the network.This is due to the fact that the continuous predictions in the interval (0, 1) are assigned to the negative or to the positive class by assessing their relative value with respected to a fixed threshold τ ∈ (0, 1).In many situations, this threshold parameter is set to τ = 0.5, but it can be tuned a posteriori in order to maximize a chosen skill score.In order to deal with this discrepancy between loss minimization and score maximization, in literature many empirical strategies have been designed to align the loss function to approximate metrics of interest [13,22,23].In particular, in [18] a new class of Score-Oriented Loss (SOL) functions has been introduced.To build such losses, the threshold τ is treated not as a fixed value, but as a random variable provided with a certain a priori density function.Then, considering the expected value of the entries of the CM, it is possible to obtain an averaged score that is derivable with respect to the weights of the model.During the training process, the averaged score maximization automatically leads to classifiers that are oriented to achieve results that are already optimal for the chosen score, being the optimal threshold value also driven by the a priori probability density function.We point out that the cost-sensitive learning approach proposed in [19,20] shares a similar spirit with [18], meaning that their common objective is optimizing the classification process with respect to weighted losses.However, on the one hand, in [20] the classifier is obtained by taking the argmax of the outputs of the network, possibly in a multiclass case, without considering thresholds that convert probability outcomes in 0-1 classification predictions.On the other hand, [18] focuses on the influence of the threshold value in a threshold-based classification process, and it is then suitable for a generalization to the multilabel case, as we will deepen in this paper.

Outline and contribution of this work
In this work, our first purpose is to provide a wide theoretical background for the formalization of weighted classification scores.This is done in Section 2, where starting from classical CMs we then derive weighted CMs and scores.Then, in Section 3 we analyze the presented weighted matrices via a probabilistic approach, which is a crucial step that allows the definition of weighted SOL (wSOL) functions in Section 4, which are designed to valorize the chosen weighted metrics directly in the training phase of the network.Therefore, we show the effectiveness of the proposed theoretical framework by analyzing some particular cases that are represented in literature: classical cost-sensitive learning approaches in Section 5, weighted cross entropy loss functions in Section 6 and value-weighted skill scores in Section 7. The extension of the achieved results to the multilabel setting is outlined in Section 8. Finally, we draw some final remarks in Section 9.

From confusion matrices to (weighted) classification scores
To facilitate our analysis, we define S = S n (θ) = {(ŷ θ (x i ), y i )} i=1,...,n to be a batch of predictions-labels, in which y i ∈ {0, 1} is the true label associated to the element x i ∈ X .Letting τ ∈ R be a threshold parameter, τ ∈ (0, 1), the classical confusion matrix is defined as defined in terms of the indicator function ) and gives a real number as output, which is non-decreasing with respect to TN(τ, S n (θ)) and TP(τ, S n (θ)) and non-increasing with respect to FN(τ, S n (θ)) and FP(τ, S n (θ)).
In the following, our purpose is to extend the presented framework to include modified scores where false positives and negatives are weighted.Proceeding in this way, we remark that we recover the classical CM in case of perfect classification, as we do not modify the true positives and negatives.Therefore, a positive-valued weight function w is introduced in the FP and FN entrances of the CM, obtaining In general, this weight function w can depend on different inputs.We consider the following ones.
• The label y i .The weight can treat differently the case y i = 0 from y i = 1, i.e., treat false negatives and false positives in different manners.For example, this turns out to be significant in medical diagnosis applications where false negatives are more dangerous than false positives.• The prediction ŷθ (x i ), and the threshold τ .Predictions could be handled in regards to their relative value with respect to the threshold.As an example, fixed τ = 0.4, a false positive ŷθ (x i ) = 0.5 can be treated as a minor mistake compared to ŷθ (x i ) = 0.9.• The batch S n (θ).Indeed, this is the case where the CM can be provided with a probabilistic connotation by normalizing the entrances [16].• Past and future labels and predictions, in case the data are chronological.Letting T ∈ N, we can consider as inputs of the weight function the vectors Note that while in (3) the elements of the batch S n (θ) are not required to be in chronological order, we assume a chronological order in ( 4) meaning that we are looking at the time interval [i − T, i + T ] centered at time i.This little abuse allows us to keep a simpler notation.This framework is particularly suitable to formalize situations in which positive predictions are interpreted as alarms, and then, for example, a missed alarm (false negative) may be almost negligible if alarms were produced by close past samples.We will deepen this setting in Section 7.
Therefore, the weight function may assume the form w = w τ, y i , ŷθ (x i ), S n (θ), y i , 1 i .
We denote as wCM(τ, S n (θ)) the weighted confusion matrix where we replace FP and FN with wFP and wFN.Definition 2. A weighted score s w : M 2,2 (N) −→ R is a score s that takes in input the weighted matrix wCM(τ, S n (θ)), i.e., s w = s(wCM(τ, S n (θ))).

Expected confusion matrices
In the following, we consider the non-weighted case CM(τ, S n (θ)) as a special case of wCM(τ, S n (θ)) where the weight function is set to w ≡ 1.
Assume that a certain weighted score s w has been designed in order to assess the performance of a classifier.As discussed in the introductory section, s w is discontinuous with respect to the weights of the neural network, and can not be directly used in constructing a loss function.Surely, we can not expect regularity with respect to the threshold τ and the indicator function.To deal with this problem, in what follows we leverage the approach effectively carried out in [18] with classical CMs.
The first crucial step is to let τ be a continuous random variable whose probability density function (pdf) f is supported in [a, b] ⊆ [0, 1], a, b ∈ R. We denote as F the cumulative density function (cdf) This ensures the possibility of averaging the entrances of wCM with respect to the threshold, thus replacing the irregular indicator function with the regular cdf F .Indeed, we remind that the expected value of the indicator function behaves as Hence, we consider the expected value of the matrix wCM(τ ) = wCM(τ, S n (θ)) with respect to τ , meaning As far as the true positives and negatives are concerned, we get where we used the linearity of the expected value.Then, in the assumptions of the previous section, we obtain where we highlighted the dependence of the indicator functions on the threshold.We conclude this section with the definition of admissible weight function and score.
Definition 3. Let w be a weight function.We say that w is admissible if W P and W N are derivable with respect to the weights θ of the neural network.Accordingly, we say that the weighted score s w is admissible if so is w.

Weighted score-oriented losses
If w is admissible according to Definition 3, the classification metric can be finally employed in the construction of a score-oriented loss function.Definition 4. Let S n (θ) be a batch of predictions-labels and let s w be an admissible weighted score.We call a weighted SOL (wSOL) function related to s w the loss With the following result, we highlight the benefit of minimizing the loss ℓ sw in the training phase.Theorem 1. Recalling the optimization problem in (1), we have that precisely: 1.If the score s is linear with respect to the entrances of the weighted CM, then equality is achieved in (6). 2. If the score s is analytic and non-linear with respect to the entrances of the weighted CM, then equality is achieved in (6) up to derivative terms (see (7)).
Proof.If the score s is linear with respect to the entrances of the weighted CM, we can exploit the linearity of the expected value and obtain which implies min Now, assume s is analytic with respect to the entrances of the weighted CM.Then, using the abridged notations s w = s(wCM(τ, S n (θ))) and s w = s(E τ [wCM(τ, S n (θ))]), we can leverage the Taylor formula (see e.g.[2]) where the entrances of the weighted CM are vectorized in R 4 and α ∈ N 4 denotes the classical multi-index notation for derivatives of multivariate functions.
The result in Theorem 1 certifies that the model is effectively driven to valorize the score of interest directly in the training phase.Furthermore, since the expected value is calculated with respect to the chosen pdf for τ , the network is steered to optimize the score with respect to the threshold values that are taken into account by the pdf.Remark 1.The proposed framework can include a multi-objective setting where different scores are considered.Indeed, letting s 1 w , . . ., s m w be m ∈ N weighted admissible scores and due to the linearity of the expected value, we may optimize the convex combination

Applications to cost-sensitive learning
In this section, we show how the proposed setting includes as an instance classical cost-sensitive learning approaches [3], where a cost matrix with non-negative entrances is applied to the CM via the pointwise (or Hadamard) product CC ⊙ CM(τ, S n (θ)).
The matrix CC is employed to assign different costs to each classification outcome.
In particular, in most applications the diagonal elements are set to zero values, i.e., C 00 = C 11 = 0, or simply ignored [24,25].The reason behind this choice is that the focus is put on weighting different errors in different manners, which turns out to be useful in cases where false negatives and false positives are not of equal importance, or when it is necessary to treat a strong imbalance in the dataset, as we outlined in the introduction.This approach is included in our theoretical framework by considering the admissible weight function Then, by setting w = w cost and computing the expected value with respect to τ , we get as w cost is independent of the threshold τ .Therefore, it is possible to consider a costsensitive score that relies on the weighted errors and derive the corresponding wSOL.
As an example, we can consider the score and then the loss where we highlighted the dependence on S n (θ).

Application to weighted cross entropy losses
In the following, we show that the well-known (weighted) Cross Entropy (wCE) loss [1] can be included in our framework as a particular wSOL.To observe this, let us consider the following admissible weight function where ω 0 , ω 1 > 0 are weight parameters.We remark that w CE is well-defined as the prediction is in (0, 1).Moreover, here we choose the uniform pdf on [0, 1] for the threshold τ , that is, f (ξ) ≡ 1.Therefore, by setting w = w CE and referring to (5), we compute Similarly, we find W N = −ω 1 log(ŷ θ (x i )), and therefore Finally, taking again the score s CE = s cost (see (8)), here we get the wCE Note that we can recover the classical binary CE by setting ω 0 = ω 1 = 0.
7 Applications to value-weighted scores

On the severity of false and missed alarms
In classification problems with chronological data, the distribution of predictions along time with respect to the actual occurrences of events is not taken into account when the classical confusion matrix is computed.In Figure 1 we show two predictions which have the same CM, and therefore the scores, which are also referred to as skill scores in forecasting applications, have the same value.However, the two predictions are different from a forecasting value viewpoint.Indeed, the prediction in the first panel may be preferred since the missed events are anticipated by positive predictions, also called alarms, whereas in the second panel the first event is completely missed, for example.The aim of value-weighted skill scores, which we detail in the next subsection, is then to differentiate the severity of errors by taking into account the sequential order.In such a way, the prediction in the first panel of Figure 1 will be associated to a higher value-weighted skill score than the one in second panel, since false positives and false negatives in the first prediction are less relevant than the ones in the second.

Value-weighted scores
We therefore address the problem outlined in the previous subsection by defining a weight function that takes into account the value of the error with respect to the chronological sequence of data.First, letting T ∈ N, in this setting we consider the vectors (c.f. ( 4)) which provide information on future labels and past predictions, ordered with respect to the time distance from the present referring index i.Exploiting the theoretical framework proposed in the previous sections, and inspired by [9], we then restrict to the following structure for a value-weight function where we assume 0 ≤ g(•) < 1.Let us elaborate on the consequences of this formulation.Note that 0 < w value (y i , y + i , 1 − i ) ≤ 1, thus we are going to reward good errors in place of penalizing bad errors.Furthermore, we get i ) if y i = 1 (false negative).This property agrees with the fact that a false alarm can be rewarded if it anticipates positive events occurring in the near future, while, on the other hand, a missed alarm can be rewarded if alarms have been raised in the near past.
In order to provide concrete examples and calculations for this setting, we focus on two particular formulations for the function g.In both cases, we make use of a positive weight vector ω = (ω 1 , . . ., ω T ) to act on y + i , 1 − i .We assume ω to be non-increasing with respect to the indexing, i.e., ω i ≥ ω i+1 , 1 ≤ i ≤ T − 1.Indeed, we expect the outcomes that are closer to the referring index i to be more important than the ones that approach the extremum i + T or i − T , as they are more distant in time.We focus on the following cases.
1. Case g prod (z) = ω • z.With this structure, all the positive labels in the temporal window [i+1, i+T ] and alarms in [i−1, i−T ] influence the entrances of the matrix, in a way that is ruled by the weight vector ω. 2. Case g max (z) = max(ω⊙z), where ⊙ is the pointwise product between two vectors.
In this situation, being ω non-increasing with respect to the indexing, only the closest positive label and alarm play a role in determining the value of wFP and wFN, respectively.
Remark 2. Let w = w value .Since 0 < w(•) ≤ 1, we have wFN ≤ FN and wFP ≤ FP.Moreover, the scores are non-increasing with respect to false positives and negatives and thus s ≤ s w for any batch S n (θ).

wSOLs for value-weighted scores
The structure of wSOLs in the case w = w value is in fact fully characterized by the formulations of the expected values of wFP and wFN, on which we focus.We obtain because the weight does not depend on the threshold τ .On the other hand, we have In the following, we will discuss separately the form of E τ [wFN(τ )] depending on the weight formulations outlined in the previous section.We may adopt the abridged notation ŷi = ŷθ (x i ).

Case g = g prod
In this case, to satisfy the requirements on the weight function, we need As far as the false negatives are concerned, we get the following.Theorem 2. Let ŷθ (x i−1 ), . . ., ŷθ (x i−T ) be the T predictions in chronological order before ŷθ (x i ), and let g = g prod .By defining for each i = 1, . . ., n we have Proof.We need to calculate the integral Let us comment on the achieved expression for E τ [wFN(τ )].In principle, in the predictions that play a role are the ones that are larger than the fixed threshold value.When applying the expected value, we lose the possibility of deciding which predictions are positive in terms of a threshold.Looking at (10), we observe that the prediction ŷi substitutes τ in establishing if past predictions are to be considered positive or not.Indeed, only past predictions whose value is larger than ŷi gives a non-zero contribute in the expression of E τ [wFN(τ )].Finally, note that if ω ≡ 0 then w ≡ 1, and so E τ [wFP(τ )] = E τ [FP(τ )] and E τ [wFN(τ )] = E τ [FN(τ )], which is consistent with the original value-weighted score.

Case g = g max
In this case, we require that max i=1,...,T ω i < 1, i.e., ∥ω∥ ∞ < 1.First, we get To analyze what happens for the false negatives, we need to fix some preliminary definitions.We make the following assumption.
Assumption 1.Let ŷi−j , j = 1, . . ., T , be a prediction involved in the construction of the vector 1 − i .We assume that ŷi−j ∈ (a, b), that is, it belongs to the support of the pdf chosen for the threshold.
The requirement in Assumption 1 is mild, as many pdfs are supported in the whole interval (0, 1) (e.g., the uniform one).Moreover, this requirement significantly simplifies the following presentations, and analogous results can be achieved in the more general case.
We proceed by introducing some necessary ingredients.Definition 5. Let ŷi−j , j = 1, . . ., T , be a prediction involved in the construction of the vector 1 − i .We denote as power interval the interval I j ⊆ [a, b] constructed as follows: ξ ∈ I j if and only if 1. ŷi−j > ξ and 2. ŷi−s ≤ ξ for every s < j, s ∈ {1, . . ., T }.
This fact motivates the following.Definition 6.For j = 2, . . ., N , we will denote ŷi−k(j) as the actual precursor of ŷi−j .We include the case j = 1 by saying that ŷi−k(1) = a.
Let us present a couple of examples to clarify the setting.
We are now ready to express the expected value of the false negatives entrance of the weighted CM.Theorem 3. Let ŷi−1 , . . ., ŷi−T be the T predictions in chronological order before ŷi , and let g = g max .We have where D i (•) was defined in Theorem 2 and ω S+1 = 0.
To calculate the integral, we will make use of Definitions 5 and 6.We have Let us focus on the computation of where we assume k(j) = j in the case I j = ∅.We have By recalling that ŷi−k(1) = a, we can put together as By using the subset of predictions introduced in Section 7.5, the result in ( 11) can be rewritten as where we set ω S+1 = 0. Therefore The results obtained in Theorem 3 shares a similar spirit with the one in Theorem 2, but some differences ought to be highlighted.In the only prediction that plays a role is the closest alarm to the present index i.When applying the expected value, it is necessary for a past prediction to be larger than ŷi to give a positive contribute in the expression of E τ [wFN(τ )].However, while this property is not only necessary but also sufficient in the g prod case, here we also need to refer to a hierarchy, in which only the predictions that dominate subsequent ones for some threshold values in the respective power interval are to be considered.This is observable in Definition 5, where indeed for a threshold ξ ∈ I j the value ŷi−j acts as a temporary maximum with respect to the subsequent predictions ŷi−s ≤ ξ, s < j.Furthermore, the discrepancy between adjacent elements of the weight vector ω rules severely the influence of the past predictions.A noteworthy particular case is ω ≡ 1, where we observe that W N = 1 − F (ŷ i ) − ω t j ⋆ F (ŷ i−t j ⋆ ) − F (min{ŷ i−t j ⋆ , ŷi }) being j ⋆ = argmax j=1,...,S ŷi−tj .

Generalization to the multilabel framework
In this section, we discuss how our proposed theory can be extended to the multilabel classification setting.We recall that in the multilabel case each sample may belong to one or more classes, differently with respect to the classical multiclass framework where each element of the dataset is associated to one class only.The generalization is based on a one-versus-rest approach: letting d ∈ N be the number of classes, we can consider d 2 × 2 confusion matrices wCM 1 (τ 1 , S n (θ)), . . ., wCM d (τ d , S n (θ)), one for each class, where each wCM i (τ i , S n (θ)) describes the classification outcomes, achieved with respect to a certain threshold value τ i , in predicting class i against the union of the remaining classes.Doing in this way, according to our theory, we can then consider the application of a classification metric s to such expected confusion matrices, that is where µ : R d −→ R is a function such as, e.g., the average of its arguments, the weighted average, the minimum.Then, we can define the loss function as outlined in Definition 4. Note that in this extension to the multilabel setting the threshold parameter still plays a crucial role, differently with respect to the standard multiclass setting where the argmax function of the outputs of the network is taken into account.

Conclusions
In this work, we first presented in details a theoretical framework that formalizes weighted classification metrics.Then, we showed how tailored losses can be designed for the optimization of such weighted scores in the training phase of the neural network.Finally, we proved the concreteness of the proposed setting by highlighting some of its particular instances that are well-known approaches that have been considered in literature.Therefore, the carried out analysis indicates that the constructed framework can be considered as a theoretical background for future developments of research lines and applications involving weighted classification scores and dedicated loss functions.

Fig. 1
Fig. 1 Two different binary predictions with the same confusion matrix: TN = 15, TP = 5, FP = 4 and FN = 2.The green bars are the binary true labels and the red dots are the binary predictions.