Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks

In this paper, we study the post-hoc calibration of modern neural networks, a problem that has drawn a lot of attention in recent years. Many calibration methods of varying complexity have been proposed for the task, but there is no consensus about how expressive these should be. We focus on the task of confidence scaling, specifically on post-hoc methods that generalize Temperature Scaling, we call these the Adaptive Temperature Scaling family. We analyse expressive functions that improve calibration and propose interpretable methods. We show that when there is plenty of data complex models like neural networks yield better performance, but are prone to fail when the amount of data is limited, a common situation in certain post-hoc calibration applications like medical diagnosis. We study the functions that expressive methods learn under ideal conditions and design simpler methods but with a strong inductive bias towards these well-performing functions. Concretely, we propose Entropy-based Temperature Scaling, a simple method that scales the confidence of a prediction according to its entropy. Results show that our method obtains state-of-the-art performance when compared to others and, unlike complex models, it is robust against data scarcity. Moreover, our proposed model enables a deeper interpretation of the calibration process.


Abstract
In this paper, we study the post-hoc calibration of modern neural networks, a problem that has drawn a lot of attention in recent years.Many calibration methods of varying complexity have been proposed for the task, but there is no consensus about how expressive these should be.We focus on the task of confidence scaling, specifically on post-hoc methods that generalize Temperature Scaling, we call these the Adaptive Temperature Scaling family.We analyse expressive functions that improve calibration and propose interpretable methods.We show that when there is plenty of data complex models like neural networks yield better performance, but are prone to fail when the amount of data is limited, a common situation in certain post-hoc calibration applications like medical diagnosis.We study the functions that expressive methods learn under ideal conditions and design simpler methods but with a strong inductive bias towards these well-performing functions.Concretely, we propose Entropy-based Temperature Scaling, a simple method that scales the confidence of a prediction according to its entropy.Results show that our method obtains state-of-the-art performance when compared to others and, unlike complex models, it is robust against data scarcity.Moreover, our proposed model enables a deeper interpretation of the calibration process.

Introduction
There is an increasing trend in using Deep Neural Networks (DNNs) to automate a multitude of tasks, like image classification for healthcare [1] and speech recognition [2] among others.Some of these are high-risk applications, for example, a False Negative in cancer detection could be fatal for the patient.To this end, it is of paramount importance to use reliable Machine Learning (ML) systems that acknowledge the uncertainty of their predictions.A probabilistic classifier that outputs a confidence value, or probability, for each class, allows to make Bayes decisions-i.e.optimum decisions leveraging the cost of such decisions [3].
The extent to which the confidence outputs of a classifier can be interpreted as class probabilities is what is known as the calibration of a classifier [4,5].Modern DNNs achieve very low test error rates but are not necessarily well-calibrated [6,7].Hence, the focus of the community is shifting towards improving the calibration of DNNs.
One approach to obtain better confidence estimates is to average the predictions of different models using ensembles [8] or taking a Bayesian approach [9].Data Augmentation techniques have also been used to improve calibration [10,11], as well as modified training objectives [12,13].Among popular approaches, and the focus of this work, is the approach of post-hoc calibration, in which the predictions of an already trained classifier are recalibrated.Typically, a new model, the calibrator, is trained on the outputs of the classifier evaluated on a held-out dataset.This approach results very convenient since one can use off-the-shelf ML systems that already present good test error rates taking advantage of a plethora of work on DNNs.Deep Learning models have been widely adopted and usually offer a good solution for any Machine Learning task.For this reason, DNNs have become standard models with an easy application via public frameworks like Pytorch [14] and Tensorflow [15].With post-hoc calibration, we may still apply typical DNNs to high-risk tasks and benefit from their good error rates without over-confidence issues.
Probably the most popular post-hoc calibration method is Temperature Scaling (TS) proposed by [6].It is a single parameter model that re-scales the confidence predictions by a temperature factor for its use with DNNs.The simplicity of this method and the fact that in their experiments it seems to perform better than more complex ones lead authors to believe that the problem of re-calibration is inherently simple.However, recent alternatives based on expressive models like Bayesian Neural Networks [16] and Gaussian Processes [17] improve TS, suggesting that re-calibration might be a more complex problem than it was previously assumed.However, expressive models can be more data-hungry and may require careful tuning when the amount of data is limited.
Based on the observation that miscalibration on modern DNNs is often caused by over-confidence [6,18], recent work proposes to learn more complex calibration functions than TS but from a constrained space.By imposing some restrictions, like being Accuracy-preserving [19] and orderinvariant [20], authors force an inductive bias towards the desired calibration functions.This approach shows promising results, but it may still fail in low-data scenarios, especially when using over-parametrized models.This can be a huge limitation in tasks where data for calibration is usually scarce: In certain language recognition tasks [21] some languages can be underrepresented; also, it can be difficult to obtain training examples for the medical diagnosis of very rare diseases [22].Hence, there is a need for calibration methods that achieve high performance with low data requirements.
To this end, choosing a model with a suitable inductive bias gives some advantages.First, the set of possible calibration functions that the model can learn, or hypothesis space, is reduced.This translates into an easier training objective.Moreover, if the bias is well-specified, the learned calibration function will be more robust against a lack of training data, and will better generalize to other data [23].The quality of the inductive bias depends on the knowledge we have of the task at hand.For instance, the specific architecture of Convolutional Neural Networks (CNNs), based on convolution filters, explains their success on visual recognition tasks [24], even though by sharing weights the total number of parameters is reduced, thus limiting the learning capacity.

Contributions
Intending to gain knowledge about the specific task of modern DNNs calibration, we provide a study of post-hoc adaptive calibration methods, with varying degrees of expressiveness and robustness, that lead to better calibration.This may help design models more resilient to data scarcity.We focus on the problem of confidence scaling as the bad-calibration properties of DNNs are mainly attributed to over-confidence [6].
To perform this study we focus on Adaptive Temperature Scaling (ATS) methods, a family of calibration maps that generalizes TS by making the temperature factor input-dependent as proposed by Ding et al. [25].However, the authors propose to estimate the temperature factor as a function of the classifier input.ATS models, on the other hand, learn a temperature function, that computes temperature factors directly from the output of the classifier.Within this family, we can compare several calibration methods which extend the expressiveness of TS in different ways.
We analyze and benchmark several calibration models focusing on which temperature functions can lead to better calibration.Results show that highly parametrized methods achieve high performance when there is plenty of data, but also that these are doomed to failure in low-data scenarios.By exploiting gained knowledge about the post-hoc calibration task, we develop Entropy-based Temperature Scaling (HTS), a method with a strong inductive bias that is robust to the size of the dataset and provides comparable performance to other state-of-the-art methods.
The rest of the paper is organized as follows.First, we introduce some theoretical background of the calibration task.Then, in Section 3 we introduce some post-hoc calibration methods and motivate their design.We also describe other existing techniques to which we compare our methods.In Section 4 we describe the performed experiments and show their results.Finally, in the last section, we give our conclusions and comment on possible future work.

Background
In this work we focus on the multi-class classification task.Let x ∼ X ∈ X be the input random variable with associated target y ∼ Y ∈ Y, where y = [y 1 , y 2 , .., y K ] is a one-hot encoded label.The goal is to obtain a probabilistic model f for the conditional distribution P (Y |X = x).The model defines the function f (x) = z, x ∈ X , z ∈ R K .The outputs z of the model are known as logits since they are later mapped to probability vectors via the softmax function: where the exponential in the numerator is applied element-wise, and q ∈ S K is the corresponding probability vector.We use S K to denote the probability simplex in K classes.
In practice there is no distribution P (X, Y ) (or we do not have access to it).Instead, we have a labeled data set D of N pair-realizations D = {x (i) , y (i) } N i=1 that is used to approximate it.For example, DNNs are normally trained by minimizing the expected value of some cost function.This expected value is computed from the empirical distribution induced by placing a Dirac's delta at each point {x (i) , y (i) } N i=1 .

Calibration
A probabilistic classifier is said to be well-calibrated whenever its confidence predictions for a given class match the chances of that class being the correct one [4,5].We can express this property as an equation in terms of the probability distributions introduced earlier: where P (y | q) represents the relative class frequency -i.e. the proportion of each class on the set of all samples for which the classifier predicts q.
From this expression, it is easy to derive a measure of miscalibration or Calibration Error (CE): ( This is, the expected value of the d-norm of the difference between prediction vectors and the relative class proportions. While this equation might be useful to illustrate the concept of miscalibration, it does not provide a feasible way to measure it.Our main problem is the non-existent P (X, Y ).First, we cannot compute the expected value w.r.t. a non-existent distribution.Yet the main limitation is evaluating the distribution since one can use the empirical distribution given by the labelled set D as MC samples.However, there is no simple way of evaluating P (y | q) using this empirical distribution.Therefore, further approximations are required to estimate the miscalibration of a classifier.

ECE
The most popular metric used to estimate the Calibration Error is the Expected Calibration Error (ECE) [6,26].This metric uses a histogram ap-proach to model P (y | q) and considers only top-label predictions.The samples of a given evaluation set D test are partitioned into M bins B 1 , B 2 , ..., B M according to the confidence of their top prediction: Then the ECE is computed as: where | • | denotes the number of samples in a set, acc(B i ) is the accuracy of the classifier evaluated only on B i , and conf (B i ) is the mean confidence of the top-label predictions in B i .Despite its popularity, this estimator provides unreliable results as it is biased and noisy [27,28,29].Many improvements over the ECE have been proposed to mitigate these problems such as class-wise ECE and using variable confidence intervals [29].However, there is not any binning scheme consistently reliable [30].Anyway, ECE remains the most popular metric used by the community to measure miscalibration and we use it in our experiments to report results for the sake of comparison.

Proper Scoring Rules
One way to implicitly measure calibration is to use Proper Scoring Rules (PSRs).Any PSR can be decomposed into the sum of two terms [31], a refinement term and the so-called reliability or calibration.Thus, when evaluating the goodness of a classifier with a Proper Scoring Rule, one is also indirectly measuring calibration.The fact that the calibration component cannot be evaluated in isolation is what drives the community to use approximated metrics like ECE.Moreover, different PSRs may rank differently the same set of systems evaluated on the same data.Nevertheless, PSRs provide a theoretically grounded way of measuring the goodness of a classifier.Throughout this work, we will use two different PSRs to evaluate models, the log-score or Negative Log-Likelihood (NLL) and the Brier score, both of them well-known [32].

Post-hoc Calibration
Ideally, a model f trained on some data D would generalize and show good calibration properties when evaluated on other data D test , assuming both sets are reasonably similar.However, many classification systems turn out to be badly calibrated in practice, for instance, Convolutional Neural Networks (CNNs) tend to produce overconfident predictions [6,18].Moreover, in some tasks, it cannot be guaranteed that the training data is similar enough to the actual data on which the model will be deployed.For instance, a language recognition system may be trained on broadcast narrowband speech (BNBS) data but applied in a telephone service where the audio characteristics are different.To solve this problem, one common approach is that of post-hoc calibration, in which a function is applied to the outputs of the model.This function can be seen as a decoupled classifier that learns to map uncalibrated outputs to calibrated ones-i.e.q → q.We use the • notation to denote the calibrated prediction.The standard practice is to fit this calibration map or calibrator in a held-out data set D val , or validation data, that is supposed to resemble the data on which the model will make predictions.
Many post-hoc calibration methods take as input prediction logits instead of the final probability vectors.Notice that this does not limit their applicability since the outputs q of a probabilistic model can be mapped to the logit domain through the logarithmic function z = log q + k, where k is an arbitrary scalar value.

Accuracy-preserving Calibration
Modern classification systems achieve very low test error rates and their miscalibration is attributed mainly to over-confidence-i.e.predicted confidences that call for higher accuracy rates than those actually obtained.Under this assumption, it is reasonable to constrain the calibration transforms so that the predicted ranking over the classes is maintained.This condition is known as Accuracy-preserving [19] because functions that meet it do not change the top-label prediction, arg max q = arg max q.
When using expressive, and unconstrained, classification models like DNNs for the task of calibration, it is possible to improve calibration at the cost of losing accuracy [16,20].This trade-off is avoided by restricting the calibration functions to be Accuracy-preserving so that the class decision, left to the classifier, is decoupled from the confidence estimation of each decision.
In this work, we compare only Accuracy-preserving methods and avoid a potential problem often encountered in the calibration task.Since miscalibration is measured in isolation, accuracy is also considered to evaluate calibrators.This poses the question of determining which calibrator is better, one that improves more calibration but degrades the accuracy, or one that does not degrade the accuracy but shows less improvement on calibration.This decision is often application dependent but can be circumvented by using an Accuracy-preserving method.

Temperature Scaling
Temperature Scaling (TS) is probably the most widely used post-hoc calibration approach in the literature.It belongs to the family of Accuracypreserving methods.It scales the output logits by a temperature factor This factor is obtained by minimizing the NLL on some validation data consisting of predictions of the uncalibrated classifier.Since the NLL is a Proper Scoring Rule, TS is encouraged to improve calibration.Consequently, the temperature factor conveys information about the level of over-confidence in these predictions.A high temperature T 0 > 1 flattens the logits so the probability vectors approach the uniform distribution q = [1/K, 1/K, ..., 1/K], thus relaxing the confidences and fixing over-confidence.On the other hand, a low temperature T 0 < 1 sharpens the confidence values moving the top-label predictions towards 1 and the others towards 0. Hence, fixing under-confidence.

Methods
In this section, we first describe the Adaptive Temperature Scaling family and illustrate it by proposing some methods of our contribution.Then, we introduce other Accuracy-preserving methods, not necessarily of the ATS family, with state-of-the-art performance that we use as benchmarks in the experiments.

The Adaptive Temperature Scaling family
We refer to the group of Accuracy-preserving maps that generalizes Temperature Scaling as the ATS family.All ATS methods can be expressed as the calibration function: where T : R K → R + is the temperature function.This family generalizes Temperature Scaling by making the temperature factor input-dependent.TS is limited to the temperature function T (z) = T 0 , where T 0 is the scalar parameter of the model.Hence, TS implicitly assumes that a classifier will generate predictions with the same level of over-confidence independently of the specific sample being classified.
On the other hand, a general ATS method computes a different temperature factor for each prediction via the temperature function T (z).The computed factor for some z estimates the degree of over-confidence of the corresponding prediction q = σ SM (z).Hence, ATS methods acknowledge the possibility that a classifier's over-confidence may depend on the samples being classified.
The input-dependent property was first exploited by Ding et al. [25] with their Local Temperature Scaling method.However, this approach relies on the classifier input x to estimate a temperature factor T x = T (x).An ATS method estimates the factor based on the classifier output instead, T x = T (z), thus separating further the calibration step from the original classification task.The former approach tries to learn for which inputs the classifier is likely overconfident.ATS is independent of the classification task and is only concerned with estimating the overconfidence of an already made prediction.In other words, Local TS should be tailored for each classification task, for instance, if the input is audio one might use an RNN but choose a CNN instead for images.On the other hand, the input space of ATS methods is always the logit domain so these are more likely to generalize across classification tasks.
We acknowledge that this may reduce the potential expressive power of ATS since z is a processed version of x.Nevertheless, we believe that such constraint is not necessarily limiting since, as we show in our experiments, the logit vector of a prediction already conveys information about its degree of miscalibration.Moreover, one advantage of post-hoc methods is the decoupling of the classification step from the calibration step.This is in some sense lost if the original classifier input is required for the calibration.

Proposed Methods
We introduce three different ATS methods based on simple temperature functions.These functions are theoretically motivated and interpretable, so we can empirically validate the use of more expressive calibration transforms.First, we note that to meet the positivity constraint on the temperature factor we apply the softplus function to our models' outputs:

Linear Temperature Scaling
We call this method Linear Temperature Scaling (LTS) since it is based on a linear combination of the logit vector, its temperature function is given by: where w L ∈ R K and b ∈ R are the learnable parameters of the model.The weight vector w L takes into account the score assigned to each class to determine the level of over-confidence.Hence, LTS can predict higher temperature factors for certain predicted classes than for others.The scalar parameter b allows LTS to recover the base TS by zeroing the w L parameter.
We motivate this method by giving the following example: an uncalibrated classifier can make over-confidence predictions for only certain classes.Since LTS weights each component of the logit vector to obtain the temperature factor, it should be able to raise (shrink) it by increasing (decreasing) the weight component w L i depending on whether the classifier is more (less) likely to make and over-confident prediction when predicting class i.
From this follows the interpretation of the method.After fitting LTS on a validation set, the weight vector will point towards the direction of the highest degree of overconfidence in the logit space.

Entropy-based Temperature Scaling
Motivated by the fact that the entropy of the predictive class distribution can be interpreted as the uncertainty of such prediction, we propose HTS.The temperature function of this method is given by: where H(z) = H(σ SM (z))/ log K is the normalized entropy, and w H ∈ R and b ∈ R are the learnable parameters of the model.We normalize the entropy so that it is always upper-bounded by 1 irrespective of the number of classes.This allows us to generalize the interpretation of w H between tasks with a different number of classes.We apply the logarithm to the entropy because, as we show later in the experiments, the temperature shows a linear trend with the logarithm of the entropy.We give b the same interpretation as in the previous model.The parameter w H determines how much the predictive uncertainty of predictions-i.e. the log H(z)-influence the determination of the temperature factor.The higher the magnitude of w H the more variability we can expect in the computed temperature factors.On the other hand, a model with w H → 0 will resemble the base TS.
The ECE metric and over-confidence evaluation [18], are tasks that consider only the confidence value assigned to the top-rated class.This value represents the class probability estimated by the classifier.While it is a confidence value, it does not represent the 'confidence' of the classifier on the prediction, it just concerns the predicted class in particular.Conversely, the entropy of the predictive is a measure of uncertainty of the whole predictioni.e. an alternative more comprehensive way of assessing the classifier's confidence in some prediction.
For instance, we may have two predictions q (i) = [0.6,0.2, 0.2] and q (j) = [0.6,0.4, 0.0] in a 3-class problem.Both assign the same confidence 0.6 to class 1, but it is clear that q (i) is a higher entropy predictive than q (j) -i.e. it is a more uncertain prediction.
Again, we motivate the method with a hypothetical example.Suppose that we have a classifier that produces predictions with variable degrees of over-confidence.One way in which a prediction-logit can convey information about its level of over-confidence is via its entropy.This is, for two predictions with the same predicted confidence, we may assume that the more uncertain of the 2-i.e. the higher entropy prediction-is more likely to be over-confident since it reports the same value of confidence despite its higher uncertainty.
This model makes a strong assumption about the level of over-confidence in a prediction.Mainly, that it can be expressed as a simple linear function of the log-entropy.The resulting model is easy to train since the set of possible calibration functions, or hypothesis space, is comparatively limited.However, its performance is completely conditioned on the assumption being met.We provide experiments validating the model in Section 4.

Combined system
Finally we propose HnLTS, a model that combines the previous two with a single temperature function given by: where w L ∈ R K , w H ∈ R, and b ∈ R are the learnable parameters to which we give the same interpretation as above.
The motivation behind this model is to increase the expressiveness of the system in a controlled way to see how this affects its performance and training procedure compared to the simpler methods.The hypothesis space of this method is a combination of the previous two so it should be able to recover the solution of either one.However, we argue that the increased hypothesis space also makes the model more difficult to train with higher data requirements.

Baseline Methods
We now describe other Accuracy-preserving methods already existing in the literature with state-of-the-art performance.Some of these, but not all of them, belong to the ATS family as they can be expressed in the general form given by Equation 3.1.

Parametrized Temperature Scaling
Parametrized Temperature Scaling (PTS) [13] is a specific instance of the ATS family in which the temperature function is conditioned to be a neural network (NN).The input to the NN is the logit vector sorted by decreasing value of confidence z s .Sorting the logit vector makes the model order-invariant [20] simplifying the hypothesis space at the cost of losing the possibility of discriminating between classes-i.e. it cannot consider the predicted ranking over the classes.PTS can be expressed as an ATS method with temperature function: where NN is the function defined by the neural network.Instead of optimizing the parameters of the NN to minimize some PSR as other methods do, authors propose to minimize an ECE-based loss given by: where B i , conf (B i ), and acc(B i ), are defined as in Equation 2.1.1.During training, samples are re-partitioned into B i at each loss evaluation since the confidence is re-scaled differently.
In their experiments, authors always use the same architecture, a Multi-Layer Perceptron (MLP) with two 5-unit hidden layers.Authors limit the input size of the network to the 10 highest confidence values whenever the number of classes is greater than 10.We use the same architecture in our experiments.

Bin-Wise Temperature Scaling
Bin-Wise Temperature Scaling (BTS) [33] is a histogram-based method that applies a different temperature factor to each bin of the histogram.First, test samples are partitioned into N bins according to their top-label confidence.Authors force a high-confidence bin that ranges from 0.999 to 1.The samples with predicted confidence below 0.999 are partitioned into the other N − 1 intervals such that each bin contains the same number of samples.
This method can also be included in the ATS family.The temperature function in this case is just a look-up table that assigns the corresponding temperature factor to the input confidence value.

Ensemble Temperature Scaling
Ensemble Temperature Scaling (ETS) [19] obtains a new logit vector as a convex combination of the uncalibrated vector, a maximum entropy logit vector, and the temperature-scaled vector: subject to w 1 + w 2 + w 3 = 1; w i ≥ 0 where w 1 , w 2 , w 3 are the learnable weights of the convex combination and T ET S is the temperature parameter of the TS component.All the parameters are optimized en bloc to minimize some PSR.This method is Accuracy-preserving and also an extension of the standard TS, however, it does not belong to the ATS family.This can be easily verified by noting that ATS methods compute for some logit vector a single scalar temperature factor which applies equally to every entry of the logit vector.On the other hand, ETS scales by a different temperature factor each component of the logit vector.

Experiments
We present two rounds of experiments.First, we report a study of the proposed methods that motivate their design and present ways in which calibration performance improves with model complexity.With these, we give evidence that the logit vector conveys information about its degree of over-confidence and motivates the design of new calibration methods that takes this into account.Then, we compare our methods with other state-ofthe-art Accuracy-preserving calibration techniques in different dataset-size settings to assess their robustness to data scarcity.

Setup 4.1.1. Datasets and tasks
We refer to model-dataset pairs as calibration tasks.So a task is composed of the predictions of a model, for instance a ResNet-101 [34], on a specific dataset, like CIFAR-100 [35].Every dataset is partitioned into three splits: train, validation, and test.The model of each task is trained using the train set and then it is used to generate predictions on the validation and test sets.We evaluate a calibration method on a certain task using the following procedure: First we fit the calibration method using the predictions on the validation set.Then we apply it to the test set predictions and compute metrics over these.

Training details
We use NLL as the optimization objective to fit calibrators.Additionally, in all tasks, we fit a second version of the PTS method, minimizing the ECE-based loss instead (see Section 3.3).All methods except TS and ETS are implemented in Pytorch [14] and optimized using Stochastic Gradient Descent (SGD) with an initial learning rate of 10 −4 , Nesterov momentum [36] of 0.9, and a batch size of 1000.We reduce the learning rate on plateau by a factor of 10 until the learning rate reaches 10 −7 that we stop training considering the algorithm has converged.The standard TS is optimized with SciPy [37] and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.
To calibrate with ETS we use the code uploaded by authors [19].
All experiments are run 50 times with different random initializations and the results are averaged across runs.For the experiments in which a subsampled validation set is used, this is resampled at each run but consistent across calibration methods.This is, for each of the 50 runs we sample a N -sized validation set and use it to fit all the calibration methods in the comparison.

Analysis of the ATS methods
For the first round of experiments, we calibrate a ResNet-50 [34] on CIFAR-10 [35] with the proposed interpretable ATS methods, Entropy-based TS (HTS) and Linear TS (LTS), and discuss each separately.

Linear TS: Introducing class dependence
With this experiment, we aim to illustrate the example that we give to motivate the LTS method.This is, that LTS can adapt to a classifier that makes more o less over-confident predictions depending on which class it predicts as correct.
We divide the test set of predictions according to their true class and compute for each subset the optimum temperature factor, which is obtained by optimizing TS on each group.Then, we use the LTS model optimized in the validation set to compute a temperature factor for every test prediction.Finally, we represent in Figure 1 the average of these factors per subset against the optimum temperature.For reference, we include the TS temperature factor learned on the validation set (dashed orange line).
From Figure 1 we notice that the classifier does produce more overconfident predictions for some classes than for others, even in a curated and well-balanced dataset such as CIFAR-10.We can expect this effect to be even more present in real-life applications in which the prevalence of classes may vary and some distribution mismatch between development and production data can be expected.LTS exploits this difference between classes and manages to adapt the temperature factor in each subset closely matching the optimum.

Entropy-based TS: Leveraging uncertainty of predictions
Our motivation for the HTS method is that the level of over-confidence in a prediction is related to the entropy of such prediction.If our hypothesis is correct, we can expect, for the same value of confidence in the predicted class, higher entropy predictions like q(i) to be more over-confident on average.So, we might expect higher temperature factors for higher entropy predictions.
In Figure 2 we depict the temperature function learned by HTS in the validation set.We train two models, one with the full validation set, plotted in a darker shade, and the other using a random subset of 200 samples.We also plot the optimum temperature factor estimated in the test set for different ranges of normalized entropy.We partition the log-domain of the normalized entropy in equally spaced bins and divide the test samples according to this binning scheme.For each bin, we estimate the optimum temperature factor given by TS.In a second experiment, we show the temperature factor that PTS assigns to each prediction on the test set (see Figure 3).With this plot, we aim to see if a very expressive method like PTS learns any relation between the entropy of a prediction and its temperature factor.
We find that, at least in this particular task, there exists some positive relation between the entropy of the predictive and its level of over-confidence.Figure 2 shows that a linear function is a fair approximation to the relation between entropy and temperature and that HTS manages to capture it even in the face of low data.
In Figure 3 we show that a much more expressive method PTS also captures this linear relationship when given enough data.However, in the face of limited data, it fails to do so.Moreover, in Figure 4 we plot for all samples in the test set the temperature factors given by PTS against those by HTS, both methods fitted using all validation samples.The plot shows that when data is plentiful the function learnt by both methods is reasonable similar, suggesting that the function space of HTS contains well-performing solutions similar to those learnt by PTS despite being much more constrained.

Benchmarking
In this section, we compare the performance of the proposed ATS methods: LTS, HTS, and HnLTs; with state-of-the-art accuracy-preserving methods: TS, ETS, BTS, and PTS.We fit two versions of PTS: One trained to minimize the NLL, the calibration objective we use to train every method; and a second version optimizing the ECE-based loss instead as reported in [13](see Section 3.3).We refer to the former as PTS and the latter as PTSe where the 'e' stands for the ECE-based objective.

Results
For the sake of space and simplicity, we depict results for each dataset and average across models-e.g.average ECE of HTS on all CIFAR-10 tasks-.We defer detailed results to Appendix A. Results are shown in Figure 5.We normalize each metric by the performance of TS as we consider it the main benchmark.We report performance in terms of normalized ECE and normalized NLL, namely ECE and N LL.For each method, we plot five markers, the size of which increases with the size of the validation data set.From smallest to biggest these are N = (200, 500, 1000, 5000, 10000).The y-axis position of the marker indicates the mean value across tasks, where each task is a different NN architecture calibrated.
We first point out that almost all models outperform the simple TS when there is enough data (big markers), although, on average, there are no big differences between models.However, when data is scarce all the highlyparametrized models show severe performance degradation and only ETS and HTS seem to provide consistent performance.Moreover HTS provides better results in most of the individual tasks while ETS barely outperforms the baseline TS.
Also, it is worth noting the difference between datasets.In the highly dimensional CIFAR100, we can see a greater advantage in using calibration methods more complex than TS.On the other hand, the best methods barely outperform TS in CIFAR10 tasks.This suggests that the problem of calibration may grow more complex with the number of classes, although the number of datasets included in our experiments is not representative enough and more experiments are required to validate this observation.
Interestingly, HnLTS fails in low-data scenarios, even though it could, in theory, recover the HTS solution by zeroing the w L parameter.This suggests that increasing expressiveness can do more harm than good by complicating the training objective.

Conclusions
We have shown that post-hoc calibration of DNNs can benefit from more expressive models than the widely used Temperature Scaling, especially in tasks with a high number of classes.For instance, simply adjusting the temperature factor of TS with a linear combination of the logit prediction improves calibration by taking into account the score assigned to each class.However, more complex models require higher amounts of data to find a good-performing solution.This poses a trade-off between the complexity of the calibration model and the available data to train the model.There are many real-world tasks where data for re-calibration is limited and hinders the calibration with a complex model.
By analysing the calibration functions learned by expressive models on plenty of data, we can design simpler models with a strong inductive bias towards similar calibration functions.In this work, we have introduced HTS, a 2-parameter model that scales predictions according to their entropy.The temperature factors estimated by PTS, a much more expressive model, follow the same linear relation with the predictive entropy that HTS implicitly assumes.HTS shows calibration performance comparable to that of more expressive methods on ideal data conditions.However, unlike other methods, it is robust to data scarcity.Moreover, an important feature of the model is that it is interpretable, characterizing the link between a prediction's uncertainty and its over-confidence.
With this work, we motivate the study of expressive methods as a way to design practical models with a suitable inductive bias.As a first approach, we propose to use a hand-designed low-parameter model to achieve this bias.In future work, we plan to try other forms of inducing the desired bias, for instance, via the prior specification in a Bayesian inference setting.This option may allow training higher capacity models while still being robust to data scarcity.

Appendix A. Results
In this section, we provide in tables the results for each model-dataset task.Additionally, we give average performance normalized by that of the uncalibrated model across tasks in each dataset.
Results of ECE (M = 50), NLL, and Brier score, using the whole validation set are shown in Table A A.6, show average results using 5000 validation samples, randomly chosen at each experiment run, to calibrate models.Equivalently, tables A.7, A.8, and A.9, show the same results but using 1000 validation samples; tables A.10, A.11, and A.12 show average results using 500 validation samples.Lastly, tables A.13, A.14, and A.15 show average results using 200 validation samples.

Figure 1 :
Figure 1: Mean predicted Temperature (blue) against optimum temperature factor (green) for each class on the test set.

Figure 2 :
Figure 2: Temperature function of HTS fitted using 200 (light blue) and 10000 (dark blue) validation samples and optimal temperature on the test set (green).

Figure 3 :
Figure 3: Temperature factors of PTS for test samples fitted using 200 (light orange) and 10000 (dark orange) validation samples and optimal temperature on the test set (green).

Figure 4 :
Figure 4: Temperature factor computed by PTS against temperature factor computed by HTS for test set predictions.The black dotted line shows represents the one to one relation.

Figure 5 :
Figure 5: Average results for CIFAR-10 (left) and CIFAR-100 (right) tasks of all calibration methods in terms of ECE (up) and NLL (down) normalized by the performance of TS, namely ECE and N LL.

Table A .
1: ECE (M = 50).Models are denoted by their architecture and depth (and width if applicable).Table A.2: NLL.Models are denoted by their architecture and depth (and width if applicable).Table A.3: Brier Score.Models are denoted by their architecture and depth (and width if applicable).

Table A .
5: NLL using 5000 validation samples.Models are denoted by their architecture and depth (and width if applicable).Table A.6: Brier Score using 5000 validation samples.Models are denoted by their architecture and depth (and width if applicable).

Table A .
9: Brier Score using 1000 validation samples.Models are denoted by their architecture and depth (and width if applicable).

Table A .
11: NLL using 500 validation samples.Models are denoted by their architecture and depth (and width if applicable).

Table A .
12: Brier Score using 500 validation samples.Models are denoted by their architecture and depth (and width if applicable).

Table A .
13: ECE (M = 50) using 200 validation samples.Models are denoted by their architecture and depth (and width if applicable).Table A.14: NLL using 200 validation samples.Models are denoted by their architecture and depth (and width if applicable)..93231.9061 0.9046 2.0570 2.4555 0.9438 1.2121 Avg.Relative N LL 1.0000 0.8791 0.8967 1.6993 0.8704 2.0837 2.3254 0.9419 1.0849 Table A.15: Brier Score using 200 validation samples.Models are denoted by their architecture and depth (and width if applicable).