Keywords

1 Introduction

Deep Neural Network (DNNs) typically produce predictions that are not calibrated, which means their predicted probabilities express confidence levels that are not reflected in accuracy. Calibration methods improve the calibration of their predictive uncertainty both during model training [1,2,3,4,5] and post-hoc [6,7,8,9,10,11,12,13,14,15,16,17]. These calibration methods still tend to be over-confident under distribution shift because of distribution disparity between the source (or pre-shift) and target (or post-shift) domains [18,19,20,21]. Standard methods of domain adaptation and transfer learning [22] can offer some but limited help because of focusing on prediction accuracy and not on calibration [23].

Table 1. Comparison between calibration generalization and related calibration paradigm

This issue is addressed in recent research about calibration across multiple domains [19,20,21, 24,25,26,27] where the goal is to obtain calibrated probabilities in the target domain(s) using information from the source domain(s). More precisely, these methods address several different but related tasks, which we propose to categorize as follows, building on the categorization by Wang et al. [28] (see Table 1): (1) single-domain calibration, i.e., the classical task of learning calibrated uncertainty estimates in a single domain without any shift involved; (2) multi-domain calibration with the goal of learning calibrated predictions for multiple domains by using some labeled data from each of these domains during learning; (3) calibration transfer or adaptation where a calibration learned on source domain can be transferred (might lose the calibration on source domain) or adapted (preserves calibration on source domain) to a target domain with the help of some labeled or unlabelled samples from the target domain during learning [19, 20, 25, 27]; and (4) calibration generalization where there are no data available from the target domain during the learning phase, and hence the model is faced with test data from a previously unseen domain, typically a variation of the seen domain(s) due to a slight distribution shift, some perturbations to the data or a context change [21, 24].

We focus on calibration generalization, and in particular on the same scenario as Gong et al. [21] where the goal is to provide calibrated predictions in an unseen domain under the assumption of having access to the following resources: (a) a model trained on training (source) domains; (b) labeled data in several calibration domains, helping to prepare for the unseen domain; (c) access to the representation layer (i.e., latent space features) of the model, helping to relate test instances of the unseen domain to the training and calibration domains. In other words, the goal is very specific: having access to a model trained on source domains and access to multiple calibration domains, how can we best prepare ourselves for unseen data by using but not modifying the representation layer of the model?

The work of Gong et al. [21] builds on Temperature Scaling (TS) [16], which is a simple and commonly used calibration method where a single temperature parameter is used to adjust the confidence level of the classifier. Instead of applying the same temperature on all instances, Gong et al. vary the temperature across instances. The idea is first to cluster the data of all calibration domains jointly and then learn a different temperature for each cluster. Clustering is performed in the representation space, i.e., on the activation vectors of the representation layer instead of the original features. On a test instance, the temperature from the closest cluster is used (a method called Cluster NN), or a linear model fitted on cluster centers is used to predict the temperature to be used (a method called Cluster LR).

These cluster-based methods have the following limitations. First, they rely on TS to offer a suitable family of calibration transformations, while several newer calibration methods with richer transformation families have been shown to outperform TS [1,2,3, 29]. Second, test-time inference to obtain predictions requires additional computation outside the classifier itself, thus making the solution slightly more complicated technically. While this second point is typically a minor problem, having a single classifier model to provide predictions on test data would still be advantageous.

Fig. 1.
figure 1

Modified head of DNN. The network before the Representation layer is fixed, and after the representation layer, 2 dense hidden layers (with dimensions (1024, 512) for DomainNet, and (512, 128) for the other two datasets) and one dense logit layer (equal to the number of classes) are added. We use dropout (0.5), L2 regularizer (0.01), and ReLU activation for all layers with softmax on the last layer

To address these shortcomings, we propose a novel calibration method CaliGen (Calibration by Generality Training), to improve calibration generalization to unseen domains. We insert additional fully-connected layers to the network on top of the representation layer and train this additional structure on the calibration domains (freezing the representations, i.e., the earlier part of the network, see the illustration in Fig. 1). We propose a custom objective function (see Sect. 4.1) to be used with this training process, which addresses both prediction accuracy and calibration error with a weighted combination of cross-entropy and KL-divergence. We call this process generality-training because its goal is to improve the model’s generalizability to unseen domains. Our major contributions include the following:

  • We propose a novel solution for the simultaneous generalization of a neural network classifier for calibration and accuracy in a domain generalization procedure.

  • We propose a novel secondary loss function to be used with cross-entropy to encourage calibration. We have tested it in a calibration generalization setting, but it is open to finding more use cases in similar or different scenarios.

  • We provide a theoretical justification for the proposed algorithm and explain its advantage in better generalization.

  • We provide experimental results on real-world data to justify its advantage over existing calibration generalization methods. We show that our method generalizes better to unseen target domains with improved accuracy and calibration while maintaining well-calibrated probabilities in the source domains.

The rest of the paper is organized as follows: Sect. 2 discusses related work in calibration and multi-domain calibration. In Sect. 3, we discuss the required background to understand the paper. We propose the method in Sect. 4 and give a theoretical explanation. We discuss datasets and experimental set-up in Sect. 5. In Sect. 6, we discuss the results obtained on 3 datasets with comparison to other state-of-the-art (SOTA) methods along with an ablation study, and in Sect. 7, we give concluding remarks.

2 Related Work

2.1 Calibration

Researchers have proposed many solutions to obtain well-calibrated predictions from neural networks. These solutions can be summarised into three categories. In the first category, the primary training loss is replaced or augmented with a term that explicitly incentives calibration; examples include the AvUC loss [1], MMCE loss [2], Focal loss [3, 4] and Cross-entropy loss with Pairwise Constraints [5]. Other examples include Mixup [30], Label smoothing [31], and Label Relaxation [32], which can also be interpreted as modifying the loss function and has been shown to improve calibration.

While the above methods offer to achieve calibrated probabilities during training, solutions using post-hoc calibration methods fall into the second category in which model predictions are transformed after the training by optimizing additional parameters on a held-out validation set [6,7,8,9,10,11,12,13,14,15,16,17]. One of the most popular techniques in this category is TS [16], however, it is ineffective under distribution shift in certain scenarios [18].

A third category of methods examines model changes such as ensembling multiple predictions [33, 34] or multiple priors [29].

2.2 Multi-domain Calibration

All the above works aim to be calibrated in a single domain or context (in-distribution data). Still, studies have shown that existing post-hoc calibration methods are highly overconfident under domain shift [23, 27]. In recent times, researchers have shifted their attention to addressing the issue of calibration under distribution shift and multi-domain calibration [19,20,21, 24,25,26,27, 35, 36]. As calibration on multiple in-distribution domains, multi-domain calibration is also used in fairness [25, 35, 36]. Recent research on calibration adaptation considers the setting when labels are unavailable from the target domain [19, 20, 25, 27]. However, a more challenging task of calibration generalization, when no samples from target domains are available during training or calibration, is considered by [21, 24]. In particular, Wald et al. [24] extended Isotonic Regression to the multi-domain setting, which takes predictions of a trained model on validation data pooled from training domains.

Gong et al. [21] have used latent space features to predict a temperature to use in TS calibration to obtain better calibration. This work is close to our work, where calibration generalization is sought. The core idea behind the work is that instances with similar representations might require a similar temperature to achieve better-calibrated probabilities. They proposed Cluster-level Nearest Neighbour (Cluster NN), which uses clustering on features and then calculates the temperature for each cluster, which will be applied on the target domain based on the assigned cluster. They also proposed Cluster-level Regression (Cluster LR), where a linear regression model is trained on cluster centers to predict the target temperature. They use multiple domains to learn calibration to better generalize unseen domains while depending on inferred temperature for probability calibration.

3 Background

3.1 Calibration

Consider a DNN \(\phi (.)\), parameterized by \(\theta \), which for any given input X predicts a probability for each of the K classes as a class probability vector \(\hat{P}\). The class with the highest probability is denoted \(\hat{Y}\), and the corresponding probability is known as the confidence, denoted \(\hat{Q}\). The classifier is defined to be perfectly confidence-calibrated [16] if \(\mathbb {P}(\hat{Y} = Y | \hat{Q} = q) = q, \forall q \in [0,1]\) where Y is the ground truth label. The calibration methods aim to adjust the confidence during or after the training to achieve calibrated probabilities. Expected Calibration Error (ECE) [13, 16] is used to quantify the calibration capabilities of a classifier by grouping test instances into B bins of equal width based on predicted confidence values where \(B_m\) is the set of instances such that \(B_m = \{i|\hat{q}_i \in (\frac{m-1}{B},\frac{m}{B}]\}\) and \(\hat{q}_i\) is the predicted confidence for the ith instance. ECE is calculated as the weighted absolute difference between accuracy and confidence across bins as:

$$\begin{aligned} \mathcal {L}_{\text {ECE}} = \sum _{m=1}^B \frac{|B_m|}{N} |\mathbb {A}(B_m) - \mathbb {Q}(B_m)|, \end{aligned}$$
(1)

where N is the total number of instances; for each bin \(B_m\), the accuracy is \(\mathbb {A}(B_m) = \frac{1}{|B_m|} \sum _{i \in B_m} 1(\hat{y}_i = y_i)\) and the confidence is \(\mathbb {Q}(B_m) = \frac{1}{|B_m|} \sum _{i \in B_m} \hat{q}_i\); and finally, \(\hat{y}_i\) and \(y_i\) are the predicted and actual label for the ith instance.

Temperature Scaling Calibration. TS method for calibration is a simple and popular calibration method [16] which gives calibrated probabilities, once an optimal temperature \(T^* > 0\) is calculated by minimizing negative log likelihood (NLL) loss as follows:

$$\begin{aligned} T^* = \mathop {\mathrm {arg\,min}}\limits _{T>0} \sum _{(\textbf{x}_v,\textbf{y}_v)\in \mathcal {D}_v} \mathcal {L}_{\text {NLL}} (\sigma (\textbf{z}_v / T), \textbf{y}_v), \end{aligned}$$
(2)

where \(\mathcal {D}_v\) is the validation set, \(\textbf{z}_v\) are the logit vectors (network outputs before the softmax) obtained from a classifier trained on \(\mathcal {D}_{tr}\) (the training set), \(\textbf{y}_v\) are the ground truth labels, and \(\sigma (.)\) is the softmax function. Calibrated probabilities are obtained by applying \(T^*\) on test logit vectors \(\textbf{z}_{ts}\) of the test set \(\mathcal {D}_{ts}\) as \(\mathbf {\hat{y}}_{ts} = \sigma (\textbf{z}_{ts} / T^*)\).

3.2 Standard Calibration-Refinement Decomposition

Following the notations from the previous Section, let \(C = (C_1, \cdots , C_K)\) be the perfectly calibrated probability vector, corresponding to the classifier \(\phi (.)\), where \(C=\mathbb {E}[Y\mid \hat{P}]\). Consider the expected loss of \(\hat{P}\) with respect to Y, i.e., \(\mathbb {E}[D(\hat{P}, Y)]\) where D is a proper scoring rule. According to the calibration-refinement decomposition [37, 38], the expected loss can be decomposed into the sum of the expected divergence of \(\hat{P}\) from C and the expected divergence of C from Y with respect to any proper scoring rule D as follows:

$$\begin{aligned} \mathbb {E}[D(\hat{P}, Y)] = \mathbb {E}[D(\hat{P}, C)] + \mathbb {E}[D(C, Y)] \end{aligned}$$
(3)

These two terms are known as the Calibration Loss (CL) and the Refinement Loss (RL). CL (\(\mathbb {E}[D(\hat{P}, C)]\)) is the loss due to the difference between the model estimated probability score \(\hat{P}\) and the fraction of positive instances with the same output. Better calibrated models have lower CL. The loss RL (\(\mathbb {E}[D(C, Y)]\)) is due to the multiple class instances with the same score \(\hat{P}\).

As NLL decomposes into the sum of CL and RL, training a DNNs with the objective of NLL means putting equal importance to both parts. This motivates our custom modification to the loss function, which we will describe next.

4 Calibration by Generality Training (CaliGen)

We aim to achieve better generalization of calibration and accuracy with generality-training of the classifier. To achieve the best of both, we propose a new loss function, CaliGen loss, to be used with our approach. The primary objective of classifier training with NLL (i.e., the cross-entropy loss) is to increase the classification accuracy. In contrast, we want the network to produce calibrated probabilities which is hard to achieve by minimizing NLL [16]. To achieve this goal, we need to use an objective function that penalizes the situation when the model produces uncalibrated probabilities.

4.1 CaliGen Loss Function

NLL loss can be expressed as follows [3]:

$$\begin{aligned} \mathcal {L}_{\text {NLL}}(\hat{P}, Y) = \mathcal {L}_{\text {KL}}(\hat{P}, Y) + \mathbb {H}[Y] \end{aligned}$$
(4)

where \(\mathcal {L}_{\text {KL}}(.)\) is the KL-divergence loss and \(\mathbb {H}[Y]\) is entropy which is a constant with respect to the prediction that is being optimized. Following Eq. (3), we can decompose divergence in Eq. (4) as:

$$\begin{aligned} \mathcal {L}_{\text {NLL}}(\hat{P}, Y) = \mathcal {L}_{\text {KL}}(\hat{P}, C) + \mathcal {L}_{\text {KL}}(C, Y) + \mathbb {H}[Y], \end{aligned}$$
(5)

Our objective of generalization is to improve emphasis on CL (\(\mathcal {L}_{\text {KL}}(\hat{P}, C)\)), that is to obtain better calibration. In other words, if we decrease emphasis on RL (\(\mathcal {L}_{\text {KL}}(C, Y)\)), it will give more importance to CL. Mathematically, we consider a new loss function as follows:

$$\begin{aligned} \mathcal {L}(\hat{P}, Y) = \mathcal {L}_{\text {KL}}(\hat{P}, C) + (1-\rho ) ( \mathcal {L}_{\text {KL}}(C, Y) + \mathbb {H}[Y]), \end{aligned}$$
(6)

where \(\rho \in [0, 1]\) is a hyperparameter, with higher values of \(\rho \) putting less emphasis on refinement and thus more emphasis on calibration. The problem in implementing such a loss function is that we cannot know the perfectly calibrated probabilities C. Thus, instead we approximate these with the probability vector \(\hat{C}\) obtained by TS calibration, assuming that \(\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) \approx \mathcal {L}_{\text {KL}}(\hat{P}, C)\). This assumption is justified as we first fit TS on the validation set data and then use the same data to generate \(\hat{C}\). Given the above approximation, we can now add a negligible term \(\rho (\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) - \mathcal {L}_{\text {KL}}(\hat{P}, C))\) to the loss function in Eq. (6) as:

$$\begin{aligned} \mathcal {L}(\hat{P}, \hat{C}, Y) =\,\,&\mathcal {L}(\hat{P}, Y) + \rho (\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) - \mathcal {L}_{\text {KL}}(\hat{P}, C)) \nonumber \\ =\,\,&(1-\rho ) \mathcal {L}_{\text {KL}}(C, Y) + \mathcal {L}_{\text {KL}}(\hat{P}, C) \nonumber \\&+ \rho (\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) - \mathcal {L}_{\text {KL}}(\hat{P}, C)) + (1-\rho ) \mathbb {H}[Y]) \nonumber \\ =\,\,&(1-\rho ) \mathcal {L}_{\text {KL}}(C, Y) + (1-\rho ) \mathcal {L}_{\text {KL}}(\hat{P}, C) \nonumber \\&+ \rho \mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) + (1-\rho ) \mathbb {H}[Y] \nonumber \\ =\,\,&(1-\rho ) (\mathcal {L}_{\text {KL}}(\hat{P}, C) + \mathcal {L}_{\text {KL}}(C, Y) + \mathbb {H}[Y]) + \rho \mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) \nonumber \\ =\,\,&(1-\rho ) \mathcal {L}_{\text {NLL}}(\hat{P}, Y) + \rho \mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}), \end{aligned}$$
(7)

For \(\rho > 0\), the loss function in Eq. (7) decreases the emphasis on \(\mathcal {L}_{\text {NLL}}(\hat{P}, Y))\) and adds emphasis on \(\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C})\), which is equivalent to reducing the gap between distribution of predicted probabilities \(\hat{P}\) and temperature scaled calibrated probabilities \(\hat{C}\). We have \(\mathcal {L}(\hat{P}, \hat{C}, Y) \approx \mathcal {L}(\hat{P}, Y)\) (using the assumption \(\mathcal {L}_{\text {KL}}(\hat{P}, \hat{C}) \approx \mathcal {L}_{\text {KL}}(\hat{P}, C)\)), hence by minimizing \(\mathcal {L}(\hat{P}, \hat{C}, Y)\), we are minimizing \(\mathcal {L}(\hat{P}, Y)\) with more emphasis on CL (\(\mathcal {L}_{\text {KL}}(\hat{P}, C)\)). We call this custom loss function \(\mathcal {L}(\hat{P}, \hat{C}, Y)\) as the CaliGen loss.

4.2 Generality Training

We learn calibration by generality-training of a trained network by modifying the prediction head. We insert two more layers between the representation and logit layers of the trained network, as shown in Fig. 1. The additional two layers improve the ability to realize more complex functions, as demonstrated by the ablation study in Sect. 6.3. This modified head can be considered a separate Multi-Layer Perceptron (MLP) with two hidden layers where input is the representation produced by the trained model. CaliGen loss requires 3 vectors: representation vector, ground truth label vector, and calibrated probabilities. Generality-training of this modified head is a two-stage task: (i) We consider multiple domains for calibration domain set \(\mathcal {C}\), and for each domain \(c \in \mathcal {C}\), we obtain \(T^*_{c}\) using TS calibration method given in Eq. (2). We then use it to get the calibrated probability vector \(\hat{C}\) for each instance in the calibration domains. These calibrated probabilities for each calibration domain are generated once before the generality-training. We have used TS to obtain the calibrated probabilities for its simplicity. However, our method requires only calibrated probabilities, and thus, in principle, TS in generality-training can be replaced by any other post-hoc calibration method. (ii) During generality-training, all the layers are frozen till the representation layer. We use the CaliGen loss function given in Eq. (7) for optimization based on a fixed value of \(\rho \).

Hyper-parameter Tuning. We consider different values of \(\rho \) from [0.0, 0.1, ..., 0.9, 1.0] in the CaliGen loss function Eq. (7) for optimization and use early stopping (with 20% of data allocated for validation). We select the best value of \(\rho \) using 3-fold cross-validation based on best error while restricting the selection of \(\rho \) from 0.2 to 0.8. The selection range is restricted in [0.2, 0.8] to avoid extreme \(\rho \) values that do not improve calibration in our observation (see Fig. 2). The best values of \(\rho \) are given in the Supplementary material.

5 Experiments

We experiment with three datasets to test our method. These experiments aim to test our proposed method CaliGen on real datasets and compare the results with SOTA methods in the area. In the following, we give a brief description of each dataset.

5.1 Datasets

Office-Home. [39] The dataset contains around 15,500 images of different sizes from 65 categories. It is further divided into four domains: Art, Clipart, Painting, and Real World. We resize the images to 224 \(\times \) 224 and split each domain into 80-20 subsets referred to as the Large subset and the Small subset, respectively. We use the Large subset for training and evaluation, whereas the Small subset for calibration. We use one domain for training, 2 for calibration, and 1 remaining domain for evaluation along the same lines as used in [21] and perform experiments on all possible 12 combinations of splitting the 4 domains into 1 training, 1 test, and 2 calibration domains.

DomainNet. [40] The dataset contains different size images from 6 domains across 345 categories. The domains are Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. We resize the images to 224 \(\times \) 224 and split each domain into 90–10 referred to as the Large subset and the Small subset, respectively. We use the Large subset for training and evaluation and the Small subset for calibration. We use 2 domains for training, 3 for calibration, and 1 for evaluation, similar to [21]. We perform experiments on all possible 60 combinations.

CIFAR-10-C. We used the CIFAR-10 dataset [41] and applied 15 corruptions [42, 43] from level 1 (less severe) to level 5 (more severe) on it. We consider 4 corruptions (Gaussian Noise, Brightness, Pixelate, and Gaussian blur) of level 1 as well as original images as the source domains, 4 different corruptions (Fog, Contrast, Elastic Transform, and Saturate) of level 1 as the calibration domains, and the remaining 7 corruptions of all levels as the target domain. A more detailed description of this dataset is given in the supplemental material.

5.2 Experimental Setup

Generality-Training. We use ResNet101 [44] and EfficientNet V2 B0 [45] pre-trained on Imagenet and re-train it on each of the datasets listed in Sect. 5.1. For generality-training, we train a Multi-Layer Perceptron (modified head of DNN) with details given in Fig. 1 and Sect. 4.2.

Base Methods. We use TS on the held-out validation set from the source domain as a reference method. This method does not generalize to unseen domains and is called the source only calibration (TS). We consider an Oracle method where the model is calibrated with TS on the Large subset of the target domain.Oracle method is closest to the best possible calibration method within the TS family as it has access to the test data for calibration. For other base methods, we consider TS and Top Label Calibration (Histogram Binning Top Label or HB-TL) [17] by fitting these methods on calibration domains. We have also considered learning the weights by focal loss [3] instead of NLL and then applying calibration methods to improve calibration.

Calibration Adaptation Methods. There is very limited work in Calibration generalization settings; however, the work done by Park et al., Calibrated Prediction with Covariate Shift (CPCS) [19] and by Wang et al., Transferable Calibration (TransCal) [20] address the calibration adaptation by estimating the density ratio of the target domain and source domain with unlabelled instances from the target domain. We used calibration domain instances to estimate the density ratio for a fair comparison with CaliGen, assuming the target domain is unseen.

Cluster-Level Methods. As we propose the calibration generalization method, we compare our method with the current SOTA method cluster level [21]. We use K-means clustering (for Cluster NN) with 8 clusters for the Office-Home dataset and 9 clusters for the other two datasets. The number of clusters for Office-Home and DomainNet was selected from the paper [21] while for CIFAR-10-C, we experimented with clusters from 6 to 15 and chose 9, which gave the best results in the test dataset. We train a linear regressor (for Cluster LR) on cluster centers, and for the cluster ensemble method, we take the mean of logits of these two along with TS.

Table 2. Calibration performance (ECE %) evaluated on target domains of Office-Home dataset and averaged by target domains. The weights are learned by minimizing NLL (default) and focal loss (FL)
Table 3. Calibration performance (ECE %) evaluated on target domains of DomainNet dataset and averaged by target domains

6 Results and Discussion

We perform experiments to test the robustness of CaliGen on different challenging tasks: (a) slight distribution shifts due to corruptions (CIFAR-10-C) and (b) major domain shifts (Office-Home, DomainNet). For this, we experimented with different calibration methods considering calibration domains (which include source domains) to compare the performance of CaliGen with other SOTA methods. Gong et al. [21] considered only calibration domain data for calibration learning, while our experiments suggest that it will lose calibration on source data if source domains are not included in calibration domains (see supplementary material). We run 20 iterations of our experiments with 500 samples from Large subset or Test subset and report mean ECE % (calculated with bin size 15) with standard deviation and mean Error % with standard deviation for each dataset.

6.1 Performance Measures

Calibration Performance. Our method CaliGen achieves SOTA results for ECE on each dataset as shown in Table 2 (Office-Home) and Table 3 (DomainNet) where we outperform all other methods on average. For the DomainNet dataset, we achieve the best results on average (8.13 ± 2.57, while the second best method is the cluster ensemble achieving ECE of 9.37 ± 6.6). Note that the high standard deviations in the table are due to considering different combinations of domains for source, calibration, and target. E.g., when Art is the target in Office-Home, then Caligen has ECE of 6.64 ± 1.84, where 6.64 and 1.84 are, respectively, the mean and standard deviation of 5.41,8.75,5.76, which are respectively the results when the source is Clipart, Product, or RealWorld. All methods other than CaliGen struggle when the task is more complex (Clipart in Office-Home, and Quickdraw in DomainNet). In contrast, CaliGen improves the uncalibrated ECE significantly. CPCS is either outperforming or comparable to cluster-based methods. In comparison, CaliGen achieves lower ECE on target domains and also on source domains (see supplementary material). DNN trained with focal loss [3] might not give better calibration on the unseen domain when compared to NLL. We have given detailed results considering the focal loss in the supplementary material.

Improvement Ratio. Improvement Ratio (IR) [21] measures the model’s calibration transfer score. Given a source-only calibration and target-only (Oracle) calibration, one can measure how close to the oracle-level the model is, compared to the source-only level. It is measured by \(IR = \frac{ECE_S-ECE}{ECE_S-ECE_T}\), where \(ECE_S\) is the source only ECE and \(ECE_T\) is the Oracle ECE. CaliGen achieves the best IR across all datasets (see Table 4). The detailed results on the CIFAR-10-C dataset and EfficientNet network are given in the supplementary material.

Table 4. Improvement Ratio based on average ECE % scores of target domains when the classifier is trained using ResNet (R) or EfficientNet (E)

Accuracy Generalization. The generality training procedure has access to representations, ground truth labels, and calibrated probabilities of calibration domains, but the representations are learned only on source domains. During generality-training, the model tries to minimize NLL loss on ground truth labels along with minimizing divergence to calibrated probabilities which helps in better accuracy generalization (see Table 5) along with calibration generalization, given the representations learned on source domains.

Table 5. Error % averaged by target domains for different datasets while classifier is trained using ResNet101 (R) or EfficientNet V2 B0 (E). All TS based methods do not change error and are the same as Uncalibrated

6.2 Effect of \(\rho \) on ECE and Error

In our method, \(\rho \) is a hyper-parameter for which we select the best value by 3-fold cross-validation based on minimum error. Figure 2 shows the effect of changing \(\rho \) on error and ECE for both Office-Home and DomainNet datasets. We observe the best error rate when \(\rho \) is 0.2 or 0.3, and the error increases for higher values, while ECE is not monotonous for higher values of \(\rho \). For \(\rho =0\), the objective function is NLL, and the generality-training does not minimize KL-divergence, so higher ECE and lower error are expected. Still, as we increase \(\rho \), the error further minimizes and monotonically increases after \(\rho =0.3\).

Fig. 2.
figure 2

Effect of \(\rho \) on ECE and Error on datasets (a) Office-Home, (b) DomainNet

6.3 Ablation Study

We perform an ablation study on generality training considering the effects of (i) not modifying the objective function and (ii) not modifying the head.

Unmodified Objective Function. We test the abilities of our proposed objective function given in Eq. (7) by setting \(\rho = 0\) (Only the NLL loss). We observe that when only the NLL loss is used, we do not achieve desirable results for ECE as shown in Table 6. This confirms that models trained with NLL give equal importance to CL and RL, thus struggling to produce well-calibrated probabilities.

Table 6. Calibration performance (ECE %) averaged by target domains of Office-Home dataset while fine-tuned with either unmodified loss function or unmodified head

Unmodified Head. We justify the modification of the network head by testing our CaliGen loss function on the unmodified head. We use the same procedure without modifying the head and select the best \(\rho \) by 3-fold cross-validation. As shown in Table 6, modifying the head improves the performance on average. Adding more layers to the head gives it the ability to realize more complex functions while we use dropout (0.5) and L2 regularizer (0.01) to prevent over-fitting.

6.4 Limitations

Our method gives better calibration compared to SOTA methods in calibration generalization setting, same as Gong et al. [21], where we set the representations to be fixed. We have made the assumption of fixed representations throughout the whole paper. In contrast, now we investigate what one can do if there are enough time and resources to relearn the representations as well, something that was not considered by Gong et al. [21]. Thus, we use ResNet pre-trained on Imagenet for this experiment and retrained it on source and calibration domains.

Experimental Setup. For training, we use the Office-Home dataset with an additional 50% of the Small subsets (10% of the whole domain) from calibration domains along with the source domain (Large subset, 80%). This setting redistributes the data for training and calibration such that we have the Large subset of source domains and Small subset of calibration domains available for training and calibration as discussed in Sect. 5.1. For calibration, the Small subset of the source domain (20%) and the remaining 50% of the Small subset (10%) of the calibration domains are used.

Table 7. Calibration performance (ECE %) and Error % averaged by target domains of Office-Home dataset while ResNet trained and calibrated on all combinations of 3 domains.

Results. Results shown in Table 7 are averaged by target domains. CaliGen obtains surprisingly better ECE than the model trained on calibration data. However, this training procedure aims to give richer representation to source and calibration domains, which helps in better accuracy generalization on unseen domains. However, CaliGen still gives slightly better errors in the Clipart and Real World domains. In a scenario where calibration domains are unavailable during training, it is more sensible to save cost by generality-training with CaliGen instead of retraining the whole network.

7 Conclusion

In this paper, we addressed the problem of calibration generalization for unseen domains. Based on the goal of improving calibration, we derived a new loss function that gives more weightage to Calibration Loss. We proposed a novel generality-training procedure CaliGen which modifies the head of the DNN and uses our proposed CaliGen loss function, which gives more weightage to Calibration Loss. Together, these two changes improve domain generalization, meaning that accuracy and calibration improve on unseen domains while staying comparable to standard training on seen domains. Similarly to several earlier works, the method assumes that all the layers until the representation layer are fixed and cannot be retrained due to limited resources. We also show that if retraining is possible, then results can be improved further.