Confidence Calibration for Object Detection and Segmentation

Calibrated confidence estimates obtained from neural networks are crucial, particularly for safety-critical applications such as autonomous driving or medical image diagnosis. However, although the task of confidence calibration has been investigated on classification problems, thorough investigations on object detection and segmentation problems are still missing. Therefore, we focus on the investigation of confidence calibration for object detection and segmentation models in this chapter. We introduce the concept of multivariate confidence calibration that is an extension of well-known calibration methods to the task of object detection and segmentation. This allows for an extended confidence calibration that is also aware of additional features such as bounding box/pixel position, shape information, etc. Furthermore, we extend the expected calibration error (ECE) to measure miscalibration of object detection and segmentation models. We examine several network architectures on MS COCO as well as on Cityscapes and show that especially object detection as well as instance segmentation models are intrinsically miscalibrated given the introduced definition of calibration. Using our proposed calibration methods, we have been able to improve calibration so that it also has a positive impact on the quality of segmentation masks as well.


Introduction
Common neural networks for object detection must not only determine the position and class of an object but also denote their confidence about the correctness of each detection. This also holds for instance or semantic segmentation models that output a score for each pixel indicating the confidence of object mask membership. A reliable confidence estimate for detected objects is crucial, particularly for safetycritical applications such as autonomous driving, to reliably process detected objects. Another example is medical diagnosis, where, for example, the shape of a brain tumor within an MRI image is of special interest [MWT+20].
Each confidence estimate might be seen as a probability of correctness, reflecting the model's uncertainty about a detection or the pixel mask. During inference, we expect the estimated confidence to match the observed precision for a prediction. For example, given 100 predictions with 80% confidence each, we expect 80 predictions to be correctly predicted [GPSW17,KKSH20]. However, recent work has shown that the confidence estimates of either classification or detection models based on neural networks are miscalibrated, i.e., the confidence does not match the observed accuracy in classification [GPSW17] or the observed precision in object detection [KKSH20]. While confidence calibration within the scope of classification has been extensively investigated [NCH15, NC16, GPSW17, KSFF17, KPNK+19], we recently defined the term of calibration for object detection and proposed methods to measure and resolve miscalibration [KKSH20,SKR+21]. In this context, we measured miscalibration w.r.t. the position and scale of detected objects by also including the regression branch of an object detector into a calibration mapping. We have been able to show that modern object detection models also tend to be miscalibrated. On the one hand, this can be mitigated using standard calibration methods. On the other hand, we show that our proposed methods for position-and shape-dependent calibration [KKSH20] are able to further reduce miscalibration.
In this chapter we review our methods for position-and shape-dependent confidence calibration and provide a definition for confidence calibration for the task of instance/semantic segmentation. To this end, we extend the definition of the expected calibration error (ECE) to measure miscalibration within the scope of instance/semantic segmentation. Furthermore, we adapt the extended calibration methods originally designed for object detection [KKSH20] to enable positiondependent confidence calibration for segmentation tasks as well.
This chapter is structured as follows. In Sect. 2 we review the most important related works regarding confidence calibration. In Sect. 3 we provide the definitions of calibration for the tasks of object detection, instance segmentation, and semantic segmentation. Furthermore, in Sect. 4 we present the extended confidence calibration methods for object detection and segmentation. Extensive experimental evaluations on a variety of architectures and computer vision problems, including object detection and instance/semantic segmentation, are discussed in Sect. 5. Finally, we provide conclusions and discuss our most important findings.

Related Works
In the past, most research focused on measuring and resolving miscalibration for classification tasks. In this scope, the expected calibration error (ECE) [NCH15] is commonly used in conjunction with the Brier score and negative log likelihood (NLL) loss to measure miscalibration. For calculating the ECE, all samples are grouped into equally sized bins by their confidence. Afterward, for each bin the accuracy is calculated and used as an approximation of the accuracy of a single sample in its respective bin. Recently, several extensions such as classwise ECE [KPNK+19], marginal calibration error [KLM19], or adaptive calibration error [NDZ+19] for the evaluation of multi-class problems have also been proposed, where the ECE is evaluated for each class separately. In contrast to previous work, we extend the common ECE definition to the task of object detection [KKSH20] and instance/semantic segmentation. This extension allows for a position-dependent miscalibration evaluation so that it is possible to quantify miscalibration separately for certain image regions. The definition is given in Sect. 3.
Besides measuring miscalibration, it is also possible to resolve a potential miscalibration by using calibration methods that are applied after inference. These posthoc calibration methods can be divided into binning and scaling methods. Binning methods such as histogram binning [ZE01], isotonic regression [ZE02], Bayesian binning into quantiles (BBQ) [NCH15], or ensemble of near-isotonic regression (ENIR) [NC16] group all samples into several bins by their confidence (similar to the ECE calculation) and perform a mapping from uncalibrated confidence estimates to calibrated ones. In contrast, scaling methods such as logistic calibration (Platt scaling) [Pla99], temperature scaling [GPSW17], beta calibration [KSFF17], or Dirichlet calibration [KPNK+19] scale the network logits before sigmoid/softmax activation to obtain calibrated confidences. The scaling parameters are commonly obtained by logistic regression. Other approaches comprise binwise temperature scaling [JJY+19] or scaling-binning calibrator [KLM19], combining both approaches to further improve calibration performance for classification tasks.
Recently, we proposed an extension of common calibration methods to object detection by also including the bounding box regression branch of an object detector into a calibration mapping [KKSH20]. On the one hand, we extended the histogram binning [ZE01] to perform calibration using a multidimensional binning scheme. On the other hand, we also extended scaling methods to include position information into a calibration mapping. Both approaches are presented in Sect. 4 in more detail. Unlike classification, the task of instance or semantic segmentation calibration has not yet been addressed by many authors. In the work of [WLK+20], the authors perform online confidence calibration of the classification head of an instance segmentation model and show that this has a significant impact on the mask quality. The authors in [KG20] use a multi-task learning approach for semantic segmentation models to improve model calibration and out-of-distribution detection within the scope of medical image diagnosis. A related approach is proposed by [MWT+20], where the authors train multiple fully-connected networks as an ensemble to obtain well-calibrated semantic segmentation models. However, none of these methods provide an explicit definition of calibration for segmentation models (instance or semantic). This problem is addressed by [DLXS20], where the authors explicitly define semantic segmentation calibration and propose a local temperature scaling for semantic image masks. This approach utilizes the well-known temperature scaling [GPSW17] and assigns a temperature for each mask pixel separately. Furthermore, they use a dedicated convolutional neural network (CNN) to infer the temperature parameters for each image. Our definition of segmentation calibration in Sect. 3 is conform with their definition. Our approach differs from their proposed image-based temperature scaling as we use a single position-dependent calibration mapping to model the probability distributions for all images.

Calibration Definition and Evaluation
In this section, we review the term of confidence calibration for classification tasks [NCH15,GPSW17] and extend this definition to object detection, instance segmentation, and semantic segmentation. Furthermore, we derive the detection expected calibration error (D-ECE) to measure miscalibration.

Definitions of Calibration
Classification: In a first step, we define the datasets D of size |D| = N with indices i ∈ I = {1, . . . , N }, consisting of images x ∈ I H ×W ×C , I = [0, 1], with height H , width W , and number of channels C. Each image has ground-truth information that consists of the class information Y ∈ Y = {1, . . . , Y }. As an introduction into the task of confidence calibration, we start with the definition of perfect calibration for classification. A classification model F cls outputs a labelŶ and a corresponding confidence scoreP indicating its belief about the prediction's correctness. In this case, perfect calibration is defined by [GPSW17] (1) In other words, the accuracy P(Ŷ = Y |P = p) for a certain confidence level p should match the estimated confidence. If we observe a deviation, a model is called miscalibrated. In binary classification, we rather consider the relative frequency of Y = 1 instead of the accuracy as the calibration measure. We illustrate this difference using the following example. Consider N = 100 images of a dataset D with binary groundtruth labels Y ∈ {0, 1}, where 50 images are labeled as Y = 0 and 50 images with Y = 1. Furthermore, consider a classification model F cls with a sigmoidal output in [0, 1], indicating its confidence for Y = 1. In our example, this model is able to always predict the correct ground-truth label with confidencesP = 0 andP = 1 for Y = 0 andŶ = 1, respectively. Thus, the network is able to achieve an accuracy of 100% but with an average confidence of 50%. Therefore, we consider the relative frequency for Y = 1 in a binary classification task as the calibration goal that is also 50% in this scenario.
Object detection: In the next step, we extend our dataset and model to the task of object detection. In contrast to a classification dataset, an object detection dataset consists of ground-truth annotations Y ∈ Y = {1, . . . , Y } for each object within an image as well as the ground-truth position and shape information R ∈ R = [0, 1] A (A denotes the size of the normalized box encoding, comprising the center x and y positions c x , c y , as well as width w and height h). An object detection model F det further outputs a confidence scoreP ∈ [0, 1], a labelŶ ∈ Y, and the corresponding positionR ∈ R for each detection in an image x. Extending the original formulation for confidence calibration within classification tasks [GPSW17], perfect calibration for object detection is defined by [KKSH20] where M = 1 denotes a correctly classified prediction that matches a ground-truth object with a certain intersection-over-union (IoU) score. Commonly, an object detection model is calibrated by means of its precision, as the computation of the accuracy is not possible without knowing all anchors of a model [KKSH20,SKR+21]. Thus, P(M = 1) is a shorthand notation for p(Ŷ = Y ,R = R) that expresses the precision for a dedicated IoU threshold.
Instance segmentation: At this point, we adapt this idea to define confidence calibration for instance segmentation. Consider a dataset with K annotated objects K over all images. For notation simplicity, we further use j ∈ J k = {1, . . . , H k · W k } as the index for pixel j within a bounding box R k of object k ∈ K in the instance segmentation dataset, where H k and W k denote the width and height of object k, respectively. In addition to object detection, a ground-truth dataset for instance segmentation also consists of pixel-wise mask labels denoted by Y j ∈ Y * = {0, 1} for each pixel j in the bounding box R k of object k. Note that we introduce the star superscript ( * ) here to distinguish between the bounding box label/prediction encoding and the instance segmentation label/prediction encoding. An instance segmentation model F ins predicts the membershipŶ j ∈ Y * for each pixel j in the predicted bounding boxR k to the object mask with a certain confidenceP j ∈ [0, 1]. We further denote R j ∈ R * = [0, 1] A * as the position of pixel j within the bounding boxR k , where A * denotes the size of the used position encoding of a pixel within its bounding box. In contrast to object detection, it is possible to treat the confidence scores of each pixel within an instance segmentation mask as a binary classification problem. In this case, the confidenceP j can be interpreted as the probability of a pixel belonging to the object mask. Therefore, the aim of confidence calibration for instance segmentation is that the pixel confidence should match the relative frequency that a pixel is part of the object mask. According to the task of object detection, we further include a position dependency into the definition, for instance, segmentation and obtain The former term can be interpreted as the probability that the predictionŶ j for a pixel with index j within an object k matches the ground-truth annotation Y j given a certain confidence p, a certain pixel position r within the bounding box, as well as a certain object category y that is predicted by the bounding box head of the instance segmentation model. Semantic Segmentation: Compared to object detection or instance segmentation, a semantic segmentation dataset does not hold ground-truth information for individual objects but rather consists of pixel-wise class annotations Y j ∈ Y = {1, . . . , Y }, j ∈ J = {1, . . . , H · W }. A semantic segmentation model F sem outputs pixel-wise labelsŶ j and probabilitiesP j with relative position R j within an image. Therefore, we can define perfect calibration for semantic segmentation as that is related to the calibration definition of classification [GPSW17]. In addition, the confidence score of each pixel must not only reflect its accuracy given a certain confidence level but also at a certain pixel position.

Measuring Miscalibration
We can measure the miscalibration of an object detection model as the expected deviation between confidence and observed precision which is also known as the detection expected calibration error (D-ECE) [KKSH20]. Let further s = ( p, y, r) denote a single detection with confidence p, class y, and bounding box r so that s ∈ S, where S is the aggregated set of the confidence space [0, 1], the set of ground-truth labels Y, and the set of possible bounding box positions R. Since M is a continuous random variable, we need to approximate the D-ECE using the Riemann-Stieltjes integral [GPSW17] with FP ,Ŷ ,R (s) as the joint cumulative distribution ofP,Ŷ , andR. Let a = (a p , a y , a r 1 , . . . , boundaries for the confidence a p , b p ∈ [0, 1], the estimated labels a y , b y ∈ Y and each quantity of the bounding box encoding a r 1 , b r 1 , . . . with B as the number of equidistant bins used for integral approximation so that B = {1, . . . , B}, and p m ∈ [0, 1], y m ∈ Y and r m ∈ R being the respective bin entities for average confidence, the current label, and the average (unpacked) bounding box scores within bin m ∈ B, respectively. Let N m denote the amount of samples within a single bin m. For large datasets |D| = N , the probability P(M = 1|P = p m ,Ŷ = y m ,R = r m ) is approximated by the average precision within a single bin m, whereas p m is approximated by the average confidence, so that the D-ECE can finally be computed by where prec(m) and conf(m) denote the precision and average confidence within bin m, respectively. Similarly, we can extend the D-ECE to instance segmentation by with a binning scheme over all pixels and freq(m) as the average frequency within each bin. In this case, each pixel is treated as a separate prediction and binned by its confidence, label class, and relative position. In the same way, the D-ECE for semantic segmentation is approximated by with accuracy acc(m) within each bin m.
For the calibration evaluation of object detection models, we can use the relative position c x , c y , and the shape h, w, of the bounding boxes [KKSH20]. For segmentation, we consider the relative position x, y, of each pixel within a bounding box (instance segmentation) or within the image (semantic segmentation), as well as its distance d to the next segment boundary, as we expect a higher uncertainty in the peripheral areas of a segmentation mask. We use these definitions to evaluate different models in Sect. 5.

Position-Dependent Confidence Calibration
For post-hoc calibration, we distinguish between binning and scaling methods. According to the approximation in (7), binning methods such as histogram binning [ZE01] divide all samples into several bins by their confidence and measure the average accuracy/precision within each bin. In contrast, scaling methods rescale the logits of a neural network before sigmoid/softmax activation to calibrate a network's output. In this section, we extend standard calibration methods so that they are capable of dealing with additional information such as position and/or shape. These extended methods can be used for confidence calibration of object detection, instance segmentation, and semantic segmentation tasks. We illustrate the difference between standard calibration and position-dependent calibration using an artificially created dataset in Fig. 1. This dataset consists of points with a confidence score and a binary ground-truth information in {0, 1}, so that we are able to compute the frequency of the points across the whole image. We observe that standard calibration only shifts the average confidence to fit the average precision/accuracy. This leads to an  increased calibration error in the center of the image. In contrast, position-dependent calibration as defined in this section is able to capture correlations between position information and calibration error and reduces the D-ECE across the whole image.

Histogram Binning
Given a binning scheme with B different bins so that B = {1, . . . , B} with the according bin boundaries 0 = a 1 ≤ a 2 ≤ · · · ≤ a B+1 = 1 and the corresponding calibrated estimate θ m within each bin, the objective for histogram binning to estimate = {θ m |m ∈ B} is the minimization with 1(·) as the indicator function, yielding a 1 if its argument is true and a 0 if it is false [ZE01,GPSW17]. This objective converges to the fraction of positive samples within each bin under consideration. However, within the scope of object detection or segmentation, we also want to measure the accuracy/precision w.r.t. position and shape information. As an extension to the standard histogram binning [ZE01], we therefore propose a multidimensional binning scheme that divides all samples into several bins by their confidence and by all additional information such as position and shape [KKSH20]. We further denoteŜ = (P,R) of size Q = A + 1 as the input vector to a calibration function consisting of the confidence and the bounding box encoding, so that Q = {1, . . . , Q}. In a multidimensional histogram binning, we indicate the number of bins as B = (B 1 , . . . , B Q ) so that B * = {B q = {1, . . . , B q }|q ∈ Q} with bin boundaries {0 = a 1,q ≤ a 2,q ≤ · · · ≤ a B q +1,q = 1|q ∈ Q}. For each bin combination m 1 ∈ B 1 , . . . , m Q ∈ B Q , we have a dedicated calibration parameter θ m 1 ,...,m Q ∈ R so that the calibration parameters * , with | * | = q∈Q B q , are given as * = {θ m 1 ,...,m Q |m 1 ∈ B 1 , . . . , m Q ∈ B Q }. This results in an objective function given by where which again converges to the fraction of positive samples within each bin. The term J (ŝ i ) simply denotes that the loss is only applied if a sampleŝ i falls in a certain bin combination. A drawback of using this method is that the number of bins is given by q∈Q B q , which grows exponentially as the number of dimensions Q grows.

Scaling Methods
As opposed to binning methods like histogram binning, scaling methods perform a rescaling of the logits before sigmoid/softmax activation to obtain calibrated confidences. We can distinguish between logistic calibration (Platt scaling) [Pla99] and beta calibration [KSFF17].
Logistic calibration: According to the well-known logistic calibration (Platt scaling) [Pla99], the calibration parameters are commonly obtained using logistic regression. For a binary logistic regression model, we assume normally distributed scores for the positive and negative class, so that p( p|+) ∼ N( p; μ + , σ 2 ) and The mean values for the two classes are given by μ + , μ − and the variance σ 2 is equal for all classes. First, we follow the derivation of logistic calibration introduced by [Pla99, KSFF17] using the likelihood ratio where γ = 1 2σ 2 (μ + − μ − ) and η = μ + + μ − . Assuming a uniform prior over the positive and negative classes, the likelihood ratio equals the posterior odds. Hence, a calibrated probability is derived by which recovers the logistic function.
Recently, [KKSH20] used this formulation to derive a position-dependent confidence calibration by using multivariate Gaussians for the positive and negative classes. Introducing the concept of multivariate confidence calibration [KKSH20], we can derive a likelihood ratio for position-dependent logistic calibration by whereŝ + =ŝ − μ + andŝ − =ŝ − μ − using μ + , μ − ∈ R Q as the mean vectors and + , − ∈ R Q×Q as the covariance matrices for the positive and negative classes, respectively.
Beta calibration: Similarly, we can also extend the beta calibration method [KSFF17] to a multivariate calibration scheme. However, we need a special form of a multivariate beta distribution as the Dirichlet distribution is only defined over a Q-simplex and thus is not suitable for this kind of calibration. Therefore, we use a multivariate beta distribution proposed by [LN82] which is defined as with Q 0 = {0, . . . , Q} and the shape parameters α = (α 0 , . . . , α q ) T , β = (β 0 , . . . , β q ) T that are restricted to α q , β q > 0. Furthermore, we denote λ q = β q β 0 and s * q =ˆs q 1−ŝ q . In this context, B(α) denotes the multivariate beta function. However, this kind of beta distribution is only able to capture positive correlations [LN82]. Nevertheless, it is possible to derive a likelihood ratio given by with α + , α − and λ + , λ − as the shape parameters for the multivariate beta distribution in (22) and for the positive and negative class, respectively, so that c = log B(α − ) B(α + ) . We investigate the effect of position-dependent calibration in the next section using these methods.

Experimental Evaluation and Discussion
In this section, we evaluate our proposed calibration methods for the tasks of object detection, instance segmentation, and semantic segmentation using pretrained neural networks. For calibration evaluation, we use the MS COCO validation dataset [LMB+14] consisting of 5,000 images with 80 different object classes for object detection and instance segmentation. For semantic segmentation, we use the panoptic segmentation annotations consisting of 171 different object and stuff categories in total. Our investigations are limited to the validation dataset since the training set has already been used for network training and no ground-truth annotations are available for the respective test dataset. Thus, we use a 50%/50% random split and use the first set for calibration training, while the second set is used for the evaluation of the calibration methods. Furthermore, we also utilize the Cityscapes validation dataset [COR+16] consisting of 500 images with 19 different classes that are used for model training. The Munster and Lindau images are used for calibration training, whereas the Frankfurt images are held back for calibration evaluation.

Object Detection
We evaluate our proposed calibration methods histogram binning (HB) (14), logistic calibration (LC) (21) and beta calibration (BC) (23) for the task of object detection using a pretrained where only the predicted bounding box information is used for calibration. For calibration evaluation, we use the proposed D-ECE with the same subsets of data that have been used for calibration training. Thus, we use a binning scheme of B = 20 using the confidencep only. In contrast, we use B q = 5 for Q = 5 when all auxiliary information is used. Each bin with less than 8 samples is neglected to increase D-ECE robustness. We further measure the Brier score (BS) and the negative log likelihood (NLL) of each model to evaluate its calibration properties. It is of special interest to assess if calibration has an influence to the precision/recall. Thus, we also denote the area under precision/recall curve (AUPRC) for each model. As opposed to previous examinations [KKSH20], we evaluate calibration using all available classes within each dataset. Each of these scores is measured for each class separately. In our experiments, we denote weighted average scores that are weighted by the amount of samples for each class. We perform calibration using IoU scores of 0.50 and 0.75, respectively. Furthermore, only bounding boxes with a score over 0.3 are used for calibration to reduce the amount of non-informative predictions. We give a qualitative example Fig. 2 that illustrates the effect of confidence calibration in the scope of object detection.
In our experiments, we first apply the standard calibration methods (Table 1) and compare the results with the baseline miscalibration of the detection model. We can already observe a default miscalibration of each examined network. This miscalibration is reduced by standard calibration where the scaling methods offer the best performance compared to the histogram binning. In a next step, we apply our box-sensitive calibration that includes the confidencep, position c x , c y and shape w and h into a calibration mapping (Table 2). Similar to the confidence-only case in Table 1, the scaling methods consistently outperform baseline and histogram binning calibration.
In each calibration case, we observe a miscalibration of the base network that is alleviated using our proposed calibration methods. By examining the reliability diagram (Fig. 3), we observe that the networks are consistently miscalibrated for all confidence levels. This can be alleviated either by standard calibration or by position-dependent calibration. By examining the position-dependence of the    (Figs. 4 and 5), a considerable increase of the calibration error toward the image boundaries can be observed. This calibration error is already well mitigated by standard calibration methods and can be further improved by position-dependent calibration. Also note that the AUPRC is not affected by standard scaling methods (logistic/beta calibration) as these methods perform a monotonically increasing mapping of the confidence estimates and thus do not affect the order of the samples. However, this is not the case with histogram binning which may lead to a significant drop of the AUPRC. Furthermore, even the position-dependent scaling methods cannot guarantee a monotonically increasing mapping. However, compared to the improvement of the calibration, the impact on the AUPRC is marginal. Therefore, we conclude that our calibration methods are a valuable contribution especially for safety-critical systems, since they lead to statistically better calibrated confidences.

Instance Segmentation
After object detection, we investigate the calibration properties of instance segmentation models. According to the definition of calibration for segmentation models in (9), we can use each pixel within a segmentation mask as a separate prediction. This alleviates the problem of limited data availability and allows for a more robust calibration evaluation. Recently, Kumar et al. [KLM19] started a discussion about the sample-efficiency of binning and scaling methods. The authors show that binning methods yield a more robust calibration mapping for large datasets but also tend to overfit for a small database. We can confirm this observation as histogram binning provides poor calibration performance in our examinations on object detection calibration, particularly for classes with fewer samples. In contrast, scaling methods are more sample-efficient but also more inaccurate [KLM19]. Furthermore, scaling methods are computationally more expensive as they require an iterative update of the calibration parameters over the complete dataset, whereas a binning of samples comes at low computational costs, especially for large datasets. Therefore, our examinations are focused on the calibration performance of a (multivariate) histogram binning using 15 bins for each dimension. For inference, we use pretrained Mask R-CNN [HGDG17, WKM+19] as well as pretrained PointRend [KWHG20] models to obtain predictions of instance segmentation masks for both datasets. As within object detection evaluation, all objects with a bounding box score below 0.3 are neglected. We perform standard calibration using the confidence only, as well as a position-dependent calibration including the x and y position of each pixel. The pixel position is scaled by the mask's bounding box size to get position information in the [0, 1] interval. Furthermore, we also include the distance of each pixel to the nearest mask segment boundary as a feature in a calibration mapping since we expect a higher uncertainty especially at the segment boundaries. The distance is normalized by the bounding box's diagonal to obtain distance values in [0, 1]. Since many data samples are available, we measure calibration using a D-ECE with 15 bins neglecting each bin with less than 8 samples. We further assess the Brier score as well as the NLL loss as complementary metrics. The task of instance segmentation is related to object detection. In a first step, the network also needs to infer a bounding box for each detected object. Similar to the calibration evaluation for object detection, we further use IoU scores of 0.50 and 0.75 to specify whether a prediction has matched a ground-truth object or not. In contrast to object detection, the IoU score within instance segmentation is computed by the overlap of the inferred segmentation mask and the according ground-truth object. Using this definition, we can compute the AUPRC to evaluate the average quality of the object detection branch as well as the according segmentation masks. All results for instance segmentation calibration for IoU of 0.50 and 0.75 are shown in Tables 3  and 4, respectively. These tables compare standard calibration (subset: confidence only) with position-dependent calibration (subset: full). We observe a significant miscalibration of the networks by default. Using standard calibration or our calibration methods, it is possible to improve the calibration score D-ECE, Brier, and NLL for each case. The miscalibration is successfully corrected by histogram binning using either the standard binning or the position-dependent binning. This is also underlined by inspecting the reliability diagrams for the confidence-only case (Fig. 6). Furthermore, we can also observe miscalibration that is dependent on the relative x and y pixel position within a mask (Figs. 7 and 8). We can observe that standard histogram binning calibration already achieves good calibration results but also offers a weak dependency on the pixel position as well. Although the positiondependent calibration only achieves a minor improvement in calibration compared to the standard calibration case, it shows an equal calibration performance across the whole image In contrast to object detection, we observe a slightly increased miscalibration toward the center of a mask. As it can be seen in Fig. 9, most pixels belonging to an object mask are also located in the center. This underlines the need for a positiondependent calibration. Interestingly, although position-dependent calibration does not seem to offer better Brier or NLL scores compared to the confidence-only case, it significantly improves the mask quality as we can observe higher AUPRC scores for position-dependent calibration. We further provide a qualitative example that illustrates the effect of confidence calibration in Fig. 9. In this example, we can see the difference between standard calibration and position-dependent calibration. In the former case, the mask scores are only rescaled by their confidence which might lead to a better calibration but sometimes also to unwanted losses of mask segments Table 3 Calibration results for instance segmentation @ IoU=0.50. The best D-ECE scores are underlined for each subset separately, since it is not convenient to compare D-ECE scores with different subsets to each other [KKSH20]. Furthermore, the best Brier, NLL, and AUPRC scores are highlighted in bold. These scores are only calculated using the confidence information and thus can be compared to each other. The histogram binning calibration consistently improves miscalibration. Furthermore, we find that although position dependence does not improve the Brier and NLL scores as in the standard case, it leads to a significant improvement in the mIoU score  Table 4 Calibration results for instance segmentation @ IoU=0.75. The best scores are highlighted. We observe the same behavior in calibration for all models as for the IoU=0.50 case shown in Table 3 Network (especially small objects in the background). In contrast, position-dependent calibration is capable of possible correlations between confidence and pixel position or the size of an object. This leads to improved estimates of the mask confidences even for smaller objects. Therefore, we conclude that multidimensional confidence calibration has a positive influence on the calibration properties as well as on the quality of the object masks.

Semantic Segmentation
For the evaluation of semantic segmentation, we use the same datasets, the same binning scheme, and the same features that have already been used for the instance segmentation in Sect. 5.2. For COCO mask inference, we use a pretrained Deeplabv2 [CPK+18] as well as a pretrained HRNet model [SXLW19, WSC+20, YCW20]. For Cityscapes, we also use a HRNet as well as a pretrained Deeplabv3+ model [CZP+18]. Similar to the instance segmentation, we also use the D-ECE, Brier score, NLL loss, and mIoU to compare the baseline calibration with a histogram binning model. As opposed to our previous experiments, we only use 15% of the provided samples to reduce computational complexity. The results for default miscalibration, standard calibration (subset: confidence only) and position-dependent calibration (subset: full) are shown in Table 5. Unlike instance segmentation calibration, the  Fig. 10 to illustrate the effect of calibration for semantic segmentation. In this example, we can observe only minor differences between the uncalibrated mask and the masks either after standard calibration or after position-dependent calibration. This aspect is supported by the confidence reliability diagram shown in Fig. 11. Although we can observe an overconfidence, most samples are located in the low confident space with a low calibration error that also results in an overall low miscalibration score. Furthermore, our calibration methods are able to achieve even better calibration results, but mostly in the confidence-only calibration case. In addition, we also observe only a low correlation between position and calibration error (Figs. 12 and 13). One reason for the major difference between instance and semantic segmentation calibration may be the difference in model training. An instance segmentation model needs to infer an appropriate bounding box first to achieve qualitatively good results in mask inference. In contrast, a semantic segmentation model does not need to infer the position of objects within an image but is able to directly learn and improve mask quality. We also suspect an influence of the amount of available data points, since a semantic segmentation model is able to use each pixel within an image as a separate sample, whereas an instance segmentation model is only restricted to the pixels that are available within an estimated bounding box. Therefore, we conclude that semantic segmentation models do not require a post-hoc confidence calibration as they already offer a good calibration performance.

Conclusions
Within the scope of confidence calibration, recent work has mainly focused on the task of classification calibration. In this chapter, we presented an analysis of confidence calibration for object detection calibration as well as, for instance, and semantic segmentation calibration. Firstly, we introduced definitions for confidence calibration within the scope of object detection, instance segmentation, and semantic segmentation. Secondly, we presented methods to measure and alleviate miscalibration of detection and segmentation networks. These methods are extensions of well-known calibration methods such as histogram binning [ZE01], Platt scaling [Pla99], and beta calibration [KSFF17]. We extend these methods so that they encompass additional calibration information such as position and shape. Finally, the experiments revealed that common object detection models as well as instance segmentation networks tend to miscalibration. In addition, we showed that auxiliary information such as estimated position or shape of a predicted object also have an influence to confidence calibration. However, we also found that semantic segmentation models are already intrinsically calibrated. Thus, the examined models do not require additional post-hoc calibration and already offer well-calibrated mask confidence scores. We argue that this difference between instance and semantic segmentation is a result of data quality and availability during training. This leads to the assumption that limited data availability is a direct cause for miscalibration during training and thus an effect of overfitting. Nevertheless, our proposed calibration framework is capable to calibrate object detection and instance segmentation models. In safety-critical applications, the confidence in the applied algorithms is paramount. The proposed calibration algorithms allow to detect situations of low confidence and thus perform the appropriate system reaction. Therefore, calibrated confidence values can be used as additional information especially in safety-critical applications.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.