Introduction

Quality control is one of the key parts of the manufacturing process, which comprehends inspection, testing, and identification to ensure the manufactured products comply with specific standards and specifications (Kurniati et al., 2015; Wuest et al., 2014; Yang et al., 2020). For example, the inspection tasks aim to determine whether a specific part features assembly integrity, surface finish, and adequate geometric dimensions (Newman & Jain, 1995). In addition, product quality is key to the business since it (i) builds trust with the customers, (ii) boosts customer loyalty, and (iii) reinforces the brand reputation.

One such quality inspection activity is the visual inspection, considered a bottleneck activity in some instances (Zheng et al., 2020). Visual inspection is associated with many challenges. Some visual inspections require a substantial amount of reasoning capability, visual abilities, and specialization (Newman & Jain, 1995). Furthermore, reliance on humans to perform such tasks can affect the scalability and quality of the inspection. When considering scalability, human inspection requires training inspectors to develop inspection skills; their inspection execution tends to be slower when compared to machines, they fatigue over time and can become absent at work (due to sickness or other motives) (Selvi & Nasira, 2017; Vergara-Villegas et al., 2014). The quality of inspection is usually affected by the inherent subjectiveness of each human inspector, the task complexity, the job design, the working environment, the inspectors’ experience, well-being, and motivation, and the management’s support and communication (Cullinane et al., 2013; Kujawińska et al., 2016; See, 2012). Manual visual inspection’s scalability and quality shortcomings can be addressed through an automated visual inspection.

Automated visual inspection can be realized with Machine Learning models. Technological advances [e.g., Internet of Things or Artificial Intelligence (Rai et al., 2021; Zheng et al., 2021)], and trends in manufacturing [e.g., the Industry 4.0 and Industry 5.0 paradigms (Barari et al., 2021; Rozanec et al., 2022)] have enabled the timely collection of data and foster the use of machine learning models to automate manufacturing tasks while reshaping the role of the worker (Carvajal Soto et al., 2019; Chouchene et al., 2020). Automated visual inspection was applied in several use cases in the past (Beltrán-González et al., 2020; Duan et al., 2012; Jiang & Wong, 2018; Villalba-Diez et al., 2019). Nevertheless, it is considered that the field is still in its early stages and that artificial intelligence has the potential to revolutionize product inspection (Aggour et al., 2019).

While machine learning models can be trained to determine whether a manufactured piece is defective and do so in an unsupervised or supervised manner, no model is perfect. At least three challenges must be faced: (a) how to improve the models’ discriminative capabilities over time, (b) how to calibrate the models’ prediction scores into probabilities to enable the use of standardized decision rules (Silva Filho et al., 2021), and (c) how to alleviate the manual labeling effort.

This paper presents our approach to addressing these three challenges as follows. Active learning enhances the classification model to address the first challenge. Pool-based and stream-based settings are compared, considering different active learning sample query strategies across five machine learning algorithms. Platt scaling, a popular probability calibration technique, addresses the second challenge. Finally, two scenarios were considered when addressing the reduction of manual labeling effort: (i) manual inspection of cases where the machine learning model does not predict with enough confidence and (ii) data labeling to acquire ground truth data for the model calibration. The first scenario was addressed by exploring the usage of multiple oracles and soft labeling to reduce the manual inspection effort. Finally, the second and third scenarios were addressed by approximating the ground truth with models’ predictions to calibrate the model. Furthermore, several novel metrics to measure the quality of calibration were proposed. The results confirm that they can measure the quality of such calibration without needing a ground truth.

This work extends our previous research described in paper Streaming Machine Learning and Online Active Learning for Automated Visual Inspection (Rožanec et al., 2022). In that paper, research was performed to measure the impact of active learning on streaming algorithms. This paper explored batch and online settings, active learning policies, and oracles. This research overcomes some of the shortcomings of the previous research. First, it does not only consider the models’ uncertainty to derive data instances to oracles but also a certain quality acceptance level. Second, it calibrates the machine learning models so that through probability calibration, they issue probabilities rather than predictive scores. Third, it increases the amount of data devoted to active learning to ensure more meaningful results. Finally, it focuses on batch machine learning models (which achieve a greater discriminative performance) and studies them in batch and streaming active learning settings. In addition to the abovementioned items, multiple metrics were developed to assess the calibration quality of a calibrator. The metrics overcome some shortcomings of widely adopted metrics and enable measuring calibration quality when no ground truth is available. The research was performed on a real-world use case with images provided by Philips Consumer Lifestyle BV corporation. The dataset comprises images regarding the printed logo on manufactured shavers. The images are classified into three classes: good prints, double prints, and interrupted prints.

The Area Under the Receiver Operating Characteristic Curve [AUC ROC, see (Bradley, 1997)] was used to evaluate the discriminative capability of the classification models. AUC ROC estimates the quality of the model for all possible cutting thresholds. It is invariant to a priori class probabilities and, therefore, suitable for classification tasks with strong class imbalance. Furthermore, given that the models were evaluated in a multiclass setting, the AUC ROC was computed with a one-vs-rest strategy. Furthermore, the performance of multiple probability calibration approaches was measured through the Estimated Calibration Error (ECE) and several novel metrics proposed in this research.

This paper is organized as follows. Section “Related work” describes the current state of the art and related works. Section “Approximate model’s probabilities calibration describes novel metrics proposed for probability calibration and how calibration methods can leverage approximate ground truth to enlarge the calibration set. The main novelty regarding the proposed probabilities calibration metrics is the ability to measure calibration quality without needing a ground truth. Section “Use case” describes the use case, while section “Methodology” provides a detailed description of the methodology followed. Section “Experiments” describes the experiments performed, while section “Results and evaluation” presents and discusses the results obtained. Finally, section “Conclusion” presents the conclusions and outlines future work.

Related work

This section provides a short overview of three topics relevant to this research: (i) the use of machine learning for quality inspection, (ii) active learning, and (iii) probabilities calibration. The following subsections are devoted to each of them.

Machine learning for quality inspection

A comprehensive and reliable quality inspection is indispensable to the manufacturing process, and high inspection volumes turn inspection processes into bottlenecks (Schmitt et al., 2020). Machine Learning has been recognized as a technology that can drive the automation of quality inspection tasks in the industry. Multiple authors report applying it for early prediction of manufacturing outcomes, which can help drop a product that will not meet quality expectations and avoid investment in expensive manufacturing stages. Furthermore, similar predictions can be used to determine whether the product can be repaired and therefore avoid either throwing away a piece to which the manufacturing process was invested or selling a defective piece with the corresponding costs for the company (Weiss et al., 2016). Automated visual inspection refers to image processing techniques for quality control, usually applied in the production line of manufacturing industries (Beltrán-González et al., 2020). It has been successfully applied to determine the end quality of the products. It provides many advantages, such as performing non-contact inspection that is not affected by the type of target, surface, or ambient conditions (e.g., temperature) (Park et al., 2016). In addition, visual inspection systems can perform multiple tasks simultaneously, such as object, texture, or shape classification and defect segmentation, among other inspections. Nevertheless, automated visual inspection is a challenging task given that collecting the dataset is usually expensive, and the methods developed for that purpose are dataset-dependent (Ren et al., 2017).

Jian et al. (2017) considers three approaches that exist toward automated visual inspection: (a) classification, (b) background reconstruction and removal (reconstruct and remove background to find defects in the residual image), and (c) template reference (comparing a template image with a test image). Tsai and Lai (2008) describe how TFT-LCD panels and LCD color filters were inspected by comparing surface segments containing complex periodic patterns. Lin et al. (2019) describes how defect inspection on LED chips was automated using deep Convolutional Neural Networks (CNN). Kang and Liu (2005) successfully applied feed-forward networks to detect surface defects on cold-rolled strips. In the same line, Yun et al. (2014) proposed a novel defect detection algorithm for steel wire rods produced by the hot rolling process. Valavanis and Kosmopoulos (2010) compared multiple machine learning models (Support Vector Machine, Neural Network, and K-nearest neighbors (kNN)) on defect detection in weld images. Park et al. (2016) developed a CNN and compared it to multiple models (particle swarm optimization-imperialist competitive algorithm, Gabor-filter, and random forest with variance-of-variance features) to find defects on silicon wafers, solid paint, pearl paint, fabric, stone, and wood surfaces. Furthermore, Aminzadeh and Kurfess (2019) described how Bayesian classification enabled online quality inspection in a powder-bed additive manufacturing setting. Multiple authors developed machine learning algorithms for visual inspection leveraging feature extraction from pre-trained models (Cohen & Hoshen, 2020; Li et al., 2021; Jezek et al., 2021). While much research was devoted to supervised machine learning methods, unsupervised defect detection was explored by many authors, who explored using Fourier transforms to remove regularities and highlight irregularities (defects) (Aiger & Talbot, 2012) or employed autoencoders to find how a reference image differs from the expected pattern (Mujeeb et al., 2018; Zavrtanik et al., 2021, 2022).

Active learning

Active learning is a subfield of machine learning that studies how an active learner can best identify informative unlabeled instances and requests their labels from some oracle. Typical scenarios involve (i) membership query synthesis (a synthetic data instance is generated), (ii) stream-based selective sampling (the unlabeled instances are drawn one at a time, and a decision is made whether a label is requested or the sample is discarded), and (iii) pool-based selective sampling (queries samples from a pool of unlabeled data). Among the frequently used querying strategies are (i) uncertainty sampling (select an unlabeled sample with the highest uncertainty, given a certain metric, or machine-learning model (Lewis & Catlett, 1994)), or (ii) query-by-committee [retrieve the unlabeled sample with the highest disagreement between a set of forecasting models (committee) (Cohn et al., 1994; Settles, 2009)] can be found. More recently, new scenarios have been proposed leveraging reinforcement learning, where an agent learns to select images based on their similarity, and rewards obtained are based on the oracle’s feedback (Ren et al., 2020). In addition, it has been demonstrated that ensemble-based active learning can effectively counteract class imbalance through newly labeled image acquisition (Beluch et al., 2018). While active learning reduces the required volume of labeled images, it is also essential to consider that it can produce an incomplete ground truth by missing the annotations of defective parts classified as false negatives and not queried by the active learning strategy (Cordier et al., 2021).

Active learning was successfully applied in manufacturing, but scientific literature remains scarce on this domain (Meng et al., 2020). Some use cases include the automatic optical inspection of printed circuit boards (Dai et al., 2018), media news recommendation in a demand forecasting setting (Zajec et al., 2021), and the identification of the local displacement between two layers on a chip in the semiconductor industry (van Garderen, 2018).

Probabilities calibration

Probabilities denote the likelihood that a particular event will occur and are expressed as a real number between zero and one (Cheeseman, 1985). Many machine learning models output prediction scores which cannot be directly interpreted as probabilities. Therefore, such models can be calibrated (mapped to a known scale with known properties), ensuring the prediction scores are converted to probabilities. Probability calibration aims to provide reliable estimates of the true probability that a sample is a member of a class of interest. Such calibration (a) usually does not decrease the classification accuracy, (b) enables using provides thresholds on the decision rules and therefore minimizes the classification error, (c) ensures decision rules and their maximum posterior probability are fully justified from the theoretical point of view, (d) can be easily adapted to changes in class and cost distributions, and therefore (e) is key to decision-making tasks (Cohen & Goldszmidt, 2004; Song et al., 2021).

The k-class probabilistic classifier is considered well-calibrated if the predicted k-dimensional probability vector has a distribution that approximates the distribution of the test instances. While a single accepted notion of probabilistic calibration exists for binary classifiers, the definition for multiclass settings has multiple nuances. Three kinds of probability calibration are described in the literature for multiclass settings: (i) confidence calibration [aims only to calibrate the classifier’s most likely predicted class (Song et al., 2021)], (ii) class-wise calibration (attempts to calibrate the scores for each class as marginal probabilities), and (iii) multi-class calibration (seeks to create an entire vector of predicted probabilities so that for any prediction vector the proportion of classes among all possible instances getting the same prediction, are equal to the probabilities for those classes in the predicted vector).

Multiple probability calibration methods have been proposed in the scientific literature. The post-hoc techniques aim to learn a calibration map for a machine-learning model based on hold-out validation data. In addition, popular calibration methods for binary classifiers include logistic calibration (Platt scaling), isotonic calibration, Beta calibration, temperature calibration, and binning calibration.

Empirical binning builds the calibration map by computing the empirical frequencies within a set of score intervals. It can therefore capture arbitrary prediction score distributions (Kumar et al., 2019). Isotonic regression computes a regression assuming the uncalibrated model has a set of non-decreasing constant segments corresponding to bins of varying widths. Given its non-parametric nature, it avoids a model misfit, and due to the monotonicity assumption, it can find optimal bin edges. Nevertheless, training times and memory consumption can be high on large datasets and give sub-optimal results if the monotonicity assumption is violated. Platt scaling (Platt, 2000) aims to transform prediction scores into probabilities through a logistic regression model, considering a uniform probability vector as the target. While the implementation is straightforward and the training process is fast, it assumes the input values correspond to a real scalar space and restricts the calibration map to a sigmoid shape. Probability calibration trees evolve the concept of Platt scaling, identifying regions of the input space that lead to poor probability calibration and learning different probability calibration models for those regions, achieving better overall performance (Leathart et al., 2017). Beta calibration was designed for probabilistic classifiers. It assumes that the scores of each class can be approximated with two Beta distributions and is implemented as a bivariate logistic regression. Temperature scaling uses a scalar parameter \(T>0\) (where T is considered the temperature) to rescale logit scores before applying a softmax function to achieve recalibrated probabilities with better spread scores between zero and one. It is frequently applied to deep learning models, where the prediction scores are frequently strongly skewed towards one or zero. Furthermore, the method can be applied to generic probabilistic models by transforming the prediction scores with a logit transform (Guo et al., 2017). This enables calculating the score against a reference class and obtaining the ratio against other classes. Nevertheless, the method is not robust in capturing epistemic uncertainty (Ovadia et al., 2019). Finally, the concept of temperature scaling is extended in vector scaling, which considers that a different temperature for each class can be specified, and matrix scaling, which considers a matrix and intercept parameters (Song et al., 2021).

Several metrics and methods were proposed to assess the quality of the calibration. Reliability diagrams plot the observed relative frequency of predicted scores against their values. They, therefore, enable to quickly assess whether the event happens with a relative frequency consistent with the forecasted value (Bröcker & Smith, 2007). On the other hand, validity plots aim to convey the bin frequencies for every bin and therefore provide valuable information regarding miscalibration bounds (Gupta & Ramdas, 2021). Among the metrics, the binary ECE measures the average gap across all bins in a reliability diagram, weighted by the number of instances in each bin, considering the labeled samples of a test set. In the same line, the binary Maximum Calibration Error computes the maximum gap across all bins in a reliability diagram. The Confidence Estimated Calibration Error measures the average difference between accuracy and average confidence across all bins in a confidence reliability diagram, weighted by the number of instances per bin. A different approach is followed by the Brier score, which measures the mean squared difference between the predicted probability and the actual outcome. While the ECE metric is widely accepted, research has shown that it is subject to shortcomings (Nixon et al., 2019; Posocco & Bonnefoy, 2021). One of such shortcomings is that when using fixed calibration ranges, some bins contain most of the data, resulting in the metric’s decreased sharpness. Furthermore, ECE is measured across non-empty bins, failing to account for the overall distribution of positives across the mean predicted probabilities. Measuring probabilistic calibration remains a challenge (Nixon et al., 2019).

While many probability calibration methods and metrics have been developed, most of them were conceived considering probability calibration must be done based on some ground truth. Nevertheless, acquiring data for such ground truth is expensive (requires labeled instances), limits the amount of data seen to build such a probability calibration map, and therefore introduces inaccuracies due to the inherent characteristics of the sample. To address this void, this research proposes labeling each predicted data instance according to the predicted class with the highest score or most likely class if the highest predicted scores are equal. Assuming the classifier could perform with perfect discriminative power in the best case, such labels would equal the ground truth. Furthermore, this research proposes metrics to assess the discrepancy between an ideal probability calibration scenario and the calibrated classifier to measure the quality of probability calibration achieved. By doing so, the calibrators’ quality over time can be measured without needing any data labeling for such an assessment. Furthermore, it enables exploring approximate model’s probabilities calibration, training a calibrator from a ground truth approximated with predicted labels. This idea is further explained and developed in section “Approximate model’s probabilities calibration”.

Approximate model’s probabilities calibration

Towards approximate probability calibration models

This research proposes metrics and an approach to calibrating machine learning prediction scores to probabilities using a ground truth approximation. The approach considers building an initial calibration set, as it is common practice for probability calibration methods. A calibration set has (a) several prediction scores used to perform the probability calibration and (b) the ground truth labels for the corresponding data instances. Using both, a mapping is created between the prediction scores and the probability of a class outcome. Nevertheless, the limited amount of data in the calibration set can impact the fidelity of the calibration. In particular, the distribution of predictive scores between the calibration set and the predictions performed in a production environment can differ.

The final prediction of a calibrated model has at least two sources of error: (a) the classification model, which does not perfectly predict the target class, and (b) the probability calibration technique, which does not produce a perfect probabilistic mapping between the predicted scores and the target class. While (a) directly affects the refinement loss (loss produced by assigning the same probability for instances from different classes), (b) affects the calibration loss (loss due to the difference between the predicted probabilities and observed positive instances for such an output). While metrics and plots exist to assess the quality of the probability calibration, such means require a ground truth to evaluate the probability calibration. While the requirement for a ground truth allows for an exact estimate of the classifier on that particular hold-out data, it has at least two drawbacks: (i) it requires labeling a certain amount of data to perform the evaluation, and (ii) such data may not be representative of current or future data distributions observed in a production environment.

Fig. 1
figure 1

The figure presents two calibration plots. On the left, the calibration plot shows a perfectly calibrated calibrator (where the fraction of positives for the class under consideration equals the mean predicted probability). On the right, the same information is presented, but normalizing the values of the plot on the right to ensure the sum of their values equals one

Intuitions behind a calibration without a ground truth

Current scientific literature considers the quality of a model calibration can be measured by comparing, given a fixed class, whether the fraction of positives does correspond to the predicted mean probability of a given classifier. The fraction of positives empirically measures the likelihood of positive class events for the class under consideration within a specific mean probability range (bin). In a well-calibrated model, the likelihood of the occurrence of positive class events in a particular bin for the class under consideration matches the mean predicted probability, revealing a linear relationship between the mean predicted probability and the likelihood of the occurrence in that bin of the positive class event for the class considered (see Fig. 1). Furthermore, a perfectly calibrated classifier is only possible for a binary classification problem with no class imbalance. Class imbalance or multiple classes introduce distortions regarding the frequency with which the positive class is observed within a given predicted mean probability range compared to the frequency with the other events occurring within that mean probability range.

For a well-calibrated classifier, each of the predicted classes is expected to behave as shown in Fig. 1. Therefore, while class imbalance or a multi-class setting can introduce distortions to the histogram’s shape, the distance to the ideal case could be measured by comparing the histogram shape of a perfectly calibrated model for a given class and the shape of the histogram in the real-world case under consideration. To estimate how close the histograms are from each other, optimal transport is used (Peyré et al., 2019; Villani, 2009). In particular, the Wasserstein distance measures the distance between the two histogram distributions. We consider the Wasserstein distance between the histograms representing the existing calibration and a perfect one. The distance denotes the improvement opportunity regarding the specific calibration model to achieve a perfect calibration (or a desired calibration according to the reference histogram). Nevertheless, the fraction of positives for a given class cannot be computed when no ground truth is available. Therefore, we reframe the problem so that the goodness of a model calibration can be evaluated even without considering a ground truth.

Considering the information available in Fig. 1 and a particular class j, and considering each prediction regarding class j an event x, we are interested on two types of events: \(E_{1}=\{x \, corresponds \, to \, bin \, i \}\), and \(E_{2}=\{x \, corresponds \, to \, class \, j \}\). Furthermore, we are interested in calibrating the model so that the resulting score indicates \(p_{j}(E_{2} |E_{1})\).

Intuition 1: Considering a perfectly calibrated classifier

Let us consider the case of a perfectly calibrated classifier. Given a perfectly calibrated classifier, the fraction of positives for a given class must match the mean predicted probability. The fraction of positives within a certain bin i can be considered the empirical computation of \(p_{j}(E_{2} |E_{1})\). \(E_{1}\) and \(E_{2}\) are not independent events, given the probability of belonging to class j should be higher in bins representing a higher mean predicted probability. Therefore, \(p_{j}(E_{2} |E_{1}) = \frac{p_{j}(E_{2} \cap E_{1})}{p_{j}(E_{1})}\). Considering a balanced binary classification problem, the number of predictions issued for each mean predicted probability range must be equal to verify the symmetry regarding the fraction of positives observed in the mean probability ranges for both classes. Fluctuations regarding the fraction of positives observed in the mean predicted probability ranges translate into an unequal number of predictions in them and directly impact the quality of the calibration. Based on this observation, given the abovementioned equation, \(p_{j}(E_{2} |E_{1}) = \frac{p_{j}(E_{2} \cap E_{1})}{p_{j}(E_{1})}\), \(p_{j}(E_{1})\) is constant, and can be empirically computed as \(p_{j}(E_{1}) = \frac{1}{\# \, of \, bins}\). The number of predictions for a given class j is computed as the count of predictions where the highest predicted value was issued for that class j. While \(p_{j}(E_{2} \cap E_{1})\) cannot be computed without ground truth, the expected values that must be satisfied for each bin for \(p_{j}(E_{2} |E_{1})\) are known. Therefore, we envision at least two ways to estimate the mismatch between the ideal case and the case under consideration. First, the value of \(p_{j}(E_{2} \cap E_{1})\) can be inferred based on the expected \(p_{j}(E_{2} |E_{1})\) for a particular bin and the empirical computation of \(p_{j}(E_{1})\) to then measure the Wasserstein distance between the resulting distributions. Second, it could be estimated by only considering \(p_{j}(E_{1})\) and measuring the Wasserstein distance between the ideal distribution (an equal number of predictions per mean predicted probability range) and the distribution of predictions obtained from the calibrated classifier under consideration (number of predictions per bin, that are empirically measured—usually the amount of predictions is not equal across bins given the calibrated classifier’s imperfection). Each class’s calibration quality could be estimated in both cases by comparing two histograms: the ideal case and the calibration model under consideration. The distance between both distributions computes measures how far the particular calibrator is from a perfectly calibrated case.

While the case above was demonstrated for a balanced binary classification problem, it approximately holds for multiclass settings and cases with class imbalance. In these scenarios, we aim to calibrate each class as perfectly as possible, even though a perfect calibration cannot be achieved. Nevertheless, how well-calibrated each class is against the ideal case can still be assessed by comparing the distributions described above.

Intuition 2: Considering a perfect classifier

Let us consider the case of a perfect classifier. Given a perfect classifier, the prediction equals the ground truth regarding a positive class event for the class under consideration. Therefore, two scenarios are considered: (a) degrade the classifiers’ performance to achieve a calibrated classifier, or (b) spread the predicted values within a specific range so that they emulate particular calibration. It must be noted that while (a) can still satisfy the definition of probability considered for calibration, (b) does not.

For (a), the classifier’s performance must be degraded due to the inherent definition of probabilities used in this problem: the calibration model will ensure a proportion of positive events regarding a class given a mean predicted probability bin. Therefore, given \(n=number \, of \, classes\), the highest predicted value for each class will not issue only data instances of that class above 1/n. Furthermore, some cases will be lost under the 1/n threshold.

Fig. 2
figure 2

The figure presents two calibration plots. On the left, the calibration plot shows a perfect binary classifier, while on the right we find a perfectly calibrated binary classifier

On the other hand, for (b), the abovementioned equation \(p_{j}(E_{2} |E_{1}) = \frac{p_{j}(E_{2} \cap E_{1})}{p_{j}(E_{1})}\) can be considered. It is known that for a perfect classifier, the following is true: \(p_{j}(E_{2})=0\) or \(p_{j}(E_{2})=1\). Furthermore, \(E_{1}\) and \(E_{2}\) can be considered dependent events, given \(p_{j}(E_{2})=0\) for bins below a certain threshold, and \(p_{j}(E_{2})=1\) otherwise (see Fig. 2). In addition, the mean predicted probability would not match the fraction of positives, given the classifier is perfect: each prediction perfectly identifies the target class. Therefore, this scenario per se violates the idea behind probabilities calibration. Nevertheless, the best approximation towards Fig. 1 would be to achieve an increasing number of predictions per mean predicted probability range (histogram bin) for a specific class. To avoid degrading the models’ discriminative power, such a mapping function will not issue scores below 1/n, where \(n=number \, of \, classes\).

Intuitions materialized

From intuitions to approximate calibrators

To perform model calibration, a function that can map the predictive scores of a machine-learning model to probability scores is required. Ideally, such probability scores would indicate \(p_{j}(E_{2} |E_{1})\). When no ground truth is available, the intuitions described above can be considered to reproduce some scenarios where the resulting probability score distribution can be compared against an ideal probability score distribution. Therefore, we consider labeling the predicted data instances with the class with the highest predicted score. In case two classes hold equal scores, we decide on the most probable one based on the class imbalance observed in the train test. For balanced datasets, the class can be assigned randomly, given no other information exists to guide the decision. The more perfect the classification model, the closer will the assigned labels be to the ground truth. Given data instances with predicted scores and assigned labels, a calibrator can be fitted to map the classifier’s output to a calibrated probability.

To realize the abovementioned calibration without ground truth, at least the following preconditions must be met: (a) no concept drift exists, (b) no covariate shift exists, and (c) the values of the features in the production environment remain within the ranges considered when training the machine learning model.

From intuitions to metrics

In Sections “Intuition 1: Considering a perfectly calibrated classifier” and “Intuition 2: Considering a perfect classifier”, the cases of a perfectly calibrated model and a perfect classifier were considered. While in the case of a perfect classifier, a ground truth is not needed (the predicted labels equal the ground truth), non-perfect classifiers approximate such a ground truth to a certain degree (measured as the classifiers’ performance). Furthermore, regardless of the calibration technique, it was shown that a certain correlation between the calibration quality and the calibration score distribution exists. In particular, it was shown that for each class k a histogram could be computed showing (a) the number of predictions per bin and (b) the proportion of positive class occurrences per mean predicted probability bin. Both could then be compared against ideal cases. A certain advantage of (a) is that it does not require ground truth or ground truth approximation to determine whether some bins are under or over-assigned. While such an imbalance certainly signals a calibration error, the histogram lack information regarding the composition of each bin. In particular, they provide no information on whether the positive class occurrences increase according to the value of the mean predicted probability bin. This can only be measured in (b), comparing all cases against an ideal calibration histogram. For multiclass problems, each class could be compared against such a histogram, and the resulting scores averaged (Fig. 3).

Fig. 3
figure 3

The figure illustrates two sample histograms: the histogram on the left corresponds to some sub-optimally calibrated classifier. In contrast, the histogram on the right (reference histogram) corresponds to a perfectly calibrated classifier

To estimate how close a probability calibration method is w.r.t. the target (ideal) histogram, optimal transport is used (Peyré et al., 2019; Villani, 2009). In particular, the Wasserstein distance between two histogram distributions is considered: a histogram constructed with the calibrator scores and a histogram corresponding to the ideal scenario. Based on them, we propose a metric that can be used to estimate the quality of calibration of any calibrator given certain ground truth. We name it Probability Calibration Score (PCS—see Eq. 1). The proposed metrics issue a value between zero and one: PCS is zero when the model is not calibrated and one when the model is perfectly calibrated. Furthermore, a weighted metric variant can also be considered (wPCS—see Eq. 2), where the proportion of each class among the observed instances weights the Wasserstein distances.

\(W_{1}(h_{i}, h_{ref})\) is the 1-Wasserstein distance between the histogram \(h_{i}\) and the reference histogram \(h_{ref}\) and n is the number of classes.

$$\begin{aligned} PCS = \sum _{i=1}^{n} \frac{1 - W_{1}(h_{i}, h_{ref})}{n} \end{aligned}$$
(1)

\(W_{1}(h_{i}, h_{ref})\) is the 1-Wasserstein distance between the histogram \(h_{i}\) and the reference histogram \(h_{ref}\) and \(w_{i}\) is the weight of a particular class. n indicates the number of classes under consideration.

$$\begin{aligned} wPCS = \sum _{i=1}^{n} \left( 1 - W_{1}(h_{i}, h_{ref})\right) \cdot w_{i} \end{aligned}$$
(2)

To ensure the histograms are comparable, they are normalized, ensuring that the sum of their values equals one. To ensure the Wasserstein distance remains between zero and one, the distance between both distributions is divided by the distance measured between the worst-case scenario and the reference ideal histogram (see Fig. 4). In Fig. 4, we consider the Wasserstein distance between the case on the left and the distribution of a Perfect Probability Calibration Model (PPCM) to be the highest among possible calibration scenarios.

Fig. 4
figure 4

The figure illustrates two sample calibration plots: the calibration plot on the left corresponds to a calibrated classifier where all positives were assigned to a zero mean predicted probability (worst-case scenario). In contrast, the calibration plot on the right (reference histogram) corresponds to a perfectly calibrated classifier. Both calibration plots correspond to normalized cases, where the sum of the values equals one

When assessing the performance of an approximate calibrated model, two errors must be taken into account: (i) the classification error, given the classifier does not perfectly predict the target class (and the ground truth is approximated with such predictions), and (ii) the probability calibration technique, which does not produce a perfect probabilistic mapping between the predicted scores and the (approximated) target class. To measure (i), we choose the AUC ROC metric, which is not affected by the class imbalance. AUC ROC can be computed in a multiclass setting with a one-vs-rest or one-vs-one strategy. We measure it on the test set. We consider (ii) can be measured using the Wasserstein distance, comparing the ideal calibration histogram and a histogram where the proportion of positive class occurrences (given the approximate ground truth) is considered per mean predicted probability bin.

We propose two metrics, which we name Additive Probability Calibration Score (APCS—see Eq. 6) and Multiplicative Probability Calibration Score (MPCS—see Eq. 8). Both summarize the calibrated models’ performance, considering the classifier’s imperfection (see Eq. 3) and the calibration error incurred due to the lack of ground truth. To ensure the Wasserstein distance remains between zero and one, we compute a normalized histogram, ensuring the area of the entire histogram equals one. The proposed metrics issue a value between zero and one, and in both cases, the higher the value, the better the model. Furthermore, we also provide a weighted version of both metrics (wAPCS (see Eq. 7) and wMPCS (see Eq. 9)), which aim to weight the Wasserstein distance between the normalized histograms obtained from a calibrator and the ideal histogram with the class weights (see Eqs. 4 and  5 for APCS\(_{\hbox {W}}\) and wAPCS\(_{\hbox {W}}\), and Eqs. 1 and 2 for MPCS and wMPCS).

APCS is zero when the model has no discriminative power and is not calibrated, and one when the model is perfectly calibrated and shows no classification error on the test set. The APCS metric is detailed in Eq. 6.

K is used to measure classifiers’ discriminative power. \(AUC ROC_{Classifier_{test}}\) corresponds to the classifiers’ AUC ROC measured on the test set.

$$\begin{aligned} K_{AUC ROC} = |0.5 - AUC ROC_{Classifier_{test}}|\end{aligned}$$
(3)

Component for Wasserstein distance measurement between an ideal calibrator and the calibrator under consideration, as used for the APCS metric.

$$\begin{aligned} APCS_{W} = 0.5 \cdot PCS \end{aligned}$$
(4)

Component for Wasserstein distance measurement between an ideal calibrator and the calibrator under consideration, as used for the wAPCS metric.

$$\begin{aligned} wAPCS_{W} = 0.5 \cdot wPCS \end{aligned}$$
(5)

APCS metric definition.

$$\begin{aligned} APCS = K_{AUC ROC} + APCS_{W} \end{aligned}$$
(6)

wAPCS metric definition.

$$\begin{aligned} wAPCS = K_{AUC ROC} + wAPCS_{W} \end{aligned}$$
(7)

On the other hand, MPCS and wMPCS correspond to zero when (a) the classifiers’ predictive ability is no better than random guessing or (b) the Wasserstein distance between histograms is highest (equal to one). Moreover, MPCS and wMPCS correspond to one when (a) the classifiers’ predictive ability is perfect, and (b) the calibration is perfect w.r.t. the target histogram h of choice. The MPCS metric is detailed in Eq. 8.

MPCS metric definition

$$\begin{aligned} MPCS = K_{AUC ROC} \cdot PCS \end{aligned}$$
(8)

MPCS metric definition

$$\begin{aligned} wMPCS = K_{AUC ROC} \cdot wPCS \end{aligned}$$
(9)
Fig. 5
figure 5

The figure illustrates three histograms that correspond to ideal cases described in this section: Perfect Probability Calibration Model (PPCM), Almost Perfect Probability Calibration Model (APPCM), and Perfect Classification with Perfect Confidence (PPwPC)

For models’ probability calibration, PCS, APCS, and MPCS assume an ideal reference histogram. Three histograms are presented in Fig. 5 corresponding to (a) a Perfect Probability Calibration Model (PPCM), (b) an Almost Perfect Probability Calibration Model (APPCM), and (c) Perfect Classification with Perfect Confidence (PPwPC). While only PPCM can be used for strict probability calibration, the other two reference histograms measure how far the distributions of the predicted values are from other desired distribution shapes. In particular, APPCM achieves a similar spread of predicted probabilities as PPCM but neglects the segment of predictions below 1/n (with \(n=number \, of \, classes\)), where the classifier would become suboptimal. On the other hand, PPwPC advocates for a classifier where all scores are pushed toward the highest possible score for a given class. This research only considers the PPCM reference histogram to compute the above-described metrics.

Use case

Philips Consumer Lifestyle BV in Drachten, The Netherlands, is one of Philips’ biggest development and production centers in Europe. They use cutting-edge production technology to manufacture products ceaselessly. One of their improvement opportunities is related to visual inspection, where they aim to identify when the company logo is not properly printed on the manufactured products. They have multiple printing pad machines, from which the products are handled and inspected on their visual quality and removed if any error is detected. Experts estimate that a fully automated procedure would speed up the process by more than 40%. Currently, there are two defects associated with the printing quality of the logo (see Fig. 6): double prints (the whole logo is printed twice with a varying overlap degree) and interrupted prints (the logo displays small non-pigmented areas, similar to scratches).

Fig. 6
figure 6

The images shown above correspond to three possible classes: good (no defect), double print (defective), and interrupted print (defective)

Machine learning models can be developed to automate the visual inspection procedure (Rippel et al., 2021; Zavrtanik et al., 2022). However, given that such models are imperfect, the manual revision can be used as a fallback to inspect the products about which the uncertainty of the machine learning model exceeds a certain threshold. Such decisions can be made based on simple decision rules, quality policies, and the probability of obtaining a defective product given a particular prediction score. Furthermore, products sent for manual inspection can be prioritized using different criteria to enhance the existing defect detection machine learning model. This research explores the abovementioned capabilities through multiple experiments, building supervised models, leveraging active learning, and comparing six machine learning algorithms. Furthermore, new measures for probability calibration are explored, and experiments are executed to determine whether existing calibration techniques would benefit from enlarging the calibration set with approximate ground truth. The experiments were conducted on a dataset of 3518 labeled images, all corresponding to manufactured shavers.

Methodology

The research presented in this paper was performed using the Python language, and open source libraries, such as scikit-learn (Buitinck et al., 2013) and netcal (Küppers et al., 2020).

Methodological aspects to evaluate active learning strategies

Fig. 7
figure 7

The methodology we followed to train and assess machine learning models and active learning scenarios

We frame the automated defect detection as a supervised, multiclass classification problem. A ResNet-18 model (He et al., 2016) was used for feature extraction. 512 values long vectors were extracted for each image obtained from the average pooling layer. To avoid overfitting, the procedure suggested by Hua et al. (2005) was followed by selecting the top K features, with \(K=\sqrt{N}\), where N is the number of data instances in the train set. Features’ relevance was assessed considering the mutual information score, which measures any relationship between random variables. It is considered that the mutual information score is not sensitive to feature transformations if these transformations are invertible and differentiable in the feature space or preserve the order of the original elements of the feature vectors (Vergara & Estévez, 2014) (Fig. 7).

Fig. 8
figure 8

A 10-fold stratified cross-validation was used. The dataset was split for four purposes: train, test, probabilities calibration, and simulate unlabeled data under an active learning setting

To evaluate the models’ and active learning scenarios’ performance, a stratified k-fold cross validation (Zeng & Martinez, 2000) was applied, considering k=10 based on recommendations by Kuhn and Johnson (2013). One fold was used for testing (test set), and one for machine learning models’ probabilities calibration (calibration set). Three folds were used to simulate a pool of unlabeled data for active learning (active learning set), and the rest to train the model (train set) (see Fig. 8). Samples are selected from the active learning set to be annotated by the oracle and then added to the training set, on which the models are retrained. In this research, two types of oracles were considered: (a) machine oracles, which can be imperfect, and (b) human annotators (assumed to be ideal). Five machine learning algorithms were evaluated: Gaussian Näive Bayes, CART (Classification and Regression Trees, similar to C4.5, but it does not compute rule sets), Linear SVM, kNN, and Multilayer perceptron (MLP).

To evaluate the discriminative power of the machine learning models and how it is enhanced over time through active learning, the AUC ROC metric was computed. Given the multiclass setting, the “one-vs-rest” heuristic was selected, splitting the multiclass dataset into multiple binary classification problems and computing their average, weighted by the number of true instances for each class. In addition, to assess the usefulness of the active learning approaches, the AUC ROC values obtained by evaluating the model against the test fold for the first (Q1) and last (Q4) quartiles of instances queried in an active learning setting were compared. The amount of manual work saved under each active learning setting and the soft-labeling approaches’ precision were also evaluated.

Fig. 9
figure 9

Expected visual inspection pipeline in a production setting. Multiple active learning strategies were assessed to identify which would drive the best results

Through different experiments (detailed in section “Experiments”), a visual inspection pipeline was simulated (see Fig. 9). First, a stream of images is directed toward the machine learning model trained to identify possible defects. Then, based on the prediction score, a decision is made on whether the manufactured product should remain in the production line or be deferred to manual inspection. If the product is unlikely to be defective, such a decision can be considered a label (it is considered a soft label when not made by a human annotator). The label is then persisted, enlarging the existing dataset. The enlarged dataset can be used to retrain the model and replace the existing one after a successful deployment.

Methodological aspects to evaluate probability calibration metrics and strategies

Fig. 10
figure 10

A 10-fold stratified cross-validation was used. The data was split into train set, test set, calibration set, and unlabeled data. Unlabeled data was used to simulate a stream of unlabeled data and assess whether a histogram-based calibration method without ground truth can enhance its performance over time

wECE metric definition. n indicates the number of classes under consideration.

$$\begin{aligned} wECE = \sum _{i=1}^{n} ECE_{i} \cdot w_{i} \end{aligned}$$
(10)

A similar procedure was followed in the previous subsection to evaluate the proposed probability calibration metrics and techniques, avoiding the active learning step. Furthermore, a different dataset split was considered (see Fig. 10). After training the machine learning model and calibrating it with the calibration set, the non-calibrated model was used to issue a prediction for each instance in the unlabeled data set. The predicted class is then used to adjust further (train) the calibrator. While this introduces some noise, we expect that the better the classification model, the more it would benefit the calibrator, as explained in section “Approximate model’s probabilities calibration”. Eleven performance metrics were measured: AUC ROC, ECE (computed as the ECE for each class and averaged, assigning the same weight to all classes), wECE (computed as the classwise ECE—see Eq. 10), PCS, wPCS, APCS\(_{\hbox {W}}\), wAPCS\(_{\hbox {W}}\), APCS, wAPCS, MPCS, and wMPCS. AUC ROC measures the discriminative capability of the model and provides insights into how such capability is affected by different calibration techniques. ECE evaluates the expected difference between the accuracy and confidence of a calibration model. The ECE metric was used to compare the calibration quality for the multiple calibration techniques and the newly proposed PCS, wPCS, APCS, wAPCS, MPCS, and wMPCS metrics. Furthermore, given that the newly proposed metrics were built on a similar concept as the ECE metric, we are interested in how much they capture the same information. The Kendall \(\tau \) [see Kendall (1938)] and the Pearson correlation between ECE and the newly proposed metrics were measured. The Kendall correlation measures the ordinal association between two measured quantities. In this case, it measures to what extent both metrics increase or decrease, given the predictions for a given machine learning model and calibrators. The Pearson correlation, on the other side, was used to assess whether the correlation between metrics was linear.

The metrics were computed on the test set against the ground truth (class annotations) and the approximate ground truth (predicted classes). The results were analyzed to understand how well the metrics capture the models’ performance and calibration when no ground truth is available. Furthermore, the weighted and non-weighted metrics were compared to understand how class weighting influences the final score and perception regarding the quality of the calibration.

Experiments

Experimenting with active learning strategies

For this research, two active learning settings were explored (pool-based and stream-based), using four distinct strategies to label the queried data instances in an active learning setting. Two strategies were used to select data from the active learning set under the pool-based active learning setting: (a) random sampling and (b) instances for which the classification model was the most uncertain. The model’s uncertainty was assessed by considering the highest score for a given class for a given instance and selecting the instance with the lowest score among the scores provided for the data instances in the active learning set. In both cases, data were sampled until the set’s exhaustion. Under the streaming active learning setting, a slightly different policy was used. When random sampling was used, a decision was made whether to keep or discard the instance with a probability threshold of 0.5. Under the highest uncertainty selection criteria, the prediction for each data instance was analyzed and derived to the oracles for labeling if it was below a certain confidence threshold (p = 0.95 or p = 0.99).

Fig. 11
figure 11

Three oracle settings are explored in this research: A human annotator, B soft-labeling with classification model’s outcomes for instances with high-confidence scores, and human annotator for instances where the model has low confidence; and C which is analogous to B, but the machine oracle takes into account the classifier’s output score and whether the predicted class matches the class with of a labeled image with the shortest distance towards the active sample. In C, the sample is sent to manual revision if there is a class mismatch in the machine oracle. Samples are only discarded in a streaming setting

Three oracle settings were considered (see Fig. 11): (A) human labeler as the only source of truth, (B) machine oracle (classifier model) for data instances where the classifier had a high certainty, and a human labeler otherwise; and (C) machine oracle (classifier model) for data instances where the classifier had a high certainty, and requesting an additional opinion to another machine oracle when uncertain about the outcome. This second oracle queries the closest labeled image from three randomly picked images (one per class). In (C), the machine oracle issues a label only when both machine oracles are unanimous on the label; otherwise, the instance labeling is delegated to a human labeler. The decision regarding which oracle to query was based on the models’ confidence regarding the outcome and a probability threshold set based on manufacturing quality policies. It was assumed that the second machine oracle in (C) is accessible at a certain cost (e.g., paid external service) and, therefore, cannot be used for every prediction. Such a service was simulated by computing the Structural Similarity Index Measure (SSIM) score over the queried image.

Eight scenarios were set up (see Table 1), and experimented with two quality thresholds (0.95 and 0.99 probability that the item corresponded to a certain class) and five machine learning models. The machine learning models were calibrated using a sigmoid model based on Platt logistic model (Platt, 1999) (see Eq. 11).

Platt classifier calibration logistic model. \(y_i\) denotes the truth label, and \(f_i\) denotes the uncalibrated classifier’s prediction for a particular sample. A and B denote adjusted parameters when fitting the regressor.

$$\begin{aligned} P(y_i=1 \mid f_i) = \frac{1}{1+exp(Af_i+B)} \end{aligned}$$
(11)
Table 1 Proposed experiments to evaluate the best active learning setting regarding how it influences the models’ learning and its impact on the manual revision workload

Experiments assessing probability calibration metrics and techniques

In an automated visual inspection setting, a labeling effort is required to (a) label data to train and calibrate the machine learning models and (b) perform a manual inspection when the models cannot determine the class of a given data instance accurately. To understand how the probability calibration affects the machine learning models, the models’ predictions were compared against those obtained by (a) not calibrating the model (No calibration) and calibrating the model with (b) a sigmoid model based on the Platt logistic model (Platt), (c) temperature scaling (Temperature), and (d) Histogram calibration. Two aspects were considered in the experiments: (i) how calibration techniques compare against each other and (ii) whether calibrating a model without a ground truth can provide comparable results to models calibrated with ground truth.

Results and evaluation

Results and evaluation of active learning strategies

The active learning strategies were analyzed from two points of view. First, whether they contributed to better learning of the machine learning model. Second, how much manual work could be saved by adopting such strategies.

Table 2 Mean values for the mean AUC ROC computed across ten folds for five machine learning models

For the first case, the AUC ROC was measured over time (see Table 2). In particular, the models’ average performance was contrasted when they consumed data within the Q1 and Q4 of the active learning pool. The best outcomes were observed for Experiment 2 (highest uncertainty with human labeler) settings, while the second-best performance was observed for Experiment 8 (highest uncertainty, with the machine and human oracles). Overall, it was observed that the streaming setting had a better average performance when compared to the pool-based experiments, despite achieving only the second-best results with Platt scaling. Furthermore, in two cases, the machine learning model degraded its performance between Q1 and Q4. This happened for Experiment 3 (\(p=0.95\)) and Experiment 4 (\(p=0.95\)).

Given that (a) in both experiments, a machine oracle was used, (b) no performance decrease was observed for \(p=0.99\), and (c) that the same setting did not affect the streaming case, we were tempted to conclude that most likely the machine oracles mislabeled certain instances, confusing the model when retrained and therefore reducing the model’s performance over time. Nevertheless, further analysis revealed a small fraction of soft-labeled data and that most cases were accurately labeled. While soft labeling was detrimental for the pool-based active learning settings, it led to superior results in a streaming setting, achieving results close to the best ones obtained across all experiments.

Table 3 Mean AUC ROC values computed across ten test folds for five machine learning models

In Table 3 we report the performance of machine learning models for Experiment 2 and compare how they performed after Q1 and Q4 of the active learning pool data was shown to them. We found that the best performance was attained by the MLP, followed by the SVM by at least 0.05 AUC ROC points. Furthermore, while the MLP increased its performance over time, the SVM slightly reduced it in Q4. No other model had a performance decrease over time. Since Experiment 2 only considered a human oracle and the annotations are accurate, the performance decrease cannot be attributed to mislabeling. Furthermore, while the model’s discriminative capacity loss could be attributed to the class imbalance, we consider this improbable, given that the rest of the models could better discern among the classes over time. Finally, the CART model obtained the worst results, which lagged slightly more than 0.16 AUC ROC points compared to the best one.

As mentioned at the beginning of this section, another relevant aspect of evaluating active learning strategies is their potential to reduce data annotation efforts. This could be analyzed from two perspectives. First, whether the additional data annotations provide enough knowledge to enhance the models’ performance significantly. If not, the data annotation can be avoided. Second, a strategy can be devised (e.g., a machine oracle) to reduce the manual annotation effort. In this work, we focused on the second one. Table 4 presents the results for a cut-off value of p = 0.95. For p = 0.99, no instances were retrieved and given to machine oracles; therefore, no analysis was performed on them. The task required annotating 2460 samples on average.

When considering the cut-off value of 0.95, it was noticed that the Platt calibration considered a negligible number of cases for each experiment. While the quality of the annotations was high, using machine oracles would not strongly alleviate the manual labeling effort. The highest amount of soft-labeled instances corresponded to experiments with streaming settings (Experiment 7 and Experiment 8), which soft-labeled 4% and 3% of all data instances, respectively. Furthermore, 96% of samples were correctly labeled in both cases, meeting the quality threshold of p = 0.95. The decrease in the amount of soft labeled samples for Experiment 8 was due to discrepancies between the machine learning model and the SSIM score. Furthermore, the best machine labeling quality was achieved when considering Oracle C (unanimous vote of two machine oracles). When contrasting with the AUC ROC results obtained for those experiments, it was observed that while Experiments 3 and 4 slightly decreased discriminative power, Experiments 7 and 8 increased their performance for at least 0.01 AUC ROC points.

Table 4 Proportion and quality of soft labeling through different settings, considering a predicted probability cut-off value of p = 0.95
Table 5 The results were obtained for different models and probability calibration techniques
Table 6 The results were obtained for different probability calibration techniques

Results and evaluation of probability calibration metrics and techniques

The experiments performed in this research aimed to validate whether the metrics proposed to measure the quality of a calibrator can be used to understand the performance of a calibrator even when no ground truth is available. Furthermore, it aimed to validate whether predictions on unlabeled data could enhance the calibrators’ performance. The results are presented in Tables 5, 6, and 7. The PCS, APCS, and MPCS (along with the weighted variants) metrics were computed considering the PPCM histogram, which denotes a perfect calibration.

To understand whether the proposed metrics can measure the calibration quality without ground truth, the Pearson and Kendall correlations were computed between the ECE, wECE, APCS\(_{\hbox {W}}\), wAPCS\(_{\hbox {W}}\), PCS, and wPCS metrics (see Table 5). While ECE and wECE are always computed considering the ground truth at the test set, PCS, APCS\(_{\hbox {W}}\), wPCS, and wAPCS\(_{\hbox {W}}\) were calculated considering two cases: ground truth (golden standard) and predicted labels (approximate ground truth) at the test set. Furthermore, the correlations between the metrics were evaluated in two separate moments: after calibrating the models with the calibration set (CS) and after calibrating the models with additional samples retrieved from the unlabeled data set (CS+UD). The results show that the correlation between ECE, PCS, APCS\(_{\hbox {W}}\), wPCS, and wAPCS\(_{\hbox {W}}\) metrics is consistent across all cases. Furthermore, little variation exists between the values obtained when PCS or wPCS were computed on the ground truth or the approximate ground truth. While the Pearson correlation decreases after training the calibrator with predicted labels from the unlabeled data set, the Kendall correlation grew stronger when PCS or APCS\(_{\hbox {W}}\) were just averaged across classes and not weighted by the frequency of occurrence of each class. We consider the correlations moderate (negative Pearson correlation was measured between 0.50 and 0.61) or strong (negative Kendall correlation was above 0.33 and slightly below 0.40). Given the abovementioned results, we consider the PCS, wPCS, APCS\(_{\hbox {W}}\), and wAPCS\(_{\hbox {W}}\) metrics adequately capture information conveyed by the ECE metric regardless of the source of truth used to measure the quality of the calibration. Therefore, we conclude that PCS, wPCS, APCS\(_{\hbox {W}}\), and wAPCS\(_{\hbox {W}}\) can be used to assess the calibrators’ quality when no ground truth is available.

Fig. 12
figure 12

Eight calibration plots, comparing No calibration, Histogram calibration, Platt calibration, and Temperature calibration at CS (calibrated with the calibration set (ground truth)) and CS + UD (calibrated with a calibration set and predicted labels over time). The calibration plots have been adapted, showing normalized values (their sum is one) rather than the usual fraction of positives on the dependent variable axis. The x-axis denotes the mean predicted probability for a given class. The histograms average predictions across classes and calibrated machine learning models

Tables 6 and 7, compare the calibrators across multiple metrics to assess how an approximate calibration affects their discriminative power (AUC ROC) and whether it helps to enhance the calibrators’ quality (ECE, PCS, APCS, MPCS, and their weighted variants). Furthermore, Fig. 12 presents calibration plots for each calibrator for CS and CS+UD for visual assessment.

When comparing the calibrators through non-weighted metrics (see Table 6), we consider the Platt calibrator achieved the most stable performance. While the measured quality of calibration slightly decreased with the approximate calibration, it must be noticed that a higher proportion of positives was allocated at higher scores. Furthermore, while with the approximate calibration, the model’s overall discriminative power slightly decreased, it remained superior against other models (even the not calibrated) by at least 0.02 AUC ROC points. The Histogram and Temperature calibrators provide an interesting case, given both had a similar initial (CS) calibration quality if measured with the ECE, PCS, APCS, or MPCS metrics. Nevertheless, the metrics at CS+UD showed discrepancies: while ECE slightly increased for the Histogram calibrator (showing a worse calibration quality), it remained the same for the Temperature calibrator. On the other hand, PCS, APCS, and MPCS decreased (signaling a worse calibration quality) for both the Histogram and Temperature calibrator. Furthermore, the decrease in the metrics’ values was more pronounced for the Histogram calibrator. When visually assessing both calibrators, we found that they had a similar initial distribution (CS), but the Histogram calibrator ended up much more skewed than the Temperature calibrator at CS+UD. While the ECE metric did not capture this behavior, it was successfully summarized in the PCS, APCS, and MPCS metrics. We found the same patterns could be observed when analyzing the weighted metrics (see Table 7).

From the results above, we confirm that the proposed metrics can accurately measure the quality of calibration of a given calibrator when no ground truth is available. Furthermore, the metrics have shown to provide a more accurate measurement of the calibrators’ quality than ECE, overcoming some of its shortcomings (e.g., providing a more holistic view of the distribution of positives along the mean predicted probability, taking into account empty bins).

Our research shows that tracking predictions over time did not enhance the quality of calibration for any of the methods involved (Histogram calibration, Platt calibration, or Temperature calibration). Finding accurate calibration models for probability calibration, given a lack of ground truth, remains a matter of future work.

Table 7 The results were obtained for different probability calibration techniques

Conclusions and future work

This work explored active learning with multiple oracles to alleviate the manual inspection of manufactured products and the labeling of inspected products. Our active learning settings can save up to four percent of the manual inspection and data labeling load while not compromising on the quality of the outcome for a quality threshold of p = 0.95. It must be noted that labeling savings depend on the machine learning model deployed, the acceptance quality levels, and the quality of the active learning machine oracles under consideration. Furthermore, multiple probability calibration techniques were compared, and several new metrics to measure the quality of a calibrator were proposed. The metrics enable measuring the calibrators’ quality even when no ground truth is available. The experiments demonstrated that the proposed metrics capture relevant data otherwise summarized in the ECE metric—a popular metric to measure the quality of a probability calibration model. Nevertheless, the behavior of the proposed metrics under concept drift was not studied yet, and we consider it a matter of future research.

We envision multiple lines of investigation for future work. Regarding active learning, we are interested in enriching our current setup by adopting different strategies to decide how interesting an upcoming image is (e.g., learning distance metrics for each class or learning to predict which piece of data would enhance the classifier the most) and enhancing the calibration techniques to display the desired behavior for high-confidence thresholds. We will conduct further research on probabilities calibration to understand how the proposed metrics behave when concept drift occurs. Finally, we will explore new approximate probability calibration approaches leading to enhanced calibrators when no ground truth is available.