Deep Bayesian Self-Training

Supervised deep learning has been highly successful in recent years, achieving state-of-the-art results in most tasks. However, with the ongoing uptake of such methods in industrial applications, the requirement for large amounts of annotated data is often a challenge. In most real-world problems, manual annotation is practically intractable due to time/labour constraints; thus, the development of automated and adaptive data annotation systems is highly sought after. In this paper, we propose both a (1) deep Bayesian self-training methodology for automatic data annotation, by leveraging predictive uncertainty estimates using variational inference and modern neural network (NN) architectures, as well as (2) a practical adaptation procedure for handling high label variability between different dataset distributions through clustering of NN latent variable representations. An experimental study on both public and private datasets is presented illustrating the superior performance of the proposed approach over standard self-training baselines, highlighting the importance of predictive uncertainty estimates in safety-critical domains.


Introduction
With the advent of Big Data in industrial applications, the ability to automatically label datasets using limited supervision is increasingly sought after. In most real-world problems, manual annotation is practically intractable due to time and labour constraints. Furthermore, recent advances in supervised deep learning have shown that training over parameterised models on large datasets significantly increases performance [1]. With that in mindand despite the high demand for annotated data-deep learning practitioners have not yet explored or leveraged many of deep learning tools for automatic annotation systems. This is evidenced by the scarcity of existing research in the field, compared to others [2]. Automated annotation techniques typically involve semi-supervised algorithmic variants, wherein learning systems are often trained on a small initial sample of labelled data, and leverage information from unlabelled data to generalise better [3]. Wellestablished semi-supervised methods such as self-training [4], transfer learning [5], co-training [6], active learning [7] and tri-training [8] among others have shown to be useful for labelling in the past, but some challenges remain with regard to their scalability to high-dimensional data and their suitability to modern deep learning settings [2,9]. Prominent recent works have explored some of these ideas in the context of modern deep models, proposing new paradigms such as co-teaching [10], active learning on image data [2] and analysing deep transfer learning [11,12] with good levels of success. Taking inspiration from these works, in this paper we primarily focus on exploring the self-training algorithm in combination with modern Bayesian deep learning methods and leverage predictive uncertainty estimates for self-labelling of highdimensional data.

Background on application domain
In addition to public domain datasets, we evaluate our methods on a real-world task involving optical character verification (OCV) of real food packaging images, expanding on earlier work in [13] by reducing manual data annotation.
Incorrectly labelled food products (e.g. bearing an incorrect/illegible use-by date) result in product recalls and food waste, as label faults can lead to food safety incidents. Label faults are primarily attributed to human error during error-prone manual checking. Automatic approaches typically involve OCV, whereby a supervisory system holds the correct date code string and transfers it to both the printer and the vision system. The latter will then verify its read and take appropriate action. Such a system could also be used alongside other systems such as blockchain, within the food chain for food traceability [14]. Current OCV systems require accurately labelled data to be utilised for training, but the labelling process is time-consuming, expensive and requires expertise. They also rely on consistency in date code format, packaging and camera view angle which is difficult to ensure in a manufacturing environment, so there is a great need for a more robust solution.

Contribution
We propose a deep Bayesian self-training methodology orthogonal to [2] that leverages approximate variational inference in DNNs to estimate predictive uncertainty during a self-training setting. Both aleatoric and epistemic uncertainties of predicted pseudo-labels for unseen data are estimated, and the samples with the lowest predictive uncertainty (highest confidence) are added to the training set in an automated manner. We offer ways to mitigate the known problem of propagating errors in self-training by including: (1) an entropy penalty on the log-likelihood loss to punish overconfident output distributions and facilitate thresholding, and (2) an adaptive sample-wise weight on the influence of predicted pseudo-labelled samples over gradient updates to be inversely proportional to their predictive uncertainty. Lastly, we propose a new simple methodology for visualising and analysing variability between two dataset distributions in DNNs and attempt to adapt information from one problem to the other by clustering learnt latent variable representations in the context of our application domain. An experimental study on both public and private (real) datasets is presented demonstrating the increased performance of our algorithm over standard self-training baselines.

Related work
Deep learning model's ability to learn abstract hierarchical representations from data has pushed the state of the art in most machine learning-related tasks [1,15]. The uptake of these methodologies in academia and industry has resulted in many diverse and interesting DNN applications, wherein patterns learned from data have been adapted to perform tasks in various domains, including computer vision [13,[15][16][17], medical imaging [18][19][20] and signal processing [21,22]. Although many important improvements to DNNs have been made in various domains, there are still many adversities in training models which can be easily adapted to other tasks; and the lack of annotated data is one of the contributing factors.

Deep semi-supervised learning
Most related work addressing the aforementioned issues is often related to domain adaptation philosophy and semisupervised learning algorithms such as self-training [4], which is an iterative procedure for self-labelling data points in an unlabelled pool, and retraining a classifier until stop conditions are met. Co-training [6] can be considered multiview variant of self-training wherein two separate classifiers are trained on different views of the data and augment each others training sets with their predicted labels. Tri-training [8] extends co-training by having three classifiers, and unlabelled examples are added to a classifier's training set iff the other two agree on the predicted label. Active learning [7] selects the most informative samples from a pool of unlabelled data and retrains the classifier with human given labels in an effort to maximise performance and minimise data labelling requirements. Transfer learning [5] is often used when there is a lack of annotated data in the target domain, and the goal is to adapt knowledge from one task to another by initialising the weights of the target task with the pre-trained weights of another, often performing better than random initialisation. Among these algorithms, transfer learning has undoubtedly had the most success in the context of deep models, and it is widely used in computer vision for adapting visual features from large source domains, to target domains with limited annotated data. Notably, [11] find that initialising a network with transferred features boosts generalisation that lingers even after fine-tuning to the target dataset, and transferring features from distant tasks is still better than using random weights. Recent work in [23] suggests that a single DL model can jointly learn a number of tasks from multiple domains successfully. In fact, it was observed that adding knowledge from unrelated tasks never hurts performance, rather mostly improves it on all tasks. This phenomenon is complimented by research in [24], with results suggesting that combining tasks, even via a naïve multihead architecture, always improves performance. Authors in [25] propose learning a network comprised of the most successful layers from many different source networks, which are continuously generated and evaluated by a recurrent neural network (RNN) controller. Task transfer learning was recently studied in great depth by [12], where a fully computational approach termed taskonomy was proposed. This was achieved by identifying dependencies between 26 different tasks in latent space, producing a computational taxonomic map for task transfer learning. Deep generative modelling is also gaining popularity in tackling adaptation of knowledge learnt from data generating distributions to pool sets of unlabelled data [26][27][28]. Other notable related works presented more recently include co-teaching [10], wherein two neural networks are trained simultaneously and teach each other to select clean labels and then decide what data to use for training. Mean teacher models [29] maintain an exponential moving average of model weights and penalise inconsistent predictions, enabling training with fewer labels as an added benefit. Deep co-training [30] extends the original co-training algorithm by training multiple DNNs with different views generated by exploiting adversarial examples. In [31], a simple method termed pseudo-label similar to entropy regularisation [32] is proposed, and it consists of iteratively assigning pseudo-labels via the maximum predicted probability of a NN. Although research on self-training with deep models is scarce, notable work in [33] presents an unsupervised domain adaptation (UDA) framework based on self-training for semantic segmentation using DNNs. They develop a selfpaced policy that increases the number of pseudo-labels incorporated in each additional round and demonstrate performance benefits over other popular methods. However, as is the case with all previous works mentioned thus far, their proposed approach does not provide principled predictive uncertainty estimates. The black box nature of DNNs is a concern in most real-world applications, and by quantifying what a model does not know with uncertainty measures, we can not only better trust our predictions but also avoid potentially harmful outcomes [34]. With that in mind, perhaps the most significant related work is in [2], where the authors propose a Bayesian formulation of active learning for image data using DNNs, obtaining a significant improvement on existing active learning approaches by considering uncertainty estimates in approximating acquisition functions.

Uncertainty estimation
The estimation of uncertainty as a measure of confidence over a model's predictions is desirable for self-labelling, and for safety-critical systems in general [34]. Bayesian neural networks (BNNs) were studied by many in the past [35][36][37] and have more recently regained popularity. In BNNs, uncertainty is typically captured by placing a prior distribution such as a Gaussian, over the weights and averaging over all possible parameters, rather than optimising them directly. Bayesian inference is then used to compute the posterior over the weights capturing the set of likely parameters. However, BNNs are difficult to perform inference in with traditional methods, as they do not scale well scale to high-dimensional inputs or very complex DL models [34]. Recent promising methods including [34,38,39] offer alternative ways of capturing uncertainty by simple modifications to loss functions, having the network to learn/predict aleatoric uncertainty in an unsupervised manner. Aleatoric uncertainty relates to sensory noise in the acquisition process of the data and is therefore inherently irreducible [40]. However, we argue that it can be a great tool for quantifying our uncertainty about pseudo-label predictions. In [39], dropout was shown to perform approximate variational inference, wherein stochastic forward passes with dropout at test time are effectively samples from the approximate posterior. This technique is know as Monte Carlo (MC) dropout [39] and can be used to quantify epistemic uncertainty in NN predictions. Epistemic uncertainty relates to our uncertainty about the model parameters, which is in fact reducible as we observe more data. This is because we can explain the uncertainties about the model parameters in the limit of observing all explanatory variables of the data [34,40]. This type of uncertainty is useful for identifying out-ofdistribution data points and is the most important type of uncertainty measure when assigning pseudo-labels to data.
In this paper we argue that with some modifications, uncertainty estimation techniques in Bayesian deep learning can also be useful in a self-training setting, and to the best of our knowledge, these ideas have yet to be explored in this context. All things considered, we propose a deep Bayesian self-training algorithm, in which a DNN assigns pseudo-labels to new data and automatically weighs their sample-wise importance for the next self-training iteration to be inversely proportional to the predictive uncertainly of the assigned pseudo-label. In this way, we can reduce the burden of manual data annotation requirements and also offer a measure of uncertainty about our predictions which is important in safety-critical domains.

Deep Bayesian self-Training
In this section, we provide a brief background on Bayesian NNs and explore the idea of uncertainty estimation of pseudo-label predictions for unlabelled data, in a deep Bayesian self-training framework (see Algorithm 1). In order to quantify what our algorithm does and does not know, we extend existing approaches for estimating uncertainty in deep CNNs [34,41]. To this end, we consider the following Bayesian formulation of a deep CNN for estimating both aleatoric and epistemic uncertainties.

Bayesian neural networks
Let D ¼ fðX; YÞg denote a dataset given as N pairs of inputs x i 2 R d of dimension d, and class labels y i 2 f1; . . .Kg of K total classes. Assuming a Bayesian neural network (BNN) formulation, we place a Gaussian prior probability distribution pðxÞ over the set of trainable parameters x ¼ fW 1 ; . . .; W ' g. We define the likelihood conditional output distribution pðYjX; xÞ of NN for mapping inputs to labels, by finding parameters x that yield the maximum likelihood estimate (MLE). MLE is the pillar of supervised learning in DNNs and is defined as yielding a point estimate for the most likely parameters to have generated the data. In a Bayesian sense, the MLE is a special case of maximum a posteriori (MAP) estimation when a uniform prior is assumed. In practical classification tasks, the MLE estimator is obtained by minimising the negative log-likelihood of a Bernoulli or softmax distribution depending on the number of classes. We define the softmax negative log-likelihood of our classification NN model as where z denotes the vector of output logits by the network and k denotes a class. Having defined a prior and a likelihood, we would like to compute the posterior probability distribution over the weights given the data by Bayes rule pðxjX; YÞ ¼ pðYjX; xÞpðxÞ pðYjXÞ / pðYjX; xÞpðxÞ; with which we can also formulate the predictive distribution given new inputs x Ã and labels y Ã pðy Ã jx Ã ; X; YÞ ¼ Z pðy Ã jx Ã ; xÞpðxjX; YÞdx; ð4Þ enabling predictions using a full distribution over the parameters x, which captures uncertainty over the model parameters, rather than using a point estimate. However, in most cases, the posterior distribution pðxjX; YÞ cannot be evaluated analytically. This is because to compute the marginal probability pðYjXÞ we must integrate over all possible model parameters x with weighted probability pðxÞ, in order to obtain the normalising constant, also known as the model evidence. Since the true posterior distribution pðxjX; YÞ is intractable, various approximations exist [36,37,42]. Most of them were important early steps towards performing approximate inference in Bayesian NNs, but are unfortunately difficult to employ in modern applications due to scalability constraints or expert knowledge requirements. More recent work in [41,[43][44][45] addressed some of these issues with variational inference, reigniting interest in the field of Bayesian NNs.

Variational inference
Next, we provide a background on variational inference (VI) to contextualise some of the ideas presented in [41], wherein dropout is shown to perform approximate variational inference in NNs when used at test time. In VI, a factorised variational distribution from a tractable family q h ðxÞ, parameterised by h, is defined for approximating the posterior distribution by minimising the Kullback-Leibler (KL) divergence between q h ðxÞ and pðxjX; YÞ. Intuitively, the KL divergence is a non-negative asymmetric measure of similarity between the two distributions KLðq h ðxÞ jj pðxjX; YÞÞ, which we minimise via the variational parameters h of our approximating distribution However, optimising the KL divergence directly requires knowledge of the intractable posterior. This is circumvented by instead maximising the evidence lower bound (ELBO) on the marginal log-likelihood log pðYjXÞ, derived via Jensen's inequality logðE½XÞ ! E½logðXÞ and given that the KL divergence ! 0 then log pðYjXÞ ¼ L ELBO ðhÞ þ KLðq h ðxÞ jj pðxjX; YÞÞ: By maximising the lower bound, we implicitly maximise log pðYjXÞ and minimise the KL divergence as intended.
We extend these ideas in the light of recent developments in [41] with the Monte Carlo dropout approximation using q h ðxÞ, further explained in the following section.

Continuous relaxation of dropout
Concrete dropout is based on concrete relaxation of discrete distributions [46], allowing the replacement of dropout's discrete Bernoulli distribution with its continuous relaxation [47]. To obtain calibrated uncertainty estimates with Monte Carlo dropout, it is necessary to tune the dropout probabilities. A grid search is a common but costly approach for large models, highlighting the benefit of optimising them directly with gradient descent. This requires formulating an objective for minimising epistemic uncertainty [41] using the variational interpretation of dropout. Formally, dropout can be treated as an approximating distribution q h ðxÞ to the posterior in a BNN, where x represents the weight matrices of the 'th of L layers in the network , and h are the variational parameters to optimise [47]. Let F ðxÞ be the model with weight matrix realisation x; given a random set S comprising M of all N data points, denote the model's output on the x i input as F ðx i ; xÞ. The following NN objective function can then be formulated where pðy i jF ðx i ; xÞÞ is the model's likelihood, a Gaussian with a predictive mean given by F ðx i ; xÞ. KL is a regularisation term which constrains the approximate posterior q h ðxÞ from deviating too far from prior pðxÞ. Following [38], we can approximate the KL term with where fM ' ; p ' g L '¼1 is a set of mean weight matrices and dropout probabilities, such that ( H½p is simply the entropy of a Bernoulli random variable with probability p H½p :¼ Àp log p À ð1 À pÞ logð1 À pÞ; ð10Þ which can be interpreted as a regularisation term that only depends on dropout probability p, so minimising the KL term is equivalent to maximising the entropy of a Bernoulli random variable with probability ð1 À pÞ. Rather than sampling the random variable from the discrete Bernoulli distribution, by adopting the concrete distribution [46,47] with some temperature t, it is possible to sample variables in the interval [0, 1], s.t. the concrete relaxation distribution e z parameterised by means of u $ Unifð0; 1Þ, provides a relationship between e z and u, which is differentiable w.r.t. p. With the concrete relaxation of the dropout masks, the dropout probabilities for each layer fp ' g L '¼1 can be optimised using the path-wise derivative estimator [47].

Entropy penalty on output distributions
The probabilities assigned to incorrect classes at test time help quantify a model's ability to generalise. By penalising output distributions with low entropy (i.e. confident predictions), we can obtain a similar effect to label smoothing and improve generalisation [48]. This can be useful in selftraining, since we assign pseudo-labels based on low uncertainty predictions, which are in some cases wrongly assigned. We suggest that by penalising very confident output distributions we can improve generalisation and make thresholding easier since the output distributions are smoother, rather than overly concentrated at 0 or 1. The entropy of a NNs output conditional distribution is given by with pðyjx; xÞ as the probability distribution obtained from a softmax function. To penalise very confident predictions, we can simply take the negative log-likelihood and subtract the entropy of the output distribution as where the scaling hyperparameter b balances how much we would like to penalise non-uniformity of the softmax.

Inverse uncertainty weighting
A known limitation of self-training is the potential accumulation of wrongly pseudo-labelled samples being added to the training set. A common approach is to remove less confident samples from the training set and leave them in the unlabelled set. However, this tends to underperform in practice, as the algorithm can become biased by continuously adding the easiest unlabelled samples to the training set. This can hinder learning over time, as more difficult and potentially informative samples are neglected.
In attempt to mitigate this behaviour, we propose a sample-wise weighting scheme during training that places a weight on each training sample fx i ; b y i ; k i g, proportional to the predictive uncertainty over its pseudo-label b y i , such that its contribution to the loss function is inversely proportional to its predictive uncertainty (see Algorithm 1). To calculate the predictive uncertainty, we can have the network predict the aleatoric uncertainty as one of its outputs and add the epistemic uncertainty obtained from the variance of Monte Carlo dropout samples.
Formally, let b p t ¼ softmaxðF ðx; b x t ÞÞ denote the softmax out of a BNN, and fb pg T t¼1 be the set of outputs from T Monte Carlo dropout samples at test time, each parameterised by weights drawn from the approximate posterior b x t $ qĥðxÞ. We propose calculating the predictive uncertainty from these samples by generalising the binary variant approach in [49] to a multivariate classification setting. By the definition of variance of a multinomial distribution, we can decompose the variance of b p into where the first term represents aleatoric uncertainty r 2 a , and the second is the epistemic r 2 e . Each diagonal entry of the resulting matrix is the variance of a binomially distributed random variable, and the off-diagonals are negative covariances for fixed T. Since we are only interested in a single number to measure our uncertainty, we take trace of the resulting uncertainty matrix.
Alternatively, we can have the NN predict the input noise variance r 2 a as one of its outputs [34], by assuming measurement error in our target function y ¼ F ðxÞ þ , where $ N ð0; r 2 a Þ. The predictive variance in a multivariate classification setting is then given by the entropy term measures epistemic uncertainty in the output softmax distributions, whereas the log aleatoric uncertainty b s i :¼ log r 2 a;i term is regressed by the NN for each input x i , for numerical stability. To capture aleatoric uncertainty in our classification task, we can use Monte Carlo integration on the NNs Gaussian log-likelihood objective function, by drawing t 2 T samples of Gaussian noise-corrupted NN output logits F ðxÞ, yielding the following loss with t $ N ð0; IÞ parameterised by the predicted aleatoric uncertainty exp ðb sjxÞ for each sample x i , which learns to capture measurement error.
Having calculated the predictive uncertainty Var½b p of our pseudo-labels, we calculate a per-sample importance weight fx i ; b y i ; k i g with where /ðÁÞ is a parameterised hyperbolic tangent function with c, b as scale and intercept terms, and r denotes the self-training iteration. The weighted penalised log-likelihood of our NN with weights x is then where pðyjx; xÞ is computed via softmax, and the optional confidence entropy penalty term is balanced by b. By tuning c and b, we can obtain the desired behaviour over r iterations, s.t. when the uncertainty is low, we assign high weight to the predicted pseudo-labelled sample fx i ; b y i ; k i % 1g. We can incrementally encourage the model to assign more weight to uncertain pseudo-labelled samples as self-training progresses, since in the lim r!1 /ðrÞ ¼ À1.
Intuitively, this procedure inverts Eq. (17) over time, incrementally forcing exploration by adding more uncertain, and potentially informative samples, to the training set. In summary, using this logic along with entropy penalties on overconfident output distributions, we can mitigate the effect of pseudo-labelling error accumulation in the training set and adjust risk taking by tuning c and b. Once per-sample predictive uncertainties are calculated, we decide on which pseudo-labelled samples to add to the training set via a Tukey fence. Intuitively, assume a NN has been trained on data D ¼ È ðx i ; y i Þg N i¼1 , learning a function F ðx; xÞ for mapping inputs to labels. At inference time, we take the correct predictions where y i ¼ F ðx i ; xÞ and retrieve their predictive uncertainty. We then summarise variability by calculating the interquartile range (IQR) outlier statistic and define an uncertainty upper bound s, which is used to decide which pseudo-labelled samples should be added to D following where b y i denotes the pseudo-label assigned to sample x i computed as b y i ¼ arg max b p i , and D Ã is the augmented training set. Lastly, we can also easily adjust the uncertainty upper bound s by selecting higher or lower quartiles to reflect how confident we would like to be about predictions before adding samples to D Ã .

Latent variable adaptive clustering
We propose a new simple methodology for visualising and analysing variability between distributions and attempt to adapt information from one problem to another in DNNs. In Fig. 1, an illustration of our adaptation framework is shown using an example backbone InceptionV3 CNN. Let the following denote two training sets from separate datasets targeting the same task ð21Þ and the two respective test sets as  ; e x ðiÞ 2 g 2 R 2048 as latent variables representations, by simply forward-propagating each image through as is typically done at inference time.
Utilising these, our adaptation methodology is then performed as follows: 1. Given D 2 , produce a set of clusters C ¼ fc 1 ; . . .; c k g by minimising the within-cluster L 2 norms of the following clustering objective function 2. Repeat step 1 with D 1 to generate k clusters U ¼ fu 1 ; . . .; u k g and compute the k closest instances in D 1 to each centroid in U. Fetch the corresponding set of images S ¼ fS 1 ; . . .; S k g, whose latent variables are closest to U; 3. Forward-propagate S through F ðD 2 ; W 2 Þ to obtain a new set of adapted clusters Z ¼ fz 1 ; . . .; z k g, where S is considered an approximation of U from F ðD 1 ; W 1 Þ; 4. Derive an augmented cluster representation that encapsulates knowledge from both facets of the trained CNNs, by concatenating the respective C and Z clusters into a set A ¼ fc 1 ; . . .; c k ; z 1 ; . . .; z k g; 5. Compute the Euclidean distance between T 1 and A and evaluate the classification performance; 6. Iteratively remove the lowest performing cluster in A and repeat step 5 until the performance stops improving.
In all cases, the k-means?? [50] seeding strategy was used, whereby the first cluster centre c 1 is chosen uniformly at random from X , and all preceding cluster centres x 2 X are chosen with probability where DðxÞ denotes the distance between x and the closest c i . Moreover, we assign the class label of a given cluster c i as simply the mode class j of all data points within it In the experimental study of Sect. 6, we demonstrate that our method distils and adapts knowledge from both trained CNNs on real data, achieving better performance than direct inference of T 1 with F ðD 2 ; W 2 Þ, without any parameter retraining.

Experimental study
This section is divided into two separate subsections: the first subsection presents experiments using deep Bayesian self-training applied to the MNIST public domain dataset. An ablation study is presented and comparisons are made with baseline methods. The second subsection comprises a study using private (real) datasets, in which we perform some preliminary experiments using transfer learning and then we evaluate our proposed latent variable adaptable clustering method. We then finish off the second subsection by evaluating deep Bayesian self-training on the self-annotation of the real datasets.

MNIST dataset
In order to validate our algorithm, we conduct a series of self-labelling experiments on the popular MNIST dataset.  [15]. DenseNets have revealed several wellfounded advantages over previous architectures, from mitigating vanishing gradients to encouraging feature propagation and reuse with shorter connections between layers [15,51]. The dense connectivity in DenseNets can be formally defined as where f ðÁÞ is the ReLU activation function, BNðÁÞ is batch normalisation [52] and Â A ½0 ; A ½1 ; . . .; A ½'À1 Ã represents feature map-wise concatenation of all layers preceding '. A sequential composite function consisting of BN, ReLU and 3 Â 3 convolution can then be defined as H ½' . Each function H ½' produces x feature maps, known as the growth rate of the network, and each layer ' takes as input f þ x Â ð' À 1Þ total feature maps, where f denotes the number of channels in the visible layer. To reduce spatial dimensionality of feature maps, a transition layer is introduced between densely connected DenseBlocks. Transition layers in [15] are composed of BN followed by 1 Â 1 convolution and 2 Â 2 average pooling with a feature map compression factor h ¼ 0:5.
Following Algorithm 1 closely, we propose a progressively growing NN scheme by starting off with a 40 layer deep DenseNet with a growth rate k ¼ 12, and incrementally increasing the growth rate (width) of the network as more data are added to the training set. In the first iteration, the network has only 181k parameters to avoid overfitting on the small initial training set, but complexity of the network is incrementally increased in an automated way. As described in greater detail in Sect. 3.5, we employ Monte Carlo dropout at test time to calculate the predictive uncertainty of the assigned pseudo-labels samples. In all cases, we take T ¼ 30 samples, equating to 30 different dropout masks. We compare the performance of our proposed approach with a baseline ensemble method (DEST) similar to [53] for estimating predictive uncertainty, and the vanilla self-training methodology, albeit in a deep learning model, considering only the output probability of the NN as a measure of confidence, similarly to [31]. We also evaluate the effect of our inverse uncertainty weighting scheme, as well as the entropy penalty on confident output distributions on the performance of our Bayesian self-training algorithm.

Training details
In all MNIST experiments, we use the same DenseNet model and hyperparameters for fair comparisons. Specifically, we train the networks using stochastic gradient descent (SGD) with a Nesterov momentum of 0.9, a batch size of 32 and an initial learning rate of 0.1. We train all models for 75 epochs and reduce the learning rate by a factor of 10 at 50 and 75% of the way through training. All models are trained using the same train/valid/test/unlabelled splits, no data augmentation is used aside from simple image standardisation (mean 0 sd. 1), and we take T ¼ 30 Monte Carlo dropout samples to at test time as explained in Sect. 3.5. With regard to the ensemble, we train M ¼ 5 models each initialised with random weights and capture the predictive uncertainty following Eq. (15), but without using dropout at test time. Lastly, the stop conditions can be adjusted depending on the application at hand, but here they were kept consistent in all experiments for fairness of comparison. Specifically, we stipulate that if less than the current batch size number of images are selected to be added to the training set in the next selftraining iteration, the algorithm stops.

Ablation study
The results are reported in Table 1 and illustrated in Figs. 2, 3, 4 and 5. In our experiments, we simply have the NNs predict the labels for the 54,500 unlabelled MNIST samples and evaluate how well the system is doing at predicting the correct labels at the end of each self-training iteration. The evaluation is primarily considered in terms of the Cohen's kappa statistic (j) as it is more robust than accuracy by taking into account random luck, and the number of images left unlabelled after self-training. As can be observed from the results, the addition of our proposed inverse uncertainty weighting scheme improves the performance of the algorithm by leaving less images unlabelled and achieving a higher j score (DBST-1 to DBST-2). We also test the effect of the quartile uncertainty thresholds for s from Q3 to Q2 (DBST-2 to DBST-3), meaning we are more strict about which pseudo-labelled samples we can add to the training set. This only considers very highly confident pseudo-label predictions resulting in a higher j score, at the cost of labelling less examples as expected. In the DBST-4 model, we combine both the sample-wise inverse uncertainty weighting scheme and the entropy penalty on the log-likelihood loss (L PNLL ) using b ¼ 1 as described in Sect. 3.5. As reported in Table 1, the number of examples left unlabelled is significantly less, whilst maintaining a good Cohen's j agreement between predicted and actual labels. In comparison with the others, s is the upper bound uncertainty threshold for augmenting D Ã , k i are sample-wise inverse uncertainty weights, and r is the number of self-training iterations taken before stop conditions were met. All metrics (precision, recall, F1-score and Cohen's j) are reported in 1Àmetric format  the DBST-4 model provides the best balance between the number of unlabelled images left after self-training and a high Cohen's j score.

Comparative discussion
Lastly, we compare our Bayesian models (DBST) with two baseline method for estimating uncertainty in a similar way to [53], known as a deep ensemble of NNs (DEST), and the standard self-training (DST) following the logic in [31], and simply using the NNs predicted probability of an assigned pseudo-label as a level of confidence. The predictions from each NN in the ensemble (DEST) can be used as to calculate predictive uncertainty as the deviations capture model parameter uncertainty. Here, we do not employ any bootstrap methods as the randomness from the NN weight initialisation and shuffled training has been shown to be sufficient experimentally [53]. We use the same DenseNet architecture, including related hyperparameters and identical dataset splits to train an ensemble of five models. Table 1 shows that our methods (DBST) are better than using an ensemble (M ¼ 5) for predicting uncertainty for our self-training purpose, whilst taking approximately 5Â less time to run in our experiments. Note that Monte Carlo dropout samples are very cheap to compute at inference time compared to training multiple models; thus, we can afford to take multiple samples, i.e. T ¼ 30 as compared to an ensemble of M ¼ 5, which is also an advantage of our approach. With regard to the vanilla self-training baseline (DST), again we use the exact same DenseNet architecture and related hyperparameters for fair comparisons. As previously outlined, in standard self-training we take the NNs predicted probability as a measure of confidence, and to demonstrate the inadequacy of this method, we threshold with a very high confidence probability of .99. This simply means that only pseudo-label predictions above the .99 probability (confidence) threshold in a 10-way softmax (MNIST digit classes) are added to the training set. As reported in Table 1 and Fig. 2, DST underperforms compared to our methods since it is overconfident early on, resulting in the addition of more wrong pseudo-labels to the training set, thus propagating the errors forward. Although the number of images left unlabelled is low, the Cohen's j score is significantly lower

Real datasets
Four datasets of food package photographs were collected by a leading food company and provided to us for research purposes. The four sets include 1404, 6739, 1154 and 13948 captured images, respectively. In order to produce trainable datasets, a portion of the images was first manually annotated w.r.t. the presence of use-by dates, and lack thereof. In the case of unreadable images, in which dates were not discernible from the background-potentially due to heavy distortion-non-homogeneous illumination or blur was then set aside in a separate category. Conversely, images in which either day or month, or both were missing, were considered as incomplete and subsequently grouped into their own category. Lastly, images of good quality, reporting the date including both the day and month, were considered as good candidates for OCV.
The first three sets of images were annotated as mentioned above to form five categories: complete dates, missing day, missing month, no date and unreadable (Table 2), whereas photographs belonging to the fourth dataset were annotated as good or bad candidates for OCV and utilised to test our proposed Bayesian self-annotating framework. After annotating all the images in the first three datasets, it was possible to plot some statistics (see Fig. 6) on the frequency of specific dates within each dataset, and thus devise a methodology for conducting experiments with balanced sets of classes. Moreover, by inspecting the images with partially missing data, it was observed that most of them were photographs of package labels which had been folded at crucial points, included photographic glare, digits fainting over time, or included human made occlusions. With regard to the fourth dataset, 8931 images were annotated as including readable dates, and the remaining 5017 as unreadable (Fig. 7).

Transfer learning
It was of particular interest to conduct transfer learning in order to assess the adaptability of pre-trained CNN weights [54] on the current food datasets. Specifically, each image from our datasets was fed through a previously trained InceptionV3 CNN on the ImageNet dataset, up to the last global average pooling (GAP) layer, where a 2048-dimensional vector representation of each instance was extracted. The 2048-dimensional vectors then became the input to a new series of FC layers and a final softmax layer able to predict N classes (see Fig. 8). In order to optimise the training performance of the new FC layer network, a series of architectural decisions were made empirically, and the best performances were achieved using a FC network consisting of two 2048 unit hidden layers with rectified linear unit (ReLU) activations and batch normalisation (BN) [52] layers. The risk of overfitting rises as the number of parameters increases w.r.t. number of training examples. Due to the limited amount of training data, available for experimentation, it is infeasible to train state-of-the-art models from scratch. Therefore, we introduced an effective regulariser in the new network as well as adapted previously learned low-level features through transfer learning. One of the most effective regularisation techniques is dropout [55]. In practice, to preserve more information in the input layer ' ð0Þ (of L total layers) in the network and thus aid learning, the probability of keeping (pðz ðiÞ Þ :6 ¼ 0) any given neuron z ðiÞ in layer i was as defined per the following schema In view of the unbalance present among the various classes, it was beneficial to use a weighted negative log-likelihood as a loss function (28). In (28), k j is a weight coefficient computed for the jth of all classes J as a function of the proportion of instances N j compared to the most densely populated class (29). During training, k encourages the model to focus on under-represented classes calculating the per-class weight parameter k j with In the case of multiclass classification, where J [ 2, the weighted cross entropy loss function can be defined as where log pðb y ¼ jjz j Þ is calculated as z is a vector of NN output logits, and M denotes the batch size of choice for stochastic optimisation of L NLL via backpropagation. In all cases, we use adaptive moment estimate (Adam) as an optimiser [56]. In this framework, three sets of experiments were conducted and the obtained results are reported in Tables 3 and 4. The goal of the first experiment was to establish a baseline for images that would be classified as acceptable according to human standards. The appearance of unreadable images was especially prominent in the first of the three datasets. Conversely, the average image quality of the second and third datasets was higher; therefore, they were not considered in this experiment. Moreover, the first dataset contained images from seven different locations, and as such, there were at least seven different types of food packaging present. To devise a balanced experiment, images from all locations were combined and categorised into two classes: 'Complete Dates' and 'Unreadable'. As reported in Table 3, 90:1% classification accuracy was achieved over all seven locations.
The second experiment aimed at distinguishing between acceptable and not-acceptable, missing dates. This meant that the absence of either day or month digits in a use-by date is not acceptable. The second dataset was the largest, containing approximately 50% of examples with partial or missing dates. Images missing the day/month or both were assigned to one class and 'complete dates' to the other. As reported in Table 3, an accuracy of 96:8% was achieved.
Similarly, a performance of 94:8% was achieved when applying the same procedure to the first dataset. As for the third dataset, it includes images of higher quality, but there is a very small number of missing value examples available. To address this, we performed data augmentation in order to produce a larger set of 'Partial Dates'. The accuracy achieved on this synthetic set was 85:8%. Lastly, a small variation of this experiment (2.1 in Table 3) was conducted in order to assess how well the network can identify the presence of any type of date, be it complete or partial, versus the absence of a date altogether. This experiment offered insight into how well the network can produce inferred localisation of dates, as it must learn to filter out the abundant non-date-related text/numbers in the images. Table 3 shows that good accuracies were achieved across all three datasets, with the best case of 96:2% date presence detection on the second dataset.
In a brief third experiment, a global approach to OCV was tested by targeting the classification of specific digits and letters. Successful text recognition systems typically begin with the detection of text presence within a given image, followed by a segmentation or localisation of the desired region-of-interest (ROI) in order to perform classification of segmented digits thereafter. Here, we assess how well the NN can perform without specifying any additional labels or local information. Given that almost all images in the third dataset contained 'Complete Dates', we conducted a brief digit classification experiment (see Table 4 for results). Despite the small number of training examples (1138) and limited possible class combinations, four digit classes were identified, namely 5, 8, 16 and 20. With these labelled examples, an accuracy of 90% was achieved. Similarly for the second dataset-due to limited data-a brief global OCV classification experiment between the months of October and November in use-by dates was conducted. An accuracy of 92:7% was achieved despite the small number of training examples. In reflection of these results, it is important to remember the great variety of text and numbers included in each image. Without providing any local knowledge and given limited training examples, the networks were still able to automatically infer the importance of specific digits and their respective locations in a global manner, whilst ignoring the same or other digits located in close proximity.

Latent variable adaptive clustering
A major challenge spanning the three datasets was the high variability in the captured images characteristics. This variability made the reuse of a DNN trained on one dataset, for classifying the data of another, very difficult leading to poor performances. Fundamentally, this is because each dataset comes from a different distribution, as the images Subsequently, we explored whether the respective trained networks were suitable for carrying out the proposed network adaptation approach (see Table 5 for results).
To this end, consider F ðD 2 ; W 2 Þ as a trained CNN with a test performance of 95:9% on a binary classification problem of use-by date verification on a real dataset. Let T 1 be the test set of a dataset from a different distribution targeting the same classification task. We forward-propagate T 1 through F ðD 2 ; W 2 Þ and achieve a lower accuracy of 63:8% as expected. We employed our adaptation procedure to classify T 1 without any parameter retraining, decreasing the relative error by 34:81% with an improved accuracy of 76:4%. Interestingly, the original performance achieved by F ðD 2 ; W 2 Þ on T 2 also increased from 95:9% to 97:1% when classifying T 2 with A instead of the CNN, it was originally trained on. Figure 9 depicts a 3D visualisation of all 2048-dimensional cluster centroids, for k ¼ 7 for both datasets (14 in total). Squares (Red) and (Blue) crosses denote the centroids corresponding to the complete date class in the first and second datasets, respectively. (Green) circles and (Pink) diamonds are the centroids in the missing date category, and the (Black) stars indicate the centroids not used in the final classification as per the centroid exclusion policy explained previously in Sect. 4.

Deep Bayesian self-Training on real data
In order to validate our approach, we conducted a series of experiments on a pool of held-out annotated data comprised of 11,948 real food package images. The results can be seen in Table 6 and Fig. 10. We begin by introducing concrete dropout layers after every convolutional layer in the last DenseBlock of a DenseNet-201, pre-trained on ImageNet. We then fine-tuned the last DenseBlock on a small portion of 500 images, with binary annotated labels representing whether the use-by date was readable (OK) or not (NOT-OK). As observable in Fig. 10a, we first applied these ideas to the full set of unlabelled 11,948 images and simply selected the 500 most certain predicted labels to be added to the initial training set of 500 images. This process was repeated 10 times in order to collect a total of 5000 images with predicted labels, which we then compared with our annotated labels as shown in Table 6. In the remaining set of experiments, instead of selecting a predetermined number of images, we filtered out uncertain predictions based on a threshold s as in Algorithm 1. Figure 10c, d depicts the confusion matrices for the automatically annotated images w.r.t. true labels and highlights the benefits of applying a confidence penalty on the log-  Fig. 9 t-SNE visualisation of the derived centroids A with best k ¼ 7, achieving the results reported in Table 5. The 'Excluded centroids' (2 black stars) were removed as per the policy outlined in step 6 of our proposed adaptation procedure (colour figure online) In order to compare our approach to standard selftraining, we took the same network and datasets splits and trained it without the Bayesian components. The threshold was set based on the confidence of the CNN output to only consider very confident predictions with over 0.999 predicted probability. As can be seen in Table 6, even with a high threshold, the deterministic CNN tends to be overconfident in its wrong predictions. This causes an increase in the propagated error as more images with wrong predicted labels are added to the training set and the model starts to underperform. To ensure a fair comparison between the self-training methods, the stop conditions were set to be identical s.t. the procedure was interrupted after three consecutive iterations without selecting more images to be added to the training set.

Conclusion and future work
In this paper, we propose a deep Bayesian self-training methodology that leverages modern approximate variational inference in DNNs to estimate predictive uncertainty during a self-training setting. Both aleatoric and epistemic uncertainties of predicted pseudo-labels for unseen data are estimated, and the samples with the lowest predictive uncertainty (highest confidence) are added to the training set in an automated manner. We offer ways to mitigate the known problem of propagating errors in self-training by including: (i) an entropy penalty on the log-likelihood loss to punish overconfident output distributions and facilitate thresholding, and (ii) an adaptive sample-wise weight on the influence of predicted pseudo-labelled samples over gradient updates to be inversely proportional to their predictive uncertainty. Lastly, we propose a new simple methodology for visualising and analysing variability between two dataset distributions in DNNs and attempt to adapt information from one problem to the other by clustering learnt latent variable representations in the context of our application domain. An experimental study on both public and private (real) datasets is presented demonstrating the increased performance of our algorithm over standard self-training baselines, and also highlighting the importance of predictive uncertainty estimates in safetycritical domains.
Our future work will extend the experimental study to large dataset, consisting of about half a million real food packaging images, and we intend to apply the presented DNN-based methodologies for adaptation and self-annotation of these data.
(a) (b) (c) (d) Fig. 10 Normalised confusion matrices of the results obtained from our self-annotation procedure a The 5000 predicted labels obtained with the lowest prediction uncertainty. b Deterministic baseline CNN predicted labels, wherein the thresholds were set based on the network's sigmoid output. c Predicted labels from our Bayesian selftraining approach, trained with a standard binary negative loglikelihood loss. d Similar to c but using a Bayesian CNN trained with the entropy penalised binary negative log-likelihood loss link to the Creative Commons license, and indicate if changes were made.