1 Introduction

Handwritten text recognition (HTR) and Word Retrieval, also known as Word Spotting, have been central tasks in the document analysis community. Both fields have seen a transformation from classical pattern recognition approaches relying on designed features combined with models such has Hidden Markov Models or Support Vector Machines towards end-to-end approaches based on deep neural networks. Introducing large neural networks with an enormous number of parameters has been a promising approach for improving performance. This trend of scaling models and removing structural biases for overall performance seems to persevere as also indicated by the rise of popularity of transformer architecture that neglect classical convolutional and recurrent components.

While this development leads to increasing performance on all relevant benchmarks it comes with the cost of requiring annotated training data. Annotating data is costly and a limiting factor for the application of machine learning. This is especially relevant considering that a main application of HTR or Word Spotting is the exploration of so far unknown historic documents. These documents are usually specific in their appearance resulting in general models to perform poorly. The creation of annotated data often requires domain experts which can render the application of machine learning unfeasible. In this work, we show that it is feasible to train an AttributeCNN which is a state-of-the-art Word Spotting model and a sequence-to-sequence model for HTR, without the need for any annotated data. We consider the approach as annotation-free (AF) as no manually created annotations are required. In contrast to methods that are considered learning-free, we still use machine learning to optimize our models.

At the core of the proposed method lies a self-training approach, see Fig. 1. First an initial model is trained on a synthetic dataset. The initial model is then used to make predictions for the unlabeled target dataset. These predictions are considered to be correct and are used as pseudo-labels to further optimize the model. Prediction of pseudo-labels and training on them is an iterative process resulting in the adaptation of the model from synthetic data to the real world target domain. We present confidence estimates for both models that allow to identify erroneous pseudo-label predictions and that can be used to enhance the quality of the pseudo-labeled training set by thresholding. We further investigate the influence of crucial characteristics of the synthetic data generation process and how it can be adapted to a target dataset.

Fig. 1
figure 1

Schematic visualization of self-training. Neural models are trained based on a confidence-based selection of pseudo-labels. An initial model (\(\text {Model}_0\)) is derived from synthetic data and later iteratively adapted

Aspects of the proposed methods and preliminary results have already been published in [62,63,64,65]. The idea of confidence estimation for an AttributeCNN was first investigated in [65]. Wolf and Fink [63] showed that self-training an AttributeCNN based on synthetic data is feasible. Wolf et al. [62], several considerations with respect to synthetic data generation and synthesis calibration were introduced. The generalization of the self-training approach to HTR was presented in [64]. While previous works only consider certain aspects of self-training and data synthesis for a single isolated task, this works specifically shows that a unified approach can be applied for both model types and challenges. The extended experimentation shows that the same conclusions can be drawn and strongly underlines the generality of the proposed method. The presented training approach allows to train the two considered models without requiring a single manually annotated training sample. Our contributions can be summarized as follow:

  • We show the influence of language properties of synthetic datasets used for training recognition and retrieval models.

  • We propose a visual adaptation strategy for synthesis based on font classification.

  • We show that it is feasible to train a retrieval and a recognition model with self-training in the absence of any manually annotated data.

  • We show that the inclusion of confidence measures in the self-training scheme improves performances and robustness.

2 Related works

HTR and the retrieval of query words are common tasks that have been of great interest in the document analysis community. While handwriting recognition produces a fixed transcription, word spotting, results in a ranked list of possible occurrences. In the following, essential methods on handwriting recognition and word spotting will be briefly discussed. Furthermore, common word synthesis approaches are reviewed as they constitute the state-of-the-art approach to pretrain networks and build the basis to derive initial models in this work. Section 2.4 presents works on self-training that inspired the proposed method.

2.1 Word spotting

Word spotting describes the task of locating a query word in a large document collection. Instead of generating a full transcription, the user searches for a designated word inside the collection. As the provided result is a ranked list of potential occurrences, the final interpretation is on the side of the user, which is often desired in the context of exploring historical document collections.

The most characterizing distinction between methods can be made based on the query representation, the requirement of an independent segmentation step and the use of machine learning [16]. The literature distinguishes query-by-example (QbE) and query-by string (QbS) depending on whether the query is represented by an exemplary image or a string. Segmentation-based methods such as [4, 26, 27, 42, 54] assume an independent word segmentation. In this case, the generation of a retrieval list boils down to the computation of similarity between word images. A popular approach to solve this problem was proposed in [4], which introduced the embedded attribute framework. It’s goal is to map word images and strings to a common vector space, such that similarity can be quantified by vector distances. The Pyramidal Histogram of Characters (PHOC) embedding established itself as the defacto standard embedding. A PHOC vector encodes the presence of characters in different splits of a word in a pyramidal fashion. Predicting the embedding for a word image, i.e., by using Support Vector Machines [4] allows to map images and strings in a common vector space solving the segmentation-based word spotting problem for QbE and QbS. Convolutional neural networks rapidly became the model of choice to predict attribute embeddings. In their line of research, Sudholt et al. [52,53,54] developed the PHOCNet that is capable to predict PHOC vectors for word images. The PHOCNet is the model of choice in this work and will be discussed in more detail in Sect. 3.1. Using a neural network to predict attribute embeddings requires annotated data for training the respective model. Several works on learning-free word spotting try to circumvent this requirement [16]. These works commonly rely on designing specific feature represantations [43, 49, 66]. Their performance is naturally clearly inferior to trained models and they usually lack the capability to perform QbS.

2.2 Handwritten text recognition

Handwritten text recognition is a problem in document analysis and general computer vision with a long standing tradition. Before the rise of popularity of deep neural networks, many methods relied on Hidden Markov Models [38]. The shift to neural models was significantly fostered by the introduction of Connectionist Temporal Classification (CTC) [17]. CTC allows to define a loss that allows to train a neural network based on image level annotations. State-of-the-art architectures combine convolutional backbones with reccurrent components such as Long Short Term Memory (LSTM) cells or Gated Recurrent Units (GRU) [13, 35, 40, 58, 60].

In contrast to CTC, other methods investigated the use of sequence-to-sequence based models that exploit the use of an attention mechanism. The model usually is built upon a convolutional encoder followed by recurrent layers, an attention mechanism and a decoder. The attention mechanism allows the decoder to focus on different elements of the encoded feature sequence and, therefore, the prediction of character sequences of arbitrary size is feasible. Several works show that sequence-to-sequence models give competetitve performances and alleviate some of the drawbacks of CTC [8, 23, 55].

Recently, transformer architectures became increasingly popular in all fields of computer vision. They have also been explored for handwritten text recognition [20, 31]. Transformers usually require large amounts of annotated training data to achieve high performances. Nonetheless, it was shown in [37] that an application to low resource domains can be feasible using different pretraining and transfer learning strategies.

2.3 Handwriting synthesis

A main drawback of large neural architectures is that usually a high number of manually annotated, representative training samples are required to optimize the model. To alleviate this problem, creating data automatically has been of interest to the research communities. Handwriting is an especially well suited domain, as handwritten text follows a number of defined rules regularizing possible appearances. This facilitate the design of an automated generation process that produces labeled samples without introducing manual effort. Generating handwritten text by using True Type fonts to render strings has become a popular approach in the literature [23, 29, 31]. Pretraining recognition and retrieval models on large synthetic datasets has become a standard approach. Most works report poor performances when no manually annotated data is included in the training process, but the use of synthetic data generally improves performances [27] and reduces the demand of manually labeled data [18].

Other works explore the use of Generative Adversarial Networks (GAN) for image generation [5, 21, 34]. While generation results are visually closer to real world examples, they require annotated samples to be trained. For using samples generated by adversarial networks in terms of a data augmentation strategy, only marginal improvements of recognizer performances have been reported [5, 21, 34]. The same drawback can be observed with respect to diffusion models that appear to be the next evolution in terms of image generation [12, 36].

2.4 Self-training

Self-training describes the technique to improve model performances by learning from pseudo-labels generated by the model itself. Conceptionally, this idea has a long standing tradition and has been investigated for almost all model types in semi-supervised machine learning [10]. Empirically, results show that training with pseudo-labels improves performances for diverse sets of models and application domains [14].

One of the first works to explore self-training for neural networks was presented in [30], where the authors show that self-training benefits performances for a classification network on MNIST. Self-training has become a popular approach also for object recognition and its effectiveness has been proven for example in the highly recognized line of work presented in [6, 7, 51]. Self-training is often combined with confidence criteria that allow to improve the quality of the pseudo-labeled training set by removing samples that are likely to be erroneous [7]. This is commonly done by considering the network activation as a confidence estimate and removing samples with a low prediction confidence. Furthermore, works such as [6, 51] introduce consistency criteria by exploiting multiple augmented versions of a sample during training.

Self-training has also been explored in the document analysis community in several works. [56] adapt a text recognizer from one language to another by training on pseudo-labels with the support of a language model. [44] uses self-training in a transductive learning scenario. In [24], self-training is applied to a CTC based recognizer for a semi-supervised learning task. Note, that in most cases initial models are derived from representative training data that constitute a part of the actual training set.

3 Method

This section discusses the core elements of the proposed self-training method for handwritten text recognition and word retrieval. First, we introduce the underlying models for word recognition and retrieval in Sect. 3.1. Section 3.2 describes the synthesis process and how it can be adapted to a target dataset without the need of annotated data. Finally, Sect. 3.3 presents the self-training scheme and its extension by confidence selection.

3.1 Model architectures

Despite being conceptionally different problems, modern retrieval and recognition models share a common structure. Different architectural components reoccur in almost any successful network for handwriting recognition and retrieval and can be summarized by a three step architecture. First, a convolutional neural network serves as a feature extractor and computes a set of feature maps for a given input image. Feature extraction is commonly followed by an aggregation step with the goal to compute a characteristic and compact representation. This can be either achieved in a holistic manner for example by spatial pooling, or alternatively by slicing and sequentializing the set of feature maps. Finally, a decoder computes the desired outputs. The models used in this work implement this common architecture in terms of an AttributeCNN for retrieval [54, 63] and a sequence-to-sequence model for recognition [23, 64]. See Fig. 2 for an overview of the different model components according to their shared structure.

Fig. 2
figure 2

Comparison of the two main models. Both architectures follow the general three step procedure of feature extraction, aggregation and decoding. Both models closely follow established approaches. For a detailed description of the architecture of the retrieval model, see [54]. The recognition model is discussed in depth in [23]

3.1.1 Attribute CNN

In this work, we follow the AttributeCNN approach presented and researched in [52,53,54]. Even though the model integrates the prediction of attributes in a single network architecture, the structure of feature extraction, aggregation and decoding can be identified. The model considered in this work is essentially a TPP-PHOCNet, first presented in [53]. Feature extraction uses a variant of the VGG16 architecture with less pooling layers. In order to aggregate an image representation of fixed size that is independent from the input image dimensions, the model includes a Temporal Pyramid Pooling (TPP) layer with five levels. The final two fully-connected layers can be considered a decoder that allows the prediction of an attribute vector \(\varvec{\hat{a}}\). Given annotated data the network can be trained using the Binary Cross Entropy Loss.

3.1.2 Sequence to sequence HTR

Sequence-to-sequence handwriting recognition models avoid the use of the common CTC-loss. Attention-based decoding was shown to be a promising alternative [1, 8, 23, 55]. We adapt most of the design choices from [23] to allow for a fair comparison of training and adaptation methods which are the main focus of this work.

A VGG19 network serves as a feature extractor and computes a set of feature maps \(\varvec{X}\). The aggregation stage consists of two steps. First, the computed feature maps are transformed into a sequence of feature vectors by slicing the extracted feature maps. Furthermore, two layers of Bi-directional Gated Recurrent Units allow for recurrent connections inside the network, resulting in an enriched feature representation \(\varvec{H}\). Based on the sequence of enriched feature vectors, a context vector \(\varvec{c_t}\) is computed at every decoding step t using location-based attention [11]. The decoder is implemented as a one-directional multi-layered GRU with a fully-connected layer and softmax activation. Each prediction sequence starts with a <GO> signal. Then, the decoder computes a pseudo-probality distribution \(\varvec{d_t}\) over the character set at each time step until either the end signal occurs or the maximum number of time steps T is reached. As at every time step t the decoder is simply predicting a character \(y_t\) from the decoder state \(\varvec{s_t}\) that also encodes the context vector \(\varvec{c_t}\) the entire model can be optimized using a Binary Cross Entropy loss.

3.2 Synthesis of handwritten words

3.2.1 Image synthesis

The goal of data synthesis is to generate training samples that model the variance occurring in real world data. Essentially, we need to design a function \(f_{\textit{syn}}: \mathcal {Y} \rightarrow \mathcal {I}\) that maps a given label y to a handwritten word image \(\varvec{I}\). The choice of a set of labels and a corresponding distribution p(y), can be considered a form of language modeling. As the information on language may only be learned implicitly during training, it is an open question if it is actually encoded and learned in a synthetic pre-training stage. In order to investigate this, we propose the use of different labelsets, that are later rendered into handwritten word images.

For this purpose, we use the top 100 ebooks available from the project Gutenberg.Footnote 1 This yields a text corpus of approximately 14 million words. The text corpus includes punctuation, small and capital letters and all distributions follow the occurence patterns in natural text from English literature.

Using a literature corpus as the source for the labelset, results in a training dataset where the number of training samples per word depends on its occurence in the natural use of a language. In order to gain an insight whether the distribution over words is influencing the model performances, we remove the natural distribution from the generation process. Therefore, we use the most frequent English words and generate a fixed number of samples for each. In contrast to the text corpus extracted from natural text, the samples per word follow a uniform distribution.

In order to further remove language characteristics at the stage of synthetic pretraining, we investigate whether and how performances degrade when the model is trained on randomized strings. In this case, not only the word distribution is removed from the labelset but also any information on word level. If the model is learning independent character information, the influence of removing word level information should be limited. If performances degrade strongly, it is a clear indication that synthetic pretraining constitutes a form of language modeling. In order to generate strings that at least resemble words, we introduce the following constraints. The distribution of length follows the same distribution as occurring in the previously described Gutenberg corpus. The same holds true for the occurrence of capital letters and special characters.

Randomizing strings aims at removing any information on language from the synthetic dataset. On the other extreme, it is an open question whether the model benefits from more specific language information from the target domain. Therefore, we create a dataset specific labelset. Here, the synthetic dataset is based on the closed dictionary of the evaluation dataset.

After modeling language and word information by extracting the underlying labelset, the next step in a synthesis pipeline is the design of the function \(f_\text {syn}: \mathcal {Y} \rightarrow \mathcal {I}\) that maps a given label \(y \in \varvec{L}\) to a handwritten word image \(\varvec{I} \in {I}\). This mapping can easily be realized by off-the-shelf solutions that render strings to a word image. In this work, we use the open-source software ImageMagickFootnote 2 for rendering. The framework allows to generate word images from a given string. The resulting style is determined by a random choice of a handwritten font, the strokewidth, the kerning, the slant and skew angle.

Fig. 3
figure 3

Architecture of the synthesis calibration system. Font and slant classification networks are trained on synthetic images. Respective histograms can be derived from predictions on the target images. An adapted word image synthesis includes the style information encoded by the histograms

3.2.2 Synthesis calibration

Considering the synthesis parameters as in Sec. 3.2, the selection of a font and slant angle strongly defines what can be considered the style of a writer or document. Instead of randomly choosing these parameters following a uniform distribution, we propose to predict them for a target dataset. In a first step, a synthetic dataset is generated. Fonts and slant angles are uniformly distributed. This dataset is then used to train a font and slant angle classifier. Training these classifiers solely on synthetic data does not introduce any additional labeling effort and allows to derive distributions on fonts and slant angles for a given target dataset. The assumption is that by classifying each word image from the target dataset, we can estimate a distribution that allows to synthesize a dataset that is visually more similar to the target set. We use a convolutional neural network to classify the font of a given word image. Each font used in the synthesis procedure is considered a class. As a backbone architecture, we use the residual network ResNet50 [19] with the number of output neurons equal to the number of available fonts. The same approach is taken for predicting one out of five slant angles. Both networks are exclusively trained on synthetically generated word images. In order to adapt the synthesis procedure to an unknown dataset, we predict a font and slant angle for each sample from the target dataset. The histograms of predicted fonts and slant angles constitute an approximate distribution of which fonts and slant angles result in visual similar word images. An adapted synthetic dataset is then generated by replacing the uniform distributions of the synthesis procedure with the distributions approximated by the font and slant angle predictors. Figure 3 presents an overview of the synthesis calibration procedure.

3.3 Self-training

Self-training can be realized for the models discussed in Sect. 3.1 starting with an initial model trained on a synthetic dataset. Let \(\varvec{S}\) be a synthetic dataset containing word images \(\varvec{s}^{(0)}, \ldots , \varvec{s}^{(N)}\), with corresponding labels \(y_0, \ldots y_N\). The goal is to train a model \(\phi (\cdot , \theta )\) for a target dataset \(\varvec{X}\) for which no annotations are available. In general, we can use the same self-training approach for both, recognition with a sequence-to-sequence HTR model and retrieval with an AttributeCNN. First, an initial model is required. In order to derive an initial set of weights \(\theta ^0\), we first train the model on the synthetic dataset for one epoch. This allows us to use the initial model to make predictions for the unlabeled target dataset. These predictions are then assumed to be correct and, therefore, constitute a pseudo-label. The previously unlabeled dataset in combination with the predicted pseudo-labels, can then be considered a training dataset \(\varvec{X}_{\text {train}}\). Training on this dataset results in a new set of weights \(\theta ^k\). Predicting pseudo-labels and training on the newly generated training set is performed iteratively.

Generating a pseudo-label is straight forward for the recognition model discussed. The recognition decoder consists of a linear layer with softmax activation, which gives a prediction vector \(\varvec{\hat{y}}_t\) at every time step t. A final recognition sequence is then derived by taking the character with maximum activation and the resulting prediction sequence may finally serve as a pseudo-label.

In case of an AttributeCNN, using the network predictions as pseudo-labels directly or in a binarized form, disregards any knowledge on the structure of a plausible PHOC vector. To circumvent the neglection of any structural consistency and to further include language information in the training procedure, lexicon-based recognition is used for pseudo-label generation. Even though the AttributeCNN used in this work is not primarily designed to solve word recognition, it is feasibly to do so if a lexicon is available. Let \(\mathbb {L} = \{l_0, \ldots , l_N\}\) be a lexicon containing a number of N strings. Each string can trivially be represented by its corresponding attribute representation \(\mathbb {L} = \{^{l_0}\varvec{a}, \ldots , ^{l_N}\varvec{a}\}\). After the prediction of an attribute representation \(\varvec{\hat{a}} = \phi (\varvec{x}, \theta )\), word recognition is performed in the fashion of a nearest neighbor search over the lexicon. The recognition result \(^{*}l\) is the lexicon entry with minimal cosine dissimilarity \(d_{cos}\) between the respective attribute vectors.

$$\begin{aligned} {}^{*}l = \mathop {\mathrm {arg\,min}}\limits _{l \in \mathbb {L}} d_{cos}\left( ^{l}\varvec{a}, \varvec{\hat{a}}\right) . \end{aligned}$$
(1)

The final pseudo-label is derived as the attribute representation of the recognition result \(\varvec{\bar{y}}=^{^{*}l}\varvec{a}\).

Aiming at the exploration of new datasets where no annotations are available, also means that the exact lexicon is unknown. In the following experiments, we make use of language-based lexicon, meaning that, similar to the synthesis stage, we use the most common 10, 000 English words as the recognition lexicon. This only introduces the assumption that the language of the target dataset is known before training the model.

3.3.1 Confidence estimation

Using pseudo-labels to train a neural network results in a training dataset of poor label quality, as especially early models may be inaccurate. Nonetheless, prediction quality differs for various parts of the dataset and it is likely that a small portion of labels are at least partially correct. The goal of confidence estimation in the context of self-training is to identify these correct predictions without knowing the actual annotation. Therefore, a numerical value is required that quantifies how likely it is that a given prediction is correct.

A confidence measure essentially constitutes a function \(c: \mathcal {I},\Theta \rightarrow \mathbb {R}\) that maps an input image \(\varvec{I} \in \mathcal {I}\) to a real valued number based on a set of model parameters \(\theta \in \Theta \). In the literature, most self-training methods for training classification models use the activation of the neuron corresponding to the predicted class as a form of confidence estimate. This is not directly feasible for retrieval and recognition models of this work, as the prediction task is either a multi-class classification or sequential problem. We still follow the idea of using the network activations and derive several confidence measures based either on the predicted attribute representation \(\varvec{\hat{a}}\) or the prediction sequence \(\hat{y}\).

At each time step of the recognition sequence, the decoder of the sequence-to-sequence model outputs a vector of pseudo-probabilities \(\varvec{d}_t\) over the character set. For each character, we consider its corresponding activation as a numeric estimation of how confident the network is in the respective prediction. In the proposed self-training procedure, a sample is either included in the training set or neglected based on a total confidence value. Therefore, the individual character confidences have to be summarized in a single value quantifying the pseudo-probability of the prediction sequence to be correct. Summarizing the character confidence is either performed by taking the mean of the character confidences or the minimal confidence estimate:

$$\begin{aligned} c_{\text {mean}} = \frac{1}{T} \sum ^T_{t=0} \max {(\varvec{d}_t)} \quad c_{\text {min}} = \min _{0\le t \le T}(\max (\varvec{d}_t)) \end{aligned}$$
(2)

In order to investigate the potential performance loss of the introduced confidence estimate, we further experiment with an oracle confidence measure. A perfect estimate that fulfills the previously discussed requirements of the training procedure can be easily derived if the annotation is taken into account. From a practical perspective, this is unfeasible but allows to quantify the performance drop caused by a suboptimal confidence estimation. For the sequence-to-sequence model we use the Levenshtein distance between the prediction sequence and its annotation as an oracle confidence estimate:

$$\begin{aligned} c_{\text {oracle}} = \text {Levenshtein}(\hat{y}, y) \end{aligned}$$
(3)

Retrieval relies on the prediction of an attribute representation \(\varvec{\hat{a}}\). Similar to the sequence model, a confidence measure may be built directly upon the network’s outputs. In case of a PHOC vector, an ideal prediction would give a vector with either a zero or a one for every entry. An attribute is considered to be present in the word image if its pseudo-probability \(\varvec{\hat{a}}_i \approx p(A_i=1|\varvec{x})\) lies above a value of 0.5. For defining a confidence estimate, we neglect all attributes that are not present according to the prediction. The final value results from averaging over all present attributes:

$$\begin{aligned} c_{\text {sig}} = \frac{1}{|\varvec{\hat{a}}|} \sum _{\hat{a}_i> 0.5} \phi (\varvec{x},\theta )_i \approx \frac{1}{|\varvec{\hat{a}}|} \sum _{\hat{a}_i > 0.5} p(A_i=1|\varvec{x}). \end{aligned}$$
(4)

While only considering present attributes is a straight forward generalization of using a class activation as a confidence estimate, it neglects most of the information of the predicted PHOC vector. In most cases PHOC vectors and, therefore, also their estimations are rather sparse. For an estimate of high confidence it is not just desired that present attribute activations are close to one, but also that entries of absent attributes are zero. This characteristic can be quantified by using entropy as a confidence measure. In this case, a low entropy corresponds to a high confidence in the network’s predictions. Following the interpretation of an attribute as a Bernoulli distributed random variable \(A_i\), its entropy is given by

$$\begin{aligned} H({A}_i) = -\hat{a}_i\log (\hat{a}_i) - (1-\hat{a}_i)\log (1-\hat{a}_i). \end{aligned}$$
(5)

To model the confidence of an entire attribute vector, we compute the negative joint entropy over all attributes. We assume conditional independence among attributes. The joint entropy over all attributes is then computed by the sum over the entropies of the individual random variables.

$$\begin{aligned} c_{\text {entropy}}= & {} - H({A}_1, \ldots , {A}_D) = - \sum _{i=1}^{D}H({A}_i) \nonumber \\= & {} \sum _{i=1}^{D} \left( \hat{a}_i\log \hat{a}_i + (1-\hat{a}_i)\log (1-\hat{a}_i)\right) . \end{aligned}$$
(6)
Algorithm 1
figure a

Self-Training with Confidence selection

Additionally, we also introduce an oracle confidence measure for the AttributeCNN exploiting the annotations. Using string based attribute representations such as the PHOC encoding, maps strings and the network prediction into a common embedding space. A confidence estimate may be defined based on the cosine dissimilarity \(d_{cos}\) of the predicted attribute representation and the attribute representation of its annotation. Here, a high dissimilarity corresponds to a prediction of low confidence:

$$\begin{aligned} c_{\text {oracle}} = 1 - d_\text {cos}(^{l}\varvec{a}, \varvec{\hat{a}}) \end{aligned}$$
(7)

Confidence estimation can be easily integrated in a self-training scheme as follows. After making predictions for the entire unlabeled dataset , a confidence value is determined for each sample. This allows to sort the unlabeled dataset with respect to its confidence estimations. Under the assumption that low confidence estimations are highly erroneous, we aim at removing inaccurate pseudo-labels at each training cycle. One option is to define a training schedule and always include a fixed percentage \(p_k\) of the unlabeled training data. Alternatively, sample selection may be performed by introducing a threshold \(\tau \). In this case, the number of hyperparameters is reduced to choosing only a single threshold value. Algorithm 1 summarizes the entire self-training scheme with a confidence-based sample selection.

4 Experiments

The main focus of our experimental evaluation is to answer the following questions:

  • Does an adaptation of language properties of the synthetic dataset have direct implications on model performances?

  • Can classification networks trained on synthetic data capture visual properties of a target dataset and how does the visual adaption of the synthesis approach influence performances?

  • Does self-training allow to adapt the considered models to a target dataset without the need for manually labeled data?

  • Do the proposed confidence measures capture prediction quality?

  • Is a confidence-based selection of pseudo-labels beneficial for self-training?

We conduct experiments on four evaluation datasets and on one synthetic dataset from the literature. Performances are evaluated in terms of word (WER) and character error rates (CER) in case of the recognition model. For the AttributeCNN, we report mean average precision (mAP) values for different QbE and QbS benchmarks.

4.1 Datasets

In order to evaluate the methods presented in this work, experiments are conducted on four different benchmark datasets. While all datasets consist of handwritten word images their appearance strongly varies across the different benchmarks. See Fig. 4 for selected examples of all datasets. Furthermore, the datasets show significant differences in the size of the available training and testing material, which represents the challenge in a real world application. Beside the presented benchmarks for evaluation, a publicly available synthetic dataset serves as an additional baseline.

Fig. 4
figure 4

Example word images from the historic (GW, BT), contemporary (CVL, IAM) and synthetic (IIIT) datasets

4.1.1 George Washington (GW)

The George Washington (GW) dataset constitutes a historic document collection. For recognition, we follow the protocol established in [13, 57]. The dataset is split into training, validation and test according to the first partition proposed in [15]. Capitalization and punctuation is considered as annotated. Retrieval is evaluated following the protocol first described in [4].

4.1.2 Jeremy Bentham manuscripts (BT)

We further consider the Jeremy Bentham Manuscripts. In order to evaluate retrieval performances we follow the competition protocols establihed in 2014 [39] (\(\text {BT}_{\text {14}}\)) and 2015 (\(\text {BT}_{\text {15}}\)) [41]. Both competitions define segmentation based protocols for query-by-example word spotting. Note, that the respective datasets are not entirely annotated but only provide a list of words that are relevant with respect to a given query. Therefore, a numeric evaluation of QbS is not feasible despite the proposed model has the capability. Even though no classic word recognition benchmark exists for the Bentham Manuscripts, it still has been used in this fashion in [33]. Based on [59], the authors propose a question answering benchmark and, therefore, define the BenthamQA dataset (\(\text {BT}_{\text {QA}}\)) with word level annotations. They also report recognition results of a system entirely trained without training material from the target domain to which we can draw a direct comparison.

4.1.3 IAM handwriting database

The IAM Handwriting Database contains contemporary handwriting and was specifically created with the goal to establish a benchmark for handwriting recognition tasks [32]. We follow the predominantly used RWTH evaluation protocol as used in [22, 27]. Due to its large scale and challenging characteristics the IAM database, was also adopted as a word spotting benchmark. We follow the commonly used protocol that was defined in [4].

4.1.4 CVL database

The CVL Database [25] is the second contemporary benchmark that we use to evaluate recognition performances. Note, that the dataset includes words in German language, while it is mainly written in English. For all later experiments, we ignore this fact and follow the exact same procedures as for the other datasets that only include English words. We follow the same protocol as [1, 22] to allow for a fair comparison.

4.1.5 IIIT-HWS

In the literature on handwritten document analysis, using synthetic data to pre-train models is common practice. Nonetheless, the actual synthetic datasets are rarely publicly available. One of the first published synthetic datasets of handwritten word images is the IIIT-HWS dataset first described in [29]. The dataset consists of a total of nine million word images, which were rendered from a vocabulary of size 90,000. The authors define two variants of the dataset with IIIT-HWS10K that only uses a subset of 10,000 words opposed to the IIIT-HWS90K which includes the entire dataset. In this works, both versions are used as a baseline for a generic synthetic dataset from the literature without any model or domain specific considerations.

4.2 Significance testing and confidence intervals

The experiments in this work are in most cases the result of combining numerous stochastic processes. In order to account for the random nature of the presented experiments, significance tests are performed in case of all retrieval experiments. Furthermore, we provide confidence intervals for the recognition experiments.

As proposed in [54], running a permutation test [50] gives further insights into the statistical significance of a performance difference. Consider two word spotting models that are evaluated on a given benchmark with a predefined set of queries. The permutation test is supposed to answer the question whether the two systems are significantly different. Therefore, the null hypothesis is formulated, which is that both systems are actually identical [50]. If both systems are identical any permutation of the resulting set of average precisions is equally likely. Based on this assumption, the p-value or significance level can be computed. The exact p-value is the number of permutations that resulted in a mAP difference greater than the observed value for the original system divided by all possible permutations. In practice, Monte-Carlo sampling of random permutations is used to derive an estimate \(\hat{p}\) of the actual p-value. In the following experiments, we follow the same parameterization as in [46, 54]. A precise approximation of the p-value with a maximal standard deviation of \(s_{\hat{p}} = 0.0001\) requires \(k=250000\) permutations. Two models are considered significantly different if the estimated p-value lies below the significance level of \(\alpha =0.05\). If a result is significantly better or worse than its respective baseline, its mAP value is noted in italic.

All recognition experiments are quantitatively evaluated in terms of error rates. Due to the random nature of the conducted experiment, the provided error rate can be only considered an estimate of the actual error rate. Typically, recognition tasks are considered a Bernoulli process that follows a binomial distribution [9]. An observed word accuracy describes an estimate \(\hat{p}\) based on the fraction of correctly classified patterns k over the total number of all patterns n with \(\hat{p} = \frac{k}{n}\). If n is large enough, the Wilson interval [9] can be derived from solving the following quadratic equation:

$$\begin{aligned} (n+c^2)p^2 - (2k + c^2)p + \frac{k^2}{n} = 0 \end{aligned}$$
(8)

The value c describes the probability with which the actual accuracy lies in the confidence interval. Here, \(c=1.96\) corresponds to a probability of 95%. The final confidence interval is derived by computing the solutions to Eq. 8 given by

$$\begin{aligned} p_{u/l} = \frac{2k+c^2}{2(n+c^2)} \pm \sqrt{\left( \frac{(2k+c^2)}{2(n+c^2)}\right) ^2 - \frac{k^2}{n(n+c^2)}} \end{aligned}$$
(9)

4.3 Results

4.3.1 Training on synthetic data

In order to derive an initial model as a starting point for self-training, we train the retrieval and recognition models on synthetic datasets. Following the synthesis pipeline described in Sect. 3.2, we generate several datasets based on different underlying labelsets to investigate how the language properties of the synthetic data influence performances. In all cases three million word images are synthesized using all 396 available fonts and each dataset only differs in the labels that represent the available language information. As described in Sect. 3.2, we either use the natural text corpus extracted from the project Gutenberg, a vocabulary based approach with an equal number of training samples for the 1000 most frequent words or randomized strings. Furthermore, it is feasible to define a labelset that perfectly represents the target datasets by using their respective annotations. From a practical application perspective, it is usually not possible to obtain such a labelset. For the quantitative evaluation, we train both the AttributeCNN and the sequence-to-sequence recognition model on the synthetic data. Following the same setup, models for the two variants of the IIIT-HWS dataset are trained. Tables 1 and 2 present the resulting performances on all benchmarks.

Table 1 Comparison of retrieval performances resulting from training on synthetic datasets with varying labelsets
Table 2 Comparison of recognition performances resulting from training on synthetic datasets with varying labelsets

Considering retrieval performance, quite high mAP values up to 80% in case of query-by-string on George Washington can be achieved, despite the fact that the model has never seen representative samples from the target dataset. In general, considerable performances are achieved on all benchmarks, motivating the suitability as an initial model for self-training. Focusing the synthesis labels on the annotations of the actual datasets leads to significant performance gains for the retrieval model. Improvements are comparably larger for a small dataset such as George Washington in contrast to the IAM dataset indicating that the choice of the labelset introduces a strong bias to the model. For a homogeneous dataset with a quite limited test set, this bias actually improves performances quite significantly.

Recognition performances draw a similar picture with character error rates around 20% and word error rates around 50%. With respect to the influence of changing the labelset, the experiments offer a more diverse interpretation. In case of George Washington and Bentham a drop of performance can be observed. Opposed to the AttributeCNN the sequence-to-sequence model seems not to benefit from the introduced language bias. The strong sequential modeling capacity of the recurrent architecture is likely to overfit on the rather limited labelset, which yields a drop of performance. This is not the case for the IAM database with a larger and more diverse labelset for which a similar performance gain as in the retrieval case can be observed. The CVL database presents an interesting outlier with a striking jump in performance. This is explainable by the dataset characteristics. On one hand using the annotations includes the occurring German labels which are not part of the Gutenberg corpus. On the other hand the diversity of text in the CVL database is extremely limited, allowing the model to highly benefit from the introduced bias.

The results show that removing language information generally leads to a drop in performance. Already removing the word distribution by using the vocabulary based dataset results in significant performance drops on almost all benchmarks for both models. Considering the randomization approach, performances generally decrease compared to the vocabulary dataset, but the influence is comparably lower. This indicates that information on word distributions is increasingly important compared to n-gram information. Interestingly, in some cases slight performance gains can be observed comparing the randomization to vocabulary models. This is especially the case for the pure visual retrieval in case of query-by-example. The respective model seems to benefit more from the visual diversity of the dataset than it is hindered from the lack of language information.

Summarizing, the experiments clearly show that training a model on synthetic data can be considered a form of language modeling. This is in line with the literature on implicit language modeling [48], but in contrast to training a supervised model the following problem arises. For supervised training, language information is modeled by the collection of a representative training dataset while for synthetic it is freely chosen independent from the visual generation process. The experiments indicate that focusing the language information towards a specific dataset may benefit performance but offers the risk of introducing harmful biases. In general, synthesizing from natural text including word distributions according to their real world occurrences offers a robust and easily available basis for synthetic data generation.

4.3.2 Synthesis calibration

In the following experiments, it is investigated whether it is feasible to determine a beneficial selection of fonts and whether increasing numbers benefit performances. Furthermore, the distribution of slant angles is adapted following the same strategy. We follow the adaptation strategy presented in Sect. 3.2.2 and use a font and slant classification network that have been entirely trained on synthetic data. Here, the main assumption is, if for example the font classifier predicts a certain synthetic font for a given real word image this font is likely to be visually similar. If this assumption holds true the resulting distributions of fonts and slants should result in a more representative dataset when the distributions are integrated in the synthesis process.

Fig. 5
figure 5

Examples of style adapted word image synthesis. First column shows original images from each dataset. Second column shows images generated using the font predicted by the classification network. Column three shows the least frequently predicted font for each dataset. The style is likely to be dissimilar to the original image

Fig. 6
figure 6

Overview on the histograms derived from predicting fonts and slant angles for the given datasets. For visualization purposes, the absolute frequencies were normalized to the interval \([0,\,1]\)

Fig. 7
figure 7

Comparison of query-by example retrieval performances in mAP (upper half) and recognition performances in WER (bottom half) on adapted synthetic datasets. Orange indicates performances derived from datasets using the least frequently predicted fonts. Green represents datasets generated with the most frequently predicted fonts. Gray corresponds to using all fonts. The number at each bar show the number of fonts used for synthesis

After inferring the distributions of fonts and slants for all four datasets, first conclusions can be drawn from a pure qualitative inspection. Figure 5 shows a selection of samples from the actual datasets with corresponding synthetic word images. In all cases, the synthetic image is generated with the most often predicted slant angle. For each of the real world samples, a synthetic version is generated using the font predicted by the font classifier. Additionally, synthetic images using the least often predicted font for each dataset are shown. The given examples show that exploiting the style classification networks yields synthetic samples that are visually more similar to an actual training sample. This underlines the assumption that font classification can be used to identify the most similar synthetic font for a given word image.

Besides looking at the adaptation process on the level of individual word images, the distributions can also be considered from a dataset perspective. Figure 6 presents the font and slant angle histograms for all four datasets. The goal of the adaptation process is that the given histograms approximate distributions that encapsulate the visual characteristics of each dataset. As shown in the first row of the figure, it can be argued that this is the case as the font distributions clearly reflect the number of writers in each of the datasets. In cases such as the IAM and CVL database where a high number of different writers contributed to the collection of data, we observe the frequent prediction of multiple fonts. This stands in a clear contrast to the historic datasets of George Washington and Bentham with one or few contributing writers. In this case, the histograms are clearly dominated by individual fonts. Considering the slant angle, the dataset characteristics are reflected by the histograms shown in the second row of the figure. Again, a difference between the modern and historic datasets can be observed.

In general, the interpretation of the histograms of fonts and slant angles corresponds to the qualitative perception of the different datasets. The use of the proposed slant and font classifier allows to encode the visual properties of the datasets, motivating an adapted synthesis process.

Fig. 8
figure 8

Visualization local confidence values derived from combining the character confidences \(c_t\) and the respective attention weights \(\varvec{a}_t\). Red indicates an area of low confidence, blue an area of high confidence. Word images are sorted by the total confidence value \(c_\text {mean}\) with images of lower confidence on the left

Up to this point, the experiments only indicate that using the style classification networks leads to word images with a style similar to the respective dataset. This does not necessarily mean that training a model on an adapted dataset leads to performance gains. In order to answer the question whether fonts that are more frequently predicted by the font classifier are better suited for generation, the following approach is proposed. Synthetic datasets are created that only use the ten or fifty most frequently predicted fonts of the target dataset. Using the same synthesis process corresponding datasets for the least predicted ten and fifty fonts are generated. Additionally, we compare performances to using all fonts for synthesis. In all cases, we train the retrieval and recognition models for one epoch on the different synthetic datasets and compare performances in terms of query-by-example and word error rates on the target benchmarks.

See Fig. 7, for a comparison of the resulting performances. Considering the models trained on a limited number of fonts, a clear superiority of the adapted datasets is shown. In all cases, the performance of the model trained on the more frequently predicted fonts outperforms the models trained on the least predicted ones. This observation encourages the previously made assumption that the set of handwritten fonts contains writing styles that are better or worse suited for data synthesis. Here, the prediction frequency of the font classifier is an indication which fonts to use during synthesis. In the experiments, we also observe performance gains when the variance inside the dataset is increased by using more available fonts. This characteristic is less notably for a rather homogeneous dataset such as George Washington. Using all 396 fonts that are considered in this work results in the best performing models. This leads to the conclusion that even though visually dissimilar fonts are included during data synthesis the additional variance benefits performance.

4.3.3 Self training

In the following experiments, self-training is performed according to Algorithm 1. We train initial models on the synthetic dataset derived from the Gutenberg corpus. In order to observe how adaptation is influenced by a poor performing initial model, we also run a series of experiments with initial models trained on the IIIT-90K dataset.

For the retrieval model, we perform self-training for \(K=20\) cycles. Here, a cycle corresponds to the entire process of predicting pseudo-labels and training on them for one epoch. In each cycle, all images from the target dataset are included in the pseudo-labeled training set . Self-training of the recognition model is performed with the same strategy. We use an increased number of cycles with \(K=50\) as preliminary experiments showed that the sequence-to-sequence model needs a longer training period to converge. As presented in Tables 3 and 4, self-training leads to considerable performance gains on GW and IAM with respect to using the well performing initial model. Despite the poor performances of the initial models trained on IIIT-90K, self-training leads in most cases to performance improvements. Only in case of the AttributeCNN on IAM, the initial performance seems to be insufficient for a successful adaptation of the model. In general, final performances are better starting with a better initial model.

In order to enhance the quality of the pseudo-labeled dataset, we aim at removing samples with inaccurate predictions by exploiting a confidence measure. The first question that needs to be answered is, whether the proposed confidence estimates provide a quantitative estimation of the prediction quality in absence of the respective annotation. The following experiment gives a first indication whether a higher confidence value provided by the proposed methods coincides with higher prediction quality. Consider an initial model that have been solely trained on synthetic data. If the proposed confidence measures quantify prediction quality, better performance should be observed for more confident samples. Therefore, we sort the dataset samples according to the predicted confidence values and observe word error rates of the pseudo-labels for different parts of the dataset. Note that for the retrieval model we also report word error rate even though the model is not inherently designed for recognition. Nonetheless, lexicon-based recognition can be performed following the same method as for lexicon-based pseudo-label generation. Figure 9 shows the error rates with respect to the most confident x% of the IAM database. As an upper baseline, the oracle confidence measures are plotted. The main observation is that even though the proposed confidence measures do not have access to the actual annotations, they are able to mimic the behavior of the oracle confidence measure. For both models and all confidence measures, lower error rates are observed for those parts of the dataset with more confident pseudo-labels.

In order to further investigate the behavior of the proposed confidence measure, a closer analysis of the recognition model is pursued. The attention mechanism integrated in the model directly links a character prediction to a region of the input sequence. This allows to get a qualitative impression of which image regions lead to unconfident predictions. The expectation on a well working confidence measure is that unconfident regions are visually ambiguous and also correspond to prediction errors.

For a given sample the prediction relies on the feature sequence \((\varvec{h}_0, \varvec{h}_1, \ldots , \varvec{h}_N)\) that is weighted for a given character at time step t by a set of attention weights \(a_{ti}\). Due to the sequentialization procedure, each feature vector corresponds to a series of image columns. Therefore, the attention weights can be interpreted as the numerical relevance of an image column for a given character prediction. In order to link the local relevance with the proposed confidence estimation, we simply multiply the attention weights \(a_{ti}\) with the respective character confidence \(c_t\). Here, the character confidence corresponds to the activation of the predicted character \(c_t = \max (\varvec{d_t})\). By considering each character of the prediction sequence, the entire confidence mass is distributed over the feature vector sequence. This results in a function C that gives a local confidence value depending on the feature vector index i:

$$\begin{aligned} C(i) = \sum _{t=0}^{N} c_t \cdot a_{it} \end{aligned}$$
(10)

For visualization purposes the feature vector index is mapped to the image columns. Figure 8 shows some examples with the resulting confidence mask overlayed on the original image. High confidence values are depicted in blue color, while red indicates uncertainty.

From a qualitative perspective, the character confidence values align quite well to the perception of the samples. Several characteristics can be observed that show the desired behavior with respect to confidence estimation. For word samples with an overall very low confidence value (i.e. “che”, “lortyl”, “Ily”, “fososf”, “leper”), large parts of the image are linked to low confidence values and prediction errors can be observed. Nonetheless, the prediction of individual characters may still be correct and is also reflected by small parts of the image considered as confident. For other samples (i.e. “lounty”, “belceve”, “miphty”, “obserue”), the confidence value perfectly quantifies the wrong prediction of individual characters and localizes the error in the image. In general, prediction errors often correspond to ambiguous image regions for which an identification of the actual character without context is mostly unfeasible also from a human point of view. It is important to point out that lower confidence does not inherently result in a prediction error. Consider the following samples (i.e. “god”, “that”, “very”, “Vienna”, “knows”) with a high general word confidence and correct prediction. Despite the absence of prediction errors, low confidence areas exist and correspond to hard to classify image regions that are qualitatively prone to prediction errors. Overall, the analysis shows that for the recognition model the proposed character confidence numerically captures what would be considered a plausible confidence from a human perspective. Additional to an overall evaluation of a sample, the connection of the attention weights allows a localization of potential prediction errors.

Fig. 9
figure 9

Word error rates over different parts of the IAM test set. The dataset is sorted by its confidence estimates

See Table 4, for the resulting performances of the AttributeCNN including confidence estimation. Considering performance with respect to the well performing initial model (see Table 4 upper half), the inclusion of confidence estimation leads to performance gains in just a few cases. This is also the case for the oracle confidence which improves performance on George Washington only marginally. The observation is entirely different when the initial model with poor performances is considered. Including the confidence-based selection of pseudo-labels drastically improves the results. For both benchmarks, introducing the confidence measure clearly outperforms training on all pseudo-labels. The resulting performances are in line with the conclusions drawn from Fig. 9. Comparing the two measures gives only marginal differences and performances are close to the results for the oracle confidence.

Table 3 Comparison of recognition performances resulting from self-training with a confidence-based selection under a fixed schedule

Table 3 presents the resulting performances for the recognition models. A difference between the sequence-to-sequence model and the AttributeCNN can be observed with respect to their sensitivity to the initial model performance. While the AttributeCNN is not able to adapt from the model trained on the IIIT-HW90K dataset, the recognizer is able to do so despite the poor initial performances. Focusing training on more confident pseudo-labels outperforms training on all samples already when the better performing initial model is considered (Table 3 upper half). On all benchmarks a confidence-based selection improves performances in terms of character and word error rates. The impact of the confidence measures and the relative improvement is considerably higher for models trained based on the initial model from the IIIT-HW90K dataset. Taking the mean over the character confidences generally performs better than the minimum-based approach.

The experiments show that the addition of confidence estimation, leads to performance gains and a higher robustness regarding a poor performing initial model. Introducing a predefined schedule based on fixed portions of the datasets allows to compare different confidence measures. Nonetheless, it has a severe drawback. For the considered benchmarks, it is feasible to derive consistent schedules for both models giving robust results. In how far this approach generalizes in a practical application is questionable, as intuitively the number of selected samples shall not depend on the number of available samples but only on their prediction quality. A possible alternative to choosing fixed numbers of samples is to follow a thresholding approach. We follow the same training procedure as previously, but only train on samples that lie above the threshold value. For the AttributeCNN, we use the \(c_\text {sig}\) confidence measure and a threshold of \(\tau = 0.95\) and \(c_{mean}\) and \(\tau = 0.5\) for the recognition model respectively. Note, that the recognition model uses label smoothing, which leads to activations rarely higher than 0.5. Experiments showed that selecting high threshold values that are only surpassed by few samples initially, offers a robust strategy. Over the course of training the network becomes increasingly confident, which leads to the inclusion of more samples into the pseudo-labeled training set over the course of training. Therefore, thresholding automatically balances between label quality and diversity. See Tables 56 and 7 for quantitative results. Following the thresholding approach generally performs better compared to the previously discussed fixed schedule or training on all available samples.

Table 4 Comparison of retrieval performances resulting from self-training with a confidence-based selection under a fixed schedule
Table 5 Comparison of recognition performances to results from the literature

4.3.4 Comparison to the literature

In the literature on word retrieval, most works fall in one of two categories. On one hand, a paradigm shift towards deep neural networks has taken place and several works reported tremendous performance gains in the presence of a labeled training set [27, 28, 47, 54, 61]. On the other hand, several works avoid the use of any labeled data and, therefore, propose methods that are learning-free and do not rely on a trained model [3, 4, 26, 41, 43, 49, 66]. These methods do not impose any label effort but they are clearly inferior when more visually challenging benchmarks are considered. Works that solely use synthetic data for training commonly not yield high performances [18, 27].

Table 6 presents retrieval performances for George Washington and the IAM database for annotation-free and annotation-based methods. With respect to methods that do not include manual labels during training, the presented self-training approach clearly outperforms all learning-free approaches. Performance of the model only trained on synthetic data is similar to other results in the literature [27] and is in many cases already superior to the reported performances of learning-free approaches. Additionally, training a model allows to perform query-by-string which is out of scope otherwise. Adapting the proposed model with self-training under a confidence based pseudo-label selection further increases model performances. Of course other works report higher performances under the supervision of the labeled training set, but the proposed self-training method is able to tremendously reduce the performance gap towards a model only trained on synthetic data.

Table 6 Comparison of retrieval performances on GW and IAM to results from the literature
Table 7 Comparison of retrieval performances on BT to results from the literature

For both benchmarks based on the Bentham document collection, similar observations can be made, see Table 7. Other annotation-free methods are clearly outperformed. Note, that even though it is not possible to report query-by-string performances for the benchmarks due to the lack of annotations, the method is still capable to perform query-by-string. This is not the case for the other works in the literature. A similar comparison to the state-of-the-art can be drawn for the recognition model, see Table 5. Most works on recognition either report performances for models trained in a fully-supervised fashion or for models trained on synthetic data. In this regard, the work of Kang et al. [22] provides an interesting comparison in terms of the adaptation efficiency of the proposed self-training approach. The underlying model and, therefore, also fully-supervised and synthetic performances are comparable. In the considered work, the authors adapt the recognition model by introducing an additional loss that is supposed to align the feature vectors of synthetic and real world samples. The reported performance gains are considerably smaller compared to the use of self-training. This clearly indicates, the effectiveness of the proposed training scheme in the application area. While the work of Kang et al. [22] focuses the adaptation process solely on the visual domain in the feature space, self-training benefits from the language bias and the learned information on the target domain. Overall, training an initial model on synthetic data followed by self-training gives state-of-the-art results in the absence of a manually labeled training set. In comparison to models trained in a fully-supervised manner, a performance gap still exists which leaves room for improvements on the adaptation of neural models.

5 Conclusions

In this work, we present a self-training method for two classical document analysis models that are popular approaches to HTR and Word Spotting. We show that it is feasible to achieve high performances without any manually annotated data by training initial models on synthetic datasets. Despite the poor initial performances, both models are capable to adapt to a target dataset by training on pseudo-labels. Our experiments indicate the significance of the language properties of the synthetic data. Therefore, the creation of synthetic data can be considered a form of implicit language modeling. Self-training has been shown to be highly effective to benefit from unlabeled target data. In order to benefit performance and robustness towards a poor initial model, the proposed confidence measures can be integrated in the self-training process. The presented results show that all confidence measures quantify prediction quality in absence of the actual label. While the integration in the self-training is beneficial, the activation based confidence estimates are still suboptimal which leaves room for further improvements. Overall the presented self-training approach has been shown to be highly performant and clearly outperforms other adaption strategies or learning-free approaches. Its application to two different tasks and models in document analysis speaks for its generality and encourages the investigation of self-training for other models and problems.