1 Introduction

While automatic transcription of characters from printed documents has significantly improved in the past decade [1,2,3], transcription of handwritten documents (or handwritten text recognition—HTR) is still a rather challenging exercise and current models are still far from satisfactory performance. This is due, among other reasons, to the large variations caused by the existence of various writing styles as well as the multiple noise factors introduced during the scanning phase. It is even more difficult with historical documents where the writing supports are highly degraded and present a high level of noise. However challenging, automatic transcription of handwritten documents provides a gateway to the content of a large volume of not-yet-transcribed books and offers experts the ability to accelerate the analysis of such content to extract meaningful information and to facilitate the search for information across vast repositories of historical manuscripts.

Extracting a proper representation of the input image that mitigates the irrelevant variations in document images while preserving the important information is a key step in automatic text recognition. Handcrafted features, such as statistical attributes or shape descriptors, have been widely used in handwriting recognition models [4,5,6]. However, specifying a proper set of features compensating for all document variations while being compatible with the subsequent recognizer is an expensive and time-consuming task. An alternative approach is to apply learning methods to extract representations automatically in a supervised or unsupervised mode [7,8,9,10]. Deep neural networks (DNNs) have been extensively applied to text image recognition [11, 12], so that the most common architecture for handwriting recognition utilizes convolutional neural network (CNN) layers to compute a feature map representing the input image to be fed to the subsequent neural recognizer (usually a stack of recurrent neural networks (RNNs)) which recognizes the sequence of constituent letters [13]. The representation network and the recognizer are trained jointly to minimize the cost function. Despite their improved performance compared to conventional methods, current DNN-based models, however, have some weaknesses:

  • They apply the common deep architectures introduced for general images directly into document images, not considering the characteristics of text images. While some modifications have been applied to adapt the recognizer (e.g., developing sequential models) and loss function (applying a connectionist temporal classification—CTC), not much has been done to adapt the representation network. This is while the significance that proper representation plays in handwriting recognition has been widely demonstrated in previous studies [14, 15].

  • They are limited to learning a “continuous” representation of the input image to be trainable by the backpropagation algorithm. While many conventional handwriting recognition models take advantage of intrinsically discretized visual components of the text image, e.g., primitive shapes [14, 16], existing deep models are not adapted correspondingly.

In this paper, we introduce a novel method for neural handwritten text recognition that utilizes discrete representation. The main motivation is to develop a representation network that matches the configuration of scripts where a limited set of basic components (e.g., signs or drawing strokes) are combined to construct the words and sentences. Depending on the handwriting style, writing tools, and type of image noises, these basic components can deform and appear differently, while still referring to a component in the set. Inspired by this characteristic of text images, the proposed model utilizes a set of primitive representation vectors, i.e., a dictionary, to represent the input text image.

The model includes a CNN-based encoder that maps the input image to the continuous feature space. The computed features are passed to a “quantization” layer where the feature vectors are discretized using a dictionary. Discrete features are then pushed to an RNN-based decoder to generate the output text. The quantization layer performs the vector quantization on the latent vectors comparing their values with the dictionary atoms and selecting the closest one. During the training, the dictionary is trained jointly with the encoder–decoder networks in order to be consistent with the recognition model. To tackle the non-diferentiability imposed by the quantization layer, we develop a hybrid training algorithm that updates the dictionary and the network weights with k-means algorithm and back propagation iteratively. The experimental results demonstrate the effectiveness of utilizing discrete representation in handwriting recognition, as the model achieves promising results on both modern and historical documents, outperforming the state-of-the-art results.

In summary, the main contributions of this work are the following:

  • a novel deep neural network architecture for handwriting recognition is proposed. This architecture builds upon the conventional CNN+RNN architectures but utilizes discrete representation instead of continuous representation.

  • a novel method for handwriting recognition that combines a dictionary-based model with a deep neural network is proposed. The dictionary and network parameters are trained together using a combination of batch K-means and backpropagation iteratively.

  • a comprehensive evaluation of the proposed model using modern and historical manuscript pages from benchmark databases. The results show that the proposed approach using discrete representation surpasses previous deep handwriting recognition models.

2 Related works

2.1 Handwriting recognition

Traditional handwritten text recognition models extract representative features, like scale-invariant feature transform (SIFT) and histogram of gradients (HOG), from the segmented input image and apply classical machine learning to recognize constitutive characters [4]. To mitigate the segmentation errors, holistic descriptors have been applied to represent the word image as a whole and classify the words as inseparable entities [17]. The problem with this approach is the large number of classes to be recognized which requires highly discriminative features [18]. Hence, some lexicon reduction approaches were used to reduce the number of word hypotheses [19, 20], which are still prone to propagating errors of the elimination step to the recognition step. Later on, hidden Markov model (HMM) became popular for HTR as it merges the segmentation and recognition steps by modeling the word image as a sequence generated from the set of characters [21]. The advent of long short-term memory (LSTM) networks prompted their application to end-to-end text-line recognition. The LSTM network and its variants were successfully applied for text recognition, thanks to the introduction of the CTC loss [22]. Minimizing the CTC loss function over the LSTM network built on top of the CNN features gained the state-of-the-art result on most benchmarks [1, 23]. Encoder–Decoder architectures combined with the attention mechanism have also been applied to HTR, modeling the problem as a sequence-to-sequence conversion [24]. Despite their progress, RNN networks’ performance is limited by the available memory size. Sequential modeling with attention gates, as an alternative to recurrent units, has demonstrated significant improvements in modeling longer dependencies. The same approach applied on Handwriting recognition [12] has achieved promising results using fully convolutional models. However, this approach is computationally expensive, as it requires multiple attention layers to model the input dependencies, therefore increasing the size of the network with respect to the size of the input image. Due to the high capability of deep learning methods in processing high dimensional images, some studies have started to apply them for end-to-end paragraph and page recognition as well [12, 25]. In this paper, we evaluate the performance of the model running experiments on text-line images.

2.2 Neural discrete representation models

The success of deep learning methods is mainly due to their ability to learn proper representations. Autoencoders are the main deep network architecture widely used for computing data representation [26]. By applying a bottleneck in the network architecture, they learn the continuous representation vectors for the input data. Deep embedding models, like triplet network [27] and Siamese network [28], are commonly used in content-based image retrieval (CBIR) to learn a compact and meaningful representation of images for similarity comparisons. These models typically consist of an encoder network that maps an input image to a high-dimensional feature space and a similarity metric, such as a cosine similarity, that is used to compare feature vectors.

Most of the existing neural representation models learn continuous features. However, exerting a discrete representation, as we desire in this paper, creates difficulties in training as the gradients cannot backpropagate through the sampling layer. To solve this issue, one possibility is to estimate the gradient of the categorical distribution by a differentiable sampling from a Gumbel-Softmax distribution [29]. Another approach is to use vector quantization, VQ-VAE, [30] which enforces discretization through a nearest neighbor lookup procedure. In this paper, we use vector quantization for discretizing the latent variables.

3 Problem statement

The majority of deep learning-based models for handwriting recognition use continuous variables to represent the input image. This is mainly due to the complexity of training neural networks with discrete variables and computational considerations. However, discrete representation has been widely used in traditional HTR models and has shown promising performance in representing the fundamental elements of text images. The goal of this paper is to explore the possibility of using discrete representation for deep neural network (DNN)-based handwritten text recognition and to assess its impact on the performance of standard architectures.

More specifically, our objective is to develop a discretization layer for the CNN+RNN architecture commonly used in handwriting text recognition (HTR) to discretize the features from the CNN output. We also plan to devise a training procedure to address the non-differentiability imposed by the added discretization layer, allowing the entire model to be trained with backpropagation. The new HTR model should follow the standard inference process in deep handwriting recognition, where it takes an image of a handwritten text line as input and outputs a sequence of individual characters. In the following, we will present our approach to this research problem in more detail and discuss the results.

4 Proposed approach

This section describes the proposed method for handwriting recognition using discrete representation. The proposed HTR model consists of an encoder–decoder network with an added quantization layer in the bottleneck that is responsible for quantizing the latent vectors (Fig. 1).

In the next subsections, we first illustrate the mechanism of the quantization layer explaining the process of dictionary learning. Then, a detailed architecture of the proposed HTR model is explained.

4.1 Quantization layer

To learn the discrete representation, we are inspired by the work presented in [30], VQ-VAE. Relying on the main idea of VQ-VAE, we compute discrete latent variables by applying a vector quantization technique with a dictionary D. The encoder and decoder networks apply this dictionary to find the representation of a given input and generate the output from it, respectively. The dictionary is trained simultaneously with the encoder–decoder networks. This allows for learning a dictionary compatible with the encoding/decoding functions.

Figure 1 provides a simple visualization of the vector quantization process that takes place in the quantization layer. Given the input x, the encoder computes the continuous feature vector \(z_e\) which is passed to the quantization layer to be quantized to \(z_q\). To this end, a dictionary D of size \(N\times K\) is used, where N is the dimensionality of each vector and K denotes the total number of vectors in the dictionary. \(z_e\) is quantized to be equal to the closest element of the dictionary.

More formally, the model can be considered as a conditional VAE with a discrete latent variable z, where the probability distribution over z, i.e., \(p_{\theta }(z|x)\), is considered to be a categorical distribution over K dictionary atoms. This categorical distribution is restricted to get nonzero value just on one element selected based on the encoder output (Fig. 1). More precisely, the probability distribution over z, \(p_{\theta }(z|x)\) is defined as in Eq. 1.

$$\begin{aligned} p_\theta (z=k|x) = {\left\{ \begin{array}{ll} 1 &{} \hbox { if}\ k={\text {argmin}}_{j} \Vert z_{e}(x)-e_{j} \Vert _2,\\ 0 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(1)

where \(z_e(x)\) denote the output of the encoder network and \(e_j\) is the jth element of the dictionary. Accordingly, in the forward pass, the model takes the input x and passes it through the encoder to compute \(z_e(x)\). The computed value is used in Eq. 1 to determine the distribution over the discrete latent variables z. Sampling from this distribution is equivalent to performing a nearest neighbor look-up through the dictionary and picking the dictionary atom closest to the \(z_e\), i.e., \({\text {argmin}}_j \Vert z_e(x)- e_j\Vert ^2\). The drawn sample, \(z_q\), then passes through the decoder network to compute the output.

Fig. 1
figure 1

Overall overview of the proposed handwriting recognition model: a variational autoencoder with discrete latent variables

4.1.1 Training

The parameters of the model to be trained include the weights of the encoder network, the weights of the decoder network and the dictionary atoms. The model is trained by maximizing the likelihood function. Embracing the common approach in deep learning base handwriting recognition, we apply the CTC algorithm to compute the likelihood function. However, training the network by the backpropagation is not straightforward due to the non-differentiability imposed by the argmin function in Eq. (1), which prevents the gradient from backpropagating toward the encoder. On the other hand, as the discrete latent variable has the categorical distribution, the reparameterization trick is not applicable either. To solve this problem, we use the trick proposed in [30] and approximate the gradient at the output of the encoder to be equal to the gradient at the input of the decoder [30], i.e., bypassing the argmin function (Fig. 1). This is practically feasible as the dimensions of the two vectors are equal. Thus, in the forward pass, the input image passes normally through the model while in the backward pass the vector quantization layer is skipped. This approach, however, prevents the gradient from reaching the dictionary, hence the dictionary cannot be updated. To address this issue, we modify the learning algorithm (as in Algorithm 1), in which the dictionary is trained separately through the k-means clustering algorithm. Proceeding further in this section, we will delve into the details of this hybrid training method.

For simultaneous training of the dictionary and the network, one way is to iteratively train the dictionary and the network weights by alternating between k-means clustering and a full run of backpropagation optimization. This approach is computationally expensive and requires a long time to converge. To reduce the computational costs, we utilize the batch K-means algorithm instead and integrate the clustering iterations with the optimization iterations. In each training iteration, the current batch samples are used to update the dictionary by running one single iteration of K-means clustering and to update the network weights by running one step of backpropagation. Given a batch of samples at iteration i, the samples are encoded passing through the encoder network and their corresponding \(z_e\)s are computed. Applying the batch K-means algorithm, the new cluster centers are computed and applied to update the dictionary. The new dictionary is then used for updating the network weights through back propagation (with the argmin bypass technique). The procedure continues for a fixed number of iterations or until convergence. Simultaneous training of dictionary and the network weights leads the encoder to learn a representation that takes into account the set of elements provided in the dictionary, while the dictionary is also built to be a descriptive set of quantized vectors.

In accordance with [30], another loss function is also added to CTC loss for training the network. This loss, given in. Equation (2), attempts to keep the encoder output and the dictionary elements close together. sg in Eq. (2) stands for the stop gradient operator, which prevents the gradient from backpropagation through its argument, and \(\alpha\) is a constant to be set empirically. This loss function, which is called “embedding loss” from now on, computes the Euclidean distance between the encoder output and its closest dictionary element e, and its gradient is backpropagated toward the encoder and dictionary proportional to the value of \(\alpha\). The value of e is used as the quantized value of latent variable, i.e., \(z_q\).

The first term in Eq. (2) is to pull the dictionary atoms toward the encoder output, while the second term aims to keep the output of encoder close to the dictionary atoms avoiding its arbitrary growth. As we apply the k-means algorithm to train the dictionary, this additional loss is not necessary for training. However, according to our experiments, adding this term improves the performance of the model. On this basis, the total loss function for optimization is specified as in Eq. 3.

$${\mathcal{L}}_{{{\text{emb}}}} = \alpha \cdot \left\| {sg[z_{e} (x)] - e} \right\|_{2}^{2} + (1 - \alpha ) \cdot \left\| {z_{e} (x) - sg[e]} \right\|_{2}^{2}$$
(2)
$${\mathcal{L}}_{{{\text{total}}}} = \frac{1}{L}\sum\limits_{{l = 1}}^{L} {{\mathcal{L}}_{{{\text{CTC}}}} } (y^{l} ,{\text{pred}}^{l} ) + \beta \cdot {\mathcal{L}}_{{{\text{emb}}}}^{l} ,$$
(3)

where

$$\begin{aligned} \beta = G \cdot e{^{- 10 e ^{-4} \cdot T \cdot i}} \end{aligned}$$
(4)

\(\beta\) in Eq. (3) is a coefficient that adjusts the weight of each term in the total loss. The initial value of \(\beta\) is set to G which gradually increases by training iteration i. T is a constant that is set to \(-\) 1, empirically (Sect. 5.5). The intuition is to lessen the impact of embedding loss in the beginning, hence relying mostly on the k-means clustering to find a descriptive dictionary. As training progress, the encoder network and the dictionary get aligned through the embedding loss.

The details of the training procedure are provided in Algorithm 1. The encoder and decoder weights are initialized by pretraining a baseline model made of the same encoder-decoder architecture, without a quantization layer. The experimental results, provided in Sect. 5.4, demonstrate that this initialization improves the performance.

figure a

4.2 Handwriting recognition model

Figure 2 illustrates the detailed architecture of the proposed HTR model. The encoder network consists of a stack of CNN networks with LeakyRelu activation functions and several maxpooling layers. All CNN networks are with stride 1. The feature map computed in the output of the encoder network is passed to the quantization layer to be discretized. The quantization is performed on the patches extracted from the feature map computing their distances with the dictionary elements. For each patch, the closest element of the dictionary is selected as its quantized value. The quantized patches are flattened over the channel dimension to become a sequence of 2D vectors to be fed to LSTM network. An FC layer is then applied to the flattened vectors to reduce their dimension. Through quantization, the representative vectors get redundant values; this dimensionality reduction is useful to compresses the information stored in the quantized vectors. The sequence of feature vectors is then passed to the decoder network where two layers of bidirectional LSTM networks are applied. The output of the BLSTMs is passed through some FC layers to project its height dimension to be equal to the size of the target alphabet (plus one for the blank character). Finally, a softmax layer is applied on the output.

Fig. 2
figure 2

Architecture of the proposed handwriting recognition model

5 Experiments and results

This section describes the multiple sets of experiments undertaken to analyze the performance of the proposed model for handwritten text recognition. In the first set of experiments, the model is tested on 5 different benchmark datasets of handwritten text lines, comparing its recognition result with the state-of-the-art models. Another set of experiments is performed to analyze the effect of the pretraining step and hyperparameters values on the model performance. Some other experiments are also performed for ablation analysis. Finally, we provide visualizations of the quantized feature and present sample outputs of the model.

5.1 Implementation details

The model’s hyperparameters are set empirically and can be fine-tuned for each dataset separately. However, in our experiments, except mentioned otherwise, the following default values are used: The number of dictionary elements K is set to 128, and the size of patches extracted from the feature map is set to \(H \times 1 \times 1\), where H denotes the height of the feature map. The constants G, \(\alpha\), and T used for loss computation in Eq. 3 are set to 15, 1, and \(-\) 1, respectively. The batch K-means learning rate, \(lr_{km}\) is set to 0.1. The network is trained using the ADAM optimizer [31] with a learning rate of 1e−3 and batch size of 64 for 20K iterations.

To improve the generalization, similar to [12], augmentation techniques are applied on the input image during training, including random projection and elastic distortion. The models are implemented in Pytorch and trained on 4 NVIDIA Tesla V100 GPUs.

The performance of the model is evaluated by two metrics: character error rate (CER) and word error rate (WER). CER (WER) is the Levenstein distance computed on the character (word) level normalized by the length of the ground truth.

5.2 Datasets

The experiments are performed on 3 datasets: IAM, ICFHR18 and IAM-HistDB, where the latter itself is divided into three sets, taken from three different medieval manuscripts, called Saint Gall, Parzival and Washington.

The IAM database [32] contains 1539 English text line images, written by 657 different writers that are partitioned into 6161 training, 966 validation and 1861 test lines. There are 79 different character classes in the dataset.

The ICFHR 2018 is associated with a competition on automated text recognition [13]. The dataset consists of text images from 22 heterogeneous documents, divided into a general set (17 documents) and a document-specific set (5 documents). At the time of writing this paper, the online evaluation engine is not accessible through the competition website. Therefore, we performed the experiments using the general set to train our model and the document-specific set as the test set. The samples are written in modern and medieval Italian and German. The dataset provides the extracted text-line images with their annotations using a 98-character alphabet. The training and the test set consist of 11925 and 2878 line images, respectively.

The Washington database [33] contains 565 text line images extracted from 20 pages of George Washington letters. The dataset provides the binarized and normalized text-line images, together with their annotations by an 83-character length alphabet.

The Saint Gall dataset [34] has been created from a ninth century hagiography manuscript in the Latin language, from which 60 pages are extracted and segmented into 1,410 text line images. The processed and binarized text-line images are provided in the dataset together with their annotations by 49 characters.

The Parzival database [33] contains 4477 text line images of gothic scripts extracted from 47 pages of a thirteenth century manuscript compiled by three copyists in medieval German language. Similar to Washington and Saint Gall, the text line images are binarized and divided into three parts of training, validation, and test set, each containing 2237, 912, and 1328, respectively. Each text line is annotated using the alphabet of 96 different characters in the database.

5.3 Performance evaluation

To evaluate the performance of the proposed discrete representation for HTR, the recognition is performed using two models: the “VQ-HTR” which has the same architecture as described in Sect. 4.2, and the “Baseline” model, which has the similar architecture as VQ-HTR without the quantization layer. The main difference between the baseline model and the VQ-HTR is in their continuous and discrete latent representation. The experiments are performed by applying both models on all the datasets and comparing their results with each other, as well as with other state-of-the-art models. For a fair comparison, we only consider the methods that do not utilize any language models or post-processing as part of their transcription pipeline.

Table 1 reports the results of the experiment on IAM dataset. As evident in the table, applying vector quantization significantly improves the performance of the baseline architecture, reducing its character error rate by 22%. It also outperforms the prevalent architectures in the state-of-the-art methods including CNN+BLSTM, fully CNN and sequence to sequence models. Note that the performance improvement of the VQ-HTR model is achieved solely by embedding the quantization layer into the baseline model which has the canonical CNN+LSTM architecture. While, for instance, [35] and [36] use attention mechanism in their proposed architecture, or [37] and [38] use large networks with many layers, VQ-HTR is constructed of a 6-layer-CNN encoder, a quantization layer with a dictionary of size \(11 \times 128\), and a 2-layer-BLSTM decoder (see Sect. 4.2). We opted for a simple model, in order to focus more on the potential of discrete representation in handwritten text recognition. The table purposely does not include recently developed architectures based on deformable convolutions [39, 40] as they use completely different architectures, not comparable with the baseline architecture.

Table 1 Recognition results on IAM dataset

The baseline and VQ-HTR models are also tested using the ICFHR18 dataset and their recognition results are reported in Table 2, along with the results of the participants of the ICFHR2018 competition and of [38] which has an architecture similar to our baseline. Like for the IAM dataset, discrete representation significantly improves the performance of the baseline model. VQ-HTR outperforms the OSU system, the winner of the competition, and the RPPDI system which gained the best result for the case where images of the same document are not provided in the training. Moreover, our model achieves a lower character error rate in comparison with [38] which has a more complicated decoder compared to our model, 5 vs 2 layers BLSTM.

Table 2 Recognition results on ICFHR18 dataset. The first 5 methods are the participants of the ICFHR2018 competition [13]

Table 3 shows the results of the proposed model evaluated on three datasets of IAM-HistDB. Similar to the previous results on other datasets, discrete representation improves the performance of the baseline method in all three. Compared to other methods in the literature, our model has the best performance for Washington and Saint Gall. On Parzival dataset, VQ-HTR has a higher character error rate compared to [38]. In the ground truth of Parzival dataset, the special characters are encoded with a long sequence of digits (see Fig. 5). We use the exact same labeling, without any modifications, to train and test our model. This can explain the relatively higher performance of [38], as it uses a larger decoder, which is basically more capable of capturing these dependencies.

Table 3 Recognition results on IAM-HistDB dataset

5.4 Effect of pretraining

The proper initialization of the encoder and decoder weights in our model has an effective role in its performance. This is mainly due to the mutual interaction between the dictionary and the encoder during training, as if the encoder outputs precise representation vectors, the dictionary computed by clustering these vectors would be precise as well. As mentioned before, we initialize the encoder and decoder networks with the trained baseline model.

To evaluate the effect of pretraining, we perform an experiment where the pretraining step is removed completely and all network parameters are initialized randomly. The model is trained starting from these randomly initialized weights. The performance of the obtained model is reported in Table 4, under “Randomly Initialized” column, along with the results of the baseline model and the pretrained VQ-HTR model. The results are provided for all datasets, computed on the corresponding test set.

According to the results reported, removing the pretraining step significantly reduces the performance of the model in all datasets, which does not even reach the performance of the baseline model. This implies that the added quantization layer makes the network more vulnerable to the local minima, making it critical to start from proper initial weights. Interestingly, the largest performance drop happened for IAM dataset where the CER increases more than eighteen times. This can be related to the relatively large number of writers participating in IAM dataset collection, as the style variations can make the training optimization more prone to local minima. By pretraining, the encoder computes meaningful representations, up to some tolerable precision, making it easier to optimize the dictionary while fine tuning the encoder weights.

Overall, the results of these experiments demonstrate the fact that, in the proposed model, the feature learning and dictionary learning are highly intertwined, and pretraining can be effectively used to tune the training procedure.

Table 4 Comparing performances obtained with different pretraining strategies

5.5 Effect of costs weights

\(\alpha\) and G are two of the main hyperparameters of the model that play an effective role in training the model. Here, we provide a detailed analysis of their values in order to gain a better understanding of the dynamics of the training. Recall that \(\alpha\) is a parameter that determines the weights of two terms in the embedding loss (Eq. 2), while G is a coefficient controlling the weights of the embedding loss versus CTC loss (Eq. 3). To analyze how the value of \(\alpha\) affects the model performance, we perform an experiment in which the model is trained multiple times changing the value of \(\alpha\) from 0 to 1. More specifically, we set \(\alpha\) to 0, 0.25, 0.5, 0.75 and 1 and train the model for each value separately. The experiment is performed on IAM and ICFHR18 datasets and their corresponding results are shown in Fig. 3a and b, where the character error rates obtained by each model is plotted versus the corresponding \(\alpha\) value.

The zero value of \(\alpha\) corresponds to the case where the first term of Eq. 2 has no effect on the embedding loss, and increasing its value toward one increases the contribution of this term against the second term. Recall that the first term of Eq. 2 drives the output of the encoder toward the dictionary atoms, while its second term moves the dictionary atoms toward the encoder output. Hence, \(\alpha =0\) is equivalent to the case where the encoder network is kept fix during the training and only the dictionary is updated in each iteration (together with the decoder network). On the other hand, \(\alpha = 1\) refers to the case that the encoder is updated in each back propagation step whereas the dictionary is not. Note that this does not imply the dictionary is not updated at all, since it is updated through k-means algorithm. According to Fig. 3, for both datasets, \(\alpha = 0\) has a poorer performance compared to other cases. This is not unexpected as, in this case, the gradient does not reach the encoder network, so it is not trained. The other end of the spectrum, i.e., \(\alpha = 1\), has a low error rate indicating that the gradient-based updating of the dictionary has no much effect. \(\alpha = 0.25\) has the best performance in both datasets. However, we set \(\alpha =1\) for better understanding of the model dynamics.

In a second experiment, the effect of hyperparameter G on the model performance is evaluated. In the proposed model, hyperparameter G is a constant determining the weight of the embedding loss compared to the CTC loss. In this experiment, similar to the previous one, the model is trained multiple times using different values for G while the rest of hyperparameters are set to their default values. The experiment is performed on IAM and ICFHR18 datasets and the results are reported in Fig. 3c and d. According to these figures, by increasing the value of G from zero, the error rate decreases first and then remains constant or increases for higher values of G. We set the value of G to be equal to 15 in the final models for all datasets.

As mentioned in Sect. 4.1, we also employ an exponential decay technique to gradually increase the value of \(\beta\) during training. Figure 3e and f, demonstrates how the decay parameter, \(\tau\) in Eq. 4, affects the model performance on IAM and ICFHR18. We also test for \(\tau\) with positive values, i.e., increasing \(\beta\) by iteration. According to the presented curves, the best performance is achieved for \(\tau = -1\), which we take as its default value for training the models in this paper.

Fig. 3
figure 3

Comparison of different values of \(\alpha\), G and T

5.6 Ablation study

Compared to the baseline model, VQ-HTR employs a dictionary to quantize the encoder output and a fully connected network to reduce the dimension of the quantized feature map to be fed to LSTM (see Sect. 4.2). To examine the effect of these two added parts separately, in two separate experiments, we remove each part one at a time, train the reduced model, and compute their recognition results on the validation set. For each of the experiments, we also made the necessary changes in the reduced model to make it work properly. The results of these experiments are reported in Table 5, marked as “Only FC” and “Only Quantization,” respectively. For comparison, the result of our model, employing both parts together is also provided.

In another experiment, we investigate the proposed training strategy. Compared to the quantization algorithm introduced in [30], where the dictionary is trained only through backpropagation, here we utilize the k-means algorithm as well. To clarify the effect of k-means clustering, Table 5 also provides the result for the case where the k-means-based update of the dictionary is eliminated, and the model is trained solely via backpropagation. According to the table, our algorithm improves the performance of the model considerably.

Table 5 Ablation experiments on IAM and ICFHR18 datasets

5.7 Visualization and sample results

To gain a better understanding of the procedure taking place in the quantization layer, a visualization of feature vector discretization is provided in Fig. 4b for two samples taken from IAM and ICFHR18 datasets. For each sample, the input text-line image is shown along with two other images representing the feature vectors in one channel of the feature map, before and after discretization, i.e., input and output of the quantization layer. For a better display, instead of the full images, cropped pieces of them are shown here. Each column in feature images corresponds to one vector in the feature map which is discretized in the quantization layer. Analyzing feature images before and after discretization provides a view on how the quantization layer changes the feature vectors. As expected, the discretized feature map contains a number of patterns repeated along the sequence, reducing large pixel variations in the continuous feature map. To check the effect of this discretization on the recognition result, the final output of the model computed using each feature map is also provided in the image. Each column of the feature maps is assigned a character, written on top of it, which shows the model output in the corresponding position. In the case of a continuous feature map, the recognition result is computed by directly feeding the feature map to the decoder, i.e., bypassing the quantization layer. Comparing the recognition results of the discrete and continuous feature maps, some correct and false predictions are specified with green and red boxes, respectively. “ _ ” denotes the blank. The final recognition results after CTC decoding, i.e., “predictions”s, are also provided. In addition, Fig. 5 provides more samples taken from all datasets with their corresponding recognition results computed by both VQ-HTR and baseline models. The words colored in red are the ones with at least one character transcribed wrongly.

Fig. 4
figure 4

Visualization of the feature map for two sample text-line images taken from (a) IAM Dataset and (b) ICFHR18 Dataset. For each sample, three images are provided: original input image (top), Continuous feature map indicating the encoder output (middle), discrete feature map indicating the quantization layer output (bottom). For each feature map image, the text recognized at the decoder output is shown above each vector

Fig. 5
figure 5

Sample images from different datasets with their corresponding recognition results

6 Discussion

The experimental results on various datasets demonstrate that discretizing the latent variables can improve the accuracy of conventional CNN+RNN architectures when applied for a handwritten text recognition. This improvement is significant enough that in some databases, the model outperforms larger and more advanced models. It is noteworthy that this improvement is achieved by adding extra constraints to the latent variables, forcing them to take on a limited number of dictionary atoms. This highlights that the fact that handwritten symbols can be represented by a limited set of vectors can be leveraged to regularize the network training.

However, the proposed approach also has some limitations compared to conventional methods. Firstly, it introduces additional hyperparameters that need to be fine-tuned for each dataset. These hyperparameters include the cost coefficients and the number of dictionary atoms, which need to be determined through experimentation. The number of vectors used for constituting the dictionary is an effective hyperparameter that controls its capability in representing the input images. The number of elements should be large enough to provide the model with enough flexibility to represent main variations, but not so large that it increases trainable parameters unnecessarily. We observed that increasing the number of dictionary atoms improves the performance to some point, after which performance remains constant and then decreases with further increases. This can be attributed to the fact that when the dictionary is too small, it does not have enough elements to construct an adequate representation of the input images. Conversely, overly large dictionaries provide the model with excessive flexibility, making it more prone to overfitting. An optimal number of dictionary atoms is between these two extremes. In our experiments, we fine-tuned this value for one dataset and used the same value for other datasets. Fine-tuning this hyperparameter for each dataset can improve the results reported.

Secondly, our model necessitates a pretraining stage before the main training. This warming-up step has a significant impact on the model’s performance and cannot be skipped. An important aspect to consider here is that the main training phase begins with a set of parameters that have already been trained for a similar architecture (excluding the quantization layer), on the same task and same data. The only difference is the constraint imposed by the quantization layer, forcing the latent variables to take on one of the dictionary atoms. This restriction fine-tunes the encoder–decoder parameters, resulting in improved accuracy.

7 Conclusion

Finding an appropriate representation is a crucial step in handwritten text recognition. In this paper, we used a discrete image representation in place of the prevalent continuous representation in DNN-based HTR models. A novel encoder–decoder-based architecture has been created where a quantization layer has been embedded to discretize the latent variables. The results of multiple experiments performed have indicated an effectiveness of discrete representations improving modern and historical handwriting text recognition models.

The discretization layer is applied on the latent vector of the encoder–decoder architecture. Similarly, it can be utilized in other layers of the network to analyze the quantization effect on various representation levels. In the current model, the discretization is applied over the patches of rectangular shapes extracted from the encoder output, whereas the primitive components of a text image can have flexible shapes extended differently over the image.