A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

Ameryan, Mahya; Schomaker, Lambert

doi:10.1007/s00521-020-05612-0

A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

Original Article
Open access
Published: 01 February 2021

Volume 33, pages 8615–8634, (2021)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

Download PDF

2472 Accesses
6 Citations
3 Altmetric
Explore all metrics

A Correction to this article was published on 05 August 2023

This article has been updated

Abstract

The strength of long short-term memory neural networks (LSTMs) that have been applied is more located in handling sequences of variable length than in handling geometric variability of the image patterns. In this paper, an end-to-end convolutional LSTM neural network is used to handle both geometric variation and sequence variability. The best results for LSTMs are often based on large-scale training of an ensemble of network instances. We show that high performances can be reached on a common benchmark set by using proper data augmentation for just five such networks using a proper coding scheme and a proper voting scheme. The networks have similar architectures (convolutional neural network (CNN): five layers, bidirectional LSTM (BiLSTM): three layers followed by a connectionist temporal classification (CTC) processing step). The approach assumes differently scaled input images and different feature map sizes. Three datasets are used: the standard benchmark RIMES dataset (French); a historical handwritten dataset KdK (Dutch); the standard benchmark George Washington (GW) dataset (English). Final performance obtained for the word-recognition test of RIMES was 96.6%, a clear improvement over other state-of-the-art approaches which did not use a pre-trained network. On the KdK and GW datasets, our approach also shows good results. The proposed approach is deployed in the Monk search engine for historical-handwriting collections.

Deep RNN Architecture: Design and Evaluation

Are 2D-LSTM really dead for offline text recognition?

Article 06 June 2019

A benchmark for unconstrained online handwritten Uyghur word recognition

Article 28 July 2020

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Convolutional neural networks (CNNs) [1] and long short-term memory networks (LSTM) [2] and its variants [3, 4] have recently achieved impressive results [5,6,7]. This exceptional performance comes, however, at the cost of having an ensemble of, e.g., 100–2000 recognizers [8]. The high cost of training and operation brings to mind the question of whether less costly methods can be applied to boost the performance of handwriting recognizers.

A possible direction would consist of the use of linguistic statistics [9]. A recent method for using language information is a dual-state word-beam search [10] for decoding the connectionist temporal classification (CTC [11]) layer of neural networks, which has been shown to be effective [10].

Although the presence of dictionaries and corpora is beneficial, historical documents present a challenge. For instance, the historic spelling of a word differs from the contemporary spelling, there is often an absence of strict orthography, and there may be frequent misspellings [12]. Figure 1 shows a word image from one of the datasets used in this paper. This historical word has an extra character compared to the current spelling. Moreover, for rare languages, e.g., Aymara [13], the complete lexicon does not exist yet, and corpora are of very limited size. Handwritten-text recognition (HTR) is exactly required to obtain such digital linguistic resources for that language.

Another possible direction to improve performance would concern a heavy optimization of network architecture and training (hyper)parameters. The state-of-the-art approaches can be sensitive to the choice of hyper-parameter values. As an example, it is reported that increasing the depth of a neural network that consists of convolutional and LSTM layers, from 8 hidden layers to 10, is advantageous. Further enlarging to 12 hidden layers yielded unsatisfactory results [14]. From the perspective of e-Science services for handwriting recognition dealing with hundreds of different books, it is not feasible to tailor the recognizer models for each book based on prior knowledge, using human handcrafting of neural networks. An e-Science server is the application of computationally intensive modern scientific procedures for data gathering, preparation, experimentation, result distribution, and long-term maintenance. An e-Science server can include data modeling, digitized datasets, and analysis, e.g., the Monk system [15,16,17,18]. For an e-Science server, preferably, having an ensemble consisting of a limited number of automatically generated neural-network architectures would be practical. The Monk e-Science server is a live web-based search engine for words and character recognition, retrieval, and annotation. It contains diverse digitized historical and contemporary handwritten manuscripts in many languages: Chinese, Thai, Arabic, Dutch, English, Persian. Also, complicated machine-printed documents such as German, Fraktur, Egyptian hieroglyphs, and historical language are available in the Monk system. For such a system with almost 600 manuscripts, it is not feasible to use human effort for fine-tuning a model for each of the manuscripts to reach higher accuracy as [8]. An essential consideration is that it should be possible to add our suggested algorithm to the Monk system, with a minimum of required operational human effort.

In this paper, we explore the possibilities of exploiting the success of current CNN/LSTM approaches, using several methods at the level of linguistics and labeling systematics, as well as an ensemble method. Ideally, the approach should be robust, require a minimum of human intervention with a limited set of hyper-parameter settings (architectures), and minimum linguistic resources. For evaluation, we use a standard benchmark public dataset, RIMES [19], a historical handwritten dataset, KdK [15, 20], and the standard public benchmark George Washington dataset (GW [21]). The three datasets differ in historical period and language. The purpose of this paper is not to handcraft a model to achieve maximum accuracy on a particular dataset but is to design a robust, high-performing recognizer that can be deployed in an e-Science server such that training occurs largely autonomously and no hyper-parameter setting should be necessary. This is important because the number of collections and the variation in styles precludes individual attention by human operators in the back office. In other words, the goal of this paper is to use neural networks for real-world applications involving large collections in a massive high-performance computing context.

The rest of this paper is structured as follows. In Sect. 2, we briefly survey the related works in terms of recent state-of-the-art methods and word search approaches in character-hypothesis grids. In Sect. 3, the requirements of the proposed method are explained. In Sect. 4, we present our system. The experimental evaluation and discussion are given in Sects. 5 and 6. Finally, conclusions are drawn in Sect. 7.

2 Related work

In this section, we first briefly survey the state of the art on handwriting recognition task. Afterward, we survey part of the long history of word search in character-hypothesis grids and linguistic post-processing.

2.1 The state of the art on handwriting recognition task

Offline handwriting recognition classifiers typically use the direct image values or the extracted features from an input image to predict posterior probabilities [22, 23]. These classifiers, e.g., neural networks (NNs), and hidden Markov models (HMMs) have their own merits and demerits. HMMs are relatively simple, and rely on strong assumptions. But one of the main drawbacks of HMMs is the weakness in modeling long-term input data dependency. Two well-known NNs are recurrent neural networks (RNNs) and convolutional neural networks (CNN [1]. CNNs are able to learn important features without human interference. Additionally, the convolutional approach makes CNNs translation invariant and the pooling layers make them relatively insensitive to scale variation. A distinct disadvantage is the reliance on a fixed-size input image. This is undesirable for text processing, with its variable-length sequential patterns. On the contrary, RNNs and specifically LSTMs have shown remarkable success in various sequence learning tasks [5,6,7, 24, 25].

Combinations of classifiers as pipeline methods and heterogeneous/homogeneous ensembles are used to reach better performance. One common pipeline concerns ensembles with different RNN-based models using different feature extraction [7, 26] and different decoding methods [8, 14, 27,28,29].

CNNs are sometimes used as feature extraction method for classifiers, in particular LSTMs [10, 26, 28, 30]. In [30], a framework consisting of a deep CNN, LSTM layers as encoder/decoder, and a attention mechanism for isolated handwritten-word recognition is given. The result is reported with/without dictionary. For pre-processing, methods for baseline correction, normalization, and deslanting are applied. After pre-processing, an input image is converted to a sequence of image patches by using a horizontal sliding window. A deep CNN is used for feature extraction. Afterward, an LSTM is applied to extract the horizontal relationships existing among a sequence of overlapped horizontal patches of input images. A decoder component is used, a combination of an LSTM and an attention mechanism. To find the best performance, experiments are done to determine the optimal LSTM cell size and patch size. Although the overall architecture is interesting, this method [41] does not have a very high performance. In [28], a spatial transformer network, residual convolutional blocks (ResNet-18), stacked BiLSTMs and a CTC layer are used. Deslanting and slope normalization is performed on images, using the approach presented in [31]. A CNN-RNN is pre-trained on the IIIT-HWS dataset [32]. During training and testing on benchmark datasets, three types of augmentations are used: affine transformation; elastic distortion; multi-scale transformation. Each test image is augmented 25 times. This type of augmentation at in the operational stage has been reported earlier in other applications [33, 34] involving animal recognition.

In [35], a 12-layer convolutional neural network (CNN) is used to processes fixed-sized word images and recognize a Pyramidal Histogram of Character (PHOC) representation [36], using multiple parallel fully connected layers. Afterward, canonical correlation analysis (CCA) [37] is applied as a final stage of the word recognition task, using a predefined lexicon.

In [38], a whole-word CNN can be apply to recognize known words, defined as the 500 most frequent words in the training set of the RIMES dataset, which have a minimum confidence level of 70%. Otherwise, a Block Length CNN predicts the number of symbols in the given image block. Then, a fully convolutional neural network predicts the characters. Finally, the result is enhanced by a vocabulary-matching method. This varied-CNN method has a problem with separating common and non-common words. The separation of lexicon into a set of common and a set of uncommon words may be artificial, in view of the usual continuously decaying Zipf distribution [39].

There are two key solutions for a handwriting prediction output to be converted into a character sequence. The first approach is using an HMM [7, 40]. HMM is the most traditional way to detect a word. The second is using CTC [11]. The approach of using an RNN followed by a CTC layer has been widely used [8, 14, 27,28,29].

The successful methods are unfortunately quite complicated. Most of them use a combination of CNNs and LSTMs. Therefore, it is important to consider more integrated, simplified approaches, such as a convolutional LSTM [26]. It will be treated in Sect. 3.1.

2.1.1 Ensemble system

A simple but effective method for improving an individual classifier performance is the ensemble method [27, 41,42,43,44,45,46,47,48,49]. In [42], it is shown that having diverse classifiers is a key point for classifier fusion. Using ensembles for handwriting recognition with hidden-Markov models as basic word classifiers, Günter and Bunke [43] compares different ensemble creation methods: Bagging, AdaBoost, Half & half bagging, random subspace, architecture as well as different voting combination methods for handwriting-recognition task. It is shown that each of these four methods increases performance.

The impact of dictionary size, the training set size, and the number of recognizers in ensemble systems are studied for off-line cursive handwritten-word recognition in [44]. The ensemble methods are Bagging, AdaBoost and the random subspace, while the recognizers used are HMMs with different configurations. It is verified that increasing the size of the training set and the number of recognizers elevates the performance of the system, while the larger dictionary pulls down the performance.

Recently, in [45], ensemble classifiers for Persian handwriting recognition were used. This study used AdaBoost and Bagging to combine weak classifiers created from hand-crafted families of simple features.

In the deep learning domain, Yang et al. [46] obtained very high accuracy for Chinese handwritten character recognition using deep convolutional neural networks and a hybrid serial-parallel ensemble strategy which tries to find an ‘expert’ network for each example that can classify the example with high accuracy, or if such a network cannot be found, falls back to the majority vote overall networks.

An ensemble of NNs and HMM methods is used in [27]. This ensemble uses eight recognizers for handwriting recognition which includes four variants of a multidimensional long short-term memory neural network (MDLSTM [4]), a grapheme-based MLP-HMM, and two variants of a context-dependent sliding window based on Gaussian Mixture Model (GMM-HMM). The ensemble system is a simple sum rule. This example illustrates that some studies involve highly complicated and heterogeneous algorithm architectures requiring a lot of traditional engineering efforts.

In an ensemble system, majority or, alternatively, the plurality can be used if the output of each individual recognizer is only the best hypothesis label. If recognizers of the ensemble system produce a ranked hypotheses list, Borda count is possible [47, 48] to determine the result. In this case, it is required that the ranked list shows a sufficient diversity of intuitive candidates, i.e., with a low edit distance [50] from the target. Two ensemble system of handwritten recognition methods are presented in [49]: using word-list merging; and linear combination.

In [8], two architectures are used to generate more than a thousand networks to construct an ensemble. Each network is either a two-layer BiLSTM or a three-layer MDLSTMs. BiLSTMs are fed by HOG [51], and the input of the MDLSTM is raw images. The best path algorithm [52] is applied for CTC decoding. This approach uses a lexicon verification method. After training 2100 networks and evaluating the validation set of RIMES dataset, the lowest performance networks are removed, which results in 118 networks. It is reported that the pruned ensemble of 118 networks has a 0.16 pp drop in performance compared to the ensemble of 2100 networks on the RIMES dataset. On another dataset, IAM [53], the size of ensemble is different ($\hbox {n}_{\mathrm{rec}}=1039$). Because of the simplicity of the system and the high number of recognizers, the complexity is medium to high.

The good results represented in literature are often based on a fairly complex system with many hyperparameters. In an e-Science service such as Monk which currently has about almost 600 different manuscripts, it is clear that human attendance and detailed selection of hyperparameters for each of those documents by humans and crafting are impossible.

2.2 Word search and linguistic post-processing

Character-oriented approaches create a data structure representing the character hypotheses, their position in the text, and the confidence value. For example, an LSTM produces a final map with character hypothesis activations, ordered from left-to-right or right-to-left with some stride (step size). Other approaches generate a grid or graph of character hypotheses. The final processing step involves finding the most likely character path, given a dictionary and potential other linguistic resources (statistics). For the LSTM, a well-known first step toward this is connectionist temporal classification (CTC) [11].

Given a dictionary containing possible input words, an easy method can be used for error detection and correction of a word recognizer. In the case of the existence of the word hypothesis in the dictionary, the result is accepted as the label of the input image. Otherwise, if a similar word exists in the dictionary, it can be accepted as a final label candidate by using the Levenshtein distance and its variants [50, 54,55,56], or n-gram distances [57], as common measures for comparing (dis)similarity. If required, it is possible to use suitable linguistic statistics to further refine the ranking [58,59,60].

A data structure for contextual word recognition is presented in [61] for a quick dictionary look-up using limited memory.

An approach of providing contextual information by giving a dictionary to predict the most probable label in a graph search is presented in [62], which is robust to dictionary errors. In this approach, for every lexical word, the most probable path and related confidence is calculated to predict dictionary ranking.

Shannon [63, 64] was one of the first researchers working on the letter prediction task. Based on this idea, using a trainable variable memory length Markov model (VLMM), a linguistic post-processing model for character recognizers is introduced in [65]. The next character is predicted by a variable-length window of previous characters.

In [66], on the linguistic corpora, a statistical n-gram language model of syllables is trained. In [67], for Japanese mail addresses, a character recognition method uses a dictionary in a trie tree. The dictionary matching is controlled by a beam search approach. The dictionary includes all the address names and principal postal offices in Japan. After pre-processing and segmentation character hypotheses are produced by a combination of successive segments. Then, a version of a nearest-neighbor classifier that exploits the trie structure is made for a fast prediction of the final label. In [68], an on-line handwritten recognition system for cursive words uses simple character features to reduce a given large dictionary. The outputs of a Time-Delay Neural Network (TDNN) are converted into a character sequence. The result of the system is a matched word in the reduced dictionary using a variant of Damerau–Levenshtein distance. For on-line handwriting recognition, a search technique is proposed in [69], which is a post-processing phase of a recognition system that calculates posterior probabilities of characters based on Viterbi decoding.

In [70] a version of beam and Viterbi search-recognizer is presented. This search method provides the use of discrete probabilities generated by many character recognition systems based on stroke. Powalka et al. [71] introduces a technique combining word segmentation and character recognition with a lexical search to deal with segmentation ambiguities. A depth-first trace of dictionary tree for text recognition using recursive procedure presented in [72]. For online handwriting recognition, in [73], by applying simple feature extraction a given dictionary is reduced. Afterward, the reduced dictionary is refined by AI techniques. In [74], for isolated cursive handwritten-word recognition, contextual knowledge is used. A dictionary tree representation with an efficient pruning method, as a fast search method for a large dictionary for an on-line handwriting recognition system, is proposed in [75].

Of all these approaches, a dual-state word-beam search for CTC decoding currently enjoys increased interest, Scheidl et al. [10], and will be described in Sect. 3.2.

3 Background

In this section, we detail two essential approaches for our proposed method. Firstly, a convolutional recurrent neural network is briefly detailed [26]. Afterward, a dual-state word-beam search (DSWBS [10]) for CTC decoding is explained.

3.1 Convolutional recurrent neural network

The convolutional recurrent neural network is an end-to-end trainable system presented in [26]. It outperforms the plain CNN in four aspects: (1) it does not need precise annotation for each character and it can handle a string of characters for the word image; (2) it works without a strict preprocessing phase, hand-crafted features, or component localization/segmentation; (3) it benefits from the state preservation capability of a recurrent neural network (RNN) in order to deal with character sequence; (4) it does not depend on the width of the word image. Only, height normalization is needed.

The model is composed of seven layers of convolutional layers followed by two layers of BiLSTM units containing 256 hidden cells and a transcription layer. Although the model is made up of two distinct neural network varieties, it can be trained integrally using one loss function.

Figure 2 shows the pipeline of the convolutional recurrent neural network [26]. The input of the model is a height-normalized and grayscale word image. The feature extraction is performed by convolutional layers directly from the input image. The output of CNN is a frame of features sequence and acts the input of the recurrent neural network, which provides raw character hypotheses. Finally, the transcription layer translates the resulting prediction into a label sequence.

3.2 A dual-state word-beam search for CTC decoding

The dual-state word-beam search for CTC decoding, [10], is based on Vanilla Beam Search decoding (VBS) [76] for decoding of the CTC layer. The output of RNN is a matrix, and it is the input of the dual-state word-beam search method. In the dual-state word-beam search, a prefix tree is made of ground truth labels of the training set. It consists of two states: the word-state and the non-word-state (Fig. 3). The next character of the current beam is either a word-character or a non-word-character, and it determines the subsequent state of the beam. The sets of word-characters and non-word-characters are predefined.

The temporal evolution of a beam depends on its state. For a beam in the non-word state, it is possible to be extended by a non-word-character, and it will stay in the non-word state. A word-character entering brings the system to the word state. Such a word-character is the beginning of a word. For a beam in the word-state, the feasible following characters are presented by a prefix tree. This procedure iteratively repeats until a complete word is reached. Scoring can be done in four ways:

1.
Words: a dictionary is used without employing a language model (LM).
2.
N-grams as LM: as a beam goes to the non-word state from the word state, the LM scores beam-labeling.
3.
Ngram+forecast: as a word-character appends a beam, the prefix tree presents all possible words. LM scores all of the relevant beam-extensions.
4.
Ngram+forecast+sample: to restrain the following potential words, first some samples are randomly selected. Then, LM scores them. The total score value has to be refined to account for the random-sampling step.

The pseudo-code of the dual-state word-beam search is illustrated in Algorithm 1. The list of symbols is as follows.

$RNN_{o}$: the sequence of RNN output activations over time.
B: the set of beams at the present time step.
Width: beam width.
$P$ $_{b}$: the probability of finishing the paths of a beam with blank.
$P$ $_{nb}$: the probability of not finishing the paths of a beam with blank.
$P$ $_{tot}$: $P$ $_b$+$P$ $_{nb}$.
$P$ $_{txt}$: the probability allocated by the LM.
T: the final iteration of the algorithm, $t=T$.
Ø: Empty beam.
$-1$: the last character of the beam.
x: a beam.
c: a character.
x(t): a beam character at t.
numWords(x): the number of words in the beam x.
GetbestBeams($B,\ Width$ ): best Width beams based on the highest value of $P_{txt}*P_{tot}$.
$NumWord's(x)$: the number of words exists in the beam x.
scoreBeam(LM, x, c): the probability of seeing character c for extension of the beam x.

In RNNs, such as LSTM, the exact alignment of the observed word image with the ground truth label is not clear. Hence, a probability distribution at each time step is used for prediction which makes it more important to use an adequate coding scheme.

However, even after the CTC stage, additional processing steps from the above-mentioned repertoire are needed to boost classifications.

Unfortunately, although using linguistic resources is clearly advantageous, there are cases where this is not, or only partly possible:

Not all problems enjoy the presence of an abundance of digitally encoded text, comparable to, e.g., the massive contemporary-English text corpora;
in historical collections, there may be virtually no resources, not even a lexicon;
many collections, e.g., administrative ones, have a dedicated jargon, abbreviations and non-standard phrasing. Diaries often contain family-specific or other idiosyncratic neologisms;
many collections have outdated geographical and scientific terminology, such as the historical document collection that belonged to the Natuurkundige Commissie’s scientific exploration of the Indonesian Archipelago between year 1820 and 1850 [77]. This heterogeneous handwritten manuscript contains 17,000 pages of the field notes based of the scientists’ natural observation in German, Latin, Dutch, Malay, Greek, and French. Biological terms vary greatly over periods in history [78].

There is, however, an additional way to improve the classification performance. Impressive results using an ensemble method were presented in [8]; however, the number of networks was so large (118) that the need for a less drastic approach is becoming urgent. We will therefore focus on the probabilities of a small-scale ensemble.

4 Method

In this section, we present a limited-size ensemble system for word recognition with a minimum of human intervention. The suggested system uses an adequate label-coding scheme and a dictionary as the only resource for the language model. This system is adequate for being applied on e-Science servers. The system is described as follows.

4.1 The Extra-separator label-coding scheme

In the common label-coding scheme, we call it ‘Plain,’ only the characters which are present in the word image appear in the corresponding label. In the ‘Extra-separator’ label-coding scheme, one more character is appended at the end of each label. The appended character, named the extra separator (e.g., '|’), must not exist in the alphabet of the dataset. The aim of adding the Extra-separator character is to give the recognizer an extra hint concerning the end-of-word shape condition.

4.2 Neural network

The neural network is a convolutional BiLSTM neural network, and it is an end-to-end trainable framework inspired by [26]. The main configuration of the networks is detailed in Table 1. In this section, we explain the essential components of our approach.

Table 1 Configuration of our a convolutional recurrent neural network from input image (bottom) to last output (top)

Full size table

4.2.1 Pre-processing

The prepossessing is performed in each epoch of training. It consists of: (a) data augmentation through randomly stretching/squeezing the grayscale images in the width direction, (b) re-sizing the images into $128 \times 32$ and (c) normalization. Data augmentation is performed to increase the size of the training set, and it is achieved by changing the width of an image randomly by a factor between 0.5 and 1.5. Next, both the original grayscale images and those added through data augmentation are resized so that either the width is 128 pixels or the height is 32 pixels. After that, we pad the image with white pixels until the size is $128 \times 32$. Then we normalize the intensity of the grayscale image. Note that our method does not need baseline alignment or precise deslanting. Please note that one of our datasets was already deslanted to 90°.

4.2.2 A 5-layer CNN

The pixel-intensity values after preprocessing are fed to the first of 5-layers of a CNN to extract feature sequences. Each layer of the CNN contains a convolution operation, normalization, the ReLU activation function [79], and a max pooling operation. The size of the kernel filters in each layer is $3 \times 3$. Given the fixed important hyperparameter setting, such as the number of layers, the only variable control parameters concern the number of units in the hidden layers. The simple table of three possible sizes $\{128,\ 256,\ 512\}$ is used with the random probability of 0.33 for selecting the sizes of hidden units. The network has no dropout. The sizes of the numbers of hidden units used in our experiments are shown in Table 2. The number of layers, the size of the kernel and the optimizer is our configuration, and differ from Shi et al. [26].

Furthermore, instead of using ADADELTA [80] used in [26], we used RMSProp [81]. Moreover, we used five convolutional layers instead of seven suggested in [26].

Table 2 Number of hidden units in the CNNs front ends, in the five architectures ($A_i$, $i= 1\ldots 5$)

Full size table

Table 3 Datasets

Full size table

4.2.3 BiLSTM

The five convolutional layers are followed by three layers of BiLSTM. Because the last convolutional layer contains 512 hidden units, each BiLSTM has 512 hidden units.

4.2.4 Connectionist temporal classification (CTC)

The CTC output layer contains two units more than characters in the alphabet (A) of the given dataset: the suggested extra separator (e.g., '|’), and a common blank for CTC, which differs from the space character. Therefore, the alphabet of CTC output is:

$$\begin{aligned} A'=A\cup {extra\ separator\cup {blank}} \end{aligned}$$

The $|A+2|$ output units determine the probability of detecting the relevant label at the time. Further, the blank unit determines the probability of observing blank, or 'no label.’ For CTC decoding, we use the dual-state beam search presented in [10]. This method is explained in Sect. 3.2.

4.3 The ensemble system

In order to construct an ensemble automatically, the number of hidden units in layers 2,3 and 4 is selected at random from a list of possible sizes [128, 256, 512] (Table 2). In each of the five CNN-BiLSTMs, for an input image, the outcome of the CTC decoder is a string as a word hypotheses with its relative likelihood. The word hypothesis obtained from five networks are sent to the voter component. Plurality voting [82] with a solution for ties is then applied, where the alternatives are divided to subsets with identical strings. The subset with the largest number of voters is selected. In case of a tie, the subset with the highest averaged likelihood is the winner. If the number of subsets is equal to the number of alternatives, the alternative with the highest likelihood is the winner. The winning string is considered as the final, best label of the input image. This approach was chosen after a pilot experiment, using Borda-count voting, without good results. This may be due to the lack of diversity in the ranked candidate lists. Therefore, the more simple approach using plurality voting with exception handling was performed. Please note that analyzing the different, randomly drawn CNN-BiLSTM architectures in the ensemble is not the research goal of this paper. We just need a number of networks that sufficiently support each other in the ensemble by sufficiently independent votes.

5 Results

In this section, we will first describe the datasets used in the experiments. Then, we explain how our experiments were carried out. Finally, we report the quantitative results.

5.1 Datasets

In this paper, we used three datasets which differ in time period and language, summarized in Table 3. The first dataset is named RIMES, which was used to be comparable with the state-of-the-art methods. This database has different versions. We used isolated words of the version of ICDAR 2011 for evaluation of the methods and making the comparison with the published results possible [19]. The RIMES database is drawn from different types of handwritten manuscripts: postal mails and faxes. It contains 12,723 pages written by 1300 volunteers using black ink on white paper. The RIMES dataset consists of 51,738 images of French handwriting for training, 7464 images for validation and 7776 images for testing. The dictionary size of the training set is 4943 words, the validation set is 1612 and the test set is 1692, and the dictionary size of the whole dataset is 5744 words. The comparison is accomplished case insensitive as it is common for the RIMES dataset, and the accent were contemplated. In the evaluation process of our model on RIMES, two dictionaries were used: Concise and Large. The Concise dictionary contains the whole words within the RIMES dataset, $n_{words}$ = 5744 (6K). A French dictionary called Large (50K) is used to study the effect of a larger dictionary.

The second dataset belongs to the National Archive of the Netherlands, named KdK (Het Kabinet der Koningin or Dutch Queen’s Office) [15, 20]. The manuscript was written between years 1798 and 1988, and the year 1903 was used. The KdK dataset contains 172,440 Dutch word images. The number of word classes of the total dataset is 11,749 and 10,747, case-sensitively and case-insensitively, respectively. Regardless of case-sensitivity, there are 1–5628 sample(s) in each class. The length of the word samples is 1–28 character. In the case-sensitive manner, 5% of the test words does not occur in the training words, and is ‘out of vocabulary (OOV).’ OOV in the case-insensitive manner is 4.5%. The remaining words are considered as is referred to as 'in vocabulary (INV).’ Figure 4 shows four original samples of the KdK dataset. For evaluation, two dictionaries are used: Concise and Large. The Concise dictionary contains all the words in the KdK dataset (12 K); the size of the Dutch Large dictionary is 384 K, [83].

The third dataset is George Washington (GW [21]). The GW dataset is harvested from 20 pages written by George Washington and George Mercer in the year 1755. The GW datasets for handwriting word recognition consists of 4894 word images. The ground truth of GW contains the upper and lower case English letters, punctuation marks, digits, and the historical special characters, e.g., long s

which all were encountered in our evaluations. As it is common, we used the first partition of the dataset [28, 84] (Table 3). The Concise dictionary contains the whole words within the GW dataset, $n_{words}$ = 1471 (1K). A Large dictionary (12 K) is made used to study the effect of our model. The Large dictionary contains all the words and signs from Pamphlets of the American Revolution [85] from year 1750 to 1776.

Table 4 The result of the RIMES dataset

Full size table

5.2 Quantitative results

In this section, we evaluate our model on the RIMES, the KdK and the GW datasets in terms of label-coding scheme (Plain vs Extra separator) and ensemble/single network in terms of word accuracy. Moreover, for the RIMES and the GW datasets, the results of our model are compared with the-state-of-the-art methods. We train the model from scratch. Although using synthetically generated images can boost the result for a particular task, this is not a general solution. There are no synthetic resources for rare languages, e.g., Aymara, or rare script types. This argument also applies to the difficult note-fields, e.g., the MkS dataset [77]. Generating artificial (synthetic) samples in the proper language and style is very interesting but much more complicated than random morphing of existing data [86]. There are several methods for data augmentation, [87, 88]. Whether one uses algorithmic synthesis or generative adversarial networks (GANs [89]), this requires human expertise and research labor, in addition to the recognizer’s design and training. This dependency on human input is actually in contrast with the current data-driven philosophy in machine learning and AI.

For the Extra-separator label-coding scheme, a character that is absent in the given dataset was found automatically as the Extra-separator character, the bar sign (|); hence, the bar sign is annexed to the end of each image label (Fig. 4). As a result, the size of the output of the CTC layer increases. The RIMES dataset contains 80 unique characters. Meaning that the size of the output layer of the CTC layer is 82 (80 unique character, one extra separator, and one common blank). The KdK dataset contains 52 unique characters. Therefore, the size of the output layer of CTC layer is 54 (52 unique characters, one extra separator, and one common blank). We compare the result of this addition to the Plain label-coding scheme. Two CTC decoder methods are used: dictionary-free (Best path) and with a dictionary (dual-state word-beam search [10]). For the dual-state word-beam search, two dictionaries are used for each dataset: Concise and Large.

Table 4 shows the effect of the two label-coding schemes, single recognizer, and ensemble voting on the RIMES dataset showing word accuracy (%). For each of the two label-coding schemes (Plain and Extra separator), the five architectures were trained, which resulted in 10 trained networks. Then the networks were evaluated using the Best-path CTC decoder and the dual-state word-beam search CTC decoder applying the Concise (6 K) and the Large (50 K) dictionaries. The result of each evaluation and the relative average ± standard deviation (avg ± sd) are reported. In the bottom row of Table 4, the voting-based result of the ensemble of the five networks is presented.

Best path vs Dual-state word-beam search: the results confirm that using a decoder with a dictionary considerably improves the performance (95–97%) as expected (t-test, $p < 0.05$, significant). The dictionary-free best-path CTC decoder is given a low performance, still at 88–89%. Moreover, when the dual-state word-beam search CTC decoder is used, adding an Extra-separator character enhances the model.

Plain vs Extra separator: for the Best-path CTC decoder, both Plain and Extra separator have an average of 84.5% (t-test, $p > 0.05$, N.S.). Therefore, the extra separator has no effect. However, for a dual-state word-beam search CTC decoder using the Concise dictionary, Plain has an average of 94.3%, and Extra separator has an average of 95.2% (t-test, $p < 0.05$, significant). Hence, the extra separator is effective; for a dual-state word-beam search CTC decoder, using the Large dictionary, Plain has an average of 92.9%, and Extra separator has an average of 94.1% (t-test, $p < 0.05$, significant). Therefore, the Extra separator is effective again, for the case of a large dictionary.

Single network vs ensemble: ensemble voting increases the performance where its effect is more on a weaker recognizer (4 pp increase in performance for the dictionary-free CTC decoder using the Plain/Extra-separator label-coding scheme, final row vs average and individual). An ensemble of five recognizers, using the CTC decoder with the Concise dictionary combined with the Extra-separator label-coding scheme results in the highest performance among the methods trained from scratch (96.6%, column 6, bottom).

Table 5 The comparison of our system to the state-of-the-art systems on the RIMES dataset in terms of number of recognizers ($\hbox {n}_{rec}$), homogeneity of the algorithm (Hom.), complexity of the approach (Compl.), and word accuracy (%) ($\hbox {word}_{acc}$)

Full size table

To study the effect of the number of networks in the ensemble on the final accuracy, the results of randomly selected 1, 3, 5, 10, and 15 network(s) are shown in Fig. 5 for the RIMES dataset. The label-coding scheme is Extra separator, and the CTC decoder is the dual-state word-beam search using the Concise dictionary. The networks in the ensemble only differ in the random initialization and number of the units over the layers, also randomly selected from the set $n = \{128,\ 256,\ 512\}$ in 1 through 4. The maximum accuracy is obtained by the ensemble of 15 networks, 96.72%, which is just 0.09 pp is more than using 10 networks.

Table 5 shows the comparison of our method on the RIMES dataset with [8, 27, 29, 30, 35, 38] in the terms of a number of characteristics: number of recognizers, homogeneity of the algorithm, word accuracy (%), and the complexity of the approach, not to be confused with computational complexity, e.g., deep learning method without extra complicated modules.

Here, model complexity means complexity in general, but this includes the computational complexity that for LSTMs is in the order of the total number of coefficients or weights [90]. Indeed, there is another aspect of complexity, i.e., the intricacy of the models in terms of the number of layers, other hyperparameters, and the consequent human effort spent on training.

The importance of a homogeneous ensemble for an e-Science server is its practicality. In this paper, we will use the term 'homogeneous’ to indicate constrained heterogeneity because absolute homogeneity would not be sensible due to the lack of stochastic independence of opinions in an ensemble. While [91] rightly suggested that heterogeneity is important in ensemble voting, our results indicate that even constrained heterogeneity, i.e., with a limited number of random hyper-parameter variations still is beneficial. In this work, we pursue an adequate method for e-Science servers. In an e-Science server, tailoring an ensemble for slightly higher performance is not feasible because an e-Science server handles massive different datasets. Therefore, for an e-Science server, it is feasible to use an ensemble composed of a limited number of automatically generated architectures.

The result of Dutta et al. [28] in Table 5 appears to be the highest performance; however, in [28], the special characters are not counted in the evaluation. On the other hand, in our approach, the punctuations and digits are counted for RIMES. The model in [28] is pre-trained on the IIIT-HWS dataset, and then training and evaluation were done on the RIMES dataset.

Table 6 The results of the KdK dataset

Full size table

For the KdK dataset, the results are as follows. The samples of the KdK dataset for our experiment were binarized, then sheared 45 ° in the anticlockwise direction to the slant angle in this style approximately 45 °. Afterward, the white borders of images were removed horizontally and vertically, until the place where the first black pixel is observed. In Fig. 6, the deslanted, white-removed images are shown. To derive a more accurate estimation of the performance of our model, we ran fivefold cross-validation. Each architecture, $A_i,$ where $i\ =\ 1\ to\ 5$, is trained, either using the Plain label-coding scheme or using the Extra separator, resulting in 50 trained networks ($5\times 5\times 2$). Then, each network is tested three times: using the dictionary-free Best-path CTC decoder; using the dual-state word-beam search CTC decoder applying Concise (12 K), and Large (384 K).

Table 6 shows the average (avg) and standard deviation (sd) of word accuracy (%) of five architectures using fivefold cross-validation and varying per architecture, over the following parameters: dictionary (none, Concise, Large), and label-coding scheme (Plain, Extra separator) ($5\times 3\times 2$). Each row is derived from 30 network evaluations. In other words, each row is the result of one architecture, regardless of the used CTC decoding method, dictionary, and label-coding scheme. Slightly lower performance is expected as the Best-path CTC decoder pulls it down. A similar result is achieved for each label-coding scheme, regardless of the used CTC decoding method, dictionary, and architecture. The Extra-separator has a higher performance, 94.5%, which is 0.4 pp higher than the Plain decoding scheme.

Table 7 shows the average (avg) and standard deviation (sd) of word accuracy (%) of using a dictionary on fivefold cross-validation and varying per dictionary, over the following parameters: architecture ($\hbox {A}_i,\ i= 1$ to 5), label-coding scheme (Plain, Extra separator) ($5\times 5\times 2$). Each row is derived from 50 network evaluations.

Table 7 The results of the KdK dataset

Full size table

Figure 7 shows the behavior of a single network $\hbox {A}_2$, using the Extra-separator coding scheme and the dual-state word-beam search CTC decoder for different word lengths for the OOV and INV condition in the KdK dataset. The blue and red dots represent the accuracy of OOV and INV, respectively.

Table 8 The result of the KdK dataset

Full size table

The continuous green and black lines in Fig. 7 indicate the word-length occurrence of the training and the test sets in the KdK dataset in one round of the fivefold cross-validation. It is demonstrated that the single network $A_2$ on INV words with a length of up to 17 characters has high accuracy and is promising. For longer words the performance becomes erratic. The single network $A_2$ does not perform satisfactorily on short OOV words with 1–4 characters. The performance on OOV words which have 5–15 characters is highly adequate. For OOV words whose length is between 16 and 20, the performance is variable. Surprisingly, for OOV samples longer than 21 characters, the model has a high performance.

Table 9 The result of the GW dataset

Full size table

Table 10 The results of the GW dataset

Full size table

Figure 8 shows the accuracy of words achieved by network $\hbox {A}_2$ on one round of fivefold cross-validation on the KdK dataset. The X axis shows the number of instances per word class, sorted in order of log frequency in training set. The blue circles indicate INV words. The dark red circle reveals the average accuracy and the log occurrence of OOV (89.9%). Note the different 'threads’ in the curve, revealing groups of easy and difficult (slow-starting) classes. In a lifelong machine-learning, the horizontal axis corresponds to time, starting with just a few examples on the left. The average of the performance on OOV samples is high, at $log(f) = 0$, where f is the frequency in the training set. From the curves, it can be seen that more examples imply a higher accuracy, but even words that are not in the (training) lexicon can obtain decent performance.

Table 8 shows the comparison of the effect of the two label-coding schemes (Plain and Extra separator) and the CTC decoder application on the ensemble for the five rounds of the cross-validation of the KdK dataset.

Best path vs dual-state word-beam search: using no dictionary conditions in more than 93% accuracy. Using a decoder with dictionary boosts the performance (t-test, $p < 0.05$, significant). Adding an extra separator enhances the model when a CTC decoder with a dictionary is used.

Plain vs extra separator: for the Best-path CTC decoder, Plain has an average of 90.6%, and Extra separator has an average of 90.7% (t-test, $p > 0.05$, N.S.). Therefore, the extra separator has no effect; for a dual-state word-beam search CTC decoder using the Concise dictionary (12 K), Plain has an average of 96.3%, and Extra separator has an average of 96.8% (t-test, $p < 0.05$, significant). Therefore, the extra separator is effective; for a dual-state word-beam search CTC decoder using the large dictionary (384 K), Plain has an average of 95.5%, and extra separator has an average of 96.1% (t-test, $p < 0.05$, significant). Therefore, the extra separator is effective.

Single network vs Ensemble: ensemble voting increases the performance where its effect is more on a weaker recognizer (3 pp increase in performance for the dictionary-free CTC decoder for Plain/Extra separator). The ensemble of five recognizers that used the CTC decoder with the Concise dictionary combined with the Extra separator label-coding scheme results in the highest performance (97.4%).

Table 9 shows the effect of the two label-coding schemes, Plain and Extra separator, single recognizer, and ensemble voting on the GW dataset showing word accuracy (%). We follow the use of this measure to compare our results to other studies. Sometimes character classification rates or edit distances are reported. There are caveats here. For instance, asymmetric distance measurement has been reported to be more relevant in the case of historical spelling words compared to contemporary spelling [92]. Since the word accuracy is a strict, conservative measure, we will use it here. For each of the two coding schemes, five CNN-BiLSTMs described in Sect. 4.2 with a different number of hidden units (Table 2) were trained. Then the evaluation conducted using the Best-path (BP) and the dual-state word-beam search (DSWBS) CTC decoders applying the Concise (1 K) and the Large (12 K) dictionaries. The word recognition accuracy (%) of each evaluation and the relative average ± standard deviation (avg ±sd) are reported in Table 9. The bottom row of Table 9 shows the result of the ensemble.

Best path vs dual-state word-beam search: the results confirm that using a CTC decoder with a dictionary significantly improves the performances as expected (t-test, $p < 0.05$, significant). Additionally, the dual-state word-beam search CTC decoder coupled with an Extra-separator character enhances the model more (86%). The dictionary-free best-path CTC decoder results in low performance, 70%.

Plain vs Extra separator: when the best-path CTC decoder is used, the performance for both Plain and Extra separator is low (t-test, $p > 0.05$, N.S.). It is shown that using the Extra-separator label-coding scheme is not beneficial when the Best-Path CTC decoder is used. However, for a dual-state word-beam search CTC decoder using the Concise dictionary, Plain has an average of 84%, and Extra separator has an average of 86% (t-test, $p < 0.05$, significant). For the dual-state word-beam search CTC decoder using the Large dictionary, Plain has an average of 81% and Extra separator has an average of 84% (t-test, $p < 0.05$, significant).

Single network vs ensemble: ensemble voting boosts the performance. The ensemble further affects a weak classifier. An ensemble of five recognizers increases the performance 4 pp using the Plain/Extra separator label-coding scheme when the dictionary-free CTC decoder is used. The ensemble using the Plain/Extra separator increases the performance 3 pp when a dictionary is used.

Table 10 shows the comparison of the performance of our approach on the GW dataset with a recent paper [28]. In this paper, we focused on word recognition rather than character recognition. Unfortunately, the other studies on this dataset have reported character recognition accuracy [36, 84].

Figure 9 shows a comparison of the effect of two label-coding schemes and dictionary application on single architecture and ensemble voting on the RIMES, the KdK, and GW datasets showing the weighted average. Table 11 shows the average of word accuracy (%) on the RIMES, KdK, the GW datasets, using the Concise dictionary and the Extra-separator label-coding scheme.

The analysis of the ensemble shows that the suggested solution for ties is more beneficial in case of weaker classifiers. The increase in accuracy is 1.3 pp for GW, 0.3 pp for RIMES, and 0.2 pp for the KdK dataset.

6 Discussion

The results indicate that it is possible to achieve a high word accuracy (%) in comparison with the state of the art with a limited-size ensemble, a homogeneous algorithmic approach, and a low complexity [8, 27,28,29,30, 35, 38] (cf. Tables 5, 10). In those studies, numerous networks (up to 118 or 2100 network instances) are required in the ensemble. Our method only uses five networks, yielding comparable or better results, also considering that no extraneous training sets were used. The other approach to improve recognition rates is using pre-trained classifiers on the bases of synthetic data [28]. However, such an approach will only work if a language model is available, the allographs of the script style are known including details of punctuation and diacritics. In general, this is not the case and human effort is necessary to implement the training setup. This is not acceptable as a general solution in a large and diverse e-Science server for historical document processing. In the proposed method, handcrafted feature descriptors such as histogram of oriented gradients (HOG [51]) are not used, the process starts with a pixel image and is trained end to end.

The results confirm that the DSWBS CTC decoder using a prefix tree made of a given dictionary significantly increases performance as anticipated (5–16 pp). The results indicate that adding an end-of-word separator is beneficial specifically when dual-state word beam search is used for CTC decoding, not in case of the basic best-path/dictionary-free CTC decoding. In other words, the Extra-separator character, '|,’ tagging the end of the word, boosts the result of the dual-state word-beam search CTC decoding. This increase in performance occurs despite the slight increase in the model size by adding the Extra-separator character. However, the effect on the result of CTC best-path decoding, i.e., a non-dictionary method, is limited. Finally, ensemble voting clearly improves the word accuracy (1–4 pp); its effect is stronger on weaker recognizers.

Table 11 Weighted average of word accuracy (%) on the RIMES, KdK and GW datasets, using the dual-state word-beam search applying the Concise dictionary and the Extra-separator label-coding scheme, for the two CTC methods and single vs ensemble voting

Full size table

It should be noted that the reported result is based on realistic images with many word-segmentation problems and therefore can be considered as a conservative estimate (cf. Fig. 6).

We have shown that medium length OOV words (5–11 characters), i.e., words that are in the test set, but not in the training set, profit from training information that is present within short words in the training set (cf. Fig. 7). Longer OOV words (11–23 characters) profit from the training on words whose length is 1–11 characters. Interestingly, OOV words can have a high performance in a word-length range for which there are not many examples (cf. Fig. 7). In addition, for INV words shorter than 18 characters, the accuracy is higher than 95%. Therefore, it can be inferred that both OOV words and INV words are recognized with high accuracy, if they are of a commonly occurring word length. Furthermore, it is apparent that some INV words can be considered 'easy’ in the sense that they only need a limited number of examples in the training set, whereas other words are 'difficult,’ i.e., needing more than 100 examples to obtain a high accuracy (Fig. 8).

The goal of this research is not a record attempt toward maximized accuracy on the RIMES, the KdK, and GW datasets. Higher performance can undoubtedly be achieved using a larger ensemble (e.g., from Fig. 5 it can be derived that 15 or more NNs in the ensemble would yield 97% accuracy: this is not the point). However, our choice for an ensemble of 5 voting elements results in a compromise with a very good and stable performance. The more than 1 pp jump in performance from one individual classifier to five classifiers is larger than the less than 0.3 pp increase in performance from 5 to 10 classifiers, and the increase in the performance is even smaller for higher numbers of classifiers in the ensemble, showing diminishing returns.

Furthermore, we have shown that by providing a more than 30 times larger dictionary (i.e., the case of the KdK dataset), only a slight drop in performance occurred. In addition, for the dictionary-free approach, using an ensemble system results in much higher performance with more stability than a single network. However, in the higher-performing approach, the ensemble-based improvement is present but less prominent when a dictionary is used. Moreover, as expected from previous research, using the CTC decoder with a dictionary increases the performance compared to a dictionary-free CTC decoder.

From the literature, it is known that using synthetic data for pre-training can be beneficial for contemporary and common languages, e.g., the contemporary French RIMES dataset [28]. Also for the English GW dataset with special characters and styles, it was noted that some characters were absent in the handcrafted synthetic augmentation data. However, the positive effect of synthetic data is expected to be less even when the problem is highly multilingual, e.g., as in the case of the MkS dataset, or a barely-used language, e.g., Aymara. The use of handcrafted synthetic data makes the recognizer highly dependent on the human labor that is needed for its implementation. An alternative to augmenting the data from a training set would be to exploit pre-existing text-shape knowledge from a pre-trained network. However, this was not the purpose of this paper, because we want to provide training on the data itself, not using extraneous background knowledge.

7 Conclusions

Implementing algorithms that perform very well on standard benchmark datasets may yield sub-optimal results on other, large historical manuscript collections in a real application. In this study we wanted to find an LSTM architecture and CTC decoding approach that shows a high performance, and is easy to implement without human supervision in training and operation. Our model consists of an ensemble of just five homogeneous end-to-end trainable recognizers, using plurality voting with a solution for ties. Each recognizer is composed of five convolutional layers and three BiLSTM layers, followed by a CTC layer. Diversity is fostered by a various number of units in the hidden layers of the CNNs. For CTC decoding, a dual-state word-beam search is applied, using only the given dictionary as the language model. For the labeling of words, we show that adding a token to stress the end-of-word state is significantly beneficial. Training the system is done from scratch, exclusively on the given dataset, and data augmentation is not used during testing. The word accuracy (%) of our model is 96.6% on RIMES, 89.55% for the George Washington dataset, and 97.4% on the KdK dataset, a locally collected historical handwritten dataset. The contributions of this paper are:

(a)
To illustrate that—even in a deep learning paradigm—careful, one could say, handcrafted design of the labeling systematics in LSTMs plays an important role. Adding a separator that marks the word ending has a beneficial effect;
(b)
to make the point that LSTM architectures are difficult to design and train. They do not lend themselves easily to large-scale operations where training and deployment take place as autonomous as possible. The goal is to design an architecture that can be generated randomly on the basis of a limited number of hyper-parameter values, with good performance;
(c)
to introduce an ensemble-based approach, that unlike recent examples in literature, does not require hundreds or thousands of individual networks, but just a handful;
(d)
a specific solution for ties in plurality voting, yielding 0.2–1.3 pp improvement on plain plurality voting;
(e)
to provide empirical results on very large datasets to give an insight into: (1) the effect of the word length on the accuracy of in-vocabulary and out-of-vocabulary test samples; (2) the in-vocabulary vs out-of-vocabulary results; (3) the effect of the number of examples per word class; (4) the presence of easy and difficult classes in training; (5) the effect of the ensemble size on the accuracy; and (6) the effect of the size of a word class on the recognition rate for easy and difficult test samples.

Word-based LSTMs cannot make use of the larger textual content. Therefore, as future work, we plan to extend our approach to handle line-strip images. Moreover, we will explore the applicability of our model on other datasets with different languages, and increase the performance on out-of-vocabulary words. Furthermore, the challenge of high-performance recognition of long words will be addressed.

Change history

05 August 2023
A Correction to this paper has been published: https://doi.org/10.1007/s00521-023-08855-9

References

LeCun Y, Bengio Y (1998) Convolutional networks for images, speech, and time series. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 255–258
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch W, Kacprzyk J, Oja E, Zadrozny S (eds) Artificial neural networks: formal models and their applications–ICANN 2005. Springer, Berlin, Heidelberg, pp 799–804
Chapter Google Scholar
Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. Curran Associates Inc, New York, pp 545–552
Google Scholar
Li X, Wu X (2014) Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4520–4524
Maillette de Buy Wenniger G, Schomaker L, Way A (2019) No padding please: efficient neural handwriting recognition. In: 15th International conference on document analysis and recognition (ICDAR 2019), Sydney, Australia
Doetsch P, Kozielski M, Ney H (2014) Fast and robust training of recurrent neural networks for offline handwriting recognition. In: 14th International conference on frontiers in handwriting recognition, pp 279–284. https://doi.org/10.1109/ICFHR.2014.54
Stuner B, Chatelain C, Paquet T (2016) Handwriting recognition using Cohort of LSTM and lexicon verification with extremely large lexicon. arXiv:1612.07528
Puigcerver J (2018) A probabilistic formulation of keyword spotting. PhD thesis, University of Valencia
Scheidl H, Fiel S, Sablatnig R (2018) Word beam search: a connectionist temporal classification decoding algorithm. In: The international conference on frontiers of handwriting recognition (ICFHR), IEEE Computer Society, pp 253–258
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning, ACM, New York, ICML'06, pp 369–376. https://doi.org/10.1145/1143844.1143891
Hauser AW, Schulz KU (2007) Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the first workshop on finite-state techniques and approximate search, pp 1–6
Emlen NQ (2017) Perspectives on the Quechua–Aymara contact relationship and the lexicon and phonology of Pre-Proto-Aymara. Int J Am Linguist 83(2):307–340
Article Google Scholar
Voigtlaender P, Doetsch P, Ney H (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 15th international conference on frontiers in handwriting recognition (ICFHR), pp 228–233. https://doi.org/10.1109/ICFHR.2016.0052
Van der Zant T, Schomaker L, Haak K (2008) Handwritten-word spotting using biologically inspired features. IEEE Trans Pattern Anal Mach Intell 30(11):1945–1957. https://doi.org/10.1109/TPAMI.2008.144
Article Google Scholar
Van der Zant T, Schomaker L, Zinger S, Van Schie H (2009) Where are the search engines for handwritten documents? Interdiscip Sci Rev 34(2–3):224–235. https://doi.org/10.1179/174327909X441126
Article Google Scholar
He S, Samara P, Burgers J, Schomaker L (2016) Image-based historical manuscript dating using contour and stroke fragments. Pattern Recogn 58:159–171
Article Google Scholar
Schomaker L (2019) A large-scale field test on word-image classification in large historical document collections using a traditional and two deep-learning methods. arXiv:1904.08421
Grosicki E, Carré M, Geoffrois E, Augustin E, Preteux F (2006) La campagne d’évaluation RIMES pour la reconnaissance de courriers manuscrits. In: Actes colloque international francophone sur l’ecrit et le document (CIFED’06), Fribourg, Switzerland, pp 61–66
Van Oosten JP, Schomaker L (2014) Separability versus prototypicality in handwritten word-image retrieval. Pattern Recogn 47(3):1031–1038
Article Google Scholar
Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recogn Letters 33(7):934–942 (special Issue on Awards from ICPR 2010). https://doi.org/10.1016/j.patrec.2011.09.009. http://www.sciencedirect.com/science/article/pii/S0167865511002820
Impedovo S (2014) More than twenty years of advancements on frontiers in handwriting recognition. Pattern Recogn 47(3):916–928 (handwriting Recognition and other PR Applications). https://doi.org/10.1016/j.patcog.2013.05.027. http://www.sciencedirect.com/science/article/pii/S0031320313002513
Plamondon RJ, Srihari SN (2000) On-line and off-line handwriting recognition: a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 22:63–84
Article Google Scholar
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks. In: 31st international conference on international conference on machine learning, vol 32
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 202–211
Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371
Article Google Scholar
Menasri F, Louradour J, Bianne-Bernard AL, Kermorvant C (2012) The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition. In: Viard-Gaudin C, Zanibbi R (eds) Document recognition and retrieval XIX, International Society for Optics and Photonics, SPIE, vol 8297, pp 263–270. https://doi.org/10.1117/12.911981
Dutta K, Krishnan P, Mathew M, Jawahar CV (2018) Improving CNN-RNN hybrid networks for handwriting recognition. In: 2018 16th international conference on frontiers in handwriting recognition (ICFHR), pp 80–85. https://doi.org/10.1109/ICFHR-2018.2018.00023
Stuner B, Chatelain C, Paquet T (2017) Self-training of BLSTM with lexicon verification for handwriting recognition. 14th international conference on document analysis and recognition (ICDAR 2017). Kyoto, Japan, pp 633–638
Google Scholar
Sueiras J, Ruiz V, Sanchez A, Velez JF (2018) Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289:119–128
Article Google Scholar
Vinciarelli A, Luettin J (2001) A new normalization technique for cursive handwritten words. Pattern Recogn Lett 22(9):1043–1050
Article MATH Google Scholar
Krishnan P, Jawahar CV (2016) Generating synthetic data for text recognition. ArXiv: abs/1608.04224
Okafor E, Smit R, Schomaker L, Wiering M (2017) Operational data augmentation in classifying single aerial images of animals. In: 2017 IEEE international conference on innovations in intelligent systems and applications (INISTA), Gdynia, Poland, pp 354–360
Okafor E, Schomaker L, Wiering MA (2018) An analysis of rotation matrix and colour constancy data augmentation in classifying images of animals. J Inf Telecommun 2(4):465–491
Google Scholar
Poznanski A, Wolf L (2016) CNN-N-Gram for handwriting word recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2305–2314. https://doi.org/10.1109/CVPR.2016.253
Almazán J, Gordo A, Fornés A, Valveny E (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566. https://doi.org/10.1109/TPAMI.2014.2339814
Article Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis, an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Ptucha R, Such FP, Pillai S, Brockler F, Singh V, Hutkowski P (2019) Intelligent character recognition using fully convolutional neural networks. Pattern Recogn 88:604–613
Article Google Scholar
Zipf GK (1935) The psycho-biology of language. Houghton, Mufflin, Oxford
Google Scholar
Bluche T, Ney H, Kermorvant C (2013) Feature extraction with convolutional neural networks for handwritten word recognition. In: 2013 12th international conference on document analysis and recognition, pp 285–289
Ho T (1992) A theory of multiple classifier systems and its application to visual word recognition. PhD thesis, State University of New York at Buffalo, Buffalo, UMI Order No GAX92-22062
Romesh Ranawana VP (2006) Multi-classifier systems: review and a roadmap for developers. Int J Hybrid Intell Syst 3(1):35–61
MATH Google Scholar
Günter S, Bunke H (2003) Ensembles of classifiers for handwritten word recognition. Int J Doc Anal Recogn 5(4):224–232. https://doi.org/10.1007/s10032-002-0088-2
Article Google Scholar
Günter S, Bunke H (2005) Off-line cursive handwriting recognition using multiple classifier systems on the influence of vocabulary, ensemble, and training set size. Opt Lasers Eng 43(3):437–454
Article Google Scholar
Karimi H, Esfahanimehr A, Mosleh M, Jadval Ghadam FM, Salehpour S, Medhati O (2015) Persian handwritten digit recognition using ensemble classifiers. Proced Comput Sci 73:416–425
Yang W, Jin L, Xie Z, Feng Z (2015) Improved deep convolutional neural network for online handwritten Chinese character recognition using domain-specific knowledge. In: Proceedings of the 2015 13th International conference on document analysis and recognition (ICDAR), ICDAR ’15, pp 551–555
Ho TK, Hull JJ, Srihari SN (1994) Decision combination in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 16(1):66–75. https://doi.org/10.1109/34.273716
Article Google Scholar
Van Erp M, Schomaker L (2000) Variants of the Borda count method for combining ranked classifier hypotheses. In: Proceedings 7th international workshop on frontiers in handwriting recognition (7th IWFHR), pp 443–452
Powalka RK, Sherkat N, Whitrow RJ (1995) Recognizer characterisation for combining handwriting recognition results at word level. In: Proceedings of 3rd international conference on document analysis and recognition, vol 1, pp 68–73. https://doi.org/10.1109/ICDAR.1995.598946
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707
MathSciNet Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Graves A (2012) Supervised sequence labelling with recurrent neural networks, vol 385. Springer, Berlin, Heidelberg
Book MATH Google Scholar
Marti U-V, Bunke H (2002) The IAM-database: an English sentence database for offline handwriting recognition. Int J Doc Anal Recogn 5:39–46
Article MATH Google Scholar
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM 21(1):168–173
Article MathSciNet MATH Google Scholar
Seni G, Kripásundar V, Srihari RK (1996) Generalizing edit distance to incorporate domain information: handwritten text recognition as a case study. Pattern Recogn 29(3):405–414
Article Google Scholar
Oommen B, Loke R (1997) Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions. Pattern Recogn 30(5):789–800
Article Google Scholar
Angell RC, Freund GE, Willett P (1983) Automatic spelling correction using a trigram similarity measure. Inf Process Manag 19(4):255–261
Article Google Scholar
Youssef Bassil MA (2012) OCR post-processing error correction algorithm using Google’s online spelling suggestion. J Emerg Trends Comput Inf Sci 3(1):90–99
Google Scholar
Asonov D (2010) Real-word typo detection. In: Horacek H, Métais E, Muñoz R, Wolska M (eds) Natural language processing and information systems. Springer, Berlin, Heidelberg, pp 115–129
Chapter MATH Google Scholar
Chantal Amrhein SC (2018) Supervised OCR error detection and correction using statistical and neural machine translation methods. J Lang Technol Comput Linguist 3(1):49–76
Article Google Scholar
Wells C, Evett L, Whitby P, Whitrow R (1990) Fast dictionary look-up for contextual word recognition. Pattern Recogn 23(5):501–508
Article Google Scholar
Favata JT (2001) Off-line general handwritten word recognition using an approximate beam matching algorithm. IEEE Trans Pattern Anal Mach Intell 23(9):1009–1021
Article Google Scholar
Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Techn J 30(1):50–64
Article MATH Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Techn J 27:379–423 (Part I) 623–656 (Part II)
Guyon I, Pereira F (1995) Design of a linguistic postprocessor using variable memory length markov models. In: Proceedings of 3rd international conference on document analysis and recognition, Montreal, QC, Canada, vol 1, pp 454–457. https://doi.org/10.1109/ICDAR.1995.599034
Swaileh W, Paquet T, Mohand K (2016) A syllabic model for handwriting recognition [Un modèle syllabique pour la reconnaissance de l’écriture]. In: CORIA 2016—conference en recherche d’informations et applications–13th French information retrieval conference. CIFED 2016–colloque international francophone sur l’ecrit et le document, association francophone de recherche d’information et applications (ARIA), pp 23–37. https://doi.org/10.1109/ICDAR.2015.7333846
Liu C-L, Koga M, Fujisawa H (2002) Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading. IEEE Trans Pattern Anal Mach Intell 24(11):1425–1437. https://doi.org/10.1109/TPAMI.2002.1046151
Article Google Scholar
Seni G, Srihari RK, Nasrabadi N (1996) Large vocabulary recognition of on-line handwritten cursive words. IEEE Trans Pattern Anal Mach Intell 18(7):757–762. https://doi.org/10.1109/34.506798
Article Google Scholar
Seni G, Anastasakos T (2000) Non-cumulative character scoring in a forward search for online handwriting recognition. In: 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No.00CH37100), vol 6, pp 3450–3453. https://doi.org/10.1109/ICASSP.2000.860143
Seni G, Seybold J (1999) Forward search with discontinuous probabilities for online handwriting recognition. In: Proceedings of the 5th international conference on document analysis and recognition, ICDAR ’99 (Cat. No. PR00318), IEEE Computer Society, Washington, pp 741–744. https://doi.org/10.1109/ICDAR.1999.791894
Powalka RK, Sherkat N, Evett LJ, Whitrow RJ (1993) Multiple word segmentation with interactive look-up for cursive script recognition. In: Proceedings of 2nd international conference on document analysis and recognition (ICDAR ’93), Tsukuba City, Japan, pp 196–199. https://doi.org/10.1109/ICDAR.1993.395750
Ford DM, Higgins CA (1990) A tree-based dictionary search technique and comparison with n-gram letter graph reduction. In: Plamondon R, Leedham G (eds) Computer processing of handwriting. World Science Publishing Co., Singapore, pp 291–312
Chapter Google Scholar
Bramall PE, Higgins CA (1995) A cursive script-recognition system based on human reading models. Mach Vis Appl 8(4):224–231
Article Google Scholar
Côté M, Lecolinet E, Cheriet M, Suen C (1998) Automatic reading of cursive scripts using a reading model and perceptual concepts. Int J Doc Anal Recogn 1(1):3–17. https://doi.org/10.1007/s100320050002
Article Google Scholar
Manke S, Finke M, Waibel A (1996) A fast search technique for large vocabulary on-line handwriting recognition. In: Proceedings of the 5th international workshop on frontiers in handwriting recognition, UK, pp 183–188
Hwang K, Sung W (2016) Character-level incremental speech recognition with recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5335–5339. https://doi.org/10.1109/ICASSP.2016.7472696
Weber A, Ameryan M, Wolstencroft K, Stork L, Heerlien M, Schomaker L (2017) Towards a digital infrastructure for illustrated handwritten archives. In: Ioannides M (ed) Final conference of the Marie SkłOdowska–Curie initial training network for digital cultural heritage, ITN-DCH 2017, Springer, Olimje, Slovenia, vol 10605
Schuh RT (2003) The Linnaean system and its 250-year persistence. Botan Rev 69(1):59–78
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on international conference on machine learning, Omnipress, USA, ICML’10, pp 807–814
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701
Tieleman T, Hinton G (2012) Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Peleg B (1978) Consistent voting systems. Econometrica 46(1):153–161. https://doi.org/10.2307/1913652
Article MathSciNet MATH Google Scholar
BV SW (2018) Woorden.org, Nederlandse woordenboek. https://www.woorden.org. Accessed 17 October 2017
Toledo JI, Dey S, Fornés A, Lladós J (2017) Handwriting recognition by attribute embedding and recurrent neural networks. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 01, pp 1038–1043
(1994) Pamphlets of the American Revolution: [selections]/edited by Bernard Bailyn, Oxford Text Archive. http://hdl.handle.net/20.500.12024/2021
Bulacu M, Brink A, Van der Zant T, Schomaker L (2009) Recognition of handwritten numerical fields in a large single-writer historical collection. In: 2009 10th international conference on document analysis and recognition, pp 808–812. https://doi.org/10.1109/ICDAR.2009.8
Frid-Adar M, Klang E, Amitai M, Goldberger J, Greenspan H (2018) Synthetic data augmentation using gan for improved liver lesion classification. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp 289–293
Mikołajczyk A, Grochowski M (2018) Data augmentation for improving deep learning in image classification problem. In: 2018 International interdisciplinary PhD workshop (IIPhDW), pp 117–122
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative Adversarial Nets. In: Proceedings of the 27th international conference on neural information processing systems, vol 2, MIT Press, Cambridge, NIPS’14, pp 2672–2680
Sak H, Senior AW, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR abs/1402.1128. http://arxiv.org/abs/1402.1128
Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl-Based Syst 118:124–139. https://doi.org/10.1016/j.knosys.2016.11.017. http://www.sciencedirect.com/science/article/pii/S0950705116304749
Stehlìk M, Ruiz MP, Stehlìková S, Lu Y (2020) Contemporary experimental design, multivariate analysis and data mining, Chap on equidistant designs, symmetries and their violations in multivariate models. Springer, New York
Google Scholar

Download references

Acknowledgements

This work is part of the research program Making Sense of Illustrated Handwritten Archives with Project Number 652-001-001, which is financed by the Netherlands Organisation for Scientific Research (NWO). We would like to thank Gideon Maillette de Buy Wenniger for thoughtful advice and the Center for Information Technology of the University of Groningen for providing access to the Peregrine high-performance computing cluster.

Author information

Authors and Affiliations

Artificial Intelligence and Cognitive Engineering, Faculty of Science and Engineering, University of Groningen, Groningen, The Netherlands
Mahya Ameryan & Lambert Schomaker

Authors

Mahya Ameryan
View author publications
You can also search for this author in PubMed Google Scholar
Lambert Schomaker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahya Ameryan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ameryan, M., Schomaker, L. A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification. Neural Comput & Applic 33, 8615–8634 (2021). https://doi.org/10.1007/s00521-020-05612-0

Download citation

Received: 26 December 2019
Accepted: 11 December 2020
Published: 01 February 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00521-020-05612-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

Abstract

Similar content being viewed by others

Deep RNN Architecture: Design and Evaluation

Are 2D-LSTM really dead for offline text recognition?

A benchmark for unconstrained online handwritten Uyghur word recognition

Explore related subjects

1 Introduction

2 Related work

2.1 The state of the art on handwriting recognition task

2.1.1 Ensemble system

2.2 Word search and linguistic post-processing

3 Background

3.1 Convolutional recurrent neural network

3.2 A dual-state word-beam search for CTC decoding

4 Method

4.1 The Extra-separator label-coding scheme

4.2 Neural network

4.2.1 Pre-processing

4.2.2 A 5-layer CNN

4.2.3 BiLSTM

4.2.4 Connectionist temporal classification (CTC)

4.3 The ensemble system

5 Results

5.1 Datasets

5.2 Quantitative results

6 Discussion

7 Conclusions

Change history

05 August 2023

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation