A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

In recent years, long short-term memory neural networks (LSTMs) have been applied quite successfully to problems in handwritten text recognition. However, their strength is more located in handling sequences of variable length than in handling geometric variability of the image patterns. Furthermore, the best results for LSTMs are often based on large-scale training of an ensemble of network instances. In this paper, an end-to-end convolutional LSTM Neural Network is used to handle both geometric variation and sequence variability. We show that high performances can be reached on a common benchmark set by using proper data augmentation for just five such networks using a proper coding scheme and a proper voting scheme. The networks have similar architectures (Convolutional Neural Network (CNN): five layers, bidirectional LSTM (BiLSTM): three layers followed by a connectionist temporal classification (CTC) processing step). The approach assumes differently-scaled input images and different feature map sizes. Two datasets are used for evaluation of the performance of our algorithm: A standard benchmark RIMES dataset (French), and a historical handwritten dataset KdK (Dutch). Final performance obtained for the word-recognition test of RIMES was 96.6%, a clear improvement over other state-of-the-art approaches. On the KdK dataset, our approach also shows good results. The proposed approach is deployed in the Monk search engine for historical-handwriting collections.


Introduction
Convolutional neural networks (CNNs) [1] and long short-term memory networks (LSTM) [2] and its variants [3,4] have recently achieved impressive results [5][6][7]. This exceptional performance comes, however, at the cost of having an ensemble of, e.g., 118 recognizers [8]. High cost of training and operation brings to mind the question whether less costly methods can be applied to boost the performance of handwriting recognizers.
A possible direction would consist of the use of linguistic statistics [9]. A recent method for using language information is a dual-state word-beam search [10] for decoding the connectionist temporal classification (CTC [11]) layer of neural networks, which has been shown to be effective [10].
Although the presence of dictionaries and corpora is beneficial, historical documents present a challenge. For instance, historic spelling of a word differs from the contemporary spelling, there often is an absence of strict orthography, and there may be frequent misspellings [12]. Figure 1 shows a word image from one of the datasets used in this paper. This historical word has an extra character compared to the current spelling. Moreover, for rare languages, e.g., Aymara [13], the complete lexicon does not exist yet, and corpora are of very limited size. Handwritten-text recognition (HTR) is exactly required to obtain such digital linguistic resources for that language.
Another possible direction to improve performance would concern a heavy optimization of network architecture and training (hyper)parameters. The state-ofthe-art approaches can be sensitive to the choice of hyper-parameter values. As an example, it is reported that increasing the depth of a neural network that consists of convolutional and LSTM layers, from 8 hidden layers to 10 is advantageous. Further enlarging to 12 hidden layers yielded unsatisfactory results [14]. From the perspective of e-Science services for handwriting recognition dealing with hundreds different books, it is not feasible to tailor the recognizer models for each book based on prior knowledge, using human handcrafting of neural networks. Preferably, having an ensemble consisting of a limited number of automatically generated architectures would be practical.
In this paper, we explore the possibilities of exploiting the success of current CNN/LSTM approaches, using several methods at the level of linguistics and labeling systematics, as well as an ensemble method. Ideally, the approach should be robust, require a minimum of human intervention with a limited set of hyperparameter settings (architectures), and minimum linguistic resources. For evaluation, we use a standard benchmark public dataset, RIMES [15], and a historical handwritten dataset, KdK [16,17]. The two dataset differ in time period and language.
An essential consideration is that it should be possible to add our suggested algorithm to the Monk system, [16,[18][19][20]. Monk is a live web-based search engine for words and character recognition, retrieval and annotation. It contains diverse digitized historical and contemporary handwritten manuscripts in many languages: Chinese, Thai, Arabic, Dutch, English, Persian. Also, complicated machine-printed documents such as German, Fraktur, Egyptian hieroglyphs, and historical language are available in the Monk system. The rest of this paper is structured as follows. In section 2, we briefly survey the related works in terms of recent state-of-the-art methods on RIMES, convolutional recurrent neural network, word search in characterhypothesis grids, ensemble systems, and requirements of the proposed method. In section 3, we present our system. The experimental evaluation and discussion are given in sections 4 and 5. Finally, conclusions are drawn in section 6.

Related Work
In this section, we first briefly survey the recent studies that worked on isolated words of the RIMES dataset. Then, a convolutional recurrent neural network is briefly detailed. Afterwards, we survey part of long history of word search in character-hypothesis grids and linguistic post-processing. As an example of these approaches, we explain a dual-state word-beam search for CTC decoding which is one of principle of our work. Finally, we review researches in ensemble-system approaches.

On RIMES
One of the used datasets in this paper is RIMES [15]. In this section, the compared methods are explain briefly. In [21], a 12-layer convolutional neural networks (CNN) is used to processes fixed-sized word images and recognize a Pyramidal Histogram of Character (PHOC) representation [22], using multiple parallel fully connected layers. Afterwards, Canonical Correlation Analysis (CCA) [23] is applied as a final stage of the word recognition task, using a predefined lexicon.
In [8], two architectures are used to generate more than a thousand networks to construct an ensemble. Each network is either two-layer BiLSTM or three-layer multidimensional LSTM (MDLSTM) neural networks [4]. BiLSTMs are fed by HOG [24], and the input of the MDLSTM is raw image. The best path algorithm [25] is applied for CTC decoding. This approach uses a lexicon verification method. After training 2,100 networks and evaluating on the validation set of RIMES dataset, the lowest performance networks are removed, which results in 118 networks. It is reported that the pruned ensemble of 118 networks has 0.16pp drop in performance compared to the ensemble of 2,100 networks on the RIMES dataset. On another dataset, IAM [26], the size of ensemble is different (n rec =1,039). Because of the simplicity of system and high number of recognizers, the complexity is medium to high.
In [27], an ensemble uses eight recognizers for handwriting recognition which includes four variants of a MDLSTM, a grapheme based MLP-HMM, and two variants of a context-dependent sliding window GMM-HMM. The ensemble system is a simple sum rule.
In [28], a framework consisting of a deep CNN, LSTM layers as encoder/decoder, and a attention mechanism for isolated handwritten-word recognition is given. The result is reported with/without dictionary. For pre-processing, methods for baseline correction, normalization, and deslanting are applied. After pre-processing, an input image is converted to a sequence of image patches by using a horizontal sliding window, . Then, a deep CNN is used for feature extraction. Afterwards, a LSTM is applied to extract the horizontal relationships existing among a sequence of overlapped horizontal patches of input images. Then, a decoder component is used, a combination of an LSTM and an attention mechanism. To find the best performance, experiments are done to determine the optimal LSTM cell size and patch size. This method does not have very high performance.
In [29], a whole-word CNN can be apply to recognize known words, defined as the 500 most frequent words in the training set of the RIMES dataset, which have a minimum confidence level of 70%. Otherwise, a Block Length CNN predicts the number of symbols in the given image block. Then, a fully convolutional neural network predicts the characters. Finally, the result is enhanced by a vocabulary-matching method. This varied-CNN method has a problem with separating common and non-common words. The separation of lexicon into a set of common and a set of uncommon words may be artificial, in view the usual continuously decaying Zipf distribution [30]. In [31], deslanting and slope normalization is performed on images, using the approach presented in [32]. A pre-trained CNN-RNN is used. During training and testing on benchmark datasets, three types of augmentations are used: affine transformation; elastic distortion; multi-scale transformation. Then, the best result on one of their seven approach is reported. Before that, image augmentation during training and testing is used in [33,34] for animal recognition.
The successful methods applied to the RIMES dataset are unfortunately quite complicate. Most of them use a combination of CNNs and LSTMs. Therefore, we treat convolutional neural networks in the next section.

Convolutional Recurrent Neural Network
The convolutional recurrent neural network is an endto-end trainable system presented in [35]. It outperforms the plain CNN in three aspects: 1) It does not need precise annotation for each character and it can handle a string of characters for the word image; 2) it works without a strict preprocessing phase, hand-crafted features or component localization/segmentation; 3) It benefits from the state preservation capability of a recurrent neural network (RNN) in order to deal with character sequence; 4) It does not dependent on the width of word image. Only, height normalization is needed.
The model is composed of seven layers of convolutional layers followed by two layers of BiLSTM units containing 256 hidden cells and a transcription layer. Although, a the model is made up of two distinct neural network varieties, it can be trained integrally using one loss function. Figure 2 shows the pipeline of the convolutional recurrent neural network [35]. The input of the model is a height-normalized and gray-scale word image. The the feature extraction is performed by convolutional layers directly from the input image. The output of CNN is a frame of features sequence, and acts the input of the recurrent neural network, which provides raw character hypotheses. Finally, the transcription layer translates the resulting prediction into a label sequence.

Word search and linguistic post-processing
Character-oriented approaches create a data structure representing the character hypotheses, their position in the text and the confidence value. For example, a LSTM produces a final map with character hypothesis activ- Figure 2: The architecture of a convolutional recurrent neural network is composed of three components: convolutional, recurrent layers and transcription layer. The phases are as follows: First, feature extraction is carried out by convolutional layers directly from a heightnormalized and gray-scale word image. Secondly, for each frame, prediction of label distribution is performed by RNN layers. Thirdly, the transcription layer transcribes the regarding prediction into a label sequence [35]. ations, ordered from left-to-right or right-to-left with some stride (step size). Other approaches generate a grid or graph of character hypotheses. The final processing step involves finding the most likely character path, given a dictionary and potential other linguistic resources (statistics). For the LSTM, a well-known first step toward this is connectionist temporal classification (CTC) [11].
Given a dictionary containing possible input words, an easy method can be used for error detection and correction of a word recognizer. In the case of existence of the word hypothesis in the dictionary, the result is accepted as the label of the input image. Otherwise, if a similar word exists in the dictionary, it can be accepted as final label candidate by using the Levenshtein distance and its variants [36][37][38][39], or n-gram distances [40], as common measures for comparing (dis)similarity. If required, it is possible to use suitable linguistic statistics to further refine the ranking [41][42][43].
A data structure for contextual word recognition is presented in [44] for quick dictionary look-up using limited memory.
An approach of providing contextual information by giving a dictionary to predict the most probable label in a graph search is presented in [45], which is robust to dictionary errors. In this approach, for every lexical word, the most probable path and related confidence is calculated to predict dictionary ranking.
Shannon [46] [47] was one of the first researchers working on the letter prediction task. Based on this idea, using a trainable variable memory length Markov model (VLMM), a linguistic post-processing model for character recognizers is introduced in [48]. The next character is predicted by a variable length window of previous characters.
In [49], on the linguistic corpora, a statistical n-gram language model of syllables is trained. In [50], for Japanese mail address, a character recognition method uses a dictionary in a trie tree. The dictionary matching is controlled by a beam search approach. The dictionary includes all the address names and principal postal offices in Japan. After pre-processing and segmentation character hypotheses are produced by combination of successive segments. Then, a version of a nearest-neighbor classifier that exploits the trie structure is made for a fast predict in of the final label. In [51], an on-line handwritten recognition system for cursive words uses simple character features to reduce a given large dictionary. The outputs of a Time-Delay Neural Network (TDNN) are converted into a character sequence. The result of the system is a matched word in the reduced dictionary using a variant of Damerau-Levenshtein distance. For on-line handwriting recogni-tion a search technique is proposed in [52], which is a post-processing phase of a recognition system that calculates posterior probabilities of characters based on Viterbi decoding.
In [53] a version of beam and Viterbi searchrecognizer is presented. This search method provides the use of discrete probabilities generated by many character recognition systems based on stroke. [54] introduces a technique combining word segmentation and character recognition with lexical search to deal with segmentation ambiguities. A depth first trace of dictionary tree for text recognition using recursive procedure presented in [55]. For online handwriting recognition, in [56], by applying simple feature extraction a given dictionary is reduced. Afterwards, the reduced dictionary is refined by AI techniques. In [57], for isolated cursive handwritten-word recognition, contextual knowledge is used. A dictionary tree representation with efficient pruning method, as a fast search method for large dictionary for on-line handwriting recognition system is proposed in [58].
Of all these approaches, a dual-state word-beam search for CTC decoding currently enjoys increased interest, [10], and will be described next.

A dual-state word-beam search for CTC decoding
The dual-state word-beam search for CTC decoding, [10], is based on Vanilla Beam Search Decoding (VBS) [59] for decoding of the CTC layer. The output of RNN is a matrix, and it is the input of the dual-state word-beam search method. In the dual-state wordbeam search, a prefix tree is made of groundtruth label of the training set. It consists of two states: word-state and non-word-state, Figure 3. The next character of the current beam is either a word-character or a nonword-character, and it determines the subsequent state Figure 3: The dual-state word-beam search for CTC decoding [10] used for our proposed system. of the beam. The sets of word-characters and non-wordcharacters are predefined.
The temporal evolution of a beam depends on its state. For a beam in the non-word state, it is possible to be extended by a non-word-character, and it will stay in the non-word state. A word-character entering brings the system to the word state. Such a word-character is the beginning of a word. For a beam in the wordstate the feasible following characters are presented by a prefix tree. This procedure iteratively repeats until a complete word is reached.
Scoring can be done in four ways: 1. Words: A dictionary is used without employing a language model (LM). 2. N-grams as LM: As a beam goes to non-word state from word state, the LM scores beam-labeling. 3. Ngram+forecast: As a word-character appends a beam, prefix tree presents all possible words. LM scores all of the relevant beam-extensions. 4. Ngram+forecast+sample: to restrain the following potential words, first some samples are randomly selected. Then, LM scores them. The total score value has to be refined to account for the randomsampling step.
The pseudo code of the dual-state word-beam search is illustrated in Algorithm 1. The list of symbols is as follows.
- In RNNs, such as LSTM, the exact alignment of the observed word image with the ground truth label Algorithm 1: The dual-state word-beam search for CTC decoding [10] is not clear. Hence, a probability distribution at each time step is used for prediction. Which makes it more important to use an adequate coding scheme.
However, even after the CTC stage, additional processing steps from the above mentioned repertoire are needed to boost classifications.
Unfortunately, although, using linguistic resources is clearly advantageous, there are cases where this is not, or only partly possible: -Not all problems enjoy the presence of abundance or digitally encoded contemporary text content. -In historical collections there may be virtually no resources, not even a lexicon -Many collections, e.g., administrative once have a dedicated jargon, abbreviations and non-standard phrasing. Even diaries may contain idiosyncratic neologisms. -Many collections have outdated geographical and scientific terminology, such as the historical document collection belonged to Natuurkundige Commissies scientific exploration of the Indonesian Archipelago between year 1820 and 1850 [60]. This heterogeneous handwritten manuscript contains 17,000 pages of the field notes based of the scientists' natural observation in German, Latin, Dutch, Malay, Greek, and French. Biological terms vary greatly over periods in history [61].
There is, however, an additional way to improve the classification performance. Impressive results using an ensemble method were presented in [8], however the number of networks was so large (118). that the need for a less drastic approach is becoming urgent. We will therefore focus on the probabilities of small-scale ensemble.

Ensemble system
A simple but effective method for improving an individual classifier performance is the ensemble method [27,[62][63][64][65][66][67][68][69][70]. In [63], it is shown that having diverse classifiers is a key point for classifier fusion. Using ensembles for handwriting recognition with hidden-Markov models as basic word classifiers, [64] compares different ensemble creation methods: Bagging, AdaBoost, Half & half bagging, random subspace, architecture as well as different voting combination methods for handwriting-recognition task. It is shown than each of four methods, increases the performance. The impact of dictionary size, the train-set size and the number of recognizers in ensemble systems is studied for off-line cursive handwritten-word recognition in [65]. The ensemble methods are Bagging, AdaBoost and the random subspace, while the recognizers are HMMs with different configurations. It is verified that increasing the size of the training set and the number of recognizers elevate the performance of the system, while the larger dictionary pull down the performance.
Recently, in [66], ensemble classifiers for Persian handwriting recognition was used. They used AdaBoost and Bagging to combine weak classifiers created from hand-crafted families of simple features.
In the deep learning domain, [67] obtained very high accuracy for Chinese handwritten character recognition using deep convolutional neural networks and a hybrid serial-parallel ensembling strategy which tries to find an "expert" network for each example that can classify the example with a high accuracy, or if such a network cannot be found, falls back to the majority vote over all networks.
In [27], an ensemble system is used for handwriting recognition of RIMES [71] dataset. The ensemble uses eight recognizers; including: Four variants of a recurrent neural network (RNN), a grapheme based MLP-HMM, and two variants of a context-dependent sliding window based on GMM-HMM. For RNN, a multi-dimensional long-short term memory neural network (MDLSTM) [4] is used.
In an ensemble system, majority voting can be used if the output of of each individual recognizer is only the best hypothesis label. If recognizers of ensemble system output a ranked hypotheses list, Borda count is possible [68,69] to determine the result. In this case, it is required that the ranked list shows a sufficient diversity of intuitive candidates, i.e., with a low edit distance from the target. Two ensemble system of handwritten recognition methods are presented in [70] : using wordlist merging; and linear combination.
The good results represented in literature are often based on a fairly complex system with many hyperparameters. In a e-science service such as Monk which currently has about 530 different manuscripts, it is clear that human attendance and detailed selection of hyper parameters for each of those documents by human and crafting is impossible.

Method
In this section, we present a limited-size ensemble system for word recognition with a minimum of human intervention. The suggested system uses an adequate label-coding scheme and a dictionary as the only resource for the language model. The system is described as follows.

The Extra-separator coding scheme
In the common coding scheme, we call it 'Plain', only the characters which are present in the word image appear in the corresponding label. In the 'Extra-separator ' coding scheme, one more character is appended at the end of each label. The appended character, named the extra separator (e.g., '|'), must not exist in the alphabet of the dataset. The aim of adding the extra-separator character is to give the recognizer an extra hint concerning the end-of-word shape condition.

Neural Network
The neural network is a convolutional BiLSTM neural network, and it is an end-to-end trainable framework inspired by [35]. The main configuration of the networks is detailed in Table 1. In this section, we explain the essential components of our approach.

Pre-processing
The prepossessing is performed in each epoch of training. It is consists of: a) data augmentation through randomly stretching/squeezing the gray-scale images in the width direction, b) re-sizing the images into 128 × 32 and c) normalization. Data augmentation is performed to increase the size of training set, and it is achieved by changing the width of an image randomly by a factor between 0.5 and 1.5. Next, both the original gray-scale images and those added through data augmentation are re-sized so that either the width is 128 pixels or the height is 32 pixels. After that, we pad the image with white pixels until the size is 128 × 32. Then we normalize the intensity of the gray-scale image. Note that our method does not need baseline alignment or precise deslanting. Please note that one of our datasets was already deslanted to 90.

A 5-layer CNN
The pixel-intensity values after preprocessing are fed to the first of 5-layers of a CNN to extract feature sequences. Each layer of the CNN contains a convolution operation, normalization, the ReLU activation function [72], and a max pooling operation. The size of the kernel filters in each layer is 3 × 3. Given the fixed important hyperparameter setting, such as the number of layers, the only variable control parameters concern the number of units in the hidden layers. The simple table of three possible sizes {128, 256, 512} is used with the random probability of 0.33 for selecting the sizes of hidden units. The sizes of the numbers of hidden units used in our experiments are shown in Table 2. The number of layers, size of kernel and optimizer is our configuration, and differ from [35]. Furthermore, Instead of using ADADELTA [73] used in [35], we used RMSProp [74]. Moreover, we used five convolutional layers instead of seven suggested in [35].

BiLSTM
The five convolutional layers are followed by three layers of BiLSTM. Because the last convolutional layer contains 512 hidden units, each BiLSTM has 512 hidden unit.

Connectionist temporal classification (CTC)
The CTC output layer contains two units more than characters in the alphabet (A) of the given dataset: the suggested Extra separator (e.g., '|'), and a common blank for CTC, which differs the space character. Therefore, the alphabet of CTC output is: A = A∪extra separator ∪ blank The |A+1| output units determine the probability of detecting the relevant label at the time. Further, the blank unit determines the probability of observing blank, or 'no label'. For CTC decoding, we use the dual-state beam search presented in [10]. This method is explained in section 2.3.1.

The ensemble system
For an input image, the outcome of the CTC decoder is a string as a word hypotheses with its relative likelihood. The word hypothesis obtained from five net-works are sent to the voter component. Plurality voting is then applied [75], where the alternatives are divided to subsets with identical strings. The subset with largest number of voters are selected. In case of a tie, the subset with the highest averaged likelihood is the winner. If the number of subsets is equal to the number of alternatives, the alternative with the highest likelihood is the winner. The winning string is considered as the final, best label of the input image. This approach was chosen after a pilot experiment, using Borda-count voting, whiteout good results. This may be due to the lack of diversity in the ranked candidate lists. Therefore, the more simple approach using plurality voting with exception handling was performed.

Results
In this section, firstly, we describe the datasets used in the experiments. Then, we explain how our experiments were carried out. Finally, we report the numerical results.

Datasets
In this paper, we used two datasets which differ in time period and language, summarized in Table 3. The first dataset is named RIMES, which was used to be comparable with the state-of-the-art methods. This database has different versions. We used isolated words of the version of ICDAR 2011 for evaluation of the methods and making the comparison with the published results possible [15]. The RIMES database is drawn from different types of handwritten manuscripts: postal mails and faxes. It contains 12,723 pages written by 1,300 volunteers using black ink on white paper. The RIMES dataset consists of 51,738 images of French handwriting for training, 7,464 images for validation and 7,776 images for testing. The dictionary size of the training set is 4,943 words, the validation set is 1,612 and the test set is 1,692, and the dictionary size of the whole dataset is 5,744 words. The comparison is accomplished case insensitive as it is common for the RIMES dataset, and the accent were contemplated. In the evaluation process of our model on RIMES, two dictionaries were used: Concise and Large. The Concise dictionary contains the whole words within the RIMES dataset, n words = 5,744 (6K). A French dictionary called Large (50K) is used to study the effect of a larger dictionary.
The second dataset belongs to the National Archive of the Netherlands, named KdK (Het Kabinet der Koningin or Dutch Queens Office) [16,17]. The manuscript was written between years 1798 and 1988, the year 1903 was used. The KdK dataset contains 172,440 word images. The number of word classes of the total dataset is 11,749 and 10,747, case-sensitively and case-insensitively, respectively. Regardless of casesensitivity, there are 1 to 5,628 sample(s) in each class. The length of the word samples is 1 to 28 character. In the case-sensitive manner, 5% of the test words does not occur in the training words, and is 'out of vocabulary (OOV)'. OOV in the case-insensitive manner, is 4.5%. The remaining words are considered as is referred to as 'in vocabulary (INV)'. Figure 4 shows four original samples of the KdK dataset. For evaluation, two dictionaries are used: Concise and Large. The Concise dictionary contains all the words in the KdK dataset (12K); the size of the Dutch Large dictionary is 384K, [76].

Quantitative results
In this section, we evaluate our model on the RIMES and the KdK datasets in terms of coding scheme (Plain vs Extra separator) and ensemble/single network. Moreover, for the RIMES dataset, the results of our model is compared with the-state-of-the-art methods suggested in [8,21,[27][28][29]77]. In [31], very good result are reported. However, their system was trained with a large amount of synthetic data. Therefore, we do not find it comparable with our approach, which is exclusively based on the the given dataset, and its augmentations.
For the Extra-separator coding scheme, a character which is absent in the given dataset was found automatically as the extra-separator character, the bar sign (|); hence,the bar sign is annexed to the end of each image label, Figure 4. As a result, the size of the output of the CTC layer increases. The RIMES dataset contains 80 unique characters. Meaning that the size of the output layer of the CTC layer is 82 (80 unique character, one extra separator, and one common blank). The KdK dataset contains 52 unique characters. Therefore, the size of output layer of CTC layer is 54 (52 unique character, one extra separator, and one common blank). We compare the result of this addition to the Plain coding scheme. Two CTC decoder methods are used: dictionary-free (Best path) and with dictionary (dual-state word-beam search [10]). For the dual-state word-beam search, two dictionaries are used for each dataset; Concise and Large. Table 4 shows the effect of the two coding schemes, single recognizer and ensemble voting on the RIMES dataset showing word accuracy (%). For each of the two coding schemes (Plain and Extra separator), the five architectures were trained, which resulted to 10 trained networks. Then the networks were evaluated using the Best-path CTC decoder and the dual-state word-beam search CTC decoder applying the Concise (6K) and the Large (50K) dictionaries. The result of each evaluation and the relative average ± standard deviation (avg ±sd) are reported. In the bottom row of the Table 4, the voting-based result of the ensemble of the five networks is presented.
Best path vs Dual-state word-beam search: the results confirm that using a decoder with dictionary considerably improves the performance (95-97%) as expected (t-test, p < 0.05, significant). The dictionary-free Best-path CTC decoder is given a low performance, still at 88-89%. Moreover, when the dual-state word-beam search CTC decoder is used, adding an extra-separator character enhances the model.
Plain vs Extra separator: for the Best-path CTC decoder, both Plain and Extra separator have an average of 84.5%, (t-test, p > 0.05, N.S.). Therefore, the extra separator has no effect. However, for a dual-state word-beam search CTC decoder using the Concise dictionary, Plain has an average of 94.3%, and Extra separator has an average of 95.2%, (t-test, p < 0.05, significant). Hence, the extra separator is effective; for a dualstate word-beam search CTC decoder, using the Large dictionary, Plain has an average of 92.9%, and Extra separator has an average of 94.1%, (t-test, p < 0.05, significant). Therefore, the Extra separator is effective again, for the case of a large dictionary.
Single network vs Ensemble: ensemble voting increases the performance where its effect is more on a weaker recognizer (4 pp increase in performance for the dictionary-free CTC decoder using the Plain/Extra separator coding scheme, final row vs average and individual). An ensemble of five recognizers, using the CTC decoder with the Concise dictionary combined with the Extra-separator coding scheme results in the highest performance (96.6%, column 6, bottom).
To study the effect of the number of networks in the ensemble on the final accuracy, the result of ran- Table 5: The comparison of our system to the stateof-the-art systems on the RIMES dataset in terms of number of recognizers (n rec ), homogeneity of the algorithm (Hom.), complexity of the approach (Compl.), and word accuracy (%) (word acc ). Please, refer to the text for further explanation. domly selected 1, 3, 5, 10 and 15 network(s) are shown in Figure 5 for the RIMES dataset. The coding scheme is Extra separator, and CTC decoder is the dual-state word-beam search using the Concise dictionary. The networks in the ensemble only differ in the random initialization and number of the units over the layers, also randomly selected from the set n = {128, 256, 512} in 1 through 4. The maximum accuracy is obtained by the ensemble of 15 networks, 96.72%, which is just 0.09 pp is more than using 10 networks. Table 5 shows the comparison of our method on the RIMES dataset with [8,21,[27][28][29]77] in the terms of a number of characteristics: number of recognizers, homogeneity of the algorithm, word accuracy (%) and the complexity of the approach, not to be confused with  For the KdK dataset, the results are as follows. The samples of the KdK dataset for our experiment were binarized, then sheared 45 degrees in the anticlockwise direction to the slant angle in this style approximately 45 degrees. Afterwards, the white borders of images were removed horizontally and vertically, until the place where the first black pixel is observed. In Figure  6 the deslanted, white-removed images are shown. To derive a more accurate estimation of the performance of our model, we ran 5-fold cross-validation. Each architecture, A i , where i = 1 to 5, is trained, either using the Plain coding scheme or using the Extra separator, resulting in 50 trained networks (5×5×2). Then, each network is tested three times: using the dictionaryfree Best-path CTC decoder; using the dual-state wordbeam search CTC decoder applying Concise (12K) and Large (384K). Table 6 shows the average (avg) and standard deviation (sd) of word accuracy (%) of five architectures using 5-fold cross-validation and varying per architec- ture, over the following parameters: dictionary (none, Concise, Large), and coding scheme (Plain, Extra separator) (5 × 3 × 2). Each row is derived from 30 network evaluations. In other words, each row is the result of one architecture, regardless of the used CTC decoding method, dictionary, and coding scheme. Slightly lower performance is expected as the Best-path CTC decoder pulls it down. Similar result is achieved for each coding scheme, regardless of the used CTC decoding method, dictionary, and architecture. The Extra-separator has a higher performance, 94.5%, which is 0.4 pp higher than the Plain decoding scheme.
The Table 7 shows the average (avg) and standard deviation (sd) of word accuracy (%) of using dictionary on 5-fold cross-validation and varying per dictionary, over the following parameters: architecture (A i , i = 1 to 5), coding scheme (Plain, Extra separator) (5×5×2). Each row is derived from 50 network evaluations. Figure 7 shows the behavior of a single network A 2 , using the Extra-separator coding scheme and the dual-state word-beam search CTC decoder. For different word lengths and for the OOV and INV condition  Table  shows the average (avg) and standard deviation (sd) of word accuracy (%) of using dictionary on 5-fold crossvalidation and varying per dictionary, over the following parameters: architecture (A i , i = 1 to 5), coding scheme (Plain, Extra separator) (5 × 5 × 2). Each row is derived from 50 network evaluations.  words with 1 to 4 characters. The performance on OOV words which have 5 to 15 characters is highly adequate. For OOV words whose length is between 16 and 20, the performance is variable. Surprisingly, for OOV samples longer than 21 characters, the model has a high performance. Figure 8 shows the accuracy of words achieved by network A 2 on one round of 5-fold cross-validation on the KdK dataset. On the X axis words are sorted in order of increasing relative log frequency of the test set. The blue circles indicate INV words. The dark red circle reveals the average accuracy and the log occurrence of OOV. Note the different 'threads' in the curve, revealing groups of easy and difficult (slow-starting) classes. In a lifelong machine-learning, the horizontal axis corresponds to time, starting with just a few examples on the left. The average of the performance on OOV samples is high, at log(f ) = 0, where f is frequency in the test set (#samples= 34, 488). Table 8 shows the comparison of the effect of the two coding schemes (Plain and Extra separator) and the CTC decoder application on the ensemble for the five rounds of the cross-validation of the KdK dataset.
Best path vs Dual-state word-beam search: using no dictionary conditions in more than 93% accuracy. Using a decoder with dictionary boosts the performance (t-test, p < 0.05, significant). Adding an extra separator enhances the model, when a CTC decoder with dictionary is used.
Plain vs Extra separator: for the Best-path CTC decoder, Plain has an average of 90.6%, and Extra separator has an average of 90.7% (t-test, p > 0.05, N.S.). Therefore, the extra separator has no effect ; for a dualstate word-beam search CTC decoder using the Concise dictionary (12K), Plain has an average of 96.3%, and Extra separator has an average of 96.8% (t-test, p < 0.05, significant). Therefore, the extra separator is effective ; for a dual-state word-beam search CTC decoder using the Large dictionary (384K), Plain has an average of 95.5%, and Extra separator has an average of 96.1% (t-test, p < 0.05, significant). Therefore, the extra separator is effective .
Single network vs Ensemble: ensemble voting increases the performance where its effect is more on a weaker recognizer (3 pp increase in performance for the dictionary-free CTC decoder for Plain/Extra separator). Ensemble of five recognizers used the CTC decoder with the Concise dictionary combined with the   Extra separator coding scheme results in the highest performance (97.4%). Figure 9 shows comparison of the effect of two coding schemes and dictionary application on single architecture and ensemble voting on the RIMES and the KdK datasets showing the weighted average. Table 9 shows the average of word accuracy (%) on the RIMES and KdK datasets, using the Concise dictionary and the Extra-separator coding scheme.

Discussion
The results indicate that it is possible to achieve a high word accuracy (%) in comparison to the state of the art with a limited-size ensemble, a homogeneous algorithmic approach and a low complexity [8,21,[27][28][29]77] (cf. Table 5). In those studies, numerous networks (up to 118 or 2100 network instances) are required in the ensemble. Whereas our method only uses five networks, yielding comparable or better results. In the proposed method, feature descriptors such as histogram of oriented gradients (HOG [24]) are not used, the process starts with a pixel image and is trained end to end.
Results also indicate that the average performances of the two coding schemes (Plain and Extra separator) differ significantly if the dual-state word-beam search is used for CTC decoding. In other words, the extraseparator character, '|', tagging the end of the word, boosts the result of the dual-state word-beam search CTC decoding. This increase in performance occurs despite the slight increase of the model size by adding the extra-separator character. However, the effect on the result of CTC best-path decoding, i.e., a nondictionary method, is limited. On the other hand, using the decoder with dictionary boosts the performance. Finally, ensemble voting clearly improves the word accuracy (%); its effect is stronger on weaker recognizers.
It should be noted that the reported result is based on realistic images with many word-segmentation problems, and therefore can be considered as a conservative estimate ( cf. Figure 6).
We have shown that medium length OOV words (5 to 11 characters) profit from training information in the shorter words in the training set (cf. Figure 7). Longer OOV words (11 to 23 characters) profit from the training on words whose length is 1 to 11 characters. Interestingly, OOV can have a high performance in a range for which there are not many examples (cf. Figure 7). In addition, for INV words shorter than 18 characters, the accuracy is higher than 95%. Therefore, our method recognized the common length OOV and INV words with a high accuracy. Alternatively stated, we demonstrate an important finding on a single network where increasing the size of difficult in-vocabulary word classes yields superior results, while the performance on easy in-vocabulary word classes is high even for a limited number of samples.
The goal of this research is not a record attempt towards maximized accuracy on the RIMES and the KdK datasets. Higher performance can undoubtedly be achieved using a larger ensemble. However, our choice for an ensemble of 5 voting elements results in a compromise with a very good and stable performance. The more than 1 pp jump in performance from one individual classifier to five classifiers is larger than the less than 0.3 pp increase in performance from 5 to 10 classifiers, and the increase in the performance is even smaller for higher numbers of classifiers in the amble, showing diminishing returns.
Furthermore, we have shown that by providing a more than 30 times larger dictionary, only a slight drop in performance occurred. In addition, for the dictionary-free approach, using an ensemble system results in a much higher performance with more stability than a single network. However, in the higher performing approach, the relative improvement is present but less prominent, when a dictionary is used. Moreover, as expected from previous research, using the CTC decoder with a dictionary increases the performance of our model compared to dictionary-free CTC decoder.

Conclusions
This study was aimed at achieving high-performance handwritten word recognition, using deep learning, however, with a limited cost in terms of network handcrafting combined with low complexity. Our model consists of an ensemble of just five homogeneous end-to-end trainable recognizers, using plurality voting with a solution for ties. Each recognizer is composed of five convolutional layers and three BiLSTM layers, followed by a CTC layer. Diversity is fostered by various number of units in the hidden layers of the CNNs. For CTC decoding, a dual-state word-beam search is applied, using only the given dictionary as the only language model. Furthermore, we study the effects of the dictionary-free Best-path CTC decoding on a single network and on the ensemble. Training the system is done from scratch, exclusively on the given dataset, and data augmentation is not used during testing. The word accuracy (%) of our model is 96.6% on RIMES, and 97.4% on the KdK dataset, a locally collected historical handwritten dataset. Results show that an ensemble size higher than five networks only yields limited further improvement; the method is not very sensitive to diverse network correspondence. Moreover, we showed that using an extra separator in the label-coding scheme boosts the performance with advantage of using it in case of a large dictionary.
We showed that by providing ∼ 30 times larger dictionary, only a slight drop in performance occurred. Ensemble voting improves the performance; its effect is more on weaker recognizers. Longer out-of-vocabulary (OOV) words benefit from training information in the shorter words in the training set.
On in-vocabulary word classes, increasing the number of samples yields better results. However, it does not have an effect on easy word classes. The performance of our model is even relatively high for OOV classes in word-length ranges, where there are a limited number of samples in the training set. The suggested method is applicable to e-Science services where it is not feasible to manually tailor hyperparameters, pre-processing and language model for each manuscript based on prior knowledge.
Word-based LSTMs cannot make use of the large textual content. Therefore, as future work, we plan to extend our approach to handle the handwritten line recognition task. Moreover, we will explore the applicability of our model on other datasets with different languages, and increase the performance on out-ofvocabulary words. Furthermore, the challenge of highperformance recognition of long words will be addressed.