Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Detecting Regular expression (REGEX) in handwritten documents can be useful for finding sub-string which are relevant for a further higher level information extraction task. It consists in detecting patterns sequence of characters that obey certain rules described using meta models such as lower cases (#[a-z]#), upper cases (#[A-Z]#) or Digits (#[0-9]#). For example, a system of that kind could spot entities. For example, spotting date (#[0-9]\(\{\)2\(\}\)/[0-9]\(\{\)2\(\}\)/[0-9]\(\{\)4\(\}\)#), first name (#[A-Z][a-z]*#), ZIP code and city name of a french postal address (#[0-9]\(\{\)5\(\}\) [A-Z]*#). The extraction of these informations allows to consider high level processing stages such as document categorisation, customer identification, Named Entity detection, etc.

The spotting of regular expression is a common task on electronical documents, using Natural Language Processing methods [1, 2]. In this case, the REGEX spotting is rather straightforward as it consists in applying exact string matching methods on the ASCII text. When dealing with document images, a recognition step is needed in order to produce the ASCII transcription before processing the input data. The trouble is that this recognition step is subject to errors and uncertainty, making the string matching problematic. Some attempts have been made on printed documents [3, 4]. In these works, an OCR is applied on the whole document before applying the regular expression spotting step based on a set of rules that performs an exact matching. In spite of OCR errors, the system provides acceptable performances (Average Precision of 82 % and 72 % of Recall).

However, only a few works concerning regular expression spotting in handwritten documents. The reason is that exact matching methods can not overcome the frequent recognition errors due to the intrinsic difficulties of recognizing handwriting. Therefore, in order to cope with these errors, inexact matching method should be carried out. This can be performed using statistical sequence models such as HMM. Some works have been published within this framework, proposing pattern spotting such as dates [5] and numerical fields [6, 7] that involve meta models of characters, namely digits. However, these HMM based approach are limited to very specific fields.

A more generic approach for REGEX spotting in handwritten documents has been addressed using pure HMM approach [8], but led to moderate results (see Sect. 5 for detailed results and comparison). This article presents a REGEX spotting system for handwritten documents. It is based on a combination of HMM statistical sequence model with the state-of-the-art BLSTM neural network. Our alternative hybrid BLSTM/HMM model enables us to benefit from both strong local discrimination, and the generative sequence ability of the HMM. An alternative to the hybrid system is also proposed in this article, using the BLSTM stage without the HMM model. It consists in applying the BLSTM, and searching for the query in the raw recognition results. Surprising results are obtained, namely producing 100 % precision performance.

This paper is organized as follows: first a review of word and REGEX spotting is given in Sect. 2, then we present our REGEX spotting system based on a hybrid BLSTM/HMM in Sect. 3. Section 4 is devoted to the experimental setup and results on both word spotting and regular expression spotting tasks carried out on the RIMES database [9].

2 Related Work

As a REGEX can match sequences with variable length and characters, a REGEX spotting task can be assimilated to a word spotting task where the word belongs to a lexicon which contains all the character string variations admissible by the REGEX. The less constrained the REGEX, the larger the size of the lexicon. Relaxing those constraints makes the REGEX spotting task more complex especially when considering handwritten document images. As regular expression spotting shares many aspects in common with word spotting we now briefly introduce the related works concerning word spotting approaches.

Word spotting in document images has received a lot of attention these last years. Systems proposed in the literature are divided into two main categories: Image based and recognition based systems. The first one, also known as query-by-example, operates through the image representation of the keywords [1014]. Such systems are therefore limited to deal with omni-writer handwriting and require to get an image of the query. The second kind of approaches, also known as query-by-string methods, deals with the ASCII representation of the keywords [1519]. Moving from the image representation to the ASCII representation of the query is performed through a recognition stage. These systems are suitable for omni-writer handwriting and can be used with any string query of any size. In this context, many works have focused on several variants of Hidden Markov Models (HMMs) to process this intrinsically sequential problem [19].

State-of-the-art recognition-based approaches are based on a line of text models [1820]. The line model generally contains a model of the target word, combined with filler models that describe the out-of-vocabulary words. For example in [20], the authors present an alpha-numerical information extraction system on handwritten unconstrained documents. It relies on a global line modeling allowing a dual representation of the relevant and the irrelevant information. The acceptation or rejection of the matched keyword is controlled by the variation of a hyper-parameter in the HMM line model. A similar approach is presented in [18]. The line model is made of a left and right filler models surrounding the word model. The acceptation or rejection of the matched keyword is controlled by a text line score based on the likelihood ratio of the word text line and the filler text line model. However we know that HMM rely on strong observation independence assumptions and they perform poorly on high dimensional observations. Moreover, they have low discrimination capabilities between character classes due to their inherently generative modelization framework.

Recently a new approach based on recurrent neural networks has overcome these shortcomings. Bilateral Long Short Term Memory (BLSTM) architecture has demonstrated impressive capabilities for omni-writer handwriting recognition [21]. Some primary applications of BLSTM to word spotting have also demonstrated promising results [22, 23]. In this system, the BLSTM is combined with the CTC layer which provides character class posterior probabilities. Then a token passing algorithm allows efficient decoding of the spotting line model. Very interesting results have been reported on the IAM Database [22]Footnote 1.

In this paper, we combine the BLSTM-CTC architecture with a HMM based spotting line model. This two stage architecture is first evaluated for hadnwritten word spotting on the RIMES database. Then we explore some extensions of the system to Regular Expression spotting (REGEX Spotting). This model is described in the following section.

3 BLSTM-CTC/HMM System

In this section, we describe our hybrid model for word and REGEX spotting. We first describe the BLSTM-HMM architecture that has been retained, then we present our word spotting model, based on standard state-of-the-art word spotting framework. And finally, we propose the adaptation of this model for REGEX spotting.

3.1 Character Recognition and Segmentation

The BLSTM-CTC is a complex Recurrent Neural Network able to manage long term dependencies thanks to its internal buffer structure. Each neuron is specialized to stop a specific character in the input signal. The recurrent architecture allows each neuron to take account of the previous activated neurons (character), possibly at multiple time step earlier in the input signal (thus modeling long term dependencies). This typical architecture allows to take account of character bigrams in addition to the input signal to compute the activation of each neuron. The BLSTM is composed of two recurrent neural networks with Long Short Term Memory neural units. The first one processes the data from left to right whereas the second one proceeds in the reverse order. For each time step, decision is taken combining the two networks output, taking advantage of both left and right context. Such context is essential to have a certain knowledge of the surounding characters, because in most cases sequences of letters are constrained by the properties of the lexicon. The outputs of these two networks are then combined through a Softmax decision layer that provides character posterior probabilities in addition to a non decision class. This decision stage is called the Connectionist Temporal Classification (CTC) [24] that enables the labelling of unsegmented data.

These networks integrate special neural network units: Long Short Term Memory [24] (LSTM). LSTM neurons is composed of a memory cel, an input and three control gates. Each gates control the memory of the cell, i.e. how a given input will affect the memory (input gates), if a new input should reset the memory cell (forget gate) and if the memory of the network should be presented to the following neuron (output gate). This system of control gates allows a very accurate control of the memory cell during the training step. A LSTM layer is fully recurrent, that is to say, the input and the three gates receive at each instant t the input signal at time t and the previous outputs (at time \((t-1)\)).

This architecture has shown very impressive results on challenging data-sets dedicated to word recognition [9, 25] due to its efficient classification and segmentation ability. For these reasons, its efficiency to cope with the low level character identification appears also very promising for handwritten words and REGEX spotting, since such scenario is less contrained by lexicon properties.

The proposed BLSTM/HMM architecture has been chosen in order to take advantage of both generative and discriminative frameworks. As shown on Fig. 2, the input sequence is processed by a BLSTM-CTC network in order to compte character posterior probabilities at every step. Then the probabilities of each labels are fed to the HMM stage (using class posteriors in place of the character likelihood computed by Gaussian Mixtures Models in the traditionnal HMM framework) to perform the alignment of the spotting model. We now describe the HMM line spotting models which enable us to spot either words (cf Sect. 3.2) or REGEX (cf Sect. 3.3).

3.2 Handwritten Word Spotting Model

Our word spotting model describes a line of text that may contain the word to spot. As it is classically proposed in the literature, it is made of the HMM word model surrounded by filler models that represent any other sequence of characters. Figure 1 shows an example of a word spotting model for the word “sentiments”. The space model is directly integrated into the filler. By constraining the whole model, we can locate the word at the beginning, in the middle or at the end of the line. The filler model is basically an ergodic model made of every character model. In our problem, we use 99 models corresponding to lower and upper cases, digits, punctuations and space.

Fig. 1.
figure 1

HMM line model: Detail of every component of the line model.

Decoding a text line is classically achieved using the Viterbi algorithm, the system will outputs the character sequence with the maximum likelihood \(P(X|\lambda )\). In order to accept the spotted word or reject it, decoding is generally performed twice: a first pass using the spotting model, and a second pass using a filler model. The likelihood ratio of the two models serves generally as a score for accepting or rejecting the spotted hypothesis. Using the BLSTM-CTC architecture, posterior probabilities are computed that can directly serve as a score for accepting/rejecting the hypothesis, without the need for a filler model. The score of each spotted hypothesis is computed by the average character posteriors over the number of frames spanning the hypothesis. This score is then normalised by the number of characters of the spotted word. Doing this, we choose to rely on the strong discriminative decisions of the BLSTM-CTC and use the HMM only as a sequence model constrained by high level information such as lexicons and/or language models. The graphical representation of the whole word spotting system is shown on Fig. 2.

Fig. 2.
figure 2

Hybrid structure BLSTM/HMM: Details of every step of the word spotting task from feature extraction to position of the word sentiments in the sentence. The BLSTM/CTC outputs a posteriori probabilities for HMM decoding.

Fig. 3.
figure 3

HMM MetaModels.

We now show how this model can be adapted to REGEX spotting.

3.3 Regular Expression Spotting Model

As previously mentionned, REGEX spotting is a generalisation of the word spotting task, the difference is that the sequences to spot are less constrained and more variable, thus leading to a larger lexicon of admissible expressions.

In order to cope with REGEX queries, we use the HMM stage to model a regular expression with a stochastic model of character sequences. Each meta model is an ergodic model of characters implied in the query, e.g., Lower Cases (#[a-z]#), Upper Cases (#[A-Z]#) or Digits (#[0-9]#), as it is the case for the Filler models. Figure 3 shows examples of meta models for these three examples.

We also need to model the variable length of the queries, which may occurs when using * or + operators (spotting between 0 and \(\infty \) times a character, or spotting between 1 and \(\infty \) times a character) such as in #[0-9]+# which stands for any sequence of at least 1 digit. This is simply modeled by allowing auto transitions over the desired character meta model. Figure 4 shows an example of a model for spotting variable length sequences. The query taken is the sub-string agr following by an unconstrained sequence of lower cases (#[a-z]*#), in this example we hope that the system will spot the word agréer correctly.

The following models allow searching for a REGEX at the beginning of a line (#[a-z]*ion#), at the end of a line ((#le[a-z]*#)), or both (#[A-Z]o[a-z]#). The line model can also only contain meta models dedicated to spotting sequences of digits of any length, for example (#[0-9]*#) or word beginning by one upper case character and ending with a sequence of lower cases characters of arbitrary length (#[A-Z][a-z]*#). Here, the arbitrary length of the sequence unconstrained (*) is controled by the auto-transition probabilities of the meta model of the HMM.

Fig. 4.
figure 4

HMM stage: Spotting of regular expressions #se[a-z]*# (i.e. every word beginning by the sub-string se followed by any number of lower case characters).

As the transitions in the HMM meta models are ergodic, the Viterbi alignment will only be driven by the local classification of BLSTM-CTC. The spotting model depends on its discriminant capacity to feed the higher HMM stage with accurate information from the local character recognition stage.

The graphical representation of the whole REGEX spotting system is shown on Fig. 5.

Fig. 5.
figure 5

Hybrid structure BLSTM/HMM: Details of every step of the REGEX spotting task from feature extraction to position of the REGEX #se[a-z]*# in the sentence.

Finally, the integration of meta models and auto transitions into the line model allows spotting of handwritten REGEX. Practically, the line model is build on the fly at the time of querying the data-set, by rewriting the REGEX into a HMM line spotting model. At this time, the “translation” is manually done, but an automatization of this task can be performed for industrial purpose.

4 Experiments

In this section, we give some details about the implementation of the system, starting with a description of the features extraction in Sect. 4.1. The performance of the system are evaluated using the 2011 RIMES database [26], they are summarized in Sect. 4.2.

Fig. 6.
figure 6

Regular expression spotting performance with the sub-string effe (#effe[a-z]*#).

Fig. 7.
figure 7

Regular expression spotting performance with the sub-string cha (#cha[a-z]*#).

4.1 Features Set

Our feature vector is based on Histograms of Oriented Gradient (HOG) [27] extracted from windows of \(8 \times 64\) pixels. During the extraction, the window is dividing into sub-windows of \(n \times m\) pixels. For each sub-window a histogram is computed, representing the distribution of the local intensity gradients (edge direction). The histograms of every windows are then merged to obtain the final representation of our feature vector representation. We used \(8 \times 8\) non-overlapping sub-windows using 8 directions, this produces a 64 dimensional feature vector.

4.2 Results and Discussion

To evaluate the performance of our system, all the experiments have been performed on the RIMES database used for the 2011 ICDAR handwriting recognition competitions [26]. The training database is composed of 1.500 documents, the validation and test sets are composed respectively of 100 documents. In order to evaluate the spotting system, we compute recall (R) and precision measures (P). To do this, the number of true positives (TP), false positives (FP), and false negatives (FN) are evaluated for all possible threshold values. From these values, a recall-precision curve is computed by accumulating these values over all word queries.

$$\begin{aligned} R = \frac{TP}{TP+FN} ~~ P = \frac{TP}{TP+FP} \end{aligned}$$
(1)
Fig. 8.
figure 8

Regular expression spotting performance with the sub-string com (#com[a-z]*#).

Fig. 9.
figure 9

Regular expression spotting performance with the sub-string pa (#pa[a-z]*#).

Regular Expression Results. To evaluate the performance of our system on a regular expression spotting task we performed exactly the same experiments as in [8]. In this study the authors were interested in spotting 4 different REGEX queries corresponding to the the search for the sub-strings “effe”, “pa”, “com” and “cha” at the beginning of a word (#effe[a-z]*#, #pa[a-z]*#, #com[a-z]*#, #cha[a-z]*#). As for word spotting experiments, results of the HMM system have been added too in order to provide a precise comparison between those systems (cf Figs. 6, 7, 8 and 9).

A first observation is that the system achieves good performance, since most of the REGEX queries lead to a mean-average precision of nearly 75 %, whereas the queries involve many fewer constraints than for word spotting. Moreover, our results are far beyond the standard HMM approach. We can observe a gap of more than 40 % in the difficult cases (#com[a-z]*#) and (#cha[a-z]*#) and 20 % in easier ones (#effe[a-z]*# and #pa[a-z]*#). We also run more test on other queries such as #[a-z]*er#,#[a-z]*tion#, #[a-z]*tt[a-z]*# and #[a-z]*mm[a-z]*#. The results are still pretty good for both textbf#[a-z]*tion# #[a-z]*mm[a-z]*#. However, the system seems to have trouble dealing with the two other queries. It is certainly due the fact that two consecutive letters such as “t” are really difficult to spot. Concerning the other issue, it is due to the high level of confusion between “r” and other characters such as “n”,“u”, etc.

We have also tested less constrained queries, with the search for REGEX containing any sequence of upper cases characters (#[A-Z]*#), and any sequence of digits (#[0-9]#). This problem is by far more difficult than the previous queries since the corresponding sequences may have variable contents and lengths. For example the digit query should detect the sequence “1” as well as sequence “0123456789”. Results are presented in Fig. 11.

Knowing the difficulty of the problem, the performance are still interesting. Note that digit characters are not very frequent in the database. An interesting fact is that the Uppercase query can reach interesting precision scores, whereas the digit query can reach very high recall scores.

In the following section, an additional experiment is proposed in order to maximize the precision.

4.3 Using the BLSTM Without HMM

Recall-Precision curves allow the user to choose the most appropriate threshold for his problem. Indeed, some applications need to maximise the precision whereas some others may privilege the Recall. That is why we perform a REGEX spotting experiment without introducing the higher level HMM stage spotting model. This stage correspond to analysing the raw transcriptions provided by the low level decision stage and then matching the searched keywords on this transcription. Results are shown in Table 1.

Fig. 10.
figure 10

Regular expression spotting performance for #[a-z]*er#,#[a-z]*mm [a-z]*#, #[a-z]*tion# and #[a-z]*tt[a-z]*#.

Fig. 11.
figure 11

Regular expression spotting performance with upper case sequence (#[A-Z]*#) and number sequence (#[0-9]#).

Table 1. Results of the detection without HMM stage of the following REGEX: #effe[a-z]*#, #pa[a-z]*#, #com[a-z]*#, #cha[a-z]*#, #[a-z]*er#,#[a-z]*tion#, #[a-z]*tt[a-z]*#, #[a-z]*mm[a-z]*#.

The first comment from this experiment is that each query leads to a 100 % precision, which means that the system does not any false alarm. This result was expected since only two cases may occur: the first case is when the BLSTM correctly recognizes every character from the query, leading to a “hit”. The second case is when a recognition error occurs within the searched sequence. In this case the query is missed, but no false alarm is produced. Finally, no false alarms are produced using this system, whatever the recognition performance. Despite the lack of constraint of the request, the BLSTM-CTC manage to obtain high value of recall: more than 55 % for 6 of our REGEX and more than 60 % for 4 of them. An interesting fact is that these operating points are significantly beyond the recall-precision curves provided using the HMM stage. The 100 % precision (absence of false alarm) makes this system suitable for keywords spotting dedicated to categorization systems as in [28].

Our system seems to have trouble dealing with #[a-z]*tt[a-z]*# and #[a-z]*er#, but as you can see on the Fig. 10 those two REGEX were already badly recognized with the HMM stage. As said before, this is certainly due to the fact that it is difficult to recognize correctly consecutive letters (tt) or high confusion level letters (r). The BLSTM-CTC seems to be able perform recognition of out vocabulary elements such as named entities for example.

In most of the applications, you can improve your global results thanks to a language model, a lexicon or any kind of high level information. By performing this experiment, we once again prove the powerful capacity of the BLSTM-CTC to tackle the Sayre paradox as it is able to segment and recognize characters very accurately.

5 Conclusion

In this paper, we have proposed a hybrid system BLSTM-CTC/HMM able to spot any word of REGEX. We have shown that the hybrid system exhibits interesting results, even on weakly constrained queries such as the search for sequences of digits of arbitrary length. We have compared our system for REGEX spotting with some recent work carried out on the same data-set and using the standard HMM framework. Our approach outperforms this system by more than 30 % on the standard word spotting task and by more than 40 % on REGEX spotting. These very promising results allow to envisage the application of higher level spotting systems such as addresses, named entities for which a combination of specific markers (keywords and alpha numerical expressions) is generally used to detect the relevant information. Some additional results show that the BLSTM-CTC provides interesting performance even when no additional constraints are introduced, since it provides some interesting recall values (higher than 60 %). These experiments show that the frontiers of processing handwritten documents are becoming more and more closer to those of processing printed or born digital documents which offers many perspectives for developing applications dealing with handwritten documents in the bear future.