Word graphs size impact on the performance of handwriting document applications

Abstract

Two document processing applications are considered: computer-assisted transcription of text images (CATTI) and Keyword Spotting (KWS), for transcribing and indexing handwritten documents, respectively. Instead of working directly on the handwriting images, both of them employ meta-data structures called word graphs (WG), which are obtained using segmentation-free handwritten text recognition technology based on N-gram language models and hidden Markov models. A WG contains most of the relevant information of the original text (line) image required by CATTI and KWS but, if it is too large, the computational cost of generating and using it can become unafordable. Conversely, if it is too small, relevant information may be lost, leading to a reduction of CATTI or KWS performance. We study the trade-off between WG size and performance in terms of effectiveness and efficiency of CATTI and KWS. Results show that small, computationally cheap WGs can be used without loosing the excellent CATTI and KWS performance achieved with huge WGs.

Introduction

In recent years, huge amounts of historical handwritten documents have been scanned into digital images, which are then made available through Web sites of libraries and archives all over the world. However, the wealth of information conveyed by the text captured in these images remains largely inaccessible (no plain text, difficult to read even for researchers). Therefore, automated methods are needed to add value to mass digitization and preservation efforts of Culture Heritage institutions, in order to provide adequate access to the contents of the preserved collections of handwritten text documents. To this end, in the tranScriptorium Footnote 1 project [32] two different applications were developed aiming to fulfill these needs: computer-assisted transcription for text images (CATTI) [30], intended to speed up transcription processes, and Keyword Spotting [40] for automatic indexing of untranscribed handwritten material under the so-called Precision-Recall trade-off model. Actually, both applications rely on word graphs (WG).

A WG is a data structure proposed by several authors some decades ago during the development of automatic speech recognition (ASR) technology [20, 45, 46]. Nowadays, WGs are also being used in the fields of machine translation (MT) [41], and lately in handwritten text recognition (HTR) [30, 39]. In HTR (as in ASR), WGs are obtained through adequate extensions of the standard dynamic programming Viterbi decoding algorithm, which determines the single best HTR hypothesis. A WG represents very efficiently a huge number of word sequence hypotheses whose posterior probabilities are large enough, according to the morphological character likelihood and the prior (language) models, used to decode a text line image. WGs also can store additional important data about these hypotheses, namely alternative word segmentations and word decoding likelihoods. In the literature, we can find different algorithms to generate WGs mainly related with the particularities of the underlying decoder. There are several standard ASR/HTR systems that provide WGs, such as HTK [47], Kaldi [25] or iATROS [15].

An important shortcoming of WGs is the large computing cost entailed by their generation, often very much larger than the cost of the basic Viterbi decoding process itself. WG generation cost depends on many factors, including the input sequence length and decoding vocabulary size. But a major factor is, by far, the size of the WG. This size can be measured in terms of WG “density” [20, 22, 34] and, in many cases, is determined by a parameter known as maximum node input degree (Idg), which specifies the amount of information retained at each node during the WG generation process [15, 47]. In addition to reducing Idg, other pruning techniques, such as beam-search and histogram pruning, can also be used to reduce the size of the WGs and the generation cost, at the expense of additional loss of the information retained in the resulting WGs [20, 34, 47].

This work, which extends the one presented in [37], studies how different sizes of WGs, pruned by different Idg values, impact the effectiveness/efficiency of both CATTI and KWS applications. The main differences of this paper with previous ones are a better explanation of the WG generation process and also of the experimental framework. This study will serve as a reference for estimating the space–time resources required for tasks entailing the processing of large handwritten image collections using WG-based techniques.

Overview of HTR and WG technology

This section is devoted to introduce the basics of the HTR systems used to generate the WGs required by CATTI and KWS.

HTR based on hidden Markov models and N-grams

Holistic, segmentation-free HTR technology is used in this work to produce WGs. It is based on hidden Markov models (HMMs) and N-grams, and follows the fundamentals presented in [2] further developed in [38, 42], among others. This kind of recognizer accepts a handwritten text line image, represented as a sequence of D-dimensional feature vectors \(\mathbf {x}=\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_n\), \(\mathbf {x}_i\in \mathfrak {R}^{D}\), and finds a most likely word sequence \(\widehat{\mathbf {w}}=\widehat{w}_1\widehat{w}_2\ldots \widehat{w}_l\), according to:

$$\begin{aligned} \widehat{\mathbf {w}} = \mathop {\hbox {arg max}}\limits _{\mathbf {w}} P(\mathbf {w}\mid \mathbf {x}) = \mathop {\hbox {arg max}}\limits _{\mathbf {w}} p(\mathbf {x},\mathbf {w}) = \mathop {\hbox {arg max}}\limits _{\mathbf {w}} p(\mathbf {x}\mid \mathbf {w})\cdot P(\mathbf {w}) \end{aligned}$$
(1)

The conditional density \(p(\mathbf {x}\mid \mathbf {w})\) is approximated by optical word models, built by concatenating character HMMs [10, 27, 30], and the prior \(P(\mathbf {w})\) is approximated by an N-gram language model [10, 11]. Two main modules comprise the HTR process illustrated in Fig. 1: preprocessing and feature extraction, and decoding. Preprocessing generally entails line image enhancement and basic geometry corrections, including slant normalization, while feature extraction obtains an image representation in terms of a sequence of feature vectors. A simple feature extraction method based on gray levels and gray-level gradients [38] is illustrated in Fig. 1.

Fig. 1
figure1

Diagram of the HTR decoding process. Starting from a text line image of the text “home is located near,” a visual representation of a feature vector sequence obtained is shown (average gray level and horizontal and vertical components of the gray-level gradient in this illustrative example). Finally, the corresponding WG is delivered by the decoding process, using the provided knowledge sources: optical character HMMs, lexicon word models and a N-gram language model

The search (or decoding) for \(\widehat{\mathbf {w}}\) is efficiently carried out by the Viterbi algorithm [10, 11] also referred to as token-passing [19]. This dynamic programming decoding process can yield not only a single best solution, as in Eq. (1), but also a huge set of best solutions compactly represented in a WG (see next section).

For a more detailed description of these HTR technologies, including text line processing, model training and decoding, the reader is referred to [30].

Word graphs

Generally speaking a word graph (WG) is a data structure which represents in a compact manner a large amount of decoding or recognition hypotheses (e.g., word sequences), along with corresponding likelihood and signal segmentation or alignment information. More specifically, our concept of WG is that of a weighted directed acyclic graph, where edges are labeled with words and weighted with scores, and nodes contain word-to-signal alignment information. As we will see latter, alignment information is not required in the CATTI approach (see Sect. 3.1), but it is essential in the KWS application presented in Sect. 3.2.

There is not a unique or standard definition of WG in the literature. For example, in [13] it is defined as a weighed directed acyclic graph, where each edge is labeled with a word and a score; in [20, 22], word-to-signal alignment information is also included in each node, and in [11] the alignment information is removed and multiple overlapping copies of the same word are merged. In addition, many modern papers [23, 33, 50] and toolkits [25, 31, 47] prefer to use the term “lattice” for the same type of graphs we (and other authors) refer to as WGs. Moreover, while some authors treat word graphs and word lattices as synonymous [9, 31, 35, 43], others are careful to distinguish between them [11, 22].

WGs are different from “confusion networks” (CN), also used in many works on automatic speech recognition, machine translation and HTR [8, 9, 16, 21]. A CN can be obtained by simplifying a WG in such a way that the alignment information is discarded and the word sequences contained in the original WG are generalized (in a rather uncontrolled way). These effects render CNs inappropriate for our purposes.

In the present work, we adhere to the name and definition given by Ney et al. [20, 22]. In these works, the term WG was used to make explicit the differences with the word lattices used in early speech recognition systems [3, 45], where the hypothesis represented by a lattice were allowed to have word overlaps or gaps along the input signal axis.

Formally (and according to [20]), we define a WG as an eight tuple \((Q,V,E,q_I,F,t,s,\omega )\). Q is a finite set of nodes, including an initial node \(q_I\in Q\) and a set of final nodes \(F\subseteq (Q-q_I)\). Each node q is associated with a horizontal position of \(\mathbf {x}\) (often called “frame”), given by \(t(q)\!\in \![0,n]\), where \(t(q_I)=0\) and \(\forall q\in F,\, t(q)=n\). V is a non-empty set of words (vocabulary), E is a finite set of edges and s and \(\omega \) are edge scoring and labeling functions, respectively. For an edge \((q',q)\!\in \!E\) (\(q'\ne q,q'\!\not \in \!F,q\!\ne \!q_I\)), \(v=\omega (q',q)\) is its associated word and \(s(q',q)\) is its score, corresponding to the likelihood that the word v appears in the image segment delimited by frames \(t(q')+1\) and t(q), computed during the Viterbi decoding of \(\mathbf {x}\). Figure 2 shows a small, illustrative example of a WG obtained by decoding the text line image also shown in this figure.

Fig. 2
figure2

Illustrative, simplified example of a WG that would be obtained from the decoding of a line image of the Spanish handwritten text: “antiguos ciudadanos, que en Castilla se llamaban,” represented by a sequence of feature vectors, \(\mathbf {x}\), of length n. Each edge such as \((q',q)=(12,13)\) is labeled with the corresponding word, \(\omega (12,13)\!=\,\) “Castillo,” and weighted with its score \(s(12,13)=0.6\). In this case, node positions of \(q'\) and q correspond to \(t_7=t(12)\) and \(t_8=t(13)\), respectively, which are shown on the bottom and in the image itself

A complete path of a WG is a sequence of nodes starting with node \(q_I\) and ending with a node in F. Complete paths correspond to whole line decoding hypotheses. It is important to remark that the WGs considered in this (and many other) work(s) are unambiguous; that is, no two complete paths can exist in a WG which correspond to the same word sequence. This implies that, for each word sequence in a WG, only one word segmentation is retained. However, for a given individual word v, several (often many) segmentations can appear in a WG, each corresponding to a different word sequence which includes v. For example, in Fig. 2, two different segmentations appear for the word “ciudadanas,” corresponding to word sequence hypotheses starting with “antiguas ciudadanas fue…” and “antiguos ciudadanas que ...”. Similarly, two alternative segmentations appear for the word “que.”

WG generation overview

Several methods can be applied for WG generation, depending on the intended applications and/or on particularities of underlying decoder. Practically all the techniques proposed so far can produce graphs with the basic unambiguity property discussed above. Also, most methods guarantee that the single best decoding hypothesis is also the most probable path in the WG.

Apart from these basic requirements, other important optimization criteria are considered. Almost all the systems aim to maximize the probability mass of the set of word sequences represented by the WG. Another criterion is to minimize the “graph error rate” (called minimum word error rate in Sect. 4.3). Additional or complementary criteria are to minimize the number of WG paths (i.e., different word sequences) or the number of edges, and/or the computational cost of WG generation.

It is worth remarking that no method has been described in the literature which actually meets a chosen criterion or combination of criteria. Therefore, all the existing methods rely on heuristics to obtain only approximate solutions which prove sufficiently good for the main applications aimed at. This is the case, for example, of systems described in [22, 26]. The former presents a method, used in the Kaldi decoder [25], based on automata composition [18], while the latter introduces techniques based on the so-called word-pair approximation to achieve good compromises between accuracy and coverage. We briefly discuss in some more detail the WG generation algorithm used in two toolkits: HTK [47] and iATROS [15]. Both of them are based on HMMs and N-grams and the decoding process is performed by using the Viterbi-like token-passing algorithm [48].

In HTK, token-passing search is performed on a global network [19] which takes into account (finite-state) language model restrictions. This recognition network consists of two types of nodes: HMM nodes, mapped to specific character HMMs, where accumulated optical log-likelihoods along paths are held, and word end nodes mapped to specific words, where language model log-probabilities are added. At each input frame, each network node holds a “moveable token” that contains partial log-likelihood and path identifiers to allow for path trace-back.

HTK generates WGs which fulfill the two basic requirements mentioned above: a) the best recognition hypothesis given by the Viterbi decoding process is also the best WG path and b) WGs are unambiguous. During the token-passing decoding, for each active word (set by the associated HMM nodes on which the moveable token is currently propagating), there are multiple copies, and thereby, it is possible to generate multiple word sequence hypotheses with some additional computation. This is implemented by linking the tokens that flow into the same word end nodes and then propagating the most likely token (which is the head of this list) into following nodes. It is important to remark here that, in order to avoid ambiguous paths, linked tokens must come from different preceding words. Of course, this strategy cannot generate exact solutions for any but the best path since it implicitly assumes that the start time of each word is independent of all words before its immediate predecessor. At the end of decoding, the WG is straightforwardly built by descending the list at each word token boundary and recovering the multiple word hypotheses, which include the best one. This guarantees that the resulting graph contains the best word sequence hypothesis. The size of the WGs generated in this way can be controlled by setting the maximum number of best tokens which retained at each word end node. This number is usually called maximum node input degree (Idg).

The iATROS decoder is also based on the Viterbi algorithm and adopts the same criteria as HTK. A main difference between HTK and iATROS is that whereas HTK is restricted to the use of bi-gram language models, iATROS can employ N-gram models of any order. On the other hand, iATROS implements real back off, while HTK approaches it by interpolation. Another difference is that in iATROS, rather than using a tree-structured lexicon like in HTK, the search process is carried out in a network of parallel word HMMs. Although the WG generation method is practically the same in both toolkits, these implementation details entail non-negligible differences between the WGs produced. While these iATROS peculiarities allow for greater modeling flexibility, it is known that they entail somewhat inferior quality of the WGs generated.

Computational cost of WG generation

It is well known that the computational complexity of the Viterbi algorithm is proportional to the length of \(\mathbf {x}\) and the lexicon size. However, for real applications, even for moderately sized tasks, to carry out a complete search is not computationally feasible. This cost can be made largely independent on the lexicon size and the overall size of the models used by means of well-known pruning techniques such as beam-search [7, 10, 14]. In beam search only paths whose likelihood falls within a fixed beam width of the mostly likely path are considered for extension. This beam search approximation allows for significant speed ups at the expense of a controllable, small degradation of the decoding performance.

However, when the decoding process includes WG generation, the overall computing cost is observed to grow very fast with the WG size (exponentially with the WG density, according to [34]). In summary, the asymptotic cost of generating a WG for a line image of length n can be expressed as \(O({\varGamma }\cdot n)\), where \({\varGamma }\) is a (generally large) constant which depends on the WG size. Nevertheless, it should be taken into account that this process is carried out only once, and by choosing adequate WG sizes, reasonable WG generation time can be achieved in practice, as it will be shown later on in this paper.

Outline of WG-based CATTI and KWS

CATTI

The interactive computer-assisted transcription of text images (CATTI) framework is presented in detail in [30, 36]. In this framework, the human transcriber is directly involved in the handwritten text transcription process and he/she is responsible of validating and/or correcting the HTR output.

The interactive process starts when the HTR system proposes a full transcript of a feature vector sequence \(\mathbf {x}\), extracted from a handwritten text line image. In each interaction step, the user validates a prefix of the transcript which is error free and makes some amendment(s) to correct the erroneous text that follows the validated prefix, producing a new correct prefix \(\mathbf {p}\). The new, extended prefix is used by CATTI to search for a new most likely suffix, \(\hat{\mathbf {s}}\), according to:

$$\begin{aligned} \hat{\mathbf {s}} = \mathop {\hbox {arg max}}\limits _s P(\mathbf {s}\mid \mathbf {x},\mathbf {p}) \approx \mathop {\hbox {arg max}}\limits _s p(\mathbf {x}\mid \mathbf {p},\mathbf {s}) \cdot P(\mathbf {s}\mid \mathbf {p}) \end{aligned}$$
(2)

Equation (2) is very similar to Eq. (1), being \(\mathbf {w}\) the concatenation of \(\mathbf {p}\) and \(\mathbf {s}\). As in conventional HTR, \(p(\mathbf {x}\mid \mathbf {p},\mathbf {s})\) can be approximated by HMMs and \(P(\mathbf {s}\mid \mathbf {p})\) by an N-gram model, but now the N-gram is conditioned by \(\mathbf {p}\), which is given. Therefore, the search must be performed over all possible suffixes \(\mathbf {s}\) of \(\mathbf {p}\), rather than over complete transcripts as in Eq. (1). A key point of this interactive process is that, at each user–system interaction step, the system can take advantage of the prefix validated so far to attempt to improve its prediction.

This search problem can be solved through an extension of the conventional Viterbi algorithm. In the first iteration of the CATTI process, \(\mathbf {p}\) is empty. Therefore, the decoder has to generate a full transcript of \(\mathbf {x}\) as shown in Eq. (1). Afterward, at each user interaction step, a special “prefix-conditioned” language model which accounts for \(P(\mathbf {s}\mid \mathbf {p})\) is built [30, 36]. Owing to the finite-state nature of this special language model, the search involved in Eq. (2) can be efficiently carried out using the Viterbi algorithm. However, the computational cost of a full Viterbi decoding at each interaction step becomes prohibitive for the very short response times generally needed for adequate interactive operation.

As shown in [30], much more efficient search can be achieved using the WG obtained during the conventional Viterbi decoding of the whole image representation \(\mathbf {x}\), as outlined in Sect. 2.

In each interaction step, the decoder parses the previously validated prefix \(\mathbf {p}\) over the WG. This parsing procedure leads to defining a set of nodes \(Q_p\) corresponding to paths from the initial node whose associated word sequence is \(\mathbf p \). Then, the decoder continues searching from any of the nodes in \(Q_p\) for a suffix \(\mathbf {s}\) that maximizes the posterior probability according to Eq. (2). The prefix parsing step entails a significant complication because it may happen that a prefix given by the user cannot be exactly found in the WG. In this case, an error-correcting parsing procedure is carried out: rather than looking for the exact prefix \(\mathbf p \), a best-match, “approximate” prefix is searched for over all the possible prefixes existing in the WG [30]. This approximate search procedure can be efficiently carried out using dynamic programming and it can be further improved by visiting the WG nodes in topological order [1]. This process is repeated until a complete and correct transcript of the input image is obtained.

The computational costs of these WG procedures can be divided into two phases: initialization and prediction. For each line image, first the corresponding WG must be stored in memory and several data needed at each successive interaction step can be pre-computed. Then, at each interaction step, the cost of prefix matching and suffix prediction should be considered. Both costs can be seen to be roughly linear in the number of WG edges but, thanks to the pre-computation phase, the more critical suffix prediction costs can be reduced very significantly [30]. As will be shown later, for reasonably small WG sizes, both of these costs can be kept sufficiently small, as required by the real-time constraints imposed by interactive operation.

KWS

We focus here on line-based “query by string” keyword spotting (KWS). The goal is to determine whether a textually given keyword is likely to appear in each text line image, no matter how many occurrences of the word may appear in the line.

According to [40], an adequate line-level measure to score the likelihood that a keyword v appears in any horizontal position of a line image \(\mathbf {x}\) is:

$$\begin{aligned} S(v,\mathbf {x}) \,{{\mathrm{\mathop {=}\limits ^{\tiny {\text {def}}}}}}\, \max _{i} P(v\mid i,\mathbf {x}) \end{aligned}$$
(3)

where \(\mathbf {x}\) is the given vector sequence representation of the image and i is an index or “frame” of \(\mathbf {x}\). In this equation, \(P(v\mid \,i,\mathbf {x})\) (called “frame posterior”) is the probability that the word v appears in some segment of the line image \(\mathbf {x}\) such that i lies within this segment. \(P(v\mid \,i,\mathbf {x})\) could be obtained directly from the image representation \(\mathbf {x}\) using a (language model constrained) word recognizer, on the base of backward–forward scores computed at each state of a trellis similar to that used by a Viterbi decoder for \(\mathbf {x}\). However, the computational cost of such a direct approach and its implementation complexity would be exceedingly high and a much more convenient approach based on WGs was proposed in [40].

In a nutshell, \(P(v\mid \,i,\mathbf {x})\) can be easily and efficiently computed by considering the contribution of all the WG edges labeled with v, which correspond to segmentation hypotheses that include the frame i; that is:

$$\begin{aligned} P(v\mid i,\mathbf {x}) \,\,\approx \!\!\! \sum _{\begin{array}{c} (q',q)\in E{:}\\ v=\omega (q',q),\\ t(q')<\,i\le t(q) \end{array}}\!\!\! \varphi (q',q)\;,\qquad \varphi (q',q) = \dfrac{\alpha (q')\cdot s(q',q)\cdot \beta (q)}{\beta (q_I)} \end{aligned}$$
(4)

where the so-called edge posterior probability \(\varphi (q',q)\) is computed using the forward \(\alpha (.)\) and backward \(\beta (.)\) accumulated path scores which, in turn, can be very efficiently computed on the WGs by dynamic programming [40, 44]. Figure 3 shows a version of the WG shown in Fig. 2 where edge scores are “normalized” in this way.

Fig. 3
figure3

Example of an “edge posterior” normalized WG. The original, unnormalized WG is shown in Fig. 2. Each edge \((q',q)\) is labeled with the word \(\omega (q',q)\), and weighted with the edge posterior \(\varphi (q',q)\). Note that, for any horizontal image position i, the sum of the weights of all the edges encompassing i is 1

The costs entailed by the computation of the confidence measure \(S(v,\mathbf {x})\), based on the frame word-posteriors \(P(v\mid i,\mathbf {x})\) and the corresponding WG normalization [Eqs. (3, 4)], depend linearly on the total number of WG edges and on the length, n, of line image representation, \(\mathbf {x}\). As will be see in Sect. 4.4, these costs are practically negligible in comparison with the cost of WG generation.

Experiments

To study the performance of WG-based CATTI and KWS for different WG sizes, several experiments were carried out. The evaluation measures, corpora, experimental setup and the results are presented next.

Evaluation measures

The effect of WG size on CATTI and KWS performances are assessed in terms of effectiveness (accuracy) and efficiency (computational time and space requirements).

To asses the effectiveness of CATTI, we use the word stroke ratio (WSR), defined as the number of word-level interaction steps needed to achieve the reference transcript of the text image considered, divided by the total number of reference words. The WSR gives an estimate of the human effort required to produce correct transcripts using a CATTI system.

On the other hand, KWS effectiveness was measured by means of the standard recall and interpolated precision [17] curve, which are obtained by varying a threshold to decide whether a score \(S(v,\mathbf {x})\) [Eq. (3)] is high enough to assume that a word v appears in \(\mathbf {x}\). More specifically, the average precision (AP) [28, 49] was used as a single scalar performance measure.

Finally, the computing times required for efficiency assessment are reported in terms of total elapsed times measured on a dedicated single core of an Intel®  2 Core™  Quad CPU at 2.83 GHz.

Corpora

Two historical manuscripts: CS [29] and PARFootnote 2 [5] were used in the experiments.

CS (or “Cristo Salvador”) is a XIX century Spanish manuscript, entitled “Noticia histórica de las fiestas con que Valencia celebró el siglo sexto de la venida a esta capital de la milagrosa imagen del Salvador,” which was kindly provided by the Biblioteca Valenciana Digital (BiVaLDi).Footnote 3 It is composed of 50 color text images of pages written by a single writer and scanned at \(300\,\)dpi. Some examples are shown in Fig. 4.

Fig. 4
figure4

Examples of CS manuscript

On the other hand, PAR is a XIII-century epic poem, by Wolfram Von Eschenbach, identified as “St. Gall, collegiate library, cod. 857” (and often referred to as “Parzival”). It is composed by 47 pages written in the Middle High German language. While written by multiple hands, all the writing styles are very similar. Figure 5 shows some examples of this manuscript.

Fig. 5
figure5

Examples of PAR manuscript

Table 1 summarizes information on data partitioning used for both datasets. The percentage of different words of the test partition that do not appear in the training partition is shown in the row “Running OOV(%)” (out of vocabulary words).

Table 1 Basic statistics of the CS and PAR datasets and the corresponding partitions

For KWS evaluation, several criteria can be adopted for the selection of the keywords. Clearly, any given KWS system may perform better or worse depending on the query words it is tested with and how these words are distributed in the test set. In general, the larger the set of keywords, the more reliable the empirical results. According to these observations, in this work we adopt the same criterion as in [6], where all the words that appear in the training partition are selected as keywords, namely 2236 and 3221 keywords for CS and PAR, respectively. It is important to remark that, according to this criterion, there will be many keywords which actually do not appear in any of the test images. This is a challenging scenario, since the system may erroneously find other similar words, thereby leading to important precision degradations.

System setup

Each line image was represented as a sequence of feature vectors. For CS, an approach based on smoothed gray levels and gray-level gradients was used (see [30] and Fig. 1), while a technique just based on gray level PCA analysis [24] was adopted for PAR.

The line image feature-vector sequences of both the CS and PAR training partitions were used to train corresponding character HMMs, using the standard embedded Baum-Welch training algorithm [10]. A left-to-right HMM was trained for each of the elements appearing in the training text images (78 for CS and 92 for PAR). This included lowercase and uppercase characters, symbols, special abbreviations, possible spacing between words and characters, crossed-words, etc. Meta-parameters of both HTR feature extractions and HMM models were optimized through cross-validation on the training data for CS and on the validation data for PAR. The optimal HMM meta-parameters were 14 states with 16 Gaussian densities per state for CS, and 8 states with 16 Gaussians per state for PAR.

The training set transcripts of both corpora were used to train the respective 2-grams with Kneser-Ney back-off smoothing [12] (for the PAR final evaluation, the language model training includes also the validation data).

For each test line image, six WGs were obtained for several input degree (Idg) values using the HTK toolkit based on the previously trained HMMs and 2-grams. The following Idg values were considered: 1, 3, 5, 10, 20 and 40, where the value 1 corresponds to a degenerate WG representing only the 1-best transcript. Table 2 shows relevant statistics of the resulting WGs, along with the minimum word error rates (W(%)) [4] and the average generation computing time (\(T_{\text {gen}}\)). The W values are “oracle” WERs obtained by computing for each WG, the path (word sequence) which best matches the corresponding reference transcript. Therefore, this is the best WER that we could obtain during an “oracle decoding process” that knows what is the correct decoding. As expected, W decreases for larger WGs values. Specifically, the relative differences between the W values obtained with WGs generated with Idg = 10 and those generated with Idg = 40 are around 8–10 %. This is a quite small improvement given the very large increase in the size and also the computation time of the corresponding WGs.

Table 2 Statistics of the CS and PAR WGs obtained for different Idg values

Similar sets of WGs were also generated using the iATROS toolkit. As mentioned in Sect.  2.3, these WGs tend to be somewhat worse than those of HTK and this is what is actually observed in this case. For example, for PAR WGs with Idg = 3, the “oracle” WER is 20.8 %, as compared with 18.8 % for the HTK WGs. While such differences do not necessarily entail a similar degradation in CATTI and KWS performance, in what follows, we will focus only on the results obtained with HTK WGs.

Once the WGs were generated, they were directly used by CATTI to complete the prefixes accepted by the (simulated) user. In each interaction step, the decoder parsed the validated prefix over the WG and then continued searching for a suffix which maximizes the posterior probability according to Eq. (2).

For KWS, the WGs were normalized by computing edge posteriors and used to obtain the frame-level word posterior probability according to Eq. (4). Finally, word confidence scores were computed from these probabilities according to Eq. (3).

Results and discussion

Experiments with the WG-based systems outlined in Sect. 3 were carried out for increasingly large WGs, as described in Sect. 4.3.

CATTI WSR, along with the corresponding average WG initialization and interactive prediction times (\(T_{\text {init}}\) and \(T_{\text {pred}}\) in milliseconds), are reported in Table 3 for the WGs obtained using HTK toolkit with increasing Idg. \(T_{\text {init}}\) includes the time required to load the WG in memory along with the time needed to initialize the data structures for error-correcting parsing and efficient suffix search. \(T_{\text {pred}}\), on the other hand, corresponds to the time required to compute a suffix prediction, as needed in each successive word-level interaction step.

As expected, the results obtained with iATROS WGs were somewhat worse. For instance, for PAR WGs with Idg = 3 and Idg = 5 the obtained WSR was 22.9 and 22.7 %, respectively. However, the WSR, \(T_{\text {init}}\) and \(T_{\text {pred}}\) tendencies observed for increasing Idg are essentially the same as for the HTK WGs.

Table 3 CATTI WSR (in %) for different Idg values, along with average WG initialization and CATTI prediction times: \(T_{\text {init}}\) and \(T_{\text {pred}}\) (in milliseconds)

For KWS, on the other hand, Table 4 shows the average precision (AP) along with the average normalization time (\(T_{\text {norm}}\), in milliseconds) and total indexing time (\(T_{\text {indx}}\), in seconds), for increasing Idg. Here \(T_{\text {norm}}\) includes the time needed to load the WG in memory plus the average time needed for WG normalization and computation of the KWS scores according to Eqs. (34). \(T_{\text {indx}}\), on the other hand, is determined by adding \(T_{\text {norm}}\) to the corresponding WG generation time, \(T_{\text {gen}}\), given in Table 2.

Table 4 KWS AP along with the average normalization and total indexing times per line: \(T_{\text {norm}}\) (in milliseconds) and \(T_{\text {indx}}\) (in seconds)

From the results, we observe that for WG Idg values larger than 10, the CATTI WSR and the KWS AP do not improve significantly (43.4 or 19.3 WSR and 0.720 or 0.893 AP, for CS or PAR, respectively). It is worth noting that, above Idg \(\,{=}\,10\), the WGs become huge (more than two orders of magnitude larger for Idg \(\,{=}\,40\)). On the other hand, in both datasets, the WSR or the AP only degrade less than 2 % by using the WGs obtained with Idg \(=5\), which are 5–7 times smaller on the average. For full comparison, Tables 3 and 4 also include results for Idg \(\,{=}\,1\), which is equivalent to just using the HTR 1-best transcription hypothesis. As it can be observed, the effectiveness of both CATTI and KWS degrade very significantly in this case.

With respect to efficiency, the computing time results of Tables 2, 3 and 4 clearly show that WG generation dominates all the costs. For KWS, which is intended to process and indexing thousands or millions of page images without any supervision, it is just this WG generation time the one which matters.

In the case of CATTI, usually aimed to semiautomatically transcribe documents with hundreds of pages, WG generation time is much less critical, as it is only spent in a preparatory phase. However, during interactive operation, large WGs may require prohibitively large initialization time (\(T_{\text {init}}\)), which negatively affect the interactive experience in the first interaction step for each line image. And, for the very large WGs (Idg \(\,{=}\,40\)), also the successive, word-level interaction steps may become compromised because of the large increase in prediction time (\(T_{\text {pred}}\)), thereby significantly hindering the overall usability of CATTI.

Taking into account this discussion, we conclude that Idg \(\,=\,5\) constitutes a very good trade-off between accuracy and computing cost, both for CATTI and KWS.

Remarks and conclusions

Performance of two applications, CATTI and KWS, of handwritten document image processing based on word graphs is studied in this paper. In both applications, the word graphs are generated during the decoding process of text line images using optical character HMMs and N-gram language models. The work presented in this paper focuses on how the performance of these applications is affected by using WGs of increasing sizes, where WG size is controlled by limiting the node maximum input degree during WG generation.

From the reported performance results, no significant differences are observed for WG input degrees equal to or larger than 5. For this input degree, the word graphs are really small, in the order of hundred of edges on the average. Such word graphs not only allow extremely fast computing of CATTI predictions and line-level KWS word confidence scores, but also can themselves be generated with low extra computing cost over the standard Viterbi decoding computing cost.

The estimates reported in the paper can be used to gauge the computational resources that will be needed for performing WG-based CATTI and KWS on large collections of handwritten document images.

Notes

  1. 1.

    http://www.transcriptorium.eu.

  2. 2.

    CS and PAR are publicly available for research purposes from http://www.prhlt.upv.es/page/data and www.iam.unibe.ch/fki/databases, respectively.

  3. 3.

    http://bv2.gva.es.

References

  1. 1.

    Amengual JC, Vidal E (1998) Efficient error-correcting Viterbi parsing. IEEE Trans Pattern Anal Mach Intell 20(10):1109–1116

    Article  Google Scholar 

  2. 2.

    Bazzi I, Schwartz R, Makhoul J (1999) An omnifont open-vocabulary OCR system for English and Arabic. IEEE Trans Pattern Anal Mach Intell 21(6):495–504

    Article  Google Scholar 

  3. 3.

    Erman L, Lesser V (1990) The HEARSAY-II speech understanding system: a tutorial. Readings in Speech Reasoning, pp 235–245

  4. 4.

    Evermann G (1999) Minimum word error rate decoding. Ph.D. thesis, Churchill College, University of Cambridge

  5. 5.

    Fischer A, Wuthrich M, Liwicki M, Frinken V, Bunke H, Viehhauser G, Stolz M (2009) Automatic transcription of handwritten medieval documents. In: 15th international conference on virtual systems and multimedia, 2009. VSMM ’09, pp 137–142

  6. 6.

    Frinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224

    Article  Google Scholar 

  7. 7.

    Furcy D, Koenig S (2005) Limited discrepancy beam search. In: Proceedings of the 19th international joint conference on artificial intelligence, IJCAI’05, pp 125–131

  8. 8.

    Granell E, Martínez-Hinarejos CD (2015) Multimodal output combination for transcribing historical handwritten documents. In: 16th international conference on computer analysis of images and patterns, CAIP 2015, chap, pp 246–260. Springer International Publishing

  9. 9.

    Hakkani-Tr D, Bchet F, Riccardi G, Tur G (2006) Beyond ASR 1-best: using word confusion networks in spoken language understanding. Comput Speech Lang 20(4):495–514

    Article  Google Scholar 

  10. 10.

    Jelinek F (1998) Statistical methods for speech recognition. MIT Press, Cambridge

    Google Scholar 

  11. 11.

    Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language processing, speech recognition, and computational linguistics, 2nd edn. Prentice-Hall, Englewood Cliffs

    Google Scholar 

  12. 12.

    Kneser R, Ney H (1995) Improved backing-off for N-gram language modeling. In: International conference on acoustics, speech and signal processing (ICASSP ’95), vol 1, pp 181–184. IEEE Computer Society

  13. 13.

    Liu P, Soong FK (2006) Word graph based speech recognition error correction by handwriting input. In: Proceedings of the 8th international conference on multimodal interfaces, ICMI ’06, pp 339–346. ACM

  14. 14.

    Lowerre BT (1976) The harpy speech recognition system. Ph.D. thesis, Pittsburgh, PA

  15. 15.

    Luján-Mares M, Tamarit V, Alabau V, Martínez-Hinarejos CD, Pastor M, Sanchis A, Toselli A (2008) iATROS: a speech and handwritting recognition system. In: V Jornadas en Tecnologías del Habla (VJTH’2008), pp 75–78

  16. 16.

    Mangu L, Brill E, Stolcke A (2000) Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput Speech Lang 14(4):373–400

    Article  Google Scholar 

  17. 17.

    Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Google Scholar 

  18. 18.

    Mohri M, Pereira F, Riley M (2002) Weighted finite-state transducers in speech recognition. Comput Speech Lang 16(1):69–88

    Article  Google Scholar 

  19. 19.

    Odell JJ, Valtchev V, Woodland PC, Young SJ (1994) A one pass decoder design for large vocabulary recognition. In: Proceedings of the workshop on human language technology, HLT ’94, pp 405–410. Association for Computational Linguistics

  20. 20.

    Oerder M, Ney H (1993) Word graphs: an efficient interface between continuous-speech recognition and language understanding. IEEE Int Conf Acoust Speech Signal Process 2:119–122

    Google Scholar 

  21. 21.

    Olivie J, Christianson C, McCarry J (eds) (2011) Handbook of natural language processing and machine translation. Springer, Berlin

    Google Scholar 

  22. 22.

    Ortmanns S, Ney H, Aubert X (1997) A word graph algorithm for large vocabulary continuous speech recognition. Comput Speech Lang 11(1):43–72

    Article  Google Scholar 

  23. 23.

    Padmanabhan M, Saon G, Zweig G (2000) Lattice-based unsupervised MLLR for speaker adaptation. In: ASR2000-automatic speech recognition: challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW)

  24. 24.

    Pesch H, Hamdani M, Forster J, Ney H (2012) Analysis of preprocessing techniques for latin handwriting recognition. In: International conference on frontiers in handwriting recognition, ICFHR’12, pp 280–284

  25. 25.

    Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society

  26. 26.

    Povey D, Hannemann M, Boulianne G, Burget L, Ghoshal A, Janda M, Karafiat M, Kombrink S, Motlcek P, Qian Y, Riedhammer K, Vesely K, Vu NT (2012) Generating Exact Lattices in the WFST Framework. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  27. 27.

    Rabiner L (1989) A tutorial of hidden Markov models and selected application in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  28. 28.

    Robertson S (2008) A new interpretation of average precision. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR ’08), pp 689–690. ACM

  29. 29.

    Romero V, Toselli AH, Rodríguez L, Vidal E (2007) Computer assisted transcription for ancient text images. Proc Int Conf Image Anal Recogn LNCS 4633:1182–1193

    Article  Google Scholar 

  30. 30.

    Romero V, Toselli AH, Vidal E (2012) Multimodal interactive handwritten text transcription. Series in machine perception and artificial intelligence (MPAI). World Scientific Publishing, Singapore

    Google Scholar 

  31. 31.

    Rybach D, Gollan C, Heigold G, Hoffmeister B, Lööf J, Schlüter R, Ney H (2009) The RWTH aachen university open source speech recognition system. In: Interspeech, pp 2111–2114

  32. 32.

    Sánchez J, Mühlberger G, Gatos B, Schofield P, Depuydt K, Davis R, Vidal E, de Does J (2013) tranScriptorium: an European project on handwritten text recognition. In: DocEng, pp 227–228

  33. 33.

    Saon G, Povey D, Zweig G (2005) Anatomy of an extremely fast LVCSR decoder. In: INTERSPEECH, pp 549–552

  34. 34.

    Strom N (1995) Generation and minimization of word graphs in continuous speech recognition. In: Proceedings of IEEE workshop on ASR’95, pp 125–126. Snowbird, Utah

  35. 35.

    Tanha J, de Does J, Depuydt K (2015) Combining higher-order N-grams and intelligent sample selection to improve language modeling for Handwritten Text Recognition. In: ESANN 2015 proceedings, European symposium on artificial neural networks, computational intelligence and machine learning, pp 361–366

  36. 36.

    Toselli A, Romero V, i Gadea MP, Vidal E (2010) Multimodal interactive transcription of text images. Pattern Recogn 43(5):1814–1825

    Article  MATH  Google Scholar 

  37. 37.

    Toselli A, Romero V, Vidal E (2015) Word-graph based applications for handwriting documents: impact of word-graph size on their performances. In: Paredes R, Cardoso JS, Pardo XM (eds) Pattern recognition and image analysis. Lecture Notes in Computer Science, vol 9117, pp 253–261. Springer International Publishing

  38. 38.

    Toselli AH, Juan A, Keysers D, Gonzlez J, Salvador I, Ney H, Vidal E, Casacuberta F (2004) Integrated handwriting recognition and interpretation using finite-state models. Int J Pattern Recogn Artif Intell 18(4):519–539

    Article  Google Scholar 

  39. 39.

    Toselli AH, Vidal E (2013) Fast HMM-Filler approach for key word spotting in handwritten documents. In: Proceedings of the 12th international conference on document analysis and recognition (ICDAR’13). IEEE Computer Society

  40. 40.

    Toselli AH, Vidal E, Romero V, Frinken V (2013) Word-graph based keyword spotting and indexing of handwritten document images. Technical report, Universitat Politècnica de València

  41. 41.

    Ueffing N, Ney H (2007) Word-level confidence estimation for machine translation. Comput Linguist 33(1):9–40. doi:10.1162/coli.2007.33.1.9

    Article  MATH  Google Scholar 

  42. 42.

    Vinciarelli A, Bengio S, Bunke H (2004) Off-line recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans Pattern Anal Mach Intell 26(6):709–720

    Article  Google Scholar 

  43. 43.

    Weng F, Stolcke A, Sankar A (1998) Efficient lattice representation and generation. In: Proceedings of ICSLP, pp 2531–2534

  44. 44.

    Wessel F, Schluter R, Macherey K, Ney H (2001) Confidence measures for large vocabulary continuous speech recognition. IEEE Trans Speech Audio Process 9(3):288–298

    Article  Google Scholar 

  45. 45.

    Wolf J, Woods W (1977) The HWIM speech understanding system. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP ’77, vol 2, pp 784–787

  46. 46.

    Woodland P, Leggetter C, Odell J, Valtchev V, Young S (1995) The 1994 HTK large vocabulary speech recognition system. In: International conference on acoustics, speech, and signal processing (ICASSP ’95), vol 1, pp 73 –76

  47. 47.

    Young S, Odell J, Ollason D, Valtchev V, Woodland P (1997) The HTK book: hidden Markov models toolkit V2.1. Cambridge Research Laboratory Ltd, Cambridge

    Google Scholar 

  48. 48.

    Young S, Russell N, Thornton J (1989) Token passing: a simple conceptual model for connected speech recognition systems. Technical report

  49. 49.

    Zhu M (2004) Recall, precision and average precision. Working Paper 2004–09 Department of Statistics and Actuarial Science, University of Waterloo

  50. 50.

    Zimmermann M, Bunke H (2004) Optimizing the integration of a statistical language model in hmm based offline handwritten text recognition. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 2, pp 541–544

Download references

Acknowledgments

Work partially supported by the Generalitat Valenciana under the Prometeo/2009/014 Project Grant ALMAMATER, by the Spanish MECD as part of the Valorization and I+D+I Resources program of VLC/CAMPUS in the International Excellence Campus program, and through the EU projects: HIMANIS (JPICH programme, Spanish Grant Ref. PCIN-2015-068) and READ (Horizon-2020 programme, Grant Ref. 674943).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alejandro H. Toselli.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Toselli, A.H., Romero, V. & Vidal, E. Word graphs size impact on the performance of handwriting document applications. Neural Comput & Applic 28, 2477–2487 (2017). https://doi.org/10.1007/s00521-016-2336-2

Download citation

Keywords

  • Computer-assisted transcription of text images
  • Keyword spotting for handwritten text
  • Historical handwritten manuscripts
  • Word graphs
  • Evaluation performance