1 Introduction

Since the advent of the printing press in the fifteenth century, the amount of printed text has grown overwhelmingly. Although a great deal of text is now generated in electronic character-coded formats (HTML, word processor files, etc.), many documents, available only in print, remain important. This is due in part to the existence of large collections of legacy documents available only in print, and in part because printed text remains an important distribution channel that can effectively deliver information without the technical infrastructure that is required to deliver character-coded text. These factors are particularly important for Arabic, which is widely used in places where the installed computer infrastructure is often quite limited. Printed documents can be browsed and indexed for retrieval relatively easily in limited quantities, but effective access to the contents of large collections requires some form of automation.

One such form of automation is to scan the documents (to produce document images) and subsequently perform OCR on the document images to convert them to text. Typically, the OCR process introduces errors in the text representation of the document images. The error level is affected by the quality of paper, printing, and scanning. The introduced errors are more pronounced in Arabic OCR (as compared to English) due to some of the orthographic and morphological features of Arabic. For example, the dataset reported-on in this paper, which is based on a fairly clean book that has been published 25 years ago and scanned at 300 × 300 dpi, has a word error rate of approximately 39%. Even higher word error rates were observed by the authors in their collaborative work with the Library of Alexandria on their Arabic digitization project. This is significantly higher than the average word error rate for the out-of-copyright English books (typically 100 years old) that are available through the Internet Archive. Orthographically, Arabic characters are connected and change shape depending on their position in a word. As for morphological complexity, Arabic allows the insertion of infixes to form words and the attachment of prefixes and suffixes that include pronouns, determiners, number markers (singular, dual, and plural), conjunctions, etc.

The introduced errors adversely affect retrieval effectiveness of OCR’ed documents. This paper examines the effect of word-based post-OCR error correction in conjunction with language modeling on Arabic retrieval effectiveness using different index terms on two collections of degraded Arabic documents. The correction uses a character segment based noisy channel model and language modeling to correct OCR errors. The paper compares the effect on retrieval effectiveness of performing “good” error correction and performing “moderate” error correction with and without language modeling, respectively. The effect of error correction strategies is also investigated when using different index terms, namely word surface forms, morphological variants, and sub-word character n-gram sequences. The paper provides suggestions on which error correction strategies and index terms to use or not use under different conditions to improve retrieval effectiveness. The paper will be organized as follows: Sect. 2 provides background information on Arabic OCR and retrieval along with OCR error correction; Sect. 3 presents the experimental setup; Sect. 4 reports and discusses experimental results; and Sect. 5 concludes the paper and provides possible future directions.

2 Background

2.1 Arabic morphology and OCR

The goal of OCR is to transform a document image into character-coded text. The usual process is to automatically segment a document image into character images in the proper reading order using image analysis heuristics, apply an automatic classifier to determine the character codes that most likely correspond to each character image, and then exploit sequential context (e.g., preceding and following characters and a list of possible words) to select the most likely character in each position. The character error rate can be influenced by reproduction quality (e.g., original documents are typically better than photocopies), the resolution at which a document was scanned, and any mismatch between the instances on which the character image classifier was trained and the rendering of the characters in the printed document. Arabic OCR presents several challenges, including:

  • Arabic’s cursive script in which most characters are connected and their shape vary with position in the word. Further, multiple connected characters may resemble other single characters or combinations of characters. For example, the letter “شـ” (sheen) may resemble “نتـ” (noon—ta combination).

  • The optional use of word elongations and ligatures, which are special forms of certain letter sequences.

  • The presence of dots in 15 of the 28 to distinguish between different letters and the optional use of diacritic which can be confused with dirt, dust, and speckle (Darwish and Oard 2002a, b). The orthographic features of Arabic lead to some characters being more prone to OCR errors than others.

  • The morphological complexity of Arabic, which results in an estimated 60 billion possible surface forms, complicates dictionary-based error correction. A surface form is any group of consecutive characters in the text that may include a word with the attachment of a conjunction, a determiner, and/or a pronoun. Arabic words are built from a closed set of about 10,000 root forms that typically contain 3 characters, although 4-character roots are not uncommon, and some 5-character roots do exist. Arabic stems are derived from these root forms by fitting the root letters into a small set of regular patterns, which sometimes includes addition of “infix” characters within the root (Ahmed 2000). Thus, stems with no infixes are identical to roots. Further attachment of prefixes and suffixes that include determiners, conjunctions, pronouns, and grammatical markers produces a word surface form. Again, a word can be identical to a stem if no prefixes or suffixes are attached.

There are a number of commercial Arabic OCR systems, with Sakhr’s Automatic Reader and Shonut’s OmniPage being perhaps the most widely used (Kanungo et al. 1999a, 1997). Most Arabic OCR systems segment characters (Gillies et al. 1999; Hassibi 1994a, b; Kanungo et al. 1997), while a few opted to recognize words without segmenting characters (Allam 1995; Lu et al. 1999). A system developed by BBN avoids character segmentation by dividing lines into slender vertical frames (and frames into cells) and uses an HMM recognizer to recognize character sequences (Lu et al. 1999).

2.2 OCR degraded text retrieval

Retrieval of OCR degraded text documents has been reported on for many languages, including English (Harding et al. 1997; Kantor and Voorhees 1996; Taghva et al. 1994a, b, 1995, 1996b); Chinese (Tseng and Oard 2001); and Arabic (Darwish and Oard 2002a, b).

For English, Doermann (1997) reports that retrieval effectiveness decrease significantly for OCR’ed documents with an error rate at some point between 5% and 20%. Taghva reported experiments which involved using English collections with documents ranging in number between 204 and 674 documents that were about 38 pages long on average (Taghva et al. 1994b, 1995). The documents were scanned and OCR’d. His results show negligible decline in retrieval effectiveness due to OCR errors. Taghva’s work was criticized for being done on very small collections of very long documents (Tseng and Oard 2001). Small collections might not behave like larger ones, and thus they might not be reflective of real life applications in which retrieval from a large number of documents is required (Harman 1992). Similar results for English were reported by Smith (1990) in which he reported no significant drop in retrieval effectiveness with the introduction of simulated OCR degradation in which characters were randomly replaced by a symbol indicating failure to recognize. These results contradict other studies in which retrieval effectiveness deteriorated dramatically with the increase in degradation. Hawking reported a significant drop in retrieval effectiveness at a 5% character error rate on the TREC-4 “confusion track” (Hawking 1996). In the TREC-4 confusion track, approximately 50,000 English documents from the federal registry were degraded by applying random edit operations to random characters in the documents (Kantor and Voorhees 1996). The contradiction might be due to the degradation method, the size of the collection, the size of the documents, or a combination of these factors. In general retrieval effectiveness is adversely affected by the increase in degradation and decrease in redundancy of search terms in the documents (Doermann 1998).

Several studies reported the results of using n-grams. A study by Harding et al. (1997), compared the use of different length n-grams to words on 4 English collections, in which errors artificially introduced. The documents were degraded iteratively using a model of OCR degradation until retrieval effectiveness of using words as index terms started to significantly deteriorate. The error rate in the documents was unknown. For n-grams, a combination of 2 and 3 grams and a combination of 2, 3, 4, and 5 grams were compared to words. Their results show that n-gram indexing consistently outperformed word indexing, and combining more n-grams was better than combining fewer. In another study by Tseng and Oard, they experimented with different combinations of n-grams on a Chinese collection of 8,438 document images and 30 Chinese queries (Tseng and Oard 2001). Although ground-truth was not available for the image collection to conclude the effect of degradation on retrieval effectiveness, the effectiveness of different index terms were compared. They experimented with unigrams, bigrams, and a combination of both. Chinese words were not segmented and bigrams crossed word boundaries. The results of the experiments show that a combination of unigrams and bigrams consistently and significantly outperform character bigrams, which in turn consistently and significantly outperforms character unigrams.

For Arabic, Darwish and Oard (2002a, b) reported that character 3-gram and 4-grams were the best index terms for searching OCR degraded text. They conducted their experiments on a small collection of 2,730 scanned documents.

In general, blind relevance feedback does not help for the retrieval of OCR degraded documents (Darwish and Emam 2005; Lam-Adesina and Jones 2006; Taghva et al. 1996a, b; Tseng and Oard 2001).

2.3 Building an OCR degraded collection

To build an OCR-degraded test collection, there are three common approaches:

  1. 1.

    Printed document domain: which involves building a collection by scanning printed documents and performing OCR. This approach is most desirable because the errors in the text are due to real OCR degradation and not a model of the degradation. However, building large test collections of several hundred thousand documents with a set of topics and relevance judgments can be very expensive. Therefore, the collections reported in the literature were all small. One such collection is a Chinese collection of 8,438 documents which was developed by Tseng and Oard (2001). The documents in Tseng’s collection varied widely in their degradation level and there was no accurately character-coded version (OCR ground truth) for the collection. Abdelsapor et al. (2006) developed a collection of Arabic OCR’ed document images by randomly picking approximately 25 pages from 1,378 Arabic books from Bibliotheca Alexandrina (BA) forming a set of 34,651 printed documents. Associated with the collection are set of 25 topics that were developed using an iterative search and judge method (Sanderson and Joho 2004). The books cover a variety of topics including historical, philosophical, cultural, and political subjects and the printing dates of the books range from the early 1920s to the present. Again, no ground truth is available for the collection. Having ground truth helps show the effect of degradation on retrieval. Developing OCR ground truth is typically laborious, involving either correction of OCR errors in the OCR’d version of the collection or manual re-entry of the collection’s text. Lam-Adesina and Jones (2006) reported on a collection that they developed from the Spoken Document Retrieval (SDR) track collection. The stories in the collection were printed using different formats and fonts, and the resulting hardcopies were scanned and OCR’ed. Associated with the collection of 21,759 news stories are rough or closed-caption quality transcripts and 50 topics that were developed for the SDR track (Lam-Adesina and Jones 2006). Darwish and Oard (2003) report on a small collection of 2,730 documents of scanned and OCR’ed document images for which ground truth exists. The collection is used in this paper and is thoroughly described later.

  2. 2.

    Image domain: which involves building a collection by synthesizing document images from a preexisting non-degraded collection, degrading the document images, and performing OCR on them. Synthesizing document images is done by typesetting the text into an image (Doermann and Yao 1995). To degrade document images, different document degradation models were developed (Baird 1990, 1993, 2000; Doermann and Yao 1995; Kanungo 1996; Kanungo et al. 1995, 1993). The models parameterize different aspects of the document images such as font size, page skew, horizontal and vertical offset, horizontal and vertical scaling, blur, resolution, pixel jitter, and sensitivity. With degradation modeling, document image collections of varying degradation levels with corresponding ground truth can be developed automatically. To verify suitability of the generated document image collections for further OCR research, tests were developed. It is claimed that a degradation model is valid if the confusion matrices that result from automatically degraded documents are similar to the ones that result from real documents (Kanungo and Haralick 1998; Li et al. 1997; Lopresti and Zhou 1994; Nagy 1994). However, Kanungo and Haralick (1998) criticized their approach on the basis that OCR algorithms might filter certain features in either the synthetic or the real documents making both produce similar confusion matrices. Kanungo et al. (2000) instead proposed a probabilistic method that focuses on the correctness of the model in isolation of OCR algorithms. The advantage of this approach for creating OCR-degraded collections is that it is inexpensive, the degradation level can be tuned, and OCR ground truth is automatically available. Although OCR researchers prefer real document images and real OCR output (Tseng and Oard 2001), the suitability of this approach for IR experimentation needs to be verified.

  3. 3.

    Text domain: building a collection by synthesizing OCR degradation. This approach has the advantage of being able to use a preexisting non-degraded collection with its topics and relevance judgments to rapidly build a new degraded collection. This approach was used in developing many degraded text collections (Croft et al. 1994; Harding et al. 1997; Harman 1995; Smith 1990; Taghva et al. 1996a). The degradation models ranged between ones that attempted to accurately model OCR degradation (Harding et al. 1997) to ones that randomly introduced errors (Smith 1990). Mittendorf and Schäuble (2000) argued that using synthetic OCR degradation do not lead to the variations of recognition probabilities, which affect ranking permutations the most, that are observed in real OCR degradation. Darwish (2003) introduced formal tests to verify that the modeled OCR-degradation has similar effect on retrieval as real OCR-degradation.

2.4 OCR error correction

Much research has been done to correct recognition errors in OCR-degraded collections. There are two main categories of approaches to correct these errors, namely word-level and passage-level post-OCR processing. Some of the kinds of word level post-processing include the use of dictionary lookup, probabilistic relaxation, character and word n-gram frequency analysis (Hong 1995), and morphological analysis. Passage-level post-processing techniques include the use of word n-grams, word collocations, grammar, conceptual closeness, passage level word clustering, linguistic context, and visual context. The following introduces some of the error correction techniques.

  • Dictionary lookup: dictionary lookup, which is the basis for the correction reported in this paper, is used to compare recognized words with words in a term list (Hong 1995; Tseng and Oard 2001). If a word is found in the dictionary, then it is considered correct. Otherwise, a checker attempts to find a dictionary word that might be the correct spelling of the misrecognized word.

Jurafsky and Martin illustrate the use of a noisy channel model to find the correct spelling of misspelled or misrecognized words (Jurafsky and Martin 2000). The model assumes that text errors are due to edit operations namely insertions, deletions, and substitutions. Given two words, the number of edit operations required to transform one of the words to the other is called the Levenshtein edit distance (Baeza-Yates and Navarro 1996). To capture the probabilities associated with different edit operations, confusion matrices are employed. Another source of evidence is the relative probabilities that candidate word corrections would be observed. These probabilities can be obtained using word frequency in text corpus (Jurafsky and Martin 2000; Lu et al. 1999). However, the dictionary lookup approach has the following problems (Hong 1995):

  1. (a)

    A correctly recognized word might not be in the dictionary. This problem could surface if the dictionary is small, if the correct word is an acronym or a named entity that would not normally appear in a dictionary, or if the language being recognized is morphologically complex. In a morphological complex language such as Arabic, German, and Turkish the number of valid word surface forms is arbitrarily large which complicates building dictionaries for spell checking. The work in this paper shows that this problem can be overcome ever for Arabic if the lookup dictionary is large.

  2. (b)

    A word that is misrecognized is in the dictionary. An example of that is the recognition of the word “tear” instead of “fear”. This problem is particularly acute in a language such as Arabic where a large fraction of three letters sequences are valid words. In handling this problem, the error correction reported in this paper does not assume that a word is correct because it exists in the dictionary of possible words and assumes that it could have been generated from another correct word.

Mittendorf and Schäuble (2000) argue that using dictionary lookup can be harmful to retrieval effectiveness because if a correctly recognized token does not exists in the dictionary it is likely to have a high inverse document frequency, hence a valuable search term, and the correction process might eliminate it. In effect a proper correction may be eliminated because the ranking formula did not rank it as the best correction.

  • Character n-grams: character n-grams maybe used alone or in combination with dictionary lookup (Lu et al. 1999; Taghva et al. 1994a). The premise for using n-grams is that some letter sequences are more common than others and other letter sequences are rare or impossible. For example, the trigram “xzx” is rare in the English language, while the trigram “ies” is common. Using this method, an unusual sequence of letters can point to the position of an error in a misrecognized word. This technique is employed by BBN’s Arabic OCR system (Lu et al. 1999). The technique can be particularly helpful in limiting the number of candidate corrections and hence making correction more efficient.

  • Using morphology: many morphologically complex languages, such as Arabic, Swedish, Finnish, Turkish, and German, have enormous numbers of possible words. Accounting for and listing all the possible words is not feasible for purposes of error correction. Domeij proposed a method to build a spell checker that utilizes stem lists and orthographic rules, which govern how a word is written, and morphotactic rules, which govern how morphemes (building blocks of meanings) are allowed to combine, to accept legal combinations of stems (Domeij et al. 1994). By breaking up compound words, dictionary lookup can be applied to individual constituent stems. Similar work was done for Turkish in which an error tolerant finite state recognizer was employed (Oflazer 1996). The finite state recognizer tolerated a maximum number of edit operations away from correctly spelled candidate words. This approach was initially developed to perform morphological analysis for Turkish and was extended to perform spelling correction. The techniques used for Swedish and Turkish can potentially be applied to Arabic. Much work has been done on Arabic morphology and can be potentially extended for spelling correction. This paper tests correction without accounting for morphology.

  • Word clustering: another approach tries to cluster different spellings of a word based on a weighted Levenshtein edit distance. The insight is that an important word, specially acronyms and named-entities, are likely to appear more than once in a passage. Taghva described an English recognizer that identifies acronyms and named-entities, clusters them, and then treats the words in each cluster as one word (Taghva et al. 1994a). Applying this technique for Arabic requires accounting for morphology, because prefixes or suffixes might be affixed to instances of named entities. DeRoeck introduced a clustering technique tolerant of Arabic’s complex morphology (De Roeck and Al-Fares 2000). Perhaps the technique can be modified to make it tolerant of errors.

  • Using grammar: in this approach, a passage containing spelling errors is parsed based on a language specific grammar. In a system described by Agirre, an English grammar was used to parse sentences with spelling mistakes (Agirre et al. 1998). Parsing such sentences gives clues to the expected part of speech of the word that should replace the misspelled word. Thus candidates produced by the spell checker can be filtered. Applying this technique to Arabic might prove challenging because the work on Arabic parsing has been very limited (Moussa et al. 2003).

  • Word n-grams (language modeling): a word n-gram is a sequence of n consecutive words in text. The word n-gram technique is a flexible method that can be used to calculate the likelihood that a word sequence would appear (Magdy and Darwish 2006; Tillenius 1996). Using this method, the candidate correction of a misspelled word might be successfully picked. For example, in the sentence “I bought a peece of land,” the possible corrections for the word peece might be “piece” and “peace”. However, using the n-gram method will likely indicate that the word trigram “piece of land” is much more likely than the trigram “peace of land.” Thus the word “piece” is a more likely correction than “peace”. The work in this paper uses language modeling and does not automatically assume that a word is correct if it exists in the dictionary. This paper builds on the work of (Magdy and Darwish 2006) to ascertain the effect of error correction on retrieval effectiveness.

  • Multi-OCR output fusion: in this approach multiple OCR systems, which typically have different classification engines with different training data, are used to recognize the same text. The output of the different OCR systems is then fused by picking the most likely recognized sequence of tokens using language modeling (Magdy et al. 2007). This is akin to using classifier ensembles.

2.5 Arabic information retrieval

Most early studies of character-coded Arabic text retrieval relied on relatively small test collections (Abu-Salem et al. 1999; Al-Kharashi and Evens 1994); more recent results are based on a single large collection (from TREC-2001/2002) (Gey and Oard 2001; Oard and Gey 2002). Several types of index terms have been examined, including words, word clusters, terms obtained through morphological analysis (e.g., stems and roots), and character n-grams of various lengths. The effects of normalizing alternative characters, removal of diacritics and stop-word removal have also been explored (Darwish and Oard 2002a, b; Fraser et al. 2002; Larkey et al. 2002; Mayfield et al. 2001; McNamee et al. 2002). Early studies conducted on small collections suggested that roots were the best Arabic index terms (Abu-Salem et al. 1999; Al-Kharashi and Evens 1994). More recent studies using the larger TREC-2001/2002 Arabic test collection indicate that lightly stemmed words and character 3 and 4-grams result in better retrieval effectiveness than roots (Aljlayl et al. 2001; Darwish and Oard 2002a, b; Fraser et al. 2002; Larkey et al. 2002; Mayfield et al. 2001; McNamee et al. 2002). Retrieval effectiveness is known to be affected by the size, genre, and document length in the test collection, and by many details of system processing (e.g., character normalization, stop-word removal, and morphological analysis). As for OCR degraded Arabic text, a previous study suggests that 3 and 4 character grams and their combinations with index terms obtained through morphological analysis, such light stems, outperform all other kinds of index terms (Darwish and Oard 2002a, b).

3 Experimental setup

As shown in Fig. 1, documents are scanned, OCR’ed, OCR errors are optionally corrected, indexed, and searched. For evaluation, two collections are employed. The first is a small collection of OCR degraded text. As for the second, due to the lack of existence of a large collection of Arabic OCR text, a large existing character-coded Arabic collection is corrupted to simulate OCR errors in the documents (further explanation is provided in the following subsection). The effect of corrupting the collection and its subsequent correction on retrieval effectiveness is examined. For both collections, a portion of the collection is used to train a character based or a character segment based OCR error correction models. The following presents the collections, the error model which is used to corrupt the large collection, the error correction model that is used to correct both collections, and the design of experiments that test the effect of error correction on retrieval using different index terms.

Fig. 1
figure 1

Document flow in a printed document retrieval system

3.1 The document collection

The first document collection is the Zad collection which is built from Zad Al-Me’ad, a printed fourteenth century religious book, which was scanned at 300 × 300 dpi and OCR’ed using Sakhr’s Automatic Reader version 4.0 without any book-specific training. Further, a manually entered and corrected electronic copy of the Zad collection is available. The collection consists of 2,730 separate documents, 25 topics, which only include title queries, and relevance judgments which were built by exhaustively searching the collection. The number of relevant documents per topic ranges between 3 and 72, averaging 20. The average query length is 5.4 words (Darwish and Oard 2002a, b). The first author of (Darwish and Oard 2002a, b) created the topics and performed the relevance judgments.

As for the large collection, the best presently available Arabic test collection was created for the TREC-2002 “Cross-Language IR (CLIR) track;” for brevity, it is referred to here simply as the TREC collection. It contains 383,872 articles from the Agence France Press (AFP) Arabic newswire. NIST developed 50 topics in cooperation with the Linguistic Data Consortium (LDC), and relevance judgments were developed at the LDC by manually judging a pool of documents obtained from combining the top 100 documents from all the runs submitted by the participating teams in TREC 2002 CLIR track. The number of known relevant documents ranges from 10 to 523, with an average of 118 relevant documents per topic (Oard and Gey 2002). The topic descriptions include a title field that briefly names the topic, a description field that usually consists of a single sentence description, and a narrative field that is intended to contain any information that would be needed by a human judge to accurately assess the relevance of a document (Harman 1995). As for the corruption of the collection, a unigram model is used, as described in (Darwish 2003). OCR degradation is modeled as a noisy channel in which the observed characters result from the application of some distortion function on the real characters. The model used here accounts for three character edit operations: insertion, deletion, and substitution. Formally, given a clean word #C 1 ..C i ..C n # and the resulting word after OCR degradation #D 1 ..D j ..D m #, where D j resulted from C i , ε representing the null character, L representing the position of the letter in the word (beginning, middle, end, or isolated), and # marking word boundaries, the probability estimates for the three edit operations for the models, are:

$$ P_{{{\text{substitution}}}} (C_{i} \rightarrow D_{j} ) = \frac{{{\text{count}}(C_{i} \to D_{j} |L_{{C_{i} }} )}} {{{\text{count}}(C_{i} |L_{{C_{i} }} )}} $$
(1)
$$ P_{{{\text{deletion}}}} {\text{ }}(C_{i} \rightarrow \varepsilon ) = \frac{{{\text{count}}(C_{i} \to \varepsilon |L_{{C_{i} }} )}} {{{\text{count}}(C_{i} |L_{{C_{i} }} )}} $$
(2)
$$ P_{{{\text{insertion}}}} (\varepsilon \rightarrow D_{j} ) = \frac{{{\text{count}}(\varepsilon \to D_{j} )}} {{{\text{count}}(C)}} $$
(3)

The models are trained using 2,000 words obtained by automatically aligning the real OCR outputs from the 300 × 300 dpi version of the Zad collection with the associated clean text version.

The resulting character-level alignments are used to create a garbler that reads in a clean word #C 1 ..C i ..C n # and synthesizes OCR degradation to produce #D1 ..D j ..D m #. For a given character C i , the garbler chooses a single edit operation to perform by sampling the estimated probability distribution over the possible edit operations. If an insertion operation is chosen, the model picks a character to be inserted prior to C i by sampling the estimated probability distribution for possible insertions. Insertions before the # (end-of-word) marker are also allowed. If a substitution operation is chosen, the substituted character is selected by sampling the probability distribution of possible substitutions. If a deletion operation is chosen, the selected character is simply deleted. Darwish (2003) validated that the effect of synthesizing OCR degradation using the aforementioned model on retrieval is consistent with the effect of real OCR degradation for the Zad collection.

3.2 Error correction model

For OCR model training, the goal is to learn an effective model of OCR degradation to enable effective correction of OCR errors. It is desirable to minimize the number of training examples, because the process of producing the examples is manual. Previously published papers indicate that training an error model with 2,000 examples produces a good model with as little as 5,000 examples producing nearly the best possible model (Darwish and Oard 2003). The model introduced by (Darwish and Oard 2003) is used as is in this work. For this work, 2,000 words were randomly picked from the corrupted TREC collection to train the error correction model and 4,000 words were used from the Zad collection.Footnote 1 The trained models are used to correct the respective collections. The 2,000 words amount to nearly 2–4 pages in an average size book and typically require 20–30 min of correction time. For all words (in training and testing), the different forms of alef (hamza, alef, alef maad, alef with hamza on top, hamza on wa, alef with hamza below it, and hamza on ya) are normalized to alef, and ya and alef maqsoura are normalized to ya. Also, all diacritics and kashidas are removed. The characters in the corrupted and manually corrected training examples may be aligned in two different ways, namely: 1:1 character alignment (as done in the synthetic degradation process), where each character is mapped to no more than one character (including the null character for deletion or insertion); or using m:n alignment, where any number of characters are aligned to any other number of characters. The second method is more general and potentially more accurate especially for Arabic where a character can be confused with as many as three or four characters. The following example highlights the difference between the 1:1 and the m:n alignment approaches. Given the training pair (rnacle,made):

figure a

For alignment, the Levenstein dynamic programming minimum edit distance algorithm is used to produce 1:1 alignments. The algorithm initially computes the minimum number of edit operations required to transform one string into another, and then the algorithm is back-traced to find the alignments. Given the output alignments of the algorithm, properly aligned characters (such as a → a and e → e) are used as anchors, ε’s (null characters) are combined to properly aligned adjacent characters (anchors) producing m:n alignments, and ε‘s between correctly aligned characters are counted as deletions or insertions.

To formalize the error model, given a clean word χ = #C 1 ..C k ..C l ..C n # and the resulting OCR degraded word δ = #D 1 ..D x ..D y ..D m #, where D x ..D y resulted from C k ..C l , ε(representing the null character, and # marking word boundaries, the probability estimates for the three edit operations for the models are:

$$ P_{{{\text{substitution}}}} (C_{k} ..C_{l} \rightarrow D_{x} ..D_{y} ) = \frac{{{\text{count}}(C_{k} ..C_{l} \to D_{x} ..D_{y} )}} {{{\text{count}}(C_{k} ..C_{l} )}} $$
(4)
$$ {\text{P}}_{{{\text{deletion}}}} {\text{ }}(C_{k} ..C_{l} \rightarrow \varepsilon ){\text{ = }}\frac{{{\text{count}}(C_{k} ..C_{l} \to \varepsilon )}} {{{\text{count}}(C_{k} ..C_{l} )}} $$
(5)
$$ P_{{{\text{insertion}}}} (\varepsilon \rightarrow D_{x} ..D_{y} ) = \frac{{{\text{count}}(\varepsilon \to D_{x} ..D_{y} )}} {{{\text{count}}(C)}} $$
(6)

When decoding a corrupted string δ composed of the characters D 1 ..D x ..D y ..D m , the goal is to find a string χ composed of the characters C 1 ..C k ..C l ..C n such that P(δ|χ)·P(χ) is maximum. P(χ) is the prior probability of observing χ in text and P(δ|χ) is the conditional probability of producing δ from χ.

A modification to the above involved giving a small uniform probability to single character substitutions that are unseen in the training data (Magdy and Darwish 2006). This is done in accordance to Lidstone’s law to smooth probabilities. The probability is set to be 100 times smaller than the probability of the smallest seen single character substitution.Footnote 2

For the Zad collection, P(χ) is computed from a web-mined collection of religious text by Ibn Taymiya, who was the main teacher of the medieval author of the Zad book. The collection contains approximately 16 million words, with 279,000 unique surface forms. As for the TREC collection, P(χ) is computed from a web-mined collection of Arabic newswire documents from the BBC, Al-Ahram newspaper, Al-Jazeera news site, Al-Wafd newspaper, and Al-Moheet news site. The collection contains 12 million words, with nearly 260,000 unique surface forms.

P(δ|χ) is calculated using the trained model, as follows:

$$ P(\delta \left| \chi \right.) = {\prod\limits_{{\text{all}}:D_{x} ..D_{y} } {P(D_{x} ..D_{y} \left| {C_{k} ..C_{l} } \right.)} } $$
(7)

The segments D x ..D y are generated by finding all possible 2n−1 segmentations of the word δ. For example, given “macle” then all possible segmentations are (m,a,c,l,e), (ma,c,l,e), (m,ac,l,e), (mac,l,e), (m,a,cl,e), (ma,cl,e), (m,acl,e), (macl,e), (m,a,c,le), (ma,c,le), (m,ac,le), (mac,le), (m,a,cle), (ma,cle), (m,acle), (macle). The segmentation producing the highest probability is chosen.

All segment sequences C k ..C l known to produce D x ..D y for each of the possible segmentations are produced. If a sequence of C k ..C l segments generates a valid word χ which exists in the web-mined collection, then argmaxχ P(δ|χ) · P(χ) is computed, otherwise the sequence is discarded. Possible corrections are subsequently ranked.

3.3 Language modeling

For language modeling, a trigram language model is trained on the same web-mined collections that were mentioned in the previous subsection without any kind of morphological processing. Like the Zad and TREC collections, alef and ya letter normalizations are performed and diacritics and kashidas are removed. The language model is built using SRILM toolkit with Good-Turing smoothing and default backoff.

Given a corrupted word sequence Δ = {δ1..δ i ..δ n } and Ξ = {X1..X i ..X n }, where X i  = {χ i0..χ im } are possible corrections of δ i (m = 10 for all the experiments reported in the paper), the aim was to find a sequence Ω = {ω1 ..ω i ..ω n }, where ω i  ∈ X i , that maximizes:

$$ {\underbrace {{\left( {{\mathop \prod\limits_{i = 1..n,j = 1..m} }P{\left( {\chi _{{ij}} |\chi _{{i - 1,j}} ,\chi _{{i - 2,j}} } \right)}} \right)}}_{{{\text{Language}}\,{\text{Model}}}}} \cdot {\underbrace {P(\delta _{i} |\chi _{{ij}} )}_{{{\text{Character}}\,{\text{Model}}}}} $$
(8)

For each corrupted word δ i , the top m (m = 10) correction X i  = {χ i0..χ im }, as computed by Eq. 8, are generated. So given a sequence Δ = {δ 1 ..δ i ..δ n } the top m corrections for each word are generated leading to Ξ = X1 ..( i ..X n }. All possible sequences ω1...ω m ((ω i  ∈ X i ) are generated and scored using Eq. 5. The highest scoring sequence is picked as the correct sequence Ω.

3.4 Testing the models

Two types of tests are performed to measure the effect of error correction. The first type examines the change in Word Error Rate (WER) which is computed by examining a set of approximately 2,000 and 6,000 words for the Zad and TREC collections, respectively. The testing is done for the 1:1 and m:n character models with language modeling (LM) enabled or disabled. In all the results reported in this paper, the top correction is chosen. The second examines the effect of correction on retrieval effectiveness. The retrieval experiments are performed on the clean, OCR degraded/synthetically corrupted, and corrected versions of the Zad and TREC collections described above. Note that for the TREC collection, only the m:n character mapping is done. The authors’ intuition is that since the TREC collection is corrupted using a 1:1 model, then using either models would not make much difference as the m:n model is a generalization of the 1:1 model. Multiple corrected versions of the collection are generated with all different correction models mentioned before. The resulting corrected collections are as follows:

For Zad collection, correction with:

  1. 1.

    1:1 character error model.

  2. 2.

    m:n character error model.

  3. 3.

    1:1 character error model + language model.

  4. 4.

    m:n character error model + language model.

For TREC collection, correction with:

  1. 1.

    m:n character error model.

  2. 2.

    m:n character error model + language model.

The collections are indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed words obtained using Al-Stem (Oard and Gey 2002). For all experiments, Indri is used with no blind relevance feedback, stopword removal, or stemming. Indri combines inference network model with language modeling (Metzler and Croft 2004). The figure of merit for evaluating retrieval results is mean average precision (MAP). Statistical significance between different retrieval results is performed using a paired 2-tailed t-test and Wilcoxon test with continuity correction with p-values of less than 0.05 to assume statistical significance. The Wilcoxon test p-values are being reported for completeness. There are some indications that the t-test is sufficiently reliable despite the fact that the normality condition might not be met (Sanderson and Zobel 2005).

4 Results and discussion

Tables 1 and 2 summarize the effect of correction on WER for the Zad and TREC collections, respectively. As stated earlier, two sets of 2,000 and 6,000 words are used to test the correction of the Zad and the TREC collections, respectively. The evaluation involved examining the word error rate before and after correction with language modeling enabled or disabled. The results show that error correction removes a large portion of the errors with language modeling having a positive impact on error correction. Also, error correction is more effective for the Zad collection compared to the TREC collection. This could be a result of better coverage of the dictionary and better comparability of the trained language model for the correction of the Zad collection.

Table 1 Word error rate (WER) and error reduction (ER) for correction with the different models for the Zad collection
Table 2 Word error rate (WER) and error reduction (ER) for correction with the different models for the TREC collection

Figures 2 and 3 and Tables 3 and 4 summarize the retrieval results of searching the original (clean), OCR’ed (corrupted/bad), and corrected versions of the Zad and TREC collections respectively using words, character 3-grams, character 4-grams, and lightly stemmed words. Tables 5 and 6 provide the p-values of the paired 2-tailed t-test and Wilcoxon text of comparing the results for the Zad and TREC collections respectively. The results confirm that character 3 and 4-grams are indeed the best index terms with 3-grams on uncorrected text outperforming words and light stems even after correction. For correcting the Zad collection with or without language modeling, the results (Table 3) show that retrieval effectiveness is statistically indistinguishable from the original uncorrupted and OCR degraded versions of the collections when indexing using words. However, for the TREC collection (Table 4), using language modeling statistically improves effectiveness over the corrupted version and makes effectiveness indistinguishable from clean version. Same is true for the use of light stems for the Zad collection with and without language modeling and the TREC with language modeling only. For character 3-grams, the error correction statistically significantly improves retrieval effectiveness over corrupted versions for the Zad and TREC collections (except for 3-grams m:n model without language modeling). Unlike character 3-grams, character 4-grams does not necessarily improve retrieval effectiveness statistically. For character 3 and 4-grams, retrieval effectiveness is generally statistically significantly worse than the clean text (except for 4-gram with m:n model with and without language modeling and the 1:1 model with language modeling for the Zad collection).

Fig. 2
figure 2

Results in MAP of searching the original, bad, and corrected versions of the Zad collection (+LM indicates the use of language modeling)

Fig. 3
figure 3

Results in MAP of searching the original, bad, and corrected versions of the TREC collection (+LM indicates the use of language modeling)

Table 3 Results in MAP of searching the original, bad, and corrected versions of the Zad collection (+LM indicates the use of language modeling). The left|right squares below MAP for corrected versions indicate t-test values in comparing to the clean and bad collections respectively (using t-test values from Table 5), with black and grey indicating statistically significantly worse or better, respectively, and white indicating no statistical significance
Table 4 Results in MAP of searching the original, bad, and corrected versions of the TREC collection (+LM indicates the use of language modeling). The left|right squares below MAP for corrected version indicate t-test values in comparing to the clean and bad collections respectively (using t-test values from Table 6), with black and grey indicating statistically significantly worse or better, respectively, and white indicating no statistical significance
Table 5 p-Value of the paired 2-tailed t-test and Wilcoxon test comparisons of retrieval results for the ZAD Collection for Base Model. Black and Grey squares indicate that results are statistically significantly worse and better than corrected version, respectively
Table 6 p-Value of the paired 2-tailed t-test and Wilcoxon test comparisons of retrieval results for the ZAD Collection for Base Model. Black and Grey squares indicate that results are statistically significantly worse and better than corrected version, respectively

The results suggest that given a moderately degraded Arabic collection with resulting word error rate greater than 20%, doing no correction and searching using character 3 or 4-grams without correction does not seem to be a bad strategy. This can be seen in comparing the results for character 3 and 4-grams on the corrupted version compared to the corrected and stemmed versions of Zad and TREC. The results also suggest that indexing using short n-grams such as 3-grams is a better strategy than moderate error correction with no language modeling.

As for using a language model, with the m:n model for both collections, error correction statistically significantly improves retrieval effectiveness, and for the corrected Zad collection, unlike the TREC collection, retrieval effectiveness is statistically indistinguishable from the effectives of retrieving from the clean version. This would suggest that “good” error correction, with word error rate less than 15%, can have a statistically significant positive effect on retrieval, and possibly improve to the level of retrieving clean documents.

Another interesting and important observation here is that correction in the experiments is done at the word level without any morphological analysis and the correction yielded good results. In fact, using the m:n character model with language modeling reduces word error rate by 70%. This seems to suggest that using a large language model for correcting a morphologically rich language like Arabic can minimize the need for morphological analysis. Further, indexing using character n-grams can benefit from good correction that performs no morphological analysis. This is advantageous because character 3 and 4-grams are the best index terms for OCR degraded Arabic text.

5 Conclusion and future work

This paper examines the effect of OCR error correction on retrieval effectiveness of Arabic OCR degraded documents. When correcting without language modeling, the word error rate is nearly halved, but the effect on retrieval effectiveness is less pronounced with no guarantee of statistically significant improvement. This would suggest that given only moderate error correction, performing no correction and using character n-grams is not a bad strategy. However, given “good” error correction, like in the case of using language modeling, retrieval effectiveness can statistically significantly improve (often to the level of retrieval of the uncorrupted documents). Therefore, unless error correction is not “very good” (with error rate greater than 15%) then using n-gram index terms would be preferred for retrieval. Further, given a large language model, word-based error correction can be effective for Arabic, which is orthographically and morphologically complex, even in the absence of morphological processing. Also, character 3 and 4-grams, which are the best index terms for OCR degraded Arabic text, can benefit from word-based correction with language modeling.

For future work, there are a few clear directions to follow. Investigating sub-word error correction techniques may prove useful for languages where the best index terms are n-grams. Further, a comparison of the effect of error correction as opposed to query garbling is warranted (Darwish and Oard 2003). Also, a serious exploration of the effect of correction on large real OCR document collections is warranted. Unfortunately, there are no reports in the literature of TREC size Arabic OCR document collections and much effort needs to be invested to create such collections. Lastly, investigating the effect of correction using language modeling but no character level model is warranted.