A Robust Word Spotting System for Historical Arabic Manuscripts

Cheriet, Mohamed; Moghaddam, Reza Farrahi

doi:10.1007/978-1-4471-4072-6_19

Mohamed Cheriet³ &
Reza Farrahi Moghaddam³

1678 Accesses
4 Citations

Abstract

A novel system for word spotting in old Arabic manuscripts is developed. The system has a complete chain of operations and consists of three major steps: pre-processing, data preparation, and word spotting. In the pre-processing step, using multi-level classifiers, clean binarization is obtained from the input degraded document images. In the second step, the smallest units of data, i.e., the connected components, are processed and clustered in a robust way in a library, based on features which have been extracted from their skeletons. The preprocessed data are ready to be used in the final and third step, in which occurrences of queries are located within the manuscript. Various techniques are used to improve the performance and to cope with possible inaccuracies in data and representation. The latter techniques have been developed in an integrated collaboration with scholars, especially for relaxing the system to absorb various scripts. The system is tested on an old manuscript with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://islamsci.mcgill.ca/RASI/; http://islamsci.mcgill.ca/RASI/pipdi.html

References

Al-Khatib, W.G., Shahab, S.A., Mahmoud, S.A.: Digital library framework for Arabic manuscripts. In: Shahab, S.A. (ed.) AICCSA’07, Amman, Jordan, May 13–16, 2007, pp. 458–465 (2007)
Google Scholar
Antonacopoulos, A., Downton, A.: Special issue on the analysis of historical documents. Int. J. Doc. Anal. Recognit. 9(2), 75–77 (2007)
Article Google Scholar
Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph. Tools 12(2), 13–21 (2007)
Google Scholar
Chang, H.-H., Yan, H.: Analysis of stroke structures of handwritten Chinese characters. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 29(1), 47–61 (1999)
Article Google Scholar
Chomsky, N.: Aspects of the Theory of Syntax, 1st edn. MIT Press, Cambridge (1969). ISBN-10: 0262530074
Google Scholar
Farin, G.: Curves and Surfaces for Computer Aided Geometric Design: A Practical Guide, 5th edn. Academic Press, San Diego (2001)
Google Scholar
Gacek, A.: Arabic manuscripts: a vademecum for readers. In: Handbook of Oriental Studies. Section 1 The Near and Middle East, vol. 98. Brill, Leiden/Boston (2009). ISBN-10: 90 04 17036 7
Google Scholar
Hamza, H., Belaid, Y., Belaid, A., Chaudhuri, B.B.: Incremental classification of invoice documents. In: ICPR’08, Tampa, FL, USA, December 8–11, 2008, pp. 1–4 (2008)
Google Scholar
Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision. Addison-Wesley/Longman, Reading (1992)
Google Scholar
Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: ICDAR’03, pp. 780–783 (2003)
Google Scholar
Joosten, I.: Applications of microanalysis in the cultural heritage field. Mikrochim. Acta 161(3), 295–299 (2008)
Article Google Scholar
Kane, S., Lehman, A., Partridge, E.: Indexing George Washington’s Handwritten Manuscripts: A study of word matching techniques. CIIR technical report, University of Massachusetts, Amherst (2001)
Google Scholar
Kohonen, T.: Self Organizing Maps, 3rd edn. Springer, Berlin (2001)
Book MATH Google Scholar
Leydier, Y., Le Bourgeois, F., Emptoz, H.: Omnilingual segmentation-free word spotting for ancient manuscripts indexation. In: Le Bourgeois, F. (ed.) ICDAR’05, vol. 1, pp. 533–537 (2005)
Google Scholar
Leydier, Y., Ouji, A., LeBourgeois, F., Emptoz, H.: Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit. 42, 2089–2105 (2009)
Article MATH Google Scholar
Manso, M., Carvalho, M.L.: Application of spectroscopic techniques for the study of paper documents: A survey. Spectrochim. Acta, Part B, At. Spectrosc. 64(6), 482–490 (2009)
Article Google Scholar
Matuschek, M., Schlüter, T., Conrad, S.: Measuring text similarity with dynamic time warping. In: Proceedings of the 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 263–267. ACM, New York (2008)
Google Scholar
Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word-spotting in historical document images. In: ICDAR’09, Barcelona, Spain, July 26–29, 2009, pp. 511–515 (2009)
Google Scholar
Moghaddam, R.F., Cheriet, M.: Low quality document image modeling and enhancement. Int. J. Doc. Anal. Recognit. 11(4), 183–201 (2009)
Article Google Scholar
Moghaddam, R.F., Cheriet, R.M.: Restoration of single-sided low-quality document images. Pattern Recognit. 42, 3355–3364 (2009)
Article MATH Google Scholar
Moghaddam, R.F., Cheriet, M.: A multi-scale framework for adaptive binarization of degraded document images. Pattern Recognit. 43(6), 2186–2198 (2010)
Article MATH Google Scholar
Moghaddam, R.F., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. In: DAS’10, Boston, Massachusetts, pp. 11–18. ACM, New York (2010)
Google Scholar
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)
Article Google Scholar
Nakayama, K., Hasegawa, H., Hernandez, C.A.: Handwritten alphabet and digit character recognition using skeleton pattern mapping with structural constraints. In: Proc. ICANN’93, Amsterdam, September 1993, pp. 941 (1993)
Google Scholar
Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. 9(2), 139–152 (2007)
Article Google Scholar
Barni, M., Beraldin, J.-A., Lahanier, C., Piva, A. Signal processing in visual cultural heritage, special issue. In: IEEE Signal Processing Magazine, vol. 25(4) (2008)
Google Scholar
Rodriguez-Serrano, J.A., Perronnin, F., Llados, J., Sanchez, G.: A similarity measure between vector sequences with application to handwritten word image retrieval. In: CVPR’09 (2009)
Google Scholar
Rothfeder, J.L., Feng, S., Rath, T.M.: Using corner feature correspondences to rank word images by similarity. In: Workshop on Document Image Analysis and Retrieval, Madison, June 20, 2003
Google Scholar
Saykol, E., Sinop, A.K., Gudukbay, U., Ulusoy, O., Cetin, A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)
Article Google Scholar
Shafait, F., Keysers, D., Breuel, T.M.: Efficient implementation of local adaptive thresholding techniques using integral images. In: Document Recognition and Retrieval XV, San Jose, CA, January 2008
Google Scholar
Sharma, O., Mioc, D., Anton, F.: Voronoi diagram based automated skeleton extraction from colour scanned maps. In: ISVD’06, pp. 186–195 (2006)
Google Scholar
Shih, F.Y., Pu, C.C.: A skeletonization algorithm by maxima tracking on Euclidean distance transform. Pattern Recognit. 28(3), 331–341 (1995)
Article Google Scholar
Spitz, A.L.: Using character shape codes for word spotting in document images. In: Dori, D., Bruckstein, A. (eds.) Shape, Structure and Pattern Recognition, pp. 382–389. World Scientific, Singapore (1995)
Google Scholar
Steinherz, T., Intrator, N., Rivlin, E.: A special skeletonization algorithm for cursive words. In: IWFHR’00, pp. 529–534 (2000)
Google Scholar
Tang, Y.Y., Suen, C.Y., De Yan, C., Cheriet, M.: Financial document processing based on staff line and description language. IEEE Trans. Syst. Man Cybern. 25(5), 738–754 (1995)
Article Google Scholar
The Mathworks Inc., Natick, MA. MATLAB Version 7.5.0
Google Scholar
van der Zant, T., Schomaker, L., Haak, K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1945–1957 (2008)
Article Google Scholar
van Dongen, S.: Graph clustering by flow simulation. Ph.D. thesis, Univ. Utrecht., May 2000
Google Scholar
Vijaya Kumar, V., Srikrishna, A., Ali Shaik, S., Trinath, S.: A new skeletonization method based on connected component approach. Int. J. Comput. Sci. Netw. Secur. 8, 133–137 (2008)
Google Scholar
Yalniz, I.Z., Altingovde, I.S., Güdükbay, U., Ulusoy, O.: Ottoman archives explorer: A retrieval system for digital Ottoman archives. J. Comput. Cult. Herit. 2(3), 1–20 (2009)
Article Google Scholar
Zeng, J., Liu, Z.-Q.: Stroke segmentation of Chinese characters using Markov random fields. In: ICPR’06, vol. 1, pp. 868–871 (2006)
Google Scholar
Zhu, X.: Shape recognition based on skeleton and support vector machines. In: Advanced Intelligent Computing Theories and Applications. With Aspects of Contemporary Intelligent Computing Techniques, vol. 2, pp. 1035–1043 (2007)
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank the NSERC of Canada for their financial support. Also, we would like to acknowledge Dr. Robert Wisnovsky and his team, from IIS, McGill University, for their collaboration and fruitful comments and discussions.

Author information

Authors and Affiliations

Synchromedia Laboratory for Multimedia Communication in Telepresence, École de Technologie Supérieure, Montréal, QC, H3C 1K3, Canada
Mohamed Cheriet & Reza Farrahi Moghaddam

Authors

Mohamed Cheriet
View author publications
You can also search for this author in PubMed Google Scholar
Reza Farrahi Moghaddam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Cheriet .

Editor information

Editors and Affiliations

Institute for Communications Technology, Braunschweig Technical University, Schleinitzstraße 22, Braunschweig, 38106, Niedersachsen, Germany
Volker Märgner
Institute for Communications Technology, Braunschweig Technical University, Schleinitzstraße 22, Braunschweig, 38106, Niedersachsen, Germany
Haikal El Abed

Appendices

Appendix A: Competence vs. Performance

Pattern recognition and document understanding [4, 35] are actually equivalent to a process in which humans (authors, writers, and copiers of documents and manuscripts) are modeled as a kind of machine. These people usually differ in the writing they produce mainly because of their philological and psychological differences and limitations. In other words, some aspects of these variations are rooted in the inability of such a “machine” to reproduce a well-standardized set of patterns. This inability can be related to motor limitations of hand, mind, or memory, for example. In modern societies, and for short time scales, where a high level of education based on a single “school” can be assumed, this is close to the reality. Performance [5] is a term used to measure the level of this imperfection, and is a concept that has been used widely in many fields, from linguistics to engineering. Performance is a measure of the output of a writing process. It not only reflects the personality of the writer, but also, and mainly, the errors and inabilities of the writer. Most approaches to document understanding, which usually originated from the engineering point of view, overestimate performance. This tradition has resulted in attempts to filter out, “average,” or ignore variations. In earlier times, there was more diversity still, although even in modern societies there are variations in the way individuals think. Every author, writer, and copier has a unique and well-developed ability to think for himself. Perhaps it would be difficult to gauge the value of the large number of manuscripts produced over a period of a thousand years or so, but, if we look at them on a scale of a few years, they appear as individual gems. The deep philosophical thinking of their authors is reflected in their lives through many stories told about them. It is obvious that their thinking would be reflected in their writing style as well. Moreover, their writing styles should also be as unique as the philosophical schools to which they belong. In other words, a major part of the difference between the writing styles of two manuscripts arises from differences in the thinking of the writers, rather than their inability to replicate a standard style. This brings us to competence [5], a concept which has been somewhat ignored in document analysis and understanding. In contrast to performance, competence actually depends on the knowledge of the writer. We believe that a successful approach to analysis and understanding of historical-manuscript images should be based on accepting differences in competence and allowing enough room for variations. This is why our approach is adaptable to any manuscript under study, as it “learns” and organizes the information in the manuscript, and can therefore capture a large proportion of the possible variations. This may, of course, lead to a sharp rise in the resources needed to understand the documents, as well as in the complexity of the models. But, at the same time, if the approach attempts to understand the underlying philosophy behind the writing style, it can easily isolate and locate sources of variations in competence, and can therefore control the complexity of the model.

Appendix B: A Priori Information

In order to use the document properties directly in processing, some parameters, considered as a priori information, are defined and included in the models. The first and most important parameter is the average stroke width, which depends on the writing style and the acquisition setup. The average line height and the minimum and maximum dot sizes are a few other parameters.

As discussed in the previous sections, the key concepts in the proposed framework are characteristic lengths, which are defined based on the range of interactions. These parameters can be extracted using various tools, such as wavelet transform or kernel-based analysis. Although these parameters may vary drastically, even on a single image from one site (paragraph) to another, their behavior is usually very robust and almost constant over a whole dataset. Therefore, many learning and data mining methods can be used to obtain robust values for the characteristic lengths. In this work, we assume that the values of these parameters are known a priori and are constant for each manuscript. Below, a few characteristic lengths are defined.

1.
Stroke width

The most important characteristic length on a document image is the stroke width. In this work, we use the average stroke width, w _s, as a priori information in the form of a constant parameter. It is estimated using a kernel-based algorithm (see Appendix C).
2.
Line height

The second most important characteristic length on document images is the shortest distance between text sites, which are usually text lines. We call this parameter the line height. By definition, line height is the distance between two adjoining baselines. Again, in this work, only the average line height, h _l, is used.
3.
Vertical extent of text line

The average vertical extent of text line h _e is defined as the average distance to which the text pixels extend from a text line. It is different from the line height h _l, which is the average distance between two successive baselines.

Appendix C: Markov Clustering

Robust and parameterless clustering of objects is an interesting and at the same time difficult problem. Usually, the similarity measure between objects is not normalized, and therefore threshold-based methods, or the methods that assume that the number of clusters is known a priori, are very sensitive to parameters. There are many approaches to parameterless clustering techniques, such as Markov clustering [38] and improved incremental growing neural gas [8]. Markov clustering (MCL) is a robust technique in which the similarity values between like objects is increased, gradually and through a few interactions, while the value for nonsimilar objects decreases. This process eventually leads to zero plus one values in the similarity matrix. Although there are two parameters in this technique, which will be discussed later, their effects on the performance of the clustering are mainly limited to the number of iterations, and the convergence of the process is usually independent of them. It is worth noting that the translation of distances to similarities (for example, relation (19.5)), which is independent of MCL and a common step in all clustering techniques, depends strongly on the nature of the problem under study. This is why the parameters h _MCL are selected equal to d _thr.

Algorithm 3 provides the details of the MCL process. In each MCL iteration, there are two main operations. The parameter p specifies the intensity of the expansion operation, which is represented by the power operation in the algorithm. The parameter r controls the inflation operation implemented as entry-wise power operations and column renormalization. In this work, p=2 and r=1.2 are used. If we imagine the similarity value between different objects as the capacity of imaginary pipes that connect the objects, the expansion operation actually increases the capacity of high-flow pipes, and reduces the capacity of pipes with little flow, in a gradual and smooth way. The inflation operator preserves the stochastic nature of the process. In the end, there will only be two types of pipes: open pipes with the highest capacity (W _ij=1) and blocked pipes (W _ij=0). The objects that are connected with open pipes represent a cluster, and the objects with the highest number of connections can be considered as the representer of that set of objects.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cheriet, M., Moghaddam, R.F. (2012). A Robust Word Spotting System for Historical Arabic Manuscripts. In: Märgner, V., El Abed, H. (eds) Guide to OCR for Arabic Scripts. Springer, London. https://doi.org/10.1007/978-1-4471-4072-6_19

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4072-6_19
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4071-9
Online ISBN: 978-1-4471-4072-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics