Abstract
A novel system for word spotting in old Arabic manuscripts is developed. The system has a complete chain of operations and consists of three major steps: pre-processing, data preparation, and word spotting. In the pre-processing step, using multi-level classifiers, clean binarization is obtained from the input degraded document images. In the second step, the smallest units of data, i.e., the connected components, are processed and clustered in a robust way in a library, based on features which have been extracted from their skeletons. The preprocessed data are ready to be used in the final and third step, in which occurrences of queries are located within the manuscript. Various techniques are used to improve the performance and to cope with possible inaccuracies in data and representation. The latter techniques have been developed in an integrated collaboration with scholars, especially for relaxing the system to absorb various scripts. The system is tested on an old manuscript with promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Khatib, W.G., Shahab, S.A., Mahmoud, S.A.: Digital library framework for Arabic manuscripts. In: Shahab, S.A. (ed.) AICCSA’07, Amman, Jordan, May 13–16, 2007, pp. 458–465 (2007)
Antonacopoulos, A., Downton, A.: Special issue on the analysis of historical documents. Int. J. Doc. Anal. Recognit. 9(2), 75–77 (2007)
Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph. Tools 12(2), 13–21 (2007)
Chang, H.-H., Yan, H.: Analysis of stroke structures of handwritten Chinese characters. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 29(1), 47–61 (1999)
Chomsky, N.: Aspects of the Theory of Syntax, 1st edn. MIT Press, Cambridge (1969). ISBN-10: 0262530074
Farin, G.: Curves and Surfaces for Computer Aided Geometric Design: A Practical Guide, 5th edn. Academic Press, San Diego (2001)
Gacek, A.: Arabic manuscripts: a vademecum for readers. In: Handbook of Oriental Studies. Section 1 The Near and Middle East, vol. 98. Brill, Leiden/Boston (2009). ISBN-10: 90 04 17036 7
Hamza, H., Belaid, Y., Belaid, A., Chaudhuri, B.B.: Incremental classification of invoice documents. In: ICPR’08, Tampa, FL, USA, December 8–11, 2008, pp. 1–4 (2008)
Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision. Addison-Wesley/Longman, Reading (1992)
Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: ICDAR’03, pp. 780–783 (2003)
Joosten, I.: Applications of microanalysis in the cultural heritage field. Mikrochim. Acta 161(3), 295–299 (2008)
Kane, S., Lehman, A., Partridge, E.: Indexing George Washington’s Handwritten Manuscripts: A study of word matching techniques. CIIR technical report, University of Massachusetts, Amherst (2001)
Kohonen, T.: Self Organizing Maps, 3rd edn. Springer, Berlin (2001)
Leydier, Y., Le Bourgeois, F., Emptoz, H.: Omnilingual segmentation-free word spotting for ancient manuscripts indexation. In: Le Bourgeois, F. (ed.) ICDAR’05, vol. 1, pp. 533–537 (2005)
Leydier, Y., Ouji, A., LeBourgeois, F., Emptoz, H.: Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit. 42, 2089–2105 (2009)
Manso, M., Carvalho, M.L.: Application of spectroscopic techniques for the study of paper documents: A survey. Spectrochim. Acta, Part B, At. Spectrosc. 64(6), 482–490 (2009)
Matuschek, M., Schlüter, T., Conrad, S.: Measuring text similarity with dynamic time warping. In: Proceedings of the 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 263–267. ACM, New York (2008)
Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word-spotting in historical document images. In: ICDAR’09, Barcelona, Spain, July 26–29, 2009, pp. 511–515 (2009)
Moghaddam, R.F., Cheriet, M.: Low quality document image modeling and enhancement. Int. J. Doc. Anal. Recognit. 11(4), 183–201 (2009)
Moghaddam, R.F., Cheriet, R.M.: Restoration of single-sided low-quality document images. Pattern Recognit. 42, 3355–3364 (2009)
Moghaddam, R.F., Cheriet, M.: A multi-scale framework for adaptive binarization of degraded document images. Pattern Recognit. 43(6), 2186–2198 (2010)
Moghaddam, R.F., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. In: DAS’10, Boston, Massachusetts, pp. 11–18. ACM, New York (2010)
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)
Nakayama, K., Hasegawa, H., Hernandez, C.A.: Handwritten alphabet and digit character recognition using skeleton pattern mapping with structural constraints. In: Proc. ICANN’93, Amsterdam, September 1993, pp. 941 (1993)
Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. 9(2), 139–152 (2007)
Barni, M., Beraldin, J.-A., Lahanier, C., Piva, A. Signal processing in visual cultural heritage, special issue. In: IEEE Signal Processing Magazine, vol. 25(4) (2008)
Rodriguez-Serrano, J.A., Perronnin, F., Llados, J., Sanchez, G.: A similarity measure between vector sequences with application to handwritten word image retrieval. In: CVPR’09 (2009)
Rothfeder, J.L., Feng, S., Rath, T.M.: Using corner feature correspondences to rank word images by similarity. In: Workshop on Document Image Analysis and Retrieval, Madison, June 20, 2003
Saykol, E., Sinop, A.K., Gudukbay, U., Ulusoy, O., Cetin, A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)
Shafait, F., Keysers, D., Breuel, T.M.: Efficient implementation of local adaptive thresholding techniques using integral images. In: Document Recognition and Retrieval XV, San Jose, CA, January 2008
Sharma, O., Mioc, D., Anton, F.: Voronoi diagram based automated skeleton extraction from colour scanned maps. In: ISVD’06, pp. 186–195 (2006)
Shih, F.Y., Pu, C.C.: A skeletonization algorithm by maxima tracking on Euclidean distance transform. Pattern Recognit. 28(3), 331–341 (1995)
Spitz, A.L.: Using character shape codes for word spotting in document images. In: Dori, D., Bruckstein, A. (eds.) Shape, Structure and Pattern Recognition, pp. 382–389. World Scientific, Singapore (1995)
Steinherz, T., Intrator, N., Rivlin, E.: A special skeletonization algorithm for cursive words. In: IWFHR’00, pp. 529–534 (2000)
Tang, Y.Y., Suen, C.Y., De Yan, C., Cheriet, M.: Financial document processing based on staff line and description language. IEEE Trans. Syst. Man Cybern. 25(5), 738–754 (1995)
The Mathworks Inc., Natick, MA. MATLAB Version 7.5.0
van der Zant, T., Schomaker, L., Haak, K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1945–1957 (2008)
van Dongen, S.: Graph clustering by flow simulation. Ph.D. thesis, Univ. Utrecht., May 2000
Vijaya Kumar, V., Srikrishna, A., Ali Shaik, S., Trinath, S.: A new skeletonization method based on connected component approach. Int. J. Comput. Sci. Netw. Secur. 8, 133–137 (2008)
Yalniz, I.Z., Altingovde, I.S., Güdükbay, U., Ulusoy, O.: Ottoman archives explorer: A retrieval system for digital Ottoman archives. J. Comput. Cult. Herit. 2(3), 1–20 (2009)
Zeng, J., Liu, Z.-Q.: Stroke segmentation of Chinese characters using Markov random fields. In: ICPR’06, vol. 1, pp. 868–871 (2006)
Zhu, X.: Shape recognition based on skeleton and support vector machines. In: Advanced Intelligent Computing Theories and Applications. With Aspects of Contemporary Intelligent Computing Techniques, vol. 2, pp. 1035–1043 (2007)
Acknowledgements
The authors would like to thank the NSERC of Canada for their financial support. Also, we would like to acknowledge Dr. Robert Wisnovsky and his team, from IIS, McGill University, for their collaboration and fruitful comments and discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A: Competence vs. Performance
Pattern recognition and document understanding [4, 35] are actually equivalent to a process in which humans (authors, writers, and copiers of documents and manuscripts) are modeled as a kind of machine. These people usually differ in the writing they produce mainly because of their philological and psychological differences and limitations. In other words, some aspects of these variations are rooted in the inability of such a “machine” to reproduce a well-standardized set of patterns. This inability can be related to motor limitations of hand, mind, or memory, for example. In modern societies, and for short time scales, where a high level of education based on a single “school” can be assumed, this is close to the reality. Performance [5] is a term used to measure the level of this imperfection, and is a concept that has been used widely in many fields, from linguistics to engineering. Performance is a measure of the output of a writing process. It not only reflects the personality of the writer, but also, and mainly, the errors and inabilities of the writer. Most approaches to document understanding, which usually originated from the engineering point of view, overestimate performance. This tradition has resulted in attempts to filter out, “average,” or ignore variations. In earlier times, there was more diversity still, although even in modern societies there are variations in the way individuals think. Every author, writer, and copier has a unique and well-developed ability to think for himself. Perhaps it would be difficult to gauge the value of the large number of manuscripts produced over a period of a thousand years or so, but, if we look at them on a scale of a few years, they appear as individual gems. The deep philosophical thinking of their authors is reflected in their lives through many stories told about them. It is obvious that their thinking would be reflected in their writing style as well. Moreover, their writing styles should also be as unique as the philosophical schools to which they belong. In other words, a major part of the difference between the writing styles of two manuscripts arises from differences in the thinking of the writers, rather than their inability to replicate a standard style. This brings us to competence [5], a concept which has been somewhat ignored in document analysis and understanding. In contrast to performance, competence actually depends on the knowledge of the writer. We believe that a successful approach to analysis and understanding of historical-manuscript images should be based on accepting differences in competence and allowing enough room for variations. This is why our approach is adaptable to any manuscript under study, as it “learns” and organizes the information in the manuscript, and can therefore capture a large proportion of the possible variations. This may, of course, lead to a sharp rise in the resources needed to understand the documents, as well as in the complexity of the models. But, at the same time, if the approach attempts to understand the underlying philosophy behind the writing style, it can easily isolate and locate sources of variations in competence, and can therefore control the complexity of the model.
Appendix B: A Priori Information
In order to use the document properties directly in processing, some parameters, considered as a priori information, are defined and included in the models. The first and most important parameter is the average stroke width, which depends on the writing style and the acquisition setup. The average line height and the minimum and maximum dot sizes are a few other parameters.
As discussed in the previous sections, the key concepts in the proposed framework are characteristic lengths, which are defined based on the range of interactions. These parameters can be extracted using various tools, such as wavelet transform or kernel-based analysis. Although these parameters may vary drastically, even on a single image from one site (paragraph) to another, their behavior is usually very robust and almost constant over a whole dataset. Therefore, many learning and data mining methods can be used to obtain robust values for the characteristic lengths. In this work, we assume that the values of these parameters are known a priori and are constant for each manuscript. Below, a few characteristic lengths are defined.
-
1.
Stroke width
The most important characteristic length on a document image is the stroke width. In this work, we use the average stroke width, w s , as a priori information in the form of a constant parameter. It is estimated using a kernel-based algorithm (see Appendix C).
-
2.
Line height
The second most important characteristic length on document images is the shortest distance between text sites, which are usually text lines. We call this parameter the line height. By definition, line height is the distance between two adjoining baselines. Again, in this work, only the average line height, h l , is used.
-
3.
Vertical extent of text line
The average vertical extent of text line h e is defined as the average distance to which the text pixels extend from a text line. It is different from the line height h l , which is the average distance between two successive baselines.
Appendix C: Markov Clustering
Robust and parameterless clustering of objects is an interesting and at the same time difficult problem. Usually, the similarity measure between objects is not normalized, and therefore threshold-based methods, or the methods that assume that the number of clusters is known a priori, are very sensitive to parameters. There are many approaches to parameterless clustering techniques, such as Markov clustering [38] and improved incremental growing neural gas [8]. Markov clustering (MCL) is a robust technique in which the similarity values between like objects is increased, gradually and through a few interactions, while the value for nonsimilar objects decreases. This process eventually leads to zero plus one values in the similarity matrix. Although there are two parameters in this technique, which will be discussed later, their effects on the performance of the clustering are mainly limited to the number of iterations, and the convergence of the process is usually independent of them. It is worth noting that the translation of distances to similarities (for example, relation (19.5)), which is independent of MCL and a common step in all clustering techniques, depends strongly on the nature of the problem under study. This is why the parameters h MCL are selected equal to d thr.
Algorithm 3 provides the details of the MCL process. In each MCL iteration, there are two main operations. The parameter p specifies the intensity of the expansion operation, which is represented by the power operation in the algorithm. The parameter r controls the inflation operation implemented as entry-wise power operations and column renormalization. In this work, p=2 and r=1.2 are used. If we imagine the similarity value between different objects as the capacity of imaginary pipes that connect the objects, the expansion operation actually increases the capacity of high-flow pipes, and reduces the capacity of pipes with little flow, in a gradual and smooth way. The inflation operator preserves the stochastic nature of the process. In the end, there will only be two types of pipes: open pipes with the highest capacity (W ij =1) and blocked pipes (W ij =0). The objects that are connected with open pipes represent a cluster, and the objects with the highest number of connections can be considered as the representer of that set of objects.
Rights and permissions
Copyright information
© 2012 Springer-Verlag London
About this chapter
Cite this chapter
Cheriet, M., Moghaddam, R.F. (2012). A Robust Word Spotting System for Historical Arabic Manuscripts. In: Märgner, V., El Abed, H. (eds) Guide to OCR for Arabic Scripts. Springer, London. https://doi.org/10.1007/978-1-4471-4072-6_19
Download citation
DOI: https://doi.org/10.1007/978-1-4471-4072-6_19
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4071-9
Online ISBN: 978-1-4471-4072-6
eBook Packages: Computer ScienceComputer Science (R0)