Skip to main content

A Robust Word Spotting System for Historical Arabic Manuscripts

  • Chapter
Book cover Guide to OCR for Arabic Scripts

Abstract

A novel system for word spotting in old Arabic manuscripts is developed. The system has a complete chain of operations and consists of three major steps: pre-processing, data preparation, and word spotting. In the pre-processing step, using multi-level classifiers, clean binarization is obtained from the input degraded document images. In the second step, the smallest units of data, i.e., the connected components, are processed and clustered in a robust way in a library, based on features which have been extracted from their skeletons. The preprocessed data are ready to be used in the final and third step, in which occurrences of queries are located within the manuscript. Various techniques are used to improve the performance and to cope with possible inaccuracies in data and representation. The latter techniques have been developed in an integrated collaboration with scholars, especially for relaxing the system to absorb various scripts. The system is tested on an old manuscript with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://islamsci.mcgill.ca/RASI/; http://islamsci.mcgill.ca/RASI/pipdi.html

References

  1. Al-Khatib, W.G., Shahab, S.A., Mahmoud, S.A.: Digital library framework for Arabic manuscripts. In: Shahab, S.A. (ed.) AICCSA’07, Amman, Jordan, May 13–16, 2007, pp. 458–465 (2007)

    Google Scholar 

  2. Antonacopoulos, A., Downton, A.: Special issue on the analysis of historical documents. Int. J. Doc. Anal. Recognit. 9(2), 75–77 (2007)

    Article  Google Scholar 

  3. Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph. Tools 12(2), 13–21 (2007)

    Google Scholar 

  4. Chang, H.-H., Yan, H.: Analysis of stroke structures of handwritten Chinese characters. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 29(1), 47–61 (1999)

    Article  Google Scholar 

  5. Chomsky, N.: Aspects of the Theory of Syntax, 1st edn. MIT Press, Cambridge (1969). ISBN-10: 0262530074

    Google Scholar 

  6. Farin, G.: Curves and Surfaces for Computer Aided Geometric Design: A Practical Guide, 5th edn. Academic Press, San Diego (2001)

    Google Scholar 

  7. Gacek, A.: Arabic manuscripts: a vademecum for readers. In: Handbook of Oriental Studies. Section 1 The Near and Middle East, vol. 98. Brill, Leiden/Boston (2009). ISBN-10: 90 04 17036 7

    Google Scholar 

  8. Hamza, H., Belaid, Y., Belaid, A., Chaudhuri, B.B.: Incremental classification of invoice documents. In: ICPR’08, Tampa, FL, USA, December 8–11, 2008, pp. 1–4 (2008)

    Google Scholar 

  9. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision. Addison-Wesley/Longman, Reading (1992)

    Google Scholar 

  10. Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: ICDAR’03, pp. 780–783 (2003)

    Google Scholar 

  11. Joosten, I.: Applications of microanalysis in the cultural heritage field. Mikrochim. Acta 161(3), 295–299 (2008)

    Article  Google Scholar 

  12. Kane, S., Lehman, A., Partridge, E.: Indexing George Washington’s Handwritten Manuscripts: A study of word matching techniques. CIIR technical report, University of Massachusetts, Amherst (2001)

    Google Scholar 

  13. Kohonen, T.: Self Organizing Maps, 3rd edn. Springer, Berlin (2001)

    Book  MATH  Google Scholar 

  14. Leydier, Y., Le Bourgeois, F., Emptoz, H.: Omnilingual segmentation-free word spotting for ancient manuscripts indexation. In: Le Bourgeois, F. (ed.) ICDAR’05, vol. 1, pp. 533–537 (2005)

    Google Scholar 

  15. Leydier, Y., Ouji, A., LeBourgeois, F., Emptoz, H.: Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit. 42, 2089–2105 (2009)

    Article  MATH  Google Scholar 

  16. Manso, M., Carvalho, M.L.: Application of spectroscopic techniques for the study of paper documents: A survey. Spectrochim. Acta, Part B, At. Spectrosc. 64(6), 482–490 (2009)

    Article  Google Scholar 

  17. Matuschek, M., Schlüter, T., Conrad, S.: Measuring text similarity with dynamic time warping. In: Proceedings of the 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 263–267. ACM, New York (2008)

    Google Scholar 

  18. Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word-spotting in historical document images. In: ICDAR’09, Barcelona, Spain, July 26–29, 2009, pp. 511–515 (2009)

    Google Scholar 

  19. Moghaddam, R.F., Cheriet, M.: Low quality document image modeling and enhancement. Int. J. Doc. Anal. Recognit. 11(4), 183–201 (2009)

    Article  Google Scholar 

  20. Moghaddam, R.F., Cheriet, R.M.: Restoration of single-sided low-quality document images. Pattern Recognit. 42, 3355–3364 (2009)

    Article  MATH  Google Scholar 

  21. Moghaddam, R.F., Cheriet, M.: A multi-scale framework for adaptive binarization of degraded document images. Pattern Recognit. 43(6), 2186–2198 (2010)

    Article  MATH  Google Scholar 

  22. Moghaddam, R.F., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. In: DAS’10, Boston, Massachusetts, pp. 11–18. ACM, New York (2010)

    Google Scholar 

  23. Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)

    Article  Google Scholar 

  24. Nakayama, K., Hasegawa, H., Hernandez, C.A.: Handwritten alphabet and digit character recognition using skeleton pattern mapping with structural constraints. In: Proc. ICANN’93, Amsterdam, September 1993, pp. 941 (1993)

    Google Scholar 

  25. Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. 9(2), 139–152 (2007)

    Article  Google Scholar 

  26. Barni, M., Beraldin, J.-A., Lahanier, C., Piva, A. Signal processing in visual cultural heritage, special issue. In: IEEE Signal Processing Magazine, vol. 25(4) (2008)

    Google Scholar 

  27. Rodriguez-Serrano, J.A., Perronnin, F., Llados, J., Sanchez, G.: A similarity measure between vector sequences with application to handwritten word image retrieval. In: CVPR’09 (2009)

    Google Scholar 

  28. Rothfeder, J.L., Feng, S., Rath, T.M.: Using corner feature correspondences to rank word images by similarity. In: Workshop on Document Image Analysis and Retrieval, Madison, June 20, 2003

    Google Scholar 

  29. Saykol, E., Sinop, A.K., Gudukbay, U., Ulusoy, O., Cetin, A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)

    Article  Google Scholar 

  30. Shafait, F., Keysers, D., Breuel, T.M.: Efficient implementation of local adaptive thresholding techniques using integral images. In: Document Recognition and Retrieval XV, San Jose, CA, January 2008

    Google Scholar 

  31. Sharma, O., Mioc, D., Anton, F.: Voronoi diagram based automated skeleton extraction from colour scanned maps. In: ISVD’06, pp. 186–195 (2006)

    Google Scholar 

  32. Shih, F.Y., Pu, C.C.: A skeletonization algorithm by maxima tracking on Euclidean distance transform. Pattern Recognit. 28(3), 331–341 (1995)

    Article  Google Scholar 

  33. Spitz, A.L.: Using character shape codes for word spotting in document images. In: Dori, D., Bruckstein, A. (eds.) Shape, Structure and Pattern Recognition, pp. 382–389. World Scientific, Singapore (1995)

    Google Scholar 

  34. Steinherz, T., Intrator, N., Rivlin, E.: A special skeletonization algorithm for cursive words. In: IWFHR’00, pp. 529–534 (2000)

    Google Scholar 

  35. Tang, Y.Y., Suen, C.Y., De Yan, C., Cheriet, M.: Financial document processing based on staff line and description language. IEEE Trans. Syst. Man Cybern. 25(5), 738–754 (1995)

    Article  Google Scholar 

  36. The Mathworks Inc., Natick, MA. MATLAB Version 7.5.0

    Google Scholar 

  37. van der Zant, T., Schomaker, L., Haak, K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1945–1957 (2008)

    Article  Google Scholar 

  38. van Dongen, S.: Graph clustering by flow simulation. Ph.D. thesis, Univ. Utrecht., May 2000

    Google Scholar 

  39. Vijaya Kumar, V., Srikrishna, A., Ali Shaik, S., Trinath, S.: A new skeletonization method based on connected component approach. Int. J. Comput. Sci. Netw. Secur. 8, 133–137 (2008)

    Google Scholar 

  40. Yalniz, I.Z., Altingovde, I.S., Güdükbay, U., Ulusoy, O.: Ottoman archives explorer: A retrieval system for digital Ottoman archives. J. Comput. Cult. Herit. 2(3), 1–20 (2009)

    Article  Google Scholar 

  41. Zeng, J., Liu, Z.-Q.: Stroke segmentation of Chinese characters using Markov random fields. In: ICPR’06, vol. 1, pp. 868–871 (2006)

    Google Scholar 

  42. Zhu, X.: Shape recognition based on skeleton and support vector machines. In: Advanced Intelligent Computing Theories and Applications. With Aspects of Contemporary Intelligent Computing Techniques, vol. 2, pp. 1035–1043 (2007)

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the NSERC of Canada for their financial support. Also, we would like to acknowledge Dr. Robert Wisnovsky and his team, from IIS, McGill University, for their collaboration and fruitful comments and discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Cheriet .

Editor information

Editors and Affiliations

Appendices

Appendix A: Competence vs. Performance

Pattern recognition and document understanding [4, 35] are actually equivalent to a process in which humans (authors, writers, and copiers of documents and manuscripts) are modeled as a kind of machine. These people usually differ in the writing they produce mainly because of their philological and psychological differences and limitations. In other words, some aspects of these variations are rooted in the inability of such a “machine” to reproduce a well-standardized set of patterns. This inability can be related to motor limitations of hand, mind, or memory, for example. In modern societies, and for short time scales, where a high level of education based on a single “school” can be assumed, this is close to the reality. Performance [5] is a term used to measure the level of this imperfection, and is a concept that has been used widely in many fields, from linguistics to engineering. Performance is a measure of the output of a writing process. It not only reflects the personality of the writer, but also, and mainly, the errors and inabilities of the writer. Most approaches to document understanding, which usually originated from the engineering point of view, overestimate performance. This tradition has resulted in attempts to filter out, “average,” or ignore variations. In earlier times, there was more diversity still, although even in modern societies there are variations in the way individuals think. Every author, writer, and copier has a unique and well-developed ability to think for himself. Perhaps it would be difficult to gauge the value of the large number of manuscripts produced over a period of a thousand years or so, but, if we look at them on a scale of a few years, they appear as individual gems. The deep philosophical thinking of their authors is reflected in their lives through many stories told about them. It is obvious that their thinking would be reflected in their writing style as well. Moreover, their writing styles should also be as unique as the philosophical schools to which they belong. In other words, a major part of the difference between the writing styles of two manuscripts arises from differences in the thinking of the writers, rather than their inability to replicate a standard style. This brings us to competence [5], a concept which has been somewhat ignored in document analysis and understanding. In contrast to performance, competence actually depends on the knowledge of the writer. We believe that a successful approach to analysis and understanding of historical-manuscript images should be based on accepting differences in competence and allowing enough room for variations. This is why our approach is adaptable to any manuscript under study, as it “learns” and organizes the information in the manuscript, and can therefore capture a large proportion of the possible variations. This may, of course, lead to a sharp rise in the resources needed to understand the documents, as well as in the complexity of the models. But, at the same time, if the approach attempts to understand the underlying philosophy behind the writing style, it can easily isolate and locate sources of variations in competence, and can therefore control the complexity of the model.

Appendix B: A Priori Information

In order to use the document properties directly in processing, some parameters, considered as a priori information, are defined and included in the models. The first and most important parameter is the average stroke width, which depends on the writing style and the acquisition setup. The average line height and the minimum and maximum dot sizes are a few other parameters.

As discussed in the previous sections, the key concepts in the proposed framework are characteristic lengths, which are defined based on the range of interactions. These parameters can be extracted using various tools, such as wavelet transform or kernel-based analysis. Although these parameters may vary drastically, even on a single image from one site (paragraph) to another, their behavior is usually very robust and almost constant over a whole dataset. Therefore, many learning and data mining methods can be used to obtain robust values for the characteristic lengths. In this work, we assume that the values of these parameters are known a priori and are constant for each manuscript. Below, a few characteristic lengths are defined.

  1. 1.

    Stroke width

    The most important characteristic length on a document image is the stroke width. In this work, we use the average stroke width, w s , as a priori information in the form of a constant parameter. It is estimated using a kernel-based algorithm (see Appendix C).

  2. 2.

    Line height

    The second most important characteristic length on document images is the shortest distance between text sites, which are usually text lines. We call this parameter the line height. By definition, line height is the distance between two adjoining baselines. Again, in this work, only the average line height, h l , is used.

  3. 3.

    Vertical extent of text line

    The average vertical extent of text line h e is defined as the average distance to which the text pixels extend from a text line. It is different from the line height h l , which is the average distance between two successive baselines.

Appendix C: Markov Clustering

Robust and parameterless clustering of objects is an interesting and at the same time difficult problem. Usually, the similarity measure between objects is not normalized, and therefore threshold-based methods, or the methods that assume that the number of clusters is known a priori, are very sensitive to parameters. There are many approaches to parameterless clustering techniques, such as Markov clustering [38] and improved incremental growing neural gas [8]. Markov clustering (MCL) is a robust technique in which the similarity values between like objects is increased, gradually and through a few interactions, while the value for nonsimilar objects decreases. This process eventually leads to zero plus one values in the similarity matrix. Although there are two parameters in this technique, which will be discussed later, their effects on the performance of the clustering are mainly limited to the number of iterations, and the convergence of the process is usually independent of them. It is worth noting that the translation of distances to similarities (for example, relation (19.5)), which is independent of MCL and a common step in all clustering techniques, depends strongly on the nature of the problem under study. This is why the parameters h MCL are selected equal to d thr.

Algorithm 3 provides the details of the MCL process. In each MCL iteration, there are two main operations. The parameter p specifies the intensity of the expansion operation, which is represented by the power operation in the algorithm. The parameter r controls the inflation operation implemented as entry-wise power operations and column renormalization. In this work, p=2 and r=1.2 are used. If we imagine the similarity value between different objects as the capacity of imaginary pipes that connect the objects, the expansion operation actually increases the capacity of high-flow pipes, and reduces the capacity of pipes with little flow, in a gradual and smooth way. The inflation operator preserves the stochastic nature of the process. In the end, there will only be two types of pipes: open pipes with the highest capacity (W ij =1) and blocked pipes (W ij =0). The objects that are connected with open pipes represent a cluster, and the objects with the highest number of connections can be considered as the representer of that set of objects.

Algorithm 3
figure 17

Operation cycle of MCL technique

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London

About this chapter

Cite this chapter

Cheriet, M., Moghaddam, R.F. (2012). A Robust Word Spotting System for Historical Arabic Manuscripts. In: Märgner, V., El Abed, H. (eds) Guide to OCR for Arabic Scripts. Springer, London. https://doi.org/10.1007/978-1-4471-4072-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-4072-6_19

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-4071-9

  • Online ISBN: 978-1-4471-4072-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics