Text retrieval from early printed books

  • Simone MarinaiEmail author
Original Paper


Retrieving text from early printed books is particularly difficult because in these documents, the words are very close one to the other and, similarly to medieval manuscripts, there is a large use of ligatures and abbreviations. To address these problems, we propose a word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation. Two main principles characterize the approach. First, characters are identified in the pages and clustered with self-organizing map (SOM). During the retrieval, the similarity of characters is estimated considering the proximity of cluster centroids in the SOM space, rather than directly comparing the character images. Second, query words are matched with the indexed sequence of characters by means of a dynamic time warping (DTW)-based approach. The proposed technique integrates the SOM similarity and the information about the width of characters in the string matching process. The best path in the DTW array is identified considering the widths of matching words with respect to the query so as to deal with broken or touching symbols. The proposed method is tested on four copies of the Gutenberg Bibles.


Early printed books Dynamic Time Warping Self-Organizing Map 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Belaïd, A., Turcan, I., Pierrel, J.-M., Belaïd, Y., Rangoni, Y., Hadjamar, H.: Automatic indexing and reformulation of ancient dictionaries. In: International Workshop on Document Image Analysis for Libraries, pp. 342–354 (2004)Google Scholar
  2. 2.
    Gutenberg bible: In: Encyclopaedia Britannica, Chicago: Encyclopaedia Britannica (2010)Google Scholar
  3. 3.
    Gutenberg bible census: Clausen Books
  4. 4.
    Le Bourgeois, F., Trinh, E., Allier, B., Eglin, V., Emptoz, H.: Document images analysis solutions for digital libraries. In: Proceedings of First International Workshop on Document Image Analysis for Libraries, pp. 2–24 (2004)Google Scholar
  5. 5.
    Agüera y Arcas B., Fairhall A.: Archaeology of type. Nature 411, 997 (2001)CrossRefGoogle Scholar
  6. 6.
    Digital library: Bibliotheque nationale de France
  7. 7.
    Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In: AND ’09: Proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data, New York, NY, USA, pp. 69–76, ACM (2009)Google Scholar
  8. 8.
    Gutenberg digital: Göttingen University Library
  9. 9.
    Coulmans, F.: Johannes Gutenberg. In: The Blackwell Encyclopedia of Writing Systems. Blackwell (1999)Google Scholar
  10. 10.
    Wild, A.: La typographie de la bible de Gutenberg. Cahiers GUT enberg, 22 (1995)Google Scholar
  11. 11.
    Smigiel, E., Belaïd, A., Hamza, H.: Self-organizing maps and ancient documents. In: Document Analysis Systems, pp. 125–134 (2004)Google Scholar
  12. 12.
    Gupta M.R., Jacobson N.P., Garcia E.K.: OCR binarization and image pre-processing for searching historical documents. Pattern Recognit 40(2), 389–397 (2007)zbMATHCrossRefGoogle Scholar
  13. 13.
    Delalandre, M., Ogier, J.-M., Lladós, J.: A fast CBIR system of old ornamental letter. In: International Workshop on Graphics Recognition, pp. 135–144 (2007)Google Scholar
  14. 14.
    Karray, A., Ogier, J.-M., Kanoun, S., Alimi, M.A.: An ancient graphic documents indexing method based on spatial similarity. In: Int’l Workshop on Graphics Recognition, pp. 126–134 (2007)Google Scholar
  15. 15.
  16. 16.
    Gamera: a framework for building document analysis applications:
  17. 17.
    Edwards, J., Teh, Y.W., Forsyth, D.A., Bock, R., Maire, M., Vesom, G.: Making latin manuscripts searchable using gHMMs. In: NIPS (2004)Google Scholar
  18. 18.
    Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. 9(2–4), 167–177 (2007)Google Scholar
  19. 19.
    Lu S., Li L., Tan C.L.: Document image retrieval through word shape coding. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1913–1918 (2008)CrossRefGoogle Scholar
  20. 20.
    Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: ACM SIGIR 04, pp. 369–376 (2004)Google Scholar
  21. 21.
    Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: CVPR (2), pp. 521–527 (2003)Google Scholar
  22. 22.
    Rath T.M., Manmatha R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. 9(2–4), 139–152 (2007)Google Scholar
  23. 23.
    Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Document Analysis Systems, vol. 3872 of Lecture Notes in Computer Science, pp. 1–12, Springer (2006)Google Scholar
  24. 24.
    Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: Computer Vision—ACCV 2007, 8th Asian Conference on Computer Vision, Tokyo, Japan, November 18–22, 2007, Proceedings, Part I, vol. 4843 of Lecture Notes in Computer Science, pp. 586–595, Springer (2007)Google Scholar
  25. 25.
    Smeaton, A.F., Spitz, A.L.: Using character shape coding for information retrieval. In: International Conference on Document Analysis and Recognition, pp. 974–978 (1997)Google Scholar
  26. 26.
    Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)CrossRefGoogle Scholar
  27. 27.
    Lu Y., Tan C.: Information retrieval in document image databases. IEEE Trans. Knowl. Data Discov. 16(11), 1398–1410 (2004)Google Scholar
  28. 28.
    Marinai S., Marino E., Soda G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1187–1199 (2006)CrossRefGoogle Scholar
  29. 29.
    Cao H., Bhardwaj A., Govindaraju V.: A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit. 42(12), 3374–3382 (2009)zbMATHCrossRefGoogle Scholar
  30. 30.
    Kolcz A., Alspector J., Augusteijn M., Carlson R., Viorel Popescu G.: A line-oriented approach to word spotting in handwritten documents. Pattern Anal. Appl. 3(2), 153–168 (2000). doi: 10.1007/s100440070020 CrossRefGoogle Scholar
  31. 31.
    Leydier, Y., Le Bourgeois, F., Emptoz, H.: Omnilingual segmentation-free word spotting for ancient manuscripts indexation. 1, 533–537 (2005)Google Scholar
  32. 32.
    Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)CrossRefGoogle Scholar
  33. 33.
    Lorigo, L.M., Govindaraju, V.: Transcript mapping for handwritten Arabic documents. In: Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 6500 (2007)Google Scholar
  34. 34.
    Lopresti D.P.: String techniques for detecting duplicates in document databases. Int. J. Doc. Anal. Recognit. 2(4), 186–199 (2000)CrossRefGoogle Scholar
  35. 35.
    Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Retrieving poorly degraded OCR documents. Int. J. Doc. Anal. Recognit. 8(1) (2006)Google Scholar
  36. 36.
    Lopresti, D.P.: Optical character recognition errors and their effects on natural language processing. In: Workshop on Analytics for Noisy Unstructured Text Data, pp. 9–16 (2008)Google Scholar
  37. 37.
    Kohonen, T.: Self-organizing maps. Springer Series in Information Sciences, (2001)Google Scholar
  38. 38.
    Marinai, S., Marino, E., Soda, G.: Self-organizing maps for clustering in document image analysis. In: Machine Learning in Document Analysis and Recognition, pp. 193–219 Springer (2008)Google Scholar
  39. 39.
    Biblia vulgata: The Latin Library
  40. 40.
    Gutenberg bible vol.1: Bayerische Staatsbibliothek
  41. 41.
    Treasures in full: Gutenberg bible: British Library
  42. 42.
    Takamiya, T.: How to make good use of digital contents: The Gutenberg bible and the HUMI project. In: Kyoto International Conference on Digital Libraries, pp. 110–112, (2000)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.Dipartimento di Sistemi e InformaticaUniversità di FirenzeFirenzeItaly

Personalised recommendations