Tilting at Windmills: Adventures in Attempting to Reconstruct Don Quixote

  • A. Lawrence Spitz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)


Despite the current practice of re-keying most documents placed in digital libraries, we continue to try to improve accuracy of automated recognition techniques for obtaining document image content. This task is made more difficult when the document in question has been rendered in letterpress, subjected to hundreds of years of the aging process and been microfilmed before scanning.

We endeavored to leave intact a previously described document reconstruction technique, and to enhance the document image to bring the perceived production values up to a more modern standards in order to process a novel of historic importance: Don Quixote by Miguel de Cervantes Saavedra. Pre-processing of the page images before application of the reconstruction techniques were performed to accommodate early 17th century typography and low-quality scanned micro-film images.

Though our technology easily outstripped the capabilities of commercial OCRs, it too was found lacking, at this stage of development, for automated processing of historical documents for digital libraries.

We had hoped to develop a useful transcription of the text and a lexicon of Spanish contemporary with the composition of this novel. However the actual accomplishment was limited to making improvements in the recognizability of the page images involved and providing a basis for further research.


Digital Library Document Image Optical Character Recognition Lexical Entry Character Position 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Cannon, M., Hochberg, J., Kelly, P.: Quality assessment and restoration of typewritten document images. International Journal on Document Analysis and Recognition 2(2/3), 80–89 (1999)CrossRefGoogle Scholar
  2. 2.
    Ho, T.K., Nagy, G.: OCR with no shape training. In: International Conference on Pattern Recognition, pp. 27–30 (2000)Google Scholar
  3. 3.
    Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text Categorization of Low Quality Images. In: Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 301–315 (1995)Google Scholar
  4. 4.
    Nagy, G., Seth, S., Einspahr, K.: Decoding substitution ciphers by means of word matching with application to OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(5), 710–715 (1987)CrossRefGoogle Scholar
  5. 5.
    Reynar, J.C., Spitz, A.L., Sibun, P.: Document Reconstruction: A Thousand Words from One Picture. In: Symposium on Document Analysis and Information Retrieval, pp. 367–385 (1995)Google Scholar
  6. 6.
    Lawrence Spitz, A.: Shape-based Word Recognition. International Journal of Document Analysis and Recognition 178 (1999)Google Scholar
  7. 7.
    Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1280–1296 (1999)Google Scholar
  8. 8.
    Lawrence Spitz, A.: Progress in document reconstruction. In: International Conference on Pattern Recognition, Quebec City, pp. 464–467 (2002)Google Scholar
  9. 9.
    Lawrence Spitz, A., Paul Marks, J.: Measuring the robustness of character shape coding. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 1–12. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  10. 10.
    Lawrence Spitz, A.: Generalized Line, Word and Character Finding. In: Impedovo, S. (ed.) Progress in Image Analysis and Processing III, pp. 377–383. World Scientific, Singapore (1994)Google Scholar
  11. 11.
    Lawrence Spitz, A.: Moby Dick meets GEOCR: Lexical considerations in word recognition. In: International Conference on Document Analysis and Recognition, Ulm, Germany (1997)Google Scholar
  12. 12.
    Lawrence Spitz, A.: Correcting for Variable Skew. In: Document Analysis Systems, pp. 179–187. Princeton, NJ (2002)CrossRefGoogle Scholar
  13. 13.
    Taghva, K., Borsack, J., Condit, A., Erva, S.: The Effects of Noisy Data on Text Retrieval. Journal of the American Society for Information Science 45(1), 50–58 (1994)CrossRefGoogle Scholar
  14. 14.
    Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99-01-07, 1280–1296 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • A. Lawrence Spitz
    • 1
  1. 1.DocRec LtdAtawhai, NelsonNew Zealand

Personalised recommendations