Tilting at Windmills: Adventures in Attempting to Reconstruct Don Quixote
Despite the current practice of re-keying most documents placed in digital libraries, we continue to try to improve accuracy of automated recognition techniques for obtaining document image content. This task is made more difficult when the document in question has been rendered in letterpress, subjected to hundreds of years of the aging process and been microfilmed before scanning.
We endeavored to leave intact a previously described document reconstruction technique, and to enhance the document image to bring the perceived production values up to a more modern standards in order to process a novel of historic importance: Don Quixote by Miguel de Cervantes Saavedra. Pre-processing of the page images before application of the reconstruction techniques were performed to accommodate early 17th century typography and low-quality scanned micro-film images.
Though our technology easily outstripped the capabilities of commercial OCRs, it too was found lacking, at this stage of development, for automated processing of historical documents for digital libraries.
We had hoped to develop a useful transcription of the text and a lexicon of Spanish contemporary with the composition of this novel. However the actual accomplishment was limited to making improvements in the recognizability of the page images involved and providing a basis for further research.
KeywordsDigital Library Document Image Optical Character Recognition Lexical Entry Character Position
- 2.Ho, T.K., Nagy, G.: OCR with no shape training. In: International Conference on Pattern Recognition, pp. 27–30 (2000)Google Scholar
- 3.Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text Categorization of Low Quality Images. In: Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 301–315 (1995)Google Scholar
- 5.Reynar, J.C., Spitz, A.L., Sibun, P.: Document Reconstruction: A Thousand Words from One Picture. In: Symposium on Document Analysis and Information Retrieval, pp. 367–385 (1995)Google Scholar
- 6.Lawrence Spitz, A.: Shape-based Word Recognition. International Journal of Document Analysis and Recognition 178 (1999)Google Scholar
- 7.Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1280–1296 (1999)Google Scholar
- 8.Lawrence Spitz, A.: Progress in document reconstruction. In: International Conference on Pattern Recognition, Quebec City, pp. 464–467 (2002)Google Scholar
- 10.Lawrence Spitz, A.: Generalized Line, Word and Character Finding. In: Impedovo, S. (ed.) Progress in Image Analysis and Processing III, pp. 377–383. World Scientific, Singapore (1994)Google Scholar
- 11.Lawrence Spitz, A.: Moby Dick meets GEOCR: Lexical considerations in word recognition. In: International Conference on Document Analysis and Recognition, Ulm, Germany (1997)Google Scholar
- 14.Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99-01-07, 1280–1296 (1999)Google Scholar