Abstract
OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library applications, applications of the system include capturing and recognizing classical literature, as well as the large body of research literature about classics. OCRopus advances the state of the art in a number of ways, including the ability easily to plug in new text recognition and layout analysis modules, the use of adaptive and user extensible character recognition, and statistical and trainable layout analysis. Of particular interest for computational linguistics applications is the consistent use of probability estimates throughout the system and the use of weighted finite state transducers to represent both alternative recognition hypotheses and statistical language models. In this paper, I first give an overview of these technologies and their relevance to digital library applications in the humanities, and then focus on the use of statistical language models and their use for the integration of OCR output with subsequent computational linguistic and information extraction modules.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Breuel, T.: Language Modeling for a Real-World Handwriting Recognition Task. In: The British Society for the Study of Artificial Intelligence and the Simulation of Behaviour, Workshop on Computational Linguistics for Speech and Handwriting Recognition, Leeds University, England (1994)
Breuel, T.: The OCRopus Open Source OCR System. In: Proceedings, The Society for Imaging Science and Technology (IS&T) and Society of Photographic Instrumentation Engineers (SPIE), 20th Annual Symposium 2008 (2008)
Breuel, T.: Segmentation of handwritten letter strings using a dynamic programming algorithm. In: Proc. 6th Int. Conf. on Document Analysis and Recognition (DAS) (2001)
Kumar, S., Byrne, W.: A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (2003)
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: International Conference on Document Analysis and Recognition (2007)
Mohri, M., Pereira, F., Riley, M.: Weighted Finite-State Transducers in Speech Recognition. Computer Speech and Language (2002)
Setlur, S., Kompalli, S., Ramanaprasad, V., Govindaraju, V.: Creation of data resources and design of an evaluation test bed for Devanagari script recognition. In: Proceedings of 13th International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management (2003)
Smith, R.: An Overview of the Tesseract OCR Engine. In: Proc. 9th Int. Conf. on Document Analysis and Recognition (ICDAR) (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Breuel, T.M. (2009). Applying the OCRopus OCR System to Scholarly Sanskrit Literature. In: Huet, G., Kulkarni, A., Scharf, P. (eds) Sanskrit Computational Linguistics. ISCLS ISCLS 2007 2008. Lecture Notes in Computer Science(), vol 5402. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00155-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-00155-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00154-3
Online ISBN: 978-3-642-00155-0
eBook Packages: Computer ScienceComputer Science (R0)