Applying the OCRopus OCR System to Scholarly Sanskrit Literature

Breuel, Thomas M.

doi:10.1007/978-3-642-00155-0_21

Thomas M. Breuel²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5402))

Included in the following conference series:

912 Accesses

Abstract

OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library applications, applications of the system include capturing and recognizing classical literature, as well as the large body of research literature about classics. OCRopus advances the state of the art in a number of ways, including the ability easily to plug in new text recognition and layout analysis modules, the use of adaptive and user extensible character recognition, and statistical and trainable layout analysis. Of particular interest for computational linguistics applications is the consistent use of probability estimates throughout the system and the use of weighted finite state transducers to represent both alternative recognition hypotheses and statistical language models. In this paper, I first give an overview of these technologies and their relevance to digital library applications in the humanities, and then focus on the use of statistical language models and their use for the integration of OCR output with subsequent computational linguistic and information extraction modules.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breuel, T.: Language Modeling for a Real-World Handwriting Recognition Task. In: The British Society for the Study of Artificial Intelligence and the Simulation of Behaviour, Workshop on Computational Linguistics for Speech and Handwriting Recognition, Leeds University, England (1994)
Google Scholar
Breuel, T.: The OCRopus Open Source OCR System. In: Proceedings, The Society for Imaging Science and Technology (IS&T) and Society of Photographic Instrumentation Engineers (SPIE), 20th Annual Symposium 2008 (2008)
Google Scholar
Breuel, T.: Segmentation of handwritten letter strings using a dynamic programming algorithm. In: Proc. 6th Int. Conf. on Document Analysis and Recognition (DAS) (2001)
Google Scholar
Kumar, S., Byrne, W.: A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (2003)
Google Scholar
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: International Conference on Document Analysis and Recognition (2007)
Google Scholar
Mohri, M., Pereira, F., Riley, M.: Weighted Finite-State Transducers in Speech Recognition. Computer Speech and Language (2002)
Google Scholar
Setlur, S., Kompalli, S., Ramanaprasad, V., Govindaraju, V.: Creation of data resources and design of an evaluation test bed for Devanagari script recognition. In: Proceedings of 13th International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management (2003)
Google Scholar
Smith, R.: An Overview of the Tesseract OCR Engine. In: Proc. 9th Int. Conf. on Document Analysis and Recognition (ICDAR) (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

DFKI and University of Kaiserslautern, Kaiserslautern, Germany
Thomas M. Breuel

Authors

Thomas M. Breuel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA, Centre de Paris-Rocquencourt, BP 105, 78153, Le Chesnay Cedex, France
Gérard Huet
Department of Sanskrit Studies, University of Hyderabad, 500046, Hyderabad, India
Amba Kulkarni
Department of Classics, Brown University, Macfarlane House, 48 College Street,, RI 02912, Providence, USA
Peter Scharf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Breuel, T.M. (2009). Applying the OCRopus OCR System to Scholarly Sanskrit Literature. In: Huet, G., Kulkarni, A., Scharf, P. (eds) Sanskrit Computational Linguistics. ISCLS ISCLS 2007 2008. Lecture Notes in Computer Science(), vol 5402. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00155-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-00155-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00154-3
Online ISBN: 978-3-642-00155-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics