Skip to main content

Applying the OCRopus OCR System to Scholarly Sanskrit Literature

  • Conference paper
Sanskrit Computational Linguistics (ISCLS 2007, ISCLS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5402))

  • 912 Accesses

Abstract

OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library applications, applications of the system include capturing and recognizing classical literature, as well as the large body of research literature about classics. OCRopus advances the state of the art in a number of ways, including the ability easily to plug in new text recognition and layout analysis modules, the use of adaptive and user extensible character recognition, and statistical and trainable layout analysis. Of particular interest for computational linguistics applications is the consistent use of probability estimates throughout the system and the use of weighted finite state transducers to represent both alternative recognition hypotheses and statistical language models. In this paper, I first give an overview of these technologies and their relevance to digital library applications in the humanities, and then focus on the use of statistical language models and their use for the integration of OCR output with subsequent computational linguistic and information extraction modules.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Breuel, T.: Language Modeling for a Real-World Handwriting Recognition Task. In: The British Society for the Study of Artificial Intelligence and the Simulation of Behaviour, Workshop on Computational Linguistics for Speech and Handwriting Recognition, Leeds University, England (1994)

    Google Scholar 

  • Breuel, T.: The OCRopus Open Source OCR System. In: Proceedings, The Society for Imaging Science and Technology (IS&T) and Society of Photographic Instrumentation Engineers (SPIE), 20th Annual Symposium 2008 (2008)

    Google Scholar 

  • Breuel, T.: Segmentation of handwritten letter strings using a dynamic programming algorithm. In: Proc. 6th Int. Conf. on Document Analysis and Recognition (DAS) (2001)

    Google Scholar 

  • Kumar, S., Byrne, W.: A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (2003)

    Google Scholar 

  • Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: International Conference on Document Analysis and Recognition (2007)

    Google Scholar 

  • Mohri, M., Pereira, F., Riley, M.: Weighted Finite-State Transducers in Speech Recognition. Computer Speech and Language (2002)

    Google Scholar 

  • Setlur, S., Kompalli, S., Ramanaprasad, V., Govindaraju, V.: Creation of data resources and design of an evaluation test bed for Devanagari script recognition. In: Proceedings of 13th International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management (2003)

    Google Scholar 

  • Smith, R.: An Overview of the Tesseract OCR Engine. In: Proc. 9th Int. Conf. on Document Analysis and Recognition (ICDAR) (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Breuel, T.M. (2009). Applying the OCRopus OCR System to Scholarly Sanskrit Literature. In: Huet, G., Kulkarni, A., Scharf, P. (eds) Sanskrit Computational Linguistics. ISCLS ISCLS 2007 2008. Lecture Notes in Computer Science(), vol 5402. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00155-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00155-0_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00154-3

  • Online ISBN: 978-3-642-00155-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics