Chapter

Guide to OCR for Indic Scripts

Part of the series Advances in Pattern Recognition pp 3-25

Date:

Building Data Sets for Indian Language OCR Research

  • C.V. JawaharAffiliated withCenter for Visual Information Processing, International Institute for Information Technology Email author 
  • , Anand KumarAffiliated withInternational Institute for Information Technology. Center for Visual Information Technology
  • , A. PhaneendraAffiliated withInternational Institute for Information Technology. Center for Visual Information Technology
  • , K.J. JineshAffiliated withInternational Institute for Information Technology. Center for Visual Information Technology

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Lack of resources in the form of annotated data sets has been one of the hurdles in developing robust document understanding systems for Indian languages. In this chapter, we present our activities in this direction. Our corpus consists of more than 600000 document images in Indian scripts. A parallel text is aligned to the images to obtain word- and symbol-level annotated data sets. We describe the process we follow and the status of the activities.

Keywords

OCR Data sets Indic scripts Annotation Tools