Building Data Sets for Indian Language OCR Research

  • C.V. JawaharEmail author
  • Anand Kumar
  • A. Phaneendra
  • K.J. Jinesh
Part of the Advances in Pattern Recognition book series (ACVPR)


Lack of resources in the form of annotated data sets has been one of the hurdles in developing robust document understanding systems for Indian languages. In this chapter, we present our activities in this direction. Our corpus consists of more than 600000 document images in Indian scripts. A parallel text is aligned to the images to obtain word- and symbol-level annotated data sets. We describe the process we follow and the status of the activities.


OCR Data sets Indic scripts Annotation Tools 



Authors wish to acknowledge the financial support provided by Ministry of Communication and Information Technology, Govt of India. They also want to acknowledge the inputs from the members of the Indian language OCR consortia in formulating the annotation procedure. They also thank the members of the consortia for identification of the books to be included in the corpora and getting some of these books typed.


  1. 1.
    Henry Baird: Digital Libraries and Document Image Analysis. In: Proc. 7th International Conference on Document Analysis and Recognition (ICDAR) 1 (2003) 2–14CrossRefGoogle Scholar
  2. 2.
    Digital Library of India.
  3. 3.
    Vamshi Ambati, Lakshmipratha Hari, N. Balakrishnan, Raj Reddy and C.V. Jawahar: Process and Architecture for Digital Library of India. In: Proc. of ICDL (2006)Google Scholar
  4. 4.
    K. Pramod Sankar, V. Ambati, Lakshmi Hari and C. V. Jawahar: Digitizing A Million Books: Challenges for Document Analysis. In: Proceedings of Seventh IAPR Workshop on Document Analysis Systems (DAS) (2006) 425–436Google Scholar
  5. 5.
    U. Pal and B. B. Chaudhuri: Indian Script Character Recognition: A Survey. Pattern Recognition 37 (2004) 1887–1899Google Scholar
  6. 6.
    C. V. Jawahar and Anand Kumar: Content Level Annotation of Large Collection of Printed Document Images. In: Proc. of International Conference on Document Analysis and Recognition (ICDAR) (2007) 799–803Google Scholar
  7. 7.
    D. Elliman and N. Sherkat: A Truthing Tool for Generating a Database of Cursive Words. In: Proc. of 6th International Conference on Document Analysis and Recognition (ICDAR) (2001) 1255–1262Google Scholar
  8. 8.
    Srirangaraj Setlur, Suryaprakash Kompalli, Vemulapati Ramanaprasad and Venugopal Govindaraju: Creation of Data Resources and Design of an Evaluation Test Bed for Devanagari Script Recognition. International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management (2003) 55–61Google Scholar
  9. 9.
    Anand Kumar, A. Balasubramanian, Anoop M Namboodiri and C.V. Jawahar: Model-Based Annotation of Online Handwritten Datasets. In: Proc. of 10th International Workshop on Frontiers in Handwriting Recognition (2006)Google Scholar
  10. 10.
    M. Agrawal, K. Bali, S. Madhvanath and L. Vuurpijl: UPX: A New XML Representation for Annotated Datasets of Online Handwriting Data. In: Proc. of International Conference on Document Analysis and Recognition (ICDAR) (2005) 1161–1165Google Scholar
  11. 11.
    M. Zimmermann and H. Bunke: Automatic Segmentation of the IAM Off-line Database for Handwritten English Text. In: Proc. of 16th International Conference on Pattern Recognition (ICPR) (2000) 35–39Google Scholar
  12. 12.
    C. Tomai, B. Zhang and V. Govindaraju: Transcript Mapping for Historic Handwritten Document Images. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition (2002) 413–418Google Scholar
  13. 13.
    I. Guyon, R. Haralick, J. Hull and I. Phillips: Data Sets For OCR and Document Image Understanding Research. In: Proc. 2nd International Conference on Document Analysis and Recognition (ICDAR) (1993)Google Scholar
  14. 14.
    R. Haralick: UW-II English/Japanese document image database. Intelligent Systems Laboratory, University of Washington (1993)Google Scholar
  15. 15.
    Japanese Character Image Database. The Center of Excellence for Document Analysis and Recognition, State University of New York at Buffalo (1995)Google Scholar
  16. 16.
    K. Sesh Kumar, K. Sukesh Kumar and C. V. Jawahar: On Segmentation of Documents in Complex Scripts. In: Proc. of International Conference on Document Analysis and Recognition (ICDAR) 2 (2007) 1243–1247Google Scholar
  17. 17.
    Faisal Shafait, Daniel Keysers and Thomas M. Breuel: Performance Comparison of Six Algorithms for Page Segmentation. In: Document Analysis Systems VII (2006) 368–379Google Scholar
  18. 18.
    A. Bhaskarbhatla, S. Madhavanath, M. Pavan Kumar, A. Balasubramanian and C. V. Jawahar: Representation and Annotation of Online Handwritten Data. In: Proc. of 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR) (2004) 136–141Google Scholar
  19. 19.
    Veena Bansal and R. M. K. Sinha: A Complete OCR for Printed Hindi Text in Devanagari Script. In: Proc. 6th International Conference on Document Analysis and Recognition (ICDAR) (2001) 800–804Google Scholar
  20. 20.
    B. B. Chaudhuri and U. Pal: A Complete Printed Bangla OCR System. Pattern Recognition 31(5) (1998) 531–549CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • C.V. Jawahar
    • 1
    Email author
  • Anand Kumar
    • 2
  • A. Phaneendra
    • 2
  • K.J. Jinesh
    • 2
  1. 1.Center for Visual Information Processing, International Institute for Information TechnologyHyderabadIndia
  2. 2.International Institute for Information Technology. Center for Visual Information TechnologyHyderabadIndia

Personalised recommendations