Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 2, pp 1643–1678 | Cite as

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

  • Sk Md Obaidullah
  • Chayan Halder
  • K. C. Santosh
  • Nibaran Das
  • Kaushik Roy
Article

Abstract

Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.

Keywords

Page-level handwritten dataset Handwritten document recognition Official Indic scripts Script identification 

References

  1. 1.
    Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 140–145Google Scholar
  2. 2.
    Aleai A, Nagabhushan P, Pal U (2012) Dataset and Ground truth for Handwritten Text in Four Different Scripts. International Journal of Pattern Recognition and Artificial Intelligence, World Scientific, 26(4):1253001 (25 pages)Google Scholar
  3. 3.
    Bhattacharya U, Chaudhuri BB (2005) Databases for research on recognition of handwritten characters of Indian scripts. In: Proceedings of the International Conference on Document Analysis and Recognition, p 789–793Google Scholar
  4. 4.
    Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27(11):1720–1732CrossRefGoogle Scholar
  5. 5.
    Chaudhuri BB (2006) A complete handwritten numeral database of Bangla-a major Indic script. In: Proceedings of the International Workshop on Frontiers of Handwriting Recognition, p 379–384Google Scholar
  6. 6.
    Cun YL, Bottou L, Bengio Y, Haffiner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  7. 7.
    Das N, Acharya K, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) A benchmark image database of isolated Bangla handwritten compound characters. Int J Doc Anal Recognit 17(4):413–431CrossRefGoogle Scholar
  8. 8.
    Das N, Sarkar R, Basu S, Saha PK, Kundu M, Nasipuri M (2015) Handwritten Bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recogn 48(6):2054–2071CrossRefGoogle Scholar
  9. 9.
    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetzbMATHGoogle Scholar
  10. 10.
    Diem M, Fiel S, Kleber F, Sablatnig R (2013) CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: Proceedings of the International Conference on Document Analysis and Recognition, p 560–564Google Scholar
  11. 11.
    Dongre VJ, Mankar VH (2012) Development of comprehensive Devanagari numeral and character database for offline handwritten character recognition. Journal of Applied Computational Intelligence and Soft Computing (ACISC), Hindawi Publishing Corporation. doi: 10.1155/2012/871834
  12. 12.
    Gatos B, Stamatopoulos N, Louloudis G (2009) Handwriting segmentation contest. In: Proceedings of the International Conference on Document Analysis and Recognition, p 1393–1397Google Scholar
  13. 13.
    Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition- a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161CrossRefGoogle Scholar
  14. 14.
    Hull J (1994) A database for handwritten text recognition research. IEEE Transaction on Pattern Analysis and Machine Intelligence 16(5):550–554CrossRefGoogle Scholar
  15. 15.
    Kittler J, Hatef M, Robert PWD, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239CrossRefGoogle Scholar
  16. 16.
    Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Analysis and Application 7(2):190–204MathSciNetCrossRefGoogle Scholar
  17. 17.
    Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the International Conference on Document Analysis and Recognition, p 705–708Google Scholar
  18. 18.
    Marti U, Bunke H (2002) The IAM-database: an English sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46CrossRefzbMATHGoogle Scholar
  19. 19.
    Mulhem P, Martin H (2003) From database to web multimedia documents. Multimed Tool Appl 20(3):263–282CrossRefGoogle Scholar
  20. 20.
    Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proceedings of the International Workshop on Frontiers in Handwriting Recognition, p 415–420Google Scholar
  21. 21.
    Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identification from printed Indian document images and performance evaluation using different classifiers. Applied Computational Intelligence and Soft Computing 2014:12CrossRefGoogle Scholar
  22. 22.
    Obaidullah SM, Halder C, Das N, Roy K (2015) A corpus of word-level offline handwritten numeral images from official indic scripts. In: Proceedings of the International Conference on Computer and Communication Technologies, p 703–711Google Scholar
  23. 23.
    Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2016a) Separating Indic scripts with ‘shirorekha’ -- a precursor to script identification in multi-script documents.In: Proceedings of the IAPR International Conference on Computer Vision & Image Processing, India. doi: 10.1007/978-981-10-2104-6_19
  24. 24.
    Obaidullah SM, Halder C, Das N, Roy K (2016b) A new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion. International Journal of Intelligent Engineering Informatics 4(1):1–20CrossRefGoogle Scholar
  25. 25.
    Paul M (ed.) (2009) Ethnologue: languages of the world, Sixteenth edition. Dallas: SIL International. Available: http://www.ethnologue.com/. Last accessed on 20 Oct 2016
  26. 26.
    Rani R, Dhir R, Lehal GS (2013) Script identification for pre-segmented multi-font characters and digits. In: Proceedings of the International Conference on Document Analysis and Recognition, p 2010–1154Google Scholar
  27. 27.
    Raza A, Siddiqi I, Abidi A, Arif F (2012a) An unconstrained benchmark Urdu sentence database with automatic line segmentation. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 491–496Google Scholar
  28. 28.
    Raza A, Siddiqi I, Abidi A, Arif F (2012b) QUWI: an Arabic and English handwriting dataset for offline writer identification. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 746–751Google Scholar
  29. 29.
    Saqheer MW, He CL, Nobile N, Suen CY (2009) A new large urdu database for off-line handwriting recognition. In: Proceedings of the International Conference on Image Analysis and Processing, p 538–546Google Scholar
  30. 30.
    Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journalon Document Analysis and Recognition 15:71–83CrossRefGoogle Scholar
  31. 31.
    Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level script identification from multi-script handwritten documents. In: Proceedings of the third international conference computer, Communication, Control and Information Technology, p 1–6Google Scholar
  32. 32.
    Sklansky J (1982) Finding the convex hull of a simple polygon. Pattern Recogn Lett 1:79–83CrossRefzbMATHGoogle Scholar
  33. 33.
    Suen CY, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180CrossRefGoogle Scholar
  34. 34.
    Sumner M, Frank E, Hall M (2005) Speeding up logistic model tree induction. In: Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, p 675–683Google Scholar
  35. 35.
    Thadchanamoorthy S, Kodikara ND, Premaretne HL, Pal U, Kimura F (2013) Tamil handwritten city name database development and recognition for postal automation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 793–797Google Scholar
  36. 36.
    Wilkinson R, Geist J, Janet S, Grother P, Burges C, Creecy R, Hammond B, Hull J, Larsen N, Vogl T, Wilson C (1992) The First Census Optical Character Recognition Systems. Conference #NISTIR 4912 (The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg, MD, 1992)Google Scholar
  37. 37.
    Writing_System (2016) Writing System of India Available: http://en.wikipedia.org/wiki/Writing_system. Last accessed on 20 Oct 2016
  38. 38.
    Zimmermann M, Bunke H (2000) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proceedings of the International Conference on Pattern Recognition, p 35–39Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Sk Md Obaidullah
    • 1
  • Chayan Halder
    • 2
  • K. C. Santosh
    • 3
  • Nibaran Das
    • 4
  • Kaushik Roy
    • 2
  1. 1.Department of Computer Science and EngineeringAliah UniversityKolkataIndia
  2. 2.Department of Computer ScienceWest Bengal State UniversityKolkataIndia
  3. 3.Department of Computer ScienceUniversity of South DakotaVermillionUSA
  4. 4.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia

Personalised recommendations