Advertisement

Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

  • Sk Md ObaidullahEmail author
  • K. C. Santosh
  • Chayan Halder
  • Nibaran Das
  • Kaushik Roy
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 709)

Abstract

Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).

Keywords

Multi-script documents Official indic script database Script identification 

References

  1. 1.
    Pati, P.B., Ramakrishnan, A.G.: Word-level multi-script identification. Pattern Recog. Lett. 29(9), 1218–1229 (2008)CrossRefGoogle Scholar
  2. 2.
    Hochberg, J., Kelly, P., Thomas, T., Kerns, L.: Automatic script identification from document images using cluster-based templates. IEEE Trans. Pattern Anal. Mach. Intell. 19, 176–181 (1997)CrossRefGoogle Scholar
  3. 3.
    Pal, U., Chaudhuri, B.B.: Identification of different script lines from multi-script documents. Image Vis. Comput. 20(13/14), 945–954 (2002)CrossRefGoogle Scholar
  4. 4.
    Jawahar, C.V., Kumar, M., Kiran, S.S.R.: A bilingual OCR for Hindi-Telugu documents and its applications. In: Proceedings of International Conference Document Analysis and Recognition, pp. 408–412 (2003)Google Scholar
  5. 5.
    Chanda, S., Sinha, S., Pal, U.: Word-wise English Devnagari and Oriya script identification. In: Speech and Language Systems for Human Communication, pp. 244–248 (2004)Google Scholar
  6. 6.
    Joshi, G.D., Garg, S., Sivaswamy, J.: Script identification from Indian documents. In: 7th International Association of Pattern Recognition Workshop on Document Analysis Systems, pp. 255–267 (2006)Google Scholar
  7. 7.
    Dhanya, D., Ramakrishna, A.G., Pati, P.B.: Script identification in printed bilingual documents. Sadhana 27(1), 73–82 (2002)Google Scholar
  8. 8.
    Chaudhury, S., Harit, G., Madnani, S., Shet, R.B.: Identification of scripts of Indian languages by combining trainable classifiers. In: Indian Conference on Computer Vision Graphics and Image Processing (2000)Google Scholar
  9. 9.
    Ghosh, D., Dube, T., Shivprasad, S.P.: Script recognition-a review. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2142–2161 (2010)CrossRefGoogle Scholar
  10. 10.
    Hangarge, M., Santosh, K.C., Pardeshi, R.: Directional discrete cosine transform for handwritten script identification. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 344–348 (2013)Google Scholar
  11. 11.
    Pardeshi, R., Chaudhuri, B.B., Hangarge, M., Santosh, K.C.: Automatic handwritten Indian scripts identification. In: 14th International Conference on Frontiers in Handwriting Recognition, pp. 375–380 (2014)Google Scholar
  12. 12.
    Obaidullah, S.M., Mondal, A., Das, N., Roy, K.: Script identification from printed Indian document images and performance evaluation using different classifiers. Appl. Comput. Intell. Soft Comput. 22, 12 (2014)Google Scholar
  13. 13.
    Huhn, J., Hullermeier, E.: FURIA: an algorithm for unordered fuzzy rule induction. Data Min. Knowl. Discov. 19(3), 293–319 (2009)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  15. 15.
    Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)CrossRefzbMATHGoogle Scholar
  16. 16.
    Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, New York (1983)zbMATHGoogle Scholar
  17. 17.
    Santosh, K.C., Lamiroy, B., Wendling, B.: DTW for matching radon features: a pattern recognition and retrieval method. In: 13th International Conference on Advances Concepts for Intelligent Vision Systems, pp. 249–260 (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  • Sk Md Obaidullah
    • 1
    Email author
  • K. C. Santosh
    • 2
  • Chayan Halder
    • 3
  • Nibaran Das
    • 4
  • Kaushik Roy
    • 3
  1. 1.Department of Computer Science and EngineeringAliah University KolkataWest BengalIndia
  2. 2.Department of Computer ScienceThe University of South DakotaVermillionUSA
  3. 3.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia
  4. 4.Department of Computer ScienceWest Bengal State UniversityKolkataIndia

Personalised recommendations