Abstract
In a multilingual country like India, a document may contain text words in more than one language. For a multilingual environment, multi lingual Optical Character Recognition (OCR) system is needed to read the multilingual documents. So, it is necessary to identify different language regions of the document before feeding the document to the OCRs of individual language. The objective of this paper is to propose visual clues based procedure to identify Kannada, Hindi and English text portions of the Indian multilingual document.
Article PDF
Avoid common mistakes on your manuscript.
References
P.Naghabhushan, Radhika M Pai, “Modified Region Decomposition Method and Optimal Depth Decision Tree in the Recognition of non-uniform sized characters – An Experimentation with Kannada Characters”, Journal of Pattern Recognition Letters, 20, 1467–1475, (1999).
T.N.Tan, “Rotation Invariant Texture Features and their use in Automatic Script Identification”, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(7), 751– 756, (1998).
Santanu Choudhury, Gaurav Harit, Shekar Madnani, R.B. Shet, “Identification of Scripts of Indian Languages by Combining Trainable Classifiers”, ICVGIP 2000, Dec., 20–22, Bangalore, India.
M.C.Padma, P. Nagabhushan, “Horizontal and Vertical linear edge features as useful clues in the discrimination of multiligual (Kannada, Hindi and English) machine printed documents”, Proc. National Workshop on Computer Vision, Graphics and Image Processing (WVGIP), Madhurai, 204–209, (2002).
U.Pal, B.B.Choudhuri, “OCR in Bangla:an Indo-Bangladeshi language”, IEEE, no.2, 1051–4651, (1994).
U.Pal, B.B.Choudhuri, “An OCR system to read two Indian language scripts:Bangla and Devanagari(Hindi)”, Proc. 4th ICDAR, Uhn, 18–20, (1997).
G.S. Peake, T.N.Tan, “Script and Language Identification from Document Images”, Proc. Eighth British Mach. Vision Conference., 2, 230–233, (1997).
U.Pal, B.B.Choudhuri, “Script Line Separation From Indian Multi-Script Documents”, Pro c. 5th International Conference on Document Analysis and Recognition(IEEE Comput. Soc. Press), 406–409, (1999).
S.Basvaraj Patil, N.V.Subba Reddy, “Character script class identification system using probabilistic neural network for multi-script multi lingual document processing”, Proc. National Conference on Document Analysis and Recognition, Mandya, Karnataka, 1–8, (2001).
U.Pal B.B.Choudhuri, “Automatic Separation of Words in Multi Lingual multi Script Indian Documents”, Proc. 4th International Conference on Document Analysis and Recognition, 576–579, (1997).
S.Chanda, U.Pal, “English, Devanagari and Urdu Text Identification”, Proc. International Conference on Document Analysis and Recognition, 538–545, (2005).
U.Pal, S.Sinha, B.B.Choudhuri, “Word-wise script identification from a document containing English, Devanagari and Telugu text”, Proc. 2nd National Conference on Document Analysis and Recognition, Karnataka, India, 213–220, (2003).
P.Nagabhushan, S.A.Angadi, B.S.Anami, “A Fuzzy Statistical Approachto Kannada Vowel Recognition based on Invariant Moments”, proc. 2nd National Conference, NCDAR, Mandya, 275–285, (2003).
M.C.Padma, P.Nagabhushan, “Study of the Applicability of Horizontal and Vertical Projections and Segmentation in Language Identification of Kannada, Hindi and English Documents”, P roc. National Conference NCCIT, Kilakarai, Tamilnadu, 93–102, (2001).
M.C.Padma, P.Nagabhushan, “Identification and separation of text words of Kannada, Hindi and English languages through discriminating features”, Proc. 2nd National Conference on Document Analysis and Recognition, Mandya, Karnataka, 252–260, (2003).
U.Pal, B.B.Choudhuri, “Automatic Identification of English, Chinese, Arabic, Devanagari and Bangla Script Line”, Proc. 6th International Conference on Document Analysis and Recognition, 790–794, (2001).
R.C.Gonzalez, R.E.Woods, Digital Image Processing Pearson Education Publications, India, 2002.
A.L.Spitz, “Determination of the Script and language Content of Document Images”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 1, no. 3, 235–245, 1997.
U.Pal, S.Sinha, B.B.Choudhuri, “Multi-Script Line Identification from Indian Documents”, Proc. 7th International Conference on Document Analysis and Recognition (ICDAR 2003) vol. 2, 880–884, 2003.
Ramachandra Manthalkar and P.K. Biswas, “An Automatic Script Identification Scheme for Indian Languages”, NCC, 2002.
J.Hochberg, P.Kelly, T.Thomas, L.Kerns, “Automatic Script Identification from Document Images using Cluster –based Templates”, IEEE Transaction on Pattern Analysis and Machine Intelligence, 176–18, 1997. Gopal Datt Joshi, Saurabh Garg, Jayanthi Sivaswamy, “Script Identification from Indian Documents”, DAS 2006, LNCS 3872, 255–267, 2006.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Padma, M.C., Vijaya, P.A. Language Identification of Kannada, Hindi and English Text Words Through Visual Discriminating Features. Int J Comput Intell Syst 1, 116–126 (2008). https://doi.org/10.2991/ijcis.2008.1.2.2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.2008.1.2.2