Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format

Rakholia, Rajnish M.; Saini, Jatinderkumar R.

doi:10.1007/978-981-10-1675-2_37

Rajnish M. Rakholia⁵ &
Jatinderkumar R. Saini⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 468))

1435 Accesses
3 Citations

Abstract

In Natural Language Processing (NLP), language identification is the problem of determining which natural language(s) are used in written script. This paper presents a methodology for Language Identification from multilingual document written in Indian language(s). The main objective of this research is to automatically, quickly, and accurately recognize the language from the multilingual document written in Indian language(s) and then separate the content according to types of language, using Unicode Transformation Format (UTF). The proposed methodology is applicable for preprocessing step in document classification and a number of applications such as POS-Tagging, Information Retrieval, Search Engine Optimization, and Machine Translation for Indian languages. Sixteen different Indian languages have been used for empirical purpose. The corpus texts were collected randomly from web and 822 documents were prepared, comprising of 300 Portable Document Format (PDF) files and 522 text files. Each of 822 documents contained more than 800 words written in different and multiple Indian languages at the sentence level. The proposed methodology has been implemented using UTF-8 through free and open-source programming language Java Server Pages (JSP). The obtained results with an execution of 522 Text file documents yielded an accuracy of 99.98 %, whereas 300 PDF documents yielded an accuracy of 99.28 %. The accuracy of text files is more than PDF files by 0.70 %, due to corrupted texts appearing in PDF files.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

“Department of Electronics & Information Technology, India”, Indian Language Technology Proliferation and Deployment Center [Online]. Available: http://www.tdil-dc.in/index.php?option=com_up-download&view=publications&lang=en [May 10, 2015].
“Ministry of Communication & Information Technology, India”, Technology Development for Indian Languages [Online]. Available: http://ildc.in/Gujarati/Gindex.aspx [May 10, 2015].
“The Unicode Consortium, USA”, The Unicode Standard [Online]. Available: http://www.unicode.org/standard/standard.html [May 10, 2015].
Verma, V. K., & Khanna, N. (2013, April). Indian language identification using k-means clustering and support vector machine (SVM). In Engineering and Systems (SCES), 2013 Students Conference on (pp. 1–5). IEEE.
Google Scholar
Anto, A., Sreekumar, K. T., Kumar, C. S., & Raj, P. C. (2014, December). Towards improving the performance of language identification system for Indian languages. In Computational Systems and Communications (ICCSC), 2014 First International Conference on (pp. 42–46). IEEE.
Google Scholar
Yadav, P., & Kaur, S. (2013, November). Language identification and correction in corrupted texts of regional Indian languages. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O COCOSDA/ CASLRE), 2013 International Conference (pp. 1–5).IEEE.
Google Scholar
Padma, M. C., Vijaya, P. A., & Nagabhushan, P. (2009, March). Language Identification from an Indian Multilingual Document Using Profile Features. In Computer and Automation Engineering, 2009. ICCAE’09. International Conference on (pp. 332–335). IEEE.
Google Scholar
Chanda, S., Pal, U., & Terrades, O. R. (2009). Word-wise Thai and Roman script identification. ACM Transactions on Asian Language Information Processing (TALIP), 8(3), 11.
Google Scholar
Saha, S. K., Mitra, P., & Sarkar, S. (2012). A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition. Knowledge-Based Systems, 27, 322–332.
Google Scholar
Pati, P. B., & Ramakrishnan, A. G. (2008). Word level multi-script identification. Pattern Recognition Letters, 29(9), 1218–1229. Chicago.
Google Scholar
Gupta, V. (2013). Hybrid Algorithm for Multilingual Summarization of Hindi and Punjabi Documents. In Mining Intelligence and Knowledge Exploration (pp. 717–727). Springer International Publishing.
Google Scholar
Hangarge, M., & Dhandra, B. V. (2008, July). Shape and Morphological Transformation Based Features for Language Identification in Indian Document Images. In Emerging Trends in Engineering and Technology, 2008. ICETET’08. First International Conference on (pp. 1175–1180). IEEE.
Google Scholar
Padma, M. C., & Vijaya, P. A. (2010). Word level identification of Kannada, Hindi and English scripts from a tri-lingual document. International Journal of Computational Vision and Robotics, 1(2), 218–235.
Google Scholar
H. Marti and B. Larry, “Accessing Database with JDBC,” in Core Servlets and Java Server Pages volume 1, 2^nd ed. Pearson Education, 2008, ch.17, pp. 499–599.
Google Scholar
“MySQL”, Unicode Support [Online]. Available: http://dev.mysql.com/doc/refman/5.5/ en/charset-unicode.html [September 6, 2014].
“The Apache Software Foundation”, Apache Tomcat 7 [Online]. Available: http://tomcat.apache.org/tomcat-7.0-doc/index.html [May 10, 2015].

Download references

Author information

Authors and Affiliations

R K University, Rajkot, Gujarat, India
Rajnish M. Rakholia
Narmada College of Computer Application, Bharuch, Gujarat, India
Jatinderkumar R. Saini

Authors

Rajnish M. Rakholia
View author publications
You can also search for this author in PubMed Google Scholar
Jatinderkumar R. Saini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajnish M. Rakholia .

Editor information

Editors and Affiliations

Department of Computer Science & Engineering, ANITS, Visakhapatnam, India
Suresh Chandra Satapathy
Dept. of ECE, Shri Ramswaroop Mem. Group of Prof. Clg, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Sabar Institute of Technology, Tajpur, Sabarkantha, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rakholia, R.M., Saini, J.R. (2017). Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format. In: Satapathy, S., Bhateja, V., Joshi, A. (eds) Proceedings of the International Conference on Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 468. Springer, Singapore. https://doi.org/10.1007/978-981-10-1675-2_37

Download citation

DOI: https://doi.org/10.1007/978-981-10-1675-2_37
Published: 24 August 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-1674-5
Online ISBN: 978-981-10-1675-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics