Skip to main content

Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format

  • Conference paper
  • First Online:
Proceedings of the International Conference on Data Engineering and Communication Technology

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 468))

Abstract

In Natural Language Processing (NLP), language identification is the problem of determining which natural language(s) are used in written script. This paper presents a methodology for Language Identification from multilingual document written in Indian language(s). The main objective of this research is to automatically, quickly, and accurately recognize the language from the multilingual document written in Indian language(s) and then separate the content according to types of language, using Unicode Transformation Format (UTF). The proposed methodology is applicable for preprocessing step in document classification and a number of applications such as POS-Tagging, Information Retrieval, Search Engine Optimization, and Machine Translation for Indian languages. Sixteen different Indian languages have been used for empirical purpose. The corpus texts were collected randomly from web and 822 documents were prepared, comprising of 300 Portable Document Format (PDF) files and 522 text files. Each of 822 documents contained more than 800 words written in different and multiple Indian languages at the sentence level. The proposed methodology has been implemented using UTF-8 through free and open-source programming language Java Server Pages (JSP). The obtained results with an execution of 522 Text file documents yielded an accuracy of 99.98 %, whereas 300 PDF documents yielded an accuracy of 99.28 %. The accuracy of text files is more than PDF files by 0.70 %, due to corrupted texts appearing in PDF files.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. “Department of Electronics & Information Technology, India”, Indian Language Technology Proliferation and Deployment Center [Online]. Available: http://www.tdil-dc.in/index.php?option=com_up-download&view=publications&lang=en [May 10, 2015].

  2. “Ministry of Communication & Information Technology, India”, Technology Development for Indian Languages [Online]. Available: http://ildc.in/Gujarati/Gindex.aspx [May 10, 2015].

  3. “The Unicode Consortium, USA”, The Unicode Standard [Online]. Available: http://www.unicode.org/standard/standard.html [May 10, 2015].

  4. Verma, V. K., & Khanna, N. (2013, April). Indian language identification using k-means clustering and support vector machine (SVM). In Engineering and Systems (SCES), 2013 Students Conference on (pp. 1–5). IEEE.

    Google Scholar 

  5. Anto, A., Sreekumar, K. T., Kumar, C. S., & Raj, P. C. (2014, December). Towards improving the performance of language identification system for Indian languages. In Computational Systems and Communications (ICCSC), 2014 First International Conference on (pp. 42–46). IEEE.

    Google Scholar 

  6. Yadav, P., & Kaur, S. (2013, November). Language identification and correction in corrupted texts of regional Indian languages. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O COCOSDA/ CASLRE), 2013 International Conference (pp. 1–5).IEEE.

    Google Scholar 

  7. Padma, M. C., Vijaya, P. A., & Nagabhushan, P. (2009, March). Language Identification from an Indian Multilingual Document Using Profile Features. In Computer and Automation Engineering, 2009. ICCAE’09. International Conference on (pp. 332–335). IEEE.

    Google Scholar 

  8. Chanda, S., Pal, U., & Terrades, O. R. (2009). Word-wise Thai and Roman script identification. ACM Transactions on Asian Language Information Processing (TALIP), 8(3), 11.

    Google Scholar 

  9. Saha, S. K., Mitra, P., & Sarkar, S. (2012). A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition. Knowledge-Based Systems, 27, 322–332.

    Google Scholar 

  10. Pati, P. B., & Ramakrishnan, A. G. (2008). Word level multi-script identification. Pattern Recognition Letters, 29(9), 1218–1229. Chicago.

    Google Scholar 

  11. Gupta, V. (2013). Hybrid Algorithm for Multilingual Summarization of Hindi and Punjabi Documents. In Mining Intelligence and Knowledge Exploration (pp. 717–727). Springer International Publishing.

    Google Scholar 

  12. Hangarge, M., & Dhandra, B. V. (2008, July). Shape and Morphological Transformation Based Features for Language Identification in Indian Document Images. In Emerging Trends in Engineering and Technology, 2008. ICETET’08. First International Conference on (pp. 1175–1180). IEEE.

    Google Scholar 

  13. Padma, M. C., & Vijaya, P. A. (2010). Word level identification of Kannada, Hindi and English scripts from a tri-lingual document. International Journal of Computational Vision and Robotics, 1(2), 218–235.

    Google Scholar 

  14. H. Marti and B. Larry, “Accessing Database with JDBC,” in Core Servlets and Java Server Pages volume 1, 2nd ed. Pearson Education, 2008, ch.17, pp. 499–599.

    Google Scholar 

  15. “MySQL”, Unicode Support [Online]. Available: http://dev.mysql.com/doc/refman/5.5/ en/charset-unicode.html [September 6, 2014].

  16. “The Apache Software Foundation”, Apache Tomcat 7 [Online]. Available: http://tomcat.apache.org/tomcat-7.0-doc/index.html [May 10, 2015].

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajnish M. Rakholia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Singapore

About this paper

Cite this paper

Rakholia, R.M., Saini, J.R. (2017). Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format. In: Satapathy, S., Bhateja, V., Joshi, A. (eds) Proceedings of the International Conference on Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 468. Springer, Singapore. https://doi.org/10.1007/978-981-10-1675-2_37

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-1675-2_37

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-1674-5

  • Online ISBN: 978-981-10-1675-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics