Skip to main content
Log in

Resources and components for gujarati NLP systems: a survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Natural Language Processing (NLP) represents the task of automatic handling of natural human language by machines. There is a large spectrum of possible NLP applications which aid in automating tasks like text translation amongst languages, retrieving and summarizing data from very huge and complex repositories, spam email filtering, identifying fake news in digital media, finding political opinions, views and sentiments of people on various government policies, providing effective medical assistance based on past history records of patients etc. Gujarati language is an Indian language with more than sixty million users worldwide. At present, many efforts are laid for developing NLP applications and resources for Indian languages. This survey gives a taxonomy and comprehensive report regarding component and resource development for Gujarati NLP systems. Also, few prominent tools available in open domain are tested, and their posterior analysis is presented. Possible measures to handle the issues in resource and component development of Gujarati NLP system are also discussed. This report might be useful for industry, researchers and academicians to have a clear picture of the research gaps, challenges and opportunities in Gujarati NLP systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Ameta J, Joshi N, Mathur I (2011) A lightweight stemmer for gujarati. http://cogprints.org/9059/

  • Ameta J, Joshi N, Mathur I (2013) Improving the quality of gujarati-hindi machine translation through part-of-speech tagging and stemmer-assisted transliteration. http://cogprints.org/9068/

  • Aswani N, Gaizauskas RJ (2010) Developing morphological analysers for south asian languages: Experimenting with the hindi and gujarati languages. In: LREC, pp. 811–815

  • Balumuri S, Bachina S, Kamath S (2021) Sb_nitk at mediqa 2021: Leveraging transfer learning for question summarization in medical domain. In: Proceedings of the 20th workshop on biomedical language processing, pp. 273–279

  • Baskaran S, Bali K, Bhattacharya T, Bhattacharyya P, Jha GN et al. (2008) A common parts-of-speech tagset framework for indian languages. In: Proceeding of LREC 2008

  • Bharati A, Chaitanya V, Sangal R, Ramakrishnamacharyulu K (1995) Natural language processing: a Paninian perspective. Prentice-Hall of India, New Delhi

    Google Scholar 

  • Bhatt BS, Bhensdadia C, Bhattacharyya P, Chauhan D, Patel K (2017) Gujarati wordnet: a profile of the indowordnet. In: The WordNet in Indian languages, Springer. pp. 167–174

  • Bhatt R (2007) Ergativity in indo-aryan languages. In: Talk given at the MIT ergativity seminar

  • Bhattacharyya P (2010) Indowordnet. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10)

  • Bhattacharyya P, Murthy H, Ranathunga S, Munasinghe R (2019) Indic language computing. Commun ACM 62(11):70–75

    Article  Google Scholar 

  • Cai M (2021) Natural language processing for urban research: a systematic review. Heliyon 7(3):e06322

    Article  Google Scholar 

  • Cambria E, White B (2014) Jumping nlp curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48–57

    Article  Google Scholar 

  • Central Institute of Indian Languages, Mysore. Linguistic data consortium for indian langauges. https://www.ldcil.org, Accessed on: 09.12.20202020

  • CFILT, IIT Bombay. Indowordnet. http://www.cfilt.iitb.ac.in/indowordnet/index.jsp, Accessed on : 09.12.2020

  • Chandramouli C, General R ( 2011) Census of india 2011. In: Provisional Population Totals. New Delhi: Government of India, pp. 409–413

  • Charles University P, U. of West Buhimia, I. of Czech language of academy of sciences, and M. Univ. Digital research infrastructure for the language technologies, arts and humanities. https://lindat.mff.cuni.cz/, Accessed on: 11.12.2020

  • Chen Y, Skiena S (2014) Building sentiment lexicons for all major languages. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 383–389

  • Das D, Petrov S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 600–609

  • Desai N, Dalwadi B(2016) An affix removal stemmer for gujarati text. In: 2016 3rd international conference on computing for sustainable global development (INDIACom), pp. 2296–2299. IEEE

  • Devi SL, Ram VS, Rao PR (2014) A generic anaphora resolution engine for indian languages. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp. 1824–1833

  • Dhar A, Mukherjee H, Dash NS, Roy K (2021) Text categorization: past and present. Artif Intell Rev 54(4):3007–3054

    Article  Google Scholar 

  • DivyaBhaskar. https://www.divyabhaskar.co.in/, Accessed on: 11.12.2020

  • Forum for Information Retrieval Evaluation(FIRE). http://fire.irsi.res.in/fire/static/data, Accessed on: 11.12.2020

  • Garg V, Saraf N, Majumder P (2013) Named entity recognition for gujarati: a crf based approach. In: Mining intelligence and knowledge exploration. Springer, pp. 761–768

  • Gebreselassie T, Mersha A, Gasser M (2020) A translation-based approach to morphology learning for low resource languages. In: Proceedings of the the fourth widening natural language processing workshop, pp. 36–40

  • Goldhahn D, Eckart T, Quasthoff U et al (2012) Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: LREC 29:31–43

  • Google. https://cloud.google.com/translate/docs/basic/translating-text, Accessed on : 09.12.2020

  • Grierson GA (1906) Linguistic survey of India, volume 9.2. Office of the superintendent of government printing, India

  • Gujarati Samaj. https://www.vishwagujaratisamaj.net, Accessed on: 11.08.2021

  • GujaratSamachar. http://www.epapergujaratsamachar.com/, Accessed on: 11.12.2020

  • Indian institute of technology - bombay. Center for Indian language technology tools. https://www.cfilt.iitb.ac.in/Tools.html, Accessed on : 09.12.2020

  • Indic NLP library: resources and tools for Indian language Natural Language Processing. http://anoopkunchukuttan.github.io/indic_nlp_library/, Accessed on: 09.12.2020

  • International Institute of Information Technology Hyderabad. Ltrc language technologies research center. https://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-pos-ann-guidelines-15-Dec-06.pdf, Accessed on: 09.12.2020

  • Kaur P, Goyal V, Shah KS, Singh U (2018) Hybrid chunker for gujarati language. In: Networking communication and data knowledge engineering, Springer, pp. 217–226

  • KCIS Resources, DeiTY, Govt. of India. http://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/, Accessed on : 09.12.2020

  • KPMG and Google. https://assets.kpmg/content/dam/kpmg/in/pdf/2017/04/Indian-languages-Defining-Indias-Internet.pdf, Accessed on : 09.12.2020

  • Kunchukuttan A, Bhattacharyya P (2020) Utilizing language relatedness to improve machine translation: a case study on languages of the Indian subcontinent. arXiv preprint arXiv:2003.08925

  • Kunchukuttan A, Mishra A, Chatterjee R, Shah R, Bhattacharyya P (2014) Shata-anuvadak: tackling multiway translation of indian languages. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), pp. 1781–1787

  • Kurian C, Kannan Balakrishnan K (2008) Natural language processing in india prospects and challanges. In: Proceedings of the international conference on recent trends in computational science

  • Leipzig Corpora Collection. Leipzig university,germany. https://wortschatz.uni-leipzig.de/en/download/gujarati, Accessed on: 11.12.2020

  • Liddy ED (2001) Natural language processing. In: Encyclopedia of library and information science, 2. Inc., NY, USA

  • Locke S, Bashall A, Al-Adely S, Moore J, Wilson A, Kitchen G (2021) Natural language processing in medicine: a review. Trends in Anaesthesia and Critical Care

  • Marev cek, David and Yu, Zhiwei and Zeman, Daniel and v Zabokrtsk’y, Zdenv ek. Deltacorpus 1.1. http://hdl.handle.net/11234/1-1743, 2016

  • Matrubhumi website. https://www.matrubharti.com/stories/gujarati/short-stories, Accessed on: 11.12.2020

  • Ministry of Electronics and Information Technology,Govt of India. Technology development for indian languages. http://www.tdil-dc.in, 2020

  • Mishra P, Mujadia V, Sharma DM (2018) Pos tagging for resource poor indian languages through feature projection. In: Proceedings of ICON-2017, pp. 50–55

  • NLTK. https://www.nltk.org/_modules/nltk/tokenize/punkt.html, Accessed on: 09.12.2020

  • OLAC Community. Olac resources in and about the gujarati language. http://olac.ldc.upenn.edu/language/guj, Accessed on: 09.12.2020

  • Open Government Data Platform, Govt. of India. https://data.gov.in/, Accessed on : 09.8.2021

  • Patel C, Ahalpara D (2015) A statistical chunker for Indian language gujarati. Int J Comput Eng Appl 9:173–180

    Google Scholar 

  • Patel C, Gali K (2008) Part-of-speech tagging for gujarati using conditional random fields. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages

  • Patel CD, Patel JM (2017) Gujster: A rule based stemmer using dictionary approach. In: 2017 international conference on inventive communication and computational technologies (ICICCT), pp. 496–499. IEEE

  • Patel KA, Pareek JS (2013) Gh-map: translation system for sibling language pair gujarati-hindi. CSI Trans ICT 1(2):183–192

    Article  Google Scholar 

  • Prajapati M, Yajnik A (2020) Constraint-based gujarati parser using lpp. In: Proceedings of first international conference on computing, communications, and cyber-security (IC4S 2019), Springer, pp. 375–386

  • Saini JR, Modh JC(2016) Gidtra: A dictionary-based mts for translating gujarati bigram idioms to english. In: 2016 fourth international conference on parallel, distributed and grid computing (PDGC), pp. 192–196. IEEE

  • Sengupta D, Saha G (2015) Study on similarity among indian languages using language verification framework. Adv Artif Intell 2015:1–25

  • Shah DN, Bhadka H (2020) Paradigm-based morphological analyzer for the gujarati language. In: Intelligent communication, control and devices, Springer, pp 469–481

  • Sheth J, Patel B(2014) Dhiya: a stemmer for morphological level analysis of gujarati language. In: 2014 international conference on issues and challenges in intelligent computing techniques (ICICT), pp. 151–154. IEEE

  • Suba K, Jiandani D, Bhattacharyya P (2011) Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In: Proceedings of the 2nd workshop on south southeast Asian natural language processing (WSSANLP), pp. 1–8

  • Swati’s Journal. https://swatisjournal.com/, Accessed on: 11.12.2020

  • Tailor C, Patel B (2019) Sentence tokenization using statistical unsupervised machine learning and rule-based approach for running text in gujarati language. In: Emerging trends in expert applications and security, Springer, pp. 319–326

  • Tailor C, Patel B(2021) Chunker for gujarati language using hybrid approach. In: Rising threats in expert applications and solutions, Springer, pp. 77–84

  • The Emille project, Enabling minority language engineering. http://www.emille.lancs.ac.uk/, 2020

  • The Rosetta Project. Gujarati writing - p. j. mistry. https://archive.org/details/rosettaproject_guj_ortho-2, Accessed on: 09.12.2020

  • Unicode\(\text{\textregistered}\) Consortium. https://unicode.org/main.html, Accessed on : 09.12.2020

  • Vaishnav ZB, Sajja PS (2019) Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In: Information and communication technology for intelligent systems, Springer, pp. 485–494

  • W3Techs. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language, Accessed on : 09.12.2020

  • World Atlas of Language Structures (WALS). https://wals.info/languoid/lect/wals_code_guj, Accessed on : 09.12.2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikita P. Desai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Desai, N.P., Dabhi, V.K. Resources and components for gujarati NLP systems: a survey. Artif Intell Rev 55, 1–19 (2022). https://doi.org/10.1007/s10462-021-10120-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-021-10120-1

Keywords

Navigation