Abstract
Natural Language Processing (NLP) represents the task of automatic handling of natural human language by machines. There is a large spectrum of possible NLP applications which aid in automating tasks like text translation amongst languages, retrieving and summarizing data from very huge and complex repositories, spam email filtering, identifying fake news in digital media, finding political opinions, views and sentiments of people on various government policies, providing effective medical assistance based on past history records of patients etc. Gujarati language is an Indian language with more than sixty million users worldwide. At present, many efforts are laid for developing NLP applications and resources for Indian languages. This survey gives a taxonomy and comprehensive report regarding component and resource development for Gujarati NLP systems. Also, few prominent tools available in open domain are tested, and their posterior analysis is presented. Possible measures to handle the issues in resource and component development of Gujarati NLP system are also discussed. This report might be useful for industry, researchers and academicians to have a clear picture of the research gaps, challenges and opportunities in Gujarati NLP systems.
Similar content being viewed by others
References
Ameta J, Joshi N, Mathur I (2011) A lightweight stemmer for gujarati. http://cogprints.org/9059/
Ameta J, Joshi N, Mathur I (2013) Improving the quality of gujarati-hindi machine translation through part-of-speech tagging and stemmer-assisted transliteration. http://cogprints.org/9068/
Aswani N, Gaizauskas RJ (2010) Developing morphological analysers for south asian languages: Experimenting with the hindi and gujarati languages. In: LREC, pp. 811–815
Balumuri S, Bachina S, Kamath S (2021) Sb_nitk at mediqa 2021: Leveraging transfer learning for question summarization in medical domain. In: Proceedings of the 20th workshop on biomedical language processing, pp. 273–279
Baskaran S, Bali K, Bhattacharya T, Bhattacharyya P, Jha GN et al. (2008) A common parts-of-speech tagset framework for indian languages. In: Proceeding of LREC 2008
Bharati A, Chaitanya V, Sangal R, Ramakrishnamacharyulu K (1995) Natural language processing: a Paninian perspective. Prentice-Hall of India, New Delhi
Bhatt BS, Bhensdadia C, Bhattacharyya P, Chauhan D, Patel K (2017) Gujarati wordnet: a profile of the indowordnet. In: The WordNet in Indian languages, Springer. pp. 167–174
Bhatt R (2007) Ergativity in indo-aryan languages. In: Talk given at the MIT ergativity seminar
Bhattacharyya P (2010) Indowordnet. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10)
Bhattacharyya P, Murthy H, Ranathunga S, Munasinghe R (2019) Indic language computing. Commun ACM 62(11):70–75
Cai M (2021) Natural language processing for urban research: a systematic review. Heliyon 7(3):e06322
Cambria E, White B (2014) Jumping nlp curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48–57
Central Institute of Indian Languages, Mysore. Linguistic data consortium for indian langauges. https://www.ldcil.org, Accessed on: 09.12.20202020
CFILT, IIT Bombay. Indowordnet. http://www.cfilt.iitb.ac.in/indowordnet/index.jsp, Accessed on : 09.12.2020
Chandramouli C, General R ( 2011) Census of india 2011. In: Provisional Population Totals. New Delhi: Government of India, pp. 409–413
Charles University P, U. of West Buhimia, I. of Czech language of academy of sciences, and M. Univ. Digital research infrastructure for the language technologies, arts and humanities. https://lindat.mff.cuni.cz/, Accessed on: 11.12.2020
Chen Y, Skiena S (2014) Building sentiment lexicons for all major languages. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 383–389
Das D, Petrov S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 600–609
Desai N, Dalwadi B(2016) An affix removal stemmer for gujarati text. In: 2016 3rd international conference on computing for sustainable global development (INDIACom), pp. 2296–2299. IEEE
Devi SL, Ram VS, Rao PR (2014) A generic anaphora resolution engine for indian languages. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp. 1824–1833
Dhar A, Mukherjee H, Dash NS, Roy K (2021) Text categorization: past and present. Artif Intell Rev 54(4):3007–3054
DivyaBhaskar. https://www.divyabhaskar.co.in/, Accessed on: 11.12.2020
Forum for Information Retrieval Evaluation(FIRE). http://fire.irsi.res.in/fire/static/data, Accessed on: 11.12.2020
Garg V, Saraf N, Majumder P (2013) Named entity recognition for gujarati: a crf based approach. In: Mining intelligence and knowledge exploration. Springer, pp. 761–768
Gebreselassie T, Mersha A, Gasser M (2020) A translation-based approach to morphology learning for low resource languages. In: Proceedings of the the fourth widening natural language processing workshop, pp. 36–40
Goldhahn D, Eckart T, Quasthoff U et al (2012) Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: LREC 29:31–43
Google. https://cloud.google.com/translate/docs/basic/translating-text, Accessed on : 09.12.2020
Grierson GA (1906) Linguistic survey of India, volume 9.2. Office of the superintendent of government printing, India
Gujarati Samaj. https://www.vishwagujaratisamaj.net, Accessed on: 11.08.2021
GujaratSamachar. http://www.epapergujaratsamachar.com/, Accessed on: 11.12.2020
Indian institute of technology - bombay. Center for Indian language technology tools. https://www.cfilt.iitb.ac.in/Tools.html, Accessed on : 09.12.2020
Indic NLP library: resources and tools for Indian language Natural Language Processing. http://anoopkunchukuttan.github.io/indic_nlp_library/, Accessed on: 09.12.2020
International Institute of Information Technology Hyderabad. Ltrc language technologies research center. https://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-pos-ann-guidelines-15-Dec-06.pdf, Accessed on: 09.12.2020
Kaur P, Goyal V, Shah KS, Singh U (2018) Hybrid chunker for gujarati language. In: Networking communication and data knowledge engineering, Springer, pp. 217–226
KCIS Resources, DeiTY, Govt. of India. http://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/, Accessed on : 09.12.2020
KPMG and Google. https://assets.kpmg/content/dam/kpmg/in/pdf/2017/04/Indian-languages-Defining-Indias-Internet.pdf, Accessed on : 09.12.2020
Kunchukuttan A, Bhattacharyya P (2020) Utilizing language relatedness to improve machine translation: a case study on languages of the Indian subcontinent. arXiv preprint arXiv:2003.08925
Kunchukuttan A, Mishra A, Chatterjee R, Shah R, Bhattacharyya P (2014) Shata-anuvadak: tackling multiway translation of indian languages. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), pp. 1781–1787
Kurian C, Kannan Balakrishnan K (2008) Natural language processing in india prospects and challanges. In: Proceedings of the international conference on recent trends in computational science
Leipzig Corpora Collection. Leipzig university,germany. https://wortschatz.uni-leipzig.de/en/download/gujarati, Accessed on: 11.12.2020
Liddy ED (2001) Natural language processing. In: Encyclopedia of library and information science, 2. Inc., NY, USA
Locke S, Bashall A, Al-Adely S, Moore J, Wilson A, Kitchen G (2021) Natural language processing in medicine: a review. Trends in Anaesthesia and Critical Care
Marev cek, David and Yu, Zhiwei and Zeman, Daniel and v Zabokrtsk’y, Zdenv ek. Deltacorpus 1.1. http://hdl.handle.net/11234/1-1743, 2016
Matrubhumi website. https://www.matrubharti.com/stories/gujarati/short-stories, Accessed on: 11.12.2020
Ministry of Electronics and Information Technology,Govt of India. Technology development for indian languages. http://www.tdil-dc.in, 2020
Mishra P, Mujadia V, Sharma DM (2018) Pos tagging for resource poor indian languages through feature projection. In: Proceedings of ICON-2017, pp. 50–55
NLTK. https://www.nltk.org/_modules/nltk/tokenize/punkt.html, Accessed on: 09.12.2020
OLAC Community. Olac resources in and about the gujarati language. http://olac.ldc.upenn.edu/language/guj, Accessed on: 09.12.2020
Open Government Data Platform, Govt. of India. https://data.gov.in/, Accessed on : 09.8.2021
Patel C, Ahalpara D (2015) A statistical chunker for Indian language gujarati. Int J Comput Eng Appl 9:173–180
Patel C, Gali K (2008) Part-of-speech tagging for gujarati using conditional random fields. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages
Patel CD, Patel JM (2017) Gujster: A rule based stemmer using dictionary approach. In: 2017 international conference on inventive communication and computational technologies (ICICCT), pp. 496–499. IEEE
Patel KA, Pareek JS (2013) Gh-map: translation system for sibling language pair gujarati-hindi. CSI Trans ICT 1(2):183–192
Prajapati M, Yajnik A (2020) Constraint-based gujarati parser using lpp. In: Proceedings of first international conference on computing, communications, and cyber-security (IC4S 2019), Springer, pp. 375–386
Saini JR, Modh JC(2016) Gidtra: A dictionary-based mts for translating gujarati bigram idioms to english. In: 2016 fourth international conference on parallel, distributed and grid computing (PDGC), pp. 192–196. IEEE
Sengupta D, Saha G (2015) Study on similarity among indian languages using language verification framework. Adv Artif Intell 2015:1–25
Shah DN, Bhadka H (2020) Paradigm-based morphological analyzer for the gujarati language. In: Intelligent communication, control and devices, Springer, pp 469–481
Sheth J, Patel B(2014) Dhiya: a stemmer for morphological level analysis of gujarati language. In: 2014 international conference on issues and challenges in intelligent computing techniques (ICICT), pp. 151–154. IEEE
Suba K, Jiandani D, Bhattacharyya P (2011) Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In: Proceedings of the 2nd workshop on south southeast Asian natural language processing (WSSANLP), pp. 1–8
Swati’s Journal. https://swatisjournal.com/, Accessed on: 11.12.2020
Tailor C, Patel B (2019) Sentence tokenization using statistical unsupervised machine learning and rule-based approach for running text in gujarati language. In: Emerging trends in expert applications and security, Springer, pp. 319–326
Tailor C, Patel B(2021) Chunker for gujarati language using hybrid approach. In: Rising threats in expert applications and solutions, Springer, pp. 77–84
The Emille project, Enabling minority language engineering. http://www.emille.lancs.ac.uk/, 2020
The Rosetta Project. Gujarati writing - p. j. mistry. https://archive.org/details/rosettaproject_guj_ortho-2, Accessed on: 09.12.2020
Unicode\(\text{\textregistered}\) Consortium. https://unicode.org/main.html, Accessed on : 09.12.2020
Vaishnav ZB, Sajja PS (2019) Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In: Information and communication technology for intelligent systems, Springer, pp. 485–494
W3Techs. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language, Accessed on : 09.12.2020
World Atlas of Language Structures (WALS). https://wals.info/languoid/lect/wals_code_guj, Accessed on : 09.12.2020
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Desai, N.P., Dabhi, V.K. Resources and components for gujarati NLP systems: a survey. Artif Intell Rev 55, 1–19 (2022). https://doi.org/10.1007/s10462-021-10120-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-021-10120-1