Abstract
Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian state called Odisha. Odia language lacks online content and resources for natural language processing (NLP) research. There is a great need for a better language model for the low resource Odia language, which can be used for many downstream NLP tasks. In this paper, we introduce a Bert-based language model, pre-trained on 430,000 Odia sentences. We also evaluate the model on the well-known Kaggle Odia news classification dataset (BertOdia: 96%, RoBERTaOdia: 92%, and ULMFit: 91.9% classification accuracy), and perform a comparison study with multilingual Bidirectional Encoder Representations from Transformers (BERT) supporting Odia. The model will be released publicly for the researchers to explore other NLP tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
References
Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947 (2017)
Arora, G.: inltk: Natural language toolkit for indic languages. In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 66–71 (2020)
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pretrained bert model and evaluation data. In: Practical ML for Developing Countries Workshop@ ICLR 2020 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Grießhaber, D., Maucher, J., Vu, N.T.: Fine-tuning bert for low-resource natural language understanding via active learning. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1158–1171 (2020)
Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: Termeval 2020: Taln-ls2n system for automatic term extraction. In: Proceedings of the 6th International Workshop on Computational Terminology, pp. 95–100 (2020)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 328–339 (2018)
Kakwani, D., Kunchukuttan, A., Golla, S., N.C.G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In: Findings of EMNLP (2020)
Kocmi, T., Parida, S., Bojar, O.: CUNI NMT system for WAT 2018 translation tasks. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018, https://www.aclweb.org/anthology/Y18-3002
Korzeniowski, R., Rolczynski, R., Sadownik, P., Korbak, T., Mozejko, M.: Exploiting unsupervised pre-training and automated feature engineering for low-resource hate speech detection in polish. Proceedings of the PolEval2019 Workshop, p. 141 (2019)
Kumar, S., Kumar, S., Kanojia, D., Bhattacharyya, P.: “A passage to India”: pre-trained word embeddings for Indian languages. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pp. 352–357 (2020)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: Alite bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty French language model. In: ACL 2020-58th Annual Meeting of the Association for Computational Linguistics (2020)
Ortiz Suárez, P.J., Romary, L., Sagot, B.: A monolingual approach to contextualized word embeddings for mid-resource languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1703–1714. Association for Computational Linguistics, Online, July 2020, https://www.aclweb.org/anthology/2020.acl-main.156
Parida, S., Bojar, O., Dash, S.R.: Odiencorp: Odia–English and Odia-only corpus for machine translation. In: Smart Intelligent Computing and Applications, pp. 495–504. Springer, Berlin (2020)
Parida, S., Dash, S.R., Bojar, O., Motlıcek, P., Pattnaik, P., Mallick, D.K.: Odiencorp 2.0: Odia-english parallel corpus for machine translation. In: LREC 2020 Workshop Language Resources and Evaluation Conference, 11–16 May 2020, p. 14
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 2227–2237 (2018)
Raju, A., Filimonov, D., Tiwari, G., Lan, G., Rastrow, A.: Scalable multi corpora neural language models for asr. Proc. Interspeech 2019, 3910–3914 (2019)
Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2699–2712 (2020)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL (2016)
Wang, A., Cho, K.: Bert has a mouth, and it must speak: Bert as a Markov random field language model. In: Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 30–36 (2019)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multitask benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, Nov 2018. https://doi.org/10.18653/v1/W18-5446, https://www.aclweb.org/anthology/W18-5446
Wang, Z., Mayhew, S., Roth, D., et al.: Extending multilingual bert to low-resource languages. arXiv preprint arXiv:2004.13640 (2020)
Acknowledgements
The authors Shantipriya Parida and Petr Motlicek were supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 833635 (project ROXANNE: Real-time network, text, and speaker analytics for combating organized crime, 2019-2022).
The authors do not see any significant ethical or privacy concerns that would prevent the processing of the data used in the study. The datasets do contain personal data, and these are processed in compliance with the GDPR and national law.
Esaú Villatoro-Tello, was supported partially by Idiap Research Institute, SNI CONACyT, and UAM-Cuajimalpa Mexico.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Parida, S. et al. (2022). BertOdia: BERT Pre-training for Low Resource Odia Language. In: Dehuri, S., Prasad Mishra, B.S., Mallick, P.K., Cho, SB. (eds) Biologically Inspired Techniques in Many Criteria Decision Making. Smart Innovation, Systems and Technologies, vol 271. Springer, Singapore. https://doi.org/10.1007/978-981-16-8739-6_32
Download citation
DOI: https://doi.org/10.1007/978-981-16-8739-6_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8738-9
Online ISBN: 978-981-16-8739-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)