BertOdia: BERT Pre-training for Low Resource Odia Language

Parida, Shantipriya; Biswal, Satya Prakash; Nayak, Biranchi Narayan; Fabien, Maël; Villatoro-Tello, Esaú; Motlicek, Petr; Dash, Satya Ranjan

doi:10.1007/978-981-16-8739-6_32

Shantipriya Parida⁷,
Satya Prakash Biswal⁸,
Biranchi Narayan Nayak⁹,
Maël Fabien^7,10,
Esaú Villatoro-Tello^7,11,
Petr Motlicek⁷ &
…
Satya Ranjan Dash¹²

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 271))

400 Accesses
1 Citations

Abstract

Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian state called Odisha. Odia language lacks online content and resources for natural language processing (NLP) research. There is a great need for a better language model for the low resource Odia language, which can be used for many downstream NLP tasks. In this paper, we introduce a Bert-based language model, pre-trained on 430,000 Odia sentences. We also evaluate the model on the well-known Kaggle Odia news classification dataset (BertOdia: 96%, RoBERTaOdia: 92%, and ULMFit: 91.9% classification accuracy), and perform a comparison study with multilingual Bidirectional Encoder Representations from Transformers (BERT) supporting Odia. The model will be released publicly for the researchers to explore other NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947 (2017)
Google Scholar
Arora, G.: inltk: Natural language toolkit for indic languages. In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 66–71 (2020)
Google Scholar
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pretrained bert model and evaluation data. In: Practical ML for Developing Countries Workshop@ ICLR 2020 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Grießhaber, D., Maucher, J., Vu, N.T.: Fine-tuning bert for low-resource natural language understanding via active learning. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1158–1171 (2020)
Google Scholar
Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: Termeval 2020: Taln-ls2n system for automatic term extraction. In: Proceedings of the 6th International Workshop on Computational Terminology, pp. 95–100 (2020)
Google Scholar
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 328–339 (2018)
Google Scholar
Kakwani, D., Kunchukuttan, A., Golla, S., N.C.G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In: Findings of EMNLP (2020)
Google Scholar
Kocmi, T., Parida, S., Bojar, O.: CUNI NMT system for WAT 2018 translation tasks. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018, https://www.aclweb.org/anthology/Y18-3002
Korzeniowski, R., Rolczynski, R., Sadownik, P., Korbak, T., Mozejko, M.: Exploiting unsupervised pre-training and automated feature engineering for low-resource hate speech detection in polish. Proceedings of the PolEval2019 Workshop, p. 141 (2019)
Google Scholar
Kumar, S., Kumar, S., Kanojia, D., Bhattacharyya, P.: “A passage to India”: pre-trained word embeddings for Indian languages. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pp. 352–357 (2020)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: Alite bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty French language model. In: ACL 2020-58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Ortiz Suárez, P.J., Romary, L., Sagot, B.: A monolingual approach to contextualized word embeddings for mid-resource languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1703–1714. Association for Computational Linguistics, Online, July 2020, https://www.aclweb.org/anthology/2020.acl-main.156
Parida, S., Bojar, O., Dash, S.R.: Odiencorp: Odia–English and Odia-only corpus for machine translation. In: Smart Intelligent Computing and Applications, pp. 495–504. Springer, Berlin (2020)
Google Scholar
Parida, S., Dash, S.R., Bojar, O., Motlıcek, P., Pattnaik, P., Mallick, D.K.: Odiencorp 2.0: Odia-english parallel corpus for machine translation. In: LREC 2020 Workshop Language Resources and Evaluation Conference, 11–16 May 2020, p. 14
Google Scholar
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 2227–2237 (2018)
Google Scholar
Raju, A., Filimonov, D., Tiwari, G., Lan, G., Rastrow, A.: Scalable multi corpora neural language models for asr. Proc. Interspeech 2019, 3910–3914 (2019)
Google Scholar
Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2699–2712 (2020)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL (2016)
Google Scholar
Wang, A., Cho, K.: Bert has a mouth, and it must speak: Bert as a Markov random field language model. In: Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 30–36 (2019)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multitask benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, Nov 2018. https://doi.org/10.18653/v1/W18-5446, https://www.aclweb.org/anthology/W18-5446
Wang, Z., Mayhew, S., Roth, D., et al.: Extending multilingual bert to low-resource languages. arXiv preprint arXiv:2004.13640 (2020)

Download references

Acknowledgements

The authors Shantipriya Parida and Petr Motlicek were supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 833635 (project ROXANNE: Real-time network, text, and speaker analytics for combating organized crime, 2019-2022).

The authors do not see any significant ethical or privacy concerns that would prevent the processing of the data used in the study. The datasets do contain personal data, and these are processed in compliance with the GDPR and national law.

Esaú Villatoro-Tello, was supported partially by Idiap Research Institute, SNI CONACyT, and UAM-Cuajimalpa Mexico.

Author information

Authors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Shantipriya Parida, Maël Fabien, Esaú Villatoro-Tello & Petr Motlicek
The University of Chicago, Chicago, USA
Satya Prakash Biswal
Capgemini Technology Services India Limited, Bangalore, India
Biranchi Narayan Nayak
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Maël Fabien
Universidad Autónoma Metropolitana, Mexico City, Mexico
Esaú Villatoro-Tello
KIIT University, Bhubaneswar, India
Satya Ranjan Dash

Authors

Shantipriya Parida
View author publications
You can also search for this author in PubMed Google Scholar
Satya Prakash Biswal
View author publications
You can also search for this author in PubMed Google Scholar
Biranchi Narayan Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Maël Fabien
View author publications
You can also search for this author in PubMed Google Scholar
Esaú Villatoro-Tello
View author publications
You can also search for this author in PubMed Google Scholar
Petr Motlicek
View author publications
You can also search for this author in PubMed Google Scholar
Satya Ranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shantipriya Parida .

Editor information

Editors and Affiliations

Fakir Mohan University, Balasore, Odisha, India
Satchidananda Dehuri
KIIT Deemed to be University, Bhubaneswar, Odisha, India
Bhabani Shankar Prasad Mishra
KIIT Deemed to be University, Bhubaneswar, India
Pradeep Kumar Mallick
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parida, S. et al. (2022). BertOdia: BERT Pre-training for Low Resource Odia Language. In: Dehuri, S., Prasad Mishra, B.S., Mallick, P.K., Cho, SB. (eds) Biologically Inspired Techniques in Many Criteria Decision Making. Smart Innovation, Systems and Technologies, vol 271. Springer, Singapore. https://doi.org/10.1007/978-981-16-8739-6_32

Download citation

DOI: https://doi.org/10.1007/978-981-16-8739-6_32
Published: 04 June 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8738-9
Online ISBN: 978-981-16-8739-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics