Skip to main content

BertOdia: BERT Pre-training for Low Resource Odia Language

  • Conference paper
  • First Online:
Biologically Inspired Techniques in Many Criteria Decision Making

Abstract

Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian state called Odisha. Odia language lacks online content and resources for natural language processing (NLP) research. There is a great need for a better language model for the low resource Odia language, which can be used for many downstream NLP tasks. In this paper, we introduce a Bert-based language model, pre-trained on 430,000 Odia sentences. We also evaluate the model on the well-known Kaggle Odia news classification dataset (BertOdia: 96%, RoBERTaOdia: 92%, and ULMFit: 91.9% classification accuracy), and perform a comparison study with multilingual Bidirectional Encoder Representations from Transformers (BERT) supporting Odia. The model will be released publicly for the researchers to explore other NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Odia_language.

  2. 2.

    https://www.mustgo.com/worldlanguages/oriya/.

  3. 3.

    https://www.nriol.com/indian-languages/oriya-page.asp.

  4. 4.

    https://github.com/google-research/bert/blob/master/multilingual.md.

  5. 5.

    https://github.com/goru001/inltk.

  6. 6.

    https://github.com/AI4Bharat/indic-bert.

  7. 7.

    https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3211.

  8. 8.

    https://oscar-corpus.com/.

  9. 9.

    http://preon.iiit.ac.in/~jerin/bhasha/.

  10. 10.

    https://www.kaggle.com/disisbig/odia-wikipedia-articles.

  11. 11.

    https://fontmeme.com/odia/.

  12. 12.

    https://www.kaggle.com/disisbig/odia-news-dataset.

  13. 13.

    https://www.kaggle.com/disisbig/odia-wikipedia-articles.

  14. 14.

    https://github.com/goru001/nlp-for-odia.

  15. 15.

    http://news.odialanguage.com/.

  16. 16.

    https://github.com/AI4Bharat/indic-bert/tree/master/fine_tune.

  17. 17.

    https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/evaluations/inltk-headlines.tar.gz.

  18. 18.

    https://indicnlp.ai4bharat.org/indic-glue/.

  19. 19.

    https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2879.

  20. 20.

    https://indicnlp.ai4bharat.org/corpora/.

  21. 21.

    https://gluebenchmark.com/.

  22. 22.

    https://indicnlp.ai4bharat.org/indic-glue/.

References

  1. Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947 (2017)

    Google Scholar 

  2. Arora, G.: inltk: Natural language toolkit for indic languages. In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 66–71 (2020)

    Google Scholar 

  3. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pretrained bert model and evaluation data. In: Practical ML for Developing Countries Workshop@ ICLR 2020 (2020)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  5. Grießhaber, D., Maucher, J., Vu, N.T.: Fine-tuning bert for low-resource natural language understanding via active learning. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1158–1171 (2020)

    Google Scholar 

  6. Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: Termeval 2020: Taln-ls2n system for automatic term extraction. In: Proceedings of the 6th International Workshop on Computational Terminology, pp. 95–100 (2020)

    Google Scholar 

  7. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 328–339 (2018)

    Google Scholar 

  8. Kakwani, D., Kunchukuttan, A., Golla, S., N.C.G., Bhattacharyya, A., Khapra, M.M., Kumar, P.: IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In: Findings of EMNLP (2020)

    Google Scholar 

  9. Kocmi, T., Parida, S., Bojar, O.: CUNI NMT system for WAT 2018 translation tasks. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018, https://www.aclweb.org/anthology/Y18-3002

  10. Korzeniowski, R., Rolczynski, R., Sadownik, P., Korbak, T., Mozejko, M.: Exploiting unsupervised pre-training and automated feature engineering for low-resource hate speech detection in polish. Proceedings of the PolEval2019 Workshop, p. 141 (2019)

    Google Scholar 

  11. Kumar, S., Kumar, S., Kanojia, D., Bhattacharyya, P.: “A passage to India”: pre-trained word embeddings for Indian languages. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pp. 352–357 (2020)

    Google Scholar 

  12. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: Alite bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)

    Google Scholar 

  13. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  14. Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty French language model. In: ACL 2020-58th Annual Meeting of the Association for Computational Linguistics (2020)

    Google Scholar 

  15. Ortiz Suárez, P.J., Romary, L., Sagot, B.: A monolingual approach to contextualized word embeddings for mid-resource languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1703–1714. Association for Computational Linguistics, Online, July 2020, https://www.aclweb.org/anthology/2020.acl-main.156

  16. Parida, S., Bojar, O., Dash, S.R.: Odiencorp: Odia–English and Odia-only corpus for machine translation. In: Smart Intelligent Computing and Applications, pp. 495–504. Springer, Berlin (2020)

    Google Scholar 

  17. Parida, S., Dash, S.R., Bojar, O., Motlıcek, P., Pattnaik, P., Mallick, D.K.: Odiencorp 2.0: Odia-english parallel corpus for machine translation. In: LREC 2020 Workshop Language Resources and Evaluation Conference, 11–16 May 2020, p. 14

    Google Scholar 

  18. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 2227–2237 (2018)

    Google Scholar 

  19. Raju, A., Filimonov, D., Tiwari, G., Lan, G., Rastrow, A.: Scalable multi corpora neural language models for asr. Proc. Interspeech 2019, 3910–3914 (2019)

    Google Scholar 

  20. Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2699–2712 (2020)

    Google Scholar 

  21. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL (2016)

    Google Scholar 

  22. Wang, A., Cho, K.: Bert has a mouth, and it must speak: Bert as a Markov random field language model. In: Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 30–36 (2019)

    Google Scholar 

  23. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multitask benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, Nov 2018. https://doi.org/10.18653/v1/W18-5446, https://www.aclweb.org/anthology/W18-5446

  24. Wang, Z., Mayhew, S., Roth, D., et al.: Extending multilingual bert to low-resource languages. arXiv preprint arXiv:2004.13640 (2020)

Download references

Acknowledgements

The authors Shantipriya Parida and Petr Motlicek were supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 833635 (project ROXANNE: Real-time network, text, and speaker analytics for combating organized crime, 2019-2022).

The authors do not see any significant ethical or privacy concerns that would prevent the processing of the data used in the study. The datasets do contain personal data, and these are processed in compliance with the GDPR and national law.

Esaú Villatoro-Tello, was supported partially by Idiap Research Institute, SNI CONACyT, and UAM-Cuajimalpa Mexico.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shantipriya Parida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Parida, S. et al. (2022). BertOdia: BERT Pre-training for Low Resource Odia Language. In: Dehuri, S., Prasad Mishra, B.S., Mallick, P.K., Cho, SB. (eds) Biologically Inspired Techniques in Many Criteria Decision Making. Smart Innovation, Systems and Technologies, vol 271. Springer, Singapore. https://doi.org/10.1007/978-981-16-8739-6_32

Download citation

Publish with us

Policies and ethics