Skip to main content

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 739))

Included in the following conference series:

Abstract

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

A. Joshi—Authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/l3cube-pune/hindi-sentence-bert-nli.

  2. 2.

    https://huggingface.co/l3cube-pune/hindi-sentence-similarity-sbert.

  3. 3.

    https://huggingface.co/l3cube-pune/marathi-sentence-bert-nli.

  4. 4.

    https://huggingface.co/l3cube-pune/marathi-sentence-similarity-sbert.

  5. 5.

    https://github.com/divyanshuaggarwal/IndicXNLI.

  6. 6.

    https://huggingface.co/datasets/stsb_multi_mt.

  7. 7.

    https://github.com/l3cube-pune/MarathiNLP.

  8. 8.

    https://fasttext.cc/docs/en/crawl-vectors.html.

  9. 9.

    https://huggingface.co/ai4bharat/indic-bert.

  10. 10.

    https://huggingface.co/xlm-roberta-base.

  11. 11.

    https://huggingface.co/bert-base-multilingual-cased.

  12. 12.

    https://huggingface.co/google/muril-base-cased.

  13. 13.

    https://huggingface.co/setu4993/LaBSE.

  14. 14.

    https://github.com/facebookresearch/LASER.

  15. 15.

    https://huggingface.co/l3cube-pune/marathi-bert-v2.

  16. 16.

    https://huggingface.co/l3cube-pune/marathi-albert-v2.

  17. 17.

    https://huggingface.co/l3cube-pune/marathi-roberta.

  18. 18.

    https://huggingface.co/l3cube-pune/marathi-tweets-bert.

  19. 19.

    https://huggingface.co/l3cube-pune/hindi-bert-v2.

  20. 20.

    https://huggingface.co/l3cube-pune/hindi-albert.

  21. 21.

    https://huggingface.co/l3cube-pune/hindi-roberta.

  22. 22.

    https://huggingface.co/l3cube-pune/hindi-tweets-bert.

  23. 23.

    https://huggingface.co/hiiamsid/sentence_similarity_hindi.

References

  1. Aggarwal, D., Gupta, V., Kunchukuttan, A.: Indicxnli: evaluating multilingual inference for Indian languages. arXiv preprint arXiv:2204.08776 (2022)

  2. Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)

  3. Choi, H., Kim, J., Joe, S., Gwon, Y.: Evaluation of BERT and albert sentence embedding performance on downstream NLP tasks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5482–5487. IEEE (2021)

    Google Scholar 

  4. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)

    Google Scholar 

  5. onneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680 (2017)

    Google Scholar 

  6. Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single \$ &!#* vector: probing sentence embeddings for linguistic properties (2018)

    Google Scholar 

  7. Dasgupta, I., Guo, D., Stuhlmüller, A., Gershman, S.J., Goodman, N.D.: Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302 (2018)

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  9. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805 (2018)

    Google Scholar 

  10. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020)

  11. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  12. Heffernan, K., Çelebi, O., Schwenk, H.: Bitext mining using distilled sentence representations for low-resource languages. arXiv preprint arXiv:2205.12654 (2022)

  13. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)

  14. Joshi, R., Goel, P., Joshi, R.: Deep learning for Hindi text classification: a comparison. In: Tiwary, U.S., Chaudhury, S. (eds.) IHCI 2019. LNCS, vol. 11886, pp. 94–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44689-5_9

    Chapter  Google Scholar 

  15. Joshi, R.: L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418 (2022)

  16. Joshi, R.: L3Cube-MahaCorpus and MahaBERT: marathi monolingual corpus, Marathi BERT language models, and resources. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, Marseille, France, pp. 97–101. European Language Resources Association (2022)

    Google Scholar 

  17. Joshi, R.: L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728 (2022)

  18. Kakwani, D., et al.: Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4948–4961 (2020)

    Google Scholar 

  19. Khanuja, S., et al.: Muril: multilingual representations for Indian languages. arXiv preprint arXiv:2103.10730 (2021)

  20. Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Joshi, R.: L3cubemahasent: a Marathi tweet-based sentiment analysis dataset. In: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220 (2021)

    Google Scholar 

  21. Kunchukuttan, A., Kakwani, D., Golla, S., Bhattacharyya, A., Khapra, M.M., Kumar, P., et al.: AI4Bharat-IndicNLP corpus: monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085 (2020)

  22. Li, B., Zhou, H., He, J., Wang, M., Yang, Y., Li, L.: On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864 (2020)

  23. Ma, X., Wang, Z., Ng, P., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. arXiv preprint arXiv:1910.07973 (2019)

  24. Oh, D., Kim, Y., Lee, H., Huang, H.H., Lim, H.: Don’t judge a language model by its last layer: contrastive learning with layer-wise attention pooling. arXiv preprint arXiv:2209.05972 (2022)

  25. Patankar, S., Gokhale, O., Kane, A., Chavan, T., Joshi, R.: Spread love not hate: undermining the importance of hateful pre-training for hate speech detection. arXiv preprint arXiv:2210.04267 (2022)

  26. Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259 (2018)

  27. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  28. Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V., Boeker, M.: Gottbert: a pure German language model. arXiv preprint arXiv:2012.02110 (2020)

  29. Siddhartha Shrestha. sentence similarity hindi huggingface model

    Google Scholar 

  30. Straka, M., Náplava, J., Straková, J., Samuel, D.: RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 197–209. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_17

    Chapter  Google Scholar 

  31. Sun, X., et al.: Sentence similarity based on contexts. Trans. Assoc. Comput. Linguist. 10, 573–588 (2022)

    Article  Google Scholar 

  32. Velankar, A., Patil, H., Joshi, R.: Mono vs multilingual BERT for hate speech detection and text classification: a case study in Marathi. In: El Gayar, N., Trentin, E., Ravanelli, M., Abbas, H. (eds.) ANNPR 2022. LNCS, vol. 13739, pp. 121–128. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-20650-4_10

    Chapter  Google Scholar 

  33. Wang, B., Kuo, C.-C.J.: SBERT-WK: a sentence embedding method by dissecting BERT-based word models. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2146–2157 (2020)

    Article  Google Scholar 

Download references

Acknowledgments

This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ananya Joshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Joshi, A., Kajale, A., Gadre, J., Deode, S., Joshi, R. (2023). L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_82

Download citation

Publish with us

Policies and ethics