L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Joshi, Ananya; Kajale, Aditi; Gadre, Janhavi; Deode, Samruddhi; Joshi, Raviraj

doi:10.1007/978-3-031-37963-5_82

Ananya Joshi^10,12,
Aditi Kajale^10,12,
Janhavi Gadre^10,12,
Samruddhi Deode^10,12 &
…
Raviraj Joshi^11,12

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 739))

Included in the following conference series:

Science and Information Conference

576 Accesses
1 Citations

Abstract

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

A. Joshi—Authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, D., Gupta, V., Kunchukuttan, A.: Indicxnli: evaluating multilingual inference for Indian languages. arXiv preprint arXiv:2204.08776 (2022)
Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Choi, H., Kim, J., Joe, S., Gwon, Y.: Evaluation of BERT and albert sentence embedding performance on downstream NLP tasks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5482–5487. IEEE (2021)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)
Google Scholar
onneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680 (2017)
Google Scholar
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single \$ &!#* vector: probing sentence embeddings for linguistic properties (2018)
Google Scholar
Dasgupta, I., Guo, D., Stuhlmüller, A., Gershman, S.J., Goodman, N.D.: Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805 (2018)
Google Scholar
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Heffernan, K., Çelebi, O., Schwenk, H.: Bitext mining using distilled sentence representations for low-resource languages. arXiv preprint arXiv:2205.12654 (2022)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
Joshi, R., Goel, P., Joshi, R.: Deep learning for Hindi text classification: a comparison. In: Tiwary, U.S., Chaudhury, S. (eds.) IHCI 2019. LNCS, vol. 11886, pp. 94–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44689-5_9
Chapter Google Scholar
Joshi, R.: L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418 (2022)
Joshi, R.: L3Cube-MahaCorpus and MahaBERT: marathi monolingual corpus, Marathi BERT language models, and resources. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, Marseille, France, pp. 97–101. European Language Resources Association (2022)
Google Scholar
Joshi, R.: L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728 (2022)
Kakwani, D., et al.: Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4948–4961 (2020)
Google Scholar
Khanuja, S., et al.: Muril: multilingual representations for Indian languages. arXiv preprint arXiv:2103.10730 (2021)
Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Joshi, R.: L3cubemahasent: a Marathi tweet-based sentiment analysis dataset. In: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220 (2021)
Google Scholar
Kunchukuttan, A., Kakwani, D., Golla, S., Bhattacharyya, A., Khapra, M.M., Kumar, P., et al.: AI4Bharat-IndicNLP corpus: monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085 (2020)
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., Li, L.: On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864 (2020)
Ma, X., Wang, Z., Ng, P., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. arXiv preprint arXiv:1910.07973 (2019)
Oh, D., Kim, Y., Lee, H., Huang, H.H., Lim, H.: Don’t judge a language model by its last layer: contrastive learning with layer-wise attention pooling. arXiv preprint arXiv:2209.05972 (2022)
Patankar, S., Gokhale, O., Kane, A., Chavan, T., Joshi, R.: Spread love not hate: undermining the importance of hateful pre-training for hate speech detection. arXiv preprint arXiv:2210.04267 (2022)
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259 (2018)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V., Boeker, M.: Gottbert: a pure German language model. arXiv preprint arXiv:2012.02110 (2020)
Siddhartha Shrestha. sentence similarity hindi huggingface model
Google Scholar
Straka, M., Náplava, J., Straková, J., Samuel, D.: RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 197–209. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_17
Chapter Google Scholar
Sun, X., et al.: Sentence similarity based on contexts. Trans. Assoc. Comput. Linguist. 10, 573–588 (2022)
Article Google Scholar
Velankar, A., Patil, H., Joshi, R.: Mono vs multilingual BERT for hate speech detection and text classification: a case study in Marathi. In: El Gayar, N., Trentin, E., Ravanelli, M., Abbas, H. (eds.) ANNPR 2022. LNCS, vol. 13739, pp. 121–128. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-20650-4_10
Chapter Google Scholar
Wang, B., Kuo, C.-C.J.: SBERT-WK: a sentence embedding method by dissecting BERT-based word models. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2146–2157 (2020)
Article Google Scholar

Download references

Acknowledgments

This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.

Author information

Authors and Affiliations

MKSSS’ Cummins College of Engineering for Women, Pune, Maharashtra, India
Ananya Joshi, Aditi Kajale, Janhavi Gadre & Samruddhi Deode
Indian Institute of Technology Madras, Chennai, India
Raviraj Joshi
L3Cube, Pune, India
Ananya Joshi, Aditi Kajale, Janhavi Gadre, Samruddhi Deode & Raviraj Joshi

Authors

Ananya Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Aditi Kajale
View author publications
You can also search for this author in PubMed Google Scholar
Janhavi Gadre
View author publications
You can also search for this author in PubMed Google Scholar
Samruddhi Deode
View author publications
You can also search for this author in PubMed Google Scholar
Raviraj Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ananya Joshi .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joshi, A., Kajale, A., Gadre, J., Deode, S., Joshi, R. (2023). L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_82

Download citation

DOI: https://doi.org/10.1007/978-3-031-37963-5_82
Published: 20 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37962-8
Online ISBN: 978-3-031-37963-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi