Abstract
Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.
A. Joshi—Authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
References
Aggarwal, D., Gupta, V., Kunchukuttan, A.: Indicxnli: evaluating multilingual inference for Indian languages. arXiv preprint arXiv:2204.08776 (2022)
Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Choi, H., Kim, J., Joe, S., Gwon, Y.: Evaluation of BERT and albert sentence embedding performance on downstream NLP tasks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5482–5487. IEEE (2021)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)
onneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680 (2017)
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single \$ &!#* vector: probing sentence embeddings for linguistic properties (2018)
Dasgupta, I., Guo, D., Stuhlmüller, A., Gershman, S.J., Goodman, N.D.: Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805 (2018)
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Heffernan, K., Çelebi, O., Schwenk, H.: Bitext mining using distilled sentence representations for low-resource languages. arXiv preprint arXiv:2205.12654 (2022)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
Joshi, R., Goel, P., Joshi, R.: Deep learning for Hindi text classification: a comparison. In: Tiwary, U.S., Chaudhury, S. (eds.) IHCI 2019. LNCS, vol. 11886, pp. 94–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44689-5_9
Joshi, R.: L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418 (2022)
Joshi, R.: L3Cube-MahaCorpus and MahaBERT: marathi monolingual corpus, Marathi BERT language models, and resources. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, Marseille, France, pp. 97–101. European Language Resources Association (2022)
Joshi, R.: L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728 (2022)
Kakwani, D., et al.: Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4948–4961 (2020)
Khanuja, S., et al.: Muril: multilingual representations for Indian languages. arXiv preprint arXiv:2103.10730 (2021)
Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Joshi, R.: L3cubemahasent: a Marathi tweet-based sentiment analysis dataset. In: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220 (2021)
Kunchukuttan, A., Kakwani, D., Golla, S., Bhattacharyya, A., Khapra, M.M., Kumar, P., et al.: AI4Bharat-IndicNLP corpus: monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085 (2020)
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., Li, L.: On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864 (2020)
Ma, X., Wang, Z., Ng, P., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. arXiv preprint arXiv:1910.07973 (2019)
Oh, D., Kim, Y., Lee, H., Huang, H.H., Lim, H.: Don’t judge a language model by its last layer: contrastive learning with layer-wise attention pooling. arXiv preprint arXiv:2209.05972 (2022)
Patankar, S., Gokhale, O., Kane, A., Chavan, T., Joshi, R.: Spread love not hate: undermining the importance of hateful pre-training for hate speech detection. arXiv preprint arXiv:2210.04267 (2022)
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259 (2018)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V., Boeker, M.: Gottbert: a pure German language model. arXiv preprint arXiv:2012.02110 (2020)
Siddhartha Shrestha. sentence similarity hindi huggingface model
Straka, M., Náplava, J., Straková, J., Samuel, D.: RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 197–209. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_17
Sun, X., et al.: Sentence similarity based on contexts. Trans. Assoc. Comput. Linguist. 10, 573–588 (2022)
Velankar, A., Patil, H., Joshi, R.: Mono vs multilingual BERT for hate speech detection and text classification: a case study in Marathi. In: El Gayar, N., Trentin, E., Ravanelli, M., Abbas, H. (eds.) ANNPR 2022. LNCS, vol. 13739, pp. 121–128. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-20650-4_10
Wang, B., Kuo, C.-C.J.: SBERT-WK: a sentence embedding method by dissecting BERT-based word models. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2146–2157 (2020)
Acknowledgments
This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Joshi, A., Kajale, A., Gadre, J., Deode, S., Joshi, R. (2023). L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_82
Download citation
DOI: https://doi.org/10.1007/978-3-031-37963-5_82
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37962-8
Online ISBN: 978-3-031-37963-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)