Skip to main content
Log in

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Code-mixing and code-switching are frequent features in online conversations. Classification of such text is challenging if one of the languages is low-resourced. Fine-tuning pre-trained multilingual language models is a promising avenue for code-mixed text classification. In this paper, we explore adapter-based fine-tuning of PMLMs for CMCS text classification. We introduce sequential and parallel stacking of adapters, continuous fine-tuning of adapters, and training adapters without freezing the original model as novel techniques with respect to single-task CMCS text classification. We also present a newly annotated dataset for the classification of Sinhala–English code-mixed and code-switched text data, where Sinhala is a low-resourced language. Our dataset of 10000 user comments has been manually annotated for five classification tasks: sentiment analysis, humor detection, hate speech detection, language identification, and aspect identification, thus making it the first publicly available Sinhala–English CMCS dataset with the largest number of task annotation types. In addition to this dataset, we also tested our proposed techniques on Kannada–English and Hindi–English datasets. These experiments confirm that our adapter-based PMLM fine-tuning techniques outperform or are on par with the basic fine-tuning of PMLM models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The dataset, code, and the trained models used in this paper are publicly available (https://huggingface.co/datasets/NLPC-UOM/Sinhala-English-Code-Mixed-Code-Switched-Dataset, https://github.com/HimashiRathnayake/CMCS-Text-Classification).

Notes

  1. https://github.com/amsuhane/Humour-Detection-in-English-Hindi-Code-Mixed-Text.

  2. https://huggingface.co/datasets/NLPC-UOM/Sinhala-English-Code-Mixed-Code-Switched-Dataset.

  3. https://github.com/HimashiRathnayake/CMCS-Text-Classification.

  4. Write the pronunciation of one language using other language, e.g., hodama—Written in a Sinhala pronunciation using English characters, Translation: best.

  5. https://docs.adapterhub.ml/adapter_composition.html.

  6. https://github.com/Adapter-Hub/adapter-transformers.

  7. https://docs.adapterhub.ml/adapter_composition.html.

  8. https://adapterhub.ml/.

  9. https://github.com/amsuhane/Humour-Detection-in-English-Hindi-Code-Mixed-Text.

  10. The other official language in Sri Lanka.

  11. https://adapterhub.ml/adapters/ukp/xlm-roberta-base-en-wiki_pfeiffer/.

  12. There is only one Hindi BERT model available, but it is an Electra model. Electra does not currently support adapters.

  13. https://huggingface.co.

  14. https://colab.research.google.com/.

References

  1. Aguilar G, Kar S, Solorio T (2020) Lince: a centralized benchmark for linguistic code-switching evaluation. In: Proceedings of the 12th language resources and evaluation conference, pp 1803–1813

  2. Ansari MZ, Beg M, Ahmad T, et al (2021) Language identification of Hindi-English tweets using code-mixed Bert. arXiv preprint arXiv:2107.01202

  3. Antoun W, Baly F, Achour R, et al (2020) State of the art models for fake news detection tasks. In: 2020 IEEE international conference on informatics, IoT, and enabling technologies (ICIoT). IEEE, pp 519–524

  4. Bohra A, Vijay D, Singh V et al (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. NAACL HLT 2018:36

    Google Scholar 

  5. Chakravarthi BR, Jose N, Suryawanshi S, et al (2020) A sentiment analysis dataset for code-mixed Malayalam-English. In: Proceedings of the 1st Joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL), pp 177–184

  6. Chakravarthi BR, Priyadharshini R, Muralidaran V, et al (2022) Dravidiancodemix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation pp 1–42

  7. Chathuranga S, Ranathunga S (2021) Classification of code-mixed text using capsule networks. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021), pp 256–263

  8. Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357

    Article  Google Scholar 

  9. Conneau A, Khandelwal K, Goyal N, et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451

  10. Das A, Gambäck B (2014) Identifying languages at the word level in code-mixed indian social media text. In: Proceedings of the 11th international conference on natural language processing, pp 378–387

  11. Dhananjaya V, Demotte P, Ranathunga S, et al (2022) Bertifying sinhala - a comprehensive analysis of pre-trained language models for sinhala text classification. In: Proceedings of the 13th language resources and evaluation conference

  12. Friedman D, Dodge B, Chen D (2021) Single-dataset experts for multi-dataset question answering. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6128–6137

  13. Gundapu S, Mamidi R (2018) Word level language identification in English Telugu code mixed data. In: Proceedings of the 32nd Pacific Asia conference on language, information and computation

  14. Hande A, Hegde SU, Priyadharshini R, et al (2021a) Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages. arXiv preprint arXiv:2108.03867

  15. Hande A, Puranik K, Yasaswini K, et al (2021b) Offensive language identification in low-resourced code-mixed Dravidian languages using pseudo-labeling. arXiv preprint arXiv:2108.12177

  16. Houlsby N, Giurgiu A, Jastrzebski S, et al (2019) Parameter-efficient transfer learning for nlp. In: International conference on machine learning. PMLR, pp 2790–2799

  17. Huertas García Á, et al (2021) Automatic information search for countering Covid-19 misinformation through semantic similarity. Master’s thesis

  18. Kakwani D, Kunchukuttan A, Golla S et al (2020) Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Find Assoc Comput Linguist EMNLP 2020:4948–4961

    Google Scholar 

  19. Kamble S, Joshi A (2018) Hate speech detection from code-mixed Hindi-English tweets using deep learning models. arXiv preprint arXiv:1811.05145

  20. Kazhuparambil S, Kaushik A (2020) Cooking is all about people: comment classification on cookery channels using Bert and classification models (Malayalam-English mix-code). arXiv preprint arXiv:2007.04249

  21. Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Universal Language Model Fine-tuning for Text Classification p 278

  22. Khandelwal A, Swami S, Akthar SS, et al (2019) Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. In: 11th international conference on language resources and evaluation, LREC 2018, European language resources association (ELRA), pp 1203–1207

  23. Khanuja S, Dandapat S, Srinivasan A, et al (2020) Gluecos: an evaluation benchmark for code-switched nlp. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3575–3585

  24. Libovickỳ J, Rosa R, Fraser A (2019) How language-neutral is multilingual Bert? arXiv preprint arXiv:1911.03310

  25. Mathur P, Sawhney R, Ayyar M, et al (2018a) Did you offend me? Classification of offensive tweets in Hinglish language. In: Proceedings of the 2nd workshop on abusive language online (ALW2), pp 138–148

  26. Mathur P, Shah RR, Sawhney R et al (2018) Detecting offensive tweets in Hindi-English code-switched language. ACL 2018:18

    Google Scholar 

  27. Mave D, Maharjan S, Solorio T (2018) Language identification and analysis of code-switched social media text. ACL 2018:51

    Google Scholar 

  28. Molina G, Rey-Villamizar N, Solorio T et al (2016) Overview for the second shared task on language identification in code-switched data. EMNLP 2016:40

    Google Scholar 

  29. Ousidhoum N, Lin Z, Zhang H, et al (2019) Multilingual and multi-aspect hate speech analysis. In: EMNLP/IJCNLP (1)

  30. Pfeiffer J, Rücklé A, Poth C, et al (2020a) Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 46–54

  31. Pfeiffer J, Vulić I, Gurevych I, et al (2020b) Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In: Proceedings of the 2020 Conference on empirical methods in natural language processing (EMNLP), pp 7654–7673

  32. Pfeiffer J, Kamath A, Rücklé A, et al (2021) Adapterfusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pp 487–503

  33. Rücklé A (2021) Representation learning and learning from limited labeled data for community question answering

  34. Rücklé A, Geigle G, Glockner M, et al (2021) Adapterdrop: on the efficiency of adapters in transformers. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 7930–7946

  35. Sabty C, Elmahdy M, Abdennadher S (2019) Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th international conference on semantic computing (ICSC), IEEE computer society, pp 93–97

  36. Senevirathne L, Demotte P, Karunanayake B, et al (2020) Sentiment analysis for Sinhala language using deep learning techniques. arXiv preprint arXiv:2011.07280

  37. Smith I, Thayasivam U (2019) Language detection in Sinhala-English code-mixed data. In: 2019 International conference on Asian language processing (IALP). IEEE, pp 228–233

  38. Solorio T, Blair E, Maharjan S et al (2014) Overview for the first shared task on language identification in code-switched data. EMNLP 2014:62

    Google Scholar 

  39. Swami S, Khandelwal A, Singh V, et al (2018) A corpus of English-Hindi code-mixed tweets for sarcasm detection. Scanning Electron Microsc Meet at

  40. Toftrup M, Sørensen SA, Ciosici MR, et al (2021) A reproduction of apple’s bi-directional lstm models for language identification in short strings. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: student research workshop, pp 36–42

  41. Ünal U, Dağ H (2022) Anomalyadapters: parameter-efficient multi-anomaly task detection. IEEE Access

  42. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  43. Vilares D, Alonso MA, Gómez-Rodríguez C (2016) En-es-cs: an English-Spanish code-switching twitter corpus for multilingual sentiment analysis. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), pp 4149–4153

  44. Wang X, Tsvetkov Y, Ruder S et al (2021) Efficient test time adapter ensembling for low-resource language varieties. Find Assoc Comput Linguist EMNLP 2021:730–737

    Article  Google Scholar 

  45. Yadav S, Chakraborty T (2020) Unsupervised sentiment analysis for code-mixed data. arXiv preprint arXiv:2001.11384

Download references

Acknowledgements

The authors would like to acknowledge the SRC funding.

Funding

This study was funded by Senate Research Committee (SRC) Grant of University of Moratuwa, Sri Lanka

Author information

Authors and Affiliations

Authors

Contributions

S.R. was involved in the research idea, supervision and manuscript reviewing. H.R., J.S., and R.R. contributed to the literature review, system implementation, evaluation and preparation of the initial manuscript.

Corresponding author

Correspondence to Himashi Rathnayake.

Ethics declarations

Conflict of interest

The four authors of this paper have no conflict of interest to disclose.

Ethical approval

Ethical approval is not applicable. Data annotators were informed about the task prior to the work and were paid according to the institution approved rates.

Consent for publication

We would like to confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere. All the authors give their consent to publish the manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A CMI calculation

The following equation can be used to calculate CMI at the utterance level Das and Gambäck [10].

$$\begin{aligned} \mathrm{CMI} = {\left\{ \begin{array}{ll} 100 \times [1 - \frac{\mathrm{max} (w_i)}{n - u}] &{} : n > u\\ 0 &{} : n = u \end{array}\right. } \end{aligned}$$

In this equation, \(w_i\) is the highest number of words present from any language in CMCS data, n is the total number of tokens and u is the number of tokens with language-independent tags. In our Sinhala–English corpus, we considered Mixed, Symbol, and NameEntity tags as the language-independent tags.

Appendix B English translations of non-English texts

English translations of non-English texts used in this paper are given in Table 19.

Table 19 English translations of non-English texts

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rathnayake, H., Sumanapala, J., Rukshani, R. et al. Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowl Inf Syst 64, 1937–1966 (2022). https://doi.org/10.1007/s10115-022-01698-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01698-1

Keywords

Navigation