Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Rathnayake, Himashi; Sumanapala, Janani; Rukshani, Raveesha; Ranathunga, Surangika

doi:10.1007/s10115-022-01698-1

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Regular paper
Published: 02 July 2022

Volume 64, pages 1937–1966, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Himashi Rathnayake¹^na1,
Janani Sumanapala¹^na1,
Raveesha Rukshani¹^na1 &
…
Surangika Ranathunga¹

670 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Code-mixing and code-switching are frequent features in online conversations. Classification of such text is challenging if one of the languages is low-resourced. Fine-tuning pre-trained multilingual language models is a promising avenue for code-mixed text classification. In this paper, we explore adapter-based fine-tuning of PMLMs for CMCS text classification. We introduce sequential and parallel stacking of adapters, continuous fine-tuning of adapters, and training adapters without freezing the original model as novel techniques with respect to single-task CMCS text classification. We also present a newly annotated dataset for the classification of Sinhala–English code-mixed and code-switched text data, where Sinhala is a low-resourced language. Our dataset of 10000 user comments has been manually annotated for five classification tasks: sentiment analysis, humor detection, hate speech detection, language identification, and aspect identification, thus making it the first publicly available Sinhala–English CMCS dataset with the largest number of task annotation types. In addition to this dataset, we also tested our proposed techniques on Kannada–English and Hindi–English datasets. These experiments confirm that our adapter-based PMLM fine-tuning techniques outperform or are on par with the basic fine-tuning of PMLM models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Fine-Tuning of Multilingual Models for Sentiment Classification in Code-Mixed Indian Language Texts

Ensemble of Binary Classification for the Emotion Detection in Code-Switching Text

Data availability

The dataset, code, and the trained models used in this paper are publicly available (https://huggingface.co/datasets/NLPC-UOM/Sinhala-English-Code-Mixed-Code-Switched-Dataset, https://github.com/HimashiRathnayake/CMCS-Text-Classification).

Notes

https://github.com/amsuhane/Humour-Detection-in-English-Hindi-Code-Mixed-Text.
https://huggingface.co/datasets/NLPC-UOM/Sinhala-English-Code-Mixed-Code-Switched-Dataset.
https://github.com/HimashiRathnayake/CMCS-Text-Classification.
Write the pronunciation of one language using other language, e.g., hodama—Written in a Sinhala pronunciation using English characters, Translation: best.
https://docs.adapterhub.ml/adapter_composition.html.
https://github.com/Adapter-Hub/adapter-transformers.
https://docs.adapterhub.ml/adapter_composition.html.
https://adapterhub.ml/.
https://github.com/amsuhane/Humour-Detection-in-English-Hindi-Code-Mixed-Text.
The other official language in Sri Lanka.
https://adapterhub.ml/adapters/ukp/xlm-roberta-base-en-wiki_pfeiffer/.
There is only one Hindi BERT model available, but it is an Electra model. Electra does not currently support adapters.
https://huggingface.co.
https://colab.research.google.com/.

References

Aguilar G, Kar S, Solorio T (2020) Lince: a centralized benchmark for linguistic code-switching evaluation. In: Proceedings of the 12th language resources and evaluation conference, pp 1803–1813
Ansari MZ, Beg M, Ahmad T, et al (2021) Language identification of Hindi-English tweets using code-mixed Bert. arXiv preprint arXiv:2107.01202
Antoun W, Baly F, Achour R, et al (2020) State of the art models for fake news detection tasks. In: 2020 IEEE international conference on informatics, IoT, and enabling technologies (ICIoT). IEEE, pp 519–524
Bohra A, Vijay D, Singh V et al (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. NAACL HLT 2018:36
Google Scholar
Chakravarthi BR, Jose N, Suryawanshi S, et al (2020) A sentiment analysis dataset for code-mixed Malayalam-English. In: Proceedings of the 1st Joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL), pp 177–184
Chakravarthi BR, Priyadharshini R, Muralidaran V, et al (2022) Dravidiancodemix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation pp 1–42
Chathuranga S, Ranathunga S (2021) Classification of code-mixed text using capsule networks. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021), pp 256–263
Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357
Article Google Scholar
Conneau A, Khandelwal K, Goyal N, et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451
Das A, Gambäck B (2014) Identifying languages at the word level in code-mixed indian social media text. In: Proceedings of the 11th international conference on natural language processing, pp 378–387
Dhananjaya V, Demotte P, Ranathunga S, et al (2022) Bertifying sinhala - a comprehensive analysis of pre-trained language models for sinhala text classification. In: Proceedings of the 13th language resources and evaluation conference
Friedman D, Dodge B, Chen D (2021) Single-dataset experts for multi-dataset question answering. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6128–6137
Gundapu S, Mamidi R (2018) Word level language identification in English Telugu code mixed data. In: Proceedings of the 32nd Pacific Asia conference on language, information and computation
Hande A, Hegde SU, Priyadharshini R, et al (2021a) Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages. arXiv preprint arXiv:2108.03867
Hande A, Puranik K, Yasaswini K, et al (2021b) Offensive language identification in low-resourced code-mixed Dravidian languages using pseudo-labeling. arXiv preprint arXiv:2108.12177
Houlsby N, Giurgiu A, Jastrzebski S, et al (2019) Parameter-efficient transfer learning for nlp. In: International conference on machine learning. PMLR, pp 2790–2799
Huertas García Á, et al (2021) Automatic information search for countering Covid-19 misinformation through semantic similarity. Master’s thesis
Kakwani D, Kunchukuttan A, Golla S et al (2020) Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Find Assoc Comput Linguist EMNLP 2020:4948–4961
Google Scholar
Kamble S, Joshi A (2018) Hate speech detection from code-mixed Hindi-English tweets using deep learning models. arXiv preprint arXiv:1811.05145
Kazhuparambil S, Kaushik A (2020) Cooking is all about people: comment classification on cookery channels using Bert and classification models (Malayalam-English mix-code). arXiv preprint arXiv:2007.04249
Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Universal Language Model Fine-tuning for Text Classification p 278
Khandelwal A, Swami S, Akthar SS, et al (2019) Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. In: 11th international conference on language resources and evaluation, LREC 2018, European language resources association (ELRA), pp 1203–1207
Khanuja S, Dandapat S, Srinivasan A, et al (2020) Gluecos: an evaluation benchmark for code-switched nlp. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3575–3585
Libovickỳ J, Rosa R, Fraser A (2019) How language-neutral is multilingual Bert? arXiv preprint arXiv:1911.03310
Mathur P, Sawhney R, Ayyar M, et al (2018a) Did you offend me? Classification of offensive tweets in Hinglish language. In: Proceedings of the 2nd workshop on abusive language online (ALW2), pp 138–148
Mathur P, Shah RR, Sawhney R et al (2018) Detecting offensive tweets in Hindi-English code-switched language. ACL 2018:18
Google Scholar
Mave D, Maharjan S, Solorio T (2018) Language identification and analysis of code-switched social media text. ACL 2018:51
Google Scholar
Molina G, Rey-Villamizar N, Solorio T et al (2016) Overview for the second shared task on language identification in code-switched data. EMNLP 2016:40
Google Scholar
Ousidhoum N, Lin Z, Zhang H, et al (2019) Multilingual and multi-aspect hate speech analysis. In: EMNLP/IJCNLP (1)
Pfeiffer J, Rücklé A, Poth C, et al (2020a) Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 46–54
Pfeiffer J, Vulić I, Gurevych I, et al (2020b) Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In: Proceedings of the 2020 Conference on empirical methods in natural language processing (EMNLP), pp 7654–7673
Pfeiffer J, Kamath A, Rücklé A, et al (2021) Adapterfusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pp 487–503
Rücklé A (2021) Representation learning and learning from limited labeled data for community question answering
Rücklé A, Geigle G, Glockner M, et al (2021) Adapterdrop: on the efficiency of adapters in transformers. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 7930–7946
Sabty C, Elmahdy M, Abdennadher S (2019) Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th international conference on semantic computing (ICSC), IEEE computer society, pp 93–97
Senevirathne L, Demotte P, Karunanayake B, et al (2020) Sentiment analysis for Sinhala language using deep learning techniques. arXiv preprint arXiv:2011.07280
Smith I, Thayasivam U (2019) Language detection in Sinhala-English code-mixed data. In: 2019 International conference on Asian language processing (IALP). IEEE, pp 228–233
Solorio T, Blair E, Maharjan S et al (2014) Overview for the first shared task on language identification in code-switched data. EMNLP 2014:62
Google Scholar
Swami S, Khandelwal A, Singh V, et al (2018) A corpus of English-Hindi code-mixed tweets for sarcasm detection. Scanning Electron Microsc Meet at
Toftrup M, Sørensen SA, Ciosici MR, et al (2021) A reproduction of apple’s bi-directional lstm models for language identification in short strings. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: student research workshop, pp 36–42
Ünal U, Dağ H (2022) Anomalyadapters: parameter-efficient multi-anomaly task detection. IEEE Access
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Vilares D, Alonso MA, Gómez-Rodríguez C (2016) En-es-cs: an English-Spanish code-switching twitter corpus for multilingual sentiment analysis. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), pp 4149–4153
Wang X, Tsvetkov Y, Ruder S et al (2021) Efficient test time adapter ensembling for low-resource language varieties. Find Assoc Comput Linguist EMNLP 2021:730–737
Article Google Scholar
Yadav S, Chakraborty T (2020) Unsupervised sentiment analysis for code-mixed data. arXiv preprint arXiv:2001.11384

Download references

Acknowledgements

The authors would like to acknowledge the SRC funding.

Funding

This study was funded by Senate Research Committee (SRC) Grant of University of Moratuwa, Sri Lanka

Author information

Himashi Rathnayake, Janani Sumanapala and Raveesha Rukshani have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, University of Moratuwa, Katubedda, 10400, Sri Lanka
Himashi Rathnayake, Janani Sumanapala, Raveesha Rukshani & Surangika Ranathunga

Authors

Himashi Rathnayake
View author publications
You can also search for this author in PubMed Google Scholar
Janani Sumanapala
View author publications
You can also search for this author in PubMed Google Scholar
Raveesha Rukshani
View author publications
You can also search for this author in PubMed Google Scholar
Surangika Ranathunga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R. was involved in the research idea, supervision and manuscript reviewing. H.R., J.S., and R.R. contributed to the literature review, system implementation, evaluation and preparation of the initial manuscript.

Corresponding author

Correspondence to Himashi Rathnayake.

Ethics declarations

Conflict of interest

The four authors of this paper have no conflict of interest to disclose.

Ethical approval

Ethical approval is not applicable. Data annotators were informed about the task prior to the work and were paid according to the institution approved rates.

Consent for publication

We would like to confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere. All the authors give their consent to publish the manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A CMI calculation

The following equation can be used to calculate CMI at the utterance level Das and Gambäck [10].

$$\begin{aligned} \mathrm{CMI} = {\left\{ \begin{array}{ll} 100 \times [1 - \frac{\mathrm{max} (w_i)}{n - u}] &{} : n > u\\ 0 &{} : n = u \end{array}\right. } \end{aligned}$$

In this equation, $w_i$ is the highest number of words present from any language in CMCS data, n is the total number of tokens and u is the number of tokens with language-independent tags. In our Sinhala–English corpus, we considered Mixed, Symbol, and NameEntity tags as the language-independent tags.

Appendix B English translations of non-English texts

English translations of non-English texts used in this paper are given in Table 19.

Table 19 English translations of non-English texts

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rathnayake, H., Sumanapala, J., Rukshani, R. et al. Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowl Inf Syst 64, 1937–1966 (2022). https://doi.org/10.1007/s10115-022-01698-1

Download citation

Received: 22 April 2022
Accepted: 04 June 2022
Published: 02 July 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10115-022-01698-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Fine-Tuning of Multilingual Models for Sentiment Classification in Code-Mixed Indian Language Texts

Ensemble of Binary Classification for the Emotion Detection in Code-Switching Text

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A CMI calculation

Appendix B English translations of non-English texts

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Fine-Tuning of Multilingual Models for Sentiment Classification in Code-Mixed Indian Language Texts

Ensemble of Binary Classification for the Emotion Detection in Code-Switching Text

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A CMI calculation

Appendix B English translations of non-English texts

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation