Abstract
Code-mixing and code-switching are frequent features in online conversations. Classification of such text is challenging if one of the languages is low-resourced. Fine-tuning pre-trained multilingual language models is a promising avenue for code-mixed text classification. In this paper, we explore adapter-based fine-tuning of PMLMs for CMCS text classification. We introduce sequential and parallel stacking of adapters, continuous fine-tuning of adapters, and training adapters without freezing the original model as novel techniques with respect to single-task CMCS text classification. We also present a newly annotated dataset for the classification of Sinhala–English code-mixed and code-switched text data, where Sinhala is a low-resourced language. Our dataset of 10000 user comments has been manually annotated for five classification tasks: sentiment analysis, humor detection, hate speech detection, language identification, and aspect identification, thus making it the first publicly available Sinhala–English CMCS dataset with the largest number of task annotation types. In addition to this dataset, we also tested our proposed techniques on Kannada–English and Hindi–English datasets. These experiments confirm that our adapter-based PMLM fine-tuning techniques outperform or are on par with the basic fine-tuning of PMLM models.
Similar content being viewed by others
Data availability
The dataset, code, and the trained models used in this paper are publicly available (https://huggingface.co/datasets/NLPC-UOM/Sinhala-English-Code-Mixed-Code-Switched-Dataset, https://github.com/HimashiRathnayake/CMCS-Text-Classification).
Notes
Write the pronunciation of one language using other language, e.g., hodama—Written in a Sinhala pronunciation using English characters, Translation: best.
The other official language in Sri Lanka.
There is only one Hindi BERT model available, but it is an Electra model. Electra does not currently support adapters.
References
Aguilar G, Kar S, Solorio T (2020) Lince: a centralized benchmark for linguistic code-switching evaluation. In: Proceedings of the 12th language resources and evaluation conference, pp 1803–1813
Ansari MZ, Beg M, Ahmad T, et al (2021) Language identification of Hindi-English tweets using code-mixed Bert. arXiv preprint arXiv:2107.01202
Antoun W, Baly F, Achour R, et al (2020) State of the art models for fake news detection tasks. In: 2020 IEEE international conference on informatics, IoT, and enabling technologies (ICIoT). IEEE, pp 519–524
Bohra A, Vijay D, Singh V et al (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. NAACL HLT 2018:36
Chakravarthi BR, Jose N, Suryawanshi S, et al (2020) A sentiment analysis dataset for code-mixed Malayalam-English. In: Proceedings of the 1st Joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL), pp 177–184
Chakravarthi BR, Priyadharshini R, Muralidaran V, et al (2022) Dravidiancodemix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation pp 1–42
Chathuranga S, Ranathunga S (2021) Classification of code-mixed text using capsule networks. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021), pp 256–263
Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357
Conneau A, Khandelwal K, Goyal N, et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451
Das A, Gambäck B (2014) Identifying languages at the word level in code-mixed indian social media text. In: Proceedings of the 11th international conference on natural language processing, pp 378–387
Dhananjaya V, Demotte P, Ranathunga S, et al (2022) Bertifying sinhala - a comprehensive analysis of pre-trained language models for sinhala text classification. In: Proceedings of the 13th language resources and evaluation conference
Friedman D, Dodge B, Chen D (2021) Single-dataset experts for multi-dataset question answering. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6128–6137
Gundapu S, Mamidi R (2018) Word level language identification in English Telugu code mixed data. In: Proceedings of the 32nd Pacific Asia conference on language, information and computation
Hande A, Hegde SU, Priyadharshini R, et al (2021a) Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages. arXiv preprint arXiv:2108.03867
Hande A, Puranik K, Yasaswini K, et al (2021b) Offensive language identification in low-resourced code-mixed Dravidian languages using pseudo-labeling. arXiv preprint arXiv:2108.12177
Houlsby N, Giurgiu A, Jastrzebski S, et al (2019) Parameter-efficient transfer learning for nlp. In: International conference on machine learning. PMLR, pp 2790–2799
Huertas García Á, et al (2021) Automatic information search for countering Covid-19 misinformation through semantic similarity. Master’s thesis
Kakwani D, Kunchukuttan A, Golla S et al (2020) Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Find Assoc Comput Linguist EMNLP 2020:4948–4961
Kamble S, Joshi A (2018) Hate speech detection from code-mixed Hindi-English tweets using deep learning models. arXiv preprint arXiv:1811.05145
Kazhuparambil S, Kaushik A (2020) Cooking is all about people: comment classification on cookery channels using Bert and classification models (Malayalam-English mix-code). arXiv preprint arXiv:2007.04249
Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Universal Language Model Fine-tuning for Text Classification p 278
Khandelwal A, Swami S, Akthar SS, et al (2019) Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. In: 11th international conference on language resources and evaluation, LREC 2018, European language resources association (ELRA), pp 1203–1207
Khanuja S, Dandapat S, Srinivasan A, et al (2020) Gluecos: an evaluation benchmark for code-switched nlp. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3575–3585
Libovickỳ J, Rosa R, Fraser A (2019) How language-neutral is multilingual Bert? arXiv preprint arXiv:1911.03310
Mathur P, Sawhney R, Ayyar M, et al (2018a) Did you offend me? Classification of offensive tweets in Hinglish language. In: Proceedings of the 2nd workshop on abusive language online (ALW2), pp 138–148
Mathur P, Shah RR, Sawhney R et al (2018) Detecting offensive tweets in Hindi-English code-switched language. ACL 2018:18
Mave D, Maharjan S, Solorio T (2018) Language identification and analysis of code-switched social media text. ACL 2018:51
Molina G, Rey-Villamizar N, Solorio T et al (2016) Overview for the second shared task on language identification in code-switched data. EMNLP 2016:40
Ousidhoum N, Lin Z, Zhang H, et al (2019) Multilingual and multi-aspect hate speech analysis. In: EMNLP/IJCNLP (1)
Pfeiffer J, Rücklé A, Poth C, et al (2020a) Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 46–54
Pfeiffer J, Vulić I, Gurevych I, et al (2020b) Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In: Proceedings of the 2020 Conference on empirical methods in natural language processing (EMNLP), pp 7654–7673
Pfeiffer J, Kamath A, Rücklé A, et al (2021) Adapterfusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pp 487–503
Rücklé A (2021) Representation learning and learning from limited labeled data for community question answering
Rücklé A, Geigle G, Glockner M, et al (2021) Adapterdrop: on the efficiency of adapters in transformers. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 7930–7946
Sabty C, Elmahdy M, Abdennadher S (2019) Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th international conference on semantic computing (ICSC), IEEE computer society, pp 93–97
Senevirathne L, Demotte P, Karunanayake B, et al (2020) Sentiment analysis for Sinhala language using deep learning techniques. arXiv preprint arXiv:2011.07280
Smith I, Thayasivam U (2019) Language detection in Sinhala-English code-mixed data. In: 2019 International conference on Asian language processing (IALP). IEEE, pp 228–233
Solorio T, Blair E, Maharjan S et al (2014) Overview for the first shared task on language identification in code-switched data. EMNLP 2014:62
Swami S, Khandelwal A, Singh V, et al (2018) A corpus of English-Hindi code-mixed tweets for sarcasm detection. Scanning Electron Microsc Meet at
Toftrup M, Sørensen SA, Ciosici MR, et al (2021) A reproduction of apple’s bi-directional lstm models for language identification in short strings. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: student research workshop, pp 36–42
Ünal U, Dağ H (2022) Anomalyadapters: parameter-efficient multi-anomaly task detection. IEEE Access
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Vilares D, Alonso MA, Gómez-Rodríguez C (2016) En-es-cs: an English-Spanish code-switching twitter corpus for multilingual sentiment analysis. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), pp 4149–4153
Wang X, Tsvetkov Y, Ruder S et al (2021) Efficient test time adapter ensembling for low-resource language varieties. Find Assoc Comput Linguist EMNLP 2021:730–737
Yadav S, Chakraborty T (2020) Unsupervised sentiment analysis for code-mixed data. arXiv preprint arXiv:2001.11384
Acknowledgements
The authors would like to acknowledge the SRC funding.
Funding
This study was funded by Senate Research Committee (SRC) Grant of University of Moratuwa, Sri Lanka
Author information
Authors and Affiliations
Contributions
S.R. was involved in the research idea, supervision and manuscript reviewing. H.R., J.S., and R.R. contributed to the literature review, system implementation, evaluation and preparation of the initial manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The four authors of this paper have no conflict of interest to disclose.
Ethical approval
Ethical approval is not applicable. Data annotators were informed about the task prior to the work and were paid according to the institution approved rates.
Consent for publication
We would like to confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere. All the authors give their consent to publish the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A CMI calculation
The following equation can be used to calculate CMI at the utterance level Das and Gambäck [10].
In this equation, \(w_i\) is the highest number of words present from any language in CMCS data, n is the total number of tokens and u is the number of tokens with language-independent tags. In our Sinhala–English corpus, we considered Mixed, Symbol, and NameEntity tags as the language-independent tags.
Appendix B English translations of non-English texts
English translations of non-English texts used in this paper are given in Table 19.
Rights and permissions
About this article
Cite this article
Rathnayake, H., Sumanapala, J., Rukshani, R. et al. Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowl Inf Syst 64, 1937–1966 (2022). https://doi.org/10.1007/s10115-022-01698-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01698-1