Abstract
Bilingual individuals already outnumber monolinguals yet most of the available resources for research in natural language processing (NLP) are for high-resource single languages. A recent area of interest in NLP research for low-resource languages is code-switching, a phenomenon in both written and spoken communication marked by the usage of at least two languages in one utterance. This work presented two novel contributions to NLP research for low-resource languages. First, it introduced the first sentiment-annotated corpus of Filipino-English Reviews with Code-Switching (FiReCS) with more than 10k instances of product and service reviews. Second, it developed sentiment analysis models for Filipino-English text using pre-trained Transformers-based large language models (LLMs) and introduced benchmark results for zero-shot sentiment analysis on text with code-switching using OpenAI’s GPT-3 series models. The performance of the Transformers-based sentiment analysis models were compared against those of existing lexicon-based sentiment analysis tools designed for monolingual text. The fine-tuned XLM-RoBERTa model achieved the highest accuracy and weighted average F1-score of 0.84 with F1-scores of 0.89, 0.86, and 0.78 in the Positive, Negative, and Neutral sentiment classes, respectively. The poor performance of the lexicon-based sentiment analysis tools exemplifies the limitations of such systems that are designed for a single language when applied to bilingual text involving code-switching.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The data set is made available to the research community through the following link: https://huggingface.co/datasets/ccosme/FiReCS.
References
Nguyen T (2015) Code switching: a sociolinguistic perspective. Anchor
Gumperz JJ (1982) Discourse strategies. Studies in interactional sociolinguistics. Cambridge University Press. https://doi.org/10.1017/CBO9780511611834
Myers-Scotton C (1993) Common and uncommon ground: social and structural factors in codeswitching. Lang. Soc. 22(4):475–503. https://doi.org/10.1017/S0047404500017449
Hamers JF, Blanc MHA (2000) Bilinguality and bilingualism, 2nd edn. Cambridge University Press
Eckert P, McConnell-Ginet S (2003). Language and Gender Cambridge University Press. https://doi.org/10.1017/CBO9780511791147
Green L (2006) African American English: a linguistic introduction. Lang Soc 35(1):149–152. https://doi.org/10.1017/S0047404506260056
Kim E (2006) Reasons and motivations for code-mixing and code-switching. Issues EFL 4(1,2):43–61
Rosenthal S, McKeown K (2011) Age prediction in blogs: a study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Portland, Oregon, USA, pp 763–772, June 2011. https://aclanthology.org/P11-1077
Trudgill P (1974) Linguistic change and diffusion: description and explanation in sociolinguistic dialect geography. Lang. Soc. 3(2):215–246. https://doi.org/10.1017/S0047404500004358
Yang Y, Eisenstein J (Dec2017) Overcoming language variation in sentiment analysis with social attention. Trans Assoc Comput Ling 5:295–307. https://doi.org/10.1162/tacl_a_00062, https://direct.mit.edu/tacl/article/43395
Liu B (2012) Sentiment analysis and opinion mining, 1st edn. Synthesis lectures on human language technologies. Springer Cham. https://link.springer.com/book/10.1007/978-3-031-02145-9
Aryal SK, Prioleau H, Washington G (2022) Sentiment classification of code-switched text using pre-trained multilingual embeddings and segmentation. In: Signal, image processing and embedded systems trends. Academy and Industry Research Collaboration Center (AIRCC), pp 179–186. https://doi.org/10.5121/csit.2022.122013, https://aircconline.com/csit/papers/vol12/csit122013.pdf
Angel J, Aroyehun ST, Tamayo A, Gelbukh A (2020) NLP-CIC at SemEval-2020 task 9: analysing sentiment in code-switching language using a simple deep-learning classifier. In: Proceedings of the fourteenth workshop on semantic evaluation. International Committee for Computational Linguistics, Barcelona, pp 957–962 (online). https://doi.org/10.18653/v1/2020.semeval-1.123, https://aclanthology.org/2020.semeval-1.123
Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil-English text. http://arxiv.org/abs/2006.00206, arXiv:2006.00206 [cs]
Vilares D, Alonso MA, Gómez-Rodríguez C (2015) Sentiment analysis on monolingual, multilingual and code-switching twitter corpora. In: Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis. Association for Computational Linguistics, Lisboa, Portugal, pp 2–8. https://doi.org/10.18653/v1/W15-2902, http://aclweb.org/anthology/W15-2902
Jose N, Chakravarthi BR, Suryawanshi S, Sherly E, McCrae JP (2020) A survey of current datasets for code-switching research. In: 2020 6th international conference on advanced computing and communication systems (ICACCS). IEEE, Coimbatore, India, pp 136–141. https://doi.org/10.1109/ICACCS48705.2020.9074205, https://ieeexplore.ieee.org/document/9074205/
Andrei AL (2014) Development and evaluation of tagalog linguistic inquiry and word count (LIWC) dictionaries for negative and positive emotion. https://www.mitre.org/news-insights/publication/development-and-evaluation-tagalog-linguistic-inquiry-and-word-count-liwc
Mager M, Mager E, Medina-Urrea A, Meza I, Kann K (2018) Lost in translation: analysis of information loss during machine translation between polysynthetic and fusional languages. https://arxiv.org/abs/1807.00286
Blodgett SL, Green L, O’Connor B (2016) Demographic dialectal variation in social media: a case study of African-American English. http://arxiv.org/abs/1608.08868, arXiv:1608.08868 [cs]
Chakravarthi BR, Priyadharshini R, Thavareesan S, Chinnappa D, Thenmozhi D, Sherly E, McCrae JP, Hande A, Ponnusamy R, Banerjee S, Vasantharajan C (2021) Findings of the sentiment analysis of dravidian languages in code-mixed text. arXiv:2111.09811 [cs]
Patwa P, Aguilar G, Kar S, Pandey S, Pykl S, Gambäck B, Chakraborty T, Solorio T, Das A (2020) SemEval-2020 task 9: overview of sentiment analysis of code-mixed tweets. arXiv:2008.04277 [cs]
Yadav K, Lamba A, Gupta D, Gupta A, Karmakar P, Saini S (2020) Bi-LSTM and ensemble based bilingual sentiment analysis for a code-mixed Hindi-English social media text. In: 2020 IEEE 17th India council international conference (INDICON). IEEE, New Delhi, India, pp 1–6, Dec 2020. https://doi.org/10.1109/INDICON49873.2020.9342241, https://ieeexplore.ieee.org/document/9342241/
Yadav K, Lamba A, Gupta D, Gupta A, Karmakar P, Saini S (2020) Bilingual sentiment analysis for a code-mixed Punjabi English social media text. In: 2020 5th international conference on computing, communication and security (ICCCS). IEEE, Patna, India, pp 1–5, Oct 2020. https://doi.org/10.1109/ICCCS49678.2020.9277309, https://ieeexplore.ieee.org/document/9277309/
Solorio T, Blair E, Maharjan S, Bethard S, Diab M, Ghoneim M, Hawwari A, AlGhamdi F, Hirschberg J, Chang A et al (2014) Overview for the first shared task on language identification in code-switched data. In: Proceedings of the first workshop on computational approaches to code switching. pp 62–72
Vilares D, Alonso MA, Gómez-Rodríguez C (2016) EN-ES-CS: an English-Spanish code-switching twitter corpus for multilingual sentiment analysis. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 4149–4153, May 2016. https://aclanthology.org/L16-1655
Chakravarthi BR, Priyadharshini R, Muralidaran V, Jose N, Suryawanshi S, Sherly E, McCrae JP (2022) DravidianCodeMix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text. Lang Resour Eval 56(3):765–806. https://doi.org/10.1007/s10579-022-09583-7, arXiv:2106.09460 [cs]
Co NA, Estuar MRJ, Tan HC, Tan AS, Abao R, Aureus J (2022) Development of bilingual sentiment and emotion text classification models from COVID-19 vaccination tweets in the Philippines. In: Meiselwitz G (ed) Social computing and social media: design, user experience and impact. Lecture notes in computer science, vol 13315. Springer International Publishing, Cham, pp 247–266 (2022). https://doi.org/10.1007/978-3-031-05061-9_18
De Leon M, Estuar M (2013) Disaster emotions: a bilingual sentiment and affect analysis of disaster tweets. In: Proceedings of the 6th annual international conference on computer games, multimedia and allied technologies
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762 [cs]
Ou X, Li H (2020) Ynu@dravidian-codemix-fire2020: Xlm-roberta for multi-language sentiment analysis. In: Fire
Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, He H, Li A, He M, Liu Z, Wu Z, Zhu D, Li X, Qiang N, Shen D, Liu T, Ge B (2023) Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models
Kuzman T, Mozetič I, Ljubešiá N (2023) ChatGPT: beginning of an end of manual linguistic data annotation? Use case of automatic genre identification
Zhang B, Ding D, Jing L (2023) How would stance detection techniques evolve after the launch of ChatGPT?
Huang F, Kwak H, An J (2023) Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. arXiv e-prints arXiv:2302.07736
McKinney W (2010) Data structures for statistical computing in python. In: van der Walt S, Millman J (eds) Proceedings of the 9th python in science conference, pp 56 – 61. https://doi.org/10.25080/Majora-92bf1922-00a
Emistahl P (2021) Lingua-py: a python package for language detection. https://github.com/pemistahl/lingua-py
Castro S (2017) Fast Krippendorff: fast computation of Krippendorff’s alpha agreement measure (2017). https://github.com/pln-fing-udelar/fast-krippendorff
Hutto C, Gilbert E (2014) VADER: a parsimonious rule-based model for sentiment analysis of social media text. Proc Int AAAI Conf Web Soc Media 8(1):216–225. https://doi.org/10.1609/icwsm.v8i1.14550, https://ojs.aaai.org/index.php/ICWSM/article/view/14550
Chen Y, Skiena S (2014) Building sentiment lexicons for all major languages. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (short papers), pp 383–389
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Proceedings of the 34th international conference on neural information processing systems. NIPS’20, Curran Associates Inc., Red Hook, NY, USA
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cosme, C.J., De Leon, M.M. (2024). Sentiment Analysis of Code-Switched Filipino-English Product and Service Reviews Using Transformers-Based Large Language Models. In: Iglesias, A., Shin, J., Patel, B., Joshi, A. (eds) Proceedings of World Conference on Information Systems for Business Management. ISBM 2023. Lecture Notes in Networks and Systems, vol 834. Springer, Singapore. https://doi.org/10.1007/978-981-99-8349-0_11
Download citation
DOI: https://doi.org/10.1007/978-981-99-8349-0_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8348-3
Online ISBN: 978-981-99-8349-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)