Abstract
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.
Similar content being viewed by others
Data availability
The databases used to train and evaluate the models are available at www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets.
References
Minaee S, Kalchbrenner N, Cambria E et al (2020) Deep Learning based text classification: a comprehensive review. CoRR arxiv:2004.03705
Li Q, Peng H, Li J, et al (2020) A survey on text classification: from shallow to Deep Learning. CoRR arxiv:2008.00364
Kowsari K, Jafari Meimandi K, Heidarysafa M et al (2019) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150
Liu B (2012) Sentiment analysis and opinion mining. Morgan & Claypool Publishers, CA
Rao D, McMahan B (2019) Natural Language Processing with Pytorch: Build intelligent language applications using Deep Learning, 1st edn. O’Reilly Media, Sebastopol, CA
Li W, Shao W, Ji S, Cambria E (2020) BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. CoRR arxiv:2006.00492
Cambria E, Liu Q, Decherchi S (2022) Senticnet 7: a commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In: Proceedings of the language resources and evaluation conference. https://aclanthology.org/2022.lrec-1.408
Wu L, Chen Y, Shen K, et al (2021) Graph neural networks for natural language processing: a survey. CoRR arxiv:2106.06090
Liang B, Su H, Gui L (2022) Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl-Based Sys. https://doi.org/10.1016/j.knosys.2021.107643
Pereira DA (2020) A survey of sentiment analysis in the Portuguese language. Artif Intell Rev. https://doi.org/10.1007/s10462-020-09870-1
ICMC-USP/São Carlos: opinion mining for Portuguese. https://sites.google.com/icmc.usp.br/opinando
Valdivia A, Luzón MV, Cambria E (2018) Consensus vote models for detecting and filtering neutrality in Sentiment Analysis. Inf Fusion. https://doi.org/10.1016/j.inffus.2018.03.007
Koppel M, Schler J (2005) The importance of neutral examples for learning sentiment. In: Workshop on the analysis of informal and formal information exchange during negotiations (FINEXIN)
Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham, GBR
Rajaraman A, Ullman JD (2011). Data mining. https://doi.org/10.1017/CBO9781139058452.002
Mihi S, Ali BAB, Bazi IE (2020) A comparative study of feature selection methods for informal Arabic. In: EMENA-ISTL. https://doi.org/10.1007/978-3-030-36778-7_22
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Disc Process. https://doi.org/10.1080/01638539809545028
Mikolov T, Chen K, Corrado G (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations (ICLR)
Bojanowski P, Grave E, Joulin A (2017) Enriching word vectors with subword information. TACL. https://doi.org/10.1162/tacl_a_00051
Singh P, Mukerjee A (2015) Words are not equal: graded weighting model for building composite document vectors. In: Proceedings of the 12th international conference on natural language processing (ICON). https://aclanthology.org/W15-5903
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput. https://doi.org/10.1162/neco.1997.9.8.1735
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp. 2873–2879
Zhou P, Qi Z, Zheng S (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING . https://aclanthology.org/C16-1329
Nowak J, Taspinar A, Scherer R (2017) LSTM recurrent neural networks for short text and sentiment classification. In: ICAISC. https://doi.org/10.1007/978-3-319-59060-8_50
Wang J-H, Liu T-W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: ROCLING. https://aclanthology.org/O18-1021
Lecun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278–2324. https://doi.org/10.1109/5.726791
Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP . https://doi.org/10.3115/v1/D14-1181
Zhang Y, Wallace B (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: IJCNLP. https://aclanthology.org/I17-1026/
Vaswani A, Shazeer N, Parmar N (2017) Attention is all you need. In: NIPS, pp. 6000–6010
Min B, Ross H, Sulem E, et al (2022) Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. https://doi.org/10.48550/arXiv.2111.01243
Devlin J, Chang M-W, Lee K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL. https://doi.org/10.18653/v1/N19-1423
Souza FD, Souza Filho JBdO (2022) Bert for sentiment analysis: Pre-trained and fine-tuned alternatives. In: PROPOR. https://doi.org/10.1007/978-3-030-98305-5_20
Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv. https://doi.org/10.48550/ARXIV.1907.11692
Conneau A, Khandelwal K, Goyal N (2020) Unsupervised cross-lingual representation learning at scale. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.747
Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. In: BRACIS. https://doi.org/10.1007/978-3-030-61377-8_28
Wagner Filho JA, Wilkens R, Idiart M (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: LREC. https://aclanthology.org/L18-1686
Carmo D, Piau M, Campiotti I, et al (2020) PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR
Raffel C, Shazeer N, Roberts A (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140):1–67
Wolf T, Debut L, Sanh V (2020) Transformers: State-of-the-art natural language processing. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-demos.6
Radford A, Wu J, Child R et al (2018) Language models are unsupervised multitask learners. Technical report, OpenAI
Souza F, Souza Filho JBO (2021) Sentiment analysis on Brazilian Portuguese user reviews. In: IEEE LA-CCI. https://doi.org/10.1109/LA-CCI48322.2021.9769838
NILC: Repositório de word embeddings do NILC. ICMC-USP. http://www.nilc.icmc.usp.br/embeddings
Pedregosa F, Varoquaux G, Gramfort A (2011) Scikit-learn: machine learning in Python. JMLR 12(85):2825–2830
Alammar J (2019) A visual guide to using BERT for the first time. https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. JMLR 7:1–30
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Souza, F.D., Filho, J.B.d.O.e.S. Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers. Neural Comput & Applic 35, 9393–9406 (2023). https://doi.org/10.1007/s00521-022-08068-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-08068-6