Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

Souza, Frederico Dias; Filho, João Baptista de Oliveira e Souza

doi:10.1007/s00521-022-08068-6

Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

S.I.: Latin American Computational Intelligence
Published: 01 December 2022

Volume 35, pages 9393–9406, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Frederico Dias Souza ORCID: orcid.org/0000-0002-4746-2136¹ &
João Baptista de Oliveira e Souza Filho¹

470 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Cross-domain sentiment aware word embeddings for review sentiment analysis

Article 11 August 2020

A Review on Word Embedding Techniques for Text Classification

Comparative Analysis of NLP Text Embedding Techniques with Neural Network Layered Architecture on Online Movie Reviews

Data availability

The databases used to train and evaluate the models are available at www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets.

Notes

References

Minaee S, Kalchbrenner N, Cambria E et al (2020) Deep Learning based text classification: a comprehensive review. CoRR arxiv:2004.03705
Li Q, Peng H, Li J, et al (2020) A survey on text classification: from shallow to Deep Learning. CoRR arxiv:2008.00364
Kowsari K, Jafari Meimandi K, Heidarysafa M et al (2019) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150
Article Google Scholar
Liu B (2012) Sentiment analysis and opinion mining. Morgan & Claypool Publishers, CA
Book Google Scholar
Rao D, McMahan B (2019) Natural Language Processing with Pytorch: Build intelligent language applications using Deep Learning, 1st edn. O’Reilly Media, Sebastopol, CA
Google Scholar
Li W, Shao W, Ji S, Cambria E (2020) BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. CoRR arxiv:2006.00492
Cambria E, Liu Q, Decherchi S (2022) Senticnet 7: a commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In: Proceedings of the language resources and evaluation conference. https://aclanthology.org/2022.lrec-1.408
Wu L, Chen Y, Shen K, et al (2021) Graph neural networks for natural language processing: a survey. CoRR arxiv:2106.06090
Liang B, Su H, Gui L (2022) Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl-Based Sys. https://doi.org/10.1016/j.knosys.2021.107643
Article Google Scholar
Pereira DA (2020) A survey of sentiment analysis in the Portuguese language. Artif Intell Rev. https://doi.org/10.1007/s10462-020-09870-1
Article Google Scholar
ICMC-USP/São Carlos: opinion mining for Portuguese. https://sites.google.com/icmc.usp.br/opinando
Valdivia A, Luzón MV, Cambria E (2018) Consensus vote models for detecting and filtering neutrality in Sentiment Analysis. Inf Fusion. https://doi.org/10.1016/j.inffus.2018.03.007
Article Google Scholar
Koppel M, Schler J (2005) The importance of neutral examples for learning sentiment. In: Workshop on the analysis of informal and formal information exchange during negotiations (FINEXIN)
Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham, GBR
MATH Google Scholar
Rajaraman A, Ullman JD (2011). Data mining. https://doi.org/10.1017/CBO9781139058452.002
Mihi S, Ali BAB, Bazi IE (2020) A comparative study of feature selection methods for informal Arabic. In: EMENA-ISTL. https://doi.org/10.1007/978-3-030-36778-7_22
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Disc Process. https://doi.org/10.1080/01638539809545028
Article Google Scholar
Mikolov T, Chen K, Corrado G (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations (ICLR)
Bojanowski P, Grave E, Joulin A (2017) Enriching word vectors with subword information. TACL. https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Singh P, Mukerjee A (2015) Words are not equal: graded weighting model for building composite document vectors. In: Proceedings of the 12th international conference on natural language processing (ICON). https://aclanthology.org/W15-5903
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp. 2873–2879
Zhou P, Qi Z, Zheng S (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING . https://aclanthology.org/C16-1329
Nowak J, Taspinar A, Scherer R (2017) LSTM recurrent neural networks for short text and sentiment classification. In: ICAISC. https://doi.org/10.1007/978-3-319-59060-8_50
Wang J-H, Liu T-W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: ROCLING. https://aclanthology.org/O18-1021
Lecun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278–2324. https://doi.org/10.1109/5.726791
Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP . https://doi.org/10.3115/v1/D14-1181
Zhang Y, Wallace B (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: IJCNLP. https://aclanthology.org/I17-1026/
Vaswani A, Shazeer N, Parmar N (2017) Attention is all you need. In: NIPS, pp. 6000–6010
Min B, Ross H, Sulem E, et al (2022) Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. https://doi.org/10.48550/arXiv.2111.01243
Devlin J, Chang M-W, Lee K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL. https://doi.org/10.18653/v1/N19-1423
Souza FD, Souza Filho JBdO (2022) Bert for sentiment analysis: Pre-trained and fine-tuned alternatives. In: PROPOR. https://doi.org/10.1007/978-3-030-98305-5_20
Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv. https://doi.org/10.48550/ARXIV.1907.11692
Conneau A, Khandelwal K, Goyal N (2020) Unsupervised cross-lingual representation learning at scale. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.747
Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. In: BRACIS. https://doi.org/10.1007/978-3-030-61377-8_28
Wagner Filho JA, Wilkens R, Idiart M (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: LREC. https://aclanthology.org/L18-1686
Carmo D, Piau M, Campiotti I, et al (2020) PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR
Raffel C, Shazeer N, Roberts A (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140):1–67
MathSciNet MATH Google Scholar
Wolf T, Debut L, Sanh V (2020) Transformers: State-of-the-art natural language processing. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-demos.6
Radford A, Wu J, Child R et al (2018) Language models are unsupervised multitask learners. Technical report, OpenAI
Souza F, Souza Filho JBO (2021) Sentiment analysis on Brazilian Portuguese user reviews. In: IEEE LA-CCI. https://doi.org/10.1109/LA-CCI48322.2021.9769838
NILC: Repositório de word embeddings do NILC. ICMC-USP. http://www.nilc.icmc.usp.br/embeddings
Pedregosa F, Varoquaux G, Gramfort A (2011) Scikit-learn: machine learning in Python. JMLR 12(85):2825–2830
MathSciNet MATH Google Scholar
Alammar J (2019) A visual guide to using BERT for the first time. https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. JMLR 7:1–30
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Electrical Engineering Program, Federal University of Rio de Janeiro, Rua Moniz Aragão, Rio de Janeiro, 21941-594, Brazil
Frederico Dias Souza & João Baptista de Oliveira e Souza Filho

Authors

Frederico Dias Souza
View author publications
You can also search for this author in PubMed Google Scholar
João Baptista de Oliveira e Souza Filho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frederico Dias Souza.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Souza, F.D., Filho, J.B.d.O.e.S. Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers. Neural Comput & Applic 35, 9393–9406 (2023). https://doi.org/10.1007/s00521-022-08068-6

Download citation

Received: 01 April 2022
Accepted: 16 November 2022
Published: 01 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00521-022-08068-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

Abstract

Access this article

Similar content being viewed by others

Cross-domain sentiment aware word embeddings for review sentiment analysis

A Review on Word Embedding Techniques for Text Classification

Comparative Analysis of NLP Text Embedding Techniques with Neural Network Layered Architecture on Online Movie Reviews

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

Abstract

Access this article

Similar content being viewed by others

Cross-domain sentiment aware word embeddings for review sentiment analysis

A Review on Word Embedding Techniques for Text Classification

Comparative Analysis of NLP Text Embedding Techniques with Neural Network Layered Architecture on Online Movie Reviews

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation