Skip to main content
Log in

Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

  • S.I.: Latin American Computational Intelligence
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The databases used to train and evaluate the models are available at www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets.

Notes

  1. www.huggingface.co/HeyLucasLeao/gpt-neo-small-portuguese.

  2. www.huggingface.co/pierreguillou/gpt2-small-portuguese.

  3. www.huggingface.co/EleutherAI/gpt-neo-125M.

  4. www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets.

  5. www.huggingface.co/docs/transformers/model_doc/auto.

References

  1. Minaee S, Kalchbrenner N, Cambria E et al (2020) Deep Learning based text classification: a comprehensive review. CoRR arxiv:2004.03705

  2. Li Q, Peng H, Li J, et al (2020) A survey on text classification: from shallow to Deep Learning. CoRR arxiv:2008.00364

  3. Kowsari K, Jafari Meimandi K, Heidarysafa M et al (2019) Text classification algorithms: a survey. Information. https://doi.org/10.3390/info10040150

    Article  Google Scholar 

  4. Liu B (2012) Sentiment analysis and opinion mining. Morgan & Claypool Publishers, CA

    Book  Google Scholar 

  5. Rao D, McMahan B (2019) Natural Language Processing with Pytorch: Build intelligent language applications using Deep Learning, 1st edn. O’Reilly Media, Sebastopol, CA

    Google Scholar 

  6. Li W, Shao W, Ji S, Cambria E (2020) BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. CoRR arxiv:2006.00492

  7. Cambria E, Liu Q, Decherchi S (2022) Senticnet 7: a commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In: Proceedings of the language resources and evaluation conference. https://aclanthology.org/2022.lrec-1.408

  8. Wu L, Chen Y, Shen K, et al (2021) Graph neural networks for natural language processing: a survey. CoRR arxiv:2106.06090

  9. Liang B, Su H, Gui L (2022) Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl-Based Sys. https://doi.org/10.1016/j.knosys.2021.107643

    Article  Google Scholar 

  10. Pereira DA (2020) A survey of sentiment analysis in the Portuguese language. Artif Intell Rev. https://doi.org/10.1007/s10462-020-09870-1

    Article  Google Scholar 

  11. ICMC-USP/São Carlos: opinion mining for Portuguese. https://sites.google.com/icmc.usp.br/opinando

  12. Valdivia A, Luzón MV, Cambria E (2018) Consensus vote models for detecting and filtering neutrality in Sentiment Analysis. Inf Fusion. https://doi.org/10.1016/j.inffus.2018.03.007

    Article  Google Scholar 

  13. Koppel M, Schler J (2005) The importance of neutral examples for learning sentiment. In: Workshop on the analysis of informal and formal information exchange during negotiations (FINEXIN)

  14. Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham, GBR

    MATH  Google Scholar 

  15. Rajaraman A, Ullman JD (2011). Data mining. https://doi.org/10.1017/CBO9781139058452.002

  16. Mihi S, Ali BAB, Bazi IE (2020) A comparative study of feature selection methods for informal Arabic. In: EMENA-ISTL. https://doi.org/10.1007/978-3-030-36778-7_22

  17. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Disc Process. https://doi.org/10.1080/01638539809545028

    Article  Google Scholar 

  18. Mikolov T, Chen K, Corrado G (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations (ICLR)

  19. Bojanowski P, Grave E, Joulin A (2017) Enriching word vectors with subword information. TACL. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  20. Singh P, Mukerjee A (2015) Words are not equal: graded weighting model for building composite document vectors. In: Proceedings of the 12th international conference on natural language processing (ICON). https://aclanthology.org/W15-5903

  21. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  22. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp. 2873–2879

  23. Zhou P, Qi Z, Zheng S (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING . https://aclanthology.org/C16-1329

  24. Nowak J, Taspinar A, Scherer R (2017) LSTM recurrent neural networks for short text and sentiment classification. In: ICAISC. https://doi.org/10.1007/978-3-319-59060-8_50

  25. Wang J-H, Liu T-W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: ROCLING. https://aclanthology.org/O18-1021

  26. Lecun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278–2324. https://doi.org/10.1109/5.726791

  27. Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP . https://doi.org/10.3115/v1/D14-1181

  28. Zhang Y, Wallace B (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: IJCNLP. https://aclanthology.org/I17-1026/

  29. Vaswani A, Shazeer N, Parmar N (2017) Attention is all you need. In: NIPS, pp. 6000–6010

  30. Min B, Ross H, Sulem E, et al (2022) Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. https://doi.org/10.48550/arXiv.2111.01243

  31. Devlin J, Chang M-W, Lee K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL. https://doi.org/10.18653/v1/N19-1423

  32. Souza FD, Souza Filho JBdO (2022) Bert for sentiment analysis: Pre-trained and fine-tuned alternatives. In: PROPOR. https://doi.org/10.1007/978-3-030-98305-5_20

  33. Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv. https://doi.org/10.48550/ARXIV.1907.11692

  34. Conneau A, Khandelwal K, Goyal N (2020) Unsupervised cross-lingual representation learning at scale. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.747

  35. Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. In: BRACIS. https://doi.org/10.1007/978-3-030-61377-8_28

  36. Wagner Filho JA, Wilkens R, Idiart M (2018) The brWaC corpus: a new open resource for Brazilian Portuguese. In: LREC. https://aclanthology.org/L18-1686

  37. Carmo D, Piau M, Campiotti I, et al (2020) PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR

  38. Raffel C, Shazeer N, Roberts A (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140):1–67

    MathSciNet  MATH  Google Scholar 

  39. Wolf T, Debut L, Sanh V (2020) Transformers: State-of-the-art natural language processing. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-demos.6

  40. Radford A, Wu J, Child R et al (2018) Language models are unsupervised multitask learners. Technical report, OpenAI

  41. Souza F, Souza Filho JBO (2021) Sentiment analysis on Brazilian Portuguese user reviews. In: IEEE LA-CCI. https://doi.org/10.1109/LA-CCI48322.2021.9769838

  42. NILC: Repositório de word embeddings do NILC. ICMC-USP. http://www.nilc.icmc.usp.br/embeddings

  43. Pedregosa F, Varoquaux G, Gramfort A (2011) Scikit-learn: machine learning in Python. JMLR 12(85):2825–2830

    MathSciNet  MATH  Google Scholar 

  44. Alammar J (2019) A visual guide to using BERT for the first time. https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

  45. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. JMLR 7:1–30

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frederico Dias Souza.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Souza, F.D., Filho, J.B.d.O.e.S. Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers. Neural Comput & Applic 35, 9393–9406 (2023). https://doi.org/10.1007/s00521-022-08068-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08068-6

Keywords

Navigation