1 Introduction

In recent years, the use of social media networks, such as TwitterFootnote 1, has been growing exponentially. It is estimated that about 500 million tweets – the short informal messages sent by Twitter users – are published daily.Footnote 2 Unlike others text style, tweets have an informal linguistic style, misspelled words, the careless use of grammar, URL links, user mentions, hashtags, and more. Due to these inherent characteristics, discovering patterns from tweets represents a challenge and opportunities for machine learning and natural language processing (NLP) tasks, such as sentiment analysis.

Sentiment analysis is the field of study that analyzes people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in written text (Liu 2020). Usually, one reduces the sentiment analysis task to find out the polarity classification, i.e., whether they carry a positive or negative connotation. One of the biggest challenges concerning the sentiment classification of tweets is that people often express their sentiments and opinions using a casual linguistic style, resulting in the presence of misspelling words and the careless use of grammar. Consequently, the automated analysis of tweets’ content requires machines to build a deep understanding of natural text to deal effectively with its informal structure (Pathak et al. 2020). However, before discovering patterns from text, it is essential to define a more fundamental step: how automatic methods can numerically represent textual content.

Vector space models (VSMs) (Salton et al. 1975) are one of the earliest and most common strategies adopted in text classification literature to allow for machines to deal with texts and their structures. The VSM represents each document in a corpus as a point in a vector space. Points that are close together in this space are semantically similar, and points that are far apart are semantically distant (Turney and Pantel 2010). The firsts VSM approaches are count-based methods, such as Bag-of-Words (BoW) and BoW with TF-IDF (Term Frequency-Inverse Document Frequency) (Manning et al. 2008). Although VSMs have been extensively used in the literature, they cannot deal with the curse of dimensionality. More clearly, considering the inherent characteristics of tweets, a corpus of tweets may contain different spellings for each unique word leading to an extensive vocabulary, making the vector representation of those tweets very large and often sparse.

To tackle the curse of dimensionality inherent from BOW-based approaches, in the last years it has become a standard practice to learn dense vectors to represent words and texts, the embeddings. Methods such as such as Word2Vec (Mikolov et al. 2013), FastText (Mikolov et al. 2018), and others (Agrawal et al. 2018; Felbo et al. 2017; Tang et al. 2014; Xu et al. 2018) have been used with relative success to address a plethora of NLP tasks. Nevertheless, in general, the performance of such techniques are still unsatisfactory to solve sentiment analysis from tweets, taking into account the dynamic vocabulary used by Twitter users to express themselves. Specifically, in tweets, the ironic and sarcastic content expressed in a limited space, regularly out of context and informal, makes even more challenging to retrieve meaning from the words. Such attributes may degrade the performance of traditional word embeddings methods if not handled properly. In this context, contextualized word representations have recently emerged in the literature, aiming at allowing the vector representation of words to adapt to the context they appear. Contextual embedding techniques, including ELMo (Peters et al. 2018) and Transformer-based autoencoder methods, such as BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), and BERTweet (Nguyen et al. 2020), are built upon the concept of neural language model (Bengio et al. 2000) to capture not only complex characteristics of word usage, such as syntax and semantics, but also how the word usage vary across linguistic contexts. Those methods have achieved state-of-the-art results on various NLP tasks, including sentiment analysis (Adhikari et al. 2019; Akkalyoncu Yilmaz et al. 2019; Chaybouti et al. 2021; Gao et al. 2019).

Much effort in recent language modeling research is focused on scalability issues of existing word embedding methods. On this basis, inductive transfer learning strategies and pre-trained embedding models have gained important application in the literature, especially when the amount of labeled data to train a model is relatively small. With that, models obtained from the aforementioned contextual embeddings methods are rarely trained from scratch but are instead fine-tuned from models pre-trained on datasets with a huge amount of texts (Howard and Ruder 2018; Peters et al. 2018; Radford et al. 2018). Pre-trained models reduce the use of computational resources and tend to increase the classification performance of several NLP tasks, sentiment analysis included.

Despite the successful achievements in developing efficient word representation methods in NLP literature, there is still a gap regarding a robust evaluation of existing language models applied to the sentiment analysis task on tweets. Most studies are mainly focused on evaluating those models for different NLP tasks using only a small number of datasets(Lan et al. 2020; Liu et al. 2019; Peters et al. 2018; Xu et al. 2018). Our main goal is to identify appropriate embedding-based text representations for the sentiment analysis of English tweets in this study. For this purpose, we evaluate distinct types of embeddings, including: i) static embeddings learned from generic texts (Agrawal et al. 2018; Mikolov et al. 2018, 2013; Pennington et al. 2014); ii) static embeddings learned from datasets of Twitter sentiment analysis (Araque et al. 2017; Bravo-Marquez et al. 2016; Felbo et al. 2017; Pennington et al. 2014; Tang et al. 2014; Xu et al. 2018); iii) contextualized embeddings learned from transformer-based autoencoders with generic texts with no adjustments (Devlin et al. 2019; Liu et al. 2019); iv) contextualized embeddings learned from Transformer-based autoencoders with a dataset of tweets with no adjustments (Nguyen et al. 2020); v) contextualized embeddings adapted to the tweets language with a second phase of pretraining the language model (Gururangan et al. 2020); and vi) contextualized embeddings adapted to the tweets sentiment language with a second phase of pretraining the language model (Gururangan et al. 2020). In all assessments, we use a representative set of twenty-two sentiment datasets (Carvalho and Plastino 2021) as input to five classifiers to evaluate the predictive performance of the embeddings. To the best of our knowledge, there is no previous study that has conducted such a robust evaluation regarding language models of several flavors and a large number of datasets. In order to identify the most appropriate text embeddings, we conduct this study to answer the following four research questions.

RQ1  Which static embeddings are the most effective in the sentiment classification of tweets? Our motivation to evaluate those models is that many state-of-the-art deep learning models can require a lot of computational power, such as memory usage and storage. Thus, running those models locally on some devices may be difficult for mass-market applications that depend on low-cost hardware. To overcome this limitation, embeddings generated by language models can be gathered by simply looking up at the embedding table to achieve a static representation of textual content. We intend to assess how these static representations work and which are the most appropriate in this context. We answer this research question by evaluating a rich set of text representations from the literature (Agrawal et al. 2018; Araque et al. 2017; Bravo-Marquez et al. 2016; Devlin et al. 2019; Felbo et al. 2017; Mikolov et al. 2018, 2013; Nguyen et al. 2020; Pennington et al. 2014; Tang et al. 2014; Xu et al. 2018; Zhu et al. 2015). To achieve a good overview of the static representations, we conduct an experimental evaluation in the sentiment analysis task with five different classifiers and 22 datasets.

RQ2  Considering state-of-the-art Transformer-based autoencoder models, which are the most effective in the sentiment classification of tweets? Regarding recent advances in language modeling, Transformer-based architectures have achieved state-of-the-art performances in many NLP tasks. Specifically, BERT (Devlin et al. 2019) is the first method that successfully uses the encoders components of the Transformer architecture (Vaswani et al. 2017) to learn contextualized embeddings from texts. Shortly after that, RoBERTa (Liu et al. 2019) was introduced by Facebook as an extension of BERT that uses an optimized training methodology. Next, BERTweet (Nguyen et al. 2020) was proposed as an alternative to RoBERTa for NLP tasks focusing on tweets. While RoBERTa was trained on traditional English texts, such as Wikipedia, BERTweet was trained from scratch using a massive corpus of 850M English tweets. In this context, to answer this research question, we conduct an experimental evaluation of BERT, RoBERTa, and BERTweet models in the sentiment analysis task with five different classifiers and 22 datasets to obtain a comprehensive analysis of their predictive performances. By evaluating these models we may obtain a robust overview of the Transformer-based autoencoder representations that better fit tweet’s style.

RQ3  Can a second phase of continuous pretraining the Transformer-based autoencoder models using a large set of English tweets improve the sentiment classification performance? One of the benefits of pre-trained language models, such as the Transformer-based models exploited in this study, is the possibility to adjust the language model to a specific domain. We aim at assessing whether the sentiment analysis of tweets can benefit from adapting BERT, RoBERTa and BERTweet language models to a vast, generic, and unlabeled set of around 6.7M English tweets from distinct domains. To that, we employed a second phase of training the pre-trained language model using the intermediate masked-language model task. Besides, considering that the adaptation procedure can be a very data-intensive task that may demand a lot of computational power, in addition to the large corpus of 6.7M tweets, we use in that process nine other samples with different sizes, varying from 500 to 1.5M tweets. We conduct an experimental evaluation with all models in the sentiment analysis task with five different classifiers and 22 datasets as in the previous questions.

RQ4  Can Transformer-based autoencoder models benefit from a second phase of adaptive pretraining procedure with tweets specific from sentiment analysis datasets? Although using unlabeled generic tweets to adjust a language model seems to be promising regarding the availability of data, we believe that the downstream sentiment task may benefit from the sentiment information that tweets from labeled datasets contain. In this context, we aim at identifying whether adjusting the language models with positive and negative tweets can boost the sentiment classification of tweets. We perform this evaluation by assessing three distinct strategies in order to simulate three real-world situations, as follows. In the first strategy, we use a specific sentiment dataset itself as the target domain dataset to adapt the language model. The second strategy simulates the case where a collection of general sentiment dataset is available to adapt the language model. In the third and last strategy, we combine the two previous situations. In short, we put together tweets from a target dataset and from a collection of sentiment datasets in the adaptation procedure. Finally, we present a comparison between the predictive performances achieved by these three evaluations and the adapted models evaluated in RQ3. As in the previous questions, we conduct the experiments with five different classifiers and 22 datasets.

In summation, given the large number of language models exploited in this study, our main contributions are: (i) a comparative study of a rich collection of publicly available static representations generated from distinct deep learning methods, and with different dimensions, vocabulary size, and from various kinds of corpora; (ii) an assessment of state-of-the-art contextualized language models from the literature, that is, Transformer-based autoencoder models, including BERT, RoBERTa, and BERTweet; (iii) an evaluation of distinct strategies for adapting Transformer-based autoencoder language models; and (iv) a general comparison over static, Transformer-based autoencoder, and adapted language models, aiming at determining the most suitable ones for detecting the sentiment expressed in tweets.Footnote 3

In order to present our contributions, we organized this article as follows. Section 2 presents a literature review related to the language models examined in this study. In Sect. 3, we describe the experimental methodology we followed in the computational experiments, which are reported in Sects. 456, and 7, responding the four research question, respectively. Finally, in Sect. 8, we present the conclusions and directions for future research.

2 Literature review

Sentiment analysis is an automated process used to predict people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Liu 2020). Recently, sentiment analysis has been recognized as a suitcase research problem (Cambria et al. 2017), which involves solving different NLP classification sub-tasks, including sarcasm, subjectivity, and polarity detection, which is the focus of this study.

Pioneer works in the sentiment classification of tweets mainly focused on the polarity detection task, which aims at categorizing a piece of text as carrying a positive or negative connotation. For example, Go et al. (2009) define sentiment as a personal positive or negative feeling. There, they used unigrams as features to train different machine learning classifiers, using tweets with emoticons as training data. The unigram model, or Bag-of-Words (BoW), is the most basic representation in text classification problems.

Over the years, different techniques have been developed in NLP literature in an effort to make natural language easily processable by computers. Vector Space Models (VSMs) (Salton et al. 1975) are one of the earliest strategies used to represent the knowledge extracted from a given corpus. Earlier approaches to build VSMs are grounded on count-based methods, such as BoW with TF-IDF representation, which measures how important a word is to a document, relying on its frequency of occurrence in a corpus (Manning et al. 2008).

The BoW model, which assumes word order is not important, is based on the hypothesis that the frequencies of words in a document tend to indicate the relevance of the document to a query (Salton et al. 1975). This hypothesis expresses the belief that a column vector in a term-document matrix captures an aspect of the meaning of the corresponding document or phrase. Precisely, Let X be a term-document matrix. Suppose the document collection contains n documents and m unique terms. The matrix X will then have m rows (one row for each unique term in the vocabulary) and n columns (one column for each document). Let w\(_i\) be the i-th term in the vocabulary and let d\(_j\) be the j-th document in the collection. The i-th row in X is the row vector x\(_{i:}\) and the j-th column in X is the column vector x\(_{:j}\). The row vector x\(_{i:}\) contains n elements, one element for each document, and the column vector x\(_{:j}\) contains m elements, one element for each term. Suppose X is a simple matrix of frequencies, then the element x\(_{ij}\) in X is the frequency of the i-th term w\(_i\) in the j-th document d\(_j\) (Turney and Pantel 2010).

Such a simple way of creating numeric representations from texts have motivated early studies in detecting the sentiment expressed in tweets (Barbosa and Feng 2010; Go et al. 2009; Pak and Paroubek 2010). However, though widely adopted, this kind of feature representation leads to the curse of dimensionality due to the large number of uncommon words tweets contain (Saif 2015).

Thus, with the revival and success of neural-based learning techniques, several methods that learn dense real-valued low dimensional vectors to represent words have been proposed, such as Word2Vec (Mikolov et al. 2013), FastText (Mikolov et al. 2018), and GloVe (Pennington et al. 2014). Word2Vec (Mikolov et al. 2013) is one of the pioneer models to become popular taking advantage from the development of neural networks over the years. Wor2Vec is actually a software package composed of two distinct implementations of language-models, both based on a feed-forward neural architecture, namely Continuous Bag-Of-Words (CBOW) and Skip-gram. The CBOW model aims at predicting a word given its surrounding context words. Conversely, the Skip-gram model predicts the words in the surrounding context given a target word. Both architectures consist of input, a hidden layer and an output layer. The input layer has the size of the vocabulary and encodes the context by combining the one-hot vector representations of surrounding words of a given target word. The output layer has the same size as the input layer and contains a one-hot vector of the target word obtained during the training. However, one of the main disadvantages of those models is that they usually struggle to deal with out-of-vocabulary (OOV) words, i.e., words that have not been seen in the training data before. To address this weakness, more complex approaches have been proposed, such as FastText (Mikolov et al. 2018).

FastText (Mikolov et al. 2018) is based on the Skip-gram model (Mikolov et al. 2013), still it considers each word as a bag of character n-grams, which are contiguous sequences of n characters from a word, including the word itself. A dense vector is learned to each character n-gram and the dense vector associated to a word is taken from the sum of those representations. Thus, FastText can deal with different morphological structure of words that covers the words not seen in the training phase, i.e., OOV words. For that reason, FastText is also able to deal with tweets, considering the huge number of uncommon and unique words in this kind of text.

Going to another direction, the GloVe model (Pennington et al. 2014) attempts at making efficient use of statistics of word occurrences in a corpus to learn better word representations. Pennington et al. (2014) present a model that rely on the insight that ratios of co-occurrences, rather than raw counts, encode semantic information about pair of words. This relationship is used to derive a suitable loss function for a log-linear model, which is then trained to maximize the similarity of every word pair, as measured by the ratios of co-occurrences. Given a probe word, the ratio can be small, large or equal to one depending on their correlations. This ratio gives hints on the relations between three different words. For example, given a probe word and two others w\(_i\) and w\(_j\), if the ratio is large, the probe word is related to w\(_i\) but not w\(_j\).

In general, methods for learning word embeddings deal well with the syntactic role of words but ignore the potential sentiment they carry. In the context of sentiment analysis, words with similar syntactic role but opposite sentiment polarity, such as good and bad, are usually mismapped to neighbouring word vectors. To address this issue, Tang et al. (2014) proposed the Sentiment-Specific Word Embedding model (SSWE), which encodes the sentiment information in the embeddings. Specifically, they developed neural networks that incorporate the supervision from sentiment polarity of texts in their loss function. To that, they slide the window of n-gram across a sentence, and then predict the sentiment polarity based on each n-gram with a shared neural network. In addition to SSWE, other methods have been proposed in order to improve the quality of word representations in sentiment analysis, by leveraging the sentiment information in the training phase, such as DeepMoji (Felbo et al. 2017), Emo2Vec (Xu et al. 2018), and EWE (Agrawal et al. 2018).

The aforementioned word embedding models have been used as standard components in most sentiment analysis methods. However, they pre-compute the representation for each word independently from the context they are going to appear. This static nature of these models results in two problems: (i) they ignore the diversity of meaning each word may have, and (ii) they suffer from learning long-term dependencies of meaning. Different from those static word embedding techniques, contextualized embeddings are not fixed, adapting the word representation to the context it appears. Precisely, at training time, for each word in a given input text, the learning model analyzes the context, usually using sequence-based models, such as recurrent neural networks (RNNs), and adjusts the representation of the target word by looking at the context. These context-awareness embeddings are actually the internal states of a deep neural network trained in an self-supervised setting. Thus, the training phase is carried out independently from the primary task on an extensive unlabeled data. Depending on the sequence-based model adopted, these contextualized models can be divided into two main groups, namely RNN-based (Peters et al. 2018) and Transformers-based (Lan et al. 2020; Liu et al. 2019; Nguyen et al. 2020).

Transfer learning strategies have also emerged to improve the quality of word representation, such as ULMFit (Universal Language Model Fine-tuning) (Howard and Ruder 2018). ULMFit is an effective transfer learning method that can be applied to any NLP task, and introduces key techniques for fine-tuning a language model, consisting of three stages, described as follows. First, the language model is trained on a general-domain corpus to capture generic features of the language in different layers. Next, the full language model is fine-tuned on the target task data using discriminative fine-tuning and slanted triangular learning rates (STLR) to learn task-specific features. Lastly, the model is fine-tuned on the target task using gradual unfreezing and STLR to preserve low-level representations and to adapt high-level ones.

Fine-tuning techniques made possible the development and availability of pre-trained contextualized language models using massive amounts of data. For example, Peters et al. (2018) introduced ELMo (Embeddings from Language Models), a deep contextualized model for word representation. ELMo comprises a Bi-directional Long-Short-Term-Memory Recurrent Neural Network (BiLSTM) to combine a forward model, looking at the sequence in the traditional order, and a backward model, looking at the sequence in the reverse order. ELMo is composed of two layers of BiLSTM sequence encoder responsible for capturing the semantics of the context. Besides, some weights are shared between the two directions of the language modeling unit and there is also a residual connection between the LSTM layers to accommodate the deep connections without the gradient vanishing issue. ELMo also makes use of the character-based technique for computing embeddings. Therefore, it benefits from the characteristics of character-based representations to avoid OOV words.

Although ELMo is more effective as compared to static pre-trained models, its performance may be degraded when dealing with long texts, exposing a trade-off between efficient learning by gradient descent and latching on information for long periods (Bengio et al. 1994). Transformers-based language models, on the other hand, have been proposed to solve the gradient propagation problems described in (Bengio et al. 1994). Compared to RNNs, which process the input sequentially, Transformers work in parallel, which brings benefits when dealing with large corpora. Moreover, while RNNs by default process the input in one direction, Transformers-based models can attend to the context of a word from distant parts of a sentence and pay attention to the part of the text that really matters, using self-attention (Vaswani et al. 2017).

The OpenAI Generative Pre-Training Transformer model (GPT) (Radford et al. 2018) is one of the first attempts to learn representations using Transformers. It encompasses only the decoder component of the Transformer architecture with some adjustments, discarding the encoder part. Therefore, instead of having a source and a target sentence for the sequence transduction model, a single sentence is given to the decoder. GPT’ objective function targets at predicting the next word given a sequence of words, as a standard language modeling goal. To comply with the standard language model task, while reading a token, GPT can only attend to previously seen tokens in the self-attention layers. This setting can be limiting for encoding sentences, since understanding a word might require processing the ones coming after it in the sentence.

Devlin et al. (2019) addressed the unidirectional nature of GPTs by presenting an strategy called BERT (Bidirectional Encoder Representations from Transformers) that, as the name says, encodes sentences by looking them at both directions. BERT is also based on the Transformer architecture but, contrary to the GPT, it is based on the encoder component of that architecture. The essential improvement over GPT is that BERT provides a solution for making Transformers bidirectional by applying masked language models, which randomly masks some percentage of the input tokens, and the objective is to predict those masked tokens based on their context. Also, in (Devlin et al. 2019), they use a next sentence prediction task for predicting whether two text segments follow each other. All those improvements have made BERT to achieve state-of-the-art results in various NLP tasks when it was published.

Later, Liu et al. (2019) proposed RoBERTa (Robustly optimized BERT approach), achieving even better results than BERT. RoBERTa is an extension of BERT with some modifications, such as: (i) training the model for a longer period of time, with bigger batches, over more data, (ii) removing the next sentence prediction objective, (iii) training on longer sequences, and (iv) dynamically changing the masking pattern applied to the training data.

Recently, Nguyen et al. (2020) introduced BERTweet, an extension of RoBERTa trained from scratch with tweets. BERTweet has also the same architecture as BERT, but it is trained using the same Roberta pre-training procedure instead. BERTweet consumes a corpus of 850M English tweets, which is a concatenation of two corpora. The first corpus contains 845M English tweets from the Twitter Stream dataset and the second one contains 5M English tweets related to the COVID-19 pandemic. In (Nguyen et al. 2020), the proposed BERTweet model outperformed RoBERTa baselines in some tasks on tweets, including sentiment analysis.

As far as we know, most studies in language modeling focus on designing new effective models in order to improve the predictive performance of distinct NLP tasks. For example, Devlin et al. (2019) and Liu et al. (2019) have respectively introduced BERT and RoBERTa, which achieved state-of-the-art results in many NLP tasks. Nevertheless, they did not evaluate the performance of such methods on the sentiment classification of tweets. Nguyen et al. (2020), on the other hand, used only a unique generic collection of tweets when evaluating their BERTweet strategy. In this context, we fulfill a robust evaluation of existing language models from distinct natures, including static representations, Transformer-based autoencoder models, and fine-tuned models, by using a significant set of 22 datasets of tweets from different domains and sizes. In the following sections, we present the assessment of such models.

3 Experimental methodology

This section presents the experimental methodology we followed in this article. We begin by describing, in Sect. 3.1, the twenty-two benchmark datasets used to evaluate the different language models we investigate in this study. In Sect. 3.2, we present the experimental protocol we followed. Then, in Sect. 3.3, we describe the computational experiments reported in Sects. 456, and 7.

3.1 Datasets

We used a large set of twenty-two datasetsFootnote 4 (Carvalho and Plastino 2021) to assess the effectiveness of the distinct word representation models described in Sect. 2. Table 1 summarizes the main characteristics of these datasets, namely the abbreviation we use when reporting the experimental results to save space (Abbrev. column), the domain they belong (Domain column), number of positive tweets (#pos. column), proportion of positive tweets (%pos. column), number of negative tweets (#neg. column), proportion of negative tweets (%neg. column), and the total number of tweets (Total column).

Those datasets have been extensively used in the literature of Twitter sentiment analysis and we believe they provide a diverse scenario in evaluating embeddings of tweets in the sentiment classification task, regarding a variety of domains, sizes, and class balance. For example, while datasets SemEval13, SemEval16, SemEval17, and SemEval18 contain generic tweets, other datasets, such as iphone6, movie, and archeage, contain tweets of a particular domain. Also, the datasets vary a lot in size, with some of them containing only dozens of tweets, such as irony and sarcasm. We believe that this diverse and large collection of datasets may help drawing more concise and robust conclusions on the effectiveness of distinct language models in the sentiment analysis task.

Table 1 Characteristics of the Twitter sentiment datasets ordered by size (Total column)

3.2 Experimental protocol

To assess the effect of different kinds of word representation models in the polarity classification task, we follow the protocol of first extracting the features from the several vector-based languageFootnote 5Footnote 6Footnote 7Footnote 8 representation mechanisms (BOW, static embeddings, contextualized embeddings). Next, those features compose the input attribute space for five distinct classifiers, namely Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), XGBoost (XBG), and Multi-layer Perceptron (MLP). We adopted scikit-learn’sFootnote 9 implementations of those machine learning algorithms. Although we have used the default parameters in most of the cases, it is important to mention that we set the class balance parameter for SVM, LR, and RF (class_weight = balanced). Also, for LR, we set the maximum number of iterations to 500 (max_iter = 500) and the solver parameter to liblinear. Moreover, for MLP, we set the number of hidden layers to 100. We aim at determining which word representation models are the most effective ones in Twitter sentiment analysis by leveraging different types of classifiers, thus examining how they deal with the peculiarities of each evaluated model. Furthermore, it is important to note that we do not aim at establishing the best classifier for the sentiment analysis task, which may require a specific study and additional computational experiments.

Preprocessing is the first step in many text classification problems and the use of appropriate techniques can reduce noise hence improving classification effectiveness (Fayyad et al. 2003). As this manuscript’s main goal is to evaluate the performance of different models of tweet representation, the preprocessing step is simple so that the focus is on the word representation models and classifiers. Thus, for each tweet in a given dataset, we only replace URLs by the token someurl, user mentions by the token someuser, and all tokens were lowercased.

In the experimental evaluation, the predictive performance of the sentiment classification is measured in terms of accuracy and \(F_1\)-macro. Precisely, for each evaluated dataset, the accuracy of the classification was computed as the ratio between the number of correctly classified tweets and the total number of tweets, following a stratified ten-fold cross-validation. \(F_1\)-macro was computed as the unweighted average of the \(F_1\)-score for the positive and negative classes. Moreover, all experiments were performed by using a Tesla P100-SXM2 GPU within Ubuntu operating system, running in a machine with Intel(R) Xeon(R) CPU E5-2698 v4 processor.

Lastly, as recommended by Demšar (2006), we ran the Friedman test followed by the Nemenyi post-hoc test to determine whether the differences among the results are statistically significant at a 0.05 significance level. Whenever applicable, we present the results of the statistical tests immediately below each results table. We use the symbol \(\succ \) to show that a word representation model x is significantly better than another word representation model y, so that {x} \(\succ \) {y}.

Table 2 Characteristics of the static pretrained embeddings ordered by the number of dimensions

3.3 Computational experiments details

In the next sections, we evaluate a significant collection of vector-based word representation models attempting to answer the research questions introduced in Sect. 1. Specifically, we conduct a comparative study of vector-based word representation models from distinct natures, including Bag of Words, as a classic baseline, static representations and representations induced from Transformer-based autoencoder models, by including a second-phase training or not the intermediate masked language task, in order to acknowledge their effectiveness in the polarity classification of English tweets. These language representation models are incrementally evaluated throughout Sects. 456, and 7.

In Sect. 4, we begin by analyzing the predictive performance of the static representations, which include 13 pretrained embeddings from the literature, as shown in Table 2, as well as the classical BOW with TF-IDF representation schema. Regarding the static embeddings described in Table 2, we have selected representations trained on distinct kinds of texts (Corpus column) and built from different architectures (Architecture column), from feedforward neural networks to Transformer-based ones. The |D| and |V| columns refer to the dimension and vocabulary size of each pretrained embedding, respectively. Although the most usual way of employing embeddings trained from Transformer-based architectures is running the text through the model to obtain contextualized representations, here we first investigate how these models behave when the experimental protocol is the same as earlier embeddings models: pretrained embeddings are collected from the embeddings layer and are the input of the classifiers.

Next, in Sect. 5, we present an evaluation of state-of-the-art Transformer-based autoencoder models, including BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), and BERTweet (Nguyen et al. 2020). To achieve a proper vector representation for each sentence, first we get the last four layers of the model for each token of the sentence and concatenate them, generating a 3072-dimension (4 \(\times \) 768) representation for each token. Then, to build the sentence embedding, we take the average of these token vector representations. For the sake of simplicity, the Transformer-based autoencoder models assessed in this study are referred to hereafter as Transformer-based models.

Lastly, in Sects. 6 and 7, we evaluate the effectiveness of adapting the aforementioned Transformer-based models regarding the intermediate masked-language task in two different ways: (i) by using a huge collection of unlabeled, or non-sentiment, tweets, and (ii) by using tweets from sentiment datasets.

In Sect. 6, regarding the non-sentiment adaptation approach, we adopted the general purpose collection of unlabeled tweets from the Edinburgh corpus (Petrović et al. 2010), which contains 97M tweets in multiple languages. Tweets written in languages other than English were discarded, resulting in a final corpus of 6.7M English tweets, which was then used to adapt BERT, RoBERTa, and BERTweet. In addition to the entire corpus of 6.7M tweets, we used nine other samples with different sizes, varying from 500 to 1.5M tweets. Specifically, we generated samples containing 500 (0.5K), 1K, 5K, 10K, 25K, 50K, 250K, 500K, and 1.5M non-sentiment tweets.

Conversely, in Sect. 7, we evaluated the sentiment adaptation procedure using positive and negative tweets from the twenty-two benchmark datasets described in Table 1. For this purpose, we used each dataset once as the target dataset, while the others were used as the source datasets. More clearly, for each assessed dataset, referred to as the target dataset, we explored three distinct strategies to adapt the masked-language model: (i) by using only the tweets from the target sentiment dataset itself, (ii) by using the tweets from the remaining 21 datasets, and (iii) by using the entire collection of tweets from the 22 datasets, including the tweets from the target dataset.

4 Evaluation of static text representations

The computational experiments conducted in this section aim at answering the research question RQ1, as follows:

RQ1. Which static embeddings are the most effective in the sentiment classification of tweets?

We answer this question by assessing the predictive power of the 13 pretrained embeddings described in Table 2. These embeddings were generated from distinct neural networks architectures, with different dimensions and vocabulary size, and trained on various kinds of corpora. Recall that by static embeddings we mean that the features are gathered from the embeddings layer working as a look-up table of tokens. In addition to the pretrained embeddings, we evaluate the BoW model with the TF-IDF representation, which is the most basic text representation used in Twitter sentiment analysis and text classification tasks in general. For all tweet representation, we take the average of all tokens representation of the tweet.

We begin by evaluating the predictive performance of the static representations for each classification algorithm. To limit the number of tables in the manuscript, we report the computational results in detail for SVM as an example of this evaluation.Footnote 10 Tables 3 and 4 show the results achieved by using each static representation to train an SVM classifier, in terms of classification accuracy and unweighted \(F_1\)-macro, respectively. The boldfaced values indicate the best results, and the last three lines show the total number of wins for each static representation (#wins row), as well as a ranking of the results (rank sums and position rows). Precisely, for each dataset, we assign scores, from 1.0 to 14.0, to each assessed representation (each column), in ascending order of accuracy (\(F_1\)-macro), where the score 1.0 is assigned to the representation with the highest accuracy (\(F_1\)-macro). Thus, low score values indicate better results. When two assessed representation has the same performance, we take an average of their scores. If two assessed representations achieve the best performance, they will receive a score of 1.5 ((1+2)/2). Finally, we sum up the assigned scores obtained in each dataset for each assessed representation to calculate rank sums. With the rank sum of each assessed representation, we rank the rank-sum result from the best (1) to the worst (14), calculating the rank position.

Table 3 Accuracies (%) achieved by evaluating the static representations using the SVM classifier
Table 4 \(F_1\)-macro scores (%) achieved by evaluating the static representations using the SVM classifier

As we can see in Tables 3 and 4, RoBERTa (RoBstatic column) achieved the best performance in nine out of the 22 datasets in terms of accuracy, in 11 out of the 22 datasets in terms of \(F_1\)-macro, and was ranked first in the overall evaluation (position row). Regarding the number of wins (#wins row), we can note that Emo2Vec and SSWE achieved the second best results, reaching the best performance in four out of the 22 datasets for both accuracy and \(F_1\)-macro. However, regarding the overall evaluation (position row), w2v-Edin and w2v-GN were ranked among the top three best static representations along with RoBERTa, in terms of accuracy. Regarding \(F_1\)-macro, the top three best static representations were RoBERTa, w2v-Edin and BERT (BERT-static column). Finally, the Friedman test followed by the Nemenyi post-hoc test detected that the top three best representations – RoBERTa, w2v-Edin, w2v-GN in terms of accuracy, and RoBERTa, w2v-Edin, BERT in terms of \(F_1\)-macro – are significantly better than many of the other static representations, as shown below Tables 3 and 4. Nevertheless, there is no significant difference between them.

Tables 5 and 6 show a summary of the results by evaluating each static representation on the 22 datasets, for each classification algorithm. Each cell indicates the number of wins, the rank sums, and the rank position achieved by the related static representation (each line) used to train the corresponding classifier (each column). The Total column indicates the total number of wins, the total rank sums, and the total rank position, i.e., the sum of the rank positions presented in each cell for each assessed model. Moreover, in the total column, we underline the top three best overall results in terms of total rank position.

Table 5 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by evaluating each static representation on the 22 datasets, for each classification algorithm, in terms of accuracy
Table 6 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by evaluating each static representation on the 22 datasets, for each classification algorithm, in terms of \(F_1\)-macro

Regarding the overall evaluation (Total column), from Tables 5 and 6, we can see that although Emo2Vec achieved the highest total number of wins (i.e., 27 wins in terms of accuracy, and 29 wins in terms of \(F_1\)-macro), w2v-Edin was ranked as the best overall model, achieving the lowest total rank position for both accuracy (22.0) and \(F_1\)-macro (21.0). Nevertheless, considering each classifier (each column), we can note that RoBERTa achieved the best performance when used to train LR, SVM, and MLP, for both accuracy and \(F_1\)-macro. Conversely, Emo2Vec achieved the best overall results when used to train RF and XGB classifiers. Analyzing the overall results in terms of the total rank position (Total column), we observe that Emo2Vec and w2v-GN, along with w2v-Edin, are ranked as the top three best static representations. These results suggest that w2v-Edin, Emo2Vec, and w2v-GN are well-suited static representations for Twitter sentiment analysis.

In the previous evaluations, we analyzed the predictive performance achieved by each representation for one classification algorithm at a time, focusing on the individual contribution of the text representations in the performance on the final task. Next, we investigate the classification performance of the final sentiment analysis process, that is, the combination of text representation and classifier. Considering that the final classification is a combination of both representation and classifier, an appropriate choice of the classification algorithm may affect the performance of a text representation. For this purpose, we present an overall evaluation of all possible combinations of text representations and classification algorithms, examining them as pairs {text representation, classifier}. More clearly, we evaluate the classification effectiveness of 70 possible combinations of text representations and classifiers (14 \(\times \) 5) on the 22 datasets of tweets. Tables 7 and 8 present the top and the bottom ten results in terms of the average rank position, respectively. Specifically, for each dataset, we calculate a rank of the 70 combinations and then average the rank position of each combination over the 22 datasets.

From Table 7, we can note that the best overall results were achieved by using RoBERTa to train an SVM classifier for both accuracy and \(F_1\)-macro. Also, w2v-Edin \(+\) SVM and RoBERTA \(+\) MLP appear in the top three results along with RoBERTa \(+\) SVM. From Table 8, we can notice that the RF classifier often appears among the worst results.

Table 7 Top 10 results achieved by evaluating combinations of static word representation models and classifiers
Table 8 Bottom 10 results achieved by evaluating combinations of static word representation models and classifiers

Tables 9 and 10 show a summary of the results for each text representation and classifier, respectively, from best to worst, in terms of the average rank position. As we can observe, Emo2Vec, RoBERTa, and w2v-Edin appear in the top three, being the representations that achieved the best overall performances. Among the classifiers, we can note that SVM and MLP seem to be good choices in Twitter sentiment Analysis regarding the usage of static text representations. Conversely, RF achieved the worst overall performance across all evaluations.

The top three static representations identified in the previous evaluation, i.e., RoBERTa, w2v-Edin, and Emo2Vec, are very different from each other. While w2v-Edin and Emo2Vec were trained from scratch on tweets, RoBERTa was trained on traditional English texts. However, among these, RoBERTa is the only Transformer-based model, which holds state-of-the-art performance in capturing context and semantics of terms from texts. Furthermore, regarding w2v-Edin, although it was trained with a more straightforward architecture (feedforward neural network) as compared to others, its training parameters were optimized for the emotion detection task on tweets (Bravo-Marquez et al. 2016), which may have helped determining the sentiment expressed in tweets.

Surprisingly, as shown in Table 9, BERTweet achieved the worst overall performance among all assessed text representations, despite having been trained using the same state-of-the-art Transformer-based architecture as RoBERTa while yet using tweets. One possible explanation for this behavior is that BERTweet training procedure limits the representation of its training tweets to 60 tokens only, while RoBERTa uses a limit of 512 tokens. For that reason, we believe that RoBERTa model is able to capture more semantic information to the tokens from its training vocabulary as compared to BERTweet when one collects the token representation from the embeddings layer.

In addition to the individual assessment of text representations and classifiers presented in Tables 9 and 10, Table 11 shows the best results achieved for each dataset. We can see that RoBERTa achieved the highest accuracies in seven out of the 22 datasets, and highest \(F_1\)-macro scores in nine out of the 22 datasets. Furthermore, as highlighted in Table 7, RoBERTA \(+\) SVM achieved the best performances in six out of the 22 datasets in terms of accuracy, and in eight out of the 22 datasets in terms of \(F_1\)-macro.

Finally, regarding research question RQ1, we can highlight and suggest that: (i) disregarding the classification algorithms, Emo2Vec, w2v-Edin, and RoBERTa seem to be well-suited representations for determining the sentiment expressed in tweets, and (ii) considering the combination of text representations and classifiers, RoBERTa \(+\) SVM achieved the best overall performance, which may represent a good choice for Twitter sentiment analysis in hardware-restricted environments, since the cost here is most due to the classifier induction.

Table 9 Summary of the results for each static word representation model, from best to worst, in terms of the average rank position
Table 10 Summary of the results for each classifier, from best to worst, by evaluating the static word representations, in terms of the average rank position
Table 11 Best results achieved for each dataset by evaluating the static word representation models

5 Evaluation of the transformer-based text representations

In this section, we address the research question RQ2, as follows:

RQ2.Considering state-of-the-art Transformer-based autoencoder models, which are the most effective in the sentiment classification of tweets?

To answer that question, we conduct a thorough evaluation of the widely used BERT and RoBERTa models and the BERT-based transformer trained from scratch with tweets, namely, BERTweet. These models represent a set of the most recent Transformer-based autoencoder language modeling techniques that have achieved state-of-the-art performance in many NLP tasks. While BERT is the first Transformer-based autoenconder model to appear in the literature, RoBERTa is an evolution of BERT with improved training methodology, due to the elimination of the Next Sentence Prediction task, which may fit NLP tasks on tweets considering they are limited in size and self-contained in context. Moreover, by evaluating BERTweet we analyze the performance of a Transformer-based model trained from scratch on tweets.

In this set of experiments, we give an example tweet as input to the transformer model and concatenate its last four layers to be the token representation and the tweet representation is the average of the tokens representation. Next, those representations collected from the whole dataset are given as input to the learning classifier method together with the labels of the tweets. Finally, the learned classifier is employed to perform the evaluation. In this way, we once again follow the feature extraction plus classification strategy but now using the contextualized embedding from each tweet.

Table 12 presents the classification results when using the SVM classifier in terms of accuracy and \(F_1\)-macro, and Table 13 shows a summary of the complete evaluation regarding all classifiers. As in previous section, to limit the number of tables in the manuscript, we only report the computational results in detail for the SVM classifier as an example of this evaluation. From Table 12, we can note that BERTweet achieved the best results in 18 out of the 22 datasets for both accuracy and \(F_1\)-macro. Precisely, the Friedman and the Nemenyi tests detected that BERTweet is significantly better than RoBERTa and BERT, while RoBERTa is better than BERT. Similarly, regarding all classifiers, Table 13 shows that BERTweet outperformed BERT and RoBERTa by a significant difference in terms of the total number of wins for both accuracy and \(F_1\)-macro.

Table 12 Accuracies and \(F_1\)-macro scores (%) achieved by evaluating the Transformer-based language models using the SVM classifier
Table 13 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by evaluating each Transformer-based model on the 22 datasets, for each classification algorithm

Next, we present an overall analysis of using BERT, RoBERTa, and BERTweet models to train each one of the five classification algorithms, examining them as pairs {language model, classifier}. Table 14 presents the average rank position across all 15 possible combinations (3 language models \(\times \) 5 classification algorithms), from best to worst, as explained in Sect. 4. We can observe that BERTweet combined with LR, MLP, and SVM classifiers achieved the best overall performances for both accuracy and \(F_1\)-macro. Conversely, using RF to train the Transformer-based embeddings seems to harm the performance of the models.

Table 14 Overall analysis of using the Transformer-based models to train each classification algorithm, examining them as pairs {language model, classifier}, in terms of the average rank position

Tables 15 and 16 show a summary of the results for each model and classifier, respectively, from best to worst, in terms of the average rank position. From Table 15, we can see that BERTweet achieved the best overall classification effectiveness and was ranked first. Also, RoBERTa and BERT achieved comparable overall performances for both accuracy and \(F_1\)-macro. Regarding the classifiers, as shown in Table 16, MLP and LR achieved rather comparable performances and were ranked as the top two best classifiers regarding the Transformer-based models, followed by SVM, XGB, and RF.

Table 15 Summary of the results for each Transformer-based model, from best to worst, in terms of the average rank position
Table 16 Summary of the results for each classifier, from best to worst, by evaluating the Transformer-based models, in terms of the average rank position

Regarding the results achieved for each dataset, Table 17 presents the best results in terms of accuracy and \(F_1\)-macro. As we can notice, BERTweet outperformed BERT and RoBERTa in 17 out of the 22 datasets in terms of accuracy and in 18 out of the 22 datasets in terms of \(F_1\)-macro. These results may confirm that Twitter sentiment classification benefits most from contextualized language models trained from scratch on Twitter data. Unlike BERT and RoBERTa, which were trained on traditional English texts, BERTweet was trained on a huge amount of 850M tweets. This fact may have helped BERTweet on learning the specificities of tweets, such as their morphological and semantic characteristics.

Table 17 Best results achieved for each dataset by evaluating combinations of Transformer-based models and classifiers
Table 18 Percentage of vocabulary’s tokens of the Transformer-based model in the row that are also in the vocabulary’s tokens of the Transformer-based model in the column
Table 19 Top 10 results achieved for combinations of language model and classifier by evaluating the Transformer-based models and the static word representations, in terms of the average rank position
Table 20 Overall evaluation of the Transformer-based models and the static word representations, from best to worst, in terms of the average rank position

For a better understanding of the results, we present an analysis of the difference between the vocabulary embedded in the assessed models. For this purpose, Table 18 highlights the number of tokens shared between BERT, RoBERTa, and BERTweet. In other words, we show the amount of tokens (in %) embedded in the models presented in each row that are also included in the models presented in each column, i.e., the intersection between their vocabularies. For example, regarding BERT (first row), we can see that 61% of its tokens can be found on RoBERTa (second column). The information below each model name in the columns refers to their vocabulary size (number of embedded tokens). It is possible to note that only 32% of the 64K tokens from BERTweet vocabulary (i.e., about 20K tokens) can be found in BERT. It means that, when compared to BERT, BERTweet contains about 44K (\(64-20\)) specific tokens extracted from tweets. Similarly, 55% of the tokens embedded in BERTweet (i.e., about 35K tokens) can be found in RoBERTa, meaning that BERTweet holds about 29K (\(64-35\)) specific tokens from tweets that are not included in RoBERTa. As a matter of fact, analyzing the tokens embedded in BERTweet, we find some specific tokens, such as “Awww”, “hahaha”, “broo”, and other internet expressions and slang that social media users often use to express themselves. While creating representations for these tokens is straightforward in BERTweet, BERT and RoBERTa need to do some extras steps. Specifically, when BERT and RoBERTa do not find a token in their vocabularies, they split the token into subtokens until all of them are found. For example, the token “hahaha” would be split into “ha”, “ha”, and “ha” to represent the original token. This analysis points out that this particular vocabulary, combined with a language model that was trained focused on learning the intrinsic structure of tweets, is the responsible for the BERTweet language model’s best performance on tweet sentiment classification.

Table 21 Best results achieved for each dataset by evaluating combinations of language models and classifiers, regarding Transformer-based models and static word representations

In this context, regarding RQ2, we believe BERTweet is an effective language modeling technique in distinguishing the sentiment expressed in tweets. Also, regarding the classifiers, in general, MLP and LR seem to be good choices when using Transformer-based models.

Different from static representations, when we used only the embedding layer of the language models, in this section, we use the whole language model: the tweet goes from the embedding layer up to the last layer to be transformed into a vector representation. Attempting to understand the benefits from using the whole language model (embedding layer and language model), we compare the predictive performance of Transformer-based models evaluated in this section against all the static representations assessed in Sect. 4. Table 19 presents the top ten results across all 85 possible combinations of models and classifiers (17 models \(\times \) 5 classification algorithms), and Table 20 shows an overall evaluation of the models, from best to worst, in terms of the average rank position. In addition, Table 21 shows the best results achieved for each dataset.

From Tables 19 and 20 we can notice that the Transformer-based BERTweet model outperformed all other models and was ranked first in both evaluations. Also, Table 20 shows that the Transformed-based models achieved the best overall results against all static models and were ranked as the top three best representations. Furthermore, from Table 21, the Transformer-based BERTweet model achieved the best overall classification effectiveness in 16 out of the 22 datasets in terms of accuracy and in 17 out of the 22 datasets in terms of \(F_1\)-macro.

These results point out that learning language model parameters is essential in distinguishing the sentiment expressed in tweets. Static representations may lose a lot of relevant information considering they ignore the diversity of meaning that words may have depending on the context they appear. In contrast, Transformer-based models benefit from learning how to encode the context information of a token in an embedding.

6 Adapting transformer-based models to a large collection of English tweets

In this section, we aim at performing computational experiments in order to answer the research question RQ3, stated as follows:

RQ3. Can a second phase of adaptive pretraining of Transformer-based autoencoder models using a large set of English tweets improve the sentiment classification performance?

To answer this research question, we evaluate the classification effectiveness of BERT, RoBERTa, and BERTweet language models adapted with tweets from a corpus of 6.7M unlabeled, or generic unlabeled, tweets, as described in Sect. 3.3. Precisely, we use this set of tweets to adapt the model weights using the intermediate masked language model task as the training objective with the probability of 15% to (randomly) mask tokens in the input. We also compare the results of such adapted models against those achieved by using the original weights of the Transformer-based models, as presented in Sect. 5, in order to analyze whether the adjustment of the models via a second phase of pretraining improves the predictive performance of the sentiment classification.

In general, the performance of the adapted models is very sensitive to different random seeds (Dodge et al. 2020). For that reason, all the results presented in this section are the average of three executions using different seeds (12,34,56) to account for the sensitivity of the adaptive process regarding different seeds.

The first part of the experiments reported in this section consists in determining whether the predictive performance of the Transformer-based models are affected by the adaptation procedure using tweets from corpora of different sizes. For this purpose, in addition to the entire Edinburgh corpus of 6,657,700 tweets (around 6.7M tweets), we used nine other smaller samples of tweets with different sizes, varying from 500 to 1.5M tweets. Specifically, we generated samples containing 0.5K, 1K, 5K, 10K, 25K, 50K, 250K, 500K, and 1.5M generic unlabeled tweets. In the adaptation processes, we performed three training epochs, except for the adaptation of models with 6.7M tweets, when we used one epoch, as there was a degradation of some models, such as BERTweet. In all adaptation process, all layers are unfrozen. Regarding the batch size, we use the available hardware capacity of eight instances per device. We used a learning rate of 5e-5 with a linear scheduler and Adam optmizer with beta1 equal to 0.9, beta2, 0.999, and epsilon, 1e-8. We also use a max gradient of 1 and with no weight decay.

Tables 22,  23, and 24 present the average classification accuracies and \(F_1\)-macro scores when adapting BERT, RoBERTa, and BERTweet, respectively, with the different samples of tweets generated from the Edinburgh corpus. As in previous sections, for space constraints, we only report the detailed evaluation using the SVM classifier. Regarding the variance in performance across the different seeds, the mean and maximum standard deviations are 0.05% and 0.5% for both accuracy and \(F_1\)-macro, respectively.

Table 22 Average classification accuracies and \(F_1\)-macro scores (%) achieved by adaptive pretraining of BERT with different samples of generic unlabeled tweets, using the SVM classifier
Table 23 Average classification accuracies and \(F_1\)-macro scores (%) achieved by adaptive pretraining of RoBERTa with different samples of generic unlabeled tweets, using the SVM classifier
Table 24 Average classification accuracies and \(F_1\)-macro scores (%) achieved by adaptive pretraining of BERTweet with different samples of generic unlabeled tweets, using the SVM classifier
Table 25 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by each classifier when adapting the Transformer-based models with different samples of unlabeled tweets in terms of accuracy
Table 26 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by each classifier when adapting the Transformer-based models with different samples of unlabeled tweets in terms of \(F_1\)-macro

Note that BERT (Table 22) was benefited most when adapted with the sample of 250K tweets, being ranked first in the overall evaluation (position row), for both accuracy and \(F_1\)-macro. Although these results are only significantly better than those by adapting BERT with 6.7M tweets, they may be a piece of evidence that more tweets does not necessarily means better performance for adapted models. RoBERTa (Table 23) achieved the best overall results when adapted with the sample of 1.5M, for both accuracy and \(F_1\)-macro. However, these results are only significantly better than those by adapting RoBERTa with the sample of 0.5K. On the other hand, BERTweet (Table 24) benefited from smaller samples, achieving higher overall predictive performances when adapted with the sample of 25K, for both accuracy and \(F_1\)-macro, being significantly better than the results achieved with samples of 0.5K, 1.5M, and 6.7M. This is an expected result as BERTweet is already trained from scratch from tweets. As we are adapting the language model, BERT and RoBERTa seems to require more samples to accommodate the Twitter-based vocabulary into the weights’ model.

Next, we analyze the overall performance of the adapted Transformer-based models for each classification algorithm. Tables 25 and 26 summarize the results in terms of accuracy and \(F_1\)-macro, respectively. Regarding the variance across the different seeds, the mean and maximum standard deviations are 0.2% and 0.7% in terms of accuracy, and 0.26% and 0.98% in terms of \(F_1\)-macro.

Interestingly, from Tables 25 and 26, we can note that when adapting a language model to fit a specific type of text, such as tweets, applying large corpora does not guarantee better predictive performances. Specifically, the best overall results (Total column) were achieved when adapting BERT, RoBERTa, and BERTweet models with samples of 250K, 50K, and 5K tweets, respectively, for both accuracy and \(F_1\)-macro.

Regarding the results achieved for each dataset, Table 27 shows the best predictive performances in terms of accuracy and \(F_1\)-macro. We can see that BERTweet achieved the best results for most datasets when the adaptive pretraining employs a fewer number of tweets. More specifically, BERTweet outperformed the other models when adapted with samples varying from 1K to 25K tweets in 14 out of the 22 datasets for both accuracy and \(F_1\)-macro.

Table 27 Best results achieved for each dataset by adapting the Transformer-based models with different samples of generic tweets

As in previous sections, we also present an overall evaluation of combining all adapted models and classifiers across the 22 datasets, in terms of the average rank position. Table 28 shows the top ten results among all 150 possible combinations (3 models \(\times \) 10 samples of tweets \(\times \) 5 classification algorithms). As we can see in Table 28, adapted BERTweet embeddings achieved the best overall performances when used to train LR, MLP, and SVM, mastering the top ten results. Also, note that by using LR, MLP, and SVM, BERTweet outperformed all other models when adapted with samples containing 50K tweets or less.

Tables 29 and 30 show the top ten results among all adapted models and a summary of the results for each classifier, from best to worst, respectively, in terms of the average rank position. From Table 29, we can notice that all BERTweet adapted models (0.5K, 1K, 5K, 10K, 25K, 50K, 250K, 500K, 1.5M, and 6.7M) were ranked in the top ten results. Furthermore, neither BERT nor RoBERTa appear in the top results, even when they are adapted with the entire corpus of 6.7M tweets. RoBERTa appears only in the top 24 accuracy score with an average rank of 37.02 tuned with 50K tweets and combined MLP classifier and in top 28 F\(_1\)-macro score with an average rank of 37.27 tuned with 50K tweets and combined LR classifier. BERT appears only in the top 56 accuracy score with an average rank of 66.05 tuned with 1.5M tweets and combined MLP classifier and in top 51 F\(_1\)-macro score with an average rank of 60.77 tuned with 6.7M tweets and combined LR classifier. Among the classifiers, as we can see in Table 30, MLP and LR achieved the best predictive performances and were ranked as the top two best classifiers. Conversely, RF was ranked as the worst classifier.

Table 28 Top 10 results achieved for combinations of Transformer-based models and classifiers by adapting the Transformer-based models with different samples of generic tweets
Table 29 Top 10 results achieved by adapting the Transformer-based models with different samples of generic tweets, in terms of the average rank position
Table 30 Summary of the results for each classifier, from best to worst, by adapting the Transformer-based models with samples of generic tweets, in terms of the average rank position

From all previous evaluations, we can note that as the size of the samples increases, the adaptation procedure seems to be less effective. It may be due to the adjustment of the weights of the models’ layers during the back-propagation process. Considering that the adaptation procedure consists in unfreezing the entire model obtained previously and adjusting their weights with the new data, the original model and the semantic and syntactic knowledge learned in its layers are changed. In that case, we believe that after some training iterations, the adjustment of the weights starts to damage the original knowledge embedded in the models’ layers. The aforementioned conclusion may further explain why BERTweet achieved improved classification performance by using smaller samples of tweets as compared to BERT and RoBERTa. Our hypothesis is that, considering that the weights in BERTweet’s layers are specifically adjusted to fit tweets’ language style, using more data to adapt the model means only continue the initial training. It may be that lots of data may harm the learned weights of the model. Thus, we suggest that when employ adaptive pretraining in Transformer-based models, such as BERT, RoBERTa, and BERTweet, samples of different sizes may be exploited instead of adopting a dataset with a massive number of instances.

Additionally, we present a comparison among all adapted Transformer-based models against their original versions. Tables 3132, and 33 report this comparison in terms of the average rank position for BERT, RoBERTa, and BERTweet, respectively. We can see that the adapted versions achieved meaningful predictive performances as compared to their original models, which indicates that adaptive pretraining strategies can boost classification performance in Twitter sentiment analysis. Moreover, from Tables 31 and 32, we note that the adapted versions of BERT and RoBERTa benefited most from samples containing a large amount of tweets. Conversely, as pointed out before, BERTweet achieved better overall performances by using smaller samples, as shown in Table 33.

Table 31 Comparison among all adapted BERT models and BERT’s original version (no adaptation), in terms of the average rank position
Table 32 Comparison among all adapted RoBERTa models and RoBERTa’s original version (no adaptation), in terms of the average rank position
Table 33 Comparison among all adapted BERTweet models and BERTweet’s original version (no adaptation), in terms of the average rank position

Addressing research question RQ3, we could see that adaptive pretraining of Transformer-based models improves the classification effectiveness in Twitter sentiment analysis. Nevertheless, using large sets of tweets does not guarantee better predictive performances, particularly for those models trained from scratch on tweets, such as BERTweet. We could observe that BERTweet benefited most from samples of tweets containing 50K tweets or less. Furthermore, regarding the classifiers, in general, MLP and LR seem to be good choices of classifiers to be employed after extracting the features from adapted Transformer-based models.

7 Adapting transformer-based models to sentiment datasets

The experiments conducted in this section aim at answering the research question RQ4, stated as follows:

RQ4 Can Transformer-based autoencoder models benefit from a second phase of adaptive pretraining procedure with tweets from sentiment analysis datasets?

We address this research question by evaluating whether the sentiment classification of tweets benefits from adapting language models to tweets from sentiment analysis datasets. For this purpose, we use the same collection of 22 benchmark datasets presented in Sect. 3.1 (Table 1). We perform this evaluation by assessing three distinct strategies to simulate three real-world scenarios. In addition, as done in Sect. 6, all experiments were performed three times using different seeds (12, 34, 56), with all the same hyperparameter and we report the average of the results.

The first adaptation strategy we investigate, referred to as InData, simulates the usage of a specific sentiment dataset itself as the new domain dataset to adapt a pre-trained language model. Precisely, each one of the 22 datasets is used once as the target dataset. For each of the 22 datasets, we use a 10-fold cross-validation procedure. In each of the ten executions, we use the tweets from nine folds as the source data (i.e., the training data) used to adjust a language model, which is then validated on the one remaining part of the data, referred to as the target dataset (i.e., the test data).

The second strategy, referred to as LOO (Leave One dataset Out), aims at simulating the situation where a collection of general sentiment datasets is available to adapt the language model. We use each dataset once as the target dataset while the tweets from the remaining 21 datasets are combined to adjust the language model. Although the target dataset contains sentiment labels for each tweet, these labels are not used in the adaptation process as we leverage the intermediate self-supervised masked language model task to tune the network parameters.

The third and last strategy, referred to as AllData, is a combination of the two others. Specifically, as for strategy InData, for each assessed dataset (target dataset), and for each of the nine folds in the 10-fold cross-validation procedure, we combine the tweets from the nine folds (i.e., the training data of the target dataset) with the tweets from the remaining 21 datasets to adapt the language model. This last strategy evaluates the benefits of combining the tweets from a specific sentiment target dataset with a representative general sentiment dataset corpus in the adaptation process.

Tables 3435, and 36 present the predictive performances achieved by adapting BERT, RoBERTa, and BERTweet, respectively, with strategies InData, LOO, and AllData, one at a time, by using the SVM classifier. As in previous sections, for space constraints, we only report the detailed evaluation using the SVM classifier.

From Table 34, we can observe that, although BERT seems to benefit most from strategy InData, which uses only the target dataset itself to adjust the language models, the Friedman and the Nemenyi tests did not detect any significant differences between strategies InData, LOO, and AllData. Regarding RoBERTa and BERTweet models (Tables 35 and 36, respectively), adapting them using strategies that combine tweets from distinct sentiment analysis corpora achieved the best results for most datasets. More clearly, AllData, which combines the tweets from the target dataset and tweets from a collection of sentiment datasets, achieved the best overall results with both RoBERTa and BERTweet. As a matter of fact, the Friedman and the Nemenyi tests indicate that strategy AllData with RoBERTa outperformed strategy InData with statistical difference between them. Similarly, strategies AllData and LOO with BERTweet are significantly better than strategy InData. It is also noteworthy that smaller datasets seem to have benefited most from adapting RoBERTa and BERTweet by using strategy LOO. On the other hand, larger datasets achieved higher predictive performances when using strategy AllData to fine-tune RoBERTa and BERTweet. Tables 37 and 38 show a summary of the complete evaluation regarding all classifiers in terms of classification accuracy and \(F_1\)-macro, respectively.

Table 34 Accuracies and \(F_1\)-macro scores (%) achieved by evaluating BERT with adaptation strategies InData, LOO, and AllData using the SVM classifier
Table 35 Accuracies and \(F_1\)-macro scores (%) achieved by evaluating RoBERTa with adaptation strategies InData, LOO, and AllData using the SVM classifier
Table 36 Accuracies and \(F_1\)-macro scores (%) achieved by evaluating BERTweet with adaptation strategies InData, LOO, and AllData using the SVM classifier
Table 37 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by each classifier when adapting the Transformer-Autoencoder models using strategies InData, LOO, and AllData in terms of accuracy
Table 38 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by each classifier when adapting the Transformer-Autoencoder models using strategies InData, LOO, and AllData in terms of \(F_1\)-macro

Regarding the overall results achieved for each dataset, Table 39 presents the best results. We can note that when adapting the Transformer-based models with tweets from sentiment datasets, BERTweet outperformed BERT and RoBERTa for all datasets, except for datasets sarcasm (sar) and hobbit (hob). Interestingly, as mentioned before, while strategy LOO achieved the best results for smaller datasets, larger datasets seem to benefit from strategy AllData. Precisely, strategy AllData achieved the best overall performances in ten out of the 22 datasets in terms of accuracy and in 11 out of the 22 datasets in terms of \(F_1\)-macro. Strategy LOO achieved the best results in nine out of the 22 datasets for both accuracy and \(F_1\)-macro. The better performance of the AllData strategy for larger target datasets indicates that the significant amount of information present in the target dataset is indispensable for the adaptation process, while the information present in smaller datasets seems not to contribute to the adaptation process, making the LOO strategy adequate for datasets with a limited amount of tweets.

Conversely, strategy InData did not achieve meaningful results. The inferior performance of the InData strategy in almost all datasets shows that, regardless of the size of the dataset, the use of external and more extensive data brings more information to the adaptation process, improving the final performance.

Table 39 Best results achieved for each dataset by adapting the Transformer-based models using strategies InData, LOO, and AllData

Next, we present an overall evaluation of combining all adapted models and classifiers across the 22 datasets, in terms of the average rank position. Table 40 reports the top ten results among all 45 possible combinations (3 language models \(\times \) 3 adaptation strategies \(\times \) 5 classification algorithms). We can observe that the LR classifier trained with BERTweet embeddings adapted via strategy AllData achieved the best overall predictive performances. Also, note that the fine-tuned BERTweet embeddings with strategies AllData and LOO, combined with LR, MLP, and SVM, appear at the top of the ranking (top six results). Another point worth highlighting is that BERTweet masters the top ten results, appearing in eight out of the ten positions in terms of accuracy and in nine out of the ten positions in terms of \(F_1\)-macro.

Tables 41 and 42 show the results among all adapted models and a summary of the results for each classifier, from best to worst, respectively, in terms of the average rank position. Once again, from Table 41, we can notice that all BERTweet adapted models (InData, LOO, and AllData) were ranked in the top three results. Among the classifiers, as we can see in Table 42, MLP and LR achieved the best predictive performances and were ranked as the top two best classifiers. Conversely, RF was ranked as the worst classifier.

Table 40 Top 10 results achieved for combinations of Transformer-based models and classifiers by adapting the models using strategies InData, LOO, and AllData
Table 41 Comparison among all adapted Transformer-based models using strategies InData, LOO, and AllData, in terms of the average rank position
Table 42 Summary of the results for each classifier, from best to worst, by adapting the Transformer-based models using strategies InData, LOO, and AllData, in terms of the average rank position

To evaluate the effectiveness of adapting the Transformer-based models using tweets from sentiment datasets, we present a comparison among all the adaptation strategies assessed in this study for each language model. Specifically, we compare the adapted models presented in this section, by using strategies InData, LOO, and AllData, against the best adapted models identified in Sect. 6, i.e., BERT-250K, RoBERTa-50K, and BERTweet-5K. Table 43 reports these results in terms of the average rank position for BERT, RoBERTa, and BERTweet.

Regarding BERT, as shown in Table 43, note that all the adaptation strategies using tweets from sentiment datasets achieved better overall results than using the sample of 250K generic tweets. Moreover, strategy InData appears at the top of the ranking as the best adaptation strategy. It is worth mentioning that strategy InData uses only the tweets from the target dataset itself to adjust the language model. This means that the strategy InData used a number of tweets much smaller than the 250K tweets contained in the sample. On the other hand, strategy InData did not achieve meaningful results for RoBERTa and BERTweet models. Nevertheless, for these models, strategies AllData and LOO, which also use tweets from sentiment datasets, achieved rather comparable performances and were ranked as the top two best adaptation strategies.

Table 43 Comparison among the adapted models by using strategies InData, LOO, and AllData, against the best adapted models with different samples of generic tweets

To acknowledge the effectiveness of adapting the Transformer-based models to tweets from sentiment datasets, i.e., using the strategies InData, LOO, and AllData, we present an overall comparison among these strategies and the 47 models assessed in this study (Sects. 45, and 6). Tables 44 and 45 present, respectively, the ten best and the ten worst combinations of models and classifiers, in terms of the average rank position, regarding all 280 combinations of models and classifiers (56 models and five classifiers). We note that BERTweet tuned with tweets from sentiment datasets and combined with LR and MLP had the four best results, in terms of accuracy, and the two best results, in terms of \(F_1\)-macro. These combinations were followed by BERTweet tuned with generic tweets. More specifically, combinations with the strategy AllData and LOO achieved better overall results. Independently of the language model, LR and MLP were the most frequent classifier in the top 10 results. Conversely, all the ten worst combinations are static representations combined with RF, which was unanimous in the worst model and classifier combinations.

Disregarding the classifiers, Tables 46 and 47 present the top ten and the bottom ten models, respectively, by comparing all 56 word representations assessed in this study (14 static representations \(+\) 3 Transformer-based models \(+\) 30 models adapted with samples of generic tweets \(+\) 9 models tuned with sentiment datasets). From Table 46, we can acknowledge the good performance of adapting the Transformer-based models using tweets from sentiment datasets. Specifically, the adapted BERTweet models using strategies AllData and LOO appear at the top of the ranking as the two best models. We can also notice that adapting BERTweet with generic tweets results in performance improvement to BERTweet. Regarding the bottom ten models, from Table 47, we can see that all of them are static representations.

Lastly, regarding research question RQ4, we can highlight that adapting Transformer-based models using tweets from sentiment datasets seems to boost classification performance in Twitter sentiment analysis. As a matter of fact, the strategies AllData and LOO exploited in this section, which use a collection of sentiment tweets to adjust a language model, achieved better overall results than using samples of unlabeled, or generic unlabeled, tweets. Although we do not use the labels of those tweets in the adaptation procedure, they may carry a lot of sentiment information as compared to the tweets from the Edinburgh corpus, which originated the samples of generic unlabeled tweets used in the experiments. Furthermore, BERTweet embeddings adapted with the AllData strategy seems to be very effective in determining the sentiment expressed in tweets, especially when used to train LR, MLP, and SVM classifiers.

Table 44 Top 10 results achieved by evaluating combinations of models and classifiers, regarding all 56 models assessed in this study
Table 45 Bottom 10 results achieved by evaluating combinations of models and classifiers, regarding all 56 models assessed in this study
Table 46 Top 10 models among the 56 word representation models assessed in this study, in terms of the average rank position
Table 47 Bottom 10 models among the 56 word representation models assessed in this study, in terms of the average rank position

8 Conclusions and future works

In this article, we presented an extensive assessment of modern and classical word representations when used for the task of Twitter sentiment analysis. Specifically, we assessed the classification performance of 14 static representations, the most recent Transformer-based autoencoder models, including BERT, RoBERTa, and BERTweet, as well as different adaptation strategies of the language representation tasks in such models. All models were evaluated in the context of Twitter sentiment analysis using a rich set of 22 datasets and five classifiers from distinct natures. The main focus of this study was on identifying the most appropriate word representations for the sentiment analysis of English tweets.

Based on the results of the experiments performed in this study, we can highlight the following conclusions and recommendations:

  • Considering a limited computing resource scenario where static representations could play an important role, we noticed that Emo2Vec, w2v-Edin, and RoBERTa models seem to be well-suited static representations for determining the sentiment expressed in tweets. Although there is no significant difference between them, they are significantly better than many of the other assessed static representations. The good performance achieved by Emo2Vec and w2v-Edin indicates that being trained from scratch with tweets can boost the classification performance of static representations when applied in Twitter sentiment analysis. Although RoBERTa was not trained from stratch with tweets, it is a Transformer-based autoencoder model, which holds state-of-the-art performance in several NLP tasks. Regarding the classifiers, we could see that SVM and MLP achieved the best overall performances, especially when used to train RoBERTa’s static embeddings. Nevertheless, in such scenario, we acknowledge that there is no global optimum language model. In that case, when implementing a classification system, we recommend the user to perform an assessment of RoBERTa, Emo2Vec, and w2v-Edin language models. Moreover, we suggest analyzing combinations of those language models with SVM and MLP classifiers.

  • Regarding the Transformer-based models, we observed that BERTweet is the most appropriate language model to be used in the sentiment classification of tweets, achieving significantly better results than RoBERTa and BERT. Specifically, the particular vocabulary tweets contain, combined with a language model that was trained focused on learning their intrinsic structure, can effectively improve the performance of the Twitter sentiment analysis task. Considering the combination of language models and classifiers, we can point out that BERTweet achieved the best overall results when combined with LR and MLP. Furthermore, by comparing the Transformer-based models and the static representations, we could notice that the adaptation of the tokens’ embeddings to the context they appear performed by the Transformer-based models benefits the sentiment classification task. In this context, considering a scenario where the availability of computing resources would not be an issue, we recommend BERTweet as the language model to be adopted in a Twitter sentiment classification system, being LR and MLP reasonable choices of classifiers.

  • When adapting the Transformer-based pre-trained models to a large set of English unlabeled tweets, we noticed that although it improves the classification performance, using as many tweets as possible does not necessarily mean better results. Based on that, we presented an extensive evaluation of sets of tweets with different sizes, varying from 0.5K to 1.5M. These results have shown that while BERT and RoBERTa achieved better predictive performances when tuned with sets of 250K and 50K tweets, respectively, BERTweet outperformed all adapted models using only 5K tweets. Although the Friedman and the Nemenyi tests did not detect any significant difference among these results, we believe that models trained from scratch with tweets, such as BERTweet, need fewer tweets to improve their performance. Moreover, by comparing all adapted models taking into account the classifiers, BERTweet combined with MLP, LR, and SVM achieved the best overall performances. In this context, if adapting a language model is an option, having enough computing resources at hand as well as a considerable amount of English unlabeled tweets, we recommend that the user evaluate the performance of a Twitter sentiment classification system by trying sets of tweets with different sizes. Besides, we suggest the usage of BERTweet as the language model.

  • Analyzing the adaption of the language model based on Transformers autoencoders with sentiment analysis datasets, i.e., with tweets that express polarity, we can see that the adapted models’ performance is better than when adapted with generic tweets. All adaption strategies with sentiment analysis datasets performed better than the best-tuned models adjusted with generic tweets. We conclude then that it is worth adapting a model based on Transformer autoencoders using a set of sentiment tweets. Among the adaption strategies – using sentiment analysis tweets – explored in the study, it was possible to perceive that each Transformer model presented a better performance with different adjustment methods. The use of only the target dataset, for example, was a good option to be used with BERT. For RoBERTa and BERTweet, the combination of the target dataset with a set of tweets from other datasets presented a good strategy for adapting the language model. In a general comparison, we noticed that BERTweet tuned with the union of the target dataset and the set of sentiment analysis tweets (BERTweet-AllData) performed better than the other adjusted models. Besides, we could observe that BERTweet-AllData presented a good performance when combined with LR and MLP classifiers. Hence, considering a scenario where a specific dataset of English tweets carrying positive and negative polarities is available for adapting a language model, we recommend using BERTweet adapted with strategy AllData as the language model of a sentiment classification system.

  • After answering our research questions, we can briefly state that: (i) Transformer-based autoencoder models perform better than static representation, (ii) Transformer autoencoder models adapted to English tweets behavior better than the respective original models and, finally, (iii) it is worth adapting a language model originally trained with generic English tweets with tweets from sentiment analysis datasets. Considering all original and adapted models, the best overall performance for the English tweets sentiment analysis task was achieved by the Transformer-Autoencoder model trained from scratch with generic tweets (BERTweet) when adapted with tweets from a target sentiment dataset added by tweets from a large set of other sentiment datasets. This strategy was called BERTweet-AllData, which we consider a good suggestion for sentiment classification of English tweets, mainly when combined with MLP or LR classifiers.

For future work, we plan to investigate other methods for adjusting the language models, mainly fine-tuning them to the polarity classification as the downstream tuning task. Transformer-Autoencoder pre-trained models, like BERT, RoBERTa and BERTweet, can have its weights adjusted looking for becoming more accurate in a specific task, like sentiment analysis. This adjustment is made by adding an extra classification layer in the top of the model and back-propagating the error in the final task through language models’ weights. We intend then to compare the best results obtained in this study with the ones achieved by this specific-task category of fine-tuning.