Sentiment analysis in tweets: an assessment study from classical to modern word representation models

Barreto, Sérgio; Moura, Ricardo; Carvalho, Jonnathan; Paes, Aline; Plastino, Alexandre

doi:10.1007/s10618-022-00853-0

Sentiment analysis in tweets: an assessment study from classical to modern word representation models

Published: 15 November 2022

Volume 37, pages 318–380, (2023)
Cite this article

Download PDF

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Sentiment analysis in tweets: an assessment study from classical to modern word representation models

Download PDF

Sérgio Barreto ORCID: orcid.org/0000-0002-8252-6377¹,
Ricardo Moura¹,
Jonnathan Carvalho²,
Aline Paes¹ &
…
Alexandre Plastino¹

3164 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

With the exponential growth of social media networks, such as Twitter, plenty of user-generated data emerge daily. The short texts published on Twitter – the tweets – have earned significant attention as a rich source of information to guide many decision-making processes. However, their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks, including sentiment analysis. Sentiment classification is tackled mainly by machine learning-based classifiers. The literature has adopted different types of word representation models to transform tweets to vector-based inputs to feed sentiment classifiers. The representations come from simple count-based methods, such as bag-of-words, to more sophisticated ones, such as BERTweet, built upon the trendy BERT architecture. Nevertheless, most studies mainly focus on evaluating those models using only a small number of datasets. Despite the progress made in recent years in language modeling, there is still a gap regarding a robust evaluation of induced embeddings applied to sentiment analysis on tweets. Furthermore, while fine-tuning the model from downstream tasks is prominent nowadays, less attention has been given to adjustments based on the specific linguistic style of the data. In this context, this study fulfills an assessment of existing neural language models in distinguishing the sentiment expressed in tweets, by using a rich collection of 22 datasets from distinct domains and five classification algorithms. The evaluation includes static and contextualized representations. Contexts are assembled from Transformer-based autoencoder models that are also adapted based on the masked language model task, using a plethora of strategies.

Sentiment analysis using deep learning architectures: a review

Article 02 December 2019

A survey of sentiment analysis in social media

Article 04 July 2018

A Review on Word Embedding Techniques for Text Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the use of social media networks, such as Twitter^{Footnote 1}, has been growing exponentially. It is estimated that about 500 million tweets – the short informal messages sent by Twitter users – are published daily.^{Footnote 2} Unlike others text style, tweets have an informal linguistic style, misspelled words, the careless use of grammar, URL links, user mentions, hashtags, and more. Due to these inherent characteristics, discovering patterns from tweets represents a challenge and opportunities for machine learning and natural language processing (NLP) tasks, such as sentiment analysis.

Sentiment analysis is the field of study that analyzes people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in written text (Liu 2020). Usually, one reduces the sentiment analysis task to find out the polarity classification, i.e., whether they carry a positive or negative connotation. One of the biggest challenges concerning the sentiment classification of tweets is that people often express their sentiments and opinions using a casual linguistic style, resulting in the presence of misspelling words and the careless use of grammar. Consequently, the automated analysis of tweets’ content requires machines to build a deep understanding of natural text to deal effectively with its informal structure (Pathak et al. 2020). However, before discovering patterns from text, it is essential to define a more fundamental step: how automatic methods can numerically represent textual content.

Vector space models (VSMs) (Salton et al. 1975) are one of the earliest and most common strategies adopted in text classification literature to allow for machines to deal with texts and their structures. The VSM represents each document in a corpus as a point in a vector space. Points that are close together in this space are semantically similar, and points that are far apart are semantically distant (Turney and Pantel 2010). The firsts VSM approaches are count-based methods, such as Bag-of-Words (BoW) and BoW with TF-IDF (Term Frequency-Inverse Document Frequency) (Manning et al. 2008). Although VSMs have been extensively used in the literature, they cannot deal with the curse of dimensionality. More clearly, considering the inherent characteristics of tweets, a corpus of tweets may contain different spellings for each unique word leading to an extensive vocabulary, making the vector representation of those tweets very large and often sparse.

To tackle the curse of dimensionality inherent from BOW-based approaches, in the last years it has become a standard practice to learn dense vectors to represent words and texts, the embeddings. Methods such as such as Word2Vec (Mikolov et al. 2013), FastText (Mikolov et al. 2018), and others (Agrawal et al. 2018; Felbo et al. 2017; Tang et al. 2014; Xu et al. 2018) have been used with relative success to address a plethora of NLP tasks. Nevertheless, in general, the performance of such techniques are still unsatisfactory to solve sentiment analysis from tweets, taking into account the dynamic vocabulary used by Twitter users to express themselves. Specifically, in tweets, the ironic and sarcastic content expressed in a limited space, regularly out of context and informal, makes even more challenging to retrieve meaning from the words. Such attributes may degrade the performance of traditional word embeddings methods if not handled properly. In this context, contextualized word representations have recently emerged in the literature, aiming at allowing the vector representation of words to adapt to the context they appear. Contextual embedding techniques, including ELMo (Peters et al. 2018) and Transformer-based autoencoder methods, such as BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), and BERTweet (Nguyen et al. 2020), are built upon the concept of neural language model (Bengio et al. 2000) to capture not only complex characteristics of word usage, such as syntax and semantics, but also how the word usage vary across linguistic contexts. Those methods have achieved state-of-the-art results on various NLP tasks, including sentiment analysis (Adhikari et al. 2019; Akkalyoncu Yilmaz et al. 2019; Chaybouti et al. 2021; Gao et al. 2019).

Much effort in recent language modeling research is focused on scalability issues of existing word embedding methods. On this basis, inductive transfer learning strategies and pre-trained embedding models have gained important application in the literature, especially when the amount of labeled data to train a model is relatively small. With that, models obtained from the aforementioned contextual embeddings methods are rarely trained from scratch but are instead fine-tuned from models pre-trained on datasets with a huge amount of texts (Howard and Ruder 2018; Peters et al. 2018; Radford et al. 2018). Pre-trained models reduce the use of computational resources and tend to increase the classification performance of several NLP tasks, sentiment analysis included.

Despite the successful achievements in developing efficient word representation methods in NLP literature, there is still a gap regarding a robust evaluation of existing language models applied to the sentiment analysis task on tweets. Most studies are mainly focused on evaluating those models for different NLP tasks using only a small number of datasets(Lan et al. 2020; Liu et al. 2019; Peters et al. 2018; Xu et al. 2018). Our main goal is to identify appropriate embedding-based text representations for the sentiment analysis of English tweets in this study. For this purpose, we evaluate distinct types of embeddings, including: i) static embeddings learned from generic texts (Agrawal et al. 2018; Mikolov et al. 2018, 2013; Pennington et al. 2014); ii) static embeddings learned from datasets of Twitter sentiment analysis (Araque et al. 2017; Bravo-Marquez et al. 2016; Felbo et al. 2017; Pennington et al. 2014; Tang et al. 2014; Xu et al. 2018); iii) contextualized embeddings learned from transformer-based autoencoders with generic texts with no adjustments (Devlin et al. 2019; Liu et al. 2019); iv) contextualized embeddings learned from Transformer-based autoencoders with a dataset of tweets with no adjustments (Nguyen et al. 2020); v) contextualized embeddings adapted to the tweets language with a second phase of pretraining the language model (Gururangan et al. 2020); and vi) contextualized embeddings adapted to the tweets sentiment language with a second phase of pretraining the language model (Gururangan et al. 2020). In all assessments, we use a representative set of twenty-two sentiment datasets (Carvalho and Plastino 2021) as input to five classifiers to evaluate the predictive performance of the embeddings. To the best of our knowledge, there is no previous study that has conducted such a robust evaluation regarding language models of several flavors and a large number of datasets. In order to identify the most appropriate text embeddings, we conduct this study to answer the following four research questions.

RQ1 Which static embeddings are the most effective in the sentiment classification of tweets? Our motivation to evaluate those models is that many state-of-the-art deep learning models can require a lot of computational power, such as memory usage and storage. Thus, running those models locally on some devices may be difficult for mass-market applications that depend on low-cost hardware. To overcome this limitation, embeddings generated by language models can be gathered by simply looking up at the embedding table to achieve a static representation of textual content. We intend to assess how these static representations work and which are the most appropriate in this context. We answer this research question by evaluating a rich set of text representations from the literature (Agrawal et al. 2018; Araque et al. 2017; Bravo-Marquez et al. 2016; Devlin et al. 2019; Felbo et al. 2017; Mikolov et al. 2018, 2013; Nguyen et al. 2020; Pennington et al. 2014; Tang et al. 2014; Xu et al. 2018; Zhu et al. 2015). To achieve a good overview of the static representations, we conduct an experimental evaluation in the sentiment analysis task with five different classifiers and 22 datasets.

RQ2 Considering state-of-the-art Transformer-based autoencoder models, which are the most effective in the sentiment classification of tweets? Regarding recent advances in language modeling, Transformer-based architectures have achieved state-of-the-art performances in many NLP tasks. Specifically, BERT (Devlin et al. 2019) is the first method that successfully uses the encoders components of the Transformer architecture (Vaswani et al. 2017) to learn contextualized embeddings from texts. Shortly after that, RoBERTa (Liu et al. 2019) was introduced by Facebook as an extension of BERT that uses an optimized training methodology. Next, BERTweet (Nguyen et al. 2020) was proposed as an alternative to RoBERTa for NLP tasks focusing on tweets. While RoBERTa was trained on traditional English texts, such as Wikipedia, BERTweet was trained from scratch using a massive corpus of 850M English tweets. In this context, to answer this research question, we conduct an experimental evaluation of BERT, RoBERTa, and BERTweet models in the sentiment analysis task with five different classifiers and 22 datasets to obtain a comprehensive analysis of their predictive performances. By evaluating these models we may obtain a robust overview of the Transformer-based autoencoder representations that better fit tweet’s style.

RQ3 Can a second phase of continuous pretraining the Transformer-based autoencoder models using a large set of English tweets improve the sentiment classification performance? One of the benefits of pre-trained language models, such as the Transformer-based models exploited in this study, is the possibility to adjust the language model to a specific domain. We aim at assessing whether the sentiment analysis of tweets can benefit from adapting BERT, RoBERTa and BERTweet language models to a vast, generic, and unlabeled set of around 6.7M English tweets from distinct domains. To that, we employed a second phase of training the pre-trained language model using the intermediate masked-language model task. Besides, considering that the adaptation procedure can be a very data-intensive task that may demand a lot of computational power, in addition to the large corpus of 6.7M tweets, we use in that process nine other samples with different sizes, varying from 500 to 1.5M tweets. We conduct an experimental evaluation with all models in the sentiment analysis task with five different classifiers and 22 datasets as in the previous questions.

RQ4 Can Transformer-based autoencoder models benefit from a second phase of adaptive pretraining procedure with tweets specific from sentiment analysis datasets? Although using unlabeled generic tweets to adjust a language model seems to be promising regarding the availability of data, we believe that the downstream sentiment task may benefit from the sentiment information that tweets from labeled datasets contain. In this context, we aim at identifying whether adjusting the language models with positive and negative tweets can boost the sentiment classification of tweets. We perform this evaluation by assessing three distinct strategies in order to simulate three real-world situations, as follows. In the first strategy, we use a specific sentiment dataset itself as the target domain dataset to adapt the language model. The second strategy simulates the case where a collection of general sentiment dataset is available to adapt the language model. In the third and last strategy, we combine the two previous situations. In short, we put together tweets from a target dataset and from a collection of sentiment datasets in the adaptation procedure. Finally, we present a comparison between the predictive performances achieved by these three evaluations and the adapted models evaluated in RQ3. As in the previous questions, we conduct the experiments with five different classifiers and 22 datasets.

In summation, given the large number of language models exploited in this study, our main contributions are: (i) a comparative study of a rich collection of publicly available static representations generated from distinct deep learning methods, and with different dimensions, vocabulary size, and from various kinds of corpora; (ii) an assessment of state-of-the-art contextualized language models from the literature, that is, Transformer-based autoencoder models, including BERT, RoBERTa, and BERTweet; (iii) an evaluation of distinct strategies for adapting Transformer-based autoencoder language models; and (iv) a general comparison over static, Transformer-based autoencoder, and adapted language models, aiming at determining the most suitable ones for detecting the sentiment expressed in tweets.^{Footnote 3}

In order to present our contributions, we organized this article as follows. Section 2 presents a literature review related to the language models examined in this study. In Sect. 3, we describe the experimental methodology we followed in the computational experiments, which are reported in Sects. 4, 5, 6, and 7, responding the four research question, respectively. Finally, in Sect. 8, we present the conclusions and directions for future research.

2 Literature review

Sentiment analysis is an automated process used to predict people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Liu 2020). Recently, sentiment analysis has been recognized as a suitcase research problem (Cambria et al. 2017), which involves solving different NLP classification sub-tasks, including sarcasm, subjectivity, and polarity detection, which is the focus of this study.

Pioneer works in the sentiment classification of tweets mainly focused on the polarity detection task, which aims at categorizing a piece of text as carrying a positive or negative connotation. For example, Go et al. (2009) define sentiment as a personal positive or negative feeling. There, they used unigrams as features to train different machine learning classifiers, using tweets with emoticons as training data. The unigram model, or Bag-of-Words (BoW), is the most basic representation in text classification problems.

Over the years, different techniques have been developed in NLP literature in an effort to make natural language easily processable by computers. Vector Space Models (VSMs) (Salton et al. 1975) are one of the earliest strategies used to represent the knowledge extracted from a given corpus. Earlier approaches to build VSMs are grounded on count-based methods, such as BoW with TF-IDF representation, which measures how important a word is to a document, relying on its frequency of occurrence in a corpus (Manning et al. 2008).

The BoW model, which assumes word order is not important, is based on the hypothesis that the frequencies of words in a document tend to indicate the relevance of the document to a query (Salton et al. 1975). This hypothesis expresses the belief that a column vector in a term-document matrix captures an aspect of the meaning of the corresponding document or phrase. Precisely, Let X be a term-document matrix. Suppose the document collection contains n documents and m unique terms. The matrix X will then have m rows (one row for each unique term in the vocabulary) and n columns (one column for each document). Let w\(_i\) be the i-th term in the vocabulary and let d\(_j\) be the j-th document in the collection. The i-th row in X is the row vector x\(_{i:}\) and the j-th column in X is the column vector x\(_{:j}\). The row vector x\(_{i:}\) contains n elements, one element for each document, and the column vector x\(_{:j}\) contains m elements, one element for each term. Suppose X is a simple matrix of frequencies, then the element x\(_{ij}\) in X is the frequency of the i-th term w\(_i\) in the j-th document d\(_j\) (Turney and Pantel 2010).

Such a simple way of creating numeric representations from texts have motivated early studies in detecting the sentiment expressed in tweets (Barbosa and Feng 2010; Go et al. 2009; Pak and Paroubek 2010). However, though widely adopted, this kind of feature representation leads to the curse of dimensionality due to the large number of uncommon words tweets contain (Saif 2015).

Thus, with the revival and success of neural-based learning techniques, several methods that learn dense real-valued low dimensional vectors to represent words have been proposed, such as Word2Vec (Mikolov et al. 2013), FastText (Mikolov et al. 2018), and GloVe (Pennington et al. 2014). Word2Vec (Mikolov et al. 2013) is one of the pioneer models to become popular taking advantage from the development of neural networks over the years. Wor2Vec is actually a software package composed of two distinct implementations of language-models, both based on a feed-forward neural architecture, namely Continuous Bag-Of-Words (CBOW) and Skip-gram. The CBOW model aims at predicting a word given its surrounding context words. Conversely, the Skip-gram model predicts the words in the surrounding context given a target word. Both architectures consist of input, a hidden layer and an output layer. The input layer has the size of the vocabulary and encodes the context by combining the one-hot vector representations of surrounding words of a given target word. The output layer has the same size as the input layer and contains a one-hot vector of the target word obtained during the training. However, one of the main disadvantages of those models is that they usually struggle to deal with out-of-vocabulary (OOV) words, i.e., words that have not been seen in the training data before. To address this weakness, more complex approaches have been proposed, such as FastText (Mikolov et al. 2018).

FastText (Mikolov et al. 2018) is based on the Skip-gram model (Mikolov et al. 2013), still it considers each word as a bag of character n-grams, which are contiguous sequences of n characters from a word, including the word itself. A dense vector is learned to each character n-gram and the dense vector associated to a word is taken from the sum of those representations. Thus, FastText can deal with different morphological structure of words that covers the words not seen in the training phase, i.e., OOV words. For that reason, FastText is also able to deal with tweets, considering the huge number of uncommon and unique words in this kind of text.

Going to another direction, the GloVe model (Pennington et al. 2014) attempts at making efficient use of statistics of word occurrences in a corpus to learn better word representations. Pennington et al. (2014) present a model that rely on the insight that ratios of co-occurrences, rather than raw counts, encode semantic information about pair of words. This relationship is used to derive a suitable loss function for a log-linear model, which is then trained to maximize the similarity of every word pair, as measured by the ratios of co-occurrences. Given a probe word, the ratio can be small, large or equal to one depending on their correlations. This ratio gives hints on the relations between three different words. For example, given a probe word and two others w\(_i\) and w\(_j\), if the ratio is large, the probe word is related to w\(_i\) but not w\(_j\).

In general, methods for learning word embeddings deal well with the syntactic role of words but ignore the potential sentiment they carry. In the context of sentiment analysis, words with similar syntactic role but opposite sentiment polarity, such as good and bad, are usually mismapped to neighbouring word vectors. To address this issue, Tang et al. (2014) proposed the Sentiment-Specific Word Embedding model (SSWE), which encodes the sentiment information in the embeddings. Specifically, they developed neural networks that incorporate the supervision from sentiment polarity of texts in their loss function. To that, they slide the window of n-gram across a sentence, and then predict the sentiment polarity based on each n-gram with a shared neural network. In addition to SSWE, other methods have been proposed in order to improve the quality of word representations in sentiment analysis, by leveraging the sentiment information in the training phase, such as DeepMoji (Felbo et al. 2017), Emo2Vec (Xu et al. 2018), and EWE (Agrawal et al. 2018).

The aforementioned word embedding models have been used as standard components in most sentiment analysis methods. However, they pre-compute the representation for each word independently from the context they are going to appear. This static nature of these models results in two problems: (i) they ignore the diversity of meaning each word may have, and (ii) they suffer from learning long-term dependencies of meaning. Different from those static word embedding techniques, contextualized embeddings are not fixed, adapting the word representation to the context it appears. Precisely, at training time, for each word in a given input text, the learning model analyzes the context, usually using sequence-based models, such as recurrent neural networks (RNNs), and adjusts the representation of the target word by looking at the context. These context-awareness embeddings are actually the internal states of a deep neural network trained in an self-supervised setting. Thus, the training phase is carried out independently from the primary task on an extensive unlabeled data. Depending on the sequence-based model adopted, these contextualized models can be divided into two main groups, namely RNN-based (Peters et al. 2018) and Transformers-based (Lan et al. 2020; Liu et al. 2019; Nguyen et al. 2020).

Transfer learning strategies have also emerged to improve the quality of word representation, such as ULMFit (Universal Language Model Fine-tuning) (Howard and Ruder 2018). ULMFit is an effective transfer learning method that can be applied to any NLP task, and introduces key techniques for fine-tuning a language model, consisting of three stages, described as follows. First, the language model is trained on a general-domain corpus to capture generic features of the language in different layers. Next, the full language model is fine-tuned on the target task data using discriminative fine-tuning and slanted triangular learning rates (STLR) to learn task-specific features. Lastly, the model is fine-tuned on the target task using gradual unfreezing and STLR to preserve low-level representations and to adapt high-level ones.

Fine-tuning techniques made possible the development and availability of pre-trained contextualized language models using massive amounts of data. For example, Peters et al. (2018) introduced ELMo (Embeddings from Language Models), a deep contextualized model for word representation. ELMo comprises a Bi-directional Long-Short-Term-Memory Recurrent Neural Network (BiLSTM) to combine a forward model, looking at the sequence in the traditional order, and a backward model, looking at the sequence in the reverse order. ELMo is composed of two layers of BiLSTM sequence encoder responsible for capturing the semantics of the context. Besides, some weights are shared between the two directions of the language modeling unit and there is also a residual connection between the LSTM layers to accommodate the deep connections without the gradient vanishing issue. ELMo also makes use of the character-based technique for computing embeddings. Therefore, it benefits from the characteristics of character-based representations to avoid OOV words.

Although ELMo is more effective as compared to static pre-trained models, its performance may be degraded when dealing with long texts, exposing a trade-off between efficient learning by gradient descent and latching on information for long periods (Bengio et al. 1994). Transformers-based language models, on the other hand, have been proposed to solve the gradient propagation problems described in (Bengio et al. 1994). Compared to RNNs, which process the input sequentially, Transformers work in parallel, which brings benefits when dealing with large corpora. Moreover, while RNNs by default process the input in one direction, Transformers-based models can attend to the context of a word from distant parts of a sentence and pay attention to the part of the text that really matters, using self-attention (Vaswani et al. 2017).

The OpenAI Generative Pre-Training Transformer model (GPT) (Radford et al. 2018) is one of the first attempts to learn representations using Transformers. It encompasses only the decoder component of the Transformer architecture with some adjustments, discarding the encoder part. Therefore, instead of having a source and a target sentence for the sequence transduction model, a single sentence is given to the decoder. GPT’ objective function targets at predicting the next word given a sequence of words, as a standard language modeling goal. To comply with the standard language model task, while reading a token, GPT can only attend to previously seen tokens in the self-attention layers. This setting can be limiting for encoding sentences, since understanding a word might require processing the ones coming after it in the sentence.

Devlin et al. (2019) addressed the unidirectional nature of GPTs by presenting an strategy called BERT (Bidirectional Encoder Representations from Transformers) that, as the name says, encodes sentences by looking them at both directions. BERT is also based on the Transformer architecture but, contrary to the GPT, it is based on the encoder component of that architecture. The essential improvement over GPT is that BERT provides a solution for making Transformers bidirectional by applying masked language models, which randomly masks some percentage of the input tokens, and the objective is to predict those masked tokens based on their context. Also, in (Devlin et al. 2019), they use a next sentence prediction task for predicting whether two text segments follow each other. All those improvements have made BERT to achieve state-of-the-art results in various NLP tasks when it was published.

Later, Liu et al. (2019) proposed RoBERTa (Robustly optimized BERT approach), achieving even better results than BERT. RoBERTa is an extension of BERT with some modifications, such as: (i) training the model for a longer period of time, with bigger batches, over more data, (ii) removing the next sentence prediction objective, (iii) training on longer sequences, and (iv) dynamically changing the masking pattern applied to the training data.

Recently, Nguyen et al. (2020) introduced BERTweet, an extension of RoBERTa trained from scratch with tweets. BERTweet has also the same architecture as BERT, but it is trained using the same Roberta pre-training procedure instead. BERTweet consumes a corpus of 850M English tweets, which is a concatenation of two corpora. The first corpus contains 845M English tweets from the Twitter Stream dataset and the second one contains 5M English tweets related to the COVID-19 pandemic. In (Nguyen et al. 2020), the proposed BERTweet model outperformed RoBERTa baselines in some tasks on tweets, including sentiment analysis.

As far as we know, most studies in language modeling focus on designing new effective models in order to improve the predictive performance of distinct NLP tasks. For example, Devlin et al. (2019) and Liu et al. (2019) have respectively introduced BERT and RoBERTa, which achieved state-of-the-art results in many NLP tasks. Nevertheless, they did not evaluate the performance of such methods on the sentiment classification of tweets. Nguyen et al. (2020), on the other hand, used only a unique generic collection of tweets when evaluating their BERTweet strategy. In this context, we fulfill a robust evaluation of existing language models from distinct natures, including static representations, Transformer-based autoencoder models, and fine-tuned models, by using a significant set of 22 datasets of tweets from different domains and sizes. In the following sections, we present the assessment of such models.

3 Experimental methodology

This section presents the experimental methodology we followed in this article. We begin by describing, in Sect. 3.1, the twenty-two benchmark datasets used to evaluate the different language models we investigate in this study. In Sect. 3.2, we present the experimental protocol we followed. Then, in Sect. 3.3, we describe the computational experiments reported in Sects. 4, 5, 6, and 7.

3.1 Datasets

We used a large set of twenty-two datasets^{Footnote 4} (Carvalho and Plastino 2021) to assess the effectiveness of the distinct word representation models described in Sect. 2. Table 1 summarizes the main characteristics of these datasets, namely the abbreviation we use when reporting the experimental results to save space (Abbrev. column), the domain they belong (Domain column), number of positive tweets (#pos. column), proportion of positive tweets (%pos. column), number of negative tweets (#neg. column), proportion of negative tweets (%neg. column), and the total number of tweets (Total column).

Those datasets have been extensively used in the literature of Twitter sentiment analysis and we believe they provide a diverse scenario in evaluating embeddings of tweets in the sentiment classification task, regarding a variety of domains, sizes, and class balance. For example, while datasets SemEval13, SemEval16, SemEval17, and SemEval18 contain generic tweets, other datasets, such as iphone6, movie, and archeage, contain tweets of a particular domain. Also, the datasets vary a lot in size, with some of them containing only dozens of tweets, such as irony and sarcasm. We believe that this diverse and large collection of datasets may help drawing more concise and robust conclusions on the effectiveness of distinct language models in the sentiment analysis task.

Table 1 Characteristics of the Twitter sentiment datasets ordered by size (Total column)

Sentiment analysis in tweets: an assessment study from classical to modern word representation models

Abstract

Similar content being viewed by others

Sentiment analysis using deep learning architectures: a review

A survey of sentiment analysis in social media

A Review on Word Embedding Techniques for Text Classification

1 Introduction

2 Literature review

3 Experimental methodology

3.1 Datasets

3.2 Experimental protocol

3.3 Computational experiments details

4 Evaluation of static text representations

5 Evaluation of the transformer-based text representations

6 Adapting transformer-based models to a large collection of English tweets

7 Adapting transformer-based models to sentiment datasets

8 Conclusions and future works

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation