1 Introduction

Social networks (SNs) are becoming increasingly popular platforms among people all across the world, and nowadays they are utilized even more than ever. With the growth of SNs like Twitter and increasing their popularity, people share more personal emotions and opinions about various issues in such networks. This rapid growth of SNs, combined with the accessibility of a large amount of data on a multitude of topics, provides a great research potential for a wide range of applications, such as customer analysis, product analysis, sector analysis and digital marketing (Bhatnagar and Choubey 2021; Fatehi, et al. 2022). In addition, identifying users' polarities and mining their opinions shared in various areas, especially SNs, have become one of the most popular and useful research fields. Social media platforms are able to build rich profiles from the online presence of users by tracking activities such as participation, messaging, and Web site visits (Cui, et al. 2020). By an increased growth in the number of users in the SNs and subsequently exponential rise in the interactions between them, large volumes of user-generated content are produced. It is difficult to analyze all these data since most of the social media data are unstructured and dynamic data which frequently alters. Social network analysis provides innovative techniques to analyze interactions among entities by emphasizing on social relationships (Kumar and Sinha 2021). Nowadays, analyzing SNs with data mining and machine learning algorithms has become a must-have strategy for obtaining useful data. Data mining is the process of extracting and identifying useful patterns and relationships from piles of data sets that may lead to the extraction of new information by using data analysis tools (Keyvanpour, et al. 2020).

Among different SNs, twitter is one of the most studied SNs for social networks' research. Twitter is a SN that enables users to share their daily emotions and opinions. It is considered a convenient platform for users to share personal messages, pictures, and videos. One of the main advantages of platforms like twitter is that users are organized in these platforms, making this possible to investigate groups of people or communities who are united by common interests, rather than individual profiles. Furthermore, this is possible through extensive use of hashtags, mentions, and retweets that form a complex network, which can provide us with a rich source of data to analysis. Twitter is known to be a novel source of data for those studying attitudes, beliefs, and behaviors of consumers and opinion makers (Islam, et al. 2020; Kwak and Grable 2021).

Among all various forms of communications, text messages are considered one of the most conspicuous forms, since users can express their opinions and emotions on various and diverse topics using text. Text mining is the process of exploring and transforming unstructured text data into structured data to find meaningful insights. It is defined as a multi-purpose research method to study a wide range of issues by systematically and objectively identifying characteristics of large sample data. Text mining is a sub-field of data mining and an extension of classical data mining methods, which can be applied when making sophisticated formulations using text classification and clustering procedures (Yang, et al. 2021). Hossny, et al. 2020 listed the key challenges for analyzing the text on Twitter including the tweet’s length, frequent use of abbreviations, misspelled words and acronyms, transliterating non-English words using Roman scripts, ambiguous semantics and synonyms.

Information in several social media platforms, like blogs, reviewing SNs, and Twitter, is being processed for extracting people’s opinions about a particular product, organization, or situation. The attitude and feelings comprise an essential part in evaluating the behavior of an individual that is known as sentiments. These sentiments can further be analyzed using a field of study, known as sentiment analysis (SA) (Singh, et al. 2021). SA belongs to the area of natural language processing (NLP) (Chen, et al. 2020) and it has been an active research topic in NLP, which is a cognitive computing study of people’s opinions, sentiments, emotions, appraisals, and attitudes toward entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Dai, et al. 2021). Also, it aims to analyze and extract knowledge from the subjective information published on the Internet (Basiri, et al. 2021). Sentiment analysis of user-generated data is very useful to know the opinion of the crowd. Two main approaches for sentiment analysis of text documents are described in the literature, specifically approaches based on machine learning and approaches based on symbolic techniques. Symbolic techniques use lexicons and other linguistic resources to determine the sentiment of a given text. Some research has used machine learning for classifying the sentiment of a given text, sometimes following the approach of most symbolic techniques and seeking to identify positive, negative and neutral categories, but sometimes also considering other sentiment categories such as anger, joy and sadness (Moutidis and Williams 2020).

The SA plays significant role in many domain by extracting the people’s emotions which then assist business industry to be developed accordingly. In this study, we investigate the performance of different ML models to analyze the sentiment of two real datasets.

So, the contributions in this paper are summarized as follows:

  1. (1)

    We generate and preprocess two real datasets extracted with Twitter Application Programming Interface (API)—binomial and polynomial—to investigate the sentiment analysis. Binomial dataset incorporating two polarities of positive and negative which is the typical dataset used in the literature, polynomial dataset, however, includes three positive, negative, and neutral polarities.

  2. (2)

    We investigate the performance of sentiment classification in terms of accuracy /AUC and accuracy/kappa for four classifiers on both binomial and polynomial datasets, respectively.

  3. (3)

    To increase the reliability of SA and reduce variance and bias of learning models, we apply ensemble methods on both the binomial and polynomial datasets and then report the accuracy values for these methods.

  4. (4)

    To find out the best train–test split ratio in addition to K-fold cross-validation, we divide the dataset into two parts: training set and testing set with different percentages of data.

The rest of this paper is structured as follows: Sect. 2 reviews some of the related works in the literature. A description of the methodology that includes data collection, preprocessing for sentiment analysis, sentiment detection, and classification modelling is presented in Sect. 3. The results are presented and discussed in Sect. 4, and eventually, the conclusion is detailed in Sect. 5.

2 Related work

Researchers in the field of sentiment analysis have been mostly used supervised machine learning algorithm for primary classification, such as the work done by Chauhan et al. (2020). Furthermore, many of the recent studies use Twitter as the primary source of data (Al-Laith, et al. (2021), Yadav, et al. (2021)).

Henríquez and Ruz (2018) used a non-iterative deep random vectorial functional link called D-RVFL. They analyzed two different datasets. Dataset 1 contains a collection of 10,000 tweets from the Catalan referendum of 2017 and dataset 2 contains a collection of 2187 tweets from the Chilean earthquake of 2010. They consider the datasets as a two-class classification problem with the labels of positive and negative. By the use of D-RVFL, results show the best performance compared to SVM, random forest, and RVFL.

Ankit and Saleena (2018) proposed an ensemble classification system formed by different learners, such as naive Bayes, random forest, SVMs, and logistic regression classifiers. Their system employs two algorithms: the first algorithm calculates a positive and a negative score for the tweet, and the second algorithm utilizes these scores to predict the sentiment of that tweet. Furthermore, the dataset consists of 43,532 negative and 56,457 positive tweets.

Symeonidis et al. (2018) evaluated the preprocessing techniques on their resulting classification accuracy and the number of features they produce. However, this paper worked on lemmatization, removing numbers, and replacing contractions techniques, while the detection accuracy is low. For this task, they used four classification algorithms named logistic regression, Bernoulli Naive Bayes, linear SVC, and convolutional neural networks on two datasets with the classes of positive, negative, and neutral.

Sailunaz and Alhajj (2019), used a dataset to detect sentiment and emotion from tweets and their replies and measured the influence scores of users based on various user-based and tweet-based parameters. The dataset also includes replies to tweets, and the paper introduces agreement score, sentiment score and emotion score of replies in influence score calculation.

Ruz, et al. (2020), reviewed five classifiers and assessed their performances on two Twitter datasets of two different critical events. Their datasets were Spanish, and they concluded that there is no difference between the behavior of support vector machine (SVM) and random forest in English and Spanish. In order to automatically control the number of edges supported by the training examples in the Bayesian network classifier, they adopt a Bayes factor approach, yielding more realistic networks.

Wang et al. (2021) proposed a system for general population sentiment monitoring from a social media stream (Twitter), through comprehensive multilevel filters, and improved latent Dirichlet allocation (LDA) method for sentiment classification. They reached an accuracy of 68% for general sentiment analysis using real-world content. Also, they used a dataset with three categories (positive, negative, and neutral) and a dataset with four categories (positive, negative, neutral and junk).

Ali et al. (2021) utilized the bilingual (English and Urdu) data from Twitter and NEWS websites to do the sentiment and emotional classification using machine learning and deep learning models. Kaur and Sharma (2020) used API to collect beneficial-related corona virus tweets and then categorized them in three groups (positive, negative, and neutral) to investigate the feeling of people about the COVID-19 pandemic. Nuser et al. (2022) proposed an unsupervised learning framework based on serial ensemble of some hierarchical clustering methods for sentiment analysis on a binomial dataset collected from Twitter.

Machuca et al. (2021) used English COVID-19 pandemic tweets to do the sentiment analysis using a logistic regression algorithm on a binomial dataset including positive and negative labels.

In Table 1, we present a review of the state-of-the-art and their reported accuracy for the sentiment classification with data structures of binomial (positive and negative) and polynomial (positive, negative, and neutral).

Table 1 Comparison of sentiment analisys approches

3 Methodology

This section introduces our research framework in four phases: data collection, preprocessing, sentiment detection, and classification modeling (Fig. 1).

Fig. 1
figure 1

Overview of proposed sentiment classification workflow

3.1 Data collection

Twitter is among the most popular social networking platforms nowadays. It provides its users with a platform to share their daily lives with other users and express their opinions about different national, international issues from various perspectives. Every user can write a short text called tweet with a maximum length of 140 characters. These opinions and comments can be used to raise public awareness to help the government and enterprises understand the views of the public. Twitter also can be used to predict event trends. Therefore, tweets are an important resource to study public awareness.

Researchers and practitioners can access Twitter data using Twitter API. Search and streaming APIs allow them to collect Twitter data using different types of queries, including keywords and user profiles, which has offered them opportunities to access the data needed to analyze challenging problems in diverse domains. Thus, many researchers and practitioners have begun to focus on Twitter data mining to obtain more research value and business value from this research (Li et al. 2019).

For our experiments, in order to collect tweets, we selected a few recent events and issues; search keywords about corona virus like #covid-19, #coronavirus. For our experiments, in order to collect tweets, we selected a few recent events and issues; search keywords about corona virus like #covid-19, #coronavirus, #covid19vaccine, etc. A total of 14,000 tweets were extracted using Twitter API. 6980 of which were written in English; therefore, we picked these tweets. These tweets were sentences; consequently, we had to preprocess these sentences and convert them to a set of words. Then, these words were classified to be understood by the classifier. In the following sections, we elaborate the mentioned procedure.

3.2 Preprocessing

Tweets are sometimes not in a usable format, for instances they include characters, symbols or emoticons. Therefore, we need to format them in an appropriate usable form to be able to extract meaningful opinions from them. As a first step in preprocessing, most (if not all) studies apply tokenization. Tokenization is a task for separating the full text string into a list of separate words. Tokenization is defined as a kind of lexical analysis that breaks a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. At its core, the process of tokenization is a standard method for further natural language processing (NLP) transformation in preprocessing (Symeonidis, et al. 2018). For the preprocessing steps, various methods have been proposed and can be applied for data cleaning. Following are the steps in the data preprocessing that we used in this article:

  • All non-English tweets are eliminated.

  • User names preceded by ‘@’ and external links are omitted.

  • All hashtags (only the # symbol) are removed.

  • Stop-words or useless words are removed from the tweet.

  • All emoticons were removed (i.e.,:-),:-( etc.).

All the tweets were converted to lower case to make the dataset uniform.

3.3 Detection

Each tweet should be labeled with sentiment with three possible values: negative, neutral, or positive. The first step to label the tweets is to apply unsupervised methods due to the large dataset we have. For this purpose, we used the TextBlob library in the python programming language to label tweets. This library assigns each tweet a number between − 1 and + 1 (-1 is the most negative and + 1 is the most positive value). Then, we double-checked the labels manually. Tweets between [− 1, − 0.1], [− 0.1, + 0.1] and [+ 0.1, + 1] were labeled negative, neutral, and positive, respectively. Figure 2 illustrates the results from the sentiment analysis. Also, the number of tweets in each class is shown in Table 2. We have a total of 6980 tweets: 977 of which are negative, 3689 of which are neutral and positive tweets are 2314.

Fig. 2
figure 2

Sentiment proportion of dataset

Table 2 Dataset structure

3.4 Classification modelling

For our experiment and in order to make a comparative analysis, we employed four classifiers, which are the most widely used classifiers for sentiment analysis, namely (1) K-nearest neighbor (KNN), (2) decision tree (DT), (3) support vector machine (SVM), (4) Naive Bayes (NB), and also two ensemble methods including voting and bagging.

3.4.1 K-nearest neighbor

The logic behind KNN classification is that we expect a test sample X to have the same label as the training sample located in the local region surrounding X denoting by K. Training a KNN classifier simply consists of determining K. KNN simply memorizes all samples in the training set and then compares the test sample with them.

3.4.2 Decision tree

The decision tree is a particularly efficient method of producing classifiers from data. It is a tree-like collection of nodes intended to create a decision on values affiliation to a class or an estimate of a numerical target value. Each node represents a splitting rule for one specific attribute. For classification, this rule separates values belonging to different classes. The building of new nodes is repeated until the stopping criteria are met. A prediction for the class label attribute is determined depending on the majority of examples which reached this leaf during generation.

3.4.3 Support vector machine

An SVM is a supervised learning algorithm creating learning functions from a set of labeled training data. Support vector machine solves the traditional text categorization problem effectively. The main principle of SVMs is to determine a linear separator that separates different classes in the search space with a maximum distance. SVM’s classification function is based on the concept of decision planes that define decision boundaries between classes of samples. The main idea is that the decision boundary should be as far away as possible from the data points of both classes. There is only one that maximizes the margin.

3.4.4 Naive Bayes

The naive Bayesian method is one of the most widely used methods for text data classification. The naive Bayesian is a simple probabilistic classifier that uses the concept of mixture models to perform classification. The mixture model relies on the assumption that each of the predefined classes is one of the components of the mixture itself. The components of the mixture model denote the probability of belongingness of any term to the particular component. Naive Bayes classifier uses the concept of Bayes theorem and finds the maximum prospect of the probability of any word fitting to a particular given or predefined class. This algorithm assumes that the elements in the dataset are independent from each other and their occurrences in different datasets indicate their relevance to certain data attributes (Desai and Mehta 2016). This method is a high-bias, low-variance classifier, and it can build a good model even with a small data set. Typical use cases involve text categorization, including spam detection, sentiment analysis, and recommender systems.

3.4.5 Ensemble methods

Ensemble methods are learning algorithms which by try to improve the predicted performance by employing a set of learning algorithms. They reduce bias and variance of the model and so are more reliable compared to the single classifier (Dietterich 2000). The voting method can be used with different combination sets of the classifiers; therefore, we applied the voting method with the combination set of all classifiers to get the maximum value for accuracy. We also used the bagging method with DT (generally this amalgamation has shown a better performance) and bagging with SVM, KNN, and NB.

3.4.6 Evaluation metric

3.4.6.1 Accuracy
$$\mathrm{accuracy}= \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$

TP, TN, FP, and FN are the number of true positive, true negative, false positive, and false negative.

3.4.6.2 AUC

The area under the curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the receiver operator characteristic (ROC) curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

3.4.6.3 Kappa

Kappa is a metric that provides a comparison between observed accuracy and expected accuracy.

To start the classification, we divided the dataset into a training set and a testing set with different percentages of data. Common ratios used are 70% or 60% of the dataset for training and 30% or 40% for testing. In our experiment, we used different train–test split percentage, which are 10% to 70%. Continuing the classification, we also used K-fold cross-validation (K-FCV) with K = 10 to generate the training set and the testing set and compare the results with above-mentioned split ratios.

In this paper, first, the above-mentioned classifiers were applied to a dataset with just negative and positive tweets (binomial), and then, the classifiers were applied to a dataset including negative, positive, and neutral tweets (polynomial).

4 Result analysis

This section gives an overview of the accuracy rates of different trained classifiers. All the calculations are done in the RapidMiner Studio application.

Table 3 shows the predicted accuracy of all classifiers when the tweets are binomial. Our results in Table 3 demonstrate that K-FCV with k = 10 has the highest accuracy rate, except DT, besides the accuracy when we use the train–test split procedure. SVM with 86.42% in single methods and voting with 86.75% in ensemble methods has the best accuracy rates. In Table 4, we can see the differences between the accuracy rates. In most algorithms, there is some decrease in accuracy rate when we used 60% of the dataset for training data. Also, this decrease can be seen when 40% of the dataset is used for training in some methods. Furthermore, in all methods when the ratio is 20%, there is the most increase in accuracy rate in comparison with the ratio of 10%. NB algorithm with + 9.15% and bagging with NB with + 9.62% have the most variation in accuracy rate from 10 to 70% train–test split percentages of the dataset.

Table 3 Sentiment accuracy comparison on binomial dataset
Table 4 Sentiment accuracy differences on binomial dataset

Table 5 shows the predicted AUC for binomial dataset. SVM and bagging with SVM have the best values. We can also see that the 10-FCV has better results than the split procedure. From Table 6, the results show that there is some reduction when we use 60% of the dataset for training data than 50%. An increase in AUC from 10 to 20% of the dataset is more than other ratios.

Table 5 Sentiment AUC comparison on binomial dataset
Table 6 Sentiment AUC differences on binomial dataset

The classification continued with the polynomial dataset. So we applied classifiers to the dataset with three categories including positive, negative, and neutral tweets. Tables 7, 8, 9, 10 show the comparison between classifiers in terms of accuracy and kappa metrics when the tweets are polynomial. According to Tables 7, 8, 9, 10, there is some reduction in accuracy and kappa rates when we use 60% of the dataset for training data than 50% in most classifiers, and in some cases we have just a little increase in the accuracy and kappa rates. SVM and bagging with SVM have better results compared to other classifiers. SVM with an accuracy of 73.91% is the better choice for polynomial classification. However, the bagging with SVM is a more reliable model compared to SVM, employing the ensemble method. This technique makes the learning model more reliable by reducing variance and bias. Tables 7 and 10 show that the most positive variation has happened from 10 to 20% of the dataset in both accuracy and kappa terms.

Table 7 Sentiment accuracy comparison on polynomial dataset
Table 8 Sentiment accuracy differences on polynomial dataset
Table 9 Sentiment Kappa comparison on polynomial dataset
Table 10 Sentiment Kappa differences on polynomial dataset

From the results of accuracy and AUC on the binomial dataset (Tables 3, 4, 5, 6) and the results of accuracy and kappa on the polynomial dataset (Tables 7, 8, 9, 10), we can observe that SVM and bagging with SVM have better results compared to other classifiers. However, the accuracy of polynomial classification is less than binomial. The reason of over-performing of SVM can be the fact the text data have a sparse nature. In such type of data, there are few irrelevant features that tend to have a correlation with each other. This leads those features to turn into some distinct categories, which can be separated by linear separators. Also, we can see most of the classifiers in 50% train–test split percentage have almost the same results as 70% in accuracy (Figs. 3 and 4), AUC and kappa rates, while using 10-FCV can reach better results.

Fig. 3
figure 3

Classification accuracy on binomial dataset

Fig. 4
figure 4

Classification accuracy on polynomial dataset

We also compared the performance of SVM, when 10-FCV is imposed, with state of the art presented in Table 1. The results showed that overall accuracy has improved at least 3.52% and 5.91% on binomial and polynomial datasets, respectively. This improvement can be a result of using the training and testing data divided through the K-fold cross-validation method.

5 Conclusion

In this paper, we aimed to analyze the sentiment of social media data, specifically Twitter, using both single classifiers and ensemble models combined with single classifiers on two datasets including binomial (positive and negative) and polynomial (positive, negative, and neutral) datasets.

From the results, we observed that data mining is a good choice for sentiment prediction since the accuracy rates are relatively high values. We also reviewed four classifiers, including SVM, K-nearest neighbor, decision tree and naive Bayes and two bagging ensemble methods.

From the results, we concluded that among single classifiers and their combination with the ensemble methods, SVM reached 3.53% and 7.41% over performances on binomial and polynomial datasets, respectively. Although ensemble methods do not show over performance compared to single methods, they are able to decrease the bias or variance of the learning models and also decrease the generalization error. Therefore, there is a trade-off between reliability of the algorithm and accuracy.

Our results show that using 50% of the dataset as training data has almost the same results as 70%; however, using 10-FCV has better results. This conclusion can be seen both in the accuracy and AUC rates in the binomial dataset and accuracy and kappa rates in the polynomial dataset.

In future studies, we will apply other ensemble methods, such as boosting and stacking combined with other classifiers, along with single classifiers. Furthermore, we will attempt to improve our dataset by selecting other keywords including both negative and positive sentiments and increasing the size of the dataset by extracting more tweets.