Table 3 provides different high level categories of the primary studies selected for this systematic review, discussed in Sect. 2.4.
Table 3 Categories of primary studies It must be noted that not all the published papers were considered in the analysis conducted. Therefore, this table is referenced in all of the different aspects of the data synthesised, as presented below. It presents the primary studies returned from each electronic library and the additional ones, together with the ones that do not have full access, survey papers, papers which present work that can be applied/used on social data, and papers originating from organised tasks within the domain.
The in-depth analysis, which focused on the social media platforms, techniques, social datasets, language, modality, tools and technologies, NLP tasks and other aspects used across the published papers, is presented in Sects. 3.1–3.7.
Social media platforms
Social data refers to online data generated from any type of social media platform be it from microblogging, social networking, blogging, photo/video sharing and crowdsourcing. Given that this systematic survey focuses on opinion mining approaches that make use of social networking and microblogging services, we identify the social media platforms used in the studies within this review.
In total, 469 studies were evaluated with 66 from ACM, 155 from IEEE Xplore, 32 from ScienceDirect, 182 from SpringerLink and 34 additional ones. Papers which did not provide full access were excluded. Note that 4 survey papers—2 from ACM (Giachanou and Crestani 2016; Zimbra et al. 2018), 1 from IEEE Xplore (Wagh and Punde 2018), 1 from SpringerLink (Abdullah and Hadzikadic 2017)—and 2 SpringerLink organised/shared task papers (Loukachevitch and Rubtsova 2015; Patra et al. 2015) were included, since the former papers focus on Twitter Sentiment Analysis methods whereas the latter papers focus on Sentiment Analysis of tweets (therefore the target social media platform of all evaluated papers is clear in both cases). None of the other 14 survey papers (Rajalakshmi et al. 2017; Yenkar and Sawarkar 2018; Abdelhameed and Muñoz-Hern’andez 2017; Rathan et al. 2017; Liu and Young 2018; Zhang et al. 2018; Ravi and Ravi 2015; Nassirtoussi et al. 2014; Beigi et al. 2016; Lo et al. 2017; Ji et al. 2016; Batrinca and Treleaven 2015; Li et al. 2014; Lin and He 2014) have been included, since various social media platforms were used in the respective studies evaluated. In addition, 2 papers that presented a general approach which can be applied/used on social data (i.e., not on any source) (Min et al. 2013; El Haddaoui et al. 2018) have also not been included.
Out of these studies, 429 made use of 1 social media platform, whereas 32 made use of 2–4 social media platforms, as can be seen in Fig. 2.
With respect to social media platforms, in total 504 were used across all of the studies. These span over the following 18 different ones, which are also listed in Table 4:
-
1.
Twitter : a microblogging platform that allows publishing of short text updates (“microposts”);
-
2.
Sina Weibo : a Chinese microblogging platform that is like a hybrid of Twitter and Facebook;
-
3.
Facebook : a social networking platform that allows users to connect and share content with family and friends online;
-
4.
YouTubeFootnote 24: a video sharing platform;
-
5.
Tencent WeiboFootnote 25: a Chinese microblogging platform;
-
6.
TripAdvisor : a travel platform that allows people to post their reviews about hotels, restaurants and other travel-related content, besides offering accommodation bookings;
-
7.
InstagramFootnote 26: a platform for sharing photos and videos from a smartphone;
-
8.
FlickrFootnote 27: an image- and video-hosting platform that is popular for sharing personal photos;
-
9.
MyspaceFootnote 28: a social networking platform for musicians and bands to show and share their talent and connect with fans;
-
10.
DiggFootnote 29: a social bookmarking and news aggregation platform that selects stories to the specific audience;
-
11.
FoursquareFootnote 30: formerly a location-based service and nowadays a local search and discovery service mobile application known as Foursquare City Guide;
-
12.
StocktwitsFootnote 31: a social networking platform for investors and traders to connect with each other;
-
13.
LinkedInFootnote 32: a professional networking platform that allows users to communicate and share updates with colleagues and potential clients, job searching and recruitment;
-
14.
PlurkFootnote 33: a social networking and microblogging platform;
-
15.
WeixinFootnote 34: a Chinese multi-purpose messaging and social media app developed by Tencent;
-
16.
PatientsLikeMeFootnote 35: a health information sharing platform for patients;
-
17.
ApontadorFootnote 36: a Brazilian platform that allows users to share their opinions and photos on social networks and also book hotels and restaurants;
-
18.
Google+Footnote 37: formerly a social networking platform (shut down in April 2019) that included features such as posting photos and status updates, group different relationship types into Circles, organise events and location tagging.
Table 4 Social media platforms used in the studies Overall, Twitter was the most popular with 371 opinion mining studies making use of it, followed by Sina Weibo with 46 and Facebook with 30. Other popular platforms such as YouTube (12), Tencent Weibo (8), TripAdvisor (7), Instagram (6) and Flickr (5) were also used in a few studies. These results show the importance and popularity of microblogging platforms, such as Twitter and Sina Weibo, which are also very frequently used for research and development purposes in this domain. Such microblogging platforms provide researchers the possibility of using an Application Programming Interface (API) to access social data, which plays a crucial role in selecting them for their studies. On the other hand, data retrieval from other social media platforms such as Facebook, is becoming more challenging due to ethical concerns. For example, Facebook access to the Public Feed APIFootnote 38 is restricted and users cannot apply for it.
Techniques
For this analysis, 465 studies were evaluated: 65 from ACM, 154 from IEEE Xplore, 32 from ScienceDirect, 180 from SpringerLink and 34 additional ones. Studies excluded are the ones with no full access, surveys, and organised task papers. The main aim was to identify the technique/s used for the opinion mining process on social data. Therefore, they were categorised under the following approaches: Lexicon (Lx), Machine Learning (ML), Deep Learning (DL), Statistical (St), Probabilistic (Pr), Fuzziness (Fz), Rule (Rl), Graph (Gr), Ontology (On), Hybrid (Hy) –a combination of more than one technique, Manual (Mn) and Other (Ot). Table 5 provides the yearly statistics for all the respective approaches adopted.
Table 5 Approaches used in the studies analysed From the studies analysed, 88 developed and used more than 1 technique within their respective studies. These techniques include the ones originally used in their approach and/or ones used for comparison/baseline/experimentation purposes. In particular, from these 88 studies, 65 used 2 techniques each, 17 studies used 3 techniques, 4 studies used 4 techniques, and 2 studies made use of 5 techniques, which totals to 584 techniques used across all studies (including the studies that used 1 technique). The results show that a hybrid approach is the most popular one, with over half of the studies adopting such an approach. This is followed by Machine Learning and Lexicon techniques, which are usually chosen to perform any form of opinion mining. These results are explained in more detail in the sub-sections below.
Lexicon
In total 94 unique studies adopted a lexicon-based approach to perform a form of SOM, which produced a total of 96 different techniquesFootnote 39. The majority of the lexicons used were specifically related to opinions and are well known in this domain, whereas the others that were not can still be used for conducting opinion mining.
Table 6 Lexicon-based studies Table 6 presents the number of lexicons (first row and columns titled 1–8) used by the lexicon-based studies (second row). The column titled “Other/NA” refers to any other general lexicon, which does not list general lexicons mentioned in the studies such as acronym dictionaries, intensifier wordsFootnote 40, downtoner words,Footnote 41 negation words and internet slang, and/or to studies which do not provide any information on the exact lexicons used.
The majority of the lexicon-based studies used one or two lexicons, where a total of 144 state-of-the-art lexicons (55 unique ones) were used across. The following are the top six lexicons based on use:
-
1.
SentiWordNetFootnote 42 (Baccianella et al. 2010)—used in 22 studies;
-
2.
Hu and LiuFootnote 43 (Hu and Liu 2004)—used in 12 studies;
-
3.
AFINNFootnote 44 (Årup Nielsen 2011) and SentiStrengthFootnote 45 (Thelwall et al. 2012)—used in 9 studies;
-
4.
MPQA—SubjectivityFootnote 46 (Wilson et al. 2005)—used in 8 studies;
-
5.
HowNet Sentiment Analysis Word Library (HowNetSenti)Footnote 47—used in 6 studies;
-
6.
NRC Word-Emotion Association Lexicon (also known as NRC Emotion Lexicon or EmoLex)Footnote 48 (Mohammad and Turney 2010, 2013), WordNetFootnote 49 (Miller 1995) and Wikipedia—list of emoticonsFootnote 50—used in 5 studies.
In addition to the lexicons mentioned above, 19 studies used lexicons that they created as part of their work or specifically focused on creating SOM lexicons, such as (Årup Nielsen 2011) who created the AFINN word list for sentiment analysis in microblogs, (Javed et al. 2014) who built a bilingual sentiment lexicon for English and Roman Urdu, (Santarcangelo et al. 2015) the creators of the first Italian sentiment thesaurus, (Wu et al. 2016) for Chinese sentiment analysis and (Bandhakavi et al. 2016) for sentiment analysis on Twitter. These lexicons varied from social media focused lexicons (Tian et al. 2015; Ghiassi and Lee 2018; Pollacci et al. 2017), to sentiment and/or emoticon lexicons (Jurek et al. 2014; Molina-González et al. 2014; Khuc et al. 2012; Ranjan et al. 2018; Vo et al. 2017; Feng et al. 2015; Wang and Wu 2015; Zhou et al. 2014) and extensions of existing state-of-the-art lexicons (Li et al. 2016; Pandarachalil et al. 2015; Andriotis et al. 2014), such as (Li et al. 2016) who extended HowNetSenti with words manually collected from the internet, and (Pandarachalil et al. 2015) who built a sentiment lexicon from SenticNetFootnote 51 (Cambria et al. 2020) and SentiWordNet for slang words and acronyms.
Machine learning
A total of 121 studies adopted a machine learning-based approach to perform a form of SOM, where several supervised and unsupervised algorithms were used. Table 7 below presents the number of machine learning algorithms (first row and columns titled 1–7) used by the machine learning-based studies (second row). The column titled “NA” refers to studies who do not provide any information on the exact algorithms used.
Table 7 Machine learning-based studies In total, 239 machine learning algorithms were used (not distinct) across 117 studies (since 4 studies did not provide any information), with 235 being supervised and 4 unsupervised. It is important to note that this figure does not include any supervised/semi-supervised/unsupervised proposed algorithms by the respective authors, which algorithms shall be discussed below.
Table 8 Supervised machine learning algorithms Table 8 provides breakdown of the 235 supervised machine learning algorithms (not distinct) that were used within these studies. The NB and SVM algorithms are clearly the most popular in this domain, especially for text classification. With respect to the former, it is important to note that 20 out of the 75 studies used the Multinomial NB (MNB), which model is usually utilised for discrete counts i.e., the number of times a given term (word or token) appears in a document. The other 55 studies made use of the Multi-variate Bernoulli NB (MBNB) model, which is based on binary data, where every token in a feature vector of a document is classified with the value of 0 or 1. As for SVM, this method looks at the given data and sorts it in two categories (binary classification). If multi-class classification is required, the Support Vector Classification (SVC)Footnote 52, NuSVCFootnote 53 or LinearSVCFootnote 54 algorithms are usually applied, where the “one-against-one” approach is implemented for SVC and NuSVC, whereas the “one-vs-the-rest” multi-class strategy is implemented for LinearSVC.
The LoR statistical technique is also widely used in machine learning for binary classification problems. In total, 16 studies from the ones analysed, made use of this algorithm. DT learning has also been very much in use, which model uses a DT for both classification and regression problems. There are various algorithms in how a DT is built, with 2 studies using the C4.5 (Quinlan 1993)—an extension of Quinlan’s Iterative Dichotomiser 3 (ID3) algorithm, used for classification purposes, 3 studies using J48, a simple C4.5 DT for classification (Weka’s implementationFootnote 55), 2 using the Hoeffding Tree (Hulten et al. 2001) and the other 8 using the basic ID3 algorithm.
MaxEnt, used by 12 studies, is a probabilistic classifier that is also used for text classification problems, such as sentiment analysis. More specifically, it is generalisation of LoR for multi-class scenarios (Yu et al. 2011). RF was used in 9 studies, where this supervised learning algorithm –which can be used for both classification and regression tasks– creates a forest (which is an ensemble of DTs) and makes it somehow random. Moreover, 7 studies used the KNN algorithm, one of the simplest classification algorithms where no learning is required, since the model structure is determined from the entire dataset.
The SentiStrength algorithm, utilised by 5 studies (Gonçalves et al. 2013; Lu et al. 2015; Baecchi et al. 2016; Yan et al. 2017; Zhang et al. 2018), can be used in both supervised and unsupervised cases, since the authors developed a version for each learning case. Conditional Random Fields, used by 4 studies (Pak and Paroubek 2010; Zhang et al. 2014; Wang et al. 2016; Hao et al. 2017), are a type of discriminative classifier that model the decision boundary amongst different classes, whereas LiR was also used by 4 studies (Bollen et al. 2011; Pavel e al. 2017; Adibi et al. 2018; Xiaomei et al. 2018). Moreover, 3 studies each used the SANT (Ou et al. 2014; Lu 2015; Xiaomei et al. 2018) and SGD (Bifet and Frank 2010; Juneja and Ojha 2017; Sánchez-Holgado and Arcila-Calderón 2018) algorithms, with the former being mostly used for comparison purposes to the proposed approaches by the respective authors.
In addition, the PA algorithm was used in 2 studies (Li et al. 2014; Filice et al. 2014). In the case of the former (Li et al. 2014), this algorithm was used in a collaborative online learning framework to automatically classify whether a post is emotional or not, thereby overcoming challenges faced by the diversity of microblogging styles which increase the difficulty of classification. The authors in the latter study (Filice et al. 2014) extend the budgeted PA algorithm to enable robust and efficient natural language learning processes based on semantic kernels. The proposed online learning learner was applied to two real world linguistic tasks, one of which was sentiment analysis.
Nine other algorithms were used by 7 different studies, namely: Bagging (Sygkounas et al. 2016), BN (Lu et al. 2016), CRB (Raja and Swamynathan 2016), AB (Raja and Swamynathan 2016), HMM (Zhang et al. 2014), Dictionary Learning (Asiaee et al. 2012), NBSVM (Sun et al. 2017), MCC (Çeliktuğ 2018) and ICO (Çeliktuğ 2018).
In terms of unsupervised machine learning algorithms, 4 were used in 2 of the 80 studies that used a machine learning-based technique. Suresh and Raj S. used the K-Means (KM) (Lloyd 1982) and Expectation Maximization (Dempster et al. 1977) clustering algorithms in Suresh (2016). Both were used for comparison purposes to an unsupervised modified fuzzy clustering algorithm proposed by authors. The proposed algorithm produced accurate results without manual processing, linguistic knowledge or training time, which concepts are required for supervised approaches. Baecchi et al. (Baecchi et al. 2016) used two unsupervised algorithms, namely Continuous Bag-Of-Word (CBOW) (Mikolov et al. 2013) and Denoising Autoencoder (DA) (Vincent et al. 2008) (the SGD and backpropagation algorithms were used for the DA learning process), and supervised ones, namely LoR, SVM and SentiStrength, for constructing their method and comparison purposes. They considered both textual and visual information in their work on sentiment analysis of social network multimedia. Their proposed unified model (CBOW-DA-LoR) works in both an unsupervised and semi-supervised manner, whereby learning text and image representation and also the sentiment polarity classifier for tweets containing images.
Other studies proposed their own algorithms, with some of the already established algorithms discussed above playing an important role in their implementation and/or comparison. Zimmermann et al. proposed a semi-supervised algorithm, the S*3Learner (Zimmermann et al. 2014) which suits changing opinion stream classification environments, where the vector of words evolves over time, with new words appearing and old words disappearing. Severyn et al. (2016) defined a novel and efficient tree kernel function, the Shallow syntactic Tree Kernel, for multi-class supervised sentiment classification of online comments. This study focused on YouTube which is multilingual, multimodal, multidomain and multicultural, with the aim to find whether the polarity of a comment is directed towards the source video, product described in the video or another product. Furthermore, Ignatov and Ignatov (2017) presented a novel DT-based algorithm, a Decision Stream, where Twitter sentiment analysis was one of several common machine learning problems that it was evaluated on. Lastly, Fatyanosa et al. (2018) enhanced the ability of the NB classifier with an optimisation algorithm, the Variable Length Chromosome Genetic Algorithm (VLCGA), thereby proposing VLCGA-NB for Twitter sentiment analysis.
Moreover, the following 13 studies proposed an ensemble method or evaluated ensemble-based classifiers:
-
Çeliktuğ (2018) used two ensemble learning methods in RF and MCC (amongst other machine learning algorithms), for sentiment classification of Twitter datasets;
-
Yan et al. (2017) presented two ensemble learners built on four off-the-shelf classifiers, for Twitter sentiment classification;
-
Zhang et al. (2018), Adibi et al. (2018), Çeliktuğ (2018), Vora and Chacko (2017), Lu et al. (2016), Rexha et al. (2016), Xie et al. (2012) and Zhang et al. (2011) used the RF ensemble learning method in their work;
-
Troussas et al. (2016) evaluated the most common ensemble methods that can be used for sentiment analysis on Twitter datasets;
-
Sygkounas et al. (2016) proposed an ensemble system composed on five state-of-the-art sentiment classifiers;
-
Le et al. (2014) used multiple oblique decision stumps classifiers to form an ensemble of classifiers, which is more accurate than a single one for classifying tweets;
-
Neethu and Rajasree (2013) used an ensemble classifier (and single algorithm classifiers) for sentiment classification.
Ensembles created usually result in providing more accurate classification answers when compared to individual classifiers, i.e., classic learning approaches. In addition, ensembles reduce the overall risk of choosing a wrong classifier especially when applying it on a new dataset (Da Silva et al. 2014).
Deep learning
Deep learning is a subset of machine learning based on Artificial Neural Networks (ANNs) –algorithms inspired by the human brain– where there are connections, layers and neurons for data to propagate. A total of 35 studies adopted a deep learning-based approach to perform a form of SOM, where supervised and unsupervised algorithms were used. Twenty six (26) of the studies made use of 1 deep learning algorithm, with 5 utilising 2 algorithms and 2 studies each using 3 and 4 algorithms, respectively. Table 9 provides breakdown of the 50 deep learning algorithms (not distinct) used within these studies.
Table 9 Deep learning algorithms LSTM, a prominent variation of the RNN which makes it easier to remember past data in memory, was used in 13 studies (Yan et al. 2018; Sun et al. 2018; Sanyal et al. 2018; Ameur et al. 2018; Wazery et al. 2018; Li et al. 2018; Chen and Wang 2018; Chen et al. 2018; Sun et al. 2017; Hu et al. 2017; Shi et al. 2017; Wang et al. 2016; Yan and Tao 2016), thus making it the most popular deep learning algorithm amongst the evaluated studies. Three further studies (Ameur et al. 2018; Balikas et al. 2017; Wang et al. 2016) used the BLSTM, an extension of the traditional LSTM which can improve model performance on sequence classification problems. In particular, a BLSTM was used in Balikas et al. (2017) to improve the performance of fine-grained sentiment classification, which approach can benefit sentiment expressed in different textual types (e.g., tweets and paragraphs), in different languages and different granularity levels (e.g., binary and ternary). Similarly, Wang et al. (2016) proposed a language-independent method based on BLSTM models for incorporating preceding microblogs for context-aware Chinese sentiment classification.
The CNN algorithm –a variant of ANN– is made up of neurons that have learnable weights and biases, where each neuron receives an input, performs a dot product and optionally follows it with non-linearity. In total, 12 studies (Sun et al. 2018; Ochoa-Luna and Ari 2018; Ameur et al. 2018; Adibi et al. 2018; Chen and Wang 2018; Napitu et al. 2017; Shi et al. 2017; Wehrmann et al. 2017; Zhang et al. 2017; Stojanovski et al. 2015; Wang et al. 2016; Severyn and Moschitti 2015) made use of this algorithm. Notably, Wehrmann et al. (2017) propose a language-agnostic translation-free method for Twitter sentiment analysis.
RNNs, a powerful set of ANNs useful for processing and recognising patterns in sequential data such as natural language, were used in 8 studies (Yan et al. 2018; Ochoa-Luna and Ari 2018; Piñeiro-Chousa et al. 2018; Wazery et al. 2018; Pavel e al. 2017; Shi et al. 2017; Yan and Tao 2016; Wang et al. 2016). One study in particular (Averchenkov et al. 2015), considered a novel approach to aspect-based sentiment analysis of Russian social networks based on RNNs, where the best results were obtained by using a special network modification, the RNTN. Two further studies (Lu et al. 2015; Sygkounas et al. 2016) also used this algorithm (RNTN) in their work.
Five other studies (Arslan et al. 2018; Anjaria and Guddeti 2014; Du et al. 2014; Politopoulou and Maragoudakis 2013; Zhang et al. 2011) used a simple type of ANN, such as the feedforward neural network. Moreover, the MLP, a class of feedforward ANN, was used in 2 studies (Chen and Zheng 2018; Ramadhani and Goo 2017). Similarly, 2 studies (Yan et al. 2018; Ameur et al. 2018) proposed methods based on the AE unsupervised learning algorithm which is used for representation learning. Lastly, one study each made use of the GRU (Wang et al. 2016) and DAN2 (Ghiassi et al. 2013) algorithms.
Some studies used several types of ANNs in their work. Ameur et al. (2018) used multiple methods based on AE, CNN, LSTM and BLSTM for sentiment polarity classification and Wang et al. (2016) used RNN, LSTM, BLSTM and GRUs models. Yan et al. (2018) used learning methods based on RNN, LSTM and AE for comparison with the proposed learning framework for short text classification, and Shi et al. (2017) proposed an improved LSTM which considers user-based and content-based features and used CNN, LSTM and RNN models for comparison purposes. Furthermore, Ochoa-Luna and Ari (2018) made use of CNN and RNN deep learning algorithms for tweet sentiment analysis, Wazery et al. (2018) and Yan and Tao (2016) used the RNN and LSTM, whereas Sun et al. (2018) and Chen and Wang (2018) proposed new models based on CNN and LSTM.
Statistical
A total of 9 studies (Wang et al. 2018; Kitaoka and Hasuike 2017; Arslan et al. 2017; Raja and Swamynathan 2016; Yang et al. 2014; Bukhari et al. 2016; Zhang et al. 2015; Karpowicz et al. 2013; Supriya et al. 2016) adopted a statistical approach to perform a form of SOM. In particular, one of the approaches proposed in Arslan et al. (2017) uses the term frequency-inverse document frequency (tf-idf) (Salton and McGill 1986) numerical statistic to find out the important words within a tweet, to dynamically enrich Twitter specific dictionaries created by the authors. The tf-idf is also one of several statistical-based techniques used in Wang et al. (2018) for comparing the proposed novel feature weighting approach for Twitter sentiment analysis. Moreover, Raja and Swamynathan (2016) focuses on a statistical sentiment score calculation technique based on adjectives, whereas Yang et al. (2014) use a variation of the point-wise mutual information to measure the opinion polarity of an entity and its competitors, which method is different from the traditional opinion mining way.
Probabilistic
A total of 6 studies (Bhattacharya and Banerjee 2017; Baecchi et al. 2016; Ou et al. 2014; Ragavi and Usharani 2014; Yan et al. 2014; Lek and Poo 2013) adopted a probabilistic approach to perform a form of SOM. In particular, Ou et al. (2014) propose a novel probabilistic model in the Content and Link Unsupervised Sentiment Model, where the focus is on microblog sentiment classification incorporating link information, namely behaviour, same user and friend.
Fuzziness
Two studies (D’Asaro et al. 2017; Del Bosque and Garza 2014) adopted a fuzzy-based approach to perform a form of SOM. D’Asaro et al. (2017) present a sentiment evaluation and analysis system based on fuzzy linguistic textual analysis. Del Bosque and Garza (2014) assume that aggressive text detection is a sub-task of sentiment analysis, which is closely related to document polarity detection given that aggressive text can be seen as intrinsically negative. This approach considers the document’s length and the number of swear words as inputs, with the output being an aggressiveness value between 0 and 1.
Rule-based
In total, 4 studies (El Haddaoui et al. 2018; Zhang et al. 2014; Min et al. 2013; Bosco et al. 2013) adopted a rule-based approach to perform a form of SOM. Notably, Bosco et al. (2013) applied an approach for automatic emotion annotation of ironic tweets. This relies on sentiment lexicons (words and expressions) and sentiment grammar expressed by compositional rules.
Graph
Four studies (Dritsas et al. 2018; Vilarinho and Ruiz 2018; Chen et al. 2015; Rabelo et al. 2012) adopted a graph-based approach to perform a form of SOM. The study in Vilarinho and Ruiz (2018) presents a word graph-based method for Twitter sentiment analysis using global centrality metrics over graphs to evaluate sentiment polarity. In Dritsas et al. (2018), a graph-based method is proposed for sentiment classification at a hashtag level. Moreover, the authors in Chen et al. (2015) compare their proposed multimodal hypergraph-based microblog sentiment prediction approach with a combined hypergraph-based method (Huang et al. 2010). Lastly, Rabelo et al. (2012) used link mining techniques to infer the opinions of users.
Ontology
Two studies (Lau et al. 2014; Kontopoulos et al. 2013) adopted an ontology-based approach to perform a form of SOM. In particular, the technique developed in Kontopoulos et al. (2013) performs more fine-grained sentiment analysis of tweets where each subject within the tweets is broken down into a set of aspects, with each one being assigned a sentiment score.
Hybrid
Hybrid approaches are very much in demand for performing different opinion mining tasks, where 244 unique studies (out of 465) adopted this approach and produced a total of 282 different techniquesFootnote 56.
Tables 10 and 11 lists these studies, together with the type of techniques used for each. In total, there were 38 different hybrid approaches across the analysed studies.
Table 10 Studies adopting a hybrid approach consisting of two techniques Table 11 Studies adopting a hybrid approach consisting of three and four techniques The majority of these studies used two different techniques (213 out of 282)—see Table 10—within their hybrid approach, whereas 62 used three and 7 studies used four different techniques –see Table 11.
The Lexicon and Machine Learning-based techniques were mostly used, where they accounted for 40% of the hybrid approaches, followed by Lexicon and Statistical-based (7.8%), Machine Learning and Statistical-based (7.4%), and Lexicon, Machine Learning and Statistical-based (7.4%) techniques.
Moreover, out of the 282 hybrid approaches, 232 used lexicons, 205 used Machine Learning and 39 used Deep Learning. These numbers reflect the importance of these three techniques within the SOM research and development domain. In light of these, a list of lexicons, machine learning and deep learning algorithms used in these studies have been compiled, similar to Sects. 3.2.1, 3.2.2 and 3.2.3 above. The lexicons, machine learning and deep learning algorithms quoted below were either used in the proposed method/s and/or for comparison purposes in the respective studies.
In terms of state-of-the-art lexicons, these total 403 within the studies adopting a hybrid approach. The top ones align with the results obtained from the lexicon-based approaches in Sect. 3.2.1 above. The following are the lexicons used for more than ten times across the hybrid approaches:
-
1.
SentiWordNet—used in 51 studies;
-
2.
MPQA—Subjectivity—used in 28 studies;
-
3.
Hu and Liu—used in 25 studies;
-
4.
WordNet—used in 24 studies;
-
5.
AFINN—used in 22 studies;
-
6.
SentiStrength—used in 21 studies;
-
7.
HowNetSenti—used in 15 studies;
-
8.
NRC Word-Emotion Association Lexicon—used in 13 studies;
-
9.
NRC Hashtag Sentiment LexiconFootnote 57 (Mohammad et al. 2013)—used in 12 studies;
-
10.
SenticNet, Sentiment140 Lexicon (also known as NRC Emoticon Lexicon)Footnote 58 (Mohammad et al. 2013), National Taiwan University Sentiment Dictionary (NTUSD) (Ku et al. 2006) and Wikipedia list of emoticons - used 11 studies.
Further to the quoted lexicons, 49 studies used lexicons that they created as part of their work. Some studies composed their lexicons from emoticons/emojis that were extracted from a dataset (Cao et al. 2018; Li and Fleyeh 2018; Azzouza et al. 2017; Zimbra et al. 2016; You and Tunçer 2016; Chen et al. 2015; Porshnev et al. 2014; Cui et al. 2011; Zhang et al. 2012; Vu et al. 2012), combined publicly available emoticon lexicons/lists (Siddiqua et al. 2016) or mapped emoticons to their corresponding polarity (Tellez et al. 2017), and others (Gao et al. 2016; Souza et al. 2016; Su et al. 2014; Yan et al. 2014; Tang et al. 2013; Cui et al. 2011; Zhang et al. 2012; Li and Xu 2014) used seed/feeling/emotional words to establish a microblog typical emotional dictionary. Additionally, some authors constructed or used sentiment lexicons (Zhang et al. 2018; Vo et al. 2017; Rout et al. 2017; Jin et al. 2017; Ismail et al. 2018; Yan et al. 2017; Katiyar et al. 2018; Al Shammari 2018; Abdullah and Zolkepli 2017; Liu and Young 2016; Sahu et al. 2015; Cho et al. 2014; Wang et al. 2014; Chen et al. 2015; Jiang et al. 2013; Cui et al. 2013; Khuc et al. 2012; Montejo-Raez et al. 2014; Rui et al. 2013) some of which are domain or language specific (Konate and Du 2018; Hong and Sinnott 2018; Chen et al. 2017; Zhao et al. 2016; Lu et al. 2016; Zhou et al. 2014; Porshnev and Redkin 2014), others that extend state-of-the-art lexicons (Li et al. 2016, 2016; Koto and Adriani 2015), and some who made them available to the research community (Cotfas et al. 2017; Castellucci et al. 2015) such as the Distributional Polarity LexiconFootnote 59.
Table 12 Machine learning algorithms used in the studies adopting a hybrid approach Table 12 below presents a list of machine learning algorithms –in total 381 in 197 studies– that were used within the hybrid approaches. The first column indicates the algorithm, the second lists the type of learning algorithm, in terms of Supervised (Sup), Unsupervised (Unsup) and Semi-supervised (Semi-sup), and the last column lists the total number of studies using each respective algorithm. The SVM and NB algorithms were mostly used in supervised learning, which result corresponds to the machine learning-based approaches in Sect. 3.2.2 above. With respect to the latter, 76 studies used the MBNB algorithm, 19 studies the MNB and 1 study the Discriminative MNB. Moreover, the LoR, DT –namely the basic ID3 (10 studies), J48 (5 studies), C4.5 (5 studies), Classification And Regression Tree (3 studies), Reduced Error Pruning (1 study), DT with AB (1 study), McDiarmid Tree (McDiarmid 1989) (1 study) and Hoeffding Tree (1 study) algorithms, RF, MaxEnt and SentiStrength (used in both supervised and unsupervised settings) algorithms were also in various studies. Notably, some additional algorithms from the ones used in the machine learning-based approaches in Sect. 3.2.2 above, were used in a hybrid approach, in particular, SVR (Drucker et al. 1997), Extremely Randomised Trees (Geurts et al. 2006), Least Median of Squares Regression (Rousseeuw 1984), Maximum Likelihood Estimation (Fisher 1925), Hyperpipes (Witten et al. 2016), Extreme Learning Machine (Huang et al. 2006), Domain Adaptation Machine (Duan et al. 2009), RIPPER (Cohen 1995), Affinity Propagation (Frey and Dueck 2007), Multinomial inverse regression (Taddy 2013), Apriori (Agrawal et al. 1994), Distant Supervision (Go et al. 2009) and Label Propagation (Zhu and Ghahramani 2002).
Given that deep learning is a subset of machine learning, the algorithms used within the hybrid approaches are presented below. In total, 36 studies used the following deep learning algorithms:
-
CNN—used in 16 studies (Yan et al. 2018; Stojanovski et al. 2018; Konate and Du 2018; Hanafy et al. 2018; Haldenwang et al. 2018; Ghosal et al. 2018; Chen et al. 2017; Ameur et al. 2018; Alharbi and DeDoncker 2017; Symeonidis et al. 2018; Saini et al. 2018; Jianqiang et al. 2018; Baccouche et al. 2018; Cai and Xia 2015; Kalayeh et al. 2015; Yanmei and Yuda 2015);
-
ANN—used in 8 studies (Li and Fleyeh 2018; Karyotis et al. 2017; Poria et al. 2016; Er et al. 2016; Koto and Adriani 2015; Porshnev and Redkin 2014; Porshnev et al. 2014; Hassan et al. 2013);
-
LSTM—used in 7 studies (Yan et al. 2018; Konate and Du 2018; Hanafy et al. 2018; Ghosal et al. 2018; Ameur et al. 2018; Sun et al. 2017; Baccouche et al. 2018);
-
MLP—used in 7 studies (Villegas et al. 2018; Ghosal et al. 2018; Coyne et al. 2017; Karyotis et al. 2017; Bravo-Marquez et al. 2014; Del Bosque and Garza 2014; Thelwall et al. 2010);
-
RNN—used in 4 studies (Yan et al. 2018; Liu et al. 2018; Baccouche et al. 2018; Yanmei and Yuda 2015);
-
AE—used in 2 studies (Yan et al. 2018; Ameur et al. 2018);
-
BLSTM—used in 2 studies (Konate and Du 2018; Ameur et al. 2018);
-
DAN2—used in 2 studies (Ghiassi and Lee 2018; Zimbra et al. 2016);
-
Deep Belief Network (Hinton and Salakhutdinov 2006), a probabilistic generative model that is composed of multiple layers of stochastic, latent variables—used in 2 studies (Jin et al. 2017; Tang et al. 2013);
-
GRU—used in 1 study (Cao et al. 2018);
-
Generative Adversarial Networks (GAN) (Goodfellow et al. 2014), are deep neural net architectures composed of a two networks, a generator and a discriminator, pitting one against the other—used in 1 study (Cao et al. 2018);
-
Conditional GAN (Mirza and Osindero 2014), a conditional version of GAN that can be constructed by feeding the data that needs to be conditioned on both the generator and discriminator—used in 1 study (Cao et al. 2018);
-
Hierarchical Attention Network, a neural architecture for document classification (Yang et al. 2016), used in 1 study (Liu et al. 2018).
Further to the quoted algorithms, 22 studies (Hong and Sinnott 2018; Hanafy et al. 2018; Ghosal et al. 2018; Saleena 2018; Yan et al. 2017; Tong et al. 2017; Dedhia and Ramteke 2017; Wijayanti and Arisal 2017; Xia et al. 2017; Jianqiang 2016; Prusa et al. 2015; Fersini et al. 2015; Abdelwahab et al. 2015; Kanakaraj and Guddeti 2015; Hagen et al. 2015; Cai and Xia 2015; Mansour et al. 2015; Wang et al. 2014; Tsakalidis et al. 2014; Da Silva et al. 2014; Hassan et al. 2013; Gonçalves et al. 2013) used ensemble learning methods in their work, where they combined the output of several base machine learning and/or deep learning methods. In particular, Gonçalves et al. (2013) compared eight popular lexicon and machine learning-based sentiment analysis algorithms, and then developed an ensemble that combines them, which in turn provided the best coverage results and competitive agreement. Moreover, Ghosal et al. (2018) proposes an MLP-based ensemble network that combines LSTM, CNN and feature-based MLP models, with each model incorporating character, word and lexicon level information, to predict the degree of intensity for sentiment and emotion. Lastly, as presented in Table 12, the RF ensemble learning method was used in the 21 studies (Da Silva et al. 2014; Porshnev et al. 2014; Samoylov 2014; Yuan et al. 2014; Buddhitha and Inkpen 2015; Kanakaraj and Guddeti 2015; Jianqiang 2015; Bouchlaghem et al. 2016; Deshwal and Sharma 2016; Jianqiang 2016; Yan and Tao 2016; Tong et al. 2017; Jianqiang and Xiaolin 2017; Bouazizi and Ohtsuki 2017; Elouardighi et al. 2017; Bouazizi and Ohtsuki 2018; Li and Fleyeh 2018; Saleena 2018; Villegas et al. 2018; Yan et al. 2018; Zhang et al. 2018).
Other
In total, 23 studies did not adopt any of the previous approaches (discussed in Sects. 3.2.1–3.2.10). This is mainly due to three reasons: no information provided by the authors (13 studies), use of an automated approach (4 studies), or use of a manual approach (6 studies) (Sandoval-Almazan and Valle-Cruz 2018; Fang and Ben-Miled 2017; Song and Gruzd 2017; Zafar et al. 2016; Furini and Montangero 2016; Cvijikj and Michahelles 2011) to perform a form of SOM. Regarding the former, the majority of them (Ayoub and Elgammal 2018; Tiwari et al. 2017; Ouyang et al. 2017; Anggoro et al. 2016; Williamson and Ruming 2016; Agrawal et al. 2014; Pupi et al. 2014; Das et al. 2014) were not specifically focused on SOM (this was secondary), in contrast to the others (Vivanco et al. 2017; Gonzalez-Marron et al. 2017; Chen et al. 2016; Barapatre et al. 2016; Mejova and Srinivasan 2012). As for the automated approaches (Sharma et al. 2018; Pai and Alathur 2018; Ali et al. 2018; Teixeira and Laureano 2017), some of them used cloud services, such as Microsoft Azure Text AnalyticsFootnote 60 or out-of-the-box functionality provided by existing tools/software libraries, such as the TextBlobFootnote 61 Python library.
Social datasets
Numerous datasets were used across the 465 studies evaluated for this systematic review. These consisted of SOM datasets released online for public use –which have been widely used across the studies– and newly collected datasets, some of which were made available for public use or else for private use within the respective studies. In terms of data collection, the majority of them used the respective platform’s API, such as the Twitter Search APIFootnote 62, either directly or through a third-party library, e.g., Twitter4JFootnote 63. Due to the large number of datasets, only the ones mostly used shall be discussed within this section. In addition, only social datasets are mentioned irrespective of whether other non-social datasets (e.g., news, movies, etc.,) were used, given that the main focus of this review is on social data.
The first sub-section (Sect. 3.3.1) presents an overview of the top social datasets used, whereas the second sub-section (Sect. 3.3.2) presents a comparative analysis of the studies that produced the best performance for each respective social dataset.
Overview
The following are the top ten social datasets used across all studies:
-
1.
Stanford Twitter Sentiment (STS) Go et al. (2009) used in 61 studies: 1,600,000 training tweets collected via the Twitter API, that is made up of 800,000 tweets containing positive emoticons and 800,000 tweets containing negative emoticons. These are based on various topics, such as Nike, Google, China, Obama, Kindle, San Francisco, North Korea and Iran.
-
2.
SandersFootnote 64—used in 32 studies: 5513 hand-classified tweets about four topics: Apple, Google, Microsoft, Twitter. These tweets are labelled as follows: 570 positive, 654 negative, 2503 neutral, and 1786 irrelevant.
-
3.
SemEval 2013—Task 2Footnote 65 Nakov et al. (2013)—used in 28 studies: Training, development and test sets for Twitter and SMS messages were annotated with positive, negative, and objective/neutral labels via the Amazon Mechanical Turk crowdsourcing platform. This was done for 2 subtasks focusing on an expression-level and message-level.
-
4.
SemEval 2014—Task 9Footnote 66 Rosenthal et al. (2014)—used in 18 studies: Continuation of SemEval 2013—Task 2, where three new test sets from regular and sarcastic tweets, and LiveJournal sentences were introduced.
-
5.
STS Gold (STS-Gold) Saif et al. (2013)—used in 17 studies: A subset of STS, which was annotated manually at a tweet and entity-level. The tweet labels were either positive, negative, neutral, mixed, or other.
-
6.
Health care reform (HCR) Speriosu et al. (2011)—used in 17 studies: Dataset contains tweets about the 2010 health care reform in the USA. A subset of these are annotated for polarity with the following labels: positive, negative, neutral, irrelevant. The polarity targets, such as health care reform, conservatives, democrats, liberals, republicans, Obama, Stupak and Tea Party, were also annotated. All were distributed into training, development and test sets.
-
7.
Obama-McCain Debate (OMD) Shamma et al. (2009)—used in 17 studies: 3,238 tweets about the first presidential debate held in the USA for the 2008 campaign. The sentiment labels of the tweets are acquired by Diakopoulos and Shamma (2010) using Amazon Mechanical Turk, and are rated as either positive, negative, mixed, or other.
-
8.
SemEval 2015—Task 10Footnote 67 Rosenthal et al. (2015)—used in 15 studies: This continues on datasets number 3 and 4, with three new subtasks. The first two target sentiment about a particular topic in one tweet or collection of tweets, whereas the third targets the degree of prior polarity of a phrase.
-
9.
SentiStrength Twitter (SS-Twitter) Thelwall et al. (2012)—used in 12 studies: 6 human-coded databases from BBC, Digg, MySpace, Runners World, Twitter and YouTube annotated for sentiment polarity strength i.e., negative between -1 (not negative) and -5 (extremely negative), and positive between 1 (not positive) and 5 (extremely positive).
-
10.
SemEval 2016—Task 4Footnote 68 Nakov et al. (2016)—used in 9 studies: This is a re-run of dataset 7, with three new subtasks. The first one replaces the standard two-point scale (positive/negative) or three-point scale (positive/negative/neutral) with a five-point scale (very positive/positive/OK/ negative/very negative). The other two subtasks replaced tweet classification with quantification (i.e., estimating the distribution of the classes in a set of unlabelled items) according to a two-point and five-point scale, respectively.
-
11.
NLPCC 2012Footnote 69—used in 6 studies: Chinese microblog sentiment dataset (sentence level) from Tencent Weibo provided by the First Conference on Natural Language Processing and Chinese Computing (NLP&CC 2012) It consists of a training set of microblogs about two topics, and a test set about 20 topics, where the subjectivity (subjective/objective) and the polarity (positive/negative/neutral) was assigned for each.
-
12.
NLPCC 2013Footnote 70—used in 6 studies: Dataset from Sina Weibo used for the Chinese Microblog Sentiment Analysis Evaluation (CMSAE) task in the second conference on NLP&CC 2013. The Chinese microblogs were classified into 7 emotion types: anger, disgust, fear, happiness, like, sadness, surprise. Test set contains 10,000 microblogs, where each text is labelled with a primary emotion type ans a secondary one (if possible).
-
13.
Sentiment Evaluation (SE-Twitter) Narr et al. (2012)—used in 5 studies: Human annotated multilingual dataset of 12,597 tweets from 4 languages, namely English, German, French, Portuguese. Polarity annotations with labels: positive, negative, neutral, and irrelevant, were conducted manually using Amazon Mechanical Turk.
-
14.
SemEval 2017—Task 4 Rosenthal et al. (2017)—used in 5 studies: This dataset continues with a re-run of dataset 10, where two new changes were introduced; inclusion of the Arabic language for all subtasks and provision of profile information of the Twitter users that posted the target tweets.
All the datasets above are textual, with the majority of them composed of social data from Twitter. From the datasets above, in terms of language, only the SE-Twitter (number 13) social dataset can be considered as multilingual, with the rest targeting English (majority) or Chinese microblogs, whereas SemEval 2017—Task 4 (number 14) introduced a new language in Arabic. An additional dataset is the one produced by Mozetič et al., which contains 15 Twitter sentiment corpora for 15 European languages (Mozetič et al. 2016). Some studies such as Munezero et al. (2015) used one of the English-based datasets above (STS-Gold) for multiple languages, given that they adopted a lexicon-based approach. Moreover, these datasets had different usage within the respective studies, with the most common being used as a training/test set, the final evaluation of the proposed solution/lexicon, or for comparison purposes. Evaluation challenges like SemEval are important to generate social datasets such as the above and Cortis et al. (2017), since these can be used by the Opinion Mining community for further research and development.
Comparative analysis
A comparative analysis of all the studies that used the social datasets presented in the previous sub-section was carried out. The Precision, Recall, F-measure (F1-score), and Accuracy metrics were selected to evaluate the said studies (when available) and identify the best performance for each respective social dataset. It is important to note that for certain datasets, this could not be done, since the experiments conducted were not consistent across all the studies. The top three studies (where possible) obtaining the best results for each of the four evaluation metrics are presented in the tables below.
Tables 13 and 14 provide the best results for the STS and Sanders datasets.
Table 13 Studies obtaining the best performance for the STS (1) social dataset Table 14 Studies obtaining the best performance for the Sanders (2) social dataset Tables 15 and 16 provide the best results for the SemEval 2013—Task 2 and SemEval 2014—Task 9 datasets, specifically for sub-task B, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.
Table 15 Studies obtaining the best performance for the SemEval 2013—Task 2 (3) social dataset Table 16 Studies obtaining the best performance for the SemEval 2014—Task 9 (4) social dataset Tables 17, 18 and 19 provide the best results for the STS-Gold, HCR and OMD datasets.
Table 17 Studies obtaining the best performance for the STS-Gold (5) social dataset Table 18 Studies obtaining the best performance for the HCR (6) social dataset Table 19 Studies obtaining the best performance for the OMD (7) social dataset Table 20 provides the best results for the SemEval 2015—Task 10 dataset, specifically for sub-task B, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.
Table 20 Studies obtaining the best performance for the SemEval 2015—Task 10 (8) social dataset Table 21 provides the best results for the SS-Twitter dataset.
Table 21 Studies obtaining the best performance for the SS-Twitter (9) social dataset Table 22 provides the best results for the SemEval 2016—Task 4 dataset, specifically for sub-task A, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.
Table 22 Studies obtaining the best performance for the SemEval 2016—Task 4 (10) social dataset Tables 23 and 24 provide the best results for the NLPCC 2012 dataset. Results quoted below are for task 1 which focused on subjectivity classification (see Table 23) and task 2 which focused on sentiment polarity classification (see Table 24). Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.
Table 23 Studies obtaining the best performance for the NLPCC 2012 - Task 1 (11) social dataset Table 24 Studies obtaining the best performance for the NLPCC 2012 - Task 2 (11) social dataset Tables 25 and 26 provide the best results for the NLPCC 2013 and SE-Twitter datasets.
Table 25 Studies obtaining the best performance for the NLPCC 2013 (12) social dataset Table 26 Studies obtaining the best performance for the SE-Twitter (13) social dataset Table 27 provides the best results for the SemEval 2017—Task 4 dataset, specifically for sub-task A, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.
Table 27 Studies obtaining the best performance for the SemEval 2017—Task 4 (14) social dataset The following are some comments regarding the social dataset results quoted in the tables above:
-
In cases where several techniques and/or methods were applied, the highest result obtained in the study for each of the four evaluation metrics, was recorded, even if the technique did not produce the best result for all metrics.
-
The average Precision, Recall, and F-measure results are quoted (if provided by authors), i.e., average score of the results for each classified level (e.g., the average score of the results obtained for each sentiment polarity classification level - positive, negative and, neutral).
-
Results for social datasets that were released as a shared evaluation task, such as SemEval, were either only provided in the metrics used by the task organisers or other metrics were chosen by the authors, therefore not quoted.
-
Certain studies evaluated their techniques based on a subset of the actual dataset. Results quoted are the ones where the entire dataset was used (according to the authors and/our our understanding).
-
Quoted results are for classification tasks and not aspect-based SOM, which can vary depending on the focus of the study.
-
Results presented in a graph visualisation were not considered due to the exact values not being clear.
Language
Multilingual/bilingual SOM is very challenging, since it deals with multi-cultural social data. For example, analysing Chinese and English online posts can bring a mixed sentiment on such posts. Therefore, it is hard for researchers to make a fair judgement in cases where online posts’ results from different languages contradict each other (Yan et al. 2014).
The majority of the studies (354 out of 465) considered for this review analysis support one language in their SOM solutions. A total of 80 studies did not specify whether their proposed solution is language-agnostic or otherwise, or else their modality was not textual-based. Lastly, only 31 studies cater for more than one language, with 18 being bilingual, 1 being trilingual and 12 proposed solutions claiming to be multilingual. Regarding the latter, the majority were tested on a few languages at most, with Castellucci et al. (2015, 2015) on English and Italian, Montejo-Raez et al. (2014) on English and Spanish, Erdmann et al. (2014) on English and Japanese, Radhika and Sankar (2017) on English and MalayalamFootnote 71. Baccouche et al. (2018) on English, French and Arabic, Munezero et al. (2015) on keyword sets for different languages (e.g., Spanish, French), Wehrmann et al. (2017) on English, Spanish, Portuguese and German, Cui et al. (2011) on Basic Latin (English) and Extended Latin (Portuguese, Spanish, German), Teixeira and Laureano (2017) on Spanish, Italian, Portuguese, French, English, and Arabic, Zhang et al. (2017) on 8 languages, namely English, German, Portuguese, Spanish, Polish, Slovak, Slovenian, Swedish, and Gao et al. (2016) on 11 languages, namely English, Dutch, French, German, Italian, Polish, Portuguese, Russian, Spanish, Swedish and Turkish.
The list below specifies the languages supported by the 19 bilingual and trilingual studies:
-
English and Italian (Severyn et al. 2016; D’Avanzo and Pilato 2015; Pupi et al. 2014);
-
English and German (Abdelrazeq et al. 2016; Tumasjan et al. 2010);
-
English and Spanish (Giachanou et al. 2017; Cotfas et al. 2015; Delcea et al. 2014);
-
English and Brazilian Portuguese (Guerra et al. 2014);
-
English and Chinese (Xia et al. 2017; Yan et al. 2014);
-
English and Dutch (Flaes et al. 2016);
-
English and Greek (Politopoulou and Maragoudakis 2013);
-
English and Hindi (Anjaria and Guddeti 2014);
-
English and Japanese (Ragavi and Usharani 2014);
-
English and Roman-Urdu (Javed et al. 2014);
-
English and Swedish (Li and Fleyeh 2018);
-
English and Korean (Ramadhani and Goo 2017);
-
English, German and Spanish (Boididou et al. 2018).
Some studies above (D’Avanzo and Pilato 2015; Anjaria and Guddeti 2014; Tumasjan et al. 2010) translated their input data into an intermediate language, mostly English, to perform SOM.
Moreover, Table 28 provides a list of the non-English languages identified from the 354 studies that support one language. Chou et al. (2017) claim that their method can be easily applied to any ConceptNetFootnote 72 supported language, with Wang et al. (2016) similarly claiming that their method is language independent, whereas the solution by Wang and Wu (2015) is multilingual given that emoticons are used in the majority of languages.
Table 28 Non-English languages supported by studies in this review analysis Modality
The majority of the studies in this systematic review and in the state-of-the-art focus on SOM on the textual modality, with only 15 out of 465 studies applying their work on more than one modality. However, other modalities, such as visual (image, video), and audio information is often ignored, even though it contributes greatly towards expressing user emotions (Chen et al. 2015). Moreover, when two or more modalities are considered together for any form of social opinion, such as emotion recognition, they are often complementary, thus increase the system’s performance (Caschera et al. 2016). Table 29 lists the multimodal studies within the review analysis, with the ones catering for two modalities –text and image– being the most popular.
Table 29 Studies adopting a multimodal approach Datasets
Current available datasets and resources for SOM are restricted to the textual modality only. The following are the non-textual social datasets (not listed in Sect. 3.3) used across the mentioned studies:
-
YouTube Dataset (Morency et al. 2011) used in Poria et al. (2016): 47 videos targeting various topics, such as politics, electronics and product reviews.
-
SentiBank Twitter DatasetFootnote 73 (Borth et al. 2013) used in Baecchi et al. (2016) and Cai and Xia (2015): Image dataset from Twitter annotated for polarity using Amazon Mechanical Turk. Tweets with images related to 21 hashtags (topics) resulted in 470 being positive and 133 being negative.
-
SentiBank Flickr Dataset (Borth et al. 2013) used in Cai and Xia (2015): 500,000 image posts from Flickr labeled by 1553 adjective noun pairs based on Plutchik’s Wheel of Emotions (psychological theory) (Plutchik 1980).
-
You Image Dataset (You et al. 2015) used in Cai and Xia (2015): Image dataset from Twitter consisting of 769 positive and 500 negative tweets with images, annotated using Amazon Mechanical Turk.
-
Katsurai and Sotoh Image DatasetFootnote 74 (Katsurai and Satoh 2016) used in Ortis et al. (2018): Dataset of images from Flickr (90,139) and Instagram (65,439) with their sentiment labels.
Observations
The novel methodology by Poria et al. (2016), is the only mutlimodal sentiment analysis approach which caters for four different modalities, namely text, vision (image and video) and audio. Sentiments are extracted from social Web videos. In Caschera et al. (2016), the authors propose a method whereby machine learning techniques need to be trained on different and heterogeneous features when used on different modalities, such as polarity and intensity of lexicons from text, prosodic features from audio, and postures, gestures and expressions from video. The sentiment of video and audio data in Song and Gruzd (2017) was manually coded, which task is labour intensive and time consuming. The addition of images to the microblogs’ textual data reinforces and clarifies certain feelings (Wang et al. 2014; Baecchi et al. 2016), thus improving the sentiment classifier with the image features (Liu et al. 2015; Zhang et al. 2015; Wang et al. 2014; Cai and Xia 2015). Similarly, Chen et al. (2015) also demonstrates superiority with their multimodal hypergraph method when compared to single modality (in this case textual) methods. Moreover, these results are further supported by the method in Poria et al. (2016)—which caters for more than two modalities, in audio, visual and textual—where it shows that accuracy improves drastically when such modalities are used together.
Flaes et al. (2016) apply their multimodal (text, images) method in a real world application area, which research shows that several relationships exist between city liveability indicators collected by the local government and sentiment that is extracted automatically. For example, a negative linear association of detected sentiment from Flickr data is related with people living on welfare checks. Results in Rai et al. (2018) show that there is a high correlation between sentiment extracted from text-based social data and image-based landscape preferences by humans. In addition, results in Yuan et al. (2015) show some correlation between image and textual tweets. However, the authors mention that more features and robust data is required to determine the exact influence of multimedia content in the social domain. The work in Chen et al. (2017) adopts a bimodal approach to solve the problem of cross-domain image sentiment classification by using textual features and visual features from the target domain and measuring the text/image similarity simultaneously.
Therefore, multimodality in the SOM domain is one of numerous research gaps identified in this systematic review. This provides researchers with an opportunity towards further research, development and innovation in this area.
Tools and technologies
In this systematic review, we also analysed the tool and technologies that were used across all studies for various opinion mining operations conducted on social data, such as NLP, machine learning, and big data handling. The subsections below provide respective lists for the ones mostly used across the studies for the various operations required.
NLP
The following are the top 5 NLP tools used across all studies for various NLP tasks:
-
Natural Language Toolkit (NLTK)Footnote 75: a platform that provides lexical resources, text processing libraries for classification, tokenisation, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial NLP libraries;
-
TweetNLPFootnote 76: consists of a tokeniser, Part-of-Speech (POS) tagger, hierarchical word clusters, and a dependency parser for tweets, besides annotated corpora and web-based annotation tools;
-
Stanford NLPFootnote 77: software that provides statistical NLP, deep learning NLP and rule-based NLP tools, such as Stanford CoreNLP, Stanford Parser, Stanford POS Tagger;
-
NLPIR-ICTCLASFootnote 78: a Chinese word segmentation system that includes keyword extraction, POS tagging, NER, and microblog analysis, amongst other features;
-
word2vecFootnote 79: an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.
Machine learning
The top 5 machine learning tools used across all studies are listed below:
-
WekaFootnote 80: a collection of machine learning algorithms for data mining tasks, including tools for data preparation, classification, regression, clustering, association rules mining and visualisation;
-
scikit-learnFootnote 81: consists of a set of tools for data mining and analysis, such as classification, regression, clustering, dimensionality reduction, model selection and pre-processing;
-
LIBSVMFootnote 82: an integrated software for support vector classification, regression, distribution estimation and multi-class classification;
-
LIBLINEARFootnote 83: a linear classifier for data with millions of instances and features;
-
SVM-LightFootnote 84: is an implementation of SVMs for pattern recognition, classification, regression and ranking problems.
Opinion mining
Certain studies used opinion mining tools in their research to either conduct their main experiments or for comparison purposes to their proposed solution/s. The following are the top 3 opinion mining tools used:
-
SentiStrengthFootnote 85: a sentiment analysis tool that is able to conduct binary (positive/negative), trinary (positive/neutral/negative), single-scale (-4 very negative to very positive +4), keyword-oriented and domain-oriented classifications;
-
Sentiment140Footnote 86: a tool that allows you to discover the sentiment of a brand, product, or topic on Twitter;
-
VADER (Valence Aware Dictionary and sEntiment Reasoner)Footnote 87: a lexicon and rule-based sentiment analysis tool that is specifically focused on sentiments expressed in social media.
Big data
Several big data technologies were used by the analysed studies. The most popular ones are categorised in the list below:
-
1.
Relational storage
-
(a)
MySQLFootnote 88
-
(b)
PostgreSQLFootnote 89
-
(c)
Amazon Relational Database Service (Amazon RDS)Footnote 90
-
(d)
Microsoft SQL ServerFootnote 91
-
2.
Non-relational storage
-
(a)
Document-based
-
i.
MongoDBFootnote 92
-
ii.
Apache CouchDBFootnote 93
-
(b)
Column-based
-
i
Apache HBaseFootnote 94
-
3.
Resource Description Framework Triplestore
-
4.
Distributed Processing
-
(a)
Apache HadoopFootnote 95
-
(b)
Apache SparkFootnote 96
-
(c)
IBM InfoSphere StreamsFootnote 97
-
(d)
Apache AsterixDBFootnote 98
-
(e)
Apache StormFootnote 99
-
5.
Data Warehouse
-
(a)
Apache HiveFootnote 100
-
6.
Data Analytics
-
(a)
DatabricksFootnote 101
The MySQL relational database management system was the most technology used for storing structured social data, whereas MongoDB was mostly used for processing unstructured social data. On the other hand, the distributed processing technologies were used for processing large scale social real-time and/or historical data. In particular, Hadoop MapReduce was used for parallel processing of large volumes of structured, semi-structured and unstructured social datasets, that are stored in the Hadoop Distributed File System. Spark’s ability to process both batch and streaming data was utilised in cases where velocity is more important than volume.
Natural language processing tasks
This section presents information about other NLP tasks that were conducted to perform SOM.
Overview
An element of NLP is performed in 283 studies, out of the 465 analysed, either for pre-processing (248 studies), feature extraction (Machine Learning) or one of the processing parts within their SOM solution. The most common and important NLP tasks range from Tokenisation, Segmentation and POS, to NER and Language Detection.
It is important to mention that the NLP tasks mentioned above together with Anaphora Resolution, Parsing, Sarcasm, and Sparsity, are some other challenges faced in the SOM domain (Khan et al. 2014). Moreover, online posts with complicated linguistic patterns are challenging to deal with Li and Xu (2014).
However, Koto and Adriani (2015) showcase the importance and potential of NLP within this domain, where they investigated the pattern or word combination of tweets in subjectivity and polarity by considering their POS sequence. Results reveal that subjective tweets tend to have word combinations consisting of adverb and adjective, whereas objective tweets tend to have a word combination of nouns. Moreover, negative tweets tend to have a word combination of affirmation words which often appear as a negation word.
Pre-processing and negations
The majority (355 out of 465) of the studies performed some sort of pre-processing in their studies. Different methods and resources were used for such a process, such as NLP tasks (e.g., tokenisation, stemming, lemmatisation, NER), and dictionaries for stop words, acronyms for slang words, and others (e.g., noslang.com, noswearing.com, Urban Dictionary, Internet lingo).
Negation handling is one of the most challenging issues faced by SOM solutions. However, 117 studies cater for negations within their approach, Several different methods are used, such as negation replacement, negation transformation, negation dictionaries, textual features based on negation words and negation models.
Emoticons/Emojis
Social media can be seen as a sub-language that uses emoticons/emojis mixed with text to show emotions (Min et al. 2013). Emoticons/emojis are commonly used in tweets irrespective of the language, therefore are sometimes considered as being domain and language independent (Khan et al. 2014), thus useful for multilingual SOM (Cui et al. 2011).
Even though some researchers remove emoticons/emojis as part of their pre-processing stage (depending on what the authors want to achieve), many others have utilised the respective emotional meaning within their SOM process. This has led to emoticons/emojis in playing a very important role within 205 solutions of the analysed studies especially when the focus is on emotion recognition.
Results obtained from the emoticon networks model in Zhang et al. (2013) show that emoticons can help in performing sentiment analysis. This is further supported by Jiang et al. (2015) who found that emoticons are a pure carrier of sentiment. This is further supported by the results obtained by the emoticon polarity-aware method in Li et al. (2018) which show that emoticons can significantly improve the precision for identifying the sentiment polarity. In the case of hybrid (lexicon and machine learning) approaches, emoticon-aided lexicon expansion improve the performance of lexicon-based classifiers (Zhou et al. 2014). From an emotion classification perspective, Porshnev et al. (2014) analysed users’ emoticons on Twitter to improve the accuracy of predictions for the Dow Jones Industrial Average and S&P 500 stock market indices. Other researchers (Cvijikj and Michahelles 2011) were interested in analysing how people express emotions, displayed via adjectives or usage of internet slang i.e., emoticons, interjections and intentional misspelling.
Several emoticon lists were used in these studies, with the Wikipedia and DataGeneticsFootnote 102 ones commonly used. Moreover, emoticon dictionaries, such as (Agarwal et al. 2011; Aisopos et al. 2012; Becker et al. 2013), consisting of emoticons and their corresponding polarity class were also used in certain studies.
Word embeddings
Word embeddings, a type of word representation which allows words with a similar meaning to have a similar representation, were used by several studies (Severyn and Moschitti 2015; Jiang et al. 2015; Castellucci et al. 2015, 2015; Cai and Xia 2015; Gao et al. 2015; Chen et al. 2015; Stojanovski et al. 2015; Gao et al. 2016; Zhao et al. 2016; Rexha et al. 2016; Hao et al. 2017; Kitaoka and Hasuike 2017; Arslan et al. 2018; Baccouche et al. 2018; Chen et al. 2018; Ghosal et al. 2018; Hanafy et al. 2018; Jianqiang et al. 2018; Stojanovski et al. 2018; Sun et al. 2018; Wan et al. 2018; Yan et al. 2018) adopting a learning-based (Machine Learning, Deep Learning and Statistical) or hybrid approach. These studies used word embedding algorithms, such as word2vec, fastTextFootnote 103, and/or GloVeFootnote 104. Such a form of learned representation for text is capable of capturing the context of words within a piece of text, syntactic patterns, semantic similarity and relation with other words, amongst other word representations. Therefore, word embeddings are used for different NLP problems, with SOM being one of them.
Aspect-based social opinion mining
Sentence-level SOM approaches tend to fail in discovering an opinion dimension, such as sentiment polarity about a particular entity and/or its aspects (Cambria et al. 2013). Therefore, an aspect-level (also referred to as feature/topic-based) (Hu and Liu 2004) approach –where an opinion is made up of targets and their associated opinion dimension (e.g., sentiment polarity)– has been used in some studies to overcome such issues. Certain NLP tasks, such as a parsing, POS tagger, and NER, are usually required to extract the entities or aspects from the respective social data.
From all the studies analysed, 39 performed aspect-based SOM, with 37 (Bansal and Srivastava 2018; Dragoni 2018; Gandhe et al. 2018; Ghiassi and Lee 2018; Kao and Huang 2018; Katz et al. 2018; Liu et al. 2018; Rathan et al. 2018; Wang et al. 2018; Zainuddin et al. 2018; Abdullah and Zolkepli 2017; Dambhare and Karale 2017; Hagge et al. 2017; Ray and Chakrabarti 2017; Rout et al. 2017; Tong et al. 2017; Vo et al. 2017; Zhou et al. 2017; Zimbra et al. 2016; Zainuddin et al. 2016, 2016; Kokkinogenis et al. 2015; Lima et al. 2015; Hridoy et al. 2015; Castellucci et al. 2015; Averchenkov et al. 2015; Tan et al. 2014; Lau et al. 2014; Del Bosque and Garza 2014; Varshney and Gupta 2014; Unankard et al. 2014; Lek and Poo 2013; Wang and Ye 2013; Min et al. 2013; Kontopoulos et al. 2013; Jiang et al. 2011; Prabowo and Thelwall 2009) focusing on aspect-based sentiment analysis, 1 (Aoudi and Malik 2018) on aspect-based sentiment and emotion analysis and 1 (Weichselbraun et al. 2017) on aspect-based affect analysis.
In particular, the Twitter aspect-based sentiment classification process in Lek and Poo (2013) consists of the following main steps: aspect-sentiment extraction, aspect ranking and selection, and aspect classification, whereas Lau et al. (2014) use NER to parse product names to determine their polarity. The aspect-based sentiment analysis approach in Hagge et al. (2017) leveraged POS tagging and dependency parsing. Moreover, Zainuddin et al. (2016) proposed a hybrid approach to analyse aspect-based sentiment of tweets. As the authors claim, it is more important to identify the opinions of tweets rather than finding the overall polarity which might not be useful to organisations. In Zainuddin et al. (2018), the same authors used association rule mining augmented with a heuristic combination of POS patterns to find single and multi-word explicit and implicit aspects. Results in Jiang et al. (2011) show that classifiers incorporating target-dependent features significantly outperform target-independent ones. In contrast to the studies discussed, Weichselbraun et al. (2017) introduced an aspect-based analysis approach that integrates affective (includes sentiment polarity and emotions) and factual knowledge extraction to capture opinions related to certain aspects of brands and companies. The social data analysed is classified in terms of sentiment polarity and emotions, aligned with the “Hourglass of Emotions” (Susanto et al. 2020).
In terms of techniques, the majority of the aspect-based studies used a hybrid approach, where only 5 studies used deep learning for such a task. In particular, the study by Averchenkov et al. (2015) used a deep learning approach based on RNNs for aspect-based sentiment analysis. A comparative review of deep learning for aspect-based sentiment analysis published by Do et al. (2019) discusses current research in this domain. It focuses on deep learning approaches, such as CNN, LSTM and GRU, for extracting both syntactic and semantic features of text without the need for in-depth requirements for feature engineering as required by classical NLP. For future research directions on aspect-based SOM, refer to Sect. 6.2.