1 Introduction

Social media is increasing in popularity and also in its importance. This is principally due to the large number of people who make use of different social media platforms for various types of social interaction. Kaplan and Haenlein define social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, which allows the creation and exchange of user generated content” (Kaplan and Haenlein 2010). This definition fully reflects that social media platforms are essential for online users to submit their views and also read the ones posted by other people about various aspects and/or entities, such as opinions about a political party they are supporting in an upcoming election, recommendations of products to buy, restaurants to eat in and holiday destinations to visit. In particular, people’s social opinions as expressed through various social media platforms can be beneficial in several domains, used in several applications and applied in real-life scenarios. Therefore, mining of people’s opinions, which are usually expressed in various media formats, such as textual (e.g., online posts, newswires), visual (e.g., images, videos) and audio, is a valuable business asset that can be utilised in many ways ranging from marketing strategies to product or service improvement. However as indicated in Ravi and Ravi (2015), dealing with unstructured data, such as video, speech, audio and text, creates crucial research challenges.

This research area is evolving due to the rise of social media platforms, where several work already exists on the analysis of sentiment polarity. Moreover, researchers can gauge widespread opinions from user-generated content and better model and understand human beliefs and their behaviour. Opinion Mining is regarded as a challenging Natural Language Processing (NLP) problem, in particular for social data obtained from social media platforms, such as TwitterFootnote 1, and also for transcribed text. Standard linguistic processing tools were built and developed on newswires and review-related data due to such data following more strict grammar rules. These differences should be taken in consideration when performing any kind of analysis (Balazs and Velásquez 2016). Therefore, social data is difficult to analyse due to the short length in text, the non-standard abbreviations used, the high sparse representation of terms and difficulties in finding out the synonyms and any other relations between terms, emoticons and hashtags used, lack of punctuations, use of informal text, slang, non-standard shortcuts and word concatenations. Hence, typical NLP solutions are not likely to work well for Opinion Mining.

Opinion Mining—presently a very popular field of study—is defined by Liu and Zhang as “the computational study of people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes” (Liu and Zhang 2012). Social is defined by the Merriam-Webster Online dictionaryFootnote 2 as “of or relating to human society, the interaction of the individual and the group, or the welfare of human beings as members of society”.

In light of this, we define Social Opinion Mining (SOM) as “the study of user-generated content by a selective portion of society be it an individual or group, specifically those who express their opinion about a particular entity, individual, issue, event and/or topic via social media interaction”.

Therefore, the research area of SOM is tasked with the identification of several dimensions of opinion, such as sentiment polarity, emotion, sarcasm, irony and mood, from social data which is represented in structured, semi-structured and/or unstructured data formats. Information fusion is the field tasked with researching about efficient methods for automatically or semi-automatically transforming information from different sources into a single coherent representation, which can be used to guide the fusion process. This is important due to the diversity in data in terms of content, format and volume (Balazs and Velásquez 2016). Sections 1.1 and 1.2 provide information about SOM and its challenges.

In addition, SOM is generally very personal to the individual responsible for expressing an opinion about an object or set of objects, thus making it user-oriented from an opinion point-of-view, e.g., a social post about an event on Twitter, a professional post about a job opening on LinkedInFootnote 3 or a review about a hotel on TripAdvisorFootnote 4.

Our SOM research focuses on microposts—i.e. information published on the Web that is small in size and requires minimal effort to publish (Cano et al. 2016)—that are expressed by individuals on a microblogging service, such as Sina WeiboFootnote 5 or Twitter and/or a social network service that has its own microblogging feature, such as FacebookFootnote 6 and LinkedIn.

1.1 Opinion mining versus social opinion mining

In 2008, Pang and Lee had already identified the relevance between the field of “social media monitoring and analysis” and the body of work reviewed in Pang and Lee (2008), which deals with the computational treatment of opinion, sentiment and subjectivity in text. This work is nowadays known as opinion mining, sentiment analysis, and/or subjectivity analysis (Pang and Lee 2008). Other phrases, such as review mining and appraisal extraction have also been used in the same context, whereas some connections have been found to affective computing (where one of its goals is to enable computers in recognising and expressing emotions) (Pang and Lee 2008). Merriam-Webster’s Online Dictionary defines that the termsFootnote 7 “opinion”, “view”, “belief”, “conviction”, “persuasion” and “sentiment” mean a judgement one holds as true. This shows that the distinctions in common usage between these terms can be quite subtle. In light of this, three main three research areas—opinion mining, sentiment analysis and subjectivity analysis—are all related and use multiple techniques taken from NLP, information retrieval, structured and unstructured data mining (Ravi and Ravi 2015). However, even though these three concepts are broadly used as synonyms, thus used interchangeably, it is worth noting that their origins differ. Some authors also consider that each concept presents a different understanding (Serrano-Guerrero et al. 2015), and also have different notions (Tsytsarau and Palpanas 2012). We are in agreement with this, hence we felt that a new terminology is required to properly specify what SOM means, as defined in Sect. 1.

According to Cambria et al., sentiment analysis can be considered as a very restricted NLP problem, where the polarity (negative/positive) of each sentence and/or target entities or topics needs to be understood (Cambria et al. 2013). On the other hand, Liu discusses that “opinions are usually subjective expressions that describe people’s sentiments, appraisals or feelings toward entities, events and their properties” (Liu 2010). He further identifies two sub-topics of sentiment and subjectivity analysis, namely sentiment classification (or document-level sentiment classification) and subjectivity classification. SOM requires such classification methods to determine an opinion dimension, such as objectivity/subjectivity and sentiment polarity. For example, subjectivity classification is required to classify whether user-generated content, such as a product review, is objective or subjective, whereas sentiment classification is performed on subjective content to find the sentiment polarity (positive/negative) as expressed by the author of the opinionated text. In cases where the user-generated content is made up of multiple sentences, sentence-level classification needs to be performed to determine the respective opinion dimension. In addition, sentence-level classification is not suitable for compound sentences, i.e., a sentence that expresses more than one opinion. For such cases, aspect-based opinion mining needs to be performed.

1.2 Issues and challenges

Pang and Lee (2008) had already identified that the writings of Web users can be very challenging in their own way due to numerous factors, such as the quality of written text, discourse structure and the order in which different opinions are presented. The effects of the latter factor can result in a completely opposite overall sentiment polarity, where the order effects can completely overwhelm the frequency effects. This is not the case in traditional text classification, where if a document refers to the term “car” in a frequent manner, the document will probably somewhat be related to cars. Therefore, order dependence manifests itself in a more fine-grained level of analysis.

Liu (2010) mentions that complete sentences (for reviews) are more complex than short phrases and contain a large amount of noise, thus making it more difficult to extract features for feature-based sentiment analysis. Even though we agree that with more text, comes a higher probability of spelling mistakes, etc., we tend to disagree that shorter text, such as microposts, contain less noise.

The process of mining user-generated content posted on the Web is very intricate and challenging due to the nature of short textual content limit (e.g., tweets allowed up to 140 characters till October 2017), which at times forces a user to resort in using short words, such as acronyms and slang, to make a statement. These often lead to further issues in the text, such as misspellings, incomplete content, jargon, incorrect acronyms and/or abbreviations, emoticons and content misinterpretation (Cortis 2013). Other noteworthy challenges include swear words, irony, sarcasm, negatives, conditional statements, grammatical mistakes, use of multiple languages, incorrect language syntax, syntactically inconsistent words, and different discourse structures. In fact, when informal language is used in the user-generated content, the grammar and lexicon varies from the standard language normally used (Dashtipour et al. 2016). Moreover, user-generated text exhibits more language variation due to it being less grammatical than longer posts, where the aforementioned use of emoticons, abbreviations together with hashtags and inconsistent capitalisation, can form an important part of the meaning (Maynard et al. 2012). Maynard et al. (2012) also points out that microposts are in some sense the most challenging type of text for text mining tools especially for opinion mining, since they do not contain a lot of contextual information and assume much implicit knowledge. Another issue is ambiguity, since microposts such as tweets do not follow a conversation thread. Therefore, this isolation from other tweets makes it more difficult to make use of coreference information unlike in blog posts and comments. Due to the short textual content, features can also be sparse to find and use, in terms of text representation (Wang et al. 2014). In addition, the majority of microposts usually contain information about a single topic due to the length limitation, which is not the case in traditional blogs, where they contain information on more than one topic given that they do not face the same length limitations (Giachanou and Crestani 2016).

Big data challenges, such as handling and processing large volumes of streaming data, are also encountered when analysing social data (Bravo-Marquez et al. 2014). Limited availability of labelled data and dealing with the evolving nature of social streams usually results in the target concept changing which would require the learning models to be constantly updated (Guerra et al. 2014).

In light of the above, social networking services bring several issues and challenges with them and the way in how content is generated by their users. Therefore, several Information Extraction tasks, such as Named Entity Recognition (NER) and Coreference Resolution, might be required to carry out multi-dimensional SOM. In fact, several shared evaluation tasks are being organised to try and reach a standard mechanism towards performing IE tasks on noisy text which is very common in user-generated social media content. As already discussed in detail above, such tasks are much harder to solve when they are applied on micro-text like microposts (Ravi and Ravi 2015). This problem presents serious challenges on several levels, such as performance. Examples of such tasks are “Named Entity Recognition in Twitter”Footnote 8.

In terms of content, social media-based studies present only analysis and results from a selective portion of society, since not everyone uses social media. Moreover, several cross-cultural differences and factors determine the social media usage in each country and hence the results of such studies. For example for the Political domain, these services are used predominantly by young and politically active individuals or by ones with strong political views. This could be easily reflected in the Brexit results, where the majority of younger generation (age 18–44) voted to remain in the European Union as opposed to people over age 45. Such a result falls in line with the latest United Kingdom social media statistics, such as for Twitter, where 72% of the users are between the age of 15–44, whilst for Facebook the most popular age group is 25–34 (26% of users) (Hürlimann et al. 2016). However, results of similar studies in other cultures and languages might differ due to different use of social words to reflect a general opinion, sentiment polarity and/or emotion (Lin et al. 2018).

1.3 Systematic review

In light of the above, it is noteworthy that no systematic review within this newly defined domain exists even though there are several good survey papers (Liu and Zhang 2012; Tsytsarau and Palpanas 2012; Medhat et al. 2014; Ravi and Ravi 2015). The research paper by Bukhari et al. (2016) is closest to a systematic review in this domain, whereby the authors performed a search over the ScienceDirect and SpringerLink electronic libraries for the “sentiment analysis”, “sentiment analysis models”, “sentiment analysis of microblogs” terms. As a result, we felt that the SOM domain well and truly deserves a thorough systematic review that captures all of the relevant research conducted over the last decade. This review also identifies the current literature gaps within this popular and constantly evolving research domain.

The structure of this comprehensive systematic review is as follows: Sect. 2 presents the research method adopted to carry out this review, followed by Sect. 3 which provides a thorough review analysis of the main aspects derived from the analysed studies. This is followed by Sect. 4 which focuses on the different dimensions of social opinions as derived from the analysed studies, and Sect. 5 which presents the application areas where SOM is being used. Lastly, Sect. 6 discusses the the latest developments for SOM (beyond the period covered by the systematic review) and future research directions as identified by the authors.

2 Research method

This survey paper about SOM adopts a systematic literature review process. This empirical research process was based on the guidelines and procedures proposed by Kitchenham (2004), Brereton et al. (2007), Dyba et al. (2007) and Attard et al. (2015) which were focused on the software engineering domain. The systematic review process although more time consuming is reproducible, minimising bias and maximising internal and external validity. The procedure undertaken was structured as follows and is explained in detail within the sub-sections below:

  1. 1.

    Specification of research questions;

  2. 2.

    Generation of search strategy which includes the identification of electronic sources (libraries) and selection of relevant search terms;

  3. 3.

    Application of the relevant search;

  4. 4.

    Choice of primary studies via the utilisation of inclusion and exclusion criteria on the obtained results;

  5. 5.

    Extraction of required data from primary studies;

  6. 6.

    Synthesis of data.

2.1 Research questions

A systematic literature review is usually characterised by an appropriate generic “research question, topic area, or phenomenon of interest” (Kitchenham 2004). This question can be expanded into a set of sub-questions that are more clearly defined, whereby all available research relevant to these sub-questions are identified, evaluated and interpreted.

The goal of this systematic review is to identify, analyse and evaluate current opinion mining solutions that make use of social data (data extracted from social media platforms). In light of this, the following generic research question is defined:

  • What are the existing opinion mining approaches which make use of user-generated content obtained from social media platforms?

The following are specific sub-questions that the generic question above can be sub-divided into:

  1. 1.

    What are the existing approaches that make use of social data for opinion mining and how can they be classifiedFootnote 9?

  2. 2.

    What are the different dimensions/types of social opinion mining?

  3. 3.

    What are the challenges faced when performing opinion mining on social data?

  4. 4.

    What techniques, datasets, tools/technologies and resources are used in the current solutions?

  5. 5.

    What are the application areas of social opinion mining?

2.2 Search strategy

The search strategy for this systematic review is primarily directed via the use of published papers which consist of journals, conference/workshop proceedings, or technical reports. The following electronic libraries were identified for use, due to their wide coverage of relevant publications within our domain: ACM Digital LibraryFootnote 10, IEEE Xplore Digital LibraryFootnote 11, ScienceDirectFootnote 12, and SpringerLinkFootnote 13.

The first three electronic libraries listed were used by three out of the four systematic reviews that our research process was based on (and which made use of a digital source), whereas SpringerLink is one of the most popular sources for publishing work in this domain (as will be seen in Sect. 2.4 below). Moreover, three other electronic libraries were considered for use, two –Web of ScienceFootnote 14 and Ei CompendexFootnote 15– which the host university did not have access to and Google ScholarFootnote 16 which was not included, since content is obtained from the electronic libraries listed above (and more), thus making the process redundant.

The relevant search terms were identified for answering the research questions defined in Sect. 2.1. In addition, these questions were also used to perform some trial searches before the following list of relevant search terms was determined:

  1. 1.

    “Social opinion mining”;

  2. 2.

    “Social sentiment analysis”;

  3. 3.

    “Opinion mining social media”;

  4. 4.

    “Sentiment analysis social media”;

  5. 5.

    “Microblog opinion mining”;

  6. 6.

    “Microblog sentiment analysis”;

  7. 7.

    “Social network sentiment”;

  8. 8.

    “Social network opinion”;

  9. 9.

    “Social data sentiment analysis”;

  10. 10.

    “Social data opinion mining”;

  11. 11.

    “Twitter sentiment analysis”;

  12. 12.

    “Twitter opinion mining”;

  13. 13.

    “Social data analysis”.

The following are important justifications behind the search terms selected above:

  • “opinion mining” and “sentiment analysis”: are both included due to the fact that these key terms are used interchangeably to denote the same field of study (Pang and Lee 2008; Cambria et al. 2013), even though their origins differ and hence do not refer to the same concept (Serrano-Guerrero et al. 2015);

  • “microblog”, “social network” and “Twitter”: the majority of the opinion mining and/or sentiment analysis research and development efforts target these two kinds of social media platforms, in particular the Twitter microblogging service.

2.3 Search application

The “OR” Boolean operator was chosen to formulate the search string. The search terms were all linked using this operator, making the search query simple and easy to use across multiple electronic libraries. Therefore, a publication only had to include any one of the search terms to be retrieved (Attard et al. 2015). In addition, this operator is more suitable for the defined search terms given that this study is not a general one e.g., about opinion mining in general, but is focused about opinion mining in a social context. Construction of the correct search string (and terms) is very important, since this eliminates noise (i.e. false positives) as much as possible and at the same time still retrieves potential relevant publication which increases recall.

Several other factors had to be taken in consideration during the application of search terms on the electronic libraries. The following is a list of factors relevant to our study, identified in Brereton et al. (2007) and verified during our search application process:

  • Electronic library search engines have different underlying models, thus not always provide required support for systematic searching;

  • Same set of search terms cannot be used for multiple engines e.g., complex logical combination not supported by the ACM Digital Library but is by the IEEE Xplore Digital Library;

  • Boolean search string is dependent on the order of terms, independent of brackets;

  • Inconsistencies in the order or relevance in search results (e.g., IEEE Xplore Digital Library results are sorted in order of relevance);

  • Certain electronic libraries treat multiple words as a Boolean term and look for instances of all the words together (e.g., “social opinion mining”). In this case, the use of the “AND” Boolean operator (e.g., “social AND opinion AND mining”) looks for all of the words but not necessary together.

On the above, in our case it was very important to select a search strategy that is more appropriate to the review’s research question which could be applied to the selected electronic libraries.

When applying the relevant search on top of the search strategy defined in Sect. 2.2, another important element was to identify appropriate metadata fields upon which the search string can be executed. Table 1 presents the ones applied in our study.

Table 1 Metadata fields used in search application

Applying the search on the title metadata field alone would result in several missed and/or incorrect results. Therefore, using the abstract and/or keywords in the search is very important to reduce the number of irrelevant results. In addition, this ensures that significant publications that lack any of the relevant search terms within their title are returned.

A separate search method was applied for each electronic library, since they all offer different functionalities and have different underlying models. Each method is detailed below:

  • ACM: Separate searches for each metadata field were conducted and results were merged (duplicates removed). Reason being that the metadata field search functionality “ANDs” all metadata fields, whereas manual edition of the search query does not work well when amended.

  • IEEE: Separate searches for each metadata field were conducted and results were merged (duplicates removed).

  • ScienceDirect: One search that takes in consideration all the chosen metadata fields.

  • SpringerLink: By entering a search term or phrase, a search is conducted over the title, abstract and full-text (including authors, affiliations and references) of every article and book chapter. This was noted in the large amount of returned papers (as will be discussed in the next sub-section), which results in a high amount of false positives (and possibly a higher recall).

2.4 Study selection

A manual study selection was performed on the primary studies obtained from the search application defined in Sect. 2.3. This is required to eliminate any studies that might be irrelevant even though the search terms appear in either of the metadata fields defined in Table 1 above. Therefore, inclusion and exclusion criteria (listed below) were defined.

Published papers that meet any of the following inclusion criteria are chosen as primary studies:

  • I1. A study that targeted at least one social networking service and/or utilised a social dataset besides other social media services, such as blogs, chats and wikis. Please note that only work performed on social data from social networking services is taken in consideration for the purposes of this review;

  • I2. A study published from the year 2007 onwards. This year was chosen, since the mid-2000s saw the evolution of several social networking services, in particular Facebook’s growth (2007), which currently contains the highest monthly active users;

  • I3. A study published in the English language.

Published papers that satisfy any of the exclusion criteria from the following list, are removed from the systematic review:

  • E1. A study published before 2007;

  • E2. A study that does not focus on performing any sort of opinion mining on social media services, even though it mentions some of the search terms;

  • E3. A study that focuses on opinion mining or sentiment analysis in general i.e. no reference in a social context;

  • E4. A study that is only focused on social data sources obtained from online forums, communities, blogs, chats, social news websites (e.g., SlashdotFootnote 17), review websites (e.g., IMDbFootnote 18);

  • E5. A study that consists of either a paper’s front cover and/or title page.

Selection of the primary studies for this systematic review was carried out in 2019. Therefore, studies indexed or published from 2019 onwards, are not included in this review.

Table 2 Primary studies selection procedure from the electronic libraries

Table 2 shows the results for each electronic library at each step of the procedure used for selecting the final set of primary studies. The results included one proceedings, which was resolved by including all the published papers within the track relevant to this studyFootnote 19, since the other papers were not relevant thus not included in the initial results. The search application phase resulted in a total of 861 published papers. False positives, which consist of duplicate papers and papers that meet any of the exclusion criteria were removed. This was done through a manual study selection which was performed on all the metadata fields considered i.e. the title, abstract and keywords. In cases where we were still unclear of whether a published paper is valid or not, we went through the full text. This study selection operation left us with 460 published papers, where the number of false positives totalled 401. Out of the final study selection published papers, we did not have full access to 9 published papers, thus reducing the total primary studies to 451.

In addition to the primary studies selected from the electronic libraries, we added a set of relevant studies –34 published papers (excluding survey papers)– for completeness sake which were either published in reputable venues within the Opinion Mining community or were highly cited. Therefore, the final set of primary studies totals 485 published papers.

2.5 Extraction of data

2.5.1 Overall

The main objective of this study is to conduct a systematic analysis of the current literature in the field of SOM. Each published paper in this review was analysed in terms of the following information/parameters: social media platforms, techniques and approaches, social datasets, language, modality, tools and technologies, (other) NLP tasks, application areas and opinion mining dimensions. It is important to note that this information was manually extracted from each published paper. In the sub-sections below we discuss the overall statistics about the relevant primary studies that resulted from the study selection phase of this systematic review.

2.5.2 Study selection: electronic libraries

Figure 1 shows that the first three years of this evaluation period, i.e., 2007–2009, did not return any relevant literature. It is important to note that 2006 and 2007 was the period when opinion mining emerged in Web applications and weblogs within multiple domains, such as politics and marketing (Pang and Lee 2008). However, 2010—which year coincides with the introduction of various social media platforms and the major increase in Facebook and Twitter usageFootnote 20—resulted in the first relevant literature, which figures kept increasing in the following years. Please note that the final year in evaluation, that is 2018, contains literature that was published or indexed till the 31st December 2018. From the twelve full years evaluated, 2018 produced the highest number of relevant literature. This shows the importance of opinion mining on social data, and therefore the continuous increase in social media usage and popularity, in particular social networking services. Moreover, SOM solutions are on the increase for various real world applications.

Fig. 1
figure 1

Primary studies by year

2.5.3 Study selection: additional set

The additional set of studies included in this systematic review, were published in the period between the year 2009 and 2014. These ranged from various publishers, namely the four selected for this study (ACM, IEEE Xplore, ScieneDirect and SpringerLink) and other popular ones, such as Association for the Advancement of Artificial Intelligence (AAAI)Footnote 21, Association for Computational Linguistics (ACL)Footnote 22 and Wiley Online LibraryFootnote 23.

2.6 Synthesis of data

The data synthesis of this detailed analysis is based on the extracted data mentioned in Sect. 2.5.1 above, which is discussed in the subsequent sections.

3 Review analysis

Table 3 provides different high level categories of the primary studies selected for this systematic review, discussed in Sect. 2.4.

Table 3 Categories of primary studies

It must be noted that not all the published papers were considered in the analysis conducted. Therefore, this table is referenced in all of the different aspects of the data synthesised, as presented below. It presents the primary studies returned from each electronic library and the additional ones, together with the ones that do not have full access, survey papers, papers which present work that can be applied/used on social data, and papers originating from organised tasks within the domain.

The in-depth analysis, which focused on the social media platforms, techniques, social datasets, language, modality, tools and technologies, NLP tasks and other aspects used across the published papers, is presented in Sects. 3.13.7.

3.1 Social media platforms

Social data refers to online data generated from any type of social media platform be it from microblogging, social networking, blogging, photo/video sharing and crowdsourcing. Given that this systematic survey focuses on opinion mining approaches that make use of social networking and microblogging services, we identify the social media platforms used in the studies within this review.

In total, 469 studies were evaluated with 66 from ACM, 155 from IEEE Xplore, 32 from ScienceDirect, 182 from SpringerLink and 34 additional ones. Papers which did not provide full access were excluded. Note that 4 survey papers—2 from ACM (Giachanou and Crestani 2016; Zimbra et al. 2018), 1 from IEEE Xplore (Wagh and Punde 2018), 1 from SpringerLink (Abdullah and Hadzikadic 2017)—and 2 SpringerLink organised/shared task papers (Loukachevitch and Rubtsova 2015; Patra et al. 2015) were included, since the former papers focus on Twitter Sentiment Analysis methods whereas the latter papers focus on Sentiment Analysis of tweets (therefore the target social media platform of all evaluated papers is clear in both cases). None of the other 14 survey papers (Rajalakshmi et al. 2017; Yenkar and Sawarkar 2018; Abdelhameed and Muñoz-Hern’andez 2017; Rathan et al. 2017; Liu and Young 2018; Zhang et al. 2018; Ravi and Ravi 2015; Nassirtoussi et al. 2014; Beigi et al. 2016; Lo et al. 2017; Ji et al. 2016; Batrinca and Treleaven 2015; Li et al. 2014; Lin and He 2014) have been included, since various social media platforms were used in the respective studies evaluated. In addition, 2 papers that presented a general approach which can be applied/used on social data (i.e., not on any source) (Min et al. 2013; El Haddaoui et al. 2018) have also not been included.

Out of these studies, 429 made use of 1 social media platform, whereas 32 made use of 2–4 social media platforms, as can be seen in Fig. 2.

Fig. 2
figure 2

Number of social media platforms used in each study

With respect to social media platforms, in total 504 were used across all of the studies. These span over the following 18 different ones, which are also listed in Table 4:

  1. 1.

    Twitter : a microblogging platform that allows publishing of short text updates (“microposts”);

  2. 2.

    Sina Weibo : a Chinese microblogging platform that is like a hybrid of Twitter and Facebook;

  3. 3.

    Facebook : a social networking platform that allows users to connect and share content with family and friends online;

  4. 4.

    YouTubeFootnote 24: a video sharing platform;

  5. 5.

    Tencent WeiboFootnote 25: a Chinese microblogging platform;

  6. 6.

    TripAdvisor : a travel platform that allows people to post their reviews about hotels, restaurants and other travel-related content, besides offering accommodation bookings;

  7. 7.

    InstagramFootnote 26: a platform for sharing photos and videos from a smartphone;

  8. 8.

    FlickrFootnote 27: an image- and video-hosting platform that is popular for sharing personal photos;

  9. 9.

    MyspaceFootnote 28: a social networking platform for musicians and bands to show and share their talent and connect with fans;

  10. 10.

    DiggFootnote 29: a social bookmarking and news aggregation platform that selects stories to the specific audience;

  11. 11.

    FoursquareFootnote 30: formerly a location-based service and nowadays a local search and discovery service mobile application known as Foursquare City Guide;

  12. 12.

    StocktwitsFootnote 31: a social networking platform for investors and traders to connect with each other;

  13. 13.

    LinkedInFootnote 32: a professional networking platform that allows users to communicate and share updates with colleagues and potential clients, job searching and recruitment;

  14. 14.

    PlurkFootnote 33: a social networking and microblogging platform;

  15. 15.

    WeixinFootnote 34: a Chinese multi-purpose messaging and social media app developed by Tencent;

  16. 16.

    PatientsLikeMeFootnote 35: a health information sharing platform for patients;

  17. 17.

    ApontadorFootnote 36: a Brazilian platform that allows users to share their opinions and photos on social networks and also book hotels and restaurants;

  18. 18.

    Google+Footnote 37: formerly a social networking platform (shut down in April 2019) that included features such as posting photos and status updates, group different relationship types into Circles, organise events and location tagging.

Table 4 Social media platforms used in the studies

Overall, Twitter was the most popular with 371 opinion mining studies making use of it, followed by Sina Weibo with 46 and Facebook with 30. Other popular platforms such as YouTube (12), Tencent Weibo (8), TripAdvisor (7), Instagram (6) and Flickr (5) were also used in a few studies. These results show the importance and popularity of microblogging platforms, such as Twitter and Sina Weibo, which are also very frequently used for research and development purposes in this domain. Such microblogging platforms provide researchers the possibility of using an Application Programming Interface (API) to access social data, which plays a crucial role in selecting them for their studies. On the other hand, data retrieval from other social media platforms such as Facebook, is becoming more challenging due to ethical concerns. For example, Facebook access to the Public Feed APIFootnote 38 is restricted and users cannot apply for it.

3.2 Techniques

For this analysis, 465 studies were evaluated: 65 from ACM, 154 from IEEE Xplore, 32 from ScienceDirect, 180 from SpringerLink and 34 additional ones. Studies excluded are the ones with no full access, surveys, and organised task papers. The main aim was to identify the technique/s used for the opinion mining process on social data. Therefore, they were categorised under the following approaches: Lexicon (Lx), Machine Learning (ML), Deep Learning (DL), Statistical (St), Probabilistic (Pr), Fuzziness (Fz), Rule (Rl), Graph (Gr), Ontology (On), Hybrid (Hy) –a combination of more than one technique, Manual (Mn) and Other (Ot). Table 5 provides the yearly statistics for all the respective approaches adopted.

Table 5 Approaches used in the studies analysed

From the studies analysed, 88 developed and used more than 1 technique within their respective studies. These techniques include the ones originally used in their approach and/or ones used for comparison/baseline/experimentation purposes. In particular, from these 88 studies, 65 used 2 techniques each, 17 studies used 3 techniques, 4 studies used 4 techniques, and 2 studies made use of 5 techniques, which totals to 584 techniques used across all studies (including the studies that used 1 technique). The results show that a hybrid approach is the most popular one, with over half of the studies adopting such an approach. This is followed by Machine Learning and Lexicon techniques, which are usually chosen to perform any form of opinion mining. These results are explained in more detail in the sub-sections below.

3.2.1 Lexicon

In total 94 unique studies adopted a lexicon-based approach to perform a form of SOM, which produced a total of 96 different techniquesFootnote 39. The majority of the lexicons used were specifically related to opinions and are well known in this domain, whereas the others that were not can still be used for conducting opinion mining.

Table 6 Lexicon-based studies

Table 6 presents the number of lexicons (first row and columns titled 1–8) used by the lexicon-based studies (second row). The column titled “Other/NA” refers to any other general lexicon, which does not list general lexicons mentioned in the studies such as acronym dictionaries, intensifier wordsFootnote 40, downtoner words,Footnote 41 negation words and internet slang, and/or to studies which do not provide any information on the exact lexicons used.

The majority of the lexicon-based studies used one or two lexicons, where a total of 144 state-of-the-art lexicons (55 unique ones) were used across. The following are the top six lexicons based on use:

  1. 1.

    SentiWordNetFootnote 42 (Baccianella et al. 2010)—used in 22 studies;

  2. 2.

    Hu and LiuFootnote 43 (Hu and Liu 2004)—used in 12 studies;

  3. 3.

    AFINNFootnote 44 (Årup Nielsen 2011) and SentiStrengthFootnote 45 (Thelwall et al. 2012)—used in 9 studies;

  4. 4.

    MPQA—SubjectivityFootnote 46 (Wilson et al. 2005)—used in 8 studies;

  5. 5.

    HowNet Sentiment Analysis Word Library (HowNetSenti)Footnote 47—used in 6 studies;

  6. 6.

    NRC Word-Emotion Association Lexicon (also known as NRC Emotion Lexicon or EmoLex)Footnote 48 (Mohammad and Turney 2010, 2013), WordNetFootnote 49 (Miller 1995) and Wikipedia—list of emoticonsFootnote 50—used in 5 studies.

In addition to the lexicons mentioned above, 19 studies used lexicons that they created as part of their work or specifically focused on creating SOM lexicons, such as (Årup Nielsen 2011) who created the AFINN word list for sentiment analysis in microblogs, (Javed et al. 2014) who built a bilingual sentiment lexicon for English and Roman Urdu, (Santarcangelo et al. 2015) the creators of the first Italian sentiment thesaurus, (Wu et al. 2016) for Chinese sentiment analysis and (Bandhakavi et al. 2016) for sentiment analysis on Twitter. These lexicons varied from social media focused lexicons (Tian et al. 2015; Ghiassi and Lee 2018; Pollacci et al. 2017), to sentiment and/or emoticon lexicons (Jurek et al. 2014; Molina-González et al. 2014; Khuc et al. 2012; Ranjan et al. 2018; Vo et al. 2017; Feng et al. 2015; Wang and Wu 2015; Zhou et al. 2014) and extensions of existing state-of-the-art lexicons (Li et al. 2016; Pandarachalil et al. 2015; Andriotis et al. 2014), such as (Li et al. 2016) who extended HowNetSenti with words manually collected from the internet, and (Pandarachalil et al. 2015) who built a sentiment lexicon from SenticNetFootnote 51 (Cambria et al. 2020) and SentiWordNet for slang words and acronyms.

3.2.2 Machine learning

A total of 121 studies adopted a machine learning-based approach to perform a form of SOM, where several supervised and unsupervised algorithms were used. Table 7 below presents the number of machine learning algorithms (first row and columns titled 1–7) used by the machine learning-based studies (second row). The column titled “NA” refers to studies who do not provide any information on the exact algorithms used.

Table 7 Machine learning-based studies

In total, 239 machine learning algorithms were used (not distinct) across 117 studies (since 4 studies did not provide any information), with 235 being supervised and 4 unsupervised. It is important to note that this figure does not include any supervised/semi-supervised/unsupervised proposed algorithms by the respective authors, which algorithms shall be discussed below.

Table 8 Supervised machine learning algorithms

Table 8 provides breakdown of the 235 supervised machine learning algorithms (not distinct) that were used within these studies. The NB and SVM algorithms are clearly the most popular in this domain, especially for text classification. With respect to the former, it is important to note that 20 out of the 75 studies used the Multinomial NB (MNB), which model is usually utilised for discrete counts i.e., the number of times a given term (word or token) appears in a document. The other 55 studies made use of the Multi-variate Bernoulli NB (MBNB) model, which is based on binary data, where every token in a feature vector of a document is classified with the value of 0 or 1. As for SVM, this method looks at the given data and sorts it in two categories (binary classification). If multi-class classification is required, the Support Vector Classification (SVC)Footnote 52, NuSVCFootnote 53 or LinearSVCFootnote 54 algorithms are usually applied, where the “one-against-one” approach is implemented for SVC and NuSVC, whereas the “one-vs-the-rest” multi-class strategy is implemented for LinearSVC.

The LoR statistical technique is also widely used in machine learning for binary classification problems. In total, 16 studies from the ones analysed, made use of this algorithm. DT learning has also been very much in use, which model uses a DT for both classification and regression problems. There are various algorithms in how a DT is built, with 2 studies using the C4.5 (Quinlan 1993)—an extension of Quinlan’s Iterative Dichotomiser 3 (ID3) algorithm, used for classification purposes, 3 studies using J48, a simple C4.5 DT for classification (Weka’s implementationFootnote 55), 2 using the Hoeffding Tree (Hulten et al. 2001) and the other 8 using the basic ID3 algorithm.

MaxEnt, used by 12 studies, is a probabilistic classifier that is also used for text classification problems, such as sentiment analysis. More specifically, it is generalisation of LoR for multi-class scenarios (Yu et al. 2011). RF was used in 9 studies, where this supervised learning algorithm –which can be used for both classification and regression tasks– creates a forest (which is an ensemble of DTs) and makes it somehow random. Moreover, 7 studies used the KNN algorithm, one of the simplest classification algorithms where no learning is required, since the model structure is determined from the entire dataset.

The SentiStrength algorithm, utilised by 5 studies (Gonçalves et al. 2013; Lu et al. 2015; Baecchi et al. 2016; Yan et al. 2017; Zhang et al. 2018), can be used in both supervised and unsupervised cases, since the authors developed a version for each learning case. Conditional Random Fields, used by 4 studies (Pak and Paroubek 2010; Zhang et al. 2014; Wang et al. 2016; Hao et al. 2017), are a type of discriminative classifier that model the decision boundary amongst different classes, whereas LiR was also used by 4 studies (Bollen et al. 2011; Pavel e al. 2017; Adibi et al. 2018; Xiaomei et al. 2018). Moreover, 3 studies each used the SANT (Ou et al. 2014; Lu 2015; Xiaomei et al. 2018) and SGD (Bifet and Frank 2010; Juneja and Ojha 2017; Sánchez-Holgado and Arcila-Calderón 2018) algorithms, with the former being mostly used for comparison purposes to the proposed approaches by the respective authors.

In addition, the PA algorithm was used in 2 studies (Li et al. 2014; Filice et al. 2014). In the case of the former (Li et al. 2014), this algorithm was used in a collaborative online learning framework to automatically classify whether a post is emotional or not, thereby overcoming challenges faced by the diversity of microblogging styles which increase the difficulty of classification. The authors in the latter study (Filice et al. 2014) extend the budgeted PA algorithm to enable robust and efficient natural language learning processes based on semantic kernels. The proposed online learning learner was applied to two real world linguistic tasks, one of which was sentiment analysis.

Nine other algorithms were used by 7 different studies, namely: Bagging (Sygkounas et al. 2016), BN (Lu et al. 2016), CRB (Raja and Swamynathan 2016), AB (Raja and Swamynathan 2016), HMM (Zhang et al. 2014), Dictionary Learning (Asiaee et al. 2012), NBSVM (Sun et al. 2017), MCC (Çeliktuğ 2018) and ICO (Çeliktuğ 2018).

In terms of unsupervised machine learning algorithms, 4 were used in 2 of the 80 studies that used a machine learning-based technique. Suresh and Raj S. used the K-Means (KM) (Lloyd 1982) and Expectation Maximization (Dempster et al. 1977) clustering algorithms in Suresh (2016). Both were used for comparison purposes to an unsupervised modified fuzzy clustering algorithm proposed by authors. The proposed algorithm produced accurate results without manual processing, linguistic knowledge or training time, which concepts are required for supervised approaches. Baecchi et al. (Baecchi et al. 2016) used two unsupervised algorithms, namely Continuous Bag-Of-Word (CBOW) (Mikolov et al. 2013) and Denoising Autoencoder (DA) (Vincent et al. 2008) (the SGD and backpropagation algorithms were used for the DA learning process), and supervised ones, namely LoR, SVM and SentiStrength, for constructing their method and comparison purposes. They considered both textual and visual information in their work on sentiment analysis of social network multimedia. Their proposed unified model (CBOW-DA-LoR) works in both an unsupervised and semi-supervised manner, whereby learning text and image representation and also the sentiment polarity classifier for tweets containing images.

Other studies proposed their own algorithms, with some of the already established algorithms discussed above playing an important role in their implementation and/or comparison. Zimmermann et al. proposed a semi-supervised algorithm, the S*3Learner (Zimmermann et al. 2014) which suits changing opinion stream classification environments, where the vector of words evolves over time, with new words appearing and old words disappearing. Severyn et al. (2016) defined a novel and efficient tree kernel function, the Shallow syntactic Tree Kernel, for multi-class supervised sentiment classification of online comments. This study focused on YouTube which is multilingual, multimodal, multidomain and multicultural, with the aim to find whether the polarity of a comment is directed towards the source video, product described in the video or another product. Furthermore, Ignatov and Ignatov (2017) presented a novel DT-based algorithm, a Decision Stream, where Twitter sentiment analysis was one of several common machine learning problems that it was evaluated on. Lastly, Fatyanosa et al. (2018) enhanced the ability of the NB classifier with an optimisation algorithm, the Variable Length Chromosome Genetic Algorithm (VLCGA), thereby proposing VLCGA-NB for Twitter sentiment analysis.

Moreover, the following 13 studies proposed an ensemble method or evaluated ensemble-based classifiers:

  • Çeliktuğ (2018) used two ensemble learning methods in RF and MCC (amongst other machine learning algorithms), for sentiment classification of Twitter datasets;

  • Yan et al. (2017) presented two ensemble learners built on four off-the-shelf classifiers, for Twitter sentiment classification;

  • Zhang et al. (2018), Adibi et al. (2018), Çeliktuğ (2018), Vora and Chacko (2017), Lu et al. (2016), Rexha et al. (2016), Xie et al. (2012) and Zhang et al. (2011) used the RF ensemble learning method in their work;

  • Troussas et al. (2016) evaluated the most common ensemble methods that can be used for sentiment analysis on Twitter datasets;

  • Sygkounas et al. (2016) proposed an ensemble system composed on five state-of-the-art sentiment classifiers;

  • Le et al. (2014) used multiple oblique decision stumps classifiers to form an ensemble of classifiers, which is more accurate than a single one for classifying tweets;

  • Neethu and Rajasree (2013) used an ensemble classifier (and single algorithm classifiers) for sentiment classification.

Ensembles created usually result in providing more accurate classification answers when compared to individual classifiers, i.e., classic learning approaches. In addition, ensembles reduce the overall risk of choosing a wrong classifier especially when applying it on a new dataset (Da Silva et al. 2014).

3.2.3 Deep learning

Deep learning is a subset of machine learning based on Artificial Neural Networks (ANNs) –algorithms inspired by the human brain– where there are connections, layers and neurons for data to propagate. A total of 35 studies adopted a deep learning-based approach to perform a form of SOM, where supervised and unsupervised algorithms were used. Twenty six (26) of the studies made use of 1 deep learning algorithm, with 5 utilising 2 algorithms and 2 studies each using 3 and 4 algorithms, respectively. Table 9 provides breakdown of the 50 deep learning algorithms (not distinct) used within these studies.

Table 9 Deep learning algorithms

LSTM, a prominent variation of the RNN which makes it easier to remember past data in memory, was used in 13 studies (Yan et al. 2018; Sun et al. 2018; Sanyal et al. 2018; Ameur et al. 2018; Wazery et al. 2018; Li et al. 2018; Chen and Wang 2018; Chen et al. 2018; Sun et al. 2017; Hu et al. 2017; Shi et al. 2017; Wang et al. 2016; Yan and Tao 2016), thus making it the most popular deep learning algorithm amongst the evaluated studies. Three further studies (Ameur et al. 2018; Balikas et al. 2017; Wang et al. 2016) used the BLSTM, an extension of the traditional LSTM which can improve model performance on sequence classification problems. In particular, a BLSTM was used in Balikas et al. (2017) to improve the performance of fine-grained sentiment classification, which approach can benefit sentiment expressed in different textual types (e.g., tweets and paragraphs), in different languages and different granularity levels (e.g., binary and ternary). Similarly, Wang et al. (2016) proposed a language-independent method based on BLSTM models for incorporating preceding microblogs for context-aware Chinese sentiment classification.

The CNN algorithm –a variant of ANN– is made up of neurons that have learnable weights and biases, where each neuron receives an input, performs a dot product and optionally follows it with non-linearity. In total, 12 studies (Sun et al. 2018; Ochoa-Luna and Ari 2018; Ameur et al. 2018; Adibi et al. 2018; Chen and Wang 2018; Napitu et al. 2017; Shi et al. 2017; Wehrmann et al. 2017; Zhang et al. 2017; Stojanovski et al. 2015; Wang et al. 2016; Severyn and Moschitti 2015) made use of this algorithm. Notably, Wehrmann et al. (2017) propose a language-agnostic translation-free method for Twitter sentiment analysis.

RNNs, a powerful set of ANNs useful for processing and recognising patterns in sequential data such as natural language, were used in 8 studies (Yan et al. 2018; Ochoa-Luna and Ari 2018; Piñeiro-Chousa et al. 2018; Wazery et al. 2018; Pavel e al. 2017; Shi et al. 2017; Yan and Tao 2016; Wang et al. 2016). One study in particular (Averchenkov et al. 2015), considered a novel approach to aspect-based sentiment analysis of Russian social networks based on RNNs, where the best results were obtained by using a special network modification, the RNTN. Two further studies (Lu et al. 2015; Sygkounas et al. 2016) also used this algorithm (RNTN) in their work.

Five other studies (Arslan et al. 2018; Anjaria and Guddeti 2014; Du et al. 2014; Politopoulou and Maragoudakis 2013; Zhang et al. 2011) used a simple type of ANN, such as the feedforward neural network. Moreover, the MLP, a class of feedforward ANN, was used in 2 studies (Chen and Zheng 2018; Ramadhani and Goo 2017). Similarly, 2 studies (Yan et al. 2018; Ameur et al. 2018) proposed methods based on the AE unsupervised learning algorithm which is used for representation learning. Lastly, one study each made use of the GRU (Wang et al. 2016) and DAN2 (Ghiassi et al. 2013) algorithms.

Some studies used several types of ANNs in their work. Ameur et al. (2018) used multiple methods based on AE, CNN, LSTM and BLSTM for sentiment polarity classification and Wang et al. (2016) used RNN, LSTM, BLSTM and GRUs models. Yan et al. (2018) used learning methods based on RNN, LSTM and AE for comparison with the proposed learning framework for short text classification, and Shi et al. (2017) proposed an improved LSTM which considers user-based and content-based features and used CNN, LSTM and RNN models for comparison purposes. Furthermore, Ochoa-Luna and Ari (2018) made use of CNN and RNN deep learning algorithms for tweet sentiment analysis, Wazery et al. (2018) and Yan and Tao (2016) used the RNN and LSTM, whereas Sun et al. (2018) and Chen and Wang (2018) proposed new models based on CNN and LSTM.

3.2.4 Statistical

A total of 9 studies (Wang et al. 2018; Kitaoka and Hasuike 2017; Arslan et al. 2017; Raja and Swamynathan 2016; Yang et al. 2014; Bukhari et al. 2016; Zhang et al. 2015; Karpowicz et al. 2013; Supriya et al. 2016) adopted a statistical approach to perform a form of SOM. In particular, one of the approaches proposed in Arslan et al. (2017) uses the term frequency-inverse document frequency (tf-idf) (Salton and McGill 1986) numerical statistic to find out the important words within a tweet, to dynamically enrich Twitter specific dictionaries created by the authors. The tf-idf is also one of several statistical-based techniques used in Wang et al. (2018) for comparing the proposed novel feature weighting approach for Twitter sentiment analysis. Moreover, Raja and Swamynathan (2016) focuses on a statistical sentiment score calculation technique based on adjectives, whereas Yang et al. (2014) use a variation of the point-wise mutual information to measure the opinion polarity of an entity and its competitors, which method is different from the traditional opinion mining way.

3.2.5 Probabilistic

A total of 6 studies (Bhattacharya and Banerjee 2017; Baecchi et al. 2016; Ou et al. 2014; Ragavi and Usharani 2014; Yan et al. 2014; Lek and Poo 2013) adopted a probabilistic approach to perform a form of SOM. In particular, Ou et al. (2014) propose a novel probabilistic model in the Content and Link Unsupervised Sentiment Model, where the focus is on microblog sentiment classification incorporating link information, namely behaviour, same user and friend.

3.2.6 Fuzziness

Two studies (D’Asaro et al. 2017; Del Bosque and Garza 2014) adopted a fuzzy-based approach to perform a form of SOM. D’Asaro et al. (2017) present a sentiment evaluation and analysis system based on fuzzy linguistic textual analysis. Del Bosque and Garza (2014) assume that aggressive text detection is a sub-task of sentiment analysis, which is closely related to document polarity detection given that aggressive text can be seen as intrinsically negative. This approach considers the document’s length and the number of swear words as inputs, with the output being an aggressiveness value between 0 and 1.

3.2.7 Rule-based

In total, 4 studies (El Haddaoui et al. 2018; Zhang et al. 2014; Min et al. 2013; Bosco et al. 2013) adopted a rule-based approach to perform a form of SOM. Notably, Bosco et al. (2013) applied an approach for automatic emotion annotation of ironic tweets. This relies on sentiment lexicons (words and expressions) and sentiment grammar expressed by compositional rules.

3.2.8 Graph

Four studies (Dritsas et al. 2018; Vilarinho and Ruiz 2018; Chen et al. 2015; Rabelo et al. 2012) adopted a graph-based approach to perform a form of SOM. The study in Vilarinho and Ruiz (2018) presents a word graph-based method for Twitter sentiment analysis using global centrality metrics over graphs to evaluate sentiment polarity. In Dritsas et al. (2018), a graph-based method is proposed for sentiment classification at a hashtag level. Moreover, the authors in Chen et al. (2015) compare their proposed multimodal hypergraph-based microblog sentiment prediction approach with a combined hypergraph-based method (Huang et al. 2010). Lastly, Rabelo et al. (2012) used link mining techniques to infer the opinions of users.

3.2.9 Ontology

Two studies (Lau et al. 2014; Kontopoulos et al. 2013) adopted an ontology-based approach to perform a form of SOM. In particular, the technique developed in Kontopoulos et al. (2013) performs more fine-grained sentiment analysis of tweets where each subject within the tweets is broken down into a set of aspects, with each one being assigned a sentiment score.

3.2.10 Hybrid

Hybrid approaches are very much in demand for performing different opinion mining tasks, where 244 unique studies (out of 465) adopted this approach and produced a total of 282 different techniquesFootnote 56.

Tables 10 and 11 lists these studies, together with the type of techniques used for each. In total, there were 38 different hybrid approaches across the analysed studies.

Table 10 Studies adopting a hybrid approach consisting of two techniques
Table 11 Studies adopting a hybrid approach consisting of three and four techniques

The majority of these studies used two different techniques (213 out of 282)—see Table 10—within their hybrid approach, whereas 62 used three and 7 studies used four different techniques –see Table 11.

The Lexicon and Machine Learning-based techniques were mostly used, where they accounted for 40% of the hybrid approaches, followed by Lexicon and Statistical-based (7.8%), Machine Learning and Statistical-based (7.4%), and Lexicon, Machine Learning and Statistical-based (7.4%) techniques.

Moreover, out of the 282 hybrid approaches, 232 used lexicons, 205 used Machine Learning and 39 used Deep Learning. These numbers reflect the importance of these three techniques within the SOM research and development domain. In light of these, a list of lexicons, machine learning and deep learning algorithms used in these studies have been compiled, similar to Sects. and 3.2.3 above. The lexicons, machine learning and deep learning algorithms quoted below were either used in the proposed method/s and/or for comparison purposes in the respective studies.

In terms of state-of-the-art lexicons, these total 403 within the studies adopting a hybrid approach. The top ones align with the results obtained from the lexicon-based approaches in Sect. 3.2.1 above. The following are the lexicons used for more than ten times across the hybrid approaches:

  1. 1.

    SentiWordNet—used in 51 studies;

  2. 2.

    MPQA—Subjectivity—used in 28 studies;

  3. 3.

    Hu and Liu—used in 25 studies;

  4. 4.

    WordNet—used in 24 studies;

  5. 5.

    AFINN—used in 22 studies;

  6. 6.

    SentiStrength—used in 21 studies;

  7. 7.

    HowNetSenti—used in 15 studies;

  8. 8.

    NRC Word-Emotion Association Lexicon—used in 13 studies;

  9. 9.

    NRC Hashtag Sentiment LexiconFootnote 57 (Mohammad et al. 2013)—used in 12 studies;

  10. 10.

    SenticNet, Sentiment140 Lexicon (also known as NRC Emoticon Lexicon)Footnote 58 (Mohammad et al. 2013), National Taiwan University Sentiment Dictionary (NTUSD) (Ku et al. 2006) and Wikipedia list of emoticons - used 11 studies.

Further to the quoted lexicons, 49 studies used lexicons that they created as part of their work. Some studies composed their lexicons from emoticons/emojis that were extracted from a dataset (Cao et al. 2018; Li and Fleyeh 2018; Azzouza et al. 2017; Zimbra et al. 2016; You and Tunçer 2016; Chen et al. 2015; Porshnev et al. 2014; Cui et al. 2011; Zhang et al. 2012; Vu et al. 2012), combined publicly available emoticon lexicons/lists (Siddiqua et al. 2016) or mapped emoticons to their corresponding polarity (Tellez et al. 2017), and others (Gao et al. 2016; Souza et al. 2016; Su et al. 2014; Yan et al. 2014; Tang et al. 2013; Cui et al. 2011; Zhang et al. 2012; Li and Xu 2014) used seed/feeling/emotional words to establish a microblog typical emotional dictionary. Additionally, some authors constructed or used sentiment lexicons (Zhang et al. 2018; Vo et al. 2017; Rout et al. 2017; Jin et al. 2017; Ismail et al. 2018; Yan et al. 2017; Katiyar et al. 2018; Al Shammari 2018; Abdullah and Zolkepli 2017; Liu and Young 2016; Sahu et al. 2015; Cho et al. 2014; Wang et al. 2014; Chen et al. 2015; Jiang et al. 2013; Cui et al. 2013; Khuc et al. 2012; Montejo-Raez et al. 2014; Rui et al. 2013) some of which are domain or language specific (Konate and Du 2018; Hong and Sinnott 2018; Chen et al. 2017; Zhao et al. 2016; Lu et al. 2016; Zhou et al. 2014; Porshnev and Redkin 2014), others that extend state-of-the-art lexicons (Li et al. 2016, 2016; Koto and Adriani 2015), and some who made them available to the research community (Cotfas et al. 2017; Castellucci et al. 2015) such as the Distributional Polarity LexiconFootnote 59.

Table 12 Machine learning algorithms used in the studies adopting a hybrid approach

Table 12 below presents a list of machine learning algorithms –in total 381 in 197 studies– that were used within the hybrid approaches. The first column indicates the algorithm, the second lists the type of learning algorithm, in terms of Supervised (Sup), Unsupervised (Unsup) and Semi-supervised (Semi-sup), and the last column lists the total number of studies using each respective algorithm. The SVM and NB algorithms were mostly used in supervised learning, which result corresponds to the machine learning-based approaches in Sect. 3.2.2 above. With respect to the latter, 76 studies used the MBNB algorithm, 19 studies the MNB and 1 study the Discriminative MNB. Moreover, the LoR, DT –namely the basic ID3 (10 studies), J48 (5 studies), C4.5 (5 studies), Classification And Regression Tree (3 studies), Reduced Error Pruning (1 study), DT with AB (1 study), McDiarmid Tree (McDiarmid 1989) (1 study) and Hoeffding Tree (1 study) algorithms, RF, MaxEnt and SentiStrength (used in both supervised and unsupervised settings) algorithms were also in various studies. Notably, some additional algorithms from the ones used in the machine learning-based approaches in Sect. 3.2.2 above, were used in a hybrid approach, in particular, SVR (Drucker et al. 1997), Extremely Randomised Trees (Geurts et al. 2006), Least Median of Squares Regression (Rousseeuw 1984), Maximum Likelihood Estimation (Fisher 1925), Hyperpipes (Witten et al. 2016), Extreme Learning Machine (Huang et al. 2006), Domain Adaptation Machine (Duan et al. 2009), RIPPER (Cohen 1995), Affinity Propagation (Frey and Dueck 2007), Multinomial inverse regression (Taddy 2013), Apriori (Agrawal et al. 1994), Distant Supervision (Go et al. 2009) and Label Propagation (Zhu and Ghahramani 2002).

Given that deep learning is a subset of machine learning, the algorithms used within the hybrid approaches are presented below. In total, 36 studies used the following deep learning algorithms:

  • CNN—used in 16 studies (Yan et al. 2018; Stojanovski et al. 2018; Konate and Du 2018; Hanafy et al. 2018; Haldenwang et al. 2018; Ghosal et al. 2018; Chen et al. 2017; Ameur et al. 2018; Alharbi and DeDoncker 2017; Symeonidis et al. 2018; Saini et al. 2018; Jianqiang et al. 2018; Baccouche et al. 2018; Cai and Xia 2015; Kalayeh et al. 2015; Yanmei and Yuda 2015);

  • ANN—used in 8 studies (Li and Fleyeh 2018; Karyotis et al. 2017; Poria et al. 2016; Er et al. 2016; Koto and Adriani 2015; Porshnev and Redkin 2014; Porshnev et al. 2014; Hassan et al. 2013);

  • LSTM—used in 7 studies (Yan et al. 2018; Konate and Du 2018; Hanafy et al. 2018; Ghosal et al. 2018; Ameur et al. 2018; Sun et al. 2017; Baccouche et al. 2018);

  • MLP—used in 7 studies (Villegas et al. 2018; Ghosal et al. 2018; Coyne et al. 2017; Karyotis et al. 2017; Bravo-Marquez et al. 2014; Del Bosque and Garza 2014; Thelwall et al. 2010);

  • RNN—used in 4 studies (Yan et al. 2018; Liu et al. 2018; Baccouche et al. 2018; Yanmei and Yuda 2015);

  • AE—used in 2 studies (Yan et al. 2018; Ameur et al. 2018);

  • BLSTM—used in 2 studies (Konate and Du 2018; Ameur et al. 2018);

  • DAN2—used in 2 studies (Ghiassi and Lee 2018; Zimbra et al. 2016);

  • Deep Belief Network (Hinton and Salakhutdinov 2006), a probabilistic generative model that is composed of multiple layers of stochastic, latent variables—used in 2 studies (Jin et al. 2017; Tang et al. 2013);

  • GRU—used in 1 study (Cao et al. 2018);

  • Generative Adversarial Networks (GAN) (Goodfellow et al. 2014), are deep neural net architectures composed of a two networks, a generator and a discriminator, pitting one against the other—used in 1 study (Cao et al. 2018);

  • Conditional GAN (Mirza and Osindero 2014), a conditional version of GAN that can be constructed by feeding the data that needs to be conditioned on both the generator and discriminator—used in 1 study (Cao et al. 2018);

  • Hierarchical Attention Network, a neural architecture for document classification (Yang et al. 2016), used in 1 study (Liu et al. 2018).

Further to the quoted algorithms, 22 studies (Hong and Sinnott 2018; Hanafy et al. 2018; Ghosal et al. 2018; Saleena 2018; Yan et al. 2017; Tong et al. 2017; Dedhia and Ramteke 2017; Wijayanti and Arisal 2017; Xia et al. 2017; Jianqiang 2016; Prusa et al. 2015; Fersini et al. 2015; Abdelwahab et al. 2015; Kanakaraj and Guddeti 2015; Hagen et al. 2015; Cai and Xia 2015; Mansour et al. 2015; Wang et al. 2014; Tsakalidis et al. 2014; Da Silva et al. 2014; Hassan et al. 2013; Gonçalves et al. 2013) used ensemble learning methods in their work, where they combined the output of several base machine learning and/or deep learning methods. In particular, Gonçalves et al. (2013) compared eight popular lexicon and machine learning-based sentiment analysis algorithms, and then developed an ensemble that combines them, which in turn provided the best coverage results and competitive agreement. Moreover, Ghosal et al. (2018) proposes an MLP-based ensemble network that combines LSTM, CNN and feature-based MLP models, with each model incorporating character, word and lexicon level information, to predict the degree of intensity for sentiment and emotion. Lastly, as presented in Table 12, the RF ensemble learning method was used in the 21 studies (Da Silva et al. 2014; Porshnev et al. 2014; Samoylov 2014; Yuan et al. 2014; Buddhitha and Inkpen 2015; Kanakaraj and Guddeti 2015; Jianqiang 2015; Bouchlaghem et al. 2016; Deshwal and Sharma 2016; Jianqiang 2016; Yan and Tao 2016; Tong et al. 2017; Jianqiang and Xiaolin 2017; Bouazizi and Ohtsuki 2017; Elouardighi et al. 2017; Bouazizi and Ohtsuki 2018; Li and Fleyeh 2018; Saleena 2018; Villegas et al. 2018; Yan et al. 2018; Zhang et al. 2018).

3.2.11 Other

In total, 23 studies did not adopt any of the previous approaches (discussed in Sects. This is mainly due to three reasons: no information provided by the authors (13 studies), use of an automated approach (4 studies), or use of a manual approach (6 studies) (Sandoval-Almazan and Valle-Cruz 2018; Fang and Ben-Miled 2017; Song and Gruzd 2017; Zafar et al. 2016; Furini and Montangero 2016; Cvijikj and Michahelles 2011) to perform a form of SOM. Regarding the former, the majority of them (Ayoub and Elgammal 2018; Tiwari et al. 2017; Ouyang et al. 2017; Anggoro et al. 2016; Williamson and Ruming 2016; Agrawal et al. 2014; Pupi et al. 2014; Das et al. 2014) were not specifically focused on SOM (this was secondary), in contrast to the others (Vivanco et al. 2017; Gonzalez-Marron et al. 2017; Chen et al. 2016; Barapatre et al. 2016; Mejova and Srinivasan 2012). As for the automated approaches (Sharma et al. 2018; Pai and Alathur 2018; Ali et al. 2018; Teixeira and Laureano 2017), some of them used cloud services, such as Microsoft Azure Text AnalyticsFootnote 60 or out-of-the-box functionality provided by existing tools/software libraries, such as the TextBlobFootnote 61 Python library.

3.3 Social datasets

Numerous datasets were used across the 465 studies evaluated for this systematic review. These consisted of SOM datasets released online for public use –which have been widely used across the studies– and newly collected datasets, some of which were made available for public use or else for private use within the respective studies. In terms of data collection, the majority of them used the respective platform’s API, such as the Twitter Search APIFootnote 62, either directly or through a third-party library, e.g., Twitter4JFootnote 63. Due to the large number of datasets, only the ones mostly used shall be discussed within this section. In addition, only social datasets are mentioned irrespective of whether other non-social datasets (e.g., news, movies, etc.,) were used, given that the main focus of this review is on social data.

The first sub-section (Sect. 3.3.1) presents an overview of the top social datasets used, whereas the second sub-section (Sect. 3.3.2) presents a comparative analysis of the studies that produced the best performance for each respective social dataset.

3.3.1 Overview

The following are the top ten social datasets used across all studies:

  1. 1.

    Stanford Twitter Sentiment (STS) Go et al. (2009) used in 61 studies: 1,600,000 training tweets collected via the Twitter API, that is made up of 800,000 tweets containing positive emoticons and 800,000 tweets containing negative emoticons. These are based on various topics, such as Nike, Google, China, Obama, Kindle, San Francisco, North Korea and Iran.

  2. 2.

    SandersFootnote 64—used in 32 studies: 5513 hand-classified tweets about four topics: Apple, Google, Microsoft, Twitter. These tweets are labelled as follows: 570 positive, 654 negative, 2503 neutral, and 1786 irrelevant.

  3. 3.

    SemEval 2013—Task 2Footnote 65 Nakov et al. (2013)—used in 28 studies: Training, development and test sets for Twitter and SMS messages were annotated with positive, negative, and objective/neutral labels via the Amazon Mechanical Turk crowdsourcing platform. This was done for 2 subtasks focusing on an expression-level and message-level.

  4. 4.

    SemEval 2014—Task 9Footnote 66 Rosenthal et al. (2014)—used in 18 studies: Continuation of SemEval 2013—Task 2, where three new test sets from regular and sarcastic tweets, and LiveJournal sentences were introduced.

  5. 5.

    STS Gold (STS-Gold) Saif et al. (2013)—used in 17 studies: A subset of STS, which was annotated manually at a tweet and entity-level. The tweet labels were either positive, negative, neutral, mixed, or other.

  6. 6.

    Health care reform (HCR) Speriosu et al. (2011)—used in 17 studies: Dataset contains tweets about the 2010 health care reform in the USA. A subset of these are annotated for polarity with the following labels: positive, negative, neutral, irrelevant. The polarity targets, such as health care reform, conservatives, democrats, liberals, republicans, Obama, Stupak and Tea Party, were also annotated. All were distributed into training, development and test sets.

  7. 7.

    Obama-McCain Debate (OMD) Shamma et al. (2009)—used in 17 studies: 3,238 tweets about the first presidential debate held in the USA for the 2008 campaign. The sentiment labels of the tweets are acquired by Diakopoulos and Shamma (2010) using Amazon Mechanical Turk, and are rated as either positive, negative, mixed, or other.

  8. 8.

    SemEval 2015—Task 10Footnote 67 Rosenthal et al. (2015)—used in 15 studies: This continues on datasets number 3 and 4, with three new subtasks. The first two target sentiment about a particular topic in one tweet or collection of tweets, whereas the third targets the degree of prior polarity of a phrase.

  9. 9.

    SentiStrength Twitter (SS-Twitter) Thelwall et al. (2012)—used in 12 studies: 6 human-coded databases from BBC, Digg, MySpace, Runners World, Twitter and YouTube annotated for sentiment polarity strength i.e., negative between -1 (not negative) and -5 (extremely negative), and positive between 1 (not positive) and 5 (extremely positive).

  10. 10.

    SemEval 2016—Task 4Footnote 68 Nakov et al. (2016)—used in 9 studies: This is a re-run of dataset 7, with three new subtasks. The first one replaces the standard two-point scale (positive/negative) or three-point scale (positive/negative/neutral) with a five-point scale (very positive/positive/OK/ negative/very negative). The other two subtasks replaced tweet classification with quantification (i.e., estimating the distribution of the classes in a set of unlabelled items) according to a two-point and five-point scale, respectively.

  11. 11.

    NLPCC 2012Footnote 69—used in 6 studies: Chinese microblog sentiment dataset (sentence level) from Tencent Weibo provided by the First Conference on Natural Language Processing and Chinese Computing (NLP&CC 2012) It consists of a training set of microblogs about two topics, and a test set about 20 topics, where the subjectivity (subjective/objective) and the polarity (positive/negative/neutral) was assigned for each.

  12. 12.

    NLPCC 2013Footnote 70—used in 6 studies: Dataset from Sina Weibo used for the Chinese Microblog Sentiment Analysis Evaluation (CMSAE) task in the second conference on NLP&CC 2013. The Chinese microblogs were classified into 7 emotion types: anger, disgust, fear, happiness, like, sadness, surprise. Test set contains 10,000 microblogs, where each text is labelled with a primary emotion type ans a secondary one (if possible).

  13. 13.

    Sentiment Evaluation (SE-Twitter) Narr et al. (2012)—used in 5 studies: Human annotated multilingual dataset of 12,597 tweets from 4 languages, namely English, German, French, Portuguese. Polarity annotations with labels: positive, negative, neutral, and irrelevant, were conducted manually using Amazon Mechanical Turk.

  14. 14.

    SemEval 2017—Task 4 Rosenthal et al. (2017)—used in 5 studies: This dataset continues with a re-run of dataset 10, where two new changes were introduced; inclusion of the Arabic language for all subtasks and provision of profile information of the Twitter users that posted the target tweets.

All the datasets above are textual, with the majority of them composed of social data from Twitter. From the datasets above, in terms of language, only the SE-Twitter (number 13) social dataset can be considered as multilingual, with the rest targeting English (majority) or Chinese microblogs, whereas SemEval 2017—Task 4 (number 14) introduced a new language in Arabic. An additional dataset is the one produced by Mozetič et al., which contains 15 Twitter sentiment corpora for 15 European languages (Mozetič et al. 2016). Some studies such as Munezero et al. (2015) used one of the English-based datasets above (STS-Gold) for multiple languages, given that they adopted a lexicon-based approach. Moreover, these datasets had different usage within the respective studies, with the most common being used as a training/test set, the final evaluation of the proposed solution/lexicon, or for comparison purposes. Evaluation challenges like SemEval are important to generate social datasets such as the above and Cortis et al. (2017), since these can be used by the Opinion Mining community for further research and development.

3.3.2 Comparative analysis

A comparative analysis of all the studies that used the social datasets presented in the previous sub-section was carried out. The Precision, Recall, F-measure (F1-score), and Accuracy metrics were selected to evaluate the said studies (when available) and identify the best performance for each respective social dataset. It is important to note that for certain datasets, this could not be done, since the experiments conducted were not consistent across all the studies. The top three studies (where possible) obtaining the best results for each of the four evaluation metrics are presented in the tables below.

Tables 13 and 14 provide the best results for the STS and Sanders datasets.

Table 13 Studies obtaining the best performance for the STS (1) social dataset
Table 14 Studies obtaining the best performance for the Sanders (2) social dataset

Tables 15 and 16 provide the best results for the SemEval 2013—Task 2 and SemEval 2014—Task 9 datasets, specifically for sub-task B, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.

Table 15 Studies obtaining the best performance for the SemEval 2013—Task 2 (3) social dataset
Table 16 Studies obtaining the best performance for the SemEval 2014—Task 9 (4) social dataset

Tables 1718 and 19 provide the best results for the STS-Gold, HCR and OMD datasets.

Table 17 Studies obtaining the best performance for the STS-Gold (5) social dataset
Table 18 Studies obtaining the best performance for the HCR (6) social dataset
Table 19 Studies obtaining the best performance for the OMD (7) social dataset

Table 20 provides the best results for the SemEval 2015—Task 10 dataset, specifically for sub-task B, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.

Table 20 Studies obtaining the best performance for the SemEval 2015—Task 10 (8) social dataset

Table 21 provides the best results for the SS-Twitter dataset.

Table 21 Studies obtaining the best performance for the SS-Twitter (9) social dataset

Table 22 provides the best results for the SemEval 2016—Task 4 dataset, specifically for sub-task A, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.

Table 22 Studies obtaining the best performance for the SemEval 2016—Task 4 (10) social dataset

Tables 23 and 24 provide the best results for the NLPCC 2012 dataset. Results quoted below are for task 1 which focused on subjectivity classification (see Table 23) and task 2 which focused on sentiment polarity classification (see Table 24). Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.

Table 23 Studies obtaining the best performance for the NLPCC 2012 - Task 1 (11) social dataset
Table 24 Studies obtaining the best performance for the NLPCC 2012 - Task 2 (11) social dataset

Tables 25 and 26 provide the best results for the NLPCC 2013 and SE-Twitter datasets.

Table 25 Studies obtaining the best performance for the NLPCC 2013 (12) social dataset
Table 26 Studies obtaining the best performance for the SE-Twitter (13) social dataset

Table 27 provides the best results for the SemEval 2017—Task 4 dataset, specifically for sub-task A, which focused on message polarity classification. Moreover, the results obtained by the participants of this shared task should be reviewed for a more representative comparative evaluation.

Table 27 Studies obtaining the best performance for the SemEval 2017—Task 4 (14) social dataset

The following are some comments regarding the social dataset results quoted in the tables above:

  • In cases where several techniques and/or methods were applied, the highest result obtained in the study for each of the four evaluation metrics, was recorded, even if the technique did not produce the best result for all metrics.

  • The average Precision, Recall, and F-measure results are quoted (if provided by authors), i.e., average score of the results for each classified level (e.g., the average score of the results obtained for each sentiment polarity classification level - positive, negative and, neutral).

  • Results for social datasets that were released as a shared evaluation task, such as SemEval, were either only provided in the metrics used by the task organisers or other metrics were chosen by the authors, therefore not quoted.

  • Certain studies evaluated their techniques based on a subset of the actual dataset. Results quoted are the ones where the entire dataset was used (according to the authors and/our our understanding).

  • Quoted results are for classification tasks and not aspect-based SOM, which can vary depending on the focus of the study.

  • Results presented in a graph visualisation were not considered due to the exact values not being clear.

3.4 Language

Multilingual/bilingual SOM is very challenging, since it deals with multi-cultural social data. For example, analysing Chinese and English online posts can bring a mixed sentiment on such posts. Therefore, it is hard for researchers to make a fair judgement in cases where online posts’ results from different languages contradict each other (Yan et al. 2014).

The majority of the studies (354 out of 465) considered for this review analysis support one language in their SOM solutions. A total of 80 studies did not specify whether their proposed solution is language-agnostic or otherwise, or else their modality was not textual-based. Lastly, only 31 studies cater for more than one language, with 18 being bilingual, 1 being trilingual and 12 proposed solutions claiming to be multilingual. Regarding the latter, the majority were tested on a few languages at most, with Castellucci et al. (2015, 2015) on English and Italian, Montejo-Raez et al. (2014) on English and Spanish, Erdmann et al. (2014) on English and Japanese, Radhika and Sankar (2017) on English and MalayalamFootnote 71. Baccouche et al. (2018) on English, French and Arabic, Munezero et al. (2015) on keyword sets for different languages (e.g., Spanish, French), Wehrmann et al. (2017) on English, Spanish, Portuguese and German, Cui et al. (2011) on Basic Latin (English) and Extended Latin (Portuguese, Spanish, German), Teixeira and Laureano (2017) on Spanish, Italian, Portuguese, French, English, and Arabic, Zhang et al. (2017) on 8 languages, namely English, German, Portuguese, Spanish, Polish, Slovak, Slovenian, Swedish, and Gao et al. (2016) on 11 languages, namely English, Dutch, French, German, Italian, Polish, Portuguese, Russian, Spanish, Swedish and Turkish.

The list below specifies the languages supported by the 19 bilingual and trilingual studies:

  • English and Italian (Severyn et al. 2016; D’Avanzo and Pilato 2015; Pupi et al. 2014);

  • English and German (Abdelrazeq et al. 2016; Tumasjan et al. 2010);

  • English and Spanish (Giachanou et al. 2017; Cotfas et al. 2015; Delcea et al. 2014);

  • English and Brazilian Portuguese (Guerra et al. 2014);

  • English and Chinese (Xia et al. 2017; Yan et al. 2014);

  • English and Dutch (Flaes et al. 2016);

  • English and Greek (Politopoulou and Maragoudakis 2013);

  • English and Hindi (Anjaria and Guddeti 2014);

  • English and Japanese (Ragavi and Usharani 2014);

  • English and Roman-Urdu (Javed et al. 2014);

  • English and Swedish (Li and Fleyeh 2018);

  • English and Korean (Ramadhani and Goo 2017);

  • English, German and Spanish (Boididou et al. 2018).

Some studies above (D’Avanzo and Pilato 2015; Anjaria and Guddeti 2014; Tumasjan et al. 2010) translated their input data into an intermediate language, mostly English, to perform SOM.

Moreover, Table 28 provides a list of the non-English languages identified from the 354 studies that support one language. Chou et al. (2017) claim that their method can be easily applied to any ConceptNetFootnote 72 supported language, with Wang et al. (2016) similarly claiming that their method is language independent, whereas the solution by Wang and Wu (2015) is multilingual given that emoticons are used in the majority of languages.

Table 28 Non-English languages supported by studies in this review analysis

3.5 Modality

The majority of the studies in this systematic review and in the state-of-the-art focus on SOM on the textual modality, with only 15 out of 465 studies applying their work on more than one modality. However, other modalities, such as visual (image, video), and audio information is often ignored, even though it contributes greatly towards expressing user emotions (Chen et al. 2015). Moreover, when two or more modalities are considered together for any form of social opinion, such as emotion recognition, they are often complementary, thus increase the system’s performance (Caschera et al. 2016). Table 29 lists the multimodal studies within the review analysis, with the ones catering for two modalities –text and image– being the most popular.

Table 29 Studies adopting a multimodal approach

3.5.1 Datasets

Current available datasets and resources for SOM are restricted to the textual modality only. The following are the non-textual social datasets (not listed in Sect. 3.3) used across the mentioned studies:

  • YouTube Dataset (Morency et al. 2011) used in Poria et al. (2016): 47 videos targeting various topics, such as politics, electronics and product reviews.

  • SentiBank Twitter DatasetFootnote 73 (Borth et al. 2013) used in Baecchi et al. (2016) and Cai and Xia (2015): Image dataset from Twitter annotated for polarity using Amazon Mechanical Turk. Tweets with images related to 21 hashtags (topics) resulted in 470 being positive and 133 being negative.

  • SentiBank Flickr Dataset (Borth et al. 2013) used in Cai and Xia (2015): 500,000 image posts from Flickr labeled by 1553 adjective noun pairs based on Plutchik’s Wheel of Emotions (psychological theory) (Plutchik 1980).

  • You Image Dataset (You et al. 2015) used in Cai and Xia (2015): Image dataset from Twitter consisting of 769 positive and 500 negative tweets with images, annotated using Amazon Mechanical Turk.

  • Katsurai and Sotoh Image DatasetFootnote 74 (Katsurai and Satoh 2016) used in Ortis et al. (2018): Dataset of images from Flickr (90,139) and Instagram (65,439) with their sentiment labels.

3.5.2 Observations

The novel methodology by Poria et al. (2016), is the only mutlimodal sentiment analysis approach which caters for four different modalities, namely text, vision (image and video) and audio. Sentiments are extracted from social Web videos. In Caschera et al. (2016), the authors propose a method whereby machine learning techniques need to be trained on different and heterogeneous features when used on different modalities, such as polarity and intensity of lexicons from text, prosodic features from audio, and postures, gestures and expressions from video. The sentiment of video and audio data in Song and Gruzd (2017) was manually coded, which task is labour intensive and time consuming. The addition of images to the microblogs’ textual data reinforces and clarifies certain feelings (Wang et al. 2014; Baecchi et al. 2016), thus improving the sentiment classifier with the image features (Liu et al. 2015; Zhang et al. 2015; Wang et al. 2014; Cai and Xia 2015). Similarly, Chen et al. (2015) also demonstrates superiority with their multimodal hypergraph method when compared to single modality (in this case textual) methods. Moreover, these results are further supported by the method in Poria et al. (2016)—which caters for more than two modalities, in audio, visual and textual—where it shows that accuracy improves drastically when such modalities are used together.

Flaes et al. (2016) apply their multimodal (text, images) method in a real world application area, which research shows that several relationships exist between city liveability indicators collected by the local government and sentiment that is extracted automatically. For example, a negative linear association of detected sentiment from Flickr data is related with people living on welfare checks. Results in Rai et al. (2018) show that there is a high correlation between sentiment extracted from text-based social data and image-based landscape preferences by humans. In addition, results in Yuan et al. (2015) show some correlation between image and textual tweets. However, the authors mention that more features and robust data is required to determine the exact influence of multimedia content in the social domain. The work in Chen et al. (2017) adopts a bimodal approach to solve the problem of cross-domain image sentiment classification by using textual features and visual features from the target domain and measuring the text/image similarity simultaneously.

Therefore, multimodality in the SOM domain is one of numerous research gaps identified in this systematic review. This provides researchers with an opportunity towards further research, development and innovation in this area.

3.6 Tools and technologies

In this systematic review, we also analysed the tool and technologies that were used across all studies for various opinion mining operations conducted on social data, such as NLP, machine learning, and big data handling. The subsections below provide respective lists for the ones mostly used across the studies for the various operations required.

3.6.1 NLP

The following are the top 5 NLP tools used across all studies for various NLP tasks:

  • Natural Language Toolkit (NLTK)Footnote 75: a platform that provides lexical resources, text processing libraries for classification, tokenisation, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial NLP libraries;

  • TweetNLPFootnote 76: consists of a tokeniser, Part-of-Speech (POS) tagger, hierarchical word clusters, and a dependency parser for tweets, besides annotated corpora and web-based annotation tools;

  • Stanford NLPFootnote 77: software that provides statistical NLP, deep learning NLP and rule-based NLP tools, such as Stanford CoreNLP, Stanford Parser, Stanford POS Tagger;

  • NLPIR-ICTCLASFootnote 78: a Chinese word segmentation system that includes keyword extraction, POS tagging, NER, and microblog analysis, amongst other features;

  • word2vecFootnote 79: an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.

3.6.2 Machine learning

The top 5 machine learning tools used across all studies are listed below:

  • WekaFootnote 80: a collection of machine learning algorithms for data mining tasks, including tools for data preparation, classification, regression, clustering, association rules mining and visualisation;

  • scikit-learnFootnote 81: consists of a set of tools for data mining and analysis, such as classification, regression, clustering, dimensionality reduction, model selection and pre-processing;

  • LIBSVMFootnote 82: an integrated software for support vector classification, regression, distribution estimation and multi-class classification;

  • LIBLINEARFootnote 83: a linear classifier for data with millions of instances and features;

  • SVM-LightFootnote 84: is an implementation of SVMs for pattern recognition, classification, regression and ranking problems.

3.6.3 Opinion mining

Certain studies used opinion mining tools in their research to either conduct their main experiments or for comparison purposes to their proposed solution/s. The following are the top 3 opinion mining tools used:

  • SentiStrengthFootnote 85: a sentiment analysis tool that is able to conduct binary (positive/negative), trinary (positive/neutral/negative), single-scale (-4 very negative to very positive +4), keyword-oriented and domain-oriented classifications;

  • Sentiment140Footnote 86: a tool that allows you to discover the sentiment of a brand, product, or topic on Twitter;

  • VADER (Valence Aware Dictionary and sEntiment Reasoner)Footnote 87: a lexicon and rule-based sentiment analysis tool that is specifically focused on sentiments expressed in social media.

3.6.4 Big data

Several big data technologies were used by the analysed studies. The most popular ones are categorised in the list below:

  1. 1.

    Relational storage

    1. (a)

      MySQLFootnote 88

    2. (b)

      PostgreSQLFootnote 89

    3. (c)

      Amazon Relational Database Service (Amazon RDS)Footnote 90

    4. (d)

      Microsoft SQL ServerFootnote 91

  2. 2.

    Non-relational storage

    1. (a)


      1. i.

        MongoDBFootnote 92

      2. ii.

        Apache CouchDBFootnote 93

    2. (b)


      1. i

        Apache HBaseFootnote 94

  3. 3.

    Resource Description Framework Triplestore

  4. 4.

    Distributed Processing

    1. (a)

      Apache HadoopFootnote 95

    2. (b)

      Apache SparkFootnote 96

    3. (c)

      IBM InfoSphere StreamsFootnote 97

    4. (d)

      Apache AsterixDBFootnote 98

    5. (e)

      Apache StormFootnote 99

  5. 5.

    Data Warehouse

    1. (a)

      Apache HiveFootnote 100

  6. 6.

    Data Analytics

    1. (a)

      DatabricksFootnote 101

The MySQL relational database management system was the most technology used for storing structured social data, whereas MongoDB was mostly used for processing unstructured social data. On the other hand, the distributed processing technologies were used for processing large scale social real-time and/or historical data. In particular, Hadoop MapReduce was used for parallel processing of large volumes of structured, semi-structured and unstructured social datasets, that are stored in the Hadoop Distributed File System. Spark’s ability to process both batch and streaming data was utilised in cases where velocity is more important than volume.

3.7 Natural language processing tasks

This section presents information about other NLP tasks that were conducted to perform SOM.

3.7.1 Overview

An element of NLP is performed in 283 studies, out of the 465 analysed, either for pre-processing (248 studies), feature extraction (Machine Learning) or one of the processing parts within their SOM solution. The most common and important NLP tasks range from Tokenisation, Segmentation and POS, to NER and Language Detection.

It is important to mention that the NLP tasks mentioned above together with Anaphora Resolution, Parsing, Sarcasm, and Sparsity, are some other challenges faced in the SOM domain (Khan et al. 2014). Moreover, online posts with complicated linguistic patterns are challenging to deal with Li and Xu (2014).

However, Koto and Adriani (2015) showcase the importance and potential of NLP within this domain, where they investigated the pattern or word combination of tweets in subjectivity and polarity by considering their POS sequence. Results reveal that subjective tweets tend to have word combinations consisting of adverb and adjective, whereas objective tweets tend to have a word combination of nouns. Moreover, negative tweets tend to have a word combination of affirmation words which often appear as a negation word.

3.7.2 Pre-processing and negations

The majority (355 out of 465) of the studies performed some sort of pre-processing in their studies. Different methods and resources were used for such a process, such as NLP tasks (e.g., tokenisation, stemming, lemmatisation, NER), and dictionaries for stop words, acronyms for slang words, and others (e.g., noslang.com, noswearing.com, Urban Dictionary, Internet lingo).

Negation handling is one of the most challenging issues faced by SOM solutions. However, 117 studies cater for negations within their approach, Several different methods are used, such as negation replacement, negation transformation, negation dictionaries, textual features based on negation words and negation models.

3.7.3 Emoticons/Emojis

Social media can be seen as a sub-language that uses emoticons/emojis mixed with text to show emotions (Min et al. 2013). Emoticons/emojis are commonly used in tweets irrespective of the language, therefore are sometimes considered as being domain and language independent (Khan et al. 2014), thus useful for multilingual SOM (Cui et al. 2011).

Even though some researchers remove emoticons/emojis as part of their pre-processing stage (depending on what the authors want to achieve), many others have utilised the respective emotional meaning within their SOM process. This has led to emoticons/emojis in playing a very important role within 205 solutions of the analysed studies especially when the focus is on emotion recognition.

Results obtained from the emoticon networks model in Zhang et al. (2013) show that emoticons can help in performing sentiment analysis. This is further supported by Jiang et al. (2015) who found that emoticons are a pure carrier of sentiment. This is further supported by the results obtained by the emoticon polarity-aware method in Li et al. (2018) which show that emoticons can significantly improve the precision for identifying the sentiment polarity. In the case of hybrid (lexicon and machine learning) approaches, emoticon-aided lexicon expansion improve the performance of lexicon-based classifiers (Zhou et al. 2014). From an emotion classification perspective, Porshnev et al. (2014) analysed users’ emoticons on Twitter to improve the accuracy of predictions for the Dow Jones Industrial Average and S&P 500 stock market indices. Other researchers (Cvijikj and Michahelles 2011) were interested in analysing how people express emotions, displayed via adjectives or usage of internet slang i.e., emoticons, interjections and intentional misspelling.

Several emoticon lists were used in these studies, with the Wikipedia and DataGeneticsFootnote 102 ones commonly used. Moreover, emoticon dictionaries, such as (Agarwal et al. 2011; Aisopos et al. 2012; Becker et al. 2013), consisting of emoticons and their corresponding polarity class were also used in certain studies.

3.7.4 Word embeddings

Word embeddings, a type of word representation which allows words with a similar meaning to have a similar representation, were used by several studies (Severyn and Moschitti 2015; Jiang et al. 2015; Castellucci et al. 2015, 2015; Cai and Xia 2015; Gao et al. 2015; Chen et al. 2015; Stojanovski et al. 2015; Gao et al. 2016; Zhao et al. 2016; Rexha et al. 2016; Hao et al. 2017; Kitaoka and Hasuike 2017; Arslan et al. 2018; Baccouche et al. 2018; Chen et al. 2018; Ghosal et al. 2018; Hanafy et al. 2018; Jianqiang et al. 2018; Stojanovski et al. 2018; Sun et al. 2018; Wan et al. 2018; Yan et al. 2018) adopting a learning-based (Machine Learning, Deep Learning and Statistical) or hybrid approach. These studies used word embedding algorithms, such as word2vec, fastTextFootnote 103, and/or GloVeFootnote 104. Such a form of learned representation for text is capable of capturing the context of words within a piece of text, syntactic patterns, semantic similarity and relation with other words, amongst other word representations. Therefore, word embeddings are used for different NLP problems, with SOM being one of them.

3.7.5 Aspect-based social opinion mining

Sentence-level SOM approaches tend to fail in discovering an opinion dimension, such as sentiment polarity about a particular entity and/or its aspects (Cambria et al. 2013). Therefore, an aspect-level (also referred to as feature/topic-based) (Hu and Liu 2004) approach –where an opinion is made up of targets and their associated opinion dimension (e.g., sentiment polarity)– has been used in some studies to overcome such issues. Certain NLP tasks, such as a parsing, POS tagger, and NER, are usually required to extract the entities or aspects from the respective social data.

From all the studies analysed, 39 performed aspect-based SOM, with 37 (Bansal and Srivastava 2018; Dragoni 2018; Gandhe et al. 2018; Ghiassi and Lee 2018; Kao and Huang 2018; Katz et al. 2018; Liu et al. 2018; Rathan et al. 2018; Wang et al. 2018; Zainuddin et al. 2018; Abdullah and Zolkepli 2017; Dambhare and Karale 2017; Hagge et al. 2017; Ray and Chakrabarti 2017; Rout et al. 2017; Tong et al. 2017; Vo et al. 2017; Zhou et al. 2017; Zimbra et al. 2016; Zainuddin et al. 2016, 2016; Kokkinogenis et al. 2015; Lima et al. 2015; Hridoy et al. 2015; Castellucci et al. 2015; Averchenkov et al. 2015; Tan et al. 2014; Lau et al. 2014; Del Bosque and Garza 2014; Varshney and Gupta 2014; Unankard et al. 2014; Lek and Poo 2013; Wang and Ye 2013; Min et al. 2013; Kontopoulos et al. 2013; Jiang et al. 2011; Prabowo and Thelwall 2009) focusing on aspect-based sentiment analysis, 1 (Aoudi and Malik 2018) on aspect-based sentiment and emotion analysis and 1 (Weichselbraun et al. 2017) on aspect-based affect analysis.

In particular, the Twitter aspect-based sentiment classification process in Lek and Poo (2013) consists of the following main steps: aspect-sentiment extraction, aspect ranking and selection, and aspect classification, whereas Lau et al. (2014) use NER to parse product names to determine their polarity. The aspect-based sentiment analysis approach in Hagge et al. (2017) leveraged POS tagging and dependency parsing. Moreover, Zainuddin et al. (2016) proposed a hybrid approach to analyse aspect-based sentiment of tweets. As the authors claim, it is more important to identify the opinions of tweets rather than finding the overall polarity which might not be useful to organisations. In Zainuddin et al. (2018), the same authors used association rule mining augmented with a heuristic combination of POS patterns to find single and multi-word explicit and implicit aspects. Results in Jiang et al. (2011) show that classifiers incorporating target-dependent features significantly outperform target-independent ones. In contrast to the studies discussed, Weichselbraun et al. (2017) introduced an aspect-based analysis approach that integrates affective (includes sentiment polarity and emotions) and factual knowledge extraction to capture opinions related to certain aspects of brands and companies. The social data analysed is classified in terms of sentiment polarity and emotions, aligned with the “Hourglass of Emotions” (Susanto et al. 2020).

In terms of techniques, the majority of the aspect-based studies used a hybrid approach, where only 5 studies used deep learning for such a task. In particular, the study by Averchenkov et al. (2015) used a deep learning approach based on RNNs for aspect-based sentiment analysis. A comparative review of deep learning for aspect-based sentiment analysis published by Do et al. (2019) discusses current research in this domain. It focuses on deep learning approaches, such as CNN, LSTM and GRU, for extracting both syntactic and semantic features of text without the need for in-depth requirements for feature engineering as required by classical NLP. For future research directions on aspect-based SOM, refer to Sect. 6.2.

4 Dimensions of social opinion mining

4.1 Context

An opinion describes a viewpoint or statement about a subjective matter. In many research problems, authors assume that an opinion is more specific and of a simpler definition. For example, sentiment analysis is considered to be a type of opinion mining even though it is only focused on extracting the sentimental score from a given text. Social data contains a wealth of signals to mine where opinions can be extracted over time. Different types of opinions require different modes of analysis (Agrawal et al. 2014). This leads to opinions being multi-dimensional semantic artefacts. In fact, Troussas et al. specify that “emotions and polarities are mutually influenced by each other, conditioning opinion intensities and emotional strengths”. Moreover, multiple studies applied different approaches, where Bravo-Marquez et al. (2013) showed that a composition of polarity, emotion and strength features, achieve significant improvements over single approaches, whereas (Koto and Adriani 2015) focused on finding the correlation between emotion—which can be differentiated by facial expression, voice intonation and also words—and sentiment in social media. Similar in nature, Buscaldi and Hernandez-Farias (2015) found out that finer-grained negative tweets potentially help in differing between negative feelings, e.g., fear (emotion).

Furthermore, mood, emotions and decision making are closely connected (Porshnev and Redkin 2014). Research on multi-dimensional sentiment analysis shows that human mood is very rich in social media, where a piece of text may contain multiple moods, such as calm and agreement (Huang et al. 2015). On the other hand, there are studies showing that one mood alone is already highly influential in encouraging people to rummage through Twitter feeds for predictive information. For example in Weiss et al. (2015), “calmness” was highly correlated with stock market movement. Different dimenions of opinions are also able to effect different entities, such as events. Results in Zhang et al. (2012) show a strong correlation between emergent events and public moods. In such cases, new events can be identified by monitoring emotional vectors in microblogs. Moreover, work in Thelwall et al. (2011) assessed if popular events are correlated with sentiment strength as it increases, which is likely the case.

All of the above motivates us to pursue further research and development on the identification of different opinion dimensions that are present within social data, such as microblogs, published across heterogeneous social media platforms. A more fine-grained opinion representation and classification of this social data shall lead to a better understanding of the messages conveyed, thus potentially influencing multiple application areas. Section 5 lists the application areas of the analysed studies.

4.2 Different dimensions of social opinions identified in the review analysis

The analysed studies focused on different opinion dimensions, namely: objectivity/subjectivity, sentiment polarity, emotion, affect, irony, sarcasm and mood. These were conducted on different levels, such as, document-level, sentence-level, and/or feature/aspect-based, depending on the study. Same as for the techniques presented in Sect. 3.2, 465 studies were evaluated. The majority focused on one social opinion dimension, with 60 studies focusing on more than one; 58 on two dimensions, 1 on three dimensions, and 1 on four dimensions. In this regard, Table 30 lists the different dimensions and respective studies. Most of the studies focused on sentiment analysis, specifically polarity classification.

Table 30 Studies focusing on two or more social opinion dimensions

The following sections present the different tasks conducted for each form of opinion mentioned aboveFootnote 105.

4.2.1 Subjectivity

Subjectivity determines whether a sentence expresses an opinion –in terms of personal feelings or beliefs– or not, in which case a sentence expresses objectivity. Objectivity refers to sentences that express some factual information about the world (Liu 2010).

  1. 1.

    subjectivity classification: 2-level

    1. (a)


    2. (b)


    3. (c)

      opinionated/not opinionated

  2. 2.

    subjectivity classification: 3-level

    1. (a)


    2. (b)


  3. 3.

    subjectivity score

    1. (a)

      objective/subjective ranging from 0 (low) to 1 (high)

In this domain, objective statements are usually classified as being neutral (in terms of polarity), whereas subjective statements are non-neutral. In the latter cases, sentiment analysis is performed to determine the polarity classification (more information on this below). However, it is important to clarify that neutrality and objectivity are not the same. Neutrality refers to situations whereby a balanced view is taken, whereas objectivity refers to factual based i.e., true statements/facts that are quantifiable and measurable.

4.2.2 Sentiment

Sentiment determines the polarity (positive/negative/neutral) and strength/intensity (through a numeric rating score e.g., 1–5 stars, or level of depth e.g., low/high/medium) of an expressed opinion (Liu 2010).

  1. 1.

    polarity classification: 2-level

    1. (a)


    2. (b)


    3. (c)


    4. (d)

      positive/not positive (neutral or negative)

  2. 2.

    polarity classification: 3-level

    1. (a)


    2. (b)


    3. (c)


    4. (d)


    5. (e)

      positive/negative/other (neutral, irrelevant)

    6. (f)


    7. (g)


    8. (h)


    9. (i)

      beneficial (positive)/harmful (negative)/neutral

    10. (j)

      personal negative/personal non-negative/non-personal i.e. news

    11. (k)


  3. 3.

    polarity classification: 4-level

    1. (a)

      positive/not so positive/not so negative/negative

    2. (b)


    3. (c)


    4. (d)


    5. (e)


    6. (f)

      positive/negative/neutral/not related to target topic

    7. (g)


    8. (h)


    9. (i)

      subjective/positive/negative/ironic (subjectivity and irony classification is also considered)

    10. (j)


    11. (k)


    12. (l)


  4. 4.

    polarity classification: 5-level

    1. (a)

      highly positive/positive/neutral/negative/highly negative

    2. (b)

      strong positive/positive/neutral/negative/strong negative

    3. (c)

      strongly positive/mildly positive/neutral/mildly negative/strongly negative

    4. (d)

      strongly positive/slightly positive/neutral/slightly negative/strongly negative

    5. (e)

      very positive/positive/neutral/negative/very negative

    6. (f)

      positive/somewhat positive/neutral/somewhat negative/negative

    7. (g)

      most positive/positive/neutral/negative/most negative

    8. (h)

      extremely positive/positive/neutral/negative/extremely negative

    9. (i)

      positive/negative/ironic/positive and negative/objective (subjectivity and irony classification is also considered)

    10. (j)


    11. (k)


  5. 5.

    polarity classification: 6-level

    1. (a)

      strong positive/steady positive/week positive/week negative/steady negative/strong negative

  6. 6.

    polarity classification: 8-level

    1. (a)

      partially positive/mildly positive/positive/extremely positive/partially negative/mildly negative/negative/extremely negative

    2. (b)

      ProCon/AntiCon/ProLab/AntiLab/ProLib/AntiLib/Unknown/Irrelevant (levels are oriented towards the political domain)

  7. 7.

    polarity classification: 12-level

    1. (a)

      future orientation/past orientation/positive emotions/negative emotions/sadness/anxiety/anger/ tentativeness/certainty/work/achievement/money

  8. 8.

    polarity score

    1. (a)

      negative ranging from 0–0.5 and positive ranging from 0.5–1

    2. (b)

      negative/neutral/positive ranging from 0 (low) to 0.45 (high)

    3. (c)

      negative/positive ranging from -1 (low) to 1 (high)

    4. (d)

      negative/positive ranging from -1.5 (low) to 1.5 (high)

    5. (e)

      negative/positive ranging from -2 (low) to 2 (high)

    6. (f)

      negative/positive ranging from 1 (low) to 5 (high)

    7. (g)

      negative ranging from -1 (low) to -5 (high) and positive ranging from 1 (low) to 5 (high)

    8. (h)

      strongly negative to strongly positive ranging from -2 (low) to 2 (high)

    9. (i)

      normalised values from -100 to 100

    10. (j)

      weighted average of polarity scores of the sentiment aspects/topic segments

    11. (k)

      score for every aspect/feature of the subject

    12. (l)

      score per aspect by calculating the distance between the aspect and sentiment word

    13. (m)

      total classification probability close to 1

  9. 9.

    polarity strength

    1. (a)

      -5 (very negative) to 5 (very positive)

    2. (b)

      1 (no sentiment) to 5 (very strong positive/negative sentiment)

    3. (c)

      low (0) to high (5)

    4. (d)

      -4 (most negative) to 4 (most positive)

    5. (e)

      weak/strong (relative strength)

    6. (f)

      Euclidean distance of positive and negative dimensions

  10. 10.

    polarity intensity

    1. (a)


    2. (b)

      no emotion/a bit/normal/very/extremely

    3. (c)

      -3 (negative) to 3 (positive)

  11. 11.

    sentiment assignment

    1. (a)

      total sentiment is the sum of sentiment of all words divided by total number of words (high to low)

    2. (b)

      average mean sentiment score

    3. (c)

      sentiment index based on the distribution of positive and negative online posts (Oh and Kumar 2017)

    4. (d)

      sum of inverse distance weighted sentiment values (+1, -1) of words in textual interactions

    5. (e)

      sentiment for a term is computed as [min, max] of all the positive and negative polarities

    6. (f)

      average score of associated messages in a time range and overall sentiment trend encoded by colours

  12. 12.


    1. (a)

      cluster heads from sentimental content

    2. (b)

      sentiment change detection

In some studies (Sandoval-Almazan and Valle-Cruz 2018; Bouazizi and Ohtsuki 2017; Chou et al. 2017; Karyotis et al. 2017; Furini and Montangero 2016; Gambino and Calvo 2016; Jiang et al. 2015; Yuan et al. 2015), the sentiment polarity was derived from the emotion classification, such as, joy/love/surprise translated to positive, and anger/sadness/fear translated to negative.

4.2.3 Emotion

Emotion refers to a person’s subjective feelings and thoughts, such as love, joy, surprise, anger, sadness and fear (Liu 2010).

  1. 1.

    emotion classification: 2-level

    1. (a)


    2. (b)


  2. 2.

    emotion classification: 3-level

    1. (a)


  3. 3.

    emotion classification: 4-level

    1. (a)


    2. (b)


    3. (c)


  4. 4.

    emotion classification: 5-level

    1. (a)


    2. (b)


    3. (c)


  5. 5.

    emotion classification: 6-level

    1. (a)


    2. (b)


    3. (c)


    4. (d)


  6. 6.

    emotion classification: 7-level

    1. (a)


    2. (b)

      pleasure/wondering/confirmation/excitement/laughter/tasty/surprise (emotions based on interjections (Cvijikj and Michahelles 2011))

    3. (c)

      love-heart/quality/happiness-smile/sadness/amused/anger/thumbs up (emotions based on sentiment carrying words and/or emoticons (Walha et al. 2016))

    4. (d)


    5. (e)


    6. (f)


    7. (g)


    8. (h)


    9. (i)


    10. (j)


    11. (k)


    12. (l)


  7. 7.

    emotion classification: 8-level

    1. (a)


    2. (b)


    3. (c)


    4. (d)


    5. (e)


    6. (f)


  8. 8.

    emotion classification: 9-level

    1. (a)


  9. 9.

    emotion classification: 11-level

    1. (a)


    2. (b)

      joy/excitement/wink/happiness/love/playfulness/surprise/scepticism/support/sadness/annoyance (emotions based on emoticons (Cvijikj and Michahelles 2011))

  10. 10.

    emotion classification: 22-level

    1. (a)

      hope/fear/joy/distress/pride/shame/admiration/reproach/linking/disliking/gratification/remorse/gratitude/anger/satisfaction/fears-confirmed/relief/disappointment/happy-for/resentment/gloating/pity (emotions based on an Emotion-Cause-OCC model that describe the eliciting conditions of emotions (Gao et al. 2015))

  11. 11.

    emotion–anger classification: 7-level

    1. (a)


  12. 12.

    emotion score

    1. (a)

      valence/arousal/dominance ranging from 1 (low) to 9 (high)

    2. (b)

      prediction/valence/arousal/outcome from 0 (low) to 100 (high)

  13. 13.

    emotion intensity

    1. (a)

      0 (minimum) to 1 (maximum)

    2. (b)

      0 (minimum) to 9 (maximum)

    3. (c)


  14. 14.

    emotion–happiness measurement

    1. (a)

      average happiness score

A study (Munezero et al. 2015) mapped the observed emotions into two broad categories of enduring sentiments: ‘like’ and ‘dislike’. The former includes emotions that have a positive evaluation of the object, i.e., joy, trust and anticipation, and the latter includes emotions that have a negative evaluation of the object, i.e., anger, fear, disgust, and sadness.

It is important to note that some of the emotion categories listed above are based on published theories of emotion, with the most popular ones being Paul Ekman’s six basic emotions (anger, disgust, fear, happiness, sadness and surprise) (Ekman 1992), and Plutchik’s eight primary emotions (anger, fear, sadness, disgust, surprise, anticipation, trust, and joy) (Plutchik 1980). Moreover, other studies have used emotion categories that are influenced from emotional state/psychological models, such as the Pleasure Arousal Dominance (Mehrabian 1996) and the Ortony, Clore and Collins (commonly referred to as OCC) (Ortony et al. 1988).

Several studies (Xu et al. 2012; Furini and Montangero 2016; Walha et al. 2016; Hubert et al. 2018) that targeted emotion classification incorrectly referred to such a task as sentiment analysis. Even though emotions and sentiment are highly related, the former are seen as enablers to the latter, i.e., an emotion/set of emotions affect the sentiment.

4.2.4 Affect

Affect refers to a set of observable manifestations of a subjectively experienced emotion. The basic tasks of affective computing are emotion recognition and polarity detection (Cambria 2016).

  1. 1.

    affect classification: 4-level

    1. (a)

      aptitude/attention/pleasantness/sensitivity (based on the “Hourglass of Emotions”, which was inspired by Plutchik’s studies on human emotions)

When using the affective model mentioned above, sentiment is based on the four independent dimensions mentioned, namely Pleasantness, Attention, Sensitivity, and Aptitude. The different levels of activation of these dimensions constitute the total emotional state of the mind (Hussain and Cambria 2018). The semi-supervised learning model proposed by Hussain and Cambria (2018) based on the merged use of multi-dimensional scaling by means of random projections and biased SVM, has been exploited for the inference of semantics and sentics (conceptual and affective information) that are linked with concepts in a multi-dimensional vector space, in accordance with this affective model. This is used to carry out sentiment polarity detection and emotion recognition in cases when there is a lack of labelled common-sense data.

4.2.5 Irony

Irony is usually used to convey, the opposite meaning of the actual things you say, but its purpose is not intended to hurt the other person.

  1. 1.

    irony classification: 2-level

    1. (a)


4.2.6 Sarcasm

Sarcasm holds the “characteristic” of meaning the opposite of what you say, but unlike irony, it is used to hurt the other person.

  1. 1.

    sarcasm classification: 2-level

    1. (a)


    2. (b)


4.2.7 Mood

Mood refers to a conscious state of mind or predominant emotional state of person or atmosphere of groups, people or places, at any point in time.

  1. 1.

    mood classification: 6-level

    1. (a)

      composed-anxious/agreeable-hostile/elated-depressed/confident-unsure /energetic-tired/clearheaded-confused (based on the profile of mood states (POMS) Bipolar questionnaire (McNair et al. 1971) which is designed by psychologists to assess human mood states)

    2. (b)

      calm/alert/sure/vital/kind/happy (based on GPOMS (Bollen et al. 2011) which expands the POMS Bipolar questionnaire to capture a wider variety of naturally occurring mood terms in tweets)

  2. 2.

    mood classification: 8-level

    1. (a)


4.2.8 Aggressiveness

Del Bosque and Garza (2014) assume that aggressive text detection is a sub-task of sentiment analysis, which is closely related to document polarity detection. Their reasoning is that aggressive text can be seen as intrinsically negative.

  1. 1.

    Aggressiveness detection

    1. (a)

      aggressiveness score ranging from 0 (no aggression) to 10 (strong aggression)

4.2.9 Other

  1. 1.

    Opinion retrieval

    1. (a)

      opinion score from 0 (minimum) to 5 (maximum)

4.3 Impact of sarcasm and irony on social opinions

Sarcasm and irony are often confused and/or misused. This leads to their classification in being very difficult even for humans (Unankard et al. 2014; Buscaldi and Hernandez-Farias 2015), with most users holding negative views on such messages (Unankard et al. 2014). The study by Buscaldi and Hernandez-Farias (Buscaldi and Hernandez-Farias 2015) is a relevant example, whereby a large number of false positives were identified in the tweets classified as ironic. Moreover, such tasks are also very time consuming and labour intensive particularly with the rapid growth in volume of online social data. Therefore, not many studies focused and/or catered for sarcasm and/or irony detection.

4.3.1 Challenges

The majority of the reviewed proposed approaches are not equipped to cater for traditional limitations, such as negation effects or ironic phenomena in text (Castellucci et al. 2015). Such opinion mining tasks face several challenges, with the main ones being:

  • Different languages and cultures result in various ways of how an opinion is expressed on certain social media platforms. For example, Sina Weibo users prefer to use irony when expressing negative polarity (Wang et al. 2014). Future research is required for the development of cross-lingual/multilingual NLP tools that are able to identify irony and sarcasm (Yan et al. 2014).

  • Presence of sarcasm and irony in social data, such as tweets, may affect the feature values of certain machine learning algorithms. Therefore, further advancement is required in the techniques used for handling sarcastic and ironic tweets (Pandey et al. 2017). The work in Sarsam et al. (2020) addresses the main challenges faced for sarcasm detection in Twitter and the machine learning algorithms that can be used in this regard.

  • Classifying/rating a given sentence’s sentiment is very difficult and ambiguous, since people often use negative words to be humorous or sarcastic.

  • Sarcasm and/or irony annotation is very hard for humans and thus it should be presented to multiple persons for accuracy purposes. This makes it very challenging to collect large datasets that can be used for supervised learning, with the only possible way being to hire people to carry out such annotations (D’Asaro et al. 2017). Moreover, the differentiation between sarcasm and irony by human annotators result in a lack of available datasets and datasets with enough examples of ironic and/or sarcastic annotations. Such datasets are usually needed for “data hungry” computational learning methods (Sykora et al. 2020).

4.3.2 Observations

Table 31 lists the studies within the review analysis that focused on sarcasm and/or irony. These account for only 18 out of 465 reviewed papers. One can clearly note the research gap that exists within these research areas.

Table 31 Studies adopting sarcasm and/or irony

The following is an overview of the studies’ main results and observations:

  • Bosco et al. (2013): The authors found that irony is normally used together with a positive statement to express a negative statement, but seldomly the other way. Analysis shows that the Senti-TUTFootnote 106 corpus can be representative for a wide range of irony in phenomena from bitter sarcasm to genteel irony.

  • Reyes et al. (2013): The study describes a number of textual features used to identify irony at a linguistic level. These are mostly applicable for short texts, such as tweets. The developed irony detection model is evaluated in terms of representativeness and relevance. Authors also mention that there are overlaps in occurrences of irony, satire, parody and sarcasm, with their main differentiators being tied to usage, tone and obviousness.

  • Mejova et al. (2013): A multi-stage data-driven political sentiment classifier is proposed in this study. The authors found out “that a humorous tweet is 76.7% likely to also be sarcastic”, whereas “sarcastic tweets are only 26.2% likely to be humorous”. Future work is required on the connection between sarcasm and humour.

  • Fersini et al. (2015): Addresses the automatic detection of sarcasm and irony by introducing an ensemble approach based on Bayesian Model Averaging, that takes into account several classifiers according to their reliabilities and their marginal probability predictions. Results show that not all the features are equally able to characterise sarcasm and irony, whereby sarcasm is better characterised by POS tags, and ironic statements by pragmatic particles (such as emoticons and emphatic/onomatopoeic expressions, which represent those linguistic elements typically used in social media to convey a particular message).

  • Jiang et al. (2015): The authors’ model classifies subjectivity, polarity and emotion in microblogs. Results show that emoticons are a pure carrier of sentiment, whereas sentiment words have more complex senses and contexts, such as negations and irony.

  • Wang et al. (2012): Post-facto analysis of user-generated content, such as tweets, show that political tweets tend to be quite sarcastic.

  • Ghiassi and Lee (2018): Certain keywords or hash-tagged words (e.g., “thanks”, “#smh”, “ #not”) that follow certain negative or positive sentiment markers in textual social data, usually indicate the presence of sarcasm.

5 Application areas of social opinion mining

Around half of the studies analysed focused their work on a particular real-world application area (or multiple), where Fig. 3 shows the ones applicable for this systematic review. Note that each circle represents an application area, where the size reflects the number of studies within the particular application area. The smallest circles represent a minimum of two studies that pertain to the respective application area, whereas the biggest circle reflects the most popular application area. Intersecting circles represent application areas that were identified as being related to each other based on the analysis conducted.

Fig. 3
figure 3

Application areas

The Politics domain is the dominant application area with 45 studies applying SOM on different events, namely elections (Elouardighi et al. 2017; Bansal and Srivastava 2018; Nugroho et al. 2017; Chen 2018; Nausheen and Begum 2018; Abdullah and Hadzikadic 2017; Joyce and Deng 2017; Soni et al. 2017; Salari et al. 2018; Fatyanosa and Bachtiar 2017; Juneja and Ojha 2017; Sandoval-Almazan and Valle-Cruz 2018; Zhou et al. 2017; Le et al. 2017; Yuan et al. 2014; Ramteke et al. 2016; Smailović et al. 2015; Burnap et al. 2016; Rill et al. 2014; Anjaria and Guddeti 2014; Kuo et al. 2016; Batista and Ratté 2014; Mejova et al. 2013; Hoang et al. 2013; Gonçalves et al. 2013; Unankard et al. 2014; Wang et al. 2012; Maynard and Funk 2011; Bosco et al. 2013; Bakliwal et al. 2013; Tumasjan et al. 2010), reforms, such as equality marriage (Lai et al. 2015), debates (Tapia and Velásquez 2014), referendums (Pavel e al. 2017; Fang and Ben-Miled 2017), political parties or politicians (Ozer et al. 2017; Javed et al. 2014; Taddy 2013), and political events, such as terrorism, protests, uprisings and riots (Sachdeva et al. 2018; Kamyab et al. 2018; Bouchlaghem et al. 2016; Mejova and Srinivasan 2012; de Souza Carvalho et al. 2016; Sheth et al. 2014; Weiss et al. 2013).

In terms of Marketing & Advertising & Sales, 29 studies focused on brand/product management and/or awareness (Giachanou et al. 2017; Ayoub and Elgammal 2018; Ghiassi and Lee 2018; Li and Fleyeh 2018; Ducange and Fazzolari 2017; Husnain et al. 2017; Teixeira and Laureano 2017; Halibas et al. 2018; Hu et al. 2017; Abdullah and Zolkepli 2017; Zimbra et al. 2016; Cho et al. 2014; Esiyok and Albayrak 2015; Dasgupta et al. 2015; Ghiassi et al. 2013; Mostafa 2013b; Min et al. 2013; Cvijikj and Michahelles 2011; Li and Li 2013; Gonçalves et al. 2013), products/services in general (Asghar et al. 2018; Kao and Huang 2018; Walha et al. 2016; Polymerou et al. 2014; Li and Li 2013), local marketing (Costa et al. 2014) and online advertising (Adibi et al. 2018; Dragoni 2018; Lewenberg et al. 2015).

The Technology industry-oriented studies (23) focused on either: company perception (Wan et al. 2018; Rout et al. 2018; Lek and Poo 2013; Karpowicz et al. 2013; Jiang et al. 2011), products, such as mobile/smart phones (Rathan et al. 2018; Ray and Chakrabarti 2017; Geetha et al. 2018; Gupta and Joshi 2017; Gandhe et al. 2018; Hridoy et al. 2015; Agrawal et al. 2014; Suresh 2016; Mumu and Ezeife 2014; Erdmann et al. 2014), laptops (Raja and Swamynathan 2016), electronics (Neethu and Rajasree 2013) tablets (Severyn et al. 2016; Jiang et al. 2011), operating systems (Huang et al. 2018), cloud service providers (Qaisi and Aljarah 2016), social media providers (Arslan et al. 2017) and multiple technologies (Vo et al. 2017).

All the 21 studies targeting the Finance domain applied SOM on demonitisation (Gupta and Singal 2017), currencies (Pavel e al. 2017) and the stock market, for risk management (Ishikawa and Sakurai 2017) and predictive analytics (Ghosal et al. 2018; Chen and Zheng 2018; Piñeiro-Chousa et al. 2018; Simões et al. 2017; Tiwari et al. 2017; Sun et al. 2017; Coyne et al. 2017; Zhao et al. 2016; Attigeri et al. 2015; Weiss et al. 2015; Rao and Srivastava 2014; Huang et al. 2015; Porshnev et al. 2013; Porshnev and Redkin 2014; Porshnev et al. 2014; Yu et al. 2013; Bollen et al. 2011; Vu et al. 2012).

Thirteen studies applied SOM on the Film industry for recommendations (Orellana-Rodriguez et al. 2015; Song et al. 2015), box office predictions (Du et al. 2014; Rui et al. 2013) or from a general perspective (Pavel e al. 2017; Sihwi et al. 2018; Permatasari et al. 2018; Orellana-Rodriguez et al. 2013; Yan et al. 2014; Gonçalves et al. 2013; Wang and Ye 2013; Blenn et al. 2012; Chen et al. 2012). Similarly, 13 studies focused on Healthcare, namely on epidemics/infectious diseases (Hong and Sinnott 2018; Lim et al. 2017; Lu et al. 2015; Gonçalves et al. 2013), drugs (Moh et al. 2017; Peng et al. 2016; Wu et al. 2015), hospitals (Gupta and Kohli 2016), vaccines (Song and Gruzd 2017), public health, such as epidemics, clinical science and mental health (Ji et al. 2015, 2016), and in general, such as health-related tweets (Baccouche et al. 2018) and health applications (Pai and Alathur 2018).

In terms of other industries, SOM was applied within the following:

  • Telecommunications (e.g., telephony, television) on particular service providers (Ghiassi and Lee 2018; Ranjan et al. 2018; Napitu et al. 2017; Fitri et al. 2018; Kumar and Bala 2016; Varshney and Gupta 2014; Wunnasri et al. 2013; Tan et al. 2011; Trung et al. 2013) or complaints (Souza et al. 2016);

  • Automotive (Vo et al. 2017; Pai and Liu 2018; Fatyanosa et al. 2018; Weichselbraun et al. 2017; Shukri et al. 2015; Bifet et al. 2011; Reyes et al. 2013; Severyn et al. 2016; Erdmann et al. 2014);

  • Hospitality for restaurant recommendations (Vo et al. 2017; Yang et al. 2013) and hotel/resort perceptions (Rout et al. 2017; Philander and YunYing 2016; Lu et al. 2016; Chen et al. 2015; Molina-González et al. 2014);

  • Aviation on specific airline services, e.g., customer relationship management (Ghiassi and Lee 2018; Mostafa 2013a; Chen et al. 2016), and air crashes (Gonçalves et al. 2013);

  • Food either in general (dos Santos et al. 2018; Liu et al. 2015) or on safety (Sun et al. 2014);

  • Fashion (Mukkamala et al. 2014, 2014).

In terms of domains, the studies focused on:

  • Sports on football/soccer (Stojanovski et al. 2018; Seron et al. 2015; Guerra et al. 2014), American football (Guerra et al. 2014; Brooks et al. 2014), basketball (Tan et al. 2011; Jiang et al. 2011), cricket (Ahuja and Dubey 2017) and Olympics (Gonçalves et al. 2013);

  • Government for smart cities (D’Asaro et al. 2017; Anggoro et al. 2016; Li et al. 2016) and e-Government (Hubert et al. 2018; Rezk et al. 2018; Williamson and Ruming 2016);

  • Environment for policy makers (Sluban et al. 2015), urban mobility (Gallegos et al. 2016), wind energy (Politopoulou and Maragoudakis 2013), green initiatives (Rai et al. 2018) and peatland fires (Gandhe et al. 2018);

  • E-commerce for product recommendations (Xie et al. 2012; Lau et al. 2014), crisis management (Park et al. 2011), decision making (D’Avanzo and Pilato 2015) and policy making (Omar et al. 2017);

  • Education for e-learning (Ortigosa et al. 2014; Karyotis et al. 2017) and on universities (Abdelrazeq et al. 2016);

  • Transportation for ride hailing services and logistics (Anastasia and Budi 2016) and traffic conditions (Cao et al. 2018).

Moreover, other studies applied SOM in the following areas:

  • Personalities (Ali et al. 2018; Ghiassi and Lee 2018; Arslan et al. 2017; Tasoulis et al. 2018; Goel et al. 2018; Poortvliet and Wang 2018; Wang et al. 2018; Jiang et al. 2011; Tan et al. 2011; Khan et al. 2014; Kranjc et al. 2013);

  • Natural Disasters on earthquakes (Aoudi and Malik 2018; Ragavi and Usharani 2014; Zhang et al. 2012; Thelwall et al. 2011), flooding (Buscaldi and Hernandez-Farias 2015), explosions (Ouyang et al. 2017) and in general (Sangameswar et al. 2017);

  • Aggressive Behaviour in relation to crime (Kitaoka and Hasuike 2017; Chen et al. 2015; Zainuddin et al. 2016), cyberbullying (Del Bosque and Garza 2014), bullying (Xu et al. 2012) and violence and disorder (Jurek et al. 2014);

  • Main/Breaking Events such as Black Friday (Choi and Kim 2013), Oscars, TV shows, product launch, earthquake (Thelwall et al. 2011), accidents e.g., shootings (Singh et al. 2018; Akcora et al. 2010) and in general (Stojanovski et al. 2018);

  • Liveability in terms of place design to supports local authorities, urban designers and city planners (You and Tunçer 2016), and government services, such as welfare (Flaes et al. 2016);

  • Digital Forensics (Andriotis et al. 2014; Aboluwarin et al. 2016).

Lastly, 19 further studies –not represented in Fig. 3– focused on the following application areas: Human Development (Zafar et al. 2016), Human Mobility (Kokkinogenis et al. 2015), Public Facilities (Ramadhani et al. 2016), Smart Cities (Li et al. 2017), Web Publishing (Tian et al. 2015), Sponsorships (Kaushik and Dey 2016), Countries (Khan et al. 2014), Industry (Trung et al. 2013), Entertainment (Trung et al. 2013), Refugee/Migrant crisis (Lee and Nerghes 2017), Tourism (Michailidis et al. 2018), Music (Radhika and Sankar 2017), Cryptocurrency (Pant et al. 2018), Economy (Gupta and Singal 2017), Social Issues (Vora and Chacko 2017), Law (Gandhe et al. 2018), Insurance/Social Security (Zhang et al. 2017), Geographic Information (Stojanovski et al. 2018) and Social Interactions (Vivanco et al. 2017).

6 Concluding remarks

This section presents the latest research developments and advancements within the SOM research area (Sect. 6.1) and presents the overall conclusions of this systematic review in terms of target audience and future research and development in (Sect. 6.2).

6.1 Latest research of social opinion mining

Given that this systematic review covers studies till 2018, some recent developments and advancements from 2019 till 2021 shall be discussed within this sub-section. This shows the fast research turnaround in SOM which has kept evolving at an incredibly fast rate, thus reiterating its validity and popularity as a research area.

The number of studies using Deep Learning approaches continued to increase (as reflected in Table 5), especially ones using certain deep learning techniques, such as CNNs, RNNs, LSTM, GRU and Deep Belief Networks (Yadav and Vishwakarma 2020; Wadawadagi and Pagi 2020), and with the introduction of new techniques, such as Transfer Learning. This is supported by numerous studies (Carvalho and Plastino 2021; Eke et al. 2020) who have noted that researchers are shifting from using traditional machine learning techniques to deep learning ones. Carvalho and Plastino (2021) focus on sentiment analysis on tweets, Xu et al. (2020) focus on emotion classification on tweets, Akhtar et al. (2020) focus on sentiment and emotion intensity, Cignarella et al. (2020) focus on irony detection of English, Spanish, French and Italian tweets, whereas Eke et al. (2020) focus on sarcasm detection with Twitter also being the social media platform mostly used in this research area.

Transfer learning is a deep learning technique where a model is trained for one or more tasks (source tasks), which learnt knowledge is applied to a related second task (target task) (Pan and Yang 2009). In particular, the Transformer model architecture introduced by Vaswani et al. (2017) in 2017, is based on attention mechanisms and is designed to handle sequential data like natural language for NLP tasks, such as sentiment analysis and text summarisation. This has coincided with the advancement of SOM for different opinion dimensions, such as sentiment polarity (Nguyen et al. 2020; Naseem et al. 2020), emotion (Acheampong et al. 2021), and irony (Nguyen et al. 2020), especially studies focused on adaptation to new domains and/or knowledge transfer from one language to another. The latter application is extremely reliable for cross-lingual adaptation where a labelled dataset is available in one language e.g., English, which is then applied to another language, such as low-resourced languages (Ruder 2017).

With respect to language, more SOM studies supporting languages other than the popular ones (such as English and Chinese) are on the rise. In Rani and Kumar (2019), the authors discuss the growth of research work in the fields of sentiment and emotion analysis for Indian languages. Moreover, Buechel et al. (2020) created emotion lexicons for 91 languages for sentiment and emotion analysis. Other recent studies have focused on languages, such as Urdu for sentiment analysis (Mukhtar and Khan 2019), Maltese for sentiment and emotion analysis and sarcasm/irony detection (Cortis and Davis 2019), Indonesian for sentiment analysis (Koto et al. 2020), Portuguese for sentiment and emotion analysis (Pereira 2021), and Arabic for sentiment and emotion analysis (Alhumoud and Al Wazrah 2021). Studies on code-switched languages is also on the increase, with Bansal et al. (2020) demonstrating how Hindi-English code-switching patterns from tweets can be used to improve sarcasm detection, and Appidi et al. (2020) analysing code-switched Kannada-English from tweets for emotion classification.

In terms of modality, the visual modality is gaining more interest in the SOM research community. In Akhtar et al. (2019), the authors propose a deep multi-task learning framework that carries out sentiment and emotion analysis from the textual, acoustic and visual frames of video data obtained from YouTube. On the other hand, Kumar and Garg (2019) propose a multi-modal sentiment analysis model for Twitter, where the sentiment polarity and strength is extracted from tweets based on their text and images (typographic and/or infographic).

More research has been published on aspect-based SOM, with Jiang et al. (2020) focused on sentiment polarity in both single-aspect and multi-aspect scenarios, whereas Hyun et al. (2020) focused on sentiment polarity in the automotive domain for the English and Korean languages.

In terms of application areas, the ones identified in Sect. 5 are still very popular, with research in new sub-domains emerging. In particular, several studies (Kapočiūtė-Dzikienė et al. 2019; Cresci et al. 2019; Guo and Li 2019; Xing et al. 2020; Chen et al. 2020; Mishev et al. 2020) focus on the Finance domain. Xing et al. (2020) identify common error patterns that result in financial sentiment analysis to fail, namely, irrealis mood, rhetoric, dependent opinion, unspecified aspects, unrecognised words, and external reference. On the other hand, in Mishev et al. (2020) evaluate sentiment analysis studies in the Finance domain by starting from lexicon-based approaches and finishes with the ones that use Transformers, such as the Bidirectional Enconder Representations from Transformers (BERT) (Devlin et al. 2018) and the Robustly optimised BERT approach (RoBERTa) (Liu et al. 2019).

The ongoing coronavirus disease (COVID-19) global pandemic has led to a rise in SOM studies analysing social opinions in terms of different dimensions, such as sentiment polarity. The work in Müller et al. (2020) released a COVID-19 Transformer-based model that was pre-trained on multiple datasets of tweets from Twitter. These datasets contained tweets on various topics, such as vaccine sentiment and maternal vaccine stance, and used other well known datasets, such as SemEval 2016—Task 4 which was previously discussed in Sect. 3.3. This model was pre-trained to carry out sentiment analysis on tweets written in other languages, such as Arabizi—a written form of spoken Arabic that relies on Latin characters and digits (Baert et al. 2020). On the other hand, Kruspe et al. (2020) presented sentiment analysis results of 4.6 million European tweets for the initial period of COVID-19 (December 2019 till April 2020), which results were aggregated by country (Italy, Spain, France, Germany, United Kingdom) and averaged over time. An ANN was trained to carry out sentiment analysis, which model was compared with several pre-trained models, such as BERT which is trained on BookCorpus and English Wikipedia data (Devlin et al. 2018), a multilingual version of BERT trained on COVID-19 tweets (Müller et al. 2020), and the Embeddings from Language Models (ELMO) trained on the 1 Billion Word Benchmark dataset.

In terms of NLP tools, Hugging FaceFootnote 107 provides a state-of-the-art Transformer library for Pytorch and TensorFlow 2.0Footnote 108. Therefore, it provides general-purpose architectures, such as BERT, GPT-2 (Radford et al. 2019), RoBERTa, cross-lingual language model (XLM) (Lample and Conneau 2019), DistilBert (Sanh et al. 2019), and XLNET (Yang et al. 2019) for NLP tasks (like sentiment analysis), where over 32+ pre-trained models are available in 100+ languages. Similarly, TensorFlow HubFootnote 109 provides a repository of trained machine learning models, with a variety of them using the Transformer architectureFootnote 110, such as BERT.

The carbon footprint for training new deep learning models should always be taken in consideration especially if a large number of Central Processing Units (CPUs), Graphical Processing Units (GPUs), or Tensor Processing Units (TPUs) are needed. This in turn increases the related costs for model training, which is becoming very expensive and is expected to keep increasing in the future. In Strubell et al. (2019), Strubell et al. mention that such costs amount to both the financial aspect in terms of hardware and electricity or cloud compute time, and the environmental aspect in terms of carbon footprint needed to fuel modern tensor processing hardware. Therefore, researchers should report the training time and computational resources needed in their published work, and they should prioritise computationally efficient algorithms and hardware that need less energy.

6.2 Conclusion

The main aim of this systematic review is to provide in-depth analysis and insights on the most prominent technical aspects, dimensions and application areas of SOM. The target audience of this comprehensive review is three fold:

  • Early-Stage Researchers who are interested in working within this evolving research field of study and/or are looking for an overview of this field;

  • Experienced Researchers already working in SOM who would like to progress further on the technical side of their work and/or looking for weaknesses in the the field of SOM;

  • Early-Stage and/or Experienced Researchers who are looking into applying SOM/their SOM work in a real-world application area.

The identification of the current literature gaps within the SOM field of study is one of the main contributions of this systematic review. An overview below provides a pathway to future research and development work:

  • Social Media Platforms: Most studies focus on data gathered from one social media platform, with Twitter being the most popular followed by Sina Weibo for Chinese targeted studies. It is encouraged to possibly explore multi-source information by using other platforms, thus use data from multiple data sources, subject to any existing API limitationsFootnote 111. This shall increase the variety and volume of data (two of the V’s of Big Data) used for evaluation purposes, thus ensuring that results provide more reflective picture of society in terms of opinions. The use of multiple data sources for studies focusing on the same real-world application areas are also beneficial for comparison purposes and identification of any potential common traits, patterns and/or results. Mining opinions from multiple sources of information also presents several advantages, such as higher authenticity, reduced ambiguity and greater availability (Balazs and Velásquez 2016).

  • Techniques: The use of Deep Learning, Statistical, Probabilistic, Ontology and Graph-based approaches should be further explored both as standalone and/or part of hybrid techniques, due to their potential and accessibility. In particular, Deep Learning capabilities has made several applications feasible, whereas Ontologies and Graph Mining enable fine-grained opinion mining and the identification of relationships between opinions and their enablers (person, organisation, etc.). Moreover, ensemble Machine Learning and Deep Learning methods and fine-tuned Transformed-based models are still under-explored. In such a case, researchers should be attentive to the carbon footprint needed to train neural network models for NLP.

  • Social Datasets: The majority of available datasets are either English or Chinese specific. This domain needs further social datasets published under a common open license for use by the public domain. These should target any of the following criteria: bi-lingual/multilingual data, and/or annotations of multiple opinion dimensions within the data, e.g., sentiment polarity, emotion, sarcasm, irony, mood, etc. Both requirements are costly in terms of resources (time, funding and personnel), domain knowledge and expertise.

  • Language: The majority of the studies support one language, with English and Chinese being the most popular. Studies that support two or more languages is one of the major challenges in this domain due to numerous factors, such as cultural differences and lack of language-specific resources, e.g., lexicons, datasets, tools and technologies. This domain also needs more studies that focus on code-switched languages and less-resourced languages, which shall enable the development of certain language resources needed for the respective code-switched and less-resourced languages.

  • Modality: Bi-/Multi-modal SOM is another sub-domain that requires several research. Several studies cater for the text modality only, with the visual—image modality gaining more popularity. However, the visual—video and audio modalities are still in their early research phases with several aspects still unexplored. This also stems from a lack of available visual, audio and multimodal datasets.

  • Aspect-based SOM: Research in this sub-domain is increasing and developing, however, it is far from the finished article, especially when applied in certain domains. Further aspect-based research is encouraged on other opinion dimensions other than sentiment polarity, such as emotions and moods, which is still unexplored. Moreover, more research is required on the use of Deep Learning approaches for such a task, which is still at an early stage.

  • Application areas: Most studies target Politics, Marketing & Advertising & Sales, Technology, Finance, Film and Healthcare. Research into other areas/sub-domains is encouraged to study and show the potential of SOM.

  • Dimensions of SOM: Most studies focus on subjectivity detection and sentiment analysis. The area of emotion analysis is increasing in popularity, however, sarcasm detection, irony detection and mood analysis are still in their early research phases. Moreover, from the analysis of this systematic review it is evident that there is a lack of research on any possible correlations between the different opinion dimensions, e.g., emotions and sentiment. Lastly, no studies cater for all the different SOM dimensions within their work.

Shared evaluation tasks, such as International Workshop on Semantic Evaluation (SemEval), focused on any one of the current research gaps identified above, are very important and shall contribute to the advancement of the SOM research area. Therefore, researchers are encouraged to engage in these tasks through their participation and/or organisation of new tasks, since these shall advance the SOM research area.

In conclusion, as identified through this systematic review, a fusion of social opinions represented in multiple sources and in various media formats can potentially influence multiple application areas.