Over a decade of social opinion mining: a systematic review

Cortis, Keith; Davis, Brian

doi:10.1007/s10462-021-10030-2

Over a decade of social opinion mining: a systematic review

Open access
Published: 25 June 2021

Volume 54, pages 4873–4965, (2021)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Over a decade of social opinion mining: a systematic review

Download PDF

10k Accesses
40 Citations
6 Altmetric
Explore all metrics

Abstract

Social media popularity and importance is on the increase due to people using it for various types of social interaction across multiple channels. This systematic review focuses on the evolving research area of Social Opinion Mining, tasked with the identification of multiple opinion dimensions, such as subjectivity, sentiment polarity, emotion, affect, sarcasm and irony, from user-generated content represented across multiple social media platforms and in various media formats, like text, image, video and audio. Through Social Opinion Mining, natural language can be understood in terms of the different opinion dimensions, as expressed by humans. This contributes towards the evolution of Artificial Intelligence which in turn helps the advancement of several real-world use cases, such as customer service and decision making. A thorough systematic review was carried out on Social Opinion Mining research which totals 485 published studies and spans a period of twelve years between 2007 and 2018. The in-depth analysis focuses on the social media platforms, techniques, social datasets, language, modality, tools and technologies, and other aspects derived. Social Opinion Mining can be utilised in many application areas, ranging from marketing, advertising and sales for product/service management, and in multiple domains and industries, such as politics, technology, finance, healthcare, sports and government. The latest developments in Social Opinion Mining beyond 2018 are also presented together with future research directions, with the aim of leaving a wider academic and societal impact in several real-world applications.

Opinion mining in online social media: a survey

Article 11 January 2022

Opinion Mining and Sentiment Analysis in Social Media: Challenges and Applications

360 degree view of cross-domain opinion classification: a survey

Article 06 August 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Social media is increasing in popularity and also in its importance. This is principally due to the large number of people who make use of different social media platforms for various types of social interaction. Kaplan and Haenlein define social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, which allows the creation and exchange of user generated content” (Kaplan and Haenlein 2010). This definition fully reflects that social media platforms are essential for online users to submit their views and also read the ones posted by other people about various aspects and/or entities, such as opinions about a political party they are supporting in an upcoming election, recommendations of products to buy, restaurants to eat in and holiday destinations to visit. In particular, people’s social opinions as expressed through various social media platforms can be beneficial in several domains, used in several applications and applied in real-life scenarios. Therefore, mining of people’s opinions, which are usually expressed in various media formats, such as textual (e.g., online posts, newswires), visual (e.g., images, videos) and audio, is a valuable business asset that can be utilised in many ways ranging from marketing strategies to product or service improvement. However as indicated in Ravi and Ravi (2015), dealing with unstructured data, such as video, speech, audio and text, creates crucial research challenges.

This research area is evolving due to the rise of social media platforms, where several work already exists on the analysis of sentiment polarity. Moreover, researchers can gauge widespread opinions from user-generated content and better model and understand human beliefs and their behaviour. Opinion Mining is regarded as a challenging Natural Language Processing (NLP) problem, in particular for social data obtained from social media platforms, such as Twitter^{Footnote 1}, and also for transcribed text. Standard linguistic processing tools were built and developed on newswires and review-related data due to such data following more strict grammar rules. These differences should be taken in consideration when performing any kind of analysis (Balazs and Velásquez 2016). Therefore, social data is difficult to analyse due to the short length in text, the non-standard abbreviations used, the high sparse representation of terms and difficulties in finding out the synonyms and any other relations between terms, emoticons and hashtags used, lack of punctuations, use of informal text, slang, non-standard shortcuts and word concatenations. Hence, typical NLP solutions are not likely to work well for Opinion Mining.

Opinion Mining—presently a very popular field of study—is defined by Liu and Zhang as “the computational study of people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes” (Liu and Zhang 2012). Social is defined by the Merriam-Webster Online dictionary^{Footnote 2} as “of or relating to human society, the interaction of the individual and the group, or the welfare of human beings as members of society”.

In light of this, we define Social Opinion Mining (SOM) as “the study of user-generated content by a selective portion of society be it an individual or group, specifically those who express their opinion about a particular entity, individual, issue, event and/or topic via social media interaction”.

Therefore, the research area of SOM is tasked with the identification of several dimensions of opinion, such as sentiment polarity, emotion, sarcasm, irony and mood, from social data which is represented in structured, semi-structured and/or unstructured data formats. Information fusion is the field tasked with researching about efficient methods for automatically or semi-automatically transforming information from different sources into a single coherent representation, which can be used to guide the fusion process. This is important due to the diversity in data in terms of content, format and volume (Balazs and Velásquez 2016). Sections 1.1 and 1.2 provide information about SOM and its challenges.

In addition, SOM is generally very personal to the individual responsible for expressing an opinion about an object or set of objects, thus making it user-oriented from an opinion point-of-view, e.g., a social post about an event on Twitter, a professional post about a job opening on LinkedIn^{Footnote 3} or a review about a hotel on TripAdvisor^{Footnote 4}.

Our SOM research focuses on microposts—i.e. information published on the Web that is small in size and requires minimal effort to publish (Cano et al. 2016)—that are expressed by individuals on a microblogging service, such as Sina Weibo^{Footnote 5} or Twitter and/or a social network service that has its own microblogging feature, such as Facebook^{Footnote 6} and LinkedIn.

1.1 Opinion mining versus social opinion mining

In 2008, Pang and Lee had already identified the relevance between the field of “social media monitoring and analysis” and the body of work reviewed in Pang and Lee (2008), which deals with the computational treatment of opinion, sentiment and subjectivity in text. This work is nowadays known as opinion mining, sentiment analysis, and/or subjectivity analysis (Pang and Lee 2008). Other phrases, such as review mining and appraisal extraction have also been used in the same context, whereas some connections have been found to affective computing (where one of its goals is to enable computers in recognising and expressing emotions) (Pang and Lee 2008). Merriam-Webster’s Online Dictionary defines that the terms^{Footnote 7} “opinion”, “view”, “belief”, “conviction”, “persuasion” and “sentiment” mean a judgement one holds as true. This shows that the distinctions in common usage between these terms can be quite subtle. In light of this, three main three research areas—opinion mining, sentiment analysis and subjectivity analysis—are all related and use multiple techniques taken from NLP, information retrieval, structured and unstructured data mining (Ravi and Ravi 2015). However, even though these three concepts are broadly used as synonyms, thus used interchangeably, it is worth noting that their origins differ. Some authors also consider that each concept presents a different understanding (Serrano-Guerrero et al. 2015), and also have different notions (Tsytsarau and Palpanas 2012). We are in agreement with this, hence we felt that a new terminology is required to properly specify what SOM means, as defined in Sect. 1.

According to Cambria et al., sentiment analysis can be considered as a very restricted NLP problem, where the polarity (negative/positive) of each sentence and/or target entities or topics needs to be understood (Cambria et al. 2013). On the other hand, Liu discusses that “opinions are usually subjective expressions that describe people’s sentiments, appraisals or feelings toward entities, events and their properties” (Liu 2010). He further identifies two sub-topics of sentiment and subjectivity analysis, namely sentiment classification (or document-level sentiment classification) and subjectivity classification. SOM requires such classification methods to determine an opinion dimension, such as objectivity/subjectivity and sentiment polarity. For example, subjectivity classification is required to classify whether user-generated content, such as a product review, is objective or subjective, whereas sentiment classification is performed on subjective content to find the sentiment polarity (positive/negative) as expressed by the author of the opinionated text. In cases where the user-generated content is made up of multiple sentences, sentence-level classification needs to be performed to determine the respective opinion dimension. In addition, sentence-level classification is not suitable for compound sentences, i.e., a sentence that expresses more than one opinion. For such cases, aspect-based opinion mining needs to be performed.

1.2 Issues and challenges

Pang and Lee (2008) had already identified that the writings of Web users can be very challenging in their own way due to numerous factors, such as the quality of written text, discourse structure and the order in which different opinions are presented. The effects of the latter factor can result in a completely opposite overall sentiment polarity, where the order effects can completely overwhelm the frequency effects. This is not the case in traditional text classification, where if a document refers to the term “car” in a frequent manner, the document will probably somewhat be related to cars. Therefore, order dependence manifests itself in a more fine-grained level of analysis.

Liu (2010) mentions that complete sentences (for reviews) are more complex than short phrases and contain a large amount of noise, thus making it more difficult to extract features for feature-based sentiment analysis. Even though we agree that with more text, comes a higher probability of spelling mistakes, etc., we tend to disagree that shorter text, such as microposts, contain less noise.

The process of mining user-generated content posted on the Web is very intricate and challenging due to the nature of short textual content limit (e.g., tweets allowed up to 140 characters till October 2017), which at times forces a user to resort in using short words, such as acronyms and slang, to make a statement. These often lead to further issues in the text, such as misspellings, incomplete content, jargon, incorrect acronyms and/or abbreviations, emoticons and content misinterpretation (Cortis 2013). Other noteworthy challenges include swear words, irony, sarcasm, negatives, conditional statements, grammatical mistakes, use of multiple languages, incorrect language syntax, syntactically inconsistent words, and different discourse structures. In fact, when informal language is used in the user-generated content, the grammar and lexicon varies from the standard language normally used (Dashtipour et al. 2016). Moreover, user-generated text exhibits more language variation due to it being less grammatical than longer posts, where the aforementioned use of emoticons, abbreviations together with hashtags and inconsistent capitalisation, can form an important part of the meaning (Maynard et al. 2012). Maynard et al. (2012) also points out that microposts are in some sense the most challenging type of text for text mining tools especially for opinion mining, since they do not contain a lot of contextual information and assume much implicit knowledge. Another issue is ambiguity, since microposts such as tweets do not follow a conversation thread. Therefore, this isolation from other tweets makes it more difficult to make use of coreference information unlike in blog posts and comments. Due to the short textual content, features can also be sparse to find and use, in terms of text representation (Wang et al. 2014). In addition, the majority of microposts usually contain information about a single topic due to the length limitation, which is not the case in traditional blogs, where they contain information on more than one topic given that they do not face the same length limitations (Giachanou and Crestani 2016).

Big data challenges, such as handling and processing large volumes of streaming data, are also encountered when analysing social data (Bravo-Marquez et al. 2014). Limited availability of labelled data and dealing with the evolving nature of social streams usually results in the target concept changing which would require the learning models to be constantly updated (Guerra et al. 2014).

In light of the above, social networking services bring several issues and challenges with them and the way in how content is generated by their users. Therefore, several Information Extraction tasks, such as Named Entity Recognition (NER) and Coreference Resolution, might be required to carry out multi-dimensional SOM. In fact, several shared evaluation tasks are being organised to try and reach a standard mechanism towards performing IE tasks on noisy text which is very common in user-generated social media content. As already discussed in detail above, such tasks are much harder to solve when they are applied on micro-text like microposts (Ravi and Ravi 2015). This problem presents serious challenges on several levels, such as performance. Examples of such tasks are “Named Entity Recognition in Twitter”^{Footnote 8}.

In terms of content, social media-based studies present only analysis and results from a selective portion of society, since not everyone uses social media. Moreover, several cross-cultural differences and factors determine the social media usage in each country and hence the results of such studies. For example for the Political domain, these services are used predominantly by young and politically active individuals or by ones with strong political views. This could be easily reflected in the Brexit results, where the majority of younger generation (age 18–44) voted to remain in the European Union as opposed to people over age 45. Such a result falls in line with the latest United Kingdom social media statistics, such as for Twitter, where 72% of the users are between the age of 15–44, whilst for Facebook the most popular age group is 25–34 (26% of users) (Hürlimann et al. 2016). However, results of similar studies in other cultures and languages might differ due to different use of social words to reflect a general opinion, sentiment polarity and/or emotion (Lin et al. 2018).

1.3 Systematic review

In light of the above, it is noteworthy that no systematic review within this newly defined domain exists even though there are several good survey papers (Liu and Zhang 2012; Tsytsarau and Palpanas 2012; Medhat et al. 2014; Ravi and Ravi 2015). The research paper by Bukhari et al. (2016) is closest to a systematic review in this domain, whereby the authors performed a search over the ScienceDirect and SpringerLink electronic libraries for the “sentiment analysis”, “sentiment analysis models”, “sentiment analysis of microblogs” terms. As a result, we felt that the SOM domain well and truly deserves a thorough systematic review that captures all of the relevant research conducted over the last decade. This review also identifies the current literature gaps within this popular and constantly evolving research domain.

The structure of this comprehensive systematic review is as follows: Sect. 2 presents the research method adopted to carry out this review, followed by Sect. 3 which provides a thorough review analysis of the main aspects derived from the analysed studies. This is followed by Sect. 4 which focuses on the different dimensions of social opinions as derived from the analysed studies, and Sect. 5 which presents the application areas where SOM is being used. Lastly, Sect. 6 discusses the the latest developments for SOM (beyond the period covered by the systematic review) and future research directions as identified by the authors.

2 Research method

This survey paper about SOM adopts a systematic literature review process. This empirical research process was based on the guidelines and procedures proposed by Kitchenham (2004), Brereton et al. (2007), Dyba et al. (2007) and Attard et al. (2015) which were focused on the software engineering domain. The systematic review process although more time consuming is reproducible, minimising bias and maximising internal and external validity. The procedure undertaken was structured as follows and is explained in detail within the sub-sections below:

1.
Specification of research questions;
2.
Generation of search strategy which includes the identification of electronic sources (libraries) and selection of relevant search terms;
3.
Application of the relevant search;
4.
Choice of primary studies via the utilisation of inclusion and exclusion criteria on the obtained results;
5.
Extraction of required data from primary studies;
6.
Synthesis of data.

2.1 Research questions

A systematic literature review is usually characterised by an appropriate generic “research question, topic area, or phenomenon of interest” (Kitchenham 2004). This question can be expanded into a set of sub-questions that are more clearly defined, whereby all available research relevant to these sub-questions are identified, evaluated and interpreted.

The goal of this systematic review is to identify, analyse and evaluate current opinion mining solutions that make use of social data (data extracted from social media platforms). In light of this, the following generic research question is defined:

What are the existing opinion mining approaches which make use of user-generated content obtained from social media platforms?

The following are specific sub-questions that the generic question above can be sub-divided into:

1.
What are the existing approaches that make use of social data for opinion mining and how can they be classified^{Footnote 9}?
2.
What are the different dimensions/types of social opinion mining?
3.
What are the challenges faced when performing opinion mining on social data?
4.
What techniques, datasets, tools/technologies and resources are used in the current solutions?
5.
What are the application areas of social opinion mining?

2.2 Search strategy

The search strategy for this systematic review is primarily directed via the use of published papers which consist of journals, conference/workshop proceedings, or technical reports. The following electronic libraries were identified for use, due to their wide coverage of relevant publications within our domain: ACM Digital Library^{Footnote 10}, IEEE Xplore Digital Library^{Footnote 11}, ScienceDirect^{Footnote 12}, and SpringerLink^{Footnote 13}.

The first three electronic libraries listed were used by three out of the four systematic reviews that our research process was based on (and which made use of a digital source), whereas SpringerLink is one of the most popular sources for publishing work in this domain (as will be seen in Sect. 2.4 below). Moreover, three other electronic libraries were considered for use, two –Web of Science^{Footnote 14} and Ei Compendex^{Footnote 15}– which the host university did not have access to and Google Scholar^{Footnote 16} which was not included, since content is obtained from the electronic libraries listed above (and more), thus making the process redundant.

The relevant search terms were identified for answering the research questions defined in Sect. 2.1. In addition, these questions were also used to perform some trial searches before the following list of relevant search terms was determined:

1.
“Social opinion mining”;
2.
“Social sentiment analysis”;
3.
“Opinion mining social media”;
4.
“Sentiment analysis social media”;
5.
“Microblog opinion mining”;
6.
“Microblog sentiment analysis”;
7.
“Social network sentiment”;
8.
“Social network opinion”;
9.
“Social data sentiment analysis”;
10.
“Social data opinion mining”;
11.
“Twitter sentiment analysis”;
12.
“Twitter opinion mining”;
13.
“Social data analysis”.

The following are important justifications behind the search terms selected above:

“opinion mining” and “sentiment analysis”: are both included due to the fact that these key terms are used interchangeably to denote the same field of study (Pang and Lee 2008; Cambria et al. 2013), even though their origins differ and hence do not refer to the same concept (Serrano-Guerrero et al. 2015);
“microblog”, “social network” and “Twitter”: the majority of the opinion mining and/or sentiment analysis research and development efforts target these two kinds of social media platforms, in particular the Twitter microblogging service.

2.3 Search application

The “OR” Boolean operator was chosen to formulate the search string. The search terms were all linked using this operator, making the search query simple and easy to use across multiple electronic libraries. Therefore, a publication only had to include any one of the search terms to be retrieved (Attard et al. 2015). In addition, this operator is more suitable for the defined search terms given that this study is not a general one e.g., about opinion mining in general, but is focused about opinion mining in a social context. Construction of the correct search string (and terms) is very important, since this eliminates noise (i.e. false positives) as much as possible and at the same time still retrieves potential relevant publication which increases recall.

Several other factors had to be taken in consideration during the application of search terms on the electronic libraries. The following is a list of factors relevant to our study, identified in Brereton et al. (2007) and verified during our search application process:

Electronic library search engines have different underlying models, thus not always provide required support for systematic searching;
Same set of search terms cannot be used for multiple engines e.g., complex logical combination not supported by the ACM Digital Library but is by the IEEE Xplore Digital Library;
Boolean search string is dependent on the order of terms, independent of brackets;
Inconsistencies in the order or relevance in search results (e.g., IEEE Xplore Digital Library results are sorted in order of relevance);
Certain electronic libraries treat multiple words as a Boolean term and look for instances of all the words together (e.g., “social opinion mining”). In this case, the use of the “AND” Boolean operator (e.g., “social AND opinion AND mining”) looks for all of the words but not necessary together.

On the above, in our case it was very important to select a search strategy that is more appropriate to the review’s research question which could be applied to the selected electronic libraries.

When applying the relevant search on top of the search strategy defined in Sect. 2.2, another important element was to identify appropriate metadata fields upon which the search string can be executed. Table 1 presents the ones applied in our study.

Table 1 Metadata fields used in search application

Over a decade of social opinion mining: a systematic review

Abstract

Similar content being viewed by others

Opinion mining in online social media: a survey

Opinion Mining and Sentiment Analysis in Social Media: Challenges and Applications

360 degree view of cross-domain opinion classification: a survey

1 Introduction

1.1 Opinion mining versus social opinion mining

1.2 Issues and challenges

1.3 Systematic review

2 Research method

2.1 Research questions

2.2 Search strategy

2.3 Search application

2.4 Study selection

2.5 Extraction of data

2.5.1 Overall

2.5.2 Study selection: electronic libraries

2.5.3 Study selection: additional set

2.6 Synthesis of data

3 Review analysis

3.1 Social media platforms

3.2 Techniques

3.2.1 Lexicon

3.2.2 Machine learning

3.2.3 Deep learning

3.2.4 Statistical

3.2.5 Probabilistic

3.2.6 Fuzziness

3.2.7 Rule-based

3.2.8 Graph

3.2.9 Ontology

3.2.10 Hybrid

3.2.11 Other

3.3 Social datasets

3.3.1 Overview

3.3.2 Comparative analysis

3.4 Language

3.5 Modality

3.5.1 Datasets

3.5.2 Observations

3.6 Tools and technologies

3.6.1 NLP

3.6.2 Machine learning

3.6.3 Opinion mining

3.6.4 Big data

3.7 Natural language processing tasks

3.7.1 Overview

3.7.2 Pre-processing and negations

3.7.3 Emoticons/Emojis

3.7.4 Word embeddings

3.7.5 Aspect-based social opinion mining

4 Dimensions of social opinion mining

4.1 Context

4.2 Different dimensions of social opinions identified in the review analysis

4.2.1 Subjectivity

4.2.2 Sentiment

4.2.3 Emotion

4.2.4 Affect

4.2.5 Irony

4.2.6 Sarcasm

4.2.7 Mood

4.2.8 Aggressiveness

4.2.9 Other

4.3 Impact of sarcasm and irony on social opinions

4.3.1 Challenges

4.3.2 Observations

5 Application areas of social opinion mining

6 Concluding remarks

6.1 Latest research of social opinion mining

6.2 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest