1 Introduction

The advent of digitization accelerated the scope of the general public to express their sentiment or opinion on an online platform. An expert or general public nowadays desires to reach an optimal decision or opinion with the use of available opinionative data. Any online platform, such as an e-commercial website or a social media site, maintains a level of transparency, increasing its chance of influencing other users. However, a single topic or item can possess millions of varied opinions on a single platform. The opinions or sentiments expressed can hold minute details or even a general opinion, which increases the research community’s interest in further investigation. This was the beginning of the principle of sentiment analysis, also known as opinion mining. Sentiment analysis makes it easier to retrieve sentimental details, analyze opinionative/sentimental web data, and classify sentimental patterns in a variety of situations.

Sentiment analysis can be stated as the procedure to identify, recognize, and/or categorize the users’ emotions or opinions for any service like movies, product issues, events, or any attribute as positive, negative, or neutral (Mehta and Pandya 2020). When sentiment is stated as a polarity in computational linguistics, it is typically treated as a classification task. When sentiment scores lying inside a particular range are used to express the emotion, the task is however regarded as a regression problem. Cortis et al. (2017) mentioned various research works where sentiment analysis is approached as either a classification or regression task. While analyzing the sentiments by assigning the instances sentiment scores within the range [− 1,1], Cortis et al. (2017) discovered that there can be circumstances where the prediction is sometimes considered to be a classification task and other times to be regression. To solve the regression/classification problem, the authors developed a novel approach that combined the use of two evaluation methods to compute the similarity matrix. Therefore, mining and analysis of sentiment are either limited to positive/negative/neutral; or even deeper granular sentimental scale, depending on the necessity, topic, scenario, or application (Vakali et al. 2013).

In the last decade since the paper by Pang et al. (2002), a large number of techniques, methods, and enhancements have been proposed for the problem of sentiment analysis, in different tasks, at different levels. Numerous review papers on sentiment analysis are already available. It has been noted that the current studies do not give the scientific community a comprehensive picture of how to build a proper sentiment analysis model. A general, step-by-step framework that can be used as a guide by an expert or even by a new researcher would be ideal for designing a proper sentiment analysis model. Many of the existing surveys basically report the general approaches, methods, applications, and challenges available for sentiment analysis. The survey paper by Alessia et al. (2015) reports basic three levels of sentiment analysis, presents three types of sentiment classification approaches, discusses some of the available tools and methods, and points out four domains of applications of sentiment analysis. The study can be further extended to give more details about the different levels, methods/approaches, additional applications, and other related factors and areas. Wankhade et al. (2022) provided a detailed study of different sentiment analysis methods, four basic levels of sentiment analysis, applications based on domain and industries, and various challenges. The survey emphasizes several classification methods while discussing some of the necessary procedures in sentiment analysis. Instead of only concentrating on the procedures that are necessary for sentiment analysis, a detailed description of all the possible approaches is highly desirable as it can help in selecting the best among all for a certain type of sentiment analysis model. Each step/module of the sentiment analysis model should be discussed in detail to gain insight into which technique should be used given the domain, dataset availability, and other variables; or how to proceed further to achieve high performance. Further, applications of sentiment analysis are commonly described based on the domain or applicable industry. Possible application areas based purely on the dataset are rarely covered by recent review papers. Some of the survey papers focus on only one direction or angle of sentiment analysis. Multimodal sentiment analysis and its applications, as well as its prospects, challenges, and adjacent fields, were the main topics of the paper by Kaur and Kautish (2022). Schouten and Frasincar (2015) focused on the semantically rich concept-centric aspect-level sentiment analysis and foreseen the rise of machine learning techniques in this context in the future. Verma (2022) addressed the application of sentiment analysis to build a smart society, based on public services. The author showed that understanding the future research directions and changes in sentiment analysis for smart society unfolds immense opportunities for elated public services. Therefore, this survey paper aims to categorize sentiment analysis techniques in general, while critically evaluating and discussing various modules/steps associated with them.

This paper offers a broad foundation for creating a sentiment analysis model. Instead of focusing on specific areas, or enumerating the methodological steps in a scattered manner; this paper follows a systematic approach and provides an extensive discussion on different sentiment analysis levels, modules, techniques, algorithms, and other factors associated with designing an effective sentiment analysis model. The important contributions can be summarized as follows:

  1. 1.

    The paper outlines all the granularity levels at which sentiment analysis can be carried out, through appropriate representative examples.

  2. 2.

    The paper provides a generic step-by-step framework that can be followed while designing a simple as well as a high-quality sentiment analysis model. An overview of different techniques of data collection and standardization, along with pre-processing which significantly influences the efficiency of the model, are presented in this research work. Keyword extraction and sentiment classification having a great impact on a sentiment analysis model is thoroughly investigated.

  3. 3.

    Possible applications of sentiment analysis based on the available datasets are also presented in this paper.

  4. 4.

    The paper makes an effort to review the main research problems in recent articles in this field. To facilitate the future extension of studies on sentiment analysis, some of the research gaps along with possible solutions are also pointed out in this paper.

The remaining paper is organized into five different sections to provide a clear vision of the different angles associated with a sentiment analysis process. Section 2 provides knowledge of the background of sentiment analysis along with its different granularity levels. A detailed discussion of the framework for performing sentiment analysis is presented in the Sect. 3. Each module associated with designing an effective sentiment analysis is discussed in this section. Section 4 discusses different performance measures which can be used to evaluate a sentiment analysis model. Section 5 presents various possible applications of sentiment analysis based on the content of the data. Section 6 discusses the future scope of research on sentiment analysis. At last, Sect. 7 concludes the paper.

2 Background and granularity levels of sentiment analysis

The first ever paper that focused on public or expert opinion was published in 1940 by Stagner (1940). However, at that time studies were survey based. As reported in Mäntylä et al. (2018), the earliest computer-based sentiment analysis was proposed by Wiebe (1990) to detect subjective sentences from a narrative. The research on modern sentiment analysis accelerated in 2002 with the paper by Pang et al. (2002), where ratings on movie reviews were used to perform machine learning-based sentiment classification. Pang et al. (2002) classified a document based on the overall sentiment, i.e., whether a review is positive or negative rather than based on the topic.

Current studies mostly concentrate on multilabel sentiment classification, while filtering out neutral opinions/sentiments. Due to the unavailability of proper knowledge of handling neutral opinion, the exclusion of neutral sentiment might lead to disruption in optimal decision-making or valuable information loss. Based on a consensus method, Valdivia et al. (2017) proposed two polarity aggregation models with neutrality proximity functions. Valdivia et al. (2018), filtered the neutral reviews using induced Ordered Weighted Averaging (OWA) operators based on fuzzy majority. Santos et al. (2020) demonstrated that the examination of neutral texts becomes more relevant and useful for comprehending and profiling particular frameworks when a specific polarity pre-dominates. Besides, there can be opinions that usually contain both positive and negative emotions as a result of noise. This kind of opinion is termed an ambivalence opinion, which is often misinterpreted as being neutral. Wang et al. (2020) presented a multi-level fine-scaled sentiment sensing and showed that the performance of the sentiment sensing improves with ambivalence handling. Wang et al. (2014) introduced the concept to classify a tweet with more positive than negative emotions into a positive category; and one with more negative emotions than the positive one into a negative sentiment category.

Computational linguistics, Natural Language Processing (NLP), text mining, and text analysis are different areas that are closely interlinked with the sentiment analysis process. The relationship between sentiment analysis and the different areas is summarized below:

  1. 1.

    Computational linguistics

    Sentiment Analysis is a blend of linguistics and computer science (Taboada 2016; Hart 2013). Nowadays thousandths of human languages and other abbreviated or special languages exist, say the ones used in social media, which are used to convey thoughts, emotions, or opinions. People might use one single language or a combination of different languages, say for example Hinglish (a combination of Hindi and English) along with emoticons or some symbols to convey their messages. Computational linguistics assists in obtaining the computer-executable and understandable language from the vast source of raw languages through proper representation, to extract the associated sentiments properly. While developing formal theories of parsing and semantics along with statistical methods like deep learning, computational linguistics forms the foundation for performing sentiment analysis.

    Linguistics knowledge aids in the development of the corpus set that will be used for sentiment analysis while understanding the characteristics of the data it operates on and determining which linguistic features may be applied. Data-driven or rule-based computer algorithms are designed to extract subjective information or to score polarity with the help of linguistic features, corpus linguistics, computational semantics, part of speech tagging, and the development of analytical systems for parsing. Connotations and associations are used to construct sentiment lexicons.

    Recognition of sarcasm, mood classification, and polarity classification are some of the tasks covered by sentiment analysis, which is just a small subset of the discipline of computational linguistics. Approaches to classifying moods introduce a new dimension that is based on external psychological models. Methods for detecting sarcasm make use of ideas like “content” and “non-content” terms, which coexist in linguistic theory. Language models, such as Grice’s well-known maxims, are used to define sarcasm.

  2. 2.

    Natural language processing

    NLP deciphers human language and makes it machine understandable. With the aid of NLP, the sentiments behind human-generated online comments, social media posts, blogs, and other information can be processed and represented by patterns and structures that can be used by software to comprehend and implement them. Sentiment analysis can be considered as a subset of NLP which helps users in opinionative/sentimental decision-making.

    Different NLP tasks such as tokenization, stemming, lemmatization, negation detection, n-gram creation, and feature extraction aid in proper sentiment analysis. NLP-based pre-processing helps in improving the polarity classifier’s performance by analyzing the sentiment lexicons that are associated with the subject (Chong et al. 2014). As a result, NLP facilitates text comprehension, accurately captures text polarity, and ultimately facilitates improved sentiment analysis (Rajput 2020; Solangi et al. 2018).

    Advanced NLP techniques are often needed when dealing with emoticons, multilingual data, idioms, sarcasm, sense or tone, bias, negation, etc. Otherwise, the outcome can drastically deteriorate. If the NLTK’s general stopwords list is utilized, words like not, nor, and no, for instance, are frequently deleted when removing stopwords during pre-processing. However, the removal of such words can alter the actual sentiment of the data. Thus, depending on its application, NLP tasks can either improve or deteriorate the result.

  3. 3.

    Text mining

    Text messages, comments, reviews, and blog posts are excellent sources of sentimental information. The extraction of useful information and knowledge hidden in textual data is an important aspect of sentiment analysis. Mining the relevant information from textual data possesses multi-dimensional advantages such as improved decision-making, public influence, national security, health and safety, etc. (Zhang et al. 2021; Wakade et al. 2012). Text mining involves the use of statistical techniques to retrieve quantifiable data from unstructured text, and uses NLP to transform the unstructured text into normalized, structured data, which makes it suitable for sentiment analysis.

    Sentiment analysis, however, is not just confined to text. In most cases, such as when a sarcastic comment is made, or while pointing a finger at someone and saying- “You are responsible!”, the exact sentiment behind the plain text might not be conveyed properly. Non-text data like video, audio, and image are helpful in such a scenario to portray sentiment accurately.

  4. 4.

    Text analysis

    A key part of sentiment analysis is extracting insightful information, trends, and patterns. To extract them from unstructured and semi-structured text data, text analysis is a process that supports sentiment analysis. Using techniques including word spotting, manual rule usage, text classification, topic modeling, and thematic analysis, the procedure helps in the extraction of meaning from the text. Text analysis can be used to specify individual lexical items (words or phrases) and observe the pattern.

    Sentiment analysis, in contrast to basic text analytics, fundamentally shows the emotion concealed beneath the words, while text analytics analyses the grammar and relationships between words. Sentiment analysis essentially identifies whether a topic conveys a positive, negative, neutral, or any other sentiment; while text analysis is used to identify the most popular topics and prevalent ideas-based texts. In addition, it can be more challenging to specify the intended target in the context of sentiment conveyed, than it is to determine a document’s general subject.

    A textual document with numerous opinions would have a mixed polarity overall, as opposed to having no polarity at all (being objective). It is also important to distinguish the polarity and the strength of a conveyed sentiment. One may have strong feelings about a product being decent, average, or awful while having mild feelings about a product being excellent (due to the possibility that one just had it for a brief period before having an opinion.). Also, unlike topical (involving text) analysis, in many cases such as that of the quotes, it is critical to understand whether the sentiment conveyed in the document accurately reflects the author’s true intentions or not.

    Analyzing the existence of an important word in conjunction with the use of a sentiment score approach can help to uncover the most profound and specific insights that can be used to make the best decision in many situations. Areas of application for sentiment analysis aided by appropriate text analysis include strategic decision-making, product creation, marketing, competition intelligence, content suggestion, regulatory compliance, and semantic search.

2.1 Granularity levels

At present, a sentiment analysis model can be implemented at various granular levels according to the requirement and scope. There are mainly four levels of sentiment analysis that have gained a lot of popularity. They are document level (Pang et al. 2002; Li and Li 2013; Hu and Li 2011; Li and Wu 2010; Rui et al. 2013; Zhan et al. 2009; Yu et al. 2010), sentence or phrase level (Nguyen and Nguyen 2017; Wilson et al. 2005; Narayanan et al. 2009; Liu et al. 2013; Yu et al. 2013; Tan et al. 2012; Mullen and Collier 2004), word level (Nielsen 2011; Dang et al. 2009; Reyes and Rosso 2012; Bollegala et al. 2012; Thelwall and Buckley 2013; Li et al. 2014), and entity or aspect level (Li et al. 2012; Li and Lu 2017; Quan and Ren 2014; Cruz Mata et al. 2013; Mostafa 2013; Yan et al. 2015; Li et al. 2015a).

Some of the other research works concentrate on concept level (Zad et al. 2021; Tsai et al. 2013; Poria et al. 2013; Balahur et al. 2011; Cambria et al. 2022; Cambria 2013), link/user level (Rabelo et al. 2012; Bao et al. 2013; Tan et al. 2011), clause level (Kanayama and Nasukawa 2006; Liu et al. 2013), and sense level (Banea et al. 2014; Wiebe and Mihalcea 2006; Alfter et al. 2022) sentiment analysis. Some of the important levels of sentiment analysis are discussed in the following sub-sections. To understand the different levels, let us consider a customer review R as shown below.

R = “I feel the latest mobile from iPhone is really good. The camera has an outstanding resolution. It has a long battery life. I can even bear the mobile’s heating problem. However, I feel it could have been a bit light weighted. Given the configurations, it is a bit expensive; but I must give a thumbs up for the processor.”

In the following subsections, we will observe the analysis of review R based on different levels.

2.1.1 Document-level sentiment analysis

It aims to assess a document’s emotional content. It assumes that the overall document expresses a single sentiment (Pang et al. 2002; Hu and Li 2011). The general approach of this level is to combine the polarities of each word/sentence in the document to find the overall polarity (Kharde and Sonawane 2016). According to document-level sentiment analysis, the overall sentiment of the document represented by review R is positive. According to Turney (2002), there are two approaches to document sentiment classification namely term-counting and machine learning. Term counting measure derives a sentiment measure while calculating total positive and negative terms in the document. Machine learning approaches generally yield superior results as compared to term-counting approaches. In this approach, it is assumed that the document is focused on only one object and thus holds an opinion about that particular object only. Thus, if the document contains opinions about different objects, this approach is not suitable.

2.1.2 Sentence/phrase-level sentiment analysis

The sentiment associated with each sentence of a set of data is analyzed at this level of sentiment analysis. The general approach is to combine the sentiment orientation of each word in a sentence/phrase to compute the sentiment of the sentence/phrase (Kharde and Sonawane 2016). It attempts to classify a sentence as conveying either positive/negative/neutral/mixed sentiment or as a subjective or objective sentence (Katrekar and AVP 2005). Objective sentences are facts and do not convey any sentiment about an object or entity. They do not play any role in polarity determination and thus need to be filtered out (Kolkur et al. 2015). The polarity of a sentence in review R is found to be positive/negative/mixed irrespective of its overall polarity.

2.1.3 Word-level sentiment analysis

Through proper examination of the polarity of each and every word, this sentiment analysis level investigates how impactful individual words can be on the overall sentiment. The two methods of automatically assigning sentiment at this level are dictionary-based and corpus-based methods (Kharde and Sonawane 2016). According to Reyes and Rosso (2012), in corpus-based techniques, the co-occurrence patterns of words are used for sentiment determination. However, most of the time, statistical information needed for the determination of a word’s sentiment orientation is large corpus dependent. The dictionary-based approaches use synonyms, antonyms, and hierarchies from lexical resources such as WordNet and SentiWordNet (SWN) to determine the sentiments of words (Kharde and Sonawane 2016). Such techniques assign positive, negative, and objective sentiment scores to each synset. If the words in review R such as outstanding, expensive, etc. are evaluated individually, different words within a particular sentence are observed to hold different polarities.

2.1.4 Aspect or entity-level sentiment analysis

For a specific target entity, this approach essentially identifies various aspects associated with it. Then, the sentiment expressed towards the target by each of its aspects is determined in this level of sentiment analysis. As a result, it can be divided into two different tasks, namely extraction of aspects and sentiment classification of aspects (Liu and Zhang 2012). For the different aspects such as resolution, weight, and price of the same product in review R, different sentiments are conveyed.

2.1.5 Concept-level sentiment analysis

Most of the time, merely using emotional words to determine sentiment or opinion is insufficient. To obtain the best results, a thorough examination of the underlying meaning of the concepts and their interactions is required. Concept-level sentiment analysis intends to convey the semantic and affective information associated with opinions, with the use of web ontologies or semantic networks (Cambria 2013). Rather than simply using word-cooccurrences or other dictionary-based approaches as in word-level sentiment analysis, or finding overall opinion about a single item as in document-level sentiment analysis; concept-level sentiment analysis generally makes use of feature spotting and polarity detection based on different concepts. E.g., For “long battery life” in review R is considered positive. However, a “long route” might not be preferable if someone wants to reach the destination in minimum time, and thus can be considered as negative. Tsai et al. (2013) made use of features of the concept itself as well as features of the neighboring concepts.

2.1.6 User-level sentiment analysis

User–level sentiment analysis takes into account the fact that if there is a strong connection among users of a social platform, then the opinion of one user can influence other users. Also, they may hold similar sentiments/opinions for a particular topic (Tan et al. 2011). At the user level, all the followers of the reviewer of review R may get influenced by this review.

2.1.7 Clause-level sentiment analysis

A sentence can be a combination of multiple clauses, each conveying different sentiments. The clauses in review R can be observed to represent opposing polarity because they are separated by the word “but”. Clause-level sentiment analysis focuses on the sentiment associated with each clause based on aspect, associated condition, domain, grammatical dependencies of the words in the clause, etc.

2.1.8 Sense-level sentiment analysis

The words which form a sentence can interpret different meanings based on their usage in the sentence. Specifically, when the same word has multiple meanings, the sense with which the word is used, can highly affect the sentiment orientation of the whole sentence or document. E.g., let us consider the word “bear” in review R. Is the word bear referring to the mammal bear? Otherwise, is it indicating the bearing (holding) of something? In what sense it is used? Is it used as a noun or a verb? In such a case, proper knowledge of the grammatical structure or word sense can contribute immensely to the determination of the appropriate sentiment of any natural language text. Thus, solving words’ syntactic ambiguity and performing word sense disambiguation (Wiebe and Mihalcea 2006) are vital parts of designing an advanced sentiment analysis model. Alfter et al. (2022) provided a sense-level annotated resource rather than word-level annotation and performed various experiments to explore the explanations of difficult words.

The analysis of the review R at different levels shows that the same review can have different interpretations based on the requirement. Single-level approaches work well in most cases. However, sometimes when the evaluation of sentiments is based on very short document(s) or even very long document(s), the model may fail to handle the flexibility. To determine the polarity of the overall documents, Li et al. (2010) combined phrase-level and sentence-level sentiment analysis to design a multi-level model. Valakunde and Patwardhan (2013) advised following a ladder-like computation. In this technique, aspect or entity-level sentiment is employed to compute the sentence-level sentiments and then use the weightage of entities along with the sentence-level sentiments for evaluation of the complete document.

3 General framework of sentiment analysis

The evolution of sentiment analysis marks the emergence of different models by different experts. After going through more than 500 sentiment analysis models proposed till now, a general framework of sentiment analysis is presented in Fig. 1. The framework comprises mainly four modules along with an additional optional module. The modules perform collection and standardization of data; pre-processing of the dataset; extraction of features or keywords which represent the overall dataset; prediction or classification of the sentiments associated with the keywords or the whole sentence or document according to the requirement; and summarization of the overall sentiment associated with the dataset. The different modules are discussed in detail below.

Fig. 1
figure 1

General framework of sentiment analysis

3.1 Data collection and standardization

With the growing platforms of expression, the type and format of expressing people’s views, opinions, or sentiments on a particular subject is increasing. Among the different available types of data such as text, image, audio, or video, the research on textual data has gained momentum in the last few years. Currently, though multi-lingual text data has attracted few researchers, however, 90% of sentiment analysis studies, experimentation, and design concentrates mainly on English textual data.

The development, examination, and validation of a system typically depend on the quality and structure of data used for building, operating, and maintaining the model. The overall functionality of a model depends on the data used from the boundless and voluminous source of available data to a great extent. Many public data sources are available which are used by some researchers to design a sentiment analysis model. Publicly available dataset namely Blitzer’s multi-domain sentiment data (Blitzer et al. 2007) is used by Dang et al. (2009). Public product reviews by Epinions (epinions.com) are also used by some of the researchers (Kharde and Sonawane 2016; Fahrni and Klenner 2008). UCI Machine Learning Repository provides standard datasets for sentiment namely Twitter data for Arabic Sentient Analysis, Sentiment Labelled Sentences, Paper Reviews, Sentiment Analysis in Saudi Arabia about distance education during Covid-19, etc. The overwhelming rate of data production demands designing a system that keeps on updating the database from time to time to avoid generality or biased interest at a particular time. A manual approach to collecting a substantial volume of data is not a desirable practice. Thus, automatic big data collection techniques are indeed a vital aspect that must be keenly observed. Several tools or APIs have come up recently that help to collect data from online social or e-commercial platforms. Some of them are NodeXL, Google spreadsheet using Twitter Achiever, Zapier, Rapid Miner, Parsehub, BeautifulSoup in Python, WebHarvy, etc. Most of these tools or APIs help to collect real-time data. But the main problem occurs when someone desires to work with historical data; because many of these techniques such as Twitter API do not permit extracting tweets older than seven days. Building a standard database involves dealing with the unstructured information attached to the data from the internet. For a dataset representing a particular topic, proper standardization in an appropriate type, format, and context, extensively boosts the overall outcome of the analysis. To design a robust system, the homogeneity of the data must be maintained. Besides, proper labelling of the collected data can improve the performance of the sentiment analysis model. Different online labelling techniques are available nowadays. However, online labelling techniques are sometimes full of noise, which leads to lower accuracy of the system. Designing an automatic labelling system, which makes use of various statistical knowledge of the whole corpus and appropriate domain knowledge of words, proves to contribute more to enhancing the sentiment analysis process.

3.2 Pre-processing

The process of removing any sort of noise from a textual dataset and preparing a cleaned, relevant and well-structured dataset for the sentiment analysis process is called as pre-processing. Appropriate pre-processing of any dataset noticeably improves the sentiment analysis process. For analyzing the sentiment of online movie reviews, a three-tier approach is adopted by Zin et al. (2017) to examine the effect of pre-processing task. In the first tier, they experimented with the removal of stopwords using the English stopwords list. The stopwords are the words such as the articles a, an, the, etc., which have no effective role in determining sentiment. In the second tier, the sentiment analysis is performed after the removal of stopwords and all other meaningless characters/words such as date (16/11/20), special characters (@, #), and words with no meaning (a+, a-, b+). In the third tier, more cleaning strategies are used, i.e., numbers and words having less than three characters are removed along with the stopwords and meaningless words. Their results demonstrate that the different combinations of the pre-processing steps show favorable improvement in the classification process; thus, establishing the significance of the removal of stopwords, meaningless words such as special characters, numbers, and words with less than three characters. Jianqiang (2015) found that replacing negations, and expanding acronyms have a positive effect on sentiment classification, however, the removal of URLs, numbers, and stopwords hardly changes the accuracy. Efficient pre-processing can increase the accuracy of a sentiment analysis model. To establish it, Haddi et al. (2013) combined various pre-processing methods using online reviews of movies and followed different steps such as cleaning online text, removal of white space, expansion of abbreviations, stemming, eliminating stopwords, and handling negation. Apart from these, they also considered feature selection as a pre-processing step. They used the chi-square method to filter out the less impactful features. To handle negation, a few researchers such as Pang et al. (2002), used the following words to tag the negation word until a punctuation mark occurs. However, authors of Haddi et al. (2013) and Dave et al. (2003) observed that the results before and after the tagging remain almost the same. Therefore, Haddi et al. (2013) reduced the number of tagged following words to three and two. Saif et al. (2014) observed that a list of pre-complied stopwords negatively affects Twitter sentiment classification. However, with the use of pre-processing the original feature space is significantly reduced. Jianqiang and Xiaolin (2017) show that stopword removal, acronym expansion, and replacing negation are effective pre-processing steps. According to Jianqiang and Xiaolin, URLs and numbers do not contain useful information for sentiment analysis. They also found that reverting words with repeated characters shows fluctuating performance. This must be because, in some situations, a word such as goooood gets replaced by goood. Thus, creating confusion about whether it should be interpreted as good or god. Such a situation may alter the actual polarity conveyed by the word. Therefore, reverting words with repeated characters is not recommendable.

3.3 Feature/keyword extraction

In a sentiment analysis model, the words and symbols within the corpus are mainly used as the features (O’Keefe and Koprinska 2009). Traditional topical text classification approaches are used in most sentiment analysis systems, in which a document is treated as a Bag of Words (BOW), projected as a feature vector, and then categorized using a proper classification technique. Experts use a variety of feature sets to boost sentiment classification efficiency, including higher-order n-grams (Pang et al. 2002; Dave et al. 2003; Joshi and Rosé 2009), word pairs and dependency relations (Dave et al. 2003; Joshi and Rosé 2009; Gamon 2004; Subrahmanian and Reforgiato 2008). Using different word-relation feature sets namely unigram (one word), bigram (two words), and dependency parsing, Xia et al. (2011) performed sentiment classification using an ensemble framework. Wiebe and Mihalcea (2006) introduced a ground-breaking study focused on the Measure of Concern (MOC) to assess public issues using Twitter data and the most significant unigrams. While conducting text opinion mining, Sidorov et al. (2013) demonstrated the supremacy of unigrams, as well as other suitable settings such as minimal classes, the efficacy of balanced and unbalanced corpus, the usage of appropriate machine learning classifiers, and so on. Every word present in a dataset is not always important in the context of sentiment analysis. The difficulty of determining precise sentiment classifications has been increased by the continuous growth of knowledge. Even after cleaning the dataset with various pre-processing steps, using all of the data in the dataset can result in dimensionality issues, longer computation times, and the use of irrelevant or less significant features or terms. Especially in the case of higher dimensional and multivariate data, these problems become even worse. According to Li et al. (2017), a good word representation that captures sentiment is good at word sentiment analysis and sentence classification; and building document-level sentiment analysis dynamically based on words in need is the best practice. Keyword extraction is a method for extracting essential features/terms from textual data by defining particular terms, phrases, or words from a document to represent the document concisely (Benghuzzi and Elsheh 2020). If a text’s keywords are extracted correctly, the text’s subject can be thoroughly researched and evaluated, and a good decision can be made about the text. Given that, manually extracting keywords from such a large number of databases is a repetitive, time-consuming, and costly process, automated keyword extraction has become a popular field of research for most researchers in recent years. Automatic keyword extraction can be categorized into supervised, semi-supervised, and unsupervised methods (Beliga et al. 2015). The keywords are mainly represented using either Vector Space Model (VSM) or a Graph-Based Model (GBM) (Ravinuthala et al. 2016; Kwon et al. 2015). Once the datasets are represented using any of the VSM or GBM techniques, the keywords are extracted using simple statistics, linguistics, machine learning techniques, and hybridized methods (Bharti and Babu 2017). Simple methodologies that do not include training data and are independent of language and domain are included in the statistical keyword extraction methods. To identify keywords, researchers used frequency of terms, Term Frequency-Inverse Document Frequency (TF-IDF), co-occurrences of terms, n-gram statistics, PATricia (PAT) Tree, and other statistics from documents (Chen and Lin 2010). The linguistic approach examines the linguistic properties of words, sentences, and documents, with lexical, semantic, syntactic, and discourse analysis being the most frequently studied linguistic properties (HaCohen-Kerner 2003; Hulth 2003; Nguyen and Kan 2007). A machine learning technique takes into account supervised or unsupervised learning while extracting keywords. Supervised learning produces a system that is trained on a collection of relevant keywords followed by identification and analysis of keywords within unfamiliar texts (Medelyan and Witten 2006; Theng 2004; Zhang et al. 2006). All of these methods are combined in the hybrid method for keyword extraction. O’Keefe and Koprinska (2009) performed sentiment analysis using machine learning classifiers, which they validated using the movie review dataset. Along with the use of feature presence, feature frequency, and TF-IDF as feature weighting methods, they proposed SWN Word Score Groups (SWN-SG), SWN Word Polarity Groups (SWN-PG), and SWN Word Polarity Sums (SWN-PS) using words which are grouped by their SWN values. The authors suggest categorical Proportional Difference (PD), SWN Subjectivity Scores (SWNSS), and SWN Proportional Difference (SWNPD) as feature selection techniques. They discovered that feature weights based on unigrams, especially feature presence, outperformed SWN-based methods. Using different machine learning techniques; Tan and Zhang (2008) proposed a model for sentiment analysis in three domains: education, film, and home, which was written in Chinese and used various feature selection techniques for the purpose. Mars and Gouider (2017) proposed a MapReduce-based algorithm for determining opinion polarity using features of consumer opinions and big data technologies combined with Text Mining (TM) and machine learning tools. Using a supervised approach, Kummer and Savoy (2012) suggested a KL score for providing weightage to features for sentiment and opinion mining. All these research works establish that the machine learning approach of keyword extraction when incorporated with any other techniques has a great scope in the field of sentiment analysis. There are different kinds of methods that are used to perform keyword extraction using VSM and GBM approaches. They are discussed in detail below.

3.3.1 Vector space model

In VSM, the documents are represented as vectors of the terms (Wang et al. 2015). VSM involves building a matrix V which is usually termed as a document-term matrix, where the rows represent the documents in the dataset, whereas columns correspond to the terms of the whole dataset. Thus, if the set of documents is represented by \(D = (d_{1}, d_{2},...., d_{m})\) and the set of terms/tokens representing the entire corpus is \(T= (t_{1}, t_{2},...., t{n})\), then the element \(dt_{i,j} \in V_{mxn}\), \(i=1,2,\ldots , m\), and \(j=1,2,\ldots , n\) is assigned a weight \(w_{i,j}\). The weights can be assigned based on the word frequency associated with a document or the entire dataset. According to Abilhoa and De Castro (2014), the frequencies can be binary, absolute, relative, or weighted. Algorithms such as binary, Term Frequency (TF), TF–IDF, etc. are used in traditional term weighting schemes.

  1. a.

    Binary

    If document \(d_i\) contains the term \(t_j\), the element \(dt_{i,j}\) of a term vector is assigned a value 1 in the binary term weighting scheme, otherwise, the value 0 is assigned (Salton and Buckley 1988). It has the obvious drawback of being unable to recognize the most representative words in a text. Furthermore, using word frequency often helps to increase the importance of terms in documents.

  2. b.

    TF

    The limitation of the binary term weighting scheme motivates the use of term frequency as the weight of a term for a specific text. The number of times a word appears in a text is known as its term frequency. As a result, a value \(w_{i,j}\) is assigned to \(dt_{i,j}\) with \(w_{i,j}\) equaling the number of times the word \(t_j\) appears in the document \(d_i\). However, as opposed to words that appear infrequently in documents, terms that appear consistently in all documents have less distinguishing power to describe a document (Kim et al. 2022). This is an area where the TF algorithm falls short.

  3. c.

    TF-IDF

    The number of documents in the entire document corpus where a word appears is known as its document frequency. If a word has a higher document frequency, it has a lower distinguishing power, and vice versa. As a result, the Inverse Document Frequency (IDF) metric is used as a global weighting factor to highlight a term’s ability to identify documents. Equation 1 (Zhang et al. 2020) may be used to describe a term’s TF-IDF weight as follows:

    $$\begin{aligned} W(t_{k}) = tf_{k} ~. ~log \left( \frac{m}{{df_{k}}} \right) \end{aligned}$$
    (1)

    where, \(tf_{k}\) denotes the frequency of the term \(t_{k}\) in a specific document and \(df_{k}\) denotes the document frequency of the term \(t_k\), i.e., the number of documents containing the term \(t_k\). The total number of documents in the corpus is denoted by m.

    Using the traditional term-weighing techniques, many experts tried to propose their improvised version. Some of them are TF-CHI (Sebastiani and Debole 2003), TF-RF (Lan et al. 2008), TF-Prob (Liu et al. 2009), TF-IDF-ICSD (Ren and Sohrab 2013), and TF-IGM (Chen et al. 2016).

3.3.2 Graph based model

A graph G is constructed in GBM, with each node or vertex \(V_i\) representing a document term or function \(t_i\) and the edges \(E_{i, j}\) representing the relationship between them (Beliga et al. 2015). Nasar et al. (2019) showed that various properties of a graph, like centrality measures, node’s co-occurrence, and others, play a significant role in keyword ranking. Semantic, syntactic, co-occurrence, and similarity relationships are some of the specific perspectives of graph-based text analysis. In GBM techniques, centrality measures tend to be the most significant deciding factor (Malliaros and Skianis 2015). The importance of a term is calculated by using the centrality measure, to calculate the importance of the node in the graph. Beliga (2014) presented the knowledge of nineteen different measures which are used for extraction purposes. Degree centrality, closeness centrality, betweenness centrality, selectivity centrality, eigenvector centrality, PageRank, TextRank, strength centrality, neighborhood size centrality, coreness centrality, clustering coefficient, and other centrality measures have been proposed so far. Some of the popular centrality measures are discussed below.

  1. a.

    Degree centrality

    Degree centrality is used to measure how often a term occurs with any other term. For a particular node, the total count of edges incident on it is used to measure the metric (Beliga 2014). The more edges that cross the node, the more significant it is in the graph. A node \(V_i\)’s degree centrality is measured using Eq. 2.

    $$\begin{aligned} D_C (V_i) = \frac{\mid n(V_i) \mid }{\mid N \mid -1} \end{aligned}$$
    (2)

    where, \(D_C (V_i)\) represents node \(V_i\)’s degree centrality, \(\mid N \mid\) indicates the total count of nodes and \(\mid n(V_i) \mid\) represents the overall nodes linked with the node \(V_i\).

  2. b.

    Closeness centrality

    Closeness centrality determines the closeness of a term with all other terms of the dataset. This metric calculates the average of the shortest distance from a given node to every other node in the graph. It is defined by Eq. 3 (Tamilselvam et al. 2017) as the reciprocal of the number of all node distances to any node, i.e. the inverse of farness.

    $$\begin{aligned} C_C (V_i) = \frac{\mid N \mid -1}{\sum \limits _{V_{i} \in G} dist(V_i,V_j) } \approx \frac{N }{\sum \limits _{V_{i} \in G} dist(V_i,V_j) } , if (N>>1) \end{aligned}$$
    (3)

    where, \(C_C (V_i)\) represents node \(V_i\)’s closeness centrality, \(\mid N \mid\) represents graph’s node count, and \(dist(V_i,V_j)\) represents the shortest distance from node \(V_i\) to node \(V_j\).

  3. c.

    Betweenness centrality

    This metric is used to see how often a word appears in the middle of another term. This metric indicates how many times a node serves as a bridge between two nodes on the shortest path. For a node \(V_i\), it is calculated using Eq. 4 (Tamilselvam et al. 2017).

    $$\begin{aligned} B_C (V_i) = \sum \limits _{V_x \ne V_i \ne V_y \in G } \frac{\sigma _{{V_x}{V_y}}(V_i)}{\sigma _{{V_x}{V_y}}} \end{aligned}$$
    (4)

    In Eq. 4, \(B_C (V_i)\) represents \(V_i\)’s betweenness centrality, \(\sigma _{{V_x}{V_y}}\) represents the overall shortest paths from node \(V_x\) to \(V_y\), and the overall shortest paths from node \(V_x\) to \(V_y\) via. \(V_i\) is represented by \(\sigma _{{V_x}{V_y}}(V_i)\).

  4. d.

    Selectivity centrality

    Selectivity Centrality (\(S_C (V_i)\)) (Beliga et al. 2015) is the average weight on a node’s edges. As shown in Eq. 5, \(S_C (V_i)\) is equal to the fraction of strength of node \(s(V_i)\) to its degree \(d(V_i)\).

    $$\begin{aligned} S_C (V_i) = \frac{s(V_i)}{d(V_i)} \end{aligned}$$
    (5)

    As shown in Eq. 6, node \(V_i's\) strength, \(s(V_i)\), is the summation of overall edge weights incident on \((V_i)\).

    $$\begin{aligned} s(V_i) = \sum \limits v_j EW_{{V_i}{V_j}} \end{aligned}$$
    (6)
  5. b.

    Eigenvector centrality

    This centrality measure determines the global importance of a term. It is calculated for a node using the centralities of the neighbors of the node. It is calculated using the adjacency matrix and a matrix calculation to determine the principal eigenvector (Golbeck 2013). Assume that A is a (nxn) similarity matrix, with \(A = (\alpha _{{V_i}{V_j}}), \alpha _{{V_i}{V_j}} = 1\) if \(V_i\) is bound to \(V_j\) and \(\alpha _{{V_i}{V_j}}= 0\), otherwise. The i-th entry in the normalized eigenvector belonging to the largest eigenvalue of A is then used to describe the eigenvector centrality \(EV_C (V_i)\) of node \(V_i\). Equation 7 (Bonacich 2007) shows the formula for eigenvector centrality.

    $$\begin{aligned} \lambda EV_C (V_i) = \sum \limits _{j=1}^N \alpha _{{V_i}{V_j}} EV_C (V_j) \end{aligned}$$
    (7)

    where, \(\lambda\) is the largest eigenvalue of A. Castillo et al. (2015) suggested a supervised model with the use of degree and closeness centrality measures of a co-occurrence graph, to determine words belonging to each sentiment while representing existing relationships among document terms. Nagarajan et al. (2016) have also suggested an algorithm for the extraction of keywords based on centrality metrics of degree and closeness. For obtaining the optimal set of ranked keywords, Vega-Oliveros et al. (2019) used nine popular graph centralities for the determination of keywords and introduced a new multi-centrality metric. They found that all of the centrality measures have a strong relationship. The authors also discovered that degree centrality is the quickest and most efficient measure to compute. While experimenting with various centrality measures, Lahiri et al. (2014) also noticed that degree centrality makes keyword and key extraction much simpler. Abilhoa and De Castro (2014) suggest a keyword extraction model based on graph representation, and eccentricity and closeness centrality measures. As a tiebreaker, they used the degree centrality. In several real-world models, disconnected graphs are common, and using eccentricity and closeness centralities to achieve the expected result often fails. Yadav et al. (2014) recommended extracting keywords using degree, eccentricity, closeness, and other centralities of the graph while emphasizing the semantics of the terms. With the use of Part of Speech (PoS) tagging, Bronselaer and Pasi (2013) presented a method to represent textual documents in a graph-based representation. Using various centralities, Beliga et al. (2015) proposed a node selectivity-driven keyword extraction approach. Kwon et al. (2015) suggested yet another ground-breaking keyword weighting and extraction method using graph. To improvise the traditional TextRank algorithm, Wang et al. (2018) used document frequency and Average Term Frequency (ATF) to calculate the node weight for extraction of keywords belonging to a particular domain. Bellaachia and Al-Dhelaan (2012) introduced the Node and Edge rank (NE-rank) algorithm for keyword extraction, which basically combines node weight (i.e., TF-IDF in this case) with TextRank. Khan et al. (2016) suggested Term-ranker, which is a re-ranking approach using graph for the extraction of single-word and multi-words using a statistical method. They identified classes of semantically related words while estimating term similarity using term embedding, and used graph refinement and centrality measures for extraction of top-ranked terms. For directed graphs, Ravinuthala et al. (2016) weighted the edges based on themes and examined their framework for keywords produced both automatically and manually. Using the PageRank algorithm, Devika and Subramaniyaswamy (2021) extracted keywords based on the graph’s semantics and centralities. The above studies show that centrality measures are a catalyst for effective sentiment analysis. This is because a powerful keyword’s effect or position in determining the sentiment score is often greater than a weaker keyword. For the extraction of sentiment sentences, Shimada et al. (2009)suggested the use of a hierarchical acyclic-directed graph and similarity estimation. For sentences’ sentiment representation, Wu et al. (2011) developed an integer linear programming-based structural learning system using graph. Using graphs, Duari and Bhatnagar (2019) also suggested keyword’s score determination and extraction procedures based on the sentences’ cooccurrence with a window size set to 2, position-dependent weights, contextual hierarchy, and connections based on semantics. In comparison to other existing models, their model has an excessively high dimensionality with terms in the text interpreted as nodes and edges representing node relationships in the graph. A variety of unsupervised graph-driven automated keyword extraction approaches is investigated by Mothe et al. (2018) using node ranking and varying word embedding and co-occurrence hybridization. Litvak et al. (2011) suggested DegExt, an unsupervised cross-lingual keyphrase extractor that makes use of syntactic representation of text using graphs. Order-relationship between terms represented by nodes is represented by the edges of such graphs. However, without a restriction on the maximum number of possible nodes which can be used, their algorithm generates exponentially larger graphs with larger datasets. As a result, dimensionality is one of the consequences of a graph-based keyword extraction procedure that must be regulated using appropriate means for sentiment analysis to be efficient. Chen et al. (2019) suggested extracting keywords using an unsupervised approach that relied solely on the article as a corpus. Words are ranked in their model based on their occurrence in strong motifs. Bougouin et al. (2013) assessed the relevance of a document’s topic in order to suggest TopicRank, an unsupervised approach for extracting key phrases. However, it should be mentioned that their model does not have the optimal key selection approach. To retrieve topic-wise essential keywords, Zhao et al. (2011) suggested a three-stage algorithm. Edge-weighting is used to rate the keywords (i.e., nodes) using two words’ co-occurrence frequency, followed by generation as well as the ranking of candidate keyphrases. Shi et al. (2017) suggested an automated single document keyphrase extraction technique based on co-occurrence-based knowledge graphs, which learns hidden semantic associations between documents using Personalized PageRank (PPR). Thus, many experts have used co-occurrence graphs, as well as other graph properties such as centrality metrics, to demonstrate the effectiveness of these methods for keyword ranking in sentiment analysis.

3.4 Sentiment prediction and classification techniques

Different techniques have emerged till now for serving sentiment prediction and classification purposes. Several researchers group the techniques based on the applicability of the techniques, challenges, or simply the general topics of sentiment analysis. According to Cambria (2016), affective computing can be performed either by using knowledge-based techniques, statistical methods, or hybrid approaches. Knowledge-based techniques categorize text into affect categories with the use of popular sources of affect words or multi-word expressions, based on the presence of affect words such as ‘happy’, ‘sad’, ‘angry’ etc. Statistical methods make use of affectively annotated training corpus and determine the valence of affect keywords through word co-occurrence frequencies, the valence of other arbitrary keywords, etc. Hybrid approaches such as Sentic Computing (Cambria and Hussain 2015) make use of knowledge-driven linguistic patterns and statistical methods to infer polarity from the text.

Medhat et al. (2014) presented different classification techniques of sentiment analysis in a very refined and illustrative manner. Inspired by their paper, the current sentiment prediction and classification techniques are depicted in The evolution of sentiment analysis marks the emergence of different models by different experts. After going through more than 500 sentiment analysis models proposed till now, a general framework of sentiment analysis is presented in Fig. 2. The framework comprises mainly four modules along with an additional optional module. The modules perform collection and standardization of data; pre-processing of the dataset; extraction of features or keywords which represent the overall dataset; prediction or classification of the sentiments associated with the keywords or the whole sentence or document according to the requirement; and summarization of the overall sentiment associated with the dataset. The different modules are discussed in detail below. The techniques are examined thoroughly below, to assist in choosing the best sentiment analysis classification or prediction method for a particular task.

3.4.1 Machine learning approach

The machine learning approach of sentiment classification uses well-known machine learning classifiers or algorithms along with linguistic features to classify the given set of data into appropriate sentiment classes (Cambria and Hussain 2015). Given a set of data, machine learning algorithms focus to build models which can learn from the representative data Patil et al. (2016). The extraction and selection of the best set of features to be used to detect sentiment are crucial to the models’ performance Serrano-Guerrero et al. (2015). There are basically two types of machine learning techniques namely supervised and unsupervised. However, some researchers also use a hybrid approach by combining both these techniques.

  1. a.

    Supervised learning

    The supervised machine learning approach is based on the usage of the initial set of labeled documents/opinions, to determine the associated sentiment or opinion of any test set or new document. Among the different supervised learning techniques Support Vector Machine (SVM), Naive Bayes, Maximum Entropy, Artificial Neural Network (ANN), Random Forest, and Gradient Boosting are some of the most popular techniques which are employed in the sentiment analysis process. A brief introduction to each of these techniques is presented below; followed by a discussion on some of the research works using these algorithms either individually, in combination, or in comparison to each other.

    1. i.

      Support vector machine SVM is a kernel-based classifier that has gained popularity in different regression and classification problems. Many researchers established that Gaussian (or RBF) kernel function performs better for sentiment analysis (Kim et al. 2005; Li et al. 2015b). But, whenever a large number of features is encountered, the RBF kernel does not provide suitable results. In the case of a very large dataset, the linear kernel function proves to be the best for text classification among all other different kernel functions used in the SVM classifier (Mullen and Collier 2004). The linear kernel function is represented as follows:

      $$\begin{aligned} K(x_i,x_j) = x_{i}^{T} x_j \end{aligned}$$
      (8)

      where, \(x_i\) and \(x_j\) are the input space vector and \(x_{i}^{T}\) is the transpose of \(x_i\).

      SVM classifier is basically designed for binary classification. However, if the model is extended to support multi-class classification, One-vs-Rest (OvR)/One against all or One-vs-One (OvO)/One against one strategy is applied for the SVM classifier (Hsu and Lin 2002). In OvR, the multi-class dataset is re-designed into multiple binary datasets, where data belonging to one class is considered positive while the rest are considered negative. Using the binary datasets, the classifier is then trained. The final decision on the assignment of a class is made by choosing the class which classifies the test data with the greatest margin. Another, strategy One-vs-One (OvO) can also be used, and thus choose the class which is selected by majority classifiers. OvO involves splitting the original dataset into datasets representing one class versus every other class one by one.

      Ahmad et al. (2018) presented a systematic review of sentiment analysis using SVM. Based on the papers published during the span of 5 years, i.e., from 2012 to 2017, they found that a lot of research works are published either using SVM directly for analysis or in a hybrid manner or even for comparing their proposed model with SVM. Some of the recent studies that used SVM for sentiment analysis are listed in Table 1.

    2. ii.

      Naïve Bayes The probabilistic classifier Naive Bayes classifies data based on the naive presumption that features are independent of one another. It is one of the simplest algorithms with low computational cost and relatively high accuracy. NB classifier uses the Bayesian method as shown in Eq. 9 to classify a document.

      $$\begin{aligned} P(class_i \mid doc) = \frac{P(class_i)P(doc \mid class_i)}{P(doc)} \end{aligned}$$
      (9)

      Given a document, \(P(class_i \mid doc)\) is the probability that a document belongs to a particular \(class_i\). \(P(doc \mid class_i)\) is the probability that the document is present in that \(class_i\); \(P(class_i)\) and P(doc) are the probability of the \(class_i\) and the document respectively in the training set. In case of any other level of analysis such as sentence or word level sentiment analysis, simply doc in Eq. 9 is replaced with the required instance of the particular level.

      There are basically two models which are commonly used for text analysis i.e., Multivariate Bernoulli Naive Bayes (MBNB) and Multinomial Naive Bayes (MNB) (Altheneyan and Menai 2014).

      However, for continuous data, Gaussian Naive Bayes is also used. MBNB is used for classification when multiple keywords (features) represent a dataset. In MBNB, the document-term matrix is built using BoW, where the keywords for a document are represented by 1 and 0 based on the occurrence or non-occurrence in the document.

      Whenever the count of occurrence is considered, MNB is used. In MNB, the distribution is associated with vector parameters \(\theta _c = (\theta _{c1},\theta _{c2},...,\theta _{ci})\) for class c, where i is the number of keywords, and \(\theta _{ci}\) is the probability \(P(V_i \mid Class_c)\) of keyword \(V_i\) appearing in a dataset belonging to class c. For estimating \(\theta _c\), a smoothed variant of maximum likelihood namely relative frequency counting is employed as shown below.

      $$\begin{aligned} \widehat{\theta _{ck}} = \frac{N_{ck}+\alpha }{N_c + \alpha n} \end{aligned}$$
      (10)

      where, \(\alpha\) is the smoothing factor, \(N_{ci}\) is the number of times keyword k appears in the training set and \(N_{c}\) is the total number of keywords in class c.

      To conduct a thorough investigation of the sentiment of micro-blog data, Le and Nguyen (2015) developed a sentiment analysis model using Naive Bayes and SVM, as well as information gain, unigram, bigram, and object-oriented feature extraction methods. Wawre and Deshmukh (2016) presented a system for sentiment classification that included comparisons of the common machine learning approaches Naive Bayes and SVM. Bhargav et al. (2019) used the Naive Bayes algorithm and NLP to analyze customer sentiments in various hotels.

    3. iii.

      Maximum entropy Maximum entropy classifier is a conditional probability model. Unlike the Naive Bayes classifier, it does not consider any prior assumptions such as the independence of keywords for the given set of data. Rather than using probabilities to set the parameters of the model, the maximum entropy classifier applies search techniques to determine the parameter set which maximizes the classifier’s performance. After the determination of the document-term matrix, the training set is summarised in terms of its empirical probability distribution as shown in Eq. 11.

      $$\begin{aligned} \widetilde{P} (doc_i, c) = \frac{1}{N} *(doc_i, c) \end{aligned}$$
      (11)

      where, N is the count of documents in the training set, \(n(doc_i, c)\) is the co-occurrence count of document \(doc_i\), and the class c and \(doc_i\) comprises the contextual information of the document i.e. the sparse array.

      Using the empirical probability distribution, maximum entropy models a given dataset by finding the highest entropy to satisfy the constraints of the prior knowledge. The unique distribution that shows maximum entropy is of the exponential form as shown in Eq. 12.

      $$\begin{aligned} P^{*} (c \mid doc_i) = \frac{exp (\sum _i \lambda _i f_i (doc_i, C))}{\sum _c exp (\sum _i \lambda _i f_i (doc_i, C))} \end{aligned}$$
      (12)

      Here, \(f_i (doc_i, C)\) is a keyword and \(\lambda _i\) is a parameter to be estimated. The denominator of Eq. 12 is a normalizing factor to ensure proper probability.

      The flexibility offered by the maximum entropy classifier helps to augment syntactic, semantic, and pragmatic features with the stochastic rule systems. However, the computational resources and annotated training data required for the estimation of parameters for even the simplest maximum entropy model are very high. Thus, for large datasets, the model is not only expensive but is also sensitive to round-off errors because of the sparsely distributed features. For the estimation of parameters, different methods such as gradient ascent, conjugate gradient, variable metric methods, Generalized Iterative Scaling, and Improved Iterative Scaling are available (Hemalatha et al. 2013). Yan and Huang (2015) used the maximum entropy classifier to perform Tibetan sentences’ sentiment analysis, based on the probability difference between positive and negative outcomes. To identify the sentiment expressed by multilingual text, Boiy and Moens (2009) combined SVM, MNB, and maximum entropy describing different blogs, reviews, and forum texts using unigram feature vectors.

    4. iv.

      Artificial Neural Network ANN is a machine learning classifier that is designed based on the biological brain. In ANN, a set of fundamental processing units, known as a neuron, are connected and organized according to specific tasks. The network topology, weights between the neurons, activation function, bias, momentum, etc. together form the basis of learning in ANN. Among the different types of ANNs, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Recursive Neural Network (RecNN) are most commonly used for sentiment analysis. A convolutional layer is used by CNN to extract information from a larger piece of text. RNNs are particularly well suited to handle sequential data, such as text. They can be used in sentiment analysis to anticipate the sentiment when each token in a piece of text is processed. CNNs excel at extracting local and position-invariant features, but RNNs excel at a classification based on a long-range semantic dependency rather than local key phrases. As compared to shallow ANN which was initially introduced; deep ANN or generally known as the deep learning model has emerged as a powerful technique in the context of sentiment analysis.

      Deep learning (DL): Deep Learning is essentially an ANN with three or more layers that has the capability to handle large datasets and their associated complexities such as non-linearity, intricate patterns, etc. It involves the transformation and extraction of features automatically, which facilitates self-learning as it goes by multiple hidden layers, in a way similar to humans. These advantages of deep learning lead to enhanced performance of a sentiment analysis model and thus have led to its popularity since 2015 for the same. The input features of many deep learning models are generally preferred to be word embeddings. Word embeddings can be learned from text data by using an embedding layer, Word2Vec, or Glove vectors. Word2Vec can be learned either by the Continuous Bag of Words (CBOW) or the Continuous Skip-Gram model. Some of the common deep learning algorithms include CNNs, RecNN, RNN, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Deep Belief Networks (DBN). The detailed study by Yadav and Vishwakarma (2020) on sentiment analysis using DL, has found that LSTM performs better than other popular DL algorithms.

      Tembhurne and Diwan (2021) provided valuable insight into the usage of several architectural versions of sequential deep neural networks, such as RNN, for sentiment analysis of inputs in any form, including textual, visual, and multimodal inputs. Tang et al. (2015) introduced several deep NNs with the use of sentiment-specific word embeddings for performing word-level, sentence-level, and lexical-level sentiment analysis. To encode the sentiment polarity of sentences, the authors introduced different NNs including a prediction model and a ranking model. They discovered discriminative features from different domains using sentiment embeddings to perform sentiment classification of reviews. According to the authors, the SEHyRank model shows the best performance among all the other proposed models. To fit CNN in aspect-based sentiment analysis, Wang et al. (2021) proposed an aspect mask to keep the important sentiment words and reduce the noisy ones. Their work made use of the position of aspects to perform aspect-based sentiment analysis in a unified framework. Hidayatullah et al. (2021) performed sentiment analysis using tweets on the Indonesian President Election 2019 using various deep neural network algorithms. According to the authors, Bidirectional LSTM (Bi-LSTM) showed better results as compared to CNN, LSTM, CNN-LSTM, GRU-LSTM, and other machine learning algorithms namely SVM, Logistic Regression (LR), and MNB. Soubraylu and Rajalakshmi (2021) proposed a hybrid convolutional bidirectional recurrent neural network, where the rich set of phrase-level features are extracted by the CNN layer and the chronological features are extracted by Bidirectional Gated Recurrent Unit (BGRU) through long-term dependency in a multi-layered sentence. Priyadarshini and Cotton (2021) suggested a sentiment analysis model using LSTM-CNN for a fully connected deep neural network and a grid search strategy for hyperparameter tuning optimization.

      The Emotional Recurrent Unit (ERU) is an RNN, which contains a Generalized Neural Tensor Block (GNTB) and a Two-Channel Feature Extractor (TFE) designed to tackle conversational sentiment analysis. Generally, using ERU for sentiment analysis involves obtaining the context representation, incorporating the influence of the context information into an utterance, and extracting emotional features for classification. Li et al. (2022) employed ERU in a bidirectional manner to propose a Bidirectional Emotional Recurrent Unit (BiERU) to perform sentiment classification or regression. BiERU follows a two-step task instead of the three steps mentioned for simple ERUs. According to the source of context information, the authors presented two types of BiERUs namely, BiERU with global context (BiERU-gc) and BiERU with local context (BiERU-lc). As compared to c-LSTM (Poria et al. 2017), CMN (Hazarika et al. 2018), DialogueRNN (Majumder et al. 2019), and DialogueGCN (Ghosal et al. 2019), AGHMN (Jiao et al. 2020), BiERU showed better performance in most of the cases.

    5. v.

      Random forest Random Forest, similar to its name suggests, comprises a large number of individual choice trees that work as a troupe. Every individual tree in the random forest lets out a class expectation and the class with the most votes turn into our model’s forecast. The major idea driving random forest is a straightforward however powerful one—the insight of groups. The explanation that the random forest model functions admirably is: A large number of relatively uncorrelated models (trees) working as a panel will beat any of the individual constituent models.

      The low correlation between models is the key. Much the same as how speculations with low relationships meet up to shape a portfolio that is more prominent than the number of its parts, uncorrelated models can create group expectations that are more exact than any of the individual forecasts. The explanation behind this great impact is that the trees shield each other from their individual mistakes. While a few trees might not be right, numerous different trees will be correct, so as a gathering the trees can move the right way. So, the requirements for the random forest to perform well are:

      There should be some real sign in our highlights so that models manufactured utilizing those highlights show improvement over random speculating.

      The predictions made by the individual trees need to have low correlations with one another. As we realize that a forest is comprised of trees and more trees imply a more robust forest. Likewise, a random forest algorithm makes choice trees on information tests and afterward gets the forecast from every one of them, and lastly chooses the best arrangement by methods for casting a ballot. It is a gathering strategy that is superior to a solitary choice tree since it decreases the over-fitting by averaging the outcome.

      Baid et al. (2017) analyzed the movie reviews using various techniques like Naïve Bayes, K-Nearest Neighbour, and Random Forest. The authors showed that Naïve Bayes performed better as compared to other algorithms. While performing sentiment analysis of real-time 2019 election twitter data, Hitesh et al. (2019) demonstrated that Word2Vec with Random Forest improves the accuracy of sentiment analysis significantly compared to traditional methods such as BoW and TF-IDF. This is because Word2Vec improves the quality of features by considering the contextual semantics of words.

    6. vi.

      Gradient boosting Boosting is one of the well-known ensemble techniques. Boosting builds the multiple models sequentially by assigning equal weights to each sample initially and then targets misclassified samples in subsequent models. Gradient Boosting Machine (GBM) is an ensemble technique that applies a decision tree as a base classifier. GBM constructs one tree at a time where each new tree helps to rectify errors caused by the earlier trained tree. Using a random forest classifier, the trees don’t correlate with previously constructed trees, whereas, GBM relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. GBM can learn with different loss functions providing the ability to work efficiently with high-dimensional data. Moreover, sentiment analysis may encounter class imbalance issues, since the distribution of sentiments in real-world applications often displays issues of inequality. While handling the class imbalance problem using Synthetic Minority Oversampling TEchnique (SMOTE) algorithm and the Tomek links score, Athanasiou and Maragoudakis (2017) demonstrated that GBM outperforms decision trees, SVM, Naïve Bayes, and ANN. While using emoji and slang dictionaries, Prasad et al. (2017) found that GBM handles sarcasm in tweets much better than Random Forest, Decision Tree, Adaptive Boost, Logistic Regression, and Gaussian Naïve Bayes algorithm.

      Jain and Dandannavar (2016) suggested a system for sentiment analysis of tweets based on an NLP-based technique and machine learning algorithms such as MNB and decision tree, which use features extracted based on various parameters. For sentiment analysis of online movie reviews, Sharma and Dey (2012) have developed a noteworthy comparison of seven current machine learning techniques in conjunction with various feature selection approaches. Tan and Zhang (2008) also introduced a similar work, in which sentiment analysis of various fields, such as education, movies, and houses, is carried out using various feature selection methods along with machine learning techniques. Depending on the applicability and need for better-quality models for sentiment analysis, experts in the field use a variety of cascaded and ensemble approaches to combine machine learning algorithms with other existing options (Ji et al. 2015; Tripathy et al. 2015; Xia et al. 2011; Ye et al. 2009).

  2. b.

    Unsupervised learning

    In unsupervised learning, the models are trained using unlabeled datasets. This technique in most cases relies on clustering methods such as k-means clustering, expectation-maximization, and cobweb. Darena et al. (2012) used k-means clustering through the use of Cluto 2.1.2 to determine the sentiment associated with customer reviews.

  3. c.

    Self-supervised learning

    In self-supervised learning, the model begins with unlabeled datasets and then trains itself to learn a part of the input by leveraging the underlying structure of the data. Although the use of an unlabeled dataset gives this learning technique the notion of being unsupervised, they are basically designed to execute downstream tasks that are traditionally addressed by supervised learning. One of the self-supervised learning techniques which have gained a lot of popularity in recent years is the Pretrained Language Model (PML).

    1. i.

      Pretrained language model (PLM) PLMs are models that are built and trained previously, specifically using a large language dataset. In the case of sentiment analysis, the use of transfer learning by a pretraining model to add knowledge from the open domain to a downstream task to improve that task is extremely beneficial when limited language resources are available. The learning process of PLM involves processes such as missing or masked word prediction, next sentence prediction, sentence-order prediction, corrupted text reconstruction, autoregressive language modeling, etc. which aids in the sentiment prediction process.

      Typical steps in the process of creating a sentiment analysis model from scratch usually involve making use of standard sentiment lexicons, sentiment scoring and data labeling by human experts, and proper parameter tuning of the model that performs well on the rest of the dataset. This procedure could be expensive and time-consuming. PLM makes it simpler for developers of sentiment analysis models to implement the model in less training time with improved accuracy, by providing extensive semantic and syntactic information with the usage of a few lines of code. PLM acts as a reusable NLP model for various tasks associated with sentiment analysis such as PoS tagging, lemmatization, dependency parsing, tokenization, etc. Thus, PLMs can be proved to be advantageous to solve similar new tasks using old experience, without training the sentiment analysis model from the scratch.

      Chan et al. (2022) provided a detailed study on the evolution and advancement of sentiment analysis using pretrained models. Additionally, the authors covered various tasks of sentiment analysis, for which the pretrained models can be used. The early works on PML involved transferring a single pretrained embedding layer to the task-oriented network architecture. To cope with numerous challenges such as word sense, polysemy, grammatical structure, semantics, and anaphora, models are presently being improved to a higher representation level.

      Bidirectional Encoder Representations from Transformers BERT (Devlin et al. 2018), NLTK (Loper and Bird 2002), Stanford NLP (Manning et al. 2014), Universal Language Model Fine-tuning (ULMFit) (Howard and Ruder 2018), Embeddings from Language Models (ELMo) (Sarzynska-Wawer et al. 2021) are some of the well-known PLMs that serve as open-source NLP libraries for sentiment analysis. The performance of BERT was determined to be superior by Mathew and Bindu (2020) who thoroughly analyzed numerous PLMs that are frequently used for sentiment analysis.

      Many pre-trained models use self-supervision strategies to learn the semantic content; however, give less importance to the sentiment-specific knowledge during the pre-training phase. There might also be a risk of overfitting associated with a pretraining model, which may lead to domain-specific sentiment mismatch between the source and the target domain. While dealing with social media related content, the PLM might cause biases in the results. The language in which the PLM was trained might differ from the language which is generally used in social media platforms. Further in-depth analysis and model development may be constrained if PLM behaves in a black-box manner. In a few cases, the PLM might not be able to handle the multi-class problem, if it was originally designed for identifying single or binary classes. This might also lead to ignorance/mishandling of one of the important classes, say neutral class, if the PLM was initially designed for handling positive and negative classes. Thus, while choosing a particular PLM model, we must consider the domain and data it was originally designed for. Also, a human expert might be required to validate the results, whenever required, to assure the quality of the sentiment analysis model.

      Mao et al. (2022) provided an in-depth analysis of how PLMs are biased toward prompt-based sentiment analysis and emotion detection. According to the authors, the number of label classes, emotional label-word selections, prompt templates and positions, and the word forms of emotion lexicons leads to biased results. To address the issue of cross-domain tasks, Zhou et al. (2020) proposed SENTIX, a sentiment-aware model that learns the domain invariant sentiment knowledge during the pre-training phase. For addressing several factors related to sentiment analysis, experts have till now presented a variety of improvised modifications of the original PLMs. Some of them include Dynamic Re-weighting BERT (DR-BERT) (Zhang et al. 2022), BERT-based Dilated CNN (BERT-DCNN) (Jain et al. 2022), Attention-based ELMo (A-ELMo) (Huang and Zhao 2022), Contextual Sentiment Embeddings (CoSE) (Wang et al. 2022a), Extended Universal Language Model Fine-Tuning (Ext-ULMFiT) and Fine-Tuned (FiT-BERT) (Fazlourrahman et al. 2022), etc.

  4. d.

    Other learning techniques

    Many researchers combine supervised and unsupervised techniques to generate hybrid approaches or even semi-supervised techniques which can be used to classify sentiments (König and Brill 2006; Kim and Lee 2014). With new information generated every millisecond, finding a fully labeled large dataset representing all the required information is nearly impossible. In such a scenario, semi-supervised algorithms train an initial model on a few labeled samples and then iteratively apply it to the greater number of unlabelled data and make predictions on new data. Among various semi-supervised techniques, Graph Convolution Network (GCN) (Kipf and Welling 2016; Keramatfar et al. 2022; Dai et al. 2022; Zhao et al. 2022; Lu et al. 2022; Yu and Zhang 2022; Ma et al. 2022) has recently gained the attention of researchers for performing sentiment analysis.

    1. i.

      Graph Convolution Network

      GCN is based on CNN which operates directly on graphs while taking advantage of the syntactic structure and word dependency relation to correctly analyze sentiment. GCNs learn the features by inspecting neighboring nodes. By using a syntactic dependency tree, a GCN model captures the relation among different words and links specific aspects to syntax-related words. Each layer of the multi-layer GCN architecture encodes and updates the representation of the graph’s nodes using features from those nodes’ closest neighbors. GCNs assist in performing node-level, edge-level, and graph-level prediction tasks for sentiment analysis, such as determining how connections on a social media platform affect the opinions of the users within that network, creating user recommendations based on connections between various products previously purchased, suggesting movies, etc. Generally, GCNs focus on learning the dependency information from contextual words to aspect words based on the dependency tree of the sentence. As a result, GCN has mainly attracted researchers in the field of aspect-based sentiment analysis.

      Lu et al. (2021) built a GCN on the sentence dependency tree to fully utilize the syntactical and semantic information. Their methodology fixed the issues of incorrectly detecting irrelevant contextual words as clues for evaluating aspect sentiment, disregarding syntactical constraints, and long-range sentiment dependencies, which were present in earlier models. SenticGCN was proposed by Liang et al. (2022) to capture the affective dependencies of the sentences according to the specific aspects. To combine the affective knowledge between aspects and opinion words, the model performs aspect-based sentiment analysis using SenticNet along with GCN.

      Along with the local structure information of a given sentence, such as locality, sequential knowledge, or syntactical dependency constraints within the sentence, global dependency information also holds importance in determining the sentiments accurately. Zhu et al. (2021) proposed a model named Global and Local Dependency Guided Graph Convolutional Networks (GL-GCN), where word global semantic dependency relations were revealed with the use of a word-document graph representing the entire corpus. An attention mechanism was adopted by the authors to combine both local and global dependency structure signals.

      In general, the layers in GCN models are not devised distinctively for processing the aspect. To handle this issue, Chen et al. (2021) integrated GCN and co-attention networks for aspect-based sentiment analysis, to extract relevant information from contexts and remove the noise while considering colloquial texts. Tian et al. (2021) addressed the issues of the inability to learn from different layers of GCN, not considering dependency types, and lacking mechanisms for differentiating between various relations in the context of sentiment analysis. The authors utilized dependency types for aspect-based sentiment analysis with Type-aware GCN (T-GCN).

  5. ii.

    Lexicon-based approach

    Opinion terms are used in a lexicon-based approach to execute sentiment classification tasks. This method suggests that a sentence’s or document’s cumulative polarity is the sum of the polarities of individual terms or phrases (Devika et al. 2016). According to Zhang et al. (2014), in opinion lexicon methods, evaluated and tagged sentiment-related words are counted and weighted with the use of a lexicon to perform sentiment analysis. This approach is based on sentiment lexicons, which are a compilation of recognized and pre-compiled terms, phrases, and idioms formed for traditional communication genres, according to Kharde and Sonawane (2016). More complex systems, such as dictionaries or ontologies, may also be used for this approach (Kontopoulos et al. 2013). Some of the recent lexicons available for sentiment analysis are discussed below in Table 2.

    The lexicon-based approach is categorized into three methods: manual, dictionary-based, and corpus-based methods based on the various approaches to classification (Zhang et al. 2014). Because of the considerable time investment, researchers seldom use the manual approach, though it is often paired with the other two automated approaches.

    1. a.

      Dictionary-based approach

      Dictionary-based approach starts with a series of manually annotated opinion seed terms. The collection is then extended by searching through a dictionary such as WordNet (Miller et al. 1990) to find synonyms and antonyms. SWN (Baccianella et al. 2010) is one of the earliest thesauri and makes use of WordNet to assign positive, negative, and objective ratings to terms. The new words are added to the initial list after they have been discovered. The next iteration begins and the method continues until no new words need to be added after a particular point. While considering valence shifters (intensifiers, downtoners, negation, and irrealis markers), Read and Carroll (2009) proposed a word-level sentiment analysis model called Semantic Orientation CALculator (SO-CAL). In SO-CAL, lexicon-based sentiment classification is performed using dictionaries of sentiment-bearing terms annotated with their polarities and strengths.

      The use of a dictionary for sentiment analysis suffers from one major drawback. This methodology does not adequately handle the domain and context-sensitive orientations of opinion terms.

    2. b.

      Corpus-based approach

      The corpus-based approach uses syntactic patterns or co-occurring patterns in a vast corpus to extend the original seed list of opinion terms (Cambria and Hussain 2015). It is very tough to generate a huge corpus using the corpus-based approach, to cover each and every English word. However, using a domain corpus has the advantage of allowing you to identify the domain and context-related opinion terms as well as their orientations. The corpus-based approach aims to provide dictionaries that are specially related to a particular domain (Kharde and Sonawane 2016). To expand the dictionary, statistical or semantic approaches may be used to look for words that are similar as discussed below.

      1. i.

        Statistical approach

        The statistical approach includes searching co-occurrence patterns or seed opinion words. Searching for co-occurrence trends or seed opinion terms is one statistical technique. If the corpus is insufficient, the issue of certain words not being available can be solved by using the whole collection of indexed documents on the web as the corpus for creating the dictionary (Turney 2002). In a broad annotated corpus, even the appearance of a word in the positive or negative text may be used to determine its polarity (Read and Carroll 2009). Similar opinion words are likely to co-occur in a corpus, according to Cambria and Hussain (2015), and hence the polarity of an unfamiliar word can be calculated using the relative frequency of co-occurrence with another word. In this case, PMI can be used (Turney 2002). The statistical approach to the semantic orientation of a word is used in conjunction with PMI (Cambria and Hussain 2015). Another such approach is Latent Semantic Analysis (LSA) (Deerwester et al. 1990).

      2. ii

        Semantic approach

        Semantically close words are assigned similar polarities based on this approach. This method is based on various criteria for measuring word similarity (Cambria and Hussain 2015). The relative count of positive and negative synonyms of an unknown word can be used to find out the polarity of that word using different semantic relationships given by WordNet (Kim and Hovy 2004).

        A combination of both statistical and semantic approaches is also followed by a few researchers to perform sentiment analysis. Zhang et al. (2012) applied a mixture of both these approaches to online reviews to determine the weakness of products. Sentence-based sentiment analysis, according to their model, is carried out by taking into account the effect of degree adverbs to determine the polarity of each aspect within a sentence. To find the implicit features, they used the collocation statistics-based selection method-Pointwise Mutual Information (PMI). With the use of semantic methods, feature words of the products are grouped into corresponding aspects.

        Ding et al. (2008) demonstrated that the same term can have multiple polarities in different contexts, even within the same domain. Therefore, rather than simply finding domain-dependent sentient words using the corpus-based approach, they explored the notion of intra-sentential and inter-sentential sentiment consistency.

        In the lexicon-based approach, one point is worth noticing. The initial manual annotation of the seed list can be a costly procedure. Secondly and most importantly, the use of a dictionary even for seed list generation can lead to the insufficiency of handling cross-domain problems. Thus, the usage of a proper technique to generate a seed list for a lexicon-based approach is an open problem. Also, whenever linguistic rules are involved in handling knowledge, there might be situations where it fails to correctly grasp the affective sentiment.

        Hybrid approaches which use sentiment lexicons in machine learning methods have also attracted many researchers to combine the benefits of both approaches. Trinh et al. (2018) used the hybrid approach to perform sentiment analysis of Facebook comments in the Vietnamese language. While their dictionary is partly based on SO-CAL, the authors manually built the dictionary to include nouns, verbs, adjectives, and adverbs along with emotional icons. They performed sentence-level sentiment analysis of product reviews using the SVM classifier. Appel et al. (2016) also performed sentence-level sentiment analysis using a combination of lexicon and machine learning approaches. They extended their sentiment lexicon with the use of SWN and used fuzzy sets to determine the polarity of sentences. Using an SVM classifier, Zhang et al. (2011) performed entity-level sentiment analysis of tweets, with the use of a lexicon that supports business marketing or social studies. They made use of the lexicon by Ding et al. (2008) along with some frequently used opinion hashtags to build the lexicon for their model. Pitogo and Ramos (2020) performed sentiment analysis for Facebook comments using a lexicon-based approach called Valence Aware Dictionary and Sentiment Reasoner (VADER) along with a hierarchical clustering algorithm.

3.5 Sentiment or opinion summarization

Sentiment or Opinion summarization or aggregation aims to provide an idea of the overall influence or polarity depicted by the dataset, by summing up the polarity of all individual words/aspects /sentences/documents of the dataset. Sentiment summarization must not be confused with text summarization, though they are slightly related. Text summarization aims to provide a summary of the dataset, while sentiment summarization provides a generalized polarity depicted by the whole dataset.

Different types of summarization models are proposed by researchers to obtain an average sentiment. Pang and Lee (2004) first extracted all subjective sentences and then summarized those subjective sentences. Blair-Goldensohn et al. (2008) used a tool to choose a few representative documents from a vast number of documents and then used them for emotion summarization based on aspects. By mining opinion features from product feedback, Hu and Liu (2004) suggested an aspect-based sentiment summarization strategy for online consumer reviews. Using the ratings on different aspects, Titov and McDonald (2008) proposed a model which can contribute to the sentiment summarization process. Their algorithm is designed to find related topics in text and collect textual evidence from reviews to support aspect ratings. Sokolova and Lapalme (2009) developed an emotion summarization model to summarise the opinionated text in consumer goods by integrating different polarity detection techniques and automated aspect detection algorithms. Different types of summarization models are proposed by researchers to obtain an average sentiment. Pang and Lee (2004) first extracted all subjective sentences and then summarized those subjective sentences. Blair-Goldensohn et al. (2008) used a tool to choose a few representative documents from a vast number of documents and then used them for emotion summarization based on aspects. By mining opinion features from product feedback, Hu and Liu (2004) suggested an aspect-based sentiment summarization strategy for online consumer reviews. Using the ratings on different aspects, Titov and McDonald (2008) proposed a model which can contribute to the sentiment summarization process. Their algorithm is designed to find related topics in text and collect textual evidence from reviews to support aspect ratings. Bahrainian and Dengel (2013) developed an emotion summarization model to summarise the opinionated text in consumer goods by integrating different polarity detection techniques and automated aspect detection algorithms.

Fig. 2
figure 2

Sentiment classification techniques

Table 1 Recent literature on sentiment analysis using SVM
Table 2 Lexicons for sentiment analysis

4 Performance analysis measures

The evaluation of performance is one of the principal concepts associated with building a resourceful model. Once the sentiments are classified as either positive or negative, the performance of the model needs to be evaluated. The papers by Sokolova and Lapalme (2009) provided a better understanding of the applicability of performance measures depending on the variability of the classification tasks. Among different kinds of available metrics for measuring the performance of a textual sentiment analysis model, metrics based on the confusion matrix are widely used (Sokolova and Lapalme 2007, 2009; John and Kartheeban 2019). The details concerning the classifications that are expected and those that are calculated by a classifier are shown in the confusion matrix. A confusion matrix for binary classification problems consists of four separate data entries, namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), as shown in Table 3.

Table 3 Confusion matrix for binary classification

The most frequently used performance metric is accuracy to measure the overall effectiveness of the model. Accuracy determines the proportion of a total number of instances (i.e., documents/ sentences/words) that are correctly predicted by the sentiment analysis model. Equation 13 shows the formula for estimating the model’s accuracy.

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(13)

Apart from accuracy, precision and recall are well-known metrics that are best suited for text applications (Sokolova and Lapalme 2007). The number of correctly classified positive instances is determined by positive predictive value or precision, while the number of correctly classified negative instances is determined by negative predictive value. The proportion of actual positive instances that are correctly classified is determined by sensitivity or recall; the proportion of actual negative instances that are correctly classified is determined by negative recall or specificity.

The following are the formulas for calculating them (Salari et al. 2014).

$$\begin{aligned}{} & {} Precision/Positive\, predictive\, value = \frac{TP}{TP+FP} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} Negative\, predictive\, value = \frac{TN}{TN+FN} \end{aligned}$$
(15)
$$\begin{aligned}{} & {} Recall/Sensitivity = \frac{TP}{TP+FN} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} Specificity/Negative\, recall = \frac{TN}{TN+FP} \end{aligned}$$
(17)

Precision and recall are better indicators of the current system’s success than accuracy for an imbalanced binary classifier. Yet, in certain situations, a system may have high precision but poor recall, or vice versa. In this case, the f-measure allows you to articulate all issues with a single number. Once the precision and recall for a binary or multi-class classification task have been calculated, the two scores together form the f-measure, as seen in Eq. 18.

$$\begin{aligned} F{\text {-}}measure, F = 2 *\frac{Precision *Recall }{Precision + Recall} \end{aligned}$$
(18)

Accuracy or f-measure can show overoptimistic inflated results, especially on imbalanced datasets. Matthew’s Correlation Coefficient (MCC) is a more reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories proportionally, both to the size of positive elements and the size of negative elements in the dataset. The confusion matrix or an error matrix can be summed up using MCC as shown in Eq. 19. MCC ranges from [− 1,1], where 1 indicates the best agreement between the predicted and actual values. The MCC helps us to identify the ineffectiveness of the classifier in classifying especially the minority class samples.

$$\begin{aligned} MCC = \frac{TN *TP - FN *FP}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)} } \end{aligned}$$
(19)

To measure the ability of a sentiment classifier to distinguish between the polarity classes, an Area Under the Curve (AUC) is employed. The curve in AUC is generally a ROC (Receiver Operating Characteristic) curve, which is a graph showing the performance of a classification model at all classification thresholds as shown in Fig. 3. ROC plots TP and FP. AUC is an aggregated evaluation of the classifier as the threshold varies over all possible values. The Precision-Recall AUC summarizes the curve using a range of threshold values as a single score. AUC measures how true positive rate (recall) and false positive rate trade-off. Specifically, for imbalanced datasets, where overfitting needs to be avoided, AUC works as a preferable evaluation matrix. AUC represents the probability that a random positive instance is positioned to the right of a random negative instance. AUC ranges from 0 to 1. An AUC of 0.0 denotes a model that makes all incorrect classifications, whereas an AUC of 1.0 denotes a model that makes all correct classifications.

Fig. 3
figure 3

AUC under ROC

When a regression task is adopted for sentiment analysis, Mean Squared Error (MSE) is employed to find the squared difference between actual and predicted values. It is an absolute measure of the goodness of fit of dependent variables in the model. The formula of MSE is given in Eq. 20. The lower the value of MSE, the better the sentiment analyzer. It can be used as a loss function as the graph of MSE is differentiable. However, it is not very suitable in case the dataset contains outliers.

$$\begin{aligned} MSE = \frac{1}{n} \left( \sum {(y-\widehat{y})}^2\right) \end{aligned}$$
(20)

In contrast to the context dependency of MSE, R squared is a context-independent metric that is used for a regression task. It is a relative measure of how well the model fits dependent variables or how close the data is to the fitted regression line. Coefficient of Determination and Goodness of Fit are other names for R squared and it is calculated using Eq. 21.

$$\begin{aligned} R^2 = 1 - \frac{SSR}{SSM} \end{aligned}$$
(21)

where SSR is the squared sum error of the regression line and SSM is the squared sum error of the mean line.

Other performance evaluation metrics that can be also considered for evaluating a sentiment analysis model are Root Mean Squared Error (RMSE), Residual Standard Error (RSE), Mean Absolute Error (MAE), etc.

5 Applications of sentiment analysis

Sentiment analysis or opinion mining has recently been used in studies on e-commerce feedback, tweets, Facebook posts, YouTube content, blog entries, and a variety of other data mining and knowledge-based AI programs. As a result, it has progressed significantly in fields including Information Retrieval (IR), web data analysis, mining of text, analysis of text, NLP, computational linguistics, and biometrics. Using different approaches/methods/ frameworks analyzed in this paper beforehand, sentiment analysis can be applied to various fields such as tourism, education, defense, business, politics, public, finance, hazards, health, and safety. The broad range of applications will aim to obtain the best possible combination of strengths, whether or not any of the components in Fig. 1 or any of the approaches indicated in Fig. 2 are present. Depending on the requirement/aim/ framework of a sentiment analysis model, applications can vary from a straightforward prediction of the polarity of a single word to uncovering sensitive or hidden information, or even a pattern to protect a nation from any potential terrorist attack or disaster. Many research works mention different application areas based on different domains or approaches used (Alessia et al. 2015; Jain and Gupta 2022; Saxena et al. 2022; Feldman 2013; Govindarajan 2022; Ravi and Ravi 2015). The knowledge of diverse application fields based purely on the dataset at hand is challenging to find in existing research papers. This paper aims to outline several sentiment analysis application areas based on the data/content/material in hand, that can be used by researchers for sentiment analysis.

5.1 Reviews on products

Sentiment analysis using reviews on different products with different brands is the most widespread practice, which encompasses different application angles. For a particular product, the number of brands has been increasing day to day. Also, the same brand may offer products with different specifications. Nowadays even different online shopping sites are available that sell the same product. This creates confusion among customers to reach an optimal decision. Though shopping sites offer the option of displaying comments and star ratings left by former customers to assist potential buyers, the count of current feedback can be so large that scrolling through thousands of them can be a time-consuming process. Sentiment analysis helps to alleviate this condition by giving a concise perspective on a product or brand as a whole, or even on a certain feature/aspect of the product. Also, it can be used by the sellers or manufacturers to concentrate on the suitable aspects or specifications, which can be used for upgrading the product or deciding the advertisement strategy. Product analysis by buyers, suppliers, and sellers; competitor analysis or market study by sellers or manufacturers; brand tracking and reputation management by manufacturers; customer service by e-commerce sites; and customer analysis by sellers and manufacturers are among the various application directions associated with sentiment analysis of product feedback. The necessity to detect fake reviews before using the available data for decision-making was highlighted in the research work by Vidanagama et al. (2022). The authors made use of a rule-based classifier, a domain feature ontology, and Mahalanobis distance to detect fake reviews while performing aspect-based sentiment analysis. Cao et al. (2022) have introduced a quality evaluation model of products by combining deep learning, word vector conversion, keyword clustering, and feature word extraction technologies. Their model improves product features based on consumer online reviews and finally calculates customer satisfaction and attention based on short text comments with sentiment tags. With the use of pre-trained word embeddings, Bhuvaneshwari et al. (2022) proposed a Bi-LSTM Self Attention based CNN (BAC) model for analysis of user reviews. Wang et al. (2022b) designed multi-attention bi-directional LSTM (BLSTM(MA)), and used Latent Dirichlet Allocation (LDA) modeling to perform multimodal fusion for sentiment analysis of product reviews. Alantari et al. (2022) examined 260,489 reviews from five review platforms, covering 25,241 products in nine different product categories. They discovered that pretrained neural network-based machine learning techniques, in particular, provide the most precise forecasts, while topic models like LDA provide more thorough diagnostics. To make predictions, topic models are better suited than neural network models, which are not good at making diagnoses. As a result, the preference of the analysts for prediction or diagnostics is likely to determine how text review processing technologies are chosen in the future.

5.2 Political Tweets, Facebook comments, Blog posts, and YouTube Videos

Recently, people have started to openly share their views or opinion on different political parties, electoral candidates, government policies, and rules on different public platforms such as Twitter, Facebook, YouTube, and blogs. These create a great influence on the followers. Therefore, they are used by many experts to predict the outcome of an election beforehand, monitor public sentiment on various political movements, or analyze the sentiment of the public on a proposed government rule, bill, or law.

With the use of pre-trained models and the Chi-square test, Antypas et al. (2022) proposed a multilingual sentiment analysis model to analyze both influential and less popular politicians’ tweets from members of parliament of Greece, Spain, and the United Kingdom. Their study indicates that negative tweets spread rapidly as compared to positive ones. Using Valence Aware Dictionary and sentiment Reasoner (VADER), and 2 million tweets on the 2019 Indian Lok Sabha Election, Passi and Motisariya (2022) analyzed sentiments of Twitter users towards each of the Indian political parties. Using the aging estimation method with the proportion of positive message rate to negative messages rate, Yavari et al. (2022) designed an indicator of the election results in the future.

5.3 Tweets or comments on Facebook/YouTube/Instagram on social cause or events

Expressions of opinions on different social causes or events have also increased recently. This increases the scope of designing application portals that perform public sentiment analysis, monitor, and predict different possible outcomes of such an event or cause and decide the possible steps which need to be adopted in the future in case there is an outbreak of any chaotic situation.

A multi-grained sentiment analysis and event summary method employing crowd-sourced social media data on explosive accidents was built by Ouyang et al. (2017). The system can determine which components of the event draw users’ attention, identify which microblog is responsible for a large shift in sentiment, and detect those aspects of the event that affect users’ attention. Smith and Cipolli (2022) studied the emotional discourse before and after a prohibition on graphic photos of self-harm on Facebook and Instagram using a corpus of 8,013 tweets. By clarifying topical content using statistical modeling to extract abstract topics in discourse, the authors offered an insight into how the policy change relating to self-harm was viewed by those with a vested interest.

5.4 Reviews on Blogs/Tweets/Facebook comments on movie

Reviews on an upcoming movie or a movie that is onscreen in the theatres can be used to decide the success or failure of the movie. Different movie recommender systems can also be designed using the reviews from the audience. Also, the distributors or producers can use such reviews to improve their advertising strategy based on the different aspects which are liked by the viewers.

Using sentiment analysis to gain a deeper understanding of user preferences, Dang et al. (2021) proposed methods to enhance the functionality of recommender systems for streaming services. The Multimodal Album Reviews Dataset (MARD) and Amazon Movie Reviews were used to test and compare two different LSTM and CNN combinations, LSTM-CNN and CNN-LSTM. They started with a version of the recommendation engine without sentiment analysis or genres as their baseline. As compared to the baseline, the results demonstrate that their models are superior in terms of rating prediction and top recommendation list evaluation. Pavitha et al. (2022) designed a system for analyzing movie reviews in different languages, classifying them into either positive or negative using Naive Bayes and Support Vector Classifier (SVC), and recommending similar movies to users based on Cosine Similarity. For B-T4SA and IMDB movie reviews, Zhu et al. (2022) proposed a self-supervised sentiment analysis model namely Senti-ITEM. The model pairs a representative image with the social media text as a pretext task, extract features in a shared embedding space, and uses SVM for sentiment classification.

5.5 Tweets/Facebook comments on pandemic/crisis /environmental issues

Nowadays people encountering abrupt situations or difficulties due to the Covid-19 pandemic or any environmental issues such as storm or earthquake posts real-time tweets or comments on Facebook. In such a situation, by analyzing tweets or comments properly, government or any agency, or even nearby people can offer help, and perform disaster management and crisis analysis.

Hodson et al. (2022) suggested a corpus-assisted discourse analysis approach, for analyzing public opinion on COVID-19 tweets and YouTube comments related to Canadian Public Health Office. The authors found that different platforms convey key differences between comments, specifically based on the tone used in YouTube videos as compared to plain text in Tweets. To capture sarcasm or get clear information, cross-platform and diverse methods must be adopted to facilitate health-related communication and public opinion. Chopra et al. (2022) employed logistic regression, Naive Bayes, XGBoost, LSTM, GloVe, and BERT to predict disaster warnings from tweets and evaluate the seriousness of the content.

5.6 Tweets/Facebook comments/YouTube videos on Stock Market

One of the trending application areas of sentiment analysis is Stock Market Prediction. Identifying stocks and share with great potential and deciding the optimal time to buy them at the lowest price and sell them at the peak time can be performed using a suitable sentiment analysis model. Using stock market data with SVM, Ren et al. (2018) suggested a model that forecasts movement direction and predicts stock prices while capturing investor psychology. Sousa et al. (2019) used the BERT algorithm to analyze the sentiments of news articles and provide relevant information that can facilitate stock market-related quick decision-making. Considering both positive and negative financial news, de Oliveira Carosia et al. (2021) analyzed the influence on the stock market using three Artificial Deep Neural Networks namely Multi-Layer Perceptron (MLP), LSTM, and CNN. The findings of this sentiment analysis model’s observations revealed that while recurrent neural networks, such as LSTM, perform better in terms of time characteristics when used to predict the stock market, CNNs perform better when assessing text semantics.

6 Future scope of research in sentiment analysis

There are numerous scientific studies in the literature that focus on each of the components of the sentiment analysis approach, either independently or in combination. Each of these sentiment analysis modules offers plenty of opportunities for further investigation, improvisation, and innovation. Several challenges and issues are also faced during the process of performing sentiment analysis, which may hinder the proper functioning or performance of the model. Some of them are domain dependency, reference problems, sarcasm detection, spam detection, time period, etc. Most of these challenges influence the development of better techniques and algorithms to handle them. Some of the primary research gaps that offer scope for future research and hence encourage further sentiment analysis research are discussed below:

  1. 1.

    It has been found that current techniques dedicated to sentiment analysis do not employ effective data initialization and pre-processing techniques. Rather than relying on established NLP pre-processing techniques, an advanced pre-processing technique, such as standard normalization that takes deliberately into account, the case of negation and mixed emotion would be extremely beneficial.

  2. 2.

    One of the most critical steps in improving the performance of a sentiment analysis model is keyword extraction. Many sentiment analysis models have been observed to extract keywords using generalized dictionaries. The use of generalized dictionaries, on the other hand, produces inaccurate findings since most of these dictionaries include keywords that are relevant to a specific domain. However, there is no predefined list of keywords for a certain domain or subject in the real world. Different researchers have shown the supremacy of the degree centrality metric for the graph-based method of obtaining the best collection of representative and sentimental words. As a result, it may be used to find key terms or phrases. Automatic keyword extraction techniques can be used for sentiment analysis in a variety of applications, both independently and in combination. Most of these techniques have found applications in a variety of research areas, including Data Analysis, TM, IR, and NLP since they allow for the condensing of text records.

  3. 3.

    Assignment of polarity scores to keywords using sentiment dictionaries has gained a lot of attention in sentiment analysis. However, depending on its use in a specific domain, a term can serve as a positive or negative word at different times. Therefore, the usage of sentiment dictionaries with pre-defined polarities for words is not an appropriate practice for sentiment analysis. Existing sentiment dictionaries fail to handle sarcasm or negations to a great extent. It is observed that many machine learning based techniques are trained to work only for a particular domain. They do not consider that the words can change their polarity based on the context and domain of application. Thus, whenever the same word is tested for another domain using the trained classifier, it shows incorrect results in some situations.

  4. 4.

    New edge and node weighing approaches may be introduced and used in place of NE-Rank or TextRank centralities to determine keyword rank. To achieve improved outcomes in the future, different ensemble or individual improvised centralities may be used. This establishes a framework for future research into graph mining algorithms for sentiment analysis in various fields.

7 Conclusion

The era of digitization marks the astonishing growth of subjective textual data online. Proper analysis of the textual information, to rightly reflect the public sentiment regarding any topic, demands proper investigation of textual data. Sentiment analysis has emerged as the most important task which helps to enhance the decision-making process by extracting the underlying sentiment or opinion of data. Even though sentiment analysis has progressed in recent years, modern models have flaws such as domain dependence, negation management, high dimensionality, and the failure to use efficient keyword extraction. This paper examines and provides a comprehensive discussion of different perspectives related to the creation and implementation of an effective sentiment analysis model. A thorough examination and establishment of various modules of the sentiment analysis methodology are carried out to plan and improve effective sentiment analysis models. The keyword extraction algorithm is vital to the success of a sentiment analysis model and thus is well-studied in this paper. The paper also discusses sentiment classification methods, which form an essential aspect of a sentiment analysis model. The paper conducts a detailed review of both machine learning and lexicon-based approaches to textual data sentiment analysis.

As a thorough, well-organized study on sentiment analysis, this research effort can assist academicians and industry experts in analyzing and developing powerful sentiment analysis models in a wide range of domains. Sentiment analysis models have a lot of potential for further development and use in the near future because they have a broad range of uses in social, industrial, political, economic, health and safety, education, defense financial contexts, and others. Each of the sentiment analysis modules as discussed in this paper can be investigated, improvised, and supplemented with certain relevant algorithms to design an efficient sentiment analysis model. This study also offers prospective guidelines for carrying out proper sentiment analysis research.