A multimodal sentiment analysis system for recognizing person aggressiveness in pain based on textual and visual information

This article proposes a multimodal sentiment analysis system for recognizing a person’s aggressiveness in pain. The implementation has been divided into five components. The first three steps are related to a text-based sentiment analysis system to perform classification tasks such as predicting the classes into non-aggressive, covertly aggressive, and overtly aggressive classes. The remaining two components are related to an image-based sentiment analysis system. A deep learning-based approach has been employed to do feature learning and predict the three types of pain classes. An aggression dataset for the text-based system and the UNBC-McMaster database for an image-based system has been employed, respectively. Experimental results have been compared with the state-of-the-art methods, showing the superiority of the proposed approach. Finally, the scores due to text-based and image-based sentiment analysis systems are fused to obtain the performance for the proposed multimodal sentiment analysis system.


Introduction
Sentiment analysis (SA) is realizing various opinions from different entities like events, issues, aggression, anger, attitude, etc. Sentiment analysis tries to categorize the sentiment of people's opinions into three main categories: positive, negative, and neutral (Ghosh et al. 2022). Nowadays, the research activities are not limited only to finding positive, negative, or neutral sentiment but also to finding the amount of positivity and negativity with the help of sentiment scores through natural language processing, images/video, and audio. It has been observed that the various activities of different users on online platforms are growing day by day at a rapid proportion. With the increased amount of interaction and the increased number of people involved in these interactions over the web, various aggression-oriented activities like flaming, trolling, roasting, and cyberbullying have also increased (Ebrahimi et al. 2017) globally. Sentiment Analysis includes applications such as spam email detection, bullying, trolling, suicidal tendency, aggression, fraud messages, roasting, etc. Several problems were solved using text, audio, image, or video-based sentiment analysis methods (Li and Xu 2019). For example, if we have a detailed analysis of the aggressive text, we will be able to measure the intensity of the aggression. We can analyze various images to detect pain, happiness, and other human emotions. Any particular video can easily track the changes in emotions with the change in the person's activity. There are several cases like aggressive detection that are possible with text, image, and video, but there will be some situations where sentiment analysis based on a single data will not be enough to get the correct result, as explained in detail in the following subsections.

Text-based sentiment analysis
Sentiment analysis using text is being performed using text data. This approach goes through a specific set of activities like preprocessing of text data (removing stop words, articles, prepositions, pronouns, be verbs, and other words), tokenization of words, feature extraction (using a bag of words, aspect-based features, context-based features, etc.), and finally, the classification tasks using well-known classifiers such as random forest (Mursalin et al. 2017), support vector machine (Noble 2006), logistic regression (Hosmer Jr et al. 2013), Naïve Bayes (Malmasi and Zampieri 2017) and Neural Network based approaches (Sundermeyer et al. 2012;Zaremba et al. 2014;Gambäck and Sikdar 2017). Text-based sentiment analysis solves the identification problems of bullying, roasting, aggression, trolling, and suicidal tendencies. In contrast, feedback analysis provides better customer service, such as tracking the performance of employees of any organization. One of the important issues is that we cannot correctly judge the actual feelings of the person, as, during the analysis of the text, the prediction of emotions is absent. Text-based sentiment analysis is based on the type of content most often characterized by a lack of labeled data, an inability to handle complex sentences, and a misunderstanding of context in specific conversations, making this task particularly demanding. Multimodal sentiment analysis, as opposed to the traditional single modality, considers diverse manifestation patterns. Therefore, the sentiment analysis method must be effective in bridging the gap between different modalities. The semantic information covered by the text description and the visual content may differ. It is necessary to extract comprehensive and discriminative data from each modality most related to sentiment classification. Finally, the absence of one modality from the multimodal data is a common phenomenon. Dealing with incomplete multimodal data for sentiment analysis is still a challenging issue.

Image/video-based sentiment analysis
Analyzing information from images is vital in understanding human behavior by capturing their activities. The imagebased SA better handles human attitudes and emotions. Nowadays, people use social media to share various images with friends, relatives, and near or dear ones. The description or caption for most shared images is unavailable (You et al. 2015). Various video-sharing network applications, websites, and other multimedia platforms help researchers work in multimodal sentiment analysis (Li and Xu 2019). The information gathered from natural language processing is not enough since humans communicate and express their emotions and sentiments through different channels. The simultaneous and cognitive analysis of text, audio, and visual modalities enables the effective extraction of semantic and affective information. In this direction, visual information is essential as it contains significant sentiment characteristics in the speaker's gestures and facial expressions. Accepting the various challenging issues, we have proposed a sentiment-based analysis system to identify the types of sentiments among the non-aggressive (NAG, nopain), covertly aggressive (CAG, low-pain), and overtly aggressive (OAG, high-pain) classes in human behaviors. Covert-aggressive behaviors (due to pain in the human body) are standard and relatively low compared to OAG behavior, which has a very high intensity of aggressiveness. Finally, the NAG class has no aggressive behavior. Therefore, the contributions of this work are as follows: • A sentiment analysis system using both text and image data of a person such that the proposed framework will predict the level of the sentiment of a person, such as pain among NAG, CAG, and OAG human behaviors. • A text-based sentiment analysis using both conventional and deep learning approaches has been proposed [(long short term memory (LSTM) architecture]. • An image-based sentiment analysis adopts a deep learning approach, i.e., the convolutional neural network (CNN) architecture. • Fusion of scores due to text and image-based sentiment analysis systems have been adopted to derive the prediction for the proposed multimodal sentiment system. This paper's organization is as follows: Sect. 2 describes the related works for the proposed system. Methodology for implementing the proposed sentiment analysis system using text and image data has been discussed in Sect. 3. The experimental results and discussion have been performed in Sect. 4. Finally, Sect. 5 concludes this paper.

Related work
The online platform is not considered just a matter of nuisance but has been marked as a significant criminal activity that can be dangerous for many people (Kumar et al. 2018b). So, it is essential to take some preventive action to provide a safeguard to the people of the web. Thus, analyzing various texts, images, or videos using natural language processing, image processing, pattern recognition, computer vision techniques, and algorithms will be highly effective in detecting aggression-related issues. Several deep learning-based sentiment analysis methods have recently drawn more attention to text detection techniques and word embedding algorithms (Chen and Zhang 2018).

Text-based sentiment analysis
From a current observation, there are some works of detecting hate speech vs. vulgarity in Kumar et al. (2018b), which leads to the scope to differentiate speech vs. vulgarity into covert and overt aggression. Dinakar et al. (2011) had completed the work relating to cyber-bullying. In contrast, Dadvar et al. (2013), Dadvar et al. (2014) and Van Hee et al. (2015) have implemented the cyber-bullying detection system with some improved performance. Trolling is another kind of human behavior on which we have found some initial research activities by Cambria et al. (2010) and, after that, Kumar et al. (2014). Mihaylov et al. (2015) and Mojica (2016) have also published their work on trolling identification with improved performance. Human behavior like racism is another important factor for sentiment analysis, and for this, Greevy and Smeaton (2004) had derived the solutions for analyzing racism as sentiment analysis. The next human behavior is hate speech identification, extensively analyzed by Burnap and Williams (2015), Djuric et al. (2015), Gitari et al. (2015) and Badjatiya et al. (2017).

Image/video-based sentiment analysis
In the present scenario, due to the availability of upgraded communication systems such as smartphones, plenty of data is uploaded in video (Cambria et al. 2017). Due to the prosperity of research activity in sentiment analysis, it is not sufficient to analyze the aggression from text-only; thus, it is imperative to analyze other data types to get the most relevant result (You et al. 2015). A survey on deep learning approaches to medical images with a systematic look up into object detection tasks has been demonstrated in Kaur et al. (2021). As concerned with image or videobased sentiment analysis, Arya et al. (2021) had served some multidisciplinary domains contributing to affective computing for emotion recognition. The image sentiment analysis using deep learning approaches had been discussed in Mittal et al. (2018). Emotions using facial expressions (Neth 2007) and body language of any person remain within the visual information. In multimodal sentiment analysis (MSA), facial expression recognition using video-based data plays an important role (Li and Xu 2019). There exist two types of data for facial expressions recognition: (Butler et al. 2009) one is spatial, and the other is Spatio-temporal. In the case of spatial, the image sequences are encoded frame-by-frame; on the other hand, the neighboring frames are considered in Spatio-temporal representation (Sariyanidi et al. 2014). Ekman and Keltner (1970) described a thorough investigation into facial expression recognition. Chen et al. (1998) andDe Silva et al. (1997) presented their early work on emotion detection fusing visual and audio modalities. Werner et al. (2019) demonstrated a survey on automatic recognition methods supporting pain. Ullah et al. (2017) had served several different types of difficulties faced during the implementation of multimodal sentiment analysis (MSA) using text, image, audio, and video posted regularly on social media. And also, the survey reports a list of existing and upcoming difficulties and opportunities for MSA research. Another paper on a survey on MSA (Soleymani et al. 2017) where also, problems of MSA in different domains such as spoken reviews, video blogs, images, human-machine, and human-interaction systems have been discussed with their opportunities and challenges. Rao et al. (2021) have employed speech-based LSTM features, k-nearest neighbors (kNNs), Bayesian networks, hidden Markov models (HMMs), and artificial neural networks (ANN) based features from facial expressions for acoustic features (Gaussian mixture model, Mel frequency cepstral coefficients) using RAVDESS (Ryerson audio-visual database of emotional speech and song) audio dataset. Emotion classification from speech and text in videos using a multimodal approach has been performed in Caschera et al. (2022) where an automatic extraction of emotional information from a variety of data provided by different interaction modalities and from different domains has been demonstrated. A survey on multimodal video sentiment analysis using deep learning approaches has been reported in Abdu et al. (2021) where multimodal sentiment analysis systems with the Multimodal Multi-Utterance based architecture have been discussed. Based on these studies, a text-based and an image-based sentiment analysis system have been developed in this work using machine learning and computer vision techniques. Using categorically based learning, the proposed system can detect aggressiveness in the pain of human behavior. The classes are NAG, CAG, and OAG. For the text-based sentiment analysis, only the text data is used. In contrast, for image-based sentiment analysis, static images extracted from the facial region are employed to analyze emotions and thus predict the level of aggression.

Proposed method
The proposed approach mainly deals with two different types of data. One is text, and the other is images or sequences of frames from a video to perform the recognition task for the type of sentiment when a person feels pain due to any misshaping in their body. The block diagram of the proposed system is shown in Fig. 1.

Text preprocessing
During sentiment analysis, the employed text datasets have several noises due to the diverse domains. So, to make these databases usable, some text preprocessing techniques are required. In this work, the text-based sentiment analysis system has been performed in three components: text preprocessing, feature extraction, and classification. During text preprocessing, the particular document D has been considered in the same domain for which the problem is to be solved. Let's assume that there are N comments in the document D , where N is the number of comments.
Here each comment C i may contain several stop words such as 'is', 'are', 'i', 'am', 'would', 'will', 'what is, 'more', 'such', 'has', 'have', etc. During processing, these stop words are removed from C i such that after preprocessing C i is transformed to C ′ i (Fig. 2). This C ′ i now undergoes to feature extraction component in which two different algorithms, Scheme 1 and Scheme 2 , have been adopted for feature computation, respectively. Scheme 1 derives f TEXT while Scheme 2 provides g TEXT feature vectors. The classification tasks have been performed on the computed feature vectors. A detailed description of the two schemes is proposed below.

Scheme 1
The preprocessed comments C � = {C � 1 , … , C � N } contains several words which are related to measuring the aggressiveness label of a person. Since a particular word may have several meanings, the participation of each word is important. So, the conversion of words into numeric form not only removes the redundancy between the words but also introduces the distinctive properties between those words (Boulis and Ostendorf 2005). Moreover, the numeric transformation of these words reduces the dimension and helps the classifier to drive better predictions for the proposed system. In Scheme 1 , the sentences in each comment C ′ i has been tokenized into the words i.e. sentence S j ∈ C � i = {w 1 , w 2 , ⋅, w K } . During feature extraction, we extract features from each C ′ i in the form of feature vector ( f TEXT ) such that for each word w K ∈ C � i , two values: (a) term frequency (TF), and (b) inverse document frequency (IDF) have been computed. In this work, TF is defined as a = where n w i denotes the number of occurrences of word w i in the particular C ′ i , N be the total number of comments i.e. |C � | = N and M w i be the number of C ′ i s in which w i appears. The counting of words in each comment and also in the entire comments are handled using the Bag-of-Words (Boulis and Ostendorf 2005) technique where a list of unique words (let there are K unique words in the dictionary) that has been considered that defines '1' for presenting while '0' for absent of the word in the comment C ′ i . The final feature  Text pre-processing steps for the proposed text-based sentiment analysis system value for each word w i is given by c = a * b . So, the feature vector for each comment C ′ i is given by f TEXT ∈ ℝ 1×K that is sparse in nature with some zeros values. The value of K varies, i.e., each comment C ′ i may not have the same number of words. The block diagram for extracting the features for the text-based sentiment analysis system has been shown in Fig. 3. Now, the features extracted from the text documents undergo the classification task. Several classifiers such as LR, SVM, classification and regression tree (CART), and kNN have been employed. During classification, 50% of samples from each class have been used to form a training set, while the remaining 50% are used for testing purposes. This partitioning of training and testing has been performed ten times, and since then, average performance has been reported for the text-based sentiment analysis system. Here the performances have been reported for the testing set in terms of both F1-Score and correct recognition rate [accuracy (%)].

Scheme 2
In the second scheme, we have employed LSTM based feature extraction followed by a classification technique. LSTM (Hochreiter and Schmidhuber 1997) is a specific type of Recurrent Neural Network (RNN) applied for sequence labeling and prediction tasks. The feature extraction using this scheme is as follows: each preprocessed comment C ′ i undergoes space-separated sequences of words which are further split into a list of tokens, and then these tokens are vectorized with some data structure technique. This list of tokens is finally input to the LSTM-based architecture that performs feature learning of tokens from the separate comment and classifies the comment into NAG, CAG, and OAG classes, respectively. The LSTM based architecture has been shown in Fig. 4 while parameters are shown in Table 1. Here also, 50% of samples from each class have been randomly used to form a training set. In contrast, the remaining 50% are used for testing purposes, and average performance has been reported over ten times the partitioning of training-testing samples of the employed dataset. Here, the Scheme 1 is based on TF and inverse document frequency methods of NLPbased handcrafted feature representation techniques, where the contributions of word occurrences are used for feature computation at the lexical level. This scheme gives more importance to the word frequency in the comments, so the emotional information retrieval is easier using this scheme if the significance of the words is learned to the system, making search engines faster to identify emotions in the given content. On the other hand, this scheme is based on the bagof-words (BoW) model. Therefore, this scheme does not capture the word's position, co-occurrences, and semantics in the other comments; hence, this scheme cannot introduce the concept of word embeddings and topic modeling during information extraction. So, the Scheme 2 feature representation technique has been incorporated here also. In this scheme, LSTM based technique has been employed to better memorize specific patterns of words in the comments. This   technique can also support extracting the semantic information from the given content. Hence, these two schemes are adopted for the proposed text-based sentiment analysis to extract more distinct and discriminant features from texts and incorporate both lexical and semantic-based information from the comments.

Image preprocessing
The image-based sentiment analysis has two components: (i) image preprocessing for extracting the facial region as a region of interest and (ii) feature learning with classification for predicting the aggressiveness level in human behavior.
During an unconstrained imaging environment, noise, illuminations, variations in poses, and background represent irrelevant features (Umer et al. 2019). So, to extract more relevant and valuable characteristics, the face region detection as a region of interest from the input image has been extracted and normalized to obtain the same dimensional feature vector from each extracted face region. During the preprocessing image phase, a tree-structured part model (Zhu and Ramanan 2012) has been employed, which works for all variants of face poses. This technique computes sixtyeight facial landmark points for the frontal face, while thirtynine landmarks have been extracted for the profile face. The bilinear image interpolation method has been employed for normalization purposes on each extracted face region. The face detection process for the proposed image-based sentiment analysis system has been shown in Fig. 5.

Feature learning with classification
A deep learning-based approach such as CNNs for feature learning with the classification of images into three aggressiveness classes has been employed. With CNN-based models, various research-oriented problems like object detection, texture classification, face recognition, object recognition, scene understanding, and many more applications from the computer vision field can be analyzed and solved (Szegedy et al. 2015;Saxena 2016). The CNN-based approaches extract shape and texture information using machine learning optimizing algorithms. The training of CNN architectures has been performed with a bulky database, and according to the number of classes for the given problem, the weights in the network are adjusted. The CNN architecture has two parts: (i) feature learning and (ii) classification ).
In the convolution layer, the input layer always accepts the image. In contrast, the convolution operations are performed with various unique kernels (filter banks) to get a convoluted image (feature map) against each filter. Here, the parameters are considered the adjusted weights in the filter sets. The max-pooling layers (Liu et al. 2013) reduce the computational barriers by decreasing the number of parameters within the network. In the max-pooling layer, the 2 × 2 filter has been employed on each feature map (retrieved from the preceding layer). Then a step of a 2-down-sample with the minimum or maximum or average values is computed among the 4-numbers towards horizontal and then in the vertical direction. A fully connected layer transforms all the features from the previous layer into a 1-dimensional vector. The dense layer is another fully connected layer that performs linear operations in the dense layer. In the case of linear operations, each input is connected with the output and the probability scores are generated as the outcomes with the help of an activation function such as Softmax. Hence, using the concepts and theories of the CNN layers, in this work, we have proposed a CNN architecture. The architectural design of the proposed CNN has been shown in Fig. 6. It can be seen that the architecture has mainly six blocks (each block is composed of convolution, batch normalization, activation, max-pooling, and dropout layers) followed by two fully connected layers with three dense layers, out of which the probability values are obtained from the last dense layer for the three aggressiveness classes, i.e. NAG, CAG, and OAG. The detailed description of the proposed CNN architecture along with adopted layers, the output shape of feature maps at each layer, and the parameters involved at each layer have been demonstrated in Table 2.
During learning the parameters, the data from the preceding layer is normalized by the Batch Normalization technique (Ding et al. 2018). This normalization technique processes the batch of data by subtracting it from the batch mean and dividing it by the batch standard deviation. Then the batch mean ( ) and standard deviation ( ) are two trainable parameters added to the batch normalized data. The used activation function is the Rectified Linear Unit (ReLU) (Ding et al. 2018) function. The Dropout method (Wang et al. 2017) ignores arbitrarily selected neurons during

Datasets
An Aggression dataset (Kumar et al. 2018b) has been used for text, which is divided into three classes: non-aggressive (NAG, no-pain), covertly aggressive (CAG, Low-pain), and overtly aggressive (OAG, high-pain). As mentioned in Sect. 2, the OAG class of comments basically represents comments where the user's aggression is expressed with great intensity against any particular topic. Both the external and internal statements are highly aggressive for these types of comments. For the CAG class of comments, the intensity of the overall aggressiveness is quite low compared to OAG comments. Moreover, if we notice the external statement of the comments, it may not look aggressive. Still, if we notice the internal statement of the comments, the clear aggressiveness will be identified distinctly. In the case of the NAG class of comments, no aggressiveness can be identified from both the external and internal statements of the comments. NAG is C 1 class, CAG is C 2 class, and OAG is C 3 class. The comments in this dataset are composed of two different languages: 'Hindi', and 'English'. Some examples of these comments in 'English', and 'Hindi' (words are in the English Alphabets) with respect to NAG, CAG, and OAG classes have been shown in Table 3 whereas Table 4 shows the description of this used text database.
Similarly, for the image-based sentiment analysis, we have employed the UNBC-McMaster (Lucey et al. 2011) shoulder pain expression archive database. This dataset is composed of 129 subjects (63 male and 66 female). The participants have shoulder pain, three physiotherapy clinics have identified their problems, and the videos were captured on the campus of McMaster University. The subjects have suffered from arthritis, bursitis, tendinitis, subluxation, rotator cuff injuries, impingement syndromes, bone spurs, capsulitis, and dislocation. The frames have been extracted from each video, and the images are labeled from 'No pain' to 'High-intensity pain' classes. These images are classified as non-aggressive (NAG, no-pain) ( C 1 ), covertly aggressive (CAG, low-pain) ( C 2 ), and overtly aggressive (OAG, highpain) ( C 3 ), with descriptions of the samples in Table 5. Some images from this database have been shown in Fig. 7.

Results and discussion
The proposed sentiment analysis system has been implemented in Python on Ubuntu O/S with 32GB RAM and an Intel Core i7 processor of 3.20 GHz. Several Python packages have been employed during implementation with Theano (Bergstra et al. 2010) and Keras (Gulli and Pal 2017) special packages. The sentiment of a person has been analyzed using their text information (written by them) and the image data (that has pain emotion on their Fig. 6 The employed CNN architecture for the proposed system face). Here, text and image-based sentiment analysis have been performed individually, and the results are reported accordingly. Finally, to improve the performance of the proposed system, the results from text and image-based sentiment analysis systems are fused at the post-classification level such that the unconstrained environments can be handled with improved performance. In the below sections, text-based and image-based sentiment analyses have been discussed accordingly.

Results on text
During text-based sentiment analysis, each dataset's comment is considered and classified into three classes. Here at first each comment C i has been preprocessed to C ′ i using the technique discussed in Sect. 3.1. Now using Scheme 1 for each comment C ′ i , f TEXT ∈ ℝ 1×(W=1000 ) dimensional feature vector has been obtained and hence for all comments F TEXT ∈ ℝ 12000×(W=1000) feature matrix has been obtained. This feature matrix has been randomly partitioned with 50% of its data as a training set while 50% as a testing set. The training set undergoes classification tasks using LR, kNN, CART, and SVM classifiers, respectively. Each classifier results in a model used to obtain the performance of the proposed text-based sentiment analysis system using the testing set. The performance of proposed text-based sentiment analysis using Scheme 1 and Scheme 2 methods has been demonstrated in Table 6. In this table, we show the performance for both 2-class problems (where samples from C 2 and C 3 are considered to be from the same class, i.e., the aggressive class, and samples from C 1 are considered to be a NAG class). From Table 6 it has been observed that the proposed system has attained better performance using SVM and LSTM classifiers for both 2-class and 3-class problems. It has also been observed that the proposed system has achieved better performance for 2-class problem than 3-class problem, and it is because the samples are better distributed in 2-class than 3-class problem. The performance of the proposed textbased sentiment analysis due to both Scheme 1 and Scheme 2 methods has been compared with the existing state-of-theart methods in Table 7 where we have noted the performance from Samghabadi et al. (2018), Kumar et al. (2018a), Modha et al. (2018), and Constantin Orasan (Orǎsan 2018) methods respectively in terms of both F1-Score and Accuracy (%). This comparison shows that the proposed text-based sentiment analysis has obtained better performance using both the Scheme 1 and Scheme 2 methods. For performance improvement in the proposed text-based sentiment analysis system, the scores from the SVM classifier using Scheme 1 features and the scores from the LSTM classifier using Scheme 2 features are fused together using sum, product and weighted-sum rule-based score level fusion techniques. Let s 1 and s 2 be the scores after post-classification of SVM and LSTM classifiers; (i) sum rule-based technique is defined as s = s 1 + s 2 , (ii) product rule-based technique is defined s = s 1 × s 2 , and (iii) weighted-sum rule-based technique is defined as s = w 1 × s 1 + w 2 × s 2 , where s is the fused score while w 1 and w 2 be the corresponding weights such that w = w 1 + w 2 . The fused performance of text-based sentiment analysis due to Scheme 1 and Scheme 2 methods has been shown in Table 8.

Results on image
In the image-based sentiment analysis, the implementation of the proposed system has been divided into (i) image preprocessing and (ii) feature learning with classification. For this system, during face preprocessing, the face region F has been extracted using the TSPM model. Then the extracted face region F is normalized to N × N fixed image size. Further, the extracted facial region from the training samples undergoes the proposed CNN architecture (Fig. 6). Here the size of the face region N × N is 48 × 48 while the batch size and the number of epochs vary. During experimentation, it has been observed that the performance of the proposed system improves due to the batch sizes {30, 40, 50} with epochs such as {50, 100, 200} . Figure 8 demonstrates the effectiveness of batch sizes with the number of epochs over the performance of the proposed image-based sentiment analysis system for the UNBC-McMaster shoulder pain database.
From Fig. 8, it has been observed that the performance improves with increasing the epochs, and for batch size 30, the performance is better. For further experiments, we have employed batch size 30 on training samples with 200 epochs for learning the parameters of the proposed CNN architecture. So, the top-2 performance (in terms of accuracy and f1-score) of the proposed image-based sentiment analysis system has been shown in Table 9. Here also, the performance of the proposed system has been shown for both 2-class and 3-class problems. For 2-class problem the image samples of C 2 and C 3 are considered to be from 'Pain' class while the image samples of C 1 belongs to 'Non-Pain' class. Hence, from the performance, it has been observed that the proposed system has achieved better performance for both 2-class and 3-class problems. We have compared the performance of this proposed system with some existing state-of-the-art methods for the UNBC-McMaster shoulder pain database in Table 10. The performance reported for the competing methods adopted the same training-testing protocols. These results show that the proposed system performs better than the other competing methods for the UNBC-McMaster shoulder pain database. Tables 7 and 10 shows that the proposed system has achieved outstanding performance for both the employed text and image dataset. So, the performances of text-based and image-based sentiment analysis systems are fused for the multimodal sentiment analysis system. Both datasets have different samples for NAG, CAG, and OAG classes. Consequently, an equal number of samples have been considered for the multimodal sentiment analysis system, selecting 5000 samples for NAG, 4000 samples for CAG, and 2700 samples for OAG classes. Here the datasets have been partitioned with 50% data as training while 50% as testing. The performance of the proposed multimodal sentiment analysis system has been shown in Table 11. According to the reported performance, the data in the multimodal system is challenging. Due to a lack of data in training set for both datasets, the individual performance is somewhat similar. The performance is much better after fusion. It is also shown that the multimodal system has obtained better performance for the product-rule-based fusion technique than other fusion techniques.

Conclusions
This paper presents a multimodal sentiment analysis system for recognizing people's aggression in pain using textual and visual information from a person. The implementation of the proposed system has five components: text preprocessing, feature extraction, and classification. These are the components of text-based sentiment analysis, processing the person's textual information to predict the label of  Table 4 Description of text database for the proposed system Class Sample Non-aggressive (NAG) (No-pain) C 1 5052 covertly aggressive (CAG) (Low-pain) C 2 4240 Overtly aggressive (OAG) (High-pain) C 3 2708  aggressiveness among the NAG, Covertly Aggressive, and OAG classes when they have pain or no pain. Similarly, image preprocessing and feature learning with classification are the components of an image-based sentiment analysis system, where the visual intensity of pain emotion on the facial region of a person has been employed to predict the classes of NAG, CAG, and OAG. Both these systems have been implemented individually and experimented with using the respective databases. The performance has been compared with the state-of-the-art methods,   (Szegedy et al. 2016) 79.32 Inception-v3 (McNeely-White et al. 2020) 79.64 Werner et al. (2016) 75.50 Li and Xu (2019) 76.89 Cambria et al. (2017) 79.71 Lucey et al. (2011) 81.80 Proposed 82.35 showing the superiority of the proposed system. Finally, the scores due to both these systems have been fused to derive the performance of the proposed multimodal sentiment analysis system.
Funding Open access funding provided by Università degli Studi di Salerno within the CRUI-CARE Agreement.