1 Introduction

In today’s day and age, the existence of popular social media platforms such as – Instagram, Twitter or Whatsapp—provide us with a fighting chance against the menace of Stress and Anxiety, by raising awareness and sharing resources to help those in need. Social media directly affects the mental health of a person [12]. Studies have conclusively proved that along with physical fitness, mental fitness is just as crucial for the healthy operation of a human body, hence, equally focused treatment for those who feel weak or mentally distressed is the need of the hour. The COVID-19 pandemic has taken a serious toll on mental health with people forced to be confined in their home, cut off from the world and normal interactions. Thus, there is a growing need to find ways to easily identify and prevent mental health issues along with increasing access to mental health services [24]. First aid for mental health is not very popular and developed as compared to physical health. Providing mental health first aid can help ease the symptoms a person may be experiencing [17].

To facilitate these issues, this project was taken on in order to create a platform that would help people assess their condition and mental health more extensively and take any necessary precautions if warranted. Such a platform would not only provide people with an efficient platform to conduct precursory psychiatric diagnostics, but it would also serve a big role in raising awareness amongst the people. The platform will enable this via sentiment analysis using audio and video. Analysis based on audio or video alone is not sufficient since a human expresses himself not just through words but through his facial expressions and body language. Face often tells a different story than words. By listening to a person without looking at them one can technically understand them, but he cannot gauge their feelings. Hence, in this platform, a person would be required to answer a set of questions and their response would be used to analyse their immediate mood and emotions. The audio would be converted to text and then processed to perform sentiment analysis to categorize the mood throughout the session. Alongside this, OpenCV can be used to detect facial emotions through facial recognition. Combining both the results would give us a report of the person’s state of mind which can be used for further diagnosis.

The motivation of this paper is to conduct a thorough research of the different studies conducted for the discipline of sentiment analysis based on audio, video and text input. The premise of this paper was based on research work titled—‘Psychiatric Precursory Diagnostic Test’—conducted as part of a college thesis, wherein a cursory test was created from the perspective of a counsellor who can assess the day-to-day mood and attitude of a user through targeted questions, and based on the results produced, the system recommends appropriate activities to help the user. In the interest of covering all bases, this study contains an outlook from a technological and psychological point of view.

This research was predicated on a simple idea to digitise the prognosis conducted by a psychiatric professional and make that technology easily available to users. The stigma around mental health is a major issue preventing people from seeking the help they require [16]. The challenges faced while getting mental health services are alone enough for a person to disregard their health. The aim of this project is to provide a reliable resource which a person feels comfortable using in their day to day life without facing the stigma. While replicating the exact behaviour and capabilities of a professional might not be prudent, the primary objective is to develop a model that can at the very least, perform accurate sentiment analysis and depict a user’s mood. Much like the analysis conducted by any psychiatrist who converses with the user and uses his behaviour to draw conclusions, our system uses speech, text-based and video input to identify the user’s mood and accordingly suggest activities for the user to indulge. The research stems from this project, wherein the authors have conducted a study of the extensive studies performed over the past few decades and draw out the best and the most effective machine learning models that would facilitate the entire project.

This paper is structured in sections so as to give us an ordered manner of information. Section 1 informs us about the dataset inculcated to train the Sentiment Analysis model and the chatbot model. Succeeding the datasets is Sect. 2 comprising of the diligent Literature Review done by various authors in the field of Sentiment Analysis and their contrasts in work have been presented. Following that is Sect. 3 that encompasses Methodology of the model. It encapsulates all the specific details about the methods, functions and libraries used for the different models used in the project. Following that is Sect. 4 of the paper that presents us with the various findings, results and observations gathered through this project. Section 5 finally concludes our project and the research conducted for it. Penultimately, the last section of the paper contains all the references and citations to previous studies.

2 Datasets

Through this study, a myriad of different techniques have been explored in order to learn and identify the techniques that best cater to the domain of sentiment and semantics analysis. In this pursuit, the authors developed certain unique own models that yielded remarkable accuracies based on multimedia inputs. To train these models, certain datasets were used that could ensure a plethora of different options for the holistic coverage of most bases in the model. For textual sentiment analysis that takes the input from a user in the form of text and takes the speech to text input, the Amazon Reviews dataset was employed. Meanwhile for the chatbot model, a custom dataset was created in the.json format consisting of several relevant tags and responses.

2.1 Text sentiment analysis

Amazon Customer Reviews is one of Amazon’s rich databases containing over hundreds of reviews over 2 decades with customers expressing their views on products. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others.

The dataset contains the customer review text with accompanying metadata, consisting of three major components:

  • Reviews intended into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale.

  • Reviews in multiple languages to analyse perception of the same product across different languages and countries.

  • Non-compliant reviews with respect to Amazon policies helping flag unwanted and biased reviews.

2.2 Chatbot – custom dataset

A sophisticated chatbot was developed which is capable of carrying out intelligent conversations with a user. To achieve this ability, we created a custom dataset using a.json file. It contained 9 primary tags (“Welcome”, “Exit”, “Thanks”, “Incorrectinput”, "Inspiration", "Support", "Sad", "Happy", "Joke") which enabled the grouping of distinct types of inputs given by the user to which the bot could appropriately reply. The input given by the users were defined as patterns and the response given by our bot was defined as responses. With this dataset, chatbot was trained appropriately to our customizations, in order to give our users an interactive and satisfied experience.

3 Literature review

The 21st Century marked the advent of the digital age that has caught an unparalleled pace in the first two decades, wherein advancements in technology have been made that cater to eradicate most of our problems. People around the globe hold phones in the palm of their hands that facilitate communication Nowadays, we have access to the world wide web that answers all of our questions, we have the fastest vehicles at our disposal to facilitate travel. Machines are growing smarter by the day in order to cater to us humans, and in fact make our lives easier. The field of teaching computers to perform certain tasks using previously created data, is known as Machine Learning. One major sub-discipline of this field is that of Sentiment Analysis, wherein a machine is taught to study and recognise the different human emotions. This task has been achieved through proper analysis of multimedia inputs such as – Text, Audio or Video.

Over the past few decades, meticulous analysis and feature extraction has been conducted on multimedia inputs to classify them on the basis of the sentiment or emotion exhibited. A cogent and sound analysis of these studies would present a distinguished comparison between the different techniques that have been used by professionals across the globe. One such comparison is projected in Table 1 drawn below, where different models employed for video, audio, and text-based sentiment analysis were examined.

Table 1 Studies for sentiment analysis on video, audio and text-based inputs

In the extensive study conducted by Rao et al. [22], techniques such as Support Vector Machines (SVMs), Decision Trees and OpenCV were employed for Text, Audio, and Video based inputs, respectively. Through this study, they obtained an accuracy of about 70% in identifying a total of 6 emotions. Secondly, in the paper by Li et al. [14], an SVM model was used for sentiment analysis through video-based input, on the MOUD dataset [21] and CMU-MOSI [29] dataset. For the datasets in consideration, they attained accuracies of 63.9% and 71.1% respectively. In the paper written by Schmidt et al. [25] sentiment analysis is conducted on the textual and audio version (audiobook) of the historic German plays, where Emilia Galotti by G.E. Lessing was taken as the base case. This study employed the Naïve Lexicon method and the free Vokaturi tool for Text and audio-based analysis, respectively and has presented a substantial accuracy for both models.

Gaikwad et al. [9] employed models such as Support Vector Machines, Naïve Bayes and K-Nearest Neighbours on text-based inputs and achieved excellent levels of accuracy, wherein the SVM model proved to be the most accurate, with an accuracy of 82%. Raza et al. [23] used a total of six different models for sentiment analysis based on textual data, wherein they obtained remarkable accuracies. Singh et al. [26] used models such as Naïve Bayes, J48, BFTree and OneR for sentiment analysis on the Amazon Reviews and IMDB movie reviews datasets, wherein the OneR model gave the best accuracy. In their paper, Stappen et al. [27] used video transcriptions of the MuSe-CAR dataset for sentiment analysis and topic recognition using a support vector machine model, which gave an accuracy of 66.16%. Bhuiyan et al. [4] used NLP techniques to analyse YouTube User comments to retrieve relevant videos and attained an accuracy of about 75%. Agarwal et al. [1] used Recurrent Neural Networks with variants like LSTM and GRUs to perform multimodal sentiment analysis on the CMU-MUSI dataset. Kaushik et al. [13] employed a Maximum Entropy (ME) model to develop a text-based emotion detection model, which uses decoded speech transcripts of YouTube videos.

Chu et al. [6] employed an audio-visual approach to sentiment analysis by using sophisticated models on the Spotify dataset and a vast collection of movie clips, wherein an AUC of 0.652 was obtained. Sentiment analysis of News Videos was conducted by Pereira et al. [19] based on the audio, visual and textual features of these videos, using a myriad of ML techniques, achieving an accuracy of 75%. Luo et al. [15] used a parallel combination of an LSTM and CNN based network to conduct audio-based sentiment detection on the MOSI dataset. Naïve Bayes algorithm was used on the Twitter dataset by Parveen et al. [18] for sentiment analysis, which yielded an accuracy of 57%. In the research conducted by Ezzat et al. [8], text-based classification was conducted using Speech to Text conversions on a set of Call Centre Audio Conversations. A plethora of techniques were used for this research wherein, the SVM model yielded the highest accuracy of 94.4%.

In this paper, Al-Azani et al. [3] fused textual, auditory and visual data for sentiment analysis on the MOSI, MOUD and IEMOCAP datasets by developing SVM and Logistic Regression based classification models. The paper by Rosas et al. [20] explores multimodal sentiment analysis on Spanish videos available online using a support vector machines model that yielded an overall accuracy of 64.86%. Poria et al. [5] conducted multimodal emotion analysis using an LSTM based model on user-generated videos and on MOUD, MOSI and IEMOCAP datasets, where remarkable accuracies were obtained for each dataset. Lastly, in the study conducted by Gautam et al. [10], twitter data was used for sentiment analysis using models based on Naïve Bayes algorithm, SVM and Maximum Entropy, and WordNet was employed for semantic analysis. Through this study, it was found out that Naïve Bayes model gave the highest accuracy for sentiment analysis, meanwhile, WordNet gave an accuracy of 89.9% for semantics analysis.

4 Methodology

The core principle behind the research work is sentiment analysis using audio and video. For better and reliable results multiple inputs are considered. The audio input would be converted to text and then processed to perform sentiment analysis to categorize the mood throughout the session. Alongside this, OpenCV will be used to detect facial emotions through facial recognition. Combining the results obtained from both the inputs would give us a report of the person’s state of mind which can be used for further diagnosis.

Through this section, the implementation of the elegant system predicated on the premise of sentiment analysis can be seen, which was developed after conducting a thorough research of the different ML techniques that have been employed over the past few decades for real time sentiment analysis. In the following subsections, the 4 primary points of focus will be covered for this project and the expected outcomes for the same. The 4 primary points of focus are –

  1. 1.

    Facial Emotion Recognition – Video Based Analysis: MTCNN Model

  2. 2.

    ChatBot Model – NLP model

  3. 3.

    Sentiment Analysis – Text input analysis: Support Vector Machines model

  4. 4.

    Speech to Text model – Audio conversion to text for Sentiment analysis: Google API

Using these 4 aspects, the proposed project was implemented in the form of a user-friendly platform that caters directly to the user by asking targeted questions and provides him with a quick fix solution.

This section encapsulates all the specific details about the methods, functions and libraries used for the different models used in the project.

4.1 Sentiment analysis

Being a sub-discipline of Natural Language Processing (NLP), sentiment analysis refers to the examination and identification of natural language user inputs in order to predict the positive, negative or negative inclination of the input [2]. This helps the users find out the true sentiment which in-turn helps them comprehend the real meaning of the given text. It has become prominent in the past few years as increasingly more companies are exploring this domain to identify their user’s responses to their products, allowing them to better suit and cater their products to the needs of the customers. It can also be used to gauge the general reaction of the netizens on certain topics or certain new stories whether the outcome has a positive or negative emotion or does it barely affect anyone.

The main aim of every sentiment analysis is to find whether the given data is positive, negative, or neutral. But as we delve deeper into studying the underlying emotions of a human being using machine learning they are also focusing on the emotions like whether the data represents if the user is happy, cheerful, sad, sorry, etc. Using lexicon is an efficient way of determining these range of emotions with the help of neural networks. Lexicon is a list containing various emotions corresponding to certain words.

For this research, after a thorough study, a supervised machine learning algorithm – Support Vector Machines (SVMs) [7] – was employed, that is one of the most commonly used classification and regression techniques. SVM is a sophisticated model which is predicated on creating a hyperplane plotted in an N-dimensional space that segregates the underlying data into respective classes. Figure 1 represents a two-dimensional representation of an SVM model which has a hyperplane separating the data points into two classes. The tuning parameter for an SVM model is called Kernels. There are several different types of kernels, where RBF is mostly used for Non-Linear problems, while linear kernels are used for Linear Classification problems.

Fig. 1
figure 1

Two-Dimensional Hyperplane - Linear Classification using SVM

Using the sklearn library, Asophisticated SVM model using the previously vectorised data is developed. The model yielded an F1-score of 91%, a recall of 0.91 and a 0.91 precision, proving the highly efficient nature of the technique in cursorily classifying the data into positive, negative or neutral. The F1-score is a statistical measure that helps one analyse the performance of a model and it is given by the harmonic mean of the precision and recall, as in the formula below –

$$F1-Score=2\times \left(\frac{precision \times recall}{precision + recall}\right)$$

The model was initially trained and tested using the Amazon Reviews dataset which comprises of thousands of user inputs of different natures. The model was also saved to be called for future instances using the Pickle library. On testing the model with custom user inputs, an accuracy of over 91% was obtained. Considering the forthcoming employment of this model for the overall sentiment analysis using the chatbot inputs, the model worked ideally, and it can be effectively used for several instances.

4.2 NLTK

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that we can use for NLP.

A lot of the data that could be analysed is unstructured data and contains human-readable text. Therefore, before programmatical analysis of the data, it first needs to be pre-processed. Following are the steps involved in pre-processing of the data that allows us to feed meaningful and efficient data into the Model.

4.2.1 Tokenizing

Tokenizing process allows us a comfortable way of splitting our text data into smaller processable data. It makes it easier to crunch, allowing us to work with more modest bits of text that are still moderately reasonable and significant even outside of the context of the remainder of the text. It is the first step in the pipeline which converts the enormous unstructured data into easily processable and algorithm friendly structured data (Table 2).

Table 2 Tokenized words

4.2.2 Stop words

Removing stop words is a crucial step in our pipeline which allows us to filter out the unwanted words which are not essential to process while doing our analysis. This streamlines our process and reduces the number of words that need to be processed making our process fast and efficient. Frequently used words like ‘i’, ‘am', 'to’ which do not really contribute to finding out the emotion of the message are some examples of stop words which are scrubbed out in the pipeline (Table 3).

Table 3 Before and After Stop Words

4.3 Stemming

Stemming process is one of the important processes which allows users to extract meaningful data from the given text values. It uses tokenized words to extract root or base words from different variants of a similar word. This allows us to clean the majority of similar words having the same meaning and further making our training process much faster and efficient. It is one the main parts of the pipelining process. The main obstacle in using this technique is to look out for under stemming and over stemming (Table 4).

  • Under stemming usually occurs when related words are not condensed to the same stem even if they are similar resulting in a false negative.

  • Over-stemming is encountered when related words are condensed to the same stem even if they are not remotely related, resulting in a false positive.

Table 4 Before and after—stemming words.

4.3.1 Lemmatization

Lemmatization is another process in the pipeline where grouping of words takes place where the words are crumpled and are then processed as a single item. Although it looks like stemming, it is a crucial step as it brings context to words by linking words which have similar meaning.

4.3.2 Quantifying emotions

An elaborate dataset was created which contains copious number of words and the emotion attached to them. After the extraction of the meaningful words from our text, the text is compared with the text in the database, which allows us to find the emotion hidden behind the text. After successfully extracting the words and its emotion, the text was run through a Counter which allows us to quantify the emotions present in the words. Figure 2 shows a plot of the magnitude of emotions detected in a sample video fed into the classifier.

Fig. 2
figure 2

Emotions

4.4 Facial emotion recognition

Emotions are the biggest driving force in any decision making. It is often said that a person might be saying something, but their face might be saying something else. Facial expressions provide a lot of insight into a person's mood or emotions. Two models were considered for facial emotion recognition (FER), Multi cascade convolutional network (MTCNN) and Haar Cascade classifier.

The MTCNN is a neural network model used to detect faces. It consists of 3 convolutional networks cascaded together, hence the name. It requires face features such as eyes, mouth and nose. Haar Cascade is a well-known model for detecting faces. It uses edge or line detection to detect facial features. In recent years, Convolutional Neural Networks (CNNs) have increased the accuracy of face detection manifolds [28].

The process for facial emotion recognition is the same for images and videos except for an additional step in the case of videos. Theoretically, video is a series of images in motion. The video is split into separate images and each image is passed through the emotion recognition algorithm to detect the various emotions in the frame and their magnitude on a scale from 0 to 1. Post this an output video is generated which consists of a box around the face in the input video with the live emotion detection along with their magnitude. Figure 3 accurately represents the processing of a video input by splitting it into frames and then further passing it to the classifier for sentiment analysis.

Fig. 3
figure 3

Pictorial representation of video input processing for sentiment analysis

Pandas library is used to analyse the result of all frames and their output is plotted against time using matplotlib for the complete duration of the video. Further analysis is drawn by taking the sum of all the emotions detected in the video and forming a table with emotions and their values.

The text, speech and video input models together constitute the sentiment analysis model which is capable of producing reliable and cogently justified outputs after thorough analysis of the user’s input. Figure 4 represents the flowchart of the entire project which explains the process of sentiment analysis and input processing in a simple manner.

Fig. 4
figure 4

Simple flowchart representation of the sentiment analysis model developed by the authors

5 Result

The model for textual sentiment analysis yielded an F1-score of 91%, a recall of 0.91 and a 0.91 precision, proving the highly efficient nature of the technique in cursorily classifying the data into positive, negative or neutral.

The result of the video analysis is obtained in the form of a graph consisting of emotions plotted against time. The magnitude of emotions is plotted on a scale of 0 to 1. The X-axis of the plot represents the timespan of the video while the Y-axis represents magnitude of emotion. Assorted colours have been used to signify different emotions. It can be seen that until around the 200 s mark, a mix of emotions can be detected in the video while after that till the 400 s mark a strong signal of “Happy” has been detected which could be cross verified from the video. After the 400 s mark again a mix of emotions is observed. This information can be corroborated from Fig. 5 given below, in which lines of different colours represent the different magnitudes of the emotions in the sample video fed to the classifier. Moving on, from Table 5, thus obtained, it can be seen that the most dominant emotion in the test video is ‘happy’ which is followed up by ‘fear’ which was visible in the video as well and thus, it can conclude that an accurate analysis has been made.

Fig. 5
figure 5

Emotions plotted against time

Table 5 Magnitude of the emotions in quantifiable values obtained as output from the MTCNN classifier

MTCNN [11] which uses CNN provides better results than the Haar Cascade classifier. Another advantage of MTCNN is that it creates an arbitrary rectangle around the face for better detection as compared to Haar Cascade which creates a square.

Having created this elegant system of our own, we were able to notice the performance of our model and how it was able to produce reliable results based on live input from the user. After conducting a thorough research of the previous studies conducted in this domain, we were able to notice significant improvements in the outputs and the overall accuracy/F1-score of the models. While the papers focussing on NLP only worked with pre-existing datasets, our model was able to produce accurate responses and predictions based on a user’s natural language text input. Furthermore, the live video input from a user was broken into frames for a cogent analysis and complete processing of each frame to identify a sentiment over a certain period of time. The results yielded by this part of the model, on top of the text and speech analysis, were testaments of the excellent performance of the aforesaid classifier. Thus, it can be stated confidently that the extensive, manifold model developed by the authors is an improvement on several past studies as it constitutes multimedia input (text, audio and video) from a user and processes the same in little to no time and predicts to a high level of accuracy, the sentiments being expressed by the user.

6 Conclusion

The discipline of Machine Learning and Deep Learning has found prolific applications in the field of semantics and sentiment analysis. Interpretation of emotions and responses through computers helps not just developers, but it helps professionals across various domains. Such recognition can allow counsellors and in fact, the users themselves to identify and keep track of their daily moods.

Through this paper, we have successfully conducted a thorough research of the different ML techniques that have been employed over the past few decades for real time sentiment analysis. Moreover, this paper was predicated on a project developed in order to allow a user to track his/her day-to-day moods and get appropriate recommendations for how they can alleviate their stressful or anxious state of being. This manifold project comprises of the following parts which encapsulate about 3 different ML models, namely –

  1. 1.

    Facial Emotion Recognition – Video Based Analysis: MTCNN Model

  2. 2.

    ChatBot Model – NLP model

  3. 3.

    Sentiment Analysis – Text input analysis: Support Vector Machines model

  4. 4.

    Speech to Text model – Audio conversion to text for Sentiment analysis: Google API

Using these models, we created a user-friendly platform that caters directly to the user by asking targeted questions and provides him with a quick fix solution. Such an application has never-ending applications and can be adapted for several different purposes such as – Communication through Text, Audio or Video, or even several applications in the field of medicine and psychology. In the near future, the aim is to develop a scalable mobile application for Android and iOS users, so that they can assess and keep track of their moods and reactions on a regular basis and even use this model to interpret the tone and mood of the person they are communicating with.