Introduction

Especially for environmental and technological reasons, people's habits have started to change in recent years. In line with technological developments, socialization has almost entirely moved into social media applications through smartphones. Indeed, beyond the likes of photos and videos, people are now shooting and sharing videos that reflect their personal experiences. People instantly share their satisfaction and appreciation for a movie or a service in a restaurant, municipalities, and businesses, including public institutions. In addition to social media applications such as Twitter, and Facebook, they also share their ideas as videos on YouTube channels. They also share their experiences and comments with applications such as TripAdvisor and Airbnb. These shares account for a very large amount of data.

Approximately 7 TB of data on Twitter and approximately 10 TB of data on Facebook are accumulated every day [1]. The inadequacy of the processing power of traditional computer systems and the classical machine learning methods used for the analysis of these large amounts of data has led to the emergence of new technologies. While cloud computing technologies are becoming prevalent for higher processing power [2], deep learning methods are also becoming prevalent for data analysis. One of the application areas of deep learning methods on texts is the sentiment analysis task of natural language processing (NLP).

Sentiment analysis is the classification of the analyzed text data for estimating sentiment polarity using NLP techniques. Tagging comments on social media applications and websites, especially Twitter, as positive and negative, and classifying them using NLP techniques through these tags is widely used in many areas for different purposes. As a recent technology, deep learning is also frequently seen in studies in this field.

Deep learning is a machine learning method based on an improved version of artificial neural network technology, which is inspired by the structure and function of the human brain. Unlike artificial neural networks, it consists of many more hidden layers. Consequently, more comprehensive and efficient learning can be achieved by creating more complex networks, similar to the human brain. Deep learning methods are used in many areas, such as image processing [3,4,5,6], text classification [7], speech recognition [8, 9], and NLP [10, 11].

In the deep learning methods used in many areas, different and multi layer neural networks are used. In one of the studies on the use of multi layer neural networks, geometric relational features based on distances between joints and selected lines were investigated using multi layer Long Short-Term Memory (LSTM) [12]. In their study, researchers created a model of energy consumption using a multi layer Gated Recurrent Unit (GRU) [13]. A new multi layer GRU neural network was also used to speed up the calculation of aerodynamic forces [14]. A multi-layer LSTM model was proposed for binary and multi class data on denial-of-service attacks [15]. Multi-layer Bidirectional LSTM (Bi-LSTM) was used in image processing. This method had better results than single-layer LSTM and GRU [16]. In a forecast study on daily power consumption controlled by the US electric company, a multi layer LSTM model was used, which performed better than the traditional single-layer LSTM and Bi-LSTM [17].

New models using multilayer neural networks have been proposed in many areas, from the energy industry to network security to image processing, and they give better results than single-layer neural network models.

In this study, a new multi layer deep learning model was proposed to investigate the effect of multi layer and different Artificial Neural Network (ANN) models on sentiment analysis performance. In the proposed model, a new deep learning model (MBi-GRUMCONV) was developed by using a combination of 6 Bi-GRU layers and 2 Convolution (Conv) layers.

The main contributions of this study are as follows:

  • A new model (MBi-GRUMCONV) was proposed by using multi layer and different neural network architectures together to achieve the best performance in the sentiment analysis task.

  • For the proposed method, the accuracy performance was investigated to reveal the effect of 3 different vector sizes (100, 200, 300) and 2 different Word2Vec word embedding methods (Skip Gram, Continuous Bag of Words (CBOW)) on sentiment classification performance.

  • It is stated in the literature that while the Bi-GRU sequence gives good results in the analysis of data, CNN gives better results in image processing data. Bi-GRU, which found the most up-to-date solution to the vanishing gradient problem of RNN, was used as a multi-layer and then added 2 convolution layers to increase performance.

The rest of the article is organized as follows: Sect. "Related Works" presents a literature review. Under the heading of the methodology in Sect. "Methodology", the introduction of the dataset, data preprocessing, word embedding methods, and deep learning methods are discussed. The structure and details of the proposed model are given in Sect. "Proposed Model (MBi-GRUMCONV)". Sect. "Experiment Environments" presents the results and comments of the experiment carried out according to the proposed architecture and performance metric. The 6th and last section of the study discusses the overall evaluation of the study results and makes recommendations for future research.

Related Works

Existing studies with different approaches to SA task have been reviewed and detailed information about them have been given below.

Sentiment analysis is called by a few names, depending on its application areas, such as aspect-based, opinion extraction, sentiment mining, subjective analysis, and impact analysis. Idea mining and sentiment analysis are used interchangeably [18] The terms sentiment, opinion, and impact are used interchangeably. We can divide emotion classification into two as machine learning and dictionary-based approaches.

There are various levels in the studies of extracting emotions from texts. There are documented, sentence, aspect and concept levels. There are dictionary-based methods for approaches at the Concept [19,20,21], Document [22] level. In the sentence-level sentiment analysis studies, the document and sentence-level analysis does not affect the effect of the properties and aspects of the entities in the sentiment classification. For this, aspect based [18, 23, 24] methods are used. Aspect-based provides more detailed and accurate information from the data by dividing it into small parts, taking into account the integrity of the emotion in the text [25].

Before emotion classification or just emotion identification studies are also done. Emotions such as happiness, disappointment, anger, and sadness can be determined in more detail, rather than standard positive, negative and neutral classes. In systems using emotion detection feature, thanks to these emotion classes, the user's attitude can be examined as positive or negative, and appropriate actions can be taken by analyzing the mood more clearly [25].

Sentiment analysis approaches, as seen in Fig. 1, can be generalized under four main headings as a dictionary-based approach, machine learning approach and their hybrid use, and other studies [20]. In our study, we followed the deep learning methods under supervised learning, one of the machine learning methods.

Fig. 1
figure 1

Sentiment analysis approaches [26]

In the first studies with sentiment analysis (Term Frequency-Inverse Document Frequency), after frequency-based word embedding methods such as TF-IDF and Bag of Words, Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT) Multinominal Naïve Bayes (MNB), Bernoulli Naïve Bayes (BNB), Logistic Regression (LogR), and stochastic gradient descent (SGD) were used for emotion classification using machine learning methods. Afterwards, prediction-based methods such as Word2Vec, Global Vector (GloVe) started to be used. Bidirectional Encoder Representations from Transformers (BERT) has been used in other tasks of NLP, including the sentiment analysis task in pre-trained models such as CNN.

In a textual affective computing tasks such as sentiment analysis, supervised machine learning was used LogR, MNB, RF, SVM, Recurrent Neural Network (RNN), LSTM, and CNN [27,28,29,30,31,32,33] Studies are carried out with methods such as prompt-based analysis [34,35,36], graph-based [37,38,39]. There are also neurosymbolic AI [40,41,42] approaches that can be mentioned as new trends in sentiment analysis studies. Within the scope of the study, it is desired to investigate the performance of using multiple neural networks together in NLP tasks after the prediction-based based wor2vec method.

Among the sentiment analysis studies published in popular databases such as IEEE and Science Direct in the last 5 years, those applied to the IMDB reviews dataset was selected, and the details of these studies are presented in Table 1. The Relevant Table shows the word representation or embedding methods used in each study, the architectures used, the architecture descriptions, and the accuracy performance values in detail.

Table 1 Sentiment analysis studies on the IMDB dataset

As shown in Table 1, various deep learning models were created by using the same or different multi-layer neural network architectures together. When these studies are examined, it is seen that the use of multiple layers leads to higher accuracy than single layers. We decided to use a multi-layered architecture to obtain higher performance in our study.

Methodology

In this section, the dataset used in the study is introduced, and preprocessing, word embedding and deep learning methods are discussed.

Dataset

The IMDB dataset, which contains 25,000 positive-tagged and 25,000 negative-tagged movie reviews, is a reliable, balanced, and popular dataset collected by Stanford researchers to be used in sentiment analysis studies [57]. The sentiment class distribution of the dataset is shown in Fig. 2.

Fig. 2
figure 2

IMDB dataset sentiment class distribution

The use of the IMDB dataset was preferred to compare the results of our proposed architecture with previous studies in the literature. Table 2 shows information about the features of the IMDB dataset.

Table 2 IMDB dataset

In Table 2, a sentiment attribute of 1 indicates a positive sentiment class, and 0 indicates a negative sentiment class. Table 3 shows more detailed information about the dataset.

Table 3 Dataset detailed information

The word cloud of the negative and positive word distribution from the 50,000 comments in the dataset is shown in Fig. 3. As shown in Fig. 3, in the distribution of positive-tagged and negative-tagged data, words such as "good", and "great" are positive-tagged, and words such as "even", and "though" are negative-tagged. Words such as "film", "scene", and "story" are seen both in the positive and negative word clouds.

Fig. 3
figure 3

Reviews dataset word clouds

Data preprocessing

Text preprocessing is the preparation process for easier processing of texts. It includes processes such as the removal of stop words and special characters. It is used to prepare the data before the NLP classification stage.

To achieve better and more reliable classification results, only the most important words are taken or existing words are cleaned from HTML tags, and all are converted into normalized form. Preprocessing operations:

  1. 1.

    HTML tags were removed,

  2. 2.

    Special characters, URLs, and email addresses were removed.

  3. 3.

    Numbers were removed,

  4. 4.

    White spaces were removed,

  5. 5.

    Punctuations were removed,

  6. 6.

    Tokenization was performed,

  7. 7.

    All text data were converted to lowercase to eliminate variations in word forms.

  8. 8.

    Stopwords were removed,

  9. 9.

    Word2Vec's CBOW and Skip Gram methods were used in vector sizes of 100, 200, and 300 for vectorization operations.

Figure 4 shows the outline of the study, including preprocessing the dataset, word embedding operations, splitting the dataset into training, testing, and validation sets, feeding them into the proposed classification model, and evaluating the results. In the experiments carried out to examine the performance of the proposed model in the study, the data set was divided into 80%-20% and 70%-30% training-test sets. Approximately 10% of the training data were used for validation.

Fig. 4
figure 4

Block diagram of the study

Word Embedding

Word embedding is a method in which each word in a sentence is mapped to a d vector so that words with a close semantic similarity are represented closer together in a hidden space. This is a more advanced technique than the bag-of-words, which carries more context of the sentence in a low-dimensional space. The input to neural networks must be in numerical form. Therefore, the text is converted to a numeric representation called text vectorization. There are some useful methods used for text vectorization. One of the most popular techniques for word vectorization using neural networks is the technique. It was developed by Google. Word2Vec uses two methods: SkipGram and CBOW [58].

CBOW is a word embedding architecture that uses future words as well as past words to create a word embedding [56]. The objective function of the CBOW is given in Eq. (1):

$${\mathrm{J}}_{\uptheta }=\frac{1}{\mathrm{T}}={\sum }_{\mathrm{t}=1}^{\mathrm{T}}\mathrm{logp}\left({\mathrm{w}}_{\mathrm{t}}\right|{\mathrm{w}}_{\mathrm{t}-\mathrm{n},\dots \dots \dots .,}{\mathrm{w}}_{\mathrm{t}-1,}{\mathrm{w}}_{\mathrm{t}+1,\dots \dots .}{\mathrm{w}}_{\mathrm{t}+\mathrm{n}}$$
(1)

In the CBOW method, distributed representations of the context are used to predict the word in the middle of the window [58]. A visual view of the CBOW model structure is shown in Fig. 5.

Fig. 5
figure 5

CBOW model structure

Skip Gram uses the central word to predict the surrounding words [58]. The objective function of the Skip Gram is given in Eq. (2):

$${\mathrm{J}}_{\uptheta }=\frac{1}{\mathrm{T}}={\sum }_{\mathrm{t}=1}^{\mathrm{T}}{\sum }_{-\mathrm{n}\le \mathrm{j}\le \mathrm{n},\ne 0}\mathrm{logp}\left({\mathrm{w}}_{\mathrm{j}+1}\right|{\mathrm{w}}_{\mathrm{t}}$$
(2)

The log probabilities of the surrounding n words to the left and right of the w_t target word are summed for the objective function of the Skip Gram given in Eq. (2): The structure of the Skip Gram model is shown in Fig. 6.

Fig. 6
figure 6

Skip gram model structure

Word2Vec parameters;

Size: It is the minimum number of occurrences of a word in the corpus to be included in the model.

Window: This is the maximum distance between the current and predicted words in a sentence.

Workers: It denotes the number of running threads for parallelization to speed up training.

min_count: The size of the feature vectors.

SG: If its value is 0, then CBOW is performed; otherwise, Skip Gram is performed.

Table 4 shows the Word2Vec parameters used in the study.

Table 4 Word2Vec parameters used in the study

Deep learning

Deep learning is a machine learning method based on artificial neural networks. It is based on at least one artificial neural network and many algorithms to obtain new data from the available data. Deep learning can be carried out in a supervised, semi-supervised or unsupervised manner. It has been successfully applied to areas such as computer vision, machine vision, speech recognition, natural language processing, voice recognition, social network filtering, and machine translation [59].

RNN

It is a type of neural network that is often used in NLP problems. Successful results were obtained since it can also recall past information. The RNN uses the output of the previous step as the input to the current step. The RNN ensures that each output proceeds based on the previous step. It tries to keep the results calculated in the previous steps in memory [60].

The RNN architecture has one input layer, two hidden layers, and one output layer. All of these layers work independently. The structures in each layer have weights, and the specific threshold values are determined for each layer. In this way, the system will give more realistic results. As a result of these recurrent steps, the previous input state is stored and combined with the newly obtained input value to associate the newly obtained input with the previous input. Visualization of the RNN architecture is shown in Fig. 6 [60].

Due to the problems of RNN, various networks of later variants, such as GRU and Bi-GRU, have been proposed. Information about GRU and Bi-GRU are given in subsections.

Gated Recurrent Unit Network

GRU is a type of Recurrent Neural Network. GRU, which consist of only two gates, a reset gate, and an update gate, use the hidden state to transfer information [60].

Update Gate: It decides the information to discard and the new information to include.

Reset Gate: This gate is used to decide how much of the past information is to be forgotten.

GRU are slightly faster than other types of RNN since they have fewer vector operations.

The model structure of the GRU is shown in Fig. 7. For a given time range t, \({\mathrm{x}}_{\mathrm{t}}\)\({\mathrm{R}}^{\mathrm{n}*\mathrm{d}}\) is the mini-batch input (n: sample count, d: number of inputs), and \({\mathrm{h}}_{\mathrm{t}-1}\)\({\mathrm{R}}^{\mathrm{n}*\mathrm{h}}\) is the hidden state of the last step (h: number of hidden states) [61].

Fig. 7
figure 7

GRU architecture

GRU has Update and Reset Gates that control the flow of information within the unit. The Update Gate decides the information to discard and the information to include [61].

The calculation of the Update Gate is given in Eq. (3). This gate is used to decide how much of the past to forget. It also helps to capture short-term dependencies on time series.

$${\mathrm{U}}_{\mathrm{t}}=\upsigma \left({\mathrm{W}}_{\mathrm{U}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{W}}_{\mathrm{U}}{\mathrm{x}}_{\mathrm{t}}+{\mathrm{b}}_{\mathrm{U}}\right)$$
(3)

The calculation of the reset gate is given in Eq. (4). A value close to 0 indicates that previous information is forgotten in the current memory content. A value close to 1 indicates that it will be retained in the current memory [62].

$${\mathrm{R}}_{\mathrm{t}}=\upsigma \left({\mathrm{W}}_{\mathrm{R}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{W}}_{\mathrm{R}}{\mathrm{x}}_{\mathrm{t}}+{\mathrm{b}}_{\mathrm{R}}\right)$$
(4)

After determining how much of the memory will be forgotten and how much will be kept through the reset and update gates at the moment t in Eq. (5), the information in these gates is scaled using the activation function [63].

$$\stackrel{`}{{\mathrm{h}}_{\mathrm{t}}}=\mathrm{tanh}$$
(5)

The information stored by the hidden layer at time t is determined by Eq. (6).

$${\mathrm{h}}_{\mathrm{t}}=\left(1-{\mathrm{U}}_{\mathrm{t}}\right)*{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{U}}_{\mathrm{t}}*\stackrel{`}{{\mathrm{h}}_{\mathrm{t}}}$$
(6)

Bi-GRU

Bidirectional recurrent neural networks bring together only two independent GRUs. This structure ensures that networks always have both backward and forward information about the array in each step.

The bidirectional GRU processes inputs in two ways, one from the past to the future and the other from the future to the past, and what distinguishes this approach from the unidirectional one is that the GRU working backward preserves information from the future and combines two hidden states. The Bi-GRU can preserve information from both the past and the future at any point in time. Figure 8 shows the schematic structure of the GRU and Bi-GRU side by side [64].

Fig. 8
figure 8

a) GRU, b) Bi-GRU

CNN

CNN is a class of deep, forward-propagation artificial neural networks most commonly used for image recognition and computer vision. It is a regular version of the multilayer perceptron. They were developed for image processing and image classification, but they were also used in text classification tasks. CNN uses convolution and pooling as the two main processes for feature extraction in the model. The output from convolution and pooling is connected to a fully connected multilayer perceptron [65].

A convolutional neural network consists of an input layer, hidden layers, and an output layer. In any forward-propagation neural network, any middle layer is called the hidden layer since its inputs and outputs are masked by the activation function and the final convolution. In a convolutional neural network, the hidden layers contain the layers that perform the convolutions. Typically, this involves a layer that performs a multiplication or another pointwise multiplication and consists of an activation function. This is followed by other layers, such as pooling layers, fully connected layers, and normalization layers [66].

Convolution is a process in which we select a filter or kernel of a predefined size and move that filter along the text values and multiply the corresponding values. The text in sentences is converted to numbers using the Word2Vec embedding technique. Then, all the values in the filter are added for the first feature in the resulting array. The filter moves along text values based on the stride size (which determines how many steps the filter can take each time it moves). The resulting array is called the convoluted feature [67].

Pooling is a technique used to downsample convoluted features. The global maximum pooling technique was chosen in the present study. This technique is used to reduce the size of the convoluted feature. After the convolution process, "same padding" was used to have the same size for both the array and the input.

Proposed Model (MBi-GRUMCONV)

In this section, a multilayered deep learning model is proposed for sentiment classification after the Word2Vec embedding process (Skip Gram and CBOW) in 3 different vector sizes (100, 200, and 300) on the public IMDB reviews dataset. The details of this proposed model, shown in Fig. 9, are discussed in this section.

Fig. 9
figure 9

The Proposed Mbi-GRUMCONV model for sentiment analysis

Input Layer

The input layer is considered the first stage of the network. The input layer is the layer where the input data to be used in the model are imported.

Embedding layer

It is the second layer of the model and converts each list corresponding to a specific word in the information array into an element vector. These real-value vectors together form a structure. The embedding layer is designed so that each row represents a unique array corresponding to a specific word in the dictionary. The added network element is "d*w" in size. The dataset size is denoted by "d", and word weights are denoted by "w". Word2Vec (Skip Gram, CBOW) word vectors with 100, 200, and 300 vector sizes were used in the study.

Convolution, dropout, and dense layer

The convolution layer convolves its input obtained by calculating the pointwise multiplication between all input channels and filters based on the structure of the feature map. The dropout layer nullifies the connection of certain neurons in the next layer while leaving others unchanged. The dense layer is the layer of basic artificial neural network neurons.

Output layer

The output layer is the last layer in the model that yields negative or positive prediction results.

Experiment environments

Experiments on the established model were conducted on Google Colaboratory (Colab Pro) [68] using the TensorFlow 2.9.0 [69] and Keras 2.9 [70] libraries and Python version 3.9.13 [71]. The Colab Pro version was used to make the experiments faster without any interruption.

Hyperparameters settings

The ReducePlatue [72] and various callbacks of Keras [70] for early stopping [73] were used to create a model with a good fit.

If there is no improvement in the validation accuracy during the training of the model, the training process is stopped before the specified number of steps. This process is called early stopping. It is performed with the EarlyStopping function in the Keras library. The training is stopped if there is no improvement in the validation accuracy value for 5 steps.

If there is no improvement in the validation loss during training, the learning speed is reduced by multiplying by a specific coefficient. In the study, in the case of no improvement in validation loss performance in 5 steps, the learning speed was reduced by multiplying by 0.1.

TensorBoard is a tool of TensorFlow that enables tracking of experimental metrics such as loss and accuracy, model graph visualization, and fast and multiparameter model setup in machine learning studies [69]. Using the TensorBoard library, the parameters in the first column of Table 5 were run by assigning the values in the second column in nested loops to find the optimum value for each parameter. These optimum values are shown in bold in Table 5.

Table 5 Parameters used in the model

Summary information of the entire model determined by TensorBoard and used in the study is presented in Table 6.

Table 6 Model parameter summary information

Experiments

This section gives the results of the experiments of the proposed deep learning model conducted in 3 different (100–200-300) word embedding vector sizes with Word2Vec on 70%-30% and 80%-20% train-test split and 10% validation split for each set. Moreover, the performance metrics used in the evaluation of these experimental results are also addressed in this section.

Performance metrics

The performance of the models created for sentiment analysis was evaluated by accuracy criteria. This criterion is obtained by the ratio of the True Negatives (TN) and True Positives (TP) to the total number of samples in Eq. (7) [74].

$$\mathrm{ACC}=\frac{{\mathrm{T}}_{\mathrm{P}}+{\mathrm{T}}_{\mathrm{N}}}{{\mathrm{T}}_{\mathrm{P}}+{\mathrm{T}}_{\mathrm{N}}+{\mathrm{F}}_{\mathrm{P}}+{\mathrm{F}}_{\mathrm{N}}}$$
(7)

F1 Score is used when there is a need for a measurement metric that will include not only False Negative \({\mathrm{F}}_{\mathrm{N}}\) or False Positive \({\mathrm{F}}_{\mathrm{P}}\) But also all error costs. F1 score is given in Eq. (8) [74].

$$\mathrm{F}1=2*\left(\frac{\frac{{\mathrm{T}}_{\mathrm{P}}}{{\mathrm{T}}_{\mathrm{p}}+{\mathrm{F}}_{\mathrm{P}}} * \frac{{\mathrm{T}}_{\mathrm{P}}}{{\mathrm{T}}_{\mathrm{p}}+{\mathrm{F}}_{\mathrm{N}}}}{\frac{{\mathrm{T}}_{\mathrm{P}}}{{\mathrm{T}}_{\mathrm{p}}+{\mathrm{F}}_{\mathrm{P}}} + \frac{{\mathrm{T}}_{\mathrm{P}}}{{\mathrm{T}}_{\mathrm{p}}+{\mathrm{F}}_{\mathrm{N}}}}\right)$$
(8)

Experimental results

The results of the experiments are given in Tables 7 and 8.

Table 7 Accuracy values of the proposed model (80%-20%)
Table 8 Accuracy values of the proposed model (70%-30%)

As shown in Table 7, in all dataset splits (train, test, and validation), the performance was found to increase in line with the increase in the vector size in both the Skip Gram and CBOW methods.

For the vector size of 100 in the CBOW and Skip Gram, the Skip Gram has better accuracy performance in the training and validation sets, while the CBOW has better results in the test data set. In the vector size of 200, however, CBOW was better in the training and test datasets, while Skip Gram was better in validation. At a vector size of 300, Skip Gram performed better than CBOW in all three sets (test, train, and validation). The best result of the proposed model was obtained with a vector size of 300 in the Skip Gram embedding in all three data sets.

As shown in Table 8, in all dataset splits (training, testing, and validation), the performance was found to increase with the increase in the vector size in both the Skip Gram and CBOW methods.

For the vector size of 100 in the CBOW and Skip Gram, the Skip Gram has better accuracy performance in the test and validation sets, while the CBOW has better results in the training data set. In the vector size of 200, however, CBOW was better in the training and test datasets, while Skip Gram was better in validation. At a vector size of 300, Skip Gram performed better than CBOW in all three sets (test, train, and validation). The best result of the proposed model was obtained with a vector size of 300 in the Skip Gram embedding in all three data sets.

Considering Table 7 and Table 8 together, 80%-20% (Table 9) gave better results than the results obtained in the 70%-30% train-test separation. During the two different 80%-20% and 70%-30% (Table 9, Table 10) train-test splits, the performance was found to increase as the vector size increased in both the Skip Gram and CBOW methods. Although the best results in both separations were obtained by the Skip Gram method and at a vector size of 300, the 80%-20% train-test separation gave the best result in the proposed model.

Table 9 IMDB Dataset 80%-20% train and validation accuracy/loss graphs
Table 10 IMDB Dataset 70%-30% train and validation accuracy/loss graphs

Accuracy and loss graphs for all embedding and vectors of the proposed model are presented in Table 9 and Table 10.

Table 9 shows that Skip Gram is more robust than CBOW in all vector sizes (100, 200, 300).

Table 10 shows that the Skip Gram yields a good fit/robust model, better than CBOW in all vector sizes (100, 200, and 300).

F1 Score value is given in Table 11.

Table 11 F1 score values of the proposed model

As seen in Table 11, Skip Gram method gave better results than CBOW embedding method, similar to the Accuracy values.

Comparison of the proposed model (MBi-GRUMCONV) with related works

The experimental results of the proposed model, state-of-the-art models proposed in other studies in the literature, author information, and the comparison of the accuracy results of the test sets are given in Table 12.

Table 12 Comparison of the test accuracy results of the proposed model with those of previous studies

The validation accuracy value of the model we proposed is specifically compared to the proposed multilayered models in Table 13. In these comparisons, it is seen that it surpasses other studies.

Table 13 Comparison of the validation accuracy results of the proposed model with those of previous studies

Conclusion and Discussion

A new model was proposed based on deep learning with word embedding using Word2Vec on the IMDB reviews dataset. Multilayered Bi-GRU and Conv were used as deep learning methods. The proposed model uses 6 Bi-GRUs followed by 2 Conv. During the experiments, the data set was used as 80%-20%-10% and 70%-30%-10% training-test-validation splits, and the results are presented in two separate tables (Table 7, Table 8). As expected, it was seen that results with higher accuracy were obtained in the 80%-20%-10% split, which has larger training data.

The proposed model was found to have a 95.32% training accuracy, 94.67% validation accuracy, and 95.34% test accuracy with a vector size of 300 in the Word2Vec Skip Gram method. These values yielded a higher performance compared to state-of-the-art studies.

One of the reasons for this performance increase is that Word2Vec works more efficiently with Skip Gram, which uses the sum of probabilities of the surrounding words to the left and right of the target word, unlike CBOW, which uses distributed representations of context to predict the word in the middle of the window. Another reason is that the increase in the vector dimension has a positive effect on the classification.

According to the results of both the literature studies and the proposed model, it was seen that multilayered and hybrid models gave better results compared to single-layer neural network models. It is recommended that those who will conduct research in this field use neural network architectures in multilayer and hybrid structures.

Future studies should investigate the effects of different deep learning models on sentiment analysis performance with different word representation methods and different classifiers In future work, we will be looking at hybrid approaches to investigate the affects of different learning techniques such as supervised and semi-supervised learning methods on enhancing the sentiment analysis accuracy.