1 Introduction

The multifaceted ramifications of sentiment analysis across diverse domains such as psychology, politics, education, and marketing have catalyzed significant scholarly interest within the realm of natural language processing (NLP). Computational examination of textual expressions, emotions, and sentiments intriguingly sheds light on the collective sentiments and perceptions held by individuals toward various subjects in different fields [1,2,3,4]. The burgeoning expansion of digital platforms, notably social media, has precipitated an unprecedented influx of textual data. Consequently, sentiment analysis assumes heightened significance in contemporary discourse, reflecting the imperative to glean insights from this vast reservoir of information (e.g., Facebook [5], Twitter(X) [2]), forums, and review sites). Nevertheless, there are substantial obstacles due to the intricacy of human emotion and natural language. The importance of context and semantic subtleties in conveying emotions is especially apparent in languages specifically Turkish, which have complex morphological patterns [6].

The use of embedding techniques, which capture semantic associations in text, has been a game-changer for sentiment analysis thanks to recent deep learning developments. Word embedding (WE) is an example of these techniques that convert textual input into numerical formats that neural networks can handle efficiently. When it comes to Turkish, these techniques are essential for precise sentiment analysis because of the prevalence of brief texts like tweets or internet comments. Combining embedding approaches with deep learning architectures such as CNNs and LSTMs may greatly improve sentiment analysis performance, according to some studies such as [7].

Within the area of deep learning-based Turkish short text sentiment analysis, our work intends to methodically assess the efficacy of different embedding techniques. Importantly, this study adds to our knowledge of how various embedding methods impact the efficacy and precision of sentiment analysis in a morphologically complicated language. Researchers, companies, and governments depend on sentiment analysis to understand public opinion and consumer behavior. Our goal is to analyze the embedding techniques of Turkish texts by investigating the efficacy of these approaches. Therefore, our study not only solves a technical problem but also has wider ramifications for areas where public opinion is very important. For example, in the field of marketing, precise sentiment analysis may shed light on customer tastes and trends; in politics, it can show how public opinion is changing. Therefore, our work enhances the precision and effectiveness of sentiment analysis for Turkish texts, adding to the growing body of knowledge on digital communications’ complex meaning, which is crucial for decision-making across industries.

This paper provides significant advancements in the area of natural language processing, particularly in the domain of sentiment analysis for Turkish short text analysis. The contributions that have been made are listed below:

  • This paper introduced a unique character one-hot encoding embedding (COE) technique using a set of 129 characters, previously unexplored in Turkish text analysis.

  • This work evaluated three pre-trained prominent word embedding algorithms, namely Word2Vec, GloVe, and FastText, in addition to two character-level embeddings, and a hybrid character-level and word-level embedding techniques, in Turkish short-text sentiment analysis. It aims to assess their efficacy and constraints in this context.

  • This study proposed a hybrid word embedding and character-embedding method that performed promisingly on Turkish short text classification.

  • This study demonstrates the usefulness of integrating embedding approaches with deep learning architectures, namely CNN, LSTM, BiLSTM, and hybrid models (i.g. CNN-LSTM, CNN-BiLSTM, and BiLSTM-CNN), for sentiment analysis.

  • Unlike earlier works in the field of Turkish Sentiment analysis, this study does a comparative analysis of embedding techniques with six deep learning architectures on two datasets for two rounds; one incorporating cross-validation and the other one using the train-test-split method. It aimed to provide a detailed knowledge of the performance of each combination in terms of accuracy, computing efficiency, and linguistic sensitivity.

  • The results of this research have practical implications for governments and companies that want to use sentiment analysis to make informed decisions, particularly in areas where Turkish is commonly spoken.

This study is structured to methodically examine the efficacy of deep learning (DL) models utilizing embedding strategies for sentiment analysis of Turkish short texts. The subsequent section (Sect. 2) presents a literature review, offering a synthesis of prior research endeavors and their respective methodologies. Section 3 furnishes a theoretical foundation for the embedding techniques and DL-based models discussed here-in. The Methodology section (Sect. 4) comprehensively delineates the experimental design, data processing methodologies, and model architectures employed in this study. Subsequently, Section 5, entitled Experimental Results, juxtaposes the findings of each model incorporating distinct embedding techniques. Section 6, designated as the Discussion section, elucidates interpretations of the observed results. Finally, the Conclusion and Further Works section succinctly encapsulates the key findings, draws conclusions, and outlines avenues for future research.

2 Related Works

The field of deep learning-based sentiment analysis in Turkish texts is a rich source of research and methodologies that cover a wide range of topics like education [8], social media analytics [9], and business [10]. So, it is important to go further into the key contributions that have played a role in shaping the present landscape of embedding methods and deep learning approaches in this particular sector.

Word embedding (WE) plays a crucial role in the field of natural language processing because it provides a method for representing words in a dense vector space. This format can capture semantic meanings as well as connections between words, which enables deep learning models to handle text in a more efficient manner. The efficacy of word embedding has been shown by studies such as Zhang et al. in handling languages such as Modern Standard Arabic (MSA), which are characterized by the prevalence of dialectical variants. GloVe and Word2Vec are examples of embedding that have become commonplace in many natural language processing applications. These embedding techniques improve the performance of models in tasks such as text classification and sentiment analysis [11].

Table 1 The recent results for Turkish sentiment analysis

In the field of advanced deep learning (DL) and embedding approaches, In 2019, Emre Dogan and Buket Kaya [7] made a major contribution by proving the efficacy of deep learning approaches, notably semantic context word embedding, and latent semantic analysis (LSA), in analyzing sentiments in Turkish social networks. According to the study of Dogan and Kaya, a success rate of 93% was reported.

To enhance the effectiveness of artificial intelligence-based sentiment analysis models, attempts to use text segmentation were made. The paper [17] focused on the impact of several segmentation techniques, including morphological, sub-word, tokenization, and hybrid approaches, on the sentiment analysis of informal Turkish texts. The authors noticed that the performance of sentiment analysis is greatly improved when these approaches are employed in combination with deep neural networks. This highlights the crucial role that text segmentation plays in managing the morphological complexity of Turkish [17]. Another study was conducted by Karakus et al. [18] to evaluate deep learning models; however, only the skip-gram word to vector (Word2Vec) method was used. The study concluded that the CNN-LSTM model with pre-trained word embedding performed better on a dataset containing 44k reviews. In 2021, the study [9] demonstrated the effectiveness of recurrent neural networks with word embedding for sentiment analysis on Turkish Facebook data. The findings of this study showed that RNN achieved the highest level of accuracy among the models that were tested. Furthermore, the findings highlight the potential of recurrent neural networks (RNNs) to capture the subtleties of sentiment analysis in Turkish textual data [9].

The paper [12] investigated the use of LSTM units in RNNs for the purpose of sentiment analysis in Turkish. Their research focused on the role of LSTM in capturing long-term dependencies. In terms of accuracy, the authors found that the LSTM-based model outperformed classic machine learning approaches such as logistic regression and Naive Bayes by achieving an accuracy of 83.3%. According to Ciftci et al. [12], the results demonstrate the importance of LSTM in the context of managing the sequential and context-dependent character of language for sentiment analysis.

In 2019, S. Akin and Tuǧba Yıldız introduced a transfer learning model that used word embedding techniques trained on Turkish Wikipedia and LSTM with dropout for analyzing sentiment analysis. According to [14], the proposed model demonstrated its effectiveness in analyzing sentiments in Turkish language restaurant and product evaluations. The achieved accuracy on the restaurant reviews dataset was 90.1% and about 83.4% on the products reviews dataset. This demonstrates the potential of transfer learning in adapting pre-trained models to particular domains. In [19], Amasyali et al. studied the use of character-based representations in the context of sentiment analysis for Turkish textual data. Their results indicate that these representations are more effective in comparison with word-based techniques, providing a fresh viewpoint on the management of the complexity involved in the processing of the Turkish language.

In the Lexicon domain, Ucan et al. [16] introduced a model that employed a Turkish Sentiment Dictionary (TSD) and SVM to reach an accuracy of 80.70% on Hotel Reviews Dataset (HRD) and an accuracy of 84.60% on Movie Reviews Dataset. Similarly, Erşahin et al. [20] used a hybrid approach that employed Lexicon and Traditional Machine Learning (TML) models to reach an accuracy of 86.31% on a Movie Reviews Dataset. In 2020, Yildirim et al. [10] suggested an approach for Turkish sentiment analysis using the term frequency and inverse document frequency (TF-IDF), bag of words (BoW), and TML to achieve an accuracy of 86.00% on Hotel Reviews data and about 83.00% on Movie Reviews Dataset.

Different attempts have been made in this domain while the most relevant studies were summarized in Table 1. One of the recent studies is [15] which used text filtering and transformer-based models. Even though this study reached encouraging results on Movie and Hotel Reviews datasets by reaching an accuracy of 98% on the hotel reviews dataset, there are drawbacks to the real-world application of the proposed method for two reasons. Firstly, this study used a novel text filtering method based on removing the top frequent words in the positive reviews from the negative reviews and removing the most commonly used terms in the negatively labeled reviews from the positive reviews which is inapplicable for a new unlabeled text. Secondly, the study used transformers that necessities more memory and computational resources. So, there is a need to use lightweight models and further investigate and improve embedding methods with deep learning approaches for more efficient and accurate Turkish short text sentiment analysis. This study builds upon these needs to further explore, evaluate, and polish these techniques.

3 Preliminaries

3.1 Word Embedding (WE)

WE techniques are fundamental in natural language processing, providing a method to encode words in a compact vector space [11]. This format incorporates semantic meanings and interconnections among words, enabling deep learning models to handle text with greater efficiency. The work of Zhang et al. showcases the effectiveness of word embedding in managing languages such as Modern Standard Arabic (MSA), which often exhibit dialectical fluctuations. Embedding methods such as GloVe or Word2Vec have been widely used in various natural language processing (NLP) tasks, significantly improving the performance of models in tasks like sentiment analysis and text classification. This methodology incorporates the semantic connotations and associations of words, enabling enhanced text analysis using deep learning models. The studies conducted in [21] and [22] examine the efficacy and difficulties associated with neural model-based embedding of words in different natural language processing (NLP) tasks. These studies emphasize the potential of such embedding despite concerns regarding their interpretability. These approaches take into consideration each word as the essential unit for representation and learning. They concentrate primarily on word-level embedding techniques. The use of this technique is advantageous in terms of collecting word-specific characteristics and semantics, which are crucial for activities such as sentiment analysis. They may, however, have difficulty with terms that are not in their lexicon. According to Kim et al. [23], the incorporation of recurrent neural networks (RNN) or long short-term memory (LSTM) networks with word-level embeddings has the potential to greatly increase performance via the capture of sequential information in texts [23].

Fig. 1
figure 1

A simple LSTM unit architecture

3.1.1 GloVe

GloVe is an unsupervised learning technique that, in order to generate word embedding, aggregates global word-to-word co-occurrence data from a corpus. The model is a common option for a variety of natural language processing jobs since it is able to properly capture both the semantic and syntactic information of words. GloVe has been demonstrated to be useful in improving tasks such as sentiment analysis and word similarity assessments, according to several studies. For instance, in 2017, Sharma et al. [24] illustrated its usefulness in sentiment analysis. Moreover, in 2021, Wang developed a recent and useful version of GloVe to improve performance in natural language processing applications [25].

3.1.2 Word2Vec

Word2Vec, which stands for Word to Vector, is a collection of models used to construct word embeddings, often utilizing either the Continuous Bag-of-Words (CB-OW) model or the Skip-Gram model. In addition to its widespread use in a variety of natural language processing applications, it is well known for its effectiveness in extracting word connections from big datasets. Various studies have shown that Word2Vec performs well at both comprehending natural language and improving machine learning tasks in addition to the efficacy of the software. For instance, the authors of [26] studied the enhancement of neural network models by using weighted Word2Vec. Moreover, Jaffe et al. [27] investigated its spectral foundation, indicating possible verifiable guarantees for its usage in natural language processing (NLP).

3.1.3 FastText

FastText was developed by Facebook’s AI Research department, it is an extension of the Word2Vec model that interprets each word as made of character n-grams, enabling it to collect subword information. Because of this capability, FastText is especially helpful for languages that have a rich morphology and especially for words that are not encountered during training. It has been shown that FastText is successful in a variety of natural language processing tasks, including text categorization and sentiment analysis. In 2021, Lin as al. [28] highlighted the effectiveness of FastText in sentiment analysis, while Young and Rusli [29] acknowledged its potential in text pre-processing for enhanced findings [28].

3.2 Character-Level Embedding

The focus on embedding techniques is shifted from words to characters in character embedding (CE), which provides a more detailed approach to the representation of text. Character embedding techniques are also known as character-level embedding [30]. When it comes to languages that have a high morphological complexity or dialectical differences, this technique is very useful since it eliminates the possibility of word-level embedding failing due to the absence of unknown terms. Character-level embedding, which is often implemented with the help of convolutional neural networks (CNNs), have the ability to overcome these issues by learning representations based on character sequences. This is what was emphasized in the [31] making use of this method guarantees that even words that are not effectively captured using WE may be successfully represented using CE.

3.3 Deep Learning Models

3.3.1 Long Short-Term Memory (LSTM)

LSTM networks are a subset of recurrent neural networks (RNNs) in deep learning that are meant to overcome the constraints of standard RNNs in dealing with dependency issues over time. Because of their capacity to recall and absorb prior information over long periods, LSTMs are especially successful in time-series data estimations, language modeling, and sequence creation. For example, LSTMs have been effectively deployed in financial time series forecasting, outperforming comparable models in predicted accuracy and profitability [32]. The simple architecture of LSTM is shown in Fig. 1.

Fig. 2
figure 2

The architecture of BiLSTM

3.3.2 Bidirectional LSTM (BiLSTM)

BiLSTM networks, a deep learning technique, improve LSTM implementation by processing data in both backward and forward directions as shown in Fig. 2. Bidirectionality facilitates the extraction of neighborhood characteristics and universal connections, which is especially advantageous in analyzing time series data. This capability improves the execution of tasks such as spectrum detection in cognitive radio [33]. In educational environments, BiLSTMs are used to accurately forecast the performance of pupils with great efficiency [8]. In the domain of sentiment analysis, a huge number of attempts were made to enhance the performance of text classification using BiLSTM [34, 35].

3.3.3 Convolutional Neural Network (CNN)

CNNs are a potent tool in sentiment analysis, which involves using their deep learning skills to analyze and categorize expressive tones in text-based information. The primary advantage of CNNs is their capacity to spontaneously and effectively identify characteristics from newly acquired information, rendering them very proficient in managing extensive and intricate datasets often encountered in sentiment research [36]. A basic architecture of 1-dimensional CNN with one convolutional layer is shown in Fig. 3.

Fig. 3
figure 3

A basic CNN model

3.3.4 Hybrid Approach

In recent years, scholars have been investigating hybrid approaches by combining CNN with other architectures such as LSTM and BiLSTM to enhance the performance. For instance, The CNN-LSTM deep neural network model demonstrates superior performance compared to SVM and other models in multidomain datasets by accurately discerning several emotion polarities within a single phrase. The statement highlights the proficiency of CNN in dealing with intricate sentiment analysis tasks [37]. Similarly, utilizing a hybrid CNN-RNN model integrating a convolutional neural network (CNN) with a recurrent neural network (RNN) enhances the precision of sentiment analysis for Twitter data. The study [36] indicates that convolutional neural networks (CNNs) may be effectively used across various social media platforms, highlighting their versatility.

4 Methodology

This section outlines the methodology used in this study to assess the effectiveness of several embedding techniques for Turkish short-text sentiment analysis utilizing deep learning methodologies.

Fig. 4
figure 4

The method used for evaluating the performance of embedding techniques on Turkish Sentiment Analysis

Manifestly, Fig. 4 shows the steps performed to evaluate the performance of different word embedding methods (Word to vector, GloVe, FastText, character-level, and hybrid approach; character level and word level) using deep learning techniques for analyzing Turkish short text.

4.1 Dataset

To evaluate the performance of word embedding techniques, two datasets were employed in this study.

4.1.1 Turkish Higher Education Dataset (THED)

THED is an original dataset. It was gathered consisting of concise Turkish texts from the Twitter (X) social media platform. The dataset was collected based on universities’ names, and hashtags to get people’s opinions about universities and educational quality. Due to the increased demand from Turkish institutions and policymakers for evaluating people’s opinions about universities in Turkey in order to enhance their services. The collection period was in the first half of the year 2023 because at that time the quality of education was a hot topic in social media for Turkish users due to the presidential election that was held on May 2023, 14, in Turkey. After collecting more than 17k tweets, the annotation process was done manually by two annotators in order to categorize the tweets into two classes (i.e., positive and negative) correctly.

4.1.2 Hotel Reviews Dataset (HRD)

HRD, a public dataset, that belongs to the paper [16]. It includes customer reviews of hotels. This dataset was chosen for three reasons. Firstly, this dataset is public and used previously by scholars in many studies such as [10]. Secondly, it is a balanced dataset. While our original dataset is unbalanced, it was better to choose a balanced dataset to monitor and evaluate the performance of deep learning models on a dataset that has different characteristics to get a better insight into the performance of the proposed models. Thirdly, the dataset was not very large dataset, which could be a challenge for deep learning models that require huge amounts of data.

Both THED and HRD datasets include two categories as shown in Fig. 5. With a comparison to HRD, THED is a recent Twitter (X) data that need intensive pre-processing for removing hashtags and converting about 4282 emojis to texts as shown with further statistical information in Table 2. Additionally, the average word count in THED’s record was 24 words, whereas about 77 words in HRD’s record. In terms of mentions, which means notifying a user by writing his name after the @ character, THED includes 13,915 mentions while only 10 mentions are in the HRD dataset. Basically, THED, as a tweets dataset, has shorter texts and needs more pre-processing, however, HRD includes longer records in terms of word counts and fewer emojis and mentions.

Fig. 5
figure 5

Data distribution

4.2 Data Pre-processing

In order to obtain great performance, the textual data were intensively pre-processed with a pre-processing procedure that included tokenization, stop word removal, and normalization of text to address the distinctive characteristics of the Turkish language. Moreover, the emojis were replaced with their textual meaning. Furthermore, all links, URLs, usernames, and mentions were removed from the data. However, many studies use stemming (i.e., Snowball Turkish Stemmer or Zemberek-NLP) that are specifically for Turkish, but after exhausted experimentation on both datasets, it seemed that stemming was not effective in enhancing the results for both datasets. That is why stemming was not employed in our methodology.

4.3 Embedding Phase

During the embedding phase, both character-level embedding and word-level embedding were employed.

Table 2 Statistical details of the datasets used

Word-Level Embedding: three widely used pre-trained WE techniques (i.g. Word2Vec, GloVe, and FastText) were employed. Both Skip-Gram and the Continuous Bag-of-Words (CBOW) versions of Word2Vec were used. The GloVe embedding was used to exploit the global Word2Word co-occurrence matrix in the textual data. Moreover, choosing FastText was based on its capacity to capture subword details, which is essential for effectively dealing with the morphological complexity of Turkish. The process of utilizing these WE techniques started by downloading their pre-trained model which is specifically for analyzing Turkish texts. Firstly, Word2Vec downloaded from [38] with genism Python library whose size reaches around 366 Megabytes. Then, the Turkish Word2Vec embedding matrix was created with an embedding dimension equal to 400. Secondly, GloVe embedding [39] was employed after downloading the Turkish GloVe embedding model and creating its embedding matrix. A recent version of the GloVe used whose size reached 1.42 Gigabytes with a dimension of 25, 50, 100, and 200 values. Thirdly, FastText model was used to create pre-trained word vector models for 157 languages, including Turkish, as stated in [40]. Then, we chose the model which is specifically for Turkish that was trained using data from Common Crawl and Wikipedia. It employs position-weighted Continuous Bag of Words (CBOW) architecture. The model has a size of 300 and includes 5-character n-grams.

To determine the optimal dimensions for the embedding matrix, we experimented with various embedding sizes across different models. For the Word2Vec model, an embedding dimension of 400 was identified as the most effective. In contrast, when testing the GloVe model with embedding dimensions of 100, 200, and 300, a dimension of 100 was found to be the most suitable. Similarly, the FastText model demonstrated superior performance with an embedding dimension of 100.

Character-level Embedding (CE): This method goes beyond word-based feature extraction from textual data [11]. By contrast to WE, this method allows training on a smaller vocabulary set that includes characters, symbols, and punctuation marks. CE does not need pre-trained word embedding (WE) matrices. Compared to WE, CE’s capability to handle the problems caused by new words without pre-existing vector representations is a major advantage. This makes CE a better candidate for capturing textual properties in the Turkish language, which is a rich language.

Two CE Methods were employed. Firstly, character-integer embedding (CIE) is applied using the standard Python tokenizer library. In this method, the character of the word is mapped to an integer and construct a matrix of integers that represents the input textual data. Secondly, character-one-hot encoding embedding (COE) was used. The process begins with specifying the alphabetic needed to generate the embedding matrix, which in Turkish comprises a fixed list of characters.

Choosing a wider variety of alphabets for the character set requires greater vector dimensions for each letter. The increase in vector size increases the computational resources needed to analyze embedding matrices, especially in text (i.e., tweet or reviews) analysis. For Turkish text, characters and symbols were restricted to 129 characters: 58 Turkish letters (29 Upper case letters and 29 lower cases), 10 numerals, 6 English letters, 53 special characters including punctuations, and symbols, and 2 letters (a newline and a space character) falls into the ’Other Type’ category.

Our analysis examined how different alphabet character sets affected categorization ability, starting with English numbers (0–9), etc. The approach we used included each of the distinctive characters in the data we used, including non-Turkish ones (i.g. English letters). Table 3 lists 129 distinct characters and symbols found using this method. We then created a one-hot embedding matrix with (180, 129) multidimensional vectors of binary values for each record, where 180 indicates the maximum length of character for each record and 129 is the one-hot encoded vector of each character. Figure 6 shows the COE process, which turns texts into a character-level numerical matrix.

Table 3 The characters that were employed for COE
Fig. 6
figure 6

The process of COE

4.4 Deep Learning Architectures

CNNs were selected for their effectiveness in extracting geographical characteristics from text data, while LSTMs were chosen for their capacity to capture long-term relationships in text sequences. Additionally, BiLSTM was taken into consideration due to its ability to capture connections in sequential data in both directions. Ultimately, after investigating different architectures to perform sentiment analysis tasks on both datasets, we relied on using six deep learning architectures: CNN, BiLSTM, CNN-BiLSTM, BiLSTM-CNN, LSTM, and CNN-LSTM as shown in Fig. 7. Then, a Bayesian optimization technique was used to select the best hyperparameters.

Fig. 7
figure 7

The DL-based models that were employed for sentiment analysis

Table 4 The hyper-parameters of the proposed models

The proposed models, in Fig. 7, were configured with a consistent set of hyper-parameters: a batch size of 32, a learning rate (LR) of 0.0001, the Adam optimizer, a maximum of 20 epochs, and an early stopping mechanism with a patience of 3 to mitigate overfitting. The specific hyper-parameters detailed in Table 4 were meticulously chosen based on extensive experimentation, aiming to get an optimal balance between model complexity and efficacy. For instance, the utilization of 64 versus 128 LSTM units across different models was guided by observed performance nuances during validation phases. Similarly, adjustments in dropout rates were tailored to each model’s propensity for overfitting. This systematic approach to model development was employed to explore a broad spectrum of sequence modeling aspects. The architectural design of each model incorporates a strategic assembly of layers and parameters, including LSTM units, dropout rates, convolutional layers, and dense layers, all selected to enhance model performance while curtailing overfitting risks. Alongside Table 4, which encapsulates the hyperparameters, Fig. 7 visually represents the architectural design of these models.

The hyperparameters for each model were chosen after a thorough evaluation of their effect on model performance, taking into account the trade-off between the need for complexity to capture detailed patterns in the data and the necessity to prevent overfitting. After conducting preliminary experiments, it was concluded that using 64 LSTM units is the ideal choice. This decision was based on the observation that performance improvement levels beyond this point, and also to ensure computing efficiency. The dropout rates were adjusted to address overfitting, a typical issue in machine learning. The LSTM and CNN models had a dropout rate of 0.5, whereas the hybrid models had different dropout rates as in Table 4. These adjustments were made based on iterative testing. This process ensured that each model configuration provided the best reliable performance when faced with the difficulties presented by our datasets.

The proposed hybrid architecture for character-level and word-level embeddings was a novel approach utilizing a dual-pathway architecture constructed by doubling the proposed model (one from Fig. 7) and removing the last layer. Then, employing one for word embedding and the second one for character embedding. These two branches were concatenated by a concatenation layer which is followed by a classification head. The classification head consists of two layers, namely a dropout layer and a Softmax layer as shown in Fig. 8.

Fig. 8
figure 8

The hybrid character-word levels embedding model

4.5 Experimental Setup

The performed experiments were carried out using a Pro version of Google Colab, that offers more processing power than the basic Colab platform. GPU (NVIDIA Tesla P100 and T4 GPUs) which are well-known for their strong performance in deep learning workloads with RAM (25 GB), allowing for the effective processing of massive datasets and models. Python 3.7.10, TensorFlow 2.5.0, and other key libraries such as Keras (included in TensorFlow), NumPy, scikit-learn, and NLPT were used in this work. This configuration guaranteed that our deep learning models were executed robustly and efficiently, with dramatically decreased training and testing durations. Each embedding approach was applied for all proposed models (i.g. LSTM, CNN, BiLSTM, CNN-BiLSTM, CNN-LSTM, and CNN-BiLSTM). The models went through training and evaluation using two rounds. For the first round, the holdout validation procedure was used by employing the train-test-split function (80% for training and validation and 20% for testing). However, in the second round, the cross-validation method with K=10 was utilized. The hyperparameters, including the learning rate, batch size, and number of layers, were optimized using Bayesian optimization, taking into account the performance of the validation set.

4.6 Evaluation Measures

The performance of each model was assessed using five evaluation measures, including accuracy, precision, recall, AUC, and F1 score. Furthermore, the average time per epoch was employed as a sixth measure to get in-depth insight into the performance of deep learning models. The metrics provide a thorough perspective on the efficacy of each model in accurately categorizing feelings in Turkish brief texts.

To guarantee a balanced and unbiased evaluation, especially when one of the datasets is imbalanced, the ‘macro’ average method was used with precision, recall, and F1 Score for evaluating the performance of all models on the two datasets. Accuracy (Acc): is a measure for the correct prediction divided by all the predictions as in Eq. 1.

$$\begin{aligned} \text {Acc} = \frac{N_{\text {TP}} + N_{\text {TN}}}{N_{\text {TP}} + N_{\text {TN}} + N_{\text {FP}} + N_{\text {FN}}} \end{aligned}$$
(1)

Precision (Pre): a measure for a number of positively anticipated instances that were truly positive divided by the total number of the positive in the test set as in Eqs. 2 and 3.

$$\begin{aligned} \text {Macro (Pre)} = \frac{1}{2} \times [\text { Pre }\text {Class}_1 + \text { Pre }\text {Class}_2] \end{aligned}$$
(2)

where

$$\begin{aligned} \text {Pre of each class} = \frac{N_{\text {TP}}}{N_{\text {TP}} + N_{\text {FP}}} \end{aligned}$$
(3)

Recall (Rec): a measure for the percentage of true positives that were correctly classified as in Eq. 4 and 5.

$$\begin{aligned} \text {Macro (Rec)} = \frac{1}{2} \times [\text { Rec }\text {Class}_1 + \text { Rec }\text {Class}_2] \end{aligned}$$
(4)

where

$$\begin{aligned} \text {Rec of each class} = \frac{N_{\text {TP}}}{N_{\text {TP}} + N_{\text {FN}}} \end{aligned}$$
(5)

where N represents the total number and FP, TP, FN, and TN are depicted in the typical structure of the Confusion matrix in Fig. 9.

F1 Score (F1 Sc.): it is a measure for recall and precision harmonic mean. It is similar to striking a balance between recall and precision as in Eq. 6.

$$\begin{aligned} \text {Macro (F1 Sc)} = 2 \times \frac{\text {Macro (Rec)} \times \text {Macro (Pre)}}{\text {Macro (Rec)} + \text {Macro (Pre)}} \end{aligned}$$
(6)

Area Under Curve (AUC): is a measure of a model’s total capacity for differentiation between classes. Consider it as an assessment of how well a radar system can identify various kinds of objects. An improved model is indicated by a higher AUC.

Average Time per Epoch (AT/E): it is an average measure of how much time the model takes to complete one epoch as in Eq. 7. It is empirically established that the time taken can be varied under different conditions. For instance, it is directly affected by the experiment’s computational resources and inputs’ vector lengths but we applied this metric consistently across each model under identical conditions and parameters.

$$\begin{aligned} \text { AT/E} = \frac{\text {The total time of all training epochs}}{\text {Number of training epochs}} \end{aligned}$$
(7)
Fig. 9
figure 9

The standard confusion matrix structure

Fig. 10
figure 10

Performance of the proposed models on THED dataset in terms of F1 Score

Fig. 11
figure 11

Performance of the proposed models on HRD dataset in terms of F1 Score

5 Results

As mentioned in the experimental setup subsection above, the performance was evaluated for two rounds. In the first round, the proposed DL-based models used holdout validation when each dataset was initially partitioned into an 80:20 ratio for training and testing. Subsequently, the \(80\%\) of the training set was partitioned into an 80:20 ratio for training and validation. So, the training set was \(80\% \times 80\% = 64\%\) of the data, the validation set was about \(80\% \times 20\% = 16\%\), and the testing was the rest \(20\%\) of the whole dataset. However, in the second round, the cross-validation method (\(K=10\)) was used to ensure the robustness and the generalizability of the proposed models.

5.1 Round 1: Holdout Validation

During the first round, employing the THED dataset, all embedding techniques were utilized for both training and testing the performance of the proposed six DL-based models. Although BiLSTM-CNN achieved the best performance regarding the Word2Vec embedding method reaching an accuracy of 0.8682 and an F1 Score of 0.8262, it has the worst AT/E with a value of 3.63 s as summarized in Table 5. Compared to Word2Vec, a considerable rise in performance was achieved using the GloVe embedding method when LSTM attained a value of 0.8730, and 0.8338 in terms of accuracy, and F1 Score, respectively. In comparison with the previous two embedding techniques (i.e., GloVe, Word2Vec), FastText helps models enhance the performance especially CNN-LSTM which reached 0.8758, 0.8389, 0.8351, and 0.9302 in terms of accuracy, precision, F1 Score, and AUC, respectively. Regarding the CIE method, the performance of the models was not encouraging with three models (e.g., LSTM, CNN, and BiLSTM-CNN), however, reasonable findings over 80% were achieved by the rest three models, namely, CNN (with 0.8415 accuracy), CNN-BiLSTM (with an accuracy of 0.8351), and CNN-LSTM (with an accuracy of 0.8235).

The use of character-level embedding and word level embedding enhanced the results as in the hybrid CIE-WE method when all the proposed models passed 87% in terms of accuracy, especially CNN-BiLSTM that outperformed other models with an accuracy of 0.8800 and an F1 Score of 0.8392. Compared to CIE, COE achieved better performance which indicates that using one-hot embedding on character-level embedding instead of integer embedding could enhance the performance. Similarly, BiLSTM has poor performance either using CIE or COE. With the hybrid embedding technique (COE-WE), CNN achieved the best results 0.8811 and 0.9345 in terms of accuracy and AUC, respectively.

Vividly, the summarized results in terms of F1 Score in Fig. 10 demonstrate that among pre-trained embedding techniques, FastText (especially with CNN-LSTM) achieved better results. Furthermore, using one of the character embedding techniques (i.e., COE, CIE) alone does not guarantee high performance. However, employing a hybrid approach by using word embedding with character embedding could enhance the performance effectively. Moreover, utilizing either COE or CIE with CNN could achieve promising findings as shown in Fig. 10 employing the first dataset and also in Fig. 11 that shows the results using the second dataset.

Regarding the AT/E as in Table 5 and precisely depicted in Fig. 12, the hybrid embedding techniques resulted in more time for two reasons. Firstly, these hybrid embedding techniques were used by doubling the model one for character-level embedding and the other one for word-level embedding as shown in Fig. 8. Additionally, with all embedding methods, CNN has the best AT/E among other models regarding the same embedding method. Moreover, using CNN before LSTM or BiLSTM could effectively decrease the time required for completing one epoch for the original model. For instance, LSTM with Word2Vec took 3.0 s and CNN-LSTM took only 2.25 s. Similarly, BiLSTM took 3.31s and CNN-BiLSTM took only 3.13. Glove achieved the best time-efficient embedding method. The second-best time-efficient method was FastText. Also, CIE was more efficient than COE regarding character-level embedding methods because of the nature of COE that embeds the input textual data to 3-D dimensional input (number of records, number of characters=129, max length 180).

Secondly, during employing the HRD dataset, the obtained results as illustrated in Table 6 show that all models with Word2Vec embedding achieved promising results over 90% in terms of accuracy while the best model was LSTM with an accuracy of 0.9353 however, the worst model was CNN with an accuracy of 0.9013. Compared to Word2Vec, GloVe embedding obtained a considerable enhancement in the performance of CNN achieving an accuracy of 0.9297 as the best performance among other models regarding GloVe embedding techniques.

Table 5 The performance of DL-based models on the THED dataset

The third embedding techniques enhance the efficiency of the proposed models to outperform the previous two embedding techniques (i.e., GloVe, Word2Vec). Using FastText embedding gives a high performance, especially CNN-LSTM which attained 0.9543, 0.9543, 0.9544, 0.9543, and 0.9872 in terms of accuracy, precision, Recall, F1 Score, and AUC, respectively. Employing character-level embedding, firstly using the CIE method, the performance of the models was encouraging only with CNN (with 0.9444 of accuracy), and CNN-BiLSTM (with an accuracy of 0.9293). However, BiLSTM and BiLSTM-CNN obtained a good level of accuracy around 82% and 87%, respectively. Both LSTM and CNN-LSTM gained poor performance around the 40 s in terms of accuracy. Employing WE in addition to CIE increased the efficiency of the models showing an accuracy of over 94.70%, particularly, CNN-BiLSTM achieved significant progress similar to CNN-LSTM with FastText reaching an accuracy of 0.9543, and an F1 Score of 0.9543 with a noteworthy AUC of 0.9875. Even though the progress of CNN-LSTM was not the best using this hybrid embedding method in terms of accuracy and F1 Score (slightly lower than CNN-BiLSTM), its findings show positive indications, achieving an accuracy of 0.9526 and an AUC of 0.9890 better than CNN-BiLSTM. When it comes to utilizing the COE method, compared to CIE, COE achieved better performance than CIE, especially using LSTM and CNN-LSTM which performed poorly with the CIE embedding method. This indicates that using one-hot embedding on character-level embedding instead of integer embedding could enhance the performance. With the hybrid embedding technique (COE-WE), all models showed a great performance in all terms of metrics, particularly CNN-BiLSTM and CNN-LSTM which achieved the best AUC with a value of 0.9903 and 0.9901, respectively.

Fig. 12
figure 12

The performance of DL-based models on THED dataset in terms of AT/E

A comparison between the obtained performance in terms of F1 score with all embedding techniques on the HRD dataset is depicted in Fig. 11. The findings summarized in Table 6 and the comparison presented in Fig. 11 clearly show that FastText was the most successful pre-trained embedding strategy for obtaining superior outcomes. Furthermore, high performance cannot be guaranteed by utilizing only one of the character embedding approaches (COE, CIE, etc.). However, a hybrid technique that combines character and word embedding might significantly improve the progress of sentiment analysis models.

Table 6 The performance of DL-based models on the HRD dataset

Regarding the time measure (AT/E) as in Table 6 and depicted in Fig. 13, the hybrid embedding technique CIE-WE resulted in more time for two reasons. Firstly, the textual data (the records of the HRD dataset) is longer than THED’s records. In other words, the word count is higher which resulted in increasing the vector’s length. Subsequently, the computational time was more especially when each letter is represented by an integer. Secondly, each model is used twice in a parallel manner one for character-level embedding and the other one for word-level embedding as shown in Fig. 8. Furthermore, when all embedding methods are used, CNN has the best AT/E when compared to other models using the same embedding approach. Also, the time needed for the original model to complete one epoch might be significantly reduced by employing CNN at the top of LSTM or BiLSTM. FastText was the most time-efficient technique slightly better than GloVe. When it comes to character-level embedding technique unlike what was obtained using THED, COE was better than CIE in terms of time efficiency. Although the embedded textual data in COE were represented by one-hot encoded three-dimensional matrices, it was efficient due to its easiness of processing one-hot encoded matrix even 3-dimensional rather than integers.

5.2 Round 2: Cross-Validation

In the subsequent iteration, a cross-validation methodology was employed to assess the effectiveness of the proposed models. The outcomes of this evaluation are presented in Table 7. All models demonstrated strong performance on the THED dataset when employing Word2Vec embedding techniques. The recorded accuracy scores ranged between 0.856 and 0.868, while F1 Scores ranged from 0.811 to 0.828. With a mean accuracy of \(0.868 \pm 0.006\) and an F1 Score of \(0.828 \pm 0.009\), the BiLSTM model demonstrated superior performance compared to other models due to its ability to gather contextual information from both preceding and succeeding elements in sequential contexts simultaneously, enabled by bidirectional processing. By combining information from both preceding and following words in a sequence, the BiLSTM model successfully captures long-range dependencies and contextual nuances in the text, improving its sentiment analysis skills. The BiLSTM architecture’s strong performance on the THED dataset is also a result of its innate flexibility in learning complex patterns and its ability to handle variable-length input sequences. likewise, using GloVe embeddings produced good outcomes, especially when using the CNN-LSTM model, which produced an accuracy of \(0.867 \pm 0.009\). In contrast, FastText was found to be a very successful embedding method, outperforming other pre-trained embeddings. In particular, FastText significantly improved the F1 Score to \(0.834 \pm 0.011\) and achieved an accuracy of 0.874 (\(\pm 0.009\)) when used with the CNN-LSTM model. These findings highlight how well FastText captures subtle semantic information, which is especially useful for sentiment analysis tasks, especially the CNN-LSTM model which performs better than other models.

Fig. 13
figure 13

The performance of DL-based models on the HRD dataset in terms of AT/E

When it comes to character embedding methods, CIE achieved the best performance using CNN with an accuracy of \(0.863 \pm 0.008\), precision of \(0.831\pm 0.014\), a recall of \(0.800\pm 0.015\), an F1 Score of \(0.813 \pm 0.013\), and AUC of \(0.920\pm 0.010\). The second-best model was CNN-BiLSTM which achieved an accuracy of \(0.851 \pm 0.011\). As in the first round when holdout validation was used, using a hybrid method employing character-level and word-level embedding enhances the results effectively. With the CIE-WE method, CNN achieved the best results with an accuracy of \(0.880 \pm 0.008\). The second best model was CNN-LSTM with an accuracy of \(0.878 \pm 0.006\), and an F1 Score of \(0.834 \pm 0.011\), while the best obtained F1 Score using this technique was by LSTM with a value of \(0.835 \pm 0.005\). Compared to the CIE method, the COE approach demonstrates a similar level of effectiveness, with only marginal differences observed. However, utilizing COE with WE shows slightly less progress compared to CIE with WE, with only a marginal difference observed between the two approaches. It is noteworthy that COE-WE with CNN yielded the highest AUC among all embedding techniques and models, with a value of \(0.938 \pm 0.005\), as summarized in Table 7. This indicates the effectiveness that could be obtained by employing a hybrid embedding technique (COE-WE) with the CNN model to analyze sentiment. By combining contextual and semantic information, character-one-hot encoding embedding (COE) with word embedding (WE) provide a thorough representation of textual properties. This hybrid technique improves the model’s capacity to reliably determine sentiment polarity in sentiment analysis by allowing it to capture contextual cues present in the text. Furthermore, by further improving the representation of textual features, the convolutional neural network (CNN) architecture, which is known for its efficiency in recognizing local patterns and features from sequential data, enhances the hybrid embedding technique. The enhanced AUC performance highlights the advantage of integrating COE-WE embedding with the CNN model for sentiment analysis applications.

Table 8 shows the performance of DL-based models using the HRD dataset with cross-validation K=10. The LSTM model performed better than other models using the Word2Vec technique, with an accuracy of \(0.882 \pm 0.015\) and an F1 score of \(0.882 \pm 0.016\), slightly outperforming the BiLSTM-CNN’s accuracy of \(0.882 \pm 0.013\) and an F1 score of \(0.881 \pm 0.013\). Notably, these findings are especially significant because the dataset includes longer Turkish reviews. The LSTM model’s ability to capture long-term dependencies within text sequences makes it well-suited for analyzing both short and long comments, contributing to its somewhat better performance than the BiLSTM-CNN architecture. However, FastText was better than the first two embedding techniques providing encouraging results, especially using CNN-LSTM that achieved the performance overall embedding techniques on this dataset with an accuracy of \(0.913 \pm 0.013\), precision of \(0.916 \pm 0.011\), recall of \(0.913\pm 0.013\), an F1 Score of \(0.913 \pm 0.013\), and AUC of \(0.970 \pm 0.004\). This shows that combining FastText embedding with the CNN-LSTM architecture may produce better results than using other pre-trained embedding approaches. FastText, known for its ability to capture subword information and properly handle out-of-vocabulary words, gives a more detailed representation of textual semantics, particularly in languages with complicated morphology, such as Turkish. When combined with the CNN-LSTM architecture, which excels in capturing both local and global dependencies within text sequences, FastText embeddings help to provide a more sophisticated understanding of sentiment, perhaps leading to improved performance in sentiment analysis tasks.

Table 7 The performance of DL-based models on the THED dataset using cross-validation
Table 8 The performance of proposed DL-based models on the HRD dataset using cross-validation

In the domain of character embedding methods, CIE achieved poor performance, especially using LSTM and CNN-LSTM which achieved about 50% in terms of accuracy. However, the best performance was obtained with CNN- BiLSTM with an accuracy of \(0.812 \pm 0.039\). Also, with the CIE-WE method LSTM achieved a better result with an accuracy of \(0.875 \pm 0.016\). On the other hand, COE showed positive progress in enhancing the results of LSTM and CNN-LSTM that were poorly performed with CIE with an enhanced rate of around 26% in terms of accuracy for LSTM and 30.9% for the CNN-LSTM model. Additionally, the COE-WE method demonstrated compelling outcomes reaching around \(0.887 \pm 0.020\) in terms of accuracy for LSTM and about \(0.896 \pm 0.01\) using both CNN-LSTM and CNN-biLSTM as shown in Table 8.

6 Discussion

This section explores the experimental results that were attained using a range of embedding approaches in conjunction with different sentiment DL-based models on our datasets. By carefully testing and analyzing each model, we intend to determine how well it performs and how well it can properly identify sentiment in Turkish text. By deliberately evaluating the obtained results, the strengths and limitations of each approach will be highlighted, thereby improving our understanding of sentiment analysis in the context of Turkish short texts.

Figure 14 shows the importance of using a hybrid approach to employ word level and character level to obtain promising results. Precisely, the CNN-BiLSTM model with the CIE-WE embedding technique obtained the best F1 Score on the THED dataset with a value of 0.839. Among pre-trained embedding models, only CNN-LSTM with FastText reached promising results with the presence and the absence of cross-validation. Worth noting that CNN with either CIE-WE or COE-WE shows encouraging findings during the cross-validation round. The best findings in terms of F1 Score with cross-validation were obtained using LSTM with the hybrid embedding approach CIE- WE with a value of 0.835.

Fig. 14
figure 14

DL-based models having the best performance on THED dataset

Fig. 15
figure 15

DL-based models having the best performance on HRD dataset

As depicted in Fig. 15, CNN-LSTM obtained promising results no matter whether it was with cross-validation or holdout validation. Moreover, both CNN-BiLSTM with CIE-WE and the CNN-LSTM with FastText obtained the best results using holdout validation on the HRD dataset reaching a value of 0.9543 while CNN-LSTM with COE-WE a comparable F1 Score valued at 0.9517 to be ranked between the best models. On the other hand, during the cross-validation round, CNN-LSTM with FastText achieved the best performance with an F1 Score reached 0.913. Also, CNN-LSTM with COE-WE maintains among the best models with an F1 Score of 0.895, mirroring CNN-BiLSTM performance employing the identical embedding strategy.

In terms of accuracy, the best 2 models on the HRD dataset incorporating the holdout-validation method (as in Table 6) were CNN-BiLSTM using CIE-WE technique and CNN-LSTM employing FastText embedding technique that achieved an accuracy of 0.9543. However, on the THED dataset (as in Table 6), CNN employing COE-WE and CNN-BiLSTM using CIE-WE were the best models obtaining an accuracy of 0.8811 and 0.8800, respectively. This indicates positive progress in using the hybrid technique. The training and validation accuracy for these three models, namely CNN-BiLSTM (CIE-WE), CNN-LSTM (FastText), and CNN (COE-WE), on both datasets, is shown in Figs. 16 and 17.

Fig. 16
figure 16

The accuracy of the best models on the HRD dataset

Fig. 17
figure 17

The accuracy of the best models on the THED dataset

Vividly, Figs. 16 and 17 illustrate that CNN-LSTM with FastText embedding technique started with a higher level of accuracy due to the power of using pre-trained embedding technique incorporating sub-words information as a recent method like FastText. On the other hand, the models with the hybrid embedding technique began from a lower level in the process of learning but they gradually boosted the performance in a sharp rise toward optimal results. Similarly, the corresponding performance of these best three models in terms of loss is shown in Figs. 19 and 18.

Fig. 18
figure 18

The loss of the best models on the HRD dataset

Fig. 19
figure 19

The loss of the best models on the THED dataset

As evidenced by the results presented earlier, all proposed models demonstrated promising outcomes, with the exception of character-level embedding techniques (CIE and COE) when utilized with non-CNN-based architectures, failing to achieve promising results. This observation suggests that embedding textual data solely with letter embedding appears insufficient for models like LSTM and BiLSTM, although better results can be obtained with CNN-based architectures. Moreover, the consistently good results across almost all models on both datasets indicate precise modeling and architecture design, which were achieved through intensive experiments aimed at optimizing performance. Concerning the effectiveness of the Word2Vec technique, LSTM or BiLSTM architectures regularly outperformed. This discovery derives from the nature of Word2Vec, which primarily encodes words as vectors rather than delving into complex deep or subword characteristics. As a result, the sequential input features embedded in Word2Vec representations have a direct impact on the performance of LSTM and BiLSTM models. These architectures excel at capturing long-term relationships within sequential data, making them ideal for harnessing the sequential nature of Word2Vec embeddings, resulting in optimal sentiment analysis results. Specifically, on the HRD dataset, LSTM achieved the highest F1 score of approximately \(93.52\%\) using holdout validation and \(0.882 \pm 0.016\) using cross-validation. Similarly, good performance was observed on the THED dataset with LSTM achieving either the best or the second-best results in both cross-validation and holdout validation scenarios.

When applying the GloVe method to both the HRD and THED datasets, LSTM tended to yield better results, particularly with cross-validation. On the HRD dataset, LSTM ranked as the second-best model with an accuracy score of \(0.883 \pm 0.016\), following BiLSTM-CNN with \(0.885 \pm 0.012\). Similarly, LSTM outperformed other models on the THED dataset, achieving the best F1 score of \(0.828 \pm 0.005\) and accuracy of \(0.864 \pm 0.005\).

FastText embedding techniques demonstrated the most promising results on both datasets (HRD and THED) and in both validation rounds (cross-validation and holdout validation), particularly with the CNN-LSTM model. This highlights the effectiveness of FastText embeddings in extracting subword sentimental features, combined with the CNN-LSTM architecture, which leverages the advantages of both LSTM and CNN to excel in Turkish sentiment analysis.

Despite the poor performance of the CIE embedding technique with LSTM and BiLSTM, it exhibited promising results with CNN-based models. For instance, on the HRD dataset, CNN achieved \(94.44\%\), while on the THED dataset, CNN-BiLSTM achieved the best results in terms of accuracy (\(0.812 \pm 0.039\)) and F1 score (\(0.811 \pm 0.039\)) in round 2 (cross-validation). This indicates that the CIE embedding method may yield good results with CNN-based models and their variations, such as CNN-BiLSTM. Similarly, the COE embedding technique showed encouraging results with CNN-based architectures.

As demonstrated in the previous tables (Tables 8, 6, 7, and 5), employing hybrid character-level and word-level embedding techniques significantly improved results. For example, CIE-WE with BiLSTM-CNN achieved an accuracy of \(0.848 \pm 0.040\) on the HRD dataset, outperforming the results obtained solely with the CIE embedding method. Similarly, on the THED dataset, the CNN model achieved an accuracy of \(0.880 \pm 0.008\) and an F1 score of \(0.834 \pm 0.012\), which surpassed the performance of the CIE method alone. Employing COE and WE also demonstrated performance improvements compared to using character-level embedding alone, underscoring the role of hybrid word-character embedding techniques in enhancing CNN-based architectures. This emphasizes the critical role of CNN-based architectures in extracting sentimental features from Turkish text, utilizing convolutional layers, and concatenating word level with character-level features to enhance performance.

To summarize the findings, the obtained results show that the FastText embedding method outperformed other pre-trained embedding techniques on both Turkish short-text datasets for four reasons. First, the GloVe and Word2Vec methods were trained on limited datasets in a comparison with FastText. Second, the sub-word information technique in FastText strengthens its power to handle numerous rare words or morphological variants even that was not trained. Third, training a pre-trained embedding model in different aspects from the aspect that was trained on effectively affects word embedding success. Fourth, short texts may include more Out-of-Vocabulary (OOV) words that can benefit from FastText’s subword unit embeddings for words not seen during training.

Word2Vec’s methodology can be more advantageous than GloVe’s global co-occurrence technique in datasets where the significance of a word depends mostly on the current surroundings. Word2Vec’s continuous bag-of-words (CBOW) and skip-gram methods naturally record word contexts in a dynamic manner, which can be advantageous in datasets where contextual complexity is vital.

In some cases, the GloVe embedding technique obtains better findings than Word2Vec due to its nature and the content of the used dataset. For instance, GloVe analyzes the whole corpus by creating a co-occurrence matrix and decomposing it. GloVe utilizes this approach to efficiently collect comprehensive contextual information on a global scale, which can be especially advantageous in datasets including brief text segments where the availability of local context is restricted. Additionally, GloVe’s approach which relies on using statistical data can offer improved handling of infrequent words in comparison with Word2Vec. In datasets containing a wide range of terminology, this can be beneficial. When analyzing short texts with little contextual signals, GloVe’s capability to use broader corpus data could give it an advantage over Word2Vec’s approach of considering just local context windows.

On both datasets, the dual-pathway architecture that incorporates hybrid character-level and word-level embedding methods such as COE-WE and CIE-WE show positive progress better than other pre-trained embedding methods. However, in terms of time, these methods may not be the best choice. Additionally, positioning CNN before LSTM or BiLSTM in the architecture of the model is likely to reduce the AT/E to less than the AT/E of the original LSTM or BiLSTM model. For instance, LSTM requires higher AT/E than CNN-LSTM. This is because of the less time complexity of the CNN model compared with LSTM and BiLSTM which rely on learning sequential data and the connections in sequences which necessitates more time. Furthermore, the hybrid architecture such as CNN-LSTM shows impressive results, especially with FastText embedding or hybrid embedding methods.

Employing DL-based models with various embedding approaches for Turkish sentiment analysis can have a variety of real-world applications. These include social media monitoring to better understand public opinions and customer sentiments, market research to analyze product feedback and consumer preferences, brand monitoring to track sentiment toward companies and products, and political analysis to gauge public sentiment toward political figures and parties. Sentiment analysis can also be used in finance for risk management, in tourism to improve hospitality services, and in healthcare to improve patient happiness and satisfaction. Overall, these applications demonstrate the versatility and benefit of using deep learning algorithms for sentiment analysis in the Turkish language across different sectors and domains.

Despite attempts to address common limitations like model complexity, overfitting, data quality, and training time complexity, this study, exploring various deep learning architectures and embedding techniques, continues to encounter significant challenges and constraints warranting future investigation. These include the intricacies of multilingual support, transfer learning, and transformer models such as BERT, the imperative to capture temporal dynamics and contextual nuances, particularly in languages with intricate contextual and morphological structures like Turkish, as well as the necessity for robust domain adaptation strategies to ensure generalizability across diverse domains. Furthermore, there is a pressing need to explore the practical application of these methodologies in real-time and real-world scenarios.

7 Conclusion

This work conducted a comprehensive evaluation of embedding approaches in the domain of Turkish sentiment analysis, which is characterized by the distinctive difficulties posed by the grammatical complexities of the Turkish language. By employing three pre-trained embedding techniques, namely GloVe, Word2Vec, and FastText, in addition to a hybrid method that combines character and word embedding, has shed light on the ever-changing field of sentiment analysis in Turkish text processing. Our investigation has shown that by combining advanced embedding techniques such as Word2Vec, GloVe, FastText, and novel character-level methods like character-integer and one-hot encoding embedding, with deep learning models such as LSTM, CNN, BiLSTM, and innovative hybrid architectures (i.e., CNN-BiLSTM, BiLSTM-CNN, and CNN-LSTM), we can effectively address the unique challenges of Turkish sentiment analysis.

The experimental results demonstrate the greater ability of FastText to handle textual elements and the revolutionary dual-pathway architecture’s capability for sophisticated sentiment analysis. The results of this study provide practical recommendations for professionals, indicating that utilizing both character and word embedding can significantly enhance sentiment analysis in languages with intricate morphologies. The practical implications of these improvements are extensive, incorporating the improvement of customer feedback analysis and the enhancement of social media monitoring systems. As a result, a more comprehensive understanding of public viewpoints is provided.

It is crucial to note that our study encounters significant obstacles when investigating alternative deep learning architectures and embedding strategies. These include multilingual support, transfer learning, transformer-based models such as BERT, and the requirement for robust domain adaptation strategies. Furthermore, there is an urgent need to examine the actual use of these approaches in real-time circumstances. Recognizing the computing challenges, future research will focus on optimizing the efficiency of these models, to increase their utility in real-world settings while minimizing computational resources.