1 Introduction

Text analysis offers many possibilities to get to know the author of a text and the meanings of a given text, which may be hidden at first sight. For example, hate speech, fake news, and other useful information can be identified automatically from text features. One such characteristic is the sentiment of the text. The sentiment tells whether the text is positive or negative, or what level of positivity or negativity it has. It is already obvious at first sight that such a characteristic can be useful, for example, in identifying hate speech or incitement to hate.

There are several tools for sentiment analysis but it's no surprise that the best-optimized tools are for languages like English (Kajava et al., 2020), Spanish, or other rich-source languages. However, the question is how much low-source languages must develop their sentiment analysis tools. It might be more effective to deal with cross-lingual sentiment analysis and classification (Zhou et al., 2016). Based on such a concept, it would be more efficient to use one of the available machine translation tools to translate the text from a low-resource language to English and apply sentiment analysis to this machine translation in English using a reliable tool such as IBM Watson™ Natural Language Understanding service. However, this raises the further question of whether the sentiment of the original text is preserved in this translation.

This paper aims to analyze how accurately this tool can be used to identify sentiment for the Slovak language as a representative of low-resource language based on the use of machine translation. The IBM Watson™ Natural Language Understanding service will be used as a reference tool for sentiment identification of text. This tool can identify the sentiment of text but only for selected languages. It is best optimized for the English language and therefore the results of the analysis of English texts will be considered as a reference.

For the analysis, a parallel corpus of Slovak language texts and their corresponding English translations written by humans was acquired. This study will employ movie subtitles for this objective. However, this research will not use community-created subtitles but professional subtitles from a streaming service. These are high-quality human translations, which are also used for example in English language teaching (Dizon & Learning, 2021).

The processing of the subtitle files consisted of quite extensive data preparation, mainly related to the correct alignment of the translations into segments. Machine translations were obtained from the most widely used online machine translation system, Google Translate. In previous research, it was verified that there is no statistically significant difference in the level of sentiment preservation in machine translation from Slovak to English between the most used systems Google Translate and DeepL. Based on the results, it is sufficient to use only one of them and generalize the results (Reichel & Benko, 2022b). Using the IBM Watson™ Natural Language Understanding service was identified the sentiment of each segment in different versions of the translations: human-written text (EN) and Google Translate machine translation (MT). The research is aimed to compare these results and determine the degree of agreement between the sentiment scores identified in the human text and machine translation.

The structure of the paper is as follows. The first section introduces the alignment of Slovak and English subtitles into a coherent parallel corpus, removal of erroneous records, segmentation, and sentiment analysis using IBM Watson NLU service. The second section focuses on data analysis, including lexico-morphological analysis, verification of correlation, and comparison of sentiment error rate. The third section deals with the verification and interpretation of results, including the comparison of sentiment polarity using the F1 Score. Additionally, the paper explores sentiment analysis using OpenAI GPT technology and discusses the relevance of the results achieved by the described method.

2 Related work

There are several commercial systems for sentiment analysis (Ermakova et al., 2021), which is a subsection of information extraction (Sazzed & Jayarathna, 2019). Based on research (Dash & Pathare, 2022), in which the IBM system came out as the most accurate, it can be considered reliable enough for our experiment. This system from IBM was used in sentiment analysis for social networks (Daneshfar et al., 2022) or different types of reviews. In addition to the positive and negative polarity of the text, some studies deal with the identified emotions (Abdaoui et al., 2017). Sentiment analysis was also used to examine the impact of sentiment text on the ability to identify fake news (Kapusta et al., 2020). The correlation between the truthfulness of the information in the text from its sentiment was examined (Reichel et al., 2020).

Cross-lingual sentiment analysis (CLSA) is a natural language processing task that involves analyzing the sentiment (positive, negative, neutral) of text written in different languages. CLSA leverages one or serval source languages to help the low-resource languages perform sentiment analysis tasks. The models used in the CLSA methodology can be significantly refined if it is possible to find the best range of source languages for a given target language (Reichel & Benko, 2022b). However, as a low-resource language, there are limits to the Slovak language (Xu et al., 2022).

Rasooli et al. (2018) investigated the transfer of sentiment from selected corpora such as the Bible, the Qur’an, or Europarl (Koehn, 2005). Among other languages, they also used the Slovak language, whose results were among the most accurate of the sample. Rasooli et al. used their methodology for sentiment analysis (Rasooli et al., 2018).

Araujo et al. (2020) validated the accuracy of sentiment analysis in prover sentiment in English for multiple languages. Several methods were used in the experiment, including IBM Watson. IBM Watson did not provide analysis capabilities for a wide range of languages, unlike some of the other systems studied. In those in which sentiment analysis was provided, the results showed top ranks in accuracy.

The identification of sentiment and emotion in translation and their preservation using film subtitles has been investigated by Öhman et al. (2021). In the second research (Kajava et al., 2020), the authors also use a parallel corpus created from subtitles published on opensubtitles.com (Lison & Tiedemann, 2016). However, these are often subtitles created by the community, not by professionals. They identified three main reasons why sentiment information is lost in translation. These are incomplete translations, an ambiguous choice on the part of the translator, or an overlap of possible sentiment classes. In the context of machine translation quality concerning the level of preserved sentiment, the authors (Lohar et al., 2018) attempted to balance these two attributes of translation. The authors examined the English-German translation. The researchers (Afli et al., 2017) worked with tweets and found that sentiment is significantly preserved in translation between the Irish and English languages. Implementing a sentiment lexicon further improved the results of sentiment analysis.

At present, the utilization of advanced language models, including those developed by OpenAI, is unlocking new possibilities for sentiment analysis, even in low-resource languages. The utilization of OpenAI’s API allows for the processing of language models without the need to develop proprietary ones, making it an attractive option for low-resource language analysis (Wang et al., 2023). These models, pre-trained on extensive and diverse corpora, can perform sentiment analysis across various languages and domains. This approach not only simplifies the process but also offers an end-to-end solution without additional intermediary steps. The comparison of ChatGPT with state-of-the-art models in sentiment classification and aspect-based sentiment analysis further validates its effectiveness and efficiency in open-domain sentiment analysis tasks.

3 Experimental setups

The research problems that are addressed in the experiment:

  • RQ-1: Does the machine translation from Slovak to English preserve the original level of a positive or negative sentiment of the text?

  • RQ-1.1: What are the characteristics of such subtitles in which the sentiment has been misidentified in machine translation?

  • RQ-2: Can the level of sentiment preservation in direct speech translation (subtitles) be used as a metric of translation quality?

Based on the research problems null hypotheses were formulated.

  • H0-1: There is no statistically significant correlation between the machine translation from Slovak to English and the English human text in the level of identified positive or negative sentiment.

  • H0-2: There is no statistically significant correlation between the error of the identified sentiment score in the machine translation and the accuracy (or error rate) of the machine translation.

A more detailed description of the research process is given in the following steps:

  1. (1)

    Data preparation

    1. (a)

      Source corpus preparation

      1. (i)

        Alignment of Slovak and English subtitles into a coherent parallel corpus.

      2. (ii)

        Removal of erroneous, inconsistent, repetitive, or unnecessary records.

      3. (iii)

        Segmentation—Merging sentences that have been split into multiple subtitles back into a single segment.

    2. (b)

      Generating a machine translation for each of the subtitles using a machine translation system.

    3. (c)

      Identification of keywords and their sentiment using IBM Watson NLU service.

    4. (d)

      Transforming the sentiment of the keywords into a coherent dataset of sentiment scores of each segment for the three sets:

      1. (i)

        Human text (EN),

      2. (ii)

        Machine translation from Google Translate (GT).

  2. (2)

    Data analysis

    1. (a)

      Lexico-morphological analysis of the dataset.

    2. (b)

      Verification of the level of correlation of the identified sentiment of the machine translations (MT) with the reference sentiment from the human text (EN).

    3. (c)

      Analysis of text characteristics that cause a decrease in the accuracy of sentiment transfer in machine translation.

    4. (d)

      Comparison of the sentiment error rate of machine translation with the accuracy of the translation.

    5. (e)

      Comparison of the results of the described methodology and the outcomes obtained using OpenAI API calls.

  3. (3)

    Verification and interpretation of results

    1. (a)

      Verification of research hypotheses H0-1 and H0-2 and concluding them.

    2. (b)

      Comparison of sentiment polarity using F1 Score.

3.1 Modifying the source corpus

The corpus that was used (Table 1) contained 11,601 subtitles from 10 movies of different styles (war, fairy tale, action, sci-fi, comedy). The corpus contained the variables:

  • id – subtitle identifier,

  • Text_sk – Slovak subtitles,

  • Text_en – English subtitles,

  • Movie_cat – movie category (war – war, fairytale – fai, action – act, sci-fi – sci, comedy – com).

Table 1 Sample of the input dataset

There were several types of errors in the raw files that were used. The errors had to be corrected or removed. Several of these erroneous subtitles had to be found and corrected (removed) manually. The above modifications aimed to create a parallel corpus of many segments from dialogues for which the human English text and the Slovak text were available. The aim was not to analyze complete dialogues but individual sentences (segments) or subtitles. Therefore, the removal of some erroneous subtitles is not a problem from the point of view of the research aim. The removal will change the meaning or the dynamics of the conversation, but it will not change the meaning of the individual segments. And this is a priority for this research.

The subtitles in the source files have not been aligned. There were 2 separate files for Slovak and English language. These two files needed to be merged and aligned correctly. Multiple errors and redundancies in the records caused errors in the text alignment. Therefore, it was necessary to correct them carefully. The errors and redundancies that occurred in the matching of subtitles and had to be corrected during text alignment are as follows:

  • Incorrectly loaded lines from CSV file. These lines had to be manually found and either corrected or removed.

  • One subtitle segment to two subtitle segments. One subtitle in SK was split into two subtitles in EN or vice versa.

  • Greetings or shouts. In the SK subtitles, there was sometimes a shout such as “Hey!” or a greeting like “Ahoj.” (EN: “Hi.”) which was omitted in the EN subtitles.

  • Singing. The sung parts of the movies were often subtitled only in Slovak. These subtitles were missing in English. The sung parts in the subtitles are usually marked in italics, i.e. with the marks < i > and < /i > and therefore it was easy to find them in the records. However, all the sung texts have been removed. The reason is that even subtitled sung parts are usually solved in human translation by re-texting the song. The authors of human translation usually place more emphasis on preserving the rhymes in the verses than on preserving the identicalness of the content of the text. Therefore, these subtitles were not considered relevant for use concerning the aims of this research. The above condition was met by 921 out of 11601 segments.

  • Descriptions. Descriptions of buildings, location of scenes in the movie, chapter titles, etc. are in the Slovak subtitles. They are English texts that are inserted directly into the film footage. Therefore, they are not mentioned in the EN subtitles, but the SK subtitles contain them. These records have also been removed from the corpus.

  • Storyteller. Particularly in war movies, there was sometimes a storyteller who described the facts of the event. This commentary was sometimes not given in the EN subtitles.

  • Tags. In addition to the italics markers, the {\an8} marker was also present in the subtitles, indicating that the subtitle should be placed at the top of the screen. This tag has been removed.

  • Texts of multiple characters in one subtitle segment. Texts of multiple characters were found in some subtitles. Because the purpose of data preparation is to create a corpus for examining the sentiment of individual subtitles, merging the sentiment of texts of two different characters into one result is not correct. Such subtitles could be efficiently found by searching for the character "-". The "-" character separated the text of different characters in subtitles that contained the texts of two or more characters. This was the case for 1379 out of 11601 segments.

  • Duplicate subtitles. In this case, it is not an error in the subtitles but an adjustment due to the objectives of the research. For example, the aim of the research is not sentiment analysis within the timeline of the movie, but the creation of a corpus. For example, “Hello.”, “Bye.”, or other repetitive short sentences are often found in movies. These subtitles were only used once in the analysis and thus duplicates were removed from the dataset. Duplicates were identified for 906 out of 11601 segments.

Adjustments based on the above factors generated a cleaned corpus with 8551 records. Around 27% of the records were removed from the raw corpus.

Some of the longer sentences in the actors’ texts were split into multiple subtitles. These situations were identified based on whether the sentence ended in the first subtitle. For this purpose, was created a variable Segment_id grouped the subtitles into sentences (segments) (Table 2).

Table 2 Dataset with subtitles joined into segments based on Segment_id

3.2 Machine translation generation

This paper aims to verify whether the sentiment identified from machine translation can be considered relevant. Its relevance was verified by comparing it with the sentiment identified from the human subtitle text. Therefore, it was necessary to obtain a machine translation for each Slovak-English text pair (the English text is considered as the human text, and this was not translated but was a transcript). To obtain the machine translation Google Translate (translation was created in July 2022 using the Basic edition API version 2.0.1) system was chosen.

For further analysis, only the variables Segment_id, Text_sk, and new variable Text_mt (machine translation obtained from GT) were needed (Table 3).

Table 3 Sample of the dataset with subtitles and their machine translation from GT

3.3 Dataset composition

The final dataset (Reichel & Benko, 2022a) consisted of text columns of preprocessed Slovak and English subtitles of various movie styles and their machine translation. The parallel corpus consists of four corpora:

  • subtitles in the Slovak language—source texts,

  • subtitles in the English language—these were treated like human translations or reference texts,

  • subtitles translated by machine translation using Google Translate from Slovak to English,

The lexico-grammatical structure of the dataset is described in Table 4.

Table 4 Lexico-grammatical composition of the examined dataset of subtitles

The high number of short sentences in the whole dataset is because the sentences in subtitles contain mainly a low number of words as they have to fit onto the screen. The Slovak subtitles contain even more short sentences. This contrasts with written language but is understandable as translating speech is much looser. Type token ratio results show us that the texts are transcriptions of spoken text. Biber et al. (2002) have proven that spoken texts have relatively little lexical variation in comparison to written text. The much higher type-token ratio for the Slovak language was identified because the Slovak language is a flective type of language and contains more variations of words. Despite that, it still achieves the values that represent spoken language. The lexico-grammatical composition shows that the corpus consists mainly of nouns and verbs (Table 4). In the case of English subtitles (reference and machine translated), there is a higher number of pronominals in comparison to the source subtitles in the Slovak language (Table 4). As can be seen, there are a few differences between the English subtitles and machine-translated English subtitles. This motivated the experiment that there should not be differences in the sentiment analysis between the reference subtitles and machine-translated subtitles.

3.4 Translation evaluation

The machine translation system was evaluated using automatic metrics of machine translation evaluation. Metrics of accuracy (BLEU1-4 and GLEU) and metric of error rate (TER) were used in this evaluation. Bilingual Evaluation Understudy (BLEU) is a standard automatic measure, which is a precision-oriented metric used for machine translation accuracy. BLEU-n (Papineni et al., 2002) is a geometric mean of n-gram precisions with a brevity penalty (BP), i.e. penalty to prevent very short sentences:

$$BLEU \left(n\right)=exp\sum_{n=1}^{N}{w}_{n}{\text{log}}{p}_{n}\times BP$$

where \({w}_{n}\) is weights for different \({p}_{n}\),

$$BP=\left\{\begin{array}{c} 1,\hfill \,if \,h>r\hfill\\ {e}^{1-\frac{r}{h}},\hfill if \,h\le r\hfill\end{array}\right.$$

where r is the reference and h is the hypothesis.

Translation quality of adequacy and fluency is represented by BLEU (Munk et al., 2018). The BLEU score has several variations that depend on the number of words in the reference used to compute the brevity penalty.

Wu et al. (2016) used the GLEU (Google BLEU) score to remove the inaccuracies and poor correlation of the BLEU metric with human ratings that arose due to the corpus focus of the BLEU metric. GLEU operates at the segment level where the GLEU score is computed as the minimum value of the accuracy and coverage of the n-grams in the output and target sequences and takes values between 0 (no matches) and 1 (all matches). Precision is determined by the ratio of the number of matching n-grams to the number of all n-grams in the generated output word sequence, and coverage is determined by the ratio of the number of matching n-grams to the number of all n-grams in the target word sequence.

The automated metric of error rate used was Translation Edit Rate (TER) which is defined as the minimum number of edit-operations required to change the hypothesis to an exact match with the reference (Snover et al., 2006):

$$ TER (h,r) = \frac{\min \# (I + D + S + shift)}{{\left| r \right|}} $$

where r is the reference of hypothesis h, I is an insertion, D is deletion, S is substitution and shift is the number of changes in word order.

Table 5 shows that the accuracy of the machine translation is of very high quality for both MT systems (BLEU-1 score of more than 0.6). The other identified BLEU scores were mainly of high quality (BLEU-2|3|4 scores more than 0.4). This indicates that the obtained translation of Slovak subtitles into English is of good quality. Overall, the best translation was identified for sci-fi movies. On the other hand, comedy movies obtained the least quality translation. Despite that is the translation still of high quality. Similar results were obtained also for the metric GLEU for each movie category.

Table 5 BLEU-n score and GLEU score for machine translation system Google Translate

As the translation quality can be seen also from the point of view of identified error-rate metrics, the TER metric was used. The less the identified score value the better the translation is. Table 6 shows that most movie categories were identified with a lower error rate (less than 0.5) which confirms the results of BLEU-n metrics. The obtained error-rate scores correspond with the accuracy scores. The least error-rate score was identified for the sci-fi movies and comedy movies obtained the highest error-rate score.

Table 6 TER error rate score for machine translation system Google Translate

Cosine similarity is a metric that is used to quantify the similarity between two or more vectors (Salton, 1989). The cosine similarity represents the cosine of the angle between the vectors. It is also used in text mining to compare the similarity between text documents. The obtained value is between 0 and 1, where 1 implies that the two documents are exactly alike and 0 means that there are no similarities between the two documents. Usually, a value higher than 0.5 shows strong similarities. Python library from scikit-learn was used to calculate the cosine similarity for each movie category.

$$ \begin{aligned} cosine{ }similarity = & S_C \left( {A,B} \right): = \cos \left( \theta \right) = \frac{A \cdot B}{{\left\| A \right\|\left\| B \right\|}} = \frac{{\sum_{i = 1}^n A_i B_i }}{{\sqrt {{\sum_{i = 1}^n A_i^2 }} \sqrt {{\sum_{i = 1}^n B_i^2 }} }} \end{aligned} $$

Table 7 shows the obtained cosine similarity score between the reference subtitles and the machine translation system Google Translate for various types of movies. As can be seen, the score is high (more than 0.6) and that means that the machine translation output is like the reference subtitles.

Table 7 Cosine similarity between the machine translation and reference

3.5 Sentiment analysis using the machine translation and IBM NLU

The sentiment analysis in this study was conducted using the IBM NLU service, a comprehensive system that encompasses various text analysis functions, including sentiment and emotion detection, text classification, and more. This tool represents an evolution of its predecessor, known as the Tone Analyzer, which provided similar functionalities.

The IBM NLU system based on a comparative study of commercial sentiment analysis systems (Ermakova et al., 2021) was utilized for the analysis. The system compares Amazon Comprehend, Google Cloud Natural Language, IBM Watson NLU, Microsoft Azure Text Analytics, Mexalytics Semantria API, and MeaningCloud Sentiment Analysis API. According to the study, IBM NLU is ranked among the top-performing systems in most metrics, with the highest Accuracy score.

Sentiment analysis in the IBM NLU system is carried out in the following steps (Solanki, 2022):

  • Text preprocessing: The input text is first preprocessed to remove any unnecessary elements such as HTML tags, URLs, and special characters.

  • Tokenization: The preprocessed text is then tokenized, which involves splitting the text into individual words, phrases, or sentences, depending on the type of analysis being performed.

  • Part-of-speech tagging: Each token is assigned a part-of-speech tag, which helps to identify the grammatical structure of the text.

  • Dependency parsing: The tokens are then analyzed to identify the relationships between them, which helps to understand the meaning and context of the text.

  • Sentiment analysis: The sentiment of each sentence in the text is then analyzed using a combination of lexicon-based and machine learning-based approaches. For lexicon-based analysis, the text is matched against a sentiment lexicon containing words and phrases with known sentiment values. For machine learning-based analysis, IBM NLU uses a trained model to predict the sentiment of the text based on features such as word usage, sentence structure, and context.

  • Sentiment aggregation: Finally, the sentiment scores of each sentence are aggregated to calculate an overall sentiment score for the entire text.

A server in Jupyter Notebook was created in the Watson Studio environment, where sentiment analysis can be performed on selected texts using Python and built-in libraries (Carvalho et al., 2019). The necessary libraries had to be imported into the system:

  • ibm_watson.natural_language_understanding_v1: version 2021–03-25,

  • ibm-watson: version 6.0.0,

  • websocket-client: version 1.1.0,

  • python-dateutil: version 2.8.2,

  • ibm-cloud-sdk-core: version 3.*, minimum version 3.3.6,

  • requests: version 2.26.0, minimum version 2.0,

  • PyJWT: version 2.1.0, maximum version 3.0.0,

  • six: version 1.15.0,

  • certifi: version 2022.5.18.1,

  • urllib3: version 1.26.7, maximum version 1.27,

  • idna: version 3.3, maximum version 4,

  • charset-normalizer: version 2.0.4, approximately version 2.0.0.

Sentiment analysis was performed by calling the natural_language_understanding.analyze function for each segment. Each segment’s sentiment analysis resulted in the identification of keywords and the determination of their sentiment. These results were transformed from JSON format into 2 matrices for each translation group (EN, MT).

For the possibility of experiment reproduction, we also provide a Minimal Working Example (MWE) of the pseudocode that was utilized in the Jupyter Notebook for the IBM NLU service.

3.6 MWE

figure a

4 Pseudocode

figure b

There were 2 files with identified keywords for each segment and an associated sentiment_score value (Table 8).

Table 8 Sample output from IBM NLU in the form of a matrix

Using the clustering the average sentiment for each segment was calculated. If the IBM NLU system did not identify keywords in a segment, that phrase was not present in the final output. The two datasets were then joined together via the segment_id variable. The resulting dataset contained only the segments (subtitles) for which the sentiment was identified in both groups (EN, MT). Since subtitles are mostly short sentences, there was an assumption that keywords would not be identified in multiple sentences.

The number of identified keywords for Text_en was 8419 and for Text_mt 6886. This means that there are approximately 18% fewer keywords in the machine translation compared to the human text. After combining all three categories based on identified/unidentified sentiment, 4076 segments were extracted. These records were further manually cleaned of erroneous unpaired segments from sentences split into multiple subtitles. The resulting dataset contained 3768 segments. A sample of the resulting data matrix for analysis is shown in Table 9.

Table 9 Sample of the resulting data matrix prepared for hypothesis testing

4.1 Sentiment analysis using OpenAI GPT model

A solution utilizing GPT technology from OpenAI was used to validate the relevance of the results achieved by our described method. Specifically, the variable Text_sk from the Slovak variants of subtitles in the dataset (Table 3) was used for sentiment analysis through an end-to-end method. Instead of using translations in the English language, a request to determine the sentiment of the text in accordance with the specifications was directly sent via API calls to OpenAI. This approach did not create a new dataset of translations but only sentiment values extracted by OpenAI.

The creation of the used prompt was guided by a series of testing and adherence to principles of prompt engineering (Kheiri & Karimi, 2023):

  • The prompt should be clear and specific to minimize the likelihood of the model generating irrelevant or overly generalized responses.

  • Proper context setting within the prompt can guide the model towards the desired output style.

  • It is essential to maintain a balance in the length of the prompt, providing sufficient context without leading the model to generate excessively long answers.

  • In certain cases, especially in few-shot learning scenarios, it may be beneficial to include examples of desired responses within the prompt itself.

The code used to execute the calls is provided below:

figure c

By setting the temperature to a value of 0, a consistency in the responses and their structure was achieved, given that the responses from the API calls are required only in the form of a decimal number. The text-davinci-003 model was utilized, which is less economically advantageous compared to the commonly used gpt-3.5-turbo but is expected to attain a higher level of analysis. It is designed for standalone prompts, whereas gpt-3.5-turbo is more suited for chat interactions. The analysis was conducted using the paid version of OpenAI for API call utilization.

The results obtained from OpenAI were recorded in the final dataset under the variable sentiment_score_gpt (Table 9). A total of 98.27% of the results were correctly in the form of a decimal number, or accompanied by a few characters, which were manually corrected. A minority of 1.73% of the results were not in the correct numerical form but were instead a string of characters. These records were excluded from the analysis. Based on the above modifications, the resulting dataset for analysis was obtained (Reichel & Benko, 2023).

5 Results

The results of the research will be presented from the perspective of evaluating the hypotheses that were established before the initiation of the study. These will be assessed based on the identified continuous variable, sentiment score. To enhance the comparability of the findings with other research that typically employs the discrete variable of sentiment polarity, the created datasets were transformed into this format as well and the results were compared using the standard metric of the F1 score.

5.1 Comparison of sentiment score

The null hypothesis H0-1 predicts that there is no statistically significant dependence between the sentiment of the text in the English human text and the sentiment of the text in the machine English translation. Thus, it was necessary to test whether there is a dependency between the identified sentiments.

Correlation analysis was used to verify the dependence. Correlation analysis verifies, in a simplistic way, that if sentiment is high in human text, it is also high in machine translation and vice versa. To determine the correct method for correlation analysis, the distribution of each group of EN and MT was verified. The sentiment_score variable does not have a normal distribution. This is confirmed by the results of the Kolmogorov–Smirnov test for variables:

  • sentiment_score_en (D(3768) = 0.256, p < 0.01),

  • sentiment_score_mt (D(3768) = 0.276, p < 0.01).

Since enough cases are available, the parametric method can be used (Salton, 1989): Pearson’s correlation coefficient was used. The calculation was performed at a 5% significance level.

The results of correlation analysis (Fig. 1) are EN/MT: r(3768) = 0.73, p < 0.01 (Reichel & Benko, 2022b).

Fig. 1
figure 1

2D scatterplots for the correlation of the variable sentiment_score_en and sentiment_score_mt

Already in this setting, it is noticeable that there is a statistically significant dependency in the identified sentiment between human text and machine translation. However, it can be seen in the scatterplots (Fig. 1) that a significant proportion of the records are cases where neutral sentiment (a value of zero) has been identified in some variable.

The correlation between the sentiment identified using OpenAI from the Slovak text and the sentiment identified from the English human translation of the Slovak text using the IBM NLU tool was evaluated to verify the relevance of the results of the proposed method.

The results of correlation analysis are EN/GPT: r(3703) = 0.5, p < 0.01 (the reduced number of records is attributed to the omission of erroneous responses from OpenAI for the corresponding entries).

New multi-modal variable sentiment_simple_polarity was created to check whether there is a mis-transmission of positive or negative sentiment during machine translation. The variable was created based on the identified sentiment scores:

  • sentiment_score > 0, then sentiment_simple_polarity = positive,

  • sentiment_score = 0, then sentiment_simple_polarity = neutral,

  • sentiment_score < 0, then sentiment_simple_polarity = negative.

Using contingency analysis and visualization using bar charts (Fig. 2), a significant proportion of translations were assigned to the correct group of identified sentiment (i.e., if there is negative sentiment in the human text, then there is negative sentiment in the machine translation, etc.). Most of the segments were assigned to the correct group (77.04%). Most of the misclassified segments were neutral/positive or neutral/negative conjunctions (i.e., cases where neither positive nor negative sentiment was identified in one of the translations). Cases that were completely misclassified (i.e., negative/positive or vice versa) were only 3.71%.

Fig. 2
figure 2

Frequencies of combinations of identified sentiments (the color difference corresponds to the distribution according to sentiment_simple_polarity)

Since the aim of the above research is to test whether positive and negative sentiment is transmitted in machine translation, segments, where neutral sentiment was identified (in any of the variables that entered the given analysis), were excluded from the analysis. Subsequent correlation analysis with a determination of Pearson's correlation coefficient showed the following results (Fig. 3) (Reichel & Benko, 2022b): EN/MT: r(1497) = 0.86, p < 0.01.

Fig. 3
figure 3

2D scatterplots for the correlation of the variable sentiment_score_en and sentiment_score_mt except for segments with neutral sentiment

There is a high correlation between the results of the identified sentiment in the human text and the machine translation through both translators (Munk & Benko, 2018), which is at 0.86. This is confirmed by the p-value, which is significantly lower than 0.01. Therefore, based on these results, the null hypothesis H0-1 as stated in “There is no statistically significant correlation between the machine translation from Slovak to English and the English human text in the level of identified positive or negative sentiment.” is rejected. Thus, it can be said that there is a statistically significant correlation between sentiment in human text and machine translation. The same evaluation of sentiment using OpenAI has a Pearson correlation with the variable sentiment_score_en of r(1045) = 0.72, p < 0.01.

Based on the above results, the end-to-end solution based on OpenAI API has advantages in ease of implementation but there are shortcomings in accuracy. They occurred both when considering neutral sentiments (rMT = 0.73 > rGPT = 0.5) and omitting them (rMT = 0.86 > rGPT = 0.72). The proposed methodology in this paper has significantly better results.

RQ-2 discusses whether the level of sentiment preservation in direct speech translation (subtitles) can be used as a metric of translation quality. This was verified by investigating the dependence of the result of BLEU-n (GLEU was omitted as it obtained similar results) and TER methodologies against the error (deviation) of the sentiment level identified in the human text and the machine translation. For this purpose, it was necessary to create a new variable sent_dev_mt.

$$sent\_dev\_mt= \left|sentiment\_score\_en-sentiment\_score\_mt\right|$$

Pearson’s correlation coefficient was used. In all combinations, the p-value was < 0.01 (Table 10).

Table 10 Pearson correlation coefficients for accuracy and translation error metrics with the sent_dev_mt variable

Thus, the null hypothesis H0-2 “There is no statistically significant correlation between the error of the identified sentiment score in the machine translation and the accuracy (or error rate) of the machine translation.” can be rejected at a significance level of 99%. There is a statistically significant relationship between the error in sentiment score and the accuracy/error rate of the translation.

5.2 Comparison of sentiment polarity using F1 score

The aim is to quantify the sentiment conveyed in a text by categorizing it into negative, neutral, or positive polarities. While the sentiment was initially identified using a continuous score ranging from − 1 to 1, it was essential to convert this continuous score into discrete polarity categories to better align with the analysis objectives. The F1 score, defined as the harmonic mean of precision and recall, was employed to evaluate the accuracy of the sentiment categorization. ​

To transform the continuous sentiment scores into discrete polarities, the 33rd and 67th percentiles were used as boundary values, excluding the frequently occurring zero values. Specifically:

  • Values below the 33rd percentile (inclusive) were assigned a polarity of − 1 (negative).

  • Values between the 33rd and 67th percentiles were assigned a polarity of 0 (neutral).

  • Values above the 67th percentile (inclusive) were assigned a polarity of 1 (positive).

The F1 scores were computed for two combinations of sentiment categorizations:

  • Between sentiment_polarity_en and sentiment_polarity_mt, resulting in an F1 score of approximately 0.79.

  • Between sentiment_polarity_en and sentiment_polarity_gpt, resulting in an F1 score of approximately 0.67.

These results indicate a higher agreement between human text and Google Translate sentiment categorizations compared to human text and GPT-based categorizations. The lower F1 score for the GPT-based comparison may highlight differences in the sentiment identification process or specific nuances captured by the models. Overall, these findings offer valuable insights into the compatibility of different sentiment analysis techniques and their application in various linguistic contexts, with implications for both research and practical implementations.

6 Discussion

This paper aimed to analyze whether machine translation of text from the Slovak language as a flective type of language into English provides the sufficiently accurate text for sentiment analysis. The research was aimed at finding out whether the results of sentiment analysis using IBM Watson™ Natural Language Understanding service from human text match the results of sentiment analysis from machine translation from the most popular translator Google Translate.

The study utilized a parallel corpus consisting of segmented movie subtitles that combine both the Slovak and English languages. The objective of the study was to evaluate the effectiveness of sentiment analysis in machine translation. To ensure generalization of the results to different translation systems, two machine translation versions were generated. The paper delves into the process of preparing the data for sentiment analysis and outlines the methodology employed, which includes the use of the IBM Watson™ NLU service. Furthermore, the hypotheses were tested based on the research problems identified.

The purpose of this study was to verify H0-1, which was based on RQ-1 and aimed to assess whether the sentiment expressed in machine-translated movie subtitles is statistically dependent on the human-generated text of the subtitles. The findings of the correlation analysis indicated a strong dependence between the two groups at 0.87, implying that one can use machine translation to convert Slovak text into English and subsequently apply an established tool for sentiment analysis in English to accurately identify the sentiment expressed in the Slovak text. The results obtained from this approach should align well with those obtained from a high-quality sentiment analysis tool designed for the Slovak language. However, the findings of the study also raise the question of whether there is a pressing need to develop a localized tool for sentiment analysis.

The study investigated the suitability of sentiment analysis on movie subtitles, including the identification of appropriate movie genres for this purpose (RQ-1.1). The results indicated that the accuracy of sentiment analysis was adversely affected by the presence of comedy movies. This finding can be attributed to the difficulties in translating humor, irony, and sarcasm, which are common in comedy movies. Translating such language nuances can be challenging even for human translators, thus posing challenges for sentiment analysis. Therefore, when applying the proposed methodology to texts with ambiguous or humorous language, it is recommended to exercise caution.

The RQ-1.1 solution also identified segments (149) in which the opposite sentiment was identified for human text and machine translation systems. One of the frequent problems was the occurrence of vulgarism in the texts and their different translation. These were mainly comedy movies. Vulgarisms were placed differently in the text, which the machine translation often could not evaluate correctly. Subtitles are sometimes not translated literally but the translation depends on the situation and its standard understanding in the region. Vulgarisms are also used differently in different languages and therefore in the erroneous sample vulgarisms were sometimes present in only one of the languages (SK/EN). This means that it did not occur or occurred in the machine translation. Machine translation also sometimes avoids translating vulgarisms and tries to translate the word more politely.

Due to the high emotional expressiveness of vulgarisms, their inappropriate use can lead to significant distortion of the intended sentiment within a sentence, ultimately causing discrepancies in data records. This issue is compounded by the ambiguity of other types of slang, which can also have multiple interpretations.

In addition to the methodologies explored, a comparative analysis using OpenAI to evaluate the sentiment of the Slovak text was conducted. By leveraging OpenAI’s capabilities, it was possible to assess the sentiment directly from the Slovak text without the need for translation into English. The correlation between the sentiment identified using OpenAI and the sentiment identified from the human-translated English text using IBM NLU further substantiated the identified findings. This approach not only reinforced the validity of our methodology but also highlighted the potential of utilizing advanced language models for sentiment analysis in low-resource languages. Despite the relatively high accuracy of the rather simple end-to-end solution using OpenAI, the described methodology, which consists of machine translation followed by sentiment analysis in English, proved to be significantly more precise.

A further challenge that emerged was related to the subtitles of a specific fairy tale movie. In particular, the dialogue of one character, depicted as a giant, exhibited ungrammatical construction and inconsistent inflection. Consequently, the machine translation and sentiment analysis of the giant's speech were subject to occasional inaccuracies.

To confirm RQ-2, it was crucial to establish a variable that would measure the error rate of the sentiment analysis carried out by the machine translation. By examining the relationship between this variable and the translation accuracy, a correlation was discovered at a moderate level, ranging from 0.30 to 0.32. Although moderate, this correlation was statistically significant.

7 Conclusions

The research has demonstrated that it is possible to effectively use machine translation of Slovak text into English and its subsequent sentiment analysis as an effective alternative for a sentiment analysis tool in the local language, Slovak. The above statement is valid for texts containing short direct speech sentences, e.g. dialogues. It is reasonable to assume that this concept is also applicable to general texts, such as articles, reviews, etc. However, it is necessary to pay attention to correct grammar. Expressive and slang words in texts may also reduce the accuracy of the described methodology. However, they do not reduce the accuracy so much that the described methodology is not statistically significantly effective for such texts. While the primary methodology, involving machine translation into English followed by sentiment analysis, proved to be highly accurate, the comparative analysis with OpenAI’s end-to-end solution revealed a promising alternative. Despite the relatively high accuracy of OpenAI’s sentiment determination, the proposed methodology was found to be more precise.

The F1 scores results indicate a higher agreement between human text and Google Translate sentiment categorizations compared to human text and GPT-based categorizations, offering valuable insights into the compatibility of different sentiment analysis techniques and their application in various linguistic contexts.

The research findings indicate that there exists a statistically significant correlation between the accuracy level of sentiment scores identification in machine translation and the accuracy level of translation assessed using BLEU metrics. Therefore, the sentiment transfer factor in machine translation can be considered a valuable metric for evaluating the accuracy of machine translation. In order to validate this hypothesis, the research should be extended to include other text types besides movie subtitles and incorporate manual translation evaluation.

Conducting additional research would be valuable to examine diverse text genres and confirm the findings using alternative data sources. Expanding the research to comparable languages beyond Slovak would also be advantageous. To achieve this, employing the same methodology to other flective languages would be a prerequisite.