1 Introduction

Nowadays, millions of people share their opinions and feedback on products and services via comments, reviews, and social media. Positive and negative feedback impacts the sales of products and services as well as the organization public perception. However, the rate at which the information is generated makes it impossible to manually analyse this information. Sentiment analysis offers a solution to automatically process these data by computationally understanding and classifying subjective information from source materials. Companies have successfully deployed such a system, as a part of business intelligence and big data analytic technologies (Cavallari et al. 2019; Yang et al. 2018; Guellil et al. 2021), to use this feedback for improving their products (Cambria et al. 2019; Dragoni et al. 2019; Hussain et al. 2021).

Due to informal and highly colloquial nature of such reviews and comments uploaded by general Internet users, the source text often contains idiomatic expressions. An idiomatic expression is a language construction that conveys a meaning distinct from the literal meaning of the words it is formed of, or a multi-word expression with non-compositional meaning. While idiomatic expressions often make speech more expressive, they are difficult to understand if not listed in the lexicon (Zikopoulos et al. 2011; Dashtipour et al. 2021). In addition, they are highly language-dependent and unpredictable: knowing idiomatic expressions in one language often provides no help in interpreting idiomatic expressions in another language (Wang et al. 2013; Li et al. 2021; Dashtipour et al. 2021). Finally, a lexicon-based sentiment analysis framework often incorrectly classifies the sentences consisting of idiomatic expression. Therefore, the inclusion of idiomatic expressions in the lexicon could help accurately understand a wide range of opinions available on the Internet.

Although Persian language is an official language of Iran, Afghanistan (variety called Dari) and Tajikistan (variety called Tajik), with about 130 million of speakers across these countries, there are currently very limited tools available to automatically summarize the overall opinion (Ling Lo et al. 2016; Dashtipour et al. 2019, 2017, 2018, 2017; Ieracitano et al. 2018; Jiang et al. 2021; Dashtipour et al. 2021); in particular, Persian sentiment analysis techniques suffer from the lack of resources to detect and interpret idiomatic expressions in the reviews.

Persian language has a number of interesting sociolinguistic peculiarities. A notable difference between Persian and English language is the way of writing using a complex, right-to-left Persian script (or a variant of Cyrillic script in Tajikistan), often not compatible with existing keyboards or devices, that leads to much greater than in English variation in spelling and code switching. In addition, some letters of the Persian script, such as T, S and Z are written in different shape depending on the context, which creates more spelling variants in informal user-contributed texts often typed on phone keyboards: for example, the word (“environment”) can be written in various forms (Basiri et al. 2019; Nezhad et al. 2019; Dashtipour et al. 2017, 2020; Gogate et al. 2020; Ahmed et al. 2021; Gogate et al. 2019, 2017).

Particularly important for this work is the fact that the Persian language has many idiomatic expressions, metaphors, slang and swear words and profanity expressions, with much more complicated system of social acceptance and taboos than English does, leading to a more complex system of nuances, double meaning and wordplay. We attribute this mostly to an interplay of sociolinguistic factors such as education level, general standard of living, religion and dialectal variation. Persian speakers often prefer not to express their sentiment explicitly but use idiomatic expressions or euphemisms to convey their thoughts (Khoshnevisan 2019; Gogate et al. 2020, 2020; Dashtipour et al. 2021; Gogate et al. 2019).

In addition, the source text consisting of many informal words, as well as transliteration or literal translations of English words, makes it difficult to analyse the text polarity (Mullen and Malouf 2006). The process of extracting features such as idioms for sentiment analysis of Persian texts is more complex than extracting traditional features, and it is sometimes difficult to assign polarity to such features. We found that nearly 8% of the movie review dataset contained idioms. Not considering them has negative impact on the classification of the overall polarity of the text (Mansouri 2015).

In this paper, we present PerSent lexicon (Dashtipour et al. 2016) with 14000 idioms in the Persian language, along with their sentiment polarity, and show its value for the sentiment analysis task. In addition, we integrate the extended lexicon and the idiom analysis engine into our lexicon-based Persian sentiment analysis framework. We evaluate the usefulness of the obtained novel resource via application to sentiment analysis using machine learning applied to a dataset of reviews of movies and products. We show that our extended PerSent-based sentiment analysis outperforms state-of-the-art sentiment analysis approaches as well as Persian-to-English translation-based approaches. The extended version of the lexicon will be made publicly available for research purposes.

The main contribution of this work is the first-of-its-kind publicly available Persian Sentiment lexicon for accurately classifying source text consisting of idiomatic expressions. We provide an algorithm to detect Persian idioms in a sentence. We also provide a methodology for using the lexicon in a machine learning-based sentiment analysis framework and show that the proposed model outperforms the baseline lexicon-based algorithm that simply counts the average polarity of the words and expressions in the document. In this way, we illustrate a sentiment analysis framework for a resource-poor language, to which more advanced corpus-based methods, in particular, deep learning-based techniques, are not applicable. In addition, we show that for idiomatic expression-based sentiment analysis, less pre-processing of the input text is required and a simpler algorithm can be used in comparison with phrase-based and concept-based sentiment analysis, since idioms are fixed expressions with no or very little syntactic variation, while detecting phrases or concepts in the text may require paraphrase detection.

The paper is organized as follows. In Sect. 2, we discuss related work. In Sect. 3, we describe the methodology used for compilation of our lexicon. In Sect. 4, we describe our procedure for the evaluation of the obtained lexicon and the datasets used for evaluation. In Sects. 5 and 6, we present experimental results and discussion, accordingly. Finally, Sect. 7 concludes the paper and outlines the directions of future work.

2 Related work

An idiom is a phrase or fixed expression that conveys meaning different from the combination of literal meanings of its constituent words, such as figurative meaning: for example, “give me a hand” means in English “help me”; “hold on a second” means in English “wait for a short time” (Langlotz 2006). Thus, it is difficult or impossible to deduce its meaning without knowing it beforehand, even if the meaning of the individual words is known.

Idioms exist in all known languages; it is estimated that more than 20,000 idiomatic expressions exist in the English language, and Persian language is no exception. For example, (“as mouse and cat”) in Persian refers to two people who keep fighting with each other. The following properties characterize idioms (Fraser 1970):

  • Conventionality: literal meaning of the phrase does not coincide with its idiomatic meaning;

  • Inflexibility: the syntax of idioms is very restrictive;

  • Figuration: idioms often present figurative meaning;

  • Informality: idioms often contain informal words;

  • Affect: idioms often have non-neutral (i.e. positive of negative) polarity.

A correct interpretation of idioms is crucial for the understanding of the meaning of a text. To identify idioms in a given sentence, various linguistic resources, such as lists of fixed expressions, phrases, cliches and proverbs, need to be considered (Passaro et al. 2019).

Non-native speakers experience difficulties in understanding idioms, which can significantly affect their personal and professional communication (Nippold and Martin 1989). Therefore, most of the second-language classes focus on idioms as an important part of language acquisition (Liu 2003; Erman and Warren 2000). Correct interpretation of idioms is very important in sentiment analysis. Williams et al. (2015) collected more than five hundred English idioms and used web-based crowdsourcing to assign their polarity. They evaluated the usefulness of their resource on a movie reviews dataset, using an SVM classifier for sentiment analysis. In their experiments, they obtained an accuracy of 0.70. However, they showed that their approach was not suitable for document-level sentiment analysis, so it was applied for sentence-level analysis.

Liang et al. (2018) built a sentiment lexicon SlangSD that contained informal words. They used online resources to collect idioms and slang words. The performance of the classifier using their resource was boosted from the baseline 0.73 to 0.87. A shortcoming of their approach was that the lexicon was relatively small, and hence, it could not effectively detect informal words in texts. Ibrahim et al. (2015, 2015) created an Arabic lexicon to identify idioms in the text. They trained an SVM to evaluate the performance of the features extracted from text, including idioms. Once the idioms have been extracted, a corpus of 1000 positive and 1000 negative tweets was used to identify their polarity. The overall accuracy achieved was 0.986. The technique that they used does not generalize to Persian because they developed their lexicon for Arabic dialects and colloquial Arabic language.

Gul (2014) proposed an approach for detecting idioms in Urdu texts. In this work, an idiom lexicon of 2500 Urdu idioms along with their polarity was developed. A classifier was trained on a dataset that was manually annotated with positive and negative polarity (Raj and Kajla 2015). The technique that they used does not generalize to Persian because their lexicon consists of Urdu words.

Table 1 summarizes the most important existing resources and approaches to the use of idioms for sentiment analysis in various languages. We included in the table several recently developed approaches, as well as some approaches, published several years ago that have not been outperformed by later approaches. The table also shows the accuracy of polarity classification with and without detection of idioms. For example, the figures for Wang et al. (2010) suggest that idioms alone provide high accuracy for the Chinese dataset they used for evaluation. In all cases, adding idioms gives a significant boost in sentiment analysis accuracy.

Table 1 Approaches for idiom detection in text

3 Persian idiom lexicon

In order to build our Persian idiom lexicon for sentiment analysis, we extracted idioms from a website with a list of 925 Persian idioms. This website provides the most widely used Persian idioms, though there are many more idioms not widely used in daily communication and not understandable for many native speakers; the treatment of such rarer idioms is a topic of our future work.

Three annotators manually assigned polarity to the idioms in the form a number between \(-\) 1 (very negative) and + 1 (very positive), with one decimal point precision, such as + 0.7. In their work, they could consult Internet or other sources for usage examples as they felt appropriate. In the resulting lexicon, the values of the annotators were averaged and, for simplicity of presentation, rounded to one decimal digit, since these estimates were very subjective anyway. The annotators were educated native speakers of Persian, two of them between 50 and 60 years old and one 30 years old. The agreement between the annotators was substantial, with Fleiss of 0.73 (Fleiss et al. 2020).

In some cases, the annotators did not agree on whether the polarity is positive or negative; examples of such idioms are (“bread and kebab”) or (“wind party”: a person who keeps changing political views). In such cases, these idioms were removed. In addition, for some idioms, annotators were not sure as to their polarity; such idioms were also removed from the lexicon.

In about 20% of cases, the annotators agreed on the sign of polarity but the difference of the assigned numerical values was more than 0.5, such as + 0.9 versus + 0.3. In such cases, we used TextBlob, a Python library for natural language processing, based on the widely used Natural Language Toolkit (NLTK). It includes an automatic polarity detector for English sentiment analysis, pre-trained with supervised learning using naive Bayes as the classifier, which can identify the polarity for a word, multi-word expression, or sentence. To determine the sentiment of a Persian phrase, we used Google translator to automatically obtain an idiomatic translation of the phrase into English. Then, we manually corrected these translations where the translation provided by Google was not idiomatic enough. Finally, we fed it into TextBlob’s sentiment detector do obtain the automatic sentiment value judgement. For example, for (“bad face”) the polarity obtained in this way was − 0.6; for (“ugly face”) it was -0.7. The Persian dictionary has been used, which consists of 14,000 phrases. We used three Persian language experts to remove unrelated phrases. Our lexicon consists of 925 Persian idioms, 326 negative and 374 positive idioms (total 700), and 225 of the idioms were neural.

Finally, of the 925 idioms that have been annotated manually, those idioms that were annotated with neutral polarity were not included in our lexicon; by neutral, we understood either the label “neutral” assigned by the annotators or the value of zero assigned by TextBlob. We did not use any threshold for the exclusion of neutral idioms, because, as Fig. 1 illustrates, the majority of the idioms that had nonzero polarity value were strongly opinionated, so there were very few cases of “almost neutral but still not neutral” annotations. We also decided to remove from the lexicon idioms that contained strong profanity, because they might create ethical, religious or legal problems in annotating and sharing the lexicon. This did not affect our evaluation because the review datasets on which we evaluated our lexicon did not contain strong profanity due to the policy of the corresponding websites. Investigating the impact of profanity on sentiment analysis using a suitable dataset will be a topic of our future work.

This resulted in 326 negative and 374 positive idioms. These figures look a bit unexpected because in any language negative expressions are more numerous than positive ones. We attribute the inverse relation in our figures to the removal of the idioms that contained strong profanity, which was mostly negative. Figure 1 shows the number of idioms by assigned polarity. Table 2 shows the numbers of different types of idiom, such as dialectal, light swear words and slang. Table 3 gives examples of idioms in our lexicon along with their idiomatic translation and polarity.

Fig. 1
figure 1

Number of idioms in our lexicon by polarity

Table 2 Statistics of idioms in our lexicon by type
Table 3 Examples of idioms

4 Evaluation methodology

In order to evaluate the performance of our idiom lexicon, we used our lexicon and the idiom lexicon to assign polarity to sentences with various classification algorithms. It is to be noted that we used the original PerSent lexicon and its extension with the idioms lexicon. The algorithms we used varied from direct counting of the average polarity of the words and expressions in the sentence to the use of machine learning techniques to extend the sentiment values from the words present in our lexicon to other words that co-occur with them in texts.

Figure 2 shows the framework we used to evaluate the performance of our idiom lexicon. The left-hand part of this figure represents the process of compilation and annotation of the lexicon described in the previous section. In this section, we describe the corresponding processing steps we followed for evaluation of the obtained lexicon, shown in the right-hand part of the figure.

Fig. 2
figure 2

Framework for compilation and evaluation of our idiom lexicon

4.1 Pre-processing and feature extraction

Pre-processing Datasets collected from online sources are noisy. At the pre-processing step, the noise and uninformative parts of the text were removed. This speeded up the rest of the process. In addition, normalization was done at this stage. We used Persian normalization algorithm JHAZM to normalize the words and phrases. For example, (“The movie was greatttttt”) was changed to (“The movie was great”) (Dashtipour et al. 2016; Nourian et al. 2015). This might possibly cause false positives in our idiom detection procedure; in our future work, we plan to study the effect of pre-processing on idiom detection.

N-gram feature extraction In the experiments that involved machine learning algorithms, we used trigram and four-gram features. For example, (“It was really bad movie”) was transformed into two trigrams: and (Lopez-Gazpio et al. 2019).

Bag of words We also used the bag-of-words (unigram) features to reflect the frequency of occurrence of words for training the classifier (Ayadi et al. 2019). The bag-of-words features for the sentence (“God Lawyer was bad movie”) were (“God-Lawyer”), (“was”), (“movie”), and (“bad”), each one with term frequency 1. We used adjectives, adverbs, verbs and nouns as features (Deshpande et al. 2019).

Idiom detection algorithm The idiom detector automatically detects the idioms in Persian sentences. First, the sentences are tokenise, after tokenization, the algorithm automatically identify the idiom, and it fed into machine learning classifiers for evaluation the performance of the approach.

Algorithm 1 presents the procedure that we used to detect idioms in a Persian text. As any lexicon-based algorithm, our procedure detects idioms present in the lexicon, but does not discover new idioms from the texts. Some idioms contain discontinuous phrases or words that can change, for example, “pull up his socks” or “pull up her socks”; a Persian example is (lit. “step on my eyes”) or (lit. “step on his eyes”), which means “welcome”. Our algorithm does not detect such types of idioms, which are statistically very rare in Persian, because in Persian verbs and nouns have no grammatical gender (Chen et al. 2014).

Persian idioms are usually located in the middle of the sentence. Persian sentences start with a noun and end with a verb, and the idiom is usually located in the sentence between noun and verb. The idioms can be detected automatically in any part of the sentence. The algorithm detects the idioms in the sentence even if multiple idioms are present in different parts of the sentence. For example, in the following sentence, (“It is a mockery movie and it drives me crazy when I see it”), our algorithm identifies the idioms by tagging the idiom in the sentence: (shown here by underlining instead of the idiom tag).

Example with positive polarity: (“I would like to hurray bravo the whole team of the film”). The algorithm detects the idiom in the sentence and returns .

Example with negative polarity: (“Seeing this film, I wasted my life”). The algorithm detects the idiom in the sentence and tags it as follows:

Example with multiple idioms: The algorithm can detect multiple idioms in the sentence: (“the movie was trash and fiddle and faddle”). The algorithm tags the idioms in the sentence as .

4.2 Labelling with average polarity

The PerSent Persian lexicon contains 1500 Persian words along with their polarity and part-of-speech tag: noun, adjective, adverb or verb (Dashtipour et al. 2016). As a variant of classification technique, we used the PerSent lexicon and our idiom lexicon to identify the average polarity of the features. For example, in (“it was great movie”), the word (“great”) was searched in the PerSent lexicon to identify its polarity. If the idiom was detected in the sentence, the idiom lexicon was used to assign polarity to it. For example, for (lit. “I really love the movie but the acting did not play guitar with my heart”), the output was + 1 for the positive word and -0.4 for the negative idiom, and thus + 0.6 overall. We calculated the average polarity of the review as

$$\begin{aligned} AP=1/n\sum _i^nP(w_i) \end{aligned}$$
(1)

where AP denotes the average polarity, P denotes the polarity of the word, \(w_i\) denotes the ith word in the review and N denotes the number of words in the review. We used this value for automatic labelling of texts used as training examples for supervised classification algorithms, as explained below.

4.3 Direct and machine learning-based classification

The purpose of our classification experiments was twofold: to evaluate the usefulness of the developed lexicon for Persian sentiment analysis and to identify the most accurate machine learning algorithm for sentiment analysis of Persian texts. To show that our framework performed better with the PerSent 2.0 lexicon that includes the idiom lexicon than with the original PerSent 1.0 lexicon alone, we conducted various experiments in document-level polarity classification of Persian texts. In these experiments, we compared the results obtained with two versions of the lexicon: the old PerSent 1.0 (without idioms) and the new PerSent 2.0 (with idioms) on the task of document-level polarity classification. We also compared the techniques that rely on our lexicon with a baseline technique that does not rely on it.

We used three classification methods: a direct method, based on the average polarity from equation (1); a machine learning-based method that implicitly extended the annotation from the existing lexicon to a greater number of features, and a baseline experiment, based on automatically translation into English without using our PerSent lexicon or the idiom lexicon.

Direct classification: We used the average polarity from equation (1) to assign positive or negative polarity to texts. Namely, we assigned the negative label to the text if AP was negative; otherwise, we assigned positive polarity. Note that we never assigned neutral polarity, even when AP was zero.

Machine learning classification: We also tried a more sophisticated procedure that internally involved training a supervised classifier, with a distant-supervision approach. Namely, we first used the unsupervised direct classifier to identify the polarity in a large unlabelled dataset, and then trained a supervised classifier on these examples obtained with the unsupervised classifier. In order to train SVM, we have used RBF kernal, for naive Bayes the sample weight is equal to none used.

We did not experiment with using purely supervised approaches since the aim of our experiments was not to show how well one can classify Persian texts when one has a large enough corpus of manually labelled examples. Here we only to show how useful our lexicon is for classification of Persian texts in lexicon-based (distant-supervision) manner, without any manually labelled examples at all. A more complete distant-supervision approach could use for training both use manually labelled dataset and a much larger corpus automatically annotated with the help of our lexicon; however, currently, we do not have at our disposal such a large corpus. In addition, our current experiments are sufficient to demonstrate the value of our idiom lexicon.

In order to train the supervised classifier, the features were extracted from the reviews and the lexicon was used to assign polarity to the sentences. Then, we used this automatically labelled dataset to train a supervised classifier. We evaluated the trained classifier on our manually annotated dataset for which the polarity of the reviews was known. (The main reason for using binary classification was its efficiency.) We used five different classifiers: support vector machine (SVM), naive Bayes classifier, k-nearest neighbour classifier (kNN), decision tree and convolutional neural network (CNN). For the CNN, we used deeplearning4j, an open-source Java library (Nicholson and Gibson 2017).

The whole process was still unsupervised since no manually labelled examples were used for training; however, this process allowed us to extend implicitly the labels present in the lexicon to words that are not listed in the lexicon but co-occur with the words present in the lexicon, in order to improve the results. We did not export the learnt data in the form of a separate, larger lexicon with these automatically obtained sentiment values, but this is certainly possible. Such a list could be used in our future work, for example, for manual revision of the assigned sentiment values in order to extend the original lexicon.

Due to the small size of our manually labelled dataset and the lack of data in Persian language, in our experiments, we used the same dataset as a source of unlabelled data for the distant-supervision procedure. Namely, we used a tenfold cross-validation procedure, with the training portion of the dataset used only as a source of unlabelled data. For each fold, denote by A the test set with manually assigned polarity labels and by B a corpus obtained from the training set by removing the labels. Then we automatically annotated B using equation (1), trained a classifier on the obtained dataset and tested it on the manually annotated set A. Note that, in spite of using a training procedure internally, the whole process is lexicon-based and unsupervised since it does not use any manually labelled examples.

Classification via translation In this experiment, the whole dataset was translated into English using the Google translator. The translated sentences were passed to the TextBlob, and the average polarity was calculated. However, the TextBlob software was not able to identify the polarity for some idioms, because Google translator provided non-idiomatic translation (Loria et al. 2014).

4.4 Dataset used

To evaluate the performance of the framework, we used the following three datasets.

Movie Reviews dataset: Due to the lack of available resources for the Persian language, we had to compile our dataset. For this, we collected more than 1000 movie reviews from two popular movie review sites. We annotated these reviews as positive or negative and selected 500 positive and 500 negative reviews. The movie reviews in our dataset are on comedy and action movies, from 2014 to 2016. This dataset is rich in colloquial language, slang and idioms; for example, (“It is movie with lots of zany acting”), however, as we have explained above, it does not contain profanity.

Persian VOA dataset: We used widely used benchmark Persian Voice of America (VOA) news dataset, which contains 500 positive and 500 negative news headlines. The language of headline news is quite formal; they contain much fewer informal or colloquial language expressions, slang words or idioms than the movie reviews (Mirsarraf et al. 2013).

Amazon reviews dataset: In order to compare the result, we used the Amazon reviews dataset, which contains more than one million English-language reviews on movies and TV. We used the bag of words of the reviews to compare the result with the Persian datasets. Table 4 summarizes the statistics of the datasets used. It includes the number of idioms detected in each dataset with our lexicon.

Table 4 Datasets used

5 Experimental results

We measured the performance in terms of accuracy: the ratio of the number of correctly classified documents to the total number of documents in the dataset. In addition, to estimate the bias of the classifier, we report separately the recall of the positive and negative examples on the test set: the ratio of the correctly classified positive (respectively, negative) texts to the total number of positive (respectively, negative) texts in the dataset. The difference between recall on the two classes measures the bias of the classifier.

Tables 5 and 6 show the results on the two datasets we used, with and without the use of our idiom lexicon, obtained with tenfold cross-validation procedure. All reported figures are averaged over the tenfold.

Table 5 Performance of different classifiers on the movie reviews dataset. R stands for Recall
Table 6 Performance of different classifiers on the Persian VOA dataset. R stands for Recall

Table 7 shows that on the Movie Review dataset, the translation technique outperforms the original PerSent lexicon without the idiom lexicon, but is outperformed by our PerSent lexicon with the idiom lexicon, which again demonstrates the usefulness of our idiom lexicon for dealing with highly informal texts. For the VOA Persian dataset, translation showed comparable result, because this dataset contains much fewer informal and slang words, for which either translation would fail or TextBlob would not have data; see Table 4. Indeed, for formal texts, TextBlob is more complete than our small PerSent lexicon with only 1500 Persian words. Comparison with a very large English-language Amazon corpus suggests that our results are good enough and can be considered closed to the state of the art, though we cannot quantify this because results on different languages are incomparable. There are lots of studies used to translate Persian sentences into English and apply English lexicon. Therefore, we translate the Persian sentences into English, then apply machine learning classifiers results. The experimental results illustrate that the translated sentences achieved lower performance as compared to Persian sentences.

Table 7 Accuracy of classification via translation. R stands for recall

6 Discussion and further analysis

We applied five different classifiers to the Movie Review dataset and the Persian VOA dataset. Comparison of the results shows that CNN outperformed the other classifiers. This classifier gave better performance with our idiom lexicon than with the original PerSent lexicon alone.

Since the language in the Persian VOA dataset is more formal, the idiom lexicon was not as useful for it as it was for the movie review dataset, which includes many idioms, slang words and informal words. Our idiom lexicon is particularly useful for highly informal texts, which is especially important because the majority of texts on Internet to which opinion-mining techniques are typically applied are highly informal. The CNN classifier typically outperforms other classifiers in terms of accuracy; however, the CNN takes increasingly more time to train the model for large dataset (Iqbal et al. 2019).

In particular, in our experiments, CNN was trained to evaluate the performance of our lexicon. However, it showed superior performance in comparison with more traditional approaches such as SVM. We attribute this to our noisy data, which can cause low performance. At the pre-processing step, we did not remove all stop words, because this could affect automatic detection of idioms in the sentence. Various other factors affect the performance of the CNN classifier, such as the choice of feature extraction techniques, which is difficult to adapt to different types of dataset. The performance of the CNN improves when we increased the number of convolutional layers: five layers was enough to construct the model while increasing the number of pooling layers can deteriorate the results (Hassan and Mahmood 2017).

For the convenience of the reader, Figs. 3 and 4 show in the graphical form the data from Tables 5 and 6, respectively: the average overall accuracy of the five classifiers on the datasets using the original PerSent lexicon and after adding our idiom lexicon to it. One can see that the direct classifier that just uses the average polarity gave good results on both datasets. This shows that the average polarity can be used in practice to assign polarity labels to texts. However, the addition of distant supervision further improved the results, with the CNN classifier providing the best accuracy. On both corpora, the use of the idiom lexicon yielded better results, though for the highly informal Movie Review dataset improvement was much greater than for the formal Persian VOA dataset. In particular, improvement on the Movie Review dataset was statistically significant for all classifiers, while improvement on the Persian VOA dataset was statistically significant only for the direct classifier.

Fig. 3
figure 3

Comparison of the result for five different classifiers for the Movie Reviews

The fact that the figures of the recall on the positive polarity class and the negative polarity class are similar shows that the classifiers are not biased since the datasets are balanced. Figures 3 and 4 show that direct classifier with idioms performed better than some of machine learning algorithms. We can attribute such cases to wrong generalizations made from co-occurrences of some unigrams or n-grams by chance with some idioms, due to the very small size of the available raw corpus for training the supervised classifiers.

Fig. 4
figure 4

Comparison of the result for five different classifiers for the Persian VOA dataset without idiom and with our idiom lexicon

Figures 5 and 6 show the distribution of all examples and the correctly classified examples by average polarity obtained with the equation (1) with the original PerSent lexicon and after we added our new idiom lexicon to it. Note that the figures show absolute numbers, not percentages. The figures suggest that the addition of the idioms made the analysis more detailed. Indeed, while the number of documents assigned almost neutral polarity or extreme polarity (completely positive or completely negative) has decreased with the addition of the idioms lexicon, the number of documents assigned moderate polarity (positive or negative) has increased. This indicates a more balanced and specific estimation of the sentiment conveyed by those documents since very few real texts convey extremely positive or extremely negative feelings.

Fig. 5
figure 5

Number of documents in the Movie Review dataset by average polarity according to equation (1) before and after adding idioms to the PerSent lexicon

Fig. 6
figure 6

Number of documents in the Persian VPA dataset by average polarity according to equation (1) before and after adding idioms to the PerSent lexicon

7 Conclusions and future work

We have extended the PerSent Persian sentiment lexicon with 1000 idiomatic expressions. The resulting lexicon is not only useful for detect idioms in Persian texts but also for accurately classifying Persian texts. We used several algorithms to evaluate the performance of our lexicon on the polarity detection task and have shown that it improves the performance of the classifiers, especially on user-contributed contents rich in informal language.

We have also shown that the use of deep learning algorithms, especially CNN, to extend implicitly the annotation from the sentiment lexicon to the words not included in it in a distant-supervision manner improves the results. This leaves the whole labelling process lexicon-based and unsupervised since the training data for the machine learning algorithms are obtained automatically with a lexicon-based algorithm; no manually annotated examples are involved in training.

As part of our future work, we plan to develop a multilingual idiom detection framework for English and Persian languages. We also plan to overcome certain shortcomings of our idiom detection method: to enable it to classify code-mixed texts, to identify multiple meanings of words and to deal with cultural and regional language variation. Namely, there are words in Persian that have several meanings: e.g. (sun) can be used to say (“the male actor was shining like sun”). Our current sentiment classification method is unable to detect these peculiarities in the text. We plan to extend it to be able to distinguish such figures of speech as, in this case, metaphor.

Another scenario where our framework needs enhancement is code-mixed text, which includes a mixture of Persian and English idioms. For example, “Dude, (“Dude, you have done a great job”).

Some of the slang words are culture-specific or region-specific. For example, “Laila and Majnun” is an ancient Persian love story. It is currently difficult for idiom detectors to recognize that (“Laila and Majnun”) can be translated as “lovers”. Our idiom detection framework will be enhanced to detect such culture-specific or region-specific slang. Similarly, our method is to be adapted to handle different dialects of the Persian language. These goals can be achieved both by including in our lexicon manually annotated region- and dialect-specific idioms and by automatically identifying idioms already included in our lexicon as region- or dialect-specific. The latter can be done using dialect-specific corpora and corpora with geo-localization information.

Finally, our algorithm currently allows detection of only those idioms that are manually included in our lexicon. However, new idiomatic and slang expressions constantly appear in language, especially in the language of Internet, microblogging and social networks. Automatic detection of new slang and idiomatic expressions from raw texts and automatic discovery of their sentiment value is an ambitious goal, which would probably involve deep learning techniques and semantic analysis of the text. This is particularly challenging for a resource-poor language such as Persian.