A Transformer-based approach to Irony and Sarcasm detection

Figurative Language (FL) seems ubiquitous in all social-media discussion forums and chats, posing extra challenges to sentiment analysis endeavors. Identification of FL schemas in short texts remains largely an unresolved issue in the broader field of Natural Language Processing (NLP), mainly due to their contradictory and metaphorical meaning content. The main FL expression forms are sarcasm, irony and metaphor. In the present paper we employ advanced Deep Learning (DL) methodologies to tackle the problem of identifying the aforementioned FL forms. Significantly extending our previous work [71], we propose a neural network methodology that builds on a recently proposed pre-trained transformer-based network architecture which, is further enhanced with the employment and devise of a recurrent convolutional neural network (RCNN). With this set-up, data preprocessing is kept in minimum. The performance of the devised hybrid neural architecture is tested on four benchmark datasets, and contrasted with other relevant state of the art methodologies and systems. Results demonstrate that the proposed methodology achieves state of the art performance under all benchmark datasets, outperforming, even by a large margin, all other methodologies and published studies.


Introduction
In the networked-world era the production of (structured or unstructured) data is increasing with most of our knowledge being created and communicated via web-based social channels [92]. Such data explosion raises the need for efficient and reliable solutions for the management, analysis and interpretation of huge data sizes. Analyzing and extracting knowledge from massive data collections is not only a big issue per-se, but also challenges the data analytics state-of-the-art [99], with statistical and machine learning methodologies paving the way, and deep learning (DL) taking over and presenting highly accurate solutions [29]. Relevant applications in the field of social media cover a wide spectrum, from the categorization of major disasters [42] and the identification of suggestions [69] to inducing users appeal to political parties [2].
The raising of computational social science [55], and mainly its social media dimension [63], challenge contemporary computational linguistics and text-analytics endeavors. The challenge concerns the advancement of text analytics methodologies towards the transformation of unstructured excerpts into some kind of structured data via the identification of special passage characteristics, such as its emotional content (e.g., anger, joy, sadness) [48]. In this context, Sentiment Analysis (SA) comes into play, targeting the devise and development of efficient algorithmic processes for the automatic arXiv:1911.10401v1 [cs.CL] 23 Nov 2019 extraction of a writers sentiment or emotion as conveyed in text excerpts. Relevant efforts focus on tracking the sentiment polarity of single utterances, which in most cases is loaded with a lot of subjectivity and a degree of vagueness [57]. Contemporary research in the field utilizes data from social media resources (e.g., Facebook, Twitter) as well as other short text references in blogs, forums etc [70]. However, users of social media tend to violate common grammar and vocabulary rules and even use various figurative language forms to communicate their message. In such situations, the sentiment inclination underlying the literal content of the conveyed concept may significantly differ from its figurative context, making SA tasks even more puzzling. Evidently, single turn text lack in detecting sentiment polarity on sarcastic and ironic expressions, as already signified in the relevant SemEval-2014 Sentiment Analysis task 9 [78]. Moreover, lacking of facial expressions and voice tone require context aware approaches to tackle such a challenging task and overcome its ambiguities [31]. As sentiment is the emotion behind customer engagement, SA finds its realization in automated customer aware services, elaborating over users emotional intensities [13]. Most of the related studies utilize single turn texts from topic specific sources, such as Twitter, Amazon, IMDB etc. Hand crafted and sentimentoriented features, indicative of emotion polarity, are utilized to represent respective excerpt cases. The formed data are then fed traditional machine learning classifiers (e.g. SVM, Random Forest, multilayer perceptrons) or DL techniques and respective complex neural architectures, in order to induce analytical models that are able to capture the underlying sentiment content and polarity of passages [32,79,41].
The linguistic phenomenon of figurative language (FL) refers to the contradiction between the literal and the non-literal meaning of an utterance [17]. Literal written language assigns exact (or real) meaning to the used words (or phrases) without any reference to putative speech figures. In contrast, FL schemas exploit non-literal mentions that deviate from the exact concept presented by the used words and phrases. FL is rich of various linguistic phenomena like metonymy reference to an entity stands for another of the same domain, a more general case of synonymy; and metaphors systematic interchange between entities from different abstract domains [18]. Besides the philosophical considerations, theories and debates about the exact nature of FL, findings from the neuroscience research domain present clear evidence on the presence of differentiating FL processing patterns in the human brain [91,58,45,6,13], even for woman-man attraction situations! [23]. A fact that makes FL processing even more challeng-ing and difficult to tackle. Indeed, this is the case of pragmatic FL phenomena like irony and sarcasm that main intention of in most of the cases, are characterized by an oppositeness to the literal language context. It is crucial to distinguish between the literal meaning of an expression considered as a whole from its constituents words and phrases. As literal meaning is assumed to be invariant in all context at least in its classical conceptualization [46], it is exactly this separation of an expression from its context that permits and opens the road to computational approaches in detecting and characterizing FL utterance.
We may identify three common FL expression forms namely, irony, sarcasm and metaphor. In this paper, figurative expressions, and especially ironic or sarcastic ones, are considered as a way of indirect denial. From this point of view, the interpretation and ultimately identification of the indirect meaning involved in a passage does not entail the cancellation of the indirectly rejected message and its replacement with the intentionally implied message (as advocated in [12,30]). On the contrary ironic/sarcastic expressions presupposes the processing of both the indirectly rejected and the implied message so that the difference between them can be identified. This view differs from the assumption that irony and sarcasm involve only one interpretation [88,80]. Holding that irony activates both grammatical / explicit as well as ironic / involved notions provides that irony will be more difficult to grasp than a nonironic use of the same expression.
Despite that all forms of FL are well studied linguistic phenomena [88], computational approaches fail to identify the polarity of them within a text. The influence of FL in sentiment classification emerged both on SemEval-2014 Sentiment Analysis task [78] and [18]. Results show that Natural Language Processing (NLP) systems effective in most other tasks see their performance drop when dealing with figurative forms of language. Thus, methods capable of detecting, separating and classifying forms of FL would be valuable building blocks for a system that could ultimately provide a full-spectrum sentiment analysis of natural language.
In literature we encounter some major drawbacks of previous studies and we aim to resolve with our proposed method: • Many studies tackle figurative language by utilizing a wide range of engineered features (e.g. lexical and sentiment based features) [21,28,71,73,74,82] making classification frameworks not feasible. • Several approaches search words on large dictionaries which demand large computational times and can be considered as impractical [71,82] • Many studies exhausting preprocess the input texts, including stemming, tagging, emoji processing etc. that tend to be time consuming especially in large datasets [51,86]. • Many approaches attempt to create datasets using social media APIs to automatically collect data rather than exploiting their system on benchmark datasets, with proven quality. To this end, it is impossible to be compared and evaluated [51,56,86].
To tackle the aforementioned problems, we propose an end-to-end methodology containing none hand crafted engineered features or lexicon dictionaries, a preprocessing step that includes only de-capitalization and we evaluate our system on several benchmark dataset. To the best of our knowledge, this is the first time that an unsupervised pre-trained Transformer method is used to capture figurative language in many of its forms.
The rest of the paper is structured as follows, in Section 2 we present the related work on the field of FL detection, in Section 3 we present our proposed method along with several state-of-the-art models that achieve high performance in a wide range of NLP tasks which will be used to compare performance, the results of our experiments are presented in Section 4, and finally our conclusion is in Section 5.

Literature Review
Although the NLP community have researched all aspects of FL independently, none of the proposed systems were evaluated on more than one type. Related work on FL detection and classification tasks could be categorized into two main categories, according to the studied task: (a) irony and sarcasm detection, and (b) sentiment analysis of FL excerpts. Even if sarcasm and irony are not identical phenomenons, we will present those types together, as they appear in the literature.

Irony and Sarcasm Detection
Recently, the detection of ironic and sarcastic meanings from respective literal ones have raised scientific interest due to the intrinsic difficulties to differentiate between them. Apart from English language, irony and sarcasm detection have been widely explored on other languages as well, such as Italian [81], Japanese [35], Spanish [64], Greek [10] etc. In the review analysis that follows we group related approaches according to the their adopted key concepts to handle FL.
Approaches based on unexpectedness and contradictory factors. Reyes et al. [75,76] were the first that attempted to capture irony and sarcasm in social media. They introduced the concepts of unexpectedness and contradiction that seems to be frequent in FL expressions. The unexpectedness factor was also adopted as a key concept in other studies as well. In particular, Barbieri et al. [4] compared tweets with sarcastic content with other topics such as, #politics, #education, #humor. The measure of unexpectedness was calculated using the American National Corpus Frequency Data source as well as the morphology of tweets, using Random Forests (RF) and Decision Trees (DT) classifiers. In the same direction, Buschmeir et al. [7] considered unexpectedness as an emotional imbalance between words in the text. Ghosh et al. [26] identified sarcasm using Support Vector Machines (SVM) using as features the identified contradictions within each tweet.
Content and context-based approaches. Inspired by the contradictory and unexpectedness concepts, followup approaches utilized features that expose information about the content of each passage including: N-gram patterns, acronyms and adverbs [8]; semi-supervised attributes like word frequencies [16]; statistical and semantic features [74]; and Linguistic Inquiry and Word Count (LIWC) dictionary along with syntactic and psycholinguistic features [72]. LIWC corpus [65] was also utilized in [28], comparing sarcastic tweets with positive and negative ones using an SVM classifier. Similarly, using several lexical resources [82], and syntactic and sentiment related features [56], the respective researchers explored differences between sarcastic and ironic expressions. Affective and structural features are also employed to predict irony with conventional machine learning classifiers (DT, SVM, Nave Bayes/NB) in [20]. In a follow-up study [21], a knowledge-based k-NN classifier was fed with a feature set that captures a wide range of linguistic phenomena (e.g., structural, emotional). Significant results were achieved in [86], were a combination of lexical, semantic and syntactic features passed through an SVM classifier that outperformed LSTM deep neural network approaches. Apart from local content, several approaches claimed that global context may be essential to capture FL phenomena. In particular, in [89] it is claimed that capturing previous and following comments on Reddit increases classification performance. Users behavioral information seems to be also beneficial as it captures useful contextual information in Twitter post [73]. A novel unsupervised probabilistic modeling approach to detect irony was also introduced in [62].
Deep Learning approaches. Although several DL methodologies, such as recurrent neural networks (RNNs), are able to capture hidden dependencies between terms within text passages and can be considered as contentbased, we grouped all DL studies for readability purposes. Word Embeddings, i.e., learned mappings of words to real valued vectors [60], play a key role in the success of RNNs and other DL neural architectures that utilize pre-trained word embeddings to tackle FL. In fact, the combination of word embeddings with Convolutional Neural Networks (CNN), so called CNN-LSTM units, was introduced by Kumar [52] and Ghosh & Veale [25] achieving state-of-the-art performance. Attentive RNNs exhibit also good performance when matched with pre-trained Word2Vec embeddings [38], and contextual information [98]. Following the same approach an LSTM based intra-attention was introduced in [84] that achieved increased performance. A different approach, founded on the claim that number present significant indicators, was introduced by Dubey et al. [19]. Using an attentive CNN on a dataset with sarcastic tweets that contain numbers, showed notable results. An ensemble of a shallow classifier with lexical, pragmatic and semantic features, utilizing a Bidirectional LSTM model is presented in [50]. In a subsequent study [51], the researchers engineered a soft attention LSTM model coupled with a CNN. Contextual DL approaches are also employed, utilizing pre-trained along with user embeddings structured from previous posts [1] or, personality embeddings passed through CNNs [33]. ELMo embeddings [68] are utilized in [39]. In our previous approach we implemented an ensemble deep learning classifier (DESC) [71], capturing content and semantic information. In particular, we employed an extensive feature set of a total 44 features leveraging syntactic, demonstrative, sentiment and readability information from each text along with Tf-idf features. In addition, an attentive bidirectional LSTM model trained with GloVe pre-trained word embeddings was utilized to structure an ensemble classifier processing different text representations. DESC model performed state-ofthe-art results on several FL tasks.

Sentiment Analysis on Figurative Language
The Semantic Evaluation Workshop-2015 [24] proposed a joint task to evaluate the impact of FL in sentiment analysis on ironic, sarcastic and metaphorical tweets, with a number of submissions achieving highly performance results. The ClaC team [101] exploited four lexicons to extract attributes as well as syntactic features to identify sentiment polarity. The UPF team [3] introduced a regression classification methodology on tweet features extracted with the use of the widely utilized SentiWordNet and DepecheMood lexicons. The LLT-PolyU team [95] used semi-supervised regression and decision trees on extracted uni-gram and bi-gram features, coupled with features that capture potential contradictions at short distances. An SVM-based classifier on extracted n-gram and Tf-idf features was used by the Elirf team [27] coupled with specific lexicons such as Affin, Patter and Jeffrey 10. Finally, the LT3 team [85] used an ensemble Regression and SVM semi-supervised classifier with lexical features extracted with the use of WordNet and DBpedia11.

The background: Transfer Learning
Due to the limitations of annotated datasets and the high cost of data collection, unsupervised learning approaches tend to be an easier way towards training networks. Recently, transfer learning approaches, i.e., the transfer of already acquired knowledge to new conditions, are gaining attention in several domain adaptation problems [22]. In fact, pre-trained embeddings representations, such as GloVe, ElMo and USE, coupled with transfer learning architectures were introduced and managed to achieve state-of-the-art results on various NLP tasks [36]. In this chapter we review on these methodologies in order to introduce our approach. In this chapter we will summarize those methods and introduce our proposed transfer learning system. Model specifications used for the state-of-the-art models compared can be found in Appendix A.

Contextual Embeddings
Pre-trained word embeddings proved to increase classification performances in many NLP tasks. In particular, Global Vectors (GloVe) [66] and Word2Vec [61] became popular in various tasks due to their ability to capture representative semantic representations of words, trained on large amount of data. However, in various studies (e.g., [67,68,59]) it is argued that the actual meaning of words along with their semantics representations varies according to their context. Following this assumption, researchers in [68] present an approach that is based on the creation of pre-trained word embeddings through building a bidirectional Language model, i.e. predicting next word within a sequence. The ELMo model was exhaustingly trained on 30 million sentences corpus [11], with a two layered bidirectional LSTM architecture, aiming to predict both next and previous words, introducing the concept of contextual embeddings. The final embeddings vector is produced by a task specific weighted sum of the two directional hidden layers of LSTM models. Another contextual approach for creating embedding vector representations is proposed in [9] where, complete sentences, instead of words, are mapped to a latent vector space. The approach provides two variations of Universal Sentence Encoder (USE) with some trade-offs in computation and accuracy. The first approach consists of a computationally intensive transformer that resembles a transformer network [87], proved to achieve higher performance figures. In contrast, the second approach provides a light-weight model that averages input embedding weights for words and bi-grams by utilizing of a Deep Average Network (DAN) [40]. The output of the DAN is passed through a feedforward neural network in order to produce the sentence embeddings. Both approaches take as input lowercased PTB tokenized 1 strings, and output a 512-dimensional sentence embedding vectors.

Transformer Methods
Sequence-to-sequence (seq2seq) methods using encoderdecoder schemes are a popular choice for several tasks such as Machine Translation, Text Summarization, Question Answering etc. [83]. However, encoders contextual representations are uncertain when dealing with longrange dependencies. To address these drawbacks, Vaswani et al. in [87] introduced a novel network architecture, called Transformer, relying entirely on self-attention units to map input sequences to output sequences without the use of RNNs. The Transformers decoder unit architecture contains a masked multi-head attention layer followed by a multi-head attention unit and a feed forward network whereas the decoder unit is almost identical without the masked attention unit. Multi-head selfattention layers are calculated in parallel facing the computational costs of regular attention layers used by previous seq2seq network architectures. In [17] the authors presented a model that is founded on findings from various previous studies (e.g., [14,37,68,72,87]), which achieved state-of-the-art results on eleven NLP tasks, called BERT -Bidirectional Encoder Representations from Transformers. The BERT training process is split in two phases, the unsupervised pretraining phase and the fine-tuning phase using labelled data for down-streaming tasks. In contrast with previous proposed models (e.g., [68,72]), BERT uses masked language models (MLMs) to enable pre-trained deep bidirectional representations. In the pre-training phase the model is trained with a large amount of unlabeled 1 nlp.stanford.edu/software/tokenizer.html data from Wikipedia, BookCorpus [100] and WordPiece [94] embeddings. In this training part, the model was tested on two tasks; on the first task, the model randomly masks 15% of the input tokens aiming to capture conceptual representations of word sequences by predicting masked words inside the corpus, whereas in the second task the model is given two sentences and tries to predict whether the second sentence is the next sentence of the first. In the second phase, BERT is extended with a task-related classifier model that is trained on a supervised manner. During this supervised phase, the pre-trained BERT model receives minimal changes, with the classifiers parameters trained in order to minimize the loss function. Two models presented in [17], a Base Bert model with 12 encoder layers (i.e. transformer blocks), feed-forward networks with 768 hidden units and 12 attention heads, and a Large Bert model with 24 encoder layers 1024 feed-the pre-trained Bert model, an architecture almost identical with the aforementioned Transformer network. A [CLS] token is supplied in the input as the first token, the final hidden state of which is aggregated for classification tasks. Despite the achieved breakthroughs, the BERT model suffers from several drawbacks. Firstly, BERT, as all language models using Transformers, assumes (and pre-supposes) independence between the masked words from the input sequence, and neglects all the positional and dependency information between words. In other words, for the prediction of a masked token both word and position embeddings are masked out, even if positional information is a key-aspect of NLP [15]. In addition, the [MASK] token which, is substituted with masked words, is mostly absent in finetuning phase for down-streaming tasks, leading to a pretraining fine-turning discrepancy. To address the cons of BERT, a permutation language model was introduced, so-called XLnet, trained to predict masked tokens in a non-sequential random order, factorizing likelihood in an autoregressive manner without the independence assumption and without relying on any input corruption [96]. In particular, a query stream is used that extends embedding representations to incorporate positional information about the masked words. The original representation set (content stream), including both token and positional embeddings, is then used as input to the query stream following a scheme called Two-Stream SelfAttention. To overcome the problem of slow convergence the authors propose the prediction of the last token in the permutation phase, instead of predicting the entire sequence. Finally, XLnet uses also a special token for the classification and separation of the input sequence, [CLS] and [SEP] respectively, however it also learns an embedding that denotes whether the two words are from the same segment. This is similar to relative positional encodings introduced in TrasformerXL [15], and extents the ability of XLnet to cope with tasks that encompass arbitrary input segments. Recently, a replication study, [17], suggested several modifications in the training procedure of BERT which, outperforms the original XLNet architecture on several NLP tasks. The optimized model, called Robustly Optimized BERT Approach (RoBERTa), used 10 times more data (160GB compared with the 16GB originally exploited), and is trained with far more epochs than the BERT model (500K vs 100K), using also 8-times larger batch sizes, and a byte-level BPE vocabulary instead of the character-level vocabulary that was previously utilized. Another significant modification, was the dynamic masking technique instead of the single static mask used in BERT. In addition, RoBERTa model removes the next sentence prediction objective used in BERT, following advises by several other studies that question the NSP loss term [54,97,43].

Proposed Method -Recurrent CNN RoBERTA (RCNN-RoBERTa)
The intuition behind our proposed RCNN-RoBERTa approach is founded on the following observation: as pre-trained networks are beneficial for several downstreaming tasks, their outputs could be further enhanced if processed properly by other networks. Towards this end, we devised an end-to-end model with minimum training time that utilizes pre-trained RoBERTa weights combined with a RCNN in order to capture contextual information. Actually, the proposed leaning model is based on a hybrid DL neural architecture that utilizes pre-trained transformer models and feed the hidden representations of the transformer into a Recurrent Convolutional Neural Network (RCNN), similar to [53]. In particular, we employed the RoBERTa base model with 12 hidden states and 12 attention heads, and used its output hidden states as an embedding layer to a RCNN. As already stated, contradictions and long-time dependencies within a sentence may be used as strong identifiers of FL expressions. RNNs are often used to capture time relationships between words, however they are strongly biased, i.e. later words are tending to be more dominant that previous ones [53]. This problem can be alleviated with CNNs, which, as unbiased models, can determine semantic relationships between words with max-pooling. Nevertheless, contextual information in CNNs is depended totally on kernel sizes. Thus, we appropriately modified the RCNN model presented in [53] in order to capture unbiased recurrent informative relationships within text, and we  Fig. 1 The proposed RCNN-RoBERTa methodology, consisting of a RoBERTa pre-trained transformer followed by a Bidirectional LSTM layer (BiLSTM). Pooling is applied to the representation vector of concatenated RoBERTa and LSTM outputs and passed through a fully connected softmaxactivated layer.
implemented a Bidirectional LSTM (BiLSTM) layer, which is fed with RoBERTas final hidden layer weights. The output of LSTM is concatenated with the embedded weights, and passed through a feedforward network and a max-pooling layer. Finally, softmax function is used for the output layer. Table 1 shows the parameters used in training and Figure 1 demonstrates our method.

Experimental Results
To assess the performance of the proposed method we performed an exhaustive comparison with several advanced state-of-the-art methodologies along with published results. The used methodologies were appropriately implemented using the available codes and guidelines, and include: ELMo [68], USE [9], NBSVM [90], FastText [44], XLnet base cased model (XLnet) [96], BERT [17] in two setups: BERT base cased (BERT-Cased) and BERT base uncased (BERT-Uncased) models, and RoBERTa base model. The published results were acquired from the respective original publication (the reference publication is indicated in the respec-tive tables). For the comparison we utilized benchmark datasets that include ironic, sarcastic and metaphoric expressions. Namely, we used the dataset provided in Semantic Evaluation Workshop Task 3 (SemEval-2018) that contains ironic tweets [34]; Riloffs high quality sarcastic unbalanced dataset [77]; a large dataset containing political comments from Reddit [47]; and a SA dataset that contains tweets with various FL forms from SemEval-2015 Task 11 [24]. All datasets are used in a binary classification manner (i.e., irony/sarcasm vs. literal), except from the SemEval-2015 Task 11 dataset where the task is to predict a sentiment integer score (from -5 to 5) for each tweet (refer to [71] for more details). The evaluation was made across standard five metrics namely, Accuracy (Acc), Precision (Pre), Recall (Rec), F1-score (F1), and Area Under the Receiver Operating Characteristics Curve (AUC). For the SA task the cosine similarity metric (Cos) and mean squared error (MSE) metrics are used, as proposed in the original study [24].
The results are summarized in the tables 2-5; each table refers to the respective comparison study. All tables present the performance results of our proposed method (Proposed) and contrast them to eight stateof-the-art baseline methodologies along with published results using the same dataset. Specifically, Table 2 presents the results obtained using the ironic dataset used in SemEval-2018 Task 3.A, compared with recently published studies and two high performing teams from the respective SemEval shared task [5,93]. Tables 3,4 summarize results obtained using Sarcastic datasets (Reddit SARC politics [47] and Riloff Twitter [77]). Finally, Table 5 compares the results from baseline models, from top two ranked task participants [3,101], from our previous study with the DESC methodology [71] with the proposed RCNN-RoBERTa framework on a Sentiment Analysis task with figurative language, using the Se-mEval 2015 Task 11 dataset.
As it can be easily observed, the proposed RCNN-RoBERTa approach outperforms all approaches as well as all methods with published results, for the respective binary classification tasks (Tables 2, 3, and 4). Our previous approach, DESC (introduced in [71]), performs slightly better in terms of cosine similarity for the sentiment scoring task (Table 5, 0,820 vs. 0,810), with the RCNN-RoBERTa approach to perform better and managing to significantly improve the MSE measure by almost 33.5% (2,480 vs. 1,450).

Conclusion
In this study, we propose the first transformer based methodology, leveraging the pre-trained RoBERTa model

2.48
Proposed 0.81 1.45 combined with a recurrent convolutional neural network, to tackle figurative language in social media. Our network is compared with all, to the best of our knowledge, published approaches under four different benchmark dataset. In addition, we aim to minimize preprocessing and engineered feature extraction steps which are, as we claim, unnecessary when using overly trained deep learning methods such as transformers. In fact, hand crafted features along with preprocessing techniques such as stemming and tagging on huge datasets containing thousands of samples are almost prohibited in terms of their computation cost. Our proposed model, RCNN-RoBERTa, achieve state-of-the-art performance under six metrics over four benchmark dataset, denoting that transfer learning non-literal forms of language. Moreover, RCNN-RoBERTa model outperforms all other state-of-the-art approaches tested including BERT, XLnet, ELMo, and USE under all metric, some by a large factor.