1 Introduction

Text classification is the task of assigning to strings of text a label from a predefined set. Automating text classification is motivated by the massive amount of texts available in the current digital age, and by the complexity of determining the correct label for new texts. Automated text classification is made possible by increasingly accurate algorithms (Kowsari et al. 2019), such as supervised models from machine learning (Sebastiani 2002) and deep learning (Minaee et al. 2021), or semantic methods Altinel and Ganiz (2018) that overcome the simple representation of texts as bag of words. For example, sentiment classification Liu and Zhang (2012) of subjective texts about a product (or, a brand, a politician, a regulation, etc.) into positive, negative, or neutral opinions is supported by several techniques Hemmatian and Sohrabi (2019). Other applications of text classification Li et al. (2020); Zhou et al. (2020) include: filling texts into appropriate sections in websites or folders (document organization); assigning one or more subjects to texts (topic labeling); selecting relevant texts from a stream (text filtering); detecting unsolicited emails (spam detection); determining the stance – favor, against, neither – of the author of a text towards a target (stance detection); etc.

The classification of short texts  (Song et al. 2014), which abound in micro-blogging sites such as Twitter and in online reviews, is especially challenging, due to their sparsity, non-uniformity, and noisiness. Deep Neural Networks (DNNs) Korde and Mahender (2012), Zhang et al. (2015) and Random Forests (RFs) da Silva et al. (2014) and Xu et al. (2012), have been shown to be effective in terms of predictive accuracy and robustness to noise. However, the decision logic learned by a DNN or by a RF to classify a given text remains obscure to human inspection. These inscrutable “black box” models may reproduce and amplify biases learned from data, such as prejudice (Bolukbasi et al. 2016), or they may introduce new forms of bias (Olteanu et al. 2019), e.g., due to spurious correlations. When opinions concern specific individuals, this may also result in social discrimination against protected-by-law groups identified by their sensitive traits (e.g. gender identity, sexual orientation, ethnic background) (Ntoutsi 2020).

Explainability of decisions made by black box models is nowadays a mandatory requirement (Doshi-Velez and Kim 2017; Freitas 2013) for the social acceptance of Artificial Intelligence (AI) applications Danks (2019). Developers need to understand a model’s decisions for debugging and optimization. For example, to characterize conditions under which the model can be trustfully applied, and conditions for which the model should abstain to make decisions. People subject to black box decisions may inquire to be provided with “meaningful information of the logic involved” (articles 13–15 of European Union General Data Protection Regulation, sometimes referred to as right to legibility (Malgieri and Comandé 2017) or right to explanation (Selbst and Powles 2017)). For example, if a comment in a social network has been removed because it has been classified as hate speech, the author has the right to know why the machine learning system has assigned such a label to her comment.

In this paper, we investigate the problem of explaining the decisions of a black box model (simply, a black box) when classifying a given short text in input, e.g., as in sentiment classification. We design and experiment with a model-agnostic local approach named xspells (explaining sentiment prediction generating exemplars in the latent space). xspells’s explanations for the prediction \(y = b(x)\) assigned by a black box b to a short text x consists of set of exemplar texts E, a set of counter-exemplar texts C, and the most frequent words in each of those sets \(W = W_E \cup W_C\). Exemplars are sentences classified by the black box with the same label as x and close in meaning to x. They are intended to provide the user with hints about the kind of texts in the neighborhood of x that the black box classifies in the same way as x. Counter-exemplars are sentences that the black box classifies differently from y, but like exemplars, are also close in meaning to x. They are intended to provide the user with hints about the kind of texts in the neighborhood of x that the black box classifies differently from x. The usefulness of counter-factual reasoning has been widely recognized in the literature on Explainable AI Byrne (2019), particularly as a tool for causal understanding of the behavior of black boxes. Diversity of the provided counter-exemplars is modeled as an optimization problem, which is solved by resorting to a greedy algorithm with a theoretical guarantee. By contrasting exemplars and counter-exemplars, the user can gain an understanding of the factors affecting the classification of x. To help such an understanding, xspells provides also the most frequent words appearing in the exemplar texts E and in the counter-exemplar texts C.

The main novelty of our approach lies in the fact that the exemplars and counter-exemplars produced by xspells are meaningful texts, albeit synthetically generated. We map the input text x from a (sparse) high-dimensional vector space into a low-dimensional latent space vector z by means of Variational Autoencoders Kingma and Welling (2014), which are effective in encoding and decoding diverse and well-formed short texts (Bowman et al. 2016). Then we study the behavior of the black box b in the neighborhood of z, or, more precisely, the behavior of b on texts decoded back from the latent space. Finally, we exploit a decision tree built from latent space neighborhood instances to drive the selection of exemplars and counter-exemplars. Experiments on three standard datasets and two black box classifiers show that xspells overtakes the baseline method lime Ribeiro et al. (2016) by providing understandable, faithful, useful, and stable explanations.

This paper extends the conference version Lampridis et al. (2020) in several aspects. First, we formulate the problem of diverse counter-exemplar selection and provide a solution based on a greedy algorithm. Second, in addition to training a VAE on a subset of available data, we consider using a pre-trained VAE from the state of the art. Third, a deeper experimental qualitative/quantitative analysis is conducted. The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 formalizes the problem and recalls key notions for the proposed method, which is described in Sect. 4. Section 5 presents the experimental results. Finally, Sect. 6 summarizes our contribution, its limitations, and future work.

2 Related work

Research on interpretability and explainability in AI has bloomed over the last few years (Guidotti et al. 2019b; Miller 2019), with many implementations of proposed methods (Bodria et al. 2021; Linardatos et al. 2021). Intrinsically explainable AI models can be directly interpretable by humans, or the explanation of their decisions arise as part of their prediction process (self-explainability). Examples in the area of short text classification include linear classifiers exploiting word taxonomies (Skrlj et al. 2021) or lexicons (Clos and Wiratunga 2017). The best performing text classifiers, however, rely on black box models, which are inaccessible, inscrutable, or simply too complex for humans to understand. Hence, they require post-hoc explanations of their decisions. Explanation methods can be categorized as: (i) Model-specific or model-agnostic, depending on whether the approach requires access to the internals of the black box; (ii) Local or global, depending on whether the approach explains the prediction for a specific instance or the overall logic of the black box.

xspells falls into the category of local, model-agnostic methods. Well known tools in this category that are able to work also on textual data include lime, anchor and shap. lime Ribeiro et al. (2016) randomly generates synthetic instances in the neighborhood of the instance to explain. An interpretable linear model is trained from such instances. Feature weights of the linear model are used for explaining the feature importance over the instance to explain. In the case of texts, a feature is associated to each of the top frequent words in a dataset. lime has two main weaknesses. First, the number of top features/words to be considered is assumed to be provided as an input by the user. Second, the neighborhood texts are generated by randomly removing words, possibly generating meaningless texts. anchor Ribeiro et al. (2018) follows the main ideas of lime but it returns decision rules (called anchors) as explanations. In the case of texts, such rules state which words, once fixed, do not alter the decision of the black box when randomly replacing all other words by similar words (in an embedding space) with the same part-of-speech (POS) tag. anchor adopts a bandit algorithm that constructs anchors with predefined minimum precision. Its weaknesses include the need for user-defined precision threshold parameters, and, as for lime, the generation of possibly meaningless instances. shap Lundberg and Lee (2017) relates game theory with local explanations and overcomes some of the limitations of lime and anchor. Also shap audits the black box with possibly meaningless synthetic sentences. The method xspells proposed in this paper recovers from this drawback by generating the sentences for the neighborhood in a latent space by taking advantage of variational autoencoders.

With regard to model-specific local approaches, lionets, deeplift and neurox are designed to explain deep neural networks and they are able to work on textual data. deeplift Shrikumar et al. (2017) decomposes the prediction of neural networks on a specific input by back-propagating the contributions of all neurons in the network to the input features. Then it compares the activation of each neuron to its “reference activation” and it assigns contribution scores according to the difference. neurox Dalvi et al. (2019) facilitates the analysis of individual neurons in DNNs. In particular, it identifies specific dimensions in the vector representations learned by a neural network model that are responsible for specific properties. Afterwards, it allows for the ranking of neurons and dimensions based on their overall saliency. Finally, lionets Mollas et al. (2019) looks at the penultimate layer of a DNN, which models texts in an alternative representation, randomly permutes the weights of nodes in that layer to generate new vectors, classifies them, observes the classification outcome and returns the explanation using a linear regressor like lime. Differently from these model-specific methods, xspells is not tied to a specific architecture and can be used to explain any black box classifier.

Other approaches such as IntGrad Sundararajan et al. (2017), LRP Bach et al. (2015), DeepLift Shrikumar et al. (2017) and L2X Chen et al. (2018) are designed to explain image classifiers. However, they can be adapted to work on text classifiers by returning a sort of saliency map as explanation, also named sentence highlighting, that highlights the most important words responsible for the classification. For instance, Arras et al. (2017) adapts the model-specifc approach of layer-wise relevance propagation (LRP Bach et al. 2015) to extract word-wise relevance score. Similarly, Li et al. (2016) proposes an attention-based sentence-highlighting explanation system that extracts a saliency score for every word by using the weights layer of a DNN black box. Attention-based explanations aim also at connecting the importance of various words in a sentence. For instance, Vaswani et al. (2017) provides the explanation as a matrix where rows and columns represent a word and the value in the cell is proportional to the self-attention in the explanation between the words at the row/column of the cell. Visualization of such a matrix can also highlight connections among words (Hoover et al. 2019). Compared to such approaches, the explanation given by xspells is “instance-wide” as it is formed by prototypes showing with examples/counter-examples similar/different classification outcomes. On the other hand, those approaches return an explanation that points the attention on single words causing the classification outcomes.

Regarding counter-factual approaches, while there is a growing literature for tabular data and images (Artelt and Hammer 2019; Verma et al. 2020), to the best of our knowledge our proposal is an original contribution in the context of short text classification. A form of contrastive explanations has been proposed in the local model-specific approach of Croce et al. (2019) for a self-explaining question classification system based on LRP. Here, the top texts from the training set which contribute the most (negatively) to the decision are returned as counter-exemplars.

Finally, for explainability in Natural Language Processing beyond classification, we refer to a recent survey Danilevsky et al. (2020) and its associated living website Qian etal (2021).

3 Setting the stage

In this paper, we address the black box outcome explanation problem Guidotti et al. (2019b) in the domain of short text classification. We will only consider and experiment with short texts such as posts on social networks, brief reviews, or single sentences. These are typically categorized into two or a small number of class labels, as in sentiment classification, stance detection, hate-speech recognition, etc. Text classification is particularly challenging in these cases, with high risks of biased decisions accompanied with an urgent need for black box explanation methods. A black box model is a non-interpretable or inaccessible text classifier b which assigns a class label y to a given text x, i.e., \(b(x) = y\). We assume that the black box b can be queried at will. We use the notation b(X) as a shorthand for \(\{b(x) \;|\; x \in X\}\).

Definition 1

Let b be a black box text classifier, and x a text for which the decision \(y=b(x)\) has to be explained. The black box outcome explanation problem for text classification consists of providing an explanation \(\xi \in \Xi \) belonging to a human-interpretable domain \(\Xi \).

In the following, we introduce the key tools used in our approach.

3.1 Local explainers, factuals and counter-factuals

In the context of tabular data, a widely adopted human-interpretable domain \(\Xi \) consists of if-then rules. They provide conditions (in the if-part) met by the instance x to be explained, that determined the answer of the black box (then-part). Rules can also be used to provide counter-factuals, namely alternative conditions, not met by x, that would determine a different answer by the black box Byrne (2019). In our approach, we will build on lore Guidotti et al. (2019a), a local explainer for tabular data that learns a decision tree from a given neighborhood Z of the instance to explain. Such a tree is a surrogate model of the black box, i.e., it is trained to reproduce the decisions of the black box locally to x. lore provides as output: (i) a factual rule \(p \rightarrow y\), corresponding to the path p in the surrogate tree that explains why an instance x has been labeled as y by the black box b; and (ii) a set of counter-factual rules \(p[\delta ] \rightarrow y'\) explaining (minimal) changes \(\delta \) in the features of x that would change the class y assigned by b to \(y'\). In lore, the neighborhood Z is synthetically generated using a genetic algorithm that balances the number of instances similar to x and with its same label y, and the number of instances similar to x but with a different label \(y' \ne y\) assigned by b. One first problem with directly using lore is that texts are not tabular data. We will tackle this issue by mapping a large m-dimensional space of words to a small k-dimensional space of numbers (latent space), a tabular space from which rules can be extracted.

3.2 Variational autoencoders for short text generation

Local explanation methods, such as lore, audit the behavior of a black box in the neighborhood of the instance to explain. A second non-trivial problem with textual data is how to generate meaningful synthetic sentences in the neighborhood (w.r.t. semantic similarity) of the instance. We tackle this problem by adopting Variational Autoencoders (VAEs) (Kingma and Welling 2014), which have been shown to be very effective (Bowman et al. 2016) in generating diverse and well-formed (short text) sentences. A VAE is trained with the aim of learning a representation that reduces the dimensionality from the space of words to the latent space, also capturing non-linear relationships. An encoder \(\zeta \), and a decoder \(\eta \) are simultaneously learned with the objective of minimizing the reconstruction loss. Starting from the reduced encoding \(z = \zeta (x)\), the VAE reconstructs a representation as close as possible to its original input \({\tilde{x}} = \eta (z) \simeq x\). After training, the decoder can be used with generative purposes to reconstruct instances never observed by generating vectors in the latent space of dimensionality k. The difference to standard autoencoders (Hinton and Salakhutdinov 2006) is that VAEs are trained by considering an additional limitation on the loss function such that the latent space is scattered and does not contain “dead zones”. Indeed, the name “variational” comes from the fact that VAEs work by approaching the posterior distribution with a variational distribution. The encoder \(\zeta \) emits the parameters for this variational distribution, in terms of a multi-factorial Gaussian distribution, and the latent representation is taken by sampling this distribution. The decoder \(\eta \) takes as input the latent representation and focuses on reconstructing the original input from it. The avoidance of dead zones ensures that the instances reconstructed from vectors in the latent space, e.g., posts or tweets, are semantically meaningful Bowman et al. (2016).

figure a

4 Explaining short text classifiers

We propose a local model-agnostic explainer for classification of short texts, called xspells (e xplaining sentiment prediction generating exemplars in the latent space). Given a black box b, a short text x, e.g., a post on a social network, and the class label \(y = b(x)\) assigned by the black box, e.g., hate or neutral, the explanation provided by xspells is composed of: (i) A set of exemplar texts; (ii) A set of counter-exemplar texts; and, (iii) The set of most common words in exemplars and counter-exemplars. Exemplar and counter-exemplar texts respectively illustrate instances classified with the same and with a different label than x. Such texts are close in meaning to x, and they offer an understanding of what makes the black box determine the label of texts in the neighborhood of x. Exemplars help in understanding reasons for, e.g., the sentiment assigned to x. Counter-exemplars help in understanding reasons that would reverse the sentiment assigned. The most common words in the exemplars and counter-exemplars may allow for highlighting terms (not necessarily appearing in x) that discriminate between the assigned label and a different one. These components form the human-interpretable explanation \(\xi \in \Xi \) for the classification \(y = b(x)\) returned by xspells, whose aim is to satisfy the requirements of counter-factuability, usability, and meaningfulness of an explanation (Byrne 2019; Miller 2019; Pedreschi et al. 2019).

Fig. 1
figure 1

xspells process (part 1) on a sample input. xspells takes as input the short text x and the assigned label b(x). After coding the text into the latent space and generating a neighborhood, a decision tree over the latent features (\(L_1, L_2, \ldots \)) is learned

Fig. 2
figure 2

xspells process (part 2) on a sample input. The figure starts at the extraction of exemplars (u) and counter-exemplars (v) from the decision tree in the latent space. The output of xspells is a set of exemplars and counter-exemplars, and the most common discriminative words

Besides the black box b and the text x to explain, xspells is parametric in: an encoder \(\zeta \) and a decoder \(\eta \) for representing texts in a compact way in the latent space. Algorithm 1 details xspells, and Figs. 1 and 2 show the steps of the explanation process on a sample input. We describe the process in detail by a simple example x: “the stupid brown fox", classified as “hate". First, x is transformed into a low-dimensionality vector \(z = \zeta (x)\) in the latent space. xspells then generates a neighborhood Z of z, which is decoded back to a set of texts \({\tilde{Z}}\) (e.g. “stupid brown fox", “stupid red dog"). The dataset Z and the decisions of the black box on the decoded text \(Y = b({\tilde{Z}})\) are used to train a surrogate decision tree (in the latent space).

Then, the \( explCexpl ()\) module selects exemplars E (e.g. “stupid brown fox", “stupid red dog") and counter-exemplars C (e.g. “lovely brown cat", “cute red cat") from Z by exploiting the knowledge extracted (i.e., the decision tree branches), and decodes them into texts. Finally, the most common words \(W = W_E \cup W_C\) (here “stupid, silly, fox", “lovely, cute, cat") are extracted from E and C and the overall explanation \(\xi \) is returned. Details of each step are presented in the rest of this section.

4.1 Latent encoding and neighborhood generation

The input text x is first passed to a trained VAE \(\zeta \) (line 1 of Algorithm 1), thus obtaining the latent space representation \(z = \zeta (x)\). The number of latent dimensions k is kept low to avoid dimensionality problems. While xspells is parametric in the encoder-decoder, we considered in the experiments two actual implementations. The first one is trained on a subset of available data. It captures the sequential information in texts by means of long short-term memory layers (LSTM) (Hochreiter and Schmidhuber 1997) for both the encoder \(\zeta \) and decoder \(\eta \) (lines 1 and 3). In particular, the decoder \(\eta \) is trained to predict the next characters of a text given the previous characters of the text, with the purpose of reconstructing the original text. The second implementation relies instead on a state-of-the-art VAE already pretrained on a large text corpus (details in Sect. 5).

xspells generates a set Z of n instances in the latent feature space for a given z. The neighborhood generation function \( neighgen \) (line 2) can be implemented by adopting several different strategies, ranging from a purely random approach like in lime (Ribeiro et al. 2016), to using a given distribution and a genetic algorithm maximizing a fitness function like in lore (Guidotti et al. 2019a). xspells adopts a random generation of latent synthetic instances by relying on the fact that the encoder maps uniformly the data distribution over the latent space. Duplicated instances in the random generation processes are removed, keeping only distinct ones. Next, xspells uses the synthetically generated instances \({\tilde{Z}}\) for querying the black box b (line 4). This is made possible by turning back the latent representation to text through the decoder \(\eta \) Bowman et al. (2016) (line 3). We tackle the requirement of generating local instances by randomly generating \(N \gg n\) latent instances, and then retaining in Z only the n closest instances to z, i.e., \(|Z| = n\). Such an heuristics balances diversity of instances in Z with similarity w.r.t. x. The distance function used in the latent space is the Euclidean distance. The neighborhood generation \( neighgen \) actually returns a partitioned set \(Z = Z_{=} \cup Z_{\ne }\) with \(z' \in Z_=\) such that \(b(\eta (z')) = b(\eta (z))\), and instances \(z' \in Z_{\ne }\) such that \(b(\eta (z')) \ne b(\eta (z))\). We further consider the problem of imbalanced distributions in Z, which may lead to weak decision trees. Class balancing between the two partitions is achieved by adopting the SMOTE Chawla et al. (2002) procedure if the proportion of the minority class is less than a predefined threshold \(\tau \).

4.2 Local latent rules and explanation extraction

Given Z and \(Y = b({\tilde{Z}})\), xspells builds a latent decision tree \( ldt \) (line 5) acting as a local surrogate of the black box, i.e., being able to locally mime the behavior of b. xspells adopts decision trees because decision rules can be naturally derived from a root-to-leaf path (Guidotti et al. 2019a). Indeed, the premise p of a rule \(r = p {\rightarrow } y\) is the conjunction of the split conditions from the root to the leaf of the tree that is followed by features in z. This approach is a variant of lore (see Sect. 3.1) but in a latent feature space. The consequence y of the rule is the class assigned at that leafFootnote 1.

Given a text x, the explanations returned by xspells are of the form \(\xi =\langle E, C, W \rangle \), where: \(E = \{e^x_1, \dots , e^x_u\}\) is the set of exemplars (\(b(e^x_i) = b(x) \; \forall i \in [1, u]\)); \(C = \{c^x_1, \ldots , c^x_v\}\) is the set of counter-exemplars (\(b(c^x_i) \ne b(x) \; \forall i \in [1, v]\)); and \(W = W_E \cup W_C\) is a union of the set \(W_E\) of the h most frequent words in exemplars E and of the set \(W_E\) of the h most frequent words in counter-exemplars C. Here, u, v, and h are parameters that can be set in xspells. Exemplars are chosen starting from the latent instances in Z which satisfy both the premise p and the consequence y of the rule \(r = p \rightarrow y\) above, namely the instances \(z' \in Z\) that follow the same path as z in the decision tree, and such that the \(b(\eta (z')) = y\). We visualized this in Fig. 2. By construction, elements in the same leaf as the instance to explain are candidate exemplars. The u instances \(z'\) closest to z are selected, using Euclidean distance. They are decoded back to the text space \(\eta (z')\) and included in E. Counter-exemplars are chosen starting from the latent instances \(z' \in Z\) which do not satisfy the premise p and such that \(b(\eta (z')) \ne b(x)\). Let us call them the set \({{\mathcal {A}}}\) of admissible counter-exemplars. The v instances in \({{\mathcal {A}}}\) closest to z are chosen. They are decoded back to the text space \(\eta (z')\) and included in C. We call such an approach distance-based counter-exemplar selection. Again, this is visualized in Fig. 2.

4.3 Diversity

Consider the set \({{\mathcal {A}}}\) of admissible counter-exemplars. The distance-based strategy of selecting the v instances in \({{\mathcal {A}}}\) closest to z may return counter-exemplars too similar to each other. Here, we consider the problem of improving diversity of the returned instances by proposing a diversity-based counter-exemplar selection. The presentation is for counter-exemplars, but the approach applies to exemplars simply by considering as admissible set the instances of Z satisfying the premise and the consequence of the rule \(p \rightarrow y\).

There are several ways of formulating diversity requirements. For example, Mothilal et al. (2020) combines the search for counter-factuals (in tabular instances) with diversity maximization by resorting to gradient descent for solving an optimization problem. In our case, we have a somehow simpler problem, since the set of admissible counter-factuals is given. We formulate our problem as maximizing a function over subsets of admissible counter-exemplars:

$$\begin{aligned} \underset{S \subseteq {{\mathcal {A}}} \wedge |S| \le v}{ argmax } \ h_z(S) \end{aligned}$$

where \(h_z(S)\) is an objective function to be optimized under a size constraint for S. We write z as sub-script to denote that \(h_z(S)\) depends also on the instance z. We choose the following objective function:

$$\begin{aligned} h_z(S) = |\bigcup _{a \in S} { nn _k}(a)| - \lambda \sum _{a \in S} dist (a, z) \end{aligned}$$

which maximizes the difference between the coverage of the k-nearest instances of counter-exemplars (a measure of diversity) and the total distance of counter-exemplars from z (a measure of proximity) regularized by a parameter \(\lambda \). The distance function \( dist \), which is also used in the k-nearest neighborhood function \({ nn _k}()\), is the Euclidean distance between min-max normalized vectors. An efficient greedy algorithm with formal guarantees can be devised for the function above. For details, see Appendix 1.

Let us discuss here how to set \(\lambda \). Let k be the number of nearest neighbors considered in \({ nn _k}(a)\) (including a itself) and let \(a_0 \in {{\mathcal {A}}}\) be the closest instance to z, i.e., such that the distance \(d_0 = dist (a_0, z)\) is minimal. We have that \(h_z(\{a_0\}) = k - \lambda \cdot d_0\), where k is the gain of adding \(a_0\) to the empty set, and \(\lambda \cdot d_0\) the cost. In order to have at least \(a_0\) selected, it must hold that \(h_z(\{a_0\}) > 0\), hence \(\lambda < k/d_0\). Since \(k \ge 1\), we conservatively set:

$$\begin{aligned} \lambda = 1/(2d_0) \end{aligned}$$

See Appendix 1 for a simulation on the dependency of such \(\lambda \) over k.

5 Experiments

In this section, we illustrate a qualitative and quantitative experimental analysis of faithfulness, usefulness, stability and diversity properties of xspells explanations. The xspells system has been developedFootnote 2 in Python. It relies on the CART decision tree algorithm as provided by the scikit-learn library, on a VAE implemented with the keras library, as well as on OptimusFootnote 3, a VAE that is pretrained on a large text corpus Li et al. (2020), specifically from Wikipedia.

Table 1 Dataset descriptions and data partitions

5.1 Experimental settings

Datasets. We conduct experiments on a total of five datasets covering a wide array of problems within the short text classification domain. These include sentiment classification, hate speech identification, spam detection, fake news detection and question classification (Li et al. 2020; Zhou et al. 2020). The hate speech dataset (hate) Davidson et al. (2017) contains tweets labeled as hate, offensive or neutral. Here, we focus on the 1430 tweets that belong to the hate class, and on the 4163 tweets of the neutral class. The polarity dataset (polarity) Pang and Lee (2005) contains 10,660 tweets about movie reviews. Half of these tweets are classified as negative reviews, and the other half as positive ones. The third dataset we look at contains comments from five YouTube videos (youtube) Alberto et al. (2015), either classified as spam or ham (i.e., not spam). The final labels are no spam and spam The liar dataset (liar) Wang (2017) contains 12,788 manually annotated short statements with regards to whether they contain false or real information and are taken from politifact.com. This dataset contains six classes ranging from utterly false news to completely real news. Following related work Alhindi et al. (2018), we merge the classes to convert the task to a binary classification one. The labels are named fake news and real news respectively. Finally, the question dataset (question) Li and Roth (2002) contains questions categorized into six different semantic classes. To turn the problem into binary classification, we considered a "1 vs rest" classifier. where the label defined as 1 is the one with the largest number of instances. This class label is entity and it reflects questions that are about a specific entity. Thus, the two emerging labels are named entity and all other classes All these datasets are remarkable examples where a black box approach is likely to be used to remove or flag posts or to even ban users, possibly in an automated way, or to recommend topics based on user questions, possibly leading to wrong answers. Such extreme actions risk to hurt the free speech rights of people. Explanations of the black box decisions are then of primary relevance both to account for the action and to test or debug a black box. Datasets details are reported in Table 1 (left). We will experiment with binary (short) text classification. This is not a limitation of the proposed approach, which is able to deal with multi-class black boxes. Rather, the problem traces back to the limited size of the available datasets, which have to be partitioned into different sets for training black boxes, tuning VAEs, and testing the approach, as described next. Working with more than two classes makes it harder to train black boxes with acceptable performances.

Experimental scenarios. For each dataset, we use 75% of the available data for training a black box machine learning classifier. We call such a subset the training set. The remaining 25% of data, called the test set, is used for training/tuning the autoencoder and for explaining the black box decisions. More specifically, the test set is further split into 75% for training/tuning the autoencoder, called the tuning set, and 25% as the set of instances to explain, called the explanation set. All splits are stratified. We call this scenario the standard case. The VAE that we implemented, when trained on the tuning set, will be denoted as SVAE (for “standard VAE"). Also, we call OVAE the Optimus (Li et al. 2020) VAE, when it is fine-tuned using the tuning set. We will also experiment with an alternative scenario, which simulates the ideal situation when the autoencoder has a perfect knowledge of the (domain of the) instances to explain. In this scenario, both the tuning set and the explanation set are fixed to the whole test set. In particular, we call BVAE (for “best case VAE") the VAE we implemented, when it is trained on the whole test set. This scenario is realistic during the development cycle of a black box, when the developers want to debug its decisions on a set of instances representative of the population. Finally, for computational efficiency reasons, in both scenarios we selected 100 instances (98 for the youtube dataset in the case of the standard scenario) from the explanation set to calculate the quantitative evaluation metrics (fidelity, usefulness, stability, and diversity).

Table 2 Black box model accuracy and MRE of BVAE and OVAE

Black boxes. We trained and explained the following black box classifiers: Random Forests (Tan et al. 2016) (RF) as implemented by the scikit-learn library, and Deep Neural Networks (DNN) as implemented with the keras library. For the RF, we transformed texts into their TF-IDF weight vectors (Tan et al. 2016), after removing Twitter stop-words such as “rt”, hashtags, URLs and usernames. For the youtube dataset we additionally removed emojis and texts that hold more than 140 characters (maximum length of a tweet until 2017). A randomized cross-validation search was then performed for parameter tuning. Parameters for RF models were set as follows: 100 decision trees, Gini split criterion, \(\sqrt{m}\) random features where m is the total number of features; no limit on tree depth. The DNNs adopted have the architecture shown in Fig. 3. The first layer is a dense embedding layer. It takes as input a sparse vector representation of each text (subject to same pre-processing steps as for the RF, without the TF-IDF representation) obtained by using a Keras tokenizerFootnote 4 to turn the text into an array of integers and a padder so that each vector has the same length. This way, we allow the network to learn its own dense embeddings of size 64. The first embedding layer is followed by a dropout layer at 0.25. Afterwards, the DNN is composed by three dense layers with sizes 64, 512 and 128. The central layer is an LSTM (Hochreiter and Schmidhuber 1997) that captures the sequential nature of texts and has size 100. After that, there are three dense layers with sizes 512, 64 and 32. The dense layers adopt the ReLu activation function. Finally, the sigmoid activation function is used for the final classification. We adopted binary cross-entropy as loss function and the Adam optimizer. We trained the DNN for a maximum of 100 epochs, or until the performance of the model stops improving on a held out validation dataset for more than 2 epochs. Classification accuracies are reported in Table 2.

Fig. 3
figure 3

Architecture of DNN black boxes. Bold text: type of layer. Numbers: layer size

Table 3 MRE of SVAE

VAEs. Here, we describe the structure and the training parameters of the VAEs we developed. We distinguish the best scenario (BVAE) from the standard scenario (SVAE). The performances of SVAE were considerably worse than BVAE (see later on). For this reason, we decided to adopt the pre-trained Optimus VAE in the standard scenario (OVAE). Experiments in the following subsections will involve only BVAE and OVAE. Table 2 reports the Mean Reconstruction Error (MRE) calculated as the average cosine similarity distance between an instance in the explanation set and its reconstructed text when converted to TF-IDF vectors. Since the OVAE generates texts that are similar in meaning but more diverse in words, the MRE is for all datasets higher (worse) compared to the BVAE. However, this does not mean the generated texts are not useful.

Both BVAE and SVAE have their encoders \(\zeta \) and decoders \(\eta \) implemented as a single LSTM layer. This design choice is inspired by the widely successful usage of seq2seq models for texts viewed as sequences of characters (Sutskever et al. 2014). We fed the text into the VAE using a one-hot vectorization. In order to have control over the dimensionality of the input, and a fortiori for the computational resources required for training the VAE, we kept a maximum of 5000 distinct words for each dataset (this actually affects the polarity and liar datasets). This vocabulary was extended with the 1,000 most common English wordsFootnote 5 to provide to the VAE also knowledge about unseen words with respect to the training set. The resulting size of the input tensors was \(33 \cdot 4947 {=} 163,251 \) for the hate dataset, \(49 \cdot 5287 {=} 259,063 \) for the polarity dataset after Twitter stop-words removal for both of the previous datasets, \(32 \cdot 1666 {=} 53,312\) for the youtube dataset, \(67 \cdot 5319 {=} 356,373 \) for the liar dataset, and \(35 \cdot 3966 {=} 138,810\) for the question dataset. The numbers above represent the maximum text length (number of words) and the number of distinct words considered. These dimensionalities are manageable because the input consists of short texts. We considered \(k{=}500\) latent featuresFootnote 6 for all datasets. The number of training epochs of the BVAE are 200 for the hate, the youtube and question dataset, 250 for the polarity and the liar dataset. These numbers were chosen after observing the epoch at which the reconstruction loss stabilized.

We proceeded similarly for SVAE. As a first approach, we trained a VAE for 100 epochs for the hate and polarity datasets. Figure 4 highlights that the validation error has a minimum at roughly 30 epochs for both datasets. As expected, the MRE values after 30 epochs of training are lower (better) than after 100 epochs, see also Table 3. However, these values are considerably worse than for the BVAE (cf. Table 2). We also observed that the Kullback-Leibler loss term is much smaller than the reconstruction loss (see Fig. 4), a phenomenon also know as KL vanishing. Therefore we experimented with Kullback-Leibler annealing as e.g. described in (Bowman et al. 2016), but could not obtain good enough results to continue with the subsequent experiments using SVAE.

Finally, fine-tuning of the OVAE was done for 10 epochs and on each dataset separately. Several variants of the pretrained autoencoder are released: we used the version with 768 latent dimensions and set the parameter \(\beta = 0.5\) (weighting the Kullback-Leibler loss term). This is a trade-off between reconstruction quality of the sentences and smooth generation of new data instances, i.e., between a standard autoencoder and a full variational autoencoder. We disabled SMOTE for class balancing of sentences generated by the OVAE.

Fig. 4
figure 4

Training and validation error for SVAE. The total training loss can be further deconstructed in the reconstruction loss and the Kullback-Leibler loss. The validation error increases after roughly 30 epochs

Hyper-parameters. We set the following xspells hyper-parameters. The neighborhood generation \( neighgen \) is run with \(N {=} 600\), \(n {=} 200\), \(\tau {=}40\%\). For the latent decision tree, we used the default parameters of the CART implementationFootnote 7. Finally, with regards to the explanation hyper-parameters, we set \(u{=}v{=}5\) (counter-)exemplars, and \(h{=}5\) most frequent words for exemplars and for counter-exemplars. Finally, unless otherwise stated, we adopt the diversity-based counter-exemplar selection, setting \(k=5\) for the \({ nn _k}()\) function in (2).

Compared approaches. We compare xspells against lime (Ribeiro et al. 2016), and, for the qualitative part, also against anchor. We cannot compare against shap (Lundberg and Lee 2017) because it is not immediate how to adapt it to text classifiers. As discussed in Sect. 2, we do not compare to adaptions of IntGrad (Sundararajan et al. 2017) or LRP (Bach et al. 2015), originally proposed for image classifiers, to explain text classifiers because they are not model-agnostic.

Table 4 Explanations returned by xspells for texts labeled as hate in the hate dataset for BVAE and OVAE. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies over \(u=v=5\) (counter-)exemplars in parenthesis
Table 5 Explanations returned by xspells for texts labeled as negative in the polarity dataset for BVAE and OVAE. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies over \(u=v=5\) (counter-)exemplars in parenthesis
Table 6 Explanations returned by xspells for texts labeled as spam in the youtube dataset for BVAE and OVAE. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies over \(u=v=5\) (counter-)exemplars in parenthesis
Table 7 Explanations returned by xspells for texts labeled as fake news in the liar dataset for BVAE and OVAE. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies over \(u=v=5\) (counter-)exemplars in parenthesis
Table 8 Explanations returned by xspells for texts labeled as entity in the question dataset for BVAE and OVAE. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies over \(u=v=5\) (counter-)exemplars in parenthesis

5.2 Qualitative evaluation

In this section, we qualitatively evaluate the explanations by xspells and contrast them with the ones by lime and anchor. Example texts labeled as hate/negative/spam/fake news/entity are shown in Table 4 to Table 8. Each table contains three exemplars and two counter-exemplars for both the BVAE and the OVAE.

Table 4 reports an example for the hate dataset. This dataset is only used for research purposes, we clearly distance from any form of discrimination expressed in there. In particular, we avoid reproducing extremely offensive words by writing “N" instead of the N-word. Explanations based on the BVAE for DNN and RF are clearly offensive when it comes to exemplars, while the opposite is true with counter exemplars. Exemplars in the case of OVAE are instead more diverse, whilst counter-exemplars of DNN and RF coincide in this case. In both cases, top words in the exemplar class tend to be very strong/offensive and they can be clearly associated to hate speech.

Table 5 shows an example for the polarity dataset. A clear difference between both BVAE and OVAE is the sentences length. Instances generated by the OVAE are in general longer, nevertheless meaningful. This is emphasized by a lower relative word frequency. All generated exemplar sentences are different to each other, whereas one counter-exemplar is picked up both for DNN and RF in the case of OVAE. Instances explained by the BVAE contain the top word “bad” in their factual (negative) class with a high relative word frequencyFootnote 8. It can be clearly assigned to a negative sentiment. In this example, we have two cases where the black box prediction was not correct, i.e., a misclassification occurred. This is the case for the BVAE RF and OVAE DNN. We can see that exemplars and counter-exemplars do not contrast very well against each other, i.e., do not clearly distinguish different classes. This is consistent, for instance, with a possible overfitting of the black box in the neighborhood of the instance to explain. More specifically, in the case of the BVAE DNN, we can see that exemplars seem to praise the movie, while counter-exemplars criticize it. This is in-line with the fact that the label was in-correctly predicted as positive.

Generated sentences for the youtube dataset are rather short, as displayed in Table 6. Both VAEs are able to pick up key words from the spam comments as most frequent words in their explanations, e.g. “subscribe”, “check” or “video”. This is mirrored by the exemplar sentences from the BVAE, which are very similar to each other. For the OVAE, sentences are more diverse, but transport a similar meaning. For the BVAE, most frequent counter-factual words do not show a pattern whereas in the OVAE example for RF, we can repeatedly observe the words “love” and “song”. There is also a misclassification in this example in the case of the OVAE RF. It can be notable that the exemplars and counter-exemplars have switched in the meaning that they convey. Exemplars show interest, while counter-exemplars are spammy.

In Table 7 are the results for the liar dataset. In this example, counter exemplar sentences seem to make outlandish claims, while exemplar sentences seem more grounded. In the case of the BVAE, the top words don’t contrast each other very well. In the OVAE, we can see that more negative words appear in the exemplar case.

Finally, the results regarding the question dataset, can be seen in Table 8. In this example, both black boxes for the OVAE have misclassified the sentence. In the case of the BVAE, we can see that the exemplars correctly show that the answer to the question asked is with respect to an entity, while the counter-exemplars ask either for definitions or explanations. This is not the case for the OVAE, as the exemplars are making geographical inquiries or asking questions regarding descriptions.

We contrast explanations of xspells against those from lime reported in Table 9. The texts from all the datasets except the liar text are correctly explained by lime. For the youtube comment, for instance, we observe that the most important words are for both black box models “check" and “out", but receive no further information. Overall, since lime extracts words from the text under analysis, it can only provide explanations using such words. On the contrary, the (counter-)exemplars of xspells consist of several and diverse texts which are (most of the times) close in meaning to the text under analysis, but including different wordings that help the user better grasp the reasons behind black box decision. A similar conclusion holds for anchor, which produces as explanations a subset of the words in the text – called anchors. For the youtube example and the RF black box, it returns the anchors “out" and “youtube", with a precision of 100%. This means that when both words are present in a randomly generated neighbor instance, then the black box labels the text as spam. anchor outputs also exemplar texts, obtained from the random generation of the neighborhood. Differently from lime, however, the randomness can be driven by distance in some embedding space between a word in the text and others words with the same POS tag. On our sample text, using the embedding provided by the BERT transformers Devlin et al. (2019), the following top five exemplars are returnedFootnote 9 for the youtube text:

  • cut out the video on youtube

  • check out this page on youtube

  • check out movie trailers on youtube

  • file out play list on youtube

  • get out loud online on youtube

These exemplars have the same syntactic structure of the text to explain (and the same size). On the positive side, this leads to syntactically valid texts, yet not necessarily meaningful (as the last two ones). This is not true, however, for tweets, which lack the formal structure of sentences, and then relying on POS tagging is not effective. For instance, the exemplars returned for the sample text of the polarity dataset are meaniningless. Moreover, exemplars of anchor do not show much variation with respect to the text to explain. Examples of using anchor on the rest of the datasets can be found on our Github repositoryFootnote 10.

Table 9 Explanations returned by lime for texts labeled as hate (hate dataset), negative (polarity dataset), spam (youtube dataset), fake news (liar dataset), or entity (question dataset). lime word importance in parenthesis
Table 10 Mean and standard deviation of fidelity. The higher the better

5.3 Fidelity evaluation

Surrogate models adopted by explanation approaches should mimic the decision logic of the black box. Here, we show that xspells performs better than lime under such a perspective. We evaluate the faithfulness (Freitas 2013; Guidotti et al. 2019b) of the latent decision tree adopted by xspells by measuring how well it reproduces the behavior of the black box b in the neighborhood of the text x to explain – a metric known as fidelity. Let Z be the neighborhood of z in the latent space generated at line 2 of Alg. 1 and \( ldt \) be the surrogate decision tree computed at line 5. The fidelity metric is \(|\{ y \in Z \ |\ ldt (y) = b(\eta (y)) \}|/|Z|\), namely the accuracy of \( ldt \) assuming as ground truth the black box. The fidelity values over all instances in the explanation set are aggregated by taking their average and standard deviation.

We compare xspells against lime, which adopts as surrogate model a linear regression over the feature space of words and generates the neighborhood using a purely random strategy. Table 10 reports the average fidelity and its standard deviation for lime and xspells in its variants depending on the type of VAE adopted. We notice how on every dataset and on every black box xspells fidelity is markedly higher than the one of lime independently from the VAE adopted.

Fig. 5
figure 5

Accuracy (usefulness) of xspells (with BVAE/OVAE) vs a baseline for an increasing number of exemplars and counter-exemplars adopted as neighbors. Best viewed in color

5.4 Usefulness evaluation

In this section, we investigate on the usefulness of xspells explanations for the user. How can we evaluate the usefulness of explanations? The gold standard would require to run lab experiments involving human evaluators. Inspired by Kim et al. (2016), we provide here an indirect evaluation by means of a k-Nearest Neighbor (k-NN) classifier (Tan et al. 2016). For a text x in the explanation set, first we randomly select n exemplars and n counter-exemplars from the output of xspells. Then, a 1-NN classifierFootnote 11 is trained over such (counter-)exemplars. Finally, we test 1-NN over the text x and compare the prediction of 1-NN with the label b(x) predicted by the black box. In other words, the 1-NN approximates a human in assessing the (counter-)exemplars usefulness. The accuracy computed over all x’s in the explanation set is a proxy measure of how good/useful are (counter-)exemplars at delimiting the decision boundary of the black box. We compare such an approach with a baseline (or null) model consisting of a 1-NN trained on n texts per class label, selected randomly from the training set and not including x.

The accuracy of the two approaches are reported in Fig. 5 by varying the number n of exemplars and counter-exemplars. xspells neatly overcomes the baseline. The difference is particularly marked for when n is small. Even though the difference tend to decrease for large n’s, large-sized explanations are less useful in practice due to cognitive limitations of human evaluators. Moreover, xspells performances are quite stable w.r.t. n, i.e., even one or two exemplars and counter-exemplars are sufficient to let the 1-NN classifier distinguish the label assigned to x in an accurate way. The use of BVAE compared to OVAE does not significatively change in the accuracy of xspells.

5.5 Stability evaluation

Let us contrast xspells with lime as per the stability of their explanations. Stability is a key requirement, which heavily impacts users’ trust on post-hoc explainability methods (Rudin 2019). For local approaches, the generation of the neighborhood introduces randomness in the process, leading to different explanations for a same instance in different runs of the method, or to disproportionately different explanations for two close instances (Alvarez-Melis and Jaakkola 2018). Instability of the surrogate models learned from the neighborhood instance may exacerbate the problem Guidotti and Ruggieri (2019). Extensions of lime tackle the problem by avoiding sampling of neighbor instances in favor of training data (Zafar and Khan 2021), by adopting a denoising autoencoder to compute neighbor instances over a more robust space (Shankaranarayana and Runje 2019), or by determining the number of instances needed to statistically guarantee stability of the explanation (Zhou et al. 2021). In addition, methods using text embedding, such as ours, suffer from the variability introduced by the encoding-decoding of texts in the latent space. Several metrics of stability can be devised (Alvarez-Melis and Jaakkola 2018; Guidotti and Ruggieri 2019), either independent or specific to the explanation representation (Visani et al. 2021). A possible choice is to use sensitivity analysis with regard to how much an explanation varies on the basis of the randomness in the explanation process. We measure here stability as a relative notion, that we call coherence. For a given text x in the explanation set, we consider its closest text \(x^c\) and its k-th closest text \(x^f\), again in the explanation set. A form of Lipschitz condition (Alvarez-Melis and Jaakkola 2018) would require that the distance between the explanations e(x) and \(e(x^f)\), normalized by the distance between x and \(x^f\), should not be much different than the distance between the explanations e(x) and \(e(x^c)\), again normalized by the distance between x and \(x^c\). Stated in words, normalized distances between explanations should be as similar as possible. Formally, we introduce the following coherence index:

$$\begin{aligned} \mathcal {C}_x = \frac{ dist _e(e({x^{f})}, e({x}))/ dist (x^{f}, x)}{ dist _e(e({x^{c}}), e({x}))/ dist (x^{c}, x)}\end{aligned}$$

where we adopt as distance function \( dist \) the cosine distance between the TF-IDF representation of the texts, and as distance function \( dist _e\) the Jaccard distance between the 10 most frequent words in each explanation (namely, the W set). In experiments, we set \(x^f\) to be the \(k=10\)-closest text w.r.t. x. For comparison, the coherence index is computed also for lime, with Jaccard similarity calculated between the sets of 10 words (a.k.a. features) that lime deems more relevant.

Table 11 Mean and stdev of the coherence index \(\mathcal {C}_x\). The closer to 1 the better

Table 11 reports the average coherence over the explanation set. xspells has a slightly worse level of coherence, yet statistically significant (p-value \(< 0.01\)) in most cases except the following ones: the youtube dataset and OVAE RF/DNN, of the polarity dataset and BVAE RF or OVAE RF/DNN, and of the liar dataset and BVAE RF or OVAE RF. This is expected, as lime’s explanations are subsets of the original text to explain, hence more stable across similar instances. xspells trade-offs a slightly worse stability for diversity of exemplars, and, a fortiori, of the most frequent words in them. Finally, contrasting BVAE with OVAE, we observe the latter has lower mean and standard deviation values in almost all cases.

Fig. 6
figure 6

Density estimation of the objective function \(h_z(S)\) (see (2)) for S selected by the distance-based and by the diversity-based counter-exemplar strategies. Best viewed in color

5.6 Diversity of counter-exemplars

Let us contrast here the diversity-based with the distance-based selection of counter-exemplars (see Sect. 4.3), showing that the former has a better impact on diversity of counter-exemplars. Their impact on usefulness and stability metrics are comparable (experiments not reported here for lack of space). Here, we consider their impact on the objective function \(h_z()\) (see formula (2)). Since the diversity-based approach optimizes such an objective function, we expect it to perform better than the distance-based approach. The point is to quantify the degree of diversity it can capture.

Figure 6 shows the (kernel density estimates of the) distributions of \(h_z(S)\) when S is the set of v counter-exemplars selected by the two strategies for each instance in the explanation set. The densities of the distance-based approach approximately resemble a normal distribution whereas the densities for the diversity-based approach are skewed towards higher values of the objective function. The difference between the two densities is statistically significant (\(p < 0.01\) in the Kolmogorov-Smirnov test), irrespective of the datasets and the VAEs. This is expected, as for the diversity-based approach, the greedy algorithm adopted (see CSG in Appendix 1) provides a theoretical guarantee of approximating the optimal solution, while this is not the case for the distance-based approach. Finally, for the diversity-based strategy, the OVAE has a more peaked distribution than the BVAE, i.e., the objective function for the selected counter-exemplars is on average closer to the optimal solution. This can be attributed to the different latent spaces of the two VAEs.

6 Conclusion

We have presented xspells, a local model-agnostic explanation approach for black box short text classifiers. The key feature of xspells is the adoption of variational autoencoders for generating meaningful synthetic texts in a latent space. We considered both a pre-trained variational autoencoder (OVAE) and two ones trained on a subset of the available data (SVAE and BVAE). The latent space is also revealed to be essential for inducing a decision tree which helps in characterizing exemplar and counter-factual exemplar texts. Diversity of counter-exemplars is modeled as an optimization problem. The approach advances over baseline explainers, such as lime, which only highlight the contribution of words already in the text to explain. Experiments showed that xspells also exhibits better fidelity and usefulness, while trading off stability with diversity.

The proposed approach has some clear limitations. First, we will consider extending the explanations returned by xspells with logic rules, which convey information at a more abstract level than exemplars. Such rules can be extracted from the decision tree on the latent space, but have to be decoded back to rules on texts – a challenging task. Second, while the framework is general enough to cover multi-class black boxes, where the class labels are more than two, specific experimental analysis is required to properly validate the approach. For example, by comparing the usage of multi-class decision trees against transformation to binary problems (one vs. one, or one vs. rest). Extension of the framework are instead required to tackle multi-label classification, where more than one label might be assigned to an instance, and ranking of class labels (Sebastiani 2002). Third, xspells could be extended to account for long texts. While the theoretical framework does not change, the implementation of VAEs does not scale to large dimensionalities of the input/output. A possible direction is to use word2vec embeddings (Goldberg and Levy 2014). Fourth, we could rely on semantic and linguistic resources (Altinel and Ganiz 2018), such a thesaurus or domain ontologies, to empower both synthetic text generation and to enrich the expressiveness of the (counter-)exemplars. Finally, a user study to assess the perceived usefulness of the explanations of xspells would be definitively required, e.g., through crowdsourcing or lab experiments (Förster et al. 2020) or by comparing with human-annotated explanations (Wiegreffe and Marasović 2021). In fact, automatic measures have been shown to correlate moderately with human judgements (Nguyen 2018). Even more challenging is to incorporate in the design of the explanations, cognitive and social aspects regarding the users of xspells, i.e., moving towards Human-centered Explainable (AI Ehsan and Riedl 2020).