In this section, we illustrate qualitative/quantitative experimental analyses of faithfulness, usefulness, and stability properties of xspells explanationsFootnote 2. The xspells system has been developed in Python, and it relies on the CART decision tree algorithm as implemented by the scikit-learn library, and on VAEe implemented with the keras libraryFootnote 3.
5.1 Experimental Settings
We experimented with the proposed approach on two datasets of tweets. The hate speech dataset (hate)
[8] contains tweets labeled as hate, offensive or neutral. Here, we focus on the 1,430 tweets that belong to the hate class, and on the 4,163 tweets of the neutral class. The polarity dataset (polarity)
[30] contains tweets about movie reviews. Half of these tweets are classified as positive reviews, and the other half as negative ones. These two datasets are remarkable examples where a black box approach is likely to be used to remove posts or to ban users, possibly in automated way. Such extreme actions risk to hurt the free speech rights of people. Explanations of the black box decision are then of primary relevance both to account for the action and to test/debug the black box.
For both datasets, we use 75% of the available data for training a black box machine learning classifier. The remaining 25% of data is used for testing the black box decisions. More specifically, 75% of that testing data is used for training the autoencoder, and 25% for explaining black box decisions (explanation set). Datasets details are reported in Table 1 (left).
Table 1. Datasets description, black box models accuracy, and VAE RMSE.
We trained and explained the following black box classifiers: Random Forest
[38] (RF) as implemented by the scikit-learn library, and Deep Neural Networks (DNN) implemented with the keras library. For the RF, we transformed texts into their TF-IDF weight vectors
[38], after removing stop-words, including Twitter stop-words such as “rt”, hashtags, URLs and usernames. A randomized cross-validation search was then performed for parameter tuning. Parameters for RF models were set as follows: 100 decision trees, Gini split criterion, \(\sqrt{m}\) random features where m is the total number of features; no limit on tree depth. The DNNs adopted have the following architecture. The first layer is a dense embedding layer. It takes as input a sparse vector representation of each text (subject to same pre-processing steps as for the RF, without the TF-IDF representation) obtained by using a Keras tokenizerFootnote 4 to turn the text into an array of integers and a padder so that each vector has the same length. This way, we allow the network to learn its own dense embeddings of size 64. The first embedding layer is followed by a dropout layer at 0.25. Afterwards, the DNN is composed by three dense layers with sizes 64, 512 and 128. The central layer is an LSTM
[20] that captures the sequential nature of texts and has size 100. After that, there are three dense layers with sizes 512, 64 and 32. The dense layers adopt the ReLu activation function. Finally, the sigmoid activation function is used for the final classification. We adopted binary cross-entropy as loss function and the Adam optimizer. We trained the DNN for 100 epochs. Classification performances are reported in Table 1 (center-right).
We designed the VAEs used in experiments with both the encoder \(\zeta \) and the decoder \(\eta \) consisting of a single LSTM layer. We fed the text into the VAE using a one-hot vectorization that takes an input tensors with dimensions \(33 \cdot 5368 \,{=}\, 177,144\) for the hate dataset, and \(48 \cdot 5308 \,{=}\, 254,784\) for the polarity dataset, after stop-words removal. The numbers above represent the maximum text length and the number of distinct words considered. In order to provide to the VAE knowledge also about unseen words with respect to those in its training set, we extended the vocabulary with the 1000 most common English wordsFootnote 5 We considered \(k\,{=}\,500\) latent features for both datasetsFootnote 6. Table 1 (right) reports the Mean Reconstruction Error (MRE) calculated as the average cosine similarity distance between the original and reconstructed texts when converted to TF-IDF vectors. We set the following xspells hyper-parameters. The neighborhood generation \( neighgen \) is run with \(N \,{=}\, 600\), \(n \,{=}\, 200\), \(\tau \,{=}\,40\%\). For the latent decision tree we used the default parameter of the CART implementation. Finally, with regards to the explanation hyper-parameters, we set \(u\,{=}\,v\,{=}\,5\) (counter-)exemplars, and \(h\,{=}\,5\) most frequent words for exemplars and for counter-exemplars.
In the experiments we compare xspells against lime
[33]. We cannot compare against shap
[25] and anchor
[34] because it is not immediate how to practically employ them to explain sentiment classifiers. Other approaches such as IntGrad
[37] or LRP
[1] could theoretically be used to explain sentiment classifiers. However, first, they are not agnostic but tied to DNNs, and second, they are typically used for explaining image classifiers.
Table 2. Explanations returned by xspells for texts classified as hate in the hate dataset, and as negative in the polarity dataset. Three exemplars (E) and two counter-exemplars (C) for each tweet. Relative word frequencies in parenthesis.
5.2 Qualitative Evaluation
In this section, we qualitatively compare xspells explanations with those returned by lime. Tables 2 and 3 show sample explanations for both experimental datasets, and considering the RF black box sentiment classifier.
The first and second tweet in Table 2 belong to the hate dataset and are classified as hate. Looking at the exemplars returned by xspells, the hate sentiment emerges from the presence of the word “hate”, from sexually degrading references, and from derogatory adjectives. On the other hand counter-exemplars refer to women and to work with a positive perspective. The second tweet for the hate dataset follows a similar pattern. The focus this time is on the word “retard”, used here with negative connotations. Differently from xspells, the explanations returned by lime in Table 3 for the same tweets show that the hate sentiment is mainly due to the words “faggot” and “retards” but there are not any further details, hence providing to the user a limited understanding.
Table 3. Explanations returned by lime for tweets classified as hate in the hate dataset, and as negative in the polarity dataset. lime word importance in parenthesis.
The usefulness of the exemplars and counter-exemplars of xspells are even more clear for the polarity dataset, where the RF correctly assigns the sentiment negative to the sample tweets in Table 2. For the first tweet, xspells recognizes the negative sentiment captured by the RF and provide exemplars containing negative words such as “trash”, “imperfect”, and “extremely unfunny” as negative synonyms of “eccentric”, “forgettable”, and “doldrums”. The counter-exemplars show the positive connotation and context that words must have to turn the sentiment into positive. On the contrary, lime (Table 3) is not able to capture such complex words and it focuses on terms like “off”, “debut”, or “enough”. For the second tweet, xspells is able to generates exemplar similar in meaning to the tweet investigated: the tweet starts positive (or appear positive), but reveals/hides a negative sentiment in the end. In this case the most frequent words alone are not very useful. Indeed, (the surrogate linear classifier of) lime mis-classifies the second tweet as positive giving importance to the word “work” that, however, is not the focus of the negative sentiment.
Overall, since lime extracts words from the text under analysis, it can only provide explanations using such words. On the contrary, the (counter-)exemplars of xspells consist of texts which are close in meaning, but including different wordings that help the user better grasp the reasons behind black box decision.
Table 4. Mean and standard deviation of fidelity. The higher the better.
5.3 Fidelity Evaluation
We evaluate the faithfulness
[11, 17] of the surrogate latent decision tree adopted by xspells by measuring how well it reproduces the behavior of the black box b in the neighborhood of the text x to explain – a metric known as fidelity. Let Z be the neighborhood of x in the latent space generated at line 2 of Algorithm 1 and \( ldt \) be the surrogate decision tree computed at line 5. The fidelity metric is \(|\{ y \in Z \ |\ ldt (y) = b(\eta (y)) \}|/|Z|\), namely the accuracy of \( ldt \) assuming as ground truth the black box. The fidelity values over all instances in the explanation set are aggregated by taking their average and standard deviation.
We compare xspells against lime, which adopts as surrogate model a linear regression over the feature space of words and generates the neighborhood using a purely random strategy. Table 4 reports the average fidelity and its standard deviation. On the hate dataset, xspells reaches almost perfect fidelity for both black boxes. lime performances are markedly lower for the RF black box. On the polarity dataset, the difference is less marked, but still in favor of xspells. A Welch’s t-test shows that the difference of fidelity between xspells and lime is statistically significant (p-value \(< 0.01\)) in all cases from Table 4.
5.4 Usefulness Evaluation
How can we evaluate the usefulness of xspells explanations? The gold standard would require to run lab experiments involving human evaluators. Inspired by
[21], we provide here an indirect evaluation by means of a k-Nearest Neighbor (k-NN) classifier
[38]. For a text x in the explanation set, first we randomly select n exemplars and n counter-exemplars from the output of xspells. Then, a 1-NN classifierFootnote 7 is trained over such (counter-)exemplars. Finally, we test 1-NN over the text x and compare the prediction of 1-NN with the sentiment b(x) predicted by the black box. In other words, the 1-NN approximates a human in assessing the (counter-)exemplars usefulness. The accuracy computed over all x’s in the explanation set is a proxy measure of how good/useful are (counter-)exemplars at delimiting the decision boundary of the black box. We compare such an approach with a baseline (or null) model consisting of a 1-NN trained on n texts per sentiment, selected randomly from the training set and not including x.
The accuracy of the two approaches are reported in Fig. 2 by varying the number n of exemplars and counter-exemplars. xspells neatly overcomes the baseline. The difference is particularly marked for when n is small. Even though the difference tend to decrease for large n’s, large-sized explanations are less useful in practice due to cognitive limitations of human evaluators. Moreover, xspells performances are quite stable w.r.t. n, i.e., even one or two exemplars and counter-exemplars are sufficient to let the 1-NN classifier distinguish the sentiment assigned to x in an accurate way.
Table 5. Mean and stdev of the coherence index \(\mathcal {C}_x\). The closer to 1 the better.
5.5 Stability Evaluation
Stability of explanations is a key requirement, which heavily impacts users’ trust on explainability methods
[35]. Several metrics of stability can be devised
[18, 27]. A possible choice is to use sensitivity analysis with regard to how much an explanation varies on the basis of the randomness in the explanation process. Local methods relying on random generation of neighborhoods are particularly sensitive to this problem. In addition, our method suffers of the variability introduced by the encoding-decoding of texts in the latent space. Therefore, we measure here stability as a relative notion, that we call coherence. For a given text x in the explanation set, we consider its closest text \(x^c\) and its k-th closest text \(x^f\), again in the explanation set. A form of Lipschitz condition
[27] would require that the distance between the explanations e(x) and \(e(x^f)\), normalized by the distance between x and \(x^f\), should not be much different than the distance between the explanations e(x) and \(e(x^c)\), again normalized by the distance between x and \(x^c\). Stated in words, normalized distances between explanations should be as similar as possible. Formally, we introduce the following coherence index:
$$\begin{aligned} \mathcal {C}_x = \frac{ dist _e(e({x^{f})}, e({x}))/ dist (x^{f}, x)}{ dist _e(e({x^{c}}), e({x}))/ dist (x^{c}, x)} \end{aligned}$$
where we adopt as distance function \( dist \) the cosine distance between the TF-IDF representation of the texts, and as distance function \( dist _e\) the Jaccard distance between the 10 most frequent words in each explanation (namely, the W set). In experiments, we set \(x^f\) to be the \(k=10\)-closest text w.r.t. x. For comparison, the coherence index is computed also for lime, with Jaccard similarity calculated between the sets of 10 words (a.k.a. features) that lime deems more relevant.
Table 5 reports the average coherence over the explanation set. xspells and lime have comparable levels of coherence, and an even number of cases where one overcomes the other. A Welch’s t-test shows that the difference of the coherence indexes between xspells and lime is statistically significant (p-value \(< 0.01\)) in only one case, namely for the polarity dataset and RF black box model.