Benchmarking and survey of explanation methods for black box models

The rise of sophisticated black-box machine learning models in Artificial Intelligence systems has prompted the need for explanation methods that reveal how these models work in an understandable way to users and decision makers. Unsurprisingly, the state-of-the-art exhibits currently a plethora of explainers providing many different types of explanations. With the aim of providing a compass for researchers and practitioners, this paper proposes a categorization of explanation methods from the perspective of the type of explanation they return, also considering the different input data formats. The paper accounts for the most representative explainers to date, also discussing similarities and discrepancies of returned explanations through their visual appearance. A companion website to the paper is provided as a continuous update to new explainers as they appear. Moreover, a subset of the most robust and widely adopted explainers, are benchmarked with respect to a repertoire of quantitative metrics.


Introduction
Today AI is one of the most important scientific and technological areas, with a tremendous socio-economic impact and a pervasive adoption in many fields of modern society.The impressive performance of AI systems in prediction, recommendation, and decision making support is generally reached by adopting complex Machine Learning (ML) models that "hide" the logic of their internal processes.As a consequence, such models are often referred to as "black-box models" [59,47,95].Examples of black-box models used within current AI systems include deep learning models and ensemble such as bagging and boosting models.The high performance of such models in terms of accuracy has fostered the adoption of non-interpretable ML models even if the opaqueness of black-box models may hide potential issues inherited by training on biased or unfair data [77].Thus there is a substantial risk that relying on opaque models may lead to adopting decisions that we do not fully understand or, even worse, violate ethical principles.Companies are increasingly embedding ML models in their AI products and applications, incurring a potential loss of safety and trust [32].These risks are particularly relevant in high-stakes decision making scenarios, such as medicine, finance, automation.In 2018, the European Parliament introduced in the GDPR4 a set of clauses for automated decision-making in terms of a right of explanation for all individuals to obtain "meaningful explanations of the logic involved" when automated decision making takes place.Also, in 2019, the High-Level Expert Group on AI presented the ethics guidelines for trustworthy AI 5 .Despite divergent opinions among legals regarding these clauses [53,121,35], everybody agrees that the need for the implementation of such a principle is urgent and that it is a huge open scientific challenge.
As a reaction to these practical and theoretical ethical issues, in the last years, we have witnessed the rise of a plethora of explanation methods for black-box models [59,3,13] both from academia and from industries.Thus, eXplainable Artificial Intelligence (XAI) [87] emerged as investigating methods to produce or complement AI to make accessible and interpretable the internal logic and the outcome of the model, making such process human understandable.
This work aims to provide a fresh account of the ideas and tools supported by the current explanation methods or explainers from the different explanations offered. 6.We categorize explanations w.r.t. the nature of the explanations providing a comprehensive ontology of the explanation provided by available explainers taking into account the three most popular data formats: tabular data, images, and text.We also report extensive examples of various explanations and qualitative and quantitative comparisons to assess the faithfulness, stability, robustness, and running time of the explainers.Furthermore, we include a quantitative numerical comparison of some of the explanation methods aimed at testing their faithfulness, stability, robustness, and running time.
The rest of the paper is organized as follows.Section 2 summarizes existing surveys on explainability in AI and interpretability in ML and highlights the differences between this work and previous ones.Then, Section 3 presents the proposed categorization based on the type of explanation returned by the explainer and on the data format under analysis.Sections 4, 5, 6 present the details of the most recent and widely adopted explanation methods together with a qualitative and quantitative comparison.Finally, Section 8 summarizes the crucial aspects that emerged from the analysis of the state of the art and future research directions.

Related Works
The widespread need for XAI in the last years caused an explosion of interest in the design of explanation methods [52].For instance, in the books [90,105] are presented in details the most wellknown methodologies to make general machine learning models interpretable [90] and to explain the outcomes of deep neural networks [105].
In [59], the classification is based on four categories of problems, and the explanation methods are classified according to the problem they are able to solve.The first distinction is between explanation by design (also named intrinsic interpretability and black-box explanation (also named post-hoc interpretability [3,92,26]).The second distinction in [59], further classify the black-box explanation problem into model explanation, outcome explanation and black-box inspection.Model explanation, achieved by global explainers [36], aims at explaining the whole logic of a model.Outcome explanation, achieved by local explainers [102,84], understand the reasons for a specific outcome.Finally, the aim of black-box inspection, is to retrieve a visual representation for understanding how the black-box works.Another crucial distinction highlighted in [86,59,3,44,26] is between model-specific and model-agnostic explanation methods.This classification depends on whether the technique adopted to explain can work only on a specific black-box model or can be adopted on any black-box.
In [50], the focus is to propose a unified taxonomy to classify the existing literature.The following key terms are defined: explanation, interpretability and explainability.An explanation answers a "why question" justifying an event.Interpretability consists of describing the internals of a system in a way that is understandable to humans.A system is called interpretable if it produces descriptions that are simple enough for a person to understand using a vocabulary that is meaningful to the user.An alternative, but similar, classification of definitions is presented in [13], with a specific taxonomy for explainers of deep learning models.The leading concept of the classification is Responsible Artificial Intelligence, i.e., a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability, and accountability at its core.Similarly to [59], in [13] the term interpretability (or transparency) is used to refer to a passive characteristic of a model that makes sense for a human observer.On the other hand, explainability is an active characteristic of a model, denoting any action taken with the intent of clarifying or detailing its internal functions.Further taxonomies and definitions are presented in [92,26].Another branch of the literature review is focusing on the quantitative and qualitative evaluation of explanation methods [105,26].Finally, we highlight that the literature reviews related to explainability are focused not just on ML and AI but also on social studies [87,24], recommendation systems [131], model-agents [10], and domain-specific applications such as health and medicine [117].
In this survey we decided to rewrite the taxonomy proposed in [59] but from a data type perspective.In light of the works mentioned above, we believe that an updated systematic categorization of explanation methods based on the type of explanation returned and comparing the explanations is still missing in the literature.
Table 1: Examples of explanations divided for different data type and explanation

TABULAR IMAGE TEXT
Rule-Based (RB) A set of premises that the record must satisfy in order to meet the rule's consequence.r = Education ≤ College → ≤ 50k

Saliency Maps (SM)
A map which highlight the contribution of each pixel at the prediction.

Sentence Highlighting (SH)
A map which highlight the contribution of each word at the prediction.
Feature Importance (FI) A vector containing a value for each feature.Each value indicates the importance of the feature for the classification.
Concept Attribution (CA) Compute attribution to a target "concept" given by the user.For example, how sensitive is the output (a prediction of zebra) to a concept (the presence of stripes)?
Attention Based (AB) This type of explanation gives a matrix of scores which reveal how the word in the sentence are related to each other.

Counterfactuals (CF)
The user is provided with a series of examples similar to the input query but with different class prediction q = Education ≤ College → "≤ 50k" c = Education ≥ Master → "≥ 50k" q = →"3" c = →"8" q = The movie is not that bad →"positive" c = The movie is that bad →"negative" Furthermore, we systematically present a qualitative comparison of the explanations that also help understand how to read these explanations returned by the different methods 7 .

Categorization of Type of Explanations
In this survey, we present explanations and explanation methods acting on the three principal data types recognized in the literature: tabular data, images and text [59].In particular, for every of these data types, we have distinguished different types of explanations illustrated in Table 1.A Table appearing at the beginning of each subsequent Section summarizes the explanation methods by grouping them accordingly to the classification illustrated in Table 1.Besides, in every section we present the meaning of each type of explanation.The acronyms reported in capital letters in Table 1, in this section and in the following are used in the remainder of the work to quickly categorize the various explanations and explanation methods.We highlight that the nature of this work is tied to test the available libraries and toolkits for XAI.Therefore, the presentation of the existing methods is focused on the most recent works (specifically from 2018 to the date of writing) and to those papers providing a usable implementation that is nowadays widely adopted.
Fig. 1: Existing taxonomy for the classification of explanation methods.

Existing XAI Taxonomy for Explanation Methods
In this section, we synthetically recall the existing taxonomy and classification of XAI methods present in the literature [59,3,50,13,105,26] to allow the reader to complete the proposed explanation-based categorization of explanation methods.We summarize the fundamental distinctions adopted to annotate the methods in Figure 1.
The first distinction separates explainable by design methods from black-box explanation methods: -Explainable by design methods are INtrinsically (IN) explainable methods that returns a decision, and the reasons for the decision are directly accessible because the model is transparent.-Black-box explanation are Post-Hoc (PH) explanation methods that provides explanations for a non interpretable model that takes decisions.
The second differentiation distinguishes post-hoc explanation methods in global and local: -Global (G) explanation methods aim at explaining the overall logic of a black-box model.Therefore the explanation returned is a global, complete explanation valid for any instance; -Local (L) explainers aim at explaining the reasons for the decision of a black-box model for a specific instance.
The third distinction categorizes the methods into model-agnostic and model-specific: -Model-Agnostic (A) explanation methods can be used to interpret any type of black-box model; -Model-Specific (S) explanation methods can be used to interpret only a specific type of black-box model.
To provide to the reader a self-contained review of XAI methods, we complete this section by rephrasing succinctly and unambiguously the definitions of explanation, interpretability, transparency, and complexity: -Explanation [13,59] is an interface between humans and an AI decision-maker that is both comprehensible to humans and an accurate proxy of the AI.Consequently, explainability is the ability to provide a valid explanation.-Interpretability [59], or comprehensibility [51], is the ability to explain or provide the meaning in understandable terms a human.Interpretability and comprehensibility are normally tied to the evaluation of the model complexity.
-Transparency [13], or equivalently understandability or intelligibility, is the capacity of a model of being interpretable itself.Thus, the model allows a human to understand its functioning without explaining its internal structure or the algorithmic means by which the model processes data internally.-Complexity [42] is the degree of effort required by a user to comprehend an explanation.The complexity can consider the user background or eventual time limitation necessary for the understanding.

Evaluation Measures
The validity and the utility of explanations methods should be evaluated in terms of goodness, usefulness, and satisfaction of explanations.In the following, we describe a selection of established methodologies for the evaluation of explanation methods both from the qualitative and quantitative point of view.Moreover, depending on the kind of explainers under analysis, additional evaluation criteria may be used.Qualitative evaluation is important to understand the actual usability of explanations from the point of view of the end-user: they satisfy human curiosity, find meanings, safety, social acceptance and trust.In [42] is proposed a systematization of evaluation criteria into three major categories: 1. Functionally-grounded metrics aim to evaluate the interpretability by exploiting some formal definitions that are used as proxies.They do not require humans for validation.The challenge is to define the proxy to employ, depending on the context.As an example, we can validate the interpretability of a model by showing the improvements w.r.t. to another model already proven to be interpretable by human-based experiments.2. Application-grounded evaluation methods require human experts able to validate the specific task and explanation under analysis [124,114].They are usually employed in specific settings.For example, if the model is an assistant in the decision making process of doctors, the validation is done by the doctors.3. Human-grounded metrics evaluate the explanations through humans who are not experts.
The goal is to measure the overall understandability of the explanation in simplified tasks [78,73].This validation is most appropriate for general testing notions of the quality of an explanation.
Moreover, in [42,43] are considered several other aspects: the form of the explanation; the number of elements the explanation contains; the compositionality of the explanation, such as the ordering of FI values; the monotonicity between the different parts of the explanation; uncertainty and stochasticity, which take into account how the explanation was generated, such as the presence of random generation or sampling.
In quantitative evaluation, the evaluation focuses on the performance of the explainer and how close the explanation method f is to the black-box model b.Concerning quantitative evaluation we can consider two different types of criteria: In the first criterion, we group the metrics that are often used in the literature [102,103,58,115].One of the metric most used in this setting is the fidelity that aims to evaluate how good is f at mimicking the black-box decisions.There are different specializations of fidelity, depending on the type of explanator under analysis [58].For example, in methods where there is a creation of a surrogate model g to mimic b, fidelity compares the prediction of b and g on the instances Z used to train g.Another measure of completeness w.r.t.b is the stability, which aims at validating how consistent the explanations are for similar records.The higher the value, the better is the model to present similar explanations for similar inputs.Stability can be evaluated by exploiting the Lipschitz constant [8] as L x = max ex−e x x−x , ∀x ∈ N x where x is the explained instance, e x the explanation and N x is a neighborhood of instances x similar to x.
Besides the synthetic ground truth experimentation proposed in [55], a strategy to validate the correctness of the explanation e = f (b, x) is to remove the features that the explanation method f found important and see how the performance of b degrades.These metrics are called deletion and insertion [97].The intuition behind deletion is that removing the "cause" will force the black-box to change its decision.Among the deletion methods, there is the faithfulness [8] which is tailored for FI explainers.It aims to validate if the relevance scores indicate true importance: we expect higher importance values for attributes that greatly influence the final prediction 8 .Given a black-box b and the feature importance e extracted from an importance-based explanator f , the faithfulness method incrementally removes each of the attributes deemed important by f .At each removal, the effect on the performance of b is evaluated.These values are then employed to compute the overall correlation between feature importance and model performance.This metrics corresponds to a value between −1 and 1: the higher the value, the better the faithfulness of the explanation.In general, a sharp drop and a low area under the probability curve mean a good explanation.On the other hand, the insertion metric takes a complementary approach.monotonicity is an implementation of an insertion method: it evaluates the effect of b by incrementally adding each attribute in order of increasing importance.In this case, we expect that the black-box performance increases by adding more and more features, thereby resulting in monotonically increasing model performance.Finally, other standard metrics, such as accuracy, precision and recall, are often evaluated to test the performance of the explanation methods.The running time is also an important evaluation. 8An implementation of the faithfulness is available in aix360, presented in Section 7

Explanations for Tabular Data
In this Section we present a selection of approaches for explaining decision systems acting on tabular data.In particular, we present the following types of explanations based on: Features Importance (FI, Section 4.1), Rules (RB, Section 4.2), Prototype (PR) and Counterfactual (CF) (Section 4.3).Table 2 summarizes and categorizes the explainers.After the presentation of the explanation methods, we report experiments obtained from the application of them on two datasets9 : adult and german.We trained the following ML models: Logistic Regression (LG), XGBoost (XGB), and Catboost (CAT).

Feature Importance
Feature importance is one of the most popular types of explanation returned by local explanation methods.The explainer assigns to each feature an importance value which represents how much that particular feature was important for the prediction under analysis.Formally, given a record x, an explainer f (•) models a feature importance explanation as a vector e = {e 1 , e 2 , . . ., e m }, in which the value e i ∈ e is the importance of the i th feature for the decision made by the black-box model b(x).For understanding the contribution of each feature, the sign and the magnitude of each value e i are considered.W.r.t. the sign, if e i < 0, it means that feature contributes negatively for the outcome y; otherwise, if e i > 0, the feature contributes positively.The magnitude, instead, represents how great the contribution of the feature is to the final prediction y.In particular, the greater the value of |e i |, the greater its contribution.Hence, when e i = 0 it means that the i th feature is showing no contribution for the output decision.An example of a feature based explanation is e = {age = 0.8, income = 0.0, education = −0.2},y = deny.In this case, age is the most important feature for the decision deny, income is not affecting the outcome and education has a small negative contribution.LIME, Local Interpretable Model-agnostic Explanations [102], is a model-agnostic explanation approach which returns explanations as features importance vectors.The main idea of lime is that the explanation may be derived locally from records generated randomly in the neighborhood of the instance that has to be explained.The key factor is that it samples instances both in the vicinity of x (which have a high weight) and far away from x (low weight), exploiting π x , a proximity measure able to capture the locality.We denote b the black-box and x the instance we want to explain.To learn the local behavior of b, lime draws samples weighted by π x .It samples these instances around x by drawing nonzero elements of x uniformly at random.This gives to lime a perturbed sample of instances {z ∈ R d } to fed to the model b and obtain b(z).They are then used to train the explanation model g(•): a sparse linear model on the perturbed samples.The local feature importance explanation consists of the weights of the linear model.A number of papers focus on overcoming the limitations of lime, providing several variants of it.dlime [130] is a deterministic version in which the neighbors are selected from the training data by an agglomerative hierarchical clustering.ilime [45] randomly generates the synthetic neighborhood using weighted instances.alime [108] runs the random data generation only once at "training time".kl-lime [96] adopts a Kullback-Leibler divergence to explain Bayesian predictive models.qlime [23] also consider nonlinear relationships using a quadratic approximation.
In Figure 2 are reported examples of lime explanations relative to our experimentation on adult (top) and german (bottom) 10 .We fed the same record into two black-boxes, and then we explained it.Interestingly, for adult, lime considers a similar set of features as important (even if with different values of importance) for the two models: on 6 features, only one differs.A different scenario is obtained by the application of lime on german: different features are considered necessary by the two models.However, the confidence of the prediction between the two models is quite different: both of them predict the output label correctly, but CAT has a higher value, suggesting that this could be the cause of differences between the two explanations.
SHAP, SHapley Additive exPlanations [84], is a local-agnostic explanation method, which can produce several types of models.All of them compute shap values: a unified measure of feature importance based on the Shapley values11 , a concept from cooperative game theory.In particular, the different explanation models proposed by shap differ in how they approximate the computation of the shap values.All the explanation models provided by shap are called additive feature attribution methods and respect the following definition: g(z ) = φ 0 + M i=1 φ i z i , where z ≈ x as a real number, z ∈ [0, 1], φ i ∈ R are effects assigned to each feature, while M is the number of simplified input features.shap has 3 properties: (i) local accuracy, meaning that g(x) matches b(x); (ii) missingness, which allows for features x i = 0 to have no attributed impact on the shap values; (iii) consistency, meaning that if a model changes so that the marginal contribution of a feature value increases (or stays the same), the shap value also increases (or stays the same).The construction of the shap values allows to employ them both locally, in which each observation gets its own set of shap values; and globally, by exploiting collective shap values.There are 5 strategies to compute shap's values: KernelExplainer, LinearExplainer, TreeExplainer, GradientExplainer, and DeepExplainer.In particular, the KernelExplainer is an agnostic method while the others are specifically designed for different kinds of ML models.
In our experiments with shap we applied: (i) the LinearExplainer to the LG models, (ii) the TreeExplainer to the XGB and (iii) KernelExplainer to the CAT models.In Figures 2 we report the application of shap on adult through force plot.The plot shows how each feature contributes to pushing the output model value away from the base value, which is an average among the training dataset's output model values.The red features are pushing the output value higher while the ones in blue are pushing it lower.For each feature is reported the actual value for the record under analysis.Only the features with the highest shap values are shown in this plot.In the first force plot, the features that are pushing the value higher are contributing more to the output value, and it is possible to note it by looking at the base value (0.18) and the actual output value (0.79).In the force plot on the right, the output value is 0.0, and it is interesting to see that only Age, Relationship and Hours Per Week are contributing to pushing it lower.Figure 3 (left and center) depicts the decision plots: in this case, we can see the contribution of all the input features in decreasing order of importance.In particular, the line represents the feature importance for the record under analysis.The line starts at its corresponding observations' predicted value.In the first plot, predicted as class > 50k, the feature Occupation is the most important, followed by Age and Relationship.For the second plot, instead, Age, Relationship and Hours Per Week are the most important feature.Besides the local explanations, shap also offers a global interpretation of the model-driven by the local interpretations.Figure 3 (right) reports a global decision plot that represents the feature importance of 30 records of adult.Each line represents a record, and the predicted value determines the color of the line.
DALEX [19] is a post-hoc, local and global agnostic explanation method.Regarding local explanations, dalex contains an implementation of a variable attribution approach [104].It consists of a decomposition of the model's predictions, in which each decomposition can be seen as a local gradient and used to identify the contribution of each attribute.Moreover, dalex contains the ceteris-paribus profiles, which allow for a What-if analysis by examining the influence of a variable by fixing the others.Regarding the global explanations, dalex contains different exploratory tools: model performance measures, variable importance measures, residual diagnoses, and partial dependence plot.In Figure 4 are reported some local explanations obtained by the application of dalex to an XGB model on adult.On the left are reported two explanation plots for a record classified as class > 50k.On the top, there is a visualization based on Shapely values, which highlights as most important the feature Age (35 years old), followed by occupation.At the bottom, there is a Breakdown plot, in which the green bars represent positive changes in the mean predictions, while the red ones are negative changes.The plot also shows the intercept, which is the overall mean value for the predictions.It is interesting to see that Age and occupation are the most important features that positively contributed to the prediction for both the plots.In contrast, Sex is positively important for Shapely values but negatively important for the Breakdown plot.On the right part of Figure 4 we report a record classified as < 50k.In this case, there are important differences in the feature considered most important by the two methods: for the Shapely values, Age and Relationship are the two most important features, while in the Breakdown plot Hours Per Week is the most important one.
CIU, Contextual Importance and Utility [9], is a local, agnostic explanation method.ciu is based on the idea that the context, i.e., the set of input values being tested, is a key factor in generating faithful explanations.The authors suggest that a feature that may be important in a context may be irrelevant in another one.ciu explains the model's outcome based on the contextual importance (CI), which approximates the overall importance of a feature in the current context, and on the contextual utility (CU), which estimates how good the current feature values are for a given output class.Technically, ciu computes the values for CI and CU by exploiting Monte Carlo simulations.We highlight that this method does not require creating a simpler model to employ for deriving the explanations.
NAM, Neural Additive Models [6], is a different extension of gam.This method aims to combine the performance of powerful models, such as deep neural networks, with the inherent intelligibility of generalized additive models.The result is a model able to learn graphs that describe Fig. 6: Explanations of anchor and lore for adult to explain an XGB model.
how the prediction is computed.nam trains multiple deep neural networks in an additive fashion such that each neural network attend to a single input feature.

Rule-based Explanation
Decision rules give the end-user an explanation about the reasons that lead to the final prediction.The majority of explanation methods for tabular data are in this category since decision rules are human-readable.A decision rule r, also called factual or logic rule [58], has the form p → y, in which p is a premise, composed of a Boolean condition on feature values, while y is the consequence of the rule.In particular, p is a conjunction of split conditions of the form i ], where x i is a feature and v are lower and upper bound values in the domain of x i extended with ±∞.An instance x satisfies r, or r covers x, if every Boolean conditions of p evaluate to true for x.If the instance x to explain satisfies p, the rule p → y represents then a candidate explanation of the decision g(x) = y.Moreover, if the interpretable predictor mimics the behavior of the blackbox in the neighborhood of x, we further conclude that the rule is a candidate local explanation of b(x) = g(x) = y.We highlight that, in the context of rules we can also find the so-called counterfactual rules [58].Counterfactual rules have the same structure of decision rules, with the only difference that the consequence of the rule y is different w.r.t.b(x) = y.They are important to explain to the end-user what should be changed to obtain a different output.An example of a rule explanation is r = {age < 40, income < 30k, education ≤ Bachelor }, y = deny.In this case, the record {age = 18, income = 15k, education = Highschool } satisfies the rule above.A possible counterfactual rule, instead can be: r = {income > 40k, education ≥ Bachelor }, y = allow .
ANCHOR [103] is a model-agnostic system that outputs rules as explanations.This approach's name comes from the output rules, called anchors.The idea is that, for decisions on which the anchor holds, changes in the rest of the instance's feature values do not change the outcome.Formally, given a record x, r is an anchor if r(x) = b(x).To obtain the anchors, anchor perturbs the instance x obtaining a set of synthetic records employed to extract anchors with precision above a user-defined threshold.First, since the synthetic generation of the dataset may lead to a massive number of samples anchor exploits a multi-armed bandit algorithm [72].Second, since the number of all possible anchors is exponential anchor uses a bottom-up approach and a beam search.Figure 6 reports some rules obtained by applying anchor to a XGB model on adult.The first rule has a high precision (0.96%) but a very low coverage (0.01%).It is interesting to note that the first rule contains Relationship and Education Num, which are the features highlighted by most of the explanation models proposed so far.In particular, in this case, for having a classification > 50k, the Relationship should be husband and the Education Num at least bachelor degree.Education Num can also be found in the second rule, in which case has to be less or equal to College, followed by the Maritial Status, which can be anything other than married with a civilian.This rule has an even better precision (0.97%) and suitable coverage (0.37%).LORE, LOcal Rule-based Explainer [58], is a local agnostic method that provides faithful explanations in the form of rules and counterfactual rules.lore is tailored explicitly for tabular data.It exploits a genetic algorithm for creating the neighborhood of the record to explain.Such a neighborhood produces a more faithful and dense representation of the vicinity of x w.r.t.lime.Given a black-box b and an instance x, with b(x) = y, lore first generates a synthetic set Z of neighbors through a genetic algorithm.Then, it trains a decision tree classifier g on this set labeled with the black-box outcome b(Z).From g, it retrieves an explanation that consists of two components: (i) a factual decision rule, that corresponds to the path on the decision tree followed by the instance x to reach the decision y, and (ii) a set of counterfactual rules, which have a different classification w.r.t.y.This counterfactual rules set shows the conditions that can be varied on x in order to change the output decision.In Figure 6 we report the factual and counterfactual rules of lore for the explanation of the same records showed for anchor.It is interesting to note that, differently from anchor and the others models proposed above, lore explanations focuses more on the Education Num, Occupation, Capital Gain and Capital Loss, while the features about the relationship are not present.
RuleMatrix [88] is a post-hoc agnostic explanator tailored for the visualization of the rules extracted.First, given a training dataset and a black-box model, rulematrix executes a rule induction step, in which a rule list is extracted by sampling the input data and their predicted label by the black-box.Then, the rules extracted are filtered based on thresholds of confidence and support.Finally, rulematrix outputs a visual representation of the rules.The user interface allows for several analyses based on plots and metrics, such as fidelity.
One of the most popular ways for generating rules is by extracting them from a decision tree.In particular, due to the method's simplicity and interpretability, decision trees explain black-box models' overall behavior.Many works in this setting are model specific to exploit some structural information of the black-box model under analysis.
TREPAN [36] is a model-specific global explainer tailored for neural networks.Given a neural network b, trepan generates a decision tree g that approximates the network by maximizing the gain ratio and the model fidelity.
DecText is a global model-specific explainer tailored for neural networks [22].The aim of dectext is to find the most relevant features.To achieve this goal, dectext resembles trepan, with the difference that it considers four different splitting methods.Moreover, it also considers a pruning strategy based on fidelity to reduce the final explanation tree's size.In this way, dectext can maximize the fidelity while keeping the model simple.
MSFT [31] is a specific global post-hoc explanation method that outputs decision trees starting from random forests.It is based on the observation that, even if random forests contain hundreds of different trees, they are quite similar, differing only for few nodes.Hence, the authors propose dissimilarity metrics to summarize the random forest trees using a clustering method.Then, for each cluster, an archetype is retrieved as an explanation.
CMM, Combined Multiple Model procedure [41], is a specific global post-hoc explanation method for tree ensembles.The key point of cmm is the data enrichment.In fact, given an input dataset X, cmm first modifies it n times.On the n variants of the dataset, it learns a black-box.Then, random records are generated and labeled using a bagging strategy on the black-boxes.In this way, the authors were able to increase the size of the dataset to build the final decision tree.
STA, Single Tree Approximation [132], is a specific global post-hoc explanation method tailored for random forests, in which the decision tree, used as an explanation, is constructed by exploiting test hypothesis to find the best splits.
SkopeRules is a post-hoc, agnostic model, both global and local 12 , based on the rulefit [48] idea to define an ensemble method and then extract the rules from it.skope-rules employs fast algorithms such as bagging or gradient boosting decision tress.After extracting all the possible rules, skope-rules removes rules redundant or too similar by a similarity threshold.Differently from rulefit, the scoring method does not solve the L1 regularization.Instead, the weights are given depending on the precision score of the rule.We can employ skoperules in two ways: (i) as an explanation method for the input dataset, which describes, by rules, the characteristics of the dataset; (ii) as a transparent method by outputting the rules employed for the prediction.In Figure 7, we report the rule extracted by rulefit with highest precision and recall for each class for adult.Similarly to the models analyzed so far, we can find Relationship and Education among the features in the rules.In particular, for the first rule, for > 50k, the Education has to be at least a Bachelor degree, while for the other class, it has to be at least fifth or sixth.Interestingly, it is also mentioned the Capital Gain and Capital Loss which were considered as important by few models, such as lore.We also tested skoperules to create a rule-based classifier obtaining a precision of 0.68 on adult.
Moreover, with skoperules, it is possible to explain, using rules, the entire dataset without considering the output labels; or obtain a set of rules for each output class.We tested both of them, but we report only the case of rules for each class.In particular, we report the rule with the highest precision and recall for each class for adult in Figure 7.
Scalable-BRL [127] is an interpretable probabilistic rule-based classifier that optimizes the posterior probability of a Bayesian hierarchical model over the rule lists.The theoretical part of this approach is based on [81].The particularity of scalable-brl is that it is scalable, due to a specific bit vector manipulation.
GLocalX [1] is a rule-based explanation method which exploits a novel approach: the local to global paradigm.The idea is to derive a global explanation by subsuming local logical rules.GLocalX start from an array of factual rules and following a hierarchical bottom up fashion merges rules covering similar records and expressing the same conditions.GLocalX finds the smallest possible set of rules that is: (i) general, meaning that the rules should apply to a large subset of the dataset; (ii) has high accuracy.The final explanation proposed to end-user is a set of rules.In [1] the authors validated the model in constrained settings: limited or no access to data or local explanations.A simpler version of GLocalX is presented [107]: here, the final set of rules is selected through a scoring system based on rules generality, coverage, and accuracy.

Prototypes
A prototype, also called archetype or artifact, is an object representing a set of similar records.It can be (i) a record from the training dataset close to the input data x; (ii) a centroid of a cluster to which the input x belongs to.Alternatively, (iii) even a synthetic record, generating following some ad-hoc process.Depending on the explanation method considered, different definitions and requirements to find a prototype are considered.Prototypes serve as examples: the user understands the model's reasoning by looking at records similar to his/hers.MMD-CRITIC [74] is a "before the model" methodology, in the sense that it only analyses the distribution of the dataset under analysis.It produces prototypes and criticisms as explanations for a dataset using Maximum Mean Discrepancy (MMD).The first ones explain the dataset's general behavior, while the latter represent points that are not well explained by the prototypes.mmdcritic selects prototypes by measuring the difference between the distribution of the instances and the instances in the whole dataset.The set of instances nearer to the data distribution are called prototypes, and the farthest are called criticisms.mmd-critic shows only minority data points that differ substantially from the prototype but belong in the same category.For criticism, mmd-critic selects criticisms from parts of the dataset underrepresented by the prototypes, with an additional constraint to ensure the criticisms are diverse.
ProtoDash [61] is a variant of mmd-critic.It is an explainer that employs prototypical examples and criticisms to explain the input dataset.Differently, w.r.t.mmd-critic, protodash associates non-negative weights, which indicate the importance of each prototype.In this way, it can reflect even some complicated structures.
Privacy-Preserving Explanations [21] is a local post-hoc agnostic explanability method which outputs prototypes and shallow trees as explanations.It is the first approach that considers the concept of privacy in explainability by producing privacy protected explanations.To achieve a good trade-off between privacy and comprehensibility of the explanation, the authors construct the explainer by employing micro aggregation to preserve privacy.In this way, the authors obtained a set of clusters, each with a representative record c i , where i is the i − th cluster.From each cluster, a shallow decision trees is extracted to provide an exhaustive explanation while having good comprehensibility due to the limited depth of the trees.When a new record x arrives, a representative record and its associated shallow tree are selected.In particular, from g the representative c i closer to x is selected, depending on the decision of the black-box.
PS, Prototype Selection (ps) [20] is an interpretable model, composed by two parts.First, the ps seeks a set of prototypes that better represent the data under analysis.It uses a set cover optimization problem with some constraints on the properties the prototypes should have.Each record in the original input dataset D is then assigned to a representative prototype.Then, the prototypes are employed to learn a nearest neighbor rule classifier.
TSP, Tree Space Prototype [116], is a local, post-hoc and model-specific approach, tailored for explaining random forests and gradient boosted trees.The goal is to find prototypes in the tree space of the tree ensemble b.Given a notion of proximity between trees, with variants depending on the kind of ensemble, tsp is able to extract prototypes for each class.Different variants are proposed for allowing for the selection of a different number of prototypes for each class.

Counterfactuals
Counterfactuals describe a dependency on the external facts that led to a particular decision made by the black-box model.It focuses on the differences to obtain the opposite prediction w.r.t.b(x) = y.Counterfactuals are often addressed as the prototypes' opposite.In [122] is formalized the general form a counterfactual explanation should have: b(x) = y was returned because variables of x has values x 1 , x 2 , ..., x n .Instead, if x had values x 1 1 , x 1 2 , ..., x 1 n and all the other variables has remained constant, b(x) = ¬y would have been returned, where x is the record x with the suggested changes.An ideal counterfactual should alter the values of the variables as little as possible to find the closest setting under which y is returned instead of ¬y.Regarding the counterfactual explainers, we can divide them into three categories: exogenous, which generates the counterfactuals synthetically; endogenous, in which the counterfactuals are drawn from a reference population, and hence they can produce more realistic instances w.r.t. the exogenous ones; or instance-based, which exploits a distance function to detect the decision boundary of the black-box.There are several desiderata in this context: efficiency, robustness, diversity, actionability, and plausibility, among others [122,71,69].To better understand the complex context and the many available possibilities, we refer the interested reader to [15,120,25].In [25] is presented a study that evaluates the understandability of factual and counterfactual explanations.The authors analyzed the mental model theory, which stated that people construct models that simulate the assertions described.They conducted experiments on a group of people highlighting that people prefer reasoning using mental models and find it challenging to consider probability, calculus, and logic.There are many works in this area of research; hence, we briefly present only the most representative methods in this category.
MAPLE [99] is a post-hoc local agnostic explanation method that can also be used as a transparent model due to its internal structure.It combines random forests with feature selection methods to return feature importance based explanations.maple is based on two methods: SILO and DStump.SILO is employed for obtaining a local training distribution, based on the random forest leaves'.DStump, instead, ranks the features by importance.maple considers the best k features from DStump to solve a weighted linear regression problem.In this case, the explanation is the coefficient of the local linear model, i.e., the estimated local effect of each feature.
CEM, Contrastive Explanations Method [40], is a local, post-hoc and model-specific explanation method, tailored for neural networks which outputs contrastive explanations.cem has two components: Pertinent Positives (PP), which can be seen as prototypes, and are the minimal and sufficient factors that have to be present to obtain the output y; and Pertinent Negatives (PN), which are counterfactuals factors, that should be minimally and necessarily absent.cem is formulated as an optimization problem over the perturbation variable δ.In particular, given x to explain, cem considers x 1 = x + δ, where δ is a perturbation applied to x.During the process, there are two values of δ to minimize: δ p for the pertinent positives, and δ n for the pertinent negatives.cem solves the optimization problem with a variant that employs an autoencoder to evaluate the closeness of x 1 to the data manifold.ceml [14] is also a Python toolbox for generating counterfactual explanations, suitable for ML models designed in Tensorflow, Keras, and PyTorch.
DICE, Diverse Counterfactual Explanations [91] is a local, post-hoc and agnostic method which solves an optimization problem with several constraints to ensure feasibility and diversity when returning counterfactuals.Feasibility is critical in the context of counterfactual since it allows avoiding examples that are unfeasible.As an example, consider the case of a classifier that determines whether to grant loans.If the classifier denies the loan to an applicant with a low salary, the cause may be low income.However, a counterfactual such as "You have to double your salary" may be unfeasible, and hence it is not a satisfactory explanation.The feasibility is achieved by imposing some constraints on the optimization problem: the proximity constraint, from [122], the sparsity constraint, and then user-defined constraints.Besides feasibility, another essential factor is diversity, which provides different ways of changing the outcome class.
FACE, Feasible and Actionable Counterfactual Explanations [100] is a local, post-hoc agnostic explanation method that focuses on returning "achievable" counterfactuals.Indeed, face uncovers "feasible paths" for generating counterfactual.These feasible paths are the shortest path distances defined via density-weighted metrics.It can extract counterfactuals that are coherent with the input data distribution.face generates a graph over the data points, and the user can select the prediction, the density, also the weights, and a conditions function.face updates the graph accordingly to these constraints and applies the shortest path algorithm to find all the data points that satisfy the requirements.
CFX [7] is a local, post-hoc, and model-specific method that generates counterfactuals explanations for Bayesian Network Classifiers.The explanations are built from relations of influence between variables, indicating the reasons for the classification.In particular, this method's main achievement is that it can find pivotal factors for the classification task: these factors, if removed, would give rise to a different classification.

Transparent methods
In this section we present some transparent methods, tailored for tabular data.In particular, we first present some models which output feature importance, then methods which outputs rules.
EBM, Explainable Boosting Machine [93] is an interpretable ML algorithm.Technically, ebm is a variant of a Generalized Additive Model (gam) [64], i.e., a generalized linear model that incorporates nonlinear forms of the predictors.For each feature, ebm uses a boosting procedure to train the generalized linear model: it cycles over the features, in a round-robin fashion, to train one feature function at a time and mitigate the effects of co-linearity.In this way, the model learns the best set of feature functions, which can be exploited to understand how each feature contributes to the final prediction.ebm is implemented by the interpretml Python Library 13 .We trained an ebm on adult.In Figure 5 we show a global explanation reporting the importance for each feature used by ebm.We observe that Maritial Status is the most important feature, followed by Relationship and Age.In Figure 5 we show an inspection of the feature Education Number illustrating how the prediction score changes depending on the value of the feature.In Figure 5, are also reported two examples of local explanations for ebm.For the first record, predicted as > 50k, the most important feature is Education Num, which is Master for this record.For the second record, predicted as < 50k, the most important feature is Relationship.This feature is important for both records: in the first (husband) is pushing the value higher, while in the second (own-child) lower.
TED [65] is an intrinsically transparent approach that requires in input a training dataset in which its explanation correlates each record.Explanations can be of any type, such as rules or feature importance.For the training phase, the framework allows using any ML model capable of dealing with multilabel classification.In this way, the model can classify the record in input and correlate it with its explanation.A possible limitation of this approach is the creation of the explanations to feed during the training phase.ted is implemented in aix360.
SLIPPER [34] is a transparent rule learner based on a modified version of Adaboost.It outputs compact and comprehensible rules by imposing constraints on the rule builder.
LRI [123] is a transparent rule learner that achieves good performance while giving interpretable rules as explanations.In lri, each class of the training is represented by a set of rules, without ordering.The rules are obtained by an induction method that weights the cumulative error adaptively without pruning.When a new record is considered, all the available rules are tested on it.The output class is the one that has the most satisfying set of rules for the record under analysis.
MlRules [39] is a transparent rule induction algorithm solving classification tasks through probability estimation.Rule induction is done with boosting strategies, but a maximum likelihood estimation is applied for rule generation.
RuleFit [48] is a transparent rule learner that exploits an ensemble of trees.As a first step, it creates an ensemble model by using gradient boosting.The rules are then extracted from the ensemble: each path in each tree is a rule.After the rules' extraction, they are weighted according to an optimization problem based on L1 regularization.
IDS, Interpretable Decision Sets [78], is a transparent and highly accurate model based on decision sets.Decision sets are sets of independent, short, accurate, and non-overlapping if-then rules.Hence, they can be applied independently.

Quantitative Comparison
We validated explanation models by considering the two most important metrics in the context of tabular data: fidelity, and the stability.In particular, we evaluated lime, shap 14 , anchor and lore.The results of the fidelity are reported in Table 3.The fidelity values are relatively high for all the methods highlighting that the local surrogate models are good at mimicking their blackbox models.Regarding the feature importance-based models, lime shows higher values of fidelity w.r.t.shap, especially for adult.In particular, shap has lower values for the CAT models (both german and adult), suggesting that it may be not good in explaining this kind of ensemble models.Concerning the rule-based models, the fidelity is high for both of them.However, we remark that anchor shows lower values of fidelity for the CAT model for german, a behavior which is similar to the one of shap.We compared lime and shap on faithfulness and monotonicity.Overall, we did not find any model to be monotonic, and hence we do not report any results.The results for the faithfulness are reported in Table 3.For adult, the faithfulness is quite low, especially for lime.
The model with the highest faithfulness is CAT explained by shap.Regarding german, instead, the values are higher, highlighting a better faithfulness overall.However, also for this dataset shap has a better faithfulness w.r.t.lime.In Table 4 are reported the results obtained from the analysis on the stability.For this metric, a high value means that the model presents high instability, meaning that we can have quite different explanations for similar inputs.None of the methods is remarkably stable according to this metric.
Runtime Analysis Table 5 shows the explanation runtime approximated as order of magnitude.Overall, feature importance explanation algorithms are faster w.r.t. the rule-based ones.In particular, shap is the most efficient, followed by lime.We remark that the computation time of lore depends on the number of neighbors to generate exploiting a genetic algorithm (in this case, we considered 1000 samples).anchor, instead, requires a minimum precision as well as skoperule (we selected min precision of 0.40).

Discussion
In the context of tabular data, many explainable methods have been proposed.In particular, the most explored area is feature importance-based explanators, such as lime and shap.These methods provide an importance value for each feature in the input.It is suitable for domain experts who know the meaning of the features employed.However, it may be too difficult for a common end-user to understand, especially when obtaining such importance values is complex.In contrast, rule-based explanations, prototypes, and counterfactuals are more suitable for the common end-user due to their logical structure and the similarity by example they exploit.This is particularly true in decision rules correlated by counterfactual ones, like in lore.The end-user can understand why she received that outcome, but she also has a suggestion about what to change to achieve another classification.However, fewer methods are proposed in this context w.r.t.feature importance explanations.In particular, the majority of rule and prototype-based explanators are intrinsic.For the few post-hoc ones, on average, they require more time to provide an explanation w.r.t.feature importance ones.Regarding the post-hoc prototype-based models, there are some interesting approaches, but there is no code for them, highlighting that they are still in an early stage of development.During the past few years, counterfactuals have witnessed a particularly great interest.Overall, even if the rules, prototypes, and counterfactuals seem to be the best solution, there are still several open questions and challenges in this research area such as improving the efficiency and the accuracy of these explanation algorithms as well as considering the constraints of the domain in which the model is being employed.

Explanations for Image Data
This section presents the solutions in the state of the art, proposing explanations for decision systems acting on image data.In particular, we distinguish the following types of explanations: Saliency Maps (SM, Section 5.1, Concept Attribution (CA, Section 5.2), Prototypes (PR, Section 5.3) and Counterfactuals (CF, Section 5.4).Table 6 summarizes and categorizes the explana- tion methods acting on image data.For the experiments, we considered three datasets15 : mnist, cifar in its 10 class flavor and imagenet.We choose these datasets because they are the most utilized, and we have different types of classes with various image dimensions.On these three datasets, we trained the models most used in literature to evaluate the explanation methods: for mnist and cifar we a CNN with two convolutions and two linear layers, while for imagenet the VGG16 network [111].

Saliency Maps
A Saliency Map (SM) is an image in which a pixel's brightness represents how salient the pixel is.Formally, a SM is modeled as a matrix S which dimensions are the sizes of the image we want to explain, and the values s ij are the saliency values of the pixels ij.The greater the value of s ij the bigger is the saliency of that pixel.To visualize SM, we can use a divergent color map for example, ranging from red to blue.A positive value (red) means that the pixel ij has contributed positively to the classification, while a negative one (blue) means that it has contributed negatively.There are two methods for creating SMs.The first one assigns to every pixel a saliency value.The second one segments the image into different pixel groups and then assign a saliency value for each group.LIME, already presented in Section 4, can also be used to retrieve SM for classifiers working on images.For images, the perturbation is done by segmentation.More in detail, lime divides the input image into segments called superpixels.Then it creates the neighborhood by randomly substituting the super-pixels with a uniform, possibly neutral, color.This neighborhood is then fed into the black-box, and a sparse linear model is learned on top.An example of such a superpixel explanation is shown in Figure 8.The super-pixel segmentation is critical to obtain a good explanation.For small resolution images, the segmentation in lime does not work out of the box, resulting in the algorithm selecting all the image as a super-pixel.To obtain a decent result, the user needs to tune the segmentation parameters.Recently, many research improved and extended lime [109,96,130,23] 16-LRP, Layer-wise Relevance Propagation [17] is a model specific method which produce posthoc local explanations for any type of data.-lrp explains the classifier's decisions by decomposition.The -lrp redistribution process was introduced for feed-forward neural networks [12].Mathematically, it redistributes the prediction y backwards using local redistribution rules until it assigns a relevance score R i to each pixel value.Let a i be the neuron activations at layer l, R j be the relevance scores associated to the neurons at layer l + 1 and w ij be the weight connecting neuron i to neuron j.The simple -lrp rule redistributes relevance from layer l + 1 to layer l is: R i = j aiwij i aiwij + R j where a small stabilization term is added to prevent division by zero.Intuitively, this rule redistributes relevance proportionally from layer l + 1 to each neuron in l based on the connection weights.The final explanation is the relevance of the input layer.Figure 8 shows some examples of -lrp in the third row.As with all the pixel-wise explanation method, the algorithm works very well on mnist while it is difficult to address larger images.A variant of -lrp is spray [80] which builds a specrtal clustering on top of the local instance-based -lrp explanations.Similar work is done in [82]: it starts with the -lrp of the input instance and finds the LRP attribution relevance for a single input of interest x.
INTGRAD, Integrated Gradient [115], is a model-specific method that produces post-hoc local explanations for any type of data.intgrad utilizes the gradients of a black-box along with the sensitivity techniques like -lrp.For this reason, it can be applied only on differentiable models.Formally, given b and x, and let x be the baseline input. 17, intgrad constructs a path from x to x and computes the gradients of points along the path.For example, with images, the points are taken by overlapping x on x and gradually modifying the opacity of x.Integrated gradients are obtained by cumulating the gradients of these points.Formally, the integrated gradient along the i th dimension for an input x and baseline x is defined as follows.Here, ∂b(x)/∂x i is the gradient of b(x) along the i th dimension.The equation for computing the scores is: dα.An example of intgrad explanations is in Figure 8.The saliency maps obtained tend to have uniform pixels than -lrp.As shown before, -lrp highlights that when predicting the "deer" the most salient regions are in the background.However, an arbitrary choice of baselines could cause issues.For example, a black baseline image could cause the method to lower the importance of black pixels in the source image.This problem is due to the difference between the image's pixel and the baseline (x i − x i ) present in the integral equation.Expected Gradients [46] tries to overcome this problem by averaging intgrad to different baselines.
DEEPLIFT [110], is a model-specific and data-agnostic explainer which produces post-hoc local explanations.It computes SMs in a backward fashion similarly to -lrp, but it uses a baseline reference like in intgrad.deeplift uses the slope, instead of the gradients, which describes how the output y = b(x) changes as the input x differs from a baseline x .Like -lrp, an attribution value r is assigned to each unit i of the neural network going backward from the output y.This attribution represents the relative effect of the unit activated at the original network input x compared to the activation at the baseline reference x .deeplift computes the starting values of the last layer L by the difference between the output of the input and baseline.Then, it uses the following recursive equation to compute the attribution values of layer l using the attributions of layer l + 1 to obtain the values of the starting layer: r where w l+1,l ij are the weights of the network between the layer l and the layer l + 1, and a are the activation values.As for intgrad, picking a baseline is not trivial and might require domain experts.The SMs obtained with deeplift are very similar to those obtained with -lrp (Figure 8).
SMOOTHGRAD [112] is a post-hoc model-specific and data-agnostic explanation method.A SM tends to be noisy, especially for pixel-wise saliency maps.smoothgrad tries to overcome this problem by smoothing the noisiness in the SMs.Usually, a SM is created directly on the Fig. 9: Visual Comparison of saliency maps obtained by taking the gradient of the output y w.r.t. the input image x (center) and smoothgrad (bottom).On the three image in the center the saliency map changes drastically.On all three cases is focusing to the subject of the image completely changing original values.This is true also for the seashore image on the far right.gradient of the model's output signal w.r.t. the input ∂y/∂x.smoothgrad augments this process by smoothing the gradients with a Gaussian noise kernel.It takes x, applies Gaussian noise to it, and retrieve the SM for every perturbed image, using the gradient.The final SM is an average of these.Formally, given a saliency method f (x) which produces a saliency map s, its smoothed version f can be expressed as: f = 1 n n 1 f (x + N (0, σ 2 )) where n is the number of samples, and N (0, σ 2 ) is the Gaussian noise.In [4,5] are shown some weaknesses of smoothgrad: people tend to evaluate SMs on what they are expected to see.For example, in a bird image, we want to see the shape of a bird.However, this does not mean that this is what the network is looking at. Figure 9 highlights this problem.We obtained the SMs taking the gradient of the output w.r.t. the input, and then we used smoothgrad.We observe that the SMs completely changed their behavior, moving in direction of the subject.SHAP, presented in Section 4, has two explanators that can be employed for deep networks tailored for image classification: deep-shap and grad-shap.deep-shap is a high-speed approximation for shap values in deep learning models that builds on a connection with deeplift.The implementation is different from the original deeplift by using as baseline a distribution of background samples instead of a single value and using shapley equations to linearise non-linear components of the black-box such as max, softmax, products, divisions, etc. grad-shap, instead, is based on intgrad and smoothgrad [115,112].intgrad values are a bit different from shap values, and require a single reference value to integrate from.As an adaptation to make them approximate shap values, grad-shap reformulates the integral as an expectation and combines that expectation with sampling reference values from the background dataset as done in smoothgrad.We tested both deep-shap and grad-shap experimentally and the results are shown in Figure 10.deep-shap outputs a saliency map explaining every class of the input image.grad-shap instead produce a pixel-wise saliency map similar to those shown before.
XRAI [70] is based on intgrad and inherits its properties.Differently from intgrad, xrai first over-segments the image.It iteratively tests each region's importance, fusing smaller regions into larger segments based on attribution scores.It is divided into three steps: segmentation, get attribution, and selecting regions.The segmentation is repeated several times with different segments to reduce the dependency on image segmentation.For attribution, xrai uses intgrad with black and white baselines averaged.Finally, to select regions, xrai leverages the fact that, given two regions, the one that sums to the more positive value should be more important to the classifier.From this observation, xrai starts with an empty mask, then selectively adds the regions that yield the maximum gain in the total attributions per area.The saliency maps obtained from xrai are very different from those already presented.Figure 8 shows some examples.As all the segmentation methods xrai performs at its best when having high-resolution images.However, it still obtains good results on low-resolution images.
GRADCAM [106] is a model-specific post-hoc local explainer for image data.It uses the gradient information flowing into the last convolutional layer of a CNN to assign saliency values to each neuron for a particular decision.Convolutional layers naturally retain spatial information in fully-connected layers, so we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information.To create the SM, gradcam takes the feature maps created at the last layer of the convolutional network a.Then, it computes the gradient of an output of a particular class y c for every feature map activations k, i.e., ∂y c /∂a k .This equation returns a tensor of dimensions [k, v, u] where k is the number of features maps and u, v are height and width of the image.gradcam compute the saliency value for every feature maps by pooling the dimensions of the image.The final heatmap is calculated as a weighted sum of these values.Notice that this results in a coarse heatmap of the same size as the convolutional feature maps.An up-sampling technique is applied to the final result to produce a map of the initial image dimension.From Figure 8 is clear that these coarse grain heatmap style are very characteristic of gradcam.These heat maps highlight very different parts of the image compared to the other methods.
GRADCAM++ [27] extends gradcam solving some related issues.The spatial footprint in an image is essential for gradcam's visualizations to be robust.Hence, if there are multiple objects with slightly different orientations or views, different feature maps may be activated with differing spatial footprints and the one with lesser footprints fade away in the final sum.gradcam++ fix this problem by taking a weighted average of the pixel-wise gradients.In particular, gradcam++ reformulates gradcam by explicitly coding the structure of the weights α c k as: where ReLU is the Rectified Linear Unit activation function, and w kc ij are the weighting co-efficients for the pixel-wise gradients for class c and convolutional feature map a k .The idea is that w c k captures the importance of a particular activation map a k , and positive gradients are preferred to indicate visual features that increase the output neuron's activation rather than those that suppress the output neuron's activation.
RISE [97] is a model-agnostic method which produces post-hoc local explanations on image data.To produce a saliency map for an image x, rise generate N random mask M i ∈ [0, 1] from Gaussian noise.The input image x is element-wise multiplied with these masks M i , and the result is fed to the base model.The saliency map is obtained as a linear combination of the masks M i with the predictions from the black box corresponding to the respective masked inputs.The intuition behind this is that b(x M i ) is high when pixels preserved by mask M i are essential.

Qualitative and Quantitative Comparison of Saliency Maps
In Figure 8, we report the SMs obtained for every method tested.The segmentation used by lime is very poor with small images as it results in super-pixels big as the whole image in some cases.On the other hand, those produced by xrai are much more clear.For the majority of images, the SMs are very similar among those returned by the various explainers but we can observe conflicts.For instance, in cifar we can assume that the background is useful to predict the class deer, but we do not know-how.Some explainers highlight the top background while other the bottom background, so it is difficult to understand.Moving to bigger images, these conflicts become more evident.Let us look at the ice hockey image.The class in the dataset here is "puck": the hockey disk.lime highlights the ice as important, while other methods (xrai and gradcam++) highlight the stick of the player.gradcam highlights the fans while rise the hockey player.Thus, for the same image, we can obtain very different explanations.When moving to the second image from imagenet (the mask), we can observe that all the methods capture the same pattern.A straw hat in the background triggered the class "shower cap" while the correct one was "mask".In the "seashore" of imagenet, we have an island in the sea.The top three predicted classes are: seashore (0.91), promontory (0.04) and cliff (0.01).Half of the tested methods like lime, smoothgrad, rise, and gradcam was fooled that the promontory is important to the class "seashore".We can conclude that SMs are very fragile when we have multiple classes in the image, even if these classes has very low predicted probability.
To investigate more the performance of the methods analyzed we computed the deletion and the insertion metric, discussed in Section 3.2.For a query image, we substitute pixels in order of importance scores given by the explanation method.For insertion, we blurred the image and then  slowly inserted pixels while substituting with black pixels for deletion.For every substitution we made, we query the image to the black-box, obtaining an accuracy.The final score is obtained by taking the area under the curve (AUC) [62] of accuracy as a function of the percentage of removed pixels.In Figure 11 we have an example of this metric computed on the hockey figure of imagenet.
For every dataset, we performed this metric calculation for a set of 100 samples, and then we averaged.The results are shown in Table 7. Insertion scores decrease while augmenting the dataset image dimension because we have higher information and more pixels have to be inserted to higher the accuracy.On the other hand, deletion scores decrease.This fact could be because since we have greater information, it is easier to decrease accuracy.The best methods are highlighted in bold, and we can see that rise is the best in three out of five experiments.rise is followed by deeplift, and -lrp.Segmentation based methods (lime, xrai, gradcam, gradcam++) struggles when using low-resolution images.

Concept Attribution
Most ML models are designed to operate on low-level features like edges and lines in a picture that do not correspond to high-level concepts that humans can easily understand.In [4,128], they pointed out that feature-based explanations applied to state-of-the-art complex black-box models can yield non-sensible explanations.Concept-based explainability constructs the explanation based on human-defined concepts rather than representing the inputs based on features and internal model (activation) states.This idea of high-level features is more familiar to humans, that are more likely to accept it.For example, a low-level explanation for images is to assign to every pixel a saliency value.Although it is possible to look at every pixel and infer their numerical values, these make no sense to humans: we do not say that the 5 th pixel of an image has a value of 28.Instead CA method quantifies, for example, how much the concepts "stripes", has contributed to the class prediction of "zebra".Formally, given a set of images belonging to a concept [x (1) , x (2) , ..., x (i) ]withx (i) ∈ C, CA methods can be thought as a function f : (b, [x (i) ]) → e which assign a score e to the concept C basing on the predictions and the values of the black-box b on the set [x (i) ].
TCAV, Testing with Concept Activation Vectors [75] is a model-agnostic method that produces post-hoc global explanations for image classifiers.tcav provides a quantitative explanation of how important is a concept for the prediction.Every concept is represented by a particular vector called Fig. 12: tcav scores for three concepts: ice, Hockey player, and cheering people (fans) for the class puck of imagenet.On the left the query image; on the center some sample of the image tested in tcav as concepts, and on the right the histogram of the scores with errors.The hockey players has been classified as a puck, but the saliency maps are very different alongside methods.Here we can see that the ice and the hockey players are important concepts, while the background fans are not significant.
Concept Activation Vectors (CAVs) created by interpret an internal state of a neural network in terms of human-friendly concepts.tcav uses directional derivatives to quantify the degree to which, a user-defined concept, is vital to a classification result.
For example, how sensitive a prediction of "zebra" is to the presence of "stripes".tcav requires two main ingredients: (i) concept-containing inputs and negative samples (random inputs), and (ii) pre-trained ML models on which the concepts are tested.The concept-containing and random inputs are fed into the model to obtain the predictions to test how well a trained ML model captured a particular concept.A linear classifier is trained to distinguish the activation of the network due to concept-containing vs. random inputs.The result of this training is concept activation vectors (CAVs).Once CAVs are defined, the directional derivative of the class probability along CAVs can be computed for each instance that belongs to a class.The "concept importance" for a class is computed as a fraction of the class instances that get positively activated by the concept containing inputs vs. random inputs.In Figure 12, we can see an Example of tcav explanation.The user must collect some images of some concept, like "ìce", "hockey player" and "fans".Then tcav compute the score for everyone of these, telling us which one has more impact on the prediction of a query image.
ACE, Automated Concept-based Explanation [49], is the evolution of tcav, and it does not need any concept example.It can automatically discover them.It takes training images and segments them using a segmentation method.These super-pixels are fed into the black-box model as there where input images clustered in the activation space.Then we can obtain like in tcav how much these clusters contributed to the prediction of a class.
ConceptSHAP [129] defines an importance score for each concept discovered.Similar to ace, conceptshap aims at having concepts consistently clustered to certain coherent spatial regions.conceptshap finds the importance of each individual concepts from a set of m concept vectors C s = {c 1 , c 2 , . . ., c m } by utilizing Shapley values.
CaCE, Causal Concept Effect [54], is another variation of tcav.It looks at the causal effect of the presence or absence of high-level concepts on the deep learning model's prediction.tcav can suffer from confounding of concepts that could happen if the training data instances have multiple classes, even with a low correlation.cace can be computed exactly if the concepts of interest are changed by intervening in the counterfactual data generation.

Prototypes
Another possible explanation for images is to produce prototypical images that best represent a particular class.Human reasoning is often prototype-based, using representative examples as a basis for categorization and decision-making.Similarly, prototype explanation models use representative examples to explain and cluster data.
MMD-CRITIC [74], already presented in Section 4, can be applied to retrieve image prototypes and criticisms.In Figure 13 is presented an application of mmd-critic on cifar.We can extract some interesting knowledge from these methods.For example, in the criticism images, planes are all on a white background or have a different form from the usual one.We can conclude that in cifar, most planes are in the sky and have a passenger airplane shape.
PROTONET [28] is a model-agnostic explainer that produces post-hoc global explanations on image data.It figures out some prototypical parts of images (named prototypes) and then uses them Fig. 13: Criticism (on the left) and prototypes (on the right), output of mmd-critic from cifar.
On the criticisms we have a lot of planes on white background, so the sky background is important for the plane.Influence Functions [76] is another variant for building prototypes.Instead of building prototypical images for a class, it tries to find the most responsible images for a given prediction using influence functions.Influence functions is a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction.Visualizing the training points most responsible for a prediction could be useful for more in-depth insights into model behavior.

Counterfactuals
Counterfactuals are another type of explanation for images.Its application for images is similar to the one already done for tabular data in Section 4.4.As output, counterfactuals methods for images produce samples of images similar to the original one but with altered prediction.Some methods output only the pixel variation, others the whole altered image.
Guided Prototypes, Interpretable Counterfactual Explanations Guided by Prototypes (guidedproto) [118] proposes a model-agnostic method to find interpretable counterfactuals.guidedproto perturbs the input image to find the closest image to the original one but with a different classification by using an objective loss function L = cL pred + βL 1 + L 2 optimized using gradient descent.The first term, cL pred , encourages the perturbed instance to predict another class then x while the others are regularisation terms.In Figure 14 we have an example of application of guidedproto on mnist.It is interesting to notice how easy it is to change the digit class with very few focused pixels.
CEM, Contrastive Explanation Method (cem) [40], already presented in Section 4, can also be applied on image data.For images, Pertinent Positives (PP) or Pertinent Negatives (PN) are the pixels that lead to the same or a different class w.r.t. the original instance.To create PP's and PN's, feature-wise perturbation is done by keeping the perturbations sparse and close to the original instance through an objective function that contains an elastic net βL 1 + L 2 regularizer.An auto-encoder is trained to reconstruct images of the training set.As a result, the perturbed instance lies close to the training data manifold.In fact, in Figure 14, we can see how very few pixels are obtained as explanations on mnist.
L2X [29] finds the pixels that change the classification.It is based on learning a function for extracting a subset of the most informative features for each given sample using Mutual Information.l2x adopts a variational approximation to efficiently compute the Mutual Information and gives  [56], is a local, modelagnostic explainer that produces explanations composed of: (i) a set of exemplar and counterexemplar images, and (ii) a saliency map.The end-user can understand the classification by looking at images similar to those under analysis that received the same prediction or a different one.Moreover, by exploiting the SM, it is possible to understand the areas of the images that cannot be changed and varied without impacting the outcome.abele exploits an adversarial autoencoder (AAE) to generate the record's local neighborhood to explain x.It builds the neighborhood on a latent local decision tree, which mimics the behavior of b.Finally, exemplars and counter-exemplars are selected, exploiting the rules extracted from the decision tree.The SM is obtained by a pixelby-pixel difference between x and the exemplars.In Figure 14 we have an example of application of abele on mnist.Green and yellow areas can change without impacting the black-box outcome, while the gray areas must remain the same to have the same prediction.
Runtime Analysis Table 8 shows the explanation runtime approximated as order of magnitude.We notice that gradcam and gradcam++ are the fastest methods, especially for big models like the VGG network.In general, pixel-wise Saliency explanations are more comfortable to obtain, while segmentation slows a lot, especially for high-resolution images.CA, CF, and PR methods are very slow compared to the SM.This problem is because these algorithms need additional training or use some searching algorithm.
Discussion When dealing with images, the most diffused explanations are Saliency Maps (Section 5.1).The literature presents a multitude of methods that are capable of producing such type of explanation.The problem with saliency maps is the confirmation bias [4].Also, humans do not think in terms of pixels.The explanation of Saliency Maps is provided in terms of pixels, which are low level features that are useful only for an expert user who wants to check the robustness of the black-box.For a general audience, there is the need to build an explanation in terms of higher features called concepts.This is the goal of Concept Attributions based explanations (Section 5.2).For a concept selected by a human team, these types of methods compute a score that evaluates the probability that the selected concept has influenced the prediction.Concept based explanations are a very recent type of explanation for images, and they have potential improvements.It is a first step in the direction of human-like explanations.Human-friendly concepts make it possible to build straightforward and useful explanations.Humans still need to map images to concepts, but it is a small price to pay to augment the human-machine interaction.Other approaches are based on the concept of producing examples to support the explanation.Prototypes and Counterfactual (Sections 5.3 and 5.4) are two types of similar explanations but with very different meaning.The goal of prototypes is to produce an example that reflects the common proprieties of a class, while the goal of counterfactual is to produce examples similar to the input, but with a different predicted class.The first one is useful for model inspection, while the ladder for the user experience.In particular, counterfactuals are more user-friendly since they highlight the changes to make to obtain the desired prediction.

Text
For text data, we can distinguish the following types of explanations: Saliency Maps (SM), described in Section 6.1, Attention-Based methods (AB), described in Section 6.2, Other Methods,  understand what words are the most relevant for a specific tag assignment.We experimented on three datasets: sst, imdb, and yelp.We selected these datasets 18 , because they are the most used on sentiment classification and have different dimensions.On these datasets we trained different black-box models.For every explainer we present an example of an application on one or more datasets.

Sentence Highlighting
As seen in Section 5.1, saliency-based explanations are prevalent because they present visually perceptive explanations.Saliency highlighting is saliency maps applied to text and consists of assigning to every word a score based on the importance that that word had in the final prediction.Formally, a Sentence Highlighting (SH) is modeled as a vector s who explain a classification y = b(x) of a black-box b on x.The dimensions of s are the words present in the sentence x we want to explain, and the value s i is the saliency value of the word i.The greater the value of s i the bigger is the importance of that word.A positive value indicates a positive contribution towards y, while a negative one means that the word has contributed negatively.Some examples are reported in Figure 15.To obtain such an explanation, it is possible to adapt some of the saliency maps methods presented in Section 5.1.LIME [102], presented in Section 4, can be applied to text with a modification to the perturbation of the original input.Given an input sentence x, lime creates a neighborhood of sentences by replacing one or multiple words with spaces.Another possible variation is to insert a similar word instead of removing them.
INTGRAD [115], presented in Section 4, can also be exploited to explain text classifiers.Indeed, gradient-based methods are challenging to apply to NLP models because the vector representing every word is usually averaged into a single sentence vector.Since it does not exist a mean operation gradient, the explainer cannot redistribute the signal back to the original vectors.On the other hand, intgrad is immune to this problem because the saliency values are computed as a difference with a baseline value.intgrad computes the saliency value of a single word as a difference from the sentence without it.For a fair comparison, we substituted the words with spaces as done for lime.
DEEPLIFT [110], presented in Section 4, can also be applied on text following the same principle of intgrad.For the experiments, we adopt the same preprocessing used for lime and intgrad.
L2X [29] can produce a SH explanation for text.In particular, for text, the patches are now a group of words.
Qualitative and Quantitative Comparison of Sentence Highlighting Besides the methods exposed above we tested also a baseline method.This baseline named Gradient × Input takes the black-box gradient of the input w.r.t to the output and multiply these value by the input values.The results are shown in Figure 15.The highlighted words are very different among the various methods.intgrad and lime are the ones who output meaningful explanations, while deeplift struggles a lot to diversify from the baseline.We also measured the deletion/insertion and report the results in Table 10.For both metrics, we have very poor performance among all the methods.However removing a single word barely changes the meaning of the sentence.

Attention-based Methods
Attention was proposed in [126] to improve the model performance.The authors managed to show through an attention layer which parts of the images contributed most to realize the caption.Attention is a layer to put on top of the model that, for each pixel, ij of the image x, generates a positive weight α ij , i.e., the attention weight.This value can be interpreted as the probability that a pixel ij is in the right place to focus on producing the next word in the caption.Attention mechanisms allow models to look over all the information the original sentence holds and learn the context [125,18].Therefore, it has caught the interest of XAI researchers who started using these weights as an explanation.The explanation e of the instance x is composed by the set of attention values (α), one for each feature x i .Attention is nowadays a delicate argument, and while it is clear that it augments the performance of models, it is less clear if it helps gain interpretability and what are the relationship with model outputs [67].
Attention Based Sentence Highlighting [83] is an AB mechanism to produce a heatmap explanation similar to the one used for SMs.The scores are computed for every word of the sentence by using the attention layer of the black-box.The weights α ij of the attention layer are used as a score.The higher the score, the redder highlighting.
Attention Matrix [30] looks at the dependencies between words for producing explanations.It is a self-attention method, sometimes called intra-attention.attentionmatrix relates different positions of a single sequence to compute its internal representation.The attention of a sentence x composed of N words can be understood as an N × N matrix, where each row and columns Fig. 16: Saliency heat-map matrix generated from the method presented in [30].The row and the columns of the matrix correspond to the words in the sentence 'Read the book, forget the movie!".Each value of the matrix shows the attention weight α ij of the annotation of the i-th word w.r.t. the j-th.Fig. 17: Representation of the attention in BERT for a sentence taken from imdb using the visualization of [66].The greater the attention between two words, the bigger the line.Here is selected only the attention related to the word "sucks".represent a word in the input sentence.The values of the matrix are the attention values of every possible combination of the tokens.This matrix is a representation of values pointing from each word to every other word [119] (see Figure 16).We can also visualize this matrix with a focus on the connection between words [66] as in Figure 17, where the thickness of the lines is the self-attention value between two tokens.
Runtime Analysis NLP models are usually very large resulting in poor performance in terms of runtime.Apart from Attention Matrix methods which are instant, we notice that for all the datasets, the time are pretty much the same in the order of magnitude of ten seconds.The time of the methods is independent from the dataset size.
Discussion Explanations of text data are at the very early stages compared to tabular data and images.The majority of the methods focus on low feature explanation by giving a score to words that make up the sentence.As said for image type of explanations in Section 5, these low feature explanations are useful to check the model's robustness, not to give a useful explanation for the final inexpert user.Natural Language processing is a very complex field, and find a human-friendly explanation is challenging.Researchers are working in the direction of creating explanation with high concept [113], and using humans to augment these type of concept [101], as done for Concept Attribution.

Miscellanea of Other Methods
There other methods that are important to mention when we talking about XAI using text or sequential data.
ANCHOR, presented in Section 4, can be adapted to text by using as perturbation the word UNK.It consists of perturbing a sentence by substituting words with UNK (unknown).For example, It shows how "sucks" contributed to the negative prediction of the sentence, but when coupled with "love" then the sentence prediction switches to positive.
Natural Language Explanation verbalizes explanations in natural human language.Natural language can be generated with complex deep learning models, e.g., by training a model with natural language explanations and coupling with a generative model [101].Besides, it can also be generated using a simple template-based approach [2].
XSPELLS [79] is a model-agnostic explainer returning exemplars and counterexamples sentences as explanation.It re-implements abele for text data by using LSTM layers in the autoencoder.Exemplars and counterexemplars are selected, exploiting the rules extracted from the decision tree learned in the latent space.
LASTS, Local Agnostic Shapelet-based Time Series explainer (lasts) [60], is a variation of abele for time series.Since a text could be interpreted as a time series, we report here this work.As explanation lasts returns exemplars and counterexamples time series and shapelet-based rules.Shapelets are locally discriminative subsequences characterizing the classification.An example of a rule is: "If these shapelets are present and these others not, then x is classified as y".
DOCTORXAI [94] is a local post-hoc model-agnostic explainer acting on sequential data in the medical setting.In particular, it exploits a medical ontology to perturb the data and to generate neighbors.doctorxai is designed on healthcare data, but it can theoretically be applied to every type of sequential data with an ontology.

Explanation Toolboxes
A significant number of toolboxes for the ML explanation have been proposed during the last few years.In the following, we report the most popular Python toolkits with a brief description of the explanation models they provide 19 .
AIX360 [16] contains both intrinsic, post-hoc, local, and global explainer and it can be used with every kind of input dataset.Regarding the local post-hoc explanations, different methods are implemented, such as lime [102], shap [84], cem [40], cem-maf [85] and protodash [61]).Another interesting method proposed in this toolkit is ted [65,38], which provides intrinsic local explanations and provides global explanations based on rules.CaptumAI is a library built for PyTorch models.CaptumAI divides the available algorithms into three categories: Primary Attribution, in which there are methods able to evaluate the contribution of each input feature to the output of a model: intgrad [115], grad-shap [84], deeplift [110], lime [102], gradcam [106].Layer Attribution, in which the focus is on the contribution of each neuron: e.g.gradcam [106] and layer-deeplift [110].Neuron Attribution, in which is analyzed the contribution of each input feature on the activation of a particular hidden neuron: e.g.neuron-intgrad [115], neurongrad-shap [84].InterpretML [93] contains intrinsic and post-hoc methods for Python and R. In-terpretML is particularly interesting due to the intrinsic methods it provides: Explainable Boosting Machine (ebm), Decision Tree, and Decision Rule List.These methods offer a user-friendly visualization of the explanations, with several local and global charts.InterpretML also contains the most popular methods, such as lime and shap.DALEX [19] is an R and Python package that provides post-hoc and model-agnostic explainers that allow local and global explanations.It is tailored for tabular data and is able to produce different kinds of visualization plots.Alibi provides intrinsic and post-hoc models.It can be used with any type of input dataset and both for classification and regression tasks.Alibi provides a set of counterfactual explanations, such as cem, and, interestingly, an implementation of anchor [103].Regarding global explanation methods, Alibi contains ale (Accumulated Local Effects) [11], which is a method based on partial dependence plots [59].FAT-Forensics takes into account fairness, accountability and transparency.Regarding intrinsic explainability, it provides methods to assess explainability under three perspectives: data, models, and predictions.For accountability, it offers a set of techniques that assesses privacy, security, and robustness.For fairness, it contains methods for bias detection.What-If Tool is a toolkit providing a visual interface from which it is possible to play without coding.Moreover, it can work directly with ML models built on Cloud AI Platform (https://cloud.google.com/ai-platform).It contains a variety of approaches to get feature attribution values such as shap [84], intgrad [115], and smoothgrad [106].

Conclusion
This paper has presented a survey of the last advances on XAI methods, following a categorization based on the data types and explanation strategies.We measured and evaluated a set of benchmarks for each explanation technique for a comparison from both the quantitative and qualitative point of view.
Our literature review revealed interesting trends in the strategies proposed for an explanation.For tabular data, feature importance is the most widely adopted strategy, particularly for Explainable-by-Design solutions and model agnostic black box explanations.Rule-based explanations are gaining attention since their logic formalization enables a deeper understanding of the AI model's internal decisions.Recently, methods that explain in terms of counterfactuals are yielding interesting results.For image data, the most considerable adopted technique is based on the creation of Saliency Maps, which translate to the image domain the feature relevance approach for tabular data, highlighting the portions of the relevant images for the AI model outcome.However, other approaches, like Concept Attribution, Prototypes, and Counterfactual, are rising in recent years.The explanation techniques are still limited for text data, but it is still possible to highlight a few trends.We recall the Sentence Highlight that, similarly to feature importance for tabular data, provides a weight to the portion of the input that contributed, positively or negatively, to the outcome.Across the different data types, different approaches tend to use similar strategies.This is also evident if we look at the internals of these algorithms.For example, several methods exploit the generation of a synthetic neighborhood around an instance to reconstruct the local distribution of data around the point to investigate.This stochastic generation is the base of several methods, and it also explains the low performance on the stability measure (see Table 3).Another frequent strategy consists of learning a surrogate model from partial training data (sometimes created from the neighborhood generation).This approach tries to bring the benefit of intrinsic methods in the context of black box explanation.
In recent years the contributions on the Explainable AI topics are constantly growing, particularly in AI and ML.However, there are still a restricted number of contributions focusing on the comparison of these methods.A definition of a unifying metric for measuring the efficacy of explanation strategies is difficult, particularly when human-grounded evaluations are addressed.We believe that next year of research will focus more on the human side, emphasizing the humanmachine interactions and aligning the generation of the explanation with the cognitive model of the final user.Some preliminary results of this direction are presented in [55,68,63].We believe that XAI must be addressed more in the development of AI applications in the future, and we hope that this work could help in its development.

Fig. 2 :
Fig. 2: TOP : lime application on the same record for adult (a/b), german (c/d): a/c are the LG model explanation and b/d the CAT model explanation.All the models correctly predicted the output class.BOTTOM : Force plot returned by shap explaining XGB on two records of adult: (e), labeled as class 1 (> 50K) and, (f), labeled as class 0 (≤ 50K).Only the features that contributed more (i.e. higher shap's values) to the classification are reported.

Fig. 3 :
Fig. 3: shap application on adult: a record labelled > 50K (top-left) and one as ≤ 50K(top-right).They are obtained applying the TreeExplainer on a XGB model and then the decision plot, in which all the input features are shown.At the bottom, the application of shap to explain the outcome of a set of record by XGB on adult.The interaction values among the features are reported.

r = {Age > 34 ,Fig. 7 :
Fig. 7: skoperule global explanations of XGB on adult.On the left, a rule for class > 50k, on the right for class < 50k.

Fig. 8 :
Fig. 8: Examples of saliency maps obtained with the algorithm exposed in Section 5.1 on various datasets.The first row are the original images of the dataset and on top of them we have the predicted class from the original model.

Fig. 11 :
Fig. 11: Example of Insertion (on the left) and Deletion (on the right) metric computation performed on lime and the hockey image.The area under the curve is 0.2156 for deletion and 0.5941 for Insertion.

Fig. 14 :
Fig. 14: (a):Explanation of cem on mnist: query on the center, Pertinent Negative left, and Pertinent Positive right.(b): Explanation of guidedproto on mnist: left to right, the query, the closest counterfactuals labeled as 6, and 8. (c): Explanation of abele on mnist: left query, right SM. Green/yellow areas can be exchanged without impact.

Fig. 15 :
Fig. 15: Example of sentence highlighting, on top we have the score produce by IntGrad and below we have in order, LIME, DeepLift and the baseline which consists of multiplying the input with the gradient w.r.t.input.The sentence is taken from imdb

1 .
Completeness w.r.t. the black-box model.The metrics aim at evaluating how closely f approximates b. 2. Completeness w.r.t. to specific task.The evaluation criteria are tailored for a particular task or behavior.

Table 2 :
Summary of methods for explaining black-boxes for tabular data.The methods are sorted by explanation type: Features Importance (FI), Rule-Based (RB), Counterfactuals (CF), Prototypes (PR), and Decision Tree (DT).For every method, there is a data type on which it is possible to apply it: only on tabular (TAB) or any data (ANY).If it is an Intrinsic Model (IN) or a Post-Hoc one (PH), a local method (L) or a global one (G), and finally if it is model agnostic (A) or model-specific (S).

Table 3 :
Comparison on the fidelity and the faithfulness metrics of different explanation methods.For every evaluation we report the mean and the standard deviation over a subset of 50 test set records.

Table 4 :
Comparison on the stability metric.We report the mean and the standard deviation over a subset of 30 test records.

Table 5 :
Explanation runtime expressed in seconds for explainers of tabular classifiers approximated as order of magnitude.

Table 7 :
Insertion (left) and deletion (right) metrics expressed as AUC of accuracy vs. percentage of removed/inserted pixels.

Table 8 :
Explanation runtime expressed in seconds for explainers of image classifiers approximated as order of magnitude.value for a group of pixels called patches.If the value is positive, a group contributed positively to the prediction.Otherwise, it contributed negatively.ABELE, Adversarial black-box Explainer generating Latent Exemplars) a

Table 9 :
Summary of methods for opening and explaining black-boxes.

Table 10 :
Deletion (right)and Insertion (left) metrics and computed on Sentence Highlighting for different datasets.