Keywords

1 Introduction

Explainable Artificial Intelligence (XAI) aims to provide insights into the decision-making processes of complex machine learning models, particularly black-box models, which are often difficult to interpret due to their inherent complexity [2]. As the adoption of machine learning models in critical applications (e.g., healthcare, finance, or business decision-making) continues to increase, understanding their decision-making rationale becomes essential for building trust, ensuring fairness, and making informed decisions based on the output of the model [12].

Surrogate models, which are interpretable white-box (WB) models trained to approximate the behavior of black-box (BB) models, have emerged as a popular approach for providing explanations in XAI [21]. These surrogate models are particularly relevant in use cases where BB models are unavailable due to security or practical concerns or when stakeholders require explanations to support their decision-making processes. For instance, in healthcare, a BB model may predict the likelihood of a patient having a specific disease based on various symptoms and test results. A surrogate model that provides the same prediction but with a different rationale could lead medical professionals to focus on the wrong symptoms or tests, resulting in incorrect treatment or mismanagement of the patient’s condition. In finance, a BB model might predict the probability of default for loan applicants based on their credit history, income, and other financial factors. A surrogate model that matches the BB model’s predictions but with different reasoning could lead to unfair lending decisions or increased financial risk for the lending institution.

However, evaluating the faithfulness of these surrogate models remains challenging. Traditional fidelity measures such as accuracy focus on the similarity of the final predictions between the BB and surrogate models. This can lead to a significant limitation, as a surrogate model might be considered faithful even if it provides the same prediction as the BB model but with a completely different rationale. In critical applications like the ones mentioned above, this can be dangerous, as decision-makers might act on unfaithful explanations, leading to suboptimal or even harmful outcomes [14].

In this paper, we address this limitation by introducing a novel metric called ShapGAP, which assesses the faithfulness of surrogate models by comparing their reasoning paths, using SHAP explanations as a proxy. ShapGAP measures the average L2 distance between the SHAP explanations of BB and WB models, providing a more comprehensive evaluation of surrogate model faithfulness that goes beyond the similarity of final predictions. The main contributions are:

  • We propose ShapGAP, a novel metric for evaluating surrogate model faithfulness that considers the reasoning paths of the models by comparing their SHAP explanations.

  • We demonstrate the effectiveness of ShapGAP through experiments with real-world datasets, comparing it against traditional fidelity measures.

  • We highlight the potential dangers and ethical concerns of relying on unfaithful explanations in critical applications, drawing on philosophical arguments for truthfulness and ethical AI.

By introducing ShapGAP, we aim to contribute to the ongoing research towards Trustworthy AI by providing a more effective method for evaluating surrogate model faithfulness that captures the essence of reasoning paths, enabling better understanding, trust, and ethical considerations in AI systems.

The rest of the manuscript is organized as follows. Section 2 introduces related work. Section 3 describes ShapGAP. Section 4 presents details about the experiments (i.e., experimental setting, datasets and models). Section 5 discusses the reported results. Section 6 goes in depth with some ethical concerns. Finally, Sect. 7 concludes the paper with final remarks and future work.

2 Related Works

The research field of XAI encompasses various approaches for extracting secondary models from primary models. Model distillation, for instance, often refers to the transfer of knowledge from a larger model (teacher) to a smaller one (student) with the aim of optimizing space and speed though not necessarily interpretability [4, 13]. However, in our context, we pay major attention to distillation that results in a secondary model with greater interpretability than the primary one while maintaining most of its core characteristics in terms of both behaviour and performance.

In this domain, two main threads can be identified: local surrogates and global surrogates. Local surrogates ensure fidelity within a suitably defined neighborhood [17, 21], resulting in multiple local models that collectively describe the global behavior. Conversely, global surrogates attempt to build a more interpretable model across the entire data domain, providing a bird’s-eye view of the problem. Several methods have been proposed for distilling global surrogates:

  • Pedagogical approach [6, 8]: This method trains the surrogate on queries to the primary model, assuming availability of the primary model for evaluating arbitrary synthesized data points in order to obtain labels for training the secondary model.

  • Audit approach [27]: When probing the model with new data is not possible, this approach trains the surrogate on predictions made by the primary model. This setup is sometimes more realistic in industrial scenarios where the primary model is unavailable due to security or practical concerns.

The quality of the secondary model is typically assessed using one or two metrics, which can be referred to as task accuracy and model accuracy. Task accuracy evaluates the accuracy of the secondary model concerning the true labels in the dataset, while model accuracy (sometimes called fidelity) measures the accuracy of the secondary model with respect to the labels provided by the primary model [7, 8, 15]. As an alternative, precision and recall are also used to evaluate the resulting rule system or Bayesian network trained with the pedagogical approach [23]. Bastani et al. [6] generalize this approach to non-classification domains, incorporating suitable metrics for regression and reinforcement learning tasks.

In addition to the aforementioned work, there are also studies that focus on alternative metrics for evaluating faithfulness of explanations in different contexts. For instance, Alvarez-Melis and Jaakkola [3] propose a self-explainable neural network that provides both prediction and self-relevance scores for feature or concept importance. Their faithfulness measure is based on the correlation between explanation vectors and probability drops from ablation studies, offering an alternative approach to assess faithfulness using explanation vectors. Although their setup is different from a global surrogate, it shares some similarity with ShapGAP in utilizing explanations for measuring faithfulness.

Alaa and van der Schaar [1] proposed another approach that employs feature importance for model comparison, although not necessarily in a global surrogate setting. They qualitatively compare the global feature importance of two models, demonstrating another perspective on assessing faithfulness and explanation quality between models. In addition, Dai et al. [9] proposed alternative fidelity measures such as ground truth fidelity, which can be adapted for comparing two models. This measure, which can be referred to as the “Top-k Percentage Accordance”, calculates the percentage of top-k features from one explanation that are also in the top-k features of another explanation. While this metric might have limitations, it demonstrates another perspective on evaluating faithfulness between models.

It is worth noting that previous work does not directly address surrogation but contribute to the broader understanding of evaluating explanations and faithfulness in various settings. This context helps to clarify how ShapGAP fits into the larger landscape of XAI research.

3 Proposed Approach: ShapGAP

For giving context it is appropriate to recall some preliminary concepts before defining ShapGAP. Shapley values [25], originally derived from cooperative game theory, represent a strategy for allocating the payoff of a cooperative game among the players, based on their contributions. In the context of feature importance, each feature is considered as a player, and the prediction is the payoff. The Shapley value of a feature quantifies its average marginal contribution across all possible feature combinations. SHapley Additive exPlanations (SHAP) [17] combine Shapley values with a unified measure of feature importance for machine learning models. For a given prediction, SHAP assigns an importance value to each feature, such that the sum of all values equals the difference between the prediction and the average prediction for the dataset.

We define a BB model as a function \(f_{bb}: X \rightarrow Y\), where X is the input space and Y is the output space, and a WB model as a function \(f_{wb}: X \rightarrow Y\). Given a dataset D with n instances, we compute the SHAP values for each instance \(x_i\) in D for both the BB and WB models. Let \(S_{bb}(x_i)\) and \(S_{wb}(x_i)\) represent the SHAP values for instance \(x_i\) for the BB and WB models, respectively.

To define a generic version of ShapGAP, we use a distance function \(d(\cdot , \cdot )\) that measures the dissimilarity between the SHAP explanations of the BB and WB models for instance \(x_i\). The ShapGAP metric is then the average of these distances across all instances in the dataset:

$$\begin{aligned} \text {ShapGAP}(D, d) = \frac{1}{n} \sum _{i=1}^{n} d(S_{bb}(x_i), S_{wb}(x_i)) \end{aligned}$$
(1)

We can implement the distance function \(d(S_{bb}(x_i), S_{wb}(x_i))\) using different distance measures, such as the L2 Euclidean distance and the Cosine distance. The L2 Euclidean distance, as shown in Eq. (2), is more precise and faithful to the final probability values, as the SHAP explanations sum up to the output of the model. This choice emphasizes the importance of the exact contribution of each feature in the explanation and is more sensitive to differences in magnitude between SHAP values associated to BB and WB models.

$$\begin{aligned} \text {ShapGAP}_{L2}(D) = \frac{1}{n} \sum _{i=1}^{n} || S_{bb}(x_i) - S_{wb}(x_i) ||_2 \end{aligned}$$
(2)

However, it is noteworthy to mention some limitations of the L2 Euclidean distance. This method is sensitive to outliers, which means that a single instance with a large disparity in SHAP values can significantly affect the overall distance. Therefore, L2 Euclidean distance might not represent the overall similarity well in the presence of outliers.

On the other hand, the Cosine distance (see Eq. (3)) is more relaxed and focuses on the similarity in the direction of the SHAP explanations rather than their magnitude. This choice allows for finding out surrogate models with similar reasoning paths, even if the magnitude of their feature contributions differs. By being more scale-agnostic with respect to the magnitude of the explanations, the Cosine distance might be better suited for cases where the focus is on the general structure of the explanation, rather than the exact values of the SHAP contributions.

$$\begin{aligned} \text {ShapGAP}_{Cos}(D) = \frac{1}{n} \sum _{i=1}^{n} ( 1 - \frac{S_{bb}(x_i) \cdot S_{wb}(x_i)}{|| S_{bb}(x_i) ||_2 || S_{wb}(x_i) ||_2}) \end{aligned}$$
(3)

By offering these two distance measures, the ShapGAP metric can accommodate different application requirements and preferences, providing more flexibility in the evaluation of surrogate model faithfulness.

To compute SHAP values, we use the widely adopted shap packageFootnote 1, which offers efficient implementations for various types of models. For tree-like models such as random forests and decision trees, the package provides the fast TreeSHAP algorithm. Likewise, for linear models, an efficient method based on the coefficients of the model is available. In cases where the BB model is neither tree-based nor linear, the package offers a model-agnostic method called KernelSHAP, which can be applied to any model at the expense of increased computational cost. By leveraging these implementations, we can calculate the ShapGAP metric for a diverse range of surrogate models, ensuring that the metric remains flexible and adaptable to various application scenarios.

4 Experimental Section

In this section, we present the experimental setting for using and validating both ShapGAP L2 and Cosine distance metrics, compared against Task Accuracy, Fidelity Accuracy (in the sense of model accuracy, i.e., the accuracy with respect to the labels predicted by the BB model), and ShapLength [18]. ShapLength, a model-agnostic metric, enables the comparison of fundamentally different models, such as Logistic Regression and Decision Trees, in terms of their explanation complexity. By examining ShapLength, we can assess the trade-offs between faithfulness and simplicity in surrogate models, and gain insights into the complexity of models and their explanations, making our analysis more comprehensive and thorough.

To align with the problem suggested in the introduction, we use two popular datasets: the Breast Cancer dataset [26], which contains 569 instances with 30 features, and the German Credit score dataset from UCI [10], which includes 1,000 instances with 20 features. Both datasets are widely used for benchmarking machine learning models in the context of binary classification. For surrogating we employ the Audit approach and perform a 10-fold cross-validation. Then, we compute the quality metrics on the test set. The reported results represent the average across all the folds. To explore various surrogate models, we train Decision Trees by varying two parameters: max depth, which can take values of 3, 4, or 5, and ccp_alpha, a regularization parameter for cost complexity pruning, which takes values of 0.001, 0.01, and 0.1. For the Bank credit score dataset from UCI, which contains categorical columns, we preprocess the data by one-hot encoding the categorical columns.

For the sake of experimental reproducibility, everything required for running the experiments is available onlineFootnote 2.

Fig. 1.
figure 1

Comparative analysis of surrogate models using Task Accuracy, Fidelity Accuracy, ShapGAP (L2 and Cosine distance), and ShapLength for the Breast Cancer dataset. The plot displays the performance of various Decision Trees (DT) and Logistic Regression models. The key findings show that although Logistic Regression has high Task Accuracy and Fidelity Accuracy, its large ShapGAP highlights the potential danger in using it as a global surrogate. Decision Trees with lower Fidelity Accuracy but smaller ShapGAP provide more faithful explanations.

Fig. 2.
figure 2

Comparison of Task Accuracy, Fidelity Accuracy, ShapGAP (L2 and Cosine), and ShapLength for different surrogate models (Logistic Regression and Decision Trees) on the Bank credit score dataset. Each point represents a surrogate model, with its position determined by the respective metrics. The plot illustrates how Logistic Regression achieves similar Task Accuracy to the black-box model but has higher ShapLength complexity and significantly higher ShapGAP compared to Decision Trees, indicating that its explanations may be less faithful to the black-box model.

5 Discussions of Results

Our analysis of the results on both the Breast Cancer (Fig. 1) and German Credit (Fig. 2) datasets reveal some important insights about the relationships between Task Accuracy, Fidelity Accuracy, Complexity, and ShapGAP. In both experiments, Logistic Regression (LR) performs better in terms of both Task Accuracy and Fidelity Accuracy compared to Decision Trees (DT). However, when we consider the ShapGAP metric, the LR model exhibits a very high ShapGAP compared to DT models, indicating that its explanations are unfaithful to the BB model despite its superior accuracy.

In the Breast Cancer dataset, the LR model exhibits both high Task and Fidelity Accuracy, suggesting its potential as an effective global surrogate for the BB model. The additional advantage of lower complexity in the LR model strengthens this proposition. However, the high ShapGAP value warns against this, as it would lead to unfaithful explanations of the BB model.

In the German Credit dataset, it is interesting to observe that the LR model has similar Task Accuracy as the BB model but with higher complexity. One possible reason for this observation is that the simpler structure of the LR model, with its linear decision boundary, allows it to effectively capture the underlying patterns in the data, but at the expense of using more features on average, as reflected in its higher ShapLength complexity. On the other hand, the more expressive nature of the Random Forest BB model, with its ensemble of DT models, enables it to capture complex relationships among features and potentially represent the patterns in the data with fewer features on average.

Given that a WB model such as the LR model performs almost on par with the BB model in terms of Task Accuracy, readers may wonder whether it is necessary to employ a more complex less interpretable BB model in this case. Indeed, this question was already posed by a few papers, such as [19, 22], which argue that if a WB model performs well on the task, it should be used directly for both prediction and explanation, discarding the BB model. Of course, this choice is also motivated by the nature of the task and its specific requirements. For high-stake scenarios, such as medical diagnosis, it might be more preferable to have slightly lower Task Accuracy but higher explainability, making the use of a WB model more appropriate. For other scenarios, like movie or song recommendations, the trade-off between accuracy and explainability might be less critical, and the choice between WB and BB models may depend on other factors. The decision ultimately depends on the value of having a certain percentage of accuracy improvement and the importance of explainability in the given context.

Overall, ShapGAP reveals that LR models behave in a very different way compared to DT models. While LR models can achieve higher accuracy, in some cases, it might be preferable to use DT models as global surrogates due to their explanations being more faithful to the BB model. This illustrates the value of the ShapGAP metric in guiding the selection of surrogate models based on the faithfulness of their explanations, in addition to their performance on the task.

6 Ethical Considerations in Surrogate Explanations

The ethical implications of unfaithful surrogate explanations in critical applications are manifold. Unfaithful explanations can lead to misinformed decision-making, adversely affecting individuals’ lives and well-being, especially in sensitive domains like healthcare, justice and finance [5]. This misalignment can also hinder the identification and correction of AI model shortcomings, exacerbating existing societal inequalities [20].

Accurate and reliable surrogate explanations are vital for ensuring AI systems align properly with ethical principlesFootnote 3 guiding AI development. Such explanations enable users to understand and scrutinize model behavior, allowing them to make fully informed choices about the use of AI systems. Informed consent is a fundamental ethical principle that should be upheld in AI development and deployment, as it preserves users’ autonomy and agency [11].

Unfaithful surrogate explanations can erode trust in AI systems, which is essential for the successful adoption of AI technologies [24]. Legal and liability issues can arise due to misaligned explanations, making responsibility attribution for errors or adverse outcomes challenging [28]. Furthermore, unfaithful explanations can mask biases and discrimination, perpetuating societal inequalities.

AI developers have an ethical obligation to provide truthful and accurate explanations. In summary, ShapGAP offers a means to evaluate and compare surrogate models, promoting responsible development and deployment of Trustworthy AI by helping developers in their pursuit of more faithful explanations. Ensuring truthfulness and accuracy in AI explanations is essential for preserving human values, promoting fairness, and fostering transparency and accountability in agreement with ethical guidelines [28].

7 Conclusions, Limitations, and Future Work

In this paper, we introduced the ShapGAP metric, a novel approach for evaluating the faithfulness of surrogate models by comparing their SHAP explanations with those of the black-box model. Through two illustrative case studies, we demonstrated the utility of ShapGAP in revealing the unfaithfulness of models that may otherwise appear as strong global surrogates based on Task Accuracy, Fidelity Accuracy, and complexity metrics like ShapLength.

The ShapGAP metric has the potential to improve the trustworthiness and utility of surrogate models, particularly in high-stakes applications such as healthcare and finance, where faithful explanations are crucial for decision-making. Our experimental results emphasize the importance of considering faithfulness as an essential criterion for surrogate model evaluation and selection.

While ShapGAP offers a promising approach for evaluating surrogate model faithfulness, there are some limitations to consider:

  • Computational Expense: SHAP explanations can be computationally expensive to compute, especially for complex models or large datasets. Although various approximation methods exist, the computational cost of calculating SHAP values may still pose a challenge in some scenarios. Anyway, the utility of ShapGAP as an evaluation metric makes it worthwhile to consider these explanations despite the potential computational burden.

  • Approximate Nature: though SHAP explanations are widely adopted in the XAI community, they are only approximations of the reasoning paths followed by the underlying models. Furthermore, the SHAP computation itself yields approximations to the true Shapley values. Despite these limitations, we believe that a reasonable approximation is still useful for assessing surrogate model faithfulness, as it provides insights into the models’ reasoning processes that are otherwise unavailable or incomparable.

  • Dependency on SHAP: Our approach relies on SHAP explanations to compare reasoning paths, which may limit its applicability to other explanation methods. Although SHAP has gained widespread acceptance in the XAI community, future research could explore alternative approaches for evaluating surrogate model faithfulness using different explanation methods.

To address these limitations, future work could focus on expanding the scope of ShapGAP to incorporate other explanation methods or reduce the computational cost of producing SHAP explanations. Moreover, SHAP, as it stands, provides a first-order explanation, meaning it presents the impact of each individual feature without considering interactions between features. As a result, there might be missed nuances in feature relationships. To account for this, we plan taking into account SHAP interaction values [16], which consider the synergistic or antagonistic effects between feature pairs, as a basis for an enhanced ShapGAP. In Addition, more in-depth investigations into the factors that contribute to high or low ShapGAP scores, and how to optimize surrogate models for faithfulness, could be valuable. Further research might also delve into the role of ShapGAP in model selection and evaluation pipelines, its integration into automated machine learning (AutoML) frameworks, and its potential impact on the design or training of surrogate models for improved faithfulness. Moreover, further research could investigate potential biases or limitations in SHAP explanations on more datasets and explore methods to mitigate their impact on the evaluation of surrogate model faithfulness.