Disagreement amongst counterfactual explanations: how transparency can be misleading

Brughmans, Dieter; Melis, Lissa; Martens, David

doi:10.1007/s11750-024-00670-2

Disagreement amongst counterfactual explanations: how transparency can be misleading

Original Paper
Open access
Published: 08 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

TOP Aims and scope Submit manuscript

Disagreement amongst counterfactual explanations: how transparency can be misleading

Download PDF

310 Accesses
Explore all metrics

Abstract

Counterfactual explanations are increasingly used as an Explainable Artificial Intelligence (XAI) technique to provide stakeholders of complex machine learning algorithms with explanations for data-driven decisions. The popularity of counterfactual explanations resulted in a boom in the algorithms generating them. However, not every algorithm creates uniform explanations for the same instance. Even though in some contexts multiple possible explanations are beneficial, there are circumstances where diversity amongst counterfactual explanations results in a potential disagreement problem among stakeholders. Ethical issues arise when for example, malicious agents use this diversity to fairwash an unfair machine learning model by hiding sensitive features. As legislators worldwide tend to start including the right to explanations for data-driven, high-stakes decisions in their policies, these ethical issues should be understood and addressed. Our literature review on the disagreement problem in XAI reveals that this problem has never been empirically assessed for counterfactual explanations. Therefore, in this work, we conduct a large-scale empirical analysis, on 40 data sets, using 12 explanation-generating methods, for two black-box models, yielding over 192,000 explanations. Our study finds alarmingly high disagreement levels between the methods tested. A malicious user is able to both exclude and include desired features when multiple counterfactual explanations are available. This disagreement seems to be driven mainly by the data set characteristics and the type of counterfactual algorithm. XAI centers on the transparency of algorithmic decision-making, but our analysis advocates for transparency about this self-proclaimed transparency.

Beyond explainability: justifiability and contestability of algorithmic decision systems

Article 30 July 2021

Counterfactual explanations and how to find them: literature review and benchmarking

Article Open access 28 April 2022

PreCoF: counterfactual explanations for fairness

Article 28 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Artificial Intelligence (AI) or Machine Learning (ML) is rapidly evolving and disrupting various sectors, such as finance, healthcare, business (e.g., logistics, the labor market), education, and urban development. Besides the many benefits AI can create, multiple negative implications can be identified for each sector (Păvăloaia and Necula 2023). One of the re-occurring challenges concerning AI is the need for transparency: many AI models are opaque and operate on a black-box basis, which makes it difficult—or sometimes impossible—to interpret and explain a decision that has been made. Therefore, Explainable Artificial Intelligence (XAI) has recently emerged as a much-needed research field. Next to an obvious focus on the predictability of AI models, model explainability is necessary for users, developers, and other stakeholders of real-life AI applications. Not only do people generally want to know an explanation for an algorithm-based decision, but also legislation is backing up this need. For example, in 2018 the European Union stated in the new General Data Protection Regulation (GDPR) that subjects of algorithmic decision-making are entitled to insights about the logic involved. Users can ask for explanations of data-driven decisions that significantly influence their lives (Goodman and Flaxman 2017). These are categorised under high-risk applications of AI, such as credit scoring and employment services. People want, and are entitled to, an answer to why their loan is denied or why they are not hired for a job.

Reaching a certain level of explainability in AI models is possible by either developing models that are inherently more interpretable - but sometimes have less predictive power—or by using post-hoc XAI techniques to generate explanations after predictions have been made with a black-box model. Even though seemingly good explanations for a model’s decision can be generated by the use of a post-hoc XAI method, and consequently the model and its decisions are qualified as transparent, research on the uniformity of these explanations is rather scarce. Many different post-hoc XAI methods exist and each method can generate different explanations for the same predicted outcome. Ergo, different stakeholders might be more interested in the explanations of one specific XAI method over another one. This raises the question of whether the transparency objective of XAI is achieved. In the literature, this phenomenon has been recently called the disagreement problem (Krishna et al. 2022; Neely et al. 2021; Roy et al. 2022).

Miller (2019) takes knowledge from psychology, sociology, and cognitive sciences to identify what are “good” explanations. They argue that explanations are contrastive, selected, social and that probabilities most likely will not matter. The first means that people generally don’t ask why a certain decision is made. People wonder why a certain decision is made instead of another one. The second points to the fact that even though multiple explanations are possible to justify a decision, people are used to selecting one or two causes as the explanation. The third means an explanation is always dependent on the beliefs of the user and the last refers to the preference of causes over a probability or statistical relationship. These insights stress the usefulness of counterfactual (CF) explanations, a post-hoc example-based XAI method which underlines a set of features that, when changed, alter a decision made by a model (Arrieta et al. 2020).

Evaluating the quality of counterfactual explanations in varying contexts and for different users is complex, leading to diverse counterfactual algorithms (Verma et al. 2020; Guidotti 2022). Similar to other types of post-hoc XAI explanations, limited research on the consistency of counterfactual explanations reveals a risk: disagreeing counterfactual explanations could lead to ethical issues and transparency concerns in XAI, especially if one party controls explanation selection.

Our research focuses on the disagreement problem in popular counterfactual (CF) explanation methods, an area that remains less explored compared to feature importance explanations in Explainable AI (XAI). Two primary reasons highlight the unique nature of the disagreement problem in counterfactual explanations. First, counterfactuals explain decisions, while feature importance explanations explain prediction scores (Fernández-Loría et al. 2020). This difference directly relates to the contexts where the disagreement problem is most critical: those of the high-risk scenarios as outlined by the AI Act, where the emphasis is on understanding decisions—such as denial of credit, job rejections, or medical diagnoses—rather than on interpreting prediction scores. Second, unlike feature importance methods that include all features, counterfactual explanations often focus on a small, selective subset of features. This selectiveness could introduce bias in the explanations. For example, explanations with specific features could be chosen to hide the fact a model is based on unethical features (see an elaborated example in Sect. 3). Our study aims to analyze the disagreement problem in counterfactual explanation algorithms, considering their unique challenges and significant role in ensuring ethical and transparent use of AI.

It’s noteworthy that, although the disagreement problem has received attention in the context of feature importance methodologies, a comprehensive quantification of this challenge within counterfactual algorithms remains unaddressed. Recognizing the considerable potential for misuse, our research aims to bridge this critical gap in the literature by conducting a comprehensive analysis of the disagreement problem amongst counterfactual explanation algorithms.

In Sect. 2.3.1, we will situate counterfactual explanations in the diverse landscape of post-hoc XAI techniques and express how a lack of consistent evaluation methods for these techniques can lead to ambiguity in their explanations and ethical consequences. Consequently, in Sect. 3, we will quantify the disagreement amongst ten different counterfactual explanation methods next to Anchor and SHAP. Section 4 discusses the disagreement problem for counterfactual explanations and addresses potential ways to deal with the problem. The paper ends with conclusions and future research in Sect. 5.

2 The diverse landscape of post-hoc explanations

Post-hoc explanation methods are a subcategory of XAI that is concerned with explaining decisions made by complex black-box models, after these models have been trained. In contrast to intrinsic explanation methods, they do not try to create interpretable white-box models, but are focused on explaining existing complex models (Linardatos et al. 2020). These methods are particularly interesting because their explanations seem to bypass the accuracy-explainability trade-off (Huysmans et al. 2006). This is a paradox stating that model performance often comes at a cost of model interpretability. Nonetheless, these post-hoc methods are able to explain complex models and thus theoretically achieve both high performance and explainability at the same time. However, the quality of post-hoc explanation has often been a point of discussion (Fernández-Loría et al. 2020; Doshi-Velez and Kim 2017).

Because these explanation methods are applied to models that are not intrinsically explainable, it is difficult to assess the quality of such explanations. The field of XAI evaluation has come up with different metrics to quantify this quality, however, no consensus has currently been reached. Since we cannot strictly quantify the quality of a post-hoc explanation method, many methods are proposed and used. This has led to ambiguity amongst explanations: explanations for the same instance are different depending on the post-hoc explanation method used (the disagreement problem). The quantified lack of uniformity in explanations has already been investigated for several post-hoc explanation methods (Krishna et al. 2022; Neely et al. 2021; Roy et al. 2022), however, to the best of our knowledge, this problem has not yet been investigated for counterfactual explanations, which is the main contribution of this work.

We first give an overview and classification of the post-hoc explanation methods used for comparison in this work in Sect. 2.1. In Sect. 2.2, we discuss how post-hoc explanation methods are currently evaluated and address some core issues regarding this topic. Lastly, Sect. 2.3 elaborates on the existing research on the disagreement problem and the need to apply this research to counterfactual explanations.

2.1 Counterfactual explanations and recent post-hoc explanation methods

The most popular post-hoc explanation methods can be divided into two groups: feature-based techniques (also called attribution methods) and example-based (also called instance-based) techniques (Dwivedi et al. 2023; Molnar 2018). The first group contains methods like local interpretable model-agnostic explanations (LIME), Shapley additive explanations (SHAP), and other feature importance techniques. The second group, example-based post-hoc explanation methods, contains Anchors and counterfactuals. We will briefly explain the methods used in our experiments: SHAP, Anchors, and counterfactual explanations. Figure 1 provides a figurative example of the different XAI methods to explain why a person (the instance) is predicted not to get their loan approved. For a more detailed description and examples, we refer to Molnar (2018) and the works mentioned below.

(LIME and) SHAP

SHAP (Lundberg and Lee 2017) and LIME (Ribeiro et al. 2016) are similar in the sense that the impact of a certain feature is measured related to the predictive outcome. The basic idea of LIME is to sample instances in the neighborhood of the instance that is given to the prediction model and then train an interpretable model like linear regression or decision tree to explain this neighborhood. The interpretable model can consequently be used to explain the prediction that is made by the actual, black-box model. For tabular data, which are used in the experiments of this work, the issue is to define the neighborhood of an instance. If LIME would only sample closely around the given instance, chances are high all predictions will be exactly the same and LIME cannot comprehend how predictions change. Therefore, samples are taken broadly, e.g., by using a normal distribution. A major disadvantage of LIME is that the explanations differ depending on the samples used, which makes the explanations unstable and manipulable. Therefore, we do not include LIME in our comparisons.

A better feature-based technique can be found in SHAP, which combines the locality of LIME with the concept of Shapley values from coalitional or cooperative game theory. The contribution of each feature (player) to the prediction that is made by the model (outcome of the game) for a given instance is calculated. Moreover, the contribution of cooperation between players (multiple features) is examined. The average marginal contribution of a feature value across all cooperations is called the shapley value.^{Footnote 1} Because there are $2^k$ possible cooperations, for which models need to be trained, calculating all the Shapley values is computationally expensive. Therefore, by using the LIME-inspired sampling, the SHAP algorithm decreases the computation time.

Anchors

Anchors or scoped rules (Ribeiro et al. 2018) are high-precision easy-to-understand if-then rules. They portray feature conditions together with a predictive outcome. The rules are called Anchors because any changes to other features than the ones mentioned, will not result in another prediction. In contrast to e.g., LIME, Anchors will provide a region of instances to describe the model’s behavior. They are consequently less instance-specific. For example, imagine if a person applies for a loan at a bank. This person is 50 years old, has a monthly income of $2000, his gender is male and he currently has $5000 in debt. A model has predicted that the loan application should be declined. The corresponding Anchor could then be: if the monthly income is lower than $5000 and the age is higher than 35, then predict that the loan application would be declined.

Counterfactual explanations

Counterfactual explanations describe a combination of feature changes that would alter the predicted class (Martens and Provost 2014). In other words, they determine what features should change to change the prediction and are consequently sometimes called what-if statements. As mentioned in Sect. 1, this type of explanation is especially human-friendly because they are contrastive and selective (Miller 2019). Counterfactual explanations are somewhat the opposite of Anchors. To revisit the same example: the person asking for a loan wants to know why he will not get one. A counterfactual explanation could then be: if your monthly income rises to $5000, you will get a loan.

Because of their many benefits and varying quality measures to optimize for, a sprawl of different counterfactual methods came into existence. This possibly leads to different respective explanations, which will be investigated in this work.

Guidotti (2022) and Verma et al. (2020) give an overview of counterfactual explanation techniques, however, to date, the state-of-the-art further unfolded with e.g., the introduction of NICE, a counterfactual generation algorithm which simultaneously achieves 100% coverage, model-agnosticism and fast counterfactual generation for different types of classification models.

2.2 Ambiguity due to a lack of consistent evaluation metrics for post-hoc explanations

As referred to in Sects. 1 and 2.1, evaluating XAI methods is a research field in its infancy today, even though a strong need for evaluation methods is identified by multiple authors such as Rosenfeld (2021). One reason for the limited amount of research done in this field can be the simple fact that evaluating XAI methods is difficult, especially for post-hoc explanation methods. Because of they explain black-box models, by definition, we don’t know the logic involved in a decision made by such models.

Vilone and Longo (2021) divide XAI evaluation techniques into two groups: those that involve human-centered evaluations and those that evaluate with objective metrics. The first requires human participants to give qualitative or quantitative feedback to XAI explanations, typically through surveys. For the second, to this day, more than 35 metrics have been proposed in the literature to evaluate XAI explanations. Examples of these metrics are, among others, actionability (knowledge is useful to the end-user), efficiency (computational speed of the algorithm), simplification (minimal features), stability (similar instances should provide similar explanations), etc. The authors conclude that the boom in the number of evaluation metrics calls for a general consensus among researchers on how an explanation should be evaluated.

Note that these objective metrics are sometimes hard to quantify. Qualitative quality properties are therefore often quantified in numbers. For counterfactual explanations, popular properties are proximity, sparsity, and plausibility (Verma et al. 2020). Proximity is a property that is somehow used in every counterfactual algorithm. It tries to measure the total change that is suggested by the counterfactual explanations with a distance metric (typically L1 or L2 distance) (Van Looveren and Klaise 2021; Mothilal et al. 2020; Wexler et al. 2019). It is intuitive that less change is better than more change in most situations. Sparsity is a special case of proximity. It refers to the number of features in the explanation (L0 distance) (Karimi et al. 2020; Dandl et al. 2020; Laugel et al. 2018). The argument is that shorter explanations are more comprehensible for humans than longer ones (Miller 1956). Finally, plausibility is a more conceptual property that refers to the closeness to the data manifold (Pawelczyk et al. 2020). For example, in a credit scoring context, advising someone to wait 200 years to get a loan, is not plausible.

Counterfactual explanations have an additional advantage in comparison to feature importance methods. The latter estimate the influence of each feature on the predicted score. These estimates potentially suffer from bias and features which have almost no influence on the model’s decision might be labeled important (Fernández-Loría et al. 2020). Counterfactual explanations don’t suffer from this bias, applying the suggested changes of a counterfactual explanation will always lead to a change in prediction. Consequently, a counterfactual explanation is always ‘correct’ (in the sense that it leads to a class change). However, counterfactual explanations are a simplification of all the information involved in the decision-making. Therefore, different explanations contain different bits of information. And while every counterfactual explanation is ‘correct’, it is not guaranteed to be useful.

The ambiguity of measuring the quality of counterfactual explanations has led to the development of many counterfactual algorithms and possibly as many different explanations (Verma et al. 2020; Guidotti 2022). As a result, when a stakeholder wants to use counterfactual explanations, he is presented with many options. This might be an advantage or can lead to the disagreement problem.

2.3 The disagreement “problem”

The disagreement problem in XAI arises when different interpretability methods, used to explain a given AI model, produce conflicting or contradictory explanations. Because of a lack of broadly used evaluation methods, this is often the case, resulting in explanations that are generally non-consistent and thus ambiguous. Neely et al. (2021) raise the question of whether agreement as an evaluation method for XAI methods is suitable. When assuming agreement as an evaluation method, low agreement would mean only a few of the XAI methods are right, while the others are far from ideal. However, low agreement is not necessarily a bad thing.

Ambiguity can actually be valuable or result in possible ethical consequences (Martens 2022). It all depends on the context in which XAI methods are used (Bordt et al. 2022). Mothilal et al. (2020) argues that diversity among counterfactual explanations is beneficial. This for one increases the chance of generating usable explanations. For example, when someone is not allowed to get a loan according to a prediction model, and the only counterfactual explanation is to change their sex or lower their level of education, this explanation is argued not to be useful. Some people prefer to get an actionable explanation, such as ‘increase your income with $X’. This actionability is not uniform over all decision subjects. Therefore, providing multiple explanations increases the chance that one explanation is useful for this specific user.

Bordt et al. (2022) examine when ambiguity in explanations is problematic. They differentiate between a cooperative and adversarial context. In a cooperative context, all stakeholders have the same interests. For example, in most medical applications of AI, both doctors and patients have the same goal: to improve or manage the patient’s health. In adversarial contexts, this is not the case. Here, different parties have opposite interests. For example, when a student is denied admission to a prestigious university, the student is interested in challenging this decision. Another example is an autonomous car crashing into a wall to avoid a pedestrian. Insurance companies have other interests than the owner of the car or the developers of the software that steers the car’s driving decisions. A final example is a denied bank loan: the bank and the client have different interests. In these cases, it might not be in the model user’s best interest to look for the most correct or elaborate explanation of a decision that is made. The model user will most likely choose the explanation that fits their best interest, if diverse explanations are available. An adversarial context can lead to all kinds of ethical issues (Martens 2022). Aïvodji et al. (2019) examine the use of post-hoc explanations to fairwash or rationalize decisions made by an unfair prediction model, while Slack et al. (2020) and Lakkaraju and Bastani (2020) investigate the discriminatory characteristics of explanations. Imagine a model using a prohibited feature such as e.g., gender or race, or a feature that is linked to one of these e.g., zip code, when other more neat explanations are available. The model user could choose to ignore the discriminatory explanations and use another one instead. When considering the ethical consequences of disagreement, consensus amongst explanations might be desired. Therefore, consensus between explanations could be seen as a training objective to increase user trust (Schwarzschild et al. 2023; Hinns et al. 2021). Namely, if two explanations are consensual, the ethical consequences of choosing one XAI method over another one are less severe.

In fact, the scope of how explanation providers manipulate explanations extends beyond the selection of explanation algorithms. Goethals et al. (2023) identify a total of 6 stages in which explanations can be influenced. Besides algorithmic selection, users can also change the parameters of XAI algorithms to influence the explanations. Furthermore, some algorithms are non-deterministic and each run can result in a distinct explanation, which can also be exploited. Less obvious might be that manipulation can happen in earlier stages as even changing the training data, predictive model or test data can also lead to different explanations. Our quantitative assessment is only limited to the algorithmic decision stage. However, our recommendations on how to move forward with disagreement in Sect. 4 and especially the call for transparency applies to all stages in the framework of Goethals et al. (2023).

2.3.1 Related work

The ethical issues related to the selection of model explanations can only arise if there actually is ambiguity. Neely et al. (2021) were the first to measure the disagreement problem in XAI. They compare LIME, Integrated Gradients, DeepLIFT, Grad-SHAP, Deep-SHAP, and attention-based explanations with a rank correlation (Kendall’s $\tau$) metric. They conclude there is only low agreement in the explanations of these methods, between 0.19 and 0.27 depending on the data set used. Krishna et al. (2022) expand the previous study by comparing LIME, KernelSHAP, Vanilla Gradient, Gradient Input, Integrated Gradients, and SmoothGrad, once again finding disagreement amongst explanation of different methods, especially when the model complexity increases. Instead of only using a rank correlation metric, they use a feature agreement, (signed) rank agreement, sign agreement, and rank correlation. Depending on the type of data (tabular, text or image data), they use different of the above-mentioned evaluation metrics. For tabular data, which are used in this work, they found the rank and signed rank agreement to be significantly lower, compared to the feature agreement. They find the feature agreement to be between 79.1% and 100% agreement when looking at the top 5 features and 100% when looking at the top 7 features. Next to a quantitative comparison, the authors also perform a qualitative study on how practitioners handle the disagreement problem. 84% of practitioners interviewed by Krishna et al. (2022) mentioned encountering the disagreement problem on a day-to-day basis. They report there is no principle evaluation method to decide on which explanations to use, therefore, they simply choose to generate explanations with the XAI method they are most familiar with. Han et al. (2022) extend the study of Krishna et al. (2022) to investigate why the disagreement problem exists for these methods. They conclude that different XAI methods approximate a black-box model over different neighborhoods by applying other loss functions. If two explanations are trained to predict different sets of perturbations, then the explanations are each accurate in their own domain and may disagree. A more focused disagreement problem study can be found in Roy et al. (2022) where the explanations of LIME and SHAP are investigated for one single defect prediction model. They calculate the feature, rank, and sign agreement also proposed by Krishna et al. (2022). They conclude that LIME and SHAP disagree more on the ranking of important features compared to the feature agreement or the sign agreement of the features.

Table 1 Literature overview of the quantitative evaluation of the disagreement between post-hoc XAI methods

Disagreement amongst counterfactual explanations: how transparency can be misleading

Abstract

Similar content being viewed by others

Beyond explainability: justifiability and contestability of algorithmic decision systems

Counterfactual explanations and how to find them: literature review and benchmarking

PreCoF: counterfactual explanations for fairness

1 Introduction

2 The diverse landscape of post-hoc explanations

2.1 Counterfactual explanations and recent post-hoc explanation methods

2.2 Ambiguity due to a lack of consistent evaluation metrics for post-hoc explanations

2.3 The disagreement “problem”

2.3.1 Related work

3 The quantified disagreement amongst counterfactual explanation methods

3.1 Example

3.2 Experimental setup

3.3 Results

3.3.1 To what extent can counterfactual disagreement be abused by malicious agents?

3.3.2 What are the drivers of counterfactual disagreement?

4 Discussion

5 Conclusion and future research

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Parameter values

Appendix B: Scaled L0 distance

Appendix C: Jaccard distance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation