1 Introduction

Machine Learning (ML), a field of Artificial Intelligence, paves the way for emerging technologies in a wide variety of sectors, leading to technological advancements. ML provides solutions in manufacturing, such as predictive maintenance [1, 2], banking, including credit scoring [3, 4] and risk management [5], insurance, for fraud detection [6] and damage estimation [7], and healthcare, improving the efficiency of care delivery [8] and aiding in the diagnosis or prognosis of numerous diseases [9, 10]. Additionally, ML finds applications in education, specifically in predicting student learning performance [11], and social media, for the detection of online hate speech [12].

Even though ML elevates those sectors, societal and ethical issues may arise in high-risk scenarios. For example, credit score predicting algorithms are discriminating between minority and majority populations, leading minorities to poverty and homelessness [13]. As a result of worries about performance, biases and poor trust, an insurance company pulled the plug on an AI tool that was designed to detect fraud in claims through videos [14]. Reports of inappropriate patient treatments [15], as well as the use of biased risk prediction models [16], have raised concerns in both society and the research community. Thus, legal frameworks and regulations given by many sources, such as the General Data Protection Regulation (GDPR) [17] of the EU, the European AI ACT [18], and the Equal Credit Opportunity Act of the USFootnote 1, aim to establish requirements that every ML-powered system should satisfy.

One of the requirements is explainability, which led to the establishment of the Explainable AI (XAI) area [19, 20]. Interpretable ML (IML), a subfield focused on the interpretations of machine learning models, has attracted the attention of the research community [21, 22]. Not every machine learning model, especially deep learning models, can provide explanations on its own. Since a lot of models are unable to provide interpretations intrinsically, IML has introduced explainability techniques to solve that issue. Such explanations can be in the form of Feature Importance, Rules, and Counterfactual explanations, among others. Particularly, Feature Importance (FI) techniques estimate the influence of each feature on the prediction. Each type of explanation has its own set of evaluation metrics, and for FI, they include robustness [23], faithfulness [24], infidelity [25], and truthfulness [26].

Nevertheless, techniques that generate feature importance explanations can only provide approximate information when applied to very complex models. Therefore, it is necessary to effectively evaluate the output of such techniques. Additionally, researchers and practitioners face challenges in selecting the most suitable explainability technique from the numerous available options. Consequently, an ensemble of explainability techniques or an automatic selection tool can be highly valuable. One approach to address the ensembling of explainability techniques is through aggregation, such as using averaging techniques or optimisation [27]. However, research in this area is limited, and these methods heavily rely on explainability metrics. Several metrics have been proposed to assess the quality of an explanation, depending on its form, including fidelity [28], coverage [29], and stability [30]. Nonetheless, most of these metrics are not useful for the end user.

Explaining the explanations is another interesting direction. Argumentation can be a first step towards this direction. In particular, argumentation is the study of how conclusions can be reached by a logical chain of reasoning, that is, claims based, soundly or not, on premises [31]. IML and argumentation both aim to persuade someone to accept the legitimacy of a decision. In the philosophy of science, it is debatable whether the explanations are arguments or not. An intriguing point of view distinguishes between arguments and explanations, stating that arguments are used to justify something in dispute, but explanations are used to provide a meaning of something incomprehensible [32].

In this work, based on our previous preliminary work [26], we aim to combine 3 concepts to create a meta-explanation ensembling multiple explanation techniques based on a complete and user-oriented explainability metric, called truthfulness, complemented with an argumentation framework.

Section 2 of the paper covers the necessary theoretic concepts, while Section 3 presents related studies. Section 4 introduces our technique, and Section 5 evaluates it through a series of experiments. Finally, in Section 6, we discuss the findings, and provide our concluding opinions and future plans.

2 Background

In this section, we introduce the basic notions that underlie our approach. We will discuss machine learning and interpretable machine learning concepts, as well as a few argumentation frameworks.

2.1 Machine learning

Machine Learning (ML) is a cutting-edge technology that forms the core of new and innovative products. We can use ML to solve both supervised and unsupervised problems. In this paper, we emphasise on supervised problems such as binary classification and regression. Thus, given a dataset D, containing instances \(x_i \in X \subseteq \mathbb {R}^l\), where l is the size of the feature space \(F=[f_1, f_2, \dots , f_l]\) and their predictions \(y_i \in Y \subseteq \mathbb {R}\), we can train a model P to predict y given an instance x, \(P(x)=y\).

Based on the data type \(x_i\) can have different shapes. In tabular data, \(x_i\) has l values according to the l different \(f_i\) features. Dealing with textual data, such as sentences, we can have multiple representations. The simplest representation is to use Bag-of-Words or TF-IDF vectors, which are one-dimensional and express each sentence \(x_i\) as a vector with l different values, where l is equal to the size of the vocabulary. We can also have more complex representations, such as word embeddings, which can be two-dimensional. These representations, given a fixed sentence length s, express each word of the sentence as a vector of size e. If \(x_i\) represents multivariate time-series data, then it contains \(l\times m\) values, thus l values according to the l different \(f_i\) features across m time-steps. Finally, when dealing with images, we must handle with three-dimensional inputs. The first two dimensions represent the image’s resolution, while the third expresses its colour channel. Therefore, we can deal with either 1D (tabular or textual data), 2D (textual or time-series data) or 3D (image data) inputs.

We can choose from a variety of ML models according to the task, data type, and size of the dataset, ranging from traditional algorithms (logistic regression and support vector machines) to ensemble algorithms (random forests and XGBoost), and deep neural networks such as CNNs, LSTMs, and Transformers. In this work, we will focus on neural networks and will employ three different types. The first type of neural network will be linear. This network has only an input layer and an output layer. The data, regardless of shape, are handled by the network via a flattening layer. The second type concerns network architectures that are designed specifically for the task, while the third type is a more complex version of them. These networks will contain feed-forward, convolutional, recurrent, bidirectional, and attention layers to showcase our approach to a wide range of networks.

2.2 Interpretable machine learning

With the increasing adaptation of ML, there is a need for more transparent and understandable decision systems in a lot of sectors. IML, a subfield of XAI, aims to make ML models more accessible and transparent. There are intrinsically interpretable ML models, like linear models or decision trees, while others, like ensembles or neural networks, most of the time are more complex and uninterpretable. As a result, we require techniques to explain them.

IML approaches might be global, revealing an ML system’s whole structure and working mechanism, or local, explaining a specific decision. We can also distinguish between techniques that are applicable to any ML model, known as model-agnostic techniques, and techniques that are limited to specific ML algorithms or architectures, known as model-specific techniques. For example, RuleFit is a global, model-specific technique [33], while LIME is a local, model-agnostic technique [34].

Another aspect could be the applicability of an explainability technique to different data types. There are algorithms that are applicable to specific data types, or there are data-type independent algorithms. For example, LionForests [35] is a data type specific algorithm applicable only to tabular data, while Anchors [36] is a data type independent algorithm. Furthermore, we can distinguish the difference of the explainability techniques based on how they provide explanations. There are numerous ways to present an explanation. Several techniques generate rule-based explanations, while others use weights to indicate the importance of input features.

The latter has been expressed using various terms, such as attribution importance, saliency maps, and feature importance, among others. In this work, we will use the last notation, feature importance (FI). Depending on the explainability technique, FI explanations can be global or local. Therefore, given a model P, an instance \(x_i\) (as presented in Section 2.1), and an FI explainability technique Z, the explanation will be \(Z(P,x_i)=[z_1, z_2, \dots , z_l]\), where \(z_j\) corresponds to a weight – a.k.a. attribution or importance score – for the \(j^{th}\) value of instance \(x_i\).

A variety of algorithms have been proposed in this field. LIME [34], SHAP [37], and Permutation Importance [38] are among the most well-known model-agnostic feature importance explainability algorithms. A plethora of model-specific algorithms, on the other hand, have also been proposed. In neural networks, algorithms exploiting back propagation operation, like Layer-wise Relevance Propagation (LRP) [39] and Integrated Gradients (IG) [40], are retrieving the influence of the input to the output.

2.3 Argumentation

Argumentation theory is a fundamental concept in AI with numerous applications, one of which is in the criminal justice field [41]. Argumentation procedures show step-by-step how they reached a decision. Therefore, argumentation is considered highly interpretable [42]. However, that’s not always the case. Every argumentation procedure is based upon an argumentation framework. Regarding the argumentation framework employed, a few argumentation procedures are interpretable, but not explainable. Classic argumentation based on logic, as proposed by Hunter et al. [43], is a simple, yet explainable argumentation framework with many capabilities.

Argumentation based on Classical Logic (CL) concerns a framework defined exclusively with logic rules and terms. A sequence of inference to a claim is an argument in this framework. Specifically, an argument is a pair \(\langle \Phi , \alpha \rangle \) such that \(\Phi \) is consistent (\(\Phi \nvdash \perp \)), \(\Phi \vdash \alpha \), and \(\Phi \) is a minimal subset of \(\Delta \) (a knowledge base), which means that there is no \(\Phi '\subset \Phi \) such that \(\Phi ' \vdash \alpha \). \(\vdash \) represents the classical consequence relation. In this framework counterarguments, the defeaters, are defined as well. \(\langle \Psi , \beta \rangle \) is a counterargument for \(\langle \Phi , \alpha \rangle \) when the claim \(\beta \) contradicts the support \(\Phi \). Furthermore, two more specific notions of a counterargument are defined as undercut and rebuttal. Some arguments specifically contradict other arguments’ support, which leads to the undercut notion. An undercut for an argument \(\langle \Phi , \alpha \rangle \) is an argument \(\langle \Psi , \lnot (\phi _1 \wedge \dots \wedge \phi _n)\rangle \) where \(\{\phi _1, \dots , \phi _n\} \subseteq \Phi \). If there are two arguments in objection, we have the most direct form of dispute. This case is represented by the concept of a rebuttal. An argument \(\langle \Psi , \beta \rangle \) is a rebuttal for an argument \(\langle \Phi , \alpha \rangle \) if \(\beta \leftrightarrow \lnot \alpha \).

Argumentation begins when an initial argument is put forward, and some claim is made. This leads to an argumentation tree Tr with root node the initial argument. Objections can be posed in the form of a counterargument. In Tr, these are represented as children of the initial argument. The latter is addressed in turn, ultimately giving rise to a counterargument. Finally, a judge function decides if a Tr is rather Warranted or Unwarranted, based on marks assigned to each node as either undefeated U or defeated D. A Tr is judged as Warranted, Judge(Tr) = Warranted, if Mark\((A_{r}) = U\) where \(A_r\) is the root node of Tr is undefeated. For all nodes \(A_i \in \) Tr, if there is a child \(A_j\) of \(A_i\) such that Mark(\(A_j) = U\), then Mark\((A_{i}) = D\), otherwise Mark\((A_i) = U\).

3 Related work

In this section, we will present feature importance evaluation metrics found in the literature, as well as a few meta-explanation techniques we identified.

3.1 Evaluation

One key evaluation metric in the IML research area is fidelity. It was first used to evaluate the performance of surrogate models and their ability to mimic the black box models they were explaining. We can define fidelity as the accuracy of a surrogate model on a test set in relation to the complex model’s decisions. This metric, however, had several shortcomings because it was not user-centric and could not be used in non-surrogate explainability techniques. Influenced by fidelity, faithfulness and faithfulness-based metrics were therefore introduced [44].

While the origins of the initial Faithfulness-based measure are unclear, one of the first research to propose it aimed at evaluating sentence-level explanations in text classification tasks [24]. In this study, for a given instance, the sentence with the highest important score was removed and the change in the prediction was recorded. The higher the change in the prediction, the better the explanation. A different definition for faithfulness was also provided by a study [23], measuring the correlation between importance and prediction by continuously removing the most important elements from the input and observing the output.

Several variations on faithfulness were also introduced. Decision Flip (most informative token) removes the most informative token and awards the explanation if and only if the prediction is changing [45], whereas Decision Flip (fraction of token) identifies the number of important tokens that must be removed to flip the model decision [46].

Two other metrics, comprehensiveness and sufficiency, were introduced as faithfulness alternatives [47]. The former evaluates the explanation by deleting a set of elements from the input and observing the change in the prediction, whereas the latter does so by preserving only the important ones and removing the rest.

Monotonicity, also known as PP Correlation, is another similar metric [48]. It adds elements in descending order of priority, beginning with an empty input. The prediction should increase proportionally to the importance of the new elements. The correlation between the prediction and importance scores is then used to calculate monotonicity.

In a previous work of ours, Truthfulness was introduced as a faithfulness-based metric, which focuses only on the polarity of the feature importance weights [26]. It analyses every element of the input and making different alterations it monitors the model’s behaviour. One additional metric, influenced by faithfulness and truthfulness, proposed to both consider importance correlation and polarity consistency, is Faithfulness Violation Test [49]. This metric captures both the correlation between the importance scores and the change in the probability, while it also examines if the sign of the explanation weights correctly indicates the polarity of input impact, similarly to truthfulness.

Fig. 1
figure 1

Workflow of MetaLion

In addition to the metrics mentioned in this discussion, numerous other metrics are available. The Quantus GitHub repositoryFootnote 2 offers various variations of the faithfulness metric, as well as metrics related to robustness, complexity, randomization, and other related concepts [50].

A few studies introduced datasets with ground truth rationales, which are golden explanations. Rationales can be used to evaluate explainability techniques using traditional ML metrics like F1 score and area under the precision-recall curve (AUPRC). One work proposed ERASER, a benchmark for NLP models, which includes datasets containing both document labels and snippets of text recognized as explanations by annotators [47]. However, in real applications, most of the datasets do not contain ground truth information regarding the explanations, and as such these evaluation approaches cannot be applied. Furthermore, we can only assume that humans are capable of annotating unbiased rationales [51]. Nevertheless, the usefulness of such benchmarks is to enable comparison of newly proposed explainability techniques.

3.2 Meta-explanations / aggregation

Different aggregation procedures are initially introduced in a very interesting research [27]. Attempting to combine multiple explanation techniques, metrics such as sensitivity, faithfulness, and complexity [52], are used over different combination strategies. Among these combination strategies, Mean and Median, are presented. Through experimentation, it is suggested that aggregating leads to a smaller error compared to the error an explanation by one technique can have. Moreover, another combination strategy is presented. For a given instance, a set of near neighbours is identified. Extracting explanations for the predictions of these neighbours, the final explanation is the aggregation of the explanations, weighted by the distance of the neighbours to the original instance. The latter was designed to lower the sensitivity and complexity.

A recent research introduced a method, called inXAI, that enables the combination of explanations provided by multiple techniques, using specific evaluation metrics, to do so [53]. In their experiments, they use LIME, SHAP and Anchors to ensemble explanations. They select three metrics; stability, consistency, and area under the loss curve, to ensemble the weights produced by the techniques into one. One issue with this approach is that the consistency metric requires to create and use different ML models to produce explanations. One of the framework’s shortcomings is that it only enables model weighting using comparative evaluation metrics across several models/explainers. It does not guarantee that the final explanations are correct or acceptable for the end user. This method was evaluated only in an image classification use case.

Finally, another work on ensembling explanations introduces EBEC, a method for correcting global explanations of non-differentiable ML models with a non-differentiable importance score [54]. The central idea of EBEC is to train multiple ML models on a dataset to identify different local minima, then produce global explanations using an explainability technique (in this work SHAP), and finally combine them by solving an optimization problem that guarantees certain qualitative properties. They conclude that EBEC works effectively in three different tabular datasets based on their evaluation.

4 Truthful meta-explanations supported by arguments

In this work, we are presenting a three-dimensional contribution to the IML community. Focusing exclusively on FI explainability techniques, we first formulate the definition of the truthfulness metric. Then, we present a meta-explanation technique for ensembling multiple explanations in an ensemble fashion. Finally, we also present how arguments can enhance the meta-explain process, which uses the truthfulness metric. All of these are visible in the workflow of Fig. 1.

To begin, we will state a few assumptions that must hold for our technique to be theoretically sound. Assumption 1 ensures that the ML model we are trying to apply our technique is able to provide continuous predictions. This is a necessary property for the metric we are going to formulate in the following section (Section 4.1).

Assumption 1

The machine learning model \(P(x) = y\) can provide continuous predictions \(y \in \mathbb {R}\). A classification model, for example, should be able to provide predictions in the form of probabilities of good quality (e.g. neural networks or probabilistic models). In our technique, a classification model that produces inadequate probability estimates, such as decision trees, would not yield satisfactory results. A regression model, on the other hand, always produces continuous outputs.

The second assumption (Assumption 2) concerns the explainability techniques utilised in the approach. The amount and type of the technique to be used in the ensemble to produce one final explanation is not limited, with the only exception being to provide weights that represent a (local) monotonic relation to the prediction of a specific label or being perceived by end users as such. This is critical since a few explainability techniques, such as SHAP, provide the contribution of a feature to the prediction without assuming any local or global monotonicity. Nonetheless, based on this proposed contribution, still, end-users perceive the relationship between a feature and the output to have monotonic behaviour when altered.

Assumption 2

FI’s are producing, or they are perceived as producing, \(z_j\) weights with local or global monotonic notion.

4.1 Truthfulness metric

The first contribution of this paper concerns the truthfulness metric. Truthfulness is a user-inspired evaluation metric that simulates a user’s behaviour with respect to an explanation. It addresses issues of other Faithfulness-based metrics by evaluating all feature importance elements and taking into account all signs (Positive, Negative and Neutral). But, before we get into the definition of the metric, we will present an example. The following explanation explains a prediction of a customer’s loan disapproval in a bank.

Fig. 2
figure 2

Feature importance weights assigned to the features of the example

The customer received the explanation shown in Fig. 2. Despite her friend’s income of $1,000 (Co-applicant Income), she observed that it had minimal impact on her decision, leading to a disapproval with a score below the minimum threshold. Consequently, she decided to involve her mother, who had a slightly higher income ($1.2K), as a co-applicant. Surprisingly, this change had no effect on the outcome, as she received the same score and, therefore, the same decision (disapproval), which she anticipated would improve slightly. As a result, the customer perceived the explanation as dishonest, while also lacking trust in the predictive system.

Influenced by this, we suggest a metric that, given an explanation, performs a few tests to ensure that the explanation provided to the end-user is truthful. This procedure shares similarities with counterfactual techniques [55], but it differs in its objective. While counterfactual techniques aim to switch the class in a classification problem, our goal here is to observe the change in the probability of the predicted class, which may not necessarily result in a switched prediction. Hence, for each feature importance score \(z_j\) assigned to the feature values \(v_i^j\) of \(x_i\), we both increase and decrease the feature values, and we observe if the model behaves as expected with respect to the feature importance.

In this work, we will focus on four types of data: textual, image, tabular, and time-series. The words are the features in textual tasks, and the importance concerns a word and, in some cases, its position. When dealing with images, feature importance is used to describe either a single pixel or a group of pixels known as superpixels. Each feature in tabular data has its own importance score. Lastly, in time-series, feature importance can refer to either a sensor’s time-step value or a sensor throughout the entire time-window. In all these cases, feature importance can be either Positive, Negative, or Neutral, as described in Definition 1.

Definition 1

The importance assigned to a feature can be IMP \(\in \) {\(1=\) Positive (\(z_i>0\)), \(-1=\) Negative (\(z_i<0\)), or \(0=\) Neutral (\(z_i=0\))}.

Algorithm 1
figure d

Process of determining the alternative values for a feature

Let’s discuss now how we alter a feature value \(v_i^j\). Given a set of samples \(X'\), we measure each feature’s distribution statistics, namely, min, max, mean ad STD values. Then, as presented in Algorithm 1, we calculate a noise or the alternative values. This noise is small, and therefore these alterations are local. This procedure is different for the various data types. In textual datasets, this procedure replaces the word regarding the examined feature importance score with an empty string. In images, we compute a noise which makes lighter and darker a pixel or a superpixel. For superpixels, we have to also employ an image segmentation algorithm to identify the superpixels of an image. Regarding tabular data, we create a noise which both increases and decreases the feature value, while in time-series, we increase and decrease with the calculated noise either a specific time-step of a sensor, or the whole time-window of a sensor. We set three different noise levels; “weak”, “normal”, and “strong”, in the cases which are applicable (image, tabular and time-series).

Definition 2

The alteration of the value of a feature can be ALT \(\in \) {\(1=\) Increasing (\(v_{j,i}'>v_{j,i}\)), \(-1=\) Decreasing (\(v_{j,i}'<v_{j,i}\))}, where \(v_{j,i}'\) the altered value.

Fig. 3
figure 3

Example of altering a feature of an instance for the four different data types

In Fig. 3, we show an example of each data type. The textual example demonstrates how to remove (decrease) a feature, in this case, the word “John”. In the image example, we can see that the first alteration involves lightening the kitten’s ear by increasing the values of the superpixel, and the second alteration involves darkening the ear by decreasing the values of the pixels. In the tabular example, we make two changes to the “Age” feature. We increase “Age” from 25 to 27, while also decreasing it to 23. Finally, in the time series, we increase the first sensor’s readings by 0.1 at each time step and decrease them by the same value.

We discussed a feature’s feature importance score and introduced the concept of alterations in the values of features across different data types. We are now introducing the concept of expected behaviour. Given a feature importance score \(z_j\) for \(f_j\), we have two alternative values for that feature for a specific instance \(x_i\). We can request the ML model to predict the modified instance \(x_i' \in \{x_i^{inc}, x_i^{dec}\}\), where \(x_i^{inc}\) is the same instance but with a higher value for the examined feature and \(x_i^{dec}\) has a lower value. Then, regarding the feature importance score \(z_j\), we evaluate whether the model’s predictions \(P(x_i^{inc})\) and \(P(x_i^{dec})\) behave as expected.

Definition 3

The expected behaviour of an M component can be EXP \(\in \) {\(1=\) Increasing (\(P_{M}(x_i) - P_{M}(x_i') < \delta \)), \(-1=\) Decreasing (\(P_{M}(x_i) - P_{M}(x_i') > -\delta \)), \(0=\) Remaining Stable (\(|P_{M}(x_i') - P_{M}(x_i)|< \delta \))}, where \(x_i'\) the instance with the altered value, while tolerance \(\delta \) is defined either manually by the user or is set to a default value (0.0001).

As presented in Definition 3, the model’s prediction can behave in three ways. It can increase, decrease, or remain stable. If the feature importance score \(z_j\) is positive, we expect the prediction to increase for the \(x_i^{inc}\) modified instance while decreasing for the \(x_i^{dec}\). If \(z_j\) is negative, we expect the prediction of the two changes, \(x_i^{inc}\) and \(x_i^{dec}\), to decrease and increase, respectively. In the case of a neutral feature importance score \(z_j\), we anticipate that the prediction will remain stable for both alterations. We also use a \(\delta \) tolerance value. This will help evaluate an importance score as truthful in cases where the prediction will change, for example, from 0.75 to 0.7502, where the difference is extremely small. This will help to not punish small mistakes. However, setting \(\delta =0\) leads to a stricter evaluation. In the experiments section (Section 5), we examine different delta values. Table 1 summarises all of these.

Table 1 Truthfulness matrix [(t)ruthful and (u)ntruthful states]

As a result, we argue that a feature importance score is truthful if and only if the behaviour of the model’s prediction regarding the alterations is as expected. This is also included in Definition 4. It is worthwhile to provide an example to demonstrate this. The ML model predicts \(P_{M}(x_i)=0.7\) for a random instance \(x_i\), and the feature \(f_1\), with a value of \(v_{1,i} = 1\), has an IMP \(z_1 = 0.5\) (positive). We use Gaussian noise to increase and decrease the value of the feature based on its distribution. We change the value to \(v_{1,i}^{inc} = 1.21\) and \(v_{1,i}^{dec} = 0.85\), for \(x_i^{inc}\) and \(x_i^{dec}\), respectively. Then, we observe the model’s predictions by querying the ML model. In this example, the prediction for the \(v_{1,i}^{inc}\) was increased to 0.85, while the prediction for the \(v_{1,i}^{dec}\) remained stable. As a result, we can conclude that the behaviour in the second alteration was not as expected, and hence the feature importance score is untruthful.

Definition 4

The importance assigned to a feature can be defined as truthful when the expected changes to the output of the M model \(P_{M}(x_i')\) are correctly observed with respect to the alterations that occur in the value of this feature. Thus, for both values of ALT and a given IMP, the IMP\(\times \)ALT=EXP must be in accordance with the truthfulness matrix (Table 1).

The truthfulness metric analyses the feature importance scores individually and penalizes those that are deemed untruthful. For an instance with \(|F|\) features, we examine the importance scores assigned to each feature’s value one by one. If a score is deemed truthful, we increment the truthfulness score by one. However, if a score is considered untruthful, we do not increment the score. Finally, we can optionally normalize the truthfulness score to the range of [0, 1] by dividing it by the number of features. While in the experiments below we do not normalize the scores, normalization can help the comparison of multiple explanation techniques across different models. This process can be mathematically formulated as follows:

$$\begin{aligned} Truthfulness(Z(P,x_i)) \!=\! T(Z(P,x_i)) \!=\! \frac{1}{|F|}\!\sum _{j=1}^{|F|}evaluate(z_j,x_i) \end{aligned}$$
(1)
$$\begin{aligned} evaluate(z_j,x_i) = {\left\{ \begin{array}{ll} \text {1} &{} \text {if } z_j \text { truthful with respect to the}\\ &{} \text {alterations of } x_i, x_i' \in \{x_i^{inc}, x_i^{dec} \}\\ \text {0} &{} \text {else} \end{array}\right. } \end{aligned}$$

Before proceeding with the meta-explanation ensembling technique, we are arguing on why we only use two alternative values to evaluate a feature importance score of a feature for a given instance. We assume that if there is monotonicity between these two values, there will be monotonicity in the intermediate values as well. We make this assumption to minimise the computational cost, considering that many explainability techniques have high response times. We achieve three things with this choice:

  • Faster evaluation: Given a set of explainability approaches and their response times, truthfulness has a low computing cost when applying two alterations per feature, as opposed to more.

  • Integration to a meta-explanation technique: Being lighter computationally, we can use this metric in a meta-explanation technique to produce even better explanations. In the following section, we introduce a meta-explanation technique that makes use of the truthfulness metric.

  • Reduce the environmental impact: Given that truthfulness necessitates re-querying the ML model twice for each feature, utilising a set of alternative values rather than two will increase the cost exponentially in larger feature sets. As a result, we anticipate a lower environmental impact by selecting only two alternative values.

4.2 Meta-explanation technique

Based on the truthfulness metric, we introduce a meta-explanation technique, called MetaLion. Differently than the other recent research, we employ truthfulness metric to combine multiple explanation techniques, to provide a more accurate local explanation.

Given E explainability techniques, and an examined instance \(x_i\) whose prediction made by a model M, \(P_M(x_i)\), to ensemble the different explanations \(Z = [Z_0, Z_1, \dots , Z_E]\), we calculate the truthfulness of each explanation. We also measure the average change of the output given the two alterations \(x_i'\) of each feature \(f_j\), \(ac_j = \frac{1}{2}(|P_M(x_i)-P_M(x_i^{inc})|+|P_M(x_i)-P_M(x_i^{dec})|)\).

The first step in performing the ensembling of the multiple explanations is to determine the candidate importance scores for each feature based on its truthfulness. Algorithm 2 illustrates this. For each feature, we verify the truthfulness of the importance scores assigned by the various explainability techniques and save them for use in the next step.

Algorithm 2
figure e

Identification of candidate truthful feature importance scores

By identifying the truthful importance scores from each explanation for each feature for an examined instance, we present our ensemble procedure in Algorithm 3. First, we sort the features by average change. Then, starting with the feature with the greatest absolute change, we examine the candidate importance scores and choose the one with the highest absolute value. We save the highest absolute value, which we will use in the following steps. We proceed on to the next feature, which has the second largest absolute change. Again, we choose the highest absolute importance score among the candidate scores, which has to be lower than the previous score by absolute value (line 13th). The sequence of handling features and their importance scores is vital. It is possible for a technique to have a truthful score but an incorrect magnitude (e.g., an importance score of 1 instead of 0.4). Therefore, by prioritizing the feature with the highest average change and adjusting the weights accordingly, we can better meet the user’s expectations. We assign a zero value when a feature has no candidate feature importance scores.

Let’s discuss an example using this algorithm. In Table 2, we have an instance \(x_i = [0.27, 0.12, 1,5, 6, -3]\), and three explanations \(Z = [Z_0, Z_1, Z_2]\) regarding its prediction. We calculate the truthfulness of each explanation. The truthful importance scores of each technique are represented with a green check mark in Table 2, while untruthful scores with a red cross mark. The average change (AC) of the output for each feature based on two alterations is also provided. Based on the Algorithm 3, we select first the score of \(f_2\), which has the highest AC score (\(ac_2=0.8\)). Among the three truthful importance scores, we select the highest \(z_2^0=1\), the one from \(Z_0\). Then, we proceed to the next feature \(f_3\). There is only one truthful importance score \(z_3^2=-0.4\) provided from \(Z_2\). In the same fashion, we are selecting the most appropriate importance scores for each feature, till we have a complete explanation.

Table 2 Example of evaluation of three techniques and the meta-explanation
Algorithm 3
figure f

Truthfulness-based meta-explanation algorithm

With this ensembling procedure, we achieve two things. The first is that we gather more truthful importance scores in the final explanation. Moreover, we re-rank the importance scores of the features using the truthfulness evaluation and our ensembling algorithm. Later in the experiments, we will discuss the effectiveness of our approach.

4.3 Argumentation

The argumentation framework we designed to provide justifications for the truthfulness evaluation was a very useful component of Altruist, our earlier preliminary work [26]. While we do not re-formulate the entire argumentation system, in this section, we do re-formulate the atoms which form arguments, utilized in our system to make them more descriptive. More information about the theoretical formulation of the framework can be found in our earlier work [26]. The original available atoms were the following:

  • a: The explanation is untrusted

  • b: The explanation is trusted

  • \(c_j\): The importance \(z_j\) is untruthful

  • \(d_j\): The importance \(z_j\) is truthful

  • \(e_{j,ALT}\): The model’s behaviour by altering \(f_j\)’s value is not according to its importance

  • \(f_{j,ALT}\): The evaluation of the alteration of \(f_j\)’s value was performed and the model’s behaviour was as expected, according to its importance.

We are re-phrasing the last two atoms, \(e_{j,ALT}\) and \(f_{j,ALT}\), as seen below:

  • \(e_{j,ALT}\): The model’s behaviour by altering \(f_j\)’s value from X to Y (ALT) is not according to its importance Z

  • \(f_{j,ALT}\): The evaluation of the alteration of \(f_j\)’s value X to Y (ALT) was performed and the model’s behaviour was as expected EXP, according to its importance \(z_j\).

This way, we do not modify the theoretic argumentation framework supporting our system, but we are making the last two atoms used in the arguments more descriptive. While we are presenting a complete example in the qualitative experiments, we are showing an example below:

  • \(e_{2,INC}\): The model’s behaviour by altering \(f_2\)’s value from 25 to 26 (increased) is not according to its importance Z

  • \(f_{2,INC}\): The evaluation of the alteration of \(f_2\)’s value 25 to 26 (increased) was performed and the model’s behaviour was as expected (increased), according to its importance \(z_2\).

An example showcasing the enhanced argumentation framework is being presented in the qualitative experiments (Section 5.3).

5 Experiments

In this section, we will test the truthfulness metric on several types of datasets, as well as our meta-explanation technique, in a series of quantitative experiments. We will also conduct a qualitative evaluation of the explanations produced by the meta-explanation technique.

5.1 Setup

We will begin by describing the datasets we used, the preprocessing procedures we utilised, the predictive models we employed, and the explainability techniques we included in our experiments.

5.1.1 Datasets

We included the following datasets in our experiments to cover a variety of the critical sectors that use ML in their workflows, as presented in Section 1. We incorporated the Turbofan Engine Degradation Simulation (TEDS) dataset [56, 57] for the manufacturing sector’s predictive maintenance scenario, which aims to predict the remaining useful lifetime (RUL) of engines using time-series data. The second dataset we are using, Credit Card Approval Prediction (CCA), contains information about bank customers (tabular data), as well as information regarding their debt payments (if any)Footnote 3. The goal is to determine whether a client is eligible for a credit card. A dataset for Hurricane Damage Estimation (HDE)Footnote 4 of properties using satellite images [58], in a classification manner, is incorporated in our experiments to cover the insurance sector. Finally, data for the identification of Acute Ischemic Strokes (MedN) through brain MRI reports (medical notes - text)Footnote 5 [59] connects the experiments to the healthcare sector. More information about the datasets is visible in Table 3, while about their preprocessing in Section 5.1.2 and in the GitHub repository “MetaLion: Truthful Meta Explanations”Footnote 6.

Table 3 Information about the datasets incorporated in our experiments. *After preprocessing (R: Regression, BC: Binary Classification)

5.1.2 Preprocessing

While extended preprocessing is accessible in our GitHub repository, here we mention few crucial preprocessing steps. We suggest other researchers and users to apply similar preprocessing towards more explainable end-to-end systems.

Starting with the time-series dataset, TEDS, we scale our data to [0.1, 1] after reducing the available features and retaining only measurements from 14 sensors. We chose to scale our data to this range rather than [0, 1] because none of the 14 sensors have measurements that are equal to 0, but only positive values. This is a critical decision in explainability. A lot of the time, the 0 value has a neutral notion in terms of explainability and, more precisely, feature importance. As a result, local techniques such as LIME can generate a positive weight that, when multiplied by the 0 value, is neutralized. Furthermore, we create examples of 14 measurements across 50 time steps, using the RUL value on the most recent time step as the goal variable, utilizing a time window of 50 time steps. One final preprocessing step we use is to scale the output, the RUL value, to [0, 1], which is easier for a neural network to learn.

In our tabular data, on the other hand, there exist features with both positive, negative, and zero values. In this case, we’d like to scale them to \([-1,1]\) while keeping the centre at zero. To address this issue, we use maximum absolute value scaling. In terms of image preprocessing, we did augmentation by randomly flipping and rotating the samples and scaling them from [0, 255] to [0, 1]. In terms of the textual dataset, we used a symbol-removal process on each document and reduced the maximum number of words from 380 to 250, because relatively few documents were longer than 250 words. Finally, we used the BioBERT [60] pretrained transformer to obtain word-level embeddings for each document.

Fig. 4
figure 4

The three different architecture formats used in our experiments

5.1.3 Model architectures

We utilized three distinct model architecture formats in our study. The first architecture, referred to as NN1, is a linear network consisting of input and output layers for all data types, with an additional layer for flattening 2D or 3D data. NN1 is similar to linear regression models. The second architecture, NN2, is a deep neural network specifically designed for each data type. For instance, we employed recurrent and feedforward layers for time series data, feedforward layers for tabular data, convolutional and feedforward layers for image data, and bidirectional recurrent, one-dimensional convolutional, and feedforward layers for textual data. The third architecture, NN3, is a more complex version of NN2 with additional layers, more neurons, and different activation functions. While NN3 networks may seem unnecessarily complex, they are essential for facilitating our research.

The various architecture formats mentioned above are depicted in Figure 4. For the classification datasets, the models in our study utilized a sigmoid activation function in the output layer. However, for the regression dataset (TEDS), we employed linear activation functions. For detailed information on the activation functions used in the hidden layers and specific layer configurations for each dataset, please refer to the implementation available in our GitHub repositoryFootnote 7. In terms of the training process, a batch size of 32 was employed for CCA, HDE, MedN, and 512 for TEDS. The number of epochs varied for each dataset and model, and the specific values can be found in our GitHub page.

Table 4 The three different architecture formats used in our experiments

All models in our study were trained using separate training and validation sets, and their performance was evaluated on dedicated test sets. The performance of each model on the four individual tasks is presented in Table 4. It is noteworthy that the linear neural network (NN1) consistently performs worse than the other models.

Nevertheless, we included the linear neural network model in our study to assess the performance of our metric in the simplest case. This allows us to evaluate explanations for fully interpretable models accurately and serves as a baseline for comparison with more complex models. By including this model, we can ensure that our metric is capable of correctly evaluating explanations even in cases where the model’s behaviour is transparent and easily interpretable. We conducted this test, influenced by recent research on evaluating explainability techniques using ground-truth synthetic explanations [61].

In all cases, NN2 performs equally well or better than NN3. NN2 is the most used architecture among the neural networks. It is important to mention that NN3 does not consistently outperform NN2 in all scenarios (evident in datasets HDE and MedN). However, such complex models are necessary to stress the explainability techniques based on gradients.

5.1.4 Explainability techniques

In our experiments, we used three different explainability techniques, one model-agnostic and two model-specific (neural-specific). These are LIME, IG, and LRP, which we discussed in Section 2.2. In our study, we utilized the original Python library for LIME. For the IG and LRP techniques, we employed the iNNvestigate library. The default parameters for IG, LRP, and LIME were used for all datasets, except for HDE and MedN. Due to computational reasons, we adjusted the number of neighbours to 100 and 250 for LIME in the HDE and MedN datasets, respectively. Lastly, we employed a random explanation as a baseline in our study. This random explanation served as a reference point and allowed us to assess the impact of random noise on meta-explanation techniques.

One more explainability technique would be to exploit the real weights of the linear neural models (NN1). Those weights are ground truth interpretations. However, we will treat our NN1 as a black box model, and we will use these ground truth interpretations to evaluate our metric. Our metric should provide a perfect score for these interpretations, as they are correct and real.

We use the Mean, Median and inXAI for meta-explanation techniques as well as our proposed technique, MetaLion, as presented in Section 4. The Mean meta-explanation technique for each feature computes the mean value across the importance scores provided by the different explainability techniques, while Median takes the median value. For example, if LIME, IG, and LRP provided 0.2, 0.3, and 0.7 importance scores for the feature “age”, Mean will assign a 0.4 importance score to “age”, while Median will assign a 0.3. For inXAI, we follow the original implementation, which builds upon three metrics to weight the seed explanations [53].

We will apply these explainability techniques in all datasets. Regarding TEDS, we will both apply them in time-step and sensor level, where in the former feature importance scores will be assigned to each time-step of each sensor, while in the latter, importance scores are assigned in the time-steps. For the CCA dataset, each feature is assigned a feature importance score, while in HDE, each pixel. Finally, in MedN dataset, the explainability techniques assigned importance scores to each term (word).

5.2 Quantitative experiments

The quantitative experiments we conducted include the evaluation of the truthfulness metric using ground truth information from the NN1, and a comparison of the different explainability techniques we chose, the meta-explanation techniques we used, and the one we designed, accompanied by an ablation study. Additionally, we study the influence of the noise and \(\delta \) values on the evaluation, as well as we compare truthfulness to other metrics, complexity, stability, and consistency. All the metrics employed in our study were measured exclusively on the test set.

5.2.1 Truthfulness evaluation

The first part of the experiments focuses on the evaluation of explanations provided by linear neural models (NN1). Those are ground truth, and therefore we want to assess if the truthfulness metric presented in Section 4.1 correctly identifies these explanations.

Table 5 Truthfulness evaluation of linear models (NN1) on the different datasets
Table 6 Avg. number of identified untruthful features per explainability technique

In Table 5, we present the assessment of the inherent explanation of NN1s on the different datasets. We use three different levels of noise (\(noise \in [\)“weak”, “normal”, “strong”]), and four different \(\delta \) values (\(\delta \in [0, 0.0001, 0.001, 0.01]\)). We know that NN1s are interpretable. Therefore, their explanations must be 100% truthful. In the results of the table, we can see that in TEDS (sensor level), CCA, and MedN, it perfectly evaluates the linear interpretations. However, in HDE, which is at the superpixel level, it assigns on average 1.29 wrong weights out of 29.71 superpixels.

This is reasonable as in our implementation, if the value of a pixel is increased (decreased) above (below) the maximum (minimum) value after the alteration, then the value is reset to the maximum (minimum) value. Then, considering that the superpixels are working by averaging the per-pixel importance scores, if a pixel does not get increased (decreased) as the others, due to such a limitation, the possibility of observing unexpected behaviours increases. On the other hand, if we were conducting the experiment at the pixel level, that issue would not have occurred.

Let’s see the following example. If we have 3 values [0.8, 0.6, 0.9] (3 pixels in a superpixel) all of them having a range of [0, 1], with the importance weights \([0.2, 0.05, -0.35]\), and the prediction \(= sigmoid(0.2\times 0.8 + 0.05\times 0.6 \times -0.35\times 0.9) = sigmoid(-0.125) = 0.469\), the average importance would be \(-0.04\) (the weight of the superpixel). If we make a positive alteration by 0.2 the 3 values get [1, 0.8, 1]. Notice that 0.9 changed to 1 instead of 1.1, in order to not violate the range [0, 1]. The prediction will accordingly change to \(= sigmoid(0.2\times 1 + 0.05\times 0.8 \times -0.35\times 1) = sigmoid(-0.125) = 0.472\). While a positive alteration, given a negative weight, should lead to decreased prediction, this did not happen. If we had allowed the value 0.9 to get to 1.1, the prediction would have been 0.444, hence decreased, as expected. We are aware of this limitation, but based on the experiments, it appears to be rare. Therefore, we have decided to maintain the restriction of keeping the alternative values within their respective ranges.

Given these findings, we can conclude that when the explanations are correct, truthfulness accurately evaluates a technique. We shall put it to the test in non-linear, complex models in the following experiments.

5.2.2 Explainability techniques evaluation

Let’s examine how truthful the different explainability techniques are on the four selected datasets. For the noise and \(\delta \) parameters, we choose “normal” and 0.0001 (default parameters), respectively. In Table 6, for the four different datasets, we can observe how the different explainability techniques correctly assign weights to the predictions of NN2 and NN3. Let’s focus on the first four rows, which are the typical explainability techniques, and their evaluations.

The two explanation strategies that perform best in these cases are IG and LRP. There is no apparent winner between the two, since IG outperforms LRP in TEDS while performing similarly to CCA, and LRP outperforms IG in HDE and MedN. LIME, on the other hand, is the worst explanation approach, doing worse than random explanations in three of the four cases.

5.2.3 Meta-explanation techniques comparison

The performance of meta-explanation techniques, also known as ensembles, is shown in the same table (Table 6). As we showed in the previous section, there is no clear winner among the many explainability techniques in several cases. Given the need for a meta-explanation, the ensembling technique appears to be a good choice. Mean, which averages the explanations, proves to be a promising approach based on the literature. Inspired by this, we included in the experiments a meta-explanation technique based on Median (selects the median weight among the suggested), as well as the inXAI technique, and the MetaLion technique as proposed in Section 4.2.

Checking the results from Table 6, we can see that our meta-explanation technique can drastically reduce the number of identified untruthful features, in all cases. All four meta-explanation techniques use the four explainability techniques as input. MetaLion reduces the untruthful features by 66% compared to the original techniques, and by 64% compared to the other three meta-explanation techniques.

Ablation study

We also conduct an ablation study on seed explanation techniques. We remove each seed technique one at a time, monitoring how the meta-explanation techniques perform. In Table 7, we can see the results. The performance of the meta-explanation techniques employing all four explanation techniques is shown in the first row. In the next rows, we omit the following approaches in this order: IG, LRP, LIME, and Random. The red up arrow indicates that omitting the specific explanation technique reduced the meta-explanation technique’s performance compared to the original performance. The green down arrow indicates that performance improved as the average number of untruthful elements in meta-explanations decreased.

Table 7 Ablation study regarding the meta-explanation techniques

When LIME or Random are omitted, MetaLion appears to perform slightly worse, whereas the other three perform slightly better than their original performance. This occurs because of erroneous explanations introducing noise into these three meta-explanation techniques. MetaLion, on the other hand, detects the correct elements even in these noisy explanations and uses only them, discarding any potentially noisy ones. Based on this, we can conclude that our technique is resilient to noisy and even contradictory seed explanation techniques.

Another intriguing discovery is that the Mean, and Median meta-explanation techniques are susceptible to changes in the seed explainability techniques. On average, the performance of both techniques changes 0.60 and 0.77. Contrarily, MetaLion and inXAI appear to be more stable. On both neural networks, the average change across all datasets is 0.56 and 0.38. We assume that adding more seed explainability techniques will significantly enhance these observations.

5.2.4 Parameters impact

In the prior experiments, we identified IG as the best explainability technique, and our meta-explanation technique as the best across the other meta-explanation techniques. We select these two, as well as inXAI and LIME, which the former is one possible competitor and the latter probably the most popular explainability technique, to analyse how they perform given different noise and \(\delta \) values in our cases, as tested in the NN2 model.

Table 8 Performance of IG, LIME, inXAI, and MetaLion technique on truthfulness based on different noise and \(\delta \) values

In Table 8, we can see the performance of the techniques given the different noise and \(\delta \) values changes on the NN2 neural network. In every case, “weak” noise and higher \(\delta \) values (e.g., 0.01) produce higher truthfulness scores and do not allow to easily distinguish between techniques. On the other hand, “strong” noise and \(\delta =0\) is very strict and punitive. Therefore, we suggest the use of “normal” noise with a small \(\delta =0.0001\), but not 0.

Table 9 Comparison of truthfulness and complexity between IG, LIME, inXAI, and MetaLion with different \(\delta \) values
Table 10 Stability per explainability technique

5.2.5 Comparison with other metrics

One last quantitative experiment compares the truthfulness metric to another well-known explainability metrics, including as complexity, stability, and consistency. Complexity measures the number of non-zero weights included in an explanation. Lower scores in this metric suggest lower complexity, which means more comprehensible explanations. We can include a complexity threshold. The importance scores that fall below that threshold are then regarded zero, and the number of non-zero elements is reduced. In the results shown in Table 9, where we use “normal” noise, we use the \(\delta \) values as the complexity threshold as well. For \(\delta =0.0001\), the evaluation of the different techniques is unclear, and we cannot easily choose the best technique. For example, in TEDS dataset, both IG and LIME have the same complexity, but they have very different truthfulness scores, with LIME making twice the mistakes as IG.

Another intriguing finding from the complexity metric is that MetaLion, which always shows the highest truthfulness score, also has the lowest complexity scores in three of the four test cases. This means that the meta-explanation technique can reduce the number of elements that appear to the end user, making the explanations shorter and easier to understand, while guaranteeing that the remaining importance scores are truthful. This is due to the meta-explanation technique just replacing the incorrect components with zero. Given that the truthfulness score is always lower, this replacement is most likely correct. As a result, it not only ensembles, but also corrects the seed explainability techniques.

Additionally, we plan to evaluate all seed and meta-explanation techniques using stability and consistency metrics. By examining the stability and consistency of the explanations, we can assess the robustness and reliability of the techniques. This evaluation will enable us to compare how these metrics align with the scores of truthfulness, providing a comprehensive analysis of the explanation quality across multiple dimensions.

Stability, also known as robustness, ensures consistent explanations for similar inputs. To evaluate stability, we utilize Lipschitz continuity, as discussed in [53], a modified concept of continuity. It measures the maximum difference in explanations between points within a defined neighbourhood. The neighbourhood is determined by a distance criterion denoted as \(\epsilon \), which ensures proximity between points.

For the CCA dataset, we set \(\epsilon \) to the default value of 0.3. However, for TEDS, we found that 0.3 was inadequate as it resulted in the same score for all techniques. Therefore, we selected a higher value of 3. For datasets with higher-dimensional feature spaces like HDE and MedN, we searched for an \(\epsilon \) value that could differentiate the techniques to some extent. We set \(\epsilon \) to 50 in these cases. Although we would have preferred to increase this value further, the computational cost associated with increasing \(\epsilon \) limited our search.

Table 10 presents the performance of each technique in terms of stability. While there is no clear winner, several interesting findings can be observed. In the case of the TEDS dataset, LIME demonstrates superior stability compared to all other techniques. Among the meta-explanation techniques, median shows promising results. For the CCA dataset, Random explanations exhibit the highest level of stability, while among the meta-explanation techniques, inXAI performs well. In the HDE dataset, all techniques achieve perfect stability, indicating the need for an increased value in the \(\epsilon \) parameter. In the MedN dataset, IG and LRP achieve perfect stability scores, while all meta-explanation techniques demonstrate adequate performance. Lastly, we can observe that MetaLion is highly influenced by the instability of the seed techniques; thus, by removing instable ones, for example, Random on TEDS, the stability of the technique is expected to be increased.

The last metric we will discuss is consistency, which measures the degree of variation in explanations generated by a technique for different models. Specifically, we consider our three models: NN1, NN2, and NN3. Consistency evaluates how different the explanations provided by a technique for these models are for each instance. It is important to note that the Random explainability technique produces the same explanation for an instance across all three models, leading to inflated consistency scores.

As seen in Table 11, in the case of TEDS, the seed techniques exhibit challenges with consistency, while the meta-explanation techniques perform better. For CCA, both the seed and meta-explanation techniques perform similarly, except for inXAI, which outperforms the others. A similar pattern is observed in HDE, while in MedN, IG demonstrates perfect consistency compared to the other techniques. Overall, inXAI produces the most consistent explanations across all datasets, with IG, Mean, and MetaLion following suit.

However, both metrics fail to identify a single technique as the best. It is worth noting that the inXAI meta-explanation technique, which aggregates the seed techniques by optimizing these two metrics, was included in our study. In contrast, the truthfulness metric achieves to identify the best technique, which is MetaLion which attempts to optimise this metric.

Furthermore, when comparing inXAI and MetaLion, the former optimizes metrics such as stability, consistency, and area under the loss curve. Thus, it is expected to perform slightly better.

5.3 Qualitative experiments

In this section, we will present two examples comparing explanations provided by IG, LRP, and Mean, with our meta-explanation technique from the textual (MedN) and image (HDE) datasets. In both cases, we use the NN3 neural model. Moreover, we will showcase how the argumentation framework can be employed to provide richer explanations.

Table 11 Consistency per explainability technique
Fig. 5
figure 5

MedN Example: Explanation provided by different explainability techniques for a specific instance. The colour red indicates that the word has a positive influence on the prediction, whereas the colour blue suggests that it has a negative effect. Green mark indicates that the influence is correct, while red cross incorrect

We will start with the first example regarding an instance from the MedN dataset. In the first row in Fig. 5, we can see the examined instance. The prediction from the neural network regarding this medical report was 93% probability to concern acute ischemic stroke. Focusing on few specific words, “Diffusion”, “Restriction” (appearing 1st), “Restriction” (appearing 2nd), all of them have been assigned with a truthful weight from our meta-explanation technique, and the corresponding arguments are presented below. For example, regarding the word “Diffusion”, which according to IG, LRP, Mean, and inXAI should have a negative weight, when it is removed, the probability drops. Therefore, the weight should have been positive, as our meta-explanation technique correctly assigned. The corresponding argument is the \(f_{Diffusion,DEC}\).

  • \(f_{Diffusion,DEC}\): The evaluation of the alteration of Diffusion’s value 1 to 0 (DEC) was performed and the model’s behaviour was as expected DEC (93% to 90%), according to its importance \(z_{Diffusion}=0.75\).

  • \(f_{Restriction-1st,DEC}\): The evaluation of the alteration of \(Restriction-1st\)’s value 1 to 0 (DEC) was performed and the model’s behaviour was as expected DEC (93% to 86%), according to its importance \(z_{Restriction-1st}=0.92\).

  • \(f_{Restriction-2nd,INC}\): The evaluation of the alteration of \(Restriction-2nd\)’s value 1 to 0 (INC) was performed and the model’s behaviour was as expected INC (93% to 95%), according to its importance \(z_{Restriction-2nd}=-0.29\).

Fig. 6
figure 6

CCA Example: Explanation provided by different explainability techniques for a specific instance. The colour red indicates that the word has a positive influence on the prediction, whereas the colour blue suggests that it has a negative effect. Green numbers indicate that the weight of the segment is correct, while red numbers incorrect

In the second example (Fig. 6), we can see an examined instance, which is classified as “damaged” with a 60% probability. We can see the different explanations from the techniques, at the segment (superpixel) level. Red highlight indicates that the segment is very crucial to the prediction of this class, while blue for the other class. Our meta-explanation method achieves to both highlight fewer parts of the image, while it contains less untruthful weights. The segments, with number 1, 9 and 27, are presented. The weights assigned to all of them is correct in MetaLion, in contrast to the others. Segment 1 is negative to IG, LRP, Mean and inXAI, while it should have been positive. The arguments supporting this are \(f_{Segment_1,DEC}\) and \(f_{Segment_1,INc}\). Similarly, to segments 9 and 27.

  • \(f_{ Segment _1, DEC }\) The evaluation of the alteration of \( Segment _1\)’s value by \(-0.07\) (DEC) was performed and the model’s behaviour was as expected DEC (60.5% to 60.1%), according to its importance \(z_{Segment_1}=0.20\).

  • \(f_{ Segment _1, INC }\) The evaluation of the alteration of \( Segment _1\)’s value by \(+0.0.07\) (INC) was performed and the model’s behaviour was as expected INC (60.5% to 60.9%), according to its importance \(z_{Segment_1}=0.20\).

  • \(f_{ Segment _{9}, DEC }\) The evaluation of the alteration of \( Segment _{9}\)’s value by \(-0.26\) (DEC) was performed and the model’s behaviour was as expected DEC (60.5% to 39.2%), according to its importance \(z_{ Segment _{9}}=0.24\).

  • \(f_{ Segment _9, INC }\) The evaluation of the alteration of \( Segment _9\)’s value by \(+0.26\) (INC) was performed and the model’s behaviour was as expected INC (60.5% to 62.7%), according to its importance \(z_{Segment_9}=0.24\).

  • \(f_{ Segment _27, DEC }\) The evaluation of the alteration of \(Segment_27\)’s value by \(-0.14\) (DEC) was performed and the model’s behaviour was as expected INC (60.5% to 62.5%), according to its importance \(z_{Segment_27}=-0.75\).

  • \(f_{ Segment _27,INC}\) The evaluation of the alteration of \(Segment_27\)’s value by \(+0.14\) (INC) was performed and the model’s behaviour was as expected DEC (60.5% to 52.3%), according to its importance \(z_{Segment_27}=-0.75\).

We believe that by presenting such arguments to the user, we may help them evaluate and choose the optimal method from among multiple options, as well as enhance their trust in the system. Specifying the reasons why an importance score is regarded truthful or untruthful can considerably increase trust in the explanation.

6 Conclusions

Machine learning models must be interpretable when used in high-risk applications. There are several techniques for explaining a model’s decisions, as well as evaluation methods for determining the quality of the explanations. However, because there are so many alternatives, determining the best explanation technique for a given application can be challenging. While evaluation can assist in this process, it is not always the case. It would be extremely beneficial to give a user-friendly evaluation metric that would allow the combination of various explanation techniques via a meta-explanation technique.

The metric we introduced, truthfulness, is a good choice for a meta-explanation technique, whereas alternative faithfulness-based metrics produce outputs that would be difficult to incorporate into a local ensembling/meta-explanation technique. Truthfulness is appropriate since it filters the important scores of each explanation if they have the incorrect polarity based on the model’s behaviour with a few alterations. Depending on this filtering, the meta-explanation technique may readily select the most appropriate truthful score among the scores based on the change in the output. As a result, with fully unsupervised methods, both explanation techniques and metrics, the meta-explanation we provide seems to be an appropriate choice.

Through large-scale experimentation, we explored the ability of truthfulness to accurately assess explanations, intrinsic or not, in four datasets of varied data types. Then, while conducting an ablation study, we discussed the performance of meta-explanation techniques, demonstrating that MetaLion always performs better when the number of input (seed) explanations is increased, even when noisy, contradicting explanations are included, whereas other methods perform better when noisy or erroneous explanations are omitted. This allows us to freely add explanations as seeds in our meta-explanation, considering solely the additional computational overhead as it improves efficiency while overlooking potential noise.

Although we used IG, LIME, and LRP in our experiments, our technique is not limited to them. To improve the performance of the meta-explanation technique, the end user can easily replace or add new ones. Nonetheless, they must always keep in mind that the additional explanation techniques must be consistent with the two assumptions (Assumptions 1 and 2).

A discussion about the two truthfulness parameters, noise and \(\delta \), took place as well, to allow users to select which values are most appropriate for their applications. A “strong” noise with a \(\delta \) value of 0 is recommended for a more stringent evaluation. However, in most circumstances, “normal” noise with a \(\delta \) value of 0.0001 would suffice. In most situations, such difference in prediction is almost minor, hence we recommend this setting. We also compared truthfulness to the complexity, stability, and consistency metrics. Based on this comparison, we wanted to highlight that our meta-explanation technique almost always delivers both the most truthful and the least complex explanations, especially as the \(\delta \) value increases, while achieving decent performance in other metrics as well.

Lastly, we demonstrated a qualitative experiment using two examples from two distinct datasets. We contrast our meta-explanation technique, MetaLion, with various explainability techniques. MetaLion is less complex and more truthful. The truthfulness of each important score is additionally supported by a few arguments, which assist the end user trust the explanation. Those arguments are re-phrased compared to the originals, as presented in our preliminary work [26].

6.1 Limitations

While our work offers several advantages, it is important to acknowledge its limitations. Firstly, concerning the truthfulness metric, the alterations made may not fully capture the behaviour of the underlying model accurately. This limitation could potentially impact the evaluation of explanations.

Additionally, the truthfulness metric is dependent on models that produce continuous outputs in regression tasks or probabilities in classification tasks. As a result, explanations for models like decision trees may not be suitable candidates for evaluation using this metric or other faithfulness-based metrics. Conversely, explanations for neural networks, where logits or output of activation functions (such as softmax/sigmoid) can be used directly, as well as explanations for probabilistic models and ensembles, can be appropriately evaluated using this metric.

Furthermore, it is important to note that MetaLion, like the seed explanation techniques it builds upon, does not consider feature dependencies. This limitation arises from the nature of the seed techniques themselves. It is an area that requires further exploration and consideration in future research.

6.2 Future directions

Our future objectives encompass several aspects. Firstly, we aim to conduct a large-scale experiment to compare the truthfulness metric with other metrics, investigating any potential correlations between them. Additionally, we seek to explore methods to enhance the truthfulness metric for even more accurate evaluations. One approach is to investigate alternative techniques, such as interpolation, to capture the polarity more accurately within a given range of values, replacing the current two alterations performed for each feature value.

Moreover, we plan to extend our analysis to incorporate different machine learning models, such as Transformers, and explore their compatibility with MetaLion. This expansion would also involve considering various tasks, including multi-class and multi-label classification, to assess the applicability of the approach across different domains. Furthermore, we intend to enhance both the argumentation framework and MetaLion itself. By refining these components, we aim to improve the overall effectiveness and usability of the approach. In addition, we aim to investigate the feasibility of incorporating alternative metrics, such as stability, consistency, and others, in MetaLion’s ensembling procedure.

Another potential future direction is to incorporate explanation techniques that consider feature dependencies and investigate the applicability of MetaLion in such scenarios. By exploring the use of MetaLion in conjunction with explanation techniques that account for feature dependencies, we can assess its usability and effectiveness in capturing and presenting explanations in more complex and interdependent feature spaces. Finally, we aim to conduct a human-centred experiment to demonstrate the preference of end users for meta-explanation techniques. We also plan to carry out experiments involving domain experts, like those presented in recent studies [62].