1 Introduction

Nowadays, black-box machine learning (ML) models are extensively employed in different applications that frequently have an impact on human lives (e.g. lending, hiring, insurance, or access to welfare services) [9, 23, 46]. In this context, understanding and trustworthiness of ML models are crucial. However, many ML models’ internal workings appear to be opaque. Several questions raised to scrutinise and debug their behaviour usually remain unanswered, e.g. Why did we receive this result? What changes could provide an alternative outcome?

When ML systems engage humans in the loop, they are expected to meet at least one of two requirements: (1) explain model prediction and (2) provide helpful suggestions for assisting humans to achieve their desired outcome [22, 27, 43]. For example, consider a Bank Loan application problem in which a user asks for a loan from an online banking service and the decision is made automatic by an intelligent agent. Such a decision (classification) can be challenged in the case of unfavourable outcomes from the loan applicant. Therefore, the loan applicant should be provided with an explanation of the factors involved in the classification and suggestions about how to change an unfavourable outcome. These two requirements are fulfilled with factual and counterfactual explanations in the field of Explainable Artificial Intelligence (Explainable AI or XAIFootnote 1 for short) [2, 10, 35, 37, 38]. Factuals refer to what is observed in the actual scenario (e.g. ranking the most important input factors [25, 32, 53]). In contrast, counterfactuals refer to simulated imaginary scenarios (e.g. increased income could provide alternative outcomes [50]) in the application domain [29]. These hypothetical scenarios could provide information similar to the original input as “specifying necessary minimal changes in the input so that a favourable outcome is obtained”, what is also called a Counterfactual Explanation (CE) [37]. It is worth noting that CE has been deemed acceptable for the General Data Protection Regulation (GDPR) in the European Union [49].

Wachter et al. [50] proposed one of the earliest methods for generating CEs, which involves adjusting input features to achieve a desired outcome such as loan approval. Although, various approaches have been proposed subsequently [5, 12, 19, 26], they do not provide human-centred explanations (e.g. explanations containing actionable information grounded in user requirements). CEs generated by current approaches may recommend impractical actions (e.g. extreme input modifications) due to the lack of consideration for user feedback.

User feedback can contribute to address the above-mentioned problems, and thus improve the generation of meaningful and actionable explanations. Accordingly, we propose the User Feedback-based Counterfactual Explanations (UFCE) algorithm, which allows the user to specify the scale of input modifications to generate CE recommending actionable information, consequently providing viable outcomes. In a previous work [40, 41], the authors already dealt with the notion of feedback while their scope of user involvement was limited to defining the neighbourhood to minimise the proximity constraint. Here, we extend previous work by expanding its scope to find out the mutual information (see Sect. 4.2) of key contributors in the feature space and considering user constraints to define a feasible subspaceFootnote 2 (see Sect. 4.3) to search CEs. The proposed algorithm guides the search in this subspace to find the minimal changes (perturbations) to the input that can alter the classification result as required. UFCE introduces three methods to perform minimal changes in the input features (see Sect. 4.4), i.e. suggesting single feature, double feature and triple feature changes in the input at a time). The adherence of minimal changes in the input features to the subspace confirms the suggested actions as feasible. The mutual information of features guides in deciding which features to perturb (see Sect. 4.4). UFCE is a deterministic, model-agnostic, and data-agnostic approach for tabular datasets. In this paper, the focus is on a binary classification problem while the extension of UFCE for multi-class classification problems will be addressed in future work.

In addition, UFCE is subjected to a rigorous evaluation focussing on sparsity, proximity, actionability, plausibility and feasibility of explanations. Compared to existing solutions, UFCE not only demonstrates enhanced performance in terms of subjected evaluation metrics but also makes a substantial contribution in the current state of the art.

Fig. 1
figure 1

Example of decision surface with counterfactual instance space in the neighbourhood of test instance x. The yellow, black, and green dots (\(z_1\), \(z_2\), \(z_3\), \(z_4\), \(z_5\)) are the counterfactual instances: where \(z_3\) is invalid; \(z_1\), \(z_2\), and \(z_4\) are valid and actionable; and \(z_5\) is valid but not actionable due to not adhering to user-defined feature range for Mortgage (assume Bank loan data)

In Fig. 1, an illustrative example of counterfactual instance space is shown. The decision boundary divides this space into ‘loan denied space’ and ‘loan approved space’. The blue dots (points) represent the instances with the outcome of loan approval in the actual space of Loan data, and the red dot (x) represents an instance with the outcome of a loan denied (test instance). The green, black, and yellow counterfactual instances (\(z_1\), \(z_2\), \(z_3\), \(z_4\), and \(z_5\)) are produced due to the smallest feature modifications to x in its nearest neighbours. The nearest neighbourhood is shown with a dotted curved line on the decision boundary in the loan-approved space (see Sect. 4.3). The instance \(z_3\) represents inadequate changes in x that could not result in its outcome to loan approved, whereas \(z_1\), \(z_2\), \(z_4\), and \(z_5\) represent sufficient changes in x resulting in their outcomes to loan approved. In addition, \(z_1\), \(z_2\), and \(z_4\) adopt the changes in mutually informed features adhering to the subspace which guarantees them as feasible counterfactuals, and convincing the model to alter their outcomes (i.e. desired outcomes), whereas \(z_5\) does not adhere to the user specified feasible ranges of features making it in-actionable (unfeasible), while \(z_3\) could not convince the prediction model to alter its outcome. For the sake of explaining in-actionable counterfactual \(z_5\), we assumed 2D plot labelled with start and end range of Income and Mortgage features specifying the actionable subspace by the user and \(z_5\) violates the range of Mortgage making it not actionable and unfeasible.

In summary, the main contributions in this paper are as follows:

  1. 1.

    We propose the UFCE algorithm that can generate actionable explanations complying with user preferences in mixed-feature tabular settings.

  2. 2.

    We provide experimental evidence by simulating different kinds of user feedback that an end user can presumedly provide. We observed that user feedback is influencing in obtaining feasible CEs, and UFCE turned out to be the most promising algorithm as compared to two well-known CE generation algorithms: Diverse Counterfactual Explanations (DiCE) and Actionable Recourse (AR).

  3. 3.

    We analyse the proposed UFCE algorithm and evaluate its performance on five datasets with two ML models (logistic regression and multi-layer perceptron) in terms of widely used evaluation metrics, including sparsity, proximity, actionability, plausibility, and feasibility. We present the results obtained from a simulation-based experimental setup for UFCE, DiCE, and AR. We observe how UFCE outperforms DiCE and AR in terms of proximity, sparsity, and feasibility. It is important to note that in this paper, we have presented the experimental results only for LR due to page constraints. However, for the results regarding the MLP model, the reader is referred to the supplementary material.Footnote 3

  4. 4.

    We implemented our algorithm as open-source software in Python, and it is made publicly available to support further investigations.

The rest of the paper is organised as follows: Sect. 2 revisits the related work, including methods and frameworks. Section 3 introduces the preliminaries, problem statement and evaluation metrics. Section 4 introduces the new UFCE algorithm. Section 5 details the experimental setting and discusses the reported results in the empirical study. Finally, Sect. 7 draws some conclusions and points out future work.

2 Related Work

Table 1 Categorisation of CE algorithms: optimisations (OPT), heuristic search (HS), decision-tree (DT), fuzzy rules (FR), instance-based (IB), and instance and neighbourhood based (IB-NB)

XAI has witnessed substantial growth over the last decade. Our research focuses on CEs as a means to explain AI. Consequently, we restrict our investigation to CE research in the subsequent paragraphs. For a comprehensive understanding of XAI and its existing algorithms, we recommend referring to Ali et al. [4] and Holzinger et al. [18]. This section briefly reviews the CE generation algorithms related to our work. The selected papers discussed below are purposefully chosen as they closely align with our approach. Watcher et al. [50] introduce the preliminary idea of CEs and evaluate their compliance with regulations. They frame the process of generating counterfactuals as an optimisation problem that seeks minimal distance between two data points by minimising an objective function using gradient descent. Their principal objective is to determine the proximity of data points, which is evaluated using a pertinent distance metric applicable to the dataset, such as L1/L2 or customised distance functions. The DiCE algorithm [26] emphasises feasibility and diversity of explanations, which can be optimised by applying the gradient descent algorithm to a loss function. DiCE assumes feature independence during perturbations for counterfactual generation. Yet, real-world features often correlate and hold mutual information, challenging the universal applicability of feature independence assumption. Feedback-based Counterfactual Explanation (FCE) [41] is built upon the strengths of the state-of-the-art explanatory method given by Wachter et al. [50]. FCE focussed on defining a neighbourhood space around the instance of interest with user feedback and finds minimal distant CE in its proximity that provides favourable outcomes. Ustun et al. initially resolved the issue of actionability in CE generation by introducing the AR algorithm [47], which can handle categorical features by discretising numerical features; however, discretisation could be a weakness since it encodes feature values into a different format and decoding new values is likely to yield a precision problem after perturbations. The mixed integer programming model optimises the recourse to comply with restrictions prohibiting immutable features from being modified.

Counterfactual Local Explanations via Regression (CLEAR) [51] is based on heuristic search strategies to discover CEs using local decisions that minimise a specific cost function at each iteration. First, CLEAR explains single predictions through “boundary counterfactuals” (b-counterfactuals) that specify the minimal adjustments required for the observation to “flip” the output class in the case of binary classification. Second, explanations are created by building a regression model that aims to approximate the local input–output behaviour of the ML system. Growing Spheres (GS) [24] is based on a generative algorithm that expands a sphere of artificial instances around the instance of interest to identify the closest CE. Until the decision boundary of the classification model is crossed and the closest counterfactual to the instance of interest is retrieved. This algorithm creates candidate counterfactuals at random in all directions of the feature space without considering their feasibility, actionability and (un)realistic nature in practice.

Decision tree-based explainers uncover CE using the tree’s structure to simulate the opaque ML model behaviour. These techniques first estimate the behaviour of a black-box model with a tree and then employ the tree structure to extract CE [14, 39]. Local Rule-based Explainer (LORE) [14] is a decision-tree approach that uses factual and counterfactual rules to explain why a choice was made in a given situation. It starts by sampling the local data for the explanation using a genetic algorithm. Then, LORE uses the sampled neighbourhood records of a particular instance to train a decision tree, which supports the generation of an explanation in the form of decision rules and counterfactuals. Another algorithm for factual and CEs [39] assess the length and complexity of rules in a fuzzy rule-based classification system (FRBCS) to estimate the conciseness and relevance of the explanations produced from these rules.

Feasible and Actionable CE (FACE) [30] is an instance-based CE generation algorithm that extracts CEs from similar examples in the reference dataset. It accounts for the actionable explanations based on multiple data paths and follows some feasible paths achievable by the shortest distance metric defined on density-weighted metrics. Thus, FACE constructs a graph on the selected data points and applies the shortest distance path algorithm (Dijkstra’s Algorithm) to find the feasible data points for generating CEs. The Nearest-Neighbour CE (NNCE) [33] is also an instance-based explainer that chooses the examples in the dataset which are the most similar to the instance of interest but associated with an output class different from the actual. The computational expense of calculating distances between input instances and every occurrence in a dataset with a different result is a shortcoming of FACE and NNCE methods.

We categorised and summarised the above-mentioned CE algorithms in Table 1. The strategies are optimisations (OPT), Heuristic Search (HS), decision-tree (DT), instance-based (IB), and a hybrid of instance and neighbourhood based (IB-NB). If it is a model-agnostic approach, we use a check mark (\(\checkmark \)); in other cases, the specific classifier is tagged. If it can process all types of data (ALL); otherwise, a specific data type is mentioned, such as Tabular (TAB), Image (IMG), and Text (TXT). A \(\checkmark \) is used for handling mixed features (categorical and numerical) and user constraints. A link is provided on the availability of the source code.

Even if many alternative algorithms exist for CE generation, only DiCE, AR and FCE handle user constraints as an input, and they are, therefore, the only CE generation algorithms somehow comparable to the new algorithm that is proposed in this work. Accordingly, in the rest of this manuscript, we go deeper with designing and evaluating a novel CE generation algorithm enriched with user feedback.

3 Preliminaries

In this section, we introduce the notation (i.e. expressions and terms) used throughout the paper (Sect. 3.1), the problem statement (Sect. 3.2), and evaluation metrics for counterfactual explanations (Sect. 3.3).

3.1 Expressions and Terms

Table 2 The definitions of frequently used expressions and terms

In the field of XAI, different important notions and concepts are explained with multiple expressions and terms. The definitions of these expressions and terms cannot be universally applied to different counterfactual approaches in different contexts, and must be redefined first. The specific meanings of frequently used expressions and terms throughout the paper are presented in Table 2.

3.2 Problem Statement

Let us assume we are given a point \(x = (x_1,\ldots ,x_d\)), where d is the number of features. Each feature takes values either in (a subset of) \(\mathbb {R}\), in which case we call it a numerical feature or in (a subset of) \(\mathbb {N}\), in which case we call it a categorical feature (binary categories). For categorical features, we use natural numbers as a convenient way to identify their categories but disregard orders. For example, for the categorical feature ‘Online’, 0 might mean no, and 1 might mean yes. Thus, \(x \in \mathbb {R}^{d_1} \times \mathbb {N}^{d_2}\), where \(d_1 + d_2 = d\).

A counterfactual instanceFootnote 4 for a test instance x is an instance \(z \in \mathbb {R}^{d_1} \times \mathbb {N}^{d_2}\) such that given a ML (black-box) classification model \(f: \mathbb {R}^{d_1} \times \mathbb {N}^{d_2} \rightarrow {y}\) and \(y=\{0, 1\}\) is a decision or class (this study considers a binary classification task), and \(f(x) \ne f(z)\) (we follow notations from [48]).

Our approach addresses the assumption that features are either dependent or independent from each other to compute z, as it frequently happens in real-world practice. We handle these dependencies by exploiting the mutual information (MI) shared among the features and utilising it in the selection of features to perturb (see Sect. 4.2).

A CE is an intervention in x that reveals how x needs to be changed to obtain z. We will now examine the traditional setting in which we have multiple z; for the sake of clarity and without compromising generality, we seek the most suitable \(z^*\) with

$$\begin{aligned} \begin{array}{l} z^* = {\text {argmin}}_{z} \delta (z, x) \\ \text {with} \quad \delta (z, x) = prox\_Jac(z, x) + \lambda \cdot prox\_Euc(z, x) \\ \text {subject to} \quad f(z)=t \quad \text {and { z} is plausible.} \end{array} \end{aligned}$$
(1)

where f is a ML model, t denotes the desired output and \(\delta \) determines the distance. We wish z to be close to x under the distance function \(\delta \) that handles categorical and numerical features as a linear combination of their categorical and numerical distances. The categorical distance \(prox\_{Jac}\) is represented as following:

$$\begin{aligned} prox\_Jac(z, x) = 1 - Jacc(z, x) \end{aligned}$$
(2)

where \(prox\_Jac(z, x)\) measures the distance of categorical features using Jacc(., .) that represents a Jaccard index. The result of this mathematical notation is a value between 0 and 1, where a value of 0 indicates that x and z are identical in terms of categorical features, and a value of 1 indicates that all categorical features in x are different from those in z. The numerical distance \(prox\_Euc(z,x)\) measures the Euclidean distance between x and z for numeric features. The parameter \(\lambda \) balances the influence of the two distances (re-scaling factor). To make the addition of both distances possible on the same scale, we have to normalise the numerical distance in [0, 1].

3.3 Evaluation Metrics for Counterfactual Explanations

Evaluating the quality of CEs is an important task, as it helps to ensure that the explanations provided are accurate and informative. Here are defined the evaluation metrics that we will use to analyse the quality of CEs in the empirical study (Sect. 5).

Sparsity is defined as the average number of changed features in CE versus the test instance (also, the percentage of feature changes). This is desirable to take a small value because the user can often only reasonably focus on and intervene upon a limited number of features (even if this amounts to a higher total cost than intervention on all the features). The following notation (\( \frac{1}{d}\sum _{i=1}^{d} 1\; s.t.\; z_{i} \ne x_{i}, 0\) otherwise) computes sparsity of z. Where d is the number of features, \(z_{i}\) and \(x_{i}\) represent to the ith feature in z and x, respectively.

Proximity is defined as the distance from z to x. We measure two types of proximity, \(prox\_Jac\) for categorical features (see Eq. 2), and \(prox\_Euc\) for numerical features (see Eq. 3):

$$\begin{aligned} prox\_Euc(z,x) = \sum _{i=1}^{d} \frac{ \sqrt{ \left( z_{i}-x_{i}\right) ^2}}{MAD_i} \end{aligned}$$
(3)

Using Euclidean distance, Eq. (3) measures the distance \(prox\_Euc(z, x)\) between x and z for numerical features weighted with median absolute deviation (MAD) of respective feature. The result of this equation is a value that represents the similarity between x and z, with higher values indicating greater distance, i.e. smaller proximity.

Plausibility evaluates whether CE is plausible or not (Boolean outcome). We used the Local Outlier Factor algorithm, which computes the local density deviation of a given data point with respect to its neighbours [7], and it is merely one choice among several that may be used to assess the level of plausibility, like other outlier detection techniques (e.g. Isolation Forests) [1].

Actionability counts for the average number of feature changes (also the percentage of feature changes) suggested from the user-specified list of features. It is a similar metric to sparsity; however, the feature changes are not counted for all features but rather for a subset of features (only for user-specified features).

Feasibility pays attention, at the same time, to validity, actionability, and plausibility. Validity is related to the accomplishment of the desired outcome t given z such that \(f(z)=t\). In addition, we use a certain threshold of feature changes in z, if fulfilled, to be admitted as an actionable counterfactual. The following equation measures the outcome of feasibility for z (a Boolean outcome):

$$\begin{aligned} \begin{array}{ll} z_{feasible}=(z_{valid}=1) \text { and } (z_{plausible}=1) \\ \quad \text { and } (z_{actionable} >= Actionable_{threshold}) \end{array} \end{aligned}$$
(4)

Finally, we also record the required computational time, i.e. the average time taken (seconds) to generate one CE.

4 The Counterfactual Generation Pipeline

Researchers have suggested several guidelines for the practical utility of CEs, including the minimal effort to align with user preferences [8]. In this paper, we consider the interplay between the desiderata of user preferences (i.e. that CEs require only a subset of the features to be changed) and feasibility measures (i.e. that CEs remain feasible and cost-effective). The main building blocks and pipeline of UFCE are shown in Fig. 2 and they will be described in detail in the rest of this section. The main components are input (a trained ML model on a dataset, a test instance to be explained along with user constraints), UFCE (counterfactual generation mechanism that includes feature selection, finding nearest neighbours, and calling the different perturbation methods for counterfactual generation), and output (tabular presentation of generated counterfactuals).

Fig. 2
figure 2

Counterfactual explanation generation pipeline of UFCE

4.1 User Preferences

A counterfactual instance might be close to being realised in the feature space. Still, due to limitations in the real world, it might only sometimes be feasible [11, 36, 50]. Therefore, enabling users to impose constraints on feature manipulation is natural and intuitive. These constraints can be imposed in two ways: first, the user may specify features which can be modified; secondly, he/she can set the feasible range for each feature, within which the counterfactual instances must be located. For instance, a constraint could be “income cannot exceed $10,000”. Given the feature values that the user describes as a starting point, we seek minimal changes to those feature values that result in an instance for which the black-box model makes a different (often a specific favourable one) decision.

A particular novelty of our approach is that it focuses on perturbing only the subset of features that are deemed as relevant and actionable by the user. This novelty is based on selecting features that make a higher impact on the target outcome and adhere to user-defined constraints. Note that it could be presumed that the user-defined constraints are the thoughts embodied in feasible ranges of features; however, these requirements are necessary preconditions and do not ensure that the explanations are aligned with the users’ cognitive abilities [42, 44]. It is worth noting that we evaluate CEs only on the defined quantitative evaluation metrics (see Sect. 3.3), because dealing with human grounded evaluation is out of the scope of this work.

4.2 Mutual Information of Features

Different strategies and assumptions have been exploited in the literature for feature selection and feature dependencies to account for CEs [28, 47, 52]. We use Mutual Information (MI) of features. MI stands out as a robust feature selection method capturing both linear and non-linear relationships within data. Unlike some linear approaches, it does not assume a specific data distribution, enhancing its versatility across diverse datasets. MI effectively identifies informative and non-redundant features, making it particularly valuable in scenarios with complex, non-linear relationships. Its robustness to outliers and applicability in high-dimensional spaces further contribute to its effectiveness as a feature selection tool. The MI of two random variables, in probability theory and information theory [34], measures the extent of their mutual dependence. Unlike the correlation coefficient, which is restricted to linear dependence and real-valued random variables, MI can determine the degree of difference in the joint distribution of the two variables in a comprehensible way. It evaluates the “amount of information” about one variable that can be gained by observing the other variable. MI is closely related to the concept of entropy, which measures the expected “amount of information” held in a random variable and is a fundamental idea in information theory [20]. The MI of two random variables (\(x_i\) and \(x_j\)) is represented in Eq. (5) as following:

$$\begin{aligned} I(x_i, x_j) = H(x_i, x_j) - H(x_i|x_j) - H(x_j|x_i) \end{aligned}$$
(5)

where \(H(x_i)\) and \(H(x_j)\) are the marginal entropies, \(H(x_i|x_j)\) and \(H(x_j|x_i)\) are the conditional entropies, and \(H(x_i, x_j)\) is the joint entropy of \(x_i\) and \(x_j\) (\(x_i\) and \(x_j\) are ith and jth variables). A zero value of \(I(x_i, x_j)\) indicates that \(x_i\) and \(x_j\) are independent, while large values indicate great dependence. We compute MI by exploiting this functionality from the scikit-learn package in Python, which employs non-parametric techniques and uses entropy estimation derived from k-nearest neighbours distances, outlined in [21].

Regarding feature independence, the features are perturbed individually to a defined range. These perturbations are termed as single-feature perturbations. We select tuples (double and triple features) based on their MI and sort them in descending order (from higher to lower MI) in case of regarding feature dependence. These sorted tuples also help to decide which one to use (the common with the user-specified features) for the perturbations. In the case of double features, the I(., .) function already provides the scores of MI for each pair of features from where we select the top pairs of features. To form a triplet, we use already computed top pairs of features making their triplet with each feature from the user-specified list of features (with no repetition of features in a triplet). The order of tuples is preserved to the actual distribution of data. The perturbation in these tuples (double and triple) features are termed as double feature perturbations and triple feature perturbations, respectively (see Sect. 4.4). It is worth noting that for the case of triplet, we do not exploit the functionality of MI to find the causal relationship [6, 16, 17] that is out of the scope of this work.

4.3 Nearest Neighbourhood (NN)

Under the problem statement that we considered in Sect. 3.2, a hyper-rectangle, also known as a box, is formed by the potential perturbations that could affect a test instance x. This box encompasses all the possible paths to counterfactual z that could be reached from x due to these perturbations. In other words, these paths prescribe one or more changes in x to reach z. The extreme perturbations in one or more features could lead z to be pointed outside the plausible space (a region not covered by the training set). To overcome the issue of implausibility, we restrict the perturbations to a neighbourhood of x in the desired space (e.g. the loan-approved space in Fig. 1), adhering to user constraints. This neighbourhood is computed from the desired space using a KDTree (a k-dimensional space-partitioning data structure). Some studies use these data instances in the neighbourhood as counterfactuals. However, this way of doing is not encouraged in recent papers due to the concern of data leakage [9, 13, 31]. We utilise these neighbours for further perturbations to meet Eq. (1). The process of perturbations (described in Sect. 4.4) is guided by the MI shared among the features.

4.4 Perturbations

In this section, we describe the process of perturbations in x to generate the counterfactual instance z with the aim of answering to the following questions:

  • Question (i): how the features to perturb in x are selected?

  • Question (ii): to what extent are perturbed the selected features?

  • Question (iii): what mechanism is used to update the feature values?

Regarding Question (i), the features are selected from the MI scores and the user-specified list of features to be modified. As our approach upholds perturbations to the subset of features, accordingly, there are three options. The first method attempts to perturb a single feature at a time, exploiting the user-specified list of features. The second and third methods perturb double and triple features simultaneously, respectively, and use tuples formed with MI scores (described in Sect. 4.2).

To answer the Question (ii), we can define a Python dictionary (key-value pair, data storage) to store the user preferences regarding the perturbations to each feature, calling it perturbation map as \(p = \{ x_{1}: [p_{1^l}, p_{1^u}],\ldots x_d:[p_{d^l}, p_{d^u}] \}\), where \(p_{i^l}\) and \(p_{i^u}\) are the lower and upper bound of the user-specified interval for ith feature. For example, if the ith feature represents the ‘income’ of a loan applicant, then \(p_{i^l}\) tells by how much the ‘income’ might lower at most, and \(p_{i^u}\) tells by how much the ‘income’ might raise at most. Credit managers may be able to define this information precisely from their experience. In addition, lay users can also impose these constraints in agreement with their requirements. For the case of categorical features, as we are dealing only with binary categories, \(p_{j^l}\) and \(p_{j^u}\) hold the current and new possible value (category) for the jth categorical feature (i.e. if \(p_{j^l}\) is 0, then, \(p_{j^u}\) could be 1).

For answering to Question (iii), we can consider several strategies to identify the relevant and meaningful changes to the input features by following a specific search in the actual space. For example, gradient-based approaches (optimisation) search for perturbations that minimise the difference between x and z. Such approaches adjust the feature values in the direction of the gradient iteratively. Unfortunately, when user-defined constraints are in place, this strategy could become less effective because of the uncertainty of the state of convergence and higher costs in terms of changes to incur in CEs [26]. As an alternative, the rule-based search uses a set of predefined rules to guide the search for perturbations, but this approach requires comprehensive domain knowledge to define the rules [14].

Hybrid search includes several approaches together, and its effectiveness depends on the specific problem and the characteristics of the data. We have customised hybrid search in mixed-feature perturbations (perturbing numerical and categorical features). It involves predicting the values of features using a supervised ML model trained on the available dataset, except for the feature whose value is being predicted. Thus, the values of features are predicted using their respective prediction models, the regressor for numerical and the classifier for categorical features (for each feature, a separate model is trained to predict its outcomes given the rest of known feature values). The predicted values must adhere to p to be considered as a legit perturbation value.

Finally, let us discuss briefly below the three methods that are provided by UFCE for finding out counterfactuals.

The first method is based on single-feature perturbations, in which numerical and categorical features can be perturbed separately. Particularly for this method, we do not predict new values of features from the learning model; rather, we use the values from the subspace formed by \(p_{i^l}\) and \(p_{i^u}\). This subspace contains the ordered values uniformly distributed from \(p_{i^l}\) to \(p_{i^u}\). For iterative perturbations, the new feature values are taken from the mid of the subspace by traversing on it using the notion of binary search. In other words, when a perturbed instance z does not provide the desired outcome, then, to continue the cycle of perturbations, the next value is taken from the centroid of the subspace. In the case of categorical features, the category is reversed (i.e. 0 to 1 or 1 to 0).

The second and third methods simultaneously perform double and triple feature perturbations. The first feature coming in the tuple follows the same notion of single-feature perturbations. After perturbing the first feature, the value of the second feature is predicted from the ML model previously trained. Consequently, both feature values help in the prediction of the value for the third feature. This process is repeated until a valid and plausible z is found (further details are provided in Sect. 4.5).

4.5 Algorithmic Details

Algorithm 1 presents the pseudocode for UFCE and its sub-components. The input parameters of UFCE are test instance x, perturbation map p, \(desired\_space\) (e.g. loan-approved training space), categorical features catf, numerical features numf, list of protected features protectf (these are decided with the domain knowledge and their obvious nature such as Family cannot be suggested to increase or decrease), list of all features (features), desired outcome t, black-box model f, the training data X, and a dictionary step (holding the feature distribution to be used in single-feature method). The output is a set of CEs (\(C_{cf}\)). The list of features to change f2change are obtained from the keys of p (line 1). The MI among the features is computed by calling the sub-routine CMI (short for COMPUTE_MUTUAL_INFORMATION, line 2), which provides a sorted list of feature pairs \(mi\_{pair}\) with MI scores. The nearest neighbourhood nn of x is mined by calling a sub-routine FNN (short for FIND_NEAREST_ NEIGHBOURS) in a desired space within a specific radius such that \(y=t\) (line 3). A neighbourhood (subspace) is computed which adheres to user constraints in terms of feature ranges. This subspace is created by calling a routine INTERVALS that takes input of nn, p, f2change, x; and outputs the subspace that adheres to user constraints and intersects with the neighbourhood in the desired space (line 4). Then, the different variations of perturbations to find CEs (single, double, and triple feature) are called (lines 5–7), and they return the counterfactual instances \(z^1\), \(z^2\), and \(z^3\), respectively, which represent the counterfactuals specifying the single, double, and triple feature changes to achieve the desired outcome t. Finally, the set of \(C_{cf}\) contains the counterfactuals from all three variations of feature perturbations (line 8).

Algorithm 1
figure a

UFCE

Algorithm 2
figure b

UFCE sub-routines

The Algorithm 2 presents pseudocode for the sub-routines (functions) as follows.

  • The FNN function (lines 1–5) has three arguments, namely \(desired\_space\), x, and radius. It creates a KDTree object, which holds the k-dimensional space partitions based on the \(desired\_space\). Then, it finds the indices of the nearest neighbours of x within a certain radius using the \(query\_ball\_point\) method. Finally, it returns the nearest neighbours found using the indices.

  • The CMI function (lines 6–16) has two arguments, namely features and X. It initialises an empty dictionary \(feature\_pairs\) (key-value storage) and an empty list \(mi\_pairs\). It then iterates through all the feature pairs in features and computes their MI scores using the mi_classif (short for mutual_info_classif) function provided by scikit-learn (line 9), for conciseness, we represent feature pair with \(<{f_i, f_j}>\), the actual implementation follows the structure of nested for-loop). The MI score is used as a key in \(feature\_pairs\) to store the corresponding feature pair. The \(feature\_pairs\) dictionary is sorted in descending order of the MI scores and stored in a new dictionary P (storing feature pairs). Finally, the function iterates through the keys of P and appends the corresponding feature pair to \(mi\_pairs\) if it does not already exist. The function returns \(mi\_pairs\).

  • The INTERVALS function (lines 17–25) has four arguments, namely nn, p, f2change, and x. It initialises empty storage subspace (key-value). It then iterates through all the features in p (perturbation map) and sets their corresponding lower and upper bounds. If the upper bound is greater than or equal to the maximum value of the corresponding nearest neighbourhood, then the upper bound is set to the maximum value of the neighbourhood. Similarly, the lower bound is validated (it is verified in agreement with user constraints because it could be large enough to fall outside the actual distribution). The lower and the upper bounds are then stored in the subspace using the feature as the key. The function returns the subspace.

Algorithm 3
figure c

UFCE (Single_F)

The Algorithm 3 presents pseudocode for single feature perturbations of UFCE as follows. The function Single_F (lines 1–16) has the following input parameters: x, catf, p, f, t, and step. The input instance x is to be explained. The function iterates over for each feature i in the feature map p. Suppose i is not a categorical feature. In that case, the function performs a binary search-inspired traversing on the feature values to find the minimum value mid such that changing the ith feature value of x to mid will result in the target outcome t and a plausible explanation z (plausibility is verified by using the outlier detection algorithm, LOF, described in Sect. 3.3). If the binary search fails to find such a value in the range [startend], where start and end are the lower and the upper bounds of the feature range (this subspace is discretised uniformly), the search goes on in the lower and upper half from the mid, where step is a dictionary holding the step size for each feature used to traverse to the next element. If i is a categorical feature, the function sets the feature value of ith feature in z to its reverse value \(1-end\) and checks if \(f(z) = t\) and z is a plausible explanation. If the condition is met, z is returned. Finally, the function returns the resulting explanation z.

Algorithm 4
figure d

UFCE (Double_F)

The Algorithm 4 presents pseudocode for double feature perturbations of UFCE as follows. The function Double_F (lines 1–23) has the following input arguments: X, x, subspace, \(mi\_pair\), catf, numf, features, protectf, f, and t. This function aims to search for a data point z that satisfies the condition \(f(z) = t\) while performing double-loop perturbations on the input x.

The function first iterates over each pair in \(mi\_{pair}\), a list of pairs of the features ordered with their MI scores. It then checks whether i and j features are in the valid subspace. If so, then, it checks whether i is in numf and j is in numf or catf and if they do not belong to protectf (protected features). If all these conditions are satisfied, then it generates a uniform random set of values within the range (start, end) of ith feature as \(traverse\_space\). It iterates over each value in \(traverse\_space\), the sorted set of uniform random values. In the meantime, a regressor h and classifier g are trained to predict the feature j value. To predict the j value, they use z that contains the copy of x and sets the value of ith feature to the mid value of \(traverse\_space\). The function then removes the feature (column) corresponding to jth feature from z, and depending on whether j is a categorical or numeric feature, it applies h or g to predict the new value for jth feature. It checks whether the resulting data point satisfies the condition \(f(z) = t\) and, if so, returns it. Otherwise, it reduces the size of the \(traverse\_space\) (deleting the values from start to midpoint) and continues to the next value. In the prediction mechanism, the first feature provides a space to move for more perturbations (in the uniform distribution or respective feature distribution from start to end). It predicts the second feature value from the respective predictor. After line 18, the vertical dotted line represents more cases (if possible) of different combinations of numeric and categorical features (handled accordingly).

If i and j are both categorical features and are not present in protectf, the \(Double\_F\) function sets i and j features to the maximum value within their respective p in subspace (reverse of values) and checks whether the resulting data point z satisfies the condition \(f(z) = t\). If so, it returns z.

Algorithm 5
figure e

Computing distance of Counterfactual Explanations using \(\delta \)

The Algorithm 5 finds the most suitable CE \(z^*\) that satisfies the desired outcome t by iteratively computing the distance between the given test instance x and all the possible counterfactual instances \(z \in Z\) that satisfy the outcome \(f(z) = t\) (i.e. candidate counterfactuals). The distance metric \(\delta (z, x)\) is computed by a weighted addition of \(prox\_Jac\) and \(prox\_Euc\) where the former measures the distance between categorical features, and the later measures the Euclidean distance between numerical features. The algorithm returns the instance \(z^*\) with the smallest distance \(\delta ^*\).

5 Experiments and Results

We concentrate on empirical findings that help us respond to what we see as essential research questions:

  • (RQ1) Does user feedback (user constraints) affect the quality and computation of CEs?

  • (RQ2) How do randomly taken user constraints affect the generation of CEs?

  • (RQ3) What is the behaviour of UFCE on multiple datasets?

We have performed three experiments to answer the three research questions. Experimental settings are presented in Sect. 5.1. Then, the next sections (Sects. 5.2, 5.3, and 5.4) answer to the above research questions.

5.1 Experimental Settings

Table 3 Dataset details: size (#instances), features (total features), Num-Cat (#Numerical and #Categorical features), classes (total classes), positive class (percentage), and average fivefold cross-validation (CV) accuracy with standard deviation for logistic regression and multi-layer perceptron

5.1.1 Datasets

We utilised five datasets to test the CE methods under study. We choose two datasets with mixed data types from Kaggle competitionsFootnote 5 publicly available and three datasets from the KEEL-Dataset-repositoryFootnote 6 [3], to provide readers with complete data analysis. All these datasets are for binary classification and they can be downloaded from the UFCE project repositoryFootnote 7 in the format required to run the experiments to be described in the rest of this paper. Detailed information on the datasets (i.e. their name, size, number of features, numerical and categorical counts, number of classes, and percentage of the positive class) is presented in Table 3.

5.1.2 Machine Learning Model

In this paper, we employed two types of ML models: Logistic Regression (LR) and Multi-Layer Perceptron (MLP). LR, a linear model, is utilised with default hyperparameters. Subsequently, we explored MLP, a neural network model capable of non-linear transformations. The MLP configuration included a single hidden layer with 100 neurons, in addition to other default hyperparameters. Table 3 presents the average fivefold cross-validation (CV) accuracy, along with the standard deviation, for LR and MLP.

It is important to note that in this paper, we have presented in detail the experimental results only for LR due to page constraints. However, for further details on results reported by the MLP model, the reader is referred to the supplementary material.Footnote 8

5.1.3 Counterfactual Explainer Methods

DiCE [26] strives to provide diverse CEs; it provides an implementation that also covers categorical features. In our experiments, we implemented and used the standard DiCEFootnote 9 library for results comparison. The AR [47] focuses on the issue of actionability. It also adheres to restrictions that prevent immutable features from being altered. The reason to choose DiCE and AR is due to their support available for imposing user constraints in their respective public libraries. The AR explainer works with LR, and we implemented it using the actionable-recourseFootnote 10 library. For the case of MLP, the AR explainer was not able to operate due to its design tailored for LR. Consequently, we conducted the experiment exclusively with UFCE and DiCE explainers, as elaborated in the supplementary material.Footnote 11 The proposed approach UFCE is model and data-agnostic for tabular datasets.

5.2 (RQ1) Effects of User Constraints on the Performance and Computation of Counterfactual Explanations

Table 4 (RQ1) The performance comparison in terms of generation of feasible counterfactuals (\(\%\)) for very limited (VL), limited (L), medium (M), flexible (F), and more flexible (MF) constraints

This experiment entails the details of how the different levels of user constraints (user feedback) can affect the performance of the generation of CEs. The different levels of user constraints are configured to perturb the test instances to generate CEs. These constraints help to form the perturbation map p that guides the sub-processes of UFCE to generate counterfactuals. A specific percentage (absolute value) of median absolute deviation from the actual data distribution is computed as a user-specified perturbation limit for each numeric feature. We consider five levels of constraints which are named as very limited, limited, medium, flexible, and more flexible. These levels are assumed to simulate the scenarios when different users can specify different constraints. The different levels simulate the behaviour of a user in the real scenario as follows:

  • Very limited—This value is a \(20\%\) of the median absolute deviation of the relevant data.

  • Limited—This value is a \(40\%\) of the median absolute deviation of the relevant data.

  • Medium—This value is a \(60\%\) of the median absolute deviation of the relevant data.

  • Flexible—This value is a \(80\%\) of the median absolute deviation of the relevant data.

  • More flexible—This value is a \(100\%\) of the median absolute deviation of the relevant data.

The Bank Loan dataset is considered for this experiment. For example, the median absolute deviation of the feature ‘Income’ is 50.10. Accordingly, in this case, ‘very limited’ corresponds to 10.02, ‘limited’ is 20.04, ‘medium’ is 30.06, ‘flexible’ is 40.08, and ‘more flexible’ is 50.10.

The lower bound \(p_{i^l}\) of the perturbation map p is initialised by copying the \(x_i\) value and the upper bound \(p_{i^u}\) with a value by adding the respective percentage (i.e. \(20\%\), \(40\%\), \(60\%\), \(80\%\), and \(100\%\)) in \(x_i\) for the ith feature. This process is repeated for all the features taking part in perturbations. For categorical features, the feature values are reversed in all five levels of constraints. The p is updated iteratively for each level of user constraints, and the respective counterfactuals are computed.

We run the experiment on a pool of 50 test instances, for each test instance the counterfactuals are generated for all levels of user constraints with UFCE, DiCE, and AR (our approach includes its 3 variations). The DiCE was configured in two ways: (i) DiCE-UF takes as input the same user feedback as UFCE; and (ii) the basic DiCE does not take as input any specific user feedback but after counterfactuals are generated we verify if they adhere or not to the desired user feedback ranges. The AR was configured with input features which were suppose to be changed according to user feedback, and its generated counterfactuals were checked afterwards whether they adhere to user feedback or not. All the features are assumed as the user-specified list of features to change for all methods.

For each test instance, each counterfactual explainer was configured to give a chance to generate the 5 best counterfactuals. Then, costs of proximity were calculated for each CE, and the nearest counterfactual to the test instance was chosen (given that it is feasible) to consider for further evaluations. A CE is feasible if it is actionable and plausible. To fulfil this requirement, we had considered a CE as an actionable when it used at least \(30\%\) of the features from the user-specified list to its total changes (suggested feature changes) and it is not an outlier. Table 4 presents the consolidated results of feasible counterfactuals (\(\%ge\)) by each method for all five levels of user feedback, and Fig. 3 plots the average results for all evaluation metrics. Similarly, the time is noted per counterfactual (in seconds) for all methods and presented in Table 5.

Fig. 3
figure 3

(RQ1) Performance of CE methods for different evaluation metrics (with error bar of st. dev)

The performance of generating feasible counterfactuals gets better as we move from ‘very limited’ to ‘more flexible’ in Table 4. In general, UFCE2 and UFCE3 performed better than the other methods. UFCE3 took more time on an average than other methods, when ‘more flexible’ user constraints are in place, it takes more time. The reason behind higher time for UFCE3 is due to multiple combination of features and wider subspace to explore.

In general, UFCE surpassed DiCE, DiCE-UF, and AR in all configurations of user constraints for a feasible counterfactual generation. Regarding computational time, UFCE1 and AR were faster than the other methods to generate counterfactuals. The reasons behind the better performance of UFCE in general are the targeted perturbations to look for valid counterfactuals, plausible to the reference population and actionable to certain user-defined limits.

Further, in Fig. 3, the average results for different evaluation metrics are plotted. For each plot, the CE methods are placed on the x-axes and the metric scores on the y-axes. The lower value is the better case for Proximity-Jac, Proximity-Euc, and Sparsity, while the higher value is the better case for Feasibility. Proximity-Jac represents the percentage of categorical features utilised. UFCE1 and UFCE2 did not consider any categorical features for generating CEs, and DiCE is the method utilising maximum categorical features for CEs. Proximity-Euc represents to Euclidean distance of generated CE from the test instance, DiCE turn out to be the most expensive method to suggest changes, whereas UFCE variations performed better than the other methods. Sparsity represents to the number of features changed in the generated CE. DiCE has shown a higher sparsity value, therefore, it incurred multiple feature changes, leading to higher Proximity-Euc, while UFCE performed better, in general. Similarly, UFCE performed better in generating feasible counterfactuals than the other methods.

This experiment has shown that the impact of user feedback on the generation of counterfactuals is influencing. It is evident that as the user constraints are flexible (at least equal to the median absolute deviation), the results are better for each method incorporating user feedback, in their capacity.

5.3 (RQ2) How Do the Randomly Taken User Preferences Affect the Generation of CEs?

Table 5 The performance of different CE methods, in terms of average time per CE (seconds)
Table 6 (RQ2) The performance comparison in terms of generation of feasible CE (in \(\%ge\)) for Monte Carlo-like random generation of user feedback
Fig. 4
figure 4

(RQ2) Random user preferences for CE generation: The bar plots depict the evaluation results for different evaluation metrics (with error bar of st. dev)

The second experiment is similar to the experiment previously described in Sect. 5.2, the only change is in the user feedback. In this experiment, we worked with randomly taken user preferences rather than any pre-suppositions. This time we initialised the perturbation map p with randomly chosen upper boundaries \(p_{i^u}\) from a random choice (the lower bounds are the actual test instance values). This is not an actual Monte Carlo Simulation but rather a random sampling of the upper bound for each feature to use in the generation of counterfactuals. For a pool of 50 test instances, the randomly generated user feedback is utilised (in a real scenario, a user has to provide as an affordable recourse), and this process is repeated 10 times.

The results for the feasibility metric are presented in Table 6. UFCE3 surpassed the other methods regarding the average reported results for randomly taken user constraints. Figure 4 illustrates the average results for other evaluation metrics. We can observe that UFCE performed better for proximity and sparsity than other methods. We have considered the average metric scores for all the generated counterfactuals. For example, in the case of sparsity, if a method generates only \(24\%\) feasible counterfactuals, then, we have computed the sparsity of the generated counterfactuals only, and taken the mean of those generated counterfactuals. In the case of UFCE1, the sparsity value is 1, which means UFCE1 has changed only 1 feature of the total features, whereas DiCE has changed around \(6\%\) of features in average. In sparsity, UFCE1 is better than DiCE as it suggests a smaller percentage of feature changes to get CE.

The reason for better results of UFCE for feasibility (e.g. actionability) is, that it already employs the subset of features of the user-specified list of features, and perform targeted perturbations (generation of counterfactuals). Furthermore, when DiCE and AR were subjected to generate CEs on the same subset of features as for UFCE, they were only able to generate from \(4\%\) to \(6\%\) CEs; that being the case, DiCE and AR were allowed to exploit the rest of the features in their search for counterfactuals. It is worth noting that DiCE-UF was able to generate around \(40\%\) feasible counterfactuals. To do so, DiCE-UF takes into account the given (randomly generated) user feedback similarly as UFCE does but without considering mutual information. In consequence, DiCE-UF achieved smaller percentage of feasible counterfactuals than UFCE. In general, it was taking into account the user feedback in the generation mechanism which restricted it to not make extreme changes for finding CE. In general, UFCE (and its variations) have shown better results than DiCE and AR.

5.4 (RQ3) What is the Behaviour of UFCE on Multiple Datasets?

We have conducted a third experiment to compare UFCE with other counterfactual methods on five datasets. This experiment follows the same setup as the one described in Sect. 5.2. Nevertheless, the user constraints are now fixed to a threshold equal to \(50\%\) of the median absolute deviations (MAD) of features in the actual data distribution for each test fold. Each dataset was split into 5-test folds, and the mean results of CE generation for all folds are reported.

Table 3 (introduced in Sect. 5.2) contains the details about the different datasets, their features, and the ML model’s fivefold cross-validation (CV) mean accuracy. The comparative results (mean of different folds of test set) on five datasets are presented for proximity-Jac, proximity-Euc, sparsity, actionability, plausibility, and feasibility in Table 7. Figure 5 provides the readers with complementary bar plots to facilitate interpretation of numbers reported in Table 7.

Table 7 (RQ3) Comparative results on multiple datasets for different evaluation metrics
Fig. 5
figure 5

(RQ3) Comparative results for CE generation on multiple datasets: The bar plots depict the evaluation results for different evaluation metrics (with error bar of st. dev)

We can observe that UFCE performed better in most of the evaluation metrics on multiple datasets. The better result from any of the methods on any dataset for each specific evaluation metric is highlighted in bold in Table 7. Regarding prox-Jac, there are three datasets which have categorical features, UFCE1 utilised \(0.6 (60\%)\) of the categorical features for Bank loan dataset. For Graduate and Movie datasets, no UFCE variation utilised categorical features. This is positive in a sense that the user has not to change the category of the features which in some cases is not viable like to change the gender feature in some real-world dataset. Regrading the proximity-Euc, UFCE1 performed better than others on Graduate, Bank Loan and Movie datasets, while UFCE2 performed better than others on Wine and Bupa datasets.

Regarding sparsity, UFCE1 performed better than all other methods by suggesting only one feature change. Regarding actionability, UFCE1 and DiCE-UF shared the best performance on Movie dataset; UFCE1 performed better than others on Bupa dataset; UFCE3 performed better than others on Graduate, Bank Loan, and Wine datasets. Regarding plausibility, UFCE2 and UFCE3 performed better than others on Graduate dataset; AR performed better than others on Bank Loan dataset; AR and UFCE3 shared the best performance on Wine dataset; AR, UFCE1, and UFCE2 shared the better performance than others on Bupa dataset; AR and DiCE shared the better performance than others on Movie dataset. Regarding feasibility, UFCE2 and UFCE3 shared the best performance on Graduate dataset; UFCE3 performed better than others on Bank Loan, Wine, and Movie datasets; UFCE1 and UFCE2 shared the better performance than others on Bupa dataset.

Finally, we can draw some conclusions about how well the various UFCE variations have done regarding proximity and sparsity. The generated counterfactuals are meaningful and easy to understand because they are situated relatively near to the described test cases, and there are a few modifications. UFCE produces coherent counterfactuals while adhering to user-defined actionability restrictions. The counterfactuals produced by UFCE are based on the distribution of data from the same class of ground truth, which has a high plausibility score. This ensures that the created counterfactuals are plausible and consequently feasible.

Accordingly, UFCE consistently exhibited better results across all the five datasets under study. These positive outcomes suggested robustness and effectiveness in various scenarios. Acknowledging the nuanced nature of performance evaluation, these findings provide promising indications of the efficacy of UFCE across similar tabular datasets.

6 Discussion

To gain insight into the overall outcomes of the conducted experiments involving LR and MLP black-box models with the employed counterfactual explainers across different evaluation metrics, polar plots were utilised to visualise their performance. In Fig. 6, the left polar plot depicts the performance of UFCE and DiCE for the MLP model, while the right polar plot displays the performance of UFCE, AR, and DiCE. Each plot features a radial axis representing the five datasets, with different evaluation metrics positioned at various angles along the axis. The efficacy of the counterfactual explainer is represented by the extent of coverage in the polar plot, with distinct colours indicating performance levels. A greater coverage area within the polar plot corresponds to better explainer performance.

Fig. 6
figure 6

Over all performances of counterfactual explainers: (left) results when MLP was employed as black-box, (right) when LR was employed as black-box

For example, the red, yellow, and blue colours cover the biggest polar area in both plots, these colours represent UFCE1, UFCE2, and UFCE3, respectively. It means these methods have performed better on multiple metrics across the datasets.

In the left plot (Fig. 6), UFCE1 outperformed on sparsity for 5 datasets, proximity-Euc for 4 datasets, and proximity-Jac for 1 dataset. UFCE2 performed better in proximity-Euc, plausibility, actionability, and feasibility for 1 dataset, while it outperformed in proximity-Jac for 2 datasets. UFCE-3 demonstrated superior performance in actionability for 2 datasets, plausibility for 3 datasets, and feasibility for 4 datasets. DiCE-UF achieved better results in plausibility for 1 dataset.

In the right plot (Fig. 6), UFCE1 performed better in sparsity for 5 datasets, proximity-Euc for 3 datasets, and proximity-Jac for 2 datasets. Likewise, UFCE-3 demonstrated superior performance in plausibility for 2 datasets, actionability for 3 datasets, and feasibility for 4 datasets. AR showed better results in plausibility for 2 datasets; DiCE-UF outperformed in actionability for 2 datasets, and DiCE exhibited better performance in plausibility for 1 dataset.

The results presented in the polar plots demonstrate the effectiveness of UFCE across different evaluation metrics and datasets when compared to other explainers such as DiCE and AR. UFCE consistently outperforms DiCE and AR in providing explanations for both LR and MLP black-box models. This superior performance is evident across various metrics including sparsity, proximity, plausibility, actionability, and feasibility.

A significant benefit of UFCE lies in its capacity to provide explanations that are not just concise and sparse, but also actionable and feasible. For instance, UFCE1 emphasises sparsity and proximity, whereas UFCE-3 delivers actionable and feasible counterfactuals, represented by the red and blue shaded areas in the polar plots, respectively. This capability is especially appreciated in practical scenarios where users aim to minimise adjustments while demanding actionable and feasible explanations from the explainer system.

Moreover, the confirmation of UFCE’s efficacy on the employed black-box models, highlights its adaptability and robustness, and implies its potential for efficient application across both linear and non-linear ML models.

7 Conclusion and Future Work

Even though the rules governing interpretable algorithms are still in their early stages, regulations demand explanations ensuring actionable information fulfilling human needs. The customers of explainable systems have been empowered by laws to get actionable information. In specific domains, for example, in credit scoring, to make the customers aware of adverse actions, an Act is designed for Equal Credit Opportunity in United States [45]. Our approach strives to provide actionable information by involving the user to gain the utmost trust in the generated CEs. The reveal of actionable information could benefit the domain experts in debugging and diagnosis. In contrast, reverse engineering could be applied using actionable information to learn the model’s behaviour (model internals, which the model owners never want to reveal). We tried to balance this trade-off confining the user to customise the information to some extent while respecting their rights to explanations.

This study introduces a novel methodology (UFCE) for generating user feedback-based CEs, which addresses the limitations of existing CE methods to explain the decision-making process of complex ML models. UFCE allows for the inclusion of user constraints to determine the smallest set of feature modifications while considering feature dependence and evaluating the feasibility of suggested changes. Three experiments conducted using benchmark evaluation metrics demonstrated that UFCE outperformed two well-known CE methods regarding proximity, sparsity, and feasibility. The third experiment conducted on five datasets demonstrates the feasibility and robustness of UFCE on tabular datasets. Furthermore, the results indicated that user constraints influence the generation of feasible CEs. Therefore, UFCE can be considered an effective and efficient approach for enriching ML models with accurate and practical CEs. The software and data are available as open source for the sake of open science at the Github repository of UFCE.Footnote 12

In the present framework, UFCE adeptly manages binary classification problems. In forthcoming research endeavours, we intend to systematically extend our approach to encompass multi-class classification, thereby augmenting its suitability for a more extensive array of classification tasks. The future work will extend the user involvement with a series of experiments (human-centred evaluations) to increase the usefulness of the developed framework. One of the prospects is human-grounded evaluations, which could be achieved by analysing the user’s comprehension of the explanation. More specifically, we plan to design a cognitive framework for assessing the comprehension of explanations in a user study.