1 Introduction

Artificial Intelligence (AI) as one robust technology has been used in various fields [10, 23, 24, 27], making innovative society possible and changing our lifestyles. Although AI technology is booming in the current society. There are still many problems unloved, and the black box problem is the most pressing problem of it [2, 8, 16, 30, 33]. To understand the AI models, Explainable AI (XAI) [5, 6, 19, 29] has become one crucial research topic. Generally, XAI models have two types: intrinsic (rule-based) or post hoc models [22]. The intrinsic models obtain knowledge by restricting the rules of machine learning models, e.g., linear regression, logistic analysis, and Grad-CAM [28]. In contrast, the post hoc models refer to the application of interpretation after training, such as Local Interpretable Model-agnostic Explanations (LIME) [26, 37] and SHAP [18, 20]. The rule-based intrinsic and post hoc models do not have standard metrics. Therefore, they just can be compared with the ranking of the factors and cannot be combined. However, the kernel SHAP method makes the ensemble factors ensemble possible.

SHAP method has been used in various fields [1, 3, 4, 9, 12, 17, 31, 34, 35] and is certificated robust [21, 32, 36]. The SHAP methods make us can explain the black box model and know the local and global reasons for one prediction or classification. SHAP method also contains two kinds: model agnostic (Kernel SHAP), and model specific (Tree SHAP, deep SHAP) [18, 20]. These model-specific SHAPs are designed to explain the specific models to decrease the calculation or loss of the complex models and can only be used for particular models. In contrast, the kernel SHAP can be used for any model. However, does the kernel SHAP explain models correctly and reliably? This needs to be clarified.

To overcome these challenges, we propose an ensemble XAI model to achieve a more reliable factor importance ranking methodology. Our proposed ensemble XAI model combined the models’ performance and their SHAP values. We first used kernel SHAP to explain various machine-learning models for six datasets. Then we combined the models’ accuracy and SHAP values using our proposed cross-ensemble methodology to receive a more reliable factor ranking. Finally, we compared our results with other models. Our main contributions to this study are summarized as follows:

  • We proposed an ensemble XAI methodology to calculate the factor importance ranking, which can be used for both classification and regression models.

  • Our proposed methodology can rank factor importance comparatively stable and reliable.

  • Our analysis identified new essential risk factors for diabetes based on non-objective-oriented census data in Japan.

  • Our study paved the way for trustable AI research.

The rest of this paper is organized as follows. First, the necessary knowledge about SHAP and our proposed methodology are introduced in Sect. 2. Then, Sect. 3 describes six used datasets. Next, we show the details results of our study in Sect. 4. Then, we discuss our effects and the future research direction in Sect. 5. Finally, we make one conclusion of this paper in Sect. 6.

2 Methodology

Our proposed methodology is based on SHAP, which is the enhanced method of LIME. We first introduced the basic knowledge about the LIME method (Subsect. 2.1: Introduction of LIME) and the general kernel SHAP method (Subsect. 2.2: Introduction of kernel SHAP) to state the theory efficiency of our proposed methodology. Then, we explained our proposed cross-ensemble feature ranking methodology in Subsect. 2.3: cross-ensemble factor ranking.

2.1 Introduction of LIME

LIME is a concrete implementation of local surrogate models. Surrogate models are trained to approximate the predictions of the underlying model. Instead of training a global surrogate model, LIME focuses on training local surrogate models to explain individual predictions. LIME generates a new dataset consisting of perturbed samples and the corresponding predictions of the black box model. On this new dataset, LIME then trains an interpretable (generally linear) model, which is weighted by the proximity of the sampled instances to the instance of interest. Mathematically, local surrogate models with interpretability constraints can be expressed as follows:

$$\begin{aligned} \zeta (x) = arg\min _{g\in G} L\lbrace f, g, \pi _x \rbrace + \Omega \lbrace g\rbrace \end{aligned}$$
(1)

The explanation is defined as a model \(g \in G\), where G is a class of potentially interpretable models, such as linear models, decision trees, and falling rule lists. The domain of g is \(\lbrace 0,1 \rbrace ^{d}\), i.e., g acts over the absence of the interpretable components. Moreover, \(\Omega (g)\) is the number of nonzero weights in linear models. The model being explained be denoted \(f: {\mathbb {R}}^d \Rightarrow {\mathbb {R}}\). LIME use \(\pi _x\) as a proximity measure between instance z to x to define locality around x. Finally, let \(L\lbrace f, g, \pi _x \rbrace\) be a measure of how unfaithful g is in approximating ƒ in the locality defined by \(\pi _x\). In order to ensure both interpretability and local fidelity, LIME minimizes \(L\lbrace f, g, \pi _x \rbrace\) while having \(\Omega (g)\) be low enough to be interpretable by humans.

LIME minimize the locality loss \(L\lbrace f, g, \pi _x \rbrace\) without making any assumptions about f,i.e. model-agnostic. Thus, to learn the local behavior of f as the interpretable inputs vary, LIME approximates \(L\lbrace f, g, \pi _x\rbrace\) by drawing samples, weighted by \(\pi _x\). LIME samples instances around x by drawing nonzero elements of x uniformly at random (where the number of such draws is also uniformly sampled). Given a perturbed sample \(z \in \lbrace 0, 1\rbrace ^{d }\)(which contains a fraction of the nonzero elements of \(x\)), LIME recovers the sample in the original representation \(z \in {\mathbb {R}}^d\)and obtain \(f(z)\), which is used as a label for the explanation model. Given this dataset \(z\) of perturbed samples with the associated labels, LIME optimizes Eq. 1 to get an explanation \(\zeta (x)\).

The steps for training local surrogate models in the LIME model is:

  • Select instances of factors that need to have an explanation of their black box prediction.

  • Perturb the dataset and get the black box predictions for these new points.

  • Weight the new samples according to their proximity to the nonzero elements of x uniformly at random

  • Train a weighted, interpretable model on the newly selected dataset with the variations.

  • Explain the prediction by interpreting the local model.

2.2 Introduction of kernel SHAP

The goal of SHAP is to explain the prediction of an instance x by computing each feature’s contribution to the prediction of one model. The SHAP explanation method computes Shapley’s values from coalitional game theory. The feature values x of a data instance act as players in a coalition. Shapley values tell us how to distribute the prediction among the features fairly. SHAP specifies the explanation as:

$$\begin{aligned} f(x) = g\left( z^\prime \right) = \phi _0 + \sum \limits _{i=1}^M \phi _i z_i^\prime \end{aligned}$$
(2)

where \(z^\prime \in \lbrace 0,1\rbrace ^M\), M is the number of simplified input features, and \(\phi _i \in {\mathbb {R}}\). The \(z^\prime\) represents the dataset of \(x\), and M has the same original feature space. In kernel SHAP, the \(g\left( z^\prime \right)\) is linear model. Different from LIME methods, the explanation of \(x\) becomes

$$\begin{aligned} \phi _i\left( f,x\right) = \sum \limits _{z^\prime \subseteq x } \pi _x (z^\prime ) \left[ f_x(z^\prime ) -f_x(z^\prime \setminus {i}) \right] \end{aligned}$$
(3)

where the \(z^\prime\) is the represent data space for \(x\). The \(f_x(z^\prime )\) is the original function when the \(z_i^\prime\) is 1, while the \(f_x(z^\prime {\setminus }{i})\) is the original function when the \(z_i^\prime\) is zero.

In the kernel SHAP method,

$$\begin{aligned} \Omega (g) = 0 \end{aligned}$$
(4)

and the kernel of \(\pi _x\) becomes

$$\begin{aligned} \pi _x (Z^\prime ) = \frac{(M-1)}{\left( {\begin{array}{c}M\\ \vert Z^\prime \vert \end{array}}\right) \vert Z^\prime \vert \left( M- \vert Z^\prime \vert \right) } \end{aligned}$$
(5)

where the \(\vert z^\prime \vert\) is the number of nonzero entries in \(z^\prime\) and \(z^\prime \subseteq x\) represents all \(z^\prime\) vectors where the nonzero entries are a subset of the entries in \(x\).

The explanation model \(g(z^\prime )\) matches the original model \(f(x)\) when \(x=h_x(z^\prime )\), where \(\phi _0=f(h_x(0))\) is set as 0. Therefore, the loss function of kernel SHAP becomes

$$\begin{aligned} L \left( {\widehat{f}}, g, \pi _x \right) = \sum _{z^\prime \in Z} \left[ {\widehat{f}}\left( h_x \left( z^\prime \right) \right) - g\left( z^\prime \right) \right] ^2 \pi _x \left( z^\prime \right) \end{aligned}$$
(6)

Kernel SHAP estimates the contribution of instance \(x\) following five steps:

  • Sample coalitions \(z_k^\prime \in \lbrace 0,1 \rbrace ^M\), \(k\in \lbrace 1,\dots ,{\varvec{k}} \rbrace\), where M is the original feature space (1: feature present in a coalition, 0: feature absent)

  • Get prediction for each \(z_i^\prime\) by first converting \(z_i^\prime\) to the original feature space M and then applying model \({\widehat{f}}:{\widehat{f}}\left( h_x(z_i^\prime ) \right)\)

  • Compute the weight for each \(z_i^\prime\) with the SHAP kernel

  • Fit a weighted linear model

  • Return Shapley values \(\phi _i\): the coefficients form the linear model.

Even though kernel SHAP can help us understand the factor contribution in each model, the value of factor ranking in each model is different (Fig. 2). Theoretically, because the kernel SHAP is based on the appropriate calculation theory, the kernel SHAP values of different methods will differ. Therefore, the single kernel SHAP value cannot stand the fundamental rank of factor importance, even though the kernel SHAP method can help us explain the models and even display the relations among elements. Specifically, critical factors’ importance ranking is vital in healthcare and medicine. Therefore, our proposed cross-ensemble factor importance ranking becomes necessary and meaningful.

2.3 Cross-ensemble factor ranking

From the introduction of kernel SHAP and the Eq. 6, we can find the approximated linear model just use the predicted output of one model (detail shown in 2.1), and the predicted output will be affected by how good the prediction model is. Meanwhile, because the model-agnostic explanation method only approximates the prediction outputs of models, the kernel SHAP values for all models have the same metric when we analyze one dataset. The results of our tests (Fig. 2) show that the factors’ ranking in each model changes when we use kernel SHAP to analyze one dataset. Therefore, a better factor contribution ranking method is needed, and our proposed methodology solves this problem.

Moreover, the goodness of one model also should be considered when calculating the factor’s importance. Thus, we also considered the accuracies of models to adjust the ranking of factors (Eq. 8). If one model has higher accuracy, it will be more critical in the final factor importance ranking. Therefore, we proposed the cross-ensemble factor contribution calculation method shown as Eq. 9.

$$\begin{aligned} I_j= \sum _{i=1}^{k}\phi _i \end{aligned}$$
(7)

The factor of importance is the addition/sum of all the local contributions.

$$\begin{aligned} W_j= & {} \frac{\exp (Acc_j)}{\sum _{j=1}^{N} \exp (Acc_j)} \end{aligned}$$
(8)
$$\begin{aligned} I= \sum _{j=1}^{N} W_j * I_j \end{aligned}$$
(9)

The \(Acc_j\) is the accuracy of one classification or regression model, the N is the number of analytical approaches for one dataset, and the \(I_j\) is the factor importance ranking in one analysis.

Our proposed methodology uses N approaches to analyze one dataset and gets the kernel SHAP values of each approach. Then the kernel SHAP values of N-1 approaches were chosen illiterately from N kinds approaches to calculate the ensemble factor importance using Eq. 9. Then, we used the average of all the N times ensemble iteration results as the final cross-ensemble factor importance ranking. In our proposed methodology, the performance (accuracy) of each model is considered by adding weight (Eq. 8) to the factor contribution ranking. The relatively high accuracy of one model was given higher weight in our factor ranking calculation.

Our proposed whole calculation flow is shown as Fig. 1. Firstly, nine general and robust machine learning classification models: logistic analysis, Navie Bayside classification, quantitative discriminate analysis (QDA), k-nearest neighbors classification, AdaBoost, general Decision Tree (DT), random forest classification, XGBoost, and Multi-Layer Perception(MLP) classification, were used to make a classification in five classification-task datasets. Furthermore, six regression methods were used for the regression-task dataset (household dataset). Then, for one dataset, the kernel SHAP was used to explain each model and got the contribution (global SHAP value) of factors for each model, while the importance ranking of factors was also reviewed (Fig. 2). After we got the kernel SHAP value of factors, we used our proposed methodology to calculate the final factor importance ranking. Finally, we compared our proposed methodology results with XGBoost results, which will be shown in part four.

Fig. 1
figure 1

Our proposed methodology flow chart

3 Data source

Five open datasets and one non-open dataset (Table 1) were used to test our proposed methodology. Among the used data sets, four are for classification: Pima Indians Diabetes database (PIDD) [13] and Mendeley open diabetes data set [25], date fruit data [15] and the heart disease dataset [7], while the open household dataset [14] is for regression task. All the open datasets can be downloaded from the internet. The PIDD is a small diabetes dataset containing 768 diabetes samples and eight factors of diabetes. Similarly, the Mendeley open diabetes dataset contains 11 risk factors for diabetes: BMI, HbA1c, age, Etc., while the heart disease dataset has 17 factors. The housing dataset is one commonly used regression dataset containing nine factors. Our proposed methodology was tested in these classification and regression datasets. Moreover, the proposed method was also used to analyze the Ministry of Healthcare, Labor, and Welfare(MHLW) [11] census data hoping to identify some new possible associated risk factors for diabetes (Table 1). Because the MHLW dataset is one non-objective-orientated dataset, we used the newest MHLW (2018) data and deleted the null values samples. Finally, 12736 balanced samples were analyzed after data pre-processing.

In all the models, samples of the datasets are divided into the train (0.7) and test datasets (0.3). The performance of all the models is shown in Table.2. Then the kernel SHAP was used to explain all the models separately, and the factors’ importance ranking to each model was reviewed (Fig. 2). Then we used the proposed methodology to calculate the new factor ranking, whose results are shown in the results part. Moreover, our proposed methodology results were compared with the results of XGBoost, and the results are shown in Fig. 4.

Table 1 Describe of analyzed datasets in this study
Table 2 Performance review of Classification models for the classification task datasets in our study

4 Results

After we used kernel SHAP to explain models and tested our proposed methodology in six datasets, we can clearly understand the efficiency of our proposed methodology. Our results were divided into two parts. We first show the results of kernel SHAP, which shows the lousy performance of single kernel SHAP in Subsect. 4.1. Then we showed the results of our proposed ensemble SHAP values, which are shown in Subsect. 4.2. We also compared our methodology with the XGBoost method. The apparent difference is also shown in Subsect. 4.2.

4.1 Results of kernel SHAP

Even though SHAP can help us understand the factor contribution in each model, the results of single kernel SHAP (Fig. 2) show that the value of factor contribution and the orders of factors in each model are different. At the same time, the factors’ importance ranking is also different, especially for the typical certificated high-risk factors, age, and gender in the Mendeley diabetes dataset. Similarly, the factor contribution order in the PIDD dataset is also different, specifically, BMI, age, and diabetes pre-degree function. In PIDD dataset, as an essential risk factor for diabetes, age is the lowest contribution factor for diabetes in the MLP model and the second lowest contribution factor in DT and XGBoost. Similarly, the obesity factor in the MHLW dataset makes the lowest and second lowest contribution for diabetes in the DT and Logistic models. In contrast, it makes the highest contribution in the QDA model and the second significant contribution in the Naive Byes model. The other factors also have different importance ranks in each model in MHLW dataset. Moreover, room space, room numbers, and public pension become comparatively important factors in DT and XGBoost models. PIDD dataset, date fruit dataset, and household dataset have similar situations; the factors' orders in the classification or regression (household dataset) models are changing (Fig. 2). All the results of single kernel SHAP methods show that the single SHAP value cannot stand the natural factors’ important ranking, even though the kernel SHAP method can help us explain the models and even show the relations among elements. Specifically, knowing the factors’ ranking is essential in healthcare and medicine. Therefore, our proposed cross-ensemble factor importance calculation methodology will be instrumental.

Fig. 2
figure 2

The kernel SHAP values for various datasets

4.2 Results of ensemble factors importance

After using our proposed methodology, the factor contribution ranking in each model became stable in all datasets ( Fig. 3). Moreover, we also compared our results with another robust feature ranking method: XGBoost. The results in Fig. 3 show that the final factor importance order of our proposed methodology is consistent with our human knowledge and is better than the XGBoost method.

Meanwhile, Fig. 4 shows that our proposed ensemble XAI (SHAP) methodology can explore the difference among factors, which gives the essential factors a high ranking while the less important factors in a lower ranking. Especially in the final ensemble factor importance for household data, only one factor (total bedrooms) is paramount, while the factors have nearly the same level of importance in XGBoost analysis. Similarly, compared with the results of XGBoost, the factors’ deviance in our proposed methodology is also more distinct in the other five datasets. Significantly, The factors’ importance ranking in the MHLW dataset shows some significant knowledge: the room space factor is more important than generally known factors: age, gender, and obesity. Similarly, alcohol drinking is more important than smoking. Meanwhile, the mental situation is more critical (Fig. 4) than the obesity factor in our analysis. This alarms the Japanese to pay attention to taking care of their mental situation to prevent diabetes.

Fig. 3
figure 3

Our proposed ensemble SHAP values for various datasets

Fig. 4
figure 4

Our proposed ensemble SHAP results compare with XGBoost

5 Discussion

In this study, we compared the results of kernel SHAP for various machine learning models and found that the single SHAP model cannot explain the models at the human knowledge level. Then the ensemble XAI-based factor contribution ranking methodology was proposed, and its efficiency was certificated. Our results certificate that our proposed method solves the problem of unstable factor importance ranking problems in both the classification and regression models. Furthermore, in each tested dataset, the factors’ importance orders become stable during the factor importance ranking, which proves that our proposed methodology is efficient in calculating the factors’ importance ranking. Moreover, the fixed factors’ order is consistent with our ordinary human knowledge.

Compared with the general single kernel SHAP method, the proposed methodology offers comparatively stable and reliable factor importance ranking. Moreover, in some areas, the reliability factor importance supplies efficient guidance. Such as, among diabetes and heart disease datasets, besides the commonly known insulin and HbA1c factors, age is a third important factor in the Mendeley diabetes dataset; this shows us that older people should be more careful about their health. Similarly, in the PIDD dataset, BMI is more important than age, which tells us that preventing the spread of diabetes, inspiring people to care more about their health, and preventing obesity is urgent. Significantly, the ensemble XAI methods of the MHLW dataset show that Japanese citizens should pay more attention to their living conditions (house space, drinking, and smoking) and mental status to prevent diabetes.

Indeed, there are some limitations to our study. At present, only SHAP is used to combine the factor contribution. More reliable explanation methods will be used in future studies. Meanwhile, only the factor importance ranking problems were solved in our research. Therefore, a prospective study must also consider how to explain the factors' correlation efficiently. However, our proposed methodology can help us get a steady factor importance ranking, which can help us get comparatively reliable explainable results.

6 Conclusion and future work

This study compared the results of kernel SHAP for various machine learning models and found that the single SHAP model cannot explain the models at the human knowledge level. Then the ensemble XAI-based factor ranking methodology was proposed, and its efficiency was certificated in six datasets. Our proposed methodology solves unstable factor importance ranking problems in kernel SHAP and offers stable and more reliable factor importance ranking in classification and regression models. Our study paves the way for building reliable AI models. Furthermore, our study also identified some significant factors of diabetes. Our future studies and research will focus on building trustable AI models by using XAI methods identified knowledge. We also will explore the possibility of building knowledge-based small AI models.