1 Introduction

Artificial Intelligence (AI) technology has been used in various fields in our current society [1,2,3,4,5]. AI technology makes an innovative society possible and changes our lifestyles. For example, automatic car driving [6,7,8,9], face recognition systems [10,11,12,13], and computer aid detection in the medical area [14,15,16,17]. However, AI models are generally based on large data and huge parameters, called big AI models, especially in the computer version (diffusion model [18]) and the field of natural language processing. The robust Large Language Model (LLM): Generative Pre-trained Transformer (GPT) models [19] make our daily work more convenient and will even change our work life in the future. The GPT models have been used in various fields [20]. The transfer-based various models [20,21,22,23,24,25,26,27] indicate the possibility of Artificial General Intelligence (AGI) models. However, even with current AI technology, a prominent data-based AI model is impossible in some research fields. For example, in the medical area and biomedical, big data are not always available other than big AI models. Researcher Andrew Wu states the importance of “big AI in small data” [28] and also certificated the necessity of efficient AI models for small datasets. Moreover, a few million parameters in big AI models also cost colossal energy. Research about the energy saved by small AI models is urgently necessary. Therefore, we proposed to build AI models based on prior knowledge.

Besides the big AI models and huge parameter problems, some other limitations still exist in AI research. The black box problem is one of the most pressing issues in AI studies [29,30,31,32,33]. The black box problems lower the reliability of AI models. Meanwhile, current AI models are statistical-analysis-based models, not logic-theory-based models. This keeps the uncertainty of current AI models, even though the big AI models are efficient. Therefore, understanding the AI models becomes necessary.

To clarify the AI models, Explainable AI (XAI) [34,35,36,37,38] has become one highlight topic in the AI research field. Currently, two kinds of XAI models exist: intrinsic (rule-based) and post hoc models [39]. The intrinsic models explain models by restricting the rules of machine learning models, e.g., linear regression, logistic analysis, and Grad-CAM [40]. In contrast, post hoc models interpret models after training, such as Local interpretable model-agnostic explanations (LIME) [41, 42] and Shapley Additive exPlanations (SHAP) [43]. The SHAP method is the most robust agent explanation model currently. SHAP method has been used in many fields [44,45,46,47,48,49,50,51,52,53] and was certificated robust [54,55,56]. The SHAP methods allow us to interpret the black box models and know the local and global reasons for one prediction or classification. There are also two kinds of SHAP methods: model agnostic (Kernel SHAP) and model specific (Tree SHAP, deep SHAP) [43, 57]. The model-specific SHAP methods are designed to explain the specific models to decrease the calculation or loss of the complex models. They can only be used for a particular situation. In contrast, the kernel SHAP can be used for any model type. However, the SHAP method is a causal-inference-based methodology. The logic among AI models still needs to be clarified. The SHAP methodology just increased the transparency of AI models in some aspects. Research on AI reliability and transparency is still urgently necessary. Are there also ways to explain AI models by instructing the rules of models? This still needs to be explored.

Even though the SHAP method explained AI models in some aspects, it already supplied some knowledge about AI models to us humans. Research by Feifei Li [58] certifies that human interaction will improve the performance of AI models, while the latest GPT4 models [19] also certify the necessary human insertion in large AI models. These situations show that human-knowledge-integrated AI models are one available research direction in AI studies. Currently, reinforcement models [59] give rewards in decision-making while knowledge distillation [60] models filter the knowledge (weights in layers) in AI models. Is there another efficient way to use knowledge in AI models? Can we make human knowledge-integrated AI models possible? Furthermore, how can we integrate knowledge into AI models efficiently? Our research makes one significant step to answering these questions. In this study, we proposed knowledge-integrated AI transformer models to improve the trust and efficiency of AI models. The main contribution of our study was summarized as follows:

  • Prior knowledge-integrated transformer AI models were proposed in our study.

  • Our proposed methodology paves the way to improve the transparency and reliability of AI models.

  • Our study is one significant technical try for researching small and trustable AI models.

  • Our proposed methodology certified the possibility of building knowledge-integrated neural network models.

  • Our research helps us understand the logic of attention models.

The rest of this paper is organized as follows. We make a small literature review in Sect. 2. Our proposed methodology is introduced in Sect. 3. Section 4 describes the used datasets. We show the detailed results of our study in Sect. 5. Then, we discuss our effects in Sect. 6. Finally, we made one conclusion and discussed our future research direction in Sect. 7.

2 Literature review

2.1 Literature about prior knowledge

Some studies focused on building logic-based, trustable, explainable AI models [61,62,63,64]. Besides XAI to explore and explain the AI models to improve the reliability of AI models, some other studies try to build trustable AI models. Philip Slingerland et al. proposed adapting proposed trustable AI models to space mission autonomy [65], while Robin Cohen, Etc. [66] Sketched ways in which trust modeling may be leveraged towards trustable AI. Based on our current knowledge, few studies propose building knowledge-integrated AI models as to how to build trustable AI models. Meanwhile, some researchers state that AI models with human inserting can perform better [58].

Yann LeCun [67] proposed a word model that states we can build models like human learning progress. Our humans use our knowledge to make decisions and solve problems. Can the AI model also integrate knowledge to build more reliable models? Especially, do the AI models combine human knowledge to optimize themselves? Integrating knowledge to build AI models becomes one new research topic. However, there are few researchers focused on building knowledge-integrated AI models. Meanwhile, what is human knowledge, and how can  human knowledge be integrated into AI models? There is no standardization currently. Therefore, we proposed using prior knowledge to build models. However, what can be treated as prior knowledge? While some studies use the pre-trained models as prior knowledge, we proposed using the XAI results to build models, especially building AI models based on small datasets, when most research focused on big data-based big AI models [19].

2.2 Literature about transformer models

At present, the transformer models [68, 69], which are the base model of generative AI models, become one highlight topic in AI. The attention model [70] is the primary structure of the transformer model. Using attention, we can check the connections among factors, like the research using attention to predict the connection among language tokens [23]. Even though the attention of the transformer model is also based on the Neural Network (NN) models, the attention models can help us understand the AI models in some aspects. The attention models in the LLM model can show the relationship among tokens. Especially after the attention model was used in the computer vision field, the vision transformer models can explain the images to let us know which areas are important [71]. Therefore, we also integrated prior knowledge to build transformer models for tabular data and compared our results with another tabular data transformer model: the Feature Tokenization Transformer (FTT) [72] model. Using self-attention models, we aim to clarify the relationship among the input features. Therefore, we can understand the AI models in some aspects.

3 Methodology

Fig. 1
figure 1

The proposed methodology flowchart

In this study, we proposed one knowledge-integrated self-attention transformer model. Unlike the attention mechanism using various methods to adjust the NN model weights, we proposed using ensemble SHAP values as knowledge to build transformer models. We first proposed ensemble SHAP value calculation methods to acquire more reliable knowledge. Then, we used the prior knowledge as the input of self-attention transformer models. The whole methodology structure is shown in Fig. 1.

Currently, the SHAP methodology is one of the most robust XAI methods and can be used to explain various models. Research also certificated the efficiency and robustness of SHAP methods [52, 54,55,56]. Because the SHAP value was calculated based on the casual inference theory, the SHAP value will be changed in different models. To balance the effect caused by various models, we proposed an ensemble SHAP value, which will consider all models’ accuracy and kernel SHAP values. Therefore, we use the ensemble SHAP values as knowledge to build our models, not the hybrid SHAP value. The details are introduced in the following subsection.

3.1 Ensemble XAI to acquire knowledge

SHAP predicts an instance x by computing each feature’s value’s contribution to the prediction of one model. The SHAP explanation method computes Shapley’s values from coalitional game theory. The feature values x of a data instance act as players in a coalition. Shapley values tell us how to distribute the prediction among the features fairly. SHAP by appropriate the original model function to new function \(f(x) = g\left( z^\prime \right) = \phi _0 + \sum \nolimits _{i=1}^M \phi _i z_i^\prime\). Where \(z^\prime \in \lbrace 0,1\rbrace ^M\), M is the number of simplified input features, and \(\phi _i \in \mathbb {R}\), which is treated as local factor importance. The \(z^\prime\) represents the dataset of \(x\), and M has the same feature space. In kernel SHAP, the \(g\left( z^\prime \right)\) is linear model. The explanation of \(x\) is

$$\begin{aligned} \phi _i\left( f,x\right) = \sum \limits _{z^\prime \subseteq x^\prime } \pi _x (z^\prime ) \left[ f_x(z^\prime ) -f_x(z^\prime \setminus {i}) \right] \end{aligned}$$
(1)

where the \(f_x(z^\prime )\) is the function when the \(z_i^\prime\) is 1, while the \(f_x(z^\prime {\setminus }{i})\) is the original function when the \(z_i^\prime\) is zero. The kernel of \(\pi _x\) is \(\pi _x (Z^\prime ) = \frac{(M-1)}{\left( {\begin{array}{c}M\\ \vert Z^\prime \vert \end{array}}\right) \vert Z^\prime \vert \left( M- \vert Z^\prime \vert \right) }\) where the \(\vert z^\prime \vert\) is the number of nonzero entries in \(z^\prime\) and \(z^\prime \subseteq x\) represents all \(z^\prime\) vectors where the nonzero entries are a subset of the entries in \(x\).

In kernel SHAP \(\phi _0=f(h_x(0))\) is set as 0, and the loss function of kernel SHAP becomes

$$\begin{aligned} L \left( \widehat{f}, g, \pi _x \right) = \sum _{z^\prime \in Z} \left[ \widehat{f}\left( h_x \left( z^\prime \right) \right) - g\left( z^\prime \right) \right] ^2 \pi _x \left( z^\prime \right) \end{aligned}$$
(2)

Kernel SHAP estimated the contribution of instance \(x\) by appropriate the \(x\) using a linear model and treating the weight of linear models as the local factor contribution \(\phi _i\). Although the kernel SHAP can help us understand the factor contribution in each model, the value of the factor ranking in each model is different. Because the kernel SHAP is based on the appropriate calculation theory, the kernel SHAP values of different methods will differ [73]. When the SHAP method approximates a linear model, it just uses the predicted output of one model, which will affect the predicted outcome by how good the prediction model is. Meanwhile, because the model-agnostic explanation method only approximates the predicted outputs of models, the kernel SHAP values for all models have the same metric when we analyze one dataset. Therefore, our proposed ensemble SHAP method is available. Moreover, the goodness of one model also should be considered when calculating the factor’s importance. Therefore, we also used the precision of the models to adjust the ranking of the factors. If one model has higher accuracy, it will be more critical in ensemble SHAP value calculation. The calculation is shown as Algorithm 1, Where the \(Acc_j\) is the accuracy of one classification or regression model, The N is the number of analytical approaches for one dataset, and the \(I_j\) is the factor of importance ranking in one analysis. Therefore, the single kernel SHAP value cannot stand the fundamental rank of factor importance. We proposed using ensemble SHAP values, shown in Algorithm 1. Even though we used local ensemble SHAP as input to build our proposed self-attention transformer model, we also checked the global ensemble SHAP value to confirm that our proposed ensemble SHAP method is efficient in our used datasets, which can be calculated as follows:

$$\begin{aligned} I = \sum _{j=1}^{N-1}W_j I_j = \sum _{j=1}^{N-1}\frac{\exp (Acc_j)}{\sum _{j=1}^{N-1} \exp (Acc_j)} \sum _{i=1}^{M}\phi _i \end{aligned}$$
(3)

Although our previous study already certificated the efficiency of the ensemble SHAP methods [73].

Algorithm 1
figure a

Knowledge acquisition

3.2 Prior knowledge to build transformer model

Our proposed whole methodology flow is shown in Fig. 1. We treated the proposed ensemble SHAP values as prior knowledge. Firstly, nine general and robust machine learning classification models: logistic analysis, Navie Bayside classification, quantitative discriminate analysis,k-nearest neighbors classification, AdaBoost, general Decision Tree, random forest classification, XGBoost, and Multi-Layer Perception classification, were used to make a classification in three classification-task datasets. Then, for one non-open dataset, the kernel SHAP was used to explain each classification model and got the contribution (local SHAP value) of factors for each model, while the importance ranking of factors was also reviewed. After we got the kernel SHAP value of factors, we used our proposed ensemble methodology to calculate the importance of the factors. Finally, we used the ensemble SHAP value as prior knowledge to build the self-attention transformer models. We compared our proposed knowledge-integrated self-attention transformer model with the FTT and other machine learning or NN models. Moreover, we also checked the self-attention of each transformer block in the FTT model and our proposed model. To confirm the efficiency of our models, we also tested various layers of self-attention transformer models in this study: 2 layers, 4 layers, 8 layers, and 12 layers. Reviewing the self-attention of each layer, we can understand the difference between the FTT model and our proposed model. Moreover, attention to transformer models also can help us understand the running rules of AI models. After checking the difference among various layer-deep transformer models, we also compared the average self-attention of our proposed transformer models with FTT models (Fig. 7) and the general coefficient among input factors (Fig. 8). The apparent difference is also shown in the results section and discussed.

4 Data source

Table 1 Used datasets in this analysis

To testify to the efficiency of our proposed models, three open data sets and one non-open data set were used to test our proposed methodology. The three open data sets are for classification among the used open data sets. Pima Indians Diabetes Database (PIDD) [74], Mendeley open diabetes data set [75], and the heart disease dataset [76]. All open datasets can be downloaded from the Internet. The PIDD is a small diabetes dataset containing 768 diabetes samples and eight factors of diabetes: pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pre-degree function, and age. Similarly, the US open diabetes dataset contains 11 risk factors for diabetes. BMI, HbA1c, age, etc., while the heart disease data sets have 17 factors. Moreover, The proposed method was also used to analyze the Ministry of Healthcare, Labor, and Welfare (MHLW) [77] census data. The MHLW dataset is non-objective-oriented; we used the newest MHLW (2018) data and deleted the null value samples. Finally, after pre-processing the datasets, 12,736 balanced samples were used to test our proposed methodology.

In our proposed methodology, samples of the datasets are divided into two parts: one part of the data was used to acquire prior knowledge, and the other part was used to train our proposed methodology. As shown in Table 1, for treating SHAP values as input models, we used 80 percentage data to obtain ensemble SHAP values and treated the ensemble SHAP values as a new input to self-attention transformer models and compared their performance with FTT and other machine learning models [logistic, K-nearest neighbors, decision tree, Multi-Layer Perception (MLP), AdaBoost, Naive Bayes classification, Quantum Discriminate Analysis (QDA) and XGBoost]. We used kernel SHAP to explain various classification models separately, and the factors’ importance ranking to each model was reviewed. Then, we used the proposed ensemble SHAP value to build self-attention transformer models. Finally, we checked the performance of our proposed models. Details of the results are shown in the Results section.

5 Results

In this study, we proposed using ensemble SHAP value as knowledge to build self-attention transformer models. Then, we checked our proposed transformer models’ performance and self-attention. We also compared the self-attention of our proposed models with FTT models and the general factor coefficients to confirm the efficiency of our proposed transformer models. All the results are shown as follows.

5.1 Model performance comparison of proposed transformer models

To confirm the efficiency of the proposed ensemble SHAP method, the final global factor importance (global ensemble SHAP value) is shown in Fig. 2. The ensemble global SHAP value can show the factor difference more clearly, which fits our general human common sense better. After we used the ensemble SHAP results as prior knowledge and used the knowledge to build self-attention transformer models, we compared our proposed models with FTT models and other classification models. The results (model accuracy: Acc) are shown in Table 2. In the MHLW dataset, our proposed models do not have the same level of performance as other classification methods. Because we only used 20% of the data to acquire knowledge. Then, we used the knowledge to build transformer models and acquired nearly the same level of performance (bold results in Table 2) as FTT models. However, our proposed prior knowledge-integrated transformer model performs better (bold results in Table 2) than FTT models in the PIDD and heart disease datasets. Especially for the heart disease dataset, we used 20% of the data to acquire knowledge and build knowledge-integrated self-attention transformer models. Moreover, the attention of our proposed self-attention transformer models became more stable than general FTT models, as shown in the following subsection.

Table 2 Model performance comparison of classification models for the classification task datasets in our study
Fig. 2
figure 2

Ensemble SHAP factor importance for four datasets

5.2 The self-attention comparison of transformer models

To understand the theory of the transformer models, we also checked the self-attention of each transformer block. We compared the FTT models and our proposed self-attention transformer models. The details are shown in Figs. 3, 4, 5 and 6.

When we check the self-attention among various transformer models, the attention in each transformer block changes randomly in FTT models. However, in our proposed self-attention transformer models, the attention of each transformer block becomes stable in all four datasets. To avoid possible randomness, we tested our proposed self-attention transformer models in 2 self-attention layers, 4 self-attention layers, 8 self-attention layers, and 12 self-attention layers transformer models. The results are shown in Figs. 3, 4, 5 and 6. The self-attention of our proposed knowledge-integrated transformer model becomes more stable than general FTT models, especially in the lower self-attention layer transformer models. In the 2 self-attention layer transformer models, the attention is the same. In the 4 self-attention layer and 8 self-attention layer transformer models, the attention is also nearly identical. Moreover, when we compare the self-attention of our proposed transformer model with the coefficients among features, we can find that the factors’ self-attention (Fig. 7) of our proposed transformer models becomes similar to the factor coefficients in general machine learning models (Fig. 8). In contrast, FTT models’ self-attention seems distributed randomly and has lower similarity with the factor coefficients (Fig. 8). When we use our proposed prior knowledge as input, the coefficients among factors seem to become similar (color in Fig. 8 becomes similar in each datasets)

Fig. 3
figure 3

The self-attention of proposed transformer models with general FTT models (Diabetes dataset) a self-attention in each layer for 2 layers of FTT models; b self-attention in each layer for 2 layers of proposed models; c self-attention in each layer for 4 layers of FTT models; d self-attention in each layer for 4 layers of proposed models; e self-attention in each layer for 8 layers of FTT models; f self-attention in each layer for 8 layers of proposed models; g self-attention in each layer for 12 layers of FTT models; h self-attention in each layer for 12 layers of proposed models

Fig. 4
figure 4

The cooperation of proposed initial weight setting and general initial weight setting (PIDD dataset) a self-attention in each layer for 2 layers of FTT models; b self-attention in each layer for 2 layers of proposed models; c self-attention in each layer for 4 layers of FTT models; d self-attention in each layer for 4 layers of proposed models; e self-attention in each layer for 8 layers of FTT models; f self-attention in each layer for 8 layers of proposed models; g self-attention in each layer for 12 layers of FTT models; h self-attention in each layer for 12 layers of proposed models

Fig. 5
figure 5

The cooperation of proposed initial weight setting and general initial weight setting (Heart diseases) a self-attention in each layer for 2 layers of FTT models; b self-attention in each layer for 2 layers of proposed models; c self-attention in each layer for 4 layers of FTT models; d self-attention in each layer for 4 layers of proposed models; e self-attention in each layer for 8 layers of FTT models; f self-attention in each layer for 8 layers of proposed models; g self-attention in each layer for 12 layers of FTT models; h self-attention in each layer for 12 layers of proposed models

Fig. 6
figure 6

The cooperation of proposed initial weight setting and general initial weight setting (MHLW dataset) a self-attention in each layer for 2 layers of FTT models; b self-attention in each layer for 2 layers of proposed models; c self-attention in each layer for 4 layers of FTT models; d self-attention in each layer for 4 layers of proposed models; e self-attention in each layer for 8 layers of FTT models; f self-attention in each layer for 8 layers of proposed models; g self-attention in each layer for 12 layers of FTT models; h self-attention in each layer for 12 layers of proposed models

Fig. 7
figure 7

The average of self-attention comparison among FTT models and proposed models (8 layers) a average self-attention of FTT models (US Diabetes); b average self-attention of proposed models (US Diabetes); c self-attention of FTT models (PIDD); d self-attention of proposed models (PIDD); e self-attention of FTT models (Heart disease); f self-attention of proposed models (Heart disease); g self-attention of FTT models (MHLW); h self-attention of proposed models (MHLW)

Fig. 8
figure 8

The coefficients of prior knowledge integrated input a the factor coefficients of the US Diabetes dataset; b the factor coefficients of the PIDD dataset; c the factor coefficients of the Heart disease dataset; d the factor coefficients of the MHLW dataset

6 Discussion

In this study, we used the proposed ensemble SHAP value as knowledge to build self-attention transformer models based on knowledge. The performance of our prior knowledge-integrated models has better performance than the non-knowledge-integrated FTT models. The better performance of our proposed models ensured that our proposed knowledge-integrated transformer model is an available research idea. Moreover, when we treated the ensemble SHAP value as knowledge and inserted the knowledge into transformer models, the self-attention of our knowledge-integrated transformer models became more stable than the general FTT model in all four tested datasets. Stable self-attention of each layer verified that the knowledge inserted in the transformer models influenced the transformer models. Meanwhile, stable self-attention of transformer models inspires us that we can interpret the AI models directly, rather than using agented methods [41,42,43] to explain the AI models. Moreover, our study certificated that knowledge-integrated AI methodology is achievable. Our results confirmed that a small AI model with knowledge is feasible for future research. Like the study of Feifei Li [58], our research also certifies that inserting knowledge in the AI model can help us improve the performance of traditional artificial intelligence methods. Moreover, our results inspire us to believe that a small AI model based on a small dataset [28] is possible.

While the reinforcement model rewards statement function and knowledge distillation filers the weights of NN models, our proposed model used prior knowledge as the input of the transformer model, which can be transformed into other datasets and widely used in natural language processing, computer vision, and voice analysis areas. Our results also certified that our proposed model is available in classification models. Moreover, our study found that the attention of the transformer model becomes stable, which inspires us that we can probably understand the logic of NN models and make deep learning AI models transparent and reliable in the future.

Our proposed knowledge-integrated AI models used less data and performance than general AI models. Moreover, our results confirmed that our proposal is efficient. Our study certified that knowledge-integrated small AI models are available and efficient. Meanwhile, the attention results of our proposed transformer models show that the knowledge-integrated transformer models differ from the general transformer model. The self-attention of our proposed models becomes stable in each layer, unlike general transformer models (Figs. 3, 4, 5, 6). These inspire us that there must be logic and undefined rules in NN models. We can explore the real neural connection of NN in future studies and make AI models more transparent and reliable.

Certainly, there are also some limitations in our study. The prior knowledge used is different from the natural human experience. We will find quantified human knowledge to test our model in future work. However, our proposed methodology is one significant try for building a trustworthy small AI model, which will inspire more studies about reliable AI. Meanwhile, our study also confirmed that the small AI model with a small data set is feasible. At the same time, nearly all research efforts have focused on the large AI model, which is difficult in some research areas and wastes energy.

7 Conclusion

In this study, we creatively designed knowledge-integrated AI models using prior knowledge to build transformer models. Our results confirmed the feasibility of our proposed methodology. Meanwhile, our research has certified that the research about trustable and logic-based AI models based on small data is feasible in the future. Indeed, there are some limitations to our study. More future work on trustable AI is still needed. However, our research inspires future studies about theory-based, trustable AI models in small-data-based. It paves the way for explaining and understanding the logic and theory of black-box AI models.

8 Future scope

Our future work will still explore the possibility of building transparent and reliable AI models, hoping to clarify the logic among NN models. Meanwhile, we will also consider using our proposed model in an actual life screen, especially in the medical and healthcare fields, which generally need more data to build big AI models.