1 Introduction

Artificial Intelligence (AI) has made remarkable progress in recent years, with notable implementations by tech giants like Google, Meta (formerly Facebook), and Amazon. These companies have integrated AI into various services and products, such as Google's AI-powered search algorithms (Diamant 2017), Meta's recommendation systems (Hutson 2020), and Amazon's intelligent robotics applications (Diamant 2017). Subsequently, AI will have an even more significant impact in the future (Makridakis 2017). The AI market in finance is expanding globally and is predicted to continue growing (OECD 2021). Despite these achievements, the increasing complexity of AI models has resulted in the use of opaque black box models, which reduces their interpretability (Guidotti et al. 2018). As a result, it is pivotal to develop and evaluate solutions to address this problem, especially in sensitive areas such as finance, where high accuracy is not the only requirement for expanding the adoption of AI systems.

Machine Learning (ML) is classified as a subset of Artificial Intelligence (AI) due to its emphasis on algorithms that autonomously generate predictions and decisions by learning from data (Rius 2023; Yao and Zheng 2023; Pokhariya et al. (2022). AI has a wider range of applications than only ML. Some AI systems, for instance, may be built upon an ML model but not be entirely dependent on it; conversely, other systems may function solely on rule-based systems and not employ any machine-learnt models (Sprockhoff et al. 2023; Liubchenko 2022). Therefore, it is correct to assert that ML is a component of AI, but not all AI applications exclusively depend on ML methods, thereby demonstrating the variety and complexity within the wider domain of AI.

Explainable Artificial Intelligence (XAI) offers a means to explore, explain, and comprehend intricate systems. However, evaluating and determining the explicable nature of an explanation is a complex and challenging issue. Providing understandable and interpretable financial explanations is vital to establish trust and accountability in decision making. XAI research underscores the importance of explaining decisions in finance, and authors often compare finance and medicine to emphasise the need for explanations in both fields (Silva et al. 2019; Sovrano and Vitali 2022).

The ML community has not yet reached an agreement on the specific criteria that should be employed to determine explainability. Attempts have been made to clarify concepts such as trustworthiness, interpretability, explainability, and reliability (Gilpin et al. 2018; Ribeiro et al. 2016; Lipton 2018; Došilović et al. 2018). Although numerous studies distinguish between interpretability and explainability, the two terms are frequently used interchangeably (Linardatos et al. 2020). Explainable AI and interpretability are often characterised by imprecise definitions, which can lead to misleading conclusions (Rudin 2019). Interpretability is the degree to which a human can understand and explain the cause of a decision made by a model (Cartwright 2023; Zeng et al. 2023; Doshi-Velez and Kim 2017; Miller 2019). Explainability refers to the degree of transparency in the results produced by the model, while interpretability enhances its utility by providing users with comprehension of the underlying concepts of the model, thus enabling them to grasp the logic and algorithms governing the AI system (Linardatos et al. 2020). Conversely, explainability concerns the extent to which individuals can comprehend the internal reasoning and mechanisms of a machine learning system. The explainability of a model is positively correlated with the level of understanding exhibited during its training or decision-making processes regarding these internal processes (Linardatos et al. 2020). According to Gilpin et al. (2018), explainability is a fundamental requirement in addition to interpretability, which is inadequate on its own. According to research conducted by Linardatos et al. (2021) and Doshi-Velez and Kim (2017), this study posits that interpretability surpasses explainability as a more comprehensive concept. However, to avoid confining their applicability to particular contexts, we employed the terms interpretable and explainable interchangeably.

Relying on systems without transparent decision-making procedures is not trustworthy. In finance, XAI is of highest priority due to the intricate and complex structure of its models, as well as the potential effects of incorrect or biased choices, where AI is used for tasks such as credit scoring, bankruptcy forecasting, fraud detection or portfolio optimisation. Other areas of interest include exploring how XAI can improve investment decision-making and portfolio management, or examining the ethical and legal implications of using opaque AI systems in financial services. Additionally, XAI techniques can enhance the comprehensibility of predictive models in various areas such as loan underwriting, insurance pricing, stock price prediction, and regulatory compliance.

The terms ML and AI are also used interchangeably to refer to AI methods used in research. This approach allows for a more comprehensive view of the field as it encompasses a wide range of techniques that belong to AI. Using these terms interchangeably ensures clarity and consistency throughout the study and allows researchers to communicate their findings to a broader audience more effectively.

We categorise the XAI methods employed in finance based on their specific financial applications, identify existing research gaps, and propose potential areas for future investigation. The purpose of this research is threefold: First, we report the review results that examine the current advances and trends in XAI for various financial tasks by summarising the techniques used; second, we help practitioners select the appropriate XAI technique that fits their particular financial use case by offering the categorisation; and third, to help researchers pinpoint the specific areas and applications using XAI methods that need further investigation. Our work also makes other significant contributions, as we review a large body of recent and related work and attempt to implement all explainability features previously explored. A total of 138 chosen studies were evaluated based on four review questions (RQs):

  • Which financial AI applications have benefited from the implementation of XAI?

  • How do the XAI methods differ depending on the application and task?

  • How are new XAI methods developed and applied?

  • What are the challenges, requirements, and unresolved issues in the literature related to the application of XAI in finance?

The structure of this article can be summarised as follows: Background section provides a concise introduction to the critical advancements in XAI and explains essential terminology. It is designed for readers who may need to gain a broad understanding of XAI. The methodology for conducting a systematic review can be found in Sect. 3. This section describes the search for relevant scientific articles, gathering necessary data, and conducting a thorough analysis. The systematic analysis of selected scientific articles based on the research questions outlined earlier results are presented in Sect. 4. Sections 6,7 and 8 contain the discussions, future work, recent developments, and conclusions.

2 Background

2.1 XAI in general

To promote the integration of AI, XAI strives to improve transparency by devising methods that empower end users to understand, place confidence in and efficiently control AI systems (Adadi and Berrada 2018; Arrieta et al. 2020; Saeed and Omlin 2023). Early work on rule-based expert systems identified the need for explanations (Biran and Cotton 2017). Consequently, the advent of Deep Learning (DL) systems pushed XAI to the forefront of academic inquiry (Saeed and Omlin 2023). Many practical challenges now employ ML models that require higher predictive accuracy (e.g., stock price prediction (Carta et al. 2021), online banking fraud detection (Achituve et al. 2019), bankruptcy prediction (Cho and Shin 2023). Increasing the complexity of the model often leads to higher predictive accuracy, but also reduces its capacity to provide explainable predictions (Linardatos et al. 2020). Black-box models, including deep learning models such as deep neural networks (DNNs) or generative adversarial networks (GANs), as well as ensemble models like XGBoost and Random forests, achieve the highest level of performance. However, these models are difficult to explain, as noted by Yang et al. (2022a, b) and Linardatos et al. (2020). Conversely, white-box or intrinsic models, such as linear models, rule-based models, and decision trees, provide a straightforward structure that simplifies the interpretation of outcomes.

We will primarily focus on presenting the fundamental concepts of XAI, as several comprehensive surveys have been conducted to categorize and elucidate it (see Arrieta et al. 2020; Adadi and Berrada 2018; Belle and Papantonis 2021). There are a number of terms that are frequently used in academic discourse and discussions related to XAI systems to define the key characteristics of such systems. The methods vary according to their stage (ante-hoc, post-hoc), scope (local, global), type (model-specific, model-agnostic), and area of application (finance, medicine, education, transportation, ecology, agriculture, etc.).

Explanations consist of two stages: The ante-hoc stage involves the interpretation and analysis of data using explanatory data, specifying data sources, and constructing a model. Assessing the data at this point is essential as it provides significant insights and improves the understanding of the model (Saranya and Subhashini 2023). These techniques are known as white-box approaches because they are deliberately engineered to preserve a fundamental structure. They are intrinsic and do not require explanation methods. In the post-hoc stage, an explanation is generated after the ML model has been constructed and necessitates an additional explanation method (Hassija et al. 2024). Explainability is categorised as local or global based on whether the explanation is derived from a particular piece of data (local) or the entire model (global) (Vilone and Longo 2021; Angelov et al. 2021).

Model-agnostic methods function autonomously without being dependent on any specific ML model. Typically, these techniques are post-hoc and aim to tackle the problem of comprehending intricate models such as Convolutional Neural Networks (CNN). They provide explanations that are not limited to a particular model structure, making them adaptable and broadly applicable to various types of model (Zolanvari et al. 2021). On the contrary, the applicability of the model-specific method is dependent on the structure of the specific model and is restricted to that particular model (Kinger and Kulkarni 2024).

Ethical, legal, and practical factors drive the desire for explainability (Eschenbach 2021). The European Union (EU) regulates AI through the AI Act. The regulation requires the implementation of specific criteria for the development and use of XAI (European Commission 2020). Achieving explainability through the creation of advanced and intricate XAI models is a challenging undertaking, and several barriers impede this process. Frasca et al. (2024) highlight the trade-off between performance and explainability, stating that higher accuracy often requires complex models. They address challenges such as finding pivotal characteristics, scaling issues, and model acceptability, highlighting the need for significant computational resources, and aligning explanations with user intuitions.

2.2 Artificial intelligence in finance

The significance of AI becomes evident when one observes the concerted efforts of governments to regulate its usage and practical implementation. One notable initiative in this regard is establishing the European Commission's High-Level Expert Group on AI(HLEG) in 2020. HLEG suggests a comprehensive definition of AI and sets ethical standards for developing and deploying robust artificial intelligence systems. The primary objective of HLEG is to propose recommendations for policy enhancements and address AI's social, ethical, and legal dimensions. As defined by HLEG, AI encompasses both software and hardware components, which collaborate to collect and analyse data from the environment. The AI system acquires knowledge through this analysis and formulates decisions to achieve specific objectives. The adaptability of the AI system is formed by evaluating past actions and their consequences within the operational environment. This evaluation can be accomplished using either symbolic rules or numerical models (Samoili et al. 2020). Overall, the efforts undertaken by governments and expert groups like HLEG exemplify the recognition of AI's significance and the commitment to ensuring its responsible and beneficial integration into various domains.

AI has become a game-changing and innovative tool in many industries, including finance. The ability to predict bankruptcy is crucial for financial institutions that use artificial intelligence approaches. This task requires careful attention. Several studies (Verikas et al. 2010; Balcaen and Ooghe 2006; Dimitras et al. 1996) have demonstrated the immense potential of AI in revolutionising decision-making, mitigating risks, and enhancing profitability within the financial domain. Leveraging the capabilities of AI empowers financial institutions to gain a competitive advantage and elevate their customer service offerings. The work of Cao (2021) provided a thorough, multidimensional, and problem-oriented economic-financial overview of recent research on AI in finance, which formed the basis for the financial domain categories.

Despite the potential of AI to improve and develop financial products, there are challenges to taking full advantage of innovative algorithms (Harkut and Kasat 2019). In their recent study, Eluwole and Akande (2022) identified the main difficulties in using AI in the financial sector. These challenges include aspects such as accuracy, consistency, transparency, trust, ethics, legal considerations, governance regulations, competence gaps, localisation issues, and the intricacies of ML design and integration.

Researchers often employ several types of ML techniques to evaluate and improve the efficiency of their models (Dastile et al. 2022; Zhu et al. 2022; Wang et al. 2019). Therefore, we employed the term "multi-approach AI technique" to denote the utilisation of multiple AI methods to address one financial task.

2.3 XAI in finance

The Royal Society (2019) states that as the use of AI technologies in decision-making processes increases, individuals need to understand how AI works. This need arises from concerns about bias, compliance with policies and regulations, and the ability of developers to understand AI systems. Miller (2019) specifies that improved explainability of AI can increase confidence in the results. Arrieta et al. (2020) suggest that interpretability can also ensure fairness, reliability, and truthfulness in AI decision making and that XAI aims to create more understandable models while maintaining efficiency and allowing humans to trust AI systems and control them.

Adadi and Berrada (2018) have identified different explanations, perspectives, and motives for using them, including justification, control, discovery, and improvement of classification or regression tasks. These perspectives encompass an expansive spectrum, ranging from justifying and controlling AI/ML approaches to discovering new insights and improving the accuracy of classification or regression tasks. Ensuring the interpretability of AI/ML approaches becomes crucial in identifying the input features that significantly influence the outcomes. Once fully understood, a model can be combined with specialised knowledge to generate an advanced model with enhanced capabilities. Credit scoring and risk management are frequently researched topics in XAI finance research (Demajo et al. 2020; Chlebus et al. 2021; Misheva et al. 2021; Bussmann et al. 2020).

2.3.1 Explainability categories

The comprehensive evaluation of XAI methods, encompassing numerous techniques, procedures, and performance measurements, has been investigated and reported in published surveys such as Doshi-Velez and Kim (2017) and Zhou et al. (2021). To categorise explanatory approaches, Belle and Papantonis (2021) developed a classification system widely used to analyse and compare different XAI techniques and practices and provides valuable information on the advantages and limits of each approach.

Feature relevance explanation: The main goal of feature importance explanation is to evaluate the influence of a feature on the model outcome, and one of the significant contributions to this field, particularly in XAI, is the Shapley Additive Explanations (SHAP) method (Lundberg and Lee 2017). In this method, a linear model is constructed around the instance to be explained, and the coefficients are interpreted as the significance of the features. However, this approach is seen as an indirect means of generating explanations as it only focusses on the individual contribution of a feature and does not provide information about the dependencies between features. In cases where there are strong correlations between features, the resulting scores may indicate inconclusive or contradictory data. When conducting an analysis, it is imperative to consider this aspect. For example, the study by Torky et al. (2024) presents a unique XAI model that has the ability to independently identify the underlying reasons behind financial crises and clarify the process of selecting relevant features. The authors used the pigeon optimiser to enhance the feature selection process. Subsequently, a Gradient Boosting classifier is utilised, employing a subset of the most influential attributes, to determine the sources of financial crises.

Explanation by simplification: Explanations by simplification approximate complex models to simpler ones. The main explanation for the simplification challenge is to ensure that the simpler model is flexible enough to accurately approximate the complex model and increase its effectiveness by comparing the accuracy of classification problems. Rule-based learners and decision tree techniques are simplification methods for model-agnostic explanations. Model-specific explanations can also use distillation, rule-based learners, and decision trees to provide simplified explanations. An example of such an explanation category is the Weighted Soft Decision Forest (WSDF). This method combines multiple soft decision trees by assigning weights to each tree's output. The purpose is to simulate the credit scoring process. Another example is credit risk assessment decision support systems that utilise the recursive rule extraction algorithm and decision tree. These systems generate interpretable rules for machine learning-based credit evaluation techniques (Hayashi 2016; Zhang et al. 2020a, b, c).

Local explanation: Explaining model predictions in a specific case at a statistical unit level is known as local explainability. This approach can be helpful for companies and individuals in identifying the crucial factors that contribute to their financial challenges (Hadash et al. 2022). Local explanation methods include rule-based learning, linear approximations, and counterfactuals. The Local Interpretable Model-Agnostic Explanations (LIME) technique developed by Ribeiro et al. (2016) is a commonly used ML tool that can help understand black-box model behaviour and identify the key features used to make predictions by eliminating perturbations induced on the inputs. Although LIME is a locally linear model, its accuracy may be limited due to dependence on an external model. Counterfactual explanations describe causal scenarios where Event A's absence would result in Event B's absence, providing an understanding of model predictions and recommendations for future action (Johansson et al. 2016). Park et al. (2021) examined the problem of predicting bankruptcy by showcasing the ability to reproduce the measurement of feature importance in tree-based models using LIME in bankruptcy datasets. They highlighted the potential to extract significant feature importance from other models that lack built-in feature measurement capabilities but demonstrate higher accuracy.

Visual explanation: The purpose of visual explanation is to create visual representations. This method simplifies understanding how a model works, even when dealing with complex dimensions. These methods grasp the decision boundary and the interaction among features, making visualisations a helpful tool, particularly for individuals without professional expertise. The proposed decision support tool for lending decisions, developed by Chen et al. (2022), employs interactive visualisations to clarify the structure and behaviour of time series models. It allows users to investigate the model and gain a precise understanding of how conclusions are derived.

Transparent models: Models that are transparent in their decision-making process are valuable in improving understanding. Using transparent models can improve understanding of how inputs relate to outputs. It allows to comprehend the correlation between the data provided and the outcomes generated by the model. Linear regression models illustrate transparent models because the model coefficients indicate the significance or weight of each input feature in producing the outcome and explain how the additional input unit linearly affects the output. Decision trees and rule-based learners are also transparent models because they use a transparent set of decision-making rules that can be examined to determine how the output was generated. K-nearest-neighbour models can also be transparent, depending on the simplicity of the distance metric used to determine nearest neighbours and allow the model to provide a transparent representation of its decision-making process (Clement et al. 2023).

Frameworks: A framework is commonly understood as a methodology incorporating multiple XAI methods to complete a single task (Linardatos et al. 2020). To assess a model's performance or demonstrate its approximation capabilities, researchers frequently use multiple XAI methods. Using various methods can improve the explanation and make comparing different methods easier. For example, Chen et al. (2022) created a machine learning model that is globally interpretable. This model includes an interactive visualisation and provides different summaries and explanations for each decision made. It offers case-based reasoning explanations by considering similar past cases, identifies the key features that influence the model's prediction, and provides customised concise explanations for specific lending decisions.

The summary of the different categories and their connections in the XAI landscape within the finance industry is illustrated in Fig. 1. The visual representation aims to improve comprehension of the intricate XAI techniques and their uses in finance.

Fig. 1
figure 1

Venn diagram illustrates the relationship between AI and XAI

3 SLR methodology

The systematic review followed the principles of Kitchenham and Charters (2007), which ensured a credible and verified framework for performing a thorough examination. The review process consisted of three main phases: planning, conducting, and preparing the report, as shown in Fig. 2.

Fig. 2
figure 2

SLR process stages

3.1 Identifying the research questions

A systematic literature review is proposed to examine the different methods used to provide XAI in the financial industry. The primary objective of this review is to provide valuable information on the various techniques employed across multiple domains and tasks and their corresponding evaluations. As shown in Table 1, four research questions were formulated to guide this literature review.

Table 1 List of the research questions

These questions are designed to systematically address the key aspects of XAI in finance, which include the financial tasks to which these methods are applied, the novel XAI approaches that have been developed and implemented, and the obstacles encountered during their execution.

After identifying the research question, we provide a detailed explanation of the methodology used to search the relevant literature. This initial step provides comprehensive details on the databases employed in the search process, the search strategies and terms implemented, and the criteria applied to evaluate the relevance and value of each article chosen for review. An analysis of the primary article metadata, containing bibliometric information about the articles, is also provided. These data contain a variety of essential information, such as the title of the article, the author's name, the publication date, the journal where the article is published, the number of citations obtained by the article, the keywords used, and the country of the authors (Donthu et al. 2021).

3.2 Searching specified databases for relevant articles

The initial starting point for a literature search is to identify suitable digital sources and literature databases. This approach facilitates a targeted keyword search, leading to the discovery of pertinent studies (Levy and Ellis 2006). An extensive search of the Scopus and Web of Science (WOS) databases from 2005 to 2022 was conducted during the research due to their comprehensive coverage of finance and AI topics (Zhu and Liu 2020; Pranckutė 2021) applications. We have selected these databases for their rich content and because they include highly respected financial journals (Singh et al. 2021). These databases ensure that peer-reviewed articles published in leading international journals and conference proceedings are included, which helped us maintain higher quality standards. These databases offer a comprehensive range of bibliometric analysis tools, enabling users to access and export bibliographic data customised to their research requirements.

We conducted queries on the two databases outlined earlier to collect articles, specifically targeting sources recognised for their comprehensive coverage of relevant literature (Table 2). We performed a comprehensive literature search using meticulously chosen keywords and filters to guarantee the relevance of the articles' content. We carefully chosen the keywords to cover a broad spectrum of studies that are pertinent to our study issue. This comprehensive search was designed to find all relevant research while limiting the inclusion of irrelevant literature.

Table 2 Article collection query

A total of 773 papers were discovered in the initial search results. Subsequently, we applied a screening approach to assess the suitability of these results for inclusion in our review.

When conducting an SLR various inclusion and exclusion criteria were used to filter relevant articles (Table 3). The inclusion criteria consisted of identifying articles that contained keywords from Table 2 and were relevant to directly assessing XAI applications in finance. Only peer-reviewed articles in English and with full text availability in databases were considered for inclusion. However, the review excluded articles related to the philosophy of XAI, technical reports, review articles, and duplicates. This was made to ensure that the review articles provided concrete examples and empirical evidence for XAI applications in finance. Technical reports often contain highly specialised information, and review articles summarise existing research that serves a purpose other than the review. Finally, editorials and opinions or viewpoints on a particular topic were also excluded from the review.

Table 3 Inclusion and exclusion criteria

The process of conducting a review comprises four primary stages: identification, screening, eligibility determination, and categorisation of potential research articles. The PRISMA diagram (Page et al. 2021) systematically documented the procedure, with Fig. 3 illustrating each phase of the literature search. The identification phase involved searching the Scopus and Web of Science databases for related articles from 2005 to 2022 using specific keywords relevant to the topic. The screening phase then involved reviewing the titles and abstracts to exclude unrelated studies. During the eligibility phase, we conducted full-text articles to determine whether they met the established inclusion criteria. Finally, the sorting phase involved selecting the final articles for the review. The PRISMA chart provided an explicit and transparent overview of the entire process, allowing for easy analysis and replication of the study.

Fig. 3
figure 3

The PRISMA flowchart provides a detailed checklist and flow diagram of how the literature search was conducted

By following the steps mentioned above, we can provide a comprehensive and well-substantiated evaluation, eventually contributing to the advancement of expertise and understanding in the field of financial data science.

3.3 Descriptive analysis of the data

This section presents the results of the quantitative data analysis performed by the selected publications. The research was carried out mainly by analysing bibliometric data using R Biblioshiny, which is a graphical web-based user interface for the Bibliometrix software version. This application enables the visual representation of statistical data from articles, making it easier to visually analyse the data (Aria and Cuccurullo 2017).

Table 4 presents a concise overview of the bibliometric data collected from 138 published works by 397 authors. 77 were published in peer-reviewed journals, and the remainder were presented in conference papers and proceedings. Based on the data, there has been a significant increase in the number of studies concentrating on XAI in finance, with 80% of the reviewed works published between 2020 and 2022. This upward trend highlights the growing interest in implementing XAI in the financial sector and underscores the need for further research to explore its possibilities and address its challenges.

Table 4 Summary statistics of the collected articles

All articles that were reviewed in this research used author-defined keywords to aid in the indexing process within bibliographic databases. The word cloud technique was used to compare the keywords with the ones extracted from the abstracts. Figure 4 highlights the frequency of words with different font sizes and colours.

Fig. 4
figure 4

A word cloud displaying a author-defined keywords and b keywords extracted from abstracts

The word cloud clearly highlights the author-defined keywords, with explainable AI, credit scoring, deep learning, and credit risk being the most prevalent. Conversely, the less prominent keywords include cost-sensitivity, financial forecasting, explainability, and credit score prediction. The statement demonstrates that research publications prioritise the advancement of deep learning, with a particular focus on credit evaluation and risk management. The word cloud containing abstract-extracted keywords reveals that the terms credit risk, credit scoring, artificial intelligence, and neural networks are the most prevalent, while random forest, prediction accuracy, and financial markets are less prevalent.

A keywords analysis has helped us better understand our chosen subjects. The trend topics plotted in Fig. 5 visually represent the progress of the XAI approach and relevant finance topics. The graph showcases each term's year and frequency of usage through bubbles. The size reflects the relative usage frequency for the respective term. This analytical tool is invaluable for professionals keen to explore their field of study. Researchers can pinpoint areas that require further investigation by identifying recent trends and issues. Furthermore, this resource highlights research areas that link XAI in finance with other pertinent topics, such as deep learning and risk management. These articles offer critical insights and are at the forefront of current research, serving as a valuable foundation for future exploration in this field.

Fig. 5
figure 5

Trend Topics plot

The geographical distribution of the number of publications on a global scale is illustrated on the map in Fig. 6. The grey regions indicate areas that do not have any publications during the specified time frame.

Fig. 6
figure 6

Publications by country of the first author and top ten countries in terms of publication number

China produced the most publications (n = 33), followed by Italy (n = 17) and Germany (n = 11). According to the findings of this study, China emerged as the leading contributor, with 33 published articles that showcase their significant involvement and contribution to the respective research domain. Italy and Germany follow in second and third place with 17 and 11 releases, respectively. These results indicate that China has a strong presence of XAI in finance, while Italy and Germany also create essential contributions to academic discourse.

4 Review results and main findings

After reviewing some relevant features of the publications, we will next present the results of a comprehensive study of 138 articles that address the aforementioned questions. The results represent the state of XAI research in the financial sector and are based on (i) which financial AI applications have benefited from the implementation of XAI?, (ii) how the XAI methods used differ depending on the application and task, (iii) how are new XAI methodologies being developed and applied, and (iv) what are the challenges, requirements, and unresolved issues in the literature related to the application of XAI in finance.

4.1 RQ1: which financial AI applications have benefited from the implementation of XAI?

We employ a categorisation approach, rooted in the paper of Cao (2021), to analyse the objectives of AI in the selected papers and its widespread application in finance. This categorisation includes several significant domains, including financial market analysis and forecasting, agent-based economics and finance, intelligent investment, optimisation, and management, innovative credit, loan and risk management, intelligent marketing analysis, campaign and customer care, and smart blockchain. A brief overview of the finance domains explored in the articles is presented in Fig. 7, along with the specific AI techniques used and the XAI approaches developed in response. The figure summarises the information provided in the articles. It gives a visual representation of the relationships between the various finance domains, AI techniques, and XAI methods used in the research. This enables a quick and efficient understanding of the essential findings and implications of financial AI and XAI research. Appendix A contains a complete table that provides a detailed analysis of the financial domains examined in the research, the specific AI methods used, and the corresponding references.

Fig. 7
figure 7

The selected articles from different application domains were divided into clusters based on the finance area, the AI method, and the explanation form. The number of articles that fall under each of these categories is given in parentheses

The selected articles are evaluated as structured, with the initial focus on the financial area, the specific applied methodology of artificial intelligence, and the corresponding prediction task. Our findings indicate that 44% of the selected articles were published in the domain of credit management. Additional topics covered by financial tasks include managing risks, automated and intelligent investments, financial time-series analysis, loan management, cross-market analysis, and other related tasks.

Identifying XAIs in finance domains has highlighted the importance of credit management as a critical undertaking, considering its substantial impact in several sectors of the financial industry. According to Moscato et al. (2021), good credit management is essential not only in the banking industry but also in other financial sectors. It ensures financial institutions' stability and profitability, while also facilitating the flow of capital for businesses and individuals. Conducting a credit management assessment requires many important steps, one of which is to incorporate XAI into the machine learning system as described in the articles. These tasks include assessing credit scores (62%), conducting risk assessments (35%), making well-informed credit decisions, and identifying suitable credit classifications.

A total of 23 academic papers have been published on the subject of risk management, encompassing a diverse array of financial risks. These risks include fraudulent activities, bankruptcies, financial risks, potential bank failures, and the need for precise identification of banknotes. Regarding the specific risks examined, fraudulent activities accounted for 39%, bankruptcies accounted for 13%, financial risks accounted for 34%, and the possibility of bank failure and the need for accurate banknote identification each accounted for 4%.

Automated and intelligent investing encompasses a wide range of financial investment decisions, the most prominent of which are portfolio management and optimisation (47%), and recommendation engine evaluation (13%). The remaining 40% comprises alternative methods of selecting financial investments (Fig. 8).

Fig. 8
figure 8

Distribution of financial application

4.2 RQ2: How do the XAI methods differ depending on the application and task?

This section presents a simple classification system based on the explainability categories as they relate to different financial domains. Figure 9 visually represents the main types of explanation used in explanation design and documented in the literature (Belle and Papantonis 2021), specifically feature-relevance explanations, local explanations, visual explanations, as well as additional instances of simplification-based explanations and feature-relevance explanations. Several XAI techniques used in the literature have used various methods that fall into different XAI categories. Through our research, we have classified specific methodologies as frameworks. Our findings indicate that within the financial industry, most studies involving XAI focus on three key areas: Feature relevance explanation, Explanation by simplification, and Local explanation. Specifically, 45% of studies concentrate on feature relevance explanation, 18% on explanation by simplification, and 15% on local explanation.

Fig. 9
figure 9

The number of articles on the categories of XAI and their application in finance

Table 5 presents a detailed examination of several XAI techniques. The table provides comprehensive details about the scope, type, and specific association of each method with AI methodologies. According to the definition in Sect. 2.1, ante-hoc XAI methods are inherently designed to be interpretable and explainable. In contrast, post-hoc methods involve generating explanations after the ML model has been constructed and require an additional method for explanation. Moreover, explanations can be categorised as either local or global, depending on whether they are derived from a particular data point (local) or the whole model (global).

Table 5 XAI techniques and the specific AI methods to which they relate

4.2.1 Feature relevance explanation

To explain the decision-making process of a particular method, feature-relevance explanations attempt to measure the impact of each input variable through quantification. Variables with higher values are considered more critical to the model. Within the domain of ensemble models, including Random Forest, XGboost, and Neural Network algorithms, Feature relevance methods are the most commonly used approach, as shown in Fig. 10.

Fig. 10
figure 10

Feature relevance methods used with artificial intelligence methods and specific XAI methods

The Shapley Additive Explanations (SHAP) method is widely used in the field of XAI (Lundberg and Lee 2017) and is also frequently mentioned in this review. The main goal of this approach, based on a game-theoretic concept, is to create a linear model that focusses on the specific case to be explained. The coefficients of this model are then interpreted as the importance of the features. Gradient boosting methods, such as LightGBM and gradient boosting decision trees, have been employed to predict credit default (de Lange et al. 2022), large price drops in the S&P 500 index (Ohana et al. 2021), subjective levels of financial well-being (Madakkatel et al. 2019), and solve practical marketing problems (Nakamichi et al. 2022). The SHAP method has been incorporated into these models to enable the identification of crucial explanatory variables. Credit risk assessment (Cardenas-Ruiz et al. 2022), economic growth rates and crisis prediction (Park and Yang 2022), relationships in capital flows (Nakamichi et al. 2022), and detection of accounting anomalies in financial statement audits (Müller et al. 2022) utilise various types of neural networks, including Convolutional Neural Network (CNN), long short-term memory (LSTM) network-based models, and Autoencoder Neural Networks (AENNs). SHAP are used to improve the accuracy and interpretability of these neural network models. The authors used a combination of Shapley values and Random Forest models to obtain both accurate predictions and reasonable explanations. Shapley values are used with a dynamic Markowitz portfolio optimisation model to improve the trustworthiness of robotic advisors in making investment decisions and to explain how the robotic advisor chooses portfolio weights (Babaei et al. 2022). The Random Forest method for small and medium-sized enterprises (SME) financial performance prediction demonstrated that a reduced number of balance sheet indicators could accurately forecast and clarify SME default and expected return (Babaei et al. 2023). The Shapley value (Shapley (1953)) method was incorporated to increase interpretability, gradually removing variables with the least explanatory power.

Several authors implement the feature importance method to explain the interpretability of their approach, which is based on the same principle as the SHAP approach, by emphasising the characteristics that significantly impact the model outcome. A combination of linear quantile regression and neural networks was used to predict bank loan losses with a new feature importance measurement method to estimate the strength, direction, interaction and other nonlinear effects of different variables on the model (Kellner et al. 2022). To address the lack of interpretability in enterprise credit rating models, Guo et al. (2023) used Convolutional Neural Networks with attribute and sequence attention modules. Zhou et al. (2020) include a generic time-series predicting framework that uses deep neural networks and a triple attention mechanism to gain satisfactory performance on multi-modality and multi-task learning problems in financial time series analysis. For credit risk assessment, a novel machine learning method, Least Squares Support Feature Machine (LS-SFM), is proposed, which introduces a single-feature kernel and a sampling method to reduce the cost of misclassification and provide interpretable results (Chen and Li 2006). Another research proposes a novel BWSL (Investment Buying Winners and Selling Losers) strategy for stock investing called AlphaStock, which uses enhanced learning and interpretable deep-attention networks to address quantitative trading challenges and achieve an investment strategy that balances risk and reward to reach (Wang et al. 2019). Carta et al. (2021) introduced a graphical user interface that includes generating phrases from global articles to identify high-impact words in the market, creating functions for a decision tree classifier, and predicting stock prices above or below a certain threshold.

In their empirical study, Wang et al. (2024) compared the performance of SHAP-value-based feature selection and importance-based feature selection in the context of fraud detection. They found that traditional feature importance methods typically yield a single score that represents the overall importance of each feature throughout the entire data set. On the other hand, SHAP values provide a more detailed explanation by giving significance values for each feature for individual predictions. This allows for the capture of interaction effects and a better understanding of the model's behaviour.

4.2.2 Explanation by simplification

Simplifying models is commonly achieved through feature extraction and rule-based methods. These techniques are widely regarded as effective ways to provide a set of interpretable features that explain decisions (Speith 2022). In Fig. 11, the XAI methods discussed in this article are typically used alongside boosting methods, neural networks, and other approaches that involve multiple techniques.

Fig. 11
figure 11

Explanation by simplification methods used with AI techniques and specific XAI methods

Various simplification models have been proposed to detect credit card fraud and provide accurate and personalised financial recommendations and risk assessments. One such model is the non-linear model GBDT (Gradient Boosting Decision Tree), which uses cross features and trains a linear regression model for each transaction with MANE (Model-Agnostic Non-linear Explanations) (Tian and Liu 2020). Another model, the Weighted Soft Decision Forest (WSDF), aggregates multiple soft decision trees using a weighting mechanism of each tree's output to simulate the credit scoring process (Zhang et al. 2020a, b, c). Credit risk assessment decision support systems like Continuous Re-RX, Re-RX with J48graft, and Sampling Re-RX use the recursive rule extraction algorithm and decision tree to generate interpretable rules for machine learning-based credit assessment techniques (Hayashi 2016). The TreeSHAP method groups borrowers based on their financial characteristics to measure credit risk related to peer-to-peer lending platforms (Bussmann et al. 2021). Furthermore, a personalised mutual fund recommender system was proposed using a knowledge graph structure and embedding functions that provide both general and complex explanations, which was evaluated using a mutual fund transaction dataset (Hsu et al. 2022). All these models aim to provide accurate, interpretable, and personalised financial recommendations and risk assessments, and they can be applied in Big Data environments.

4.2.3 Local explanation

Local explanations are methods used to explain the decision-making process of machine learning models at the individual instance or statistical unit level by breaking down a complex problem or model into smaller, more manageable, and understandable parts to provide insight and the key factors affecting a specific prediction (Arrieta et al. 2020). These explanations can be generated through various techniques that focus on explaining part of the system function, such as LIME, Counterfactuals, and others (Fig. 12). Using local explanations improves the transparency and interpretability of machine learning models, thereby increasing confidence in their results.

Fig. 12
figure 12

Local explanation methods used with AI methods and specific XAI methods

A local explanation method that addresses the opacity issue in ML models for credit assessment involves a counterfactual-based interpretability assessment technique. This technique utilises counterfactuals to assess the model's robustness under different scenarios by calculating the decision boundary while accounting for variations in the dataset (Bueff et al. 2022). In the realm of financial news prediction, the LIME method is used to provide investors with clear explanations of how stock prices are predicted using financial news headlines. Future research directions are also identified, such as multilingual predictions, automated predictions from financial news websites, and the integration of emotion-based GIFs (Gite et al. 2021). The emphasis is placed on the importance of feature engineering in finance, and a feature selection approach is proposed to improve predictive performance by identifying relevant features for each stock. Additionally, a genetic algorithm-based approach has been proposed to generate feature-weighted, multiobjective counterfactuals to enhance the interpretability of bankruptcy prediction, which has been experimentally shown to outperform other counterfactual generation methods (Cho and Shin 2023).

4.2.4 The Relationship between XAI, AI methods, and financial problems

We employed heatmaps to visually depict the correlation between XAI, AI approaches, and the specific financial tasks they address. These maps illustrate the values of the primary variables by using a grid of colored squares positioned along each axis. Analysing the hue and position of these squares deduces a relationship between the two variables (Ferreira et al. 2013). Figs. 13 and 14 use heatmaps to show the relationship between two variables: the XAI method used and the AI method; the XAI method used and the goal of the problem. Plotting these variables on the two axes of the heatmap and coloring the squares to represent the values of the main variables (in this case, the XAI method) makes it possible to see how the two variables are related and how commonly they were used together.

Fig. 13
figure 13

Relationship between the XAI method and the type of AI

Fig. 14
figure 14

Relationship between XAI Methods and Problem Types

Using a feature-expression heatmap, a group of converging associations has been successfully identified between XAI, the specific type of AI implemented, and the particular problem type at hand. As stated previously, for assessing credit problems, the most commonly used XAI methods have been feature importance and transparent methods. Additionally, many authors have employed frameworks of XAI methods. It is evident that there is a correlation between frameworks and multi-approach AI techniques. The SHAP technique is employed in conjunction with boosting methods, neural networks, and multi-approach AI techniques.

4.3 RQ3: how are new XAI methods developed and applied?

The development of new XAI methods is often guided by an intuitive understanding of what makes a valuable explanation. Despite ongoing efforts to develop new techniques to extract information from AI systems to provide explanations for their outcomes, the current state of research is still insufficient to fully assess the impact of these explanations on user experience and behaviour regarding their validity and reliability. Although research in finance has seen a significant increase, progress in establishing techniques or metrics to assess the effectiveness of explanation generation methods and the quality of the resulting explanations has been comparatively slow. Only 16 reviewed articles evaluated new XAI techniques or frameworks (Fig. 15).

Fig. 15
figure 15

The selected articles were divided into clusters based on financial task, the AI method, the XAI category and the created new XAI method. The number of articles that fall under each of these categories is given in parentheses.

Several innovative techniques have been introduced in XAI to enhance the interpretability of complex models. One such technique (Han and Kim 2019) represents the first attempt to ensure the accuracy and interpretability of a banknote detection and counterfeit detection system using XAI. To achieve this goal, the authors proposed a new visualisation approach that addresses the limitations of an existing method. Known as pGrad-CAM, this novel method overcomes the problem of blank activation maps that can occur when using the Grad-CAM technique. Unlike Grad-CAM, which relies on weighted sums of feature maps determined by the average gradient values, pGrad-CAM evaluates positive gradient areas pixel-by-pixel even when the average gradient value is negative, resulting in richer and more comprehensive results in informative activation cards. Another technique, ViCE (Gomez et al. 2020), is a new method that provides counterfactual explanations for model predictions and allows users to understand the decision process by presenting the minimal changes required to modify a decision for each sample through an interactive interface with a use case the effectiveness of the tools on a credit dataset and their potential for future improvement can be customised through a modular black-box design. Meanwhile, the new approach, called Contrafactual Explanations as Interventions in Latent Space (CEILS) (Crupi et al. 2022), has the dual aim of incorporating causality into the creation of counterfactual explanations and using them to make practical suggestions for recourse, while the same time being a method that is convenient can be added to existing counterfactual generator. This is the first step towards giving end users realistic explanations and actionable recommendations on achieving their desired outcome in automated decision-making processes. The RESHAPE (Reconstruction Error SHapley Additive exPlanations Extension) (Müller et al. 2022) creates attribute-level explanations for AENN and a thorough evaluation framework for benchmarking XAI methods in financial audit environments. A new black-box high-fidelity explanatory method called MANE (Model-Agnostic Non-linear Explanations) (Tian and Liu 2020) explains why a deep learning model classifies a transaction as fraudulent and identifies the key characteristics that contribute to the decision. The approach involves using an aggregation strategy to extract cardholder behaviour patterns from transaction data and Gradient Boosting Decision Tree (GBDT) to discover cross-features that approximate the non-linear local boundary of the deep learning model and improve interpretability. The study shows that the proposed approach outperforms state-of-the-art techniques on a real financial data set. It introduces a highly accurate explanatory model that characterises behavioural patterns and cross-traits. It allows linear regression models to fit the local nonlinear limit of all deep learning models. Finally, LaGatta et al. (2021) introduced two new XAI methods: CASTLE and PASTLE. The CASTLE approach integrates global knowledge learned by a black box model during its training into local explanations, which is achieved through a clustering algorithm that identifies instances that share similar characteristics and are classified homogeneously by the model. Alternatively, the PASTLE method improves on the standard explanations of the feature importance by providing information about the modifications required in the target instance to increase or decrease the likelihood of the label assigned by the model.

Two studies present new metrics to evaluate and improving XAI systems. Sovrano and Vitali (2022) introduce the DoX Pipeline, a model-agnostic approach to assess the explainability of AI systems by knowledge graph extraction and a new metric based on Achinstein's theory of Explanations. The proposed metric is tested on two real-world Explainable AI-based systems in healthcare and finance using artificial neural networks and TreeSHAP, and the results demonstrate the feasibility of quantifying explainability in natural language information. Gorzaczany and Rudziski (2016) proposed a new approach to the design of fuzzy rule-based classifiers (FRBCs) for financial data classification that addresses interpretability issues by optimising both accuracy and interpretability requirements during the design phase of classifiers. The approach includes an original measure of complexity-dependent interpretability, an efficient implementation of strong fuzzy partitions for attribute domains, a simple and computationally efficient representation of the classifier's rule base, and original genetic operators for processing rule bases. The approach outperformed 24 alternative methods in interpretability while remaining highly competitive in accuracy and computational efficiency.

Several frameworks have been proposed to improve the transparency, auditability, and explainability of complex credit scoring models, including TAX4CS, the three Cs of interpretability, and a lending decision framework. The Transparency, Auditability, and Explainability for Credit Scoring (TAX4CS) framework proposes a structured approach to perform explanatory analysis of complex credit scoring models. The framework emphasises the importance of assessing the suitability of the model rather than focussing solely on interpretability (Bücker et al. 2022). A framework called ‘the Three Cs of Interpretability” (Silva et al. 2019) encompasses the concepts of completeness, correctness and compactness developed in an ensemble model that satisfies the requirements of correctness and diversity and generates comprehensive and concise explanations for predictive models. Evaluation of the model in three datasets, including financial and biomedical, has demonstrated its superior predictive performance compared to individual models such as Deep Neural Network, Scorecard, and Random Forest while providing accurate, compact, and diverse explanations consistent with expert analysis. A lending decision framework (Chen et al. 2022) includes an interpretable machine learning model, an interactive visualisation tool, and various types of summaries and explanations for each decision. The proposed lending decision framework (Chen et al. 2022) includes an interpretable machine learning model, an interactive visualisation tool, and various types of summaries and explanations for each decision. It is a two-layer additive risk model that can be decomposed into meaningful subscales, and the online visualisation tool allows the model to be explored to understand how it arrived at its decision. The framework also offers three explanations: case-based reasoning, essential features, and custom sparse explanation. A new general-purpose framework, the Counterfactual Conditional Heterogeneous Autoencoder (C-CHVAE), also allows the generating of counterfactual feature sets without requiring a specific distance or cost function for tabular data. The framework can be applied to various autoencoder architectures that can model heterogeneous data and estimate conditional log-likelihoods of the changing and unchanging inputs. In addition, C-CHVAE does not require a distance function for the input space and is not restricted to specific classifiers (Pawelczyk et al. 2020).

4.4 RQ4: What are the challenges, requirements, and unresolved issues in the literature related to the application of XAI in finance?

Based on the literature review, we present the challenges that need to be resolved in the field of explainability of AI models. We have identified three main challenges in conducting this comprehensive review: difficulties arise when determining the intended audience for the explanation, such as whether it is targeted towards technical specialists or end users, the lack of new XAI techniques or explainable machine learning methods were used to tackle the financial problem, the absence of explainability valuation metrics that play a significant role in building trust in AI models in finance, and the security of information provided to the XAI model.

4.4.1 The audience for the method of explanation?

The rationale for employing XAI is that the outcomes generated by AI models are more likely to be perceived as credible by end users when accompanied by an accessible explanation in nontechnical language (Hagras 2018).

Some authors specify the intended audience for the method of explanation, such as the RESHAPE method (Müller et al. 2022), proposed as a tool to facilitate the application of complex neural networks in financial audits by presenting auditors with detailed but understandable explanations, thereby increasing their confidence in opaque models, indicating that the degree of interpretability may only be understandable to experts in the field.

Mohammadi et al. (2021) found that by handling different types of constraints, black-box encoding can enable stakeholders and explanation providers to consider individual-specific situations where a person may request that their personal limitations be considered in the provided explanation. Access to reliable and efficient tools to explain algorithmic decisions becomes crucial as stakeholders adopt more complex neural models to make decisions.

4.4.2 New XAI technique or new explanation for machine learning method?

It was difficult to discern whether some authors (Obermann and Waack 2016; Cao 2021; Chaquet-Ulldemolins et al. 2022a, b; Liu et al. 2021a, b; Zhou et al. 2020; Dong et al. 2021) had introduced a new explainable AI method or whether they were instead presenting an explainable machine learning method that aimed to improve the ML method itself rather than a new XAI method to develop.

As an example of a non-new XAI method, we can present the proposed interpretable personal credit evaluation model DTONN that combines the interpretability of decision trees with the high prediction accuracy of neural networks (Chen et al. 2020). This example shows that the researchers sought interpretability using various methods rather than inventing new XAI methods.

4.4.3 Explainability evaluation metrics?

One of the limitations of XAI techniques is the insufficient availability of evaluation metrics that determine the quality of the explanations provided. The process of choosing the most appropriate explanation method can be complex one, as there is no single, universally accepted definition of what constitutes a satisfactory explanation. This can vary depending on the individual and the specific circumstances involved. Further investigation is required to establish suitable standards for selecting and evaluating explanatory techniques in various contexts.

Some researchers use central tendency measures of mode and median to assess the effectiveness of an explanation method and the most recent XAI technique. According to the results, CASTLE outperformed Anchors by 6% in terms of accuracy, indicating that explanations can help to understand black box models (La Gatta et al. 2021a, b). The accuracy of MANE, the XAI method proposed for credit card fraud detection, was assessed using the RMSE local approximation accuracy metric and PCR (positive classification rate) to validate the accuracy of the selected features by the explanatory model (Tian and Liu 2020). To assess the efficacy of integrating knowledge of the financial domain into the design and training of neural networks, Zheng et al. (2021) used the mean average percentage error in both the training and test set as a metric for performance evaluation. It can be inferred that the evaluation of the machine learning method is prioritised over the effectiveness of the XAI method in most cases.

4.4.4 Information security

The security of the XAI model is paramount, as any breach of confidentiality could expose it to malicious exploitation. This could lead to attackers deceiving the model and risking the integrity of its results. The model can be manipulated to produce a different output by injecting specific information by attackers (Ghorbani et al. 2019). This risk associated with XAI is a critical consideration, as it enables even minor modifications to the original AI model, which can significantly alter the outcome. Such attacks on loan or credit management systems can be catastrophic and have significant economic consequences.

4.4.5 Impact of XAI on the finance area

Transparency and regulatory compliance are paramount in the finance industry (Zheng et al. 2019). In this regard, the use of XAI can greatly benefit financial institutions. It enables them to achieve both objectives with the utmost efficiency and accuracy. XAI offers superior transparency and traceability compared to traditional AI, making it a more suitable choice for financial institutions aiming to comply with regulations and establish trust with their stakeholders.

The evaluated research has yielded empirical evidence highlighting XAI's potential for improving transparency and predictive performance. Ghandar and Michalewicz (2011) investigated whether model interpretability could benefit financially intelligent systems by realising valuable properties. The experimental results demonstrated that interpretability objectives could yield rules with superior prediction capability while avoiding excessive complexity, thus underscoring the significance of interpretability in financial modelling.

XAI is crucial for financial professionals to understand the variables that affect risk predictions. This, in turn, leads to more accurate risk assessments and builds confidence among decision makers in the information provided by AI models. Lange et al. (2022) created an XAI model to predict credit defaults, utilising a unique data set of unsecured consumer loans. The authors combined SHAP and LightGBM to interpret how explanatory variables affect predictions by implementing XAI methods in banking to improve the interpretability and reliability of AI models. These XAI-implementing models usually outperform state-of-the-art models (Bussmann et al. 2020).

Ensuring the trustworthiness of recommendations in financial XAI is a critical challenge that demands our attention. Misleading explanations can cause users to rely on inaccurate results, eroding their confidence and trust. Legal regulations also drive the requirement to validate machine learning outcomes. The European Union has achieved a significant breakthrough in regulating AI technology by introducing the AI Act. This pioneering legal framework is designed to encourage the responsible and ethical use of AI in diverse industries and applications. The enactment of this act represents a significant step forward in ensuring that AI is used in a safe and beneficial manner.

5 Discussion and areas for further investigation

The main goal of this review was twofold. First, to question and organise the tasks encompassed by XAI in finance and highlight the challenges in generating explainable models. Second, to suggest potential avenues for future studies on implementing XAI in finance. To develop XAI in the field of finance, some challenges and future work for further research are presented:

  1. (1)

    To fully exploit the potential of the XAI, it is essential to focus on explaining the decisions made by the XAI algorithm. Providing a clear and concise explanation of each variable is a critical aspect. When using numeric coding, it is crucial to indicate the meaning of each variable. It is highly recommended to avoid confusion instead of abbreviating it. For instance, the variable "RM" can be written as a "risk measure". It is worth noting that variables can sometimes be presented in numeric coding without any legend or abbreviation, which can lead to confusion and misinterpretation.

  2. (2)

    Research shows that explainable AI is vital for credit assessment and risk management. However, other sensitive areas, such as portfolio optimisation, internet financing, and fraud detection, are comparatively less explored than credit scoring and risk management. Therefore, further research using XAI methods in these less studied areas is crucial.

  3. (3)

    There is currently a lack of definitive guidance regarding selecting an appropriate XAI method for a given scenario, considering the multitude of options available. Furthermore, a degree of ambiguity surrounds the formulation of suitable explanations and how contextual factors may influence them (Ben David et al. 2021). Researchers frequently use multiple XAI methods with multiple datasets to assess a model's performance or demonstrate its approximation capabilities. Using diverse datasets can effectively demonstrate the model's robustness and flexibility in handling various data distributions (Pawelczyk et al. 2020). In a recent study by Tian and Liu (2020), an assessment of the effectiveness of AI techniques in finance was proposed to evaluate the prediction of Chinese companies' listing statuses. The study examined decision cost and classification accuracy as critical metrics for assessing the model's performance. This new perspective can be applied to future XAI in finance to evaluate the effectiveness of machine learning techniques.

  4. (4)

    To grasp the application of XAI in finance, it is beneficial to examine the frameworks used by researchers who implemented XAI techniques to improve machine learning explainability is beneficial. The integration of LIME and SHAP techniques can substantially enhance the interpretability of XAI by effectively elucidating both the global and local contexts of the model. Further investigation is needed to determine the most commonly used combined XAI methods and their specific purposes.

  5. (5)

    Feature relevance explanations are commonly used XAI techniques in finance. However, it is possible that once we identify "significant" or "relevant" characteristics, we could encounter a situation where we have high-dimensional feature vectors that require additional analysis. This analysis involves examining correlations or employing various metrics to determine similarities or closeness between the data points. Although the interpretation may simply act as a mathematical depiction, using these methods to restructure a black-box model internally is feasible.

  6. (6)

    Decision Trees are a commonly utilised intrinsic tool commonly used in financial XAI. Decision trees are a highly adaptable tool that can act both as an intrinsic method and a post-hoc method for approximating more complicated models. They possess the ability to function as a model and an explanation in and of themselves, as well as approximate complex models such as neural networks or gradient-boosting machines. We recommend avoiding complex models with unclear operations and no decision rule explanations. Although decision trees with multiple levels can theoretically be comprehensible, their extended length of decision rules along a path makes them not easily understandable. The explainability task cannot be realised due to model complexity. We propose to expand the meaning of the term XAI, not only to indicate that the model is inherently explainable, but also to explain the model's behaviour clearly and concisely—either through diagrams or visuals—so that anyone, regardless of their level of knowledge, can easily and quickly understand the behaviour of the model.

Close collaboration between financial professionals and AI developers is crucial for deploying a financial AI model to improve the financial decision process. Financial experts possess the necessary expertise to determine whether a model's behaviour conforms to appropriate standards. Calculating bias and fairness metrics is imperative to ensure that the models employed do not display discriminatory behaviour. This is vital to maintaining ethical and moral standards in utilising such models and must be considered for the sake of all concerned parties. It is of utmost importance to adhere to explanatory methods that can be demonstrated faithfully and reliably to the model. Use of any other techniques must be avoided. The factors mentioned above make it necessary to use XAI methods while implementing financial AI. Increasing the use of explainable techniques in financial AI research would significantly enhance its financial relevance.

6 Recent developments

We conducted a final search in May 2024 to ensure the completeness of this review and identify any recently published studies. This search led to new studies on XAI measurement metrics by Giudici and Raffinetti (2023) and the Key Risk Indicators for AI Risk Management (KAIRI) framework by Giudici et al. (2024). It also found a new Python package called Aletheia that was made to make it easier to figure out what deep ReLU networks are (Sudjianto et al. 2020) and PiML, a Python toolbox that was made just for interpretable machine learning (Sudjianto et al. 2023).

As stated in Sect. 5.4.3, one drawback of XAI approaches is the lack of measurement criteria that accurately assess the quality of the explanations given. To ensure AI is trustworthy and to properly control its use in the financial sector, Giudici and Raffinetti (2023) have increased the use of Lorenz Zonoids to create measuring tools for the four main trustworthiness criteria: S.A.F.E. (Sustainability, Accuracy, Fairness, and Explainability). They used the algorithm to predict bitcoin prices, showcasing its superiority over state-of-the-art techniques. Giudici et al. (2024) have developed the Key Risk Indicators for AI Risk Management (KAIRI) framework. This framework consists of statistical tools that can analyse the prediction output of any machine learning model, regardless of the data structure and model used. KAIRI can be used to measure the level of safety in any AI application. Sudjianto et al. (2020) introduce Aletheia, an innovative toolkit that incorporates practical case studies of credit risk to enhance the understanding of decision-making processes in deep ReLU networks through network analysis. This allows researchers to have a direct understanding of the logic of the network, detect any issues, and optimise it. In an additional academic publication, Sudjianto et al. (2023) provide PiML, an original Python toolkit that not only provides an expanding selection of interpretable models but also includes an improved collection of diagnostic tests. The widespread adoption of PiML by several financial organisations since its introduction in May 2022, highlights its significant potential in the banking industry.

Advancements in XAI include the use of measurement tools such as S.A.F.E. and KAIRI, as well as interpretable models such as Aletheia and PiML. These advances aim to enhance the reliability and risk control of AI systems, with a specific emphasis on their practical use in the financial industry.

7 Conclusions

In finance, it is imperative to carefully evaluate and consider the potential limitations and drawbacks of AI technologies before making any decisions. It is recommended to integrate the XAI methodology while using these technologies to achieve optimal results. This methodology is a highly effective tool for achieving financial success by instilling confidence in the decisions made with AI technology. It provides benefits that extend beyond a mere understanding of the implemented decisions. By implementing this methodology, financial institutions can effectively validate the outcomes of their decisions, thereby ensuring their credibility and trustworthiness.

This review aims to provide a comprehensive analysis of the XAI methods used in the financial industry, focussing on evaluating their efficacy in diverse domains and tasks. In the financial sector, it is imperative to exercise caution when relying solely on sophisticated AI methodologies, as they can lead to unforeseen biases and uncommon problems. The study revealed that traditional machine learning methods are used more frequently in finance than complex multilevel AI algorithms. Additionally, the analysis presents insightful trends in the field of XAI research, highlighting the prevalence of LIME and SHAP as the most commonly adopted techniques.

The review emphasises the challenges and unresolved issues that arise when implementing XAI techniques in finance. XAI has the potential to facilitate the practical application of AI techniques. However, it is important to consider the obstacles that may hinder its implementation. These include the challenge of identifying the target audience, the lack of new XAI techniques, the absence of evaluation metrics for explainability, and the security of the data used in the XAI models. Addressing these challenges will be crucial to ensuring the successful integration of XAI in real-world scenarios. Furthermore, the review presents a comprehensive overview of the potential applications and future research directions of XAI in the field of finance. Some of these are portfolio optimisation, internet financing, and fraud detection.

Moreover, the collaboration between financial professionals and AI developers can improve XAI to be more specific to finance by enhancing the integration of not only model-agnostic techniques but also including more particular finance methods.

This review highlights empirical examples that demonstrate the potential benefits of XAI in the financial industry. It provides valuable insights into the potential of XAI in finance and contributes to the ongoing discussion on the use of XAI in this industry. The adoption of AI algorithms in the financial sector has been on the rise and it is expected that the use of XAI is expected to have a significant impact on their decision-making process in the future. However, the challenges mentioned above must be addressed, and this presents an opportunity for researchers to work in the field of financial XAI.