1 Introduction

The act of making money obtained through illegal operations, such as the trafficking of illegal drugs, appear to have come from legitimate commercial activities is known as money laundering. Money from illegal activities is viewed as dirty, and thus, the process “launders” it to make it appear clean.

The process by which criminals conceal the original ownership and control of the proceeds of their illicit activity by making those proceeds appear to have come from a legitimate source is known as money laundering [1].

Money appears to come from legitimate sources. Money can be laundered in a number of ways. These techniques range in complexity from very simple to very complicated. One of the most popular methods is to use a legal cash-based business that is owned by criminals to launder money. If criminals, for instance, ran a shop, they might have inflated the daily cash collections to transfer the dirty money from the shop to the bank. Owners use the shop’s bank account. The prevention of financial crime has increased in importance for banks. With the growth of modern technology and global communication, money laundering is rising drastically, causing banks to suffer significant losses, from the world’s largest fine ($1.9 billion) imposed on HSBC to millions of dollars imposed on other banks worldwide [1].

Money-laundering is a dynamic three-stage process that requires: (1) placement, which is the process of moving cash from its source. This is followed by placing it into circulation through financial institutions or any legitimate organization. The process of money placement can be carried out through many processes. (2). The layering process, which makes it more difficult to detect and uncover laundering activity, makes the trailing of illegal proceeds difficult for law enforcement agencies. (3). Integration is the process of moving previously laundered money into the economy, mainly through financial organizations or the banking system, which makes money appear to be normal business earnings.

To fight money laundering, many systems have been introduced. Most of the systems are rule-based systems [2] and generate a lot of false positive alerts. False positive alerts waste compliance time and increase the risk that high-risk alerts will be missed due to the large number of false positive alerts. The other methods to solve this problem are using machine learning methods. Anti-money laundering (AML) using machine learning is either supervised or unsupervised. Due to the lack of real-world labeled data, most newly introduced systems use unsupervised techniques, clustering and anomaly detection. The drawback of this approach is that not every suspicious behavior is outlier behavior. The money launderer is always trying to replicate normal behavior for customers [3]. In order to overcome these challenges, this paper aims to propose a new framework based on supervised learning techniques. The contributions of this article are summarized as follows:

  • Introducing a machine learning model working together with a rule-based model system and taking the results from it and suppressing false positive alerts, ranking true positive alerts with a risk-based approach mechanism.

  • Selecting the features from rule-based model alerts generated by AML scenarios that are designed by AML specialists gives the solution the advantage of interpretability.

  • Applying the solution to real-world data and testing the results of real experts to take their feedback, making the results countable.

  • Applying Hybrid parameter tuning Optuna [4], which gives great results with our machine learning model.

The organization of this article is presented as follows: in Sect. 2, we present overviews of related works. Then, Sect. 3 is devoted to the presentation of the proposed framework. Section 4 describes our experimental evaluation approach and the different parameters used, along with a discussion of the results obtained. Finally, Sect. 5 concludes this paper.

2 Related works

Anti-money laundering for decades has been one of the hot areas to be solved using machine learning. Domashova et al. [5] used Optuna with boosting algorithms to evaluate organizations based on non-transactional characteristics such as the organization’s age, authorized capital size, founder composition, and so on. The issue with this approach is that it neglects the fact that money laundering is a behavioral process and it depends on patterns of transactions. In [6], the authors utilized four supervised learning algorithms for the bitcoin dataset to find money laundering. Unified terminologies related to the AML field were proposed in [7]. This paper points to two major fields to work on: customer risk profiling and suspicious behavior. Also, the authors point out a major challenge, which is the lack of public data. In [8], the authors utilized transactions, sender, and receiver to build graphs to help to reduce false positives. They used real banking data, but still, many types of transactions in real data don’t contain the two sides of the transaction, and some types have only one side.

Weber et al. [9] pointed out the issue of reducing false positives of AML systems will also reduce the true positives, and keeping the rule-based model is important because it can be interpretable. In [10], the authors applied supervised learning techniques like XGBoost with real transactional data divided into three groups: normal transactions, fired event transactions, and suspected STR transactions. They achieved an 82% AUC. Ahmed et al. [11] used a gradient boosting algorithm to detect money laundering activities based on transactions and account level. Some other studies focused on filtering watch lists based on machine learning in AML [12]. Based on a 10,000 transaction dataset [13], they used decision trees and support vector machine classifiers to determine whether the transaction is legal or not, and they discovered that decision trees outperform SVM in their customized dataset. Xia et al. [14], proposed money laundering prediction model based on graph convolution neural networks (GCN) and long short-term memory (LSTM) with transactions gives good precision results, but it loses many true positive transactions on the other hand.

The authors in [15], utilized real-world transactional data. They made monthly profiles, getting the sums and counts of each type of transaction, like cash and wires, to identify money laundering. They used regression and classification techniques like logistic regression, logistic regression with lasso, K-NN, and XGBoost. The F1 score and accuracy are applied in this paper to compare the results, and they conclude that traditional logistic regression with binary outcome performs better than the other models. Raiter [16] utilized logistic regression, random forest, artificial neural network, and support vector machine with transactional features like amounts of each transaction type. They used score to compare the results, which is not accurate in this problem, because the nature of AML and fraud problems is that the data contains very rare events, and so the score will give a false lead about the model. In [17], social network analysis applied to fight Money Laundering, they introduced a prediction model using social networks to predict customer risk. Social network analysis is used to predict the involvement of accounts for customers who are involved in money laundering.

The proposed deep learning approach [18] for anti-money laundering (AML) has several strengths. Firstly, it replaces predefined rules with automatically extracted latent features, allowing for a more flexible and adaptable system that can detect new patterns and anomalies in transaction sequences. Secondly, the use of recurrent and Transformer encoder layers enables the model to capture long-term dependencies and complex relationships between transactions, leading to better performance in reducing false positives and retaining true positives. Additionally, the experiment with a large dataset from Spar Nord Bank and the subsequent expert review of 26 clients reported to Danish authorities suggests that the approach has real-world potential for improving AML compliance.

However, there are also some limitations to consider. Firstly, the proposed approach requires significant data processing and model training, which may pose challenges for smaller institutions with limited resources. Additionally, the reliance on transaction sequences may overlook other important sources of information, such as customer profiles and external data sources. Finally, the experiment with a small sample of high-risk clients prompts further questions about the approach’s scalability and generalizability to different populations and contexts.

Overall, while the proposed deep learning approach shows promise for enhancing AML systems, it is important to consider both its strengths and limitations when evaluating its potential impact on financial institutions and regulatory compliance.

However, these works mainly focus on transactions or customer profiles, whereas in our research, we concentrate on alerts generated from customer scenarios. The results were not measured effectively by focusing solely on reducing false positives without focusing on true positive events, resulting in a trade-off between reducing false positives and not losing many true positives. Because these alerts already quantify the behavior, in this article, the features rather than the transactions were the alerts produced by the AML system. Additionally, we employed F-beta with different \(\beta \) values to comply with the requirements of the financial authorities.

3 The proposed ASXAML framework

In this paper, we propose a novel anti-money laundering framework called automatic suppression based on XGBoost for AML (ASXAML). This section presents the basic idea, key features, and architecture of the proposed AML framework; then, the steps of the ASXAML framework are described.

The architecture of the ASXAML framework is described in Fig. 1, which shows the flow of the cycle after applying the proposed framework. Data are extracted from the core banking system; then, transformation and load jobs are applied to place the data in the target AML format. The AML scenarios then analyze the transactions and other data to generate alerts. These alerts go through our ASXAML framework before showing up. The ASXAML framework predicts the importance level of the alerts grouped by the customer and then suppresses these alerts. If it were lower than the cutting value, the other alerts over the cutting value would be up for investigation.

Fig. 1
figure 1

Anti-money laundering solution architecture

The current AML systems that depend on a rule-based model have a lot of false positives, which have drained the compliance time and effort. Our main goal is to intercept the alerts raised by the rule-based model, filter these alerts based on our framework, and raise the important alerts only. So, we propose the ASXAML framework to cover the drawbacks of the previous work. Figure 2 shows the steps of the ASXAML framework, which we will demonstrate in the next subsections.

Fig. 2
figure 2

The architecture of ASXAML framework

3.1 Data preparation

The first phase of the ASXAML framework involves the data preparation process. This phase encompasses the data acquisition, the extraction, and the transformation of the data into a format that is suitable for processing by the model. Additionally, it describes the original features of the dataset, details of data insights and variable analysis, and some basic descriptors to help understand the final data structure.

3.2 Dataset description

We use a real-world banking dataset for all our experiments. For privacy concerns, we are unable to reveal the name of the bank or provide precise information; however, we do provide approximations to characterize the data where we can.

As shown in Table 1, the dataset used is real data that contain 46 scenarios, and the number of customers with alerts is more than 210K. Each customer can have more than one alert for different scenarios. The same scenario could fire on the same customer multiple times during the period. The number of important entities (suspected money laundering) is 6420 customers, and the number of false positive customers is 204,576 customers. It is noticed that it is completely unbalanced data. We took a sample from the false positive customers of 10k. It was labeled as 0, and the full positive customers were 6420. It was labeled as 1. The dataset was divided into two subsets: a training dataset and a testing dataset, following a split ratio of 70% for training and 30% for testing. This partitioning operation maintains the distribution of the samples for each class in the dataset.

Table 1 Data summary

3.2.1 Data acquisition

The data utilized in this study are a real-world dataset obtained from a financial institution that has requested to remain anonymous due to privacy concerns. During this phase, the handling of missing values, removal of useless or highly correlated data, and data transformations were performed to optimize the information provided to the models and adapt the data to the behavior of an anti-money laundering department.

3.2.2 Data extraction

The data utilized in this study consist of the alerts generated by the AML system based on 46 configured scenarios in the financial institution over a period of 6 months. The scenarios include, but are not limited to, large cash deposits on a daily basis, structuring the deposits over multiple days, transferring money to high-risk countries, and receiving money from non-customers. The data were divided into two groups, the important group and the not-important group.

The important group includes alerts that were deemed important by the financial institution investigator and reported to the Money Laundering Combating Unit (MLCU), alerts that were investigated and raised to manager level, and alerts that required significant investigation even though they were not money laundering cases. The not-important group includes alerts that were determined to be false positives by investigators. As most alerts in the AML solutions are false positives, only random samples were selected from the not-important category, with a sample size of four events from the not-important group for each important event, while all important alerts were included in this study. The proportion of important alerts reported to the MLCU is typically between 2 and 5% of the total events produced by the AML system [19].

3.2.3 Customer profiling

The customer profiling step is a critical part of the ASXAML framework, which aims to build a customer profile based on the historical fired alerts in the last six months prior to the last closed alert in the not-important group, and before the last case or report on the customer in the important group. To ensure the integrity and accuracy of the data, the framework excludes all alerts that were fired after the reported case, as these alerts were not considered when the investigator made their decision.

The data extracted includes customer numbers, scenario names, which represent the name of the scenario, and the date of the fired alert. The gathered data is then grouped per customer number, where each customer has only one record and all their alerts are placed next to them. To this end, Table 2 illustrates a sample of the extracted data, while Table 3 shows the structure of the prepared data, with the number in each cell of the cells related to alerts representing the number of alerts that occurred on this customer over the period, and the label represents the category of this customer if it is important or not.

Table 2 Sample of data extracted
Table 3 Profiling customers

The profiling of all customers enables the framework to group customers with similar alerts and classify them accordingly. Ultimately, the framework can use the generated profiles to predict instances of money laundering activity, as it accurately captures customer behavior and detects any unusual or suspicious activities.

3.3 Feature selection

Feature selection is a crucial step in machine learning to select relevant and important features while avoiding irrelevant, redundant, or not important ones. Recursive feature elimination with cross-validation (RFECV) is a widely used method for feature selection [20]. The main idea of RFECV is to recursively select the optimal variables by creating a model and selecting the best or worst variable repeatedly. The process continues until all variables are selected, and the order of the eliminated variables reflects their importance.

To implement the RFE algorithm, a machine learning algorithm such as random forest, Naïve Bayes, logistic regression, or gradient boosting is needed to evaluate the importance of the features. In this paper, random forest is used as the classifier for the RFE model. The initial selection of features significantly affects the later selected features in the RFE model, as the split is not the same in the recreation process [21]. The selected features vary with each RFE model, and k-fold cross-validation is used to address this challenge in selecting the important features for automatic suppression.

In the RF-RFECV strategy, the best variables are selected based on the highest recall of cross-validation in each RFE with a cross-validation model. This technique is used for feature selection in this paper.

3.4 Classification model

The ASXAML framework uses Extreme Gradient Boosting (XGBoost) in the classification phase. XGBoost is an ensemble learning method that learns from labeled data. Sometimes it is not sufficient to depend on one learning model and only one result. Ensemble learning combines multiple learners to enhance the result. The final result is that one model consists of the combinations of several models and aggregates their results to get a better model than the previous models. XGBoost was picked because it is better when working with real-world classification issues. XGBoost can be used with parallel processing, it can use all the cores of the machine it runs on. It is highly scalable and can effectively deal with classification and preprocessing of data. XGBoost is flexible and can be integrated with many platforms; it is not tied to any specific one. It can be handled with multiple programming languages, like C++, Python, R, and Java. XGBoost can transform a weekly learner into a strong learner by boosting this learner through its optimization process. It avoids over-fitting issues with regularization, whether it is trees or linear models [22].

XGBoost has an internal cross-validation function. Therefore, there is no need to get external packages to apply cross-validation results. It can deal with missing values, support the user in setting his own objective function of the model and define his evaluation metrics. Adding to previous features, XGBoost was used in many winning competitions in machine learning, like Kaggle [23]. Therefore, XGBoost is a good choice to deal with the money laundering problem. Next, we will discuss the mathematical model of the XGBoost model.

3.4.1 Mathematical model of XGBoost

The objective function in the XGBoost model is:

$$\begin{aligned} Obj(\theta )=L(\theta )+R(\theta ) \end{aligned}$$
(1)

where \(\theta \) is the best parameters fit with the independent variables \(x_i\), and dependent variables \(y_i\). L is the training loss function and R is the regularization term which controls the complexity of the model, which helps to avoid over-fitting. L is commonly used as a mean squared error:

$$\begin{aligned} L(\theta )=\sum _{i}(y_i - \tilde{y_i})^2 \end{aligned}$$
(2)

And \(y_i\) can be written in the form:

$$\begin{aligned} \tilde{y_i}=\sum _{t=1}^{T} f_t(x_i),f_t \in F \end{aligned}$$
(3)

where T is a number of trees, F is the set of all possible classification and regression trees CART’s. In the XGBoost, instead of training many trees, sequenced trees are created; each tree is dependent on the last tree. The objective of each tree is to enhance the last result:

$$\begin{aligned} \tilde{y_i}^{(0)}= & {} 0 \end{aligned}$$
(4)
$$\begin{aligned} \tilde{y_i}^{(1)}= & {} f_1(x_i) =\tilde{y_i}^{(0)} + f_1(x_i) \end{aligned}$$
(5)
$$\begin{aligned} \tilde{y_i}^{(2)}= & {} f_1(x_i) + f_2(x_i) =\tilde{y_i}^{(1)} + f_2(x_i) \end{aligned}$$
(6)

...

$$\begin{aligned} \tilde{y_i}^{(T)} = \sum _{t=1}^{T} f_t(x_i) = \tilde{y_i}^{(T-1)} + f_T(x_i) \end{aligned}$$
(7)

The objective function then becomes:

$$\begin{aligned} Obj^{(T)}=\sum _{t=1}^{n}(y_i - (\tilde{y_i}^{(T-1)} + f_T(x_i)))^2 + \sum _{i=1}^{T}\omega (f_i) \end{aligned}$$
(8)

3.5 Hyperparameter tuning using Optuna

In the present study, the hyperparameter tuning phase employed Optuna, an automatic hyperparameter optimization software framework specifically designed for machine learning [4]. Optuna offers several advantages, such as a dynamic define-by-run API that enables users to define their search space dynamically, an efficient implementation of searching and pruning strategies, and a versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to lightweight experiments conducted via an interactive interface. Additionally, Optuna is an open source framework that outperforms many black-box frameworks while being easy to use and setup in various environments. By leveraging Optuna, the present study aimed to automatically obtain new hyperparameters without requiring interventions from money laundering experts [24]. Specifically, Optuna was utilized to tune the hyperparameters employed in the XGBoost model. The optimized hyperparameters were then fed into the model to score the data based on the Optuna results.

3.6 Validating the results

In this step, the confusion matrix was used to get the results of the model and to make the comparison between all the used models. The confusion matrix Fig. 3 describes the performance of the classification models.

Fig. 3
figure 3

Confusion matrix

From the confusion matrix, we can conclude the following measures:

  • Recall = True positive rate = \(\frac{true positive }{true Positive + false negative}\)

  • Precision = positive productivity value = \(\frac{true positive}{true Positive + false positive}\)

  • F-measure = harmonic mean of recall and precision = \(\frac{2*precision*recall}{precision + recall}\)

These are the normal measures for most classification problems. We will also use the most important measure [25] in this problem, which is F-beta. F-beta can be used to concentrate on one measure, like in our case here. We need to concentrate on the recall measure, but false positives are still important.

$$\begin{aligned} F_\beta = (1 + \beta ^2) \frac{precision*recall}{(\beta ^2 * precision) + recall } \end{aligned}$$
(9)

4 Experiments and results

In this section, the experiments were done on a real dataset. The following subsections will demonstrate the results and analyze these results.

4.1 Performance evaluation

In this section, we present the evaluation of the ASXAML framework on real-world datasets. The dataset was partitioned into two distinct classes: a training dataset and a testing dataset, with a division ratio of 70% for training and 30% for testing. This stratified partitioning ensured an equitable representation of samples from all classes. To evaluate the performance of the proposed framework, we conducted four experiments using different combinations of feature selection methods and classification algorithms. The first experiment used the RFECV feature selection method with XGBoost and Optuna, the second experiment used only XGBoost and Optuna, the third experiment used only XGBoost, and the fourth experiment used multiple classifiers including SVM, RF, NB, KNN, and DT. The reason for conducting experiments with different combinations of methods and algorithms is to assess the effectiveness and robustness of the ASXAML framework in different scenarios and to compare its performance with other commonly used classifiers. The experiments were conducted on Google Colaboratory, which provides a free Jupyter notebook environment with GPU support for running machine learning experiments [26].

4.1.1 Experiment #1

In this experiment, Optuna with \(\beta \) = 2 as optimsation function and XGBoost with feature selection was both employed. We used multiple cross-fold validation, from two cross-folds to ten cross-folds, but as there would be numerous figures, we only showed the findings of two and five cross-fold validations. Figure 4 shows the result after applying the feature selection method (RFECV) on the 46 features with 2 cross-fold validation. The number of optimal features chosen from RFECV, which represents the highest score is 16 features. The two lines in this figure display the result of each cross-validation applied. The x-axis is the number of features, and the y-axis shows the result of applying RFECV with recall as a measure on this cross-fold.

Fig. 4
figure 4

RFECV feature selection method for 2 cross-fold (C-F) validations

After applying Optuna for tuning the hyperparameters, Fig. 5 shows the important hyperparameters that affect the model the most. Booster has three options: gbtree, gblinear, or dart. Gbtree and dart use tree-based models, while gblinear uses linear models. The second important hyperparameter in this experiment was Alpha, which is a regularization term on weights (analogous to Lasso regression). The third important parameter was colsample_bytree which is the subsample ratio of columns when constructing each tree. subsample, which denotes the fraction of observations to be randomly sampled for each tree. Subsampling occurs once for every tree constructed. Lambda is the second regularization term on weights (analogous to Ridge regression) and it didn’t affect the model in this experiment. It is noticed that the hyperparameter booster is having the most important effect in this experiment. The number of trials to find the best hyperparameters for the XGBoost model is 421 with 0.86 F-beta. The number of total trials was 1000 trials Fig. 5; the majority of the trials had objective values (F-beta) above 0.80, which indicates that most of the trials produced excellent outcomes.

Fig. 5
figure 5

Hyperparameter importance and Optuna trials for 2 cross-fold validations

The second observation that will be shown in this experiment is using five cross-fold validations. In this case, the best number of features according to five cross-fold validations was 44 features, as shown in Fig. 6. Also, it was discovered that the booster, as a hyperparameter, has the greatest influence on the outcome see Fig. 7. The best trial was 770 with 0.86 F-beta, which is also not far from the results that came from using two cross-validations. The recall of the majority of trials is over 0.80, which indicates that the majority of trials offer very good results.

Fig. 6
figure 6

RFECV feature selection method for 5 cross-fold (C-F) validations

Fig. 7
figure 7

Hyperparameter importance and Optuna trials for 5 cross-fold validations

The comparison of all cross-fold validation combinations from 2 cross-fold validations to 10 cross-fold validations is shown in Fig. 8. Recall, reduced false positives, and F-beta, which was also displayed in Table 4, are the comparison metrics. It is noted that all combinations provide excellent results overall, and the variation in terms of F-beta is not very high. It is further noted that the measurement of F-beta is a perfect measure for this particular problem because it provides the balance between lowering the false positives and maintaining a high percentage of recall. The 9 and 7 cross-folds had the best total cross-fold validation rate of 0.86. With each cross-validation applied, the relevant hyperparameters affecting the model are shown in Table 5, and it can be shown that the booster hyperparameter is the most crucial hyperparameter in many combinations. The best two results of applying XGBoost after RFECV and Optuna were as follows: for 9 cross-fold, F-beta = 0.86 recall of the suspicious customers (important customers) was 0.92. The number of reduced false positive customers is 0.73, which means that only 0.27 of customers with a 0.92 true positive hit will be investigated. For 7 cross-fold, F-beta = 0.86 recall of the suspicious customers (important customers) was 0.93. The number of reduced false positive customers is 0.71, which means that only 0.29 of customers with a 0.93 true positive hit will be investigated.

Fig. 8
figure 8

Comparison between F-beta, recall, and reduced false positives for different # of Cross-Fold

Table 4 Comparison between recall, F-beta and reduced false positives for all C-F validations
Table 5 Hyperparameters importance for all combinations of C-F validations

After applying Optuna trials (1000 trials) and comparing the results of various cross-fold validations ranging from 2 to 10 cross-fold validations, it is evident that the final result is not significantly different and that the results are quite similar. This means that our framework continues to demonstrate its value even when used in various combinations.

4.1.2 Experiment #2

In this experiment, the XGBoost with Optuna with \(\beta \)= 2 as optimization function was utilized without any feature selection methods. Booster, which has a 0.96 score and is the best hyperparameter affecting the model with Optuna applied, is the key hyperparameter in this experiment, see Fig. 9. Alpha was the second most significant hyperparameter. The model’s influence from the colsample_bytree, subsample, and lambda hyperparameters was minimal. Optuna conducted 1000 trials, with 537 yielding an 0.86 success rate.

Fig. 9
figure 9

Hyperparameter importance and Optuna trials for experiment #2

The result of applying XGBoost and Optuna was as follows: F-beta = 0.86, recall of the important customers was 0.90. The number of reduced false positive customers is 0.76, which means that only 0.24 of customers with a 0.90 true positive hit will be investigated. Figure 10 compares the outcomes of experiment #1 and experiment #2. In this comparison, we used 5 cross-fold validations, which is the RFECV default to represent experiment #1. As we show, the recall and F-beta in experiment #1 are superior to experiment #2’s, however experiment #2 performs better in terms of fewer false positives.

Fig. 10
figure 10

ASXAML results in experiment #1 versus ASXAML results in experiment #2 comparison results in F-beta, recall, and reduced false positives

4.1.3 Experiment #3

In this experiment, XGBoost was applied with default parameters. Only XGBoost was applied without RFECV and without Optuna. The results were as follows: F-beta = 0.81. The recall of the suspicious customers (important customers) was 0.79. The number of reduced false positive customers is 0.95, which means that investigators will investigate only 0.05 of customers with 0.79 true positive hit on them.

4.1.4 Experiment #4

In this experiment, the Optuna maximization function was varied between multiple \(\beta \) values, ranging from 1 to 2. Additionally, cross-fold values of 2 and 5 are used for feature selection. According to the Fig. 11, the recall in 2 Cross-fold starts at 0.80 and rises as \(\beta \) is boosted. On the other hand, the decreased false positives begin at 0.94 and begin to decrease as \(\beta \) is raised. At beta = 1.4 and 1.5, the two metrics are more evenly distributed, and at \(\beta \) = 2, recall bias sets in while false positives are at their lowest. The same trend was present in cross-fold =5 with no noticeable differences, indicating that the financial institution’s decision to be more watchful of customers suspected of money laundering is up to them. They will increase the \(\beta \) while still reducing false positives by at least 0.70, while if they need balance, they will aim for \(\beta \) = 1.4 or 1.5.

Fig. 11
figure 11

Optuna XGB result with different \(\beta \) as maximization function on C-F 2 and 5

4.1.5 Experiment #5

In this experiment, we applied other classifiers such as support vector machine (SVM), random forest (RF), K-nearest neighbor (KNN), Naive Bayes (NB), and decision tree (DT) classifiers with default parameters. These classifiers were tested to obtain the best classifier overall in our problem. Table 6 illustrates the performances of the five approaches and indicates that the best F-beta result was with KNN, with F-beta 0.82, recall 0.82 and reduced false positives by 0.90. Where NB has F-beta 0.83, 0.81 of recall and reduced false positives by 0.95.

Table 6 SVM, KNN, SVM, RF and NB classifiers

The results from experiment #1 are compared to all employed classifiers in Fig. 12 using five cross-fold validation with \(\beta \) equal 1.5, and it is evident that while most of the models have greater false positive reduction than the suggested model, they fall short in recall and F-beta. Because of this, even though these approaches’ reduction, for example, for NB, was very good (0.95), their recall is rather poor (0.81). On the one hand, this indicates that the NB model mistakenly closed 366 money laundering customers out of the 1926 customers in the test data.  On the other hand, even if there are an adequate number of false positive reduction, the ASXAML framework can close fewer number money laundering customers (231) and when increase \(\beta \) to 2 the number goes to (149) achieving higher recall value. This is due to the fact that other algorithms only aim to reduce false positives, which make these algorithms missing the crucial money laundering events and mistakenly closing them, putting the bank at a greater risk of fines.

Fig. 12
figure 12

Different classifiers models comparison results in F-beta, recall, and reduced false positives

The overall results show that the proposed ASXAML framework performs better in comparison to other classifiers. The recall has a gap that is at least 0.05 between the ASXAML and other classifiers, which affects the number of important customers that slipped away and closed by mistake. In this problem, it is important to close false positives without losing too many important customers. Based on this, it is clear that the proposed framework introduces this balance between these two lines without jeopardizing either one. Our framework’s dynamic nature relies on three main dynamic components—RFECV, Optuna, and XGBoost models—making it adaptable for use with different datasets. The adaptability of the system is underscored by its reliance solely on rule-based scenarios and the requisite involvement of investigators, a characteristic common to all financial institutions equipped with AML systems. This renders the system applicable across diverse datasets and to a wide spectrum of financial institutions.

5 Conclusion

In conclusion, we have introduced a novel framework for enhancing anti-money laundering (AML) systems using role-based models. Our framework involves extracting features from alerts generated by AML scenarios created by experts. These alerts are grouped by customer and fed into the feature selection method RFECV, which selects the best features to apply before passing them to Optuna for hyperparameter tuning of the XGBoost model. The resulting classifier uses the best parameters generated by Optuna to classify customers as either important or not, striking a balance between reducing false positives and not missing important customers.

Our experimental results using five cross-fold validation with \(\beta \) equal to 1.5 demonstrate that our framework outperforms most of the other classifiers, achieving a higher recall value and F-beta score. While other approaches, such as Naive Bayes, may achieved a high false positive reduction (0.95), their recall was poor (0.81), leading to the incorrect exclusion of 366 money laundering customers out of 1926 in the test data. In contrast, the ASXAML framework, even with fewer false positive reductions, only missed 231 money laundering customers, and increasing \(\beta \) to 2 further reduced the number to 149, demonstrating the framework’s ability to detect crucial money laundering events while minimizing the risk of fines for the bank.

It is imperative to acknowledge potential drawbacks in the application of this methodology. Firstly, human investigators must abstain from participating in fictitious actions. Secondly, the occurrence of true positive cases is significantly less frequent when contrasted with the prevalence of false positives. Additionally, the necessity for historical actions implies that the proposed solution necessitates a minimum duration of coexistence with a rule-based model. Lastly, the quality of data is contingent upon the rigor and relevance of the predefined scenarios.

Future work will involve applying this framework to other features, such as customer profiling features, and utilizing clustering methods to identify the most important customer groups.