Keywords

1 Overview of Existing Financial Fraud Detection Systems and Their Limitations

Financial institutions deliver several kinds of critical services usually managing high-volume transactions, just to mention, among others, card payment transactions, online banking ones, and transactions enabled by PSD2 and generated by means of open banking APIs [1]. Each of these services is afflicted by specific frauds aiming at making illegal profits by unauthorized access to someone’s funds. The opportunity for making conspicuous, easy earnings, often staying anonymous – therefore in total impunity – incentivizes the continuous development of novel and creative fraud schemas, verticalized on specific services, forcing fraud analysts in playing a never-ending “cat-and-mouse” game, in a labored attempt of detecting new kind of frauds before valuable assets get compromised or before the damage becomes significant.

The situation has been made with no doubt more complicated by the considerable scale of the problem and by its constant trend growth rate, driven by the exponential increment of payments for e-commerce transactions recorded in recent years [2] and by the European support policy to the digital transformation and to the “Digital Single Market” strategy [3]. Over the years, the academic world and the industry players of the cybersecurity market have seen in this context an opportunity: the former to create innovative solutions to the problem, while the latter to make business by proposing technological solutions for supporting the analysts and for automating the detection processes. Rule-based expert systems (RBESSs) are a very simple implementation of artificial intelligence (AI), which leverages rules for encoding and therefore representing knowledge from a specific area in the form of an automated system [4].

RBESSs try to mimic the reasoning process of a human being expert of a subject matter when trying to solve a knowledge-intensive problem. This kind of systems consists of a set of decisional rules (if-then-else) which are interpreted and applied to specific features extracted from datasets. A wide set of problems can be managed by applying these very simple models, and lots of commercial products/services have been built leveraging on RBESSs for detecting frauds in banking, financial, and e-commerce suspicious transactions [5], by calculating a risk score based on users’ behaviors such as repeated log-in attempts or “too-quick-for-being-human” operations, unusual foreign or domestic transactions, unusual operations considering user’s transactions history (abnormal amounts of money managed), and unlikely execution day/time (e.g., weekend/3 am) [6].

Based on the risk score, the rules deliver a final decision on each analyzed transaction, therefore blocking it, accepting it, or putting it on hold for analyst’s revision. The rules can be easily updated over time, or new rules can be inserted following specific needs to address new threats. Nevertheless, as the number of fraud detection rules increases and the more you combine rules for the detection of complex fraud cases, the more the rules could conflict with each other based on semantic inconsistencies. When this happens, the rule-based system performs inefficiently [7], for example, by automatically accepting in-fraud transactions (the “infamous” false negatives) or by blocking innocuous ones (false positives). Furthermore, the decisional process lead by anti-fraud analysts could be severely affected by an increasingly high and rapidly changing set of rules which could compromise their inference capabilities during investigations.

The standard practice in the cybersecurity industry has long been blocking potentially fraudulent traffic by adopting a set of rigid rules. A fraud detection rule-based engine aims at identifying just high-profile fraudulent patterns. This method is rather effective in mitigating fraud risks and in giving clients a sense of protection by discovering well-known fraud patterns. Nevertheless, rule-based fraud detection solutions demonstrated in the field that they can’t keep pace with the increasingly sophisticated techniques adopted by fraudsters to compromise valuable assets: malicious actors easily reverse-engineer a preset of fixed thresholds, and fixed rules would not be of help in detecting emerging threats and would not adapt to previously unknown fraud schemes. These drawbacks should be taken in adequate consideration, especially if we consider the operational costs (OPEX) determined by erroneous evaluations of fraud detection engines (false positives and false negatives).

Example Rule 1: Block Transaction by IP Location

Input: u, user; tu, user’s transaction under analysis; tu[], user’s transactions’ collection

Output: ts, transaction status is blocked, await or pass

1: if tu.IPAddress.GeoLocate.Country is not in tu[].IPAddress.GeoLocate.Country and tu.amount > €200 then

2:    ts = await   // transaction is temporarily stopped by rule

3: else ts = pass   // transaction is considered not dangerous

Example Rule 2: Block Frequent Low-Amount Transaction

Input: u, user; tu, user’s transaction under analysis; tu[], user’s transactions’ collection

Output: ts, transaction status is blocked, await or pass

1: if tu.amount < €5 then

2: for i = 10 to 1 do

3:  if tu[i].amount < €5 then // check for existing recent similar transactions

4:   ts = await // transaction is temporarily stopped by rule

5:   exitfor

5:  else ts = pass  // transaction is considered not dangerous

6: endfor

Another strong limitation of the rule-based detection engines is the potential lack of data to analyze: the more innovative is the fraud scheme, the fewer data you will find in analyzed transactions. This lack of data could simply mean that necessary information is not collected and stored or that information is present, but the necessary details are missing or that information cannot be correlated to other information. As fraud rates intensify, so their complexity: complementary methodologies for fraud detection need to be implemented.

It is well-known that fraud phenomenon is endemic and could never be totally eradicated, but only mitigated more or less effectively. In this framework, wide areas for improvement still currently exist: one need only partially addressed is banally to improve the fraud detection rate; decrease the number of false positives – analyzing them requires considerable resources; and at the same time reduce the number of false negatives which impacts negatively organizations first by the costs of the undetected frauds but also generate a misleading sense of security in users. If the system does not detect frauds, this doesn’t mean there aren’t. By improving the efficiency and effectiveness of a fraud detection system, it will allow us to implement mechanisms for automated or semi-automated decision, ensuring high-impact business results and a significant reduction of CAPEX and OPEX.

Over the years, several approaches have been developed and tested in order to improve the effectiveness of the rule-based detection methodologies, but current trends suggest that promising results could be obtained by adopting analytics at scale based on an agile data foundation and solid machine learning (ML) technologies.

2 On Artificial Intelligence-Based Fraud Detection

Artificial intelligence can address many of these limitations and more effectively identify risky transactions. Machine learning methods show better performance along with the growth of the dataset to which they are adapted, which means that the more fraudulent operations samples are trained, the better they recognize fraud. This principle does not apply to rule-based systems as they never evolve by learning. Furthermore, a data science team should be aware of the risks associated with rapid model scaling; if the model did not detect the fraud and marked it incorrectly, this will lead to false negatives in the future. By the machine learning approach, machines can take on the routine tasks and repetitive work of manual fraud analysis, while specialists can spend time making higher-level decisions.

From a business perspective, a more efficient fraud detection system based on machine learning would reduce costs through efficiencies generated by higher automation, reduced error rates, and better resource usage. In addition, the finance/insurance stakeholders could address new types of frauds, minimizing disruptions for legitimate customers and therefore increasing client trust and security.

Over the past decade, intense research on machine learning for credit card fraud detection has resulted in the development of supervised and unsupervised techniques [8, 9].

Supervised techniques are based on the set of past operations for which the label (also called outcome or class) of the transaction is known. In credit card fraud detection problems, the label is “trusted” (the transaction was made by the cardholder) or “fraudulent” (the transaction was made by a scammer). The label is usually assigned a posteriori, either following a customer complain or after thorough investigations on suspect transactions. Supervised techniques make use of past transactions labeled by training a fraud prediction model, which, for each new analyzed transaction, returns the likelihood that it is a fraudulent one.

Unsupervised outlier detection techniques do not make use of the transaction label and aim to characterize the data distribution of transactions. These techniques work on the assumption that outliers from the transaction distribution are fraudulent; they can therefore be used to detect types of fraud that have never seen before because their reasoning is not based on transactions observed in the past and labeled as fraudulent. It is worth noting that their use also extends to clustering and compression algorithms [10]. Clustering allows for the identification of separate data distributions for which different predictive models should be used, while compression reduces the dimensionality of the learning problem. Several works have adopted one or other techniques to address some specific issues of fraud detection, such as class imbalance and concept drift [11,12,13].

Both these approaches are needed to work together so that supervised techniques can learn from past fraudulent behavior, while unsupervised techniques aim at detecting new types of fraud. By choosing an optimal combination of supervised and unsupervised artificial intelligence techniques, it could be possible to detect previously unseen forms of suspicious behavior, quickly recognizing the more subtle patterns of fraud that have previously been observed across huge amounts of accounts. In literature, there are already paper works showing the combination of unsupervised and supervised learning as a solution for fraud detection in financial transactions [14,15,16].

But this is not enough: it is well known that the more effective machine learning techniques, the greater the required volume and variety of data. Fraud models that are trained using big data are more accurate than models that rely on a relatively thin dataset; to be able to process big data volumes so to properly train AI models while transaction data streams are detected to find fraud attempts, open-source solutions based on cloud computing and big data frameworks and technologies may come to aid.

In the next section, we present an innovative microservice-based system that leverages ML techniques and is designed and developed on top of the most cutting-edge open-source big data technologies and frameworks.

It can improve significantly the detection rate of fraud attempts while they are occurring by the real-time ML detection of the financial transactions and continuous batch ML retraining. It integrates two layers, stream and batch, to handle periodic ML retraining while real-time ML fraud detection occurs.

3 A Novel AI-Based Fraud Detection System

To overcome the aforementioned limits, this chapter describes the design, development, and deployment of a batch and stream integrated system to handle automated retraining while real-time ML prediction occurs and combine supervised and unsupervised AI models for real-time big data processing and fraud detection.

More specifically, it addresses the challenge of financial crime and fraud detection in the scope of the European Union-funded INFINITECH project, under Grant Agreement No. 856632.

To this end, Alida solution is adopted (https://home.alidalab.it/). Alida is an innovative data science and machine learning (abbreviated to DSML) platform for the rapid big data analytics (BDA) application prototyping and deployment.

DSML solutions are entering a phase of greater industrialization. Organizations are starting to understand that they need to add more agility and resilience to their ML pipelines and production models. DSML technologies can help improve many of the processes involved and increase operational efficiency. But to do that, they must be endowed with automation capabilities (automation enables better dissemination of best practices, reusability of ML artifacts, and enhanced productivity of data science teams), support for fast prototyping of AI and big data analytics applications, and operationalization of ML models to accelerate the passage from proof of concept (PoC) to production. With the Alida solution, the R&D lab of engineeringFootnote 1 aims to respond promptly to these needs in the field of data science and ML as well as the operationalization of data analytics.

Through Alida, users can design their stream/batch workflows by choosing the BDA services from the catalog, which big data set to process, run, and monitor the execution. The resulting BDA applications can be deployed and installed in another target infrastructure with the support of a package manager that simplifies the deployment within the target cluster. Alida is designed and developed on top of the most cutting-edge open-source big data technologies and frameworks. Being cloud-native, it can scale computing and storage resources, thanks to a pipeline orchestration engine that leverages the capabilities of Kubernetes (https://kubernetes.io/) for cloud resource and container management.

Fig. 15.1
figure 1

Alida platform

Alida (Fig. 15.1) provides a web-based graphical user interface (GUI) for both stream and batch BDA workflow design. It will thus be possible to design and directly run and monitor the execution or schedule the execution. In addition, the execution, scheduling, and monitoring functionalities are also available through the Alida APIs. Alida provides an extensible catalog of services (the building blocks of the BDA workflows) which covers all phases, from ingestion to preparation to analysis and data publishing. The catalog can be used both through a web interface and through specific APIs, which will allow the registration of new BDA services. In summary, Alida is a very useful and straightforward tool for a rapid BDA application prototyping: user designs his own stream/batch WF by choosing the BDA services from the catalog, chooses which big data set he/she wants to process, and runs and monitors the execution and the successful execution of a WF results in the creation of a BDA application which can be deployed and installed in another target infrastructure with the support of package managers that simplify deployment activities within the cluster. These applications will offer appropriate APIs to access the services offered by the BDA apps.

Thanks to Alida, it was possible to design, execute, and deploy a bundle of two workflows according to the architecture showed in Fig. 15.2.

Fig. 15.2
figure 2

AI-based fraud detection system architecture

The AI-based fraud detection system presents in the batch layer:

  • A preprocessing step on transaction data where datasets on several kinds of transactions are properly filtered and joined to get only one unlabeled dataset

  • Clustering of such unlabeled data to create labeled samples

  • The random forest model training to periodically retrain it with new data (it is worth noting that the analyst’s feedbacks also contribute to the training)

  • Feeding the supervised model retraining, by means of fraud analyst’s feedback

In the stream layer, the real-time fraud detection is handled by means of the random forest algorithm on the basis of new input transaction data, to analyze all the transactions and report the suspicious ones as a resulting streaming data ready to be properly visualized. The fraud attempts are investigated by the analyst who marks them as “suspicious” if he confirms the transaction was fraudulent and such feedback feeds the retrain within the batch layer. In this way, the fraud detection performance is improved over time.

4 Real-Time Cybersecurity Analytics on Financial Transactions’ Data Pilot in INFINITECH

In order to end-to-end test the overall system, the real-time transaction generator (Fig. 15.2) produces synthetic transactions, emulating the behavior of a traditional financial transaction software. The objective is to obtain a stream of data that feeds the fraud detection system whose effectiveness is demonstrated by the cybersecurity and fraud detection pilot of INFINITECH.

Moreover, a generated, realistic dataset was created that is consistent with the real data present in the data operations environment. The synthetic dataset is provided by Poste Italiane starting from randomly generated personal data with no correlation with real people.

It contains about one million records including fraudulent transactions (Bank Transfer SEPA Transactions) associated with ten thousand users and occurred in a 1-year period (from 1 January 2020 to 30 September 2020).

Nevertheless, the invocation of a pseudo-anonymization tool is necessary to make the execution of the abovementioned pilot more realistic and complying to data protection regulations; thus, transaction data will be pseudo-anonymized before leaving the banking system by using a tool available through the INFINITECH platform.

Finally, a visualization tool aims to address the advanced visualization requirements by offering to fraud analysts a variety of visualization formats, which span from simple static charts to interactive charts offering several layers of information and customization. It consists of a set of functionalities which support the execution of the visualization process. This set of functionalities includes the dataset selection, the dataset preview generation, the visualization-type selection, the visualization configuration, the visualization generation, and an interactive dashboard. The visualization tool is a project hosted in the presentation group of the official INFINITECH GitLab repository, it will be integrated with other components adopted by the pilot in a dedicated environment, and its development process is backed by a CI/CD pipeline (Fig. 15.3) as per the INFINITECH blueprint referenceFootnote 2.

Fig. 15.3
figure 3

AI-based fraud detection system deployment scheme

5 Conclusions

This chapter addresses one of the major challenges in the financial sector which is real-time cybersecurity analytics on financial transactions’ data, presenting an innovative way to integrate supervised and unsupervised AI models by exploiting proper technological tools able to process large amounts of transaction data.

A previously stated, a supervised model is a model trained on a rich set of properly “tagged” transactions. This happens by ingesting massive amounts of tagged transaction details in order to learn patterns that best reflect legitimate behaviors. When developing a supervised model, the amount of clean, relevant training data is directly correlated with model accuracy. Unsupervised models are designed to spot anomalous behavior. In these cases, a form of self-learning is adopted to surface patterns in the data that are invisible to other forms of analytics.

A lambda architecture was designed where both real-time and batch analytics workflows are handled in an integrated fashion. Such architecture is designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. This approach is designed to balance latency, throughput, and fault tolerance by using batch processing to provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide views of online data. In our case, data preprocessing and model training are handled in the batch layer, while the real-time fraud detection is handled on the basis of new input transaction data within the speed layer. The solution presented in this chapter aims at supporting in an innovative and effective way fraud analysts and automating the fraud detection processes.