Keywords

1 Introduction

Activities in the anti-money laundering and counter-terrorist financing (AML/CTF) sphere are an important part of the provision of a stable financial and economic sector. As an autonomous countrywide regulatory authority, Bank of Slovenia (BOS) supervises the compliance of individual banks, conducts inspections and serves as a consulting body when drawing regulations [1]. As the world is more and more interconnected, and country and geographical borders are getting blurred, money transfers between countries and entities with high volume are getting more ubiquitous than ever. With the interconnection and easy use of the system for financial transfers across the world and within the EU area, efficient and timely supervision is more needed than ever.

With malicious entities moving faster and outpacing the current detection and prevention mechanisms, it is becoming clear that in the digital age of data expansion, manual and human-based detection methods are unable to cope with the sheer amount of transactions and volume of additional data needed for processing. This can be exemplified by not-so-recent events [2] that clearly show that additional data and automatic detection of specific early warning systems for risky transaction patterns should be put in place both on the level of banks and on the level of the supervisory authority. Such a system will be beneficial to both supervisory authorities by enabling greater degree and more efficient supervision as to individual banks by providing a tool for early discovery and reporting of problematic scenarios.

Artificial intelligence and machine learning methods are naturally suited for the problems facing supervisory authorities and banks. Screening tool developed as part of the Infinitech project naturally augments existing tools and provides new information. The main goal of the screening tool is to process and analyse transaction data from different sources, enrich it with additional external info specific to anti-money laundering and counter-terrorist financing needs and efficiently combine different levels of granularity of data. The screening tool tries to recognize unusual transaction patterns and relationships among combined data that indicates typologies and risks of money laundering (ML) or terrorist financing (TF) at a level of specific financial institutions. Results of automated screening and flagged patterns are automatically presented to domain experts for further exploration and consideration. Automation in the process increases the amount of transactions that can be analysed, removes possible human mistakes and enables processing and ingestion of data from additional sources that were untraceable for a human actor. Patterns reported by the screening tool are weighted according to their relative measure of riskiness and put into context, where domain experts can further explore and analyse the patterns in enriched context before making a more informed decision.

The screening tool pipeline is presented in Fig. 13.1. The pipeline is composed of several individual and mostly independent components neatly wrapped in containers following the Infinitech way of implementation. This way, the whole tool suite can easily be deployed on-site and enable flexibility and seamless integration. Following the Infinitech philosophy, additional tools and detectors can be easily included and integrated with the screening tool suite without bespoke modifications. The tool suite is built for performance with huge volumes of data and can be used in almost real time while also providing a batch mode for the exploration of historical data and scenario modelling.

Fig. 13.1
figure 1

Screening tool pipeline overview

2 Related Work

The use of machine learning in graph datasets has seen enormous progress in the last few years and has been actively incorporated in many previously thought to be hard problems. Transaction data is naturally modelled as a graph structure (nodes as actors, transactions as edges, with additional nodes and attributes for metadata). Both supervised and unsupervised learning approaches have been proposed to deal with the task of detecting suspicious patterns in financial transaction data.

2.1 Anti-money Laundering with Well-Defined Explanatory Variables

In [3] the authors propose unsupervised methods that can be used on high-dimensional client profiles. A method for visualization is proposed by using dimensionality reduction. The high-dimensional feature vectors describing customers are projected into 2D space for visualization using a dimensionality reduction procedure such as PCA. That makes it possible for supervision personnel to visually identify outlier groups (clusters) of potentially risky subjects. Additionally, a peeling algorithm is proposed for anomaly detection. Here in each step, the most extreme outlier based on any distance function (the authors propose Mahalanobis distance) is removed and marked as an anomaly and the process is then repeated for any number of steps.

Such methods can be used for any type of anomaly detection problem in n-dimensional Euclidean space. It is, however, crucial to apply them according to the end goal of identifying suspicious transactions. It is important to use features that expose activities that are known to be common among money laundering groups.

A client profiling approach based on k-means clustering of customer profiles was described in [4]. The explanatory variables (feature vectors) have been naively constructed through interactions with domain experts. The optimal number of clusters has been estimated using Silhouette coefficient and a sum of squared errors. After clustering, rules for classification have been generated and tested with multiple rule generation algorithms. The relevance of generated rules for targeting high-risk customers was estimated manually by domain experts. The authors show that it is essential to include features that are relevant to the given problem. Such features include account type, account age, the volume of transactions etc. The end result of such goal-driven development are distinct clusters that are named according to the goal (examples include “Group of risk”, “Standard customer” etc.) and can be clearly described using a set of rules.

Additionally, Mastercard’s Brighterion claims to be using unsupervised learning for their AML products [5].

In [6] a supervised learning technique that operates directly on transactions (not entities) is presented. The Gradient Boosting model is trained to predict suspicious transactions based on past Suspicious Activity Reports (SARs). The feature vectors used in training incorporate information both about the entity as a whole and the individual transaction. Such features include an indication of any previous bankruptcies related to the entity, company sector type, activity level and amount of transactions in the last 2 months grouped by transaction type. The authors show the model is accurate and efficient, outperforming the bank’s rule-based approach.

2.2 Machine Learning on Graph Structured Data

While using manually generated client feature vectors offers greater model explainability, there may still be information hidden in transaction networks that is not captured this way. Recent progress in Graph Machine Learning has it made possible to capture deeper structural information of node neighbourhoods.

Nodes in a graph can roughly be described in two distinct ways: what communities they belong to (homophily) or what roles they have in the network (structural equivalence).

The DeepWalk algorithm [7] generates node representations in continuous vector space by simulating random walks on the network and treating these walks as sentences. The generated embeddings can then be used for any prediction or classification task. The random walk idea has been further extended by node2vec [8]. DeepWalk and node2vec have been successfully used in domains such as biomedical networks [9] and recommendation systems [10].

Another algorithm struc2vec proposed by [11] works in a similar manner to DeepWalk, only that it performs random walks on a modified version of the original network, which better encodes structural similarities between nodes in the original network. Embeddings generated by struc2vec may perform better on structural equivalence-based tasks than node2vec. Struc2vec embeddings have been shown to generally perform better than node2vec or DeepWalk on link prediction tasks in biomedical networks [9].

DeepWalk, node2vec and struc2vec are inherently transductive, i.e. require the whole graph to be able to learn embeddings of individual nodes. This might present a challenge for real-world applications on large evolving networks. Therefore, other representation learning techniques have been proposed recently that are inductive, i.e. can be trained only on parts of the network and can be directly applied to new, previously unseen networks and nodes.

Graph Convolutional Networks (GCNs) [12] are models that leverage node features to capture dependence between nodes via message passing. The original GCN idea is further extended to inductive representation learning by Hamilton et al. [13].

A (Variational) Graph Autoencoder—(V)GAE—model has been proposed by [14] to generate node representations. The authors demonstrate that an autoencoder using a GCN encoder and a simple scalar product decoder generates meaningful node representations of the Cora citation dataset. It is, however, unclear how such a model would perform on graphs where structural equivalence plays a more significant role in descriptions of nodes.

A temporal variation of GCN called T-GCN has been proposed [15]. The authors propose a model that first aggregates spatial information with a GCN layer separately at each time step and then connects time steps with Gated Recurrent Units (GRUs) to finally yield predictions. Such a model outperforms other state-of-the-art techniques (incl. SVR and GRU) on a traffic prediction task on real-world urban traffic datasets, aggregating both spatial and temporal information. Additionally, a temporal graph attention (TGAT) layer has been recently proposed as an alternative for inductive representation learning on dynamic graphs [16].

2.3 Towards Machine Learning on Graph Structured Data for AML

Transaction networks, on which money laundering detection can be performed, usually contain only a small fraction of known suspicious subjects (if any). Such class imbalances need to be dealt with in order to produce robust supervised models.

In [17] the proposed method for detecting money laundering patterns is based on node representation learning on graphs. A transaction graph is constructed from the real-world transaction data from a financial institution. The undirected, unweighted graph is constructed from data within a fixed time period. A small number of subjects is known to be suspicious regarding money laundering. Accounts outside bank’s country are aggregated by country. Node representations are learned using DeepWalk and are then classified with three binary classifiers: Support Vector Machine (SVM), Multi-Layer Perceptron (MLP) and Naive Bayes. The extreme class imbalances are overcome using the widely used SMOTE algorithm for synthetic oversampling of the minority class. Another strategy tested is the undersampling of the majority class and random duplication of the minority class. The model is then evaluated on a part of ground truth (not oversampled) entities not included in the training. Best results are achieved using the MLP and random duplication (although differences between results may be insignificant), while it is suggested that SMOTE produces slightly more stable models.

An adaptation of SMOTE for graphs, named GraphSMOTE, is proposed in [18]. GraphSMOTE generates synthetic nodes of the minority class (and not just synthetic embeddings as described in [17]) that are inserted into the graph, on which training can then be performed directly. The authors show that variants of GraphSMOTE outperform traditional training techniques for imbalanced datasets including weighted loss and variations of SMOTE and generalize well across different imbalance ratios.

In [19] the authors experiment with predicting suspicious entities in four separate directed networks published in ICIJ Offshore Leaks Database. Some nodes are marked as blacklisted by matching with international sanction lists. The datasets are highly imbalanced, containing less than 0.05% of blacklisted nodes. In the first part, embedding algorithms are used in a similar way as in [17]. One-class SVM (O-SVM) is used here to predict suspicious entities in contrast to oversampling. The model is then evaluated on a proportion of ground truth data only. Struc2vec mostly outperforms node2vec in terms of the AUC score, although the difference varies significantly across the four datasets in the database. Additionally, degree centrality measures (PageRank, Eigenvector Centrality, Local Clustering Coefficient and Degree Centrality) are used as features to describe the importance of nodes in the networks. Among these PageRank alone mostly performs best, outperforming struc2vec as well, highlighting the role of node centrality in such tasks. Additionally, all experiments were conducted on undirected, directed and reversely directed versions of graphs. Best results were generally achieved using reversed networks, although the results vary significantly across datasets.

A novel method called Suspiciousness Rank Back and Forth (SRBF) inspired by the PageRank algorithm is additionally introduced in [19]. The authors show that it generally performs better in detecting suspicious subjects compared to both degree centrality measures and the mentioned graph embedding algorithms, achieving the overall best score in 3 out of 4 datasets.

The mentioned methods in [19] are evaluated against the list of known blacklisted entities. However, it remains a challenge to validate entities that appear high risk but are not on the original ground truth blacklist. The use of Open-Source Intelligence (OSINT) is proposed in [20] for verification of these predicted high-risk entities. Such methods, although taking much manual work, have proven to be successful in some cases when enough information was found online to uncover potential hidden links.

In [21] an experiment is conducted using the labelled Elliptic Dataset of Bitcoin transactions. Around 2% of nodes in the dataset are marked as “illicit” (e.g. known to have belonged to dark markets), some (21%) are labelled as “licit” (e.g. belonging to well-established currency exchanges). The data is spread across 49 time steps. Each node is accompanied with approximately 150 features that are constructed from transaction information and aggregated neighbourhood features. They describe an inductive approach to predicting the suspiciousness of nodes using GCNs. Finally, such an approach is compared with different classification methods against the objective to predict whether a node is licit or illicit. The classifiers tested are Multi-Layer Perceptron (MLP), Random Forest and Logistic Regression trained with weighted cross-entropy loss to prioritize illicit nodes as they are the minority class.

Superior classification results are shown using Random Forests using node features only. Additionally, a performance improvement is achieved by making the Random Forest model “graph-aware” by concatenating GCN embeddings to the mentioned features. Using GCNs alone in a supervised setting did not yield as good results; however, using EvolveGCN to capture temporal dynamics yielded slightly better results compared to pure GCN. An intriguing fact arises when looking at the accuracy of the proposed models over time. As a large dark market is closed at some time step, the accuracy of all models drops significantly and is not recovered after following time steps. The robustness of models to such events is a major challenge to address.

In [21] a future work idea is pointed out to combine the power of Random Forests and GCNs using a differentiable version of decision trees as proposed by [22] in the last layer of GCN instead of Logistic Regression.

Anomaly detection on graphs could also be used for detecting anomalous activities. In [23] a fully unsupervised approach for detecting anomalous edges in a graph is presented. A classifier is trained on the same number of existing and non-existing edges to predict whether an edge exists between two nodes. The nodes that have then their edges classified as non-existing can be viewed as anomalous. The approach has been tested on real-world datasets such as online social networks.

3 Data Ingestion and Structure

3.1 Data Enrichment and Pseudo-Anonymization

At the core of the screening tool is an efficient data ingestion, store and representation of large multigraphs of transaction data from two transactional data sources. A pure transaction data graph is valuable and provides a backbone for further analysis, but data is also enriched to enable deeper exploration and an easier understanding of acquired results. Additional information, tailored specifically for anti-money laundering scenarios, is ingested (and automatically updated) alongside. Due to the high sensitivity of transaction data, privacy concerns and legal considerations, the ingestion pipeline follows a very specific and specially tailored pseudo-anonymization and enrichment pipeline. The general dataflow is presented in Fig. 13.2.

Fig. 13.2
figure 2

Data ingestion and preparation overview

Transaction data from different sources is enriched with company-specific information provided by the public business registry. Additional company-related information especially targets money laundering and terrorist financing information as well as general features to discover anomalous behaviour patterns e.g. company size, capitalization type, ownership structure, company registration date, possible company closure date, capital origin etc. Account information data is provided by eRTR (public accounts’ registers) and information required for AML scenarios is added: account registration, possible account closure, account owner etc. Important piece of information is the type of the account; an account can either be used by a private person or associated with a specific company and in some cases even both. Meta parameters pertaining to more efficient identification of suspicious patterns are periodically ingested: EU list of high-risk third countries, classifying the level of high risk to those countries on the list.

Subject to legal and privacy issues, the future version of the screening tool is expected to include additional transactions into highly risky foreign countries. The list is curated by the office for the prevention of money laundering as mandated by applicable AML and CTF law. The inclusion of this list will provide more broad information for specific entities and improve tracking of suspicious money paths.

After ingestion and data enrichment all directly identifiable company and account data is pseudo-anonymized. Initial exploration showed that great care must be taken during the pseudo-anonymization process. Some information is inevitably irrecoverably lost due to pseudo-anonymization as a way to satisfy privacy concerns (multiple accounts of the same private person, company name, account numbers), but other information might be lost unintentionally. Original transaction data and data from transactions to risky countries must be paired before the pseudo-anonymization process as reporting standards differ and thus make post-pseudo-anonymization pairing impossible. Another negative side effect of pseudo-anonymization is the fact that it can easily conceal data quality issues. For example: badly structured account numbers, manually imputed data and other small inconsistencies invisible when processing data at transaction level before pseudo-anonymization, alongside with spurious white space, letter and number similarities, string collations etc., produce totally different pseudo-anonymized result that is hard to detect. Multiple cross-checks and data validation procedures were used during the developmental phase to uncover a few such pitfalls and adjust the pseudo-anonymization process accordingly. The cross-checks for data quality control are also in place in the production environment to monitor full data flow.

Pseudo-anonymizer developed during the project is able to successfully anonymize incoming stream of data, reusing salt info from the previous invocation. This enables easier ingestion of data at later stages, combined with new data and ingestion from different platforms while still keeping as much of information about graph structure and connectivity. Due to the high amount of data and the need for automatic ingestion, developed pseudo-anonymization service is provided as part of Infinitech components as a standalone Kubernetes service that is able to mask data in real time.

3.2 Data Storage

Ingested and pseudo-anonymized data is transformed and stored to enable fast and efficient queries. Due to large and fast data ingestion, data is stored in three separate databases. The raw database is a simple PostgreSQL container storing pseudo-anonymized unnormalized raw data. The data ingestion engine firstly automatically normalizes raw data and merges it with existing data stored in the master PostgreSQL database specifically configured for high-performance reading and time-based indexing. Master database serves as a core data source for Fourier transformation-based feature calculation Sect. 5.2, stream story and feature generation. After normalization, a battery of sanity checks is performed to confirm the schema validity of ingested data and check for inconsistencies and anonymization errors. Normalized data is imported into the open-source version of the neo4j database, which serves as a backbone for graph-related searches, pattern recognition and anomaly detection.

The graph database schema is specifically tailored to enable easy exploration of transaction neighbourhoods and execution of parameterized queries. Great care was taken to provide proper indexing support for time-based analysis for neighbourhood root system evolution and obtaining financial institution-level data. This enables efficient exploration of transaction structure evolution in relative time and specific risky typology detection. In conjunction with the vanilla neo4j database, neo4j Bloom is also configured and included in the platform for more in-depth analysis and further exploration.

Both database part and normalization and sanity check services are provided as self-contained docker images in line with the Infinitech way and fully configurable for ease of use.

4 Scenario-Based Filtering

BOS acts as a supervisory authority and as such oversees the risk management and risk detection, both at the level of financial institution and also on the level of the financial sector. The screening tool combines data from multiple transactional data sources, merges and normalizes data into a single data source and enables efficient data exploration on multiple levels of granularity. As a supervisory authority one of the main goals is that the financial institutions develops proper control environment for suspicious transaction detection and reporting. Great care was taken to provide suitable explanations and interpretations of different risky scenarios flagged as unrecognized by individual financial institutions and deemed risky.

Domain expert knowledge and initial exploration on historical data combined with high-risk third countries’ transaction data showed that the surest way to get highly interpretable and quickly actionable insights is to first apply scenario-based filtering. Typical scenarios used in simple detection come in simple, human-readable rule-based forms:

Newly established companies or companies closed soon after the establishment, receiving large sums of money from foreign accounts and paying similar amount to private account pose high risk.

Company risk increases with the level of the risk the payer/receiver country.

The main advantage of rule-based scenarios is straightforward explainability, composability and easy correspondence with existing anti-money laundering regulations. As a supervisory authority, providing a clear explanation for flagging and further processing increases the ability to pursue claims and also improves the actionability of findings. Furthermore, explainable rule-based filters can easily be implemented down the road in on-site checks. A direct contribution of collaboration with experts regarding existing rules for evaluation and flagging of risky behaviour was the development of special parameterized generalized rules. They automatically encompass common money laundering and terrorist financing transaction patterns and map them to database queries for further exploration as presented in Sect. 6.

The rule-based approach also translates nicely into the language of typologies. Specific transaction patterns can be interpreted as time-based parts of the money laundering process (preparation, diffusion…) or graph-based parts (the part where money is rewired to individuals previously associated with the company, part where money is crossing the EU border…). Since not all transactions are included, detecting even parts of scenarios can provide good insight and lead to further exploration. Specific parts in this are referred to as typologies and usually named after historical cases when they were discovered and/or prominent. Detecting and correctly identifying such parts on subgraphs (specific bank transactions, a combination of banks where bank accounts typically mix…) and specific time frames is easy with generalized parameterized rule-based queries and directly translates to language and context already spoken by experts and is actionable.

Rule-based filtering proved to be flexible enough to easily detect specific patterns in historical data that was immediately interpreted as a phase of a potential money laundering scenario. Although flexible, even generalized rules were are still easily interpretable and can be put in the context of anti-money laundering during analysis.

5 Automatic Detection and Warning Systems

5.1 Pattern Detection

The screening tool will provide automatic detection of suspicious typologies and flagging of risky scenarios. Automatic detection and early real-time warning systems give the supervisory authority additional data, additional risk-based analysis, thus easing and supporting the supervisory process if decided so by a domain expert.

Figure 13.3 shows a specific flagged scenario detected and further explored in the screening tool. Nodes in the graph correspond to different scenario entities: yellow nodes correspond to accounts owned by companies, blue nodes correspond to accounts owned by private people, green nodes correspond to companies and orange nodes correspond to flagged and monitored objects in for chosen scenario. In this specific scenario, orange nodes present to companies closed less than 6 months after the registration, corresponding to the scenario where companies closed soon after the establishment process the anomalous amount of transactions. Edges between nodes correspond to specific dependencies. Edges (in red) between bank account nodes correspond to transactions with edge thickness indicating the amount being transferred (in log scale). Edges between account nodes and purple nodes correspond to account ownership. A company might own one or more accounts and the combination of transactions from the same company might tell a completely different story than an isolated bank account. Lastly edges between the flagged orange nodes and company nodes indicate which company node was flagged, with specific flagging information and results stored as attributes of the orange node. Labels on nodes correspond to anonymized company and bank account indicators.

Fig. 13.3
figure 3

Large transaction forwarding

In this scenario, a company received two large transitions (1.2M €), on the same day, and made two payments with practically the same amount to two different individual bank accounts. The proxy company existed for less than 4 months.

Figure 13.4 shows another discovered pattern. This time the company nodes are coloured in purple.

Fig. 13.4
figure 4

Transaction crumbling

In the second detected example, the same entity performs several identical transactions to two entities opened and closed in a short period of time and to a third entity open one year later.

Both scenarios exhibit signs of a potential money laundering phase called layering. The main goal of layering is to make the source of illegal money difficult to detect through various means: shell companies, receiving and processing payments through foreign and non-EU countries, specifically through countries with lax AML standards. The second example shows the possible use of smurfing where payments are divided into smaller amounts to conceal and evade ordinary controls and transfers through multiple accounts to disperse the ownership information.

The third anomaly is presented in Fig. 13.5, which consists of a transaction, where a single short-lived company disperses multiple transactions to private person accounts in a short time span. To make it even more anomalous, all transactions cross the country border, thus making efficient money tracing more difficult.

Fig. 13.5
figure 5

Dispersing transactions to multiple private accounts

Future directions for automatic detection of specific scenarios is to use relative time scale to compare local relationship evolution of similar company types. This will enable better modelling of graph representation pertaining to specific types of transaction history related to various company types (transaction structure of utility companies and retail companies is widely different from companies offering financial services to institutional clients).

5.2 Data Exploration Using Fourier Transforms

Financial data is inherently temporal. It is reasonable to assume that there exist some recurring patterns in the trade activity of subjects. Let ϕ j be a time series of a daily number of outgoing transactions made by entity j. (The same analysis as demonstrated here can be performed with any other time series.) We use the efficient Fast Fourier Transform (FFT) algorithm [24] to compute frequency-domain representations A j(ν) of each ϕ j.

Baselines of the resulting spectra are corrected using the I-ModPoly algorithm as described in [25] and implemented in the BaselineRemoval Python library (https://pypi.org/project/BaselineRemoval/) using polynomial degree 1. The resulting frequency domain spectrum of each client is resampled by linear interpolation at n = 1000 equally spaced points throughout the frequency range to yield n-dimensional vector representations of entities χ j. Finally each χ j is normalized by dividing with \( \sum _{t=1}^{|\phi _j|} \phi _j(t) \).

K-means clustering using k = 5 and Euclidean distance is performed on the given feature vectors. Entities that do not have enough data points recorded for quality spectra are filtered out with a simple rule; accounts with <365 non-zero data points in ϕ j are not considered for this analysis. Our dataset is sparse; <1% of all entities in the dataset remains available for frequency-domain analysis after filtering.

It is seen from Fig. 13.6 that all clusters have three distinct peaks. These correspond to frequencies that are multiples of \( \frac {1}{7} \). There are inherent weekly dynamics in the data, as there are no transactions processed during weekends. There are distinct differences between clusters. Clusters 1 and 3 are highly similar to the Euclidean barycentre, containing mostly weekly, but also some other (monthly) frequencies. Entities in clusters 2 and 5 exhibit strong monthly dynamics (for example, make payments once a month), while subjects in cluster 4 exhibit weekly dynamics (e.g., make a single transaction on each working day and no transactions on weekends).

Fig. 13.6
figure 6

Barycentre spectra of FFT clusters. Approximate number of members of each cluster shown in brackets

6 Use Cases

The full working platform for anti-money laundering and supervision encompasses all previously described segments joined in a simple and intuitive platform. The platform takes care of data ingestion, pseudo-anonymization, data enrichment, automatic anomaly detection and pattern flagging. Neo4j Bloom tool combined with currently developed web-based graphical user interface for graph exploration allows experts in supervisory and compliance to manually inspect flagged scenarios in a broader context.

Automatic scenario detection provides the user with suggested parameterized queries that cover both scenario-based analysis with computed risk measures.

Figure 13.7a shows an example of parameterized query from available scenarios suggested by automatic detection. In this case, the user is able to manually configure what relative time he or she is interested in. Figure 13.7b shows an example result produced by such a query. Depending on the granularity and selected timescale, certain features are more pronounced. Example scenario nicely shows that transaction neighbourhoods form a highly partitioned graph with a large component controlling more than half of the graph and a lot of smaller components. Small components are of particular interest as they usually exhibit irregular behaviour. The screening tool enables easy exploration and further processing and application of ML algorithms to specific clusters, for example one seen in close up in Fig. 13.7c.

Fig. 13.7
figure 7

Parameterized query with resulting graph structure. (a) Parameterized query example. (b) Transaction graph corresponding to parameterized query. (c) Medium level cluster close up

Exploring different graph clusters usually raises additional questions about specific cluster data. As the data is preprocessed, additional transaction and company-related data is readily included and can be uncovered by clicking on specific nodes or edges as seen in Fig. 13.9a. Presentation of additional data depends on the data status. Confidential data such as account number is presented in pseudo-anonymized version due to privacy and legal concerns. Data that does not represent personal, identifying information or confidential data (transaction date) can be presented in full. Depending on anonymizer settings, different levels of authorization can be presented with different pseudo-anonymization levels (bank account numbers, BIC numbers, full dates, specific company type).

Additional close up examples shown in Fig. 13.9b and c show additional transaction topologies with different features. Nodes can also be enriched with additional metrics calculated during anomaly detection and presented with this information and similar nodes. As before, each can be separately explored and escalated if needed.

Close up view can further be specialized to a hierarchical view, which exposes additional structural data dependency. Two close up of such views are presented in Fig. 13.8a and b. The hierarchical view is especially useful to quickly assess cluster structure and separate nodes with a high degree of connectivity connecting multiple clusters as seen in Fig. 13.9.

Fig. 13.8
figure 8

(a) and (b) Hierarchical views close up

Fig. 13.9
figure 9

Information and close up examples. (a) Additional transaction info. (b) Sending money between company accounts. (c) Start type typology with transactions to private person

Due to the highly specific nature of some companies and strange transaction patterns of others (utility companies, municipalities, large companies with foreign subsidiaries), a special whitelist is being implemented. Since all data is pseudo-anonymized, it is hard to almost impossible to reason about specific corner cases manually. Externally provided whitelist can be fully anonymized and used in two different scenarios. One can fully include whitelisted companies from calculations in anomaly detection algorithms or perform all the calculations on the whole dataset and only filter them out during presentation and manual exploration. The second option is additionally useful, as whitelisting can be elaborated and excluded but still viewed. An important example is whitelisting a utility company for having a large number of transactions from private accounts but still calculating and presenting risk scores for overseas transactions.

7 Summary

The screening tool is part of a comprehensive PAMLS platform, which enables automated data capture and automatic data quality controls. The screening tool accomplishes three specific goals, firstly it enables data acquisition and enrichment from different transactions’ data sources. Secondly, the tool provides an automatic screening of pre-known transaction patterns that are potentially suspicious and finally allows a supervisory authority to investigate the enriched transaction space and discover new potentially suspicious patterns in order to detect the inherent risk of the financial institutions and apply the commensurate level of the supervision. The PAMLS platform is intended for big data processing and detection of potentially risky transaction patterns. The screening tool is one of the tools in PAMLS and enables the mapping of enriched transactional data sources into the space of graphs, where the use of parameterized queries easily maps risk typology into graph topologies and thus human-friendly investigation and pattern recognition. The tool is still in the development phase and will need to be further validated by domain experts. The need for similar solutions is evident both at the level of individual financial institutions and supervisory authorities.