Keywords

1 Introduction

Small Medium Enterprises are vital to all economies and societies worldwide. According to OECD [1], SMEs in its member countries account for 99% of all businesses and almost 60% of value added. Respectively, in Europe SMEs hold a vital role as they account for 99.8% of all enterprises in the EU-28 non-financial business sector (NFBS), generating 56% of added value and driving employment with 66% in the NFBS [2]. Despite the global importance of SMEs, a lot of them struggle to keep up with the pace of change in their digital transformation journey. The complex and challenging environment, with various technological disruptions and radical business changes, holds as the main barrier to them. Additionally, the ongoing COVID-19 pandemic, altering customer’s behavior and business trends, made it also necessary to push SMEs’ e-commerce activities and the utilization of online channels. On the other hand, it provoked significant liquidity concerns to businesses unable to utilize new digital tools [3].

The introduction of digital technologies and state-of-the-art analytics tools can empower SMEs by helping to reduce operating costs, saving time, and valuable resources, especially for firms that illustrate reduced economic activity and smaller volumes of production, which also tend to have limited market reach and lower negotiation power with stakeholders [4].

Introduction of modern predictive and descriptive analytics can affect almost every aspect of an SME’s operation, leading to data-driven strategic decision-making processes. The digital transformation of an SME could offer a new perspective to its business and financial management, leading to a competitive advantage by increased productivity and quality control, introducing new marketing techniques and the ability to identify new markets and foresee business opportunities.

However, the digital transformation journey of an SME poses various risks and challenges. Based on the 2019 OECD SME Outlook [1], the currently limited digital skills found in most SMEs management, the inability to identify, attract, and retain ideal employees, and the lack of required resources or financing options and strict regulations regarding data protection pose the main barriers toward SME digitalization.

Of course, not all sectors face the same challenges and barriers regarding their digital transformation. The generation of data and utilization of data analytics appears to be the highest in the financial sector [5], with some SMEs competing large financial software providers who are offering commercial data analytics applications for SMEs through cloud computing services, allowing SMEs to access tailored AI services even when they lack the required resources to develop them internally [6]. Other commercialized applications utilizing Big Data analytics are included in ERP or Accounting Software packages, with stand-alone Business Financial Management (BFM) software and tools also being available for commercial use. Most of the offered solutions are geared toward analyzing historical transactions of data residing within the ERP system. Banks, retaining a variety of data of SME customers as required for their core activities, could offer a solution by utilizing all available data and provide a variety of data analytics tools aiming at increased business financial management efficiency for their customers, while offering value-added services on top of their core business. Toward this direction, banks can harness all available operational and customer data to provide accurate business insights and analytical services to SMEs resulting, as noted by Winig [7], in increased customer base and engagement.

However, developing personalized segmented services, especially when considering the enormous variations found in the SME market, from business model and goal to business scale, is not an easy task for a bank as it poses a variety of business and technical challenges.

This chapter introduces a data-driven approach to facilitate the development of personalized value-adding services for SME customers of the Bank of Cyprus, under the scope of Europeans Union’s funded INFINITECH project, grant agreement no 856632.

As a detailed presentation of all microservices developed and how those are interconnected would far exceed the constraints of this chapter, we showcase the development process and the new possibilities unlocked by the foundation of the proposed mechanism, namely the BFM Hybrid Transaction Categorization Engine. The interconnection of the various microservices is provided with a brief presentation of the underlying DL model used for categorical time-series forecasting needs, namely the BFM Cash Flow Prediction Engine. The categorization of all SME transactions is required in order to label all historical data and unlock most features of an innovative BFM toolkit. Based on the classified data, the developed cash flow prediction model is one of the key BFM tools which adds value to SMEs by providing a holistic approach to income and expenses analysis. For the transaction categorization model, a hybrid approach was followed, applying initially a rule-based step approach based on transaction, account, and customer data aggregation. Then, at an operational phase, a tree-based ML algorithm is also implemented, creating a smart classification model with high degree of automation and the ability to take users’ re-categorization into account. Given that lack of prominent research, the provided dataset sourcing from a real-world banking scenario was labeled based on various rules and the input of banking experts, incorporating various internal categorical values present in the dataset. In this direction, 20 Master Categories with 80 respective Sub-categories tailor-made for SMEs were created. As expected in a real-world banking scenario, the generated categories of the transactions were highly imbalanced, and thus, a CatBoost [8] model was preferred based on the findings of [9], where the most well-established boosting models were reviewed on multi-class imbalanced datasets, concluding that CatBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional datasets. The second microservice illustrated, the Cash Flow Prediction Engine, aims at developing an accurate and highly scalable time-series model which can predict inflows and outflows of SMEs per given category. To achieve this, after exploring traditional time-series forecasting models and newly introduced ML/DL approaches, deep learning techniques were utilized to provide information regarding the future cash flow of SMEs. The analysis focuses on time-series probabilistic forecasting, utilizing a Recurrent Neural Network model.

2 Conceptual Architecture of the Proposed Approach

The Conceptual Architecture and a workflow of the various components included in the Business Financial Management platform developed are presented in Fig. 12.1.

Fig. 12.1
figure 1

BFM platform conceptual architecture

Bank of Cyprus (BoC) is developing a testbed based on the AWS cloud computing services. The pilot’s cloud infrastructure is being developed as a blueprint testbed, with other pilots of the INFINITCH project utilizing similar cloud solutions as the one being established. For the data collection process, tokenized data from designated BoC databases, as well as data from open sources and SME ERP/Accounting software will be migrated to the data repository of the BoC testbed. Upon returning to BoC datastore, a reverse psedonymization (i.e., mapping tokenized ID to user) will be performed in order for the respective analytic output to reach their designated SME clients. The pilot utilizes both historical and real-time data for its various Business Financial Management tools, as the need to provide real-time business intelligence that relies on live data is crucial for the pilot’s development. Based on the INFINITECH Reference Architecture, which is based on BDVA’s RA, the pilot’s workflow is translated as Fig. 12.2.

Fig. 12.2
figure 2

BFM platform within the INFINITECH Reference Architecture

Also following the respective INFINITECH CI/CD process and also taking into consideration the EBA 2017 guidelines regarding outsourcing banking data to cloud, the first two components were deployed as depicted in Fig. 12.3.

Fig. 12.3
figure 3

Initial component development in INFINITECH testbed

All components are packaged following a microservices approach as docker containers and then deployed in a Kubernetes cluster (i.e., AWS EKS). More specifically, every component is deployed in a Pod within the cluster with specific features (e.g., the capability of auto-scaling when needed in terms of the demand). Thus, as a sandbox is defined, all the components deployed under the same namespace allow the interaction and connection between them.

3 Datasets Used and Data Enrichment

3.1 Data Utilized for the Project

For the development of the data analytics components presented, the utilized dataset had been provided by Bank of Cyprus as a real-world use case, in the scope of INFINITECH Project. All data received by the testbed, except the various internal codes kept by the bank for operational purposes, have already been tokenized, with no actual way of retrieving indicating customer-related personal and/or sensitive information. This anonymized dataset includes transaction, customer, and account data of over a thousand SMEs for the years 2017–2020, exceeding 3.5 million data entries. Despite the existence of the various internal codes, all transaction data were unlabeled with no initial indication of their underlying category. The key data source, both for the transaction categorization and for the cash flow prediction model, was the available transaction data which contained more than 40 variables indicating from root information like Date, Amount, Credit/Debit indicators, and Description, to more sophisticated information like Touchless Payment indicators, Merchant Category Codes, Standing Order indicators, etc. The SME Accounts dataset contained tokenized information of the individual accounts of the customers, available balances, and the respective NACE rev2 code, which is a statistical classification of economic activities used across the European Community [10] and is used by the bank for the identification of the SMEs operating sector. The availability of SMEs economic activity is significant for this chapter and our future work as well, since external national and European data utilizing horizontally the NACE code system can be utilized in our models to provide more personalized sector information. The datasets used for the BFM tools operation and for future intended data enrichment are summarized in Table 12.1.

Table 12.1 Datasets used within the BFM toolkit

3.1.1 Data Enrichment

Besides the data originating from the bank, various external sources as presented in Table 12.1 will be utilized in order to provide accurate business insights and additional information to the SMEs. The data used by external sources are retrieved by the various (historical, external, or real-time) data collectors, based on developed REST APIs within the designed microservices. Data enrichment has three main objectives:

  1. 1.

    Offers more holistic financial management potentials to the SMEs via Open-banking data integration.

  2. 2.

    Provides sector specific information and personalized insights by utilizing open-source market data.

  3. 3.

    Offers account reconciliation and additional innovative services by ingesting ERP/Accounting data to the BFM platforms underlying ML/DL models.

However, data enrichment will not be implemented until future work on upcoming data analytics components is completed, with the details on how the external data are retrieved or the required data streams designed being out of this chapter’s context.

4 Business Financial Management Tools for SMEs

The proposed solution offers a variety of data analytics microservices, all aiming to assist SMEs in monitoring their financial health, get a deeper understanding of their operating costs, allocate resources with supported budget predictions, and retrieve useful information relevant to their underlying business sector.

Since a detailed presentation of all microservices developed and underlying interconnection would far exceed the constraints of this chapter, the focus is given on the development process of the BFM toolkit foundation, namely the Hybrid Transaction Categorization Engine. Besides the development of a smart personalized classification engine that takes the user’s re-categorization into account and offers a high degree of automation, this task can assist in the categorization of open banking data and offers fertile ground for the implementation of explainable AI frameworks to better comprehend the outcomes of our classification ML model.

The interconnection of the various microservices is showcased with a brief presentation of the DeepAR RNN model utilized [11], where the produced categories are taken into account to serve the time-series forecasting needs of the model, offering highly personalized and accurate probabilistic predictions to SMEs per Master Category.

The rest of the analytics components are briefly described just to offer a glimpse of the BFM tools developed and how they can empower the SMEs, utilizing a set of descriptive and predictive analytics services based on their personalized needs.

4.1 Hybrid Transaction Categorization Engine

As mentioned above, the classification of SME transactions is vital for the additional development of financial management microservices. The absence of labeled data is the main challenge when developing a transaction categorization model. Two prevalent approaches arise when creating a classification model. The first one is utilizing unsupervised machine learning techniques to create clusters with no prior knowledge of the expected outcomes. However, this cannot be applied in our transaction categorization scenario as labels are fixed and of distinct nature in the finance sector, so this approach suffers from difficulty interpretating the outcomes and leads to a less robust and interpretable model. The second and proposed approach is initially hand labeling a representative subset based on expert knowledge creating a rule-based model, which can then be integrated with a supervised machine learning model, offering a high degree of update automation and transaction re-classification.

4.1.1 Rule-Based Model

A step approach was followed for the rule-based model, incorporating various internal codes of the bank, some of them being case specific interpreted only by the banking experts (i.e., Transaction Type Code) and others being used universally in the business world (i.e., Merchant Category Code and NACE codes). Before mapping the variables above to given categories, it was vital to capture all transactions between accounts belonging to the same SME as those would be classified as “Transfers between own accounts.” The exact flowchart of the rule-based model and the swift to a hybrid transaction engine is illustrated in Fig. 12.4.

Fig. 12.4
figure 4

Hybrid classification model flowchart

4.1.2 CatBoost Classification Model

A key challenge in developing the hybrid classification model is alleviating the bias inserted by the rule-based model. Toward this direction, it was crucial to enable the user to re-categorize a given transaction category. This process of updating the existing knowledge and adapting the model to re-categorizations is fundamental for continuous model optimization, i.e., increased accuracy and personalization.

To this end, a CatBoost model was periodically retrained in order to adopt to changes made by the end users (i.e., SMEs). CatBoost is a novel algorithm for gradient boosting on decision trees, which has the ability to handle the categorical features in the training phase. It is developed by Yandex researchers and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction, and many other tasks at Yandex and in some other companies. CatBoost makes use of CPU as well as GPU which accelerates the training process. Like all gradient-based boosting approaches, CatBoost consists of two phases in building trees. The first is choosing the tree structure and the second is setting the value of leaves for the fixed tree. One of the important improvements of the CatBoost is the unbiased gradient estimation in order to control the overfit. To this aim, in each iteration of the boosting, to estimate the gradient of each sample, it excludes that sample from the training set of the current ensemble model. The other improvement is the automatic transformation of categorical features to numerical features without any preprocessing phase. CatBoost is applicable for both binary and multi-class problems.

Given the nature of the different steps utilized in the rule-based model and aiming at high efficiency and categorization accuracy, a hybrid model was preferred instead of a fully AI one, since the two first steps of the rule-based process provide accurate categorization results, producing root categories like cash withdrawals, deposits, and banking fees.

In more detail, as some root categories produced by the first two steps (i.e., transfer between accounts and Transaction Type Code mapping) of the rule-based model are predefined, the aim of this model is to learn and mimic the last three steps, while also taking into account the changes done by the user at an operational phase. The number of the remaining Master Categories that can be produced in these steps is 16, so the evaluation on multi-class tasks in terms of various metrics was applied. Given that the dataset was highly imbalanced, finding a proper normalization factor was another challenge which has been overcome by hyper-parameter optimization.

The main outcome that can be derived from the results is that the model can learn the rules utilized incorporating key merchant features (i.e., NACE code and Merchant Code) and correctly categorize the results with 98% accuracy. Furthermore, it is worth mentioning that some of the transactions were categorized as “Uncategorized Expense.” The transactions falling into this category are expected to be categorized by the respective SME. Consequently, when the model will be retrained, the additional knowledge gained from the SME performed categorization will be incorporated into the model.

4.1.3 Explainable AI in Transaction Categorization

Although statistics, with its various hypothesis testing and the systematic study of the variable importance, is well established and studied, the same does not apply for the explainability of various techniques in machine and deep learning. Even though multiple measures and metrics of performance have been extensively applied, some of them can be rather misleading as they do not convey the justification behind the decision made. Stronger forms of interpretability offer several advantages, from trust in model predictions and error analysis to model refinement.

In the context of our research, the interpretation of the results is as significant as the results themselves, as it can lead to significant technical insights regarding the transaction categorization engine’s evaluation. Explainable methods can be categorized into two general classes:

  1. 1.

    Machine learning model build-in feature importance

  2. 2.

    Post hoc feature importance utilizing models such as LIME [12] and SHAP [13]. In our classification scenario, both LIME and SHAP techniques were leveraged as a qualitative evaluation of the results.

The SHAP values denoting the importance of each feature included in the CatBoost model are depicted in Fig. 12.5. It is evident in the figure that the model learned the rules that we based on Merchant Code ID (i.e., MCCCodeID) and transaction beneficiary NACE code as these two are the most important features. Additionally, the significance of the Account Key (skAcctKey) as the third most important feature in the model strengthens the proposed approach of a user-oriented updating approach.

Fig. 12.5
figure 5

SHAP values of each feature included in the CatBoost model

Apart from the feature importance based on SHAP analysis, as far as the CatBoost evaluation is concerned, it is of high importance to qualitatively check some of the outcomes based on the Local Interpretable Model-agnostic Explanation (LIME), which is a recent technique capable of explaining the outcomes of any classifier or regressor using local approximations (sample-based) of models with other interpretable models.

Figure 12.6 offers five examples of how LIME framework can assist to interpret the predictions of our transaction categorization model. For instance, as illustrated in the figure, the first transaction is categorized as “Banking Expense” with probability of 39% and features contributing toward this outcome are the MCCCode, the NACE Code, and the specific Account. Likewise, with a probability of 23%, it can be categorized as “Uncategorized Expense,” and with a probability of 16%, it can be classified as “Selling and Distribution Expense,” with the features contributing toward this decisions also being depicted. In the second example observed, the given category is “Selling and Distribution Expense” with a confidence level of 96% (Fig. 12.6).

Fig. 12.6
figure 6

LIME analysis of three specific transaction categorization examples, explaining the features and the value contribution of the model outcomes with their respective probabilities

4.1.4 Paving the Way for Open Data Enrichment: Word Embeddings in Transaction Descriptions

The Transaction Categorization Engine is enriched with another innovative contribution of creating word embeddings from the transaction descriptions. These embeddings are used in the transaction categorization model and serve as common ground between integrating open banking data and the proprietary (internal) data that is utilized. This approach not only increases the categorization accuracy but also paves the way of classifying transactions provided by other institutes as part of PSD2 or can be used as features in other complementary downstream machine learning processes such as fraud detection, which is also implemented partly in our Transaction Monitoring microservice. This effort however raises another challenge as information in short texts is often insufficient which makes them hard to classify. Recently, continuous word representations in high-dimensional spaces brought a great impact in Natural Language Processing (NLP) community by their ability to unsupervisedly capture syntactic and semantic relations between words, phrases, and even complete documents. Employment of these representations produced very promising results with the help of available large text bases in the fields of language modeling and language translation. Motivated from the success of continuous word representations in the NLP world, this work proposes to represent the financial transaction data in a continuous embedding space to take advantage of the large unlabeled financial data. The resulting vector representations of the transactions are similar for the semantically similar financial concepts. We argue that, by employing these vector representations, one can automatically extract information from the raw financial data. We performed experiments to show the benefits of these representations. In Fig. 12.7, a tensorboard example of a Word2Vec Skip-Gram [14] model having to do with transaction descriptions related to “olympicair” is illustrated, presenting grouping indications of similar transaction categories.

Fig. 12.7
figure 7

Tensorboard Illustration of word embeddings example created through transaction descriptions

4.2 Cash Flow Prediction Engine

The second microservice showcased is the Cash Flow Prediction Engine, which aims to accurately predict the cash inflows and outflows of the given categories produced by our hybrid transaction categorization model for each SME. The engine’s objective and the nature of the task implied the necessary data transformation in time-series representation, enabling the experimentation with various general forecasting models, or prevalent DL models. Thus, both resampling and aggregating the amount of the transactions based on each specific account and date have been applied to our data sample. The forecasting models considered consisted of a batch of SARIMA variations, which were however unable to cover the complex needs of our Cash Flow Predictions engine. Facebook’s Prophet, a modular regression forecasting model with interpretable parameters that can be intuitively adjusted by analysts with domain knowledge about the time-series, as presented in [15], was also examined. However, the results were not as satisfying when trying to predict transaction inflows and outflows in our scenario, opposed to [16], where the model is compared with DeepAR model in order to forecast food demand, showing promising results. The relevant plots are depicted in Fig. 12.8, showing the estimators on specific time series. Specifically in each plot the predicted mean value (green line) and the actual values (blue line) are depicted along with the green gradient area denoting 2 confidence intervals (i.e., 50%, 95%). DeepAR, a DL approach implementing an RNN-based model, close to the one described in [11], was chosen as the most suitable one, originating from the open-source GluonTS toolkit [16]. More specifically, the chosen model applies a methodology for producing accurate probabilistic forecasts, based on training an autoregressive Recurrent Neural Network model on many related time-series. RNNs have the concept of “memory” that helps them store the states or information of previous inputs to generate the next output of the sequence. The RNN that predicts the mean and the variance of the underlying time-series is coupled with a Monte Carlo simulation yielding results represented as a distribution. Moreover, the chosen model learns seasonal behavior patterns from the given covariates that strengthens its forecasting capabilities. As expected, while configuring the model to our forecasting needs, various long-established challenges regarding time-series forecasting arose. These challenges can be summarized in (a) the “cold start” problem, which refers to the time-series that have a small number of transactions, (b) the stationarity–seasonality trade-off, where it is assumed that in order to have predictable time-series they have to be stationary without trend and seasonality factors present, (c) the existence of noisy data and outliers observed, and (d) the adequacy of the length of the dataset in order to apply ML/DL techniques. The aforementioned challenges were dealt with the use of surrogate data, DeepAR model optimization, injected transactions thresholds, and hyper parameters configuration, which however exceed the showcase purposes of the Cash Flow Prediction Engine included in this chapter. As for the evaluation scheme of the model presented, since cross-validation methods reflect a pitfall in a time-series forecasting scenarios as they may result in a significant overlap between train and test data, the optimum approach is to simulate models in a “walk-forward” sequence, periodically retraining the model to incorporate specific chunks of transaction data available at that point in time.

Fig. 12.8
figure 8

Examples of DeepAR application for Cash Flow Prediction

4.3 Budget Prediction Engine

Having not only a good budget in place but also an effective real time budget monitoring and adjustment capability is essential for the business success of an SME. The Budget Prediction engine takes into consideration cash flow, benchmark, macroeconomic, and other available SME data which is key to come up with smart budget targets. The derived smart budgets dynamically consider a changing environment and provide actionable insights on potentially required budget adjustments.

A microservice allows the user to set budgets per category and to allocate available resources. The set budgets will not only be evaluated based on already scheduled invoices (inflows and outflows) but also on predictions derived from historical incomes and spending. The budget prediction engine is closely connected to the cash flow prediction model presented above, as it utilizes the same DeepAR model in its core.

4.4 Transaction Monitoring

A major objective of the BFM smart advisor is to reduce the administrative burden for the SME. The transaction monitoring engine aims to support this purpose by acting as a kind of transaction guard that identifies abnormal transactions. Abnormal transactions refer to the following transaction categories. Those that show irregularly high transaction amount for the specific merchant, originate from a new merchant, signal double charging notification or represent potential fraudulent transactions. The transaction guard would also “watch out” for transactions that could be of significant interest to the business such as transactions relating to refunds or insurance claims.

4.5 KPI Engine

The KPI engine delivers key metrices that allow the SME in an easy way to understand the state of their financial health and performance in real time. Besides the actual diagnosis, the engine not only comes with smart alerts that immediately point out anomalies but also with a comparison of actual versus best practice target values accompanied with a strong indication on how best practice figures can be potentially achieved and/or current values be improved. Altogether, the KPI engine effectively guides the SME in their decision-making process and ultimately contributes toward a stable financial environment.

4.6 Benchmarking Engine

Benchmarking has been underestimated for a long time within the SME sector and many times being avoided due to its cost and time impact. The benchmarking engine focuses on bringing a valuable comparison insight to the SME in a cost-/time-effective way, doing so by comparing the respective SME with other SMEs operating in the same/similar environment and under the same attributes. As a result the SME can locate key areas (e.g., operations, functions, or products) for improvement and take actions accordingly to potentially increase its customer base, sales, and profit.

4.7 Invoice Processing Invoices (Payments and Receivables)

Invoice Processing Invoices represent a vital input to other engines like the Cash Flow Prediction or KPI engine. Furthermore, the retrieved invoice data is also utilized to come up with VAT or other provisioning insights. Today, SMEs invest significant effort into the invoice monitoring, collection, and reconciliation process. The Invoice engine supports these processes and over and above can assist in the liquidity management by integrating with factoring services.

5 Conclusion

This chapter illustrates how the utilization of state-of-the-art data analytics tools and technologies, combined with the integration of available banking data and external data, can offer a new perspective. The proposed mechanism offers automation and personalization, increasing the productivity of both SMEs and financial institutions. The provided BFM tools empower SMEs through a deeper understanding of their operation and their financial status, leading to an increased data-driven decision-making model. Respectively, financial institutions can harness all available data and offer personalized value-added services to SMEs on top of their core business. The generated data of the BFM tools assist banks to better understand their SME customers and their transaction behaviors, identifying their financial needs and supporting the design of tailor-made financial products for the SMEs. Moreover, the conceptual architecture presented, which is based on the INFINITECH RA, enables new perspectives in the fields of data management, analytics, and testbed development, enabling the effortless introduction of new SME microservices and refinement of existing ones, all aiming at increased business financial management capabilities.