Keywords

1 Introduction

1.1 Background

In recent years, banks and other financial institutions are accelerating their digital transformation. As part of this transformation, financial organizations produce unprecedented amounts of data about their financial and insurance processes while using advanced digital technologies (e.g., big data, artificial intelligence (AI), Internet of Things (IoT)) to collect, analyze, and fully leverage the generated data assets [1]. Furthermore, recent regulatory developments (e.g., the 2nd Payment Services Directive (PSD2) in Europe) [2] facilitate the sharing of data across financial organizations to enable the development of innovative digital finance services based on novel business models. The latter aim at lowering the barriers for new market players (e.g., payment service providers (PSPs) in the PSD2 context) to develop and roll out new services in ways that boost customers’ satisfaction and create new revenue streams.

Given the proliferation of data assets in the finance sector and their sharing across financial organizations, the vast majority of digital transformation applications for the finance and insurance sectors are data-intensive. This holds for applications in different areas such as retail banking, corporate banking, payments, investment banking, capital markets, insurance services, and financial services security [3, 4]. These applications leverage very large datasets from legacy banking systems (e.g., customer accounts, customer transactions, investment portfolio data), which they combine with other data sources such as financial market data, regulatory datasets, social media data, real-time retail transactions, and more. Moreover, with the advent of Internet-of-Things (IoT) devices and applications (e.g., Fitbits, smart phones, smart home devices) [5], several FinTech and InsurTech applications take advantage of contextual data associated with finance and insurance services to offer better quality of service at a more competitive cost (e.g., personalized healthcare insurance based on medical devices and improved car insurance based on connected car sensors). Furthermore, alternative data sources (e.g., social media and online news) provide opportunities for new more automated, personalized, and accurate services [6]. Moreover, recent advances in data storage and processing technologies (including advances in AI and blockchain technologies [7]) provide new opportunities for exploiting the above-listed massive datasets and are stimulating more investments in digital finance and insurance. Overall, financial and insurance organizations take advantage of big data and IoT technologies in order to improve the accuracy and cost-effectiveness of their services, as well as the overall value that they provide to their customers. Nevertheless, despite early deployment instances, there are still many challenges that must be overcome prior to leveraging the full potential of big data and AI in the finance and insurance sectors.

1.2 Big Data Challenges in Digital Finance

1.2.1 Siloed Data and Data Fragmentation

One of the most prominent challenges faced by banks and financial organizations is the fragmentation of data across different data sources such as databases, data lakes, transactional systems (e.g., e-banking), and OLAP (online analytical processing) systems (e.g., customer data warehouses). This is the reason why financial organizations are creating big data architectures that provide the means for consolidating diverse data sources. As a prominent example, the Bank of England has recently established a “One Bank Data Architecture” based on a centralized data management platform. This platform facilitates the big data analytics tasks of the bank, as it permits analytics over significantly larger datasets [8]. The need for reducing data fragmentation has been also underlined by financial institutions following the financial crisis of 2008, where several financial organizations had no easy way to perform integrated risk assessments as different exposures (e.g., on subprime loans or ETFs (exchange-traded fund)) were siloed across different systems. Therefore, there is a need for data architectures that reduce data fragmentation and take advantage of siloed data in developing integrated big data analytics and machine learning (ML) pipelines, including deep learning (DL).

1.2.2 Real-Time Computing

Real-time computing refers to IT systems that must respond to changes according to definite time constraints, usually in the order of milliseconds or seconds. In the realm of financial and insurance sectors, real-time constraints apply where a response must be given to provide services to users or organizations and are in the order of seconds or less [9]. Examples range from banking applications to cybersecurity. Contrary to data-intensive applications in other industrial sectors (e.g., plan control in industrial automation), most real-world financial applications are not real time and are usually solved by putting more computing resources (e.g., central processing units (CPUs), graph processing units (GPUs), memory) at the problem. However, in the case of ML/DL and big data pipelines, algorithms can take a significant amount of time and become useless in practical cases (e.g., responses arrive too late to be used). In these cases, a quantitative assessment of the computing time of algorithms is needed to configure resources to provide acceptable time.

1.2.3 Mobility

The digital transformation of financial institutions includes a transition to mobile banking [10]. This refers to the interaction of customer and financial organizations through mobile channels. Therefore, there is a need for supporting mobile channels when developing big data, AI, and IoT applications for digital finance, but also when collecting and processing input from users and customers.

1.2.4 Omni-channel Banking: Multiple Channel Management

One of the main trends in banking and finance is the transition from conventional multichannel banking to omni-channel banking [11]. The latter refers to seamless and consistent interactions between customers and financial organizations across multiple channels. Omni-channel banking/finance focuses on integrated customer interactions that comprise multiple transactions, rather than individual financial transactions. Therefore, banking architecture must provide the means for supporting omni-channel interactions through creating unified customer views and managing interactions across different channels. The latter requires the production of integrated information about the customer based on the consolidation of multiple sources. Big data analytics is the cornerstone of omni-channel banking as it enables the creation of unified views of the customers and the execution of analytical functions (including ML) that track, predict, and anticipate customer behaviors.

1.2.5 Orchestration and Automation: Toward MLOps and AIOps

Data-intensive applications are realized and supported by specialized IT and data administrators. Recently, more data scientists and business analysts are involved in the development of such applications. Novel big data and AI architectures for digital finance and insurance must support data administrators and data scientists to provide the means for orchestrating data-intensive application and their management through easily creating workflows and data pipelines. In this direction, there is a need for orchestrating functionalities across different containers. Likewise, there is a need for automating the execution of data administration and data pipelining tasks, which is also a key for implementing novel data-driven development and operation paradigms like MLOps [12] and AIOps [13].

1.2.6 Transparency and Trustworthiness

During the last couple of years, financial organizations and customers of digital finance services raise the issue of transparency in the operation of data-intensive systems as a key prerequisite for the wider adoption and use of big data and AI analytics systems in finance sector use cases. This is particularly important for use cases involving the deployment and use of ML/DL systems that operate as black boxes and are hardly understandable by finance sector stakeholders. Hence, a key requirement for ML applications in the finance sector is to be able to explain their outcomes. As a prominent example, a recent paper by the Bank of England [14] illustrates the importance of providing explainable and transparent credit risk decisions. Novel architectures for big data and AI systems must support transparency in ML/DL workflows based on the use of explainable artificial intelligence (XAI) techniques (e.g., [15, 16]).

The above-listed challenges about big data systems in finance are not exhaustive. For instance, there is also a need for addressing non-technological challenges such as the need for reengineering business processes in a data-driven direction and the need to upskill and reskill digital finance workers to enable them to understand and leverage big data systems. Likewise, there are also regulatory compliance challenges, stemming from the need to comply with many and frequently changing regulations. To address the technological challenges, digital finance sectors could greatly benefit from a reference architecture (RA) [17] that would provide a blueprint solution for developing, deploying, and operating big data systems.

1.3 Merits of a Reference Architecture (RA)

RAs are designed to facilitate design and development of concrete technological architectures in the IT domain. They help reducing development and deployment risks based on the use of a standard set of components and related structuring principles for their integration. When a system is designed without an RA, organizations may accumulate technical risks and end up with a complex and nonoptimal implementation architecture. Furthermore, RAs help improving the overall communication between the various stakeholders of a big data systems. Overall, the value of RAs can be summarized in the following points:

  • Reduction of development and maintenance costs of IT systems

  • Facilitation of communication between important stakeholders

  • Reduction of development and deployment risks

The importance of reference architectures for big data and high-performance computing (HPC) systems, has led global IT leaders (e.g., Netflix, Twitter, LinkedIn) to publicly present architectural aspects of their platforms [18]. Furthermore, over the years, various big data architectures for digital finance have been presented as well. Nevertheless, these infrastructures do not address the above-listed challenges of emerging digital finance and insurance systems. The main goal of this chapter is to introduce a novel reference architecture (RA) for big data and AI systems in digital finance and insurance, which is destined to address the presented challenges. The RA extends architectural concepts that are presented in earlier big data architectures for digital finance while adhering to the principles of the reference model of the Big Data Value Association (BDVA) [19], which has been recently transformed to the Data, AI and Robotics (DAIRO) Association. The presented RA is centered on the development and deployment of data analytics pipelines, leveraging data from various sources. It is destined to assist financial organizations and other relevant stakeholders (e.g., integrators of big data solutions) to develop novel data-intensive systems (including ML/DL systems) for the finance and insurance sectors. The merits of the architecture are illustrated by means of some sample data science pipelines that adhere to the RA. Note also that many of the systems that are illustrated in subsequent chapters of the book have been developed and deployed based on the structuring principles and the list of reference components of the presented RA. In essence, this chapter provides a high-level overview of the RA, while the following chapters provide more information on how specific big data technologies and applications have leveraged the reference architecture.

1.4 Chapter Structure

The chapter is structured as follows:

  • Section 2 following this introduction reviews big data architectures for digital finance. It illustrates why and how existing architectures are mostly application-specific and not appropriate for providing broad coverage and support for big data and AI systems in digital finance.

  • Section 3 introduces the RA by means of complementary viewpoints, including a logical view, as well as development and deployment considerations.

  • Section 4 presents how the RA supports the development of a set of sample data pipelines (including ML/DL systems). Specifically, it illustrates that the RA can support the development of various data-intensive systems.

  • Section 5 concludes the chapter and connects it to other chapters of the book. As already outlined, the RA of this chapter has served as a basis for the design and development of various big data systems and technologies that are presented in subsequent chapters.

2 Related Work: Architectures for Systems in Banking and Digital Finance

2.1 IT Vendors’ Reference Architectures

Many RAs and solution blueprints for big data in digital finance have been recently proposed by prominent IT vendors. As a prominent example, IBM has introduced an RA for big data management as a layered architecture [20]. It comprises layers for data source collection, data organization and governance, as well as analysis and infusion of data. The data source layer comprises structured (e.g., relational database management systems (DBMSs), flat files), semi-structured (such as Extensible Markup Language (XML)), and unstructured (video, audio, digital, etc.) sources. The architecture specifies traditional data sources (e.g., enterprise banking systems, transactional processing systems) and new data sources (e.g., social media data and alternative data). The data collection and governance layer specifies different data management systems (e.g., data lakes, data warehouses), which support various types of data, including both data at rest and data at motion. The layer comprises real-time analytical processing functionalities, which support data transfer at a steady high-speed rate to support many zero latency (“business real time”) applications. Likewise, it comprises data warehousing functionalities that provide raw and prepared data for analytics consumption. Also, a set of shared operational data components own, rationalize, manage, and share important operational data for the enterprise. Moreover, the RA specifies a set of crosscutting functionalities, which are marked as “foundational architecture principles” and are present across all the above layers. These include security and multi-cloud management functionalities.

Microsoft’s RA for digital finance provides a logical banking technology architecture schema [21]. It enables high-value integration to other systems through a wide array of industry standard integration interfaces and techniques (e.g., interfaces from ISO (International Organization for Standardization), BIAN (Banking Industry Architecture Network), and IFX (International Foreign Exchange)). In this way, it reduces the costs of managing and maintaining data-intensive solutions in the banking industry. Additionally, the Microsoft RA offers an industry-leading set of robust functionalities defined and exploited in both the banks’ data centers and in the cloud. This kind of functionalities extends across the overall IT stack from the crucial operations to the end-user and constitutes a valuable framework for applications like fraud detection. The Microsoft RA provides master data management (MDM), data quality services (DQS), and predefined BI semantic metadata (BISM), which overlay business intelligence (BI) capabilities delivered via pre-tuned data warehouse configurations, near real-time analytics delivered through high-performance technical computing (HPC) and complex event processing (CEP). Overall, the Microsoft RA organizes and sustains massive volumes of transactions along with robust functionalities in bank data centers.

WSO2 offers a modular platform for the implementation of connected digital finance applications [22]. The philosophy of the platform is to divide complex systems into simpler individual subsystems that can be more flexibly managed, scaled, and maintained. It emphasizes flexibility given the need to comply with a rapidly changing landscape of business and regulatory requirements. From an implementation perspective, it comprises various applications, data management systems, and toolkits. The platform architecture comprises various operational systems that feed a data warehouse to enable analytical processing and data mining. On top of the data warehouse, several enterprise applications are implemented, including accounting applications and reporting applications. The WS02 platform includes a private PaaS (platform as a service) module that supports the integration of many financial applications in the cloud. It also specifies a business process server, which orchestrates workflows of services across different business units of a financial organization, but also of services that span different banks and financial institutions. To support service-oriented architectures, WS02 prescribes an enterprise service bus (ESB).

The Hortonworks Data Platform (HDP) supports a Hadoop stack for big data applications [23]. It is a centralized, enterprise-ready platform for storage and processing of any kind of data. When combined with NoSQL databases (e.g., CouchDB), HDP creates a huge volume of business value and intelligence. The architecture boosts accuracy, through supporting superior and precise analytics insights to Hadoop. It also provides scalability and operational performance. The combination of NoSQL systems with HDP enables the implementation of many big data scenarios. For instance, it is possible to execute deep analytics when pulling data from CouchDB into Hadoop. Likewise, it is also possible to train machine learning models and then cache them in Couchbase. Big data analytics with HDP consist of three main phases, namely, data pooling and processing, business intelligence, and predictive analytics. HDP supports various finance sector use cases, including risk management, security, compliance, digital banking, fraud detection, and anti-money laundering. The combined use of HDP and NoSQL databases enables the integration of an operational data store (ODS) with analytics capabilities in the banking environment.

2.2 Reference Architecture for Standardization Organizations and Industrial Associations

In 2017, the Big Data Value Association (BDVA)Footnote 1 introduced a general-purpose reference model (RM) that specifies the structure and building blocks of big data systems and applications. The model has horizontal layers encompassing aspects of the data processing chain and vertical layers addressing crosscutting issues (e.g., cybersecurity and trust).

The BDVA reference model is structured into horizontal and vertical concerns [19]:

  • Horizontal concerns cover specific aspects along the data processing chain, starting with data collection and ingestion and extending to data visualization. It should be noted that the horizontal concerns do not imply a layered architecture. As an example, data visualization may be applied directly to collected data (the data management aspect) without the need for data processing and analytics.

  • Vertical concerns address crosscutting issues, which may affect all the horizontal concerns. In addition, vertical concerns may also involve nontechnical aspects.

Even though the BDVA is not an RA in the IT sense, its horizontal and vertical concerns can be mapped to layers in the context of a more specific RA. The RM can serve as a common framework to locate big data technologies on the overall IT stack. It addresses the main concerns and aspects to be considered for big data systems.

Apart from the BDVA, the National Institute of Standards and Technology (NIST) has also developed a big data RA [24] as part of its big data program. The NIST RA introduces a conceptual model that is composed of five functional components: data producer/consumer, system orchestrator, and big data application/framework provider. Data flows, algorithm/tool transfer, and service usage between the components can be used to denote different types of interactions. Furthermore, the activities and functional component views of the RA can be used for describing a big data system, where roles, sub-roles, activities, and functional components within the architecture are identified.

One more reference architecture for data-intensive systems has been defined by the Industrial Internet Consortium (IIC). Specifically, the Industrial Internet Reference Architecture (IIRA) has been introduced to provide structuring principles of industrial Internet of Things (IIoT) systems [25]. The IIRA adapts the ISO architecture specification (ISO/IEC/IEEE 42010) [26] to the needs of IIoT systems and applications. It specifies a common architecture framework for developing interoperable IoT systems for different vertical industries. It is an open, standard-based architecture, which has broad applicability. Due to its broad applicability, the IIRA is generic, abstract, and high-level. Hence, it can be used to drive the structuring principles of an IoT system for finance and insurance, without however specifying its low-level implementation details. The IIRA presents the structure of IoT systems from four viewpoints, namely, business, usage, functional, and implementation viewpoints. The functional viewpoint specifies the functionalities of an IIoT system in the form of the so-called functional domains. Functional domains can be used to decompose an IoT system to a set of important building blocks, which are applicable across different vertical domains and applications. As such, functional domains are used to conceptualize concrete functional architectures. The implementation viewpoint of the IIRA is based on a three-tier architecture, which follows the edge/cloud computing paradigm. The architecture includes an edge, a platform, and an enterprise tier, i.e., components that are applicable to the implementation of IoT-based applications for finance and insurance. Typical examples of such applications are usage-based insurance applications, which leverage data from IoT devices in order to calculate insurance premiums for applications like vehicle and healthcare insurance.

2.3 Reference Architectures of EU Projects and Research Initiatives

In recent years, various EU-funded projects and related research initiatives have also specified architectures for big data systems. As a prominent example, the BigDataStack project has specified an architecture that drives resource management decisions based on data aspects, such as the deployment and orchestration of services [27]. A relevant architecture is presented in Fig. 1.1, including the main information flows and interactions between the key components.

Fig. 1.1
figure 1

H2020 BigDataStack architecture for big data systems

As presented in the figure, raw data are ingested through the gateway and unified API component to the storage engine of BigDataStack, which enables storage and data migration across different resources. The engine offers solutions both for relational and non-relational data, an object store to manage data as objects, and a CEP engine to deal with streaming data processing. The raw data are then processed by the data quality assessment component, which enhances the data schema in terms of accuracy and veracity and provides an estimation for the corresponding datasets in terms of their quality. Data stored in object store are also enhanced with relevant metadata, to track information about objects and their dataset columns.

Those metadata can be used to show that an object is not relevant to a query, and therefore does not need to be accessed from storage or sent through the network. The defined metadata are also indexed, so that during query execution objects that are irrelevant to the query can be quickly filtered out from the list of objects to be retrieved for the query processing. This functionality is achieved through the data skipping component of BigDataStack. Furthermore, the overall storage engine of BigDataStack has been enhanced to enable adaptations during runtime (i.e., self-scaling) based on the corresponding loads. Given the stored data, decision-makers can model their business workflows through the process modeling framework that incorporates two main components: the first component is process modeling, which provides an interface for business process modeling and the specification of end-to-end optimization goals for the overall process (e.g., accuracy, overall completion time, etc.). The second component refers to process mapping. Based on the analytics tasks available in the Catalogue of Predictive and Process Analytics and the specified overall goals, the mapping component identifies analytics algorithms that can realize the corresponding business processes. The outcome of the component is a model in a structural representation, e.g., a JSON file that includes the overall workflow, and the mapped business processes to specific analytics tasks by considering several (potentially concurrent) overall objectives for the business workflow. Following, through the Data Toolkit, data scientists design, develop, and ingest analytics processes/tasks to the Catalogue of Predictive and Process Analytics. This is achieved by combining a set of available or underdevelopment analytics functions into a high-level definition of the user’s application. For instance, they define executables/scripts to run, as well as the execution endpoints per workflow step. Data scientists can also declare input/output data parameters, analysis configuration hyperparameters (e.g., the k in a k-means algorithm), execution substrate requirements (e.g., CPU, memory limits, etc.) as service-level objectives (SLOs), as well as potential software packages/dependencies (e.g., Apache Spark, Flink, etc.).

As another example, the H2020 BOOST 4.0 has established an RA for big data systems in the manufacturing sector [28]. The RA consists of a number of layers at its core, alongside factory dimension and a manufacturing entity dimension [29].

The core layers represent a collection of functionalities/components performing a specific role in the data processing chain and consist of integration layer, information and core big data layers, application layer, and business layer. The integration layer facilitates the access and management of external data sources such as PLM (product lifecycle management) systems, production data acquisition systems, open web APIs, and so on. The information and core big data layer is composed of components belonging in four different sublayers:

  • Data management groups together with components facilitating data collection, preparation, curation, and linking

  • Data processing groups together with architectures that focus on data manipulation

  • Data analytics groups together with components to support descriptive, diagnostic, predictive, and prescriptive data analysis

  • Data visualization groups together with algorithm components to support data visualization and user interaction

The application layer represents a group of components implementing application logic that supports specific business functionalities and exposes the functionality of lower layers through appropriate services.

The business layer forms the overall manufacturing business solution in the BOOST 4.0 five domains (networked commissioning and engineering, cognitive production planning, autonomous production automation, collaborative manufacturing networks, and full equipment and product availability) across five process life cycle stages (smart digital engineering, smart production planning and management, smart operations and digital workplace, smart connected production, smart maintenance and service). Alongside the core layers, there are a number of other crosscutting aspects that affect all layers:

  • Communication aims to provide mechanisms and technologies for reliable transmission and receipt of data between the layers.

  • Data-sharing platforms allow data providers to share their data as a commodity, covering specific data, for a predefined space of time, and with a guarantee of reversibility at the end of the contract.

  • Development engineering and DevOps cover tool chains and frameworks that significantly increase productivity in terms of developing and deploying big data solutions.

  • Standards cover the different standard organizations and technologies used by BOOST 4.0.

  • Cybersecurity and trust cover topics such as device and application registration, identity and access management, data governance, data protection, and so on.

2.4 Architectures for Data Pipelining

Managing the flow of information forms an integral part of every enterprise looking to generate value from their data. This process can be complicated due to the number of sources and data volume. In these situations, pipelines can be of help as they simplify the flow of data by eliminating the manual steps and automating the process. Pipeline architectures are simple and powerful: They are inspired by the Unix technique of connecting the output of an application to the input of another via pipes on the shell. They are suitable for applications that require a series of independent computations to be performed on data [30]. Any pipeline consists of filters connected by pipes. A filter is a component that performs some operation on input data to transform them into output data. The latter is passed to other component(s) through a pipe. The pipe is a directional connector that passes a stream of data from one filter to the next. A pump is a data source and is the first element of the pipeline.

A pump could be, for example, a static file, a data lake, a data warehouse, or any device continuously creating new data. Data can be managed into two ways: batch ingestion and streaming ingestion. With batch ingestion, data are extracted following an external trigger and administered as a whole. Batch processing is mostly used for data transfer and is more suitable in cases where acquiring exhaustive insights is more important than getting faster analytics results. Instead, with the streaming ingestion, sources transfer unit data one by one. Stream processing is suitable in case real-time data is required for applications or analytics.

Finally, the sink is the last element of the pipeline. It could be another file, a database, a data warehouse, a data lake, or a screen. A pipeline is often a simple sequence of components. However, its structure can also be very complex: in fact, in principle, a filter can have any number of input and output pipes.

The pipeline architecture has various advantages: (i) It makes it easy to understand the overall behavior of a complex system, since it is a composition of behaviors of individual components. (ii) It supports the reuse of filters while easing the processes of adding, replacing, or removing filters from the pipeline. This makes a big data system easy to maintain and enhance. (iii) It supports concurrent execution that can boost performance and timeliness.

Data pipelines carry raw data from different data sources to data warehouses for data analysis and business intelligence. Developers can build data pipelines by writing code and interfacing with SaaS (software-as-a-service) platforms. In recent years, data analysts prefer using data pipeline as a service (DPaaS), which does not require coding. While using data pipelines, businesses can either build their own or use a DPaaS. Developers write, test, and maintain the code required for a data pipeline using different frameworks and toolkits, for example, management tools like Airflow and Luigi. Likewise, solutions like KNIME [31] enable the handling of pipelines without the need for coding (i.e., “codeless” pipeline development).

Apache Airflow is an open-source tool for authoring, scheduling, and monitoring workflows [32]. Airflow can be used to author workflows as directed acyclic graphs (DAGs) of tasks. Apache Airflow has an airflow scheduler that executes your tasks on an array of workers while following the specified dependencies. Its rich command line utility enables to easily perform complex surgeries on DAGs. Moreover, its rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Airflow provides a simple query interface to write SQL and get results quickly, as well as a charting application letting you visualize data.

Luigi is a Python package useful for building complex pipelines of batch jobs [33]. The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. It is suitable to chain many tasks and automate them. The tasks can be anything but are typically long-running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, and more. Luigi helps to stitch many tasks together, where each task can be a Hive query, a Hadoop job in Java, a Spark job in Scala or Python, a Python snippet, dumping a table from a database, or anything else. It makes it easy to build up long-running pipelines that comprise thousands of tasks and take days or weeks to complete. Since Luigi takes care of a lot of the workflow management, the user can focus on the tasks themselves and their dependencies. Luigi also provides a toolbox of several common task templates. It includes support for running Python MapReduce jobs in Hadoop, Hive, and Pig jobs. It also comes with file system abstractions for the Hadoop Distributed File System (HDFS), and local files, which ensures that file system operations are atomic.

The KNIME Analytics Platform is an open-source software suitable for designing data science workflows and reusable components accessible to everyone [31]. It is very intuitive, as it enables to create visual workflows with a drag-and-drop style graphical interface. The KNIME Hub offers a library of components that enable the following:

  1. (i)

    Blend data from any source, including simple text formats (CSV, PDF, XLS, JSON, XML, etc.), unstructured data types (images, documents, networks, molecules, etc.), or time series data. It also enables to connect to a host of databases and data warehouses.

  2. (ii)

    Shape data through deriving statistics (mean, quantiles, and standard deviation), applying statistical tests to validate a hypothesis, or make correlation analysis and more into workflows. Many components are available to clean, aggregate, sort, filter, and join data either on local machine, in database, or in distributed big data environments. In addition, features can be extracted and selected to prepare datasets for machine learning with genetic algorithms, random search, or backward and forward feature elimination.

  3. (iii)

    Apply ML/DL techniques through building machine learning models for tasks like classification, regression, dimension reduction, or clustering, using advanced algorithms including deep learning, tree-based methods, and logistic regression.

  4. (iv)

    Optimize model performance based on hyperparameter optimization, boosting, bagging, stacking, or building complex ensembles.

  5. (v)

    Validate models by applying performance metrics such as the receiver operating characteristic (ROC) and the area under the ROC curve (AUC) while performing cross validation to guarantee model stability.

  6. (vi)

    Build and apply explainable AI (XAI) models like LIME [15] and Shap/Shapley values [34].

  7. (vii)

    Discover and share insights, based on advanced and versatile visualizations.

2.5 Discussion

Previous paragraphs presented a wide range of reference architectures and reference models for big data systems, including architectures developed for the digital finance and insurance sectors. The presented architectures have illustrated the main building blocks of a reference architecture for big data and AI in digital finance, such as interface data sources, data streaming modules, modules for handling and processing data at rest, data warehouses and analytics databases, data integration and interoperability modules, as well as data visualization components. Moreover, they outline the main crosscutting functions such as data governance and cybersecurity functions. Furthermore, they illustrate implementation architectures based on multitier systems such as edge/cloud systems comprising an edge, a platform, and a cloud tier. They also outline powerful concepts and tools, such as the data pipelining concept and frameworks that support the development and deployment of data pipelines.

Despite the presentation of this rich set of concepts, there is no single architecture that could flexibly support the development of the most representative data-intensive use cases in the target sectors. Most of the presented architectures target specific use cases of the sector (e.g., fraud detection and anti-money laundering) or come at a very abstract level that makes them applicable to multiple sectors. Moreover, most of them specify a rigorous structure for the big data applications, which limits the flexibility offered to banks, financial organizations, and integrators of big data applications in finance and insurance. Therefore, there is a need to introduce a more flexible approach based on the popular pipelining concept: Instead of providing a rigorous (but monolithic) structure of big data and AI applications in digital finance, the following sections opt to define these applications as collections of data-driven pipelines. The latter are to be built based on a set of well-defined components, spanning different areas such as data preprocessing, machine learning, data anonymizing, data filtering, data virtualization, and more. The reference architecture that is introduced in the following section provides a set of layered architectural concepts and a rich set of digital building blocks, which enable the development of virtually any big data or AI application in digital finance and insurance. This offers increased flexibility in defining data-driven applications in the sector, in ways that subsume most of the rigorous architectures outlined in earlier paragraphs.

3 The INFINITECH Reference Architecture (INFINITECH-RA)

3.1 Driving Principles: INFINITECH-RA Overview

The purpose of an RA for big data systems in digital finance and insurance is to provide a conceptual and logical schema for solutions to a very representative class of problems in this sector. The H2020 INFINITECH project develops, deploys, and validates over 15 novel data-driven use cases in digital finance and insurance, which constitute a representative set of use cases. The development of the RA that is presented in the following paragraphs (i.e., the INFINITECH-RA) is driven by the requirements of these applications. Likewise, it has been validated and used to support the actual development of these use cases, some of which are detailed in later chapters of the book.

The INFINITECH-RA is the result of the analysis of a considerable number of use cases, including their requirements (i.e., users’ stories) and constraints (e.g., regulatory, technological, organizational). Furthermore, the INFINITECH-RA considers state-of-the-art technologies and similar architectures in order to provide best practices and blueprints that enable relevant stakeholders (e.g., end users in financial organizations, business owners, designers, data scientists, developers) to develop, deploy, and operate data-driven applications. The INFINITECH-RA is largely based on the concept of data pipelines. Therefore, it can be used to understand how data can be collected and how models and technologies should be developed, distributed, and deployed.

The INFINITECH-RA is inspired by state-of-the-art solutions and technologies. It is based on the BDVA RM to provide an abstraction that solves a general class of use cases, including the ones of the INFINITECH project. It exploits microservice technologies and DevOps methodologies. Specifically, data components in the INFINITECH-RA are encapsulated in microservices in line with a loosely coupled approach. The latter is preferred over tightly coupled intertwined applications. In the context of the INFINITECH-RA, data can be pipelined into different microservices to perform different types of processing, while data can be streamed or stored in data stores. In line with big data management best practices, data movements are limited as much as possible, in which cross cutting services are specified to value-added functionalities across all different layers of the architecture.

To understand the development and use of the INFINITECH-RA, the following concepts are essential:

  • Nodes: In the INFINITECH-RA, a node is a unit of data processing. Every node exhibits interfaces (APIs) for data management in particular for consuming and producing data (i.e., IN and OUT). From an implementation perspective, nodes are microservices that expose REST APIs.

  • BDVA RM Compliance: The INFINITECH-RA layers can be mapped to layers of the BDVA RM. Each node belongs to some layer of the RA.

  • Pipeline Concept: Nodes with IN/OUT interfaces can be stacked up to form data pipelines, i.e., a pipelining concept is supported. Moreover, nodes are loosely coupled in the RA, i.e., they are not connected until a pipeline is created. Data can flow in all directions (e.g., an ML node can push back data into a data layer).

  • Vertical Layers of Nodes: Every node stack with other compatible nodes, i.e., whether nodes can be connected or not, depends on their IN-OUT interfaces. Therefore, the RA can be segmented into vertical layers (called bars) to group compatible nodes. A node can belong to one or more vertical bars.

In line with the following concepts, the INFINITECH-RA is:

  • Layered: Layers are a way of grouping nodes in the same way as the BDVA RM has “concerns.”

  • Loosely Coupled: There are no rules to connect nodes in a predefined way or in a rigid stack.

  • Distributed: Computing and nodes can be physically deployed and distributed anywhere, e.g., on premise, on a cloud infrastructure, or across multiple clouds.

  • Scalable: Nodes provide the means for distributing computing at edges, at HPC nodes (GPUs), or centrally.

  • Multi-workflow: The INFINITECH-RA allows for simple pipelines and/or complex data flows.

  • Interoperable: Nodes can be connected to other nodes with compatible interfaces in a way that boosts interoperability.

  • Orchestrable: Nodes can be composed in different ways allowing creation of virtually infinite combinations.

3.2 The INFINITECH-RA

The INFINITECH reference architecture has been specified based on the “4+1” architectural view model [35]. The “4+1” architectural view model is a methodology for designing software architectures based on five concurrent “views.” These views represent different stakeholders who could deal with the platform and the architecture, from the management, development, and user perspectives. In this context, the logical viewpoint of the RA illustrates the range of functionalities or services that the system provides to the end users. The following paragraphs illustrate the logical views of the architecture while providing some development and deployment considerations as well.

3.2.1 Logical View of the INFINITECH-RA

The logical views of the RA are presented in Fig. 1.2. Rather than identifying and specifying functional blocks and how they are interconnected, the INFINITECH-RA is defined by a set of “principles” to build pipelines or workflows. These principles constitute the guidelines that drive specific implementation of digital finance and insurance. At a high level, the RA can be seen as a pipelined mapping of nodes referring to the BDVA reference model and crosscutting layers.

Fig. 1.2
figure 2

Logical view of the INFINITECH-RA and mapping to BDVA RM

Figure 1.3 provides a high-level logical view of a specific solution instance that is structured according to the INFINITECH-RA. The generic nodes that are depicted in the figure are only examples and do not refer to any specific tool or application. They represent generic services that belong to a general class of application performing the functionality of the corresponding layer in the BDVA RM. Hence, the RA is generic and makes provision for flexibly specifying and implementing nodes to support the wide array of requirements of data-intensive applications in digital finance.

Fig. 1.3
figure 3

Instance of the INFINITECH-RA: logical view of components

In this logical view of the INFINITECH-RA, the various components are grouped into the following layers:

  • Data Sources: This layer comprises the various data sources (e.g., database management systems, data lakes holding nonstructural data, etc.) of a big data application.

  • Ingestion: This is a data management layer that is associated with data import, semantic annotation, and filtering data from the data sources.

  • Security: This layer manages data clearance before any further storing or elaboration. It provides functions for security, anonymization, cleaning of data, and more.

  • Management: This layer is responsible for data management, including persistent data storage in the central repository and data processing enabling advanced functionalities such as hybrid transactional and analytical processing (HTAP).

  • Analytics: This layer comprises the ML, DL, and AI components of the big data system.

  • Interface: This layer defines the data to be produced and provided to the various user interfaces.

  • Crosscutting: This layer comprises service components that provide functionalities orthogonal to the data flows such as authentication, authorization, and accounting.

  • Data Model: This is a crosscutting layer for modeling and semantics of data in the data flow.

  • Presentation and Visualization: This layer is associated with the presentation of the results of an application in some form like a dashboard.

The RA does not impose any pipelined or sequential composition of nodes. However, each different layer and its components can be used to solve specific problems of the use case.

3.2.2 Development Considerations

From a development viewpoint, the RA complies with a microservice architecture implementation, including services that interact through REST APIs. All microservices run in containers (e.g., Docker) managed by an orchestrator platform (e.g., Kubernetes). Development and testing activities are based on a DevOps methodology, which emphasizes a continuous integration (CI) approach. Every time a developer pushes changes to the source code repository, a CI server triggers automatically a new build of the component and deploys the updated container to an integration environment on Kubernetes. This enables continuous testing of the various components against an up-to-date environment, which speeds up development and avoids painful integration problems at the end of the cycle. Continuous delivery (CD) is an extension of that process, which automates the release process so that new code can be deployed to target environments, typically to test environments, in a repeatable and automated fashion.

This process is enhanced based on DevSecOps approach, which integrates security in the software development life cycle since the beginning of the development process. With DevSecOps, security is considered throughout the process and not just as an afterthought at the end of it, which produces safer and more solid design. Moreover, this approach avoids costly release delays and rework due to non-compliance, as issues are detected early in the process. DevSecOps introduces several security tools in the CI/CD pipeline, so that different kinds of security checks are executed continuously and automatically, giving to developers quick feedback whether their latest changes introduce a vulnerability that must be corrected.

To facilitate communication and collaboration between teams and to improve model tracking, versioning, monitoring, and management, there is also a need for standardizing and automating the machine learning process. In practice, this is often very complicated, because ML works in heterogeneous environments: For example, ML models are often developed on a data scientist’s notebook. ML training is done in the cloud to take advantage of available resources, while execution of the software in production takes place on premise. The first step to establishing an MLOps approach for INFINITECH-RA compliant applications requires the standardization of the various development environments. In this direction, the Kubernetes platform along with the Docker containers provides the abstraction, scalability, portability, and reproducibility required to run the same piece of software across different environments. As a second step, it is necessary to standardize the workflow used for constructing and building of ML models. In the case of the development of INFINITECH-RA compliant applications, software like Kubeflow enables the development of models and boosts their portability. ML workflows are defined as Kubeflow pipelines, which consist of data preparation, training, testing, and execution step. Each step is implemented in a separate container, and the output of each step is the input to the following step. Once compiled, this pipeline is portable across environments.

3.2.3 Deployment Considerations

From a deployment perspective, the INFINITECH-RA leverages a microservice architecture. Microservices are atomic components, which can be developed and updated individually, yet being part of a wider end-to-end solution. This eases the development of individual components, yet it can complicate the management of a complete solution that comprises multiple microservices. In this context, INFINITECH-RA provides a tool for managing the life cycle of microservices, including their deployment, scaling, and management. This is based on Kubernetes and uses two main concepts:

  • Namespaces, i.e., a logical grouping of a set of Kubernetes objects to whom it’s possible to apply some policies, for example, to set limits on how many hardware resources can be consumed by all objects or to constraints on whether the namespace can be accessed by or can access other namespaces.

  • POD which is the simplest unit in the Kubernetes object. A POD encapsulates one container yet in cases of complex applications can encapsulate more than one container. Each POD has its own storage resources, i.e., a unique network IP, an access port, and options related to how the container/s should run.

The Kubernetes namespace enables the logical isolation of objects (mainly PODs) from other namespaces.

An environment that enables testing of services/components as a separate namespace in the same Kubernetes cluster is provided to support CI/CD processes. This development environment is fully integrated with the CI/CD tools. Moreover, automated replication functionalities are provided, which facilitate the creation of a similar environment using appropriate scripts. In this way, an “infrastructure as code” paradigm is supported.

4 Sample Pipelines Based on the INFINITECH-RA

The following subsections present some sample reference solutions for common use case scenarios.

4.1 Simple Machine Learning Pipeline

The reference architecture enables the design and execution of classical machine learning pipelines. Typical machine learning applications are implemented in line with popular data mining methods like the CRISP-DM (cross-industry standard process for data mining) [36]. CRISP-DM includes several phases, including the phases of data preparation, modeling, and evaluation. In typical machine learning pipeline, data are acquired from various sources and prepared in line with the needs of the target machine learning model. The preparation of the data includes their segmentation into training data (i.e., data used to train the model) and test data (i.e., data used to test the model). The model is usually evaluated against the requirements of the target business application. Figure 1.4 illustrates how a typical machine learning pipeline can be implemented/mapped to the layers and functionalities of the INFINITECH-RA. Specifically, the following layers/parts are envisaged:

  • Data Sources: At this level, reside the data sources of the big data/AI application. These may be of different types including databases, data stores, data lakes, files (e.g., spreadsheets), and more.

  • Ingestion: At this level of the pipeline, data are accessed based on appropriate connectors. Depending on the type of the data sources, other INFINITECH components for data ingestion can be used such as data collectors and serializers. Moreover, conversions between the data formats of the different sources can take place.

  • Data Processing: At this level, data are managed. Filtering functions may be applied, and data sources can be joined toward forming integrated datasets. Likewise, other preprocessing functionalities may be applied, such as partitioning of datasets into training and test segments. Furthermore, if needed, this layer provides the means for persisting data at scale, but also for accessing it through user-friendly logical query interfaces. The latter functionalities are not however depicted in the figure.

  • Analytics: This is the layer where the machine learning functions are placed. A typical ML application entails the training of machine learning models based on the training datasets, as well as the execution of the learned model-based test dataset. It may also include the scoring of the model based on the test data.

  • Interface and Visualization: These are the layers where the model and its results are visualized in line with the needs of the target application. For example, in Fig. 1.4, a portfolio construction proposal is constructed and presented to the end user.

Fig. 1.4
figure 4

Simple machine learning pipeline based on the INFINITECH-RA

4.2 Blockchain Data-Sharing and Analytics

The INFINITECH-RA can also facilitate the specification, implementation, and execution of blockchain empowered scenarios for the finance sector, where, for example, several discrete entities (including banks, insurance companies, clients, etc.) may be engaged in data-sharing activities. The latter can empower financial organizations to update customer information as needed while being able to access an up-to-date picture of the customer’s profile. This is extremely significant for a plethora of FinTech applications, including yet not limited to Know Your Customer/Business, Credit Risk Scoring, Financial Products Personalization, Insurance Claims Management, and more. Let’s consider a use case in which Financial Organization A (e.g., a bank or an insurance company) wishes to be granted access to a customer of another financial organization (e.g., another bank within the same group or a different entity altogether). Supposing that legal constraints are properly handled (e.g., consents for the sharing of the data have been granted, and all corresponding national and/or European regulations such as GDPR are properly respected and abided by) and that all data connectors (required for the retrieval of the data from the raw information sources) and data curation services (e.g., the cleaner, required for cleaning the raw data sources; the anonymizer, required for anonymizing the same information sources; and the harmonizer, required for mapping the cleaned and properly anonymized information sources to the Common INFINITECH Information Model) are in place, enabling the collection and preprocessing of the raw information sources, the following flow is envisaged (Fig. 1.5):

Fig. 1.5
figure 5

Blockchain data-sharing and analytics pipeline

  • The blockchain encryptor encrypts the properly preprocessed data so that they can be saved in the private collection of Organization A.

  • The blockchain authenticator authenticates (the user of) Organization A, so that access to update the ledger is granted.

  • Once authenticated, the blockchain writer inserts the encrypted data to the private collection of Organization A, and the hash of the data is submitted to the ledger.

  • Organization B requests access to the data of a customer from Organization A.

  • The blockchain authenticator authenticates (the user of) Organization B that initiates the request so as to grant or decline the request to access private-sensitive data.

  • Once (the user of) Organization B is authenticated, the smart contract executor translates the query submitted and queries the private collection of OrganizationA.

  • The blockchain authenticator authenticates (the user of) Organization B, so that access to read the ledger is granted.

  • The blockchain reader retrieves the queried information from Organization A’s private collection.

  • The smart contract executor generates a new transaction and triggers the blockchain authenticator so as to authenticate (the user of) Organization A, in order to grant access to update the ledger.

  • Once authenticated, the blockchain writer submits the encrypted data to the ledger, and a new record on the same ledger is created, containing the metadata of the contract (organizations involved, data created, metadata of the encrypted data transmitted, validity of the contract) and the actual encrypted data.

  • Organization A sends out of band the decryption key to Organization B, and the blockchain decryptor decrypts the queried data.

4.3 Using the INFINITECH-RA for Pipeline Development and Specification

The INFINITECH-RA provides a blueprint for constructing big data solutions according to standards-based flows and components. The development of any use case based on INFINITECH-RA can be carried out based on the following considerations:

  • Use Case Design: A use case is considered as a transformation from some data sources to some data destinations. This is called the “stretching phase” where a pipeline or workflow is defined from the different sources to the resulting data. The INFINITECH-RA is the total transformation of the sources into destinations once the appropriate workflow is defined and implemented.

  • Data Access and Ingestion: Each data source is ingested via a first layer of data management components (e.g., serializers and database connectors). Data sources are accessible through different types of data technologies (e.g., SQL, NoSQL, IoT streams, blockchains’ data), but also as raw data such as text, images, videos, etc. A connector must be provided to support all the variety of the supported target data store management systems.

  • Data Preprocessing: Data must not be stored in the INFINITECH platform data store unless they are properly preprocessed and “cleared.” Clearing can involve filtering, deletion, anonymization, encrypting of raw data, and more. The “clearing” can be managed via the security layer of the RA model or the crosscutting services.

  • Data Storage: Most use cases ask for storing data in some data store. In the case of big data applications, data stores that handle big data must be used. Moreover, these data stores must be able to scale out to support the diversity of data sources and workloads.

  • Analytics: Analytics work on top of the data (stored or streamed), consume, and might produce data, which are stored in some data repository as well.

Nevertheless, real-world use cases can be much more complicated than the methodology proposed above. They are intertwined with existing infrastructures and data processing components and cannot be easily stretched into a “pipelined” workflow as suggested. In these cases, the boundaries of “computational nodes” that provide basic interfaces must be identified. A complex infrastructure can be more than one microservice with many endpoints distributed among the different microservices so as to keep the functionalities homogeneous. Furthermore, a huge infrastructure can be encapsulated (wrapped) into a microservice interface to exhibit basic functions.

5 Conclusions

A reference architecture for big data systems in digital finance can greatly facilitate stakeholders in structuring, designing, developing, deploying, and operating big data, AI, and IoT solutions. It serves as a stakeholders’ communication device while at the same time providing a range of best practices that can accelerate the development and deployment of effective big data systems. This chapter has introduced such a reference architecture (i.e., the INFINITECH-RA), which adopts the concept and principles of data pipelines. The INFINITECH-RA is in line with the principles of the BDVA RM. In practice, it extends and customizes the BDVA RA with constructs that permit its use for digital finance use cases.

The chapter has illustrated how INFINITECH-RA can support the development of common big data/AI pipelines for digital finance applications. In this direction, the deliverable has provided some sample pipelines. The presented RA served as a basis for the implementation of the applications and use cases that are described in later chapters of the book, including various digital finance and insurance use cases. The INFINITECH-RA is therefore referenced in several of the subsequent chapters of the book, which is the reason why the INFINITECH-RA has been introduced as part of the first chapter.