Keywords

1 Introduction

1.1 Motivation

Data is the new oil that moves businesses. Banking, financial services, and insurance organization’s value proposition are dependent on the information they can process. Companies require to process more data in less time in a more personalized way. However, current database technology has some limitations to provide in a single database engine the required intaking speed, the capacity to transform it in a usable way in the volume a corporation needs.

To overcome these constraints, companies use approaches based on complex platforms that blend several technologies. This complexity means an increment of the total cost of ownership, a longer time to market, and creates long-run friction to adapt to the new business opportunities. According to McKinsey, a midsize organization (between $5B and $10B on operating expenses) may spend around $90M–$120M to create and maintain these architectures, mainly because of the architecture complexity and data fragmentation. The advice from McKinsey lies in simplifying the manner that financial and insurance organizations use information that will greatly impact the way that a company does business.

McKinsey also notes that a company can reduce in up to 30% of the expenses by simplifying the data architecture, in combination with other activities, such as data infrastructure off-loading, engineer’s productivity improvement, and pausing expensive projects.

1.2 Data Pipelining Architectural Pattern Catalogue and How LeanXcale Simplifies All of Them

In the current data management landscape, there are many different families of databases since they have different capabilities. Because of the use of different databases, data needs to be moved from one database into another. This practice is called data pipelining. In the next section, we first do a brief taxonomy of the different kinds of databases and the functions for which they are used, and after, we describe the most common data pipelines.

In the next sections, we discuss groups of data pipelining architectural patterns targeting specific tasks.

At LeanXcale, we are looking at how to simplify these data pipelines and adopt a uniform simple approach for them. Data pipelines get complicated mainly due to the mismatch of capabilities across the different kinds of systems. Data pipelines may get very complex because of real-time requirements. There are many architectural patterns commonly used for solving different data pipeline constraints.

LeanXcale envisions a holistic solution to the issue of data pipelining that ingests data as fast as needed, works with current and historic data, handles efficiently aggregates, and can handle them at any scale. This holistic solution aims at minimizing the TCO (total cost of ownership) of the number of storage systems needed to develop a data pipeline and minimize the duration of the data pipelining or even perform it in real time. In this section, we will address all of the above identified architectural patterns for data pipelining and see how by means of LeanXcale we can highly simplify them.

Finally, we discuss instantiations of these patterns in the context of the INFINITECH European project where a large number of pilots are being run. We also discuss how LeanXcale has been leveraged to simplify these data pipelines in those pilots.

2 A Taxonomy of Databases for Data Pipelining

This section surveys the building blocks of data pipelines, databases, giving a taxonomy at different levels of abstraction, from high-level capabilities operational vs informational to the different flavors of NoSQL databases.

2.1 Database Taxonomy

2.1.1 Operational Databases

Operational databases store data in persistent media (disk). They allow to update the data while the data is being read. The consistency guarantees that are given with concurrent reads and writes vary. Because they can be used for critical mission, operational databases might provide capabilities for attaining high availability that tolerates node failures, and, in some cases, they can even tolerate data center disasters that would otherwise lead to the whole loss or lack of availability of a whole data center. The source of these disasters can be from a natural event like a flood or a fire, the loss of electric power, the loss of Internet connectivity, a denial-of-service attack resulting in the loss of CPU power and/or network bandwidth, the saturation of some critical resource like DNS (domain name service), and more.

2.1.2 Data Warehouses

Data warehouses are informational databases, typically with much bigger persistent storage than operational databases. They are designed only to query data after ingesting it. They do not allow modifications, simply load the data, and then after query the data. They focus on speeding up queries by means of OLAP (online analytical processing) capabilities, attained by introducing intra-query parallelism typically using parallel operators (intra-operator parallelism). They often specialize the storage model to accelerate the analytical queries, e.g., using a columnar model, or an in-memory architecture.

2.1.3 Data Lakes

Data lakes are used as scalable cheap storage where to keep historical data at affordable prices. The motivation of keeping this historical data might be legal requirements of data retention, but more recently the motivation is from the business side to have enough data to be able to train machine learning models in a more effective way by reaching a critical mass of data in terms of time and detail. Some organizations use data lakes as cheap data warehouses when the queries are not especially demanding. A data lake might require more than an order of magnitude higher resources for an analytical query with a target response time higher than a data warehouse, while the price follows an inverse relationship.

2.2 Operational Database Taxonomy

Operational databases can be further divided into three broad categories.

2.2.1 Traditional SQL Databases

Traditional SQL (structured query language) databases are characterized by two main features. First, they provide SQL as query language. Second, they provide the so-called ACID guarantees over the data. ACID (atomicity, consistency, durability) properties will be discussed in detail in the following paragraphs. The main limitation of traditional SQL databases is their scalability: they either do not scale out or only logarithmically, meaning that their cost grows exponentially with the scale of the workload to be processed. They typically provide mechanisms for high availability that guarantee the ACID properties, which is technically known as one-copy consistency guarantees. The second limitation is that they ingest data very inefficiently, so they are not able to insert or update data at high rates.

2.2.2 NoSQL Databases

NoSQL databases have been introduced to overcome traditional SQL database limitations. There are four main kinds of NoSQL databases as explained above. Basically, they address the lack of flexibility of the relational schema that is too rigid and forces to know in advance all the fields of each row in the database, and they are very disruptive when this schema needs to be changed, typically resulting in having the database or at least the involved tables not available during the schema change. NoSQL databases fail to provide ACID consistency guarantees. On the other hand, most of them can scale out, although not all kinds have this ability. Some of them can scale out but either not linearly or not to large numbers of nodes.

2.2.3 NewSQL Databases

NewSQL databases appear as a new approach to address the requirements of SQL databases, trying to remove part or all their limitations. NewSQL databases are designed to bring new capabilities to old traditional SQL databases by leveraging approaches from NoSQL and/or new data warehouse technologies. Some try to improve the scalability of storage. That is normally achieved by relying on some NoSQL technology or adopting an approach similar to some NoSQL databases. Scaling queries was an already solved problem. However, scaling inserts and updates had two problems. The first one is the inefficiency of ingesting data. The second one is the inability to scale out to large scale the ACID properties, that is, transactional management. Others have tried to overcome the lack of scalability of data ingestion, while others address the lack of scalability of transactional management.

2.3 NoSQL Database Taxonomy

NoSQL databases are usually distributed (execute on a shared-nothing cluster with multiple nodes), have different flavors, and are typically divided into four categories that are explained in the following subsections.

2.3.1 Key-Value Data Stores

Key-value data stores are schemaless and allow any value associated with a key. In most cases, they attain linear scalability. Basically, each instance of the key-value data store processes a fraction of the load. Since operations are based on an individual key-value pair, the scalability does not pose any challenge and most of the times is achieved. The schemaless approach provides a lot of flexibility. As a matter of fact, each row can potentially have a totally different schema. Obviously, that is not how key-value data stores are used. But they allow to evolve the schema without any major disruption. Of course, the queries have to do the extra work of being able to understand rows with different schema versions, but since normally, the schemas are additive, they add new columns or new variants, it is easy to handle. Key-value data stores excel at ingesting data very efficiently. Since they are schemaless, they can just store the data as is. This is very inefficient for querying, and this is why they provide very little capabilities for querying such as getting the value associated with a key. In most cases, they are based on hashing, so they are unable to perform basic range scans and only provide full scans that are very expensive since they traverse all the table rows. Example of key-value data stores are Cassandra and DynamoDB.

2.3.2 Document-Oriented Databases

Document-oriented databases are technologies that support semi-structured data written in a popular language such as JSON or XML. Their main capability is being able to store data in one of these languages efficiently and perform queries for these data in an effective way. Representing these data in a relational database is just a nightmare and doing queries of this relational schema even a worse nightmare. This is why they have succeeded. Some of them scale out in a limited way and not linearly, while some others do better and scale out linearly. However, they do not support the ACID properties and are inefficient at querying data that is structured in nature. Structured data can be queried one to two orders of magnitude more efficiently with SQL databases. Examples in this category are MongoDB and Couchbase.

2.3.3 Graph Databases

Graph databases are specialized on storing and querying data modeled as a graph. Graph data represented in a relational format becomes very expensive to query. The reason is that to traverse a path from a given vertex in the graph, one has to perform many queries, one per edge stemming from the vertex and as many times as the longest path sought in the graph. This results in too many invocations to the database. If the graph does not fit in memory, then it is even a bigger problem since disk accesses will be involved for most of the queries. Also, the queries cannot be programmed in SQL and have to be performed programmatically. Graph databases, on the other hand, have a query language in which with a single invocation solves the problem. Data is stored to maximize locality of a vertex with contiguous vertexes. However, when they do not fit in a single node, graph databases start suffering from the same problem when they become distributed by losing their efficiency and any performance gain as the system grows. At some point, a relational schema solution becomes more efficient than the graph solution for a large number of nodes. A widely used graph database is Neo4J.

2.3.4 Wide-Column Data Stores

Wide-column data stores provide more capabilities than key-value data stores. They typically perform range partitioning, thus supporting range queries. In fact, they might support some basic filtering. They are still schemaless. They also support vertical partitioning that can be convenient when the number of columns is very high. They have some notion of schema, yet quite flexible. Examples of this kind of data stores are Bigtable and HBase.

3 Architectural Patterns Dealing with Current and Historical Data

In this section, an overview of two common patterns used to deal with current and historical data is given.

3.1 Lambda Architecture

The lambda architecture combines techniques from batch processing with data streaming to be able to process data in a real time. The lambda architecture is motivated by the lack of scalability of operational SQL databases. The architecture consists of three layers as illustrated in Fig.3.1.

  1. 1.

    Batch Layer

Fig. 3.1
figure 1

Lambda architecture

This layer is based on append-only storage, typically a data lake, such as the ones based on HDFS. Then, it relies on MapReduce for processing new batches of data in the forms of files. This batch layer provides a view in a read-only database. Depending on the problem being solved, the output might need to fully recompute all the data to be accurate. After each iteration, a new view of the current data is provided. This approach is quite inefficient but solves a scalability problem that used to have no solution, the processing of tweets in Twitter.

  1. 2.

    Speed Layer

This layer is based on data streaming. In the original system at Twitter, it was accomplished by the Storm data streaming engine. It basically processes new data to complement the batch view with the most recent data. This layer does not aim accuracy, but to provide more recent data to the global view achieved with the architecture.

  1. 3.

    Serving Layer

This layer processes the queries over the views provided by both the batch and speed layers. Batch views are indexed to be able to answer queries with low response times and combine them with the real-time view to provide the answer to the query, combining both real-time data and historical data. This layer typically uses some key-value data store to implement the indexes over the batch views.

The main shortcoming of the lambda architecture is its complexity and the need to have totally different code bases for each layer that have to be coordinated to be fully synchronized. Maintenance of the platform is very hard since debugging implies understanding the different layers that typically rely on totally different natures, technologies, and approaches.

3.2 Beyond Lambda Architecture

By means of LeanXcale, the lambda architecture is totally trivialized by substituting the three data management technologies and three different code bases with ad hoc code for each of the queries with a single database manager with declarative queries in SQL. In other words, the lambda architecture is simply substituted by the LeanXcale database that provides all the capabilities of the lambda architecture without any of its complexities and development and maintenance cost. LeanXcale scales out linearly operational storage, thus solving one of the key shortcomings of operational databases that motivate the lambda architecture.

The second obstacle from operational databases was its inefficiency in ingesting data that makes them too expensive even for data ingestions they can manage. As the database grows, the cache is rendered ineffective, and each insert requires to read a leaf node from the B+ tree that requires first to evict a node from the cache and write it to disk. This means that every write requires at least two IOs. LeanXcale solves this issue by providing the efficiency of key-value data stores in ingesting data, thanks to the blending of SQL and NoSQL capabilities using a new variant of LSM trees. With this approach, updates and inserts are cached in an in-memory search tree and periodically propagated all together to the persisted B+ tree. With this approach, the locality of updates and inserts on each leaf of the B+ tree is greatly increased amortizing the cost of each IO among many rows.

The third issue solved by LeanXcale that is not solved by the lambda architecture is the ease to query. The lambda architecture requires to develop programmatically each query with three different code bases for each of the three layers. In LeanXcale, queries are simply written in SQL. SQL queries are automatically optimized unlike the programmatic queries in the lambda architecture that require manual optimization across three different code bases for each of the layers. The fourth issue that is solved is the cost of recurrent aggregation queries. In the lambda architecture, this issue is typically solved in the speed layer using data streaming. In LeanXcale, the online aggregates enable real-time aggregation without the problems of operational databases and provide a low-cost solution with low response time.

3.3 Current Historical Data Splitting

Other more traditional architectures are based on combining an operational database with a data warehouse as shown in Fig. 3.2. The operational database deals with more recent data, while the data warehouse deals with historical data. In this architecture, queries can only see either the recent data or historical data, but not a combination of both as shown in the lambda architecture. In this architecture, there is a periodic process that copies data from the operational database into the data warehouse. This periodic process has to be performed very carefully since it can hamper the quality of service of the operational database. This periodic process is most of the time achieved by ETL tools. This process is typically performed over the weekends in businesses where their main workload comes during weekdays.

Fig. 3.2
figure 2

Current historical data splitting architectural pattern

Another problem with this architecture is that the data warehouse typically cannot be queried while it is being loaded, at least the tables that are being loaded. This forces to split the time of the data warehouse into loading and processing. When the loading process is daily, finally the day is split into loading and processing. The processing time consumes a fraction of hours of the day that depends on the analytical queries that have to be answered daily. It leaves a window of time for loading data that is the remaining hours of the day. At some point, data warehouses cannot ingest more data because the loading window is exhausted. We call this architectural pattern current historical data splitting.

3.4 From Current Historical Data Splitting to Real-Time Data Warehousing

In this pattern, data is split between an operational database and a data warehouse or a data lake. The current data is kept on the operational database and historic data in the data warehouse or data lake. However, queries across all the data are not supported with this architectural pattern. With LeanXcale, a new pattern, called real-time data warehousing, will be used to solve this problem. This pattern will be solved by an innovation that will be introduced in LeanXcale, namely, the ability to split analytical queries over LeanXcale and an external data warehouse. Basically, it will copy older fragments of data into the data warehouse periodically. LeanXcale will keep the recent data and some of the more recent historical data. The data warehouse will keep only historical data. Queries over recent data will be solved by LeanXcale, and queries over historical data will be solved by the data warehouse. Queries across both kinds of data will be solved using a federated query approach leveraging LeanXcale capabilities to query across different databases and innovative techniques for join optimization. In this way, the bulk of the historical data query is performed by the data warehouse, while the rest of the query is performed by LeanXcale. This approach enables to deliver real-time queries over both recent and historical data giving a 360° view of the data.

4 Architectural Patterns for Off-Loading Critical Databases

4.1 Data Warehouse Off-Loading

Since the saturation of the data warehouse is a common problem, another architectural pattern has been devised to deal with the so-called data warehouse off-loading (Fig. 3.3). This pattern relies in creating small views of the data contained by the data warehouse and stores them on independent databases, typically called data marts. Depending on the size of the data and the complexity of the queries, data marts can be handled by operational SQL databases, or they might need a data manager with OLAP capabilities that might be another data warehouse or a data lake plus an OLAP engine that works over data lakes.

Fig. 3.3
figure 3

Data warehouse off-loading architectural pattern

4.2 Simplifying Data Warehouse Off-Loading

Data warehouse off-loading is typically motivated either to cost reasons or due to the saturation of the data warehouse. For data warehouse off-loading, data marts are used using other database managers and making a more complex architecture that requires multiple ETLs and copies of the data. With LeanXcale, this issue can be solved into two ways. One way is to use operational database off-loading to LeanXcale with the dataset of the data mart. The advantage of this approach with respect to data warehouse off-loading is that the data mart contains data that is real time, instead of obsolete data copied via a periodic ETL. The second way is to use database snapshotting by taking advantage of the fast speed and high efficiency of loading of LeanXcale. Thus, periodically a snapshot of the data would be stored in LeanXcale with the same or higher freshness than a data mart would have. The advantage is that the copy would come directly from the operational database instead of coming from the data warehouse, thus resulting in fresher data.

4.3 Operational Database Off-Loading

As deeply explained, there are cases with real-time or quasi-real-time requirements, where the database snapshotting pattern does not solve the problem. In this case, a CDC (Change Data Capture) system is used that captures changes in the operational data and inject them into another operational database. The CDC is only applied over the fraction of the data that will be processed by the other operational database. The workload is not performed over the operational database due to technical or financial reasons. The technical reason is that the operational database cannot handle the full workload and some processes need to be off-loaded to another database. The financial reason is that the operational database can handle the workload but the price, typically with a mainframe, is very high. This architectural pattern is called operational database off-loading and is illustrated in Fig. 3.4.

Fig. 3.4
figure 4

Operational database off-loading architectural pattern

4.4 Operational Database Off-Loading at Any Scale

One of the main limitations of operational database off-loading is that only a fraction of data can be off-loaded to a single database due to the lack of scalability of traditional SQL operational databases. Typically, this approach is adopted to reduce the financial cost of processing by mainframes that can process very high workloads. However, traditional SQL database with much more limited capacity and little scalability limit what can be done by means of this pattern. LeanXcale can even support the full set of changes performed over the mainframe, thanks to its scalability, so it does not set any limitation on the data size and rate of data updates/inserts.

4.5 Database Snapshotting

In some cases, the problem is that the operational database cannot handle the whole workload due to its lack of scalability and part of this workload can be performed without being real time. In these cases, a copy of the database or the relevant part of the data of the database is copied into another operational database during the time that the operational is not being used, normally, weekends or nights, depending on how long the copy of the database takes. If the copy process takes less than one night, then it is performed daily. If the copy process takes more than one day, then it is performed during the weekend. Finally, if the copy process takes more than weekend, it cannot be done with this architectural pattern. This architectural pattern is called database snapshotting and is depicted in Fig. 3.5.

Fig. 3.5
figure 5

Database snapshotting architectural pattern

4.6 Accelerating Database Snapshotting

The LeanXcale can avoid the database snapshotting if it is used as operational database. This can be done thanks to the linear scalability of LeanXcale that does not require off-loading part of the workload to other databases. However, in many cases, organizations are not ready to migrate their operational database because of the large amount of code relying on specific features of the underlying database. This is the case with mainframes with large COBOL programs and batch programs in JCL (job control language). In this case, one can rely on LeanXcale to provide a more effective snapshotting or even substitute snapshotting by operational database off-loading, thus with real-time data. In the case of snapshotting, thanks to the efficiency and speed of data ingestion of LeanXcale, snapshotting can be performed daily instead of weekly since load processes that used to take days are reduced to minutes. The main benefit is that data freshness changes from weekly to real time. This speed in ingestion is achieved thanks to LeanXcale capability of ingesting and querying data with the same efficiency independent of the dataset size. This is achieved by means of bidimensional partitioning. The bidimensional partitioning exploits the timestamp in the key of historical data to partition tables on a second dimension. Tables in LeanXcale are partitioned horizontally through the primary key. But then, they are automatically split on the time dimension (or an auto-increment key, whatever is available) to guarantee that the table partition fits in memory and thus the load becomes CPU-bound, which is very fast. Traditional SQL databases get slower as data grows since the B+ tree used to store data becomes bigger in both number of levels and number of nodes. Thanks to bidimensional partitioning, LeanXcale keeps the time to ingest data constant. Queries are also speeded up thanks to the parallelization of all algebraic operators (intra-operator parallelism) below joins.

5 Architectural Patterns Dealing with Aggregations

5.1 In-Memory Application Aggregation

Other systems tackle the previous problem of recurrent aggregate queries by computing the aggregates on the application side in memory. These in-memory aggregates are computed and maintained as time progresses. The recurrent aggregation queries are solved by reading the in-memory aggregations, while access to the detailed data is solved by reading from the operational database, typically using sharding. This pattern is depicted in Fig. 3.6 and is called in-memory application aggregation.

Fig. 3.6
figure 6

In-memory application aggregation

5.2 From In-Memory Application Aggregation to Online Aggregation

LeanXcale does not need in-memory application aggregations while allowing to remove all the problems around like the loss of data in the advent of failures and more importantly all the development and maintenance cost of the code required to perform the in-memory aggregations. In-memory aggregations work as far they can be computed in a single node. However, when multiple nodes are required, they become extremely complex and often out of reach of technical teams. Leveraging the online aggregates from LeanXcale will be leveraged to compute the aggregations for recurrent aggregation queries. LeanXcale keeps internally the relationship between tables (called parent tables) and aggregate tables built from the inserts in these tables (called child aggregate tables). When aggregation queries are issued, the query optimizer uses new rules to automatically detect which aggregations on the parent table can be accelerated by using the aggregations in the child aggregate table. This results in transparent improvement of all aggregations in the parent table by simply declaring a child aggregate table (obviously of the ones that can exploit the child table aggregates).

5.3 Detail-Aggregate View Splitting

A typical and important workload is to ingest high volumes of detailed data and compute recurrent aggregate analytical queries over this data. This workload has been addressed with more specific architectures. The architectural pattern uses sharding to store fractions of the detailed data and then a federator at the application level that basically queries the individual sharded database managers to get the result sets of the individual aggregate queries and then aggregate them manually to compute the aggregate query over the logical database. These aggregated views are generated periodically by means of an ETL (extract, transform, load) process that traverses the data from the previous period in the detailed operational database, computes the aggregations, and stores them in the aggregate operational database. The recurrent queries are processed over the aggregation database. Since the database contains already the pre-computed aggregates, the queries are light enough to be computed at an operational database. This architectural pattern is called detail-aggregate view splitting and is depicted in Fig. 3.7. One of the main shortcomings of this architectural patterns is that the aggregate queries have an obsolete view of the data since they miss the data from the last period. Typical period lengths go from 15 min to hours or a full day.

Fig. 3.7
figure 7

Detail-aggregate view splitting architectural pattern

5.4 Avoiding Detail-Aggregate View Splitting

LeanXcale does not need detail-aggregate view pattern. As a matter of fact, by taking advantage of LeanXcale’s online aggregates, aggregate tables are built incrementally as data is inserted. This implies to increase the cost of ingestion, but since LeanXcale is more than one order of magnitude more efficient than the market operational SQL database leader, it can still ingest the data more efficiently despite the online aggregation. Then, recurrent aggregation analytical queries become costless since they only have to read a single row or a bunch of rows to provide the answer since each aggregation has been already computed incrementally.

6 Architectural Patterns Dealing with Scalability

6.1 Database Sharding

Databases typically used for the above architecture are SQL operational databases, and since they do not scale, they require to use an additional architectural pattern which is called database sharding (Fig. 3.8). Sharding overcomes the lack of scalability or linear scalability of an operational database by storing fractions of the data on different database-independent servers. Thus, each database server handles a workload small enough, and by aggregating the power of many different database manager instances, the system can scale. The main shortcoming of this architecture is that now queries cannot be performed over the logical database, since each database manager instance only knows about the fraction of data is storing and cannot query any other data. Another major shortcoming is that there are no transactions across database instances; they do not have consistency guarantees neither in the advent of concurrent reads and writes nor in the advent of failures.

Fig. 3.8
figure 8

Database sharding

6.2 Removing Database Sharding

LeanXcale does not need the database sharding architectural pattern (Fig. 3.8) thanks to its linear scalability. Thus, what used to require programmatically splitting the data ingestion and queries across independent database instances is not needed anymore. LeanXcale can scale out linearly to hundreds of nodes.

7 Data Pipelining in INFINITECH

Modern applications being currently used by data-driven organizations, such as those belonging to the finance and insurance sector, require processing data streams along with data persistently stored in a database. A key requirement for such organizations is that the processing must take place in real time providing real-time results, alerts, or notifications in order, for instance, to detect fraud finance transactions the moment they are being occurred, detect possible indications for money laundering, or provide real-time risk assessment among other needs. Toward this direction, streaming processing frameworks have been used during the last decade in order to process streaming data coming from various sources, in combination with data persistently stored in a database that can be considered as data at rest. However, processing data at rest introduces an inherit significant latency, as data access involves expensive I/O operations, which are not suitable for streaming processing. Due to this, various architectural designs have been proposed and are used in the modern landscape that deals with such problems. They tend to formulate data pipelines, moving data from different sources to other data management systems, in order to allow for an efficient processing in real time. However, they are far from being considered as intelligent, and each of the proposed approaches comes with their own barriers and drawbacks, as it has been widely explained in the previous section.

A second key requirement for data-driven organizations in finance and insurance sector is to be able to cope with diverse workloads and continue to provide results in real time even when there is a burst of incoming data load from a stream. This might happen in case of having a stream consuming data feeds from social media in order to perform a sentiment analysis and an important event or incident takes place which will make the social community to response by posting an increased number of tweets or articles. Another example is the unexpected currency devaluation that will most likely trigger numerous of finance transactions with people and organizations change their currencies. The problem with the current landscape is that modern streaming processing frameworks allow for static deployments of data streams that consist of several operators, in order to serve an expected input workload. To make things worse, data management solutions used in such scenarios are difficult to scale out to support an increased load. In order to cope with this, architectural solutions as the ones described previously are being adopted, with all the inherit drawbacks and technology barriers.

In order to cope with those two requirements and overcome the current barriers of the modern landscape, we envision the INFINITECH approach for intelligent data pipelines that can be further exploited by its parallelized data stream processing framework. In INFINITECH, we provide a holistic approach for data pipelines that relies on the key innovations and technologies provided by LeanXcale. The INFINITECH intelligent data pipelines break through the current technological barriers when having to deal with the different types of storage, use of different types of databases for persistently store data and allowing for efficient query processing, handling aggregates, and dealing with snapshots of data. By having LeanXcale as the base for the INFINITECH intelligent data pipelines, we can solve the problem of data ingestion in very high rates, removing the need for database off-loading. Moreover, the online aggregates remove all issues when having to pre-calculate the results of complex analytical queries, which lead to inconsistent and obsolete results. The integration of LeanXcale with Apache Flink streaming processing framework and with tools for Change Data Capture (CDC) enables the deployment of such intelligent data pipelines. Finally, our solution allows for parallelized data stream processing, via the ability for the deployed operators to save and restore their state, thus allowing for online reconfiguration of the streaming clusters, which enables elastic scalability by programmatically scaling those clusters.

In what follows, we describe the use cases of data pipelining in INFINITECH, which data pipeline patterns they use, and how they have been improved by means of LeanXcale.

The current historical data splitting is an architectural pattern that in INFINITECH is being used by a use case handling real-world data for novel health insurance products, where data is split between an operational database and a data warehouse. The main issue of this approach is that analytic algorithms can only rely on historical data moved and cannot provide real-time business intelligence, as analysis is always performed on obsolete datasets. With LeanXcale now, the AI tools of this insurance pilot can perform the analysis on real-time data that combines both the current and historical data.

The data warehouse off-loading architectural pattern is being used by INFINITECH a use case from Bank of Cyprus related to business financial management that delivers smart business advice. The data stored in its main data warehouse are being periodically moved to other data marts, and the analytical queries used to provide smart business advice are targeting the data marts which contain small views of the overall schema. The main drawback with this approach is having to maintain several data stores at the same time. In this use case, a single database with all the information from data marts is being held by LeanXcale, thus avoiding the ETL processes between data warehouse and data marts and avoiding storing some information multiple times. At the same time, the analytical processing now will access more recent data since it can be updated as frequently as needed, even in real time.

There are several use cases in INFINITECH that rely on the database snapshotting architectural pattern. One of them is from the National Bank of Greece that provides personalized investment portfolio management for their retailed customers, and it is periodically copying parts from their entire operational database into different databases. Their analytical algorithms make use of a snapshot of this operational database. Since the snapshot is not updated very frequently, it prevents them from performing real-time business intelligence. By introducing LeanXcale, the personalized investment portfolio management provided by the National Bank of Greece can have a database snapshot that is updated as frequently as needed and even in real time using the CDC architectural pattern. That way, the analysis can be as close to real time as they need or even in real time.

The operational DB off-loading architectural pattern is being used in an INFINITECH pilot related to personalized insurance products based on IoT data from connected vehicles. Data ingested from the IoT sensors cannot be processed in the operational database. After raw data has been ingested, they are preprocessed and further stored into an additional database that is accessed by the AI algorithms to prevent interference from the AI processes with the operational database. This pilot is now using LeanXcale as its only database, removing the need to maintain different data management technologies and all data accesses, either for data ingestion or for analytical processing can target a single instance.

Regarding database sharding, there are several use cases in the use case organizations participating in INFINITECH, but since the focus is on analytical pipelines, none of the targeted use cases is actually relying on this pattern.

INFINITECH has scenarios used for real-time identification of financial crime and anti-money laundering supervision from CaixaBank in Spain and Central Bank of Slovenia that require to compute aggregated data over streaming data. The drawback of this architectural approach is that the result of the aggregated data is stale as it has been calculated in point of time previously than the time of the current value. LeanXcale is being used for its online aggregations to solve this issue and have real-time aggregations that are fully persisted.

In INFINITECH, the detail-aggregate view splitting architectural pattern was initially used for real-time risk assessment in investing banking, implemented by the JRC Capital Management Consultancy. The real-time risk assessment solution provided by the JRC Capital Management Consultancy has been evolved using LeanXcale. Now the incoming data stream is being ingested into LeanXcale, exploiting its ability to support data ingestion in very high rates, while on the other hand, its online aggregates are being used by its AI algorithms to retrieve aggregated results with a latency of milliseconds. Also, the real-time identification of financial crime and the anti-money laundering supervision use cases are benefiting from using LeanXcale and its online aggregates. Both scenarios implement detection over real-time data that is also updated in real time.

8 Conclusions

Financial services and insurance companies are in a race trying to be more efficient in processing all the available information. Despite the popularity of a wide set of highly specialized data engines for specific challenges, none of them solved more of the frequent use cases independently. A group of complex architectural patterns blending different kinds of databases have emerged to solve the most common situations. Nevertheless, they highly increase the total cost of ownership, mainly due to their complexity. They also reduce the business value since most of them result in exploiting stale data from batch processes performed weekly or daily.

LeanXcale, thanks to its disruptive technology, leads to simpler architectures with more data freshness or even real-time data. These architectures speed up the development process due to their simplicity, reducing the time between requirement collection and idea inception to production-ready software. They are more affordable to maintain since fewer specialists, different servers, and database licenses are required. Additionally, their processing capabilities provide a more agile way to speed up business processes while reducing operational risk. Finally, these architectures can really support the creation of new revenue streams by building new ways to satisfy customer needs, by using the oil of nowadays: the data.