Keywords

1 Introduction

The traditional finance and insurance companies’ back ends rely on complex processes to execute their daily activities. Some examples are risk, debt, or credit scoring computation. As companies are more and more data-driven and customer relationships are not just off-line, but omnichannel, other use cases are popping up, such as highly personalized marketing or driving style control based on IoT sensors. The current trend is that organizations make increasing use of all the available information.

Since data volumes become large, data analysis takes longer than the business ideal. To partially speed up processing time, complex data platforms combine different data technologies to compute these operations. Organizations have many different data sources and typically keep many different operational databases, of varying natures, to store their data. This includes both SQL and NoSQL databases. These operational databases only keep recent data, for instance, the data from the last 12 months. On the other hand, they use slower analytical databases such as data warehouses, data marts, and data lakes to perform analytical tasks over this data, such as analysis, reporting, and machine learning. In some cases, they also use data streaming technologies to process data on the fly and compute continuous queries over streaming data. These analytical databases are designed to house large volumes of historical data, typically, data older than 12 months.

Data pipelines combine current and historical data while, at the same time, enriching and transforming this data. These transformations often involve aggregating data to compute KPIs and summarize information for reporting purposes. For these transformation processes, it is a common practice to create intermediate staging databases to enrich and transform the data before generating the targeted final reports or other output.

As a result of this complexity, platform development and maintenance cost increases, which may hurt the process agility needed for the business requirements.

2 Challenges in Data Pipelines in Digital Finance and Insurance

McKinsey has noted that “Yesterday’s data architecture can’t meet today’s need for speed, flexibility and innovation” [1]. The key […] is the agility. The right architecture provides:

  • IT cost savings

  • Productivity improvements

  • Reduced regulatory and operational risks

  • Delivery of new capabilities and services

Furthermore, McKinsey highlights the issues with a complex data architecture, showing that a midsize organization with $5 billion in operating costs ends up spending $250 million on its IT architecture [2].

In what follows, we will briefly review these challenges.

2.1 IT Cost Savings

The cost of operation and support are proportional to processing time. The shorter the processing, the smaller the technical teams’ expenses are. This cost structure, where the expenses are proportional to processing time, is exemplified by cloud platforms. Thus, increasing performance implies IT cost reduction.

2.2 Productivity Improvements

The time spent by data analysts and data scientists is split between the time when they manually do work with the data and when computer algorithms do data processing. Intuitively, the productivity of data analysts and data scientists increases as processing time is reduced.

2.3 Reduced Regulatory and Operational Risks

Digital finance and insurance companies typically handle the information of millions to hundreds of millions of customers, which results in massive data. In these companies, it is very common to have daily, weekly, and monthly batch processes in order to process such massive data. These processes raise many challenges as they involve ingesting huge amounts of data, while they are limited by a time window. Daily processes normally have to be completed overnight to prevent disturbing the operational databases when extracting the daily updates and before the next business day. Any problem in the night process will make the data unavailable the next day, thereby causing business problems. For this reason, the process should be short enough so it can be repeated in case something goes wrong. For the weekly processes, the time window is the weekend, so the processes take more than one day, and if they fail, they cannot be completed over the weekend. The monthly processes, because they have to deal with one and two orders of magnitude more data than the daily processes, have 1–2 weeks of processing, thus potentially impacting business very seriously.

2.4 Delivery of New Capabilities and Services

Competition forces companies to release new services and products that are better and less costly for the customers. Data-driven finance and insurance companies leverage their capability to process more information in less time, thus creating a competitive advantage. Current state-of-the-art data platforms do not offer effective ways for managing all the different data points and datasets, which are typically managed through different types of databases and data stores. Thus, the cost of an opportunity to develop a new product can jeopardize the market share of a company.

After seeing the importance of speeding up these data pipelines, what can be done to accelerate these processes? In order to understand how these processes can be accelerated, we need to understand what takes time in each of their steps. This will be discussed in the following section.

3 Regular Data Pipeline Steps in Digital Finance and Insurance

Banking and insurance data pipelines typically consist of three steps: data intaking, data transformation, and generating the required output.

3.1 Data Intaking

The first step is the ingestion of data from the relevant data sources, typically stored in operational databases. This data ingestion is massive and takes many hours, a significant amount of time. The challenge here is to be able to accelerate data ingestion without sacrificing the querying speed that becomes crucial for the next steps. A NoSQL solution, in particular, a key-value data store, can accelerate data ingestion between one and two orders of magnitude. But why such acceleration? To answer that question, we need to dig into how traditional SQL and NoSQL solutions ingest data.

Traditional SQL solutions are based on a persistent data structure, called B+ tree, to store database tables. A B+ tree is a search tree that enables data to be found with logarithmic complexity. Figure 2.1 illustrates how this tree works. The data that can be reached from the child nodes is split across subtrees. The keys in the node tell how the data is split across subtrees that provide the split points. There are three children. The first one keeps all the rows with primary keys lower than k1, the last one those higher than or equal to k2, and the middle one the rest of the rows. Since the B+ tree is a balanced tree, it is guaranteed that the row with the search primary key will be accessed in a logarithmic number of accessed nodes, as illustrated in Fig. 2.1.

Fig. 2.1
figure 1

Logarithmic search in a B+ tree

B+ trees store data in the leaf nodes. They also implement a read-write cache of tree blocks, an LRU (least recently used) cache, which is illustrated in Fig. 2.2. This cache typically keeps the upper part of the tree.

Fig. 2.2
figure 2

B+ tree and associated block cache

When inserting rows in a scattered manner, the leaf node needs to be read from disk, which means that a block has to be evicted from the cache and written to disk (with high ingestion, all blocks are dirty) and then the leaf node is read from disk. In large tables, inserting a row will typically mean reading multiple blocks from disk and evicting multiple blocks from the cache, since several levels will be missing. Therefore, the cost of inserting one row will be that of reading and writing one or more blocks, i.e., two or more IOs per row, which is very expensive.

The fastest NoSQL databases to ingest data are key-value data stores and wide column databases. Hash-based solutions do not support even simple range queries so are unusable for the vast majority of data pipeline use cases. Other solutions are based on the approach of string sorted tables (SSTables), mimicking what is done in Google Big Table. This approach relies on storing inserts and updates into a cache (see Fig. 2.3). This cache is typically based on a skip list, which is a kind of probabilistic search tree. When this cache reaches a threshold, it is serialized into a buffer sorted by the primary key (hence the name string sorted table, since the key is considered just a string, actually an array of bytes), and the buffer is written into a file. This process is repeated. As can be seen in Fig. 2.3, every block that is written carries as many rows as they fit in the block, so it is impossible to be more efficient in writing. However, there is an important trade-off. Reads now require scanning all the files related to the range that is being queried. The aforementioned process typically results in tens of files, which makes reading one to two orders of magnitude more expensive. To avoid further degradation of reads, there is also a compaction process that merges the files related to the same range. This compaction process is typically very intrusive and is often a stop-the-world process.

Fig. 2.3
figure 3

Key-value data stores based on SSTables

In traditional SQL databases, queries are quite efficient since the read of the first row is logarithmic, thanks to the search tree nature of the B+ tree. Then, from there, the next read rows are just consecutive reads. Thus, range queries in SQL databases are substantially more efficient than in NoSQL ones.

Dealing with both current and historical data is very challenging because table size increases dramatically, which degrades the performance of ingestion in traditional SQL operational databases. The reason is that as table size grows, ingesting data becomes more and more expensive because there are more levels in the B+ tree and each row costs more and more blocks to be read and evicted from the cache. Finally, rows become more and more scattered, further degrading the efficiency of data ingestion. The behavior is depicted in Fig. 2.4.

Fig. 2.4
figure 4

Speed of ingestion with increasing DB sizes

3.2 Data Transformation

The second step is the process of analyzing and enriching the data, which typically involves aggregating information to attain KPIs, using aggregation SQL queries. Since they are dealing with large amounts of data, these queries may take a lot of time to be computed since they have to traverse millions to billions of rows. Data warehouses are good at answering these kinds of queries. However, they do not support processes that enrich and transform the data since they do not support updates, only read-only queries.

NoSQL databases provide no, or very limited, aggregation support. In hash-based solutions, full scans of the table are necessary since there is no ability of performing range queries. The NoSQL solutions able to perform range queries are mostly based on SSTables, resulting in them being more than one order of magnitude more expensive and, thus, not competitive. Additionally, NoSQL solutions fail to work in the case of concurrent updates of the data due to the lack of ACID properties, so many enrichment processes that require updating data are simply not supported. Doing aggregation in real time with NoSQL does not work since concurrent updates result in the lost update problem. For instance, two concurrent operations trying to add 10 and 20 to an aggregation row with a value of 0 would result in both concurrent operations reading 0, each of them writing 10 and 20, respectively. One would be executed first and the other, second, for instance, the first 10 is written and then 20 is written. The final result would be 20 instead of the expected 30. This lost update problem is due to the lack of ACID properties.

3.3 Generate the Required Output

The third step is to generate reports or output files that are required to perform large simple queries to extract the data of interest. Again, these kinds of queries are not the most appropriate for traditional operational SQL databases since they are optimized for many short queries instead of large analytical queries. NoSQL databases are simply bad for this process since they typically do not support the expressiveness necessary to query and extract the data.

Finally, all these three steps need to be performed at different scales: from small to very large scale. NoSQL databases can scale well the part they manage well, that is, data ingestion, but they do not support other steps. SQL databases either do not scale out (if they are centralized) or scale out logarithmically (the case for shared storage or cluster replication, the two available approaches).

4 How LeanXcale Simplifies and Accelerates Data Pipelines

LeanXcale is a SQL database that can be integrated with all the applications that have been developed over the last 40 years using relational databases, making it simple to transform and use data. However, beyond the regular capabilities of an SQL database, LeanXcale has a set of unique features that makes it optimal for simplifying and accelerating data pipelines. These features have two effects: the speeding up of data insertion, thanks to an efficient key-value store engine and bidimensional partitioning, and the capability to parallelize the process in several execution threads while avoiding database locking due to online aggregation and linear horizontal scalability.

LeanXcale overcomes the challenges of data pipelines based on the features that are detailed in the following paragraphs.

4.1 High Insertion Rates

The first feature addresses the first of the above-listed challenges, which lies in how to be able to efficiently ingest data. The goal is to ingest large amounts of data in times that are at least an order of magnitude lower than those of traditional SQL operational databases. LeanXcale is a NewSQL database, and it adopts an approach similar to LSM trees. To maximize the locality of accesses, LeanXcale also has a cache of updates/inserts like we saw in the SSTable approach adopted by some NoSQL databases. However, instead of writing the updates into SSTable files that highly degrades read performance, LeanXcale propagates the updates into the B+ tree. Since there are many updates/inserts, many of them go to the same leaf node, thus amortizing disk block accesses across many rows instead of a single row, as is the case in traditional SQL databases. It must be noted that the cache of read blocks is kept as in the case of traditional SQL databases. This approach is depicted in Fig. 2.5.

Fig. 2.5
figure 5

LeanXcale B+ tree and write and read caches

4.2 Bidimensional Partitioning

LeanXcale also deals efficiently with long historical data. For this purpose, it uses a novel mechanism called bidimensional partitioning. The rationale behind this mechanism is that insertions often have some temporal locality around a timestamp column or auto-increment column. This mechanism leverages this observation to split data fragments automatically, so they do not get too big. The fragments contain the rows within a primary key range that is how data within a table is partitioned across nodes as shown in Fig. 2.6.

Fig. 2.6
figure 6

Bidimensional partitioning in a distributed setting

The auto partitioning is done on a different dimension, namely, time based on the timestamp column as depicted in Fig. 2.7.

Fig. 2.7
figure 7

Bidimensional partitioning

With this mechanism, the achieved effect is that the newly data ingested will happen on the new data fragment and the previous data fragment will cool down, get evicted from the cache and written to disk. Thus, only the hot data fragments are kept in memory, and a load that was inherently IO-bound becomes CPU-bound and reaches high efficiency.

4.3 Online Aggregates

As mentioned above, one of the typical tasks in data pipelines is computing aggregates over the ingested data. These aggregates can be very expensive since they are performed over millions to billions of rows. In LeanXcale, we invented a novel mechanism, online aggregates, to alleviate this problem. Online aggregates leverage a new concurrency control method (patent filed by LeanXcale) called semantic multiversion concurrency control. This invention makes it possible to aggregate over a row in real time, with high levels of concurrency (e.g., many thousands per second) and without any contention, while providing full ACID consistency. SQL databases do not do aggregations in real time since they result in high levels of contention due to traditional concurrency control, based on locking, that forces each update of the aggregate to happen one by one. When one update is performed over the aggregation row, until that transaction does not commit, the next update cannot be applied and will be blocked on a lock held over the row.

In LeanXcale, all aggregations over a row do not conflict because we leverage semantic multiversion concurrency control, which prevents any contention. Thus, the aggregations can be computed progressively as data is ingested. The idea is that the aggregate tables are created for the targeted aggregations. When a row is inserted, the aggregations in which it participates are updated by updating the corresponding rows in the relevant aggregate tables. When the data ingestion ends, the aggregations are already computed. Thus, a heavy and long aggregation analytical query in a traditional SQL database is transformed by LeanXcale into a super light query that reads just one value, or a list of them, rendering these queries almost costless.

4.4 Scalability

LeanXcale provides horizontal scalability by scaling the three layers of the database, namely, storage engine, transactional manager, and query engine (see Fig. 2.8). The storage layer is a proprietary technology called KiVi, which is a relational key-value data store. It has the same principles as key-value data stores/wide column data stores and splits data into data fragments according to primary key ranges. However, unlike NoSQL data stores, it implements all relational algebra operators, except joins. This allows be pushed down in the query plan any operator below a join to KiVi. In this way, it is possible to obtain a high level of efficiency in processing SQL, despite the distribution. For instance, let’s assume we wanted to compute the sum of a column over a large table with one billion rows that is held across ten servers. In an SQL database, the billion rows would be sent to the query engine where they would be summed. In LeanXcale, each of the ten servers would perform the sum of 1/10th of the rows locally and would send a single message per server to the query engine with the locally aggregated value. In this way, LeanXcale avoids sending billion rows over the network.

Fig. 2.8
figure 8

Architecture of LeanXcale platform

The query engine scales out by distributing the database clients across different query engine instances. Each query engine instance takes care of a fraction of the clients and executes the queries sent by them.

The transactional manager is one of the main inventions of LeanXcale. Scaling out transactional management has been unsolved for three decades. LeanXcale solved the problem by decomposing the ACID properties and scaling out each of them in a composable manner. The ACID properties are:

  • Atomicity: In the advent of failure, a transaction is all or nothing, that is, all the updates take effect, or it is as if the transaction was not executed.

  • Consistency: Transaction code should be correct. If it takes a correct database state, after the transaction the database state should remain correct.

  • Isolation: The effect of concurrent execution of transactions should fulfill a correctness criterion, i.e., serialization, which means that it should be equivalent to a serial execution of the transactions.

  • Durability: After committing a transaction, its updates cannot be lost, even in the advent of failures.

Traditional SQL operational databases rely on a centralized transactional manager (see Fig. 2.9), which becomes one of the main bottlenecks that prevents scaling out.

Fig. 2.9
figure 9

Central transactional manager providing all the ACID properties

In LeanXcale, these ACID properties get decomposed as illustrated in Fig. 2.10.

Fig. 2.10
figure 10

LeanXcale decomposition of the ACID properties and scale out of them

Consistency is enforced by the application. The database simply helps by providing the ability to set integrity constraints that are automatically checked. Thus, we focus on the other three ACID properties, and we split isolation into the isolation of reads and writes that are attained by means of different mechanisms. Atomicity is scaled out by means of using different transaction manager instances called local transaction managers (LTMs). Each of them handles the lifecycle of a subset of the transactions, and atomicity is enforced as part of this lifecycle. Isolation of writes has to handle write-write conflicts. They are distributed by means of hashing, and each conflict manager handles the conflict management of a set of buckets of primary keys. Durability is scaled out by means of loggers. Each logger handles a subset of the log records, for instance, the log records from a set of LTMs. Isolation of reads is a little bit more involved. We actually use two components, commit sequencer and snapshot server, that do not need to be scaled out due to the tiny amount of work performed on a per update transaction basis that enables them to handle hundreds of millions of update transactions per second.

LeanXcale avoids the contention during commit management by avoiding committing the transactions sequentially (Fig. 2.11). LeanXcale basically commits all transactions in parallel without any coordination and thus deals with commitment as an embarrassingly parallel problem. However, there is one aspect from transaction commit that is decoupled from it that lies in the visibility of committed state. Basically, transactions are committed in parallel, but their updates are not made visible right away. Transactions are assigned a commit timestamp (by the commit sequencer). Transactions are serialized according to this order. As far as there are not gaps in the commit order, the state is made visible by advancing the current snapshot. All transactions with a commit timestamp equal to, or lower than, this snapshot are visible. In this way, LeanXcale can commit, in parallel, an arbitrarily high number of commits, which enable it to scale out linearly up to hundreds of nodes.

Fig. 2.11
figure 11

Traditional transactional databases vs. LeanXcale

5 Exploring New Use Cases: The INFINITECH Approach to Data Pipelines

As described above, LeanXcale’s features speed up data pipeline time processing, which results in IT cost savings, productivity improvements, and reduced operational risks. Furthermore, LeanXcale may tackle new use cases. LeanXcale participates in the H2020 INFINITECH, as technical leader. Below we discuss the work done in INFINITECH to uncover new ways to satisfy financial and insurance customer and internal needs.

INFINITECH aims to break through the current technology barriers that keep modern AI architectures and solutions in their infancy, as there is no adequate support for real-time business intelligence in the finance and insurance sector. Data tends to be located in isolated sources, and data pipelines are not efficient for real-time analytical processing over datasets that span across different data stores (hot and historical data combined with data streams). This limits the creation of novel business models and their validation via scenarios that could exploit real-time risk assessment algorithms, the detection of fraudulent finance transactions on the fly, or the identification of potential business opportunities.

Toward this aim, INFINITECH is developing a unified data management layer via the use of intelligent data pipelines in order to fulfill the needs of the finance and insurance sector efficiently. This means providing a unified access over fragmented data stored in different silos and allowing for integrated query processing that combines static (or persistently stored) data, with data in-flight that comes from streaming sources. This will allow for a cost-effective execution of advanced data mining algorithms and real-time identification of events of interest. Moreover, by combining a multitude of advanced analytic techniques, via the use of its intelligent data pipelines, it facilitates the access and data processing to analytical functions, providing low-latency results that enable the algorithms to provide real-time analytics.

Using the INFINITECH data pipelines, analytic results do not suffer from having to access only an outdated snapshot of the dataset that resides in a data warehouse after the execution of a periodic batch process. This barrier forbids real-time business intelligence. Additionally, INFINITECH aims to provide a streaming processing framework that allows the combination of query executions over both data at rest and in-flight. As analytical queries are cost-expensive, their typical response times are not suitable for streaming operators, which makes the databases as the bottleneck for such architectures. Moreover, the ingestion of streaming data into persistent data stores cannot be supported by traditional SQL databases, while the use of NoSQL stores would sacrifice transactional semantics. This is not an option when having applications in the insurance and finance domains. The use of the INFINITECH data pipelines removes these existing barriers of current architectures, while it automates the parallelization of data streaming operators. This allows those operators to be dynamically and horizontally scaled at run time, relying on the fact that its data management layer can support the increase of the input workload that might also force it to scale out.

6 Conclusion

Data is the pivot around which financial and insurance companies revolve, in order for them to develop their competitive advantages. To process their ever-increasingly massive volumes of data, these organizations have built data pipelines. Typically, these platforms are not agile enough, even when organizations develop complex and expensive architectures that blend different kinds of data stores. Beyond the benefits that a regular SQL database provides, LeanXcale can reduce data access time independently of data size and allows efficient process parallelization. This combination of capabilities helps to reduce the data pipeline complexity and the total cost of ownership. However, more importantly, it uncovers new ways of generating value with new use cases that were previously not possible.