1 Introduction

Driven by the increasing connectivity between data generation (e.g., Online Transaction Processing, for short, OLTP) and data consumption (e.g., Online Analytical Processing, for short, OLAP), real-time analytics based on new data has attracted much interest from academia and industry [18, 78, 111, 168, 175]. The fundamental motivation behind this trend is that much information is most valuable when it first appears and is usually time-decayed [8, 44, 95]. Meanwhile, data is increasingly consumed by intelligent algorithms (e.g., AI-assisted [118, 171]) but not human readers to conduct timely decision automatically [8, 14, 56].

Nevertheless, due to historical reasons, the connection between OLTP and OLAP is traditionally processed by an Extract-Transform-Load (ETL) workflow [164, 166], which usually takes minutes to hours. This workflow can be good enough for traditional business intelligence, where data is processed in large batches, and hence, the execution time takes minutes to hours (e.g., daily report [144] and data mining [75]). In these scenarios, the data freshness loss (i.e., the time consumed in ETL) and execution time are of the same magnitude and can be tolerable. However, for real-time applications that take seconds for execution to conduct time-critical decisions (e.g., real-time pricing [60], fraud detection [36, 136], smart industry [161, 174]), ETL is too slow and expensive. This trend emphasizes the importance of achieving high data freshness for analysis. We compare the goals and scopes of time-critical decisions and traditional business intelligence in Fig. 1.

In response, Hybrid Transactional and Analytical Processing (HTAP) was born to handle OLTP and OLAP requests in a single system, thus eliminating ETL. Specifically, HTAP systems are featured in their strong ability to interleave transactions and analytical queries [63, 103]. That is, analytical queries can observe the latest write made by transactions, and transactions can promptly use the results of analytical queries as payloads. We summarize the common target scopes of HTAP databases in Sect. 3.

Such technical coherence between OLTP and OLAP raises new design challenges over data storage, execution engines, and query optimizers. To tackle these challenges, tremendous efforts have been made in this area, resulting in a rich literature of related papers and solutions (see Table 1). The adopted architectures and design choices may vary significantly. To systematically oversee existing designs, we broadly classify existing HTAP systems based on the design of data formats, which is also adopted by a previous study [126]. Differently, our survey greatly extends the two-way classfication of HTAP by summarizing common issues and discussing the combination of potential solutions.

In particular, we first study monolithic HTAP systems, which serve OLTP and OLAP workloads with identical data format designs. Typically, OLTP and OLAP workloads show different data access patterns. OLTP can operate on a single data tuple at a time and access many attributes, while OLAP typically accesses a massive number of rows at a time with a subset of tuple attributes. Monolithic HTAP systems strive to make a “one-size-fits-all” format design and allow OLAP to make progress regardless of concurrent OLTP. Monolithic HTAP may incur significant performance issues when serving mixed workloads. We summarize three common issues: contradictory format demand, performance degradation of MVCC, and performance isolation. In response to these common issues, several new techniques for data grouping, MVCC, in-memory snapshots, and read/write splitting have been proposed. We discuss them in Sect. 4.

We then study HTAP systems using hybrid data formats, which independently optimize the underlying data formats for OLTP and OLAP. Although hybrid data formats mitigate the contradictory format demand, they bring new performance issues. We summarize six common issues for HTAP with a hybrid data format: data synchronization, data consistency, gap of write efficiency, hybrid data access, performance isolation, and sharding strategy. To understand the design rationales, we further classify HTAP with hybrid data formats into four typical architectures: row-oriented, column-oriented, separated, and hybrid architectures. Based on our taxonomy, we discuss the existing solutions for the six common issues one by one in Sect. 5

To enlighten new HTAP designers, we make a pros and cons summary of existing solutions (monolith design, column-oriented hybrid design, row-oriented hybrid design, separated hybrid design, and hybrid architecture with hybrid data formats) in various scopes (transaction performance, query performance, storage overhead, data freshness, performance isolation, scalability, performance stability, and system complexity). This can be a good starting point for newcomers (see Sect. 6). We also present readers with emerging new applications, benchmarks, and future directions.

Fig. 1
figure 1

A comparison between the targets of HATP databases and traditional business intelligence [35]. HTAP databases provide real-time analytics to maximize the value of data

1.1 Contributions

To the best of our knowledge, our survey takes the first step to deeply review the existing HTAP architectures, compare the resident technologies, and give a comprehensive overview of HTAP regarding system design, applications, and benchmarks. Our contribution includes:

  • We systematically summarize the common design goals of HTAP. We then intensively study the common design issues and existing solutions to enlighten and guide researchers and practitioners in this area.

  • We present a new fine-grained taxonomy, which comprehensively organizes HTAP architectures based on the storage layer. This provides us with a uniform methodology to compare different implementations and trade-offs.

  • We present readers with cutting-edge real-time applications powered by HTAP systems. They can be practical templates for developing new ones and motivating new research in HTAP.

  • Last but not least, we also present the emerging benchmarks, unique evaluation metrics, and standard evaluation methods of HTAP.

Table 1 This table lists the HTAP systems from academia and industry since 2014

1.2 Related surveys and research collections

There are two tutorials related to our survey. Özcan et al. presented a tutorial on the taxonomy of HTAP systems in SIGMOD 2017 [126]. It covers multiple data processing systems, including NoSQL [157], SQL-on-Hadoop [2, 68, 162], and Spark SQL [17, 22, 109, 179]. However, due to the intensive research and development efforts, the architectures and technologies of new emerging HTAP systems (Table 1) are not covered in this tutorial (especially for the designs that appear after 2017). Compared to previous designs, these new HTAP systems are more tightly coupled and leverage multiple advanced experiences from data warehouse systems. For instance, column stores are regarded as a common optimization approach in the HTAP systems with hybrid data format (Sect. 5). Moreover, as discussed in Sect. 1, our paper is more than a two-way classification. It greatly extends the taxonomy in [126] to oversee the common design issues of HTAP and lead the solutions.

Li et al. summarized the pros and cons of existing HTAP systems in the tutorial of SIGMOD 2022 [103]. Different from our work, their classification is limited to the selected systems (i.e., a new HTAP system may not fall into any categories) and does not discuss the HTAP systems with monolithic data formats (Sect. 4). In contrast, we comprehensively review different HTAP architectures, applications, benchmarks, metrics, and evaluation methods. Moreover, our survey deeply explores the dependencies and trade-offs between different design choices, providing a more fine-grained, hierarchical taxonomy to guide future research.

1.3 Survey organization

We first give a general view of ETL, a brief history of HTAP, and a discussion of other related systems in Sect. 2. Then, we summarize the common design goals and define two crucial metrics of HTAP in Sect. 3. After that, we review HTAP architectures with monolithic and hybrid data formats in Sects. 4 and 5, respectively. Based on our taxonomy, we discuss the lessons we learned in Sect. 6. In Sect. 7, we summarize the cutting-edge applications and standard benchmarks for HTAP. In Sect. 8, we analyze the future directions and challenges. Finally, we conclude this survey in Sect. 9.

Fig. 2
figure 2

An overview of our taxonomy. We classify existing HTAP systems into two main streams based on data format and discuss each stream’s design issues and potential solutions

2 Background

2.1 Extract-transform-load workflow

Traditional Extract-Transform-Load workflows, which we term the first generation of ETL, extract data from various sources and transform and load them into a data warehouse. Without HTAP, ETL is a mandatory step in projects implementing decision-making information or knowledge management systems within organizations. However, traditional ETL was a long and costly process. It may be favored for routine queries (e.g., daily reports) and large-scale analysis (e.g., Spark SQL) while lacking adequate support for real-time analysis. In particular, the workflow periodically issues heavy ETL queries on the private storage of OLTP (e.g., MySQL), then extracts data updates from the query results, and finally merges updates into the private storage of OLAP (e.g., MonetDB).

Recent works of ETL focus on improving the timeliness of ETL to cater to the demand of real-time data injection, which we term the second generation of ETL. For instance, several works [164, 166] exploit the advantages of task parallelization to speed up the workflow. Other works [34] use batch processing to improve the throughput of transformation. As both HTAP and real-time ETL strive to provide high-performance OLTP and OLAP with fresh data, they share a similar design target.

However, different from HTAP, real-time ETL has several different design scopes and thus has a different application domain. First, the real-time ETL workflow is designed as a transparent service to the underlying database, and it should be compatible with different database engines and backends. In contrast, HTAP is a built-in service inside the database and is fully managed by the database engine. To capture the changed data, an HTAP database with hybrid data formats usually triggers a monitor inside the database instead of using ETL queries, which is studied as change data capture (CDC) in the literature [156]. Second, as a real-time ETL workflow processes data in a batch manner, they can clean data before injection using a customer program (e.g., a specific code written by the database manager). In contrast, such a user-defined data clean is not usually supported by HTAP.

Due to the aforementioned two different design scopes, a real-time ETL workflow usually leads to worse data freshness than HTAP. Meanwhile, due to the flexibility of ETL, it is more likely suitable for big-data analysis (e.g., MapReduce and Spark). Some previous studies [34, 126] indeed blur the boundary between real-time ETL and HTAP. In our survey, we differentiate them because they have evolved into two different research directions and have different optimization focuses in recent years. Our classification is based on whether the data synchronization is a built-in service of HTAP.

2.2 A brief history of HTAP

The word “HTAP” was first introduced to the public in 2014 by a Gartner report [130]. Essentially, before the word was created, multiple works [12, 26, 62, 85, 86, 102, 121, 132] have strived to handle both OLTP and OLAP workloads simultaneously, which are formerly mentioned as hybrid OLTP & OLAP, mixed workloads processing, or OLxP. Although these research prototypes may have been phased out in the recent HTAP production, some key techniques and designs have become the foundation for further development.

In recent years, driven by the increasing demands for all-in-one data management solutions and real-time analytics, more and more HTAP systems have been proposed and developed. We list these advanced HTAP systems in  Table 1. These systems adopt different ways to implement HTAP. Each of them has different design decisions due to different design goals. To systematically study the performance trade-offs behind the design decisions, we broadly classify existing HTAP systems into two main streams, and each stream contains multiple sub-streams (Fig. 2). We make a comprehensive discussion on their design decisions (e.g., storage model, synchronization model, and execution engine) and design rationales in  Sects. 4 and  5.

Briefly, the first stream uses a monolithic data format with an optimized data structure to run transactions and queries simultaneously (e.g., PTree [158] and Diva [90]). Typical designs allow OLAP queries to observe transaction updates by sharing the same data copy. However, this may incur physical and logical resource conflict between OLTP and OLAP. Moreover, it neglects the opportunities for independently optimizing data engines and layouts for OLTP and OLAP. As such, the optimizations in OLAP (e.g., batch iteration [4] and late materialization [4, 5]) may not be adopted in these systems. The second stream leverages hybrid data format (e.g., TiDB [78] and BatchDB [111]). These systems run analytical queries and transactions on the dedicatedly optimized storage and thus allow for better performance optimization (i.e., independent storage engine, independent execution optimization, and independent scalability). Nevertheless, managing a hybrid data format is not free. Data synchronization between two data formats incurs additional overhead. For such hybrid architecture, a key challenge is how to maintain a consistent view of analytical queries efficiently, ensuring atomicity and isolation of the DBMS.

2.3 Other systems with a similar design goal as HTAP

Besides real-time ETL and HTAP, some other works also share a similar design goal, which tries to connect data generation and consumption in a real-time manner (i.e., a ground truth in data processing). Compared to HTAP, the specific focus of these systems is a bit different, and thus, they are tailored for different applications and deployment scenarios. We discuss two popular and representative categories below.

Streaming Systems. At birth, streaming systems do not support transactional semantics and provide a weak guarantee for atomicity. That is, considering two SQL operations in a single transaction, traditional streaming systems treat them as two individual events. When processing these two events on the OLAP side, a streaming system only guarantees they are applied in order while an OLAP query may still observe the intermediate results of the transaction (i.e., a single SQL operation in our former example).

Recent works consider supporting transactional analysis in the flight of streaming processing. They focus on the optimizations of consistent event processing and adopt the concurrency control model from the database community for correctness. However, even using a transactional streaming system for OLAP, the database manager still needs to configure the input events (which are essentially the updates from OLTP) using change data capture.

To summarize, traditional streaming systems are good for real-time alerting and monitoring but lack efficient support for transactions. Emerging transactional streaming systems do not have built-in database services and rely on additional stateful operations to process transactional requests.

Data Lakes. The concept of data lakes is processing different data sources in a single system and thus provides one-stop data management (same as HTAP). However, different from HTAP, data lakes target the process of all structured, semi-structured, and unstructured data and store data in its native format. On the contrary, HTAP only processes structured data generated by transactions and analyzes the generated data directly for better data freshness. In addition, HTAP generally provides stronger consistency and isolation guarantees than data lakes.

3 Design goals and evaluation metrics

By definition, HTAP systems target high-performance OLTP and OLAP. However, besides high transaction throughput and low query latency, unique challenges can always arise when mixing these two types of workloads, leading to specific design goals. We summarize these goals as follows.

  • [Transparent Query Execution] As a unified system, database users should not be required to understand the working logic of the systems, nor should they identify query types or transactions manually [168]. HTAP systems should provide a unified interface for transactions and queries and route them automatically.

  • [High Data Freshness] A significant motivation of HTAP is to eliminate ETL, thus providing intelligent insights into fresh data at generation speed. For that reason, freshness is commonly mentioned as an essential design goal of HTAP [39, 58, 78, 83, 103, 111, 150, 154, 168].

  • [Strong Performance Isolation] Performance isolation refers to the ability to maintain the performance of a specified workload (e.g., OLTP) while another workload changes. Generally, in an HTAP application, transactions often play a mission-critical role in production. The performance degradation of OLTP can always lead to bad application quality. Thus, performance isolation is always treated as a significant design goal of practical HTAP systems [78, 103, 111, 150, 154, 168].

  • [Strong Consistency Across OLTP and OLAP] Strong consistency is always good to have. Multiple HTAP systems (e.g., [39, 78, 168]) target maintaining strong consistency across OLTP and OLAP. By doing so, application developers do not need to spend extra effort handling data consistency issues in the application layer, largely reducing the development complexity.

  • [Excellent Elasticity and Scalability] In HTAP scenarios, the consumption of physical resources (e.g., CPU and disk IO) may fluctuate significantly depending on the mixture of different types of workloads. Thus, supporting excellent elasticity and scalability can also be a significant design goal of HTAP systems [131, 138, 168].

Given such design goals, several evaluation metrics have been proposed. We highlight two HTAP-specific ones (i.e., data freshness and performance isolation). In contrast, the metrics for other design goals can fit into traditional ones (e.g., consistency models from OLTP are still applicable for HTAP).

3.1 Metrics of data freshness

To gauge data freshness, the metric should quantify how recently each analytical query sees the view (a.k.a. snapshot) of OLTP data in an HTAP system [117]. We summarize existing freshness metrics into two categories. The first category uses the occupied space ratio for the metric, while the second uses time intervals.

3.1.1 Space-based freshness metrics

Raza et al. propose freshness-rate in [138]. Assuming OLTP and OLAP engines have two different private storage. Freshness-rate represents the rate of data tuples that are the same between two private storage. We show an example in Fig. 3. We assume the data in the red box is visible to OLTP, and the part of the data in the blue box is visible to OLAP. As OLTP continuously generates new data, the visibility of OLTP and OLAP may differ.

Accordingly, a higher freshness-rate means better data freshness. When the two engines share the same data storage (e.g., using the monolithic data format with a single data copy), the freshness-rate metric will always be 1. Instead, when their storage is independent (e.g., using a hybrid data format and maintaining two data copies), this metric should generally be less than 1.

$$\begin{aligned} \textit{freshness-rate} = \frac{count(tuples_{OLTP} \cap tuples_{OLAP})}{count(tuples_{OLTP} \cup tuples_{OLAP})} \end{aligned}$$
(1)

Thus, HTAP systems can achieve a high freshness-rate either by the two engines sharing the same data storage or by speeding up the transfer of the corresponding delta (i.e., the grey data log in Fig. 3).

Fig. 3
figure 3

This diagram shows the comparison of three freshness metrics

3.1.2 Time-based freshness metrics

Another freshness metric: freshness-score is based on the time dimension. It is more intuitive than freshness-rate. Briefly, the freshness-score for a single analytical query q is defined as a quantitative measure below, where \(t_{q}^{s}\) means the start time of q and \(t_{q}^{fns}\) represents the commit time of the first transaction which updates the data read by the query but not seen by q.

$$\begin{aligned} \textit{freshness-score}_q \ = \ max(0, t_{q}^{s} - t_{q}^{fns}) \end{aligned}$$
(2)

We show an example in Fig. 3. As freshness-score is defined for each query, we need to run the query multiple times to measure the performance of the whole system. Then, the freshness score of an HTAP system is defined as the aggregation of the freshness scores of all analytical queries that are continuously executed in the database:

$$\begin{aligned} \textit{freshness-score} = AVG(\textit{freshness-score}_q) \end{aligned}$$
(3)

Accordingly, a smaller freshness-score means better data freshness. In particular, freshness-score = 0 implies that the HTAP system can always provide the most recent version of the operational data to all analytical queries. freshness-score = \(\alpha \) seconds means that, on average, the snapshot used by the analytical queries is outdated by \(\alpha \) seconds.

Note that identifying the first not-seen transaction can be challenging in a practical HTAP system, as we may not have a centralized sequencer that records the entire serializable execution history. It may need more effort to evaluate, mainly when the databases are partitioned and spanned over multiple machines.

Besides freshness-score, visibility delay is another time-based freshness metric. By definition, visibility delay is the time interval during which updates to the database can be visible to OLAP queries. Different from freshness-score, it may not directly represent the delay observed by users but focuses on the performance evaluation inside the system. This makes visibility delay much easier to measure as it does not rely on finding the first transaction not seen by the analytical queries. As a result, visibility delay is one of the most popular metrics evaluated by the previous papers [18, 34, 39, 78, 97, 111, 150, 175]. We show an example of visibility delay in Fig. 3, which capture the time window of data visibility to OLTP and OLAP.

3.2 Evaluation method of performance isolation

To measure performance isolation, a simple approach suggested by Gartner is to instruct one kind of workload client (e.g., OLTP clients) to sustain a configured throughput (e.g., about half of peak throughput) and allow another kind of client (e.g., OLAP clients) to saturate the throughput [47, 150]. When the number of later clients increases, the less performance degradation of the previous workload, the better performance isolation is achieved. We show an example in Fig. 4. Ideal isolation indicates no performance drop and thus plots a horizontal line in the Figure. This approach is followed by [18, 78, 111, 150, 168].

Milkai et al. extend this approach by visualizing the throughput frontier in a 2D chart [117]. In the chart, the x-axis represents OLTP throughput, and the y-axis represents OLAP throughput. Each pointer in the chart means the HTAP system can achieve a fixed OLTP throughput and a fixed OLAP throughput. Then, a frontier can be generated when given a fixed OLTP throughput, and the OLAP throughput is the maximum. An HTAP system achieves ideal performance isolation when each of the OLTP and OLAP workloads performs as if executed independently, which plots a rectangle frontier in the throughput frontier chart. See the example in Fig. 4. Compared to the previous approach, the chart can provide a more comprehensive view of the performance isolation property.

Fig. 4
figure 4

This diagram compares two isolation evaluation methods

4 HTAP with monolithic data format

Fig. 5
figure 5

Faster Scan of MVCC. This diagram provides an overview of the existing approaches that reform the MVCC chain for better scan performance. It shows the access path of a column scan operator on column A with timestamp 150

4.1 Common issues

We now discuss the challenges and solutions for HTAP systems with a monolithic data format. Sharing the same physical data format across OLTP and OLAP workloads is the most straightforward approach to executing analytical queries on the fresh data generated by transactions. As a mitigation solution for HTAP, most of the systems belonging to this stream evolved from legacy databases.

A few decades ago, OLTP and OLAP workloads were not clearly divided and were mixed in traditional databases (e.g., MySQL) without giving HTAP context. More recently, several works have been proposed to make the monolithic design suitable for HTAP’s performance. Our survey focuses on these HTAP-specific optimizations.

In general, unified architecture has strengths in reducing data redundancy and eliminating the burden of managing consistency [90]. However, it has some performance issues:

  • [Issue1: Contradictory Format Design] OLTP workloads prefer row-oriented data format for efficient tuple inserts and updates; however, it is inefficient for read-only analytical queries since they only have to read a tiny subset of attributes (with massive rows) from disk (or from memory) [4], wasting on I/O.

  • [Issue2: Performance Degradation of MVCC] Multi-version concurrency control (MVCC) is widely adopted in OLTP DBMSs [172] to allow readers to make progress regardless of concurrent writers. However, the increased lengths of version chains further degrade the scan performance of analytical queries. Interchangeably, the long lifetime of analytical queries blocks the garbage collection (GC) and causes storage overhead [88]. We further detail this issue in Sect. 4.3

  • [Issue3: Weak Performance Isolation and Scalability] Simply marrying OLAP to OLTP breaks performance isolation. This is because analytical queries are usually compute-intensive, which take hundreds to thousands of times longer execution time and consume much more resources than short-lived transactions [61, 83]. In such a case, analytical queries may block the execution of transactions and cause resource starvation [32, 146]. In addition, scalability can be a major problem for monolithic HTAP. The read and write side may have different scale-out needs. OLTP workloads may be CPU-intensive, and OALP workloads typically require larger memory space for storing the intermediate query results.

In the rest of this section, we dive into these research problems posted by HTAP and show the key technologies to solve (or mitigate) these performance issues in the monolithic HTAP architecture. We show the potential solutions for Issue1 in Sect. 4.2, the mitigation for Issue2 in Sects. 4.34.4, and the mitigation for Issue3 in Sect. 4.5. Most of these solutions are orthogonal to each other and can be combined into a single HTAP database.

4.2 New data format for HTAP

Several works strive to overcome the contradictory performance properties by redesigning the existing formats. Their new format designs target providing high performance for OLTP, which should be the same as a row-oriented data format, and providing high performance for OLAP, which should be the same as a column-oriented data format. We introduce some of the representative works below.

Partition Attributes Across (PAX) was proposed in [10] by Ailamaki et al. Generally, PAX is designed for on-disk deployment and specifies how to group rows and columns together. Instead of storing data row-by-row within a disk block as row-stores, PAX stores rows column-by-column in a single disk block. This differs from a pure column store, which stores each column in separate disk blocks. The key difference is that if you had a table with 10 attributes, then in a pure column store, data from each original tuple is spread across 10 different disk blocks, whereas in PAX, all data for each tuple can be found in a single disk block. Since a disk block is a minimum granularity with which data can be read off of disk, in PAX, even if a query only accesses only 1 out of the 10 columns, it is impossible to read only this single column off of disk, since each disk block contains data for all 10 attributes of the table. Generally, PAX was able to achieve the CPU efficiency of column-stores while maintaining the disk I/O properties of row-stores.

Several commercial works adopted PAX and introduced how to use PAX in their product, e.g., Spanner [20] and Vectorwise [27]. Besides, several successive research works also follow the PAX data format and propose new optimizations. For instance, Jin et al. reorders columns within a single disk block and puts frequently co-accessed columns into nearby positions [81]. Such a way is effective for wide tables with many columns, as a random disk seek takes a large proportion of the I/O cost when reading a few from the many columns.

4.3 Making MVCC HTAP-friendly

4.3.1 Faster scan on MVCC

In MVCC, a version storage schema specifies how the system stores versions and what information each version contains [172]. We show an example of multi-version storage layouts in Fig. 5. We term the basic MVCC implementation as the native approach. In particular, a native approach records each item’s version in a linked list and tags each version with a version ID (e.g., commit timestamp). In our example, A1, A2, and A3 are the three data items. Each of them has multiple versions generated by OLTP updates. The versions are organized in chronological order (e.g., A1 has five versions, and the latest version is tagged as 180).

To perform a scan, the executor traverses each version chain from front to back until the valid data versions are found. For instance, assuming an OLAP query performs a scan using the timestamp 150, thus the required data version is shown in grey. To find these versions, the scan starts from A1’s initial version (i.e., 100) and moves forward to the next version (i.e., 120) until the version ID is equal to or bigger than 150 (or the version chain ends). Then, the scan on A1 ends on the version with the biggest version ID, which is smaller or equal to 150. After finding the version for A1, the scan continues on A2 and A3. Significant performance issues can arise when the version chain is long and the number of required data items is huge.

Several works are proposed to reform the multi-version storage layer to support faster scans. Among the proposals, the key idea is maintaining pointers between stable data version (i.e., the data version has been committed by the transactions) to speed up version traversal, which enhances scan performance at the cost of updates’ complexity. Due to the different design goals, strategies vary in when and how to construct the cross-chain link and trade-off between update efficiency, scan efficiency, and storage overhead.

We summarize existing proposals into two categories. The first category adds cross-chain pointers to the adjacent items. Hence, quires can filter unnecessary data versions when scanning over the version chain. As shown in Fig. 5, when traversing the version chain of A1, it records the the cross-chain pointer for A2 when accessing the version 120 of A1. Thus, when reading A2, the scan begins at the record position (i.e., version 120 of A2) instead of the initial position of the version chain.

For correctness, a safe cross-chain pointer always starts from the version with a higher (or equal) version ID to a version with a lower version (or equal) ID. Thus, the required version will not be neglected in such an optimization. With the cross-chain pointer, the executor can find the target versions with less traversal effort (see Fig. 5). A practical implementation is vWeaver. vWeaver features a frugal version of skip lists [133, 178] and leverages a new probabilistic search algorithm to decide whether to generate a cross-chain pointer by coin-flipping algorithm.

Another category leverages pointers to link the data items that will be (potentially) accessed in the same snapshot together when adding new data items into the multi-version chains. So that scans can be done vertically through the pointers and skip unrelated data versions efficiently. We show an example in Fig. 5. When finding the required version for the given timestamp (e.g., 150), the scan operation directly transverses the chain of snapshots and uses the weaved version to A1, A2, and A3. Compared to the first category, the second category is more aggressive and may lead to higher scan performance, while it also incurs much more complexity for version management.

To our knowledge, P-Tree [158] is an instance belonging to the second category. Although the P-tree is essentially an in-memory tree index, it holds all the data items in the index and thus works as an in-memory MVCC store. This property makes it different from the MVCC indexes we will introduce later. In particular, P-tree is a nested tree structure and uses linked snapshots for fast versioning. With the nested structure, P-trees traverse the linked snapshot to find the versions and read the version using nested pointers.

In summary, both categories speed up data scanning by sacrificing update performance and incurring extra storage overhead. To our knowledge, existing proposals are mainly designed for flat schemas. How to effectively support complex data types (e.g., arrays and nested fields) is still an open problem. Some other limitations of the two approaches may include: 1). when the values in a column are tiny (in comparison to the size of the pointers), the pointers can become too expensive and ultimately negate the version-skipping benefits. 2). pointers complicate garbage collection and can cause more fragmentation in storage. It should also be noted that other implementation details (e.g., garbage collection policy and concurrency model) can also be critical to the end-to-end performance and further affect the HTAP design, but we do not discuss them due to space considerations.

4.3.2 MVCC Index

Another challenge raised by HTAP is how to support concurrent index updating. Even though several efforts [23, 112, 148, 160, 169] have been devoted to building concurrent tree-based data structures before HTAP is motivated, they may not be directly applicable to HTAP scenarios. The reason is that the data access pattern of the analytical queries differs from the typical read-only transactions, e.g., including more wide-range scans.

Taking the B+ tree [49] as an example, the lookup operation has to protect access to the inner and leaf nodes since other operations (e.g., inserts and deletes) may change them concurrently. Thus, the two types of operations will be blocked by each other, leading to sub-optimal performance (e.g., long tail latency of OLAP). It desires a new mechanism to efficiently split and insert inner and leaf nodes with the existence of scan-oriented data access.

Vegito [150] poses the design challenges of the MVCC index in HTAP and proposes an epoch-base updating mechanism that parallelizes the updating in the same epoch (with both task parallelism and data parallelism) to reduce the conflicts. By its design, Vegito provides slightly stale snapshots to the analytical queries.

Diva [90] suggests a new provisional version indexing for HTAP based on the observation that data versions are continuous and visible only for a sliding time window. The idea is to co-locate a record and its first old version in the main index and store the rest of the versions in separate version space (i.e., provisional version indexing). This separation lets Diva conduct rapid version searching and prompt cleaning of stale data versions simultaneously. Note that prompt garbage collecting benefits scan operation (with the reduced length of the version chain) and alleviates storage costs.

4.4 In-memory snapshot algorithm

In addition to conducting analytical queries on the MVCC storage, another straightforward approach is taking an in-memory snapshot to serve OLAP workloads. Using separated consistent snapshots can efficiently eliminate the concurrency issues (e.g., race condition), as well as the performance degradation on MVCC (i.e., Issue2).

Several efficient snapshot algorithms have been proposed to construct a snapshot with low overhead, including Copy-on-Write (CoW) [104, 106] and Zigzag [37], and multiple database vendors and research prototypes adopt memory snapshot approach for hybrid transactional and analytical processing, e.g., Hyper-OFootnote 1 [85], SwingDB [116], AnKer [149], Kvell+ [100], and update-aware NDP [167].

Nevertheless, executing analytical queries with an in-memory snapshot may have the following limitations: First, the application scopes are limited. In-memory snapshots are primarily used for in-memory databases that are deployed in a single machine. It can be difficult to scale out when the database is partitioned into multiple shards. Second, in-memory snapshots incur additional overhead memory footprints, which can become severe when serving large-scale update-intensive applications. We summarize substantial research works by classifying them into two categories: OS-assisted Fork and bypass OS snapshotting.

4.4.1 OS fork

Fork is a copy-on-write mechanism implemented by operating systems. Hyper-O [85] leverages the Unix fork() to generate in-memory snapshots for analytical queries in a single machine. In particular, fork() is used to spawn child processes that share their entire virtual memory with the parent process. Practically, Hyper forks a new snapshot and thus starts a new OLAP query session process periodically (or on demand). The fork algorithm is always executed between two transactions to guarantee consistency and isolation. An incoming analytical query will be assigned to a specific OLAP query session, and the ongoing transactions will not be blocked. However, a process fork in Unix can be expensive when the snapshots are taken frequently. As a result, this snapshot mechanism is disused in the later version of Hyper [31, 124].

Scale-out Hyper (Scyper) [121] is a variant of Hyper-O, aiming at scaling out Hyper-O horizontally (without partitions). Scyper uses REDO logs to copy data from the primary (i.e., the single machine of Hyper-O) to other secondary replicas and let OLAP queries execute on secondary replicas. For load balance and consistency, Scyper uses a centralized coordinator for OLAP before the queries are scheduled.

4.4.2 Bypass OS snapshotting

Sharma et al. study the overhead of in-memory snapshot algorithms and propose a new lightweight snapshot mechanism via vm_snapshot in AnKer [149]. Their key motivation and observation is that, unlike other snapshotting purposes (e.g., for generating checkpoints), HTAP workloads require taking snapshots at a much higher frequency to realize better data freshness. In response, Anker introduces a new custom system call (vm_snapshot) and integrates the concept of rewiring directly into the Linux kernel.

The co-design between underlying components and the databases overcomes the restrictions of the OS. Individual snapshots in AnKer reserve a few short version chains instead of calculating totally transparent snapshots.

4.5 Leveraging replication mechanism

Recall the Issue3 of monolithic HTAP. To achieve isolation when processing OLTP and OLAP workloads in a single machine, several methods are proposed to isolate CPU, memory, I/O bandwidth, and network traffic (e.g., binding CPU Cores, restricting the usage of memory and network, etc.) We refer readers to Sect. 5.5.5 for more details.

Besides processing OLTP and OLAP workloads together, leveraging the replication mechanism is another approach to handling HTAP workloads. For instance, our previous example (i.e., Scyper) in Sect. 4.4.2 falls into this approach. Scyper can be configured to serve OLTP on the primary and serve OLAP on the secondary to provide performance isolation between the two types of workloads.

It should be noted that replication is also commonly used in the legacy OLTP databases (e.g., MySQL [122] and PostgreSQL [84]) for fault-tolerance [41, 72, 73] and better performance (e.g., read and write division [50, 108]). OLAP queries can be naively executed on the replica for performance isolation and scalability, which is also known as read/write splitting in the industrial community. In addition, several optimizations can be applied to customize the approach, providing better performance in the context of HTAP.

For instance, PostgreSQL-SR [117] replicates data by replaying Write-Ahead Logging (WAL) records in a streaming manner (i.e., without waiting for the WAL to be filled), which makes the replica stay more up-to-date. To serve a hybrid workload, the primary is generally used for OLTP, and the replica serves read-only queries. Separating OLTP and OLAP workloads into different processing nodes (machines) provides physical resource isolation.

We assume the replication approach discussed in this section adopts the same physical data format (e.g., row-oriented) between primary and secondaries (a.k.a. peer replicas) and will discuss the approaches using different data formats in Sect. 5.

Fig. 6
figure 6

An overview of different HTAP architectures with hybrid data format. The taxonomy is based on the different data formats adopted in the persisted storage layer, which are the root causes that influence the design decisions of in-memory storage, processing engine, data synchronization, data access path, and query optimization

5 HTAP with hybrid data format

5.1 Common issues

We now discuss the case for HTAP with the hybrid data format. Column-oriented data storage has become a paradigmatic choice for OLAP DBMSs (e.g., MonetDB [28], SnowFlake [55], ClickHouse [45], RedShift [74], and Vectorwise [180]) in the recent decades. The key feature of a column-oriented database is that it serializes all of the values of a column together [145]. Compared to row-oriented storage, column-oriented storage has the potential to reduce the amount of data read by orders of magnitude (only reading in relevant columns) and benefit from column-specific optimization (e.g., column-oriented compression and execution [3, 5], new join algorithm [4]).

Given this, following the philosophy that one size does not fit all, many developers turn to host hybrid data formats (i.e., row store and column store) in a single database, especially attracting much more interest from the industry. The advantages of using a hybrid data format are significant: it provides the opportunity to optimize OLTP and OLAP independently and adhere them together in a lightweight manner. However, it indeed raises new challenges. Below, we summarize the common issues in implementing HTAP databases with hybrid data formats.

  • [Issue#1:Data synchronization] How to keep the two storage (a.k.a data copy) abreast with each other without introducing much more additional overhead [78, 111] is one of the foremost challenges given a hybrid format design. A satisfactory synchronization method should meet the following requirements: First, it causes a minimal perturbation in the normal case of OLTP and OLAP. Second, it keeps the hybrid storage up-to-date, i.e., updates from OLTP are shipped and applied to the column store continuously. Third, it should be scalable to the number of transactions and analytical queries.

  • [Issue#2: Data consistency] Along with the data synchronization, it is non-trivial to guarantee the consistency of HTAP databases when maintaining multiple data copies in hybrid data format. A safe guarantee of strong consistency is always keeping an intact snapshot of the whole row store in the column store. However, this approach may incur high coordination costs as the column store should always ask the entire row store for snapshots. In addition, the problem can become more complex when the row store or column store is distributed or partitioned into multiple shards.

  • [Issue#3: The Gap of Write efficiency] Regardless of the specific implementation adopted by the row store and column store, there is a common issue in the gap of write efficiency between the two stores. Specifically, a row store is optimized for data updates and tuple inserts, while it becomes much more costly for a column store to absorb all the newly generated data since updates are separated into multiple columns across the format, and the column store is read-optimized, which groups different rows of the same column together.

  • [Issue#4: Hybrid Data Access] Serving a request in a hybrid data format may incur hybrid data access. That is, to serve a single query, the execution engine may need to travel from both row and column stores and combine the results for answering queries. It’s challenging because the transformation of different data formats is not free.

    Moreover, the execution engines of OLTP and OLAP have contradictory optimization demands. OLTP engine commonly uses volcano-style per-tuple iterators [66] (e.g., MySQL [122]), which is favored for processing small data. In contrast, per-tuple iterators (or even using a small batch size) largely limit the parallelism in OLAP and cause avoidable function calls frequently. It suggests a dilemma: an execution engine with a fixed block iteration size is either sub-optimal for OLTP workloads or sub-optimal for OLAP workloads.

  • [Issue#5: Performance Isolation] Compared to the monolithic architecture, hybrid data formats mitigate the contention between OLTP and OLAP by attaching the two types of workloads to their desired data formats. However, it still deserves a careful design to isolate the performance between the two types of workloads since data is continuously synchronized from row store to column store. The design of data synchronization and hybrid data access may break the isolation between row and column stores. That is, row stores may need to generate additional logs for synchronization, and column stores should always absorb all updates timely.

  • [Issue#6: Sharding Strategy] Partitioning the database into multiple shards is a significant method for scaling out. Generally, OLTP and OLAP have different scale-out needs and have different spike patterns. The data that is read-hot is not necessarily write-hot. When grouping row and column data into shards, an efficient strategy should co-locate the frequently accessed data based on different workload characteristics.

Table 2 Implementations of HTAP with hybrid data format (1)

5.2 A taxonomy of existing achitectures

To handle these issues, multiple technologies are proposed, and massive design decisions are made. Most of these design choices are highly dependent on each other and rely on the assumption of the underlying layer. We classify the HTAP systems with hybrid data formats into four categories to systematically study the trade-offs of these design decisions and technologies. The classification is based on the different data formats adopted in the persisted storage layer.

Note that we do not differ the systems based on their deployment model (e.g., shared-nothing, shared disk, or shared everything) and deployment scopes (e.g., within a data center using RDMA or cross data center using wide area network). Specific optimizations on these aspects (e.g., using RDMA for fast data synchronization) can be important for a given HTAP database. However, we do not plan to cover them in our survey as our survey targets providing high-level design guidelines, and these optimizations rely on specific hardware assumptions and are worth a more detailed investigation.

Figure 6 gives an overview of the four architectures. We illustrate the major characteristics of each category below. Row-native architectures persist row-oriented data and construct in-memory column store as secondary storage. To keep the column store up-to-date, a typical design is synchronizing new updates from the in-memory execution engine directly. When building a new column store (either on the primary node or on a stand-by replica), it populates data from the persisted storage layer. This architecture represents the road to equipping row-oriented DBMSs with a plug-in column data format. Typical implementations are SQL Server [58], Oracle Dual [92], and AlloyDB [65].

As mentioned previously, we do not restrict our classification by the deployment model. That is, row-native architectures may also be applicable to be deployed at multiple machines (nodes) using a shared-nothing architecture. An example is PolarDB-IMCI [168], which separates OLTP and OLAP workloads into different nodes. Unlike the separated architecture we discussed later, it relies on replicating a row data copy to construct its column stores on a new node.

Unlike row-native architectures, column-native architectures typically persist column-oriented data and constructs in-memory delta store (i.e., update-optimized storage) for efficient updating. Then, the newly generated updates will be merged periodically into the persisted column store.This architecture is commonly adopted by HTAP systems originating from OLAP DBMSs. Notable implementations include SAP HANA [152], MemSQL [38] (renamed as SingleStore [153] in 2020), and NoisePage [105].

Separated architecture usually maintains row and column stores individually. Updates are shipped from the row store to the column store in a streaming manner (with consistency guarantees). To alleviate the gap in write efficiency between row store and column store, an in-memory delta store is typically resident on the column side. Similar to the column native architectures, the updates in the delta store will be merged into the persisted column store periodically. This architecture represents the road to building an HTAP database on the shoulder of existing row and column stores. Typical implementations include TiDB [78], Heatweave [123], and F1 Lightning [175], IBM BLU [137], and ByteHTAP [39].

The key concept of hybrid architectures is the adaptive storage layer [6, 12, 19, 57]. Adaptive storage automatically evolves the data format of the stored data to achieve the best performance for both OLTP and OLAP workloads. For instance, HTAP systems belonging to this architecture always keep the data that are frequently updated by OLTP in row format and group the data that are always read together into column format. The shift of data format depends on the historical statistics of data access patterns, i.e., frequently updated tuples are stored in rows, while frequently scanned tuples are stored in columns. A major difference between separated and hybrid architecture is that hybrid architecture stores the full copy of data neither in rows nor columns. To the best of our knowledge, hybrid architectures are still limited to research efforts due to engineering complexity and performance reliability. This architecture represents the road to building an HTAP database from scratch. Typical research prototypes includes H\(_2\)O [12], FSM [19], and Proteus [6].

In the rest of this section, we dive into the technical aspects of the four architectures regarding HTAP-specific issues. Sections 5.3 and 5.4 introduces the essential background of row and column store, which is the foundation of hybrid HTAP. We compare the design of row and column stores and show how Issue3 and Issue4 come up.

Section 5.5 focuses on the solution to HTAP-specific issues given the four architectures. In particular, Sects. 5.5.1 and 5.5.2 presents the solution to Issue1. Section 5.5.3 discusses Issue2. Section 5.5.4 show the design for handling Issue3. Section 5.5.5 details Issue4. Sections 5.5.6 and 5.5.7 target Issue5. Finally, Sect. 5.5.8 discusses Issue6.

5.3 Design of row store

Row stores in HTAP inherit the design from existing OLTP systems. MVCC is one of the common choices. All HTAP systems in Table 2 use multi-version data structures in their row store. Additionally, indices are used over version storage to speed up updates and lookups. In the rest, we discuss the index and storage designs of the representable HTAP databases in Table 2. It should be noted that, in hybrid HTAP, the implementation of row stores majorly influences the performance of OLTP. Thus, the design of the row store may not be specific to HTAP but implies significant design considerations in OLTP. Nevertheless, when the hybrid plans are enabled, the physical design of the row store can become a part of an HTAP-specific problem (to be illustrated in Sect. 5.5.7); thus, we briefly list the design choice of representable HTAP systems here for interested readers.

Choice of Indexes. Most existing HTAP systems follow the classical template and use B-Tree indices (or its variant) [49, 67]. Several systems use hash indices and skip lists, which are delicately optimized for lookup efficiency at the cost of update overhead. In turn, the LSM Tree adopted by TiDB has better write performance with a slower read.

Design of Version Storage. Following the taxonomy in the paper [172], version storage of HTAP systems in Table 2 lies in two typical implementations: append-only storage and delta storage. Append-only storage always allocates an empty slot for the new version and applies the modifications to the data in the newly allocated version slot. On the contrary, delta storage only creates a delta version that contains the modified values instead of the entire tuples.

This makes the two designs differ in read and write efficiency. Each version in the append-only storage can be accessed independently, making it more suitable for performing read requests. In contrast, delta storage is ideal for updating since it reduces the overhead for memory allocation.

Overall, the design of the row store targets different performance aspects. However, they all target efficient updates (using MVCC to improve the concurrency) and are suitable for a full-row insert.

5.4 Design of column store

5.4.1 Storage layer of column store

The critical decision with the column store is how the columns are physically structured. We classify existing designs of column stores in HTAP into two categories.

Native Column Store. Columns are actually collections of rows with the same attribute. A common approach to storing column data is partitioning each collection into multiple segmentations with a fixed number of rows. Then, each column segmentation can be encoded and stored independently (i.e., on different pages) [94, 110]. Data can be ordered within each segmentation by the column attribute or the primary key [155]. By doing so, every insert in column stores results in a collection of physical inserts on different segmentations. Thus causing a gap in write efficiency (i.e., Issue3) when compared to write-optimized indexes and version storage of the row stores in Sect. 5.3.

Table 3 Implementations of HTAP with hybrid data format (2)

In-memory Column Index. Instead of storing columns directly, an additional approach is building a column index based on the row store. In-memory column indexes serve as an in-memory data buffer to speed up column-oriented data access. The initial idea of column index dates back to 2008 when Abadi et al. proposed index-only plans in [4]. By creating a collection of indices, it is possible to answer the query that covers all of the columns used in an analytical query without ever going to the underlying row-oriented stores. Thus, the in-memory column indexes essentially serve as a column-oriented data storage for OLAP queries.

SQL Server [94] introduced column index in 2011. Instead of directly building indices on the row-oriented tables, it stores columns into column segmentations. The column index is built based on the segmentations, serving as a dictionary. Oracle Dual [92] builds column indices on its heap tables. Heap table [79] is a row-oriented table representation format that stores data in no particular order (i.e., without a clustered index). Hence, operations on the in-memory column index will never re-structure physical data organizations in the heap table. In turn, updates and deletes are handled as an in-place operator within the heap table and will not affect the column index. This loosely coupled structure makes adding column indices to the existing database safe and convenient as a plug-in extension. PolarDB-IMCI [168] performs column index as a secondary index based on InnoDB’s row-based buffer pool. Same as the index-only plans [4], it can create indices that cover all or a subset of columns.

Fig. 7
figure 7

Data synchronization model. This diagram shows how data is synchronized from OLTP storage into OLAP storage when a read-write transaction is scheduled. The steps with dotted lines can be processed asynchronously for better performance if the OLAP storage does not provide a strongly consistent guarantee

5.4.2 Execution engine

A specialized execution engine plays a vital role in performing analytical queries. As shown in Table 2, almost all selected HTAP systems embrace this OLAP technology ecosystem. Nevertheless, it leads to Issue4, as discussed previously. In this subsection, we provide the background on such engine optimizations.

Column-oriented Compression and Execution. Column store has a different compression pattern from row stores. Column store allows compressed values from more than one row at a time. Several well-studied compression algorithms are proposed in the literature, e.g., run-length encoding (RLE) [141]. Based on compression, late materialization [4, 5] lets execution engines directly operate compressed data to defer data decoding until all predicates have been applied. This optimization enables database systems to construct fewer tuples at runtime, reducing I/O and computation costs. However, how to perform late materialization on different data formats with different compression strategies is still an open problem.

Vectorization and Intra-query Parallelization. Vectorized query processing [4, 29] and intra-query parallelism [11, 107] are also critical to the performance of analytical queries. However, they are strongly coherent with the design of the column-oriented data format and may not be common in the engine of the row store. It still calls for solutions to enable efficient data access when both row and column engines are desired (see our discussion in Sect. 5.5.7).

5.5 Road to HTAP

Given the design of the row store and column store above, we now present the design choices for bridging the two data formats into a single system as well as the potential solutions to the aforementioned issues. In addition, we also add a summary of the pros and cons of the candidate solution in each sub-section to guide future development. Table 3 summarizes existing designs of representative systems.

5.5.1 Data synchronization model

Existing data synchronization models fall into two categories: transaction-based and storage-based. Figure 7 shows how the writes of a transaction T are synchronized and can finally be observed in both storages. Depending on the consistency guarantees the HTAP system provides, synchronization in both models can be done either synchronously or asynchronously. When the data synchronization is done asynchronously (i.e., the step with a dotted line is done in the background), it removes the data transfer and communication out of the critical path of transactional processing, thus for better performance isolation.

Using the transaction-based model, the coordinator of transaction T performs a double write to both row and column stores. On receiving the full copy of all writes in the write set of T and getting ready to be committed (e.g., all needed latches are preserved), the storage node sends an acknowledgment message to the transaction coordinator. Then, the coordinator finalizes the transaction with a commit message. To enforce consistency, it can notify the client of the success of T only when the coordinator receives both acknowledgments of the commit message from the two stores.

Alternatively, the storage-based model synchronizes data on the ordered log. The units of log records can be transactions or read/write operations. To commit a transaction T, the coordinator performs a write operation to row store. Then, a log containing a set of writes will be shipped from row storage to column storage. Once writing in both two stores is finished, coordinators can finalize the transaction. Until the row store receives the acknowledgment of the commit message from the column store and T has been committed to it, it can send an acknowledgment of the commit message to the coordinator. Finally, the coordinator can notify the client about the success of T.

Pros and Cons. Transaction-based model is usually adopted in the row-native architectures (see Table 3). This model benefits development and management as correctness is provided in the transaction unit. That is, the storage layer in this model is not required to understand the semantics of transactions. Moreover, column stores in this architecture are fully in memory; thus, transaction-based synchronization will not incur much overhead on transactional processing. On the contrary, the column-native, separated, and hybrid architectures usually adopt the storage-based model to decouple the data synchronization from transactional processing for better flexibility and scalability.

5.5.2 Data synchronization approach

Dual Write. Coinciding with the synchronization model, HTAP with row-native architectures adopt dual write to perform an atomic write operation in both row store (with in-memory buffer pool) and in-memory column store, which has been introduced in Sect. 5.5.1.

Delta Merge. For HTAP with column-native architectures, the databases continuously merge the new data generated from the in-memory row store to the persisted column store. To enforce efficient processing based on the storage-based synchronization model, a general optimization parses internal storage representation in multiple stages [105, 152]. For instance, SAP HANA [152] features two provisional in-memory delta storage (i.e., L1-Delta and L2-Delta) instead of a single row store. L1-Delta is a row store optimized for fast insert, delete, field update, and record projection. For performance reasons, L1-Delta does not perform any data compression. L2-Delta is organized in the column store format and employs dictionary encoding for better memory usage. Different from the persisted column store, the dictionary is unsorted. With the two provisional delta storage, a data update is first caught by L1-Delta. When the L1-delta absorbs updates more than a specific threshold (e.g., 10,000 rows), updates in the L1-delta are then merged into the L2-delta. Similarly, L2-Delta only merges the updates into the main store (i.e., persisted column store) when the number of updates exceeds the threshold. This workflow mitigates the overhead of transforming row-oriented tuples into column-oriented ones since the processing can be divided into multiple stages, and tuples can be batched.

Shared Log or Consensus. For separated architectures, the data synchronization approach deserves a more careful design to handle Issue1. BatchDB [111], F1 lightning [175], Vegito [150], Janus [18], and Heatwave [123] ship updates via shared log. All committed transactional updates are saved in the logical logs (i.e., logs record read/write operations without depending on physical format). Logs are continuously shipped from the row store to the column store. A key challenge is guaranteeing the updates belonging to the same transaction can be consistently observed in the column store. Specifically, the partial view of an atomic transaction should never be observed by queries, and transactions’ updates should be observed in the commit order.

To handle this challenge, BatchDB [111] groups transactions into batches, and the updates are only visible when a batch is successfully applied. To form a transaction batch, BatchDB assigns a batch ID to each transaction. The invariant is that a transaction can only belong to one batch ID, thus providing atomicity. Similarly, Vegito [150] tags each transaction with an epoch id. The transactions within the same epoch should be applied and observed together. F1 Lightning [175] reuses the safe timestamp mechanisms of Spanner [50]. The safe timestamp indicates the watermark of visibility to OLAP. Each transaction in F1 is assigned a timestamp. The transaction can be visible if and only if the transaction’s timestamp is smaller than the safe timestamp.

Another straightforward option is reusing consensus mechanism [77, 120] (e.g., state machine replication [147]). Consensus is essentially an abstraction of the shared log approach with clear, safe guarantees and a mature toolkit. Therefore, we classify them into the same category.

Wiser [21] seeks consensus protocols to achieve the serial order of writes. A query can only see updates made before it in a serializable order generated by the consensus (i.e., Raft [125]). TiDB [78] introduces Raft learners to improve the performance of data synchronization. A learner in Raft does not participate in leader elections, and log replication from the leader to a learner is asynchronous. Thus, adding more followers will not significantly impact the consensus group’s performance because the leader does not need to wait for responses from followers. Also, to satisfy the hybrid format, data is transformed from row-oriented to column-oriented when data is synchronized to Raft learners.

Compared to shared logs, the benefits of leveraging a well-studied consensus protocol to realize data synchronization lie in three folds: proofed correctness, well-studied implementations, and strong compatibility.

Reform Format. For hybrid architecture, the database performs data synchronization by reforming data format directly. For instance, Proteus [6] changes the storage format by reading a consistent data snapshot into memory, and bulk loads the data into the respective storage format. The changes are generally triggered by a storage advisor continuously monitoring data access patterns.

Pros and Cons. The design choices of data synchronization largely depend on the underlying storage layer. The four data synchronization approaches mentioned in this sub-section are affiliated with their corresponding architecture. Multiple optimizations can be applied within each architecture to make synchronization more efficient.

In addition to minimizing performance overhead, a significant design goal is to achieve higher data freshness. The design of data synchronization approaches can also contribute to data freshness. For instance, batch-based log shipping may have worse data freshness than streaming shipping.

5.5.3 Consistency model

Given the sophisticated data synchronization models and approaches, the consistency model is worth revisiting. Different from the monolithic HTAP (Sect. 4), where the consistency model can be simply inherited (since each analytical query can be treated as a read-only transaction, e.g., in MySQL), the consistency model in hybrid HTAP may degrade. This is because data synchronization in hybrid HTAP breaks the consistency boundary between two data formats. To make it worse, data synchronization is practically done asynchronously to ensure a better performance isolation between OLTP and OLAP workloads.

Three of the foremost consistency models (a.k.a. isolation level) used in the database community are read committed, snapshot isolation, and serializability. To the best of our knowledge, most of the existing hybrid HTAP systems [6, 18, 92, 111, 150, 152, 175] provide snapshot isolation, even when transactional processing is promised with a more substantial consistency guarantee.

Under snapshot isolation, analytical queries can observe a complete data view (i.e., snapshot). As these HTAP databases usually provide a stronger model on OLTP, the degradation of the consistency model should be carefully noted. Application programmers or database administrators should handle the interleaving usage of transactions and analytical queries. A straightforward question is whether HTAP databases can ensure a stronger consistency model. The answer is absolutely yes, but maybe at the cost of performance and needs a careful design. Ensuring stronger consistency for HTAP in a lightweight manner is still an open problem.

For instance, TiDB [78] is a practical implementation that provides analytical queries to read the latest write made by transactions at the cost of longer latency. Specifically, it relies on the raft learner to calibrate and absorb all committed transactions from the raft leader before an analytical query can actually be scheduled. Obviously, this approach produces additional overhead for the raft leader and needs to block analytical queries for a while. PolarDB-IMCI uses an asynchronous replication mechanism and can ensure different consistency models (e.g., strict serializability) through the proxy layer. The proxy node keeps track of the log sequence numbers of both row stores and column stores and may only route queries to the column store when the log sequence number of the column store is not less than the log sequence number of the row store.

5.5.4 Delta buffer and meta-data

Recall Issue3, due to the gap in writing efficiency between the row and column store, when the write rate in the row store is high, the column store may not be able to absorb all updates timely. To make things worse, this lagging will continuously accumulate. Thus, to bridge the efficiency gap, an HTAP DBMS commonly maintains a delta buffer and meta-data for recent transactions and updates. An overview of the delta buffers and meta-data adopted by the existing hybrid HTAP systems is shown in Table 3. In the next, we describe these designs in more detail.

Without introducing additional physical storage space, row-native architecture can directly treat row stores as a delta buffer for column stores. This is because row and column stores are always co-located in this architecture. As such, to perform a data update, dual-write may not finish the column store’s writing process immediately. Instead, it only needs to update a lightweight meta-data. The meta-data distinguishes whether an update has already been merged into the column store or is only resident in the row store. Then, the actual format changes can be done in the background. To serve an analytical query, the executor fetches most data from the column store and a small portion of unmerged updates from the row store. The trade-off is also significant: the reading process is more expensive.

A typical implementation is used by SQL Server [93]. SQL Server uses Tail Index as its meta-data, which covers rows not yet included in the column store. Inserting a new row or row version into the table consists of inserting the row into the in-memory Hekaton table (i.e., its row store) and adding it to all indices, including the Tail Index (i.e., the meta-data). Then, a background task will eventually copy the data to the CSI (i.e., its column store). Deleting a row from the table consists of first locating the target row in the Hekaton table and deleting it from there. If the column store contains a valid value, a row with its row-id is inserted into the Deleted Rows Table, which logically deletes the row from the column store. An update operation can be implemented as an insert follows deletion. To update a row, it is first updated in the Hekaton table, and the new version is also included in the Tail Index but not immediately added to the column store. If an old version is included in the column store, its row ID is masked and deleted in the Deleted Rows Table to logically delete the old version from the column store.

Oracle Dual [92] and AlloyDB [65] do not index the updates directly. Instead, they create a transaction map as its meta-data to record the committed transactions that do not appear in the column store. The transaction map will record its transaction ID and write sets whenever a transaction is committed in the row store. After the updates of a committed transaction have been successfully merged into the column store, the entry with its transaction ID is removed. Compared to the Tail Index in SQL Server, they simplify the data modification process (insert, update, and delete) at the cost of reading performance.

PolarDB-IMCI [168] adopts a different approach. Instead of leveraging the row store as its temporary delta buffer, it performs the data writes directly to the column store when a transaction is ready to commit. To alleviate the problem with writing efficiency, the column store is append-only, and all updates to PolarDB-IMCI are out-place. PolarDB-IMCI moves the task of column encoding, data compression, and sorting to a background process. Thus, data modifications on the column store do not incur additional overhead compared to the row store. Inserting a new row in PolarDB-IMCI’s table is straightforward. It includes a standard insert in the row store and an append-only insert in the column store. Like SQL Server, deleting a row in the column store is conducted by adding a delete mask to the bitmap (i.e., the meta-data). For an update operation, without performing an in-place update in the column store, the update is executed by a delete in bitmap and an append-only insert.

Fundamentally speaking, the row store is essentially a delta buffer in column-native architectures. Since data modifications (i.e., inserts, updates, and deletes) are periodically merged into the persisted column store, the column store always lags behind the row store. The tricky thing is that, like a generally designed delta buffer, a row store does not store the complete data copy and only contains a small portion of incremental data. Thus, the overall scan performance will not be largely influenced by the sub-efficient scan on the row store. Meta-data in this architecture enhances performance on the delta buffer.

For instance, SAP HANA [152] adopts a secondary index structure (named delta index) on its L2-delta (i.e., its delta buffer) to support access patterns of point query optimally. NoisePage [105] uses a traditional B-tree on its delta buffer. MemSQL [153] uses skiplist as its memory-optimized delta indices. Skiplists can use pointers to rows directly since the delta buffer is fully in-memory. This makes it different from the rows-native architecture indices: it avoids the indirection caused by page tables.

For separated architectures, persisted row store and column store are stored independently. They can be independently stored in a single persisted storage or separated into different machines. As such, to handle the problem efficiency gap, a typical solution is allocating additional storage space and co-locating it with the column store to serve as a delta buffer. Meta-data in this architecture serves the same target as column-native architecture.

TiDB [78] proposes DeltaTree (an optimized variant of log-structured merge-tree (LSM-tree) [127]) as its delta buffer. Specifically, DeltaTree consists of a delta space and a stable space. In delta space, data modifications are recorded in append-only logs with a serialized order, achieving near-optimal write efficiency. However, these modifications are usually stored in many small files and induce large IO overhead when read. To enhance read performance, DeltaTree continuously compacts these small records into a larger one, then flushes the larger logs into stable space. Note that the separation of delta space and stable space is designed to realize the same goal as L1-Delta and L2-Delta in SAP HANA (Sect. 5.5.2), i.e., dividing the heavy merge process into multiple stages, and thus amortize the merge overhead to both read and write operations.

TiDB also includes an important mete-data: delta index. Delta index is developed to accelerate read operations, and the implementation is based on B-tree. Briefly, reads on log-structured storage require a sequential logs scan to find the latest data version. Delta index speeds up the sequential scan by recording a valid but may not the latest data version and its location in delta space for each index entry. Then, read operations can leverage the delta index to skip old versions and start the scan process from the pointed location instead of a scan from beginning to end.

F1 Lightning [175] adopts another data structure for its delta buffer: PAX [9]. In PAX, rows are first divided into row groups and then stored column-wise within a row group (see Sect. 4.2). PAX can exhibit superior cache and memory bandwidth utilization than a purely row-oriented data format, favoring the performance of retrieving new data. The index of PAX contains a sparse B-tree index on the primary key where the leaf entry tracks the key range of each row group. Though the data modifications on PAX are not as efficient as LSM-tree, they practically mitigate the write efficiency problem. Therefore, it balances the performance between range scan and point lookup.

Hybrid architectures do not rely on delta buffer and meta-data directly. However, the format change indeed incurs additional overhead. An adaptive threshold is embedded in the storage advisor [6, 19] to avoid changing format frequently.

Pros and Cons. Multiple data structures can be used for delta buffer, e.g., row-oriented format, LSM-tree, and PAX. Overall, these data structures perform better writing efficiency than column stores, thus filling the write efficiency gap between row stores and column stores. On the other hand, they trade-offs among write performance, range scan performance, and point query lookup performance. Meta-data is delicately designed on a delta buffer to optimize further read/write performance. They either mark the unmerged data modifications or index the incremental data. Nevertheless, maintaining meta-data increases the complexity of systems.

Fig. 8
figure 8

An example of how the HTAP systems generate execution plans for both OLTP and OLAP requests (Diversion Approach)

Fig. 9
figure 9

An examples of how the HTAP systems generate execution plans for both OLTP and OLAP requests (Unified Optimizer)

5.5.5 Performance isolation

In HTAP, data synchronization (Sects. 5.5.25.5.1), delta buffer (Sect. 5.5.4), and meta-data (Sect. 5.5.4) make it challenging to realize fine-grained performance isolation between OLTP and OLAP. A summary of the existing solutions adopted by HTAP systems is shown in Table 3.

Fig. 10
figure 10

Hybrid plan. An example of the hybrid query plan generated by an anonymous HTAP system for TPC-H Q21

Instance level isolation is a native property that separates different workloads into different instances. An instance can either stand for a stand-alone physical machine or a virtualized resource group (e.g., virtual machine [64], container [70]). Separated architectures (e.g., TiDB, Heatwave, F1 Lightning) can embrace this isolation naturally. PolarDB-IMCI [168] achieves this isolation by dividing the read-write and read-only nodes. By doing so, the OLAP workload will only be scheduled for the read-only nodes.

The HTAP systems that share a single instance for both OLTP and OLAP workloads deserve a more careful design on performance isolation, which shares the same issue with the monolithic HTAP (see Sect. 4.5). Several methods are proposed to isolate CPU, memory, I/O bandwidth, and network traffic usage. We illustrate each of them below.

CPU Isolation. Binding CPU core [58, 110], thread pool [6], and variants are common approaches to separate CPU resource. They can be implemented either in the programming language layer (e.g., goroutine pool in Go [163]) or in the OS kernel layer (e.g., the control group in Linux kernel [71]).

Memory Isolation. Memory isolation is achieved by adaptive scheduling or configuring a fixed quota for memory usage. For instance, Greenplum [110] utilizes a memory scheduler in the database kernel to control memory usage among different workloads. TiDB [78] limits the default access table size on TiKV for analytical queries to 500 MB.

I/O Bandwidth and Network Traffic. Ideally, the isolation of I/O bandwidth and network traffic can be achieved in the same way as memory isolation. However, it becomes non-trivial if I/O bandwidth and network traffic are consumed by background processes asynchronously [110], e.g., asynchronous data merge (Sect. 5.5.4) and asynchronous data synchronization (Sect. 5.5.2). Worse still, such bandwidth-consuming processes may have complex dependencies with each other, making it an open problem and continuously calling for solutions.

Table 4 A Comparison of Physical Operation in OLTP and OLAP

5.5.6 Query optimization and execution

Recall that one of the most important design principles of HTAP is providing users (applications) with OLTP and OLAP in a single database system using a transparent interface (see Sect. 3). Thus, it demands a unified interface for both transactions and analytical queries. Expressly, when a user (application) submits a database request to the HTAP system, the database should take the duty to optimize the request, execute the requests, and generate the results automatically (i.e., Issue5).

To the best of our knowledge, there exist two approaches (Figs. 8 and 9). In the first approach (Fig. 8, diversion approach), database requests are diverted into OLTP and OLAP requests using a routing-based approach. After receiving user requests, they rely on embedded user-level hints or a middleware layer to differentiate OLTP and OLAP queries. Then, they execute queries on the desirable engines and stores. Typical implementations are PolarDB-IMCI [168] and ByteHTAP [39].

Another approach handles the diversion problem with a unified optimizer (Fig. 9). The optimizer can generate both OLTP and OLAP query plans. Thus, the diversion (i.e., choosing an appropriate executor and store format) can be made inside the optimizer, thus driving the request to its desired execution engine based on request properties (i.e., using OLTP executor for point reads and writes and using OLAP executor for large scans).

We introduce several implementations belonging to this approach. For instance, F1 Lightning [175] generates logical plans using its F1 optimizer (i.e., an optimizer designed for OLTP) and considers lightning-only indexes and views during physical planning. SQL Server [58] analyzes and recommends column stores by its Database Engine Tuning Advisor (DTA) when suitable for a given workload. TiDB [78] extends query optimizer to explore physical plans accessing both row and column stores. Oracle Dual [92] supplements column indices to its optimizer as an alternate execution method for high-speed table scans.

Pros and Cons. The advantages of the first approach are that it can reuse the well-developed optimizers and executors and independently absorb the advanced technologies from the OLTP and OLAP community. However, it is non-trivial to distinguish the requests before they have been planned, optimized, and executed, especially for read-only transactions (e.g., point queries) and analytical queries (e.g., queries with frequent scans). The second approach can leverage the knowledge of the optimizer for division and thus can be more precise when compared to the first approach. Moreover, it provides the opportunity for hybrid plans, which we will discuss in the next subsection.

5.5.7 Hybrid plan

A hybrid plan leverages row store, column store, OLTP, and OLAP executors for a single query. It has the potential to generate a more efficient execution plan for HTAP. The fundamental motivation behind the hybrid plan is that row store with specific indices (e.g., clustered B+ tree index) is more suitable for short-range scans than column store [58] since the index allows efficient point and short-range lookups. Recent studies [58, 87] point out that access path selection in modern databases should be based on a selectivity threshold, along with the underlying hardware properties, system design parameters, and data format. Thus, the row store may be optimal for a portion of sub-queries in a complex analytical query, and the column store may be optimal for the others. Naively binding analytical queries to column store and OLAP executors leaves much potential for the performance of HTAP systems unrealized.

In addition to the data access path, many optimized query operator implementations rely on underlying storage characters. For example, the index-merge join (or sort-merge join) algorithm requires the data source to be ordered by the join key; otherwise, it will incur a computation-heavy sorting phase. Thus, an available sorted data source (either in row store or column store) may potentially speed up the query execution. Table 4 provides a case study on the physical implementations of query operators in two popular OLTP and OLAP databases (i.e., MySQL and ClickHouse).

To the best of our knowledge, several HTAP systems have implemented hybrid plans to optimize the performance of OLAP. For instance, TiDB [78] complements column store as an alternative access approach in the cost model and lets the query optimizer choose the access approach that has the lowest estimated cost during the optimization. Specifically, the query optimizer can choose both row and column stores to access different tables in the same query, while it does not consider retrieving the same table from different data sources. SQL Server [59] develops a Database Engine Tuning Advisor (DTA) to recommend a suitable combination of B+ tree (i.e., the row store) and column store indexes for a given workload. After a column store index is built, the SQL server uses the column store indexes in its optimizer simply as other secondary indexes.

Fig. 11
figure 11

Comparison of sharding strategies. This figure is a reprint of Fig. 1 in [82]

In addition, HTAP databases using hybrid architecture have to deal with hybrid plans by nature, as they do not hold a full data copy for either row or column. For example, Proteus [6] generates physical execution plans based on its workload models and the current storage layout and reuses past decisions to accelerate these processes.

We show a practical example of the hybrid plan generated by TiDB in Fig. 10. The query plan is for TPC-H [52] Query 21 with a scale factor = 100. We ran the query using the same parallelism (i.e., 16 threads) in the database engines and thus used similar CPU and memory resources among the three candidate plans (i.e., OLTP plan, OLAP plan, and hybrid plan). Overall, the hybrid plan has the potential to achieve a better execution time (3.54 s) than both the OLTP plan (5.60 s) and the OLAP plan (4.28s).

Unsurprisingly, scans of TABLE nation, supplier, stock, lineitem:l1, and order on column store are significantly faster than row store. However, as for TABLE lineitem:l2, the scan performance on the row store is better, i.e., 28.65 ms compared to 286.18 ms on the column store. This is because the query’s selectivity for TABLE lineitem:l2 is high (i.e., a theta join), and a sorted index exists for the accessed column. Thus, it incurs smaller CPU, memory, and I/O consumption compared to a sequential scan on a column store.

Moreover, due to the sorted order, if the TABLE order is accessed on the row store, it can be joined by a range join algorithm instead of a hash join algorithm, which can be potentially more efficient. Thus, in the hybrid plan, even though the time for a scan on the row store (504.00 ms) is slightly longer than the column store (470.91 ms), the optimizer should still select the row store for its data source.

Nevertheless, there is no free lunch. With hybrid plans, the optimizer’s search space is much larger, especically when both row and column store are available. Actually, even though the available options in a traditional OLTP or OLAP database are much more limited, they still suffer from a long optimization time for complex analytical queries [99, 114, 115]. Heuristics must be adopted to prune the search space to keep optimization times within reasonable bounds [58].

To our knowledge, existing solutions have not yet well explored these aspects and leave them as their future work (as admitted in [59]). Our vision is that the emerging machine-learning-assisted optimization approach (a.k.a learned optimizer) may help with optimizers for handling the complexity and pruning challenges since many of them have achieved excellent success on typical OLAP-only workloads (to be further illustrated in Sect. 8). Meanwhile, the models used for hybrid architecture (e.g., the learned decision in Proteus) may also be helpful in finding heuristics.

Table 5 A comparison of different formats and architecture choices in various aspects

5.5.8 Sharding strategy

Partitioning databases into multiple data shards can improve query processing performance, increase database manageability [1], and achieve high scalability. Silimar to format design, which groups data into storage blocks (e.g., disk pages), sharding strategies also abstract the way of data grouping. The key difference is that the format design specifies the smallest unit to organize data while sharing strategies consider scaling the organized data into multiple machines or data sites. Briefly, there are three primary ways of partitioning a relational database: horizontal, vertical, and irregular. We show a comparison of the three strategies in Fig. 11.

Each partition is a separate data store in horizontal partitioning, and all partitions share the same schema. This partitioning is general to row store, as a row is the smallest unit in a horizontal partition. For vertical partitioning, each partition holds a subset of the columns in the data store. Thus, column stores can benefit from this partition schema when two commonly accessed columns are located in the same partition. In addition, irregular partitioning is typically adopted by the HTAP systems with a hybrid architecture (e.g., Proteus [6]). Irregular partitioning does not shard the database at the granularity of rows or columns but extends the existing abstraction of hybrid architecture by spanning hybrid formats into multiple partitions. Similar to the adaptive policy of working on a single partition, it adaptively shifts data across partitions based on workload perspective.

Excluding using an irregular partition strategy for HTAP with a hybrid architecture, which essentially blurs the boundary of row and column stores, it’s challenging to draft a consistent and efficient partitioning strategy for the two types of data formats (i.e., rows and columns). Existing strategies can be classified into two categories: symmetric and asymmetric.

Symmetric partitioning adopts the same partition strategy for both row store and column store. Thus, each data partition in the column store can be precisely mapped to a data partition in the row store, simplifying data management by eliminating cross-partition data synchronization. Typical implementations include Vegito [150] and TiDB [78].

On the other hand, asymmetric partitioning has the potential to generate a more efficient partitioning strategy by customizing the strategy for each data store and workload independently. For instance, SAP HANA [96] maintains a row store in a single physical machine without partitioning, while its column store is independently partitioned and distributed across multiple smaller physical machines. This enables it to avoid the cross-partition two-phase commit [119] (2PC) for OLTP workloads while serving partition-friendly OLAP workloads in a more scalable way. Janus [18] also uses horizontal partitioning for the row store and vertical partitioning for the column store by default.

6 Discussion and lessons learned

In the previous sections, we have examined the design of different HTAP systems. In this section, we aim to shed light on the reasons why these designs persisted and provide coarse-grained design guidelines for newcomers.

Sincerely, most of the existing HTAP systems have evolved from legacy OLTP or OLAP architecture to serve their needs of HTAP, and their evolution routes diverge from the root. For instance, SQL Server, Oracle, and PolarDB-IMCI start from an existing OLTP system and strive to provide a plug-in OLAP acceleration using columnar data formats. In turn, SingleStore and SAP HANA start from a current OLAP system and try to provide essential OLTP performance over the column store. Separated design bridges two existing OLTP and OLAP systems using a similar approach as real-time ETL (see Sect. 2.1). The difference is that for better data freshness, the separated design implements the data synchronization inside the database and uses system-specific design at the cost of sacrificing versatility. The hybrid architecture seems to be the best native approach for handling HTAP. However, there are several open problems when deploying them into the industry. First, the performance of such a design heavily relies on the algorithms that reform the data formats, which inevitably incurs instability in complex real-world applications. Second, debugging and management of such a design is always tricky. Third, it introduces a fair complexity in managing secondary indexes since the indexed data may have different formats and redundant copies.

To guide an HTAP system designer in selecting data formats and architectures from scratch, we summarize our main findings concerning the pros and cons of different architectures in Table 5. This is a coarse-grained guideline, and practical databases are much more complex than our abstraction. The data format and architecture choice can be important, but not all. Specific optimizations can also contribute to the overall performance.

7 Applications and benchmarks

Thus far, we have reviewed the different HTAP architectures and their unique challenges and solutions. We now briefly discuss HTAP applications and benchmarks.

7.1 Real-time applications

Real-time applications drive the development of HTAP. In addition to bridging existing OLTP and OLAP applications, several cutting-edge real-time applications are proposed in the literature, including analytical applications, dashboards, and real-time prediction using machine learning.

For instance, fraud detection applications [36, 135, 136] analyze the continuously generated transactions to prevent money or property from being obtained through false pretenses. System monitoring applications [46, 170] derive real-time system metrics swiftly based on data logs. Social trading [129, 176] applications offer a sense of trading community to traders by monitoring the influx of retail traders. Innovative industry applications [134, 174] automate the management of manufacturing and supply chains by conducting timely decisions based on fresh information.

A thorough review of these applications is beyond the scope of this paper; however, we list their desirable requirements for data freshness and consistency in Table 6 and several key inspirations below. First, automatic decision-making applications desire data freshness as high as possible. In contrast, human-in-the-loop applications desire data freshness close to the boundary of human reaction time (i.e., 150–300 ms). Second, mission-critical applications (e.g., fraud detection) have stringent consistency requirements for generating analytical results with high quality, while routine analysis services may tolerate some inconsistent data. This suggests that the consistency model of a full-fledged HTAP system should be configurable, further influencing the design of HTAP (e.g., the data synchronization approaches).

Table 6 HTAP applications
Table 7 HTAP benchmarks

7.2 Benchmarks

As more and more HTAP systems have been developed, a benchmark with representative data schema and workloads is essential for evaluating existing system designs and directing future development. For this purpose, several HTAP benchmarks are proposed in the literature (see Table 7).

Most existing HTAP benchmarks mix the OLTP and OLAP workloads from existing ones. We observe that these benchmarks usually target a much simpler querying scenario than a benchmark traditional data warehouse (e.g., TPC-DS).

Specifically, many existing HTAP benchmarks use TPC-C [53] for OLTP workloads and TPC-H [52] for OLAP workloads. Among them, CH-benCHmark is one of the most representative ones, which is widely adopted by the existing HTAP systems, e.g., BatchDB [111], TiDB [78], Vegito [150], etc. In particular, CH-benCHmark [48] unifies the schema of TPC-C and TPC-H by keeping the TPC-C schema unmodified and adding some necessary tables to fulfill equivalent queries from TPC-H. The benchmark can be scaled by partitioning a database into multiple warehouses. Each warehouse is a data shard and can be deployed across multiple physical machines or data sites. Generally, CH-benCHmark keeps transactions and queries unmodified from the TPC-C and TPC-H and thus has a similar complexity as the well-studied benchmarks.

HTAPBench [47] follows a similar schema design, confining the schema from TPC-C and TPC-E. Additionally, HTAPBench proposes a new evaluation metric: QpHpW. Instead of evaluating transactions and queries using independent metrics (e.g., throughput for OLTP and latency for OLAP), QpHpW calculates the throughput of transactions and analytical queries in the same equation. As such, QpHpW can be a representative metric for HTAP. Same as CH-benCHmark, HTAPBench keeps transactions and queries unmodified from the TPC-C and TPC-H.

HTAPTrick [117] bridges TPC-C and TPC-H with a star schema (i.e., TPC-H SSB [128]). Star schema provides a clear division between dimension tables and fact tables, where dimension tables are most likely to join fact tables with their primary keys. As star schemas can provide performance enhancements for read-only reporting applications, they have become popular in the OLAP community. HTAPTrick follows this design to explore real-world use cases. To run TPC-C on star schemas, HTAPTrick rewrites the transactions while keeping the logic unmodified. As such, HTAPTrick has the same complexity as TPC-C and TPC-H, while the queries can result in improved performance for aggregation operations (e.g., Join) thanks to the star schema.

TPC-HC [158] is another benchmark over TPC-C and TPC-H. Different from the aforementioned benchmarks, TPC-HC uses the schema directly from TPC-H and integrates TPC-C transactions by rewriting them and tailoring them to the schema of TPC-H. By doing so, the transactions of TPC-HC are simpler than TPC-C, while the OLAP queries are the same as TPC-H.

CBTR [25] is a composite benchmark for evaluating order-to-cash applications. CBTR is based on a global enterprise’s data set and thus targets real-world workloads. Nevertheless, to our knowledge, CBTR is not open-sourced.

Besides mixing OLTP and OLAP workloads for building HTAP workloads, several studies [63, 103, 126, 130] feature that HTAP transactions should be able to contain analytical operations inside transactions. These benchmarks target to model a widely observed behavior pattern: making a quick decision while consulting real-time analysis. To our knowledge, Kang et al. has proposed a benchmark called Olxpbench, where a real-time query is embedded between transactions. Overall, the benchmarks contain three domain-specific schemas and workloads.

Due to the integration of query operators, the transactions in Olxpbench are much more complex and have a larger execution, which may lead to a big contention window (i.e., the period to conflict with other transactions). The complexity of OLAP remains unmodified as its initial benchmark. For interested readers, the OLAP queries in Subenchmark are the same as queries in TPC-H. Fibenchmark involves many scan-intensive queries while fewer joins. The OLAP queries of Tabenchmark have moderate scans and join complexity.

8 Future directions

Although many novel technologies and architectures have been proposed to make HTAP practical, some further opportunities and challenges exist. In this section, we outline and discuss some prospective and interesting research directions.

8.1 Software and hardware co-design

Co-designing software and hardware in HTAP is currently on the rise due to the increased adoption of accelerators (e.g., FPGAs and GPUs). There are several existing works worth noting. Polynesia [30] proposed an energy-efficient in-memory HTAP implementation, which leverages hardware and software co-designed components (called islands). Each island is specialized for specific types of queries (either for OLTP or OLAP). In particular, islands cooperate with processing-in-memory (PIM) hardware and specifically optimized algorithms to speed up the computation.

In addition, GPU databases accelerate queries by dividing query processing into multiple fine-grained data-parallel GPU tasks. Several works [16, 98, 139] have been contributed to exploring such a heterogeneous HTAP database by executing OLAP queries on GPUs. However, there are still some open problems. For instance, the scalability of heterogeneous HTAP is limited, and the interleaving between CPU/GPU processing is complex. Note that the interleaving between CPU/GPU can be unavoidable when an analytical operator is embedded in a transaction (see Sect. 7.2).

Other hardware, such as persistent memory, secure hardware, and remote direct memory access NICs, may also be worth attention. How to develop HTAP systems on this new hardware remains unexplored.

8.2 Serverless and cloud-native HTAP

The advent of new serverless and cloud-native architectures has introduced new opportunities for HTAP. The benefits of adapting HTAP systems to cloud-native architectures include elastic resource scheduling, independent fault-tolerance between storage and computing, stringent service-level agreement (SLA) to ensure reliability, and easy manageability with reduced operational cost [101].

There are several cloud-native HTAP systems that have been proposed. For instance, MemSQL [131] (renamed as SingleStore in 2022) absorbs two key innovations from the cloud-native designs to shift its HTAP database. First, it separates the storage and computing layers to provide excellent resource elasticity in both of the two layers. Second, it unifies its table storage for row and column storage to mitigatcost of storing redundant data copies. PolarDB-IMCI [168] is another example that follows the separation of storage and computing layers. By adopting such a design, PolarDB-IMCI is ablcan up in seconds and scale out in minutes to handle workload spikes of OLAP. Moreover, it introduces a new checkpoint mechanism for fast recovery in case of the single point of failure using cloud facilities.

Although these existing explorations are exciting, the design choices are based on their unique architecture, leaving other aspects unexplored (e.g., the cloud-native architecture for monolithic HTAP). Furthermore, memory disaggregation has become more and more popular nowadays, especially because the next generation of in-datacenter networks can be ultra-fast (e.g., powered by CXL). Generally, the disaggregated memory architecture detaches CPUs and memory by abstracting them into resource pools to reduce the total cost of resource ownership in a cloud-native environment. How to leverage this concept to guide the design of HTAP is still an open problem.

8.3 Beyond column store

As far as we consider, column store is treated as the default data format complements to row store. Actually, although real-world applications usually prefer row stores for transactions due to the given relational model, they may desire another specialized data format for analytical queries. For instance, complementary to column store, databases with novel data models are increasingly popular for analytical workloads, e.g., graph databases [15, 33], spatial databases [76, 173], and time series databases [24, 140]. To absorb the advancements of these new databases, a well-customized HTAP system demands a re-design of data synchronization approaches, delta buffer, meta-data, execution engine, and query optimizer.

To our knowledge, several works have explored customized HTAP designs for graph databases. Jibril et al. proposed a new data synchronization approach when building HTAP over graph database. In particular, they use an update-friendly table in an optimized sparse matrix format to propagate transactional updates. Another work introduces GART [151], performing dynamic graph analytical processing tasks on the datasets generated by relational OLTP. In particular, GART customizes an efficient and mutable compressed sparse row representation for graph scans and proposes a coarse-grained (MVCC) scheme to reduce the temporal and spatial overhead of the version. These issues can be specific to graph HTAP, which may not be studied in existing HTAP systems that use column stores.

8.4 Optimizer for HTAP workloads

Designing optimizers in an HTAP database is challenging. As evident in Sect. 5.5.7, the extension of both data source and operators is accompanied by an exponential increase in plan search space. Thus, efficient heuristics to prune the search space are highly desirable. Existing methods [58, 78] only consider leveraging cost functions to select the access path (i.e., data source) while neglecting store-specific query operators (see Table 4). Additionally, as asynchronous data synchronization causes a visibility delay between the row and column stores, row stores in HTAP always have better data freshness than column stores. As such, interesting tradeoffs should be made to achieve the best timeliness of queries: accelerating OLAP queries using column stores or retrieving the required data from row stores for better freshness.

In addition, learned optimizers [114] have attracted a lot of research interest and shown practical gains by learning a mapping between an incoming query and the execution strategy [103]. For instance, Bao [113] steers query optimizers using reinforcement learning. Specifically, Bao automatically learns from the changes in query workloads, data, and schema and provides per-query optimization hints.

In HTAP, how to efficiently utilize those learning algorithms is unclear. Unlike predicting query latency on a static data copy, ongoing transactions may influence the latency due to physical and logical resource contention (e.g., latches on a data object). To forecast the best query plan, a learned optimizer may need to learn the properties of OLTP workloads before estimating the cost.

As pointed out by previous papers [177], learning-based optimizers cannot work well for dynamic workloads. Some recent works also consider combining learning with the existing cost models. For instance, Yu et al. propose a hint-based candidate generation method that leverages the learning-based method to generate highly beneficial hints and uses a cost-based method to supplement the hints to generate complete plans as candidates. Yang et al. takes a similar approach while identifying important parameters within the cost model and using a fast-learning model to adjust them for each specific hardware and software configuration.

In our vision, combining the learning approach with the traditional cost model (as well as statistics) can be a moderate solution for HTAP optimizers.

8.5 Learned data format

Recall the design of HTAP with a hybrid architecture. A system with a hybrid architecture can efficiently learn physical format design, thus blurring the boundary of row and column stores. Existing approaches (e.g., [6, 7]) learn the cost of data accesses and predict their latency under different storage formats. For instance, Tiresias [7] makes predictions by collecting observed latencies and access histories to build predictive models in an online manner, enabling autonomous storage and index adaptation.

To our knowledge, these learning algorithms are commonly used for hybrid architecture and have not been well explored in other architectures. For instance, for row-oriented architecture, systems may also need an advisor to suggest which column (or segmentation of the column) can be most valuable for constructing an index when the memory space can be limited. We regard these interesting explorations and development as future works.

9 Conclusion

Hybrid transactional/analytical processing (HTAP) is an increasingly important subject of research and development. It introduces massive technical challenges and opens many opportunities to the database community.

In this paper, we systematically review the existing HTAP architectures. We summarize the common HTAP-specific issues and challenges in implementing an efficient HTAP system. We also compare different design choices based on the assumption of their underlying architecture and discuss how the proposed methods can handle the aforementioned challenges. Moreover, we discuss the cutting-edge applications, benchmarks, and future directions to push forward future research in this area.