Adaptive update handling for graph HTAP

Hybrid transactional/analytical processing (HTAP) workloads on graph data can significantly benefit from GPU accelerators. However, to exploit the full potential of GPU processing, dedicated graph representations are necessary, which mostly make in-place updates difficult. In this paper, we discuss an adaptive update handling approach in a graph database system for HTAP workloads. We discuss and evaluate strategies for propagating transactional updates from an update-friendly table storage to a GPU-optimized sparse matrix format for analytics.


Introduction
Storing, managing, and analyzing big graphs or a large number of smaller graphs is an important and recent data management problem in many non-standard application domains. This task is addressed by several architectures for different data models: either by managing graphs in key-value stores [1], as RDF in triple stores [2], mapped to relational databases [3], stored natively by graph database systems for property graphs [4], or by dedicated graph analytics frameworks [5].
A special challenge in the context of graph data management is the support for HTAP (hybrid transactional/analytical processing) workloads: processing both transactional updates and queries as well as complex graph analytics tasks (e.g. pathfinding, centrality analysis, etc.) on the same dataset while guaranteeing the freshness of the results of both parts.

3
Recent work and experiments [6] have shown that processing analytical tasks on graphs with GPU support delivers significant performance benefits. Figure 1 shows results of a simple experiment (setup description in Sect. 6) for illustration purposes: we ran the same task [Shortest Path with Single Source (SPSS)] both on an NVIDIA A100 GPU using the Gunrock framework [7] and using an Intel Xeonbased graph database system where a property graph is stored natively in updatefriendly in-memory data structures (basically, tables for nodes, relationships, and their properties). The graphs in the experiments contain approximately 2 million nodes/2.7 million relationships (graph1), 2 million nodes/5.5 million relationships (graph2), and 4.8 million nodes/68.9 million relationships (graph3).
Due to this fact, several GPU-based graph analytics frameworks have been developed such as Tigr [8], Grus [9] etc., allowing to leverage the parallelization potential of GPUs for graph operations. However, these frameworks rely on dedicated data structures and graph representations (e.g. based on adjacency lists or matrices) which are not well suited-if at all-for updating portions of the graph efficiently.
As a consequence, either one has to accept performance losses for the analytical part of the workload by not utilizing the performance benefits of GPU or the graph has to be stored in two different representations. Ultimately, this raises the problem of update propagation.
In this work, we present a hybrid architecture of a graph database system that addresses this challenge. Based on our Poseidon system [10] which manages a property graph natively in persistent memory (Intel Optane DCPMM) and supports transactional workloads, we show the integration of GPU-accelerated graph analytics. For this purpose, the graph representation on the GPU is the Compressed Sparse Row (CSR) format, which is the most commonly used sparse matrix format [11]. Although GPU-based dynamic data structures that are more amenable to updates are now being developed, however, most of the existing GPU frameworks for graph analytics have static data structures-particularly the CSR-as their underlying data structures [7][8][9]. For our setting, the CSR is considered as a replica (although in a different format) of the main graph representation in persistent memory (PMem). Thus, updates by transactions on the main graph representation have to be propagated to the GPU part in order to execute the analytics on the most recent version of the graph. However, due to its highly compact layout, the CSR format inherently makes updates on the graph difficult. This challenge leads to several research questions as follows, which we tackle in this paper: RQ1 Can the graph representation on the GPU be updated incrementally or is a complete rebuild required?
RQ2 Where are the updates/rebuilds performed? At the CPU (host) side or at the GPU (device) side?
RQ3 When is the GPU part updated or rebuilt as a trade-off between freshness and costs?
RQ4 How can consistency of queries on both parts be guaranteed?
In the following, we discuss different strategies to address the aforementioned research questions as well as their consequences. Furthermore, we show the results of our experimental evaluation with the goal of an adaptive graph update handling approach in a HTAP setting. We note here that the problem of update propagation also applies to the dynamic data structures as well. In other words, if a dynamic data structure is used instead of a static data structure, the transactional updates on the main graph would still need to be propagated to the GPU to update the dynamic data structure (i.e. the graph replica). However, in this paper, we focus on the CSR as the representative static data structure. We outline our contributions as follows: -We introduce an incremental CSR update approach, which adapts the concept of delta store to CSR while guaranteeing consistency between the two graph data representations: the CSR and the main graph. -We present an adaptive update handling approach for graph HTAP which switches update mode between the delta and rebuild approach. -We develop a cost model based on which the adaptive update handling approach decides when to use the delta and when to rebuild. -We implement the above concepts in our HTAP graph database system Poseidon. Based on empirical evaluations, we show the performance benefits of the delta and the advantage of our adaptive approach.

3 2 Related work
This section reviews related work in the existing literature. It begins with data structures dedicated to graph analytics, both static and dynamic ones. The section follows them with differential data structures, where it discusses work that are based on the concept of delta store.

Data structures for GPU graph analytics
The performance benefits of GPUs in graph analytics are realized by considering the architecture of a GPU in the data structure design. Due to the limited size of GPU memory and its SIMT execution model, processing on GPU leverages data structures such as the Compressed Sparse Row (CSR) that allow for compact data representation and regular-yet parallel-data access. Although the CSR is the most widely adopted format [11], other representations are also used such as the Compressed Sparse Column (CSC), Coordinate (COO), ELL, Edge List (EL), Adjacency Matrix (AM), BitMatrix (BM) etc., each impacting performance depending on factors such as the characteristics of the graph algorithm, the structure of the input graph etc. [12]. However, they all have in common that they are not well suited-if at all-for updates by transactions. With current work that make use of these static data structures, such as Tigr [8], Grus [9], Gunrock [7] etc., one first reads the graph data for the initial representation of the graph in these data structures and, as a way to handle updates, one rebuilds these structures when the original graph changes [13]. More recently, there has been work on dynamic data structures for graphs on GPUs. These data structures introduce flexibility with respect to the high degree of compactness as obtained in the aforementioned ones, in order to accommodate insert and delete operations while supporting queries. DCSR [14], for example, divides the column indices and edge values arrays into segments. GPMA [15] is basically a Packed Memory Array (PMA) [16] adapted for analytical graph processing on GPU. PMA is a binary tree structure whose nodes sort elements with interspaces to accommodate updates. The data manager of Hornet [17] stores adjacency lists in arrays of equally sized blocks. Both the edge capacities of blocks and the number of blocks in blockarrays are powers of two. For each block-array, a vectorized bit tree is used to track its free blocks for reclamation and reuse before a new block-array of the same block edge capacity is allocated. Furthermore, it manages an array of B + -trees, where each B + -tree tracks all the block-arrays with a specific block edge capacity. Awad et al. [18] presents a hash-table-based dynamic data structure where the adjacency list of each node is stored in a hash table. Nodes are mapped to their hash tables via pointers in a fixed-sized array. A concurrent map is used for the hash table if edges are associated with a weight value. Otherwise, a concurrent set is used. All update operations are first batched and then executed in a phase-concurrent manner. Thus, it suffers from the limitation that updates are not executed concurrently with read-only queries.
The aforementioned static data structures are mainly the underlying data structures in existing GPU-based graph analytics frameworks [7][8][9]. Although the problem of update propagation applies to both static and dynamic data structures, we focus on the static data structures in this paper using CSR as the representative use case. However, our approach also extends to the dynamic data structures.

Differential data structures
Column stores are optimized for analytical query processing, typically where a few columns are scanned for a huge number of tuples. This is because they avoid unnecessary data access to all other columns that are not needed in the scan, thereby reducing I/O overhead and increasing cache hits. Additionally, techniques such as data clustering, data compression, and data replication further enhance the performance of analytical workloads on column stores. However, column stores are no panacea. Similarly to static graph data structures, they are not well-suited for updates. The columnar data layout, data compression, and data replication make updates in column stores very expensive. As a solution, systems such as C-Store [19] split the storage into a write store for transactional updates and a read store. The write store handles updates in a differential fashion. Analytical queries are executed by scanning both the read store and the write store, merging the differential updates or deltas from the write store. There is a value-based approach to these deltas as well as a position-based approach [20]. MonetDB [21] is another column store that buffers updates in delta columns. These delta columns are periodically merged to update the main table columns when they exceed a certain size.
RateupDB [22] is a dual-store relational HTAP system with support for GPUbased analytics. It has a read-only AlphaStore and a read-write DeltaStore, both of which are column stores. An execution on the GPU starts with the creation of two snapshots by the CPU: IDs of tuples that are not visible to the query and column data for all tuples in the DeltaStore that are visible to the query. The CPU then ships these snapshots along with the original column in the AlphaStore to the GPU, after which the GPU filters out the right tuples for the query execution. SAP HANA [23] manages a row-oriented L1-delta and a column-oriented L2-delta on top of a main columnoriented store, hence having a two-step merge: L1-to-L2-delta and L2-delta-to-main.
Although the concepts of delta store in these work are similar to our work, however, they are all on relational databases and none of them is on graph HTAP or data structures for GPU-accelerated graph analytics.

Background
This section gives an overview of our graph system Poseidon. Poseidon 1 is a native graph database that is based on the property graph model and optimized for PMem [10]. All nodes and relationships, as well as their properties, are stored in PMem. In addition to PMem, Poseidon also supports storing the graph on disk. We briefly discuss some aspects of Poseidon. Figure 2 depicts a simplified pictorial representation of Poseidon's storage. Poseidon implements the property graph data model using separate persistent tables of nodes, relationships, and property sets. A table is a linked list of fixed-sized chunks, where each chunk is an array of equally-sized object records. All the records of any given chunk are of the same type-node, relationship, or property set. In order to optimize sequential access to PMem and utilization of its bandwidth, chunks are aligned to a cache-line and of sizes in multiples of 256 bytes.

Storage
To store node and relationship objects as equally-sized records, properties are outsourced to separate tables. Equal sizing of records allows for accessing the records using their 8-byte integer offsets instead of 16-byte persistent pointers. This enables failure-atomic writes to PMem, reduces the storage space for pointers by half, and avoids costly dereferencing of persistent pointers. Moreover, property keys and variable-length property values are encoded using a persistent dictionary with bidirectional translation, thereby reducing expensive writes to PMem.
To efficiently reclaim the memory space of deleted records, Poseidon makes use of a bitmap in each chunk to mark slots as either used or free. Allocations and deallocations are done at the chunk level instead of record-wise, which amortizes the overhead of expensive PMem allocations. Chunks are linked by persistent pointers, forming a linked list. Thus, a scan over all nodes in a graph entails traversing the Apart from PMem, Poseidon also provides the option of using disk as a storage backend. Here, the linked list of chunks is persisted as a file on disk, where each chunk represents a page in the file. For this storage mode, Poseidon implements a bufferpool for caching the pages of the file in DRAM. Moreover, Poseidon offers an in-memory mode, where the entire graph is maintained as a volatile linked list of chunks in DRAM.

Transaction processing
The underlying concurrency control mechanism in Poseidon is the Multi-Version Timestamp Ordering (MVTO) protocol. Updates of an arbitrary number of objects are supported within a single transaction with snapshot isolation guarantees. Each graph object (node or relationship) stores the transaction identifier txn-id of the write transaction that locks it. The txn-id is assigned to a transaction at the time the transaction starts. The graph object stores additional data for concurrency control. These include a read timestamp rts, which is the txn-id of the latest transaction that read the object. The object also maintains a begin timestamp bts and an end timestamp ets that denote the validity of the object, i.e. which transactions are allowed to access the object. Furthermore, a graph object stores a pointer to a volatile list of dirty versions of the object. Modifications made by transactions are persisted, in a failure-atomic way, to the original copies in PMem at commit time.

Write transactions
A transaction T with timestamp t has to first acquire a lock-using a compare-andswap (CAS) instruction-on a graph object o before it is allowed to write to o. T can only acquire the lock if no other transaction has the lock on o and the rts of o is not greater than t, i.e. no transaction newer than T has read o. If the lock acquisition fails, T aborts. If T acquires the lock, it updates the latest version o i of the object by creating a new version o i+1 and writes to it. The ets of o i is set to t. Since it is now an older version, it is unlocked for read transactions with timestamps between bts and t. As for o i+1 , it remains locked and the bts and ets are set to t and ∞ respectively. This new version stays in DRAM as part of the volatile dirty list until T commits. If it was a newly inserted object, i.e. if T made an initial insertion of o to the graph, it is directly stored in PMem but remains locked until the end of T.

Read transactions
A transaction T with timestamp t is granted read access on an object o if o is not locked by a write transaction and t lies between the bts and ets of o. The access starts from the most recently committed version, i.e. the one persisted in PMem, followed by the dirty versions in DRAM if the persisted version is not valid for T. The dirty list is traversed until a valid version is found. Upon reading, T sets the rts of the read version to t if the value of rts is less than t, i.e. if it was not read by a transaction newer than T. In the event another transaction has a lock on the same version of o that T tries to read, T aborts.

Query processing
The byte-addressability of PMem is particularly useful with respect to access patterns of graph traversal operations. However, there is a need to make optimizations for sequential data access since PMem data access is not block-oriented anymore. Poseidon addresses this using a multithreaded push-based query engine. Moreover, multithreaded processing along with efficient cache utilization and various execution modes allows for hiding the higher latency that PMem has compared to DRAM.
The query engine of Poseidon provides a set of graph-specific algebra operators such as NodeScaN, ForeachrelatioNShip and expaNd, alongside standard relational operators like Filter, project and joiN. Moreover, it provides appropriate operators for graph update operations such as createNode for inserting a new node into a graph, creatrelatioNShip for creating a relationship to connect two existing nodes in the graph, UpdateNode for updating the properties of a node, deleterelatioNShip for removing a relationship from the graph, deleteNode for deleting a node, deletedetach for deleting a node and as well deleting all the incoming and outgoing relationships associated with the node.

Update handling
In this section, we present the common update handling approach, i.e., to fully rebuild the data structure. Then we elaborate on our delta approach, how it surpasses the rebuild approach in terms of performance, and the factors that determine when the delta is the better of the two approaches. We conclude the section with our adaptive approach and what metrics to use for that purpose.

Rebuild approach
A CSR expresses an adjacency matrix. An adjacency matrix, M ∈ {0, w} n,n , is a matrix representation of the topology of a graph. Each relationship/edge in the graph is represented by an entry w at a coordinate (u, v) in M, where u and v are the source and destination nodes/vertices of the relationship respectively, and w is the relationship weight. Entries with weight value 0 denote unconnected nodes. A CSR essentially provides information regarding the non-zero entries in M, linearized in three one-dimensional arrays. The edge values array stores the non-zero weight values (i.e., all ws for each u in M), the column indices array stores the column indices of the values (i.e., all vs for each u in M), and the row offsets stores the offsets of the values (in the first two arrays) for each row (i.e., u in M). Figure 3 shows an example graph along with its CSR representation in Fig. 4.
CSR and other static data structures offer a compact representation and contiguous storage of graph data, leading to reduced memory consumption, regular memory access, and low interconnect bandwidth consumption when transferring data between the host (CPU) and the device (GPU). Intrinsically, however, they do not lend themselves easily to updates. When the graph data changes, the default way to handle updates is a complete rebuild of such static data structures. This introduces overhead especially if the graph data is accessed from memory devices with relatively higher access latency such as PMem. The overhead becomes more significant when the graph data is fetched from disk (typically in a textual file format), which has even higher access latency and would additionally require deserializing the data for main memory. For the initial CSR build, this is inevitable. However, for subsequent updates to the graph data, repetitive rebuild of the CSR on an HTAP graph system in order to run graph analytics on the most recently committed snapshot of the graph data eventually counterbalances the benefits of using GPU for accelerated analytics in the first place. Therefore, we leverage our delta approach in order to avoid this high cost of CSR rebuild.

Delta approach
Presently, systems that make use of static data structures e.g. CSR for graph analytics resort to a full rebuild of the data structure when the underlying graph data changes. With respect to the research question RQ1 (see Sect. 1), we propose to use deltas in order to avoid the overhead of repeatedly rebuilding the CSR. Each delta represents the new state of the adjacency list associated with an updated node in the graph. At commit time, every transaction, besides committing the updates it made to the main graph, stores deltas for its updates in a delta store. Upon execution of analytics, the delta store is scanned for these deltas and the deltas are merged to update the CSR so that the analytics execution is on the latest committed snapshot of the graph, in accordance with the freshness requirements of HTAP.
Some important factors to consider for this are when to update the CSR, where the CSR update or merging of deltas takes place (either on CPU or on GPU), when to propagate the updated CSR (if the CSR update is done on the CPU), or when to propagate the deltas (if the CSR update is done on the GPU). With regard to when to update the CSR, this can be done either each time an analytics task is to be executed or after a certain time period elapses since the last CSR update, or when the CSR delta store reaches a certain size threshold or a combination of those. This decision can be made efficiently with a cost model. As for where to update the CSR, updating the CSR using our approach on CPU would require transferring the whole CSR across the interconnect to the GPU, in contrast to updating the CSR on the GPU where only the deltas are needed to be transferred to the GPU. Thus, it would decrease the transfer overhead across the interconnect.
Our solution broadly consists of three steps, which we explain below using

Delta append
In the context of our HTAP system, Poseidon, the GPU-accelerated analytics are run on a CSR representation of the main graph data in PMem, which is being updated by transactions. Our delta approach is implemented such that during commit (denoted by the first vertical dashed line in Fig. 5), in addition to persisting updates to the main graph, each transaction appends deltas into the delta store for the new states of the adjacency lists that changed as a result of the updates it has made to the main graph. Upon the arrival of analytics (denoted by the second vertical dashed line in Fig. 5), the deltas are merged to update the CSR in order that the analytics are executed on a CSR representation of the most recently committed snapshot of the main graph. The update operations that modify the arrays of a CSR are the creation and deletion of nodes and relationships. To identify each delta by a node, we associate each of these operations with the appropriate node(s): for the addition and deletion of a relationship, the delta is associated with only the source node of the relationship (in a directed graph) or with both the source and destination nodes of the relationship (in an undirected graph). For the addition and deletion of a node, the delta is associated with the added or deleted node. Corresponding deltas are likewise stored for any addition or deletion of relationships that accompany the addition or deletion of a node. At commit time, a transaction stores a delta for each of these associated nodes, where a delta consists of the ID of the node, its current column indices and the corresponding edge values.
To ensure that the main graph and its CSR representation are consistent under transactional updates, each transaction at commit time additionally stores, in each of its deltas, (i) its timestamp and (ii) a flag to indicate whether the delta has been previously used to update the CSR or not (each delta is used only once in a CSR update).

Delta scan
In order to update the CSR by incorporating the updates since the last CSR update (or since the initial CSR build), the delta store is first scanned to retrieve the deltas. This begins with the trigger of a special transaction T s . As T s is scanning the delta store, regular update transactions that are committing would store their respective deltas in the delta store. Therefore, T s conducts a visibility check on each delta. Concerning the research question RQ4, this check is to guarantee data consistency. It is based on the transaction timestamps stored in the deltas as an extension of the underlying MVTO in Poseidon. As the CSR update itself is done transactionally, all the deltas (denoted by the green triangles in Fig. 5) that are visible to the transaction T s executing the CSR update operation and have not been used in a previous CSR update are merged to update the CSR, thus reflecting the latest committed snapshot of the main PMem graph data. All the deltas (denoted by the red triangles in Fig. 5) that have been used in a CSR update are marked using the flag mentioned earlier. We delay deleting such used deltas individually until all deltas in the delta store have been used for a CSR update, then we clear all the deltas at once to reclaim the memory space.

Delta merge
Consider regular update transactions (not T s ) T 0 , T 1 ,...T n that store deltas after a CSR update. Each delta stored by a transaction T i either overwrites another delta associated with the same node by an older transaction T j|j<i , or is overwritten by another delta associated with the same node ID by a newer transaction T k|k>i , or is the only delta associated with that node. It follows that overwritten deltas are not valid for a CSR update. Therefore, for a CSR, CSR 0 , and transactions T 0 , T 1 ,...T n that store deltas, updating CSR 0 to CSR 1 would be as follows: where dist T i are the set of deltas by transaction T i uniquely associated with their respective nodes, while owrt T i are deltas that were last overwritten by T i . These deltas ( dist T i and owrt T i ) that are visible to T s during the delta store scan described above are used in the delta merge. Algorithm 1 describes the delta merge operation. The nodes IDs are first grouped into two sets L and U based on the maximum node ID in the old CSR (see Fig. 5) before the updates in the deltas i.e. L are IDs of nodes that existed in the old CSR and U are IDs of newly inserted nodes (Line 1 to 5). For nodes that were not updated since the last CSR update, their corresponding entries in the old CSR are copied to the new CSR (Lines 9 and 16). Whereas for newly inserted and updated nodes, the entries are updated with the deltas in the new CSR (Lines 12 and 20). Note that the CSR representation is oversimplified since a CSR has three arrays, not one.

Adaptive approach
Each of the approaches discussed above has its advantages, depending on the workload pattern. Therefore, to leverage each of them without sacrificing the other, we make use of an adaptive mechanism where the more suitable approach is always utilized. The delta approach saves the cost of complete CSR rebuild. Hence, it is more suitable for scenarios where the workloads between successive CSR updates (1) or successive executions of analytics are not bulk updates (we define a measure for this in Sect. 5). Rebuilding the CSR in such scenarios would repeatedly incur performance penalties, thereby exponentially degrading the overall system performance. The delta approach is thus more suited for (ideal) HTAP freshness requirements compared to the rebuild approach. However, rebuilding the CSR would be a more suitable option (i) if the overhead of the delta approach is more than the overhead of a full rebuild, e.g. due to heavy transactional updates to the main graph triggering a huge number of deltas to be stored, or (ii) if the time interval between successive analytics is much longer than the rebuild time, like in cases where several analytics tasks are first queued and then executed as a batch, or (iii) if the system trades off freshness for reduced cost etc.
We use a cost model to make the decision as to when to use which approach (see Sect. 5).

Cost model
As mentioned earlier, our adaptive mechanism uses a cost model to decide when to use the delta approach and when to rebuild the CSR. This is part of our addressing of the research question RQ1. One measure that could be used is the time interval between consecutive CSR updates or the time between successive execution of analytics. Thus, so long as the time period since the last CSR update is below a determined threshold, the delta approach is maintained-transactions store deltas for their updates at commit time, which are used afterwards to update the CSR before the execution of analytics. Otherwise, the CSR is rebuilt. Another metric that can be used to decide when to switch the update handling approach is the total size of the delta store, particularly in cases where the storage or memory budget is limited.
However, we use the number of deltas as the metric for our cost model. Our delta approach for update propagation in graph HTAP serves as a direct alternative to costly rebuild especially when intermittent transactional updates interleave executions of analytics. Although the delta approach avoids the performance overhead of rebuild, it nonetheless introduces a different overhead: (1) the time to append the deltas in the delta store during transaction commit (see Sect. 4.2.1), (2) the time to scan the delta store for the deltas with which to update the CSR for the execution of analytics (see Sect. 4.2.2), and as well (3) the time for the actual delta merge operation (see Sect. 4

.2.3).
With the cost model, we aim to determine up to what number of deltas is the delta time-i.e. the total time needed to store the deltas, scan the delta store for the deltas and merge them afterwards to update the CSR less than the time of full rebuild. We figure, and we validate in Sect. 6, that the delta time comprises three time components-t 1 , t 2 and t 3 -each of which is correlated to the number of stored deltas, and an additional time component t 4 that is correlated to the size of the graph. t 1 and t 2 are the time to store the deltas and the time to scan the delta store, while the sum of t 3 and t 4 is the time to merge the deltas. The correlations between the number of deltas and t 2 and t 3 as well as the correlation between the size of the graph and t 4 are linear and hence we model them using linear models as such. The correlation 1 3 between the number of deltas and t 1 approximately fits a linear model as well. The CSR rebuild time t rebuild is linearly correlated to the size of the graph as we show in Sect. 6. Similarly, we fit it into a linear model. Therefore, the delta time where m 1 , m 2 , and m 3 are the gradients while c 1 , c 2 and c 3 are the y-intercepts of the corresponding mathematical equations modelling the correlations. We directly extrapolate the time components t 1 , t 2 , and t 3 that are correlated to the number of deltas from the model equations for a given number of stored deltas. In the case of t 4 and t rebuild , however, we do not extrapolate the times for each instance of storing the deltas because the change in graph size due to the deltas is only a small fraction of the overall graph size. In other words, the number of deltas is a small percentage of the graph and hence results in a negligible change in the values of t 4 and t rebuild (see Sect. 6). Therefore, we keep the cost model based on the initial graph size and update the values of t 4 and t rebuild after successive instances of CSR updates such that the changes in the values of t 4 and t rebuild due to the cumulative change in the graph size over these successive CSR update instances is no more negligible.
As threshold, we set # max , which is the maximum number of deltas for which a minimum required benefit, b min , is achieved. We define b min as the minimum percentage reduction of the rebuild time that the delta approach yields as expressed below: b min and t rebuild are known values. b min is a system configuration parameter while t rebuild is obtained upon the initial CSR build or extrapolated from Eq. (3). Likewise, t 4 is extrapolated as modeled in Eq. (2). As for m 1 , m 2 , m 3 , c 1 , c 2 , and c 3 , we obtain their values by fitting the respective model equations with data points. Fitting these model equations requires a training phase. And although there is a training overhead to this, it is nonetheless insignificant since the training is done only once and the overhead will amortize across all subsequent use of # max for the adaptive mechanism. The value of # max is used each time an update transaction commits, as depicted in Fig. 6. Before any update transaction appends its deltas, it first checks a delta mode flag (initially, the flag is ON). If the flag is ON, it checks the number of deltas in the delta store (including its deltas). If they are less than # max , it stores its deltas. Otherwise, it skips storing its deltas and sets OFF the delta flag. Subsequent transactions equally do not store deltas when the flag is OFF since the delta approach will not be faster by b min % and the CSR will hence be rebuilt instead. At this point, all the deltas in the delta store are deleted to clear the delta store since they are not going to be used in any subsequent update of the CSR. However, the delta mode flag is not switched back ON yet until the next execution of analytics whereby the CSR is rebuilt. Upon rebuilding the CSR, the flag is turned ON and subsequent transactions continue to store deltas for their updates into the delta store. The delta approach is used whenever the flag is ON while the CSR is rebuilt when the flag is OFF.

Evaluation
In this section, we first describe aspects of our experimental environment: hardware setup, dataset and query workload. We subsequently address RQ1 (see Sect. 1) in Sect. 6.3 by evaluating our delta approach in comparison with a complete rebuild approach. Afterwards, in Sect. 6.4, we conduct an evaluation of the delta time components discussed in Sects. 4.2 and 5. We evaluate our cost-model-based adaptive approach (see Sects. 4.3 and 5) in Sect. 6.5. We start with the calibration of our cost model in Sect. 6.5.1, followed by an evaluation of the model in Sect. 6.5.2. As for RQ2 (see Sect. 1), we evaluate the merging of deltas to update the CSR both on CPU and GPU in Sect. 6.6.

Setup
We conduct our experiments on two machines: Machine 1 A dual-socket Intel Xeon Gold 5215 with 10 cores per socket running at a maximum of 3.40GHz. The machine is equipped with 384GB DRAM, 1.5TB Intel Optane DC Persistent Memory Module (DCPMM) operating in AppDirect mode, 4x 1.0TB Intel SSD DC P4501 Series connected via PCIe 3.1 and runs on CentOS 7.9 with Linux Kernel 5.14. We use the Intel Persistent Memory Development Kit (PMDK) version 1.11 and create an ext4 filesystem on the PMem DIMMs, mounted with the DAX option to enable direct loads and stores, thereby bypassing the OS cache.
Machine 2 A dual-socket AMD EPYC 7F52 with a total of 32 cores clocked at a maximum frequency of 3.5GHz. It has a 528GB DRAM and runs CentOS Fig. 6 Adaptive mechanism (applied on each committing update transaction) 7.9. We carry out the analytics experiments leveraging the Gunrock framework on an NVIDIA A100 Tensor Core GPU with 40GB memory, connected to this machine via a PCIe 4.0 interconnect. We use NVCC version 11.2.

Data and workload
We use the Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) dataset at scale factors (SF) 1, 3, 10, and 30. The LDBC-SNB [24] is based on a social network of different entity types interconnected by relationships, modeled as per the label property graph model. We load the data into our system as the main graph which is updated by transactions. Building a CSR representation from this main graph entails a full scan of the tables containing nodes and relationships in order to retrieve the adjacency list of each node and populate the CSR arrays therewith.
As discussed already in Sect. 4.2, the update operations that change the content of the arrays in a CSR are node and relationship create and delete operations. Therefore, we make use of the following set of four basic operations: Update A: Selects a node with a given ID in the graph and creates an outgoing and an incoming relationship to connect it to two other existing nodes in the graph.
Update B: Creates a new node and inserts it into the graph by creating an outgoing and an incoming relationship to connect it to two other nodes that already exist in the graph.
Update C: Selects the relationship connecting two given nodes in the graph and deletes the relationship.
Update D: Selects a node with a given ID in the graph and deletes it along with all its outgoing and incoming relationships.
We run these operations as transactional updates throughout our experiments. The execution of Update A, B, C & D are distributed as 66%, 22%, 11%, and 1% respectively.

Delta vs rebuild
We start with cases of transactional updates that generate a relatively low number of deltas. In these cases, the delta approach would clearly yield performance benefits. We use the set of operations-Update A, B, C & D-distributed as mentioned above. Figure 7 depicts the total time of CSR update with the delta approach ( t in Sect. 5) in comparison with the time of CSR rebuild ( t rebuild in Sect. 5). We execute 1K update operations on machine 1 on each of SF1 and SF10 graphs. The time to store the deltas in the delta store and afterwards scan the delta store to retrieve the deltas and merge them to update the CSR for SF1 and SF10 is 173.00 msecs and 1.19 s respectively. While the time to rebuild the CSR is 6.57 s and 60.03 s respectively-an improvement of approximately 97% and 98% respectively. As expected, the delta approach outperforms the rebuild approach particularly when the transactional workloads between successive CSR updates are not bulky. This addresses the research question RQ1.

Delta time components
Next, we carry out transactional updates on the graph with an increasing number of stored deltas by varying the number of update operations. Figure 8 depicts the CSR update times with the delta approach on the SF1 (shown on the left) and SF10 (shown on the right) graphs for 10K to 100K and 100K to 1M update We further breakdown the time of our delta approach into its constituent time components with regards to the three steps of the delta approach explained in Sect. 4.2 and Sect. 5. These time components are: (1) the time for appending deltas into the delta store (denoted as t 1 in Sect. 5), (2) the time for scanning the delta store upon analytics in order to retrieve the corresponding deltas (denoted as t 2 in Sect. 5), and (3) the time for merging the deltas to update the CSR so that the analytics are executed on the fresh version of the graph (denoted as the sum of t 3 and t 4 in Sect. 5). Figures 9, 10 and 11 show the correlation between the number of deltas and the time of appending the deltas, scanning the delta store and merging the deltas respectively. The last two correlations are linear and we model them as such, while the first correlation approximately fits a linear model as well. Since the delta merge essentially comprises updating entries of the CSR for updated and newly inserted nodes based on their respective deltas (Lines 12 and 20 of Algorithm 1) and copying entries of the CSR for nodes that remain unchanged (Lines 9 and 16 of Algorithm 1), we compare the delta merge times with the time to directly copy the CSR. We depict this as two time components in Fig. 12: CSR update and CSR copy respectively. The former (i.e. t 3 in Sect. 5) is directly correlated to the number of stored deltas as seen in Fig. 12 while the latter (i.e. t 4 in Sect. 5) is correlated to the size of the graph as we show in Fig. 14. Moreover, we take a closer look and highlight the time to store the deltas, the time to scan the delta store for the deltas, and the time to merge the deltas to update the CSR as a percentage of the total time of the delta approach for CSR update. Initially most of of the time is spent in the delta merge function for a relatively low For example, with ∼ 2 K deltas, the percentages of the delta store, delta scan, and delta merge times are 2%, 0.2%, and 97.8% respectively. However, as the number of deltas becomes more and more significant, as depicted in Fig. 13, the time of storing the deltas eventually becomes the dominant part. The delta scan time gradually surpasses the delta merge time. Our cost model accounts for this since it models each delta time component separately (see Eq. (2)).

Adaptive approach
We evaluate our adaptive approach described in Sect. 4.3 based on the cost model presented in Sect. 5. This pertains to the research question RQ1.

Cost model calibration
For our adaptive mechanism, we calculate the maximum number of stored deltas, # max , for which the delta approach yields performance benefits as per the minimum percentage improvement on the CSR rebuild time, b min , set in the cost model. We set b min at 20% . t rebuild and t 4 for the SF10 graph are 60.03 s and 1.03 s respectively. Note that as mentioned already in Sect. 5, both t 4 and t rebuild are linearly correlated to the size of the graph. These linear correlations are respectively depicted in Figs. 14 and 15, using the SF1, SF3, SF10 and SF30 graphs with approximately 3M nodes & 17M relationships, 9M nodes & 53M relationships, 30M nodes & 177M relationships and 89M nodes & 541M relationships respectively. The correlations fit linear models and both t rebuild and t 4 are therefore directly extrapolated from the linear models for other graph sizes. We conduct the training phase to fit data points to the models to obtain the values of m 1 , m 2 , m 3 , c 1 , c 2 and c 3 in Eq. (2). Lastly, using Eq. (4), we calculate the value of # max for our setting to be 4702568, i.e. the maximum number of deltas for which the delta approach is at least 20% faster than rebuilding the CSR.

Cost model evaluation
Having obtained # max , we evaluate the adaptive approach as depicted in Fig. 16. We execute update operations in steps of 100K, starting from 100K. The horizontal line Rebuild in Fig. 16 denotes the CSR rebuild time while Threshold denotes the corresponding time of the delta approach for # max as predicted by our cost model. Note that in Fig. 16, we only show a range of deltas around # max since that is the point of particular interest with respect to the adaptive mechanism's switching between the delta and rebuild modes. The cost model predicts the delta time for any given number of deltas and compares it with the Threshold. This means that the areas prone to premature or late adaptive switching due to delta time prediction errors are the areas around # max , where the difference between the delta time and Threshold is limited. Hence, we limit the evaluation to areas where the number of deltas is closer to # max . In principle, the adaptive approach would either be below the Threshold line or on the Threshold line or on the Rebuild line. But errors in predicting the time of the delta approach from the cost model would result in the adaptive approach lying between the Threshold line and the Rebuild line. For instance, with ∼ 4.5 M deltas, the adaptive mechanism selects the delta approach, as it is faster by at least 20%. However, with ∼ 4.6 M deltas, the adaptive mechanism still selects the delta approach because based on the predicted time from the cost model, the delta approach is still faster by at least 20%. On the contrary, the actual delta time is such that the delta approach is 18.98% faster, hence the adaptive approach lying a little above the Threshold. With ∼ 4.8 M deltas, the delta approach is less than 20% faster based on the time prediction as per the cost model, and as a result, the adaptive mechanism switches over to rebuild the CSR. Note that in  addition to being a system configuration parameter, b min could also be used as an error-bounding mechanism. To show the effectiveness and accuracy of our cost model, Fig. 17 compares the delta times as predicted by our cost model for a varying number of deltas and the actual delta times as obtained from the experiments. The accuracy of our cost model is approximately 96%.
It is worth mentioning here that after the adaptive switch to CSR rebuild, it does not mean that the rebuild would be used from there onward. As previously mentioned in Sect. 5, the delta mode flag is turned ON, following the next CSR rebuild. In Fig. 16, for example, the ∼ 4.8 M deltas will be deleted and the delta mode flag is kept OFF. Following the arrival of analytics, the CSR is updated using the rebuild approach as per the adaptive mechanism. Thereupon, the flag is switched back ON and the adaptive mechanism would make use of the delta approach for the next CSR update in as much as the newly stored deltas do not exceed # max .
In this evaluation, we update the CSR upon execution of analytics, thereby ensuring that the analytics are executed on the most recent version of the graph. With respect to RQ3 (see Sect. 1), this is a trade-off between CSR update cost and freshness. An alternative is to avoid the cost by updating the CSR only occasionally. This alternative fits workloads with bulky transactional updates. However, it (i) trades off freshness and (ii) cannot leverage the delta approach since the number of deltas would exceed # max much before the time for the occasional CSR update. Our approach guarantees freshness at an overall differential update cost that, depending on the number of newly appended deltas between successive CSR updates, could still be lower than the cost of an occasional CSR rebuild. This number of deltas depends entirely on the workload pattern. Moreover, executing analytics on stale data, especially in an HTAP setting with freshness requirements, defeats the purpose of having an HTAP system in the first place.

Delta merge on CPU and GPU
Lastly, we address the research question RQ2 concerning where to carry out the CSR updates. As for updating the CSR by way of rebuilding, we do this on the CPU (host) since the main graph resides on the host. Rebuilding the CSR entails scanning the entire main graph for the adjacency lists of each node and populating the CSR arrays with the adjacencies. The rebuild itself is a sequential operation dominated by memory copy and, as a result, suited for CPU not GPU. Additionally, rebuilding the CSR on GPU (device) would require copying the main graph as is to the GPU. However, with respect to updating the CSR by way of our delta approach, we evaluate merging the deltas on both CPU and GPU. In the first case, the delta merge is carried out on CPU and, after merging the deltas, the updated CSR is transferred to the GPU for the execution of analytics. Thus, there is no transfer of deltas themselves since the deltas are stored, scanned, and merged altogether on the host. As for the second case, the deltas are transferred to the GPU and merged on the GPU for the execution of analytics. Hence, merging on the GPU (i.e. where only the deltas are transferred to the GPU) has less data transfer overhead compared to merging on the CPU (i.e. where the entire updated CSR is transferred to the GPU) 1 3 since the deltas are a small fraction of the entire CSR. Figure 18 illustrates this overhead difference in transfer time between the two for a varying number of deltas whereby merging on the GPU has much less transfer time compared to merging on CPU. Apart from the transfer time, we also compare the two based on the update time, i.e. the time for the actual delta merge operation. As shown in Fig. 19, the merge operation is much faster on CPU than on GPU. Overall, taking both the transfer time in Fig. 18 and the update time in Fig. 19, it is faster to carry out the delta merge on the CPU than on the GPU since the update time is always much more significant than the transfer time, even for a large number of deltas. Nevertheless, it is noteworthy that the delta merge on GPU is still faster than rebuilding the CSR  when the number of stored deltas is less than a threshold. Hence, our adaptive approach could be extended to incorporate merging deltas on GPU. However, we do not include that in this paper.

Conclusion and future work
In this paper, we presented an adaptive update handling approach in a graph HTAP setting where transactional workloads update the main graph on the host while analytics are offloaded to the GPU for accelerated execution. We addressed the problem of update handling where the transactional updates to the main graph need to be propagated to the CSR replica of the graph on GPU. We adapted the concept of delta store as a faster update handling approach than rebuild. Then we formulated an adaptive mechanism that switches between the delta and rebuild approach based on a cost model. Our approach can be applied to other HTAP systems with separate/specialized storage and engines for transactions and analytics. In the course of our contributions in this paper, we tackled the four research questions presented in Sect. 1. We addressed RQ1 in Sects. 6.3, 6.4 and 6.5, RQ2 in Sect. 6.6, RQ3 in Sect. 6.5.2 and RQ4 in Sect. 4.2.
Future work would be to extend our delta approach of update propagation to dynamic data structures as well.