ChronoDB [25] is a versioned key-value store and the bottom layer in our architecture. Its main responsibilities are persistence, versioning, branching and indexing. As all other components in our architecture rely on this store, we formalized its data structures and operations during the design phase.
Formal foundation
Salzberg and Tsotras identified three key query types which have to be supported by a data store in order to provide the full temporal feature set [62]. For versioning purposes, this set can be reused by restricting the features to timestamps instead of time ranges. This gives rise to the following three types of possible queries:
-
Pure-Timeslice Query Given a point in time (e.g., date and time), find all keys that existed at that time.
-
Range-Timeslice Query Given a set of keys and a point in time, find the value for each key which was valid at that time.
-
Pure-Key Query Given a set of keys, for each key find the values that comprise its history.
We use these three core query types as the functional requirements for our formalization approach. For practical reasons, we furthermore require that inserted entries never have to be modified again. In this way, we can achieve a true append-only store. In order to maintain the traceability of changes over time (e.g., for auditing purposes [R8]), we also require that the history of a key must never be altered, only appended.
The key concept behind our formalism is based on the observation that temporal information always adds an additional dimension to a dataset. A key-value format has only one dimension, which is the key. By adding temporal information, the two resulting dimensions are the key, and the time at which the value was inserted. Therefore, a matrix is a very natural fit for formalizing the versioning problem, offering the additional advantage of being easy to visualize. The remainder of this section consists of definitions which provide the formal semantics of our solution, interleaved with figures and (less formal) textual explanations.
Definition 1
Temporal Data Matrix
Let T be the set of all timestamps with \(T \subseteq \mathbb {N}\). Let \(\mathcal {S}\) denote the set of all non-empty strings and K be the set of all keys with \(K \subseteq \mathcal {S}\). Let \(\mathbb {B}\) define the set of all binary strings with \(\mathbb {B} \subseteq \{0,1\}^+ \cup \{null, \epsilon \}\). Let \(\epsilon \in \mathbb {B}\) be the empty binary string with \(\epsilon \ne null\). We define the Temporal Data Matrix\(\mathcal {D} \in \mathbb {B}^{\infty \times \infty }\) as:
$$\begin{aligned} \mathcal {D}: T \times K \rightarrow \mathbb {B} \end{aligned}$$
We define the initial value of a given Temporal Data Matrix D as:
$$\begin{aligned} D_{t,k} := \epsilon \qquad \forall t \in T, \forall k \in K \end{aligned}$$
In Definition 1, we define a Temporal Data Matrix, which is a key-value mapping enhanced with temporal information [R2, R3]. Note that the number of rows and columns in this matrix is infinite. In order to retrieve a value from this matrix, a key string and a timestamp are required. We refer to such a pair as a Temporal Key. The matrix can contain an array of binary values in every cell, which can be interpreted as the serialized representation of an arbitrary object. The formalism is therefore not restricted to any particular value type. The dedicated null value (which is different from all other bit-strings and also different from the \(\epsilon \) values used to initialize the matrix) will be used as a marker that indicates the deletion of an element later in Definition 3.
In order to guarantee the traceability of changes [R8], entries in the past must not be modified, and new entries may only be appended to the end of the history, not inserted at an arbitrary position. We use the notion of a dedicated now timestamp for this purpose.
Definition 2
Now Operation
Let D be a Temporal Data Matrix. We define the function \(now: \mathbb {B}^{\infty \times \infty } \rightarrow T\) as:
$$\begin{aligned} now(D) = max(\{t | k \in K, D_{t,k} \ne \epsilon \} \cup \{0\}) \end{aligned}$$
Definition 2 introduces the concept of the now timestamp, which is the largest (i.e., latest) timestamp at which data has been inserted into the store so far, initialized at zero for empty matrices. This particular timestamp will serve as a safeguard against temporal inconsistencies in several operations. We continue by defining the temporal counterparts of the put and get operations of a key-value store.
Definition 3
Temporal Write Operation
Let D be a Temporal Data Matrix. We define the function \(put: \mathbb {B}^{\infty \times \infty } \times T \times K \times \mathbb {B} \rightarrow \mathbb {B}^{\infty \times \infty }\) as:
$$\begin{aligned} put(D,t,k,v) = D' \end{aligned}$$
with \(v \ne \epsilon \), \(t > now(D)\) and
$$\begin{aligned} D'_{i,j} := {\left\{ \begin{array}{ll} v &{} \hbox {if } t = i \wedge k = j\\ D_{i,j} &{} \hbox {otherwise}\\ \end{array}\right. } \end{aligned}$$
The write operation put replaces a single entry in a Temporal Data Matrix by specifying the exact coordinates and a new value for that entry. All other entries remain the same as before. Please note that, while v must not be \(\epsilon \) in the context of a put operation (i.e., a cell cannot be “cleared”), v can be null to indicate a deletion of the key k from the matrix. Also, we require that an entry must not be overwritten. This is given implicitly by the fact that each put advances the result of now(D), and further insertions are only allowed after that timestamp. Furthermore, write operations are not permitted to modify the past in order to preserve consistency and traceability, which is also asserted by the condition on the now timestamp. This operation is limited in that it allows to modify only one key at a time. In the implementation, we generalize it to allow simultaneous insertions in several keys via transactions.
Definition 4
Temporal Read Operation
Let D be a Temporal Data Matrix. We define the function \(get: \mathbb {B}^{\infty \times \infty } \times T \times K \rightarrow \mathbb {B}\) as:
$$\begin{aligned} get(D,t,k) := {\left\{ \begin{array}{ll} D_{u,k} &{} \text{ if } u \ge 0 \wedge D_{u,k} \ne null\\ \epsilon &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
with \(t \le now(D)\) and
$$\begin{aligned} u := max(\{x | x \in T, x \le t, D_{x,k} \ne \epsilon \} \cup \{-1\}) \end{aligned}$$
The function get first attempts to return the value at the coordinates specified by the key and timestamp (\(u = t\)). If that position is empty, we scan for the entry in the same row with the highest timestamp and a non-empty value, considering only entries with lower timestamps than the request timestamp. In the formula, we have to add \(-~1\) to the set from which u is chosen to cover the case where there is no other entry in the row. If there is no such entry (i.e., \(u = -~1\)) or the entry is null, we return the empty binary string, otherwise we return the entry with the largest encountered timestamp.
This process is visualized in Fig. 4. In this figure, each row corresponds to a key, and each column to a timestamp. The depicted get operation is working on timestamp 5 and key ‘d’. As \(D_{5,d}\) is empty, we attempt to find the largest timestamp smaller than 5 where the value for the key is not empty, i.e., we move left until we find a non-empty cell. We find the result in \(D_{1, d}\) and return v1. This is an important part of the versioning concept: a value for a given key is assumed to remain unchanged until a new value is assigned to it at a later timestamp. This allows any implementation to conserve memory on disk, as writes only occur if the value for a key has changed (i.e., no data duplication is required between identical revisions). Also note that we do not need to update existing entries when new key-value pairs are being inserted, which allows for pure append-only storage. In Fig. 4, the value v1 is valid for the key ‘d’ for all timestamps between 1 and 5 (inclusive). For timestamp 0, the key ‘d’ has value v0. Following this line of argumentation, we can generalize and state that a row in the matrix, identified by a key \(k \in K\), contains the history of k. This is formalized in Definition 5. A column, identified by a timestamp \(t \in T\), contains the state of all keys at that timestamp, with the additional consideration that value duplicates are not stored as they can be looked up in earlier timestamps. This is formalized in Definition 6.
Definition 5
History Operation
Let D be a Temporal Data Matrix, and t be a timestamp with \(t \in T, t \le now(D)\). We define the function \(history: \mathbb {B}^{\infty \times \infty } \times T \times K \rightarrow 2^{T}\) as:
$$\begin{aligned} history(D,t,k) := \{x | x \in T, x \le t, D_{x,k}\ne \epsilon \} \end{aligned}$$
In Definition 5, we define the history of a key k up to a given timestamp t in a Temporal Data Matrix D as the set of timestamps less than or equal to t that have a non-empty entry for key k in D. Note that the resulting set will also include deletions, as null is a legal value for \(D_{x,k}\) in the formula. The result is the set of timestamps where the value for the given key changed. Consequently, performing a get operation for these timestamps with the same key will yield different results, producing the full history of the temporal key.
Definition 6
Keyset Operation
Let D be a Temporal Data Matrix, and t be a timestamp with \(t \in T, t \le now(D)\). We define the function \(keyset: \mathbb {B}^{\infty \times \infty } \times T \rightarrow 2^{K}\) as:
$$\begin{aligned} keyset(D,t) := \{x | x \in K, get(D,t,x)\ne \epsilon \} \end{aligned}$$
As shown in Definition 6, the keyset in a Temporal Data Matrix changes over time. We can retrieve the keyset at any desired time by providing the appropriate timestamp t. Note that this works for any timestamp in the past, in particular we do not require that a write operation has taken place precisely at t in order for the corresponding key(s) to be contained in the keyset. In other words, the precise column of t may consist only of \(\epsilon \) entries, but the key set operation will also consider earlier entries which are still valid at t. The version operation introduced in Definition 7 operates in a very similar way, but returns tuples containing keys and values, rather than just keys.
Definition 7
Version Operation
Let D be a Temporal Data Matrix, and t be a timestamp with \(t \in T, t \le now(D)\). We define the function \(version: \mathbb {B}^{\infty \times \infty } \times T \rightarrow 2^{K \times \mathbb {B}}\)
$$\begin{aligned} version(D,t) := \{\langle k,v\rangle | k \in keyset(D,t), v = get(D,t,k)\} \end{aligned}$$
Figure 5 illustrates the key set and version operations by example. In this scenario, the key set (or version) is requested at timestamp \(t = 5\). We scan each row for the latest non-\(\epsilon \) entry and add the corresponding key of the row to the key set, provided that a non-\(\epsilon \) right-most entry exists (i.e., the row is not empty) and is not null (the value was not removed). In this example, keyset(D, 5) would return \(\{a,c,d\}\), assuming that all non-depicted rows are empty. b and f are not in the key set, because their rows are empty (up to and including timestamp 5), and e is not in the set because its value was removed at timestamp 4. If we would request the key set at timestamp 3 instead, e would be in the key set. The operation version(D, 5) returns \(\{ \langle a,v0\rangle , \langle c, v2\rangle , \langle d, v4\rangle \}\) in the example depicted in Fig. 5. The key e is not represented in the version because it did not appear in the key set.
Table 2 Mapping capabilities to operations [25] With the given set of operations, we are able to answer all three kinds of temporal queries identified by Salzberg and Tsotras [62], as indicated in Table 2. Due to the restrictions imposed onto the put operation (see Definition 3), data cannot be inserted before the now timestamp (i.e., the history of an entry cannot be modified). Since the validity range of an entry is determined implicitly by the empty cells between changes, existing entries never need to be modified when new ones are being added. The formalization therefore fulfills all requirements stated at the beginning of this section.
Implementation
ChronoDB is our implementation of the concepts presented in the previous section. It is a fully ACID compliant, process-embedded, temporal key-value store written in Java. The intended use of ChronoDB is to act as the storage backend for a graph database, which is the main driver behind numerous design and optimization choices. The full source code is freely available on GitHub under an open-source license.
Implementing the matrix
As the formal foundation includes the concept of a matrix with infinite dimensions, a direct implementation is not feasible. However, a Temporal Data Matrix is typically very sparse. Instead of storing a rigid, infinite matrix structure, we focus exclusively on the non-empty entries and expand the underlying data structure as more entries are being added.
There are various approaches for storing versioned data on disk [15, 46, 50]. We reuse existing, well-known and well-tested technology for our prototype instead of designing custom disk-level data structures. The temporal store is based on a regular B\(^{+}\)-Tree [61]. We make use of the implementation of B\(^{+}\)-Trees provided by the TUPLFootnote 8 library. In order to form an actual index key from a Temporal Key, we concatenate the actual key string with the timestamp (left-padded with zeros to achieve equal length), separated by an ‘@’ character. Using the standard lexicographic ordering of strings, we receive an ordering as shown in Table 3. This implies that our B\(^{+}\)-Tree is ordered first by key, and then by timestamp. The advantage of this approach is that we can quickly determine the value of a given key for a given timestamp (i.e., get is reasonably fast), but the keyset (see Definition 6) is more expensive to compute.
Table 3 Ascending Temporal Key ordering by example [25] The put operation appends the timestamp to the user key and then performs a regular B\(^{+}\)-Tree insertion. The temporal get operation requires retrieving the next lower entry with the given key and timestamp.
This is similar to regular B\(^{+}\)-Tree search, except that the acceptance criterion for the search in the leaf nodes is “less than or equal to” instead of “equal to”, provided that nodes are checked in descending key order. TUPL natively supports this functionality. After finding the next lower entry, we need to apply a post-processing step in order to ensure correctness of the get operation. Using Table 3 as an example, if we requested aa@0050 (which is not contained in the data), searching for the next-lower key produces a@1000. The key string in this temporal key (a) is different from the one which was requested (aa). In this case, we can conclude that the key aa did not exist up to the requested timestamp (50), and we return null instead of the retrieved result.
Due to the way we set up the B\(^{+}\)-Tree, adding a new revision to a key (or adding an entirely new key) has the same runtime complexity as inserting an entry into a regular B\(^{+}\)-Tree. Temporal search also has the same complexity as regular B-Tree search, which is \(\mathcal {O}(\hbox {log}(n))\), where n is the number of entries in the tree. From the formal foundations onward, we assert by construction that our implementation will scale equally well when faced with one key and many versions, many keys with one revision each, or any distribution in between [R5]. An important property of our data structure setup is that, regardless of the versions-per-key distribution, the data structure never degenerates into a list, maintaining an access complexity of \(\mathcal {O}(\hbox {log}(n))\) by means of regular B\(^{+}\)-Tree balancing without any need for additional algorithms.
Branching
Figure 6 shows how the branching mechanism works in ChronoDB [R4]. Based on our matrix formalization, we can create branches of our history at arbitrary timestamps. To do so, we generate a new, empty matrix that will hold all changes applied to the branch it represents. We would like to emphasize that existing entries are not duplicated. We therefore create lightweight branches. When a get request arrives at the first column of a branch matrix during the search, we redirect the request to the matrix of the parent branch, at the branching timestamp, and continue from there. In this way, the data from the original branch (up to the branching timestamp) is still fully accessible in the child branch.
For example, as depicted in Fig. 7, if we want to answer a get request for key c on branch branchA and timestamp 4, we scan the row with key c to the left, starting at column 4. We find no entry, so we redirect the call to the origin branch (which in this case is master), at timestamp 3. Here, we continue left and find the value \(c_{1}\) on timestamp 1. Indeed, at timestamp 4 and branch branchA, \(c_{1}\) is still valid. However, if we issue the same original query on master, we would get \(c_{4}\) as our result. This approach to branching can also be employed recursively in a nested fashion, i.e., branches can in turn have sub-branches. The primary drawback of this solution is related to the recursive “backstepping” to the origin branch during queries. For deeply nested branches, this process will introduce a considerable performance overhead, as multiple B\(^{+}\)-Trees (one per branch) need to be opened and queried in order to answer this request. This happens more often for branches which are very thinly populated with changes, as this increases the chances of our get request scan ending up at the initial column of the matrix without encountering an occupied cell. The operation which is affected most by branching with respect to performance is the keySet operation (and all other operations that rely on it), as this requires a scan on every row, leading to potentially many backstepping calls.
Caching
A disk access is always slow compared to an in-memory operation, even on a modern solid state drive (SSD). For that reason, nearly all database systems include some way of caching the most recent query results in main memory for later reuse. ChronoDB is no exception, but the temporal aspects demand a different approach to the caching algorithm than in regular database systems, because multiple transactions can simultaneously query the state of the stored data at different timestamps. Due to the way we constructed the Temporal Data Matrix, the chance that a given key does not change at every timestamp is very high. Therefore, we can potentially serve queries at many different timestamps from the same cached information by exploiting the periods in which a given key does not change its value. For the caching algorithm, we apply some of the ideas found in the work of Ramaswamy [57] in a slightly different way, adapted to in-memory processing and caching idioms.
Figure 8 displays an example for our temporal caching approach which we call Mosaic. When the value for a temporal key is requested and a cache miss occurs, we retrieve the value together with the validity range (indicated by gray background in the figure) from the persistent store, and add the range together with its value to the cache. Validity ranges start at the timestamp in which a key-value pair was modified (inclusive) and end at the timestamp where the next modification on that pair occurred (exclusive). For each key, the cache manages a list of time ranges called a cache row, and each range is associated with the value for the key in this period. As these periods never overlap, we can sort them in descending order for faster search, assuming that more recent entries are used more frequently. A cache look-up is performed by first identifying the row by the key string, followed by a linear search through the cached periods.Footnote 9 We have a cache hit if a period containing the requested timestamp is found. When data is written to the underlying store, we need to perform a write-through in our cache, because validity ranges that have open-ended upper bounds potentially need to be shortened due to the insertion of a new value for a given key. The write-through operation is fast, because it only needs to check if the first validity range in the cache row of a given key is open-ended, as all other entries are always closed ranges. All entries in our cache (regardless of the row they belong to) share a common least recently used registry which allows for fast cache eviction of the least recently read entries.
In the example shown in Fig. 8, retrieving the value of key d at timestamp 0 would result in adding the validity range [0; 1) with value v0 to the cache row. This is the worst-case scenario, as the validity range only contains a single timestamp, and can consequently be used to answer queries only on that particular timestamp. Retrieving the same key at timestamps 1 through 4 yields a cache entry with a validity range of [1; 5) and value v1. All requests on key d from timestamp 1 through 4 can be answered by this cache entry. Finally, retrieving key d on a timestamp greater than or equal to 5 produces an open-ended validity period of \([5;\infty )\) with value v2, which can answer all requests on key d with a timestamp larger than 4, assuming that non-depicted columns are empty. If we would insert a key-value pair of \(\langle d, v3\rangle \) at timestamp 10, the write-through operation would need to shorten the last validity period to [5; 10) and add a cache entry containing the period \([10;\infty )\) with value v3.
Incremental commits
Database vendors often provide specialized ways to batch-insert large amounts of data into their databases that allow for higher performance than the usage of regular transactions. ChronoDB provides a similar mechanism, with the additional challenge of keeping versioning considerations in mind along the way. Even when inserting large amounts of data into ChronoDB, we want the history to remain clean, i.e., it should not contain intermediate states where only a portion of the overall data was inserted. We therefore need to find a way to conserve RAM by writing incoming data to disk while maintaining a clean history. For this purpose, the concept of incremental commits was introduced in ChronoDB. This mechanism allows to mass-insert (or mass-update) data in ChronoDB by splitting it up into smaller batches while maintaining a clean history and all ACID properties for the executing transaction.
Figure 9 shows how incremental commits work in ChronoDB. The process starts with a regular transaction inserting data into the database before calling commitIncremental(). This writes the first batch (timestamp 2 in Fig. 9) into the database and releases it from RAM. However, the now timestamp is not advanced yet. We do not allow other transactions to read these new entries, because there is still data left to insert. We proceed with the next batches of data, calling commitIncremental() after each one. After the last batch was inserted, we conclude the process with a call to commit(). This will merge all of our changes into one timestamp on disk. In this process, the last change to a single key is the one we keep. In the end, the timestamps between the first initial incremental commit (exclusive) to the timestamp of the final commit (inclusive) will have no changes (as shown in timestamps 3 and 4 in Fig. 9). With the final commit, we also advance the now timestamp of the matrix and allow all other transactions to access the newly inserted data. By delaying this step until the end of our operation, we keep the possibility to roll back our changes on disk (for example in case that the process fails) without violating the ACID properties for all other transactions. Also, if data generated by a partially complete incremental commit process is present on disk at database start-up (which occurs when the database is unexpectedly shut down during an incremental commit process), these changes can be rolled back as well, which allows incremental commit processes to have “all or nothing” semantics.
A disadvantage of this solution is that there can be only one concurrent incremental commit process on any data matrix. This process requires exclusive write access to the matrix, blocking all other (regular and incremental) commits until it is complete. However, since we only modify the head revisions and now does not change until the process ends, we can safely perform read operations in concurrent transactions, while an incremental commit process is taking place. Overall, incremental commits offer a way to insert large quantities of data into a single timestamp while conserving RAM without compromising ACID safety at the cost of requiring exclusive write access to the database for the entire duration of the process. These properties make them very suitable for data imports from external sources, or large scale changes that affect most of the key-value pairs stored in a matrix. This will become an important factor when we consider global model evolutions in the model repository layer [R3]. We envision incremental commits to be employed for administrative tasks which do not recur regularly, or for the initial filling of an empty database.
Supporting long histories
In order to create a sustainable versioning mechanism, we need to ensure that our system can support a virtually unlimited number of versions [R2, R5]. Ideally, we also should not store all data in a single file, and old files should remain untouched when new data is inserted (which is important for file-based backups). For these reasons, we must not constrain our solution to a single B-Tree. The fact that past revisions are immutable in our approach led to the decision to split the data along the time axis, resulting in a series of B-Trees. Each tree is contained in one file, which we refer to as a chunk file. An accompanying meta file specifies the time range which is covered by the chunk file. The usual policy of ChronoDB is to maximize sharing of unchanged data as much as possible. Here, we deliberately introduce data duplication in order to ensure that the initial version in each chunk is complete. This allows us to answer get queries within the boundaries of a single chunk, without having to navigate to the previous one. As each access to another chunk has CPU and I/O overhead, we should avoid accesses on more than one chunk to answer a basic query. Without duplication, accessing a key that has not changed for a long time could potentially lead to a linear search through the chunk chain which contradicts the requirement for scalability [R5].
The algorithm for the “rollover” procedure outlined in Fig. 10 works as follows.
In Line 1 of Algorithm 1, we fetch the latest timestamp where a commit has occurred in our current head revision chunk. Next, we calculate the full head version of the data in Line 2. With the preparation steps complete, we set the end of the validity time range to the last commit timestamp in Line 3. This only affects the metadata, not the chunk itself. We now create a new, empty chunk in Line 4, and set the start of its validity range to the split timestamp plus one (as chunk validity ranges must not overlap). The upper bound of the new validity range is infinity. In Line 5 we copy the head version of the data into the new chunk. Finally, we update our internal look-up table in Line 6. This entire procedure only modifies the last chunk and does not touch older chunks, as indicated by the grayed-out boxes in Fig. 10.
The look-up table that is being updated in Algorithm 1 is a basic tree map which is created at start-up by reading the metadata files. For each encountered chunk, it contains an entry that maps its validity period to its chunk file. The periods are sorted in ascending order by their lower bounds, which is sufficient because overlaps in the validity ranges are not permitted. For example, after the rollover depicted in Fig. 10, the time range look-up would contain the entries shown in Table 4.
Table 4 Time range look-up [26] We employ a tree map specifically in our implementation for Table 4, because the purpose of this look-up is to quickly identify the correct chunk to address for an incoming request. Incoming requests have a timestamp attached, and this timestamp may occur exactly at a split, or anywhere between split timestamps. As this process is triggered very often in practice and the time range look-up map may grow quite large over time, we are considering to implement a cache based on the least-recently-used principle that contains the concrete resolved timestamp-to-chunk mappings in order to cover the common case where one particular timestamp is requested more than once in quick succession.
With this algorithm, we can support a virtually unlimited number of versions [R6] because new chunks always only contain the head revision of the previous ones, and we are always free to roll over once more should the history within the chunk become too large. We furthermore do not perform writes on old chunk files anymore, because our history is immutable. Regardless, thanks to our time range look-up, we have close to \(O(\log {}n)\) access complexity to any chunk, where n is the number of chunks.
This algorithm is a trade-off between disk space and scalability. We introduce data duplication on disk in order to provide support for large histories. The key question that remains is when this process happens. We require a metric that indicates the amount of data in the current chunk that belongs to the history (as opposed to the head revision) and thus can be archived if necessary by performing a rollover. We introduce the Head–History–Ratio (HHR) as the primary metric for this task, which we defined as follows:
$$\begin{aligned} HHR(e, h) = {\left\{ \begin{array}{ll} e, &{} \text {if } e = h\\ \frac{h}{e-h}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
...where e is the total number of entries in the chunk, and h is the size of the subset of entries that belong to the head revision (excluding entries that represent deletions). By dividing the number of entries in the head revision by the number of entries that belong to the history, we get a proportional notion of how much history is contained in the chunk that works for datasets of any size. It expresses how many entries we will “archive” when a rollover is executed. When new commits add new elements to the head revision, this value increases. When a commit updates existing elements in the head revision or deletes them, this value decreases. We can employ a threshold as a lower bound on this value to determine when a rollover is necessary. For example, we may choose to perform a rollover when a chunk has an HHR value of 0.2 or less. This threshold will work independently of the absolute size of the head revision. The only case where the HHR threshold is never reached is when exclusively new (i.e., never seen before) keys are added, steadily increasing the size of the head revision. However, in this case, we would not gain anything by performing a rollover, as we would have to duplicate all of those entries into the new chunk to produce a complete initial version. Therefore, the HHR metric is properly capturing this case by never reaching the threshold, thus never indicating the need for a rollover.
Secondary indexing
There are two kinds of secondary indices in ChronoDB. On the one hand, there are indices which are managed by ChronoDB itself (“system indices”) and on the other hand there are user-defined indices. As indicated in Table 3, the primary index for each matrix in ChronoDB has its keys ordered first by user key and then by version. In order to allow for efficient time range queries, we maintain a secondary index that is first ordered by timestamp and then by user key. Further system indices include an index for commit metadata (e.g., commit messages) that maps from timestamp to metadata, as well as auxiliary indices for branching (branch name to metadata). User-defined indices [R5] help to speed up queries that request entries based on their contents (rather than their primary key). An example for such a query is find all persons where the first name is ’Eva’. Since ChronoDB stores arbitrary Java objects, we require a method to extract the desired property value to index from the object. This is accomplished by defining a ChronoIndexer interface. It defines the index(Object) method that, given an input object, returns the value that should be put on the secondary index. Each indexer is associated with a name. That name is later used in a query to refer to this index. The associated query language provides support for a number of string matching techniques (equals, contains, starts with, regular expression...), numeric matching (greater than, less than or equal to...) as well as Boolean operators (and, or, not). The query engine also performs optimizations such as double negation elimination. Overall, this query language is certainly less expressive than other languages such as SQL. Since ChronoDB is intended to be used as a storage engine and embedded in a database frontend (e.g., a graph database), these queries will only be used internally for index scans while more sophisticated expressions are managed by the database frontend. Therefore, this minimalistic Java-embedded DSL has proven to be sufficient. An essential drawback of this query mechanism is that the number of properties available for querying is determined by the available secondary indices. In other words, if there is no secondary index for a property, that property cannot be used for filtering. This is due to ChronoDB being agnostic to the Java objects it is storing. In absence of a ChronoIndexer, it has no way of extracting a value for an arbitrary request property from the object. This is a common approach in database systems: without a matching secondary index, queries require a linear scan of the entire data store. When using a database frontend, this distinction is blurred, and the difference between an index query and a non-index query is only noticeable in how long it takes to produce the result set.
In contrast to the primary index, entries in the secondary index are allowed to have non-unique keys. For example, if we index the “name” attribute, then there may be more than one entry where the name is set to “John”. We therefore require a different approach than the temporal data matrices employed for the primary index. Inspired by the work of Ramaswamy et al. [57], we make use of explicit time windows. Non-unique indices in versioned contexts are special cases of the general interval stabbing problem [31].
Table 5 Secondary indexing in ChronoDB Table 5 shows an example of a secondary index. As such a table can hold all entries for all indices, we store the index for a particular entry in the “index” column. The branch, keyspace and key columns describe the location of the entry in the primary index. The “value” column contains the value that was extracted by the ChronoIndexer. “From” and “To” express the time window in which a given row is valid. Any entry that is newly inserted into this table initially has its “To” value set to infinity (i.e., it is valid for an unlimited amount of time). When the corresponding entry in the primary index changes, the “To” value is updated accordingly. All other columns are effectively immutable.
In the concrete example shown in Table 5, we insert three key-value pairs (with keys e1, e2 and e3) at timestamp 1234. Our indexer extracts the value for the “name” index, which is “john” for all three values. The “To” column is set to infinity for all three entries. Querying the secondary index at that timestamp for all entries where “name” is equal to “john” would therefore return the set containing e1, e2 and e3. At timestamp 5678, we update the value associated with key e2 such that the indexer now yields the value “jack”. We therefore need to terminate the previous entry (row #2) by setting the “To” value to 5678 (upper bounds are exclusive), and inserting a new entry that starts at 5678, has the value “jack” and an initial “To” value of infinity. Finally, we delete the key e3 in our primary index at timestamp 7890. In our secondary index, this means that we have to limit the “To” value of row #3 to 7890. Since we have no new value due to the deletion, no additional entries need to be added.
This tabular structure can now be queried using well-known techniques also employed by SQL. For usual queries, the branch and index is fixed, the value is specified as a search string and a condition (e.g., “starts with [jo]”) and we know the timestamp for which the query should be evaluated. We process the timestamp by searching only for entries where
$$\begin{aligned} From \le timestamp < To \end{aligned}$$
...in addition to the conditions specified for the other columns. Selecting only the documents for a given branch is more challenging, as we need to traverse the origin branches upwards until we arrive at the master branch, performing one subquery for each branch along the way and merging the intermediate results accordingly.
Transaction control
Consistency and reliability are two major goals in ChronoDB. It offers full ACID transactions with the highest possible read isolation level (serializable, see [38]). Figure 11 shows an example with two sequence diagrams with identical transaction schedules. A database server is communicating with an Online Analytics Processing (OLAP [10]) client that owns a long-running transaction (indicated by gray bars). The process involves messages (arrows) sending queries with timestamps and computation times (blocks labeled with “c”) on both machines. A regular Online Transaction Processing (OLTP) client wants to make changes to the data which is analyzed by the OLAP client. The left figure shows what happens in a non-versioned scenario with pessimistic locking. The server needs to lock the relevant contents of the database for the entire duration of the OLAP transaction, otherwise we risk inconsistencies due to the incoming OLTP update. We need to delay the OLTP client until the OLAP client closes the transaction. Modern databases use optimistic locking and data duplication techniques (e.g., MVCC [6]) to mitigate this issue, but the core problem remains: the server needs to dedicate resources (e.g., locks, RAM...) to client transactions over their entire lifetime. With versioning, the OLAP client sends the query plus the request timestamp to the server. This is a self-contained request; no additional information or resources are needed on the server, and yet the OLAP client achieves full isolation over the entire duration of the transaction, because it always requests the same timestamp. While the OLAP client is processing the results, the server can safely allow the modifications of the OLTP client, because it is guaranteed that any modification will only append a new version to the history. The data at timestamp on which the OLAP client is working is immutable. Client-side transactions act as containers for transient change sets and metadata, most notably the timestamp and branch name on which the transaction is working. Security considerations aside, transactions can be created (and disposed) without involving the server. An important problem that remains is how to handle situations in which two concurrent OLTP transactions attempt to change the same key-value pair. ChronoDB allows to select from several conflict handling modes (e.g., reject, last writer wins) or to provide a custom conflict resolver implementation.