Robust and Scalable Content-and-Structure Indexing (Extended Version)

Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.

ing a value predicate on the content of an attribute and a path predicate on the location of this attribute in the hierarchical structure.
CAS indexes are being used to support the efficient processing of CAS queries. There are two important properties that we look for in a CAS index: robustness and scalability. Robustness means that a CAS index optimizes the average query runtime over all possible queries. It ensures that an index can efficiently deal with a wide range of CAS queries. Many existing indexes are not robust since the performance depends on the individual selectivities of its path and value predicates. If either the path or value selectivity is high, these indexes produce large intermediate results even if the combined selectivity is low. This happens because existing solutions either build separate indexes for, respectively, content and structure [26] or prioritize one dimension over the other (i.e., content over structure or vice versa) [6,11,42]. Scalability means that even for large datasets an index can be efficiently created and updated, and is not constrained by the size of the available memory. Existing indexes are often not scalable since they rely on in-memory data structures that do not scale to large datasets. For instance, with the memory-based CAS index [43] it is impossible to index datasets larger than 100 GB on a machine with 400 GB main memory.
We propose RSCAS, a robust and scalable CAS index. RSCAS's robustness is rooted in a well-balanced integration of the content and structure of the data in a single index. Its scalability is due to log-structured merge (LSM) trees [33] that combine an in-memory structure for fast insertions with a series of read-only disk-based structures for fast sequential reads and writes.
To achieve robustness we propose to interleave the path and value bytes of composite keys in a balanced manner. A well-known technique to interleave composite keys is the zorder curve [31,34], but applying the z-order curve to paths and values is subtle. Often the query performance is poor because of long common prefixes, varying key lengths, different domain sizes, and data skew. The paths in a hierarchical structure have, by their very nature, long common prefixes, but the first byte following a longest common prefix separates data items. We call such a byte a discriminative byte and propose a dynamic interleaving that interleaves the discriminative bytes of paths and values alternatingly. This leads to a well-balanced partitioning of the data with a robust query performance. We use the dynamic interleaving to define the RSCAS index for semi-structured hierarchical data. The RSCAS index is trie-based and efficiently supports the basic search methods for CAS queries: range searches and prefix searches. Range searches enable value predicates that are expressed as a value range and prefix searches support path predicates that contain wildcards and descendant axes. Crucially, tries in combination with dynamically interleaved keys allow us to efficiently evaluate path and value predicates simultaneously.
To scale the RSCAS index to large datasets and support efficient insertions, we use LSM trees [33] that combine an in-memory RSCAS trie with a series of disk-resident RSCAS tries whose size is doubling in each step. RSCAS currently supports only insertions since our main use case, indexing an append-only archive, does not require updates or deletes. The in-memory trie is based on the Adaptive Radix Tree (ART) [21], which is a memory-optimized trie structure that supports efficient insertions. Whenever the inmemory RSCAS trie reaches its maximum capacity, we create a new disk-based trie. Since disk-based RSCAS tries are immutable, we store them compactly on disk and leave no gaps between nodes. We develop a partitioning-based bulkloading algorithm that builds RSCAS on disk while, at the same time, dynamically interleaving the keys. This algorithm works well with limited memory but scales nicely with the amount of memory to reduce the disk I/O during bulkloading.
Main contributions: -We develop a dynamic interleaving to interleave paths and values in an alternating way using the concept of discriminative bytes. We show how to compute this interleaving by a hierarchical partitioning of the data. We prove that our dynamic interleaving is robust against varying selectivities (Section 5). -We propose the trie-based Robust and Scalable Contentand-Structure (RSCAS) index for semi-structured hierarchical data. Dynamically interleaved keys give RSCAS its robustness. Its scalability is rooted in LSM trees that combine a memory-optimized trie for fast in-place insertions with a series of disk-optimized tries (Section 6). -We propose efficient algorithms for querying, inserting, bulk-loading, and merging RSCAS tries. A combination of range and prefix searches is used to evaluate CAS queries on the trie-based structure of RSCAS. Insertions are performed on the in-memory trie using lazy restructuring. Bulk-loading creates large disk-optimized tries in the background. Merging is applied when the in-memory trie overflows to combine it with a series of disk-resident tries (Section 7). -We conduct an experimental evaluation with three realworld and one synthetic dataset. One of the real-world datasets is Software Heritage (SWH) [2], the world's largest archive of publicly-available source code. Our experiments show that RSCAS delivers robust query performance with up to two orders of magnitude improvements over existing approaches, while offering comparable bulk-loading and insertion performance (Section 8).

Application Scenario
As a practical use case we deploy a large-scale CAS index for Software Heritage (SWH) [13], the largest public archive of software source code and its development history. 1 At its core, Software Heritage archives version control systems (VCSs), storing all recorded source code artifacts in a giant, globally deduplicated Merkle structure [28] that stores elements from many different VCSs using cryptographic hashes as keys. VCSs record the evolution of source code trees over time, an aspect that is reflected in the data model of Software Heritage [35]. The data model supports the archiving of artifacts, such as file blobs (byte sequences, corresponding to tree leaves), source code directories (inner nodes, pointing to sub-directories and files, giving them local path names), commits (called revisions in this context), releases (commits annotated with memorable names such as "1.0"), and VCS repository snapshots. Nodes in the data model are associated with properties that are relevant for querying. Examples of node properties are: cryptographic node identifiers, as well as commit and release metadata such as authors, log messages, timestamps, etc.
Revisions are a key piece of software development workflows. Each of them, except the very first one in a given repository, is connected to the previous "parent" revision, or possibly multiple parents in case of merge commits. These connections allow the computation of the popular diff representations of commits that show how and which files have been changed in any given revision. Computing diffs for all revisions in the archive makes it possible to look up all revisions that have changed files of interest.
Several aspects make Software Heritage a relevant and challenging use case for CAS indexing. First, the size of the archive is significant: at the time of writing, the archive consists of about 20 billion nodes (total file size is about 1 PiB, but we will not index within files, so this measure is less relevant). Second, the archive grows constantly by continuously crawling public data sources such as collaborative development platforms (e.g., GitHub, GitLab), Linux distributions (e.g., Debian, NixOS), and package manager repositories (e.g., PyPI, NPM). The archive growth ratio is also very significant: the amount of archived source code artifacts grows exponentially over time, doubling every 2 to 3 years [38], which calls for an incremental indexing approach to avoid indexing lag. For instance, during 2020 alone the archive has ingested about 600 million new revisions and 3 billion new file blobs (i.e., file contents never seen before).
Last but not least, short of the CAS queries proposed in this paper, the current querying capabilities for the archive are quite limited. Entire software repositories can be looked up by full-text search on their URLs, providing entry points into the archive. From there, users can browse the archive, reaching the desired revisions (e.g., the most recent revision in the master branch since the last time a repository was crawled) and, from there, the corresponding source code trees. It is not possible to query the "diff", i.e., find revisions that modified certain files in a certain time period, which is limiting for both user-facing and research-oriented queries (e.g., in the field of empirical software engineering).
With the approach proposed in this paper we offer functionality to answer CAS queries like the following: Find all revisions from June 2021 that modify a C file located in a folder whose name begins with "ext".
This query consists of two predicates. First, a content predicate on the revision time, which is a range predicate that matches all revisions from the first to the last day of June 2021. Second, a structure predicate on the paths of the files that were touched by a revision. We are only interested in revisions that modify files with .c extension and that are located in a certain directory. This path predicate can be expressed as / ** /ext * / * .c with wildcard ** to match folders that are nested arbitrarily deeply in the filesystem of a repository and wildcard * to match all characters in a directory or file name.

Related Work
For related work, two CAS indexing techniques have been investigated: (a) creating separate indexes for content and structure, and (b) combining content and structure in one index. We call these two techniques separate CAS indexing and combined CAS indexing, respectively.
Separate CAS indexing creates dedicated indexes for, respectively, the content and the structure of the data. Mathis et al. [26] use a B+ tree to index the content and a structural summary (i.e., a DataGuide [17]) to index the structure of the data. The DataGuide maps each unique path to a numeric identifier, called the path class reference (PCR), and the B+ tree stores the values along with their PCRs. Thus, the B+ tree stores (value, ⟨nodeId, PCR⟩) tuples in its leaf nodes, where nodeId points to a node whose content is value and whose path is given by PCR. To answer a CAS query we must look at the path index and the value index independently. The subsequent join on the PCR is slow if intermediate results are large. Mathis et al. assume that there are few unique paths and the path index is small (fewer than 1000 unique paths in their experiments). Kaushik et al. [20] present an approach that combines a 1-index [29] to evaluate path predicates with a B+ tree to evaluate value predicates, but they do not consider updates.
A popular system that implements separate indexing is Apache Lucene [1], which is a scalable and widely deployed indexing and search system that underpins Apache Solr and Elasticsearch. Lucene uses different index types depending on the type of the indexed attributes. For CAS indexing, we represent paths as strings and values as numbers. Lucene indexes strings with finite state transducers (FSTs), which are automata that map strings to lists of sorted document IDs (called postings lists). Numeric attributes are indexed in a Bkd-tree [36], which is a disk-optimized kd-tree. Lucene answers conjunctive queries, like CAS queries, by evaluating each predicate on the appropriate index. The indexes return sorted postings lists that must be intersected to see if a document matches all predicates of a conjunctive query. Since the lists are sorted, the intersection can be performed efficiently. However, the independent evaluation of the predicates may yield large intermediate results, making the approach non-robust. To scale to large datasets, Lucene implements techniques that are similar to LSM trees [33] (cf. Section 6). Lucene batches insertions in memory before flushing them as read-only segments to disk. As the number of segments grows, Lucene continuously compacts them by merging small segments into a larger segment.
The problem with separate CAS-indexing is that it is not robust. If at least one predicate of a CAS query is not selective, separate indexing approaches generate large intermediate results. This is inefficient if the final result is small. Since the predicates are evaluated on different indexes, we cannot use the more selective predicate to prune the search space.
Combined CAS indexing integrates paths and values in one index. A well-known and mature technology are composite indexes, which are used, e.g., in relational databases to index keys that consist of more than one attribute. Composite indexes concatenate the indexed attributes according to a specified ordering. In CAS indexing, there are two possible orderings of the paths and values: the P V -ordering orders the paths before the values, while the V P -ordering orders the values first. The ordering determines what queries a composite index can evaluate efficiently. Composite indexes are only efficient for queries that have a small selectivity for the attribute appearing first. In our experiments we use the composite B+ tree of Postgres as the reference point for an efficient and scalable implementation of composite indexes.
IndexFabric [11] is another example of a composite CAS index. It uses a P V -ordering, concatenating the (shortened) paths and values of composite keys, and storing them in a disk-optimized PATRICIA trie [30]. IndexFabric shortens the paths to save disk space by mapping long node labels to short strings (e.g., map label 'extension' to 'e'). During query evaluation IndexFabric must first fully evaluate the path predicate before it can look at the value predicate since it orders paths before the values in the index. Since it uses shortened paths, it cannot evaluate wildcards within a node label (e.g., ext* to match extension, exterior, etc.). Index-Fabric does not support bulk-loading.
The problem with composite indexes is that they prioritize the dimension appearing first. The selectivity of the predicate in the first dimension determines the query performance. If it is high and the other selectivity is low, the composite index performs badly because the first predicate must be fully evaluated before the second predicate can be evaluated. As a result, a composite index is not robust.
Instead of concatenating dimensions, it is possible to interleave dimensions. The z-order curve [31,34], for example, is obtained by interleaving the binary representation of the individual dimensions and is used in UB-trees [37] and k-d tries [32,34,39]. Unfortunately, the z-order curve deteriorates to the performance of a composite index if the data contains long common prefixes [43]. This is the case in CAS indexing where paths have long common prefixes. The problem with common prefixes is that they are the same for all data items and do not prune the search space during a search. Interleaving a common prefix in one dimension with a noncommon prefix in the other dimension means we prune keys in one dimension but not the other [25].
LSM trees [33] are used to create scalable indexing systems with high write throughput (see, e.g., AsterixDB [5], BigTable [10], Dynamo [12], etc.). They turn expensive inplace updates that cause many random disk I/Os into outof-place updates that use sequential writes. To achieve that, LSM trees combine a small in-memory tree R M 0 with a series of disk-resident trees R 0 , R 1 , . . ., each tree being T times larger than the tree in the previous level. Insertions are performed exclusively in the main-memory tree R M 0 . Modern LSM tree implementations, see [23] for an excellent recent survey, use sorted string tables (SSTables) or other immutable data structures at multiple levels. Generally, there are two different merge policies: leveling and tiering. With the leveling merge policy, each level i contains exactly one structure and when the structure at level i grows too big, this structure and the one at level i + 1 are merged. A structure on level i + 1 is T times larger than a structure on level i. Tiering maintains multiple structures per level. When a level i fills up with T structures, they are merged into a structure on level i + 1. We discuss the design decisions regarding LSM-trees and RSCAS in Section 6.2.
An LSM tree requires an efficient bulk-loading algorithm to create a disk-based RSCAS trie when the in-memory trie overflows. Sort-based algorithms sort the data and build an index bottom-up. Buffer-tree methods bulk-load a tree by buffering insertions in nodes and flushing them in batches to its children when a buffer overflows. Neither sort-nor buffer-based techniques [7,3,8] can be used for RSCAS because our dynamic interleaving must look at all keys to correctly interleave them. We develop a partitioning-based bulkloading algorithm for RSCAS that alternatingly partitions the data in the path and value dimension to dynamically interleave paths and values.
The combination of the dynamic interleaving with wildcards and range queries makes it hard to embed RSCAS into an LSM-tree-based key-value (KV) store. While early, simple KV-stores did not support range queries at all, more recent KV-stores create Bloom filters for a predefined set of fixed prefixes [27], i.e., only range queries using these prefixes can be answered efficiently. SuRF was one of the first approaches able to handle arbitrary range queries by storing minimum-length prefixes in a trie so that all keys can be uniquely identified [45]. This was followed by Rosetta, which stores all prefixes for each key in a hierarchical series of Bloom filters [24]. KV-stores supporting ranges queries without filters have also been developed. EvenDB optimizes the evaluation of queries exhibiting spatial locality, i.e., keys with the same prefixes are kept close together and in main memory [16]. REMIX offers a globally sorted view of all keys with a logical sorting of the data [46]. The evaluation of range queries boils down to seeking the first matching element in a sorted sequence of keys and scanning to the end of the range. CAS queries follow a different pattern. During query evaluation, we simultaneously process a range query in the value dimension and match strings with wildcards at arbitrary positions in the path dimension. The prefix shared by the matching keys ends at the first wildcard, which can occur early in the path. We prune queries with wildcards by regularly switching back to the more selective value dimension.

Data Representation
We use composite keys to represent the paths and values of data items in semi-structured hierarchical data.
Definition 1 (Composite Key) A composite key k is a twodimensional key that consists of a path k.P and a value k.V , and each key stores a reference k.R as payload that points to the corresponding data item in the database.
. Composite keys can be extracted from popular semi-structured hierarchical data formats, such as JSON and XML. In the context of SWH we use composite keys k to represent that a file with path k.P is modified (i.e., added, changed, or deleted) at time k.V in revision k.R. Table 1 shows the set K 1..9 = {k 1 , . . . , k 9 } of composite keys (we use a sans-serif font to refer to concrete instances in our examples). We write K 2,5,6,7 to refer to {k 2 , k 5 , k 6 , k 7 }. Composite key k 2 denotes that the file /crypto/ecc.h$ was modified on 2019-07-20 in revision r 2 . In the same revision, also file /crypto/ecc.c$ is modified, see key k 3 . ◻

Example 1
We represent paths and values as byte strings that we access byte-wise. We visualize them with one byte ASCII characters for the path dimension and italic hexadecimal numbers for the value dimension, see Table 1. To guarantee that no path is a prefix of another we append the end-of-string character $ (ASCII code 0x00) to each path. Fixed-length byte strings (e.g., 64 bit numbers) are prefix-free because of the fixed length. We assume that the path and value dimensions are binary-comparable, i.e., two paths or values are <, =, or > iff their corresponding byte strings are <, =, or >, respectively [21]. For example, big-endian integers are binary-comparable while little-endian integers are not.
Let s be a byte-string, then s denotes the length of s and s[i] denotes the i-th byte in s. The left-most byte of a byte-string is byte one. s[i] = ǫ is the empty string if i > s . s[i, j] denotes the substring of s from position i to j and Definition 2 (Longest Common Prefix) The longest common prefix lcp(K, D) of a set of keys K in dimension D is the longest prefix s that all keys k ∈ K share in dimension D, i.e.,

Content-and-Structure (CAS) Queries
Content-and-structure (CAS) queries contain a path predicate and value predicate [26]. The path predicate is expressed as a query path q that supports two wildcard symbols. The descendant axis ** matches zero to any number of node labels, while the * wildcard matches zero to any number of characters in a single label.
Definition 3 (Query Path) A query path q is denoted by q = /λ 1 /λ 2 /. . ./λ m . Each label λ i is a string λ i ∈ (A ∪ { * }) + , where A is an alphabet and * is a reserved wildcard symbol. The wildcard * matches zero to any number of characters in a label. We call λ i = ** the descendant axis that matches zero to any number of labels.

Interleaving of Composite Keys
We integrate path k.P and value k.V of a key k by interleaving them. Table 2 shows three common ways to integrate k.P and k.V of key k 9 from Table 1. Value bytes are written in italic and shown in red, path bytes are shown in blue. The first two rows show the path-value and value-path concatenation (I P V and I V P ), respectively. The byte-wise interleaving I BW in the third row interleaves one value byte with one path byte. Note that none of these interleavings is wellbalanced. The byte-wise interleaving is not well-balanced, since all value-bytes are interleaved with a single label of the path (/Sources). Table 2: Key k 9 is interleaved using different approaches.

Theoretical Foundation -Dynamic Interleaving
We propose the dynamic interleaving to interleave the paths and values of a set of composite keys K, and show how to build the dynamic interleaving through a recursive partitioning that groups keys based on the shortest prefixes that distinguish keys from one another. We introduce the partitioning in Section 5.1 and highlight in Section 5.2 the properties that we use to construct the interleaving. In Section 5.3 we define the dynamic interleaving with a recursive partitioning and develop a cost model in Section 5.4 to analyze the efficiency of interleavings. The dynamic interleaving adapts to the specific characteristics of paths and values, such as common prefixes, varying key lengths, differing domain sizes, and the skew of the data. To achieve this we consider the discriminative bytes.
Definition 5 (Discriminative Byte) The discriminative byte dsc(K, D) of keys K in dimension D is the first byte for which the keys differ in dimension D, i.e., dsc(K, D) = lcp(K, D) + 1. Table 3 illustrates the position of the discriminative bytes for the path and value dimensions for various sets of composite keys K. Set K 9 = {k 9 } contains only a single key. In this case, the discriminative bytes are the first position after the end of k 9 's byte-strings in the respective dimensions. For example, k 9 's value is eight bytes long, hence the discriminative value byte of {k 9 } is the ninth byte. ◻ Discriminative bytes are crucial during query evaluation since at their positions the search space can be narrowed down. We alternate in a round-robin fashion between discriminative path and value bytes in our interleaving. Each discriminative byte partitions a set of keys into subsets, which we recursively partition further.

ψ-Partitioning
The ψ-partitioning of a set of keys K groups composite keys together that have the same value at the discriminative byte in dimension D. Thus, K is split into at most 2 8 non-empty partitions, one partition for each value (0x00 to 0xFF) of the discriminative byte in dimension D.
Definition 6 (ψ-Partitioning) The ψ-partitioning of a set of keys K in dimension D is ψ(K, D) = {K 1 , . . . , K m } iff 1. (Correctness) All keys in a set K i have the same value at K's discriminative byte in dimension D: have the same value at K's discriminative byte in D: 3. (Completeness) Every key in K is assigned to a set K i . All K i are non-empty.
Let k ∈ K be a composite key. We write ψ k (K, D) to denote the ψ-partitioning of k with respect to K and dimension D, i.e., the partition in ψ(K, D) that contains key k.

Properties of the ψ-Partitioning
We work out four key properties of the ψ-partitioning. The first two properties, order-preserving and prefix-preserving, allow us to evaluate CAS queries efficiently while the other two properties, guaranteed progress and monotonicity, help us to construct the dynamic interleaving.
Lemma 1 (Order-Preserving) ψ-partitioning ψ(K, D) = {K 1 , . . . , K m } is order-preserving in dimension D, i.e., all keys in set K i are either strictly greater or smaller in dimension D than all keys from another set K j : All proofs can be found in Appendix A.
The next two properties allow us to efficiently compute the dynamic interleaving of composite keys. Guaranteed progress ensures that each step partitions the data and when we repeatedly apply ψ(K, D), we eventually narrow a set of keys down to a single key. For each set of keys that ψ(K, D) creates, the position of the discriminative byte for dimension D increases. This property of the ψ-partitioning holds since each set of keys is built based on the discriminative byte and to ψ-partition an existing set of keys we need a discriminative byte that is positioned further down in the byte-string. For the alternate dimension D, i.e., D = P if D = V and D = V if D = P , the position of the discriminative byte remains unchanged or increases.
Lemma 4 (Monotonicity of Discriminative Bytes) Let K i be one of the partitions of K after partitioning in dimension D. In dimension D, the position of the discriminative byte in K i is strictly greater than in K. In dimension D, the discriminative byte is equal or greater than in K, i.e., Example 8 The discriminative path byte of K 1..9 is 2 while the discriminative value byte of K 1..9 is 5 as shown in Table 3. For partition K 1,4,8,9 , which is obtained by partitioning K 1..9 in the value dimension, the discriminative path byte is 10 while the discriminative value byte is 7. For partition K 4, 8,9 , which is obtained by partitioning K 1,4,8,9 in the path dimension, the discriminative path byte is 14 while the discriminative value byte is still 7. ◻ Monotonicity guarantees that each time we ψ-partition a set K we advance the discriminative byte in at least one dimension. Thus, we make progress in at least one dimension when we dynamically interleave a set of keys.
These four properties of the ψ-partitioning are true because we partition K at its discriminative byte. If we partitioned the data before this byte, we would not make progress and the monotonicity would be violated, because every byte before the discriminative byte is part of the longest common prefix. If we partitioned the data after the discriminative byte, the partitioning would no longer be order-and prefix-preserving. Skipping some keys by sampling the set is not an option, as this could lead to an (incorrect) partitioning using a byte located after the actual discriminative byte.
Example 9 K 1..9 's discriminative value byte is byte five. If we partitioned K 1..9 at value byte four we would get {K 1..9 } and there is no progress since all keys have 0x00 at value byte four. The discriminative path and value bytes would remain unchanged. If we partitioned K 1..9 at value byte six we would get {K 1,4,8,9 , K 2,3,6,7 , K 5 }, which is neither ordernor prefix-preserving in V . Consider keys k 3 , k 6 ∈ K 2,3,6,7 and k 5 ∈ K 5 . The partitioning is not order-preserving in V The partitioning is not prefixpreserving in V since the longest common value prefix in K 2,3,6,7 is 00 00 00 00, which is not longer than the longest common value prefix of keys from different partitions since lcp(K 2,3,6,7 ∪ K 5 , V ) = 00 00 00 00. ◻

Dynamic Interleaving
To compute the dynamic interleaving of a composite key k ∈ K we recursively ψ-partition K while alternating between dimension V and P . In each step, we interleave a part of k.P with a part of k.V . The recursive ψ-partitioning yields a partitioning sequence This continues until we reach a set K n that contains at most τ keys, where τ is a threshold (explained later). The recursive ψ-partitioning alternates between dimensions V and P until we run out of discriminative bytes in one dimension, which means ψ k (K i , D) = K i . From then on, we can only ψ-partition in dimension D until we run out of discriminative bytes in this dimension as well, that is ψ k (K i , D) = ψ k (K i , D) = K i , or we reach a K n that contains at most τ keys. The partitioning sequence is finite due to the monotonicity of the ψ-partitioning (see Lemma 4), which guarantees that we make progress in each step in at least one dimension.

Definition 7 (Partitioning Sequence)
The partitioning sequence ρ(k, K, D) = ((K 1 , D 1 ), . . . , (K n , D n )) of a composite key k ∈ K is the recursive ψ-partitioning of the sets to which k belongs. The pair (K i , D i ) denotes the partitioning of K i in dimension D i . The partitioning stops when K n contains at most τ keys or K n cannot be further ψ-partitioned in any dimension (K n .D = in this case). ρ(k, K, D) is defined in Figure 1.
Example 10 Below we illustrate the step-by-step expansion of ρ(k 9 , K 1..9 , V ) to get k 9 's partitioning sequence. We set Note the alternating partitioning in, respectively, V and P . We only deviate from this if partitioning in one of the dimensions is not possible. Had we set τ = 1, K 8,9 would be partitioned once more in the path dimension. ◻ To compute the full dynamic interleaving of a key k we set τ = 1 and continue until the final set K n contains a single key (i.e, key k). To interleave only a prefix of k and keep a suffix non-interleaved we increase τ . Increasing τ stops the partitioning earlier and speeds up the computation. An index structure that uses dynamic interleaving can tune τ to trade the time it takes to build the index and to query it. In Section 6 we introduce a memory-optimized and a disk-optimized version of our RSCAS index. They use different values of τ to adapt to the underlying storage.
We determine the dynamic interleaving I DY (k, K) of a key k ∈ K via k's partitioning sequence ρ. For each element in ρ, we generate a tuple with strings s P and s V and the partitioning dimension of the element. The strings s P and s V are composed of substrings of k.P and k.V , ranging from the previous discriminative byte up to, but excluding, the current discriminative byte in the respective dimension. The order of s P and s V in a tuple depends on the dimension used in the previous step: the dimension that has been chosen for the partitioning comes first. Formally, this is defined as follows: Definition 8 (Dynamic Interleaving) Let k ∈ K be a composite key and let ρ(k, K, V ) = ((K 1 , D 1 ), . . . , (K n , D n )) be the partitioning sequence of k. The dynamic interleaving I DY (k, K) = (t 1 , . . . , t n , t n+1 ) of k is a sequence of tuples To correctly handle the first tuple we define dsc(K 0 , V ) = 1, dsc(K 0 , P ) = 1 and D 0 = V . The last tuple t n+1 = (s P , s V , R) stores the non-interleaved suffixes along with revision k.R: We compute the tuples for the dynamic interleaving I DY (k 9 , K 1..9 ) = (t 1 , . . . , t 5 ) of key k 9 using the partitioning sequence ρ(k 9 , K 1..9 , V ) from Example 10. The necessary discriminative path and value bytes can be found in Table 3. Table 4 shows the details of each tuple of k 9 's dynamic interleaving with respect to K 1..9 . The final dynamic interleavings of all keys from Table 1 are displayed in Table 5. We highlight in bold the values of the discriminative bytes at which the paths and values are interleaved, e.g., for key k 9 these are bytes 5D, S, and 97. ◻ Table 4: Unlike static interleavings I(k) that interleave a key k in isolation, the dynamic interleaving I DY (k, K) of k depends on the set of all keys K to adapt to the data. The result is a well-balanced interleaving (compare Tables 2 and 5).
In Section 7 we propose efficient algorithms to dynamically interleave composite keys and analyze them for different key distributions.

Efficiency of Interleavings
We propose a cost model to measure the efficiency of interleavings that organize the interleaved keys in a tree-like search structure. Each node represents the ψ-partitioning of the composite keys by either path or value, and the node branches for each different value of a discriminative path or value byte. We simplify the cost model by assuming that the search structure is a complete tree with fanout o where every root-to-leaf path contains h edges (h is the height). Further, we assume that all nodes on one level represent a partitioning in the same dimension φ i ∈ {P, V } and we use a vector φ(φ 1 , . . . , φ h ) to specify the partitioning dimension on each level. We assume that the number of P s and V s in each φ are equal. Figure 2 visualizes this scheme.
To answer a query we start at the root and traverse the search structure to determine the answer set. In the case of range queries, more than one branch must be followed. A search follows a fraction of the outgoing branches o originating at this node. We call this the selectivity of a node (or just selectivity). We assume that every path node has a selectivity of ς P and every value node has a selectivity of ς V . The costĈ of a search, measured in the number of visited nodes during the search, is as follows: If a workload is well-known and consists of a small set of specific queries, it is highly likely that an index adapted to this workload will outperform RSCAS. For instance, if ς V ≪ ς P for all queries, then a VP-index shows better performance than an RSCAS-index. However, it performs badly for queries deviating from that workload (ς V > ς P ). Our goal is an access method that can deal with a wide range of queries in a dynamic environment in a robust way, i.e., avoiding a bad performance for any particular query type. This is motivated by the fact that modern data analytics utilizes a large number of ad-hoc queries to do exploratory analysis. For example, in the context of building a robust partitioning for ad-hoc query workloads, Shanbhag et al. [41] found that after analyzing the first 80% of a real-world workload the remaining 20% still contained 57% completely new queries. We aim for a good average performance across all queries.

Definition 9 (Robustness) A CAS-index is robust if it optimizes the average performance and minimizes the variability over all queries.
State-of-the-art CAS-indexes are not robust because they favor either path or value predicates. As a result they show a very good performance for one type of query but run into problems for other types of queries. To illustrate this problem we define the notion of complementary queries.
Definition 10 (Complementary Query) Given a query Q = (ς P , ς V ) with path selectivity ς P and value selectivity ς V , there is a complementary query Q ′ = (ς ′ P , ς ′ V ) with path selectivity ς ′ P = ς V and value selectivity ς ′ V = ς P Table 5: The dynamic interleaving of the composite keys in K 1.. 9 . The values at the discriminative bytes are written in bold.
Theorem 1 Consider a query Q with selectivities ς P and ς V and its complementary query Q ′ with selectivities ς ′ P = ς V and ς ′ V = ς P . There is no interleaving that on average performs better than the dynamic interleaving with a perfectly alternating vector φ DY , i.e., ∀φ Theorem 1 shows that the dynamic interleaving has the best query performance for complementary queries. It follows that for any set of complementary queries Q, the dynamic interleaving has the best performance.
There is no interleaving φ that in total performs better than the dynamic interleaving over all queries Q, i.e., This also holds for the set of all queries, since for every query there exists a complementary query. Thus, the dynamic interleaving optimizes the average performance over all queries and, as a result, a CAS index that uses dynamic interleaving is robust.
There is no interleaving φ that in total performs better than the dynamic interleaving φ DY over all queries Q.
We now turn to the variability of the search costs and show that they are minimal for dynamic interleavings.
Theorem 3 Given a query Q (with ς P and ς V ) and its com- , there is no interleaving that has a smaller variability than the dynamic interleaving with a perfectly alternating vector φ DY , Similar to the results for the average performance, Theorem 3 can be generalized to the set of all queries.
Note that in practice the search structure is not a complete tree and the fraction ς P and ς V of children that are traversed at each node is not constant. We previously evaluated the cost model experimentally on real-world datasets [43] n1 (30 bytes) (00 00 00 00,/,V ) n2 (28 bytes) (5D A8,Sources/,P ) The RSCAS trie for the composite keys K 1.. 9 .
and showed that the estimated and true cost of a query are off by a factor of two on average, which is a good estimate for the cost of a query.

Robust and Scalable CAS (RSCAS) Index
Data-intensive applications require indexing techniques that make it possible to efficiently index, insert, and query large amounts of data. The SWH archive, for example, stores billions of revisions and every day millions of revisions are crawled from popular software forges. We propose the Robust and Scalable Content-And-Structure (RSCAS) index to provide support for querying and updating the content and structure of big hierarchical data. For robustness, the RSCAS index uses our dynamic interleaving to integrate the paths and values of composite keys in a trie structure. For scalability, RSCAS implements log-structured merge trees (LSM trees) that combine a memory-optimized trie with a series of disk-optimized tries (see Figure 5).

Structure of an RSCAS Trie
RSCAS tries support CAS queries with range and prefix searches. Each node n in an RSCAS trie includes a dimension n.D, a path substring n.s P , and a value substring n.s V . They correspond to fields t.D, t.s P and t.s V in the dynamic interleaving of a key (see Definition 8). Substrings n.s P and n.s V are variable-length strings. Dimension n.D is P or V for inner nodes and for leaf nodes. Leaf nodes additionally store a set of suffixes, denoted by n.suffixes. This set contains non-interleaved path and value suffixes along with references to data items in the database. Each dynamically interleaved key corresponds to a root-to-leaf path in the RSCAS trie.
Definition 11 (RSCAS Trie) Let K be a set of composite keys and let R be a trie. Trie R is the RSCAS trie for K iff the following conditions are satisfied.
is the dynamic interleaving of a key k ∈ K iff there is a root-to-leaf path Suffix t m+1 is stored in leaf node n m , i.e., t m+1 ∈ n m .suffixes. 2. R does not include duplicate siblings, i.e., no two sibling nodes n and n ′ , n ≠ n ′ , in R have the same values for s P , s V , and D, respectively. Figure 4 shows the RSCAS trie for keys K 1.. 9 . The values at the discriminative bytes are highlighted in bold.

Example 13
The dynamic interleaving I DY (k 9 , K 1. Table 5 is mapped to the root-to-leaf path (n 1 , n 2 , n 4 , n 6 ) in the RSCAS trie. Tuple t 5 is stored in n 6 .suffixes. Key k 8 is stored in the same root-to-leaf path. For key k 1 , the first two tuples of I DY (k 1 , K 1..9 ) are mapped to n 1 and n 2 , respectively, while the third tuple is mapped to n 3 . ◻

RSCAS Index
The RSCAS index combines a memory-optimized RSCAS trie for in-place insertions with a sequence of disk-based RSCAS tries for out-of-place insertions to get good insertion performance for large data-intensive applications. LSM trees [33,36] have pioneered combining memory-and diskresident components, and are now the de-facto standard to build scalable index structures (see, e.g., [5,10,12]). We implement RSCAS as an LSM trie that fixes the size ratio between two consecutive tries at T = 2 and uses the leveling merge policy with full merges (this combination is also known as the logarithmic method in [36]). Leveling optimizes query performance and space utilization in comparison to the tiering merge policy at the expense of a higher merge cost [22,23]. Luo and Carey show that a size ratio of T = 2 achieves the maximum write throughput for leveling, but may have a negative impact on the latency [22]. Since query performance and space utilization are important to us, while latency does not play a large role (due to batched updates in the background), we choose the setup described above. If needed the LSM trie can be improved with the techniques presented by Luo and Carey [22,23]. For example, one such improvement is partitioned merging where multiple tries with non-overlapping key ranges can exist at the same level and when a trie overflows at level i, this trie needs only to be merged with overlapping tries at level i + 1. Partitioned merges reduce the I/O during merging since not all data at level i needs to be merged into level i + 1.
Our focus is to show how to integrate a CAS index with LSM trees. We do not address aspects related to recovery and multi-user synchronization. These challenges, however, exist and must be handled by the system. Typical KV-stores use write-ahead logging (WAL) to make their system recoverable and multi-version concurrency control (MVCC) to provide concurrency. These techniques are also applicable to the RSCAS index.
The in-memory RSCAS trie R M 0 is combined with a sequence of disk-based RSCAS tries R 0 , . . . , R k that grow in size as illustrated in Figure 5. The most recently inserted keys are accumulated in the in-memory RSCAS trie R M 0 where insertions can be performed efficiently. When R M 0 grows too big, the keys are migrated to a disk-based RSCAS trie R i . A query is executed on each trie individually and the result sets are combined. We only consider insertions since deletions do not occur in the SWH archive. When R M 0 is full, we look for the first disk-based trie R i that is empty. We (a) collect all keys in tries R M 0 and R j , 0 ≤ j < i, (b) bulk-load trie R i from these keys, and (c) delete all previous tries.
Example 14 Assume we set the number of keys that fit in memory to M = 10 million, which is the number of new keys that arrive every day in the SWH archive, on average. When R M 0 overflows after one day we redirect incoming insertions to a new in-memory trie and look for the first non-empty trie R i . Assuming this is R 0 , the disk-resident trie R 0 is bulk-loaded with the keys in R M 0 . After another day, R M 0 overflows again and this time the first non-empty trie is R 1 . Trie R 1 is created from the keys in R M 0 and R 0 . At the end R 1 contains 20M keys, and R M 0 and R 0 are deleted. ◻ An overflow in R M 0 does not stall continuous indexing since we immediately redirect all incoming insertions to a new in-memory trie R M ′ 0 while we bulk-load R i in the background. In order for this to work, R M 0 cannot allocate all of the available memory. We need to reserve a sufficient amount of memory for R M ′ 0 (in the SWH archive scenario we allowed R M 0 to take up at most half of the memory). During bulk-loading we keep the old tries R M 0 and R 0 , . . . , R i−1 around such that queries have access to all indexed data. As soon as R i is complete, we replace R M 0 with R M ′ 0 and R 0 , . . . , R i−1 with R i . In practice neither insertions nor queries stall as long as the insertion rate is bounded. If the insertion rate is too high and R M ′ 0 overflows before we finish bulkloading R i , we block and do not accept more insertions. This does not happen in the SWH archive since with our default of M = 10 8 keys (about 8 GB memory) trie R M ′ 0 overflows every ten days and bulk-loading the trie on our biggest dataset takes about four hours.

Storage Layout
The RSCAS index consists of a mutable in-memory trie R M 0 and a series of immutable disk-based tries R i . For R M 0 we use a node structure that is easy to update in-place, while we design R i for compact storage on disk.

Memory-Optimized RSCAS Trie
The memory-optimized RSCAS trie R M 0 provides fast inplace insertions for a small number of composite keys that fit into memory. Since all insertions are buffered in R M 0 before they are migrated in bulk to disk, R M 0 is in the critical path of our indexing pipeline and must support efficient insertions. We reuse the memory-optimized trie [43] that is based on the memory-optimized Adaptive Radix Tree (ART) [21]. ART implements four node types that are optimized for the hardware's memory hierarchy and that have a physical fanout of 4, 16, 48, and 256 child pointers, respectively. A node uses the smallest node type that can accommodate the node's child pointers. Insertions add node pointers and when a node becomes too big, the node is resized to the next appropriate node type. This ensures that not every insertion requires resizing, e.g., a node with ten children can sustain six deletions or seven insertions before it is resized. Figure 6 illustrates the node type with 256 child pointers; for the remaining node types we refer to Leis et al. [21]. The node header stores the dimension D, the lengths l P and l V of substrings s P and s V , and the number of children m.
Substrings s P and s V are implemented as variable-length byte vectors. The remaining space of an inner node (beigecolored in Figure 6) is reserved for child pointers. For each possible value b of the discriminative byte there is a pointer (possibly NULL) to the subtree where all keys have value b at the discriminative byte in dimension D. The structure of leaf nodes is similar, except that leaf nodes contain a variable-length vector with references k.R instead of child pointers.
For the memory-optimized RSCAS trie we set the partitioning threshold τ = 1 meaning that R M 0 dynamically interleaves keys completely. This provides fast and fine-grained access to the indexed keys.

Disk-Optimized RSCAS Trie
We propose a disk-resident RSCAS trie to compactly store dynamically-interleaved keys on disk. Since a disk-resident RSCAS trie is immutable, we optimize it for compact storage. To that end we store nodes gapless on disk and we increase the granularity of leaf nodes by setting τ > 1. We look at these techniques in turn. We store nodes gapless on disk since we do not have to reserve space for future inplace insertions. This means a node can cross page boundaries but we found that in practice this is not a problem. We tested various node clustering techniques to align nodes to disk pages. The most compact node clustering algorithm [19] produced a trie that was 30% larger than with gapless storage as it kept space empty on a page if it could not add another node without exceeding the page size. Besides being simpler to implement and more compact, the gapless storage yields better query performance because less data needs to be read from disk. In addition to the gapless storage, we increase the granularity of leaf nodes by setting τ > 1. As a result the RSCAS index contains fewer nodes but the size of leaf nodes increases. We found that by storing fewer but bigger nodes we save space because we store less meta-data like node headers, child pointers, etc. In Section 8.4.1 we determine the optimal value for τ . Figure 7 shows how to compactly serialize nodes on disk. Inner nodes point to other nodes, while leaf nodes store a set of suffixes. Both node types store the same four-byte header that encodes dimension D ∈ {P, V, }, the lengths l P and l V of the substrings s P and s V , and a number m. For inner nodes m denotes the number of children, while for leaf nodes it denotes the number of suffixes. Next we store substrings s P and s V (exactly l P and l V bytes long, respec-tively). After the header, inner nodes store m pairs (b i , ptr i ), where b i (1 byte long) is the value at the discriminative byte that is used to descend to this child node and ptr i (6 bytes long) is the position of this child in the trie file. Leaf nodes, instead, store m suffixes and for each suffix we record substrings s P and s V along with their lengths and the revision r (20 byte SHA1 hash).

Algorithms
We propose algorithms for querying, inserting, bulk-loading, and merging RSCAS tries. Queries are executed independently on all in-memory and disk-based RSCAS tries and the results are combined. Insertions are directed at the inmemory RSCAS trie alone. Merging is used whenever the in-memory RSCAS trie overflows and applies bulk-loading to create a large disk-optimized RSCAS trie.

Querying RSCAS
We traverse an RSCAS trie in pre-order to evaluate a CAS query, skipping subtrees that cannot match the query. Starting at the root node, we traverse the trie and evaluate at each node part of the query's path and value predicate. Evaluating a predicate on a node returns MATCH if the full predicate has been matched, MISMATCH if it has become clear that no node in the current node's subtree can match the predicate, and INCOMPLETE if we need more information. In case of a MISMATCH, we can safely skip the entire subtree. If both predicates return MATCH, we collect all revisions r in the leaf nodes of this subtree. Otherwise, we traverse the trie further to reach a decision.

Query Algorithm
Algorithm 1 shows the pseudocode for evaluating a CAS query on a RSCAS trie. It takes the following parameters: the current node n (initially the root node of the trie), a query path q, and a range [v l , v h ] for the value predicate. Furthermore, we need two buffers buff P and buff V (initially empty) that hold, respectively, all path and value bytes from the root to the current node n. Finally, we require state information s to evaluate the path and value predicates (we provide details as we go along) and an answer set W to collect the results.
First, we update buff V and buff P by adding the information in s V and s P of the current node n (line 1).
For inner nodes, we match the query predicates against the current node. MatchValue computes the longest common prefix between buff V and v l and between buff V and v h . The position of the first byte for which buff V and v l differ is lo and the position of the first byte for which buff V and v h differ is hi. If buff V [lo] < v l [lo], we know that the node's value lies outside of the range, hence we return MISMATCH.
, the node's value lies outside of the upper bound and we return MISMATCH as well. If buff V contains a complete value (e.g., all eight bytes of a 64 bit integer) and , we know that all values in the subtree rooted at n match and we also return MATCH. In all other cases we cannot make a decision yet and return INCOMPLETE. The values of lo and hi are kept in the state to avoid recomputing the longest common prefix from scratch for each node. Instead we resume the search from the previous values of lo and hi.
Function MatchPath matches the query path q against the current path prefix buff P . It supports symbols * and ** to match any number of characters in a node label, respectively any number of node labels in a path. As long as we do not encounter any wildcards in the query path q, we directly compare (a prefix of) q with the current content of buff P byte by byte. As soon as a byte does not match, we return MISMATCH. If we successfully match the complete query path q against a complete path in buff P (both terminated by $), we return MATCH. Otherwise, we return INCOMPLETE. When we encounter wildcard * in q, we match it successfully to the corresponding label in buff P and continue with the next label. A wildcard * itself will not cause a mismatch (unless we try to match it against the terminator $), so we either return MATCH if it is the final label in q and buff P or INCOMPLETE. Matching the descendant-axis ** is more complicated. We store in state s the current position where we are in buff P and continue matching the label after ** in q. If at any point we find a mismatch, we backtrack to the next path separator after the noted position, thus skipping a label in buff P and restarting the search from there. Once buff P contains a complete path, we can make a decision between MATCH or MISMATCH.
The algorithm continues by checking the outcomes of the value and path matching (line 5). If one of the outcomes is MISMATCH, we stop the search since no descendant can match the query. Otherwise, we continue with the matching children of n (lines [6][7][8]. Finding the matching children follows the same logic as described above for MatchValue and MatchPath. If node n.D = P and we have seen a descendant axis in the query path, all children of the current node match.
As soon as we reach a leaf node, we iterate over each suffix t in the leaf to check if it matches the query using the same functions as explained above (lines [10][11][12][13][14]. If the current buffers indeed match the query, we add the reference t.R to the result set.
-Starting at n 1 , we update buff V to 00 00 00 00 and buff P to /. MatchValue matches four value bytes and returns INCOMPLETE. MatchPath matches one path byte and also returns INCOMPLETE. Both functions return INCOMPLETE, so we have to traverse all matching children. Since n 1 is a value node, we look for all matching children whose value for the discriminative value byte is between 5E and 5F. Nodes n 7 and n 8 satisfy this condition. -Node n 7 is a leaf. We iterate over each suffix (there are two) and update the buffers accordingly. For the first suffix with path substring 3/inode.c$ we find that MatchPath and MatchValue both return MATCH. Hence, revision r 4 is added to W . The next suffix matches the value predicate but not the path predicate and is therefore discarded. -Next we look at node n 8 . We find that v l [5] , thus all values of n 9 's descendants are within the bounds v l and v h , and MatchValue returns MATCH. Since n 8 .s P is the empty string, MatchPath still returns INCOMPLETE and we descend further. According to the second byte in the query path, q[2] = f, we must match letter f, hence we descend to node n 10 , where both predicates match. Therefore, revision r 6 is added to W .

Updating Memory-Based RSCAS Trie
All insertions are performed in the in-memory RSCAS trie R M 0 where they can be executed efficiently. Inserting a new key into R M 0 usually changes the position of the discriminative bytes, which means that the dynamic interleaving of all keys that are located in the node's subtree is invalidated.

Example 17
We insert the key k 10 = (/crypto/rsa.c$, 00 00 00 00 5F 83 B9 AC, r 8 ) into the RSCAS trie in Figure 4. First we traverse the trie starting from root n 1 . Since n 1 's substrings completely match k 10 's path and value we traverse to child n 8 . In n 8 there is a mismatch in the value dimension: k 10 's sixth byte is 83 while for node n 8 the corresponding byte is BD. This invalidates the dynamic interleaving of keys K 2,3,7 in n 8 's subtree. ◻

Lazy Restructuring
If we want to preserve the dynamic interleaving, we need to re-compute the dynamic interleaving of all affected keys, which is expensive. Instead, we relax the dynamic interleaving using lazy restructuring [44]. Lazy restructuring resolves the mismatch by adding exactly two new nodes, n par and n sib , to RSCAS instead of restructuring large parts of the trie. The basic idea is to add a new intermediate node n par between node n where the mismatch happened and n's new sibling node n sib that represents the newly inserted key. We put all bytes leading up to the position of the mismatch into n par , and all bytes starting from this position move to nodes n and n sib . After that, we insert node n par between node n and its previous parent node n p . Figure 8 shows the rightmost subtree of Figure  4 after it is lazily restructured when k 10 is inserted. Two new nodes are created, parent n par = n ′ 8 and sibling n sib = n ′′ 8 . Additionally, n 8 .s V is updated. ◻ Lazy restructuring is efficient: it adds exactly two new nodes to R M 0 , thus the main cost is traversing the trie. However, while efficient, lazy restructuring introduces small irregularities that are limited to the dynamic interleaving of the keys in the subtree where the mismatch occurred. These irregularities do not affect the correctness of CAS queries, but they slowly separate (rather than interleave) paths and values if insertions repeatedly force the algorithm to split the same subtree in the same dimension. Since R M 0 is memorybased and small in comparison to the disk-based tries, the overall effect on query performance is negligible.

Example 19
After inserting k 10 , root node n 1 and its new child n ′ 8 both ψ-partition the data in the value dimension, violating the strictly alternating property of the dynamic interleaving, see Figure 8. ◻

Inserting Keys with Lazy Restructuring
Algorithm 2 inserts a key k in R M 0 rooted at node n. If R M 0 is empty (i.e., n is NIL) we create a new root node in lines 1-3. Otherwise, we traverse the trie to k's insertion position. We compare the key's path and value with the current node's path and value by keeping track of positions g P , g V , i P , i V in strings k.P, k.V, n.s P , n.s V , respectively (lines 8-11). As long as the substrings at their corresponding positions coincide we descend. If we completely matched key k, it means that we reached a leaf node and we add k.R to the current node's suffixes (lines [12][13][14]. If during the traversal we cannot find the next node to descend to, the key has a new value at a discriminative byte that did not exist before in the data. We create a new leaf node and set its substrings s P and s V to the still unmatched bytes in k.P and k.V , respectively (lines 20-22). If we find a mismatch between the key and the current node in at least one dimension, we lazily restructure the trie (lines 15-17).

Bulk-Loading a Disk-Based RSCAS Trie
We create and bulk-load a new disk-based RSCAS trie whenever the in-memory trie R M 0 overflows. The bulk-loading algorithm constructs RSCAS while, at the same time, dynamically interleaving a set of keys. Bulk-loading RSCAS is difficult because all keys must be considered together to dynamically interleave them. The bulk-loading algorithm starts with all non-interleaved keys in the root partition. We use partitions during bulk-loading to temporarily store keys along Algorithm 3: LazyRestructuring(k, n, n p , g P , g V , i P , i V ) with their discriminative bytes. Once a partition has been processed, it is deleted.
Definition 12 (Partition) A partition L = (g P , g V , size, ptr) stores a set K of composite keys. g P = dsc(K, P ) and g V = dsc(K, V ) denote the discriminative path and value byte, respectively. size = K denotes the number of keys in the partition. L is either memory-resident or disk-resident, and ptr points to the keys in memory or on disk. ◻ Example 20 Root partition L 1..9 = (2, 5, 9, •) in Figure 9a stores keys K 1..9 from Table 1. The longest common prefixes of L 1..9 are type-set in bold-face. The first bytes after these prefixes are L 1..9 's discriminative bytes g P = 2 and g V = 5. We use placeholder • for pointer ptr; we describe later how to decide if partitions are stored on disk or in memory. ◻ Bulk-loading starts with root partition L and breaks it into smaller partitions using the ψ-partitioning until a partition contains at most τ keys. The ψ-partitioning ψ(L, D) groups keys together that have the same prefix in dimension D, and returns a partition table where each entry in this table points to a new partition L i . We apply ψ alternatingly in dimensions V and P to interleave the keys at their discriminative bytes. In each call, the algorithm adds a new node to RSCAS with L's longest common path and value prefixes. Figure 9 shows how the RSCAS from Figure 4 is built. In Figure 9b we extract L 1..9 's longest common path and value prefixes and store them in the new root node n 1 . Then, we ψ-partition L 1..9 in dimension V and obtain a partition table (light green) that points to three new partitions: L 1,4,8,9 , L 5,6 , and L 2,3,7 . We drop L 1..9 's longest common prefixes from these new partitions. We proceed recursively with L 1,4,8,9 . In Figure 9c we create node n 2 as before and this time we ψ-partition in dimension P and obtain two new partitions. Given τ = 2, L 1 is not partitioned further, but in the next recursive step, L 4,8,9 would be partitioned one last time in dimension V . ◻

Example 21
To avoid scanning L twice (first to compute the discriminative byte; second to compute ψ(L, D)) we make the ψpartitioning proactive by exploiting that ψ(L, D) is applied hierarchically. This means we pre-compute the discriminative bytes of every new partition L i ∈ ψ(L, D) as we ψpartition L. As a result, by the time L i itself is ψ-partitioned, we already know its discriminative bytes and can directly compute the partitioning. Algorithm 6 in Section 7.4 shows how to compute the root partition's discriminative bytes; the discriminative bytes of all subsequent partitions are computed proactively during the partitioning itself. This halves the scans over the data during bulk-loading.

Bulk-Loading Algorithm
The bulk-loading algorithm (Algorithm 4) takes three parameters: a partition L (initially the root partition), the partitioning dimension D (initially dimension V ), and the position in the trie file where the next node is written to (initially 0). Each invocation adds a node n to the RSCAS trie and returns the position in the trie file of the first byte after the subtree rooted in n. Lines 1-3 create node n and set its longest common prefixes n.s P and n.s V , which are extracted from a key k ∈ L from the first byte up to, but excluding, the positions of L's discriminative bytes L.g P and L.g V . If the number of keys in the current partition exceeds the partitioning threshold τ and L can be ψ-partitioned, we break L further up. In lines 5-6 we check if we can indeed ψ-partition L in D and switch to the alternate dimension D otherwise. In line 8 we apply ψ(L, D) and obtain a partition table T , which is a 2 8 -long array that maps the 2 8 possible values b of a discriminative byte (0x00 ≤ b ≤ 0xFF) to partitions. We write T [b] to access the partition for value b (T [b] = NIL if no partition exists for value b). ψ(L, D) drops L's longest common prefixes from each key k ∈ L since we store these prefixes already in node n. We apply Algorithm 4 recursively on each partition in T with the alternate dimension D, which returns the position where the next child is written to on disk. We terminate if partition L contains no more than τ keys or cannot be partitioned further. We iterate over all remaining keys in L and store their non-interleaved suffixes in the set n.suffixes of leaf node n (lines [16][17][18][19]. Finally, in line 22 we write node n to disk at the given offset in the trie file.  pos ← preorderPos + size(n); 22 Write node n to disk from position preorderPos to preorderPos + size(n); 23 return pos; Algorithm 5 implements ψ(L, D). We organize the keys in a partition L at the granularity of pages so that we can seamlessly transition between memory-and disk-resident partitions. A page is a fixed-length buffer that contains a variable number of keys. If L is disk-resident, L.ptr points to a page-structured file on disk and if L is memory-resident, L.mptr points to the head of a singly-linked list of pages. Algorithm 5 iterates over all pages in L and for each key in a page, line 6 determines the partition T [b] to which k belongs by looking at its value b at the discriminative byte. Next we drop the longest common path and value prefixes from k (lines 7-8). We proactively compute T [b]'s discriminative bytes whenever we add a key k to T [b] (lines 10-17). Two cases can arise. If k is T [b]'s first key, we initialize partition T [b]. If L fits into memory, we make T [b] memoryresident, else disk-resident. We initialize g P and g V with one past the length of k in the respective dimension (lines 9-12). These values are valid upper-bounds for the discriminative bytes since keys are prefix-free. We store k as a reference key for partition T 2 Let outpages be an array of 2 8 pages for output buffering; 3 Let refkeys be an array to store 2 8 composite keys; 4 foreach page ∈ L.ptr do 5 foreach key k ∈ page do 10 if L fits into memory then ptr ← new linked list else ptr ← new file;

Merging RSCAS Tries Upon Overflow
When the memory-resident trie R M 0 reaches its maximum size of M keys, we move its keys to the first disk-based trie R i that is empty using Algorithm 6. We keep pointers to the root nodes of all tries in an array. Algorithm 6 first collects all keys from tries R M 0 , R 0 , . . . , R i−1 and stores them in a new partition L (lines 2-4). Next, in lines 5-11, we compute L's discriminative bytes L.g P and L.g V from the substrings s P and s V of the root nodes of the i tries. Finally, in lines 12-14, we bulk-load trie R i and delete all previous tries. . . , Ri−1; 6 L.g P , L.g V ← ( n M 0 .s P + 1, n M 0 .s V + 1); 7 foreach root node n ∈ {n0, . . . , ni−1} do 8 g P , g V ← (1, 1); 9 while g P < L.g P ∧ n M 0 .s P [g P ] = n.s P [g P ] do g P ++; L.g P , L.g V ← (g P , g V ); 12 Create new trie file Ri; 13 BulkLoad(L, V, position 0 in Ri's trie file); 14 Delete tries R M 0 , R0, . . . , Ri−1;

Total I/O Overhead During Bulk-Loading
The I/O overhead is the number of page I/Os without reading the input and writing the output. We use N , M , and B for the number of input keys, the number of keys that fit into memory, and the number of keys that fit into a page, respectively [4]. We analyze the I/O overhead of Algorithm 4 for a uniform data distribution with a balanced RSCAS and for a maximally skewed distribution with an unbalanced RSCAS. The ψ-partitioning splits a partition into equally sized partitions. Thus, with a fixed fanout f the ψ-partitioning splits a partition into f , 2 ≤ f ≤ 2 8 , partitions. For maximally skewed data RSCAS deteriorates to a trie whose height is linear in the number of keys in the dataset.

Example 23
We use the same parameters as in the previous example but assume maximally skewed data. There are 16 − ⌈ 4 2 ⌉2 = 12 levels before the partitions fit into memory. For example, at level i = 1 we write and read ⌈ 16−1 2 ⌉ = 8 pages for L 1,2 . In total, the I/O overhead is 144 pages. ◻ Note that, since RSCAS is trie-based and keys are encoded by the path from the root to the leaves, the height of the trie is bounded by the length of the keys. The worstcase is very unlikely in practice because it requires that the lengths of the keys is linear in the number of keys. Typically, the key length is at most tens or hundreds of bytes. We show in Section 8 that building RSCAS performs close to the best case on real world data.

Amortized I/O Overhead During Insertions
Next, we consider the amortized I/O overhead of a single insertion during a series of N insertions into an empty trie. Note that M − 1 out of M consecutive insertions incur no disk I/O since they are handled by the in-memory trie R M 0 . Only the M th insertion bulk-loads a new disk-based trie.  (N, M, B)).

Setup
Environment. We use a Debian 10 server with 80 cores and 400 GB main memory. The machine has six hard disks, each 2 TB big, that are configured in a RAID 10 setup. The code is written in C++ and compiled with g++ 8.3.0.
Datasets. We use three real-world datasets and one synthetic dataset. Table 6 provides an overview.
-GitLab. The GitLab data from SWH contains archived copies of all publicly available GitLab repositories up to 2020-12-15. The dataset contains 914 593 archived repositories, which correspond to a total of 120 071 946 unique revisions and 457 839 953 unique files. For all revisions in the GitLab dataset we index the commit time and the modified files (equivalent to "commit diffstats" in version control system terminology). In total, we index 6.9 billion composite keys similar to Table 1. -ServerFarm. The ServerFarm dataset [43] mirrors the file systems of 100 Linux servers. For each server we installed a default set of packages and randomly picked a subset of optional packages. In total there are 21 million files. For each file we record the file's full path and size. -Amazon. The Amazon dataset [18] contains hierarchically categorized products. For each product its location in the hierarchical categorization (the path) and its price in cents (the value) are recorded. For example, the shoe 'evo' has path /sports/outdoor/running/evo and its price is 10 000 cents. -XMark. The XMark dataset [40] is a synthetic dataset that models a database for an internet auction site. It contains information about people, regions (subdivided by continent), etc. We generated the dataset with scale factor 500 and we index the numeric attribute 'category'.
Previous Results. In our previous work [44] we compared RSCAS to state-of-the-art research solutions. We compared RSCAS to the CAS index by Mathis et al. [26], which indexes paths and values in separate index structures. We also compared RSCAS to a trie-based index where composite keys are integrated with four different methods: (i) the z-order curve with surrogate functions to map variablelength keys to fixed-length keys, (ii) a label-wise interleaving where we interleave one path label with one value byte, (iii) the path-value concatenation, and (iv) value-path concatenation. Our experiments showed that the approaches do not provide robust CAS query performance because they may create large intermediate results.
Compared Approaches. This paper compares RSCAS to scalable state-of-the-art industrial-strengths systems. First, we compare RSCAS to Apache Lucene [1], which builds separate indexes for the paths and values. Lucene creates an FST on the paths and a Bkd-tree [36] on the values. Lucene evaluates CAS queries by decomposing queries into their two predicates, evaluating the predicates on the respective indexes, and intersecting the sorted posting lists to produce the final result. Second, we compare RSCAS to composite B-trees in Postgres. This simulates the two possible corder curves that concatenate the paths and values (or vice versa). We create a table data(P, V, R), similar to Table 1, and create two composite B+ trees on attributes (P, V ) and (V, P ), respectively.
Parameters. Unless otherwise noted, we set the partitioning threshold τ = 100 based on experiments in Section 8.4.1. The number of keys M that the main-memory RSCAS trie R M 0 can hold is M = 10 8 . Artifacts. The code and the datasets used for our experiments are available online. 2

Impact of Datasets on RSCAS's Structure
In Figure 10 we show how the shape (depth and width) of the RSCAS trie adapts to the datasets. Figure 10a shows the distribution of the node depths in the RSCAS trie for the GitLab dataset. Because of its trie-based structure not every root-to-leaf path in RSCAS has the same length (see also Figure 4). The average node depth is about 10, with 90% of all nodes occurring no deeper than level 14. The expected depth is logf ⌈ N τ ⌉ = log 8 ⌈ 6.9B 100 ⌉ = 8.7, where N is the number of keys, τ is the partitioning threshold that denotes the maximum size of a leaf partition, andf is the average fanout. The actual average depth is higher than the expected depth since the GitLab dataset is skewed and the expected depth assumes a uniformly distributed dataset. In the Git-Lab dataset the average key length is 80 bytes, but the average node depth is 10, meaning that RSCAS successfully extracts common prefixes. Figure 10b shows how the fanout of the nodes is distributed. Since RSCAS ψ-partitions the data at the granularity of bytes, the fanout of a node is upperbounded by 2 8 , but in practice most nodes have a smaller fanout (we cap the x-axis in Figure 10b at fanout 40, because there is a long tail of high fanouts with low frequencies). Nodes that ψ-partition the data in the path dimension typically have a lower fanout because most paths contain only printable ASCII characters (of which there are about 100), while value bytes span the entire available byte spectrum.

Node Depth
Node Fanout The shape of the RSCAS tries on the ServerFarm and Amazon datasets closely resemble that of the trie on the Git-Lab dataset, see the second and third row in Figure 10. This is to be expected since all three datasets contain a large number of unique paths and values, see Table 6. As a result, the data contains a large number of discriminative bytes that are needed to distinguish keys from one another. The paths in these datasets are typically longer than the values and contain more discriminative bytes. In addition, as seen above, the discriminative path bytes typically ψ-partition the data into fewer partitions than the discriminative value bytes. As a consequence, the RSCAS trie on these three datasets is narrower and deeper than the RSCAS trie on the XMark dataset, which has only seven unique paths and about 390k unique values in a dataset of 60M keys. Since the majority of the discriminative bytes in the XMark dataset are value bytes, the trie is flatter and wider on average, see the last row in Figure 10. Table 7 shows twelve typical CAS queries with their query path q and the value range [v l , v h ]. We show for each query the final result size and the number of keys that match the individual predicates. In addition, we provide the selectivities of the queries. The selectivity σ (σ P ) [σ V ] is computed as the fraction of all keys that match the CAS query (path predicate) [value predicate]. A salient characteristic of the queries is that the final result is orders of magnitude smaller than the results of the individual predicates. Queries Q 1 through Q 6 on the GitLab dataset increase in complexity. Q 1 looks up all revisions that modify one specific file in a short two-hour time frame. Thus, Q 1 is similar to a point query with very low selectivity in both dimensions. The remaining queries have a higher selectivity in at least one dimension. Q 2 looks up all revisions that modify one specific file in a one-month period. Thus, its path selectivity is low but its value selectivity is high. Query Q 3 does the opposite: its path predicate matches all changes to GPU drivers using the ** wildcard, but we only look for revisions in a very narrow one-day time frame. Q 4 mixes the * and ** wildcards multiple times and puts them in different locations of the query path (in the middle and towards the end). Q 5 looks for changes to all Makefiles, using the ** wildcard at the front of the query path. Similarly, Q 6 looks for all changes to files named inode (all file extensions are accepted with the * wildcard). The remaining six queries on the other three datasets are similar. (e) Query Q 5 (f) Query Q 6 Fig. 11: Runtime of queries Q 1 , . . . , Q 6 on cold caches Figure 11 shows the runtime of the six queries on the GitLab dataset (note the logarithmic y-axis). We clear the operating system's page cache before each query (later we  repeat the same experiment on warm caches). We start with the runtime of query Q 1 in Figure 11a. This point query is well suited for existing solutions because both predicates have low selectivities and produce small intermediate results. Therefore, the composite VP and PV indexes perform best. No matter what attribute is ordered first in the composite index (the paths or the values), the index can quickly narrow down the set of possible candidates. Lucene on the other hand evaluates both predicates and intersects the results, which is more expensive. RSCAS is in between Lucene and the two composite indexes. Q 2 has a low path but high value selectivity. Because of this, the composite PV index outperforms the composite VP index, see Figure 11b. Evaluating this query in Lucene is costly since Lucene must fully iterate over the large intermediate result produced by the value predicate. RSCAS, on the other hand, uses the selective path predicate to prune subtrees early during query evaluation. For query Q 3 in Figure 11c, RSCAS performs best but it is closely followed by the composite VP index, for which Q 3 is the best case since Q 3 has a very low value selectivity. While Q 3 is the best case for VP, it is the worst case for PV and indeed its query performance is an order of magnitude higher. For Lucene the situation is similar to query Q 2 , except that the path predicate produces a large intermediate result (rather than the value predicate). Query Q 4 uses the * and ** wildcards at the end of its query path. The placement of the wildcards is important for all approaches. Query paths that have wildcards in the middle or at the end can be evaluated efficiently with prefix searches. As a result, RSCAS's query performance remains stable and is similar to that for queries Q 1 , . . . , Q 3 . Queries Q 5 and Q 6 are more difficult for all approaches because they contain the descendant axis at the beginning of the query path. Normally, when the query path does not match a path in the trie, the node that is not matched and its subtrees do not need to be considered anymore because no path suffix can match the query path.

Query Performance
The ** wildcard, however, may skip over mismatches and the query path's suffix may match. For this reason, Lucene must traverse its entire FST that is used to evaluate path predicates. Likewise, the composite PV index must traverse large parts of the index because the keys are ordered first by the paths and in the index. The VP index can use the value predicate to prune subtrees that do not match the value predicate before looking at the path predicate. RSCAS uses the value predicate to prune subtrees when the path predicate does not prune anymore because of wildcards and therefore delivers the best query runtime. In Figure 12 we show the runtime of queries Q 7 , . . . Q 12 on the remaining three datasets (again on cold caches). The absolute runtimes are lower because the datasets are considerably smaller than the GitLab dataset, see Table 6, but the relative differences between the approaches are comparable to the previous set of queries, see Figure 11.
We repeat the same experiments on warm caches, see Figure 13 (the y-axis shows the query runtime in milliseconds). Note that we did not implement a dedicated caching mechanism and solely rely on the operating system's page cache. When the caches are hot the CPU usage and memory access become the main bottlenecks. Since RSCAS produces the smallest intermediate results, RSCAS requires the least CPU time and memory accesses. As a result, RSCAS consistently outperforms its competitors, see Figure 13. To evaluate the impact of the number of levels on the query performance we ran an experiment for an RSCAS index with 10 9 keys from the GitLab dataset. By varying the memory size to accommodate, respectively, 2 20 , 2 26 and 2 30 keys, we got an RSCAS index with 1, 4 and 7 levels (tries), respectively. The total running time for running queries Q 1 to Q 6 is detailed in Figure 14.

Scalability
RSCAS uses its LSM-based structure to gracefully and efficiently handle large datasets that do not fit into main memory. We discuss how to choose threshold τ , the performance of bulk-loading and individual insertions, the accuracy of the cost model, and the index size.

Calibration
We start out by calibrating the partitioning threshold τ , i.e., the maximum number of suffixes in a leaf node. We calibrate τ in Figure 15 on a 100 GB subset of the GitLab dataset. Even on the 100 GB subset, bulk-loading RSCAS with τ = 1 takes more than 12 hours, see Figure 15a. When we increase τ , the recursive bulk-loading algorithm terminates earlier (see lines 15-21 in Algorithm 4), hence fewer partitions are created and the runtime improves. Since the bulk-loading algorithm extracts from every partition its longest common prefixes and stores them in a new node, the number of nodes in the index also decreases as we increase τ , see Figure  15b. As a result, leaf nodes get bigger and store more uninterleaved suffixes. This negatively affects the query performance and the index size, see Figures 15c and 15d, respectively. Figures 15c shows the average runtime of the six queries Q 1 , . . . , Q 6 . A query that reaches a leaf node must scan all suffixes to find matches. Making τ too small decreases query performance because more nodes need to be traversed and making τ too big decreases query performance because a node must scan many suffixes that do not match a query. According to Figures 15c, values τ ∈ [10, 100] give the best query performance. Threshold τ also affects the index size, see Figure 15d. If τ is too small, many small nodes are created and for each such node there is storage overhead in terms of node headers, pointers, etc., see Figure 7. If τ is too big, leaf nodes contain long lists of suffixes for which we could still extract common prefixes if we ψ-partitioned them further. As a consequence, we choose medium values for τ to get a good balance between bulk-loading runtime, query performance, and index size. Based on Figure 15 we choose τ = 100 as default value. More details on a quantitative analysis on how τ affects certain parameters can be found in Appendix B.

Bulk-Loading Performance
Bulk-loading is a core operation that we use in two situations. First, when we create RSCAS for an existing system with large amounts of data we use bulk-loading to create RSCAS. Second, our RSCAS index uses bulk-loading to create a disk-based RSCAS trie whenever the in-memory RSCAS trie R M 0 overflows. We compare our bulk-loading algorithm with bulk-loading of composite B+ trees in Postgres (Lucene does not support bulk-loading; as a reference   Figure 16 evaluates the performance of the bulk-loading algorithms for RSCAS and Postgres. We give the systems 8 GB of main memory. For a fair comparison, we set the fill factor of the composite B+ trees in Postgres to 100% to make them read-optimized and as compact as possible since disk-based RSCAS tries are read-only. We compare the systems for our biggest dataset, the GitLab dataset, in Figure 16. The GitLab dataset contains 6.9 billion keys and has a size of 550 GB. Figure 16a confirms that bulk-loading RSCAS takes roughly the same time as bulk-loading the PV and VP composite indexes in Postgres (notice that RSCAS and the PV composite index have virtually the same runtime, thus PV's curve is barely visible). The runtime and disk I/O of all algorithms increase linearly in Figure 16a, which means it is feasible to bulk-load these indexes efficiently for very large datasets. Postgres creates a B+ tree by sorting the data and then building the index bottom up, level by level. RSCAS partitions the data and builds the index top down. In practice, both paradigms perform similarly, both in terms of runtime ( Figure 16a) and disk I/O (Figure 16b).

Insertion Performance
New keys are first inserted into the in-memory trie R M 0 before they are written to disk when R M 0 overflows. We eval-uate insertions into R M 0 in Figure 17a and look at the insertion speed when R M 0 overflows in Figure 17b. For the latter case we compare RSCAS's on-disk insertion performance to Lucene's and Postgres'.
Since R M 0 is memory-based, insertions can be performed quickly, see Figure 17a. For example, inserting 100 million keys takes less than three minutes with one insertion taking 1.7µs, on average. In practice, the SWH archive crawls about one million revisions per day and since a revision modifies on average about 60 files in the GitLab dataset, there are 60 million insertions into the RSCAS index per day, on average. Therefore, our RSCAS index can easily keep up with the ingestion rate of the SWH archive. Every two days, on average, R M 0 overflows and a new disk-based RSCAS trie R i is bulk-loaded. In Figure 17b we show how the RSCAS index performs when R M 0 overflows. In this experiment, we set the maximum capacity of R M 0 to M = 100 million keys and insert 600 million keys, thus R M 0 overflows six times. Typically when R M 0 overflows we bulk-load a disk-based trie in a background process, but in this experiment we execute all insertions in the foreground in one process to show all times. As a result, we observe a staircase runtime pattern, see Figure 17b. A flat part where insertions are performed efficiently in memory is followed by a jump where a diskbased trie R i is bulk-loaded. Not all jumps are equally high since their height depends on the size of the trie R i that is bulk-loaded. When R M 0 overflows, the RSCAS index looks for the smallest i such that R i does not exist yet and bulkloads it from the keys in R M 0 and all R j , j < i. Therefore, a trie R i , containing 2 i M keys, is created for the first time after 2 i M insertions. For example, after M insertions we bulk-load R 0 (M keys); after 2M insertions we bulk-load R 1 (2M keys) and delete R 0 ; after 3M insertions we again bulk-load R 0 (M keys); after 4M insertions we bulk-load R 2 (4M keys) and delete R 0 and R 1 , etc. Lucene's insertion performance is comparable to that of RSCAS, but insertion into Postgres' B+ tree are expensive in comparison. 3 This is because insertions into Postgres' B+ trees are executed in-place, causing many random updates, while insertions in RSCAS and Lucene are done out-of-place.

Evaluating the Cost Model
We evaluate the cost model from Lemma 5 that measures the I/O overhead of our bulk-loading algorithm for a uniform data distribution and compare it to the I/O overhead of bulk-loading the real-world GitLab dataset. The I/O overhead is the number of page transfers to read/write intermediate results during bulk-loading. We multiply the I/O overhead with the page size to get the number of bytes that are transferred to and from disk. The cost model in Lemma 5 has four parameters: N , M , B, and f (see Section 5.4).
We set fanout f = 10 since this is the average fanout of a node in RSCAS for the GitLab dataset, see Figure 10a. The cost model assumes that M (B) keys fit into memory (a page). Therefore, we set B = ⌈ 16 KB 80 B ⌉ = 205, where 16 KB is the page size and 80 is the average key length (see Section 8.2). Similarly, if the memory size is 8 GB we can store M = ⌈ 8 GB 80 B ⌉ = 100 million keys in memory. In Figure 18a we compare the actual and the estimated I/O overhead to bulk-load RSCAS as we increase the number of keys N in the dataset, keeping the memory size fixed at M = 100 million keys. The estimated and actual cost are close and within 15% of each other. In Figure 18b we vary the memory size and fix the full GitLab dataset as input. The estimated cost is constant from M = 100 to M = 400 million keys because of the ceiling operator in log f ⌈ N M ⌉ to compute the number of levels of the trie in Lemma 5. If we increase M to 800 million keys, the trie in the cost model has one level less before partitions fit entirely into memory and therefore the I/O overhead decreases and remains constant thereafter since only the root partition does not fit into main memory.  In Figure 19a we compare the actual and the estimated I/O overhead to insert N keys one-by-one into RSCAS, setting M = 100×10 6 . We compute the estimated I/O overhead by multiplying the amortized cost of one insertion according to Lemma 7 with the number of keys N . We observe a staircase pattern for the actual I/O overhead because of the repeated bulk-loading when the in-memory trie overflows after every M insertions. Next we fix N = 600 million keys and increase M in Figure 19b. In general, increasing M decreases the actual and estimated overhead because less data must be bulk-loaded. But this is not always the case. For example, the actual I/O overhead increases from M = 200 to M = 300 million keys. To see why, we have to look at the tries that need to be bulk-loaded. For M = 200 we create three tries: after M insertions R 0 (200 mil.), after 2M insertions R 1 (400 mil.), and after 3M insertions again R 0 (200 mil.) for a total of 800 million bulk-loaded keys. For M = 300 we create only two tries: after M insertions R 0 (300 mil.) and after M insertions R 1 (600 mil.) for a total of 900 million bulk-loaded keys.   Figure 20 shows the size of the RSCAS, Lucene, and Postgres indexes for our four datasets. The RSCAS index is between 30% to 80% smaller than the input size (i.e., the size of the indexed keys). The savings are highest for the XMark dataset because it has only seven unique paths and therefore the RSCAS trie has fewer nodes since there are fewer discriminative bytes. But even for a dataset with a large number of unique paths, e.g., the GitLab dataset, RSCAS is 43% smaller than the input. RSCAS's size is comparable to that of the other indexes since all the indexes require space linear in the number of keys in the input.  We propose the RSCAS index, a robust and scalable index for semi-structured hierarchical data. Its robustness is rooted in a well-balanced integration of paths and values in a single index using a new dynamic interleaving. The dynamic interleaving does not prioritize a particular dimension (paths or values), making the index robust against queries with high individual selectivities that produce large intermediate results and a small final result. We use an LSM-design to scale the RSCAS index to applications with a high insertion rate. We buffer insertions in a memory-optimized RSCAS trie that we continuously flush to disk as a series of read-only disk-optimized RSCAS tries. We evaluate our index analytically and experimentally. We prove RSCAS's robustness by showing that it has the smallest average query runtime over all queries among interleaving-based approaches. We evaluate RSCAS experimentally on three real-world datasets and one synthetic data. Our experiments show that the RSCAS index outperforms state-of-the-art approaches by several orders of magnitude on real-world and synthetic datasets. We show-case RSCAS's scalability by indexing the revisions (i.e., commits) of all public GitLab repositories archived by Software Heritage, for a total of 6.9 billion modified files in 120 revisions. In our future work we plan to support deletions. In the in-memory RSCAS trie we plan to delete the appropriate leaf node and efficiently restructure the trie if necessary. To delete keys from the disk-resident RSCAS trie we plan to flag the appropriate leaf nodes as deleted to avoid expensive restructuring on disk. As a result, queries need to filter flagged leaf nodes. Whenever a new disk-based trie is bulkloaded, we remove the elements previously flagged for deletion. It would also be interesting to implement RSCAS on top of a high-performance platform, such as an LSM-treebased KV-store, the main challenge would be to adapt range filters to our complex interleaved queries.

A Proofs
Proof (Lemma 1) Consider the ψ-partitioning ψ(K, D) = {K 1 , . . . , K m }. Let K i ≠ K j be two different partitions of K and let k ′ ∈ K i and k ′′ ∈ K j be two keys belonging to these partitions. Since the paths and values of our keys are binary-comparable [21], the most significant byte is the first byte and the least significant byte is the last byte. Therefore, k ′ .D is smaller (greater) than k ′′ .D iff k ′ .D is smaller (greater) than k ′′ .D at the first byte for which the two keys differ in dimension D. All keys in K have the same longest common prefix s = lcp(K, D) in dimension D and their discriminative byte is g = s + 1 = dsc(K, D). By Definition 6, keys k ′ and k ′′ share the same longest common prefix s in D, i.e., k ′ .D[1, g−1] = k ′′ .D[1, g− 1] = s and they differ at the discriminative byte k ′ .D[g] ≠ k ′′ .D [g]. Therefore, if k ′ .D[g] < k ′′ .D[g], we know that k ′ .D < k ′′ .D (and similarly for >). By the correctness constraint in Definition 6, all keys in K i have the same value at k ′ .D[g] and are therefore all smaller or all greater than the keys in K j , who all have the same value k ′′ .D [g]. ◻ Proof (Lemma 2) By Definition 6, all keys k in a partition K i have the same value k.D[dsc(K, D)] for the discriminative byte of K in dimension D. Therefore, dsc(K, D) is no longer a discriminative byte in K i , instead dsc(K i , D) > dsc(K, D). Let K i ≠ K j be two different partitions. Again by Definition 6, we know that dsc(K i ∪K j , D) = dsc(K, D) since any two keys from these two partitions differ at byte dsc(K, D). Substituting lcp(K, D) = dsc(K, D) − 1 concludes the proof that ψ(K, D) is prefix-preserving in D. ◻

Proof (Lemma 3)
Since not all keys in K are equal in dimension D, we know there must be at least two keys k 1 and k 2 that differ in dimension D at the discriminative byte g = dsc(K, D), i.e., k 1 .D[g] ≠ k 2 .D [g]. According to the disjointness constraint of Definition 6, k 1 and k 2 must be in two different partitions of ψ(K, D). Hence, ψ(K, D) ≥ 2. ◻

Proof (Lemma 4)
The first line states that K i ⊂ K is one of the partitions of K. From Definition 6 it follows that the value k.D[dsc(K, D)] is the same for every key k ∈ K i . From Definition 5 it follows that dsc(K i , D) ≠ dsc(K, D). By removing one or more keys from K to get K i , the keys in K i will become more similar compared to those in K. That means, it is not possible for the keys in K i to differ in a position g < dsc(K, D). Consequently, dsc(K i , D) ≮ dsc(K, D) for any dimension D (so this also holds for D: dsc(K i , D) ≮ dsc(K, D)). Thus dsc(K i , D) > dsc(K, D) and dsc(K i , D) ≥ dsc(K, D). ◻ Proof (Theorem 1) We begin with a brief outline of the proof. We show for a level l that the costs of query Q and complementary query Q ′ on level l is smallest with the dynamic interleaving. That is, for a level is smallest with the vector φ DY = (V, P, V, P, . . .) of our dynamic interleaving. Since this holds for any level l, it also holds for the sum of costs over all levels l, 1 ≤ l ≤ h, and this proves the theorem.
We only look at search trees with a height h ≥ 2, as for h = 1 we do not actually have an interleaving (and the costs are all the same). W.l.o.g., we assume that the first level of the search tree always starts with a discriminative value byte, i.e., φ 1 = V . Let us look at the cost for one specific level l for query Q and its complementary query Q ′ . We distinguish two cases: l is even or l is odd.
l is even: The cost for a perfectly alternating interleaving for Q for level l is equal to . This is the same cost as for Q, so adding the two costs gives us 2o l ς For a non-perfectly alternating interleaving with the same number of ς V and ς P multiplicands up to level l we have the same cost as for our dynamic interleaving, i.e., 2o l ς l 2 V ς l 2 P . Now let us assume that the number of ς V and ς P multiplicands is different for level l (there must be at least one such level l). Assume that for Q we have r multiplicands of type ς V and s multiplicands of type ς P , with r + s = l and, w.l.o.g., r > s. This gives us o l ς s V ς s P ς r−s V + o l ς s V ς s P ς r−s P = o l ς s V ς s P (ς r−s V + ς r−s P ) for the cost.
We have to show that 2o l ς l is odd: W.l.o.g. we assume that for computing the cost for a perfectly alternating interleaving for Q, there are ⌈ l 2⌉ multiplicands of type ς V and ⌊ l 2⌋ multiplicands of type ς P . This results in o l ς ⌊ l 2⌋ V ς ⌊ l 2⌋ P (ς V + ς P ) for the sum of costs for Q and Q ′ .
For a non-perfectly alternating interleaving, we again have o l ς s V ς s P (ς r−s V + ς r−s P ) with r + s = l and r > s, which can be reformulated to What is left to prove is o l ς , the argument for the other factorization follows along the same lines. This term can only become negative if one factor is negative and the other is positive. Let us first look at the case a < b: since 0 ≤ a, b ≤ 1, we can immediately follow that a x < b x and a x+1 < b x+1 , i.e., both factors are negative. Analogously, from a > b (and 0 ≤ a, b ≤ 1) immediately follows a x > b x and a x+1 > b x+1 , i.e., both factors are positive. ◻

Proof (Theorem 2)
We assume a tree with fanout o and height h. With the dynamic interleaving, the dimension on each level alternates, i.e., φ DY = (V, P, V, P, . . .). We need to show that the cost for every query in a set of complementary queries Q is minimal with φ DY . Thus, we show that this inequality holds: First, we double the cost on each side: This is the same as counting the cost of each query and its complementary query twice: Since by Theorem 1 each summand on the left side is smaller than or equal to its corresponding summand on the right side, the sum on the left side is is smaller than or equal to the sum on the right side. ◻

Proof (Theorem 3)
Similar to the proof of Theorem 1, we show that for every level l, is smallest for the dynamic interleaving vector φ DY = (V, P, V, P, . . .).
Again, we only look at search trees with a height h ≥ 2 and, w.l.o.g., we assume that the first level of the search tree always starts with a discriminative value byte, i.e., φ 1 = V . Let us look at the difference in costs for one specific level l for query Q and its complementary query Q ′ . We distinguish two cases: l is even or l is odd.
l is even: The cost for a perfectly alternating interleaving for Q for level l is equal to o l (ς V ⋅ ς P . . . ς V ⋅ ς P ), while the cost for Q ′ is equal to o l (ς ′ V ⋅ ς ′ P . . . ς ′ V ⋅ ς ′ P ), which is equal to o l (ς P ⋅ ς V . . . ς P ⋅ ς V ). This is the same cost as for Q, so subtracting one cost from the other gives us 0.
For a non-perfectly alternating interleaving with the same number of ς V and ς P multiplicands up to level l we have the same difference in costs as for our dynamic interleaving, i.e., 0. Now let us assume that the number of ς V and ς P multiplicands is different for level l (there must be at least one such level l). Assume that for Q we have r multiplicands of type ς V and s multiplicands of type ς P , with r + s = l and, w.l.o.g., r > s. This gives us o l ς s V ς s P ς r−s V − o l ς s V ς s P ς r−s P for the absolute value of the difference in costs, which is always greater than or equal to 0. l is odd: W.l.o.g. we assume that for computing the cost for a perfectly alternating interleaving for Q, there are ⌈ l 2⌉ multiplicands of type ς V and ⌊ l 2⌋ multiplicands of type ς P . This results in o l ς ⌊ l 2⌋ V ς ⌊ l 2⌋ P (ς V − ς P ) for the difference in costs between Q and Q ′ .
For a non-perfectly alternating interleaving, we again have o l ς s V ς s P ς r−s V − o l ς s V ς s P ς r−s P = o l ς s V ς s P (ς r−s V − ς r−s P ) with r + s = l and r > s, which can be reformulated to o l ς s V ς s P (ς What is left to prove is o l ς . W.l.o.g., assume that ς V > ς P (if ς V < ς P , we just have to switch the minuend with the subtrahend in the subtractions and the roles of ς V and ς P in the following), then, as all numbers in the inequality are greater than or equal to 0, we have to prove o l ς Substituting a = ς V , b = ς P , and x = ⌊ l 2⌋ − s, this means showing that a Since a > b, the left-hand side of the inequality is always less than 0, while the right-hand side is greater than 0.

Proof (Lemma 7)
A key moves through a series of indexes in our indexing pipeline. First, it is stored in R M 0 at no I/O cost and after that, it moves through a number of disk-based indexes R 0 , . . . , R k . Importantly, when a key in an index R i is moved, it always moves to a larger index R j , j > i.

B Tuning τ
The following provides more details on how to calibrate the partitioning threshold τ discussed in Section 8.4.1. In particular, we quantify the effects leading to the shape of the curves depicted in Figure 15. Although we have not been able to develop a closed formula (so far), the results below are important steps towards such a formula. The diagrams in Figure 15(a) and Figure 15(b), showing the impact of τ on the number of nodes and the overall time for bulk-loading the index, are not particularly interesting for the calibration, as there is a clear relationship: the larger the value of τ , the faster we can stop the partitioning and the smaller the number of created nodes. Consequently, the execution time for bulk-loading goes down, as we can skip more and more partitioning steps. Figure 15(c) and Figure 15(d) are much more interesting, as we have two effects counteracting each other in both figures. We start with Figure 15(d), showing the impact of τ on the index size. First of all, τ influences the total amount of metadata stored on each disk page. Clearly, the fewer leaf nodes per page we have, the smaller the amount of this metadata. Assuming that we actually fill every leaf node with exactly τ key suffixes and that we store d bytes of metadata, the overhead is N p ⋅d τ per disk page (with N p being the number of input keys per disk page). This reciprocal function flattens out quickly and explains why the curve in Figure 15(d) drops at the beginning. However, there is a second effect at play. The more key suffixes share a leaf node (i.e., the larger the value of τ ), the higher the probability that they share a common prefix that has not been factored out, because we stopped the partitioning early. We found estimating the expected value of overlapping prefixes in leaf nodes very hard to do, as it depends on the data distribution. A simplified version can be computed as follows. Assume we have n different prefixes x 1 , x 2 , . . . , x n that appear in the key suffixes stored in leaf nodes and that all prefixes have the same likelihood of appearing. Moreover, let u be the number of unique prefixes in a leaf node, then we have τ − u prefixes that are stored multiple times (the first time a prefix shows up in a leaf node, it is fine, but every subsequent appearance adds to the overhead). For a leaf node, let R i be a random variable that is 1 if x i is in the node, and 0 otherwise. Then the number X of different prefixes found in the node is ∑ n i=1 R i . Due to the linearity of expectation, The probability of at least one prefix x i appearing in a leaf node is equal to one minus the probability that there is none, which is equal to 1 − ( n−1 n ) τ . Thus, E[X] = n(1 − ( n−1 n ) τ ), which means that the expected overhead is τ − E[X] per leaf node. This rises slowly for small values of τ , but ascends more quickly for larger values of τ . Nevertheless, this is still a simplification, as it counts the prefixes rather than summing their lengths.
We now turn to the impact of the threshold τ on the query runtime ( Figure 15(c)). The threshold determines how many internal nodes we have that distinguish subsets of keys. For τ = 1, when evaluating a query, we visit a path down the trie containing all discriminative bytes. When increasing τ , the path shortens, as we skip the final discriminative bytes. Mapping τ to the path length is not straightforward, as this depends on the distribution of the keys again. Assuming that every discriminative byte splits a set of keys into b subsets and that the full length of a trie path for τ = 1 is l, the number of internal nodes visited by a query is equal to l − log b τ . The curve of this function drops at the beginning, but then quickly flattens out, explaining the left-hand part of Figure 15(c). The second effect of increasing τ is that we are accessing more and more keys that are not relevant for our query. The irrelevant keys just happen to be in the same leaf node, because we no longer distinguish them from the relevant keys. There is at least one key in the node that satisfies the query, for the other keys we compute the probability that they are relevant. We cannot just use the selectivity σ c of the complete query, since we need the selectivity σ s of the suffix stored in the leaf node. Thus, the expected number of keys in a leaf node not satisfying the query predicate is (1 − σ s )(τ − 1). Assuming uniform distribution and independence, we can estimate σ s given σ c : if the length of the suffix in the leaf node is 1 s of the total length of a key, then σ s = s √ σ c . Since all selectivities are within the range of [0, 1], σ s ≥ σ c , which means that (1 − σ s ) ≤ (1 − σ c ), so (1 − σ s )(τ − 1) is usually a relatively flat ascending line. This explains the shape of the curve on the right-hand side of Figure 15(c).
While the distribution of the keys has a direct impact on all of these parameters, as far as we can see, the total number of input keys does not directly influence them. Consequently, we can use a sample to calibrate τ (as we have done in Section 8.4.1).