The B2-tree is a variation of the classic B‑tree, its core structure is based on the B+-tree layout. We extend the existing layout by embedding another tree into each page, as emphasized by the name B2-tree. The term embedded tree refers to this tree structure, it serves the purpose of improving the lookup performance while maintaining minimal impact on the size consumption as well as on the throughput of insert and delete operations. Our implementation also features some commonly known optimization techniques like the derivation of a shortened separator during a split [26,27,28].
Other approaches that combine or nest different index structures have already proved their potential. Masstree for instance showed considerable performance improvements [23]. However, Masstree is not designed to be used in conjunction with paging based storage engines. Another point of concern is the direct correlation between the outer trie height and the indexed data. The inflexible maximum span length of eight bytes may lead to a relatively low utilization and fanout of the lower tree levels when indexing strings, this is usually caused by the sparse distribution of characters found in string keys. This is not unique to Masstree: ART’s fanout on lower tree levels also decreases in such usage scenarios [7]. B+-trees on the other hand feature a uniform tree height by design, since the tree height does not depend on the data distribution. Comparison-based index structures such as the B+-tree on the other hand are often outperformed by trie-based indexes in point accesses [7].
Our approach intends to combine the benefits of both worlds, the uniform tree height of B+-trees with the trie-based lookup mechanics, while still featuring a page based architecture. Our trie-based embedded tree on each page serves the purpose of determining a limited search space where the corresponding queried key may reside. However, we still utilize a comparison-based search on these limited subranges. This design aims to improve the general cache-friendliness of the search procedure on each page.
The Embedded Tree
In the following we will present the inner page layout of our B2-tree, the general outline can be observed in Fig. 2. As already mentioned, the general page organization follows the common B+-tree architecture, hence, payloads are only stored in leaf nodes. Leaf nodes are also interlinked, like it is originally the case in a B+-tree, in order to maintain the high range scan throughput usually achieved by B+-trees.
The embedded tree itself is composed of a couple of different node types. First, we define the decision node, it acts like a B-tree node by directing incoming queries onto the corresponding child. Probably the main difference to a B-tree node, is the fact that these nodes operate on a fixed size decision boundary represented by a single byte in our implementation, in contrast to B+-tree nodes, which usually operate on multiple bytes at once. We hence decompose keys into smaller units of information similar to how the trie data structure operates [24, 25]. Nodes of this decision type direct the search to the first child where the currently investigated byte of the search key is less or equal to the separator. The fanout of this type of node is also limited in order to improve data locality. Another similarity to B‑tree nodes is the fact that they can be hierarchically arranged just like B‑tree nodes. This node type bears some similarity to the branch node found in Patricia tries [21]. However, Patricia’s branch nodes only compare for equality, our decision nodes use the range of bytes to determine the position of a corresponding child. In Fig. 2 this type is illustrated as divided rectangular shape. Fig. 3 illustrates the memory layout for this node type. Note that, inner decision nodes and their leaf counterparts share the same layout, they just differ in the interpretation of their two byte large payloads. Leaf nodes terminate the search for a queried key even if it is not fully processed, the remainder of a queried key will then be further processed by the subsequent comparison based search.
The second node type we define are span nodes. These store the byte sequence which forms the longest common prefix found in the subtree rooted at the current node. Their memory layout is shown in Fig. 3. This node type can be compared to the extension concept of the Patricia trie [21], however, span nodes have two additional outgoing edges to handle non-equality. Note that, by using an order preserving storage layout for the nodes, there is no necessity to store any next pointer within the span node, since the child node will directly succeed the span node. In Fig. 2 span nodes are illustrated as rounded rectangles. The deployment of span nodes is necessary to advance the queried key past the length of a span if the current subtree has a common prefix. At the following key depth, decisions, whether a queried key is part of a certain range, can be made once again by the deployment of decision nodes.
Obviously, the content of a span node does not have to match the corresponding excerpt of the queried key exactly. In case the stored span does not match, three scenarios can occur. Firstly, the size of the span may actually exceed the queried key. In that case the input will be logically padded with zero bytes. This may lead to the second case where the input is shorter. Any further comparisons with subsequent nodes are therefore meaningless. Hence, we introduce the concept of a virtual edge pointing from each span to its leftmost child, a so-called virtual node. To the edge itself we will refer as minimum edge. In Fig. 2 such an edge and its corresponding node is always colored gray to emphasize the aspect that it is not part of the physical tree structure. We follow this edge every time the input is less than the content of the span node. Note that encountering a fully processed input key implies that the minimum edge of a span node has to be taken. Fig. 2 illustrates the usage of this concept with the insertion of the Wikipedia URL after the construction of the embedded tree. This URL does not match the second span node, hence, it is delegated to the virtual node labeled “3”.
The last case where the input is greater than the span node’s content is completely symmetrical to the minimum edge situation. Therefore, a second virtual edge and node pair exists for every span node to handle the greater than case. We will cover the algorithmic details more elaborately in Sect. 3.2.
Fig. 2 also illustrates the range array, which stores the positions of key-value pairs. These define limited search spaces on the page. This array serves two purposes. First, it eliminates the need to alter the actual contents of the embedded tree during insert and removal operations, this simplifies modification operations significantly. Second, it enables the use of the aforementioned minimum and maximum edges.
During a lookup on a page this array is used to translate the output \(r_{i}\) of a query on the embedded tree into a position \(j\) on the actual page. Each lookup on the embedded tree itself yields an index into this array. This array, on the other hand, contains indices into the page indirection vector [29], whereas the indirection vector itself points to any data that does not fit into a slot within the indirection vector [29]. A resulting index thereby specifies an upper limit for the search of a queried key, whereas the directly preceding element specifies the lower limit. In Fig. 2 the annotated positions are colored differently in accordance to their origin. The very first position is colored green, this special element ensures that the lower limit for a search can always be determined. Indices originating from virtual edges are colored gray, whereas blue is used for regular positions. We denote these indices as \(r_{i}\) where \(i\) represents the corresponding position within the array of prefix sums. Each \(r_{i}\) occupies two bytes within each leaf node, the memory layout is illustrated in Fig. 3.
Insertion and removal operations, which are to be performed on the overlying page, also affect the embedded tree. More precisely, this affects the search range given by the embedded tree where the actual operation took place and all subsequent search ranges, since adjusting an upper boundary of one particular search range also implies that subsequent search ranges have to be shifted in order to retain their original limits. This is achieved by simply adjusting the values within the range array for the directly affected search range and every following search range.
Construction
One aspect we have not covered so far is the construction of the embedded tree structure. The construction routine is triggered each time a page is split or merged and also periodically depending on the number of changes since the last invocation.
The construction routine always starts by determining the longest common prefix of the given range of entries beginning at the very first byte of each entry. We will refer to the position of the currently investigated byte as key depth, which is zero within the context of the first invocation. On the first invocation, this spans the entire range entries on the current page. Based on the length of the longest common prefix a root node will be created. If the length of the longest common prefix is zero, a decision node will be created, else a span node. In the latter case, the newly created node contains the string forming the longest common prefix. Afterwards, the construction routine recurses by increasing the key depth to shift the observation point past the length of the longest common prefix.
The creation of a decision node is more involved, here we investigate the byte at the current key depth of the key in the middle of the given range. Subsequently, with the concrete value of this byte, a search on the entries right to that key is performed. This search determines the lower bound key index with regard to that value at the current key depth. In some cases, the resulting index may lie right at the upper limit of the given key range. For this reason, we also search in the opposite direction and take the index which divides the provided range of keys more evenly. This procedure is repeated on both resulting subranges until either the size of a subranges falls below a certain threshold or until the physical node structure of the current decision node does not contain enough space to accommodate another entry. Once a decision node is constructed, the construction routine recurses on each subrange, however, this time the key depth remains unchanged. This process is repeated until each final subrange is at most as large as our threshold value.
Key Lookup
On the page level, the general lookup principle is performed as in a regular B+-tree. The only difference is the applied search procedure. We start by querying the embedded tree which yields an upper limit for the search on the page records within the indirection vector. With the upper limit known, the lower limit can be obtained by fetching the previous entry from the range array. Afterwards, a regular binary search on the limited range of entries will be performed.
Querying the embedded tree not only yields the search range but also further information about the queried key’s relationship to the largest common prefix prevailing in the resulting search range. The concrete relationship is encoded in skip, the stored value corresponds to the length of the largest common prefix within the returned search range. It also indicates that the key’s prefix is equivalent to this largest common prefix. This information can be exhibited to optimize the subsequent search procedure by only comparing the suffixes.
Algorithm 1 depicts a recursive formulation of the embedded tree traversal algorithm. It inspects each incoming node whether it is a span node or not. We compare the stored span with the corresponding key excerpt at the position defined by skip, in case a span node is encountered. The difference between the stored span and the key excerpt will be the result of this comparison. We also determine whether the key is fully processed in this step, meaning that the byte sequence stored within the span node exceeds the remaining input key. Three cases have to be differentiated at this point.
Firstly, the obtained difference stored in diff may be greater than zero, hence, the span did not match. However, this also implies that the remaining subtree cannot be evaluated for this particular input key. One of the outgoing virtual edges must therefore be taken. Implementation-wise, this edge is realized by a call to MaximumLeaf. It traverses the remaining subtree by choosing the edges corresponding to the largest values. The final result is thus the rightmost node of the remaining subtree.
The second case, where the excerpt of the input key is smaller, is mostly analog. However, the condition must now not only include the result, whether diff is smaller than zero, but also the result, whether the input key has been fully processed during the span comparison or not. An input key that is shorter than the sum of all span nodes, which led to the key’s destination search range, will be logically padded with zeros. This leads to another interesting observation. Consider two keys with different lengths and their largest common prefix being the complete first key, all remaining bytes of the second key are set to zero. The index structure has to be able to handle both keys. However, from the point of view of the embedded tree, both keys will be considered as equal. This also implies that the embedded structure has to ensure that both keys will be mapped into the same search range. It is therefore up to the construction procedure to handle such situations accordingly. The subsequent binary search has to handle everything from there on.
The third and last case, where the key excerpt matches the span node, should be the usual outcome for most input keys. We obviously have to account for the actual length of the span to advance the queried key beyond this byte sequence. Hence, the point of observation on the key has to be shifted accordingly. This is also the case where skip is adjusted accordingly. It holds the accumulated length of all span nodes which were encountered during the lookup, or an invalid value if one of the span nodes did not match or more precisely if diff evaluated to a non-zero value. The subsequent call to either MaximumLeaf or MinimumLeaf thereupon returns an invalid value for the skip entry in the result tuple.
Key Insertion
We have already briefly discussed, how the insertion of new entries, affects the embedded tree, and its yielded results. Two cases have to be addressed. Either there is enough free space on the affected page to accommodate the insertion of a new entry, or the space does not suffice. A new entry can be inserted as usual if the page has enough free space left. However, this will also require some value adjustments within the range array in order to reflect the change. The latter case, where the page does not hold enough free space for the new entry, will lead to a page split. Splitting a page additionally results in roughly half of the embedded tree being obsolete.
For a simple insertion that does not lead to a page split, updating the embedded tree is trivial. We first determine the affected \(r_{i}\) in the range array where the insertion takes place. The updated search range is then defined by the preceding value and the value at \(r_{i}\), which has to be incremented, since the search range grew by exactly one entry. In Fig. 2 these index values are denoted as \(j\), and they are stored within the range array. However, this change must also be reflected in all subsequent search ranges. Therefore, all the following entries within the range array have to be incremented as well, in order to point to their original elements. By conducting this change, subsequent index values will then span all the original search spaces, which were valid up to the point where the insertion occurred.
The case where an insertion triggers a page split has to be handled differently. A split usually implies that approximately half of the embedded tree represents the entries on the original page whereas the other half would represent the entries on the newly created page. Consequently, the index values defining the search ranges of one page are now obsolete. Although, the structure could be updated to correctly represent the new state of both pages, we instead opted to reconstruct the embedded trees. This allows us to utilize the embedded structure to a higher degree, since the current prevailing state of both pages can be captured more accurately. Having a newly split page also ensures that roughly half of the available space is used. We can thus construct a more efficient embedded tree, which specifies smaller search ranges. In turn, smaller ranges can be used to direct incoming searches more efficiently.
Key Deletion
Deletion is handled mostly analogously. However, the repeated deletion of entries, which define the border between two ranges, may lead to empty ranges. This is no issue per se: The subsequently executed search routine just has to handle such a scenario accordingly. As it is the case with insertions, the deletion of entries also requires further actions. Directly affected search ranges have to be resized accordingly. Hence, the corresponding \(j\) values within the range array have to be decremented in order to reflect those changes. All subsequent values also have to be decremented in order to point to their original elements on the page.
Space Requirements
Another interesting aspect is the space requirement of the embedded tree structure. In the following we will analyze the worst-case space consumption in that regard. We start by determining an upper bound for the space consumption of a path through the embedded tree to its corresponding section of the page which defines a search range.
For now, we only consider the space required by the structure itself, not the contents of span nodes. The complete length of all the contents of span nodes forms the longest common prefix of a certain page section, which our second part of this analysis takes into account. Furthermore, a node in the context of the following first part refers to a compound construction of a decision node and a zero-length span node, this represents the worst-case space consumption scenario, where each decision node is followed by a span. Similar to the analysis of ART’s worst-case space consumption per key [6], a space budget \(b(n)\) in byte for each node \(n\) is defined. This budget has to accommodate the size required by the embedded tree to encode the path to that section. \(x\) denotes the worst-case space consumption for a path through the embedded tree in byte. The total budget for a tree is recursively given by the sum of the budgets of its children minus the fixed space \(s(n)\) required for the node itself. Formally, the budget for a tree rooted in \(n\), can be defined as
$$b(n)=\left\{\begin{array}[]{ll}x&\;\text{isTerminal}(n)\\ \sum_{c\in\text{children}(n)}b(c)-s(n)&\;\text{else.}\end{array}\right.$$
Hypothesis: \(\forall n:b(n)\geq x\).
Proof
Let \(b(n)\geq x\). We give a proof by induction over the length of a path through the tree.
Base case: The base case for the terminal node \(n\), i.e.a page section, is trivially fulfilled since \(b(n)=x\).
Inductive step:
$$\begin{aligned}\displaystyle&\displaystyle b(n)=\sum_{c\in\text{children}(n)}b(c)-s(n)\end{aligned}$$
$$\begin{aligned}\displaystyle&\displaystyle\quad\geq\underbrace{b(c_{1})+b(c_{2})}_{\text{at least two children per node}}-\;x\end{aligned}$$
$$\begin{aligned}\displaystyle&\displaystyle\quad\geq 2x-x=x\quad\text{(induction hypothesis).}\end{aligned}$$
Conclusion: Since both cases have been proved as true, by mathematical induction the statement \(b(n)\geq x\) holds for every node \(n\). \(\square\)
An upper bound for the payload of the span nodes is obtained by assigning the complete size of the prefix of each section to the section itself. Assigning the complete prefix directly to a section implies that the embedded tree does not use snippets of the complete prefix for multiple sections, therefore, each span node has a direct correlation with a search range defined by the embedded tree. The absence of shared span nodes, thus, maximizes the space consumption for the embedded tree. An upper bound for the space consumption of the embedded tree is given by
$$\sum_{r\in\text{searchRanges}(p)}\left(l(r)+x\right)$$
where \(l(r)\) yields the size of the longest common prefix of the search range \(r\) within page \(p\).
We can therefore conclude that the additional space required by the embedded tree mostly depends on the choice of how many search ranges are created and the size of common prefixes within them. Our choice of roughly 32 elements per search range yielded the optimal result on all tested datasets, however, this is a parameter which may require further tuning in different scenarios. In our setting, the space consumption of the embedded structure never exceeded 0.5 percent of the page. Note that, the prefix of each key within the same search range does not have to be stored, the B2-tree may therefore also be used to compress the stored keys.
In the following we will analyze how modern CPUs may benefit from B2-tree’s architecture. Both AMD’s and Intel’s current x86 lineup feature L1 data caches with a size of 32 KiB, 8‑way associativity, and 64-byte cache-lines. Our previous worst-case space consumption showed that the size of the embedded tree is mostly influenced by the size of common prefixes. The constant parameter \(x\), on the other hand, can be set to 15, which is the size of a decision node and an empty span node. With the aforementioned setup of 32 elements per search range and a page size of 64 KiB, we can assume that the embedded structure, excluding span nodes, fits into a couple of cache lines, our evaluation also supports this assumption.
Efficient lookups within the limited search ranges are the second important objective of our approach. With the indirection vector being the entry point for the subsequent binary search, it is beneficial to prefetch most of the accessed slots. In our implementation, each slot within the indirection vector occupies exactly 12 bytes. Therefore, with 32 elements per search range, only six cache-lines are required to accommodate the entire section of the indirection vector. Recall that it is a common optimization strategy to store the prefix of a key within the indirection vector as unsigned integer variable. The B2-tree, however, utilizes this space to store a substring of each key since the prefixes are already part of the embedded tree. We will refer to this substring as infix. It can also be observed that the stored infix values within the indirection vector are usually more decisive, since the embedded tree already confirmed the equality for all the prefix bytes. Overall, this implies that fewer indirection steps, to fetch the remainder of a key, have to be taken.
Concurrency
B2-tree was designed with concurrent access via optimistic latching approaches taken into consideration. While this approach adapts well to most vanilla B‑tree implementations, other architectures may require additional logic. This section covers all necessary adaptions and changes required by the B2-tree in order to ensure correctness in the presence of concurrent accesses.
Optimistic latching approaches often require additional checks in order to guarantee thread safety. Leis et al. [16] list two issues that may arise through the use of speculatively locking techniques such as OLC. The first aspect concerns the validity of memory accesses. Any pointer obtained during a speculative read may point to an invalid address due to concurrent write operations to the pointer’s source. Readers have hence to ensure that the read pointer value was obtained through a safe state. This issue can be prevented by the introduction of additional validation checks. Before accessing the address of a speculatively obtained pointer, the reader has to compare its stored lock version with the version currently stored within the node. Any information obtained before the validation has to be considered as invalid if those versions differ. Usually, an operation will be restarted upon encountering such a situation.
Secondly, algorithms have to be designed in a manner that their termination is guaranteed under the presence of write operations performed by interleaving threads. Leis et al. discuss one potential issue concerning the intra-node binary search implementation as such. They note that its design has to ensure that the search loop can terminate under the presence of concurrent writes [16]. Optimistically operating algorithms, therefore, have to ensure that no accesses without any validation to speculatively obtained pointers are performed and that termination under the presence of concurrent writes is guaranteed.
However, the presented traversal algorithm does not guarantee termination without the introduction of further logic. One main aspect concerns the observation that span nodes can contain arbitrary byte sequences. It is hence possible to construct a key containing a byte sequence that resembles a valid node. Such a node may also contain links pointing to itself. An incoming searcher may then end up in a cycle due to previous modifications performed by an interleaving writer which had conducted modifications to the embedded structure in said manner.
To prevent issues such as the one described, certain countermeasures have to be taken. We have to ensure that the traversal progresses with every new node. Furthermore, node pointers must not exceed the boundary of their containing page. We could have used the validation scheme presented by Leis et al. [16]. This would require a validation on the optimistic lock’s version after each node fetch. However, we can also use the fact, that in our implementation each parent node has a smaller address than any of its children. We furthermore have to ensure that each obtained node pointer lies within the boundary of the current page. Note that any search range obtained through the embedded tree is also a possible candidate leading to invalid reads. We hence have to ensure that each obtained boundary value also lies within the boundary of the currently processed page. Our binary search implementation, which will be performed directly afterwards, trivially fulfills the previously described termination requirement.
Insert and delete operations do not require any further validation steps, since they do not depend on any unvalidated speculative reads and exclusive locks will be held during such operations anyway.