Predecessor on the Ultra-Wide Word RAM

We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with 'ultrawords' consisting of $w^2$ bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to 'scattered' memory operations that access or modify $w$ (potentially non-contiguous) memory addresses simultaneously. The ultra-wide word RAM model captures (and idealizes) modern vector processor architectures. Our main result is a simple, linear space data structure that supports predecessor in constant time and updates in amortized, expected constant time. This improves the space of the previous constant time solution that uses space in the order of the size of the universe. Our result holds even in a weaker model where ultrawords consist of $w^{1+\epsilon}$ bits for any $\epsilon>0 $. It is based on a new implementation of the classic $x$-fast trie data structure of Willard [Inform. Process. Lett. 17(2), 1983] combined with a new dictionary data structure that supports fast parallel lookups.


Introduction
Let S be a set of n w-bit integers.The predecessor problem is to maintain S under the following operations.
• predecessor(x): return the largest y ∈ S such that y ≤ x.
• insert(x): add x to S.
• delete(x): remove x from S.
On the word RAM model of computation, the complexity of the problem is well-understood with the following tight upper and lower bound on the time for operations given by Pȃtraşcu and Thorup [34].
From the upper bound perspective, the first branch matches dynamic fusion trees [34], the second branch is based on an extension of the techniques from Beame and Fich [5], and the last branch is based on an extension of dynamic van Emde Boas trees [39].Note that the lower bound implies that we cannot support operations in constant time for general n and w.Hence, a natural question is if practical models of computation capturing modern hardware can allow us to overcome the superconstant lower bound.One such model is the RAM with byte overlap (RAMBO) by Brodnik et al. [15].This model extends the word RAM model by adding a set of special words that share bits; flipping a bit in one word will also affect all the other words that share that bit.The precise model is determined by the layout of the shared bits.It is feasible to make hardware based on this model, and prototypes have been built [28].In the RAMBO model, Brodnik et al. [15] gave a predecessor data structure using constant time per operation with O(2 w /w) space (counting both regular words and shared words).They also gave a randomized version of the solution that uses constant time with high probability and reduces the regular space to O(n) (but still needs Ω(2 w /w) space for the shared words).In both cases, the total space is near-linear in the size of the universe.
More recently, Farzan et al. [22] introduced the ultra-wide word RAM model (UWRAM).The UWRAM extends the word RAM model by adding special ultrawords of w 2 bits.The model supports standard boolean and arithmetic operations on ultrawords, as well as scattered memory operations that access w words in memory in parallel.The UWRAM model captures (and idealizes) modern vector processing architectures [16,35,37] (see Section 2 for details of the model).Farzan et al. [22] showed how to simulate algorithms for the RAMBO model on the UWRAM at the cost of increasing the space by a polylogarithmic factor.Simulating the above RAMBO solution for the predecessor problem, they gave a solution to the predecessor problem on the UWRAM using worst case constant time for all operations and O(w2 w ) space.

Our Results
We revisit the predecessor problem on the UWRAM and show the following main result.
Theorem 1 Given a set of n w-bit integers, we can construct an O(n) space data structure on a UWRAM that supports predecessor in constant time and insert and delete in amortized expected constant time.The result holds even when ultrawords consist of w 1+ bits for any fixed > 0.
Compared to the previous result of Farzan et al. [22], Theorem 1 significantly reduces the space from O(w2 w ) to linear while maintaining constant time for operations (note that query time is worst-case, while updates are amortized expected).Furthermore, our result works in a weaker model were ultrawords consist of only w 1+ bits for any arbitrarily small > 0. In this restricted model we limit our reliance on the powerful scattered memory operations by allowing them to access only w words in memory in parallel.
A key component in our solution is a new dictionary data structure of independent interest that supports fast parallel lookups on the UWRAM.We define the problem as follows.Recall that an ultraword X consists of w 2 (or w 1+ ) bits.We view X as divided into w (or w ) words of w consecutive bits each, numbered from right to left starting from 0. The ith word in X is denoted X i (we discuss the model in detail in Section 2).Given a set S of n w-bit integers, the w -parallel dictionary problem is to maintain S under the following operations.
• pMember(X): return an ultraword I where I i = 1 if X i ∈ S and I i = 0 otherwise.
• insert(x): Add x to S.
• delete(x): Remove x from S.
Thus, pMember takes an ultraword X of w integers and returns an ultraword encoding which of these integers are in S. To the best of our knowledge, the w -parallel dictionary problem has not been studied before.We show the following result.
Theorem 2 Given a set of n w-bit integers on a UWRAM with w 1+ -bit ultrawords for any fixed > 0, we can construct an O(n+w )-space data structure that supports pMember queries in worst case constant time and insert and delete in amortized expected constant time.
Note that the queries are worst-case constant time, while the updates are amortized expected constant time.The time bounds of Theorem 2 thus match the well-known dynamic perfect hashing structure of Dietzfelbinger et al. [20] (which is also the basis of our solution), except that the queries are parallel.The space is linear except for the additive w term, which is needed even for storing the input to the pMember query.

Techniques
Our results are achieved by novel and efficient parallel implementations of well-known sequential data structures.
Our parallel dictionary structure of Theorem 2 is based on the dynamic perfect hashing structure of Dietzfelbinger et al. [20].This is a two-level data structure similar to the classic static perfect hashing structure of Fredman et al. [23].At the first level, a universal hash function partitions the input into smaller subsets, each of which is then resolved at the second level using another universal hash function mapping the elements into sufficiently large tables.The structure supports (sequential) membership queries in worst-case constant time by evaluating the hash functions and navigating the structure accordingly.Updates are supported in amortized expected constant time by carefully rebuilding and rehashing the structure during execution.At any point in time the structure never uses more than O(n) space.We show how to parallelize the evaluation of a universal hash function (the simple and practically efficient multiply-shift hash function).Then, using the scattered memory access operations, we show how to access the corresponding entries in the structure in parallel.Our technique requires only small changes to the structure of Dietzfelbinger et al. [20] and we can directly apply their update operations to our solution.Thus, we are able to parallelize the worst-case constant time sequential membership query while maintaining the amortized expected constant update time bound of Dietzfelbinger et al. [20], leading to the bounds of Theorem 2.
We first show Theorem 1 for the simpler case = 1 that corresponds to the original UWRAM model by [22].Our data structure is based on the emphx-fast trie of Willard [40] combined with our parallel dictionary structure of Theorem 2. The x-fast trie consists of the trie T of the binary representation of the input set.Also, at each level i, the structure stores a dictionary containing the length-i prefixes of the input set.In total, this uses O(nw) space.The x-fast trie supports predecessor queries in O(log w) time by binary searching the levels (with the help of the dictionaries) to find the longest common prefix of the query and the input set.Though not designed for it, we can implement updates on the x-fast trie in O(w) time by directly updating each level of the dictionary accordingly.Our new predecessor structure, which we call the xtra-fast trie, instead stores the compact trie of the binary representation of the input set (i.e., the trie where paths of nodes with a single child are merged into a single edge).We store a dictionary representing the prefixes (similar to in the x-fast trie) using our parallel dictionary structure of Theorem 2, but now only for the branching nodes in the compact trie.This reduces the space to O(n).To support predecessor queries for an integer x, we generate all w prefixes of x and apply a parallel membership query on these in the dictionary.We show how to identify the longest match in parallel which in turn allows us to identify the predecessor.In total this takes worst-case constant time for the predecessor query.To handle updates, we show how to modify the trie efficiently using scattered memory access operations and a constant number of dictionary updates, leading to the expected amortized constant time bound of Theorem 1.
We generalize our result for Theorem 1 to arbitrary > 0 as follows.The main challenge is that pMember now supports only w member queries in parallel, so we cannot search for all prefixes of x simultaneously.Instead, we adapt ideas from the y-fast trie by Willard [40] to our xtra-fast trie.The y-fast trie works as follows.Partition the input set S into O(n/w) sets S 1 , . . ., S t where each S i consists of w consecutive values from S, i.e., where max(S i ) < min(S i+1 ) for each i.Build an x-fast trie over the set S = {max(S i ) | i = 1, . . ., t − 1} -which takes O(n) space since |S | = O(n/w) -and a balanced binary search tree over each S i .To determine predecessor(x), do a predecessor query in the x-fast trie to determine the set S i containing the predecessor of x and do a predecessor query in S i , both of which takes O(log w) time.Insertions are supported by instead inserting x in S i .If S i subsequently becomes too large (e.g., larger than 2w), split S i into two and add an additional element to S in the x-fast trie.This takes O(w) time, which is constant when amortized over the Ω(w) insertions necessary for S i to grow too large.Deletions are supported similarly.In our data structure we use dynamic fusion trees by Pȃtraşcu and Thorup [34] for each S i , which solves the predecessor problem on sets of size w O (1) in linear space and constant time per operation.We build an uncompacted xtra-fast trie over S , i.e. the xtra-fast trie that also includes non-branching nodes.To support fast queries and updates for an integer x, we use the scattered memory operations to simulate a w -way search (as opposed to a binary search) to find the longest common prefix between x and S .This eliminates a factor 1/w of the remaining possibilities per round, leading to a running time of O(log w w) = O(1/ ), i.e., constant for any fixed .
In our data structures we only need to store a constant number of ultrawords during the computation.This is important since modern vector processor architectures only have a limited number of ultraword registers.

Outline
In Section 2 we describe the UWRAM model of computation and some useful procedures.In Sections 3 and 4 we show how to do parallel hash function evaluation and w -parallel dictionaries, proving Theorem 2. Finally, in Section 5 we prove Theorem 1 for = 1, which we generalize to arbitrary > 0 in Section 6.

The Ultra-Wide Word RAM Model
The word RAM model of computation [25] consists of an unbounded memory of w-bit words and a standard instruction set including arithmetic, boolean, and bitwise operations (denoted '&', '|' and '∼' for and, or and not) and shifts (denoted ' ' and ' ') such as those available in standard programming languages (e.g., C).We make the standard assumption that we can store a pointer into the input in a single word and hence w ≥ log n, where n is the size of the input, and for simplicity we assume that w is even.We denote the address of x in memory as addr(x), and the address of an array is the address of its first index.The time complexity of a word RAM algorithm is the number of instructions and the space is the number of words stored by the algorithm.The ultra-wide word RAM (UWRAM) model of computation [22] extends he word RAM model with special ultrawords of w 2 bits (in Section 6 we consider the case where ultrawords have w 1+ bits for any fixed > 0).As in [22], we distinguish between the restricted UWRAM that supports a minimal set of instructions on ultrawords consisting of addition, subtraction, shifts, and bitwise boolean operations, and the multiplication UWRAM that additionally supports multiplications.We extend the notation for bitwise operations and shifts to ultrawords.The UWRAM (both restricted and multiplication) also supports contiguous and scattered memory access operations, as described below.The time complexity is the number of instructions (on standard words or ultrawords) and the space complexity is the number of words used by the algorithms, where each ultraword is counted as w words.The UWRAM model captures (and idealizes) modern vector processing architectures [16,35,37].See also Farzan et al. [22] for a detailed discussion of the applicability of the UWRAM model.

Instructions and Componentwise Operations
Recall that ultrawords consists of w 2 bits.We often view an ultraword X as divided into w words of w consecutive bits each, which we call the components of X.We number the components in X from right-to-left starting from 0 and use the notation X i to denote the ith word in X (see Figure 1).We will also use the notation X = x w−1 , . . ., x 0 , denoting that X i = x i .
We define a number of useful componentwise operations on ultrawords that we will need for our algorithms in the following.Let X and Y be ultrawords.The componentwise addition of X and Y , denoted X + Y , is the ultraword Z such that Z i = X i + Y i mod 2 w .We define componentwise subtraction, denoted X − Y , and componentwise multiplication, denoted XY , similarly.The componentwise comparison of X and Y is the ultraword Z such that Z i = 1 if X i < Y i and 0 otherwise.Given another ultraword I where each component is either 0 or 1, we define the componentwise blend of X, Y , and I to be the ultraword Except for componentwise multiplication, all of the above componentwise operations can be implemented in constant time on the restricted UWRAM using standard word-level parallelism techniques [12,25] (see Appendix A for details on blend).For our purposes, we will need componentwise multiplication as an instruction (for evaluating hash functions in parallel) and thus we include this in the instruction set of the UWRAM.This is the UWRAM model that we will use throughout the rest of the paper.Note that all of the componentwise operations are widely supported directly in modern vector processing architectures.For instance, a componentwise multiplication (e.g., the vpmullw operation) is defined in Intel's AVX2 vector extension [17].
We will need componentwise operations on components that are small constant multiples of w.In particular, we will need a 2w-bit componentwise multiplication that multiplies w/2 components of w bits and returns the w/2 resulting components of 2w bits.Specifically, let X = 0, x w−2 , . . ., 0, x 2 , 0, x 0 and Y = 0, y w−2 , . . ., 0, y 2 , 0, y 0 , i.e., X and Y store w/2 components aligned at the even positions.The 2w-bit componentwise multiplication is the ultraword Z i and z − i is the leftmost and rightmost w bits, respectively, of the 2w-bit product of x i and y i .We can implement 2w-bit componentwise multiplication using standard techniques in constant time on the UWRAM.See Appendix A for details.
Finally, the UWRAM model supports the compress operation that, given X, returns the word that results from concatenating the rightmost bit of each component of X.We do not need the corresponding inverse spread operation, defined by Farzan et al. [22].

Memory Access
The UWRAM supports standard memory access operations that read or write a single word or a sequence of w contiguous words.More interestingly, the UWRAM also supports scattered access operations that access w memory locations (not necessarily contiguous) in parallel.Given an ultraword A containing w memory addresses, a scattered read loads the contents of the addresses into an ultraword X, such that X i contains the contents of memory location A i .Given ultrawords X and A a scattered write sets the contents of memory location A i to be X i .Scattered memory accesses captures the memory model used in IBM's Cell architecture [16].They also appear (e.g., vpgatherdd) in Intel's AVX2 vector extension [17].Scattered memory access operations were also proposed by Larsen and Pagh [27] in the context of the I/O model of computation.Note that while the addresses for scattered writes must be distinct, we can read simultaneously from the same address.We can use this to efficiently copy x into all w components of an ultraword X.To do so, create the ultraword 0, . . ., 0 by left-shifting any ultraword by w 2 bits, write x to address 0, and do a scattered read on 0, . . ., 0 .We say that we load x into X.

Computing Multiply-Shift in Parallel
We show how to efficiently compute a universal hash function in parallel.The multiply-shift hashing scheme is a standard and practically efficient family of universal hash functions due to Dietzfelbinger et al. [19].For some integer 1 ≤ c ≤ w, define the class H c = {h a | 0 < a < 2 w and a is odd} of hash functions where h a (x) = (ax mod 2 w ) (w − c).Each function in H c maps from w-bit to c-bit integers.The class H c is universal in the sense that for any x = y and for h a ∈ H c selected uniformly at random, it holds that P [h a (x) = h a (y)] ≤ 2/2 c .We will show how to evaluate w such functions in constant time.Given To do so we first evaluate the functions in two rounds of w/2 functions each, and then combine the results.
Step 1: Evaluate the hash function on the even indices.We construct an ultraword H even containing all the values of h i (x i ) at all even indices i.First construct the ultrawords To do so, we do componentwise multiplication of C with the constant M = 0, 1, . . ., 0, 1 and componentwise multiplications of A, X, and M .Then, we do a 2w-bit multiplication of C and T and right shift the result by w.This produces the ultraword Thus, all even indices in H even store the resulting hash values of the integers at the even indices in the input.We will not need the values in the odd indices (resulting from the 2w-bit multiplication and the right shift) and therefore these are marked with a wildcard symbol .
Step 2: Evaluate the hash function on the odd indices.Symmetrically, we now construct the ultraword H odd containing h i (x i ) at all odd indices i.To do so, repeat step 1 and modify the shifting to align the computation for the odd indices.More precisely, right shift X, C and A by w and repeat step 1, then left shift the result by w to align the results back to the odd positions.This produces the ultraword Step 3: Combine the results.Finally, we combine the results by blending H even and H odd using I = 1, . . ., 1 − M , producing the ultraword H of the even indices of H even and the odd indices of H odd .
This takes constant time since componentwise multiplication, 2w-bit multiplication, shifting, blending, loading 1 into 1, . . ., 1 , and componentwise subtraction all run in constant time.Hence, we can evaluate each case of w/2 hash functions in constant time and combine the results in constant time.In summary, we have the following result.

The w -Parallel Dictionary
We now show how to construct the w -parallel dictionary of Theorem 2. Throughout the section we assume that = 1, but the result generalizes to any > 0 in a straight forward manner.Our data structure is based on a dictionary by Dietzfelbinger et al. that implements a dynamic perfect hashing strategy [20].Their dictionary already supports insert and delete in amortized expected constant time.Furthermore, it supports sequential member queries (i.e."is x ∈ S") in worst case constant time.We will show that we can use scattered memory operations to run w member queries simultaneously, thus implementing pMember in constant time.

Dynamic Perfect Hashing
In this section we briefly describe the contents of the data structure of Dietzfelbinger et al. [20].Note that we use the multiply-shift hashing scheme, while they use another class of universal hash functions.Multiply-shift satisfies all the necessary constraints and the analysis from [20] still works.It does however incur a multiplicative, constant space overhead for our arrays since the range of a multiply-shift function is a power of two.
The main idea of the data structure is as follows.Let S be a set of w-bit integers.Choose h ∈ H c and partition S into 2 c = Θ(n) sets S 0 , . . ., S 2 c −1 where S i = {x | x ∈ S and h(x) = i}.Each set S i is stored in a separate array using a hash function h i .Dietzfelbinger et al. show how to implement the operations insert and delete such that they maintain that h i has no collisions on S i .
The data structure consists of the following.
• For each S i , store an array e. the position that x hashes to stores x.If there is no x ∈ S i that hashes to j, then T i [j] = 2 w−1 if j = 0 and T i [j] = 0 otherwise.We claim that h i (0) is always zero and h i (2 w−1 ) is never zero, so it follows from this construction that x ∈ S i if and only if The second step follows since a i is odd; then a i 2 w−1 = 2 w−1 + (a i − 1)2 w−1 , and the latter term is 0 modulo 2 w since a i − 1 is even.The last step follows because c i ≥ 1.
• An array T of size 2 c .At index T [i] we store the 5-tuple (addr(T i ), 2 ci , a i , , ) where are bookkeeping values used by insert and delete.Note that 2 ci and a i encode h i .
• The integers a and 2 c representing the top-level hash function h(x) = (ax mod 2 w ) (w − c), as well as addr(T ).
It follows from this construction that x ∈ S if and only if T i [h i (x)] = x where i = h(x).Dietzfelbinger et al. show that the data structure uses linear space, that member runs in worst-case constant time, and that insert and delete run in amortized expected constant time [20].
Extending the Data Structure.We extend this data structure by storing the constant M = 0, 1, . . ., 0, 1, 0, 1 from Section 3 used to evaluate multiply-shift functions in parallel.This increases the space of the data structure to O(n + w).Note that linear space in w is needed even to store the input to a pMember query.

Parallel Queries
In this section, we begin by describing a single member query, before we show how to run w copies of the member query in parallel to support pMember.We compute member(x) as follows.
2. Let q = addr(T ) + 5j = addr(T [j]) (recall that each index in T stores five words).Read the values stored at q, q + 1 and q + 2 to get respectively addr(T j ), 2 cj and a j , the first three words stored at 3. Check whether the value stored at addr(T The parallel algorithm runs this algorithm for all w inputs simultaneously.Given X = x w−1 , . . ., x 0 we implement pMember(X) as follows.Each of the steps below executes the corresponding step above in parallel for each of the w inputs.
Step 3: Check whether the inputs are present in the dictionary.Do a scattered read of P + K and name the result R. Then R i = T j [h j (x i )] where j = h(x i ).Return the result I of componentwise equality between X and R.That is Evaluating the hash functions in steps 1 and 2 takes constant time according to Lemma 1.The remaining operations are scattered reads, loads and componentwise operations, all of which run in constant time.Since there is only a constant number of operations, pMember runs in constant time.This concludes the proof of Theorem 2.
Note that both the algorithm for parallel hashing and the dictionary generalizes to the case with w 1+ -bit ultrawords and w inputs in a straight forward manner.In this case, the space is O(n + w ) since the ultraword constants use only w space.

Satellite Data
Suppose we associate some value data(x) with each x ∈ S. We extend the data structure to support the following operation, where X = x w−1 , . . ., x 0 as above.
• pRetrieve(X): returns a pair (I, D) where I is the result of pMember(X) and We return addr(data(x)) instead of data(x) since the data would not fit into an ultraword if data(x) requires more than one word to store.We extend the data structure as follows to support pRetrieve.Store two words for each index in T i .For each x ∈ S i , the first word in T i [h i (x)] stores x and the second stores addr(data(x)).The remaining entries store either 0 or 2 w−1 , as above.
To do the retrieval, first compute I = pMember(X).However, in step 3, multiply K by 2, . . ., 2 before the scattered read since each index in T i now stores two words.Also, add 1, . . ., 1 to P + 2, . . . 2 K and do a scattered read to compute the ultraword D. The space of the data structure remains O(n + w) (assuming that data(x) uses constant space), and pRetrieve runs in constant time.

The xtra-fast Trie
In this section we prove Theorem 1 for the special case where = 1, i.e.where ultrawords consist of w 2 bits.We generalize our result to arbitrary > 0 in Section 6.Our data structure, the xtra-fast trie, supports predecessor in worst case constant time and insert and delete in amortized expected constant time.In our description we assume that we have keys of w − 1 bits each and we give a solution that uses O(n + w) space.At the end of this section we will reduce the space to O(n) and extend the solution to w-bit keys, proving Theorem 1 for = 1.

Data Structure
Consider the compacted trie T over the binary representation of the elements in S. For each node v ∈ T define str(v) to be the bitstring encoded by the path from the root to v in T .Also let min(v) and max(v) be the smallest and largest leaves in the subtree of v, respectively.By min(v) and max(v) we refer both to a leaf and to the value the leaf represents.
For each edge (u, v) ∈ T , let label(u, v) be str(u) followed by the first bit on the edge (u, v).Define key(u, v) to be label(u, v) followed by a single 1-bit and w−|label(u, v)|−1 zeroes.Note that |key(u, v)| = w and that the keys of two distinct edges in T always differ.See Figure 2 for an example.
We define the exit edge for an integer x to be the edge in T where the match of x ends.In other words, it is the edge (u, v) ∈ T such that label(u, v) is a prefix of x and |label(u, v)| is maximum.See Figure 2 for an example.It is possible that x has no exit edge if the root has fewer than two children.
Our data structure consists of the following: • A sorted, doubly linked list L of the leaves of T , i.e., the elements of S.
• A dictionary D supporting parallel queries using Theorem 2. For each edge (u, v) ∈ T we store an entry in D with the key key(u, v) and data(u, v) = (addr(min(v)), addr(max(v))).Here, addr(min(v)) and addr(max(v)) are the addresses to the corresponding elements in L, and we denote the addresses to min(v) and max(v) as the min-and max-pointer of (u, v).
• The two ultraword constants M and H described in the next section.
Storing L and the ultraword constants takes O(n + w) space combined.Since T is compacted there are O(n) entries in D, so by Theorem 2 the dictionary also uses O(n + w) space.The dashed edge and nodes illustrate how the trie would change if x = 110101 were inserted.The exit edge for x is (u, v) since we match the bitstring 1101 but do not match the next 1 on (u, v).Similarly, the exit edge for 100100 is (s, t).We have that key(u, v) = label(u, v)1000 = 1101000 where the underlined part is what we append to the labels to disambiguate the keys.Similarly, key(r, s) = 1100000 and key(s, t) = 1010000.The dictionary entry of (s, u) has key(s, u) = 1110000, and the min-and max-pointer of (s, u) are addr(min(u)) and addr(max(u)).Similarly, the min-pointer of (r, s) is to min(s) = min(t) and the max-pointer is to max(s) = max(u).Note that if we insert x we would have to update the min-pointer of (s, u), since x < min(v).However, the min-pointer of (r, s) remains unchanged since min(t) < x.

Predecessor Queries
The main idea of the predecessor query for x is to first find the exit edge of x by simultaneously searching for all prefixes of x in D. Then we use the min-and max-pointer of the exit edge to find the predecessor of x.If x has no exit edge, then the root does not have an outgoing edge matching the leftmost bit of x.If the leftmost bit of x is 1, the predecessor of x is the largest leaf in the left subtree of the root, and otherwise x has no predecessor.Assuming that x has an exit edge, the procedure has three steps.
Step 1: Compute all prefixes of x.Let b w−2 b w−3 • • • b 0 be the binary representation of x of length w − 1.We compute the ultraword That is, X i contains the prefix of x of length i followed by a 1-bit and w − i − 1 zeroes.Thus, for any edge (u, v) ∈ T such that label(u, v) is the length-i prefix of x, we have X i = key(u, v).We compute X as follows.
Let M be the constant such that M i consists of i consecutive 1-bits followed by w − i consecutive 0-bits.Let H be the constant where the (i + 1)th leftmost bit in H i is 1 and the remaining bits are zeroes.First load x into X such that X = x, x, . . ., x .Then compute X = (X & M ) | H.
Step 2: Find the exit edge (u, v) of x.First do (I, P ) = pRetrieve( X ) on D. Then compute c = compress(I) such that the ith rightmost bit in c is 1 if I i = 1 and zero otherwise.Note that x has no exit edge if c = 0. Find the index k of the leftmost bit in c that is 1 (see [24]).Then X k = key(u, v) where (u, v) is the exit edge of x.Furthermore, the values stored at the addresses P k and P k + 1 are the min-and max-pointers of (u, v), respectively.
Step 3: Find the predecessor of x.Use the min-and max-pointer of (u, v) found in step 2 to retrieve min(v) and max(v).If x ≥ max(v) then return max(v), otherwise return the element immediately left of min(v) in L. Note that there might not be an element immediately left of min(v) if x is smaller than than everything in S, in which case x has no predecessor.
Since we search for all prefixes of x and take the edge corresponding to the longest prefix found, we find the exit edge (u, v) of x.If x ∈ S, then x = v = max(v) and we correctly return that x is the predecessor of itself.If x ∈ S then the path to where x would have been located if it were in T branches off (u, v) either to the left (if x < min(v)) or right (if x > max(v)).In the first case, predecessor(x) is the element located immediately left of min(v) in T , and in the second case predecessor(x) is max(v).
By Theorem 2 the parallel dictionary query in step 2 takes worst case constant time.Finding the leftmost bit that is 1 takes constant time on the word RAM [24].The remaining operations are standard operations available in the model, so the procedure runs in constant time.

Insertions
The main idea of the insertion procedure is as follows.Since T is compacted, inserting a new leaf x will cause only a constant number of edges to be inserted and removed, so we can make these changes sequentially.Furthermore, some of the at most w − 1 edges on the path from the root to x might have their min-or max-pointers changed, and we will update these edges in parallel.
Consider inserting x = 110101 in the trie in Figure 2. When x is inserted we add a new leaf for x, as well as a new node p at the location where the path to x branches off the exit edge (u, v) of x.This removes the edge (u, v), but adds the three new edges (u, p), (p, x) and (p, v).Furthermore, we must update the min-pointer of (s, u), because min(v) was replaced by x as the smallest leaf under u.On the other hand, we do not update the min-pointer of (r, s) because min(t) is smaller than x.Note that we do not explicitly store internal nodes and therefore do not add p anywhere in the data structure.
We now describe the insertion procedure.First we note that if x does not have an exit edge it is because the root does not have an outgoing edge which shares the same leftmost bit as x.This case is easily solved by adding an edge from the root to the new leaf x and adding x to either the start or end of L. We will now assume that x has an exit edge, and also that x branches off its exit edge to the left; the other case is symmetric.
Step 1: Find the predecessor of x.Do a predecessor query as described in Section 5.2, which determines • The exit edge (u, v) of x, label(u, v) and data(u, v) = (addr(min(v)), addr(max(v))).
• The result (I, P ) of pRetrieve( X ) on D.
Step 2: Insert x in L. Insert x immediately to the right of its predecessor in L.
Step 3: Update edges.We insert (u, p), (p, x) and (p, v) and remove (u, v) from D. We find the labels of the three edges to insert as follows.We have that label(u, p) = label(u, v) since (u, p) is the edge (u, v) shortened by adding the node p and since only the first character of the edge affects the label.By definition, label(p, x) and label(p, v) consist of str(p) with a zero and a one appended, respectively.We compute str(p) by finding the longest common prefix p of x and min(v).To do so, do bitwise XOR between x and min(v) and find the index k of the leftmost bit that is 1 in the result (see [24]).Now k indicates the leftmost bit where x and min(v) differ.To extract the longest common prefix compute p = x & ∼((1 . Given the labels we can easily construct the keys for the edges.We now construct the satellite data for the edges.Both the min-and max-pointer for (p, x) are addr(x) since x is a leaf.For (p, v) they are addr(min(v)) and addr(max(v)), which were determined during the predecessor query.Finally, the min-pointer for (u, p) is addr(x) and the max-pointer is addr(max(v)).
Step 4: Update min-pointers.We update the min-pointers for the edges on the path from the root to u that are incorrect after inserting x.Note that inserting x cannot invalidate any max-pointers since we assumed that x branched off its exit edge to the left.The edges that must be updated are exactly those that have a min-pointer to min(v), since x has replaced min(v) as the smallest leaf under u.
Consider the result (I, P ) from the pRetrieve query.We begin by setting I k = 0 for the index k corresponding to the exit edge (u, v) of x (we know k from the predecessor query).The indices in I that now contain 1 indicate the edges on the path from the root to u.
Next we identify the edges that needs to be updated by creating I where I i = 1 if and only if both I i = 1 and what is stored at address P i is the address of min(v).To do so, first do a scattered read of P and store the result in M .Now M contains addr(min(b)) for each edge (a, b) on the path to u.1 Note the value of P i is arbitrary if I i = 0, i.e. if no edge has the length-i prefix of x as its label.Load addr(min(v)) into the ultraword V .Let E be the result of componentwise equality between M and V .Then E i = 1 if and only if what is stored at address P i is addr(min(v)).Finally compute I = I & E. Now we use P and I to update the incorrect min-pointers.First, load the address of the node for x into U .Then compute B by blending M (the result of the scattered read of P ) and U conditioned on I such that the value already at the address P i ) U i if I i = 1 (i.e. the address of x) Finally, do a scattered write of B to the addresses in P .Hence, what is stored at the address P i remains the same if I i = 0 and is replaced by the address of x otherwise.
The predecessor query in step 1 takes constant time.The operations in step 2 and step 4 are all standard RAM or UWRAM operations, except for finding the leftmost 1-bit which takes constant time [24].The dictionary updates in step 3 run in amortized expected constant time by Theorem 2. Since the rest of step 3 consists of standard operations, the running time for insertions is amortized expected constant.

Deletions
The deletion procedure is essentially the inverse of the insertion procedure.We assume that x is the left child of its parent p; the other case is symmetric.
Step 1: Find x.Do a predecessor query for x.Since x ∈ S, the predecessor of x is itself.This determines • The position of x in L.
• The exit edge (p, x) for x, along with label(p, x).Since x ∈ S, this edge must end in the leaf for x.
• The result (I, P ) of pRetrieve( X ) on D.
Step 2: Update min-pointers.If p is the root (i.e. if |label(p, x)| = 1) we remove the edge (p, x) from D and remove x from L which completes the deletion of x.Otherwise p is an internal node and must have another child which we denote by v. Consider the edges on the path to p. Any min-pointer to x should be replaced by the address of min(v), since min(v) is the successor of x and also in the subtree of all of these edges.We find min(v) in the node immediately right of x in L. As we did for insertions, replace any min-pointer that is an address of x by the address of min(v) in parallel using I and P .
Step 3: Delete edges.We delete (p, x) and (p, v) from D. Determine label(p, v) by flipping the last bit in label(p, x).Using the labels we easily find the keys.Note that we do not explicitly delete the edge (u, p) or insert the edge (u, v).These two edges share the same key, and the min-pointer of (u, p) was changed to the address of min(v) in step 2.
Step 4: Update L. Remove x from L.
Steps 1, 2 and 4 all take constant time (see Sections 5.2 and 5.3).The two deletions in step 3 take amortized constant time according to Theorem 2. The remainder of step 3 takes constant time, so deletions run in amortized expected constant time.

Reducing to Linear Space and Supporting w-bit Keys
Here, we reduce the space to O(n) and show how to support w-bit keys, concluding the proof of Theorem 1.
The O(w) term in the space bound above is due to the w -parallel dictionary D and O(1) ultraword constants.To avoid this when n = o(w), we will initially support predecessor, insert and delete using the dynamic fusion tree by Pȃtraşcu and Thorup [34] (based on the fusion tree by Fredman and Willard [24]), which uses linear space and supports all three operations in constant time for sets of size w O (1) .Simultaneously, we build the ultraword constants we need over the course of Θ(w) insertions, maintaining linear space.When n ≥ w, the constants have been built and we move all elements into the trie.If at any point n ≤ w/2, we move all elements from the trie into a fusion tree and remove the trie and the ultraword constants, leaving us with linear space and Θ(w) insert operations in which to rebuild the constants.Updates still run in amortized expected constant time since we always do Ω(w) updates before we move O(w) elements.
To extend the solution to work with w-bit keys, we partition the input set S into S 0 and S 1 where S i = {s | s ∈ S and the leftmost bit of s is i}, and store an xtra-fast trie for each set.Suppose the leftmost bit of an integer x is i.An insert, delete or predecessor operation on x is performed on the data structure for S i .Additionally, if i = 1 and the predecessor query on S 1 returns that x has no predecessor, we return the largest element in S 0 , or report that x has no predecessor if S 0 is empty.

The xtra-fast Trie With Smaller Ultrawords
In this section we show how to match the bounds of Theorem 1 when ultrawords consist of only w 1+ bits (i.e.w components) for any fixed > 0. The model is otherwise exactly as described in Section 2.
As mentioned, our data structure based on the y-fast trie by Willard [40] (see Section 1.2).We partition the input set S into O(n/w) sets S 1 , . . ., S t where each S i consists of w consecutive values from S, i.e., where max(S i ) < min(S i+1 ) for each i (note that |S t | < w is possible).We build a dynamic fusion tree [34] over each S i and an uncompacted xtra-fast trie T over S , i.e., the xtra-fast trie where we include non-branching nodes.The size of S is O(n/w) and each root-to-leaf path has length O(w), so storing the uncompacted trie uses O(n + w ) space, where the additional w is due to the w -parallel dictionary.We also store a collection B of ultraword constants (to be described shortly) that increases the space to O(n + w).Note that we use the same trick as in Section 5.5 to reduce this to linear in n.
We answer predecessor(x) as in the y-fast trie by first determining the predecessor of x in T , and then finding the predecessor in the corresponding dynamic fusion tree, the latter of which takes constant time [34].We show that we can find the longest common prefix between x and T in constant time, from which it follows that we can find the predecessor of x in T in constant time (see Section 5.2).In the y-fast trie this is done by binary searching over the binary representation of x, taking O(log w) time.We speed up the process by doing a w -way search instead, reducing the running time to O(log w w) = O(1/ ), or constant for any fixed .To do so, we first construct the ultraword X R that contains the labels corresponding to the prefixes of x of length w 1− , 2w 1− , . . ., w w 1− .We then do a pMember query in the dictionary for T , compress the resulting ultraword (yielding a word indicating which labels were found), and find the most significant bit to determine the longest prefix found.This eliminates all but w/w prefixes as candidates for the longest common prefix with T , and we recurse on this range.To construct the correct labels we use the ultraword constants in B. Recall that in Section 5.2 we use the constants M and H to compute the labels for the parallel member query by X = (X & M ) | H.We can compute any collection of w prefix-labels of x in this way, provided that we use the correct constants.We let B encode a B-tree of degree Θ(w ) over M and H, allowing us to perform the w -way search.Consider some node v in B that has k + 1 children.In v we store k of the values from M in an array M v and the corresponding k values from H in another array H v , ensuring that k ≤ w so that each array fits into an ultraword.We additionally store k and the pointers to the children of v.When we visit v during the search, we read M v and H v into the k least significant components of two ultrawords.If k < w , the w 1+ − kw most significant bits of these ultrawords will contain some values that are irrelevant to the parallel member query; we zero out these bits by doing bitwise & with (1 kw) − 1.This does not cause false positives to occur in the pMember query since no edges in T has the label 0 due to how labels are constructed.We then compute the k prefix-labels of x, do the parallel lookup in the dictionary, compress the result, and find the most significant bit to determine which child of v to continue the search in.Since B is a B-tree over O(w) values it uses O(w) space.Furthermore, the height of the tree is O(1/ ) since the branching factor is Θ(w ).We use constant time per node, concluding the proof of the predecessor query.
We also support insertions as in the y-fast trie.We determine which set S i to add the new element to and update that dynamic fusion tree in constant time.If S i becomes too large we split it (by deleting and reinserting each element in another dynamic fusion tree) and add a separator element to T .This takes expected O(w) time in total (the expectation is from adding at most w new edges to the w -parallel dictionary), which is expected constant when amortized over the Ω(w) updates between splits.Deletions are supported similarly.

Conclusion and Open Problems
We have studied the predecessor problem on the UWRAM model of computation.We have given a linear space data structure that supports predecessor queries in worst case constant time and updates in amortized expected constant time, even when ultrawords consist of only w 1+ bits for any fixed > 0.
Furthermore, we have shown how to implement a w -parallel dictionary on the UWRAM.The dictionary supports w (or w ) simultaneous membership queries in worst case constant time and updates in amortized expected constant time.
We wonder if it is possible to achieve constant time with high probability for all operations in the predecessor problem.The limiting factor for our solution is the time for updates in the w -parallel dictionary.There are dictionaries that achieve constant time with high probability for all operations in the word RAM model, e.g.[18].However, such dictionaries seem to require hash functions that are difficult to evaluate in parallel on the UWRAM.For instance, [18] uses the modulo operator, for which we cannot see an obvious way to make a component-wise version.
Step 1: Compute x + i , x − i , y + i and y − i for all even i.We first construct X + and X − such that X + i = x + i and X − i = x − i for even i and zero otherwise, and similarly for Y .Compute the integer m = 2 w/2 − 1 which consists of w/2 zeroes followed by w/2 ones.Load m into M .Compute X − = X & M and X + = (X w/2) & M .Compute Y + and Y − in the same way.
Step 2: Compute the products of the w/2-bit integers.Use componentwise multiplication to compute each of the ultrawords X + Y + , X + Y − , X − Y + and X − Y − .Since each component of X + , X − , Y + and Y − is a (w/2)-bit integer, no overflow occurs.The odd components still store 0.
Step 3: Align and add the products.Align the products by left-shifting them the amount specified in Equation 2, i.e.
Add the aligned ultrawords using componentwise addition for 2w-bit components (see e.g.Hagerup [25]) and return the result.See Figure 3 for an illustration.Since the sum of the terms added together in a 2w-bit component exactly correspond to the multiplication of two w-bit integers, the addition will not overflow.
Bitwise &, left-and right-shifts, componentwise multiplication and componentwise additions for arbitrary component sizes all run in constant time.Each step uses a constant number of these operations, so the procedure runs in constant time.

Figure 1 :
Figure 1: The layout of an ultraword X.

Figure 2 :
Figure 2: An xtra-fast trie for S = {001000, 001010, 001011, 101000, 101010, 110110, 110111, 111100}.The dashed edge and nodes illustrate how the trie would change if x = 110101 were inserted.The exit edge for x is (u, v) since we match the bitstring 1101 but do not match the next 1 on (u, v).Similarly, the exit edge for 100100 is (s, t).We have that key(u, v) = label(u, v)1000 = 1101000 where the underlined part is what we append to the labels to disambiguate the keys.Similarly, key(r, s) = 1100000 and key(s, t) = 1010000.The dictionary entry of (s, u) has key(s, u) = 1110000, and the min-and max-pointer of (s, u) are addr(min(u)) and addr(max(u)).Similarly, the min-pointer of (r, s) is to min(s) = min(t) and the max-pointer is to max(s) = max(u).Note that if we insert x we would have to update the min-pointer of (s, u), since x < min(v).However, the min-pointer of (r, s) remains unchanged since min(t) < x.

Figure 3 :
Figure 3: Illustrates step 3 of 2w-bit multiplication.Each of the products X+ Y + , X + Y − , X − Y +and X − Y − are left-shifted by respectively w, w/2, w/2 and 0 by shifting in zeroes from the right.Then they are added together using componentwise addition for 2w-bit components.Since what we sum up in a 2w-bit component adds up to the product of two w-bit integers, we only need 2w bits to store the result.Hence the addition will not overflow.