Concurrent Disjoint Set Union

We develop and analyze concurrent algorithms for the disjoint set union (union-find) problem in the shared memory, asynchronous multiprocessor model of computation, with CAS (compare and swap) or DCAS (double compare and swap) as the synchronization primitive. We give a deterministic bounded wait-free algorithm that uses DCAS and has a total work bound of $O(m \cdot (\log(np/m + 1) + \alpha(n, m/(np)))$ for a problem with $n$ elements and $m$ operations solved by $p$ processes, where $\alpha$ is a functional inverse of Ackermann's function. We give two randomized algorithms that use only CAS and have the same work bound in expectation. The analysis of the second randomized algorithm is valid even if the scheduler is adversarial. Our DCAS and randomized algorithms take $O(\log n)$ steps per operation, worst-case for the DCAS algorithm, high-probability for the randomized algorithms. Our work and step bounds grow only logarithmically with $p$, making our algorithms truly scalable. We prove that for a class of symmetric algorithms that includes ours, no better step or work bound is possible.


Introduction
As data sets get bigger and bigger, it becomes more and more important to harness the potential of parallelism to solve computational problems -even linear time is too slow. In the late twentieth century, many beautiful and efficient algorithms were developed in the PRAM (parallel random access machine) model, which assumes a memory shared among many synchronized processors. In practice, however, synchronization is expensive or may not be possible. A weaker model that has attracted much attention in the distributed systems community is the APRAM (asynchronous parallel random access machine) model, in which a common memory is shared among many unsynchronized processors. In the most general version of this model, any processor can be arbitrarily slow compared to any other.
Obtaining efficiency bounds in the APRAM model is extremely challenging: the use of locks, for example, seems to make it impossible to guarantee efficiency, since one process could set a lock and then go to sleep indefinitely, blocking progress by any other process that needs access to the same resource. To overcome this problem, systems researchers have invented synchronization primitives that do not use locks, notably CAS (compare and swap) [Her91a], transactional memory [HM93], and others. These primitives allow at least the possibility of obtaining good efficiency bounds for asynchronous concurrent algorithms. Yet, except for "embarrassingly parallel" computations, this possibility is almost unrealized. Indeed, we know of only one example of a concurrent data structure (other than our work, to be described) for which a work bound without a term at least linear in the number of processes has been obtained. This is an implementation by Ellen and Woefel [EW13] of a fetch-and-increment object.
An important problem in data structures that could benefit from an efficient concurrent algorithm is disjoint set union, also known as the union-find problem. The simplest version of this problem requires maintaining a collection of disjoint sets, each containing a unique element called its representative, under two operations: find (x ): return the representative of the set containing element x.
unite(x, y): if elements x and y are in different sets, unite these sets into a single set and designate some element in the new set to be its representative; otherwise, do nothing.
Each initial set is a singleton, whose representative is its only element. Note that the implementation is free to choose the representative of each new set produced by a unite. This freedom simplifies concurrent implementation, as we discuss in Section 4. Other versions of the problem add operations for initializing singleton sets and for maintaining and retrieving information about the sets such as names or sizes. We study the simplest version but comment on extensions in Section 9.
Applications of sequential disjoint set union include storage allocation in compilers [LA02], finding minimum spanning trees using Kruskal's algorithm [Kru56], maintaining the con-nected components of an undirected graph under edge additions [Tar75a], testing percolation [SW11], finding loops and dominators in flow graphs [Tar73,Tar74,FGMT13], and finding strong components in directed graphs. Some of these applications, notably finding connected components [Tar72, SV82, JM97, HZ01, RMCDS12, LT18] and finding strong components, are on immense graphs and could potentially benefit from the use of concurrency to speed up the computation. For example, model checking requires finding strong components in huge, implicitly defined directed graphs [Wol02,Blo15,BLvdP16]. There are sequential linear-time strong components algorithms [Tar72,Sha81], but these may not be fast enough for this application. The sequential algorithms use depth-first search [Tar72], which apparently cannot be efficiently parallelized [Rei85]. If one had an efficient concurrent disjoint set union algorithm one could use it in combination with breadth-first search to potentially speed up model checking. This application, described to the second author by Leslie Lamport, was the original motivation for our work.
The classical sequential solution to the disjoint set union problem is the compressed tree data structure [GF64,Fis72,HU73,Tar75b,TvL84]. With appropriate tree linking and path compaction rules, m operations on sets containing a total of n elements take O(α(n, m/n)) time [Tar75b,TvL84,GKLT14], where α is a functional inverse of Ackermann's function, defined in Section 3. Three linking rules that suffice are linking by size [Tar75b], linking by rank [TvL84], and linking by random index [GKLT14]; three compaction rules that suffice are compression [Tar75b,TvL84,GKLT14], splitting [TvL84,GKLT14], and halving [TvL84,GKLT14].
Perhaps surprisingly, there has been almost no previous research on wait-free concurrent disjoint set union. We have found only one such effort, that of Anderson and Woll [AW91]. Their work contains a number of significant ideas that are the genesis of our results, but it has many flaws that reveal the subtlety of the problem. We use their concurrency model. In one of our linking algorithms we use DCAS (double compare and swap), as a synchronization primitive, whereas they used only the weaker CAS (compare and swap) primitive.
Anderson and Woll considered an alternative formulation of the problem in which sets do not have representatives and the two operations are same-set(x, y), which returns true if x and y are in the same set and false otherwise, and unite(x, y), which combines the sets containing x and y into a single set if these sets are different. (We discuss same-set further in Section 4.) They attempted to develop an efficient concurrent solution that combines linking by rank with a concurrent version of path halving. They claimed a bound of O(m · (p + α(m, 1)) on the total work, where p is the number of processors. (They did not treat n as a separate parameter.). Their linking method can produce paths of Ω(p) nodes of equal rank. The O(mp) term in their work bound accounts for such paths. Their proof of their upper bound is not correct, because they did not consider interference among different processes doing halving on intersecting paths. Whether or not their bound is correct, it is easy to show that their algorithm can take Ω(np) work to do n − 1 unite operations, compared to the O(nα(n, 1)) time required by one process. Thus in the worst case their work bound gives essentially no speedup.
Anderson and Woll also claimed a work bound of O(m · (α(m, 1) + log * p)) for a synchronous PRAM algorithm that uses deterministic coin tossing [CV86] to break up long paths of equalrank nodes. They provided no details of this algorithm and no proof of the work bound. We think that their bound is incorrect and that the work bound of their algorithm is Ω(n log p), since it is easy to construct sets of operations that do linking by rank exactly but such that concurrent finds with halving take Ω(log p) steps per find, even on a PRAM. See Section 8. Deterministic coin tossing does seem to be a good idea, however: we conjecture that it can give an efficient (but complicated) deterministic set union algorithm in the APRAM model using only CAS for synchronization, at the cost of a multiplicative log * p factor in the work bound.
In this paper we apply the ideas of Anderson and Woll and some additional ones to develop several efficient concurrent algorithms for disjoint set union. We give three concurrent implementations of unite, one deterministic and the other two randomized. The deterministic method uses DCAS to do linking by rank. The randomized methods use only CAS: one does linking by random index, the other does randomized linking by rank. We also give two concurrent implementations of path splitting, one-try and two-try splitting. The former is simpler, but we are able to prove slightly better bounds for the latter, bounds that we think are tight for the problem.
We prove that any of our linking methods in combination with one-try splitting does set union in O m · log np 2 m + 1 + α n, m np 2 work, and in combination with two-try splitting in O m · log np m + 1 + α n, m np work. Each set operation takes O(log n) steps. These bounds are worst-case for deterministic linking and high-probability for randomized linking.
The O(log n) step bound per operation holds even without path splitting; without splitting, the work bound is O(m log n). The work and step bounds for randomized linking by rank hold even for an adversarial scheduler, provided that scheduling is based only on information sent to the scheduler, or we allow a form of CAS that writes a random bit. The work and step bounds for linking by random index hold provided that the randomization is independent of the order in which the unite operations are executed, or, more precisely, independent of the "linearization order" of the unite operations. (We define linearization order in Section 2.) We also show that Ω m · log np m + 1 + α n, m np work is needed in the worst case for any algorithm satisfying a symmetry assumption, which implies that our work bound for two-try splitting is best possible for such algorithms.
The remainder of our paper contains 8 sections. Section 2 describes our concurrency model. Section 3 describes the compressed tree data structure and sequential algorithms for disjoint set union. Section 4 presents concurrent linking by index, a special case of which is concurrent linking by random index, and one-try and two-try splitting. Section 5 presents preliminary versions of deterministic and randomized linking by rank. These versions rely on some simplifying assumptions that we eliminate in Section 6. Section 7 gives upper bounds on the total work of our algorithms. Section 8 presents lower bounds. Section 9 contains some final remarks and open problems.

Concurrency Model
Our concurrency model is the same as that of Anderson and Woll: a shared memory multiprocessor, otherwise known as an asynchronous random-access machine (APRAM). We assume that p processes run concurrently but asynchronously, each doing a different set operation. Each process has a private memory. In addition, all processes have access to a shared memory that supports concurrent reads but not concurrent writes.
To provide synchronization of writes to shared memory, we use the compare and swap primitive CAS(x, y, z ). Given the address x of a block of shared memory and two values y and z, this operation tests whether block x holds value y; if so, it stores value z in block x (overwriting y) and returns true; if not, it returns false. We also consider the two-block extension DCAS(u, v, w, x, y, z ). Given the addresses u and x of two blocks of shared memory and four values v, w, y, and z, this operation tests whether block u holds value v and block x holds value y; if both are true, it stores value w in block u and value z in block x and returns true; if not, it returns false. These operations are atomic: once one starts, it completes before any other operation can read, write, CAS, or DCAS the affected block or blocks. Although both CAS and DCAS return a value indicating success or failure, many of our algorithms do not actually use these values.
In one version of our randomized linking algorithm we use the following randomized version of CAS: atomic operation CAS(x, null, flip) tests whether the value of x is null and, if so, sets the value of x equal to true or false, each with probability 1 /2. Such a randomized atomic write operation has been used in algorithms for achieving consensus [CIL87].
Many current hardware designs include CAS as an instruction; DCAS was supported on the Motorola 68030 [Mot] but not on any current hardware, as far as we know. As we demonstrate in Section 5, it is straightforward to implement linking by rank using DCAS, but much harder using only CAS.
We study concurrent algorithms for disjoint set union that are linearizable [HW90] and bounded wait-free [Her91b]. To be linearizable means that (i) the outcome of a concurrent execution is the same as if each set operation were executed instantaneously at some distinct time (its linearization time) during its actual execution and (ii) the sequential execution sequence given by the linearization times is correct; that is, all find operations produce answers that are correct at their linearization times. The linearization times define a total order of the operations, called the linearization order. Although we focus on linearizable algorithms, some applications of disjoint set union may not require linearizability for correctness. We briefly discuss this issue in Section 9, and leave further investigation as an open problem.
To be bounded wait-free means that every operation finishes in a bounded number of its own steps. The total work done by a concurrent solution is the total number of steps done by all processes to complete all operations.
Two weaker progress properties than bounded wait-freedom are wait-freedom and lockfreedom [HS08]. A concurrent solution is wait-free if every process is guaranteed to finish. It is lock-free if every operation can execute its next step when it chooses to do so, and at least one process is guaranteed to finish its operation. In general a lock-free solution need not be wait-free, and a wait-free solution need not be bounded wait-free. In our version of disjoint set union, the number of elements is fixed, which makes it easy to guarantee bounded wait-freedom. This remains true if we add an operation that allows the creation of singleton sets containing new elements, as long as the total number of set operations is bounded. If we allow an unbounded number of singleton sets to be created, then our solutions are no longer wait-free, but they remain lock-free. In this case there are no meaningful work bounds.

Data Structure and Sequential Algorithms
Our concurrent disjoint set union algorithms use the same data structure as the best sequential algorithms: a compressed forest. This forest contains one rooted tree per set, whose nodes are the elements of the set and whose root is the set representative. Each node x has a pointer x.p, to its parent if it has a parent or to itself if it is a root. The root of the tree is the representative of the set.
The sequential algorithm for find (x ) follows parent pointers from x until reaching a node v that points to itself, optionally compacts the find path (the path of ancestors from x to v ) by replacing the parent of one or more nodes on the find path by a proper ancestor of its parent, and returning v. Naïve find does no compaction. Three good compaction rules are compression, splitting, and halving. Compression replaces the parent of every node on the find path by the root v. Splitting replaces the parent of every node on the find path by its grandparent. Halving replaces the parent of every other node on the find path by its grandparent, starting with x. See Figure 1.
The sequential implementation of unite(x, y) does find (x ) and find (y), returning the roots u and v of the trees containing x and y, respectively, and tests whether u = v. If u = v, it links u and v by making one the parent of the other. Three good linking rules are linking by size, linking by rank, and linking by random index. Linking by size maintains the size (number of nodes) of each tree in its root, and makes the root of the tree of larger size the parent of the other, breaking a tie arbitrarily. Linking by rank maintains a non-negative integer rank for each root, initially zero, and makes the root of larger rank the parent of the other, breaking a tie by adding one to the rank of one of the roots. Linking by index chooses a fixed total order of the nodes and makes the root of larger index the parent of the other. Linking by random index is the special case of linking by index that chooses the total order of nodes uniformly at random. Figure 1: The results of running find (8) on the original path with the three different types of compaction: compression links all the nodes on the find path directly to the root, thereby "compressing" the path; halving links alternating nodes on the find path to their grandparents, thereby creating a path of "half" the length with nodes hanging off; splitting links every node on the find path to its grandparent, thereby "splitting" one path into two.
Linking by size, rank, or random index combined with naïve find, compression, splitting or halving gives an algorithm that takes O(logn) time for an operation on a set or sets containing n elements, worst-case for deterministic linking, high-probability for linking by random index. Use of compaction improves the amortized time per operation: any combination of compression, splitting, or halving with linking by size, rank, or random index gives an algorithm that takes O(m · α(n, m/n)) time to do m operations on sets containing a total of n elements. The bound is worst-case for linking by size or rank, average-case for linking by randomized index. Here is a functional inverse of Ackermann's function defined as follows. We recursively define A k (n) for non-negative integers k and n as follows: A 0 (n) = n + 1; A k (0) = A k-1 (1) if k > 0; A k (n) = A k -1 (A k (n -1)) if k > 0 and n > 0.
For a non-negative integer n and non-negative real-valued d, Lemma 1. A k (n) < min{A k+1 (n), A k (n + 1)}, i.e., A k (n) is strictly increasing in k and n.
. Thus the lemma also holds for k and n = 0. Let k > 0 and n > 0. Suppose the lemma holds for k' < k and all n, and for k and n -1.
Corollary 1. α(n, d) is non-decreasing in n and non-increasing in d.
Our goal is to extend at least one sequential set union algorithm to the concurrent model of Section 2 and to obtain an almost-linear work bound that grows sublinearly with p, the number of processes. For convenience in stating bounds, we assume that 2 ≤ p ≤ n ≤ m, and that there is at least one unite of different elements. We denote the base-two logarithm by lg.

Concurrent Linking and Splitting
Concurrency significantly complicates the implementation of the set operations. One complication is that processes can interfere with each other by trying to update the same field at the same time, requiring our algorithms to be robust to such interference. Consider doing unites concurrently. To do unite(x,y), we can start as in the sequential case by finding the roots u and v of the trees containing x and y, respectively. Then we can try to link u and v by doing a CAS to make v the parent of u or vice-versa. But we must allow for the possibility that the CAS can fail, for example if it tries to make v the parent of u but in the meantime some other process makes another node the parent of u. If this happens we must retry the unite. When retrying, we start the new finds at u and v rather than at x and y, to avoid revisiting nodes. Anderson and Woll [AW91] proposed this method; the following pseudocode implements it. Method link (u, v ), to be defined, tries to make one of two roots u and v the parent of the other.
unite(x, y): In this and subsequent implementations, asterisks denote linearization points. The linearization point of a unite is the linearization point of the successful link if there is one, or the inequality test that fails if no link is successful.
Concurrency also imposes constraints on the linking rule. We need to prevent concurrent links from creating a cycle of parent pointers other than a loop at a root. For example, three concurrent links might make v the parent of u, w the parent of v, and u the parent of w. The simplest way to prevent such cycles is to do linking by index, which we can implement using CAS. We denote the total order of nodes by "<" . The following pseudocode implements linking by index: The linearization point of the link is its CAS. A link is successful if its CAS returns true. For any total order, linking by index guarantees acyclicity. Linking by random index is the special case of linking by index that chooses the total order uniformly at random.
With this implementation of link, a link can succeed even though the new parent itself becomes a child of another node at the same time. Fortunately this affects neither correctness nor efficiency. We could prevent this anomaly by using DCAS to do links, which allows us to guarantee that the new parent remains a root. But this has two drawbacks. First, it uses DCAS, whereas our goal is to use only CAS if possible. Second, if all links are done using DCAS, the total work can be linear in p, as we discuss in Section 5.1.
Next we consider finds. Concurrent naïve finds do not interfere with each other, since such finds do not change the data structure. Thus we can do such finds exactly as in the sequential case. The following pseudocode implements concurrent naïve find: The linearization point of a find is the last update of v.
Concurrent finds with compaction can interfere with each other. Consider a sequential find with splitting. Let u be the current node visited by the find. One step of the find consists of setting v = u.p; setting w = v.p; and, if v = w, replacing u.p by w and then setting u = v. Steps continue until v = w, when the find finishes by returning v. The only update to the data structure in a step is the replacement of u.p by w. We obtain a concurrent version of splitting by using CAS(u.p, v, w ) to do the update. The following pseudocode implements this method, which we originally presented in [JT16] and which is based on Anderson and Woll's version of find with halving: The linearization point of a find is the last update of w. We call this method one-try splitting because it tries once to update u.p and then changes the current node from u to v, whether or not the update of u.p has succeeded.
Concurrent splits can produce anomalies that are not possible if splits are sequential, as a simple example shows. (See Figure 2.) Suppose a, b, c, d, e is a path in a tree built by linking by index, and that four processes, 1, 2, 3, and 4 begin concurrent finds with one-try splitting starting at a, a, b, and b, respectively. We denote the local variables of process i by u i , v i , w i . First, process 1 sets u 1 = a, v 1 = a.p = b, and w 1 = b.p = c. Second, process 3 sets u 3 = b, v 3 = c, w 3 = d, and replaces b.p by d. Third, process 4 sets u 4 = b, v 4 = d, w 4 = e, and replaces b.p by e. Fourth, process 2 sets u 2 = a, v 2 = b, and w 2 = d. Fifth, process 1 replaces a.p by c. Sixth, process 2 attempts to replace a.p by d but fails, because process 1 changed a.p after process 2 read it. Observe that just before process 1 replaces a.p by c, c is not an ancestor of a, even though it was when process 1 read it. This threatens correctness. Furthermore, even though the failure of process 2 to update a.p guarantees that a.p has changed since process 2 read it, the new value of a.p, namely c, is not an ancestor of the current grandparent of a, namely e, violating a property used in the analysis of sequential splitting. Finally, even though the new parent c of a is higher in index than the old parent b of a (as we prove in Theorem 1), the new grandparent d of a is lower than the old grandparent e of a.
Figure 2: Interference in concurrent splitting: process 1 updates a's parent to a node that is not its ancestor; process 2's CAS fails. These difficulties do not occur in the sequential setting.
Fortunately, correctness requires only a weak property of compaction, one that holds for one-try splitting and many other methods. Suppose we do linking by index. We define the union forest to be the set of trees such that the parent of a node x is the first value of x.p other than x ; if x.p = x throughout the computation, x is a root in the union forest. We call a compaction method valid if it visits nodes on a single path in the union forest, each vertex visit takes O(1) steps, and each replacement of a parent y by another node z (of which there may be none) is such that z is a proper ancestor of y in the union forest.
Theorem 1. Linking by index in combination with any valid compaction method maintains the invariant that the parents define a set of trees that partition the nodes into the correct disjoint sets, and the parent of any non-root node has higher index than the node. Further-more each set operation stops in O(h) steps, where h is the height (maximum number of edges on a path) of the union tree, so the algorithm is bounded wait-free.
Proof. An induction on the number of parent changes using the transitivity of "<" shows that the parent of any node never has smaller index than the node. This implies that the only cycles are loops at roots. Parent changes done during compactions do not change the node partition defined by the trees. A link that makes v the parent of u must be such that u is a root before the link, u < v, and u and v are in the trees containing the two nodes x and y that are the inputs to the unite that does the link. It follows by induction on the number of parent changes that at all times the trees correctly partition the nodes. Correctness of the linearization points follows in a straightforward way by induction on the number of parent changes.
Since the nodes visited during a find are on a single path in the union forest, and each node visit takes O(1) steps, each find stops in O(h) steps. (Our assumption that there is at least one unite of different elements implies h > 0.) The nodes visited during a unite are on two paths in the union forest. Consider the node visits in the order they occur. Each node visit takes O(1) steps, but a node can be visited many times. This can only happen while it is a root; once it becomes a child, it can only be visited once more (as the input to a find).
Consider the nodes u and v just before an execution of the test "u = v" in unite. Each of u and v was a root at some time during the find that computed it. If the test "u = v" succeeds, whichever of u and v is smaller in the total order will be a child after the next link (whether or not the link succeeds). Suppose without loss of generality it is u. We charge the next visits to u and v to u becoming a child. There are at most 2h such events. It follows that the total number of node visits during the unite, and hence the total number of steps, is O(h).
Having dealt with correctness, we discuss concurrent compaction in more detail. The monotonicity of parents (each new parent is higher in index than the old one) allows us to extend the analysis of sequential splitting to one-try splitting, although the extension is not straightforward. On the other hand, the analysis of sequential halving relies on monotonicity of grandparents, which fails in the concurrent setting, as our example above shows. Anderson and Woll [AW91] claimed a good work bound for their concurrent version of halving, but they overlooked the problem of non-monotonicity. We see no way to get a good work bound for their method.
Even though we can prove good efficiency bounds for one-try splitting, we can prove slightly better bounds for a related compaction method that tries to change each parent pointer twice instead of once. We call this method two-try splitting. The following pseudocode implements find with two-try splitting: The linearization point of a find is the last assignment to w. If every attempted parent change succeeds, the effect of a single two-try split is to replace the parent of every other node on the find path by its great-grandparent. This splits the original path into two paths, each containing half the nodes on the original path, but the split is different from that produced by one-try splitting: if the nodes on the original path are numbered consecutively from 1, the latter produces a path of nodes 1, 3, 5, 7. . . and another path of nodes 2, 4, 6, 8. . . ; the former produces a path of nodes 1, 4, 5, 8, 9. . . and another path of nodes 2, 3, 6, 7, 10, 11. . .
A variant that has the same work bounds as two-try splitting is conditional two-try splitting, in which the second try occurs only if the first one fails. We omit a detailed discussion of this variant, since its pseudocode is a bit longer and it is unclear whether avoiding extra parent changes improves efficiency.
Both one-try and two-try splitting are valid compaction methods, so Theorem 1 holds for both of them.
We conclude this section by presenting Anderson and Woll's concurrent implementation of same-set, which gives an extension of our algorithms to their formulation of the problem. It is easy to do same-set(x, y) in the sequential setting: find the root u of the tree containing x, find the root v of the tree containing y, and test whether u = v. As Anderson and Woll observed, this does not suffice in the concurrent setting, because u might no longer be a root when the equality test occurs, possibly invalidating the test. Their solution has three cases. If u = v, return true: x and y are in the same tree when the test occurs, and remain in the same tree. If u = v, test whether u is still a root. If so, return false: x and y were in different trees when v was computed, since u and v were different roots. If not, redo the computation: do new finds from u and v, and repeat the test or tests. The following pseudocode implements this method: The linearization point of a same-set is the last assignment to v. All our analyses of find and unite extend to include same-set as an allowed operation.

Concurrent Linking by Rank
To obtain a good work bound, we combine one-try or two-try splitting with a good linking method. Linking by random index is one such method, but our analysis of it assumes that the scheduling of CAS instructions is independent of the random node order. This assumption is questionable; if it fails, the work bound becomes much worse as a function of p, as we show in Section 8. To overcome this, we develop two concurrent versions of linking by rank, one deterministic and one randomized, both of which have good work bounds. To simplify our descriptions, we assume for the moment that the rank and parent of a node can be stored in a single block of memory that is updatable by one CAS instruction. In Section 6 we show how to eliminate this assumption.
Both of our versions of linking by rank are refinements of a generic method. The generic method links roots of different ranks using CAS, and links roots of the same rank using method elink, to be defined. The rank of node u is u.r, initially zero. The following pseudocode implements the generic method: Given two roots u and v with ranks r and s, respectively, this method compares r to s. If r < s, it uses a CAS to make v the parent of u while guaranteeing that neither the parent nor the rank of u changes in the meantime. If r > s, it proceeds symmetrically. If r = s, it does an elink to link u and v. A link is successful if its CAS returns true or its elink is successful, in which case the linearization point of the link is its CAS or that of its elink.
Our two versions of linking by rank differ only in their implementation of elink.

Linking by Rank via DCAS
A simple way to do elink (u, v, r ) is to use a DCAS to make v the parent of u and increment the rank of v while guaranteeing that the ranks and parents of u and v do not change in the meantime.. The following pseudocode implements this idea: An elink is successful if its DCAS returns true, in which case the linearization point of the elink is its DCAS.
Our first version of linking by rank uses this implementation of elink. The rank of a node can never decrease, and can increase only while the node is a root. It follows that the rank of a child is always strictly less than that of its parent. Linking by rank is an implicit form of linking by index: the successful links respect any total order consistent with the final ranks of nodes. Thus Theorem 1 holds for this method.
The following lemma and theorem extend known bounds on sequential linking by rank [TvL84] to linking by rank via DCAS: Lemma 2. With linking by rank via DCAS, the sum of ranks is at most n − 1, the number of nodes of rank k is at most n/2 k , and the maximum rank and the height of the union forest are at most lg n.
Proof. For a node to increase in rank by 1, it must be a root, and another root must become its child at the same time. It follows that the number of rank increments, and hence the sum of ranks, is at most n -1, one per root that becomes a child. An induction on k shows that at most n/2 k nodes can ever attain rank k. The bounds on the maximum rank and the height of the union forest follow, since no node can have rank exceeding lgn.
Theorem 2. Linking by rank via DCAS in combination with any valid compaction method maintains the invariant that the parents define a set of trees that partition the nodes into the correct disjoint sets, and the rank of a child is less than that of its parent. Furthermore each set operation stops in O(log n) steps, so the algorithm is bounded wait-free.
Proof. The first half of the theorem follows by induction on the number of steps as in the proof of the first half of Theorem 1. A find takes O(log n) steps by the argument in the proof of Theorem 1, since the height of the union forest is O(log n) by Lemma 2. We prove the bound for unites by an extension of the argument in the proof of Theorem 1. The nodes visited during a unite are on two paths in the union forest, and on each path they are visited in increasing order by rank. Each node visit takes O(1) steps, but roots can be visited many times. We charge each repeated visit to a root either to a root becoming a child or to a root increasing in rank. Consider the nodes u and v just before an execution of the test "u = v" in unite. Each of u and v was a root at some time during the find that computed it. Suppose the test "u = v" succeeds. The next execution of link sets r to the rank of u and s to the rank of v. If r < s, then after the CAS either the rank of u has increased or u has become a child, whether or not the CAS succeeds. We charge the next visits to u and v to the rank increase of u or to u becoming a child. The symmetric argument applies if r > s. If r = s, at least one of u and v has increased in rank or become a child after the elink. We charge the next visits to u and v to whichever of these events has occurred. There are at most 2 lg n roots that become children and at most 2 lg n rank increases by Lemma 2, since for each of the two paths in the union forest the rank increases sum to at most lg n. It follows that the total number of node visits during the unite, and hence the total number of steps, is O(log n).
The efficiency of this linking method (though not its correctness) depends critically on using CAS to link nodes of different ranks, reserving DCAS for the equal-rank case. An attempted link of equal-rank nodes u and v using DCAS fails only if some other process makes u or v a non-root, or increases the rank of u or v. In the proof of Theorem 2 we charge extra node visits resulting from the failure of the DCAS to whichever of these events occurs. If we were to use DCAS to try to make a node v the parent of a node u of lower rank, the DCAS could fail because another process made v the parent of another node w. This changes neither the parent nor rank of u, nor of v, leaving us with no event to charge for extra node visits. In the worst case, O(n) such links could produce Ω(pn) failures, resulting in total work linear in p. Using CAS to link nodes of different ranks eliminates these failures. Although we can avoid such interference in the disjoint set union problem as we have defined it, this is much harder to do in some extensions of the problem, as we discuss in Section 9.

Randomized Linking by Rank
To link equal-rank nodes using CAS, we need to do the parent change and the rank increment separately. The question is which one to do first. Making this decision randomly gives an approximation to linking by rank that produces few enough rank ties that we are able to get good work bounds. Since this method allows rank ties, we use linking by index to break such ties, in order to prevent the creation of non-trivial cycles of parent pointers. Assume that "<" is an arbitrary total order of the nodes. To link two equal-rank roots u and v such that u < v, we flip a fair coin. If it comes up heads, we attempt to make v the parent of u; if it comes up tails, we attempt to increase the rank of u. The following pseudocode implements this idea. Random Boolean method flip returns true with probability 1 /2 and false otherwise, independent of all other flips.
An elink is successful if it does a CAS that changes a parent pointer, in which case the linearization point of the elink is its CAS.
Our second version of linking by rank uses this implementation of elink. Observe that the CAS done after a flip is almost the same whether the flip returns true or false, the only difference being the updated field (parent or rank, respectively). In our analysis we shall assume that the success or failure of the CAS following a flip is independent of the outcome of the flip. In Section 6 we describe how to modify the implementation to eliminate the need for this independence assumption. Randomized linking by rank is an implicit form of linking by index: links respect the total order defined by final node ranks with ties broken by "<" .
Lemma 3. With randomized linking by rank, (i) any node x has O(1) ancestors of the same rank, in expectation; (ii) the sum of ranks is at most n in expectation and n + O(n 1 /2 ) with probability 1 − 1/n c for any constant c > 0, where the constant factor in the "O" depends on c; (iii) the expected number of nodes of rank at least k is at most n/2 k , and with probability at least 1 − n/2 k , all nodes have rank less than k; (iv) the maximum rank is at most lg n + 3 in expectation and is at most (c + 1) lg n with probability at least 1 − 1/n c for any positive constant c; (v) the depth of the union forest is at most 3 lg n + 9 in expectation and O(log n) with probability at least 1 − 1/n c for any constant c > 0, where the constant factor inside the "O" depends on c; and (vi) for a large enough constant c and any k > 0, the expected number of nodes of rank less than k and height at least ck in the union forest is at most n/2 k .
Proof. Consider only flips that result in successful CAS operations. Each such flip produces a rank increment with probability 1 2 ; otherwise, it makes a root into a child.
(i) The probability that a node has k ancestors of the same rank is at most 1/2 k-1 . Summing gives the bound.
(ii) There are at most n -1 flips that make a root into a child. The sum of ranks, which is the number of rank increments, is thus at most the number of heads in a sequence of coin flips containing at most n tails, which is at most n in expectation, and at most n + O(n 1/2 ) with probability 1n c for any constant c > 0 by a Chernoff bound [CL05], with the constant factor in the "O" depending on c.
(iii) The rank of a given node is at least k with probability at most 1/2 k . The expected number of nodes of rank at least k is thus at most n/2 k . By a union bound, all nodes have rank less than k with probability at least 1n/2 k .
(iv) For c > 0, the probability that the maximum rank is at least lgn + c is at most n/2 lgn + c = 1/2 c . It follows that the expected maximum rank is at most lgn + ∞ i=1 i/2 i lgn + 2 lgn + 3, and the probability that the maximum rank exceeds (c + 1)lgn is at most 1/2 clgn = 1/n c .
(v) A node gains a proper ancestor of the same rank with probability at most 1 /2. Thus the expected number of proper ancestors of the same rank as that of a given node is at most ∞ i=1 i/2 i = 2, which implies by (ii) that the expected depth of the union forest is at most 3lgn + 9. Let c > 0. By (ii) the maximum node rank is at most (c + 3)lgn with probability at least 1 -1/n c+2 . For any node, the probability that it has at least (b + c + 3)lgn proper ancestors in the union forest is at most the probability that a sequence of fair coin flips containing at most (c + 3) lg n heads contains at least blgn tails. By a Chernoff bound, for b sufficiently large, this probability is at most 1 -1/n c+2 . The probability that at least one of the n nodes has more than (b + c + 3)lgn proper ancestors in the union forest is at most 2n/n c+2 ≤ 1/n c .
(vi) Let x be any node. We claim that for some c > 0 the probability that x has an ancestor y of rank less than k such that the path from x to y contains ck edges is at most 1/2 k . Part (vi) follows from the claim by a union bound. To prove the claim, consider the edges on the path from x to y in the union forest. Call such an edge good if its ends have different ranks and bad otherwise. Each edge has probability at least 1 /2 of being good, independent of the status of all other edges. The claim follows by a Chernoff bound.
Theorem 3. Randomized linking by rank in combination with any valid compaction method maintains the invariant that the parents define a set of trees that partition the nodes into the correct disjoint sets. The parent of any non-root node has rank no less than that of the node, and if the ranks are equal, the parent has larger index. Each set operation stops in O(log n) steps with probability 1 − 1/n c for any c > 0, where the constant factor in the "O" depends on c, so the algorithm is bounded wait-free with probability 1 − 1/n c for any c > 0.
Proof. The theorem follows from parts (iv) and (v) of Lemma 3 by a proof like those of Theorems 1 and 2.
Remark 1. We can modify randomized linking by rank to guarantee that it is bounded waitfree without affecting any of our bounds by placing an upper bound of n − 1 on the maximum rank: when linking two nodes of maximum rank, we do not do a flip but always try to make the smaller-index node a child of the larger.

Linking by Random Index
With an appropriate definition of rank, Lemma 3 and Theorem 3 hold for linking by random index, under a strong independence assumption. We define the rank of node x to be lgnlg(nx + 1). Thus node n has rank lgn, nodes n -1 and n -2 have rank lgn -1, and so on. The rank of a child is no greater than that of its parent. We use these ranks only in the analysis; the implementation of the algorithm does not use them.
We assume that the random node order is independent of the linearization of the unite operations. More precisely, we assume that the node order and linearization are generated together in the following way. The implementation maintains a set U of unordered pairs {u, v} that are candidates for linking, initially empty, and a partial order P of the nodes, initially empty, that is a total order on the nodes of any set defined by the links done so far, and that leaves any two nodes in different sets unordered. To do link (u, v ), a process adds the unordered pair {u, v} to U.
The scheduler sequentially removes pairs from U in arbitrary order. When removing a pair {u, v} from U, the scheduler performs three actions. First, it modifies P by merging the total orders of the sets containing u and v, with each possible merged order equally likely.
This unites the sets containing u and v. The link corresponding to the pair {u, v} succeeds. Third, the scheduler deletes from U all other pairs containing u or v. Each link corresponding to such a pair fails. When a pair is deleted from U, the process that added the pair to U proceeds with its next operations, which are the recomputing of its u and v.
The updating of P maintains the invariant that the total order of the nodes in any set defined by the links done so far is uniformly random. If there is more than one set after all unites have been done, we can extend the final partial order to a total order by merging the total orders on the final sets, with each possible merged order equally likely. The result is a uniformly random permutation of the nodes, equivalent to a uniformly random numbering of the nodes from 1 to n. Thus this implementation does linking by random index, subject to the restriction imposed on the scheduler. The execution does not change if we initially number the nodes uniformly at random but reveal to the scheduler only the total order within each set formed so far.
This implementation restricts the behavior of linking by random index in at least two different ways: in the actual implementation, the CAS operation for a link is defined by an ordered pair, not an unordered pair, so the scheduler gets information about the node order before it needs to decide which such CAS to do next. Also, in the actual implementation a link can make a root the child of a non-root, which cannot happen with the scheduler constraint. We think the latter restriction is inconsequential, but the former is significant. We thus view our analysis of linking by random index as suggestive, not definitive.
Lemma 4. With linking by random index, if the scheduler restriction holds, then parts (i), (v), and (vi) of Lemma 3 are true, as well as the following strengthened versions of parts (ii), (iii), and (iv): (ii) the sum of ranks is at most n, (iii) the number of nodes of rank at least k is at most n/2 k , and (iv) the maximum node rank is at most lg n.
Proof. Parts (ii), (iii), and (iv) are immediate from the definition of ranks. Parts (i), (v), and (vi) follow as in the proof of Lemma 3 from the following claim: (*) Given a node x, each successive ancestor of x in the union forest has probability at least ½ of having higher rank then its parent, independent for each ancestor.
To prove (*), consider a node x. Given an execution of the algorithm, we modify the linearization by delaying links uniting sets not containing x until such a set is a subset of a set about to be united with x. That is, let S be the current set containing x, let {u, v} be the next pair deleted from U with u but not v in S, and let S' be the current set containing v. We modify the schedule to delay all the links forming S' until just before the link of u and v. This does not change the steps done by the execution, only their linearization order. We generate the partial order P by generating a numbering of the nodes incrementally and revealing to the scheduler only the total order within each set constructed so far. Initially we assign a number uniformly at random to x. Subsequently when the scheduler removes a pair {u, v} from U, if u and/or v is unnumbered we assign it a number chosen uniformly at random from the numbers not yet assigned.
Suppose the scheduler removes a pair {u, v} from P with u the root of the tree containing x, and v in another tree, and that the corresponding link succeeds. Let S and S', respectively, be the sets containing u and v just before {u, v} is removed from U. Just before the links forming S' are done, the only numbered nodes are those in S, and u has the largest number. Among the numbers larger than that of u, at least half have rank larger than that of u. When S' is formed, its nodes are numbered uniformly at random from among the unassigned numbers. Given that a number assigned to a node in S' is larger than that of u, it has probability at least 1 /2 of being larger than the rank of u. Since v has largest number among the nodes in S', if the link makes v the parent of u the rank of v is greater than that of u with probability at least 1 /2. The claim (*) follows by induction on the number of steps. Proof. The theorem follows from parts (iv) and (v) of Lemma 4 in the same way that Theorem 3 follows from parts (iv) and (v) of Lemma 3.

Indirection and Helping
The algorithms in Sections 5.1 and 5.2 require that CAS (and DCAS in 5.1) support testing and updating of storage blocks able to store both the parent and the rank of a node. In this section we present two ways to eliminate the need for blocks to contain multiple fields. Both methods increase the number of steps per operation, but by at most a constant factor. We also discuss how to modify the implementation of the randomized linking algorithm of Section 5.2 to eliminate the need for an unrealistic independence assumption in its analysis.
The first method to reduce the block size, proposed by Anderson and Woll [AW91] is to use indirection. Specifically, each node contains only one field, a pointer to a ledger that contains the parent and rank of the node. To do a link via a CAS, a process creates a new ledger containing the updated information for the node being linked and then uses a CAS to attempt to replace the old ledger of the node by the new one. A link via a DCAS is similar, except that the process creates two new ledgers and uses a DCAS to replace the old ledgers of the two affected nodes. Parent updates done by splitting are done directly on the appropriate ledgers, without allocating new ones. The ledger method requires a way to allocate ledgers, and care must be taken to avoid the reuse of ledgers. The algorithm of Section 5.1 needs at most 3n − 2 + 2p = O(n) ledgers, one per initial set plus at most two per successful link plus at most two per process. The algorithm of Section 5.2 needs O(n) ledgers with high probability. If ledgers are used to implement randomized linking by rank, the independence assumption needed by the analysis becomes much weaker and quite realistic: the success or failure of a CAS can depend on all inputs to the CAS, in particular the ledger addresses, but not on the contents of the ledgers.
Allocating ledgers efficiently is itself a challenging problem, which Anderson and Woll ignored. One way to do it is to use the concurrent fetch and increment method of Ellen and Woelfel [EW13]. If ledgers are allocated individually, the number of steps to allocate a ledger is O(log p). If ledgers are allocated in groups of O(log p), the amortized time per allocation is O(1) and the number of ledgers used will be O(n) provided p = O(n/ log n). The problem of ledger allocation deserves more serious study.
The second method is helping, as described for example in [HFP02]. The idea is to allow processes to complete the tasks of other processes. We number the processes from 1 to p. Each node u has an extra field, u.process, which can hold a process number or 0, and is initially 0. Each process has a descriptor in which it records a sequence of steps it wants to perform. To update a node, a process writes appropriate instructions into its descriptor and then does a CAS or DCAS to write its process number into the process field of the affected node or nodes.
Any other process that wants to update a node containing a non-zero process number must first execute the instructions in the corresponding descriptor. When the last instruction is executed, the process number in the affected node or nodes is reset to 0, allowing further updates to the node or nodes. As long as the number of instructions needed to do an update is bounded by a constant, the use of descriptors increases the total work by only a constant factor.
In using helping to link nodes of equal rank, we have to solve the ABA problem: A helping process, having completed the instructions in a descriptor, resets the process number in the relevant node to zero, but in the meantime the process being helped has reset the process number to zero, initiated a new update, and set the process number to its own number again. The helping process has no way to detect that a new update has been initiated by the process that initiated the old one. In our application, we can solve the ABA problem by using the monotonicity of ranks. This requires that a CAS be able to update the rank and the process number of a node as an atomic operation. Since ranks are small and in any realistic application p n, we think this is a reasonable assumption.
The deterministic algorithm of Section 5.1 does links using helping as follows. Each descriptor contains two nodes and a rank. To link root x of rank r to root y, a process, say process i, writes x, y, and r into its descriptor. If y has rank greater than r, it uses a CAS to write i into node x while verifying that the process number of x is 0 and the rank of x is r. If y has rank r, it uses a DCAS to write i into both x and y while verifying that the process numbers of x and y are 0 and the ranks of x and y are r. A process wanting to update a node that finds a non-zero process number i in the node reads the corresponding descriptor. Suppose the descriptor contains nodes x and y and rank r. The process sets z = y.p and does a CAS to set the parent of x to z while verifying that x was a root before the update. It then tests whether y is a root of rank r. If so, it does a CAS to change the rank of y to r + 1 and the process number of y to 0 while verifying that the rank and process number of y were r and i before the update.
This method uses a couple of optimizations. It does not reset the process number of a node that becomes a non-root, since no subsequent link will try to change its parent or rank. Instead of making y the parent of x, it makes y.p the parent of x. The reason to do the link this way is that some other process can make y a non-root just before process i does its CAS or DCAS to write i into x, or into x and y. If this happens, y.p will have rank greater than r when x becomes its child, preserving the invariant that ranks strictly increase from child to parent. If this does not happen and the rank of y is r, the helping process adds one to the rank of y and resets the process number of y to 0.
The randomized algorithm of Section 5.2 does helping using descriptors containing a node y, a rank r, and a flag whose value is null, true, or false. A flag of true indicates that y should become the parent of the root containing the process number of the descriptor; a flag of false indicates that the rank of this node should be changed from r to r + 1. If process i wants to link root x of rank r to root y, it writes y and r into its descriptor. If y has rank greater than r, it sets the flag to true; if the rank of y equals r, it flips a fair coin and sets the flag correspondingly. Then it uses a CAS to set the process number of x equal to i while verifying that the rank and process number of x were r and 0 before the update. A process wanting to update a node x that finds a non-zero process number i in x reads the corresponding descriptor. If the flag is true it does a CAS to set the parent of x to y while verifying that x was a root before the update. If the flag is false it does a CAS to set the rank and process number of x to r + 1 and 0 while verifying that they were r and i before the update.
The analysis of this method relies on the same independence assumption as the method using ledgers: scheduling decisions are independent of the contents of descriptors. A variant of the method is to set the flag after the process number of the descriptor is written into x : the first step of a helping process is to change the flag from null to true or false using the randomized CAS operation mentioned in Section 2. If this operation is available, no independence assumption is needed. This implementation of randomized linking by rank satisfies the Anderson-Woll requirement that a randomized algorithm be efficient even if the scheduler knows the outcome of previous random choices. We think, though, that it is reasonable to assume that the scheduler makes its decisions only on the basis of the inputs to the CAS operations, or that it cannot read the private memories of the processes. If either of these assumptions hold, we do not need randomized CAS.

Upper Bounds
The results of Sections 5 and 6 give us the following theorem: Theorem 5. With any of the three linking methods of Section 5 combined with any valid compaction method, the total work is O(m log n). This bound is worst-case for the deterministic linking method, high-probability for the randomized methods. If randomized linking by rank is implemented as described in Section 6, the bound is valid even for an adversarial scheduler.
Proof. The theorem is immediate from the results of Sections 5 and 6.
The use of splitting instead of naïve find improves the total work bounds significantly if p n. We show this by extending the analysis of sequential splitting [TvL84, GKLT14] to one-try and two-try splitting.
We define the density d of a set union problem instance to be m/(np) if splitting is two-try, m/(np 2 ) if splitting is one-try. We shall obtain a bound of O(m · (α(n, d) + log(1 + 1/d))) on the total work if either kind of splitting is used in combination with any of the three linking methods. The main obstacle we encounter in extending the sequential analysis to the concurrent setting is accounting for unsuccessful CAS operations. Accounting for such operations adds the logarithmic term to the work bound.
We call a problem instance sparse if d < 1 and dense otherwise. The logarithmic term in the work bound dominates only in sparse instances. We start with the analysis of dense instances, which is simpler than that of sparse ones.
We call a child a zero child if its rank is the same as that of its parent. Zero children only exist if a randomized linking method is used. With deterministic linking by rank or linking by random index, the maximum rank is lgn, but with randomized linking by rank, the maximum rank is unbounded, although large ranks occur with exponentially small probability (Lemma 3 part (iii)). Our first result handles the high-rank case.
Lemma 5. With randomized linking by rank, the probability that the maximum rank is n or greater is at most 1 − n/2 n . The contribution to the expected total work of instances in which the maximum rank is n or greater is O(m).
Proof. Part (iii) of Lemma 3 gives the first part of the lemma. If the maximum rank is k, the number of steps done during any find or unite operation is O(n + k). The contribution to the expected running time of instances in which the maximum rank is n or greater is thus bounded by a constant times m k≥n (n + k) /2 k = O(m).
Henceforth when considering randomized linking by rank we assume all ranks are less than n.

The Dense Case
Throughout this section we assume d ≥ 1.
Lemma 6. The number of finds is O(m), worst-case unless randomized linking by rank is used, in which case the bound is with high probability.
Proof. There are at most two finds per unite plus at most two per process per root that increases in rank or becomes a child, for a total of O(m + np) = O(m). For randomized linking, this bound follows from part (ii) of Lemmas 3 and 4 and is high-probability for randomized linking by rank, worst case for linking by random index.
We call a node low if its rank is less than d and high otherwise. The number of nodes of rank at least n is zero if linking is deterministic or by randomized index, and is zero with probability at least 1 − n/2 n by if linking is randomized by rank. During a find, a visit to a node is an iteration of the find loop in which the node is the value of u. (See the pseudocode in Section 4.) Lemma 7. The number of visits to low nodes during finds is O(m), worst-case if linking is deterministic, expected if randomized.
Proof. Consider three successive visits to low nodes during a find, to u, v, and w. Let I be the interval of time between the visits of u and w. We claim that at least one of the following events occurs during I : u or v becomes a child, u.p.r increases, or u or v loses an ancestor of the same rank. The number of such events for fixed u and v is O(d): a node only becomes a child once, its parental rank can increase at most d times before it exceeds d and its parent is not low; a node has O(1) ancestors of the same rank in expectation by part (i) of Lemmas 3 and 4. We charge the visit to u to the corresponding event (or any such event if there is more than one). Each event is charged for at most 2p visits, at most two per process. (The factor of two comes from the two nodes associated with a visit, the node itself and the next node visited.) Summing over all nodes, we obtain a bound of O(npd) = O(m) on visits to low nodes. Suppose the claim is false. The u and v are children when u is visited. After the CAS following the visit to u, the parent of u has changed; after the CAS following the visit to v, the parent of v has changed. If either u or v is a zero child when u is visited, at least one of them becomes a non-zero child or loses an ancestor of the same rank during I. Thus neither u nor v is a zero child when u is visited. But then the rank of the parent of u increases by the time the CAS after the visit to u finishes, making the claim true.
Bounding visits to high nodes is more complicated. For each high child x, we measure the progress of compaction by keeping track of an increasing function of the rank of the parent of x, called the count of x. We define counts using Ackermann's function. Our formulation is an extension of that of Kozen [Koz92]. We define the level x.a of a high node x, and the index x.b and count x.c of a high child x, as follows: We bound the range of levels, indices, and counts by using the properties of Ackermann's function: Lemma 8. If x is a high node, 0 ≤ x.a ≤ α(n, d) and x.a = 0 if and only if x.r = x.p.r. If x is a high child, 0 ≤ x.b < x.r and 0 ≤ x.c < (α(n, d) + 1)x.r. The values of x.a and x.c never decrease, and if x.a or x.b increases, x.c increases by at least as much.
Proof. Since A 0 (x.r ) = x.r + 1, x.a = 0 if and only if x.r = x.p.r, and x.a ≥ 0 if it is defined. If x is a high node, A α(n,d) (x.r) ≥ A α(n,d) ( d ) > n > x.p.r. Thus x.a is defined and is at most α(n, d). (Here for randomized linking by rank we use the assumption that all ranks are less than n.) Suppose x is a high child. If x.a = 0, x.r. The bounds on x.c follow from those on x.a and x.b. While x is a root, x.a = 0. Once x is a child, x.r is constant and x.p.r cannot decrease, so x.a cannot decrease by Lemma 1. While x.a is constant, x.b cannot decrease for the same reason. If x.a increases by one, x.b can decrease by at most x.r -1, resulting in an increase of at least one in x.c. If x.a increases by more than k, x.c increases by at least (k -1)x.r + 1.
Lemma 9. The sum of the counts of all high children is O (nα(n, d)), worst-case unless linking is randomized by rank, in which case the bound is high-probability.
Proof. By Lemma 8, the sum of the counts of high children is O (α(n, d)) times the sum of the ranks of all nodes. By Lemmas 2 and 4, the sum of ranks is less than n for deterministic linking by rank and linking by random index. For randomized linking by rank, it is O(n) with high probability by Lemma 3.
The following lemma is the key to the analysis of splitting.
Lemma 10. Consider a time t at which u is a high child whose parent v is also a (high) child. Let w the parent of v at time t, and let u.a, v.a, and w.r be the levels of u and v and the rank of w at time t, respectively. Suppose that at time t or later the parent of u changes from v to a node x of rank at least w.r. If v.a > u.a, the parent change increases u.a and u.c by at least v.a -u.a; if v.a = u.a, the parent change increases u.c by at least 1 or causes u to lose an ancestor of the same rank.
Proof. Let u.r and v.r be the ranks of u and v at time t, respectively. Let x.r be the rank of x when it becomes the parent of u. Since A v.a-1 (u.r ) < A v.a-1 (v.r ) ≤ w.r ≤ x.r, the level of u after the parent change is at least v.a. If v.a > u.a, the parent change increases the level and hence the count of x by at least v.au.a by Lemma 8. Suppose v.a = u.a. If u.a = 0, the parent change causes u to lose v as an ancestor. Suppose u.a > 0. Since A u.a (u.b + 1) = A u.a-1 (A u.a (u.i )) ≤ A u.a-1 (v.r ) ≤ w.r ≤ x.r, the parent change increases either the level or the index of u and hence increases the count of u.
To count visits to high nodes, we use a credit argument. One credit pays for one high-node visit. We allocate a certain number of credits to each find when it starts, and additional credits when high nodes increase in count or lose ancestors of the same rank. We show via a credit invariant that these credits suffice to pay for all the high-node visits. A bound on the total number of credits gives a bound on the number of high-node visits.
We begin by analyzing two-try splitting: even though it is more complicated than one-try splitting, its analysis is simpler. We call a find active while it is being executed. When a find starts, we allocate it α (n, d ) + 1 credits. When the count of a high child increases by k, we allocate 2k credits to each active find, for a total of at most 2pk. When a high child loses an ancestor of the same rank, we allocate one credit to each active find, for a total of at most p.
Lemma 11. With two-try splitting, the number of allocated credits is O (mα(n, d)), worstcase if linking is deterministic, average-case if randomized.
Proof. By Lemma 6, the number of credits allocated to finds when they start is O (mα(n, d)). By Lemma 9, the number of credits allocated to finds as a result of increases in count is O(npα(n, d)) = O(mα(n, d)). By Lemmas 3 and 4, the expected number of credits allocated to finds as a result of nodes losing ancestors of the same rank is O(np) = O(m).
Lemma 12. With two-try splitting, just after a high node u is visited by a find, the find has at least u.a credits.
Proof. We prove the lemma by induction on the number of high-node visits done by a find. When the find starts, it has O(α(n, d) + 1 credits. The first visit costs one, leaving α(n, d), which is enough to make the lemma true just after this visit. Suppose the lemma holds just after u is visited, and let v be the next node visited. We denote by unprimed and primed values their values just after the visit to u and just before the visit to v, respectively. The lemma holds after the visit to v provided that the find accrues at least v.a'u.a + 1 credits between the visits to u and v. To show that this happens, we need the following crucial inequality, which follows from Lemma 10: To prove (*), we refer to the implementation of two-try splitting. Let t be the first time u.p = v. Time t is after the visit to u, since u.p changes between the first and second times that the find sets its variable v after the visit to u, as a result of the first CAS after the visit to u succeeding or failing. Let w be the parent of v at time t. Consider the change to u.p resulting from the second CAS after the visit to u. This change satisfies the hypothesis of Lemma 9, since the new parent of u must have been the parent of v at time t or later. By Lemma 10, just after this change to u.p, the level of u is at least the level of v at time t. Since levels are non-decreasing, (*) holds.
Between the visits to u and v, the find accrues at least 2(u.a'u.a + v.a'v.a) = (v.a'u.a) + (u.a'u.a) + (v.a'v.a) + (u.a'v.a) credits as a result of level increases.. Each of the last three terms is non-negative, the last one by (*). Thus the find accrues at least v.a' u.a + 1 credits between the visits, unless the levels of u and v are equal and unchanging between the visits. Suppose the levels of u and v are equal and unchanging between the visits. By Lemma 10, the find accrues at least one credit when the parent of u changes from v.
Lemma 13. With two-try splitting, the number of visits to high nodes is O (mα(n, d)), worstcase if linking is deterministic, average-case if randomized.
Proof. The lemma is immediate from Lemmas 11 and 12. Now we extend the analysis to one-try splitting. The proof of Lemma 12 fails for one-try splitting, because a CAS done by one process, say process 1, can fail as a result of a successful CAS done by another process, say process 2, that sets its value of v before process 1's most recent high-node visit. That is, time t in the proof of Lemma 10 can precede the visit. This invalidates the use of Lemma 10 in the proof.
To overcome this problem, we allocate additional credits to node count increases, and we allow active finds to shift some of their credits to the other active finds. Specifically, when a find starts, we allocate it α(n, d) + 1 normal credits. When a high child loses an ancestor of the same rank, we allocate one normal credit to each active find. When the count of a high node increases by k, we allocate 2k normal credits and 2k(p − 1) extra credits to each active find. When a CAS in a find succeeds, we shift a 1/(p − 1) fraction of the find's extra credits to each other active find. Shifted extra credits become normal; that is, we shift a credit at most once.
Lemma 14. With one-try splitting, the number of allocated credits is O(mα(n, d)), worstcase if linking is deterministic, average-case if randomized.
Proof. The bound holds for normal credits by the proof of Lemma 11. By Lemma 9, the number of extra credits allocated to finds as a result of increases in count is O(np 2 α(n, d)) = O(mα(n, d)) since d = m/(np 2 ).
Lemma 15. With one-try splitting, just after a high node u is visited by a find, the find has at least u.a normal credits.
Proof. The proof is an extension of that of Lemma 12. Consider a find, say find 1. The credits allocated to the find when it starts make the lemma true just after its first high-node visit. Suppose the lemma holds just after find 1 visits u, and let v be the next node it visits. We consider three cases. If the CAS after the visit of find 1 to u succeeds, the lemma holds just after the visit to v by an argument like that in the proof of Lemma 12. (This case does not use shifted credits.) Suppose this CAS fails, because a CAS done by another find, say find 2, changes u.p from v to another value. Let t be the last time that find 2 set its variable v before its successful CAS. If t is after find 1 visits u, the lemma holds just after the visit to v by an argument like that in the proof of Lemma 12, again without the use of shifted credits.
The third, new case is if t precedes the visit of find 1 to u. Let t', t", and t" ' be the times find 1 visits u, find 2 does its CAS, and find 1 visits v, respectively. We denote by unprimed, primed, double-primed, and triple-primed values their values at times t, t', t", and t" ', respectively. Applying Lemma 9 to time t and the successful CAS of find 2 gives u.a" ≥ v.a; and, if u.a ≤ v.a, the count of u increases by at least 1 or u loses an ancestor of the same rank when find 2 does its CAS.
Lemma 16. With one-try splitting, the number of visits to high nodes is O (mα(n, d)), worstcase if linking is deterministic, average if randomized.
Proof. The lemma is immediate from Lemmas 14 and 15.

The Sparse Case
In this section we modify the analysis of Section 7.1 to handle sparse instances. Throughout this section we assume d < 1. We need to change the definition of low and high nodes, add an additional node type, middle, and (for the purpose of the analysis only) redefine the ranks of nodes.
Let l = lg(1 + 1/d ). Since d < 1, l > 1. A node is low if its rank is less than l and its height is less than cl, where c is the constant in part (vi) of Lemmas 3 and 4; middle if its rank is less than l but its height is at least cl, and high if its rank is at least l. Middle nodes can exist only if linking is randomized.
Lemma 17. The number of non-low nodes is at most 2nd, as is the sum of the ranks of such nodes. This bound is worst-case if linking is deterministic, average-case if randomized.
Proof. By part (vi) of Lemmas 3 and 4, the expected number of middle nodes is at most n/2 l ≤ n/2 lg(1/d) = nd if linking is randomized. (It is zero if not.). By Lemma 2 or part (iii) of Lemma 3 or 4 depending on the linking method, the number of high nodes is also at most n/2 l ≤ nd, worst-case if linking is deterministic or by randomized index, average-case if by randomized rank. The bound on the sum of ranks follows from the node bound by the argument in the proof of Lemma 2 if linking is deterministic, by that in the proof of part (ii) of Lemma 3 or 4 if randomized.
Lemma 18. The number of finds that visit at least one non-low node is O(m), worst-case if linking is deterministic, average-case if randomized.
Proof. Consider the finds during unites that visit at least one non-low node. At most two per unite also visit a low node. Of those that visit only non-low nodes, there are at most two per unite plus at most 2p per non-low node that becomes a child or has a rank increase, two per process doing a unite while the event in question takes place. By Lemma 17, the number of such finds is O(npd) = O(m).
Lemma 19. The number of visits to low nodes is O(ml), worst-case.
Proof. The analysis of node visits in the proof of Theorem 2 restricted to nodes of rank less than l and height less than cl gives a bound of O(l) low-node visits for each find and unite.
Lemma 20. If linking is randomized, the expected number of visits to middle nodes is O(ml).
Proof. By Lemma 18, the expected number of finds that visit middle nodes is O(m). During such a find, each visit to a middle node except the last two is followed by a middle node losing a child of the same rank or the parent of a middle node x increasing in rank. The latter can only happen l times before x has a parent that is not a middle node; subsequently, x can only be the last middle node visited during a find. We charge each visit to a middle node other than the last two of a find to the corresponding event. The charge per event is at most p, and the expected number of events is at most ndl rank increases and O(nd) losses of same-rank ancestors, the latter by part (i) of Lemma 3 or 4. Such events account for O(npdl) = O(ml) visits. Adding the last two per find gives the lemma.
To count visits to high nodes, we define the effective rank of a high node x to be x.er = x.rl + 1. We define levels of high nodes and indexes and counts of high children, using effective ranks in place of ranks. Since the effective rank of a high node is at least one, levels of high nodes and indices and counts of high children are well-defined. We allocate credits exactly as in Section 7.1. Lemmas 8 and 10 remain true. By Lemma 17, the sum of counts of high children is O(ndα(n, d)), worst-case if linking is deterministic, high-probability if randomized. We allocate credits exactly as in Section 5.1. Lemmas 12 and 15 remain true. If splitting is two-try, the number of allocated credits is O((m + ndp)α(n, d)) = O(mα(n, d)) since d = m/(np); if splitting is one-try, it is O((m + ndp 2 )α(n, d)) = O(mα(n, d)) since d = m/(np 2 ). We conclude that Lemmas 11, 13, 14, and 16 hold in the sparse case (with the new definition of a high node).

The Total Work Bound
Combining the results of Sections 7.1 and 7.2, we obtain the following theorem: Theorem 6. With any of the three linking methods of Section 5 and either one-try or two-try splitting, the total work is O(m(α(n, d)+log(1+1/d))), worst-case if linking is deterministic, average-case if randomized, where d = m/(np 2 ) if splitting is one-try, d = m/(np) if splitting is two-try.

Lower Bounds
In this section, we derive lower bounds on the worst-case and amortized efficiency of set union algorithms. In the first subsection, we prove lower bounds on the work efficiency of the algorithms described in this paper by explicitly providing worst-case executions-both the operations and the adversarial schedules. At a high level, our executions are constructed by the following observations and steps. For each algorithm, we describe operations that build a tree of logarithmic height using unite operations. We observe that shadowing schedules in which all processes are scheduled in lock-step while performing the same expensive find operations result in worst-case behavior. We apply a shadowing schedule to processes performing a find on the deepest node in the aforementioned tree to prove that the logarithmic term in our upper bounds is tight. Then, we combine the idea of shadowing schedules with previous sequential lower bounds of Tarjan et al. and Fredman et al. [TvL84,FS89] to show that the inverse-Ackermann term in our upper bounds is tight. Our algorithmic lower bounds section proves that our amortized upper bound analyses are tight when find operations are done with two-try splitting.
In the second subsection, we show general lower bounds that apply to the concurrent set union problem. First, we prove that, in the worst-case, any concurrent set-union algorithm must do at least Ω(log min{n, p}) work in expectation for a single operation. When p = n ω( 1 log log n ) , this lower bound is stronger than the sequential lower bound of Ω log n log log n given by Fredman and Saks [FS89] in the cell probe model. It also shows a separation in work complexity between the sequential and concurrent versions of the set-union problem, since Blum [Blu85] presented an algorithm that does at most O log n log log n work per operation in the sequential setting. Furthermore, whenever log p = Θ(log n), i.e. when p = n , this lower bound establishes that randomized linking with any form of compaction yields an algorithm with optimal expected work per operation. Finally, we generalize the worst-case lower bound using shadowing schedules to show that our algorithm obtained by combining randomized linking with two-try splitting is optimal amongst a class of symmetric algorithms that includes all known algorithms for the concurrent disjoint set union problem.

Algorithmic Lower Bounds
In order to prove the tightness of the inverse-Ackermann term in our upper bounds, we recall a sequential cell probe lower bound on the set union problem given by Fredman and Saks.
Lemma 21 ( [FS89]). Let A be any randomized algorithm that solves the sequential set union problem. For any fixed number of nodes n, and any M ≥ n, there is a sequence of operations σ M , that makes A perform Ω(M α(n, M/n)) expected work.
We use Lemma 21 to establish a concurrent lower bound.
Lemma 22. Let A be any of the algorithms we have described for concurrent set-union. There is some sequence of m operations using p processes on n nodes that requires Ω m · α n, m np work in expectation.
Proof. Any concurrent algorithm is also a sequential algorithm if it is run by a single process. So, for any given M ≥ n, we can take a worst-case sequence σ M of operations from Lemma 21. That is, a single process running the sequence of operations σ M will perform Ω(M α(n, M/n)) work in expectation. In the remainder of the proof, we use shadowing schedules, in which processes run in lock-step with each other and thereby do not gain locally from any compaction attempts of other processes, to get the lower bound.
We consider two cases for m: Case 1: If m ≥ np, then we choose M = m/p ≥ n. If each of the p processes runs σ M and is scheduled in lock-step (so that the processes all walk up find sequences together and do not benefit from each other's compaction attempts), then the total number of operations is pM = m and the total amount of work is Ω(pM α(n, M/n)) = Ω mα n, m np .
Case 2: If m < np, we choose M = n. Then, σ M performed by a single process takes Ω(nα(n, 1)) expected work. We observe that m/n < p and assign m/n processes the operation sequence σ M , thus assigning m operations. If the processes are scheduled in lock-step, the total expected work performed by them is Ω(m/n · nα(n, 1)) = Ω mα n, m np .
We can also show that the logarithmic term log( np m + 1) is an amortized lower bound for all our algorithms. The schedule that builds binomial trees with a single process and makes all the processes shadow each other up the longest branch of these trees yields the lower bound.
Lemma 23. For randomized linking by rank and linking by DCAS, regardless of what type of compaction is used in find operations, and for any positive integer k ∈ [1, n], there is a sequence of k − 1 unite operations that will build a tree with k nodes with height Ω(log k).
Proof. For simplicity, we initially assume that k is a power of 2. Let B j be the binomial tree of height j. All the nodes are initially in singleton trees, i.e. B 0 trees. We proceed in lg k rounds. In round r, we start with k 2 r−1 trees of type B r−1 , and simply unite their roots pairwise. After lg k rounds we end up with a single tree of type B lg k . If k were not a power of 2, we perform the above procedure with the largest power of 2 less than k and simply unite the remaining nodes to the root of the main tree.
A slightly more complex construction allows us to prove a similar lemma for linking by index.
Lemma 24. For randomized linking by index, regardless of what type of compaction is used in find operations, and for any positive integer k ∈ [1, n], there is a sequence of k − 1 unite operations by a single process that will build a tree with k nodes in which the depth of a uniformly randomly picked node is Ω(log k) in expectation.
Proof. The proof is constructive. The construction of these trees is inspired by binomial trees, and is done in multiple rounds such that each round fully finishes before the next round starts. Without loss of generality let k be a power of 2, as otherwise we could just use the greatest power of 2 less than k in the following construction. Initially, we let the nodes be in singleton trees T 1,1 , . . . , T k,1 . In each round we will combine pairs of trees, and each tree T will have a designated node ν(T ). In the initial trees the designated node is the only node. In the first round we combine pairs of trees by performing unite(ν(T 1,1 ), ν(T 2,1 )), unite(ν(T 3,1 ), ν(T 4,1 )), . . . , unite(ν(T k−1,1 ), ν(T k,1 )) to produce tree T 1,2 , . . . , T k/2,2 . The designated node ν(T i,2 ) is chosen to be one of the designated nodes of the subtrees that formed T i,2 . We call this process of picking the new designated nodes as a subset of the old ones refining. The subsequent rounds are done similarly by combining pairs of trees from the previous round and refining designated nodes, until only the tree T 1,lg k remains.
We now make the following observations about this process: 1) All trees T i,r of a given round r have the same number of nodes 2 r .
2) A designated node always has depth at most 2. (This follows from the way find does compactions.) 3) A node of depth δ in any of the trees T i,r has at most 1 2 δ · |T i,r | successors.
The links in the rounds raise the depth of half the nodes due to (1), and the compactions in the find operations of the rounds reduce the average depth of a node in the forest by at most 1 4 due to (2) and (3). Thus, each round increases the average depth of a node in the forest by at least 1 2 − 1 4 = 1 4 . Since there are log(k) rounds, the proof is complete.
Combining the previous lemmas yields our best algorithmic lower bound result.
Lemma 25. Let A be a concurrent disjoint set union algorithm obtained by combining linking by DCAS, randomized linking by rank, or linking by random index with find with no compaction, one-try splitting, or two-try splitting. There is a schedule of m operations on n nodes by p processes that forces A to perform Ω m log np m + 1 work. The bound holds in expectation for the linking by random index algorithm even under the independence assumption.
Proof. We prove the theorem for linking by the DCAS and randomized linking by rank algorithms first. The lower bound is non-trivial only when m/p < n. In this case, we describe a particular sequence of operations and schedule that performs the requisite work. Divide the nodes into m/p groups of size n/(m/p) = np/m. Lemma 23 allows us to link each group of nodes into a tree of height Ω m log np m + 1 . For each such tree, perform find(x) on the deepest node x of that tree simultaneously with each of the processes. Now consider the schedule in which processes shadow each other in all the finds. In this schedule, each process does Ω m log np m + 1 work per find, and one find per group. The total number of operations is m/p × p = m, and the total amount of work is Ω m log np m + 1 .
In the case of the linking by random index algorithm under the independence assumption, we modify the above argument by replacing the use of Lemma 23 with Lemma 24, and performing find(x) on a uniformly randomly picked node x in the tree.
Combining the previous lemmas yields our best algorithmic lower bound result.
Theorem 7. Let A be a concurrent disjoint set union algorithm obtained by combining linking by DCAS, randomized linking by rank, or linking by random index with find with no compaction, one-try splitting, or two-try splitting. There is a schedule of m operations on n nodes by p processes that forces A to perform Ω m log np m + 1 + α n, m np work. The bound holds in expectation for the linking by random index algorithm even under the independence assumption.
Proof. Combine the results of Lemmas 22 and 25.
As the final algorithmic lower bound, we prove that the independence assumption we have been using to analyze the linking by random index algorithm is indeed necessary. In particular, we present a super-logarithmic work lower bound for the algorithm if the independence assumption does not hold.
Lemma 26. Concurrent set union via the linking by random index algorithm performs Ω(m √ p) expected work to do m = n √ p operations if √ p ≤ n, regardless of which compaction rule the find operations use.
Proof. We will show an explicit example. Assume √ p < n, and pick a set S of √ p nodes.
Let p/2 processes attempt to do unite(x, y) where each pair of x, y in S is tried by at least one process. The scheduler can wait to see the outcomes of the node comparisons and decide to schedule the processes so that the nodes of S get linked into a linear path of length √ p.
If the remaining p/2 processes all perform find(x) where x is chosen randomly from S, and are scheduled in lock-step, they will perform, in expectation, Ω( √ p) work each, since the expected depth of x is √ p/2.
Performing the same process on each of the n √ p sets of nodes leads to Ω(np) work to do m = n √ p operations. The average operation takes Ω( √ p) work.

Problem Lower Bounds
In this subsection, we prove that any concurrent set-union algorithm must do Ω(log min{n, p}) work for a single operation in the worst case. Furthermore, we build on the worst-case lower bound to show an Ω(log(np/m + 1)) amortized work lower bound for all symmetric algorithms, where we say an algorithm is symmetric if: 1) The algorithm's code for the unite and find procedures does not use process ids.
2) The algorithm does not use the return values of CAS operations.
All our algorithms and all algorithms known to us can be made symmetric without effecting the upper bound analyses of the algorithms. For instance, this can be done if we assume that all CAS operations return false. This does not effect the correctness of our algorithms since we only use the return value of a CAS operation in the unite operation to determine if a link has been successful. However, if we do not perform this check atomically, our algorithms remain correct, since a process that has successfully united two trees together will realize this shortly afterwards when its u and v pointers meet at the root of the united tree. The work efficiency analysis increases by at most a factor of two, because we can always imagine the case where processes work in pairs (p, q), and each pair performs operations together and are always scheduled in lock-step. In this case, if p ever performs a successful link CAS(u.p, u, v) and returns, then q's attempt to perform the same link will fail; thus q will only return after it traverses the whole tree and discovers that some other process has already finished the link it wanted to do. The modification we propose to symmetrize the algorithm will simply make p do the same work as q.
Our lower bounds make use of a result on a problem called "wake-up". The wake-up problem on k processes asks for a wait-free algorithm with two properties: (i) every process returns a boolean value and at least one process returns true, and (ii) a process may return true only if every process has already executed at least one step. The following Lemma is a lower bound on the complexity of wake-up that follows straightfowardly from Jayanti's lower bound in [Jay98b].
Lemma 27 ( [Jay98a, Jay98b]). For any k process wake-up algorithm that uses variables supporting read, write, and CAS, there is a schedule in which some process performs Ω(log k) steps in expectation.
We solve wake-up via set-union to get our lower bounds below.
Lemma 28. Reduction 1 (below) solves the wake-up problem for k processes using a disjoint set union instance with k + 1 nodes, in which each process executes one unite and two find operations.
Proof. Let q 1 , . . . , q k be the k processes and let the nodes be labelled 0, . . . , k. The reduction below correctly solves wake-up because: Reduction 1 : q j 's code in wake-up solution.
1: procedure WakeUp 2: unite(j − 1, j); x ← find(0); y ← find(k); return x = y (i) the last process to complete unite finds that the representatives of nodes 0 and k are the same and thus returns true, and (ii) no process returns true before all processes have completed unite, since no representative of k can be the same as any representative of 0 until they are in the same set, i.e. until the last of the unite operations is linearized.
Theorem 8. Let A be a linearizable wait-free concurrent disjoint set union algorithm using read, write, and CAS. There is a schedule of m operations on n nodes by p processes that forces A to perform Ω(log min{n, p}) work in expectation.
Proof. Instantiate Lemma 28 with k = min{n − 1, p}. The most expensive of the three set union operations of the process that performs the most work in the adversarial schedule of Lemma 27 must do Ω(log min{n, p}) expected work.
Corollary 2. The disjoint set union algorithm obtained by combining randomized linking with any form of find described in this paper gives an algorithm with optimal worst-case work per operation up to constant factors when log p = Θ(log n).
Remark 2. Theorem 2 shows that our set union algorithms with randomized linking have optimal work per operation when p = n ε for constant ε.
Remark 3. Theorem 8 establishes a separation in worst-case work complexity between sequential and concurrent set-union when p = n ω( 1 log log n ) since Blum's sequential set-union algorithm has a worst-case work complexity of O( log n log log n ) [Blu85].
Lemma 29. Let A be a linearizable wait-free symmetric concurrent disjoint set union algorithm using read, write, and CAS. There is a schedule of m operations on n nodes by p processes that forces At to perform Ω(m log(np/m + 1)) work.
Proof. Divide the n nodes into g = m p groups of size k + 1, where k = np 8m (disregard any additional nodes); label the groups G 1 , . . . , G g . Note that m ≥ p and m ≥ n 2 , so k ≤ p 4 . We divide the p 2 (out of the p) processes into two sets A = {q 1 , . . . , q k } and B = {q k+1 , . . . , q p/2 }. Note that |B| ≥ p 4 and |A ∪ B| = p 2 .
Consider running the wake-up algorithm of Lemma 28 on the k processes in A using the k +1 nodes in G 1 . By Lemma 27 there is a schedule σ 1 in which some process q i performs Ω(log k) steps. Assign to each process in B the same sequence of three set union operations that q i performs, and define schedule σ 1 to be the schedule σ 1 , with the processes of B interleaved in to run in lockstep with q i . In this schedule, the processes q 1 , . . . , q p/2 perform Ω(p log k) work to do p set union operations. Repeating this procedure on each group G j produces schedules σ j , each of which performs Ω(p log k) work. Therefore, in the concatenated schedule of σ 1 σ 2 · · · σ g , the processes q 1 , . . . , q p/2 perform a total of Ω(gp log(k+1)) = Ω(m log(np/m+1)) work to do a total of gp = m operations.
Theorem 9. Let A be any linearizable wait-free symmetric concurrent disjoint set union algorithm using read, write, and CAS. There is a schedule of m operations on n nodes by p processes that forces A to perform Ω m log np m + 1 + α n, m np work in expectation.
Proof. Combine the results of Lemma 22-whose argument applies to all symmetric algorithmswith Lemma 29.
Remark 4. Theorem 9 shows that the set union algorithm obtained by combining randomized linking with two-try splitting has optimal amortized work efficiency amongst all symmetric algorithms (up to a constant factor).
The ideas behind our collection of upper and lower bounds lead us to make the following conjecture about the expected work complexity of concurrent disjoint set union.
Conjecture 1. The expected work complexity of the concurrent set union problem is Θ m · log np m + 1 + α n, m np .
In light of Theorem 6, which shows that randomized linking with two-try splitting satisfies the conjectured upper bound, a refutation of Conjecture 1 would imply a more efficient algorithm than randomized linking with two-try splitting. On the other hand, a demonstration of the conjecture would involve proving a universal lower bound; namely, showing that Theorem 9 holds for all algorithms (as opposed to only symmetric ones). While this paper proves a universal lower bound on the worst-case complexity of a single operation, it does not prove any universal lower bounds on the total work complexity of m operations. The only such lower bound that is known for the problem is exponentially weaker than the conjectured one. We state this bound, by Jayanti et al., below.
Theorem ( [JTBA19] ). Let A be any linearizable wait-free concurrent disjoint set union algorithm using read, write, and CAS. There is a schedule of m operations on n nodes by p processes that forces A to perform Ω m log log np m + 1 + α n, m n work in expectation.

Remarks and Open Problems
We have presented three linking methods and two splitting methods for concurrent disjoint set union. With any of the linking methods, with or without compaction, the number of steps per operation is O(log n), worst-case if linking is deterministic, high-probability if randomized. With any of the linking methods and either of the splitting methods, the total work is O(m(α(n, d) + log(1 + 1/d))), worst-case if linking is deterministic, averagecase if randomized, where the problem density d is m/(np 2 ) if splitting is one-try, m/(np) if splitting is two-try. No matter what the density, the cost of concurrency is at most a factor of log p, making our algorithms truly scalable. The proofs of the bounds for linking by random index require assuming that the scheduler is non-adversarial, as discussed in Section 5.3. The bounds differ for the two splitting methods only for a narrow range of densities: if m/n = O(1) or m/n = Ω(p 2 ), the bounds are the same; if m/n = ω(1) and m/n = o(p 2 ), the bounds differ by a factor of at most log p.
The O(log n) step bound is tight for all our algorithms. The work bounds for splitting are almost tight: any symmetric algorithm (as defined in Section 8) has a work bound of Ω m · log np m + 1 + α n, m np . We conjecture that the same lower bound can be shown for asymmetric algorithms also (Conjecture 1), but leave the proof or refutation of this statement as an open problem. We also leave as an open problem finding an efficient deterministic algorithm that uses only CAS, not DCAS. We think the deterministic coin tossing method of Cole et al. [CV86] can be used, but the resulting algorithm will be significantly more complicated than the one we have proposed, and the number of steps per operation and the total work bound will contain an additional factor of log * p.
In some applications of disjoint set union, such as computing flow graph information [Tar74,Tar73] each set has a name or some other associated value, such as the number of elements in the set. We can extend the compressed tree date structure to support set values by storing these in the set roots. In the sequential setting, it is easy to update set value information in O(1) time during a link. But in the concurrent setting, updating the value in the new root during a link requires a DCAS or some more-complicated implementation using CAS. Updating root values using DCAS invalidates our analysis. Consider n singleton sets {1}, {2}, . . . , {n}. Suppose p = n, and unite(1, n), unite(2, n),. . ., unite(n -1, n) are performed concurrently using linking by rank via DCAS. Assume the tie-breaking total order is numeric. At most one link will succeed initially, say the link of 1 and n. After this link, all nodes except n will still have rank 0, and n will have rank 1. The algorithm of Section 5.1 does all the remaining links concurrently using CAS, since none affects node n.
But if each such link needs to update the value in node n, the remaining links must be done one-at-a-time, resulting in overall work Ω(np).
We think this problem can be overcome, and that the concurrent set union problem with set values can be solved in a work bound that is quasi-linear in m and poly-logarithmic in p. But doing so may well require relaxing the linearization requirement: instead of continuing to try to link each node i and n, suppose the algorithm does a different set of links that reduce the contention. For example, the algorithm could link roots in pairs, then the remaining roots in pairs, and so on. The set resulting from all the links would be the same, but the intermediate sets would not correspond to any linearization of the original unites. Even though it violates linearization, such an algorithm might suffice in many if not all applications.
An algorithm of this kind needs a mechanism to restructure the links. We think some sort of binary tree structure, like the one used by Ellen and Woelfel [EW13] in their fetch-andincrement algorithm but more dynamic, may suffice. We leave open the development of this idea or some other idea to solve the problem of concurrent sets with values.
Finally, we think our methods will fruitfully extend to a distributed-memory, message-passing setting.