Constructing the R* Consensus Tree of Two Trees in Subcubic Time

The previously fastest algorithms for computing the R* consensus tree of two given (rooted) phylogenetic trees with a leaf label set of cardinality n run in Θ(n3) time (Bryant and Berry in Adv. Appl. Math. 27(4):705–732, 2001; Kannan et al. in SIAM J. Comput. 27(6):1695–1724, 1998). In this manuscript, we describe a new \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O(n^{2} \sqrt{\log n})$\end{document}-time algorithm to solve the problem. This is a significant improvement because the R* consensus tree is defined in terms of a set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {R}_{\mathit{maj}}$\end{document} which may contain Ω(n3) elements, so any direct approach that explicitly constructs \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {R}_{\mathit{maj}}$\end{document} requires Ω(n3) time.


Introduction
Phylogenetic trees are leaf-labeled trees commonly used to describe the evolutionary history of a set of objects such as biological species or languages [1,9,10,16,18]. Typically, in a phylogenetic tree, each leaf represents one of the objects being studied and the branching structure of the tree shows the assumed evolutionary relationships among the objects. Depending on the available data, it can be difficult to infer an accurate phylogenetic tree. For example, small changes in the input data may result in trees with very different structures. A consensus tree is a single phylogenetic tree which summarizes the branching information contained in an input collection of phylogenetic trees with identical leaf label sets. Consensus trees are useful when different data sets or different tree inference methods have produced a set of trees with the same leaf labels and slightly conflicting structures, yet a single tree is required to represent all of them [1,3,7,9,14,18] (indeed, phylogenetic analyses often output alternative trees for the same set of species [1]). Also, by exclusion, a consensus tree indicates areas of conflict in the input trees [3]. Furthermore, consensus trees are sometimes used as a basis for new phylogenetic inferences [3].
Over the years, many types of consensus trees have been defined and studied in detail. For a survey, see, e.g., [3], Chap. 30 in [9], Chap. 3.2 in [18], or Chap. 8.4 in [19]. Different types of consensus trees use different criteria to resolve conflicts among the input trees, so their mathematical properties vary. Thus, the most suitable type of consensus tree to use in practice depends on the particular application. In this paper, we focus on the so-called R* consensus tree. One advantage of the R* consensus tree is that it provides a statistically consistent estimator of the species tree topology when combining a set of gene trees, as recently demonstrated by Degnan et al. [8]; moreover, R* consensus outperformed other methods such as majority-rule consensus in the study conducted by [8]. For the case of two input trees, it is known [3] that the R* consensus tree is equivalent to the RV-III tree introduced in [14].
The R* consensus tree is defined in Sect. 2 below. 1 The idea is: For any set of input phylogenetic trees on a fixed leaf label set L, there are certain small binary trees called rooted triplets (each containing exactly three leaf labels from L) that occur as embedded subtrees more frequently than others, and the R* consensus tree includes as many of these rooted triplets as possible and none of the others. In short, if the rooted triplet xy|z for any {x, y, z} ⊆ L is consistent with more input trees than each of the two rooted triplets xz|y and yz|x is, then xy|z belongs to a set named R maj ; the R* consensus tree is the phylogenetic tree τ having the largest possible number of internal nodes such that every rooted triplet consistent with τ belongs to R maj .
The goal of this paper is to develop a fast algorithm for constructing the R* consensus tree for two input trees T 1 and T 2 with a leaf label set L of cardinality n. The previous algorithms for this problem require Θ(n 3 ) time [4,14], which was believed to be optimal because when T 1 and T 2 have similar branching structures, |R maj | = Ω(n 3 ). Our main result is that the time complexity of the problem is in fact subcubic. In order to obtain a subcubic-time algorithm, we have to avoid explicitly constructing the set R maj . For this purpose, we compute the values of a function named s R maj associated to R maj by using a novel formulation based on distances from leaves to lowest common ancestors of pairs of leaves, and then apply an algorithm by Bryant and Berry [4] to compute the Apresjan clusters of s R maj . Next, we check each of the obtained Apresjan clusters to find all strong clusters of R maj , which are subsequently used to build the R* consensus tree. In total, the running time of our new algorithm R*_consensus_tree is O(n 2 √ log n). (The preliminary conference version of this paper [13] described a slightly slower O(n 2 log n)-time method.) The paper is organized as follows. Section 2 introduces the basic definitions and terminology used throughout the paper and states some simple properties of clusters and the R* consensus tree. Next, our algorithm R*_consensus_tree is presented in Sect. 3. Two crucial steps of the algorithm (how to compute the values of the function s R maj efficiently and how to determine if a cluster is a strong cluster of R maj ) are described in detail in Sects. 4 and 5. Finally, Sect. 6 discusses open problems for further research.

Basic Definitions
A phylogenetic tree is a rooted, unordered, distinctly leaf-labeled tree in which every internal node has at least two children. From here on, "tree" means "phylogenetic tree", and every leaf in a tree is identified with its label.
We shall use the following terminology and notation. For any tree T and any node u of T , the subtree of T rooted at u is denoted by T [u]. The set of all leaves in a tree T is written as Λ(T ). To simplify the presentation, every node in a tree is considered to be an ancestor as well as a descendant of itself. For any nodes u, v in a tree, in case u is a descendant of v and u = v then we call u a proper descendant of v. For any set A of nodes in a tree T , the lowest common ancestor of A in T is the node w such that: (1) every node in A is a descendant of w; and (2) w is a proper descendant of x for every other node x which is an ancestor of all nodes in A. The lowest common ancestor of any two nodes u and v in a tree T is denoted by lca T (u, v).
A triplet is a tree with three leaves. Any non-binary tree containing exactly three leaves {x, y, z} is called a fan triplet and is written as x|y|z. On the other hand, any binary tree with exactly three leaves {x, y, z} is called a rooted triplet, and is denoted by xy|z if the lowest common ancestor of x and y is a proper descendant of the lowest common ancestor of x and z. Note that there are precisely four different triplets for any set of three leaf labels {x, y, z}, namely x|y|z, xy|z, xz|y, and yz|x; see Fig. 1.
For any tree T and {x, y, z} ⊆ Λ(T ), the fan triplet x|y|z is said to be consistent with T if lca T (x, y) = lca T (x, z) = lca T (y, z). Similarly, the rooted triplet xy|z is consistent with T if lca T (x, y) is a proper descendant of lca T (x, z)

Strong Clusters and Apresjan Clusters
Below, let R be a given set of triplets over a leaf label set L = r∈R Λ(r) such that for each {x, y, z} ⊆ L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to R. A cluster of L is any non-empty subset of L. We define two special types of clusters: Definition 1 A cluster A of L is called a strong cluster of R if aa |x ∈ R for all a, a ∈ A and x ∈ L \ A. Furthermore, L as well as every singleton set of L is also defined to be a strong cluster of R. There is a connection between the strong clusters of R and the Apresjan clusters of s R : Lemma 2 Every strong cluster of R is an Apresjan cluster of s R .
Proof Let C be a strong cluster of R. Consider any fixed a, a ∈ C and x ∈ L \ C. We need to show that s R (a, a ) > s R (a, x).
First of all, for every y ∈ L \ C, we have aa |y ∈ R by the definition of a strong cluster, which gives s R (a, a ) = |{aa |y : aa |y ∈ R}| ≥ |L \ C|. In the same way, ab|x ∈ R holds for every b ∈ C, which (along with the requirement that for each {x, y, z} ⊆ L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to R) implies that ax|b ∈ R. Thus, s R (a, x) = |{ax|y : ax|y ∈ R}| ≤ |(L \ C) \ {x}| < |L \ C|. Since s R (a, x) < s R (a, a ), C is an Apresjan cluster of s R by definition.

R* Consensus Trees
Let T 1 and T 2 be two given trees with Λ(T 1 ) = Λ(T 2 ) = L. For any {a, b, c} ⊆ L, define #ab|c as the number of trees T i for which ab|c ∈ r(T i ). The set of "majority rooted triplets" R maj is defined as {ab|c : a, b, c ∈ L and #ab|c > #ac|b, #bc|a}. An R* consensus tree of T 1 and T 2 is a tree τ with Λ(τ ) = L which satisfies r(τ ) ⊆ R maj and which maximizes the number of internal nodes. 2 The next two lemmas describe some useful properties of the strong clusters of R maj .

Lemma 3 Let T be a tree with Λ(T ) = L and r(T )
Proof If u is a leaf or the root of T then Λ(T [u]) is trivially a strong cluster of R maj . If u is an internal non-root node then for any two a, a ∈ Λ(T [u]) and any x ∈ Λ(T [u]), the triplet aa |x belongs to R maj , so Λ(T [u]) is a strong cluster of R maj .

Lemma 4
There exists a tree τ such that {Λ(τ [u]) : u is a node in τ } equals the set of strong clusters of R maj .
Proof First observe that for any two strong clusters A and B of R maj , either A B, B A, or A ∩ B = ∅. To prove this claim, suppose for the sake of contradiction that there are x, y, z ∈ L such that x ∈ A \ B, y ∈ B \ A, and z ∈ A ∩ B. Since A is a strong cluster, xz|y ∈ R maj . Since B is a strong cluster, yz|x ∈ R maj . But by the definition of R maj , we cannot have both of xz|y and yz|x in R maj . The claim follows.
As a direct consequence, the strong clusters of R maj form a nested hierarchy (laminar family) on L. By Theorem 13.21 of [17], the collection of strong clusters of R maj can be represented by a rooted tree τ together with a function π : L → {nodes of τ } such that the strong clusters of R maj are in one-to-one-correspondence with the nodes of τ in the following sense: For any strong cluster A and any x ∈ L, it holds that x ∈ A if and only if π(x) is a descendant of the node τ (A) corresponding to A (as before, every node is also defined to be a descendant of itself). 3 Next, we show that: • For every x ∈ L, the node π(x) must be a leaf: Suppose that for some x ∈ L, the node π(x) is an internal node. Consider the strong cluster {x}. By the above, the node π(x) is equal to or a proper descendant of the node τ ({x}). Since π(x) has at least one child, there exists some node v in τ which is a proper descendant of τ ({x}). The (non-empty) strong cluster corresponding to node v is therefore a proper subset of {x}, which is impossible. Hence, π cannot map any element of L to an internal node of τ . • Every internal node of τ has at least two children: If A and B are strong clusters of R maj with A B then there exists some strong cluster C of R maj such that C B and C ∩ A = ∅. For example, take any c ∈ B \ A and use the fact that {c} is a strong cluster of R maj .
By these observations, there exists a tree τ leaf-labeled by L where every internal node has at least two children such that each strong cluster of

Theorem 1 An R* consensus tree of T 1 and T 2 always exists. In particular, it includes every strong cluster of R maj and no other clusters of L.
Proof By Lemma 4, there exists a tree τ that includes all the strong clusters of R maj and does not include any other clusters. For any aa |x ∈ r(τ ), there exists a node u in τ such that a, a ∈ Λ(τ [u]) and x ∈ Λ(τ [u]), i.e., a, a ∈ A and x ∈ A for some strong cluster A of R maj , which implies that aa |x ∈ R maj . Thus, r(τ ) ⊆ R maj .
To prove the optimality of τ , suppose that there exists a tree τ which satisfies r(τ ) ⊆ R maj and has more internal nodes than τ . By Lemma 3, the set of leaves in each rooted subtree of τ forms a strong cluster of R maj . Since τ has more internal nodes than τ , it follows that τ includes some strong cluster of R maj which τ does not include. This contradicts Lemma 4. Hence, τ is an R* consensus tree of T 1 and T 2 .

Constructing a Tree from a Set of Clusters
Here, we address the problem of constructing a tree from a given set of clusters. We first recall a problem known as the directed perfect phylogeny problem with binary characters. Suppose M is a binary matrix. Let I be the set of rows and J the set of columns in M. A directed perfect phylogeny for M is a tree T such that: (1) the leaves of T are bijectively labeled by I ; and (2) each c ∈ J is associated with exactly one Algorithm R*_consensus_tree Input: Two trees T 1 , T 2 with Λ(T 1 ) = Λ(T 2 ) Output: The R* consensus tree of T 1 and T 2 1: Define L := Λ(T 1 ) = Λ(T 2 ) and compute s R maj (a, b) for all a, b ∈ L as described in Sect. 4 2: Compute the Apresjan clusters of s R maj according to Lemma 1 3: for each Apresjan cluster A of s R maj do 4: Determine if A is a strong cluster of R maj as described in Sect. 5 5: end for 6: Let C be the set of strong clusters of R maj , and build a tree T which includes all clusters in C and no other clusters of L in accordance with the method in Lemma 5 7: Output T   [19]. Now, suppose we are given a set L of leaf labels and a set C of clusters of L. To build a tree that includes all clusters in C and no other clusters of L (if such a tree exists), we first construct a binary matrix M of size |L| × |C| whose rows represent the leaf labels in L and whose columns represent the clusters in C by letting each entry M[i, j ] = 1 if and only if the ith leaf label belongs to the j th cluster. Then, we try to build a directed perfect phylogeny T for M by applying Gusfield's algorithm. If such a T exists, it must be unique since all columns of M are distinct (see the comments in Sect. 2 in [11]). However, T might include some clusters not present in C; to ensure that this is not the case, count the number of nodes in T and check if it is equal to |C|. This gives: Lemma 5 Given a set L of leaf labels and a set C of clusters of L, we can build a tree that includes all clusters in C and no other clusters of L (or determine that no such tree exists) in O(|L| · |C|) time.

Constructing the R* Consensus Tree
Based on the properties mentioned in Sects. 2.2-2.4, we can construct the R* consensus tree of two given trees T 1 and T 2 as outlined in our main algorithm R*_consensus_tree (displayed in Fig. 3).
The strategy of Algorithm R*_consensus_tree is as follows: First compute s R maj and all the Apresjan clusters of s R maj . 4 Next, check each Apresjan cluster to 4 Recall from Sect. 2.2 that for a set R of triplets over a leaf label set L, the function s R is defined as s R (a, b) = {y : ab|y ∈ R} for each a, b ∈ L with a = b, and s R (a, a) = |L| − 1 for each a ∈ L. see if it is a strong cluster of R maj (by Lemma 2, the set of strong clusters of R maj is a subset of the set of Apresjan clusters of s R maj ). Finally, construct a tree T that includes all the strong clusters of R maj and no other clusters of L by the method in Lemma 5. According to Theorem 1, the resulting T is the R* consensus tree of T 1 and T 2 .
We now analyze the time complexity of Algorithm R*_consensus_tree. We shall explain how to implement step 1 in O(n 2 √ log n) time in Sect. 4 (step 1 is in fact the only step that does not take O(n 2 ) time). In step 2, the Apresjan clusters of s R maj are computed by running the algorithm of Bryant and Berry from [4], which takes O(n 2 ) time according to Lemma 1 above. Next, there are O(n) Apresjan clusters to consider in the loop of step 3 by Lemma 1, and each one is checked in O(n) time in step 4, as detailed in Sect. 5 below. Finally, in step 5, the set C satisfies |C| = O(n) because C is a subset of the Apresjan clusters, so applying the method of Lemma 5 takes O(n 2 ) time.
In summary, we have the following theorem.
Theorem 2 Algorithm R*_consensus_tree constructs the R* consensus tree of T 1 and T 2 in O(n 2 √ log n) time.
The following two sections are devoted to implementing steps 1 and 4 of R*_consensus_tree efficiently. R maj (a, b) for All a, b ∈ L with a = b

Computing s
From the definition of R maj , we have:
We introduce the following three auxiliary functions: For any a, b ∈ L, the function count r,r (a, b) tells us how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in both T 1 and T 2 . Similarly, count r,f (a, b) counts how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in T 1 and as a fan triplet of the form a|b|w in T 2 (and analogously for count f,r (a, b)). Then, Definition 2 and Lemma 6 immediately imply: r (a, b) + count r,f (a, b) + count f,r (a, b).
(Observe that three auxiliary count-functions suffice to express the value of s R maj (a, b). For our purposes, it is unnecessary to define a fourth auxiliary function count f,f (a, b) that would count how many fan triplets of the form a|b|w that are consistent with both T 1 and T 2 .) To compute count r,r , count r,f , and count f,r , we could preprocess T 1 and T 2 in O(n) time so that lowest common ancestor queries in T 1 and T 2 could be answered in O(1) time [2,12]. Then, for any given a, b ∈ L, a brute-force solution would easily obtain all of count r,r (a, b), count r,f (a, b), and count f,r (a, b) in O(n) time by checking T 1 || {a,b,w} and T 2 || {a,b,w} for every w ∈ L \ {a, b}. This approach would therefore take O(n 3 ) time to compute count r,r (a, b), count r,f (a, b), and count f,r (a, b) for all a, b ∈ L. In the rest of this section, we show how to obtain these values more efficiently.
First, in Sect. 4.1, we reformulate the count-functions in terms of distances from leaves to lowest common ancestors of leaves. All of these distances may be computed in O(n 2 ) time. Then, based on this alternative formulation, for any fixed leaf label a ∈ L, Sect. 4.  (a, b), and count f,r (a, b) for all b ∈ L \ {a} for each fixed a ∈ L. Therefore, we only need O(n 2 √ log n) time in total to obtain s R maj (a, b) for all a, b ∈ L according to Corollary 1. The next lemma states the relationship between d and e and the triplets consistent with T . For an example, refer to Fig. 4(b). The first part of Lemma 7 is a special case of Theorem 1 in [15], which was derived by Lee et al. [15] to solve a different problem known as the maximum agreement subtree problem (MAST).

Proof For the first part, d a,T (x) < d a,T (w) if and only if lca T (a, x) is a proper descendant of lca T (a, w). This is equivalent to ax|w ∈ t (T ) by the definition of "consistent with T " for a rooted triplet and the definition of t (T ).
For

the second part, a|x|w ∈ t (T ) if and only if lca T (a, x) = lca T (a, w) = lca T (x, w) by the definition of "consistent with T " for a fan triplet and the definition of t (T ). The condition lca T (a, x) = lca T (a, w) = lca T (x, w) directly implies that d a,T (x) = d a,T (w) and e a,T (x) = e a,T (w). On the other hand, if d a,T (x) = d a,T (w)
and e a,T (x) = e a,T (w) then lca T (a, x) = lca T (a, w); denote this node by v. Since e a,T (x) and e a,T (w) are two different children of v, we have lca T (x, w) = v and thus lca T (a, x) = lca T (a, w) = lca T (x, w).
Algorithm Compute_count_rr Input: a ∈ L Output: The values of count r,r (a, b) for all b ∈ L \ {a} 1: Initialize the data structure for offline orthogonal range counting on an (n × n)-grid G containing the following n − 1 points: : n] 4: end for 5: Make a query to the data structure for G to ask how many points lie inside each of the n − 1 rectan- Set count r,r (a, b) to the reported number of points for the rectangle Q b 8: end for

Fig. 5 Algorithm Compute_count_rr
The formula for count f,r (a, b) follows by symmetry.

Computing count r,r (a, b) for All b ∈ L \ {a}
Suppose a ∈ L is fixed. This section describes how to compute count r, Our algorithm is named Compute_count_rr and is listed in Fig. 5. It works as follows. Suppose that G is an integer grid of size (n × n) and each b ∈ L \ {a} is represented by the point (d a,T 1 (b), d a,T 2 (b)) on G. Then: Proof Lemma 8 states that count r,r (a, b) equals |{w : d a,T 1 (b) < d a,T 1 (w) and d a,T 2 (b) < d a,T 2 (w)}|. Every element w that contributes to this expression satisfies d a,T 1 (b) < d a,T 1 (w) and d a,T 2 (b) < d a,T 2 (w), and thus, d a, According to Lemma 9, we can obtain count r,r (a, b) by calculating how many points lie inside the corresponding rectangle on G. Actually, we have to do this for all of the n − 1 rectangles defined by the elements of L \ {a}. So to implement Lemma 9 efficiently, we borrow a data structure by Chan and Pǎtraşcu [6] for offline orthogonal range counting on G. More precisely, we use the following result (Corollary 2.3 in [6]): Lemma 10 (Chan and Pǎtraşcu [6]) Given n points and n axis-aligned rectangles in the plane, we can count the number of points inside each rectangle in O(n √ log n) total time.
Note that one query to the data structure returns the number of points in each of the rectangles. By Lemma 10, the running time of Algorithm Compute_count_rr becomes O(n √ log n).

Computing count r,f (a, b) for All b ∈ L \ {a}
Again, suppose a ∈ L is fixed. Here, we show how to compute count r,f (a, b) (and by symmetry, count f,r (a, b)) for all b ∈ L \ {a} in O(n) time. We assume that the values of d a,T 1 (b), d a,T 2 (b), e a,T 1 (b), e a,T 2 (b) for all b ∈ L \ {a} have been precomputed. For any given a, b, we need to count how many leaves w appear in a rooted triplet of the form ab|w in t (T 1 ) and in a fan triplet of the form a|b|w in t (T 2 ) at the same time. Our idea is to first use T 2 to partition L \ {a} into disjoint subsets C 1 , . . . , C p in such a way that for every pair of elements x, y in the same C i , it holds that d a,T 2 (x) = d a,T 2 (y). Also, further partition each C i into C i,1 , . . . , C i,q so that for every pair of elements x, y in the same C i,j , it holds that e a,T 2 (x) = e a,T 2 (y). The second part of Lemma 7 immediately yields:

Lemma 11
For any b, w ∈ L \ {a}, the fan triplet a|b|w belongs to t (T 2 ) if and only if b and w belong to the same C i -set but different C i,q -sets.
Then, for any b ∈ C i,j , the number of fan triplets of the form a|b|w that are consistent with T 2 equals |C i \ C i,j |. However, what we need to know is when a fan triplet a|b|w from T 2 also satisfies ab|w ∈ t (T 1 ). To also include this crucial condition on T 1 , we refine Lemma 11 in the following way: Proof By Lemma 8, count r,f (a, b) = |{w : d a,T 1 (b) < d a,T 1 (w), d a,T 2 (b) = d a,T 2 (w), and e a,T 2 (b) = e a,T 2 (w)}|. Since d a,T 2 (b) = d a,T 2 (w) if and only if w ∈ C i , and moreover, e a,T 2 (b) = e a,T 2 (w) if and only if w ∈ C i,j , we have count r,f (a, b) = |{w : d a,T 1 (b) < d a,T 1 (w), w ∈ C i , and w ∈ C i,j }|. Next, to count the total number of leaves w from C i \ C i,j that satisfy d a,T 1 (b) < d a,T 1 (w), just count how many leaves in C i that satisfy the condition and then subtract the number of leaves in C i,j that also satisfy it. This gives the desired formula for count r,f (a, b).
Our algorithm Compute_count_rf is specified in Fig. 6. It uses Lemma 12 to compute count r,f (a, b) for all b ∈ L \ {a} for any fixed a ∈ L.  count r,f (a, b). The correctness follows from the correctness of Lemma 12.
Algorithm Compute_count_rf can be implemented to run in O(n) time by using radix sort to sort the elements inside each C i -and C i,j -set in steps 3 and 16. Each b ∈ L \ {a} belongs to exactly one C i -set and exactly one C i,j -set, and is therefore treated once in steps 4-13 and once in steps 17-26.  Partition C i into C i,1 , . . . , C i,q so that for every pair of elements x, y in the same C i,j , it holds that e a,T 2 (x) = e a,T 2 (y) 15: for every C i,j do 16: Sort the elements of C i,j according to their d a,T 1 -values and denote the resulting ordering

Determining if a Cluster Is a Strong Cluster
This section shows how to check whether any given cluster A of L with |A| ≥ 2 is a strong cluster of R maj in O(n) time (recall from the definitions in Sect. 2.2 that A is trivially a strong cluster if |A| = 1). Our solution relies on the next lemma, illustrated in Fig. 7. For any node u of a tree T , each subtree of T rooted at a child of u is said to be attached to u.

Lemma 13 Let
Proof (⇒) Let A be a strong cluster of R maj . We shall prove that both conditions (1) and (2) hold. We first prove (1). For the sake of obtaining a contradiction, suppose that for some i ∈ {1, 2}, there exist a ∈ A, x ∈ L \ A where both a and x are in the same subtree attached to u i . Then, lca T i (a, x) is a proper descendant of u i . The node u i is defined as the lowest common ancestor of A, so lca T i (a, a ) = u i for some a ∈ A. However, then lca T i (a, x) is a proper descendant of lca T i (a, a ), giving ax|a ∈ t (T i ). This implies that aa |x cannot belong to R maj , which contradicts the fact that A is a strong cluster of R maj . Thus, condition (1) must hold.
To prove condition (2), we first show that there exist two leaves a, a ∈ A such that lca T 1 (a, a ) = u 1 and lca T 2 (a, a ) = u 2 . Observe that for both i = 1, 2, the leaves from A are located in at least two different subtrees attached to u i . Thus, there exists at least one pair a, a ∈ A with lca T 1 (a, a ) = u 1 . Now, assume on the contrary that for every pair a, a ∈ A such that lca T 1 (a, a ) = u 1 , lca T 2 (a, a ) is a proper descendant of u 2 . Consider any pair of different subtrees attached to u 1 that contain elements from A, and denote their two sets of leaves by A 1 and A 2 . By the assumption, lca T 2 (x, y) is a proper descendant of u 2 for all x ∈ A 1 and y ∈ A 2 , and thus all leaves in A 1 ∪ A 2 must appear in a single subtree attached to u 2 . By repeating this argument, all elements of A appear in one subtree attached to u 2 , which is impossible. Therefore, there exists a pair a, a ∈ A with lca T 1 (a, a ) = u 1 such that lca T 2 (a, a ) is not a proper descendant of u 2 , i.e., lca T 2 (a, a ) = u 2 .
Finally, we are ready to prove (2). Take any a, a ∈ A with lca T 1 (a, a ) = u 1 and lca T 2 (a, a ) = u 2 as in the preceding paragraph. To obtain a contradiction, suppose that X 1 ∩ X 2 = ∅. Then there exists an x ∈ X 1 ∩ X 2 , and moreover, x|a|a ∈ t (T 1 ) and x|a|a ∈ t (T 2 ) because the subtree attached to u i for i = 1, 2 that contains x cannot contain a or a by condition (1) above. But then aa |x ∈ R maj , contradicting the fact that A is a strong cluster of R maj . Hence, (2) holds. Algorithm Check_if_strong_cluster Input: A cluster A of L with |A| ≥ 2 Output: "yes", if A is a strong cluster of R maj ; "no", otherwise 1: Let u 1 and u 2 be the lowest common ancestor of A in T 1 and T 2 , respectively 2: for i = 1, 2 do 3: if any subtree of T i attached to u i contains leaves from A as well as from L \ A then 4: return "no" 5: end if 6: end for 7: return "yes" 10: else 11: return "no" 12: end if (⇐) Suppose A satisfies the two conditions stated in the lemma. We will prove that for every a, a ∈ A and c ∈ L \ A, it holds that aa |c ∈ R maj . The leaf c belongs to exactly one of the three disjoint sets X 1 , X 2 , and L \ (A ∪ X 1 ∪ X 2 ), which gives three cases:

a ) is a proper descendant of lca T i (a, c)
for both i = 1, 2. Thus, aa |c ∈ R maj . -If c ∈ X 1 : Then c is not a descendant of u 2 in T 2 because c ∈ X 2 , so aa |c ∈ t (T 2 ) always holds. There are two subcases depending on whether or not lca T 1 (a, a ) is a proper descendant of u 1 . If yes, then aa |c ∈ t (T 1 ), and thus aa |c ∈ R maj . If no, then lca T 1 (a, a ) = u 1 , and thus a|a |c ∈ t (T 1 ), which also yields aa |c ∈ R maj . -If c ∈ X 2 : Then it follows that aa |c ∈ R maj in the same way as for the case c ∈ X 1 above.
By definition, A is a strong cluster of R maj .
Armed with Lemma 13, it is straightforward to check if any given cluster A of L is a strong cluster of R maj in O(n) time as shown in Algorithm Check_if_strong_ cluster in Fig. 8.

Concluding Remarks
In this paper, we have shown how to construct the R* consensus tree of two input trees in O(n 2 √ log n) = o(n 3 ) time.
It is an open problem to determine if the time complexity can be further improved to O(n 2 ) or better. The bottleneck of our algorithm R*_consensus_tree is the computation of count r,r in Sect. 4.2, which uses O(n √ log n) time for each fixed a ∈ L to obtain the values of count r,r (a, b) for all b ∈ L \ {a}.
We remark that for the restricted case where both input trees are binary, the problem becomes much easier because then the R* consensus tree is equivalent to the so-called RV-I tree in [14]. The RV-I tree differs slightly from the R* consensus tree in general in that if ab|c belongs to one of t (T 1 ) and t (T 2 ), and a|b|c belongs to the other, then ab|c may never be consistent with the output tree; however, for the special case of binary trees, neither t (T 1 ) nor t (T 2 ) contain any fan triplets so R maj equals the set of all rooted triplets that are consistent with both T 1 and T 2 , and this is precisely the "set of triples which are resolved identically in T 1 and T 2 " in Definition 5.1 in [14]. Furthermore, according to Theorem 5.1 in [14], the RV-I tree of T 1 and T 2 is always equivalent to the strict consensus tree (see also [3,7,9,18,19]) of T 1 and T 2 , which can be constructed in O(n) time by a classical algorithm of Day [7]. Thus, there is a large gap in the time complexity of constructing the R* consensus tree of two trees in the binary case (O(n) time) and in the general, non-binary case (O(n 2 √ log n) time). The definition of the set of majority rooted triplets R maj as well as the definition of an R* consensus tree can be naturally extended to the case of k > 2 input trees; see [3]. The currently fastest algorithm for the case k > 2, outlined in [3], runs in O(kn 3 3 ). It seems difficult to extend the approach used in this paper because it would yield an exponential number (in k) of cases in Lemma 6 and thus express s R maj (a, b) as the sum of an exponential number of auxiliary count-functions, causing the total running time to be exponential in k.