1 Introduction

Phylogenetic trees are leaf-labeled trees commonly used to describe the evolutionary history of a set of objects such as biological species or languages [1, 9, 10, 16, 18]. Typically, in a phylogenetic tree, each leaf represents one of the objects being studied and the branching structure of the tree shows the assumed evolutionary relationships among the objects. Depending on the available data, it can be difficult to infer an accurate phylogenetic tree. For example, small changes in the input data may result in trees with very different structures. A consensus tree is a single phylogenetic tree which summarizes the branching information contained in an input collection of phylogenetic trees with identical leaf label sets. Consensus trees are useful when different data sets or different tree inference methods have produced a set of trees with the same leaf labels and slightly conflicting structures, yet a single tree is required to represent all of them [1, 3, 7, 9, 14, 18] (indeed, phylogenetic analyses often output alternative trees for the same set of species [1]). Also, by exclusion, a consensus tree indicates areas of conflict in the input trees [3]. Furthermore, consensus trees are sometimes used as a basis for new phylogenetic inferences [3].

Over the years, many types of consensus trees have been defined and studied in detail. For a survey, see, e.g., [3], Chap. 30 in [9], Chap. 3.2 in [18], or Chap. 8.4 in [19]. Different types of consensus trees use different criteria to resolve conflicts among the input trees, so their mathematical properties vary. Thus, the most suitable type of consensus tree to use in practice depends on the particular application. In this paper, we focus on the so-called R* consensus tree. One advantage of the R* consensus tree is that it provides a statistically consistent estimator of the species tree topology when combining a set of gene trees, as recently demonstrated by Degnan et al. [8]; moreover, R* consensus outperformed other methods such as majority-rule consensus in the study conducted by [8]. For the case of two input trees, it is known [3] that the R* consensus tree is equivalent to the RV-III tree introduced in [14].

The R* consensus tree is defined in Sect. 2 below.Footnote 1 The idea is: For any set of input phylogenetic trees on a fixed leaf label set L, there are certain small binary trees called rooted triplets (each containing exactly three leaf labels from L) that occur as embedded subtrees more frequently than others, and the R* consensus tree includes as many of these rooted triplets as possible and none of the others. In short, if the rooted triplet xy|z for any {x,y,z}⊆L is consistent with more input trees than each of the two rooted triplets xz|y and yz|x is, then xy|z belongs to a set named \(\mathcal {R}_{\mathit{maj}}\); the R* consensus tree is the phylogenetic tree τ having the largest possible number of internal nodes such that every rooted triplet consistent with τ belongs to \(\mathcal {R}_{\mathit{maj}}\).

The goal of this paper is to develop a fast algorithm for constructing the R* consensus tree for two input trees T 1 and T 2 with a leaf label set L of cardinality n. The previous algorithms for this problem require Θ(n 3) time [4, 14], which was believed to be optimal because when T 1 and T 2 have similar branching structures, \(|\mathcal {R}_{\mathit{maj}}| = \varOmega (n^{3})\). Our main result is that the time complexity of the problem is in fact subcubic. In order to obtain a subcubic-time algorithm, we have to avoid explicitly constructing the set \(\mathcal {R}_{\mathit{maj}}\). For this purpose, we compute the values of a function named \(s_{\mathcal {R}_{\mathit{maj}}}\) associated to \(\mathcal {R}_{\mathit{maj}}\) by using a novel formulation based on distances from leaves to lowest common ancestors of pairs of leaves, and then apply an algorithm by Bryant and Berry [4] to compute the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\). Next, we check each of the obtained Apresjan clusters to find all strong clusters of \(\mathcal {R}_{\mathit{maj}}\), which are subsequently used to build the R* consensus tree. In total, the running time of our new algorithm R*_consensus_tree is \(O(n^{2} \sqrt{\log n})\). (The preliminary conference version of this paper [13] described a slightly slower O(n 2logn)-time method.)

The paper is organized as follows. Section 2 introduces the basic definitions and terminology used throughout the paper and states some simple properties of clusters and the R* consensus tree. Next, our algorithm R*_consensus_tree is presented in Sect. 3. Two crucial steps of the algorithm (how to compute the values of the function \(s_{\mathcal {R}_{\mathit{maj}}}\) efficiently and how to determine if a cluster is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\)) are described in detail in Sects. 4 and 5. Finally, Sect. 6 discusses open problems for further research.

2 Preliminaries

2.1 Basic Definitions

A phylogenetic tree is a rooted, unordered, distinctly leaf-labeled tree in which every internal node has at least two children. From here on, “tree” means “phylogenetic tree”, and every leaf in a tree is identified with its label.

We shall use the following terminology and notation. For any tree T and any node u of T, the subtree of T rooted at u is denoted by T[u]. The set of all leaves in a tree T is written as Λ(T). To simplify the presentation, every node in a tree is considered to be an ancestor as well as a descendant of itself. For any nodes u,v in a tree, in case u is a descendant of v and uv then we call u a proper descendant of v. For any set A of nodes in a tree T, the lowest common ancestor of A in T is the node w such that: (1) every node in A is a descendant of w; and (2) w is a proper descendant of x for every other node x which is an ancestor of all nodes in A. The lowest common ancestor of any two nodes u and v in a tree T is denoted by lca T(u,v).

A triplet is a tree with three leaves. Any non-binary tree containing exactly three leaves {x,y,z} is called a fan triplet and is written as x|y|z. On the other hand, any binary tree with exactly three leaves {x,y,z} is called a rooted triplet, and is denoted by xy|z if the lowest common ancestor of x and y is a proper descendant of the lowest common ancestor of x and z. Note that there are precisely four different triplets for any set of three leaf labels {x,y,z}, namely x|y|z, xy|z, xz|y, and yz|x; see Fig. 1.

Fig. 1
figure 1

The four different triplets x|y|z, xy|z, xz|y, and yz|x leaf-labeled by {x,y,z}

For any tree T and {x,y,z}⊆Λ(T), the fan triplet x|y|z is said to be consistent with T if lca T(x,y) = lca T(x,z) = lca T(y,z). Similarly, the rooted triplet xy|z is consistent with T if lca T(x,y) is a proper descendant of lca T(x,z) = lca T(y,z). Let T||{x,y,z} denote the unique triplet with leaf label set {x,y,z} which is consistent with T. Finally, for any tree T, let r(T) be the set of all rooted triplets which are consistent with T, i.e., define r(T)={T||{x,y,z}:{x,y,z}⊆Λ(T) and T||{x,y,z} is a rooted triplet}, and define t(T) as the set of all triplets (rooted triplets and fan triplets) consistent with T, i.e., t(T)={T||{x,y,z}:{x,y,z}⊆Λ(T)}. (See Fig. 2 for some examples.) It follows that |t(T)|=Θ(|Λ(T)|3) for any tree T, and |r(T)|=Θ(|Λ(T)|3) when T is a binary tree because |r(T)|=|t(T)| in this case.

Fig. 2
figure 2

The rooted triplet xy|z and the fan triplet x|c|d are consistent with the tree T. Hence, xy|zr(T), xy|zt(T), and x|c|dt(T)

2.2 Strong Clusters and Apresjan Clusters

Below, let \(\mathcal {R}\) be a given set of triplets over a leaf label set \(L = \bigcup_{r \in \mathcal {R}} \varLambda (r)\) such that for each {x,y,z}⊆L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to \(\mathcal {R}\). A cluster of L is any non-empty subset of L. We define two special types of clusters:

Definition 1

A cluster A of L is called a strong cluster of  \(\mathcal {R}\) if \(aa'|x \in \mathcal {R}\) for all a,a′∈A and xLA. Furthermore, L as well as every singleton set of L is also defined to be a strong cluster of \(\mathcal {R}\).

Definition 2

For each a,bL with ab, define \(s_{\mathcal {R}}(a,b) = |\{y \in L \setminus \{a,b\} : ab|y \in \mathcal {R}\}|\), and for each aL, define \(s_{\mathcal {R}}(a,a) = | L | - 1\). A cluster A of L is called an Apresjan cluster of  \(s_{\mathcal {R}}\) if \(s_{\mathcal {R}}(a,a') > s_{\mathcal {R}}(a,x)\) for all a,a′∈A and xLA.

Write n=|L|. By Theorem 2.3 and Corollary 2.1 of [4], the following holds:

Lemma 1

(Bryant and Berry [4])

  1. 1.

    There are O(n) Apresjan clusters of  \(s_{\mathcal {R}}\).

  2. 2.

    Given the values of  \(s_{\mathcal {R}}(a,b)\) for all a,bL, the Apresjan clusters of  \(s_{\mathcal {R}}\) can be computed in O(n 2) time.

There is a connection between the strong clusters of \(\mathcal {R}\) and the Apresjan clusters of \(s_{\mathcal {R}}\):

Lemma 2

Every strong cluster of  \(\mathcal {R}\) is an Apresjan cluster of  \(s_{\mathcal {R}}\).

Proof

Let C be a strong cluster of \(\mathcal {R}\). Consider any fixed a,a′∈C and xLC. We need to show that \(s_{\mathcal {R}}(a,a') > s_{\mathcal {R}}(a,x)\).

First of all, for every yLC, we have \(aa'|y \in \mathcal {R}\) by the definition of a strong cluster, which gives \(s_{\mathcal {R}}(a,a') = |\{aa'|y : aa'|y \in \mathcal {R}\}| \geq |L \setminus C|\). In the same way, \(ab|x \in \mathcal {R}\) holds for every bC, which (along with the requirement that for each {x,y,z}⊆L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to \(\mathcal {R}\)) implies that \(ax|b \not\in \mathcal {R}\). Thus, \(s_{\mathcal {R}}(a,x) =|\{ax|y : ax|y \in \mathcal {R}\}| \leq |(L \setminus C) \setminus \{x\}| <|L \setminus C|\). Since \(s_{\mathcal {R}}(a,x) < s_{\mathcal {R}}(a,a')\), C is an Apresjan cluster of \(s_{\mathcal {R}}\) by definition. □

2.3 R* Consensus Trees

Let T 1 and T 2 be two given trees with Λ(T 1)=Λ(T 2)=L. For any {a,b,c}⊆L, define #ab|c as the number of trees T i for which ab|cr(T i ). The set of “majority rooted triplets” \(\mathcal {R}_{\mathit{maj}}\) is defined as {ab|c:a,b,cL and #ab|c>#ac|b,#bc|a}. An R* consensus tree of T 1 and T 2 is a tree τ with Λ(τ)=L which satisfies \(r(\tau) \subseteq \mathcal {R}_{\mathit{maj}}\) and which maximizes the number of internal nodes.Footnote 2

The next two lemmas describe some useful properties of the strong clusters of \(\mathcal {R}_{\mathit{maj}}\).

Lemma 3

Let T be a tree with Λ(T)=L and \(r(T) \subseteq \mathcal {R}_{\mathit{maj}}\). For any node u of T, Λ(T[u]) is a strong cluster of  \(\mathcal {R}_{\mathit{maj}}\).

Proof

If u is a leaf or the root of T then Λ(T[u]) is trivially a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). If u is an internal non-root node then for any two a,a′∈Λ(T[u]) and any \(x \not\in \varLambda (T[u])\), the triplet aa′|x belongs to \(\mathcal {R}_{\mathit{maj}}\), so Λ(T[u]) is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). □

Lemma 4

There exists a tree τ such that {Λ(τ[u]):u is a node in τ} equals the set of strong clusters of  \(\mathcal {R}_{\mathit{maj}}\).

Proof

First observe that for any two strong clusters A and B of \(\mathcal {R}_{\mathit{maj}}\), either A⊆̷B, B⊆̷A, or AB=∅. To prove this claim, suppose for the sake of contradiction that there are x,y,zL such that xAB, yBA, and zAB. Since A is a strong cluster, \(xz|y \in \mathcal {R}_{\mathit{maj}}\). Since B is a strong cluster, \(yz|x \in \mathcal {R}_{\mathit{maj}}\). But by the definition of \(\mathcal {R}_{\mathit{maj}}\), we cannot have both of xz|y and yz|x in \(\mathcal {R}_{\mathit{maj}}\). The claim follows.

As a direct consequence, the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) form a nested hierarchy (laminar family) on L. By Theorem 13.21 of [17], the collection of strong clusters of \(\mathcal {R}_{\mathit{maj}}\) can be represented by a rooted tree τ together with a function \(\pi : L \rightarrow \{\text{nodes of~$\tau$}\}\) such that the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) are in one-to-one-correspondence with the nodes of τ in the following sense: For any strong cluster A and any xL, it holds that xA if and only if π(x) is a descendant of the node τ(A) corresponding to A (as before, every node is also defined to be a descendant of itself).Footnote 3

Next, we show that:

  • For every xL, the node π(x) must be a leaf: Suppose that for some xL, the node π(x) is an internal node. Consider the strong cluster {x}. By the above, the node π(x) is equal to or a proper descendant of the node τ({x}). Since π(x) has at least one child, there exists some node v in τ which is a proper descendant of τ({x}). The (non-empty) strong cluster corresponding to node v is therefore a proper subset of {x}, which is impossible. Hence, π cannot map any element of L to an internal node of τ.

  • Every internal node of τ has at least two children: If A and B are strong clusters of \(\mathcal {R}_{\mathit{maj}}\) with A⊆̷B then there exists some strong cluster C of \(\mathcal {R}_{\mathit{maj}}\) such that C⊆̷B and CA=∅. For example, take any cBA and use the fact that {c} is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\).

By these observations, there exists a tree τ leaf-labeled by L where every internal node has at least two children such that each strong cluster of \(\mathcal {R}_{\mathit{maj}}\) equals Λ(τ[u]) for some rooted subtree τ[u] of τ. □

Say that a tree T includes a cluster A of L if T contains a node u such that Λ(T[u])=A.

Theorem 1

An R* consensus tree of T 1 and T 2 always exists. In particular, it includes every strong cluster of  \(\mathcal {R}_{\mathit{maj}}\) and no other clusters of L.

Proof

By Lemma 4, there exists a tree τ that includes all the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) and does not include any other clusters. For any aa′|xr(τ), there exists a node u in τ such that a,a′∈Λ(τ[u]) and \(x \not\in \varLambda (\tau[u])\), i.e., a,a′∈A and \(x \not\in A\) for some strong cluster A of \(\mathcal {R}_{\mathit{maj}}\), which implies that \(aa'|x \in \mathcal {R}_{\mathit{maj}}\). Thus, \(r(\tau) \subseteq \mathcal {R}_{\mathit{maj}}\).

To prove the optimality of τ, suppose that there exists a tree τ′ which satisfies \(r(\tau') \subseteq \mathcal {R}_{\mathit{maj}}\) and has more internal nodes than τ. By Lemma 3, the set of leaves in each rooted subtree of τ′ forms a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Since τ′ has more internal nodes than τ, it follows that τ′ includes some strong cluster of \(\mathcal {R}_{\mathit{maj}}\) which τ does not include. This contradicts Lemma 4. Hence, τ is an R* consensus tree of T 1 and T 2. □

2.4 Constructing a Tree from a Set of Clusters

Here, we address the problem of constructing a tree from a given set of clusters. We first recall a problem known as the directed perfect phylogeny problem with binary characters. Suppose M is a binary matrix. Let I be the set of rows and J the set of columns in M. A directed perfect phylogeny for M is a tree T such that: (1) the leaves of T are bijectively labeled by I; and (2) each cJ is associated with exactly one node T(c) of T in such a way that for any xI, it holds that M[x,c]=1 if and only if leaf x belongs to the subtree of T rooted at node T(c). The directed perfect phylogeny problem with binary characters (DPPB) is to, given a binary matrix M, construct a directed perfect phylogeny for M (if one exists). For examples and more details, refer to, e.g., Sect. 17.3 in [10] or Sect. 7.2.2 in [19]. DPPB can be solved in O(|I|⋅|J|) time by Gusfield’s algorithm; see Sect. 17.3.4 in [10] or Sect. 7.2.2.1 in [19].

Now, suppose we are given a set L of leaf labels and a set C of clusters of L. To build a tree that includes all clusters in C and no other clusters of L (if such a tree exists), we first construct a binary matrix M of size |L|×|C| whose rows represent the leaf labels in L and whose columns represent the clusters in C by letting each entry M[i,j]=1 if and only if the ith leaf label belongs to the jth cluster. Then, we try to build a directed perfect phylogeny T for M by applying Gusfield’s algorithm. If such a T exists, it must be unique since all columns of M are distinct (see the comments in Sect. 2 in [11]). However, T might include some clusters not present in C; to ensure that this is not the case, count the number of nodes in T and check if it is equal to |C|. This gives:

Lemma 5

Given a set L of leaf labels and a set C of clusters of L, we can build a tree that includes all clusters in C and no other clusters of L (or determine that no such tree exists) in O(|L|⋅|C|) time.

3 Constructing the R* Consensus Tree

Based on the properties mentioned in Sects. 2.22.4, we can construct the R* consensus tree of two given trees T 1 and T 2 as outlined in our main algorithm R*_consensus_tree (displayed in Fig. 3).

Fig. 3
figure 3

Algorithm R*_consensus_tree

The strategy of Algorithm R*_consensus_tree is as follows: First compute \(s_{\mathcal {R}_{\mathit{maj}}}\) and all the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\).Footnote 4 Next, check each Apresjan cluster to see if it is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) (by Lemma 2, the set of strong clusters of \(\mathcal {R}_{\mathit{maj}}\) is a subset of the set of Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\)). Finally, construct a tree T that includes all the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) and no other clusters of L by the method in Lemma 5. According to Theorem 1, the resulting T is the R* consensus tree of T 1 and T 2.

We now analyze the time complexity of Algorithm R*_consensus_tree. We shall explain how to implement step 1 in \(O(n^{2} \sqrt{\log n})\) time in Sect. 4 (step 1 is in fact the only step that does not take O(n 2) time). In step 2, the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\) are computed by running the algorithm of Bryant and Berry from [4], which takes O(n 2) time according to Lemma 1 above. Next, there are O(n) Apresjan clusters to consider in the loop of step 3 by Lemma 1, and each one is checked in O(n) time in step 4, as detailed in Sect. 5 below. Finally, in step 5, the set C satisfies |C|=O(n) because C is a subset of the Apresjan clusters, so applying the method of Lemma 5 takes O(n 2) time.

In summary, we have the following theorem.

Theorem 2

Algorithm  R*_consensus_tree constructs the R* consensus tree of T 1 and T 2 in \(O(n^{2} \sqrt{\log n})\) time.

The following two sections are devoted to implementing steps 1 and 4 of R*_consensus_tree efficiently.

4 Computing \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) for All a,bL with ab

From the definition of \(\mathcal {R}_{\mathit{maj}}\), we have:

Lemma 6

For any a,b,cL, \(ab|c \in \mathcal {R}_{\mathit{maj}}\) if and only if either

  1. 1.

    ab|ct(T 1)∩t(T 2); or

  2. 2.

    ab|ct(T 1) and a|b|ct(T 2); or

  3. 3.

    a|b|ct(T 1) and ab|ct(T 2).

We introduce the following three auxiliary functions:

For any a,bL, the function \(\mathit{count}_{r,r}(a,b)\) tells us how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in both T 1 and T 2. Similarly, \(\mathit{count}_{r,f}(a,b)\) counts how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in T 1 and as a fan triplet of the form a|b|w in T 2 (and analogously for \(\mathit{count}_{f,r}(a,b)\)). Then, Definition 2 and Lemma 6 immediately imply:

Corollary 1

For every a,bL with ab,

$$s_{\mathcal {R}_{\mathit{maj}}}(a,b) =\mathit{count}_{r,r}(a,b) + \mathit{count}_{r,f}(a,b) + \mathit{count}_{f,r}(a,b).$$

(Observe that three auxiliary \(\mathit{count}\)-functions suffice to express the value of \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\). For our purposes, it is unnecessary to define a fourth auxiliary function \(\mathit{count}_{f,f}(a,b)\) that would count how many fan triplets of the form a|b|w that are consistent with both T 1 and T 2.)

To compute \(\mathit{count}_{r,r}\), \(\mathit{count}_{r,f}\), and \(\mathit{count}_{f,r}\), we could preprocess T 1 and T 2 in O(n) time so that lowest common ancestor queries in T 1 and T 2 could be answered in O(1) time [2, 12]. Then, for any given a,bL, a brute-force solution would easily obtain all of \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) in O(n) time by checking T 1||{a,b,w} and T 2||{a,b,w} for every wL∖{a,b}. This approach would therefore take O(n 3) time to compute \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) for all a,bL. In the rest of this section, we show how to obtain these values more efficiently.

First, in Sect. 4.1, we reformulate the \(\mathit{count}\)-functions in terms of distances from leaves to lowest common ancestors of leaves. All of these distances may be computed in O(n 2) time. Then, based on this alternative formulation, for any fixed leaf label aL, Sect. 4.2 describes an \(O(n \sqrt{\log n})\)-time method for computing \(\mathit{count}_{r,r}(a,b)\) for all bL∖{a}, and Sect. 4.3 describes an O(n)-time method for computing \(\mathit{count}_{r,f}(a,b)\) for all bL∖{a}. (By symmetry, we can obtain \(\mathit{count}_{f,r}(a,b)\) for all bL∖{a} in O(n) time with the same technique as in Sect. 4.3.) To summarize, after O(n 2) time preprocessing, \(O(n \sqrt{\log n})\) time is enough to compute \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) for all bL∖{a} for each fixed aL. Therefore, we only need \(O(n^{2} \sqrt{\log n})\) time in total to obtain \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) for all a,bL according to Corollary 1.

4.1 Distances from Leaves to Lowest Common Ancestors

For any tree T and any a,bΛ(T), let d a,T (b) be the distance (i.e., the number of edges) between a and lca T(a,b) in T, and let e a,T (b) be the child of lca T(a,b) which is an ancestor of b. Without loss of generality, define e a,T (a)=∅. See Fig. 4(a) for an illustration of these definitions.

Fig. 4
figure 4

(a) Illustrating the definitions of d a,T (b) and e a,T (b). (b) The tree T satisfies d a,T (x)<d a,T (z) so by Lemma 7, ax|z must be consistent with T; similarly, d a,T (x)=d a,T (y) while e a,T (x)≠e a,T (y), so a|x|y is consistent with T

Note that the values of d a,T (b) and e a,T (b) for all a,bΛ(T) can be obtained in O(n 2) time in total by performing n simple traversals of T, where n=|Λ(T)|. More precisely, for each leaf aΛ(T), spend O(n) time to compute d a,T (b) and e a,T (b) for all bΛ(T) as follows: Let d a,T (a):=0 and e a,T (a):=∅. Start at a and follow the upwards path P from a to the root of T while using a variable d to remember the distance from a to the current location on P. Whenever a new node u on P is visited, let C be the set of children of u not belonging to P, and for each cC, traverse the subtree of T rooted at c and assign the values d a,T (b):=d and e a,T (b):=c for all leaves b that are encountered.

The next lemma states the relationship between d and e and the triplets consistent with T. For an example, refer to Fig. 4(b). The first part of Lemma 7 is a special case of Theorem 1 in [15], which was derived by Lee et al. [15] to solve a different problem known as the maximum agreement subtree problem (MAST).

Lemma 7

For any tree T and any three distinct a,x,wΛ(T), it holds that:

  • d a,T (x)<d a,T (w) if and only if ax|wt(T);

  • d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w) if and only if a|x|wt(T).

Proof

For the first part, d a,T (x)<d a,T (w) if and only if lca T(a,x) is a proper descendant of lca T(a,w). This is equivalent to ax|wt(T) by the definition of “consistent with T” for a rooted triplet and the definition of t(T).

For the second part, a|x|wt(T) if and only if lca T(a,x)=lca T(a,w)=lca T(x,w) by the definition of “consistent with T” for a fan triplet and the definition of t(T). The condition lca T(a,x)=lca T(a,w)=lca T(x,w) directly implies that d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w). On the other hand, if d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w) then lca T(a,x)=lca T(a,w); denote this node by v. Since e a,T (x) and e a,T (w) are two different children of v, we have lca T(x,w)=v and thus lca T(a,x)=lca T(a,w)=lca T(x,w). □

Next, consider two trees T 1 and T 2 with Λ(T 1)=Λ(T 2)=L. We have:

Lemma 8

For any a,bL with ab:

  • \(\mathit{count}_{r,r}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).

  • \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\).

  • \(\mathit{count}_{f,r}(a,b) =|\{ w :d_{a,T_{1}}(b) = d_{a,T_{1}}(w), e_{a,T_{1}}(b) \neq e_{a,T_{1}}(w)\), and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).

Proof

By Lemma 7, \(d_{a,T_{1}}(b)<d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b)<d_{a,T_{2}}(w)\) mean that ab|wt(T 1)∩t(T 2). Hence, \(\mathit{count}_{r,r}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).

Again, by Lemma 7, \(d_{a,T_{1}}(b)<d_{a,T_{1}}(w)\), \(d_{a,T_{2}}(b)=d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w)\) mean that ab|wt(T 1) and a|b|wt(T 2). Therefore, \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\), \(d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\).

The formula for \(\mathit{count}_{f,r}(a,b)\) follows by symmetry. □

4.2 Computing \(\mathit{count}_{r,r}(a,b)\) for All bL∖{a}

Suppose aL is fixed. This section describes how to compute \(\mathit{count}_{r,r}(a,b)\) for all bL∖{a} in \(O(n \sqrt{\log n})\) time, assuming all values of \(d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b)\) have been precomputed.

Our algorithm is named Compute_count_rr and is listed in Fig. 5. It works as follows. Suppose that G is an integer grid of size (n×n) and each bL∖{a} is represented by the point \((d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b))\) on G. Then:

Fig. 5
figure 5

Algorithm Compute_count_rr

Lemma 9

For each bL∖{a}, the value of \(\mathit{count}_{r,r}(a,b)\) equals the number of points inside the rectangle  \([(d_{a,T_{1}}(b) + 1) : n] \times [(d_{a,T_{2}}(b) + 1) : n]\) in G.

Proof

Lemma 8 states that \(\mathit{count}_{r,r}(a,b)\) equals \(|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\). Every element w that contributes to this expression satisfies \(d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w)\), and thus, \(d_{a,T_{1}}(w) \in [(d_{a,T_{1}}(b) + 1) : n]\) and \(d_{a,T_{2}}(w) \in [(d_{a,T_{2}}(b) + 1) : n]\). □

According to Lemma 9, we can obtain \(\mathit{count}_{r,r}(a,b)\) by calculating how many points lie inside the corresponding rectangle on G. Actually, we have to do this for all of the n−1 rectangles defined by the elements of L∖{a}. So to implement Lemma 9 efficiently, we borrow a data structure by Chan and Pǎtraşcu [6] for offline orthogonal range counting on G. More precisely, we use the following result (Corollary 2.3 in [6]):

Lemma 10

(Chan and Pǎtraşcu [6])

Given n points and n axis-aligned rectangles in the plane, we can count the number of points inside each rectangle in \(O(n \sqrt{\log n})\) total time.

Note that one query to the data structure returns the number of points in each of the rectangles. By Lemma 10, the running time of Algorithm Compute_count_rr becomes \(O(n \sqrt{\log n})\).

4.3 Computing \(\mathit{count}_{r,f}(a,b)\) for All bL∖{a}

Again, suppose aL is fixed. Here, we show how to compute \(\mathit{count}_{r,f}(a,b)\) (and by symmetry, \(\mathit{count}_{f,r}(a,b)\)) for all bL∖{a} in O(n) time. We assume that the values of \(d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b)\), \(e_{a,T_{1}}(b)\), \(e_{a,T_{2}}(b)\) for all bL∖{a} have been precomputed.

For any given a,b, we need to count how many leaves w appear in a rooted triplet of the form ab|w in t(T 1) and in a fan triplet of the form a|b|w in t(T 2) at the same time. Our idea is to first use T 2 to partition L∖{a} into disjoint subsets C 1,…,C p in such a way that for every pair of elements x,y in the same C i , it holds that \(d_{a,T_{2}}(x) = d_{a,T_{2}}(y)\). Also, further partition each C i into C i,1,…,C i,q so that for every pair of elements x,y in the same C i,j , it holds that \(e_{a,T_{2}}(x) = e_{a,T_{2}}(y)\). The second part of Lemma 7 immediately yields:

Lemma 11

For any b,wL∖{a}, the fan triplet a|b|w belongs to t(T 2) if and only if b and w belong to the same C i -set but different C i,q -sets.

Then, for any bC i,j , the number of fan triplets of the form a|b|w that are consistent with T 2 equals |C i C i,j |. However, what we need to know is when a fan triplet a|b|w from T 2 also satisfies ab|wt(T 1). To also include this crucial condition on T 1, we refine Lemma 11 in the following way:

Lemma 12

Let bL∖{a} and suppose bC i,j . Then, \(\mathit{count}_{r,f}(a,b) =|\{ w \in C_{i} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}| -|\{ w \in C_{i,j} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\).

Proof

By Lemma 8, \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\). Since \(d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\) if and only if wC i , and moreover, \(e_{a,T_{2}}(b) = e_{a,T_{2}}(w)\) if and only if wC i,j , we have \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), w \in C_{i}\), and \(w \not\in C_{i,j} \}|\). Next, to count the total number of leaves w from C i C i,j that satisfy \(d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\), just count how many leaves in C i that satisfy the condition and then subtract the number of leaves in C i,j that also satisfy it. This gives the desired formula for \(\mathit{count}_{r,f}(a,b)\). □

Our algorithm Compute_count_rf is specified in Fig. 6. It uses Lemma 12 to compute \(\mathit{count}_{r,f}(a,b)\) for all bL∖{a} for any fixed aL. For each b, define \(s(b) = |\{ w \in C_{i} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\) and \(ss(b) = |\{ w \in C_{i,j} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\), where bC i,j . Then, by Lemma 12, \(\mathit{count}_{r,f}(a,b) = s(b) - ss(b)\). The algorithm computes the values of s(b) for all leaves b in each set C i in steps 4–13 by simple counting and by using a variable named c to keep track of how many consecutive elements from C i that have the same \(d_{a,T_{1}}\)-value. Next, ss(b) for all leaves b in each set C i,j are obtained analogously in steps 17–26. At last, step 30 assigns the value s(b)−ss(b) to \(\mathit{count}_{r,f}(a,b)\). The correctness follows from the correctness of Lemma 12.

Fig. 6
figure 6

Algorithm Compute_count_rf

Algorithm Compute_count_rf can be implemented to run in O(n) time by using radix sort to sort the elements inside each C i - and C i,j -set in steps 3 and 16. Each bL∖{a} belongs to exactly one C i -set and exactly one C i,j -set, and is therefore treated once in steps 4–13 and once in steps 17–26.

5 Determining if a Cluster Is a Strong Cluster

This section shows how to check whether any given cluster A of L with |A|≥2 is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) in O(n) time (recall from the definitions in Sect. 2.2 that A is trivially a strong cluster if |A|=1). Our solution relies on the next lemma, illustrated in Fig. 7. For any node u of a tree T, each subtree of T rooted at a child of u is said to be attached to u.

Fig. 7
figure 7

In this example, A⊆̷L is a cluster that does not include any of the leaves i,j,k,l,v,w,x,y,z. Nodes u 1 and u 2 are the lowest common ancestors of A in T 1 and T 2, respectively. The leaves v,w,x,y,z are not descendants of u 1 in T 1, and i,j,k,l are not descendants of u 2 in T 2. According to the condition in Lemma 13, A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) because: (1) for each subtree attached to u 1 and u 2, either all its leaves belong to A or none of them do; and (2) the two sets X 1={i,j,k,l} and X 2={v,w,x,y,z} (each consisting of leaves that are descendants of u 1 and u 2 but do not belong to A) are disjoint

Lemma 13

Let AL with |A|≥2 and define u 1 and u 2 to be the lowest common ancestor of A in T 1 and T 2, respectively. A is a strong cluster of  \(\mathcal {R}_{\mathit{maj}}\) if and only if:

  1. (1)

    For i=1,2, every subtree U attached to u i in T i satisfies either Λ(U)⊆A or Λ(U)⊆LA; and

  2. (2)

    X 1X 2=∅, where X i =Λ(T i [u i ])∖A.

Proof

(⇒) Let A be a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). We shall prove that both conditions (1) and (2) hold.

We first prove (1). For the sake of obtaining a contradiction, suppose that for some i∈{1,2}, there exist aA, xLA where both a and x are in the same subtree attached to u i . Then, \(lca^{T_{i}}(a, x)\) is a proper descendant of u i . The node u i is defined as the lowest common ancestor of A, so \(lca^{T_{i}}(a,a') = u_{i}\) for some a′∈A. However, then \(lca^{T_{i}}(a,x)\) is a proper descendant of \(lca^{T_{i}}(a,a')\), giving ax|a′∈t(T i ). This implies that aa′|x cannot belong to \(\mathcal {R}_{\mathit{maj}}\), which contradicts the fact that A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Thus, condition (1) must hold.

To prove condition (2), we first show that there exist two leaves a,a′∈A such that \(lca^{T_{1}}(a,a') = u_{1}\) and \(lca^{T_{2}}(a,a') = u_{2}\). Observe that for both i=1,2, the leaves from A are located in at least two different subtrees attached to u i . Thus, there exists at least one pair a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\). Now, assume on the contrary that for every pair a,a′∈A such that \(lca^{T_{1}}(a,a') = u_{1}\), \(lca^{T_{2}}(a,a')\) is a proper descendant of u 2. Consider any pair of different subtrees attached to u 1 that contain elements from A, and denote their two sets of leaves by A 1 and A 2. By the assumption, \(lca^{T_{2}}(x,y)\) is a proper descendant of u 2 for all xA 1 and yA 2, and thus all leaves in A 1A 2 must appear in a single subtree attached to u 2. By repeating this argument, all elements of A appear in one subtree attached to u 2, which is impossible. Therefore, there exists a pair a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\) such that \(lca^{T_{2}}(a,a')\) is not a proper descendant of u 2, i.e., \(lca^{T_{2}}(a,a') = u_{2}\).

Finally, we are ready to prove (2). Take any a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\) and \(lca^{T_{2}}(a,a') = u_{2}\) as in the preceding paragraph. To obtain a contradiction, suppose that X 1X 2≠∅. Then there exists an xX 1X 2, and moreover, x|a|a′∈t(T 1) and x|a|a′∈t(T 2) because the subtree attached to u i for i=1,2 that contains x cannot contain a or a′ by condition (1) above. But then \(aa'|x \not\in \mathcal {R}_{\mathit{maj}}\), contradicting the fact that A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Hence, (2) holds.

(⇐) Suppose A satisfies the two conditions stated in the lemma. We will prove that for every a,a′∈A and cLA, it holds that \(aa'|c \in \mathcal {R}_{\mathit{maj}}\). The leaf c belongs to exactly one of the three disjoint sets X 1, X 2, and L∖(AX 1X 2), which gives three cases:

  • If cL∖(AX 1X 2): Then \(lca^{T_{i}}(a,a')\) is a proper descendant of \(lca^{T_{i}}(a,c)\) for both i=1,2. Thus, \(aa'|c \in \mathcal {R}_{\mathit{maj}}\).

  • If cX 1: Then c is not a descendant of u 2 in T 2 because \(c \not\in X_{2}\), so aa′|ct(T 2) always holds. There are two subcases depending on whether or not \(lca^{T_{1}}(a,a')\) is a proper descendant of u 1. If yes, then aa′|ct(T 1), and thus \(aa'|c \in \mathcal {R}_{\mathit{maj}}\). If no, then \(lca^{T_{1}}(a,a') = u_{1}\), and thus a|a′|ct(T 1), which also yields \(aa'|c \in \mathcal {R}_{\mathit{maj}}\).

  • If cX 2: Then it follows that \(aa'|c \in \mathcal {R}_{\mathit{maj}}\) in the same way as for the case cX 1 above.

By definition, A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). □

Armed with Lemma 13, it is straightforward to check if any given cluster A of L is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) in O(n) time as shown in Algorithm Check_if_strong_cluster in Fig. 8.

Fig. 8
figure 8

Algorithm Check_if_strong_cluster, based on Lemma 13

6 Concluding Remarks

In this paper, we have shown how to construct the R* consensus tree of two input trees in \(O(n^{2} \sqrt{\log n}) = o(n^{3})\) time. It is an open problem to determine if the time complexity can be further improved to O(n 2) or better. The bottleneck of our algorithm R*_consensus_tree is the computation of \(\mathit{count}_{r,r}\) in Sect. 4.2, which uses \(O(n \sqrt{\log n})\) time for each fixed aL to obtain the values of \(\mathit{count}_{r,r}(a,b)\) for all bL∖{a}.

We remark that for the restricted case where both input trees are binary, the problem becomes much easier because then the R* consensus tree is equivalent to the so-called RV-I tree in [14]. The RV-I tree differs slightly from the R* consensus tree in general in that if ab|c belongs to one of t(T 1) and t(T 2), and a|b|c belongs to the other, then ab|c may never be consistent with the output tree; however, for the special case of binary trees, neither t(T 1) nor t(T 2) contain any fan triplets so \(\mathcal {R}_{\mathit{maj}}\) equals the set of all rooted triplets that are consistent with both T 1 and T 2, and this is precisely the “set of triples which are resolved identically in T 1 and T 2” in Definition 5.1 in [14]. Furthermore, according to Theorem 5.1 in [14], the RV-I tree of T 1 and T 2 is always equivalent to the strict consensus tree (see also [3, 7, 9, 18, 19]) of T 1 and T 2, which can be constructed in O(n) time by a classical algorithm of Day [7]. Thus, there is a large gap in the time complexity of constructing the R* consensus tree of two trees in the binary case (O(n) time) and in the general, non-binary case (\(O(n^{2} \sqrt{\log n})\) time).

The definition of the set of majority rooted triplets \(\mathcal {R}_{\mathit{maj}}\) as well as the definition of an R* consensus tree can be naturally extended to the case of k>2 input trees; see [3]. The currently fastest algorithm for the case k>2, outlined in [3], runs in O(kn 3) time. It can be implemented by explicitly constructing the sets r(T i ) for every input tree T i using O(kn 3) total time, then obtaining \(\mathcal {R}_{\mathit{maj}}\) in O(kn 3) time by finding the most frequently occurring rooted triplet (if it exists) for each {x,y,z} in the leaf label set in O(k) time, and finally applying the O(n 3)-time strong cluster algorithm from Corollary 2.2 in [4] to \(\mathcal {R}_{\mathit{maj}}\). An important open problem is to reduce the time complexity to o(kn 3). It seems difficult to extend the approach used in this paper because it would yield an exponential number (in k) of cases in Lemma 6 and thus express \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) as the sum of an exponential number of auxiliary \(\mathit{count}\)-functions, causing the total running time to be exponential in k.