Computing the Rooted Triplet Distance Between Phylogenetic Networks

The rooted triplet distance measures the structural dissimilarity of two phylogenetic trees or phylogenetic networks by counting the number of rooted phylogenetic trees with exactly three leaf labels (called rooted triplets, or triplets for short) that occur as embedded subtrees in one, but not both, of them. Suppose that N1=(V1,E1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_1 = (V_1, E_1)$$\end{document} and N2=(V2,E2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_2 = (V_2, E_2)$$\end{document} are phylogenetic networks over a common leaf label set of size n, that Ni\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_i$$\end{document} has level ki\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_i$$\end{document} and maximum in-degree di\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_i$$\end{document} for i∈{1,2}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i \in \{1,2\}$$\end{document}, and that the networks’ out-degrees are unbounded. Write N=max(|V1|,|V2|)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = \max (|V_1|, |V_2|)$$\end{document}, M=max(|E1|,|E2|)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M = \max (|E_1|, |E_2|)$$\end{document}, k=max(k1,k2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \max (k_1, k_2)$$\end{document}, and d=max(d1,d2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d = \max (d_1, d_2)$$\end{document}. Previous work has shown how to compute the rooted triplet distance between N1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_1$$\end{document} and N2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_2$$\end{document} in O(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {O}(n \log n)$$\end{document} time in the special case k≤1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \le 1$$\end{document}. For k>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k > 1$$\end{document}, no efficient algorithms are known; applying a classic method from 1980 by Fortune et al. in a direct way leads to a running time of Ω(N6n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Omega }(N^{6} n^{3})$$\end{document} and the only existing non-trivial algorithm imposes restrictions on the networks’ in- and out-degrees (in particular, it does not work when non-binary vertices are allowed). In this article, we develop two new algorithms with no such restrictions. Their running times are O(N2M+n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {O}(N^{2} M + n^{3})$$\end{document} and O(M+Nk2d2+n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {O}(M + N k^{2} d^{2} + n^{3})$$\end{document}, respectively. We also provide implementations of our algorithms, evaluate their performance on simulated and real datasets, and make some observations on the limitations of the current definition of the rooted triplet distance in practice. Our prototype implementations have been packaged into the first publicly available software for computing the rooted triplet distance between unrestricted networks of arbitrary levels.


Background
Phylogenetic trees are commonly used in biology to represent evolutionary relationships, with the leaves corresponding to species that exist today and internal vertices to ancestor species that existed in the past [1]. When studying the evolution of a fixed set of species, different available data and tree reconstruction methods can lead to trees that look structurally different. Quantifying this difference is essential to make better evolutionary inferences, which has led to the proposal of several phylogenetic tree distance measures in the literature. For example, to evaluate the accuracy of a tree reconstruction method M , one can perform the following steps a number of times [2]: First generate a random phylogenetic tree T and let a sequence evolve down the edges of T according to some chosen model of sequence evolution, then apply the method M to reconstruct a tree T ′ , and finally measure the distance between T and T ′ . Some phylogenetic tree distance measures that are based on counting how many times certain features differ in the two trees are the Robinson-Foulds distance [3], the rooted triplet distance [4] for rooted trees, and the unrooted quartet distance [5] for unrooted trees. Other distance measures are the nearest-neighbor interchange distance (introduced independently in [6] and [7]), the path-length-difference distance [8], the subtree prune-and-regraft distance [9], the maximum agreement subtree [10], and the subtree moving tree edit distance [11].
The rooted phylogenetic network model is an extension of the rooted phylogenetic tree model that allows internal vertices to have more than just one parent [12]. Such networks can describe more complex evolutionary relationships involving reticulation events such as horizontal gene transfer and hybridization. As in the case of phylogenetic trees, it is useful to have distance measures for comparing phylogenetic networks. Therefore, in this article, we study a natural generalization [13] of the rooted triplet distance for phylogenetic trees to rooted phylogenetic networks and present two new algorithms for computing it.

Problem Definitions
For any vertex u in a directed acyclic graph, let in(u) and out(u) be the in-degree and out-degree of u. The vertex u is called a leaf if out(u) = 0 , and an internal vertex if out(u) ≥ 1 . Formally, a rooted phylogenetic network N ′ is a directed acyclic graph with one vertex of in-degree 0 (from here on called the root of N ′ and denoted by r(N � ) ), distinctly labeled leaves, and no vertices with both in-degree 1 and out-degree 1. A vertex u in N ′ is called a reticulation vertex if in(u) ≥ 2 holds. If N ′ has no reticulation vertices, i.e., if all vertices in N ′ have in-degree at most 1, then N ′ is a rooted phylogenetic tree. Below, when referring to a "tree", we imply a "rooted phylogenetic tree", and when referring to a "network", we imply a "rooted phylogenetic network".
For the rest of this subsection, suppose that N ′ is a network. A directed edge from a vertex u to a vertex v in N ′ is denoted by u → v . A path from u to v in N ′ is denoted by u ⇝ v . Let the height of u, written as h(u), be the number of edges in a longest path from u to a leaf in N ′ . By definition, if v is a parent of u in N ′ then h(v) > h(u) . We will use L(N � ) to refer to both the set of leaves in N ′ as well as to the set of leaf labels in N ′ since they are in one-to-one correspondence.
The level of a network was introduced by Choy et al. [14] as a parameter to measure the treelikeness of a network, with the special case of a level-0 network being a tree and a level-1 network a so-called galled tree [15] in which all underlying cycles are disjoint. The level is defined as follows. Let U(N � ) be the undirected graph created by replacing every directed edge in N ′ with an undirected edge. An undirected, connected graph H is called biconnected if it has no vertex whose removal makes H disconnected. A maximal subgraph of U(N � ) that is biconnected is called a biconnected component of U(N � ) . (Observe that the biconnected components of U(N � ) are edge-disjoint but not necessarily vertex-disjoint.) For any biconnected component of U(N � ), its corresponding subgraph in N ′ will be referred to as a block of N ′ . We say that N ′ is a level-k network, or equivalently N ′ has level k, if every block of N ′ contains at most k reticulation vertices. Figure 1 shows a level-2 and a level-3 network.
If B is a block of N ′ consisting of more than two vertices and one edge and B contains at most one vertex that has one or more outgoing edges to vertices not belonging to B then B is called uninformative. See Fig. 2 for an illustration.
Next, a rooted triplet is a tree with three leaves. If it is binary we say that is a rooted resolved triplet, and if it is non-binary we say that is a rooted fan triplet. We say that the rooted fan triplet x|y|z is consistent with N ′ if and only if there exists a vertex u in N ′ such that there are three directed paths of non-zero length from u to x, from u to y, and from u to z that are vertex-disjoint except for in u. Similarly, we say that the rooted resolved triplet xy|z is consistent with N ′ if and only if N ′ contains two vertices u and v such that there are four directed paths of non-zero length from u to v, from v to x, from v to y, and from u to z that are vertex-disjoint except for in u and v, and furthermore, the path from u to z does not pass through v. See Fig. 1 for an example. From here on, by "disjoint paths"  . 1 N 1 is a level-2 network and N 2 is a level-3 network with L(N 1 ) = L(N 2 ) = {a 1 , a 2 , a 3 , a 4 } . In this example, D(N 1 , N 2 ) = 6 . Some shared triplets are: a 1 |a 2 |a 4 , a 3 a 4 |a 2 , a 1 a 3 |a 2 . Some triplets consistent with only one network are: a 1 |a 3 |a 4 , a 2 a 3 |a 1 we imply "vertex-disjoint paths of non-zero length". Moreover, when referring to a "triplet", we imply a "rooted triplet". Given two networks N 1 = (V 1 , E 1 ) and N 2 = (V 2 , E 2 ) built on the same leaf label set Λ , the rooted triplet distance D(N 1 , N 2 ) , or triplet distance for short, is the number of triplets over Λ that are consistent with exactly one of N 1 and N 2 . Let S(N 1 , N 2 ) be the total number of shared triplets, i.e., triplets that are consistent with both N 1 and N 2 . Then: Note that a shared triplet contributes a +1 to S(N 1 , N 1 ) , S(N 2 , N 2 ) , and S(N 1 , N 2 ) , e.g., the triplet a 1 |a 2 |a 4 in Fig. 1. On the other hand, a triplet from either network that is not shared contributes a +1 to either S(N 1 , N 1 ) or S(N 2 , N 2 ) , and a 0 to S(N 1 , N 2 ) . As an example, a 1 |a 3 |a 4 in Fig. 1 contributes a +1 to S(N 1 , N 1 ) and a 0 to S(N 2 , N 2 ) and S(N 1 , N 2 ). Let S r (N 1 , N 2 ) and S f (N 1 , N 2 ) be the number of shared resolved and shared fan triplets, respectively. Then S(N 1 , N 2 ) = S r (N 1 , N 2 ) + S f (N 1 , N 2 ) , which implies that D(N 1 , N 2 ) can be obtained by considering shared resolved triplets and shared fan triplets separately.
The rest of this article is focused on how to compute D(N 1 , N 2 ) efficiently. We shall use the following notation to express the time complexities of various algorithms. For i ∈ {1, 2} , the network N i has vertex set V i and edge set E i . The level of N i is k i and the maximum in-degree taken over all vertices in N i is d i . We assume that the two given networks N 1 and N 2 have the same leaf label set Λ , and write n = |Λ| , To simplify the descriptions of the algorithms, we will also assume that: (i) there is no vertex u satisfying both in(u) > 1 and out(u) = 0 , i.e., all leaves have in-degree at most 1; and (ii) there are no uninformative blocks in N 1 and N 2 . Assumption (i) The block drawn with solid edges is an uninformative block because it only has one vertex u with outgoing edges to vertices not in the block u 1 3 is justified because every leaf u with in-degree larger than 1 can be replaced by an internal vertex to which a leaf with the same leaf label as u is attached, and the resulting network will be consistent with exactly the same triplets as before. Assumption (ii) is justified because first each uninformative block can be replaced by an edge, and then each vertex with in-degree 1 and out-degree 1 can be eliminated by contracting its outgoing edge; the resulting network will be consistent with the same triplets as the original network. If necessary, checking the input networks N 1 and N 2 and modifying them to ensure that they comply with (i) and (ii) before running the algorithms takes O(M) time, e.g., by using Hopcroft-Tarjan's algorithm [16] to identify the biconnected components of U(N 1 ) and U(N 2 ).

Previous Work
The rooted triplet distance was introduced by Dobson [4] in 1975 for trees, and generalized to networks by Gambette and Huber [13] in 2012. See also [17,Section 3.2] for a short discussion about the definition. Table 1 lists the time complexities of some previously known algorithms and our new ones for computing D(N 1 , N 2 ) . When k = 0 , both N 1 and N 2 are trees. This case has been extensively studied in the literature [4,[18][19][20][21][22][23][24], with the most efficient algorithms in theory and practice [19,20,24] running in O(n log n) time. For k = 1 , an O(n 2.687 )-time algorithm based on counting 3-cycles in an auxiliary graph was given in [17], and a faster, O(n log n)-time algorithm that transforms the input to a constant number of instances with k = 0 was given in [25]. All of these algorithms allow the vertices in the input networks to have arbitrary Table 1 Previous and new results for computing D(N 1 , N 2 ) , where N 1 and N 2 are two phylogenetic networks built on the same leaf label set Λ Notation: n = |Λ| is the number of leaf labels, N = max(|V 1 |, |V 2 |) is the maximum number of vertices, M = max(|E 1 |, |E 2 |) is the maximum number of edges, k = max(k 1 , k 2 ) is the maximum level, and d = max(d 1 , d 2 ) is the maximum in-degree of the two networks degrees. Moreover, software implementations of the fast algorithms for k = 0 and k = 1 are available [20,[23][24][25].
For k > 1 , much less is known. In a special "binary degree" case where the phylogenetic networks' roots have out-degree 2 and all other internal vertices have either in-degree 2 and out-degree 1, or in-degree 1 and out-degree 2, one can adapt a technique developed by Byrka et al. [27] for a problem related to finding a network consistent with as many resolved triplets as possible from a given set. They showed how to preprocess any fixed network N � = (V, E) satisfying the binary degree constraints so that checking if a resolved triplet is consistent with N ′ can be done efficiently. Below, we shall refer to this preprocessing as constructing a data structure D such that D can be used to determine whether any specified resolved triplet is consistent with N ′ in O(1) time. The proof of Lemma 2 in [27] showed how to build D in O(|V| 3 ) time. According to Remark 1 in [27], this can be further improved to O(|V| + |V|k 2 ) , where k is the level of N ′ . The rooted triplet distance can thus be computed in O(N + Nk 2 + n 3 ) time in a straightforward way when N 1 and N 2 obey the special binary degree constraints. A limitation of D is that it can only support consistency queries for resolved triplets, while a network with no restrictions on the vertices' degrees may also contain fan triplets.
In the general case, when N 1 and N 2 have unbounded degrees and unbounded levels, it is possible to compute D(N 1 , N 2 ) by iterating over all 4 n 3 triplets, and for each such triplet applying the classic directed acyclic graph pattern matching algorithm in [26] to determine its consistency with N 1 and N 2 . However, this leads to a time complexity of Ω(N 6 n 3 ) . To see this, let P in Theorem 3 in [26] be a resolved triplet and G a phylogenetic network N i with |V i | vertices. P has two internal nodes and four edges, so the algorithm will consider |V i | 2 ways of mapping the two internal nodes of P to vertices in N i , and for each one, construct a configuration graph G ′ with Ω((|V i | + 1) 4 ) vertices and look for a path in G ′ . Hence, the algorithm will use Ω(|V i | 6 ) time for each resolved triplet to check if it occurs in N i , i.e., Ω(N 6 n 3 ) time in total.

New Results
Here, we develop two algorithms that significantly improve upon the time complexity of computing the rooted triplet distance in the general, unbounded case. The running time of our first algorithm is O(N 2 M + n 3 ) . One key insight is that a technique of Perl and Shiloach for identifying two disjoint paths between two pairs of vertices in a directed acyclic graph [28] can be extended to check if a fan triplet or a resolved triplet is embedded in a phylogenetic network, leading to the useful concepts of a fan graph and a resolved graph. Our second algorithm then augments these ideas with so-called block trees and contracted block networks to obtain a running time of O(M + Nk 2 d 2 + n 3 ) . Neither algorithm has a strictly better time complexity than the other one for all possible inputs. In the special case where N 1 and N 2 follow the binary degree constraints of Byrka et al. [27], the time complexity reduces to O(N + Nk 2 + n 3 ) , matching the bound in [27]. We also provide implementations of our algorithms, evaluate their performance on simulated and real datasets, and make some observations on the limitations of the current definition of the rooted triplet distance in practice. Our prototype implementations have been packaged into the first publicly available software for computing the triplet distance between two unrestricted networks of arbitrary levels.

Organization of the Article
Section 2 describes our first new algorithm and Sect. 3 the second one. Section 4 presents an implementation of both our algorithms and experiments illustrating their practical performance. Finally, Sect. 5 gives some concluding remarks.

A First Approach
This section presents an algorithm that computes D(N 1 , Overview. The algorithm consists of a preprocessing step and a triplet distance computation step. For the preprocessing step, we extend a technique introduced by Perl and Shiloach [28] to construct suitably defined auxiliary graphs that compactly encode disjoint paths within N 1 and N 2 . Two graphs, the fan graph and resolved graph, are created that enable us to check the consistency of any fan triplet and any resolved triplet, respectively, with N 1 and N 2 in O(1) time. In the triplet distance computation step, we compute D(N 1 , N 2 ) by iterating over all possible 4 n 3 triplets and using the fan and resolved graphs to check the consistency of each triplet with N 1 and N 2 efficiently.

Preprocessing
Let G = (V, E) be a directed acyclic graph and s 1 , t 1 , s 2 , and t 2 four vertices in G. Perl and Shiloach [28] gave an algorithm that can find two vertex-disjoint paths, one from s 1 to t 1 and one from s 2 to t 2 , in O(|V||E|) time or determine that no such pair of paths exists. They achieve this by creating a directed graph G � = (V � , E � ) in O(|V||E|) time, with the property that the existence of such a pair of vertex-disjoint paths in G is equivalent to the existence of a directed path from ⟨s 1 , s 2 ⟩ to ⟨t 1 , t 2 ⟩ in G ′ , where ⟨s 1 , s 2 ⟩ and ⟨t 1 , t 2 ⟩ are vertices in G ′ . A fan triplet or resolved triplet involves more than two vertex-disjoint paths, and below we show how to extend the technique by Perl and Shiloach [28] to determine if a given network has the necessary vertex-disjoint paths that would imply the consistency of a given triplet with the network.
Proof (←) Let x|y|z be any fan triplet consistent with N i . By definition, there exists an internal vertex q in N i and three disjoint paths (except for in q), one from q to x, one from q to y, and one from q to z. Denote these paths by (q, x 0 , x 1 , … , x a ) , (q, y 0 , y 1 , … , y b ) , and (q, z 0 , z 1 , … , z c ) , where x a = x , y b = y , and z c = z . Then N f i also contains the following three paths: • (s, (q, y 0 , z 0 )) : This can be seen from q → y 0 ∈ E i and q → z 0 ∈ E i . • ((q, y 0 , z 0 ), (x 0 , y 0 , z 0 )) : This follows from the fact that q → x 0 ∈ E i and h(q) > h(y 0 ), h(z 0 ).
By concatenating the three paths above, we get a path in N f i from s to (x, y, z).
, where x t = x , y t = y , and z t = z . Let S 1 = (x 1 , x 2 , … , x t ) , S 2 = (y 1 , y 2 , … , y t ) , and S 3 = (z 1 , z 2 , … , z t ) , where x t = x , y t = y , and z t = z , be three sequences of vertices from N i obtained from P.
We prove by induction that the three paths obtained by following the sequences S 1 , S 2 , and S 3 are disjoint paths in N i . Consider any j ∈ {1, 2, … , t} . When j = t , all three vertices x t , y t , and z t are different according to the definition of V f i . For j < t , by the inductive hypothesis we have that (x j+1 , … , x t ) , (y j+1 , … , y t ) and (z j+1 , … , z t ) yield disjoint paths. In addition, by the definition of the fan graph N f i , for every j ∈ {1, 2, … , t − 1} , one of the following three cases holds: (1) x j ≠ x j+1 only, (2) y j ≠ y j+1 only, and (3) z j ≠ z j+1 only. In case (1), note that y j = y j+1 and z j = z j+1 , which means that (x j+1 , … , x t ) , (y j , … , y t ) and (z j , … , z t ) yield disjoint paths. We now show that x j cannot appear in any of these three paths. It holds that h(x j ) ≥ max(h(y j ), h(z j )) , so for ≥ j + 1 and y ≠ y j , we have h(x j ) > h(y ) . Similarly, for ≥ j + 1 and z ≠ z j , we have h(x j ) > h(z ) . Together with the fact that x j , y j , and z j are different according to the definition of N f i , we deduce that the three paths obtained from (x j , … , x t ) , (y j , … , y t ) , and (z j , … , z t ) are disjoint. Cases (2) and (3) can be argued in the same way. Thus, following S 1 , S 2 , and S 3 yields three disjoint paths.
Finally, since P contains a directed edge from s to (x 1 , y 1 , z 1 ) , N i contains an edge from x 1 to y 1 and an edge from x 1 to z 1 . Therefore, the three paths in N i that start at the internal vertex x 1 and then follow the sequences S 1 , S 2 , and S 3 , respectively, are disjoint paths (except for in x 1 ) to x, y, and z. By definition, x|y|z is consistent with N i .

The Resolved Graph
For any network N i , let its resolved graph N r and E r i includes the following directed edges: time, and has the property described in the following lemma:

Lemma 2.3 Consider a network N i and its resolved graph
; such that the three paths are vertex-disjoint except for in x 0 and y j , the first path does not pass through y j , and it holds that x a = x , y b = y , and z c = z.
Let x be a vertex on the first path satisfying h( , with x t = x , y t = y , and z t = z . By the definitions, we have x 1 → y 1 ∈ E i , x q = x q+1 , y q = y q+1 , and y q → z q+1 ∈ E i . Define three sequences of vertices from N i as follows: We claim that following the sequences S 1 , S 2 , and S 3 yields three disjoint paths in N i . (This claim is shown below.) The claim and the fact that N r i contains an edge from s to (x 1 , y 1 ) and an edge from (x q , y q ) to (x q+1 , y q+1 , z q+1 ) then imply that N i contains a path from x 1 to x, a path from x 1 to y q , a path from y q to y, and a path from y q to z that make yz|x consistent with N i .
To prove the claim, we show that the paths obtained by following the sequences of vertices listed below are disjoint: To prove that the paths obtained by following the sequences in (a) are disjoint we use induction. By the definition of N r i , we know that x q ≠ y q . For the inductive hypothesis, assume that the paths obtained from (x j+1 , … , x q ) and (y j+1 , … , y q ) are disjoint. Again by definition, there are two cases: (1) x j ≠ x j+1 only; and 1 3 (2) y j ≠ y j+1 only. For (1), we have y j = y j+1 and h(x j ) ≥ h(y j ) , thus for > j + 1 and y ≠ y j , we have h(x j ) > h(y ) . Together with x j ≠ y j , we can see that x j does not appear in (y j , … , y q ) . Case (2) can be handled in the same way. Thus, the paths from (a) are disjoint.
For (b), the induction proof from the proof of Lemma 2.1 immediately implies that the three paths are disjoint.
To show that the paths obtained from (c) are disjoint, let j ∈ {1, … , q} be the largest index such that x j ≠ x q . We know from the paths in (b) that x q = x q+1 does not appear in (y q+1 , … , y t ) , so we only need to prove that ( . We consider the following two cases: , using the same argument as in (1), we have that (x 1 , … , x g ) and (y , … , y t ) are disjoint. It only remains to show that x j does not appear in (y , … , y t ) . If we assume that x j appears in (y , … , y t ) then because y ≠ x j , we would have h(y ) > h(x j ) , which leads to a contradiction.
For the paths from (d), similar arguments as in (c) can be applied since To show that the paths from (e) are disjoint, because , meaning that the paths from (e) are disjoint.
Finally, to show that the paths from (f) are disjoint, by definition we have x q = x q+1 and h(y q ) ≥ h(x q ) . So for every > q + 1 and x ≠ x q , it holds that h(y q ) > h(x ) . Since we also have that x q ≠ y q , the paths from (f) are disjoint. ◻

Corollary 2.4
Let N i be a given network and r ′ a dummy leaf attached to r(N i ) . For any two different leaves x and y in N i that are not r ′ , there are two paths from some internal vertex z ≠ r(N i ) in N i to x and y that are disjoint, except for in z, if and only if s can reach (r � , x, y) in N r i .

Triplet Distance Computation
Algorithm 1 summarizes the steps for computing the triplet distance between two networks N 1 and N 2 . The main procedure, D(), uses Equation (1.1) to calculate D (N 1 , N 2 ) . It first builds the fan table A f i and the resolved table A r i for each N i , i ∈ {1, 2} , in a preprocessing step, and then relies on the procedure S() for counting shared triplets. The shared fan triplets and shared resolved triplets are counted by iterating over all possible triplets and using the fan and resolved tables to determine the consistency of any triplet with each of the two networks. The correctness is ensured by Lemmas 2.1 and 2.3.
To analyze the running time, building the data structures N r

A Second Approach
In this section, we show how to compute D(N 1 , Overview. Algorithm 1 in the previous section computed D(N 1 , N 2 ) by iterating over all possible triplets and using the fan and resolved tables for N 1 and N 2 to identify which triplets were consistent with both networks. To refine this idea, for every block of N i , we will define a network of approximately the same size as the block, which we call a contracted block network. For every such contracted block network, we build a fan and resolved graph and the corresponding fan and resolved table. Furthermore, by replacing the blocks of N i by single vertices, we obtain a tree structure called the block tree. The new algorithm in this section combines the block tree and all the fan and resolved tables of the contracted block networks of N i to efficiently determine whether or not any specified triplet is consistent with N i .

Preprocessing
Let N i be a network. Note that every block B of N i contains one vertex whose height is greater than the heights of all other vertices in B. This vertex will be called the root of B and denoted by r(B). If B contains only one edge u → v and v ∈ L(N i ) then B is called a leaf block; otherwise, B is called a non − leafblock . Recall from Sect. 1.2 that we assume without loss of generality that: (i) all leaves have in-degree at most 1 (so that every leaf has a leaf block); and (ii) the input networks have no uninformative blocks. Lemma 3.1 presents an important property of the blocks in N i .

Lemma 3.1 All blocks of a given network N i are edge-disjoint.
Proof For the purpose of obtaining a contradiction, suppose that N i has two different blocks B 1 = (V 1 , E 1 ) and B 2 = (V 2 , E 2 ) that share an edge. Define , and U(B) be the subgraphs of U(N i ) corresponding to B 1 , B 2 , and B. Since U(B 1 ) and U(B 2 ) are connected graphs that share an edge, U(B) is also connected. Furthermore, if any vertex is removed from B, U(B) will still be connected. Therefore, U(B 1 ) and U(B 2 ) are not maximal biconnected subgraphs of U(N i ) , which means B 1 and B 2 are not blocks of N i . Hence, we have reached a contradiction and the lemma follows. ◻

The Block Tree
From a high-level perspective, we will remove the cycles in U(N i ) by replacing the non-leaf blocks by internal nodes to obtain a rooted tree on the leaf label set L(N i ) . A similar idea was previously used by Choy et al. in the proof of Lemma 2 in [14] to bound the number of reticulation vertices in a network, and later by Byrka et al. [27] to efficiently check if a resolved triplet is consistent with a network. Below, we will show that it is also useful for checking if a fan triplet is consistent with a network. Formally, let T i = (V � , E � ) be a rooted tree, from now on referred to as the block tree, with vertex set V ′ and edge set E ′ constructed as follows: Figure 4 gives an example of a network N i and its block tree T i . The set of blocks in N i and the vertex set V � − r(T i ) , i.e., the set of all vertices of T i except the root, are in one-to-one correspondence. An edge b 1 → b 2 in T i means that the corresponding blocks B 1 and B 2 in N i do not have the same root and the root vertex r(B 2 ) is a shared vertex between B 1 and B 2 . Note that by the definition of a block, an edge connecting two vertices can define a block of its own (for example, block B 9 in Fig. 4).
The following lemma states some properties of T i .

Lemma 3.2 Let
be the block tree of a given network N i . The block tree T i is a rooted tree that has n leaves, is connected as well according to the construction. Next, we prove that T i is a tree by contradiction. Suppose that U(T i ) has a cycle.
, and with r(B) being a vertex in both B 1 and B 2 . By the definition of N i , the root r(N i ) has a path to every vertex in N i , so r(B 1 ) and r(B 2 ) must have a common ancestor. This means that the two blocks B 1 and B 2 could be merged to create an even larger block that contains both of them, contradicting that B 1 and B 2 are blocks of N i . Thus, T i is a rooted tree.
Next, we count the number of vertices and edges in T i . By assumption (i) mentioned above, there are no leaves with in-degree greater than 1 in N i . Thus, N i contains n leaf blocks and there will be exactly n leaves in T i . To count the internal vertices in T i , we distinguish between vertices having in-degree 1 and out-degree 1, from now on referred to as extra vertices, and non-extra vertices. First, to count the non-extra vertices in T i , observe that if we were to contract its extra vertices, i.e., add an edge from the parent of every such vertex u to the child of u and then remove u, we would obtain a tree T �� i = (V �� , E �� ) with n leaves in which every internal vertex has in-degree 1 and out-degree at least 2. This means that |V �� | = O(n) and |E �� | = O(n) . Secondly, to count the extra vertices, observe that any extra vertex Since the set of blocks of N i and the set V � − r(T i ) are in one-to-one correspondence, we also have: The following lemma shows that the block tree T i can be built efficiently: Proof Constructing T i when the blocks of N i are given is performed by scanning the vertices of N i and the list of components that every vertex belongs to, while adding edges to T i according to the definition of V ′ and E ′ . This requires O(|V i |) time. Finding the blocks takes O(|E i |) time by applying the algorithm by Hopcroft and Tarjan in [16]. Lastly,

Contracted Block Networks
Each block in N i can be viewed as a network, to which we may apply the techniques from Sect. 2 for detecting those triplets that are anchored within. To be able to do so, we first take each block B, make some adjustments to it as described next, and call the resulting network C . Proof If k i = 0 then B consists of a single edge of N i , meaning that C B is a binary tree on three leaves (a leaf of the form s j , its copy leaf s ′ j , and the artificial leaf r ′ ). In this case, |V � | = 5 , |E � | = 4 , and |L(C B )| = 3.
If k i ≥ 1 , there are two possibilities. If B contains only one edge then C B is a binary tree on three leaves as in the case k i = 0 above. Otherwise, proceed as follows to derive the bounds. Call a non-reticulation vertex of C B that is the parent of at least two internal vertices of C B a branching vertex (e.g., v 2 and v 7 in Fig. 6), and a nonreticulation vertex of C B that is the parent of exactly one internal vertex a path vertex (e.g., v 3 , v 5 , v 6 , v 8 , v 10 , and v 13 in Fig. 6). We apply a technique from [27] to count the branching vertices and note that every branching vertex is the beginning of at least one new directed path that has to end at a reticulation vertex. Since each reticulation vertex can end at most d i such paths and there are at most k i reticulation vertices in C B , the number of branching vertices is at most k i d i . Every path vertex is the parent of either a branching vertex or a reticulation vertex, and every reticulation vertex has at most d i parents, so the number of path vertices is at most 2k i d i . Therefore, the total number of internal vertices is at most k i (3d i + 1) . Next, at most two leaves are attached to each internal vertex, so |L(C B )| ≤ 2k i (3d i + 1) and |V � | ≤ 3k i (3d i + 1) . As for the edges, there are at most k i d i edges ending at reticulation vertices, at most k i d i edges ending at branching vertices, at most 2k i d i edges ending at path vertices, and |L(C B )| edges ending at leaves. Adding them together gives |E � | ≤ 10k i d i + 2k i . Hence, the lemma statement holds for every k i ≥ 0 . ◻

Constructing All Contracted Block Networks Efficiently
We first introduce some additional notation. For a given network N i and a block B in N i , a leaf x in N i is said to associate with B if there exists a vertex u in B such that u ≠ r(B) and x ∈ L u B . As an example, in Fig. 5, the leaf a 16 associates with B, but the leaves a 2 and a 3 do not associate with B. For any leaf x associated with a block B of N i , define:   is updated once the final set in which the leaf l will reside has been determined. After contracting all the edges, we also add the artificial leaf r ′ j .
Step 1 is performed by using the algorithm from [16], which takes O(|E i |) time.
Step 2  We first express the total size of the contracted block networks in terms of N. When C B x is constructed from B x , each vertex in B x will either be deleted or remain and introduce at most two leaves, so c(x) ≤ 3 ⋅ b(x) . Next, since the blocks decompose N i into edge-disjoint subgraphs by Lemma 3.1, and the total number of times that blocks overlap each other is equal to the number of edges E ′ in the block tree T i ,

Checking If a Triplet is Consistent with a Network
Sections 3.2.1 and 3.2.2 below describe how to determine if any given fan or resolved triplet, respectively, is consistent with N i in O(1) time, assuming that the data structures from Sect. 3.1 have already been built. A more precise definition of triplet consistency that can associate specific locations in the network to triplets that are consistent with it will be needed in this section. Let B be a block of a network N i . We say that x|y|z is a fan triplet consistent with B if and only if there exists a vertex u in B such that there are three directed paths in N i from u to x, from u to y, and from u to z that are disjoint except for in u. We also say that x|y|z is rooted at u in B. Since u belongs to N i , this means that x|y|z is rooted at u in N i as well. Next, we say that xy|z is a resolved triplet consistent with B if and only if there exist two vertices u and v ( u ≠ v ) in B such that there are four directed paths in N i from u to v, from v to x, from v to y, and from u to z that are disjoint except for in u and v, and the path from u to z does not pass through v. Moreover, we say that xy|z is rooted at u and v in B and in N i .
Observe that if x|y|z is a fan triplet consistent with a block B, then it is also consistent with N i . In the same way, if xy|z is a resolved triplet consistent with B, it is also consistent with N i .

Checking a Fan Triplet
First, we show how to determine if a given fan triplet x|y|z is consistent with a given block B (Lemma 3.8). The procedure, named IsFanInBlock, requires that the lowest common ancestor (in the block tree T i ) of x and y, the lowest common ancestor of x and z, and the lowest common ancestor of y and z are the same, and that this node corresponds to the block B being examined.
After that, the procedure IsFanInBlock is used as a subroutine in another procedure, named IsFan, to determine if a given fan triplet x|y|z is consistent with a network (Lemma 3.9). Whenever IsFanInBlock's requirement on the lowest common ancestors cannot be met, IsFan instead considers the different cases for the locations of the lowest common ancestor of every pair (x, y), (x, z), and (y, z) in T i . Since every vertex in T i except r(T i ) corresponds to a block in N i , it can then apply the available data structures to determine if N i has the necessary disjoint paths.

Lemma 3.8
Let N i be a given network and T i its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on N i . Consider any x, y, z ∈ Λ such that the lowest common ancestor of every pair (x, y), (x, z), and (y, z) is a node w in T i . If w ≠ r(T i ) , Algorithm 2 determines whether or not the fan triplet x|y|z is consistent with the block B in N i corresponding to w in O(1) time.
Proof For every l ∈ {x, y, z} , we let p l = p B (l) , p � l = p � B (l) , q l = q B (l) , and h l be the height of q l in N i . By construction (see Lemmas 3.4 and 3.6), we know that p x , p y , and p z are not the root of C B . The algorithm uses the tables Q and P to check all the possible cases for the values of p x , p y , p z , q x , q y , and q z , and return a true or false value, indicating a positive and a negative answer respectively. We have the following cases: We have q x = q y = q z and x|y|z is rooted at q x . Hence, x|y|z is consistent with B (e.g., a 5 |a 6 |a 7 in Fig. 5).
W.l.o.g., assume true for ((h x = h y ) ∧ (h x > h z )) : Then, we have q x = q y ∧ q x ≠ q z and x|y|z is rooted at q x . Hence, x|y|z is consistent with B (e.g., a 5 |a 6 |a 8 in Fig. 5). (c) h x ≠ h y ≠ h z : Then q x ≠ q y ≠ q z , thus x|y|z is not consistent with B (e.g., a 13 |a 14 |a 20 in Fig. 5).
(a) h x = h y : We have q x = q y . If p ′ x |p x |p z is a fan triplet in C B , then x|y|z is rooted at q x , thus x|y|z is consistent with B (e.g., a 8 |a 9 |a 15 in Fig. 5). If p ′ x |p x |p z is not a fan triplet in C B , x|y|z is not rooted at any vertex in B, thus x|y|z is not consistent with B (e.g., a 8 |a 9 |a 11 in Fig. 5). (b) h x ≠ h y : Then q x ≠ q y and either q x or q y was contracted when creating C B .
Moreover, both x and y are now in the set of leaves defined by p x . Since we also have p z ≠ p x , the triplet x|y|z is not consistent with B (e.g., a 7 |a 8 |a 15 in Fig. 5).
3. p x ≠ p y ≠ p z : If p x |p y |p z is consistent with C B , then there exists a vertex u in B such that x|y|z is rooted at u. Hence, x|y|z is consistent with B (e.g., a 8 |a 11 |a 16 in Fig. 5). If p x |p y |p z is not consistent with C B , x|y|z is not rooted at any vertex in B, thus x|y|z is not consistent with B (e.g., a 14 |a 16 |a 17 in Fig. 5).
In every case above, testing if a fan triplet is consistent with C B translates to finding a path that starts from s in C f B and ends in a vertex of C f B defined by the leaves of the fan triplet. Hence, every case can be handled in O(1) time. In Algorithm 2, the above cases are summarized in a procedure. ◻ Lemma 3.9 Let N i be a given network and T i its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on N i . For any x, y, z ∈ Λ , Algorithm 3 determines whether or not the fan triplet x|y|z is consistent with Proof For a block B of N i and a vertex u in B that can reach a leaf x of N i , define h B (x) to be the height of q B (x) in N i . In Algorithm 3 we have the procedure for testing the consistency of the fan triplet x|y|z. It considers the following cases: 1. x|y|z is consistent with T i : Let w be the lowest common ancestor of x, y, and z in T i .

3
(a) w = r(T i ) : x|y|z is rooted at r(N i ) , thus x|y|z is consistent with N i (e.g., a 23 |a 9 |a 20 in Fig. 4). (b) w ≠ r(T i ) : w corresponds to a block B in N i , thus we use Lemma 3.8 to determine if x|y|z is consistent with B. If x|y|z is consistent with B, then it is also consistent with N i . If x|y|z is not consistent with B, then it is not consistent with N i (e.g., a 3 |a 9 |a 12 in Fig. 4).
2. xy|z ∨ xz|y ∨ yz|x is consistent with T i . Assume w.l.o.g. that xy|z is consistent with T i . Let w = lca(x, y) in T i and = lca(x, z) in T i , and let B be the block in N i corresponding to w and F the block in N i corresponding to : is not the parent of w in T i : then x|y|z is not rooted at any vertex in N i , thus x|y|z is not consistent with N i (e.g., a 2 |a 4 |a 13 in Fig. 4).
is the parent of w in T i . By the definition of T i , B is rooted at a vertex u of F that is not r(F): i. (p B (x) = p B (y)) : then x|y|z is not rooted at any vertex in N i , thus x|y|z is not consistent with N i (e.g., a 2 |a 3 |a 4 in Fig. 4).
where r ′ is the dummy leaf in C B (see Corollary 2.2), then x|y|z is rooted at r(N i ) , thus x|y|z is consistent with N i (e.g., a 1 |a 11 |a 15 in Fig. 4). Otherwise, x|y|z is not rooted at any vertex in N i , thus x|y|z is not consistent with N i (e.g., a 12 |a 13 |a 15 in Fig. 4).
we have q F (x) = q F (y) , thus h F (x) = h F (y) . Using Corollary 2.2, if r � |p B (x)|p B (y) is a fan triplet in C B , where r ′ is the dummy leaf in C B , then x|y|z is rooted at q F (x) , thus x|y|z is a fan triplet in N i (e.g., a 1 |a 4 |a 8 in Fig. 4). Otherwise, x|y|z is not rooted at any vertex in N i , thus x|y|z is not consistent with N i (e.g., a 1 |a 24 |a 8 in Fig. 4).
we have q F (x) = q F (y) and h F (x) = h F (y) . Hence, x|y|z is not consistent with N i (e.g., a 1 |a 4 |a 21 in Fig. 4).
is a fan triplet in C F , then x|y|z is rooted at q F (x) . Hence, x|y|z is consistent with N i (e.g., a 1 |a 4 |a 9 in Fig. 4). Otherwise, x|y|z is not rooted at any vertex of N i , thus x|y|z is not consistent with N i (e.g., a 1 |a 4 |a 12 in Fig. 4).

Checking a Resolved Triplet
The strategy for determining if a given resolved triplet xy|z is consistent with a network is analogous to the case of fan triplets just described. The procedure IsResolvedInBlock (see Lemma 3.10) first considers consistency with a block B in the case where it holds in the block tree T i that the lowest common ancestor of x and y, the lowest common ancestor of x and z, and the lowest common ancestor of y and z are the same. Next, the procedure IsResolved (see Lemma 3.11) uses IsResolvedInBlock and the available data structures to take care of the general case. Lemma 3.10 Let N i be a given network and T i its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on N i . Consider any x, y, z ∈ Λ such that the lowest common ancestor of every pair (x, y), (x, z), and (y, z) is a node w in T i . If w ≠ r(T i ) , Algorithm 4 determines whether or not the resolved triplet xy|z is consistent with the block B in N i corresponding to w in O(1) time.
Proof Like in the case of fan triplets in Lemma 3.8, for every l ∈ {x, y, z} , we let p l = p B (l) , p � l = p � B (l) , q l = q B (l) , and h l be the height of q l in N i . By construction (see Lemmas 3.4 and 3.6), we know that p x , p y , and p z are not the root of C B . The algorithm uses the tables Q and P to check all the possible cases for the values of p x , p y , p z , q x , q y , and q z , and return a true or false value, indicating a positive and a negative answer respectively. We have the following cases: x |p x |p y is a fan triplet in C B , then xy|z is rooted at q z and q x , thus xy|z is consistent with B (e.g., a 14 a 17 |a 13 in Fig. 5). If p ′ x |p x |p y is not consistent with C B , xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., a 14 a 16 |a 13 in Fig. 5.). 2. h z ≤ h x : Since p x = p z , the resolved triplet xy|z cannot be consistent with B (e.g., a 14 a 17 |a 20 in Fig. 5).
3. p x ≠ p y ≠ p z : If p x p y |p z is consistent with C B , then there exist two different vertices u, v in B such that xy|z is rooted at u and v, thus xy|z is consistent with B (e.g., a 12 a 13 |a 18 in Fig. 5). If p x p y |p z is not consistent with C B , xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., a 12 a 18 |a 13 in Fig. 5).
Similarly to fan triplets, testing if a resolved triplet is consistent with C B translates to finding a path that starts from s in C r B and ends in a vertex of C r B defined by the leaves of the resolved triplet. Hence, every case can be handled in O(1) time. Algorithm 4 summarizes the above cases in a procedure. ◻ Lemma 3.11 Let N i be a given network and T i its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on N i . For any x, y, z ∈ Λ , Algorithm 5 determines whether or not the resolved triplet xy|z is consistent with Proof For a block B of N i and a vertex u in B that can reach a leaf x of N i , define h B (x) to be the height of q B (x) in N i . In Algorithm 5 we have the procedure for testing the consistency of the resolved triplet xy|z. We consider the following cases, which are similar to the cases for fan triplets in Lemma 3.9: 1. x|y|z is consistent with T i : Let w be the lowest common ancestor of x, y, and z in T i .
(a) w = r(T i ) : xy|z is not rooted at any pair of vertices in N i , thus xy|z is not consistent with N i (e.g., a 23 a 9 |a 20 in Fig. 4). (b) w ≠ r(T i ) : w corresponds to a block B in N i , thus we use Lemma 3.10 to determine if xy|z is consistent with B. If xy|z is consistent with B, then it is also consistent with N i . If xy|z is not consistent with B, then it is not consistent with N i (e.g., a 1 a 9 |a 12 in Fig. 4).
2. xy|z ∨ xz|y ∨ yz|x is consistent with T i . Assume w.l.o.g. that xy|z is consistent with T i . Let w = lca(x, y) in T i and = lca(x, z) in T i , and let B be the block in N i corresponding to w and F the block in N i corresponding to : is not the parent of w in T i : then there exists a vertex u in B and a vertex v in F such that xy|z is rooted at v and u, thus xy|z is consistent with N i (e.g., a 2 a 4 |a 13 in Fig. 4).
is the parent of w in T i . By the definition of T i , B is rooted at a vertex u of F that is not r(F). We consider the following cases: . Then, xy|z is rooted at either r(B) and q B (x) , or q F (z) and q B (x) , or r(F) and q B (x) . Hence, xy|z is consistent with N i (e.g., a 2 a 3 |a 4 in Fig. 4).
where r ′ is the dummy leaf in C B , then there exists a vertex u in B such that xy|z is rooted at r(N i ) and u. Hence, xy|z is consistent with N i (e.g., a 11 a 13 |a 15 in Fig. 4). Otherwise, xy|z is not rooted at any pair of vertices in N i , thus xy|z is not consistent with N i (e.g., a 1 a 13 |a 15 in Fig. 4). iii. (p B (x) ≠ p B (y)) ∧ ( ≠ r(T i )) : where r ′ is the dummy leaf in C B , then there exists a vertex u in B such that xy|z is rooted at q F (x) and u. Hence, xy|z is consistent with N i (e.g., a 1 a 4 |a 8 in Fig. 4). Otherwise, xy|z is not rooted at any pair of vertices in N i , thus xy|z is not consistent with N i (e.g., a 1 a 25 |a 22 in Fig. 4).
we have q F (x) = q F (y) and h F (x) = h F (y) . Then, there exists a vertex u in B such that xy|z is rooted at q F (z) and u, thus xy|z is consistent with N i (e.g., a 1 a 4 |a 21 in Fig. 4).
where r ′ is the dummy leaf in C B , then there exists a vertex u in B such that xy|z is rooted at either r(B) and u, or q F (z) and u, or r(F) and u. If p F (x)p � F (x)|p F (z) is consistent with C F , then w.l.o.g. if h F (x) > h F (y) we have that xy|z is rooted at some vertex u of F and q F (x) . In both cases, xy|z is consistent with N i (e.g., a 1 a 4 |a 12 in Fig. 4).
If both cases are false, xy|z is not rooted at any pair of vertices in N i , thus xy|z is not consistent with N i (e.g., a 1 a 25 |a 26 in Fig. 4).

Triplet Distance Computation
Our second algorithm for computing the triplet distance between two given networks N 1 and N 2 is listed in Algorithm 6. It has the same basic structure as the algorithm in Sect.

Implementation and Experiments
This section presents the implementations of the two algorithms from Sects. 2 and 3, and experimental results demonstrating their practical performance. Both simulated and real datasets were used in the experiments.

Algorithm Implementation
From here on, the algorithm from Sect. 2 will be referred to as NTDfirst and the algorithm from Sect. 3 as NTDsecond. Both algorithms were implemented in the C++ programming language and the source code is publicly available at:

https://github.com/kmampent/ntd
Since no other implementations for computing the rooted triplet distance between two networks of arbitrary levels are available, the correctness of our program code was verified by trying a large number of pairs of input networks under varying parameters and making sure that the output of NTDfirst (which is simple to implement) was identical to the output of NTDsecond in all cases.

The Setup
The experiments were performed on a machine with 16GB RAM and Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz. The operating system was Ubuntu 16.04.2 LTS, and the compiler used was g++ 5.4 with cmake 3.11.0.

Experiment 1: Performance
The first set of experiments were designed to measure the running times and memory usage of our implementations of NTDfirst and NTDsecond. To do so systematically, we used simulated datasets. The Input. Given three parameters n, p, and e, where n ≥ 1 is an integer, 0 ≤ p ≤ 1 , and e ≥ 0 is an integer, an input network N ′ was built according to the following method: • Generate a random rooted binary tree T with n leaves in the uniform model [29]. • For each internal vertex w in T except r(T), contract the edge between the parent of w and w with probability p. • For each vertex w in T, let d(w) be the number of edges on the path from r(T) to w. Let N � = T.
• Until e edges have been added or it is impossible to add any more edges: Add an edge between two vertices in N ′ chosen uniformly at random, under the constraint that an edge u → v is created in N ′ only if d(u) < d (v) . (In other words, if the total number of edges that can be added is y and y < e , then only add those y edges.) Experimental Results. We applied NTDfirst and NTDsecond to pairs of networks generated with the method above for varying values of n, p, and e, and measured their running times and memory usage. In the graphs shown below, every data point corresponds to the average taken over 30 runs with a set of fixed parameters. Reticulation events are typically rare in nature [30], so we used relatively small values for e, i.e., e ≤ 50 when n ≤ 500 , to make the experiments more realistic.
The results of Experiment 1 are reported below.
1. The two algorithms' running times and memory usage increase as n increases according to the plots in Figs. 7 and 8. The first figure shows the CPU time in seconds taken when p = 0 and e ∈ {10, 20, 30, 40, 50} . For NTDfirst we used 10 ≤ n ≤ 230 , and for NTDsecond we used 10 ≤ n ≤ 500 . Space is the reason behind the restrictions on n. As can be seen in Fig. 8a, at n = 230 the memory usage of NTDfirst is getting close to the limit of the available 16GB RAM. When n ≥ 240 , the memory requirements exceed the limit, and the operating system initiates highly time-consuming communication with the disk. 2. Both algorithms take more time as the parameter e increases due to the additional edges in the generated networks, with NTDsecond suffering more than NTDfirst. Again, see to build the highly time-and memory-consuming fan and resolved graph on the entire input network, and instead build several such graphs on smaller blocks. Figure 9 shows that a larger value of e implies a higher level k as well as fewer non-leaf blocks in N ′ , which in turn implies more time spent by NTDsecond building the fan and resolved graphs. An extreme situation is when e is so large that N ′ has a really small number of non-leaf blocks, one of which is roughly as large as N ′ itself. Then, given that the preprocessing of NTDsecond is more complex than that of NTDfirst, NTDsecond will be slower than NTDfirst. An example of where this happens can be found in Fig. 10a when the parameters are n = 90 , p = 0 , and e = 50.
In contrast, when p is large, e.g., p = 0.8 in Fig. 10b, the effect of e on the running times is small. This holds especially for NTDsecond. There will be fewer internal vertices in the generated networks, which means that the number of edges that can be added decreases as well. 3. The effect of the parameter p on the relative running times of the two algorithms is shown in Fig. 11. In general, the difference in the two algorithms' running times becomes smaller as the value of p increases. For certain combinations of the parameters such as n = 90 , p = 0 , and e = 50 in Fig. 11c, NTDfirst is faster than NTDsecond, as observed earlier.

Experiment 2: Limitations of the Rooted Triplet Distance
The second set of experiments applied the algorithms to real datasets. The goal was to see how informative the current definition of the rooted triplet distance is in practice when comparing phylogenetic networks, and to investigate any potential shortcomings. The Input. For the real datasets, we borrowed six networks from Table S4 in [31] that describe biologically motivated alternative 'scenarios' for the evolutionary history of the Viola genus. They are named N A , N B , N C , N D , N E , and N F below. The first five networks correspond to the five scenarios A, B, C, D, and E in [31], and N F is "Scenario E, CHAM and MELVIO resolved", which is actually the same as scenario E but with two of the subclades (overlapping subtrees) expanded. Only two of the six networks are shown here; the network N B is displayed in Fig. 12a, and N D in Fig. 12b. For the other four networks' branching structures, the reader is referred to Table S4 in [31].
The networks in Table S4 in [31] were inferred from a set of multilabeled trees. (A multilabeled tree is a generalization of a phylogenetic tree in which identical leaf labels are allowed to occur more than once.) The method that was used to construct the networks is explained in detail in Step 3 ("Inference of the Most Parsimonious Network from Multilabeled Gene Trees") in the MateRIals and Methods-section of [31]. Table S4 in [31] also provides these multilabeled trees. In order to represent the multilabeled trees as distinctly leaf-labeled trees as well, [31] replaced any repeated leaf label x by unique leaf labels of the form x.1, x.2, … , x.i ; e.g., one Note that for all s ∈ {A, B, C, D, E, F} , the number of leaf labels in T s is larger than than the number of leaf labels in N s due to the leaf relabeling process just described to obtain distinctly leaf-labeled trees.
In our implementations, the input trees are represented in standard Newick format and the input networks in extended Newick format [32]. We employ the graph-theoretic standard adjacency list to store the input networks, making it easy to support different input formats at the same time.
Experimental Results. We used the trees T s and networks N s , where s ∈ {A, B, C, D, E, F} , from Table S4 in [31], as explained above. In the experiments, we computed the rooted triplet distance between each T s and N s and also between pairs of these networks. According to Equation

Fig. 12
The networks N B and N D from [31] computing D(T F , N F ) , with NTDfirst requiring only 0.18 seconds to run and NTDsecond 0.05 seconds. Our findings are summarized in Tables 2 and 3. By inspecting the tables, Experiment 2 reveals two ways that the current definition of the rooted triplet distance for networks could be improved: 1. Table 2 (T s , T s ) ). This is because of the resolved triplets that arise when N s is created from a multilabeled tree using the method in [31], and the fan triplets that are created whenever a leaf x is replaced by x.1, … , x.i in N s . Consequently, it would be desirable to give less weights to such triplets. A more flexible definition of the rooted triplet distance that can assign different weights to different triplets could therefore be useful. 2. Next, Table 3 lists the triplet distance D(N s , N s � ) for all pairs s, s � ∈ {A, B, C, D, E} .
The networks N A , … , N E have identical leaf label sets, but the leaf label set of N F is different, which is why N F is excluded from Table 3. Interestingly, although the two networks N B and N D are structurally different (see Fig. 12), their triplet distance is 0. This suggests that alternative definitions of the rooted triplet distance for networks may be better in practice, as discussed in Sect. 5 below.  1 3

Final Remarks
We have developed two new algorithms for computing the rooted triplet distance between two phylogenetic networks over the same leaf label set. We have also presented an implementation of the algorithms and evaluated their performance on simulated and real datasets. Future work involves creating new algorithms that are even more efficient than the algorithms given here, as well as to research variants of the studied problem that may provide more biologically meaningful ways for comparing networks. An example of such a variant is motivated by the experiments on the real dataset in Sect. 4.4. Recall that the two networks N B and N D were structurally different, yet their triplet distance was 0. The reason is that, unlike in the case of trees, the same triplet can appear several times in a network, and for two networks N 1 and N 2 to be compared, if a triplet appears 1000 times in N 1 and only once in N 2 , it would contribute 0 under

Fig. 13
Next to every vertex marked with a circle is the number of different pairs of disjoint paths from that vertex to the leaves with labels Tridens and Chilenium, and the number of different disjoint paths from the root to the vertex. With definition A of multiplicity for resolved triplets, the resolved triplet | appears (4 + 6 + 2 + 1 + 2 + 1) ⋅ 1 + 1 ⋅ 2 = 18 times in N B and (5 + 5 + 8 + 2 + 2 + 2 + 1 + 1) ⋅ 1 + 1 ⋅ 2 = 28 times in N D . With definition B, this triplet appears 7 times in N B and 9 times in N D the current definition of D (N 1 , N 2 ) . However, extending the definition of the triplet distance for networks to capture information about the frequencies of triplets in the networks can be done in different ways, leading to different outcomes. For example, consider the following two alternative definitions of multiplicity for a resolved triplet  In summary, definition B seems somewhat simpler to compute than definition A, but it fails to distinguish between certain cases that definition A can handle.
To determine under what circumstances definition B is good enough in practice is an open problem and a future research topic.
Finally, Cardona et al. [33] gave an alternative generalization of the rooted triplet distance from trees to networks. While the extension proposed by Gambette and Huber [13] is closer to the definition of the widely studied rooted triplet distance for trees, efficient algorithms for Cardona et al.'s extension might also be useful. However, as pointed out in [13] and [33], neither one of them yields a metric for all classes of phylogenetic networks (see Corollary 1 in [13] and Figs. 19 and 20 in [33]), so another open problem is to find even more informative generalizations.