Information geometry for phylogenetic trees

We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera–Holmes–Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback–Leibler divergence, or equivalently, as we show, to any f-divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular, geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.


Introduction
Evolutionary relationships between species are represented by phylogenetic trees, in which the leaves represent present-day species, internal vertices represent speciation events, and edge lengths represent the degree of evolutionary divergence between species (Semple and Steel 2003). Evolutionary relationships are often subject to a high degree of uncertainty, and so it is natural to consider the space of all possible relationships and probability distributions on this space. Billera et al. (2001) were the first to construct a space of all phylogenetic trees on a fixed set of leaves. This space, known as Billera-Holmes-Vogtmann or BHV tree space, has a very rich geometry: in particular there is a unique geodesic, or shortest length path, between any two points in the space. BHV tree space is a so-called C AT (0) space (Billera et al. 2001), meaning it has globally non-positive curvature, and many of its attractive geometric properties follow from this condition. A polynomial time algorithm for computing geodesics and their lengths was subsequently developed (Owen and Provan 2011). A number of statistical methods for analysing samples of phylogenetic trees have been established, which rely fundamentally on the geometry of BHV tree space by transferring conventional multivariate statistical methods into the new geometrical context. Algorithms have been developed for computing sample means (Bačák 2014;Miller et al. 2015), for constructing confidence regions for the population mean (Willis 2019), and for performing principal component analysis (Nye 2011;Feragen et al. 2013;Nye 2014;Nye et al. 2017). An alternative geometry for phylogenetic trees, known as the tropical tree space (Speyer and Sturmfels 2004;, arises from regarding phylogenetic trees as distance matrices between the species at the leaves. Statistical methods such as calculation of sample means  and principal component analysis (Yoshida et al. 2019), have also been developed in tropical tree space.
In the BHV and tropical tree spaces, trees are regarded primarily as geometric or algebraic objects, without specific consideration to how phylogenetic trees are estimated or interpreted. Phylogenetic trees are typically inferred from genetic sequence data via Markov models of sequence evolution over the edges of the tree (Yang 2006), and we are only concerned with such trees. Each phylogenetic tree can therefore be regarded as a probability model for genetic sequence data, and a space of all tree-like probability models can be constructed. This idea was first considered by Kim (2000), and then developed more formally in subsequent papers (Moulton and Steel 2004;Gill et al. 2008). The space is known as the phylogenetic orange space or edge-product space. While the space has been studied from the viewpoint of algebraic geometry (Zwiernik and Smith 2012;Engström et al. 2013), metric geometry on the space has received little attention. Recently, methods for approximately computing 'probabilistic' metrics on the edge-product space have been developed (Garba et al. 2018). These metrics are defined by mapping each tree to its associated distribution on sequence data, and using a metric between these probability distributions. Specifically, each tree represents a distribution on characters, where a character is a map from the N leaves of the tree to some alphabet of letters Ω. The Hellinger and Jensen-Shannon metrics are defined between distributions on Ω N and are pulled back to give metrics between trees. Exact calculation of these metrics involves summation over all possible characters, and so when N is large, Garba et al. (2018) use a simulation procedure to estimate the distance between any pair of trees. The probabilistic metrics have substantially different properties than the BHV and tropical metrics. For example, if all the edge lengths in a pair of given trees are scaled up linearly, then the BHV and tropical distance between the trees both scale in the same way, while in contrast the probabilistic metrics eventually tend to zero. This is because the letters at the leaves of the trees become independent from one another as the edge lengths increase, due to genetic saturation. The distributions on characters represented by the two trees therefore converge to one another as the edge lengths are scaled up, and the distance tends to zero (see Fig. 2 in Garba et al. (2018)).
The metrics studied by Garba et al. (2018) arise from embedding tree space into the larger 'ambient' space of all distributions on characters. They are obtained from the lengths of 'chordal paths' in the ambient space which do not generally lie within the embedded tree space, and are hence called extrinsic metrics. In contrast, the BHV metric is an intrinsic metric, since it is obtained from the lengths of paths lying within tree space. For trees sharing a common branching pattern the BHV metric agrees with the corresponding extrinsic metric obtained via an embedding into Euclidean space. The statistical methods developed on BHV tree space rely heavily on the intrinsic nature of the metric, and this motivated us to seek intrinsic analogs of the probabilistic metrics.
The aim of this article is to realize intrinsic metrics and their associated geodesics in a new space of forests, the wald space, 1 that is related to the edge-product tree space (for the subtle, yet essential differences see the discussion in Sect. 6), when the underlying assumptions are similar to those for the probabilistic metrics. We assume that the infinitesimal distance between two trees is measured using the Fisher information matrix. We show that this is equivalent to assuming the infinitesimal squared distance is the Kullback-Leibler divergence, or equivalently, any f -divergence. Our approach uses ideas from information geometry, which is the study of Riemannian differential geometry on spaces of probability distributions. The purpose of developing this geometry on this space of forests is with the ultimate aim of obtaining statistical methods analogous to those on other tree spaces. The probabilistic metrics and the information geometry have an important advantage over the BHV and tropical geometries: they have by definition a direct biological interpretation in terms of the evolution of genetic sequences. In the information geometry, two trees are close when they determine similar distributions of characters, and as a result they would be potentially indistinguishable if inferred from experimental samples of sequence data. Conversely, trees are distant in the information geometry when they induce substantially different distributions. In contrast, the BHV and tropical metrics are defined more abstractly without reference to evolutionary models or processes. Examples of the biological interpretation of the probabilistic metrics were given by Garba et al. (2018).
Our approach has two main parts. First, we consider geodesics in the information geometry when the model associated with each phylogenetic tree is the two-state symmetric Markov process. This is the simplest discrete Markov model of sequence evolution, for which there are two letters in the alphabet, Ω = {0, 1}. This model is introduced in Sect. 2 along with a formal definition of the wald space and a brief review of BHV space. The thesis of Garba (2019) contains some comparisons of results obtained using the two-state model versus models with the DNA alphabet. Geodesics in wald space are constructed locally by numerically integrating a certain differential equation determined by the assumptions on the Riemannian metric. We explore geodesics on the space of unrooted trees with 5 leaves, for which visualization is relatively straightforward, and compare the results with those for BHV tree space. This forms Sect. 3 of the paper. Secondly, in order to improve computational tractability, we consider an alternative continuous-valued model of evolution on each tree. This consists of a Gaussian process which approximates the two-state Markov process by matching its moments. The continuous random variables at the leaves of the tree have a multivariate normal distribution with zero mean, for which the covariance matrix is related to the matrix of path lengths between the leaves. Numerically solving the differential equations for geodesics is much faster under this set of assumptions, and the geodesics closely resemble those for the two-state model. However, solutions are still restricted to trees sharing a common branching pattern, or topology. The definition of the Gaussian process on trees and numerical solution of geodesics in the corresponding information geometry are described in Sect. 4. The information geometry of multivariate normal distributions with zero mean corresponds to a certain geometry on the space of symmetric positive definite matrices, known as the Fisher-Rao or affine-invariant geometry, and the map from the wald space to covariance matrices is an isometric embedding in this space. The geometry on the space of symmetric positive definite matrices is analytically tractable, and geodesics can be computed in polynomial time. The embedding therefore gives intrinsic and extrinsic metrics on the wald space. We describe a projection algorithm from the space of symmetric positive definite matrices into the embedded wald space. We then use this algorithm to project geodesics in the ambient space down into wald space in various ways to obtain approximate geodesics between trees with different topologies. The embedding in the space of symmetric positive definite matrices and the associated geometry is described in Sect. 5. We conclude in Sect. 6 with a detailed discussion of the promises and challenges of our new wald space.

Phylogenetic trees
For N = 2, 3, . . . we define U N to be the set of unrooted phylogenetic trees on N taxa. More specifically, a tree T is an element of U N if it satisfies the following conditions. First, T contains exactly N vertices with degree 1, which are called leaves, and these are bijectively labelled 1, . . . , N . Secondly, T must contain no vertices with degree 2. Thirdly, each edge e in T is assigned a length e ≥ 0 with e = 0 if e contains a leaf. An edge in a tree is called a pendant edge if it contains a leaf; otherwise it is called an internal edge. Similarly, the vertices which are not leaves are called internal vertices.
The edge lengths e on any given tree T ∈ U N define a path length distance between any pair of leaves. The path length on T between u, v ∈ {1, . . . , N } will be denoted uv .
Each tree T ∈ U N contains at most 2N − 3 edges, in which case the tree is called fully resolved or bifurcating, and all internal vertices have degree 3. Trees with fewer edges are called unresolved, and for N > 3, these contain at least one vertex with degree 4 or more. Trees which contain only the N pendant edges joined at a single degree-N internal vertex are called star trees.
A tree T is rooted when some internal point ρ ∈ T is labelled as being the root. This is conveniently achieved by adding an additional taxon labelled 0 to the tree via a pendant edge of length zero. It follows that the set of rooted phylogenetic trees satisfies the same conditions as U N , except the leaves are bijectively labelled 0, 1, . . . , N , and the pendant edge containing taxon 0 has zero length. We will work with unrooted trees, but our results are easily transferred to the space of rooted trees via this relationship.
Every fully resolved tree will correspond to a fully resolved BHV-tree (reviewed in Sect. 2.2) and to a fully resolved wald, as introduced below in Sect. 2.4. In both BHV tree space and in wald space, unresolved trees will be identified with other trees with certain internal edges having zero length, so that conceptually a missing edge is the same as a zero length edge. Billera et al. (2001) defined a space of phylogenetic trees, subsequently known as BHV tree space, and described its geometry. BHV tree space can be described via an embedding in R d for dimension d which increases exponentially with the number of leaves. However, we have chosen to describe BHV tree space in a way different from the original authors, and we define it as a quotient space. As a result, the wald space introduced in the next section is a superset of BHV tree space when the spaces are regarded simply as sets, clarifying the relationship between the two spaces. Importantly, we allow internal edges on trees to have length zero, and under the quotient these are equivalent to trees with those edges missing. A second difference is that while Billera et al. (2001) worked with rooted trees, we work with unrooted trees. As described in Sect. 2.3, the distribution on binary characters determined by a tree does not depend on the root position under the two-state symmetric model, and so unrooted trees are more natural to work with.

Billera-Holmes-Vogtmann tree space
BHV tree space is defined using the notion of splits, where a split is a bipartition of the leaf labels 1, . . . , N into two disjoint sets. Cutting an edge of a tree induces such a bipartition of the leaves, and so each edge on a tree corresponds to a split, and the terms split and edge can be used interchangeably. The set of splits represented by a tree is called its topology.
Arbitrary sets of splits do not typically determine valid tree topologies: the splits of a tree must satisfy a compatibility condition. For example, the splits {1, 2}, {3, 4, . . . , N } and {1, 3}, {2, 4, . . . , N } are incompatible, since leaf 1 cannot be grouped next to both 2 and 3 on the same tree. For any topology τ with k internal edges, 0 ≤ k ≤ N − 3, the set of trees in U N with that topology is bijectively parametrized by R N >0 × O τ where the first term in the product parametrizes the pendant edge lengths that, by definition, are strictly positive, and O τ = R k ≥0 parametrizes the internal edge lengths. The set O τ is called the orthant associated with topology τ , and we identify the set of all trees with topology τ with R N >0 × O τ . Under this identification, the set of all trees U N , as defined in Sect. 2.1, is the disjoint union where the disjoint union is taken over all possible topologies τ . The unrooted BHV tree space U N is obtained by taking the quotient of U N with respect to an equivalence relation: Two trees in U N are equivalent under ∼ if and only if they are identical modulo the presence of internal splits with zero length, as shown in Fig. 1.
The quotient space factorizes as where the first term parametrizes the lengths of the pendant edges and the space BHV N parametrizes the topology and internal edge lengths of the BHV-trees. When τ is fully resolved, O τ is called a maximal orthant. Unresolved trees correspond to points on the boundaries of maximal orthants; they can be obtained from fully resolved trees by shrinking internal edge lengths down to zero. Since there are (2N −5)!! fully resolved unrooted topologies, BHV N can be thought of as being constructed by gluing this number of maximal orthants together along their boundaries, where two points are identified if they correspond to the same tree. For example, when N = 4, there are three fully resolved topologies, each of which contains a single internal edge. The space BHV 4 therefore consists of three copies of R ≥0 glued together at the origin. The origin corresponds to the star trees, while the location along When an internal edge from a fully resolved topology is contracted down to length zero (left to centre), there are two fully resolved topologies which can be obtained by expanding out an alternative edge (right). A, B, C, D represent subtrees. The operation of contracting an internal edge and expanding out an alternative edge is called nearest neighbour interchange. It follows that at each codimension-1 boundary, three maximal orthants are glued together each of the three copies of R ≥0 gives the length of the internal edge in each of the three possible fully resolved topologies. For N = 5 there are 15 possible unrooted tree topologies, each of which contains two internal edges. It follows that BHV 5 consists of 15 copies of R 2 ≥0 glued along their boundaries. At each codimension-1 boundary, three maximal orthants are joined together. This is because when a single internal edge is contracted to length zero, a degree 4 vertex is obtained, and there are 3 possible ways to add in an edge, including the original edge, in order to obtain a fully resolved topology, as illustrated by Fig. 2.
The metric on BHV N is constructed as follows. The basic idea is that for trees with the same fully-resolved topology but different vectors of internal edge lengths, say

The two-state symmetric Markov model
Genetic sequence evolution is typically modelled using discrete-valued continuoustime Markov processes defined over the edges of a tree T (Yang 2006;Bryant et al. 2005). DNA sequence evolution is modelled by associating to each point t ∈ T , a random variable X (t) which takes values in an alphabet {A, C, G, T }. In this paper, however, we will consider the two-state symmetric Markov process with alphabet Ω = {0, 1}. This simplification is made in order to make the mathematics more tractable and for computational speed. Nonetheless, some of the calculations using the two-state symmetric can readily be performed using DNA models. More details are given in the thesis of Garba (2019) in which simulations show similarity of geometries obtained from the two-and the four-state process. The transition probability of the symmetric two-state model is defined in terms of the path length t 1 t 2 between any two points t 1 , t 2 ∈ T : (2.1) The stationary distribution of this Markov process is Bern(1/2), and the process is assumed to be in its stationary state over the tree. As a result, for all t ∈ T , X (t) has a marginal Bernoulli distribution, X (t) ∼ Bern(1/2). While the random variables X 1 , . . . , X N at the leaves of the tree have the same marginal distributions, they are not independent since the tree imposes a dependence structure. The following lemma determines certain moments of the process X (t) giving insight on the dependence structure of X 1 , . . . , X N . The proof is straightforward using the transition probabilities in Eq. (2.1).
Lemma 2.1 1. If X 1 , . . . , X N are the random variables at the leaves of a tree T ∈ U N determined by the discrete Markov process defined above, then Cov (X u , X v ) = 1 4 exp(− uv ) where uv is the path length between leaves u and v. 2. If t 1 , t 2 ∈ T are path length t 1 t 2 apart, then the conditional distribution of X (t 2 ) given X (t 1 ) = ω ∈ {0, 1} has variance 1 4 1 − exp(−2 t 1 t 2 ) . It is straightforward to simulate realizations of X (t) in the following way. First simulate a Poisson process with rate 1 independently on each edge of the tree. The positions of the simulated events correspond to points at which X (t) changes parity. Secondly, pick any point t 0 ∈ T which is not a change point and sample X (t 0 ) from Bern(1/2). The change points generated from the Poisson process then determine the value of X (t) for all other t ∈ T . The distribution obtained is independent of the choice of t 0 , because the Markov process is reversible. In particular, the Markov process is independent of the choice of t 0 , which could be considered as a root.
Under the model, each edge length can be interpreted as the expected number of change points that occur over the edge. Internal edges are allowed to have length zero, which means that no change in X (t) occurs over the edge. On the other hand, when edges are long, the number of changes is likely to be large, and the letters at either end of the edge are weakly correlated. Biologists refer to this effect as saturation. A fixed change of edge length δ therefore has more effect on the distribution of characters when applied to a short edge as opposed to a long edge in some given tree. For example, an increase of δ = 0.1 to an edge with length = 0.1 approximately doubles the probability that the letters at either end of the edge are different, but the same change to an edge of length = 10 has almost no effect on this probability, which due to saturation is very close to 1/2. This idea becomes important when we consider defining distances between trees via the information they represent, in particular using the probability mass function of the nontrivial distribution of (X 1 , . . . , X N ).
Remark 2.1 1. The probability mass function of (X 1 , . . . , X N ) determined by T is denoted p T (s) where s ∈ {0, 1} N is called a binary character. Given any binary character s, the values of p T (s) can be evaluated via a recursive algorithm (Bryant et al. 2005), described in Appendix A. Appendix A also contains a modified form of the algorithm which is used to compute exactly the derivatives of p T (s) with respect to the edge lengths. 2. Also, as shown in Appendix A, p T (s) is a multivariate polynomial in 1 + e k and 1 − e k where k ranges over the edge lengths in T .
3. Crucially, the map T → p T from fully resolved trees in U N to the space of probability mass functions on {0, 1} N is injective up to the equivalence relation introduced in Sect. 2.2 (Rogers 1997;Allman et al. 2008). This has two implications: first that the probability mass function p T uniquely characterizes each element of U N , and secondly that metrics on distributions of characters pull back to define a metric between fully resolved trees, as described in Sect. 3.2.

A new forest space: the wald space
The following wald space gives an alternative viewpoint of phylogenetic trees by regarding them as Markov models for sequence evolution (Kim 2000;Moulton and Steel 2004;Gill et al. 2008). We will give a description of the wald space that is related to the edge-product space of previous authors, by defining it as a quotient space which adds trees with infinitely long edges to BHV tree space. As described in the introduction, each probabilistic metric considered by Garba et al. (2018) converges to zero in the limit as all the edge lengths in a pair of trees simultaneously tend to infinity. This behaviour indicates that shortest paths might cross a tree with infinitely long edges, which is why we add this boundary at infinity. Furthermore, we allow pendant edges of length zero under certain conditions. Thus, in the wald space, edge lengths e take values in R ≥0 ∪ {∞}. It is convenient to reparametrize to the λparametrization by defining weights λ e = 1 − exp(− e ), so λ e ∈ [0, 1]. Under this transformation, BHV N becomes a set of unit cubes, rather than orthants, glued along faces where λ e = 0 for one or more edges. The wald space is defined by imposing additional gluing rules on faces where λ e = 1. In order to be able to identify trees with infinitely long edges along faces where λ e = 1, we construct the wald space from forests, that is, disjoint unions of unrooted trees. We start with a preliminary space leading to the definition of the wald space further below. Let W N be the collection of forests satisfying the following necessary and sufficient conditions for each F ∈ W N .
1. The forest F contains exactly N labelled vertices, these are called leaves and labelled 1, . . . , N . 2. There are no unlabelled vertices in F of degree 0, 1 and 2. 3. For every pair of leaves u, v in the same tree in F, at least one edge e on the unique path from u to v satisfies λ e > 0.

(a)
The pendant edge with weight 1 is removed, and the resulting edges between A and B are replaced by a single edge with weight λ AB . The term F in both panels refers to other disconnected components in the forests Clearly, U N ⊂ W N . The condition on the edge weights ensures that no pair of leaves is coincident and consequently that metrics are always well-defined, as described in Sect. 3.2. We impose an equivalence relation ∼ on W N , defined by the following two rules.
BHV boundary rule: Given F 1 , F 2 ∈ W N , suppose all internal edges with λ e = 0 are removed from the forests, and the vertices at either end of each such edge are merged. If the resulting forests are identical, then F 1 ∼ F 2 . The rule is the same as that in Fig. 1.
Boundary at infinity rule: Suppose F ∈ W N contains an edge with λ e = 1, and that F is modified as follows. The edge with λ e = 1 is removed, disconnecting the tree it belongs to. If this results in any unlabelled vertex having degree 2, then those vertices are removed. If v is such a vertex, and the two adjacent edges e,ẽ have weights λ e , λẽ ∈ [0, 1], then e,ẽ are replaced by a single edge with weight λ e + λẽ − λ e λẽ, as is further explained below. Now suppose F 1 , F 2 ∈ W N , and this process of modifying unit-weight edges is applied to both forests. Then F 1 ∼ F 2 if the resultant forests are identical, as illustrated in Fig. 3.
The wald space W N is defined to be the quotient W N ∼ and it immediately follows that as sets U N ⊂ W N , but the geometry imposed on W N will be completely different from the geometry of the BHV space.
The boundary rule at infinity requires some explanation. The rule declares that edges of weight λ e = 1 (or equivalently length e = ∞) can be deleted from a forest F, but unlike the BHV rule for which the vertices at the ends of the edge are merged, edge removal disconnects a tree in F. When resulting degree-2 vertices are removed, the edge length is preserved so that the new edge has length e + ẽ . The corresponding weight λ is given by λ = 1 − exp(−( e + ẽ )) = λ e + λẽ − λ e λẽ. Unlike the BHV boundary rule, in which finitely many trees are identified in each equivalence class, infinitely many combinations of edge weights λ e , λẽ give rise to the same value λ e + λẽ − λ e λẽ. It follows that an uncountable collection of forests can be identified into a single equivalence class in W N .
In the edge-product space (Moulton and Steel 2004;Gill et al. 2008), an alternative parametrization is used, defining μ e = 1 − λ e = exp(− e ) to be the weight of edge e. This parametrization has the advantage that sums of edge lengths 1 + · · · + m become products of edge weights μ 1 × · · · × μ m (hence the name 'edge-product'). The boundary at infinity rule is simpler under this parametrization: the weights in Fig. 3 panel (b) become μ e , μẽ and 0 on the left and μ e μẽ on the right. However, under the μ-parametrization, the BHV boundary with e = 0 lies on faces of cubes with μ e = 1, whereas the boundary at infinity has μ e = 0. We prefer to work with the λ-parametrization since it gives a more intuitive interpretation of the weights, i.e. e = 0 corresponds to λ e = 0 and e = ∞ corresponds to λ e = 1. Forests which contain more than one connected component lie in the faces of cubes with at least one λ e = 1. Since the pendant edges can be expanded out to infinite length, they are also subject to the boundary at infinity rule, and so the representation of pendant edge lengths in W N is not via a product geometry, as it is for BHV tree space. While the star trees correspond to all internal edges having zero length, W N also contains a point which consists of N isolated vertices.
The BHV boundaries enable tree topologies to be changed via nearest neighbour interchange (NNI) operations (as illustrated by Fig. 2). The boundary at infinity corresponds to a different topological operation, called tree bisection and reconnection (TBR) (Allen and Steel 2001). Under this operation, an edge e in a tree can be expanded up to the boundary λ e = 1. Removing the edge bisects the tree, and the two components can be reconnected by an edgeẽ with λẽ = 1 placed arbitrarily between the two trees. Reducing the weight λẽ down from 1 then gives a tree with a topology different from the original tree. It follows that there exist continuous paths in the wald space between trees with different topologies, which pass through the boundary at infinity and, as a result, change tree topology via TBR operations. This is in contrast to BHV tree space in which paths between trees of different topologies involve only NNI operations, as edges are contracted down to length zero and alternative edges are expanded out.
While the set W N was defined above via an equivalence relation on forests, we also need to understand how it parametrizes Markov models and then characterise its elements again as probability mass functions on {0, 1} N . The two-state symmetric Markov process extends from being defined on trees to forests by taking the process on each connected component in a forest to be independent of the other components. This defines a distribution p F on {0, 1} N for each F ∈ W N . In fact the distribution uniquely determines the equivalence class of F, and vice versa, as the following lemma shows.
A proof is given in the Appendix.
Note that the forest consisting of N isolated vertices corresponds to the random variables X 1 , . . . , X N being independent, and this can be obtained from any tree by expanding all edges one after the other, as, by definition, there is at least one edge between any two leaves.

Information geometry for the two-state symmetric model
Information geometry provides methods for constructing metrics and geodesics on parametrized sets of probability distributions. In this section we embed wald space in the space of distributions of two-state characters, and investigate the corresponding information geometry analytically and computationally.

Geometry of embeddings
is a metric space and θ is injective. We will say that X is embedded in Y , and refer to Y as the ambient space. The embedding can be used to construct certain metrics on X . First, since θ is injective, d pulls back to define a metric on X which we denote d X : for all x 1 , x 2 ∈ X . The pullback metric is often called the induced extrinsic metric and, it is simply the restriction of d to X ⊆ Y , and so when the context is clear, it is also denoted d. The probabilistic metrics described in Sect. 3.2 are constructed in this way. A second metric, called the induced intrinsic metric and denoted d * (x 1 , x 2 ), is defined as the infimum of the length of all possible paths in X between x 1 , x 2 ∈ X ⊆ Y when path length is measured using the metric d. If no path with finite length exists between x 1 and x 2 then d * (x 1 , x 2 ) = ∞, in which case d * is not a metric. Details of this construction of the induced intrinsic metric are given by Bridson and Haefliger (2011). The metrics d and d * , if well-defined, give X the structure of a length space, which is a space in which the metric between points x 1 , x 2 is the infimum of the lengths of paths between those points. Length spaces are similar to geodesic metric spaces, except that the infimum is not necessarily achieved by a path lying within the space; in a geodesic metric space a minimum length path exists between every pair of points, and so every geodesic metric space is a length space. An example of a length space which is not a geodesic metric space is R 2 with the origin removed and the Euclidean metric. Points antipodal to the origin cannot be joined by a geodesic, but the distance between them is the infimum of the lengths of paths joining the points.
In order to illustrate the relationship between d and d * , consider the example of the embedding of the unit sphere X = S 2 in Y = R 3 equipped with the Euclidean metric d. For any two points x 1 , x 2 , d(x 1 , x 2 ) is the length of the straight line segment in the ambient space R 3 joining the points. This metric is usually called the chordal metric on S 2 . However, when we consider paths between x 1 , x 2 which are restricted to lie in S 2 , the shortest paths (with respect to d) are great circles, and the induced metric d * is the arc length metric. In fact, S 2 is a geodesic metric space, since the infimum of path length is always achieved by a great circle.
In the following Sect. 3.2, the wald space W N will be embedded in the space of distributions of characters. Later in Sect. 4 it will be embedded in the space of N × N symmetric positive definite matrices. Each embedding will be used to construct metrics on W N .

Probabilistic metrics
Here we briefly describe the probabilistic metrics developed by Garba et al. (2018) since these will be used for comparison with other metrics. The Kullback-Leibler divergence is a commonly used measure of the difference between two distributions. Given two probability mass functions p, q on characters {0, 1} N , the Kullback-Leibler divergence from q to p is defined as provided p(s) = 0 only when q(s) = 0. The Kullback-Leibler divergence is not a metric since it is not symmetric. However, metrics can be defined as follows: the Jensen-Shannon metric d J S is defined by and the Hellinger metric d H is defined by Recently, probabilistic metrics have been developed which are based on distributions of gene trees instead of distributions of characters (Adams and Castoe 2020). The Kullback-Leibler divergence, squared Jensen-Shannon metric and squared Hellinger metric are all examples of a more general class of distances between probability distributions known as f -divergences. Given any convex function The squared Jensen-Shannon metric and squared Hellinger metric can also be obtained by using more complicated functions f , cf. Sason and Verdu (2016). Now, let F ∈ W N be a forest representative of a wald [F] ∈ W N . As described in Sect. 2.4, the distributions at leaves of different trees of F are independent. For two leaves in the same tree in F, some degree of evolution occurs between them since by definition of W N no two leaves are coincident. Therefore, all characters are possible, giving It follows that the Kullback-Leibler divergence is always well-defined between distributions of characters corresponding to forest representatives from the wald space. Since by Lemma 2.2 the map [F] → p F is injective for [F] ∈ W N the Jensen-Shannon and Hellinger metrics pull back to define extrinsic metrics on the wald space W N (analogously to (Garba et al. 2018)). As already mentioned in the introduction, statistical methods rely heavily on the intrinsic nature of metrics, and thus we aim for more geometrical structure in the next section by imposing the Fisher information metric (a Riemannian metric) onto the wald space.

A two-state process geometry for the wald space
BHV tree space U N and wald space W N both do not have the structure of a manifold globally, but the interior of each maximal orthant is a manifold parametrized by or λ. Therefore we consider first the information geometry on the subspaces of wald space corresponding to a fixed fully resolved tree topology-here every wald has only one single tree representative, since every wald corresponding to a forest with more than one component, as well as a wald containing a pendant edge with length zero, lies on the boundary of unit cubes corresponding to fully resolved tree topologies. Secondly, we establish global results about the constructed geometry of W N .
Thus suppose τ is a fully resolved tree topology, and that trees with this topology are parametrized by = ( 1 , . . . , 2N −3 ) ∈ R N >0 × O τ . Let p (s) be the probability mass function p T (s) associated with tree T determined by τ, . Recalling that p (s) > 0 for all s, due to (3.2), the Fisher information matrix at is This defines a Riemannian inner product on the tangent space of R N >0 × O τ at (that is a copy of R 2N −3 ) which gives a way to measure the lengths of paths. Specifically, if p and p +δ lie infinitesimally close on a path, then the squared path length between them is defined to be i, j δ i g i j ( )δ j . Standard results from Riemannian differential geometry show that if (t) is a path in R N >0 × O τ then it is locally a geodesic (i.e. it minimizes path length) if it satisfies the differential equation The matrix g i j is the inverse of g i j i.e. k g ik g k j = δ i j where δ i j is the Kronecker delta. It is important to note that the geodesic equation and loci of solutions are invariant under changes of parametrization, and so the equations can be formulated using lengths i or the weights λ i . On the boundary, however, this is no longer necessarily true. The Riemannian metric defined by the Fisher information matrix is related to the Kullback-Leibler divergence and other f -divergences by the following lemma.
where the error term consists of third-order products of the elements of δ and In other words, the norm of the perturbation, as measured with respect to the Riemannian inner product, is proportional to the f -divergence of p +δ from p .
The proof is given in the Appendix. Since the lemma applies to an arbitrary fdivergence, the term on the right-hand side of Eq. (3.5) can be the Kullback-Leibler divergence or the squared Jensen-Shannon metric, for example. Lemma 3.1 gives the fundamental assumption behind the geometries we construct on W N : that distances are locally measured by the infinitesimal Kullback-Leibler divergence between probability distributions associated with trees, or equivalently, by any f -divergence. As a corollary of Lemma 3.1, it follows that the metric defined by Eq. (3.3) is positive definite (i.e. the metric is not semi-Riemannian). This is because the map from trees to distributions of characters is injective, and since D f ( p; q) > 0 for all p = q, it follows that the right-hand side of Eq.

Any path which realizes the distance d * ([F], [G]) is a solution of Eq. (3.4) at any point in the interior of a maximal orthant.
The proof is given in the Appendix. Valid choices for the metric d in Theorem 3.1 include the Jensen-Shannon metric or Hellinger metric. Theorem 3.1 establishes W N with metric d * as a length space. Finiteness of d * shows, for example, that points in W N corresponding to disconnected forests (or equivalently, trees with infinite edge lengths) are at a finite distance away from orthant interiors. The second assertion implies a scaling of the induced intrinsic metric under changes of the function f , which, in turn, substantiates our conclusions drawn from Lemma 3.1 that the geometry of W N induced by d is invariant under the choice of f .

Numerical investigation of the geometry
The geodesic equation (3.4) can be solved numerically on the interior of any maximal orthant given some initial conditions (0) = 0 and d (0)/dt = v 0 . As described in Sect. 2.3, the first and second derivatives of p (s) with respect to the edge lengths i can be computed analytically. Calculation of g i j ( ) consists of a sum over all 2 N possible characters involving the derivatives of p (s). The Christoffel symbols can similarly be calculated as sums over characters. A fourth-order Runge-Kutta method was used to integrate the ODEs.
This numerical scheme was used to construct and visualize geodesics on a single orthant in W 5 . The particular topology and edge lengths for the orthant are represented by the Newick string ((1 : 1 , 2 : 2 ) : 6 , 3 : 3 , (4 : 4 , 5 : 5 ) : 7 ). Parameters 1 , . . . , 5 are the pendant edge lengths and 6 , 7 the lengths of the two internal edges. We restricted to N = 5 leaves in order to enable easy visualization of geodesics. Integration was stopped whenever any internal edge was assigned a length ≤ 0, corresponding to the boundary of the orthant. If this occurred for a pendant edge, the pendant edge length was given value zero at that step, and integration was continued. Figure 4 shows typical results. The figure shows the orthant representing the two internal edge lengths, with geodesics 'fired' from some fixed starting point 0 in 24 different directions. Also marked on the plots are contours of distance from the starting point. Each panel shows results for a different initial tree 0 . It is evident that the geodesics are not the same as BHV geodesics, which are straight lines radiating from the initial point with equally spaced circular contours of distance. Figure 4 shows curved geodesics with irregularly spaced contours of distance. Contours appear to stack up towards the origin and codimension-1 BHV boundaries, but are more spaced out as geodesics move out towards the boundaries at infinity. This is more obvious when initial internal edges are long (top row of Fig. 4). On the other hand, the geodesics are more similar to the BHV geodesics when internal edges are short and pendant edges are long, as in the bottom two rows of the right hand column. In all cases, in contrast to BHV tree space, geodesics seem to be slightly attracted toward the star trees. The pendant edges do not behave as they do in BHV tree space: they can change value even when their initial velocity is zero. Figure 5 provides more detail for certain geodesics in Fig. 4. The graphs in the figure show each edge weight λ i versus time, and the time is proportional to distance travelled. The λ-parametrization was used for these plots since it shows the results most clearly. Panel (a) shows that, when contours become more and more widely spaced, the boundary at infinity (λ 6 = λ 7 = 1) can be reached in finite time, rather than asymptotically. This shows that points corresponding to trees with infinitely long edges are a finite distance away from the starting point, as established by Theorem 3.1. In the -parametrization, the internal edge lengths rapidly blow up to infinity after time t = 0.8. It follows that the shortest path between two trees with finite edge lengths might involve trees with infinitely long edges. On the other hand, for some panels in Fig. 4, the contours become increasingly close as BHV boundaries are approached, but as panel (b) in Fig. 5 shows, points on BHV boundaries are in fact finitely close to orthant interiors, since the boundary is reached in finite time. Figure 6 replots the top left and middle left panels from Fig. 4 using the λparametrization, so that the boundary at infinity corresponds to edges of the unit square with weight 1. These plots suggest that trees in which one of the two internal edges is finitely long are 'repellant' since the geodesics fired in the North and East compass directions end up passing through the disconnected forest with λ 6 = λ 7 = 1. Indeed, the points on the boundary with λ 6 = λ 7 = 1 appear to be 'attractive', with geodesics pulled round to pass through these points and arriving in finite time.
While these results show how the information geometry on W N differs substantially from the BHV geometry, the method for constructing geodesics by integrating the geodesic differential equation suffers from several disadvantages. First, it requires summation over the elements of {0, 1} N which makes it infeasible for large N (exponential computation time in N ). Secondly, only local geodesics are computed by solving the initial value problem (geodesic "shooting" or "firing") for the differential equation valid only in maximal orthants. Thirdly, for practical applications, an algorithm which takes two points in wald space and joins them by a globally shortest path is more useful, and so it is desirable to solve the boundary value problem for geodesic construction. The next section attempts to deal with some of these shortcomings.

Information geometry for a Gaussian process on trees
In this section we develop the information geometry of a continuous-valued Markov process associated to each tree which is more computationally and analytically tractable than the information geometry for the symmetric two-state Markov process. It has the advantage that the geodesic equation (3.4) can be solved numerically much faster than the corresponding equation for the symmetric two-state model, but the solutions for the two models are very similar.

Definition of the Gaussian process
Our aim is to construct a Gaussian process which is a continuous-valued analogue of the symmetric two-state process by matching the moments specified in Lemma 2.1. Consider the Ornstein-Uhlenbeck process Z (t) on T which satisfies where t 1 t 2 is the path length distance between t 1 and t 2 on T . The stationary distribution is the standard normal distribution N (0, 1), and we assume the process is stationary over T . The Markov process satisfies the detailed balance equation, and so is reversible in its stationary state. As a result, realizations of the process can be simulated by fixing an arbitrary root t 0 ∈ T for the tree, simulating Z (t 0 ) from N (0, 1) and then using Eq. (4.1) to simulate a realization Z (t) for any other t ∈ T . Reversibility of the process ensures the distribution obtained is invariant of the choice of root. A short calculation shows that the covariance matrix of the random variables Z 1 , . . . , Z N at the leaves of T is given by Cov (Z u , Z v ) = exp(− uv ). Since uu = 0 for all u = 1, . . . , N , this gives Var (Z u ) = 1. Similarly, the conditional distribution of Z (t 2 ) given Z (t 1 ) = z has variance 1 − exp(−2 t 1 t 2 ). These moments match those in Lemma 2.1 up to the constant factor of 1/4, and we will show later that this factor makes no difference to the geometry obtained. The process Z (t) can be thought of in two ways. First, it approximates the binomial random variables obtained when many independent binary characters evolve under the symmetric two-state process. Secondly, it could be regarded as an evolutionary model of a continuous trait for which the observations are standardized to have zero mean and unit variance in each population. Mean-reverting Gaussian processes like this are sometimes used to model continuous traits for which there is some constant optimal value for survival (Hansen and Martins 1996). The definition of Z (t) extends from trees to forests by taking the process to be independent on each connected component. Given a forest F ∈ W N , the distribution of the random variables Z 1 , . . . , Z N at the leaves is multivariate normal with zero mean and covariance matrix S F where (4.2) The path distance uv is the sum of the lengths e on edges e between u and v, and is taken to be infinite when u and v are in different components of F. This defines a map F → φ F from forests to multivariate normal distributions where φ F is the probability density function of N (0, S F ). A similar result to Lemma 2.2 holds: whenever F 1 ∼ F 2 , the distributions φ F 1 and φ F 2 are the same, so the map is well-defined on W N .

A Gaussian process geometry for the wald space
The information geometry of multivariate normal distributions with zero mean has been studied previously (Lenglet et al. 2006) and is analytically tractable. The theory described in Sect. 3 needs adapting to account for the change from discrete to continuous characters. The Fisher information metric of W N in Eq. (3.3) becomes an integral over R N rather than a sum, and the mass function p is replaced with the density function φ of the Gaussian corresponding to a fully resolved tree with edge lengths : The geodesic equation (3.4) remains the same. Although p T (s) and its derivatives could be evaluated exactly for the two-state symmetric model, evaluation of the metric and Christoffel symbols required a sum over all characters. For the continuous model, the corresponding integrals have closed form, as we describe below. Gaussian distributions with zero mean are parametrized by their covariance matrices, namely by the space of N × N symmetric positive definite matrices, which we will denote S + N . The set of covariance matrices associated with forests forms a subset within S + N , as determined by the following theorem. Theorem 4.1 1. The covariance matrix S F associated to any F ∈ W N , as defined by Eq. (4.2), is positive definite so lies in S + N , and 2. the map [F] → S F from W N to S + N is injective and so it determines a well-defined embedding of W N into S + N . The proof is given in the Appendix. For Gaussians with zero mean, it can be shown that the Fisher information metric at a point with covariance matrix S is where X , Y are matrices in the tangent space at S, i.e. symmetric matrices (Lenglet et al. 2006). This expression is obtained by evaluating the integral in Eq. (4.5) An entry above is zero if u and v are in different connected components, or equivalently, if they are separated by an infinitely long edge. By differentiating Eq. (4.5), it can be seen that the tangent space at S is spanned by matrices of the form σ e • S for each edge e, where • denotes the Hadamard matrix product. The Fisher information metric (4.3) for W N for a fully resolved tree becomes where i, j = 1, . . . , 2N −3 index edges. Algebraic expressions for the first and second derivatives of the Fisher information metric can similarly be obtained, and hence for the Christoffel symbols. Note that scaling S by some positive constant has no effect on the metric, and so the factor of 1/4 difference between the covariance matrices obtained from the discrete process X (t) and continuous process Z (t) has no effect on the geometry. The inner product and its derivatives can be computed in polynomial time and the paths obtained by integrating the geodesic ODE for the continuous Markov model within orthant interiors in W N resemble those for the two-state model very closely. Namely, results for the same initial conditions as Fig. 4 were obtained but omitted, since the plots were almost indistinguishable from those for the two-state model. However, the inner products defined using the two different models are not identical: both inner products can be written down explicitly in the case N = 2, using the transition probabilities for the discrete model and Eq. (4.6) for the continuous model. The two inner products differ when the length of the single edge in the tree is small, but converge as the edge length tends to infinity.
Using Eq. (4.6) and its derivatives, we derived algebraic expressions for the Riemannian curvature tensor and the sectional curvatures at any point in W N . We implemented these expressions in R, and hence evaluated these quantities for certain trees in W 5 . We found that at randomly selected points in W 5 , and hence in all spaces with N ≥ 5, the sectional curvatures had mixed signs. As a result, there is no global sign condition on curvature like that for BHV tree space, which is globally non-positively curved.

N
The information geometry on Gaussians with zero mean can equivalently be regarded as a geometry for the space of N × N symmetric positive definite matrices S + N . This is a useful viewpoint to adopt, first because it highlights the fact that the geometry we develop on W N is based entirely on the covariance between the leaves induced by the Markov process Z (t), and secondly, because other metrics on S + N have been studied (Dryden et al. 2009). These alternative metrics on S + N could in turn define different metrics on W N , although they will not be considered any further in this paper. The metric on S + N obtained from the information geometry of Gaussian distributions with zero mean will be denoted d cov . The metric and its associated geodesics in S + N can be computed in polynomial time (Lenglet et al. 2006). The main idea in this section is to use the analytically tractable geometry in S + N , combined with a projection algorithm from the ambient space S + N to the embedded space W N , to construct approximate geodesics in W N .
Given the embedding [F] → S F of W N within S + N , we can consider the intrinsic metric d * cov on W N induced by d cov . By construction, the induced metric corresponds to the information geometry on W N for the continuous Markov model considered in Sect. 4. The following theorem is analogous to Theorem 3.1 for the discrete Markov substitution models. A proof is given in the appendix. Eq. (4.5). Lenglet et al. (2006) give formulae for the distance and geodesics between pairs of points in S + N . The distance between S 1 , S 2 ∈ S + N is defined by

Theorem 5.1 For any [F], [G] ∈ W N the induced intrinsic metric d * cov ([F], [G]) is finite and therefore well-defined. Any path which realizes the distance d * cov ([F], [G]) is a solution of Eq. (3.4) at any point in the interior of a maximal orthant, where the Riemannian inner product is given by
where log denotes the matrix logarithm. Since the map from W N to S + N is injective, d cov pulls back to define an extrinsic metric on W N : In fact, the space S + N equipped with d cov has globally non-positive curvature (Skovgaard 1984;Ballmann et al. 1985) and so there is a unique geodesic between any two points in the ambient space. The point at proportion t ∈ [0, 1] along the geodesic between S 1 , S 2 ∈ S + N is Equations (5.1) to (5.3) involve eigen-decompositions of N × N matrices, and so can be computed in O(N 4 ) steps. Figure 7 shows a comparison of BHV metric d BHV , tropical metric, path distance metric, Jensen-Shannon metric d J S and d cov for every pair of trees in a sample of 100 trees obtained by bootstrap replication during maximum likelihood inference of a phylogenetic tree. The trees were inferred using the MrBayes software (Huelsenbeck and Ronquist 2001), and a sample data set of DNA from 12 primates provided with the software. The path difference metric (Steel and Penny 1993) between trees T , T is ( u,v ( uv − uv ) 2 ) 1/2 . The Jensen-Shannon metric was calculated exactly by summing over all 2 12 characters for the two-state model, as described in Sect. 3.2. The covariance metric was calculated using Eqs. (5.1) and (5.2). The BHV and tropical metrics are quite closely correlated, as are the Jenson-Shannon and covariance matrices. The path difference metric has a similar correlation with the BHV, tropical and Jensen-Shannon metrics (approximately 0.7-0.8), but is relatively weakly correlated with d cov . This suggests that the BHV and tropical metrics are based on features of the data which are rather distinct from those for the Jensen-Shannon and covariance metrics. The covariance metric has the advantage over the Jensen-Shannon metric of being computable in polynomial time.

Projection into wald space
To approximate geodesics in the extrinsic covariance metric d cov of W N , we construct a projection from S + N onto W N , that is, given S 0 ∈ S + N , we aim to find an element [F] ∈ W N which minimizes d cov (S 0 , S F ). Suppose that F is a fully resolved tree with edge lengths . The expression for d cov (S 0 , S F ) 2 can be differentiated with respect to edge lengths of F and gives Moakher (2005)). Moreover, ∂ i S F = S F • σ i where • denotes the Hadamard or element-wise matrix product and σ i is the split matrix associated with edge i, as defined in Eq. (4.4). This analytic expression for the derivative can be used to implement a gradient descent algorithm. Within each maximal orthant O τ the edge lengths were updated according to the rule where k denotes the edge lengths at iteration k and S k the corresponding covariance matrix. The step size α k was determined using the Barzilai-Borwein method. Two versions of the algorithm were used. In the first, the algorithm was halted whenever an internal edge was assigned a negative length. As a result, the algorithm was constrained to lie within the orthant O τ containing the initial tree. This algorithm was used for N = 5 by running the algorithm 15 times, each time with an initial tree in one of the 15 maximal orthants of U 5 , and the overall tree closest to S 0 found. The second version of the algorithm was able to cross codimension-1 BHV boundaries as follows. If an edge length was assigned a negative value, then trees in the two corresponding neighbouring orthants to τ were considered, taking the absolute value of elements in k+1 as edge lengths. The tree at step k + 1 was taken to be whichever of these two trees was closest to S 0 .
In general, the closest point in W N to a covariance matrix S 0 ∈ S + N is not necessarily unique as illustrated in Fig. 8. Moreover, the gradient descent algorithm can converge to local minima, and so the result obtained is sensitive to the tree used to initialize the algorithm.

Construction of geodesics in W N via projection of extrinsic geodesics
Since construction of geodesics in S + N between any two given points and projection from S + N into W N can both be performed efficiently, we aim to combine these algorithms to give an efficient means of constructing geodesics within W N between any two given end points. A naive approach, given [F 1 ], [F 2 ] ∈ W N , is to simply project the extrinsic geodesic between S F 1 and S F 2 from S + N into W N . This approach works for the example of the unit sphere S 2 embedded within R 3 : the projection of the chord between two points in the sphere is a great circle between those two points. However, this approach fails for W N ⊆ S + N since the projected paths are often discontinuous and jump between different orthants, as illustrated in Fig. 8.
The following recursive algorithm for constructing an approximate geo-desic in W N is therefore proposed, which is intended to overcome this issue. Let t i = i/k for i = 0, . . . , k and suppose [F 1 ], [F 2 ] ∈ W N . The algorithm outputs a sequence The idea is that at each iteration, a new extrinsic geodesic is constructed from the previous point [G i−1 ] to the destination [F 2 ], a small step is taken along that geodesic, and that point is projected into W N to give [G i ]. For the results in this paper, the projection at Step 3 was performed using the second version of the projection algorithm described in Sect. 5.1, rather than using the less efficient search over all orthants. The gradient descent for the projection to obtain [G i ] at Step 3 was initialized using the edge lengths from the forest [G i−1 ].
This algorithm has the disadvantage that it is not symmetric under swapping the end points [ The quality of the approximate geodesics produced by the symmetrized algorithm can be assessed by comparing them to geodesics in a single orthant constructed by integrating the geodesic equation as described in Sect. 4.2. Given an initial tree [F 1 ] ∈ W 5 and an initial velocity for , the geodesic equation was integrated until the path obtained reached some specified length. The final point reached was taken to be [F 2 ], and the symmetrized algorithm was then used to obtain an approximate geodesic between [F 1 ] and [F 2 ]. In all cases, the paths obtained with the two methods matched Fig. 10 Comparison of approximate geodesics in W 5 constructed beween trees F 1 and F 2 from (5.4) in neighbouring orthants. The vertical axis 7 represents a codimension-1 BHV boundary between two orthants. When, due to a nearest neighbor interchange, crossing it, 6 tends to zero, another edge appears, with negative length corresponding to the negative values on the 6 axis. Three approximate geodesics are shown: (i) construction via the recursive algorithm from F 1 and F 2 , (ii) using the same algorithm but reversing the end-points, and (iii) construction via the symmetrized algorithm very closely, with the quality of the match improving for shorter internal edge lengths. Figure 9 shows typical results.
In contrast to the methods presented in Sects. 3 and 4, these algorithms are not based on 'firing' geodesics, and can produce approximate geodesics between end points in different orthants. Figures 10 and 11 show results obtained when the end points are separated by 1 or 2 nearest neighbour interchange operations respectively in W 5 . More precisely, we consider trees corresponding to Newick strings In both figures, the approximate geodesics constructed using the recursive algorithm are not symmetric under interchange of the end points, and differ from the paths obtained using the symmetric algorithm. The lengths of the paths can be computed by using large k and summing d cov between successive points in the output. In all the examples we explored, the symmetrized algorithm produced shorter paths. The approximate geodesics shown in the figures are significantly different from BHV geodesics, which consist of straight (or once broken) lines between the end points in both plots.
We apply the symmetrized algorithm to investigate the distance from trees in the interior of a maximal orthant to the star stratum on the boundary of that orthant.  11 Comparison of approximate geodesics constructed between trees F 1 and F 3 from (5.4) in orthants separated by two nearest neighbour interchanges. Three neighbouring orthants in W 5 are shown, and the bottom right-hand orthant does not correspond to a valid tree topology. As in Fig. 10, negative values on axes correspond to negative lengths of new edges following setup. For λ 0 ∈ (0, 1], let G 1 = G(λ 0 ) be the forest corresponding to the Newick string ((1 : λ 0 , 2 : λ 0 ) : λ 0 , (3 : λ 0 , 4 : λ 0 )) in λ-parametrization. This is a fully resolved 4-taxon tree on which each edge has weight λ 0 . By symmetry, the edges on the tree in the star stratum closest to G(λ 0 ) must all have equal weight λ ∈ (0, 1]. Thus, let G k = F(λ) be the star tree corresponding to the Newick string (1 : λ, 2 : λ, 3 : λ, 4 : λ), again in λ-parametrization. Figure 12 shows for each λ 0 ∈ {0.1, 0.5, 0.9, 0.95} the approximated values of d * cov [G(λ 0 )], [F(λ)] against λ. Obviously, F(λ) is closest for λ slightly larger than λ 0 with distance decreasing as λ 0 → 1. This suggests that the star stratum is closer to the tree G(λ 0 ) than the forest consisting of 4 isolated points (obtained from F(λ) as λ → 1), even for λ 0 values close to 1. Note, though, that the forest is a boundary point of the star stratum. For any G(λ 0 ) the distance to F(λ) tends to infinity as λ → 0. Indeed, S F(0) / ∈ S + 4 is not of full rank.

Discussion
In order to do statistical inference on data sets of phylogenetic trees one needs a structure rich enough to enable the use of geometric statistical methods. Recent research has produced geometries such as the BHV and tropical tree spaces and statistical methods adapted to these geometries. Based on a more principled set of underlying assumptions by regarding phylogenetic trees as probability models for genetic sequence data, we have developed a canonical and biologically motivated geometry on tree space by applying tools from information geometry, giving the wald space. In particular, for different values of λ 0 unlike previous related work ( Garba et al. 2018) in which various extrinsic metrics were considered, in this paper we have focused on developing intrinsic metrics and their associated geodesics to explore and to enable accessing the geometry, for this is a key ingredient for statistical inference on non-Euclidean spaces. There are two main difficulties with achieving our aim. First, the discrete-valued Markov process on trees with genetic alphabet Ω characterizes trees as probability models with sample space Ω N where N is the number of phylogenetic taxa. Therefore, calculations of distances and construction of geodesics involve summations over |Ω| N terms, which becomes infeasible for large N . In order to establish computational tractability, we generalized the discrete-valued probability model to a continuousvalued Markov process in a canonical way and applied the information geometry again.
Secondly, information geometry is formulated for parametrized probability models that are a manifold, whereas tree space is a union of manifolds having different dimensions due to the orthants (representing forests with different number of edges) being glued together in a certain way. One has to be careful to compare the structure to the one defined in Moulton and Steel (2004), for example, as we are not including forests with coincident leaves and furthermore we consider a different topology induced by the Fisher information metric. We tackled this issue of not having a single connected parametrized manifold by using the continuous-valued Markov models to embed wald space in the ambient space of symmetric positive definite matrices, which has an analytically tractable geometry and thus allows for approximation of geodesics in the embedded space W N . Our computational results show that the geometry obtained is significantly different from the BHV and tropical geometries, partly due to the inclusion of trees with infinitely long edges in wald space.
Several questions about the geometry of the wald space W N remain. While we have shown that trees with infinitely long edges are a finite distance away from other trees (Theorems 3.1 and 5.1), computational results suggest that parts of this subspace are repulsive and are avoided by geodesics (see Figs. 6 and 12). An explanation for this behaviour might be obtained via calculations or results about curvature for such points of W N , but further investigation is required. Secondly, Theorems 3.1 and 5.1 establish W N as length spaces for the two induced intrinsic metrics we study. It is desirable to strengthen these results and prove that the distance between every pair of points in the space is realized by at least one path, so that our wald space is a geodesic metric space as opposed to a length space. It appears that such a proof requires thorough analysis of the condition on edge weights which excludes trees with coincident leaves. Furthermore, the methods and results presented in Sect. 5.2 represent a first step towards the development of more sophisticated and efficient algorithms for the construction of information geodesics in wald space via the embedding in the space of covariance matrices. A more thorough evaluation of the computational cost as N increases could be carried out, and a more rigorous treatment might establish convergence properties for the symmetrized algorithm. Alternatively, existing algorithms taken from computational Riemannian geometry could be adapted to work in wald space (see Schmidt et al. (2006) for example) and might offer better performance.
The underlying motivation for this work has been to obtain a novel geometric framework for the space of phylogenetic trees which has more principled biological justification than existing geometries, and which can be used to develop statistical methods for analysing data sets of trees. Ultimately, realizing this aim is still some way off. For example, given a sample of points {x 1 , . . . , x n } ⊆ X in a metric space (X , d), the Fréchet meanx ∈ X is a point which minimises the sum of squared distances to the data:x In general, the Fréchet mean does not always exist, or it can fail to be unique, but in globally non-positively curved spaces such as (S + N , d cov ) and (U N , d BHV ) there exists a unique Fréchet mean (Bridson and Haefliger 2011). Development of methods for calculation of a Fréchet mean using an intrinsic information metric in wald space seems very challenging, and the curvature calculations in Sect. 4.2 have implications for the existence and uniqueness of Fréchet means. On the other hand, given any sample of trees in W N , there is a unique inrinsic Fréchet mean in S + N and an algorithm for computing the mean is given by Lenglet et al. (2006). Our projection algorithm could be used to project this to an extrinsic mean back into W N . Properties of the projected Fréchet mean tree could be investigated.
In comparison to the BHV and tropical metrics, the intrinsic information metrics have the advantage of interpretability in terms of genetic substitutions and the distri-butions of characters represented by two trees. This suggests the information metrics might be better suited for statistical tasks such as hypothesis testing. In the BHV and tropical geometries, contraction and expansion of edges offer the means of moving between different topologies. In the wald space, additional topological transformations are possible via expanding edges to infinite length, and these correspond to tree bisection and reconnection (TBR) operations. Many applications in phylogenetics require searches over the space of phylogenetic trees, and movement along information geodesics in the wald space might have advantages over existing methods. give and zero otherwise. This and Equation (6.2) can be applied recursively to compute the terms p T v (s v |ω) for each vertex v ∈ T , starting at the leaves and working up the tree to the root v 0 . Finally, p T (s) can be computed using Equation (6.1), and it follows from the recursion that p T (s) is a multivariate polynomial with arguments of the form 1 + e − k and 1 − e − k , where k ranges over the edges of T . The coefficients of the polynomial depend on the topology of T . Equations (6.1) and (6.2) can also be used to differentiate p T (s) with respect to any edge length parameter. These derivatives are required in Sect. 3. Suppose e is an edge of T and we wish to compute the derivative ∂ p T (s)/∂ e . Since we are free to choose v 0 , the calculation is simplified if we let v 0 be an internal vertex at one end of edge e. We can order the vertices v i attached to v 0 so that e = (v 0 , v 1 ). Equation (6.2) then gives where the p T v i terms can be calculated recursively. Second derivatives of the mass function can be calculated analytically in a similar way.

Appendix B: Proof of Lemma 2.2
First, suppose F 1 ∼ F 2 . The BHV boundary rule does not affect the distribution on characters induced by a tree, because the same distribution is obtained whether an edge of length zero is present in a tree or not. Similarly, if F 1 , F 2 are equal modulo an application of the boundary rule at infinity, then p F 1 (s) = p F 2 (s) since an edge with weight 1 results in independence between the letters at leaves at either side of the edge under the transition probabilities in Eq. (2.1). Specifically, if v 0 , v 1 are vertices at the ends of an edge e with λ e = 1, then where ω ∈ {0, 1}, so the conditional distribution of X (v 1 ) is the same as its marginal. The map [F] → p F from elements of W N to distributions on characters is therefore well-defined. In fact, the work of Allman et al. (2008) shows the map is injective, and this establishes the lemma. For the second part of the theorem, suppose that D f 0 = d 2 f 0 is a f 0 -divergence, where d f 0 is a metric. Further, suppose that F 1 and F 2 are given by λ 1 and λ 2 , respectively. It suffices to consider λ 1 , λ 2 from the same topology and sufficiently close such that the image t → λ(t), λ(0) = λ 1 and λ(1) = λ 2 of the geodesic from p λ 1 to p λ 2 in the metric induced by the Riemannian metric g i j lies fully in a convex λ coordinate patch and has finite Euclidean length there, say L. Hence, for every n ∈ N, there are δλ ( j) with δλ ( j) ≤ L/n, j ∈ {1, . . . , n} such that λ 2 = λ 1 + n j=1 δλ ( j) . Then the second assertion of the theorem follows from (6.3), setting c = f (1)/ f 0 (1), as n → ∞, because The second equality sign holds because the constants in the individual summands O |δλ ( j) | 3 (1 ≤ j ≤ n) can be bounded by the supremum of absolute values of the gradient of p λ with respect to λ, in the coordinate patch, as can be seen from the last lines of the proof of Lemma 3.1. The third part of the theorem, which states that minimal length paths satisfy the geodesic equation locally, is part of the standard theory for Riemannian geometry on manifolds, e.g. (Lee 1997, Sect. 4). cherry and permuting the labels of F results in a tree with covariance P T S F P (with permutation matrix P), where positive definiteness is preserved.
Let e N −1 and e N be the edges incident to leaves N − 1 and N , respectively. Since F is bifurcating, there is exactly one edge, say e 0 , incident to e N −1 and e N . Let S F = (s uv ) N u,v=1 . The tree G ∈ W N −1 obtained by deleting e N and leaf N and merging e 0 and e N −1 toẽ with weight λẽ = 1 − (1 − λ e 0 )(1 − λ e N −1 ) has covariance S G = (s uv ) N −1 u,v=1 , which is by induction positive definite. Using this and Sylvester's criterion ( so det(S F ) is linear in x:=(1−λ e N ) 2 . By symmetry of the cherry, det(S F ) is also linear in y:=(1 − λ e N −1 ) 2 as well. We write g(x, y) = det(S F ). For x = 0, we have s N u = 0 for u < N and s N N = 1, so g(0, y) = det(S F ) = det(S G ) > 0 for all y ∈ [0, 1], and similarly g(x, 0) > 0 for all x ∈ [0, 1]. Furthermore, g(1, 1) = 0, since in that case the last two rows of S F coincide. Since g is linear in x and in y, respectively, we have g(x, y) > 0 for all (x, y) ∈ [0, 1] 2 \ {(1, 1)}, so that det(S F ) > 0 for all (λ e N −1 , λ e N ) ∈ [0, 1] 2 \ {(0, 0)}. If λ e N −1 = λ e N = 0, we would have d N (N −1) = 0, but this is not allowed by the definition of W N .
We also need to show that the map [F] → S F is injective on W N where [F] denotes the equivalence class of F ∈ W N . This is trivial, however, since whenever F 1 , F 2 ∈ W N are in different equivalence classes, the matrix of distances between the leaves is different. is bound from above along the path. Working in the λ-parametrization of an orthant, Eq. (4.5) becomes N u,v=1 . Each element of the matrix is therefore a polynomial in the elements of λ, and their derivatives with respect to λ are also polynomials. Recalling that the tangent space of W N at S λ in a maximal orthant is spanned by ∂ i S λ , where i ∈ {1, . . . , 2N − 3} ranges over the edges in that maximal orthant, Eq. (4.6) becomes The first term in this product is bounded on a geodesic path from F 1 to F * , since S λ is positive definite and its eigenvalues are bound away from zero. The other two terms are also bounded from above, because the derivatives of S λ are polynomials in λ, Thus |g i j (λ)| ≤ C for some constant C at all points along that path, and the same argument as for Theorem 3.1 shows that d * cov [F 1 ], [F * ] is finite.