Maximum Likelihood Estimation of Log-Concave Densities on Tree Space

Phylogenetic trees are key data objects in biology, and the method of phylogenetic reconstruction has been highly developed. The space of phylogenetic trees is a nonpositively curved metric space. Recently, statistical methods to analyze the set of trees on this space are being developed utilizing this property. Meanwhile, in Euclidean space, the log-concave maximum likelihood method has emerged as a new nonparametric method for probability density estimation. In this paper, we derive a sufficient condition for the existence and uniqueness of the log-concave maximum likelihood estimator on tree space. We also propose an estimation algorithm for one and two dimensions. Since various factors affect the inferred trees, it is difficult to specify the distribution of sample trees. The class of log-concave densities is nonparametric, and yet the estimation can be conducted by the maximum likelihood method without selecting hyperparameters. We compare the estimation performance with a previously developed kernel density estimator numerically. In our examples where the true density is log-concave, we demonstrate that our estimator has a smaller integrated squared error when the sample size is large. We also conduct numerical experiments of clustering using the Expectation-Maximization (EM) algorithm and compare the results with k-means++ clustering using Fr\'echet mean.


Introduction
Inference of phylogeny is one of the key problems in biology.Phylogenetic trees represent the evolutionary histories of taxa.The methods of phylogenetic reconstruction using current gene data have been highly developed, and it is now possible to reconstruct them easily with various models and inference methods (Felsenstein, 2004).
However, due to the uncertainty of inference, incomplete lineage sorting or irregular biological processes such as horizontal transfer, it is usually the case that different gene loci indicate different evolutionary histories (Reid et al., 2014).The classical approach to this problem is finding a consensus tree (Bryant, 2003), a single summary tree of multiple trees inferred from different loci, which is constructed by some rules.In 2001, Billera et al. (2001) introduced the space of phylogenetic trees with n leaves, tree space, which is a geodesic metric space with nonpositive curvature.Recent research has shifted to the statistical analysis of a set of trees in such spaces, enabling a geometrical perspective.These efforts include point estimation by Fréchet mean (Benner et al., 2014), principal component analysis (Nye, 2011), outlier detection using a kernel density estimator (Weyenberg et al., 2014(Weyenberg et al., , 2017)), and construction of confidence sets (Willis, 2019).
Since numerous factors affect the inferred phylogenetic trees, the parametric approach to specifying the distribution of inferred trees is difficult, and the risk of misspecifying models is high.In this sense, the nonparametric approach generally specifies fewer constraints in the distribution and might be desirable.The kernel density estimator proposed in Weyenberg et al. (2014Weyenberg et al. ( , 2017) ) is designed for this purpose.
In Euclidean space, log-concave density estimation has emerged as a new shape-constrained, nonparametric method of density estimation.Log-concave density is a class of probability densities whose logarithm are concave functions.Compared to classical smoothing approaches such as kernel density estimates, in which we need to specify some hyperparameters such as bandwidths, log-concave density has an advantageous property that the estimation can be done automatically.Concretely, it has been shown that the maximum likelihood estimators exist in both the one-dimensional and multi-dimensional cases and that the calculation of the estimators can be reduced to a convex optimization problem in R n with n sample points (Cule et al., 2010).
In this paper, we show that the maximum likelihood estimator of log-concave density in tree space exists under some conditions and that the estimation can be implemented in the one-dimensional case.Although we do not derive an algorithm for multidimensional cases due to the difficulty of finding the closure of convex hulls, we also show that we can approximately calculate the maximum likelihood estimator in the two-dimensional case.
The remaining sections are organized as follows.In section 2, we review some basic concepts of tree space and define some concepts for convex analysis in Hadamard spaces.
Section 3 presents our main result for maximum likelihood estimation in tree space, in one dimension and multiple dimensions.Section 4 explains how to calculate the maximum likelihood estimator in the one-dimensional and two-dimensional cases.Section 5 shows the results of simulation studies.We compare the performance of density estimation with the kernel density estimator.We also compare the simulation results of clustering with a mixture of log-concave densities to the k-means++ approach using the Fréchet mean.

Phylogenetic Trees and Tree Space
A phylogenetic tree is modeled as a tree with labels only on the leaves.We call a tree with n + 1 labeled leaves an n-tree.n-trees have one root leaf representing the common ancestor and n leaves representing present taxa.Internal edges are edges that are not directly connected to leaves.In binary n-trees, it is easy to see that the number of internal edges is n − 2. Nonbinary trees have fewer internal edges since they can be obtained by contracting some internal edges of some binary trees.The number of different topologies for binary n-trees is (2n − 3)!! (Felsenstein, 1978).Note that the edges in different tree topologies can be regarded as the same if they partition the leaves in the same way.If each internal edge in an n-tree has a positive length, we call it a metric n-tree.Billera et al. (2001) modeled the space of metric n-trees as follows.First, each binary tree topology is modeled as a positive Euclidean orthant with dimension n − 2, with each axis corresponding to the length of each internal edge of that topology.These orthants are stuck together on the axes representing the same edges.Nonbinary trees consitute the boundaries of these orthants since they are obtained by contracting some internal edges.In particular, the n-tree without any internal edges is the point located at the center of tree space connected to every orthant.We call this point the origin in this paper.The space of metric n-trees obtained in this way is called tree space and is denoted as T n .
In this paper, we mainly discuss the results in the one-dimensional or two-dimensional case.By dimension, we mean the dimension of each orthant or the dimension of the space seen as a cubical complex.Therefore, by the p-dimensional tree space, we mean the space T p+2 .In the one-dimensional tree space, we only have 3!! = 3 topologies and each orthant representing some topology is a half-line connected to the other two at the origin, the point representing a trifurcating tree (Figure 1).Two-dimensional tree space T 4 has 5!! = 15 topologies, each orthant being a nonnegative Euclidean plane, and three different orthants are connected to each axis (Figure 1).It is known that if we represent each axis as a vertex and each orthant as an edge connecting two vertices, then the whole of space T 4 can be graphically depicted as a Petersen graph, as shown in Figure 1.For details, see Billera et al. (2001) or Lubiw et al. (2020), for example.

Geometry of Tree Space
Tree space can be readily formulated as a metric space.Since it is the union of Euclidean orthants, we can simply define the distance between two points in the same orthant as the usual Euclidean distance.For any two points in distinct orthants, we can set the distance between them as the infimum of the lengths of paths with endpoints in those two points.
The infimum is attained since tree space consists of a finite number of orthants, and once a sequence of orthants of a path is decided, there is a unique path with minimum length.
The paths which attain the minimum distances are called geodesics, and a metric space with a geodesic between any two points is called a geodesic metric space.
Tree space is also known to be a Hadamard space; i.e., it is a complete metric space with nonpositive curvature.See, for example, Bačák (2014) for proof.This property is key to many important results in this space.In Hadamard spaces, the geodesic connecting any two points is guaranteed to be unique.Owen and Provan (2011) has developed an algorithm to find geodesics in tree space in O(p 4 ) time, where p denotes the dimension of tree space.
In tree space, geodesics are piecewise segments through sequences of orthants, and in some cases, they become cone paths, where each geodesic consists of two segments connecting the endpoints and the origin.In particular, in two-dimensional tree space T 4 , whether a geodesic between two points becomes a cone path depends only on the "angle" between them.See, for example, Lubiw et al. (2020) for a detailed account of this property.The nonpositively curved property also results in the existence and uniqueness of the Fréchet mean.The Fréchet mean µ of given samples X 1 , . . ., X n ∈ H and weights w 1 , . . ., w n is defined as the minimizer of the sum of squared distance: (1) It is a generalization of the arithmetic mean since they coincide with uniform weights in Hilbert spaces.It is known that the Fréchet mean can be calculated using the proximal point algorithm (Bačák, 2013).See, for example, Bačák (2014) for proof of results regarding the Fréchet mean in Hadamard spaces.

Convex Sets and Concave Functions on Hadamard Spaces
In this section, we define several concepts of convex sets and concave functions on Hadamard spaces.
First, we define convex sets as in Bačák (2014).Let (H, d) be a Hadamard space and for any points x, y ∈ H, let [x, y] denote the unique geodesic between them.Also, for λ ∈ [0, 1], we use informally the notation (1 − λ)x + λy to denote the points in the geodesic [x, y] with distance λd(x, y) from x.We say A ⊆ H is convex if for any points x, y ∈ A we have [x, y] ⊆ A.
Note that a product of Hadamard spaces is also a Hadamard space, and thus, in particular, H × R is also a Hadamard space.An equivalent definition of concavity of a function f can be given in the following way: given any points x, y ∈ H, if α < f (x) and β < f (y), then for any λ ∈ (0, 1), f Given any set S in H, a convex hull of S is the smallest convex set in H containing S.
This set exists because the intersection of (possibly infinite) convex sets is convex and that H is convex.We denote the convex hull of S as conv S. In Hadamard spaces, it is known that a convex hull can be written as in the following Lemma.
Lemma 1 (Bačák (2014), Lemma 2.1.8).For S ⊆ H, put C 0 = S and for n ∈ N, recursively define C n by This Lemma indicates that taking geodesics between points an infinite number of times obtains the convex hull.Next, we define an important concept of the concave hull of a function.Given any function f on H, the concave hull of f , conc f , is defined as the least concave function minorized by f .Its existence and uniqueness in Euclidean space is a well-known result; see chapter 5 of Rockafellar (1970) for a detailed account.Here we show the existence and uniqueness of the Hadamard case, but as we will see, the proof proceeds in the same way as in the Euclidean case.
Lemma 2. For any function f : H → [−∞, ∞], the concave hull conc f exists and is unique.
An important implication of the proof of the Lemma 1 is how to derive the concave hull of a function: calculate the convex hull of its hypograph, and take the supremum.In the remaining sections, our interest is in cases where the function can be written in the following form: In this case, the hyopgraph of the function f y is the set of vertical half-lines: In the Euclidean case, the convex hull of this set is characterized by the convex combination of the N points X 1 , . . ., X N , resulting in the concave hull of f being a piecewise linear function.However, in Hadamard spaces, the convex combination of N points is not well defined in that associativity does not hold.What we know is that Lemma 1 implies that any points (x, µ) in the convex hull of this set can be obtained by repeatedly taking the convex combination of two points.We use the notation h y = concf y (x) to denote this concave hull.We refer to function h y as the least concave function with h y (X i ) ≥ y i (i = 1, . . ., n).
Also, note that as in the Euclidean case, the restrictions of convex (resp.concave) functions to a convex set are also convex (resp.concave).This is again because the intersection of two convex sets is also convex, so the hypograph of a restricted function is also convex.
Following the usual convention in convex analysis, we regard a concave function f that has as its domain some convex set S ⊆ H as a function on the whole of space H which takes −∞ outside S.
Another condition that is usually assumed with concavity is upper-semicontinuity.
If function f is upper-semicontinuous at all points x ∈ H, then it is upper-semicontinuous on H.It is also equivalent to the hypograph of f being closed.
Similar to concave hulls, we can define an upper-semicontinuous hull of a function f as the least upper-semicontinuous function minorized by f .That this function exists follows from the fact that (possibly infinite) intersections of closed sets are closed.Furthermore, it is possible to define the upper-semicontinuous concave hull of a function of f , as the least upper-semicontinuous concave function minorized by f .Concretely, this function is derived by taking intersections of the hypographs of all upper-semicontinuous and concave functions minorized by f .It can also be characterized as the function whose hypograph is cl(conv hypof ).We denote the upper-semicontinous concave hull of the function f y by hy .

Probability Measure on Tree Space
In this section, we introduce one of the possible constructions of a probability measure on tree space.This construction is due to Willis (2019).
Let (T n , d) be tree space.Note that this space consists of (2n − 3)!! Euclidean (n − 2)dimensional nonnegative orthants.Using this, one can write any set A in T n as a union of Euclidean sets, A = ∪ Here, A 0 denotes the set of trees in A that are also in the boundary of T n , and A i denotes the set of trees in A that are also in the i-th positive orthant.Now, let ν B be the Lebesgue measure on R n−2 , and define ν by ν(A) = Then this ν preserves sigma additivity, and by completion of measures, ν becomes a complete measure.
In the remaining sections, we will use measure ν as a base measure and consider densities with respect to it.
3 Existence and Uniqueness of Maximum Likelihood Estimator

One-Dimensional Case
In this section, we consider the simplest tree space, T 3 .Let F 0 be the set of log-concave probability densities with respect to the base measure ν on this space.The following theorem indicates that maximum likelihood estimation in T 3 could be performed in the same manner as in the Euclidean case.
Then, with probability 1, the maximum likelihood estimator fn of f exists and is unique: i.e., fn is the unique maximizer in F 0 of the log-likelihood function l(f The proof is given in the supplementary material.
Remark.Although our main interest is in tree space, it is simple to see that the above argument also applies to the space of N ∈ N half-lines connected at the origin.

Multidimensional Case
In this section, we consider the general p-dimensional tree space, T p+2 .We let F0 be the set of upper-semicontinuous log-concave densities in T p+2 and l(f ) = n i=1 log f (X i ) denote the log-likelihood function.The maximum likelihood estimator in this case, unlike in the cases of the one-dimensional tree space and Euclidean space, might not even exist.Before deriving a sufficient condition for the existence of the MLE, we first show that when the maximizer exists, its uniqueness can be stated as in the following theorem.
Theorem 4. Let (X 1 , . . ., X n ) be a sample from some density on T p+2 .Suppose that a maximizer of l(f ) in F0 exists.Then the maximizer is unique ν−almost everywhere.
Here, the last inequality is the relationship between the arithmetic mean and geometric mean, and equality holds if and only if The next theorem provides a sufficient condition for sample points such that the maximum likelihood estimator exists and is unique almost everywhere.
Theorem 5. Let (X 1 , . . ., X n ) be a sample from a density f on T p+2 , n ≥ p + 1.Let , one of the following conditions holds: (c) cl(C n ) ∩ O i only includes points at the boundaries, say B j (j = 1, . . ., J), and each B j is connected to at least one orthant O j in which (b) holds.
Then, a maximum likelihood estimator fn of f in F0 exists: i.e., fn ∈ F0 is a maximizer of the log-likelihood function l(f ) = n i=1 log f (X i ).Moreover, it is unique except in some set of ν-measure zero excluding int(C n ).
Note that the sufficient condition is a condition that holds with a probability approaching one as n goes to infinity.Also, observe that this condition is easy to check if the convex hull of samples in T p+2 can be calculated.
The proof is given in the supplementary material.In the proof of Theorem 3 and Theorem 5, it is shown that the maximum likelihood estimator of n samples are parametrized by n-dimensional vector y ∈ R n .Concretely, the logarithm of the maximum likelihood estimator fn is written in the form hy , the least upper-semicontinous concave function with h y (X i ) ≥ y i .Note that in one-dimensional case, the least concave function is uppersemicontinuous; i.e. h y = hy .It is also shown that we need to maximize the objective function ψ n (y) defined as follows: We also see in the proof that in some cases, the addition of a new point outside the existing convex hull does not induce an increase in the measure of the convex hull of the new set of points.Next is an example.
Example 6 (nonuniqueness).Consider that we have three points X 1 , X 2 , X 4 ∈ T 4 as depicted in Figure 2. Then conv{X 1 , X 2 , X 4 } is the triangle X 1 X 2 X 4 .Suppose that we add X 3 and consider conv{X 1 , X 2 , X 4 , X 3 }.Then this set is the union of triangle X 1 X 2 X 4 and the segment X 4 X 3 , and it is clear that the measure has not increased: i.e., Then "extending" the support of f to include the segment X 3 X 4 preserving log-concavity would not change the value of the likelihood function, and the extension of f would also be a density.Thus, ν-almost sure uniqueness holds, but strict uniqueness does not hold in this case.
1 2 3 4 1 3 2 4 Cases in which the maximum likelihood estimator does not exist occur when the convex hull has a measure-zero intersection with some orthant.
Example 7 (nonexistence).Consider that we have three samples X 1 , X 2 , X 3 ∈ T 4 as depicted in Figure 2. The geodesics between points in these two trees necessarily are cone paths.It is clear that the assumption in Theorem 5 is not satisfied.Now, assume that the lengths of segments X 1 X 4 and X 2 X 4 are the same, and the length of segment X 3 X 4 is three times that of X 1 X 4 and X 2 X 4 .Let log f be an affine-like function with log f (X 3 ) = y + ∆ and log f (X 2 ) = log f (X 1 ) = y − ∆/3.Then log f (X 4 ) = y and One can see that as ∆ goes to +∞ the second term linearly goes to +∞ and the last integral term converges to zero.Thus, ψ n is a monotone increasing function with respect to ∆ and does not attain a maximum.Therefore, the maximum likelihood estimator does not exist in this case.

Densities That Bend at the Boundary
Although the nonparametric class of log-concave densities seem to be large enough to model various densities, there are densities that arise narturally, and are slightly not log-concave.
In this section, we focus on the simplest one-dimensional tree space T 3 , and discuss a way to incorporate densities that are not log-concave at the boundary.Let O 1 , O 2 , O 3 represent the three half-lines, and let the restriction of a density f : T 3 → R to each half-line O i be f i , which is a function on R ≥0 .Denote the exterior derivative at the origin by ∂f i /∂n i (0), defined as: x 0 ∈ R >0 in the orthant O 1 , the probabilty density at time t in each orthant can be written as follows: Here, φ(x; µ, σ 2 ) denotes the probability density function of normal distribution on R with mean µ and variance σ 2 .Density (9) does not have log-concavity since it "bends" at the origin.In fact, the exterior derivatives are This makes density f to be not log-concave around the origin.
Another example can be constructed by considering a simple coalescent process of three lineages from three different species.
Example 8. Consider the species tree drawn on the left of Figure 3, and that three lineages are drawn from three species.In a neutral coalescent process (Kingman, 1982) with n lineages, the distribution of time that takes for any two lineages to coalesce is given by simple exponential distribution: In a multispecies coalescent (Rannala and Yang, 2003), one assume that the two lineages coalesce only after the corresponding two species coalesce.Let T be the time where exactly two clades of species exist as shown in the Figure 3.We can write the distribution of the length of internal edges by where i denotes an index of orthant.See the supplementary material for the derivation.
This density is illustrated in Figure 3 for the case T = 1.We can clearly see that it is not log-concave.In fact, the exterior derivative is We can also see that it satisfies the following Kirchhoff-type condition: These densties that bend at the origin can be incorporated in the model.We relax convexity condition of hypo(log f ) to be the following: Then, (0, y 0 ) ∈ hypo(log f ) We denote by G 0 the class of densities which satisfy these two conditions.Any logconcave density f satisfies these two conditions since y 0 ≤ (1 − λ)f (x 1 ) + λf (x 2 ).Also, this allows for the bending at the origin to certain extent.In fact, the two densities we considered in this section is included in G 0 .
We can now exploit similar arguments to the one-dimensional log-concave case in section 3.1 and derive the existence and uniqueness of the maximum likelihood estimator in G 0 .
Note that the resulting estimator does not require Kirchhoff-type condition.
4 Calculation of Maximum Likelihood Estimator in

One and Two Dimensions
For the maximization of the function ψ y , we need to be able to calculate h y for a given input vector y.In the one-dimensional case, this probem is reduced to the calculation of the convex hull of points {(X 1 , y 1 ), . . ., (X n , y n )}.We explain below that this can be reduced to convex hull calculation in R 2 , and the algorithm to calculate the log-concave maximum likelihood estimator is implementable.
In the two-dimensional case, the calculation of hy requires an algorithm to find the closure of the convex hull of points {(X 1 , y 1 ), . . ., we can obtain such algorithm, but we show below that obtaining an approximate convex hull is possible.

Calculation of the Least Concave Functions on T 3
In order to obtain the maximum likelihood estimator (MLE), we need to be able to calculate the concave hull h y of a function of the following type: As discussed in section 2.3, concave hulls can be found by first taking the convex hull of the hypograph and then taking the supremum.In this case, the problem is to find the convex hull D n of a finite set of sample points in T 3 × R.
T 3 is a space consisting of three half-lines R ≥0 connected at the origin.If all sample points are from one orthant, then the situation is the same as the Euclidean case, so we consider the case that the sample points are from at least two orthants.Let Then, (0, y 0 ) needs to be in the convex hull D n .Furthermore, if we take convex hull of sample points in each orthant includeing (0, y 0 ) and take their union, this set, say E n , becomes a convex set.
Obviously E n is included in D n , thus by minimality of convex hulls, this set E n in fact equals D n .Since y 0 can be calculated by taking geodesics of all pairs of sample points, one can calculate D n easily using the convex hull algorithm in Euclidean space R 2 .
Similar arguments hold if we allow for the bend at the origin as discussed in the section 3.3.Concretely, we only need to alter the definition of y 0 to be

Approximation of hy on T 4
In the two-dimensional case, it is not even known whether the convex hull of a finite number of points is closed.Recently, Lubiw et al. (2020) constructed an algorithm to find the closure of the convex hull of a finite number of points in single-vertex 2D CAT(0) complexes, including tree space, utilizing linear programming.Although we cannot transfer the algorithm directly to the product space of T 4 and R, Lubiw et al. ( 2020) also showed that by iteratively updating the values at the boundaries, one can construct a sequence of sets that converge to the desired convex hull.With a slight modification to this argument, as we show below, we can also create a similar situation so that by stopping at some iteration number, we can obtain an approximation of the convex hull.
To see this, let S 0 be all the sample points {(X i , y i ) | X i ∈ T 4 , y i ∈ R}.For simplicity, we will only consider the situation where the origin is in the convex hull (if the convex hull of sample points does not contain the origin, then the situation is the same as the Euclidean case).We call a geodesic in the space T 4 × R a cone path if it crosses the point of the type {0, y l }, where 0 denotes the origin in T 4 : that is, we call a geodesic a cone path when its projection onto T 4 is a cone path.Also, for any nonnegative orthant First, we take geodesics between points in H l−1 that are cone paths and find the val-ues at the origin, which can be represented as (0, y l,k ).Then we find maximum and minimum values of {y l,k }, and let them be y l1 , y l0 , respectively.We initialize T l with S l−1 ∪ {(0, y l1 ), (0, y l0 )}.Although it is practically difficult to find the exact maximum and minimum values, we can use an approximation algorithm.We discuss on this approximation in the supplementary material.
Now for each pair of points in T l , take the geodesic between them, and add all intersection points with the boundaries to T l .Here, by a boundary, we mean a two-dimensional plane formed by an axis on T 4 and R.Then, for each boundary in T 4 , take all points that are also in T l , including points at the origin, which are of the form (0, y li ).Since each boundary is a two-dimensional plane, these points can be thought of as points in R 2 , and we take the usual two-dimensional Euclidean convex hull in this space.Then discard any points that are not vertices of this convex hull from T l , and define S l = T l .Further, let O be any nonnegative orthant in T 4 and S l (O) be the points in S l that are also in O × R.
Define H l (O) = conv(S l (O)), which can be computed easily since S l (O) can be embedded into a nonnegative orthant in R 3 , and This construction is similar to the two-dimensional CAT(0) case (Lubiw et al., 2020), but the differences are that in this case, we have to determine the maximum and minimum values at the origin first, and instead of preserving only extreme points in a one-dimensional axis, we need here to save all the vertices in the convex hull of points on boundaries.
To validate the first step, we first show the following lemma.
Lemma 9.If S l−1 is included in conv(S 0 ), then so is S l .
The next theorem ensures that H l converges to conv(S 0 ).
Theorem 10.Assume p, q ∈ H l−1 .Then the geodesic from p to q, denoted as γ p,q , is included in H l .This indicates that ∪H l = conv(S 0 ).
Proof.If γ p,q is a cone path, let (0, y 0pq ) be the point at which the geodesic crosses the origin.By construction, S l includes points (0, y u ) and (0, y l ) such that y l ≤ y 0pq ≤ y u .
Therefore, H l includes (0, y 0pq ), and thus all points of γ p,q are included in H l .
If γ p,q is not a cone path, then the simplest case is that p, q are in the same orthant.
In such a case, by construction, γ p,q ⊆ H l−1 as well.Thus, in particular, γ p,q ⊆ H l .
Otherwise, let p = (p x , p y ), p x ∈ T 4 , p y ∈ R, and q = (q x , q y ), q x ∈ T 4 , q y ∈ R. We can take two nonnegative orthants O p , O q ∈ T 4 to which p x and q x belong in such a manner that O p and O q are connected with at most one another orthant between them.Here, note that three connected orthants can be embedded in a part of R 2 in the manner shown on the left of Figure 4. Now, we can take a p , b p , c p ∈ S l−1 (O p ) having the following property: the triangle a p b p c p includes the points of the form (p x , y pp ) with y pp ≥ p y .These points are the ones that form the triangle "over" point p.In the same way, we can take a q , b q , c q ∈ S l−1 (C q ) having the property that the triangle a q b q c q includes the points of the form (q x , y qq ) with y qq ≥ p y .The case when the triangles exactly include p and q are drawn in Figure 4. Let the intersection points of γ pq with one of the boundaries, say B, be r = (x r , y r ).Also, in the Euclidean embedding of O p , O q , and B, as in Figure 4, let −B denote the axis in the direction opposite to B (Figure 4).Recall that a "boundary" here means the Euclidean plane an axis in T 4 makes in the space T 4 × R. Consider embedded space R 3 which includes O p and O q .If we allow paths to cross the part of the Euclidean space in which no orthants are embedded (the hatched region in Figure 4), it is easy to see from Euclidean geometry that we can find two straight lines between points in {a p , b p , c p } and {a q , b q , c q }, say γ 1 and γ 2 , such that the line segment between the intersection points of γ 1 and γ 2 with B or −B, say r 1 = (x r1 , y r1 ) and r 2 = (x r2 , y r2 ), includes a point of the form (x r , y r ) with y r ≥ y r .If both r 1 and r 2 lie in B, then we just showed that (x r , y r ) ∈ H l .If r 1 or r 2 does not lie in B, say r 1 does not, then we still find that the segment from r 1 to r 2 goes "under" (0, y l1 ), since if we let (0, y r 1 r 2 ) be the origin point included in the segment r 1 r 2 , then this point is included in some cone path between a point in the triangle a p b p c p and a point in the triangle a q b q c q .Thus, the segment from (0, y l1 ) to r 2 goes "over" (x r , y r ): i.e., the segment includes a point of the form (x r , y r ) with y r ≥ y r ≥ y r .Thus in summary, by setting either ỹr = y r or ỹr = y r , we are able to find a point (x r , ỹr ) ∈ H l such that ỹr ≥ y r .A similar argument leads to that we can find a point (x r , ŷr ) ∈ H l such that ŷr ≤ y r .This indicates r = (x r , y r ) ∈ H l , and consequently, the whole γ pq .In Lubiw et al. (2020), a similar fact was revealed, and they proceed to reduce the problem of finding the closure of a convex hull to a linear programming problem.However, such a reduction is difficult in our case, given that we do not have prior knowledge about how many vertices conv(S 0 ) would have on each boundary.We will content ourselves here with the approximation algorithm derived from the previous theorem.Note that if we can find maximum and minimum values at the origin exactly in the first step and the sets S l converge in finite iteration number m, the set H m is exactly equal to the desired convex hull.Furthermore, H m is closed in this case, thus function hy calculated by this set is the exact least upper-semicontinuous concave function.

Altering the Objective Function
Sections 4.1 and 4.2 show that we can calculate the function hy for given y in onedimensional and two-dimensional cases at least approximately.As shown at the beginning of this section, in order to find the maximum likelihood estimator, we need to solve the optimization problem of the form max y∈R n ψ(exp(h y )).As in the Euclidean case (Cule et al., 2010), the objective function is not a convex function of y, but by properly altering it, we can make this a convex optimization problem.In the d-dimension case, the new objective function σ n (y) is defined as The convexity of hy with respect to y can be derived in exactly the same manner as in the Euclidean case (Cule et al., 2010).This ensures that σ is a convex function of y.Therefore, we can utilize ordinary solvers for convex optimization to calculate MLE.

Numerical Study
In this section, we give two simulation results.First, we show that the log-concave distribution can be estimated properly in one and two dimensions.We also test the estimation of one-dimensional density that bends at the origin.Then we compare this result with the kernel density estimator proposed by (Weyenberg et al., 2014).Finally, we give an example of clustering with the log-concave mixture model.

Estimation of One-Dimensional Log-Concave Densities
For one-dimensional examples, we consider the following two types of log-concave densities, using O i (i = 1, 2, 3) to denote the three orthants in T 3 .
Case 1. Normal-like density f 1 with mean equal to the point in O 1 that is 1 unit from the origin, x 0 : i.e., for a point Case 2. Exponential-like density f 2 , defined as follows: We generated ten samples each for sample size 100, 200, 300, 500, and 1000 from the true density and calculated the integrated squared error (ISE).The average ISE versus sample size is reported in Figure 5.
Our simulation results show that the log-concave MLE dominates the kernel density estimator for all sample sizes.

Estimation of Two-Dimensional Normal-like Densities
We denote each orthant of T 4 by the indices of the axes at the boundary as in Figure 1.
We compare our log-concave MLE to the kernel density estimator (Weyenberg et al., 2014) in terms of ISE with respect to the true density.We generated ten samples each for sample size 50, 100, 200, 300, 500 and 1000 from the true density and calculated MLE and the average ISE.The estimation results are illustrated in Figure 6.In case 3, the kernel density estimator performs better in small samples, but when a large number of samples are available, log-concave MLE starts to outperform the kernel density estimator.This result is similar to the Euclidean case (Cule et al., 2010).On the other hand, in case 4, log-concave MLE dominates the kernel density estimator.This is because while log-concave MLE can find the support of the true density with enough data, the kernel density estimator cannot.

Estimation of One-Dimensional Densities That Bend at the Origin
As one-dimensional densities that bend at the origin, we take two densities we considered in the section 3.3.Concretely, we consider the following two densities: Case 5. Normal density constructed by Brownian motion f 5 at time 5 and starting position equal to the point in O 1 that is 1 unit away from the origin: Case 6. Density of coalescent time in Example 8 f 6 with T = 1: i.e., for a point x ∈ T 3 , As in the previous experiment with one-dimensional log-concave density, we generated ten samples each for sample size 100, 200, 300, 500 and 1000 from the true density and calculated MLE and the average ISE.The estimation results are illustrated in Figure 7.The maximum likelihood estimator in G 0 dominated the kernel density estimator for all sample sizes.

Clustering Example
In this section, we consider the problem of clustering in the space T 4 .Let x 1 be a point in the orthant {0, 1} and x 2 be a point in the orthant {2, 3}, both located at coordinates (1, 1) in their orthant.We consider two normal-like densities on T 4 , g 1 and g 2 , defined as follows: To compare the results with other methods, we use the k-means++ algorithm (Vassilvitskii and Arthur, 2006) applied to this space as an alternative clustering method.Concretely, we implemented the usual k-means++ algorithm with the Fréchet mean instead of with the usual arithmetic mean.It is simple to see that the sum of squared distances from the cluster centers strictly decreases at each iteration of the k-means++ or k-means algorithm (MacQueen and Others, 1967) as in the Euclidean case.Thus the algorithm terminates within a finite number of steps.
The results of clustering are shown using the Petersen graph in Figure 8.We see from these results that in this case, the clusters estimated by the log-concave mixture estimate are more representative of the true density.This is a natural result, as the true density is indeed log-concave, and the two mixed normal densities have different variances.

Conclusion
In this paper, we showed that the maximum likelihood estimator exists in tree space under certain conditions.In one dimension, the maximum likelihood estimator exists with probability one and is unique, which is the same result as in the Euclidean case.In multiple-dimension, there are conditions under which the MLE does not exist.However, if we restrict our attention to a particular situation, we have shown that the MLE exists and is unique ν-almost surely.Also, we have presented an algorithm to calculate the MLE exactly in the one-dimensional case and approximately in the two-dimensional case.We compared the results with the previously developed kernel density estimator and confirmed that our estimator dominates in well-modeled cases with a large enough sample size.
The method derived here is promising in that it gives a new nonparametric approach to density estimation on tree space.As we saw in a two-dimensional example, it is also able to respect the support of the sample distribution when it is convex, which might affect the estimation accuracy considerably in some situations.Unlike kernel density estimates, we do not need either the determination of smoothing parameters or the calculation of any (approximate) normalizing constant.Although the computation is much slower than kernel density estimates, the accuracy of density estimation can be expected to be high when the log-concavity assumption is not far away from the properties of the true density.The developed method gives an intermediate choice between fully unconstrained, nonparametric approaches, which have difficulties improving accuracy and estimation, and parametric approaches, for which it is difficult to specify the correct models for tree space.
For future research, it is important to derive some theoretical properties of the estimator on tree space.In Euclidean space, log-concave MLE is known to be strictly consistent, and even if the model is misspecified, it converges to the "log-concave projection" of the true density (the log-concave density that minimizes Kullback-Leibler divergence from the true density).It is crucial to investigate whether these properties hold on tree space as well.
Also, It is of interest whether we can introduce different constraints on the density in order to include other types of densities on this space, or to lessen the computational load.The density that bend at the boundary, which we considered only in the one-dimensional case, is a candidate class for this purpose.Finally, further simulation studies in both well-modeled and misspecified cases are necessary for practical purposes.
As tree space is constructed for modeling the space of phylogenetic trees, one immediate interest of further research is the applicability to biological data and problems.As we only have (approximate) algorithms for one and two dimensions, corresponding to the case where a maximum of four taxa are present, the applicability seems to be limited for now.
Lemma 13.Suppose H has the geodesic extension property.Then, any upper-semicontinuous concave function g on H that is bounded in the domain S is continuous on int(S).
Proof.Let g int(S) denote the restriction of g to int(S).Then, the hypograph of this function is We first show that the interior of this set is given by T = {(x, µ) | x ∈ int(S), µ < g(x)}.

A.2 Proof of Theorem 5
Proof of theorem.The proof is essentially a modification of the proof of Theorem 1 from Cule et al. (2010).
By Lemma 1, C n = ∪ ∞ l=0 D l , where D l denotes the set defined in the proof of the previous Lemma.Let F be the set of upper-semicontinuous log-concave functions on T p+2 , and consider the maximization of the function First, we can show that for any g ∈ F, there exists First, assume g(x) = 0 for some x ∈ C n .Lemma 1 suggests that there exists l ∈ N such that x ∈ D l .If l = 1, we have for some k, g(X k ) = 0, implying ψ n (g) = −∞.Otherwise, take minimum such l, then one can take some x 1,l−1 , x 2,l−1 ∈ D l−1 and µ ∈ (0, 1) such that x = (1 − µ)x 1,l−1 + µx 2,l−1 .By concavity, −∞ = log g(x) ≥ (1 − µ) log g(x 1,l−1 ) + µ log g(x 2,l−1 ).This leads to that at least one of g(x 1,l−1 ) and g(x 1,l−2 ) must be zero.
Repeating this process, we can show that there exists k ∈ {1, . . ., n} such that g(X k ) = 0, resulting again in ψ n (g) = −∞.Therefore, if for some x ∈ C n , g(x) = 0, then any f with (37) satisfies ψ n (f ) ≥ ψ(g).If g(x) = 0 for some x ∈ cl(C n )\C n , then one can take a sequence {x i } ∈ C n such that x i converges to x.The upper-semicontinuity implies that lim sup i→∞ g(x i ) ≤ g(x) = 0. Then for all ε > 0, there exists i such that g(x i ) < ε.
This with the previous argument leads to that min k=1,...,n g(X k ) < .Since can be taken arbitrarily small, min k=1,...,n g(X k ) = 0, which leads again to ψ n (g) = −∞.On the other hand, because cl(C n ) is bounded by Lemma 12, if we take f ∈ F to be finitely positive and upper-semicontinuous on cl(C n ), f becomes upper-semicontinuous on dom f and ψ n (f ) > −∞.Thus, we can strictly restrict our attention to f such that f (x) > 0 for all x ∈ C n .Now, assume that g(x) > 0 for some x ∈ cl(C n ).Then by letting f (x) be the restriction of g(x) to cl(C n ), it is easy to see that f satisfies (37), log-concavity, and uppersemicontinuity.Furthermore, ψ n (f ) ≥ ψ n (g), and the inequality becomes the strict one if ν({x | x ∈ dom g, x ∈ C n }) > 0. This means that we can strictly restrict attention to the density that satisfies (37), or the ones that satisfy them except for some set of ν−measure zero excluding cl(C n ).
Secondly, we can show that for any upper-semicontinous function g, there exists f which has its logarithm in the form hy such that ψ n (f ) ≥ ψ n (g).This is achieved if we put y i = log g(X i ), y = (y 1 , . . ., y n ) and f = exp( hy ).Note that by Lemma 15, dom hy = cl(C n ).Also, by Lemma 13, hy is continuous relative to int(C n ).Because of continuity, if g(x) = exp( hy )(x) at some x ∈ int(C n ), ψ n (exp( hy )) > ψ n (g).This shows that a maximizer of ψ n , if it exists, has to take the form ψ n ( hy ) in int(C n ).
Next, we can also show that we can restrict attention to the case f ∈ F 0 .For any upper-semicontinous function g, assume T p+2 g(x) = c < ∞.By setting f = g/c, f ∈ F 0 , Equality is attained only when c = 1.
For the existence of a maximizer of l(f ), it only remains to show that ψ n has a maximizer f = exp( hy ) ∈ F 0 with (37).This is seen as follows.For an arbitrary function f = exp( hy ) with (37), let X max = arg max X i g(X i ) and O max be one of the nonnegative orthants of T As mentioned in the main text, finding the maximum and minimum values at the origin in the first step of this approximation algorithm is not a simple optimization problem.
For example, the problem of finding the maximum value at the origin among all geodesics between points in two specific orthants, say O 1 and O 2 , can be written in the following way: given points {x i , y i } ∈ O 1 × R and {p j , q j } ∈ O 2 × R, max µ i ,λ j i µ i x i j λ j q j + j λ j p j i µ i y i i µ i x i + j λ j p j s.t.
i µ i = 1 j λ j = 1 1 ≥ µ i , λ j ≥ 0 the geodesic between i µ i x i and j λ j p j is a cone path. (46) Although the last nonconvex constraint can be removed when we are considering some combinations of two orthants, the objective function is also nonconvex.The next is an example that shows nonconvexity in a very simple case.

Figure 1 :
Figure 1: Depiction of tree space.(Left): one-dimensional tree space T 3 , consisting of three half-lines each corresponding to a topology.(Center): part of two-dimensional tree space.(Right): Petersen graph.Each edge represents a 2-dimensional orthant in T 4 with endpoints representing axes constituting the boundaries.Vertices corresponding to axes are indexed from 0 to 9 for convenience.

Figure 2 :
Figure 2: Illustration of Example 6 and Example 7. The two trees depicted are the topologies that the two orthants represent.Geodesics between these orthants are always cone paths.

)
Nye et al. (2014) considered diffusion process on T 3 and defined a natural generalization of Gaussian density as a solution to the heat equation.If diffusion starts from the point

TFigure 3 :
Figure 3: (Left): An example of how three lineages inside a species tree coalesce.T represents the length of internal edge in the species tree.(Center): The probability density in the orthants O 1 and the orthant O 2 .(Right): The logarithm of the probability density in the orthants O 1 and O 2 .

Figure 4 :
Figure 4: (Left): embeddings of three connected orthants to the Euclidean space R 2 .(Center): an illustration that the path pq goes under the segment r 1 r 2 .The three blue points are included in the blue plane.(Right): central figure seen from the direction of y-axis.

Figure 8 :
Figure 8: Clustered points displayed in the Petersen graph (Figure 1) with labels.(Left): data points with markers indicating the true clusters to which they belong.(Center): clustering results using the log-concave mixture (89% accuracy).(Right): clustering results using k-means++ (77 % accuracy).Note that closeness in this graph does not necessarily mean closeness in T 4 .
Thus, the set T is open.For any α ∈ R, the intersection of two open sets int(hypo(g int(S) )) and {(x, µ) | µ > α} is an open set {(x, µ) | x ∈ int(C n ), g(x) > µ > α}, and thus the projection of this set {x | x ∈ int(C n ), g(x) > α} .This implies the lower-semicontinuity, and thus the continuity, of g on int(S).