1 Introduction

Inference of phylogeny is one of the key problems in biology. Phylogenetic trees represent the evolutionary histories of taxa. The methods of phylogenetic reconstruction using current gene data have been highly developed, and it is now possible to reconstruct them easily with various models and inference methods (Felsenstein 2004).

However, due to the uncertainty of inference, incomplete lineage sorting or irregular biological processes such as horizontal transfer, it is usually the case that different gene loci indicate different evolutionary histories (Reid et al. 2014). The classical approach to this problem is finding a consensus tree (Bryant 2003), a single summary tree of multiple trees inferred from different loci, which is constructed by some rules. In 2001, Billera et al. (2001) introduced the space of phylogenetic trees with n leaves, tree space, which is a geodesic metric space with nonpositive curvature. Recent research has shifted to the statistical analysis of a set of trees in such spaces, enabling a geometrical perspective. These efforts include point estimation by Fréchet mean (Bačák 2014a; Sturm 2003), principal component analysis (Nye 2011), outlier detection using a kernel density estimator (Weyenberg et al. 2014, 2017), and construction of confidence sets (Willis 2019).

Since numerous factors affect the inferred phylogenetic trees, the parametric approach to specifying the distribution of inferred trees is difficult, and the risk of misspecifying models is high. In this sense, the nonparametric approach generally specifies fewer constraints in the distribution and might be desirable. The kernel density estimator proposed in Weyenberg et al. (2014, 2017) is designed for this purpose.

In Euclidean space, log-concave density estimation has emerged as a new shape-constrained, nonparametric method of density estimation. Log-concave density is a class of probability densities whose logarithm are concave functions. Compared to classical smoothing approaches such as kernel density estimates, in which we need to specify some hyperparameters such as bandwidths, log-concave density has an advantageous property that the estimation can be done automatically. Concretely, it has been shown that the maximum likelihood estimators exist in both the one-dimensional and multi-dimensional cases and that the calculation of the estimators can be reduced to a N-dimensional convex optimization problem with N sample points (Cule et al. 2010).

In this paper, we show that the maximum likelihood estimator of log-concave density in tree space exists under some conditions and that the estimation can be implemented in the one-dimensional case. Although we do not derive an algorithm for multidimensional cases due to the difficulty of finding the closure of convex hulls, we also show that we can approximately calculate the maximum likelihood estimator in the two-dimensional case.

The remaining sections are organized as follows. In Sect. 2, we review some basic concepts of tree space and define some concepts for convex analysis in Hadamard spaces. Section 3 presents our main result for maximum likelihood estimation in tree space, in one dimension and multiple dimensions. Section 4 explains how to calculate the maximum likelihood estimator in the one-dimensional and two-dimensional cases. Section 5 shows the results of simulation studies. We compare the performance of density estimation with the kernel density estimator. We also compare the simulation results of clustering with a mixture of log-concave densities to the k-means++ approach using the Fréchet mean.

2 Preliminaries

2.1 Phylogenetic trees and tree space

A phylogenetic tree is modeled as a tree with labels only on the leaves. We call a tree with \(n+1\) labeled leaves an n-tree. n-trees have one root leaf representing the common ancestor and n leaves representing present taxa. Internal edges are edges that are not directly connected to leaves. In binary n-trees, the number of internal edges is \(n-2\). Nonbinary trees have fewer internal edges since they can be obtained by contracting some internal edges of some binary trees. The number of binary n -tree topologies is \((2n-3)!!\)(Felsenstein 1978). Note that the edges in different trees are identified if they partition the leaves in the same way. If each internal edge in an n-tree has a positive length, we call it a metric n-tree.

Billera et al. (2001) constructed the space of metric n-trees as follows. First, each binary tree topology is modeled as a positive Euclidean orthant with dimension \(n-2\), with each axis corresponding to the length of each internal edge of that topology. These orthants are stuck together on the axes representing the same edges. Nonbinary trees consitute the boundaries of these orthants since they are obtained by contracting some internal edges. In particular, the n-tree without any internal edges is the point located at the center of tree space connected to every orthant. We call this point the origin in this paper. The space of metric n-trees obtained in this way is called tree space and is denoted as \({\mathcal {T}}_n\).

In this paper, we mainly discuss the results in the one-dimensional or two-dimensional case. By dimension, we mean the dimension of each orthant. Therefore, by the p-dimensional tree space, we mean the space \({\mathcal {T}}_{p+2}\). In the one-dimensional tree space, we only have \(3!! = 3\) topologies and each orthant representing some topology is a half-line connected to the other two at the origin, the point representing a trifurcating tree (Fig. 1). Two-dimensional tree space \({\mathcal {T}}_4\) has \(5!! = 15\) topologies, each orthant being a nonnegative Euclidean plane, and three different orthants are connected to each axis (Fig. 1). It is known that if we represent each axis as a vertex and each orthant as an edge connecting two vertices, then the whole of space \({\mathcal {T}}_4\) can be graphically depicted as a Petersen graph, as shown in Fig. 1. For details, see Billera et al. (2001) or Lubiw et al. (2020), for example.

Fig. 1
figure 1

Depiction of tree space. (Top): one-dimensional tree space \({\mathcal {T}}_3\). (Center): part of two-dimensional tree space. (Bottom): Petersen graph. Vertices corresponding to axes are indexed from 0 to 9 for convenience

2.2 Geometry of tree space

Tree space can be readily formulated as a metric space. Since it is the union of Euclidean orthants, we can simply define the distance between two points in the same orthant as the usual Euclidean distance. For any two points in distinct orthants, we can set the distance between them as the infimum of the lengths of paths with endpoints in those two points. The infimum is attained since tree space consists of a finite number of orthants, and once a sequence of orthants of a path is decided, there is a unique path with minimum length. The paths which attain the minimum distances are called geodesics, and a metric space with a geodesic between any two points is called a geodesic metric space.

Tree space is also known to be a Hadamard space; i.e., it is a complete metric space with nonpositive curvature. See, for example, Bačák (2014b) for proof.

This property is key to many important results in this space. In Hadamard spaces, the geodesic connecting any two points is guaranteed to be unique. Owen and Provan (2011) has developed an algorithm to find geodesics in tree space in \(O(p^4)\) time, where p denotes the dimension of tree space. In tree space, geodesics are piecewise segments through sequences of orthants, and in some cases, they become cone paths, where each geodesic consists of two segments connecting the endpoints and the origin. In particular, in two-dimensional tree space \({\mathcal {T}}_4\), whether a geodesic between two points becomes a cone path depends only on the “angle” between them. See, for example, Lubiw et al. (2020) for a detailed account of this property. The nonpositively curved property also results in the existence and uniqueness of the Fréchet mean. For a given sample \(X_1,\ldots ,X_N\) in Hadamard space \(({\mathcal {H}},d)\) and weights \(w_1,\ldots , w_N > 0\), the Fréchet mean \({\bar{X}}\) is defined as the minimizer of the sum of squared distances:

$$\begin{aligned} {\bar{X}} = \mathop {\mathrm {arg\,min}}\limits _{x\in {\mathcal {H}}}\sum _{i=1}^{ {N}} w_i d(x,X_i)^2. \end{aligned}$$
(1)

It is a generalization of the arithmetic mean since they coincide with uniform weights in Hilbert spaces. It is known that the Fréchet mean can be calculated using the proximal point algorithm (Bačák 2013). See, for example, Bačák (2014b) for proof of results regarding the Fréchet mean in Hadamard spaces.

2.3 Convex sets and concave functions on Hadamard spaces

In this section, we define several concepts of convex sets and concave functions on Hadamard spaces.

First, we define convex sets as in Bačák (2014b). Let \(({\mathcal {H}},d)\) be a Hadamard space . For any points \(x,y \in {\mathcal {H}}\), we use the notations \(\gamma _{x,y}\) or [xy] to denote the unique geodesic between them. Also, for \(\lambda \in [0,1]\), we use the notations \(\gamma _{x,y}(\lambda )\) or \((1-\lambda ) x + \lambda y\) to denote the points in the geodesic [xy] with distance \(\lambda d(x,y)\) from x. We say \(A\subseteq {\mathcal {H}}\) is convex if for any points \(x,y\in A, [x,y] \subseteq A\).

A function \(\psi :{\mathcal {H}}\rightarrow [-\infty , \infty ]\) is concave if its hypograph \({{{\,\textrm{hypo}\,}}\psi } = \{(x,\mu ) \in {\mathcal {H}} \times {\mathbb {R}}\mid \mu \le \psi (x)\}\) is convex. Note that a product of Hadamard spaces is also a Hadamard space, and thus, in particular, \({\mathcal {H}}\times {\mathbb {R}}\) is also a Hadamard space.

Given any set S in \({\mathcal {H}}\), the convex hull of S is the smallest convex set in \({\mathcal {H}}\) containing S. This set exists because the intersection of (possibly infinite) convex sets is convex and that \({\mathcal {H}}\) is convex. We denote the convex hull of S as \({{\,\textrm{conv}\,}}S\). In Hadamard spaces, it is known that a convex hull can be written as in the following Lemma.

Lemma 1

(Bačák (2014b), Lemma 2.1.8) For \(S\subseteq {\mathcal {H}}\), put \(D_0 = S\) and for \(k\in {\mathbb {N}}\), recursively define \(D_k\) by \(D_k = \{x\in {\mathcal {H}} \mid x \in [y,z] \text { for some }y,z\in D_{k-1}\}\). Then,

$$\begin{aligned} {{\,\textrm{conv}\,}}S = {\bigcup _{k=0}^\infty } {D_k} . \end{aligned}$$
(2)

This Lemma indicates that taking geodesics between points an infinite number of times obtains the convex hull. Next, we define an important concept of the concave hull of a function. Given any function \(\psi \) on \({\mathcal {H}}\), the concave hull of \(\psi \), \({{\,\textrm{conc}\,}}\psi \), is defined as the least concave function minorized by \(\psi \). Its existence and uniqueness in Euclidean space is a well-known result; see chapter 5 of Rockafellar (1970) for a detailed account. Here we give the existence and uniqueness in the Hadamard case.

Lemma 2

For any function \(\psi :{\mathcal {H}}\rightarrow [-\infty , \infty ]\), the concave hull \({{\,\textrm{conc}\,}}\psi \) exists and is unique. Furthremore, \({{\,\textrm{conc}\,}}\psi \) can be written as follows:

$$\begin{aligned} {{{\,\textrm{conc}\,}}\psi (x) = \{\mu \mid (x,\mu )\in {{\,\textrm{conv}\,}}({{\,\textrm{hypo}\,}}\psi )\} }\end{aligned}$$
(3)

The proof is given in the suplementary material. An important implication of the Lemma 2 is how to derive the concave hull of a function: calculate the convex hull of its hypograph, and take the supremum.

Another condition that is usually assumed with concavity is upper-semicontinuity. A function \(\psi :{\mathcal {H}}\rightarrow [-\infty ,\infty ]\) is upper-semicontinuous at x if \(\limsup _{y\in {\mathcal {H}}:d(x,y)\rightarrow 0} \psi (y) \le \psi (x)\). If function \(\psi \) is upper-semicontinuous at all points \(x\in {\mathcal {H}}\), then it is upper-semicontinuous on \({\mathcal {H}}\). It is also equivalent to the hypograph of \(\psi \) being closed.

Similar to concave hulls, we can define an upper-semicontinuous hull of a function \(\psi \) as the least upper-semicontinuous function minorized by \(\psi \). That this function exists follows from the fact that (possibly infinite) intersections of closed sets are closed. Furthermore, it is possible to define the upper-semicontinuous concave hull of a function of \(\psi \), as the least upper-semicontinuous concave function minorized by \(\psi \). Concretely, this function is derived by taking intersections of the hypographs of all upper-semicontinuous and concave functions minorized by \(\psi \). It can also be characterized as the function whose hypograph is \({{\,\textrm{cl}\,}}({{\,\textrm{conv}\,}}{{\,\textrm{hypo}\,}}\psi )\) where \({{\,\textrm{cl}\,}}\) is the closure operator. We denote the upper-semicontinous concave hull of a function \(\psi \) as \({\overline{{{\,\textrm{conc}\,}}}}\psi \).

2.4 Log-concave density

A function \(f: {\mathcal {H}}\rightarrow [0,\infty ]\) is called log-concave if its logarithm \(\log f: {\mathcal {H}}\rightarrow [-\infty ,\infty ]\) is concave. Figure 2 illustrates two example densities on \({\mathcal {T}}_3\): Laplace distribution-like density \(f_l(x) \propto \exp (-d(x,x_0))\) and t distribution-like density \(f_t(x)\propto (1+d(x,x_0)^2/2)^{-3/2}\). By the form of log-densities, we can see that \(f_l\) is an example of a log-concave density while \(f_t\) is an example of a density that is not log-concave.

Fig. 2
figure 2

Illustration of densities \(f_l\) and \(f_t\) (Top) and log-densities \(\log f_l\) and \(\log f_l\) (Bottom). In each figure, the right side of the origin corresponds to the orthant \(O_1\) that \(x_0\) is in. Since these densities have exact same values at the other two other orthants, left side of the origin represents both of these orthants \(O_2\) and \(O_3\)

The class of probability density functions that are log-concave is well-studied in Euclidean spaces. It especially contains some very often used densities such as multivariate normal, exponential, gamma distribution with the shape parameter greater than or equal to 1, and so on. A recent review on this topic can be found in Samworth (2018).

In terms of the estimation of log-concave density in Euclidean spaces, Cule et al. (2010) has shown that for an i.i.d. sample \(X_1,\ldots ,X_N (N\ge p+1)\) from a density on p-dimensional Euclidean space \({\mathbb {R}}^p\), the log-concave maximum likelihood estimator exists with probability 1. Here, the log-concave maximum likelihood estimator is the log-concave density f that maximizes the log-likelihood function:

$$\begin{aligned} {l_N(f) = \sum _{i=1}^N \log f(X_i). }\end{aligned}$$
(4)

We outline their key findings about the form of the log-concave maximum likelihood estimator here since a similar argument can be applied in the case of tree space as well.

The authors show that the optimization of the equation (4) can be reformulated into the maximization of the following objective function over log-concave functions, by introducing the Lagrange multiplier term:

$$\begin{aligned} {{\mathcal {L}}_N(f) = \frac{1}{N}\sum _{i=1}^N \log f(X_i) - \int _{{\mathbb {R}}^p} f(x)dx. }\end{aligned}$$
(5)

The maximization of (5) results in a probability density \({\hat{f}}_N\) whose logarithm is a “tent” function: \(\log {\hat{f}}_N\) is a piece-wise linear concave function that has its “knots” at some selected sample points and takes finite values only on the convex hull of sample points (Fig. 3). Intuitively, this holds because among any log-concave density f that satisfies \(f(X_i) \ge {\hat{f}}_N(X_i)~(i=1,\ldots ,N)\), the “tent” function minimizes the Lagrange term in the equation (5).

Fig. 3
figure 3

An example of “tent” function on \({\mathbb {R}}\). \(X_1,\ldots ,X_5\) are the given sample points

For any \(y = (y_1,\ldots ,y_N)^\top \in {\mathbb {R}}^N\), let \(\psi _y:{\mathbb {R}}^p \rightarrow [-\infty ,\infty ]\) denote the following function:

$$\begin{aligned} {\psi _y(x) = {\left\{ \begin{array}{ll} y_i &{} \text {if } x=X_i \\ -\infty &{} \text {otherwise} \end{array}\right. }. } \end{aligned}$$
(6)

Then using the terminology of section 2.3, any tent function can be characterized as the concave hull of \(\psi _y\) for some suitable \(y\in {\mathbb {R}}^N\). Therefore, the log-concave maximum likelihood estimator in Euclidean spaces is included in the following function class:

$$\begin{aligned} {\{\exp ({{\,\textrm{conc}\,}}\psi _y(x)) \mid y\in {\mathbb {R}}^p\}. } \end{aligned}$$
(7)

The problem of finding the log-concave maximum likelihood estimator can now be turned into that of finding a suitable N-dimensional vector y.

2.5 Probability measure on tree space

In this section, we introduce one of the possible constructions of a probability measure on tree space. This construction is due to Willis (2019).

Let \(({\mathcal {T}}_{n}, d)\) be tree space. Note that this space consists of \((2n-3)!!\) Euclidean \((n-2)\)-dimensional nonnegative orthants. Now, one can write any set A in \({\mathcal {T}}_{n}\) as a union of Euclidean sets, \(A = \cup _{i=0}^{(2n-3)!!} A_i\). Here, \(A_0\) denotes the set of trees in A that are in the boundary of \({\mathcal {T}}_n\), and \(A_i\) denotes the set of trees in A that are in the i-th positive orthant. Let \(\nu _B\) be the Lebesgue measure on \({\mathbb {R}}^{n-2}\), and define \(\nu \) by \(\nu (A) = \sum _{i=1}^{(2n-3)!!} \nu _B(A_i)\). This \(\nu \) preserves sigma additivity, and by completion of measures, \(\nu \) becomes a complete measure.

In the remaining sections, we consider densities with respect to the measure \(\nu \).

3 Existence and uniqueness of maximum likelihood estimator

3.1 One-dimensional case

In this section, we consider the simplest tree space, \({\mathcal {T}}_3\). Let \({\mathcal {F}}_0^{(1)}\) be the set of log-concave probability densities with respect to the base measure \(\nu \) on this one-dimensional tree space. The following theorem indicates that maximum likelihood estimation in \({\mathcal {T}}_3\) could be performed in the same manner as in the Euclidean case.

Theorem 3

Let \((X_1, \ldots , X_N)\) be an independent sample from some density on \({\mathcal {T}}_3\), with \(N \ge 2\). Then, with probability 1, the maximum likelihood estimator \({\hat{f}}_N\) of f exists and is unique: i.e., \({\hat{f}}_N\) is the unique maximizer in \({\mathcal {F}}_0^{(1)}\) of the log-likelihood function \(l_N(f) = \sum _{i=1}^N \log f(X_i)\).

The proof is given in the supplementary material.

Remark 1

Although our main interest is in tree space, it is simple to see that the above result also applies to the space of \(K\in {\mathbb {N}}\) half-lines connected at the origin.

3.2 Multidimensional case

In this section, we consider the general p-dimensional tree space, \({\mathcal {T}}_{p+2}\). We let \(\bar{{\mathcal {F}}}_0^{(p)}\) be the set of upper-semicontinuous log-concave densities in \({\mathcal {T}}_{p+2}\) and \(l_N(f) = \sum _{i=1}^N \log f(X_i)\) denote the log-likelihood function. The maximum likelihood estimator in this case, unlike in the cases of the one-dimensional tree space and Euclidean space, might not even exist. Before deriving a sufficient condition for the existence of the MLE, we first show that when the maximizer exists, its uniqueness can be stated as in the following theorem.

Theorem 4

Let \((X_1, \ldots , X_N)\) be a sample from some density on \({\mathcal {T}}_{p+2}\). Suppose that a maximizer of \(l_N(f)\) in \(\bar{{\mathcal {F}}}_0^{(p)}\) exists. Then the maximizer is unique \(\nu -\)almost everywhere.

Proof

Suppose \(f_1, f_2\in \bar{{\mathcal {F}}}_0^{(p)}\) both maximize \(l_N(f)\). If we put \(f(x)=\{f_1(x)f_2(x)\}^{1/2}/\int _{{\mathcal {T}}_{p+2}}\{f_1(z)f_2(z)\}^{1/2}d\nu (z)\), f is log-concave and

$$\begin{aligned} l {_N} (f)&={\frac{1}{2N} \sum _{i=1}^N} \{\log f_1(X_i) + \log f_2(X {_i} )\} \nonumber \\&\quad - \log \int _{{\mathcal {T}}_{p+2}}\{f_1(z)f_2(z)\}^{1/2}d\nu (z) \nonumber \\&\ge l(f_1) - \log \int _{{\mathcal {T}}_{p+2}}\frac{f_1(z)+f_2(z)}{2}d\nu (z) \nonumber \\&= l(f_1). \end{aligned}$$
(8)

Here, the last inequality is the relationship between the arithmetic mean and geometric mean, and equality holds if and only if \(f_1(x) = f_2(x)~ \nu \mathrm {-a.e.}\). \(\square \)

The next theorem provides a sufficient condition for sample points such that the maximum likelihood estimator exists and is unique almost everywhere. For a set S, let \({{\,\textrm{int}\,}}(S)\) denote the interior of the set.

Theorem 5

Let \((X_1, \ldots , X_N)\) be a sample from a density f on \({\mathcal {T}}_{p+2}\), \(N\ge p+1\). Let \(C_N = {{\,\textrm{conv}\,}}\{X_1, \ldots , X_N\}\), and suppose the following condition holds:

  1. 1.

    For each \(X_i\), there exists a nonnegative orthant \(O_j\) such that \(X_i\in O_j\) and \({{\,\textrm{cl}\,}}(C_N)\cap O_j\) is a p-dimensional set: i.e., \(\nu ({{\,\textrm{cl}\,}}(C_N)\cap O_j) > 0\).

Then, the maximum likelihood estimator \({\hat{f}}_N\) in \(\bar{{\mathcal {F}}}_0^{(p)}\) exists: i.e., \({\hat{f}}_N\in \bar{{\mathcal {F}}}_0^{(p)}\) is a maximizer of the log-likelihood function \(l_N(f) = \sum _{i=1}^N \log f(X_i)\). Moreover, it is unique except in some set of \(\nu \)-measure zero excluding \({{\,\textrm{int}\,}}(C_N)\).

Note that the sufficient condition is a condition that holds with a probability approaching one as N goes to infinity. This can be seen as follows. With probability 1, each \(X_i\) is included in at least one nonnegative orthant \(O_i\) with a positive probability under the true distribution. For any orthants that have positive probability under the true density, the probability that the orthant contains less than \(p+1\) points approaches zero as \(N\rightarrow \infty \).

The proof is given in the supplementary material. As in the case of Euclidean spaces, in the proof of Theorem 3 and Theorem 5, it is shown that the maximum likelihood estimator can be characterized as the maximizer of the following function \({\mathcal {L}}_N(f)\) over the class of upper-semicontinuous log-concave functions:

$$\begin{aligned} {{\mathcal {L}}_N(f) = \frac{1}{N}\sum _{i=1}^N \log f(X_i) - \int _{{\mathcal {T}}_{p+2}}f(x)d\nu (x). } \end{aligned}$$
(9)

In addition, as in Euclidean cases, the maximizer of this function \({\mathcal {L}}_N(f)\) has the form \(\exp ({\bar{h}}_y)(x)\) for a suitably selected \(y\in {\mathbb {R}}^N\), where \({\bar{h}}_y= {\overline{{{\,\textrm{conc}\,}}}} \psi _y(x)\) and \(\psi _y: {\mathcal {T}}_{p+2}\rightarrow [-\infty ,\infty ]\) is defined by equation (6). We call this function \({\bar{h}}_y\), the upper-semicontinuous least concave function with \({\bar{h}}_y(X_i)\ge y_i\). As a result, we need to find \(y\in {\mathbb {R}}^N\) that maximizes the function \(\rho _N: y\mapsto {\mathcal {L}}(\exp ({\bar{h}}_y))\).

Also, note that in one-dimensional case, the concave hull of \(\psi _y\) is automatically upper-semicontinuous and \({{\,\textrm{conc}\,}}\psi _y = {\overline{{{\,\textrm{conc}\,}}}}\psi _y\). We denote this concave hull by \(h_y = {{\,\textrm{conc}\,}}\psi _y\) and call it the least concave function with \(h_y(X_i) \ge y_i\).

The uniqueness of the log-concave maximum likelihood estimator is only assured in \(\nu \)-almost sure sense outside \({{\,\textrm{int}\,}}(C_n)\) in multidimensional cases. This is because the addition of a new point outside the existing convex hull sometimes does not induce an increase in the measure of the convex hull of the new set of points. Next is an example.

Example 1

(Nonuniqueness) Consider that we have three points \(X_1, X_2, X_4 \in {\mathcal {T}}_4\) as depicted in Fig. 4. Then \({{\,\textrm{conv}\,}}\{X_1, X_2, X_4\}\) is the triangle \(X_1X_2X_4\). Suppose that we add \(X_3\) and consider \({{\,\textrm{conv}\,}}\{X_1, X_2, X_4, X_3\}\). Then this set is the union of triangle \(X_1X_2X_4\) and the segment \(X_4X_3\), and it is clear that the measure has not increased: i.e., \(\nu ({{\,\textrm{conv}\,}}\{X_1, X_2, X_4\}) = \nu ({{\,\textrm{conv}\,}}\{X_1, X_2, X_4, X_3\})\). Suppose that we have obtained \(f\in \bar{{\mathcal {F}}}_0^{(p)}\) that satisfies \(f(x)>0~(x\in {{\,\textrm{conv}\,}}\{X_1, X_2, X_4\}), f(x)=0~ (x\not \in {{\,\textrm{conv}\,}}\{X_1, X_2, X_4\})\). Then “extending” the support of f to include the segment \(X_3X_4\) preserving log-concavity would not change the value of the likelihood, and the extension of f would also be a density. Thus, \(\nu \)-almost sure uniqueness holds, but strict uniqueness does not hold in this case.

Fig. 4
figure 4

Illustration of Example 1 and Example 2. The two trees depicted are the topologies that the two orthants represent. Geodesics between them are always cone paths

Cases in which the maximum likelihood estimator does not exist occur when the convex hull has a measure-zero intersection with some orthant.

Example 2

(Nonexistence) Consider that we have three samples \(X_1, X_2, X_3 \in {\mathcal {T}}_4\) as depicted in Fig. 4. The geodesics between points in these two trees necessarily are cone paths. It is clear that the assumption in Theorem 5 is not satisfied. Now, assume that the lengths of segments \(X_1X_4\) and \(X_2X_4\) are the same, and the length of segment \(X_3X_4\) is three times that of \(X_1X_4\) and \(X_2X_4\). Let \(\log f\) be an affine-like function with \(\log f(X_3) = y+\Delta \) and \(\log f(X_2) = \log f(X_1) = y-\Delta /3\). Then \(\log f(X_4) = y\) and letting \(z(\Delta ) = (y-\Delta /3, y-\Delta /3, y+\Delta )^\top \),

$$\begin{aligned} {\rho _N} ( {z(\Delta } ) {) } = y +\frac{\Delta }{9} - {\int _{C_N} } f(x)dx.\end{aligned}$$
(10)

One can see that as \(\Delta \) goes to \(+\infty \) the second term linearly goes to \(+\infty \) and the last integral term converges to zero. Thus, \(\rho _N\) is a diverging monotone increasing function with respect to \(\Delta \) and does not attain a maximum. Therefore, the maximum likelihood estimator does not exist in this case.

3.3 Densities that bend at the boundary

Although the nonparametric class of log-concave densities seem to be large enough to model various densities, there are densities that arise narturally which are not log-concave around the boundary. In this section, we focus on the simplest one-dimensional tree space \({\mathcal {T}}_3\), and discuss a way to incorporate densities that are not log-concave at the origin. Let \(O_1,O_2,O_3\) represent the three half-lines, and let the restriction of a density \(f:{\mathcal {T}}_3\rightarrow {\mathbb {R}}\) to each half-line \(O_i\) be \(f_i\), which is a function on \({\mathbb {R}}_{\ge 0}\). Denote the exterior derivative at the origin by \(\partial f_i / \partial n_i(0)\), defined as:

$$\begin{aligned} \frac{\partial f_i}{\partial n_i}(0) = \lim _{h\rightarrow +0} \frac{f_i(0)-f_i(h)}{h}. \end{aligned}$$
(11)

Nye and White (2014) considered diffusion process on \({\mathcal {T}}_3\) and defined a natural generalization of Gaussian density as a solution to the heat equation. If diffusion starts from the point \(x_0\in {\mathbb {R}}_{>0}\) in the orthant \(O_1\), the probabilty density \(f^{(D)}\) at time t in each orthant can be written as follows:

$$\begin{aligned} \begin{aligned} f {^{(D)}_1}(x {; x_0, t} )&= \phi (x;x_0,t) - \frac{1}{3}\phi (x;-x_0,t) \\ f {^{(D)}_i}(x {; x_0, t} )&= \frac{2}{3}\phi (x;-x_0,t) ~~~~~ (i\in \{2,3\}) \end{aligned} \nonumber \\ \end{aligned}$$
(12)

Here, \(\phi (x;\mu ,\sigma ^2)\) denotes the normal density on \({\mathbb {R}}\) with mean \(\mu \) and variance \(\sigma ^2\).

Density (12) does not have log-concavity since it “bends” at the origin (Fig. 5). In fact, the exterior derivatives are

$$\begin{aligned} \begin{aligned} {\frac{\partial f^{(D)}_1}{\partial n_1}(0; x_0, t) }&= -\frac{4}{3}\phi ^\prime (0;x_0,t), \\ {\frac{\partial f^{(D)}_i}{\partial n_i}(0; x_0, t) }&= \frac{2}{3}\phi ^\prime (0;x_0,t)~~~~~ (i\in \{2,3\}). \end{aligned}\nonumber \\ \end{aligned}$$
(13)

This makes density f to be not log-concave around the origin.

Fig. 5
figure 5

(Top): Illustration of densities \(f^{(D)}(x;x_0,1)\) and \(f^{(C)}(x; 1)\). (Bottom): Låog-densities \(\log f^{(D)}(x; x_0, 1)\) and \(\log f^{(C)}(x; 1)\). \(x_0\) is the point in \(O_1\) that is one unit away from the origin. The right side of the origin corresponds to \(O_1\) and the left side corresponds to both \(O_2\) and \(O_3\)

Another example can be constructed by considering a simple multispecies coalescent process of three lineages from three different species. Suppose the species tree topology corresponds to the orthant \(O_1\) and let T denote the length of the internal edge of the species tree. Then, the probability density of gene trees \(f^{(C)}\) can be derived as follows (see supplementary material for the detailed derivation):

$$\begin{aligned} \begin{aligned} f {^{(C)}_1}(x {; T} )&= -\frac{1}{6}\exp ({-x-T}) \\&\quad + \frac{1}{2}\exp \left( - {|x-T|} \right) \\ f {^{(C)}_i}(x {; T} )&= \frac{1}{3}\exp ({-T-x}) ~~~~ (i\in \{2,3\}). \end{aligned} \end{aligned}$$
(14)

The exterior derivatives are

$$\begin{aligned} \begin{aligned} {\frac{\partial f^{(C)}_1}{\partial n_1}(0; T) }&= -\frac{2}{3}\exp (-T) ~~~~ \\ {\frac{\partial f^{(C)}_i}{\partial n_i}(0; T) }&= \frac{1}{3}\exp (-T) ~~~~ (i\in \{2,3\}),\\ \end{aligned} \end{aligned}$$
(15)

so the log-concavity does not hold.

In these two examples, densities restricted to each half-line \({\mathbb {R}}_{\ge 0}\) are log-concave. The only non log-concave part is at the origin, where the absolute value of the slope of density differs. Concretely, the following relationship holds:

$$\begin{aligned} \frac{\partial f_1}{\partial n_1}(0) \le 0, ~~~~~\frac{\partial f_1}{\partial n_1}(0) = -2\frac{\partial f_{j}}{\partial n_{j}}(0)~~~~~j\in \{2,3\}. \end{aligned}$$
(16)

We can also see that it satisfies the following Kirchhoff-type condition:

$$\begin{aligned} \sum _{i=1}^3 \frac{\partial f_i}{\partial n_i}(0) = 0. \end{aligned}$$
(17)

These densties that bend at the origin can be incorporated in the model. We relax convexity condition of \({{\,\textrm{hypo}\,}}(\log f)\) to be the following:

  1. 1.

    \({{\,\textrm{hypo}\,}}(\log f)\cap O_i\times {\mathbb {R}}\) is convex for all \(i\in \{1,2,3\}\)

  2. 2.

    For \(x_1\in O_i\) and \(x_2\in O_j\) with \(i\ne j\), let \(\lambda =d(x_1,0)/d(x_1,x_2)\), and

    $$\begin{aligned} y_0 =&\min \left\{ \frac{2(1-\lambda )\log f(x_1) + \lambda \log f(x_2)}{2-\lambda }, \right. \nonumber \\&\left. \frac{(1-\lambda )\log f(x_1) + 2\lambda \log f(x_2)}{1+\lambda } \right\} . \end{aligned}$$
    (18)

    Then, \((0,y_0)\in {{\,\textrm{hypo}\,}}(\log f)\)

We denote by \({\mathcal {G}}_0\) the class of densities which satisfy these two conditions. Any log-concave density f satisfies these two conditions since \(y_0 \le (1-\lambda )\log f(x_1) + \lambda \log f(x_2)\). Also, this allows for the bending at the origin to certain extent. In fact, the two densities we considered in this section are included in \({\mathcal {G}}_0\).

We can now exploit similar arguments to the one-dimensional log-concave case in Sect. 3.1 and derive the existence and uniqueness of the maximum likelihood estimator in \({\mathcal {G}}_0\). Note that the resulting estimator does not require Kirchhoff-type condition.

4 Calculation of maximum likelihood estimator in one and two dimensions

Given sample points \((X_1,\ldots ,X_N)\in {\mathcal {T}}_{p+2}\), for the maximization of the function \(\rho _N\), we need to be able to calculate \({\bar{h}}_y={\overline{{{\,\textrm{conc}\,}}}}\psi _y\) for a given input vector \(y = (y_1,\ldots ,y_N)^\top \in {\mathbb {R}}^N\). In the one-dimensional case, this problem is reduced to finding the convex hull of points \(\{(X_1, y_1), \ldots , (X_N, y_N)\}\) in \({\mathcal {T}}_3\times {\mathbb {R}}\). We explain below that this can be reduced to convex hull calculation in \({\mathbb {R}}^2\). In the two-dimensional case, we need to find the closure of the convex hull of points \(\{(X_1, y_1), \ldots , (X_N, y_N)\}\) in \({\mathcal {T}}_4\times {\mathbb {R}}\). It is not known if we can obtain such algorithm, but we show below that obtaining an approximate convex hull is possible.

4.1 Calculation of the least concave functions on \({\mathcal {T}}_3\)

As discussed in Sect. 2.3, concave hulls can be found by first taking the convex hull of the hypograph and then taking the supremum. In this case, the problem is to find the convex hull \(E_N\) of a finite set of sample points in \({\mathcal {T}}_3\times {\mathbb {R}}\).

\({\mathcal {T}}_3\) is a space consisting of three half-lines \({\mathbb {R}}_{\ge 0}\) connected at the origin. If all sample points are from one orthant, then the situation is the same as the Euclidean case, so we consider the case that the sample points are from at least two orthants. Let \(y_0 = \max \{ (1-\lambda )y_i + \lambda y_j~|~ i,j\in \{1,\ldots ,N\}, \lambda \in [0,1], \gamma _{X_i,X_j}(\lambda ) = 0\}\). Then, \((0,y_0)\) needs to be in the convex hull \(E_N\). Furthermore, if we compute the convex hull of sample points in each orthant including \((0,y_0)\) and take their union, this set, say \(F_N\), becomes a convex set. Obviously \(F_N\) is included in \(E_N\), thus by the minimality of convex hulls, this set \(F_N\) in fact equals \(E_N\). Since \(y_0\) can be calculated by taking geodesics of all pairs of sample points, one can calculate \(E_N\) easily using the convex hull algorithm in Euclidean space \({\mathbb {R}}^2\).

Similar arguments hold if we allow for the bend at the origin as discussed in the Sect. 3.3. Concretely, we only need to alter the definition of \(y_0\) to be

$$\begin{aligned} y_0 =&\max \left\{ \min \left\{ \frac{\lambda y_j + 2(1-\lambda )y_i}{2-\lambda }, \frac{2\lambda y_j + (1-\lambda )y_i}{1+\lambda }\right\} \nonumber \right. \\&\left. ~|~ i < j, \lambda \in [0,1], \gamma _{X_i,X_j}(\lambda ) = 0\right\} . \end{aligned}$$

4.2 Approximation of \({\bar{h}}_y\) on \({\mathcal {T}}_4\)

In the two-dimensional case, it is not even known whether the convex hull of a finite number of points is closed. Recently, Lubiw et al. (2020) constructed an algorithm to find the closure of the convex hull of a finite number of points in single-vertex 2D CAT(0) complexes, including two-dimensional tree space \({\mathcal {T}}_4\), utilizing linear programming. Although we cannot transfer the algorithm directly to the product space of \({\mathcal {T}}_4\) and \({\mathbb {R}}\), Lubiw et al. (2020) also showed that by iteratively updating the values at the boundaries, one can construct a sequence of sets that converge to the desired convex hull. With a slight modification to this argument, as we show below, we can also create a similar situation so that by stopping at some iteration number, we can obtain an approximation of the convex hull.

To see this, let \(S_0\) be all the sample points \(\{(X_i, y_i)\mid X_i\in {\mathcal {T}}_4, y_i\in {\mathbb {R}}\}\). For simplicity, we will only consider the situation where the origin is in the convex hull (if the convex hull of sample points does not contain the origin, then the situation is the same as the Euclidean case). We call a geodesic in the space \({\mathcal {T}}_4 \times {\mathbb {R}}\) a cone path if it crosses the point of the type \(\{0, y_j\}\), where 0 denotes the origin in \({\mathcal {T}}_4\): that is, we call a geodesic a cone path when its projection onto \({\mathcal {T}}_4\) is a cone path. Also, for any nonnegative orthant O in \({\mathcal {T}}_4\), define \(S_0(O)=S_0\cap (O\times {\mathbb {R}})\), \(H_0(O) = {{\,\textrm{conv}\,}}(S_0(O))\) and \(H_0 = \cup _{O} H_0(O)\). Following the notation of Lubiw et al. (2020), we define sets \(S_k\subseteq {\mathcal {T}}_4\times {\mathbb {R}}~(k=0,1,2,\ldots )\) (which Lubiw et al. (2020) call the k -th skeletons) and \(H_k\subseteq {\mathcal {T}}_4\times {\mathbb {R}}~(k=0,1,2,\ldots )\) iteratively as follows.

First, we take geodesics between points in \(H_{k-1}\) that are cone paths and find the values at the origin, which can be represented as \((0,y_{k,j})\). Then we find maximum and minimum values of \(\{y_{k,j}\}\), and let them be \(y_{k1}, y_{k0}\), respectively. We initialize \(T_k\) with \(S_{k-1}\cup \{(0, y_{k1}), (0, y_{k0})\}\). Although it is practically difficult to find the exact maximum and minimum values, we can use an approximation algorithm. We discuss on this approximation in the supplementary material.

Now for each pair of points in \(T_k\), take the geodesic between them, and add all intersection points with the boundaries to \(T_k\). Here, by a boundary, we mean a two-dimensional plane formed by an axis on \({\mathcal {T}}_4\) and \({\mathbb {R}}\). Then, for each boundary in \({\mathcal {T}}_4\), take all points that are also in \(T_k\), including points at the origin, which are of the form \((0,y_{ki})\). Since each boundary is a two-dimensional plane, these points can be thought of as points in \({\mathbb {R}}^2\), and we take the usual two-dimensional Euclidean convex hull in this space. Then discard any points that are not vertices of this convex hull from \(T_k\), and set \(S_k = T_k\). Further, let O be any nonnegative orthant in \({\mathcal {T}}_4\) and \(S_k(O)\) be the points in \(S_k\) that are also in \(O\times {\mathbb {R}}\). Define \(H_k(O) = {{\,\textrm{conv}\,}}(S_k(O))\), which can be computed easily since \(S_k(O)\) can be embedded into a nonnegative orthant in \({\mathbb {R}}^3\), and \(H_k = \cup H_k(O)\).

This construction is similar to the two-dimensional CAT(0) case (Lubiw et al. 2020), but the differences are that in this case, we have to determine the maximum and minimum values at the origin first, and instead of preserving only extreme points in a one-dimensional axis, we need here to save all the vertices in the convex hull of points on two-dimensional boundaries.

To validate the first step, we first show the following lemma.

Lemma 6

If \(S_{k-1}\) is included in \({{\,\textrm{conv}\,}}(S_0)\), then so is \(S_k\).

Proof

\(S_{k-1} \subseteq {{\,\textrm{conv}\,}}(S_0)\) indicates \(H_{k-1}\subseteq {{\,\textrm{conv}\,}}(S_{0})\), and thus \({{\,\textrm{conv}\,}}(H_{k-1})\subseteq {{\,\textrm{conv}\,}}(S_{0})\). By construction, \(S_k \subseteq {{\,\textrm{conv}\,}}(H_{k-1}) \subseteq {{\,\textrm{conv}\,}}(S_{0})\). \(\square \)

The next theorem ensures that \(H_k\) converges to \({{\,\textrm{conv}\,}}(S_0)\).

Theorem 7

Assume \(a,b\in H_{k-1}\). Then the geodesic from a to b, \(\gamma _{a,b}\), is included in \(H_k\). This indicates that \(\cup H_k = {{\,\textrm{conv}\,}}(S_0)\).

The proof is given in the supplementary material.

In Lubiw et al. (2020), a similar fact was revealed, and they proceed to reduce the problem of finding the closure of a convex hull to a linear programming problem. However, such a reduction is difficult in our case, given that we do not have prior knowledge about how many vertices \({{\,\textrm{conv}\,}}(S_0)\) would have on each boundary. We will content ourselves here with the approximation algorithm derived from the previous theorem. Note that if we can find maximum and minimum values at the origin exactly in the first step and the sets \(S_k\) converge in finite iteration number T, the set \(H_T\) is exactly equal to the desired convex hull. Furthermore, \(H_T\) is closed in this case, thus function \({\bar{h}}_y\) calculated by this set is the exact least upper-semicontinuous concave function.

4.3 Altering the objective function

Sections 4.1 and 4.2 show that we can calculate the function \({\bar{h}}_y\) for given y in one-dimensional and two-dimensional cases at least approximately. As shown at the beginning of this section, in order to find the maximum likelihood estimator, we need to solve the optimization problem of the form \(\max _{y\in {\mathbb {R}}^N} \rho _N(y) = {\mathcal {L}}_N(\exp ({\bar{h}}_y))\). As in the Euclidean case (Cule et al. 2010), the objective function is not a convex function of y, but by properly altering it, we can make this a convex optimization problem. In the d-dimension case, the new objective function \(\sigma _N(y)\) is defined as

$$\begin{aligned} \sigma _N (y) = - {\frac{1}{N}} \sum _{i=1}^{N} y_i + \int _{{\mathcal {T}}_{d+2}} \exp (h_y(x))d\nu (x). \end{aligned}$$
(19)

The convexity of \({\bar{h}}_y\) with respect to y can be derived in exactly the same manner as in the Euclidean case (Cule et al. 2010). This ensures that \(\sigma _N\) is a convex function of y. Therefore, we can utilize ordinary solvers for convex optimization to calculate MLE.

5 Numerical study

In this section, we give two simulation results. First, we estimate some log-concave distributions in one and two dimensions. We also conduct the estimation of one-dimensional densities that bends at the origin. Secondly, we give an example of clustering with the log-concave mixture model.

For the first simulation, we compare results with the kernel density estimator. The following set of bandwidths selection algorithms are considered:

  1. 1.

     [NN0.2]: Nearest neighbor approach adopted by Weyenberg et al. (2014), where we adaptively set the distance from each center to \(m=0.2N\)th closest point as the bandwidth.

  2. 2.

     [CV]: Least squares cross validation approach (Rudemo 1982; Bowman 1984).

  3. 3.

     [OPT_NN] Nearest neighbor approach with numerically optimal proportion with respect to the integrated squared error (ISE).

  4. 4.

     [OPT_FIX] Fixed bandwidths selected to numerically optimize ISE.

Note that we use the true density to evaluate and optimize ISE for the last two approaches. Thus, these approaches are infeasible in practice, but they are included as a benchmark. ISE is numerically optimized for each sample, thus these bandwidths are sample dependent as well. NN0.2 and OPT_NN are adaptive methods that use varying bandwidths for each center while CV and OPT_FIX use fixed bandwidths for all centers.

5.1 Estimation of one-dimensional log-concave densities

For one-dimensional examples, we consider the following two types of log-concave densities, using \(O_i~(i=1,2,3)\) to denote the three orthants in \({\mathcal {T}}_3\).

Case 1.:

Normal-like density \(f_1\) with mean equal to the point in \(O_1\) that is 1 unit from the origin, \(x_0\): i.e., for a point \(x\in {\mathcal {T}}_3\), \(f_1(x) \propto \exp (-d(x,x_0)^2/2)\).

Case 2.:

Exponential-like density \(f_2\), defined as follows: Let the support of the density be \({{\,\textrm{supp}\,}}(f_2) = \{x\in O_1\mid d(x,0)\le 1\}\cup O_2\cup O_3\), and for \(x\in {{\,\textrm{supp}\,}}(f_2)\),

$$\begin{aligned} f_2(x) \propto \exp (-d(x,x_0)). \end{aligned}$$
(20)

Because the density function in orthants \(O_2\) and \(O_3\) are identical, the generation of samples from these distributions can be implemented using ordinary sampling techniques in \({\mathbb {R}}\). We generated 100 samples each for sample size 100, 200, 300, 500, and 1000 from the true density and calculated ISE. The average ISE versus sample size is reported in Fig. 6.

Fig. 6
figure 6

Average ISE for sample sizes 100, 200, 300, 500, and 1000 in case 1 ( Top ) and case 2 ( Bottom ). LCMLE denotes log-concave MLE, and KDE denotes kernel density estimator

Our simulation results show that in case 1, our log-concave MLE is comparable to kernel density estimator with optimized bandwidths (OPT_NN, OPT_FIX) and dominates practical bandwidth selection approaches. In case 2, our log-conave MLE performs better than any kernel density estimator . This could be due to the density used for case 2 does not have the whole space as support; Log-concave MLE restricts its support to the convex hull of sample points, thus they can be considered better at detecting the shape of the support as long as it is convex.

5.2 Estimation of two-dimensional normal-like densities

We denote each orthant of \({\mathcal {T}}_4\) by the indices of the axes at the boundary as in Fig. 1. For instance, the orthant with axes 0 and 1 is denoted by \(\{0,1\}\). We consider densities of the following types:

Case 3.:

Positive truncated multivariate normal density \(f_3\) with mean 0 and variance I, supported on all orthants in \({\mathcal {T}}_4\):

$$\begin{aligned} f_3(x) \propto \exp (-d(x,0)^2/2).\end{aligned}$$
(21)
Case 4.:

Positive truncated multivariate normal density \(f_4\) with mean 0 and variance I, supported on orthants \(\{0,1\}, \{1,6\}, \{6,8\}, \{3,8\}, \{3,4\}, \{0,4\}\): For x in one of the above orthants,

$$\begin{aligned} f_4(x) \propto \exp (-d(x,0)^2/2). \end{aligned}$$
(22)

Also, because of the nonpositive curvature property of tree space, we have that these densities are log-concave.

Generation of samples from these distributions can be implemented using sampling from positive truncated two-dimensional normal density. We generated 100 samples each for sample size 100, 200, 300, 500 and 1000 from the true density and calculated MLE and the average ISE. The estimation results are illustrated in Fig. 7.

Fig. 7
figure 7

(Top) average ISE in case 3. (Bottom) average ISE in case 4

In case 3, the kernel density estimator performs better in small samples, but with large number samples, log-concave MLE outperforms the kernel density estimator. This result is similar to the Euclidean case (Cule et al. 2010). On the other hand, in case 4, log-concave MLE dominates the kernel density estimator. This is again because while log-concave MLE can effectively approximate the support of the true density with enough data, the kernel density estimator cannot.

5.3 Estimation of one-dimensional densities that bend at the origin

As one-dimensional densities that bend at the origin, we take two densities we considered in the Sect. 3.3. Concretely, we consider the following two densities:

Case 5.:

Normal density constructed by Brownian motion \(f_5\) at time 5 and starting position equal to the point in \(O_1\) that is 1 unit away from the origin:

$$\begin{aligned} f_{5,1}(x)&= \phi (x;1,5) - \frac{1}{3}\phi (x;-1,5) \end{aligned}$$
(23)
$$\begin{aligned} f_{5,j}(x)&= \frac{2}{3}\phi (x;-1,5) ~~~~~ (j\in \{2,3\}) \end{aligned}$$
(24)
Case 6.:

Density of coalescent time (equation (14)) \(f_6\) with \(T=1\): i.e., for a point \(x\in {\mathcal {T}}_3\),

$$\begin{aligned} f_{6,1}(x) =&-\frac{1}{6}\exp (-x-1) + \frac{1}{2}\exp \{ - {|x-1|} )\} \end{aligned}$$
(25)
$$\begin{aligned} f_{6,j}(x) =&\frac{1}{3}\exp (-x-1) ~~~~~j\in \{2,3\} \end{aligned}$$
(26)

As in the previous experiment with one-dimensional log-concave density, generation of samples are simple. We generated 100 samples each for sample size 100, 200, 300, 500 and 1000 from the true density and calculated MLE and the average ISE. The estimation results are illustrated in Fig. 8.

Fig. 8
figure 8

Average ISE for sample sizes 100, 200, 300, 500, and 1000 in case 5 (left) and case 6 (right). LCMLE with bend denotes MLE in \({\mathcal {G}}_0\), and KDE denotes kernel density estimator

The maximum likelihood estimator in \({\mathcal {G}}_0\) again performs comparable to the kernel density estimator with the optimized bandwidths while it performs better than any practical approaches in almost all sample sizes.

5.4 Clustering example

In this section, we consider the problem of clustering in the space \({\mathcal {T}}_4\). Let \(x_1, x_2, x_3\) denote points in the orthants \(\{0,1\}, \{2,3\}, \{5,8\}\) located at coordinates \(\{4,4\}, \{1,1\}, \{1,1\}\) respectively. We consider three normal-like densities on \({\mathcal {T}}_4\), \(g_1, g_2\) and \(g_3\), defined as follows:

$$\begin{aligned} g_1(x)&\propto \exp (-d(x,x_1)^2/16), \end{aligned}$$
(27)
$$\begin{aligned} g_2(x)&\propto \exp (-8d(x,x_2)^2), \end{aligned}$$
(28)
$$\begin{aligned} g_3(x)&\propto \exp (-8d(x,x_3)^2), \end{aligned}$$
(29)

Note that \(g_1, g_2, g_3\) differ in variance as well as mean.

Generation of samples from each component can be implemented in the following way. First, we divide the space \({\mathcal {T}}_4\) into two: One is the space of points whose geodesic to the center \(x_i\) of the component is a cone path, and the other is that with the opposite property. The probability of each space can be easily computed using cumulative distribution function of normal distribution on \({\mathbb {R}}^2\). Because the density only depends on the distance from the origin in the former space, the sampling from this space can be reduced to sampling from some one-dimensional density. The sampling from the latter space is possible utilizing sampling from normal density on \({\mathbb {R}}^2\), because geodesics in this case can be embedded into \({\mathbb {R}}^2\).

We generated a sample of size 200 from the mixture of \(g_i\), \(f=\sum _{i=1}^3 \pi _i g_i\) with proportions \(\pi _1=0.4, \pi _2 = 0.3, \pi _3 = 0.3\), and we attempted to cluster these points using log-concave MLE.

As in the Gaussian mixture in Euclidean space, we can construct an Expectation-Maximization algorithm (EM algorithm) to solve the clustering problem. We will assume that we know the number of clusters to be \(K=3\). We model the density in the following equation:

$$\begin{aligned} {\hat{f}}=\sum _{i=1}^3 {\hat{\pi }}{_i }{\hat{g}}{_i } \end{aligned}$$
(30)

where \({\hat{g}}_i\) is assumed to be log-concave. Cule et al. (2010) explained the EM algorithm applied to the log-concave mixture model, and we adopt their framework here.

To compare the results with other methods, we use the k-means++ algorithm (Vassilvitskii and Arthur 2006) applied to this space as an alternative method. Concretely, we implemented the k-means++ algorithm with the Fréchet mean instead of the usual arithmetic mean. It is simple to see that the sum of squared distances from the cluster centers strictly decreases at each iteration of the k-means++ or k-means algorithm (MacQueen 1967). Thus the algorithm terminates within a finite number of steps. Fréchet mean is computed using the proximal point algorithm developed in Bačák (2013).

The results of clustering are shown using the Petersen graph in Fig. 9. We see from these results that in this case, the clusters estimated by the log-concave mixture estimate are more representative of the true density. This is a natural result, as the true density is indeed log-concave, and the mixed normal densities have different variances.

Fig. 9
figure 9

Clustered points displayed in the Petersen graph (Fig. 1) with labels. (Left): data points with markers indicating the true clusters to which they belong. (Center): clustering results using the log-concave mixture (97% accuracy). (Right): clustering results using k-means++ (57 % accuracy). Note that closeness in this graph does not necessarily mean closeness in \({\mathcal {T}}_4\)

6 Conclusion

In this paper, we showed that the maximum likelihood estimator within the class of log-concave density in tree space exists under certain conditions. In one dimension, the maximum likelihood estimator exists with probability one and is unique, which is the same result as in the Euclidean case. In multiple-dimension, there are conditions under which the MLE does not exist. However, if we restrict our attention to a particular situation, we have shown that the MLE exists and is unique \(\nu \)-almost surely. Also, we have presented an algorithm to calculate the MLE exactly in the one-dimensional case and approximately in the two-dimensional case. We compared the results with the previously developed kernel density estimator and confirmed that our estimator dominates in well-modeled cases with a large enough sample size.

The method derived here is promising in that it gives a new nonparametric approach to density estimation on tree space. As we saw in a two-dimensional example, it is also able to respect the support of the sample distribution when it is convex, which might affect the estimation accuracy considerably in some situations. Unlike kernel density estimates, we do not need either the determination of smoothing parameters or the calculation of any (approximate) normalizing constant. Although the computation is much slower than kernel density estimates, the accuracy of density estimation can be expected to be high when the log-concavity assumption is not far away from the properties of the true density. The developed method gives an intermediate choice between fully unconstrained, nonparametric approaches, which have difficulties improving accuracy and estimation, and parametric approaches, for which it is difficult to specify the correct models for tree space.

We also note that with a slight modification to the statements, the main theoretical results on the existence and the uniqueness of the log-concave maximum likelihood estimator (Theorem 4 and the existence part of Theorem 5) holds in CAT(0) orthant spaces defined by Miller et al. (2015), or locally compact CAT(0) polyhedral complexes.

For future research, it is important to derive some theoretical properties of the estimator on tree space. In Euclidean space, log-concave MLE is known to be strictly consistent, and even if the model is misspecified, it converges to the “log-concave projection” of the true density (the log-concave density that minimizes Kullback–Leibler divergence from the true density). It is crucial to investigate whether these properties hold on tree space as well. Also, It is of interest whether we can introduce different constraints on the density in order to include other types of densities on this space, or to lessen the computational load. The density that bend at the boundary, which we considered only in the one-dimensional case, is a candidate class for this purpose. Finally, further simulation studies in both well-modeled and misspecified cases are necessary for practical purposes.

As tree space is constructed for modeling the space of phylogenetic trees, one immediate interest of further research is the applicability to biological data and problems. As we only have (approximate) algorithms for one and two dimensions, corresponding to the case where a maximum of four taxa are present, the applicability seems to be limited for now. However, it might be possible to seek lower-dimensional representations of large trees, for example, by grouping some taxa when they are irrelevant in terms of the inconsistency of multiple trees. It is of interest how well our method with these modifications performs compared to the existing methods with the biological data. Another possible way to utilize these lower-dimensional estimation results in large trees is to combine it with tree-construction methods based on subtrees, in particular quartets and rooted triples. Further careful assessment of the log-concavity assumption about the data distribution is also of interest. Furthermore, an attempt to improve computational efficiency is called for, as our method currently is not suited for use with large datasets.