The Shape of Phylogenies Under Phase-Type Distributed Times to Speciation and Extinction

Phylogenetic trees describe relationships between extant species, but beyond that their shape and their relative branch lengths can provide information on broader evolutionary processes of speciation and extinction. However, currently many of the most widely used macro-evolutionary models make predictions about the shapes of phylogenetic trees that differ considerably from what is observed in empirical phylogenies. Here, we propose a flexible and biologically plausible macroevolutionary model for phylogenetic trees where times to speciation or extinction events are drawn from a Coxian phase-type (PH) distribution. First, we show that different choices of parameters in our model lead to a range of tree balances as measured by Aldous’ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document}β statistic. In particular, we demonstrate that it is possible to find parameters that correspond well to empirical tree balance. Next, we provide a natural extension of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document}β statistic to sets of trees. This extension produces less biased estimates of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document}β compared to using the median \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document}β values from individual trees. Furthermore, we derive a likelihood expression for the probability of observing an edge-weighted tree under a model with speciation but no extinction. Finally, we illustrate the application of our model by performing both absolute and relative goodness-of-fit tests for two large empirical phylogenies (squamates and angiosperms) that compare models with Coxian PH distributed times to speciation with models that assume exponential or Weibull distributed waiting times. In our numerical analysis, we found that, in most cases, models assuming a Coxian PH distribution provided the best fit.


Introduction
Understanding how biodiversity is maintained and changed throughout time has been of long-standing interest in evolutionary biology (Quental and Marshall 2010;Morlon 2014). Fossil records are commonly used to make inferences about changes through time in speciation and extinction rates (Simpson 1944;Stanley 1998;Morlon et al. 2011). However, most clades do not possess sufficiently complete fossil records to make such inferences (Ricklefs 2007;Quental and Marshall 2010). In contrast, dated molecular trees are increasingly available; nevertheless, these "reconstructed phylogenies" only give relationships between extant species (Nee et al. 1992(Nee et al. , 1994aStadler 2013b). These reconstructed phylogenies can also be used to study how diversification processes change throughout time (Nee et al. 1994a), although some have argued that the use of reconstructed phylogenies needs to be accompanied with availability of fossil records (Quental and Marshall 2010;Morlon 2014;). However, reconstructed phylogenies remain useful to study diversification and diversity dynamics when accompanied by biologically well-justified constraints (Louca and Pennell 2020).
Several mathematical models have been proposed for studying macroevolutionary processes. These range from the constant-rate birth and death (crBD) model where speciation and extinction rates are assumed to be constant through time (Nee et al. 1994b), to models where speciation and extinction rates change according to species age (Hagen et al. 2015), to models where an evolving trait can affect speciation and extinction rates (Maddison et al. 2007;FitzJohn 2012). For models under the general birth-death process, in which speciation and extinction rates can vary over time, a recent paper by Louca and Pennell (2020) shows that many parameter choices are indistinguishable as they generate the same expected lineage-through-time (LTT) plot. Despite the problems identified by Louca and Pennell (2020), these fitted parameters still provide some insight into speciation and extinction rates or structure of relationships between species through time (Harvey and Pagel 1991;Stadler 2013b).
Given a choice of a model, various methods can be applied to use empirical (or simulated) data such as branch lengths from reconstructed trees to estimate the parameters of the model. For example, it is possible to derive an expression for the likelihood of observing these branch lengths and find the best-fitting parameters of the model using maximum-likelihood estimation (MLE) to make inference about the speciation and extinction rates (Morlon et al. 2011). In order to see which model fits empirical data best, we can assess models via the likelihood ratio test (LRT) or the Akaike's Information Criterion (AIC) (Anderson and Burnham 2004) or via the comparison of their simulated LTT plot, which counts the number of species that existed at each given time in the past, with an empirical LTT plot (Morlon 2014). Then, given a model with best choice of parameters, we can assess whether it fits well to the empirical data by comparing tree balance or tree topology and branch length distributions from empirical and simulated trees generated from the model. The balance of a phylogenetic tree describes the branching pattern of the tree, ranging from imbalanced shape where sister clades tend to be very different in sizes to balanced shape where the clades are of similar sizes. Tree balance is important for understanding macroevolutionary dynamics on a tree (Hagen et al. 2015) as it gives indication of heterogeneity of diversification rate across the tree without requiring information on branch lengths. Several statistics for assessing tree balance have been proposed in the literature. These include the Colless index (Colless 1982), the Sackin index (Sackin 1972) and Aldous' β (Aldous 1996)-Section 3.3 of Steel (2016) gives a detailed description of all three measures. In this paper, we focus exclusively on the β statistic as, unlike the other two statistics, it is easily comparable between trees of different size. The β statistic arises as a parameter of the Aldous' β-splitting model; in this model β is in the range [−2, ∞) where values close to − 2 mean that taxa are likely to split into unbalanced subsets and large values mean that splits are likely to be balanced. Many models in phylogenetics fail to resemble empirical datasets which often have β value around − 1 (Aldous 1996). For example, the simplest macroevolutionary model is the pure birth model, also known as the Yule-Harding (YH) model (Yule 1925), where each species is equally likely to speciate. It has been shown that trees under this model have the expected value β = 0 (Aldous 1996;Hagen et al. 2015). In other words, the YH model predicts trees that are too balanced compared to empirical data (Aldous 1996(Aldous , 2001. Likewise, models that include diversity-dependent (Etienne et al. 2012) and time-dependent speciation and extinction have been shown to produce the same expected tree balance as the YH model (Lambert and Stadler 2013). These models fall under a general class of species-speciation-exchangeable models as described in Stadler (2013b). This suggests that this class of models is not adequate to explain the macroevolutionary dynamics that has produced empirical trees.
Another statistic that has been widely used to compare empirical trees with macroevolutionary models is the γ statistic. The γ statistic was introduced in Pybus and Harvey (2000) and unlike the tree balance statistics it makes use of the branch lengths. The statistic is designed to have a zero mean standard normal distribution under a pure birth model. Negative values of γ mean that more diversification has occurred earlier in the tree than expected under a pure birth model, i.e., the edges nearer the root tend to be shorter relative to the other edges. Correspondingly, positive values of γ mean that more diversification has occurred later in the tree and that edges nearer the root tend to be relatively longer. It has been shown that γ values for empirical phylogenies tend to be below 0, which has sometimes been taken to indicate a slowdown in the diversification rate (Phillimore and Price 2008; Rabosky and Lovette 2008;Morlon et al. 2010).
In this paper, we construct a stochastic model for generating species phylogenies in which we apply Coxian PH distributions (Neuts 1981;Marshall and McClean 2004) for times to speciation and times to extinction. PH distributions describe the time to absorption in a continuous-time Markov chain (CTMC) with a single absorbing state and a finite number of non-absorbing states. Biologically, this could be thought of as a species passing through different phases where it may be more or less likely to speciate depending on a current underlying phase (Fig. 1). While these phases need not represent any particular biological state, the PH distribution gives great flexibility to model different ways that rates of speciation may depend on a species' age. Similarly, Each phase corresponds to a non-absorbing state in a CTMC and speciation corresponds to the single absorbing state. At the start, the species directly goes to phase 1 where it can either undergo speciation or move to the next phase with certain rates. The process can continue up to a finite number of n phases, each corresponding to a different rate of speciation times to extinction can also be modeled using PH distributions. We show that different parameter choices for age-dependent speciation rates produce phylogenetic trees that can range from highly balanced to highly unbalanced. In particular, we find parameters that give similar tree balance statistics to empirical trees. An additional contribution of the paper is that we develop a new approach for computing the β statistic based on a set of trees rather than computing β from a single tree. We suggest that this approach leads to more accurate estimates of the β statistic compared to computing β for single trees and then taking an average and that this is particularly true for trees with fewer extant species.
For a special case of our model, in which only speciation (and not extinction) occurs, we derive a likelihood expression for the probability of observing any edgeweighted tree. For two very large phylogenies-squamates (Zheng and Wiens 2016) and angiosperms (Zanne et al. 2014)-we perform model selection for different clades of both trees to compare our Coxian PH model for the speciation process to the exponential and Weibull distributions.
The rest of our paper is structured as follows. In the mathematical methods section we: (1) summarize the key properties of the PH distribution, (2) introduce some examples of Coxian PH distributions, (3) present our method for calculating the β statistic for a set of trees, (4) and derive a likelihood expression based on our model for fitting empirical branch length data. The next section contains simulations that: (1) demonstrate the use of treeset β, (2) show that the model can produce trees with a wide range of tree shapes, (3) examine how well fitted models do in recovering the speciation process in scenarios with and without extinction. In the section on empirical data we apply our model to two large published phylogenies-squamates (Zheng and Wiens 2016) and angiosperms (Zanne et al. 2014). In summary, we find that Coxian PH distributions are a useful tool for studying macroevolutionary dynamics.

PH Distribution and Relevant Properties
In this section, we introduce the PH distribution and some of its key properties.
Definition 1 (Continuous PH distributions) Let {X (t) : t ≥ 0} be a continuous time Markov chain defined on state space S =Ŝ {n + 1}, whereŜ = {1, 2, . . . , n} is the set of non-absorbing states and n + 1 is an absorbing state, initial distribution vector α = [α i ] i∈Ŝ , and generator matrix where Q = [Q i, j ] i, j∈Ŝ is a square matrix with dimension n that records the transition rates between non-absorbing states i, j ∈Ŝ, q = [Q i,n+1 ] i∈Ŝ is a column vector that records the transition rates from non-absorbing states i ∈Ŝ to the absorbing state n + 1, and 0 is the row vector with corresponding dimension. By the definition of generator matrix Q, we have Q i,i < 0, for all i, Q i, j ≥ 0 for i = j, and Q1 + q = 0, where q is the exit rate vector. Let Z = inf {t ≥ 0 : X (t) = n + 1} be the random variable recording the time until absorption, then Z is said to be continuous PH distributed with parameters α and Q, which we denote Z ∼ PH (α, Q).
Theorem 1 (The cumulative distribution and density functions of continuous PH distribution) Suppose Z ∼ PH (α, Q), then the cumulative distribution and the probability density function of Z are given, respectively, by and its mean and variance are given by Proof of this theorem is originally given in Neuts (1975), and a clear exposition is given in Verbelen (2013).
Definition 2 (Coxian PH distribution) If α and Q are defined as where 0 < p i ≤ 1 and λ 1 , . . . , λ n > 0 for all i = 1, 2, . . . n − 1, then we say that the random variable T ∼ PH (α, Q) follows Coxian PH distribution. Cumani (1982) showed that any acyclic PH (APH) distribution (including Coxian PH distributions), that is, a distribution with an upper triangular generator matrix (Asmussen et al. 1996), can be restructured to a canonical form such as shown above and thus only requires 2n parameters as opposed to n 2 + n parameters for a general PH distribution. This reduction in the number of parameters makes it computationally simpler to fit parameters (Thummler et al. 2006). Further, Cumani (1982) and Dehon and Latouche (1982) showed that for any APH distribution, there exists an equivalent representation as a Coxian PH distribution with λ 1 ≤ λ 2 ≤ · · · ≤ λ n .
To fit a PH distribution to data it is necessary to fix the number of non-absorbing states. Thummler et al. (2006) stated that it is difficult to fit general PH distributions if the number of non-absorbing states is larger than four, due to the increased computational cost and the dependence on the initial values. They also state that having a PH distribution of low order (less than four non-absorbing states) is not sufficient to get parameter values that correspond to small coefficients of variation (CV).
In Sects. 2.3 and 3.2 where we simulate data under different conditions, we focus solely on PH distributions with four non-absorbing states. In Sect. 4, where we fit models to empirical data, we explore a wider range of options for the number of non-absorbing states.

Coxian-Based Macro-Evolutionary Model
Now, we develop a stochastic model for generating species phylogenies, in which we assume that the time spent by each newly formed lineage before the next speciation or extinction event is drawn from a Coxian PH distribution. Our model is a special case of the well-studied Bellman-Harris model which allows any distribution of waiting times to extinction or speciation (Bellman and Harris 1948). This model is discussed in Hagen and Stadler (2018) and they provide an R package (Hagen and Stadler 2018) that allows users to simulate trees under a general Bellman-Harris model. However, while it is possible to simulate trees under this very general class of models, it is not possible to fit parameters of a general Bellman-Harris distribution to empirical data. A novelty of our approach is that we are able derive a likelihood expression for the probability of observing a reconstructed phylogeny under our model in the case with no extinction and that we can therefore fit parameters.
In our model, we primarily focus on symmetric speciation. This means that after a speciation event two "child" species are created that are identical and of age 0. Thus, each branch length on a given tree can be thought of as an independent random variable drawn from the imposed Coxian PH distribution. We also consider asymmetric speciation in which the "parent" species is considered to continue and one new "child" species is created with age 0. Both symmetric and asymmetric speciation modes are supported by the R package TreeSimGM (Hagen and Stadler 2018).
We also construct two examples of the Coxian PH distribution as given in Definition 2. We parameterize the two examples so as to enforce either monotonically increasing or monotonically decreasing rates of absorption. In Example 1, the rate of speciation (or extinction) decreases as species get older, and in Example 2 the rate of speciation (or extinction) increases as species get older. We chose a parameterization with three free variables (x, y and z), as this gives flexibility to pick instances of each example with a given mean and variance, while at the same time reducing the number of free parameters for faster computational time (Okamura and Dohi 2016). Moreover, these two examples follow canonical form 3 of an APH distribution as stated in Okamura and Dohi (2016) (see also the derivation of the form by Cumani 1982 where 0 < x ≤ 1, 0 < y < 1, z ≥ 2 and q is the exit rate vector.
The restrictions on x and y imply that each entry of the exit rate vector q is less than the preceding entry.
Here, the restrictions on x and y imply that each entry of the exit rate vector q is greater than the preceding entry.
From now on, we refer Examples 1 and 2 as PH Dec and PH Inc , respectively. By standard theory of the PH distribution, the first and second moments of the Coxian PH distribution in PH Dec and PH Inc are given by respectively. The derivations of Eqs. 10 and 11 are shown in "Appendix."

Computingˇfor a Set of Trees
We propose a new approach for estimating the tree-balance statistic β from a set of rooted trees {T 1 , . . . , T M }, which can be either empirical trees or simulated trees under some model of interest. For each subtree with four or more tips in each tree in {T 1 , . . . , T M } we compute the probability q n (i, β) of observing i tips on the left out of the n tips of that subtree. This is done using Eq. 4 from Aldous (1996), where a n (β) is the normalizing constant. We note that subtrees of size 2 or 3 are not of interest as there is only one possible division of the tips. In the case where the tree size is too large, the above expression is not numerically tractable, so we use the following approximation instead (which is also used in the apTreeShape package (Bortolussi et al. 2006)), given by whereâ n (β) is the normalizing constant. (Justification for the approximation in Eq. 13 is given in "Appendix.") We then use numerical optimization to find the value of β in the range [−2, 10] which maximizes the product of all the q n (i, β) values. This is the maximum likelihood estimate of β for the set of trees. Our custom R script, based on maxlik.betasplit function from the apTreeShape package (Bortolussi et al. 2006) to estimate β from sets of trees, is available as a Supplementary Material on Dryad (https://doi.org/10. 5061/dryad.w9ghx3fpk).

Fitting PH Distributions to Branch Length Data
In this section, we propose a method for finding parameters of a PH distribution using branch length data from a phylogenetic tree. We assume that the time until a speciation event on a branch follows a PH distribution and that there is no extinction. We write the likelihood expression using parameters from the PH distribution to calculate the probability of observing a tree with a given number of extant species.
Assuming that a tree evolves under a symmetric speciation mode, and that times to speciation events are drawn from a PH distribution, we can treat each branch length on the tree as independently drawn from the same PH distribution. We illustrate this in Fig. 2, in which the lengths of internal branches and pendant branches are denoted by In general, we denote the lengths of internal and pendant branches by b i , for i = 1, . . . , k, andb j , for j = 1, . . . , , where the total number of internal branches and pendant branches is denoted by k and , respectively. Here, because we consider the root branch, we note that k = − 1. Both internal and pendant branches follow a PH distribution with parameter α and rate matrix Q, that is, b i ,b j ∼ PH (α, Q). It follows from the properties of the PH distribution (Neuts 1981), that the likelihood of observing an internal branch of length b i is the probability density of the distribution along the branch given by αe Qb i q and the likelihood of observing a pendant branch of lengthb j is the probability that the branch has survived until time t (i.e., one minus the cumulative probability of the distribution) given by αe Qb j 1, where 1 is a column vector of ones. Therefore, by independence of the branch lengths, the likelihood of observing tree T can be written as, with α = [1, 0, . . . , 0], since we apply Coxian PH distribution. Note that if we consider all the possible permutations on the tips of the tree, then the likelihood becomes, Given the branch lengths of a single tree T , we perform numerical optimization to find parameter values that maximize the likelihood equation given in Eq. 14. In the case of the general Coxian PH model this amounts to finding the best values of p i 's and λ i 's as in Definition 2, for PH Dec and PH Inc it means finding the best values of x, y, and z.
Alternatively, given the branch lengths of a tree set {T 1 , . . . , T M }, we apply maximum likelihood estimation to maximize the product where we assume trees are independent and apply Eq. 14 to compute the likelihood of observing the individual trees T 1 , . . . , T M .

Fig. 3
Original tree T and the reconstructed tree T * , where S (in black) denotes observed speciation events and E (in red) denotes unobserved extinction events. Speciation events S (in red) followed by extinction events E (in red) on the original tree T , are not observed on the branches (highlighted in red) of the reconstructed tree T * . The internal branch 1 on T * was born at time t 1 • and the next observed speciation event on that branch (on T * ) was at time t 1 • . The external branch 2, 3, and 4 on T * was born at time t 2 • , t 3 • , and t 4 • , respectively, and no speciation events are observed on that branch (on T * ) (Color figure online) To optimize parameters for the exponential and Weibull distribution, we derive an equivalent expression to Eq. 14 for both distributions. The likelihood expression for the exponential distribution is given by and for the Weibull distribution where ψ and φ are scale and shape parameters, respectively. Then, we apply maximum likelihood estimation to search for λ > 0 that maximizes Eq. 16. Similarly, we search for ψ > 0 and φ > 0 parameters that maximize Eq. 16.
Finally, we consider a birth-and-death process (BDP) with constant birth rate λ and constant death rate μ. The likelihood expression for the reconstructed tree under such BDP is given in Eq. 20 of Nee et al. (1994b), it is a conditional probability conditioning on the survival of both original branches descending from the root.
Note that the likelihood for the reconstructed tree under any process that includes extinction events needs to consider the possibility that speciation events that end with extinction may occur on internal or external branches and so are not observed on the reconstructed tree (see Fig. 3).
Below, we present our alternative likelihood formula for the reconstructed tree under a BDP. This formula provides new physical interpretations given by Eqs. 19-23, in the context of the dynamics of the process driving the evolution of the phylogenetic tree in time.
Assume that t is the age of the tree with 0 is the time at the start of the root branch and let x i be the elapsed time from the end of the internal branch i until the end of tree T . That is, if internal branch i is born at time t i • and gives birth at time t i • to another branch, then For the external branch j descending from the internal branch i, we have its branch length given bỹ b j = t − t i • = x i . Then, the likelihood of observing a reconstructed species tree T * is given by where and D (1) where by Kendall (1948) is the probability that a branch born at time zero becomes extinct by time z. Solving the above equations gives The derivation of the differential equations for D (1) t (z) and G x,t (z) along with their solutions and some intuition are shown in "Appendices 6.4 and 6.5." Next, we apply our likelihood expression in Eq. 19 to the reconstructed tree T * in Fig. 3 (ignoring the age of the root) to see that where L Nee (T * | λ, μ) is the likelihood expression given in Eq. 20 in Nee et al. (1994b), and x 2 is the elapsed time from the starting time of the two original branches descending from the root until the end of the tree T * , as defined in Nee et al. (1994b). This relationship is as expected, since L Nee (T * | λ, μ) is a conditional probability of observing the tree T * given that both original branches have survived until the end of the tree.

Comparing Treesetˇto the StandardŤ
o compare β values estimated from individual trees to those estimated for a set of trees, we performed the following simulation. We simulated sets of 1000 trees using TreeSimGM package (Hagen and Stadler 2018), where each set of trees had the same number of extant tips n ∈ {10, 20, 30, . . . , 200} and their times to speciation were drawn from PH distribution with rate matrix We note that the structure of the exit rate vector q implies that the probability of getting absorbed from later states is less likely than from earlier states. We then repeated the above procedure for sets of trees evolving under the YH model. The YH case is interesting because it is representative of a wider class of models that are known to have E(β) = 0 (Aldous 2001). For each set of trees, we computed individual estimates of β for each tree as well as a β estimate for the entire tree set. We also computed 95% confidence intervals for the estimated β values, denotedβ, from each tree set. In order to get the lower and upper bound for the confidence intervals, we performed a numerical search over 500 equidistant points betweenβ − 5 × S E β andβ to find the point that corresponds to the lower bound and 500 equidistant points betweenβ andβ + 5 × S E β to find the point that corresponds to the upper bound. The lower and upper bounds were chosen such that their likelihood is equal to the likelihood of the MLE minus a half of the chi-square value with 1 degree of freedom; this gives a 95% confidence interval (Pawitan 2001). The standard error forβ, S E β , was evaluated using where I β is the Fisher information ofβ.
The results are summarized in Fig. 4. For both of the generating processes, the distribution of β values is right-skewed (Fig. 4a, c) and the median value for individual trees is higher than the value estimated using the entire tree set particularly for trees with fewer tips (Fig. 4b, d). For the trees generated under the YH process, when estimating the value of β for trees with fewer extant tips we obtained β ≈ 0 when applying the method based on treesets, but median β > 0 for estimates based on individual trees (Fig. 4c, d). We conclude that the method based on treesets is more accurate for the Yule process, as evidenced by the 95% confidence interval in Fig. 4d. The β values estimated from different sets of trees concentrate around β = 0 in agreement with the theoretical value for trees evolving under the YH model. We think that the upwards bias in estimation of β arises because, for trees with fewer tips, it not unlikely to get a tree that is maximally balanced (or close to it) and in this case the maximum likelihood procedure for fitting β prefers to make β as large as possible.

Coxian-PH Models can Generate a Range of Tree Shapes
In Hagen et al. (2015), the authors found that using a Weibull distribution for age dependent speciation had an effect on tree balance (as measured by the β statistic), whereas using a Weibull distribution for extinction had an effect on diversification (as measured by the γ statistic). To test if using PH distributions gives similar results, we simulated trees using the two examples PH Dec and PH Inc . We did not see obvious changes in the β and γ statistics under different parameter values using PH Inc , so we only report results for PH Dec . The simulation procedure was a follows: -As an example, we set z = 10 and mean waiting time to both speciation and extinction E PH (X ) = 2. The choice of E PH (X ) scales the branch lengths of generated phylogenies, but results will be invariant to this choice of the mean since we only consider tree balance and relative branch lengths. Likewise, the z parameter is chosen arbitrarily as long as it is larger than or equal to 2 in order to preserve a decreasing rate as described in PH Dec . -We then selected 4 pairs of parameters 0 < x ≤ 1 and 0 < y < 1 to give a wide range of coefficients of variation (CV). We found choices of x and y where CV = σ μ ∈ {30.08, 13.50, 5.56, 1.49}. These 4 pairs of x and y are as follows: We also note that fixing either x or y parameters gives less flexibility in choosing (y, z) or (x, z) pairs corresponding to a wide range of CV. -Using the TreeSimGM package (Hagen and Stadler 2018) in R, we generated 300 trees with 100 extant tips in which times to speciation followed a PH distribution  , 20, 30, . . . , 200} tips (a-d). Estimates of β from treesets are indicated by red dots (a) and blue dots (c). Trees are simulated according to either Coxian PH distribution for times to speciation events (a) or the YH process (c). The area of 95% confidence interval of β values from treesets following Coxian PH distribution and the YH process are plotted in (b), (d), respectively. The black lines represent the treeset β values, and the gray area represents the confidence interval for each treeset β value. The red lines represent the median β values from individual trees. The blue-dashed line represents the theoretical β value for the YH trees (β = 0) (Color figure online) with parameters x, y and z, while times to extinction followed an exponential distribution with rate λ = 0.25. The main goal in choosing trees of size 100 was to have trees that were large enough for β to be accurately estimated for individual trees, but small enough to have reasonable running time. We repeated this procedure for both symmetric and asymmetric speciation modes. Then we repeated everything again but using an exponential distribution for the times to speciation (with λ = 20) and the PH distributions described above for the times to extinction. -We measured the effect of different parameter choices above on tree balance using the β statistic. We computed the β statistic both for individual trees, using the apTreeshape package (Bortolussi et al. 2006), and for sets of trees based on our new approach. We also measured the effect on relative branch lengths as measured by the γ statistic (Pybus and Harvey 2000), which we computed using the APE package (Paradis et al. 2004).
The results are presented in Fig. 5. Tree balance is affected by varying the parameters for times to speciation (Fig. 5a), in particular, there are choices of model parameters that match the tree-shape statistics of empirical phylogenies (β = −1). Tree balance is not significantly affected by the parameters for times to extinction (Fig. 5b). In contrast to the behavior of β, relative branch lengths, as measured by the γ statistic are not affected by the parameters for times to speciation (Fig. 5c), while they are affected by the parameters for times to extinction (Fig. 5d). We did not observe a significant difference in our results between the symmetric and asymmetric speciation modes. These results are congruent with what was found in Hagen et al. (2015).

Fitting Coxian-PH Distributions to Branch Length Data
In this section, we test if the maximum likelihood approach outlined in Sect. 2.4 is able to fit the speciation process well in cases where: (a) there is no extinction, and (b) the generating model includes extinction. As an example to illustrate the bias introduced by not considering the extinction process in the likelihood function in Eq. 16, we simulated trees using the PH Dec distribution with known parameter values, for the speciation process and an exponential distribution for the extinction process with rate λ ∈ {0, 0.1, 0.4}, and then fitted the parameters of the PH Dec distribution to the generated branch length data. In total, we generated 50 trees with 50 extant tips each, using TreeSimGM package (Hagen and Stadler 2018), which produced 4900 branches.
Using Eq. 14-16, we found the parameters x, y, and z that maximized the likelihood of observing the given set of branch lengths. The optimization was carried out using the built-in R function, optim, with the "L-BFGS-B" method (Byrd et al. 1995) and multiple starting points for x, y, z, followed by local optimization using the "Nelder-Mead" method (Nelder and Mead 1965).
To compare the fitted distribution to the generating distribution we plotted the density of the fitted distribution and the known distribution used to simulate the data. Additionally, using the fitted parameters x, y and z, we generated trees with the same number of tips as in the simulated data, and compared their distribution of branch lengths with that of the simulated trees. Note that we cannot simply compare the branch length histogram from trees generated under the known distribution with its fitted frequency density plot since the generated trees are truncated at some time t (the tree's age). Therefore, to compare distributions of branch lengths we used the two sample Kolmogorov-Smirnov (KS) test of the null hypothesis that both simulated and fitted log branch lengths come from the same distribution (using the built-in ks.test function in R). The results of this analysis are shown in Figs. 6, 7, 8 and Table 1.
In the scenario without extinction (Fig. 6) the fitting process was able to recover the parameters since the generated trees do not assume extinction, the KS statistic found no significant difference in the log branch lengths produced by the true generating model and the fitted PH Dec model (Table 1). In the scenarios that included extinction, the fitting process was not able to correctly recover the true generating model (Figs. 7,  8). The bias in estimating the speciation process becomes more apparent as we increase the extinction rate (Fig. 8).

Empirical Data
In this section, we apply the techniques developed in Sect. 2.4 to two large empirical phylogenies (Zheng and Wiens 2016;Zanne et al. 2014). In order to view these phylogenies and to extract clades of interest, we used Dendroscope 3 software (Huson and Scornavacca 2012). For each dataset, we compared nine models. These included models where the speciation process followed a PH distribution: the general Coxian distribution (Definition 2) with 3, 4, 5, and 6 non-absorbing states, and the two examples PH Dec and PH Inc developed in Sect. 2.2, one model where the speciation process follows an exponential distribution, one where it follows a Weibull distribution, and one where we fit to the constant rate birth-death model (crBD) using the likelihood formula of observing a tree conditioned on survival in Eq. 20 in Nee et al. (1994b) or using the likelihood in Eq. 27. We note that our likelihood formula as in Eq. 14 does not consider permutation on the tips of tree, so it differs from the likelihood from the crBD model by (N − 1)! where N denotes the number of tips on tree.
Our general approach for model comparison was to use the Akaike Information Criterion (AIC) (Akaike 1998) which is essentially the log likelihood penalized according to the number of parameters used in the model. We followed the approach suggested in Anderson and Burnham (2004) which is that models with an AIC difference (ΔAIC) of less than two are essentially as good as the best model, and models with ΔAIC less than 6 should not be discounted. In addition to assessing relative goodness-of-fit via the AIC, and bearing in mind that all of our models are likely to be wrong given that they ignore extinction, we also assessed absolute goodness-of-fit using the KS statistic to compare fitted branch length densities to empirical branch length densities.
Lastly we show the hazard rate function for speciation from the best-fitting model for each clade. We were interested to see how different these would be to the constant hazard rate assumed by most macroevolutionary models or the monotonically decreasing hazard rate given by a Weibull distribution.

Squamate Phylogeny
We fit the models under consideration to the branch lengths from the squamate phylogeny in Zheng and Wiens (2016). We also examined three major clades of the tree separately, namely the gekkota clade (1318 branches), the iguania clade (1936 branches), and the anguimorpha clade (200 branches), to see if there are any notable differences.
The model comparison results are summarized in Table 2. The general Coxian model is strongly preferred for the overall tree and for all the clades being studied. In particular, the general Coxian model with three non-absorbing states fits best, but the model with four non-absorbing states is essentially indistinguishable. Additionally, fitting to the PH Inc example model is significantly worse than other distributions. Moreover, fitting to the crBD model returns zero extinction rate for all the cases and returns the same parameter values for speciation process, comparable to the model that follows exponential speciation rate without extinction. The absolute goodness-of-fit of different models is assessed in Fig. 9. Visually both general Coxian PH distribution with three and four non-absorbing states give fairly similar densities. These two appear to fit better compared to the other distributions (in agreement with the AIC results in Table 2). Both of these distributions seem to capture the tail behavior fairly well, but do a poorer job of matching the density for shorter branch lengths. The lack of fit to the reconstructed squamate tree and to most clades is supported by the KS tests which show a significant difference between the empirical branch lengths and branch lengths of 10 simulated trees from each bestfitting distribution (Table 3). We use the phytools package (Revell 2012) to simulate trees under the crBD model. Given that earlier results (Hagen et al. (2015) and Fig. 5d) show that the extinction process affects relative branch lengths, we hypothesize that this result could be due to ignoring extinction events in the models. Interestingly, all the distributions, except for PH Inc , show a good fit between the empirical branch lengths of the anguimorpha clade and branch lengths of 10 simulated trees from each of these best-fitting distributions (Table 3). We note that this result could be due to the clade having a relatively small number of extant tips (101 tips); therefore, there is a lack of

(d)
Branch Lengths from the Anguimorpha Clade Density Fig. 9 Histograms of empirical branch length density from the whole squamate tree (a), the iguania clade (b), the gekkota clade (c), and the anguimorpha clade (d) with the fitted branch length densities from the six distributions mentioned above. The yellow and blue lines are the fitted densities using the general Coxian PH distribution defined in Definition 2 with 3 and 4 non-absorbing states, respectively, the red line is the fitted density using the Coxian PH Dec example, the green line is the fitted density using the Coxian PH Inc example, and black and orange lines are the fitted density using Weibull and exponential distributions, respectively. The fitted densities for the general Coxian PH distribution with 5 and 6 non-absorbing states are not included because in most cases the distribution with 3 and 4 non-absorbing states fit better, while the fitted density from the crBD model is not included because it is identical to the fitted density using exponential distribution (see Table 2) (Color figure online) power to distinguish between models. Alternatively, extinction may occur at a lower rate in this clade compared to the other clades. The hazard rate functions for speciation from the best-fitting general Coxian PH distribution with four non-absorbing states for each the overall squamate phylogeny and the three major clades are shown in Fig. 10. For the overall tree and for each clade, the instantaneous rate of speciation seems to show a slight decline (almost constant) as species get older. The constant rate birth-death (crBD) model is the only model that includes extinction in this comparison. The numbers (3,4,5,6) in the row labels for the general Coxian PH indicate the number of non-absorbing states. For the crBD model from Nee et al. (1994b), we adjust the log likelihood by subtracting log(( − 1)!) where is the number of tips on tree. We select the model that has the lowest AIC value as the base model and compute ΔAIC= AIC other model − AIC best model . We use branch lengths from (a) the whole reconstructed squamate tree; and from different clades from the tree, namely (b) the gekkota clade, (c) the iguania clade, and (d) the anguimorpha clade For the reconstructed squamate tree, the iguania, and the gekkota clades, the resulting p values from these KS statistics are all significant ( p < 0.05), indicating that branch lengths drawn from these fitted distributions are significantly different than the empirical branch lengths. However, in the case of the anguimorpha clade, the p value are not significant ( p > 0.05) in most distributions, except for PH Inc . This indicates that branch lengths drawn from these fitted distributions are not statistically different compared to the empirical branch lengths

Angiosperm Phylogeny
To see how each model performs on an even larger tree, we also fit branch lengths from four different clades of the angiosperm phylogeny of (Zanne et al. 2014). The four different clades we use are: the monocotyledoneae clade (14,118 branches), the magnoliidae clade (2092 branches), the superrosidae clade (11,323 branches), and the superasteridae clade (20,016 branches). The model comparison results are summarized in Table 4. The general Coxian model are very strongly preferred over all the other models for all of the individual clades. Additionally, fitting to the model that follows PH Inc example is significantly worse than other distributions. Moreover, unlike the results in Table 2, the general Coxian model with four non-absorbing states fit best in this case. Interestingly, fitting

Fig. 10
Hazard rate functions for speciation show the change in the instantaneous probability of speciation as species age, as determined using the best-fitting general Coxian PH distribution with four non-absorbing states for the whole squamate tree (a), the iguania clade (b), the gekkota clade (c), and the anguimorpha clade (d). For each tree, branches are scaled by dividing each branch length leading to speciation event with height of the tree to the crBD model to this set of empirical data returns non-zero extinction rate for all of the individual clades and it fits better compared to the model following an exponential speciation rate without extinction. The absolute goodness-of-fit of different models is assessed in Fig. 11. Visually, both general Coxian PH distributions with three and four non-absorbing states give fairly similar densities. These two appear to fit better compared to the other distributions (in agreement with the AIC results in Table 4). Both of these distributions seem to capture the tail behavior fairly well, but do a poorer job of matching the density for shorter branch lengths. The lack of fit is supported by the KS tests which show a significant difference between the empirical branch lengths and branch lengths of 10 simulated trees from each best-fitting distribution (Table 5). Again, we hypothesize that this result could be due to ignoring extinction events in the model. Here, as with the squamate data, we observe that the density of the fitted distribution of PH Inc , which imposes increasing speciation rates as species age, does not follow the shape of the empirical histograms for any of the clades (Fig. 11). The hazard rate functions for speciation from the best-fitting general Coxian PH distribution with four non-absorbing states for the four major clades of the angiosperm phylogeny are shown in Fig. 12. The instantaneous rate of speciation declines in each case and the rate of decline appears to be different in major clades of the angiosperm tree.  The constant rate birth-death (crBD) model is the only model that includes extinction in this comparison. The numbers (3, 4, 5, 6) in the row labels for the general Coxian PH indicate the number of non-absorbing states. For the crBD model from Nee et al. (1994b), we adjust the log likelihood by subtracting log(( − 1)!) where is the number of tips on tree. We select the model that has the lowest AIC value as the base model and compute ΔAIC= AIC othermodel − AIC bestmodel . We use branch lengths from different clades of the angiosperm phylogeny, namely (a) the monocotyledoneae clade, (b) magnoliidae clade, (c) superasteridae clade and (d) the superrosidae clade

Discussion and Conclusion
Our macroevolutionary model for phylogenetic trees where times to speciation or extinction events are drawn from a Coxian PH distribution can produce phylogenetic trees with a range of tree shapes. The model provides a good fit to empirical data compared to exponential and Weibull distributions. The idea of applying PH distributions is motivated by the following two properties. First, it is well known that PH distributions are dense in the field of all positive-valued distributions (Asmussen et al. 1996), and thus, they are very flexible when fitting to empirical distributions. In particular, it implies that waiting times to either speciation or extinction events that follow any positive real-value distributions, such as exponential and Weibull, are well approximated using PH distribution with some given structure. Second, evolution of species trees or a species tree can be modeled as a forward-in-time process which follows an acyclic PH distribution. It is also known in the literature that any acyclic PH distribution can be represented as a Coxian PH distribution (Cumani 1982;Asmussen et al. 1996). Using a Coxian distribution is particularly useful here because its structure allows for the process to reach the absorbing state from any of the non-absorbing states, as described in Definition 2. This implies, using a general Coxian PH distribution, we can create an example where either speciation or extinction rates decrease or increase over time, by only changing parameter values inside the rate matrix Q, such as ones in PH Dec and PH inc . However, we recommend using the general Coxian PH distribution when used to fit to empirical data. We have demonstrated that trees generated under our model can have a range of different levels of tree balance as measured by the β statistic (Fig. 5). Thus, it is possible to fit parameters of our model to empirical tree shapes. The ability to get tree shapes that vary from the uniform distribution on ranked tree shapes (URT) in our model is expected based on the work of Lambert and Stadler (2013). A model with Coxian PH distributed times to speciation and exponentially distributed times to extinction is in class 4 of the scheme given in Lambert and Stadler (2013), in which the speciation process depends on a non-heritable trait (in this case species age).
In our simulations, we found that tree balance is mainly controlled by the speciation process and is largely invariant to the extinction process. In contrast to the behavior of β, the relative branch lengths, as measured by the γ statistic, are to a large extent

(d)
Branch Lengths from the Superrosidae clade Density Fig. 11 Histograms of empirical branch length density from the monocotyledoneae clade (a), the magnoliidae (b), the superasteridae clade (c) and the superrosidae clade (d) with the fitted branch length densities from the six distributions mentioned earlier. The yellow and blue lines are the fitted densities using the general Coxian PH distribution defined in Definition 2 with 3 and 4 non-absorbing states, respectively, the red line is the fitted density using the Coxian PH Dec example, the green line is the fitted density using the Coxian PH Inc example, and black and orange lines are the fitted density using Weibull and exponential distributions, respectively. The fitted densities for the general Coxian PH distribution with 5 and 6 non-absorbing states are not included because in most cases having the distribution with less number of non-absorbing states (e.g., four non-absorbing states) fit better (see Table 4) (Color figure online) controlled by the extinction process, but relatively invariant to the speciation process. Interestingly, unlike the β statistic where we found model parameters that gave values around −1, we did not find any model parameters that led to negative values of γ . We also found that using symmetric or asymmetric speciation modes did not have much effect on tree balance. These findings agree with the results in Hagen et al. (2015) in which speciation and extinction processes were modeled using Weibull distribution. We proposed a method of computing the β statistic based on sets of trees. We have demonstrated that computing the β statistic based on individual trees can be upwardly biased, particularly for trees with smaller numbers of taxa. For trees generated by a  YH process, computing the β statistic based on sets of trees gives a more accurate result (Fig. 4). This approach of computing a β value for a set of trees is useful in the context of simulated tree data, but beyond simulation studies, there may be other contexts where it is useful to estimate β for a set of trees. For example, when studying bio-geographic patterns researchers may have multiple species trees for the same set of geographic regions. It would also be possible to compute a single β value for a set of gene trees. We derived a likelihood expression for the probability of observing any reconstructed tree (Eqs. 14-16) that has evolved with PH distributed times to speciation (and no extinction); we applied it to both simulated and empirical data by applying the maximum likelihood method. We note that fitting parameters based on branch lengths taken from trees that include extinction, produces some bias in estimation of the speciation process (Fig. 7). The bias becomes more apparent with increasing rates of extinction (Fig. 8). In future work, we aim to generalize Eq. 14 to include extinction. Such an extension can potentially be done in a similar manner as the derivation for the likelihood under a BDP process as described in Eq. 19. Once we derive a generalized likelihood function, we will compare its performance with likelihood functions that consider both speciation and extinction events, such as in Rabosky (2006).
In Sect. 2.4, we have also given a different approach for deriving the likelihood expression of observing a tree evolving under a constant rate birth-death process. This expression in Eq. 19 provides new physical interpretations in the context of the process driving the evolution of phylogenetic tree, and it also has a nice relationship with the formula in Nee et al. (1994b) as described in Eq. 27. In terms of fitting the model to empirical data, we note that the likelihood must be conditioned on the survival of the original two branches descending from the root of the tree as seen in Eq. 27. This agrees with what Stadler (2013a) stated in her paper.
Finally, we have fitted the parameters of our model to the empirical data consisting of branch lengths from various clades in the squamate and angiosperm reconstructed phylogenies (Zheng and Wiens 2016;Zanne et al. 2014). In both cases, we found that the extra flexibility permitted by the Coxian PH distribution was favored by the AIC over the simpler Weibull and Exponential models. Interestingly, in both cases, the model using the Coxian PH distribution without extinction process still fits better than the constant rate birth-death model from Nee et al. (1994b) that includes extinction. Moreover, in one example, fitting using the Coxian PH distribution with three nonabsorbing states is preferable, but fitting using the distribution with four non-absorbing states is mostly preferred. Meanwhile, fitting to the same distribution with more than four non-absorbing states was always less favorable in the examples we looked at while also adding more computational time.
In the squamate phylogeny (Zheng and Wiens 2016), all the clades we examined (iguania, gekkota, anguimorpha) showed rates of speciation that declined slightly as species got older (Fig. 10). The whole squamate phylogeny also showed slight declining rates of speciation (almost constant rate). On the other hand, two clades (monocotyledoneae, superasteridae) from angiosperm phylogeny (Zanne et al. 2014) considered in this study showed apparent declining rates of speciation as species got older (Fig. 12a, c), while the other clades in the phylogeny (magnoliidae, superrosidae) only showed rates of speciation that decreased slightly (Fig. 12b, d). We caution against reading too much into these results as the model does not include extinction or account for incomplete sampling.
In summary, we have demonstrated that our macroevolutionary model with Coxian PH distribution, provides a better fit to empirical phylogenies, when compared to models with other distributions, including exponential and Weibull (Tables 2, 4). We conclude that it is necessary to use distributions with sufficient complexity, such as Coxian PH distributions, to provide a better fit to empirical phylogenies.
whereâ n (β) is a normalizing constant and B(x, y) is beta function.

Equivalent Formula of q n (i,ˇ) for Large n and i
Here, we show the work to approximate Eqs. 30 and 31 for large n and i, where n is the number of extant tips on the tree and i is the number of left tips on the tree. We use this approximation due to computational limitation of evaluating gamma function for large number. The formula also appears in the maxlik.betasplit from the TreeSimGM package (Hagen and Stadler 2018).
Lemma 1 Given large n and i, Eqs. 30 and 31 can be approximated using the following formula,q n (i, β) = 1 a n (β) whereâ n (β) is the normalizing constant forq n (i, β).
Proof Recall the Stirling's approximation for gamma function is given by Then, we claim that Lemma 2 Proof By Stirling's approximation with z = x + β and z = x + α, we have We observe here that 1 + β−α x+α x+α+1/2 → e β−α as x → ∞ and 1 + β Recall that q n (i, β) = 1 a n (β) . Then, we apply Eq. 47 for large n and i, q n (i, β) = 1 a n (β) That is, q n (i, β) =q n (i, β) if and only ifâ n (β) = a n (β) n 2β . To verify the result, we conduct a simulation for n = 500 and β = −1 (see Fig. 13). Estimated q n (i, β) for large n Theoretical q n (i, β) for large n

Expression of First and Second Moments from Coxian PH Distribution
In this section, we derive the expressions for first and second moments from a Coxian PH distribution, then we also derive those expressions for the two examples of a Coxian PH distribution used on this paper. The structure of the rate matrix Q follows canonical form 3 described in Okamura and Dohi (2016). Consider a Coxian PH distribution with four non-absorbing states defined by its rate matrix given as follows where 0 < p 1 , p 2 , p 3 ≤ 1. Furthermore, we have the condition that λ 1 ≥ λ 2 ≥ λ 3 ≥ λ 4 based on the result in Cumani (1982) and Dehon and Latouche (1982) for acyclic PH distributions. In order to derive the expression of first and second moments of a Coxian PH distribution, we compute the inverse matrix in Eq. 51 using the identity matrix of the same size and performing elementary row operations to derive I|(Q) −1 from (Q|I). Therefore, and QQ −1 = I where I is the identity matrix. Furthermore, Hence, the expressions for first and second moments from a Coxian PH distribution with the initial probability distribution α = [1, 0, 0, 0] and the rate matrix given by Eq. 51 are as follows E PH X 2 = 2αQ −2 1 = 2 1 0 0 0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ Next, to get the expressions for first and second moment from PH Dec , we use Eqs. 52 and 53 and the following substitutions, λ 1 = z, λ 2 = 1 + x, λ 3 = 1 + x 2 , λ 4 = x 3 , p 1 = 1 − y, p 2 = 1 − y 2 , p 3 = 1 − y 3 .

Deriving and Solving the Differential Equation of D (1) t (z)
In this section, we show the derivation of the differential equation of the probability of observing a reconstructed external branch with length z on a tree with age t, D (1) t (z), shown in Eq. 23, using physical interpretations. Then, we derive the solution to the differential equation as shown in Eq. 26.
We can write D (1) t (z) by conditioning on the time of the first event on that external branch with elapsed time z on a tree with age t. That is, (1) the branch has not undergone any observable event yet at time z, which occurs with probability e −(λ+μ)z or (2) the branch has a child at some time u ≤ z, which occurs with probability e −(λ+μ)u λ, and so the two branches evolve independently of each other where the child branch We solve the integrals in I and I I , , m = μe (μ−λ)z , dm = (μ − λ)mdz Thus, ln μe (μ−λ)z + λ − μ λ(μ − λ) ln λ − μe (μ−λ)z + c = 2 μ μ − λ ln μe (μ−λ)z − ln λ − μe (μ−λ)z + c.