On asymptotic joint distributions of cherries and pitchforks for random phylogenetic trees

Tree shape statistics provide valuable quantitative insights into evolutionary mechanisms underpinning phylogenetic trees, a commonly used graph representation of evolutionary relationships among taxonomic units ranging from viruses to species. We study two subtree counting statistics, the number of cherries and the number of pitchforks, for random phylogenetic trees generated by two widely used null tree models: the proportional to distinguishable arrangements (PDA) and the Yule-Harding-Kingman (YHK) models. By developing limit theorems for a version of extended Pólya urn models in which negative entries are permitted for their replacement matrices, we deduce the strong laws of large numbers and the central limit theorems for the joint distributions of these two counting statistics for the PDA and the YHK models. Our results indicate that the limiting behaviour of these two statistics, when appropriately scaled using the number of leaves in the underlying trees, is independent of the initial tree used in the tree generating process.


Introduction
As a common mathematical representation of evolutionary relationships among biological systems ranging from viruses to species, phylogenetic trees retain important signatures of the underlying evolutionary events and mechanisms which are often not directly observable, such as rates of speciation and expansion (Mooers et al. 2007;Heath et al. 2008). To utilise these signatures, one popular approach is to compare empirical shape indices computed from trees inferred from real datasets with those predicted by neutral models specifying a tree generating process (see, e.g. Blum and François 2006;Hagen et al. 2015). Moreover, topological tree shapes are also informative for understanding several fundamental statistics in population genetics (Ferretti et al. 2017;Arbisser et al. 2018) and important parameters in the dynamics of virus evolution and propagation (Colijn and Gardy 2014).
This paper focuses on two subtree counting statistics: the number of cherries (i.e., nodes that have precisely two descendent leaves) and that of pitchforks (i.e., nodes that have precisely three descendent leaves) in a tree. These statistics are related to monophylogenetic structures in phylogenetic trees (Rosenberg 2003) and have been utilised recently to study evolutionary dynamics of pathogens (Colijn and Gardy 2014). For example, the asymptotic frequency of cherries in pathogen trees generated by some models can be used to estimate the basic reproduction number (Plazzotta and Colijn 2016) and to study the impact of the underlying contact network over which a pathogen spreads (Metzig et al. 2019). Various properties concerning these statistics have been established in the past decades on the following two fundamental random phylogenetic tree models: the Yule-Harding-Kingman (YHK) (Rosenberg 2006;Disanto and Wiehe 2013;Holmgren and Janson 2015) and the proportional to distinguishable arrangements (PDA) models (McKenzie and Steel 2000;Chang and Fuchs 2010;Wu and Choi 2016;Choi et al. 2020).
In this paper we are interested in the limiting behaviour of the joint cherry and pitchfork distributions for the YHK and the PDA models. In a seminal paper, McKenzie and Steel (2000) showed that cherry distributions converge to a normal distribution, which was later extended to pitchforks and other subtrees by Chang and Fuchs (2010). More recently, Holmgren and Janson (2015) studied subtree counts in the random binary search tree model, and their results imply that the cherry and pitchfork distributions converge jointly to a bivariate normal distribution under the YHK model. This is further investigated by Wu and Choi (2016) and Choi et al. (2020), where numerical results indicate that convergence to bivariate normal distributions holds under both the YHK model and the PDA model. Our main results, Theorems 1 and 2, provide a unifying approach to establishing the convergence of the joint distributions to bivariate normal distributions for both models, as well as a strong law stating that the joint counting statistics converge almost surely (a.s.) to a constant vector. Moreover, our results indicate that the limiting behaviour of these two statistics, when appropriately scaled, is independent of the initial tree used in the tree generating process.
Our approach is based on a general model in probability theory known as the Pólya urn scheme, which has been developed during the past few decades including applications in studying various growth phenomena with an underlying random tree structure (see, e.g. Mahmoud (2009) and the references therein). For instance, the results by McKenzie and Steel (2000) are based on a version of the urn model in which the offdiagonal elements in the replacement matrix are all positive. However, such technical constraints pose a central challenge for studying pitchfork distributions as negative entries in the resulting replacement matrix are not confined only to the diagonal (see Sects. 4 and 5). To overcome this limitation, we study a family of extended Pólya urn models under certain technical assumptions in which negative entries are allowed for their replacement matrices (see Sect. 3). Inspired by the martingale approach used by Bai and Hu (2005), we present a self-contained proof for the limit theorems for this extended urn model, with the dual aims of completeness and accessibility. Our approach is different from a popular framework in which discrete urn models are embedded into a continuous Markov chain known as the branching processes (see, e.g. Janson (2004) and the references therein).
We summarize the contents of the rest of the paper. In the next section, we collect some definitions concerning phylogenetic trees and the two tree-based Markov processes. In Sect. 3, we introduce the urn model and a version of the Strong Law of Large Numbers and the Central Limit Theorem that are applicable to our study. We apply these two theorems to the YHK process in Sect. 4, and the PDA process in Sect. 5. These results are then extended to unrooted trees in Sect. 6. The proofs of the main results for the urn model are presented in Sect. 7, with a technical lemma included in the appendix. We conclude this paper in the last section with a discussion of our results and some open problems.

Preliminaries
In this section, we present some basic notation and background concerning phylogenetic trees, random tree models, and urn models. Throughout this paper, n is a positive integer greater than two unless stated otherwise.

Phylogenetic trees
A tree T = (V (T ), E(T )) is a connected acyclic graph with vertex set V (T ) and edge set E(T ). A vertex is referred to as a leaf if it has degree one, and an interior vertex otherwise. An edge incident to a leaf is called a pendant edge, and let E • (T ) be the set of pendant edges in T . A tree is rooted if it contains exactly one distinguished degree one node designated as the root, which is not regarded as a leaf and is usually denoted by ρ, and unrooted otherwise. Moreover, the orientation of a rooted tree is from its root to its leaves. Other than those in Sect. 6, all trees considered in this paper are rooted and binary, that is, each interior vertex has precisely two children.
A phylogenetic tree on a finite set X is a rooted binary tree with leaves bijectively labelled by the elements of X . The set of binary rooted phylogenetic trees on {1, 2, . . . , n} is denoted by T n . See Fig. 1 for examples of trees in T 7 and T 8 . Given an edge e in a phylogenetic tree T on X and a taxon x / ∈ X , let T [e; x ] be the phylogenetic tree on X ∪ {x } obtained by attaching a new leaf with label x to the edge e. Formally, let e = (u, v) and let w be a vertex not contained in V (T ). Then T [e; x ]  Fig. 1 for an illustration of this construction, where tree T 2 = T 1 [e 1 ; 8] is obtained from T 1 by attaching leaf 8 to the edge e 1 . We simply use T [e] instead of T [e; x ] when the taxon name x is not essential.
Removing an edge in a phylogenetic tree T results in two connected components; the connected component that does not contain the root of T is referred to as a subtree of T , also commonly known as a fringe subtree. A subtree is called a cherry if it has two leaves, and a pitchfork if it has three leaves. Following the notation by Choi et al. (2020), let A(T ) and B(T ) be the number of pitchforks and cherries contained in T . For example, in Fig. 1 we have A(T 2 ) = 1 and B(T 2 ) = 3.

The YHK and the PDA processes
Let T n be the set of phylogenetic trees with n leaves. In this subsection, we introduce the two tree-based Markov processes investigated in this paper: the proportional to distinguishable arrangements (PDA) process and the Yule-Harding-Kingman (YHK) process. Our description of these two processes is largely based on that in Choi et al. (2020), which is adapted from the Markov processes as described by Steel (2016, Section 3.3.3).
Under the YHK process (Yule 1925;Harding 1971), starting with a given tree T m in T m with m ≥ 2, a random phylogenetic tree T n in T n is generated as follows.
The PDA process can be described using a similar scheme; the only difference is that in Step (iii) the edge e is uniformly sampled from the edge set of T k , instead of the pendant edge set. Furthermore, under the PDA process, Step (i) can also be simplified by using a fixed permutation, say (1, 2, . . . , n). In the literature, the special case m = 2, for which T 2 is the unique tree with two leaves, is also referred to as the YHK model and the PDA model, respectively. For n ≥ 4, let A n and B n be the random variables A(T ) and B(T ), respectively, for a random tree T in T n . The probability distributions of A n (resp. B n ) are referred to as pitchfork distributions (resp. cherry distributions). In this paper, we are mainly interested in the limiting distributional properties of (A n , B n ).

Modes of convergence
Let X , X 1 , X 2 , . . . be random variables on some probability space (Ω, F, P). To study the urn model we will use the following four modes of convergence (see, e.g. Grimmett and Stirzaker (2001, Section 7.2) for more details). First, X n is said to converge to X almost surely, denoted as X n a.s.
as n → ∞} is an event with probability 1. Next, X n is said to converge to X in r -th norm, where r > 0, written X n r − → X , if E(|X r n |) < ∞ for all n and E(|X n − X | r ) → 0 as n → ∞. Furthermore, X n is said to converge to X in probability, written X n p −→ X , if P(|X n − X | > ) → 0 as n → ∞ for all > 0. Finally, X n converges to a random variable Y in distribution, also termed weak convergence or convergence in law and − −− → X holds or X n r − → X holds for some r > 0.

Miscellaneous
Let 0 = (0, . . . , 0) be the d-dimensional zero row vector. Let e = (1, . . . , 1) be the d-dimensional row vector whose entries are all one, and for 1 ≤ j ≤ d, let e j denote the j-th canonical row vector whose j-th entry is 1 while the other entries are all zero.
Let diag(a 1 , . . . , a d ) denote a diagonal matrix whose diagonal elements are a 1 , . . . , a d . Furthermore, 0 0 is the d × d matrix whose entries are all zero. Here Z denotes the transpose of Z , where Z can be either a vector or a matrix.

Urn models
In this section, we briefly recall the classical Pólya urn model and some of its generalisations. Pólya urn model was studied by Pólya (1930) and can be traced back to Markov (see, e.g. Johnson and Kotz (1977, Section 1.2)). It has been applied to describe evolutionary processes in biology and computer science. Several such applications in genetics are discussed by Johnson and Kotz (1977, Chapter 5) and by Mahmoud (2009, Chapters 8 and 9). In a general setup, consider an urn with balls of d different colours containing C 0,i many balls of colour i ∈ {1, 2, . . . , d} at time 0. At each time step, a ball is drawn uniformly at random and returned with some extra balls, depending on the colour selected. The reinforcement scheme is often described by a d × d matrix R: if the colour of the ball drawn is i, then we return the selected ball along with R i j many balls of colour j, for every j ∈ {1, 2, . . . , d}, where a positive value of R i j means adding R i j balls and a negative value of R i j means removing |R i j | many balls from the urn. Such a matrix is termed as replacement matrix in the literature. For instance, the replacement matrix R is the identity matrix for the original Pólya urn model with d colours: at each time point, the selected ball is returned with one additional ball of the same colour. We restrict our attention to tenable urn processes, that is, at each step it is always possible to add or remove balls according to the matrix R.
Let C n = (C n,1 , . . . , C n,d ) be the row vector of dimension d that represents the ball configuration at time n for an urn model with d colours, in which each entry is necessarily non-negative and at least one of these entries is greater than 0. Then the sum of C n,i , denoted by t n , is the number of balls in the urn at time n. Note that throughout this paper, t n is always a number greater than 0. Recall that a vector is referred to as a stochastic vector if each entry in the vector is a non-negative real number and the sum of its entries is one. Denote the stochastic vector associated with C n by C n , that is, we have C n,i = C n,i /t n for 1 ≤ i ≤ d.
Let F n be the information of the urn's configuration from time 0 up to n, that is, the σ -algebra generated by C 0 , C 1 , . . . , C n . Let R denote the replacement matrix. Then, for every n ≥ 1, where χ n is a random row vector of length d such that for i = 1, . . . , d, Since precisely one entry in χ n is 1 and all others are 0, it follows that E[χ n |F n−1 ] = C n−1 and E[χ n χ n |F n−1 ] = diag( C n−1 ).
We state the following assumptions about the replacement matrix R: (A1) Tenable: It is always possible to draw balls and follow the replacement rule, that is, we never get stuck in following the rules (see, e.g. Mahmoud (2009, p.46)). (A2) Small: All eigenvalues of R are real; the maximal eigenvalue λ 1 = s is positive with λ 1 > 2λ holds for all other eigenvalues λ of R. (A3) Strictly Balanced: The column vector e is a right eigenvector of R corresponding to λ 1 and one of the left eigenvectors corresponding to λ 1 is a stochastic vector. Note that e being a right eigenvector implies t n = t 0 +ns, and hence the urn models discussed here are balanced, as commonly known in the literature. (A4) Diagonalisable: R is diagonisable over real numbers. That is, there exists an invertible matrix U with real entries such that where λ 1 ≥ λ 2 ≥ · · · ≥ λ d are all eigenvalues of R.
For the matrix U in (A4) and 1 ≤ j ≤ d, let u j = U e j denote the j-th column of U , and v j = e j U −1 the j-th row of U −1 . Then u j and v j are, respectively, right and left eigenvectors corresponding to λ j . Furthermore, since v i u j = e i U −1 U e j = e i I e j , where I is the identity matrix, we have In view of (A3), (A4) and (4), for simplicity the following convention is used throughout this paper: u 1 = e and v 1 is a stochastic vector.
Furthermore, the eigenvalue λ 1 is referred to as the principal eigenvalue; u 1 and v 1 specified in (5) as the principal right and principal left eigenvector, respectively. Motivated by adaptive clinical trial problems, Bai and Hu (2005) derived limit results in an urn model by martingale techniques. Moreover, they considered random replacement matrices but required the replacement matrix has non-negative elements. On the other hand, the limit results derived in Janson (2004) are based on an embedding of the urn model into a continuous time branching process under certain non-trivial technical assumptions of the associated continuous time branching process. In this paper, we prove first and second order limit results for an urn model with a replacement matrix that may contain non-negative elements. As mentioned earlier, our proofs are based on the martingale approach for the urn models used by Bai and Hu (2005). Under assumptions (A1)-(A4), the exact expression for the limiting variance matrix agrees with the one obtained by Bai and Hu (2005) and by Janson (2004). Notice that the assumption of real eigenvalues in (A2) and real eigenvectors in (A4) is chosen to make our proof more accessible to a wider audience by simplifying expressions and the proofs. Indeed, our proof can be extended to the case where the dominating eigenvalue λ 1 is real while the eigenvalues λ 2 , . . . , λ d are complex-valued whose real parts are less than λ 1 /2, as one of the cases studied in Janson (2004).
The limit of the urn process and the rate of convergence to the limiting vector depends on certain spectral properties of matrix R (see, e.g. Janson (2004) or Bai and Hu (2005)). In our context, it suffices to consider the extended Pólya urn model under the aforementioned assumptions (A1)-(A4), for which Theorems 1 and 2 below give the Strong Law of Large Numbers and the Central Limit Theorem. Our proofs, which are adapted from that of Bai and Hu (2005), are presented in Sect. 7 .

Theorem 1 Under assumptions
where s is the principal eigenvalue and v 1 is the principal left eigenvector.
Let N (0, Σ) be the multivariate normal distribution with mean vector 0 and covariance matrix Σ.
Theorem 2 Under assumptions (A1)-(A4), we have where s is the principal eigenvalue, v 1 is the principal left eigenvector, and Remark 1 During the reviewing process of this paper, a reviewer suggested that an alternative approach to establishing Theorems 1 and 2 might be based on Janson (2004, Theorems 3.21 & 3.22 and Remark 4.2) and a result on the super-critical Galton-Watson process (see, e.g. Athreya and Ney (1972, Theorem 2(i) in Section III.7)), which could potentially lead to a stronger version of the results presented here.

Limiting distributions under the YHK model
A cherry is said to be independent if it is not contained in any pitchfork, and dependent otherwise. Similarly, a pendant edge is independent if it is contained in neither a pitchfork nor a cherry. In this section, we study the limiting joint distribution of the random variables A n (i.e., the number of pitchforks) and B n (i.e., the number of cherries) under the YHK model. To study the joint distribution of cherries and pitchforks, we extend the urn models used in McKenzie and Steel (2000) (see also Steel (2016, Section 3.4)) as follows. Each pendant edge in a phylogenetic tree is designated as one of the following four types: (E1): a type 1 edge is a pendant edge in a dependent cherry (i.e., contained in both a cherry and a pitchfork); (E2): a type 2 edge is a pendant edge in an independent cherry (i.e., contained in a cherry but not a pitchfork); (E3): a type 3 edge is a pendant edge contained in a pitchfork but not a cherry; (E4): a type 4 edge is an independent pendant edge (i.e., contained in neither a pitchfork nor a cherry).
It is straightforward to see that any pendant edge in a phylogenetic tree with at least two leaves belongs to one and only one of the above four types. Furthermore, the numbers of pitchforks and independent cherries in a tree are precisely half of the numbers of type 1 and type 2 edges, respectively.
As illustrated in Fig. 2, the composition of the types of the pendant edges in T [e], the tree obtained from T by attaching an extra leaf to a pendant edge e, is determined by the composition of pendant edge types in T and the type of e as follows. When e is type 1, then the number of type 4 edges in T [e] increases by one compared with that in T while the number of edges of each of the other three types is the same. This holds because both T [e] and T have the same number of cherries and that of pitchforks (see T 3 and T 4 in Fig. 2). When e is of type 2, then the number of type 2 edges decreases by two while the numbers of type 1 and of type 3 increase by two and one, respectively. This is because in this case one independent cherry is replaced by one pitchfork (see T 2 and T 3 in Fig. 2). When e is type 3, one pitchfork is replaced by two independent of the YHK model evolving from T 2 with two leaves to T 6 with six leaves. The labels of the leaves are omitted for simplicity. The type of pendant edges is indicated by the circled numbers next to them. For 2 ≤ i ≤ 5, the edge selected in T i to generate T i+1 is highlighted in bold and the associated edge type is indicated in the circled number above the arrows. (ii) The associated urn model with four colours, derived from the types of pendants edges in the trees. Note that in the vector form we have C 0 = (0, 2, 0, 0), C 1 = (2, 0, 1, 0), C 2 = (2, 0, 1, 1), C 3 = (2, 2, 1, 0), and C 4 = (0, 6, 0, 0) cherries, hence the number of type 2 edges increases by four while the numbers of edges of type 1 and of type 3 decrease by two and one, respectively (see T 5 and T 6 in Fig. 2). Finally, when e is type 4, one independent pendant edge is replaced by one independent cherry, and hence the number of type 2 edges increases by two and that of type 4 edges decreases by one (see T 4 and T 5 in Fig. 2).
Using the dynamics described in the last paragraph, we can associate a YHK process starting with a tree T m with a corresponding urn process (C 0 , R) as follows. The urn model contains four colours in which colour i (1 ≤ i ≤ 4) is designated for type i edges. In the initial urn C 0 = (C 0,1 , . . . , C 0,4 ), the number C 0,i is precisely the number of type i edges in T m . Furthermore, the replacement matrix R is the following 4 × 4 matrix: Given an arbitrary tree T , let α( The following result will enable us to obtain the joint distribution on pitchforks and cherries for the YHK model. Moreover, it also implies that the asymptotic behaviour of these two statistics, when appropriately scaled, is independent of the initial tree used in the YHK process.
Theorem 3 Suppose that T m is an arbitrary phylogenetic tree with m leaves with m ≥ 2, and that T n is a tree with n leaves generated by the YHK process starting with T m . Then we have where v 1 = 1 3 , 1 3 , 1 6 , 1 6 and , is the urn model of 4 colours derived from the pendant edge decomposition of the YHK process. Therefore, it is a tenable model with C 0 = α(T m ) and replacement matrix R as given in (8).
Note that R is diagonalisable as Therefore, R satisfies condition (A4). Next, (A2) holds because R has eigenvalues where s = λ 1 = 1 is the principal eigenvalue. Furthermore, put u i = U e i and v i = e i U −1 for 1 ≤ i ≤ 4. Then (A3) follows by noting that u 1 = (1, 1, 1, 1) is the principal right eigenvector, and v 1 = 1 6 2, 2, 1, 1 is the principal left eigenvector. Since (A1)-(A4) are satisfied by the replacement matrix R, by Theorem 1 it follows that By Theorem 2 we have where Therefore, we have Here the convergence follows from (12) and the fact that √ n−m √ n converges to 1 and mv 1 √ n converges to 0 when n approaches infinity. By Theorem 3, it is straightforward to obtain the following result on the joint distribution of cherries and pitchforks, which also follows from a general result by Holmgren and Janson (2015, Theorem 1.22).
Corollary 1 Under the YHK model, for the joint distribution (A n , B n ) of pitchforks and cherries we have and Proof Consider the YHK process {T n } n≥2 starting with a tree T 2 with two leaves. Denote the i-th entry in α(T n ) by α n,i for 1 ≤ i ≤ 4. Then the corollary follows from Theorem 3 by noting that we have A n = α n,1 2 and B n = α n,1 +α n,2 2 .
The above result is consistent with the previously known results on the mean and (co-)variance of the joint distribution of cherries and pitchforks (see, e.g., Wu and Choi (2016); Choi et al. (2020)), namely, under the YHK model and for n ≥ 7 we have

Limiting distributions under the PDA model
In this section, we study the limiting joint distribution of the random variables A n (i.e., the number of pitchforks) and B n (i.e., the number of essential cherries) under the PDA model.
To study the PDA model, in addition to the four edge types (E1)-(E4) considered in Sect. 4, which partitions the set of pendant edges, we need two additional edge types concerning the internal edges. Specifically, (E5): a type 5 edge is an internal edge adjacent to an independent cherry; (E6): a type 6 edge is an internal edge that is not type 5.
For 1 ≤ i ≤ 6, let E i (T ) be the set of edges of type i. Then the edge sets E 1 (T ), . . . , E 6 (T ) form a partition of the edge set of T . That is, each edge in T belongs to one and only one E i (T ). Furthermore, let β(T ) = |E 1 (T )|, . . . , |E 6 (T )| be the type vector associated with T , where |E i (T )| counts the number of type i edges in T .
As illustrated in Fig. 3 Finally, when e is type 5, the change it caused is the same of that of a type 2 edge, and when e is type 6, the change it caused is the same of that of type 1 edge. Therefore, we can associate a PDA process starting with a tree T 0 with a corresponding urn process (C 0 , R) as follows. The urn model contains six colours in which colour i (1 ≤ i ≤ 6) is designated for type i edges. In the initial urn C 0 = (C 0,1 , . . . , C 0,6 ), the number C 0,i is precisely the number of type i edges in T 0 . Furthermore, the replacement matrix R is the following 6 × 6 matrix: Note that the replacement matrix for the YHK model in (8) is a submatrix of the replacement matrix in (16); and the last (respectively, second last) row in (16) is the same as its first (respectively, second) row. These two observations are direct consequences of the dynamic described above. The theorem below describes the asymptotic behaviour of β(T n ), which enables us to deduce the asymptotic properties of the joint distribution of the number of pitchforks and the number of cherries for the PDA model in Corollary 2. Moreover, it also implies that the asymptotic behaviour of these two statistics, when appropriately scaled, is independent of the initial tree used in the PDA process.
Theorem 4 Suppose that T m is an arbitrary phylogenetic tree with m leaves with m ≥ 2, and that T n is a tree with n leaves generated by the PDA process starting with T m .
The remainder of the proof is similar to the final part of the proof of Theorem 3, and hence we only outline the main steps. Since (A1)-(A4) are satisfied by the replacement matrix R, by Theorem 1 it follows that By Theorem 2 we have where Therefore, we have Similar to Corollary 1, by Theorem 4 it is straightforward to obtain the following result on the joint distribution of cherries and pitchforks. as n → ∞.
Proof Consider the PDA process {T n } n≥2 starting with a tree T 2 with two leaves. Denote the i-th entry in β(T n ) by β n,i for 1 ≤ i ≤ 6. Then the corollary follows from Theorem 3 by noting that we have A n = β n,1 2 and B n = β n,1 +β n,2 2 .
The above result is consistent with the previously known results on the mean and (co-)variance of the joint distribution of cherries and pitchforks (see, e.g., Wu and Choi (2016); Choi et al. (2020)), namely, under the PDA model and for n ≥ 7 we have

Unrooted trees
Although rooted phylogenetic trees are often preferred by biologists as time is explicitly shown, it is also important to consider unrooted phylogenetic trees. Indeed, many methods for building trees from real data can usually do so only up to the placement of the root, and thus produce unrooted trees first and then figure out the root position (see, e.g. Steel (2016, Section 1.3)). In this section, we extend our results in Sects. 4 and 5 to unrooted phylogenetic trees. Formally, deleting the root ρ of a rooted phylogenetic tree and suppressing its adjacent interior vertex r results in an unrooted tree (see Fig. 4). The set of unrooted phylogenetic trees on {1, 2, . . . , n} is denoted by T n . The YHK process on unrooted phylogenetic tree is similar to that on rooted ones stated in Sect. 2.2; the only difference is that at step (ii) we shall start with an unrooted phylogenetic tree T m in T m for m ≥ 3. Similar modification suffices for the PDA processes on unrooted phylogenetic trees; see Choi et al. (2020) for more details. Note that the concepts of cherries and pitchforks can be naturally extended to unrooted trees in T n for n ≥ 6. Moreover, let A n and B n be the random variables counting the number of pitchforks and cherries in a random tree in T n .
To associate urn models with the two processes on unrooted trees, note that for a tree T in T n with n ≥ 6, we can decompose the edges in T into the six types similar to those for rooted trees, and hence define α(T ) and β(T ) correspondingly. Furthermore, Fig. 4 Example of sample paths for the PDA process on unrooted trees and the associated urn model. Two sample paths of the PDA process evolving from T 5 : one ends with T 1 7 using the edges in bold and the other with T 2 7 using the edges in grey. Leave labels are omitted for simplicity. Note that in the vector form we have β(T 1 6 ) = (4, 0, 2, 0, 0, 3) and β(T 2 6 ) = (0, 6, 0, 0, 3, 0) the replacement matrix is the same as the unrooted one, that is, the replacement matrix for the YHK model is given in (8) and the one for the PDA process is given in (16).
See two examples in Fig. 4. We emphasize that the condition n ≥ 6 is essential here: for instance, there is no appropriate assignment for the edge e 2 in the tree T 5 in Fig. 4 in our scheme, neither type 3 nor type 4 satisfying the requirement of a valid urn model. This observation is indeed in line with the treatment of unrooted trees in Choi et al. (2020). However, there is only one unrooted shape for n = 4 and one for n = 5. Furthermore, there are only two tree shapes for T 6 (as depicted in T 1 6 and T 2 6 in Fig. 4). In particular, putting α 1 6 = (4, 0, 2, 0) and α 2 6 = (0, 6, 0, 0), then for each T in T 6 , we have either α(T ) = α 1 6 or α(T ) = α 2 6 . Now we extend Theorem 3 and Corollary 1 to the following result concerning the limiting behaviour of the YHK process. Similar to the rooted version, the asymptotic behaviour of the frequencies of cherries and pitchforks, when appropriately scaled, is independent of the initial trees used in the unrooted YHK process.

Theorem 5 Suppose that T m is an arbitrary unrooted phylogenetic tree with m leaves with m ≥ 6, and that T n is an unrooted tree with n leaves generated by the YHK process starting with T m . Then, as n → ∞,
where v 1 = 1 3 , 1 3 , 1 6 , 1 6 and Σ is given in Eq. (10). In particular, as n → ∞, Proof The proof of (24) follows an argument similar to that for Theorem 4. To establish (25), consider the YHK process {T n } n≥2 starting with a tree T 2 with two leaves. For n ≥ 6, let α n = α(T n ) and α n,i denote the i-th entry in α(T n ) for 1 ≤ i ≤ 4.
Finally, combining Theorem 4, Corollary 2, and an argument similar to the proof of Theorem 5 leads to the following result concerning the limiting behaviour of the unrooted PDA process, whose proof is hence omitted.

Theorem 6 Suppose that T m is an arbitrary unrooted phylogenetic tree with m leaves with m ≥ 6, and that T n is an unrooted tree with n leaves generated by the PDA process starting with T m .
Then, as n → ∞, where v 1 = 1 16 (2, 2, 1, 3, 1, 7) and Σ is given in Eq. (18). In particular, as n → ∞,

Proofs of Theorems 1 and 2
In this section, we shall present the proofs of Theorems 1 and 2. To this end, it is more natural to consider Y n := C n U , a linear transform of C n . Next we introduce For 1 ≤ j ≤ d, consider the following numbers b n,n ( j) = 1 and b n, Moreover, we introduce the following diagonal matrix for 0 ≤ k ≤ n: Then we have the following key observation: To see that (31) holds, let Q k = I + t −1 k−1 R for 1 ≤ k ≤ n, where I is the identity matrix. Then we have Since holds for 0 ≤ k ≤ n and Y n = C n U , it is straightforward to see that (31) follows from transforming (32) by a right multiplication of U . Next, we shall present several properties concerning ξ k . To this end, consider the sequence of random vectors Furthermore, since the entries in χ k are either 0 or 1 and E[χ k |F k−1 ] = C k−1 , the random vector τ k is also bounded. As a bounded martingale difference sequence, τ k is uncorrelated. To see it, assuming that < k, then we have where the first equality follows from the total law of expectation and the second from τ is F k−1 -measurable. A similar argument shows E[τ τ k ] = 0. Consequently, we have the following expression showing that distinct τ k and τ are uncorrelated: Moreover, putting Consequently, we have where the last equality follows from (2). This implies Note that ξ k is a 'linear transform' of τ k in that combining (1) and (28) leads to Note this implies that ξ k is a martingale difference sequence in that E[ξ k |F k−1 ] = 0 = E[ξ k ]. Furthermore, by (35) and (37) we have Together with (34) and (36), for all k, ≥ 1 we have Since u 1 = U e 1 = e is a right eigenvector of R corresponding to s, by (37) we have where the last equality follows from χ k e = 1 and E[χ k |F k−1 ]e = C k−1 e = 1. Note that for n > 1 and ρ < 1, we have Furthermore, we present the following two results on the entries of B n,k , whose proofs are elementary calculus and included in the appendix.

Lemma 1 Under assumptions (A2) and (A3), there exists a constant K such that
|b n,0 ( j)| ≤ K n λ j /s and |b n,k ( j)| ≤ K (n/k) λ j /s (42) hold for 1 ≤ j ≤ d and 1 ≤ k ≤ n. Furthermore, we have Corollary 3 Assume that {Z n } is a sequence of random variables such that

Proof of Theorem 1
Proof Recall that Y n = C n U for n ≥ 1. Hence, it is sufficient to show that because s e 1 U −1 = s v 1 and n −1 C n = n −1 Y n U −1 . Furthermore, as the sequence of random vectors n −1 C n is bounded, its L r convergence follows from the almost sure convergence.
To establish (45), we restate the following decomposition from (31) as below: where {ξ k } is the martingale difference sequence in (28) and B n,k is the diagonal matrix in (30). Next we claim that , denoted by y n, j , is given by When j = 1, we have where we used the fact that u 1 = e and hence t 0 = C 0 u 1 . Therefore we have y n,1 /n = t n /n → s as n → ∞. On the other hand, for 2 ≤ j ≤ d there exist two constants K 1 and K such that where the last inequality follows from Lemma 1. Since λ j < s, it follows that y n, j /n → 0 as n → ∞. This completes the proof of (47). For simplicity, let Z n := Y n − E(Y n ). Then we have Y n = Z n + E(Y n ), by (47) it follows that to establish (45), it remains to show that Denote the j-th entry in Z n by Z n, j , then from (46) we have Since (48) is equivalent to Z n, j n a.s.
the remainder of the proof is devoted to establishing (50). It is straightforward to see that (50) holds for j = 1 because by (40) and (49) we have Z n,1 = n k=1 b n,k ( j)ξ k e 1 = 0.
Thus in the remainder of the proof, we may assume that 2 ≤ j ≤ d holds. Note that Here the third equality follows from (39). As E[e j ξ k ξ k e j ], the ( j, j)-entry of matrix E[ξ k ξ k ], is bounded above by a constant K 1 in view of (39), there exist constants K 2 and K so that holds for all n ≥ 1. Here the second inequality follows from Lemma 1 and the third one from (41) in view of λ j < s/2 for 2 ≤ j ≤ d.
Since E(Z n, j ) = 0, for > 0 using the Chebychev inequality we get P Z n, j > n ≤ K n 2 for all n ≥ 1.
Consider the subsequence Z n, j of Z n, j with Z n, j = Z n 2 , j for n ≥ 1. Then for > 0 we have where the first inequality follows from (51). Thus, by the Borel-Cantelli Lemma, it follows that n −2 Z n, j a.s.
Next, consider Δ n, j := max Since for each > 0, elements of χ and RU are all bounded above, there exists a constant K independent of and j so that Consequently, we have and hence n −2 Δ n, j a.s.
Now, for each k > 0, considering the natural number n with n 2 ≤ k < (n + 1) 2 , then we have Note that when k → ∞, the natural number n satisfying n 2 ≤ k < (n + 1) 2 also approaches to ∞. Thus combining (52), (53), and (54) leads to which completes the proof of (50), and hence also the theorem.

Proof of Theorem 2
Proof For each n ≥ 1, consider the following two sequences of random vectors: X n,k := n −1/2 ξ k B n,k and S n, where {ξ k } k≥1 is the martingale difference sequence in (28) and B n,k is the diagonal matrix in (30). Then for each n ≥ 1, the sequence {X n,k } 1≤k≤n is a martingale difference sequence, and {S n,k } 1≤k≤n is a mean zero martingale.
Recalling that Y n = C n U , then by (31) we have A key step in our proof is to show that where N (0, Σ) denotes a normal distribution with mean vector 0 and variancecovariance matrix We shall show that Theorem 2 follows from (57). To this end, we claim that Indeed, we have Z n e 1 = n −1/2 (t n − ns) = n −1/2 t 0 → 0. Furthermore, by Lemma 1 there exists a constant K such that |Z n e j | = n −1/2 |Y 0, j b n,0 ( j)| = n −1/2 Y 0, j |b n,0 ( j)| ≤ n −1/2 Y 0, j K n λ j /s for 2 ≤ j ≤ d.
What remains is to prove (57). Define (4), and hence Therefore (63) is equivalent to Since B n,k is a diagonal matrix and e 1 ξ k = 0 in view of (40), this implies A similar argument shows Φ(n)e 1 = 0, and hence (64) holds for i = 1 or j = 1. It remains to consider the case 2 ≤ i, j ≤ d. Since As both B n,k and Λ are diagonal matrices, we have where the convergence follows from Corollary 3 and (65). Since S n,n is a mean 0 random vector and B n,k is a diagonal matrix, we have V S n,n = E[S n,n S n,n ] = 1 n n k, =1 where the third equality follows from (39). Furthermore, an argument similar to the proof of (63) shows that lim n→∞ V(S n,n ) = Σ.
Therefore Σ is positive semi-definite because the matrix V(S n,n ) is necessarily positive semi-definite for each n ≥ 1.
Following the Cramér-Wold device for the multivariate central limit theorem (see, e.g. Durrett (2019, Theorem 3.10.6)), fix an arbitrary row vector w = (w 1 , . . . , w d ) in R d \ {0} and put s n,k = S n,k w and x n,k = X n,k w . Furthermore, since the matrix Σ is positive semi-definite, we can introduce σ 2 := w Σ w ≥ 0. Then for establishing (57) it suffices to show that Since {x n,k } 1≤k≤n is a martingale difference sequence and {s n,k } 1≤k≤n is an array of mean zero martingale, the martingale central limit theorem (see, e.g. Hall and Heyde (2014, Corollary 3.2)) implies that (67) follows from and the conditional Lindeberg-type condition holds, that is, for every > 0 where I A n,k, is the indicator variable on A n,k, := {|x n,k | > }. Now (68) follows from where the convergence follows from (63). To see that (69) holds, by (37) we have X n,k = d j=1 X n,k e j e j = d j=1 n −1/2 λ j b n,k ( j)τ k u j e j , 1 ≤ k ≤ n.
In particular, we have X n,k (1) = 0 because τ k u 1 = 0 holds for k ≥ 1 in view of (40). Consequently, we have Putting ρ = λ 2 /s, then λ j /s ≤ ρ < 1/2 holds for 2 ≤ j ≤ d in view of (A2) and (A4). Furthermore, there exists a constant K 0 > 0 independent of n and k such that |x n,k | ≤ d j=2 n −1/2 |w j λ j τ k u j ||b n,k ( j)| ≤ K 0 n −1/2 (n/k) ρ ≤ K 0 n −1/2 max(1, n ρ ) holds for 1 ≤ k ≤ n. Here the second inequality follows from Lemma 1 and the fact that |w j λ j τ k u j | is bounded above by a constant independent of k. The last inequality follows from the fact that (n/k) ρ ≤ max (n/1) ρ , (n/n) ρ . Now let A n, := {K 0 n −1/2 max(1, n ρ ) > }, which it is either ∅ if n is sufficient large or the whole probability space otherwise. Then by (72) we have A n,k, ⊆ A n, and hence for all > 0 and each n, we have I A n,k, ≤ I A n, for all 1 ≤ k ≤ n. Furthermore, since ρ < 1/2 and K 0 > 0, we have Consequently, we have where we have used the fact that I A n, is F n -measurable and independent of F n (and all its sub-sigma-algebras); the convergence follows from (70) and (73). Since γ * n is almost surely non-negative, this completes the proof of (69), the last step in the proof of the theorem.

Discussion
Inspired by a martingale approach developed by Bai and Hu (2005), we present in this paper the strong law of large numbers and the central limit theorem for a family of the Pólya urn models in which negative off-diagonal entries are allowed in their replacement matrices. This leads to a unified approach to proving corresponding limit theorems for the joint vector of cherry and pitchfork counts under the YHK model and the PDA model. In other words, the results for both models are derived from Theorems 1 and 2, using different replacement matrices. Furthermore, our results on unrooted trees are also derived directly from Theorems 1 and 2, without the need for a detour of rooted trees. For each of these random tree models, we show that the joint vector of cherry and pitchfork frequencies converges almost surely to a deterministic vector and the appropriately scaled fluctuations converge in distribution to a bivariate normal distribution. Interestingly, such convergence results do not depend on the initial tree used in the generating process.
The results presented here also lead to several broad directions that may be interesting to explore in future work. The first direction concerns a more detailed analysis on convergence. For instance, the central limit theorems present here should be extendable to a functional central limit theorem (see, e.g. Gouet (1993)), a follow-up project that we will pursue. Furthermore, it remains to establish the rate of convergence for the limit theorems (see Laulin (2020) for some recent results on urns with two colours). For example, a law of the iterated logarithm would add considerable information to the strong law of large numbers by providing a more precise estimate of the size of the almost sure fluctuations of the random sequences in Theorems 3 and 4.
The second direction concerns whether the results obtained here can be extended to other tree statistics and tree models. For example, the two tree models considered here, the YHK and the PDA, can be regarded as special cases of some more general tree generating models, such as Ford's alpha model (see, e.g. Chen et al. (2009)) and the Aldous beta-splitting model (see, e.g. Aldous (1996)). Therefore, it is of interest to extend our studies on subtree indices to these two models as well. Furthermore, instead of cherry and pitchfork statistics, we can consider more general subtree indices such as k-pronged nodes and k-caterpillars (Rosenberg 2006;Chang and Fuchs 2010).
Finally, it would be interesting to study tree shape statistics for several recently proposed graphical structures in evolutionary biology. For instances, one can consider aspects of tree shapes that are related to the distribution of branch lengths (Ferretti et al. 2017;Arbisser et al. 2018) or relatively ranked tree shapes (Kim et al. 2020). Furthermore, less is known about shape statistics in phylogenetic networks, in which non-tree-like signals such as lateral gene transfer and viral recombinations are accommodated (Bouvel et al. 2020). Further understanding of their statistical properties could help us design more complex evolutionary models that may in some cases provide a better framework for understanding real datasets.
Proof Since the lemma holds for λ = 0 in view of F n m ( , 0) = 1, we assume that λ = 0 in the remainder of the proof. For simplicity, put L := max 1, −( + λ) .
First we shall establish (77). To this end, we may assume m > L, and hence m + + λ > 0. Furthermore, recall the following result on the ratio of gamma functions: for a fixed number y ∈ R, we have lim x→∞ Γ (x + y) which follows from Stirling's formula for the gamma function; see also Jameson (2013, P.398) for an alternative approach. Therefore, putting Here the second limit holds because the limit of ln G m,0 being 0 implies that its limit superior is also 0. Together with G m,k = G m+k,0 for k ≥ 0, this leads to  (80) and (81). This completes the proof of (77).
With Lemma 2, we now present a proof of Lemma 1.
where in the third inequality we use the fact that (41) implies 1 n n k=1 n k Therefore it follows that Δ n → 0 as n → ∞. Since |b n,k (i)| ≤ K (n/k) ρ i , a similar argument can be adopted to show that Δ * n → 0 as n → ∞, completing the proof of Lemma 1.
Finally, we complete the appendix by the following proof of Corollary 3.
(83) Furthermore, let N 0 be the smallest integer greater than 1 such that both N 0 > −(λ i + t 0 )/s and N 0 > −(λ j + t 0 )/s hold. Then we have a n,k > 0 for all n ≥ k ≥ N 0 .
We shall next show that 1 n n k=1 a n,k E[|Z k − Z |] → 0.
For simplicity, put β k := E[|Z k − Z |] for k ≥ 1. Then {β k } k≥1 is a sequence of non-negative numbers which converges to 0. Thus there exists a constant K 1 > 0 such that β k < K 1 holds for all k ≥ 1. Next, fix an arbitrary number > 0. By (83), let N 1 = N 1 ( ) be the smallest integer greater than N 0 so that 1 n n k=1 a n,k < 1 1 − ρ + for holds for all n > N 1 .