0–1 laws for pattern occurrences in phylogenetic trees and networks

In a recent paper, the question of determining the fraction of binary trees that contain a fixed pattern known as the snowflake was posed. We show that this fraction goes to 1, providing two very different proofs: a purely combinatorial one that is quantitative and specific to this problem; and a proof using branching process techniques that is less explicit, but also much more general, as it applies to any fixed patterns and can be extended to other trees and networks. In particular, it follows immediately from our second proof that the fraction of d -ary trees (resp. level-k networks) that contain a fixed d -ary tree (resp. level-k network) tends to 1 as the number of leaves grows.


Introduction
Phylogenetic trees (and networks) are the primary way of representing evolutionary relationships in biology and related fields (e.g.language evolution, epidemiology).Typically, the leaves of a tree are labelled by extant species, and the (unlabelled) interior vertices represent branching events that correspond to ancestral speciation events.A binary phylogenetic tree is an unrooted tree with labelled leaves and unlabelled interior vertices of degree 3.This class of trees represents the most 'informative' description of evolution, since vertices of degree greater than 3 typically describe the unknown order to an ancestral species radiation (a 'soft polytomy'), whereas the vertices of degree 2 are essentially redundant.Accordingly, binary phylogenetic trees play a key role in phylogenetics, and are the focus of this paper.In addition, a rooted binary phylogenetic tree is a rooted tree with labelled leaves and unlabelled interior vertices of out-degree 2 (when directed away from the root vertex).
A phylogenetic tree for a set X of species is typically inferred from a sequence of discrete characters (functions c 1 , c 2 , . . ., c k , where c i is a function from X into some discrete set S i ).A natural measure of how well c i is described by a phylogenetic tree T is to let f (c i , T ) denote the minimum number of edges of T that need to be assigned different states, over all possible ways of assigning states from S i to the interior vertices of T .In general, f (c i , T ) ⩾ |c i (X)| − 1 and if we have equality, then c i is said to be homoplasy-free on T (this is equivalent to saying that c i could have evolved on T from some ancestral vertex without reversals or convergent evolution; see [7]).
A natural question is the following: For a phylogenetic tree T , what is the smallest size N (T ) of some set of characters for which T is the only tree on which each of these characters is homoplasy-free?It is easily seen that if T is the only tree for which each character in a given set is homoplasy-free, then T must be binary.Moreover, when the sets S i all have size 2, then it is easily shown that N (T ) ⩾ |X| − 3.However, if no restriction is placed on the size of the sets S i , then N (T ) turns out to be independent of |X|; in fact N (T ) ⩽ 4 [3].A recent paper [4] exactly characterised the set of binary trees T for which N (T ) = 4: they are precisely the trees that contain a 'snowflake' (defined shortly).The authors of [4] then posed the problem of determining the asymptotic proportion of binary trees that contain a snowflake as |X| → ∞.
In this short note, we first provide an explicit combinatorial proof that the proportion of binary trees that contain a snowflake tends to 1 (we also show that the same limit applies for birth-death trees).We then provide a second proof using branching process techniques.Although, when it comes to the specific case of snowflakes in phylogenetic trees, this proof is less informative than the first one, it is also much more general, as it covers not only snowflakes but any finite pattern, and not only binary trees, but also other classes of trees and networks (including phylogenetically relevant ones such as level-k networks).

Snowflakes in binary trees: A combinatorial approach 2.1 Preliminaries
Let B(n) be the set of binary phylogenetic trees on the leaf set [n] = {1, . . ., n}, and let B(n) = |B(n)| be the number of such trees.Define R(n) and R(n) similarly for rooted binary phylogenetic trees.Then . The following result is from [2], and its proof follows by a standard application of the Lagrange inversion formula.Lemma 1.The number N (n, k) of forests consisting of k rooted binary phylogenetic trees on disjoint leaf sets that partition a set of size n is given by: Note that there is a canonical decomposition of any tree T ∈ B(n+2) by considering the path from leaf n + 1 to n + 2 and the ordered forest of rooted trees that attach to this path.This leads to a bijection between ordered forests on n leaves, and B(n + 2).In particular, ( A snowflake in a tree T ∈ B(n) is a subtree of T with a distinguished interior vertex v and six interior vertices at distance 2 from v. We refer to v as the central vertex of the snowflake (see Fig. 1).Observe that an interior vertex v in T is the central vertex of at least one snowflake if and only if the distance from v to each leaf of T is at least 3.
v Figure 1: A snowflake with central vertex v; each of the 12 circles represents a rooted tree on one or more leaves.
Let S(n) denote the set of ordered pairs (T, v), where T ∈ B(n) and v is the central vertex of a snowflake in T .

Lemma 2.
For n ⩾ 12, Proof.We have where N (n, 12) enumerates the forest of 12 rooted trees (represented by circles in Fig. 1) and 12! 2 9 •3! counts the number of distinct ways to arrange these 12 rooted trees.Thus, by Lemma 1, which reduces to the expression in the lemma.
Next, for a given tree T ∈ B(n), let X T denote the number of vertices in T that are the central vertex of at least one snowflake in T .Let X n denote the random variable X T , where T is chosen uniformly at random from B(n).By Lemma 2, we have: This corollary implies that lim n→∞ P( and therefore P(X n = 0) ⩽ 1 − E(X n )/n).In particular, P(X n = 0) does not converge to 1.

The asymptotic certainty of a snowflake
We now establish the following result.
Proof.We show that the variance of X n is o(n 2 ).This implies that P(X n = 0) → 0 as n → ∞ by Chebychev's inequality and Corollary 3.
By Corollary 3, it suffices to show that E(X , where Y (n) is the number of ordered triples (T, v 1 , v 2 ), where T ∈ B(n) and v 1 and v 2 are central vertices of snowflakes of T .Moreover, for any tree where v 1 and v 2 are central vertices of snowflakes of T and d(v 1 , v 2 ) ⩽ 4 (this includes the case where v 1 = v 2 ), where d(v 1 , v 2 ) denotes the number of edges of T in the path between v 1 and v 2 ).
Thus it suffices to show that

Now observe that for any such ordered triple (T
The decomposition of T given two vertices (v 1 , v 2 ) that are centres of snowflake and at distance at least 6 apart from each other.The triangles represent trees; there has to be at least one such tree, and the total number of leaves in such trees is i ⩾ 0.
This decomposition allows us to write: In this expression, • The term B(i + 2) is from Eqn. (1), since this counts the number of ways to select an ordered collection of trees that contain a total of i leaves (the forest denoted by triangles on the path in Fig. 2).The condition that i ⩾ 0 recognises that d(v 1 , v 2 ) ⩾ 5, and i ⩽ n − 22 because each of the 22 circled trees in Fig. 2 has at least one leaf in order for v 1 and v 2 to be the centres of snowflakes.
• The term n i is the number of ways of selecting the i leaf labels from the total leaf set of size n that will label the leaves of the trees indicated by triangles in Fig. 2.
• The term N (n − i, 22) is the number of choices for the 22 circled trees (which form a forest of 22 rooted trees on a total of n − i leaves).
• The term 22! 2 16 counts the number of distinct ways to attach the forest of the 22 circled rooted subtrees to the backbone tree (with v 1 , v 2 as distinguished vertices), by the orbit-stabilizer theorem.
Eqn. (2) expresses W (n) as a summation; however, we can use generating function techniques to obtain a concise exact expression for W (n)/B(n), namely: Theorem 4 then follows directly from Eqn. (3), since Thus, it remains to establish Eqn.(3).For notational convenience, let k = 22.Since B(i + 2) = R(i + 1), we can rewrite Eqn.(2) as: We now use generating functions.Let which is the exponential generating function for the number of rooted binary phylogenetic trees.Note that Thus, by Eqn.(4), x n , we have: and so, by Eqn.(5), Consequently, recalling that k = 22, and applying Lemma 1 gives: which establishes Eqn.(3) and thereby the theorem.
An alternative class of models for generating random binary trees in biology are birth-death processes.Under a fairly wide range of conditions (see [6,7]), these models give rise to the same probability distribution on tree shapes, namely the Yule-Harding distribution.If we suppress the root, the resulting random tree Tn ∈ B(n) has a simple construction (regardless of the underlying birth-death rates in the model), as follows.Starting with the tree on two leaves, select one of the existing pendant edges (incident with a leaf) uniformly at random and attach the next leaf to a subdividing midpoint of this edge1 .For Yule-Harding trees, snowflakes are also asymptotically certain, by the following much shorter argument.
Proposition 5.The probability that Tn contains a snowflake tends to 1 as n grows.
Proof.Let T n denote the Yule-Harding tree (with its root).If ) denote the number of leaves of the two subtrees of T n incident with the root, then n 1 (T n ) is uniformly distributed between 1 and n − 1 (see e.g., [1]).In particular, Since the two subtrees of T n are also described by the Yule-Harding distribution, it follows that each of these two subtrees consists of two subtrees that each have at least √ n leaves with probability 1 − o(1) as n grows.Continuing this argument two steps further, the root of T n is the root of a complete balanced binary tree on 16 vertices with probability tending to 1 as n grows.Thus, if we now suppress the root vertex, the resulting tree Tn contains a snowflake with probability that tends to 1 as n grows.

A generic approach using branching process techniques
In this section, we prove a 0-1 law for pattern occurrences that applies not only to snowflakes but also to any finite pattern, and not only to uniform binary trees but also to other trees and even networks.This 0-1 law follows readily from standard tools of modern probability theory -namely, local limits of size-conditioned Galton-Watson trees -so even though we could not find it in the literature, it will not come as a surprise to people familiar with these tools.Nevertheless, it does not seem to be known in the mathematical phylogenetics community, despite having relevant applications there.
The idea of the proof is that some random phylogenetic trees or networks can be 'chopped up' into smaller parts that are almost independent of each other.If these parts are large enough, then each of them has a positive probability of containing the pattern of interest; the 0-1 law then follows from a Borel-Cantelli argument.
The caveat in this argument is that it may not be obvious how to chop up the random tree or network of interest into constituents that are 'almost independent'.The notion of local limit provides a convenient way to tackle this issue, namely, by making it possible to study some large trees or networks using a limiting object that consists of truly independent parts.

Prerequisites
In this section, we give an overview of the minimal prerequisites for the proof of our 0-1 law.In particular, some notions and results will not be presented in full generality.Complete and self-contained introductions to these tools can be found in [9], for the general notion of local limit; and in [5], for local limits of size-conditioned Galton-Watson trees.

Local limits
The notion of local limit of a sequence of rooted graphs formalizes the idea that the structure of a rooted graph G n 'as seen from its root' converges as n → ∞.
What makes this interesting is that, after giving a rigorous meaning to lim n G n , quantities such as lim n f (G n ) can sometimes be computed as f (lim n G n ); when lim n G n has a simple structure, the latter can be much easier to compute.
There are several ways to formalize this idea.In the case of ordered trees2 -which is all we need for our main result -a standard way to do so is to embed all trees in the Ulam-Harris tree and to say that a sequence of trees (T n ) converges locally to a tree T if and only if the out-degrees of T n converge pointwise to the outdegrees of T .If the trees are locally finite (i.e. if all vertices have a finite degree), then letting [T ] k denote the ball of radius k centered on the root of T , this is equivalent to saying that for all fixed k, there exists This framework makes it possible to talk about convergence in distribution of a sequence of random trees (T n ) to a (possibly infinite) random tree T : Moreover, all of the usual results from probability theory regarding the convergence of functionals of T n apply.For instance, T n converges in distribution to T if and only if E(f (T n )) → E(f (T )) for all bounded continuous functions f .However, many functions of interests are not continuous for the local topology.Thus, in order to use lim n T n to compute lim n f (T n ), one must take care to justify either the continuity of f for the local topology, or the interchange of limit for the particular sequence (T n ) of interest.

Size-conditioned Galton-Watson trees
Galton-Watson trees have a natural ordering that makes it convenient to treat them as ordered trees: by doing so, for any fixed ordered tree τ , the probability that a Galton-Watson tree T with offspring distribution X is equal to τ is where d + (v) denotes the out-degree of v in τ .In this paper, we use the notation T ∼ GW(X) to indicate that T is a Galton-Watson tree with offspring distribution X.By a slight abuse of notation, we also use the notation GW(X) to denote a generic Galton-Watson tree.
A size-conditioned Galton-Watson tree is a Galton-Watson tree conditioned to have exactly n vertices.Of course, there are conditions on the offspring distributions X and on n for this conditioning to make sense: for instance, a Galton-Watson tree whose offspring distribution is almost surely positive cannot be conditioned to be finite; similarly, since rooted binary trees always have an odd number of vertices (we are not counting the root edge here), a Galton-Watson tree whose offspring distribution takes values in {0, 2} cannot be conditioned to have an even number of vertices.
The central role of size-conditioned Galton-Watson trees in combinatorial probability theory and their relevance here comes from the two following points: • For various classes of random trees, it is possible to sample uniformly at random using size-conditioned Galton-Watson tree.This is the case, for example, of uniform leaf-labelled d-ary trees, as detailed in the next section.
• Under some fairly general assumptions on the offspring distribution, the local limit of size-conditioned Galton-Watson trees has a very specific structure known as Kesten's size-biased tree.This is detailed in Section 3.1.4.

Uniform d-ary trees as size-conditioned Galton-Watson trees
In this section, we recall how to obtain uniform leaf-labelled d-ary trees from sizeconditioned critical Galton-Watson trees.But first, let us clarify a few points of vocabulary when talking about d-ary trees and ordered d-ary trees: • By a d-ary tree, we mean a tree such that the degree of every vertex is either equal to 1 (the leaves) or to d + 1 (the internal vertices).Except for the tree consisting of a single edge, every d-ary tree has (k + 1)d + 2 vertices, for some k ⩾ 0: k + 1 internal vertices and (k + 1)d − k + 1 leaves.
As seen above, in the case d = 2, there are B(n) = (2n − 5)!! such trees with n labelled leaves -each of which has 2n − 3 edges.
• By an ordered d-ary tree, we mean an ordered tree in which every vertex has in-degree 1, except for the root, which has in-degree 0; and where the out-degree of every vertex is either 0 or d.Each such tree has kd + 1 vertices, for some k ⩾ 0: k internal vertices and (d − 1)k + 1 leaves.
For d = 2, there are C n−1 such trees with n leaves, where C k denotes the k-th Catalan number.
Unless explicitly Finally, recall that ordered trees are intrinsically labelled.For instance, the Ulam-Harris labelling (also known as the Neveu notation) assigns a word to each vertex of the tree in the following way: the root is labelled with the empty word, and the k-th child of a vertex with label w get the label wk.The link between ordered trees and rooted vertex-labelled trees is thus straightforward: there are exactly v d + (v)! ways to order any rooted vertex-labelled tree, where the product runs over the vertices of the tree.
Since this probability is the same for every tree τ with n vertices, this concludes the proof.Therefore, the pushforward by φ • ψ of the uniform distribution on T n−1 is the uniform distribution Tn .Since φ • ψ is the construction described in the proposition, this concludes the proof.
Remark 9.This proof implies that, for all d ⩾ 2 and all n = d • i + 1, we have 1) .

Kesten's size-biased tree
As already mentioned, the local limit of size-conditioned Galton-Watson trees has a simple, universal structure.In what follows, we state this result for critical Galton-Watson trees (that is, the expected value of the offspring distribution X is equal to 1).However, the criticality is not as restricting as it may seem, because many non-critical Galton-Watson trees can be turned into equivalent critical Galton-Watson trees via exponential tilting (that is, there exists an exponential tilting of the offspring distribution that yields a critical Galton-Watson tree with the same conditional distribution on the set of trees with n vertices as the original Galton-Watson tree; see [5,Section 4]).
The following theorem is not stated in full generality; see [5,Theorem 7.1] for a more general statement.
Theorem 10.Let X be an integer-valued random variable such that E(X) = 1, E(X 2 ) < ∞ and P(X = 0) > 0. Let T ∼ GW(X) be a Galton-Watson tree with offspring distribution X, and let T n ∼ (T | #T = n), for all n such that P(#T = n) > 0. Then the local limit of T n is the infinite random tree T * obtained by the following procedure: 1. Start with a semi-infinite path v 1 , v 2 , . .., and let v 1 be the root of T * .This path will be referred to as the spine of T * .
Let X * have the size-biased distribution of X, and let T * denote the local limit of T n , i.e. the Kesten tree associated with X.Let v 1 be the root of T * , and v 1 , v 2 , . . . the vertices on its spine.
Let S k = k i=1 (d + (v i )−1) denote the total number of edges coming out of the spine of T * from vertices v i at distance less than k from the root.Note that (S k ) k⩾1 is a random walk whose increments are distributed as X * − 1.Since X * ⩾ 1 almost surely and since P(X * > 1) > 0, we have S k → ∞ almost surely as k → ∞.
Next, let D denote the diameter of τ (i.e., the maximal distance between two of its vertices) and let p • • = P [GW(X)] D ⊃ τ > 0 be the probability that a Galton-Watson tree with offspring distribution X contains τ in the ball of radius D centered on its root.Note that, for all i and all k ⩾ i + D, because [T * ] k contains the S i balls of radius D centered on the roots of the S i independent Galton-Watson trees that are grafted on the first i vertices of the spine of T * in its construction.Taking expectations and using that p > 0 and that S i → ∞ almost surely, we get: As a result, we also have: Combining Inequalities ( 6) and ( 7) finishes the proof.Indeed, for any ε > 0, taking i as in (6), and then N as in (7) with the same ε and k = i + D ensures that P([T n ] k ⊃ τ ) ⩾ 1 − ε for all n ⩾ N .Since P(T n ⊃ τ ) ⩾ P([T n ] k ⊃ τ ), we have proved which is what we needed.

Corollaries: patterns in d-ary trees and level-k networks
We conclude this paper by providing two examples of applications of Theorem 12.
One is a direct corollary that generalizes Theorem 4 on snowflakes in binary trees; the other one is an application to level-k networks.Since some relevant classes of phylogenetic trees and networks can be characterized by the fact that they contain or exclude certain fixed-size patterns (and since, more generally, such patterns can affect the outcome or performance of some algorithms), Theorem 12 likely has many other such relevant applications in mathematical phylogenetics.

Proposition 6 .Remark 7 .
Let X ∼ d × Bernoulli(1/d), and let T ∼ GW(X).Then, letting #T denote the number of vertices of T , for any n such that P(#T = n) > 0, the size-conditioned tree T n ∼ (T | #T = n) has the uniform distribution on the set of ordered d-ary trees with n vertices.Since for d-ary trees the number of leaves is a deterministic function of the total number of vertices, Proposition 6 also holds if we condition on the number of leaves.⋄ Proof.Let τ be any fixed ordered d-ary tree with n vertices.Recalling that all such trees have the same number i • • = (n − 1)/d of internal vertices,

Proposition 8 .
Let T n−1 have the uniform distribution on the set of ordered d-ary trees with n − 1 leaves, and let Tn be the tree obtained by: (1) grafting a leaf to the root of T n−1 and labelling the n leaves of the resulting tree uniformly at random; and (2) discarding the ordering and the rooting of the resulting tree.Then Tn has the uniform distribution on the set of d-ary trees with n labelled leaves.Proof.Let us start by introducing some notation.We denote by: • T n the set of ordered d-ary trees with n leaves;• Tn the set of d-ary trees with n labelled leaves;• C n the set of ordered d-ary trees with n − 1 leaves, where the root has out-degree 1 and where the leaves and the root are labelled;• S n the set of permutations of {1, . . ., n}.With this notation, the following hold:(i) Since the leaves of a tree T ∈ T n−1 are already intrinsically labelled by the ordering of T , by adding a root edge to T and labelling the root and the n − 1 leaves of the resulting tree, we get a bijection φ from T n−1 × S n to C n .(ii)For any T ∈ Tn , by choosing one of the n leaves as the root, and then an ordering for the d children of each of the (n − 2)/(d − 1) internal vertices of T , we get a bijection from Tn × {1, . . ., n} × (S d ) (n−2)/(d−1) to C n .Point (i) means that the pushforward by φ of the uniform distribution on T n−1 ×S n is the uniform distribution on C n , whereas Point (ii) implies that if we let ψ denote the canonical projection from C n to Tn , the pushforward by ψ of the uniform distribution on C n is the uniform distribution on Tn .