Enumeration of binary trees compatible with a perfect phylogeny

Evolutionary models used for describing molecular sequence variation suppose that at a non-recombining genomic segment, sequences share ancestry that can be represented as a genealogy—a rooted, binary, timed tree, with tips corresponding to individual sequences. Under the infinitely-many-sites mutation model, mutations are randomly superimposed along the branches of the genealogy, so that every mutation occurs at a chromosomal site that has not previously mutated; if a mutation occurs at an interior branch, then all individuals descending from that branch carry the mutation. The implication is that observed patterns of molecular variation from this model impose combinatorial constraints on the hidden state space of genealogies. In particular, observed molecular variation can be represented in the form of a perfect phylogeny, a tree structure that fully encodes the mutational differences among sequences. For a sample of n sequences, a perfect phylogeny might not possess n distinct leaves, and hence might be compatible with many possible binary tree structures that could describe the evolutionary relationships among the n sequences. Here, we investigate enumerative properties of the set of binary ranked and unranked tree shapes that are compatible with a perfect phylogeny, and hence, the binary ranked and unranked tree shapes conditioned on an observed pattern of mutations under the infinitely-many-sites mutation model. We provide a recursive enumeration of these shapes. We consider both perfect phylogenies that can be represented as binary and those that are multifurcating. The results have implications for computational aspects of the statistical inference of evolutionary parameters that underlie sets of molecular sequences.


Introduction
Coalescent and mutation models are used in population genetics to estimate evolutionary parameters from samples of molecular sequences (Marjoram and Tavaré 2006). The central idea is that observed molecular variation is the result of a process of mutation along the branches of the genealogy of the sample. This genealogy is a timed tree that represents the ancestral relationships of the sample at a chromosomal segment. Consisting of a tree topology and its branch lengths, the genealogy is a nuisance parameter that is modeled as a realization of the coalescent process dictated by evolutionary parameters-which are in turn inferred by integrating over the space of genealogies. For large sample sizes, however, this integration is computationally challenging because the state space of tree topologies increases exponentially with the number of sampled sequences.
Recently, a coarser coalescent model known as the Tajima coalescent (Tajima 1983;Sainudiin et al. 2015), coupled with the infinitely-many-sites mutation model (Kimura 1969), has been introduced for population-genetic inference problems (Palacios et al. 2019). Whereas the standard coalescent model (Kingman 1982) induces a probability measure on the space of ranked labeled tree topologies, the Tajima coalescent induces a probability measure on the space of ranked unlabeled tree topologies. Removing the labels of the tips from the tree topology, as in the Tajima coalescent, reduces the cardinality of the space of tree topologies substantially, shrinking computation time in inference problems.
Under infinitely-many-sites mutation, only a subset of tree topologies (labeled or unlabeled) are compatible with an observed data set, so that the computational complexity of inference varies among different data sets. Hence, Cappello et al. (2020a) used importance sampling to approximate cardinalities of the spaces of labeled and unlabeled ranked tree shapes conditioned on a data set of molecular sequences, demonstrating a striking reduction of the cardinality of the space of ranked unlabeled tree shapes versus the labeled counterpart when conditioning on observed data with a sparse number of mutations. Here, we extend beyond the approximate work of Cappello et al. (2020a) and obtain exact results. We provide a recursive algorithm for exact computation of the cardinality of the spaces of labeled and unlabeled ranked tree shapes compatible with a sequence data set. We provide a number of other enumerative results relevant for inference of tree topologies in phylogenetics and population genetics. Python code for enumeration is available at https://colab.research.google. com/drive/1cAx2xyn7OtmG-F-9nxJ3CHRc7e7AjuCj?usp=sharing. A A ranked labeled tree shape. B A ranked unlabeled tree shape. C An unranked unlabeled tree shape. D An unranked labeled tree shape. The ranked unlabeled tree shape in (B) is obtained by discarding leaf labels from the ranked labeled tree shape in (A). The unranked labeled tree shape in (D) is obtained by discarding the sequence of internal node ranks in (A). The unranked unlabeled tree shape in (C) is obtained by discarding the sequence of internal node ranks in (B) or the leaf labels in (D)

Types of trees
The coalescent is a continuous-time Markov chain with values in the space P n of partitions of [n] = {1, 2, . . . , n} (Kingman 1982). The process starts with the trivial partition of n singletons, labeled {1}, {2}, . . . , {n}, at time 0; at each transition, two blocks are chosen uniformly at random to merge into a single block. The process ends with a single block with label {1, 2, . . . , n}. In the standard coalescent, the holding times are exponentially distributed with rate k 2 when there are k blocks. Transition probabilities for the coalescent can be factored into two independent components, a pure death process and a discrete jump chain. A full realization of the process can be represented by a timed rooted binary tree: a genealogy. The tips of the genealogy are labeled by {1, 2, . . . , n}. Figure 1A shows a realization of the jump process, a ranked labeled tree shape.
A lumping of the standard coalescent process, called the Tajima coalescent (Sainudiin et al. 2015), consists in removing the labels of the tips of the genealogy. The pure death process of the lumped process is the same as the standard coalescent. The discrete jump chain can be described as a simple urn process (Janson and Kersting 2011). Start with an urn of n balls labeled 0; at the ith transition, draw two balls and return one to the urn with label i. The process ends when there is a single ball with label n − 1 in the urn. A full realization of the urn process can be represented as a ranked unlabeled tree shape with internal nodes labeled by the transition index.
A ranked labeled tree shape of size n, denoted by T L n , is a rooted binary labeled tree of n leaves with a total ordering for the internal nodes. Without loss of generality, we use label set [n] to label the n leaves. The space of ranked labeled tree shapes with n leaves will be denoted by T L n . Figure 1A shows an example of a ranked labeled tree shape with n = 8 leaves. Ranked labeled tree shapes are also known as labeled histories.
A ranked unlabeled tree shape of size n, denoted by T R n , is a rooted binary unlabeled tree of n leaves with a total ordering for the internal nodes. The space of Fig. 2 An enumeration of all possible ranked tree shapes with 3, 4, 5, and 6 leaves ranked unlabeled tree shapes with n leaves will be denoted by T R n . Figure 1B shows an example of a ranked unlabeled tree shape with n = 8 leaves. We will refer to a ranked unlabeled tree shape simply as a ranked tree shape; these ranked tree shapes are also known as unlabeled histories, or Tajima trees. Figure 2 shows all ranked unlabeled tree shapes with 3, 4, 5, and 6 leaves.
An unranked unlabeled tree shape of size n, denoted by T n , is a rooted binary unlabeled tree of n leaves with unlabeled internal nodes. The space of unranked (unlabeled) tree shapes with n leaves will be denoted by T n . Figure 1C shows an example of an unranked unlabeled tree shape with n = 8 leaves. These shapes are also called unlabeled topologies or Otter trees (Otter 1948).
An unranked labeled tree shape of size n, denoted by T X n , is a rooted binary labeled tree of n leaves with unlabeled internal nodes. The space of unranked labeled tree shapes with n leaves will be denoted by T X n . Figure 1D shows an example of an unranked labeled tree shape with n = 8 leaves. These tree shapes are also called labeled topologies.

Mutations on trees
Many generative models of neutral molecular evolution assume that a process of mutation is superimposed on the genealogy as a continuous-time Markov process. In the infinitely-many-sites mutation model, every mutation along the branches of the tree occurs at a chromosomal site that has not previously mutated (Kimura 1969). Therefore, if a mutation occurs at an interior branch along the genealogy, all sequences descended from that branch carry the mutation. Because every site can mutate at most once, the sequence of mutated sites can be encoded as a binary sequence, with 0 denoting the ancestral type and 1 denoting the mutant type at any site; we assume that the ancestral type is known, and that it is denoted by 0. Figure 3A shows a realization of the Tajima coalescent together with a realization of mutations from the infinitely-many-sites mutation model with 5 individuals and 4 mutated sites. In what follows, we assume that we observe molecular data only as binary sequences at the tips of the tree. Tajima coalescent and infinitely-many-sites generative model of binary molecular data. A A Tajima genealogy of 5 individuals, with 4 superimposed mutations depicted as gray squares. The root is labeled by the ancestral type 0000, and the leaves are labeled by the genetic type at each of three mutated sites. The first two leaves from left to right are labeled 0001 because one mutation occurs in their path to the root. The third and fourth individuals have three mutations in their path to the root and are labeled 1110; the last individual is labeled 1000 because only one mutation occurs along its path to the root. The order and label of the mutations is unimportant; however, it is assumed that the same position, or site, in a sequence of 0s and 1s corresponds across individuals. For ease of exposition, we label the mutations a, b, c and d. The first site corresponds to mutation a, the second to b, the third to c, and the fourth to d. B Left, a perfect phylogeny representation of the observed data at the tips of (A). Data consist of 3 unique haplotypes 0001, 1110 and 1000, with frequencies 2, 2, and 1. The corresponding frequencies are the labels of tips of the perfect phylogeny. Right, perfect phylogeny topology obtained by removing the edge labels of the perfect phylogeny. C The only three ranked tree shapes compatible with the perfect phylogeny topology in (B)

Observed binary molecular sequence data as a perfect phylogeny
The perfect phylogeny algorithm, proposed by Gusfield (1991), generates a graphical representation of binary molecular sequence data that have been produced according to the infinitely-many-sites mutation model. Label individual sequences 1, 2, . . . , n, and label mutated or "segregating" sites a, b, . . .. The original algorithm generates a rooted tree structure known as a perfect phylogeny, with tips labeled 1, 2, . . . , n and with edges labeled a, b, . . ., that is in bijection with the observed "labeled data." An edge can have no labels, one label, or more than one label. Perfect phylogenies have been central to coalescent-based inference algorithms, in which maximum likelihood or Bayesian estimation of evolutionary parameters that have given rise to the particular distribution of mutations and clade sizes on the perfect phylogeny are sought by importance sampling or Markov chain Monte Carlo (Griffiths and Tavaré 1994;Stephens and Donnelly 2000;Palacios et al. 2019;Cappello et al. 2020b).
In this study, we assume that individual sequences are not uniquely labeled, but instead, are identified by their sequences of 0s and 1s, or haplotypes. Hence, the number of tips in our perfect phylogeny is the number of unique haplotypes, and the labels at the tips correspond to the observed frequencies of the haplotypes. For the genealogy in Fig. 3A and B shows the perfect phylogeny of the data observed at its tips.
The key assumption of the bijection between sequence data sets and perfect phylogenies is that if a site mutates once, then all descendants of the lineage on which the mutation occurred must also have the mutation-and no other individuals will have the mutation. That is, every unique mutation, or site, partitions the sample of haplotypes into two groups: those with the mutation and those without the mutation. Hence, we group sites that induce the same partition on the haplotypes, and we call each such group of sites a mutation group.
In this study, we are not concerned with the mutation labels, and hence, we remove the edge labels of the perfect phylogeny (right side of Fig. 3B), so that we consider only the topology of the perfect phylogeny. In dropping the edge labels, we treat a perfect phylogeny topology as a perfect phylogeny. Henceforth, a perfect phylogeny is a multifurcating rooted tree with k leaves, representing k distinct haplotypes, each labeled by a positive integer (n i ) 1≤i≤k , with k i=1 n i = n. We use the symbol n to denote the space of perfect phylogenies of size n sequences, and we use π ∈ n to denote a perfect phylogeny with n sequences.
The most extreme unresolved perfect phylogeny with n tips-the perfect phylogeny that is compatible with all ranked tree shapes with n tips-has two representations. It can be written as a star, in which the root has degree n and is the only internal node, that is, π = (1, 1, . . . , 1). It can also be written as a single node π = (n). For our purposes, with mutations discarded, the star and single-node perfect phylogenies are indistinguishable, and they will be represented as a single-node perfect phylogeny. Details of the algorithm for generating the perfect phylogeny from binary molecular data can be found in Cappello et al. (2020a), which presents a slight modification to Gusfield's algorithm (Gusfield 1991).
We say that a binary tree T is compatible with a perfect phylogeny π if the tree can be reduced to π by collapsing internal edges of T . The number of tree shapes, ranked or unranked, that are compatible with a perfect phylogeny gives the cardinality of the corresponding posterior sampling tree space in statistical inference from sequence data sets. Given a perfect phylogeny π ∈ n , we are interested in calculating the number of compatible ranked tree shapes with n leaves and the number of compatible unranked tree shapes with n leaves.

Known enumerative results
In advance of our effort to count tree shapes compatible with a perfect phylogeny, we state some known enumerative results for the unconstrained spaces of ranked labeled tree shapes, unranked labeled tree shapes, ranked unlabeled tree shapes, and unranked unlabeled tree shapes (Steel 2016).
Let L n = |T L n | denote the cardinality of the space of ranked labeled trees with n leaves. Then The product is obtained by noting that for each decreasing i from n to 2, there are i 2 ways of merging two labeled branches. The sequence of values of L n begins 1, 1, 3, 18, 180, 2700, 56,700. Let X n = |T X n | denote the number of unranked labeled trees with n leaves. We have To generate trees in T X n from trees in T X n−1 , a pendant edge connected to the nth label can be placed along each of the 2n − 3 edges of a tree with n − 1 leaves, including an edge above the root. X n is obtained as the solution to the recursion X n = (2n−3)X n−1 , with X 1 = 1. The sequence of values of X n begins 1, 1, 3, 15, 105, 945, 10,395.
The number of ranked tree shapes with n tips is the (n − 1)-th Euler zigzag number (Stanley 2012). Let R n = |T R n | denote the number of ranked tree shapes with n leaves. We have the following recursion: The sequence of values of R n begins 1, 1, 1, 2, 5, 16, 61. For n ≥ 1, if the tree has n +1 tips, and hence n interior nodes, then the root divides the tree into two ranked subtrees T R 1 and T R 2 , where T R 1 has k interior nodes, 0 ≤ k ≤ n − 1, and T R 2 has n − 1 − k interior nodes. There are n−1 k ways of interleaving the k and n − 1 − k interior nodes of T R 1 and T R 2 , such that the relative orderings of the interior nodes of T R 1 and T R 2 are preserved in the interleaving. The number of possible ranked tree shapes with such a configuration is n−1 k R k+1 R n−k . Summing over the possibilities for k from 0 to n − 1, and acknowledging that the identity of T R 1 and T R 2 can be interchanged, we get Eq. 3. Let S n = |T n | denote the number of unranked tree shapes with n leaves. We have the following recursion: S n is the nth Wedderburn-Etherington number (Harding 1971). The sequence begins 1, 1, 1, 2, 3, 6, 11. When the number of leaves is 2n − 1, the root divides the tree shape into two subtree shapes T 1 and T 2 with k and 2n −1−k leaves, for k = 1, 2, . . . , n −1.
When the number of leaves is even, the root divides the tree shape into subtree shapes with k and 2n − k leaves for k = 1, 2, . . . , n − 1 or two subtree shapes with n leaves; these tree shapes are indistinguishable in S n cases and distinguishable in 1 2 S n (S n − 1) cases.

Enumeration for binary perfect phylogenies
To count ranked and unranked tree shapes compatible with a perfect phylogeny, we first consider binary perfect phylogenies: those perfect phylogenies for which the outdegree of any node, traversing from root to tips, is either 0 (leaves or taxa) or 2 (internal nodes). We then consider multifurcating perfect phylogenies in Sect. 4.

Lattice structure of binary perfect phylogenies
The binary perfect phylogenies for a set of n tips possess a structure that will assist in enumerating binary ranked and unranked trees compatible with a set of sequences. In particular, we can make the set n of all binary perfect phylogenies of [n] into a poset by defining π ≤ σ if either σ is the same as π , or if σ can be obtained by sequentially collapsing pairs of pendant edges, or cherries, of π . We then say π is a refinement of σ . For example, π = (2, 3) refines σ = (5). We say that two binary perfect phylogenies in n are comparable if they are equal or if one is a refinement of the other. An example of two perfect phylogenies that are not comparable is π = (2, 3) and σ = (4, 1).
Given two binary perfect phylogenies π 1 and π 2 in n , their meet, denoted π 1 ∧π 2 , is the largest perfect phylogeny that refines both π 1 and π 2 . Similarly, the join of two binary perfect phylogenies π 1 ∨ π 2 is the smallest perfect phylogeny that is refined by both π 1 and π 2 . Formal definitions of these notions appear in Definition 1.
Under the meet and join operations, we will see in Theorem 5 that the poset n ∪{∅} is a lattice L n = ( n ∪ {∅}, ∧, ∨). As a lattice, L n possesses a Hasse diagram with a minimal and a maximal element. The maximal element of L n is the single node perfect phylogeny (n) and the minimal element is ∅. Figures 4 and 5 show the Hasse diagrams of L 2 , L 3 , L 4 , L 5 .
The following proposition extends properties of the meet and join operations. It is proved in the "Appendix".
if π 2 = π 4 (π 1 ∨ π 3 , π 2 ∨ π 4 ) ∧ (π 1 ∨ π 4 , π 2 ∨ π 3 ) otherwise, with the convention that (π, ∅) = ∅. That is, the join of two perfect phylogenies is the meet of the two perfect phylogenies formed by merging two subtrees at the root. These four subtrees (two per newly formed perfect phylogeny) correspond to the joins of all pairs of subtrees, one from each of the original perfect phylogenies. In the particular case that the two original perfect phylogenies share one of the subtrees descending from the root, the join of the two perfect phylogenies is the perfect phylogeny that merges, at the root, the shared subtree with the join of the two different subtrees, one from each of the original perfect phylogenies. In the case that no two pairs of subtrees, one from each of the original perfect phylogenies, have the same size, the join is the maximal single node perfect phylogeny (n). 4. For all π 1 , π 2 , π 3 ∈ n , π 1 ∧ (π 2 ∨ π 3 ) = (π 1 ∧ π 2 ) ∨ (π 1 ∧ π 3 ), and π 1 ∨ (π 2 ∧ π 3 ) = (π 1 ∨ π 2 ) ∧ (π 1 ∨ π 3 ).
Note that the meet and join operations are symmetric and that pairs (π 1 , π 2 ) are unordered; for convenience, we have expanded expressions in parts 1 and 3 of the proposition that could potentially be simplified using the symmetry.
We illustrate the operations in Definition 1 by considering a series of examples.
To make use of the operations ∧ and ∨ for counting binary ranked and unranked trees compatible with a perfect phylogeny, we need a theorem that shows that the two operations ∧ and ∨ induce the same order. That is, we will show that ( n ∪ {∅}, ∧, ∨) is a lattice.

Unranked unlabeled tree shapes compatible with a binary perfect phylogeny
With the lattice structure of the binary perfect phylogenies established, we are now equipped to calculate the number of compatible unranked unlabeled tree shapes with n leaves. Notice that an unranked unlabeled tree shape can be transformed into a perfect phylogeny with the same number of tips by assigning the count 1 to all leaves. We use P(T n ) to denote the perfect phylogeny with n tips that corresponds to the unranked unlabeled tree shape T n .
Definition 6 (Unranked unlabeled tree shape T n compatible with a perfect phylogeny π ∈ n ). An unranked unlabeled tree shape with n leaves, T n , is compatible with a perfect phylogeny π ∈ n , if (1) a one-to-one correspondence exists between the k leaves of π with leaf counts n 1 , n 2 , . . . , n k and k disjoint subtrees of T n containing n 1 , n 2 , . . . , n k leaves, respectively; and (2) P(T n ) ≤ π , that is, P(T n ) is a refinement of π .
We use the symbol G c (π ) = {T n : T n π } to denote the set of unranked unlabeled tree shapes compatible with a perfect phylogeny π ∈ n . For a perfect phylogeny π consisting of a single leaf with leaf count n, the number of compatible unranked unlabeled tree shapes is simply the number of unranked unlabeled tree shapes of size n, or |G c (π )| = S n . Figure 7 shows an example of an unranked unlabeled tree shape compatible with a perfect phylogeny of sample size 7. Proposition 7 For n 1 , n 2 ≥ 1, the number of unranked unlabeled tree shapes compatible with a cherry perfect phylogeny (n 1 , n 2 ) ∈ n is Proof By Definition 6, an unranked unlabeled tree shape is compatible with the perfect phylogeny π = (n 1 , n 2 ) if it possesses two subtrees, one with n 1 leaf descendants and another with n 2 leaf descendants. Decomposing an unranked unlabeled tree shape at its root, the number of shapes with this property is S n 1 S n 2 for n 1 = n 2 and 1 2 S n 1 (S n 1 + 1) for n 1 = n 2 .
Propositions 7 and 8 provide a recursive formula for calculating the number of tree shapes compatible with a binary perfect phylogeny. For example, examining Fig. 6A, the number of tree shapes compatible with (4, 2) is S 4 S 2 = 2, and the number of tree shapes compatible with ((4, 2) Table 1 shows the number of tree shapes compatible with certain perfect phylogenies of sample size 10.

Ranked unlabeled tree shapes compatible with a binary perfect phylogeny
Next, for a binary perfect phylogeny, we compute the number of compatible ranked unlabeled tree shapes with n leaves.
Definition 9 (Ranked unlabeled tree shape T R n compatible with a perfect phylogeny π ∈ n ). A ranked unlabeled tree shape with n leaves, T R n , is compatible with a perfect phylogeny π ∈ n if the unranked unlabeled tree shape T n obtained by removing the ranking from T R n is compatible with π . Proposition 10 For n 1 , n 2 ≥ 1, the number of ranked unlabeled tree shapes compatible with a cherry perfect phylogeny (n 1 , n 2 ) ∈ n is Proof By Definition 9, a ranked unlabeled tree shape T R is compatible with the perfect phylogeny π = (n 1 , n 2 ) if the associated unranked unlabeled tree shape T obtained by removing the ranking of T R is compatible with π . By Definition 6, the unranked unlabeled tree shape T is compatible with the perfect phylogeny π = (n 1 , n 2 ) if it possesses two subtrees, one with n 1 leaf descendants and another with n 2 leaf descendants.
We decompose a ranked unlabeled tree at its root into subtrees of size n 1 and n 2 . If n 1 = n 2 , then the n 1 − 1 interior nodes of the subtree with n 1 leaves and the n 2 − 1 interior nodes of the subtree with n 2 leaves can be interleaved in n 1 +n 2 −2 n 1 −1 ways. If n 1 = n 2 , then the two ranked subtrees can be the same in R n 1 ways, each with 1 2 2n 1 −2 n 1 −1 ways of interleaving the two ranked unlabeled subtrees; the two ranked subtrees can differ in 1 2 (R 2 n 1 − R n 1 ) ways, each with 2n 1 −2 n 1 −1 ways of interleaving the subtrees. Proposition 11 For n 1 , n 2 ≥ 1 and π 1 ∈ n 1 , π 2 ∈ n 2 , the number of ranked unlabeled tree shapes compatible with a binary perfect phylogeny π = (π 1 , π 2 ) ∈ n is Proof If π 1 ∧π 2 = ∅, then the number of ranked tree shapes compatible with (π 1 , π 2 ) is simply the product of the number of ranked tree shapes compatible with π 1 , the number of ranked tree shapes compatible with π 2 , and the number of ways of interleaving their rankings.
If π 1 ∧ π 2 = ∅, then certain ranked tree shapes can be compatible with both π 1 and π 2 , i.e., compatible with π 1 ∧ π 2 . We therefore have three cases: the two perfect phylogenies are the same, one is a refinement of the other (two possible ways), or neither is a refinement of the other. The cardinalities in these cases are 1 , respectively, all multiplied by the possible number of interleavings of the rankings 2n 1 −2 n 1 −1 .
Propositions 10 and 11 provide a recursive formula for calculating the number of ranked tree shapes compatible with a binary perfect phylogeny. For Fig. 6A, the number of ranked tree shapes compatible with (4, 2) is (4)(2) = 8, and the number of ranked tree shapes compatible with ((4, 2), 6) is 10 Table 1 shows the number of ranked unlabeled tree shapes compatible with some of the perfect phylogenies of sample size 10. We can observe that these numbers exceed corresponding numbers of unranked unlabeled tree shapes compatible with the perfect phylogenies, just as the numbers of ranked unlabeled tree shapes exceed the numbers of unranked unlabeled tree shapes (Sect. 2.4).
For the ranked unlabeled tree shapes compatible with a binary perfect phylogeny, we can examine the asymptotic growth of the number of compatible ranked unlabeled tree shapes in particular families of binary perfect phylogenies. For a fixed integer value x ≥ 1, consider the family of binary perfect phylogenies B x (n) = (x, n − x) as n increases. These are cherry phylogenies with labels x and n − x at their two leaves. Let b x (n) be the number of ranked unlabeled tree shapes compatible with B x (n). Among the integer sequences b 1 (n), b 2 (n), b 3 (n), . . ., the next proposition shows that b 2 (n) has the fastest asymptotic growth. In other words, as n grows large, the value of x for which the number of ranked unlabeled tree shapes compatible with the perfect phylogeny B x (n) is asymptotically largest is x = 2.
Proof For a fixed integer value x ≥ 0, let β x = (x + 1, n − x + 1) be a binary perfect phylogeny with two leaves, labeled by x + 1 (say to the left of the root) and n − x + 1 (to the right of the root). The set of ranked unlabeled tree shapes compatible with β x corresponds to the set of ranked unlabeled tree shapes with n + 1 internal nodes (n + 2 leaves), x internal nodes for the left root subtree, and n − x internal nodes for the right root subtree.
We consider an increasing sequence of values of n. Supposing n > 2x so that the root subtrees of β x cannot have the same sample size, we apply Proposition 11, finding that the number of ranked unlabeled tree shapes compatible with β x is n x e x e n−x , where e i is the number of ranked unlabeled tree shapes with i internal nodes. Following Eq. 3, the integer e i is the ith Euler number, e i = R i+1 . The exponential generating function of the sequence (e i ) is (Brent and Harvey 2013) We can write the ratio q i = e i i! as (Flajolet and Sedgewick 2009, p. 269; Brent and Harvey 2013) As i becomes large, by applying singularity analysis to Eq. 11, or by computing directly from Eq. 12, we have the asymptotic relation With q x = e x /x!, we rewrite Eq. 10 as n! q x q n−x . Letting n → ∞ for a fixed x, we can use Eq. 12 to rewrite q x , and because x is constant as n grows, we can use Eq. 13 for the asymptotic value of q n−x . Hence, for increasing values of n, the number of ranked tree shapes compatible with the perfect phylogeny β x behaves asymptotically like the product of n! and where Note that ζ(s) = ∞ k=1 1 k s is the Riemann zeta function. If x is even, then Among odd values of x, we have c 1 = 3 4 ζ(2) = π 2 /8 ≈ 1.2337 for x = 1. For odd x ≥ 3, we have Hence, c 1 > 1 exceeds c x both for even x and for all odd x ≥ 3.
Because c x has its maximum at x = 1, from Eq. 14, we conclude that the product q x q n−x grows asymptotically fastest for x = 1. In particular, as n → ∞, the value of x for which the binary perfect phylogeny β x has the largest number of compatible ranked unlabeled tree shapes is x = 1-that is, when β x = β 1 = (2, n).
We also obtain the following corollary.

Ranked labeled tree shapes compatible with a labeled binary perfect phylogeny
Propositions 7, 8, 10 and 11 provide recursive formulas for enumerating unranked unlabeled tree shapes and ranked unlabeled tree shapes compatible with a binary perfect phylogeny. In these cases, a perfect phylogeny representation does not use individual sequence labels; the labels of the tips of the perfect phylogeny are simply counts of numbers of sequences. We now consider labeled perfect phylogenies that partition the set of labeled individual sequences. We still use the parenthetical notation described in Sect. 2.3 to denote a labeled perfect phylogeny, for example π = (2, 3), however, it must be understood that this labeled perfect phylogeny partitions the sampled sequences into two different sets of labeled sequences. Consider {x 1 , x 2 } and {x 3 , x 4 , x 5 } in the perfect phylogeny of Fig. 8B. We are now interested in calculating the number of ranked labeled tree shapes compatible with a labeled binary perfect phylogeny. Figure 8C shows all the ranked labeled tree shapes compatible with the labeled perfect phylogeny. For ranked labeled tree shapes, the enumeration follows a simple recursive expression.
Definition 14 (Ranked labeled tree shape T L n compatible with a labeled perfect phylogeny π ∈ L n ). A ranked labeled tree shape with n leaves, T L n , is compatible with a perfect phylogeny π ∈ L n if the unranked unlabeled tree shape T n obtained by removing the ranks and the labels from T L n is compatible with π and the one-to-one correspondence between the k leaves of π and the k disjoint subtrees of T L n correspond to the same partition of the individual sequences.
Proposition 15 For n 1 , n 2 ≥ 1 and π 1 ∈ L n 1 , π 2 ∈ L n 2 the number of ranked labeled tree shapes compatible with a labeled binary perfect phylogeny π = (π 1 , π 2 ) is Proof We can count the number of ranked labeled tree shapes by dividing π at the root into two subtrees, one with n 1 leaves and perfect phylogeny π 1 , and the other with n 2 leaves and perfect phylogeny π 2 , both partitioning the sampled sequences.
The number of such trees is the product of the numbers of ranked labeled trees for the two subtrees and the number of ways of interleaving the internal nodes of the two subtrees. In this case, the two perfect phylogenies π 1 and π 2 can never be identical because they correspond to different sets of sequences.
Counts for the number of ranked labeled tree shapes for some of the perfect phylogenies of 10 taxa (with an arbitrary labeling) appear in Table 1. Given a perfect phylogeny in the table, we can observe that the number of ranked labeled tree shapes far exceeds the number of ranked unlabeled tree shapes.
We can obtain a result analogous to Proposition 12; we characterize, for binary labeled perfect phylogenies B x (n) = (x, n − x), the one compatible with the largest number of ranked labeled tree shapes. Let b x (n) denote the number of ranked labeled tree shapes compatible with B x (n). Proof Applying Proposition 15, we have b x (n) = n−2

Proposition 16
x−1 L x L n−x . Simplifying with Eq. 1, we obtain b x (n) = [n! (n − 2)!/2 n−2 ] n x −1 . As it is quickly verified that the binomial coefficients n x increase monotonically from x = 1 to x = n 2 , b x decreases monotonically from x = 1 to x = n 2 .

Unranked labeled tree shapes compatible with a labeled binary perfect phylogeny
Continuing with the labeled perfect phylogenies from Sect. 3.4, we now count the unranked labeled binary perfect phylogenies compatible with a labeled binary perfect phylogeny. Consider {x 1 , x 2 } and {x 3 , x 4 , x 5 } in the perfect phylogeny of Fig. 8B. We calculate the number of unranked labeled tree shapes compatible with a labeled binary perfect phylogeny. Each row of Fig. 8C corresponds to one of the unranked labeled tree shapes compatible with the labeled perfect phylogeny.
Definition 17 (Unranked labeled tree shape T X n compatible with a labeled perfect phylogeny π ∈ L n ). An unranked labeled tree shape with n leaves, T X n , is compatible with a perfect phylogeny π ∈ L n if the unranked unlabeled tree shape T n obtained by removing the labels from T X n is compatible with π and the one-to-one correspondence between the k leaves of π and the k disjoint subtrees of T X n correspond to the same partition of the individual sequences. Coalescent and infinitely-many-sites generative model of binary molecular data. A A genealogy of 5 individuals, with 2 superimposed mutations depicted as gray squares. The root is labeled by the ancestral type 00, and the leaves are labeled by the genetic type at each of three mutated sites. The first two leaves from left to right are labeled 01 because one mutation occurs in their path to the root. The third, fourth and fifth individuals have one mutation in their path to the root and are labeled 10. The order and label of the mutations is unimportant; however, individual labels x 1 , x 2 , x 3 , x 4 , x 5 are important. For ease of exposition, we label the mutations a, b. The first site corresponds to mutation a, and the second to b. B Left, a labeled perfect phylogeny representation of the observed data at the tips of (A). Data consist of 2 unique haplotypes 01 and 10, with frequencies 2 and 3, respectively. The corresponding frequencies are the labels of tips of the perfect phylogeny; however, it is understood that the two leaves correspond to {x 1 , x 2 } and {x 3 , x 4 , x 5 } respectively. Right, perfect phylogeny topology obtained by removing the edge labels of the perfect phylogeny. C The nine ranked labeled tree shapes compatible with the labeled perfect phylogeny topology in (B). Note that in (C), if we ignore the branching order and drop the internal node labels, in each row, the three trees are equivalent-so that each row corresponds to one of the three unranked labeled tree shapes compatible with the labeled perfect phylogeny topology in (B) Proposition 18 For n 1 , n 2 ≥ 1 and π 1 ∈ L n 1 , π 2 ∈ L n 2 , the number of unranked labeled tree shapes compatible with a labeled binary perfect phylogeny π = (π 1 , π 2 ) is Proof We divide π at the root into two subtrees, one with n 1 leaves and perfect phylogeny π 1 , and the other with n 2 leaves and perfect phylogeny π 2 . The subtrees must partition the sampled sequences in the same way as π . The number of such trees is the simply product of the numbers of unranked labeled trees for the two subtrees. As in Proposition 15, perfect phylogenies π 1 and π 2 are not identical because they correspond to different sets of sequences; with the ranking dropped, unlike in Proposition 15, we need not consider the number of ways of interleaving the internal nodes of the two subtrees.
For some of the perfect phylogenies of 10 taxa (with an arbitrary labeling), counts for the number of unranked labeled tree shapes appear in Table 1. The number of unranked labeled tree shapes far exceeds the number of unranked unlabeled tree shapes, and it generally exceeds the number of ranked unlabeled tree shapes.
types that are compatible with a perfect phylogeny, we must consider multifurcating perfect phylogenies. We proceed by reducing the multifurcating case to the binary case that has already been solved.
We now consider a multifurcating perfect phylogeny that consists of a single internal node subtending k leaves with labels n 1 , n 2 , . . . , n k . An example is depicted in Fig. 9. Because multiple leaves can each correspond to groups with the same number of taxa, so that the same numerical label can be assigned to many of those leaves, it is convenient to denote the vector of unique labels by a = (a 1 , a 2 , . . . , a s ) and the corresponding vector of their multiplicities by m = (m 1 , m 2 , . . . , m s ), where m j denotes the number of leaves with label a j , 1 ≤ j ≤ s ≤ k. In the example of Fig. 9,  a = (2, 3) and m = (2, 2), as two leaves (m 1 = 2) have label 2 (a 1 = 2) and two leaves (m 2 = 3) have label 3 (a 2 = 3).
The lattice structure enables us to count the number of ranked unlabeled tree shapes compatible with a multifurcating perfect phylogeny π = (n 1 , n 2 , . . . , n k ). We use a recursive inclusion-exclusion principle with label vector a and multiplicities m. The key idea is to decompose the computation into a sum over all possible binary perfect phylogenies, applying Propositions 10 and 11 to each binary perfect phylogeny. To recursively generate all possible binary perfect phylogenies from π , we define the operator B i, j (π ) that collapses two leaves with labels a i and a j in π . For example B 2,3 (2, 2, 3, 4) = ((2, 3) collapsing two pendant edges with the same leaf values collapsing all pairs containing two distinct pairs of pendant edges, each pair with the same leaf values collapsing a pair of edges with different leaf values and collapsing a pair of edges with the same leaf values To interpret Eq. 21 as an inclusion-exclusion formula, notice that the first two sums that are added on the right-hand side of Eq. 21 correspond to enumerations of single events (so that the sum is analogous to a union ∪A i ), and the following three sums that are subtracted correspond to intersections of pairs of these events (analogous to intersections A i ∩ A j ). Equation 21 provides a recursive approach for counting the number of ranked unlabeled tree shapes compatible with a multifurcating perfect phylogeny by expressing the calculation in terms of binary perfect phylogenies. The recursive application of the equation proceeds until all terms reach s i=1 m i = 2, when the binary perfect phylogenies are reached.
For counting the number of unranked unlabeled tree shapes compatible with π = (n 1 , n 2 , . . . , n k ), we simply replace G T c with G c in Eq. 21. We use Propositions 7 and 8 in place of Propositions 10 and 11.
To count the number of ranked labeled tree shapes compatible with a labeled multifurcating perfect phylogeny π = (n 1 , n 2 , . . . , n k ), we assume that although any leaf in the perfect phylogeny can have multiplicity larger than one, each leaf is uniquely defined by its associated taxa, all of which are all assumed to have different labels. Therefore, we take a = (n 1 , n 2 , . . . , n k ) and m = (1, 1, . . . , 1). Equation 21 reduces to collapsing two pairs of pendant edges .
The enumeration makes use of Proposition 15.
The entries in the table are obtained by repeated use of Propositions 7 and 8 for unranked unlabeled tree shapes, Propositions 10 and 11 for ranked unlabeled tree shapes, Proposition 15 for ranked labeled tree shapes, and Proposition 18 for unranked labeled tree shapes. An arbitrary labeling of the perfect phylogeny is assumed for counting the associated ranked and unranked labeled tree shapes. Figure 10 shows the corresponding partial Hasse diagram of the lattice of binary perfect phylogenies with 10 taxa. Fig. 9 Example of all possible binary perfect phylogeny topologies for a given multifurcating perfect phylogeny topology. The binary perfect phylogenies are obtained from a multifurcating perfect phylogeny by resolving multifurcating nodes into sequences of bifurcations Table 1 Number of trees compatible with example perfect phylogenies of 10 taxa  Table 1 are shown

Conclusion
The infinitely-many-sites mutations model is a popular model of molecular variation for problems of population genetics (Wakeley 2008) and related areas (Jones et al. 2020), in which constraints are imposed on the space of trees that can explain the observed patterns of molecular variation. A realization of the coalescent model on a genealogy and a superimposed infinitely-many-sites mutation model can be summarized as a perfect phylogeny. Here, we have examined combinatorial properties of the genealogical tree structures that are compatible with a perfect phylogeny, demonstrating that the binary perfect phylogenies possess a lattice structure (Theorem 5). We have used this lattice structure to provide recursive enumerative results counting the trees-unranked unlabeled trees, ranked unlabeled trees, ranked labeled trees, and unranked labeled trees-compatible with binary perfect phylogenies. Further, for multifurcating perfect phylogenies, we have exploited a recursive inclusion-exclusion principle to decompose a multifurcating perfect phylogeny into all possible binary perfect phylogenies, extending the utility of our lattice approach from bifurcating structures to more general structures.
In our enumerative results, the count of the number of trees of a specified type that are compatible with a perfect phylogeny is obtained by a decomposition of the perfect phylogeny at its root. The number of associated trees is obtained by counting trees for each subtree immediately descended from the root of the perfect phylogeny-and where appropriate, counting interleavings of nodes within those trees, taking care to consider cases that avoid double-counting, or both. This same technique was applicable for each of the types of trees we considered, appearing in Sects. 3.2, 3.3, 3.4, 3.5, and 4. We have provided examples for relatively small cases with n = 10 taxa (Table  1, Fig. 10). Owing to the recursive structure of the computation, the decomposition itself proceeds rapidly from the root through the internal nodes, so that a count can be quickly obtained even if the number itself is large. Our algorithmic implementation in python does have a computational precision limitation, but it accommodates numbers up to the order of 10 290 .
We obtained results concerning the cherry perfect phylogenies with the largest numbers of ranked unlabeled, unranked labeled, and ranked labeled tree shapes (Propositions 12,16,and 19), and it will be informative to seek a similar result for the unranked unlabeled case. The result in Proposition 12 on asymptotic growth of the number of ranked unlabeled tree shapes compatible with a binary perfect phylogeny is reminiscent of a result concerning "lodgepole" trees. A number of studies have examined another combinatorial structure for evolutionary trees, the number of "coalescent histories" associated with a labeled species tree and its matching labeled gene tree. These coalescent histories encode different evolutionary scenarios possible for the coalescence of gene lineages on a species tree. Disanto and Rosenberg (2015) found that the lodgepole trees, a class of trees in which cherry nodes with 2 descendants successively branch from a single species tree edge, possesses a particularly large number of coalescent histories. Similarly, in Proposition 12, as n increases, the number of ranked unlabeled tree shapes compatible with a cherry perfect phylogeny is largest when the perfect phylogeny has one subtree with sample size 2.
Perfect phylogenies have been widely studied in varied estimation problems, for the "perfect phylogeny problem" asking whether a perfect phylogeny can be constructed from data given on a set of characters (Agarwala and Fernández-Baca 1993;Kannan and Warnow 1997;Felsenstein 2004;Gusfield 2014;Steel 2016), statistical inference of evolutionary parameters under the coalescent (Griffiths and Tavaré 1994;Stephens and Donnelly 2000;Tavaré 2004;Palacios et al. 2019;Cappello et al. 2020b), and algorithmic estimation of haplotype phase from diploid data (Gusfield 2002;Bafna et al. 2004;Gusfield 2014). However, the literature on perfect phylogenies has largely focused on such applications and on algorithmic problems of obtaining perfect phylogenies from data under various constraints, with little emphasis on the enumerative combinatorics of the perfect phylogenies themselves, and of their associated refinements. In describing a lattice for the binary perfect phylogenies with sample size n, this study suggests that the mathematical properties of sets of perfect phylogenies as combinatorial structures per se can be informative. The link to coalescent histories suggests possible connections to related concepts such as "ancestral configurations" (Wu 2012; Disanto and Rosenberg 2017), which also can be described in terms of lattices (Alimpiev and Rosenberg 2022); it will be useful to consider perfect phylogenies alongside such structures arising in the combinatorics of evolutionary trees.
Finally, returning to considerations of coalescent-based inference from sequences, recall that inference of evolutionary parameters from a given perfect phylogeny is performed by integrating over the space of genealogies. A standard approach to inference integrates over the space of ranked labeled tree shapes generated by the Kingman coalescent (Drummond et al. 2012). However, this inference is computationally intractable for large sample sizes. We have observed a striking reduction in the cardinality of the set of ranked (and unranked) unlabeled tree shapes compatible with an observed perfect phylogeny, relative to the number of ranked (and unranked) labeled tree shapes compatible with an observed perfect phylogeny (Tables 1 and 2). This observation contributes to a growing branch of the area of coalescent-based inference (Sainudiin Table 2 Ratio of the number of unranked labeled and unranked unlabeled tree shapes and ratio of the number of ranked labeled and ranked unlabeled tree shapes compatible with three perfect phylogenies of 10, 20 and 50 taxa  Palacios et al. 2015Palacios et al. , 2019Cappello et al. 2020a) that can make use of ranked unlabeled trees to estimate the evolutionary parameters.

Appendix: Proof of Theorem 5
To prove Theorem 5, we must verify four pairs of conditions concerning perfect phylogenies π ∈ n ∪ {∅}. Note that any binary perfect phylogeny π ∈ n ∪ {∅} is equal to ∅, (n), or (π 1 , π 2 ) for two non-empty binary perfect phylogenies π 1 ∈ n 1 and π 2 ∈ n 2 , where 1 ≤ n 1 , n 2 < n and n 1 + n 2 = n. Hence, we must demonstrate the four pairs of conditions for perfect phylogeny pairs that include ∅, (n), or both, and for perfect phylogeny pairs that include neither ∅ nor (n). Because perfect phylogenies can be decomposed into smaller perfect phylogenies, we proceed by induction on n, with a base case of n = 1. In the inductive step we assume that ( k ∪ {∅}, ∧, ∨) is a lattice for all k, 1 ≤ k < n. We then verify that it follows that ( n ∪ {∅}, ∧, ∨) is a lattice. We start with Condition 2, which is trivial.

Condition 2: ∧ = ∧ and ∨ = ∨
For all n, condition 2 of the definition of a lattice is trivially satisfied, as the operations ∧ and ∨ are symmetric by definition. In subsequent derivations, we frequently apply Condition 2 without always noting its application. Demonstrating Condition 3 requires that we verify a pair of conditions for each of the eight choices of (x, y, z) for x, y, z ∈ 1 ∪ {∅}. Demonstrating Condition 4 requires that we verify a pair of conditions for each of the four choices of (x, y). The 16 verifications for Condition 3 and eight verifications for Condition 4 all quickly follow by Definition 1 (1-4). Hence, ( 1 ∪ {∅}, ∧, ∨) is a lattice.
Thus, Eq. 24 holds, so that Eq. 23 holds for the case in which subtrees are shared at the root.
Proof In the first case, π 1 ∧ (π 2 ∨ π 3 ) is the largest perfect phylogeny that refines both π 1 and (π 2 , π 3 ). This perfect phylogeny then refines both π 1 and π 2 or both π 1 and π 3 , that is, the largest of (π 1 , π 2 ) and (π 1 , π 3 ) and this largest perfect phylogeny corresponds to their join. In the second case, π 1 ∨ (π 2 ∧ π 3 ) is the smallest perfect phylogeny that is refined by both π 1 and (π 2 ∨ π 3 ), that is, refined by both π 1 and π 2 , and by π 1 and π 3 and this corresponds to the meet of the perfect phylogeny that is refined by π 1 and π 2 and the perfect phylogeny that is refined by π 1 and π 3 .
Proof The meet of two incomparable perfect phylogenies is the largest perfect phylogeny that refines both π and σ . If π or σ or both are perfect phylogenies with all tips labeled 1, for example ((1, 1), 1) then the only refinement of π and σ is ∅ and the result holds by Definition 1. Otherwise a refinement of each perfect phylogeny can be obtained sequentially by branching any tip with label greater than 1, until a common perfect phylogeny is reached or until two perfect phylogenies with all tips labeled 1 are reached. If a common perfect phylogeny γ is reached we then have γ ∨ π = π and γ ∨σ = σ and the result holds. If no common perfect phylogeny is reached then γ = ∅ and the result holds. Similarly, the join of two incomparable perfect phylogenies is the smallest perfect phylogeny refined by both π and σ . Since π and σ are not comparable then π nor σ are (n), therefore we can sequentially collapse pairs of pendant edges of π and σ until a common perfect phylogeny is reached or (n) is reached. In both cases, the result holds.