Counting and optimising maximum phylogenetic diversity sets

In conservation biology, phylogenetic diversity (PD) provides a way to quantify the impact of the current rapid extinction of species on the evolutionary ‘Tree of Life’. This approach recognises that extinction not only removes species but also the branches of the tree on which unique features shared by the extinct species arose. In this paper, we investigate three questions that are relevant to PD. The first asks how many sets of species of given size k preserve the maximum possible amount of PD in a given tree. The number of such maximum PD sets can be very large, even for moderate-sized phylogenies. We provide a combinatorial characterisation of maximum PD sets, focusing on the setting where the branch lengths are ultrametric (e.g. proportional to time). This leads to a polynomial-time algorithm for calculating the number of maximum PD sets of size k by applying a generating function; we also investigate the types of tree shapes that harbour the most (or fewest) maximum PD sets of size k. Our second question concerns optimising a linear function on the species (regarded as leaves of the phylogenetic tree) across all the maximum PD sets of a given size. Using the characterisation result from the first question, we show how this optimisation problem can be solved in polynomial time, even though the number of maximum PD sets can grow exponentially. Our third question considers a dual problem: If k species were to become extinct, then what is the largest possible loss of PD in the resulting tree? For this question, we describe a polynomial-time solution based on dynamical programming.


Introduction
Advances in molecular genetics and computational techniques over recent decades have allowed biologists to reconstruct evolutionary relationships among thousands of species (Jetz et al. 2014;Upham et al. 2019). However, as fast as this 'Tree of Life' is being assembled, many of these species are heading to extinction because of anthropogenic impacts (Davis et al. 2018). This extinction of species also entails the loss of features and genetic variation through the differential pruning of the underlying tree structure. The impact on this tree is often estimated by the reduced sum of edge lengths measured in evolutionary time (Faith 1992). For example, if all 575 bird species classified as 'imperilled' were to disappear from the bird phylogeny (from ∼10,000 species), this would result in the loss of 2.7 billion years of evolution (Jetz et al. 2014).
The ancestral relationships between a set of species are generally modelled using phylogenetic trees (Felsenstein 2004), and one measure of how much of a tree is spanned by a subset of species is the phylogenetic diversity (PD) measure (precise definitions are provided in the next section; here, we give an informal description). In simple terms, every non-empty set of species defines a minimal subtree which connects those species to the root of the tree, and the length of every branch in that subtree is summed to give a PD score for the set overall. The greater the PD score, the more diverse a set of species is assumed to be. To illustrate, Fig. 1 shows the relative ancestry of the species x 1 , x 2 , . . . , x 7 . Solid edges are those used in the calculation of the PD score for species x 3 , x 4 and x 7 . Thus the PD score of {x 3 , x 4 , x 7 } is 16. Note that the PD score of {x 4 , x 7 } is 10.
An important concern of conservationists is preventing the extinction of species and the subsequent reduction of biodiversity. For a phylogenetic tree, an extinction is represented by the removal of that species' leaf from the tree. This also removes the edge which connected that species to the rest of the tree, lowering the PD score. In the case where more than one species becomes extinct, the combined effect can be much larger than the sum of individual extinctions. We can interpret Fig. 1 as representing the extinction of x 1 , x 2 , x 5 , and x 6 . Notice that the simultaneous extinction of x 1 and x 2 has caused the removal of a third edge, that connecting their least common ancestor to the rest of the tree. These types of dependencies can lead to large differences in PD scores among sets of equal size.
Research has been conducted to assess the usefulness of the PD measure to inform conservation strategy. To this end, sets of species which attain the maximum value of PD (for a given number of species) have been used as a benchmark against a measured response, to be contrasted with random selections of species (Tucker et al. 2019). However, the sets of a given size which maximise PD are not unique. In applications of PD, we see the algorithms which generate such sets being run multiple times to account for this. For example, Molina-Venegas et al. (2021)[p. 586] and Mazel et al. (2018)[p. 7] both performed ten runs on each phylogenetic tree under consideration because: "There are multiple subsets of size S that maximises PD in a phylogeny" and "For a given tree there are likely multiple, and possibly very many, sets of species with the same [maximum] PD", respectively. Furthermore, Mazel et al. (2017) [p. 1021] noted that: "this number will vary across simulations and could, in some case, be very large." Although the non-uniqueness of these sets is known and accounted for, their total number is not well understood. This leaves open questions about the most appropriate number of runs to perform in the above trials, and what the chances are that random selections of species also happen to form sets which maximise PD. In this paper, we investigate mathematical questions concerning the enumeration of maximum PD sets of given size, as well as identifying the sets of species of given size whose extinction would result in the largest loss of phylogenetic diversity.

Outline of the paper
We begin by stating in the next section the key mathematical definitions required in the paper. In Sect. 3, we present a new characterisation of those sets which maximise PD for each possible size (Theorem 1). This characterisation allows us to count the number of such sets on any rooted phylogenetic tree, which previous methods could not achieve concisely. Theorem 2 sets out how this process may be achieved efficiently. The conceptual approach from Theorem 1 is continued in Sect. 4, leading to Algorithm 1, which selects, in polynomial time, one of these maximising sets that is optimal against a second measure. In Sect. 5, we consider a dual problem: determining the greatest possible loss of PD if a certain number of species becomes extinct (this turns out to be equivalent to minimising PD for a given number of species). A dynamic programming approach is used to solve this problem in polynomial time for binary (or degreebounded) rooted phylogenetic trees.

Preliminaries
Phylogenetic trees. Let X be a non-empty set of taxa (e.g. species), with |X | = n. A rooted phylogenetic X -tree is a rooted tree T = (V , E), where X is the set of leaves, and all edges are directed away from a distinguished root vertex ρ, and every non-leaf, non-root vertex has out-degree at least 2. In addition, when |X | = 1, the tree consisting of a single vertex is a rooted phylogenetic X -tree. All edges drawn in this paper will be directed down the page. If all non-leaf vertices of T have out-degree 2, we say that T is binary.
Three types of restrictions on T will be useful. For A ⊆ X , the A-subtree of T is the minimal tree which connects the leaves of A to the root vertex ρ. In order for an A-subtree to be a phylogenetic tree, we suppress any non-root vertices with out-degree 1 which arise during its construction. For a set of vertices A subtree of T is pendant if it can be disconnected from ρ by deleting a single edge of T . As shorthand, for an arbitrary set A and element x, If a cluster has size two or three, we call it, respectively, a cherry or a triple. A cluster of size four which contains two distinct cherries is called a fork. In the rooted phylogenetic tree T 1 of Fig. 1 the set {x 1 , x 2 , x 3 } is a triple, the set {x 4 , x 5 , x 6 , x 7 } is a fork, and each of {x 1 , x 2 }, {x 4 , x 5 } and {x 6 , x 7 } is a cherry. Phylogenetic diversity. The edges of every rooted phylogenetic tree considered in this paper are positively weighted. Let T be a rooted phylogenetic X -tree, and let : E(T ) → R >0 be a function which assigns a positive real-valued length (e) to each edge e ∈ E(T ). Suppose that u, v ∈ V (T ) are two vertices of T connected by a directed path from u to v (this path is unique if it exists). Then the distance from u to v, denoted d (u, v), is the sum of the lengths of the edges in this path. If an edge e is subdivided into two edges e 1 and e 2 , we require (e 1 ) + (e 2 ) = (e). If is such that for any two distinct leaves x and y we have d(ρ, x) = d(ρ, y), we say that satisfies the ultrametric condition.
For a non-empty subset Y of X , we define the phylogenetic diversity of Y on T , denoted by P D (T , ) (Y ), to be the sum of the edge lengths of the Y -subtree. That is, It will be usual for us to remove the subscript notation and write P D(Y ) when it is clear which rooted phylogenetic tree and edge length function we refer to. We also write P D(T ) to denote the phylogenetic diversity of the entire X -tree T , in place of P D (T , ) (X ). Let T be a rooted phylogenetic X -tree whose edges are assigned a positive realvalued weighting, and let A ⊆ X . If |A| = k and P D(A) ≥ P D(Y ) for all Y ⊆ X with |Y | = k, then we call A a size-k maxPD set. Similarly, if |A| = k and P D(A) ≤ P D(Y ) for all Y ⊆ X with |Y | = k, then we call A a size-k minPD set. To illustrate, Fig. 2 shows an example of a size-3 maxPD and an example of a size-3 minPD sets for the same rooted phylogenetic tree.

The number of maxPD sets on rooted phylogenetic trees
Given a rooted phylogenetic X -tree T , with |X | = n and a weighting on E(T ), a natural question is to find a subset Y of X of size t whose extinction minimises PD loss. The solution to this question is to take Y to be X − W , where W is a subset of X of size n − t that maximises P D(W ). It turns out that a greedy algorithm provably constructs such sets W of k = n −t leaves (Pardi and Goldman 2005;Steel 2005). This result relies on an underlying combinatorial 'strong exchange property' that induces a greedoid structure on maximal PD sets of given size.
Although a greedy algorithm will output a maxPD set, it does not give a clear indication of how many distinct maxPD sets exist for T . Such an algorithm begins with an empty set of leaves and iteratively adds k leaves, based on which leaf adds most to the running total of PD at each iteration. There may be multiple steps at which a choice has to be made between equally-good options. By altering the procedure for breaking ties when they occur, it is possible to discover numerous size-k maxPD sets. This effect is most pronounced for rooted phylogenetic X -trees satisfying the ultrametric condition. For example, Fig. 1 and Fig. 2 show two of the 20 different size-3 maxPD sets for the rooted phylogenetic tree T 1 .
All possible maxPD sets can be obtained by using a greedy algorithm, by taking each option separately when presented with ties (Steel 2005, Theorem 1). However, this process can become quite involved even for small phylogenetic trees. Moreover, each maxPD set could be counted multiple times, as greedy algorithms sometimes select the same set of leaves in different orders.
In this section, we present a more straightforward method for determining exactly how many maxPD sets exist on a given rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. Firstly, by deleting certain edges near the root vertex, we partition the leaf set into disjoint subsets. Then we use a generating function which takes the sizes of these subsets and outputs the number of maxPD sets as a coefficient.

Counting maxPD sets in an ultrametric context
We restrict our attention to the problem of counting maxPD sets on a rooted phylogenetic X -tree T whose edge lengths satisfy the ultrametric condition. Suppose T = (V , E), and let 1 ≤ k ≤ |X |. It turns out that the minimal subtrees of T connecting size-k maxPD sets to the root all contain particular subsets of edges. Furthermore, these common edges induce a subtree of T containing the root vertex. Our approach is to determine which edges of T will be in common to all size-k maxPD sets. From there, we can enumerate these maxPD sets by analysing the forest that results from deleting the common edges.
For example, all twenty size-3 maxPD sets of T 1 from Fig. 1 (with a score of 16) can be found by checking the 35 possible sets of 3 leaves. Comparing these, we see that all of the minimal subtrees of T 1 that connect a size-3 maxPD set to the root of T 1 contain both edges incident with the root, as well as exactly 3 out of the 4 edges descending from the two highest non-root vertices.
We extend the metaphor that the ultrametric condition produces clock-like trees and consider time to run down the page. Vertices at the same height are therefore contemporary and, in particular, the leaves are in the present. Let d be a non-negative real number and let Note that d k may not be defined for all k < n. However, if d k is defined, we call k a branching value, and d k a branching distance. In other words, d k is the most recent time for which T [R(d)] has exactly k connected components, if such a time exists. For example, the rooted phylogenetic tree T 2 in Fig. 3 has {1, 2, 4, 7, 9, 11} as its set of branching values. The forests T 2 [R(d 4 )] and T 2 [R(d 7 )] are shown below T 2 in the same figure.
If k is not a branching value, we will be interested in the nearest integers which are. We write k + to denote the smallest branching value of at least k, and k − to denote the x 7 x 8 x 9 x 10 x 11 x 7 x 8 x 9 x 10 x 11 2 4 2 3 3 4 2 2 2 2 x 10 x 11 2 2 2 2 1 1 1 1 We first prove Lemma 1. Let X = {x 1 , . . . , x n }. We define T k = (V , E ) to be the rooted tree derived from T (by adding vertices to subdivide edges as necessary) where, for each Lemma 1 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition, and let A ⊆ X with |A| = k. Then For all 1 ≤ i ≤ k, the path from ρ to v i lies withinT k . Therefore the total contribution of these paths to P D T k (A) cannot exceed P D(T k ). This means P D T k (A) must be less than or equal to P D(T k ) plus a contribution of (at most) d k − from each of the k elements of A. Thus P D T k (A) ≤ P D(T k ) + kd k − , and the lemma holds.

Lemma 2 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. Let A ⊆ X with |A| = k, and let d be a branching distance of T . If one component of T [R(d)] contains no members of A, while a second component of T [R(d)] contains two or more distinct members of A, then A is not a size-k maxPD set.
Proof Assume some component of T [R(d)] contains two (distinct) leaves, say x 1 , x 2 , of A. The PD contribution of adding x 1 to A − x 1 cannot exceed d because all edges of T in the path from ρ to x 2 have already been counted towards P D(A − x 1 ). In particular,

Theorem 1 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. Let A ⊆ X , and let |A| = k. Then A is a size-k maxPD set if and only if A contains at least one leaf from each component of T [R(d k − )], and at most one leaf from each component of T [R(d k + )].
Proof First suppose that A is a size-k maxPD set. Assume some component of T [R(d k + )] contains two (distinct) leaves, say x 1 ,x 2 , of A. Since k + ≥ k, there must be some component of T [R(d k + )] which has no leaf in A. Then, by Lemma 2, the set A cannot be a maxPD set, contradicting our initial supposition. Thus A contains at most one leaf from each component of Next, assume that some component of that contains two or more leaves of A. Then, by Lemma 2, the set A cannot be a maxPD set, again contradicting our initial supposition. Thus A contains at least one leaf from each component of , and at most one leaf from each component of T [R(d k + )]. By Lemma 1, the value P D(T k ) + kd k − is an upper bound for the PD score of size-k sets. We show that P D(A) achieves this bound.
Notice that the components of contains at most one leaf, the paths from v i to x i , and from v j to x j contain no common edges, for any distinct 1 ≤ i, j ≤ k. So in total, the collection of all paths v i to x i for all i contributes exactly kd k − to P D(A). Furthermore, since A contains at least one leaf from each component of the maximum possible value, and hence A is a size-k maxPD set.
When k is a branching value for a rooted phylogenetic X -tree T , then Hence, as a direct consequence of Theorem 1, we obtain the following result.

Corollary 1 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. Let A ⊆ X with |A| = k. If k is a branching value of T , then A is a size-k maxPD set if and only if A contains exactly one leaf from each component of T [R(d k )].
Theorem 1 can be used to count the number of size-k maxPD sets for a rooted phylogenetic X -tree T whose edge lengths are ultrametric. Note that this result does not require T to be binary. Let m(T , k) denote the number of size-k maxPD sets on T . The next proposition derives m(T , k) when k is a branching value of T . The case when k is not a branching value of T is covered separately. We express the forest T [R(d k )] as a union of components κ i (k) for i ∈ {1, 2, . . . , k}, and write λ(κ i (k)) for the number of leaves in κ i (k).

Proposition 1 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. If k is a branching value for T , then
Proof By Corollary 1, each maxPD set contains exactly one leaf from component   Fig. 4). Let us call this phylogeny P. We calculate the number of size-8 and size-16 maxPD sets for P.
If k is not a branching value for a rooted phylogenetic X -tree T , calculating the number of size-k maxPD sets is not as immediate. In this case, we use a generating function to determine m(T , k). The following lemma is presented for a more general context.

Lemma 3
Suppose that C = (X i j : i = 1, . . . , n j ; j = 1, . . . , r ) is an array of disjoint sets, and let n i j denote the size of set X i j . Let N C (k) be the number of sets of size k that can be obtained by selecting at most one element from each set X i j but in such a way that at least one element is selected from i X i j for each value of j. Then N C (k) is the coefficient of x k in the polynomial (1 + n i j x) . (1) Proof For each integer j ≥ 1, let p j (x) = −1 + n j i=1 (1 + n i j x), and for each l ≥ 0, let c l j denote the coefficient of x l in p j (x). Then c 0 j = 0 for each j, and for l > 0, the coefficient c l j is the number of ways of selecting l elements from n j i=1 X i j in such a way that at most one element is selected from the (pairwise-disjoint) sets (X i j : i = 1 . . . , n j ) and at least one element is selected (since l > 0). Now p C,k (x) = r j=1 p j (x) and so the coefficient of x k in p C,k (x) is the sum (call it S k ) of the terms c l 1 ,1 c l 2 ,2 · · · c l r ,r across all choices of (l 1 , l 2 , . . . , l r ) for which l 1 + l 2 + · · · + l r = k, and l m > 0 for all m (this second condition holds because c 0 j = 0 for all j). Since the sets X i j are pairwise-disjoint (across all choices of i, j), we have S k = N C (k) as required.
Let T be a rooted phylogenetic X -tree, and let k be a positive integer such that k ≤ |X |. We write p T ,k (x) instead of p C,k when C is constructed from componentconnected clusters of T , using the branching values k − and k + . If k is a branching value of T , we have n j = 1 for all j, and the result coincides with Proposition 1. In the general case, Lemma 3 gives a polynomial-time algorithm to compute m(T , k). Proof There are at most n different branching values for T (one for every non-leaf vertex, and the value n) from which to select the appropriate k − and k + values. Determining the components of a forest can be achieved in O(n 2 ) time. Once the components have been determined, the polynomial p T ,k (x) is calculated, and the coefficient of x k is extracted. With a naïve approach of sequentially multiplying factors, this can be completed in time O(n 3 ).

Theorem 2 Let T be a rooted phylogenetic X -tree
The following example highlights a nice property of the generating function p T ,k (x) for a rooted phylogenetic tree T . Calculating the number of size-k maxPD sets for some non-branching value k also gives the number of size-m maxPD sets for every positive integer m in the interval [k − , k + ]. Example. Consider the rooted phylogenetic tree T 2 in Fig. 3. Firstly, 4 is a branching value for T 2 . Thus the number of its size-4 maxPD sets is the product of the number of leaves in each of the four components of T 2 [R(d 4 )]. That is, m(T 2 , 4) = 3·1·4·3 = 36.
However, 5 is not a branching value for T 2 , so we use the generating function approach to find m(T 2 , 5). First note that the greatest branching value less than 5 is 4 and the least branching value greater than 5 is 7. The forests Lastly, we arrange the entries of C(T 2 ) so that each column contains precisely those leaves that share a component of T 2 [R(d 4 )]. Thus we have 4 columns in C(T 2 ), and set r = 4 in Eqn. 1. As T 2 is binary, there are at most two components of T 2 [R(d 7 )] contained in any component of T 2 [R(d 4 )]. We use an empty set as a placeholder, if required, to ensure that C(T 2 ) is a rectangular array. As such, we are able to set n j = 2 for all j in Eqn. 1. The completed array is The generating function for T 2 , when k = 5, is calculated below.
(1 + n i j x) Hence T 2 has 84 maxPD sets of size 5. We have also determined that T 2 has 64 size-6 maxPD sets, 16 of size 7, and confirmed that there are 36 maxPD sets of size 4.

Bounding m(T, k), and its value for a certain family of trees
The shape of a rooted phylogenetic tree impacts the components at each branching distance, and hence the number of maxPD sets which exist. In this section we restrict ourselves to rooted binary phylogenetic trees with the ultrametric constraint on edge lengths. By 'shape', we refer to both the particular branching structure of a tree and the relative distances of the vertices from its leaves (see Steel 2016, Ch. 3). Here, we begin to address the question of which tree shapes and values of k give the most size-k maxPD sets across a fixed number of leaves. First we consider the lower and upper bounds for the number of size-k maxPD sets when k is a branching value.

Moreover, these bounds are sharp.
Proof If k is a branching value for T , then, by Proposition 1, the value m(T , k) is the product of the number of leaves in the k components. Let S be the multiset {λ(κ i (k)) : 1 ≤ i ≤ k}, that is, the multiset containing the number of leaves of each component.
To find the lower bound for m(T , k), note that if a and b are positive integers with 1 < a ≤ b, then (a − 1)(b + 1) < ab. Hence the product of elements of the multiset (S − {a, b}) ∪ {a − 1, b + 1} will be less than the product of elements of S. This exchange of elements can continue until only one element is greater than 1. Thus the minimum product of k positive integers which sum to n is n − k + 1, achieved with one value of n − k + 1 and k − 1 values of 1. This bound is achieved by rooted caterpillar trees (i.e. rooted phylogenetic X -trees with exactly one cherry).
On the other hand, the maximum such product is bounded above by ( n k ) k . This follows from the fact that the arithmetic mean, AM(S), of a multiset of positive integers S is greater than or equal to the geometric mean, G M(S), of the same multiset. Thus This maximum is obtained when k is a divisor of n and all components contain n k leaves.
Let T be a rooted binary phylogenetic X -tree. If k is not a branching value, it is possible that m(T , k) exceeds the upper bound given in Proposition 2. For example, the tree T 2 from Fig. 3 has n = 11, and m(T 2 , 5) = 84 ≥ 11 5 5 ≈ 51.5.
We have seen above that if T is a rooted caterpillar tree, then m(T , k) is as small as possible. The highly asymmetric structure of caterpillar trees restricts the possible maxPD sets they contain. In contrast, we now consider m(T , k) values across the family of fully symmetric rooted trees (with constant edge lengths) and include cases when k is not a branching value.
We say T is a perfect unit-length tree if the edge lengths of T satisfy the ultrametric condition, and all edges of T have length 1. Perfect unit-length trees have 2 α leaves, where α ∈ N is the height of the tree (the number of edges between the root and any leaf).

Proposition 3 Let T be a perfect unit-length tree of height α ∈ N, and let n denote the number of leaves of T . Let k be a positive integer such that k ≤ n, and let β be the unique non-negative integer such that
The values of k that maximise m(T , k) are k = 2n 3 for all n and, additionally, k = 2n 3 + 1 when n ≡ 1 (mod 3).
Proof Firstly, if k is a branching value, then k = 2 β and each component has size 2 α−β . Thus, by Proposition 1, m(T , k) = (2 α−β ) k , which coincides with Eqn. 2. Furthermore, if k is not a branching value, we have k − = 2 β−1 and k + = 2 β . Then by Lemma 3, m(T , k) is the coefficient of x k in the polynomial Taking the binomial expansion of the last expression we determine that 2 β−1 k−2 β−1 · 2 2 β +(α−β−1)k is the coefficient of x k in p T ,k (x). This establishes Eqn. 2.
To find the value of k which maximises m(T , k), we first show that m(T , k) ≤ m(T , n −k) when k ≤ n 2 . Let A be a size-k maxPD set for some k ≤ n 2 . By Theorem 1, A contains at most one leaf from each cherry of T . Then X − A contains at least one leaf from every cherry (which are the components of T [R(d (n−k) − )]), and at most one leaf from each component of T [R(d (n−k) + )], as these are all single-leaf components. This implies X − A is a size-(n − k) maxPD set by Theorem 1, and thus there are at least as many size-(n − k) maxPD sets as size-k ones. Hence the value of k that maximises m(T , k) will be greater than or equal to n 2 . In the case when k ≥ n 2 , since 2 α = 2 β = n, the expression in Eqn. 2 simplifies to This ratio is monotonically decreasing as k increases, and equals 1 when 3k = 2n − 2. Our maximal value of m(T , k) will be found at the smallest k ≥ n 2 for which m (T ,k+1) m(T ,k) ≤ 1, namely when k = 2n 3 . Note that when n ≡ 1 (mod 3) and k = 2n 3 , we have m(T ,k+1) m(T ,k) = 1, so we get an equal number of maxPD sets for the two consecutive values k and k + 1. Table 1 shows the growth of m(T , k) as n increases. We note that for n = 16, the perfect unit-length tree does not provide the largest value of m(T , k). Figure 5 shows a rooted binary phylogenetic X -tree T 3 on 16 leaves which contains 1809 size-8 maxPD sets (thus having 17 more maxPD sets than the perfect unit-length tree on 16 leaves can achieve for its optimal value of k = 10 or k = 11). For T 3 we have x 10 x 11 x 12 x 13 x 14 x 15 x 16 Fig. 5 A rooted binary phylogenetic tree with more maxPD sets (for its optimal value of k = 8) than the perfect unit-length tree with the same number of leaves (for its optimal value of k = 10, 11) literature that, in general, maxPD sets are far from unique. This provides scope for evaluating the collection of maxPD sets against other strategic considerations. In developing strategies for conservation planning, PD is often seen as one measure to be used in conjunction with others (for examples of this, see Cadotte and Tucker 2018;Isaac et al. 2007;Kling et al. 2019). For instance, we may wish to incorporate benefit-cost ratios of focussed conservation spending, or employ IUCN categorisations into the analysis. This section provides an algorithm to optimise a further measure across the collection of maxPD sets.
Here, we frame the further measure in terms of a real-valued linear function on the leaves. Each leaf is assigned a function value, and the linear function score of a set of leaves is the weighted sum of the function values of the constituent leaves. We seek a size-k set which has as large a linear function score as possible among the size-k maxPD sets. By suitably modifying the linear function, the problem can be rephrased as maximising the unweighted sum across maxPD sets. Thus, for a function φ : X → R we want to determine max x∈A φ(x) : A is a size-k maxPD set .
We note that it is not always possible to achieve this result by simply adding the function score of each leaf to the length of its incident pendant edge, and then finding a size-k maxPD set of the resulting rooted phylogenetic tree. We provide a counterexample using the tree T 1 from Fig. 2. Consider the function Adding the function values to appropriate pendant edges, results in a tree with a unique size-3 maxPD set {x 5 , x 6 , x 7 }. However this set is not a maxPD set of the original tree T 1 .
For a rooted phylogenetic tree T , MaximiseLinearSum selects a set A consisting of k leaves of T in the following manner. Initially, it determines the components of T [R(d k − )] ('tall' components) and those of T [R(d k + )] ('short' components). For every short component, it keeps (in the set of 'potential' leaves P) one leaf x such that φ(x) is maximal for that short component. It then discards all other leaves from further consideration. In every tall component, it adds one leaf x to A from the leaves retained in P such that φ(x) is maximal for that tall component. Finally, from the remaining k + − k − leaves under consideration, it chooses k − k − with the largest φ values. In presenting the algorithm, we make use of the following notation. For a pendant subtree C of T , write X C for the set of leaves in C. For S ⊆ X , let φ(S) = {φ(x) : x ∈ S}.

Algorithm 1: MaximiseLinearSum
Input: a rooted phylogenetic X -tree T whose edge lengths satisfy the ultrametric condition, a positive integer k ≤ |X |, φ : X → R Output: a size-k maxPD subset A ⊆ X , with the largest linear function score among all maxPD sets choose one leaf m from the set {x ∈ P : φ(x) = max φ(X C ∩ P)};

10
A ← A ∪ m; 11 P ← P − m 12 end 13 for each of the k − k − largest elements φ(x) of φ(P) add x to A; 14 return A Proposition 4 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition. The MaximiseLinearSum algorithm outputs a maxPD set of T .
Proof The for-loop from Lines 4 to 7 ensures that A cannot contain more than one leaf from any short component. The for-loop from Lines 10 to 13 ensures that A contains at least one leaf from every large component. Since Line 15 ensures that |A| = k, it follows by Theorem 1, that A is a maxPD set of T .
Proposition 5 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition, and let φ : X → R be a function on the leaves of T . Let k be a positive integer such that k ≤ |X |. Then MaximiseLinearSum applied to T , φ, and k correctly outputs a size-k maxPD set with the largest function score among all maxPD sets.
We give a proof of this result shortly, but first give a short description of our approach. The algorithm MaximiseLinearSum was designed to construct a set containing the largest possible values of φ while obeying the constraints imposed by Theorem 1 to ensure the selection of a maxPD set. Suppose that A is a size-k maxPD set of T . The proof considers two possible cases when A is not a valid output of the algorithm, and exhibits a size-k maxPD set with a greater linear function score in each. Finally, outside of these two cases we prove that the linear function score of A must be at least as large as that of any other size-k maxPD set.
Proof Suppose that A is a size-k maxPD set of T . For the result to hold, either A is a valid output of MaximiseLinearSum or there is a size-k maxPD set B, distinct from A, such that x∈A φ(x) < x∈B φ(x). One of the following three conditions holds: In particular, the strictness of this inequality means that b j / ∈ A. In this case, the leaf b j would always be selected by Line 9 of MaximiseLinearSum, precluding A from being a valid output of this algorithm. Moreover, since Condition 1 fails to hold, no element of {a 1 , a 2 Let P be the set of 'potential leaves' as used in Algorithm 1. Then by our construction ofB, we have B −B ⊆ P −Ā. The set A is a valid output of MaximiseLinearSum if and only if the elements of A −Ā have the k − k − largest φ values among elements of P −Ā. The latter condition is equivalent to Hence, under all three conditions, either A is a valid output of MaximiseLinear-Sum or x∈A φ(x) < x∈B φ(x) for some size-k maxPD set B of T , as required.
Proposition 6 Let T be a rooted phylogenetic X -tree whose edge lengths satisfy the ultrametric condition, and let |X | = n. Then MaximiseLinearSum runs in time O(n 2 ).
Proof By Theorem 2, Line 1 can be completed in O(n 2 ). We show that this subroutine dominates the time taken for MaximiseLinearSum to run.
Determining The algorithm MaximiseLinearSum makes use of the component constraints on maxPD sets to solve this problem for rooted phylogenetic X -trees whose edge lengths satisfy the ultrametric condition. For phylogenetic trees whose edge lengths do not satisfy the ultrametric condition, the determination of appropriate connected components requires a further algorithm (Manson KD, in preparation).
We note that an alternative approach to solving this more general problem comes from the area of lexicographic multi-objective linear programming (Cococcioni et al. 2018, see Section 2). The optimisation can be phrased as a max-flow min-cost problem, in a similar manner to that used in (Bordewich et al. 2009). However this approach relies on first scaling every length of the phylogenetic tree by a suitably large number. Determining an appropriate value for the scaling factor can prove difficult unless the edge lengths are restricted to take only rational values. For some trees with real-valued edge lengths this step requires a pairwise comparison of the PD scores across all sets of k leaves (Manson KD, in preparation).

Maximum possible loss of PD in a tree if k species become extinct ('minPD')
In Sect. 3, we were interested in finding sets of k species which contained as much diversity as possible. However, it is also worth considering the dual problem: determining how must PD could be lost if k extant species were to become extinct (i.e. a 'worst case scenario' in biodiversity conservation in the face of widespread extinction pressure). More precisely, we consider the problem of determining the maximum possible PD loss if a given number species were to become extinct. Formally, let T = (V , E) be a rooted phylogenetic X -tree and let the function : E(T ) → R >0 assign a positive real-valued length (e) to each edge e ∈ E(T ).
Suppose that each species in a subset Y of the leaf set X of T is removed from the tree. The resulting loss of PD, which we denote here as Δ (T , ) (Y ) is given by This is equivalent to the concept of 'exclusive molecular phylodiversity' as described in Lewis and Lewis (2005). 1 Notice that finding a subset Y of X of size k to maximise Δ (T , ) (Y ) is equivalent to finding a subset W (= X − Y ) of X of size k = |X | − k to minimise P D (T , ) (W ). Unlike the max-PD question, this minimisation question is not solved by the greedy algorithm (Moulton et al. 2007). However, as discussed in Sect. 6 of (Spillner et al. 2008), minimal PD scores can be found using dynamic programming. In particular, (Blum et al. 1994, Sect. 3.1) describe an algorithm for an equivalent problem (referred to as the i-tree problem). Here we present a detailed description of this algorithm using the terminology of phylogenetic trees.
We call a set of k leaves which has the smallest PD score across all sets of size k, a size-k minPD set. In this section, we present a polynomial-time dynamic programming approach to finding minPD scores. For simplicity, we initially restrict our attention to rooted binary phylogenetic trees; however, we show that the same idea extends to rooted phylogenetic trees for which each vertex has bounded out-degree. Note that in this section, we do not require the branch lengths to satisfy the ultrametric condition.
Given a rooted phylogenetic X -tree T , and an integer 0 ≤ k ≤ |X |, let ϕ T (k) be the minimum PD score across all size-k subsets of X . When k > |X |, ϕ T (k) is undefined, and when k = 0, we set ϕ T (0) = 0. For the case when T is a single vertex, we define ϕ T (k) = 0. Proposition 7 gives the dynamic programming equation when T is binary.
Proposition 7 Let T be a rooted binary phylogenetic X -tree and let e 1 and e 2 be the two edges of T incident with the root. Let e 1 have length 1 and e 2 have length 2 . Finally, let T 1 and T 2 denote the (maximal) pendant subtrees formed by the deletion of e 1 and e 2 respectively. For all k ∈ {1, 2, . . . , |X |}, where I k j >0 takes the value 1 if k j > 0; otherwise, I k j >0 = 0.
Proof We proceed by induction on the number of vertices in T . For the base case, take the tree T consisting of a single vertex. Since T has no edges, it has a PD score of 0, which corresponds to ϕ T (k) for all k ≥ 0 by definition. Suppose that Eqn. 4 fails to give the minimum PD score for some rooted binary phylogenetic X -tree T . We write ϕ i as shorthand for ϕ T i for i = 1, 2. Furthermore, suppose that ϕ i (k ) equals the size-k minPD score in T i for all k ≤ k and i ∈ {1, 2}.
Let A be such a set of k leaves of T , with k 1 leaves in T 1 and k 2 leaves in T 2 .
If k 2 = 0, then P D T (A) = 1 + P D T 1 (A) ≥ 1 + ϕ i (k 1 ) by the inductive assumption. Thus the PD score of A is not lower than the calculated minimum; hence, k 2 = 0. Similarly, k 1 = 0. Consequently, For A to have a PD score lower than ϕ T (k), we must have This implies that either P D T 1 (A ∩ T 1 ) < ϕ 1 (k 1 ) or P D T 2 (A ∩ T 2 ) < ϕ 2 (k 2 ), contradicting our inductive assumption. Therefore, no such set A exists, and ϕ T (k) calculates a minPD score of size k in T .
We now present an algorithm which utilises Proposition 7 to calculate a minPD score for a rooted binary phylogenetic X -tree T = (V , E). For a vertex v ∈ V (T ), we use the notation ϕ v (k) in place of ϕ T v (k), where T v is the pendant subtree of T for which vertex v has in-degree 0. Additionally, ϕ ρ (k) = ϕ T (k). Note that the root vertex ρ of T will always appear last in the ordered list L defined in the algorithm. The algorithm MinPDScore computes the minimum PD score for a rooted binary phylogenetic tree T when selecting k of its leaves. This dynamic programming approach calculates the minPD score for pendant subtrees of T , which are then combined to calculate the minPD score for T as a whole. Additionally, by tracking the indicator function values as we go, a corresponding size-k minPD set can be determined.
Proposition 8 Let T = (V , E) be a rooted binary phylogenetic X -tree, and let 0 ≤ k ≤ n, where n = |X |. The algorithm MinPDScore applied to T and k calculates the minimum PD score for a size-k set of leaves of T in time O(n 4 ).

MinPD scores for non-binary rooted phylogenetic trees
The algorithm MinPDScore can be adapted for a non-binary rooted phylogenetic tree with bounded out-degree. Specifically, Line 10 of the algorithm is adjusted, and an upper bound on the out-degree of every vertex is required to ensure that the modified algorithm runs in polynomial time.
Let {e 1 , e 2 , . . . , e t } denote the set of edges incident with the root of T , and let T i denote the subtree of T descending from e i . Set i = (e i ) for i ∈ {1, 2, . . . , t}, and let K (k, t) = k = (k 1 , ..., k t ) : k i ≥ 0 for all i and t i=1 k i = k .
Then, in place of Eqn. 4, we use Eqn. 5 which applies the same notation as Proposition 7.

Concluding remarks
Phylogenetic diversity provides a formal way to quantify recent (and possible future) biodiversity loss, resulting from the current high rate of species extinction. For example, PD has become an integral part of the Zoological Society of London's 'EDGE of Existence' programme for monitoring biodiversity risk (Isaac et al. 2007). PD is more nuanced than simply counting species extinctions, since PD explicitly incorporates the evolutionary relationships among species, and thus provides a proxy for measuring the richness of features that make species unique (Faith 1992;Wicke et al. 2021).
In this paper, we have investigated new combinatorial questions concerning PD that arise in its application to large data-sets. In particular, we have described a precise way to count the number of maxPD sets of given size on a given tree (in the usual ultrametric setting) and derived some bounds on the growth rate for these numbers. We have also described further mathematical results that establish polynomial-time algorithms to (i) optimise a linear function (across the species at the tips of the tree) over all maxPD sets and (ii) determine the greatest possible loss of PD on a tree if k species were to become extinct (this last question amounts to determining minPD sets of given size).
Our results suggest a number of questions. In future work, we hope to characterise the tree shapes that have the largest number of maxPD sets (of any given size). A further question is to count the number of minPD sets in the binary ultrametric setting. For caterpillars on n leaves (and ultrametric edge lengths), the number of size-k minPD sets is 1 unless k = 1, in which case there are n min PD sets. To see this, observe that a size-k minPD set in a caterpillar is the one that contains the k leaves with the shortest pendant edges (removing any of these leaves to replace it with one of the n −k unchosen leaves would necessarily add more to the PD score than what was lost by not counting the removed pendant edge). A related question is to categorise the trees which have a unique minPD set.