# Fast Phylogenetic Biodiversity Computations Under a Non-uniform Random Distribution

- First Online:

DOI: 10.1007/978-3-319-31957-5_16

- Cite this paper as:
- Tsirogiannis C., Sandel B. (2016) Fast Phylogenetic Biodiversity Computations Under a Non-uniform Random Distribution. In: Singh M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science, vol 9649. Springer, Cham

## Abstract

Computing the phylogenetic diversity of a set of species is an important part of many ecological case studies. More specifically, let \(\mathcal {T}\) be a phylogenetic tree, and let *R* be a subset of its leaves representing the species under study. Specialists in ecology want to evaluate a function \(f(\mathcal {T},R)\) (a *phylogenetic measure*) that quantifies the evolutionary distance between the elements in *R*. But, in most applications, it is also important to examine how \(f(\mathcal {T},R)\) behaves when *R* is selected at random. The standard way to do this is to compute the mean and the variance of *f* among all subsets of leaves in \(\mathcal {T}\) that consist of exactly \(|R| = r\) elements. For certain measures, there exist algorithms that can compute these statistics, under the condition that all subsets of *r* leaves are equiprobable. Yet, so far there are no algorithms that can do this exactly when the leaves in \(\mathcal {T}\) are weighted with unequal probabilities. As a consequence, for this general setting, specialists try to compute the statistics of phylogenetic measures using methods which are both inexact and very slow.

We present for the first time exact and efficient algorithms for computing the mean and the variance of phylogenetic measures when leaf subsets of fixed size are selected from \(\mathcal {T}\) under a non-uniform random distribution. In particular, let \(\mathcal {T}\) be a tree that has *n* nodes and depth *d*, and let *r* be a non-negative integer. We show how to compute in \(O((d+\log n) n \log n)\) time and *O*(*n*) space the mean and the variance for any measure that belongs to a well-defined class. We show that two of the most popular phylogenetic measures belong to this class: the Phylogenetic Diversity (\(\mathrm {PD} \)) and the Mean Pairwise Distance (\(\mathrm {MPD} \)). The random distribution that we consider is the Poisson binomial distribution restricted to subsets of fixed size *r*. More than that, we provide a stronger result; specifically for the \(\mathrm {PD} \) and the \(\mathrm {MPD} \) we describe algorithms that compute in a batched manner the mean and variance on \(\mathcal {T}\) for *all* possible leaf-subset sizes in \(O((d+\log n) n \log n)\) time and *O*(*n*) space.

For the \(\mathrm {PD} \) and \(\mathrm {MPD} \), we implemented our algorithms that perform batched computations of the mean and variance. We also developed alternative implementations that compute in \(O((d+\log n) n^2)\) time the same output. For both types of implementations, we conducted experiments and measured their performance in practice. Despite the difference in the theoretical performance, we show that the algorithms that run in \(O((d+\log n) n^2)\) time are more efficient in practice, and numerically more stable. We also compared the performance of these algorithms with standard inexact methods that can be used in case studies. We show that our algorithms are outstandingly faster, making it possible to process much larger datasets than before. Our implementations will become publicly available through the R package PhyloMeasures.