# Fast Phylogenetic Biodiversity Computations Under a Non-uniform Random Distribution

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

## Abstract

Computing the phylogenetic diversity of a set of species is an important part of many ecological case studies. More specifically, let $$\mathcal {T}$$ be a phylogenetic tree, and let R be a subset of its leaves representing the species under study. Specialists in ecology want to evaluate a function $$f(\mathcal {T},R)$$ (a phylogenetic measure) that quantifies the evolutionary distance between the elements in R. But, in most applications, it is also important to examine how $$f(\mathcal {T},R)$$ behaves when R is selected at random. The standard way to do this is to compute the mean and the variance of f among all subsets of leaves in $$\mathcal {T}$$ that consist of exactly $$|R| = r$$ elements. For certain measures, there exist algorithms that can compute these statistics, under the condition that all subsets of r leaves are equiprobable. Yet, so far there are no algorithms that can do this exactly when the leaves in $$\mathcal {T}$$ are weighted with unequal probabilities. As a consequence, for this general setting, specialists try to compute the statistics of phylogenetic measures using methods which are both inexact and very slow.

We present for the first time exact and efficient algorithms for computing the mean and the variance of phylogenetic measures when leaf subsets of fixed size are selected from $$\mathcal {T}$$ under a non-uniform random distribution. In particular, let $$\mathcal {T}$$ be a tree that has n nodes and depth d, and let r be a non-negative integer. We show how to compute in $$O((d+\log n) n \log n)$$ time and O(n) space the mean and the variance for any measure that belongs to a well-defined class. We show that two of the most popular phylogenetic measures belong to this class: the Phylogenetic Diversity ($$\mathrm {PD}$$) and the Mean Pairwise Distance ($$\mathrm {MPD}$$). The random distribution that we consider is the Poisson binomial distribution restricted to subsets of fixed size r. More than that, we provide a stronger result; specifically for the $$\mathrm {PD}$$ and the $$\mathrm {MPD}$$ we describe algorithms that compute in a batched manner the mean and variance on $$\mathcal {T}$$ for all possible leaf-subset sizes in $$O((d+\log n) n \log n)$$ time and O(n) space.

For the $$\mathrm {PD}$$ and $$\mathrm {MPD}$$, we implemented our algorithms that perform batched computations of the mean and variance. We also developed alternative implementations that compute in $$O((d+\log n) n^2)$$ time the same output. For both types of implementations, we conducted experiments and measured their performance in practice. Despite the difference in the theoretical performance, we show that the algorithms that run in $$O((d+\log n) n^2)$$ time are more efficient in practice, and numerically more stable. We also compared the performance of these algorithms with standard inexact methods that can be used in case studies. We show that our algorithms are outstandingly faster, making it possible to process much larger datasets than before. Our implementations will become publicly available through the R package PhyloMeasures.

### References

1. 1.
Bininda-Emonds, O.R.P., Cardillo, M., Jones, K.E., MacPhee, R.D.E., Beck, R.M.D., Grenyer, R., Price, S.A., Vos, R.A., Gittleman, J.L., Purvis, A.: The delayed rise of present-day mammals. Nature 446, 507–512 (2007)
2. 2.
Chen, S.X., Liu, J.S.: Statistical applications of the poisson-binomial and conditional bernoulli distributions. Stat. Sin. 7, 875–892 (1997)
3. 3.
Faller, B., Pardi, F., Steel, M.: Distribution of phylogenetic diversity under random extinction. J. Theor. Biol. 251, 286–296 (2008)
4. 4.
Goloboff, P.A., Catalano, S.A., Mirandeb, J.M., Szumika, C.A., Ariasa, J.S., Kallersjoc, M., Farris, J.S.: Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics 25, 211–230 (2009)
5. 5.
The Webpage of the International Union for Conservation of Nature. http://www.iucn.org/
6. 6.
Kraft, N.J.B., Cornwell, W.K., Webb, C.O., Ackerly, D.A.: Trait evolution, community assembly, and the phylogenetic structure of ecological communities. Am. Nat. 170, 271–283 (2007)
7. 7.
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform, vol. 10. Siam, Philadelphia (1992)
8. 8.
Steel, M.: Tools to construct, study big trees: A mathematical perspective. In: Hodkinson, T., Parnell, J., Waldren, S. (eds.) Reconstructing the Tree of Life: Taxonomy and Systematics of Species Rich Taxa, pp. 97–112. CRC Press, Boca Raton (2007)Google Scholar
9. 9.
Tasche, M., Zeuner, H.: Improved roundoff error analysis for precomputed twiddle factors. J. Comp. Anal. Appl. 4(1), 1–18 (2002)
10. 10.
Tsirogiannis, C., Sandel, B.: Fast Phylogenetic Biodiversity Computations Under a Non-Uniform Random Distribution. http://www.madalgo.au.dk/~constant/abundance_model.pdf
11. 11.
Tsirogiannis, C., Sandel, B.: PhyloMeasures: a Package for Computing Phylogenetic Biodiversity Measures and their Statistical Moments. Ecography (2015). http://dx.doi.org/10.1111/ecog.01814
12. 12.
Tsirogiannis, C., Sandel, B., Kalvisa, A.: New algorithms for computing phylogenetic biodiversity. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 187–203. Springer, Heidelberg (2014)Google Scholar
13. 13.
Webb, C.O., Ackerly, D.D., McPeek, M.A., Donoghue, M.J.: Phylogenies and community ecology. Annu. Rev. Ecol. Syst. 33, 475–505 (2002)