# Fast Phylogenetic Biodiversity Computations Under a Non-uniform Random Distribution

## Abstract

Computing the phylogenetic diversity of a set of species is an important part of many ecological case studies. More specifically, let \(\mathcal {T}\) be a phylogenetic tree, and let *R* be a subset of its leaves representing the species under study. Specialists in ecology want to evaluate a function \(f(\mathcal {T},R)\) (a *phylogenetic measure*) that quantifies the evolutionary distance between the elements in *R*. But, in most applications, it is also important to examine how \(f(\mathcal {T},R)\) behaves when *R* is selected at random. The standard way to do this is to compute the mean and the variance of *f* among all subsets of leaves in \(\mathcal {T}\) that consist of exactly \(|R| = r\) elements. For certain measures, there exist algorithms that can compute these statistics, under the condition that all subsets of *r* leaves are equiprobable. Yet, so far there are no algorithms that can do this exactly when the leaves in \(\mathcal {T}\) are weighted with unequal probabilities. As a consequence, for this general setting, specialists try to compute the statistics of phylogenetic measures using methods which are both inexact and very slow.

We present for the first time exact and efficient algorithms for computing the mean and the variance of phylogenetic measures when leaf subsets of fixed size are selected from \(\mathcal {T}\) under a non-uniform random distribution. In particular, let \(\mathcal {T}\) be a tree that has *n* nodes and depth *d*, and let *r* be a non-negative integer. We show how to compute in \(O((d+\log n) n \log n)\) time and *O*(*n*) space the mean and the variance for any measure that belongs to a well-defined class. We show that two of the most popular phylogenetic measures belong to this class: the Phylogenetic Diversity (\(\mathrm {PD} \)) and the Mean Pairwise Distance (\(\mathrm {MPD} \)). The random distribution that we consider is the Poisson binomial distribution restricted to subsets of fixed size *r*. More than that, we provide a stronger result; specifically for the \(\mathrm {PD} \) and the \(\mathrm {MPD} \) we describe algorithms that compute in a batched manner the mean and variance on \(\mathcal {T}\) for *all* possible leaf-subset sizes in \(O((d+\log n) n \log n)\) time and *O*(*n*) space.

For the \(\mathrm {PD} \) and \(\mathrm {MPD} \), we implemented our algorithms that perform batched computations of the mean and variance. We also developed alternative implementations that compute in \(O((d+\log n) n^2)\) time the same output. For both types of implementations, we conducted experiments and measured their performance in practice. Despite the difference in the theoretical performance, we show that the algorithms that run in \(O((d+\log n) n^2)\) time are more efficient in practice, and numerically more stable. We also compared the performance of these algorithms with standard inexact methods that can be used in case studies. We show that our algorithms are outstandingly faster, making it possible to process much larger datasets than before. Our implementations will become publicly available through the R package PhyloMeasures.

## Keywords

Leaf Node Pairwise Distance Phylogenetic Diversity Simple Path Subset Size## References

- 1.Bininda-Emonds, O.R.P., Cardillo, M., Jones, K.E., MacPhee, R.D.E., Beck, R.M.D., Grenyer, R., Price, S.A., Vos, R.A., Gittleman, J.L., Purvis, A.: The delayed rise of present-day mammals. Nature
**446**, 507–512 (2007)CrossRefGoogle Scholar - 2.Chen, S.X., Liu, J.S.: Statistical applications of the poisson-binomial and conditional bernoulli distributions. Stat. Sin.
**7**, 875–892 (1997)zbMATHMathSciNetGoogle Scholar - 3.Faller, B., Pardi, F., Steel, M.: Distribution of phylogenetic diversity under random extinction. J. Theor. Biol.
**251**, 286–296 (2008)CrossRefMathSciNetGoogle Scholar - 4.Goloboff, P.A., Catalano, S.A., Mirandeb, J.M., Szumika, C.A., Ariasa, J.S., Kallersjoc, M., Farris, J.S.: Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics
**25**, 211–230 (2009)CrossRefGoogle Scholar - 5.The Webpage of the International Union for Conservation of Nature. http://www.iucn.org/
- 6.Kraft, N.J.B., Cornwell, W.K., Webb, C.O., Ackerly, D.A.: Trait evolution, community assembly, and the phylogenetic structure of ecological communities. Am. Nat.
**170**, 271–283 (2007)CrossRefGoogle Scholar - 7.Van Loan, C.: Computational Frameworks for the Fast Fourier Transform, vol. 10. Siam, Philadelphia (1992)CrossRefzbMATHGoogle Scholar
- 8.Steel, M.: Tools to construct, study big trees: A mathematical perspective. In: Hodkinson, T., Parnell, J., Waldren, S. (eds.) Reconstructing the Tree of Life: Taxonomy and Systematics of Species Rich Taxa, pp. 97–112. CRC Press, Boca Raton (2007)Google Scholar
- 9.Tasche, M., Zeuner, H.: Improved roundoff error analysis for precomputed twiddle factors. J. Comp. Anal. Appl.
**4**(1), 1–18 (2002)zbMATHMathSciNetGoogle Scholar - 10.Tsirogiannis, C., Sandel, B.: Fast Phylogenetic Biodiversity Computations Under a Non-Uniform Random Distribution. http://www.madalgo.au.dk/~constant/abundance_model.pdf
- 11.Tsirogiannis, C., Sandel, B.: PhyloMeasures: a Package for Computing Phylogenetic Biodiversity Measures and their Statistical Moments. Ecography (2015). http://dx.doi.org/10.1111/ecog.01814
- 12.Tsirogiannis, C., Sandel, B., Kalvisa, A.: New algorithms for computing phylogenetic biodiversity. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 187–203. Springer, Heidelberg (2014)Google Scholar
- 13.Webb, C.O., Ackerly, D.D., McPeek, M.A., Donoghue, M.J.: Phylogenies and community ecology. Annu. Rev. Ecol. Syst.
**33**, 475–505 (2002)CrossRefGoogle Scholar