The Rarefaction of Phylogenetic Diversity: Formulation, Extension and Application
Like other measures of diversity, Phylogenetic Diversity (PD) increases monotonically and asymptotically with increasing sample size. This relationship can be described by a rarefaction curve tracing the expected PD for a given number of accumulation units. Accumulation units represent individual organisms, collections of organisms (e.g. sites), or even species (or equivalent), giving individual-based, sample-based and species-based curves respectively. The formulation for the exact analytical solution for the rarefaction of PD is given in an expanded form to demonstrate congruence with the classic formulation for the rarefaction of species richness. Rarefaction is commonly applied as a standardisation for diversity values derived from differing numbers of sampling units. However, the solution can be simply extended to create measures of phylogenetic evenness, phylogenetic beta-diversity and phylogenetic dispersion, derived from individual-based, sample-based and species-based curves respectively. This extension, termed ∆PD, is simply the initial slope of the rarefaction curve and is related to entropy measures such as PIE (Probability of Interspecific Encounter) and Gini-Simpson entropy. The application of rarefaction of PD to sample standardisation and measurement of phylogenetic evenness, phylogenetic beta-diversity and phylogenetic dispersion is demonstrated. Future prospects for PD rarefaction include the recognition of evolutionary hotspots (independent of species richness), the basis for ecological theory such as phylogeny-area relationships, and the prediction of unseen biodiversity.
KeywordsAlpha diversity Beta diversity Evenness Phylogeny Sampling curves
Phylogenetic Diversity (PD) is a simple, intuitive and effective measure of biodiversity. The PD of a set of taxa, represented as the tips of a phylogenetic tree, is the sum of the branch lengths connecting those taxa (Faith 1992). PD is a particularly flexible measure because it can be applied to any set of relationships among entities that can be reasonably portrayed as a tree. Thus, the tips do not, by necessity, need to represent species but could be higher taxa, Operational Taxonomic Units, Evolutionarily Significant Units, individual organisms or unique haplotypes. Further, the tree itself might not portray evolutionary relationships but instead be, for example, a cluster dendrogram portraying functional relationships among taxa (Petchey and Gaston 2002).
Since the original formulation by Faith (1992), PD has come to be not just a single measure equating to a phylogenetically weighted form of richness, but rather a general class of measures dealing with various aspects of alpha and beta-diversity (Faith 2013). The common feature of this class of measures is the summation of branch lengths rather than the counting of tips. By substituting branch segments (intervals between nodes on a phylogenetic tree) for species, and including a weighting for the length of that segment, it is possible to modify many of the classic measures of Species Diversity (SD) to a PD equivalent (Faith 2013). By this means, phylogenetically weighted measures of endemism (Faith et al. 2004; Rosauer et al. 2009), ecological resemblance (Ferrier et al. 2007; Nipperess et al. 2010), and entropy (Chao et al. 2010, and chapter “Phylogenetic Diversity Measures and Their Decomposition: A Framework Based on Hill Numbers”) have been developed, for example.
Beside the units by which sampling effort is measured, Gotelli and Colwell (2001) distinguished between “accumulation curves” and “rarefaction curves”, based on the process by which the sampling curve is calculated. An accumulation curve plots a single ordering of individuals or samples (or species) against a cumulatively calculated concave diversity measure. The jagged shape of the resulting curve is highly dependent on the, often arbitrary, order of the accumulation units. To resolve this problem, rarefaction curves instead plot the expected value of the diversity measure against the corresponding number of accumulation units. Rarefaction can be achieved using an algorithmic procedure of repeated random sub-sampling of the full set of accumulation units and calculating the mean diversity (Gotelli and Colwell 2001). However, Hurlbert (1971) and Simberloff (1972) showed that expected diversity can be calculated using an exact analytical solution, obviating the need for computer-intensive repeated sub-sampling. Initially, this solution was for individuals-based rarefaction curves, but it has since been shown that the same solution applies to sample-based rarefaction (Kobayashi 1974; Ugland et al. 2003; Mao et al. 2005; Chiarucci et al. 2008).
The original purpose of rarefaction was to allow the comparison of datasets with differing amounts of sampling effort (Sanders 1968). Assemblages can be compared “fairly” when rarefied to the same number of accumulation units (Gotelli and Colwell 2001). However, rarefaction has broader application than this single purpose. Depending of the unit of accumulation, the shape of the rarefaction curve provides information on ecological evenness (Olszewski 2004) and beta-diversity (Crist and Veech 2006). Rarefaction of species richness also forms the basis of estimators of species richness, including unseen species (Colwell and Coddington 1994). In the case of PD, species-based rarefaction curves also allow for a measure of phylogenetic dispersion (Webb et al. 2002), effectively the expected PD for some given number of species (Nipperess and Matsen 2013). A solution for the rarefaction of PD is therefore desirable as it will allow for these applications to be realised for phylogenetically explicit datasets.
Rarefaction of Phylogenetic Diversity, using an algorithmic solution of repeated sub-sampling, has now been done several times (see for example Lozupone and Knight 2008; Turnbaugh et al. 2009; Yu et al. 2012). However, an analytical solution for PD rarefaction, similar to that determined by Hurlbert (1971) for species richness, is preferable both because its results are exact (not dependent on the number of repeated subsamples) and substantially more computationally efficient. Nipperess and Matsen (2013) recently published just such a solution for both the mean and variance of PD under rarefaction. This solution is quite general, being applicable to rooted and unrooted trees, and even allowing partition of the tree into smaller components than the individual branch segments. As a result, the solution is given in a very generalised form and its relationship with classic rarefaction formula for species richness is not immediately clear.
In this chapter, I provide a detailed formulation for the exact analytical solution for expected (mean) Phylogenetic Diversity for a given amount of sampling effort. This formulation is for the specific but common case of a rooted phylogenetic tree where whole branch segments are selected under rarefaction. I use the same form of expression as used by Hurlbert (1971) to demonstrate the direct relationship between rarefaction of PD and rarefaction of species richness. I do not include a solution for variance of PD under rarefaction due to its complexity when given in this form and instead refer the reader to Nipperess and Matsen (2013). I extend this framework to show how the initial slope of the rarefaction curve (∆PD) can be used as a flexible measure of phylogenetic evenness, phylogenetic beta-diversity or phylogenetic dispersion, depending on the unit of accumulation. I apply PD rarefaction and the derived ∆PD measure to real ecological datasets to demonstrate its usefulness in addressing ecological questions. Finally, I discuss some future directions for the extension and application of PD rarefaction.
The following is a demonstration of the application of PD rarefaction, and the derived ∆PD statistics, to real ecological datasets. These applications are not intended to provide definitive answers to ecologically important questions but are, rather, simple demonstrations of how PD rarefaction can allow new analyses to be undertaken and, hopefully, new insights gained.
In all these applications, I have used published data on mammals. This is principally for convenience as mammals (Bininda-Emonds et al. 2007) and birds (Jetz et al. 2012) are the only major taxonomic groups for which comprehensive species-level supertrees are available. I have used an updated version of the mammal supertree of Bininda-Emonds et al. (2007) published as supplementary material by Fritz et al. (2009). In this supertree, all branch lengths are measured in units of time (millions of years between branching events), allowing for a straight-forward interpretation of PD as cumulative evolutionary history (Proches et al. 2006).
All analyses were conducted using the statistical software, R version 2.15.2 (R Core Team 2012). Phylogenetic information was processed using the ape package in R (Paradis et al. 2004). PD rarefaction analyses used the phylodiv, phylocurve and phylorare functions, written by the author and available from: http://davidnipperess.blogspot.com.au.
Standardisation of Sampling
The most commonly used application for rarefaction is standardisation to allow comparisons to be made between datasets with differing amounts of sampling effort. Standardisation can be achieved by rarefying all datasets back to a common (typically the minimum) number of accumulation units (Sanders 1968; Gotelli and Colwell 2001).
Law et al. (1998) surveyed bats in ten State Forests of the south-west slopes region of New South Wales, Australia. Survey methods were a combination of ultra-sonic detectors, harp-traps, mist-nets and trip-lines. For the purposes of this demonstration, only data from the harp-traps will be used. A harp-trap is a rectangular frame, stringed vertically with nylon line, placed so as to intercept the flight path of low-flying bats (Tidemann and Woodside 1978). A bat striking the nylon lines of the trap will tumble down into a collecting bag at the bottom.
Comparison of diversity measures for bat assemblages for ten state forests of the south-west slopes region of New South Wales, Australia
Species richness (S)
Phylogenetic diversity (PDN)
Standardised phylogenetic diversity (PD15)
The extension of PD rarefaction to ∆PD allows for the measurement of phylogenetic evenness, which is essentially a measure of the distribution of individuals among branches in a phylogenetic tree (Webb and Pitman 2002). A phylogenetically even community is one where the most evolutionarily distinct species are also the most abundant. Because ∆PD will increase with both increasing phylogenetic evenness and phylogenetic diversity, it is more correctly a measure of entropy (Jost 2006), directly comparable to the PIE and Gini-Simpson indices. It has a particularly close relationship with the quadratic entropy measure of Rao (1982). Rao’s quadratic entropy measures the average distance between individuals in an assemblage. When that distance is measured as patristic distance (path length on a phylogenetic tree), ∆PD will be approximately half of Rao’s quadratic entropy. ∆PD is also similar in intent, but not in form, to the phylogenetic entropy index of Allen et al. (2009).
Comparison of diversity measures for bat assemblages from four habitats along a disturbance gradient in the Selva Lacandona, Chiapas, Mexico
Phylogenetic evenness (∆PD)
Phylogenetic component (∆PD/PIE)
The trend in phylogenetic evenness may simply be reflecting the abundance distribution among species. To determine the phylogenetic contribution to phylogenetic evenness, ∆PD was divided by the PIE index (Table 2). Since PIE is the probability that the second randomly selected individual is a different species to the first, we can divide ∆PD by PIE to get the expected branch length of that species (conditional on the second individual being a different species). This value is related to phylogenetic dispersion (∆PD from a species-based rarefaction curve) but differs due to the conditional probability structure, and effectively measures the pure phylogenetic contribution to ∆PD independent of the abundance distributions among species. We see, in this case, that the phylogenetic component generally decreases with increasing disturbance (Cacao being the exception), supporting the notion that disturbance favours more closely related species.
Phylogenetic beta-diversity is effectively the turnover of branch lengths between samples in space and/or time. Like its species-level equivalent, phylogenetic beta-diversity can be measured on a pair-wise basis (Lozupone and Knight 2005; Bryant et al. 2008; Nipperess et al. 2010) or as a single value for a set of samples (Anderson et al. 2010). Rarefaction of PD provides a means for deriving a single value of beta-diversity for a set of samples of any size via the ∆PD measure, which is a phylogenetic analogue of the additive partitioning approach of Crist and Veech (2006).
Comparison of diversity measures for small mammal assemblages of sites in the Tanami Desert and Uluru-Kata Tjuta National Park, Northern Territory, Australia
No. of sites
Species richness (alpha)
Species richness (gamma)
Species beta diversity (additive)
Phylogenetic beta diversity (∆PD)
Phylogenetic dispersion is a measure of the average phylogenetic distance among species (or tips) (Webb et al. 2002) and is in effect a measure of tree shape (Davies and Buckley 2012). ∆PD provides a simple, intuitive measure of dispersion as the expected gain in PD of adding a second randomly selected species to the first. It can also be seen as a means of correcting for variation in species richness among samples, as it is well known that PD increases with species richness (Rodrigues and Gaston 2002).
I generated PD rarefaction curves and ∆PD values for the mammal faunas of 71 of the 79 terrestrial ecoregions recognised by Olson et al. (2001) as constituting the Australasian biogeographic realm. Data were sourced from the wildfinder database (http://worldwildlife.org/pages/wildfinder) of the World Wildlife Fund. Eight ecoregions were excluded from the analysis because they had less than two species and thus a ∆PD value could not be calculated.
Comparison of diversity measures for mammal assemblages of selected ecoregions of the Australasian biogeographic realm
Phylogenetic diversity (Ma)
Phylogenetic dispersion (∆PD)
New Caledonia dry forests
Central range montane rainforests
Mount Lofty woodlands
As demonstrated here, rarefaction of PD has a straightforward application in standardising PD across samples so that they can be compared directly. Further, depending on the accumulation unit, the rarefaction formula can be extended to the calculation of metrics of phylogenetic evenness, phylogenetic beta-diversity and phylogenetic dispersion. However, the application of the PD rarefaction formula and its extension to other metrics is still very much in its infancy. Here I will outline some future directions for PD rarefaction.
Rarefaction by units of species allows for the comparison of locations while controlling for variation in species richness. This can easily be done by either rarefying all locations to a given number of species (Nipperess and Matsen 2013) or via ∆PD as demonstrated here. This kind of correction has previously been done by including species richness as an explanatory variable in a statistical model and taking the residuals (Davies et al. 2008) or by comparison to a null model derived by repeated subsampling (Davies et al. 2007). The latter method is often used as a statistical test of phylogenetic dispersion (also known as phylogenetic structure) where random draws are taken from a species pool, representing a null community assembly process (Webb 2000). Such methods are no longer necessary as the exact relationship between species richness and PD is described by the rarefaction curve (Nipperess and Matsen 2013). Further, the exact analytical solution is computationally efficient, allowing for practical application to very large datasets.
By removing the effect of species richness, we can identify “evolutionary hotspots” with higher than expected phylogenetic diversity (Davies et al. 2008; Nipperess and Matsen 2013) on a regional or global scale. We can then use the standardised PD values (called relative PD by Davies et al. 2007) to explore the environmental, ecological and historical processes that lead to the observed patterns of high or low phylogenetic dispersion (Kooyman et al. 2013). Ultimately, we may be able to develop the theory to predict these patterns (Davies et al. 2007), in a similar vein to what has been done for species richness (Arrhenius 1921; MacArthur and Wilson 1963; Rosindell et al. 2011). For example, the relationship of species richness with area is well known but the phylogeny-area relationship has only recently begun to be explored (Morlon et al. 2011). Rarefaction curves have an obvious connection to species-area curves (Olszewski 2004) and thus the development of PD rarefaction may well improve understanding of the phylogeny-area relationship. In particular, species-based rarefaction of PD allows for the separation of species diversity effects from those purely explained by phylogeny.
It is possible to predict how much Phylogenetic Diversity is yet to be sampled from the observed rarefaction curve. Rarefaction is the basis of several species diversity estimators, which attempt to calculate total diversity (including unseen species) for a set of individuals or samples by effectively extending the curve beyond the observed sampling depth (Colwell and Coddington 1994). It follows that a useful extension of PD rarefaction would be a PD estimator that predicts unseen branch length, given the observed rate of accumulation of PD. It is important to note that PD rarefaction calculates the expected branch length gained by adding additional accumulation units but does not predict where on the tree these branches will come from. Similarly, a biodiversity estimator based on PD rarefaction may be able to predict the amount of PD not yet sampled but would not be able to predict where these unseen branches would be added to an existing tree. This would be, nevertheless, an exciting development.
It has recently been proposed that the standardisation of samples for species diversity should not be done by rarefaction to the same size (i.e. no. of individuals), but rather by sample completeness (Alroy 2010; Jost 2010; Chao and Jost 2012). Completeness, when measured by a statistic known as coverage (Good 1953), is the proportion of individuals in a community that are represented by species in a sample from that community (Chao and Jost 2012). When samples differ in their coverage, they should be standardised to equal coverage before a “fair” comparison can be made. Much like expected species richness, the coverage of a sample can be estimated from the sample size and the distribution of individuals among the species in the sample (Chao and Jost 2012). Given that standardisation by sample completeness has been shown to yield a less biased comparison of species richness between communities (Chao and Jost 2012), it would be desirable to have a similar method of standardisation for PD. Since rarefaction of coverage is mathematically related to rarefaction of sample size, the recent work on estimating PD from sample size will no doubt form the basis from which estimated PD for sample coverage will be developed.
Finally, a general issue when considering any PD measure is uncertainty regarding the length of branches and the topology (branching pattern) of the tree. All PD measures (including those presented here) assume that the branch lengths and their arrangement in the tree are perfectly known. This is obviously an abstraction, although PD can be surprisingly robust to this source of variation (Swenson 2009). One solution to this dilemma is to calculate PD, including rarefied PD, for a large number of possible trees and report the mean and confidence limits. The output from a Bayesian phylogenetic analysis is a large number of trees, each with their own topology and corresponding branch lengths (see for example Jetz et al. 2012) and so lends itself well to this approach. However, when the possible trees number in the thousands and tens of thousands, this is obviously computationally intensive. An analytical solution, directly incorporating uncertainty into the calculation, would therefore be desirable. This is not an easy extension of the PD rarefaction solution because both variation in branch length and topology (affecting the probability of encountering internal branches) would need to be taken into account. It is worth remembering that phylogenetic relationships are not the only source of uncertainty when investigating real ecological communities – neither the abundance, nor even the presence (occupancy), of species are necessarily known with precision.
The formulation for the rarefaction of Phylogenetic Diversity (PD) is given in expanded form to show its simplicity and its connection to the classic formula for the rarefaction of species richness (Hurlbert 1971; Simberloff 1972). The method is exact and efficient and should be preferred over the algorithmic (Monte Carlo) solution involving repeated random sub-sampling. Further, the extension to the calculation of ∆PD provides a flexible and general framework for the measurement of biodiversity as phylogenetic evenness, phylogenetic beta-diversity or phylogenetic dispersion. The applications of PD rarefaction and ∆PD presented here are hopefully useful in improving understanding of the importance of rarefaction in ecology and in guiding future applications of the method. There are, I believe, exciting prospects for PD rarefaction in the future, including as a general method for standardising PD by removing variation with species richness, and for predicting unseen (i.e. un-sampled) PD. The recent availability of comprehensive phylogenies (Bininda-Emonds et al. 2007; Jetz et al. 2012) and rich data on species occurrences (Flemons et al. 2007), coupled with analytical advances such as PD rarefaction, allows us to better understand the distribution of Phylogenetic Diversity on the surface of the Earth and the processes giving rise to that distribution. This is valuable for its own sake but will also inform efforts to conserve as much of the Tree of Life as possible in the face of future extinctions (Rosauer and Mooers 2013).
I would like to thank Frederick Matsen (Fred Hutchinson Cancer Research Center), Daniel Faith (Australian Museum), Peter Wilson (Macquarie University), Anne Chao (National Tsing Hua University) and Robert Colwell (University of Connecticut) for their input regarding the rarefaction of PD and what it all might mean. Daniel Miranda-Esquivel (Universidad Industrial de Santander) and Pedro Cardoso (University of Helsinki) provided helpful feedback on an earlier version of this chapter. This research was supported by the Australian Research Council (DP0665761 and DP1095200).
- Faith DP, Lozupone CA, Nipperess DA, Knight R (2009) The cladistic basis for the phylogenetic diversity (PD) measure links evolutionary features to environmental gradients and supports broad applications of microbial ecology’s ‘Phylogenetic Beta Diversity’ framework. Int J Mol Sci 10:4723–4741CrossRefPubMedPubMedCentralGoogle Scholar
- R Core Team (2012) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Open Access This chapter is distributed under the terms of the Creative Commons Attribution-Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.