When Being “Most Likely” Is Not Enough: Examining the Performance of Three Uses of the Parametric Bootstrap in Phylogenetics
- 62 Downloads
I show that three parametric-bootstrap (PB) applications that have been proposed for phylogenetic analysis, can be misleading as currently implemented. First, I show that simulating a topology estimated from preliminary data in order to determine the sequence length that should allow the best tree obtained from more extensive data to be correct with a desired probability, delivers an accurate estimate of this length only in topological situations in which most preliminary trees are expected to be both correct and statistically significant, i.e. when no further analysis would be needed. Otherwise, one obtains strong underestimates of the length or similarly biased values for incorrect trees. Second, I show that PB-based topology tests that use as null hypothesis the most likely tree congruent with a pre-specified topological relationship alternative to the unconstrained most likely tree, and simulate this tree for P value estimation, produce excessive type I error (from 50% to 600% and higher) when they are applied to null data generated by star-shaped or dichotomous four-taxon topologies. Simulating the most likely star topology for P value estimation results instead in correct type-I-error production even when the null data are generated by a dichotomous topology. This is a strong indication that the star topology is the correct default null hypothesis for phylogenies. Third, I show that PB-estimated confidence intervals (CIs) for the length of a tree branch are generally accurate, although in some situations they can be strongly over- or under-estimated relative to the “true” CI. Attempts to identify a biased CI through a further round of simulations were unsuccessful. Tracing the origin and propagation of parameter estimate error through the CI estimation exercise, showed that the sparseness of site-patterns which are crucial to the estimation of pivotal parameters, can allow homoplasy to bias these estimates and ultimately the PB-based CI estimation. Concluding, I stress that statistical techniques that simulate models estimated from limited data need to be carefully calibrated, and I defend the point that pattern-sparseness assessment will be the next frontier in the statistical analysis of phylogenies, an effort that will require taking advantage of the merits of black-box maximum-likelihood approaches and of insights from intuitive, site-pattern-oriented approaches like parsimony.
Unable to display preview. Download preview PDF.