Abstract
We consider the effects of fully or partially random sequences on the estimation of four-taxon phylogenies. Fully or partially random sequences occur when whole subsets of sequences or some sites for subsets of sequences are independent of sequence data for the other taxa. Random sequences can be a consequence of misalignment or because sites evolve at very fast rates in some portions of a tree, a situation that occurs especially in analyses involving deep divergence times. One might reasonably speculate that random sites will only add noise to the estimation of a phylogeny. We show that in the case that a random sequence is added to a three-taxa alignment, it is more likely to be a neighbor of the sequence corresponding to the longest branch in the three-taxon tree. Surprisingly, when only about half of the sites show randomness, a long-branch-repels form of small sample bias occurs, and when a minority of sites show randomness this becomes a long-branch-attraction bias again. The most serious bias, one that does not vanish with increasing sequence length, occurs when more than one sequence is partially random. If there is a large amount of overlap in the random sites for two sequences, those two sequences will be attracted to each other; otherwise, they will repel each other. Random sequences or sites can, therefore, cause complicated biases in phylogenetic inference. We suggest performing analyses with and without potentially saturated sequences and/or misaligned sites, to check that these biases are not affecting the inferred branching pattern.
Similar content being viewed by others
References
Buneman P (1971) The recovery of trees from measures of dissimilarity. In Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 387–395
Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:27–33
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Felsenstein J (1993) PHYLIP (Phylogeny inference package) version 3.5c. Department of Genetics, University of Washington, Seattle
Foster P (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495
Gascuel O (1994) Concerning the NJ algorithm and its unweighted version, UNJ. In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A (eds) Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, RI, pp 149–170
Gaut BS, Lewis PO (1995). Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol 12:152–162
Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Pop Biol 61:391–408
Hendy MD, Penny D (1989) A framework for the study of evolutionary trees. Syst Zool 38:297–309
Hillis DM, Mable BK, Moritz C (1996) Applications of molecular systematics: the state of the field and a look to the future. In Hillis DM, Moritz C, Mable BK (eds) Molecular systematics, Sinauer Associates, Sunderland, MA, pp 575–543
Holland BR, Penny D, Hendy MD (2003) Outgroup misplacement and phylogenetic inaccuracy under a molecular clock: a simulation study. Syst. Biol. 52:229–238
Huelsenbeck JP, Hillis DM (1993) Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264
Huelsenbeck JP, Hills DM (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci (CABIOS) 8:275–282
Jukes TH, Cantor CR (1969) In: Munro HN (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–123
Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459–468
Lockhart PJ, Steel MA, Barbrook AC, Huson DH, Howe CJ (1998) A covariotide model describes the evolution of oxygenic photosynthesis. Mol Biol Evol 15:1183–1188
Lopez P, Forterre P, Philippe H (1999) The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508
R Development Core Team (2004) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 3-900051-07-0
Stiller JW, Hall D (1999) Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol Biol Evol 16:1270–1279
Sullivan J, Swofford L (2001) Should we use model-based methods for phylogenetic inferencewhen we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723–729
Susko E, Inagaki Y, Field C, Holder ME, Roger AJ (2002). Testing for differences in rates across sites distributions in phylogenetic subtrees. Mol Biol. Evol 19:1514–1523
Susko E, Inagaki Y, Roger AJ (2004). On inconsistency of the neighbour joining method and least squares estimation when distances are incorrectly specified. Mol Biol Evol. 29:1629–1642
Van de Peer Y, Frickey T, Taylor J, Meyer A (2002) Dealing with saturation on the amino acid level: a case study based on anciently duplicated zebrafish genes. Gene 295:205–211
Wenzel JW, Siddall ME (1999) Noise. Cladistics 15:51–64
Wheeler WC (1990) Nucleic acid sequence phylogeny and random outgroups. Cladistics 6:363–367
Acknowledgments
E.S. and A.J.R. are supported by the Natural Sciences and Engineering Research Council of Canada. M.S. is supported by Genome Atlantic/Genome Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology and the Canadian Institutes for Health Research for fellowship support. A.J.R. is supported by NSERC Operating Grant 227085-00, the Alfred P. Sloan Foundation and a Peter Lougheed/CIHR New Investigator Award. This collaboration is part of the Prokaryotic Genome Evolution and Diversity Project of Genome Atlantic/Genome Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor: Dr. J. Rasmus Nielson]
Rights and permissions
About this article
Cite this article
Susko, E., Spencer, M. & Roger, A.J. Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments. J Mol Evol 61, 351–359 (2005). https://doi.org/10.1007/s00239-004-0352-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-004-0352-9