Skip to main content
Log in

Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

We consider the effects of fully or partially random sequences on the estimation of four-taxon phylogenies. Fully or partially random sequences occur when whole subsets of sequences or some sites for subsets of sequences are independent of sequence data for the other taxa. Random sequences can be a consequence of misalignment or because sites evolve at very fast rates in some portions of a tree, a situation that occurs especially in analyses involving deep divergence times. One might reasonably speculate that random sites will only add noise to the estimation of a phylogeny. We show that in the case that a random sequence is added to a three-taxa alignment, it is more likely to be a neighbor of the sequence corresponding to the longest branch in the three-taxon tree. Surprisingly, when only about half of the sites show randomness, a long-branch-repels form of small sample bias occurs, and when a minority of sites show randomness this becomes a long-branch-attraction bias again. The most serious bias, one that does not vanish with increasing sequence length, occurs when more than one sequence is partially random. If there is a large amount of overlap in the random sites for two sequences, those two sequences will be attracted to each other; otherwise, they will repel each other. Random sequences or sites can, therefore, cause complicated biases in phylogenetic inference. We suggest performing analyses with and without potentially saturated sequences and/or misaligned sites, to check that these biases are not affecting the inferred branching pattern.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  • Buneman P (1971) The recovery of trees from measures of dissimilarity. In Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 387–395

    Google Scholar 

  • Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:27–33

    Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

    Article  PubMed  Google Scholar 

  • Felsenstein J (1993) PHYLIP (Phylogeny inference package) version 3.5c. Department of Genetics, University of Washington, Seattle

    Google Scholar 

  • Foster P (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495

    Article  PubMed  Google Scholar 

  • Gascuel O (1994) Concerning the NJ algorithm and its unweighted version, UNJ. In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A (eds) Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, RI, pp 149–170

    Google Scholar 

  • Gaut BS, Lewis PO (1995). Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol 12:152–162

    PubMed  Google Scholar 

  • Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Pop Biol 61:391–408

    Article  Google Scholar 

  • Hendy MD, Penny D (1989) A framework for the study of evolutionary trees. Syst Zool 38:297–309

    Google Scholar 

  • Hillis DM, Mable BK, Moritz C (1996) Applications of molecular systematics: the state of the field and a look to the future. In Hillis DM, Moritz C, Mable BK (eds) Molecular systematics, Sinauer Associates, Sunderland, MA, pp 575–543

    Google Scholar 

  • Holland BR, Penny D, Hendy MD (2003) Outgroup misplacement and phylogenetic inaccuracy under a molecular clock: a simulation study. Syst. Biol. 52:229–238

    PubMed  Google Scholar 

  • Huelsenbeck JP, Hillis DM (1993) Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264

    Google Scholar 

  • Huelsenbeck JP, Hills DM (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48

    Google Scholar 

  • Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci (CABIOS) 8:275–282

    Google Scholar 

  • Jukes TH, Cantor CR (1969) In: Munro HN (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–123

  • Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459–468

    PubMed  Google Scholar 

  • Lockhart PJ, Steel MA, Barbrook AC, Huson DH, Howe CJ (1998) A covariotide model describes the evolution of oxygenic photosynthesis. Mol Biol Evol 15:1183–1188

    PubMed  Google Scholar 

  • Lopez P, Forterre P, Philippe H (1999) The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508

    PubMed  Google Scholar 

  • R Development Core Team (2004) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 3-900051-07-0

  • Stiller JW, Hall D (1999) Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol Biol Evol 16:1270–1279

    PubMed  Google Scholar 

  • Sullivan J, Swofford L (2001) Should we use model-based methods for phylogenetic inferencewhen we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723–729

    Article  PubMed  Google Scholar 

  • Susko E, Inagaki Y, Field C, Holder ME, Roger AJ (2002). Testing for differences in rates across sites distributions in phylogenetic subtrees. Mol Biol. Evol 19:1514–1523

    PubMed  Google Scholar 

  • Susko E, Inagaki Y, Roger AJ (2004). On inconsistency of the neighbour joining method and least squares estimation when distances are incorrectly specified. Mol Biol Evol. 29:1629–1642

    Article  Google Scholar 

  • Van de Peer Y, Frickey T, Taylor J, Meyer A (2002) Dealing with saturation on the amino acid level: a case study based on anciently duplicated zebrafish genes. Gene 295:205–211

    Article  PubMed  Google Scholar 

  • Wenzel JW, Siddall ME (1999) Noise. Cladistics 15:51–64

    Article  Google Scholar 

  • Wheeler WC (1990) Nucleic acid sequence phylogeny and random outgroups. Cladistics 6:363–367

    Google Scholar 

Download references

Acknowledgments

E.S. and A.J.R. are supported by the Natural Sciences and Engineering Research Council of Canada. M.S. is supported by Genome Atlantic/Genome Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology and the Canadian Institutes for Health Research for fellowship support. A.J.R. is supported by NSERC Operating Grant 227085-00, the Alfred P. Sloan Foundation and a Peter Lougheed/CIHR New Investigator Award. This collaboration is part of the Prokaryotic Genome Evolution and Diversity Project of Genome Atlantic/Genome Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edward Susko.

Additional information

[Reviewing Editor: Dr. J. Rasmus Nielson]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Susko, E., Spencer, M. & Roger, A.J. Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments. J Mol Evol 61, 351–359 (2005). https://doi.org/10.1007/s00239-004-0352-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-004-0352-9

Keywords

Navigation