Journal of Molecular Evolution

, Volume 61, Issue 3, pp 351–359 | Cite as

Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

  • Edward SuskoEmail author
  • Mathew Spencer
  • Andrew J. Roger


We consider the effects of fully or partially random sequences on the estimation of four-taxon phylogenies. Fully or partially random sequences occur when whole subsets of sequences or some sites for subsets of sequences are independent of sequence data for the other taxa. Random sequences can be a consequence of misalignment or because sites evolve at very fast rates in some portions of a tree, a situation that occurs especially in analyses involving deep divergence times. One might reasonably speculate that random sites will only add noise to the estimation of a phylogeny. We show that in the case that a random sequence is added to a three-taxa alignment, it is more likely to be a neighbor of the sequence corresponding to the longest branch in the three-taxon tree. Surprisingly, when only about half of the sites show randomness, a long-branch-repels form of small sample bias occurs, and when a minority of sites show randomness this becomes a long-branch-attraction bias again. The most serious bias, one that does not vanish with increasing sequence length, occurs when more than one sequence is partially random. If there is a large amount of overlap in the random sites for two sequences, those two sequences will be attracted to each other; otherwise, they will repel each other. Random sequences or sites can, therefore, cause complicated biases in phylogenetic inference. We suggest performing analyses with and without potentially saturated sequences and/or misaligned sites, to check that these biases are not affecting the inferred branching pattern.


Biased estimation Long branch attraction Phylogeny Random sequences 



E.S. and A.J.R. are supported by the Natural Sciences and Engineering Research Council of Canada. M.S. is supported by Genome Atlantic/Genome Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology and the Canadian Institutes for Health Research for fellowship support. A.J.R. is supported by NSERC Operating Grant 227085-00, the Alfred P. Sloan Foundation and a Peter Lougheed/CIHR New Investigator Award. This collaboration is part of the Prokaryotic Genome Evolution and Diversity Project of Genome Atlantic/Genome Canada.


  1. Buneman P (1971) The recovery of trees from measures of dissimilarity. In Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 387–395Google Scholar
  2. Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:27–33Google Scholar
  3. Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376CrossRefPubMedGoogle Scholar
  4. Felsenstein J (1993) PHYLIP (Phylogeny inference package) version 3.5c. Department of Genetics, University of Washington, SeattleGoogle Scholar
  5. Foster P (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495CrossRefPubMedGoogle Scholar
  6. Gascuel O (1994) Concerning the NJ algorithm and its unweighted version, UNJ. In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A (eds) Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, RI, pp 149–170Google Scholar
  7. Gaut BS, Lewis PO (1995). Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol 12:152–162PubMedGoogle Scholar
  8. Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Pop Biol 61:391–408CrossRefGoogle Scholar
  9. Hendy MD, Penny D (1989) A framework for the study of evolutionary trees. Syst Zool 38:297–309Google Scholar
  10. Hillis DM, Mable BK, Moritz C (1996) Applications of molecular systematics: the state of the field and a look to the future. In Hillis DM, Moritz C, Mable BK (eds) Molecular systematics, Sinauer Associates, Sunderland, MA, pp 575–543Google Scholar
  11. Holland BR, Penny D, Hendy MD (2003) Outgroup misplacement and phylogenetic inaccuracy under a molecular clock: a simulation study. Syst. Biol. 52:229–238PubMedGoogle Scholar
  12. Huelsenbeck JP, Hillis DM (1993) Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264Google Scholar
  13. Huelsenbeck JP, Hills DM (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48Google Scholar
  14. Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci (CABIOS) 8:275–282Google Scholar
  15. Jukes TH, Cantor CR (1969) In: Munro HN (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–123Google Scholar
  16. Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459–468PubMedGoogle Scholar
  17. Lockhart PJ, Steel MA, Barbrook AC, Huson DH, Howe CJ (1998) A covariotide model describes the evolution of oxygenic photosynthesis. Mol Biol Evol 15:1183–1188PubMedGoogle Scholar
  18. Lopez P, Forterre P, Philippe H (1999) The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508PubMedGoogle Scholar
  19. R Development Core Team (2004) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 3-900051-07-0Google Scholar
  20. Stiller JW, Hall D (1999) Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol Biol Evol 16:1270–1279PubMedGoogle Scholar
  21. Sullivan J, Swofford L (2001) Should we use model-based methods for phylogenetic inferencewhen we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723–729CrossRefPubMedGoogle Scholar
  22. Susko E, Inagaki Y, Field C, Holder ME, Roger AJ (2002). Testing for differences in rates across sites distributions in phylogenetic subtrees. Mol Biol. Evol 19:1514–1523PubMedGoogle Scholar
  23. Susko E, Inagaki Y, Roger AJ (2004). On inconsistency of the neighbour joining method and least squares estimation when distances are incorrectly specified. Mol Biol Evol. 29:1629–1642CrossRefGoogle Scholar
  24. Van de Peer Y, Frickey T, Taylor J, Meyer A (2002) Dealing with saturation on the amino acid level: a case study based on anciently duplicated zebrafish genes. Gene 295:205–211CrossRefPubMedGoogle Scholar
  25. Wenzel JW, Siddall ME (1999) Noise. Cladistics 15:51–64CrossRefGoogle Scholar
  26. Wheeler WC (1990) Nucleic acid sequence phylogeny and random outgroups. Cladistics 6:363–367Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2005

Authors and Affiliations

  • Edward Susko
    • 1
    Email author
  • Mathew Spencer
    • 1
    • 2
  • Andrew J. Roger
    • 3
  1. 1.Genome Atlantic, Department of Mathematics and StatisticsDalhousie UniversityHalifaxCanada
  2. 2.Genome Atlantic, Department of Biochemistry and Molecular BiologyDalhousie UniversityHalifaxCanada
  3. 3.Genome Atlantic, Canadian Institute for Advanced Research, Program in Evolutionary Biology, Department of Biochemistry and Molecular BiologyDalhousie UniversityHalifaxCanada

Personalised recommendations