Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

Susko, Edward; Spencer, Mathew; Roger, Andrew J.

doi:10.1007/s00239-004-0352-9

Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

Published: 21 July 2005

Volume 61, pages 351–359, (2005)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Edward Susko¹,
Mathew Spencer^1,2 &
Andrew J. Roger³

129 Accesses
23 Citations
Explore all metrics

Abstract

We consider the effects of fully or partially random sequences on the estimation of four-taxon phylogenies. Fully or partially random sequences occur when whole subsets of sequences or some sites for subsets of sequences are independent of sequence data for the other taxa. Random sequences can be a consequence of misalignment or because sites evolve at very fast rates in some portions of a tree, a situation that occurs especially in analyses involving deep divergence times. One might reasonably speculate that random sites will only add noise to the estimation of a phylogeny. We show that in the case that a random sequence is added to a three-taxa alignment, it is more likely to be a neighbor of the sequence corresponding to the longest branch in the three-taxon tree. Surprisingly, when only about half of the sites show randomness, a long-branch-repels form of small sample bias occurs, and when a minority of sites show randomness this becomes a long-branch-attraction bias again. The most serious bias, one that does not vanish with increasing sequence length, occurs when more than one sequence is partially random. If there is a large amount of overlap in the random sites for two sequences, those two sequences will be attracted to each other; otherwise, they will repel each other. Random sequences or sites can, therefore, cause complicated biases in phylogenetic inference. We suggest performing analyses with and without potentially saturated sequences and/or misaligned sites, to check that these biases are not affecting the inferred branching pattern.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EmpPrior: using outside empirical data to inform branch-length priors for Bayesian phylogenetics

Article Open access 24 June 2016

Inferring phylogenies of evolving sequences without multiple sequence alignment

Article Open access 30 September 2014

Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach

References

Buneman P (1971) The recovery of trees from measures of dissimilarity. In Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 387–395
Google Scholar
Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:27–33
Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Article PubMed Google Scholar
Felsenstein J (1993) PHYLIP (Phylogeny inference package) version 3.5c. Department of Genetics, University of Washington, Seattle
Google Scholar
Foster P (2004) Modeling compositional heterogeneity. Syst Biol 53:485–495
Article PubMed Google Scholar
Gascuel O (1994) Concerning the NJ algorithm and its unweighted version, UNJ. In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A (eds) Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, RI, pp 149–170
Google Scholar
Gaut BS, Lewis PO (1995). Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol 12:152–162
PubMed Google Scholar
Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Pop Biol 61:391–408
Article Google Scholar
Hendy MD, Penny D (1989) A framework for the study of evolutionary trees. Syst Zool 38:297–309
Google Scholar
Hillis DM, Mable BK, Moritz C (1996) Applications of molecular systematics: the state of the field and a look to the future. In Hillis DM, Moritz C, Mable BK (eds) Molecular systematics, Sinauer Associates, Sunderland, MA, pp 575–543
Google Scholar
Holland BR, Penny D, Hendy MD (2003) Outgroup misplacement and phylogenetic inaccuracy under a molecular clock: a simulation study. Syst. Biol. 52:229–238
PubMed Google Scholar
Huelsenbeck JP, Hillis DM (1993) Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264
Google Scholar
Huelsenbeck JP, Hills DM (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48
Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci (CABIOS) 8:275–282
Google Scholar
Jukes TH, Cantor CR (1969) In: Munro HN (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–123
Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459–468
PubMed Google Scholar
Lockhart PJ, Steel MA, Barbrook AC, Huson DH, Howe CJ (1998) A covariotide model describes the evolution of oxygenic photosynthesis. Mol Biol Evol 15:1183–1188
PubMed Google Scholar
Lopez P, Forterre P, Philippe H (1999) The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508
PubMed Google Scholar
R Development Core Team (2004) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 3-900051-07-0
Stiller JW, Hall D (1999) Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol Biol Evol 16:1270–1279
PubMed Google Scholar
Sullivan J, Swofford L (2001) Should we use model-based methods for phylogenetic inferencewhen we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723–729
Article PubMed Google Scholar
Susko E, Inagaki Y, Field C, Holder ME, Roger AJ (2002). Testing for differences in rates across sites distributions in phylogenetic subtrees. Mol Biol. Evol 19:1514–1523
PubMed Google Scholar
Susko E, Inagaki Y, Roger AJ (2004). On inconsistency of the neighbour joining method and least squares estimation when distances are incorrectly specified. Mol Biol Evol. 29:1629–1642
Article Google Scholar
Van de Peer Y, Frickey T, Taylor J, Meyer A (2002) Dealing with saturation on the amino acid level: a case study based on anciently duplicated zebrafish genes. Gene 295:205–211
Article PubMed Google Scholar
Wenzel JW, Siddall ME (1999) Noise. Cladistics 15:51–64
Article Google Scholar
Wheeler WC (1990) Nucleic acid sequence phylogeny and random outgroups. Cladistics 6:363–367
Google Scholar

Download references

Acknowledgments

E.S. and A.J.R. are supported by the Natural Sciences and Engineering Research Council of Canada. M.S. is supported by Genome Atlantic/Genome Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology and the Canadian Institutes for Health Research for fellowship support. A.J.R. is supported by NSERC Operating Grant 227085-00, the Alfred P. Sloan Foundation and a Peter Lougheed/CIHR New Investigator Award. This collaboration is part of the Prokaryotic Genome Evolution and Diversity Project of Genome Atlantic/Genome Canada.

Author information

Authors and Affiliations

Genome Atlantic, Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada, B3H 3J5
Edward Susko & Mathew Spencer
Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada, B3D 4H7
Mathew Spencer
Genome Atlantic, Canadian Institute for Advanced Research, Program in Evolutionary Biology, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada, B3H 4H7
Andrew J. Roger

Authors

Edward Susko
View author publications
You can also search for this author in PubMed Google Scholar
Mathew Spencer
View author publications
You can also search for this author in PubMed Google Scholar
Andrew J. Roger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edward Susko.

Additional information

[Reviewing Editor: Dr. J. Rasmus Nielson]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Susko, E., Spencer, M. & Roger, A.J. Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments. J Mol Evol 61, 351–359 (2005). https://doi.org/10.1007/s00239-004-0352-9

Download citation

Received: 03 December 2004
Accepted: 10 March 2005
Published: 21 July 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s00239-004-0352-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

Abstract

Access this article

Similar content being viewed by others

EmpPrior: using outside empirical data to inform branch-length priors for Bayesian phylogenetics

Inferring phylogenies of evolving sequences without multiple sequence alignment

Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments

Abstract

Access this article

Similar content being viewed by others

EmpPrior: using outside empirical data to inform branch-length priors for Bayesian phylogenetics

Inferring phylogenies of evolving sequences without multiple sequence alignment

Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation