Skip to main content
Log in

Automated Removal of Noisy Data in Phylogenomic Analyses

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Noisy data, especially in combination with misalignment and model misspecification can have an adverse effect on phylogeny reconstruction; however, effective methods to identify such data are few. One particularly important class of noisy data is saturated positions. To avoid potential errors related to saturation in phylogenomic analyses, we present an automated procedure involving the step-wise removal of the most variable positions in a given data set coupled with a stopping criterion derived from correlation analyses of pairwise ML distances calculated from the deleted (saturated) and the remaining (conserved) subsets of the alignment. Through a comparison with existing methods, we demonstrate both the effectiveness of our proposed procedure for identifying noisy data and the effect of the removal of such data using a well-publicized case study involving placental mammals. At the least, our procedure will identify data sets requiring greater data exploration, and we recommend its use to investigate the effect on phylogenetic analyses of removing subsets of variable positions exhibiting weak or no correlation to the rest of the alignment. However, we would argue that this procedure, by identifying and removing noisy data, facilitates the construction of more accurate phylogenies by, for example, ameliorating potential long-branch attraction artefacts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Abbreviations

ML:

Maximum likelihood

MP:

Maximum parsimony

OTU:

Operational taxonomic unit

References

  • Allard MW, Miyamoto MM, Honeycutt RL (1991) Test for rodent polyphyly. Nature 353:610–611

    Article  CAS  PubMed  Google Scholar 

  • Amrine-Madsen H, Koepfli KP, Wayne RK, Springer MS (2003) A new phylogenetic marker, apolipoprotein B, provides compelling evidence for eutherian relationships. Mol Phylogenet Evol 28:225–240

    Article  CAS  PubMed  Google Scholar 

  • Arnason U, Adegoke JA, Bodin K, Born EW, Esa YB, Gullberg A, Nilsson M, Short RV, Xu X, Janke A (2002) Mammalian mitogenomic relationships and the root of the eutherian tree. Proc Natl Acad Sci USA 99:8151–8156

    Article  CAS  PubMed  Google Scholar 

  • Bininda-Emonds ORP (2007) Fast genes and slow clades: comparative rates of molecular evolution in mammals. Evol Bioinf 2007:3:59–85

    Google Scholar 

  • Brinkmann H, Philippe H (1999) Archaea sister group of bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol Biol Evol 16:817–825

    CAS  PubMed  Google Scholar 

  • Burleigh JC, Mathews S (2004) Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. Am J Bot 91:1599–1613

    Article  CAS  Google Scholar 

  • Cao Y, Adachi J, Yano T, Hasegawa M (1994) Phylogenetic place of guinea pigs: no support of the rodent-polyphyly hypothesis from maximum-likelihood analyses of multiple protein sequences. Mol Biol Evol 11:593–604

    CAS  PubMed  Google Scholar 

  • D’Erchia AM, Gissi C, Pesole G, Saccone C, Arnason U (1996) The guinea-pig is not a rodent. Nature 381:597–600

    Article  PubMed  Google Scholar 

  • da Fonseca RR, Johnson WE, O’Brien SJ, Ramos MJ, Antunes A (2008) The adaptive evolution of the mammalian mitochondrial genome. BMC Genomics 9:119

    Article  PubMed  Google Scholar 

  • de Jong WW, van Dijk MAM, Poux C, Kappé G, van Rheede T, Madsen O (2003) Indels in protein-coding sequences of Euarchontoglires constrain the rooting of the eutherian tree. Mol Phylogenet Evol 28:328–340

    Article  PubMed  Google Scholar 

  • Felsenstein J (1978) Cases in which parsimony or compatibility methods can be positively misleading. Syst Zool 27:401–410

    Article  Google Scholar 

  • Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791

    Article  Google Scholar 

  • Gadagkar SR, Kumar S (2005) Maximum likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. Mol Biol Evol 22:2139–2141

    Article  CAS  PubMed  Google Scholar 

  • Galtier N, Gouy M, Gauthier C (1996) SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. Comput Appl Biosci 12:543–548

    CAS  PubMed  Google Scholar 

  • Gaucher EA, Miyamoto MM (2005) A call for likelihood phylogenetics even when the process of sequence evolution is heterogeneous. Mol Phylogenet Evol 37:928–931

    Article  PubMed  Google Scholar 

  • Gibson AV, Gowri-Shankar P, Higgs G, Rattray MA (2005) Comprehensive analysis of mammalian mitochondrial genome base composition and improved methods. Mol Biol Evol 22:251–264

    Article  CAS  PubMed  Google Scholar 

  • Goremykin VV, Hellwig FH (2006) A new test of phylogenetic model fitness addresses the issue of the basal angiosperm phylogeny. Gene 381:81–91

    Article  CAS  PubMed  Google Scholar 

  • Goremykin VV, Bobrova V, Pahnke J, Troitsky A, Antonov A, Martin W (1996) Non-coding sequences from the slowly evolving chloroplast inverted repeat in addition to rbcL data do not support gnetalean affinities of angiosperms. Mol Biol Evol 13:383–396

    CAS  PubMed  Google Scholar 

  • Goremykin V, Viola R, Hellwig F (2009) Removal of the noisy characters from the chloroplast genome-scale data suggests a revision of phylogenetic placements of Amborella and Ceratophyllum. J Mol Evol 68:197–204

    Article  CAS  PubMed  Google Scholar 

  • Graur D, Hide WA, Li WH (1991) Is the guinea-pig a rodent? Nature 351:649–652

    Article  CAS  PubMed  Google Scholar 

  • Graur D, Hide WA, Zharkikh A, Li W-H (1992) The biochemical phylogeny of guinea-pigs and gundis, and the paraphyly of the order Rodentia. Comp Biochem Phys B 101:495–498

    Article  CAS  Google Scholar 

  • Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Popul Biol 61:391–408

    Article  PubMed  Google Scholar 

  • Hasegawa M, Cao Y, Adachi J, Yano T (1992) Rodent polyphyly? Nature 355:595–595

    Article  CAS  PubMed  Google Scholar 

  • Hirt RP, Logsdon JM Jr, Healy B, Dorey MW, Doolittle WF, Embley TM (1999) Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc Natl Acad Sci USA 96:580–585

    Article  CAS  PubMed  Google Scholar 

  • Jabbari K, Rayko E, Bernardi G (2003) The major shifts of human duplicated genes. Gene 317:203–208

    Article  CAS  PubMed  Google Scholar 

  • Janke A, Xu X, Arnason U (1997) The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Marsupialia, and Eutheria. Proc Natl Acad Sci USA 94:1276–1281

    Article  CAS  PubMed  Google Scholar 

  • Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Haberle RC, Hansen AK et al (2007) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci USA 104:19369–19374

    Article  CAS  PubMed  Google Scholar 

  • Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of incongruence? Trends Genet 22:225–231

    Article  CAS  PubMed  Google Scholar 

  • Kjer KM, Honeycutt RL (2007) Site specific rates of mitochondrial genomes and the phylogeny of eutheria. BMC Evol Biol 7:8

    Article  PubMed  Google Scholar 

  • Kostka M, Uzlikova M, Cepicka I, Flegr J (2008) SlowFaster, a user-friendly program for slow-fast analysis and its application on phylogeny of Blastocystis. BMC Bioinf 9:34

    Article  Google Scholar 

  • Le Quesne WJ (1969) A method of selection of characters in numerical taxonomy. Syst Zool 18:201–205

    Article  Google Scholar 

  • Leebens-Mack J, Raubeson LA, Cui LY, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, de Pamphilis CW (2005) Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one’s way out of the Felsenstein zone. Mol Biol Evol 22:1948–1963

    Article  CAS  PubMed  Google Scholar 

  • Li W-H, Hide WA, Zharkikh A, Ma D-P, Graur D (1992) The molecular taxonomy and evolution of the guinea pig. J Hered 83:174–181

    CAS  PubMed  Google Scholar 

  • Lin Y, Waddell P, Penny D (2002a) Pika and vole mitochondrial genomes increase support for both rodent monophyly and glires. Gene 294:119–129

    Article  CAS  PubMed  Google Scholar 

  • Lin YH, McLenachan PA, Gore AR, Phillips MJ, Ota R, Hendy MD, Penny D (2002b) Four new mitochondrial genomes and the increased stability of evolutionary trees of mammals from improved taxon sampling. Mol Biol Evol 19:2060–2070

    CAS  PubMed  Google Scholar 

  • Lopez P, Forterre P, Philippe H (1999) The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508

    Article  CAS  PubMed  Google Scholar 

  • Luckett WP, Hartenberger J-L (1993) Monophyly or polyphyly of the order Rodentia: possible conflict between morphological and molecular interpretations. J Mamm Evol 1:127–147

    Article  Google Scholar 

  • Ma D-P, Zharkikh A, Graur D, VandeBerg JL, Li WH (1993) Structure and evolution of opposum, guinea pig, and porcupine cytochrome b genes. J Mol Evol 36:327–334

    CAS  PubMed  Google Scholar 

  • Madsen O, Scally M, Douady CJ, Kao DJ, DeBry RW, Adkins R, Amrine HM, Stanhope MJ, de Jong WW, Springer MS (2001) Parallel adaptive radiations in two major clades of placental mammals. Nature 409:610–614

    Article  CAS  PubMed  Google Scholar 

  • Moore MJ, Bell CD, Soltis PS, Soltis DE (2007) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA 104:19363–19368

    Article  PubMed  Google Scholar 

  • Mouchaty SK, Catzeflis F, Janke A, Arnason U (2001) Molecular evidence of an African phiomorpha-south america caviomorpha clade and support for hystricognathi based on the complete mitochondrial genome of the cane rat (Thryonomys swinderianus). Mol Phylogenet Evol 18:127–135

    Article  CAS  PubMed  Google Scholar 

  • Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O’Brien SJ (2001a) Molecular phylogenetics and the origin of placental mammals. Nature 409:614–618

    Article  CAS  PubMed  Google Scholar 

  • Murphy WJ, Eizirik E, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al (2001b) Resolution of the early placental mammal radiation using Bayesian inference. Science 294:2348–2351

    Article  CAS  PubMed  Google Scholar 

  • Olsen G (1987) Earliest phylogenetic branching: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symp Quant Biol 52:825–837

    CAS  PubMed  Google Scholar 

  • Pesole G, Gissi C, de Chirico A, Saccone C (1999) Nucleotide substitution rate of mammalian mitochondrial genomes. J Mol Evol 48:427–434

    Article  CAS  PubMed  Google Scholar 

  • Phillips MJ, Penny D (2003) The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol 28:171–185

    Article  CAS  PubMed  Google Scholar 

  • Phillips MJ, Lin Y-H, Harrison GL, Penny D (2001) Complete mitochondrial sequences for two marsupials, a bandicoot and a brushtail possum. Proc R Soc Lond Ser B 268:533–1538

    Article  Google Scholar 

  • Pisani D (2004) Identifying and removing fast evolving sites using compatibility analysis: an example from the arthropoda. Syst Biol 53:978–989

    Article  PubMed  Google Scholar 

  • Pisani D, Mohun MM, Harris S, McIterney JO, Wilkinson M (2006) Molecular evidence for dim-light vision in the last common ancestor of the vertebrates. Curr Biol 16:318–319

    Article  Google Scholar 

  • Pol D, Siddal ME (2001) Biases in maximum likelihood and parsimony: a simulation approach to a 10-taxon case. Cladistics 17:266–281

    Article  Google Scholar 

  • Posada D, Crandall KA (1998) ModelTest: testing the model of DNA substitution. Bioinformatics 14:817–818

    Article  CAS  PubMed  Google Scholar 

  • Prasad AB, Allard MW, Green ED (2008) Confirming the phylogeny of mammals by use of large comparative sequence data sets. Mol Biol Evol 25:1795–1808

    Article  CAS  PubMed  Google Scholar 

  • Qiu YL, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW et al (2005) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int J Plant Sci 166:815–842

    Article  CAS  Google Scholar 

  • Reyes A, Pesole G, Saccone C (1998) Complete mitochondrial DNA sequence of the fat dormouse, Glis glis: further evidence of rodent paraphyly. Mol Biol Evol 15:499–505

    CAS  PubMed  Google Scholar 

  • Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000a) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17:979–983

    CAS  PubMed  Google Scholar 

  • Reyes A, Pesole G, Saccone C (2000b) Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny. Gene 259:177–187

    Article  CAS  PubMed  Google Scholar 

  • Reyes A, Gissi C, Catzeflis F, Nevo E, Pesole G, Saccone C (2004) Congruent mammalian trees from mitochondrial and nuclear genes using Bayesian methods. Mol Biol Evol 21:397–403

    Article  CAS  PubMed  Google Scholar 

  • Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol 56:389–399

    Article  CAS  PubMed  Google Scholar 

  • Ruiz-Trillo I, Riutort M, Littlewood DT, Herniou EA, Baguna J (1999) Acoel flatworms: earliest extant bilaterian Metazoans, not members of Platyhelminthes. Science 283:1919–1923

    Article  CAS  PubMed  Google Scholar 

  • Sperling EA, Peterson KJ, Pisani D (2009) Phylogenetic-signal dissection of nuclear housekeeping genes supports the paraphyly of sponges and the monophyly of eumetazoa. Mol Biol Evol 26:2261–2274

    Article  CAS  PubMed  Google Scholar 

  • Springer MS, Debry RW, Douady C, Amrine HM, Madsen O, de Jong WW, Stanhope MJ (2001) Mitochondrial versus nuclear gene sequences in deep-level mammalian phylogeny phylogeny reconstruction. Mol Biol Evol 18:132–143

    CAS  PubMed  Google Scholar 

  • Springer MS, Stanhope MJ, Madsen O, de Jong WW (2004) Molecules consolidate the placental mammal tree. Trends Ecol Evol 19:430–438

    Article  PubMed  Google Scholar 

  • Stefanovic S, Rice DW, Palmer JD (2004) Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol Biol 4:35

    Article  PubMed  Google Scholar 

  • Swofford DL (2002) PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4.0b10. Sinauer Associates, Sunderland, MA

    Google Scholar 

Download references

Acknowledgments

O.B.E. received funding support in part granted by the Deutsche Forschungsgemeinschaft (BI 825/3-2).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vadim V. Goremykin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goremykin, V.V., Nikiforova, S.V. & Bininda-Emonds, O.R.P. Automated Removal of Noisy Data in Phylogenomic Analyses. J Mol Evol 71, 319–331 (2010). https://doi.org/10.1007/s00239-010-9398-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-010-9398-z

Keywords

Navigation