Abstract
Density of taxon sampling and number/kind of characters are central to achieving the ultimate goals in phylogenetic reconstruction: tree robustness and improved accuracy. In molecular phylogenetics, DNA sequence repositories such as GenBank are potential sources for expanding datasets in two dimensions, taxa and characters, to the level of “supermatrices.” However, the issue of missing characters/genomic regions is generally considered a major impediment to this endeavor. We used here the angiosperm order Caryophyllales to systematically address the impact of missing data when expanding taxon sampling and number of characters in phylogenetic reconstruction. Our analyses show that expansion of taxon sampling by ~13-fold resulted in improved phylogenetic assessment of the Caryophyllales despite up to 38% missing data. Expanding number of characters in the dataset by allowing for up to 100-fold increase in amount of missing data and inclusion of entries with about 40% missing genomic regions did not negatively impact tree structure or robustness, but to the contrary improved both. These results are timely regarding the ongoing efforts to achieve detailed assessment of the tree of life.
Similar content being viewed by others
References
Agnarsson I, May-Collado LJ (2008) The phylogeny of Cetartiodactyla: the importance of dense taxon sampling, missing data, and the remarkable promise of cytochrome b to provide reliable species-level phylogenies. Mol Phylogenet Evol 48:964–985
Albert VA, Williams SE, Chase MW (1992) Carnivorous plants: phylogeny and structural evolution. Science 257:1491–1495
Alverson WS, Whitlock BA, Nyffeler R, Bayer C, Baum DA (1999) Phylogeny of the core Malvales: evidence from ndhF sequence data. Am J Bot 86:1474–1486
APG II (2003) An update of the angiosperm phylogeny group classification for the orders and families of flowering plants: APG II. Bot J Linn Soc 141:399–436
APG III (2009) An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III. Bot J Linn Socy 161:105–121
Applequist WL, Wallace RS (2001) Phylogeny of the portulacaceous cohort based on ndhF sequence data. Syst Bot 26:406–419
Barthet MM, Hilu KW (2007) Expression of matK: functional and evolutionary implications. Am J Bot 94:1402–1412
Behnke H-D (1994) Sieve-element plastids: their significance for the evolution and systematics of the order. In: Behnke H-D, Mabry TJ (eds) Caryophyllales: evolution and systematics. Springer, Berlin, Germany, pp 87–121
Bittrich V (1993) Introduction to centrospermae. In: Kubitzki K, Rohwer JG, Bittrich V (eds) The families and genera of vascular plants, vol II, magnoliid, hamamelid, and caryophyllid families, vol II. Springer, Berlin, Germany, pp 13–19
Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W (2003) Noncoding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms. J Evol Biol 16:558–576
Brockington SF, Alexandre R, Ramdial J, Moore MJ, Crawley S, Dhingra A, Hilu K, Soltis DE, Soltis PS (2009) Phylogeny of the caryophyllales sensu lato: revisiting hypotheses on pollination biology and perianth differentiation in the core caryophyllales. Int J Plant Sci 170:627–643
Burleigh JG, Hilu KW, Soltis DE (2009) Inferring phylogenies with incomplete data sets: A 5-gene, 567-taxon analysis of angiosperms. BMC Evol Biol 9:61
Cameron KM, Wurdack KJ, Jobson RW (2002) Molecular evidence for the common origin of snap-traps among carnivorous plants. Am J Bot 89:1503–1509
Chase MW, Soltis DE, Olmstead RG, Morgan D, Les DH, Mishler BD, Duvall MR, Price RA, Hills HG, Qiu Y-L, Kron KA, Rettig JH, Conti E, Palmer JD, Manhart JR, Systma KJ, Michaels HJ, Kress WJ, Karol KG, Clark WD, Hedren M, Gaut BS, Jansen RK, Kim K-J, Wimpee CF, Smith JF, Furnier GR, Strauss SH, Xiang Q-Y, Plunkett GM, Soltis PS, Swensen SM, Williams SE, Gadek PA, Quinn CJ, Eguiarte LE, Golenberg E, Learn GH Jr, Graham SW, Barrett SCH, Dayanandan S, Albert VA (1993) Phylogenetics of seed plants: an analysis of nucleotide sequences from the plasitd Gene rbcL. Ann MO Bot Garden 80:528–580
Clark LG, Zhang W, Wendel JF (1995) A phylogeny of the grass family (Poaceae) based on ndhF sequence data. Syst Bot 20:436–460
Cuénoud P, Savolainen V, Chatrou LW, Powell MP, Grayer RJ, Chase MW (2002) Molecular phylogenetics of Caryophyllales based on nuclear 18S rDNA and plastid rbcL, atpB, and matK DNA sequences. Am J Bot 89:132–144
Donoghue MJ, Doyle JA, Gauthier J, Kluge AG, Rowe T (1989) The importance of fossils in phylogeny reconstruction. Annu Rev Ecol Syst 20:431–460
Downie SR, Katz-Downie DS, Cho K-J (1997) Relationships in the Caryophyllales as suggested by phylogenetic analyses of partial chloroplast DNA ORF2280 homolog sequences. Am J Bot 84:253–273
Downie SR, Palmer JD (1994) Phylogenetic relationships using restriction site variation of the chloroplast DNA inverted repeat. In: Behnke H-D, Mabry TJ (eds) Caryophyllales: evolution and systematics. Springer, Berlin, pp 223–233
Doyle JJ, Doyle JL (1990) Isolation of plant DNA from fresh tissue. Focus 12:13–25
Edgar Robert C (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Edwards EJ, Nyffeler R, Donoghue MJ (2005) Basal cactus phylogeny: implications of Pereskia (Cactaceae) paraphyly for the transition to the cactus life form. Am J Bot 92:1177–1188
Farris JS (1989) The retention index and the rescaled consistency index. Cladistics 5:417–419
Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791
Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, pp 344
Fior S, Karis PO, Casazza G, Minuto L, Sala F (2006) Molecular phylogeny of the Caryophyllaceae (Caryophyllales) inferred from chloroplast matK and nuclear rDNA ITS sequences. Am J Bot 93:399–411
Fior S, Karis PO (2007) Phylogeny, evolution and systematics of Moehringia (Caryophyllaceae) as inferred from molecular and morphological data: a case of homology reassessment. Cladistics 23:362–372
Freudenstein JV, Davis JI (2010) Branch support via resampling; an empirical study. Cladistics 26:643–656
Gao K, Norell MA (1998) Taxonomic revision of Carusia (Reptilia: Squamata) from the late cretaceous of the gobi desert and phylogenetic relationships of anguimorphan lizards. Am Mus Novitates 3230:1–52
Gauthier J (1986) Saurischian monophyly and the origin of birds. Memoirs Calif Acad Sci 8:1–56
GenBank (2009) (http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html)
Giannasi DE, Zurawski G, Learn G, Clegg MT (1992) Evolutionary relationships of the Caryophyllidae based on comparative rbcL sequences. Syst Bot 17:1–15
Graybeal A (1998) Is it better to add taxa or characters to a difficult phylogenetic problem? Syst Biol 47:9–17
Hillis DM (1996) Inferring complex phylogenies. Nature 383:130–131
Hillis DM, Pollock DD, McGuire JA, Zwickl DJ (2003) Is sparse taxon sampling a problem for phylogenetic inference? Syst Biol 52:124–126
Hilu KW, Alice LA (1999) Evolutionary implications of matK indels in Poaceae. Am J Bot 86:1735–1741
Hilu KW, Borsch T, Müller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell MP, Alice LA, Evans R, Sauquet H, Neinhuis C, Slotta TAB, Jens GR, Campbell CS, Chatrou LW (2003) Angiosperm phylogeny based on matK sequence information. Am J Bot 90:1758–1776
Hoot SB, Culham A, Crane PR (1995) The utility of atpB gene sequences in resolving phylogenetic relationships: comparison with rbcL and 18S ribosomal DNA sequences in the Lardizabalaceae. Ann MO Bot Garden 82:194–207
Huelsenbeck JP (1991) When are fossils better than extant taxa in phylogenetic analysis? Syst Zool 40:458–469
Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack J, Müller KF, Guisinger-Bellian M, Haberle RC, Hansen AK, Chumley TW, Lee S-B, Peery R, McNeal JR, Kuehl JV, Boore JL (2007) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. PNAS 104:19369–19374
Jansen RK, Saski C, Lee S-B, Hansen AK, Daniell H (2011) Complete plastid genome sequences of three rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus. Mol Biol Evol 28:835–847
Johnson LA, Soltis DE (1995) Phylogenetic inference in Saxifragaceae sensu stricto and Gilia (Polemoniaceae) using matK sequences. Ann MO Bot Gardens 82:149–175
Judd WS, Campbell CS, Kellogg EA, Stevens PF, Donoghue MJ (2008) Plant systematics: a phylogenetic approach. Sinauer Associates, Sunderland MA 01375 USA
Kadereit G, Borsch T, Weising K, Freitag H (2003) Phylogeny of Amaranthaceae and Chenopodiaceae and the evolution of C4 photosynthesis. Int J Plant Sci 164:959–986
Källersjö M, Farris JS, Chase MW, Bremer B, Fay MF, Humphries CJ, Petersen G, Seberg O, Bremer K (1998) Simultaneous parsimony jackknife analysis of 2538 rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants, and flowering plants. Plant Syst Evol 213:259–287
Källersjö M, Albert VA, Farris JS (1999) Homoplasy increases phylogenetic structure. Cladistics 15:91–93
Kawahara AY, Mignault AA, Regier JC, Kitching IJ, Mitter C (2009) Phylogeny and biogeography of Hawkmoths (Lepidoptera: Sphingiae): evidence from five nuclear genes. PLoS One 4:1–11
Kearney M (2002) Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Syst Biol 51:369–381
Kearney M, Clark JM (2003) Problems due to missing data in phylogenetic analyses including fossils: a critical review. J Vertebr Paleontol 23:263–274
Kelchner SA (2000) The evolution of non-coding chloroplast DNA and its application in plant systematics. Ann MO Bot Garden 87:482–498
Kubitzki K, Rohwer JG, Bittrich V (eds) (1993)The families and genera of vascular plants. II. Flowering plants: dicotyledons, magnoliid, hamamelid and caryophyllid families. Springer, Berlin
Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, dePamphilis CW (2005) Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one’s way out of the felsenstein zone. Mol Biol Evol 22:1948–1963
Li J (2008) Phylogeny of Catalpa (Bignoniaceae) inferred from sequences of chloroplast ndhF and nuclear ribosomal DNA. Syst Evol 46:341–348
Liang H, Hilu KW (1996) Application of the matK gene sequences to grass systematics. Can J Bot 74:125–134
McMahon MM, Sanderson MJ (2006) Phylogenetic supermatrix analysis of genbank sequences from 2228 papilionoid legumes. Syst Biol 55:818–836
Meimberg H, Wistuba A, Dittrich P, Heubl G (2001) Molecular phylogeny of Nepenthaceae based on cladistic analysis of plastid trnK intron sequence data. Plant Biol 3:164–175
Moore MJ, Bell CD, Soltis PS, Soltis DE (2007) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA 104:19363–19368
Müller J, Müller K (2003) QuickAlign: a new alignment editor. Plant Mol Biol Rep 21:5
Müller K (2004) PRAP-computation of Bremer support for large data sets. Mol Phylogen Evol 31:780–782
Müller KF, Borsch T (2005) Phylogenetics of Amaranthaceae based on matK/trnK sequence data-evidence from parsimony, likelihood, and bayesian analyses. Ann MO Bot Gardens 92:66–102
Müller KF, Borsch T, Hilu KW (2006) Phylogenetic utility of rapidly evolving DNA at high taxonomical levels: contrasting matK, trnT-F and rbcL in basal angiosperms. Mol Phylogen Evol 41:99–117
Nixon KC (1999) The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics 15:407–414
Novacek MJ (1992) Fossils, topologies, missing data, and the higher level phylogeny of eutherian mammals. Syst Biol 41:58–73
Nyffeler R (2002) Phylogenetic relationships in the cactus family (Cactaceae) based on evidence from trnK/matK and trnL-trnF sequences. Am J Bot 89:312–326
Nyffeler R (2007) The closest relatives of cacti: insights from phylogenetic analyses of chloroplast and mitochondrial sequences with special emphasis on relationships in the tribe Anacampseroteae. Am J Bot 94:89–101
Olmstead RG, Michaels HJ, Scott KM, Palmer JD (1992) Monophyly of the Asteridae and identification of their major lineages inferred from dna sequences of rbcL. Ann MO Bot Garden 79:249–265
Olmstead RG, Zjhra ML, Lohmann LG, Grose SO, Eckert AJ (2009) A molecular phylogeny and classification of bignoniaceae. Am J Bot 96:1731–1743
O’Quinn R, Hufford L (2005) Molecular systematics of montieae (Portulacaceae): implications for taxonomy, biogeography and ecology. Syst Bot 30:314–331
Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D (2004) Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol 21:1740–1752
Pirie MD, Humphreys AM, Galley C, Barker NP, Verboom GA, Orlovich D, Draffin SJ, Lloyd K, Baeza CM, Negritto M, Ruiz E, Sanchez JHC, Reimer E, Linder HP (2008) A novel supermatrix approach improves resolution of phylogenetic relationships in a comprehensive sample of danthonioid grasses. Mol Phylogen Evol 48:1106–1119
Pollock DD, Zwickl DJ, McGuire JA, Hillis DM (2002) Increased taxon sampling is advantageous for phylogenetic inference. Syst Biol 51:664–671
Pryer KM, Schuettpelz E, Wolf PG, Schneider H, Smith AR, Cranfill R (2004) Phylogeny and evolution of ferns (Monilophytes) with a focus on the early leptosporangiate divergences. Am J Bot 91:1582–1598
Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW, Renner SS, Soltis DE, Soltis PS, Zanis MJ, Cannone JJ, Gutell RR, Powell M, Savolainen V, Chatrou LW, Chase MW (2005) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int J Plant Sci 166:815–842
Qiu Y-L, Li L, Wang B, Chen Z, Knopp V, Groth-Malonek M, Dombrovska O, Lee J, Kent L, Rest J, Estabrook GF, Hendry TA, Taylor DW, Testa CM, Ambros M, Crandall-Stotler B, Duff RJ, Stech M, Frey W, Quandt D, Davis CC (2006) The deepest divergences in land plants inferred from phylogenomic evidence. PNAS 103:15511–15516
Rannala B, Huelsenbeck JP, Yang Z, Nielsen R (1998) Taxon sampling and the accuracy of large phylogneies. Syst Biol 47:702–710
Rettig JH, Wilson HD, Manhart JR (1992) Phylogeny of the Caryophyllales-gene sequence data. Taxon 41:201–209
Rokas A, Carroll SB (2005) More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol 22:1337–1344
Rønsted N, Weiblen GD, Clement WL, Zerega NJC, Savolainen V (2008) Reconstructing the phylogeny of figs (Ficus, Moraceae) to reveal the history of the fig pollination mutualism. Symbiosis 45:1–12
Rosenberg MS, Kumar S (2001) Incomplete taxon sampling is not a problem for phylogenetic inference. PNAS 98:10751–10756
Sanchez A, Kron KA (2008) Phylogenetics of Polygonaceae with an emphasis on the evolution of Eriogonoideae. Syst Bot 33:87–96
Savolainen V, Chase MW, Hoot SB, Morton CM, Soltis DE, Bayer C, Fay MF, DeBruijn AY, Sullivan S, Qiu Y-L (2000) Phylogenetics of flowering plants based on combined analysis of plastid atpB and rbcL gene sequences. Syst Biol 49:306–362
Smissen RD, Clement JC, Garnock-Jones PJ, Chambers GK (2002) Subfamilial relationships within Caryophyllaceae as inferred from 5’ ndhF sequences. Am J Bot 89:1336–1341
Smith JF, Wolfram JC, Brown KD, Carroll CL, Denton DS (1997) Tribal Relationships in the Gesneriaceae: evidence from DNA sequences of the chloroplast gene ndhF. Ann MO Bot Garden 84:50–66
Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M, Savolainen V, Hahn WH, Hoot SB, Fay MF, Axtell M, Swensen SM, Prince LM, Kress WJ, Nixon KC, Farris JS (2000) Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Bot J Linn Soc 133:381–461
Soltis DE, Senters AE, Zanis MJ, Kim S, Thompson JD, Soltis PS, Ronse De Craene LP, Endress PK, Farris JS (2003) Gunnerales are sister to other core eudicots: implications for the evolution of pentamery. Am J Bot 90:461–470
Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu Y-L, Chase MW, Farris JS, Stefanovic S (2004) Genome-scale data, angiosperm relationships, and ‘ending incongruence’: a cautionary tale in phylogenetics. Trends Plant Sci 9:477–483
Stamatakis A, Hoover P, Rougemont J (2008) A Fast Bootstrapping Algorithm for the RAxML Web Servers. Systematic Biol 57:758–771
Stevens PF (2010) Angiosperm Phylogeny Website. Version 9, June 2008. http://www.mobot.org/MOBOT/research/APweb/
Swofford DL (2003) PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4, Sinauer Associates, Sunderland, MA, USA
Turmel M, Gagnon M-C, O’Kelly CJ, Otis C, Lemieux C (2009) The chloroplast genomes of the green algae Pyramimonas, Monomastix, and Pycnococcus shed new light on the evolutionary history of Prasinophytes and the origin of the secondary chloroplasts of Euglenids. Mol Biol Evol 26:632–648
Wang H, Moore MJ, Soltis PS, Bell C, Brockington SF, Alexandre R, Davis CC, Latvis M, Manchester SR, Soltis DE (2009) Rosid radiation and the rapid rise of angiosperm-dominated forests. Proc Natl Acad Sci USA 106:3853–3858
Whittall JB, Carlosn ML, Beardsley PM, Meinke RJ, Liston A (2006) The Mimulus moschatus Alliance (Phrymaceae): molecular and morphological phylogenetics and their conservation implications. Syst Bot 31:380–397
Wiens JJ (1998) Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst Biol 47:625–640
Wiens JJ (2003a) Incomplete taxa, incomplete characters, and phylogenetic accuracy: is there a missing data problem? J Vertebr Paleontol 23:297–310
Wiens JJ (2003b) Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol 52:528–538
Wiens JJ (2005) Can Incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst Biol 54:731–742
Wiens JJ (2006) Missing data and the design of phylogenetic analyses. J Biomed Inform 39:34–42
Wiens JJ, Reeder TW (1995) Combining data sets with different numbers of taxa for phylogenetic analysis. Syst Biol 44:548–558
Wilkinson M (1995) Coping with abundant missing entries in phylogenetic inference using parsimony. Syst Biol 44:501–514
Williams SE, Albert VA, Chase MW (1994) Relationships of Droseraceae: a cladistic analysis of rbcL sequence and morphological data. Am J Bot 81:1027–1037
Wilson CA (2009) Phylogenetic relationships among the recognized series in Iris section Limniris. Syst Bot 34:277–284
Wolf PG (1997) Evaluation of atpB nucleotide sequences for phylogenetic studies of ferns and other pteridophytes. Am J Bot 84:1429–1440
Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51:588–598
Acknowledgments
The authors thank J. Gordon Burleigh for his contributions to this manuscript; and D. and P. Soltis, S. Brockington, and M. Moore, as well as the Missouri Botanical Garden and the Royal Botanic Garden at Kew for providing DNA samples for several taxa. We thank M. Barthet for help in designing a primer, A. Hinckle for helping with specimen collection, S. Newman for assistance in laboratory work, and A. Ferraioli for assistance with figures. We also thank two anonymous reviewers for their comments and suggestions. This work is part of the AToL-Angiosperm project supported by grants from the National Science Foundation, USA (EF-043105 and REU-477683 3) to K.W.H.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below are the links to the electronic supplementary material.
Online Resource 4 (ESM_4) MP strict consensus tree based on the matK/trnK intron dataset for 51 Caryophyllales taxa (0.3% missing data). Percent bootstrap values greater than 50% are noted on branches
Online Resource 5 (ESM_5) Summary of the MP strict consensus tree based on matK/trnK intron data with expanded taxon sampling (652 taxa with 38% missing data). Percent bootstrap values greater than 50% are noted on branches
Online Resource 6 (ESM_6) ML tree based on the five genomic regions (rbcL, atpB, ndhF, matK, and trnK intron) for 136 Caryophyllales taxa. Percent bootstrap values greater than 50% are noted on branches. (a) Expanded details for the “AAC” and “raphide” clades. (b) Expanded details for the “succulents” clade. (c) Expanded details for the “FTPP” and “carnivorous” clades
Online Resource 7 (ESM_7) MP strict consensus tree based on the dataset of five genomic regions (rbcL, atpB, ndhF, matK, and trnK intron) for 136 taxa (5GR-136; 46% missing data). Percent bootstrap values greater than 50% are noted on branches. (a) The FTPP and carnivorous clades have been collapsed. (b) FTPP and carnivorous clades are expanded
Online Resource 8 (ESM_8) ML tree based on the matK/trnK intron dataset for 51 Caryophyllales taxa. Branch lengths are noted on the branches
Rights and permissions
About this article
Cite this article
Crawley, S.S., Hilu, K.W. Impact of missing data, gene choice, and taxon sampling on phylogenetic reconstruction: the Caryophyllales (angiosperms). Plant Syst Evol 298, 297–312 (2012). https://doi.org/10.1007/s00606-011-0544-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00606-011-0544-x