Background

In phylogenetic studies, it has become increasingly common to sequence large numbers of individuals at many loci (e.g., [14]). While these multilocus datasets provide the potential to improve the accuracy of phylogeny inferences over large sets of taxa, for a variety of reasons, topologies of trees inferred at different loci might not match [5]. One source of this gene tree discordance is incomplete lineage sorting, the phenomenon in which sets of sampled lineages fail to coalesce in the population in which they are first capable of coalescing [6]. With incomplete lineage sorting, several species tree inference methods—including Concatenation [7], Democratic Vote Consensus [8], Greedy Consensus [9], Majority-Rule Consensus [9], Matrix Representation with Parsimony [10], and Minimize Deep Coalescences [11]—can be misled by discordance of gene trees across loci.

Numerous approaches have been used for estimating species tree topologies from multilocus sequence data. Consensus methods construct species tree topologies from gene trees according to deterministic rules applied to the input set of trees [12, 13]. These methods take as input a set of gene trees produced from individual loci, allowing for separate input evolutionary histories at each locus. Because genetic lineages in different species sometimes have relatively few sequence differences, however, the information in a locus can be insufficient to accurately infer gene trees, thereby allowing incorrect gene trees to adversely influence the constructed species tree (e.g., [14, 15]). Concatenation methods concatenate a set of multiple alignments and construct a tree from the concatenated alignment, treating the estimated “super-gene” tree as a species tree estimate [6, 13, 16]. Because concatenation combines all loci to form a single locus, and because different loci can have different evolutionary histories that are disregarded in the analysis of the concatenated alignment, the analysis of loci in this way can lead to incorrect species tree inferences [7]. Consensus and concatenation have in common that they are “two-stage” methods, in which species trees are inferred in two steps—inference of gene trees and then species tree inference for consensus, and inference of the super-gene tree followed by a conceptually substantial though methodologically trivial pronouncement that this tree is the species tree estimate for concatenation [6, 12, 13]. A third class of approaches can be labeled “single-stage” methods, in which species trees are inferred by simultaneously modeling the evolution of sequences among all sampled loci to output a species tree estimate [1720]. These single-stage model-based methods often have desirable statistical properties, but because they typically explore large spaces of possibilities rather than algorithmically constructing estimated trees, they can be computationally intensive and applicable only to smaller datasets.

Properties of species tree inference methods can be examined using a variety of frameworks, including theory, simulations, and empirical assessments. Theoretical investigations are often concerned with limiting properties as the number of loci approaches infinity [911, 15, 2127]. In-depth explorations of inference methods often rely on simulation studies, which are commonly used to investigate the performance of species tree inference methods on simulated multilocus datasets [10, 2832]. These theoretical and simulation-based studies have the advantage of knowing the true species tree, but the disadvantage that the scenarios they examine lack the complexity of empirical data.

An alternative approach is to evaluate methods on an empirical dataset in which the space of parameter values is defined by the evolutionary history of a group of species. Recent studies have empirically investigated the performance of species tree methods from multilocus datasets in a variety of organisms, including birds [3, 3336], insects [37, 38], newts [39], plants [40], primates [41, 42], rice [1], rodents [43], snakes [44], and yeast [4, 16, 36, 45, 46]. While some of these studies constructed highly-supported species trees, others did not, possibly due to high levels of genealogical discordance resulting from incomplete lineage sorting, hybridization, and ancient rapid radiations.

In one such study, [47] found that samples from a multilocus dataset of North American pines displayed widespread genealogical discordance. This pattern of incomplete lineage sorting is a common feature of long-lived shrubs and trees (e.g., [4850]), and likely arises from factors such as large effective population sizes and long generation times [51]. Because gene tree discordance is needed for different algorithms to produce different estimates, high levels of gene tree discordance make North American pines an interesting group in which to compare species tree inference methods.

In this study, we take an empirical approach to the evaluation of species tree inference methods by examining the performance of 72 strategies for inferring species tree topologies using multilocus data from North American pines. Each “phylogenetic inference strategy” consists of three components: a method of constructing species trees from gene trees (e.g., consensus or concatenation), a gene tree inference method (e.g., maximum likelihood, maximum parsimony, or neighbor-joining), and an outgroup species. Our framework thus focuses on two-stage inference strategies that can be separated into gene tree inference and species tree inference steps, so that the effect of the choices of gene tree and species tree estimators can be directly evaluated. We examine ∼47 kilobases (kb) of sequence spanning 121 nuclear loci sequenced in 120 individuals from eight ingroup species of Pinus subsection Strobus (Table 1) and three outgroup species of Pinus subsection Gerardianae. With our empirical approach, unlike in simulation-based and theoretical evaluations, the true species tree is not known. It is still possible, however, to evaluate properties of species tree estimators without knowledge of a true tree, by comparing the features of species trees inferred by different methods. We apply techniques from multivariate statistical analysis to sets of inferred species trees to compare characteristics of species trees estimated by different strategies and to identify groups of strategies that behave similarly.

Table 1 Ranges and morphological characteristics differentiating eight North American species of Pinus subsection Strobus

Methods

North American white pine dataset

A total of 120 individuals were sequenced in eight ingroup species of North American white pines from Pinus subsection Strobus (Pinus albicaulis Engelmann, P. ayacahuite Ehrenberg ex Schlechtendal, P. chiapensis (Martínez) Andresen, P. flexilis James, P. lambertiana Douglas, P. monticola Douglas ex D. Don, P. strobiformis Engelmann, and P. strobus L.) and three outgroup species from Pinus subsection Gerardianae (P. bungeana Zuccarini ex Endlicher, P. gerardiana Wallich ex D. Don, P. squamata X. W. Li), the identified sister lineage to Pinus subsection Strobus[47, 54]. Sequencing was conducted on haploid templates generated from DNA extractions of seed megagametophyte tissue; as a single haploid sequence was generated for each individual at each locus, no phasing was necessary. Gene sequences were obtained from 245 putative nuclear loci chosen from among ∼7,500 loci recently resequenced for loblolly pine (Pinus taeda L., http://loblolly.ucdavis.edu/bipod/ftp/) using single pass, bidirectional Sanger sequencing of PCR products amplified from haploid megagametophyte tissue excised from seeds of each species. Further description of laboratory protocols appears in [55]. Sequence data were pre-processed and organized using PINESAP[56], a bioinformatics pipeline that combines PHRED[57], PHRAP[58], and MUSCLE[59, 60] to call bases and align sequencing reads. Reported nucleotide sequences consisted only of A, C, G, T, missing, and gap information, with no other ambiguity codes used. After pre-processing, the data were manually assembled and aligned using CODONCODE (CodonCode Corporation, Dedham, MA). Bases were called using a minimum PHRED score [57, 61] of 25 for aligned bases. All polymorphisms were visually validated. All alignments were further aligned to resequencing data from P. taeda (unpublished data) using the profile-profile option in MUSCLE[59, 60]. These alignments are publicly available as part of the Dendrome project (http://loblolly.ucdavis.edu/bipod/ftp/). GenBank accession numbers for sequences in the study appear in Additional file 1: Table S1.

Of 245 loci sequenced initially, 37 were dropped from further consideration due to low overall quality of sequence reads. An additional 15 loci were discarded due to possible chloroplast or mitochondrial contamination, on the basis of BLAST analysis against pine organellar sequences in GENBANK[62]. Two loci were dropped due to sequence similarity to retroelement-like proteins, leaving 191 high-quality nuclear gene alignments. We then eliminated 70 loci for which at least one of the 11 species contained no data. This filter reduced the dataset to 121 loci, covering ∼47 kb of aligned sequence data.

Coding regions (i.e., site annotations) could be confidently identified for 112 of the 121 loci by further analysis using TBLASTX against protein-coding genes in Arabidopsis, Oryza, Picea, and Populus. For these 112 loci, the gene for the highest-scoring TBLASTX hit, in combination with the expressed sequence tag from loblolly pine, was used to identify coding regions. Site annotations for each alignment were validated with BLASTP analysis of the amino acid sequences derived from the inferred coding intervals against the gene that was used to derive the site annotations. For the data from the 112 annotated loci, ∼62% represents exonic regions, ∼18% represents intronic regions, ∼1% is from 5’ UTRs, and ∼19% is from 3’ UTRs. Because 112 of the loci could be confidently identified as belonging to coding regions, with a substantial fraction of exonic sequence, the data likely contain a mixture of non-neutral and neutral regions.

Overview of the analysis

The procedure for obtaining results for each of the 72 phylogenetic inference strategies (listed in Additional file 2: Table S1) appears in Figure 1. For a given strategy, we started from a dataset D with L loci. To generate distributions on the set of clades inferred by a given strategy, we used the bootstrap, creating bootstrap replicates by randomly choosing with replacement B sets of L loci. As many of the loci are coding and the eight pine species are closely related, we chose not to bootstrap across sites within a locus to ensure that bootstrapped alignments would contain reasonable levels of variation. Next, we applied a gene tree inference method to each bootstrap replicate dataset. Based on the set of inferred gene trees in a bootstrap replicate, we then applied a species tree construction method to estimate a species tree topology with one of the three outgroup species. For each phylogenetic inference strategy, we constructed B=1000 independent bootstrap datasets, thereby estimating 1000 species tree topologies. From these topologies, we created a list of clades, each with a corresponding count of its number of appearances in the 1000 bootstrap replicates. Clade lists were then analyzed to assess differences among the estimates produced by different strategies.

Figure 1
figure 1

Flow diagram representing the procedure in which we obtained results on the behavior of phylogenetic inference strategies. A boldface number attached to a downward arrow indicates the number of phylogenetic inference strategies generated by the box immediately above the arrow. Absence of a number indicates a value of 1. The number of strategies for a particular path from the topmost box to the bottom layer is calculated as the product of the boldface numbers visited during the traversal of the path. The number of phylogenetic inference strategies analyzed is 72, the sum over all paths from the topmost box to boxes in the bottom layer.

Creating datasets

Our final set of 121 loci contains many loci that are highly conserved across multiple species. Because of the high level of conservation, for these loci, little information exists for identifying relationships among lineages. Thus, if methods for inferring gene trees were applied to certain loci, the resulting gene trees would be highly unresolved and would therefore provide little information to species tree construction methods. This issue motivates the construction of datasets that attempt to reduce the chance of inferring highly unresolved trees, and that provide phylogenetic inference strategies with the maximal amount of sequence data available.

We therefore analyzed four carefully selected subsets of the initial dataset (Table 2; Additional file 3). Two of these are datasets of multiple alignments that contain information on a single individual per species ( D s and D s , 0 ). The other two contain information on multiple individuals per species ( D p and D p , 0 ). These four datasets are constructed such that each possesses desirable properties for certain strategies in the collection of 72 phylogenetic inference strategies, providing the strategies with as much information as possible to infer resolved phylogenies. For example, because it is desirable for a pair of species to have nonzero distance, we require pairs of distinct species to be separated by at least one observed mutation. Furthermore, because it is desirable to minimize missing data, we choose individuals that yield minimal missing data in a multiple alignment. One of the two datasets with a single individual sampled per species is optimized for locus-by-locus gene tree inference ( D s , 0 ), whereas the other is optimized for gene tree inference from a concatenated alignment ( D s ). Similarly, one of the two datasets with multiple individuals sampled per species is optimized for locus-by-locus gene tree inference ( D p , 0 ), whereas the other is optimized for gene tree inference using multiple loci simultaneously ( D p ). The procedures used for constructing these datasets appear in Sections on “Datasets with one individual per species” and “Datasets with multiple individuals per species”.

Table 2 Datasets

Let S k , k=1,2,…,11, denote the set of individuals from pine species k, considering eight ingroup species (S1,S2,…,S8) and three outgroup species (S9,S10,S11). Denote the amount of overlapping non-gap non-missing sequence between a pair of individuals x and y by n x y and denote the number of non-gap non-missing nucleotide differences between a pair of individuals x and y by d x y (0≤d x y n x y ). Further, denote the final dataset of L=121 loci by D={ A 1 , A 2 ,, A L }, where A is the set of aligned sequences at locus for individuals from all 11 species. It is from dataset that we create the four optimized datasets as summarized in Table 2 and Figure 2.

Figure 2
figure 2

Schematic for creating the four subsets D s , D s , 0 , D p , and D p , 0 from dataset. For the matrices of datasets , D s , D s , 0 , D p , and D p , 0 (see Table 2), each row is an individual and each column is a locus. Thick black lines in these matrices separate the individuals in different species. Gray boxes indicate missing sequences. (A) At each locus, a single sequence from each species (indicated in red) is selected from dataset . These selected sequences are used to create D s such that there exists a single sequence sampled per species at each locus. Sequences from a subset of loci in D s (indicated in yellow) are used to create dataset D s , 0 such that each locus has at least one nucleotide difference between each distinct pair of species other than pairs from distinct outgroups. (B) Dataset D p is the full starting dataset . At each locus , a distance matrix is created according to eq. 2. Sequences from a subset of loci (indicated in red) in D p are used to create dataset D p , 0 such that each locus has a nonzero p-distance between each distinct pair of species other than pairs from distinct outgroups. Observe that the D p , 0 matrix includes loci 3 and 7, which are not included in the D s , 0 matrix. Loci 3 and 7 are included in D p , 0 but not in D s , 0 because in D p , 0 , pairs of species contain at least one pair of individuals with different sequences, whereas in D s , 0 , at least one pair of the 11 selected individuals have identical sequences. Therefore, the set of loci in D p , 0 is a superset of the set of loci in D s , 0 , and the number of loci in D p , 0 is always greater than or equal to the number of loci in D s , 0 .

Datasets with one individual per species

The first dataset, D s , consists of alignments with a single individual sampled per species at each locus (not necessarily the same individual across loci). That is, we generate a dataset of multiple alignments at each of L loci with only one individual per species, thereby creating multiple alignments of 11 individuals. This dataset is used by phylogenetic inference strategies that utilize the concatenation-based species tree construction methods with the maximum likelihood, maximum parsimony, and neighbor-joining gene tree inference methods (see “Inferring gene trees” and “Inferring species trees”). To create D s , we choose the subset of 11 sequences A s at locus by first maximizing the total overlap sequence n( A s )= x , y A s , x y n xy and then, if there is a tie for the overlap n( A s ), maximizing the total number of substitutions d( A s )= x , y A s , x y d xy . In other words, for any other set of aligned sequences A R A at locus with a set R of only one individual sampled per species, the amount of overlapping non-gap non-missing sequence in A R is no larger than in A s , i.e., n( A R )n( A s ). We note that the quantity n x y represents a calculation only on a pair of individuals x and y, whereas n( A s ) considers all 11 2 pairs of individuals. Further, for any other set of aligned sequences A R A at locus with a set R of only one individual sampled per species and n( A R )=n( A s ), the total number of pairwise sequence differences in A R is no larger than in A s , i.e., d( A R )d( A s ). If multiple sets R of 11 individuals share the same values of n and d, then we choose the set of 11 individuals randomly among the tied sets. We choose the “optimal” set of 11 individuals at each locus in this way both to maximize the sequence contributions of individual loci to the inference of gene trees (maximizing n) and to maximize the potential for creating resolved gene trees (maximizing d).

The second dataset, D s , 0 , is a subset of D s with Ls,0L loci that consists of only those loci in D s for which there exists at least one nucleotide difference between each distinct pair of species (other than pairs of outgroup species). In other words, for any pair of individuals x and y with x,y A s and xy, d x y ≥0, and d x y >0 if x, y, or both are from species 1 through 8. This condition of at least one nucleotide difference between species pairs assists in constructing gene trees that are bifurcating. Dataset D s , 0 is used by phylogenetic inference strategies that utilize consensus methods with maximum likelihood, maximum parsimony, and neighbor-joining (see “Inferring gene trees” and “Inferring species trees” for details).

Datasets with multiple individuals per species

The third dataset, D p , is identical to our starting dataset . Thus, strategies that use D p consider all available sequences. Dataset D p is used by phylogenetic inference strategies that employ the concatenation-based species tree construction methods with the neighbor-joining gene tree inference method using multiple individuals (see “Inferring gene trees” and “Inferring species trees”).

Consider a dataset D of L loci sampled randomly with replacement from D p . Define

P ij all = 0 , i = j A D x , y A d xy 1 { x S i , y S j } A D x , y A n xy 1 { x S i , y S j } , i j ,
(1)

where the indicator random variable 1 { x S i , y S j } equals 1 if xS i and yS j and 0 otherwise. The distance matrix defined by eq. 1 is used to estimate gene trees for all strategies applied to D p . Given distinct species S i and S j , P ij all represents the p-distance (fraction of nucleotide differences; [63]) averaged over pairs of individuals, one from species i and the other from species j. Note that eq. 1 represents a weighted rather than unweighted average for the mean p-distance between species i and j. Although the distance is weighted, it is the same as a distance between pairs of species calculated on a concatenated alignment.

The fourth dataset, D p , 0 , is a subset of D p with Lp,0L loci. This subset consists of only those loci in D p for which there exists a pair of individuals in each distinct pair of species (other than pairs from distinct outgroups) with at least one nucleotide difference between them. Define

P ij = 0 , i = j x , y A d xy 1 { x S i , y S j } x , y A n xy 1 { x S i , y S j } , i j ,
(2)

where 1 { x S i , y S j } is an indicator random variable that equals 1 if xS i and yS j and 0 otherwise. The numerator of P ij represents the number of pairwise sequence differences, summed over pairs of individuals, one from species S i and the other from species S j , at locus . The denominator represents the sum across pairs of individuals, one from S i and the other from S j , of the non-gap non-missing sequence shared between pairs of individuals at locus . To construct D p , 0 , we create a subset of D p that consists only of those loci in D p for which the p-distance ( P ij >0) is nonzero between each distinct pair of species (excluding pairs from distinct outgroups). This dataset is utilized by phylogenetic inference strategies that employ consensus methods with gene trees inferred by neighbor-joining using multiple individuals (see “Inferring gene trees” and “Inferring species trees”). Similarly to dataset D s , 0 , this condition of a nonzero p-distance between species pairs assists in constructing bifurcating gene trees. We note that the species tree estimation approach taken in this study neither requires pairs of individuals in the same species to have nonzero distances nor to have distances of zero. We only enforce that the distance calculated between pairs of species is nonzero.

Inferring gene trees

For each of the four datasets D s , D s , 0 , D p , and D p , 0 , we inferred gene trees from bootstrap samples [6365] that contain loci randomly sampled with replacement from the dataset. For strategies applied to datasets D s and D s , 0 , we inferred gene trees from sequence alignments by applying either maximum likelihood (ML; [63], ch. 9) under a general time-reversible substitution model ([63], ch. 13), maximum parsimony (MP; [63], ch. 1), or neighbor-joining (NJ; [63], ch. 11) to a p-distance matrix calculated between pairs of alignments. For strategies applied to D p and D p , 0 , we inferred gene trees by applying neighbor-joining to the Pall and Pp-distance matrices, respectively. We term the method for inferring gene trees from the Pall and Pp-distance matrices “neighbor-joining using multiple individuals” (M). Gene trees were inferred using PAUP[66]. Note that the estimation of gene trees on the scale explored in this study would be computationally intensive on the full set of sampled individuals; thus, we do not consider gene tree inference directly from alignments with multiple lineages sampled within species, and when exploring multiple lineages (as in M), we do so only with distance matrices between pairs of species rather than pairs of lineages.

Inferring species trees

We view as a species tree inference method any method that outputs a species tree estimate. The six species tree inference methods in this study are Concatenation [16, 67], SuperMatrix Rooted Triple (SMRT; [22]), STEAC [21], STAR [21], Rooted Triple Consensus (RTC; [68]), and Minimize Deep Coalescences (MDC; [69, 70]). Concatenation and SMRT are concatenation-based, and STEAC, STAR, RTC, and MDC are consensus methods. Because we have adopted a unified two-stage framework for phylogenetic inference strategies in which gene trees are first inferred by one approach and species trees are then inferred from gene trees by a second approach, we did not investigate single-stage approaches such as BEST [18, 19], and *BEAST [20] that bypass gene tree inference or that perform gene tree inference simultaneously with species tree inference. Our analysis pipeline explores the performance of two-stage inference strategies when the roles of gene tree and species tree inference are separated, and it therefore requires that strategies estimate species trees from inferred gene trees and that they permit different gene tree inference methods to provide input to a given species tree method. The six species tree methods investigated in this article satisfy both of these conditions, whereas species tree methods such as BUCKy [17], BEST [18, 19], and *BEAST [20] do not. Further, the methods we have selected are well-suited to a computationally intensive bootstrap approach included in our pipeline for generating distributions of species tree topologies, and the more computationally intensive of the single-stage methods would not be easily accommodated within this framework. Given the large number of two-stage methods available, it would not be possible to be comprehensive; we have thus chosen a limited number of methods that represent a range of underlying principles. Our choice of methods permits a diverse set of criteria for estimating species trees to be evaluated, and the conceptual differences in the underlying methods enable some differentiation in behavior across methods.

Consider a set of L loci (multiple alignments) with m ingroup and one outgroup species. Concatenation methods concatenate the L alignments to create a single “super locus” consisting of an alignment of the m+1 species across L loci. From this alignment, a gene tree is inferred by either maximum likelihood, maximum parsimony, or neighbor-joining—note that the definition of Concatenation does not require that gene trees be estimated using any specific method—and is then taken as the species tree estimate. Similarly, SMRT creates a concatenated alignment of the m+1 species from a set of L alignments. However, SMRT then constructs from this concatenated alignment all m 3 concatenated alignments of three ingroup species and an outgroup species. Rooted three-taxon gene trees are then inferred from each of the m 3 concatenated alignments. A supertree algorithm is then applied to the set of rooted three-taxon gene trees to estimate an m-taxon species tree topology. This study uses the Modified Mincut supertree algorithm implemented in the program SUPERTREE[71] to construct a species tree from rooted three-taxon gene trees.

Consider a set of (m+1)-taxon gene trees (m ingroup and one outgroup species) inferred at each of L loci. STEAC estimates a species tree topology by using estimated mean coalescence times. For distinct species S i and S j , the mean coalescence time is computed as the estimated coalescence time for S i and S j , averaged over all L gene trees. This resulting mean is placed into a matrix of distances between species, to which neighbor-joining is applied to estimate the species tree topology (using the R PHYBASE package).

STAR estimates a species tree topology by using average coalescence ranks. It assumes that the rank of the root of a gene tree is equal to the number of species in the tree (m + 1 in our case). An internal node of a gene tree is then assigned one less than the rank of its immediate ancestor. For distinct species S i and S j , the average coalescence rank is computed as the rank of the node that connects S i and S j , averaged over all L gene trees. Similarly to STEAC, these average coalescence ranks specify a matrix of distances between species pairs. Neighbor-joining is applied to the matrix to estimate the species tree topology using PHYBASE.

RTC estimates a species tree from rooted three-taxon tree topologies. At each locus , =1,2,…,L, RTC finds the set of m 3 rooted tree topologies of three ingroup and one outgroup species that are displayed by the inferred gene tree at locus . RTC then applies quartet puzzling [72] to the m 3 L topologies to estimate the species tree topology (using the program TRIPLEC).

A coalescence event between a pair of lineages is considered “deep” if the coalescence does not occur in the first population in which the pair of lineages is capable of coalescing. Given a gene tree, the number of deep coalescences on a species tree is defined as the total number of “extra lineages”, summed across branches of the species tree topology, that is needed to fit the gene tree within the species tree topology. Here, the number of extra lineages for a branch is one fewer than the number of lineages that survive to the ancestral node of the branch; if incomplete lineage sorting does not occur, then only one lineage persists from a branch to a more ancestral branch, and there are no extra lineages. For a set of L gene trees, the number of deep coalescences for a species tree is the total number of deep coalescences for the species tree given a gene tree, summed across the L gene trees. MDC estimates a species tree topology by minimizing the number of deep coalescences. That is, MDC finds a species tree topology for which the number of deep coalescences that will fit the set of L gene trees within the species tree topology is minimal. This study utilizes the MDC implementation in PHYLONET[70].

Multivariate analysis

We aim to determine which of the 72 phylogenetic inference strategies perform similarly, and we use multivariate analyses to define clusters of strategies that provide similar species tree estimates. Consider a 72×145-dimensional data matrix S in which rows represent strategies and columns represent 145 observed clades, among the k = 2 8 - 1 8 k =246 possible non-trivial clades (i.e., clades that contain more than one species and fewer than all analyzed species) of eight species. Entry S i j in column i and row j of S is the number of times that strategy i infers clade j in 1000 bootstrap replicates across loci.

Principal components analysis (PCA) was applied to S to create a 72×2-dimensional matrix V, with rows representing strategies and the first and second columns representing the first and second principal components, respectively. Plotting strategies onto the space defined by these principal components yields a two-dimensional spatial “map” of phylogenetic inference strategies. We similarly applied multidimensional scaling (MDS) to a distance matrix for all 72 2 pairs of strategies, computing pairwise distance as the mean Robinson-Foulds distance [73] across all 106 pairs of bootstrap trees, and extracting the first two components. We calculated the Robinson-Foulds distance using TREEDIST in PHYLIP.

To compare spatial maps of phylogenetic inference strategies, we used Procrustes analysis [7476]. In particular, we compared the spatial distribution of a subset of 72-r strategies when analyzed alone to the spatial distribution for all strategies. The comparison enabled us to quantify the influence that a set of r strategies with a particular feature (i.e., species tree construction method, gene tree inference method, or outgroup species) has on the full spatial distribution. Consider a proper subset Σ={σ1,σ2,…,σ72-r} of the full set of strategies. Consider a (72-r)×145-dimensional data matrix S Σ in which rows represent the strategies in set Σ and columns represent observed clades. S Σ is a submatrix of S, in which the rows corresponding to strategies in Σ are selected from S. Consider a (72-r)×2 target matrix X and a (72-r)×2 comparison matrix Y. X is matrix V restricted to the set of strategies Σ. Y represents the first two principal components in the PCA applied to matrix S Σ . Now consider a (72-r)×2 matrix Z=b Y T+C, where b is a scaling factor, T is a 2×2 matrix that rotates and reflects Y, and C is a (72-r)×2 matrix that has constant columns and that is used to translate the matrix. Procrustes analysis seeks to find b, T, and C to minimize the sum of squared differences between X and some (72-r)×2 matrix Z=b Y T+C. That is, Z is formally defined as Z = argmin Z { i = 1 72 - r j = 1 2 ( X ij - Z ij ) 2 }. The dissimilarity measure between X and Z is computed as

i = 1 72 - r j = 1 2 ( X ij - Z ij ) 2 i = 1 72 - r j = 1 2 ( X ij - μ j ) 2 ,
(3)

where μ j = 1 72 - r i = 1 72 - r X ij is the jth entry of the centroid of X. This measure takes the sum of squared differences between points on the spatial maps defined by X and Z and normalizes it by the sum of squared differences between the points on the spatial map defined by X and their centroid.

Define a cluster as a set of strategies and let the centroid of a cluster be the location in the 145-dimensional space of clades whose coordinates are the means of those of all strategies in the cluster. Hierarchical clustering was performed by first creating a matrix of Euclidean distances between all 72 2 pairs of 145-dimensional vectors represented by the matrix S. Define the within-cluster sum of squared Euclidean distance as the squared Euclidean distance between a point in a cluster and the cluster centroid, summed over all points in the cluster. From the 72×72-dimensional matrix of Euclidean distances between strategies, a dendrogram relating the strategies was constructed using the Ward algorithm [77], which iteratively merges clusters until all points are contained within a single cluster. For a given iteration, two clusters are merged if their merged cluster has a smaller within-cluster sum of squared Euclidean distances than any other potential merged cluster. The nesting of clusters created by the algorithm defines the dendrogram.

We performed K-means clustering on the 72 145-dimensional vectors, using K clusters, K=2,3,…,9. Given K, strategies were separated into K clusters on the basis of the squared Euclidean distance between all pairs of the strategies in a 145-dimensional space. We ran 104 replicates with random starting locations. Each replicate yielded a total within-cluster sum of squared distances for the set of K clusters, representing the within-cluster sum of squared distances between points in a cluster and the cluster centroid, summed over all K clusters. We then chose the set of cluster assignments that had the minimum total within-cluster sum of squared distances, where the minimum was taken over all 104 replicate starting locations.

To compute the Pearson correlation coefficient between a pair of strategies, we only used points in the 145-dimensional vector that were nonzero in both strategies being compared.

Results

We accounted for the variable outcomes of individual phylogenetic inference strategies by applying the strategies to bootstrap datasets instead of their respective full datasets. Our analysis identified 145 distinct clades observed in the set of 72 phylogenetic inference strategies, among 246 possible non-trivial clades on eight species, across 1000 bootstrap replicates for each strategy. From these clades, we created a 72×145 matrix S in which each row is a strategy and each column is a clade. The value of S i j , the cell in row i and column j, is the number of times among the 1000 bootstrap replicates that strategy i inferred a species tree with clade j. This summarized dataset S of clade counts was used for all further analyses.

Clade size

We first investigated the level of balance [7881] in the tree topologies inferred by each phylogenetic inference strategy. The distribution of clade sizes (number of taxa within a clade) provides a basis for measuring tree topological balance. Topologies with numerous small clades tend to be more balanced than topologies with large clades. For example, consider the topologies T b a l =(((A B)(C D))((E F)(G H))) and T u n b a l =(((((((A B)C)D)E)F)G)H). Topology T b a l is the most balanced eight-taxon topology whereas T u n b a l is the most unbalanced eight-taxon topology. Considering non-trivial clades, T b a l has four clades of size two and two clades of size four. T u n b a l has one clade each of size two, three, four, five, six, and seven. Thus, the clades of T b a l are smaller than those of T u n b a l . The mean clade size for T b a l is ∼2.67 and the mean clade size for T u n b a l is 4.5.

Figure 3A displays the cumulative distribution of clade sizes for each of the 72 phylogenetic inference strategies, considering all 1000 bootstrap replicate species trees for each strategy. This cumulative distribution increases most quickly for strategies based on MDC, for which most of the distribution is located in clades of size two. By contrast, it increases most slowly for strategies based on SMRT and STEAC, for which much of the distribution is located in clades of size six and seven. Figure 3B displays a bar graph of the mean clade size for each of the 72 phylogenetic inference strategies. This graph shows that among all six species tree construction methods, the 12 MDC strategies have the smallest mean clade size as well as the smallest variance in mean clade size across the 12 combinations of outgroup and gene tree inference method. In contrast, SMRT and STEAC in general have the largest mean clade size. However, all 12 SMRT strategies infer trees with large mean clade size, whereas the mean clade size of STEAC varies across the 12 combinations of outgroup and gene tree inference method. Interestingly, the mean clade size averaged over all 12 strategies based on MDC is ∼2.79, a value that is close to the mean clade size for T b a l of ∼2.67.

Figure 3
figure 3

Distribution of clade size for phylogenetic inference strategies. (A) Cumulative distribution of clade sizes. Each bar represents a strategy, of which there are 12 per color. (B) Mean clade size for phylogenetic inference strategies, calculated as the mean size of all clades inferred across 1000 bootstrap replicates. Vertical lines centered at the top of vertical bars represent standard errors of mean clade sizes.

Clustering of strategies

We next used PCA, MDS, hierarchical clustering, K-means clustering, and correlation analysis on the matrix of clades S to identify phylogenetic inference strategies that perform similarly. Figure 4 displays plots of the first two principal components, which account for 38.94% and 18.96% of the variation across strategies, respectively. Figure 4A shows that separate clusters are formed by strategies that are based on Concatenation, SMRT, and STEAC, and that strategies based on STAR, RTC, and MDC form a cluster together. Further, a larger cluster is formed by strategies that are based on Concatenation, SMRT, and STEAC, and another larger cluster is formed by strategies that are based on STAR, RTC, and MDC. These larger clusters have a simple interpretation in that one of the larger clusters contains topologically-based strategies (STAR, RTC, and MDC) and the other contains strategies that are not strictly topologically-based (Concatenation, SMRT, and STEAC). Strategies are classified as topologically-based if they only use information on tree topologies to construct a species tree. In contrast, strategies are classified as not strictly topologically-based if they use information other than the gene tree topologies, such as sequence or branch length information, to construct a species tree. Relabeling the points in Figure 4A according to gene tree inference method, Figure 4B shows that strategies that are based on M (i.e., multiple individuals) form a cluster, and that there are no separate clusters for strategies that are based on ML, MP, or NJ. Figure 4C, which labels points according to outgroup, shows that no strategies separate into clusters based on the choice of outgroup. When we apply MDS to Robinson-Foulds distances between the sets of bootstrap replicate trees produced by pairs of strategies (Figure 4), we obtain similar observations of the clusters of strategies, detecting an important role for M and for the difference between topologically-based and non-topologically-based strategies, and no strong signal for the outgroup choice.

Figure 4
figure 4

Principal components analysis (PCA) and multidimensional scaling (MDS) of phylogenetic inference strategies. PCA was applied to 72 phylogenetic inference strategies, in which each strategy is viewed as a point in a 145-dimensional space of clades. MDS was applied to a distance matrix between all pairs of strategies, where the distance between a pair of strategies was the mean Robinson-Foulds distance between all pairs of bootstrap trees from one strategy and another strategy (i.e., mean of 106 comparisons). The plots show the first and second components. On a given row, each plot represents the same 72 points in the space of components 1 and 2; the three plots are colored differently to highlight different features. (A) Colors represent different methods for constructing species trees. (B) Colors represent different gene tree inference methods. (C) Colors represent different outgroups. The points on each graph represent different combinations of the three factors that form phylogenetic inference strategies. Each line in part A represents the resultant vector (scaled by a constant to lie within the span of the 72 points) for all 12 points of a certain method for constructing species trees. Each line in part B represents the resultant vector for all 18 points of a gene tree inference method (scaled by a constant). Each line in part C represents the resultant vector for all 24 points of an outgroup (scaled by a constant). The scaling constants in parts A, B, and C are distinct. Each of the shaded regions in parts A and B is a convex hull of the points from a particular method.

From Figure 4, we can see that much of the variation across the 72 phylogenetic inference strategies, as explained by PCA and MDS, is caused by M. Strategies based on M are more similar in clade outcomes to other strategies based on M than they are to other strategies that are not based on M. The magnitude of this effect can be quantified using Procrustes analysis, which demonstrates that M has a large influence on the spatial relationship among all other phylogenetic inference strategies (Additional file 2: Figure S1).

Figure 5 shows the results of our cluster and correlation analyses. The main clusters formed by phylogenetic inference strategies involve strategies based on the species tree construction methods Concatenation, SMRT, STEAC, and MDC or the gene tree inference method M (Figure 5). The clusters of strategies formed by K-means and the large groupings of strategies formed by hierarchical clustering are quite similar. Additionally, the correlation coefficient between clade vectors inferred by pairs of phylogenetic inference strategies is generally higher for pairs of strategies that are placed into the same cluster by either K-means or hierarchical clustering than for pairs of strategies that are not placed into the same cluster (Figure 5).

Figure 5
figure 5

Cluster and correlation analysis of phylogenetic inference strategies. Each leaf of the dendrogram corresponds to a different phylogenetic inference strategy for obtaining the rooted phylogeny of eight ingroup pine species. Blue squares directly below the dendrogram indicate the features used to construct a rooted phylogeny for the eight pine species. The first six rows below the dendrogram represent different species tree construction methods. The next four rows below the dendrogram represent gene tree inference methods. The following three rows below the dendrogram represent the outgroup species. The dendrogram was constructed by hierarchical clustering using the Ward algorithm [77] applied to a matrix of Euclidean distances between all 72 2 pairs of 145-dimensional vectors (each dimension representing a distinct clade). The remaining nine rows below the outgroups show the results of K-means clustering applied to the 72 145-dimensional vectors with K clusters, K=2,3,…,9. Below the cluster analysis is a heat map of the correlation coefficients between all 72 2 pairs of phylogenetic inference strategies. An entry in the heat map represents the Pearson correlation coefficient between a pair of strategies by only using points in the 145-dimensional vector that were nonzero in both strategies being compared.

Interestingly, the clustering of strategies by PCA and MDS in Figure 4 matches well with the groupings observed in Figure 5, which is likely driven by similar signals. In Figure 5, three large clusters are represented by the subtree to the left of the root of the dendrogram (i.e., the blue color in the K-means clustering with K=3) and by two subtrees to the right of the root (i.e., the pink and orange colors at K=3). The two subtrees to the right of the root (or pink and orange clusters defined by K-means clustering) involve strategies that are based on M (pink K-means cluster or left subtree on the right of the root of the dendrogram) or strategies that are based on species tree construction methods that are topologically-based (orange K-means cluster or right subtree on the right of the root of the dendrogram). That is, strategies that correspond to the orange cluster are based on either STAR, RTC, or MDC. In contrast, the subtree to the left of the root (or the blue cluster defined by K-means clustering) contains only strategies that use species tree construction methods that are not strictly topologically-based (i.e., Concatenation, SMRT, or STEAC).

From Figures 4 and 5, we find that phylogenetic inference strategies form three basic clusters: a cluster that involves strategies that are based on M, a cluster that involves strategies that are topologically-based, and a cluster that involves strategies that are not strictly topologically-based.

Clade flow

Following the “haplotype flow” computations of [82], we can view “clade flow” as a proportion of clades inferred by one phylogenetic inference strategy that are also inferred by another strategy. Figure 6 displays a heat map that represents a form of clade flow, where the cell at row i and column j in the heat map represents the fraction of clades inferred by strategy i that were not inferred by strategy j. By definition, the heat map is not symmetric. As can be seen from the mostly white and yellow boxes for rows corresponding to strategies based on M, these strategies tend to infer clades that are supported by other strategies. That is, if a species tree topology is inferred by a strategy that is based on M, then clades displayed by that topology will often also be present on species tree topologies inferred by other strategies. In Additional file 2: Figure S1, strategies based on M contribute to the most variation across strategies. A possible explanation for this observation is that the flow of clades is largely unidirectional. That is, if a strategy is based on M, then clades that are inferred by that strategy also tend to be supported by other strategies; however, if a strategy not based on M infers a clade, then that clade is not often supported by strategies based on M. Because clades inferred by strategies based on M also tend to be supported by other strategies, it follows that strategies based on M tend to infer clades that are also supported by other strategies based on M. This sharing of clades among strategies based on M causes those strategies to be more similar to each other than they are to strategies not based on M. In contrast to the results for M, as can be seen from the mostly dark boxes in rows for strategies based on MDC, strategies based on MDC tend to infer clades that are not supported by other strategies (especially when compared with strategies based on M).

Figure 6
figure 6

Heat map representing the “flow” of clades between phylogenetic inference strategies. We use clade flow to measure the proportion of clades inferred by one strategy that were not inferred by a different strategy. Phylogenetic inference strategies are ordered using the dendrogram. The cell at row i and column j represents the fraction of clades inferred by strategy i that were not inferred by strategy j. By definition, the heat map is asymmetric. Darker colors indicate lower levels of “flow” from a row to a column.

Similarly to the behavior of MDC, strategies that are based on Concatenation, SMRT, and STEAC together with ML, MP, or NJ share more clades with other such strategies (mostly white and yellow boxes) than with the remaining strategies (mostly dark boxes). In contrast, as was observed with M, strategies based on STAR and RTC together with ML, MP, or NJ share similar numbers of clades among other such strategies as with the remaining strategies (mostly yellow boxes). These results suggest that strategies that are topologically-based (i.e., STAR and RTC) tend to infer clades that are also supported both by other topologically-based strategies and by strategies that are not strictly topologically-based, whereas strategies that are not strictly topologically-based (i.e., Concatenation, SMRT, and STEAC) tend to infer clades that are not supported by strategies that are strictly topologically-based (i.e., STAR, RTC, and MDC).

Representative topologies

We next wanted to use a set of representative species tree topologies to highlight similarities and differences in topologies constructed by various strategies. Topologies were estimated using the Greedy Consensus algorithm [12] applied to clade counts. Because our previous results (Figures 4-5) indicate that the choice of outgroup species does not strongly influence the overall inferred topologies, it is sensible to average across outgroups. Therefore, we first present topologies for each of the 24 species tree–gene tree inference method pairs constructed from clade counts that were averaged over the three outgroups (Figure 7). Next, to obtain a clearer picture of the types of topologies that are inferred by the six species tree inference methods, we present topologies for each of the six species tree inference methods, constructed from clade counts that were averaged over gene tree inference methods and outgroup species (Figure 8). Finally, to assess the influence that various gene tree inference methods have on the overall inferred species tree topology, we present topologies for each of the four gene tree inference methods, constructed from clade counts that were averaged over species tree inference methods and outgroup species (Figure 9).

Figure 7
figure 7

Rooted consensus trees of phylogenetic inference strategies averaged over outgroups. For a given subset of the 72 phylogenetic inference strategies considered, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. Each consensus tree is the greedy consensus tree based on clade counts averaged over outgroup species. These trees disregard branch-length information.

Figure 8
figure 8

Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and gene tree inference methods. For a given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. These trees disregard branch-length information. (A) Trees constructed using the 12 strategies that utilize Concatenation; (B) SMRT; (C) STEAC; (D) STAR; (E) RTC; (F) MDC.

Figure 9
figure 9

Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and species tree construction methods. For a given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. These trees disregard branch-length information. (A) Trees constructed using the 18 strategies that utilize ML; (B) MP; (C) NJ; (D) M.

Figure 7 displays 24 topologies with clade support values for each combination of a species tree construction method and a gene tree inference method. The clade {P. chiapensis, P. strobus} is generally highly supported, appearing for 22 of 24 strategies, with support ranging from 382 to 982 among 1000 bootstrap replicates. The smallest support values for {P. chiapensis, P. strobus} occur in strategies that use SMRT with ML, MP, and NJ, producing support values of 382, 406, and 395, respectively. The largest support values for this clade occur in strategies that use M, with values ranging from 824 to 982. Further, although strategies based on SMRT with ML, MP, and NJ yield lower support values than other strategies, when SMRT is combined with M, the support for {P. chiapensis, P. strobus} is 905. In addition, although two of the strategies based on STEAC do not support {P. chiapensis, P. strobus}, when STEAC is combined with M, the support for the clade is 982. Another clade that is highly supported is {P. ayacahuite, P. flexilis, P. strobiformis}. This clade is observed across all strategies, with support among non-MDC strategies out of 1000 bootstrap replicates ranging from 858 to 1000. Strategies that use MDC with ML, MP, and NJ yield support values for {P. ayacahuite, P. flexilis, P. strobiformis} of 560, 407, and 415, respectively. However, using MDC with M yields a support value of 933 for {P. ayacahuite, P. flexilis, P. strobiformis}. Across the 24 trees, the topological positions of P. albicaulis, P. lambertiana, and P. monticola are variable and are generally poorly supported. Each of these species is found in a variety of positions across all trees.

Figure 8 displays six topologies with clade support values for each species tree construction method. Similarly to Figure 7, the clade {P. chiapensis, P. strobus} is generally highly supported across all six species tree construction methods, with support ranging from 522 to 876 among 1000 bootstrap replicates. Also, as in Figure 7, the clade {P. ayacahuite, P. flexilis, P. strobiformis} is highly supported across all six species tree construction methods, with support ranging from 579 to 999 among 1000 bootstrap replicates. From these topologies, we can also observe that in agreement with the clade size distribution, strategies based on Concatenation, SMRT, and STEAC tend to produce more unbalanced trees than strategies based on STAR, RTC, and MDC (Figure 3). Strategies based on Concatenation, SMRT, and STEAC support topologies in which P. lambertiana is on the opposite side of the root from the other seven species. In contrast, strategies based on STAR, RTC, and MDC place P. monticola and P. lambertiana as sister species. These results support the observations from Figures 4, 5, and 6 that strategies based on species tree construction methods that are topologically-based behave differently from strategies that are not strictly topologically-based.

Figure 9 displays four topologies with clade support values, considering each gene tree inference method and combining species tree construction methods for each gene tree inference method. As in Figures 7 and 8, the clades {P. chiapensis, P. strobus} and {P. ayacahuite, P. flexilis, P. strobiformis} are generally highly supported across all four gene tree inference methods, with supports among 1000 bootstrap replicates respectively ranging from 610 to 931 and from 858 to 988.

Discussion

In this article, we have empirically evaluated strategies for inferring species tree topologies from multilocus sequence data. We have found that MDC tends to infer balanced topologies, whereas SMRT and STEAC tend to infer more unbalanced topologies. This bias toward balanced topologies exhibited by MDC is a consequence of the nature of the criterion that MDC uses to construct species trees, reflecting a theoretical finding that species trees with more balance have lower deep coalescence costs [83].

The strategies that we have examined fall into three classes in terms of the species tree inferences they produce: strategies applied only to datasets including all available sequenced individuals (i.e., M), topologically-based strategies (i.e., STAR, RTC and MDC), and strategies that are not strictly topologically-based (i.e., Concatenation, SMRT, and STEAC). While it is not unexpected that some approaches would behave similarly, it is surprising that strategies did not cluster based on the dataset or approach used (e.g., consensus or concatenation). Instead, strategies that take quite different species tree construction approaches (e.g., consensus-based STEAC and concatenation-based Concatenation and SMRT) form a cluster. Topologically-based strategies tend to infer clades that are supported by other strategies, whereas strategies that are not strictly topologically-based tend to infer clades that are not always well-supported by other strategies. For example, clades inferred from strategies that are not strictly topologically-based tend not to appear on trees that were inferred from strategies that are topologically-based. A possible reason for this observation could be a lack of phylogenetic signal to properly infer branch lengths; that is, if a phylogenetic inference strategy is not strictly topologically-based, then sequences with little phylogenetic signal (e.g., due to low substitution rate or short length) can strongly influence the species tree inferred by that strategy. Because STEAC uses branch-length information to infer a species tree topology, sequences with little signal can reduce its performance relative to topologically-based methods such as STAR [15].

Although our main goal has been to use North American pines to investigate relationships among phylogenetic inference strategies, our results also provide some information about the phylogenetic placement of the pine species in the study. This analysis is the first multilocus study to provide substantial confidence for a sister relationship of P. chiapensis and P. strobus. Pinus chiapensis is a threatened species of Mexico and Guatemala whose phylogenetic affinity has been uncertain. Morphological and molecular evidence have been used to alternately argue for a sister relationship between P. chiapensis and P. strobus, from eastern North America, or between P. chiapensis and P. monticola, from western North America [47]. Here, 22 of 24 trees in Figure 7 grouped P. chiapensis and P. strobus as sister taxa, mostly with reasonably high bootstrap support. When phylogenetic inference strategies were averaged over either gene tree methods (Figure 8) or species tree methods (Figure 9), the {P. chiapensis, P. strobus} clade was always recovered.

The close phylogenetic affiliation of P. ayacahuite, P. flexilis, and P. strobiformis has long been suspected, as these three species represent similar forms that are continuously distributed from southern British Columbia and Alberta into Honduras [84]. Here, the {P. ayacahuite, P. flexilis, P. strobiformis} clade is well-supported, although relationships among these three species are less clear. Two possibilities, namely ((P. flexilis, P. strobiformis), P. ayacahuite) and ((P. ayacahuite, P. strobiformis), P. flexilis), appear more likely based on our analysis (Figures 7, 8 and 9). Interestingly, Figure 7 finds that the ((P. ayacahuite, P. strobiformis), P. flexilis) clade is well-supported by all strategies that use M.

Beyond these clades, the full phylogeny of this group of pines remains unclear. Considering the trees inferred in Figure 7, relationships among P. albicaulis, P. lambertiana, P. monticola, and the clades {P. chiapensis, P. strobus} and {P. ayacahuite, P. flexilis, P. strobiformis} are not stable across inference strategies, and bootstrap support is generally low. We might have expected greater resolution in this study, due to the exhaustive sample of the ingroup, extensive intraspecific sampling, large molecular dataset, and ease of species delimitation (the eight ingroup species include well-defined taxa that are morphologically, ecologically, phenologically, and generally geographically distinct).

We can attribute the lack of resolution in the pine phylogeny to several possible sources. First, the loci in the study were chosen because they amplify across a broad range of taxa from subgenus Strobus (only eight of whose members are included here), and might therefore be more slowly evolving and less informative for phylogenetic inference than typical loci. Thus, the size of the dataset might not be indicative of its information content for phylogenetic inference. Second, we have focused on strategies that have been implemented for ease of comparison and have not explored the full collection of available methods (e.g., [6, 20, 85, 86]), nor have we considered such techniques as investigation of different subsets of taxa or loci on the basis of the strategies that we have studied (e.g., [16]). A study with a primary goal of resolving the pine phylogeny might achieve greater resolution through analyses that deviate from our standardized procedure. Third, the speciation events of interest might have occurred fast enough that retention of ancestral polymorphisms, as has been observed elsewhere among conifers [47, 48, 50, 87], might inhibit convergence on a stable, well-supported topology. Further work with more loci or faster-evolving loci will be important for distinguishing among these alternatives.

One caveat for interpreting our results is that except in our analyses based on M, we only considered a single lineage sampled within a species. Information on multiple lineages of the same species can have a significant effect on the performance of species tree inference methods, and many methods can use information on coalescences within and between species as part of the inference process (e.g., [1921, 23, 2527, 8890]). Therefore, it is important to keep in mind that we have used one of a number of potential schemes for sampling individuals within our data, as sampling scheme can have an impact on the efficacy of species tree estimators [15, 20, 36, 91, 92].

Another caveat is that some of the datasets were obtained from procedures designed to maximize information content at each locus. These optimization procedures yielded datasets with one sequence sampled per species. Because the sequences within these optimized datasets are no longer randomly sampled within each species, a possible concern is that our results are not representative of random samples. This concern might be warranted when considering the inferred relationships of the various pine species in Figures 7, 8 and 9. However, as the strategies applied to each of these optimized datasets retain their general relationships across datasets (e.g., those that are topologically-based and those that are not strictly topologically-based), the conclusions drawn in this article should also hold for large noncoding datasets. Additionally, it is important to mention that an identical dataset is not used for all strategies considered, and notably, strategies that used M relied on different datasets from those that used ML, MP, and NJ. Although this is a limitation, the clustering pattern suggests it is not a major concern. In our principal components (Figure 4), cluster (Figure 5), and correlation (Figure 5) analyses, though the strategies were split by whether they used M or were either topologically-based or not strictly topologically-based, these three categories do not precisely map onto the different datasets. Therefore, though the dataset varies across the 72 strategies, other factors beyond the difference in datasets are contributing substantially to the difference in results.

Finally, a third caveat is that to maintain a uniform procedure across strategies, we did not estimate the nucleotide substitution model before applying maximum likelihood. This choice might have caused some systematic bias in ML gene tree estimates by overparametrizing the substitution model. However, we found that our inference strategies did not cluster by whether maximum likelihood, maximum parsimony, or neighbor-joining was used (see Figures 4B and 5), suggesting that any systematic bias due to using a general time-reversible substitution model did not drive our observed clustering patterns.

Conclusions

Based on a collection of two-stage strategies that we have investigated, representing a subset of all available methods, our analyses have highlighted several aspects of phylogenetic inference strategies that enable recommendations for inferring rooted phylogenies from large-scale multilocus data. First, it is beneficial to examine multiple strategies [93], considering some methods that use only topological information (e.g., STAR, RTC, and MDC) and others that also incorporate additional information (e.g., Concatenation, SMRT, and STEAC). If species tree topologies returned by these different classes of species tree construction methods agree, then an investigator can be more confident in the inferred tree topology. Second, estimates should not be based solely on species tree construction methods that appear to be biased toward certain types of topologies (e.g., MDC). Instead, it is preferable to utilize these types of methods in conjunction with other approaches. For example, after obtaining an unbalanced inferred tree from an inference method, if MDC also infers the same unbalanced topology, then we can be more confident that the true species topology is actually unbalanced. Finally, it is best to utilize as much information as is available on individuals at every locus. That is, if multiple individuals are sampled within a species at a given locus, then we should use all available sequence data from the species (i.e. as many sampled individuals as possible, as in strategies that are based on M). This point is supported by the observation that clades inferred by M tend to “flow” to other strategies (Figure 6). Based on our findings, we recommend the joint consideration of multiple approaches to estimating species trees that originate in different locations in the space of methods and that exhibit diverse properties in their species tree estimates.

Availability of supporting data

The data used in this study are available in Additional file 3.