Background

Many biological processes can be better understood in the framework of reliable phylogenetic analyses. This is not only true for our understanding of evolutionary systematics and phylogenetics, including TOL, but it will also largely contribute to our understanding of diversification at the subcellular, cellular and organismal levels of integration. One well documented example in this respect is the postulated whole-genome duplication (WGD) that occurred during the evolution of some species belonging to the Saccharomycotina [1]. Only using a correctly inferred phylogenetic TOL it was possible to distinguish between "pre-WGD" and "post-WGD" species of Saccharomycotina. Other examples refer to our understanding of evolution of metabolic pathways [2], structure of genomes [3, 4], life styles [5], and pathogenicity [6].

Until recently, our understanding of the (fungal) TOL has been based on two approaches, which basically differ in number of species and genes considered: (1) few genes and large number of species; (2) large number of genes and few species. The clear advantage of the first approach is the availability of many sequences, e.g. of the rDNA locus, in publicly available databases (i.e. National Center for Biotechnology Information – NCBI), and, secondly, it is generally rather easy to generate complete or partial sequences of a few genes for a large number of species. Besides, the rDNA loci have the clear advantage of being universally present in all branches of TOL, universal primers are well known and it has been successfully explored in many branches of TOL. The disadvantage of the rDNA loci, however, is that the deeper branches are usually less supported [7]. As an answer to this, various authors started to include multiple protein coding genes in their phylogenetic analyses [810]. Unfortunately, the rationale behind the selection of these protein coding genes is not always clear, and discrepancies and incongruences between individual gene trees may result in unresolved phylogenetic trees [7, 8]. This may be due to different evolutionary rates, and/or different origins of the genes, e.g. whether nuclearly encoded (e.g. RPB1 and RPB2) or mitochondrial in origin (e.g. ATP6). In the second approach, large numbers of genes have been used for phylogenetic studies as an attempt to contribute to the first approach described above. This was firstly applied in the prokaryotes [11] and, more recently, in eukaryotes as well [1214]. A large selection of genes and/or proteins are concatenated and used for inferring phylogenetic relationships, thereby increasing the phylogenetic signal considerably [12, 1417]. However, although this approach resolved the fungal phylogenetic tree [12, 14, 16, 17] it also suffers from some limitations. For instance, it does not take into consideration the evolutionary history of each individual gene and it depends on the availability of complete genome data.

Here, we explored the usefulness of comparing the cophenetic correlation coefficients (CCCs) among distance matrices of individual gene trees in order to make a phylogenetically meaningful selection of orthologs to be considered for further phylogenomics studies as well as large scale TOL and barcoding applications. We used the fungal kingdom as an example as it represents one of the major clades of life with approximately 1.5 million species [18], of which only approximately 80.000 have been described. Moreover, the fungi are morphologically, metabolically and ecologically highly diverse and, importantly, the number of completely sequenced genomes is high among the eukaryotes.

Candidate proteins to be considered for TOL and/or barcoding studies were assessed from 33 fungal proteomes by comparing (i) distance matrices of each individual orthologous protein (KOGs) matrix, (ii) to compare these with that of a well supported guide tree [14], and (iii) analyze for their phylogenetic signal. The method presented here may be universally applied for the selection of markers in various TOL and barcoding studies.

Results and Discussion

The 33 genomes investigated shared 4852 KOGs from which 70 were single copy proteins. The function of these 70 KOGs was assessed from the Saccharomyces cerevisiae genome database [19] (Additional file 1). The corresponding systematic name, standard name, description, chromosome number and knock out phenotype are presented in Table 1 (Additional file 1). Knock out phenotypes of 32 genes were lethal (Table 1) when deleted in S. cerevisiae [19], thus suggesting that they code for essential proteins. Genes coding for the 70 KOG proteins are distributed on almost all chromosomes of S. cerevisiae, except chromosome VI (Table 1), thus representing the entire genome.

Table 1 Correlation values of KOG distance matrices compared to that of KOG2671, KOG functional category, the corresponding single protein KOGs to the systematic name, systematic deletion and chromosome number of ORFs of Saccharomyce cerevisae (Sce) [19].

Comparing the CCC values of a 531 × 531 distance matrices analyzed before [14] using Pearson's correlation, indicated that KOG2671 represents the single copy protein with the highest correlation value of 0.96 (Additional file 2). This KOG2671 protein (putative RNA methylase KOG annotation) corresponds to ORF YOL124c of S. cerevisiae [Catalytic subunit of an adoMet-dependent tRNA methyltransferase complex (Trm11p-Trm112p), required for the methylation of the guanosine nucleotide at position 10 (m2G10) in tRNAs; contains a THUMP domain and a methyltransferase domain]. The CCC values of the remaining 69 single copy KOGs were compared with that of KOG2671. Any of the subsequent five single protein KOGs present in the list of 531 KOG proteins [14], namely KOG2728 (Uncharacterized conserved protein with similarity to phosphopantothenoylcysteine synthetase/decarboxylase), KOG0991 (Replication factor C, subunit RFC2), KOG0340, (ATP-dependent RNA helicase), KOG0809 (SNARE protein TLG2/Syntaxin 16), and KOG3786 (RNA polymerase II assessory factor Cdc73p), could be used as a starting point for this comparison, because the correlation values ranged between 0.95 and 0.96 (Additional file 2). The correlation values between the distance matrix of KOG2671 and that of each of the remaining 69 KOG proteins ranged from 0.08 to 0.93 (Table 1), and were statistically significant (Additional file 3). The majority of the KOGs (i.e. 64 from 70 KOGs) gave correlation values higher than 0.50 (Table 1). As an example, we constructed a phylogenetic tree based on concatenation of these 64 KOGs (Fig. 1), which is in accordance with previously published trees. Four KOGs gave CCC values below 0.36 (Table 1), thus indicating that they have different phylogenetic signals. This is sustained by the resulting phylogenetic tree showing a different topology (Additional file 4) if compared with that based on 64 KOGs (Fig. 1). For instance, the Pezizomycotina formed a sister clade to the Basidiomycetes and, S. pombe occured as a basal lineage to both of them, but without statistical support (Additional file 4).

Figure 1
figure 1

Phylogenetic relationship of 33 complete fungal genomes. The same tree topology is given by concatenation of 30, 40, 50, 60 and 64 KOG proteins with correlation values >0.50 when compared to reference KOG2671 distance matrix. Asp. = Aspergillus, Can. = Candida, Cry. = Cryptococcus, Sac. = Saccharomyces, Ash. = Ashbya. Phyla: I = Ascomycota, II = Basidiomycota, III = Rhyzomycota. Subphyla: IA = Saccharomycotina, IB = Pezizomycotina, IC = Taphrinamycotina, IIA = Agaricomycotina, IIB = Ustilaginomycotina, IIIA = Mucormycotina. IB1 = Sardariomycetes, IB2 = Letiomycetes, IB3 = Eurotiomycetes, IB4 = Dothideomycetes. Support values indicated on the branches were obtained by bootstrap analysis using 100 replicates. * indicates support values of 98–100%.

Among the KOG proteins with CCC values above 0.50, are many proteins involved in cellular processes and signaling. The other tree KOG categories [20], namely information storage and processing, metabolism, and poorly characterized categories seem to be less informative (Fig. 2). When the KOG proteins are concatenated in increasing numbers (e.g. the 10 with the highest CCC values; the 20 with the highest CCC values and so on) it can be seen that the CCC values remains above 0.8 until 44 proteins have been concatenated (Fig. 2). Thereafter, the CCC values showed a sharp decline, indicating that the KOG proteins 44–64 have different phylogenetic signals. Interestingly, the topology of the phylogenetic trees stabilizes after the concatenation of 40 proteins (Additional file 5). After concatenation of only 10 and 20 proteins the lineages with C. glabrata, S. kluyveri, K. lactis and A. gossypii, and that of C. lusitaniae, D. hansenii, C. guilliermondii and C. albicans, and finally the Euascomycete lineage of C. globosum, N. crassa, M. grisea and F. graminearum showed varying topologies (Additional file 5). Bootstrap values of most branches were high irrespective the number of proteins concatenated (Fig. 1, Additional file 5). However, for two branches, labeled 7 and 9 in Additional file 5, that received lower bootstrap values, the maximum value (85%) was obtained after concatenation of 40 KOG proteins. The A. gossypii-K. lactis-Sac. kluyveri lineage (labeled as branches 4 and 5 in Additional file 5) received only low support, and this was even true after concatenation of 531 orthologues [14]. This most likely indicates that further improvement can only be obtained by further species sampling in this lineage. Summarizing we estimate that 40–45 concatenated single copy protein KOGs are needed to fully resolve fungal TOL. Below this number the tree topology may be different, and above this number the CCC values as well as the support values tend to drop.

Figure 2
figure 2

Graph representing the number of concatenated KOGs (x-axis) per functional KOG category (information storage and processing; cellular processes and signaling; metabolism; poorly characterized), and the correlation values between KOG2671 distance matrix and each distance matrix of the 70 KOGs (right y-axis). The left y-axis illustrates the cumulative values of each KOG functional category when they are concatenated. The corresponding KOG protein number in x-axis is listed in the Table 1 and the corresponding functional category is in Supplemental Table 1.

Reevaluating fungal TOL

In all phylogenetic trees using 10–64 concatenated single KOG proteins, the clades I, II and III correspond to the Ascomycota, Basidiomycota and Zygomycota phyla, respectively (Fig. 1, Additional file 5), thus agreeing with analyses using a supertree method [16], a super alignment using restricted orthology [21], and concatenation of six genes [10], and 153 [15] and 531 proteins [14], respectively. Not surprisingly, the Ascomycota formed a sister clade to the Basidiomycota, with the Zygomycota forming a basal lineage.

The Ascomycota are well represented because of the number of available sequenced genomes, and is subdivided into subphyla Pezizomycotina, Saccharomycotina and Taphrinamycotina (Fig. 1). The Saccharomycotina (clade IA) formed a sister group to the Pezizomycotina (clade IB), with Taphrinamycotina (clade IC) forming a basal lineage to both (Fig. 1). The resolution of the Saccharomycotina and Pezizomycotina is in agreement with previous phylogenomic analyses [10, 16, 21].

The phylogenetic structure of the subphylum Saccharomycotina in our tree (Fig. 1) is similar to that based on a combination of 153 protein families [15], but slightly differs from that based on an analysis using six combined genes [10]. Noticeable differences are the positions of D. hansenii, C. guilliermondii, C. lusitaniae and C. albicans. In our analysis and the study of Fitzpatrick et al. [16], these four species formed a single cluster (Fig. 1), while in the six-gene analysis [10], C. albicans clusters with C. guilliermondii, and D. hansenii with C. lusitaniae.

Within the Saccharomycotina, seven species evolved after WGD [1], namely S. cerevisiae, S. bayanus, S. castellii,S. kudriavzevii, S. mikatae, S. paradoxus and C. glabrata. The basal position of C. glabrata among these species agrees with results from Fitzpatrick et al. [16], but only after removal of fast evolving site classes in their dataset. The phylogenetic structure of the Saccharomyces sensu stricto species, S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii and S. bayanus corroborated with previous results of Rokas et al. [12] and Kuramae et al. [14], but was found to be somewhat different if compared with data obtained by comparative genomic hybridization (CGH) [22] and a four-gene analysis [8] (Additional file 6). In the CGH study the positions of S. mikatae and S. kudriavzevii differ, whereas in the four-gene analysis S. cerevisiae, S. paradoxus and S. mikatae occupied different positions.

The subphylum Pezizomycotina is divided into four clades: Sordariomycetes (clade IB1), Leotiomycetes (clade IB2), Eurotiomycetes (clade IB3) and Dothideomycetes (clade IB4) (Fig. 1). The phylogenetic positions of the Sordariomycetes, Leotiomycetes and Dothiomycetes have been a matter of controversy. According to our analysis, the Sordariomycetes and Leotiomycetes are sister clades, which is in agreement with other studies [10, 16, 23], although the tree in the latter study was only weakly supported. All these results are, however, in disagreement with data resulting from a four-gene analysis [9], in which the Dothideomycetes occurred as a sister clade to the Sordariomycetes. The position of Stagonospora nodorum (Dothiomycetes) as a basal lineage in the Pezizomycetes is highly supported in our analysis (> 90% bootstrap) (Fig. 1, Additional file 5) and corroborates with data from James et al. [10] and Robbertse et al. [17] who used maximum parsimony. However, in analyses based on a supertree method, and 153 concatenated proteins [15] and a four-gene analysis [9], S. nodorum was found to be positioned next to the Eurotiomycetes [9, 16, 21] or closely to the Sordariomycetes and Leotiomycetes [16].

All analyses using concatenated proteins with CCC values above 0.50 (Fig. 1, Additional file 5) positioned S. pombe (Taphrinomycotina) as a basal lineage within the phylum Ascomycota, which is in concordance with many other studies [10, 1416, 21] using different sets of genes or orthologous proteins and different methods of analysis [15]. However, in another study [15], part of the concatenated orthologues resulted in a different position, which was explained by assuming a different evolutionary origin of these proteins.

The topology of the few basidiomycetous species included, representing only two subphyla Agaricomycotina (clade IIA with Coprinopsis cinerea, Phanerochaete chrysosporium, Cryptococcus neoformans var. neoformans, C. neoformans var. grubii) and Ustilaginomycotina (clade IIB with Ustilago maydis) (Fig. 1, Additional file 5) corroborates with previous studies [10, 16].

Our method of protein selection using CCC values of individual protein distance matrices seems an useful approach as the resulting phylogenetic trees are largely in agreement with those published elsewhere, and, importantly, most of the branches are well supported. The resulting selection of proteins may also be used to analyze the majority of fungal species for which a full genome is not yet available in order to improve our understanding of fungal TOL.

The performance of our method, if compared to the recent AFTOL study [10], was assessed by comparing CCC values between the protein distance matrix of reference KOG2671 and that based on the combined data set of six AFTOL genes. The correlation value obtained was 0.73, thus indicating that our reference protein has a rather similar phylogenetic signal if compared to the AFTOL genes. However, the inclusion of more genes increases the phylogenetic signal as demonstrated in our analysis (Fig. 1, Additional file 5), which may contribute to the resolution of discordant branches, such as that of A. gossypii-K. lactis-S. kluyveri clade.

Conclusion

In short, the set of proteins resulting from our studies presents a good selection to be elaborated in further studies on fungal TOL, which may include many non-sequenced species. As the proteins were selected across the fungal kingdom and because they represent single KOG proteins, they may also be suitable for the development of molecular barcodes. This proposed method is universal and can be extended easily to bacterial and archaeal TOLs as well as other eukaryote lineages of TOL.

Methods

Assignment of genomes to KOG

In this study we used the complete genomes of 33 fungal and one metazoa (Caenorhabditis elegans) (Table 2). The group orthology framework presented in the KOG database [20] was the basis of our analyses. KOGs of Caenorhabditis. elegans, Saccharomyces cerevisiae S288c and Schizosaccharomyces pombe were obtained from the KOG database [24]. Thirty one proteomes (Ashbya gossypii, Aspergillus fumigatus, Asp. nidulans, Botritys cinerea, Candida albicans, Can. glabrata, Can. guilliermondii, Can. lusitaniae, Chaetomium globosum, Coccidioides immitis, Coprinopsis cinerea, Cryptococcus neoformans var. neoformans, Cryp. neoformans var.grubii, Debaryomyces hansenii, Fusarium graminearum, Kluyveromyces lactis, Magnaporthe grisea, Neurospora crassa, Phanerochaete chrysosporium, Rhizopus oryzae, Saccharomyces cerevisiae RM11-1a, Sac. bayanus, Sac. castellii, Sac. kluyveri, Sac. kudriavzevii, Sac. mikatae, Sac. paradoxus, Sclerotinia sclerotiorum, Stagonospora nodorum, Ustilago maydis and Yarrowia lipolytica were assigned for orthologies using the STRING program as described before [25].

Table 2 Genome sources, genome size (Mb), number of KOGs assigned to each genome used in the study

Comparison of KOGs represented by single protein

In order to avoid problems of paralogy we selected only those 70 KOGs represented by a single protein shared by 33 complete fungal genomes. First, each protein from the list of the KOGs that fulfilled this criterion was aligned by Clustal X [26]. Second, poorly aligned positions and divergent regions in each KOG alignment were removed by using Gblocks 0.91b [27]. The threshold parameters used were: minimum number of sequences for a conserved position = 50% of the number of sequences + 1, minimum number of sequences for a flank position = 85% of the number of sequences, maximum number of contiguous nonconserved positions = 8, minimum length of a block = 10, not allowed gap positions, use similarity matrices. Third, the distance matrix (percent divergence) of each KOG protein was calculated between all pairs of sequences from a multiple alignment of each KOG. Finally, each KOG protein distance matrix was compared to each other (70 × 70) by Pearson's correlation.

Selection of the reference KOG distance matrix

The distance matrices of the 531 KOGs used by Kuramae et al. [14] were calculated. Then, the correlation matrix values between distance matrices were determined by Pearson's correlation as described. To find the KOG distance matrix to be used as reference we selected the single copy KOG protein with the highest correlation value. This reference distance matrix was then compared to the distance matrices of the remaining 69 KOGs selected.

Phylogenetic analysis

KOG distance matrices with correlation values higher than 0.50 when compared to the reference KOG distance matrix were concatenated, aligned, the poorly aligned regions removed, and a phylogenetic analysis was done by Maximum Likelihood ( PHYML) [28]. The amino acid model substitution used was JTT [29]. The number of substitution rate categories was 2. The model of rate heterogeneity was Gamma distribution rates with 4 categories. We used Caenorhabditis elegans as outgroup for all phylogenetic trees reconstructions. Groups of 10, 20, 30, 40, 50, 60 and 64 KOGs protein according to decreasing cophenetic correlation values were selected, subsequently used to build phylogenetic trees, and their support values assessed using 100 replicates.

Comparison KOG reference and AFTOL combined genes

For this comparison we used 24 genomes present in AFTOL for which entire genome data are available to calculate the distance matrix of the alignment from AFTOL [30]. The six combined genes distance matrix from AFTOL and the distance matrix of our reference KOG2671 were compared by Pearson's correlation.