Background

Trait correlations are widespread across life. These correlations can be intrinsic to an organism’s biology or the result of extrinsic factors. Intrinsic biological factors that can lead to associations between traits include overlapping biochemical or genetic pathways [1, 2], promiscuous enzymes [3, 4], and other forms of pleiotropy [5, 6]. Such factors can facilitate the evolution of novel traits from preexisting traits [7,8,9]. The environment or niche of an organism is a composite of many factors (e.g., temperature, carbon availability, and salinity), which select for suites of traits or trait syndromes that are compatible with that specific habitat or niche [10].

Trait syndromes can include a set of traits that collectively provide a fitness advantage, either additively or non-additively, in the environment and have been described for distinct groups of species that are associated with certain environments [11,12,13,14]. For example, stress-resistant syndrome is a common suite of traits that enables plants to survive in stressful environments, such as low-resource environments [15]. Stress-resistant plants tend to have lower rates of growth and photosynthesis, high root-to-shoot ratios, and additional adaptations to low-nutrient conditions. A second example is domestication syndrome, a collection of traits associated with the genetic change of an organism from a wild progenitor to a domesticated form, which is prevalent among domesticated animals [16, 17] and plants [18]. In animals, the characteristics of this syndrome can include reduction in tooth size, increased docility, reduction in brain size, and many others [17]. Alternatively, correlations between traits or suites of traits may be due to phylogeny, the retention of ancestral traits in descendant species [19, 20]. Intrinsic, extrinsic, and phylogenetic factors are not mutually exclusive, so inferences about interactions among traits must consider each of these possibilities, a task undertaken in relatively few comprehensive studies [21,22,23,24,25], none of which have considered microbes or metabolic traits.

The budding yeast subphylum Saccharomycotina is among the most extensively characterized higher taxonomic ranks [26]. In addition to the well-known model system Saccharomyces cerevisiae and the human commensal and pathogen Candida albicans, its more than 1000 known species share a common ancestor approximately half-a-billion years ago. These yeasts display considerable genetic, phenotypic, and ecological diversity and provide a unique opportunity to quantify trait associations and elucidate the mechanisms that drive them [27,28,29,30,31,32,33,34,35,36]. Furthermore, these yeasts have been extensively described by leading taxonomists, who have scored growth phenotypes for a large number of phenotypic traits in The Yeasts: A Taxonomic Study [26]. In addition to its comprehensive nature, an advantage of this phenotypic dataset is that the methods used to score yeasts for trait presence were uniform across species; therefore, there are fewer biases than would occur by combining multiple published datasets. We used this qualitative dataset of 48 traits in 784 budding yeast species to test whether physiological associations involving nitrogen and carbon source utilization (i.e., assimilation or consumption), sugar fermentation, and growth temperature traits were driven by intrinsic (biological/functional) or extrinsic (environmental) factors. We identified pervasive positive and negative correlations between traits among wild yeast species and found that the structures of metabolic networks are dominant factors that drive these associations, while environment plays an important secondary role.

Results and Discussion

Since individual traits do not evolve independently of each other, we hypothesized that the yeast phenotypic dataset would include positive (traits that tend to co-occur) and negative (traits that tend to not occur together) associations. We quantified pairwise associations among 48 traits across 784 species. Most traits were conditional growth, such as growth in a medium with a single carbon source, and most species were represented by a single strain, the taxonomic type strain. We verified a subset of the dataset by growing 240 yeast species on four carbon sources: galactose, maltose, sucrose, and raffinose (Additional file 1: Table S1). We found that 94% of the growth results matched the data found within The Yeasts: A Taxonomic Study, leading us to conclude that this dataset is sufficiently accurate and reproducible.

For each trait pair, we quantified observed trait pair counts (i.e., the number of times both traits in a pairwise set were present across all species) and compared it to a distribution of counts based on randomized or permuted datasets (n = 10,000) to determine significance. The average of these randomized counts represents our expected value, and the difference between the observed and expected counts represents the strength of the trait pair association. We found several (n = 211) significantly positive (n = 104) and negative (n = 107) pairwise associations among traits (Fig. 1 and Additional file 2: Table S2). Clustering traits based on association strength revealed that traits that shared similar associations formed significant trait clusters (p < 0.05, Additional file 3: Figure S1), which often involved similar physiological functions (e.g., fermentation traits), chemical bonds (e.g., glucosides), and functional groups (e.g., sugar alcohols) (Additional file 2: Table S2), suggesting that the biological properties of traits could be an important factor affecting trait associations. Most negative associations occurred with sugar fermentation, growth on DL-lactate, and growth at 37 °C (88/107, Fig. 1, bottom), each of which will be highlighted below. In contrast, positive associations were broadly spread across traits and were not driven by a few individual traits (Fig. 1b, bottom).

Fig. 1
figure 1

Traits showed pairwise positive and negative associations, but the numbers and strength of association varied among traits. Bottom: Stacked bar graph displaying the proportion of negative (red), positive (blue), and not significant (white) associations for all traits. Top: Heat map of pairwise associations among traits. The color of a box represents the type of association: negative (red), positive (blue), and not significant (white). The strength of the association (the difference between the observed and expected counts) is displayed by the saturation of the color. We used a hierarchical cluster analysis to determine significant trait clusters with similar pairwise trait associations (Additional file 3: Figure S1; q < 0.05). Selected trait clusters are represented by the colors along the left-hand side and bottom of the graph

Intrinsic biological properties explain the largest proportion of trait variation

Multiple factors could conceivably contribute to the positive and negative trait associations we detected. We quantified the extent to which phylogeny, biological properties, and isolation environments contributed to the variance of trait associations [37]. This variance decomposition showed that biological properties (i.e., the functional attributes and structures of molecules consumed) largely drove the variation in trait associations. Specifically, 47.9% of the variance was explained by biological properties alone (R2 = 0.479, Fig. 2), while phylogeny and isolation environments explained only 0.0424% (R2 = 0.000424, Fig. 2) and 0.636% (R2 = 0.00636, Fig. 2) of the variance on their own, respectively. In some instances, multiple factors jointly explained the variance of trait associations. The largest proportion of these complex factors occurred between biological properties and isolation environments, which together explained 8.12% of the variance (R2 = 0.0812, Fig. 2). Only 0.0541% of the variance could be explained by both phylogeny and isolation environments or by both phylogeny and biological properties. In total, some combination of these three factors explained 56.8% of the variance in trait associations, leaving 43.2% unexplained. These results show that the vast majority (98%) of the explained variation among trait associations involves biological properties intrinsic to the organism, while isolation environments play an important secondary role, often in conjunction with biological properties.

Fig. 2
figure 2

Variation in trait associations can largely be explained by biological properties. We used variance partitioning to measure the amount of variation in pairwise trait associations that could be explained by biological properties (purple), isolation environments (red), and phylogeny (green). Each circle represents a single factor, while the overlap among circles represents interactions among factors (e.g., the interaction between biological properties and isolation environments is represented by the pink area). The values provided in each area represent the amount each factor contributes to the variation. In total, the factors represented here explain 0.568 (or 56.8%) of the variance in trait associations. The residuals represent the proportion of the variance that is unexplained

Limited effects of evolutionary history

Despite the limited amount of variance explained by phylogeny in our full dataset, we tested individual traits for a significant phylogenetic signal by calculating D across individual traits using a phylogeny of 561 species [38]. D is a measure of the character dispersion of binary traits and is similar to the more commonly used Blomberg’s K and Pagel’s λ statistics, which are used for quantitative data. Negative D values represent phylogenetic clustering, whereas values greater than 1 indicate phylogenetic over-dispersion. When D is not significantly different from 0, the trait is evolving according to a Brownian motion process that tracks phylogeny. When D does not significantly differ from 1, that trait is randomly distributed across the phylogeny. Although we detected some phylogenetic signal for most individual traits (i.e., D significantly differed from 1 for 47/48 traits), these analyses also rejected simple Brownian motion for the majority of traits (i.e., D significantly differed from 0 for 31/48 traits), suggesting that factors other than phylogeny predominate (Additional file 4: Table S3). Along with the variance decomposition (Fig. 2), these results suggest that, although phylogenic history is significantly associated with some individual traits, it is not the major driver of the observed trait associations.

Environmental factors help drive trait associations

To test the role of isolation environments in trait associations further, we compiled and categorized the environments from which each species had been isolated from The Yeasts: A Taxonomic Study and quantified the association between isolation environments and traits using the same methods to quantify trait associations (Additional file 5: Table S4). Yeasts have been isolated from many environments, including insects, plants, and food (Additional file 6: Figure S2). These environments were scored hierarchically, ranging from general categories (e.g., insects) to specific categories (e.g., ants). The broader categories increased the sample sizes available for statistical analyses, but they may also aggregate cryptic ecologies.

Isolation environments differed by the numbers and types of traits positively and negatively associated with them (p < 0.05, Fig. 3 and Additional file 7: Table S5). The strengths of these associations also varied among traits and environments, partly due to differences in power. Significant associations between traits and environments were generally concordant with known ecologies. For example, glucose and sucrose fermentation were positively associated with fruit, fermented substrates, and drinks or juice. These associations were not driven solely by the genus Saccharomyces and included many non-Saccharomyces yeasts known to be important for fermentation and spoilage of drinks, including yeasts that have been commercialized as oenological starter cultures due to their fermentation capabilities and their contributions to the chemical compositions of wines (e.g., Hanseniaspora uvarum, Torulaspora delbrueckii, Metschnikowia pulcherrima, and Lachancea thermotolerans [39]). H. uvarum and T. delbrueckii have also both been shown to be involved in the early stages of spontaneous fermentation of grapes, demonstrating that these yeasts ferment fruits in their natural environments [39, 40].

Fig. 3
figure 3

Traits are positively and negatively associated with a subset of isolation environments (q < 0.05). The numbers and strengths of these associations varied among traits and environments. Bottom: Stacked bar graph displaying the proportion of negative (red), positive (blue), and not significant (white) associations for each isolation environment with at least one significant association. Top: Heat map of pairwise associations among traits and isolation environments. The color of a box represents the type of association: negative (red), positive (blue), and not significant (white). The strength of the association (the difference between observed and expected trait counts) is displayed by the saturation of the color

Growth at 37 °C was positively associated with isolation from cacti (p < 0.001), which prefer warmer climates. Yeast communities isolated from the family Cactaceae were previously associated with growth at high temperatures, although species identifications were not possible in this classic 1986 study [41]. Our analyses extend this pre-molecular research to specific yeast taxa (e.g., Pichia cactophila and Clavispora opuntiae), provide further statistical support for this association, and highlight its relative importance in a more comprehensive picture of yeast ecology. Similarly, our analyses statistically supported the previously hypothesized positive associations between growth on DL-lactate and isolation from oaks and cacti [41].

As expected, we also found growth at 37 °C was also positively associated with yeasts being classified as pathogenic (p = 0.0047). This oft-stated association [26, 28] stems from the need to survive elevated temperatures within endothermic hosts, but to our knowledge, it has never been formally tested across a broad taxonomic scale. In contrast, growth at 37 °C was negatively associated with isolation from insects (p = 0.03) and fungi (p < 0.001), which are not endothermic. More intriguingly, the endothermic pathogenic/commensal lifestyle could help explain the unexpected negative associations that we observed between growth at 37 °C and the ability to utilize a variety of carbon sources (15/33 carbon sources, Table 1) because emerging evidence suggests that carbon starvation is frequently encountered by yeasts growing in mammalian hosts [42,43,44].

Table 1 Utilization of diverse carbon sources is negatively associated with growth at high temperatures

The ability to survive in certain habitats requires suites of traits, while other traits are expendable [10, 15]; therefore, both negative and positive trait associations could be affected by extrinsic factors related to isolation environments. To explore the extent to which extrinsic factors contributed to trait associations further, we tested whether the number of trait associations we saw in each specific environment was more than that expected by chance. If there were more significant observed trait associations (negative or positive) within a given environment than expected, we would conclude that the specific isolation environment tested was not important to the significant trait associations. Instead, for all isolation environments, we found no significant differences between the observed and expected associations for both the positive (Fig. 4a) and negative (Fig. 4b) associations. In contrast, when we removed the effects of the environment by drawing species from random environments and repeating our analyses as a control, we found significant deviations from the number of positive and negative associations. These results demonstrate that the environment is an important factor in explaining observed trait associations, even if it may often be acting in concert with intrinsic biological factors.

Fig. 4
figure 4

Isolation environments contribute to positive and negative associations among traits. We calculated the deviations for our observed data for both a positive and b negative associations. All of the deviations for the observed data were close to zero, suggesting that isolation environments contribute to the trait associations. The saturation of the bars represents sample sizes. In the insets, we removed the effect of the environment by randomly sampling species without regard to their environment for multiple sample sizes (4 ≤ n ≤ 217) and reproduced the environment data on the same scale for contrast (isolation versus random). Note that removing the effect of environment led to significant deviations from expectations for both positive (inset in a) and negative (inset in b) associations

Networks of intrinsic biological factors affecting traits

Given the amount of the variance in trait associations explained by biological properties, we examined the role of biological factors in more detail. In particular, promiscuous enzymes and pathway overlap are major features of yeast carbon metabolism that could potentially underlie many of the scored traits and the significant trait clusters (p < 0.05, Fig. 1 and Additional file 3: Figure S1). Shared enzymes and pathways are expected to lead primarily to positive associations, and indeed, we found an enrichment for positive associations relative to negative associations among carbon source utilization traits (p = 7.67 × 10− 6). Across trait pairs (n = 496), 75 pairs were significantly positively associated, while only 34 were significantly negatively associated (Fig. 5a and Additional file 8: Table S6). Moreover, DL-lactate (n = 20) and methanol (n = 7) resulted in 79% of all negative associations with other carbon sources, and most carbon sources were not negatively associated with any beyond these two (Fig. 5b).

Fig. 5
figure 5

Significant associations among carbon metabolism traits were biased toward positive associations. a Bar graph displaying the total number of positive (blue) and negative (red) associations for each carbon source (Additional file 5: Table S4). b Negative association network between carbon utilization traits. The width of the edge between nodes represents the strength of the association between carbon sources. Node color represents similar molecular structures among compounds, and shading behind nodes represents significant communities within the network. Note that, among negative associations, the majority (79%) of significant associations were with DL-lactate and methanol

To explore the role and cause of trait associations among carbon utilization traits further, we generated a network of all significant positive associations (Fig. 6). We detected five distinct communities or subnetworks within the complete network using the Clauset–Newman–Moore algorithm, a modularity maximization method that detects communities by searching for subdivisions with high modularity [45]. Many of the carbon sources within each subnetwork shared similar molecular structures and functional properties (Additional file 9: Table S7). For example, we detected a subnetwork that consisted exclusively of Glucosides, which all contain a glucose moiety linked to other chemical groups (p = 3.57 × 10− 5, Fig. 6a). Similar types of glucosidic bonds are often cleaved by promiscuous enzymes [46], and our findings suggest that these biochemical properties at least partly explain trait associations across macroevolutionary timescales. The Contains Galactose subnetwork (p = 0.0001, Fig. 6a) included all three glucosides not found in the Glucosides subnetwork (the galactosides melibiose, raffinose, and lactose, Fig. 6c), as well as galactose and its sugar alcohol, galactitol. The two trisaccharides in the dataset even had significant edges connecting them to their constituent disaccharide moieties (raffinose to melibiose and sucrose, as well as melezitose to sucrose, Fig. 6b).

Fig. 6
figure 6

Carbon source utilization trait associations form communities within the network that contain traits with similar structures and/or properties. a Network of positive associations among carbon utilization traits. The width of the edge between nodes represents the strength of the association between carbon sources. Node color represents similar molecular structures among compounds, and shading behind nodes represents significant communities within the network. Each subnetwork is labeled with a description of the biochemical structures or pathways captured by that subnetwork. b, c Reproduction of network highlighting compounds containing specific monosaccharide moieties. di Reproduction of network highlighting negative (red) and positive (blue) associations with the fermentation of specific sugars. Note that fermentation of a specific sugar is always positively associated with utilization of that sugar because utilization is a prerequisite for fermentation; in some cases, related sugars are also positively correlated. Trehalose fermentation may be an exception to the general trend of negative associations with other carbon sources because many yeasts synthesize trehalose internally [74]

Finally, the Sugar Alcohols & Pentose Phosphate Pathway subnetwork was enriched (p = 1.81 × 10− 5) for pentoses and sugar alcohols; 30 interactions occurred among nodes within the subnetwork, compared to 13 outside the subnetwork. Sugar alcohols accumulate as intermediate products during pentose utilization in many microbes [47], including yeasts [48, 49], suggesting that overlapping biochemical pathways may explain the association between the utilization of these two types of carbon sources. Strikingly, the ability to ferment various sugars was negatively associated with the ability to utilize sugar alcohols and pentoses (9 out of 11 pentoses/sugar alcohols, p = 0.006, Fig. 1 and Fig. 6d–i). Efficient fermentation of pentoses, such as xylose and arabinose, is one of the central challenges of bioenergy research, and this broad negative association across yeasts further underscores that only a handful of yeast species have naturally evolved pentose fermentation [50,51,52,53].

Biochemical pathways associated with intrinsic biological factors

The enrichment for positive associations among carbon sources and the communities of carbon sources with similar properties and structures suggests that the ability to utilize one carbon source can increase the potential to utilize similar carbon sources. These findings provide an empirical and evolutionary underpinning for a theoretical model of bacterial central carbon metabolism, which proposed that metabolisms viable on one carbon source can be preadapted to multiple other carbon sources as a result of shared pathways [7]. Furthermore, the enrichment for specific molecular properties, chemical structures, and pathways within communities suggests an underlying genetic and biochemical basis for the suite of carbon metabolism traits present within a given yeast species [54].

To determine whether the observed macroevolutionary patterns of metabolic trait associations could indeed be explained by the shared genetic and biochemical pathways, we measured the extent to which the presence of several well-characterized genes from model organisms could explain growth on various carbon sources. We focused on 79 species with extensive trait information, fully sequenced genomes, and a well-resolved phylogeny [55]. For traits, we focused on two significant communities within the positive carbon metabolism network (Fig. 6) and searched the genomes for homologs of genes known to enable the utilization of those carbon sources in model organisms (Additional file 10: Table S8).

For the Contains Galactose community, the GAL1, GAL7, and GAL10 genes, which encode the enzymes required for galactose utilization, were positively associated with galactose utilization (p < 0.05, Additional file 10: Table S8 and Additional file 11: Figure S3a) [56]. To test whether interacting pathways could indeed explain some of the significant edges within communities, we also tested whether galactose utilization was positively associated with the presence of MEL1 and LAC12, which encode galactosidases that are required for melibiose and lactose utilization, respectively. Indeed, even though MEL1 and LAC12 are not required for galactose utilization, we found significant associations across macroevolutionary timescales (p < 0.05, Additional file 11: Figure S3a, Additional file 12: Table S9).

The Glucosides community was rich in individual genes (e.g., MAL11 and IMA5) associated with growth on multiple carbon sources, including maltose, melezitose, and sucrose (p < 0.05, Additional file 10: Table S8 and Additional file 11: Figure S3b). These results suggest that these genes are pleiotropic and could be responsible for the utilization of multiple carbon sources, in line with an extensive body of research in Saccharomyces showing enzyme promiscuity [57], as well as with the demonstration that the deletion of MAL1 and MAL2 in Ogataea polymorpha prevented this distantly related yeast from growing on multiple carbon sources, including maltose, sucrose, and melezitose [58]. Our comprehensive analyses suggest that the established mechanisms from these model systems are general and further show how these promiscuous enzymes have led to positive trait associations across a broad taxonomic range.

Finally, we quantified whether genes that are involved in the metabolism of carbon sources within each community were more likely to co-occur. We found a significant difference in the frequency of co-occurrence among genes that were from the same community (81.1%) versus when they were from two different communities (60.2%) (Χ2 = 7.98, p = 0.005, Additional file 11: Figure S3c). In other words, genes involved in the utilization of positively associated carbon sources co-occurred more frequently than those of randomly associated carbon sources. The significant co-occurrence of genes associated with the utilization of positively associated carbon sources provides further support that intrinsic biological properties contribute to positive trait associations.

Conclusions

As seen in non-microbial taxonomic groups [22, 25], trait pairs in the budding yeast subphylum show significant positive and negative associations of varying strengths. Positive associations are particularly common among carbon metabolism traits, relative to negative associations, especially those involving compounds with similar molecular structures and pathways. These correlations suggest that the ability to metabolize individual carbon sources can increase the potential to utilize additional chemically related carbon sources. Among negative associations, there are no absolute exclusions where the presence of one trait is perfectly correlated with the absence of another, suggesting that any subtle trade-offs that may occur can be overcome across macroevolutionary timescales. One caveat to our study is that these data are a qualitative measure of growth. Measuring correlations between quantitative growth parameters, such as lag, growth rate, and saturation, may reveal trade-offs or interactions among traits that cannot be seen with the current data. An interesting avenue of future research would be to measure quantitatively the growth parameters of all species and look for positive and negative associations.

The presence of negative associations could conceivably be explained by intrinsic biological factors leading to trade-offs or due to extrinsic factors. For example, highly fermentative yeasts might be intrinsically poor pentose fermenters due to a metabolic trade-off, possibly explaining the challenges encountered by biofuel researchers [59,60,61]. Alternatively, negative associations could also be explained by extrinsic factors, such as adaptation to the environment or an ecological niche. Under this second model, environments that select for robust pentose metabolism might not typically favor highly fermentative species. Similarly, there could be a trade-off between the utilization of a broad array of carbon sources and growth at high temperatures, as suggested by research demonstrating impaired growth on different carbon sources at 37 °C in Candida albicans [43]. Alternatively, yeasts may simply encounter a more limited range of carbon sources within mammalian and endothermic hosts [44], resulting in the negative association between carbon utilization breadth and growth at 37 °C, perhaps through the neutral loss of metabolic pathways [56]. The intrinsic explanations posit limitations to adaptability imposed by the metabolic network, while the extrinsic explanations propose a dominant role for the ecological niche in determining which traits are retained or acquired. An interesting future research avenue will be to determine whether similar patterns are observed in other large clades of diverse eukaryotic microbes, or perhaps even in bacteria where widespread horizontal gene transfer may further diminish the role of phylogeny.

By accounting for both intrinsic and extrinsic factors, our results demonstrate that both genetics and environment contribute to trait associations. Indeed, these factors likely interact to create a positive feedback loop, in which an organism’s genes ultimately underlie the traits needed for survival in an environment, and the environment then puts additional selective pressure on those genes, leading to their retention or further adaptation [62]. Together, these forces shape the traits present in an organism, strengthen correlations between those traits among organisms, and select for common suites of traits or trait syndromes in diverse clades.

Methods

Species

We examined trait correlations in 784 Saccharomycotina species [26] from 50 different isolation environments. We quantified the phylogenetic signal across our trait data for 578 species and curated isolation data for 831 yeast species.

Trait data

Qualitative trait data for 75 traits were curated from The Yeasts: A Taxonomic Study [26] (Additional file 13: Table S10). The trait data are largely based on the type strain of a species; however, in some cases, multiple strains were assessed for a species and condensed into one value by the taxonomist. Traits were scored by multiple taxonomists, but they used standardized protocols to limit the biases of multiple techniques [26]. When multiple strains were qualitatively assessed for a species and only some strains grew, the trait was scored as “variable.” In the dataset, 6% of the trait matrix was marked as variable. The occurrence of two traits being variable within the matrix was less than 1%; therefore, all traits scored as variable were considered positive and scored as a 1 (Additional file 14: Table S11). Trait data were not available for every trait for every species; therefore, any trait that was evaluated for presence or absence in fewer than 80% of the species was removed from all analyses. Species that were lacking trait data for more than 80% of the traits considered in our study were also removed from analyses, leaving 784 species and 48 traits with sufficient data.

Growth validation

We validated the growth of 240 yeast species in our dataset on four carbon sources, galactose, maltose, sucrose, and raffinose. We inoculated the type strains of all species in yeast extract peptone dextrose and allowed them to grow for 3 days. After the initial inoculation, the cultures were arrayed into a 96-well plate and a pinner was used to inoculate a 96-well plate containing minimal media plus 2% sugar. The cultures grew for a week and were then scored for growth. We repeated the growth validation experiment three times. A species was scored as growing on a carbon source if it grew at least two or three times. After blindly scoring the yeasts for growth, we compared the results to the trait data from The Yeasts: A Taxonomic Study.

Imputations

Since the trait data contained missing values, to determine the best way to handle these values, we tested three methods: (1) all missing values were set to 0, (2) all missing values were set to 1, and (3) missing values would be either 1 or 0. Overall, the third method correctly predicted trait presence or absence 79% of the time, while the other methods predicted trait presence or absence correctly less often (Additional file 15: Table S12). Trait associations were generally insensitive to how missing values were handled, and the methods agreed on trait associations 99% of the time. Therefore, we performed the remaining analyses using the third method (imputation).

The missing data were imputed and replaced with a value of 1 or 0 by calculating the probability of a value being 0 in a species, P(s), and trait, P(t), respectively:

$$ {\displaystyle \begin{array}{l}P(s)=\frac{n_0}{n_s},\\ {}P(t)=\frac{n_0}{n_t},\end{array}} $$

where n0 is the number of 0 values found for that species or trait, and n s and n t are the total number of data points for that species and trait, respectively. The proportion of 0 values was also calculated for the total matrix, T(z):

$$ T(z)=\frac{n_0}{n_r\times {n}_c}, $$

where n r and n c , are the total number of rows and columns in the matrix, respectively. These values were then multiplied to determine the probability that the missing data for that cell would be 0, P(0):

$$ P(0)=T(z)\times \left(P(s)\times P(t)\right). $$

When P(0) was greater than 0.5, the missing value was set to 0, and when it was less than 0.5 the missing value was set to 1. This imputation method quantitatively accounts for the observation that some traits are more common than others, while outperforming approaches encoding all missing values as 0 or 1.

Associations

We permuted the trait presence/absence matrix (n = 10,000 permutations) using a swap algorithm to perform the permutations (n = 1000 swaps per permutation) and maintain row and column sums from the original matrix. Permutations were performed using the R package picante (v. 1.6–2) [63]. To determine whether traits were positively or negatively associated, for each trait pair, we counted the number of times that both traits were observed (1,1) and the number of times only a single trait was present (1,0 or 0,1) across species, respectively:

$$ {\displaystyle \begin{array}{l}{\mathrm{Positive}}_{\mathrm{obs}}={\mathrm{obs}}_{\left( 1, 1\right)},\\ {}{\mathrm{Negative}}_{\mathrm{obs}}={\mathrm{obs}}_{\left(1,0\right)}+{\mathrm{obs}}_{\left(0,1\right)}.\end{array}} $$

These values were also calculated for the permuted data, and the expected trait pair counts were determined by calculating the mean of those permuted observations:

$$ {\displaystyle \begin{array}{l}\overline{{\mathrm{Pos}}_{\mathrm{exp}}}=\frac{\sum {\exp}_{\left( 1, 1\right)}}{10, 000},\\ {}\overline{{\mathrm{Neg}}_{\mathrm{exp}}}=\frac{\sum {\exp}_{\left(0,1\right)}+{\exp}_{\left(1,0\right)}}{10,000}.\end{array}} $$

The strength of each association (Str) was calculated by subtracting the observed and average accepted values for the positive and negative associations, respectively:

$$ {\displaystyle \begin{array}{l}{\mathrm{Str}}_{\mathrm{pos}}=\left|{\mathrm{Pos}\mathrm{itive}}_{\mathrm{obs}}-\overline{{\mathrm{Pos}}_{\mathrm{exp}}}\right|,\\ {}{\mathrm{Str}}_{\mathrm{neg}}=\left|{\mathrm{Neg}\mathrm{ative}}_{\mathrm{obs}}-\overline{{\mathrm{Neg}}_{\mathrm{exp}}}\right|.\end{array}} $$

We calculated the binomial confidence intervals to determined significant associations using the R package Hmisc (v. 4.0–0) [64] and corrected for multiple tests across associations with the Benjamini–Hochberg correction. padj. or q < 0.05 was accepted as significant, and decreasing the significance cutoff to q < 0.01 did not affect our general conclusions. For all significant associations, we reported the observed [the count of either (1,1) or (1,0 or 0,1) within the actual dataset] and expected values [the mean of either (1,1) or (1,0 or 0,1) for the randomized dataset (n = 10,000)] for that association (Additional file 1: Table S1). Since negative and positive associations were calculated separately, if an association was not significant for either a positive or negative association, we reported the observed count for the highest difference between observed and expected (Additional file 1: Table S1). The same method was applied to determine trait isolation associations.

Significant trait clusters

We determined whether traits showed similar patterns of associations via cluster analysis. We calculated a dissimilarity matrix, using a Euclidean distance, for all trait pairs using the difference between observed and expected association values for both the positive and negative associations. All associations that were not statistically significant were set to a value of 0. Traits were clustered using Ward’s method in the R package pvclust (v. 2.0–0) [65].

Variance partitioning of trait associations

Adjusted bi-multivariate statistics (\( {R}_a^2 \)) were computed using the varpar() function in the R package vegan (v. 2.4–3). This statistic estimates the contributions of the independent variables (phylogeny, biological properties, and isolation environments) to the response variable (trait association). Three matrices were used for the independent variables. For the phylogenetic matrix, D1/D2 sequences from the rDNA locus for 578 Saccharomycotina species in our association analysis and the outgroup basidiomycete Cryptococcus neoformans were used to construct a phylogenetic tree of the subphylum Saccharomycotina [26, 66, 67]. All sequences were aligned using MAFFT (v. 7.305) [68, 69]. RAxML-HPC BlackBox (v. 8.2.9) was applied to build the phylogenetic tree under the GTRCAT model for nucleotide sequences; 1000 bootstrap replicates were used to assess the reliability of internal branches (Additional file 16: Table S13) [69, 70].

The inferred maximum likelihood tree was then used to make a matrix using the cophenetic.phylo() function in the R package ape (v. 4.1). The biological properties matrix calculated the numbers of each type of carbon source (defined in Additional file 9: Table S7) utilized by a species (e.g., the number of hexoses utilized by S. cerevisiae), and the isolation environments matrix was a binary matrix consisting of all species and whether or not they were isolated from an environment. The response variable was a trait association matrix that consisted of whether two traits (e.g., sucrose and maltose utilization) were both present (1) or one trait was present while the other was absent (0) in a species.

Phylogenetic signal of traits

The D1/D2 maximum likelihood tree was used to detect the phylogenetic signal in our trait data. We determined whether there was a phylogenetic signal for individual traits by calculating D, a measure of dispersion for binary traits [38], and testing for a significant departure from both random associations (D = 1) and the clumping expected under a Brownian model of evolution (D = 0). It was calculated using the phylo.d function in the R package caper (v.0.5.2) [71]. We used 1000 permutations to detect whether D was significantly different from random associations and clumping.

Isolation environments

Isolation environments were manually scored and curated from the “Ecology” section for each species available in The Yeasts: A Taxonomic Study [26]. Isolation environments were classified into specific isolation conditions (e.g., oak, ant, and beer) and broad isolation conditions (e.g., tree, insect, and fermentation). Each isolation environment was scored as a 1 or 0 to represent species presence and absence in each environment (Additional file 5: Table S4). If the isolation environment of a species was unknown, it was classified as unknown. Analyses were performed for isolation environments that contained four or more species.

Quantifying direct and indirect associations

To determine whether the negative and positive associations among traits were indirectly caused by the environment, we calculated the average difference between the observed and expected numbers of trait associations for each environment using a co-occurrence matrix [72]. Any deviations from 0 would suggest the trait associations observed were driven by something other than the isolation environment. We removed the effects of the environment by randomly drawing species, regardless of their isolation environment, for a range of sample sizes (n = 4, 11, 26, 47, 76, 147, 217). For each sample size, we randomly drew that number of species from our trait data set and ran the analysis described above. We did this for each sample size 1000 times and calculated the average difference for each sample size.

Carbon metabolism network analyses

To quantify whether there was an enrichment of positive associations in our association data, the data were limited to carbon metabolism traits that had at least one significant association, and we used a two-sided Fisher’s exact test. Positive and negative association networks were created in the R package igraph (v. 1.0.1) [73], and carbon trait communities in these networks were determined through the Clauset–Newman–Moore algorithm (fast.greedy community), an algorithm that maximizes modularity. We determined whether there was enrichment for specific molecular properties or functions within each subnetwork using two-sided Fisher’s exact tests.

Gene carbon analysis

Gene presence was detected using TBLASTX and BLASTN searches using query sequences [53] from the characterized pathways in model organisms (e.g., S. cerevisiae) versus 79 previously curated genome assemblies [55], using an e value cutoff of 10−10. We also collapsed the MAL12 and IMA1–4 genes into a single IMA/MAL group since these genes are closely related paralogs [57]. For each carbon source, we quantified how often each gene was present and there was growth on the carbon source, as well as the sum of how often a gene was absent and there was growth plus how often a gene was present and there was no growth. We used a Χ2 test to detect associations between growth and gene presence and corrected for multiple tests across associations with the Benjamini–Hochberg correction. We also tested whether genes of positively associated traits co-occurred more frequently than those of traits that showed random associations using a test for equal proportions. We included two groups of genes in our analyses; the first group of genes (Contains Galactose) consisted of GAL1, GAL7, GAL10, LAC12, and MEL1. The second group (Glucosides) comprised MAL11, MAL13, MAL6, the IMA/MAL-collapsed genes, IMA5, and SUC2. We tested whether genes within each group co-occurred more frequently than genes that were from the two different groups. All statistical analyses were done in R.