Introduction

Temperature optimum (T opt) is arguably one of the most conserved phenotypic traits. For a cell to achieve a specific T opt, most or all of its genes need to be functional at this temperature, e.g., all the essential genes of a thermophilic organism must encode thermostable proteins. Furthermore, although genomes may evolve quickly through for example frequent events of horizontal gene transfer, this does not mean that phenotypes in general, and particularly not T opt, change rapidly through successive generations. On the contrary, it seems reasonable to assume that T opt, or more specifically protein thermostability and cell-membrane thermostability, represents a barrier to fixation of horizontally acquired genes. Although frequent horizontal gene transfer may result in genomes being a mosaic of genes from various lineages, the changes in T opt over the same time-frame may be very small.

How fast is the evolution of T opt? Addressing this question requires a measure of how T opt changes as a function of evolutionary distance. The 16S rRNA gene has for decades been a useful marker for measuring evolutionary relatedness among prokaryotes (Woese and Fox 1977; Woese 1987) and is a cornerstone of present day prokaryotic taxonomy (Coenye et al. 2005; Stackebrandt et al. 2002). Phylogenetic clusters derived from 16S rRNA gene sequences commonly contain organisms with similar T opt values, suggesting that although the 16S rRNA gene evolves extremely slowly, its resolution as a phylogenetic marker is sufficient to make inferences about T opt evolution.

Phenotypic evolution can be divided into two components: the rate of evolution (tempo) and the mechanisms driving the rates (mode) (Kinnison and Hendry 2001; Simpson 1944). Quantitative characterization of these components is not only important for understanding the evolution, but may also have practical implications for microbial ecology. For example, knowing the mode of T opt evolution in a microbial lineage may help constrain the expected T opt values of uncultured organisms. This would be particularly useful for the organisms within thermophilic lineages as these organisms are typically detected in hydrothermal environments characterized by steep temperature gradients. For example, a few grams of a sample from the wall of a marine hydrothermal chimney may comprise micro niches with temperatures ranging from 0 to more than 300°C (Pagé et al. 2008). Thus, predicting T opt of detected organisms from a temperature measurement of a bulk sample, if available at all, would be uninformative.

Here, we analyze the mode of T opt evolution within the Thermotogaceae family, which has a large number of cultured type strains distributed in 7 described genera. Members of Thermotogacea are moderately thermophilic to hyperthermophilic, fermentative bacteria, typically isolated from hydrothermal systems. Our findings are consistent with the hypothesis that T opt evolves according to a Brownian motion evolutionary model. We show how phenotypes of uncultured members of Thermotogaceae can be predicted based on known trait values of cultured relatives.

Data and methods

Database construction

We developed the software SPOT (sequence and phenotype organizing tool) to construct a database of 16S rRNA gene sequence information and T opt information. SPOT, along with a users’s manual and the dataset analyzed here, can be downloaded from http://webber.uib.no/geobio/spot. Phenotypic data were compiled from the literature as follows: Fervidobacterium pennavorans (Friedrich and Antranikian 1996), Fervidobacterium changbaicum (Cai et al. 2007), Fervidobacterium gondwanense (Andrews and Patel 1996), Fervidobacterium islandicum (Huber et al. 1990), Fervidobacterium nodosum (Patel et al. 1985), Geotoga petraea (Davey et al. 1993), Geotoga subterranea (Davey et al. 1993), Kosmotoga olearia (DiPippo et al. 2009), Marinitoga camini (Wery et al. 2001), Marinitoga hydrogenitolerans (Postec et al. 2005), Marinitoga piezophila (Alain et al. 2002), Marinitoga okinawensis (Nunoura et al. 2007), Petrotoga mexicana (Miranda-Tello et al. 2004), Petrotoga miotherma (Davey et al. 1993), Petrotoga mobilis (Lien et al. 1998), Petrotoga olearia (L’Haridon et al. 2002), Petrotoga sibirica (L’Haridon et al. 2002), Petrotoga halophila (Miranda-Tello et al. 2007), Thermosipho africanus (Huber et al. 1989), Thermosipho atlanticus (Urios et al. 2004), Thermosipho geolei (L’Haridon et al. 2001), Thermosipho japonicus (Takai and Horikoshi 2000), Thermosipho melanesiensis (Antoine et al. 1997), Thermotoga elfii (Ravot et al. 1995), Thermotoga hypogea (Fardeau et al. 1997), Thermotoga lettingae (Balk et al. 2002), Thermotoga maritima (Huber et al. 1986), Thermotoga naphthophila (Takahata et al. 2001), Thermotoga neapolitana (Jannasch et al. 1988), Thermotoga petrophila (Takahata et al. 2001), Thermotoga subterranea (Jeanthon et al. 1995), Thermotoga thermarum (Windberger et al. 1989), and Kosmotoga shengliensis (Feng et al. 2010; Nunoura et al. 2010). Accession numbers of 16S rRNA gene sequences together with corresponding organism names and T opt values are given in Supplementary Table 1.

Brownian motion simulation

Temporal evolution of a single trait can be modeled as a Gaussian random walk, which assumes that the long-term dynamics of an evolving population is governed by the mean and variance of the distribution of evolutionary steps. In the limit of very small steps, the random walk approaches a one-dimensional Wiener process, and the latter is commonly referred to as Brownian motion model in the evolutionary biology literature (Felsenstein 1985). We simulated one-dimensional Brownian motions (BM) by calculating the cumulative sum of random deviates drawn from a normal distribution, whose mean and standard deviation are the governing parameters, describing the rate (standard deviation) and directionality (mean) of evolution in a phenotypic trait. We define a BM model as equivalent to an unbiased random walk with a mean of zero. Modeling evolutionary changes as random steps is motivated by our uncertainty regarding the role of different microevolutionary processes and the need to minimize mechanistic assumptions, but does not imply that evolution is random with respect to underlying causal factors.

Linking T opt variability with evolutionary distance

SPOT links an evolutionary distance matrix with phenotypic variability and produces a tab delimited output file including organism names, distances, and differences in T opt (∆T opt) for all unique pairs of organisms. Whenever T opt values were reported as a range, differences in T opt values between two strains were based on midpoint values. 16S rRNA distance matrices were constructed using the following approach: 16S rRNA genes sequences from all strains were aligned using the SINA webaligner (Pruesse et al. 2007) (http://www.arb-silva.de/aligner/). A distance matrix was generated from the alignment (E. coli positions 43–1371) using DNADIST with Jukes–Cantor correction and otherwise default settings as implemented in ARB (Ludwig et al. 2004). Comparisons involving branch lengths obtained from phylogenetic trees were done using in-house perl scripts.

Construction of phylogenetic trees and calculation of independent contrasts

The SINA-generated 16S rRNA gene sequence alignment (see above) was exported to ARB (version 5.0) (Ludwig et al. 2004), where trees were produced using three different algorithms: (1) maximum likelihood by PhyML (Guindon and Gascuel 2003) applying the Hasegawa–Kishono–Yano nucleotide substitution model and where a discrete-gamma model (Yang et al. 1994) is implemented to accommodate rate variation among sites (four substitution rate categories were used and the gamma distribution parameter was estimated by maximizing the likelihood of the phylogeny), (2) maximum parsimony by Phylip DNAPARS (Felsenstein 1989), and (3) neighbor joining (Saitou and Nei 1987) with Jukes–Cantor (Jukes and Cantor 1969) correction. All trees were constructed from the same alignment (E. coli position 43–1371) and by applying a bacterial positional variability filter (pos_var_Bacteria_102), leaving 1357 valid columns. For other parameters in the tree constructions, default settings were used.

Trees were exported from ARB in newick format and converted to nexus trees using in-house perl scripts. Nexus trees were imported by Mesquite (version 2.72) (http://mesquiteproject.org/mesquite/mesquite.html), where phylogenetically independent contrasts (IC) (Felsenstein 1985) were calculated using the PDAP (Garland et al. 1999) module (version 1.14). In PDAP, phenotypic data from tip organisms in a phylogenetic tree are used to estimate the phenotypes of internal nodes. IC are calculated as (T opt(node1) − T opt(node2))([L branch]−0.5), where node1 and node2 are two sister nodes in the tree and L branch is a corrected branch length between the sister nodes (see Garland et al. (2005) for a worked example).

A possible method for prediction of trait values of uncultured organisms

Generally, if we let X i be the value of a continuous phenotypic trait X in strain i, then a BM evolutionary model predicts that at a certain evolutionary distance (D), independent observations of (X i − X j) are normally distributed with expectation zero and variance proportional to D (Felsenstein 1985). Thus, observations of (X i − X j) at different distances can be standardized by dividing by the square root of the distance, obtaining values of S = (X i − X j) (D −0.5), which are also normally distributed with expectation 0 and variance σ 2. |S| will have a half-normal distribution with a standard deviation, s, which can be estimated from a sample of independent pairs of organisms with known trait values and known distances. A given value of s can further be used to estimate σ from the relationship s 2 = σ 2(1 − 2/π). Given a 95% confidence interval (−1.96σ, 1.96σ) of S, the confidence interval of (X i − X j) is [−1.96σ(D 0.5), 1.96σ(D 0.5)]. Therefore, if we have a pair of organisms (i, j) with known evolutionary distance D, but where only one of the trait values (X j) is known, the 95% confidence interval for the unknown trait value (X i) can be expressed as

$$ X_{\text{i}} \, = \,X_{\text{j}} \, \pm \, 1. 9 6\sigma \left( {D^{0. 5} } \right) $$
(1)

Cross-validation

Cross-validation of each predictive equation was performed by iterative re-sampling (5,000 bootstrap replicates) of the original dataset, randomly assigning species into a training set (17 species) and a validation set (16 species). In each iteration, σ values were estimated from the random training set and used in Eq. 1. Next, each organism in the validation set was paired with a random organism in the training set and their observed T opt difference (∆T opt) was compared with the predicted value. A prediction was considered to be erroneous when 1.96σ (D 0.5) < ∆T opt. Error rates were defined as the total proportion of erroneous predictions.

Results and discussion

The mode of T opt evolution in Thermotogaceae

Following Gingerich (1993, 2009), we use the distribution of log rate against log distance to assess the mode of evolution (e.g., random, directional, or static). In the case of BM simulations (Fig. 1a–c), the slope (Fig. 1c) in the LRI plot is close to the theoretical expectation of −0.5, and the intercept value is close to the true rate used to generate the time series. On a normal scale (Fig. 1b), rates of change over time appear low and almost constant across a wide range of distances, but with a sharp increase when the time of divergence is very low (i.e., close relatives). This pattern is expected under a BM model, because short-time intervals can show rapid directional change, whereas long time intervals incorporate more fluctuations and reversals, which tend to reduce the net change, and thus the apparent rate (Gingerich 1993; Sadler 1981). The precision of the slope estimate is a function of the number of observations, and to establish a significance criterion we calculated the mean and 95% confidence interval of the BM slope by repeated random walk simulations (5,000 bootstrap replicates), given the same number of observations as the Thermotogaceae dataset. The resulting confidence interval on the BM slope [−0.644 −0.363] can be used to test the null hypothesis that the observed T opt evolution does not significantly deviate from that expected under a BM model.

Fig. 1
figure 1

Log-rate-interval analysis of evolutionary mode. a Distribution of absolute differences (|∆x|) between simulated trait values as a function of time interval (∆t) over which the difference is measured, from a set of 100 simulated evolutionary trajectories. Each trajectory is modeled as a one-dimensional Brownian motion over 50,000 time steps (generations) by calculating the cumulative sum of random deviates drawn from a Gaussian distribution with a mean of zero and standard deviation of 0.1. Pairwise differences are sampled randomly to match the number of pairs in the Thermotogaceae dataset. Blue line represents the 95% confidence interval as derived from Eq. 1. b The same data as in a, but plotted as rates (change per time interval) against time interval, showing the characteristic abrupt increase in rates at very short-time intervals. c On a log–log scale, the rates tend to follow a straight line. If we estimate the slope of that line repeatedly under iterated random sampling (5,000 replicates), we obtain a mean slope (−0.5; red) close to the theoretical expectation for a Brownian motion (−0.5), with a 95% confidence interval (CI) of [−0.644 −0.363]. d Absolute pairwise differences in optimal growth temperature (∆T opt) in the Thermotogaceae, as a function of evolutionary distance (D = 16Sdist) between pairs (black). Independent contrasts (green) are plotted as a function of distance (D = branch length) based on three different algorithms: maximum likelihood (IC ML), maximum parsimony (IC MP), and neighbor joining (IC NJ). Blue line represents 95% confidence interval as derived from Eq. 1. e Same data as in panel D, but plotted as rates against evolutionary distance. As in the simulated data (b), rates increase abruptly at short-time intervals. f LRI plot of the data in panel E, omitting pairs for which the ∆T opt value is zero. To account for inaccuracy and binning of the reported T opt values, we added random noise on the interval [−2.5 2.5] to T opt and estimated the mean (−0.493; red) and 95% confidence interval (CI = [−0.542 −0.442]) of the resulting distribution of LRI slopes (5,000 replicates). The range of estimated slopes is consistent with a Brownian motion process (c). The three different independent contrasts reconstructions give different slopes (IC ML = −0.808, IC MP = −0.671, IC NJ = −0.397; not shown), all of which fall within the 95% confidence interval of a Brownian motion [−0.948 −0.067], for the same number of contrasts (N = 32)

Our dataset contains 33 members of Thermotogaceae, yielding 528 unique pairs of organisms, with corresponding DNADIST-derived evolutionary distances (16Sdist) and differences in T opt (∆T opt) (Fig. 1d). For each pair, we calculated an average rate of T opt evolution defined as ∆T opt/16Sdist, which shows the characteristic abrupt increase in rate at small distances (Fig. 1e). Inaccuracies in the reported T opt values, many of which are binned into 5 degree intervals, would pose a serious problem only if that inaccuracy varies in a systematic manner on the phylogeny, e.g., if T opt is measured differently within one subgroup of Thermotogaceae than in other subgroups. We do not have any reason to suspect that this is the case. Instead, we assume that over- and underestimation of T opt is randomly distributed over the phylogeny. However, binning may have an effect on the LRI analysis by underestimating T opt variance, potentially biasing the slope estimates toward stasis. In our view, true T opt values are likely to fall within a 5-degree interval around the reported value. To account for this uncertainty, we added noise to the T opt values in the form of uniformly random deviates drawn from the interval [−2.5 2.5]. This allows us to calculate a mean LRI slope (−0.493) and 95% confidence interval [−0.542 −0.442] by bootstrapping (5,000 replicates), which more realistically accounts for the inaccuracy of the T opt data (Fig. 1f). The range of estimated LRI slopes for Thermotogaceae falls well within the 95% confidence interval for the BM model, consistent with the hypothesis that T opt evolution occurs as BM.

A problem with applying LRI to all possible pairs is that different pairs of organisms are not independent because related organisms share parts of their evolutionary history (phylogenetic autocorrelation). In order to evaluate whether the LRI results are biased by such autocorrelation, we performed the LRI analysis on independent contrasts (IC) generated by three different algorithms (Fig. 1d–f). Since the number of IC is much smaller than the DNADIST-based dataset, the BM 95% confidence interval is correspondingly much wider [–0.948 −0.067]. The LRI slopes for all three IC reconstructions fall within that interval. Moreover, analyses in PDAP (Garland et al. 1999) showed no significant correlation between IC and the square root of branch length, further supporting our conclusion.

A framework for predicting T opt in Thermotogaceae

The distribution in Fig. 1d can be used to quantitatively constrain T opt predictions. For example, all 74 pairs observed to have 16Sdist <0.10 show ∆T opt <10°C. Intuitively, one might infer that the T opt of an uncultured organism is likely to differ by <10°C from the T opt of a cultured close relative given that 16Sdist <0.10. The goal of this study is to assign quantitative measures of confidence on such predictions. LRI analysis shows that the evolution of T opt in Thermotogaceae is consistent with a BM model. We can therefore take advantage of the properties of the BM model to obtain a general prediction of T opt according to Eq. 1 by letting X represent T opt values. As explained earlier, σ can be estimated in two ways: (1) a non-phylogenetic approach based on the values of (X k − X l) (D −0.5), where X k and X l represent known T opt values for two strains k and l, respectively, and D is 16Sdist. Based on all 528 unique pairs, we found an estimated value of σ = 26.3. (2) A phylogenetic approach, whereby σ can be estimated from the square root of the variance of IC values. Depending on the algorithm used to obtain the evolutionary distances, this approach yields σ estimates in the range 29.8–33.3 (Table 1). The consistency between the different σ estimates is encouraging given the different assumptions and underlying distance measures.

Table 1 Overview of T opt prediction equations generated from four different approaches to measuring evolutionary distance

Cross-validation

Cross-validation by bootstrapping shows that the sigma values are robust to subsampling and stable regardless of the method used (Table 1). Furthermore, median T opt prediction error rates are generally lower than the theoretical expectation of 0.05 (Table 1). Note that the error rate distributions are highly skewed (hence we report median values) with rare instances of large error rates (>0.25).

Constraining the T opt in ‘mesotogas’

Currently, all cultivated isolates of the Thermotogaceae family are either moderate thermophiles, thermophiles, or hyperthermophiles. However, 16S rRNA gene sequences from members of this family have been detected in mesothermic environments or enrichment cultures. The corresponding lineages have informally been designated as ‘mesotogas’ (Nesbø et al. 2006, 2010). In the absence of a mesophilic isolate, their existence remains unproven. Nevertheless, we used the equations in Table 1 to address the following questions: are uncultured mesotogas unlikely to be mesophiles given their relatedness to thermophilic organisms? Are all Thermotogaceae detected in mesothermic environments likely to be mesophiles? Here, we analyzed a dataset recently published by Nesbø et al. (2010) describing five putative lineages of mesotogas. Four of these (M1, M2, M4, and M5) are represented by at least one near full-length 16S rRNA gene sequence. Using the equations in Table 1, we predicted confidence intervals of T opt values of mesotoga members given their evolutionary distance to the closest cultured relative (Table 2). The predictions are of limited precision because the mesotogas are only distantly related to the described species (DNADIST and branch length values were in the range of 0.11–0.26). Yet, our results indicate that it is reasonable to assume that mesotoga lineages M4 and M5 comprise mesophilic organisms, whereas lineages M1 and M2 seem to be at least moderately thermophilic. It is still possible that lineages M1 and M2 are mesophilic and have temperature optima lower than predicted (Table 2), but if they are, the rate of T opt evolution in the branches leading to these lineages are unexpectedly high under a BM model.

Table 2 Predicted confidence intervals of T opt from selected members of ‘mesotogas’

Conclusion

The evolution of T opt within Thermotogaceae is consistent with a Brownian motion model of phenotypic evolution. Based on this model, we developed a general method for T opt prediction of uncultured members of this family. Cross-validation shows that the predictions are accurate (low error rates), stable under different phylogenetic reconstructions, and robust to taxonomic sampling. Similar analyses can be performed on any continuous characteristic (e.g., regarding pressure, pH, salinity, growth rate, genome size, and % GC content) and within any microbial group, which would greatly enhance the value of functional studies and provide valuable insight into evolutionary dynamics.