Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

Arias, Pablo Millán; Butler, Joseph; Randhawa, Gurjit S.; Soltysiak, Maximillian P. M.; Hill, Kathleen A.; Kari, Lila

doi:10.1038/s41598-023-42518-y

Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

Article
Open access
Published: 26 September 2023

Volume 13, article number 16105, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

Download PDF

Pablo Millán Arias¹^na1,
Joseph Butler²^na1,
Gurjit S. Randhawa³,
Maximillian P. M. Soltysiak²,
Kathleen A. Hill² &
…
Lila Kari¹

1932 Accesses
87 Altmetric
15 Mentions
Explore all metrics

Abstract

This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of $\sim 700$ extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, $1\le k \le 6$. The supervised learning resulted in high accuracies for taxonomic classifications at $2\le k \le 6$, and medium to medium-high accuracies for environment category classifications of the same datasets at $3\le k \le 6$. For $k=3$, our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.

The Dynamic Genomes of Acidophiles

The Evolutionary Space Model to be Used for the Metagenomic Analysis of Molecular and Adaptive Evolution in the Bacterial Communities

Towards the biogeography of prokaryotic genes

Article 15 December 2021

Introduction

Life exists in the most unusual and extreme environments on our planet. Biodiversity exists in environments such as volcanoes, deep-sea trenches, and polar regions, that are characterized by extreme physical conditions (temperature, radiation, pressure, salinity, pH, etc.), and pose unique challenges to life¹. Organisms that are able to survive, and sometimes thrive, in extreme conditions are known as extremophiles, characterized by their peculiar phenotypic adaptations and yet underexplored in their genome composition. The study of the genome organization and diversity of extremophiles is of particular interest, as it can provide insights into the mechanisms of adaptation and the evolution of biodiversity in extreme environmental conditions^2,3,4. In particular, the study of microbial extremophiles has yielded important research reagents and gained popularity more recently due to their potential applications in biorefineries⁵, as sources of industrially-relevant biocatalysts⁶, and due to explorations of microbial dormancy and survivability in outer space^7,8. This study uses several machine learning algorithms and an alignment-free methodology to uncover evidence suggesting that microbial adaptations to extreme temperatures and pH conditions imprint a discernible environmental component in their genomic signatures.

As it has been previously observed, extremophiles have developed a wide range of structural, biochemical and metabolic strategies that support cell viability in high-stress environments, and there is evidence that the adaptive mechanisms arising with convergent evolution of extremophilic taxa can be observed at the proteomic and genomic levels⁹. At the proteomic level, diverse organisms living in certain extreme environments have a strong amino acid compositional bias, attributed in part to convergent proteomic adaptations^3,10,11. At the genomic level, codon usage patterns in the genomes of different extremophilic organisms are linked to the physiochemical characteristics imposed by selective pressures experienced in their respective environments^2,12,13. In addition to the localized influences of the selection pressures upon the proteome, open reading frame sequence composition is also influenced by nucleic acid level adaptations associated with structural functions¹⁴. In particular, high Guanine+Cytosine (G+C) content was observed to be correlated with DNA stability in thermophiles of diverse taxa¹⁵ and is the major factor influencing tRNA stability in hyperthermophiles³, while the fraction of Adenine+Guanine (A+G) content in coding DNA was observed to be correlated with the optimal growth temperature (OGT) in thermophiles^16,17,18. These observations all suggest the hypothesis of a correlation between adaptation to extreme environments and specific genome composition patterns.

One way to approach the concept of genome composition is to study “genomic signatures”¹⁹, a general term used for a variety of quantitative measures, pervasive along a genome, that can be used to discriminate between genomes of different species²⁰. In the last two decades, numerous studies have confirmed the effectiveness of alignment-free methods that use genomic signatures for the purpose of genome analysis²¹, comparison²², and sensitive taxonomic classification^23,24,25, even without supervision²⁶. These findings confirmed the existence of a strong phylogenetic signal that is present in (genome-wide, pervasive) genomic signatures, and offered a different perspective complementing alignment-based taxonomic comparisons and distinctions. In particular, genomic signatures based on k-mer (subwords of length k) frequency profiles have been widely used to classify organisms at different taxonomic levels, from Kingdom to species subtypes^{23,24,25,27,28,29,30}.

These findings suggested that differences and similarities in genomic signatures can be attributed to phylogeny. However, the possibility exists that there are other contributors to the differentiating power of genomic signatures³¹. Of special interest are the genomes of extremophile microbes of diverse taxa, which present a unique case marked by convergent adaptations to extreme physical environments. As shared phenotypic adaptations can also impact genome sequence composition (i.e., codon usage, nucleotide biases), genomic signature analyses may identify instances of convergent evolution. Prior work on thermophiles has also suggested some genomic composition biases, such as dinucleotide¹⁹ or tetranucleotide³² frequencies, that are pervasive across the genome. In addition, other research suggests that amino acid compositional biases and codon usage patterns in coding regions may contribute to the existence of a detectable pervasive genomic signature, strong enough to differentiate between taxonomically related organisms that live at opposite environmental extremes^10,18.

This paper provides comprehensive quantitative evidence suggesting that adaptation to extreme temperatures or pH introduces a discernible environmental component in the genomic signature of microbial extremophiles. Herein, a genomic signature is defined as the k-mer frequency vector of a 500 kbp DNA fragment, arbitrarily selected to represent a genome, where k is a fixed positive integer $1\le k \le 6$. This hypothesis was tested using both supervised and unsupervised machine learning algorithms, on a prokaryote dataset comprising 693 high-quality genomes from bacterial or archaeal organisms adapted to extreme temperature conditions, or extreme pH conditions. Supervised machine learning has proved effective in using genomic signatures for taxonomic classification, on data that was problematic for traditional alignment-based algorithms due to its sparseness, complexity, and high dimensionality^24,25,28. Thus, the first approach was to use supervised machine learning methods to learn the taxonomic and potential environmental components of genomic signatures. To this end, several supervised learning algorithms were trained on genomic signatures labelled with either taxonomic labels or with environment category labels. Each classifier was then used to obtain a taxonomic classification (if it was trained using taxonomic labels), or an environment category classification (if it was trained using environment category labels). The classification accuracies obtained were high for the taxonomic classification, and medium to medium-high for the environment category classification, suggesting the presence of an environmental component in the genomic signature in addition to its taxonomic component. For further insight, interpretability tools of supervised learning were used to determine the features of the genomic signatures (specific k-mers) that were most relevant to the environment category classification, and our findings were compared with existing literature on codon usage and amino acid compositional bias in extremophiles.

The presence of an environmental component in the genomic signature was independently confirmed by an unsupervised clustering analysis of data, whereby the first step was to assess several unsupervised learning algorithms for their ability to learn the taxonomic structure of unlabelled data. The most performant clustering algorithms were then used to identify several candidate organisms, with similar genomic signatures in spite of large taxonomic differences. Of these, additional stringent tests based on supervised learning classifications, in challenging scenarios, identified exemplars of hyperthemophile bacteria and archaea whose genomic signatures were grouped together as similar, under all classification and clustering scenarios, by all machine learning algorithms used.

The main contributions of this paper are:

An extensive supervised machine learning analysis of a dataset, augmented with literature references and annotations, of $\sim$700 high-quality microbial extremophile genomes (temperature and pH), at various scales. The results suggest the presence of an environmental component in the genomic signature of microbial extremophiles (temperature, pH) for values $3 \le k \le 6$, in addition to a strong taxonomic component for values $2\le k \le 6$. Subsets of 3-mers that contribute to this enviromental component are also identified, together with an assessment of the relative importance of their contribution.
An unsupervised clustering-based analysis of the aforementioned dataset, providing independent support of the hypothesis of the presence of an environmental component in the genomic signature of these extremophiles. This analysis also identified a hyperthemophile bacterium, Thermocrinis ruber, and three hyperthermophile archaea, Pyrococcus furiosus, Thermococcus litoralis, and Pyrococcus chitonophagus, as exemplars whose genomic signatures are grouped together as similar, by all machine learning algorithms used, in spite of their vast taxonomic differences.

Overall, the results of machine learning analyses, corroborated in the exemplar cases by observations of shared characteristics of the isolating environments, suggest the existence of an environmental component that co-exists with a strong taxonomic component in the genomic signatures of organisms living in extreme temperatures or extreme pH conditions. To the best of our knowledge, this study is the most comprehensive examination to date of the genomic signature of prokaryotic extremophiles, at various scales, of a substantial, well-curated dataset of extremophile genomes.

Materials and methods

Datasets

The data in this study were collected through a systematic literature search focused on identifying extremophilic microbes adapted to environments of extreme temperature and extreme pH. The search was conducted on the PubMed Database (accessed September 2022) and Google Scholar (accessed September 2022) for primary research articles and reviews, and identified 768 microbial species or strains for which extremophilic characteristics were recorded. Subsequently, these species/strains were identified in the Genome Taxonomy Database (GTDB; release R207 April 8, 2022, accessed February 2023), the gold-standard database for taxonomy³³, and only GTDB species representative genomes with reported completeness of over 95%, and contamination of under 5% were selected. Species/strains were mapped to their identified extremophilic characteristic(s), along with genome assembly numbers provided by GTDB for each given organism. The extremophilic characteristic(s) was validated for each organism by searching PubMed with the given strain/species name, and identifying a primary article/review and/or reliable BacDive database (accessed February 2023) entry to confirm the accuracy of the characteristic(s). Entries lacking consistent observations related to the growth characteristics of the respective microbe were removed from the dataset.

For this study, we used the following definitions, based on the Optimal Growth Temperature (OGT), respectively Optimal Growth pH (OGpH): Psychrophile (OGT of < 20 $^{\circ }$C)³¹, mesophile (OGT of 20-45$^{\circ }$C)³¹, thermophile (OGT of 45-80$^{\circ }$C)³¹, and hyperthermophile (OGT of > 80$^{\circ }$C)³¹, acidophile (OGpH < pH 5)³¹ and alkaliphile (OGpH > pH 9)³¹. The dataset was then curated for 154 descriptors, so as to be in accordance with the temperature and pH intervals used in the above definitions. Fourteen entries could not be validated and were discarded from the dataset.

This selection process resulted in 693 annotated high-quality extremophile microbial genome assemblies. These high-quality assemblies were then used to form two datasets according to two extremophile characteristic(s), as follows. The first dataset, called the Temperature Dataset, is composed of 148 psychrophile genomes (8 archaeal, 140 bacterial), 190 mesophile genomes (84 archaeal, 106 bacterial), 183 thermophile genomes (67 archaeal, 116 bacteria), and 77 hyperthermophile genomes (70 archaeal, 7 bacterial) for a total of 598 organism genomes (229 archaeal, 369 bacterial) (Table 1). The second dataset, called the pH Dataset, is composed of 100 acidophile genomes (39 archaeal, 61 bacterial) and 86 alkaliphile genomes (30 archaeal, 56 bacterial), for a total of 186 organisms (69 archaeal, 117 bacterial) (Table 2). Note that 91 organisms were identified to belong to both the Temperature Dataset and the pH Dataset. The datasets are described in Supplementary Table S1, with assembly metadata provided in Supplementary Table S2. The proportions of both the Temperature Dataset and the pH Dataset are described in terms of genus, organized by domain and by environment category, and are described in greater detail in Supplementary Data S1. As well, phylogenetic trees organized by domain (Bacteria, Archaea) and environment category (temperature, pH) are accessible in Supplementary Data S2.

The selection of a genomic fragment s to represent the genome of an organism is a process that has to consider several factors, including fragment length, taxonomic level, and computational complexity of the algorithms used. For methods that rely on k-mer frequency for sequence classification, some studies^34,35 suggest the relation $k = log_{4}(|s|)$, where $|s|$ is the minimum length of sequence s that is necessary, in theory, to obtain statistical significance. However, in practice, longer sequences are needed. For example, another study²⁶ used sequence length of 500 kbp in conjunction with $k=6$ to cluster bacterial sequences at the family level, even though in theory a length of 4,096 bp would have sufficed for this value of k. In this study, each genome (assembly) was represented by a single, arbitrarily selected, 500 kbp DNA fragment. The values for k used in this study, namely $1\le k \le 6$, were empirically selected so as to balance the trade-off between classification accuracy and computational complexity, and to explore multiple scales of the k-mer analysis.

More precisely, a DNA fragment was arbitrarily selected to represent each DNA genome/assembly, as follows. First, the contigs of the assembly were sorted by length, from the longest to the shortest. Then, if the longest contig was longer than 500 kbp, then a 500 kbp fragment was randomly selected to be the representative DNA sequence for that genome. Otherwise, the sorted contigs were concatenated one by one, until the desired length of 500 kbp was reached, and this became the DNA representative sequence for that genome. The k-mers were counted starting from the beginning of the representative DNA sequence, by using a sliding window with step size 1. To avoid spurious k-mers that could arise from the concatenation of contigs, the N character was added as a separator between contigs or contig fragments, but no k-mers that contain N were considered when calculating k-mer counts. Also note that the inserted letters N were not counted towards the length of the DNA sequence representing each genome/assembly. To eliminate the variable of the strand orientation of the uploaded DNA sequences, the final k-mer frequency vector of a sequence was computed as the sum between the vector of its k-mer counts and the corresponding vector of k-mer counts of its reverse complement³⁶. In the remainder of this paper, a k-mer and its reverse complement will be considered to be indistinguishable, and only the canonical k-mer of a pair (the first, in alphabetical order, of the two reverse complementary k-mers) will be listed.

Table 1 Composition of the Temperature Dataset: 598 DNA fragments from microbial genomes/species (369 DNA fragments from bacterial genomes, and 229 DNA fragments from archaeal genomes).

Full size table

Table 2 Composition of the pH Dataset: 186 DNA fragments from microbial genomes/species (117 DNA fragments from bacterial genomes, and 69 DNA fragments from archaeal genomes).

Full size table

Sequence classification using supervised machine learning

To test the hypothesis of the existence of an environmental component in the genomic signature of microbial extremophiles, the two previously described datasets (Temperature, and pH) were classified using supervised machine learning algorithms, and the average accuracy of each classification was computed. For each dataset, computational experiments were performed using six different classifiers, and different values of k, as detailed below. In addition, for each computational experiment, three different scenarios for labelling the training dataset were analyzed, as follows:

(1)
All DNA sequences used in training were labelled taxonomically, by their domain (Bacteria or Archaea),
(2)
All DNA sequences used in training were labelled by their environment category (psychrophile, mesophile, acidophile, etc.),
(3)
All DNA sequences used in training were labelled with pseudo-labels sampled from a discrete uniform distribution. The discrete uniform distribution was Unif(0, 3) in the case of the Temperature Dataset (four possible labels), and respectively Unif(0, 1) in the case of the pH Dataset (two possible labels). This third scenario was introduced as a control, and it was expected to result in predictions of the correct pseudo-labels with probabilities equal to the sampling probability for each dataset. Note that an alternative sampling strategy would be to sample the pseudo-labels according to the distribution of the environment category labels in the dataset. The results associated with this alternative sampling strategy can be found in Supplementary Table S3.

The six different classifiers used for these classification tasks were selected as being representative algorithms of four main categories in the classification of DNA sequences. Support Vector Machines (SVM) were selected as a representative of Kernel Methods, with a radial basis function kernel³⁷. Random Forest was selected as a representative of Tree-Based Methods, with the Gini index as the classification criteria³⁸. The third algorithm was an Artificial Neural Network (ANN), with a simple and versatile architecture consisting of an input layer, two fully connected hidden layers, (512 and 64 neurons), each one followed by a Rectified Linear Unit (ReLU) and a Dropout layer with a dropout rate of 0.5, and an output layer. Lastly, a Digital Signal Processing framework²⁵ was considered, whereby pairwise distances between numerical representations of DNA sequences are computed and then used in conjunction with Linear Discriminant (MLDSP-1), with Quadratic SVM (MLDSP-2), or with Subspace Discriminant (MLDSP-3) machine learning algorithms.

Two different types of computational experiments were performed for each of the two datasets (Temperature and pH), supervised machine learning classifier (six classifiers), value of k ($1\le k \le 6$), and training data labelling (taxonomy, environment category, random).

In the first type of tests, called restriction-free, the predictive power of the algorithms was tested using standard stratified 10-fold cross-validation, as follows. The dataset was split into 10 distinct subsets, called folds, and a model was trained using 9 of the folds as training data; the resulting model was validated on the remaining part of the data (i.e., it was used as a test set to compute a performance measure such as accuracy). The performance measure reported by 10-fold cross-validation was calculated as the average of the classification accuracy for each of the 10 possible test sets.

The second type of tests, called restricted, or non-overlapping genera, was designed to address the possibility that a correct environment category label classification may be influenced by a contributing taxonomic component. For example, one goal was to ensure that a DNA sequence was not classified as a hyperthermophile simply due to its similarity to DNA sequences of the same genus, that happened to belong to the same hyperthermophile category. To this end, we adopted a modified 10-fold cross-validation approach, whereby all sequences of the same genus appeared in exactly one fold. At the same time, to align with the principles of stratified cross-validation, the distribution of the labels in each fold was kept the same as the distribution of the corresponding labels in the entire dataset. In this restricted (non-overlapping genera) scenario, if a DNA sequence is in the test set, then no other sequence of the same genus is present in the training set. This approach attempts to disentangle, at the genus level, the taxonomic component from the environmental component of the genomic signature.

As an independent method for assessing the environmental component of the genomic signature, we employed interpretability tools for machine learning methods. Global interpretability tools were preferred as they are useful in understanding the general mechanisms in the data through a global importance measure. Given the high correlation between the k-mers, the mean decrease in impurity (MDI) for Random Forest was selected as a k-mer global importance measure, and then used to learn the actual k-mers that were relevant to the environment category classification. (This measure was preferred over the widely adopted global-agnostic Permutation Feature Importance method, as that method is not suitable for handling highly correlated features³⁹.) The methodology used to determine the relevant k-mers is as follows. First, a one-vs-all classifier was trained for each environment category present in the dataset, using stratified 10-fold cross-validation. Second, the MDI algorithm was used to compute the global importance of each k-mer in each fold, and the average taken over all folds was used to create a ranked list of k-mers, in decreasing order of their contribution. Finally, for each environment category, the “most relevant subset of k-mers” was computed, defined as the subset of the ranked k-mer list that was sufficient to classify the dataset with the same classification accuracy as when all k-mers were used in that classification.

Unsupervised learning for sequence clustering

In unsupervised learning, no labels are provided for the DNA sequences in the dataset, and various algorithms are used to cluster similar genomic signatures, and to explore the structure of the space of the genomic signatures in the dataset.

Two groups of tests with unsupervised learning algorithms were performed in this study: parametric clustering algorithms (that take the number of expected clusters as an input parameter), and non-parametric clustering algorithms (that determine automatically the number of clusters). In the first group, four parametric clustering algorithms were used: K-means, Gaussian Mixture Model, K-medoids, and iDeLUCS⁴⁰. The computation of the cluster label assignments for each sequence in the Temperature and the pH Datasets was performed with various values of the parameter n_clusters (the expected number of clusters) in each algorithm, n_clusters $\in \{2,4,8\}$ for the Temperature Dataset, and respectively n_clusters $\in \{2,4\}$ for the pH Dataset, based on the number of potential true clusters in each dataset.

For each dataset, the strength of each of the two components of the signature (taxonomic, environmental) was assessed by comparing the clustering accuracies in two scenarios, the first where the clustering was assessed against the true taxonomic groups, and the second when the clustering was assessed against the true environment category groups. In each case, the performance was evaluated using the unsupervised clustering accuracy metric⁴¹, defined as:

$$\begin{aligned} ACC =\frac{\sum _{i=1}^{n} \textbf{1}[l_{i}=f\left( c_{i}\right) ]}{n}, \end{aligned}$$

(1)

where n is the total number of sequences and, for each $1\le i\le n$ and corresponding DNA sequence $x_i$ we have that: The true taxonomic label of $x_i$ is denoted by $l_i$; the numerical cluster label that the algorithm assigns to $x_i$ is denoted by $c_i$; an optimal mapping f, calculated for example by the Hungarian algorithm⁴², maps numerical cluster labels to true taxonomic labels (specifically, $f(c_i)$ denotes the true taxonomic label assigned by f to the cluster labelled $c_i$); and $\textbf{1}[k = i]\in \{0,1\}$ is an indicator function, equal to 1 if and only if $k = i$.

In the second type of test, we assessed whether the clusters of each dataset at the lowest possible taxonomic level (genus) can be discovered by non-parametric clustering algorithms. For this purpose, we used two non-parametric clustering algorithms, HDBSCAN⁴³ and iterative medoids²⁷, combined with three different dimensionality reduction techniques: Variational autoencoders (VAE)²⁷, Deep Contrastive Learning (CL) and Uniform Map Approximation (UMAP)⁴⁴. We also used iDeLUCS⁴⁰, which is semi-parametric, in the sense that its parameter n_clusters (herein $=300$) represents an upper limit of the number of clusters found by the algorithm. These seven clustering algorithms were used to recover the lowest taxonomic groups. The following metrics were defined to assess the quality of the found clusters: the completeness of each cluster (defined as the number of occurrences of the most common genus present in the cluster, divided by the total number of sequences of that genus in the dataset), and the contamination of each cluster (defined as the number of sequences that belong to the most common genus in the cluster, divided by the cluster size). The overall quality of each clustering algorithm was then calculated as the total number of clusters that are at least 50% complete, and at most 50% contaminated.

Results

Supervised machine learning analysis of the Temperature Dataset and the pH Dataset

Supervised classification by taxonomy, environment category, and random label assignment

Several supervised machine learning computational tests were performed to classify the Temperature Dataset and the pH Dataset, respectively, using (1) taxonomy labels (domain), (2) environment category labels, and (3) randomly assigned environment category labels (four for the Temperature Dataset, respectively two for the pH Dataset). More specifically, six supervised machine learning algorithms were used to classify the two datasets, for several k-mer lengths, $1\le k \le 6$. The classification tests were performed under two scenarios, (a) restriction-free, using stratified 10-fold cross-validation, and (b) restricted, using stratified 10-fold cross-validation with non-overlapping genera.

The classification accuracies for the restriction-free case are summarized in Table 3. For $k=6$, classifications using taxonomy labels for training resulted in high classification accuracies of over 97.49% for the Temperature Dataset and over 94.18% for the pH Dataset, across all six classification models. Classifications using environment category labels resulted in medium-high classification accuracies of over 77.59% for the Temperature Dataset and over 84.95% for the pH Dataset, across all six classification models. Classifications using randomly assigned labels resulted in the expected low accuracies of at most 28.09% for the Temperature Dataset and at most 50.06% for the pH Dataset, across all six classification models.

The classification accuracies for the restricted (non-overlapping genera) case are summarized in Table 4. For $k=6$, classifications using taxonomy labels for training resulted in high classification accuracies of over 95.30% for the Temperature Dataset and over 91.90% for the pH Dataset, across all six classification models. Classifications using environment category labels resulted in medium classification accuracies of over 61.90% for the Temperature Dataset and medium-high accuracies of over 79.24% for the pH Dataset, across all six classification models. Classifications using randomly assigned labels resulted in the expected low accuracies of at most 27.90% for the Temperature Dataset and at most 55.62% for the pH Dataset, across all six classification models.

In both the restriction-free and the restricted cases, the classification of genomic signatures for $k=1$ corresponds exactly to a classification based on the G+C content of the sequences (this is due to k-mers being counted from a DNA fragment together with its reverse complement). As seen from Table 3, the supervised classification accuracies for $k=1$ were relatively low for taxonomic classifications, and even lower for the environment category classifications. These results suggest that previous observations³ of high G+C content of archaeal tRNA sequences being correlated with DNA stability in high temperature environments ($\ge 60 ^\circ C)$ may not generalize to pervasive genomic signatures and to larger datasets. This inference is also supported by the single nucleotide composition summary for the datasets in this study, see Supplementary Data S3.

Overall, we first note that for both the Temperature Dataset and the pH Dataset, the classification accuracies improved with higher values of k. Second, we observe that, for both datasets, the classification accuracies obtained when using a random label assignment were approximately equal to the probabilities that a sequence had one of the environment category labels (around 25% in the case of the four temperature labels, and around 50% in the case of the two pH labels). Third, note that the classification accuracies in the restricted scenario were slightly lower than in the restriction-free scenario, for both the taxonomic and the environment category classifications. This decrease could be partly attributed to the decrease in the amount of training data in the restricted scenario. This being said, even in the restricted scenario, the environment category classification accuracies were significantly higher than those for the random label assignment scenario.

Most importantly, these supervised machine learning classification experiments suggest the presence of an environmental component in the genomic signature of temperature and pH microbial extremophiles, able to provide discriminating power for k values $3\le k\le 6$. This environmental component of the genomic signature appears to co-exist with a stronger taxonomic component, able to provide discriminating power for k values $2\le k\le 6$.

Table 3 Classification accuracies of six supervised learning classifiers trained on the Temperature Dataset and pH Dataset, in the restriction-free scenario, for three different label assignments (taxonomy, environment category, and random label assignment), and values of $1 \le k \le 6$. The classification accuracy in each cell is calculated using standard stratified 10-fold cross-validation.

Full size table

Table 4 Classification accuracies of six supervised learning classifiers trained on the Temperature Dataset and pH Dataset, in the restricted scenario, for three different label assignments (taxonomy, environment category, and random label assignment), and values of $1\le k \le 6$. The classification accuracy in each cell is calculated using stratified 10-fold cross-validation with non-overlapping genera.

Full size table

Sets of k-mers relevant to environment category classifications computed by interpretablity tool of a supervised learning algorithm

Of the six supervised classifiers used in the previous section, in this section we use the Mean Decrease in Impurity (MDI) algorithm for the Random Forest classifier to compute a global measure of feature importance. This serves as an interpretability tool that provides insight into the relative contribution of each feature (k-mer) to the successful classification.

To this end, we first conducted 10-fold cross-validation on a one-vs-all Random Forest classifier, which achieves a specific accuracy for each environment category. In the four computational experiments associated with the Temperature Dataset, the psychrophile category was correctly separated from the other sequences in the Temperature Dataset with 86.31% accuracy, the mesophile category with 71.14% accuracy, the thermophile category with 75.22% accuracy, and the hyperthermophile category with 89.62% accuracy. Similarly, in the two computational experiments associated with the pH Dataset, the alkaliphile category was classified with 86.45% accuracy, and the acidophile category with 83.76% accuracy.

We then used the trained models obtained in these computational experiments in conjunction with Random Forest’s interpretability tool, the MDI algorithm, to compute a global importance measure for each k-mer (for $k = 6$, the maximum value analyzed) to determine their relative contribution to the one-vs-all environment category classification. This global importance can be visualized using the Frequency Chaos Game Representation ($fCGR_k$)²⁰ to identify potential patterns, as seen in Figure 1. A visual inspection of Figure 1 suggests that the set of 6-mers that is relevant in distinguishing DNA sequences from a given environment category from the rest of the dataset is specific to that environment category.

To confirm these findings and supplement the analysis with previous observations on codon usage patterns and amino acid compositional biases in extremophiles, we also examined the value $k=3$. Note that not all the 3-mers identified by our method as relevant to the classification are codons, because 3-mers are not counted only from coding sequences or translation frames. For each environment category, the MDI algorithm was used to identify the specific 3-mers that are relevant for each of the one-vs-all Random Forest environment category classifications.

To investigate further the concept of “relevance” and explore its connection with the over-representation and under-representation of codons/amino acids as described in the literature, we computed the histograms of the 3-mers’ deviation from the dataset mean, for each dataset and environment category. Figures 2 and 3 display these histograms, and single out (in green) the 3-mers relevant for each environment category in the Temperature Dataset (Figure 2) and the pH Dataset (Figure 3). To complement this analysis, Tables 5 and 6 list the sets of relevant 3-mers displayed in Figures 2 and 3, respectively, alongside with the relevant literature on biological observations of codon/amino acid compositional biases associated with extreme temperature and pH environments. Note that each set of relevant 3-mers listed in an environment category panel in Figure 2 (Figure 3), ordered left-to-right alphabetically on the x-axis of the panel, corresponds to a set of relevant 3-mers in a matching environment category column in Table 5 (Table 6), ordered top-to-bottom alphabetically by the abbreviation of the amino acid they would encode if they were codons.

As seen in Tables 5 and 6, the majority of our findings regarding over- and under-representation of 3-mers match existing observations in the literature about codon/amino acid bias in extremophiles’ genomic sequences. Disagreements could be due to several factors. First, the 3-mers are not codons: They are counted from an arbitrarily selected 500 kbp DNA fragment representing a genome, and their frequency profile (the genomic signature) has been shown to be quasi-constant along a genome. Thus, some 3-mers could be relevant for the one-vs-all classification of a temperature/pH category in ways that are unrelated to transcriptional or proteomic adaptations. Second, the fact that a 3-mer is found to be relevant for a temperature/pH category indicates that it belongs to a set of 3-mers that collectively contribute to distinguishing sequences in that temperature/pH category from the rest of the dataset. In this sense, the concept of “relevant k-mer set” is more general, and the fact that a k-mer belongs to the relevant set of k-mers for a classification does not necessarily imply that it is over- or under-represented in the genomic sequences of that environment category.

Table 5 Over- and under-representation of the relevant 3-mers, found by our method to be collectively associated with genomic signatures of temperature-adapted prokaryotic extremophiles. The symbol $\uparrow$ ($\downarrow$) indicates over-representation (under-representation) of a 3-mer/codon. Matched arrows, e.g., ($\uparrow , \uparrow ^{ref}$) indicate that both our method and reference ref agree in their finding. Mismatched arrows indicate disagreement. See Supplementary Table S4 for details on the observations in biological literature.

Full size table

Table 6 Over- and under-representation of the relevant 3-mers, found by our method to be collectively associated with genomic signatures of pH-adapted prokaryotic extremophiles. The symbol $\uparrow$ ($\downarrow$) indicates over-representation (under-representation) of a 3-mer/codon. Matched arrows, e.g., ($\downarrow , \downarrow ^{ref}$) indicate that both our method and reference ref agree in their finding. Mismatched arrows indicate disagreement. See Supplementary Table S5 for details of observations in biological literature.

Full size table

Unsupervised clustering of the temperature dataset and the pH dataset

The supervised learning computational experiments suggested the existence of an environmental component in the genomic signature of microbial extremophiles, in both a restriction-free scenario and a restricted scenario where sequences from the same genus as the test sequence were absent from training.

It should be noted that the datasets considered in this study are not comprehensive, since the discovery and sequencing of genomes of extremophilic organisms is an ongoing difficult process given the challenging environments in which they are found, which are difficult to reproduce in order to culture and further characterize microbial extremophiles⁵⁶. In particular, the datasets’ sparsity and sampling bias do not allow computational experiments in restricted scenarios at taxonomic levels higher than the genus level. This is because such restrictions could eliminate many of the labelled sequences from the cross-validation training sets, rendering them insufficient in size for supervised learning purposes.

To address this challenge, in this section we explore the genomic signatures of the Temperature Dataset and pH Dataset through an unsupervised clustering approach. In unsupervised clustering, no taxonomic or environment category labels for DNA sequences are used during the entire process of learning, and ground-truth labels are used exclusively for the evaluation of the quality of clustering (if applicable). In a first set of tests, we applied parametric unsupervised algorithms for the task of clustering both datasets with different values for the parameter n_cluster (the number of clusters). When compared to the highest taxonomic level (domain), the ACC measure ( Eq. 1) for the clustering assignments computed by each algorithm suggest that for the Temperature Dataset, all algorithms can partially cluster sequences according to their real taxonomic labels at n_clusters = 2, with iDeLUCS (68$\%$) outperforming the others by a small margin (see Table 7 for accuracies). For the pH Dataset, all algorithms are unsuccessful at separating by domain (see Table 8 for accuracies). For values of the parameter n_clusters greater than 2, the accuracy increases for both datasets, but the increase is more significant for the pH Dataset where the ACC increases by $\sim 30\%$, which suggests that there is a good separation by environment category within each domain in the pH Dataset. Overall, the unsupervised clustering accuracy computed using taxonomic labels as ground truth, is higher than when computed using environment category labels as ground truth. This confirms the supervised machine learning results in the previous section, suggesting that the taxonomic component is stronger than the environmental component of genomic signatures.

Table 7 Accuracies (ACC) of the unsupervised clustering of the Temperature Dataset, for several parametric clustering algorithms, and several values of the pre-specified number of clusters. For each value of the number of clusters parameter, the unsupervised clustering accuracies are computed using the taxonomic labels as ground truth (top row), respectively the environment category labels as ground truth (bottom row).

Full size table

Table 8 Accuracies (ACC) of the unsupervised clustering of the pH Dataset, for several parametric clustering algorithms, and several values of the pre-specified number of clusters. For each value of the number of clusters parameter, the unsupervised clustering accuracies are computed using the taxonomic labels as ground truth (top row), respectively the environment category labels as ground truth (bottom row).

Full size table

In a second set of tests, six different non-parametric algorithms (the number of clusters is discovered by the algorithm instead of being given as a parameter) and the semi-parametric algorithm iDeLUCS were employed to cluster both datasets. Subsequently, all clusters obtained from each algorithm were compared with GTDB labels at the genus level, hereafter referred to as true genera, and only those clusters meeting the predefined quality criteria ($>50\%$ completeness, and $<50\%$ contamination, see Methods) were selected for evaluation. The outcomes, presented in Figure 4, speak to the effectiveness of deep learning-based clustering methodologies in accurately recovering the true genera, as well as illustrate the importance of choosing appropriate algorithms for specific datasets. For the datasets in this study, the method combining VAE with Iterative Medoids (VAE+IM)²⁷ demonstrated superior performance in recovering clusters that meet the predefined quality criteria. Specifically, VAE+IM successfully recovered 61 out of a total of 93 true genera represented by more than two sequences in the Temperature Dataset, and 31 out of a total of 37 true genera represented by more than two sequences in the pH Dataset.

Based on this analysis, the five algorithms that were able to recover at least 20% of the total number of true genera were VAE+HDBSCAN, CL+HDBSCAN, VAE+IM, UMAP+HDBSCAN, and iDeLUCS for the Temperature Dataset, respectively VAE+HDBSCAN, CL+HDBSCAN, VAE+IM, CL+IM, and iDeLUCS for the pH Dataset. These five algorithms were thus selected as source of information for subsequent analysis, since they performed best when compared to true genera groupings.

Microbial extremophiles from different domains, with similar genomic signatures

Following the selection of the five top performing unsupervised clustering algorithms in the previous section, the clusters discovered by these algorithms were used in conjunction with a majority voting scheme, to determine concrete “candidates,” that is, concrete exemplars of taxonomically different organisms that were clustered together presumably due to the environmental component of their genomic signatures. This computational process identified a list of pairs of hyperthermophilic, alkaliphilic, and acidophilic candidate sequences, each belonging to a different taxonomic domain, which were nevertheless grouped together by the majority of the aforementioned unsupervised clustering algorithms.

Of these candidates, we then proceeded to select sequences for which the unexpected results of the clustering could be independently confirmed by (i) supervised machine learning for the prediction of environment category, by (ii) supervised machine learning for the prediction of the taxonomic labels, and by (iii) observations of shared characteristics of their isolating environments. In these experiments, both thermophiles and hyperthermophiles were treated as part of a single environment category called “high-temperature,” so as to enhance the rigour of the confirmation procedure, given the lack of definitive knowledge of the precise threshold that separates these two environment categories from each other.

The experimental design was aimed to devise challenging scenarios that would clearly demonstrate the presence of the environmental component in the genomic signature of each candidate. To this end, for each candidate sequence to be tested, a challenge training set was created by selecting all DNA sequences of organisms from the opposite domain (i.e., Archaea or Bacteria), as well as sequences within the same domain but under a different environment category. The classifiers were then trained to perform two different tasks.

In experiments (i), a classifier was trained to predict the environment category of a candidate test sequence, as follows. For instance, if the test sequence was of a hyperthermophilic bacteria, the training set comprised all archaeal sequences (different domain), together with all the mesophilic and psychrophilic bacterial sequences (same domain, different environment category). The objective was to determine if the hyperthermophilic bacterial test sequence would be assigned the correct label “high-temperature,” despite the absence of high-temperature bacterial sequences in the training set. If this were the case, it would indicate that the correct temperature label assignment was due to the similarity of this bacterial sequence to other high-temperature archaeal sequences in the dataset, further suggesting that the environmental component overrides the taxonomical component in the genomic signature of the candidate sequence.

In experiments (ii), a classifier was trained to predict the domain of each candidate test sequence, as follows. For example, if the candidate test sequence was of a hyperthermophilic archaeon, the training set comprised all bacteria sequences (different domain), together with all the mesophilic and psychrophilic archaeal sequences (same domain, different environment category). The objective was to determine if the hyperthermophilic archaeal sequence would be assigned the incorrect label “Bacteria.” If this were indeed the case, it would indicate that the assignment of this archaeal sequence to domain Bacteria was likely due to its similarity to the high-temperature bacterial sequences, further suggesting that the environmental component overrides the taxonomic component of the candidate sequence.

All candidate sequences generated by the unsupervised clustering experiment underwent both computational experiments (i) and (ii). Of these, the following four sequences were assigned by the majority of the classifiers (SVM, Random Forest, ANN, MLDSP) to the correct environment category in experiment (i), and to the incorrect domain in experiment (ii): the bacterial sequence Thermocrinis ruber – Accession ID: GCA_000512735.1, and the three archaeal sequences, Pyrococcus furiosus DSM 3638 (formerly Pyrococcus sp000211475 ) – Accession ID: GCA_000007305.1, Thermococcus litoralis DSM 5473 (formerly Thermococcus litoralis NS-C) – Accession ID: GCA_000246985.3, and Pyrococcus chitonophagus (formerly known as Thermococcus chitonophagus) – Accession ID: GCA_002214605.1. Note that the current release of Genome Taxonomy Database (GTDB release R214 April 28, 2023) defines Thermococcus litoralis as a strain type of species Thermococcus alcaliphilus. In this study, we refer to it as “Thermococcus litoralis,” given its classification in the database version used for creation of the dataset.

Indeed, in experiments (i), all environment-trained classifiers correctly predicted these four microbial sequences as belonging to the high-temperature environment category, in spite of the fact that all genomic sequences used to train the classifier to predict temperature conditions were from a different domain than that of the test sequence. Moreover, in experiments (ii) all taxonomy-trained classifiers erroneously predicted the genomic sequences of these microbial extremophiles as belonging to a different domain, likely due to their environmental characteristic.

For biological corroboration (iii), a literature search was undertaken in an attempt to correlate the candidate species to the context of phenotypic traits and the characteristics of the isolating environments. It was determined that few phenotypic traits were congruent between candidates, including gram negative cell walls, OGpH falling within the neutrophilic range (pH 5.0 to 9.0) for each candidate, presence of intergenic sequences, and emissions of light hydrocarbons from the nearby environment^{57,58,59,60,61,62,63,64}. However, several more phenotypic traits display dissimilarities between organisms, as described in Supplementary Table S6. The particular environments each of the species was initially isolated from were analyzed in greater detail, and it was found that two Joint Genome Institute’s Genomes OnLine Database-derived (JGI-GOLD) ecosystem classifiers describe the isolating environment of all 4 species, as follows: ID 4027 for P. furiosus and P. chitonophagus, and ID 3991 for T. litoralis and T. ruber^58,65,66. The descriptors for these classifiers are “aquatic marine hydrothermal vent” and “aquatic thermal hot springs” respectively⁶⁵.

Although these environments are classified differently by JGI-GOLD, as ID 3991 and ID 4027 respectively, the descriptors accurately describe these environments due to the presence of hydrothermal systems^62,67. Note that T. litoralis (ID 3991) has been recently isolated from the Guaymas Basin, albeit from a geographic site of the Guaymas basin that was different from the isolation site of P. chitonophagus⁶⁸ (Supplementary Table S7).

For additional insight, the pairwise distance matrix of all genomic signatures generated by ML-DSP²⁵ for each dataset, with $k=6$, was analyzed. The pairwise distance matrix of the Temperature Dataset revealed that the DNA fragment with the shortest distance from that of Thermocrinis ruber (bacterium) belonged to Thermococcus_A litoralis (archaeon) with a distance value of 0.0327 (the distance ranges between 0 and 1, with 0 the minimum distance, between identical sequences, and 1 the maximum distance).

Discussion

We note that the six supervised machine learning algorithms produced highly accurate taxonomic classifications of extremophile prokaryotic genome sequences, and medium to medium-high accurate environment category classifications of the same sequences. These results suggest that, in addition to the taxonomic information present in the genomic signatures of extremophiles, a distinct k-mer frequency profile associated with each environment category also exists. Thus, if the bacteria and archaea sequences in the training set are labelled by environment category, then the supervised learning algorithms will likely assign a new sequence to its correct environment category, regardless of its taxonomy. Also note that the classification accuracies obtained when the datasets were taxonomy-labelled and environment category-labelled were both significantly higher than those obtained when the same datasets were assigned random labels. These findings are consistent with the claim that these taxonomic and environment category classifications are not due to chance, and support the hypothesis of the presence of both a taxonomic and an environmental component in the genomic signatures of microbial extremophiles.

Additional analyses revealed that the classification accuracies obtained in restriction-free supervised classification scenarios were higher than those obtained in the restricted (non-overlapping genera) supervised classification scenarios. However, even in the restricted scenario, the accuracies of classifications by the environment category were higher than those in the control “random label” scenario. Together, these findings suggest that the taxonomic component of the genomic signature is stronger than the environmental component, but that the latter is discernible and it provides discriminating power.

Note that, while the subsets of 3-mers relevant for the environment category classification that were identified by the MDI algorithm provide insights into the relations between genomic signatures and extreme environmental conditions, caution should be taken when interpreting the results. This is because the experiment prioritized classification accuracy, and the identified subsets of relevant 3-mers may partially reflect a correlation between taxonomy and environment. In other words, especially due to the bias and sparsity of both datasets, it is likely that some taxonomic information may also have influenced the process of computational discovery of these subsets of relevant 3-mers. This being said, the overlap between the aforementioned subsets of relevant 3-mers and codon usage patterns and amino acid compositional biases found to be associated with extreme environments in the biological literature, still suggest a detectable environmental component of genomic signatures in temperature and pH-adapted microbial extremophiles. Future work is needed to explore the possibility of multiple environmental components influencing the genomic signatures of polyextremophiles.

The use of unsupervised learning algorithms for exploring the space of genomic signatures holds significant value, as these algorithms effectively discover clusters of genomic fragments possessing similar genomic signatures, free from the influence of any human annotations. Since the precise definition of the term “genomic signature” entails differentiation of genetically distant organisms from each other, a high-performing clustering algorithm should primarily yield clusters corresponding to the true genera within the dataset. That being said, ascertaining causality for fragments assigned to erroneous clusters proves challenging, given the potential for similar genomic signatures to coincide with taxonomic information at a lower level, as well as the inherent systematic errors in each algorithm. For that reason, in the present study, the identification of pairs exhibiting a similar environmental component in their genomic signature based on the clustering assignments, relies predominantly on the consensus of the high-performing clustering algorithms. Furthermore, only pairs of fragments originating from organisms in different domains were retained. Additional confirmation steps by supervised learning in challenging scenarios were applied to the remaining pairs, and four hyperthermopilic exemplars successfully passed all these stringent tests. It is thus possible that other candidates from the list identified by unsupervised clustering could be viable, such as pairs for which only some of the supervised tests yielded successful results. One such example is the pair of acidophilic organisms Thermoanaerobacterium thermosaccharolyticum (bacterium) and Caldisphaera lagunensis (archaeon) in the pH Dataset, which were clustered together in spite of their domain-level taxonomic differences. Further analysis is needed to confirm such additional pairs, by, e.g., an analysis that utilizes, as a genome representative, multiple DNA fragments combined into a single genomic signature.

The dataset in this paper is, to the best of our knowledge, the largest and most comprehensive to date for the study of the genomic signatures of extremophilic microbes. Larger and more balanced datasets, combined with extensive literature and database searches, could yield more nuanced bioinformatic analyses in future studies.

Conclusion

This paper demonstrates the successful application of supervised machine learning algorithms for highly accurate taxonomic classifications of extremophile prokaryotic genome sequences, and medium to medium-high accurate classifications of the same sequences based on their environment category (hyperthermophile, psychrophile, acidophile, alkaliphile, etc). The use of k-mer frequency vectors of arbitrarily selected 500 kbp DNA fragments as genomic signatures, reveals a strong taxonomic component for $2 \le k \le 6$, and a discernible environmental component for $3 \le k\le 6$. Furthermore, specific k-mer profiles associated with distinct environment categories are identified, with partial agreement with previous observations in the literature using alignment-based analyses. Finally, these findings are confirmed using unsupervised learning clustering algorithms, which also reveal specific exemplar organisms for which the environmental component appears to be at least as strong as the taxonomic component of their genomic signature. This multi-pronged approach, applied to a substantial dataset, significantly strengthens the hypothesis of an environmental component in the genomic signature of microbial extremophiles adapted to extreme temperature or pH environmental conditions.

Data availability

All sequence data used in this paper is publicly available for download at NCBI. The unique assembly accession IDs of all the sequences and their respective labels used in this study are listed in Supplementary Material, Table S1. The representative DNA fragments used in this paper are available at https://github.com/Kari-Genomics-Lab/Extreme_Env/.

References

Rothschild, L. J. & Mancinelli, R. L. Life in extreme environments. Nature 409, 1092–1101. https://doi.org/10.1038/35059215 (2001).
Article ADS CAS PubMed Google Scholar
Wang, Q., Cen, Z. & Zhao, J. The survival mechanisms of thermophiles at high temperatures: An angle of omics. Physiology 30, 97–106. https://doi.org/10.1152/physiol.00066.2013 (2015).
Article CAS PubMed Google Scholar
Saunders, N. F. W. et al. Mechanisms of thermal adaptation revealed from the genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii. Genome Res. 13, 1580–1588. https://doi.org/10.1101/gr.1180903 (2003).
Article CAS PubMed PubMed Central Google Scholar
Raymond-Bouchard, I. et al. Conserved genomic and amino acid traits of cold adaptation in subzero-growing Arctic permafrost bacteria. FEMS Microbiol. Ecol. 94, fiy023. https://doi.org/10.1093/femsec/fiy023 (2018).
Article CAS Google Scholar
Turner, P., Mamo, G. & Karlsson, E. N. Potential and utilization of thermophiles and thermostable enzymes in biorefining. Microb. Cell Fact. 6, 9. https://doi.org/10.1186/1475-2859-6-9 (2007).
Article CAS PubMed PubMed Central Google Scholar
Zeldes, B. M. et al. Extremely thermophilic microorganisms as metabolic engineering platforms for production of fuels and industrial chemicals. Front. Microbiol. 6, 1209. https://doi.org/10.3389/fmicb.2015.01209 (2015).
Article PubMed PubMed Central Google Scholar
Horne, W. H. et al. Effects of desiccation and freezing on microbial ionizing radiation survivability: Considerations for Mars sample return. Astrobiology 22, 1337–1350. https://doi.org/10.1089/ast.2022.0065 (2022).
Article ADS PubMed PubMed Central Google Scholar
Ott, E. et al. Molecular repertoire of Deinococcus radiodurans after 1 year of exposure outside the International Space Station within the Tanpopo mission. Microbiome 8, 150. https://doi.org/10.1186/s40168-020-00927-5 (2020).
Article CAS PubMed PubMed Central Google Scholar
Xu, S., Wang, J., Guo, Z., He, Z. & Shi, S. Genomic convergence in the adaptation to extreme environments. Plant Commun. 1, 100117. https://doi.org/10.1016/j.xplc.2020.100117 (2020).
Article PubMed PubMed Central Google Scholar
Friedman, R., Drake, J. W. & Hughes, A. L. Genome-wide patterns of nucleotide substitution reveal stringent functional constraints on the protein sequences of thermophiles. Genetics 167, 1507–1512. https://doi.org/10.1534/genetics.104.026344 (2004).
Article CAS PubMed PubMed Central Google Scholar
Zeldovich, K. B., Berezovsky, I. N. & Shakhnovich, E. I. Protein and DNA sequence determinants of thermophilic adaptation. PLoS Comput. Biol. 3, 1–11. https://doi.org/10.1371/journal.pcbi.0030005 (2007).
Article ADS CAS Google Scholar
Singer, G. A. & Hickey, D. A. Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 317, 39–47. https://doi.org/10.1016/S0378-1119(03)00660-7 (2003).
Article CAS PubMed Google Scholar
Lobry, J. & Necşulea, A. Synonymous codon usage and its potential link with optimal growth temperature in prokaryotes. Gene 385, 128–136. https://doi.org/10.1016/j.gene.2006.05.033 (2006).
Article CAS Google Scholar
Forsdyke, D. R. Neutralism versus selectionism: Chargaff’s second parity rule, revisited. Genetica 149, 81–88. https://doi.org/10.1007/s10709-021-00119-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bao, Q. et al. A complete sequence of the T. tengcongensis genome. Genome Res. 12, 689–700 (2002).
Article CAS PubMed PubMed Central Google Scholar
Lao, P. J. & Forsdyke, D. R. Thermophilic bacteria strictly obey Szybalski’s transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Res. 10, 228–236. https://doi.org/10.1101/gr.10.2.228 (2000).
Article CAS PubMed PubMed Central Google Scholar
Karlin, S., Mrázek, J. & Campbell, A. M. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol. 179, 3899–3913. https://doi.org/10.1128/jb.179.12.3899-3913.1997 (1997).
Article CAS PubMed PubMed Central Google Scholar
Khan, M. F. & Patra, S. Deciphering the rationale behind specific codon usage pattern in extremophiles. Sci. Rep. 8, 15548. https://doi.org/10.1038/s41598-018-33476-x (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: A genomic signature. Trends Genet. 11, 283–290. https://doi.org/10.1016/S0168-9525(00)89076-9 (1995).
Article CAS PubMed Google Scholar
Deschavanne, P. J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Article CAS PubMed Google Scholar
Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 18, 186. https://doi.org/10.1186/s13059-017-1319-7 (2017).
Article PubMed PubMed Central Google Scholar
Karamichalis, R., Kari, L., Konstantinidis, S., Kopecki, S. & Solis-Reyes, S. Additive methods for genomic signatures. BMC Bioinform. 17, 313. https://doi.org/10.1186/s12859-016-1157-8 (2016).
Article CAS Google Scholar
Solis-Reyes, S., Avino, M., Poon, A. & Kari, L. An open-source k-mer based machine learning tool for fast and accurate sub-typing of HIV-1 genomes. PLoS ONE 13, e0206409. https://doi.org/10.1371/journal.pone.0206409 (2018).
Article CAS PubMed PubMed Central Google Scholar
Liang, Q., Bible, P. W., Liu, Y., Zou, B. & Wei, L. DeepMicrobes: Taxonomic classification for metagenomics with deep learning. NAR Genomics Bioinform.https://doi.org/10.1093/nargab/lqaa009 (2020).
Article Google Scholar
Randhawa, G. S., Hill, K. A. & Kari, L. ML-DSP: Machine learning with digital signal processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomicshttps://doi.org/10.1186/s12864-019-5571-y (2019).
Article PubMed PubMed Central Google Scholar
Millán Arias, P., Alipour, F., Hill, K. A. & Kari, L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS ONE 17, e0261531. https://doi.org/10.1371/journal.pone.0261531 (2022).
Article CAS PubMed PubMed Central Google Scholar
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560. https://doi.org/10.1038/s41587-020-00777-4 (2021).
Article CAS PubMed Google Scholar
Fiannaca, A. et al. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinform. 19, 198. https://doi.org/10.1186/s12859-018-2182-6 (2018).
Article CAS Google Scholar
Zhang, P., Jiang, Z., Wang, Y. & Li, Y. CLMB: Deep contrastive learning for robust metagenomic binning. In Research in Computational Molecular Biology, 326–348 (2022).
Randhawa, G. S. et al. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS ONE 15, e0232391. https://doi.org/10.1371/journal.pone.0232391 (2020).
Article CAS PubMed PubMed Central Google Scholar
Merino, N. et al. Living at the extremes: Extremophiles and the limits of life in a planetary context. Front. Microbiol. 10, 780. https://doi.org/10.3389/fmicb.2019.00780 (2019).
Article PubMed PubMed Central Google Scholar
Dexter Dyer, B., Kahn, J. M. & LeBlanc, D. M. Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria. Archaea 2, 159–167. https://doi.org/10.1155/2008/829730 (2008).
Article Google Scholar
Tsai, M.-H., Liu, Y.-Y., Soo, V.-W. & Chen, C.-C. A new genome-to-genome comparison approach for large-scale revisiting of current microbial taxonomy. Microorganisms 7, 161. https://doi.org/10.3390/microorganisms7060161 (2019).
Article PubMed PubMed Central Google Scholar
De la Fuente, R., Díaz-Villanueva, W., Arnau, V. & Moya, A. Genomic signature in evolutionary biology: A review. Biology 12, 322. https://doi.org/10.3390/biology12020322 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bonnici, V. & Manca, V. Informational laws of genome structures. Sci. Rep. 6, 28840. https://doi.org/10.1038/srep28840 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J. S. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 316. https://doi.org/10.1186/1471-2105-10-316 (2009).
Article CAS Google Scholar
Vapnik, V. The Nature of Statistical Learning Theory (Springer, New York, 2000).
Book MATH Google Scholar
Ho, T. K. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, 278–282. https://doi.org/10.1109/ICDAR.1995.598994 (1995).
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. & Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 9, 307. https://doi.org/10.1186/1471-2105-9-307 (2008).
Article CAS Google Scholar
Millan Arias, P., Hill K. A., & Kari, L. iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences. Bioinformatics 39, btad508. https://doi.org/10.1093/bioinformatics/btad508 (2023).
Article PubMed PubMed Central Google Scholar
Xie, J., Girshick, R. & Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of The 33rd International Conference on Machine Learning, vol. 48, 478–487 (2016).
Kuhn, H. W. The Hungarian method for the assignment problem. Naval Res. Log. Q. 2, 83–97. https://doi.org/10.1002/nav.3800020109 (1955).
Article MathSciNet MATH Google Scholar
McInnes, L., Healy, J. & Astels, S. HDBSCAN: Hierarchical density based clustering. J. Open Source Softw. 2, 205. https://doi.org/10.21105/joss.00205 (2017).
Article ADS Google Scholar
McInnes, L., Healy, J., Saul, N. & Grossberger, L. Umap: Uniform manifold approximation and projection. The J. Open Source Softw. 3, 861. https://doi.org/10.21105/joss.00861 (2018).
Article Google Scholar
Bowman, J. S. & Deming, J. W. Alkane hydroxylase genes in psychrophile genomes and the potential for cold active catalysis. BMC Genomics 15, 1120. https://doi.org/10.1186/1471-2164-15-1120 (2014).
Article CAS PubMed PubMed Central Google Scholar
Riley, M. et al. Genomics of an extreme psychrophile, Psychromonas ingrahamii. BMC Genomics 9, 210. https://doi.org/10.1186/1471-2164-9-210 (2008).
Article CAS PubMed PubMed Central Google Scholar
de Farias, S. & Bonato, M. Preferred codons and amino acid couples in hyperthermophiles. Genome Biol. 3, preprint0006.1. https://doi.org/10.1186/gb-2002-3-8-preprint0006 (2002).
Article Google Scholar
Bauvois, C. et al. Crystal structure of the cold-active aminopeptidase from Colwellia psychrerythraea, a close structural homologue of the human bifunctional Leukotriene A4 Hydrolase. J. Biol. Chem. 283, 23315–23325. https://doi.org/10.1074/jbc.M802158200 (2008).
Article CAS PubMed Google Scholar
Goldstein, R. A. Amino-acid interactions in psychrophiles, mesophiles, thermophiles, and hyperthermophiles: insights from the quasi-chemical approximation. Protein Sci. Public. Protein Soc. 16, 1887–1895. https://doi.org/10.1110/ps.072947007 (2007).
Article CAS Google Scholar
Szilágyi, A. & Závodszky, P. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: Results of a comprehensive survey. Structure 8, 493–504. https://doi.org/10.1016/S0969-2126(00)00133-7 (2000).
Article PubMed Google Scholar
Panja, A. S., Maiti, S. & Bandyopadhyay, B. Protein stability governed by its structural plasticity is inferred by physicochemical factors and salt bridges. Sci. Rep. 10, 1822. https://doi.org/10.1038/s41598-020-58825-7 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Kumar, S., Tsai, C.-J. & Nussinov, R. Factors enhancing protein thermostability. Protein Eng. Des. Sel. 13, 179–191. https://doi.org/10.1093/protein/13.3.179 (2000).
Article CAS Google Scholar
Hait, S., Mallik, S., Basu, S. & Kundu, S. Finding the generalized molecular principles of protein thermal stability. Proteins Struct. Funct. Bioinform. 88, 788–808. https://doi.org/10.1002/prot.25866 (2020).
Article CAS Google Scholar
Horikoshi, K. Alkaliphiles: Some applications of their products for biotechnology. Microbiol. Mol. Biol. Rev. 63, 735–750. https://doi.org/10.1128/MMBR.63.4.735-750.1999 (1999).
Article CAS PubMed PubMed Central Google Scholar
Mukhtar, S., Rashid, N., Farhan Ul Haque, M. & Malik, K. A. Metagenomic approach for the isolation of novel extremophiles. In Microbial Extremozymes, 55–66, https://doi.org/10.1016/B978-0-12-822945-3.00010-5 (2022).
Zhu, D. et al. Recent development of extremophilic bacteria and their application in biorefinery. Front. Bioeng. Biotechnol. 8, 483. https://doi.org/10.3389/fbioe.2020.00483 (2020).
Article PubMed PubMed Central Google Scholar
Fiala, G. & Stetter, K. O. Pyrococcus furiosus sp. nov., represents a novel genus of marine heterotrophic archaebacteria growing optimally at 100 $^{\circ }$C. Archiv. Microbiol.145, 56–61, https://doi.org/10.1007/BF00413027 (1986).
Huber, R. et al.Thermococcus chitonophagus sp. nov., a novel, chitin-degrading, hyperthermophilic archaeum from a deep-sea hydrothermal vent environment. Archiv. Microbiol.164, 255–264, https://doi.org/10.1007/BF02529959 (1995).
Neuner, A., Jannasch, H. W., Belkin, S. & Stetter, K. O. Thermococcus litoralis sp. nov.: A new species of extremely thermophilic marine archaebacteria. Archiv. Microbiol. 153, 205–207, https://doi.org/10.1007/BF00247822 (1990).
Huber, R. et al. Thermocrinis ruber gen. nov., sp. nov., A pink-filament-forming hyperthermophilic bacterium isolated from Yellowstone National Park. Appl. Environ. Microbiol. 64, 3576–3583. https://doi.org/10.1128/AEM.64.10.3576-3583.1998 (1998).
Article ADS CAS PubMed PubMed Central Google Scholar
Capaccioni, B., Tassi, F. & Vaselli, O. Organic and inorganic geochemistry of low temperature gas discharges at the Baia di Levante beach, Vulcano Island, Italy. J. Volcanol. Geoth. Res. 108, 173–185. https://doi.org/10.1016/S0377-0273(00)00284-5 (2001).
Article ADS CAS Google Scholar
Orlando, V., Franco, T., Dario, T., Robert, P. J. & Antonio, C. Submarine and inland gas discharges from the Campi Flegrei (Southern Italy) and the Pozzuoli Bay: Geochemical clues for a common hydrothermal-magmatic source. Procedia Earth Planet. Sci. 4, 57–73. https://doi.org/10.1016/j.proeps.2011.11.007 (2011).
Article ADS CAS Google Scholar
Bazylinski, D. A., Farrington, J. W. & Jannasch, H. W. Hydrocarbons in surface sediments from a Guaymas Basin hydrothermal vent site. Org. Geochem. 12, 547–558. https://doi.org/10.1016/0146-6380(88)90146-5 (1988).
Article ADS CAS Google Scholar
Clifton, C., Walters, C. & Simoneit, B. Hydrothermal petroleums from Yellowstone National Park, Wyoming, USA. Appl. Geochem. 5, 169–191. https://doi.org/10.1016/0883-2927(90)90047-9 (1990).
Article ADS Google Scholar
Mukherjee, S. et al. Twenty-five years of Genomes OnLine Database (GOLD): Data updates and new features in vol 9. Nucleic Acids Res. 51, D957–D963. https://doi.org/10.1093/nar/gkac974 (2023).
Article CAS PubMed Google Scholar
Belkin, S., Wirsen, C. O. & Jannasch, H. W. A new sulfur-reducing, extremely thermophilic eubacterium from a submarine thermal vent. Appl. Environ. Microbiol. 51, 1180–1185. https://doi.org/10.1128/aem.51.6.1180-1185.1986 (1986).
Article ADS CAS PubMed PubMed Central Google Scholar
Sherman, L. et al. Mercury isotopic composition of hydrothermal systems in the Yellowstone Plateau volcanic field and Guaymas Basin sea-floor rift. Earth Planet. Sci. Lett. 279, 86–96. https://doi.org/10.1016/j.epsl.2008.12.032 (2009).
Article ADS CAS Google Scholar
Zhou, Z. et al. Genome- and community-level interaction insights into carbon utilization and element cycling functions of Hydrothermarchaeota in hydrothermal sediment. mSystems 5, e00795-19. https://doi.org/10.1128/mSystems.00795-19 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank Flora Li and Mohammed Essa for their assistance in the process of data collection, and the anonymous Reviewers for their insightful and constructive suggestions. This work has been supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grants [RGPIN-2023-03663 to L.K, R3511A12 to K.A.H, and RGPIN-2022-03547 to G.R], and Compute Canada Research Platforms & Portals Grant [616 to K.A.H].

Author information

These authors contributed equally: Pablo Millán Arias and Joseph Butler.

Authors and Affiliations

School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Pablo Millán Arias & Lila Kari
Department of Biology, University of Western Ontario, London, ON, Canada
Joseph Butler, Maximillian P. M. Soltysiak & Kathleen A. Hill
School of Mathematical and Computational Sciences, University of Prince Edward Island, Charlottetown, PE, Canada
Gurjit S. Randhawa

Authors

Pablo Millán Arias
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Butler
View author publications
You can also search for this author in PubMed Google Scholar
Gurjit S. Randhawa
View author publications
You can also search for this author in PubMed Google Scholar
Maximillian P. M. Soltysiak
View author publications
You can also search for this author in PubMed Google Scholar
Kathleen A. Hill
View author publications
You can also search for this author in PubMed Google Scholar
Lila Kari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.K., K.A.H., M.S. designed the computational experiment, J.B., P.M.A. and M.S. collected the data. J.B. curated the data, P.M.A. and G.R. conducted the computational experiments. J.B., K.A.H., M.S. provided biological interpretation of the findings. P.M.A., L.K., K.A.H., J.B., and G.R. contributed to the writing of the manuscript, and all authors analysed the results and reviewed the manuscript.

Corresponding author

Correspondence to Pablo Millán Arias.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Supplementary Information 3.

Supplementary Information 4.

Supplementary Information 5.

Supplementary Information 6.

Supplementary Information 7.

Supplementary Information 8.

Supplementary Information 9.

Supplementary Information 10.

Supplementary Information 11.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Arias, P.M., Butler, J., Randhawa, G.S. et al. Environment and taxonomy shape the genomic signature of prokaryotic extremophiles. Sci Rep 13, 16105 (2023). https://doi.org/10.1038/s41598-023-42518-y

Download citation

Received: 06 June 2023
Accepted: 11 September 2023
Published: 26 September 2023
DOI: https://doi.org/10.1038/s41598-023-42518-y
Springer Nature Limited

Associated content

Genomics of extreme environments

Collection 02 March 2022

Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

Abstract

Similar content being viewed by others

Introduction

Materials and methods

Datasets

Sequence classification using supervised machine learning

Unsupervised learning for sequence clustering

Results

Supervised machine learning analysis of the Temperature Dataset and the pH Dataset

Supervised classification by taxonomy, environment category, and random label assignment

Sets of k-mers relevant to environment category classifications computed by interpretablity tool of a supervised learning algorithm

Unsupervised clustering of the temperature dataset and the pH dataset

Microbial extremophiles from different domains, with similar genomic signatures

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation