Background

According to the biogenic (organic) theory, petroleum hydrocarbons originate from the ancient remains of detrital matter buried and diagenetically modified in marine sediments. This organic matter is then gradually converted to petroleum compounds enriched in aromatic and aliphatic hydrocarbons (HCs) via the sequential activity of aerobic and anaerobic microorganisms [1,2,3]. In addition to their role in the formation of oil HCs, microbes play a crucial role in the biological integration of these HCs into the actively cycled carbon pool [4]. Microbial HC degradation occurs through a cascade of enzymatic reactions in three main steps: (i) activation or attacking the HC-bond, (ii) producing signature intermediate compounds, and (iii) conversion of signature degradation intermediates to central cell metabolites, followed by their mineralization to CO2. Microorganisms must overcome and break the stability and energy in carbon-hydrogen bonds in order to degrade HCs. Since HCs are structurally diverse, a plethora of enzymes are involved in their activation and degradation, and consequently, the energy that needs to be invested in the initial degradation step varies. Vast number of microorganisms can degrade different HCs according to their enzymatic repertoire and available energy [5]. Microorganisms have evolved to degrade different HCs under both aerobic and anaerobic conditions. However, biodegradation typically occurs much faster under aerobic conditions, in part due to the availability of thermodynamically favorable electron acceptors that leads to higher energy yield [6], but also because of the action of some HC-degrading enzymes requires oxygen as substrate or cofactor. Similar to all biological pathways, rate-limiting key enzymes drive the main steps of HC degradation.

Under aerobic conditions, oxygenase enzymes initiate the degradation of different aliphatic or aromatic compounds by adding one (mono-oxygenase) or two (di-oxygenase) oxygen molecules. Saturated aliphatic compounds such as alkane and cycloalkane (studied here) are converted to their corresponding carboxylic acid in this process. Catechol/gentisate derivatives are intermediate compounds during aerobic degradation of mono- and polycyclic aromatic HCs. They are then de-aromatized via subsequent meta/ortho cleavage. Intermediate compounds produced during the degradation of aliphatic and aromatic HCs converge to the β-oxidation and tricarboxylic acid (TCA) cycle [7]. While enzymes involved in the downstream part of the degradation process are widespread across living cells shared by many metabolic pathways, the mono/di-oxygenase enzymes catalyzing the first hydroxylation of aliphatic/aromatic compounds are crucial for the initial step in the HC degradation process and likely rate-limiting. Accordingly, microorganisms carrying the enzymes for such initial degradation will be rate-controlling drivers of HC degradation.

The capacity of microbial isolates to metabolically degrade oil HCs have been frequently studied [8,9,10,11]. However, our knowledge has been mainly limited to cultivated microorganisms until recently. The present study provides a systemic and genome-resolved view of hydrocarbon degradation potential in the growing database of archaeal and bacterial genomes. To provide this extensive view, we compiled a database of enzymes involved in the aerobic degradation pathway of aliphatic (short-chain and long-chain n-alkanes) and aromatic HCs (toluene, phenol, xylene, benzene, biphenyl, and naphthalene). We then explored the distribution of these enzymes in 24,692 publicly available archaeal (n = 1246) and bacterial (n = 23,446) genomes via AnnoTree [12] and manually confirmed all annotations. We focused on the microbial genomes containing enzymes for complete/near complete degradation of specific HCs and suggested that lineages with the great genetic potential to degrade a broad range of HC compounds can be exploited for bioremediation purposes. We also reconstructed the phylogenetic relationships of the recovered key HC degradation enzymes to investigate their evolution and explore the potential role of horizontal gene transfer. Several microorganisms contain multiple copies of key HC degrading genes across their genome. We thus explored whether these copies are likely to have been acquired through HGT or if they are likely to be paralogs.

Having a genome-resolved view, we also surveyed these genomes with regard to their ecological strategies by leverging information on their GC content and genome size to see whether all genomes containing key genes for HC degradation adopt similar growth strategy in term of canonical r and K strategists.

Results and discussion

HC degradation across domain bacteria

alkB/M and almA/ladA genes are alkane mono-oxygenases that initiate the degradation of short (C5–C15) and long-chain alkanes (> C15), respectively. The alkB/M is rubredoxin-dependent, while almA and ladA are flavin-dependent mono-oxygenases.

The genes pheA (phenol 2-monooxygenase), xylM (xylene monooxygenase), xylX (toluate/benzoate 1,2-dioxygenase subunit alpha), todC1 (benzene 1,2-dioxygenase subunit alpha), and tmoA (toluene-4-monooxygenase system, hydroxylase component subunit alpha) for monocyclic, and bphA1 (biphenyl 2,3-dioxygenase subunit alpha), ndoB (naphthalene 1,2-dioxygenase system, large oxygenase component) for polycyclic compounds code for catalytic domains of ring hydroxylating oxygenases (RHOs) that add -OH group(s) to compounds undergoing degradation (Supplementary Figure S1). We explored the distribution of these genes and associated degradation pathways in a total of 23,446 representatives out of 143,512 bacterial genomes available in release 89 of the GTDB database that has been annotated via Annotree [12]. These annotated genomes are dominated by representatives of phyla Proteobacteria (32.5%), Actinobacteriota (13.3%), Bacteroidota (12.13%), and Firmicutes (8.01%) (Supplementary Figure S2 and Supplementary Table S4). Among the 123 represented bacterial phyla, 58 phyla had ≤ five genomes available per phylum and combined only represented 0.57% of the explored genomes. To avoid misinterpretations due to this uneven taxonomic distribution of representative genomes, we explored the contribution of members of each phylum in the HC degradation process by showing what proportion of microbes containing each HC degrading enzyme exist in each phylum (panel A of Supplementary Figures S3-S6). We also analyze the percentage of members of each phylum containing each HC degrading enzyme to ensure that we consider the contributions of underrepresented phyla in the HC degradation (panel B of Supplementary Figures S3-S6).

As expected, representatives of the phylum Proteobacteria (Pseudomonadales and Burkholderiales orders) presented the highest abundance of aliphatic and aromatic HC degrading enzymes, followed by Actinobacteriota and Bacteroidota for aliphatic and Actinobacteriota and Firmicutes for aromatic HC degrading enzymes (Supplementary Figures S3 and S4, panel A).

Underrepresented phyla remain mainly uncultured and are notably underexplored for metabolic potential (58 of 123 phyla, n = 131 genomes). Our analyses revealed that representatives of these taxa contain HC degrading enzymes involved in both the initiation and downstream steps of HC degradation processes. For example, phyla Tectomicrobia (Entotheonella), Binatota, Firmicutes_K, and Firmicutes_E contained mono-aromatic HC degradation enzymes (Fig. 2). In addition to these phyla, we annotated enzymes involved in the degradation of aliphatic HC in representatives of phyla SAR324, Eremiobacterota (Baltobacterales), Bdellovibrionota_B, and Chloroflexota_B (Fig. 1).

Fig. 1
figure 1

Distribution of aliphatic hydrocarbon-degrading genes across domain bacteria at phylum level. Each circle of the heatmap represents a gene involved in HC degradation. Various compounds are shown in different colors, as represented in the color legend at the bottom of the figure. Genes marked with an asterisk represent key enzymes of the degradation pathway. Numbers written on each row's edge indicate the number of screened genomes in that phylum in the AnnoTree website (adopted from GTDB R89). The color gradient for genes of each compound indicates the percentage of HC degrading members of each phylum

Other enzymes in the degradation pathways beyond the key genes for the initial degradation (Supplementary Table S1) are typically involved in several degradation pathways and are broadly distributed accordingly. As an example, the process of converting catechol to non-aromatic compounds with further conversion to intermediates of the TCA cycle (e.g., acetaldehyde and pyruvate) (Supplementary Figure S1) is shared among degradation pathways of xylene, naphthalene/phenanthrene, and phenol (blue color in Fig. 2). The name of genes involved in mentioned part are xylE/dmpB/nahH (catechol oxygenase), xylF/dmpD/nahN (2-hydroxymuconate semialdehyde hydrolase), xylG/dmpC/nahI (2-hydroxymuconic semialdehyde dehydrogenase), xylH/nahJ (2-hydroxymuconate tautomerase), xylI/dmpH/nahK (4-oxalocrotonate decarboxylase), xylJ/nahL (2-hydroxypent-2,4-dienoate hydratase), xylK/bphI/nahM (4-hydroxy-2-oxovalerate aldolase), xylR/pdeR (transcriptional regulatory protein). These ring-cleavage enzymes are also involved in the degradation of aromatic amino acids. Our analysis showed that representatives of phyla Firmicutes (mainly from the orders Bacillales and Staphylococcales), Firmicutes_I, Firmicutes_K, Firmicutes_E, Firmicutes_G, Firmicutes_H, Eremiobacterota, Deinococcota, Chloroflexota, Campylobacterota, Myxococcota and Bdellovibrionota play a significant role in this part of HC degradation process (blue color in Fig. 2).

Fig. 2
figure 2

Distribution of aromatic hydrocarbon-degrading genes across domain bacteria at phylum level. Each circle of the heatmap represents a gene involved in HC degradation. Various compounds are shown in different colors, as represented in the color legend at the bottom of the figure. Genes marked with an asterisk represent key enzymes of the degradation pathway. Numbers written on each row’s edge indicate the number of screened genomes in that phylum in the AnnoTree website (adopted from GTDB R89). The color gradient for genes of each compound indicates the percentage of HC degrading members of each phylum

Distribution of key genes involved in the degradation of alkanes

At lower taxonomic rank, the alkB/M and ladA genes were differently distributed across members of phyla Gammaproteobacteria, Alphaproteobacteria, and Actinobacteriota, hinting at their capacity for degrading hydrocarbons of variable chain length. Altogether 2089 genomes in orders Mycobacteriales (23.95%), Rhodobacterales (20.46%), Pseudomonadales (17.13%), Flavobacteriales (8.3%), Burkholderiales (6.16%), Cytophagales (3.66%), Propionibacteriales (2.47%), Rhizobiales (1.89%), and Chitinophagales (1.81%) contained alkB/M genes, while ladA was present in 2154 genomes from Pseudomonadales (21.05%), Rhizobiales (16.27%), Burkholderiales (14.44%), Actinomycetales (13.44%), Mycobacteriales (13.05%), Bacillales (4.74%), Enterobacterales (3.7%), Acetobacterales (2.31%), Streptomycetales (1.91%). There were also several representative genomes with ladA gene which have not been previously reported and belonged to Tectomicrobia, UBA8248, and SAR324 phyla (Figs. 3 and 4, panel B, Supplementary Table S6).

Fig. 3
figure 3

Maximum-likelihood phylogenetic reconstruction of amino acid sequences of alkB/M protein recovered from genomes (short-chain length alkane monooxygenase). A Major clusters of alkB/M genes according to the reconstructed phylogeny. The scale bar indicates 0.8 branch distance. B Bar plot representations of the distribution of recovered genes at the order level. The detailed information of the fraction “others” is provided in Supplementary Table S6

Fig. 4
figure 4

Maximum-likelihood phylogenetic reconstruction of amino acid sequences of ladA protein recovered from genomes (long-chain length alkane monooxygenase). A Major clusters of ladA genes. The scale bar indicates 0.6 branch distance. B Bar plot representations of the distribution of recovered genes at the order level. The detailed information of the fraction “others” is provided in Supplementary Table S6

An indirect role of Cyanobacteria in HC degradation, especially in microbial mats, has been previously reported. These primary producers often have the nitrogen-fixing ability and can fuel and promote aerobic and anaerobic sulfate/nitrate-reducing HC degrading microorganisms in microbial mats [13]. There are also reports of a minor role of some Cyanobacteria members like Phormidium, Nostoc, Aphanothece, Synechocystis, Anabaena, Oscillatoria, Plectonema, and Aphanocapsa in direct HC degradation [14, 15]. In this study, we detected the presence of long-chain alkane degrading genes, ladA, in different members of Cyanobacteria with 0.31 and 12.54% of genomes in this phylum containing ladA (in Elainella saxicola, Nodosilinea sp000763385) and almA genes (in Synechococcales, Cyanobacteriales, Elainellales, Phormidesmiales, Thermosynechococcales, Gloeobacterales, Obscuribacterales), respectively.

Phylogenetic reconstruction of recovered alkB/M and ladA genes grouped them into five and nine main clades, respectively (Figs. 3 and 4, panel A). The branching pattern of these clades partially followed the taxonomic signal of the genomes they were retrieved from, specifically for the most dominant phyla. However, as it is evident in Figs. 3 and 4 some branches contained alkB/M and ladA genes originating from genomes belonging to distantly related phyla. The placement of phylogenetically diverse groups in one branch hint at horizontal transfer of these genes between microbial taxa [16]. Additionally, apart from the chromosomal type, both alkB/M and ladA genes have previously been reported to be also located on plasmids (OCT and pLW1071), corroborating their potential for horizontal transfer. For instance, there are reports on the intraspecies transfer of alkB/M among Pseudomonas members [17]. Placement of ladA gene originating from genomes affiliated to rare microbial taxa among clusters V-IX of the ladA phylogeny put forward the possibility of a prominent role for Actinobacteriota and Firmicutes members in expanding the distribution of this gene among rare taxa as representatives of Actinobacteriota and Firmicutes taxa frequently contain these genes (Fig. 4).

We also detected several genomes with multiple copies of the alkB gene that were not necessarily branching together in the reconstructed alkB phylogeny, hinting at the probability of either gene duplication, paralogue occurrence, or HGT. Examples of these genomes with more than 6 copies of alkB/M are Polycyclovorans algicola (10), Nevskia ramose (7), Zhongshania aliphaticivorans (7), Solimonas aquatic (7), Immundisolibacter cernigliae (6), and Rhodococcus qingshengii (6). Multiple copies have also been detected in representatives of the genera Nocardia, Rhodococcus, and Alcanivorax (the full list of genomes with multiple copies of alkB gene is available in Supplementary Table S6).

Furthermore, the ladA gene was also detected in Mycolicibacterium dioxanotrophicus, Cryobacterium_A sp003065485, Kineococcus rhizosphaerae, Microbacterium sp003248605, Paenibacillus_S sp001956045, Pararhizobium polonicum, Mycolicibacterium septicum, and Microbacterium sp000799385 with six copies in each genome. Several examples were also present in genera Pseudomonas_E, Bradyrhizobioum, Rhizobioum, and Paraburkholderia, which had more than one copy (904 genomes) (Supplementary Table S6).

The presence of multiple copies of alkane hydroxylase genes has been hypothesized to enable cells to use an expanded range of n-alkanes or to adapt to different environmental conditions. However, the exact evolutionary rationale has not yet been established [18, 19]. To evaluate this hypothesis, we compared different sequences of each gene in an individual genome (mentioned above for ladA and alkB) using BLAST (Supplementary Table S7). The results showed that the identity of multiple gene copies in a single genome was in the range of 30 to 70 percent, while they are still predicted to have the same function. This wide range of BLAST identity between these gene copies suggests that these genes potentially originated from different sources and were transferred horizontally. We further explore in more detail the case for 21 xylX gene copies in Immundisolibacter cernigliae below.

Distribution of key genes (ring-hydroxylating oxygenases (RHOs)) involved in the degradation of aromatic HCs

Genomes containing RHOs (2761 genomes, 16 phyla) present an overall lower phylogenetic diversity than alkane mono-oxygenases (4669 genomes, 21 phyla for both alkB/M and ladA). In general, AlkB/M and LadA enzymes consist of FA_desaturase (PF00487) and Bac_luciferase-like mono-oxygenase (PF00296) domains, respectively (Supplementary Table S5). They act non-specifically on a wide range of alkanes of different chain lengths. Therefore, they are likely to be more widespread in genomes, especially because alkane compounds do not exclusively originate from petroleum. For instance, in pristine marine ecosystems, primary producers such as Cyanobacteria can release long chain-length aliphatic compounds (e.g., pentadecane, heptadecane). Alkane-producing Cyanobacteria include prominent and globally abundant genera such as Prochlorococcus and Synechococcus [20, 21]. Therefore, marine microorganisms are broadly exposed to aliphatic compounds with different chain lengths, even in environments without oil spills or industrial influence. This can explain why marine ecosystems host a plethora of hydrocarbonoclastic bacteria [22, 23]. This would probably be the reason why the number of genomes with alkB/M is higher than RHO-bearing genomes.

Enzymes XylX, NdoB, BphA1, and TodC1 are composed of two pfam domains, PF00355 (Rieske center) and PF00848 (Ring_hydroxyl_A). These common domains impact the branching in the phylogenetic tree and lead to the neighboring branching of these mentioned genes (Fig. 5).

Fig. 5
figure 5

Maximum-likelihood phylogenetic reconstruction of amino acid sequences of ring-hydroxylating oxygenase (RHO) protein recovered from genomes. A Major clusters of RHO genes. The scale bar indicates 1.0 branch distance. B Heatmap representations of the distribution of recovered genes at the order level

RHO enzymes are predominantly present in Burkholderiales, Pseudomonadales, Sphingomonadales, Caulobacterales, and Nevskiales orders of the phylum Proteobacteria (35 different Proteobacterial orders) (Fig. 5, B part). However, a significant number of pheA (phenol 2-monooxygenase) gene and, to a lesser degree, xylX and tmoA genes were also present in Actinobacteriota phylum (9 different Actinobacteriotal orders) (Fig. 5, B part).

Sphingomonadales are prominent bacteria in the rhizosphere and are also abundant in littoral zones of inland waters. Accordingly, we suggest that these bacteria may have evolved a capacity to degrade different aromatic compounds in response to the high concentrations of aromatic secondary metabolites typically seen in the plant rhizosphere. Additionally, Sphingomonadales are known for their large plasmids with intraspecies transmission [24].

Representatives of Burkholderiales were one of the other prominent members with RHO genes. This potential was mainly attributed to Paraburkholderia, Caballeronia, Burkholderia, Cupriavidus, Ralstonia and Massilia genera. All of the mentioned genera in our study had a genome size between 6.5–7 Mb and have been reported to have multi-chromosomes (multipartite) [25]. According to the reports, the second chromosome has a prominent role in niche adaptation in Burkholderiaceae family [25]. Their HC degradation capability might be related to this feature.

Among Pseudomonadales, representatives of Pseudomonas Acinetobacter, Halomonas, Marinobacter, Marinobacterium and Psychrobacter genera were detected to contain RHO genes in their genome. There are multiple reports of representatives of the Pseudomonas genus to degrade various HCs, where they mainly organize their degrading genes on conjugal plasmids [26, 27].

Acinetobacter representatives also had RHO genes. Acinetobacter spp. can produce bioemulsifier/biosurfactants and are naturally transformable species that might be relevant for their high degradation potential [28, 29]. Marinobacter and Halomonas genera were frequently reported as hydrocarbonoclastic bacteria which were isolated from oil-polluted marine water and sediment samples. Representatives of the Halomonas genus are also capable of EPS/biosurfactant production that accelerates the degradation of HC compounds, especially under saline conditions [30,31,32].

In our study the main RHO-bearing genome in the Caulobacterales order belonged to Hyphomonas genus. Representatives of this genus have been isolated from marine aromatic hydrocarbon polluted sites with the capability to degrade HCs [33, 34]. Nevskiales aromatic degradation potential was mainly detected in Nevskia, Polycyclovorans, and Solimonas genera. A Nevskia representative was reported to have biosurfactant production ability and Polycyclovorans representatives were shown to prefer HC as a nutrient compared to glucose [35, 36].

Among all investigated RHO genes, the highest phylogenetic diversity was observed in tmoA (208 genomes in 12 phyla and 38 orders) and xylX (1486 genomes in 9 phyla and 38 orders) genes (Fig. 5, B part). In the case of tmoA gene, it might be due to the wide range of HC compounds susceptible to this enzyme (e.g., benzenes, some PAHs, and alkenes) [37, 38]. Therefore, in our survey here we see that a diverse set of genera harbor tmoA gene and could potentially degrade different types of HCs (Fig. 5, B part).

Underrepresented microbial groups with a limited number of RHO genes also featured tmoA, xylX, and pheA genes. Myxococcota, Acidobacteriota, Chloroflexota, Firmicutes_I,E,K, and Cyanobacteria with tmoA gene were clustered separately, reflecting their distinct protein sequence and the lower possibility of HGT among these groups. For xylX, Eremiobacterota affiliated genes were placed together with genes from Gammaproteobacteria, and Tectomicrobia, Binatota, Chloroflexota, and Firmicutes_I were placed in separate branches near Actinobacteriota. In addition, Acidobacteriota, Eremiobacterota, and Campylobacteria with pheA gene were nested within Alphaproteobacteria members. The phylogeny of RHO genes was also more consistent with taxonomy than the phylogeny of alkB/M and ladA.

Bionatota, a recently described phylum shown to be efficient in HC degradation, harbored todC1, bphA1 (in Binatales order), and xylX (Bin18, Binatales) genes from RHOs and ladA (in Bin18) from alkane hydroxylases. Representatives of this phyla have been reported to play a role in methane and alkane metabolism [39]. However, we also noted the further potential of Binatales and Bin18 orders of this phylum in aromatic HC degradation.

RHOs can be located either on the chromosome or plasmid, depending on the organism. For instance, todC1, bphA1, and tmoA genes were reported to be on the chromosome [40], while in another study, they were detected on a plasmid [41]. Other RHOs, including xylX, xylM, pheA, and ndoB have mainly been reported to be hosted by plasmids [40, 38]. One of the main examples is TOL plasmid (carry genes for xylene and toluene degradation) which has been reported to be transferable between Pseudomonas spp. Rhizobium spp. and Erwinia spp. [42, 43].

Multiple copies of RHO genes in one genome were detected for xylX and pheA. Immundisolibacter cernigliae surprisingly contained 21 variants of xylX. This genome also had six copies of alkB/M and was isolated from a PAH-contaminated site [44]. The high HC degradation potential of other members of this genus has also been reported in the marine ecosystem [45, 46]. Rugosibacter aromaticivorans (containing 5, 2 and 2 copies of xylX, ndoB, and tmoA genes, respectively), Pseudoxanthomonas_A spadix_B (with 4, 2 and 2 copies of xylX, todC1 and bphA1 genes, respectively), Thauera sp002354895 (4), Pigmentiphaga sp002188635 (4) are other examples of genomes that have multiple copies of the xylX gene. Although xylX gene was detected in Actinobacteriota, multiple copies in a genome were seen only among the Proteobacteria phylum.

The pairwise BLAST identity among 21 different variants of the xylX gene present in the Immundisolibacter cernigliae genome ranged between 35 to 81 percent. Among these 21 xylX copies, three sequences (i.e., xylX numbers 18, 19, and 22 in Supplementary Figure S7) showed higher BLAST identity percentages with the xylX gene of the Rugosibacter genus compared to other xylX copies present in the Immundisolibacter cernigliae genome itself (Supplementary Table S7 and Supplementary Figure S7). Several xylX copies of I. cernigliae (i.e., xylX numbers 10, 11, 13, and 15 in Supplementary Figure S7) had more edges than others in the network, and their interactions (Supplementary Figure S7, highlighted in red) represent their similarity with xylX copies of Caballeronia, Sphingobium, and Pseudoxanthomonas, Pseudomonas, and Thauera genera. In addition, xylX number 5 and 7 of Immundisolibacter had almost similar blast identities with the xylX gene of Pigmentiphaga genus and other xylX copies in I. cernigliae Itself. These results suggest that copies of the xylX gene in I. cernigliae potentially originate from horizontal transfer from other microbial groups.

On the other hand, Glutamicibacter mysorens (4), Enteractinococcus helveticum (4), and many other genomes from the Castellaniella, Kocuria, and Halomonas genera, had several pheA copies in their individual genomes. To a lesser degree, tmoA gene was present in multiple copies in Pseudonocardia dioxanivorans (4), Rhodococcus sp003130705 (3), Amycolatopsis rubida (3) and Zavarzinia compransoris_A (3) genera.

While bphA1 and todC1 have different KO identifiers (Supplementary Table S1), our manual checks showed that they had the same conserved domain based on NCBI CD-Search [47]. We kept both annotations for cases where one gene was annotated with both KO identifiers. Previous studies also report similar homology and substrate specificity between todC1 and bphA1 [41].

XylM, as one of the enzymes mediating the initial steps in toluene/xylene degradation, showed the lowest abundance and phylogenetic diversity (27 genomes in 1 phylum and 6 orders). Toluene/benzene can generally be degraded through different routes and three of the most prevalent approaches were studied here. XylX, TodC1, and TmoA are the initial oxygenase enzymes of these three pathways. They are diverse in starting the degradation and composed of different domains, while downstream degradation converges to catechol derivatives as intermediates. xylM can also initiate toluene degradation in addition to xylene. In this pathway xylX then converts produced benzoate to catechol. Therefore, while we see a lower diversity and abundance of genomes harboring xylM, we want to highlight the possibility of the presence of alternative degradation enzymes in different microorganisms that can degrade the same compound and initiate the degradation process.

As the number of rings in aromatic compounds increases, the number and diversity of microbial groups capable of degrading them decreases, and microbial groups with ndoB (naphthalene 1,2-dioxygenase) accordingly showed the lowest abundance after xylM gene. The genomes hosting ndoB had limited phylogenetic diversity (35 genomes in 1 phylum and 6 orders) and were found mainly in representatives of Alphaproteobacteria (Sphingomonadales (17) and Caulobacterales (2)) and Gammaproteobacteria (Pseudomonadales (5), Burkholderiales (1), Nevskiales (1)).

Ecological strategy of HC degrading bacteria

Microorganisms are broadly divided into two main functional growth categories, i.e., oligotrophic/slow-growing/K-strategist or copiotrophic/fast-growing/r-strategist. These ecological strategies are associated with the genome size that, in turn, directly correlates with the GC content [48]. To get further insights into the ecological strategies of organisms that feature HC degrading genes, we compared the distribution of GC content and estimated genome size. This analysis revealed that HC degrading genes were present in genomes with a broad genome size range (1.34 to 16.9 Mb) and GC content (26.9 to 76.6%) (Supplementary Figure S8, data available in Supplementary Table S8). Genomes containing HC degrading enzymes with GC percent equal to or lower than 30% mainly had alkB gene and taxonomically belonged to Flavobacteriales order (genome sizes in the range of 1.4 to 4.2 Mb). The largest genome included in this study was Minicystis rosea from the phylum Myxococcota (genomes size of 16.9 Mb), which also contained alkB. In this phylogenetic reconstruction, the alkB gene of Minicystis rosea clustered together with homologs from representatives of phylum Gammaproteobacteria (Immundisolibacter and Cycloclasticus genera) (Fig. 3). The large genome size of Minicystis rosea, together with the phylogenetic placement of its alkB gene together with phylum Gammaproteobacteria representatives, hint at the horizontal transfer of this gene to the Minicystis rosea genome. These analyses indicate that HC degradation ability is present in both K-strategist and r-strategists microorganisms. Earlier studies have shown that r-strategist serves as the principal HC degraders after oil spills and at other point sources of pollution in marine environments [49,50,51]. Indeed, most obligate hydrocarbonoclastic bacteria are r-strategists (Proteobacteria domain) and are mainly reported to be isolated from marine samples [52]. The r-strategists are adapted to live in oligotrophic environments with transient nutrient inputs and rapid consumption of substrates upon episodic inputs by means of fast growth and population expansion [53]. In contrast, studies on oil-polluted soil samples suggest a predominance of K-strategists, especially in the harsh conditions (High concentration of HC, soil dryness, etc.) commonly seen in many such soil environments [54,55,56]. Hosting multiple copies of genes coding for HC degrading enzymes seems to be a shared feature in both r- and K-strategists with small and large genome sizes alike and appears to be a universal evolutionary strategy for HC degradation.

Genome-level analysis of HC degradation

Microorganisms are known to use division of labor or mutualistic interactions to perform HC degradation in the environment [57, 58]. However, 92 genomes (less than 0.5%) of 23,446 investigated bacterial genomes do in fact, contain all the enzymes required to degrade at least one HC compound completely. These 92 genomes all belong to Actinobacteriota (n = 25) and Proteobacteria (n = 67) (Fig. 6).

Fig. 6
figure 6

Genomes with complete/near complete degradation pathways of different HCs. Colors represent the type of HC that microbial genomes could degrade. Rows represent the type of HCs and columns show the name of genomes. Orders belonging to Proteobacteria and Actinobacteriota phyla are written in blue and red, respectively. KEGG orthologous accession number of enzymes for the complete degradation process of each compound is written at the figure’s bottom

Microorganisms have evolved two pathways for naphthalene degradation that involve the production of either catechol or gentisate as aromatic degradation intermediate (Supplementary Figure S1). Catechol can in turn, be further degraded via meta- or -ortho cleavage. Several microorganisms, including Novosphingobium naphthalenivorans, Pseudomonas_E fluorescens_AQ, Pseudomonas_E frederiksbergensis_E, and Herbaspirillum sp000577615, feature both of the mentioned pathways and even have the genes to potentially perform ortho and meta cleavage simultaneously (Fig. 6).

Moreover, Cupriavidus pauculus_A (long-chain alkanes and also biphenyl), Cycloclasticus sp002700385 and Paraburkholderia_B oxyphila (cycloalkane and xylene/benzene), Pigmentiphaga sp002188465 (cycloalkane and phenol), Rhodococcus sp003130705, Burkholderia puraquae, and Paraburkholderia_B mimosarum (toluene and biphenyl) contain the genetic potential to degrade more than one HC compound autonomously (Fig. 6).

Members of Burkholderiales had the genetic potential to degrade even more diverse compounds individually, while Actinobacteriota representatives mainly had the potential to contribute to the degradation of aliphatic compounds. This ability was also apparent in Figs. 1, 3, and 4. The potential for autonomous HC degradation wasn’t detected in genomes of more rare bacterial phyla. Moreover, none of the archaeal genomes investigated in this study contained all genes for the complete degradation of HCs.

HC degradation across domain archaea

Generally, HC degradation ability seems to be less prevalent among archaea as compared to bacteria. The phylum Halobacterota had the highest proportion of enzymes involved in the degradation of both aliphatic (n = 14 enzymes of aliphatic degradation pathway) and aromatic (n = 25 enzymes of aromatic degradation pathway) compounds among the studied archaea (Supplementary Figure S9). The alkB enzyme, responsible for short-chain alkane degradation, was detected in two copies in a single member of the phylum Nanoarchaeota (ARS21 sp002686215). This gene was clustered together with alkB identified in Gammaproteobacteria representatives (GCA-002705445 order) (Fig. 3). Genes needed to initiate degradation of long-chain alkanes and cyclododecane/cyclohexane as well as cyclopentane degradation via ladA and cddA/chnB genes were more prevalent among Halobacterota representatives (75 genomes in 7 families; Haloferacaceae, Haloarculaceae, Natrialbaceae, Halococcaceae, Halalkalicoccaceae, Haloadaptaceae, and Halobacteriaceae) (Figs. 4 and Supplementary Figure S9). Among investigated RHOs, only tmoA that initiates toluene degradation was present in 5 Sulfolobales and 2 Thermoproteales genomes of the phylum Crenarchaeota (Fig. 5). Detected archaeal tmoA and ladA genes branched separately from bacteria in the phylogenetic trees (Figs. 4 and 5). Apart from alkB, gene duplications were present in several genomes for both tmoA (Sulfolobus and Acidianus genera) and ladA (Halopenitus persicus and Halopenitus malekzadehii).

Key enzymes needed to initiate HC degradation were rarely present in archaea (Figs. 3, 4, and 5), indicating that Archaea might not play a significant role in the typically rate-limiting initial degradation of HCs. However, several studies report the ability of halophilic archaeal isolates (e.g., Halorubrum sp., Halobacterium sp., Haloferax sp., Haloarcula sp.) to degrade both aliphatic (n-alkanes with chain lengths up to C18 and longer) and aromatic (e.g., naphthalene, phenanthrene, benzene, toluene and p-hydroxybenzoic acid) HCs and use them as their sole source of carbon [59,60,61]. This may imply that archaea carry alternative and hitherto unknown enzymes for triggering HC degradation. However, there is no complete genome information available for the mentioned isolates to screen them for the presence of alternative degrading enzymes [11]. The Haloferax sp., capable of using a wide range of HCs as its sole source of carbon, present in the AnnoTree database (RS_GCF_000025685.1), contained none of the key degrading genes. The AnnoTree website chooses representative genomes having completeness of higher than 90%, which reduces the likelihood of incompleteness of the studied genome as a reason for the absence of these genes. Therefore, alternative HC degrading genes that are present in the accessory part of the genomes might be responsible for the observed degradation.

On the other hand, the recent reconstruction of three metagenome-assembled Thermoplasmatota genomes (Poseidonia, MGIIa-L2, MGIIb-N1) from oil-exposed marine water samples (not included in the GTDB release89) contained enzymes involved in alkane (alkB) and xylene (xylM) degradation [46]. Hence as these global genome depositories continue to expand, we may have to revise or update our findings.

A total number of 597 archaeal genomes contain enzymes involved in the degradation of aromatic compounds regarding the conversion of catechol to TCA intermediates. This is observed in the phyla Halobacterota (176 genomes in Haloferacaceae, Haloarculaceae, Natrialbaceae, Halococcaceae, Halobacteriaceae, Methanocullaceae, Methanoregulaceae, Methanosarcinaceae, Archaeoglobaceae, and some other methano-prefixed families), Thermoplasmatota (175 genomes in Poseidoniales, Marine Group III, Methanomassiliicoccales, UBA10834, Acidiprofundales, DHVEG-1, UBA9212), and Crenarchaeota (110 genomes in Nitrospherales, Desulfurococcales, Sufolobales, Thermoproteales). This widespread capacity for degrading downstream intermediates in aromatic HC degradation implies that archaea interact closely with bacteria in HC degradation.

Conclusions

HCs are ubiquitously distributed in the biosphere and do not exclusively originate from oil. In this study, the distribution of genes of key HC degrading enzymes involved in the degradation of certain HCs (aliphatic and aromatic types) is provided at genome resolution for both archaeal and bacterial domains. Over the last decades, extensive environmental genome and metagenome sequencing has significantly increased the number of available microbial genomes and enriched contemporary genomic databases. The genome-based taxonomy using average nucleotide identity (ANI) or relative evolutionary divergence adopted by the Genome Taxonomy Database; GTDB [62, 63] as a reproducible method has in parallel revised and updated some taxonomic ranks. The order Oceanospirillales, as an example, is a well-known taxon in the marine oil degradation context, and its representatives have been frequently reported as one of the main HC degrading members in response to oil pollution [52, 64, 65]. Nonetheless, this taxonomic rank has been removed from the genome-based taxonomy, and its members have been mainly placed in the order Pseudomonadales [66]. This could potentially cause a communication gap between the existing literature and new research. An updated comprehensive metabolic survey of Bacteria and Archaea for HC degradation potential at genome resolution could thus help bridge this gap. Our extensive survey shows that a greater diversity of bacteria contain genes involved in aliphatic HC degradation compared to aromatic HCs. Few genomes were detected to contain all necessary genes to carry out complete degradation pathways. This reiterates previous findings that microbes generally cooperate for HC degradation by “division of labor” and a community perspective would therefore be crucial for predicting the fate of oil HCs in the ecosystem. According to our results and prevalence of genes for known HC degradation enzymes, archaea could potentially make a small contribution to aerobic HC degradation however, based on genes they carry they could play a siginificant role in peripheral routes to degrade intermediate compounds produced by bacterial community. We detected HC degradation ability among both r and K strategists and found signals of gene duplication and horizontal transfer of key HC degrading genes among microbes. This could be an efficient way to increase degradation capability among microbial members and potentially help them adapt to the available pool of HCs in their ecosystem.

Materials and methods

Data collection of HC degrading enzymes

Representative compounds from each category of HCs, including saturated aliphatic (short/long-chain alkanes) and alicyclic (cyclohexane/cyclododecane), compounds with mono-aromatic (toluene, phenol, xylene, and benzene), and poly-aromatic (PAHs) (naphthalene, phenanthrene, and biphenyl as representatives) hydrocarbons were selected to survey the distribution of Bacteria and Archaea capable of their degradation under aerobic conditions.

A complete list of enzymes involved in the degradation pathway of mentioned HCs was compiled from previous reports [67,68,69,70,71,72,73,74]. We explored these enzymes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [75], Pfam [76], TIGRFAMs [77], InterPro [78], and UniProt [79] databases. The accession number of enzymes in each mentioned database, their function, name, reaction (if available), EC number, and additional information are provided in Supplementary Table S1. The workflow overview is provided in Supplementary Figure S10. The CYP153 gene is another essential alkane hydroxylase that is less prevalent but present in HC degrading microorganisms lacking alkB/M [80]. However, there were some limitations in detecting this gene in silico via AnnoTree. A proper model to search for this gene has not been developed (no KEGG/TIGRfam accessions), and the only way to search for this gene through the AnnoTtree web server was to use a Pfam domain (PF00067). However, the PF00067 domain is a universal domain present in every CYP (cytochrome p450) gene, making it difficult (almost impossible) to differentiate between potential substrates and confirm whether it is involved in alkane degradation. Due to this limitation, we decided to skip this enzyme and only work on the groups we could confidently annotate.

Distribution of HC degrading enzymes among bacterial and archaeal representative genomes

The distribution of the compiled HC degrading enzymes described in Supplementary Table S1 was assessed across domains Bacteria and Archaea using AnnoTree (http://annotree.uwaterloo.ca) [12]. AnnoTree database is providing functional annotations for 24,692 genome representatives in the genome taxonomy database (GTDB) release 89. The phylogenetic classification of genomes is derived from the GTDB database (release R89). In total, the annotation information for 18, 10, and 90 enzymes involved in the degradation process of alkane, cycloalkane, and aromatic HCs, respectively, were analyzed. Genome hits were collected at the thresholds of percent identity ≥ 50, e-value cut off ≤ 1e−5, subject/query percent alignment ≥ 70 for KEGG annotations, and e-value cut off ≤ 1e−5 for Pfam and TIGRFAMs annotations. For each HC degrading enzyme, we first checked KEGG annotations. If there were no KEGG accession numbers for the enzyme, the second priority was TIGRFAMs; otherwise, the Pfam annotation was considered. The table contains information for the distribution of HC degrading enzymes of each pathway present in representative genomes from bacteria and archaea domains, as is shown in Supplementary Table S2 and Supplementary Table S3, respectively.

Phylogeny of bacteria and archaea augmented with the abundance of HC degrading enzymes

Evolview, a web-based tool for phylogenetic tree visualization, management, and annotation, was used to present the distribution view of HC degrading enzymes in representative genomes across bacterial/archaeal phylogenomic trees [81, 82].

The phylogenomic tree of bacteria and archaea in the Newick format, at the phylum level (123 and 14 leaves, respectively), was adopted from the AnnoTree website (November 21st, 2020). Trees were uploaded as the reference tree in Evolview. According to the abundance tables of HC degrading enzymes prepared for each degradation pathway, four heatmaps were plotted for bacteria and archaea domains (separately for aliphatic and aromatic compounds).

Single gene phylogeny

To provide the evolutionary history of key enzymes in each HC degradation pathway, the protein sequence of that enzyme was manually confirmed by inspecting their conserved domains using the NCBI web CD-Search tool (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) [47]. In this step, we have uploaded all detected genes retrieved from AnnoTree to the NCBI web CD-Search tool and manually checked their annotations and inspected their annotated domains. Validated amino acid sequences were then aligned using Kalign3 software [83], and their phylogenetic tree was reconstructed using FastTree2 [84].