Background

The Carbohydrate Active Enzymes (CAZymes) are enzymes involved in the assembly, modification or deconstruction of carbohydrates [1]. Based on amino acid sequence similarities, CAZymes are divided into several classes, including glycosyltransferases (GT) [2, 3], glycoside hydrolases (GH) [4,5,6], polysaccharide lyases (PL) [7, 8], carbohydrate esterases (CE) [8], and auxiliary activities (AA) [9] that have been stored in the CAZy database. The huge diversity of natural glycans and their complexity has boosted studies uncovering novel CAZymes. Thus, the number of CAZymes families increases exponentially by about four new GH families per year [10]. This broad diversity has allowed their use in plenty of industrial applications as they have been described to offer attractive opportunities in a wide range of biotechnological applications such as animal feed, biocatalysis, agriculture, biorefinery, glycoengineering and biobleaching industries [11,12,13,14,15,16]. Along with classical methods, various omics approaches are presently applied in the field of biology for the discovery of potential CAZymes. This “omics” technologies include proteomics, transcriptomics, metagenomics, metabolomics and whole genome sequencing [13, 17, 18].

The systematic genome sequencing has largely fueled the discovery of novel plant biomass degrading enzymes [10]. Studies have shown that bacteria and fungi are the main producers of CAZymes in nature. Among them extremophilic microorganisms have received special attention because of their capacity to live in extreme conditions such as high temperature, pressure, alkalinity, acidity, or salinity, thanks to their corresponding extremozymes [19]. Owing to their robustness, extremozymes are capable to function under harsh conditions more effectively than enzymes from other microorganisms [20]. Accordingly, thermophilic enzymes offer great potential for application in biotechnology, opening the possibility of performing biocatalysis at higher temperatures that can be more beneficial in some industrial settings [21, 22]. Thus, the study of thermophilic microorganisms have emerged during recent years including genome profiling and exploration of CAZymes content [23]. It has been demonstrated that carbohydrate acting enzymes works in conjunction with other CAZymes and proteins forming clusters of physically linked genes called polysaccharide utilization loci (PULs) [24,25,26]. These clusters that occur in bacteria of bacteroidetes phylum have been progressively identified in firmicutes phylum [27].

The thermophilic anaerobic Caldicoprobacter algeriensis TH7C1T strain was isolated from the hydrothermal hot spring of Guelma. It was classified as a novel species in Calidicoprobacter genus [28] and was demonstrated to produce some thermophilic enzymes [29, 30]. However, its exploitation, in particularly discovery of enzymes content such as CAZymes, was hampered by culture limitations, anaerobic and high optimal temperature (65 °C).

In order to understand plant biomass-degrading machinery and to discover new potential interesting CAZymes for biotechnological applications, we report, for the first time, the genome sequence of C. algeriensis TH7C1T. Furthermore, we report the prediction of CAZyme encoding genes as well as the identification of clusters acting on polysaccharides.

Results

Genome sequence and analysis

The genome sequencing of C. algeriensis TH7C1T rendered 473,434 Illumina reads with an average coverage of 34.55x. The de novo assembly resulted in 45 contigs and a total length of 2,535,023 bp (Accession number PRJNA743054) with an overall GC content of 44.9%. A circular genome map of C. algeriensis was constructed, showing contigs, GC content, and GC skew (Fig. 1).

Fig. 1
figure 1

Graphical Circular genome map of Caldicoprobacter algeriensis TH7C1T generated using CGView. From outside to inside, ring 1 represents the 45 assembled contigs. Ring 2 shows the GC skew with green indicating positive values and purple indicating negative values. The GC content is represented by the inner most ring

The overall genome statistics of C. algeriensis are close to those from Caldicoprobacter faecalis, Caldicoprobacter oshimai and Caldicoprobacter guelmensis (Table 1).

Table 1 Comparison of genome features between C. algeriensis, C. faecalis, C. oshimai and C. guelmensis

Gene prediction performed with the RAST server resulted in 2720 features including 2666 protein coding sequences (CDSs) classified in 226 SEED subsystems and 53 RNA genes. Figure 2 shows the subsystem category distribution following RAST annotation. The largest part of this subsystem is allocated to the Amino Acids and Derivatives, and Carbohydrate metabolism with 15.83% and 10.71%, respectively. Dfast annotation revealed 2425 protein coding sequences with CDSs and 53 RNA genes covering 85.3% of the genome, the average length of the CDSs is 297 bp.

Fig. 2
figure 2

An overview of the RAST annotation and subsystems distribution for the C. algeriensis genome

Analysis of genome stability using RAST and the CRISPRCasFinder server revealed two CRISPR array sequences located in contig 13 (with the evidence level of 4) and contig 2 (1 evidence level). This analysis revealed also three Cas cluster sequences detected in contig13 and related to CAS, CAS-TypeIIID and CAS-TypeIB. Another Cas cluster belonging to CAS-Type IE is located in contig 12.

CAZymes annotation

Sequences submitted to the dbcan server allowed the automated annotation of CAZymes using the HMMER3.0 package and the dbcan CAZyme database (see Additional file 1: Table S1). This analysis resulted in 97 genes associated with glycan assembly and breakdown. The most abundant enzymatic family predicted in this genome was glycoside hydrolases with 57 CAZyme encoding genes divided into 32 different families.

The highest number of glycoside hydrolases found in C. algeriensis was related to GH109 with 9 predicted encoding genes, followed by GH3 with 6 genes and GH2/GH13 with 4 genes. GH109 family, which contains members involved in the deconstruction of galactomannans was widely represented in this genome. Interestingly, CAZymes belonging to this family had not been identified as major catalysts in previous studies highlighting biomass-degrading potential in hot spring. The GH3 family is represented by 6 predicted enzymes for hemicellulose hydrolyzing and debranching activities such as glucosidase, xylosidase and glucanase. Interestingly, GH3 has been reported as the most abundant GH family for oligosaccharides degradation in hot spring ecosystems [31]. The other abundant glycoside hydrolases predicted in this genome were identified to belong to the GH2 and GH13 families catalyzing the degradation of oligosaccharides and starch, respectively.

The second most frequent enzyme family contained in this genome is the glycosyltransferases GT family (20 encoding genes). GTs are known to catalyze the transfer of sugar residues from activated donor molecules to saccharide or non-saccharide acceptor molecules to form glycosidic linkages. The finding corroborates the results of biomass-degrading enzyme potential exploration in hot spring ecosystems previously reported [31] demonstrating that glycoside hydrolases and glycosyltransferases are widespread groups of CAZymes present in thermophilic microbial communities.

The output from dbCAN2 also included multiple hits corresponding to carbohydrate esterases (CEs) represented with 6 predicted genes attributed to CE1, CE4 and CE9 families. CEs are enzymes acting on ester bonds in carbohydrates accelerating the degradation of polysaccharides and facilitating the access of glycoside hydrolases. The most abundant CEs in C. algeriensis genome belong to CE4 family acting on acetylated xylan and chitin. Members of CE1 and CE9 families are involved in xylan and acetylglucosamine hydrolyzing, respectively.

The remaining putative CAZyme detected has been attributed to polysaccharide lyases (PL) represented by only one predicted gene. This genome also encodes 14 carbohydrate-binding modules (CBM). The majority of predicted CBMs belong to CBM4 and CBM50. CBM4 encodes specific modules that recognize xylan, 1,3-glucan, 1,3-1,4-glucan, 1,6-glucan, and amorphous cellulose, while CBM50 proteins are responsible for binding of enzymes having cleavage activity of chitin or peptidoglycan. They were found associated to GH genes or other CBMs. CAZyme genes prediction as well as the protein encoding genes sequences are available in supplementary (Additional files 1: Table S1 and 2).

Fast blast hit of CAZyme encoding genes in the CAZy database was performed by querying the genome against DIAMOND from dbcan meta-server. This analysis showed an identity between 35 and 83% with their nearest neighbors (Table 2).

Table 2 Comparison of predicted CAZymes of C. algeriensis with those available in CAZy database using DIAMOND tool in dbCAN

PUL annotation and CGC prediction

To examine the presence of Gram-positive polysaccharide utilization loci (gpPUL) in the genome of C. algeriensis, we used nucleotide Basic Local Alignment search tool (BLASTX) available in dbCAN-PUL. This tool uses the repository as a database to query sequences against PUL proteins in dbCAN-PUL. This analysis resulted in a huge number of sequence similarities (11,320) (see Additional file 1: Table S2) including 36 CAZymes, 21 transporters (TCs) and 6 signal transduction proteins (STP). The PUL showing the highest number of hits to our query sequences is PUL0390 with a total of 10 hits. This PUL is predicted to be capable of degrading acetylated glucuronoxylan.

CAZyme gene clusters (CGC) prediction via the dbCAN2 with the CGC-Finder unveiled 33 CGCs defined by the presence of at least one CAZyme, one transporter and one transcription factor encoding genes (Fig. 3 and Additional file 1: Table S3).

Fig. 3
figure 3

Schematic representation of the predicted 33 CAZyme Gene Clusters (CGCs) showing organization of genes in each cluster. CAZymes genes are colored red, TC (Transporters Classification) are colored green, TF (Transcription Factor) are colored blue. Non-signature genes, which can be inserted between signature genes, are colored gray

CAZymes gene labels are based on CAZyme domain assignment, TC genes were predicted by searching against the TCDB and TF genes searched against the transcription factor families in Pfam and Superfamily. Genes organization of predicted clusters is shown in Fig. 3.

Results of sequence similarities were used for the determination of carbohydrate utilization ecotypes. Among the predicted CGCs, 20 of them contain CAZymes with no similarity with proteins in the repository. Based on enzymes combination in predicted CGCs and genes similarities with those available in dbCAN-PUL database, we predict a specific polysaccharide for each cluster (Table 3). The determination of carbohydrate utilization ecotypes provides insight to their biotechnological potential.

Table 3 Targeted substrates predicted for CAZymes genes clusters

Discussion

Extremophilic microorganisms are of prime interest for biotechnological applications. They possess great potential to degrade plant biomass thanks to their corresponding enzymes [20]. Previous studies have shown that they are efficient producers of CAZymes [32, 33]. In the present work, we gained insight into the profile of genes involved in the carbohydrate metabolism (CAZomes) in the thermophilic and anaerobic Caldicoprobacter algeriensis TH7C1T. This strain classified as novel species in the Caldicoprobacter genera, was isolated from a hot spring. Owing to its harsh culture conditions, we proceeded with the whole genome sequencing to unveil the capability of C. algeriensis strain for polysaccharides utilization using complex machineries including efficient carbohydrate active enzymes. The C. algeriensis TH7C1T genome consists of 2,535,023 bp with 44.9% GC content, which is similar to already sequenced Caldicoprobacter species, namely faecalis, oshimai and guelmensis.

In this study, we report for the first time, CAZymes repertoire of a thermophilic bacteria assigned to the Caldicoprobacter genera. The CAZymes prediction via the dbCAN server using predicted amino acid sequences of C. algeriensis unveiled the presence of 97 CDSs belonging to CAZymes representing 4% of protein coding genes. This percentage is within the range of CAZymes encoding-genes estimated for all microorganisms genomes [1] and genomes of previously reported thermophilic Firmicutes, such as BZ3 isolated from a new thermophilic compost-derived consortium (4%) [34], the thermophilic bacterium Caldanaerobacter sp. strain 1523vc isolated from a hot spring of Uzon Caldera (3,6%) [35]. Among predicted CAZymes, the most abundant class was glycosides hydrolases (GH), about 58% of CAZymes showing the highest percentage of Glycosidases reported in genomes and metagenomes from hot spring ecosystems. C. algeriensis also stands out for being the richest in diversity of GHs families (32) compared to other thermophilic genomes [34, 36, 37]. These GHs include the major families for hemicellulose and cellulose metabolism. Based on this, we speculate that C. algeriensis possess great potential to degrade carbohydrates much more effectively than other strains described previously.

When examining Glycosides hydrolase families by relative abundance, the maximum representation was from the families GH109 and GH3 genes. These two families are responsible for hemicelluloses and oligosaccharides biomass degrading respectively. As reported previously in thermophilic microbial consortia and hot spring samples, the other abundant class of CAZymes was glycosyl transferases (GT), 20% of predicted CAZymes. This large diversity of biomass degrading-related genes encoded by the C. algeriensis genome supports studies showing the importance of Firmicutes phylum in deconstruction of structural plant polysaccharides [27]. It has been demonstrated that this group of bacteria among the 6 predominant phyla in hot spring ecosystems [36, 38]. Given that they are nutritionally pecialized [27], they develop a battery of endo- and exo-acting Carbohydrate Active Enzymes and transporters, responsible for the cleavage of particular carbohydrates. Earlier studies reported that these genes are organized in clusters involved in polysaccharides degradation and transport forming Gram-positive polysaccharide Utilization Loci (gpPUL) [27]. In our study, we report for the first time the existing of CAZymes gene clusters in this group of Caldicoprobacteraceae.

PULs were analyzed based on genes homology with PULs available in dbCAN-PUL database. Results showed 11,320 gene similarities in CAZymes, transporters and signal transduction proteins across all PULs in the dbCAN repository, displaying an identity between 18.7% and 80.7%. To further analyze carbohydrate utilization ability of C. algeriensis, we performed CAZymes gene cluster analysis via the CGC finder in dbCAN2 meta server. We obtained 33 CAZymes gene clusters. Among them, 22 CGCs including 19 GH families, were predicted to be involved in cellulose and hemicellulose hydrolysis (GH3/GH5/GH2/GH10/GH30/GH35/GH38/GH4), glycogen degradation (debranching enzymes), (GH3/GH13_9/GH67/GH94) and starch utilization (GH13_39). The most abundant CAZyme identified in CGCs was related to the GH109 family. Nine genes, which typically encode α-N-acetylgalactosaminidase and β-N-acetylhexosaminidase, were found in seven clusters (CGC11, 13, 16, 24, 26, 28 and 29). GH109 genes were combined to other GH families genes, GH65/13/51 and GH2 in CGC 16 and 29 respectively, supporting that synergistic action of many CAZymes is required for polysaccharides cleavage [39]. Interestingly, analysis of GH109 genes similarities against genes from PULs available in the database, revealed no significant similarity. Thus, we suggest that C. algeriensis encodes new gene clusters not identified previously. Indeed, few studies reporting characterization of GH109 family members were performed [20] and CAZy database lists only 7 GH109 nagalases as characterized. Members of this family are particularly interesting for their ability to convert RBC A-antigens into H-antigens, turning type-A blood into universal donor type-O blood [40, 41].

The C. algerinsis also encodes six CAZymes genes clusters including members of GT families. As reported previously in extremophilic ecosystems, most of GT genes belonged to GT2 and GT4 families [20, 36, 42]. These two families have been reported to perform the synthesis of alpha and beta glycans and glycoconjugates. The GT4 contains a large variety of enzymes that are involved in lipopolysaccharide and antibiotic avilamycin A synthesis [43]. Owing the difficulty on purifying and investigating the biochemical features of these membrane associated enzymes, a few number of GTs has been characterized. Nevertheless, they have been described to offer potential opportunities in biotechnological applications such as biomedicine, cell biology field and pharmaceutical industry. Consequently, an in depth analysis of genes belonging to this family is very important.

Carbohydrates esterases are also identified in two CAZymes genes clusters (CGC2 and CGC5), related to CE1 and CE19 families. CE1 constitute the largest family of esterases including 5062 entries listed in CAZy database. Members of family CE1 were known to target xylan while CE19 family members are involved in pectin degradation. Recently, Carbohydrate esterases have shown great potential in several industrial applications such as food industry, pulp and paper industry, biofuel production, animal feed, medical and pharmacological industry [44, 45].

Genes Similarity analysis has shown 11 other genes, in addition to GH109, with no homologous in PULs database, including genes belonging to GT2, GT4 and CE19 CAZymes families. Thus, the C.algeriensis genome could be a source of novel and original thermophilic enzymes with strong potential for biotechnological applications.

Conclusions

The present work constitutes the first study targeting CAZymes repertoire of bacteria belonging Caldicoprobacteraceae group based on whole genome sequencing. CAZyme encoding genes prediction results highlighted the high potential of C.algeriensis bacteria for the degradation of structural plant polysaccharides. Detailed analysis of predicted genes unveiled complex machineries involved in the metabolization of these major components of the plant cell wall and put the emphasis on newly identified enzymes. The in depth characterization of the specificity of each of these enzymes is the next challenge that will allow the understanding at the molecular level of the involvement of these loci in carbohydrates metabolism and their potential industrial applications.

Methods

Sampling and DNA extraction

Strain C. algeriensis TH7C1T was isolated from the hydrothermal hot spring of Guelma [28]. Genomic DNA was extracted as previously described [46] with some modifications. Briefly, cells harvested in the exponential phase were suspended in TRIS–HCl (pH 8.0), EDTA, NaCl and incubated in the presence of lysozyme at 37 °C. Sodium dodecyl sulfate was added to 1% and the incubation continued until clarification was complete. Chloroform extractions were carried out and followed by ethanol precipitation. The DNA was drawn out of solution by being wound around a glass rod.

Sequencing and functional annotation

The isolated DNA from C. algeriensis TH7C1T was used to generate Illumina shotgun paired-end sequencing libraries, which were sequenced with a MiSeq instrument and the MiSeq reagent kit version 3 (2 × 250 bp paired-end reads), as recommended by the manufacturer (Illumina, San Diego, CA, USA) at IBISBA CSIC-CellFactory_MM platform. Quality filtering using Trimmomatic version 0.36 resulted in 473,434 paired-end reads rendering an approximate genome coverage of 30x. The sequence was assembled using the SPAdes Genome Assembler version 3.15.2. Assembled contigs were submitted to the Rapid Annotation Server (RAST) (http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/RapidAnnotationServer) [47] and the DFAST server (https://dfast.nig.ac.jp/) for protein coding sequences (CDSs) prediction. The Circular Genome Viewer (CGView server) [48] was used to construct a circular graphical map of C. algeriensis TH7C1T. Carbohydrate-active enzyme (CAZyme) searches were performed using HMMER3.0 package (http://hmmer.org/) available from dbCAN (http://csbl.bmb.uga.edu/dbCAN/) [49], this search is run against Pfam Hidden Markov Models (HMMs). DIAMOND available from the dbcan CAZyme database was used for fast blast hits in the CAZy database.

Polysaccharides Utilization Loci (PULs) were analyzed via the dbcanPUL meta server [50]. CAZyme gene cluster (CGC) Finder in the database was used for carbohydrate-active enzyme clusters annotation. CGCs were defined as genomic regions containing at least one CAZyme gene, one transporter (TC) gene, and one transcription factor (TF) gene. Genome sequence has been submitted to the public genomic NCBI database under accession number PRJNA743054.

Prediction of CRISPR-Cas sequence (Clustered Regularly Interspaced Short Palindromic Repeats) in the genome was performed using the CRISPRCasFinder server) [51].