Background

Clostridium thermocellum exhibits one of the highest rates of degradation of cellulosic substrates, which is facilitated by large extracellular multi-subunit enzyme systems termed cellulosomes [13]. It also has productivity advantages associated with thermophilic growth conditions. The bacterium has many attributes that are of interest for fundamental research. It also has the potential to be used in industrial-scale consolidated bioprocessing (CBP) (without added enzymes) of lignocellulosic biomass into ethanol for the displacement of petroleum products [48].

The C. thermocellum ATCC 27405 genome was originally submitted to the US Department of Energy (DOE) Joint Genome Institute (JGI; Walnut Creek, CA, USA) for sequencing by JHD Wu (University of Rochester, Rochester, NY, USA) and ME Himmel (National Renewable Energy Laboratory (NREL), Golden, CO, USA). The genome was sequenced using the Sanger method, made available in November 2003 [GenBank:CP000568], and represented the first genome sequence for this species. Repetitive sequences such as transposases and those present in cohesin domains made closing this genome challenging and the genome sequence was not finished until 2007. The C. thermocellum ATCC 27405 genes were originally predicted using two gene modeling programs, Glimmer [9] and Critica [10], as part of a JGI annotation pipeline. The gene prediction program Prodigal [11] was developed at Oak Ridge National Laboratory (ORNL; Oak Ridge, TN, USA) and incorporated into the JGI annotation pipeline after the initial ATCC 27405 genome annotation. We have found that its use has improved the gene prediction models for several bacteria [12, 13]. As a result, we applied Prodigal to the C. thermocellum genome sequence and report an update to the C. thermocellum ATCC 27405 genome annotation in this study.

Previous studies have suggested that C. thermocellum coordinates its cellulosomal subunit composition depending on the growth substrate [14, 15] and growth rates [16]. Such studies are important for designer cellulosome engineering studies, developing efficient industrial enzyme cocktails, metabolic engineering, and synthetic biology endeavors [17]. Biomass from the monocot switchgrass (Panicum virgatum) and the woody dicot black cottonwood (Populus trichocarpa) have been proposed as model bioenergy crops for the USA [18]. In order to gain insights into the C. thermocellum genes required for growth on either pretreated switchgrass or Populus we generated whole genome DNA microarray profiles for its growth on biomass for the first time. We have also developed an effective method to isolate high quality RNA from C. thermocellum during these biomass fermentations with initial solid substrate loadings of 5 g/L.

RNA sequencing (RNA-seq) has recently been used for prokaryotic transcriptome analysis [1921]. It has several advantages over a microarray platform such as greater dynamic range of reads relative to the intensity of probe signal on a microarray platform. The technology allows for the identification of new transcripts and transcriptional start sites at a higher resolution than would be available on a tiling array. RNA-seq technologies and statistical approaches for transcriptome analyses are developing rapidly [2226], and debate remains over the ideal methods for data normalization and which statistical methods are most useful to help identify biologically-relevant effects.

A comprehensive comparison of different normalization methods for Illumina data has been reported previously [22]. We tested five RNA-seq normalization strategies: trimmed mean of M component (TMM); reads per million (RPM) scaling; reads per kilobase per million (RPKM); upper quartile scaling (UQS); and a newly developed method called kernel density mean of M component (KDMM). Each method is a scaling type method whose corresponding scaling factors are calculated based on the geometric mean for KDMM, arithmetic mean for RPM, geometric mean divided by arithmetic mean for TMM, and the 75th percentile for UQS. We compared the results from these different normalization methods with microarray data derived from the same cDNA using an established expression microarray platform to offer useful suggestions for future RNA-seq studies.

Results

Genome reannotation and updated microarray probe sequences

Improvements in DNA sequencing technologies, assembly, and gene prediction algorithms have facilitated continuous updates to sequenced genomes [12, 13, 2729]. The latest annotation of the C. thermocellum ATCC 27405 genome has 3,175 candidate protein coding sequences (CDSs) predicted using Prodigal [GenBank:CP000568.2] [11]. Previously reported proteomics data was used to confirm predicted gene models [30] (see Additional file 1 for all peptides used for annotation confirmation and Additional file 2 for peptides used to update open reading frame (ORF) start sites and include new genes). Compared to the primary C. thermocellum ATCC 27405 annotation, 130 CDSs have been added or converted from pseudo genes into genes and 65 former CDSs were deleted or converted into pseudo genes (see Additional file 3 for examples of peptide hits used to update the genome annotation). Other modifications include the merging of two former genes into a single ORF and the modification of transcriptional start sites. A comparison of the annotation versions can be found at: http://genome.ornl.gov/microbial/cthe/. We have updated our microarray dataset to reflect the new gene numbers where probes originally designed to intergenic regions are now acknowledged to target a newly annotated gene (see Additional file 4 for microarray probe gene assignment update and Additional files 5 and 6 for details).

Biomass characterization

Of interest to us were any inherent compositional differences between the two biomasses. Quantitative saccharification of pretreated biomass samples revealed that there was more glucose in the Populus biomass (646 mg/g of biomass SD ± 13.6) compared to the switchgrass pretreated biomass (522.5 mg/g of biomass SD ± 9.3) and reflects the cellulose component of the two biomasses. The levels of xylose and arabinose differed between the biomasses with almost four times the amount in switchgrass (xylose: 72.5 mg/g of biomass SD ± 0.4; arabinose: 7.1 mg/g of biomass SD ± 1.0) relative to Populus (xylose: 19.4 mg/g of biomass SD ± 1.6; arabinose: 1.6 mg/g of biomass SD ± 0.2). This is a reflection of the hemicellulose compositional differences, in particular the arabinoxylan component that predominates in the cell wall of switchgrass [31].

Samples of the pretreated biomasses used as substrates for the fermentations were analyzed by inductively coupled plasma emission spectroscopy (ICP-ES) for elemental compositional differences that could influence the fermentation performance. The pretreated material was also compared to untreated biomass to identify any elemental differences associated with the pretreatment procedure. In both biomasses the pretreatment procedure appeared to introduce chromium, molybdenum, and titanium, which were significantly (P <0.001) different between pretreated and unpretreated biomass (Additional file 7).

Calcium was present in the untreated material at levels of 1,388 mg/kg and 2,868 mg/kg of Populus and switchgrass, respectively. The calcium was removed more efficiently from the Populus biomass with the amount in the pretreated biomass decreasing to 34.3 mg/kg, whereas levels remained high after pretreatment in the switchgrass biomass (1,918 mg/kg) (Additional file 4). Pretreatment efficiently reduced the levels of potassium, magnesium, manganese, phosphorus, strontium, and zinc from both biomasses. The divalent cations barium, calcium, copper, iron, manganese, nickel, strontium, and zinc as well as the phosphorus and sulfur content were higher in pretreated switchgrass compared to Populus (Additional file 7). The only significantly different element that was higher in pretreated Populus relative to switchgrass was molybdenum, which was likely introduced during the pretreatment procedure (Additional file 7).

Growth characterization on biomass

Inocula were similar at the beginning of the experiment, and cell count data taken at 12 hours and 37 hours postinoculation confirmed the fermentations were actively growing (Additional file 8). C. thermocellum doubled by approximately 2.7 times (SD ± 0.8) and 4.4 times (SD ± 1.3) when grown on Populus at 12 hours and 37 hours postinoculation, respectively. Similarly, cell doubling data from switchgrass fermentations showed C. thermocellum doubled 3.6 times (SD ± 1.2) and 5.6 times (SD ± 0.90) at 12 hours and 37 hours postinoculation, respectively. These time points were chosen for analysis as they correlate with exponential and early stationary phase based on the fermentation product formation and cell counts (Additional file 8). Analysis of the fermentation medium over time revealed that C. thermocellum grown on pretreated Populus substrate had greater concentrations of the major fermentation products, ethanol and acetic acid, compared to growth on switchgrass, with approximately 1.6 times greater yields on the former substrate (Table 1). Ratios of the major fermentation products (acetic acid:ethanol) were 2.20 and 2.05 for Populus and switchgrass, respectively. Lactic acid is typically a minor fermentation product, and was present at less than 0.06 g/L in each of the fermentations. Quantitative saccharification revealed that between 58% and 64% of glucose present in the Populus biomass was utilized during the 37-hour fermentation compared to the range of approximately 43% to 49% glucose conversion that occurred during the fermentation using switchgrass as the substrate (Table 1).

Table 1 Major C. thermocellum fermentation products and residual biomass sugars

Normalization and transcriptome analysis

RNA-seq is an alternative technology for microarrays in transcriptome analysis. This study sought to identify changes in the transcript profile of C. thermocellum ATCC 27405 grown on the substrates of pretreated Populus and switchgrass and whether these profiles were maintained across the two gene expression analytical platforms. RNA-seq reads gave a genome depth coverage of at least 580× (Additional file 9) and gave data for 3,370 genes (98.4% of the annotated protein coding genes). Fluorescence intensity values from the microarrays gave data on 3,157 genes (92.2% of the annotated genes). Data was collected for 3,088 genes on both platforms, constituting 90% of the 3,424 predicted genes (both protein coding and non-protein coding) in the latest version of the C. thermocellum ATCC 27405 genome. Correlations of log2 transformed fluorescent intensity counts for the array or log2 transformed read counts for the RNA-seq of the biological replicates for each condition gave Pearson R correlations ranging from 0.93 to 0.97 in the array and 0.94 to 0.98 in the RNA-seq (Additional file 10). A comparison of the array intensity values and RNA-seq read counts across the two transcriptomic techniques gave Spearman correlation coefficients ranging from 0.83 to 0.88 for each of the growth and substrate comparisons (Additional file 11).

While microarray data normalization strategies are well established, an ideal method for RNA-seq normalization has yet to be defined. A comprehensive comparison of different normalization methods for Illumina data has been reported previously [22]. In this study, we tested five RNA-seq normalization strategies, KDMM, TMM, RPM, RPKM, and UQS, and compared the results of differential gene expression to microarray data obtained from the same cDNA (Additional file 12). We found normalization had significant effects on the distribution of the read counts (Additional file 12). Expression profiles from the UQS and KDMM normalization schemes were almost indistinguishable and replicates had similar RNA-seq distributions (Additional file 12). The TMM normalization method appeared to introduce greater variation into this RNA-seq dataset compared to the pre-normalized data (Additional file 12). Both RPM and RPKM shifted the distribution of reads markedly, which influenced the final results by dramatically reducing their overall expression values (Figure 1, Additional file 12). The other three strategies had less of an effect in terms of shifting the overall distributions (Figure 1, Additional file 12).

Figure 1
figure 1

Two-way clustering of normalized RNA-seq and microarray log 2 transformed read counts or probe fluorescent intensities for all genes, respectively, for C. thermocellum grown on switchgrass or Populus 12 hours and 37 hours postinoculation. RNA-seq read counts normalized by RPM, RPKM, KDMM, UQS, or TMM in the JMP Genomics 6 software suite were plotted with the microarray probe fluorescent intensities normalized by the LOESS method. Genes were clustered into a default of ten clusters based on similarity of expression patterns across all transcriptomic platforms and normalization techniques. KDMM, kernel density mean of M component; RNA-seq, RNA sequencing; RPKM, reads per kilobase per million; RPM, reads per million; TMM, trimmed mean of M component; UQS, upper quartile scaling.

Normalized intensity values were used to identify highly expressed genes (Additional file 13). A subset of cellulosomal and cellulose utilization-related genes with a range of expression levels from the array and RNA-seq data normalized with the KDMM strategy are given in Table 2. Featured in this list are the glycoside hydrolase Cel48S (Cthe_2089) and the scaffoldin CipA (Cthe_3077) which are known to be abundant proteins in the cellulosome [32]. A gene (Cthe_0271) was highly expressed on both biomasses and is predicted to encode a protein with a putative function as a type 3A cellulose-binding protein. Cthe_0271was identified in a recent study as the most highly expressed gene when C. thermocellum was grown on both cellulose and cellobiose, indicating that the data generated in this study is consistent with published reports of C. thermocellum grown on various substrates [14, 16]. Also highly expressed on both biomass substrates and at both time points was a transport system (Cthe_0391-0393), recently identified as specific for cellotriose transport [33]. A non-cellulosomal highly expressed gene was Cthe_3383, which has a putative AgrD function (Additional file 13). This gene was a new addition to the C. thermocellum ATCC 27405 genome annotation and reflects the necessity of updating genomes as annotation algorithms improve and knowledge expands. We also compared mapped reads to bioinformatic predictions for small RNAs and in several cases found experimental data supported one model over another (Additional file 14). We expect these data will be useful to refine future sRNA models.

Table 2 Subset of relative expression values for cellulosome-related genes

Altered gene regulation and validation of expression differences

A summary of genes that passed the significance threshold of a false discovery rate (FDR) of <0.05 in one of the comparisons is shown in Table 3. A complete list of altered gene regulation in each of the conditions is given in Additional file 15. We found that 2,351 genes were considered significantly different by microarray based on a threshold of a FDR of <0.05 in any one of the four growth or substrate comparisons. A 2-fold filter for differential gene expression narrows the differences between the technologies in terms of the numbers of genes identified as significantly differentially expressed (Table 3, Additional file 15). TMM normalization performed poorly based on statistical testing of the RNA-seq data with only ten genes considered significantly differentially expressed and only five of these overlapping with the array. This is likely due in part to the greater variation seen post-normalization compared to the pre-normalized data (Additional file 12).

Table 3 Summary of genes passing significance and 2-fold differential expression thresholds

RNA-seq data normalized by RPM, UQS, or KDMM identified 117, 104, and 192 significantly differentially expressed genes, respectively. Significant differentially expressed genes from the RPM method had 50 in common with the array; however, genes in the array that had the greatest expression differences were not detected in the RPM normalized data (Figure 2). UQS normalization gave 104 genes that were differentially expressed. Forty-one of these genes were in common with the array. RNA-seq data normalized with the KDMM strategy had the highest number of genes (73) in common with the previously validated array [34] (Table 4). Six genes exhibiting a broad expression range from samples harvested 12 hours postinoculation were selected for confirmation by RT-qPCR. Expression data from the array or KDMM normalized RNA-seq data compared to RT-qPCR data had correlation coefficient values of R2 = 0.92 and 0.95, respectively (Additional file 16), thus confirming that the data from both analytical platforms were of high quality.

Figure 2
figure 2

Venn diagram of genes identified as significantly (FDR <0.05) differentially expressed (± 1, log 2 scale) by microarray and RNA-seq data normalized by KDMM or RPM. Those genes common between the KDMM and microarray strategies but not present in the RPM analytical strategy are outlined in accompanying table. FDR, false discovery rate; KDMM, kernel density mean of M component; RNA-seq, RNA sequencing; RPM, reads per million.

Table 4 Seventy-three genes significantly (FDR <0.05) and differentially expressed (± 1, log 2 scale) that were in common between RNA-seq normalized by the KDMM strategy and microarray data

Growth stage-specific changes in gene expression

C. thermocellum expression profiles can vary based on growth rate [16]. No genes showed consistent patterns of regulation at 12 hours relative to 37 hours postinoculation on both substrates using stringent criteria, which may reflect relative differences in growth (Additional file 8). By 37 hours there were eight genes consistently expressed by 2-fold or greater compared to the earlier sampling time point irrespective of the substrate. These eight genes included those encoding proteins related to spore formation (Cthe_0964 (also lysine biosynthesis), Cthe_1084, and Cthe_1759), a glycosyltransferase (Cthe_1085), and genes involved in nucleotide and amino sugar metabolism (Cthe_2642 and Cthe_2644) (Table 4). Other genes affected in the growth stage comparison include an anti-sigma factor (Cthe_1437) and a putative ABC transporter subunit (Cthe_2573). These genes are potentially contributing to the transition of the cells from log to stationary phase.

Substrate-specific gene expression

Comparison of differentially expressed genes permitted the identification of genes that were only affected on one of the biomass substrates. Six genes were upregulated during growth on Populus relative to switchgrass 12 hours after inoculation with the patterns of expression consistent across the two analytical platforms. These genes met the FDR <0.05 and ≥2-fold difference in gene expression requirements, and included genes encoding glycoside hydrolase and CenC carbohydrate-binding proteins (Cthe_1256 and Cthe_1257) (Table 4). A genomic locus that includes a gene encoding a predicted Radical SAM domain protein and an AgrB protein (Cthe_1309 and Cthe_1310) were upregulated on Populus at 12 hours relative to switchgrass. Interestingly, these two genes are upstream of a new addition to the C. thermocellum genome with predicted AgrD functions (Cthe_3348) suggesting a signaling or bacteriocin-like production specific to the substrate. Gene Cthe_2531 is predicted to be involved in sulfate transport and was upregulated when C. thermocellum was grown on Populus. Three other genes from this cluster were also upregulated but did not pass the significance threshold in the RNA-seq analysis. Conversely on switchgrass, three genes related to phosphate transport (Cthe_1603, Cthe_1604, and Cthe_1605) were upregulated. These genes are part of a putative high affinity phosphate transport system we have identified only in strain ATCC 27405 and this system is distinct from the common Na/Pi symporters found in all C. thermocellum strains examined to date. One Na/Pi symporter (Cthe_0064) in C. thermocellum ATCC 27405 was among the top 5% most highly expressed genes on both biomasses (Additional file 9).

Two genes (Cthe_1480 and Cthe_1481) with hypothetical function annotations were upregulated on switchgrass and met the significance criteria. The expression patterns of these genes were maintained in the comparison at 37 hours postinoculation. They have a general function prediction as members of the RND family of exporters and are well conserved in bacteria. Interestingly none of these genes were identified in a study of C. thermocellum ATCC 27405 grown on pure cellulose or pure cellobiose [16] suggesting the regulation of these genes were specific to the lignocellulosic biomasses used in the current study.

Differential expression of cellulosome genes and central carbon metabolism

Consistent expression patterns for cellulosomal-related genes identified in both the RNA-seq (KDMM) and array included two known cellulosome genes. Cthe_0624 (CelJ) encoding a glycoside hydrolase family 9 enzyme with predicted endoglucanase functions was upregulated in early growth stages on switchgrass relative to the later growth stage, while no differences were identified on Populus. This protein was reported as highly abundant in a proteome study with growth of C. thermocellum when grown on cellobiose, cellulose, and switchgrass [14]. Cthe_1890 encoding a protein with a type 1 dockerin domain had higher expression in the latter growth stage on switchgrass relative to the 12-hour sample. A gene (Cthe_1256), predicted to encode a glycoside hydrolase family 3 enzyme that converts a variety of glucans into β-D glucose, was upregulated on Populus relative to switchgrass at 12 hours postinoculation.

Discussion

An accurate and complete representation of an organism’s genome sequence and its functional annotation is requisite for systems biology studies and genome-scale engineering for synthetic biology [35]. New technologies (for example DNA sequencing [26]), algorithms (for example Prodigal [11]), and biological features (for example sRNA [36]) expand our knowledge of genomes. However, the majority of genome sequences and annotations are rarely updated. Re-annotation has been suggested as an essential component for assaying and understanding systems biology data [37] and wiki-based solutions have been recommended to facilitate genome updates [38]. In this study, we used the gene prediction program Prodigal to update the C. thermocellum ATCC 27405 gene models. The methodology, accuracy, and specificity improvements incorporated into Prodigal have been described [11]. RNA-seq analysis and proteomic analysis performed using two-dimensional liquid chromatography (LC)-tandem mass spectrometry (MS/MS) offers the possibility of searching continuously updated genome databases with previously obtained information. This is an important advantage since it is likely that further improvements will be made to C. thermocellum gene models and annotations in the future.

We were able to develop a protocol to obtain high quality RNA from C. thermocellum grown on biomass for the first time and to enrich mRNA by subtractive hybridization so that greater than 99.6% of the reads did not map to the 5S, 16S, and 23S rRNA gene sequences. This protocol development opens up new possibilities for future RNA-seq studies of industrially-relevant biomass fermentations. In our transition to a transcriptomic analytical platform based on RNA-seq we sought to compare and contrast the relatively new technology of RNA-seq to an established custom designed microarray. The cross-platform comparisons described here are among the best that we are aware of, with Spearman correlation coefficients ranging from 0.83 to 0.88 (Additional file 11).

Normalization strategies remove experimental noise from transcriptomic datasets prior to analyses used to determine biological differences in samples of interest. In microarray analyses, known biases include variation in dye incorporation rates and hybridization of material to the platform [39]. In RNA-seq analyses distinct biases relate to the depth of sequencing, the length and GC content of genes, and mapping approach [3942]. We found that normalization of the RNA-seq data had dramatic effects on the final results of our data (Figure 1, Additional file 12). KDMM and UQS gave similar distribution and clustering profiles. The KDMM normalization method was the preferred regime in this study as it provided more results in common with the array data. The KDMM method uses a scaling factor based on the geometric mean of the mapped reads and the UQS method scales read count distributions so that the 75th percentiles are consistent after normalization [39]. Both TMM and RPM performed poorly with our dataset. TMM gave the fewest genes (10) identified in the analysis of variance (ANOVA) as significantly differentially expressed, which was likely due to greater variation post-normalization (Additional file 12). TMM is a conservative normalization method that performs well where datasets have a consistent number of mapped reads across samples [22]. The number of reads that mapped uniquely for given samples differed as much as approximately 2-fold between the largest and smallest totals (Additional file 7). The C. thermocellum sample that was run with the PhiX sequencing control had the fewest number of reads that mapped to the genome, and inconsistencies in the number of mapped reads is likely to explain why the other methods performed better than TMM in this instance. Although widely used, there are reports that the RPKM method can bias estimates of differential expression [40, 43]. In this study, many genes which were identified as having the largest expression differences in the array and KDMM normalized RNA-seq data, such as phosphate and sulfate transport genes, were not identified in significance testing using data normalized by the RPM (Figure 2) or similar RPKM method (Additional file 15).

A number of studies have investigated RNA-seq, mapping methods, technical variability and reproducibility, normalization, and statistical testing methods. However, the field of RNA-seq is still relatively new and rapidly evolving. Differential expression measurements cannot be estimated with any confidence if a single biological replicate is employed. We employed two biological replicate fermentations on each biomass with samples taken at two time points, 12 hours and 37 hours postinoculation, but we expect that as sequencing costs continue to decrease, more biological replicates will be used to increase statistical power. This will allow for greater confidence in RNA-seq differential expression estimates. We used the NimbleGen call files for the microarray data, which uses outlier detection and then summarizes unique probe intensity values into one value for three technical array replicates for each biological replicate. We also employed the Kenward-Roger method to estimate the degrees of freedom in the mixed model analyses of the array data. The array analysis had considerably more statistical power (six expression estimates per gene per condition) compared to the RNA-seq dataset (two expression estimates per gene per condition). Our array data and RNA-seq data generally agreed, although different genes were categorized as significant or did not meet criteria for certain comparisons (Table 3, Additional file 15). We have made the datasets available so that others may compare and contrast different methods and analyses.

The yields of the major fermentation products were approximately 1.4-fold higher after 37 hours on Populus compared to switchgrass with normalization to the original biomass loading. The results of this study suggest more favorable growth of C. thermocellum when pretreated Populus was the substrate. Hemicelluloses present in these two lignocellulosic substrates differ, with glucuronoxylan in hardwoods such as Populus while grasses have predominantly arabinoxylans [44, 45]. The dilute acid pretreatment of each of the biomass substrates should solubilize the majority of hemicelluloses from the biomass, which are then removed by numerous wash steps. It is likely, however, that residual material is left, as well as remaining quantities of inhibiting compounds derived from the pretreatment and breakdown of the hemicelluloses. Examples of inhibitor byproducts from pretreatment include vanillin, hydroxymethylfurfural (HMF), furfural, and syringic acid [46]. Lignin remains after pretreatment and can influence the accessibility of C. thermocellum to cellulose in the biomass substrate. The degree of cellulose polymerization after pretreatment may be another factor that differs between the two biomasses that could influence the fermentation performance [47, 48]. ICP-ES analysis also revealed differences in calcium removal efficiency (Table 3), with the majority of calcium removed during pretreatment of Populus while two-thirds remained after pretreatment of switchgrass. The data suggests that under the pretreatment and process conditions used in this study the dilute acid pretreated Populus was a more accessible substrate for C. thermocellum fermentation compared to the pretreated switchgrass. Alternatively, the species biomass may have differed in the proportion of bound versus free calcium. Nonetheless, different pretreatment strategies and process conditions will be required for optimal conversion of different biomass feedstocks into different biofuels [49].

From both the microarray and the RNA-seq data we could identify C. thermocellum genes that were highly expressed when grown on these two complex biomass substrates. The cellotriose transport system (Cthe_0391-0393) was among genes that were highly expressed on both substrates. Dextrins of length 3 to 5 are the preferred substrate of C. thermocellum[50], and this particular transporter is one of five involved in carbohydrate transport and the only one with a specificity for cellotriose [33]. Three other systems transport glucans ranging from one to five glucose subunits with variable substrate affinities and the last is specific for laminaribiose [33]. High-level expression of the cellotriose transport system on Populus and switchgrass suggests the majority of the cellulose in these biomasses is processed by the C. thermocellum cellulosome into cellotriose. Other highly expressed genes included cellulosomal genes such as CipA (primary non-catalytic scaffoldin unit) and CelS (exoglucanase) (Table 2), which is in agreement with earlier data [14]. Identifying highly expressed genes on various substrates is useful for strain engineering as it can expand the repertoire of available promoter sequences to facilitate enhanced cellulosic conversion.

More than 70 dockerin-containing proteins and potential cellulosome-related subunits have been identified in the C. thermocellum ATCC 27405 genome [14]. Of interest in the current study were those genes encoding enzymes or proteins with functions related to cellulosome degradation of biomass and had differential regulation when C. thermocellum was grown on switchgrass compared to Populus (Additional file 15). For example, the genomic locus Cthe_1256-1257 that encodes a glycoside hydrolase and a carbohydrate-binding protein exhibited higher expression on Populus at 12 hours compared to switchgrass (Table 4). Cthe_1257 may encode a protein with potential for cellulose binding, while Cthe_1256 lacks a signal peptide and is predicted to function as a β-glucosidase cleaving imported dextrins to yield β-D glucose. These gene expression differences indicate a degree of specificity of the C. thermocellum response to different substrate availability while growing on the two biomasses. A glycoside hydrolase (Cthe_0624) was upregulated at 12 hours on switchgrass compared to 37 hours on switchgrass with no differences identified on Populus. The glycoside hydrolase (Cthe_0624) amino acid sequence includes a signal peptide and has xylan and lichenan hydrolase activities as well as activity against crystalline cellulose [51].

Cellulosomes are naturally shed at the end of C. thermocellum growth, which was exploited by an affinity purification method and proteomics approach to show C. thermocellum cellulosomal compositional changes occurred in response to different carbon sources [14]. One surprising aspect of the current study was that larger differences in cellulosomal genes were not observed at the level of transcription for the two biomasses, which may be a reflection of the pretreatment procedure efficiently homogenizing the carbohydrate components of the two biomasses. Although C. thermocellum cannot use xylose, we observed cellulosomal xylanases (Cthe_1398, Cthe_1838, Cthe_1963, Cthe_2590, and Cthe_2972) were among the most highly expressed genes (top 10%) suggesting this activity is important to access its preferred substrates. Xylanases showed little to no differential expression under the conditions assayed in this study despite bulk differences in xylose content of the two biomass substrates. An earlier study also reported highly expressed xylanase proteins on switchgrass [14] but high-level expression was not found for chemostat growth on purified cellulose [16], which shows the value in exploring a range of substrates and including those of industrial relevance. It is worth noting that the growth conditions, ‘omic’ level, and detection technologies were quite different between the current transcriptomic and earlier proteomic studies. Further systematic, integrated omic studies will be required to reveal more of this organism’s complex regulatory control mechanisms.

A putative Pst high-affinity phosphate transport system was expressed to a greater amount on switchgrass compared to Populus 12 hours postinoculation while one member of a sulfate transport system was upregulated on Populus. Other members of the sulfate transport system were highly differentially expressed in both the RNA-seq and array; however, they did not pass the significance threshold for the RNA-seq. Differences in phosphorus and sulfur contents for pretreated biomasses were observed (Additional file 7); however, the defined medium (MTC) used to suspend each biomass substrate was identical and replete for phosphate and sulfate for pure cellulose fermentations. Phosphate and sulfate uptake genes were not upregulated during growth on pure cellulose or cellobiose [16]. The corresponding binding proteins for ABC transporters often have high degrees of specificity that can distinguish the phosphate and sulfate oxyanions despite their similarities [52], although there is little data on these systems for C. thermocellum. Phosphate is required for C. thermocellum carbohydrate breakdown as the bacteria favor transport of cellodextrins over monomeric sugars. Cellodextrins enter C. thermocellum cells via ATP-dependent ABC transport systems and once inside a phosphate anion act as a nucleophile for phosphorolytic cleavage [53, 54]. Multiple uncharacterized phosphate transport systems exist in the ATCC 27405 genome including two putative Na+/Pi co-transporters (Cthe_0064 and Cthe_2810), a putative Pit transporter (Cthe_3000), as well as the Pst system differentially expressed between the two biomass substrates. The Pst transporter is typically only induced under conditions of phosphate starvation [5558], which would indicate that cells in the switchgrass fermentations were limited in phosphate despite sufficient phosphate being provided in the MTC medium for growth of this organism on pure cellulose or cellobiose. We observed a greater amount of divalent cations in the switchgrass compared to Populus, but at levels relatively insignificant compared to those provided in the MTC medium. Differences in medium ion composition may have influenced chemical speciation through formation of compounds such as insoluble metallophosphates, or disruption of ion exchange. Alternatively, one or more compounds generated during the switchgrass fermentation may have interfered with the C. thermocellum Na/Pi symporter leading to upregulation of the energetically more expensive high-affinity phosphate transport system. We observed approximately twice as much molybdenum in pretreated Populus verses switchgrass (Additional file 7) and factors such as this may have interfered with sulfate uptake and/or iron-sulfur proteins involved in metabolism. Differences in the expression of C. thermocellum anion transporters (phosphate and sulfate) may indicate part of a coordinated system for osmoadaptation and/or pH stasis with variation in the ash composition of the two biomasses influencing the osmotic balance of the cell [59, 60]. Further studies are required to investigate the physiological status of C. thermocellum during industrially-relevant fermentations.

Much higher expression from gene locus Cthe_1479-1481 occurred on switchgrass relative to Populus at both sampling time points. These genes are well conserved in bacteria and are currently annotated as a member of the RND exporter family. This type of transport system is typically associated with Gram-negative bacteria where they act to remove toxic compounds from the cell [61]. Inhibitory compounds are generated from the pretreatment processing of biomass substrates [47], and despite extensive washing of the pretreated biomass, residual compounds are likely to remain in low quantities. Thus it is conceivable that a toxic compound liberated solely from switchgrass is removed from the cell via this efflux system and this could be a possible target for strain development. A recent study identified arabitol, a putative fermentation inhibitor, as liberated during C. thermocellum fermentation on switchgrass [47]. We also observed greater expression in genes related to urea uptake and metabolism at 37 hours compared to 12 hours on Populus (switchgrass failed to meet one or both of the threshold criteria), which coincided with increases in ethanol concentrations. A previous study showed that the largest response of C. thermocellum to ethanol shock treatment was in genes and proteins related to nitrogen uptake and metabolism [34].

Three spore-related genes upregulated at 37 hours compared to 12 hours on both biomasses indicated that cells were priming for transition to stationary phase. C. thermocellum ATCC 27405 is inefficient at sporulation, converting between 0 to 7% of resting cells into spores after stressor application [62]. An agr-dependent quorum sensing mechanism for Clostridium acetobutylicum sporulation regulation and granulose formation has been recently described [63]. However, early signal sensing and transduction mechanisms for sporulation in Clostridia are not as well defined as for Bacillus subtilis[64]. Cthe_3383 among the most highly expressed of C. thermocellum genes during growth on biomass substrates (Additional files 14 and 15), is a newly predicted gene that encodes a small (40 aa) putative hypothetical protein (putative autoinducer prepeptide), and is adjacent to genes annotated as having roles in sporulation. At a separate genomic locus we observed differential gene expression for two genes on the different biomass substrates (Cthe_1309 and Cthe_1310) (Additional file 15), with higher expression occurring during fermentation on Populus at 12 hours postinoculation. The latter gene is predicted to encode an accessory gene regulator B. Interestingly, a new addition to the genome, Cthe_3348, is directly downstream of Cthe_1310 and is predicted to encode a 54 amino acid AgrD-like peptide. The agrD gene was highly expressed but was not considered differentially expressed like the two upstream genes. The role, if any, that Cthe_3383 and Cthe_3348 play in signaling and the C. thermocellum sporulation regulatory cascade remains to be elucidated (for alignment see Additional file 14).

Conclusions

The results suggest a high degree of concordance in differential gene expression measurements between the three transcriptomic platforms. We observed few transcriptomic differences for C. thermocellum cellulosome-related genes for cells fermenting either dilute acid pretreated Populus or switchgrass, which may indicate that under this pretreatment regime they sense and respond to similar carbohydrate profiles during active growth. We observed differential expression sulfate- and phosphate-related genes, which may point to aspects of metabolism for more consideration during industrial-relevant fermentations. We have identified new and highly expressed genes and our update to the ATCC 27405 genome will be useful for follow-on studies.

Microarrays and RNA-seq each have respective biases that can interfere with differential expression determinations and in this study RNA-seq normalization methods dramatically affected downstream analyses. RNA-seq offers important advantages for transcriptomic profiling and it will invariably substitute microarrays as a preferred method. However, DNA microarray testing and analysis has evolved over many years through studies such as the MicroArray Quality Control (MAQC) project [65, 66] and further studies and cost reductions in sequencing are similarly required to develop RNA-seq analyses.

Methods

Genome reannotation

A gene modeling program termed Prodigal [11] was applied to the C. thermocellum ATCC 27405 genome sequence, followed by a round of manual curation in combination with proteomics data analysis [30] to ensure no peptide evidence existed for any deleted genes (data derived from Yang et al. [30] and reported in Additional files 1,2,3). A six-frame translation generated predicted ORFs and a search of available peptide data against these ORFs resulted in three groups: 1) peptides that fall under existing gene call; 2) those that have one end within an existing gene call and the other outside, which were used to correct the start and end coordinates for a gene; and 3) those that were not within an existing gene and were used to add a new gene. In addition, the following criteria were assessed: whether peptide hit is unique or matches several places in the genome, number of times peptide was detected, peptide BLAST percent identity and length of match, transcription level via RNA-seq data from this study at the start of a gene/ORF, 100 bp upstream and average coverage, Prodigal score for coding potential, start codon used, Prodigal score for ribosome binding site (RBS), manually checked RBS, similar sequences, and their start sites by blasting ORF against the National Center for Biotechnology Information (NCBI) non-redundant database. Predicted genes were annotated using an automated annotation pipeline, as described previously [13]. The current annotation and a comparison to the earlier versions can be found at http://genome.ornl.gov/microbial/cthe/.

Pretreatment

The biomass substrates used in the fermentations were dilute acid pretreated switchgrass (Panicum virgatum cultivar Alamo; SWG) and dilute acid pretreated Populus (Populus trichocarpa x Populus deltoides F1 hybrid; POP). The biomasses were milled to -20/+80 mesh size and pretreated with dilute sulfuric acid at 0.050 g/g of dry biomass at 190°C for 1 minute residence time (flow-through mode) and 25% (w/w) total solids using a Sunds reactor at the NREL [14, 67]. The pretreated biomasses were washed with Milli-Q H2O (Millipore, Billerica, MA, USA) until less than 0.1 g/L glucose could be detected in the wash eluent, and dried prior to fermentations [47].

Compositional analysis of biomass

Trace elements were determined by ICP-ES. The samples for ICP-ES were prepared using a method based on the United States Environmental Protection Agency (USEPA) SW-846 Method 3050A. Pretreated and unpretreated biomass samples were oven dried and a 2 g sample digested by sequentially heating in nitric acid, hydrogen peroxide, and hydrochloric acid. The samples were filtered through Whatman 41 filter paper (Whatman, Maidstone, UK) and the volume made up to 50 mL with deionized (DI) water. Aliquots (5 mL) were subjected to ICP-ES analysis in an Optima 3000 DV ICP Emission Spectrometer (PerkinElmer, Waltham, MA, USA) with yttrium used as an internal standard [68].

Fermentations

Overnight inoculum cultures of C. thermocellum 27405 were grown anaerobically in 50 mL bottles. Five 40 mL aliquots from 5 g/L Avicel in MTC [69] 50 mL serum bottles were used to inoculate the 5-L Twin BIOSTAT B plus fermenters (Sartorius Stedim Biotech, Göttingen, Germany) (total volume 2 L) for a final inoculum of 10%. Two replicate fermentations were performed for each biomass. The dry weight basis of the loading of the biomass in each fermenter was 5 g/L in MTC medium. The fermenters were run at 58°C, 250 rpm, and pH-controlled at 7.0 with 3 N NaOH. Time = 0 samples were taken immediately postinoculation of the fermenter vessels. At 12 hours and 37 hours post-inoculation, 50 mL samples were removed for transcriptomic analyses.

Samples were removed periodically from the fermenter vessel to determine cell counts and monitor fermentation product formation and residual carbohydrates (Additional file 8). Samples for cell counts were diluted with Milli-Q H2O when necessary and a 10 μL aliquot was loaded onto a hemocytometer counting chamber for counting. Cell counts were performed in triplicate for each fermenter at a given time point.

Fermentation residues were analyzed for carbohydrate composition using quantitative saccharification assay ASTM E 1758–01 (ASTM 2003), NREL/TP 510–42618, and HPLC method NREL/TP 51–42623. Cell-free samples from the fermenters were analyzed for metabolites (acetic acid, lactic acid, and ethanol) and residual carbohydrates (cellobiose, glucose, xylose, and arabinose) using a LaChrom Elite HPLC System (Hitachi High Technologies America, Pleasanton, CA, USA) equipped with a refractive index detector (model L-2490), as previously described [47].

RNA isolation

Cells pelleted from an 8 mL sample drawn from each fermenter were resuspended in 1.5 mL of TRIzol (Invitrogen, Carlsbad, CA, USA) and used for cell lysis by bead beating with 0.8 g of 0.1 mm glass beads (BioSpec Products, Bartlesville, OK, USA) with 3 × 20 seconds bead beating treatments at 6,500 rpm in a Precellys 24 high-throughput tissue homogenizer (Bertin Technologies, Montigny-le-Bretonneux, France). The RNA from each cell lysate was purified, DNaseI-treated, and quantity and quality assessed, as previously described [34]. Purified RNA of high quality (RIN >8) was pooled from the same fermentation samples and depleted of rRNA using Ribo-Zero rRNA Removal Kit for Gram-positive bacteria (Epicentre, Madison, WI, USA). The sample was then concentrated with RNA Clean & Concentrate-5 (Zymo Research, Irvine, CA, USA) following the manufacturer’s protocol.

Library preparation

Depleted RNA was used as the starting material for the Epicentre ScriptSeq mRNA-Seq Library Preparation Kit (Illumina-compatible) utilizing the FailSafe PCR Enzyme Mix (Epicentre) and following the manufacturer’s protocol. cDNA tagged with standard adaptors was eluted with 20 μL of Buffer EB provided in the MinElute PCR Purification Kit (Qiagen, Venlo, Netherlands) according to the ScriptSeq protocol. Cycles were increased to 14 during amplification and samples were purified using the MinElute PCR Purification Kit and eluted with 20 μL of Buffer EB. The final mRNA-seq library was quantified with a Qubit fluorometer (Invitrogen) and library quality was assessed with Bioanalyzer High Sensitivity DNA Chip (Agilent, Santa Clara, CA, USA).

Samples were diluted to 2 nM, denatured, and further diluted to 6 pM. These were run on cBot (Illumina, San Diego, CA, USA) (SR_Amp_Lin_Block_Hyb_V7) overnight to cluster on version 1.5 Flow Cell. The mRNA-seq libraries were analyzed on a HiSeq 2000 (Illumina) platform with a SR50 sequencing kit for a single read of 51 cycles. The lane containing the F188 12-hour Populus sample included the control of phiX DNA.

RNA-seq analysis

Raw reads were mapped to genome [GenBank:CP000568.1] using CLC Genomics Workbench version 5.5.1 (CLC bio, Aarhus, Denmark) using the default settings for prokaryote genomes. Uniquely mapped reads were log2 transformed on importation into JMP Genomics version 6 (SAS Institute, Cary, NC, USA). Data were normalized using default settings for each of the four normalization strategies (see Additional file 12 for pre- and post-normalization distribution curves) and any genes with no read counts were removed prior to ANOVA analysis. Filtering was applied to identify those genes with an FDR <0.05 and a greater than a log2 of ± 1 for differential gene expression. Raw RNA-seq data have been deposited in the NCBI Sequence Read Archive (SRA) [SRA:060947] and we have made mapped reads and data available through the BioEnergy Science Center (BESC) KnowledgeBase http://bobcat.ornl.gov/besc/index.jsp[70]. Samples in the SRA series [SRA:060947] are labeled accordingly with the accession number given in square brackets. C. thermocellum harvested after growth on Populus for 12 hours: F185_Ctherm_Pop_12 hr [SRR:620218] and F188_Ctherm_Pop_12 hr [SRR:620325]. C. thermocellum harvested after growth on Populus for 37 hours: F185_Ctherm_Pop_37 hr [SRR:620219] and F188_Ctherm_Pop_37 hr [SRR:620327]. C. thermocellum harvested after growth on switchgrass for 12 hours: F186_Ctherm_Swg_12 hr [SRR:620229] and F187_Ctherm_Swg_12 hr [SRR:620532]. C. thermocellum harvested after growth on switchgrass for 37 hours: F186_Ctherm_Swg_37 hr [SRR:620238] and F187_Ctherm_Swg_37 hr [SRR:620324]. Note that the same nomenclature of fermenter number (F185, F186, F187, and F188), biomass substrate (Pop and Swg), and time point of sampling (12 hours and 37 hours) is used for naming the samples in the microarray Gene Expression Omnibus (GEO) submission, see details below.

Microarray sample labeling, hybridization, scan, and statistical analysis of array data

RNA-seq libraries were also used for hybridization to the microarray. Beginning with 100 ng of cDNA, half volume Cy3 labeling reactions were undertaken for all eight samples according to the manufacturer’s protocols. Cy3 labeling efficiency was assessed by NanoDrop ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and determined to fall within the range of 20 to 24 pmol/μg. Hybridizations were conducted using a 12-bay hybridization station (BioMicro Systems, Salt Lake City, UT, USA) and the arrays dried using a MAUI Wash System (BioMicro Systems). Microarrays were scanned with a SureScan High-Resolution DNA Microarray Scanner (5 μm) (Agilent), and the images were quantified using NimbleScan software (Roche NimbleGen, Madison, WI, USA).

Raw data was log2 transformed and imported into the statistical analysis software JMP Genomics 6.0 software (SAS Institute). The data were normalized together using a single round of the LOESS normalization algorithm within JMP Genomics, and distribution analyses conducted before and after normalization were used as a quality control step. An ANOVA was performed in JMP Genomics to determine differential gene expression levels via a direct comparison of the two biomasses and time points using the FDR testing method (P <0.05) and Kenward-Roger degrees of freedom method. Microarray data have been deposited in the NCBI GEO database [GSE:47010]. Samples in the GEO series [GSE:47010] are labeled accordingly with the specific GEO sample accession number given in square brackets. C. thermocellum harvested after growth on Populus for 12 hours: F185_Pop_12 hr_rep1 [GSM:1142896] and F188_Pop_12 hr_rep1 [GSM:1142902]. C. thermocellum harvested after growth on Populus for 37 hours: F185_Pop_37 hr_rep1 [GSM:1142897] and F188_Pop_37 hr_rep1 [GSM:1142903]. C. thermocellum harvested after growth on switchgrass for 12 hours: F186_Swg_12 hr_rep1 [GSM:1142898] and F187_Swg_12 hr_rep1 [GSM:1142900]. C. thermocellum harvested after growth on switchgrass for 37 hours: F186_Swg_37 hr_rep1 [GSM:1142899] and F187_Swg_37 hr_rep1 [GSM:1142901].

RT-qPCR analysis

Microarray data were validated using RT-qPCR, as described previously [34]. Six genes representing a range of gene expression values based on microarray hybridizations were analyzed using qPCR from cDNA derived from different time point samples. Oligonucleotide sequences of the primers targeting the six genes selected for qPCR analysis were: Cthe_0344_F CGACTTCCCGAACCAGATAA, Cthe_0344_R GCAGCGGCTATCTTCATTTC; Cthe_0482_F GAGCAGGGATTGGTAATGGA, Cthe_0482_R TACCGCAAGACCTACAAGCA; Cthe_1481_F AGTCATATCCGAAAACATGG, Cthe_1481_R TTGTAGTCGTCAAGGGAAGT; Cthe_1604_F GTGTCCCCGCTATTGCTAAA, Cthe_1604_R ATGGGTAAAATGCCGAATGA; Cthe_1951_F AAAATAAAAGCCCAGGATTC, Cthe_1951_R GCATTATCCTGAAGTTCGTC; and Cthe_2531_F CGGAAAGGACATTGTCATCC, Cthe_2531_R CAAAGCCAGGGTTACGACAT.