Background

The development of powerful new DNA sequencing technologies has yielded new tools with the potential for dramatically revolutionizing scientific approaches to biological questions [1]. These new technologies can be used for a variety of applications, including genome sequencing, identification of DNA-methylation sites, population studies, chromatin precipitation (CHIP-Seq), and transcriptome studies (RNA-Seq). For RNA-Seq, cDNA is generated from an mRNA-enriched total RNA preparation and sequenced using high-throughput technology. Here, we used the Illumina Genome Analyzer to characterize the transcriptome of stationary phase Listeria monocytogenes 10403S and its isogenic ΔsigB mutant, which lacks the general stress response sigma factor, σB.

L. monocytogenes, a Gram-positive foodborne pathogen of the Firmicutes family, is the etiological agent of the disease known as listeriosis. As 20% of listeriosis cases result in death in humans, with an estimated annual human death toll of ~ 500 in the US alone [2], this disease is a considerable public health concern. As a foodborne pathogen (with 99% of human illnesses caused by a foodborne route of infection [2]), this bacterium also presents challenging food safety concerns due to its ability to survive and grow under many conditions that are typically applied to control bacterial populations in foods, such as low pH, low temperature and high salt conditions [35]. The alternative general stress response sigma factor, σB, is an essential component of a regulatory mechanism that contributes to the ability of L. monocytogenes to respond to and survive exposure to harsh environmental conditions [6].

Sigma factors are dissociable subunits of prokaryotic RNA polymerase responsible for enzyme recognition of a conserved DNA sequence encoding a transcriptional promoter site. Promoter recognition specificities of bacterial RNA polymerase are determined by the transient association of an appropriate sigma factor with core polymerase in response to conditions affecting the cell [7]. The regulon of a single alternative sigma factor can include hundreds of transcriptional units, thus sigma factors provide an effective mechanism for simultaneously regulating large numbers of genes under appropriate conditions [7]. Critical phenotypic functions regulated by alternative sigma factors range from bacterial sporulation [8] to stress response systems [6, 9].

Through microarray analyses, the σB regulon in L. monocytogenes has been reported to encompass more than 200 genes, including both virulence and stress response genes, many of them up-regulated upon entry into stationary phase [1012]. However, interpretation of microarray analyses is dependent on the quality of existing genome annotations, which are rarely experimentally verified. Further, transcripts that do not correspond to annotated features (e.g., noncoding RNA transcripts) cannot be identified. In addition, the utility of microarrays is limited by the genomic variation that exists among bacterial strains (i.e., ideally, a unique microarray should be constructed for each strain to be analyzed) and by technical biases such as cross-hybridization. Hence, microarray data can be difficult to analyze and occasionally, misleading [13, 14]. Although interpretation of RNA-Seq data also relies on the availability of a genome sequence, it is probe- and annotation-independent and therefore, is free of cross-hybridization and low-hybridization biases, hence enabling genome-wide identification of all transcripts, including small noncoding RNAs (ncRNAs). Moreover, because RNA-Seq technology can generate multiple reads corresponding to each transcribed nucleotide on the genome, it is usually possible to identify 5' and 3' transcript ends with high resolution [15]. Therefore, in combination with bioinformatics tools, RNA-Seq data can be used to identify transcriptional promoters and terminators. We used L. monocytogenes as a model system to explore application of RNA-Seq for the dual purposes of genome-wide transcriptome characterization in a bacterial pathogen and comprehensive quantification of target gene expression for the alternative sigma factor, σB.

Results

RNA-Seq provided comprehensive coverage of the L. monocytogenes transcriptome

RNA-Seq analyses were performed on two independent replicate RNA samples collected from both the L. monocytogenes strain 10403S and an otherwise isogenic ΔsigB mutant (FSL A1-254) that had been grown to stationary phase. cDNA was generated from mRNA-enriched total RNA preparations from each strain and sequenced using the Illumina Genome Analyzer to yield a total number of reads for each sample ranging from 3,300,716 to 5,236,748 (Table 1). As the 10403S genome has not been completely closed, the sequence reads were aligned to a 10403S pseudochromosome that was created for this study using the completely closed genome of the L. monocytogenes strain EGD-e (accession no. AL591824) as a reference (see Material and Methods for details). The total number of reads matching regions other than rRNA and tRNA ranged from 451,548 to 683,746, yielding between 5 × and 7.6 × coverage of the pseudogenome. Between 87.3% and 92.1% of the reads in a given RNA-Seq run matched uniquely to the 10403S pseudochromosome and thus were used in subsequent analyses. Reads that did not match the 10403S pseudochromosome (i.e., reads that showed > 2 mismatches to the pseudochromosome) represented between 6.7% and 12.6% of the reads sequenced; another 0.1% to 0.7% of the reads matched to at least two different locations on the pseudochromosome and, therefore, were removed before further analyses. Reads identified as "matching two locations" did not include those matching rRNA genes as the 10403S pseudochromosome created for this study was designed with only one unique rRNA gene sequence.

Table 1 Summary of RNA-Seq coverage data

To allow for quantitative comparisons among genes and runs, the coverage for each run was normalized for the total number of reads in each run and for gene size. The normalized data are presented as the Gene Expression Index (GEI), which is expressed as the number of reads per 100 bases [16]. Although in silico analyses suggested that the sequencibility (i.e., the portion of the pseudochromosome that could yield unique 32 nt reads) of the 10403S pseudochromosome was 99.6% (Additional file 1: Sequencibility text file), approximately 77.5% of the genome was covered by reads from at least one of the four runs, suggesting that more than 20% of the genome is not transcribed or is transcribed at low levels.

RNA-Seq coverage correlated with qRT-PCR transcript levels indicating that RNA-Seq data are quantitative

We evaluated whether average GEI for specific genes correlated with transcript levels that had been measured using TaqMan qRT-PCR, the current gold standard for quantification of mRNA [17]. Based on transcript levels for 9 and 5 genes in 10403S and ΔsigB, respectively, log transformed average GEI and log transformed TaqMan qRT-PCR absolute copy numbers were correlated (p-value < 0.001; adj. R2 = 0.83; Figure 1; Additional file 2: RNA-Seq average GEI and TaqMan qRT-PCR absolute copy number of select genes), supporting that RNA-Seq provides reliable quantitative estimates of transcript levels in L. monocytogenes. RNA-Seq was previously reported to provide quantitative data on transcript levels in yeast [15], and more recently, in Burkholderia cenocepacia [16], thus, our findings extend this important correlation to a new prokaryotic system.

Figure 1
figure 1

Correlation between qRT-PCR and RNA-Seq. Correlation between qRT-PCR and RNA-Seq data for selected genes in L. monocytogenes 10403S (red) and the ΔsigB strain (blue). The selected genes are: ctc, gadA, gap, opuCA, rpoB (qRT-PCR data from both strains were available for these 5 genes), flaA, inlA, plcA and sigB (only qRT-PCR data from 10403S were available for these 4 genes).

Stationary phase L. monocytogenes transcribed at least 83% of annotated genes

Among the 2888 annotated coding sequences (CDS) in the 10403S pseudochromosome, 2417 (83.7%) showed an average GEI ≥ 0.7 in 10403S (average of two biological replicates) suggesting that at least 83% of the annotated L. monocytogenes genes are transcribed in stationary phase (Additional file 3: Cumulative frequency of average GEI in L. monocytogenes 10403S; see Materials and Methods for calculation of coverage, rational for defining transcribed genes, and criteria for classifying transcript levels as low, medium or high). Of these 2417 genes, 654 (22%) had high transcript levels, 586 (20.0%) had medium transcript levels, and 1177 (41.0%) had low transcript levels. A total of 471 genes (17%) had GEI < 0.7 and were considered "not transcribed". RNA-Seq data allowed visual examination of transcript units, aiding in identification of genes that are transcribed monocistronically or as part of an operon (Figure 2). A total of 355 transcription units appeared to represent operons; these units were identified and annotated (Additional file 4: Access database). A total of 1107 (38.3%) of the annotated 10403S CDS were located in these putative operons. Further experimental data are necessary to validate our predictions of transcription unit structure as some genes may have rho-dependent terminators that were not identified in this study and, therefore, they may be transcribed monocistronically despite the observation of GEI similar to those of their neighboring genes.

Figure 2
figure 2

View of RNA-Seq data using the Artemis genome browser. This region of the 10403S chromosome includes six coding genes, i.e. LMRG_02429 to LMRG_02435, and the 5' end of LMRG_02436; genes are represented as blue arrows. The top part of the figure shows normalized RNA-Seq coverage (i.e. the number of reads that match an annotated gene after normalization across runs) with red and blue lines representing the two 10403S replicates and the green and black lines representing the ΔsigB strain. The horizontal line indicates a normalized RNA-Seq coverage of 49.16 reads. The middle part of the figure shows the three positive frames of translation with the coding regions and vertical black bars representing stop codons. The last line shows putative operons (white bars), a terminator (purple bar) downstream of LMRG_02430 and the chromosome coordinates. Notice the difference in coverage between LMRG_02431 (downstream of the terminator) and the other genes. All genes in the figure have sequencibility of 100% (See Additional file 1: Sequencibility text file for a complete sequencibility plot).

The three genes with the highest average GEI in 10403S all encoded predicted ncRNAs, including tmRNA, 6S and LhrA (Table 2). The annotated CDS (as annotated in EGD-e [18]) with the highest average GEI were lmo2257, fri, and lmo1847, which encode a hypothetical CDS, iron-binding ferritin, and an ABC transporter, respectively. Other genes with well defined functions and high average GEI include flaA, which encodes a flagellin protein, sod, which encodes a superoxide dismutase involved in detoxification, and cspB and cspL, which encode cold-shock proteins involved in adaptation to atypical conditions (Table 2).

Table 2 Genes with highest GEI

Both positive and negative associations were observed between GEI and the TIGR classification of sets of genes to physiological role categories http://cmr.jcvi.org/cgi-bin/CMR/RoleIds.cgi (Table 3). For example, genes involved in protein synthesis and protein fate showed higher average GEI in stationary phase 10403S as compared to genes involved in other functions, while genes involved in viral functions and amino acid biosynthesis were significantly associated with low average GEI in 10403S. Moreover, a positive significant association was observed between codon bias and the average GEI in 10403S (p-value < 0.001; linear regression analysis).

Table 3 Associations between GEI and role categories

Identification and annotation of noncoding RNAs (ncRNAs)

Overall, we identified 67 ncRNAs (Additional file 5: ncRNAs identified by RNA-Seq) that showed average GEI ≥ 0.7 in 10403S, indicating that these ncRNAs are transcribed in stationary phase L. monocytogenes (see Materials and Methods for more details on ncRNA annotation). Among the 67 ncRNAs identified as transcribed in the present study, 60 matched ncRNAs previously described in L. monocytogenes (Additional file 5: ncRNAs identified by RNA-Seq) [1922]. These 60 ncRNAs included 6S RNA, tmRNA, several S-box RNA and T-box leader RNA molecules. A total of 7 putative ncRNAs identified here were not previously identified in L. monocytogenes and did not match ncRNA entries in Rfam (Table 4). The regions representing these putative ncRNAs showed contiguous coverage by RNA-Seq reads (i.e., at least 100 bp completely covered by RNA-Seq reads), but did not fully match annotated genes. Overall, 36 of the ncRNAs recently identified by tiling microarray analyses in L. monocytogenes strain EGD-e [20] were not identified in this study (see Additional file 6: ncRNAs previously described in L. monocytogenes strain EGD-e but not identified in this study for a list of these EGD-e ncRNAs). The most likely explanations for the absence of these EGD-e ncRNAs in 10403S are one or more of the following: (i) low (<0.7 GEI) or no RNA-Seq coverage in 10403S (indicating no transcription in stationary phase 10403S or loss of small RNAs during RNA isolation); (ii) the homolog may be absent in the L. monocytogenes 10403S genome (e.g., for EGD-e RliC; Table S3); (iii) ncRNAs determined to be antisense RNA in EGD-e [20] were not identified in 10403S, as the RNA-Seq protocol did not provide for directional reads; (iv) the corresponding 10403S genome region has not been completely sequenced and closed (e.g., for EGD-e LhrC, which falls in a repetitive region in the EGD-e chromosome [19]), and (v) the EGD-e ncRNA did not meet our criterion of 100 bases of contiguous coverage.

Table 4 New L. monocytogenes ncRNAsa identified in this study

Three putative ncRNAs with high GEI covered either part or all of each of three annotated CDS, suggesting that ncRNAs overlap with these CDS or that some putative CDS actually encode ncRNAs rather than proteins. Specifically, LMRG_01574 (lmo2257), LMRG_02926 (no homolog in EGD-e), and LMRG_1986 (lmo2711) overlapped with lhrA (partial overlap), with the bacterial RNAse P class B ncRNA (full overlap), and with the bacterial signal recognition particle RNA (partial overlap), respectively. In concert with our findings, lmo2257 was previously hypothesized not to be a CDS [19, 21].

RNA-Seq identified 96 annotated CDS and one ncRNA as σB-dependent and provided comprehensive data on transcript levels for genes in the σB regulon

Our RNA-Seq data analyses identified a total of 96 genes as up-regulated by σB (Additional file 7: Genes up-regulated by σB). No annotated genes were identified as significantly down-regulated by σB in this study. Although various genes have been identified previously as down-regulated by σB [10, 12, 20], we have observed that genes with significantly higher transcript levels in the ΔsigB strain (i.e., genes identified as down-regulated by σB): (i) are likely to be indirectly regulated by σB, as σB is a transcriptional activator, (ii) generally show a lower fold-difference in transcript levels between the parent strain and the ΔsigB strain as compared to genes identified as up-regulated by σB [10], and (iii) have not been consistently identified as down-regulated by σB between different studies, even in microarray studies using the same strain and condition (see Figure 3, which indicates that only 7 genes were identified as down-regulated by σB in both of two separate studies with strain 10403S). Down-regulation of genes by σB thus appears stochastic as compared to up-regulation by σB. Overall, our findings suggest that RNA-Seq combined with stringent criteria for detection of statistically significant differences in transcript levels (i.e., the requirement for statistical significance for all four binomial comparisons) may generate fewer false positives as compared to some microarray-based approaches.

Figure 3
figure 3

σB-dependent genes identified by RNA-Seq and microarray analyses. Venn diagram of σB-dependent genes identified in stationary phase cells in this study and in previous microarray studies of stationary phase L. monocytogenes [10, 12]. Numbers in bold are the number of up-regulated annotated CDS identified as σB-dependent in each study; numbers followed by down arrows are down-regulated σB-dependent genes. No down-regulated σB-dependent genes were identified by RNA-Seq. The 13 genes identified as σB-dependent in stationary phase only by RNA-Seq, but not by previous microarray studies of L. monocytogenes 10403S, include 5 genes that had been found to be σB-dependent, by microarray studies [10] in salt stressed cells (see Table 5). In a number of instances, (e.g. opuCB, rsbX; See Additional file 8: Comparison of genes found to be σB-dependent by microarray analysis and not by RNA-Seq) genes with significantly different transcript levels in both microarrays [10, 12] had significant binomial probabilities (q < 0.05) and a fold change ≥ 2.0 for most of the possible combinations (i.e. 10403S replicate 1 vs ΔsigB replicate 1; 10403S replicate 1 vs ΔsigB replicate 2; 10403S replicate 2 vs ΔsigB replicate 1; 10403S replicate 2 vs ΔsigB replicate 2), but not for all four comparisons and these genes were, therefore, not identified as showing significant differences in normalized RNA-Seq coverage (based on our conservative definition of genes with significant differences in normalized RNA-Seq coverage); see Additional file 8: Comparison of genes found to be σB-dependent by microarray analysis and not by RNA-Seq for detailed RNA-Seq data for genes identified as σB-dependent by microarrays, but not by RNA-Seq.

As illustrated in Figure 4A, RNA-Seq data are useful for predicting multi-gene operons controlled by a given regulator such as σB. Thirty-eight of the 96 genes up-regulated by σB are organized into a total of 20 operons, including (i) opuCABCD, which encodes the subunits of a glycine betaine/carnitine/choline ABC transporter, (ii) lmo0781-lmo0784, which encode the four subunits of a putative mannose-specific phosphotransferase system, (iii) lmo2484-lmo2485, which encode a putative membrane-associated protein and a putative transcriptional regulator similar to PspC, respectively, and (iv) lmo0133 and lmo0134 (Figure 4A), which encode proteins similar to E. coli YjdI and YjdJ, respectively.

Figure 4
figure 4

Examples of σB-dependent transcripts identified by RNA-Seq. In each panel (A, B, and C), red and blue lines representing normalized RNA-Seq coverage (i.e. the number of reads that match an annotated gene after normalization across runs) in the two 10403S replicates and green and black lines represent normalized RNA-Seq coverage in the ΔsigB strain replicates; the numbers at the top right in each panel indicates the normalized RNA-Seq coverage represented by the horizontal line shown. Panel (A) depicts LMRG_02382 and LMRG_02383 (shown as blue bars), which form an operon (indicated by a long white bar) with a defined Rho-independent terminator (purple bar) downstream of LMRG_02383; the three positive frames of translation with the coding regions in blue and stop codons shown as vertical black bars are also shown. A σB-dependent promoter (red bar) was identified upstream of the operon and the RNA-Seq coverage data clearly shows that the transcription of this operon is positively regulated by σB (i.e. almost no coverage was obtained from the ΔsigB strain). Panel (B) depicts SbrE (Rli47), a σB-dependent noncoding RNA (ncRNA) with Rho-independent terminator and a σB-dependent promoter identified; annotated features as well as positive and negative frames of translation are shown at the bottom with stop codons shown as vertical black bars. Panel (C) shows the 5' end of LMRG_01602 illustrating the position of a σB-dependent promoter in relation to the start codon of the gene and the transcriptional start site determined by RNA-Seq. The black triangle indicates the transcriptional start site determined by RACE-PCR as previously described by Kazmierczak et al. [23].

One-sided Fisher's exact tests were used to determine if σB-dependent genes are over-represented within specific TIGR role categories. Genes identified as σB-dependent were over-represented among genes involved in cellular functions (q-value = 0.045). σB-dependent genes in this category include genes involved in pathogenesis (inlA, inlB, inlH), adaptation to atypical conditions (lmo0515, lmo0669, lmo2673, lrtC), detoxification (lmo1433, lmo2230), cell division (lmo1624) and an unknown protein that may be involved in toxin production and resistance (lmo0321).

We evaluated RNA-Seq transcript levels for the 96 σB-dependent genes identified here (Additional file 7: Genes up-regulated by σB). The average fold change (10403S GEI/ΔsigB GEI) for the 96 σB-dependent genes ranged from 2.6 to 479.4. The σB-dependent genes with the highest average GEI in 10403S were lmo2158, lmo1602, and lmo0539, which encode a protein similar to B. subtilis YwmG, an unknown protein, and a tagatose-1,6-diphosphate aldolase, respectively (Table 5).

Table 5 Summary of genes up-regulated by σB

An ~ 500 nt σB-dependent ncRNA was identified between lmo2141 and lmo2142 (Figure 4B); this ncRNA was recently designated rli47 [20]. To be consistent with the nomenclature for other σB-dependent ncRNA [21], we propose that rli47 be named sbrE (s igma B-dependent R NA). Although BLASTX searches (using 6 possible reading frames) and searches against the Pfam database did not yield significant matches, a σB-dependent promoter was identified upstream of the transcript and a Rho-independent terminator was found by TransTermHP (Figure 4B). The sequence for this putative ncRNA was also present in 17 other L. monocytogenes genomes, including EGD-e (GenBank accession no. NC 003210), F2365 (GenBank accession no. NC 002973), and 15 unfinished genome sequences by the Broad Institute http://www.broad.mit.edu/annotation/genome/listeria_group/MultiHome.html as well as in one L. innocua (GenBank accession no. NC 003212) and one L. welshimeri (GenBank accession no. NC 008555) genome. The 514 nt sbrE (rli47) sequence was 96.6% conserved among the 18 L. monocytogenes genomes.

HMM showed that 84% of σB-dependent genes and operons identified by RNA-Seq are preceded by σB promoters and therefore, appear to be directly regulated by σB

An HMM representing L. monocytogenes σB-dependent promoters was dynamically created by using an initial training set of experimentally verified L. monocytogenes σB-dependent promoters to search the RNA-Seq data. The final model yielded a total of 5,387 motifs with scores > 5.00 bits throughout the pseudochromosome sequence. Among these motifs, we identified 65 possible σB-dependent promoter sequences upstream of genes and operons identified as σB-dependent based on RNA-Seq data (see Figure 5 for the L. monocytogenes σB promoter sequence logo). Because some of the genes with experimentally validated σB promoters were not found to be significantly up-regulated by σB in our study (e.g. prfA and the rsbV operon) and because the ltrC promoter, which was in the initial training set, had a score below our threshold of 5.00 bits in the final search, our annotation does not include all promoters present in the training set (i.e., only promoters identified upstream of genes that were significantly up-regulated by σB in the present study were annotated). Specifically, σB-dependent promoter sequences were found upstream of 15 of the 20 putative σB-dependent operons, 49 of the 58 monocistronic σB-dependent genes, and the one σB-dependent ncRNA identified here (Figure 4B). We compared RNA-Seq defined transcriptional start sites for 8 genes with σB promoters to transcriptional start sites determined by Rapid Amplification of cDNA Ends PCR (RACE-PCR) in a previous study [23]. Transcriptional start sites identified with RNA-Seq were located between 0 to 29 bases down-stream (and therefore sometimes 3') of start sites determined by RACE-PCR (see Figure 4C for LMRG_01602 transcriptional start site mapped by RACE-PCR and RNA-Seq), indicating that RNA-Seq successfully approximates transcriptional start sites, but sometimes does not provide full sequence coverage to the 5' end of a transcript. Some transcriptional start sites could not be specifically mapped to a σB promoter site using RNA-Seq as some genes (e.g. opuCA) have multiple promoters. A dendrogram of the putative σB promoter sequences showed no apparent clustering of these promoter sequences by either average GEI in 10403S or by σB-dependence (average fold change). These results suggest that additional regulatory elements or mechanisms other than promoter sequence per se (e.g., RNA stability) also influence transcript levels and/or σB-dependence for these genes (data not shown).

Figure 5
figure 5

Logo of the σB promoter. This logo was created from the alignment of 65 σB promoters identified in this study.

RNA-Seq successfully identifies a number of previously identified as well as novel σB-dependent genes

To evaluate the ability of RNA-Seq to identify L. monocytogenes σB-dependent genes, we compared the σB-dependent genes identified here with those identified in two independent microarray studies by our research group. Specifically, we compared our results with microarray data reported by (i) Raengpradub et al. [10], who identified σB-dependent genes using L. monocytogenes strains and growth conditions identical to those in this study, and by (ii) Ollinger et al. [12], who identified σB-dependent genes by comparing transcripts from L. monocytogenes 10403S with a PrfA* (G155S) allele [24], which constitutively expresses the PrfA-regulated virulence genes [2426], with those from an isogenic ΔsigB mutant grown to stationary phase under the same conditions used here. Further, we compared our results with those from a microarray study using another L. monocytogenes strain (EGD-e) and its isogenic ΔsigB mutant, grown under similar conditions (i.e., growth to early stationary phase [11]). Among the 96 σB-dependent annotated CDS identified in the present study, 72 were also identified as σB-dependent in previous microarray studies of stationary phase L. monocytogenes 10403S cells [10, 12] (Figure 3). In addition, 64 (66.7%) of the 96 σB-dependent genes identified here were identified as positively regulated by σB in L. monocytogenes strain EGD-e cells grown to early stationary phase (8 h growth in BHI) [11]. Overall, 12 genes identified as σB-dependent in stationary phase cells in both previous microarray studies by our group [10, 12], were not identified as σB-dependent by the RNA-Seq experiments reported here (Figure 3); 9 of these genes showed a σB-dependent promoter based on the HMM analyses in this study and are likely to be directly regulated by σB (see Additional file 8: Comparison of genes found to be σB-dependent by microarray analysis and not by RNA-Seq for further details on these genes).

Finally, a total of 13 annotated CDS identified as σB-dependent by RNA-Seq (including 9 genes that also showed a σB-dependent promoter in our HMM analysis) had not been identified as σB-dependent in either of the previous microarray studies with strain 10403S grown to stationary phase [10, 12] (see Table 3). Among these 13 genes not previously identified as σB-dependent in stationary phase L. monocytogenes 10403S, five had previously been identified as σB-dependent in salt-stressed cells [10], including the well-characterized virulence genes inlA and inlB, which have also been shown by qRT-PCR and promoter mapping to be directly regulated by σB [27]. In addition, two of these 13 genes had been identified as positively regulated by σB in L. monocytogenes strain EGD-e [11], even though they had not been identified as σB-dependent in previous microarray studies of strain 10403S [10, 12]. For one of these genes (i.e. lmo0265), the microarray probe (designed based on the genome of L. monocytogenes strain EGD-e) showed a low hybridization index (HI; % match between strain-specific sequence and oligonucleotide probe) to 10403S (< 80%). Interestingly, lmo2003, which encodes a transcription regulator similar to the GntR family, was identified as σB-dependent by RNA-Seq, but had not been previously identified as σB-dependent in either 10403S or EGD-e.

Discussion

In this study, we used deep RNA sequencing to define and characterize the transcriptomes of L. monocytogenes strain 10403S and an otherwise isogenic ΔsigB mutant, which does not express the general stress-response sigma factor, σB. The data generated using this approach showed that (i) at least 83% of annotated L. monocytogenes genes are transcribed in stationary phase cells; and (ii) stationary phase L. monocytogenes transcribes 67 ncRNAs, including one σB-dependent ncRNA and seven ncRNAs that, to our knowledge, have not previously been identified in L. monocytogenes. Additionally, RNA-Seq data provided for quantitation of transcript levels and approximate identification of transcriptional start sites on a genome scale. Use of a novel, iterative, dynamic HMM, in combination with RNA-Seq data, identified putative σB-dependent promoters and further defined the L. monocytogenes σB regulon.

The majority of annotated L. monocytogenes genes are transcribed in stationary phase cells

While genome sequencing and microarray approaches have provided important insight into the biology of prokaryotic organisms, including a number of human bacterial pathogens, identification of all genes and their transcriptional patterns remains a major challenge in all areas of biology. Our results demonstrate that global probe-independent approaches for transcriptome characterization are valuable tools for analyzing bacterial transcriptomes [16, 28, 29]. A major challenge that currently hinders analysis of transcriptomic data generated by approaches such as RNA-Seq is the ability to differentiate between genes with low levels of transcription and background levels of coverage. Several approaches have been used to define cut-off values between background GEI and GEI indicative of low transcript levels (e.g., [15, 30, 31]). We chose a comparative analysis of L. monocytogenes 10403S transcript levels with those of a mutant strain that does not express a transcription factor (i.e., the alternative sigma factor σB) as a novel approach for robustly defining background RNA-Seq coverage. Our results show that a number of σB-dependent genes were solely σB-dependent (at least under the conditions used here), as supported by the lack of detectable RNA-Seq coverage in the ΔsigB strain, despite considerable RNA-Seq coverage of the same genes in the isogenic parent strain 10403S. This is an important observation as a number of σB-dependent L. monocytogenes genes are also activated by other sigma factors (e.g., σA [32, 33]). Using the average GEI for L. monocytogenes genes that were solely σB-dependent in the ΔsigB strain as a conservative cut-off value for transcribed genes, we found that approximately 83% of L. monocytogenes 10403S annotated CDS were transcribed in stationary phase cells. These transcribed genes include 355 putative operons, which cover a total of 1,107 genes, indicating that a considerable proportion of L. monocytogenes genes appear to be transcribed polycistronically. In comparison, a recent study using a tiling microarray identified 517 polycistronic operons that encompass 1,719 genes in L. monocytogenes EGD-e [20]. Taken together, these data indicate that the majority of annotated L. monocytogenes genes are transcribed. This conclusion is consistent with results from a whole-genome tiled microarray transcriptome study of E. coli MG1655 [34], which reported transcription of 4052 E. coli MG1655 genes in bacteria grown under different conditions, suggesting that about 98% of the E. coli MG1655 genes are transcribed.

Our results also demonstrate that RNA-Seq coverage levels (generated with the Illumina Genome Analyzer System) correlate well with quantitative RT-PCR-based mRNA transcript level data. Therefore, in combination with results from previous studies (e.g., in yeast [15, 31], human cell lines [35], human tissue [36], murine tissue [30]), our findings indicate that RNA-Seq tools can be broadly applied in biological studies to enable quantitative analysis of transcript levels. We also found a positive correlation between RNA-Seq-based transcript levels and codon bias, consistent with the well-documented observation that genes with high codon bias are often highly expressed [3739]. Genes in four role categories, including (i) signal transduction, (ii) viral functions, (iii) amino acid biosynthesis, and (iv) transport and binding, were significantly associated with lower transcript levels. These categories include a number of genes that encode proteins predominantly required for growth and survival under specialized environmental conditions (e.g., viral replication genes) or under conditions other than stationary phase (e.g., amino acid biosynthesis may be less important in stationary phase than during exponential growth as sufficient amino acids from dead bacteria are likely to be available for scavenging), and/or proteins that may only be required in small amounts. On the other hand, we found that genes in seven role categories, including (i) cellular processes, (ii) DNA metabolism, (iii) protein fate, (iv) protein synthesis, (v) purines, pyrimidines, nucleosides, and nucleotides, (vi) transcription, and (vii) genes encoding proteins with unknown functions, showed, on average, higher transcript levels in stationary phase L. monocytogenes. These findings suggest that genes in these particular categories are important for bacterial cells transitioning from exponential growth to stationary phase.

Overall, the L. monocytogenes genes with the highest transcript levels were ncRNAs, specifically the transfer-messenger RNA (tmRNA) and 6S RNA, consistent with the observation that tmRNAs are involved with bacterial recovery from a variety of stresses including entry into stationary phase, amino acid starvation, and heat shock [40]. 6S RNA accumulates in cells during stationary phase; cells lacking 6S RNA have reduced fitness relative to wildtype stationary phase cells [41]. In addition to down-regulating some housekeeping genes, 6S RNA has been shown to up-regulate expression of some σS-dependent genes in Gram-negative bacteria [41]. σS is the stationary phase stress response alternative sigma factor in E. coli [42]. Taken together, we hypothesize that 6S RNA plays a critical role in the ability of L. monocytogenes to survive stationary phase associated stress conditions.

Specific protein-encoding genes with very high transcript levels in stationary phase L. monocytogenes include fri, sod, cspB, and cspL, all genes with some previous evidence for contributions to L. monocytogenes stationary phase and stress survival [4349]. flaA, which encodes a flagellin protein, was also highly transcribed in stationary phase cells at 37°C. Although L. monocytogenes has been reported to show flagellar motility only when grown at ≤ 30°C [50, 51], our results are consistent with the observation that strain 10403S, which was used in this study, has been shown to express flagellin at 37°C [51]. Interestingly, we also found some annotated CDS without known function to be highly transcribed, including lmo1847 and lmo1849, which encode putative ABC transporters based on BLAST and Pfam [52] searches, respectively, and lmo1468, which encodes an unknown protein.

RNA-Seq identifies ncRNA molecules in L. monocytogenes, including a σB-dependent ncRNA, in 10403S

Using RNA-Seq, we found 67 previously identified or putative ncRNAs that were transcribed in stationary phase L. monocytogenes. Of these, 7 represent ncRNAs that have not been identified previously as transcribed in L. monocytogenes. Sixty of the ncRNAs identified here have previously been reported by Toledo-Arana et al. [20], Nielsen et al. [53], Mandin et al. [22] and Christiansen et al. [19]. Interestingly, 16 L. monocytogenes ncRNAs with similarities to ncRNAs identified in other bacterial organisms are putative riboswitches. We also found that sbrE (rli47), which has no homologies to ncRNA entries in Rfam, appears to be directly regulated by σB, based on the considerably higher transcript levels (186 fold) present in the parent strain as compared to the sigB-null mutant, consistent with results from a recent tiling microarray study [20]. As the RNA isolation procedure used here selected against small RNA molecules (see Materials and Methods for details), it is likely that additional small ncRNAs not detected here (e.g., some small ncRNAs identified by Toledo-Arana et al. [20]), are also transcribed in stationary phase L. monocytogenes 10403S.

Prior to this study, L. monocytogenes ncRNAs, including potential σB-dependent ncRNAs [53], had been identified using in silico modeling [22, 53], co-precipitation with the RNA-binding protein Hfq [19], and, most recently, tiling microarrays [20]. While, among these approaches, tiling microarrays [20] provided the most comprehensive characterization of L. monocytogenes ncRNAs, deep RNA sequencing also identified a large number of transcribed L. monocytogenes ncRNAs, including ncRNAs with no similarities to previously identified ncRNAs. Our results, taken together with previous studies that have identified numerous novel transcripts with RNA-Seq in bacteria (S. meliloti [28], B. cenocepacia [16], V. cholerae [29]), yeast [15, 31], mouse [30], Arabidopsis [54], human cell lines [35, 55], and human tissue [36], clearly show the power of this technique for characterizing bacterial transcriptomes and ncRNAs.

The L. monocytogenes σB regulon is composed of at least 96 genes, including 82 genes and 1 ncRNA that are preceded by putative σB promoters

As alternative sigma factors, such as σB, are known to play critical roles in gene regulation across bacterial genera [33], we used L. monocytogenes 10403S and an isogenic ΔsigB null mutant as a model system for exploring the use of RNA-Seq, in combination with in silico analyses, for characterization of transcriptional blueprints associated with bacterial regulatory elements. In our study, RNA-Seq identified 96 annotated CDS and one ncRNA SbrE (Rli47) that are up-regulated by σB. Quantitative RT-PCR experiments also confirmed σB-dependent transcript levels of SbrE (Rli47) (Mujahid et al., unpublished). Among the 96 σB-dependent annotated CDS identified in this study, 74 (77.1%) [10] and 81 (84.4%) [12] were also identified as σB-dependent in stationary phase cells in two previous microarray studies using the same strain background. Also, 63 of the 96 σB-dependent genes identified here were reported as positively regulated by σB in another L. monocytogenes strain (EGD-e) grown to early stationary phase [11]. Twelve genes were identified as σB-dependent in both previous microarray studies performed with the same L. monocytogenes strain background and the same conditions used here, but were not identified as σB-dependent by RNA-Seq in this study. This disparity is likely due to the fact that the thresholds and statistical cut-offs used to define σB-dependent genes were very stringent in the present study (e.g., a q-value < 0.05 in all four comparisons).

Overall, in addition to confirming a previously identified σB-dependent ncRNA [20], RNA-Seq identified 13 genes that had not been defined as σB-dependent in previous microarray studies of stationary phase L. monocytogenes 10403S cells [10, 12], including 5 genes that had been identified as σB-dependent in salt stressed cells, but not in stationary phase cells. One gene not previously identified as σB-dependent was lmo2003, which encodes a transcription regulator similar to the GntR family. The GntR family of regulators has been characterized as global regulators of primary metabolism in a number of bacteria [5658]. This finding further supports that L. monocytogenes σB appears to be involved in a number of transcriptional regulatory networks [6]. Increasing evidence indicates that regulatory RNAs also contribute to regulatory networks that involve L. monocytogenes σB. For example, in addition to the σB-dependent SbrE ncRNA described here, tiling array analyses also identified additional σB-dependent ncRNAs. While previous in silico studies in L. monocytogenes strain EGD-e [53] identified four putative σB-dependent ncRNAs (i.e., SbrA, SbrB, SbrC, SbrD), only SbrA was confirmed in vivo as σB-dependent in EGD-e [20, 53]. Even though our RNA-Seq analyses in 10403S identified SbrA transcripts, transcript levels for this ncRNA were not σB-dependent under the conditions used in our study. The fact that SbrA was not found to be σB-dependent in 10403S may be due to differences in strains or growth conditions used (e.g., Nielsen et al. [53] and Toledo-Arana et al. [20] used strain EGD-e, while we used strain 10403S). Further studies in different L. monocytogenes strains will thus be needed to understand the full complexity of regulatory networks in this pathogen, including those involving σB and ncRNAs.

The quantitative nature of RNA-Seq allowed us to also identify highly transcribed σB-dependent genes, including lmo2158 (which encodes a protein similar to the B. subtilis YwmG), lmo1602 (which encodes an unknown protein), and lmo0539 (which encodes a tagatose-1,6-diphosphate aldolase). Interestingly, none of these genes encode proteins that appear to contribute to any of the presently recognized σB-dependent phenotypes in L. monocytogenes, such as acid resistance [9, 59], oxidative stress resistance [59, 60], or virulence [27, 33, 61, 62]. As there are no published reports of construction and characterization of null mutations in these highly transcribed σB-dependent genes, our data clearly suggest that σB and the σB regulon make additional important contributions to L. monocytogenes physiology that remain to be characterized.

In conjunction with appropriate bioinformatics tools, such as the iterative, dynamic HMM developed in this study to identify putative σB promoters, RNA-Seq data also allowed mapping of approximate transcriptional start and termination sites. Specifically, putative σB-dependent promoters were identified upstream of (i) 49 monocistronic σB-dependent genes, (ii) 15 σB-dependent operons (covering a total of 40 genes), and (iii) 1 σB-dependent ncRNA. By comparison, in the absence of genome wide transcriptional start site data, a previous study that solely relied on HMM and genome sequence data identified putative σB-dependent promoters upstream of only 40 genes that had been identified as σB-dependent by microarray analyses [10]. Our data reported here show that the majority of σB-dependent genes are directly regulated by σB and illustrate the power of combining RNA-Seq data and bioinformatics approaches for characterizing transcriptional regulatory systems. Specifically, combining transcriptional start site information with an HMM that identifies promoter motifs (e.g., the motif for σB-dependent promoters) provides a powerful approach for identifying genes directly regulated by a given transcription factor. This approach facilitates rapid genome-wide identification of putative transcriptional start sites, which currently represents a critical bottleneck in genome-wide characterization of transcriptional regulation and regulatory networks, as many current strategies for promoter mapping (e.g., primer extension, rapid amplification of cDNA ends (RACE-PCR), RNAse protection assays) are time- and labor-intensive.

Conclusions

Using the human foodborne pathogen L. monocytogenes as a model system, we have shown that RNA-Seq provides a powerful approach to (i) rapidly, comprehensively, and quantitatively characterize prokaryotic genome-wide transcription profiles without hybridization bias, and (ii) characterize putative transcriptional start sites and operon structures. We also show that RNA-Seq transcriptomic evaluation of a bacterial strain bearing a deletion in a transcriptional regulator in comparison with its parent strain can provide rapid, comprehensive insights into the blueprints of prokaryotic transcriptional regulation. Such tools and approaches will revolutionize our ability to characterize genome-wide transcriptional regulatory networks, with wide ranging applications from medicine to ecology, e.g., by providing a means to quickly characterize transcriptional networks contributing to pathogen transmission and virulence as well as environmental growth and gene expression in bacteria used for specific purposes, such as bio-remediation. When applied to both genome and transcriptome sequencing, novel high throughput sequencing approaches can also provide rapid and comprehensive characterization of bacterial genomes, representing an important tool for initial rapid characterization of novel and emerging bacterial pathogens.

Methods

Strains and growth conditions

RNA-Seq was performed on the L. monocytogenes parent strain 10403S and a previously described [9] isogenic mutant (ΔsigB, FSL A1-254) with an internal non-polar deletion of sigB, which encodes the stress response alternative sigma factor σB.

Prior to RNA isolation, bacteria were grown in 5 ml Brain Heart Infusion (BHI) broth (BD Difco, Franklin Lakes, NJ) at 37°C with shaking (230 rpm) for 15 h, followed by transfer of a 1% inoculum to 5 ml pre-warmed BHI. After growth to OD600 ~ 0.4, a 1% inoculum was transferred to a 300 ml nephelo flask (Bellco, Vineland, NJ) containing 50 ml pre-warmed BHI. This culture was incubated at 37°C with shaking until cells reached stationary phase (defined as growth to OD600 = 1.0, followed by incubation for an additional 3 h). Two independent growth replicates and RNA isolations were performed for each strain.

RNA isolation, integrity and quality assessment

RNA isolation was performed as previously described [10]. Briefly, RNAProtect bacterial reagent (Qiagen, Valencia, CA) was added according to the manufacturer's instructions to the cultures grown to stationary phase; treated cells were stored at -80°C (for no longer than 24 h) until RNA isolation was performed. Bacterial cells were treated with lysozyme followed by 6 sonication cycles at 18W on ice for 30 s. Total RNA was isolated and purified using the RNeasy Midi kit (Qiagen) according to the manufacturer's protocol; RNA molecules <200 nt in length are not recovered well with this procedure, according to the manufacturer. RNA was eluted from the column using RNase-free water. Total RNA was incubated with RQ1 DNase (Promega, Madison, WI) in the presence of RNasin (Promega) to remove remaining DNA. Subsequently, RNA was purified using two phenol-chloroform extractions and one chloroform extraction, followed by RNA precipitation and resuspension of the RNA in RNAse free TE (10 mM Tris, 1 mM EDTA; pH 8.0; Ambion, Austin, TX). UV spectrophotometry (Nanodrop, Wilmington, DE) was used to quantify and assess purity of the RNA.

Efficacy of the DNase treatment was assessed by TaqMan qPCR analysis of DNA levels for two housekeeping genes, rpoB [63] and gap [33]. qPCR was performed using TaqMan One-Step RT-PCR Master Mix Reagent and the ABI Prism 7000 Sequence Detection System (all from Applied Biosystems, Foster City, CA). Each RNA sample was run in duplicate and standard curves for each target gene were included for each assay to allow for absolute quantification of residual DNA. Data were analyzed using the ABI Prism 7000 Sequence Detection System software as previously described [64] Normalization and log transformation were performed as described by Kazmierczak et al. [23]. All samples showed log copy numbers ≤ 1.5 and Ct values > 35 for both rpoB and gap, indicating negligible levels of DNA contamination. As a final step, RNA integrity was assessed using the 2100 Bioanalzyer (Agilent, Foster City, CA).

mRNA enrichment

Removal of 16S and 23S rRNA from total RNA was performed using MicrobExpress™ Bacterial mRNA Purification Kit (Ambion) according to the manufacturer's protocol with the exception that no more than 5 μg total RNA was treated per enrichment reaction. Each RNA sample was divided into multiple aliquots of ≤ 5 μg RNA and separate enrichment reactions were performed for each sample. Enriched mRNA samples were pooled and run on the 2100 Bioanalzyer (Agilent) to confirm reduction of 16S and 23S rRNA prior to preparation of cDNA fragment libraries.

Preparation of cDNA fragment libraries

Ambion RNA fragmentation reagents were used to generate 60-200 nucleotide RNA fragments with an input of 100 ng of mRNA. Following precipitation of fragmented RNA, first strand cDNA synthesis was performed using random N6 primers and Superscript II Reverse Transcriptase, followed by second strand cDNA synthesis using RNaseH and DNA pol I (Invitrogen, CA). Double-stranded cDNA was purified using Qiaquick PCR spin columns according to the manufacturer's protocol (Qiagen).

RNA-Seq using the Illumina Genome Analyzer

The Illumina Genomic DNA Sample Prep kit (Illumina, Inc., San Diego, CA) was used according to the manufacturer's protocol to process double-stranded cDNA for RNA-Seq, including end repair, A-tailing, adapter ligation, size selection, and pre-amplification. Amplified material was loaded onto independent flow cells; sequencing was carried out by running 36 cycles on the Illumina Genome Analyzer.

The quality of the RNA-Seq reads was analyzed by assessing the relationship between the quality score and error probability; these analyses were performed on Illumina RNA-Seq quality scores that were converted to phred format http://www.phrap.com/phred/. Quality scores are reported in Additional file 9: Distribution of quality scores for all RNA-Seq runs.

RNA-Seq data will be available in the NCBI GEO Short Read Archives: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15651.

RNA-Seq alignment and coverage

The program nucmer, which is part of the MUMmer package http://mummer.sourceforge.net/, was used to align the 10403S unfinished genome sequences (available at http://www.broad.mit.edu/annotation/genome/listeria_group/MultiHome.html as supercontigs 5.1 to 5.21) against the finished genome sequence of the L. monocytogenes reference strain EGD-e [18] to create a pseudochromosome for 10403S. Creation of the 10403S pseudochromosome was performed using the order and orientation of the 10403S supercontigs provided by the alignment with EGD-e; the assembled pseudochromosome was 2.87 Mb long. The annotation of the genes in the individual 10403S supercontigs, as provided by the Broad Institute http://www.broad.mit.edu/annotation/genome/listeria_group/MultiHome.html was then mapped to the 10403S pseudochromosome (Additional file 10: Genbank (gbk) file with ncRNAs identified here). The 5S, 16S and 23S rRNA genes as well as the various tRNA genes in 10403S were identified using blastn and the EGD-e annotated rRNA and tRNA genes as a reference (Genbank ID: AL591824).

Based on quantitative analyses of RNA-Seq data, throughout this manuscript, transcript levels of a given gene are reported as the Gene Expression Index (GEI), which is expressed as number of reads per 100 bases. To obtain the GEI, the 10403S pseudochromosome was used to align Illumina RNA-Seq reads. These alignments were performed using the whole genome alignment software Eland (Illumina), which reports unique alignments of the first 32 bases of each read, allowing up to 2 mismatches. Coverage at each base position along the pseudochromosome was calculated by enumerating the number of reads that align to a given base. The coverage for each base from the first to last nt in an annotated CDS was summed then divided by 32 (i.e., the length of each aligned read) to obtain the RNA-Seq coverage for that gene before normalization. The following data were discarded prior to further analyses: (i) reads with more than 2 mismatches, (ii) reads that matched to multiple locations, (iii) reads that did not map to the chromosome, and (iv) reads that mapped to the 16S or 23S genes (Table 1). Reads identified as "matching two locations" did not include those matching rRNA genes as the 10403S pseudochromosome created for this study was designed with only one unique rRNA gene sequence. Reads matching the 16S and 23S genes were removed prior to normalizing the total number of aligned reads across the four samples because of the technical bias introduced by our deliberate partial removal of 16S and 23S transcripts from the samples. Despite removal of 16S and 23S rRNA, in a given run, between 1,860,817 and 3,138,329 reads aligned to the 23S gene and between 434,263 and 760,863 reads aligned to the 16S gene. In a given run, between 101,419 and 242,246 reads matched the 5S rRNA gene and between 7,778 and 62,699 reads matched the various tRNA genes present in the pseudochromosome.

Because of the inherent differences in the total number of reads among the four runs, the total number of reads for each run was normalized to the run with the highest coverage (i.e. ΔsigB replicate 2, Table 1). The ratio of total number of reads for ΔsigB replicate 2 to the total number of reads for 10403S replicate 1, 10403S replicate 2, or ΔsigB replicate 2 was used as a multiplier to normalize the approximate number of reads matching a given gene (Table 1). The GEI was then obtained by dividing the normalized number of reads matching each gene by the gene length. The average GEI was the number of reads that match each nt in a given gene after normalization; this value represented the average of the 2 biological replicates for a given strain and is presented as reads per 100 bases (as opposed to reads per 1 base) to simplify identification of differences. The distribution of the coefficient of variation for each gene between replicates is depicted in Additional file 11: Coefficient of variation among RNA-Seq replicates by strain.

Identification of transcribed annotated CDS

Sequence reads matching annotated CDS in the 10403S genome were used to identify those annotated CDS that were transcribed under the experimental conditions used. As our RNA-Seq analyses included both a wildtype strain and an isogenic mutant with a deletion in a transcriptional regulator (i.e., the alternative sigma factor σB), our data also provide a novel approach for characterizing background RNA-Seq coverage for genes that are not transcribed, similar to a previous approach that used background RNA-Seq coverage of so-called "gene deserts" in human chromosomes to characterize background average GEI [65]. The observations that (i) eight genes that showed average GEI between 8.64 reads and 96.43 reads per 100 bases in the parent strain showed 0 reads per 100 bases in the ΔsigB strain; (ii) 42 genes with average GEI of 1.21 to 73.81 reads per 100 bases in the parent strain showed between 0.01 and 0.7 reads per 100 bases in the ΔsigB strain; and (iii) 0.7 reads per 100 bases is the approximate median of the average GEI in σB-dependent genes in the ΔsigB strain, clearly indicate that extremely low background RNA-Seq coverage is expected for genes that are not transcribed. Overall, 50/96 σB-dependent genes show an average GEI < 0.7 in the ΔsigB strain (Additional file 7: Genes up-regulated by σB); genes with GEI < 0.7 reads are overrepresented in the ΔsigB strain (Figure 6). It is not unexpected that some σB-dependent genes showed average GEI ≥ 0.7 as a number of genes are not solely dependent on σB and will still be transcribed in the absence of σB (e.g., opuCABCD operon [32, 66, 67]). Based on these observations, we set an average GEI ≥ 0.7 as a conservative cut-off to identify genes that are transcribed (i.e., we define genes with average GEI ≥ 0.7 as being transcribed as the RNA-Seq data indicate that non-specific reads [e.g., from DNA] are highly unlikely to provide average GEI ≥ 0.7).

Figure 6
figure 6

Average gene expression indices for σB-dependent genes. The histogram shows the average GEI of σB-dependent genes in 10403S (red) and the ΔsigB (blue) strains. GEIs were grouped in intervals of 0.7, i.e., the first bar represents genes with GEIs between 0 and 0.7; the second bar represents GEIs between > 0.7 and ≤ 1.4, etc. Genes with average GEI ≥ 50 were grouped together.

Depending on RNA-Seq coverage, genes were classified into four categories, including (i) not transcribed (average GEI < 0.7), (ii) low transcript levels (average GEI ≥ 0.7 and < 10), (iii) medium transcript levels (average GEI ≥ 10 and < 25), and (iv) high transcript levels (average GEI ≥ 25). While cut-offs between low, medium, and high transcript level categories were somewhat arbitrary, they were chosen to yield a relative distribution of genes into these categories similar to the distribution of yeast genes into low, medium, and high expression categories reported previously by Nagalakshimi et al. [15].

Annotation of Rho-independent terminators and putative operons

Potential operons were manually annotated based on the continuity of a similar level of RNA-Seq coverage across consecutive genes and the (i) absence of putative Rho-independent terminators between genes, and/or (ii) presence of a putative Rho-independent terminator at the end of a putative operon. Putative Rho-independent terminators in the 10403S pseudochromosome were identified using the program TransTermHP v2.04 [68].

Discovery and annotation of regions transcribing ncRNAs

To aid in identification of transcribed ncRNAs, ncRNAs previously identified in L. monocytogenes EGD-e [1922] were mapped onto the 10403S pseudochromosome and were identified as transcribed in 10403S in this study.

New putative ncRNAs (i.e., ncRNAs not previously reported or previously identified by Rfam) were manually identified using the genome browser Artemis [69]. Specifically, regions not matching annotated genes, but showing contiguous coverage by RNA-Seq reads (i.e., regions that contain at least 100 bp completely covered by RNA-Seq reads) were designated putative ncRNAs. Further, RNA-Seq reads that did not cover an entire annotated CDS, but showed partial contiguous coverage within a CDS, were also designated as putative ncRNAs. All ncRNAs, including those reported in previous publications [19, 20, 22, 53], those identified by Rfam, and those with no matches to the Rfam database were annotated into a Genbank (gbk) file that is available as Additional file 10: Genbank (gbk) file with ncRNAs identified here. ncRNAs identified by RNA-Seq, but with no matches to the Rfam database were designated "putative ncRNA" and received designations from rli64 to rli70. The presence of rho-independent transcriptional terminators was used to assign the strand of putative ncRNAs. For two instances where terminators were not observed, the ncRNAs were annotated on both strands.

Differential expression analysis

To identify genes that showed significantly different transcript levels in the parent strain (10403S) and the ΔsigB strain, statistical analyses were performed using the normalized RNA-Seq coverage of each coding gene (as annotated by the Broad Institute). Normalized RNA-Seq coverage (i.e. the number of reads that match an annotated CDS after normalization across runs) was used in lieu of the GEI (in which the normalized RNA-Seq coverage number is divided by the gene length) for statistical analyses. Corresponding analyses were also performed for each region encoding a putative ncRNA transcript identified as described above. A coverage file of normalized RNA-Seq coverage is available in Additional file 12: Coverage file with the normalized RNA-Seq coverage for the 4 RNA-Seq runs.

For each gene, a binomial probability was calculated for the normalized RNA-Seq coverage, using each of the four possible comparisons between the 10403S and ΔsigB transcripts (i.e. 10403S replicate 1 vs ΔsigB replicate 1; 10403S replicate 1 vs ΔsigB replicate 2; 10403S replicate 2 vs ΔsigB replicate 1; 10403S replicate 2 vs ΔsigB replicate 2). The binomial probability was calculated under the hypothesis that genes that are not regulated by σB will show the same normalized number of reads in the two strains (p = 0.5 and q = 0.5). For a gene to be considered up-regulated by σB, the binomial probability of observing as many reads in the ΔsigB strain as those observed for 10403S had to be < 0.05 for each of the four possible combinations. Conversely, for a gene to be considered down-regulated by σB, the binomial probability of observing as many reads as those observed for ΔsigB had to have q-values < 0.05 for each of the four possible combinations. To control for multiple comparisons, a False Discovery Rate (FDR) approach was used. q-values (representing the FDR) were calculated using the program Q-Value [70] for R. Only genes with q-values < 0.05 and fold change ≥ 2 or ≤ 0.5 among all four possible comparisons between 10403S and ΔsigB were considered significantly up-regulated or down-regulated by σB.

Iterative HMM-based promoter identification

An initial training set containing 17 experimentally validated σB-dependent promoter motifs was used to build a Hidden Markov Model (HMM) of these motifs (Additional file 13: σB-dependent promoters used for HMM search). HMM construction and searches were performed using the program hmmer version 1.8.5. The HMM was constructed from unaligned sequences (using hmmt) and then used to search the 10403S pseudochromosome (using the hmmls tool). The null frequencies of each nucleotide used were those observed in the L. monocytogenes genome (i.e., A/T = 0.31 and G/C = 0.19).

To identify new promoter motifs that could be added to the training set, we used an iterative HMM approach. In each given HMM iteration, the only hits added to the training set were those that met four conservative criteria, including (i) location within 100 bp upstream of the start codon of an annotated CDS (or 100 bp upstream the first nt for the manually annotated noncoding genes), (ii) q-values < 0.05 (from the binomial probabilities) for σB dependence of a given gene (based on RNA-Seq data), and (iii) fold change ≥ 2 among all possible comparisons between 10403S and ΔsigB, and (iv) a score higher than the lowest score for which 50% of the motifs fall in noncoding regions (i.e. for each iteration, we adaptively chose a threshold score such that 50% of the motifs that score higher than this threshold lie in noncoding regions). After adding all hits that met these criteria (in a given iteration) to the training set, a new model was built and used to search the 10403S pseudochromosome. This process was repeated until no new motifs could be added to the training set; the final training set can be found in Additional file 13: σB-dependent promoters used for HMM search. When no new motifs that matched our criteria were discovered, the model was considered complete and the results from the last search were used for promoter identification. The final model was used to search the 10403S pseudochromosome for potential σB promoters. Potential σB promoters identified by this HMM upstream of σB-dependent genes and the σB-dependent putative ncRNA were visually evaluated. Potential σB promoters identified by HMM were considered probable σB promoters if the promoter was within 50 bp upstream of the transcriptional start site (as identified by RNA-Seq). In some instances, the transcriptional start site was not discernable due to an upstream gene transcript that overlapped with a σB-dependent gene transcript or because the gene had a low average relative normalized RNA-Seq coverage. For these instances, putative promoters were considered if they were located within 200 bp from the start codon of the σB-dependent gene. σB-dependent genes with probable σB promoters are described in Figure 7; the σB promoter sequence logo is presented in Figure 5http://weblogo.berkeley.edu/[71].

Figure 7
figure 7

Alignment of the 65 putative σB-dependent promoters identified in this study. EGD-e homologs of genes or operons downstream of a given promoters are indicated on the left. Positions 3 to 6 in the alignment represent the -35 region while positions 24 to 29 represent the -10 region. Darker nucleotides are more conserved than lighter nucleotides in the alignment. Gene names that are boxed indicated promoters that have been experimentally validated (e.g., by RACE-PCR).

Correlation of RNA-Seq relative coverage (GEI) with TaqMan absolute transcript copy number

Average GEI was correlated with absolute transcript copy numbers quantified by TaqMan qRT-PCR. qRT-PCR-based transcript level data obtained for selected genes in L. monocytogenes grown under the same conditions used here (i.e., stationary phase) were obtained from previous studies and unpublished work (see Additional file 2: RNA-Seq average GEI and TaqMan qRT-PCR absolute copy number); qRT-PCR methods are detailed in Raengpradub et al. [10]. qRT-PCR data from these studies were used to calculate absolute transcript copy numbers (using a standard curve as described by Sue et al. [64]); values were log transformed.

Statistical Analyses

One-sided Wilcoxon rank sum tests were used to assess whether genes in certain role categories showed lower or higher average GEI in 10403S than genes in other role categories. One-sided Fisher's exact tests were used to assess whether σB-dependent genes were overrepresented in certain TIGR role categories http://cmr.jcvi.org/cgi-bin/CMR/RoleIds.cgi. Linear regression analysis was used to assess correlations between average GEI and qRT-PCR data as well as between codon bias and average GEI in 10403S. The effective number of codons used in a gene (Nc), a measure of the codon bias, was assessed using the program "chips" implemented in the EMBOSS package [72]. All tests were carried out in R (version 2.7.0; http://www.r-project.org/). Correction for multiple testing was performed using the procedure reported by Benjamini & Hochberg [73], as implemented in the program Q-Value [70]. Significance was set at 5%.

Data access

RNA-Seq data will be available in the NCBI GEO Short Read Archives. All RNA-Seq data are provided in an Access database file (Additional file 4: Access database). This database contains information on the annotated CDS and ncRNAs with their 10403S locus name, 10403S start and end coordinates, lengths, strand, EGD-e locus, EGD-e gene name, EGD-e common name, EGD-e role category, codon bias, GEI, average GEI in 10403S and ΔsigB strains, fold change for the four possible comparisons involving the two replicates with 10403S and the ΔsigB strains, q-values of the binomial tests, operon annotation, promoter annotation, list of σB-dependent genes identified in this study, and data from 3 other studies of the σB regulon in L. monocytogenes using microarrays including Ollinger et al. [12], Hain et al. [11] , and Raengpradub et al. [10].