Background

In prokaryotes, several adjacent genes on the same strand of DNA are often co-transcribed as a polycistronic mRNA, forming a multi-gene transcription unit called an operon. Furthermore, in addition to protein- and RNA-coding genes, some parts of a non-coding sequence and the opposite strand of a coding sequence can also be transcribed under certain conditions, generating non-coding RNAs (ncRNAs) [1, 2] and anti-sense RNAs (asRNAs) [3, 4], respectively. Accumulating body of evidence suggest that ncRNAs [1, 2] and asRNAs [3, 4] may play important roles in the physiology of prokaryotes. Therefore, a full understanding of the transcriptomes of prokaryotic cells is necessary to annotate the functional elements in their genomes and to reconstruct the gene transcriptional networks in their cells. However, experimental determination of operon structures, ncRNAs and asRNAs by traditional molecular biology methods is time-consuming and labour-intensive. As a result, no single prokaryote has so far had all of its operon structures, ncRNA and asRNAs characterized using such methods. For instance, even for the most well-studied model bacteria E. coli K12 and B. subtilis, only 3,409 [5] and 736 [6] operons have been determined in their genomes using these methods, respectively, after decades of research while not each of their genes has been assigned to an operon. On the other hand, although a great progress has been made in computational prediction of operons [714] and small RNA genes [1518], the accuracy of these predictors is still low [13, 19], and they can only predict the static longest possible operons without considering possible alternative operons [714].

In the past few years, increasing applications in prokaryotes of whole genome directional (strand-specific) tiling array and directional RNA-seq techniques have completely changed our way to study and our view of the architecture and complexity of prokaryotic transcriptomes (for a thorough review, see [2022]). For example, using a combination of whole genome directional tiling array and RNA-seq techniques, Guell et al.[23] found that operon utilizations in the reduced parasitic M. pneumoniae genome were highly variable and dynamic, almost half of the 139 identified multi-gene operons showed varying levels of (dynamic) expression in a staircase-like manner. Under different conditions, large operons could be transcribed as smaller sub-operons, resulting in many alternative transcripts, suggesting that the operon structures in M. Pneumonia were highly complex and dynamic, a phenomenon that was comparable to the alternative splicing in eukaryotes [23]. They also identified a large number of ncRNAs and asRNAs expressed under various culture conditions, hence a much larger portion of the genome was transcribed than originally anticipated [23]. Similar results were observed in many other taxonomically distinct species, such as epsilon proteobacteria H. pylori[24]; firmicutes B. sutiblis[25] and B. anthracis[26]; cyanobacteria Synechocystis sp. PCC6803 [27]; euryarchaeota Halobacterium salinarum NRC-1 [28]; and bacteroidia Porphyromonas gingivalis[29], to only name a few. However, not all these surprising observations were noted in some other studies. For instance, prevalent alternative operon utilizations were not reported in many studies in a variety of prokaryotes, such as B. subtilis[30], Salmonella entericaserovar Typhi [31], Burkholderia cenocepacia[32], Caulobacter crescentus[33], Staphylococcus aureus[34], Vibrio cholera[35], Chlamydia trachomatis[36], Chlamydia pneumonia[37], Clostridium beijerinckii NCIMB 8052 [38], Listeria monocytogenes[39], Anabaena sp. strain PCC 7120 [40], Synechococcuselongatus PCC 7942 [41], and Sulfolobus solfataricus P2 [42]. Contradictory results have also been reported. For instance, although Rasmussen et al.[30] did not note alternative operon utilizations in B. subtilis, more recently, Nicolas et al.[25] observed highly prevalent condition-dependent operon utilizations using a similar tiling array technique. Moreover, although most of these studies found extensive asRNA and ncRNA transcriptions, the levels of their prevalence could vary quite differently from different studies even in the same strains. For instance, although Selinger et al.[43] reported that up to 4,000 E. coli K12 genes had asRNA transcriptions using directional tilling arrays, Dornenburg et al.[44] only identified about 1,000 asRNAs in the same strain under similar growth conditions using directional RNA-seq. These discrepancies can be due to different experimental conditions and methods used in these studies. Nevertheless, they inevitably raise the question: are the prevalent alternative operon utilizations, asRNA and ncRNA transcriptions ubiquitous phenomena in all prokaryotes or only prevalent in some specific species?

E. coli K12 is probably the best known free living model organism [45, 46], where novel biological hypotheses and computational algorithms can be tested. Indeed, it is mainly through the studies in E. coli K12 that we have understood many fundamental biological processes, including the mechanisms of gene transcriptional regulation [4749]. As a result, the E. coli K12 genome is in fact the best understood among all the model organisms in almost all aspects [50, 51]. Since the finishing of its genome sequence in 1997 [52], almost all newly developed high throughput technologies have been applied to this bacterium. As a result, 4,501 genes have been experimentally or computationally identified in the MS1655 strain, and 3,384 (75%) of them have been assigned a biochemical function [51]. Of these 3,384 genes with an assigned function, 2,941 (87%) had their functions characterized experimentally (66% of the total encoded genes) [46, 51]. The products of the 918 genes with experimentally characterized function catalyze 1,008 metabolic reactions, which constitute the best understood metabolic network [51]. As for its transcriptomes and transcriptional regulatory networks, RegulonDB database [53] that is dedicated to compiling all experimentally verified relevant information in E. coli K12 has documented 3,409 operons (including singleton genes), 1,878 promoters, 1,940 transcription factor binding sites of 175 transcription factors (TF) in the regulatory region of 703 operons, and 2,697 TF-target gene regulations [53]. Furthermore, more than a hundred of ncRNAs and asRNAs have been experimentally identified in the E. coli[5456]. More recently, Cho et al.[57] applied a combination of tiling array, 5’-end RNA deep sequencing, RNAP ChIP-chip and proteomics analyses to reveal the transcription unit architecture in the E. coli K12 genome. They identified 4,661 transcription units, many alternative Transcription Start Sites (TSSs), alternative operons and ncRNAs under a few cultural conditions. In another study, Mendoza-Vargas et al.[58] identified ~1,500 new TSSs using a modified 5’-RACE method and a 5’-end RNA sequencing method in the genome. Consequently, after more than 40 years intensive molecular genetics research in this bacterium, including the recent high throughput studies [43, 44, 57, 58], our experimentally validated knowledge of the transcriptome and gene regulatory systems in E. coli K12 is the most complete currently available for any organism [46, 51]. However, ironically, our understanding about the complexity of the transcriptomes in this model bacterium is rather limited compared to its counterpart model Gram-positive bacterium B. subtilis[25]. In particular, large scale dynamic and alternative operon utilizations under various conditions have not been reported in E. coli K12, so do they exist in this bacterium? Furthermore, how many asRNAs and ncRNAs are transcribed in E. coli K12 given the aforementioned inconsistent results [43, 44]?

Technically, compared to directional tiling array techniques, directional RNA-seq methods are more suitable and powerful tools for understanding the complexity of the prokaryotic transcriptomes due to their single-nucleotide resolution, higher dynamic range, and lower noise levels, thus they have gained increasing popularity [59]. One important step in RNA-seq data analysis is to accurately assemble all meaningful transcripts in their full-length, so that correct conclusions can be drawn from tens of thousands of RNA-seq short reads generated by next generation sequencing (NGS) technologies. However, it has been recently released [23, 24, 28, 29, 60, 61] and we will indicate later in this paper, that the coverage of reads generated by the current RNA-seq techniques on transcribed regions is highly non-uniform. More seriously, there are even numerous uncovered parts in transcribed regions, leading to gaps in otherwise a continuous mapping in the region [6267]. These highly non-uniform coverage and uncovered gaps make the transcripts assembly and quantitative analyses highly challenging tasks [23, 60, 6871]. Several technical problems in the current RNA-seq library construction protocols and sequencing technologies have been identified responsible for the non-uniform coverage and gaps. First, the chemical RNA fragmentation methods used in many protocols may have a bias to break or degrade some sequences [72]. Second, the random primer based reverse transcription may preferentially transcribe some sequences [66, 73]. Third, ligases may preferentially link the adaptors to some sequences [7476]. Fourth, PCR amplification is well-known for introducing GC content-dependent bias in libraries [7780]. Fifth, it was recently found that sequencing errors could be biased to some specific sequences, making such sequences missing from the reads [81]. Moreover, prokaryotic RNAs are more labile than their counterparts in eukaryotes, thus segments of some RNAs can be more easily lost during the library preparation. Although some of these problems can be avoided by new technical development, such as using FRET-seq for amplification-free sequencing to avoid GC content-dependent PCR bias [82], or using single RNA molecular sequencing for longer reads to ease the assembly problem [83, 84], no effective routine technique has yet been developed to avoid all these problems.

On the other hand, although several transcriptome assemblers using RNA-seq short reads have been developed in the past few years, they are mainly for reconstructing alternative isoforms in eukaryotes [70]. These assemblers can be classified into two basic categories [70]: the reference-based assemblers when a reference genome sequence is used, and the de novo assemblers when a reference genome is not used. The reference-based assemblers usually involve two steps: RNA-seq reads are first mapped to the reference genome using an aligner, such as BLAT [85], TopHat [86] or Bowtie [87], and then a graph representing all possible isoforms from overlapping reads is constructed, and the isoforms are resolved by traversing the graph. Examples of this strategy include Cufflinks [71] and Scripture [88]. The de novo assemblers such as Trinity [89], Oases [90], TransAByss [91], Rnnotator [92], and Multiple-k [93], generally assemble isoforms based on a De Bruijn graph constructed using overlapping reads. The advantage of de novo strategy is that it can assemble transcripts when a reference genome is not available and can recover transcripts that are missing in the genome assembly. However, de novo transcriptome assembly is very sensitive to sequencing errors, in particular, missing and chimerical reads in the dataset, thus their accuracy is generally lower than the reference-based approaches [70].

De novo transcriptome assembly in prokaryotes can also be more challenging in prokaryotes owing to the prevalence of uncovered gaps caused by the aforementioned technical reasons and the unique prosperities of their RNAs. Fortunately, with thousands of sequenced prokaryotic genomes available now, transcriptome assembly in prokaryotes can often be done using the reference-based approaches. However, the only reference-based transcriptome assembler for prokaryotes that we are aware of is a Hidden Markov Model (HMM)-based method for reconstructing operons in Bacillus anthracis[94], yet no tool was delivered from this research. Furthermore, there are at least two limitations in this method. First, the prevalently uncovered gaps were not explicitly treated in this method [94], thus the interrupted partial transcripts could not be effectively bridged. Second, although this method attempted to model transcripts of different transcription levels using different expression states, it did not allow transitions among the states [94]. Thus, without an effective method to correct the high non-uniformity of the read coverage along a transcript [65, 72, 73, 75, 81], this method can break a transcript into smaller fragments. Because of the lack of a good prokaryotic assembler, currently prokaryotic transcripts were assembled by either simply stitching the two covered segments if the gap between them is shorter than a cutoff [26], or determining 5’ and 3’ ends of transcripts via a probability-based approach [41], or relying on an additional source of information for the assembly, such as tiling array data that tend to have a more even and consecutive coverage along transcribed regions albeit at lower resolution [23, 25]. As RNA-seq becomes a routine technique for probing transcriptomes in prokaryotes, an efficient and accurate full-length transcripts assembly algorithm and tool tailored to prokaryotes are urgently needed in the research community.

To gain a better understanding of the complexity of the transcriptomes in E. coli K12, we have profiled the transcriptomes of the bacterium under different culture conditions and growth phases using a highly specific directional RNA-seq technique that can capture various types of transcripts in the cells, including mRNAs, asRNAs, and ncRNAs. To assemble all types of full length transcripts using the directional RNA-seq short reads, we have developed a new Hidden Markov Model based algorithm, TruHMM (TRancription Unit assembly by a Hidden Markov Model), attempting to addresses the highly non-uniform read coverage and uncovered gap problems of current RNA-seq techniques. TruHMM differs from the earlier HMM-based algorithm [94] in several ways (for details see Methods and Discussion). In particular, TruHMM overcomes the aforementioned limitations of the earlier method by allowing a transcript to have highly non-uniform coverage at different positions, and explicitly addressing the uncovered gap problem using a sliding window-based centroid read counting strategy in a pre-processing step. Furthermore, TruHmm can also predict alternative operons and TSSs of the assembled transcripts. When evaluated on sets of known operons, asRNAs and ncRNAs in E. coli K12, TruHMM was able to assemble various types of transcripts with rather high accuracy. The parameters trained in E. coli K12 can be applied to an earlier directional RNA-seq dataset in H. pylori[24] with similarly high accuracy, and vice versa, thus TruHMM is also very robust. Based on the transcripts assembled in TruHMM, we found that 46.9 ~ 63.4% of expressed operons were utilized in their putative alternative forms, 72.23 ~ 89.54% open reading frames had putative asRNA transcriptions and 51.37 ~ 72.74% intergenic regions had putative ncRNA transcriptions under different culture conditions and growth phases. Thus, it seems that there are more prevalent alternative operon utilizations as well as asRNA and ncRNA transcriptions in E. coli K12 than originally anticipated, and they may play important roles in the physiology of the bacterium.

Results

Our directional RNA-seq libraries are highly strand-specific and can capture various types of RNAs

We prepared the directional RNA-seq libraries from seven E. coli K12 samples collected at the log phase growth in LB, and different time points under heat shock (HS) or phosphorus starvation (M-P) treatments, denoted as LB, HS15 min, HS30 min, HS60 min, M-P0 h, M-P2 h, and M-P4 h to reflect the treatment and sampling time point. The experimental procedure of our work is listed in Additional file 1: Figure S1. The libraries were sequenced on either the Illumina GAII or the HiSeq 2000 platform. Specifically, the sample LB was sequenced using the GAII platform, samples HS30 min, HS60 min, M-P0 h, and M-P2 h were sequenced using the HiSeq 2000 platform, whereas samples HS15 min and M-P4 h were sequenced using both the platforms. Each sample sequenced using the HiSeq 2000 platform was repeated twice (technical replicates). The reads obtained from different platforms for the same sample are highly correlated (Additional file 1: Figure S2), thus the data for the same sample were combined for the analysis. A total of 330,611,663 reads were generated from the seven samples. The mapping statistics of the samples are summarized in Additional file 1: Table S1 showing that 23.07 ~ 44.18% of reads could be uniquely mapped to the genome, resulting in 7,735,369 ~ 29,581,761 uniquely mapped reads in each sample, corresponding to a sequencing depth of 93 ~ 355 times of the genome. Of the 47.08 ~ 63.04% multiple mapped reads in each sample, over 99.6% were from duplicated tRNA/rRNA genes (data not shown). Thus discarding these multiple mapped reads does not affect our analysis of mRNA, asRNA and ncRNA transcriptions. Furthermore, as shown in Figure 1, in all the samples over 90% and less than 10% of the total mapped nucleotides were mapped to the sense strand and intergenic regions, respectively, with only 0.35 ~ 0.95% of the total mapped nucleotides mapped to the antisense strand. Moreover, as shown in Additional file 1: Figure S3, our uniquely mapped reads consisted of well-balanced different sizes of RNA insertions, indicating that, in additional to mRNA, our library preparation protocol could potentially capture small RNA species such as asRNAs and ncRNAs, which were otherwise left out by a typical size selection step in other library preparation protocols. All these results indicate that our sequence reads are highly strand-specific and of high quality, which is consistent with an earlier result using a similar library construction protocol [61]. The seven sequence datasets have been submitted to the Gene Expression Omnibus (GEO) database with accession number GSE48151.

Figure 1
figure 1

Strand specificity of the directional RNA-seq libraries. The percentage of total nucleotides mapped to sense strand, antisense strand and intergenic regions is shown for the seven samples.

Uncovered-gaps in transcribed regions are prevalent and read coverage is highly non-uniform

However, as shown in Figure 2, even with such deeply sequencing coverage, less than 60% genes in the genome had their length completely covered by at least one read, while only less than 90% genes in the genome had at least 10% of their length covered by at least one read, suggesting that some transcribed regions were not covered by the reads, leaving uncovered gaps in transcribed regions. The same problem has been widely noted in both eukaryotes [6163, 66, 67, 95] and prokaryotes [24, 60] due to the aforementioned technical artefacts of the current RNA-seq techniques [65, 72, 73, 75, 81]. In fact, we found that this uncovered gap problem was even more serious in many published prokaryotic datasets we have reanalyzed, a typical example from [60] is shown in Additional file 1: Figure S4. These prevalent uncovered gaps may be also partially caused by the loss of some RNA fragments during the library preparation due to the highly labile nature of prokaryotic RNAs as mentioned earlier. Our data seems to support this hypothesis, as the percentage of gene body coverage in our samples collected under heat shock treatment were generally lower than that in other treatments, in particular, after 30 and 60 min heat shock (Figure 2). It is well known that RNAs have a shorter living time at a higher temperature. It is because of this uncovered gap problem that we define a gene with ≥50% of the length covered by at least one read to be sufficiently expressed. Also, this 50% cutoff was chosen, as all the samples except HS60min had over 80% of genes with at least 50% length being covered (Figure 2). Additionally, as shown in Figure 3, our libraries were also biased to the 5’-end of transcription units, which is consistent with the earlier results [24, 57, 58].

Figure 2
figure 2

Distribution of the genes with more than the indicated percentage of their length covered by at least one read in the samples. Less than 60% of genes have their length completely covered by at least one read. Over 80% genes have over 50% of their length covered by at least one read except for sample HS60 min.

Figure 3
figure 3

Reads are biased to the 5’-end of operons. The sufficiently expressed known multiple-gene operons (Additional file 2) and singleton operons are equally divided into 20 bins, and the average expression values in each bin of all operons in each sample were displayed. The top 10% most highly expressed genes were excluded from the calculation.

Furthermore, we also found that the read coverage along genes were highly non-uniform (an example is shown in Additional file 1: Figure S5). Interestingly, the pattern of non-uniform coverage did not depend on the culture conditions and growth phase; rather, it strongly depended on the positions in the transcribed region (Additional file 1: Figure S5). Such highly non-uniform read coverage along a transcribed region has been widely noted in recent studies [23, 24, 28, 29, 60, 61], and were shown to be caused by several technical artifices in current RNA-seq techniques [66, 7281]. Clearly, both the uncovered gaps and highly non-uniform read coverage along transcribed regions make the full-length transcript assembling and alternative operon identification challenging tasks.

TruHMM assembles operons with high accuracy

We used the 476 experimentally verified operons in RegulonDB (Additional file 2) to train the parameters of the HMM and applied the leave-one-out strategy to test our TruHMM algorithm. To compensate for the negative effect of uncovered gaps in the expressed regions on assembling, we used a centroid coverage value in a sliding window to represent the reads coverage for each nucleotide of DNA (see Methods). Meanwhile, we do not want to increase false positives by mistakenly bridging irrelevant reads using such a strategy. To find an appropriate widow size for this purpose, we plotted the distributions of interoperonic and gap lengths shown in Figure 4, which suggest that the optimal window size might be shorter than 41 nt. Therefore, we evaluated the performance of our algorithm when the window size varied from 1 to 41 nt with an increment of 10 nt on all the seven samples using the leave-one-out validation strategy (Methods). As shown in Figure 5, when evaluated using the adjacent operon pairs (neighbouring gene pairs within an operon, for details see Methods), our algorithm was very robust for the choice of the window size in the range of 11 ~ 21 nt (the mean values for each metric are ≥ 94%). Particularly, when the window size L = 11 nt, the algorithm achieved probably the best-balanced performance (the mean values for each metric are ≥ 95.87%), especially in terms of the three most important measures: sensitivity, specificity and accuracy. When evaluated using the entire operon structure, our algorithm still achieved very good performance with all the five metrics being over 94.6% for window size of 11 ~ 21 nt (Figure 6), and the best performance (the mean values for each metric are ≥ 95.3%) was also obtained when L = 11 nt. Therefore, we chose L = 11 nt for our further analysis. We also evaluated the effect of sequencing depth on the performance of our algorithm. As shown in Additional file 1: Table S2 using M-P4 h as an example, when the sequencing depth is over 153 times of genome size, our algorithm was very robust to the sequencing depth.

Figure 4
figure 4

Cumulative distributions of the length of interoperonic regions and the length of gaps in sufficiently expressed regions.

Figure 5
figure 5

Evaluation of the algorithm based on operon pairs in the seven samples. The dashed horizontal line is at the 95.87% level, and the vertical bars indicate standard errors.

Figure 6
figure 6

Evaluation of the algorithm based on entire operon structures in the seven samples. The dashed horizontal line is at the 95.3% level, and the vertical bars indicate standard errors.

The performance of TruHMM is robust

To evaluate the performance of TruHMM and the robustness of its parameters on different organisms and datasets, we first applied TruHMM with the parameters trained on the E. coli K12 dataset to the earlier directional RNA-seq datasets of H. pylori generated under five different culture conditions [24]. We then trained the algorithm using an H. pylori training set (Additional file 3, and see Methods) based on the results in [24], and applied the algorithm with the trained parameters to both the H. pylori and E. coli K12 RNA-seq datasets. Remarkably, the operons reconstructed in both H. pylori and E. coli K12 using the E. coli- or H. pylori-trained parameters are exactly the same (data not shown), and have high accuracy measured by all the five metrics (Figures 5 and 6, and Additional file 1: Table S3 and S4). This might be explained by the fact that the parameters of the algorithm trained on the H. pylori training sets and on the E. coli K12 training sets are almost the same (Additional file 1: Table S5), although our E. coli and the earlier H. pylori RNA-seq datasets were generated by quiet different methods. These results unambiguously demonstrate that the performance of our algorithm is highly robust, thus parameters trained in one organism can be well extended to other organisms, at least in our tested datasets. The assembled operons in H. pylori for each sample are listed in Additional file 4.

The boundaries of operons can largely be captured by our libraries and assembled by TruHMM

We next evaluated the ability of TruHMM to define operon boundaries, i.e., the TSSs and transcription termination sites (TTSs) of assembled transcripts. However, an accurate evaluation of predicted operon boundaries is complicated by the recently discovered fact that alternative TSSs and TTSs are far more prevalent than previously thought [2325, 57, 58] and the lack of a gold standard TSS and TTS datasets because although some different TSSs and TTSs are documented for some operons in RegulonDB, they were generally characterized in different studies under various conditions that are not necessarily the same as we used in this study. Thus, we evaluated our reconstructed TSSs by the following alternative ways. First, we wanted to know how many experimentally verified TSS in RegulonDB could be recovered by the boundaries of our assembled operons in any of the seven samples. If two known TSSs in RegulonDB are within 10nt from each other, we considered them as the same one in our evaluation. Thus, there are 1,742 known TSSs (Additional file 5) associated with the genes transcribed in at least one of our seven samples. We considered a known TSS was recovered by our predicted TSS if they were at most 50nt from each other. Using this criterion, 908 out of 1,742 (~52.1%) known TSS were recovered by a total of our 5,706 predicted TSSs (Additional file 5). Second, as for the remaining 4,798 predicted TSSs with no match to a known TSS, 2,830 of which appeared in at least two samples, thus they are likely to be novel true TSSs. For example, although genes b2628-b2627 on the reverse strand is documented as an operon in RegulonDB, there is no TSS documented for gene b2628. We predicted a potential TSS in the upstream intergenic region of b2628 (2,763,486) in five samples (Additional file 1: Figure S5). The remaining 1,968 predicted TSSs appeared only in one sample. The 4,798 predicted TSSs are listed in Additional file 6. The low coverage of known TSSs in RegulonDB does not necessarily indicate the inaccuracy of our prediction, considering the prevalence of alternative TSSs utilizations under different conditions and the fact that TSSs in RegulonDB were mostly characterized by different researchers, and under different conditions. Therefore, the limited number of TSSs in RegulonDB might be the major reason.

Third, we checked whether there is a potential σ70 binding site (Pribnow box) near the predicted TSSs. To this end, we used the motif profile of the Pribnow boxes (Additional file 1: Figure S6A) found by MEME [96] in 539 (31%) out of 1742 upstream promoter sequences to scan for the potential Pribnow box in the [-100 nt, 100 nt] interval around the predicted TSSs. According to the distribution of the scanning scores in the random promoter sequences (see Methods), when a score is greater than 4.5487, the corresponding empirical p-value would be smaller than 0.05. In all, 1,327 (47%) out of the 2,830 predicted putative TSSs appearing in multiple samples harbour a putative σ70 binding site around predicted TSSs with a p-value ≤ 0.05 (Additional file 1: Figure S6B and Additional file 7), and 1,150 out of the 1,968 (58.4%) predicted putative TSSs appearing in only one sample bear a putative σ70 binding site with p-value ≤ 0.05 around the predicted TSSs (Additional file 1: Figure S6C and Additional file 7). However, the predicted TSSs appearing in multiple samples are more likely to be genuine ones since around 80% of which have a potential σ70 binding site located around the [-50 nt, 50 nt] interval of the predicted TSSs, compared to the rather evenly distributed Pribnow box positions of predicted TSSs appearing in a single sample (Figure 7).

Figure 7
figure 7

Distribution of the Pribnow box start position relative to predicted TSS appearing in multiple samples (black dots) or in a single sample (red dots).

Lastly, Sharma et. al[24] have determined 735 primary TSSs (defined as the most frequently used TSS by an annotated transcript, supplementary information of [24]) in H. pylori, using dRNA-seq technique that enriches the reads coverage on the 5’ end of a transcript. Therefore, the TSSs determined in this study could be a good dataset to test the accuracy of our algorithm. Specifically, we compared our predicted TSSs in H. pylori using their directional RNA-seq datasets with their TSSs determined by dRNA-seq. On average, 73.12% of our predicted TSSs in each sample are located within the [-50 nt, 50 nt] interval around a TSS determined by dRNA-seq (Additional file 1: Table S6). Thus our algorithm has achieved a rather high specificity. Our predicted TSSs in each of the five samples, located within the [-50 nt, 50 nt] interval around a verified TSS are listed in Additional file 4. Furthermore, we used the primary TSS to check the recall rate (sensitivity) of our program. Our program detected 558 (~76%) out of the 735 total primary TSSs. The majority of the verified TSSs recalled by our algorithm had a dominant coverage on the 5’ end of the transcript, one of such cases is shown in Additional file 1: Figure S7. By contrast, the majority of primary TSSs missed by our algorithm did not have a dominant read coverage on the 5’-end, two such cases are shown in Additional file 1: Figure S8. The primary TSSs both covered and missed by TruHMM are listed in Additional file 8. The much higher recovery rate of known TSSs by our algorithm in H. pylori than in E. coli K12 might be due to the fact that the gold standard dataset in H. pylori were derived from the same conditions as the RNA-seq datasets that we used for assembling the transcripts, while the datasets in RegulonDB were derived under various conditions.

As for the TTS predictions, our algorithm recovered 148 out of 221 (~67%) known TTSs associated with expressed genes in E. coli K12 (Additional file 5), which is higher than the recovery rate of known TSSs, even though the mapped reads are strongly biased to the 5’-ends (Figure 3). The lower recovery rates of known 5’ ends (TSS) compared to 3’ ends (TTS) might indicate that operons utilize more alternative TSSs than TTSs under different conditions. In other words, the predicted TSSs without a match with a known TSS in RegulonDB are likely to be novel alternative TSSs used in different conditions. Taken together, all these results strongly suggest that most of the predicted TSSs and TTSs are likely to be true transcription boundaries. The assembled operons and their alternative TSSs in each sample are listed in Additional file 9. However, as also demonstrated in earlier studies [24, 57, 58], to more accurately detect TSSs and TTSs of transcripts/operons, in particular TSSs, in addition to directional RNA-seq datasets, special datasets targeted to the 5’-endof transcripts are clearly needed, such as dRNA-seq data [24] and datasets for the more recently discovered transcription start site RNAs (tssRNAs) [97].

Condition-dependent alternative operon utilizations appear to be prevalent in E. coli K12

As summarized in Additional file 1: Table S7, our algorithm detected more than 2,000 operons involving more than 4,200 genes in each sample. There were 1,121 consistent operons that were transcribed in at least two of the seven samples, and 207 of which were multiple-gene operons (Additional file 10). Of these 207 consistent multiple-gene operons, 206 were expressed in all the seven samples except the operon istR-1-istR-2/b4616, which was not expressed in the samples HS60min and M-P2 h (Additional file 10). Figure 8 shows an example of a consistent operon hemCDXY encoding enzymes for tetrapyrrole synthesis. Although all the four genes were consistently expressed and continuously covered by the reads under different cultures and growth phases, they had similar position-dependent non-uniform read coverage along the operon, again indicating the non-uniform coverage of the libraries.

Figure 8
figure 8

Position-dependent non-uniform coverage of the reads along the hem operon hemCDXY. The vertical axis is the number of reads covered at the positions. The orange and dark green bars at the button of the graph represent the reverse and forward strands, respectively. Segments with arrows represent genes. The graphs were generated using IGB. To make the expression levels for the four genes comparable in different samples, the same scale (1,200) of the vertical axis is used for all the samples. Although this four-gene operon was consecutively covered by the reads under different cultures and growth phases, there are highly similar patterns of position-dependent non-uniform coverage of the reads along the operon in the samples.

Furthermore, we consider a non-consistent operon as an alternative operon if it shares a portion of genes with another operon in other samples. As shown in Additional file 1: Table S7, from 981 (46.9%) to 1,815 (63.4%) alternative operons were detected in each sample. Thus around half of the reconstructed operons in each sample have at least one putative alternative form, a number comparable to that found in M. Pneumonia[23] and other prokaryotes [24, 25, 28, 29], indicating that like many other prokaryotes [20, 2225], E. coli K12 seems to express enormous alternative operons under different culture conditions and growth phases, a phenomenon that is more prevalent than previously expected. An interesting example is the 14-gene operon phnCDEFGHIJKLMNOP coding for proteins responsible for the assimilation of C-P bond-containing phosphonates under phosphorus starvation conditions [98]. In the LB, and heat shock samples (HS15 min, HS30min and HS60 min), this operon was transcribed in several short suboperons (Additional file 1: Table S8 and Additional file 9) with low expression levels, whereas under phosphorus starvation (samples M-P2 h and M-P4 h), the phn genes were transcribed as a large operon phnCDEFGHIJKLMNOP with high expression levels (Figure 9 and Additional file 9), which is consistent with previous observations [98]. In fact, this 14-gene operon and its suboperons have been studied previously by several groups [98101]. The phnCDE suboperon encoding a phosphonate transport system, was transcribed in the sample M-P0 h, and phnF is a repressor for this suboperon [102]. Moreover, the products of the genes phnGHIJKLM are essential for the C-P bond cleaving activity [103]. More recently, Jochimsen et. al[101] have shown that phnGHIJK encodes a protein complex essential for organophosphonate utilization; this suboperon was detected in the sample HS15 min. Furthermore, genes phnNP function as downstream processing enzymes [104], whereas the phnO gene is unnecessary for transport or catalysis, and may therefore have a regulatory role [103]. Finally, as shown in Figure 9, the phnCDEFGHIJKLMNOP operon displayed varying/decreasing expression levels along the operon, another form of the complexity of prokaryotic transcriptomes in addition to alternative operon utilization [23]. However, further investigation of this phenomenon is out of the scope of this work.

Figure 9
figure 9

Reads coverage of the genes in the phn operon. The vertical axis is the number of reads covered at the positions. The orange and dark green bars represent the forward and reverse strands, respectively. Segments with arrows represent genes. Genes from the right to left are yjdN, phnC, phnD, phnE, phnF, phnG, phnH, phnI, phnJ, phnK, phnL, phnM, phnN, phnO and phnP. The graphs were generated using IGB. To make the expression levels for the 14 genes in different samples visible and comparable, the same vertical axis scale (50) is used for the LB and HS treatments, and the same vertical axis scale (450) is used for M-P treatments. Some positions with low read coverage cannot be shown while some other positions with high coverage are truncated. Note the varying levels of coverage and gaps along the operon under different cultures and growth phases, and again the similar position-dependent non-uniform coverage of the reads along the operon.

Another interesting example is the alternative utilization of the 13-gene operon fliFGHIJKLMNOPQR encoding proteins in the flagella of E. coli K12 (Additional file 1: Table S9 and Additional file 9). Although the fli operon was expressed as a 13-gene polycistron in the sample LB, it was split into short suboperons under the treatments of heat shock or phosphorus starvation in a time dependent manner (Additional file 1: Table S9). For example, at the beginning of heat shock (the sample HS15 min), the fli operon was divided into four suboperons, then it was further split into six to seven suboperons (samples HS30 min and HS60 min). Interestingly, it has been shown that heat shock reduces bacterial mobility possibly through the regulatory interactions between the heat shock system and the flagellum/chemotaxis system [105]. Moreover, it has been shown that inorganic phosphorus is necessary for the motility of bacteria [106]. However, the underlying mechanisms of these observations are largely unknown. Therefore, our results might provide a possible molecular explanation of these earlier observations: the extreme conditions (heat shock/phosphorus starvation) alter the expression of flagella proteins by changing the patterns of alternative usages of the fli operon, thus influence the motility of the bacterial cells.

Condition-dependent asRNA and ncRNA transcriptions appear to be prevalent in E. coli K12

Intriguingly, about 0.35 ~ 0.95% and 5.27 ~ 9.79% of our uniquely mapped read were mapped to the antisense strand of annotated open reading frames (ORFs) and intergenic regions, respectively (Figure 1). We consider the assembled transcripts from these reads as putative asRNAs and ncRNAs, respectively. As shown in Figure 10, majority of these putative asRNAs and ncRNAs have a length of 20 ~ 200 nt, while some can be > 1,000 nt long. Interestingly, majority (72.23 ~ 89.54%) of ORFs were predicted to have asRNA transcriptions (Additional file 1: Table S10), which is consistent with an earlier studies showing that 3,000 ~ 4,000 ORFs had asRNA transcriptions using tiling array [43]. However, a recent study [44] identified only about 1,000 asRNA in the same genome under similar growth conditions using directional RNA-seq. This discrepancy might be due to different techniques and analysis methods used. Furthermore, 1,942 ~ 2,780 (51.37 ~ 72.74%) out of the 3,808 intergenic regions had putative ncRNA transcriptions in a condition- and/or growth phase-dependent manner (Additional file 1: Table S10). To evaluate the accuracy of our assembled asRNAs and ncRNAs, we compared them with the 112 known asRNA and ncRNAs compiled by Storz’s group [55, 56] and RegulonDB [53], and found that our results recovered 102 (91%) of these 112 known asRNA and ncRNAs (Additional file 11). Thus, TruHMM has also achieved rather high sensitivity in assembling asRNAs and ncRNAs. However, the authenticity and functions of the remaining putative novel asRNAs and ncRNAs need to be further investigated. The assembled putative asRNAs and ncRNAs in the seven samples are listed in Additional file 12 and Additional file 13, respectively.

Figure 10
figure 10

Distribution of the length of assembled asRNA and ncRNAs. For clarity, only the range of 1 ~ 400 nt is shown, but some asRNA can be longer than 1,000 nt.

Some hypothetical genes are transcribed

Although E. coli K12 is probably the best studied and understood model organism, researchers have not completely defined even its coding genes. For instance, there are still 36 sequences labelled as hypothetical protein genes as of this writing in the RegulonDB [53]. Interestingly, we found that all these 36 hypothetical genes were transcribed in at least one of our seven samples (Additional file 14), and 21 (b0050, b0137, b1356, b1382, b1419, b1446, b1457, b1607, b1952, b1998, b3471, b3638, b3937, b4325, b4335, b4336, b4593, b4596, b4610, b4615 and b4620) of them were expressed in all the seven samples, suggesting that they are highly likely to be true protein coding genes. Furthermore, 20 of them formed multi-gene operons with other known genes (Additional file 14). The functions of these known genes might provide hints to possible functions of the associated hypothetical genes for “guilt by association”.

Discussion

Although a few high throughput studies have attempted to delineate the architecture of E. coli K12 transcriptomes [43, 44, 57, 58], they mainly focused on identifying TSSs [57, 58], promoters [58] and other features [57]. Thus we still lack a good understanding of the level of the complexity of the transcriptomes in E. coli K12, from which we gained most of our knowledge about transcription in bacteria, but the more recent revolutionary view of the high complexity and dynamics of prokaryotic transcriptomes. Therefore, there is an urgent need for a better understanding of the complexity of the transcriptomes in this most widely-used model Gram-negative bacterium, in particular, when the same highly complex and dynamic transcriptomes have recently been revealed in its counterpart model Gram-positive bacterium B. Subtilis[25]. To fill the gap, we have profiled the transcriptomes in E. coli K12 during the course of heat shock and phosphorus starvation conditions using a highly strand-specific RNA-seq method that can capture various forms RNA transcripts, in conjunction with a highly accurate full-length transcript assembler, TruHMM. Indeed, as has been widely reported in many other prokaryotes [2429], we have also identified numerous putative novel and/or alternative operons and TSSs, as well as novel putative asRNAs and ncRNAs in E. coli K12. More importantly, the transcription patterns of these putative alternative operons, asRNAs and ncRNAs were highly dependent on the growth phases and culture conditions of the bacterium, suggesting that they might play important roles in the physiology of the bacterium. In the future, it would be very interesting to study how the alternative operons, asRNAs and ncRNAs are related to transcriptional and translational regulations and cellular functions, in particular in responses to environmental cues. Furthermore, the molecular mechanisms that lead to the highly complex and dynamic transcriptomes in E. coli K12 and other organisms also warrant further investigations.

Based on the ever increasing body of evidence [2022], and the data presented in current study, it is highly likely that prokaryotes generally have highly dynamic and complex transcriptomes to cope with environmental changes. The failure to observe such highly complex and dynamic transcriptomes in some earlier studies [3142], and the inconsistent results in E. coli K12 and B. subtilis[25, 30], might well be due to the limitations of experimental and computational methods used in these studies. For instance, although an earlier study [30] did not detect alternative operon utilizations in B. subtilis using tiling arrays under two culture conditions, a more recent study [25] observed highly prevalent condition-dependent operon utilizations as well as numerous asRNA and ncRNA transcriptions using higher resolution tiling arrays and more sophisticated computational analysis in ~120 culture conditions. Furthermore, although Selinger et al.[43] found that up to 3,000 ~ 4,000 E. coli K12 genes had asRNA transcriptions using directional tilling arrays, Dornenburg et al.[44] only identified about 1,000 asRNAs in the same genome under similar growth conditions using a directional RNA-seq technique. Our results is in excellent agreement with the former results [43], as we detected that 72.23 ~ 89.54% annotated genes have putative asRNA transcriptions (Additional file 1: Table S10). Thus again asRNA transcription appears to be more prevalent than originally anticipated in E. coli K12. With the continuous drop in costs of the NGS technologies, directional RNA-seq becomes a routine technique to profile transcriptomes in thousands of sequenced prokaryotic genomes. We expect that highly complex and dynamic transcriptomes will be identified in more and more prokaryotes using improved directional RNA-seq techniques and analysis tools. The experimental methods and the transcripts assembler that we developed in this study can add in these efforts.

Specifically, our directional RNA-seq libraries preparation method based on the Illumina small RNA-seq prep method is highly strand-specific, avoiding potential genomic DNA contaminations. Our method is also capable to capture various types RNA transcripts, including mRNA and small RNAs such as asRNAs and ncRNAs, eliminating the need to prepare two libraries targeted to mRNAs and small RNAs separately [34]. Additionally, before the advent of a routine full-length RNA sequencing technology, reference-based assembly of full-length transcripts is probably the best choice and a necessary step to analyze the transcriptomes using RNA-seq short reads. Due to the highly labile nature and various technical biases introduced during the sequencing library preparation [66, 7280] and the sequencing process per se [81], transcribed regions are highly non-uniformly covered, and more seriously, a considerable portion of a transcribed region may not be covered by the reads, resulting in uncovered gaps in transcribed regions [6267]. Our assembler TruHMM has effectively addressed these issues. TruHMM differs from an earlier HMM based method for analyzing transcriptomes in B. anthracis[94] in the several important aspects, and overcomes its shortcomings. First, by arbitrarily dividing read coverage values of genes into several bins, the earlier HMM [94] contains multiple expression states that are not directly connected, thus in principle it cannot assemble transcripts with highly non-uniform coverage. By contrast, TruHMM uses only a single state to model a wide range of read coverage along a transcript, thus it is able to assemble transcripts with highly non-uniform coverage. Second, the earlier method assumes a first order dependence of the mapped reads [94], which cannot effectively bridge the larger and prevalent uncovered gaps along a transcribed region as we see in our and other RNA-seq datasets. In contrast, we treat the gap problem explicitly by using a sliding-window based centroid coverage values, which as we have demonstrated in this paper, can largely relieve the gap problem. Third, the earlier method empirically assigns emission probabilities to several expression states [94]. By contrast, we derived the emission probabilities by fitting our centroid read coverage values to a Poisson distribution, which nicely models the highly non-uniform read coverage phenomenon (Figure 11). Lastly, by using a post processing strategy, our algorithm can accurately predicted TSSs, whereas the early method lacks such capability. For these reasons, our algorithm has largely solved the highly non-uniform coverage problem as well as the prevalent gap problem in assembling prokaryotic transcripts using RNA-seq short reads. Indeed, when evaluated on the seven RNA-seq datasets that we generated in E. coli K12 as well as datasets produced in H. pylori, TruHMM has achieved rather high performance in assembling operons (Figures 5 and 6, and Additional file 1: Tables S3 and S4) and locating TSSs (Figure 7, and Additional file 1: Table S6) in both our E. coli K12 datasets and the earlier H. pylori datasets. TruHmm also was able to accurately assemble asRNAs and ncRNAs as it recovered 102 (91%) of the 112 known such RNAs in E. coli K12 [3] (Additional file 11). Equally importantly, the performance of TruHMM also is very robust as we have demonstrated that the E. coli-trained parameters can be used to assemble the transcripts in H. pylori and vice versa, while achieving in both the cases exactly the same results as being done using the parameters trained on their own verified operons. Therefore, one can use our trained parameters to assemble transcripts in a different organism when enough known operons in the organism of interest are not available for training the parameters.

Figure 11
figure 11

QQ-plot comparing the distribution of centroid coverage values of the positive training set in all the samples but LB with the fitted Poisson distribution. Deviation of a data point from the line y = x indicates its deviation from the theoretical Poisson distribution. Parameters of the Poisson distribution are estimated using the maximum likelihood method.

Another interesting and rather prevalent phenomenon called dynamic operon transcription is recently revealed by transcriptome profiling studies in M. Pneumonia[23] and B. subtilis[25] using high density tiling arrays that give more uniform signal coverage along genes albeit at lower resolution [23, 25]. Dynamic operon transcription is characterized by varying levels of transcription along an operon, resulting in staircase like transcription levels between adjacent genes in the operon [23, 25]. This phenomenon also is clearly seen in our datasets (examples are shown in Figure 9). However, TruHMM in its current form is unable to detect such dynamic operon transcription events due to the highly non-uniform read coverage along genes in an operon. Furthermore, if multiple alternative operons start at the same TSS, but terminate at different TTS in the same sample, TruHMM will fail to detect such coexisting alternative operons in the same sample. Clearly, to solve these problems, one needs to transform the highly non-uniform read coverage along the genes into a more uniform one by effectively correcting the aforementioned technical biases in the current RNA-seq methods, or relies on a better sequencing technology with minimal read bias, or capable of sequencing transcripts in their full-length. In addition, TruHMM might not be able to separate overlapping transcripts if the downstream transcript has no outstanding primary TSS. Finally, additional sequencing library targeted to the intact 5’-end of RNAs might be needed in order to identify all possible TSSs in a sample.

Conclusions

Using a highly efficient and strand-specific RNA-seq method combined with a highly accurate and robust algorithm and tool, TruHMM for assembling full-length transcriptomes, we showed that alternative operon utilizations in E. coli K12 appear to be more prevalent than originally anticipated, and that a large portion of ORFs and intergenic regions of the genome appear to have asRNA and ncRNA transcriptions, respectively. Furthermore, the patterns of alternative operon, asRNA and ncRNA transcriptions are dependent on the culture conditions and growth phases of the bacterium, thus they might play important roles in the physiology of the bacterium. Furthermore, with the recognition of the highly complex nature of prokaryote transcriptomes and the wide application of RNA-seq techniques in the prokaryotes research community, TruHMM can also be very useful for biologists to reveal the complexity of transcriptomes and the underlying molecular mechanisms in all sequenced prokaryotic genomes.

Methods

Bacterial culture

A frozen stock of Escherichia coli K12 strain MG1655 (a gift from Dr. Todd Steck, Department of Biology, the University of North Carolina at Charlotte) was thawed, inoculated in LB medium in a test tube by 1:100 dilution and cultured overnight at 37°C and 250 rpm. The cells were then transferred to fresh LB medium in a flask by 1:100 dilutions, and cultured at 37°C and 250 rpm. When the cells grew to the log phase with an optical density at 610 nm [OD610] of 0.87, they were spun down at 3,200 g for 25 min. For heat shock treatment (HS), the cell pellets were resuspended in the same volume of MOPS medium (100 ml of 10X MOPS mixture, 880 ml of sterile H2O, 10 ml (0.132 M) KH2PO4 and 10 ml of 20% glucose, Teknova, Hollister, CA), and incubated at 48°C and 250 rpm. For phosphorus-starvation treatment (M-P), the cell pellets were resuspended in the MOPS medium without KH2PO4. Three milliliter cell suspension were collected in a tube containing 1.5 ml RNA Later (Invitrogen) immediately after the cell pellets were resuspended in the indicated medium (0 min) and at the indicated time points thereafter (HS:15 min, 30 min and 60 min; M-P: 0 hrs, 2 hrs, 4 hrs). Cells were spun down at 6,000 g, 8 min and -4°C, and the pellets were resuspended in 1.5 ml of RNAlater. The samples were stored at -80°C until use.

Isolation and enrichment of mRNA

Total RNA was isolated using a RiboPure™ -Bacteria Kit (Ambion) following the manufacturer’s instructions. Once isolated, ~10 μg total RNA was treated with 8 units DNase (Invitrogen) twice to remove genomic DNA, and the complete removal of DNA was confirmed bythe absence of the product of 35 cycles PCR amplification of a 196 bp fragment of the crp gene (5’-primer: AGCATATTTCGGCAATCCAG; 3’-primer: TACAGCGTTTCCGCTTTTTC). To enrich mRNAs and other transcripts, majority of rRNAs were removed from the DNase-treated total RNA using a MICROBExpress kit (Ambion) following the manufacturer’s instructions.

Construction of directional RNA-seq libraries

In our early stage of experiments, sequencing was done on an Illumina GAII platform at the sequencing core facility of the University of North Carolina at Chapel Hill, and the directional RNA-seq libraries were constructed by following an Illumina’s instruction using their Small RNA Sample Prep Kit with some modifications. Briefly, after the purified mRNA was fragmented using a RNA fragmentation kit (Ambion), the fragmented RNA was treated with Antarctic phosphatase (NEB) to remove the 5’-tri-phosphate groups of RNAs with an intact 5’-end. A mono-phosphate group was then added back to the 5’-end of fragmented RNAs by polynucleotide kinase (PNK, NEB) in the presence of 10 mM ATP. The v1.5 sRNA 3’ Adaptor (5’/5rApp/ATCTCGTATGCCGTCTTCTGCTTG/3ddC/) was ligated to the 3’-end of fragmented RNAs using truncated T4 ligase 2 (NEB), and the SRA 5’ RNA adaptor (5’GUUCAGAGUUCUACAGUCCGACGAUC) was ligated to the 5’-end of fragmented RNAs using T4 ligase. To preserve short inserts from small RNAs we omitted the size selection step after PCR application of inserts. In our later experiments, sequencing was done on an Illumina HiSeq 2000 platform at David H. Murdock Research Institute of the North Carolina Research Campus (Kannapolis, NC), and we constructed the directional RNA-seq libraries using Illumina’s TruSeq Small RNA Sample Prep Kit, so that multiplex sequencing can be achieved by using the barcoded PCR primers. The details of the method will be described elsewhere (Dong, Li and Su). Briefly, after similar treatments as described above, the 5’ Adapter (RA5: 5’ GUUCAGAGUUCUACAGUCCGACGAUC), and 3’ Adapter (RA3: 5’ TGGAATTCTCGGGTGCCAAGG) were ligated to 5’- and 3’-end of fragmented RNAs, respectively. Reverse transcription-PCR (RT-PCR) was performed using SuperScript II Reverse Transcriptase Kit using the SRA RT primer, followed by 16 cycles of PCR amplification. Again, the size selection was omitted on PCR products to preserve short inserts from possible small RNAs. Single-end sequencing on the Illumina GA II platform was done with 76 cycles, while that on the HiSeq 2000 platform was done with 100 cycles. Some samples (HS15 min and M-P4 h) were sequenced on both platforms.

Mapping and filtering RNA-seq reads

The genome sequence and annotation files of E. coli K12 substr. MG1655 were obtained from NCBI (http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/), and the experimentally verified operons in the bacterium were downloaded from RegulonDB [53] (http://regulondb.ccg.unam.mx/). Additional 112 experimentally verified small RNAs in E. coli were obtained from Storz’s group (http://cbmp.nichd.nih.gov/segr/ecoli_rnas.html). A total of 4,501 annotated genes (also including pseudo genes and small RNAs) are included in this analysis. As the reads were not size-selected during the library construction, we trimmed the 3’ adapters attached to some short insertions. Adapter-free reads with lengths of <10 nt were discarded; the remaining reads were mapped to the E. coli K12 genome using Bowtie [87]. For the reads of length 10–14, 15–29 and ≥30 nt, up to 1, 2, and 3 mismatches were allowed, respectively. Since over 99.6% of the multiple mapped reads in each sample were from duplicated tRNA/rRNA genes (data not shown), only uniquely mapped reads were used for further analysis. The alignment of mapped reads to the reference genome was visualized by Integrated Genome Browser (IGB) [107]. To map the directional RNA-seq reads of H. pylori[24], we trimmed the polyA tails of the original datasets, which were introduced during the library preparation, and mapped the reads to the reference genome using Bowtie with the same parameter settings as for E. coli K12.

Normalization of the mapped counts

Normalization of the mapped read counts is crucial for differential expression detection using RNA-seq [108], as different samples may have different total read counts, i.e. sequencing depths, as well as various biases mentioned earlier. The most commonly used normalization methods include reads per kilobase of exon model (or ORF) per million mapped reads (RPKM) [62], fragments per kilobase of transcript per million fragments mapped reads (FPKM) [71], the hypergeometric model [109] and other more recent sophisticated model-based methods [63, 64, 66, 67, 77, 78, 110, 111]. However, it has been shown that these global normalization methods are strongly affected by a small proportion of highly expressed genes in the published datasets, leading to biased estimation of gene expression levels across different conditions [108]. As shown in Figure 12, our datasets are no exception to the problem as around 10% of genes with the highest number of mapped nucleotides contribute up to 80% ~ 90% of mapped nucleotides in the gene-coding regions across all the seven samples. Inspired by [108] and also for computational efficiency, in this study we used N* defined as the total nucleotide counts minus the counts of the top 10% of genes with the highest counts to scale the gene expression levels in each sample, instead of using the total counts of mapped nucleotides in each sample.

Figure 12
figure 12

Impact of highly expressed genes on the mapped nucleotides in coding regions. Genes were sorted in the descending order of their number of mapped nucleotides in reads. The top 10 percent of genes with the highest read counts contribute to around 80% ~90% mapped nucleotides in the coding regions.

Furthermore, because our mapped reads have different lengths (see Results), instead of using the mapped read counts per gene, we used the mapped nucleotide counts per gene to measure the gene expression levels defined as “Nucleotides Per Kilo base of transcript per Billion nucleotides mapped” (NPKB):

NPKB = n N * 10 9 × L 10 3 ,
(1)

Where n is the number of nucleotides of the reads mapped to the transcript, N* our normalization factor defined above, and L the length of the transcript. Clearly, when all reads have the same length, NPKB and RPKM differ by a constant scaling factor. A similar method has been used earlier [60], except that our NPKB is further normalized by the global scaling factor N* in each sample.

Training the HMM

An HMM is a machine-learning algorithm that can be used to decode the path of hidden states that generate a sequence. In this paper, we use an HMM to infer whether or not a segment of a strand of DNA is consecutively transcribed given the expression values obtained from the mapped reads. The model consists of two states: the expression state E and non-expression state N (Figure 13).

Figure 13
figure 13

Structure of the HMM for assembling operons/transcripts using RNA-seq reads. E represents the expression state and N the non-expression state, Letters r 1 , r 2 ,…,r n  are the emission values of E, μ contig  is the mean length of sufficiently expressed contigs in the positive training set; and s 1 , s 2 ,…, s N  are the emission values of N, and μ zero  is the mean length of the non-expressed regions in the negative training set.

Selection of expressed adjacent operon pairs

A gene was considered to be sufficiently expressed if over 50% of its length was covered by at least one read and at least 20 nt of both of its termini were covered by at least one read. We used the 476 experimentally verified operons in RegulonDB (Additional file 2) to train the parameters of the HMM and to evaluate the performance of our algorithm. Since these operons were not necessarily expressed in our samples, and alternative operon utilizations could be very prevalent, as the first step to construct a positive operon set in a sample, we selected a pair of adjacent genes in a known operon (adjacent operon pair) if they met the following two criteria: 1) both genes were sufficiently expressed and over 50% of the length of their intergenic region were covered by at least one read in the sample; and 2) the correlation between the expression levels of the two genes and their intergenic region was greater than a cutoff. To compute the correlation between the expression levels of the two genes and their intergenic region, we extended the two ends of the intergenic region into the two flanking genes to double its length or extended until the other end of either gene was reached (Figure 14A). We equally divided the extended intergenic region as well as the intergenic region into n bins, and thus the expression levels (NPKB) over these bins formed two n-element vectors (Figure 14B). Pearson correlation coefficient (PCC) between the two vectors was used to quantify the correlation between the expression levels of the two genes and their intergenic region. To find an appropriate cutoff, we similarly divided a sufficiently expressed gene as well as its central half into n equal bins, and computed the correlation of the expression levels between the whole gene and its central half. We reason that for an expressed adjacent operon pair, the PCC value between the intergenic region and the extended intergenic region should follow the same distribution of the PCC value between the central half of an expressed gene and the whole gene, since an adjacent operon pair and their intergenic region should be expressed in a similar way as the different parts of a gene. The distribution of the PCC value between the central half and the whole gene (n = 4) is shown in Figure 14C. We chose 0.3 as the cutoff for our second criterion to select positive adjacent operon pair since this would allow us to include over 60% of sufficiently expressed genes.

Figure 14
figure 14

Selection of known adjacent operon pairs for training and evaluation. A: The intergenic region between two adjacent genes in an operon is doubled by extending its two ends in the two flanking genes. B: A sufficiently expressed gene is equally divided into n bins, and its central half is further equally divided into n bins. The NPKB values for each bin of a gene and of its central portion are a 1 ,…,a i ,…, a j , …, a n and b 1 ,…, b i ,…, b j ,…, b n , respectively. An extended intergenic region is similarly divided by treating it as a “gene” with the intergenic region being the central portion of the “gene”. C: Distribution of PCC values between the two vectors for sufficiently expressed genes with a bin size n = 4. We choose 0.3 as the cutoff of PCC value since 60.1% of sufficiently expressed genes can be included.

Positive and negative training sets

To train the HMM, we constructed a positive training set in a sample by simply stitching the known adjacent operon pairs that met the two criteria described above to form a large operon if they are parts of a known operon according to RegulonDB. These positive training sets in the seven samples are listed in Additional file 2. To construct a relatively large negative training set in a sample, we included all the uncovered regions in the genome excluding the ones inside the sufficiently expressed genes in the sample.

Positive and negative testing sets

We evaluated the operon prediction accuracy using two methods: one was based on adjacent operon pairs, and the other on the entire operon structure using all the gene pairs of a known operon. For the first method, we constructed a positive testing set in a sample, consisting of sufficiently expressed adjacent operon pairs, and a negative testing set consisting of known adjacent non-operon pairs that were both sufficiently expressed in the sample. A known adjacent non-operon pair consisted of either the first gene of a known operon and its immediate upstream gene, or the last gene in a known operon and its immediate downstream gene, as long as the intergenic region of the gene pair had at least one uncovered region, regardless of its length. For the second method, we constructed a positive testing set in a sample, consisting of all pair-wise combinations of the genes in a sufficiently expressed operon, and a negative testing set consisting of the gene pairs between the genes of the operon and the immediate upstream or immediate downstream gene, given that the known adjacent non-operon pairs had no overlapping un-translated region (UTR) and that all these relevant genes were sufficiently expressed.

Leave-one-out cross validation

We employed a leave-one-out cross validation strategy to evaluate the performance of our algorithm. Specifically, we used the positive training sets and negative training sets in (n-1) samples to train the emission and transition probabilities of the HMM, and used the positive testing set and the negative testing set in the remaining sample to test the trained model.

Training emission probabilities

The number of reads mapped to a specific position (nucleotide) in the reference genome is denoted as “coverage” of the position in this paper. To deal with the uncovered gap problem, we used a sliding window to compute the centroid coverage of each position on a strand of DNA, assuming that if the flanking regions of a position are transcribed, it is very likely that the position itself also is transcribed. Specifically, given a window size L (L is an odd number), the centroid coverage of the nucleotide i in the middle of the window is defined as:

Centroid i = log 10 9 N * 1 L k = i - L - 1 / 2 i + L - 1 / 2 Coverage k + 1 ,
(2)

Where i is the i-th position (nucleotide) on the chromosome. N* the normalization factor defined in equation (1), L the window size, and Coverage (k) the coverage of position k on the genome. Note that a pseudo count of 1 is added to the coverage value of each window. The optimal window size is determined by balancing two goals with opposite effects: to cover as many gaps as possible and to exclude as many interoperonic regions as possible. See Results for the details of window size selection.

The emission signals of the states E (r 1 , r 2 , …,) and N (s 1 , s 2 , …) are the centroid coverage values of the nucleotides in the reference genome. We used the positive training sets to estimate the emission probabilities of the signals of E. The distribution of centroid coverage values of the positive training set from all samples except LB is shown in Figure 11. The QQ plot indicates that the centroid coverage values of the positive training set approximately follow a Poisson distribution, which is consistent with the earlier results [108]. Thus, the emission probability of the centroid coverage values in the state E could be computed by the Poisson distribution, whose parameters were estimated with the maximum likelihood method. Since our negative training set were virtually not covered by reads, the signals that the state N emits should be the centroid coverage values with zero coverage,

log ( 10 9 N * ( 1 L k = i - L - 1 / 2 i + L - 1 / 2 0 + 1 ) ) .
(3)

We arbitrarily assigned a high probability 1-10-20 for N to emit this value, and a low probability10-20 for N to emit any other values. The value 10-20 is also a pseudo probability to avoid zero probability for decoding the HMM later.

Training transition probabilities

We chose to model the lengths of both expressed and non-expressed regions with geometric distributions, though other distributions may provide a fit. To this end, let P ij be the transition probability from state i to j. To estimate the transition probabilities P EE and P EN , i.e., the probability to stay in the state E and to transit from the state E to the state N, respectively, let X be the length of a consecutively expressed region of the genome. Under the Markov assumption, X should follow a geometric distribution,

P X = n = P EE n · 1 - P EE
(4)

Similarly, let Y be the length of a consecutively non-expressed (uncovered) region of genome, then Y also follows a geometric distribution,

P Y = n = P NN n · 1 - P NN .
(5)

To generate “full length transcripts training sets”, we simply stitched overlapping reads along the body of a known gene or operon to assembly larger contigs. We consider as sufficiently expressed contigs those that cover at least 50% of a known gene or an adjacent operon pair of a known operon. We used the lengths of such contigs to estimate the probability of staying in the state E as P EE  = E(X)/(E(X) + 1), where E(X) is the mean length of sufficiently expressed regions. E(X) can be determined from the sufficiently expressed contigs in the samples. For example, using such contigs from all the samples except LB, we obtained E(X) = 1,537 nt and P EN  = 0.0006503 (Figure 15A). Notably, the vast majority of contigs have a length shorter than 8,000 nt. Furthermore, we used the lengths of non-expressed regions in the negative training sets to estimate the probability of remaining in the state N as P NN  = E(Y)/(E(Y) + 1), where E(Y) also can be determined from raw coverage data, for example, E(Y) = 127 nt, and P NN  = 0.005773 for all the negative training sets from all samples except LB (Figure 15B). The derivation of transition probabilities estimations is given in Additional file 1: Figure S9. The QQ plot indicates that although not precisely, the lengths of the sufficiently expressed contigs can be largely modelled as a geometric distribution (Figures 15A and C), in particular when the length of contigs is shorter than 7,000 nt. However, the lengths of non-expression regions could not be modelled by a geometric distribution (Figures 15B, D), probably because of the uncovered gaps in the expressed regions, which were much shorter than authentic non-expressed regions. Nevertheless, we found that this deviation had little effects on the performance of the algorithm (see Results). We should point out that although several previous studies have shown that the lengths of exons in eukaryotes or ORFs in prokaryotes do not follow a geometric distribution [112, 113], and we have confirmed this in E. coli K12 (Additional file 1: Figures S10A and C), it is not very surprising that the lengths of prokaryotic mRNA transcripts largely follow a geometric distribution (Figures 15A and C). This result might be due to the fact that the length of a prokaryotic mRNA transcript is not limited by the lengths of its constituent ORFs, rather, it also depends on the lengths of the 5’ UTR, constituent intergenic regions and 3’ UTRs. The lengths of the UTR regions are known to follow geometric distributions, at least in eukaryotes [112, 114]. In addition, the lengths of all of the intergenic sequences are known to follow a geometric distribution [112] (Additional file 1: Figures S10B and D). Therefore the lengths of prokaryotic mRNA transcripts behave very differently from those of ORFs.

Figure 15
figure 15

Distributions of the lengths of sufficiently expressed contigs and non-expressed regions in all the samples except LB. A: Histogram of the lengths of sufficiently expressed contigs (bin size =50 nt). The curve is the geometric distribution with the success probability p = 0.0006503 estimated by the maximum likelihood method. The inset is a blow-up view of the region of length 1 ~ 7,000 nt. B: Histogram of the lengths of non-expressed regions (bin size =50 nt). The curve is the geometric distribution with p = 0.00577 estimated by the maximum likelihood method. C: QQ-plot of the lengths of the sufficiently expressed contigs against the fitted geometric distribution. D: QQ-plot for the lengths of non-expressed regions against the fitted geometric distribution.

Reconstruction of full length transcripts/operons

We used the Viterbi algorithm [115] to decode the path of the states that best explains the centroid coverage values of a region of DNA. If a string of adjacent genes are connected by a consecutive sequence of expressed states, then these genes are predicted to form an operon. Furthermore, we stitched two candidate adjacent operons, for instance, A-B and B-C, to obtain the full length transcripts/operons A-B-C. If over half of the length of a terminal gene is predicted to be expressed, this gene is considered as a member of the predicted operon, otherwise the expressed part of the terminal gene is only considered as the UTR of the operon. The TSS and TTS of an assembled operon/transcript were determined by the locations of its 5’-end and the 3’-end, respectively.

However, errors could be introduced in the assembled operon/transcripts, and thus need to be fixed. Specifically, due to the short length of the reads, if a sub-operon/transcript overlaps with an upstream operon/transcript that are expressed in a sample, the algorithm will assemble the two operons/transcripts into a single one, missing the downstream sub-operon/transcript. Furthermore, if multiple alternative operons with different TSSs are transcribed in a sample, the assembled transcripts will be the possible longest alternative operon used in the sample. To identify such possible alternative operons as well as their TSSs, we applied to each assembled operon/transcript the following procedure based on the observation that there is often an abrupt increase in the read coverage at a TSS. The procedure attempts to identify a possible TSS inside an assembled transcript by detecting the position at which an abrupt increase in the coverage occurs in the upstream region of a transcribed gene. Specifically, for each assembled operon/transcript with a long 5’ UTRs (>50 nt), we used two sliding windows of size 2w1 and 2w2 around the position i, [i-w1, i + w1] and [i-w2, i + w2], w1 > w2 > 0, to scan each position of the 5’ UTR associated with the first gene in the operon, and compute coverage ratios r1(i) and r2(i) between the downstream and upstream half windows, defined as follows,

γ 1 i = k = i + 1 i + w 1 ( Coverage k + 1 ) / ( k = i - w 1 i - 1 Coverage k + 1 ) forwardstrand k = i - w 1 i - 1 ( Coverage k + 1 ) / ( k = i + 1 i + w 1 Coverage k + 1 ) reversestrand
(6)
γ 2 i = k = i + 1 i + w 2 ( Coverage k + 1 ) / ( k = i - w 2 i - 1 Coverage k + 1 ) forwardstrand k = i - w 2 i - 1 ( Coverage k + 1 ) / ( k = i + 1 i + w 2 Coverage k + 1 ) reverse strand
(7)

Since there must be a TSS associated with the first gene of an assembled transcript, we predict position j in the 5’ UTR, with the largest sum of ratios γ1(j) + γ2(j) as the TSS associated with the first gene in the assembled transcripts, i.e.,

j = ArgMax i γ 1 i + γ 2 i .
(8)

To identify potential alternative TSSs for the downstream genes of the assembled transcripts, we used a rather strict threshold of 5-fold for the ratio γ1(j), to guarantee that there is an outstanding ‘jump’ of read coverage in the downstream of position j. In both cases, we set w1 = 80 nt and w2 = 10 nt. The TTSs were simply determined by the locations of the 3’-end of the assembled operons/transcripts.

The algorithm was encoded in C++ and perl. The software package is open-source, and can be downloaded from http://bioinfolab.uncc.edu/TruHmm_package/. We provide users the option to train their model if enough known operons are available in their genomes of interest. Otherwise users can apply our algorithm using the default settings without the need of any training.

Motif detection in promoters

We applied MEME [96] to search for σ70 binding sites (Pribnow box) within 25 nt upstream of 1,742 experimentally verified TSSs. The motif profile was then used to scan for the potential Pribnow box within the [-100 nt, 100 nt] interval around the predicted TSS by the scoring function (formula (9,10,11)) we developed before [116, 117]:

S M t = max i = 1 L I i log p i , h i q h i
(9)
I i = b A , C , G , T p i , b log p i , b q b / a
(10)
a = n + 1 n + 4 log n + 1 - log ( n + 4 ) - 1 n + 4 b A , C , G , T log q b - n n + 4 log min b A , C , G , T q b

To estimate the statistical significance of motif scores, we used a 3rd-order Markov model to generate 50,000 random sequences based on the transition probabilities learned from the set of experimentally verified promoters in E. coli K12. The distribution of the motif scores in the random sequences was used to define an empirical p-value.

Performance metrics

To evaluate the performances of our algorithm, we use the following metrics.

Sensitivity = Re call = TPR = TP TP + FN Specificity = 1 - FPR = TN FP + TN Accuracy = TP + TN TP + FP + TN + FN Precision = TP TP + FP F - factor = 2 × Re call × Pre cision Re call + Pre cision

Where, TP (true positive) = Number of known operon pairs accurately classified as operon pairs by the model.

FP (False Positive) = Number of non-operon pairs falsely classified as operon pairs by the model.

FN (False Negative) = Number of known operon pairs falsely classified as non-operon pairs by the model.

TN (True Negative) = Number of non-operon pairs accurately classified as non-operon pairs by the model.

Sensitivity, i.e. TPR (True Positive Rate or recall) is the proportion of known operon pairs that can be correctly identified as operon pairs by the model. Specificity, i.e. 1-FPR (False Positive Rate) is the proportion of non-operon pairs that are correctly classified as non-operon pairs. Accuracy combines the two metrics to quantify the overall performance of the model. A high Accuracy value represents a low total error rate. Precision denotes the proportion of predicted positives that are true positives. F-factor combines Recall and Precision and normalized them to an idealized value.