A novel short L-arginine responsive protein-coding gene (laoB) antiparallel overlapping to a CadC-like transcriptional regulator in Escherichia coli O157:H7 Sakai originated by overprinting
Due to the DNA triplet code, it is possible that the sequences of two or more protein-coding genes overlap to a large degree. However, such non-trivial overlaps are usually excluded by genome annotation pipelines and, thus, only a few overlapping gene pairs have been described in bacteria. In contrast, transcriptome and translatome sequencing reveals many signals originated from the antisense strand of annotated genes, of which we analyzed an example gene pair in more detail.
A small open reading frame of Escherichia coli O157:H7 strain Sakai (EHEC), designated laoB (L-arginine responsive overlapping gene), is embedded in reading frame −2 in the antisense strand of ECs5115, encoding a CadC-like transcriptional regulator. This overlapping gene shows evidence of transcription and translation in Luria-Bertani (LB) and brain-heart infusion (BHI) medium based on RNA sequencing (RNAseq) and ribosomal-footprint sequencing (RIBOseq). The transcriptional start site is 289 base pairs (bp) upstream of the start codon and transcription termination is 155 bp downstream of the stop codon. Overexpression of LaoB fused to an enhanced green fluorescent protein (EGFP) reporter was possible. The sequence upstream of the transcriptional start site displayed strong promoter activity under different conditions, whereas promoter activity was significantly decreased in the presence of L-arginine. A strand-specific translationally arrested mutant of laoB provided a significant growth advantage in competitive growth experiments in the presence of L-arginine compared to the wild type, which returned to wild type level after complementation of laoB in trans. A phylostratigraphic analysis indicated that the novel gene is restricted to the Escherichia/Shigella clade and might have originated recently by overprinting leading to the expression of part of the antisense strand of ECs5115.
Here, we present evidence of a novel small protein-coding gene laoB encoded in the antisense frame −2 of the annotated gene ECs5115. Clearly, laoB is evolutionarily young and it originated in the Escherichia/Shigella clade by overprinting, a process which may cause the de novo evolution of bacterial genes like laoB.
KeywordsOverlapping gene Overprinting Small protein De novo gene EHEC
Enhanced green fluorescent protein
L-arginine responsive overlapping gene
Linear discriminant functions
- MOD medium
Modified Bacillus-growth medium
Optical density at 600 nm
Open reading frame
Polymerase chain reaction
Rapid amplification of copy-DNA ends
Ribosomal coverage value
Reads per kilobase per million mapped reads
Rounds per minute
Transcription start site
The DNA triplet code is constructed such that the majority of amino acids (AA) can be encoded by more than one codon, leading to the so-called degeneration of the genetic code. Codon position three shows the highest degeneration (wobble position), whereas position one is only slightly degenerated and position two is not degenerated . Thus, a DNA double strand contains six possible reading frames, each of which has the capacity to encode a protein and it is feasible that the sequences of two or more protein-coding genes overlap. Most generally, overlapping genes (OLGs) share at least one nucleotide between the coding regions of two genes. When the reading frame of the evolutionary older established gene (mother gene) is defined as frame +1, a same-strand overlap in reading frames +2 or +3 relative to the annotated gene is possible. Same-strand OLGs originate from programmed ribosomal frameshift [2, 3] or programmed transcriptional realignment . Additionally, a second gene can overlap the mother gene antisense in frames −1, −2 or −3. It is under debate, which antisense frame is preferred for the occurrence of OLGs. In E. coli, most long antisense open reading frames (ORFs) are detected in frame −1 . However, this finding might be caused by codon bias of the mother gene [6, 7]. Whereas Krakauer  predicted highest constraints on frame −2, in E. coli more long ORF are found in frame −2 than in −3 [6, 8]. Lèbre and Gascuel  investigated the constraints of OLGs at the AA level and detected the highest constraints on frame −3 due to a high number of “forbidden dipeptides” within the protein encoded, which would cause a stop codon in the established gene.
In prokaryotes, many genes are organized in operons, which are transcribed as a polycistronic mRNA. In these cases, trivial same-strand overlaps of only a few base pairs are very common and facilitate translational coupling . In contrast, almost no long OLGs (overlap ≥90 bp) have been described in bacteria [11, 12, 13, 14], while longer OLGs are well known in viral genomes, probably leading to genome size reduction, since in viruses, 38% of all AA are encoded overlapping and in many cases the OLGs encode accessory proteins with unusual sequence composition like many disordered regions [15, 16, 17].
OLGs may originate by overprinting : By chance, an overlapping reading frame is expressed in a bacterial population. However, encoding two functional genes at one locus leads to severe constraints of sequence evolution, since many mutations will influence the AA sequence of two genes carrying completely different functions [1, 8, 9]. This may be one reason why the overprinting hypothesis has been neglected as being rather unlikely [7, 19]. Instead, the gene duplication followed by subfunctionalization or neofunctionalization hypothesis  has been favored for the origin of novel genes.
Here, we present an initial functional characterization of the novel OLG laoB of EHEC, the expression of which was seen in transcriptome data and ribosomal profiling . laoB overlaps antiparallel to the annotated gene ECs5115, and this overlapping gene pair is a novel example of this seemingly rare form of bacterial gene organization. We propose that laoB originated very recently by overprinting.
Determination of transcriptional start site by 5′ rapid amplification of copy-DNA ends (RACE)
An overnight culture of Escherichia coli O157:H7 strain Sakai (GenBank accession NC_002695, EHEC)  was inoculated 1:100 in 0.5 × LB with 400 mM NaCl and incubated at 37 °C and 150 rpm until an OD600 of 0.8 was reached. Total RNA of 500 μl EHEC culture was isolated with Trizol and the remaining DNA was digested using 2 U TURBO™ DNase (Thermo Fisher Scientific). The 5’RACE System for Rapid Amplification of cDNA Ends, Version 2.0 (Invitrogen) was used according to the manual. After the second polymerase chain reaction (PCR), the dominant product was excised from the agarose gel and purified with the GenElute™ Gel Extraction Kit (Sigma-Aldrich). The PCR product was Sanger sequenced by Eurofins with oligonucleotide laoB+25R.
Determination of transcriptional stop site by 3’RACE
Total RNA of 500 μl EHEC overnight culture in LB medium was isolated using Trizol and the remaining DNA was digested using 2 U TURBO™ DNase (Thermo Fisher Scientific). The 5′/3’ RACE Kit, 2nd Generation (Roche Applied Science) was applied according to the manual, but instead of an oligo-dT primer for cDNA synthesis the gene specific primer laoB-12F was used. A nested PCR was performed for product amplification. The dominant product was excised from the agarose gel, purified with the GenElute™ Gel Extraction Kit (Sigma-Aldrich) and Sanger sequenced (Eurofins) with oligonucleotide laoB+31F.
Cloning of pProbe-NT plasmids and determination of promoter activity
The genomic region 300 bp upstream of the transcriptional start site was amplified by PCR and restriction enzyme cut sites for SalI and EcoRI were introduced. The PCR products were cloned into the plasmid pProbe-NT  and transformed into Escherichia coli Top10. The plasmid sequence was verified by Sanger sequencing (Eurofins). Overnight cultures of E. coli Top10 + pProbe-NT (negative control) and pProbe-NT-PromoterTSS were used for 1:100 inoculation of 10 ml 0.5 × LB medium with 30 μg/ml kanamycin. The following conditions were investigated for promoter activity in 0.5 × LB medium each: plain LB, at pH 5, at pH 8.2, plus 400 mM NaCl, plus 0.5 mM CuCl2, plus 2 mM formic acid, plus 2.5 mM malonic acid, or plus 10 mM L-arginine. Cultures were incubated at 37 °C and 150 rpm until an OD600 of 0.5 was reached. Then, the cells were pelleted, washed once with phosphate-buffered saline (PBS) and resuspended in 1 ml PBS. The OD600 was adjusted to 0.3 and 0.6. Four-times each 200 μl of both OD-adjusted suspensions were pipetted in a black microtiter plate and the fluorescence was measured (Wallac Victor3, Perkin Elmer Life Science, excitation 485 nm, emission 535 nm, measuring time 1 s). The fluorescence of E. coli Top10 without vector was subtracted as background. To measure promoter activity after L-arginine supplementation, the experiment was repeated in depleted modified Bacillus-growth (MOD) medium  without L-glutamic acid, L-arginine, and L-aspartic acid, since these AA are easily convertible within the cell. Depleted MOD medium and depleted MOD medium supplemented with 10 mM L-arginine were tested. The experiments were performed in triplicate. Significance of changes was calculated by the two-tailed Student’s t-test.
Cloning of a C-terminal LaoB-EGFP fusion protein and overexpression of LaoB-EGFP protein
The laoB sequence without the stop codon was amplified by PCR and restriction enzyme cut sites for PstI and NcoI were introduced. The PCR product was cloned into the plasmid pEGFP and transformed into Escherichia coli Top10. The plasmid sequence was verified by Sanger sequencing (Eurofins). For overexpression of the fusion protein, overnight cultures of E. coli Top10 + pEGFP and E. coli Top10 + pEGFP-laoB were inoculated 1:100 in 10 ml 0.5 × LB medium with 120 μg/ml ampicillin in duplicates. Cultures were incubated at 37 °C and 150 rpm until an OD600 of 0.3 was reached. For one culture each, protein expression was induced using 10 mM isopropyl-β-D-1-thiogalactopyranoside (IPTG). Incubation of induced and uninduced cultures was continued for 1 h. Cells were pelleted, washed once with PBS and the pellet was resuspended in 1 ml PBS. The OD600 was adjusted to 0.3 and 0.6. Four times each 200 μl of the OD-adjusted bacterial suspensions were pipetted in a black microtiter plate and the fluorescence was measured as before. The experiment was performed in triplicate. Significance of changes was calculated by the two-tailed Student’s t-test.
Cloning of a translationally arrested laoB mutant
For cloning of the genomic knock-out mutant ∆laoB, the method described by Kim et al.  was adapted. The mutations introduced do not change the AA sequence of the overlapping gene ECs5115. The pHA1887 fragment and the selection cassette were amplified by PCR from the plasmid pTS2Cb. Three consecutive point mutations, leading to a premature stop codon (5th codon) and a restriction enzyme cut site deletion (see below), were introduced into the laoB sequence by PCR using the oligonucleotides HA3laoB-139F and SM5laoBmut+42R (3′ mutation fragment) and SM3laoBmut-16F and HA5laoB+183R (5′ mutation fragment). Because the plasmid pTS2Cb-∆laoB was obtained by Gibson Assembly, the four PCR fragments contain overlapping sequences. In a total reaction volume of 20 μl, 200 fmol of each PCR fragment and the NEBuilder® HiFi DNA Assembly Master Mix (NEB) were incubated at 50 °C for 4 h. Two μl of the reaction were transformed into E. coli Top10 and plated on LB agar with 120 μg/ml ampicillin and 20 μg/ml chloramphenicol. Next, the mutation cassette was amplified by PCR using pTS2Cb-∆laoB as template and the PCR product of correct size was purified from an agarose gel (GenElute™ Gel Extraction Kit; Sigma-Aldrich). EHEC  was transformed with the plasmid pSLTS and, subsequently, transformed with 75 ng of the mutation cassette. After incubation for 3 h at 30 °C and 150 rpm in SOC medium, the cells were plated on LB-agar plates with 120 μg/ml ampicillin and 20 μg/ml chloramphenicol and incubated at 30 °C. One colony per plate was suspended in PBS. One-hundred μl of a 1:10 dilution in PBS were plated on LB agar with 120 μg/ml ampicillin and 100 ng/ml anhydrotetracycline for I-SceI induction and incubated at 30 °C over night. Several colonies were streaked on LB agar with 20 μg/ml chloramphenicol and plain LB agar and incubated at 37 °C over night. Colonies that were only able to grow on LB were selected and the genomic area surrounding the point mutations introduced was amplified by PCR. Additional to the premature stop codon, the restriction enzyme cut site for MnlI was deleted, which was screened for by restriction digest of PCR products with this enzyme. Correct introduction of the three point mutations was assumed for MnlI-digestion negative PCR products and confirmed by Sanger sequencing (Eurofins).
Competitive growth assays
The experiment was performed in biological triplicates. Significance was calculated by the two-tailed Student’s t-test.
Complementation of EHEC ∆laoB
To compensate the laoB genomic knockout mutation, the intact laoB ORF was supplemented in trans on a plasmid. First, the sequence of laoB was amplified by PCR and restriction enzyme cut sites for NcoI and HindIII were introduced. The PCR product was cloned into the plasmid pBAD/Myc-His-C and the plasmid was transformed into EHEC ∆laoB. As a negative control, the plasmid containing the mutated laoB gene (∆laoB) was cloned. Next, competitive growth experiments were performed as described above using EHEC ∆laoB + pBAD-laoB (complementation) and EHEC ∆laoB + pBAD-∆laoB (control). Both overnight cultures were supplemented with 120 μg/ml ampicillin and the cultures were mixed in equal ratio. Ten ml of either 0.5 × LB or 0.5 × LB + 20 mM L-arginine were inoculated 1:3000 in quadruplicates. Induction of the laoB frame (present either as wild type or as ∆laoB) was performed with 0.002% arabinose. After incubation at 37 °C and 150 rounds per minute (rpm) for 18 h, plasmids were isolated using the GenElute™Plasmid Miniprep Kit (Sigma-Aldrich). Using 20 ng isolated plasmid, PCR was performed with the oligonucleotides pBAD+208F and pBAD+502R. The PCR products were Sanger sequenced (Eurofins) and the ratio of intact laoB over translationally arrested ∆laoB was determined in percent. The experiment was performed in biological triplicates. Significant changes were calculated by the two-tailed Student’s t-test.
Transcriptome and translatome sequencing
RNAseq and RIBOseq data sets of Hücker et al.  were investigated with respect to translated ORFs located in antisense to annotated genes. Briefly, the bacteria had been grown under the following growth conditions: LB medium at 37 °C, harvested at OD600 0.4, BHI medium at 37 °C, harvested at OD600 0.1, and BHI medium supplemented with 4% NaCl at 14 °C, harvested at OD600 0.1. An ORF was considered translated, when (i) it was covered with at least one read per million mapped sequenced reads normalized to 1 kbp, (ii) ≥ 50% of the ORF is covered with RIBOseq reads, and (iii) the ribosomal coverage value (RCV) is at least 0.25 in both biological replicates. All three requirements were matching for laoB, which was verified by visual inspection using the Artemis genome browser .
Bioinformatics methods to characterize laoB
Prediction of σ70 promoters
The region 550 bp upstream of the start codon of laoB was searched for the presence of a σ70 promoter with the program BPROM (Softberry) . The linear discriminant functions (LDF) score given is a measure of promoter strength, whereupon an LDF score of 0.2 indicates presence of a σ70 promoter with 80% accuracy and specificity.
Prediction of alternative σ-factors
The search for alternative σ-factors was performed manually. The sequence 50 bp upstream of the detected transcription start site (TSS) was compared to the consensus motifs of σ28 , σ32 , and σ54 .
Prediction of ρ-independent terminators
The region 300 bp downstream of the stop codon of laoB was searched for the presence and folding energy of a ρ-independent terminator using the program FindTerm (Softberry) .
Prediction of Shine-Dalgarno (SD) sequence
The free energy ∆G° of the region 30 bp upstream of the start codon of laoB was calculated according to Ma et al. . The perfect SD sequence taAGGAGGt has a ∆G° of − 9.6. A ∆G° of − 2.9 is considered the threshold for the presence of an SD sequence .
Detection of annotated homologs
The AA sequence LaoB, corresponding to laoB, was used to query the data base GenBank with blastp using default parameters .
LaoB was submitted to the software PredictProtein . The methods PROFphd (secondary structure) , TMSEG (transmembrane helices) , DISULFIND (disulfide bonds)  and LocTree2 (subcellular localization)  were used.
Phylogenetic tree construction
For evolutionary analysis of laoB and ECs5115, tblastn was used with an e-value cutoff of 0.001 and at least 50% identity, which allows a search of nucleotide sequences homologous to a protein sequence query in all genomic sequences of the database independent of their annotation status [39, 40]. For the short gene of LaoB, tblastn was not sensitive enough to detect all existing genomic sequences; hence, hits matching ECs5115 were used for subsequent laoB analysis. Continuous laoB ORFs were detected in a total of 497 Escherichia and 18 Shigella strains (see results). However, a large number of genes had the very same laoB sequence. Thus, examples of 11 LaoB-encoding sequences, representing the diversity of continuous laoB genes, were chosen. Likewise, exemplary ECs5115 sequences within a broad range of different sequence identities were downloaded from the database and used for phylogenetic analysis. Multiple sequence alignments of ECs5115 and LaoB homologs were conducted using MUSCLE implemented in MEGA6 . The automated alignments were manually checked and adapted, where necessary. Parts encoding sequences homologs to LaoB were manually identified in −2 frame of ECs5115. Those sequences with no obvious similarity to laoB were identified by pairwise alignments of the nucleotide sequence of the −2 frame of the respective ECs5115 homolog (EMBOSS Needle, ). The area which aligned to laoB was translated to its AA sequence and further aligned outside the initial region by multiple sequence alignments, as before. The laoB sequence was only found to be discontinuous outside the Escherichia/Shigella clade. Thus, indel-like sequence insertions and internal stop codons are present in sequences of bacteria outside Escherichia/Shigella, encoding peptide fragments shorter than 41 AA or AA sequences which are very different from LaoB.
Reference phylogenetic trees of the strains and species examined were constructed according to Fellner et al. . Briefly, a concatenated sequence of the housekeeping genes 16S rDNA, atpD, adk, gyrB, purA and recA was used. The sequences were aligned using ClustalW in MEGA6. Columns with gaps or ambiguities were removed. The final dataset contains 7484 positions. The best nucleotide substitution model was searched for using MEGA6. The final Maximum-Likelihood tree was calculated using Neighbor Joining and bootstrapped 1000 times. The best nucleotide substitution model for tree construction was identified to be the General Time reversible model (GTR with a lowest Bayesian Information Criterion of 123,336.358). The non-uniformity of evolutionary rates among substitution sites was modeled using a discrete Gamma distribution with five rate categories (+G, parameter = 0.5494). The log likelihood value of the final tree was − 61,963.20.
Detection of a transcribed and translated antiparallel overlapping ORF
Transcription and translation of laoB (part 1A) and its mother gene ECs5115 (part 1B) at the three different growth conditions indicated. The RPKM values of the transcriptome (RNAseq) and the translatome (RIBOseq) data for the overlapping novel gene and annotated mother gene are listed, including the RCV, indicating their translatability. ORF coverage is the fraction of a gene sequence, which is covered by RIBOseq reads. In addition, the corresponding data for the putative overlapping gene laoA (compare Fig. 2a) are shown (part 1C)
LB, 37 °C
BHI, 37 °C
BHI + 4% NaCl, 14 °C
LB, 37 °C
BHI, 37 °C
BHI + 4% NaCl, 14 °C
LB, 37 °C
BHI, 37 °C
BHI + 4% NaCl, 14 °C
Characterization of laoB promoter region
Expression of a LaoB-EGFP fusion protein
The translationally arrested mutant ∆laoB shows a growth advantage in arginine-containing media
To find a potential phenotype, competitive growth experiments with EHEC wild type and ∆laoB were performed. The equal-ratio mixture of wild type and mutant was incubated under different conditions and a potential growth advantage was determined by the ratio of the wild type and mutant genes at the endpoint. When LB medium was supplemented with 20 mM L-arginine, a phenotype was detected: EHEC ∆laoB displayed a significant growth advantage indicated by a ratio of wild type to mutant of 15:85 (Fig. 5b). Thus, the CI is 13.6. No phenotype was found for any other conditions tested (Additional file 3).
Intact laoB, cloned into pBAD-myc-His-C, should restore the phenotype of EHEC wild type. Therefore, competitive growth experiments using EHEC ∆laoB carrying pBAD-laoB against EHEC ∆laoB + pBAD-∆laoB (mutant control) were performed. Expression of laoB was induced with arabinose. As expected, in plain LB, the ratio between the two strains tested did not change independent whether the plasmid borne laoB was induced or not (Fig. 5c). In contrast, in LB supplemented with 20 mM L-arginine, the strain carrying the functional laoB-copy shows a significant growth disadvantage if induced with arabinose. This reflects the competitive growth phenotype of the wild type strain compared to the mutant strain (Fig. 5b). Thus, translation arrested laoB can be complemented in trans.
Phylostratigraphic analysis of laoB
This study provides evidence for a novel overlapping gene pair, laoB/ECs5115, in EHEC. Transcription and translation of a short ORF, embedded in the antisense reading frame −2 to a CadC-like transcriptional regulator, was detected by RNAseq and RIBOseq at optimal growth conditions. Translational knockout of the ORF by a premature stop codon resulted in a significant growth advantage of the mutant strain in LB medium supplemented with L-arginine over the wild type strain in competitive growth. Consistently, the activity of the putative σ32 promoter is repressed by L-arginine. Whether laoB is part of an overlapping operon together with laoA, located upstream of laoB, is unknown. LaoA was not examined, since transcription and translation of laoA appeared to be very weak under the conditions tested.
Is laoB a protein-coding gene?
LaoB might function as a novel non-coding RNA (ncRNA) instead of a novel protein-coding gene. However, due to the following reasons this appears to be unlikely: Most important, experimental data presented here confirm the protein-coding character of laoB, since the ORF is covered by RIBOseq reads (Fig. 1). RIBOseq signals clearly indicate active translation of an RNA molecule [44, 45]. In LB medium, the ORF has a very high RCV (Table 1A), which is much higher than the mean RCV of 1.55, which we found for short annotated EHEC genes [26, 43]. In addition, stable translation into a protein was further confirmed by the expression of a LaoB-EGFP fusion protein (Fig. 4). Second, a translationally arrested mutant lead to a clear phenotype which could be complemented by the wild type sequence in trans by using just the laoB ORF without any adjacent sequence attached (Fig. 5). If laoB would function as an antisense ncRNA, it would regulate its targets by base pairing with complementary mRNAs . It appears to be unlikely that a translationally arrested mutant, which changes only ~ 0.5% of the nucleotides compared to the complete transcript of the laoB sequence, would exert such a dramatic phenotype. Third, 15 bp upstream of the start codon an SD sequence is present (Fig. 2c). The distance of the SD to the start codon is within the natural ranges observed and the detected sequence is close to the SD consensus motif, resulting in strong ribosomal binding . Finally, the laoB ORF has been annotated in E. albertii as a protein-coding gene.
Putative function of LaoB
The results presented in this work provide first hints towards a potential LaoB function. The region 300 bp upstream of the TSS determined shows significant promoter activity at all investigated conditions (Fig. 3). The laoB promoter is probably recognized by the alternative σ-factor σ32, since a sequence very similar to the σ32 consensus motif is present in the proper distance to the TSS (Fig. 2b) . The first T of the −35 box and the A of the −10 box are completely conserved in σ32 promoters and both nucleotides are present in the σ32 promoter region of laoB. Additionally, the spacer between the −35 and −10 box has the optimal distance of 14 bp and σ32 promoters with this spacer distance tolerate a substitution of the second C of the tetra-C motif of the −10 box without losing promoter strength , which is also the case here. In addition, the distance between the −10 box and the TSS is in the optimal range of 6 bp . Transcription of heat shock genes is induced by σ32. Accordingly, transcription of laoB is almost switched off at cold stress (Table 1A). The σ32 stress regulon includes chaperons, transcription factors, DNA/RNA surveillance proteins, and many membrane-associated proteins . In this study, the promoter has the highest activity in LB supplemented with NaCl and at acidic conditions (Fig. 3). Interestingly, σ32 is also the master regulator of the transcription factor PhoPQ, which is also induced at acid stress .
In our hands, EHEC ∆laoB only showed a clear phenotype after supplementing the medium with L-arginine (Fig. 5b). As a proteinogenic AA, L-arginine is involved in many central metabolic pathways. Bacteria synthesize L-arginine from glutamate  or take it up from the environment by three different transporters . Arginine can be utilized as sole carbon and nitrogen source and is the substrate for the synthesis of polyamines . Here, high L-arginine concentrations resulted in a significantly reduced activity of the laoB promoter and the EHEC wild type has a clear growth disadvantage in competitive growth. These observations would agree with the speculation that LaoB might be involved in enhancing L-arginine uptake. In many EHEC reservoirs, nutrient concentrations, including L-arginine concentrations, are low and efficient uptake represents an advantage. The high arginine concentrations used in this study are unlikely to occur naturally. Therefore, under environmental conditions, which are low in arginine, intact LaoB may confer a growth advantage. The hypothesis that LaoB somehow interacts with arginine transport is supported by the facts that a high proportion of small proteins – LaoB has a size of only 41 AA - associates with the cell membrane, in which transporters are located [50, 51], and that the σ32 regulon includes many membrane proteins . However, testing this speculation and further functional characterization of LaoB must await future studies.
Origin of laoB by overprinting
The time of origin of an OLG can be estimated by phylostratigraphic analysis, comparing the phylogenetic distribution of the mother gene and the overlapping gene [18, 52]. The intact laoB ORF is only present in Escherichia and Shigella strains (Fig. 6) while the annotated gene ECs5115 has a much broader taxonomical distribution (i.e., higher conservation level) and is present in both Gram-negative and Gram-positive bacteria (Additional file 5). It is concluded that laoB originated recently and might be an interesting example of de novo gene birth by overprinting [18, 52, 53]. This would mean that a number of point mutations in the ECs5115 sequence would have created the laoB ORF including its regulatory sequences after the Escherichia/Shigella clade separated from Salmonella. One may postulate that a weak σ32 promoter sequence was already present at the proper location by chance and, later, may have been further optimized by additional point mutations leading to an increased transcription of the novel ORF. The resulting (m)RNA must have been used as template for translation, perhaps based on a weak ribosomal binding site which happened to be present upstream of the start codon. Now, one must assume that the AA chain, at this point, was functional ab initio by chance, conferring a fitness advantage to the cell. At this early evolutionary stage, a novel gene is volatile and the process is reversible, such that the novel ORF can get lost again . A fitness gain related to the L-arginine metabolism may have led to fixation of the functional allele in the population by Darwinian evolution. Because EHEC colonizes many hosts and environments , which requires expression of different sets of genes [56, 57], LaoB might improve its fitness in one of those species specific niches. Alternatively, the novel ORF could have been fixed by neutral evolution together with the mutated mother gene . Later on, extension at the 3′ end by the loss of a stop codon may occur, leading to an elongation of the novel protein which would be more likely than 5′ elongation due to regulatory elements in the 5’UTR . This speculative order of events has some similarities to the proto-gene hypothesis of Carvunis et al. , which deals with the potential de novo origin of short genes in intergenic regions of the yeast S. cerevisiae.
In EHEC, only two other antiparallel overlapping gene pairs, in which a young gene also may have originated recently de novo by overprinting, have been characterized functionally [13, 53]. For E. coli K12 two additional antiparallel overlapping gene pairs are described, yghX/modA  and tnpA/astA  respectively, which might have also originated by overprinting. Another OLG pair exists in Streptomyces coelicolor: The knock-outs of the antiparallel overlapping genes dmdRI and adm both show a phenotype . In addition, in Bacillus subtilis an annotated OLG pair exists .
Whether de novo birth of genes in antisense to annotated genes is more frequent than presumed is still open for discussion, but has been suggested by Haycocks and Grainger  based on the frequent binding of transcriptional regulators in intragenic locations. In contrast, a gene duplication followed by neofunctionalization or subfunctionalization, which is the established theory for the origin of new genes , produces just variants of existing sequences, overprinting would allow for the rapid creation of true novelty .
Strand-specific RNAseq and RIBOseq are well suited to identify translated ORFs located in antisense to annotated genes. Frequent antisense transcription is observed in all RNAseq experiments, but almost all signals have been interpreted as ncRNA . However, RIBOseq already confirmed translation of many antisense RNAs in eukaryotes [66, 67, 68], and this method identified numerous overlooked small genes in the intergenic regions of different bacteria [69, 70, 71]. Therefore, improved genome annotation algorithms are required which do not systematically dismiss small and/or overlapping genes [8, 72, 73]. Integration of transcriptomic, translatomic, and other experimental data into annotation pipelines would increase specificity and sensitivity for the prediction of novel small genes [74, 75, 76]. Additionally, improved proteomic methods are necessary, which do not miss small non-annotated proteins [77, 78]. In any case, functional characterization of novel short genes overlooked to date presents a major future challenge to experimental microbiology. In this paper, we provide initial functional characterization and evidence for overprinting of a small protein encoded in antisense to an annotated protein-coding gene. We assume such overprinting events could be significant for EHEC (e.g., [13, 14]) and maybe other bacteria.
This work was funded by the Deutsche Forschungsgemeinschaft DFG (KE740/13–1,2,3, and SCHE316/3–1,2,3). The funding body had neither a role in the design of the study, nor in collection, analysis, and interpretation of data, or in writing the manuscript.
Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files. The sequencing raw data used from  is available at the Sequence Read Archive (SRA, NCBI) under the accession SRP113660.
SMH, SSch and KN designed and planned the study. SMH performed the 3′ and 5’RACE experiments, the promoter activity assays, the competitive growth experiments and the complementation. SSi identified the optimal position for the strand-specific knock-out mutant and RW cloned the mutants. IAS performed the expression of the EGFP-LaoB fusion proteins. Data from those experiments were analyzed by SMH with the help of SSi, RW and IAS. SV did the phylostratigraphic analysis of the overlapping gene pair. SMH drafted the manuscript with the help of all other authors. SSch and KN supervised writing and critically edited the manuscript. All authors have read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Sharma V, Firth AE, Antonov I, Fayet O, Atkins JF, Borodovsky M, Baranov PV. A pilot study of bacterial genes with disrupted ORFs reveals a surprising profusion of protein sequence recoding mediated by ribosomal frameshifting and transcriptional realignment. Molecular biology and evolution. 2011;28(11):3195–211.CrossRefPubMedPubMedCentralGoogle Scholar
- 14.Fellner L, Simon S, Scherling C, Witting M, Schober S, Polte C, Schmitt-Kopplin P, Keim DA, Scherer S, Neuhaus K. Evidence for the recent origin of a bacterial protein-coding, overlapping orphan gene by evolutionary overprinting. BMC Evol Biol. 2015;15(1):283.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Hücker SM, Simon S, Scherer S, Neuhaus K. Transcriptional and translational regulation by RNA thermometers, riboswitches and the sRNA DsrA in Escherichia coli O157:H7 Sakai under combined cold and osmotic stress adaptation. FEMS Microbiol Lett. 2017;364(2)Google Scholar
- 26.Hücker SM, Ardern Z, Goldberg T, Schafferhans A, Bernhofer M, Vestergaard G, Nelson CW, Schloter M, Rost B, Scherer S. Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome. PLoS One. 2017;12(9):e0184119.CrossRefPubMedPubMedCentralGoogle Scholar
- 34.Yachdav G, Kloppmann E, Kajan L, Hecht M, Goldberg T, Hamp T, Honigschmid P, Schafferhans A, Roos M, Bernhofer M, et al. PredictProtein--an open resource for online prediction of protein structural and functional features. Nucleic Acids Res. 2014;42(Web Server issue):W337–43.CrossRefPubMedPubMedCentralGoogle Scholar
- 43.Neuhaus K, Landstorfer R, Simon S, Schober S, Wright PR, Smith C, Backofen R, Wecko R, Keim DA, Scherer S. Differentiation of ncRNAs from small mRNAs in Escherichia coli O157:H7 EDL933 (EHEC) by combined RNAseq and RIBOseq - ryhB encodes the regulatory RNA RyhB and a peptide, RyhP. BMC Genomics. 2017;18(1):216.CrossRefPubMedPubMedCentralGoogle Scholar
- 47.Koo BM, Rhodius VA, Campbell EA, Gross CA. Dissection of recognition determinants of Escherichia coli σ32 suggests a composite −10 region with an ‘extended −10′ motif and a core −10 element. Mol Microbiol. 2009;72(4):815–29.Google Scholar
- 52.Pavesi A, Magiorkinis G, Karlin DG. Viral proteins originated de novo by overprinting can be identified by codon usage: application to the “gene nursery” of Deltaretroviruses. PLoS Comput Biol. 2013;9(8):e1003162.Google Scholar
- 57.Landstorfer R, Simon S, Schober S, Keim D, Scherer S, Neuhaus K. Comparison of strand-specific transcriptomes of enterohemorrhagic Escherichia coli O157:H7 EDL933 (EHEC) under eleven different environmental conditions including radish sprouts and cattle feces. BMC Genomics. 2014;15:353.CrossRefPubMedPubMedCentralGoogle Scholar
- 60.Kurata T, Katayama A, Hiramatsu M, Kiguchi Y, Takeuchi M, Watanabe T, Ogasawara H, Ishihama A, Yamamoto K. Identification of the set of genes, including nonannotated morA, under the direct control of ModE in Escherichia coli. J Bacteriol. 2013;195(19):4496–505.CrossRefPubMedPubMedCentralGoogle Scholar
- 62.Wang LF, Park SS, Doi RH. A novel Bacillus subtilis gene, antE, temporally regulated and convergent to and overlapping dnaE. J Bacteriol. 1999;181(1):353–6.Google Scholar
- 65.Dornenburg JE, Devita AM, Palumbo MJ, Wade JT. Widespread antisense transcription in Escherichia coli. MBio. 2010;1(1)Google Scholar
- 71.Neuhaus K, Landstorfer R, Fellner L, Simon S, Marx H, Ozoline O, Schafferhans A, Goldberg T, Rost B, Küster B, et al. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC). BMC Genomics. 2016;7:133.CrossRefGoogle Scholar
- 73.ÓhÉigeartaigh SS, Armisén D, Byrne KP, Wolfe KH. SearchDOGS bacteria, software that provides automated identification of potentially missed genes in annotated bacterial genomes. J Bacteriol. 2014;196(11):2030–42.Google Scholar
- 75.Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muniz-Rascado L, Garcia-Sotelo JS, Weiss V, Solano-Lira H, Martinez-Flores I, Medina-Rivera A, et al. RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res. 2013;41(Database issue):D203–13.CrossRefPubMedGoogle Scholar
- 78.Willems P, Ndah E, Jonckheere V, Stael S, Sticker A, Martens L, Van Breusegem F, Gevaert K, Van Damme P. N-terminal proteomics assisted profiling of the unexplored translation initiation landscape in Arabidopsis thaliana. Mol Cell Proteomics. 2017;16(6):1064–80.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.