Background

Non-protein-coding RNA (ncRNA) elements play an important role in biological pathways, such as gene regulation [1-4]. It has been shown that conformational features of many such RNA elements play a major part in their biological function [5,6]. In bacteria, RNA structural rearrangements can have a major effect on the expression of their downstream coding sequences (reviewed by [7]), a process known as cis-regulation. A classic example, and one of the earliest such elements discovered, is the complex regulatory mechanism that takes place upstream of the tryptophan operon in Escherichia coli during its expression [8]. Regulation of the tryptophan biosynthetic operon, however, is achieved through very different mechanisms in other organisms, such as B. subtilis and Lactobacillus lactis (reviewed by [9]). With much attention given to protein-coding genes in the past, the introduction of ncRNAs gene finders have become a relatively new area of genomic research [10]. Currently, many general-purpose [11-13] as well as ncRNA-specific gene finders, such as [14-16] are available.

Riboswitches

An interesting group of RNA regulatory elements are riboswitches. Riboswitches are defined as regulatory elements that take part in biological pathways by selectively binding to a specific ligand or metabolite, or uncharged tRNAs, without the need for protein factors. Environmental factors such as pH [17], ion concentration [18-20], and temperature [21,22] can also trigger RNA conformational changes affecting gene regulation. Nearly all riboswitches are located in the non-coding regions of messenger RNAs [23] and are capable of regulating genes through both activation and attenuation of either transcription or translation (reviewed by [24]). Finally, other factors such as the transcription speed of RNA polymerase, the folding and unfolding rates of the aptamer of the riboswitch, and the binding rates of the metabolites add other dimensions to categorizing riboswitches. These and other factors influence the RNA switching mechanism to be kinetically or thermodynamically driven. In addition to thermodynamics-based approaches, RNA-kinetics have been gaining momentum in riboswitch-mediated regulation studies at the system level. Lin and Thirumalai [25] introduces a kinetic feedback-loop network model that describes the functions of riboswitches using experimental data from flavin mononucleotide (FMN) riboswitch.

Originally found through sequence homology upstream of bacterial coding regions [26-28], riboswitches have been shown to be more abundant than previously expected. They have also been found in cooperative or tandem arrangements [23]. It is speculated that there are at least 100 more undiscovered riboswitches in already sequenced bacterial genomes [23]. Conformational factors are essential to ligand-binding specificity of riboswitches. Many riboswitches can discriminate between similar small molecules with the aid of their structural geometry. For instance, the thiamine pyrophosphate (TPP) and S-adenosylmethionine (SAM) riboswitches measure the length of the ligand that binds to them [29-31].

RNA secondary structure

The secondary structural topology of the RNA is very effective in scaffolding the tertiary conformation. Secondary structure mainly consists of a two-dimensional schema that depicts the base-pairing interactions within the RNA structure and is dominated by Watson-Crick base-pairing. One major computational method to predict RNA secondary structure is minimization of its free energy (MFE) within a thermodynamic ensemble, such as the Boltzmann ensemble [32,33]. State-of-the-art thermodynamic models have proven to be effective in RNA secondary structural predictions in most cases. An example of where such predictions fail would be Hammerhead type I ribozyme where loop tertiary interactions have a dominating effect on the structural conformation [34]. Centroids of the Boltzmann ensemble are also used for RNA secondary structural predictions [35]. In many cases, such a prediction is more similar to the structure inferred from comparative sequence analysis than the MFE structure is [35]. In addition, Stochastic context-free grammars (SCFG) have shown to be effective in secondary structural prediction of various RNA regulatory elements. Nawrocki and Eddy, 2013 [13] have shown that more sophisticated grammars, designed to mirror the thermodynamic models can improve the prediction accuracy of structures, once trained on known RNA structures based on maximum-likelihood criteriaa.

Most of the discovered prokaryotic RNA regulatory elements (including riboswitches) are located upstream of the genes they regulate. They act as cis-regulatory elements and exhibit strong secondary structural conservation. Some exceptions to cis-regulation are two trans-acting SAM riboswitches [36] and an antisense regulation of a vitamin B 12-binding riboswitch [37] in Listeria monocytogenes. Insights into structural and functional complexity of riboswitches already discovered are offered in [38]. Purine riboswitches are good examples of secondary structural conservation. The add adenine riboswitch from V. vulnificus and the xpt guanine riboswitch from B. subtilis have very similar secondary and tertiary conformations, despite different crystal packing interactions, pH, and Mg crystallization conditions [39]. In fact, investigation of secondary-structural homology upstream of genomic regions containing the same genes has led to the discovery of more cis-regulatory elements in bacteria [40,41], making them the major current approach for riboswitch identification.

The fact that riboswitch discovery is mainly based on homology makes it difficult to assess how much secondary structural conservation is expected to be prevalent in undiscovered riboswiches. Furthermore, structural homology is not always successful in finding riboswitches. Despite [42]’s rigorous sequence and structural homology searches based on the SAM-I riboswitch, the SAM-IV riboswitch could not be detected. The authors further hypothesized that the structural diversity of riboswitches could be far greater than what has been already observed. Serganov and Nudler, 2013 [38] suggest that there may not even be an interconnection between the structures of riboswitches and the nature of their cognate metabolites and consequently, the biochemical and structural information gathered so far may not be as useful in riboswitch validation as expected. The above limitations of homology-based riboswitch identification methods indicate the need for an alternative approach.

Conformational dynamics

While secondary-structure conformational features are very descriptive of many classes of riboswitches, their folding dynamics are also critical. A typical example is the TPP riboswitch which can fold into alternative structures depending on the presence of the TPP ligand. The tertiary structure stabilized in the presence of TPP is shown in Figure 1A [43]. Both the ligand-bound and the unbound secondary structures necessary for TPP riboswitch regulatory function are shown in Figure 1B. One of the major computational tools to explore possible folding trajectories is the free energy landscape. The free energy landscape was originally defined for protein folding [44]. In a typical RNA free energy landscape, possible conformations are shown with their corresponding free energy and pairwise distances from one another. In an effort to investigate the thermodynamic equilibrium of RNA folding, Quarta et al. [45] presented a case study of the energy landscape of the TPP riboswitch where the base-pairing distances between the structural possibilities form two major clusters. The clusters corresponded to native and ligand-bound structural conformations. After repeating this process for various choices of elongation of the TPP riboswitch, they showed that for certain ranges of length, each cluster corresponds to one of the two structures of the riboswitch (see Figure 1C).

Figure 1
figure 1

Energy Landscape of The TPP Riboswitch.A: Tertiary structure of an E. coli TPP (or thi-box) riboswitch bound to thiamine pyrophosphate [43]. The image was generated by the Jmol from the PDB:2hoj structure taken from the Rfam website [89]. B: Ligand-bound and unbound secondary structures of a TPP riboswitch in B. subtilis, taken from [45]. C: Energy landscape of the B. subtilis riboswitch taken from [45]. Set-1 and Set-2 clusters correspond to the two mutually exclusive secondary structures of the TPP riboswitch. Pairwise Base-pairing distance used as a measure of distance between two structures. Please refer to [45] for detailed information about the figure and clustering details.

In [46], the dynamics of energy landscapes across elongation of various riboswitches were investigated and it was shown that such landscapes have different clustering dynamics across kinetically and thermodynamically driven riboswitches. This work highlights the fact that even in a kinetically-driven regulation scenario, investigation of the dynamics of the thermodynamic equilibrium across the elongation can be informative. In a more recent work, energy landscape analyses led to strong evidence of evolutionary co-variation of base-pairs that favor a conserved alternative structure of the purine riboswitch [47]. In addition, prediction of structural switching in RNA has been addressed by [48,49] using abstract shapes to represent different secondary structural conformations. Freyhult et al. 2007 [50,51] examined the lowest free energy structural conformations having a certain base-pairing distance to the actual structure of the RNA to explore the structural neighbors of an intermediate, biologically active structure. A more recent work [52] presents an ingenious and a significant decrease of computational consumption of estimating the likelihood of structural neighbors. However, to date there is no computational method that can identify the diverse and structurally complex riboswitches with high confidence.

Investigation into the folding dynamics of the nascent RNA based on free energy sampling and pair-wise distances can be computationally costly. Finding a sample size that sufficiently reflects the RNA folding space behavior can be difficult and prone to model parameter biases. Furthermore, even if optimized parameters and sufficient samples were available, it would still be difficult to make comparisons across RNA elements. The latter is mainly due to the fact that the characteristics of such folding distributions (here, free energy vs. structural distance within a given ensemble of secondary structures) are not well understood.

One statistic to evaluate the distribution characteristics of any probabilistic model is the Shannon entropy [53]. While the conformation with maximum-likelihood under a given SCFG is referred to as the optimum structure under that model, all of the other sub-optimal conformations can be associated with a probability. Hence, the Shannon entropy (expected log-likelihood) of such a probabilistic folding space is \(H(S)=-\sum _{s\in S} p(s)\log {p(s)}\), where S is the folding space containing all possible secondary structures s valid on the desired RNA sequence, each of which associated with the corresponding probabilities of occurrence p(s). Here, the notion of probability can also be interpreted as the frequency of occurrence of a particular conformation for the RNA sequence. Alternative formulations and approximations of Shannon entropy exist in RNA secondary structure studies, such as [54]. Exact calculations of Shannon entropy under a given SCFG as a probabilistic secondary structural folding model, however, was done in [55] and shown to be computationally convenient achievable in polynomial time O(n 3), where n is the length of the RNA sequence. In an independent work, [56] also offered an algorithm to calculate the Shannon entropy of the stochastic context-free grammar BJK [57] with parameter sets derived from a given alignment. Other measures of structural diversity such as ensemble diversity computed by RNAfold -p in the Vienna RNA Software Package [58] also exist. In this work, structural diversity is measured by the exact RNA secondary structural information theoretic-uncertainty (or here, Shannon entropy) of the complete SCFG-modeled folding space of the RNA, as computed by [55]. From hereon, we refer to this measure as structural entropy. We investigated the significance of structural entropy of RNAs with more than one biologically functional secondary structural conformation. A diverse set of prokaryotic RNA elements, validated to have such potential were used for this purpose. The performance of structural entropy to distinguish riboswitches was compared to other similar features under different negative-control sets. We then made an attempt to develop a computational method for riboswitch identification via structural entropy on a genome-wide level. The goal of the presented results of the genome-wide tests, however, is mainly exploratory and aim to investigate the genomic regions or elements that the developed method is highly sensitive to.

It has been previously shown that both high and low structural entropy values of certain classes of ncRNAs can be potentially significant. For instance, for certain riboswitches, GC-composition was co-associated with significantly high structural entropy, regardless of model accuracy to RNA secondary structure [55]. This observation raised the possibility that RNAs under selective pressure to have alternative folds, may have higher (not lower) structural entropy than expected. As discussed previously in [55], this seemingly nonintuitive observation is not theoretically impossible. The above intuition lies at the center of the proposed methodology, as will be shown.

Our approach

Folding models

The folding model for which the structural entropy of the RNA is computed is very critical. SCFG folding models can be very lightweight and consist of only few grammar rules and parameters, or they can be very sophisticated consisting of thousand parameters [13,59]. In [55], it was shown that the structural entropy value is very model sensitive. On the other hand, parameters of SCFG models are usually set by maximizing their prediction accuracy using maximum-likelihood approaches. There is no guarantee, however, that folding models optimized for such criteria also preserve information about folding dynamics of such RNAs. Increasing the accuracy of folding models under current approaches may be done at the expense of altering the folding space of possible structures under that model, thus losing the information about folding dynamics of the RNA. In order to avoid potential biases in our preliminary examination, it was essential that we include models not trained to best predict secondary structure in addition to models that do. Two different SCFG models were chosen for this study, one being a structurally unambiguous SCFG model with parameters trained to best predict RNA secondary structure, and one being a structurally ambiguous model with symmetric rules and probabilities. The theoretical implications of structural ambiguity may fall outside the scope of this work and the interested reader can refer to [55]. Here, we merely treat them as two different folding models.

Gathering data

There is a significant amount of sequence and/or structural similarity within each class of riboswitch. This is due to the fact that these riboswitches have been discovered using sequence and/or structural homology. Here, however, we are interested in capturing the universal characteristics of RNAs with alternative fold(s), mainly riboswitches, as a basis for an identification method for conformational switches. In order for our method to be less biased towards a specific structural conformation, we avoided using homologous RNA sequences or sequences that belong to closely related organisms, where possible. We also resorted to only evaluating riboswitches that have been experimentally validated to be functional rather than computationally discovered ones. The data set gathered in this work is a compromise between the above considerations and the need to include a diverse set of riboswitches in our data set. Although the attempt to computationally extract a universal feature from the diversity of prokaryotic riboswitches each having unique structural and biological characteristics is a great oversimplification, it serves as a common ground for comparing various features that aim to capture the RNA conformational dynamics as a whole.

Negative controls

One of the main challenges of our test, was the preparation of a reliable negative control. Folding models deployed here are very lightweight and simplistic, giving rise to potential unrelated dependencies to the factors such as genomic composition of RNA sequence. Therefore, gathering real biological sequences that are as similar to RNA sequences as possible while not having potential for alternative fold(s) is very critical to the significance of our test. Here, we relied on the following sets of negative controls: 1. dinucleotide shuffles of riboswitches (generated using [60]), 2. Mutagenesis; Structural mutants of the gathered sequences experimentally tested for not being functional, 3. The reverse complements (or antisense sequences) of gathered riboswitches, and 4. Sequences of the non-coding regions that are likely to be riboswitches. The choice of antisense as a negative control is explained in the Methods section.

Comparison to other methods

Two additional measures of structural diversity were used to assess the significance of structural entropy values in collected data. The first measure was the base-pairing entropy [54] of the BJK model BJKbp as defined in ([61] Eq. 3). For more information see the Methods section. The second measure, denoted as Sil, was obtained from clustering the RNA energy landscape. The Sil value reflects how well the energy landscape clusters into two. Calculations for Sil were according to [46]. We then compared the performance of classifiers designed to distinguish riboswitches from various negative controls. In order to evaluate the performance of structural entropy to detect alternative fold, we compared it to measures from RNAShapes [49] and FFTbor [52] predictions. These measures corresponded to energy disequilibrium of alternative folds: p1/p2, where p1 was the highest value in the predictions of the corresponding software and p2 was the second highest value. For RNAShapes, p1 is the probability of the most likely abstract shape of structure, whereas p2 is the second most. For FFTbor, p1 is the probability of the MFE structure and p2 is the probability of an alternative folding scenario where the structure has a particular base-pair distance with the MFE structure. Features used in this work are shown in Table 1. Please see the Methods section for further details.

Table 1 List of various sequence and structural features used throughout the work

Results and discussion

The two lightweight SCFG folding models used to calculate structural entropy are denoted here as BJK and RND models, which are taken from the literature (Please see the Methods section). RNA encoded sequence from Bacteria validated to have potential for two alternative folds were gathered from the literature (see Table 2) as representatives of RNAs having potential for alterative folding. This generally consisted of riboswitches and some other ribo-regulators, although we refer to all these sequences as riboswitches, here. A subset of such sequences were selected as the positive control set of sequences having two structures. The criterion for selecting such a subset was minimum length of the RNA that exhibits alternative folds for each sequence. This criterion is further explained in Methods. The resulting set of length variant sequences are described in Tables 2 and 3.

Table 2 Data collection
Table 3 Genomic locations of collected sequences

Mutagenesis

To investigate the relationship of various structural features to the folding space of the riboswitches, we compared their wild-type value to that of structural and non-structural mutants. By structural mutants, we mean those mutant sequences that were designed to disrupt either of the two biologically functional conformations of the riboswitch. These structural mutants, whose regulatory functions had been experimentally investigated, were gathered from the literature. These mutant sequences may not have been naturally occurring biological sequences. Nevertheless, having very similar sequence features to their wild type, this enables us to evaluate the variations of structural features with respect to loss of functionality given closest possible negative controls. The percentage of change in feature values for mutants relative to the wild type is shown in Table 4. If there is a relationship between the features and alternative folds, one would expect the values corresponding to structural mutants (denoted as YES) to be significantly less than that of the wild type and non-structural mutants (denoted as NO). A simplified criterion to calculate the performance of each feature was to define true positives as negative values in structural mutants and true negatives as zero or positive values in non-structural mutants. Hence, we calculated sensitivity and specificity of each feature to structural mutants under the above criterion. Here, Sensitivity and Specificity symbolize the performance of a classifier that, based on the diversity value of the wild type and a non-functional mutant, predicts if the mutant is a structural mutant (denoted YES) or not. The classifier rule here is that structural mutants must have a lower value. Performance of each feature is shown in Table 5. The performance of the base-pairing entropy BJKbp is higher than other features on average. This suggests that structural mutants are expected to have lower base-pairing entropy than non-structural mutants and wild type 83.33 percent of the times, while non-structural mutants are expected to equal or higher values than the while type 83.33 of the times. The performance of the structural entropy under the same folding model BJK was slightly lower, while being higher than those for the RND and Sil features. Features BJKbp, BJK, and RND corresponding to the B. subtilis Magnesium structural mutants M5 and M6 were all positive, implying that our hypothesis of higher structural entropy and alternative fold does not hold for this riboswitch. The average silhouette index of energy landscapes (Sil) has a much better performance for the mentioned riboswitches. This could either be because SCFG models fail to capture conformational dynamics of this riboswitch or the thermodynamic equilibrium between its alternative folds is more subtle than expected.

Table 4 Mutagenesis
Table 5 Mutagenesis results

Sense-antisense classification results

Classification of the RNA sequence into riboswitches and antisense sequences was done using binomial logistic regression. Sequence features, such as Length L, Minimum Free Energy MFE, GC-composition GC, and structural entropy were used for classification. The MFE value was included as a relative measure of structural stability. An initial investigation of the power of selected features in sense-antisense discrimination was done through cross-validation for all 104 (52 riboswitches and 52 antisense) sequences. Binomial logistic regression classification probabilities were assigned to each sequence based on the other 104 sense and antisense sequences. It is shown in Table 6 that features {L,GC,GU,Sil} result in the highest true positive rate, lowest false positive rate, and highest area under the receiver operating characteristic (ROC) curve. This result suggests that the folding space of the riboswitch sequence is expected to be different than that of its antisense, since the Sil feature is based on the clustering of the energy landscape, although further investigation into this assumption is needed.

Table 6 Classification performance using cross validation

The performance of classifiers that involved uracil composition were more dependent on sequence features rather that structure and subsequently more prone to data fitting. The reason is that uracil composition can be different the sense and antisense. Excluding classifiers that incorporate uracil composition (i. e. forth set of rows in Table 6) showed that the features sets {L,MFE,GC,RND} and {L,MFE,GC,BJK} had a fairly acceptable performance. The performance of the corresponding feature sets were higher than the {L,MFE,GC} classifier. Furthermore, inclusion of uracil composition into the classifier lowered performance (See {L,MFE,GC,U} in Figure 2). It is noteworthy to recall that the above classifiers neither represent the most informative features of the data nor are they tuned for best performance (Please refer to Methods section for details on calculating performance). Therefore, structural entropy may be informative in sense-antisense classification since L and GC are equal for each pair of sense and antisense. The performance of {L,MFE,GC,BJKbp} was also higher than {L,MFE,GC} but slightly lower than feature sets that incorporate structural entropy values. The ROC curve corresponding to these classifiers is shown in Figure 2.

Figure 2
figure 2

Classification ROC curve. ROC curves of 104-fold binomial simple logistic classifiers on all the 52 riboswitch sequences and their antisense sequences. classifier features shown in legend. WekaⒸopen source software package used to assess probability distributions. Resolution (0.01) was used to calculate true and false positive rates. Corresponding threshold for a given false positive rate was used to calculate the true positive rate. Values strictly higher than threshold were used to calculate true positive rates.

The sense-antisense results show that the above approach has a high false positive rate, should it be used for riboswitch discovery. It also does not fully address all questions about its performance. For instance, is structural entropy orthogonal to results of other available methods that aim to capture conformational switches, or is it highly correlated with them? What is the significance of incorporating features L, GC, and MFE in the classifiers? How much of the performance of the above classifiers, such as {L,GC,GU,Sil}, reflect structural characteristics of the riboswitches and how much is due to other features of the data? How generalizable are the results and what is the performance of classification when tested on riboswitches from distant organisms with very different genomic compositions? Can we distinguish riboswitches from their antisense and other similar sequences with no structure, simultaneously? Does the structural entropy of a typical riboswitch tend to be lower or higher than that of its antisense sequence? To address the above questions, first we calculated the correlation between structural entropy features and other results from gathered tools. For this purpose, we performed the correlations for all riboswitches and their antisense sequences, totalling 104 sequences. Correlation values between structural entropy and other conformational features are illustrated in Table 7 for both folding models, RND and BJK, along with correlation values corresponding to structural entropy normalized to sequence length. By inspection, we can see that structural entropy is not necessarily highly correlated with other features, suggesting the possibility that it may contain additional information about RNA sequences, in general. We neither rigorously calculated the significance of correlation values nor did we further evaluate the orthogonality of structural entropy to other features. We then selected classifiers {L,GC,GU,Sil}, {L,MFE,GC,RND}, {L,MFE,GC,BJK}, and {L,MFE,GC} for further investigation. They are referred to as LGCGUSIL, LMFEGCRND, LMFEGCBJK, and LMFEGC. We divided our data into training and test sets, each having different average GC-composition. Sequence segments and their corresponding structures are included in sections Training set and Test set in Appendix. We then evaluated the performance of the classifiers from the training set using the test set.

Table 7 Correlations between entropy values and other approaches

Random shuffles

First, in order to assess the relationship between the performance of the above binomial-logistic-regression classifiers to structural features, we performed dinucleotide shuffling test [60]. We originally generated 10 dinucleotide shuffles for each of the riboswitches in the training and test sets. We used riboswitches of the training set and their corresponding random shuffles to estimate binomial logistic-regression coefficients. We then used the coefficients to classify sequences of the test set and their corresponding random shuffles. However, the classification performance and corresponding ROC curves were highly dependent on the MFE feature (data not shown). In order to both have a better insight into the structural entropy feature and a more clear comparison to other methods, we then filtered both the random shuffles of the training and test sets to have similar MFE values to their corresponding riboswitches and repeated the test (please see Methods section and Table 8 for information on the filtered dinucleotide shuffles). The ROC curves and performance values can be seen in Figure 3 and Table 9, respectively. The classifier LGCGUSIL poorly distinguishes riboswitches from random shuffles compared to the other classifiers. This suggests that the high performance of this classifier in sense-antisense classification was not necessarily due to structural features. Although the RNAShapes classifier has higher performance as shown in Table 8, it only corresponds to one point of the ROC curve. In order to have a more comprehensive measure of performance, we calculated the area under the ROC curve for the classifiers. The performance of most classifiers is roughly similar with the LMFEGCRND having the second highest value after LMFEGC (see Figure 3 and area under ROC curve in the legend). Although filtering only for those dinucleotide shuffles having similar MFE would seem a reasonable negative control, we did not further investigate random shuffle test. Preparing a random-shuffle negative control ensemble of sequences with similar length, composition, and MFE values that homogeneously represent all riboswitches may be a formidable task and not necessarily helpful with evaluating our approach that focuses on real biological sequences as negative control.

Figure 3
figure 3

Classification ROC Curve. ROC curves of binomial logistic classifiers on the riboswitches of the test set (23 sequences) and their filtered random shuffles (151 sequences). Classifier features shown in legend. SPSSⒸwas used to estimate logistic regression coefficients derived from the training set consisted of the 29 riboswitches and 216 sequences as negative control. Resolution (0.01) was used to calculate true and false positive rates. Corresponding threshold for a given false positive rate was used to calculate the true positive rate. Values strictly higher than threshold were used to calculate true positive rates. In the legend, the area under the ROC curve corresponding to each classifier is shown.

Table 8 Average and standard deviation values of Length, MFE, and GC-compositions of the training and test sets along with their filtered dinucleotides shuffles
Table 9 Classification performance in dinucleotide shuffling test

To evaluate the significance of feature L, we considered both constant choice of length and variable choice of length for riboswitches. Apart from the antisense sequence, untranslated regions (UTR) shorter than 80 nt have been selected as another negative set, since they are unlikely to contain structures over such a short length. Some riboswitch sequence segments, however, were selected to be shorter than this length. The length of the corresponding UTR (from transcription binding site to the translation start codon) for riboswitches, however, were not shorter than 80 nt. UTRs corresponding to the σ-70 in E. coli with distance less than 80 nt from the translation start codon were used here as sequences that do not contain structure. 30 sequence segments were selected from σ-70 E. coli UTRs shorter than 80 nt (see Table 10 for information on sequence locations). The section Methods extensively discusses the criteria for selecting the subsets, dividing the riboswitches and E. coli UTRs into training/test sets, as well as information on data sets. Average and standard deviation of features L, MFE, and GC for the training and test sets are shown in Table 11. The free energy of the centroid structure [35] calculated by CentroidFoldⒸ [62], denoted here as CFE, was also used as a substitute for MFE.

Table 10 Short UTR collection
Table 11 Riboswitch statistics

The performance of the tri-state classifier was evaluated by estimating classifier parameters from multinomial logistic regression of the training sets and then calculating the correct classification of sequences having zero (E. coli) riboswitch structure, one possible structure (antisense), or two (riboswitch) structures that are in the test set. We also evaluated the performance of classifiers that incorporate features from FFTbor and RNAShapes software packages calculated according to Methods section. Feature sets {L,MFE,GC,FFTbor} denoted as LMFEGCFFTbor and {L,MFE,GC,RNAShapes} denoted as LMFEGCShapes were included in the test.

Classification performance values are denoted in Table 12 along with sensitivity of each classifier. Sensitivity of tri-state classifiers were defined here as total number of correctly classified sequences divided by total number of sequences classified. Model LMFEGCBJK resulted in both highest sensitivity (80.2%) and highest percentage of correctly classified riboswitches (91.3%). Performance of other classifiers was in the same range. Further tests are needed to make better comparison between the performance of the classifiers. The low performance of the LMFEGCRND model shows that classification is potentially sensitive to features length, GC-composition, and MFE, since they are different between the training and the test sets (see Table 11). Furthermore, choice of modeling is very critical in designing sense-antisense classification. The BJK model, being a more accurate folding model leads to higher performance.

Table 12 Classification performance

Regression coefficients of the classifiers corresponding to riboswitches are shown in Column β 2 of Table 13. Coefficients corresponding to MFE and structural entropy, are the second and forth values, respectively. For both the LMFEGCRND and LMFEGCBJK models, MFE coefficients are negative while structural entropy coefficients are positive (values are normalized with respect to antisense). This implies that if we input a riboswitch and its reverse complement to the regression-based classifiers, the strand with lower Minimum Free Energy and higher structural entropy is more likely to be the riboswitch. Hence, we have reason to believe that despite having a more stable structure, riboswitches tend to have higher structural entropy than expectedb. We find this observation significant, since they are consistent across two different folding models. 3D-plots of the MFE, GC-composition, and structural entropy values under the RND model for sequences of the training set are depicted in Figure 4. Top and bottom views of the grid-view of values normalized to sequence length roughly shows this distinction.

Figure 4
figure 4

Structural Entropy vs. GC-comp. and MFE. 3D-plot of features MFE, Length and Structural Entropy of the training set sequences classifier under the RND model. Grid-view of different sets of sequences are shown in the top and Bottom views, riboswitches, E. coli UTRs, and antisense sequences. Axes R N D/L and M F E/L show Structural Entropy and MFE normalized by the length of the sequence, respectively. Euclidean distance to actual values was used to generate the grids.

Table 13 Logistic regression coefficients of classifiers

Testing the classifiers on constant lengths of sequences (for all training and test sets) did not increase performance (see Table 14 in Appendix), although the RND was significant for sequences of length 150 nt in the training set. Constant length selection was based on extending (or shortening) the original choice of length of sequences from both 5’ and 3’ directions such that the center of the sequence does not change. We refer to this original choice of length as the actual length, hereon. We chose this scheme for simplicity. Other sequence selection methods may be preferred, since the alternative fold may occur on varying parts (5’ or 3’) of the riboswitch sequence, in general. Substitution of CFE feature instead of MFE feature resulted in lower performance of classifiers (comparing Tables 10, 11, 12, 13, 14 and 15 in Appendix).

Table 14 Classification performance for different choices of length
Table 15 Classification performance using centroid free energy

Association with high entropy

Mutagenesis results suggested an association between alternative folds and higher structural entropy. Furthermore, regression approaches to estimate the structural entropy of the riboswitch with respect to various sequence and structural features such as MFE lead to higher classification performance in discriminating riboswitches from their antisense control, compared to classifiers that do not incorporate the structural entropy measure. Similar to Mutagenesis results, we observed that riboswitches tend to have higher structural entropy than what is expected of their antisense sequences. Dinucleotide shuffles test also showed an slight increase in the specificity of classifiers using one of the structural entropy features (The RND model) compared to other models and methods. We hypothesize that the structural entropy value of riboswitches may be a significant factor within the context of their length, GC-composition, and folding stability (here, MFE). In specific, conformational switches (here riboswitches) show a slight but consistent increase in their entropy values than structural mutants or antisense. The higher entropy was not, however, observed in the dinucleotide shuffling test. Under this hypothesis and the limitations of our tests from both mutagenesis and sense-antisense classification, the B. subtilis Magnesium and SAM-I riboswitches seem to be two immediate outliers.

Base-pairing entropy feature had higher performance in the Mutagenesis results, while the energy landscape index led to higher performance in sense-antise classification. Putting Mutagenesis and sense-antisense results in one perspective, however, suggests a more consistent conclusion about the structural entropy compared to the other two features.

Some of the challenges in our approach to develop riboswitch identifiers were choices of sequence segment and folding model. We found it very difficult to find a subset of sequence segments from riboswitches for our training set that had the highest structural entropy. These difficulties included but were not limited to high sensitivity of structural entropy to sequence length and location and the possible varying lengths of riboswitches that have alternative structures. We arbitrarily included varying lengths of riboswitches in our training set rather than constant length, since the performance of classifiers with constant length was either lower or similar to those with varying length. Methods based on optimization of both sequence length of riboswitches and modeling their folding dynamics may prove rewarding in this regard.

The optimum length of a sequence segment that leads to identifying riboswitches can vary from one organism to another; Constant length of 100 nt segments for E. coli are more suitable, while 157 nt segments lead to higher performance for B. subtilis riboswitches. Results about sequence segments, however, had low significance due to low number of riboswitches tested in each case. We only propose that it may be possible that riboswitches from different organisms may have different ranges of sequence lengths over which alternative structure prediction becomes significant. Optimizing search parameters on a new organism sequence is potentially a difficult task. One alternative may be evaluating the behavior of structural entropy-based classifiers on data sets that are peculiar to that organism. We have not explored this approach.

Choice of model

Classification performance of sense-antisense, genome-wide sliding window tests, and mutagenesis all suggest that the BJK folding model is more sensitive to changes in the folding space than the structurally ambiguous RND model. The classification performance of the LMFEGCBJK model both on the test set and on the B. subtilis riboswitches is high given the right sequence segment is chosen. Also, the RND model does very poorly in distinguishing the folding space of riboswitch mutants from that of their wild types. On the other hand, binomial logistic regression based classification of sense and antisense of all riboswitches assigns slightly higher ROC area to the classifier that deploys the RND model (see Figure 2). Furthermore, riboswitch identifiers based on the RND model are more robust in terms of sequence positioning than their BJK counterparts. The RND model only enforces Watson-Crick and G-U base-pairing and is fairly a simplistic structural model. The acceptable performance of the RND model in genome-wide approaches may be due to having less structural constraints than BJK. It may be possible that training secondary structural folding models to predict RNA secondary structures comes at the cost of loss in folding dynamics information. Overall, design of a more efficient SCFG model (possibly a heavyweight folding model) to characterize better the riboswitch folding space is another bottleneck of this approach, since modeling techniques to capture conformational features have not been developed. Current RNA structure modeling is centered around prdiction of the RNA secondary structure rather than its conformational dynamics.

Genome-wide analysis

Sequence segments predicted to have potential for alternative fold for the two B. subtilis and E. coli intergenic regions are presented in Genome-wide scan results. The power of regression based classifiers as riboswitch predictors is not significantly high; None of the known B. subtilis riboswitches fell in the top 100 hits in genomic scans, though certain of those hits are known regulatory elements. The lack of high performance implies sensitivity of our approach to training set genomic features. Exploring other classification schemes, such as neural networks (for instance, similar to [63]) as well as incorporating a different negative-control training data than the antisense may lead to higher performance. In order to develop an organism-specific riboswitch predictor, one may gather sequences with no structure from the target organism and deploy it as a negative control for a classifier that takes structural entropy as a feature.

Many hits fell immediately upstream of operons, which could be indicative of cis-regulation. Our genome-wide scan results show dependency of genomic features such as the uracil composition. Furthermore, results presented for various genome-wide scans cannot be taken into account individually, since the above methodology is a length based method. In other words, in order to identify genomic regions with highest likelihood of having a riboswitch, it is essential to combine results of genome-wide scans under different window sizes. Such combining of results also seems organism-specific. The optimum length(s) of genome-wide window scans for the riboswitch identification can be different from one organism to another. The complexity of such a test and its corresponding rigorous statistical analyses fell outside the scope of this work. Here, we limited ourselves to few genomic scans and present a subset of regions that are riboswitch candidates in most of the genome-wide tests.

The cotH gene

The top two sequence segments predicted to be riboswitches are both upstream of cotH gene and in close proximity of one another. In fact, a 628 nt long segment is classified to be a riboswitch (four consecutive sequence segments). The 5’ half of this segment, {3717412 nt - 3717725 nt}, contains the top two hits which are also predicted to be riboswitches by the model LMFEGCBJK in position {3717098 nt - 3717725 nt} on the complementary strand of B. subtilis. Naclerio et al., 1996 [64] discuss possible regulation in the vicinity of cotH gene. They also stated that no homology to this gene was revealed in the sequences presented in the data bank at the time. They hypothesized that this gene plays an important role in the formation of the spore coat. A more recent paper [65] reports about the cotH promoter mapping 812 bp upstream of beginning of its coding region. This region covers the top two hits we have. In fact, a 200 nt scan reveals that many consecutive segments belonging to this region have significant RND entropy values (<0.05). Most interestingly, the segment with highest RND entropy value on a genome-wide level and under the 200 nt window occurs 399 nt upstream of the cotH gene at location {3717398-3717597 +}. The authors also talk about cotG and cotH genes and that they are both divergently transcribed by σ-K and a potential for extensive secondary RNA structures in this unusually long untranslated region. The cotG is located in the forward strand. There are also many hits around 2000 nt upstream of the cotG-containing operon under various sliding window tests. An interesting observation about the nucleotide composition of the top hit reveals that it uniquely contains periodic runs of 6 consecutive thymines with periodicity of 12 and 15 interchangeably. A search for similar runs of thymines was done on both strands of B. subtilis by relaxing periodicity from 10 nt to 18 nt and constraining it to having at least six consecutive runs of 6-thymines using the pattern locator software [66]. The only two hits were found on the reverse strand and overlapping with the top hit: {3717502-3717606} and {3717367-3717468}.

The most significant structural entropy value for the longest window size (200 nt) on the B. subtilis genome occurred in an unusually extensive secondary structure within that genome. It may be possible that longer RNA structures contain segments having significantly high entropy (structural entropy) on a genome-wide scale. This implies that longer RNAs could potentially have a uniquely high number of secondary structural conformations. This unusually high secondary structural diversity may be related to their regulatory role. We have not yet examined the secondary structural space of other long secondary structures in various organisms. The significantly high structural entropy feature, however, may be typical of other longer secondary strucures. In a recent study on the newly discovered class of RNAs known as long ncRNAs (lncRNA), Cloutier et al. [67] show that yeast lncRNAs are involved in the timing of gene expression. It may be possible that their proposed lncRNA-dependent quick shift of gene expression is related to a high potential for diverse secondary structural conformations.

The BSU tRNA 75 operon

The sequence segment with highest classification probability that also has signficant MFE and structural entropy values is located about 2277 nt of the upstream region of the BSU tRNA 75 Operon. The antisense control of this segment is located in a putative transcriptional regulator. It is interesting, however, that this hit occurs upstream of a tRNA operon. A 200 nt scan reveals more hits upstream of this operon that have significant structural entropy values some of which are closer to the tRNA operon (around 2000 nt upstream). From locations of tRNA operon [68], it can be seen that out of the five consecutive tRNA genes with isotypes Glu, Val, Thr, Tyr, and Gln, the Thr operon has attenuation [69]. Although the long distance from the downstream translation start codon does not make this a reliable riboswitch prediction, the significance of hits in the intergenic region upstream of the Thr gene and the fact that the other top hit in our classification approach was located in a long RNA, suggest the possibility that there may be a long regulatory RNA residing upstream of the mentioned tRNA operon, raising the interesting possibility of a putative riboswitch regulating an attenuation mechanism. Another possiblity is the existence of an open reading frame downstream of the hit. We applied GeneMark [70] and GLIMMER [71,72] gene prediction programs on the 2000 nt long downstream sequence of this hit. The Bacillus anthracis gene model was used in GeneMark, since B. subtilis was not available. Both programs had similar gene prediction results, showing the possibility of three open reading frames on the complementary strand, with the closest one ending 46 nt downstream of the hit.

lysP

One of the most significant hits in our classification under the 157 nt scan occurs immediately upstream of the lysP gene. The segment corresponding to this location also has the most significant (highest) RND entropy value while having significantly low MFE (p-Val. <0.05) on a genome-wide level. This is also true for the 150 nt window scan. Furthermore, the 200 nt scan assigns significantly high structural entropy (RND p-Val. <0.05) as well as classification probability of higher than 0.8 for this location. The 150 nt-long segment is located at {3421066-3421215 -} between the lysine permease and BSU MISC RNA 54. Other adjacent hits that overlap BSU MISC RNA 54 do not have such high structural entropy or classification probability. It may be possible that this segment plays a crucial role in regulating the downstream gene.

Cross-organism riboswitch candidates

The B. subtilis (BSU16090) had a homologue in the E. coli operon: (sdhA, sdhB, sdhC, sdhD, sucA, sucB, sucC, sucD), that contains the sucC as its last gene, with e-value 3e-141 (see Methods section for details). The genes were associated with top hits on their upstream in B. subtilis and E. coli with probabilities 0.897 and 0.905, respectively. Also, the B. subtilis sucA (BSU19370) gene had a homologue in the E. coli fixA,fixB,fixC,fix operon with e-value 1e-18, both of which were associated with high hits. Other matches were the B. subtilis tagA gene and the E. coli (rfe- wzzE- rffEDGHCA- wzxE- rffT- wzyE- rffM) operon (e-value 1e-23), the B. subtilis citB and the E. coli acnA genes (e-value 0), the B. subtilis cspB and the E. coli cspG genes (e-value 2e-17), and the B. subtilis ydaB gene and the E. coli caiA,caiB,caiC,caiD,caiE,caiT operon (e-value 3e-27). None of the above B. subtilis genes, however, had homologs in S. elongates that was associated with top hits in that organism, except for the B. subtilis ydaB gene. The ydaB gene had a homologue in S. elongates syc0203_c (e-value 7e-13) where the associated upstream region contained a hit with probability 0.714. Other homologues were also observed between B. subtilis and E. coli. However, the upstream information for those hits was not available, since they either had intergenic regions shorter than 150 nt or they contained annotated regulatory elements which were excluded from the scanning procedure.

Conclusion

Riboswitches are comprised of a diversity of biological functionality, as well as having different conformational dynamics. In this work, we made an attempt to characterize the potential for an alternative fold (switching ability) ubiquitous in various regulatory elements, regardless of their annotation and structural complexities. Various tests showed that there is in fact a relationship between higher structural entropy and RNA switching ability. This relationship was shown to be modest but consistent across various tests. Furthermore, incorporating the structural entropy feature under the simplistic and symmetric RND folding model was shown to be informative in distinguishing riboswitches from their random shuffles. Unlike results from mutagenesis and antisense tests, the structural entropy feature of riboswitches was not on average higher than their corresponding random shuffles. Given both the diversity of conformational arrangement of gathered riboswitches and the simplicity of models used, there is potential for this feature in detecting RNA conformational switching and ultimately in de novo riboswitch discovery.

Classifiers based on structural entropy optimized through sequence and structural features were devised to distinguish between the putative riboswitch and the antisense control. They were also used as riboswitch identifiers in various prokaryotes. Potential shortcomings and considerations were also explored. Factors such as, length, organism the riboswitch is taken from, and the type of riboswitch should be considered in preparing a training set for future approaches. The lightweight RND folding model tended to have a very consistent and robust result in distinguishing extensive secondary structures from other intergenic regions on genome-wide scale, regardless of test parameters.

Structural entropy using stochastic context-free grammars provides a means to better understanding the conformational dynamics of the RNA, in general. Current modeling training techniques for SCFGs focus on higher accuracy to predict RNA secondary structure, and not necessarily higher accuracy of folding space or dynamics. Interestingly, the more simplistic RND folding model used in our approach had a higher performance than the more accurate BJK model, under serveral tests. The use of a more complex and accurate folding model may not necessarily result in a better exploration of the folding dynamics of non-coding RNAs. Fully understanding the behaviour of structural entropy may require extensive exploratory investigation, both from modeling and data collection perspectives.

Methods

Data collection

Sequences with concrete evidence of alternative structures were gathered from the literature (see Table 2). Prokaryotic sequences believed not to have structure were selected from E. coli and are listed in Table 9 as negative set. 30 genome locations corresponding to σ-70 transcription factor binding sites that are less than 80 nt upstream of their corresponding start codon were randomly chosen from E. coli such that they are fairly evenly distributed across the genome. Data was manually gathered from the EcoCyc website (http://ecocyc.org/).

Mutagenesis

Sensitivity and specificity values of Table 5 were calculated from Table 4. Sensitivity: the percentage of structural mutants (annotated by YES in Table 4) having lower values than the wild type (corresponding negative value in Table 4). Specificity: the percentage of non-structural mutants (annotated by NO in Table 4) having higher or equal values with respect to the wild type (corresponding positive or zero value in Table 4). Mathews Correlation Coefficient is shown in column MCC. Features Base-pairing entropy (BJKbp), Structural entropy under BJK and RND models, (BJK) and (RND), respectively, and two-cluster average silhouette index of energy landscape (Sil) were investigated. For the case of the Bradyrhizobium ROSE-P2, entropy values were compared with −0.74 rather than zero for wild type, since the length of the 135nt-long riboswitch was decreased by 1 and this decrease in length is expected to have linear effect on structural entropy values.

Classification

Preparing the positive control set

The criterion for building the positive control set was taking the minimum-length sub-sequence for the corresponding riboswitch with evidence for alternative structures. Comprehensive structure information was not available for certain sequences. We decided to include them to increase our data set size. The structures of most sequences were experimentally validated, although a few structures of the riboswitches were inferred in combination with structural homology approaches. Only the expression platform components for the Cobalamin riboswitches were used, since they contain alterations; a typical riboswitch has an aptamer and an expression platform component, where the aptamer binds to the ligand, triggering allosteric rearrangement of the conformation of the expression platform component of the riboswitch which in turn regulates the expression of the downstream gene. Cobalamin riboswitches are also significantly longer than other sequences, e.g. Salmonella enterica serovar Typhimurium’s Cobalamin riboswitch was over 300 nt long. Including such long sequences could have been problematic, for both sensitivity of structural entropy on sequence length and the fact that RNA structures longer than 200 nt are usually predicted with low confidence under SCFG models as well as computational constraints. Also, certain sequences were excluded from the test. In the column Grouping of Table 2 we denote None for such sequences. Excluded sequences are as follows: Salmonella ATP regulatory element, located in the mgtM gene before the mgtCBR operon, was excluded since it was the only RNA in our set that had complete overlap with codons [73]. Synechococcus sp. CC9902 Downstream-peptide motif was excluded, since evidence for alteration was not available. T. tengcongensis glmS ribozyme-riboswitch was excluded, since the glmS ribozyme does not undergo “large conformational changes concomitant with ligand binding” [74] and is the only RNA element in our gathered data that functions as a self-cleaving ribozyme upon binding to glucosamine-6-phosphate (GlcN6P) ligand [75]. Synechococcus elongatus glnA motif was excluded, since no evidence of alteration was available. Schistosoma Mansoni Hammerhead type I ribozyme was excluded, since its structure does not alter. The pseudoknotted marine metagenome Hammerhead type II ribozyme was also excluded, since there was no evidence of alteration of the secondary structure. Finally, Bacillus subtilis yxkD motif was excluded, since there was no concrete biological evidence for it being a functional riboswitch or ribo-regulator, although it is predicted partially to have an alternative structure [76].

Choice of reverse-complement (Antisense) as a negative control

As we know, riboswitches are under selective pressure to preserve their potential for alternative folds due to their biological role. The reverse-complement of the RNA or the antisense sequence was here assumed not to have potential for two alternative structures; they may have at most one structure, since they are complementary to the sense. A cis-regulator has an alternative fold, typically through conformational rearrangements of the expression platform to be able to regulate the expression of the downstream genes in the same mRNA, while the antisense is not necessarily under such evolutionary pressure (See Background section for exceptions). Experimental work to back this assumption on the gathered riboswitches was not found. The following assumption was made: On average, a given set of validated riboswitch sequences are expected to contain more RNA switching features than their corresponding antisense sequences. Advantages and disadvantages of the choice of reverse complement (or antisense) are as follows: Advantages: In the reverse complement, Watson-Crick pairing positions are kept intact. This implies that the folding space of the antisense may contain secondary structural features similar to the MFE structure of the sense sequence, making it a near negative control. Other more established negative controls such as various sequence shuffling techniques do not guarantee this. In addition, high correlation of structural entropy values to nucleotide composition and length of the sequence [55] make antisense a convenient choice of negative control. Disadvantages: There may be possible co-association with other sequence features such as U-composition. Also, G-U pairs may differ between sense and antisense structures.

Training and test sets

The positive control set was divided into the training and test sets. While most gathered sequences were in the two organisms, B. subtilis and E. coli, they cover a variety of biological functions and structures. We were interested in a method that can identify potential for the RNA to have an alternative secondary structure from a thermodynamic perspective regardless of a specific function or a secondary structural conformation. Hence, the categorization was done such that each of the training and test sets would contain as many adiverse sequences and structures as possible. Furthermore, the training sequences contain those from E. coli while the test set contains those of B. subtilis. Riboswitches that did not exist in both gram-positive and gram-negatives were evenly distributed between the two sets. Division of data into training and test sets was a compromise between having as diverse riboswitches as possible and being able to assess significance of classification on riboswitches from phylogenetically distant organisms, namely the gram-positive B. subtilis and the gram-negative E. coli. In the column Grouping of Table 2, the categorization of each sequence is shown. There are a total of 29 sequences in the training set and 23 sequences, in the test set. The 30 E. coli UTRs were divided into sets of 17 and 13 for training and test sets, respectively. The categorization was selected such that for an extension of 100 nt UTRs upstream of their corresponding start codons, GC-composition, and the minimum free energy having similar distribution in both sets. A further internal control was the reverse-complement (or antisense) sequence of the riboswitch, adding additional sets of sequences of size 29 and 23 sequence to the training and test sets, respectively. Various classifications in this work always use antisense sequences of riboswitches of identical length for training and test sets, unless indicated otherwise. For the dinucleotide shuffling test, we originally generated 10 dinucleotide shuffles for each riboswitch using [60]. We then filtered the sequences and selected those having more similar MFE values to riboswitches based on the following criterion. For each of the training and test sets, we calculated average and standard deviation of the MFE values. We then eliminated those random shuffles having higher MFE values than average minus one standard deviation, for the training and test sets separately. The resulting statistics can be seen in Table 8. As we can see, the statistics for other features such as length and GC-composition were also altered.

Classification criterion

Classification probabilities of having an alternative fold (riboswitch), possibly only one fold (antisense), or no riboswitch structure (E. coli UTR) were assigned to each sequence based on multinomial logistic regression of sequences in the training. SPSS 16.0Ⓒsoftware was used to estimate the corresponding parameter vectors. Such parameter vectors were then used to calculate the probabilities of sequences in the test set to belong to each class using their features. All true and false positive rates presented (except the ROC curves) are based on maximum likelihood, where the probability with maximum value determines the class of the sequence in the test set. In this work and all presented tables, the term probability (or likelihood score) for being potential riboswitches refers to “trained output logistic model score” assigned to the sequence. Entropy calculations were done according to [55]. Two different lightweight context-free secondary structural models were used as folding distribution models. The first model, denoted here as BJK, was developed by Knudsen/Hein and originally used in the Pfold package [77,78]. The structurally unambiguous grammar was subsequently used in [79] under the name G6 to predict RNA secondary structure using different training sets for RNA secondary structures. Model parameters used here correspond to the benchmark-trained version of this grammar [79] and will be referred to as the BJK model. Average sensitivity and specificity values for the BJK model on the test set of riboswitches are 75.6 and 76.3, respectively. The second model, denoted here as RND was introduced in [55] under the name RND10. This model consists of a structurally ambiguous simple grammar with symmetric rules and probabilities set according to [55]. Also, an effort was done to convert non-stacking heavyweight grammars from [13,59]. Such grammars aim at mirroring the state-of-the-art thermodynamic folding models and are extremely sophisticated, requiring their specific software implementation. The translation of these models into our simple implementation eliminated many of its features. The resulting model did not yield the original accuracy to predict RNA secondary structure, nor was its entropy showing any significant performance in the classifier (data not shown). Minimum-free-energy calculation was done by ViennaⒸSoftware Package [58] using default parameters. Base-pairing entropy for the BJK model, denoted here as BJKbp, was calculated as defined in [54] (implementation by [55]). BJKbp calculation is according to ([61] Eq. 3), where natural logarithm is used for base-pairing entropy calculations: \(-(1/n)\times \sum _{i < j} P_{i,j}\times \log P_{i,j}\). Symbol n denotes length of the RNA sequence and P i,j denotes the probability of pairing in positions i and j, under the given SCFG model. The two-cluster average Silhouette index of energy landscape, denoted here as Sil, was calculated according to the pipeline used in [46] with the exception that we did not account for pseudoknots and only used MFE predictions of the ViennaⒸSoftware Package for prediction of structures. Also, in the case of random shuffles, we only used the fixed number of 500 structural samples to do the calculations. RNAShapes [49] was used to derive an array of most possible abstract shapes resembling RNA secondary structure. We used parameters -e [-500,10] -p to calculate different conformation probabilities. we used the ratio of the two probabilities p1/p2, where p1 and p2 are the highest and the second highest probabilities associated with the predictions. One sequence lead to only one possible conformation as an output, for which we arbitrary chose 100. A similar trend was followed using FFTbor [52] software package.

We also tried to explore GC composition information upstream of gathered sequences relative to that in the riboswitch which did not lead to significantly better results. Sequence-similarity method such, as BLASTⒸ and profile Hidden Markov Models were also examined as classifiers with the mentioned training and test sets. The pipeline was implemented according to [80]. These methods did not result in significant classification performance even after lowering the corresponding threshold to insignificant values.

Genome-wide scan of the B. subtilis, E. coli, and S. elongatus intergenic regions

Bacillus subtilis subsp. subtilis str. 168 (taxid:224308), Escherichia coli str. K-12 substr. MG1655 (GenBankⒸID: U00096.2) and Synechococcus elongatus PCC 6301 were downloaded from the National Center for Biotechnology Information (NCBI)Ⓒ [81,82]. The newer version of E. coli str. K-12 genome (gb|U00096.3) was not used, since operon and σ-70 UTR locations were given in the old version. Corresponding locations of E. coli riboswitches in the old version were used, where necessary. The operon-location information file for B. subtilis was downloaded from [83]. Candidates consisted of sequence segments of lengths 100 nt, 150 nt, and 157 nt. Each intergenic region was divided into segments of such length such that the most downstream segment in each intergenic region ends at the start codon. Only intergenic regions higher than 150 nt were considered. True positives were defined based on sequence segments that had maximum overlap with the original coordinates of the riboswitches. The same process was applied to the E. coli genome. Operon locations for the E. coli genome were downloaded from the RegulonDB website [84]. Operon locations in E. coli also contained RNA elements in our data set. Hence, results for the genome-wide scan of E. coli did not contain any sequence within the operon locations and only contained non annotated regions. S. elongatus gene locations were downloaded from the MicrobesOnline Operon Predictions website [85,86]. S. elongatus intergenic regions were chosen in a similar fashion. Computational complexity of the genome-wide scan on the available cluster and using parallel computation took roughly several days for a window size of 200 nt and overlap of 100 nt.

Cross-organism riboswitch candidate selection

We specifically looked for homologous genes that have riboswitch candidates in their upstream region in different organisms. First, from the top 100 hits in B. subtilis, we collected the genes that are associated with hits having maximum distance of 600 nt to the start codon. We then used tblastx to find their homologues in E. coli with an e-value threshold of 1e-6.

Endnotes

aPseudoknots, another RNA structural feature, are a kind of base-pairings that resemble structural knots and cannot be predicted via context-free grammars. Predictions of pseudoknots based on minimum free energy and context-sensitive grammars are possible, though computationally expensive [87]. bStructural entropy was observed to be positively correlated with the MFE for random computer-generated sequences and structures under the BJK model (the lower the MFE the lower its entropy). While structural entropy under the RND model was observed to be independent of MFE (data not shown). Hence, higher entropy (structural entropy) and lower MFE of riboswitches is an unexpected observation, at least from a random sequence perspective. cStructure partially validated, partially predicted via Vienna Software where not available. dTen nucleotides added to the structure with no structure. ePartially predicted. fStructural Homology Inferred. gPredicted by pknotsRGⒸprogram [88]. hStructural Homology Inferred. iPredicted by Vienna. jPartially predicted by Vienna.

Appendix

Additional tables

Performance of different classifiers was evaluated for both constant and variable choices of length for the riboswitches (Table 14). The performance of classifiers that substitute centroid free energy for minimum free energy is shown in Table 15.

Collected riboswitch sequences

Riboswitch sequences and their corresponding secondary structures were collected from the literature. Riboswitches used in the training set are shown in Training set in Appendix. Riboswitches used in the test set are shown in Test set in Appendix. Sequences excluded from classification are shown in Excluded set.

Training set

Test set

Excluded set

Genome-wide scan results

Bacillus subtilis

In order to gain a better insight into the structural entropy of various riboswitches within their genomic context, in an exploratory attempt, we examined both the performance of classifiers and the high entropy (structural entropy) hypothesis on a genome level. In order to assess the performance of the classifier, we used regression values derived from the same training set to assign likelihood of being a riboswtich to various genomic locations of the B. subtilis genome. As a departure from classification method, we also searched for sequence segments that support our hypothesis; for that we collected sequence segments having both significantly high structural entropy and low MFE. Finally we explored the high structural entropy hypothesis in an inter-organism test, where we looked for homologous genes across the three distant organisms of B. subtilis, E. coli, and S. elongatus, that are all being associated with significantly high structural entropy values upstream of their coding region.

Eleven of the 23 riboswitches of the test set were also located in B. subtilis. The performance of the three tri-state classifiers on the eleven riboswitches and all other intergenic regions of the gram-positive bacterium, Bacillus subtilis are shown in Tables 16 and 17 in Appendix, for the actual variable lengths and constant lengths of the test set, respectively. In order to have a broader view of classification performance for the B. subtilis, we also replaced the antisense with sequences from intergenic regions, having same length and GC-composition as the sense sequences. The negative set consisted of eleven sequences and denoted here as F P 3. Performance values are reported in Table 18 in Appendix.

Table 16 Classification performance in B. subtilis
Table 17 Classification performance for different choices of length in B. subtilis
Table 18 Classification performance in B. subtilis under 157 nt Length

Operon coordinates were taken from [83]. Performance of classifiers was higher for length 157 nt rather than lengths 100 nt, 150 nt, or 200 nt. This was true even though overlapping sliding windows were used for those lengths (sequence segment with highest overlap was selected as positive hit). In addition, we can see from Table 17 in Appendix that as window size increased, the number of intergenic regions classified as riboswitches (T P 2 %) decreased. The classification performance of the LMFEGCRND model, however, was maximum at length 157 nt. (Length 157 was found using a rough optimization of various constant-length sequence selection and under the LMFEGCBJK model). We further examined the 157 nt length for two different sets of tests. In the first case, 157 nt-long segments were selected centered at the riboswitch (routine procedure) and in the second case, 157 nt extension of the 5’ start of sequences were chosen. Classification performance is shown in Tables 16 and 19 in Appendix. Performance was very sensitive to positioning of the sequence segment of constant length. For the case of 5’ selections, the LMFEGCBJK model outperformed other models having T P %=90.9 while the centered-segment test had a performance even lower than choosing random positioning. Hence, the LMFEGCBJK is more suitable for high performance where computational complexity is not an issue. For faster genome-wide tests where examining all sequence positions is not possible the LMFEGCRND seemed more appropriate (T P=81.8%) and was based on selection of segments in a non-overlapping fashion, starting at the start codon for each operon. Selecting segments centered at riboswitches resulted in poor performance in B. subtilis.

Table 19 Classification performance in B. subtilis under constant length

The performance of classifiers on the eleven riboswitches were highly dependent on the length and positioning of sequence segments to be tested (see Tables 16, 17 and 19). Furthermore, various riboswitches had different sensitivities to such features (data not shown). We found that sequence segments of length 157 nt resulted in higher performance compared to other lengths tested. Also, without knowledge of the exact location of the riboswitch, the LMFEGCRND model outperformed the LMFEGCBJK model, though the LMFEGCBJK model had a significantly higher performance if sequence segments were positioned at the right locations on the riboswitch. The likelihood for such desired positioning is very low; 1/W L for each riboswitch, where WL is the length of the non-overlapping sliding window.

The ranking of B. subtilis riboswitches using their actual length and constant length of 157 nt are shown in Tables 20 and 21, respectively. Classification probability of the LMFEGCRND model corresponding to the sequence segment overlapping with the TPP riboswitch (0.76) was higher than that of other riboswitches with ranking empirical p-value 0.0122. Results for the SAM-I riboswitch, however, were very poor. The actual length of the SAM-I riboswitch used in this study was also 157 nt.

Table 20 B. subtilis riboswitches ranking under actual-length test
Table 21 B. subtilis riboswitches ranking under constant-length test

Table 22 contains the top 50 best hits from each strand of the B. subtilis intergenic regions and their corresponding probability values. Sequence segments having classification probabilities higher than or equal to 0.8 fall in the top 50. Plot of structural entropy under the RND model and uracil composition of the sequence segments form the B. subtilis showed that structural entropy values were correlated with higher uracil composition (see Figure 5). This may have been partly due to the fact that uracil can bind with more nucleotides to form base-pairs under folding models. In order to suppress the effect of high uracil composition, we sorted the top hits having uracil compositions within the range of known riboswitches in B. subtilis in Table 23. The location distribution of these hits can be seen in Figure 6. As we can see, the top hits do not seem to be associated with a specific genomic region as a whole. Sequence segments predicted to be riboswitches were not uniformly distributed across the genome. In order to investigate sequence location of segments with significant structural entropy values, regardless of their regression probabilities, we sorted segments having significantly low MFE (empirical p-value <0.05) while also having structural entropy values on the high 50 percentile. Hits with significant values that had GC and U compositions within the range of known riboswitches in B. subtilis are shown in Table 24. Interestingly, all of the hits also had significant structural entropy p-values (<0.05). P-values are calculated empirically and separately for each choice of window size in the genome-wide scan. Finally, significantly high structural entropy values of the 200 nt window scan that also have probability values higher than 0.8, along with other significant hits, are available in Tables 25 and 26 regardless of their MFE or nucleotide compositions.

Figure 5
figure 5

Structural Entropy vs. Uracil-comp. in B. subtilis. Entropy Distribution of the 157 nt window-scan results. 28340 candidate segments of B. subtilis against uracil composition. Blue denotes all segments. Red denotes those with classification probabilities under the LMFEGCRND are higher than 0.8. Green denotes the eleven bonafide riboswitches of the test set that are in B. subtilis.

Figure 6
figure 6

Genomic Distribution. Left: Bacillus subtilis. Distribution of locations of sequence segments of the non-overlapping 157 nt window scan of the B. subtilis intergenic regions. Location of all segments tested is depicted as grey. Location of segments with uracil composition between 0.2484 and 0.40127 and probabilities higher than 0.8 under the classifier LMFEGCRND are shown in red. Outer circle represents the direct strand while the inner circle represents the complementary strand. 72.2% (39 out of 54) of hits on the forward strand are located in the first half of the genome. 69.6% (32 out of 46) of the hits on the reverse strand are located in the second half of the genome. Right: Escherichia coli. Distribution of locations of sequence segments of the 50 nt-overlapping 100 nt window scan of the E. coli non-annotated intergenic regions. Location of all segments tested is depicted as grey. Location of segments with uracil composition between 0.23 and 0.34 and probabilities higher than 0.768 under the classifier LMFEGCRND are shown in red (please see Table 28 for more information on the hits). Outer circle represents the direct strand while the inner circle represents the complementary strand. 64.2% (34 out of 53) of hits on the forward strand are located in the first half of the genome. 60% (36 out of 60) of the hits on the reverse strand are located in the second half of the genome.

Table 22 Top classification hits in B. subtilis
Table 23 Top classification hits in B. subtilis uracil-comp. constrained
Table 24 Top entropy hits in B. subtilis filtered for GC-comp. and uracil-comp.
Table 25 Top entropy hits in B. subtilis forward strand
Table 26 Top entropy hits in B. subtilis reverse strand

Escherichia coli and Synechococcus elongatus

Nine out of the 29 riboswitches in the training set are from the E. coli genome. As a test of the generality of the results on B. subtilis, we evaluated the performance of the three classifiers on various constant-length riboswitches, 100 nt, 150 nt, 157 nt, and 200 nt on E. coli. The performance of the LMFEGCRND classifier for the 100 nt-constant length was slightly higher than other tests (data not shown). Hence, the 100 nt constant-length window scan of 50 nt overlap was used to examine the intergenic regions of E. coli. The operon coordinates were taken from RegulonDB website [84]. Top 50 hits on each strands are available in Table 27. Top 50 hits having uracil compositions within the range of known riboswitches are organized in Table 28. The genomic distribution of the latter set is shown in Figure 6. Sequence segments having significant MFE and high structural entropy values are sorted in Tables 29 and 30 for significant and insignificant structural entropy values, respectively.

Table 27 Top classification hits in E. coli
Table 28 Top classification hits in E. coli uracil-comp. constrained
Table 29 Top entropy hits of E. coli filtered for GC- and uracil-comp
Table 30 Top entropy hits in E. coli

Synechococcus elongatus is another gram-negative bacterium which belongs to cyanobacteria. This organism is able to survive in freshwater environments with low nutrients. Intergenic regions of the sequenced strain Synechococcus elongatus PCC 6301 were scanned for riboswitch identification using the LMFEGCRND classifier with sliding window of 150 nt and 75 nt overlaps. Scanning procedure was similar to other organisms. The resulting top 100 hits (top 50 hits of the forward and reverse strands) are sorted in Table 31 according to their classification probabilities of being riboswitches under the LMFEGCRND classifier.

Table 31 Top classification hits in S. elongatus

Genome-wide scan results: tables

Classification performance on the B. subtilis riboswitches is shown in Table 16. Classification performance on B. subtilis riboswitches using different choices of length is shown in Table 17. Performance values for the choice of 157 nt as the riboswitch length are shown in Tables 18 and 19 for maximum overlap and relative positioning, respectively. Ranking of probabilities associated with different riboswitches are shown in Tables 20 and 21. Genome-wide scan results for B. subtilis are shown in Tables 22, 23, 24, 25 and 26 (please refer to Table captions for further explanation). Results for E. coli are shown in Tables 27, 28, 29 and 30. Results for S. elongatus are shown in Table 31.