Introduction

Bacterial small non-coding RNAs (sRNAs) are transcripts that instead of encoding proteins, function directly at the level of RNA in the cells1,2. They are usually 50–250 nucleotides in length. sRNAs have appeared as key regulators of gene expression in pathogenic bacteria3. They were found to be involved in diverse biological processes, including bacterial virulence4,5, oxidative stress response6 and cell to cell communication in quorum sensing7. Recent advances in high throughput techniques, such as RNA sequencing (RNA-Seq)8,9 and tilling arrays10 have identified and characterized a number of sRNAs and provided valuable insights into bacterial physiology. However, experimental identification of sRNAs at a large scale is still lagging for several species. There is an urgent need to develop efficient computational tools for identifying sRNAs.

Several computational methods were developed in the recent past for the identification of sRNAs in bacteria using comparative genomics11,12,13,14, primary sequence15,16 and secondary structure17,18,19 features. QRNA11 used pairwise alignments to identify novel sRNAs in bacteria. This technique employed a pair hidden Markov models (pair-HMMs) and a pair stochastic context-free grammar (pair-SCFG) to find structured RNA (RNA), coding RNA (COD) or something else (OTH). RNA, COD and OTH models assumed that mutation pattern is significantly conserved in homologous RNA secondary structures, aligned sequences encode homologous proteins and mutations occur in simple position-independent manner, respectively. Alifoldz12 introduced combination of free energy and covariance to discriminate functional from random RNAs. MSARI13 algorithm applied the idea of detecting conserved RNA secondary structures among candidate orthologs by multiple sequence alignment. This program used RNAFOLD20 to predict the RNA secondary structure and CLUSTALW21 to make sequence alignment. zMFold14 offered a new shuffling program in perl (shuffle-pair.pl) for pairwise alignments that simultaneously preserves key features of the alignment. It used alignment dataset along with real and shuffled genomic sequences as inputs for a panel of published tools to identify novel non-coding RNA. Carter et al.15 developed a machine learning approach using neural network (NN) and support vector machine (SVM) to predict novel functional RNAs in the genomic sequence. The authors calculated twenty sequence compositions (mono (4) and di (16) nucleotide) and six structural motif features for all sequence windows and used them in training and testing of NN and SVM. Klein et al.16 proposed a screening technique for identifying novel non-coding RNAs by using both GC content bias and QRNA-based comparative analysis. RNAz217 identified thermodynamically stable and evolutionary conserved RNA secondary structures in multiple sequence alignments and subsequently filtered candidate sRNAs. It calculated thermodynamic stability in terms of z-score and structure conservation index (SCI) to identify thermodynamically stable and evolutionary conserved RNA secondary structures, respectively. Dynalign II18, an update of Dynalign19, is a software package for prediction of the common secondary structure of two RNA homologs by predicting inserted domains into dynamic programming algorithm. However, a major shortcoming of the available computational methods to identify sRNAs is that most of them favour either sensitivity or specificity, but not both22 and generate either high number of false positives or false negatives. This underscores the necessity to develop a tool that will achieve higher accuracy to identify sRNAs with nearly equal sensitivity and specificity (where both false positive and false negative rates would be very low). Machine learning techniques have been extensively used to identify different classes of small non-coding RNA molecules, namely microRNAs (miRNAs)23,24,25,26,27,28,29,30,31,32,33,34 and transfer RNAs23,24. However, SVM25,26,27,28,29,30,31,32 rather than random forest (RF)33,34 and neural network (NN)35,36 were used more widely. Several excellent reviews are also available on this topic37,38,39,40.

In this study, we introduce an SVM classifier with 10-fold cross-validation technique41 that incorporates the primary sequence and secondary structure features of sRNAs to efficiently identify them in bacteria. Experimentally-validated sRNAs of Salmonella Typhimurium LT2 (SLT2) have been used to develop an optimum SVM model for identifying potential new sRNAs. The proposed SVM model efficiently identifies sRNAs with nearly equal sensitivity and specificity. We have also validated our proposed SVM model on other experimentally- validated sRNAs of Escherichia coli (E. coli) K-12 and Salmonella Typhi (S. Typhi) Ty2. In addition to this, we have applied sliding window-based method to identify sRNAs from the complete genome of a particular strain. All the source code, help file and proposed best SVM model is freely available at http://www.bicniced.org/RKB_profile.htm.

Results

Selection of best features

Different features and their combinations were used to achieve greater accuracy with nearly equal sensitivity and specificity. The secondary structure features were found to perform poorer as compared with the primary sequence features (Table 1). The tri-nucleotide composition (64) and all nucleotides composition (84, (4-mono + 16-di + 64-tri)) features performed slightly better than other primary sequence combination features. The tri-nucleotide and all nucleotides composition features achieved accuracies of 88.19% (sensitivity of 84.61% and specificity of 91.78%) and 87.92% (sensitivity of 89% and specificity of 86.83%) at threshold values of 0.3 and −0.2, respectively.

Table 1 Performance measures on different combination of features in SLT2 dataset, using RBF kernel of SVM.

Classification performance on imbalanced datasets

We tested the performance of our proposed method on imbalanced datasets also, to closely resemble real-world scenarios, where size of negative dataset is more than the positive dataset. As shown in Table 2, tri-nucleotide composition feature performed somewhat better when the negative dataset was double the size of the positive dataset. The tri-nucleotide composition feature (P:N = 1:2) achieved an accuracy of 88.35% with sensitivity of 85.11% and specificity of 91.58% at a threshold value of 0.0.

Table 2 SVM performance measures on balance and imbalanced SLT2 datasets.

Kernel-wise performance of SVM

We tested different kernels of SVM and combinations of parameters related to them along with other performance measures to achieve the best accuracy. As shown in Table S1, linear (accuracy of 85.53% with sensitivity of 84.50% and specificity of 86.56%), polynomial (accuracy of 87.79% with sensitivity of 84.56% and specificity of 91.03%) and RBF (accuracy of 88.35% with sensitivity of 85.11% and specificity of 91.58%) kernels of SVM performed better than sigmoid kernel (accuracy of 65.26% with sensitivity of 31.33% and specificity of 99.19%). The RBF kernel performed slightly better than the linear and polynomial kernels.

Comparison of proposed SVM method with other machine learning methods

We compared our proposed SVM method with other frequently used machine learning methods like random forest and multilayer perceptron. As shown in Table 3, the proposed SVM method performed (accuracy of 88.35% with sensitivity of 85.11% and specificity of 91.58%) marginally better than random forest (accuracy of 81.18% sensitivity of 68.48% and specificity of 95.88%) and multilayer perceptron (accuracy of 85.71% sensitivity of 81.87% and specificity of 89.56%).

Table 3 Performance comparison of different machine learning methods.

Comparison with other basic computational method for identifying sRNAs

We compared the performance of our proposed model on SLT2 dataset with that of the other methods as estimated previously by Arnedo et al.22 with the same dataset. As shown in Table 4, our SVM model attained an accuracy of 88.35% with sensitivity of 85.11% and specificity of 91.58%. In contrast, all other methods except zMFold favoured specificity over sensitivity and performed poorly in terms of accuracy, sensitivity and specificity (Table 4). Arnedo et al.22 achieved an accuracy of 72.5% and sensitivity and specificity of 67% and 78%, respectively, which was inferior to our results. Thus, the data clearly indicate that our proposed method performed the best among different techniques to predict sRNA.

Table 4 Performance measures of the individual methods on SLT2 dataset.

Validation of proposed SVM model on others experimentally verified sRNAs

For the purpose of validation, we applied the SVM model to experimentally-verified sRNAs of other bacteria, which were not used in the original training and testing datasets. Our method achieved accuracies of 81.25% (sensitivity 73.75% and specificity 88.75%) and 88.82% (sensitivity 89.47% and specificity 88.16%) for sRNAs of E. coli K-12 and S. Typhi Ty2, respectively (Table 5). The result showed that the proposed model attained similar performance on SLT2, E. coli K-12 and S. Typhi Ty2 datasets.

Table 5 Performance measures of proposed SVM model on experimentally verified sRNAs of others bacteria that are not used in training and testing datasets.

Performance on genome-wide identification of sRNAs

Finally, we applied sliding window-based approach to identify sRNAs from the complete genomes of S. Typhimurium LT2, S. Typhi Ty2 and E. coli K-12 using the SVM model we developed. As shown in Tables 6 and S2–S4, the sliding window-based method achieved comparable efficacy as the SVM model applied to known sRNA sequences with sensitivities of the former being 89.09%, 83.33% and 67.39% compared with the corresponding values for the latter being 85.11%, 89.47% and 73.75% for SLT2, S. Typhi Ty2 and E. coli K-12, respectively.

Table 6 Performance of sliding windows based approach for identifying sRNAs from complete genome.

Discussion

The role of sRNAs is more diverse than anticipated. Therefore, sRNA identification is of particular interest. The experimental techniques are ideal for the identification of sRNAs and exploration of their role in individual species. However, optimal techniques are lacking for many species. This limitation may be complemented by computational methods.

Several computational methods have been introduced over the years to identify sRNAs of bacteria, but most of them show poor performance in identifying positive and negative sRNAs simultaneously. Thus, methods such as MSARi, RNAz2, vsFold, Alifoldz and dynalign showed good performance in identifying negative sRNAs (non-sRNAs) identification, but poor ability in identifying positive sRNAs (Table 4). On the other hand, zMFold was efficient in identifying positive, but not negative sRNAs. As a result, these methods generated either a high number of false negatives (FN) or false positives (FP). Other tools like QRNA and the one proposed by Arnedo et al. showed slightly better performance for the identification of both positive and negative sRNAs simultaneously, but their accuracies were less than optimums.

In this study, we introduced a computational method to identify sRNAs of bacteria using SVM. The proposed SVM model simultaneously minimized the false negatives (FN) and false positives (FP). As a result, the model achieved decent accuracy with nearly equal sensitivity and specificity (Tables 1, 2, 3 and 4). This method showed significantly better performance as compared the existing techniques (Table 4). In addition, our method showed that simple nucleotide sequence features (tri-nucleotide) can efficiently predict sRNAs in bacteria (Tables 1 and 2). Finally, we applied the proposed SVM model to E. coli K-12 and S. Typhi Ty2 datasets for the purpose of validation. Our model achieved similar performance to SLT2 on E. coli K-12 and S. Typhi Ty2 datasets that were not used in the training and testing datasets.

To identify sRNAs form complete genome, we used sliding window-based approach. This method achieved good sensitivity for experimentally verified SLT2 and S. Typhi Ty2 datasets. However the performance for E. coli K-12 dataset was inferior to this with sensitivity of 67.39%. It was poorer compared with the SVM-based method we developed that achieved a sensitivity of 73.75% for E. coli K-12 dataset. Another reason for such difference might be attributed to the fact that 42.5% of experimentally verified sRNAs of E. coli K-12 are overlapping with the protein coding regions, whereas similar overlap for experimentally verified sRNAs of SLT2 and S. Typhi Ty2 was only 9.34% and 21.05%, respectively. We also found that the median of 80 experimentally-validated sRNAs of E. coli K-12 is 110 nucleotides. Therefore, we tried with the new window size of 110 and step size of 40 nucleotides for complete genome of E. coli K-12 dataset. The sliding window-based approach with the new parameters achieved the sensitivity of 71.74% for E. coli K-12 dataset, whereas window size of 145 and step size of 45 achieved a sensitivity of 67.39%. These results indicate that proper selection of window and step size may improve the performance of identification of sRNAs from the complete genome.

The primary goal of the present study was to predict small non-coding RNAs (sRNAs) from complete genome sequences of bacteria. Hence, we searched for simple sequence features, which could efficiently identify sRNAs. We found that simple tri-nucleotide composition feature can efficiently predict sRNAs. Literature search identified six sequence-derived features, such as spectrum profile, mismatch profile, subsequence profile, position-specific matrix, pseudo dinucleotide composition and local structure-sequence triplet elements, which can predict piwi-interacting RNAs (piRNA)42,43. We will use the above features in our future study to further improve the prediction model. We encountered several challenges to construct a webserver for predicting sRNAs from complete genomes. In the section on genome-wide identification of sRNAs using sliding windows technique of the manuscript, we had to collect protein coding table of a particular bacteria from the NCBI webserver. However, we faced difficulties in automatically downloading and formatting protein coding table from NCBI. In addition, all the steps to predict sRNAs from the complete genome would take few hours to complete. Hence, instead of creating a web server, we have deposited all the source codes at http://www.bicniced.org/RKB_profile.htm. We have also created user guide documents (HelpForPredictingsRNAsByBarmanetal.pdf and HelpForParsingCDS.pdf) to describe the steps to predict sRNAs from the complete genome. The user can predict sRNAs from the complete genome of bacteria of interest by using their local desktop.

Methods

Data collection

Experimentally validated 193 sRNAs of SLT2 were collected from Arnedo et al.22. These authors originally collected sRNAs of SLT2 from RFAM database44 as well as previously available literatures45,46,47,48,49. We retrieved a table with name, source of identification, start and end position of sRNAs from the published article (Supplementary Table S5)22. We had downloaded the complete genome sequence of SLT2 (http://www.ncbi.nlm.nih.gov/nuccore/16763390?report=fasta) and have extracted the exact sRNA sequences from it using the information about the start and end positions of particular sRNAs. We found that 11 out 193 sRNAs were redundant at the sequence levels except the start and the end positions. Since we planned to predict sRNAs using their primary sequences, we removed the redundant sRNAs from the SLT2 dataset. Thus, we finally used 182 experimentally-validated sRNAs of SLT2 as the positive dataset (Supplementary Table S6).

We used shuffleseq50 program to randomly shuffle the bases of complete genome sequence of SLT2 without affecting the composition. We generated ten negative (non-sRNAs) datasets by shuffling the complete genome sequence of SLT2 ten times and using the information on the start and end positions of the positive sRNAs (Supplementary Tables S7–S16). Furthermore, we plotted Venn-diagram using Venny webserver51 to ensure that all negative datasets were unique and different from the positive dataset as well as from each other (Supplementary Figures S1–S4).

In order to validate our proposed SVM model in other bacteria, we collected experimentally validated sRNAs of E. coli K-12 (80) and S. Typhi Ty2 (38) from Raghavan et al.52 and Perkins et al.8, respectively. The negative datasets of E. coli K-12 and S. Typhi Ty2 were generated using the same shuffleseq program as described above.

10-fold cross-validation

We used 10-fold cross-validation to estimate the performance of our proposed SVM model. In 10-fold cross-validation, the whole dataset was divided into 10 equal (nearly equal)-sized folds. Training and testing were repeated ten times so that each time a different fold was used for testing, while the remaining 9 folds were used for training. The overall performance of the proposed SVM model was calculated using average performance over 10 folds.

Features

We used different nucleotide composition and secondary structure features for training and testing of SVM. A total of 84 nucleotides composition (4 for mono-nucleotide, 16 for dinucleotide and 64 for tri-nucleotide composition) and 3 secondary structure features (stem, loop and minimum free energy) were used as described below.

Mono-nucleotide composition: Mono-nucleotide composition of all 4 nucleotides was calculated using the following equation:

where i denotes any nucleotide “A” or “T” or “G” or “C”.

Di-nucleotide composition: Di-nucleotide composition of all 16 di-nucleotides was calculated using the following equation:

where i denotes any di-necleotide among 16 di-nucelotides (“AA”, “AT”, “AG”, “AC”, …. “CC”).

Tri-nucleotide composition: Tri-nucleotide composition of all 64 tri-nucleotides was calculated using the following equation:

where i denotes any tri-necleotide among 64 tri-nucelotides (“AAA”, “ATA”, “AGA”, …, “CCC”).

Stem, loop and minimum free energy (MFE): RNAFOLD20 was used to calculate the number of stems, loops and minimum free energy (MFE) from the predicted secondary structure of individual sRNAs and non-sRNAs.

SVM Classifier

Support Vector Machine (SVM) classifier was used to identify sRNAs and non-sRNAs. We employed SVMlight tool provided by T. Joachims53, which allows user to select different kernels and parameters to find a decision surface that maximizes the margin between data points of two classes (sRNAs and non-sRNAs). We have tested different kernels of SVM and the combinations of their respective parameters to optimize the performance. We have tested nearly 4000 combination of parameters (c [trade-off between training error and margin], j [Cost factor] for linear; d (d parameter in polynomial function), c, j for polynomial; g [gamma], c, j for RBF; and s [s parameter in sigmoid function], c, j, r [parameter c in sigmoid function] for sigmoid kernel) in each case and reported only the best one in the result section.

Welch’s t-test

We employed Welch’s two sample t-test54 to find subset of nucleotide composition features that were significantly different between sRNAs and non-sRNAs (Welch’s two sample t-test p value < 0.05). We found that 13 nucleotide composition features in sRNAs were significantly different from non-sRNAs, while in case of non-sRNAs, 25 nucleotide composition features were found to be significantly different from sRNAs (Supplementary Table S17).

Performance measures

Threshold-dependent performance measures of binary classification problem including sensitivity (Recall), specificity, accuracy, positive predictive value (PPV or precision), Mathew’s correlation coefficient (MCC) and F1 score were calculated using the following equations:

where n denotes the ratioof negative and positive datasets size

where, True Positive (TP): sRNAs are correctly identified as sRNAs.

False Positive (FP): non–sRNAs are incorrectly identified as sRNAs.

True Negative (TN): non–sRNAs are correctly identified as non–sRNAs.

False Negative (FN): sRNAs are incorrectly identified as non–sRNAs.

Threshold-independent performance measure like area under receiver operating characteristic curve (ROC) plot (AUC) was computed for all cases.

Genome-wide identification of sRNAs using sliding windows technique

We found that most of the experimentally-verified sRNA (~85%) sequences fall in the inter-genic regions of the genome. In order to derive inter-genic region of a particular genome, we collected the complete genome sequence and the protein-coding table of the corresponding strain from the NCBI genome database. If the protein-coding table was not directly available for a certain strain, we parsed the coding table from NCBI. All the coding regions were excluded and only the inter-genic regions were retained for further studies. Additionally, the inter-genic regions of lengths less than 50 nucleotides were also excluded.

We exploited sliding window-based approach to identify sRNAs from the inter-genic regions. The selection of the window and step size plays a crucial role in identifying sRNAs from the complete genome. Previously, window size of 100 to 200 and step size of 40 to 50 nucleotides were used to predict sRNAs in prokaryotes55,56,57. We selected a window size of 145 and step size of 45 nucleotides, since the median lengths of experimentally-verified sRNAs of SLT2 (182 sRNAs) and Bacterial Small Regulatory RNA Database (BSRD) (897 sRNAs) are 145.5 and 145, respectively. We generated every possible window for all the inter-genic regions of a bacterial genome and calculated the tri-nucleotide composition features of the windows. We then predicted sRNAs for each possible window by using our proposed SVM model. Finally, average prediction score of the windows corresponding to an experimentally-verified sRNA was calculated. If the average prediction score was greater than 0 (threshold value), we treated it as positive or truly predicted one.

Additional Information

How to cite this article: Barman, R. K. et al. An improved method for identification of small non-coding RNAs in bacteria using support vector machine. Sci. Rep. 7, 46070; doi: 10.1038/srep46070 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.