Background

Initiation of protein translation in eukaryotes is governed by a cap- and 5' end-dependent mechanism, the scanning model, or can be mediated by a cap- and 5' end-independent manner through an RNA element termed as "internal ribosomal entry site" (IRES) [1]. The translational scanning machine, comprising the 40S ribosomal subunit and a cap-binding initiation factor complex (eIF4F, composed of eIF4E, eIF4G, and eIF4A), recognizes and binds to the 5' end methylated cap structure of mRNA and scans linearly downstream until it reaches an AUG codon embedded in an optimum context for the initiation of protein translation initiation [2]. For most eukaryotic mRNAs, the first AUG encountered by the translation initiation complex acts as the initiation codon. This is termed as the cardinal rule or the first AUG rule. In contrast to the scanning model, IRES can form specific secondary and tertiary structures and interact directly with the translational machinery beyond the AUG start codon.

IRES elements were first discovered in the mRNAs of the virus family Picornaviridae [3], which have a long highly structured 5'UTR that lacks a methylated cap structure at the 5' end. And most of the picornaviruses express a protease that specifically cleaves the eIF4G that cause the cap-binding protein eIF4E cannot assemble with the 43S ternary complex (comprising eIF3 and the 40S ribosomal subunit charged with eIF2-GTP-Met-tRNA). Thus, upon infection by the picornaviruses, host cellular protein synthesis is shut down and the viral genome is translated from IRES without competition with cellular mRNA. The cleaved eIF4G (named p100) is able to interact with the picornavirus IRESs in the absence of the eIF4E binding domain [4]. Therefore, the IRES maybe a virulence factor and the identification of IRES element of pathogenic viruses can be a benefit for the treatment of the viruses infected disease. In addition, the IRES can be employed in the development of bi-cistronic expression vector that is an important tool for the biotechnology [5]. Thus, to develop an IRES search system (IRSS) for prediction and identification of IRES element(s) in a virus genome is an important issue.

Based on the predicted secondary structure and their activity in vitro, the IRES elements of picornavirus are divided into four classes: type I, type II, hepatitis A virus (HAV) IRES and hepatitis C virus (HCV)-like IRES [6, 7]. Type I IRES is from the enterovirus and rhinovirus genomes which are inefficient in driving translation initiation in the rabbit reticulocyte lysate (RRL) [8, 9]. HeLa cells extracts are required for their optimal activity in the RRL in vitro translation system. In contrast, type II IRES which was found in cardioviruse and aphthoviruse genomes can initiate translation efficiently in RRL [10, 11]. And the HAV IRES can also function in the RRL system [6, 12]. However, the activity of the HAV IRES in the RRL in vitro translation system is stimulated by the liver cell extracts but not by the HeLa cells extracts [13]. HCV-like picornavirus IRES was found in Porcine teschovirus and Simian picornavirus which display IRES activity within the RRL in vitro translation system [14, 15].

The IRES elements of the same class might have conserved primary sequence because of the functional contraction. Unfortunately, the lower homology between different IRES classes will cause inaccuracy of prediction by BLAST using primary sequences. The RNA structure prediction will therefore be useful to enhance the accuracy of de novo secondary structure prediction of IRES elements which depends somehow on good fortune. Many RNA structure prediction models have been used in RNA structure simulation, but there is no suitable model to predict the IRES element. To set up an IRES search system (IRSS), two RNA structure prediction models: comparative sequence analysis and minimum free energy structure, were applied in our IRSS. Comparative sequence analysis [16] is the gold standard for prediction of RNA secondary structure without an all-atom model. Over 97% accuracy of base pairs in ribosomal RNA secondary structures, predicted by comparative sequence analysis, were also demonstrated in high-resolution crystal structures [1719]. However, comparative sequence analysis requires a large number of homologous sequences in database. In the absence of necessary homologous sequences, minimum free energy structure prediction can be used to predict the structure of a single RNA sequence with an average of 73% accuracy [20]. This accuracy is sufficient to serve as a starting point to build an alignment for comparative sequence analysis [2123].

The predicted minimum free energy (MFE) structure assumes that the secondary structure is at equilibrium and provides a good simulation for the secondary structure [24]. But, thermodynamic parameters of MFE for evaluating conformation free energies are assumed without error. However, IRES element secondary structure prediction is more complicated than other RNA structures due to three different IRES types that are all related with eIF4 and 40S ribosome subunit but diverse RNA structures. In order to develop an IRES search system, we combined the MFE and RNA alignment modeling programs and adjusted the parameters to create a useful search platform for IRES prediction. To develop the IRES search system, it will be necessary to screen the database of virus sequences by the prediction of secondary structure to identify the candidate IRES element in the virus genome, especially those positive strand viruses with 5' untranslated regions. The applications of IRSS will assist biologists to either predict or discover the new viral IRES elements.

Results

Evaluation of IRES structure search system by Genome scanning

To estimate the accuracy of prediction about IRES elements by IRSS, a known IRES structure and four viral genome sequences were implanted into IRSS. Enterovirus IRES domain IV [25], was first selected as a target to compare with the whole genome sequences from four different viruses (Enterovirus 71 (U22521), Bovine Enterovirus (NC_001859), Human Rhinovirus (NC_001617) and Hepatitis C virus (NC_004102) [2628]) were downloaded into IRSS and ran UTR2SQ.pl program to proceed RNA secondary structure prediction (see Figure 1). Those four viruses were chosen because of distantly separated evolution relationships. Both U22521 (EV71) and NC_001859 (BEV) belong to enterovirus family but of different species. Human Rhinovirus belongs to class Picornaviridae which is with the same order of enterovirus in taxonomy but of different families. Hepatitis C virus belongs to class Flaviviridae. Using our IRSS to search those four virus genomes (see Figure 2A, B, C and Figure 2D), the EV71 domain IV has been successfully predicted as an IRES structure in nucleotide position 240–444, which matches the prediction from Zell and Stelzner [29]. In this test, the L parameter of RNALfold was set to 250 (250 bases) to fetch the possible IRES structures because the length of EV71 domain IV is 205 bases.

Figure 1
figure 1

Flowchart of IRES search system. The RNA primary sequence is input from "START. RNA secondary structure is predicted by different L parameters and then compared with known IRES structure. The alignment score indicats the possibility of de novo IRES. The IRES structures are displayed by RnaViz software and alignment can be edited by any text editor. The six private programs are pointed beside arrow symbol and listed in Additional file 2, Additional file 3, Additional file 4, Additional file 5, Additional file 6 and Additional file 7.

Figure 2
figure 2

Four virus genome sequences containing IRES element were tested in IRSS. In window = 250 nucleotides, R value of each window was calculated as shown in (A) EV71, (B) BEV, (C) HRV and (D) HCV.

For whole genome searching of EV71, RNALfold predicts the possible IRES structure and shown in Figure 2A. The R value presents a score for match length (ALEN) divided distance score (DIST, Y axis). The R value in position 242–444 bases has a significant higher score than the other positions (see Figure 2A). The average R value for this predicted IRES domain is 2.4 and whole genome of EV71 is 1.43 while the standard deviation is 0.14. In BEV, three predicted IRES structures around position 315 which are higher than 1.8, and position 315–549 is up to 2.03 (see Figure 2B). These predicted IRES structures located at 5'UTR site that is the appropriate region for potential IRES elements. Base 1895–2137 has also higher R value (1.72) although there are no previous reports to describe any IRES structure in this region. Theoretically, it can be either a potential IRES element or might be caused by RNA cross structure without IRES ability. The calculated average R value of BEV genome is 1.37 (SD = 0.12).

Another enterovirus, Human Rhinovirus (HRV), was applied in our IRES system to test its discriminative ability. The known HRV IRES structure is located at 5' UTR 1–618 bases. After prediction, two higher R value regions (1.74 and 1.69) at nucleotide 243–476 and nucleotide 2928–3158 as shown in Figure 2C. The second region has no experimental data to prove as an IRES structure, therefore, it may be a false positive result. The average R value is 1.36 and SD is 0.11. The last test sample is HCV, which has a different IRES type to EV71. HCV IRES is located on 5' UTR 1–340 nucleotides. From Figure 2D, no significant R value is higher than the average R value which indicates that IRSS cannot seek IRES structure precisely when the RNA Align software was adopted to compare different types of IRES elements in EV71 and HCV. For HCV, the average R value is 1.35 (SD = 0.12). To summarize the results of four viruses, matched IRES structures have been calculated to show R value over 1.7. The ambiguous range between 1.6 and 1.7 will be a potential candidate positions for IRES structure subject to more fine IRES examination.

Linear discriminant analysis of R value and IRES element prediction from virus databases

The second stage has two purposes. First, the IRSS search capability is evaluated while virus genomes sequences were substituted for the entire UTRdb. The known IRES element which was used for RNA comparison was selected such as HCV IRES domain III structure for example. Second, because of the diversity of known IRES length, the different length parameter (L) of RNALfold should be tested. Three L parameters, 100, 250 and 400, were applied to inspect the discrimination for IRES elements from UTRdb. To determine the best cut-off values of R value in different L parameters, the HCV and Pestivirus 5' UTR are designed as the positive group and others are the negative group. Those records were calculated their R value and estimated their distribution by linear discriminant analysis (see Figure 3D, E, F and additional file 1). The normalized R value indicates that two separate groups were made when the cut-off value was determined.

Figure 3
figure 3

The IRSS searching results from entire UTRdb in different length parameters. There are three length parameters applied in this study, L = 100 (A and D), 250 (B and E) and 400 (C and F). To decide the best R value, linear discriminant analysis was applied in each window shown as d, e and f. The best R values are 1.54, 159 and 1.55 for L = 100, 250 and 400 respectively after calculation. Curves of positive and negative IRES elements represent the R value distribution. The original alignment score of every alignment window for all UTRdb data shown as (A), (B) and (C).

In L = 100, the R values of all virus of UTRdb were plotted in Figure 3A. To determine an appropriate cut-off value, the distributions of discriminant scores for those two groups are located at R = 1.54 (see Figure 3A and Figure 3D). Based on this cutting line, the number size of negative group is 266,192 (square dot symbol, Figure 3A) and positive group (circle symbol, potential IRES elements) contains 17 records that belong to HCV or Pestivirus (circle symbols located at R>1.73), which contains a possible IRES structure or a false positive. The predictive IRES structures were scored between R = 1.54 and 1.75, and the matched length was around 205 to 210 nucleotides. To summarize the results of R values above 1.54, the top fifteen scores were all HCV, moreover, the related virus Flaviviridae also have sixteen records that score higher than 1.54 (data not shown).

L = 250, R = 1.59 was determined as the boundary line after analysis. Circle symbols are positive controls which are the HCV IRES elements with R values greater than 1.73. The negative group (square dot symbol, Figure 3B) has 247,255 records and positive group (circle symbol, Figure 3B) has 129 records. Figure 3E manifests those two groups in linear discriminant analysis and confirms most potential IRES elements distributed below R = 1.59 (see Figure 3B and Figure 3E). In addition, there are some predicted IRES elements between R = 1.59 and R = 2.00 separated from major belt area in 240–280 matched length (X axis). The largest matched size is 328 nucleotides. Part of results matched with known IRES structures including HCV, Bovine viral diarrhea virus (BVDV), Pestivirus, Classical swine fever virus (CSFV), Border disease virus (BDV) and Hepatitis GB virus. Those are all ssRNA positive-strand viruses in class Flaviviridae. Other viruses, like Picornavirus, Enterovirus, Coronavirus, Sarcoma virus and HIV, were also found to have R values higher than 1.73 (see Table 1) and might have potential IRES structures in 5'UTR predicted from our search system.

Table 1 Top records of IRSS predicted potential IRES elements from UTR database in different L* parameters (without Flaviviridae)

The discriminant R value is 1.55, L = 400 and the frequency of group R ratio is shown in Figure 3C and Figure 3F. There are 235,554 records in negative group (square dot symbol, Figure 3C) and 3,862 data in positive group (circle symbol, Figure 3C). The largest matched length is 452 nucleotides. In Figure 3C, the positive group located between R = 1.55 to 2.00 in alignment length between 250 and 290 contains 69 records of HCV and Pestivirus. The higher L value seems to filter out lots of candidate IRES structures; beside Flaviviridae, only five other virus predicted IRES found in top ten records (see Table 1). Two of them, Citrus tristeza virus and Human immunodeficiency virus 1(HIV1), have the same as results in L = 250 but in different predicted positions.

The comparison of all positive groups in L = 100, 250 and 400 might reveal false positive and wrong prediction of IRES structures. The distributions of two groups from different L values are matched our goal to predict "HCV IRES element" but results are obviously diverse after IRSS search (see Table 1, Figure 3A, B, and Figure 3C). The IRES structure prediction ability adopted by our search design is confirmed.

Accuracy of IRSS

To evaluate accuracy rate of the IRES prediction system, two known IRES elements, HCV IRES domain III and IRES of Pestivirus, and entire UTR database were analyzed in IRSS. However, from BLAST and RNA comparison results (data not shown), the primary and second structures of Pestivirus IRES are similar to HCV IRES domain III which might be attributed to the same Flaviviridae order. For RNA Align software, both Pestivirus IRES and HCV IRES were selected as the standard IRESes for IRSS. The UTR database version 19 contains 39 sequences of HCV 5'UTR and 113 sequences of Pestivirus IRES, which were counted known as IRES elements to examine the accuracy of IRSS. From Figure 3D, E and Figure 3F, discriminant R values are 1.54, 1.59 and 1.55 in L parameters as 100, 250 and 400 respectively. After estimation, sensitivity was calculated in different L lengths and better sensitivity was found in L = 250. For HCV IRES standard, the sensitivity score was 66.7% but the accuracy of Pestivirus IRES prediction was up to 72.3% in L = 250.

Web-based tools

The IRSS tool is available in web-based on line search as http://140.135.61.9/ires/. The original RNA prediction software and perl-script programs, such as RNAL fold, RNA Align, UTR2SQ.pl...etc., have been transferred into Web service style and executed automatically. Figure 4 showed the input window and output example. The input sequences can be FASTA and/or plain text formats and results are in pain text which is able to be read by any word processing software. The default L parameter of web-based IRSS is 250 and R value is 1.4. The IRSS web tool is run in a Linux workstation which has Fedora 6.0 operation system.

Figure 4
figure 4

Web-based IRES search system. (A) Input window, sequences as FASTA or plain text formats. (B) The example of output data. All of the results with R scores higher than 1.5 after IRSS prediction can be listed. Those data include potential IRES sequences, potential secondary structures and predicted minimum free energies.

Discussion

IRESes have been applied as biotechnological tools, particularly for gene expression. Functional and mutational studies have also been demonstrated on different IRES structure analysis [30, 31]. Can a scientist predict the potential IRES elements before experiment? There are lots of software to predict the RNA secondary structures but there is no available software to predict the IRES elements. Recently, experimentally verified IRES database has been built in http://www.iresite.org[32]. This database collects the full-length sequences of all mRNAs manifesting IRES activity. A similar work to collect experimentally verified IRES data has been done as UTR database which was also applied in our study. To set up the IRES search tool, we modified and combined two software to become a search flow. The RNAL fold is based on minimum free energy method, thus, longer sequences will reduce its accuracy. That is the major reason why L = 250 is better than L = 400. Minimum free energy prediction has been adopted by most of RNA secondary structure prediction software, unfortunately, its sensitivity is about 72% [20]. This explains that the sensitivity of our IRES search system is less than 72.3%. To conquer this problem, separated predictions for IRES different type structures might be the better option. However, occurrence of more false positives and longer computer CPU running time are the disadvantages. Therefore, more information is required to rule out false positives. The second software, RNA Align program, can compare first and secondary structures of RNA for a precise specific prediction of conserved structures such as Hairpin loop, Budge loop and Interior loop. On the other hand, RNA Align cannot hasten its calculation unless it is replaced by other programs or modifying the source codes.

Based on results obtained from IRSS, HCV and Pestivirus are major members of positive group in different L parameters. However, positive group may contain other virus which might be potential IRES elements. For example, Simian picornavirus has high R value (1.85) in 430–655 nucleotides. Comparison of the two predicted RNA secondary structures, Simian picornavirus domains 2, 4 and 6 (see Figure 5B) are similar to HCV domain IIIf, IIIa and IIId (see Figure 5A) respectively. This IRES element has been proved by Chard [33]. Similarities between the two whole structures is up to 41.7% which results to high R value region in Simian picornavirus by IRSS. Another ssRNA positive strand virus, Porcine enterovirus 8, has been discovered as potential IRES element at 190 nucleotides in L = 250 (see Figure 5C). Its secondary structure also contains domains similar to HCV domains IIIa, IIIb, IIId and IIIf, that was recently proved by Pisarev [15]. The IRES elements of Simian picornavirus and Porcine enterovirus 8 genomes which matched our search results of IRSS. Moreover, IRSS can distinguish known IRES elements from UTRdb (see Figure 3D, E, and Figure 3F). Our results demonstrate that IRSS is not only to predict RNA secondary structures but also to locate the IRES elements. However, in L = 100 and 400, none of Porcine enterovirus showed potential IRES structure due to various predicted structures in different L parameters, but two Retroviridae viruses, Rous Sarcoma virus and HIV, were located at positive group in our calculation. Rous Sarcoma virus occupies four top scores in L = 100 and one of them, D10652, appears in L = 250 too (see Figure 5D). The R value of 1060–1272 nucleotide within D10652 is 1.75 (cutting line R = 1.59, L = 250), which is sequence inside gag polyprotein coding region. Thus, this record shows to be false positive. Another Rous Sarcoma virus regions, (X51863 and M21526), were found to have IRES element at 5'UTR. This element forms a few stem loops similar to HCV domains IIIa, IIIb, IIId and IIIe (data not shown). But this IRES element conformation is smaller than HCV and Simian picornavirus. More experimental evidences are necessary to prove this IRES element.

Figure 5
figure 5

Prediction IRES structures of Simian picornavirus, PEV-8 and RSV. (A) HCV Domain III. (B) Predicted IRES structure of Simian picornavirus in nucleotide position 433–655 (L = 250). (C) Predicted IRES structure of PEV-8 in nucleotide position 191–440 (L = 250). (D) Predicted IRES structure of RSV in nucleotide position 1060–1273 (L = 250).

Pattern searching program and web service have been developed such as Rfam from Sanger institute [34]. Rfam is a collection of multiple RNA sequence alignments using covariance models to represent consensus primary sequences of non-coding RNA families. Rfam will provide information not only IRESes but also other RNAs. In contrast to Rfam, IRSS is specific for IRES study. IRSS searches IRES elements by structure comparison that contains neighbor regions for structure prediction and avoids short consensus primary sequences problem to improve IRES structures prediction. However, IRSS requires verified IRES structures to improve accuracy of RNA Align program and is different approach to sequence alignment of Rfam.

From the initial test for our IRES search system, Enterovirus 71 and related virus were successful to find the IRES element but failed to apply in other virus families. Species specificity is indeed an important factor in this test. During the second test, longer RNA sequences might cause difficulty in prediction for RNAL program which resulted to less positive results in L = 400. When L parameter was 100, shorter predicted length was easy to locate sub-structure that caused lots of predicted IRES elements that were focused in the same area and also revealed more false positive results. Predicted sub-domains of IRES element might match one of individual HCV domains resulting to the loss of the ability to fetch whole IRES element. After evaluation of all length parameters, middle size (L = 250) of prediction can cover whole IRES structure and also avoids the disadvantages of minimum free energy method. In addition, to improve sensitivity of IRSS, we are also preparing the implementation of a new designed program which will allow us to do similar interactions between 40s rRNA and IRES domains by 3D model. Furthermore, we plan to provide our IRES search system with a web-based interface which will help to define IRES elements. Finally, we believe that the IRSS will provide a useful source for IRES location before experimental study. The IRSS can be a public resource. It can facilitate the scientific community not only to analyze using IRSS as a tool, but also a means of communication through provide feedbacks.

Conclusion

We report the new IRES search system (IRSS), which is a search flowchart to facilitate IRES elements' prediction and analysis. The dicistronic test for IRES elements verification is the gold standard despite of the inefficiency in experiments which have serious translation background problems and lack of appropriate prediction software. In addition, there are many RNA structure prediction models, but there is no suitable model to predict IRES elements. To achieve this purpose, IRSS combined "minimum free energy structure prediction" program and "comparative sequence analysis" program. The accuracy of IRSS is sufficient to serve as a starting point and to provide bioinformatic evidences for IRES element experiment and application. Finally, IRSS has not only been developed as a useful system for prediction of IRES elements but also been transformed as a web-based service to allow public usage.

Methods

Methodology of IRES element prediction

Two key steps are the backbone of the IRES elements search system (IRSS): 1) RNA folding and 2) comparison. RNAL fold program [35] is the first step and functions to predict the RNA secondary structure by minimum free energy method. The second step is RNA secondary structure comparison which matches the known IRES structures executed by RNA Align program [36]. In our designed IRSS, primary RNA sequences inputted into search flowchart (see Figure 1) with individual length parameter (L) is transferred as raw RNA sequences into RNAL fold input format by "UTR2SQ.pl" [see Additional file 2] and the "utr_dp.pl" program [see Additional file 3]. The utr_dp.pl is the major control batch program to link each stage of IRSS. The output data of RNAL fold is then transformed into RNA Align format by B2RA.pl program [see Additional file 4]. The results of RNA Align software will be displayed as two files: Aligned structure and Alignment score files. Two statistical programs, DIST.R and sort.R [see Additional file 5 and Additional file 6], analyze those alignment scores and calculate the score distribution [37]. For RNA view, B2CT.pl [see Additional file 7] changes the predicted RNA secondary structure into "connect file format" (*.ct) which will read by RnaViz [38] to display in screen and print. The essential prerequisite step of this analysis is the calculation of secondary structure's stability by folding sub-sequences of length, L.

The length of sequence (L) fragments is assigned for window sizes, which the window slides along the target sequences. The length size is a varied factor in IRES prediction. In this study, the L parameter of RNAL fold program was set as L = 400, 250 or 100. All three length parameters input into our IRSS can predict all possible RNA secondary structures. The algorithms of RNA Align software consider all RNA structures that can be cataloged as base-match, base-mismatch, base-deletion, arc-match, arc-breaking, arc-altering and arc-removed [36]. Each RNA alignment is measured through the similarities between two RNA structure sequences as the 'edit distance', which aims on the calculation of transforming/editing one sequence into the other. Nucleotide insertion, deletion and substitution are three transforming types of edit distance. The score of the alignment between RNA structures is dependent to the summation of costs that were computed by 'edit distance' [36].

The known IRES elements were selected as standard structures for our IRSS. In "RNA family database of alignments and common motifs" (Rfam, http://www.sanger.ac.uk/Software/Rfam/) [34, 39], the known RNA structures including IRES are qualified in our system. There are twelve IRES models built upon consensus sequences in Rfam database. Those models are based on the similar consensus secondary sequences that are predicted by PFOLD program [40]. Moreover, those IRES consensus secondary sequences are the major templates for RNA alignment software such as RNAL fold program. In IRSS, if RNAL fold program predicted IRES elements that cannot match IRES models of Rfam or fetch at least two homolog IRESs from related species, those input data will be discarded.

Practice of the IRES element search system

Four different whole virus sequences, EV71, BEV, HRV and HCV, were tested in IRES search system. All coding sequences were downloaded from GenBank http://www.ncbi.nlm.nih.gov with accession number, U22521, NC_001859, NC_001617 and NC_004102[2628]. The purpose of this test is to look for EV71 IRES domain IV (240–444 nucleotides) from those virus sequences using our IRES search system. Furthermore, in order to understand the precision of IRSS, the entire virus 5' UTR database (UTRdb, http://www.ba.itb.cnr.it/UTR/) and the target is HCV domain III (accession umber: AF177037) [4143], was input into the IRSS. Domain III of the HCV IRES positions at the initiation codon in the ribosomal mRNA binding cleft by binding the 40S subunit [44].

In RNA align software, two factors are considered to evaluate the IRES elements that can be predicted by our IRSS, distance score (DIST) and alignment match length (ALEN) from RNA Align program. DIST represents the score of secondary structure in comparison with the default score of each RNA structure (base-deletion, base-mismatch, arc-mismatch, are-removing, arc-altering and arc-breaking) adopted in RNA align software. Because DIST value will increase concomitantly with longer alignment length, DIST score fails to specify the significance of matched structures from shorter and bigger alignment sequences. Therefore, DIST and ALEN are transformed into a ratio which is defined as R = ALEN/DIST. The R values are collected from all predicted IRES elements including known IRES and potential candidate IRES elements. Linear discriminant analysis (LDA) analyzes all R values to make a discriminant line that distinguishes candidate IRES group and non-IRES group. The error rate of IRES search system is estimated in comparison of known IRES structures with candidate IRES elements.