Catalysis by RNA was discovered a quarter of a century ago. The discoveries that certain introns were capable of self-splicing [1] and that the RNA moiety of bacterial ribonuclease P (RNase P) on its own could process precursor tRNAs [2] were the first indications that catalytic remnants of a postulated RNA world had persisted until the present day. By the late 1980s, the catalytic scope of RNA had been extended by the discovery of the so-called small nucleolytic ribozymes (or RNA-based enzymes). This family consists of four members: the hammerhead [3], the hairpin [4, 5], the hepatitis delta virus (HDV) [6, 7] and the Neurospora crassa Varkud satellite (VS) [8, 9] ribozymes. All the small nucleolytic ribozymes are involved in the processing of RNA replication intermediates and catalyze a simple RNA cleavage or ligation reaction.

Most present-day ribozymes have as their substrates the conventional 3',5'-phosphodiester bonds in RNA [10]. In arguably the simplest such reaction, the RNA moiety of RNase P catalyzes the hydrolysis of precursor tRNAs (Figure 1a). More frequently, however, ribozymes catalyze a transesterification reaction, as do the small nucleolytic ribozymes, (Figure 1b) and the self-splicing introns (Figure 1c,d). The small nucleolytic ribozymes catalyze the one-step cleavage of a 3',5'-phosphodiester bond, with the formation of a 2',3'-cyclic phosphate and a 5'-hydroxyl in the cleavage products (Figure 1b). Despite having the same reaction mechanism, the small nucleolytic ribozymes differ dramatically from each other in their architecture and exhibit significant variation in the pH profiles of their catalytic activity and in the metal ions required for catalysis [11]. It seems likely that this reaction mechanism is best suited to a simple and single RNA cleavage, as in the processing of multimeric replication intermediates into monomers. Other RNA-cleaving entities that use this mechanism are the in vitro selected leadzyme [12], the protein RNase A [13], and the recently discovered catalytic riboswitch glmS [14], an RNA element that controls gene expression via its ribozyme activity.

Figure 1
figure 1

Biochemical reactions naturally catalyzed by RNA. (a) Precursor tRNA hydrolysis by bacterial RNase P yields a phosphate-containing 5' end of the mature tRNA and a 3'-hydroxyl group at the 5' cleavage product. (b-d) Transesterification reactions catalyzed by (b) the small nucleolytic ribozymes, (c) group I introns, and (d) group II introns, in which different chemical groups serve as the attacking nucleophile. In the small nucleolytic ribozymes (b), a defined 2'-hydroxyl attacks the neighboring 3',5'-phosphodiester bond, resulting in a 2',3'-cyclic phosphate and a 5'-hydroxyl in the respective cleavage products. In the first step of group I intron splicing (c), the 3'-hydroxyl of the exogenous guanosine (G) cofactor attacks the 5'-exon-intron junction and sets the 5' exon free, which leads to the covalent attachment of the cofactor to the 5' end of the intron. In a second transesterification reaction, the 5' exon forms a conventional 3',5' bond with the 3' exon, releasing the linear intron with the additional guanosine [1]. In group II introns (d), the conserved branch-point adenosine (A) serves as the nucleophile, leading to the formation of a lariat intron. (e) Peptide-bond formation catalyzed by the ribosome.

In contrast to this simple reaction, self-splicing of the group I and group II introns involves two consecutive reaction steps (Figure 1c,d). The first frees the 3'-OH of the 5' exon, which allows, in the second step, an attack of the phosphodiester at the junction between the last residue of the intron and the first residue of the 3' exon. Self-splicing group I introns make use of the 3'-hydroxyl of an exogenous guanosine as the initial attacking nucleophile; the guanosine is phosphorylated in the reaction and released (Figure 1c). In the self-splicing group I introns, the formation of an intermediate with a 2',3'-cyclic phosphodiester bond has not been observed, probably because that might entail a loss of structural integrity in the spliced exons by the formation of 2',5'-phosphodiester connectivity in the second reaction step [15]. A similar two-step strategy is adopted by the self-splicing group II introns [16, 17], but in this case the attacking nucleophile is the 2'-hydroxyl of the conserved intronic branchpoint adenosine (Figure 1d). While this forms an RNA lariat in the intron, the structural integrity of the connected exons is ensured. It should be noted that the splicing of tRNA introns in the Eukarya and the Archaea does not result from self-splicing as in the Bacteria, but starts with the action of an endonuclease, a protein enzyme, which leaves 2',3'-cyclic phosphate termini [1820].

The persistence of the RNA world has been splendidly confirmed by the demonstration that the ribosome is a ribozyme - that is, the ribosomal RNA components are the catalytically active elements in polypeptide synthesis [21] - placing ribozyme activity at the heart of modern cells and showing that ribozymes could catalyze reactions other than the cleavage and ligation of RNA (Figure 1e). The first indications of catalytic RNA in the ribosome came from biochemical data [22] that showed persistence of ribosome catalytic activity after digestion and denaturation of the ribosomal proteins. The final proof that rRNA is the catalyst in protein biosynthesis came from crystallographic work that showed that the peptidyltransferase reaction center of the ribosome is devoid of any protein component, and is made up exclusively of rRNA residues [21].

In the past few years, a number of new catalytic RNA molecules have been discovered, including a catalytic riboswitch, and known elements have been detected at new genomic locations. Table 1 lists the currently known naturally occurring catalytic RNAs. Do we now know the full spectrum of the diversity and versatility of catalytic RNAs, or are there yet more to be discovered? In this article we will focus on the approaches used to identify novel catalytic RNA species and on the accompanying experimental and bioinformatic difficulties. To solve some of these problems, new bioinformatic tools that better integrate our current understanding of RNA architecture, molecular biology and evolution will have to be developed.

Table 1 The natural occurrence of ribozymes and riboswitches

The discovery of riboswitches and new catalytic RNAs

Riboswitches are bimodular RNAs that are made up of a ligand-binding region (an aptamer) and a domain that controls gene expression. They are usually located in the 5' untranslated regions of bacterial mRNAs, where they control the expression of the gene by binding a low molecular weight metabolite that triggers a conformational change in the RNA [2326]. In recent years, many of these genetic control elements have been discovered, and it has become clear that they are structurally and functionally highly diverse [27, 28]. Riboswitches control gene expression at both the transcriptional and translational levels, and can act as 'on' or 'off' switches. The majority of riboswitches are negative control elements, and among these, the first catalytic riboswitch discovered - glmS [14] - employs the ultimate method of switching off gene expression: when it binds its cognate ligand it cleaves itself, thus destroying the function of the mRNA of which it is a part.

The biological function of other recently discovered catalytic RNAs is less clear. Using an ingenious in vitro selection scheme, Szostak and co-workers [29] recently discovered an HDV-ribozyme-like element in an intron of a human mRNA and have demonstrated its biochemical activity. In this scheme, a library of uniformly sized, small circular DNAs was used as templates for rolling-circle transcription; self-cleaving RNAs can thus be identified by the appearance of unit-length RNA fragments. Cedergren and co-workers identified and biochemically characterized hammerhead ribozymes in the genomes of schistosomes [30] and cave crickets [31], and, using database searching, our group recently identified novel examples of hammerhead ribozymes [32] and found two hammerhead sequences encoded at distinct loci in the genome of Arabidopsis thaliana that we have characterized as catalytically active in vitro and in vivo [33].

Ribozyme topology versus sequence conservation

To carry out RNA-based chemical catalysis, some parts of the ribozyme molecule must adopt very precise relative positions and orientations. In addition to specific recognition, there must be dynamic mechanisms for substrate binding and product release. With the notable exception of the ribosome, present-day ribozymes act on the phosphodiester backbone linking two consecutive nucleotides. Although the catalytic processes of such reactions are basically similar, they can be achieved in diverse ways and, in addition, as chemical convergence is pervasive, ribozymes display a rich repertoire of architectures that position the reactants appropriately. Furthermore, the number of conserved nucleotides and their dispersion throughout the molecule vary considerably from one ribozyme to the other: for example, the hammerhead ribozyme and the group I introns have about the same number of conserved residues - around seven - although the latter can be up to four times as large [34, 35]. The positions and relative dispositions of the conserved structural elements with respect to the beginning and end of the ribozyme motif also vary (Figure 2). Most families of ribozymes can be subdivided into classes distinguished by their highly non-homologous peripheral elements [3638]. However, the three-dimensional architectures of the ribozyme cores belonging to the same family are expected to be similar because they are maintained by tertiary constraints which, despite the conservation of short sequence segments, can form in diverse ways.

Figure 2
figure 2

The hammerhead ribozymes are based on a three-way junction and there are two main types. (a) Type I has the ends of the single-stranded RNA on stem I; (b) type III has the ends of the single-stranded RNA on stem III. For unknown reasons, potential type II ribozymes (ends of the single-stranded RNA on stem II) have never been observed. The three-dimensional architecture is maintained by coaxial stacking of stems II and III, which, through constraints in the conserved three-way junction residues [92], orients stem I so that loop-loop interactions between stems I and II form (Figure 3) [40,42]. The internal loop of stem II (IL2) is often replaced by a capping loop (CL2); similarly, CL1 in type III can be replaced by an internal loop (IL1) followed by another hairpin. Although only one structure has been fully characterized, sequence alignments show that the loop-loop interactions (mainly constituting non-Watson-Crick pairs) are very diverse.

The hammerhead ribozyme well illustrates the difficulties of identifying new ribozymes either experimentally or by in silico approaches. Indeed, an incomplete catalytic RNA fold, which did not include tertiary contacts between elements away from the catalytic site of cleavage and sequence conservation, was accepted for a long time, until the full hammerhead ribozyme was (re)discovered [3941]. A recent crystal structure [42] shows how the presence of tertiary contacts between loops far removed from the catalytically conserved region induces conformational changes in the core that promote the active state of the ribozyme. Importantly, all those contacts involve networks of non-Watson-Crick base pairing with patterns of evolution unlike those of Watson-Crick base pairs [40, 43]. Fully biologically active hammerhead ribozymes possess structural complexity and strict sequence requirements (Figure 3b), but because of the non-Watson-Crick pairings, this is not immediately apparent from the sequence alone. In contrast, because of its convoluted pseudoknotted topology based on Watson-Crick pairs, the HDV ribozyme reveals most of its complexity immediately (Figure 3a). Incomplete hammerhead ribozymes without peripheral elements and with low sequence and structural complexity display reduced catalytic activities. Indeed, in vitro evolution, starting from random libraries, produced structurally diverse ribozymes with low activity, which contained some hammerhead variants [44]. Another experiment selecting in vitro for self-cleaving motifs with hammerhead-type biochemical activity [45] led to the conclusion that the hammerhead motif makes the most common ribozyme fold and suggested that this motif has had multiple independent origins. The long-range interactions were not considered in those two in vitro selection schemes, as their importance had not been recognized at the time. The sequences collected during the second selection scheme would enable optimization of hammerhead ribozyme activity.

Figure 3
figure 3

Schematic diagrams of the interaction networks maintaining the three-dimensional architecture of two different ribozymes. (a) The HDV ribozyme [7,93]; (b) the active hammerhead ribozyme [42]. The HDV ribozyme has a convoluted pseudoknotted topology: the color lines indicate the path of the sugar-phosphate backbone. The nomenclature is as follows [75]. Each nucleotide has three edges with hydrogen bonding possibilities: the Watson-Crick edge (denoted by a circle), the Hoogsteen edge (denoted by a square) and the sugar edge (denoted by a triangle). A pairwise base-base interaction can be formed either with the attached sugar moieties on the same side of the line of approach (cis-configuration, the symbols are closed) or with the sugars on either sides of the line of approach (the trans-configuration, the symbols are open). To avoid ambiguities, when annotating tertiary contacts, the nucleotides that are involved have been boxed. When the base of a nucleotide is in the syn-conformation with respect to the sugar it is marked in bold. The rectangles indicate the position actually occupied in space by a nucleotide. In (b), the cleavage occurs 3' of the red C.

Ribozyme topology versus sequence variability

A ribozyme with a new branching activity, GIR1, has recently been experimentally identified in slime molds [46]. On the basis of its secondary structure, this ribozyme belongs to the group I intron family. It carries out the first cleavage step of a group II intron, however, leading to the formation of a small lariat with a 2',5'-linkage at the 5' end of the endonuclease mRNA of which it forms a part, thereby protecting the message from exonuclease degradation. Thus, in this case, a similar secondary structure scaffold is the basis for two ribozymes catalyzing different chemical reactions: activation of an internal O2'-hydroxyl group in the case of the new ribozyme compared with activation of an O3'-hydroxyl group of an external cofactor for the rest of the group I intron family (Figure 1c). This is yet another example of the fact that similar RNA sequences can assume two different folds and catalyze two different chemical reactions, as shown by Schultes and Bartel [47]. Minor variations could convert a starting sequence into either of these highly active ribozymes, demonstrating that the evolving paths of RNA sequence can easily cross in sequence space. Similarly, RNA folds recognizing different ligands may be very close in sequence space [48]: for example, a small series of 'neutral' mutations (that is, mutations that have no effect on secondary structure) transformed a flavin-binding aptamer into a GMP-binding aptamer [49]. Extensive networks of neutral variation in sequence space interconnect RNA regions with similar function and structure [50, 51], as confirmed by the recent elucidation of more three-dimensional RNA structures (see [43, 52] for reviews).

It is now recognized that the most common RNA-RNA binding contact is the so-called A-minor motif [53]. This occurs between two contiguous adenines in one partner RNA and the shallow/minor groove side of two stacked Watson-Crick pairs in the other. An analysis of tertiary contacts shows that the contiguous adenines can originate from a variety of local environments (for example, bulging, apical or internal loops) and that the only molecular recognition requirement in the receptor RNA is the presence of two Watson-Crick base pairs [54, 55]. In other words, coupled to the vast shape space accessible through mutations neutral for secondary structure, there are weak but crucial sequence constraints imposed by the tertiary contacts. In RNA architectures, the additional structural constraints originate from the topology of the secondary structure (junctions of helices, number of base pairs within helices, and so on). In short, RNA sequences (and thus their structure and function) are characterized by neutrality at all levels from molecular recognition between motifs to secondary structure and three-dimensional architecture.

This complex interplay between sequence conservation and neutral evolution on the one hand, and diversity in folds despite conservation in interaction protocols on the other, is central to the theoretical and experimental difficulties in identifying key regulatory RNA sequences from genomic sequence. For example, group I introns are characterized by an invariant core onto which is grafted a variety of peripheral elements [36, 56]. Long-range contacts between those non-homologous peripheral elements are necessary for biological activity. All known group I introns contain a tertiary contact between two specific paired-segment regions (regions 5 and 9; Figure 4). However, the examples that have been crystallized (for a review, see [57]) show that in each case, the contacts are achieved through different local topologies (Figure 4), each with different sequence constraints. Interestingly, in a first attempt at modeling the lariat-forming group I-like intron, GIR1, from slime molds, it was not possible to construct the usual intramolecular contacts between regions 5 and 9 [58].

Figure 4
figure 4

Different local topologies can give rise to similar tertiary contacts in group I introns. (a) The invariant core of a group I intron [36,94] is illustrated in schematic form with the paired segments indicated by P and the loop regions by L. The dashed lines indicate the contacts between the peripheral elements, which are indicated by the numbers in circles. (b) Three different group I introns illustrate distinct ways of achieving a similar tertiary contact (involving non-Watson-Crick A-minor base-base interactions between a GAAA tetraloop and two stacked pairs) connecting distant regions. In each case region 9 folds towards region 5 (as indicated by the shaded region) but, in the Twort ribozyme [95] this is via a three-way junction, in the Tetrahymena ribozyme [96], it is via a large bend (this is not the natural junction, however), and in the Azoarcus ribozyme [97], it is via a kink-turn. Each motif has a different sequence and set of structural constraints [77,92].

Are there more ribozymes that catalyze 2',5'-phosphodiester bond formation or cleavage to be discovered? Scattered evidence of the occurrence of 2',5'-bonds exists throughout the literature. A 2',5'-phosphodiester bond was observed in vitro [59] and in vivo [60] during circularization of the genome of the peach latent mosaic viroid, and the HDV ribozyme, unlike the hammerhead ribozyme, has been shown to cleave 2',5'-linkages efficiently [61].

Searching genomes for ribozymes and riboswitches

Novel catalytic RNA entities can, in principle, be looked for either by database searches using defined consensus motifs from a given ribozyme or by experimentally testing candidate RNAs for biochemical activity. Both approaches have advantages and disadvantages. Database searches require RNA sequence alignments (as produced, for example, by Rfam [62]) coupled with covariance analysis [6367]. The quality of the sequence alignment is central to this process, however, and not many databases are as carefully hand-curated as the RNase P database [68]. In database screening, the definition of what we consider to be the consensus motif of a given catalytic RNA is crucial. Even if a catalytic RNA motif is well defined, searches are complicated by the requirement to combine a complex assembly of structural (hairpin) and sequence information, which prevents simple solutions such as purely sequence-based homology searches. Generally, the tools available adequately identify isolated hairpins [69]. Given a pattern description for a catalytic RNA motif, several programs, such as PatScan [70] or RNAMOT [71], can be used to screen the public databases. Hits from such searches require further analysis, and initially, a calculation of the secondary structure is necessary - although usually not sufficient. A secondary structure, calculated using a program such as RNAfold [72], is predictive if the required helical elements of the RNA motif under consideration will form in the hit sequence. Secondary-structure prediction programs have difficulty in accurately predicting large structures, however, and can also produce vast numbers of alternative structures when scanning whole genomes [73, 74].

For individual sequences found in a database search, a test of their particular biochemical activity (Figure 1) might be sufficient. However, functionally similar RNA molecules frequently exhibit numerous and highly divergent sequence insertions or deletions that interrupt the pattern of secondary-structure motifs and render the computer description of a given motif inadequate for finding sequences with similar activity. Furthermore, the use of pattern-description programs is incomplete if the complexity of the RNA structure - which goes way beyond the Watson-Crick base pairing [75] - is not taken into consideration. These issues, and whether the additional, essential tertiary interactions of a given RNA motif will form, can be addressed by a combination of comparative analysis of similar ribozymes with isostericity matrices, which give the geometrically equivalent base pairs for each particular type of base-base interaction [76]. All pairwise base-base interactions present in nucleic acids have been classified into 12 families, where each family is a 4 × 4 matrix of the bases A, G, C, and U [75]. This classification allows the deduction of all possible geometrically equivalent base pairs in a given family. The isostericity matrices have been verified for several RNA motifs using structural alignments anchored in crystal structures [77]. Thus, for assumed structurally homologous positions in an RNA motif, one can compare the resulting pairwise interactions with the known isostericity matrices to assess the validity of an RNA motif assignment in an alignment [78]. As this type of analysis is an iterative process, it is worth noting that it might also lead to refinement and extension of the pattern of the consensus motif that the search was started with. If applied to large assemblies of sequence information, as has been done for the kink-turn and C-loop RNA motifs [77], this approach allows a broader description (the comprehensiveness of which is currently unknown) and refinement of a given motif.

The analysis of co-variation of nucleotides in sequence alignments underlies most manual or automated secondary-structure determination. However, high sequence conservation (which is usually considered a marker for conservation of function) leads to serious ambiguities and difficulties in deriving secondary structures. The catalytic riboswitch glmS is a good example: the crystal structure [79] presents a different secondary structure from that deduced from sequences. The new helices involve pairings between segments, conserved at more than 95% in sequence, and thus giving no co-variation signal. The requirement for a well-defined RNA motif in database searches is also an intrinsic limitation of this approach.

While novel genomic locations of a known catalytic RNA can be identified from sequence similarities, novel activities cannot be so readily discovered. For this purpose, a recently introduced in vitro selection scheme [29] can be applied. Interestingly, from the human genome, a close variant of a known catalytic RNA motif was selected and characterized as a HDV-ribozyme-like sequence rather than a new catalytic RNA. It is intriguing that these sequences were discovered by their biochemical activity and not by in silico approaches (as an active HDV ribozyme can be made by both the 'genomic' sequence and its complementary sequence despite its intricate pseudoknotted secondary structure). Furthermore, additional, as-yet structurally uncharacterized, sequences were reported [29], and so this new selection scheme might actually have identified new self-cleaving RNA entities. The selection scheme described in [29] uses DNA minicircles that cover the genomic sequence of a given organism as the templates for rolling-circle RNA transcription (Figure 5). It can thus readily monitor RNA self-cleavage, but other activities will be missed. Thus, to assess the prevalence of a given catalytic RNA motif, the combined approach using sequence and three-dimensional structure information described above is most suitable, while novel in vitro selection schemes might be designed to discover activities other than RNA cleavage in a given organism.

Figure 5
figure 5

Identification of catalytic RNA from a genomic library. (a) Preparation of the genomic library. Genomic DNA is first partially digested and fragments of approximately 150 bp (blue) are gel-purified and incubated with Taq polymerase to give them 3' A overhangs. In the next step, ligation of covalently closed oligonucleotides (yellow and purple) to the library prevents the unwanted combination of DNA fragments. After removal of DNA hairpins, a T7 promoter (magenta) is then added by PCR, yielding an amplified linear library. (b) The in vitro selection scheme. The library is further amplified by PCR using a 5'-phosphorylated reverse primer and a biotinylated forward primer that allows the isolation of the phosphorylated strand using streptavidin beads. Single strands are individually circularized by ligation with a splint oligonucleotide and the second strand is added by incubation with Taq polymerase and deoxynucleoside triphosphates. The resulting nicked double-stranded library is suitable for rolling-circle transcription by T7 polymerase [98], yielding multimeric RNA species potentially encoding sites of self-cleavage (red triangles). The RNA is then incubated for self-cleavage, and active molecules (dimers) are size-selected. The scheme is completed by preparation of the next-generation DNA library using reverse transcription-PCR (RT-PCR). Modified from [29].

As pointed out earlier, most of the reactions known to be naturally catalyzed by RNA (Figure 1) involve the breakage or formation of 3',5' (and occasionally 2',5') phosphodiester bonds. RNA has the potential to catalyze other chemical reactions, however. As well as peptide formation in the ribosome, Diels-Alder cycloaddition [80] and Michael addition [81] can be catalyzed by RNA, as shown by in vitro Darwinian evolution. Thus, reactions catalyzed by RNA in nature might be more diverse than currently known. The discovery of such activities is likely to be serendipitous and made by keen observers of RNA molecular behavior.

New small or large noncoding RNAs are regularly being discovered in both bacteria and mammals. Recent evidence shows that most of the mammalian genome is transcribed in complex patterns, producing tens of thousands of novel transcripts [82, 83]. Novel RNAs are regularly predicted on the basis of their sequence conservation or secondary-structure elements [8487]. But these predictions do not utilize information on the non-Watson-Crick base pairing or tertiary structure so crucial to the activity of many ribozymes, and, as discussed above, these features are often not well conserved in the sequence. Nor do the predictive algorithms used give any indication of what the RNA function might be. Vertebrate genomes contain a large number of conserved noncoding elements (CNEs) or ultraconserved elements [88, 89], whose biological functions and mechanisms of action remain to be established. The evidence for transcription of most of these conserved elements is, however, still scanty [8991]. In any case, the recent additions to the list of natural catalytic RNAs indicate that there are likely to be many more to come; new algorithms will be required that use all available information to identify and classify them.