The recent analysis of the human genome [1,2] and the data available about other higher eukaryotic genomes have revealed that only a small fraction of the genetic material - about 1.5% - codes for protein. Indeed, most genomic DNA is involved in the regulation of gene expression, which can be exerted at either the transcriptional level, controlling whether a gene is transcribed or not and to what extent, or the post-transcriptional level, controlling the fate of the transcribed RNA molecules, including their stability, the efficiency of their translation and their subcellular localization. This article will review the structure, functions and mechanisms of mRNA untranslated regions.

Transcriptional control is mediated by transcription factors, RNA polymerase and a series of cis-acting elements located in the DNA, such as promoters, enhancers, silencers and locus-control elements, organized in a modular structure and regulates the production of pre-mRNA molecules, which undergo several steps of processing before they become functional mRNAs. Introns are removed, a 7-methyl-guanylate (m7G) cap structure is added at the 5' end of the first exon, and a stretch of 100-250 adenine residues (the poly(A) tail) is added at the 3' end of the last exon, which is itself generated by endonucleolytic cleavage of the primary transcript. Sometimes the sequence of the mRNA is also altered in a process called mRNA editing, and the resulting coding sequence of the mature RNA differs from the corresponding sequence in the genome. The resultant mature mRNA, in eukaryotes, has a tripartite structure consisting of a 5' untranslated region (5' UTR), a coding region made up of triplet codons that each encode an amino acid and a 3' untranslated region (3' UTR). Figure 1 shows these and other features of mRNAs.

Figure 1
figure 1

The generic structure of a eukaryotic mRNA, illustrating some post-transcriptional regulatory elements that affect gene expression. Abbreviations (from 5' to 3'): UTR, untranslated region; m7G, 7-methyl-guanosine cap; hairpin, hairpin-like secondary structures; uORF, upstream open reading frame; IRES, internal ribosome entry site; CPE, cytoplasmic polyadenylation element; AAUAAA, polyadenylation signal.

UTRs are known to play crucial roles in the post-transcriptional regulation of gene expression, including modulation of the transport of mRNAs out of the nucleus and of translation efficiency [3], subcellular localization [4] and stability [5]. This article focuses mainly on these three functions, but UTRs may also play other roles, such as the specific incorporation of the modified amino acid selenocysteine at UGA codons of mRNAs encoding selenoproteins in a process mediated by a conserved stem-loop structure in the 3' UTR [6]. The importance of UTRs in regulating gene expression is underlined by the finding that mutations that alter the UTR can lead to serious pathology [7].

Regulation by UTRs is mediated in several ways. Nucleotide patterns or motifs located in 5' UTRs and 3' UTRs can interact with specific RNA-binding proteins. Unlike DNA-mediated regulatory signals, however, whose activity is essentially mediated by their primary structure, the biological activity of regulatory motifs at the RNA level relies on a combination of primary and secondary structure. Interactions between sequence elements located in the UTRs and specific complementary non-coding RNAs have also been shown to play key regulatory roles [8]. Finally, there are examples of repetitive elements that are important for regulation at the RNA level. For example, CUG-binding proteins may bind to CUG repeats in the 5' UTR of specific mRNAs (such as that encoding the transcription factor C/EBPβ), affecting their translation efficiency [9].

Many RNA-binding proteins involved in the cytoplasmic post-transcriptional regulation of gene expression also participate in a wide variety of regulatory processes - such as alternative pre-mRNA splicing or 3'-end processing - within the nucleus, where they act as components of heterogeneous nuclear ribonucleoproteins (hnRNPs) [10]. This functional interconnection between post-transcriptional events in the nucleus and in the cytoplasm may explain experimental observations that the nuclear history of an mRNA can affect its cytoplasmic fate [11].

Structural features of untranslated regions

Comparison of the various completed and partial genome sequences reveals some conserved aspects of the structure of UTRs (see Table 1). The average length of 5' UTRs is roughly constant over diverse taxonomic classes and ranges between 100 and 200 nucleotides, whereas the average length of 3' UTRs is much more variable, ranging from about 200 nucleotides in plants and fungi to 800 nucleotides in humans and other vertebrates. It is striking that the length of both 5' and 3' UTRs varies a lot within a species, ranging from a dozen nucleotides to a few thousand [12]. In fact, it has been shown using a mammalian in vitro system that even a single nucleotide is a sufficient 5' UTR for the initiation of translation [13].

Table 1 Features of complete UTR sequences derived from genomic entries annotated in UTRdb [47,48,50].

The genomic region corresponding to the UTRs of an mRNA may contain introns, more frequently in the 5' than in the 3' UTR. About 30% of genes in metazoa have fully untranslated 5' exons, whereas although 3' UTRs are much longer, they have a much lower intron frequency, in the range 1-11% depending on the taxon (Figure 2a). Alternative UTRs can be formed from the use of different transcription-start sites, polyadenylation sites or splice donor and/or acceptor sites. These have been shown to vary in abundance with the tissue, developmental stage or disease state and can affect the pattern of gene expression considerably [14].

Figure 2
figure 2

The percentage of complete UTR sequences in the different taxonomic classes that contain (a) introns or (b) upstream AUGs, upstream ORFs or IRES elements. Hum, human; mam, other mammals; rod, rodents; av, Aves; vrt, other vertebrates; lil, Liliopsidae; vir, other plants (Viridiplantae); inv, invertebrates; fun, fungi. Data are taken from UTRdb [47].

The base composition of 5' and 3' UTR sequences also differs; the G+C content of 5' UTR sequences is greater than that of 3' UTR sequences. This difference is more marked in mRNAs from warm-blooded vertebrates, whose G+C content is about 60% for 5' UTRs and 45% for 3' UTRs [15]. There is also an interesting correlation between the G+C content of 5' or 3' UTRs and that of the third codon positions of the corresponding coding sequences, and a significant inverse correlation has been observed between the G+C content of 5' and 3' UTRs and their lengths [16]. In particular, it has emerged that genes localized in large GC-rich regions of a chromosome (heavy isochores) have shorter 5' UTRs and 3' UTRs than genes located in GC-poor isochores. A similar correlation has been also shown for the coding sequence and introns [17].

Finally, eukaryotic mRNAs are also known to contain several types of repeat in the untranslated regions, including short interpersed elements (SINEs) such as Alu elements, long interspersed elements (LINEs), minisatellites and microsatellites. In human mRNAs, repeats are found in about 12% of 5' UTRs and 36% of 3' UTRs. A lower repeat abundance is observed in other taxa, including other mammals.

Control of translation efficiency

Translation of mRNAs can vary in efficiency, so that the amount of protein produced is modulated. This is an important level of gene regulation; indeed, a correlation between mRNA and protein abundance is seen only for secreted proteins, whereas for intracellular proteins the differing rates of translation of different mRNAs removes this correlation [18]. Features all along the mRNA can affect translation efficiency.

Structural features of the 5' UTR have a major role in the control of mRNA translation. Messenger RNAs encoding proteins involved in developmental processes, such as growth factors, transcription factors or proto-oncogenes, all of which need to be strongly and finely regulated, often have 5' UTRs that are longer than average [19], with upstream initiation codons or open reading frames (ORFs) and stable secondary structures that hamper translation efficiency (Table 2). Other specific motifs and secondary structures in the 5' UTR can also modulate translation efficiency.

Table 2 Examples of genes with 5' UTRs longer than average and with upstream ORFs and/or repeat elements

Under normal conditions, following the transport of an mRNA from the nucleus to the cytoplasm, the eIF4F protein complex assembles at the cap. This complex consists of three subunits: eIF4E, the cap-binding protein; eIF4A, which has RNA helicase activity; and eIF4G, which interacts with various other proteins, including polyadenylate-binding protein. The ATP-dependent helicase activity of eIF4A, stimulated by the RNA-binding protein eIF4B, unwinds any secondary structure in the mRNA, thus creating a 'landing platform' for the small (40S) ribosomal subunit [20]. When concentration of ribosomes or translation factor are limiting, the poly(A) tail can cooperate with 5' cap to enhance translation initiation through the intervention of a polyadenylate-binding protein that can physically interact with eIF4F complex [21].

In most eukaryotic mRNAs, it is thought that translation initiates at the first AUG codon encountered by the 40S ribosomal subunit as it moves, or scans, 3' along the mRNA from the 5' m7G cap. Sequences flanking the AUG initiation codon are not random but fit a consensus sequence; in mammals, this sequence is GCCRCCaugG, and the most conserved nucleotides are the purine (R), usually A, in position -3 with respect to the AUG start codon and the guanine in position +4. The strong preference for A at position -3 and G at position +4 is also conserved in other animals and in plants and fungi. The sequence context of the first AUG codon, in particular the part located in the untranslated region, may modulate the efficiency with which it is recognized as a translation initiation codon.

It is noteworthy that a large fraction of 5' UTRs contain upstream AUGs, from 15% to nearly 50% depending on the organism (Figure 2b), suggesting that the 'first AUG rule' predicted by the scanning model of ribosome start-site selection is disobeyed in a large number of cases. This implies that the 40S ribosomal subunit can sometimes bypass the most upstream AUG codon, possibly because its sequence context makes it a poor initiation codon, to initiate translation at a more distal AUG. With this mechanism, called 'leaky scanning', multiple different proteins can be obtained from the same mRNA [22]. Moreover, it has been calculated that the presence of an upstream AUG correlates with a long 5' UTR and with a 'weak' start codon context of the AUG that is usually used, whereas transcripts with an optimal start-codon context have short 5' UTRs without upstream AUGs [23], suggesting that upstream AUGs may have a role in keeping the basal translational level of a gene low.

If an in-frame stop codon is found following the upstream AUG and before the main start codon, it creates an upstream ORF. After translation of the upstream ORF and the detachment of the large (60S) ribosomal subunit, the small ribosomal subunit has multiple alternative fates, which affect translation efficiency and mRNA stability. The 40S subunit may hold onto the mRNA, resume scanning, and reinitiate translation at a downstream AUG codon, or it may leave the mRNA, thus impairing translation of the main ORF. The ability of a ribosome to reinitiate is limited in eukaryotes by the stop codon context [24] and by the length of the upstream ORF; if the upstream ORF is longer than around 30 codons [25], the ribosome cannot reinitiate. This process is known to down-regulate translation of the mRNAs for the yeast transcription factors GCN4 and YAP1, which contain upstream ORFs [26].

Secondary structures in 5' UTRs are also important in the regulation of translation. Experimental data suggest that moderately stable secondary structures (a change in free energy (ΔG) above -30 kcal/mol) directly involving the AUG start codon do not stall the migration of 40S ribosomal subunit; a significant decrease in the efficiency of translation is observed only when very stable structures (ΔG below -50 kcal/mol) are formed. UTR sequences with such very stable secondary structures are reported in Table 3. The inhibitory effects of these structures can be overcome by an increase in the level of eIF4A, the subunit of the eIF4F complex that promotes the unwinding of RNA secondary structures in cooperation with eIF4B and eIF4H [27].

Table 3 Examples of 5' UTR sequences with highly stable stem-loop structures

An alternative mechanism for translation initiation, which occurs independently of the 5' cap, was discovered for the first time in picornaviruses [28]: a sequence element in the 5' UTR acts as an internal ribosome entry site (IRES). IRES elements have been found in many cellular mRNAs encoding regulatory proteins, such as proto-oncogene products like c-Myc, homeodomain proteins, growth factors (like the fibroblast growth factor FGF-2) and their receptors. The concept of IRESs has been very critically reviewed by Kozak [29], who originally defined the importance of initiation codon context. Comparative analysis of known cellular IRESs leads to the identification of a common structural motif shared by many mRNAs, including those encoding the immunoglobulin heavy chain binding protein BiP and FGF2: a Y-shaped stem-loop just upstream of the AUG initiation codon [30] (see Table 4 and Figure 2b). It has recently been discovered that short sequence motifs complementary to the small ribosomal RNA may also act as IRESs [31].

Table 4 5' UTR sequences with experimentally proved IRES elements

Sequence elements that are the target of trans-acting RNA binding proteins can also regulate translation. For example, the iron-responsive element (IRE) located in the 5' UTR of mRNAs encoding proteins involved in iron metabolism (ferritin, 5-aminolevulinate synthase and aconitase) may inhibit translation through the iron-dependent binding of iron regulatory proteins, which impede the normal scanning process of the small ribosomal subunit in translation initiation. In addition, most vertebrate mRNAs that encode ribosomal proteins and translation elongation factors analyzed to date contain a 5' terminal oligopyrimidine tract (TOP) consisting of 5-15 pyrimidines immediately adjacent to the m7G cap. This tract is required for coordinated translational repression during growth arrest, differentiation, development and certain drug treatments [32].

Regulation of mRNA stability

The turnover of mRNAs is another crucial step in post-transcriptional regulation of gene expression, as changes in mRNA abundance may alter the expression of specific genes by affecting the abundance of the corresponding protein. Several mechanisms have been proposed to describe how mRNA degradation takes place: decay can be preceded by shortening or removal of the poly(A) tail at the 3' end and/or by removal of the m7G cap at the 5' end [33]. The turnover of an mRNA is mostly regulated by cis-acting elements located in the 3' UTR, such as the AU-rich elements (AREs), which promote mRNA decay in response to a variety of specific intra- and extra-cellular signals. AREs have been experimentally grouped into three classes: class I and II AREs are characterized by the presence of multiple copies of the pentanucleotide AUUUA, which is absent from class III AREs [34]. Class I AREs control the cytoplasmic deadenylation of mRNAs by the degradation of all parts of the poly(A) tail at the same rate, generating intermediates with poly(A) tails of 30-60 nucleotides, which are then completely degraded. These elements are found mainly in mRNAs encoding nuclear transcription factors such as c-Fos and c-Myc (the products of 'fast response' genes) and also in mRNAs for some cytokines, such as interleukins 4 and 6. The presence of one or more copies of the pentanucleotide AUUUA next to a U-rich region is the structural characteristic of class I AREs. Class II AREs mediate asynchronous cytoplasmic deadenylylation, in other words the poly(A) tail is degraded at different rates in different transcripts, generating mRNAs without poly(A) tails. Among mRNAs containing this signal are those encoding the cytokines GM-CSF, interleukin 2, tumor necrosis factor α (TNF-α) and interferon-α. Class II AREs are characterized by tandem reiterations of the AUUUA pentamer, and an AU-rich region is usually found upstream of these repeats. The mRNAs containing class III AREs, such as those encoding c-Jun, do not contain the pentanucleotide AUUUA but have only a U-rich segment; they show degradation kinetics similar to those of mRNAs containing class I AREs.

Degradation of mRNAs can also take place following endonuclease activity, in a mechanism independent of both deadenylation and decapping. Such a mechanism has been observed for the mRNA encoding the transferrin receptor, a protein that mediates iron transfer in the cell. The degradation pathway of this mRNA involves an endonucleolytic cleavage in the 3' UTR region that is mediated by the recognition of IRE structures and is regulated by the level of intracellular iron [35].

Upstream initiation codons and ORFs may also play a role in mRNA decay through the nonsense-mediated mRNA decay (NMD) pathway. The signal that triggers NMD is a nonsense codon followed by a splicing junction (the junction between two removed exons) [36]; the presence of the splicing junction may be how normal stop codons are distinguished from premature termination codons. Indeed, normal stop codons and the 3' UTR are usually located in the last exon of the sequence and thus are not followed by a splicing junction. Exon junctions are recognized because a marker protein binds to the intron-containing transcript in the nucleus, remains bound to the exon junction after the splicing event has finished and is translocated to the cytoplasm with the processed mRNA [11]. The translation machinery usually displaces the marker protein, preventing the degradation of wild-type mRNAs. But if the ribosome encounters a stop codon that is either premature or due to the presence an upstream ORF, it disassembles and the marker proteins at the exon junction direct the aberrant mRNA towards NMD [37]. In Saccharomyces cerevisiae (which uses a downstream exonic element, DSE, as the second signal that triggers NMD), mRNAs containing functionally active upstream ORFs, like those encoding GCN4 or YAP1, are not degraded through the NMD pathway because they contain an mRNA-specific stabilizer sequence elements between the upstream ORF and the coding sequence that prevents the activation of the NMD pathway by interacting with the RNA-binding ubiquitin ligase Pub1 [38].

Upstream ORFs can also regulate mRNA stability through an NMD-independent mechanism. The 5' UTR of the S. cerevisiae gene YAP2 contains two upstream ORFs that inhibit ribosomal scanning and promote mRNA decay [26]. The destabilizing effect relies on the termination codon context, which modulates translation efficiency and mRNA stability. Table 5 reports some genes in which upstream ORFs have been demonstrated to affect gene expression.

Table 5 Genes with experimentally characterized upstream ORFs in their 5' UTR

Several studies have provided evidence that many hnRNPs not only function in the nucleus but also are involved in the control of mRNA fate in the cytoplasm [10] and can regulate translation, mRNA stability and cytoplasmic localization [37]. One example is the regulation of the amyloid precursor protein (APP); increasing the level of APP is an important contributing factor to the development of Alzheimer's disease. Stability of APP mRNA is dependent on a highly conserved 29-nucleotide element located in the 3' UTR that interacts with several cytoplasmic RNA-binding proteins [39]. Very interestingly, although some of these proteins are fragments of nucleolin (which is known to shuttle between the nucleus and cytoplasm), two proteins of 39 kDa and 38 kDa are subunits of hnRNP C, seen in this study for the first time in the cytoplasm [40].

Control of mRNA subcellular localization

UTRs have a fundamental role in the spatial control of gene expression at the post-transcriptional level, which is particularly important during development. The asymmetric localization of some mRNAs leads to an asymmetry of cellular distribution of the encoded proteins; such a situation is clearly more efficient than other possible mechanisms of protein localization, because the same mRNA molecule can serve as a template for multiple rounds of translation. In many cases, mRNAs are localized as ribonucleoprotein complexes along with proteins of the translational apparatus, thus ensuring efficient localized translation.

There are three main mechanisms for the asymmetric distribution of mRNAs: active directed transport, requiring a functional cytoskeleton and specific motor proteins interacting with the targeted mRNAs; local stabilization of transcripts; and diffusion of the mRNA followed by its local entrapment. Myelin basic protein (MBP) mRNA is localized to the myelin produced by oligodendrocytes of the central nervous system through an active transport mechanism. A 21-nucleotide sequence, termed the RNA-transport signal, and an additional element, the RNA-localization region, both in the 3' UTR of MBP mRNA, are required for its transport and localization in mouse [41]. Many examples of local stabilization come from Drosophila early development: transcripts encoding the RNA-binding protein Nanos or the heat-shock protein Hsp83 are degraded everywhere in the embryo except in the posterior polar plasm. Distinct cis-acting elements located in the 3' UTR of these mRNAs mediate both degradation in the embryo as a whole and the stabilization at the pole [5]. The diffusion and entrapment mechanism is well represented by localization of Bicoid mRNA in Drosophila. The elements that regulate the anchoring of the transcript, the key step of the process, are not all characterized, but one protein involved is Staufen, a double-stranded RNA-binding protein that is essential for the immobilization of Bicoid mRNA in the anterior pole of the egg [42].

In all these cases, subcellular localization of mRNA is mediated by cis-acting elements located in the 3' UTR, but there are also examples of elements in the 5' UTR or even in the coding sequence; these are known as mRNA zip codes and interact with zip-code-binding proteins (such as Staufen). Zip codes lack any apparent similarity in their primary or secondary structure; they can have a complex secondary or tertiary structure, as in the Bicoid localization element, in which primary sequence is less important than the overall structure [43], or they can be short, defined nucleotide sequences [44], sometimes in repeated elements (such as in the case of the Xenopus localized transcript Vg1 [45]).

In conclusion, untranslated regions of mRNAs have crucial roles in many aspects of gene regulation. Further information on the structures and functions of UTRs, including the cis-acting elements found in them (Table 6) [46], can be found at our UTR home page [47] and from the UTRdb and UTRsite databases, which can be downloaded from our ftp site [48] or accessed with SRS [49] from our website [50] or the European Bioinformatics Institute [51].

Table 6 Functional elements in UTRsite collection annotated in UTRdb entries