RNA-seq: from technology to biology
- First Online:
- Cite this article as:
- Marguerat, S. & Bähler, J. Cell. Mol. Life Sci. (2010) 67: 569. doi:10.1007/s00018-009-0180-6
- 8.7k Downloads
Next-generation sequencing technologies are now being exploited not only to analyse static genomes, but also dynamic transcriptomes in an approach termed RNA-seq. Although these powerful and rapidly evolving technologies have only been available for a couple of years, they are already making substantial contributions to our understanding of genome expression and regulation. Here, we briefly describe technical issues accompanying RNA-seq data generation and analysis, highlighting differences to array-based approaches. We then review recent biological insight gained from applying RNA-seq and related approaches to deeply sample transcriptomes in different cell types or physiological conditions. These approaches are providing fascinating information about transcriptional and post-transcriptional gene regulation, and they are also giving unique insight into the richness of transcript structures and processing on a global scale and at unprecedented resolution.
KeywordsHigh-throughput sequencing Transcriptional control Non-coding RNA Post-transcriptional control Gene expression Splicing Transcriptome Genome
Regulation of gene expression is fundamental to link genotypes with phenotypes. The synthesis and maturation of RNAs are tightly controlled, and they shape complex gene expression networks that ultimately drive biological processes. These networks need to be robust as well as highly plastic in order to allow rapid adaptation to environmental or genetic perturbations . An in-depth understanding of the principles and mechanisms governing these complex gene expression programmes is important to better understand complex diseases such as cancer. For more than 10 years, microarrays have allowed the simultaneous monitoring of expression levels of all annotated genes in cell populations [2, 3]. The ability to analyse entire gene expression programmes has opened new horizons for our understanding of global processes regulating gene expression. Similarly, with the increasing realisation that RNAs transcribed from non-coding portions of genomes are playing fundamental roles, genome-wide approaches have provided valuable insights into this aspect of transcriptomes. Later generations of microarrays (referred to as “tiling arrays”), which consist of probes designed to interrogate a genome systematically irrespective of any gene annotation, have been instrumental in discovering unknown transcripts . Applying this technique to several different organisms has demonstrated that the complexity of transcriptomes has indeed been vastly underestimated . This is when next-generation sequencers have entered the market. These platforms allow the rapid and cost-effective generation of massive amounts of sequence data. Obviously, this breakthrough provides a huge potential to revolutionise the field of transcriptomics. Even though direct sequencing of cDNA libraries has been achieved before with SAGE  and MPSS  approaches, next-generation sequencing (NGS) technologies are more straightforward and more affordable. RNA-seq was thus born [8, 9, 10, 11].
In this review, we will first provide an overview of the strengths and challenges inherent to RNA-seq and will then highlight major biological insights gained from RNA-seq in a wide range of organisms.
RNA-seq data generation and analysis
The library preparation is a key step of RNA-seq, because it determines how closely the cDNA sequence data reflect the original RNA population. In the classic NGS protocols, which have been developed for the analysis of genomic DNA, adapters are ligated onto shared double-stranded DNA fragments. In order to allow the analysis of transcriptomes by NGS, these protocols have been adapted to the sequencing of cDNA. The most straightforward approach is to simply synthesise double-stranded cDNA, to which the adapter can then be ligated. This robust protocol has been attractive, because it applies the procedures developed by the manufacturer for the analysis of genomic DNA, and it has been widely used in the original RNA-seq studies. A substantial drawback of this approach, however, is the loss of information on transcriptional direction, because the adaptor is ligated to double-stranded cDNA. An elegant study has managed to maintain strand information simply by pre-treating the RNA samples with sodium bisulphate . This chemical triggers the transformation of cytidine into uridine; widespread C–T transition therefore “marks” the coding strand of each transcript. Six additional RNA-seq protocols that maintain strand-specificity have been published. They differ in how the adaptor sequences are inserted into the cDNA, which is achieved (1) by direct ligation of RNA adaptors to the RNA sample before reverse transcription [15, 16], (2) by addition of the adaptor sequences by template switch during reverse transcription , (3) by double-random priming coupled to solid phase extraction , (4) by direct ligation of the DNA adaptors to single-stranded cDNA [19, 20, 21], (5) by reverse transcription of in vitro polyadenylated RNA fragments followed by intramolecular ligation , or (6) by incorporation of dUTP during second strand synthesis and digestion with uracil-N-glycosylase . These methods are likely to differ in potential biases introduced in the data, and careful comparisons will be highly interesting.
NGS technologies exploit light that is emitted when the correct base (or oligonucleotides in case of SOLiD) matches the template being sequenced and is incorporated into the sequencing reaction. Thus, NGS raw outputs are image records of the light emitted by every single parallel sequencing reaction at every sequencing cycle. These raw image files represent terabytes of data and require substantial storage resources. The images are then processed in order to extract numerical signals for every base at every synthesis event from all the parallel reactions. These signals are used for base calling. Improving the quality and reliability of signal extraction and base calling has led to significant increases in the quality and throughput of NGS data [24, 25, 26].
After image and signal processing, NGS data consist of a list of short sequences together with their base call qualities. These data are fundamentally different from microarray data. With hybridisation-based techniques, the scanner returns signal intensities for each probe on the array. In the case of RNA-seq data, the number of reads mapping to any given region of the genome makes up the signal. Besides providing single base pair resolution, sequencing allows the maintaining of total control on which reads are included in the final analysis and hence contribute to the expression signals. Thus, RNA-seq data are countable and digital in nature. The generation of reliable RNA-seq data therefore relies heavily on proper mapping of sequencing reads to corresponding reference genomes or on their efficient de novo assembly. Mapping NGS reads with high efficiency and reliability currently faces several challenges. First, the computing resources required to map huge numbers of small reads within a reasonable time can be limiting. However, tremendous effort has been invested during the last couple of years to develop algorithms that allow mapping of millions of small reads using limited computing resources and time [27, 28, 29, 30, 31, 32, 33]. The second challenge arises from the relatively high error rate of NGS data, meaning that non-perfect matches have to be considered when mapping reads back to a genome. This issue is particularly relevant when single nucleotide polymorphisms (SNPs) are of interest to detect allele-specific expression in RNA-seq data. To distinguish sequencing errors from SNPs requires higher sequencing depths such that correct base calls at each position can be made, even in heterozygous samples, because each base is sequenced multiple times. Analysis protocols have been developed for the detection of genetic variation at a reasonable sequencing depth and hence at affordable costs . Library preparation and/or sequencing procedures can also introduce systematic biases and artefacts such as over-amplification of GC-rich regions and generation of duplicate sequences . A third challenge, which is also one of the most exciting feature of RNA-seq data, is to identify reads containing post-transcriptionally modified or rearranged sequences which cannot be mapped directly to the reference genome. This feature will be discussed in more detail below. Finally, for cases when no good quality reference genome is available, direct de novo assembly of RNA-seq data into contigs may be useful. Several assemblers optimised for short sequence reads have been recently developed [36, 37, 38, 39, 40, 41, 42, 43, 44, 45].
Once the sequencing reads have been filtered and mapped (or assembled), it is possible to compute an expression score for every base in the genome and thus obtain transcriptome maps at the best possible resolution. The true resolution of this approach, however, depends on the amount of sequence coverage and therefore on the amount of sequences generated. Sequence coverage can be a limiting factor, especially when large genomes are analysed, due to costs and machine time required.
Applying RNA-seq to probe the breadth and depth of genome transcription
The use of NGS technologies for the analysis of RNA has been pioneered by researchers working with small regulatory RNAs, possibly because this field has benefited less from microarrays as the usual size of small RNAs is too short to be captured adequately with the limited resolution provided by microarrays. Sequencing of short regulatory RNAs has resulted in important and exciting papers which has been extensively reviewed elsewhere [46, 47]. Whole transcriptome studies using RNA-seq have emerged soon after. To date, transcriptomes have been sequenced for over a dozen organisms including human [14, 16, 18, 19, 20, 48, 49, 50, 51, 52, 53, 54, 55], mouse [17, 23, 56, 57, 58], budding yeast [22, 23, 59, 60, 61, 62], fission yeast , worm , fruit fly , non-model organisms [66, 67], several plants [15, 68, 69, 70, 71] and prokaryotes [21, 72, 73]. Unlike the genome, the transcriptome dynamically changes in response to the environment or to intrinsic programmes, and many studies have reported transcriptome sequences for several cell types or physiological conditions.
The countable, almost digital, nature of RNA-seq data makes them particularly attractive for the quantitative analysis of transcript expression levels. Nearly every RNA-seq study published to date has addressed this question, and they agree that RNA-seq data are highly quantitative and give reliable measurements of transcript levels in one or more conditions. The dynamic range of these data is theoretically only limited by the sequencing depth and has been reported to span at least 5 orders of magnitude . This dynamic range is well beyond the range achieved by microarrays and close to the estimated range of transcript frequencies in the cell. A few studies also looked at the ability of RNA-seq to measure differential gene expression [51, 57, 61]. These studies agree in saying that RNA-seq performs at least as well as microarrays provided an adequate sequencing depth. RNA-seq has the advantage though that, besides differential transcripts levels, levels of different splice variants or of transcripts with different UTR length can be assessed at the same time (see below). Producing enough reads for accurate quantification of lowly expressed transcripts, however, can still be quite expensive for large transcriptomes. In a variant of RNA-seq, only small tags at the 3′ ends of transcripts are sequenced. This assay permits the measurement of even lowly expressed transcripts with a limited amount of sequencing reads [57, 74].
Besides this quantitative aspect, RNA-seq studies are enabling researchers to refine transcript annotation, providing for instance accurate maps of transcript start and end sites. This feature is of particular help for dense prokaryotic genomes, allowing confident discrimination between single gene transcriptional units and operons encompassing several genes . The analysis of transcript structures is also fundamental for the study of complex diseases such as cancer. Genomic re-arrangements or mutations can generate aberrant fusion transcripts which, if stably expressed, can lead to pathologies. Such gene fusions have been shown to be commonly associated with different types of tumours . Direct sequencing of transcriptomes, coupled with analysis pipelines allowing the detection of sequence re-arrangements and abnormal transcript structures, are powerful tools which permit direct detection of such fusion events. Several studies have already provided proofs of principle that this approach is suitable for discovering new aberrant transcripts [19, 50]. Thus, this technological breakthrough will hopefully fuel our understanding of complex diseases.
Another characteristic of RNA-seq data is their high sensitivity, allowing the detection of the expression of substantially more transcripts in a given cell type compared to what could be detected by microarrays. RNA-seq studies also contribute to an increased list of the transcripts expressed in all organisms studied, most of these newly defined transcripts being non-coding. A high coverage RNA-seq study of the fission yeast (Schizosaccharomyces pombe) transcriptome during vegetative growth revealed that over 94% of this genome is actively transcribed at some level, including genes required only under specialised physiological conditions . This finding could reflect a small percentage of cells in the population expressing a different transcriptional programme , or it could reflect a certain amount of basal background transcription. The latter would be compatible with the suggestion that as much as 90% of all RNA Polymerase II (Pol II) initiation events represent transcriptional noise and raises the question of the biological relevance of an almost ubiquitous noisy transcription .
RNA-seq has also been used to dig deep into eukaryotic transcriptomes and reveal an intriguing new feature of eukaryotic transcription at promoters. Cryptic unstable transcripts (CUTs) are small RNA Pol II transcripts found in the budding yeast (Saccharomyces cerevisiae) which are targeted for degradation by the exosome complex immediately after synthesis . While the mechanisms regulating their processing have been extensively studied, the prevalence of CUTs in the yeast genome has remained unknown. Two studies have determined the genome-wide distributions and structures of CUTs [78, 79], using NGS to sequence a SAGE library enriched for CUTs or high-density tiling arrays, respectively. Interestingly, CUTs seem to be well-defined transcriptional units arising mostly from nucleosome-free regions (NFRs). NFRs are characteristic of eukaryotic genomes and can be found mainly in the promoters and terminators of genes . A fraction of CUTs are overlapping the 5′ ends of genes, suggesting a potential regulatory function. However, CUTs are most frequently transcribed in divergent orientation from the promoters of genes, suggesting that they could be by-products of Pol II-dependent transcription [78, 79]. These data suggest that bidirectional transcription is a widespread characteristic of eukaryotic promoters. In budding yeast, stable transcripts arising from bidirectional transcription can also be detected, suggesting that this phenomenon is not restricted to cryptic transcripts . Interestingly, these transcripts show extensive overlaps with annotated genes. A possible regulatory role of bidirectional transcription remains to be determined, but some data suggest that divergent transcripts could act as transcriptional “links” between neighbouring genes and potentially regulate their co-expression . Bidirectional transcription seems to be a conserved characteristic as it can also be detected in multicellular eukaryotes. Transcripts similar to yeast CUTs have been detected after inactivation of the exosome in human cells. These so-called “promoter upstream transcripts” (PROMTs) are mostly transcribed from promoters of active genes in both directions . As in yeast, stable transcripts mapping to both strands of promoters can also be detected in metazoans [16, 82, 83, 84]. A similar class of short transcripts, 20–90 nucleotides in length, has been found in mouse ES cells, up- and downstream of the transcription start sites (TSS) . Interestingly, these short divergent transcripts are not enriched in terminator or intergenic regions. Analysis of histone marks around these transcripts has revealed that marks associated with transcription elongation are present on the gene sequences but not in the antisense direction, suggesting that productive elongation occurs mostly downstream of the TSS. In this context, it is possible that these short RNAs mark regions of Pol II pausing . A similar picture could be detected in human fibroblasts where nascent RNAs have been sequenced using NGS technology, providing an overview of the distribution of Pol II engaged in transcription at a given time . This study concludes that a large amount of Pol II is paused shortly after initiation. In addition, engaged Pol II has been detected in divergent direction relative to genes. However, the lack of sequencing reads further upstream indicates that divergent Pol II does not productively elongate transcripts . These findings suggest that regulation of transcript elongation participates in the control of gene expression. In summary, bidirectional transcription at promoters seems to be a widespread phenomenon conserved across evolution. Further investigation will now be required to understand what portion of these divergent transcription events represents useless by-products of transcription initiation and what portion plays regulatory roles.
Applying RNA-seq to interrogate post-transcriptional gene regulation
Analysis of alternative splicing by RNA-seq has been performed recently on several human tissues [48, 49, 56] and cell lines [48, 55]. The ability to globally sample every possible splice isoform has uncovered a much larger amount of alternative splicing in human tissues than previously estimated. Considering different tissues, as many as 95% of the human multi-exon genes have been found to undergo alternative splicing, with exon skipping being the most frequent form of regulation [48, 49]. These results considerably increase previous estimates, which have suggested that about two-thirds of human genes are differentially spliced . Importantly, for 92% of genes, the second most frequent isoform has a relative frequency above 15%, indicating that in most cases several isoforms of the same transcript reach substantial levels of expression . Isoforms differ mostly between tissues, while between individual variations are two- to threefold less common . This finding indicates that tissue specific alternative splicing is an almost universal mode of tissue-specific gene regulation. Extreme “switch-like” behaviours, where two isoforms are mutually exclusive in two distinct tissues, have also been detected . In these cases, alternative splicing can produce different proteins in different contexts. Interestingly, “switch-like” exons are characterised by conserved regulatory motifs . Different spliced isoforms can also occur together in the same tissues. An interesting study has applied RNA-seq to analyse the transcriptome of single mouse cells . The authors report 335 genes that display multiple isoforms in a single blastomere, indicating that alternative splicing can also increase the diversity of the transcriptome of a single cell during embryonic development. Similar analyses performed in fission and budding yeasts have provided interesting insights into how simpler unicellular eukaryotes exploit alternative splicing as a mode of post-transcriptional regulation [59, 63]. In fission yeast, intron retention seems to be the main event detected during sexual differentiation. This finding has confirmed and extended observations from smaller-scale studies . In addition, global splicing efficiencies and transcript expression levels seem to be positively correlated during vegetative growth and sexual differentiation, suggesting coordination between transcription and splicing . A recent RNA-seq study in budding yeast has uncovered many alternative isoforms showing differential expression between vegetative growth and response to heat-shock . Interestingly, some of these isoforms are possibly coding for proteins of different lengths. Taken together, these data show that regulation of splicing is also used by unicellular eukaryotes to control and diversify gene expression. Finally, bioinformatics tools helping to extract the respective expression levels of different transcript isoforms from RNA-seq data are becoming available and will help to refine the global picture of alternative splicing in eukaryotes [88, 89].
A related mechanism by which transcript diversity can be increased is the use of alternative polyadenylation sites. RNA-seq is particularly well suited to study polyadenylation as it allows direct sequencing of the junctions between poly(A) tails and the rest of the transcript (Fig. 2b). This approach permits the disentangling of several isoforms with alternative polyadenylation sites in a single sample. For example, human cells show a strong correlation of alternative splicing and alternative polyadenylation between tissues, suggesting coordination between these two processes . Interestingly, alternative introns and 3′ untranslated regions (UTRs) are sharing common regulatory motifs, suggesting that they also share regulatory factors .
Transcriptome diversity can also be increased by editing of mRNA transcripts. This process involves deamination of adenosines into inosines, which are then read as guanosines. Editing is critical for brain function in mammals and linked to several diseases . However, the extent of this phenomenon has remained elusive. Direct sequencing of transcriptomes is the method of choice to understand how prevalent is this mode of post-transcriptional regulation (Fig. 2c). Indeed, a pioneering RNA-seq analysis of human brain and other tissues has revealed hundreds of new editing sites, many of which are located in non-coding RNAs .
Information about protein–RNA interactions is fundamental for the understanding of regulatory networks governing the different layers of post-transcriptional control. Predicting protein–RNA binding sites is difficult not least due to the relatively low sequence conservation of RNA binding motifs. Protein–RNA interactions can be mapped directly, however, using approaches similar the chromatin immunoprecipitation technique used to identify protein–DNA interactions . This approach is achieved in two ways: (1) RNA-binding proteins are immunoprecipitated together with their intact target transcripts (RIP) , or (2) RNA-binding proteins are crosslinked to the RNAs they interact with and treated with RNAse before immunoprecipitation (CLIP for crosslinking immunoprecipitation) . This second approach limits the analysis to RNA fragments protected by the binding protein and is reminiscent of a footprint. The immunoprecipitated RNAs need eventually to be identified using either single-gene  or genome-wide methods . NGS technologies have been successfully applied to these approaches. Several CLIP-seq (also called HITS-CLIP, for high-throughput sequencing CLIP) studies have analysed the binding patterns of human splicing regulators in different cell types and tissues [96, 97, 98]. For example, analysis of the binding patterns of the neuron-specific splicing factor Nova has demonstrated that its binding to introns determines the outcome of alternative splicing while its binding to 3′-UTRs can regulate alternative polyadenylation . RIP and CLIP-seq have also been used to characterise Ago-RNA complexes in mouse, human and fission yeast [99, 100, 101]. The Ago protein binds small RNAs to form a core RNA silencing complex. Sequencing the populations of microRNAs (miRNAs) and mRNAs bound to Ago proteins in the mouse brain has allowed direct identification of in vivo expressed miRNAs and their potential target transcripts . RIP-seq with Ago has led to the discovery of a new class of small RNAs in humans, originating from small nucleolar RNAs (snoRNA) which can function like miRNAs .
Ribosomes are riboprotein complexes mediating the translation of RNA transcripts into proteins and are probably the most abundant RNA-binding proteins in the cell. Studying the amount and position of ribosomes bound to transcripts globally can provide important information about regulation of translation. To this end, total cellular RNA is fractionated based on the amount of associated ribosomes (“polysome profiling”) . This technique has provided information on basic properties of the translation process. NGS technologies with their ability to detect the exact sequence of short RNA molecules have now enabled a transition from genome-wide polysome profiling to genome-wide ribosome foot-printing . Similarly to the CLIP method outlined above, this approach is based on the isolation of short RNA fragments occupied by ribosomes and hence protected from degradation by an endonuclease. It permits not only the measurement of the number of ribosomes associated with different transcripts but their exact positions along the RNA molecules. This method, termed “ribosome profiling”, has been applied to budding yeast grown under two different physiological conditions . The ability to detect the distribution of ribosomes on transcripts at maximum resolution has revealed that the density of ribosomes is not uniform across transcripts. All transcripts contain a region of constant length at their 5′ends showing a high density of ribosomes . This observation could explain the previously published phenomenon that short transcripts tend to be much more densely packed with ribosomes than large transcripts [103, 104]. The amount of ribosomes found in introns and 3′-UTRs is less than 1% of the ribosome density seen in open reading frames (ORFs), indicating that retained introns are rarely translationally active. Moreover, many small ORFs (uORFs) are detected in the 5′-UTRs of genes, but their functional relevance remains elusive. The ribosome density in these uORFs is significantly higher than in other regions of the 5′-UTRs, indicating that pervasive translation occurs upstream of the ORF . Surprisingly, a substantial amount of these uORFs are using non-AUG start codons, thus unexpectedly increasing the scope of peptides that can be translated from a given transcript.
Conclusions and outlook
Next-generation sequencing technologies are revolutionising genomics research and beyond by enabling the much more rapid and cost-effective generation of massive amounts of sequences compared to traditional Sanger sequencing. This technological breakthrough provides an opportunity for regular research institutes and departments to engage in ambitious projects which so far have only been conceivable for large genome centers. The impact of NGS technologies for the analysis of gene regulation is particularly high. Within only two years, RNA-seq has reached a point where recent state-of-the-art technologies such as high-density tiling arrays look almost old fashioned. It looks likely that sequencing-based approaches will largely supersede hybridisation-based approaches within a few years. RNA-seq permits the sequencing and quantifying of transcriptomes at maximal resolution and dynamic range, independently of transcript size, and above all free from any preconception (or even knowledge) of the genomes they are derived from. RNA-seq has started to change the way we think about studying the complexity and dynamics of transcriptomes and genome regulation. Early RNA-seq studies have revealed more extensively expressed genomes and more complex transcriptomes than anticipated, thus giving insight into novel regulatory mechanisms. These pioneering studies have also uncovered rich and extensive post-transcriptional regulation of transcript structures and sequences.
RNA-seq will without doubt drive many more exciting discoveries within the next few years. For example, sequencing of RNA from complex samples containing more than one organism, either collected in the wild [105, 106, 107, 108] or created in the laboratory, will ultimately provide information about transcriptome dynamics of living communities and interactions within ecosystems. On the other hand, sequencing of RNA from closely related species or members of a population will give insight into the processes linking transcriptome plasticity to phenotypic diversity and evolution. Given sufficient sequencing depth, RNA-seq analysis of cell populations adapting to changing environmental conditions could also reveal rare changes in transcript sequences that do not necessarily lead to an increase in fitness, thus helping to understand evolutionary mechanisms and dynamics. The main challenge for researchers is to creatively exploit the opportunities provided by those rapidly evolving technologies. Even more powerful sequencing approaches are already on the horizon. For example, “next-next-generation” sequencers such as the Helicos system, which can sequence millions of single molecules in parallel, are entering the market and seem to be suited to analyse RNA . Truly, progress is limited mainly by our imagination, and exciting times are certainly ahead.
We would like to thank Luis López-Maury, Rachel Imoberdorf, Vera Pancaldi, Martin Převorovský, and Brian Wilhelm for critical reading of the manuscript. Research in our laboratory is funded by Cancer Research UK and by PhenOxiGEn, an EU FP7 research project.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.