RNA-seq raw data processing

  • Alessandro Cellerino
  • Michele Sanguanini
Part of the CRM Series book series (PSNS, volume 17)


The human genome contains more than 20000 protein-coding genes, but the complexity of the RNA population in any given human sample is at least one order of magnitude higher due to alternative splicing that generates different splicing isoforms. To this, one has to add an increasing number of non-coding RNAs and various forms of RNA editing. This high complexity poses important technical and computational questions such as,
  • how ‘deep’ should the planned sequencing be (i.e. how many clusters should be sequenced from the cDNA libraries) to obtain a good representation of the transcript diversity?

  • Is the processing of the dataset (i.e. the identification of the gene of origin for each sequence) feasible in terms of computation time?

  • Can the complexity be reduced?

In this chapter the problems of complexity and of mapping the RNA-seq reads to a the reference genome will be addressed from a probabilistic and informational point of view. The issue of reducing the complexity will be dealt with in Chapters 5 and 6.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Scuola Normale Superiore Pisa 2018

Authors and Affiliations

  • Alessandro Cellerino
    • 1
  • Michele Sanguanini
    • 2
  1. 1.Scuola Normale SuperiorePisaItaly
  2. 2.Gonville and Caius CollegeUniversity of CambridgeCambridge, CambridgeshireUnited Kingdom

Personalised recommendations