Extraction of Poly(A) Sites from Large-Scale RNA-seq Data

  • Min Dong
  • Guoli Ji
  • Qingshun Quinn LiEmail author
  • Chun LiangEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1255)


The NCBI manages the SRA (Sequence Read Archive) database to store RNA-Seq data generated from different NGS technologies. With ever increasing finished and ongoing genome and transcriptome sequencing projects, the data in SRA expand rapidly and present a treasure for mining useful information to facilitate our understanding of biological issues like mRNA 3′-end formation and alternative polyadenylation. We developed a bioinformatics pipeline that can process raw SRA sequence data and obtain high quality poly(A) sites and poly(A) cluster sites with detailed expression information. This pipeline is designed to be generic and can be utilized for polyadenylation studies in any eukaryotic species.

Key words

Polyadenylation RNA-Seq SRA Poly(A) site Data mining 



This project was supported by a grant from the US National Institutes of Health (NIH-AREA) (1R15GM94732-1 A1 to CL and QQL), and by US National Science Foundation (grant nos. IOS–0817829 and IOS-1353354 to QQL).


  1. 1.
    Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K et al (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 36:D13–D21, PMID: 18045790PubMedCentralPubMedCrossRefGoogle Scholar
  2. 2.
    Ozsolak F, Kapranov P, Foissac S, Kim SW, Fishilevich E, Monaghan AP, John B, Milos PM (2010) Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell 143:1018–1029, PMID: 21145465PubMedCentralPubMedCrossRefGoogle Scholar
  3. 3.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078–2079, PMID: 19505943PubMedCentralPubMedCrossRefGoogle Scholar
  4. 4.
    Tian B, Hu J, Zhang H, Lutz CS (2005) A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res 33:201–212, PMID: 15647503PubMedCentralPubMedCrossRefGoogle Scholar
  5. 5.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881, PMID: 20147302PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875, PMID: 15728110PubMedCrossRefGoogle Scholar
  7. 7.
    Pauws E, van Kampen AH, van de Graaf SA, de Vijlder JJ, Ris-Stalpers C (2001) Heterogeneity in polyadenylation cleavage sites in mammalian mRNA sequences: implications for SAGE analysis. Nucleic Acids Res 29:1690–1694, PMCID: PMC31324PubMedCentralPubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of AutomationXiamen UniversityXiamenChina
  2. 2.Department of BiologyMiami UniversityOxfordUSA
  3. 3.Department of Computer Science and Software EngineeringMiami UniversityOxfordUSA

Personalised recommendations