Skip to main content

De Novo Approach to Classify Protein-Coding and Noncoding Transcripts Based on Sequence Composition

  • Protocol
  • First Online:
RNA Mapping

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1182))

Abstract

Each day, more and more transcripts are being discovered along the genome (especially in poorly annotated species) thanks to the rapid progress of high-throughput technology such as RNA sequencing. However, this situation unravels the challenge of how to classify the newly identified transcripts into protein coding or noncoding. Here, we describe a de novo approach named coding–noncoding index (CNCI), a powerful signature tool by profiling adjoining nucleotide triplets (ANT) to effectively distinguish between protein-coding and noncoding sequences independently of known annotations. The main advantage of CNCI is its ability to accurately classify transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, which allowed it to be used for all vertebrates and invertebrates based on the training data of well-annotated species (such as human and Arabidopsis). In this chapter, we illustrate the CNCI method in detail through an example of RNA-sequencing data generated from six biological replicates of six mouse tissues. CNCI software is available at http://www.bioinfo.org/software/cnci.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Liao Q, Liu C, Yuan X et al (2011) Large-scale prediction of long non-coding RNA functions in a coding-non-coding gene co-expression network. Nucleic Acids Res 39:3864–3878

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  2. Bu D, Yu K, Sun S et al (2012) NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res 40:D210–D215

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  3. Bernstein BE, Birney E, Dunham I et al (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74

    Article  PubMed  Google Scholar 

  4. Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87–98

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  5. Kong L, Zhang Y, Ye ZQ et al (2007) CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35:W345–W349

    Article  PubMed Central  PubMed  Google Scholar 

  6. Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27:i275–i282

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  7. Guttman M, Donaghey J, Carey BW et al (2011) lincRNAs act in the circuitry controlling pluripotency and differentiation. Nature 477:295–300

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  8. Guttman M, Amit I, Garber M et al (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458:223–227

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  9. Sun L, Luo H, Bu D et al (2013) Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res 41:e166

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  10. Derrien T, Johnson R, Bussotti G et al (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22:1775–1789

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  11. Pruitt KD, Tatusova T, Brown GR et al (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40:D130–D135

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  12. Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  13. Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515

    Article  CAS  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this protocol

Cite this protocol

Luo, H., Bu, D., Sun, L., Chen, R., Zhao, Y. (2014). De Novo Approach to Classify Protein-Coding and Noncoding Transcripts Based on Sequence Composition. In: Alvarez, M., Nourbakhsh, M. (eds) RNA Mapping. Methods in Molecular Biology, vol 1182. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1062-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1062-5_18

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-1061-8

  • Online ISBN: 978-1-4939-1062-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics