An Information Integration Approach for Classifying Coding and Non-Coding Genomic Data

  • Ashis Kumer Biswas
  • Baoju Zhang
  • Xiaoyong Wu
  • Jean X. Gao
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 246)


Reliable methods to classify coding and non-coding transcripts from large scale genomic data will help researchers annotate novel RNA transcripts. In this manuscript we explored some of the distinguishing properties of these two classes of transcripts, such as the features of their secondary structures, differential expression scores obtained from typical RNA-seq experiments, and G+C content scores. We trained two classification methods—Conditional Random Forest (CRF) and the Support Vector Machines (SVMs) with the extracted features from the genomic data and applied the trained model to predict a test set comprised of the two classes of transcripts from three well known annotation sources and found important characteristics of the extracted features regarding the classification problem. A comparative analysis shows that our method outperforms the existing two state-of-the-art methods—the CPC (Coding Potential Calculator) and the PORTRAIT in classifying transcripts from the test dataset.


  1. 1.
    Arrial R, Togawa R, Brigido M (2009) Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis. BMC Bioinformatics 10(1):239CrossRefGoogle Scholar
  2. 2.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  3. 3.
    Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM T Intell Syst Technol (TIST) 2(3):27Google Scholar
  4. 4.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  5. 5.
    Edgar R, Domrachev M, Lash A (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30(1):207–210CrossRefGoogle Scholar
  6. 6.
    Flicek P, Amode M, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S et al (2011) Ensembl 2011. Nucleic Acids Res 39(suppl 1):D800–D806Google Scholar
  7. 7.
    Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie (Chemical Monthly) 125(2):167–188CrossRefGoogle Scholar
  8. 8.
    Karolchik D, Hinrichs A, Furey T, Roskin K, Sugnet C, Haussler D, Kent W (2004) The UCSC table browser data retrieval tool. Nucleic Acids Res 32(suppl 1):D493–D496CrossRefGoogle Scholar
  9. 9.
    Kim T, Hemberg M, Gray J, Costa A, Bear D, Wu J, Harmin D, Laptewicz M, Barbara-Haley K, Kuersten S et al (2010) Widespread transcription at neuronal activity-regulated enhancers. Nature 465(7295):182–187Google Scholar
  10. 10.
    Kong L, Zhang Y, Ye Z, Liu X, Zhao S, Wei L, Gao G (2007) CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35(suppl 2):W345–W349CrossRefGoogle Scholar
  11. 11.
    Machado-Lima A, Del Portillo H, Durham A (2008) Computational methods in noncoding RNA research. J Math Biol 56(1):15–49Google Scholar
  12. 12.
    Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 5(7):621–628CrossRefGoogle Scholar
  13. 13.
    Pruitt K, Tatusova T, Brown G, Maglott D (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40(D1):D130–D135CrossRefGoogle Scholar
  14. 14.
    Rivas E, Eddy S (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 16(7):583–605CrossRefGoogle Scholar
  15. 15.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63CrossRefGoogle Scholar
  16. 16.
    Waterman M et al (1995) Introduction to computational biology: maps, sequences and genomes. Chapman & Hall, LondonCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of Texas at ArlingtonArlingtonUSA
  2. 2.School of Physics and Electronic InformationTianjin Normal UniversityTianjinChina

Personalised recommendations