An Information Integration Approach for Classifying Coding and Non-Coding Genomic Data
Reliable methods to classify coding and non-coding transcripts from large scale genomic data will help researchers annotate novel RNA transcripts. In this manuscript we explored some of the distinguishing properties of these two classes of transcripts, such as the features of their secondary structures, differential expression scores obtained from typical RNA-seq experiments, and G+C content scores. We trained two classification methods—Conditional Random Forest (CRF) and the Support Vector Machines (SVMs) with the extracted features from the genomic data and applied the trained model to predict a test set comprised of the two classes of transcripts from three well known annotation sources and found important characteristics of the extracted features regarding the classification problem. A comparative analysis shows that our method outperforms the existing two state-of-the-art methods—the CPC (Coding Potential Calculator) and the PORTRAIT in classifying transcripts from the test dataset.
- 3.Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM T Intell Syst Technol (TIST) 2(3):27Google Scholar
- 6.Flicek P, Amode M, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S et al (2011) Ensembl 2011. Nucleic Acids Res 39(suppl 1):D800–D806Google Scholar
- 9.Kim T, Hemberg M, Gray J, Costa A, Bear D, Wu J, Harmin D, Laptewicz M, Barbara-Haley K, Kuersten S et al (2010) Widespread transcription at neuronal activity-regulated enhancers. Nature 465(7295):182–187Google Scholar
- 11.Machado-Lima A, Del Portillo H, Durham A (2008) Computational methods in noncoding RNA research. J Math Biol 56(1):15–49Google Scholar