Methods to Detect Transcribed Pseudogenes: RNA-Seq Discovery Allows Learning Through Features
The detection of transcripts and the measurement of their associated activity at the pseudogene scale have recently become important topics of research. Being integral part of many recent studies aimed at establishing a role for a variety of noncoding RNA structures, pseudogenes’ popularity has substantially increased due to the discovery of regulatory properties and complex mechanisms of action that, while requiring further investigation, analysis, and validation, promise as well to have a broad impact on human disease.
Currently, there are relatively few methodologies specifically designed to accomplish the detection of pseudogene transcripts and tools that either replace or integrate manual annotation procedures are very much needed. In particular, it seems to us justified that we engage in advancing the computational treatment of pseudogenes at the whole transcriptome level. Catalogs of human pseudogenes have started to be delivered, through RNA-Seq technologies. However, just a certain number of transcriptomes has been covered. Furthermore, while most proposals have led to the production of a targeted algorithm, especially used for detection, few computational pipelines were designed following a comprehensive approach addressing identification and quantification of transcriptional activity within a unifying methodological frame.
Given the currently incomplete evidence, the limitations of the impacts due to the lack of extensive testing, and the presence of unsolved uncertainties affecting the reproducibility of results, our motivation for the proposal of a new computational approach is high and timely. We have considered a hybrid approach, based on the assembly of a variety of computational tools, including RNA-Seq methods and machine learning applications, all applied to transcriptome data of various complexities. Our initial strategy is to provide lists of pseudogenes to be validated against the currently known examples, in order to extend our knowledge further. An ultimate goal that is naturally linked to this work is to provide an automatic approach that analyzes transcriptomes with the goal of detecting candidate pseudogenes through characteristic features and that allows efficient and reproducible pseudogene classification models.
Key wordsPseudogenes Transcriptomes RNA-Seq Feature matrix Predictive inference
EC thanks Laura Poliseno for fruitful discussions on the topic and for introducing to stimulating readings.
- 6.Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, Cabili MN, Jaenisch R, Mikkelsen TS, Jacks T, Hacohen N, Bernstein BE, Kellis M, Regev A, Rinn JL, Lander ES (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458(7235):223–227PubMedCentralPubMedCrossRefGoogle Scholar
- 7.Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Rivea Morales D, Thomas K, Presser A, Bernstein BE, van Oudenaarden A, Regev A, Lander ES, Rinn JL (2009) Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci U S A 106(28):11667–11672PubMedCentralPubMedCrossRefGoogle Scholar
- 11.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCentralPubMedCrossRefGoogle Scholar
- 13.Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei C-L, Thomas R, Gingeras TR, Guigó R, Harrow J, Gerstein MB (2007) Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 17(6):839–851PubMedCentralPubMedCrossRefGoogle Scholar
- 26.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515PubMedCentralPubMedCrossRefGoogle Scholar
- 27.Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28(5):503–510PubMedCentralPubMedCrossRefGoogle Scholar