Abstract
The incomplete determination of the mRNA 5′ end sequence may lead to the incorrect assignment of the first AUG codon and to errors in the prediction of the encoded protein product. Due to the significance of the mouse as a model organism in biomedical research, we performed a systematic identification of coding regions at the 5′ end of all known mouse mRNAs, using an automated expressed sequence tag (EST)-based approach which we have previously described. By parsing almost 4 million BLAT alignments we found 351 mouse loci, out of 20,221 analyzed, in which an extension of the mRNA 5′ coding region was identified. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for Apc2 and Mknk2 cDNAs. We also generated a list of 16,330 mouse mRNAs where the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5′ end in the current form. Systematic searches in the main mouse genome databases and genome browsers showed that 82 % of our results are original and have not been identified by their annotation pipelines. Moreover, the same information is not easily derivable from RNA-Seq data, due to short sequence length and laboriousness in building full-length transcript structures. In conclusion, our results improve the determination of full-length 5′ coding sequences and might be useful in order to reduce errors when studying mouse gene structure and function in biomedical research.
Similar content being viewed by others
References
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, Kerlavage AR, McCombie R, Venter C (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252:1651–1656
Baxevanis AD (2004) An overview of gene identification: approaches, strategies, and considerations. Curr Protoc Bioinformatics 4(4):1
Bazykin GA, Kochetov AV (2011) Alternative translation start sites are conserved in eukaryotic genomes. Nucleic Acids Res 39:567–577
Boguski MS, Lowe TM, Tolstoshev CM (1993) dbEST-database for “expressed sequence tags”. Nat Genet 4:332–333
Brent MR (2005) Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res 15:1777–1786
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S et al (2005) The transcriptional landscape of the mammalian genome. Science 309:1559–1563
Casadei R, Strippoli P, D’Addabbo P, Canaider S, Lenzi L, Vitale L, Giannone S, Frabetti F, Facchin F, Carinci P, Zannotti M (2003) mRNA 5′ region sequence incompleteness: a potential source of systematic errors in translation initiation codon assignment in human mRNAs. Gene 321:185–193
Casadei R, Piovesan A, Vitale L, Facchin F, Pelleri MC, Canaider S, Bianconi E, Frabetti F, Strippoli P (2012) Genome-scale analysis of human mRNA 5′ coding sequences based on expressed sequence tag (EST) database. Genomics 100:125–130
Davis LG, Kuehl WM, Battey JF (1994) Basic methods in molecular biology. Appleton & Lange, Norwalk
ENCODE Project Consortium, Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE, Gingeras TR, Kent WJ, Birney E, Wold B, Crawford GE, Bernstein BE, Epstein CB, Shoresh N et al (2011) A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9:e100104
Engels WR (1993) Contributing software to the internet: the Amplify program. Trends Biochem Sci 18:448–450
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, McKenney K, Sutton G, FitzHugh W, Fields C, Gocayne JD et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 69:496–512
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T et al (2013) Ensembl 2013. Nucleic Acids Res 41:D48–D55
Frabetti F, Casadei R, Lenzi L, Canaider S, Vitale L, Facchin F, Carinci P, Zannotti M, Strippoli P (2007) Systematic analysis of mRNA 5′ coding sequence incompleteness in Danio rerio: an automated EST-based approach. Biol Direct 2:34
Gharib WH, Robinson-Rechavi M (2011) When orthologs diverge between human and mouse. Brief Bioinform 12:436–441
Goecks J, Nekrutenko A, Taylor J, The Galaxy Team (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86
Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD (2012) Tracking and coordinating an international curation effort for the CCDS Project. Database (Oxford) 2012:bas008
Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P (2006) CAGE: cap analysis of gene expression. Nat Methods 3:211–222
Kozak M (2002) Pushing the limits of the scanning mechanism for initiation of translation. Gene 99:1–34
Lenzi L, Frabetti F, Facchin F, Casadei R, Vitale L, Canaider S, Carinci P, Zannotti M, Strippoli P (2006) UniGene Tabulator: a full parser for the UniGene format. Bioinformatics 22:2570–2571
Lenzi L, Facchin F, Piva F, Giulietti M, Pelleri MC, Frabetti F, Vitale L, Casadei R, Canaider S, Bortoluzzi S, Coppe A, Danieli GA, Principato G, Ferrari S, Strippoli P (2011) TRAM (Transcriptome Mapper): database-driven creation and analysis of transcriptome maps from multiple sources. BMC Genomics 12:121
Li Q, Ownby CL (1993) A rapid method for extraction of DNA from agarose gels using a syringe. Biotechniques 15:976–978
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46
Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, Raney BJ, Pohl A, Malladi VS, Li CH, Lee BT et al (2013) The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res 41:D64–D69
MGC Project Team (2009) The completion of the Mammalian Gene Collection (MGC). Genome Res 19:2324–2333
Mouse ENCODE Consortium, Stamatoyannopoulos JA, Snyder M, Hardison R, Ren B, Gingeras T, Gilbert DM, Groudine M, Bender M, Kaul R (2012) An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol 13:418
Mouse Genome Sequencing Consortium, Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562
NCBI Resource Coordinators (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 41:D8–D20
Porcel BM, Delfour O, Castelli V, De Berardinis V, Friedlander L, Cruaud C, Ureta-Vidal A, Scarpelli C, Wincker P, Schächter V, Saurin W, Gyapay G, Salanoubat M, Weissenbach J (2004) Numerous novel annotations of the human genome sequence supported by a 5′-end-enriched cDNA collection. Genome Res 14:463–471
Sambrook J, Russell DW (2001) Rapid amplification of 5′ cDNA ends. In: Sambrook J, Russell DW (eds) Molecular cloning: a laboratory manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, pp 8.54–8.60
Shintani T, Takeuchi Y, Fujikawa A, Noda M (2012) Directional neuronal migration is impaired in mice lacking adenomatous polyposis coli 2. J Neurosci 32:6468–6484
Suzuki Y, Ishihara D, Sasaki M, Nakagawa H, Hata H, Tsunoda T, Watanabe M, Komatsu T, Ota T, Isogai T, Suyama A, Sugano S (2000) Statistical analysis of the 5′ untranslated region of human mRNA using “Oligo-Capped” cDNA libraries. Genomics 64:286–297
Ueda T, Watanabe-Fukunaga R, Fukuyama H, Nagata S, Fukunaga R (2004) Mnk2 and Mnk1 are essential for constitutive and inducible phosphorylation of eukaryotic initiation factor 4E but not for cell growth or development. Mol Cell Biol 24:6539–6549
van Es JH, Kirkpatrick C, van de Wetering M, Molenaar M, Miles A, Kuipers J, Destrée O, Peifer M, Clevers H (1999) Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor. Curr Biol 9:105–108
Waskiewicz AJ, Flynn A, Proud CG, Cooper JA (1997) Mitogen-activated protein kinases activate the serine/threonine kinases Mnk1 and Mnk2. EMBO J 16:1909–1920
Watahiki A, Waki K, Hayatsu N, Shiraki T, Kondo S, Nakamura M, Sasaki D, Arakawa T, Kawai J, Harbers M, Hayashizaki Y, Carninci P (2004) Libraries enriched for alternatively spliced exons reveal splicing patterns in melanocytes and melanomas. Nat Methods 1:233–239
Yalcin B, Adams DJ, Flint J, Keane TM (2012) Next-generation sequencing of experimental mouse strains. Mamm Genome 23:490–498
Acknowledgments
This work was funded by “RFO” Grants from Alma Mater Studiorum—University of Bologna to P.S. and L.V. M.C.’s fellowship is supported by a generous donation made by Illumia, Bologna, Italy. The 5′_ORF_Extender software was executed on the Apple Mac Pro “Multiprocessor Server” available at the Center for Research in Molecular Genetics “Fondazione CARISBO”, Bologna, and funded by “Fondazione CARISBO”. We are grateful to Gabriella Mattei and Michela Bonaguro for their excellent technical assistance with cDNA sequencing. We are grateful to Danielle Mitzman for her expert revision of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Nucleotide sequence data reported are available in the DDBJ/EMBL/GenBank databases under the accession numbers KF481611 and KF612275.
Allison Piovesan and Maria Caracausi contributed equally to the work.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Piovesan, A., Caracausi, M., Pelleri, M.C. et al. Improving mRNA 5′ coding sequence determination in the mouse genome. Mamm Genome 25, 149–159 (2014). https://doi.org/10.1007/s00335-013-9498-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00335-013-9498-3