Abstract
The GENCODE project provides comprehensive annotation of the functional elements in human and mouse genomes with high accuracy. The annotations are released for the benefit of biomedical and genomic research domain. In this initiative, we have provided a basic user manual or roadmap to facilitate the exploration of GENCODE annotation. We have provided a brief history of GENCODE and the general working principles that GENCODE adopts for their annotation. Then, we have introduced few workflows to guide users in the extraction and exploration of GENCODE resources for downstream analysis. The structure of this chapter is as follows. We started by introducing the GENCODE from a historical perspective, the needs and objectives that led to its creation, and being one of the most reliable sources for human and mouse genome functional elements. Afterward, we provided an overview of the GENCODE database. Mainly, different types of annotated genes, their description, basic statistics, and how they were created with emphasis on the latest four releases. Following this database overview, we described different annotation methods adopted by the GENCODE consortium for both human and mouse genomes along with validation methods. Besides GENCODE annotation methods, the user can find GENCODE annotation data format fields and definitions as they appear in the GTF and GFF3 files. Then we described three different ways to access GENCODE annotations via the GENCODE portal, Ensembl Genome Browser, and UCSC Genome Browser. We concluded with three use cases showcasing how to explore the GENCODE annotation for answering research questions. Source code, interactive user guide, and other files are made available for users at https://github.com/smusleh/BookChapterGENCODE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
References
Alam T, Al-Absi HRH, Schmeier S (2020) Deep learning in LncRNAome: contribution, challenges, and perspectives. Noncoding RNA 6(4):47. https://doi.org/10.3390/ncrna6040047
Bignell A et al (2009) GENCODE: creating a validated manually annotated geneset for the whole human genome. Nat Preced:1756-0357
Cunningham F et al (2015) Ensembl 2015. Nucleic Acids Res 43(Database issue):D662–D669. https://doi.org/10.1093/nar/gku1010
Derrien T et al (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22(9):1775–1789. https://doi.org/10.1101/gr.132159.111
Emadi-Baygi M, Sedighi R, Nourbakhsh N, Nikpour P (2017) Pseudogenes in gastric cancer pathogenesis: a review article. Brief Funct Genomics 16(6):348–360. https://doi.org/10.1093/bfgp/elx004
ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306(5696):636–640. https://doi.org/10.1126/science.1105136
Finn RD et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230. https://doi.org/10.1093/nar/gkt1223
Flicek P et al (2012) Ensembl 2012. Nucleic Acids Res 40(Database issue):D84–D90. https://doi.org/10.1093/nar/gkr991
Frankish A, Harrow J (2014) GENCODE pseudogenes. Methods Mol Biol 1167:129–155. https://doi.org/10.1007/978-1-4939-0835-6_10
Frankish A et al (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47(D1):D766–D773. https://doi.org/10.1093/nar/gky955
Frankish A et al (2021) GENCODE 2021. Nucleic Acids Res 49(D1):D916–D923. https://doi.org/10.1093/nar/gkaa1087
Gordon DE et al (2020) A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 583(7816):459–468. https://doi.org/10.1038/s41586-020-2286-9
GTEx Consortium (2013) The Genotype-Tissue Expression (GTEx) project. Nat Genet 45(6):580–585. https://doi.org/10.1038/ng.2653
Guigó R et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S231. https://doi.org/10.1186/gb-2006-7-s1-s2
Harrow J et al (2006) GENCODE: producing a reference annotation for ENCODE. Genome Biol 7(1):S4.1–S4.9. https://doi.org/10.1186/gb-2006-7-s1-s4
Harrow J et al (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22(9):1760–1774. https://doi.org/10.1101/gr.135350.111
Harrow JL et al (2014) The vertebrate genome annotation browser 10 years on. Nucleic Acids Res 42(Database issue):D771–D779. https://doi.org/10.1093/nar/gkt1241
Hon CC et al (2017) An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543(7644):199–204. https://doi.org/10.1038/nature21374
Howald C et al (2012) Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res 22(9):1698–1710. https://doi.org/10.1101/gr.134478.111
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D (2003) Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100(20):11484–11489. https://doi.org/10.1073/pnas.1932072100
Kodzius R et al (2006) CAGE: cap analysis of gene expression. Nat Methods 3(3):211–222. https://doi.org/10.1038/nmeth0306-211
Kokocinski F, Harrow J, Hubbard T (2010) AnnoTrack—a tracking system for genome annotation. BMC Genomics 11:538. https://doi.org/10.1186/1471-2164-11-538
Kozomara A, Griffiths-Jones S (2010) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 39(Suppl 1):D152–D157
Lagarde J et al (2017) High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat Genet 49(12):1731–1740. https://doi.org/10.1038/ng.3988
Lek M et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536(7616):285–291. https://doi.org/10.1038/nature19057
Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275–i282. https://doi.org/10.1093/bioinformatics/btr209
Mudge JM, Harrow J (2015) Creating reference gene annotation for the mouse C57BL6/J genome assembly. Mamm Genome 26(9–10):366–378. https://doi.org/10.1007/s00335-015-9583-x
Pei B et al (2012) The GENCODE pseudogene resource. Genome Biol 13(9):R51. https://doi.org/10.1186/gb-2012-13-9-r51
Rangan R et al (2020) RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses: a first look. RNA 26(8):937–959. https://doi.org/10.1261/rna.076141.120
Regev A et al (2017) The human cell atlas. Elife 6:e27041. https://doi.org/10.7554/eLife.27041
Rodriguez JM et al (2013) APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res 41(Database issue):D110–D117. https://doi.org/10.1093/nar/gks1058
Searle SM, Gilbert J, Iyer V, Clamp M (2004) The otter annotation system. Genome Res 14(5):963–970. https://doi.org/10.1101/gr.1864804
Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311. https://doi.org/10.1093/nar/29.1.308
Siva N (2008) 1000 Genomes project. Nat Biotechnol 26(3):256
Sonnhammer EL, Wootton JC (2001) Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 45(3):262–273. https://doi.org/10.1002/prot.1146
Stunnenberg HG, Hirst M (2016) The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167(5):1145–1149. https://doi.org/10.1016/j.cell.2016.11.007
UniProt Consortium (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40(Database issue):D71–D75. https://doi.org/10.1093/nar/gkr981
Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M (2006) PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22(12):1437–1439. https://doi.org/10.1093/bioinformatics/btl116
Zheng D et al (2007) Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 17(6):839–851. https://doi.org/10.1101/gr.5586307
Zhou Y, Hou Y, Shen J, Huang Y, Martin W, Cheng F (2020) Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2. Cell Discov 6:14. https://doi.org/10.1038/s41421-020-0153-3
Acknowledgement
Funding: None
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1.1 Electronic Supplementary Material
Supplementary Data 1.1
Interactive user guide highlighting different ways to access GENCODE annotation (HTML 10 kb)
Supplementary Data 1.2
Shell commands for the use case 1 (SH 4 kb)
Supplementary Data 1.3
MALAT1 gene and associated transcripts and exons (GFF3 22 kb)
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Musleh, S., Alazmi, M., Alam, T. (2021). GENCODE Annotation for the Human and Mouse Genome: A User Perspective. In: Abugessaisa, I., Kasukawa, T. (eds) Practical Guide to Life Science Databases. Springer, Singapore. https://doi.org/10.1007/978-981-16-5812-9_1
Download citation
DOI: https://doi.org/10.1007/978-981-16-5812-9_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5811-2
Online ISBN: 978-981-16-5812-9
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)