Abstract
The availability of reference genome sequences for virtually all species under active research has revolutionized biology. Analyses of genomic variations in many organisms have provided insights into phenotypic traits, evolution and disease, and are transforming medicine. All genomic data from publicly funded projects are freely available in Internet-based databases, for download or searching via genome browsers such as Ensembl, Vega, NCBI’s Map Viewer, and the UCSC Genome Browser. These online tools generate interactive graphical outputs of relevant chromosomal regions, showing genes, transcripts, and other genomic landmarks, and epigenetic features mapped by projects such as ENCODE.
This chapter provides a broad overview of the major genomic databases and browsers, and describes various approaches and the latest resources for searching them. Methods are provided for identifying genomic locus and sequence information using gene names or codes, identifiers for DNA and RNA molecules and proteins; also from karyotype bands, chromosomal coordinates, sequences, motifs, and matrix-based patterns. Approaches are also described for batch retrieval of genomic information, performing more complex queries, and analyzing larger sets of experimental data, for example from next-generation sequencing projects.
Key words
- Bioinformatics
- Epigenetics
- Genome browsers
- Identifiers
- Internet-based software
- Next-generation sequencing
- Motifs
- Matrices
- Sequences
This is a preview of subscription content, access via your institution.
Buying options


Abbreviations
- API:
-
Application Programming Interface
- BED:
-
Browser Extensible Data
- BLAST:
-
Basic Local Alignment Search Tool
- BLAT:
-
BLAST-Like Alignment Tool
- DDBJ:
-
DNA Databank of Japan
- EBI:
-
European Bioinformatics Institute
- EMBOSS:
-
European Molecular Biology Open Software Suite
- ENA:
-
European Nucleotide Archive
- ENCODE:
-
Encyclopedia Of DNA Elements
- FTP:
-
File Transfer Protocol
- GI:
-
GenInfo Identifier
- GOLD:
-
Genomes Online Database
- GRC:
-
Genome Reference Consortium
- GUI:
-
Graphical User Interface
- HAVANA:
-
Human and Vertebrate Analysis and Annotation
- ID:
-
Identifier Code
- INSDC:
-
International Nucleotide Sequence Database Collaboration
- NCBI:
-
National Center for Biotechnology Information
- NGS:
-
Next-Generation Sequencing
- PWM:
-
Position Weight Matrix
- RegEx:
-
Regular Expression
- REST:
-
Representational State Transfer
- ROI:
-
Region of Interest
- RSAT:
-
Regulatory Sequence Analysis Tools
- UCSC:
-
University of California Santa Cruz
- URL:
-
Uniform Resource Locator
- Vega:
-
Vertebrate Genome Annotation
References
Sanger F, Air GM, Barrell BG et al (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265:687–695
Fleischmann RD, Adams MD, White O et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512
Johnston M (1996) The complete code for a eukaryotic cell. Genome sequencing. Curr Biol 6:500–503
C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2018
Lander ES, Linton LM, Birren B et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Venter JC, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291:1304–1351
IHGSC (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945
Reddy TB, Thomas AD, Stamatis D et al (2015) The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43:D1099–D1106
Warren WC, Hillier LW, Marshall Graves JA et al (2008) Genome analysis of the platypus reveals unique signatures of evolution. Nature 453:175–183
Amemiya CT, Alfoldi J, Lee AP et al (2013) The African coelacanth genome provides insights into tetrapod evolution. Nature 496:311–316
Prüfer K, Racimo F, Patterson N et al (2014) The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505:43–49
King TE, Fortes GG, Balaresque P et al (2014) Identification of the remains of King Richard III. Nat Commun 5:5631
Abecasis GR, Altshuler D, Auton A et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Abecasis GR, Auton A, Brooks LD et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65
Torjesen I (2013) Genomes of 100,000 people will be sequenced to create an open access research resource. BMJ 347:f6690
Baslan T, Hicks J (2014) Single cell sequencing approaches for complex biological systems. Curr Opin Genet Dev 26C:59–65
Liang J, Cai W, Sun Z (2014) Single-cell sequencing technologies: current and future. J Genet Genomics = Yi Chuan Xue Bao 41:513–528
Dykes CW (1996) Genes, disease and medicine. Br J Clin Pharmacol 42:683–695
Chan IS, Ginsburg GS (2011) Personalized medicine: progress and promise. Annu Rev Genomics Hum Genet 12:217–244
Bauer DC, Gaff C, Dinger ME et al (2014) Genomics and personalised whole-of-life healthcare. Trends Mol Med 20(9):479–486
Check Hayden E (2010) Human genome at ten: life is complicated. Nature 464:664–667
Dulbecco R (1986) A turning point in cancer research: sequencing the human genome. Science 231:1055–1056
International Cancer Genome Consortium, Hudson TJ, Anderson W et al (2010) International network of cancer genome projects. Nature 464, 993–998
Alexandrov LB, Stratton MR (2014) Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr Opin Genet Dev 24C:52–60
Hoffman MM, Ernst J, Wilder SP et al (2013) Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res 41:827–841
modEncode Consortium, Roy S, Ernst J et al (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330:1787–1797
Gerstein MB, Lu ZJ, Van Nostrand EL et al (2010) Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330:1775–1787
Harrow J, Frankish A, Gonzalez JM et al (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22:1760–1774
Almouzni G, Altucci L, Amati B et al (2014) Relationship between genome and epigenome—challenges and requirements for future research. BMC Genomics 15:487
Hériché JK (2014) Systematic cell phenotyping. In: Hancock JM (ed) Phenomics. CRC Press, Boca Raton, FL, pp 86–110
Hutchins JRA (2014) What's that gene (or protein)? Online resources for exploring functions of genes, transcripts, and proteins. Mol Biol Cell 25:1187–1201
Schmidt A, Forne I, Imhof A (2014) Bioinformatic analysis of proteomics data. BMC Syst Biol 8(Suppl 2):S3
Kaiser J (2005) Genomics. Celera to end subscriptions and give data to public GenBank. Science 308:775
Church DM, Schneider VA, Graves T et al (2011) Modernizing reference genome assemblies. PLoS Biol 9:e1001091
Gerstein MB, Bruce C, Rozowsky JS et al (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17:669–681
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Thierry-Mieg D, Thierry-Mieg J (2006) AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol 7(Suppl 1):S12.1–S12.14
MGC Project Team, Temple G, Gerhard DS et al (2009) The completion of the Mammalian Gene Collection (MGC). Genome Res 19:2324–2333
Farrell CM, O'Leary NA, Harte RA et al (2014) Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res 42:D865–D872
Cunningham F, Amode MR, Barrell D et al (2015) Ensembl 2015. Nucleic Acids Res 43:D662–D669
Pruitt KD, Brown GR, Hiatt SM et al (2014) RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763
Harrow JL, Steward CA, Frankish A et al (2014) The Vertebrate Genome Annotation browser 10 years on. Nucleic Acids Res 42:D771–D779
Frankish A, Uszczynska B, Ritchie GR et al (2015) Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 16(Suppl 8):S2
Kersey PJ, Allen JE, Christensen M et al (2014) Ensembl Genomes 2013: scaling up access to genome-wide data. Nucleic Acids Res 42:D546–D552
NCBI Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43:D6–D17
Gray KA, Yates B, Seal RL et al (2015) Genenames.org: the HGNC resources in 2015. Nucleic Acids Res 43:D1079–D1085
dos Santos G, Schroeder AJ, Goodman JL et al (2015) FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res 43:D690–D697
Silvester N, Alako B, Amid C et al (2015) Content discovery and retrieval services at the European Nucleotide Archive. Nucleic Acids Res 43:D23–D29
Kodama Y, Mashima J, Kosuge T et al (2015) The DDBJ Japanese Genotype-phenotype Archive for genetic and phenotypic human data. Nucleic Acids Res 43:D18–D22
UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212
Rosenbloom KR, Armstrong J, Barber GP et al (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res 43:D670–D681
Hsu F, Kent WJ, Clawson H et al (2006) The UCSC known genes. Bioinformatics 22:1036–1046
Nawrocki EP, Burge SW, Bateman A et al (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res 43:D130–D137
Chan PP, Lowe TM (2009) GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37:D93–D97
Punta M, Coggill PC, Eberhardt RY et al (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301
Tatusova T (2010) Genomic databases and resources at the National Center for Biotechnology Information. Methods Mol Biol 609:17–44
Wolfsberg TG (2011) Using the NCBI Map Viewer to browse genomic sequence data. Curr Protoc Hum Genet. Chapter 18. Unit 18.15
Brown GR, Hem V, Katz KS et al (2015) Gene: a gene-centered information resource at NCBI. Nucleic Acids Res 43:D36–D42
Brister JR, Ako-Adjei D, Bao Y et al (2015) NCBI viral genomes resource. Nucleic Acids Res 43:D571–D577
Nicol JW, Helt GA, Blanchard SG Jr et al (2009) The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics 25:2730–2731
Thorvaldsdottir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192
Fiume M, Smith EJ, Brook A et al (2012) Savant Genome Browser 2: visualization and analysis for population-scale genomics. Nucleic Acids Res 40:W615–W621
Wright MW, Bruford EA (2011) Naming ‘junk’: human non-protein coding RNA (ncRNA) gene nomenclature. Hum Genomics 5:90–98
Agirre E, Eyras E (2011) Databases and resources for human small non-coding RNAs. Hum Genomics 5:192–199
The RNAcentral Consortium (2015) RNAcentral: an international database of ncRNA sequences. Nucleic Acids Res 43:D123–D129
Nakamura Y, Cochrane G, Karsch-Mizrachi I (2013) The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 41:D21–D24
Ameres SL, Zamore PD (2013) Diversifying microRNA sequence and function. Nat Rev Mol Cell Biol 14:475–488
Kozomara A, Griffiths-Jones S (2014) miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res 42:D68–D73
Mani SR, Juliano CE (2013) Untangling the web: the diverse functions of the PIWI/piRNA pathway. Mol Reprod Dev 80:632–664
Peng JC, Lin H (2013) Beyond transposons: the epigenetic and somatic functions of the Piwi-piRNA mechanism. Curr Opin Cell Biol 25:190–194
Sai Lakshmi S, Agrawal S (2008) piRNABank: a web resource on classified and clustered Piwi-interacting RNAs. Nucleic Acids Res 36:D173–D177
Zhang P, Si X, Skogerbo G et al (2014) piRBase: a web resource assisting piRNA functional study. Database (Oxford) 2014, bau110
Sarkar A, Maji RK, Saha S et al (2014) piRNAQuest: searching the piRNAome for silencers. BMC Genomics 15:555
Skinner ME, Uzilov AV, Stein LD et al (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638
Kung JT, Colognori D, Lee JT (2013) Long noncoding RNAs: past, present, and future. Genetics 193:651–669
Bonasio R, Shiekhattar R (2014) Regulation of transcription by long noncoding RNAs. Annu Rev Genet 48:433–455
Wright MW (2014) A short guide to long non-coding RNA gene nomenclature. Hum Genomics 8:7
Fritah S, Niclou SP, Azuaje F (2014) Databases for lncRNAs: a comparative evaluation of emerging tools. RNA 20:1655–1665
Quek XC, Thomson DW, Maag JL et al (2015) lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res 43:D168–D173
Craig JM, Bickmore WA (1993) Chromosome bands—flavours to savour. Bioessays 15:349–354
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12:656–664
Jacox E, Elnitski L (2008) Finding occurrences of relevant functional elements in genomic signatures. Int J Comput Sci 2:599–606
Brennan RG, Matthews BW (1989) Structural basis of DNA-protein recognition. Trends Biochem Sci 14:286–290
Hudson WH, Ortlund EA (2014) The structure, function and evolution of proteins that bind DNA and RNA. Nat Rev Mol Cell Biol 15:749–760
Wells RD (1988) Unusual DNA structures. J Biol Chem 263:1095–1098
Hedgpeth J, Goodman HM, Boyer HW (1972) DNA nucleotide sequence restricted by the RI endonuclease. Proc Natl Acad Sci U S A 69:3448–3452
Wei CL, Wu Q, Vega VB et al (2006) A global map of p53 transcription-factor binding sites in the human genome. Cell 124:207–219
Mergny JL (2012) Alternative DNA structures: G4 DNA in cells: itae missa est? Nat Chem Biol 8:225–226
Giraldo R, Suzuki M, Chapman L et al (1994) Promotion of parallel DNA quadruplexes by a yeast telomere binding protein: a circular dichroism study. Proc Natl Acad Sci U S A 91:7658–7662
Cayrou C, Coulombe P, Puy A et al (2012) New insights into replication origin characteristics in metazoans. Cell Cycle 11:658–667
Brown P, Baxter L, Hickman R et al (2013) MEME-LaB: motif analysis in clusters. Bioinformatics 29:1696–1697
Grant CE, Bailey TL, Noble WS (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics 27:1017–1018
Medina-Rivera A, Defrance M, Sand O et al (2015) RSAT 2015: regulatory sequence analysis tools. Nucleic Acids Res 43:W50–W56
Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16:276–277
Stormo GD, Zhao Y (2010) Determining the specificity of protein-DNA interactions. Nat Rev Genet 11:751–760
Kel AE, Gossling E, Reuter I et al (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31:3576–3579
Wingender E (2008) The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform 9:326–332
Wrzodek C, Schroder A, Drager A et al (2010) ModuleMaster: a new tool to decipher transcriptional regulatory networks. Biosystems 99:79–81
Turatsinze JV, Thomas-Chollier M, Defrance M et al (2008) Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc 3:1578–1588
Kinsella RJ, Kahari A, Haider S et al (2011) Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford) 2011, bar030
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46
Niedringhaus TP, Milanova D, Kerby MB et al (2011) Landscape of next-generation sequencing technologies. Anal Chem 83:4327–4341
Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87–98
Li R, Li Y, Kristiansen K et al (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24:713–714
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21:936–939
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997
Sedlazeck FJ, Rescheneder P, von Haeseler A (2013) NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29:2790–2791
Santana-Quintero L, Dingerdissen H, Thierry-Mieg J et al (2014) HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis. PLoS One 9:e99033
Lee WP, Stromberg MP, Ward A et al (2014) MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One 9:e90581
Fonseca NA, Rung J, Brazma A et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177
Lindner R, Friedel CC (2012) A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS One 7:e52403
Buermans HP, den Dunnen JT (2014) Next generation sequencing technology: advances and applications. Biochim Biophys Acta 1842:1932–1941
van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426
Li JW, Schmieder R, Ward RM et al (2012) SEQanswers: an open access community for collaboratively decoding genomes. Bioinformatics 28:1272–1273
Scholtalbers J, Rossler J, Sorn P et al (2013) Galaxy LIMS for next-generation sequencing. Bioinformatics 29:1233–1234
Blankenberg D, Hillman-Jackson J (2014) Analysis of next-generation sequencing data using galaxy. Methods Mol Biol 1150:21–43
Liu B, Madduri RK, Sotomayor B et al (2014) Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses. J Biomed Inform 49:119–133
Zweig AS, Karolchik D, Kuhn RM et al (2008) UCSC genome browser tutorial. Genomics 92:75–84
Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86
Hillman-Jackson J, Clements D, Blankenberg D et al (2012) Using Galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinformatics Chapter 10, Unit 10.15
Smedley D, Haider S, Durinck S et al (2015) The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res 43:W589–W598
Wolstencroft K, Haines R, Fellows D et al (2013) The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 41:W557–W561
Mangalam H (2002) The Bio* toolkits—a brief overview. Brief Bioinform 3:296–302
Stabenau A, McVicker G, Melsopp C et al (2004) The Ensembl core software libraries. Genome Res 14:929–933
Yates A, Beal K, Keenan S et al (2014) The Ensembl REST API: Ensembl data for any language. Bioinformatics 31(1):143–145
Mishima H, Aerts J, Katayama T et al (2012) The Ruby UCSC API: accessing the UCSC genome database using Ruby. BMC Bioinformatics 13:240
Sayers E (2013) Entrez programming utilities help [Internet]. National Center for Biotechnology Information (US), Bethesda, MD. http://www.ncbi.nlm.nih.gov/books/NBK25497/
Kans J (2014) Entrez programming utilities help [Internet]. National Center for Biotechnology Information (US), Bethesda, MD. http://www.ncbi.nlm.nih.gov/books/NBK179288/
Huber W, Carey VJ, Gentleman R et al (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115–121
Parnell LD, Lindenbaum P, Shameer K et al (2011) BioStar: an online question & answer resource for the bioinformatics community. PLoS Comput Biol 7:e1002216
Acknowledgements
I would like to thank the numerous developers and support staff of genomic databases who provided valuable information during the researching and writing of this chapter. Grateful thanks also go to colleagues past and present who provided helpful information and advice. During the preparation of this chapter I worked in the laboratory of Dr. M. Méchali, whom I gratefully acknowledge for his guidance and support. I was supported financially by La Fondation pour la Recherche Médicale (FRM), and by the Centre National de la Recherche Scientifique (CNRS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this protocol
Cite this protocol
Hutchins, J.R.A. (2017). Genomic Database Searching. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_10
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6622-6_10
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6620-2
Online ISBN: 978-1-4939-6622-6
eBook Packages: Springer Protocols