Computational Approaches to Mine Publicly Available Databases

  • Rodger B. Voelker
  • William A. Cresko
  • J. Andrew Berglund
Part of the Methods in Molecular Biology book series (MIMB, volume 1126)


Publicly available sequence annotation data is a vital resource for researchers. Many types of information are available, including structural annotations (i.e., the locations and identities of genomic features) and functional annotations (e.g., gene expression and protein interactions). Annotation data is especially useful for interrogating Next-Gen sequencing data (e.g., identifying genomic features that are associated with mapped reads). Additionally, the vast amount of data that is available offers researchers the opportunity to mine existing data sets and make new discoveries. The ability to efficiently obtain, manipulate, and interrogate this data is a valuable and empowering skill. In this chapter, we introduce several primary data repositories and describe the most commonly encountered file formats. In order to highlight some of the key concepts, operations, and utilities that are involved in working with annotation data we provide a fully worked example of using annotations to answer some basic questions about a particular CHIP-seq data set.

Key words

Sequence annotation Bioinformatics BED format UCSC genome browser Genomic interval operations 


  1. 1.
    Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503PubMedCrossRefGoogle Scholar
  2. 2.
    Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 26:841–842CrossRefGoogle Scholar
  3. 3.
    Tsirigos A, Haiminen N, Bilal E et al (2012) GenomicTools: a computational platform for developing high-throughput analytics in genomics. Bioinformatics (Oxford, England) 28:282–283CrossRefGoogle Scholar
  4. 4.
    Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics (Oxford, England) 25:2078–2079CrossRefGoogle Scholar
  5. 5.
    Spies N, Nielsen CB, Padgett RA et al (2009) Biased chromatin signatures around polyadenylation sites and exons. Mol Cell 36:245–254PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Schwartz S, Meshorer E, Ast G (2009) Chromatin organization marks exon-intron structure. Nat Struct Mol Biol 16:990–995PubMedCrossRefGoogle Scholar
  7. 7.
    Andersson R, Enroth S, Rada-Iglesias A et al (2009) Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Res 19:1732–1741PubMedCentralPubMedCrossRefGoogle Scholar
  8. 8.
    Galperin MY, Fernandez-Suarez XM (2011) The 2012 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res 40:D1–D8PubMedCentralPubMedCrossRefGoogle Scholar
  9. 9.
    ENCODE Project Consortium, Bernstein BE, Birney E et al (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74PubMedCrossRefGoogle Scholar
  10. 10.
    Schmidt T, Frishman D (2008) Assignment of isochores for all completely sequenced vertebrate genomes using a consensus. Genome Biol 9:R104PubMedCentralPubMedCrossRefGoogle Scholar
  11. 11.
    Costantini M, Clay O, Auletta F et al (2006) An isochore map of human chromosomes. Genome Res 16:536–541PubMedCentralPubMedCrossRefGoogle Scholar
  12. 12.
    Costantini M, Bernardi G (2008) Replication timing, chromosomal bands, and isochores. Proc Natl Acad Sci USA 105:3433–3437PubMedCentralPubMedCrossRefGoogle Scholar
  13. 13.
    Fullerton SM, Bernardo Carvalho A, Clark AG (2001) Local rates of recombination are positively correlated with GC content in the human genome. Molecular Biol Evol 18:1139–1142CrossRefGoogle Scholar
  14. 14.
    Duret L, Galtier N (2009) Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet 10:285–311PubMedCrossRefGoogle Scholar
  15. 15.
    Gibney G, Baxevanis AD (2002) Current protocols in bioinformatics. Wiley, Hoboken, NJGoogle Scholar
  16. 16.
    Kent WJ, Sugnet CW, Furey TS et al (2002) The human genome browser at UCSC. Genome Res 12:996–1006PubMedCentralPubMedCrossRefGoogle Scholar
  17. 17.
    Schattner P (2008) Genomes, browsers and databases: data-mining tools for integrated genomic databases. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  18. 18.
    Neph S, Kuehn MS, Reynolds AP et al (2012) BEDOPS: high-performance genomic feature operations. Bioinformatics (Oxford, England) 28:1919–1920CrossRefGoogle Scholar
  19. 19.
    Barnett DW, Garrison EK, Quinlan AR et al (2011) BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics (Oxford, England) 27:1691–1692CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2014

Authors and Affiliations

  • Rodger B. Voelker
    • 1
  • William A. Cresko
    • 2
  • J. Andrew Berglund
    • 3
  1. 1.Institutes of Molecular Biology and Ecology and EvolutionUniversity of OregonEugeneUSA
  2. 2.Department of Biology and Institute of Ecology and EvolutionUniversity of OregonEugeneUSA
  3. 3.Department of Chemistry and Institute of Molecular BiologyUniversity of OregonEugeneUSA

Personalised recommendations