Skip to main content

Genomics Data Analysis Pipelines

  • Chapter
  • First Online:
Biomedical Informatics for Cancer Research
  • 1402 Accesses

Abstract

Data size and flow are rapidly increasing in cancer research, as high-throughput technologies are developed for each molecular type present in the cell, from DNA sequences through metabolite levels. In order to maximize the value of this data, it must be analyzed in a consistent, reproducible manner, which requires the processing of terabytes of data through preprocessing (normalization, registration, QC/QA), annotation (pathways, linking of data across molecular domains), and analysis (statistical tests, computational learning techniques). The demands on data processing are, therefore, enormous in terms of computational power, data storage, and data flow. In this chapter, we address some of the issues faced when developing a data analysis pipeline for this high-dimensional, high-volume data. We focus on a number of best practices important for the implementation of the pipeline, including use of software design patterns, tiered storage architectures, ontologies, and links to metadata in national repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    PubMed  CAS  Google Scholar 

  • Arakawa K, Kono N, Yamada Y, Mori H, Tomita M (2005) KEGG-based pathway visualization tool for complex omics data. In Silico Biol 5:419–423

    CAS  Google Scholar 

  • Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25:25–29

    Article  PubMed  CAS  Google Scholar 

  • Ball CA, Brazma A (2006) MGED standards: work in progress. OMICS 10:138–144

    Article  PubMed  CAS  Google Scholar 

  • Burks C, Fickett JW, Goad WB et al (1985) The genbank nucleic acid sequence database. Comput Appl Biosci 1:225–233

    PubMed  CAS  Google Scholar 

  • Cleveland WS (1994) The elements of graphing data. AT&T Bell Laboratories, Murray Hill, NJ

    Google Scholar 

  • Falkner JA, Falkner JW, Andrews PC (2006) Proteomecommons.Org JAF: reference information and tools for proteomics. Bioinformatics 22:632–633

    Article  PubMed  CAS  Google Scholar 

  • Gamma E, Helm R, Johnson R et al (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MA

    Google Scholar 

  • Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80

    Article  PubMed  Google Scholar 

  • Grant JD, Somers LA, Zhang Y et al (2004) FGDP: Functional genomics data pipeline for automated, multiple microarray data analyses. Bioinformatics 20:282–283

    Article  PubMed  CAS  Google Scholar 

  • Hood LE, Hunkapiller MW, Smith LM (1987) Automated DNA sequencing and analysis of the human genome. Genomics 1:201–212

    Article  PubMed  CAS  Google Scholar 

  • Humphreys BL, Lindberg DA (1993) The UMLS project: making the conceptual connection between users and the information they need. Bull Med Libr Assoc 81:170–177

    PubMed  CAS  Google Scholar 

  • Irizarry RA, Bolstad BM, Collin F et al (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31:e15

    Article  PubMed  Google Scholar 

  • Kanehisa M, Goto S, Kawashima S et al (2002) The KEGG databases at genomenet. Nucleic Acids Res 30:42–46

    Article  PubMed  CAS  Google Scholar 

  • Komatsoulis GA, Warzel DB, Hartel FW et al (2007) Cacore version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 41:106–123

    Article  PubMed  Google Scholar 

  • Lockhart DJ, Dong H, Byrne MC et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680

    Article  PubMed  CAS  Google Scholar 

  • Ochs MF, Casagrande JT (2008) Information systems for cancer research. Cancer Invest 26:1060–1067

    Article  PubMed  Google Scholar 

  • Oinn T, Addis M, Ferris J et al (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054

    Article  PubMed  CAS  Google Scholar 

  • Parsons DW, Jones S, Zhang X et al (2008) An integrated genomic analysis of human glioblastoma multiforme. Science 321:1807–1812

    Article  PubMed  CAS  Google Scholar 

  • Raffelsberger W, Krause Y, Moulinier L et al (2008) Rreportgenerator: automatic reports from routine statistical analysis using R. Bioinformatics 24:276–278

    Article  PubMed  CAS  Google Scholar 

  • Rainer J, Sanchez-Cabo F, Stocker G et al (2006) Carmaweb: Comprehensive r- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34:W498–W503

    Article  PubMed  CAS  Google Scholar 

  • Rogers AE, Cappallo RJ, Hinteregger HF et al (1983) Very-long-baseline radio interferometry: the mark III system for geodesy, astrometry, and aperture synthesis. Science 219:51–54

    Article  PubMed  CAS  Google Scholar 

  • Rubin DL, Lewis SE, Mungall CJ et al (2006) National center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10:185–198

    Article  PubMed  CAS  Google Scholar 

  • Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470

    Article  PubMed  CAS  Google Scholar 

  • Subramanian A, Tamayo P, Mootha VK et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102:15545–15550

    Article  PubMed  CAS  Google Scholar 

  • Tufte ER (1991) Envisioning information. Graphics Press, Cheshire, CT

    Google Scholar 

  • Watson JD (1990) The human genome project: Past, present, and future. Science 248:44–49

    Article  PubMed  CAS  Google Scholar 

  • Whetzel PL, Parkinson H, Causton HC et al (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22:866–873

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael F. Ochs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Ochs, M.F. (2010). Genomics Data Analysis Pipelines. In: Ochs, M., Casagrande, J., Davuluri, R. (eds) Biomedical Informatics for Cancer Research. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-5714-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-5714-6_6

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-5712-2

  • Online ISBN: 978-1-4419-5714-6

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics