Genomics Data Analysis Pipelines



Data size and flow are rapidly increasing in cancer research, as high-throughput technologies are developed for each molecular type present in the cell, from DNA sequences through metabolite levels. In order to maximize the value of this data, it must be analyzed in a consistent, reproducible manner, which requires the processing of terabytes of data through preprocessing (normalization, registration, QC/QA), annotation (pathways, linking of data across molecular domains), and analysis (statistical tests, computational learning techniques). The demands on data processing are, therefore, enormous in terms of computational power, data storage, and data flow. In this chapter, we address some of the issues faced when developing a data analysis pipeline for this high-dimensional, high-volume data. We focus on a number of best practices important for the implementation of the pipeline, including use of software design patterns, tiered storage architectures, ontologies, and links to metadata in national repositories.


Data Type Analysis Pipeline Control Vocabulary Data Class Unify Medical Language System 


  1. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410PubMedGoogle Scholar
  2. Arakawa K, Kono N, Yamada Y, Mori H, Tomita M (2005) KEGG-based pathway visualization tool for complex omics data. In Silico Biol 5:419–423Google Scholar
  3. Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25:25–29PubMedCrossRefGoogle Scholar
  4. Ball CA, Brazma A (2006) MGED standards: work in progress. OMICS 10:138–144PubMedCrossRefGoogle Scholar
  5. Burks C, Fickett JW, Goad WB et al (1985) The genbank nucleic acid sequence database. Comput Appl Biosci 1:225–233PubMedGoogle Scholar
  6. Cleveland WS (1994) The elements of graphing data. AT&T Bell Laboratories, Murray Hill, NJGoogle Scholar
  7. Falkner JA, Falkner JW, Andrews PC (2006) Proteomecommons.Org JAF: reference information and tools for proteomics. Bioinformatics 22:632–633PubMedCrossRefGoogle Scholar
  8. Gamma E, Helm R, Johnson R et al (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MAGoogle Scholar
  9. Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80PubMedCrossRefGoogle Scholar
  10. Grant JD, Somers LA, Zhang Y et al (2004) FGDP: Functional genomics data pipeline for automated, multiple microarray data analyses. Bioinformatics 20:282–283PubMedCrossRefGoogle Scholar
  11. Hood LE, Hunkapiller MW, Smith LM (1987) Automated DNA sequencing and analysis of the human genome. Genomics 1:201–212PubMedCrossRefGoogle Scholar
  12. Humphreys BL, Lindberg DA (1993) The UMLS project: making the conceptual connection between users and the information they need. Bull Med Libr Assoc 81:170–177PubMedGoogle Scholar
  13. Irizarry RA, Bolstad BM, Collin F et al (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31:e15PubMedCrossRefGoogle Scholar
  14. Kanehisa M, Goto S, Kawashima S et al (2002) The KEGG databases at genomenet. Nucleic Acids Res 30:42–46PubMedCrossRefGoogle Scholar
  15. Komatsoulis GA, Warzel DB, Hartel FW et al (2007) Cacore version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 41:106–123PubMedCrossRefGoogle Scholar
  16. Lockhart DJ, Dong H, Byrne MC et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680PubMedCrossRefGoogle Scholar
  17. Ochs MF, Casagrande JT (2008) Information systems for cancer research. Cancer Invest 26:1060–1067PubMedCrossRefGoogle Scholar
  18. Oinn T, Addis M, Ferris J et al (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054PubMedCrossRefGoogle Scholar
  19. Parsons DW, Jones S, Zhang X et al (2008) An integrated genomic analysis of human glioblastoma multiforme. Science 321:1807–1812PubMedCrossRefGoogle Scholar
  20. Raffelsberger W, Krause Y, Moulinier L et al (2008) Rreportgenerator: automatic reports from routine statistical analysis using R. Bioinformatics 24:276–278PubMedCrossRefGoogle Scholar
  21. Rainer J, Sanchez-Cabo F, Stocker G et al (2006) Carmaweb: Comprehensive r- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34:W498–W503PubMedCrossRefGoogle Scholar
  22. Rogers AE, Cappallo RJ, Hinteregger HF et al (1983) Very-long-baseline radio interferometry: the mark III system for geodesy, astrometry, and aperture synthesis. Science 219:51–54PubMedCrossRefGoogle Scholar
  23. Rubin DL, Lewis SE, Mungall CJ et al (2006) National center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10:185–198PubMedCrossRefGoogle Scholar
  24. Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470PubMedCrossRefGoogle Scholar
  25. Subramanian A, Tamayo P, Mootha VK et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102:15545–15550PubMedCrossRefGoogle Scholar
  26. Tufte ER (1991) Envisioning information. Graphics Press, Cheshire, CTGoogle Scholar
  27. Watson JD (1990) The human genome project: Past, present, and future. Science 248:44–49PubMedCrossRefGoogle Scholar
  28. Whetzel PL, Parkinson H, Causton HC et al (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22:866–873PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Division of Oncology Biostatistics and BioinformaticsJohns Hopkins UniversityBaltimoreUSA

Personalised recommendations