Abstract
Data size and flow are rapidly increasing in cancer research, as high-throughput technologies are developed for each molecular type present in the cell, from DNA sequences through metabolite levels. In order to maximize the value of this data, it must be analyzed in a consistent, reproducible manner, which requires the processing of terabytes of data through preprocessing (normalization, registration, QC/QA), annotation (pathways, linking of data across molecular domains), and analysis (statistical tests, computational learning techniques). The demands on data processing are, therefore, enormous in terms of computational power, data storage, and data flow. In this chapter, we address some of the issues faced when developing a data analysis pipeline for this high-dimensional, high-volume data. We focus on a number of best practices important for the implementation of the pipeline, including use of software design patterns, tiered storage architectures, ontologies, and links to metadata in national repositories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Arakawa K, Kono N, Yamada Y, Mori H, Tomita M (2005) KEGG-based pathway visualization tool for complex omics data. In Silico Biol 5:419–423
Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25:25–29
Ball CA, Brazma A (2006) MGED standards: work in progress. OMICS 10:138–144
Burks C, Fickett JW, Goad WB et al (1985) The genbank nucleic acid sequence database. Comput Appl Biosci 1:225–233
Cleveland WS (1994) The elements of graphing data. AT&T Bell Laboratories, Murray Hill, NJ
Falkner JA, Falkner JW, Andrews PC (2006) Proteomecommons.Org JAF: reference information and tools for proteomics. Bioinformatics 22:632–633
Gamma E, Helm R, Johnson R et al (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MA
Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80
Grant JD, Somers LA, Zhang Y et al (2004) FGDP: Functional genomics data pipeline for automated, multiple microarray data analyses. Bioinformatics 20:282–283
Hood LE, Hunkapiller MW, Smith LM (1987) Automated DNA sequencing and analysis of the human genome. Genomics 1:201–212
Humphreys BL, Lindberg DA (1993) The UMLS project: making the conceptual connection between users and the information they need. Bull Med Libr Assoc 81:170–177
Irizarry RA, Bolstad BM, Collin F et al (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31:e15
Kanehisa M, Goto S, Kawashima S et al (2002) The KEGG databases at genomenet. Nucleic Acids Res 30:42–46
Komatsoulis GA, Warzel DB, Hartel FW et al (2007) Cacore version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 41:106–123
Lockhart DJ, Dong H, Byrne MC et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680
Ochs MF, Casagrande JT (2008) Information systems for cancer research. Cancer Invest 26:1060–1067
Oinn T, Addis M, Ferris J et al (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054
Parsons DW, Jones S, Zhang X et al (2008) An integrated genomic analysis of human glioblastoma multiforme. Science 321:1807–1812
Raffelsberger W, Krause Y, Moulinier L et al (2008) Rreportgenerator: automatic reports from routine statistical analysis using R. Bioinformatics 24:276–278
Rainer J, Sanchez-Cabo F, Stocker G et al (2006) Carmaweb: Comprehensive r- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34:W498–W503
Rogers AE, Cappallo RJ, Hinteregger HF et al (1983) Very-long-baseline radio interferometry: the mark III system for geodesy, astrometry, and aperture synthesis. Science 219:51–54
Rubin DL, Lewis SE, Mungall CJ et al (2006) National center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10:185–198
Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470
Subramanian A, Tamayo P, Mootha VK et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102:15545–15550
Tufte ER (1991) Envisioning information. Graphics Press, Cheshire, CT
Watson JD (1990) The human genome project: Past, present, and future. Science 248:44–49
Whetzel PL, Parkinson H, Causton HC et al (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22:866–873
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Ochs, M.F. (2010). Genomics Data Analysis Pipelines. In: Ochs, M., Casagrande, J., Davuluri, R. (eds) Biomedical Informatics for Cancer Research. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-5714-6_6
Download citation
DOI: https://doi.org/10.1007/978-1-4419-5714-6_6
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5712-2
Online ISBN: 978-1-4419-5714-6
eBook Packages: MedicineMedicine (R0)