Genomics Data Analysis Pipelines

Ochs, Michael F.

doi:10.1007/978-1-4419-5714-6_6

Michael F. Ochs⁴

1402 Accesses

Abstract

Data size and flow are rapidly increasing in cancer research, as high-throughput technologies are developed for each molecular type present in the cell, from DNA sequences through metabolite levels. In order to maximize the value of this data, it must be analyzed in a consistent, reproducible manner, which requires the processing of terabytes of data through preprocessing (normalization, registration, QC/QA), annotation (pathways, linking of data across molecular domains), and analysis (statistical tests, computational learning techniques). The demands on data processing are, therefore, enormous in terms of computational power, data storage, and data flow. In this chapter, we address some of the issues faced when developing a data analysis pipeline for this high-dimensional, high-volume data. We focus on a number of best practices important for the implementation of the pipeline, including use of software design patterns, tiered storage architectures, ontologies, and links to metadata in national repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410
PubMed CAS Google Scholar
Arakawa K, Kono N, Yamada Y, Mori H, Tomita M (2005) KEGG-based pathway visualization tool for complex omics data. In Silico Biol 5:419–423
CAS Google Scholar
Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25:25–29
Article PubMed CAS Google Scholar
Ball CA, Brazma A (2006) MGED standards: work in progress. OMICS 10:138–144
Article PubMed CAS Google Scholar
Burks C, Fickett JW, Goad WB et al (1985) The genbank nucleic acid sequence database. Comput Appl Biosci 1:225–233
PubMed CAS Google Scholar
Cleveland WS (1994) The elements of graphing data. AT&T Bell Laboratories, Murray Hill, NJ
Google Scholar
Falkner JA, Falkner JW, Andrews PC (2006) Proteomecommons.Org JAF: reference information and tools for proteomics. Bioinformatics 22:632–633
Article PubMed CAS Google Scholar
Gamma E, Helm R, Johnson R et al (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MA
Google Scholar
Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80
Article PubMed Google Scholar
Grant JD, Somers LA, Zhang Y et al (2004) FGDP: Functional genomics data pipeline for automated, multiple microarray data analyses. Bioinformatics 20:282–283
Article PubMed CAS Google Scholar
Hood LE, Hunkapiller MW, Smith LM (1987) Automated DNA sequencing and analysis of the human genome. Genomics 1:201–212
Article PubMed CAS Google Scholar
Humphreys BL, Lindberg DA (1993) The UMLS project: making the conceptual connection between users and the information they need. Bull Med Libr Assoc 81:170–177
PubMed CAS Google Scholar
Irizarry RA, Bolstad BM, Collin F et al (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31:e15
Article PubMed Google Scholar
Kanehisa M, Goto S, Kawashima S et al (2002) The KEGG databases at genomenet. Nucleic Acids Res 30:42–46
Article PubMed CAS Google Scholar
Komatsoulis GA, Warzel DB, Hartel FW et al (2007) Cacore version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 41:106–123
Article PubMed Google Scholar
Lockhart DJ, Dong H, Byrne MC et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680
Article PubMed CAS Google Scholar
Ochs MF, Casagrande JT (2008) Information systems for cancer research. Cancer Invest 26:1060–1067
Article PubMed Google Scholar
Oinn T, Addis M, Ferris J et al (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054
Article PubMed CAS Google Scholar
Parsons DW, Jones S, Zhang X et al (2008) An integrated genomic analysis of human glioblastoma multiforme. Science 321:1807–1812
Article PubMed CAS Google Scholar
Raffelsberger W, Krause Y, Moulinier L et al (2008) Rreportgenerator: automatic reports from routine statistical analysis using R. Bioinformatics 24:276–278
Article PubMed CAS Google Scholar
Rainer J, Sanchez-Cabo F, Stocker G et al (2006) Carmaweb: Comprehensive r- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34:W498–W503
Article PubMed CAS Google Scholar
Rogers AE, Cappallo RJ, Hinteregger HF et al (1983) Very-long-baseline radio interferometry: the mark III system for geodesy, astrometry, and aperture synthesis. Science 219:51–54
Article PubMed CAS Google Scholar
Rubin DL, Lewis SE, Mungall CJ et al (2006) National center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10:185–198
Article PubMed CAS Google Scholar
Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470
Article PubMed CAS Google Scholar
Subramanian A, Tamayo P, Mootha VK et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102:15545–15550
Article PubMed CAS Google Scholar
Tufte ER (1991) Envisioning information. Graphics Press, Cheshire, CT
Google Scholar
Watson JD (1990) The human genome project: Past, present, and future. Science 248:44–49
Article PubMed CAS Google Scholar
Whetzel PL, Parkinson H, Causton HC et al (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22:866–873
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Division of Oncology Biostatistics and Bioinformatics, Johns Hopkins University, 550 North Broadway, Suite 1103, Baltimore, MD, 21205, USA
Michael F. Ochs

Authors

Michael F. Ochs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael F. Ochs .

Editor information

Editors and Affiliations

Sydney Kimmel Comprehensive, Cancer Center, John Hopkins University, N. Broadway 550, Baltimore, 21205-2011, USA
Michael F. Ochs
USC / Norris Comprehensive Cancer Ctr., Eastlake Ave. 1441, Los Angeles, 90033, USA
John T. Casagrande
Wistar Institute, Spruce Street 3601, Philadelphia, 19104, USA
Ramana V. Davuluri

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ochs, M.F. (2010). Genomics Data Analysis Pipelines. In: Ochs, M., Casagrande, J., Davuluri, R. (eds) Biomedical Informatics for Cancer Research. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-5714-6_6

Download citation

DOI: https://doi.org/10.1007/978-1-4419-5714-6_6
Published: 06 March 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5712-2
Online ISBN: 978-1-4419-5714-6
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics