Abstract
The now routine generation of large-scale, high-throughput data in multiple dimensions (genotype, gene expression, and so on) provides a significant challenge to researchers who desire to integrate data across these dimensions in hopes of painting a more comprehensive picture of complex system behavior. This type of integration promises to elucidate networks that drive disease traits associated with common human diseases like obesity, diabetes, and atherosclerosis. However, to effectively carry out this type of research not only requires the generation of large-scale genotype and molecular profiling data but also requires the development and application of methods and software in addition to a computing infrastructure capable of processing the large-scale data sets. Mastery of the methods and tools and having access to an appropriate computing environment capable of processing large-scale data will be critical to maintaining a competitive advantage, given future successes in biomedical research will likely demand a more comprehensive view of the complex array of interactions in biological systems and how such interactions are influenced by genetic background, infection, environmental states, life-style choices, and social structures more generally. In this chapter, we detail the methodological and computing issues associated with carrying out large-scale genome-wide association studies on tens of thousands of phenotypes, where the aim is to identify those phenotypes that are intermediate to DNA variations and disease phenotypes. This type of analysis can provide insights into the molecular networks that are perturbed by DNA and environmental variations, and as a result, induce changes in disease associated traits, providing a path to interpret genome-wide association study data as well as uncover networks that drive disease processes.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Edwards AO et al. (2005) Complement factor H polymorphism and age-related macular degeneration. Science 308:421–424
Haines JL et al. (2005) Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–421
Klein RJ et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389
Grant SF et al. (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38:320–323
Sladek R et al. (2007) A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445:881–885
Herbert A et al. (2006) A common genetic variant is associated with adult and childhood obesity. Science 312:279–283
Peacock ML, Warren JT Jr, Roses AD, Fink JK (1993). Novel polymorphism in the A4 region of the amyloid precursor protein gene in a patient without Alzheimer’s disease. Neurology 43, 1254–1256.
Brem RB, Yvert G, Clinton R, Kruglyak L (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296:752–755
Bystrykh L et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat Genet 37:225–232
Chesler EJ et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet 37:233–242
Monks SA et al. (2004) Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 75:1094–1105
Morley M et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747
Schadt EE et al. (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37:710–717
Schadt EE et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422:297–302
Hartwell LH, Hopfield JJ, Leibler SMurray A.W (1999) From molecular to modular cell biology. Nature 402:C47–52
Schadt EE, Sachs A, Friend S (2005) Embracing complexity, inching closer to reality. Sci STKE 2005:pe40
Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5:101–113
Zerhouni E (2003) Medicine. The NIH Roadmap. Science 302:63–72
Han JD et al. (2003) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88–93
Luscombe NM et al. (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431:308–312
Chen Y et al. (2008) Variations in DNA elucidate molecular networks that cause disease. Nature 452:429–435
Zhao LJ et al. (2005) SNPP: automating large-scale SNP genotype data management. Bioinformatics 21:266–268
Purcell S et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575.
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959
BRLMM: an Improved Genotype Calling Method for the GeneChip®; Human Mapping 500K Array Set (Affymetrix, 2006)
Carvalho B, Bengtsson H,, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8:485–499
Hua J et al. (2007) SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23:57–63
Liu WM et al. (2003) Algorithms for large-scale genotyping microarrays. Bioinformatics 19:2397–2403
Rabbee N, Speed, TP (2006) A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 22:7–12
Teo YY et al. (2007) A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23:2741–2746
Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678
Sieberts SK, Schadt EE (2007) Moving toward a system genetics view of disease. Mamm Genome 18:389–401
He YD et al. (2003) Microarray standard data set and figures of merit for comparing data processing methods and experiment designs. Bioinformatics 19:956–965
Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:1724–1735
Emilsson V et al. (2008) Genetics of gene expression and its effect on disease. Nature 452:423–428
Yang X et al. (2006) Tissue-specific expression and regulation of sexually dimorphic genes in mice. Genome Res 16:995–1004
Wang S et al. (2006) Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genet 2:e15
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 57:289–300
Storey JD (2002) A direct approach to false discovery rates. JRSS B 64:479–498
Schadt EE et al. (2008) Mapping the genetic architecture of gene expression in human liver. PLoS Biol 6:e107
Yeo C et al. (2006) Cluster computing: high-performance, high-availability, and high-throughput processing on a network of computers. In Zomaya A (ed) Handbook of nature-inspired and innovative computing, pp 521-55142. Message Passing
Interface Forum. MPI (1994) A message-passing interface standard. Int J Supercomputer Appl 8:165–414
Message Passing Interface Forum. MPI2 (1998) A message passing interface standard. Int J High Performance Comput Appl 12:1–299
Geist A et al. (1994) PVM: Parallel Virtual Machine—a user’s guide and tutorial for network parallel computing, MIT, Cambridge, MA
Gropp W, Lusk E (2002). Goals guiding design: PVM and MPI
Carlborg O, Andersson-Eklund L, Andersson L (2001) Parallel computing in interval mapping of quantitative trait loci. J Hered 92:449–451
Jayawardena M, Ljungberg K, Holmgren S (2007) Using parallel computing and grid systems for genetic mapping of quantitative traits. In Applied parallel computing. State of the art in scientific computing, vol Volume 4699/2007 627–636, Springer, Berlin
University of Washington, Fred Hutchinson Cancer Research Center to coordinate National Human Genome Research Institute disease studies (2007)
Tanaka T (2005) [International HapMap project]. Nippon Rinsho 63(12):29–34
Ramji DP, Singh NN, Foka P, Irvine SA, Arnaoutakis K (2006) Transforming growth factor-beta-regulated expression of genes in macrophages implicated in the control of cholesterol homoeostasis. Biochem Soc Trans 34:1141–1144
Zhu J et al. (2004) An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenet Genome Res 105:363–374
Zhu J et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3:e69
Zhu J et al. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 40:854–861
Kim JK et al. (2005) Functional genomic analysis of RNA interference in C. elegans. Science 308:1164–1167
Gargalovic PS et al. (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci U S A 103: 12741–12746
Ghazalpour A et al. (2006) Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet 2:e130
Lum PY et al. (2006) Elucidating the murine brain transcriptional network in a segregating mouse population to identify core functional modules for obesity and diabetes. J Neurochem 97(1):50–62
Butte AJ, Kohane IS (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000:418–429
Davidson EH, McClay DR, Hood L (2003) Regulatory gene networks and the properties of the developmental process. Proc Natl Acad Sci U S A 100:1475–1480
Bergmann S, Ihmels, J, Barkai N (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2:E9
Carter SL, Brechbuhler CM, Griffin M, Bond A.T (2004) Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20:2242–2250
Doss S, Schadt EE, Drake TA, Lusis AJ (2005) Cis-acting expression quantitative trait loci in mice. Genome Res 15:681–691
Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512
Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140:1111–1127
Zeng ZB (1993) Precision mapping of quantitative trait loci. Genetics 121:185–199
Lee SI, Pe’er D, Dudley A.M, Church GM, Koller D (2006) Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc Natl Acad Sci U S A 103:14062–14067
Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555
Lee I, Date, SV, Adai AT, Marcotte EM (2004) A probabilistic functional network of yeast genes. Science 306:1555–1558
Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5:444–449
Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818
Hughes TR et al. (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–126
Pan X et al. (2006) A DNA integrity network in the yeast Saccharomyces cerevisiae. Cell 124:1069–1081
Kanehisa M et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–D357
Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292:929–934
Jansen R et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302:449–453
Pearl J (1998) Probabilistic reasoning in intelligent systems: networks of plausible inference, xix, p 552, Morgan Kaufmann, San Mateo, CA
Schadt EE, Lum PY (2006) Reverse engineering gene networks to identify key drivers of complex disease phenotypes. J Lipid Res 47:2601–2613
Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198–211
Price AL et al. (2006) Principle components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Molony, C., Sieberts, S.K., Schadt, E.E. (2009). Processing Large-Scale, High-Dimension Genetic and Gene Expression Data. In: Handbook on Analyzing Human Genetic Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69264-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-69264-5_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69263-8
Online ISBN: 978-3-540-69264-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)