Skip to main content

Processing Large-Scale, High-Dimension Genetic and Gene Expression Data

  • Chapter
  • First Online:
  • 1837 Accesses

Abstract

The now routine generation of large-scale, high-throughput data in multiple dimensions (genotype, gene expression, and so on) provides a significant challenge to researchers who desire to integrate data across these dimensions in hopes of painting a more comprehensive picture of complex system behavior. This type of integration promises to elucidate networks that drive disease traits associated with common human diseases like obesity, diabetes, and atherosclerosis. However, to effectively carry out this type of research not only requires the generation of large-scale genotype and molecular profiling data but also requires the development and application of methods and software in addition to a computing infrastructure capable of processing the large-scale data sets. Mastery of the methods and tools and having access to an appropriate computing environment capable of processing large-scale data will be critical to maintaining a competitive advantage, given future successes in biomedical research will likely demand a more comprehensive view of the complex array of interactions in biological systems and how such interactions are influenced by genetic background, infection, environmental states, life-style choices, and social structures more generally. In this chapter, we detail the methodological and computing issues associated with carrying out large-scale genome-wide association studies on tens of thousands of phenotypes, where the aim is to identify those phenotypes that are intermediate to DNA variations and disease phenotypes. This type of analysis can provide insights into the molecular networks that are perturbed by DNA and environmental variations, and as a result, induce changes in disease associated traits, providing a path to interpret genome-wide association study data as well as uncover networks that drive disease processes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Edwards AO et al. (2005) Complement factor H polymorphism and age-related macular degeneration. Science 308:421–424

    Article  CAS  PubMed  Google Scholar 

  2. Haines JL et al. (2005) Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–421

    Article  CAS  PubMed  Google Scholar 

  3. Klein RJ et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389

    Article  CAS  PubMed  Google Scholar 

  4. Grant SF et al. (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38:320–323

    Article  CAS  PubMed  Google Scholar 

  5. Sladek R et al. (2007) A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445:881–885

    Article  CAS  PubMed  Google Scholar 

  6. Herbert A et al. (2006) A common genetic variant is associated with adult and childhood obesity. Science 312:279–283

    Article  CAS  PubMed  Google Scholar 

  7. Peacock ML, Warren JT Jr, Roses AD, Fink JK (1993). Novel polymorphism in the A4 region of the amyloid precursor protein gene in a patient without Alzheimer’s disease. Neurology 43, 1254–1256.

    CAS  PubMed  Google Scholar 

  8. Brem RB, Yvert G, Clinton R, Kruglyak L (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296:752–755

    Article  CAS  PubMed  Google Scholar 

  9. Bystrykh L et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat Genet 37:225–232

    Article  CAS  PubMed  Google Scholar 

  10. Chesler EJ et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet 37:233–242

    Article  CAS  PubMed  Google Scholar 

  11. Monks SA et al. (2004) Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 75:1094–1105

    Article  CAS  PubMed  Google Scholar 

  12. Morley M et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747

    Article  CAS  PubMed  Google Scholar 

  13. Schadt EE et al. (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37:710–717

    Article  CAS  PubMed  Google Scholar 

  14. Schadt EE et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422:297–302

    Article  CAS  PubMed  Google Scholar 

  15. Hartwell LH, Hopfield JJ, Leibler SMurray A.W (1999) From molecular to modular cell biology. Nature 402:C47–52

    Article  CAS  PubMed  Google Scholar 

  16. Schadt EE, Sachs A, Friend S (2005) Embracing complexity, inching closer to reality. Sci STKE 2005:pe40

    Google Scholar 

  17. Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5:101–113

    Article  CAS  PubMed  Google Scholar 

  18. Zerhouni E (2003) Medicine. The NIH Roadmap. Science 302:63–72

    CAS  Google Scholar 

  19. Han JD et al. (2003) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88–93

    Article  Google Scholar 

  20. Luscombe NM et al. (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431:308–312

    Article  CAS  PubMed  Google Scholar 

  21. Chen Y et al. (2008) Variations in DNA elucidate molecular networks that cause disease. Nature 452:429–435

    Article  CAS  PubMed  Google Scholar 

  22. Zhao LJ et al. (2005) SNPP: automating large-scale SNP genotype data management. Bioinformatics 21:266–268

    Article  CAS  PubMed  Google Scholar 

  23. Purcell S et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575.

    Article  CAS  PubMed  Google Scholar 

  24. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959

    CAS  PubMed  Google Scholar 

  25. BRLMM: an Improved Genotype Calling Method for the GeneChip®; Human Mapping 500K Array Set (Affymetrix, 2006)

    Google Scholar 

  26. Carvalho B, Bengtsson H,, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8:485–499

    Article  PubMed  Google Scholar 

  27. Hua J et al. (2007) SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23:57–63

    Article  CAS  PubMed  Google Scholar 

  28. Liu WM et al. (2003) Algorithms for large-scale genotyping microarrays. Bioinformatics 19:2397–2403

    Article  CAS  PubMed  Google Scholar 

  29. Rabbee N, Speed, TP (2006) A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 22:7–12

    Article  CAS  PubMed  Google Scholar 

  30. Teo YY et al. (2007) A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23:2741–2746

    Article  CAS  PubMed  Google Scholar 

  31. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678

    Article  Google Scholar 

  32. Sieberts SK, Schadt EE (2007) Moving toward a system genetics view of disease. Mamm Genome 18:389–401

    Article  PubMed  Google Scholar 

  33. He YD et al. (2003) Microarray standard data set and figures of merit for comparing data processing methods and experiment designs. Bioinformatics 19:956–965

    Article  CAS  PubMed  Google Scholar 

  34. Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:1724–1735

    Article  CAS  PubMed  Google Scholar 

  35. Emilsson V et al. (2008) Genetics of gene expression and its effect on disease. Nature 452:423–428

    Article  CAS  PubMed  Google Scholar 

  36. Yang X et al. (2006) Tissue-specific expression and regulation of sexually dimorphic genes in mice. Genome Res 16:995–1004

    Article  CAS  PubMed  Google Scholar 

  37. Wang S et al. (2006) Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genet 2:e15

    Article  PubMed  Google Scholar 

  38. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 57:289–300

    Google Scholar 

  39. Storey JD (2002) A direct approach to false discovery rates. JRSS B 64:479–498

    Google Scholar 

  40. Schadt EE et al. (2008) Mapping the genetic architecture of gene expression in human liver. PLoS Biol 6:e107

    Article  PubMed  Google Scholar 

  41. Yeo C et al. (2006) Cluster computing: high-performance, high-availability, and high-throughput processing on a network of computers. In Zomaya A (ed) Handbook of nature-inspired and innovative computing, pp 521-55142. Message Passing

    Google Scholar 

  42. Interface Forum. MPI (1994) A message-passing interface standard. Int J Supercomputer Appl 8:165–414

    Google Scholar 

  43. Message Passing Interface Forum. MPI2 (1998) A message passing interface standard. Int J High Performance Comput Appl 12:1–299

    Google Scholar 

  44. Geist A et al. (1994) PVM: Parallel Virtual Machine—a user’s guide and tutorial for network parallel computing, MIT, Cambridge, MA

    Google Scholar 

  45. Gropp W, Lusk E (2002). Goals guiding design: PVM and MPI

    Google Scholar 

  46. Carlborg O, Andersson-Eklund L, Andersson L (2001) Parallel computing in interval mapping of quantitative trait loci. J Hered 92:449–451

    Article  CAS  PubMed  Google Scholar 

  47. Jayawardena M, Ljungberg K, Holmgren S (2007) Using parallel computing and grid systems for genetic mapping of quantitative traits. In Applied parallel computing. State of the art in scientific computing, vol Volume 4699/2007 627–636, Springer, Berlin

    Chapter  Google Scholar 

  48. University of Washington, Fred Hutchinson Cancer Research Center to coordinate National Human Genome Research Institute disease studies (2007)

    Google Scholar 

  49. Tanaka T (2005) [International HapMap project]. Nippon Rinsho 63(12):29–34

    PubMed  Google Scholar 

  50. Ramji DP, Singh NN, Foka P, Irvine SA, Arnaoutakis K (2006) Transforming growth factor-beta-regulated expression of genes in macrophages implicated in the control of cholesterol homoeostasis. Biochem Soc Trans 34:1141–1144

    Article  CAS  PubMed  Google Scholar 

  51. Zhu J et al. (2004) An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenet Genome Res 105:363–374

    Article  CAS  PubMed  Google Scholar 

  52. Zhu J et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3:e69

    Article  PubMed  Google Scholar 

  53. Zhu J et al. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 40:854–861

    Article  CAS  PubMed  Google Scholar 

  54. Kim JK et al. (2005) Functional genomic analysis of RNA interference in C. elegans. Science 308:1164–1167

    Article  CAS  PubMed  Google Scholar 

  55. Gargalovic PS et al. (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci U S A 103: 12741–12746

    Article  CAS  PubMed  Google Scholar 

  56. Ghazalpour A et al. (2006) Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet 2:e130

    Article  PubMed  Google Scholar 

  57. Lum PY et al. (2006) Elucidating the murine brain transcriptional network in a segregating mouse population to identify core functional modules for obesity and diabetes. J Neurochem 97(1):50–62

    Article  CAS  PubMed  Google Scholar 

  58. Butte AJ, Kohane IS (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000:418–429

    Google Scholar 

  59. Davidson EH, McClay DR, Hood L (2003) Regulatory gene networks and the properties of the developmental process. Proc Natl Acad Sci U S A 100:1475–1480

    Article  CAS  PubMed  Google Scholar 

  60. Bergmann S, Ihmels, J, Barkai N (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2:E9

    Article  PubMed  Google Scholar 

  61. Carter SL, Brechbuhler CM, Griffin M, Bond A.T (2004) Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20:2242–2250

    Article  CAS  PubMed  Google Scholar 

  62. Doss S, Schadt EE, Drake TA, Lusis AJ (2005) Cis-acting expression quantitative trait loci in mice. Genome Res 15:681–691

    Article  CAS  PubMed  Google Scholar 

  63. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512

    Article  PubMed  Google Scholar 

  64. Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140:1111–1127

    CAS  PubMed  Google Scholar 

  65. Zeng ZB (1993) Precision mapping of quantitative trait loci. Genetics 121:185–199

    Google Scholar 

  66. Lee SI, Pe’er D, Dudley A.M, Church GM, Koller D (2006) Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc Natl Acad Sci U S A 103:14062–14067

    Article  CAS  PubMed  Google Scholar 

  67. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555

    Article  CAS  PubMed  Google Scholar 

  68. Lee I, Date, SV, Adai AT, Marcotte EM (2004) A probabilistic functional network of yeast genes. Science 306:1555–1558

    Article  CAS  PubMed  Google Scholar 

  69. Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5:444–449

    Article  CAS  PubMed  Google Scholar 

  70. Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818

    Article  CAS  PubMed  Google Scholar 

  71. Hughes TR et al. (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–126

    Article  CAS  PubMed  Google Scholar 

  72. Pan X et al. (2006) A DNA integrity network in the yeast Saccharomyces cerevisiae. Cell 124:1069–1081

    Article  CAS  PubMed  Google Scholar 

  73. Kanehisa M et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–D357

    Article  CAS  PubMed  Google Scholar 

  74. Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292:929–934

    Article  CAS  PubMed  Google Scholar 

  75. Jansen R et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302:449–453

    Article  CAS  PubMed  Google Scholar 

  76. Pearl J (1998) Probabilistic reasoning in intelligent systems: networks of plausible inference, xix, p 552, Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  77. Schadt EE, Lum PY (2006) Reverse engineering gene networks to identify key drivers of complex disease phenotypes. J Lipid Res 47:2601–2613

    Article  CAS  PubMed  Google Scholar 

  78. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198–211

    Article  CAS  PubMed  Google Scholar 

  79. Price AL et al. (2006) Principle components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cliona Molony .

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Molony, C., Sieberts, S.K., Schadt, E.E. (2009). Processing Large-Scale, High-Dimension Genetic and Gene Expression Data. In: Handbook on Analyzing Human Genetic Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69264-5_11

Download citation

Publish with us

Policies and ethics