Analysis of Large Genomic Data in Silico: The EPIC-Norfolk Study of Obesity

  • Jing Hua Zhao
  • Jian’an Luan
  • Qihua Tan
  • Ruth Loos
  • Nick Wareham
Part of the Communications in Computer and Information Science book series (CCIS, volume 2)


In human genetics, large-scale data are now available with advances in genotyping technologies and international collaborative projects. Our ongoing study of obesity involves Affymetrix 500k genechips on approximately 7000 individuals from the European Prospective Investigation of Cancer (EPIC) Norfolk study. Although the scale of our data is well beyond the ability of many software systems, we have successfully performed the analysis using the statistical analysis system (SAS) software. Our implementation trades memory with computing time and requires moderate hardware configuration. By using such an established system, it extends some earlier discussions in a more constructive and accessible way. We report our findings and give some recommendations with SAS. We also compare briefly with alternative implementations. Our work is relevant to researchers conducting analysis of large-scale data in general, and genomewide association studies in particular.


Data mining genomewide association obesity statistical analysis system 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H., Emilsson, V., Helgadottir, A., et al.: Variant of Transcription Factor 7-Like 2 (TCF7L2) Gene Confers Risk of Type 2 Diabetes. Nat Genet 38, 320–323 (2006)CrossRefGoogle Scholar
  2. 2.
    Herbert, A., Gerry, N.P., McQueen, M.B., Heid, I.M., Pfeufer, A., Illig, T., Wichmann, H.E., Meitinger, T., Hunter, D., Hu, F.B., et al.: A Common Genetic Variant is Associated with Adult and Childhood Obesity. Science 312, 279–283 (2006)CrossRefGoogle Scholar
  3. 3.
    Thomas, D.C., Haile, R.W., Duggan, D.: Recent Developments in Genomewide Association Scans: a Workshop Summary and Review. Am J. Hum Genet 77, 337–345 (2005)CrossRefGoogle Scholar
  4. 4.
    Guo, S.W., Lange, K.: Genetic Mapping of Complex Traits: Promises, Problems, and Prospects. Theor Popul Biol. 57, 1–11 (2000)CrossRefGoogle Scholar
  5. 5.
    Excoffier, L., Heckel, G.: Computer Programs for Population Genetics Data Analysis: A Survival Guide. Nat. Rev. Genet. 7, 745–758 (2006)CrossRefGoogle Scholar
  6. 6.
    Dudbridge, F.: A Survey of Current Software for Linkage Analysis. Hum Genomics 1, 63–65 (2003)Google Scholar
  7. 7.
    Weale, M.E.: A Survey of Current Software for Haplotype Phase Inference. Hum Genomics 1, 141–144 (2004)Google Scholar
  8. 8.
    Salem, R.M., Wessel, J., Schork, N.J.: A Comprehensive Literature Review of Haplotyping Software and Methods for Use with Unrelated Individuals. Hum Genomics 2, 39–66 (2005)Google Scholar
  9. 9.
    Zhao, J.H., Tan, Q.: Integrated Analysis of Genetic Data with R. Hum Genomics 2, 258–265 (2006)Google Scholar
  10. 10.
    Zhao, J.H., Tan, Q.: Genetic Dissection of Complex Traits in Silico: Approaches, Problems and Solutions. Curr Bioinformatics 1, 359–369 (2006)CrossRefGoogle Scholar
  11. 11.
    Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E., Freathy, R.M., Lindgren, C.M., Prry, J.R.B., Elliott, K.S., Lango, H., Rayner, N.W., et al.: A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity. Science online  (2007)Google Scholar
  12. 12.
    Clayton, D., Leung, H.-T.: An R Package for Analysis of Whole-Genome Association Studies. Hum Hered 64, 45–51 (2007)CrossRefGoogle Scholar
  13. 13.
    Zhao, J.H., Sham, P.C.: Faster Haplotype Frequency Estimation Using Unrelated Subjects. Hum Hered 53, 36–41 (2002)CrossRefGoogle Scholar
  14. 14.
    Olson, J.M., Witte, J.S., Elston, R.C.: Genetic Mapping of Complex Traits. Stat Med 18, 2961–2981 (1999)CrossRefGoogle Scholar
  15. 15.
    Elston, R.C., Anne Spence, M.: Advances in Statistical Human Genetics Over the Last 25 Years. Stat Med 25, 3049–3080 (2006)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Balding, D.J.: A Tutorial on Statistical Methods for Population Association Studies. Nat Rev Genet 7, 781–791 (2006)CrossRefGoogle Scholar
  17. 17.
    Lander, E.S., Schork, N.J.: Genetic Dissection of Complex Traits. Science 265, 2037–2048 (1994)CrossRefGoogle Scholar
  18. 18.
    Risch, N., Merikangas, K.: The Future of Genetic Studies of Complex Human Diseases. Science 273, 1516–1517 (1996)CrossRefGoogle Scholar
  19. 19.
    Long, A.D., Grote, M.N., Langley, C.H.: Genetic Analysis of Complex Diseases. Science 275, 1328–1330 (1997)Google Scholar
  20. 20.
    Kruglyak, L.: Prospects for Whole-Genome Linkage Disequilibrium Mapping of Common Disease Genes. Nat Genet 22, 139–144 (1999)CrossRefGoogle Scholar
  21. 21.
    Breslow, N.E.: Statistics in Epidemiology: the Case-control Study. J. Am Stat Assoc. 91, 14–28 (1996)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Carlson, C.S., Eberle, M.A., Kruglyak, L., Nickerson, D.A.: Mapping Complex Disease Loci in Whole-Genome Association Studies. Nature 429, 446–452 (2004)CrossRefGoogle Scholar
  23. 23.
    Hirschhorn, J.N., Daly, M.J.: Genome-Wide Association Studies for Common Diseases and Complex Traits. Nat. Rev. Genet. 6, 95–108 (2005)CrossRefGoogle Scholar
  24. 24.
    Wang, W.Y., Barratt, B.J., Clayton, D.G., Todd, J.A.: Genome-Wide Association Studies: Theoretical and Practical Concerns. Nat. Rev. Genet. 6, 109–118 (2005)CrossRefGoogle Scholar
  25. 25.
    Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T., et al.: Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308, 385–389 (2005)CrossRefGoogle Scholar
  26. 26.
    Elston, R.C., Guo, X., Williams, L.V.: Two-Stage Global Search Designs for Linkage Analysis Using Pairs of Affected Relatives. Genet Epidemiol 13, 535–558 (1996)CrossRefGoogle Scholar
  27. 27.
    Holmans, P., Craddock, N.: Efficient Strategies for Genome Scanning Using Maximum-Likelihood Affected Sib-Pair Analysis. Am. J. Hum. Genet. 60, 657–666 (1997)Google Scholar
  28. 28.
    Sham, P.C., Zhao, J.H.: The Power of Genome-Wide Sib Pair Linkage Scans for Quantitative Trait Loci Using the New Haseman-Elston Regression Method. Gene Screen 1, 103–106 (2000)Google Scholar
  29. 29.
    Guo, X., Elston, R.C.: One-Stage Versus Two-Stage Strategies for Genome Scans. Adv. Genet. 42, 459–471 (2001)Google Scholar
  30. 30.
    Satagopan, J.M., Verbel, D.A., Venkatraman, E.S., Offit, K.E., Begg, C.B.: Two-Stage Designs for Gene-Disease Association Studies. Biometrics 58, 163–170 (2002)CrossRefMathSciNetGoogle Scholar
  31. 31.
    Satagopan, J.M., Elston, R.C.: Optimal Two-Stage Genotyping in Population-Based Association Studies. Genet Epidemiol 25, 149–157 (2003)CrossRefGoogle Scholar
  32. 32.
    Satagopan, J.M., Venkatraman, E.S., Begg, C.B.: Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints. Biometrics 60, 589–597 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Thomas, D., Xie, R., Gebregziabher, M.: Two-Stage Sampling Designs for Gene Association Studies. Genet. Epidemiol. 27, 401–414 (2004)CrossRefGoogle Scholar
  34. 34.
    Skol, A.D., Scott, L.J., Abecasis, G.R., Boehnke, M.: Joint Analysis Is More Efficient Than Replication-Based Analysis for Two-Stage Genome-Wide Association Studies. Nat. Genet. 38, 209–213 (2006)CrossRefGoogle Scholar
  35. 35.
    Lin, D.Y.: Evaluating Statistical Significance in Two-Stage Genomewide Association Studies. Am. J. Hum. Genet. 78, 505–509 (2006)CrossRefGoogle Scholar
  36. 36.
    Wang, H., Thomas, D.C., Pe’er, I., Stram, D.O.: Optimal Two-Stage Genotyping Designs for Genome-Wide Association Scans. Genet. Epidemiol. 30, 356–368 (2006)CrossRefGoogle Scholar
  37. 37.
    Clerget-Darpoux, F., Bonaiti-Pellie, C., Hochez, J.: Effects of Misspecifying Genetic Parameters in LOD Score Analysis. Biometrics 42, 393–399 (1986)CrossRefGoogle Scholar
  38. 38.
    Curtis, D., Sham, P.C.: Model-Free Linkage Analysis Using Likelihoods. Am. J. Hum. Genet. 57, 703–716 (1995)Google Scholar
  39. 39.
    Zhao, J.H., Curtis, D., Sham, P.C.: Model-Free Analysis and Permutation Tests for Allelic Associations. Hum Hered 50, 133–139 (2000)CrossRefGoogle Scholar
  40. 40.
    Hodge, S.E., Abreu, P.C., Greenberg, D.A.: Magnitude of Type I Error When Single-Locus Linkage Analysis Is Maximized Over Models: A Simulation Study. Am. J. Hum. Genet. 60, 217–227 (1997)Google Scholar
  41. 41.
    Nielsen, D.M., Ehm, M.G., Weir, B.S.: Detecting Marker-Disease Association by Testing for Hardy-Weinberg Disequilibrium at a Marker Locus. Am. J. Hum. Genet. 63, 1531–1540 (1998)CrossRefGoogle Scholar
  42. 42.
    Zou, G.Y., Donner, A.: The merits of testing Hardy-Weinberg equilibrium in the analysis of unmatched case-control data: a cautionary note. Ann Hum Genet 70, 923–933 (2006)CrossRefGoogle Scholar
  43. 43.
    Xu, J., Turner, A., Little, J., Bleecker, E.R., Meyers, D.A.: Positive Results in Association Studies Are Associated with Departure from Hardy-Weinberg Equilibrium: Hint for Genotyping Error? Hum Genet 111, 573–574 (2002)CrossRefGoogle Scholar
  44. 44.
    Kraft, P., Yen, Y.C., Stram, D.O., Morrison, J., Gauderman, W.J.: Exploiting Gene-Environment Interaction to Detect Genetic Associations. Hum Hered 63, 111–119 (2007)CrossRefGoogle Scholar
  45. 45.
    Langholz, B., Rothman, N., Wacholder, S., Thomas, D.C.: Cohort Studies for Characterizing Measured Genes. J. Natl Cancer Inst Monogr 26, 39–42 (1999)Google Scholar
  46. 46.
    Manolio, T.A., Bailey-Wilson, J.E., Collins, F.S.: Genes, Environment and the Value of Prospective Cohort Studies. Nat. Rev. Genet. 7, 812–820 (2006)CrossRefGoogle Scholar
  47. 47.
    Cai, J., Zeng, D.: Sample Size/Power Calculation for Case-Cohort Studies. Biometrics 60, 1015–1024 (2004)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jing Hua Zhao
    • 1
  • Jian’an Luan
    • 1
  • Qihua Tan
    • 2
  • Ruth Loos
    • 1
  • Nick Wareham
    • 1
  1. 1.MRC Epidemiology Unit, The Strangeways Research Laboratory, Worts Causeway, Cambridge CB1 8RNUK
  2. 2.Dept of Biochemistry, Pharmacology and Genetics, Odense University Hospital, Sdr. Boulevard 29, DK-5000, Odense CDenmark

Personalised recommendations