An Efficient and Accurate Graph-Based Approach to Detect Population Substructure

  • Srinath Sridhar
  • Satish Rao
  • Eran Halperin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4453)


Currently, large-scale projects are underway to perform whole genome disease association studies. Such studies involve the genotyping of hundreds of thousands of SNP markers. One of the main obstacles in performing such studies is that the underlying population substructure could artificially inflate the p-values, thereby generating a lot of false positives. Although existing tools cope well with very distinct sub-populations, closely related population groups remain a major cause of concern.

In this work, we present a graph based approach to detect population substructure.Our method is based on a distance measure between individuals. We show analytically that when the allele frequency differences between the two populations are large enough (in the l2-norm sense), our algorithm is guaranteed to find the correct classification of individuals to sub-populations.

We demonstrate the empirical performance of our algorithms on simulated and real data and compare it against existing methods, namely the widely used software method STRUCTURE and the recent method EIGENSTRAT. Our new technique is highly efficient (in particular it is hundreds of times faster than STRUCTURE), and overall it is more accurate than the two other methods in classifying individuals into sub-populations. We demonstrate empirically that unlike the other two methods, the accuracy of our algorithm consistently increases with the number of SNPs genotyped. Finally, we demonstrate that the efficiency of our method can be used to assess the significance of the resulting clusters. Surprisingly, we find that the different methods find population sub-structure in each of the homogeneous populations of the HapMap project. We use our significance score to demonstrate that these substructures are probably due to over-fitting.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Boppana, R.: Eigenvalues and Graph Bisection: An Average Case Analysis. In: IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos (1987)Google Scholar
  2. 2.
    Buntine, W., Jakulin, A.: Applying Discrete PCA in Data Analysis. In: Proc. Uncertainty in AI (2004)Google Scholar
  3. 3.
    Chaudhuri, K., Halperin, E., Rao, S., Zhou, S.: Separating Populations with Small Data. Proc. Symposium on Discrete Algorithms (SODA) (to appear, 2007)Google Scholar
  4. 4.
    Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E., Pritchard, J.K.: A high-resolution survey of deletion polymorphism in the human genome. Nature Genetics 38 (2006)Google Scholar
  5. 5.
    Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J.: High-resolution haplotype structure in the human genome. Nature Genetics 29 (2001)Google Scholar
  6. 6.
    Gabriel, S., Schaffner, S., Nguyen, H., Moore, J., Roy, J., Blumenstiel, B., Higgens, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E.S., Daly, M.J., Altschuler, D.: The Structure of Haplotype Blocks in the Human Genome. Science 296 (2002)Google Scholar
  7. 7.
    Gary, M.R., Johnson, D.S.: Computers and Intractability. Freeman, New York (1979)Google Scholar
  8. 8.
    Hirschhorn, J.N., Daly, M.J.: Genome-wide Association Studies for Common Diseases and Complex Traits. Nature Review Genetics 6 (2005)Google Scholar
  9. 9.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association (JSTOR) 58 (1963)Google Scholar
  10. 10.
    The International HapMap Consortium: A Haplotype Map of the Human Genome. Nature 437 (2005)Google Scholar
  11. 11.
    Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal (1970)Google Scholar
  12. 12.
    McSherry, F.: Spectral Partitioning of Random Graphs. In: IEEE Symposium on Foundations of Computer Science (FOCS), IEEE Computer Society Press, Los Alamitos (2001)Google Scholar
  13. 13.
    Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of Population Structure Using Multilocus Genotype Data. Genetics 155 (2000)Google Scholar
  14. 14.
    Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 (2006)Google Scholar
  15. 15.
    Thomas, D.C., et al.: Recent Developments in Genome wide Association Scans: a Workshop Summary and Review. American Journal of Human Genetics 77 (2005)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Srinath Sridhar
    • 1
  • Satish Rao
    • 2
  • Eran Halperin
    • 3
  1. 1.Computer Science Dept, Carnegie Mellon University 
  2. 2.Computer Science Dept, University of California, Berkeley 
  3. 3.International Computer Science Institute (ICSI), Berkeley 

Personalised recommendations