Separating Populations with Wide Data: A Spectral Analysis

  • Avrim Blum
  • Amin Coja-Oghlan
  • Alan Frieze
  • Shuheng Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4835)


In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size—the product of number of data points n and the number of features K—needed to correctly perform this partitioning as a function of 1/γ for K > n. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small.


Random Matrix Product Distribution Singular Vector Graph Partitioning Spectral Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Achlioptas, D., McSherry, F.: On spectral learning of mixtures of distributions. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 458–469. Springer, Heidelberg (2005), Google Scholar
  2. 2.
    S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of 33rd ACM Symposium on Theory of Computing, pages 247–257, 2001.Google Scholar
  3. 3.
    Chaudhuri, K., Halperin, E., Rao, S., Zhou, S.: A rigorous analysis of population stratification with limited data. In: Proceedings of the 18th ACM-SIAM SODA (2007)Google Scholar
  4. 4.
    Coja-Oghlan, A.: An adaptive spectral heuristic for partitioning random graphs. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, Springer, Heidelberg (2006)Google Scholar
  5. 5.
    Cryan, M.: Learning and approximation Algorithms for Problems motivated by evolutionary trees. PhD thesis, University of Warwick (1999)Google Scholar
  6. 6.
    Cryan, M., Goldberg, L., Goldberg, P.: Evolutionary trees can be learned in polynomial time in the two state general markov model. SIAM J. of Computing 31(2), 375–397 (2002)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Dasgupta, A., Hopcroft, J., Kleinberg, J., Sandler, M.: On learning mixtures of heavy-tailed distributions. In: Proceedings of the 46th IEEE FOCS, pp. 491–500 (2005)Google Scholar
  8. 8.
    Dasgupta, S.: Learning mixtures of gaussians. In: Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pp. 634–644 (1999)Google Scholar
  9. 9.
    Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI) (2000)Google Scholar
  10. 10.
    Feldman, J., O’Donnell, R., Servedio, R.: Learning mixtures of product distributions over discrete domains. In: Proceedings of the 46th IEEE FOCS (2005)Google Scholar
  11. 11.
    Feldman, J., O’Donnell, R., Servedio, R.: PAC learning mixtures of Gaussians with no separation assumption. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Fiedler, M.: Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 298–305 (1973)Google Scholar
  13. 13.
    Fjallstrom, P.: Algorithms for graph partitioning: a survey. Technical report, Linkoping University Electroni Press (1998)Google Scholar
  14. 14.
    Freund, Y., Mansour, Y.: Estimating a mixture of two product distributions. In: Proceedings of the 12th Annual COLT, pp. 183–192 (1999)Google Scholar
  15. 15.
    Kannan, R., Salmasian, H., Vempala, S.: The spectral method for general mixture models. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, Springer, Heidelberg (2005)Google Scholar
  16. 16.
    Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapir, R., Sellie, L.: On the learnability of discrete distributions. In: Proceedings of the 26th ACM STOC, pp. 273–282 (1994)Google Scholar
  17. 17.
    Latala, R.: Some estimates of norms of random matrices. In: Proceedings of the American Mathematical Society, vol. 133, pp. 1273–1282 (2005)Google Scholar
  18. 18.
    McSherry, F.: Spectral partitioning of random graphs. In: Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science, pp. 529–537 (2001)Google Scholar
  19. 19.
    Meckes, M.: Concentration of norms and eigenvalues of random matrices. J. Funct. Anal. 211(2), 508–524 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Mossel, E., Roch, S.: Learning nonsinglar phylogenies and hidden markov models. In: Proceedings of the 37th ACM STOC (2005)Google Scholar
  21. 21.
    Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155, 954–959 (2000)Google Scholar
  22. 22.
    Spielman, D.: The behavior of algorithms in practice, Lecture notes (2002)Google Scholar
  23. 23.
    Vempala, V., Wang, G.: A spectral algorithm of learning mixtures of distributions. In: Proceedings of the 43rd IEEE FOCS, pp. 113–123 (2002)Google Scholar
  24. 24.
    Vu, V.: Spectral norm of random matrices. In: Proceedings of 37th ACM STOC, pp. 423–430 (2005)Google Scholar
  25. 25.
    Zhou, S.: Routing, Disjoint Paths, and Classification. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, CMU Technical Report, CMU-PDL-06-109 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Avrim Blum
    • 1
  • Amin Coja-Oghlan
    • 1
  • Alan Frieze
    • 1
  • Shuheng Zhou
    • 1
  1. 1.Carnegie Mellon University, Pittsburgh, PA 15213USA

Personalised recommendations