eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data

  • Itamar Eskin
  • Farhad Hormozdiari
  • Lucia Conde
  • Jacques Riby
  • Chris Skibola
  • Eleazar Eskin
  • Eran Halperin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7821)


The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects.

A fundamental problem with such approaches for population studies is that the uncertainly of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of Non-Hodgkin’s Lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR), and is particularly suitable for metagenomic quantification of closely-related species.


Relative Abundance Pooling Metagenomics Expectation-Maximization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Manolio, T.A., et al.: A HapMap harvest of insights into the genetics of common disease. The Journal of Clinical Investigation 118(5), 1590–1605 (2008)CrossRefGoogle Scholar
  2. 2.
    Matsuzaki, H., et al.: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nature Methods 1(2), 109–111 (2004)CrossRefGoogle Scholar
  3. 3.
    Gunderson, K.L., et al.: A genome-wide scalable SNP genotyping assay using microarray technology. Nature Genetics 37(5), 549–554 (2005)CrossRefGoogle Scholar
  4. 4.
    Wheeler, D.A., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189), 872–876 (2008)CrossRefGoogle Scholar
  5. 5.
    Skibola, C.F., et al.: Genetic variants at 6p21.33 are associated with susceptibility to follicular lymphoma. Nature Genetics 41(8), 873–875 (2010)CrossRefGoogle Scholar
  6. 6.
    Brown, K.M., et al.: Common sequence variants on 20q11.22 confer melanoma susceptibility. Nature Genetics 40(7), 838–840 (2008)CrossRefGoogle Scholar
  7. 7.
    Hanson, R.L., et al.: Identification of PVT1 as a candidate gene for end-stage renal disease in type 2 diabetes using a pooling-based genome-wide single nucleotide polymorphism association study. Diabetes 56(4), 975–983 (2007)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Erlich, Y., et al.: DNA Sudoku–harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Research 19(7), 1243–1253 (2009)CrossRefGoogle Scholar
  9. 9.
    Golan, D., et al.: Weighted pooling–practical and cost-effective techniques for pooled high-throughput sequencing. Bioinformatics 28(12), i197–i206 (2012)Google Scholar
  10. 10.
    Prabhu, S., Pe’er, I.: Overlapping pools for high-throughput targeted resequencing. Genome Research 19(1), 1254–1261 (2009)CrossRefGoogle Scholar
  11. 11.
    Savage, D.C., et al.: The Gastrointestinal Epithelium and its Autochthonous Bacterial Flora. The Journal of Experimental Medicine 127(1), 67–76 (1968)CrossRefGoogle Scholar
  12. 12.
    Guarner, F., Malagelada, J.R.: Gut flora in health and disease. Lancet 361(9356), 512–519 (2003)CrossRefGoogle Scholar
  13. 13.
    Heselmans, M., et al.: Gut Flora in Health and Disease: Potential Role of Probiotics. Current Issues in Intestinal Microbiology 6(1), 0–8 (2005)Google Scholar
  14. 14.
    Mahida, Y.R.: Epithelial cell responses. Best Practice & Research Clinical Gastroenterology 18(2), 241–253 (2004)CrossRefGoogle Scholar
  15. 15.
    Amir, A., Zuk, O.: Bacterial community reconstruction using compressed sensing. Journal of Computational Biology 18(11), 1723–1741 (2011)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Hamady, M., et al.: Error-correcting barcoded primers allow hundreds of samples to be pyrosequenced in multiplex. Nature Methods 5(3), 235–237 (2008)CrossRefGoogle Scholar
  17. 17.
    Dethlefsen, L., et al.: The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing. PLoS Biology 6(11), e280 (2008)Google Scholar
  18. 18.
    Angly, F.E., et al.: The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLoS Computational Biology 5(12), e1000593 (2009)Google Scholar
  19. 19.
    Xia, L.C., et al.: Accurate genome relative abundance estimation based on shotgun metagenomic reads. PloS One 6(12), e27992 (2011)Google Scholar
  20. 20.
    Lin, W.Y., et al.: Evaluation of pooled association tests for rare variant identification. BMC Proceedings 5(suppl. 9), S118 (2011)Google Scholar
  21. 21.
    Price, A.L., et al.: Pooled association tests for rare variants in exon-resequencing studies. American Journal of Human Genetics 86(6), 832–838 (2010)CrossRefGoogle Scholar
  22. 22.
    Lee, J.S., et al.: On Optimal Pooling Designs to Identify Rare Variants Through Massive Resequencing. Genetic Epidemiology 35(3), 139–147 (2011)CrossRefGoogle Scholar
  23. 23.
    Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, 1977, pp. 355–368. Kluwer Academic Publishers (1998)Google Scholar
  24. 24.
    Kimmel, G., Shamir, R.: A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology 12(10), 1243–1260 (2005)CrossRefGoogle Scholar
  25. 25.
    Kennedy, J., et al.: Genotype error detection using Hidden Markov Models of haplotype diversity. Journal of Computational Biology 15(9), 1155–1171 (2008)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Browning, S.R.: Multilocus association mapping using variable-length Markov chains. American Journal of Human Genetics 78(6), 903–913 (2006)CrossRefGoogle Scholar
  27. 27.
    Conde, L., et al.: Genome-wide association study of follicular lymphoma identifies a risk locus at 6p21.32. Nature Genetics 42(8), 661–664 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Itamar Eskin
    • 1
  • Farhad Hormozdiari
    • 2
  • Lucia Conde
    • 3
  • Jacques Riby
    • 3
  • Chris Skibola
    • 3
  • Eleazar Eskin
    • 2
    • 4
  • Eran Halperin
    • 5
    • 6
    • 7
  1. 1.Applied Mathematics Department, School of Mathematical SciencesTel-Aviv UniversityIsrael
  2. 2.Computer Science DepartmentUniversity of CaliforniaLos AngelesUSA
  3. 3.Division of Environmental Health Sciences, School of Public HealthUniversity of CaliforniaBerkeleyUSA
  4. 4.Department of Human GeneticsUniversity of CaliforniaLos AngelesUSA
  5. 5.Computer Science DepartmentTel-Aviv UniversityIsrael
  6. 6.International Computer Science InstitueBerkeleyUSA
  7. 7.Molecular Microbiology and Biotechnology DepartmentTel-Aviv UniversityIsrael

Personalised recommendations