Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9683)


A major computational challenge in analyzing metagenomics sequencing reads is to identify unknown sources of massive and heterogeneous short DNA reads. A promising approach is to efficiently and sufficiently extract and exploit sequence features, i.e., k-mers, to bin the reads according to their sources. Shorter k-mers may capture base composition information while longer k-mers may represent reads abundance information. We present a novel Poisson-Markov mixture Model (PMM) to systematically integrate the information in both long and short k-mers and develop a parallel algorithm for improving both reads binning performance and running time. We compare the performance and running time of our PMM approach with selected competing approaches using simulated data sets, and we also demonstrate the utility of our PMM approach using a time course metagenomics data set. The probabilistic modeling framework is sufficiently flexible and general to solve a wide range of supervised and unsupervised learning problems in metagenomics.


Probabilistic clustering Expectation-Maximization algorithm Metagenomics Next-generation sequencing (NGS) Parallel algorithm 



This research is partially supported by NSF grant CCF: 1451316 to D.Z.


  1. 1.
    Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6(9), 673–676 (2009)CrossRefGoogle Scholar
  2. 2.
    David, L.A., Materna, A.C., Friedman, J., Campos-Baptista, M.I., Blackburn, M.C., Perrotta, A., Erdman, S.E., Alm, E.J.: Host lifestyle affects human microbiota on daily timescales. Genome Biol. 15(7), R89 (2014)CrossRefGoogle Scholar
  3. 3.
    di Milano, U.C.S.: Poisson hidden markov models for time series of overdispersed insurance countsGoogle Scholar
  4. 4.
    Gerlach, W., Stoye, J.: Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 39(14), e91 (2011)CrossRefGoogle Scholar
  5. 5.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefMATHGoogle Scholar
  6. 6.
    Huson, D.H., Mitra, S., Ruscheweyh, H.-J., Weber, N., Schuster, S.C.: Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21(9), 1552–1560 (2011)CrossRefGoogle Scholar
  7. 7.
    Kariin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11(7), 283–290 (1995)CrossRefGoogle Scholar
  8. 8.
    Karunanayake, C.: Multivariate Poisson Hidden Markov Models for Analysis of Spatial Counts. Canadian theses. University of Saskatchewan (Canada) (2007)Google Scholar
  9. 9.
    Kelley, D., Salzberg, S.: Clustering metagenomic sequences with interpolated Markov models. BMC Bioinform. 11(1), 544 (2010)CrossRefGoogle Scholar
  10. 10.
    Kurtz, S., Narechania, A., Stein, J.C., Ware, D.: A New Method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9(1), 517 (2008)CrossRefGoogle Scholar
  11. 11.
    Leroux, B.G., Puterman, M.L.: Maximum-Penalized-Likelihood estimation for independent and Markov-Dependent mixture models. Biometric 48, 545–558 (1992)CrossRefGoogle Scholar
  12. 12.
    Lu, J., Bushel, P.R.: Dynamic expression of 3’ UTRs revealed by poisson hidden Markov modeling of RNA-Seq: implications in gene expression profiling. Gene 527(2), 616–623 (2013)CrossRefGoogle Scholar
  13. 13.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of K-mers. Bioinform. 27(6), 764–770 (2011)CrossRefGoogle Scholar
  14. 14.
    Meinicke, P., Asshauer, K.P., Lingner, T.: Mixture models for analysis of the taxonomic composition of metagenomes. Bioinform. 27(12), 1618–1624 (2011)CrossRefGoogle Scholar
  15. 15.
    Melsted, P., Pritchard, J.K.: Efficient counting of K-mers in dna sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)CrossRefGoogle Scholar
  16. 16.
    Nguyen, T.C., Zhu, D.: MarkovBin : an algorithm to cluster metagenomic reads using a mixture modeling of hierarchical distributions. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, p. 115. ACM (2013)Google Scholar
  17. 17.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasim - a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)CrossRefGoogle Scholar
  18. 18.
    Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2), 544–548 (1998)CrossRefGoogle Scholar
  19. 19.
    Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J. Comput. Biol. J. Comput. Mol. Cell Biol. 19(2), 241–249 (2012)CrossRefGoogle Scholar
  20. 20.
    Wang, Y., Leung, H.C., Yiu, S.-M., Chin, F.Y.: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinform. 28(18), i356–i362 (2012)CrossRefGoogle Scholar
  21. 21.
    Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2010)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the K-mers you are looking for: efficient online K-mer counting using a probabilistic data structure. PloS one 9(7), e101271 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceWayne State UniversityDetroitUSA

Personalised recommendations