Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads

Wang, Lu; Zhu, Dongxiao; Li, Yan; Dong, Ming

doi:10.1007/978-3-319-38782-6_2

Lu Wang¹⁷,
Dongxiao Zhu¹⁷,
Yan Li¹⁷ &
…
Ming Dong¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9683))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

1467 Accesses
2 Citations

Abstract

A major computational challenge in analyzing metagenomics sequencing reads is to identify unknown sources of massive and heterogeneous short DNA reads. A promising approach is to efficiently and sufficiently extract and exploit sequence features, i.e., k-mers, to bin the reads according to their sources. Shorter k-mers may capture base composition information while longer k-mers may represent reads abundance information. We present a novel Poisson-Markov mixture Model (PMM) to systematically integrate the information in both long and short k-mers and develop a parallel algorithm for improving both reads binning performance and running time. We compare the performance and running time of our PMM approach with selected competing approaches using simulated data sets, and we also demonstrate the utility of our PMM approach using a time course metagenomics data set. The probabilistic modeling framework is sufficiently flexible and general to solve a wide range of supervised and unsupervised learning problems in metagenomics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6(9), 673–676 (2009)
Article Google Scholar
David, L.A., Materna, A.C., Friedman, J., Campos-Baptista, M.I., Blackburn, M.C., Perrotta, A., Erdman, S.E., Alm, E.J.: Host lifestyle affects human microbiota on daily timescales. Genome Biol. 15(7), R89 (2014)
Article Google Scholar
di Milano, U.C.S.: Poisson hidden markov models for time series of overdispersed insurance counts
Google Scholar
Gerlach, W., Stoye, J.: Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 39(14), e91 (2011)
Article Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Huson, D.H., Mitra, S., Ruscheweyh, H.-J., Weber, N., Schuster, S.C.: Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21(9), 1552–1560 (2011)
Article Google Scholar
Kariin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11(7), 283–290 (1995)
Article Google Scholar
Karunanayake, C.: Multivariate Poisson Hidden Markov Models for Analysis of Spatial Counts. Canadian theses. University of Saskatchewan (Canada) (2007)
Google Scholar
Kelley, D., Salzberg, S.: Clustering metagenomic sequences with interpolated Markov models. BMC Bioinform. 11(1), 544 (2010)
Article Google Scholar
Kurtz, S., Narechania, A., Stein, J.C., Ware, D.: A New Method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9(1), 517 (2008)
Article Google Scholar
Leroux, B.G., Puterman, M.L.: Maximum-Penalized-Likelihood estimation for independent and Markov-Dependent mixture models. Biometric 48, 545–558 (1992)
Article Google Scholar
Lu, J., Bushel, P.R.: Dynamic expression of 3’ UTRs revealed by poisson hidden Markov modeling of RNA-Seq: implications in gene expression profiling. Gene 527(2), 616–623 (2013)
Article Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of K-mers. Bioinform. 27(6), 764–770 (2011)
Article Google Scholar
Meinicke, P., Asshauer, K.P., Lingner, T.: Mixture models for analysis of the taxonomic composition of metagenomes. Bioinform. 27(12), 1618–1624 (2011)
Article Google Scholar
Melsted, P., Pritchard, J.K.: Efficient counting of K-mers in dna sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)
Article Google Scholar
Nguyen, T.C., Zhu, D.: MarkovBin : an algorithm to cluster metagenomic reads using a mixture modeling of hierarchical distributions. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, p. 115. ACM (2013)
Google Scholar
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasim - a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)
Article Google Scholar
Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2), 544–548 (1998)
Article Google Scholar
Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J. Comput. Biol. J. Comput. Mol. Cell Biol. 19(2), 241–249 (2012)
Article Google Scholar
Wang, Y., Leung, H.C., Yiu, S.-M., Chin, F.Y.: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinform. 28(18), i356–i362 (2012)
Article Google Scholar
Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2010)
Article MathSciNet Google Scholar
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the K-mers you are looking for: efficient online K-mer counting using a probabilistic data structure. PloS one 9(7), e101271 (2014)
Article Google Scholar

Download references

Acknowledgment

This research is partially supported by NSF grant CCF: 1451316 to D.Z.

Author information

Authors and Affiliations

Department of Computer Science, Wayne State University, Detroit, MI, 48202, USA
Lu Wang, Dongxiao Zhu, Yan Li & Ming Dong

Authors

Lu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongxiao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Li
View author publications
You can also search for this author in PubMed Google Scholar
Ming Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongxiao Zhu .

Editor information

Editors and Affiliations

Dept. of Computer Science, Georgia State University, Atlanta, Georgia, USA
Anu Bourgeois
Centre for Disease Control & Prevention, Atlanta, Georgia, USA
Pavel Skums
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Xiang Wan
Georgia State University, Atlanta, Georgia, USA
Alex Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Zhu, D., Li, Y., Dong, M. (2016). Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads. In: Bourgeois, A., Skums, P., Wan, X., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2016. Lecture Notes in Computer Science(), vol 9683. Springer, Cham. https://doi.org/10.1007/978-3-319-38782-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-38782-6_2
Published: 27 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-38781-9
Online ISBN: 978-3-319-38782-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics