AbundanceBin, Metagenomic Sequencing
KeywordsSynonymous Codon Usage Metagenomic Sequence Lower Common Ancestor Metagenomic Dataset Lower Common Ancestor
Binning is unsupervised clustering of metagenomic sequences into an unknown set of species.
AbundanceBin is a binning tool utilizing the different abundances of the species in a community.
Binning is one of the challenging problems in the metagenomics field. It has two main applications. One application is for studying the structure of microbial communities. The other application is for improving the downstream analysis of metagenomic sequences, including metagenome assembly (which has shown to be extremely difficult), considering that assembling reads one bin at a time significantly reduces the complexity of the metagenome assembly problem.
Composition-based methods have been the main approaches to unsupervised classification of reads. The basis of these approaches is that the genome composition (G + C content, dinucleotide frequencies, and synonymous codon usage) vary among organisms and are generally characteristic of evolutionary lineages. Tools in this category include TETRA (Teeling et al. 2004), TACOA (Diaz et al. 2009), and MetaCluster (Leung et al. 2011). Due to the substantial variance in sequence properties along a genome, the main limitation of composition-based approaches is that they require relatively long reads (at least 800 bp), although it is shown that MetaCluster (Leung et al. 2011) can bin reads of 300 bp by employing a different distance metric (Spearman Footrule Distance) to reduce the local variations for 4-mers.
Note a large collection of methods have been developed to classify sequencing reads in a supervised manner. MEGAN (Huson and Mitra 2012) is a representative approach of this kind. These methods either use composition information (as in NCB, a naïve Bayes classifier to metagenomic sequence classification (Rosen et al. 2011)) or employ similarity searches of metagenomic sequences against a database of known genes/proteins (as in MEGAN) and assign metagenomic sequences to taxa accordingly, with or without using phylogeny. They also differ in the algorithms used for classification: MEGAN pioneers the lowest common ancestor (LCA) algorithm (Huson et al. 2007), MTR (Gori et al. 2011) improves on LCA algorithm considering multiple taxonomic ranks, and MetaPhyler (Liu et al. 2011) achieves better classification results by tuning the taxonomic classifier to each matching length, reference gene, and taxonomic level. Note that some tools in this category can only classify a subset of the metagenomic sequences instead of all. MLTreeMap (Stark et al. 2010) uses phylogenetic analysis of 31 marker genes for taxonomic distribution estimation. CARMA (Krause et al. 2008) searches for conserved Pfam domains and protein families in raw metagenomic sequences and classifies them into a higher-order taxonomy. RDP classifier is designed for classification of 16S rRNA genes, and later extended to classification of 18S rRNA genes using a naïve Bayes classifier (Cole et al. 2009).
AbundanceBin (Wu and Ye 2011) is the first unsupervised clustering algorithm that utilizes abundance information of the species in the same microbial community to group reads into bins. The fundamental assumption of the AbundanceBin algorithm is that reads are sampled from genomes following a Poisson procedure, such that the sequencing reads can be modeled as a mixture of Poisson distribution.
An expectation–maximization (EM) algorithm is used in AbundanceBin to find parameters for the Poisson distributions (i.e., the means), which reflect the relative abundance levels of the source species. AbundanceBin then assigns reads to bins based on the fitted Poisson distributions. AbundanceBin gives an estimation of the genome size (or the concatenated genome size of species of the same or very similar abundances) and the coverage (which reflects the abundances of species) of each bin in an unsupervised manner without requiring prior knowledge of the structure of the microbial communities. The EM algorithm needs an important parameter, the number of bins, which is typically unknown, as for most metagenomic projects. AbundanceBin solves this problem by using a recursive binning approach to determine the total number of bins automatically. The recursive binning approach works by separating a dataset into two bins and proceeds by further splitting bins. The recursive procedure continues if (1) the predicted abundance values of two bins differ significantly; (2) the predicted genome sizes are larger than a certain threshold; and (3) the number of reads associated with each bin is larger than a certain threshold proportion of the total number of reads classified in the parent bin.
AbundanceBin achieves accurate classification of even very short sequences sampled from species with different abundance levels, as tested on simulated and real metagenomic datasets. The software is available for download at http://omics.informatics.indiana.edu/AbundanceBin.
Integrated Binning Methods
MetaCluster 3.0 is an integrated binning method based on the unsupervised top–down separation and bottom–up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios to very different abundance ratios (Leung et al. 2011). MetaCluster 4.0 further improves the binning algorithm and is able to handle datasets with large number of species (e.g., 100 species) (Wang et al. 2012). MetaCluster is available for download at http://i.cs.hku.hk/~alse/MetaCluster/.
Joint Analysis of Multiple Metagenomic Samples
Baran and Halperin proposed an abundance-based (also termed as coverage-based) binning algorithm (MultBin) that operates on multiple samples of the same environment simultaneously, assuming that the different samples contain the same microbial species, possibly in different proportions (Baran and Halperin 2012). MultBin employs a k-medoids clustering algorithm to cluster reads according to their coverage across the samples. Testing of MultBin on simulated metagenomic datasets shows that integrating information across multiple samples yields more precise binning on each of the samples.
Abundance-based (or coverage-based) binning approaches achieve an accurate performance even for extremely short reads – when there exist species abundance differences, an ability that cannot be achieved by composition-based approaches which suffer from the variances of the compositions of short reads. Approaches that integrate abundance and composition information and approaches that utilize multiple samples have shown promising binning results.