Metagenomics, the culture-independent isolation and characterization of DNA from uncultured microorganisms [1], has facilitated the analysis of the functional biodiversity harbored in the large reservoir of uncultured bacteria and archaea [24]. Although early metagenomic studies identified individual genes or activities of interest, recent advances in genome sequencing technologies have made obtaining a complete metagenomic sequence more tractable. Sequence-based approaches combined with functional expression approaches have the potential to identify novel genes important for industrial and ecological applications. Sequence-based approaches have recently been applied to DNA obtained from viruses [5, 6], seawater [710], wastewater [11, 12], sediment [13], sponges [14], acid mine drainage [15], marine worms [16], human gut [17], soil [18], and decomposing whale carcasses [18]. The analysis used to describe these communities has primarily focused on the descriptive characterization and comparison of the relative abundance of proteins that belong to specific functional categories.

Attempts to analyze metagenomic sequences have proven that a metagenomic sequence is more than just a large genome sequencing project. First, the goal of most genome sequence projects is a closed genome sequence where every nucleotide is represented by a desired number of independent sequence reads. In metagenomics, the probability of finding overlapping sequence reads is low in most environments [19]. The probability that overlapping sequence reads are from the same population of bacteria or archaea is even lower so that contigs that are formed are out of necessity chimeras of different genomes that may not even be from the same phylum [20]. Second, a closed genome represents a statistical population of the genes harbored by that organism; therefore, comparing genome sequences for the presence or absence of genes is straightforward. Since it is not possible to close a metagenome, every metagenomic sequence collection represents a statistical sample of the genomes in an environment. Therefore, it is necessary to treat the comparison of communities as a statistical problem. Third, although lab-based cultures that are sequenced do evolve, the differences between lab stocks is minimal compared to the changes faced by natural communities over short periods of time. This makes it difficult to reanalyze a community once a genome sequence has been obtained to improve annotations and understand gene expression.

Five general approaches have been taken to bring statistical analysis to the analysis of metagenomic sequences. The first adapts genomics-based approaches to metagenomics by constructing and curating databases to aid in the annotation and analysis of genes and the contigs they reside on [21, 22]. Unfortunately, although such databases provide a critical infrastructure, given the large number of ORFs that have no known function (e.g. 69% in the Sargasso Sea [7]) and the paucity of contigs formed from many sequencing projects (e.g. <1% in the soil [18]), such database searches will be of limited value for comparative metagenomics. The second approach to analyzing metagenomic sequences has been based on the comparison on the relative abundance of annotation categories within the different sequence collections and within databases of assembled genomes [9, 10, 17, 18, 23]; these methods implicitly assume that the metagenomic sequences represent a statistical population and/or that the reference databases represent the normal distribution of genes in communities. A third set of approaches attempts to assign a phylogenetic origin for a sequence fragment in the absence of a phylogenetic anchor (e.g. 16S rRNA gene) using nucleotide frequency analysis or sequence signatures [2427]. Such methods are limited for use with most environments because of the difficulty in forming contigs that are long enough to carry out a robust analysis and assume that the contigs that form are not chimeric. A fourth approach has attempted to compare communities without an annotation. These have attempted to quantify the species richness of communities based on the distribution of sequence read depth among contigs [7] and to compare the diversity of communities based on the relative frequency of different length oligonucleotides in the DNA sequence pool [28]. Finally, there have been attempts to using traditional population biology by analyzing the diversity of specific families of genes found in metagenomic collections [29].

Based on previous metagenomic sequencing efforts, we were interested in developing statistical tools to compare the richness, membership, and structure of the complement of ORFs from multiple communities in which assembly of the entire genomes is not possible. To address this problem, we adapted a set of statistical tools designed to analyze collections of 16S rRNA gene sequences to the analysis of protein coding genes [3033]. Our goal was to provide additional tools to make statistical and ecological inferences using metagenomic sequence data. Instead of using a traditional pairwise DNA distance matrix obtained from a sequence alignment of homologous genes as is done with 16S rRNA genes, we used BLAST score ratios (BSRs) to develop a distance matrix that represents the similarity of ORFs across homologous groups [34]. To make comparisons among communities, we propose grouping ORFs into operational protein families (OPFs) which are analogous to operational taxonomic units (OTUs) derived from 16S rRNA gene sequences.

Results and discussion

A new distance matrix

The goal of this aspect of the work was to develop a method to compare sequence alignments that circumvented the considerable computational effort required to obtain every possible global sequence alignment and pairwise distance. We used local alignments provided in BLAST and the resulting pairwise BLAST scores to generate BSRs. The BSRs approximate the fraction of identical amino acids between two peptide fragments so that a BSR value of 0.30 between two fragments means that they are approximately 30% identical over their full length. By analogy to the analysis of 16S rRNA gene sequences of uncultured bacteria where OTUs are developed based on a distance matrix, we propose using BSR values to define OPFs. Depending on the goals of the analysis an OPF can be defined as necessary. For illustrative purposes and based on previous implementations of BSRs for comparative genomics applications [34, 35], unless otherwise indicated we will operationally define an OPF as a collection of fragments that have a BSR greater than 0.30.

To assess the feasibility of using peptide fragments from individual sequence reads, we identified peptide fragments from the individual sequence reads used to assemble the Bacillus anthracis, str. Ames genome (GenBank Accession NC_003997, [36]), which contains 4,514 ORFs that were longer than 100 aa. From the individual sequence reads, we identified 92,220 peptide fragments longer than 100 aa. The computational effort required for the pairwise alignment and distance calculation among 92,220 ORFs was prohibitive. Because we expected a majority of the peptide pairs would not have significant similarity, we used BLAST to identify those comparisons that had significant similarity and to calculate BSRs as a surrogate for similarity or distance values (distance = 1-BSR). Instead of generating a 92,220 × 92,220 matrix with 8.5 × 109 values, we took advantage of the sparseness of the matrix to simplify the calculations and construct a set of three linked-lists in which each list contained the row, column, and BSR values of the full BSR matrix. Since the BSR for a peptide fragment compared to itself is 1.0 and the BSR for a non-significant comparison is 0.0, the corresponding entries in the linked lists could be removed. Once this was completed, there were 2.1 × 106 values, which represented a significant reduction in the memory required to store the data.


To assign peptide fragments to OPFs we rewrote the computer code for DOTUR to be compatible with sparse BSR matrices. DOTUR is used to assign collections of 16S rRNA gene sequences and to use the resulting frequency distribution of sequences among OTUs to estimate richness and diversity (Table 1). By analogy, MG-DOTUR assigns peptide fragments to OPFs and estimates the richness and diversity of OPFs for any desired OPF definition. Two classes of methods are available to estimate richness based on frequency distributions. The first uses parametric distributions such as the lognormal distribution to predict the number of unseen groups in a community [37]. Although it is often assumed that microbial communities follow a lognormal distribution, there are no published examples in the microbial ecology literature for which the observed data support such an assumption. This is primarily due to the difficulty in obtaining a sufficient number of observations to implement these methods. An alternative approach uses non-parametric estimators that do not assume an underlying frequency distribution and are relatively easy to compute. These estimators are implemented in DOTUR and MG-DOTUR.

Table 1 Tools used to describe and compare microbial communities.

Based on the observed frequency distribution of peptide fragments in each OPF0.30, we applied the Chao1, ACE, and interpolated Jackknife richness estimators to predict the OPF0.30 richness. The predicted OPF richness was approximately three times greater than the OPF richness that was observed in the assembled B. anthracis genome (Table 1). When we mapped each OPF from the closed genome to the OPFs from the individual sequence reads we found that each OPF from the closed genome was linked to an average of 3.08 (s.d. = 2.75) OPFs from the sequence reads. Further inspection showed that the multiple OPFs from the sequence reads corresponded to different regions of long ORFs from the closed genome sequence. Similar results have been observed when attempting to estimate the number of expressed genes using expressed sequence tags [38].

To overcome this problem, we developed a method of merging OPFs from the sequence reads to obtain a more meaningful OPF distribution. For two OPFs to merge, we required that the carboxyl-terminus of at least one sequence in the first OPF overlap with the amino-terminus of at least one sequence in the second OPF by at least 5 amino acids. Furthermore, we incorporated a BSR penalty so that for two OPFs to merge the overlapping region had to have a BSR greater than the BSR currently being used to form clusters. We used penalties of 0.00, 0.05, 0.10, 0.15, and 0.20 (Table 1). We then applied this merging scheme to the OPFs from the sequence reads and calculated two types of error [38]. Type I errors corresponded to the fraction of OPFs from the closed genome that mapped to multiple OPFs from the sequence reads. Type II errors corresponded to the fraction of OPFs from the sequence reads that corresponded to different OPFs from the closed genome (Table 2). We found that as we increased the penalty, the Type I error decreased and the Type II error increased. Based on this analysis, we decided to implement a penalty of 0.15 because both types of error were 7.1 and 7.4%, respectively. When the resulting frequency distribution was used to calculate collector's curves using the observed and predicted richness, the curves converged towards the true OPF richness (Fig. 1A). This was used to further validate the choice of penalty. A limitation of this approach is that the resulting number of peptide fragments in a merged OPF is a product of the length of the complete ORF and the relative abundance of the ORF in the metagenome. Therefore, we will report OPF richness from merged analysis and annotations from both merged and non-merged analyses.

Figure 1
figure 1

Analysis of the richness and community membership when peptide fragments identified in individual sequence reads were used to assemble the Bacillus anthracis str. Ames genome sequence. (A) The collector's curves for three non-parametric richness estimators and observed richness using individual sequence reads compared to the OPF richness of the assembled genome (horizontal black line). The solid lines represent the richness of non-merged OPFs and the dashed lines represent the richness of merged OPFs with a penalty of 0.15. (B) Collector's curves of parameters describing the similarity between two randomly selected subsets of peptide fragments.

Table 2 Summary of errors and richness estimates when different criteria were used to merge OPFs. OPFs were merged when at least one peptide fragment in each OPF overlapped at least 5 aa and had a BSR value that was above the user specified level by the merge penalty. The type I error rate is the fraction of OPFs from the closed genome that correspond to multiple OPFs from the individual sequence reads. The type II error rate is the fraction of OPFs from the individual sequence reads that corresponded to more than one OPF from the closed genome sequence.

Comparing membership and structure using OPFs

Other tools have been developed to compare the membership (e.g. SONS) and structure (e.g. ∫-LIBSHUFF and AMOVA) of microbial communities using 16S rRNA gene sequences. Again, by analogy we were interested in using OPFs and BSRs to compare microbial communities using metagenomic sequences. SONS uses the output of DOTUR and MG-DOTUR to complete its analysis and required no further modification for use with metagenomic sequences. ∫-LIBSHUFF and AMOVA were modified to use the sparse matrix data representation used in MG-DOTUR. The resulting programs were designated MG-LIBSHUFF and MG-AMOVA. To test these programs, we randomly divided the 92,220 B. anthracis peptide fragments into two artificial communities that were each represented by 46,110 peptide fragments.

We first applied SONS to these two communities to compare the membership and structure of the artificial communities using OPFs. We calculated the shared OPF richness using the Chao non-parametric estimator of shared richness and obtained a value of 3,561 OPFs. Although this estimate of shared richness is lower than 95% confidence interval observed for the total collection of peptide fragments using the Chao1, ACE, or Jackknife estimators, the shard Chao estimator was still increasing with additional sampling (Fig. 1B). This indicates that if sequencing had continued the estimate of shared richness would have probably overlapped eventually. The abundance-based Jaccard (Jabund) estimate of similarity was 1.00, which predicted that all of the peptide fragments belonged to shared OPFs0.30. Yue and Clayton's measure of community overlap, θ, was 0.97, which indicated that the distribution of peptide fragments among OPFs was the same in both artificial communities. These results indicate that SONS is amenable to analyzing OPFs to detect similarity between the memberships and structures of different communities.

An alternative approach to comparing community structures is to perform hypothesis tests. AMOVA uses an analysis of variance (ANOVA)-type framework to test the hypothesis that the difference in genetic diversity between two or more communities is not significantly different than the diversity within each community. We implemented this analysis in a program designated MG-AMOVA to perform a single-classification analysis. Our comparison of two randomly generated B. anthracis peptide fragment pseudo-communities revealed that the observed differences between the two pseudo-communities were not statistically significant (p > 0.05). Next we modified the program ∫-LIBSHUFF to create MG-LIBSHUFF to test the hypothesis that two communities have the same structures. As expected, the differences in structure between the two pseudo-communities were not statistically significant (P > 0.05). Each of these comparisons indicate that we can make statistical comparisons between the membership and structure of microbial communities using peptide fragments identified in single sequence reads from metagenomic data.

Acid Mine Drainage

Tyson et al. [15] used metagenomic sequencing to analyze a biofilm growing in acid mine drainage (AMD) that had a pH below 1.0. They obtained 322 archaeal and bacterial 16S rRNA gene sequences and 103,462 random paired sequence reads, which represented 76.2 Gbp of DNA. We used DOTUR to assign 16S rRNA gene sequences to nine OTUs and predicted there were an additional three OTUs (95% confidence interval [95% CI] = 0 to 8) that were not observed (Fig. 2A). The most abundant OTU was similar to Leptospirillum ferriphilum (n = 247) 16S rRNA gene sequences.

Figure 2
figure 2

Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an AMD biofilm community.

Next, we used MG-DOTUR to assign 99,419 peptide fragments to 10,235 merged OPFs. The dominant merged OPF (n = 901 fragments) did not have a homolog in GenBank and the next most abundant merged OPFs' were most similar to a conserved hypothetical protein from Leptospirillum sp. Group II UBA (n = 773, EAY56482) and a transposase (n = 461, ZP_00669012). The dominant non-merged OPF did not have a homolog in GenBank (n = 114 fragments) and the next most abundant OPFs were most similar to an HNH nuclease (n = 96, ZP_01023224) and a mutator-type transposase (n = 88, ZP_00669012). The Chao1 richness estimator predicted that there were a minimum of 18,463 merged OPFs0.30 in the community (95% confidence interval [95% CI] = 17,794 to 19,191; Fig. 2B). Considering the lack of a known function for two of the most abundant OPFs in the AMD community, this analysis shows the importance of including such sequences in metagenomic sequence analyses and may indicate that subsequent analysis of this group of sequences would reveal important physiological information about the community.


Tringe et al. [18] used Minnesotan farm soil to build libraries and sequence 1,633 bacterial 16S rRNA gene fragments and 149,085 random DNA fragments, representing 76 Gbp of DNA. We previously showed that the OTU richness was approximately 2,000 [39]. The three most abundant OTUs were representatives of the Chloroflexi.

Using MG-DOTUR to analyze the random metagenomic sequence reads, the 143,422 peptide fragments clustered into 98,066 merged OPFs. The members of the dominant merged OPF had similarity to a putative two-component response regulator (n = 688; NP_254170). The next most abundant merged OPFs had similarity to a histidine kinase (n = 566; YP_386369) and a serine/threonine protein kinase (n = 371; YP_825781). The three most abundant non-merged OPFs in the soil community had homology to a putative response regulator (n = 29, NP_520928), a PadR-like transcriptional regulator (n = 21, ZP_00524755), and a Cu2+-transporting ATPase (n = 20, ZP_01060472). Because of the considerable diversity in the soil sample, an insufficient number of peptide fragments were sampled to obtain a reliable OPF richness estimate; however, using the Chao1 richness estimator we predicted that the OPF richness was at least 361,546 (95% CI = 355,613 to 367,615; Fig. 3B). Although considerable additional sequencing effort is required to obtain a reliable estimate of OPF richness, it is interesting that in spite of the relatively large OTU and OPFs richness, it was possible to assign a large number of peptide fragments to the same OPF.

Figure 3
figure 3

Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an agricultural soil in Minnesota, USA.

Whalebone communities

Tringe et al. [18] compared three bacterial communities growing on the bones of two whales (AHAA and AHAI were from the same whale) at the bottom of the Pacific Ocean using 16S rRNA and metagenomic sequence analysis. Based on 16S rRNA sequence data, the three communities designated AGZO (n = 73), AHAA (n = 65), and AHAI (n = 68) had a Chao1-estimated OTU richness of at least 140 (95% CI = 67 to 366), 48 (95% CI = 29 to 121), and 19 (95% CI = 17 to 34). The most abundant OTU0.03 from each of the three communities affiliated with members of the Arcobacter sp. (n = 15), Bacteroidetes (n = 12), and Flavobacteriales (n = 19), respectively. We estimated that each of the three communities shared between 1 and 3 OTUs0.03 with any of the other communities. The lack of conservation of membership between the three communities resulted in low Jabund coefficients (0.01 to 0.19), θ values (0.04 to 0.11), and statistically significant P values when comparing the communities using AMOVA and ∫-LIBSHUFF (all p < 0.001). Although the three communities each came from similar environments, the taxonomic membership and structure of the three communities were considerably different.

We applied the newly developed statistical tools to the metagenomic sequences of the three communities to assess their genetic and functional similarities. The three communities, AGZO, AHAA, and AHAI, yielded approximately 38,000 (25 Mbp), 38,000 (25 Mbp), and 40,000 (25 Mbp) sequence reads and 38,981, 36,165, and 33,199 peptide fragments, which were over 100 aa long, respectively. The dominant merged OPFs in each community were similar to a histidine kinase (AGZO, n = 386; YP_341128) and an ABC transporter (AHAA, n = 175 and AHAI, n = 166; ZP_01203057). The most abundant non-merged OPF found in each community was homologous to a conserved hypothetical protein (AGZO: n = 22, NP_442017), RecR (AHAA: n = 9, ZP_00952890), another conserved hypothetical protein (AHAI: n = 16, ZP_00949155), and a putative transposase (AHAI: n = 16, ZP_00903285). The Chao1 OPF richness estimates for each of the communities continued to increase with additional sampling, indicating that the communities had a minimum OPF richness of 69,541 (95% CI = 67,618 to 71,550), 77,923 (95% CI = 75,699 to 80,276), and 49,120 (95% CI = 47,767 to 50,539) for the AGZO, AHAA, and AHAI communities, respectively.

Although there was an insufficient number of peptide fragments to obtain a reliable estimate of the fraction of OPF membership that was shared between any two of the three communities, we estimated that they shared at least between 10 and 20% of their OPF membership (Fig. 4). The "core" whalebone OPF membership that was shared among the three whalebone communities had a richness of at least 3,800 OPFs (approximately 2.5% of the total richness); 1,678 of these were actually observed in the sequence collection. The most commonly shared OPFs among the three communities represented a variety of activities including metal transport, sensors, and housekeeping functions (Table 3).

Figure 4
figure 4

Venn diagram comparing the OPF membership found in three whalebone microbial communities (AGZO, n = 38,981 peptide fragments; AHAA, n = 36,165; and AHAI, n = 33,199). Below each community name is the Chao1 richness estimate and the 95% confidence interval for that community. We estimated the richness of the overlapping regions based on the pairwise SA,B Chao shared richness estimates between the three communities and by pooling two communities and estimating the shared fraction with the third community. These estimates are provided on the right side of the figure.

Table 3 Summary of most abundant merged and non-merged OPFs from the three whalebone communities.

Comparison of the community structures using the peptide fragments using MG-LIBSHUFF (all p < 0.001) and MG-AMOVA (all p < 0.001) found that the structures of these three communities were significantly different. Using OPFs, θ varied between 0.39 and 0.55 indicating that there was some similarity in community structure. The ability to quantify and assess the differences in communities without exhaustive sampling of the three whalebone communities indicates the importance of applying statistical methods to metagenomic sequence data. Such analyses make comparative metagenomics amenable to ecologically-based hypothesis testing.

Comparison of the three environments

To assess the relative similarity of OTU0.03 membership between environments, we used DOTUR to cluster the 2,161 16S rRNA gene fragments from the AMD (n = 322), soil (n = 1,633), and whalebone communities (n = 206). No OTUs were shared between any two of the three communities; however, additional sampling may have identified OTUs that were shared between environments.

We compared the relative similarity of OPF membership between environments by clustering the 351,186-peptide fragments from the AMD (n = 99,419), soil (n = 143,422), and whalebone communities (n = 108,345) using MG-DOTUR and then we estimated the membership and structure overlap among the three communities (Fig. 5). Measuring the overlap of OPFs measurement among the three communities resulted in the estimate that more than 800 OPFs were shared among the five communities; this represents less than 0.3% of the total OPF richness found in the five communities. Of this pool, 774 merged OPFs and were actually observed with functions including metal transport, housekeeping, and various dehydrogenase activities (Table 4). Applications of the statistical tools to these types of comparisons will enable researchers to investigate the problem of biogeography using genome-based methods.

Figure 5
figure 5

Venn diagram comparing the pooled OPF membership found in the AMD (n = 99,419 peptide fragments), soil (n = 143,422), and whalebone (n = 108,345) microbial communities.

Table 4 Summary of most abundant merged and non-merged OPFs from the AMD, soil, and whalebone communities.

For comparison, we compared the complement of ORFs from the fully sequenced Bacillus anthracis str. 'Ames Ancestor' (GenBank accession AE017334), Bacillus cereus ATCC 10987 (AE017194), Escherichia coli K12 (U00096), Methanosarcina acetivorans C2A (AE010299), Methanosarcina barkeri str. fusaro (CP000099) genomes. We used MG-DOTUR to assign ORFs to OPFs and then we used SONS to compare the OPF0.30 overlap between these genomes, which we selected for their phylogenetic similarity and breadth. As predicted based on current understanding of phylogenetics, the more closely related organisms had the greatest OPF0.30 overlap. The comparison between B. anthracis and B. cereus yielded Jclas and θ values of 0.70 and 0.74, E. coli and Y. pestis yielded values of 0.43 and 0.20, and M. acetivorans and M. barkeri yielded values of 0.54 and 0.43. All of the other pairwise comparisons yielded values below 0.08 for both parameters. This analysis suggests that the comparisons between the OPFs0.30 identified in the metagenomic sequences represent the level of differences expected between phylogenetically disparate groups of bacteria. Furthermore, analyses using completed genome sequences may enable investigators to define the size and boundaries of so-called "pan-genomes."


We present a statistical toolbox to estimate the functional richness and overlap among communities based on peptide fragments deduced from DNA sequence data. These statistical approaches are necessary, in part, because the immense genomic diversity contained in most communities precludes the formation of contigs. There is also considerable question regarding the robustness of sequence assembly [40]. Although understanding these complex communities is tantalizing, it may prove useful to identify more communities similar to the AMD and whalebone communities that have a relatively low diversity to develop and test tools that can then be applied to soil. As sequencing technologies improve, the feasibility of obtaining nearly complete sequence coverage of the more diverse communities will improve. The rapid advances in sequencing short DNA fragments (approximately 100 bp long) in a highly parallelized manner [41] presents many new opportunities, but the method may not be amenable to metagenomic sequencing because the short sequence reads produce peptide fragments less than 100 aa long, which could make a meaningful ORF identification and analysis of functional diversity difficult.

Innovative methods have been developed to compare collections of 16S rRNA sequences, and analogous new methods are needed for comparing metagenomic sequences. For example, improving our ability to estimate and interpret the biological meaning of OPF richness will be helpful for describing the relative functional capacity of a community. Our analysis does not address the possibility that distant OPFs might serve the same biological function and that members of the same OPF might have different functions. Therefore, further work is needed to unify studies of functionally active clones into a statistical framework. For example, comparing the collection of genes conferring antibiotic resistance found in multiple environments would enable us to understand better the diversity of these genes as well as their biogeography.

Our analysis moves beyond previous attempts to compare microbial communities at the genomic level by not being dependent upon reference databases and introducing statistical rigor to the description and comparison of microbial communities. For example, previous analyses formed clusters based on similarity to reference databases and excluded those peptide fragments with no significant matches, which limited the scope of the analysis. Here, we formed OPFs using the observed data, in essence allowing the data to "speak for themselves", which allowed for a comprehensive comparison of the data. Previous analyses also based the level of similarity between communities on the observed peptide fragments as though they represented a statistical population. Here, we treated the data as a statistical sample and employed statistical tools to estimate the level of similarity between community membership and structure. These tools enable a quantitative, comprehensive, and statistically robust analysis of microbial communities at the genomic level.

Shotgun sequencing of metagenomic communities is becoming increasingly popular and routine. The results of these efforts will provide more insight if they are wrapped in robust ecological and statistical frameworks. Tools are needed to advance data analysis beyond the frequency of different COGs or KEGG categories that are found within a community. This study is a step in building such a framework to compare microbial communities functionally at the genomic level. In addition to estimating community relatedness based on metagenomic data, our approach accounts for present but unsampled peptide fragments, is independent of a subjective annotation process, and includes peptide fragments with no known function.


Genome sequence data

We obtained the 101,379 sequence reads used to assemble the Bacillus anthracis str. Ames whole genome sequence from GenBank (NC_003997). Each sequence read was evaluated by fastgenesb at the Joint Genome Institute using the same parameters used to predict the identity of peptide fragments in two previous metagenomic sequencing studies [15, 18]. We also obtained the complete complement of 4,514 ORFs from the finished genome that were longer than 100 aa. All of the predicted peptide fragments from the published metagenomic sequencing projects using an acid mine drainage biofilm [15], whalebone [18], and soil [18] were obtained from the Joint Genome Institute. Only those ORFs and peptide fragments longer than 100 aa were considered in our analyses.

Modified toolbox

DOTUR is a freely available computer program that uses a distance matrix to assign sequences to operational taxonomic units (OTUs) using either the nearest, average, or furthest neighbor clustering algorithms for all possible distances and then constructs rarefaction and collector's curves for a variety of ecological parameters [30]. These curves can be used to compare the relative richness, the number of different OTUs in a community, of two samples and to estimate the overall richness within a sample. Similarly, MG-DOTUR clusters sequences into OPFs using a BLAST table as the input. ORFs are assigned to OPFs using the furthest neighbor clustering algorithm [42], which requires that all sequences in the OPF have a pairwise BSR value greater than a specified value. Because BSRs are not necessarily symmetric (i.e. BSRij ≠ BSRji), they were forced to be symmetric by using the smaller of the two values. Once MG-DOTUR assigns sequences to OPFs, rarefaction curves of the number of OPFs observed on average as a function of ORFs sampled and collector's curves of the Chao1 [43], ACE [44], and the interpolated Jackknife [45] richness estimates as a function of ORFs sample are calculated at multiple BSR values predefined by the user. MG-DOTUR uses a switch to calculate the ACE estimator. If the coefficient of variation (γ) is greater than 0.8, then the ACE-1 estimator is calculated, otherwise the simple ACE estimator is used. This follows recommendations made by Anne Chao for use of the program SPADE [46]. This study reports results obtained by defining an OPF a group of sequences with a BSR value greater than 0.30 [34, 35] and OTU as a group of sequences that are all more than 97% identical to each other [30].

∫-LIBSHUFF [31] is a modified version of the program LIBSHUFF [32] that makes use of the integral form of the Cramér-von Mises statistic to determine whether two communities are either samples of the same statistical population, sub-samples of each other, or were drawn from different statistical populations. As employed in ∫-LIBSHUFF, the Cramér-von Mises statistic is a function of the coverage of one sequence collection onto itself (i.e. homologous coverage, CX) compared to its coverage onto another collection (i.e. heterologous coverage, CXY). Coverage is the fraction of sequences that have another sequence within a given distance of them. Application of the LIBSHUFF-style analysis requires converting BSR values into distances by subtracting the BSR value from one and setting the limits of integration from zero to 0.70. MG-LIBSHUFF calculates the ΔCXY statistic and evaluates its significance using a Monte Carlo testing procedure as described elsewhere [31, 32].

Δ C X Y = 0 1 B S R min ( C X ( D ) C X Y ( D ) ) 2 d D MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeuiLdqKaem4qam0aaSbaaSqaaiabdIfayjabdMfazbqabaGccqGH9aqpdaWdXbqaamaabmaabaGaem4qam0aaSbaaSqaaiabdIfaybqabaGccqGGOaakcqWGebarcqGGPaqkcqGHsislcqWGdbWqdaWgaaWcbaGaemiwaGLaemywaKfabeaakiabcIcaOiabdseaejabcMcaPaGaayjkaiaawMcaamaaCaaaleqabaGaeGOmaidaaaqaaiabicdaWaqaaiabigdaXiabgkHiTiabdkeacjabdofatjabdkfasnaaBaaameaacyGGTbqBcqGGPbqAcqGGUbGBaeqaaaqdcqGHRiI8aOGaemizaqMaemiraqeaaa@50E5@


D = the distance (1-BSR) that is used to determine the level of coverage.

CX(D) and CXY(D) = measures of homologous and heterologous library coverage.

BSRmin = the smallest meaningful BSR value; for this analysis set at 0.30

Population biologists have developed an analysis of variance (ANOVA)-style of analysis, which tests whether a collection of communities have similar genetic diversities using mitochondrial DNA sequences and other genetic markers. This method has been designated as either the analysis of molecular variance (AMOVA) [47] or non-parametric multivariate analysis of variance (MANOVA) [48]. This analysis has been applied for comparing bacterial communities using 16S rRNA sequences [33]. The general method is based on partitioning the sum of the squared elements in a distance matrix similar to what is done in an ANOVA. As applied in MG-AMOVA, we implement a single-classification ANOVA design to determine whether the average genetic average genetic difference between the three whalebone communities was significantly greater than the difference within a community. The total sum-squared error (SST) and within community sum-squared error (SSW) is calculated by

S S T = 1 N i = 1 N 1 j = i + 1 N ( 1 B S R i j ) 2 S S W = 1 N i = 1 N 1 j = i + 1 N ( 1 B S R i j ) 2 ε i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabiqaaaqaaiabdofatjabdofatnaaBaaaleaacqWGubavaeqaaOGaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGcdaaeWbqaamaaqahabaWaaeWaaeaacqaIXaqmcqGHsislcqWGcbGqcqWGtbWucqWGsbGudaWgaaWcbaGaemyAaKMaemOAaOgabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeGOmaidaaaqaaiabdQgaQjabg2da9iabdMgaPjabgUcaRiabigdaXaqaaiabd6eaobqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0IaeGymaedaniabggHiLdaakeaacqWGtbWucqWGtbWudaWgaaWcbaGaem4vaCfabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOta4eaaOWaaabCaeaadaaeWbqaamaabmaabaGaeGymaeJaeyOeI0IaemOqaiKaem4uamLaemOuai1aaSbaaSqaaiabdMgaPjabdQgaQbqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabikdaYaaaiiGakiab=v7aLnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabdMgaPjabgUcaRiabigdaXaqaaiabd6eaobqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0IaeGymaedaniabggHiLdaaaaaa@78A3@


BSRij = the BSR value between the ith and the jth peptide fragments.

εij = 1 if i and j are in the same community, otherwise it is 0.

N = total number of peptide fragments

The sum-squared error among communities (SSA) can be calculated as SSA = SST-SSW. Significance was determined by randomizing the assignment of sequences to the sequence collections and recalculating the statistic and determining the proportion of randomizations resulting in an equal or smaller SSW value than that observed from the randomized distribution [48].

OPF-based comparisons of community membership and structure

Using the frequency that each OPF was observed in multiple communities, it has been possible to estimate the number of OPFs that are shared between communities as well as describe the overlap between community structures. Analogous to the Chao1 non-parametric richness estimator [43], Chao et al. [49] derived a non-parametric estimator of the richness shared between two communities:

S A , B C h a o = S 12 + f 11 f 1 + f + 1 4 f 2 + f + 2 + f 1 + 2 2 f 2 + + f + 1 2 2 f + 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdgeabjabcYcaSiabdkeacjabbccaGiabdoeadjabdIgaOjabdggaHjabd+gaVbqabaGccqGH9aqpcqWGtbWudaWgaaWcbaGaeGymaeJaeGOmaidabeaakiabgUcaRiabdAgaMnaaBaaaleaacqaIXaqmcqaIXaqmaeqaaKqbaoaalaaabaGaemOzay2aaSbaaeaacqaIXaqmcqGHRaWkaeqaaiabdAgaMnaaBaaabaGaey4kaSIaeGymaedabeaaaeaacqaI0aancqWGMbGzdaWgaaqaaiabikdaYiabgUcaRaqabaGaemOzay2aaSbaaeaacqGHRaWkcqaIYaGmaeqaaaaakiabgUcaRKqbaoaalaaabaGaemOzay2aa0baaeaacqaIXaqmcqGHRaWkaeaacqaIYaGmaaaabaGaeGOmaiJaemOzay2aaSbaaeaacqaIYaGmcqGHRaWkaeqaaaaakiabgUcaRKqbaoaalaaabaGaemOzay2aa0baaeaacqGHRaWkcqaIXaqmaeaacqaIYaGmaaaabaGaeGOmaiJaemOzay2aaSbaaeaacqGHRaWkcqaIYaGmaeqaaaaaaaa@61A9@


S12 = number of shared OPFs in A and B

f11 = number of shared OPFs with one observed individual in A and B

f1+, f2+ = number of shared OPFs with one or two individuals observed in A

f+1, f+2 = number of shared OPFs with one or two individuals observed in B

By a similar approach the fraction of individuals or peptide fragments that belong to a shared OPF can be estimated [50, 51]:

U e s t = i = 1 S 12 X i n t o t a l + m t o t a l 1 m t o t a l f + 1 2 f + 2 i = 1 S 12 X i n t o t a l I ( Y i = 1 ) V e s t = i = 1 S 12 Y i m t o t a l + n t o t a l 1 n t o t a l f 1 + 2 f 2 + i = 1 S 12 Y i m t o t a l I ( X i = 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabiqaaaqaaiabdwfavnaaBaaaleaacqWGLbqzcqWGZbWCcqWG0baDaeqaaOGaeyypa0ZaaabCaKqbagaadaWcaaqaaiabdIfaynaaBaaabaGaemyAaKgabeaaaeaacqWGUbGBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4uam1aaSbaaWqaaiabigdaXiabikdaYaqabaaaniabggHiLdGccqGHRaWkjuaGdaWcaaqaaiabd2gaTnaaBaaabaGaemiDaqNaem4Ba8MaemiDaqNaemyyaeMaemiBaWgabeaacqGHsislcqaIXaqmaeaacqWGTbqBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaamaalaaabaGaemOzay2aaSbaaeaacqGHRaWkcqaIXaqmaeqaaaqaaiabikdaYiabdAgaMnaaBaaabaGaey4kaSIaeGOmaidabeaaaaGcdaaeWbqcfayaamaalaaabaGaemiwaG1aaSbaaeaacqWGPbqAaeqaaaqaaiabd6gaUnaaBaaabaGaemiDaqNaem4Ba8MaemiDaqNaemyyaeMaemiBaWgabeaaaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGtbWudaWgaaadbaGaeGymaeJaeGOmaidabeaaa0GaeyyeIuoakiabdMeajnaabmaabaGaemywaK1aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqaIXaqmaiaawIcacaGLPaaaaeaacqWGwbGvdaWgaaWcbaGaemyzauMaem4CamNaemiDaqhabeaakiabg2da9maaqahajuaGbaWaaSaaaeaacqWGzbqwdaWgaaqaaiabdMgaPbqabaaabaGaemyBa02aaSbaaeaacqWG0baDcqWGVbWBcqWG0baDcqWGHbqycqWGSbaBaeqaaaaaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdofatnaaBaaameaacqaIXaqmcqaIYaGmaeqaaaqdcqGHris5aOGaey4kaSscfa4aaSaaaeaacqWGUbGBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaGaeyOeI0IaeGymaedabaGaemOBa42aaSbaaeaacqWG0baDcqWGVbWBcqWG0baDcqWGHbqycqWGSbaBaeqaaaaadaWcaaqaaiabdAgaMnaaBaaabaGaeGymaeJaey4kaScabeaaaeaacqaIYaGmcqWGMbGzdaWgaaqaaiabikdaYiabgUcaRaqabaaaaOWaaabCaKqbagaadaWcaaqaaiabdMfaznaaBaaabaGaemyAaKgabeaaaeaacqWGTbqBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4uam1aaSbaaWqaaiabigdaXiabikdaYaqabaaaniabggHiLdGccqWGjbqsdaqadaqaaiabdIfaynaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaeGymaedacaGLOaGaayzkaaaaaaaa@D2BF@


Uest, Vest = fraction of sequences from A and B that belong to a shared OTU

Xi, Yi = abundance of the ith shared OTU in A and B

ntotal, mtotal = total number of sequences sampled in A and B

I(·) = if the argument, ·, is true then I(·) is 1; otherwise it is 0.

Uest and Vest can then be used to estimate an abundance-based Jaccard similarity coefficient:

J a b u n d = U e s t V e s t U e s t + V e s t U e s t V e s t MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOsaO0aaSbaaSqaaiabdggaHjabdkgaIjabdwha1jabd6gaUjabdsgaKbqabaGccqGH9aqpjuaGdaWcaaqaaiabdwfavnaaBaaabaGaemyzauMaem4CamNaemiDaqhabeaacqWGwbGvdaWgaaqaaiabdwgaLjabdohaZjabdsha0bqabaaabaGaemyvau1aaSbaaeaacqWGLbqzcqWGZbWCcqWG0baDaeqaaiabgUcaRiabdAfawnaaBaaabaGaemyzauMaem4CamNaemiDaqhabeaacqGHsislcqWGvbqvdaWgaaqaaiabdwgaLjabdohaZjabdsha0bqabaGaemOvay1aaSbaaeaacqWGLbqzcqWGZbWCcqWG0baDaeqaaaaaaaa@58D8@

To incorporate into the measure of community similarity the proportion of peptide fragments in each OPF, Yue and Clayton [52] developed the parameter θ:

θ = i = 1 S 12 X i n t o t a l Y i m t o t a l i = 1 S 1 ( X i n t o t a l ) 2 + i = 1 S 2 ( Y i m t o t a l ) 2 i = 1 S 12 X i n t o t a l Y i m t o t a l MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae8hUdeNaeyypa0tcfa4aaSaaaeaadaaeWbqaamaalaaabaGaemiwaG1aaSbaaeaacqWGPbqAaeqaaaqaaiabd6gaUnaaBaaabaGaemiDaqNaem4Ba8MaemiDaqNaemyyaeMaemiBaWgabeaaaaWaaSaaaeaacqWGzbqwdaWgaaqaaiabdMgaPbqabaaabaGaemyBa02aaSbaaeaacqWG0baDcqWGVbWBcqWG0baDcqWGHbqycqWGSbaBaeqaaaaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGtbWudaWgaaqaaiabigdaXiabikdaYaqabaaacqGHris5aaqaamaaqahabaWaaeWaaeaadaWcaaqaaiabdIfaynaaBaaabaGaemyAaKgabeaaaeaacqWGUbGBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaaaGaayjkaiaawMcaamaaCaaabeqaaiabikdaYaaacqGHRaWkaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGtbWudaWgaaqaaiabigdaXaqabaaacqGHris5amaaqahabaWaaeWaaeaadaWcaaqaaiabdMfaznaaBaaabaGaemyAaKgabeaaaeaacqWGTbqBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaaaGaayjkaiaawMcaamaaCaaabeqaaiabikdaYaaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGtbWudaWgaaqaaiabikdaYaqabaaacqGHris5aiabgkHiTmaaqahabaWaaSaaaeaacqWGybawdaWgaaqaaiabdMgaPbqabaaabaGaemOBa42aaSbaaeaacqWG0baDcqWGVbWBcqWG0baDcqWGHbqycqWGSbaBaeqaaaaadaWcaaqaaiabdMfaznaaBaaabaGaemyAaKgabeaaaeaacqWGTbqBdaWgaaqaaiabdsha0jabd+gaVjabdsha0jabdggaHjabdYgaSbqabaaaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdofatnaaBaaabaGaeGymaeJaeGOmaidabeaaaiabggHiLdaaaaaa@9ADD@


S1 and S2 = observed number of OPFs in each community.

16S rRNA sequence analysis

The three metagenomic sequencing projects were selected because they were accompanied by parallel 16S rRNA sequence collections. We obtained the sequences from the original authors and aligned the sequences using the greengenes website [53]. Aligned sequences were imported to ARB [54] and overlapping sequences were used to construct distance matrices with a Jukes-Cantor correction for multiple substitutions. Distance matrices were analyzed using DOTUR [30], ∫-LIBSHUFF [31], and MG-AMOVA as described above.

Availability of data and software

MG-DOTUR, MG-LIBSHUFF, MG-AMOVA and all sequence and analysis files are available from the authors' website [55].