Background

The development of next-generation sequencing in the past decade has increased access for researchers to analyze microbial communities [1, 2]. A common method involves deep sequencing of 16S rRNA genes and grouping bacteria at a certain level of 16S rRNA gene similarity [3]. The highest resolution bacterial sequence is referred to as an operational taxonomic unit (OTU). A challenge with microbiome analyses is that these sequences are derived from bacteria that can be incredibly diverse and simultaneously have a relatedness structure that is internally associated, both genetically and phenotypically. Metrics to account for phylogenetic closeness have been proposed specifically for microbiome data [46]. Previously, Fukuyama [7] showed that deep and shallow parts of the tree contribute differently to phylogenetic distances.

Microbiome data analysis can suffer from overgeneralization (e.g., comparing observations by alpha diversity measures such as Shannon [8] or Simpson [9] diversity) or overspecification (e.g., performing association tests for each taxon, which incurs a heavy penalty for multiple testing). A useful and common strategy to address both of these limitations is to first apply an unsupervised clustering methodology. After considering the microbiome data as a whole and identifying naturally occurring clusters of samples, these clusters can then be assessed for associations with sample characteristics of interest. Previous studies, for instance, have shown that human stool microbiome samples naturally form clusters that are associated with dietary and geographic factors [10, 11]. A problem that often confuses researchers is that clustering performance results often vary depending on the algorithm or the beta diversity metrics used, observed previously by Koren et al. [12] and Claesson et al. [13].

When performing microbiome sample clustering, both model-based methods and machine learning methods have been used. Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods, due to their efficient implementation and easy interpretation. In this paper, we focused on the partition around medoids (PAM) [14] clustering method, which is related to but considered more robust than K-means. In contrast to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers. We also evaluated another frequently used method for non-Euclidean metrics, hierarchical clustering with complete linkage. This method initially bundles the closest observations (with distances defined by the longest distance between any two observations in two clusters) and gradually results in a binary tree that combines the two closest clusters.

Supplemental Figure S1 illustrates the differences between these clustering methods. The choice of clustering algorithm and distance metric together determine the performance of a machine learning clustering method. The metrics considered in this study are commonly used and include the Bray Curtis(BC) dissimilarity metric [15], the unweighted UniFrac(UU) distance [4], the weighted UniFrac distance [5], and the Aitchison distance [16], which is a Euclidean distance quantified following centered log ratio (CLR) transformation of abundances. Previously, a parameter-dependent metric, generalized UniFrac, was proposed by Chen et al [6]. The results from the generalized UniFrac are shown in Supplemental Figure S6.

In addition to PAM and hierarchical clustering, we also included a third approach, the model-based Dirichlet multinomial model (DMM) [17]. This algorithm assumes that observations of the same cluster come from the same multinomial distribution, and that parameters of the multinomial distribution have a Dirichlet distribution. This model better captures overdispersion than simply assuming a multinomial.

In this paper, we systematically compared methods for clustering microbiome observations from four published studies with either geographical or seasonal variables as the true cluster label, which enables biological interpretation of the group separation. We first applied clustering with five methods, and assessed performance of the various methods using the adjusted Rand index with the true clustering assignment. The adjusted Rand index is based on a pairwise membership agreement and is corrected for the expected value, where a score of 0 is expected with random clustering, and a score of 1 indicates perfect clustering. We then explored the relationships between the differences in performance with properties of the datasets considered. With the findings obtained from the method comparisons, we proposed a novel combined metric that provided high performance for all four datasets. We further test our findings with a clinical dataset where clusters are less separated.

Results

We used four published stool microbiome datasets to test the performance of the clustering methods. A brief summary of these datasets is shown in Table 1. Each study comprises two groups of samples with relatively distinct microbiome profiles. The Shannon diversity of the dataset is calculated as the sum of Shannon diversities over all samples in the dataset, and details related to this alpha diversity are provided in a later section. The percent of sequences in high abundance OTUs is the percent of sequences in OTUs with average abundance greater than 0.001. While the individual studies varied in their 16S rRNA gene deep sequencing methods, we processed the downloaded raw sequences using the same pipeline, including VSEARCH [18] to dereplicate and remove singletons and the UNOISE2 function in usearch [19] to identify OTUs. More information about the four example datasets is provided in Supplemental Table S1. A heatmap of the high abundance OTUs can also be found in Supplemental Figure S2.

Table 1 Summary of the example datasets

The first three studies compare cohorts separated by geography. The De Filippo [20] study consists of samples from healthy children living in a rural West African village in Burkina Faso, who consume a high-fiber, largely vegetarian diet, as well as samples from healthy European children of a similar age consuming a Mediterranean diet. The Martínez et al. [11] dataset consists of samples from the Asaro and Sausi communities in Papua New Guinea, as well as those from the USA. The dataset from Schnorr et al. [10] compares fecal samples from Italian urban adults with samples from adult hunter-gatherers living in Hadza land in Tanzania, Africa, who belong to one of the few populations that have maintained a traditional lifestyle with limited exposure to processed foods.

The fourth study, Smits et al. [21], comprises fecal samples longitudinally collected from a cohort of Hadza. Their diets are significantly affected by seasonal food availability, with more berries and honey during the wet season and more meat in the dry season. In this study, the authors found that some taxa disappear when certain foods become scarce and reappear when the seasons change. For the purpose of unsupervised learning, we used samples from the two most distinct seasons, the late dry and the early wet seasons.

The first column in Fig. 1 illustrates the adjusted Rand indices of the four datasets, with different colors indicating the beta diversity metrics using PAM clustering or Dirichlet multinomial mixture model (DMM). Because of its limited capacity to handle high-dimensional data, when fitting the DMM model, we binned the OTUs present in less than 20% of the observations into one OTU when calculating Rand indices. Also, due to the fact that Dirichlet multinomial mixture is model-based, there is no corresponding distance matrix that can be shown in a PCoA plot. The black bars in the Rand index plots and the last column in the PCoA plots are generated from our newly proposed method introduced later in this paper. Due to the generally similar results between PAM and hierarchical clustering (see Fig. 2), we presented the results in our main text only using PAM. Among these existing methods, we found that there is no clearly superior one that universally performs well in all four datasets. Interestingly, UU performs well in three of four datasets, despite lacking information regarding bacterial abundances. Similar observations for the presence/absence methods have been made by Martino et al. [22]. Moreover, the Schnorr dataset has a perfect Rand index with UniFrac distances, but a low Rand index with BC, Aitchison, and DMM, all of which ignore phylogenetic relationships between OTUs. We devote the next section to exploring aspects of these differences in performance. For the completeness of comparison, we also tested the performance of 22 less common metrics that are provided by QIIME2 [23] (Fig. S7).

Fig. 1
figure 1

Heterogeneous Rand indices with PCoA plots of four example datasets using different methods

Fig. 2
figure 2

The Rand indices of PAM and hierarchical clustering correlate for the same dataset using the same beta diversity metric

In this section, we will focus on the underperformance of BC for the Schnorr dataset, followed by the underperformance of UU for the Smits dataset. Because in reality, researchers may not know the information of the true cluster assignment, we will focus on a priori properties of the dataset, rather than the differences between the two assigned clusters.

High abundance OTUs drive the performance of the Bray-Curtis dissimilarity

For a p×n table with p OTUs and n observations, the BC distance between two observations A and B is calculated with the actual counts:

$$d^{\text{BC}}=\frac{\sum_{i=1}^{p}|n_{Ai}-n_{Bi}|}{\sum_{i=1}^{p}|n_{Ai}+n_{Bi}|}, $$

where nAi is the count for the ith OTU for observation A, and nBi is the count for the ith OTU for observation B. By examining the definition, we find that the dissimilarity of the BC metric is driven by differences in high-abundance rather than low-abundance OTUs. Here, we chose to define a metric, to potentially identify poor BC performance by summing the average abundance of “high abundance OTUs”, which are those with a mean abundance greater than 0.001, though similar results can be obtained across a range of thresholds.

As shown in the last column of Table 1 and the last column of the Supplemental Table S1, the sum of average abundance and the high-abundance OTUs for the Schnorr dataset is far below that of the other three datasets. A visual description is also provided as a heatmap in Fig. S2. Thus, the Schnorr dataset is characterized by both unusually poor clustering performance by BC and few high-abundance OTUs.

To calculate this association, we attempted to improve the performance of BC by generating novel OTUs with higher mean abundances. We did this by first generating a phylogenetic tree of the aligned OTUs, and then systematically merging OTUs starting with those most distal to the tree root. In essence, we performed “trimming” of branches from the tree, by combining the abundances of the distal OTUs to generate new, more proximal OTUs with higher mean abundances, as shown in Fig. 3A.

Fig. 3
figure 3

Illustration of the trimming process, summed average abundance of high-abundance OTUs plot, Rand index plot, and PCoA plots of the trimmed Schnorr dataset. A A schematic of a phylogenetic tree of OTUs, where the number in each node is the sum of the average abundance (left). After trimming the furthest branches, the average abundance of the new tree tip is greater than before (right). B The summed abundance of the high abundance OTUs grows with increased trimming. C The Rand index of Bray Curtis - PAM initially increases and then decreases with continued trimming of the phylogenetic tree towards the root. D The total Shannon diversity of the dataset decreases with the trimming process. E Bray-Curtis beta diversity PCoA plots show the separation of two natural clusters with no trimming, 19 levels of trimming, and 29 levels of trimming

In Fig. 3B, C, the x-axes indicate the number of levels trimmed, ranging from the first most distal branch to, at most, 34 levels proximal. Figure 3B shows how the summed abundance of high-abundance OTUs increases with more iterations of the trimming process, while Fig. 3C shows that the Rand index improves initially as distal sparse OTUs are merged together. As trimming continues, the Rand index begins to worsen, as presumably excessive binning results in loss of distinctive OTU information. The three PCoA plots of Fig. 3E correspond to the original dataset (where no levels have been trimmed), to a modified dataset where 19 levels are trimmed off, and to a modified dataset where 29 levels are trimmed off from the most distal branch towards the root. The two natural clusters begin to distinctly separate when 19 levels of branches are trimmed and are fully separated when 29 levels are trimmed. The corresponding heatmaps of the three datasets in Fig. S3 also demonstrate that trimming can increase the number of OTUs with high mean abundance.

Notably, as we progressively reduced the resolution of microbiome data during the trimming process, reflected by the decreasing Shannon diversity in Fig. 3D, the PAM clustering showed improved performance. Shannon diversity is a commonly used alpha diversity metric used to quantify the information within a dataset. The Shannon diversity for observation j is defined as:

$$\text{Shannon}_{j}=-\sum_{i=1}^{p} p_{ij}\log(p_{ij}), \text{\quad }p_{ij}> 0, $$

where pij is the abundance of the ith taxa for observation j. A reasonable explanation for the increase in Rand indices is that the trimming process actually utilizes the tree phylogenetic information, and similar thinking was previously mentioned in Gopalakrishnan et al. [24] and Peled et al. [25]. We encourage researchers to explore their datasets by trimming with the tools provided in our R package “MicrobiomeCluster.” Interestingly, datasets with clear clustering into two groups are likely to produce a bimodal histogram of pairwise distances between samples. One mode corresponds to small within-cluster distances, while the other mode corresponds to larger between-cluster distances.

As a counterexample, we chose to examine in more detail the Martínez dataset for its high clustering performance when using the BC metric. We reasoned that a high degree of discriminating information was being provided by OTUs with high mean abundances. To simulate OTUs with low mean abundances progressively, we took counts assigned to an OTU and assigned those to a new OTU in randomly selected samples from each tip of the original tree to form first-generation descendants, then grew two new branches from each of the newly formed tree tips to obtain the second-generation descendants, and so on. As zero counts do not contribute to the Shannon diversity, we essentially “grew” branches without increasing the information from the original dataset, as shown in Fig. 4A. Sequences of each OTU were randomly assigned to either of the two daughter branches. As shown in Fig. 4B, the summed abundance of the OTUs with high mean abundance is reduced. The performance of both BC- and Aitchison-based clustering show a decreasing trend, while tree-structure-based UniFrac methods are minimally affected in these simulations (Fig. 4C). With an increasing number of descendants, fewer OTUs have a high mean abundance when averaged over samples, corresponding with a drop in the performance of the BC- and Aitchison-based methods.

Fig. 4
figure 4

Illustration of branching process, abundance sum of OTUs with high abundance plot, Rand indexes plot, and PCoA plot of Martínez dataset with descendants. A A schematic of the branching process, where sequences of each OTU were randomly assigned to either of the two daughter branches. B Abundance sum of OTUs with high mean abundance plot of the Martínez Dataset with first, second, and third generation descendants. Lines in the bar plot indicate the 95th percentile intervals of the Rand indices from 200 repeated simulations. C Rand indices of the Martínez dataset with the original dataset and simulations of the first, second, and third generation descendants. Lines in the bar plot indicate the 95th percentile intervals of the Rand indices from 200 repeated simulations. D Bray-Curtis beta diversity PCoA of the original Martínez dataset, as well as examples of datasets with the addition of the first, second, and third generation novel OTUs

To visualize the effects of reduced OTU abundances, we plotted the PCoA of the original dataset, as well as examples of datasets with the addition of the first, the second, and the third generation simulated distal OTUs. As additional descendants are generated, the separation between two clusters becomes less apparent (Fig. 4D). We also plotted heatmaps of the original dataset, as well as examples of modified datasets with the addition of the first, and second generation descendants in Fig. S4. As tree branches progressively diverge, the most abundant OTUs have reduced mean abundances.

Prevalence of low-abundance OTUs inhibits clustering performance of the unweighted UniFrac distance

Another key observation from our examination of clustering performance in the four datasets is that the UU distance performs well for three of the datasets, but markedly underperforms in the Smits dataset. This dataset is distinct from the others in that samples were collected at different time points from the same group of individuals, rather than being collected from two distinct groups of subjects. Thus, the association of the microbiome is with time (distinct in two seasons), rather than the source. We would anticipate that, during the late dry season, many bacteria more suited for the early wet season will be reduced in abundance but may not be totally eliminated and can recover when the host diet eventually becomes more hospitable, and vice versa. The geographically separated clusters from the other studies are less likely to have an extensive overlap in presence of bacterial taxa, with many taxa being present only in one population and absent in the other.

To explore the determinants of when the UU metric could lead to reduced clustering performance, we focused on the unweighted nature of the measure, where only the presence or absence of an OTU is quantified rather than the abundance of an OTU. In the Smits dataset, we expected that a large number of OTUs are shared by both clusters. To identify a metric that captures the information uncaptured by UU, we propose using the total Shannon diversity, which is the sum of the Shannon diversities over all the samples of the dataset.

Because zero entries in the OTU table do not contribute to total Shannon diversity, the total Shannon diversity in a dataset is a metric that can reflect the amount of information uncaptured by UU across samples. To examine this, we simulated new repetitions of the data from the Smits dataset by gradually switching non-zero entries to 0, in essence adjusting the criteria by which an OTU would be considered “present” in a sample, as shown in Fig. 5A. In both Fig. 5B and D, the x-axis is the threshold going from 0 to 70. During the process, entries with a number of sequences less than or equal to the threshold will be converted to 0. In Fig. 5B, the performance of UU goes up with the number of entries converted to 0, whereas the performance of BC only fluctuates slightly. In Fig. 5D, the total Shannon diversity drops down sharply with the number of entries converted to 0.

Fig. 5
figure 5

An illustration of replacing low abundance OTU with 0 value, Rand index plot, PCoA plots, and the total Shannon diversity plot of the modified Smits dataset. A A schematic plot illustrates how we replace low abundance values with a 0 value when the threshold is set to be 1. B Clustering performance of unweighted UniFrac improves by increasing the threshold used to change non-zero entries to 0 in the Smits dataset, while the performance of Bray Curtis remains the same. C Unweighted UniFrac beta diversity PCoA of the original data, the dataset where entries less than 30 are converted to 0, and the dataset where entries less than 60 are converted to 0. D The total Shannon diversity of versions of the Smits dataset decreases as we raise the threshold, below which counts are converted to 0

PCoA also visually depicts improved discrimination of UU with an increasing threshold. In Fig. 5C, we plotted the PCoA of the original data, the simulated dataset where entries less than 30 are converted to 0, and the simulated dataset where entries less than 60 are converted to 0. As more non-zero entries are converted to 0, the two clusters separate further in the first coordinate.

As a counter example, we chose to examine in more detail the Martínez dataset, for which the UU displays high clustering performance. To simulate poorly discriminating data, we increasingly substituted 0 entries with a single count value. Such an alteration of the dataset imposes a minimal change to the original data. As with the Smits dataset, we quantified the median with the 95% empirical confidence intervals of Rand indices and Shannon diversities over 200 simulated datasets. In the x-axes of both Fig. 6B and C, the probability of filling a 0-value entry with 1 goes from 0 to 1, with a 0.1 increment. Figure 6B demonstrates that the total Shannon diversity indeed increases as 0-value entries are increasingly replaced with a value of 1. In turn, we found that the Rand indices of other methods are almost completely unchanged with the perturbation, while the performance of the UU-based clustering is progressively reduced in Fig. 6C.

Fig. 6
figure 6

An illustration of replacing value of 0 with value of 1, the total Shannon diversity plot, Rand index plot, and the PCoA plots of the modified Martínez datasets. A A schematic plot illustrates how we increased the low-abundance OTUs by replacing value of 0 with value of 1. B The total Shannon diversity increases with the number of 0 entries replaced with 1. C Rand index of the Martínez dataset with an increasing number of 0 entries replaced with 1. D Unweighted UniFrac beta diversity PCoA of Martínez dataset with 0%, 40%, and 60% of 0 entries replaced with 1

The PCoA plots of the original dataset, as well as a simulated dataset where 40% of 0-value entries in the OTU table are replaced with a value of 1, and a simulated dataset where 60% of 0-value entries are replaced with a value of 1 are shown in Fig. 6D. We found that, as percentage of replacement increased, the two groups of samples become progressively less distinct.

Combined metric shows stable good performance

With the above findings, it would be appealing to develop a metric that can use the information of both high- and low-abundance OTUs. An intuitive way of constructing such a new metric is to combine information from both BC and UU metrics, which tend to complement each other. After normalizing pairwise BC and UU metrics by their largest values the two metrics lie between 0 and 1. We define the distance between two observations in the new metric as:

$$d^{\text{combined}}=\alpha d^{\text{UU}}_{\text{normalized}}+(1-\alpha)d^{\text{BC}}_{\text{normalized}}. $$

Different α values change the relative contributions of UU and BC distances respectively, and we suggest using α=0.5 as a default simple choice. The Fig. S6 shows that our combined metric tends to do better than the generalized UniFrac with various parameter values, and the default choice gives a robust performance across example datasets. As shown in the last column of the bar plots and the last column of PCoA plots of Fig. 1, the proposed new metric (with α=0.5), which inherits complementary insights from the BC and the UU dissimilarities, results in high performance in Rand indices for all the datasets. As shown in Fig. S7, the combined metric also outperforms other less commonly used metrics in QIIME2.

Evaluation in a setting without strong cluster separation

We also explored a clinical setting where one does not expect global separation between groups [24]. The Gopalakrishnan study examines the relationship between microbiome and response to immunotherapy drugs for melanoma patients. Here, we consider patients who responded to (30 patients) and patients who did not respond to immunotherapy drugs (13 patients) as two clusters. There are 1455 OTUs in this dataset and the average sequencing depth is 48,765. Compared with four datasets discussed in previous sections, its abundance sum of OTUs with average abundance >0.001 is higher (0.874), while its Shannon diversity is lower (132.75). However, if we convert low abundance OTUs to 0s as we did for Smits dataset, the performance of UU improves while the Shannon diversity decreases (Fig. S5).

As shown in Fig. 7 and Fig. S6, all Rand indices are less than 0.4, with the combined metric being the highest. It is interesting to notice that the combined metric can outperform each of its individual components. The BC had the second best performance, corresponding with the presence of high-abundance OTUs. Of note, clustering performance by UU is poor, and the low Shannon diversity of the dataset would normally indicate the potential for better performance.

Fig. 7
figure 7

Rand indices plot and PCoA plots under various metrics for Gopalakrishnan dataset. A The Rand indices plot. B The PCoA plot under 5 metrics

Discussion

A quantitative recommendation regarding when to avoid certain metrics would require a systematic examination of many more datasets. We did observe that in four of the datasets we considered, hierarchical clustering, though less frequently used with microbiome data, can provide similar and sometimes superior results compared to the more commonly used PAM method (Fig. 2). Unlike PAM clustering, the number of clusters do not have to be pre-determined prior to hierarchical clustering, which is a potential advantage. For supervised settings where the group membership is known, beta diversity metrics, including the proposed combined beta metric, can be used as input to distance-based testing approaches such as PERMANOVA. For the data sets considered here, PERMANOVA produces significant p-values across virtually all metrics and data sets. This is because PERMANOVA takes the group labels as known, and tests the null hypothesis that both the centroids and dispersion are the same across groups. There may be evidence to reject this null even in settings where the groups overlap substantially.

Conclusions

Our systematic evaluation of clustering performance in these five datasets show that there is no existing clustering method that universally performs the best across all datasets. In general, the weighted UniFrac outperforms other methods across studies. When OTUs with high mean abundance are rare, clustering methods that do not consider phylogeny (i.e., BC, Aitchison, DMM) perform less well. In contrast, in a dataset with lots of low abundance OTUs, the UU tends to produce poor separation, while other methods can still perform well. To capitalize on the complementary strength of the two metrics, we propose a new metric that incorporates the BC metric and the UU metric, which gives good performance in a dataset representative of the common scenario where clusters are less separated.