Strain tracking in complex microbiomes using synteny analysis reveals per-species modes of evolution

Enav, Hagay; Paz, Inbal; Ley, Ruth E.

doi:10.1038/s41587-024-02276-2

Strain tracking in complex microbiomes using synteny analysis reveals per-species modes of evolution

Article
Open access
Published: 19 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue Submit your manuscript

Strain tracking in complex microbiomes using synteny analysis reveals per-species modes of evolution

Download PDF

3732 Accesses
70 Altmetric
7 Mentions
Explore all metrics

Abstract

Microbial species diversify into strains through single-nucleotide mutations and structural changes, such as recombination, insertions and deletions. Most strain-comparison methods quantify differences in single-nucleotide polymorphisms (SNPs) and are insensitive to structural changes. However, recombination is an important driver of phenotypic diversification in many species, including human pathogens. We introduce SynTracker, a tool that compares microbial strains using genome synteny—the order of sequence blocks in homologous genomic regions—in pairs of metagenomic assemblies or genomes. Genome synteny is a rich source of genomic information untapped by current strain-comparison tools. SynTracker has low sensitivity to SNPs, has no database requirement and is robust to sequencing errors. It outperforms existing tools when tracking strains in metagenomic data and is particularly suited for phages, plasmids and other low-data contexts. Applied to single-species datasets and human gut metagenomes, SynTracker, combined with an SNP-based tool, detects strains enriched in either point mutations or structural changes, providing insights into microbial evolution in situ.

Statistical and Computational Methods for Microbial Strain Analysis

StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities

Article Open access 07 March 2022

Fast and accurate metagenotyping of the human gut microbiome with GT-Pro

Article 23 December 2021

Main

Microbiomes are assemblages of diverse microbial species that often contain a vast diversity of strains. Within any given species, strain variation can arise from in situ evolution through several mechanisms, including the accumulation of point mutations and structural changes to the genome arising from insertions, deletions, horizontal gene transfers and intragenomic rearrangements requiring recombination. Species can differ in the main mode of genomic changes underlying strain diversification in their habitat. For instance, in the human gut, strains of Bacteroides fragilis evolve mainly through the accumulation of point mutations but not through recombination¹. In contrast, Helicobacter pylori owes its strain diversity to both recombination and the accumulation of point mutations². Point mutations and structural changes can operate on different parts of the same genome, as has been shown for many fungal phytopathogens³. Mutations and structural variations alike can lead to important functional differences between strains. For H. pylori, a single point mutation is associated with the development of gastric cancer⁴; for Staphylococcus aureus, a single point mutation is associated with jumping hosts from humans to rabbits⁵. Recombination has been shown to drive antibiotic resistance in Streptococcus pneumoniae⁶ and virulence in Neisseria meningitidis⁷. Other bacterial pathogens use recombination to generate phase variation important for immune evasion^8,9,10.

Despite the importance of structural changes to the genome in driving strain diversity, the majority of tools used to date to compare strains (as reviewed in the literature¹¹) use a single approach based on pairwise comparisons of single-nucleotide polymorphisms (SNPs) between genomes.

SNP-based genome comparison tools are also used to assign genomes to the same or different strain of a species using similarity thresholds. SNP-based tools are highly sensitive to mutation-driven diversity but are relatively insensitive to structural differences between genomes, particularly if they are based on the mapping of reads to reference genomes¹¹. This bias could lead to underestimation of strain diversity in highly recombining species. SNP-based methods may also potentially overestimate genomic diversity because of technical or biological factors, such as sequencing errors or the emergence of hypermutators with elevated point-mutation rates^12,13. These aspects alone motivate the development of tools that capture structural differences between genomes. Moreover, when used in a comparative manner with SNP-based approaches, such tools promise to reveal a more complete view of how different species generate strain diversity.

Gene synteny has been used previously to estimate evolutionary distances between genomes of eukaryotes^14,15,16; in microorganisms, gene synteny was used to identify horizontal gene transfer events¹⁷. However, microsynteny (the local conservation of genetic-marker order in genomic regions) is an overlooked component of variation for assessing genomic diversity¹⁸ and constitutes an untapped and rich source of genomic information for microbial strain comparisons. Related strains should share syntenic regions, regardless of SNP variation between homologous genes.

Here, we developed SynTracker, a tool that uses microsynteny to compare strains within species. SynTracker uses pairwise comparisons of homologous genomic regions in either metagenomic or genomic assemblies, followed by scoring the average synteny per pair of strains (average pairwise synteny score, APSS). SynTracker does not require a precompiled reference genome database but instead uses only a single reference genome per species of interest and it is relatively insensitive to SNPs. We designed SynTracker to give lower weight to SNPs and higher weight to insertions, deletions and recombination events. These genomic differences are less abundant than SNPs and are less likely to result from sequencing errors¹⁹.

We applied SynTracker to bacterial isolate genomes and to human gut metagenome datasets from mother–infant pairs (MIPs) sampled in different geographic locations and benchmarked it against existing strain-tracking tools. SynTracker outperformed existing tools in terms of both accuracy and sensitivity (that is, the total number of strain pairs identified), making it particularly suited for the analysis of low-abundance species, plasmids and phages. When used in combination with an SNP-based tool, SynTracker allowed the identification of hypermutators (large differences in SNPs but low levels of recombination, insertions and deletions) and hyper-recombinators (low differences in SNPs but high levels of genome structural variations) both within single-species datasets and a human gut metagenome. SynTracker is available as an open-source program on GitHub (https://github.com/leylabmpi/SynTracker).

Results

SynTracker compares strains within species using metagenome or genome data

SynTracker identifies synteny blocks in pairs of homologous genomic regions derived from isolate genomes, metagenomic assemblies or metagenome-assembled genomes (MAGs). As input, the pipeline accepts one genome per species of interest (bacterium, phage or plasmid), either fully or partially assembled, to be used as a reference and a collection of metagenomic assemblies (or genomes, if genomes are to be compared).

Step 1: identification of homologous regions

The reference genome is fragmented to create a collection of 1-kbp genomic regions, located 4 kbp apart (‘central regions’; Fig. 1a). Next, we convert the collection of per-sample metagenomic assemblies (or genomes) to a basic local alignment search tool (BLAST)²⁰ database and use the central regions as queries for a high-stringency nucleotide BLAST (BLASTn) search (identity = 97%, minimal query coverage = 70%; Fig. 1b) to minimize the possibility of receiving multispecies hits or hits located within regions with high copy-number variation. For each BLAST hit, we then retrieve the target sequence and the flanking 2-kbp regions upstream and downstream of the target sequence. This strategy results in high specificity when identifying homologs to the central regions, while allowing for high variance in the sequence composition of the flanking regions. These parameters can be modified by the user, according to preferences.

**Fig. 1: Illustration of the SynTracker algorithm.**

Step 2: calculation of region-specific synteny scores

Each collection of homologous ~5-kbp regions (that is, derived from a BLAST search using the same central region query) is assigned to a region-specific bin (Fig. 1c). Within each bin, we perform an all-versus-all pairwise sequence alignment to identify synteny blocks (Fig. 1d) using the DECIPHER R package²¹. Then, for each pairwise alignment, we calculate the region-specific pairwise synteny score. This score is based on two parameters: the number of synteny blocks identified in each pairwise sequence alignment and the overlap between the two sequences. The synteny score is inversely proportional to the first and directly proportional to the second (Extended Data Fig. 1).

A single synteny block in a pairwise alignment can stem from two genomic regions with a high sequence similarity. A high number of synteny blocks can result from insertions, deletions, recombination events or several SNPs located within a very close proximity in just one of the two sequences. The sequence overlap is defined as the ratio of the accumulative length of all blocks to the length of the shorter DNA region in each pairwise comparison. The region-specific pairwise synteny score has a maximal value of 1, reflecting identification of a single synteny block and overlap of 100% (Fig. 1d).

Step 3: calculation of the APSS

After calculating the per-region synteny scores in all bins, we randomly subsample n regions per single comparison of metagenomic samples (or pair of genomes) and calculate the APSS by averaging the per-region pairwise synteny scores. Pairs of samples or genomes with fewer than n regions per comparison are excluded from downstream analysis (Fig. 1e). By default, n is equal to 40, 60, 80, 100 and 200 regions per pairwise comparison.

SynTracker is sensitive to structural variants, not to SNPs

We examined SynTracker’s performance and estimated the effect of different genomic variations on the synteny scores. In a first test, we used Bacmeta²² to generate in silico simulations of the evolution of bacterial populations. We performed two types of simulations: (1) the population was evolved by introducing SNPs exclusively and (2) only insertions and deletions were introduced. At each time point, for each genomic region, we sampled 20 ‘bacteria’ and calculated all pairwise synteny scores in addition to all pairwise sequence identities (that is, three sets of 190 pairwise comparisons at each time point; Fig. 2a and Extended Data Fig. 2). In simulations using SNPs, the minimal average BLAST identities were 99.48%, 99.46% and 99.5%, for regions 1, 2 and 3. The lowest average BLAST identities in simulations based on insertions and deletions were higher, at 99.79%, 99.79% and 99.84% (P < 2.2 × 10⁻¹⁶). The minimal average synteny scores in SNP-based simulations were higher (0.995, 0.995 and 0.981) than in indel-based simulations (0.103, 0.103 and 0.206; P < 2.2 × 10⁻¹⁶) even though the mutation frequency in SNP-based simulations was tenfold higher than the indel-based simulations. The lower synteny scores of genomic regions in the indel-based simulations highlight the higher sensitivity of the synteny-based approach to indels. These results show that populations evolving exclusively through the introduction of SNPs have a marginal reduction in the synteny scores compared to populations that evolve through the introduction of insertions and deletions at a lower mutation frequency.

**Fig. 2: SynTracker shows robust strain-resolving performance using a small fraction of the genome length.**

SynTracker classifies strains using a fraction of the genome

We examined the performance of SynTracker when comparing closely related genomes and assessed APSS values as the basis for the clustering of genomes into phylogenetic groups. We used a published classification of >10,000 Escherichia coli genomes based on whole-genome nucleotide content (Mash²³), which identified 14 distinct phylogroups²⁴. We randomly selected ten genomes per phylogroup and analyzed these 140 genomes eight times, randomly selecting 20–200 regions per pairwise comparison (representing ~1.8–18.5% of the E. coli O157:H7 genome length). We used the APSS values (Fig. 1e) to construct phylogenies.

The published Mash tree demonstrates the classification of E. coli genomes into 14 phylogroups; we checked whether we could recapitulate these phylogroups using APSS. We compared the Mash-based tree and the synteny-based trees using two methods: (1) calculating the Robinson–Foulds distance (RFD) between trees²⁵ and (2) using phylogenetic information content²⁶. To determine the statistical significance of these distances, we further calculated the distances (both RFDs and phylogenetic information content) between the published Mash tree and 1 × 10⁵ randomly generated trees (Fig. 2c,d and Extended Data Fig. 4). We observed that the number of sampled regions per pairwise comparison was inversely correlated to the RFDs and positively correlated to the phylogenetic information (Extended Data Figs. 3 and 4). These results indicate that, when a greater proportion of the genome was sampled by SynTracker, the Mash-based and APSS-based trees became more similar. Regardless of the proportion of the genome sampled, for both tree comparison methods, the resulting P value was smaller than 1 × 10⁻⁵ in all subsampling values. This indicates that, even with less than 2% of the genome sampled, the APSS-based tree recapitulated the published tree. Importantly, this result indicates that synteny can be used to classify E. coli strains into the ‘correct’ phylogroup using a very small fraction of the whole genome.

In a second analysis, we used data for five Nitrosococcus oceani strains isolated in different global locations²⁷. The authors used a gene synteny-based analysis using genome content and alignments. Their analysis revealed that the genomes of four of the strains (C-107, NS58, C-27 and AFC27) were highly conserved in content and gene synteny, while strain AFC132 contained an additional gene repertoire and differed in gene synteny from the other four genomes. Using ~28% of the genome, our SynTracker analysis corroborated these findings; pairwise comparisons of strains C-107, NS58, C-27 and AFC27 resulted in high APSSs (0.968–0.998), while AFC132 showed a much lower APSS to the others (0.82–0.87) (Fig. 2e).

Setting APSS thresholds for same-strain designations

In most strain-tracking software, score thresholds are used to determine whether the same strain is present in multiple samples^28,29. To apply this concept to SynTracker results, we sought to determine an APSS threshold to use for designating two strains as the same. We note that the definition of a strain is ambiguous and highly subjective. Here, we assumed that two members of a species identified in the same individual over a time period of months likely belong to the same strain, while two members of a species colonizing different individuals likely belong to different strains. To establish an APSS threshold for the same strain for species of the human gut microbiome, we based our analysis on the longitudinal human gut metagenome³⁰. We divided the dataset into training and testing sets, consisting of 117 and 106 metagenomic samples, respectively, and calculated the APSS scores for all strain pairs from 33 species in the training set. We then used the Youden statistic³¹ to determine the APSS value that optimally classified strain pairs as the same or different, which served as the basis for the thresholds used in subsequent analyses (Methods).

To check how well SynTracker performed when tracking strains, we first used it to track strains in the Poyet et al.³⁰ testing set and determined the average specificity and sensitivity with each number of subsampled regions per pairwise comparison (Methods, Supplementary Table 8 and Extended Data Fig. 7). We then applied the APSS thresholds to a separate dataset to identify strain sharing between mothers and infants³². We observed a high proportion of strains shared between mothers and infants when infants were very young. Interestingly, as infants aged, the total number of shared strains increased, while their proportion of all pairwise comparisons decreased (Extended Data Fig. 8). These results highlight SynTracker’s utility for uncovering biologically meaningful phenomena when used as a standalone strain-comparison tool.

Synteny-based and SNP-based analyses reveal modes of genome evolution

Given SynTracker’s high sensitivity to structural genomic variations and low sensitivity to SNPs, we aimed to use it in combination with an SNP-based strain-tracking tool to examine how each of these tools captures different types of within-species genomic diversity. We applied SynTracker in combination with inStrain, a widely used strain-tracking tool with high sensitivity to SNPs²⁸. The datasets we used here were (1) a collection of 12 Neisseria gonorrhoeae clinical isolates harboring antibiotic resistance (Supplementary Table 2 and Methods); (2) a population of hypermutator E. coli isolates that emerged after the colonization of four mice with two ancestral E. coli substrains³³ (Supplementary Table 3); (3) 77 H. pylori clinical isolates obtained from six individuals² (Supplementary Table 4); and (4) a collection of Streptomyces rimosus M527 from different fermentations (Supplementary Table 5 and Methods).

First, we performed 66 pairwise comparisons for the N. gonorrhoeae isolates. This species yielded a very high correlation between the SNP-based and synteny-based strain similarity scores (Spearman’s $\rho$ = 0.985), suggesting that the genomic diversity of this species is achieved through both point mutations and structural genomic variation (Fig. 3a). SynTracker identified a larger proportion of the comparisons as the same strain compared to inStrain. Second, for the E. coli population, we performed 185 pairwise comparisons using inStrain; none of them were classified as the same strain using inStrain’s default same-strain threshold. In contrast, the SynTracker analysis classified all as belonging to the same strain using an APSS threshold of 0.955 (slightly more stringent than the threshold used above). Third, when applied to the H. pylori genome data, we performed 21–91 pairwise comparisons per participant and observed that, in three of the participants (194, 249 and 295; Extended Data Fig. 5), both tools assigned a majority of the genome pairs to the same strains. In the three other participants (322, 326 and 439), however, we detected additional subpopulations: subsets of genome pairs classified as belonging to the same strain by only one of the tools or neither (Fig. 3 and Extended Data Fig. 5). These results indicate that, within a single species and within a single host, subsets of H. pylori are generating genomic diversity using two very different modes, sometimes together and sometimes independently. Fourth, for the S. rimosus samples, we performed 185 pairwise comparisons and observed that, while all strain pairs appeared as clonal in the SNP-based analysis, the SynTracker analysis resulted in a wide range of APSS values, with some classified as different strains. This result suggests that the within-species genomic variation of S. rimosus is achieved through structural differences but not SNPs (Fig. 3d). These combined results are consistent with inStrain’s sensitivity to SNPs and SynTracker’s sensitivity to genomic structural differences. Each method provides a different view of the within-species genomic diversity. When used in combination, they powerfully synergize to generate a more comprehensive view of the modes of evolution underway.

**Fig. 3: Combined synteny-based and SNP-based analysis of simple microbial populations.**

To validate our findings, we performed genome alignments for randomly selected isolates from the E. coli, H. pylori and S. rimosus datasets. In agreement with our results, we observed that the overall genome order was more uniform in E. coli compared to H. pylori and S. rimosus (Methods and Extended Data Fig. 6).

SynTracker applied to microbiomes reveals evolutionary patterns

We applied SynTracker to fecal metagenomes obtained from 1,133 individuals (747 adults and 386 related infants) residing in three countries³⁴. In a first analysis, we used MIPs (both related and unrelated) and combined the SynTracker analysis with inStrain analysis to identify species with distinct modes of accumulating genomic diversity, that is, point mutations or insertions, deletions and recombination. In a second analysis, we excluded true MIPs and tracked strains across locations to characterize spatial patterns exhibited by different species. In both analyses, we used the same set of MAGs as a reference for both SynTracker and inStrain.

1.
Detection of hypermutators and hyper-recombinators from metagenome data. Overall, SynTracker made twice as many strain comparisons as inStrain (40,000 versus 19,000), of which ~12,000 pairwise comparisons were performed by both tools. Despite the fact that this subset of comparisons was detected by both tools, each tool classified a different set of pairs as having the highest strain similarities (Fig. 4a). To identify which taxa comprised the sets of strains flagged as most similar by each of the tools, we compared the enrichment P values of each species in the two subsets (that is, most similar 5% of the strain comparisons according to each tool). Most species showed similar enrichment in both sets, suggesting that their within-species genomic diversity originated from both SNPs and structural differences. However, a subset of species showed differential enrichment in one or the other set, indicating that their genomic diversity originated preferentially from point mutations (hypermutators) or from structural differences (hyper-recombinant or increased indel rate). Hypermutators included Phocaeicola vulgatus, B. fragilis and Alistipes putredinis, while hyper-recombinators included Phocaeicola massiliensis, Streptococcus thermophilus, Prevotella spp., Streptococcus gallolyticus and others (Fig. 4b).
Fig. 4: Combined synteny-based and SNP-based analysis of human gut microbiome reveals patterns of within-species genomic diversity.
a, Synteny-based (APSS) and SNP-based (popANI) strain similarities in human gut metagenomes collected in Gabon, Germany and Vietnam. Dots represent pairwise comparisons within a species. Dots are colored when they are in the top 5% most similar: blue by inStrain, red by SynTracker and purple by both (all others are gray). Density plots show the distribution of scores for each tool. b, Species showing enrichment of SNPs or structural differences. Points represent species. The x axis shows the ratio between the enrichment of each species (that is, the hypergeometric P value) in the most similar 5% of the comparisons by tool. Right, species enriched in SNPs; left, species enriched in structural differences. The y axis indicates the degree of enrichment.
Full size image
2.
Networks built from APSS values allow the visualization of strain patterns. To visualize strain relatedness, we used APSS scores to build networks, where nodes represent hosts and edges are weighted by their APSS. In this analysis, we used SynTracker to assess whether conspecific strains sampled from persons living in geographic proximity are more similar to each other compared to strains obtained from people living further apart. In total, we were able to assess 145 species (that is, those identified with a sufficient genome coverage in >5 hosts), 61 of which showed significantly higher APSS values in individuals living in the same province (in Gabon or Vietnam) or in the same federal state (Germany), compared to individuals living in different provinces or states (Fig. 5a). For 39 species, the difference in mean APSS between strains obtained from the same or different province or state was medium to large (|d| > 0.5; Fig. 5a).

**Fig. 5: Strain analysis of human gut metagenomes collected in Gabon, Germany and Vietnam.**

The networks highlighted strain patterns by species by country. For instance, Lachnospira rogosae and Agathobacter rectalis showed distinct strain clusters within countries (Cohen’s d = 1.09 and 0.99, respectively; Fig. 5b). Similarly, the strain comparisons for Lachnospira sp003537285 resulted in two unconnected network clusters, one from Vietnam and one from Germany (with one sole carrier from Gabon). These results suggest the within-population evolution of strains with limited dispersal. In contrast, the genomes of two E. coli subspecies (Cohen’s d = 0.09) formed a network composed of distinct clusters, each containing participants from all three countries (Fig. 5b). This pattern suggests a high degree of geographic exchange across large distances. The species A. putredinis also yielded a network suggestive of cosmopolitan strains.

The network diagrams also allow subtle patterns of strain distribution to be readily identified. For instance, Bifidobacterium longum is primarily a cosmopolitan species (according to Cohen’s d value) and its strain comparisons resulted in three clusters, each composed of participants living in a single country. Surprisingly, a fourth cluster was made up from strains obtained from participants from all three countries. This pattern suggests the coexistence of geographically constrained and cosmopolitan strains within B. longum. Taken together, the network visualization of APSS scores provides a holistic view of strain patterns to emerge from highly complex data and the simultaneous assessment of strain dynamics. Although we applied this concept to spatial patterns here, any other type of metadata could be applied to the networks to aid interpretation of the patterns.

Benchmarking SynTracker against other strain-comparison tools

To benchmark SynTracker against three state-of-the-art strain-comparison tools (MIDAS³⁵, StrainPhlAn³⁶ and inStrain²⁸), we used a test performed in the literature²⁸, in which these tools were used to classify strains as shared or nonshared at different average nucleotide identity (ANI) cutoffs. In this test, gut metagenomic samples obtained from three pairs of premature infant twins were used. As the ground truth, conspecific strains residing in unrelated infants were assumed to be different, while those residing in twin infant pairs were considered the same, based on previous findings³⁷. In a similar manner, we used SynTracker to identify shared strains in these metagenomic samples over a range of APSS values. The reference genomes used in this analysis were the species-representative genomes (SRGs; MAGs assembled from the same metagenomic samples and clustered on the basis of ANI) obtained from the literature²⁸.

To determine the performance of each of the methods, we generated a receiver operating characteristic curve (ROC) for each tool and calculated the area under the curve (AUC), which is an indicator of its performance. SynTracker achieved the highest AUC value (0.93) among the previously published tools when using 40 regions per pairwise comparison (Fig. 6a). SynTracker proved to have the highest sensitivity, detecting 112 strain pairs compared to 85, 92 and 101 in StrainPhlAn, inStrain and MIDAS, respectively. Additionally, SynTracker proved to be the most effective tool when tracking plasmid and phage strains, outperforming inStrain (Fig. 6b).

**Fig. 6: Performance of four different strain-comparison tools.**

Discussion

We introduce SynTracker, a highly performant tool for comparing segments of DNA (microbial strains) using genome microsynteny. As a standalone tool, SynTracker can be used to compare genomes or genome fragments obtained from sources such as plasmids, viruses and microbiota, with applications such as microbial strain tracking. Combined with SNP-based tools, SynTracker can be used in complex microbiomes to differentiate species whose genomic diversity results from hypermutation (increased point-mutation rate) or hyper-recombination (normal point mutation rate but high levels of genomic structural differences, such as recombination, insertions and deletions). The combination of synteny-based and SNP-based comparisons is powerfully synergistic; together, these approaches can provide new insights into the mechanisms driving species and strain diversification in microbiomes.

Although genomic information has been used for decades to delineate species, there is currently no single definition for microbial species. Defining strains is equally if not more challenging. Definitions vary by context and by researcher. For example, genomes were assigned to different strains if they presented different phenotypes and differed by only a few SNPs³³ and, in other instances, a minimal genome similarity was used to assign strains^28,35. Because of the subjectivity inherent to classifying strains as the same or different, we recommend that researchers consider avoiding such classifications and instead assess strain relatedness using measures of genomic differences, such as the APSS presented here.

The classification of genomes into strains is sometimes necessary, however. Strain delineations used in strain tracking often use a minimal cutoff value of genomic similarity to bin strains into same or different categories^28,35,38. Such thresholds are subjective and depend on the species being considered, as different taxa accumulate genomic differences at different rates, among other factors. For example, when analyzing longitudinal data, one challenge is to assess when a microbial lineage has accumulated enough differences compared to the ancestral population to be considered a different strain. The widely used software inStrain uses a popANI cutoff of 99.999%, which is based on comparisons of strains in (1) mock communities and (2) the same infant over a few days to a few weeks²⁸. Similarly, we set an APSS score cutoff of 0.95 to designate the same strain in the human gut on the basis of an analysis of the APSS of genome pairs obtained from longitudinal samples from the same or different hosts. This was based on the assumption that the same strain persists in the gut of healthy adult individuals over a time period of weeks to months, while conspecific strains inhabiting unrelated individuals likely belong to different strains.

It is important to note that this APSS cutoff was based on averaging different per-species cutoffs and may not be optimal for all settings. The researcher needs to consider their context and question and set the appropriate cutoff. Factors to consider include the evolvability of the microbial species, the type and magnitude of selection and the user’s interpretation of the term ‘strain’. We encourage users analyzing nontypical microorganisms or those who wish to define APSS cutoffs tailored to their system to perform an analysis similar to our analysis of the dataset in the literature³⁰ using a training dataset relevant to their research question.

We previously applied SynTracker to gut metagenomes from MIPs to show strain sharing between MIPs and community strain transmission³⁴. Applied to the mother–child metagenome dataset in the literature³², results from SynTracker largely corroborated those obtained from SNP-based tools for the strain persistence of early-colonizing species in the infant gut^39,40 and the acquisition of strains during late infancy^35,41. Our analysis of metagenomes from geographically distinct hosts³⁴ combined with network-graph analysis based on synteny distances revealed that a notable portion of the species we assessed are made up of strains with a limited geographic distribution, indicating that they are local to a population. Intriguingly, some species contained both geographically distinct and cosmopolitan strains.

Another powerful way to use the output of SynTracker is in combination with that of an SNP-based tool, such as inStrain. Combined, these approaches readily reveal differences between species in the nature of their genomic diversity. The genome of the human pathogen N. gonorrhoeae displays both point mutations and structural changes, such that the similarities of strain pairs according to the two methods is highly correlated. This is consistent with the ease of recombination and variable genome reported for this species⁴². Similarly, H. pylori, a species with potentially both a high rate of genome rearrangement and a high mutation frequency^43,44 displayed both in our combined analysis. This was also the dominant pattern we observed for the majority of human gut microbiome species.

The combined use of SynTracker and inStrain allowed us to detect species with outlier behavior: those enriched in either structural differences or mutations. A single-species example is S. rimosus, in which we observed a high level of structural differences with a low level of point mutations. In accord, Streptomyces are known for high rates of recombination^45,46,47. Our analysis of the human microbiome revealed a few species with this hyper-recombinator phenotype, from several phyla: Bacteroidota, Bacillota and Pseudomonadota. Two belonged to the genus Streptococcus, which was also previously found to evolve through greater recombination relative to mutation^48,49. In contrast, our combined analysis of single species revealed E. coli as a hypermutator, in line with previous observations^33,50. In the human gut microbiome, hypermutator species included A. putredinis, B. fragilis and P. vulgatus. All are members of Bacteroidales. Of these, only B. fragilis has been studied for its mode of evolution and the results were consistent with our observations¹. Generally, our single-species observations are consistent with published studies. The synergetic combination of synteny-based and SNP-based tools allows the simultaneous observation of potentially thousands of species from a metagenome.

The combined use of SNP-based and synteny-based analyses will allow researchers to delve into the phylogenetic and environmental contexts underlying higher-than-expected mutation or recombination in select taxa. E. coli grown in the laboratory displays high rates of point mutation^33,50. In our analysis of human gut metagenomes, however, E. coli was revealed to preferentially undergo recombination. Although these differences in how genomic diversity is generated may be attributed to the different subtypes of E. coli, these results illustrate the potential of combining synteny-based and SNP-based approaches to unravel the forces operating on species to drive one mode of evolution over another. Using synteny to discern strains undergoing high levels of recombination may have practical implications as functional differences can emerge from recombination and/or mutation. For instance, the respiratory pathogen S. pneumoniae develops antibiotic resistance through recombination⁶. Phase variation in many important bacterial pathogens can also occur through reversible recombination events, generating subpopulations that escape immune recognition^8,9,10. Thus, high recombination with low levels of mutation may also lead to important functional differences between strains, motivating the use of SynTracker to identify subpopulations that would be missed using ANI-based tools alone.

SynTracker can be used reliably as a standalone tool, similarly to other strain-tracking methods, while offering a number of advantages. We demonstrated SynTracker’s ability to track strains using only a fraction of the complete genome (that is, 2% of E. coli genome), while most other tools require coverage of 50% or more of the genome length. Moreover, benchmarking of SynTracker against state-of-the-art strain-comparison tools showed that SynTracker outperformed MIDAS³⁵ and StrainPhlAn³⁶ and achieved a slightly better performance score than InStrain²⁸ in terms of accuracy, while outperforming all tools in terms of sensitivity (that is, the number of strain comparisons identified). Likely because of the fact that it can work well with limited data, SynTracker achieved a significantly higher AUC than inStrain when applied to phages and plasmids. Thus, when a single tool is to be used, SynTracker is particularly recommended for phages, plasmids and other low-data-input applications, such as tracking strains of rare taxa.

Methods

SynTracker pipeline

The SynTracker pipeline consists of three main parts. In the first part, SynTracker accepts a collection of reference genomes (a single genome per species), either fully assembled or as a collection of contigs. Each per-species reference is fragmented into a collection of 1-kbp central regions, which are binned and stored together.

In the second part, SynTracker creates a BLAST database based on a user-provided collection of metagenomic assemblies or genomes. Next, it performs a BLAST search, for each of the central regions, against the newly created BLAST database with a minimal identity of 97% and a minimal query coverage of 70% (that is, 700 bp). In the final step of this part, hits for each BLAST search are retrieved, in addition to a 2-kbp region on each side of the BLAST hit using the ‘blastcmddb’ command. Hits within <2 kbp both upstream and downstream to the hit are excluded from further analysis. Each retrieved sequence is denoted by its sample of origin and matching region in the reference genome.

In the third part of the pipeline, genomic fragments are grouped by their matching region in the reference genome and pairwise alignments are conducted to identify synteny blocks in each pair of sequences. The identification of synteny blocks in each pairwise alignment is performed using the ‘FindSynteny’ function in the DECIPHER R package²¹, with parameters ‘maxGap’ and ‘maxSep’ both set to 15. Additionally, only pairwise comparisons with a minimal overlap of 4,800 bp are considered for downstream analysis. Next, per each pairwise alignment, a synteny score is calculated:

$${\rm{synScore}}={1}+{{{\log }}}_{{10}}\left(\left({\rm{Ov}/{len}}\right)/{B}\right)$$

where Ov stands for the accumulative length of the overlapping synteny blocks identified in the pairwise alignment, len denotes the length of the shorter sequence in each pair and B stands for the number of synteny blocks identified in each pairwise alignment.

In the final step of the third part of the pipeline, for each reference genome, n genomic regions are randomly selected per pair of metagenomic samples or genomes. The APSS is calculated by averaging the individual pairwise synteny scores. Pairs of samples or genomes with fewer than n regions are excluded from downstream analysis.

In silico evolutionary simulations

Calculation of the synteny scores per group of sampled cells was performed as described above. However, as the length of the genomic fragments used in the simulation was limited to 20 kbp, synteny scores were based on a single alignment of the ~20-kbp region per pair of simulated genomes. Statistical significance of the differences between BLAST identities or synteny scores in the indel-based and SNP-based simulations was performed by applying the Welch t-test on the entire set of pairwise comparisons at the specified above time point.

Classification of bacterial genomes

Calculation of APSS values for E. coli strain pairs was performed as described above and using the E. coli str. K-12 substrain MG1655 genome as a reference (National Center for Biotechnology Information (NCBI) reference sequence: NC_000913.3).

Synteny-based phylogenetic trees were constructed using n regions per pairwise comparison (n = 20–200). For each number of regions per pairwise comparison, APSS values were converted to synteny distances, which are equal to the absolute value of APSS − 1. All pairwise synteny distances were placed in a symmetric matrix that was used to calculate unweighted pair group method with arithmetic mean phylogenetic trees using the ‘phangorn’ R package⁵¹.

The full Mash distance-based phylogenetic tree, including >10,000 genomes, was imported from the literature²⁴ (‘Supplementary_Data_Fig_3c_10k_newick.nwk’ file). Tree labels were converted from MD5 to assembly accession, followed by pruning of the Mash-based tree so that only external nodes corresponding to the 140 E. coli genomes analyzed with SynTracker were kept.

The distance between each of the synteny-based phylogenetic trees and the Mash-based tree was determined by calculating the Information-based generalized RFDs using the ‘TreeDist’ R package⁵².

The significance level for each of the tree comparisons was calculated by shuffling the labels of the synteny-based tree (100,000 iterations for each comparison) and calculation of the fraction of the randomized trees with an RFD equal to or lower than the distance between the actual synteny-based tree and the Mash-based tree.

Strain comparisons in experimentally evolved E. coli populations

E. coli sequencing libraries³³ were downloaded from the NCBI Sequence Read Archive (SRA) (Supplementary Table 4) and were quality-filtered as described previously⁵³. Sequencing libraries were assembled using the SPADES genome assembler⁵⁴. Nonhypermutator isolates (bearing only few mutations compared to the reference genome) were excluded from downstream analysis. For both inStrain and SynTracker analyses, the same reference genome was used, as described in the literature³³. In the SynTracker analysis, 200 regions per pairwise comparison were used.

Strain comparisons of H. pylori clinical isolates, N. gonorrhoeae isolates and S. rimosus isolates

A total of 77 sequencing libraries of clinical H. pylori isolates², 20 libraries of S. rimosus isolates and 12 libraries derived from clinical isolates of N. gonorrhoeae were obtained from the NCBI SRA database (Supplementary Tables 3 and 5), quality-filtered and assembled as described above. For both inStrain and SynTracker analyses, the same reference genomes were used: H. pylori, GenBank: CP032479.1; S. rimosus, GCF_000331185.2_ASM33118v2; N. gonorrhoeae, GCF_900087635.2. In the SynTracker analysis, 200 regions per pairwise comparison were used, except for N. gonorrhoeae, where 60 regions per pairwise comparison were used.

Whole-genome alignments of H. pylori, N. gonorrhoeae and S. rimosus assemblies

Whole-genome alignments were performed using the Mauve software⁵⁵. First, the contigs of each assembly were ordered using the reference genomes described above and then aligned against each other using the software’s default parameters.

Combined SNP-based and synteny-based analysis of human gut metagenomes

For the combined analysis, we further used the recently published inStrain and SynTracker analyses of infant–adult pairs³⁴. To create the APSS versus popANI plot, we used the subset of strain comparisons identified by both tools (>12,000 strain pairs). The SynTracker analysis was performed using 30 regions per pairwise comparison.

To identify species with a preferential mode of within-species genomic diversity, we first calculated the enrichment of each species at the 5% most similar strain pairs according to inStrain (popANI used as the distance matrix) and SynTracker (APSS). The enrichment of each species was determined using the hypergeometric distribution probability test:

$${p}({X}={k})=(({K}/k)(({N}-{K})/(n-k)))/({N}/n)$$

where N is the number of pairwise comparisons in each dataset, n is the number of the most similar pairwise comparisons (that is, 5% of N), K is the number of pairwise comparisons in each species in the entire dataset and k is the number of pairwise comparisons in each species in the most similar subset.

Determining SynTracker’s performance

For each of the 33 studied species, we used a publicly available reference genome (Supplementary Table 9), which was fragmented into a collection of 1-kbp central regions, as described above. Next, we performed a per-sample, de novo metagenomic assembly to construct our search space (Methods and Fig. 1). The metagenomic assemblies were divided randomly into training and testing sets (117 and 106 samples, obtained from 45 and 43 donors, respectively). For both sets, we calculated eight different final APSS matrices per species after randomly selecting n regions per pairwise comparison (n = 15–200 5-kbp regions). We then classified pairwise comparisons in the training set to those originating from the same host at different time points (within-host set) and those originating from different hosts (between-host set). With the classification of pairwise comparisons in the training set used as ground truth, we generated ROCs⁵⁶ to determine the specificity and sensitivity of the classification at different APSS values. We based our curves on 100 iterations for each combination of species and subsampling value.

We next aimed to assess the performance of SynTracker using the 33-species test set. We first determined the APSS values that optimally discriminated within-host comparisons from between-host comparisons. Pairwise comparisons with APSSs higher than these ‘optimal’ values were considered as same-strain comparisons, while those below them were considered as distinct strains. This was achieved by calculating the J index³¹ for each combination of species and subsampling depth. We defined the ‘threshold APSS’ for each sampling level as the average of optimal APSS for all species for a given number of genomic regions per pairwise comparison used. Finally, we used these APSS thresholds to determine the specificity and sensitivity of our method by introducing them to the testing set. To avoid sampling bias, we repeated this process 100 times for each combination of species and sampling level and calculated the average specificity and sensitivity values. As expected, we observed a direct correspondence between the number of subsampled regions and sensitivity and specificity (Extended Data Fig. 7 and Supplementary Table 8), with maximal average sensitivity and specificity values of 98.6% and 97% for comparisons calculated using 200 regions per pairwise comparison. While a small number of regions per pairwise comparison mostly result in lower accuracy, the decision to use such values may be justified by the inclusion of additional samples in the analysis. The researcher can decide whether to prioritize increased accuracy or sample size.

Longitudinal metagenomes were obtained from the NCBI SRA database and were quality-filtered as described previously⁵³. Metagenomic samples were de novo assembled using metaSPAdes⁵⁴, with a maximal number of 20 million reads per sample. ROC curves and matching APSS thresholds for each combination of species and sampling depth in the training set were calculated using the R programing language ‘pROC’ package⁵⁷. Best threshold values were calculated by aggregating the results of 100 iterations for each combination of species and sampling depth. The reported specificity and sensitivity were calculated on the basis of 100 iterations per combination.

Benchmarking SynTracker

ROC plots for MIDAS, StrainPhlAn and instrain were based on the analyses of microbiomes from six premature infants conducted in the literature²⁸. The original table was filtered to exclude phage and plasmid comparisons performed using inStrain to reflect the results shown in the previous study²⁸.

To benchmark SynTracker, we first downloaded the same raw-read files used for the tests performed by the authors²⁸. Quality filtration and de novo assembly were performed as described above. To allow a more accurate comparison to other tools, we used the SRGs (that is, genomes representing a cluster of MAGs with ANI > 99%) assembled by the authors²⁸ as reference genomes input to SynTracker. ROC plots were prepared as described above.

Mother–infant strain transmission

Metagenomic samples were downloaded from the NCBI SRA database and were quality-filtered and assembled as described above. However, as only one of the two matching read files passed our quality filtration for some samples, we performed the metagenomic assembly using single-end reads.

Strain similarities in different geographic locations

To identify location-specific strains, we used metagenomic samples collected in Gabon, Vietnam and Germany. Sample collection, metagenomic assembly and clustering of MAGs into SRGs were performed as described previously³⁴. For our search space (Fig. 1), we used the collection of paired sample assemblies; as the reference genomes, we used the previously generated SRGs. APSS was calculated using 80 5-kbp long regions per each pairwise comparison, yielding 156 species with sufficient coverage.

To avoid the influence of mother–infant strain transmission on the within-region synteny scores, we filtered out strain comparisons between mothers and their own infants. We also removed subjects who were not born in the country in which they currently resided to eliminate comparisons between the local population and expats. The location of the subject was based on the residence place as supplied by each subject.

The statistical significance of the differences in strain comparisons within and between provinces was determined by the Wilcoxon–Mann–Whitney test, with test results corrected for multiple testing using the Benjamini–Hochberg correction, as described above. Matching effect sizes (Cohen’s d) were calculated and classified using the ‘effsize’ R package (v.0.8.1)⁵⁸. Species-specific networks were generated using the CLANS software⁵⁹, with −log₁₀(APSS) used as edge weights. Minimal APSS scores to be considered in the analysis were ~0.75. Singletons (that is, subjects with no connections in the network) were removed from the network graphs.

Recommended use of reference genomes

As state-of-the-art methods for the assembly of genomes from metagenomes yield ever-larger collections of MAGs, we propose to follow the inStrain program workflow²⁸, in which the MAG collection is clustered to create SRGs to be used as the reference genomes in our pipeline. This approach should allow expansion of the study strains for novel species.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Datasets used in this paper are specified in the Supplementary Information.

Code availability

The SynTracker pipeline is available at https://github.com/leylabmpi/SynTracker.

Custom code used for downstream analysis is available at https://github.com/leylabmpi/Synteny_analyses.

References

Zhao, S. et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host Microbe 25, 656–667 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wilkinson, D. J., Dickins, B., Robinson, K. & Winter, J. A. Genomic diversity of Helicobacter pylori populations from different regions of the human stomach. Gut Microbes 14, 2152306 (2022).
Article PubMed PubMed Central Google Scholar
Lo Presti, L. et al. Fungal effectors and plant susceptibility. Annu. Rev. Plant Biol. 66, 513–545 (2015).
Article CAS PubMed Google Scholar
Sharafutdinov, I. et al. A single-nucleotide polymorphism in Helicobacter pylori promotes gastric cancer development. Cell Host Microbe 31, 1345–1358 (2023).
Article CAS PubMed Google Scholar
Viana, D. et al. A single natural nucleotide mutation alters bacterial pathogen host tropism. Nat. Genet. 47, 361–366 (2015).
Article CAS PubMed PubMed Central Google Scholar
Chewapreecha, C. et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet. 46, 305–309 (2014).
Article CAS PubMed PubMed Central Google Scholar
Joseph, B. et al. Virulence evolution of the human pathogen Neisseria meningitidis by recombination in the core and accessory genome. PLoS ONE 6, e18441 (2011).
Article CAS PubMed PubMed Central Google Scholar
Trzilova, D. & Tamayo, R. Site-specific recombination—how simple DNA inversions produce complex phenotypic heterogeneity in bacterial populations. Trends Genet. 37, 59–72 (2021).
Article CAS PubMed Google Scholar
Sánchez-Busó, L., Comas, I., Jorques, G. & González-Candelas, F. Recombination drives genome evolution in outbreak-related Legionella pneumophila isolates. Nat. Genet. 46, 1205–1211 (2014).
Article PubMed Google Scholar
McAdam, P. R. et al. Gene flow in environmental Legionella pneumophila leads to genetic and pathogenic heterogeneity within a Legionnaires’ disease outbreak. Genome Biol. 15, 504 (2014).
PubMed PubMed Central Google Scholar
Anyansi, C., Straub, T. J., Manson, A. L., Earl, A. M. & Abeel, T. Computational methods for strain-level microbial detection in colony and metagenome sequencing data. Front Microbiol. 11, 1925 (2020).
Article PubMed PubMed Central Google Scholar
Wielgoss, S. et al. Mutation rate dynamics in a bacterial population reflect tension between adaptation and genetic load. Proc. Natl Acad. Sci. USA 110, 222–227 (2013).
Article CAS PubMed Google Scholar
& Swings, T. et al. Adaptive tuning of mutation rates allows fast response to lethal stress in Escherichia coli. eLife 6, e22939 (2017).
Article PubMed PubMed Central Google Scholar
Lemoine, F., Lespinet, O. & Labedan, B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol. Biol. 7, 237 (2007).
Article PubMed PubMed Central Google Scholar
Alexeev, N. & Alekseyev, M. A. Estimation of the true evolutionary distance under the fragile breakage model. BMC Genomics 18, 356 (2017).
Article PubMed PubMed Central Google Scholar
Zhao, T. et al. Whole-genome microsynteny-based phylogeny of angiosperms. Nat. Commun. 12, 3498 (2021).
Article CAS PubMed PubMed Central Google Scholar
Adato, O., Ninyo, N., Gophna, U. & Snir, S. Detecting horizontal gene transfer between closely related taxa. PLoS Comput. Biol. 11, e1004408 (2015).
Article PubMed PubMed Central Google Scholar
Van Rossum, T., Ferretti, P., Maistrenko, O. M. & Bork, P. Diversity within species: interpreting strains in microbiomes. Nat. Rev. Microbiol. 18, 491–506 (2020).
Article PubMed PubMed Central Google Scholar
Schirmer, M., D’Amore, R., Ijaz, U. Z., Hall, N. & Quince, C. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinf. 17, 125 (2016).
Article Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Wright, E. S. Using DECIPHER v2.0 to analyze big biological sequence data in R. R J. 8, 352–359 (2016).
Article Google Scholar
Sipola, A., Marttinen, P. & Corander, J. Bacmeta: simulator for genomic evolution in bacterial metapopulations. Bioinformatics 34, 2308–2310 (2018).
Article CAS PubMed Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Abram, K. et al. Mash-based analyses of Escherichia coli genomes reveal 14 distinct phylogroups. Commun. Biol. 4, 117 (2021).
Article CAS PubMed PubMed Central Google Scholar
Meilă, M. Comparing clusterings—an information based distance. J. Multivar. Anal. 98, 873–895 (2007).
Article Google Scholar
Steel, M. & Penny, D. Maximum parsimony and the phylogenetic information in multistate characters. In Parsimony, Phylogeny, and Genomics (ed. Albert, V. A.) (Oxford University Press, 2006).
Wang, L., Lim, C. K. & Klotz, M. G. High synteny and sequence identity between genomes of Nitrosococcus oceani strains isolated from different oceanic gyres reveals genome economization and autochthonous clonal evolution. Microorganisms 8, 8 (2020).
Article Google Scholar
Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).
Article CAS PubMed PubMed Central Google Scholar
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).
Article CAS PubMed PubMed Central Google Scholar
Poyet, M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 25, 1442–1452 (2019).
Article CAS PubMed Google Scholar
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Article CAS PubMed Google Scholar
Bäckhed, F. et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe 17, 852 (2015).
Article PubMed Google Scholar
Ramiro, R. S., Durão, P., Bank, C. & Gordo, I. Low mutational load and high mutation rate variation in gut commensal bacteria. PLoS Biol. 18, e3000617 (2020).
Article CAS PubMed PubMed Central Google Scholar
Suzuki, T. A. et al. Codiversification of gut microbiota with humans. Science 377, 1328–1332 (2022).
Article CAS PubMed PubMed Central Google Scholar
Nayfach, S., Rodriguez-Mueller, B., Garud, N. & Pollard, K. S. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 26, 1612–1625 (2016).
Article CAS PubMed PubMed Central Google Scholar
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Article CAS PubMed PubMed Central Google Scholar
Raveh-Sadka, T. et al. Evidence for persistent and shared bacterial strains against a background of largely unique gut colonization in hospitalized premature infants. ISME J. 10, 2817–2830 (2016).
Article PubMed PubMed Central Google Scholar
Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 41, 1633–1644 (2023).
Article PubMed PubMed Central Google Scholar
Ferretti, P. et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lou, Y. C. et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med. 2, 100393 (2021).
Article CAS PubMed PubMed Central Google Scholar
Korpela, K. et al. Selective maternal seeding and environment shape the human gut microbiome. Genome Res. 28, 561–568 (2018).
Article CAS PubMed PubMed Central Google Scholar
Quillin, S. J. & Seifert, H. S. Neisseria gonorrhoeae host adaptation and pathogenesis. Nat. Rev. Microbiol. 16, 226–240 (2018).
Article CAS PubMed PubMed Central Google Scholar
Baltrus, D. A., Guillemin, K. & Phillips, P. C. Natural transformation increases the rate of adaptation in the human pathogen Helicobacter pylori. Evolution 62, 39–49 (2008).
PubMed Google Scholar
Baltrus, D. A., Blaser, M. J. & Guillemin, K. Helicobacter pylori genome plasticity. In Microbial Pathogenomics (eds de Reuse, H. & Bereswill, S.) (S. Karger AG, 2010).
Doroghazi, J. R. & Buckley, D. H. Intraspecies comparison of Streptomyces pratensis genomes reveals high levels of recombination and gene conservation between strains of disparate geographic origin. BMC Genomics 15, 970 (2014).
Article PubMed PubMed Central Google Scholar
Doroghazi, J. R. & Buckley, D. H. Widespread homologous recombination within and between Streptomyces species. ISME J. 4, 1136–1143 (2010).
Article CAS PubMed Google Scholar
Andam, C. P., Choudoir, M. J., Vinh Nguyen, A., Sol Park, H. & Buckley, D. H. Contributions of ancestral inter-species recombination to the genetic diversity of extant Streptomyces lineages. ISME J. 10, 1731–1741 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lefébure, T. & Stanhope, M. J. Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition. Genome Biol. 8, R71 (2007).
Article PubMed PubMed Central Google Scholar
Lee, I. P. A. & Andam, C. P. Frequencies and characteristics of genome-wide recombination in Streptococcus agalactiae, Streptococcus pyogenes, and Streptococcus suis. Sci. Rep. 12, 1515 (2022).
Article CAS PubMed PubMed Central Google Scholar
Couce, A. et al. Mutator genomes decay, despite sustained fitness gains, in a long-term experiment with bacteria. Proc. Natl Acad. Sci. USA 114, E9026–E9035 (2017).
Article CAS PubMed PubMed Central Google Scholar
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).
Article CAS PubMed Google Scholar
Smith, M. R. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics 37, 2077–2078 (2021).
Article PubMed Google Scholar
Youngblut, N. D. et al. Large-scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. mSystems 5, e01045-20 (2020).
Article PubMed PubMed Central Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Article CAS PubMed PubMed Central Google Scholar
Darling, A. E., Mau, B. & Perna, N. T. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 5, e11147 (2010).
Article PubMed PubMed Central Google Scholar
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).
Article Google Scholar
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 12, 77 (2011).
Article Google Scholar
Torchiano, M. effsize—a package for efficient effect size computation. Zenodo https://doi.org/10.5281/zenodo.683128 (2016).
Frickey, T. & Lupas, A. CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20, 3702–3704 (2004).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank N. Youngblut, T. Suzuki, L. Lo Presti and A. Tyakht for providing comments on this work and H. Ashkenazy for providing insights and helpful review of the software. This work was supported by the Max Planck Society.

Funding

Open access funding provided by Max Planck Society.

Author information

Authors and Affiliations

Department of Microbiome Science, Max Planck Institute for Biology, Tübingen, Germany
Hagay Enav, Inbal Paz & Ruth E. Ley
Cluster of Excellence EXC 2124: Controlling Microbes to Fight Infections (CMFI), University of Tübingen, Tübingen, Germany
Ruth E. Ley

Authors

Hagay Enav
View author publications
You can also search for this author in PubMed Google Scholar
Inbal Paz
View author publications
You can also search for this author in PubMed Google Scholar
Ruth E. Ley
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.E. and R.E.L. designed the project and downstream analyses, interpreted the results and wrote the manuscript. H.E. designed the pipeline and performed analyses. H.E. and I.P. wrote the code of the pipeline.

Corresponding author

Correspondence to Ruth E. Ley.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples for the influence of the number of synteny blocks and sequence overlap on the synteny score.

Examples for the influence of the number of synteny blocks and different sequence overlap on the synteny score of a single hypothetical region. a–e Examples for the synteny scores when a single synteny block is identified, with different levels of sequence coverage. f–i Synteny scores of regions with 100% coverage and different numbers of synteny blocks.

Extended Data Fig. 2 Analysis of the genomic diversity of in-silico evolved bacterial populations.

Analysis of the genomic diversity of in-silico evolved bacterial populations (simulation 2 and 3, for the simulation of region 1, see Fig. 2a). Simulations were carried out for 3000 generations, through the exclusive introduction of SNPs at a frequency of 1*10⁻⁶ substitutions per nucleotide per generation. At each time point, 20 cells were sampled, and a pairwise comparison of the same 20 kbp region was performed using BLASTn (top panel) and SynTracker (bottom panel) (that is, 190 pairwise comparisons per timepoint). Horizontal black lines mark the group median, the red lines connect the group means (red dots). Boxes represent the interquartile range (IQR), whiskers are extended to the largest and smallest observations within the first and third quartiles +/− 1.5*IQR, correspondingly’.

Extended Data Fig. 3 The influence of the number of regions/pair on Robinson-Foulds Distances.

Correspondence between the number of regions/pairwise-comparison and the Robinson-Foulds Distance (RFD), when comparing the synteny-based and MASH-distance based phylogenetic trees. Correlation was tested using the Spearman’s rank test.

Extended Data Fig. 4 Distribution of distances between random phylogenetic trees and the MASH based tree.

Upper panel shows the Phylogenetic Information content (higher values stand for higher similarity), lower panel shows the Normalized Robinson-Foulds Distance (lower values stand for higher similarity between trees). Red curves are the distribution of the distances between 1*10⁵ random trees and the MASH based tree, the points mark the distances of the actual Synteny based trees and the MASH based tree. 20R/40R/100R/200R represent the number of regions used to generate each tree.

Extended Data Fig. 5 Combined SynTracker/inStrain analysis of H.pylori isolates.

Combined SynTracker/inStrain analysis of H.pylori isolates, obtained by biopsies. Each plot shows comparisons of isolates obtained from a single patient. Points represent specific pairwise comparisons, popANI - SNP-based strain similarity, as calculated by inStrain. Red and blue lines show the SynTracker and inStrain same-strain cutoffs. Purple points mark comparisons classified as the same strain by both tools, red by SynTrakcer, blue by inStrain, gray are pairs marked as belonging to different strains by both tools. n = 91, for all patients, excluding patient 249, n = 21.

Extended Data Fig. 6 Whole genome alignments of representative assemblies analyzed by both SynTracker and inStrain.

(a) E. coli, (b) H. pylori (obtained from patient 194), (C) S. rimosus.

Extended Data Fig. 7 Using longitudinal human metagenomes to determine Syntracker’s performance.

a. Representative synteny-based analyses of B. fragilis strains in longitudinal metagenomic samples, obtained from the same host (cyan) and different hosts (pink), using 200 and 40 regions/pairwise-comparison. Dashed red lines represent the APSS threshold differentiating between strains residing in the same host and strains residing in different hosts, as calculated in B. b. ROC plots of the same analyses shown in A. Red dots correspond to the Youden point and give the APSS score that yields the optimal combination of specificity and sensitivity. c. The mean specificity and sensitivity of the synteny approach, calculated at different numbers of regions/pairwise-comparison. R/P- regions/pairwise comparison, PW - pairwise.

Extended Data Fig. 8 Strain sharing in Mother-Infant Pairs.

The Synteny of strains residing in mothers and their infants at ages 4 days (‘newborns’), 4 months, and 1 year (Related MIP, blue boxes) was compared with the synteny of strains in unrelated mothers and infants (unrelated MIP, gray boxes). Comparisons to the right of the dashed red lines are considered strain sharing events. Stars correspond to Benjamini-Hochberg corrected p-values (one sided Wilcoxon-Mann-Whitney test): q-value * <5 × 10-2, ** < 5 × 10-3, *** <5 × 10-5. Vertical lines within boxes represent the median of each group, Boxes represent the interquartile range (IQR), whiskers are extended to the largest and smallest observations within the first and third quartiles +/− 1.5*IQR, correspondingly.

Supplementary information

Supplementary Information

Supplementary discussion.

Reporting Summary

Supplementary Tables 1–9

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Enav, H., Paz, I. & Ley, R.E. Strain tracking in complex microbiomes using synteny analysis reveals per-species modes of evolution. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02276-2

Download citation

Received: 31 March 2023
Accepted: 10 May 2024
Published: 19 June 2024
DOI: https://doi.org/10.1038/s41587-024-02276-2
Springer Nature America, Inc.

Strain tracking in complex microbiomes using synteny analysis reveals per-species modes of evolution

Abstract

Similar content being viewed by others

Main

Results

SynTracker compares strains within species using metagenome or genome data

Step 1: identification of homologous regions

Step 2: calculation of region-specific synteny scores

Step 3: calculation of the APSS

SynTracker is sensitive to structural variants, not to SNPs

SynTracker classifies strains using a fraction of the genome

Setting APSS thresholds for same-strain designations

Synteny-based and SNP-based analyses reveal modes of genome evolution

SynTracker applied to microbiomes reveals evolutionary patterns

Benchmarking SynTracker against other strain-comparison tools

Discussion

Methods

SynTracker pipeline

In silico evolutionary simulations

Classification of bacterial genomes

Strain comparisons in experimentally evolved E. coli populations

Strain comparisons of H. pylori clinical isolates, N. gonorrhoeae isolates and S. rimosus isolates

Whole-genome alignments of H. pylori, N. gonorrhoeae and S. rimosus assemblies

Combined SNP-based and synteny-based analysis of human gut metagenomes

Determining SynTracker’s performance

Benchmarking SynTracker

Mother–infant strain transmission

Strain similarities in different geographic locations

Recommended use of reference genomes

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation