MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies

Zachariasen, Trine; Russel, Jakob; Petersen, Charisse; Vestergaard, Gisle A.; Shah, Shiraz; Atienza Lopez, Pablo; Passali, Moschoula; Turvey, Stuart E.; Sørensen, Søren J.; Lund, Ole; Stokholm, Jakob; Brejnrod, Asker; Thorsen, Jonathan

doi:10.1038/s41467-024-49958-8

MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies

Article
Open access
Published: 09 July 2024

Volume 15, article number 5734, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies

Download PDF

1390 Accesses
23 Altmetric
Explore all metrics

Abstract

Metagenomic sequencing has provided great advantages in the characterisation of microbiomes, but currently available analysis tools lack the ability to combine subspecies-level taxonomic resolution and accurate abundance estimation with functional profiling of assembled genomes. To define the microbiome and its associations with human health, improved tools are needed to enable comprehensive understanding of the microbial composition and elucidation of the phylogenetic and functional relationships between the microbes. Here, we present MAGinator, a freely available tool, tailored for profiling of shotgun metagenomics datasets. MAGinator provides de novo identification of subspecies-level microbes and accurate abundance estimates of metagenome-assembled genomes (MAGs). MAGinator utilises the information from both gene- and contig-based methods yielding insight into both taxonomic profiles and the origin of genes and genetic content, used for inference of functional content of each sample by host organism. Additionally, MAGinator facilitates the reconstruction of phylogenetic relationships between the MAGs, providing a framework to identify clade-level differences.

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4

Article Open access 23 February 2023

Functional Analysis in Metagenomics Using MEGAN 6

Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments

Article Open access 05 December 2022

Introduction

DNA sequencing has revolutionised our ability to gain insight into microbial compositions without relying on the ability to cultivate organisms. To explore these compositions, various methods have been developed that either rely on databases of marker genes of known organisms or attempt to reconstruct the chromosomes directly from the short reads by first assembling them into longer contigs and then binning these based on co-occurrences or DNA composition. Mapping reads against marker gene databases with tools such as MetaPhlAn¹, MetaPhyler² and mOTUs³ is a fast and effective way of recovering the microbial composition both because the library depth required can be quite shallow and because the computational requirements are smaller. However, such methodologies have limitations originating from the reliance on predefined databases, limited ability to estimate abundances at higher taxonomic resolution^4,5, and the lack of information on the functional repertoire of the identified taxa. Conversely, de novo binning strategies require high sequencing depth but can recover high-quality metagenome-assembled genomes (MAGs) from which the functional gene content can be directly linked to a specific organism. Ideally, this can recover genomes at the subspecies level that can be used in downstream analysis to generate more specific hypotheses about associations with outcomes. One example of this is to be able to identify organisms, which have the capacity of degrading Human Milk Oligosaccharides (HMOs), which are an important energy source for breastfed infants. Especially Bifidobacteria have this functionality, where certain strains or subspecies have specific preferences for certain HMO types^6,7,8,9. Previously, it has been established that the presence of Bifidobacterium longum subspecies infantis (B. infantis) together with breastfeeding, plays a crucial role in providing a protective effect to mitigate the impact of antibiotics on the early-life gut microbiome⁷. This underlines the significance of being able to accurately profile the microbiome at higher resolutions than the species level.

In this work, we have developed a pipeline that takes MAGs and original reads as input and generates output including accurate abundance estimates, subspecies-level phylogenies and gene synteny clusters that can improve insights into the microbiome composition (Fig. 1 A–F). As MAGinator is dependent on the quality of the contig assembly and MAGs, the resolution and granularity of the results are influenced by these. We do this by grouping MAGs into clusters that are phylogenetically separated at a higher resolution than species and estimating the abundances of these. This is done by identifying a set of signature genes directly from the given data and refining them according to statistical modelling to pick the ideal set suitable for abundance estimation. The fidelity of our estimated abundances is demonstrated on the Critical Assessment of Metagenome Interpretation (CAMI) strain-madness dataset, where we benchmark MAGinator against similar tools. Additionally, we show the functionality of MAGinator on a public dataset of inflammatory bowel disease (IBD) patients, where we identify differentially abundant taxa between patients and controls at high phylogenetic resolution.

**Fig. 1: Schematic visualisation of the main functions of the MAGinator workflow.**

MAGinator also enables the creation of Single Nucleotide Variant (SNV’s) resolution phylogenetic trees from the signature genes. They are used for additional stratification of the MAGs and can be associated with metadata to obtain subspecies-level differences. We exhibit MAGinator’s ability to obtain subspecies-level resolutions for Bifidobacterium from two real-world infant datasets. In this case, the signature genes were found de novo for one dataset and were then utilised to obtain subspecies-level resolution in the other cohort.

By combining the information from both contigs and gene content we identify synteny clusters of genes within subspecies, yielding information on shared pathways for the genes. Additionally, we show how we can associate the functional content to the identified clades, to improve hypotheses-generation on the impact of organisms, illustrated using the COPSAC₂₀₁₀ cohort.

Results

MAGinator can accurately detect strains in simulated data

The performance of MAGinator was evaluated against the top 10 taxonomic profilers found in the second round of CAMI⁵ challenges using the simulated short-read ‘strain-madness’ dataset. This dataset has been selected as it represents a heterogeneous strain environment, making strain and species detection highly relevant.

Running the MAGinator pipeline on the strain-madness data, 73 MAG clusters were identified, of these 22 clusters were present with less than 3 reads in 3 samples, so the abundance was set to 0. Of these 51 remaining entities, 30 were assigned with strain-level annotation by CAMITAX¹⁰.

The profilers were compared with OPAL¹¹ (Fig. 2). For the majority of the tools, the performance decreased as the taxonomic categories became less inclusive (Fig. 2B & Suppl. Figure 2). The L1 norm measures the total error from the predicted and true abundance at each rank. From genus to species level, we observed drops in the average completeness 82.7–45.6% and the average purity 73.6–36.5%. MAGinator had the best average completeness at genus (99.8%) and species levels (89.6%) (Suppl. Table 3). At the genus level, MAGinator ranked number 5 for purity at 80.1% and the best-performing tool for the species level at 90.1%. The LSHVec gsa¹² had the best performance for purity at the genus level with 100%; however, at species level it has a purity of 37.5%, ranking number 5 in this group (Supplementary Table 4).

**Fig. 2: Benchmark using OPAL for comparing taxonomic profiling results for the CAMI strain-madness data set.**

MAGinator improves detection of differentially abundant organisms

To demonstrate the advantages of quantifying bacterial taxa at high resolutions we have re-analysed a well-designed metagenomics study from Franzosa et al.¹³. We chose this because it has deep sequencing well-suited for de novo MAG construction and a discovery/replication design with two distinct cohorts. In the absence of ground truth, replicating discoveries is a compelling strategy for making sure that findings are not false discoveries.

Beta diversity analysis of the two abundance matrices (MAGinator vs. their matrix created using MetaPhlAn2¹⁴) revealed a similar separation for IBD patients vs healthy controls. For this study MAGinator produces abundance matrices of much higher dimensionality (2140 vs 201 taxa) because of the higher resolution in taxa identifications, therefore prevalence and/or abundance filtering might be relevant in MAGinator produced tables for noise reduction (Fig. 3A–C).

**Fig. 3: IBD case study shows similar performance of MAGinator with beta diversity and improvements in DA analysis.**

To illustrate the improved ability of MAGinator to identify differentially abundant taxa we performed a regular differential abundance (DA) hypothesis test with Wilcoxon’s rank-sum test (Fig. 3D–F). We looked for differentially abundant taxa defined as significant in the discovery cohort and replicated in the independent validation cohort. In the original analysis, 18 taxa were successfully validated in the independent cohort. With MAGinator, this increased to 213 taxa (Fig. 3 D–F).

MAGinator enables tracking of subspecies across datasets

B. infantis is a gut microbe particularly adapted to the infant's gut due to its ability to metabolise HMOs, which are complex sugars that infants cannot metabolise themselves^15,16. These capabilities are different from other major subspecies including B. longum. To demonstrate the utility of subspecies abundance estimation in MAGinator, we identified the signature gene set from one deeply sequenced infant cohort (COPSAC₂₀₁₀) and used it to track subspecies abundances on another infant cohort (CHILD) with shallower sequencing but more samples. In the MAGinator pipeline, we identified two MAG clusters; one annotated as B. infantis and one as B. longum with GTDB-tk. In MetaPhlAn output we identified only one overall abundance for the species Bifidobacterium longum. Correlation analysis of these abundances shows that summed abundances of the two subspecies B. infantis and B. longum MAG clusters, explain 87% of the variance in the MetaPhlAn species (Supplementary Figure 2). In addition, we analysed the samples from both cohorts with StrainPhlAn¹⁶ which detects strains in samples using prespecified species-level marker genes. Here, clustering of the sample-wise consensus sequences of the B. longum marker genes identified two clusters, one which clustered with reference strains of B. longum and one which clustered with reference strains of B. infantis. This result was previously shown for the CHILD cohort⁷ and here we found similar results for COPSAC₂₀₁₀ (Supplementary Figure 4). We hypothesised that this apparent duality represents the underlying balance of these two subspecies in each sample. We confirmed this by comparing the StrainPhlAn-clusters with the MAGinator relative abundances of all Bifidobacterium species, where we saw that the StrainPhlAn clusters depended on the ratio of B. infantis to B. longum (Fig. 4), but that more detailed information was accessible using the MAGinator derived relative abundances of each subspecies. This is an example of how de novo identification of subspecies-level MAG clusters and subsequent refinement of signature genes allows a higher resolution depiction of taxa for which the sequence coverage is sufficient in a subset of samples.

**Fig. 4: Stratification of StrainPhlAn clusters using the relative abundances of Bifidobacterium longum subspecies from MAGinator Cluster 1 indicates B.**

Additionally, we used the signature genes identified from the COPSAC cohort to track the two subspecies in the CHILD cohort. The relative abundances of the MAGinator clusters and the StrainPhlAn clusters were likewise examined (Suppl. Figure 4). When using the signature genes as a reference for the CHILD cohort MAGinator was still able to resolve the two subspecies into more well-defined clusters, yielding detailed profiling of the samples.

To estimate the fit of the signature genes for the two cohorts, we compared the read mappings and the presence of signature genes (Suppl. Figure 6A). The expected number of detected signature genes within a sample can be calculated from the number of reads that map to those genes using a negative binomial distribution¹⁷. We find that the COPSAC₂₀₁₀ cohort deviates with a mean squared error (MSE) of 103.95, whereas the CHILD cohort deviates with a MSE of 878.09, indicating that the signature genes are better suited for profiling the specific subspecies found in the COPSAC cohort. To examine the cause of this large deviation for CHILD we created a heatmap of the read mappings to the signature genes (Suppl. Figure 6B). In accordance with Suppl. Figure 6A the samples cluster into two groups, which could be due to subspecies differences. Additionally, the genes are seen to cluster into multiple groups, where a group is seen to be absent in a large proportion of the samples, indicating that these genes have not been adequately selected for this subspecies for this dataset. Thus, mapping reads from a new data set onto signature genes from a previous data set can be an advantage when sequencing depth in the latter is too limited for good assembly. Reusing signature genes is also advantageous for easy comparison of abundances between data sets. But optimal selection of adequately representative signature genes for a new data set requires running MAGinator de novo, if sequencing depth is adequate. Ideally, one might pool multiple data sets prior to running MAGinator in order to find signature genes that are equally representative for both data sets, making both abundance estimations and taxonomic entities directly comparable.

MAGinator enables de novo discovery of strains from MAG cluster phylogenies

In the above case, MAGinator’s ability to distinguish between subspecies depended on the binner to cluster the subspecies into two separate MAG clusters. This possibility may not be the case for other bacterial taxa and will vary between datasets. As an alternative, MAGinator provides samplewise phylogenies for each MAG cluster, where strains within the MAG cluster can be distinguished between samples. The result is presented as a maximum-likelihood tree and is based on sample SNVs within the signature genes, where each leaf corresponds to a sample. Because the analysis is based on read-mappings rather than assembled contigs, reliable phylogenies can be constructed even for samples where the taxon was not abundant enough to yield a MAG. E.g. for the COPSAC₂₀₁₀ data set, the Faecalibacterium sp900758465 MAG cluster was found by VAMB¹⁸ in 85 samples, but phylogenies were constructed for additional 13 samples (Suppl. Figure 7).

For the per-sample phylogenies to be reliable, signature genes must have adequate read coverage and sequencing depth. Also, samples must not contain mixtures of subspecies belonging to the same MAG cluster. Thus cutoffs are set on alignment and SNV statistics to ensure reliability and can also be visualised alongside the tree as shown in Suppl. Figure 7 for visual confirmation. Using the shown median frequency of mixed SNVs it is possible to identify samples in which multiple variants of the MAG cluster were found. In COPSAC₂₀₁₀, within-sample mixtures of MAG cluster variants were rare, yielding reliable sub-species-level information for most samples. Overall, for the 716 MAG clusters in the COPSAC data set, 387 MAG clusters had no samples containing mixed alleles. 329 MAG clusters harbour samples with mixed alleles, and within these particular MAG clusters an average of 38% of the samples had mixed alleles. In summary, across all MAG clusters, 4154 of 20,765 MAGs had mixed alleles (Suppl. Table 5).

Strain diversity across environments

Within the 54 samples obtained from the honey-bee gut environment MAGinator identified 195 MAG clusters, in which 168 were found to have one or more samples with mixed alleles. For these MAG clusters, an average of 70% of the MAGs were found to have mixed alleles. For the 148 samples from the Tara Oceans expeditions 791 MAG clusters were found, from which 540 had at least one sample with mixed alleles. In total 37% of the MAGs in these clusters were found to contain mixed alleles (Suppl. Table 5).

MAGinator identifies de novo gene synteny clusters aiding functional studies

MAGinator’s signature gene identification step involves clustering all genes into clusters of conserved proteins. Such protein clusters are orthologous (i.e. functionally conserved across different taxa) owing to conservative (and customisable) clustering parameters that maintain protein domain topology. Importantly, MAGinator’s protein clusters are identified de novo. This means they include the protein “dark matter” ignored by traditional database-driven profiling, even though it comprises the majority of protein diversity in most metagenomic datasets to date. A key advantage of MAGinator’s gene profiling is that each protein cluster, by definition, can be linked to the host MAG that encodes it. This enables the discovery of protein-host interactions against sample phenotypes, bridging taxonomic and functional profiling.

Genes can further be grouped into synteny clusters based on their genomic adjacency. Genes close to each other in the genome will be grouped into a synteny cluster, and they are usually part of the same pathway or have a related function. Part of the MAGinator workflow creates these synteny clusters. For the COPSAC₂₀₁₀ cohort 746,251 synteny clusters were identified with an average of 3 genes per cluster (Supplementary Figure 8A, B). In order to evaluate the accuracy of the synteny clusters, functional gene annotations were performed using eggNOG¹⁹ mapper. Subsequently, the predominant KEGG²⁰ module within each synteny cluster was determined, and the proportion of genes sharing the same annotation within the cluster was calculated (see Supplementary Figure 8C). Only synteny clusters with 5 or more genes and at least two annotated genes were included, qualifying 35,798 clusters for the analysis. For 28,341 clusters all genes in the synteny cluster were assigned the same KEGG module, and 80.5% of the modules had more than 80% agreement.

Discussion

MAGinator is a pipeline for quantifying the abundances of de novo-generated MAG clusters. In contrast to reference-based abundance estimations, this allows extensive integration of abundance and functional properties for individual members of the microbial community. Furthermore, it features the generation of signature gene-derived phylogenies for MAG clusters and the discovery of gene synteny clusters. It is implemented in Snakemake to take advantage of the integrated work distribution capabilities necessary for processing large-scale metagenomics data. It features logging for ease of monitoring progress and visualisation for diagnostic purposes. We have demonstrated the functionality and utility of MAGinator via several avenues, both simulated and real datasets.

The performance of MAGinator was evaluated in comparison to existing profiling tools. We benchmarked MAGinator using the simulated strain-madness dataset produced by CAMI II. We found that MAGinator is capable of profiling samples at a comparable level to the already established tools. Notably, while many tools performed well at the genus level, a decline in performance was observed when focusing on the species-level classification. This drop in performance is expected from reference-based methods, as they are limited to identifying only what already exists in their database and are thus unable to annotate novel species. MAGinator demonstrated a notable advantage in this regard, exhibiting the highest average completeness and purity when classifying samples at the species level. This indicates that MAGinator has the ability to achieve a more accurate and precise characterisation of microbial species present in the samples. It should be noted that the high completeness by MAGinator implies a greater sensitivity in detecting and including less abundant or rare taxa in the analysis. However, it may also introduce a certain level of noise or misclassification, which influences the estimation of beta diversity.

When examining the performance of MAGinator on a real dataset, the beta diversity was comparable to the analysis carried out by Franzosa et al. Reanalysing their data demonstrates how MAGinator can be used for a metagenomic association study. With the higher resolution of MAGinator when quantifying MAG clusters investigators have the possibility of discovering differentially abundant taxa in much richer detail without compromising other parts of a traditional analysis such as PCoA. Depending on the intention of the study, and the taxonomic composition of the studied microbiomes, the high resolution can also be utilised to gain deeper insights into the subspecies taxonomies. This is for instance, relevant when analysing the Bifidobacterium longum subspecies.

B. infantis is highly relevant to investigate, as it is known for its greater capacity to metabolise HMOs compared with its closely related subspecies, such as B. longum. As their genomes are very similar, distinguishing them by database-dependent approaches is challenging. With StrainPhlAn, we are able to identify 2 mutually exclusive clusters, each representing a subspecies. However, we see that the two MAG clusters identified with MAGinator for B. infantis and B. longum yield higher resolution in the form of individual abundance estimates for each. MAGinator is able to successfully classify samples containing the subspecies in samples with low abundance and even when a MAG is not produced in that sample.

These results were reproduced in the CHILD cohort using the signature genes identified in COPSAC₂₀₁₀ for the two subspecies. As samples from the CHILD cohort used in this study had lower sequencing depth, still being able to separate the subspecies is valuable. Importantly, it is worth noticing that the separation would most likely have been stronger if the signature genes had been found de novo for the specific cohort. This is supported by the read mappings to the signature genes showing a subset of the signature genes defined in COPSAC₂₀₁₀ missing in the CHILD cohort, which presumably resulted in an underestimation of the abundance for a subset of the samples. This phenomenon highlights the importance of de novo dataset-specific discovery of signature genes to yield the best possible abundance estimates of closely related taxonomic entities. A similar phenomenon would be expected when using database-derived strain marker genes.

From the COPSAC₂₀₁₀ cohort we demonstrated MAGinator’s ability to create SNV-level trees based on the sequences from the signature genes of a MAG cluster, used for more fine-grained stratification of the MAGs. Even in samples where no MAG is assembled, a reliable phylogeny can still be derived when enough reads map to the signature genes. By placing these samples in the tree, information from the closely related MAGs can be utilised to find strain-level entities, even for low-abundance samples. Tree distances can be tested against sample meta-data using e.g. PERMANOVA, thus revealing whether the added subspecies resolution is informative for the research question at hand. If so, cutting the tree at evolutionarily sensible depths could define subspecies or strains de novo that drive specific sample phenotypes. Coupling this information with MAG gene content allows for the discovery of clade-specific genes, enabling their identification in new data sets.

From the alignment of the signature genes it is also possible to identify the extent of strain-diversity within the MAG cluster, by identifying samples which display a certain frequency of mixed alleles. Samples with mixed alleles harbours multiple strains of the MAG cluster.

For the COPSAC₂₀₁₀ dataset, we found that within-sample strain diversity was low. When comparing allele frequencies across other environments, like the honey-bee gut and the ocean a greater strain diversity is seen. While the honey-bee gut exhibits the highest proportion of MAG clusters with mixed alleles (86%), the Tara Oceans exhibit a higher proportion (68%) compared to that of COPSAC₂₀₁₀ (46%). Notably, the percentage of samples in MAG clusters containing MAGs that had mixed alleles from Tara Oceans and COPSAC2010 was comparable, with 37% and 38%, respectively, whereas that number for honey-bees was 71%. These findings underline the effect of selective pressure within different environments on both strain- and species-level diversity within the microbiomes.

Additionally, the COPSAC₂₀₁₀ cohort was used to illustrate MAGinator’s ability to group genes co-localised on the chromosome into synteny clusters, further combining the strengths of using both genes and contigs. As genes found close together are often part of the same genetic pathway or share the same function, this is a valuable insight for associating organisms with the outcomes of a study. This has been validated by functionally annotating the genes of the predicted synteny clusters, confirming that the genes found in synteny are often annotated to be part of the same metabolic pathway. Currently, accuracy is limited by MAGinator’s lack of operon awareness. As bacterial operon prediction methods improve, these could be integrated into MAGinator and eliminate such noise. Users can also eliminate noise at the expense of sensitivity by altering a number of user-modifiable parameters. Crucially, MAGinator’s protein and synteny clusters are de novo, meaning that they do not need to yield any known database hits, as is often the case for new virulence factors or antiviral defence systems. Users may find that such “dark matter” gene clusters yield particularly strong associations against sample meta-data, making them prime candidates for downstream genetic or biochemical studies aimed at deciphering their mechanisms of action.

In conclusion, we have described the development of MAGinator—a pipeline for quantifying MAG clusters and demonstrated the benefits of this approach to commonly generated data types in the metagenomics field. Through reanalysis of publicly available data, we have illustrated how insights can be gained from MAGinator at a higher taxonomic resolution than available from commonly used tools. We believe that this higher resolution is key to unlocking the potential of metagenomics to identify critical subspecies for human health and environmental investigations. MAG cluster resolution metagenomics allows for accurate integration of abundance, taxonomic and functional annotation in microbiome studies, which is needed to empower investigations in the microbiome field.

Methods

Implementation

Input

The input to the MAGinator workflow comprises a set of samples with (1) shotgun metagenomic sequenced reads, (2) their sample-wise assembled contigs, and (3) sample-wise MAGs (groups of contigs from the same genome), clustered across samples, as defined by a metagenomic binning tool (see below).

Reads should be provided in a comma-separated file giving the location of the fastq files and formatted as: SampleName,PathToForwardReads,PathToReverseReads. The contigs should be nucleotide sequences in FASTA format. The MAGs should be given as a tab-separated file including the MAG identifier and contig identifier. The sample-wise MAGs should be grouped into MAG clusters representing a taxonomic entity found across the samples, which will usually be species but can also be at the subspecies level, depending on the characteristics of the input data. MAGinator is flexible regarding which tool is being used for creating the MAGs, however we recommend using VAMB¹⁸. If other binners are used, MAG clustering across samples would have to be implemented before running VAMB. As MAGinator relies on the input MAGs a larger sample size is recommended. The specific number of samples relies both on the sequencing depth and the diversity of the community being analysed. We advise the user to look at the number of MAG clusters created and assess them according to the environment being analysed.

Dependencies

The dependencies to run MAGinator are mamba²¹ and Snakemake²²—all other dependencies are installed automatically by Snakemake through MAGinator. Additionally, MAGinator needs the GTDB-tk database downloaded for taxonomic annotation of MAGs and as a reference for the phylogenetic SNV-level analysis of the signature genes.

Output generated

MAGinator generates multiple outputs and intermediate files useful for additional downstream analysis (Supplementary Table 1, Supplementary Figure 1). Importantly, MAGinator outputs the taxonomy of the MAGs, the signature genes of the MAG clusters, the sample-wise relative abundances of the MAG clusters, a non-redundant gene matrix with sample-wise mapping counts, synteny clusters and inferred phylogenies for each MAG cluster along with a table presenting samples showing evidence of strain mixtures within each MAG cluster. Additionally, a folder is created containing the log information of all the jobs run by Snakemake.

Application

MAGinator is written in Python 3. It is based on a set of Snakemake²² workflows and is easily scalable to work for both single servers and compute clusters. MAGinator is implemented as a python package and is available on GitHub at https://github.com/Russel88/MAGinator. The user can adjust the individual steps of the pipeline using various parameters (Suppl. Table 2). The results in this paper are based on MAGinator v.0.1.10.

The MAGs are filtered based on a minimum size for inclusion, with a default size of 200,000 bp. The included MAGs are taxonomically annotated using GTDB-tk (v.2.1.1)²³, by calling genes using Prodigal (v.2.6.3)²⁴, identifying GTDB marker genes and placing them in a reference tree. As the taxonomic annotation of the MAG clusters is found to be redundant, clusters with the same taxonomic assignment can be combined into one cluster, with the flag ‘--mgs_collections’ which we identify as a Metagenomic Species (MGS). Redundant genes are identified by clustering with MMseqs2 (v.13.45111)²⁵ easy-linclust using a default clustering-coverage and sequence identity threshold of 0.8, creating a list of the representative genes along with their cluster-members. The redundant genes are filtered away, leaving a nonredundant gene catalogue. The raw reads are mapped to the gene catalogue using BWA mem2 (v.2.2.1)²⁶ and counted using Samtools (v.1.10)²⁷, leaving a gene count matrix, which is used as input for the signature gene refinement and following phylogenetic clade separation and abundance estimates.

Signature gene identification

We previously described the method for identifying the signature genes for the data set¹⁷. In brief, signature genes are selected to ensure that they 1) are unique for the MAG cluster, 2) are present in all members of the cluster, and 3) are single-copy.

To accomplish this, the following steps are taken: Initially, the non-redundant gene count matrix is curated to discard any genes if they have (redundant) cluster members originating from more than one MAG cluster, as they are thus not specific for that biological entity. Subsequently, the remaining genes within each MAG cluster are sorted based on their co-abundance correlation across the samples. As the genes are unique for the species, if they are consistently detected in similar abundance across samples, it suggests that they are single-copy. This step also mitigates differences in reading mappings caused by biological or technical variations. The initial set of signature genes for each biological entity is selected from the most correlated genes. Subsequently, these signature genes are further refined and optimised by fitting them to a rank-based negative binomial model that captures the characteristics of the specific microbial composition in the input data. The signature gene set is evaluated across the samples, by calculating the probability of the detected number of signature genes given the number of reads mapping to the MAG cluster. Finally the abundance of each MAG cluster is derived from the read counts to the identified signature genes normalised according to the gene lengths.

SNV-level resolution phylogenetic trees

To elucidate the smaller biological differences within the MAG clusters, MAGinator will infer a phylogeny based on the sequences of the signature genes. Based on the read mappings to the signature genes the sample-specific SNVs are called using output from Samtools mpileup. An alignment for each signature gene is made for all samples containing the signature genes using MAFFT (v.7)²⁸ run with the offset value of 0.123 as no long indels are expected. MAGinator allows phylogenetic inference to be calculated with either the fast method Fast-Tree (v.2)²⁹ (default) or the more accurate but resource-intensive method IQ-TREE (v.2)³⁰. In samples where no MAG was found, the phylogenies can be used to detect rare subspecies-level entities based on just a few reads mapping to the signature genes and to infer functions and genes from closely related MAGs from other samples. The criteria for inclusion in the tree can be adjusted by the user. For a sample to be included in the phylogeny the following three criteria have to be met 1) minimum fraction of non-N characters in the alignment, 2) minimum number of GTDB marker genes to be detected, 3) minimum number of signature genes to be detected. The default values for a sample to be included in the phylogenetic tree have been set relatively low in order to enable the placement of samples in the tree, even in cases of very low abundance. The trees can be associated with metadata to obtain clade-level differences associated with study design variables such as disease phenotype, sampling location, or environmental factors.

Gene synteny

Based on the gene clustering with MMSeqs2 a weighted graph is created, which reflects the adjacency of the genes on contigs. If genes are close enough in the graph, they will be categorised as part of the same synteny cluster, and it is assumed that they have related functionality and/or are part of the same functional module. Clustering is determined using mcl (v.14)³¹, where the user has the options to influence the adjacency count and stringency of the clusters. Only immediate adjacency is considered. By default, genes found adjacent just once are included in the graph, but this can be tuned to make more strict clusters. The inflation parameter for MCL-clustering of the synteny graph is important for the size of the gene clusters and is, by default, set high in order to yield small and consistent clusters.

Taxonomic scope of gene clusters

The taxonomic assignment of the sample-specific MAG is done using GTDB-tk. In some cases it will not be possible to assign a taxonomy to the MAG, which could be due to contamination, the MAG originating from a currently undescribed organism or due to too little information found in the MAG. In these cases an alternative is to assign the gene clusters, found in the MAG, a taxonomy. The taxonomic scope of the genes is described for the category in which they are predominantly found in, given by a fraction defined by the user (default value 0.9). E.g. if run with default options and a gene cluster has the assignment “Bacteria Firmicutes_A Clostridia Lachnospirales Lachnospiraceae Anaerostipes NA”, then at least 90% of the genes should be found in Anaerostipes. The algorithm will find the most specific taxonomic rank which has at least 90% agreement across the genes in the cluster assigned by GTDB-tk.

Workflow design

The MAGinator workflow has been constructed to make the information flow between the different modules automatically (Suppl. Figure 1).

The data goes through a series of filtering and processing steps (Fig. 1 A–F), including:

A: MAG clusters, which are composed of one or more MAGs, are inputted.

B: The genes are clustered and redundant genes are removed.

C: Reads are mapped to the genes, creating a gene count matrix.

D: Signature genes are identified for each MAG cluster, and used for abundance estimations

E: Based on the signature genes, SNV-level resolution phylogenetic trees are created, and the taxonomic scope of gene clusters is identified.

F: Synteny-clusters of genes are identified, reflecting the adjacency of the genes on the contigs.

Benchmarking on CAMI’s simulated strain-madness data set

The construction of the strain-madness benchmarking dataset was part of the second round of CAMI challenges⁵. The data consists of 100 simulated metagenomics samples consisting of paired-end short reads of 150 bp. The samples were run through a preprocessing workflow prior to the analysis. This involved the removal of adaptors with BBDuk (v. 38.96 http://jgi.doe.gov/data-and-tools/bb-tools/) run with the following settings ‘ktrim=r k = 23 mink=11 hdist=1 hdist2 = 0 ptpe tbo’, removal of low-quality and short reads (<75 base pairs) with Sickle (v. 1.33)³² and removal of human contamination (reference version: UCSC hg19, GRCh37.p13) using BBmap (http://jgi.doe.gov/data-and-tools/bb-tools/) leaving an average of 6.6 million reads (SD: ±2802 reads) per sample.

To generate de novo assemblies, Spades (v. 3.15.5)³³ was utilised with the -meta option, with kmer sizes of 21, 33, 55 and 77, and contigs shorter than 1500 bp being discarded. Read-to-assembly mapping was carried out using BWA-mem2 (v.2.2.1)²⁶ and SAMTOOLS (v.1.10)²⁷. Contig depths were assessed using Metabat2’s jgi_summarize_bam_contig_depths (v.2.12)³⁴, while contigs were binned into MAGs using VAMB (v.3.0.8)¹⁸ with default settings.

The reads, contigs and MAGs were run through the MAGinator workflow (v.0.1.16). For comparison purposes, the VAMB clusters were annotated with an NCBI Taxonomy ID using CAMITAX¹⁰. The profile was created with an R custom script and the lineage was found using NCBI’s taxonomy toolkit (https://bioinf.shenwei.me/taxonkit). As the strain identifiers from the gold standard do not exist in the NCBI database (e.g. 1313.1), we have assigned an extra number to the Taxonomy ID for the clusters which had the same species-level annotation, starting at 1 to the number of redundantly annotated clusters.

The data for the benchmarking was obtained from CAMI second challenge evaluation of profiles. The profiles used for the benchmarking in this study were selected based on the best-performing tools found in the CAMI II paper. The top 10 profiles comprise DUDes³⁵ (v.0.08), LSHVec¹², MetaPhlAn2¹⁴ (v.2.9.22), MetaPhyler² (v.1.25), mOTUs³ (v.2.0.1 and v.2.5.1) and TIPP³⁶(v.4.3.10). The profiles were compared using Open-community Profiling Assessment tooL (OPAL) (v.1.0.11), which was run with default settings.

Franzosa et al. reanalysis

Processed taxa and metadata tables were obtained from the Franzosa et al.¹³. supplementary materials. Raw data were downloaded from ENA using the provided accessions, and run through the preprocessing, assembly and binning before running the entire MAGinator pipeline. Four samples failed the assembly (PRISM | 7238, PRISM | 7445, PRISM | 7947, PRISM | 8550) and were excluded from all downstream analyses, both in the original and the MAGinator processed tables, leaving 216 samples.

Statistical methods for abundance matrices

Abundance matrices were analysed in R (v.4.1.2). Sample management and beta diversity calculations were done in {phyloseq}³⁷, along with PCoA analysis. Differential abundance testing was done with the {DAtest} R package, which uses the Wilcoxon test function (Wilcox.test) from the {stats} package, with p-values adjusted by Benjamini-Hochberg false discovery rate correction. Corrected p-values less than 0.05 were considered significant.

Subspecies resolution of Bifidobacterium longum

COPSAC dataset - data characteristics and preparation

The COPSAC₂₀₁₀ cohort consists of 700 unselected children recruited during pregnancy week 24 and followed closely throughout childhood with extensive sample collection, exposure assessments and longitudinal clinical phenotyping^38,39,40. From the cohort, we used 662 deeply sequenced metagenomics samples taken at 1 year of age. The details of the study and sequencing protocol have previously been published⁴⁰. The samples consist of 150-bp paired-end reads per with mean ± SD: 48 ± 15.5 million reads.

The data was analysed using the same approach as for the strain-madness data set, with the exception of filtering away reads shorter than 50 bp in the preprocessing step. This workflow yielded 880 MAG clusters for the samples.

MAGinator was run using the reads, contigs and MAGs from VAMB as input. Thus creating a set of signature genes for each MAG cluster which has been found de novo for this particular dataset.

CHILD dataset - data characteristics and preparation

The Canadian Healthy Infant Longitudinal Development (CHILD) study comprises a large longitudinal birth cohort with stool collection in infancy for microbiome analysis⁴¹. Stool samples used in this analysis were sequenced to an average depth of 4.85 million reads (SD: 1.79 million), and samples which included >1 million reads after preprocessing were kept for the current analysis⁷.

We analysed a subset of the CHILD cohort, consisting of 2846 metagenomic sequenced faecal samples from infants. To overcome the shallow sequencing, the signature genes of the COPSAC₂₀₁₀ cohort were used to profile the samples instead of running MAGinator. To ensure that the process of the read mappings was identical to COPSAC, the read mapping was carried out using the full gene catalogue. Next, the read counts for the signature genes were extracted and used to derive sample-wise abundances for each MAG cluster.

Examining bifidobacterium MAG clusters

The detection of signature genes for B. infantis for the COPSAC₂₀₁₀ (n = 662) and CHILD (n = 2846) cohorts was carried out by creating a binary detection matrix and using the standard function (heatmap) with default values in R. Furthermore, we compared the abundances of all the Bifidobacterium MAG clusters derived from MAGinator with abundance estimates from Metaphlan 3 (v.3.0.7) and subspecies phylogenies from Strainphlan 3 (v.3.0.7) for the species Bifidobacterium longum. The phylogenetic tree output by Strainphlan was converted into a distance matrix and clustered using partitioning around medoids into two clusters. The two clusters were annotated as B. longum subsp. longum (B. longum) and B. infantis based on the placement of Bifidobacterium longum reference genomes in the phylogenetic tree.

SNV-level phylogenetic trees for COPSAC dataset

For each MAG cluster, the sequences of the signature genes were used as a reference to create an SNV-level phylogenetic tree. The trees for COPSAC₂₀₁₀ were constructed with the default values of MAGinator, producing both a tree in Newick file format for each MAG cluster and files containing the statistics for the alignments. The tree for Faecalibacterium sp900758465 was visualised in R using {ggtree}⁴². The heatmaps in Suppl. Figure 7 was constructed from B) stats.tab and C) stats_genes.tab. The median frequency of bases in the signature gene alignment with mixed alleles was calculated based on positions with a depth of minimum 2 and normalised according to the gene length. A major allele frequency of at least 0.8 was required for the sample to be considered homogenous. These are also the default cutoffs, and users can adjust them to trade off sensitivity for specificity.

Strain mixtures within de novo MAGs across environments

To assess the degree of within-MAG cluster strain diversity for non-human associated environments, two public datasets were included in the analysis. One study done by Engel and Ellegaard examined the honey-bee gut⁴³ and the Tara Oceans study⁴⁴. The raw data was run through the same workflow as the strain-madness data and run through MAGinator. Due to computational limitations and the size of the Tara Oceans samples only 148 of 243 samples were successfully assembled.

Gene syntenies and functional annotation for COPSAC dataset

The non-redundant genes were annotated using eggNOG mapper (v.2.0.2)^19,45,46. Of the 14.7 million non-redundant genes 9.2 million were annotated. The visualisation of the synteny clusters was done with {igraph}⁴⁷.

Statistics and reproducibility

The statistical methods included in this study has been conducted with R (v.4.1.2). In this study we have analysed 5 public datasets, COPSAC₂₀₁₀³⁸ (n = 662), CHILD⁴¹ (n = 2846), Franzosa et al. IBD-study¹³ (n = 220), Tara Oceans⁴⁴ (n = 243) and honey-bee⁴³ (n = 54). For Franzosa et al. and Tara Oceans not all samples succeeded in assembly and was thus not included in the analysis included in this study, leaving 216 and 148 samples respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All relevant data supporting the key findings of this study are available within the article and its Supplementary Information files. Supplementary dataset 1 contain the Supplementary Figs. and tables. The CAMI II strain-madness benchmarking dataset is available at https://frl.publisso.de/data/frl:6425521/strain/short_read/. The gold standard and benchmark profiles are found at https://github.com/CAMI-challenge/second_challenge_evaluation/tree/master/profiling. The dataset from Franzosa et al. used for benchmarking is available as supplementary from their paper and the raw data is available at ENA accession SAMN08049618. The raw COPSAC fastq files are available at NCBI under BioProject PRJNA715601. The honey-bee data is publicly available and found in the sequence read archive (SRA) with the accession SRP150166. The Tara Oceans data set is publicly available and found at ENA with Study accession PRJEB1787. The CHILD shotgun metagenomics sequencing data is available at NCBI BioProject PRJNA838575 . Source data are provided in this paper. Availability and implementation: MAGinator is available as a Python module at https://github.com/Russel88/MAGinator.

Code availability

MAGinator is available at GitHub (https://github.com/Russel88/MAGinator)⁴⁸.

References

Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol. 11, 1633–1644 (2023).
Liu, B., Gibbons, T., Ghodsi, M. & Pop, M. MetaPhyler: Taxonomic profiling for metagenomic sequences. in 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 95–100 (IEEE, Hong Kong, China, 2010). .
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Article ADS PubMed PubMed Central Google Scholar
Liu, Y. et al. CSMD: a computational subtraction-based microbiome discovery pipeline for species-level characterization of clinical metagenomic samples. Bioinformatics 36, 1577–1583 (2019).
Meyer, F. et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Article CAS PubMed PubMed Central Google Scholar
Underwood, M. A., German, J. B., Lebrilla, C. B. & Mills, D. A. Bifidobacterium longum subspecies infantis: champion colonizer of the infant gut. Pediatr. Res 77, 229–235 (2015).
Article CAS PubMed Google Scholar
Dai, D. L. Y. et al. Breastfeeding enrichment of B. longum subsp. infantis mitigates the effect of antibiotics on the microbiota and childhood asthma risk. Med. 4, 92–112.e5 (2023).
Article CAS PubMed Google Scholar
Asakuma, S. et al. Physiology of Consumption of Human Milk Oligosaccharides by Infant Gut-associated Bifidobacteria. J. Biol. Chem. 286, 34583–34592 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ojima, M. N. et al. Priority effects shape the structure of infant-type Bifidobacterium communities on human milk oligosaccharides. ISME J. 16, 2265–2279 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bremges, A., Fritz, A. & McHardy, A. C. CAMITAX: Taxon labels for microbial genomes. GigaScience 9, giz154 (2020).
Article PubMed PubMed Central Google Scholar
Meyer, F. et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20, 51 (2019).
Article PubMed PubMed Central Google Scholar
Shi, L. & Chen, B. LSHvec: a vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings. in Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics 1–10 (ACM, Gainesville Florida, 2021).
Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol 4, 293–305 (2018).
Article PubMed PubMed Central Google Scholar
Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).
Article CAS PubMed Google Scholar
LoCascio, R. G., Desai, P., Sela, D. A., Weimer, B. & Mills, D. A. Broad conservation of milk utilization genes in Bifidobacterium longum subsp. infantis as revealed by comparative genomic hybridization. Appl Environ. Microbiol. 76, 7373–7381 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zachariasen, T. et al. Identification of representative species-specific genes for abundance measurements. Bioinforma. Adv. 3, vbad060 (2023).
Article Google Scholar
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Kanehisa, M. & Goto, S. kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28, (2000).
QuantStack development team & Mamba contributers. Mamba (v.0.13.0). https://mamba.readthedocs.io (2020).
Mölder, F. et al. Sustainable data analysis with snakemake. F1000Res 10, 33 (2021).
Article PubMed PubMed Central Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–5316 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Article Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Vasimuddin, Md., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314–324 (IEEE, Rio de Janeiro, Brazil, 2019).
Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evolution 30, 772–780 (2013).
Article CAS Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. Fasttree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article ADS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evolution 37, 1530–1534 (2020).
Article CAS Google Scholar
Van Dongen, S. Graph clustering via a discrete uncoupling process. SIAM J. Matrix Anal. Appl. 30, 121–141 (2008).
Article MathSciNet Google Scholar
Joshi N. A., Fass J. N. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. (2011).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Computational Biol. 19, 455–477 (2012).
Article MathSciNet CAS Google Scholar
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Piro, V. C., Lindner, M. S. & Renard, B. Y. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics 32, 2272–2280 (2016).
Article CAS PubMed Google Scholar
Nguyen, N., Mirarab, S., Liu, B., Pop, M. & Warnow, T. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics 30, 3548–3555 (2014).
Article CAS PubMed PubMed Central Google Scholar
McMurdie, P. J. & Holmes, S. phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, e61217 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Bisgaard, H. et al. Deep phenotyping of the unselected COPSAC ₂₀₁₀ birth cohort study. Clin. Exp. Allergy 43, 1384–1394 (2013).
Article CAS PubMed PubMed Central Google Scholar
Stokholm, J. et al. Maturation of the gut microbiome and risk of asthma in childhood. Nat. Commun. 9, 141 (2018).
Article ADS PubMed PubMed Central Google Scholar
Li, X. et al. The infant gut resistome associates with E. coli, environmental exposures, gut microbiome maturity, and asthma-associated bacterial composition. Cell Host Microbe 29, 975–987.e4 (2021).
Article CAS PubMed Google Scholar
Moraes, T. J. et al. the canadian healthy infant longitudinal development birth cohort study: biological samples and biobanking: the child study: biological samples. Paediatr. Perinat. Epidemiol. 29, 84–92 (2015).
Article CAS PubMed Google Scholar
Xu, S. et al. Ggtree: A serialized data object for visualization of a phylogenetic tree and annotation data. iMeta 1, (2022).
Ellegaard, K. M. & Engel, P. Genomic diversity landscape of the honey bee gut microbiota. Nat. Commun. 10, 446 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Sunagawa, S. et al. Ocean plankton. structure and function of the global ocean microbiome. Science 348, 6237 (2015).
Article Google Scholar
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evolution 38, 5825–5829 (2021).
Article CAS Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal, Complex Systems1695, 1–9 (2006).
Zachariasen T & Russel J. MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies. https://github.com/Russel88/MAGinator, https://doi.org/10.5281/zenodo.11485929 (2024).

Download references

Acknowledgements

We express our deepest gratitude to the children and families of the COPSAC cohort studies for all their support and commitment. We acknowledge and appreciate the unique efforts of the COPSAC research team. All funding received by COPSAC is listed on www.copsac.com. The Lundbeck Foundation (Grant no R16-A1694); The Ministry of Health (Grant no 903516); Danish Council for Strategic Research (Grant no 0603-00280B) and The Capital Region Research Foundation have provided core support to the COPSAC research centre. JS has received funding from the Danish Council for Independent Research (Grant no. 8045-00081B). We thank the CHILD Cohort Study (CHILD) participant families for their dedication and commitment to advancing health research. CHILD was initially funded by CIHR and AllerGen NCE, and the metagenomic data reported here was generated with support from Genome Canada and Genome BC (274CHI).

Author information

Authors and Affiliations

Department of Health and Technology, Section of Bioinformatics, Technical University of Denmark, Lyngby, Denmark
Trine Zachariasen, Gisle A. Vestergaard, Ole Lund & Asker Brejnrod
Department of Biology, Section of Microbiology, University of Copenhagen, Copenhagen, Denmark
Jakob Russel, Søren J. Sørensen & Jakob Stokholm
Department of Pediatrics, BC Children’s Hospital, University of British Columbia, 950 West 28th Avenue, Vancouver, BC, Canada
Charisse Petersen & Stuart E. Turvey
COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
Shiraz Shah, Jakob Stokholm & Jonathan Thorsen
Danish Multiple Sclerosis Center, Department of Neurology, Copenhagen University Hospital, Rigshospitalet-Glostrup, Glostrup, Denmark
Pablo Atienza Lopez & Moschoula Passali
Department of Food Science, University of Copenhagen, Copenhagen, Denmark
Pablo Atienza Lopez

Authors

Trine Zachariasen
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Russel
View author publications
You can also search for this author in PubMed Google Scholar
Charisse Petersen
View author publications
You can also search for this author in PubMed Google Scholar
Gisle A. Vestergaard
View author publications
You can also search for this author in PubMed Google Scholar
Shiraz Shah
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Atienza Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Moschoula Passali
View author publications
You can also search for this author in PubMed Google Scholar
Stuart E. Turvey
View author publications
You can also search for this author in PubMed Google Scholar
Søren J. Sørensen
View author publications
You can also search for this author in PubMed Google Scholar
Ole Lund
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Stokholm
View author publications
You can also search for this author in PubMed Google Scholar
Asker Brejnrod
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Thorsen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The figures and tables were created by T.Z, P.A.L, A.B and J.T. T.Z, A.B, J.R, J.T and S.S draughted the manuscript. The MAGinator software was developed and set up by T.Z and J.R. T.Z, J.R, C.P, G.V, S.S, P.A.L, M.P, S.T, S.J.S, O.L, J.S, A.B and J.T provided intellectual input and aided in the theoretical aspects of shaping this study. The corresponding author had full access to the data and held the final responsibility for deciding to submit the manuscript for publication. T.Z, J.R, C.P, G.V, S.S, P.A.L, M.P, S.T, S.J.S, O.L, J.S, A.B and J.T guarantee that the accuracy and integrity of any part of the work have been appropriately investigated and resolved and all have approved the final version of the manuscript. None of the authors received any honorarium, grant, or other forms of payment for creating this manuscript.

Corresponding author

Correspondence to Trine Zachariasen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Stephen Nayfach and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zachariasen, T., Russel, J., Petersen, C. et al. MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies. Nat Commun 15, 5734 (2024). https://doi.org/10.1038/s41467-024-49958-8

Download citation

Received: 18 September 2023
Accepted: 21 June 2024
Published: 09 July 2024
DOI: https://doi.org/10.1038/s41467-024-49958-8
Springer Nature Limited

MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies

Abstract

Similar content being viewed by others

Introduction

Results

MAGinator can accurately detect strains in simulated data

MAGinator improves detection of differentially abundant organisms

MAGinator enables tracking of subspecies across datasets

MAGinator enables de novo discovery of strains from MAG cluster phylogenies

Strain diversity across environments

MAGinator identifies de novo gene synteny clusters aiding functional studies

Discussion

Methods

Implementation

Input

Dependencies

Output generated

Application

Signature gene identification

SNV-level resolution phylogenetic trees

Gene synteny

Taxonomic scope of gene clusters

Workflow design

Benchmarking on CAMI’s simulated strain-madness data set

Franzosa et al. reanalysis

Statistical methods for abundance matrices

Subspecies resolution of Bifidobacterium longum

COPSAC dataset - data characteristics and preparation

CHILD dataset - data characteristics and preparation

Examining bifidobacterium MAG clusters

SNV-level phylogenetic trees for COPSAC dataset

Strain mixtures within de novo MAGs across environments

Gene syntenies and functional annotation for COPSAC dataset

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation