Reference databases for taxonomic classification
The growth in volume and diversity of published sequence data is reflected in the number of reference databases that exist to curate these data and make them publicly available (see for example www.oxfordjournals.org/our_journals/nar/database/cap/), with new compendiums appearing yearly to meet evolving biological interests (Rigden and Fernández 2021). Many of the databases that underpin metagenomic analysis are not specific for the purpose. They tend to be joint genome curation projects involving multiple major funding bodies in order to sustain the hands-on maintenance required to keep information current.
Metagenome databases suitable for taxonomic classification primarily revolve around archiving, mining, and annotation of individually sequenced bacterial genomes. The National Center for Biotechnology Information (NCBI), European Nucleotide Archive, and DNA Data Bank of Japan, together make up the International Nucleotide Sequence Database Collaboration (INSDC www.insdc.org), which serves as the primary global repository of genome sequences from all domains of life, as well as viruses. The US Department of Energy Joint Genome Institute (JGI) also hosts several bioinformatic resources to centralise access to genomic and metagenomic data. JGI’s Genome Online Database (GOLD, Mukherjee et al. 2021) and Integrated Microbial Genomes and Microbiomes (IMG/M, Chen et al. 2019) are complementary tools that facilitate the classification of metagenomic data. The former is a registry for genome and metagenome projects and ensures complete documentation of metadata associated with each project. The latter specialises in access, annotation, and analysis of microbial genome and metagenomes. IMG/M connects external taxonomic and functional annotation databases via several bioinformatics pipelines for comparative analyses.
Such integration is a key feature of metagenomic reference databases, as extensive interrelation and sharing of genomes allows for varying degrees of curation within different resources. For example, GenBank (Sayers et al. 2020) is the INSDC-supported database from which the NCBI RefSeq database (Haft et al. 2018; O'Leary et al. 2015) is curated to provide a non-redundant set of high-quality genomes with established taxonomy and comprehensive annotations. The Genome Taxonomy Database Base (GTDB, Parks et al. 2020, 2018) and proGenomes (Mende et al. 2017) are examples of two subsequent databases built upon these NCBI resources, which further seek to standardise taxonomic annotation of bacterial and archaeal genomes. Both GTDB and proGenomes assign genomes to species clusters and curate non-redundant sets of genomes for different taxa. Curating standardised databases streamlines database query results based on taxonomy, which provides marginal advantages due to the size and scale of these metagenome databases. For example, GTDB and proGenomes have ~ 150 000 and ~ 84 000, bacterial and genomes, respectively, which span tens of thousands of species clusters. Efficient data mining, effective database management and integration, and scalable comparative genomic analyses are major priorities in metagenome reference databases.
Non-bacterial reference databases
While there has been an historical emphasis on bacterial members of microbiome, there are a growing number of reference databases that specifically support study of non-bacterial taxa. This includes fungi (mycobiome, Lai et al. 2018), viruses (virome, Carding et al. 2017), archaea (archaeome, Moissl-Eichinger et al. 2018), nematodes (Harris et al. 2019), and eukaryotic parasites (Dheilly et al. 2017). Metagenome resources for more targeted questions, such as pathogen genomics, are also available. For example, the Virus Pathogen Resource (ViPR, Pickett et al. 2012), and Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB, Aurrecoechea et al. 2010). Given their more clinical focus such databases frequently also incorporate sample or clinical metadata relevant to each submission, either internally, or linked in related database resources such as Clinical Epidemiology Database Resources (ClinEpiDB, Ruhamyankaka et al. 2020).
Current limitations of reference databases
In spite of huge efforts to curate high-quality references and maximise their accessibility, a fundamental limitation of existing databases is that they are missing the ‘dark matter’ of the global metagenome. While gauging the extent of this problem is not trivial, it has been attempted. For example, Zhang et al. (2020) matched all unique taxa identified by the Earth Microbiome Project (EMP) using 16S gene sequencing to all available reference genomes in the RefSeq database. A median of 62% of EMP taxa present in host-associated microbiomes could be matched to an existing RefSeq genome at a threshold of 97% similarity. This enables the crude estimate that up to ~ 40% of mammalian holobiont bacterial species may be missing from reference genome databases. It is less clear what proportion of eukaryotic, viral or archaeal data may be missing. However, similar issues are likely to exist, particularly, as databases specialised in non-bacterial organisms tend to be comparatively small. For example, eukaryotic databases such as WormBase and VEuPathDB have in the range of 1–200 unique species for each family.
A second, related issue is that microbial reference databases tend to show distinct taxonomic bias. The historical need for microbial reference genomes to have come from organisms successfully isolated from environmental samples means this bias is weighted towards those taxa that can be successfully cultured (Browne et al. 2016). It also reflects uneven distribution of research effort towards organisms of particular interest, such as model organisms or important pathogens, as well as a bias towards microbes associated with human populations that have been disproportionately studied (Pasolli et al. 2019).
An unfortunate consequence of both these limitations is that potentially important biological associations may be lost in the proportion of mWGS reads that are returned as ‘unclassified’ in metagenomic studies. This risk is greater when considering microbial clades that are comparatively understudied, or human populations that are historically not well represented. While these limitations do not preclude use of reference databases as an essential means of taxonomic classification, they need to be accounted for when interpreting metagenomic analyses and represent a challenge that needs to be overcome if we are to improve our understanding of the microbiome contribution to host health.
Matching sequences to reference databases
Sequence Pre-processing
Quality control is a fundamental upstream process in all sequenced-based analysis, involving steps such as the removal of likely PCR duplicates, removal of sequencing adapters, trimming of low-quality bases from reads, and the masking or removal of low complexity regions. One fundamental pre-processing step is the removal of sequences originating from contaminant-DNA (i.e. DNA that is not the focus of the study). While this is an important consideration in single genome studies (Goig et al. 2020), it represents a particular challenge in mWGS, where host DNA likely to be highly prevalent in samples and hence constitute a significant fraction of the sequenced reads.
Removal of host reads is a vital precursor for both taxonomic and functional characterisation of the gut microbiome. Failure to do so may confound taxonomic estimates through misassignment of host reads to microbial taxa. This may be of particular importance for virome analysis due to the presence of endogenous retroviruses, which are estimated to make up approximately 8–10% of the human and mouse genomes (Lander et al. 2001). Failure to remove host reads may also result in chimeric assemblies, which can in turn lead to annotation of spurious proteins that can confound both taxonomic and functional analysis (Breitwieser et al. 2019).
Removal of host reads is fundamentally a case of matching sequences to a reference database (albeit one containing only the host reference genome), as such the approaches reviewed below (local alignment, alignment-free sequence classification) are applicable to this problem.
Local alignment
The earliest published attempts to shotgun sequence the human gut metagenome employed de novo sequence assembly, before matching reassembled genes and genomic fragments to in-house protein reference databases using BLASTP (Gill et al. 2006; Kurokawa et al. 2007). Since these landmark studies, local alignment of query sequences to reference databases with previously ascribed taxonomy has remained a cornerstone of metagenomic characterisation.
The recent massive growth in the size of both query and reference datasets has meant that early alignment tools such as BLAST are no longer computationally efficient for the analysis of metagenomic sequence data. Fast and accurate characterisation of microbial sequences remains a research priority, particularly, in large-scale disease studies where microbiome datasets may extend to thousands of samples, each with tens of millions of reads (e.g. Lloyd-Price et al. 2019), or in clinical studies, where speed and accuracy of diagnosis are likely to be critical (Chiu and Miller 2019).
As with other genomics fields, metagenomics has benefited from the fact that the continuous advancement of technologies has inspired reciprocal development of many alignment tools optimised to cope with both the scale and features of sequence data produced on NGS platforms (Fonseca et al. 2012). A key element that sets apart tools suited for metagenomic analysis is their ability to efficiently index reference genomes so that they can be accessed and searched with great speed. Indexing is a particular challenge for metagenomic analysis, where reference sequence databases may be an order of magnitude larger than the databases required to represent single mammalian genomes.
For second-generation mWGS data, two of the most widely adopted tools for metagenomic sequence alignment are Bowtie (Langmead and Salzberg 2012; Langmead et al. 2009) and BWA (Li and Durbin 2009). Both use the FM-index in conjunction with the Burrows–Wheeler Transform (reviewed in Canzar and Salzberg 2017) to efficiently index and compress reference sequences for rapid searching. While both approaches achieve significant improvements in alignment speed, they also still rely on heuristic ‘seed-and-extend’ approaches (reviewed in Ahmed et al. 2016), and are hence not guaranteed to find optimal alignments for all query sequences. Such heuristic assumptions are likely to have little practical impact in situations where the resulting alignment accuracy is sufficient to correctly assign reads to divergent taxa (Al-Ghalith and Knights 2020). However, as microbiome studies move towards an increasing emphasis on the ability to accurately discriminate between closely related strains, recently developed rapid, heuristic-free short-read aligners such as BURST (Al-Ghalith and Knights 2020) are likely to become increasingly important in microbiome studies.
Current state-of-the art short-read alignment approaches, such as BWA, have also been adapted for alignment of the longer reads typically produced by third-generation sequencing platforms (Li 2013). However, aligners such as BLASR (Chaisson and Tesler 2012) and Minimap2 (Li 2018) have also come to the fore, having been specifically designed to overcome the sequencing errors encountered on these platforms. More recently, for ONT, these approaches have been further improved by the ability to predict and model the structure of errors inherent to nanopore sequencing (Joshi et al. 2020).
Alignment-free sequence classification
The speed at which reads can be aligned using state-of-the-art short and long-read aligners means such approaches remain viable for searching increasingly large numbers of query reads against ever-growing references databases. However, when the exact location of a specific read within a reference genome is not important, as is the case when the primary goal is to estimate taxonomic origin, precise alignment represents an unnecessary computational cost.
The relative computational efficiency of matching exact kmers, rather than long, potentially ambiguous reads, means matching query sequences to reference databases based on the overall similarity their respective kmer compositions is an extremely rapid way to achieve alignment-free classification of metagenomic sequences (Ren et al. 2018). For example, Kraken—one of the most widely used metagenomic taxonomic profiling tools—divides reference genomes (by default, derived from RefSeq) into kmers, then assigns each unique kmer to the lowest taxonomic rank that represents all the genomes in which it can be found (a so-called lowest common ancestor (LCA) approach). Kmers from query sequences can then be matched to this taxonomy. The taxonomic origin of a complete read can then be inferred from the distribution of its constituent kmers within the underlying taxonomic tree.
Resolving ambiguous classification
Not only it is possible to use the LCA to infer the taxonomy of a read based on the classification of its constituent kmers, it is also possible to use this approach to classify aligned reads that ambiguously align to multiple references. For example, MEGAN (Huson et al. 2007) provided an early implementation of this method to assign taxonomy to locally aligned microbial sequences, which has recently been adapted to work with third-generation sequence data (Huson et al. 2018).
While LCA strategies offer a robust approach to taxonomic classification, a recent study has suggested that trends in the growth of underlying reference databases potentially limit their ability to classify sequences at species level, or strain level (Nasko et al. 2018). Specifically, the authors noted the that recent massive expansion in the number of bacterial genomes in RefSeq has resulted in rapid increase in the number of species accessions to databases, but little expansion in the number of genera represented. The increasing species-to-genera ratio (and hence increasing number of genomes displaying a high degree of sequence homology at species level), leads to a reduced ability for LCA approaches accurately assign taxonomy at species level. This observation has led to a call for continued development of such methods to maximise taxonomic resolution while minimising the risk of false-positives.
One potential solution to this problem is to probabilistically reassign ambiguously classified reads to their most likely taxon of origin. Such an approach has been implemented for aligned reads in the PathoID module within the PathoScope pipeline (Hong et al. 2014). An analogous approach has also been developed for Kraken (Lu et al. 2017), which, rather than reassigning individuals reads, provides species-level abundance estimates based on LCA read assignments. Such approaches are likely to become increasingly relevant due to both the growing interest in understanding the microbiome at high taxonomic resolution, as well as the increasing levels of sequence homology within taxonomic reference databases.
Refining reference databases to reduce search space
Increasing the speed with which query sequences can be matched to sequences in reference database with known taxonomy is one way to overcome challenges inherent in taxonomic profiling of metagenomic sequence data. A second is to curate reference databases to remove redundant information that either does not discriminate between taxa, unnecessarily lengthens search times, or both.
As with read alignment and profiling, multiple approaches have been developed that exploit different characteristics of the metagenome in order to design computationally efficient references. One such approach is to leverage the pan-genome concept, which encompasses the fact that bacterial strains of the same species consist of a core genome (present in all strains) and a dispensable genome (consisting of those genomic regions that may be present in some, but not all strains, Medini et al. 2005). The pan-genome is therefore the combination of the core and dispensable genomes for a species. This concept becomes increasingly relevant as microbial reference databases move from having one representative genome for each species, towards multiple and sometimes thousands of different strains. Zhou et al. (2018) exploited this concept by creating a reference database consisting solely of species pan-genomes. Resulting references were 2–20 times smaller in size (bp) than the total size of contributing strain genomes. Furthermore, this pan-genome database resulted in improved rates of read classification over databases including only a single representative genome for each species.
A second approach to minimising the size of a reference database, while retaining its ability to taxonomically classify query sequences is to retain only discriminatory genes that are unique to a single species (Segata et al. 2012). Metaphlan is based on this concept and uses local alignment with Bowtie2 to match query mWGS reads to gene families that are selected to be both present in a species core genome, and unique to that species (Beghini et al. 2020). Taking this approach Metaphlan3 is able to efficiently represent over 13,000 microbial species with a reference database of approximately 1.1 million marker genes. Such minimal reference databases, actually result in very few reads in being successfully aligned from query metagenomic datasets, but they are nonetheless sufficient to provide accurate taxonomic profiling of complex microbial communities. Furthermore, the lightweight design of such discriminatory gene databases means that they are likely to scale efficiently with the increasingly large amounts of data processed in single studies, when compared to approaches that depend on de novo assembly of metagenomes (Segata 2018).
The concept of restricting databases to discriminatory markers is not limited to microbial gene sets. Tu et al. (2014) introduced a method for identifying and selecting unique, discriminatory regions from reference genomes (termed genome specific markers—GSMs). They subsequently matched query reads to these GSMs using BLAST-like approaches. More recently, CLARK (Ounit and Lonardi 2016; Ounit et al. 2015) is a kmer-based classifier, comparable to Kraken, that not only seeks to exploit the speed of kmer-based searching, but also to reduce the size of the reference database by storing and searching only discriminatory kmers. This has the advantage of minimising the amount of information that needs to be stored to quickly and accurately discriminate reads. However, it also means that the taxonomic level at which reads are to be classified (i.e. at which a single kmer is unique to a clade) needs to be specified prior to building a reference index and hence that an LCA approach to read classification cannot be taken.
Beyond reference databases: the dark matter of the metagenome
As discussed, a fundamental problem of assigning taxonomy to metagenomic sequences by matching them to reference databases is that these databases are almost certainly incomplete. While LCA approaches, such as Kraken, may be able to classify reads originating from unknown microbes at higher taxonomic levels, detecting and quantifying and characterising these microbes at low taxonomic resolution remains a major challenge.
Reference-extended approaches
Discriminatory marker gene databases, such as those curated by MetaPhlan, allow users to quantify the unclassified proportion of reads in an mWGS dataset. However, they provide no additional insight into reads originating from taxa missing from the reference genome databases from which they are derived. Other approaches based on sets of universal marker genes offer a potential solution to this problem. In mOTU (Milanese et al. 2019; Sunagawa et al. 2013), the authors curate a database of single-copy marker genes identified as present in all sequenced microbial genomes. They then use a hidden Markov model (HMM)-based approach to generate a profile for each marker gene, based on its sequence properties across known reference genomes (reviewed in Eddy 1996). Such profiles can be used to search for all homologs of each marker gene within de novo assemblies of metagenomic samples. All detected copies of each marker gene are clustered in a step analogous to the generation of operational taxonomic units (OTUs) from 16S gene sequence data. The relative abundance of each ‘meta-OTU’ (mOTU) is then determined, and mOTUs originating from the same genome are identified based on correlation in their relative abundance across samples. While this innovative approach relies on the computationally challenging step of de novo metagenome assembly, the use of HMM profiles enables detection of marker gene homologs that may be absent from existing genome reference databases, thereby enabling what the authors refer to as a reference-extended community profiling. In a recent study, the authors concluded that more than half the mOTU species detected in 1693 human gut samples were absent from the proGenomes reference database (Milanese et al. 2019).
Sequence-based community profiling
While reference-extended approaches offer the ability to define and quantify previously uncharacterised taxa, it is also possible to compare metagenomes entirely on the basis of sequence composition, without the need to define taxonomic units. The utility of such an approach reflects the fact that changes in the composition of the gut metagenome, such as dysbiosis, are often characteristic of disease states. Tracking such changes is therefore informative, in spite of the fact it contributes little to our understanding the mechanisms by which microbes impact host health (Olesen and Alm 2016). With this goal in mind Kmer-based approaches once again represent a computationally efficient method by which to compare the composition of metagenomic samples. MASH (Ondov et al. 2016) is an implementation of the MinHASH algorithm (reviewed in Rowe 2019), which provides an extremely fast method for approximating the proportion of kmers shared between two metagenomes. The utility of this type of approach has since been extended to account for the relative abundance of kmers when assessing samples, and to enable signatures to be searched as well as compared (Pierce et al. 2019).
Genome-Resolved Metagenomes
The ability to re-assemble complete, high-quality microbial genomes from shotgun sequence data is arguably the apotheosis of computational metagenomic analysis as it obviates the need to isolate and culture in order to understand the genomic potential of individual organisms. Genomes that may never be cultured can be retrieved, their phylogeny can be established and taxonomy inferred (Almeida et al. 2019; Almeida et al. 2020), and their functions predicted through genome annotation. Ultimately, these genomes can be added to public databases (Almeida et al. 2020; Mukherjee et al. 2021), leading to the improved performance of other, reference-dependent analysis tools (Milanese et al. 2019).
Full or partially assembled genomes derived from mWGS data are now commonly referred to as metagenome-assembled genomes (MAGs). They were first generated from shotgun sequencing of biofilms by Tyson et al. (2004), who assembled 103,462 Sanger reads (76.2 Mb), then binned the resulting contigs into genomes based on a combination of their coverage and GC content. In another landmark study, Nielsen et al. (2014) analysed 396 human stool samples (23.2 billion reads, 4.5 Gb) as part of the MetaHIT consortium. They used a canopy clustering approach to enable the rapid binning of assembled microbial genes based on their co-abundance across samples. This resulted in detection of 784 metagenomic species (defined as bins with > 700 genes) and also demonstrated the potential for MAG approaches to identify bacteriophage. More recently, massive efforts have been made to reconstruct genomes from publicly available metagenomic sequence datasets Nayfach et al. (2019), Pasolli et al. (2019), and Almeida et al. (2019) analysed 3810, 9428, and 11,850 metagenome samples, respectively, which have collectively contributed to a novel reference catalogue of 204,938 MAGs (Almeida et al. 2020).
Fundamental steps for generating MAGs are the production of high-quality de novo assemblies from mWGS reads (through use of tools such as metaSPAdes, Nurk et al. 2017, and MEGAHIT, Li et al. 2015, reviewed in Ayling et al. 2020) followed by the accurate binning of contigs originating from the same genome. The latter step is frequently performed by comparing the coverage of contigs, as well as genome-level sequence properties such as CG content or tetranucleotide frequency (reviewed in Kang et al. 2016). A widely used exemplar approach for binning is MetaBAT (Kang et al. 2015, 2019), which employs pairwise comparisons of contigs based on abundance and TNF frequencies, followed by a graph-based clustering approach (Kang et al. 2019) to identify MAGs from one or more samples.
The potential for MAGs to extend knowledge of the metagenome beyond reference databases is well illustrated by recent large-scale studies. For example, Pasolli et al. (2019) used MASH to establish pairwise genetic distances between 154,723 MAGS from different human body sites and 80,990 bacterial genomes from reference databases. Clustering these genomes at a 5% threshold resulted in an estimated 4,930 species, with 3,796 (77%) of these species clusters containing no previously known reference genome. Notably, novel MAGs are found at much greater frequency in non-westernised gut microbiomes (Nayfach et al. 2019; Pasolli et al. 2019), supporting the observation that genomic biases exist as much for the microbial proportion of the holobiont as they do for the host (Almeida et al. 2019; Choudhury et al. 2020). Further efforts to improve metagenomic discovery of microbial species are therefore likely to particularly benefit understanding the microbiome contribution to host health in these populations.
Another area in which the ability to fully resolve genomes from metagenomes offers great potential is the detection and characterisation of the non-bacterial component of the gut microbiome. This is well illustrated by recent studies of crAssphage, where mining publicly available metagenome assemblies for circular metagenome-assembled genomes (cMAGs) led to the discovery of 596 crAssphage genomes, which could be clustered into approximately 221 viral ‘species’ (Yutin et al. 2021). These viruses have subsequently been shown to be globally present in the human gut, where they dominate the gut virome and to have close biological links with the genus Bacteroides (Edwards et al. 2019; Yutin et al. 2021), which is itself a keystone species within the gut ecosystem. Recent success in the detection and characterisation of crAssphage is undoubtedly aided by their high relative abundance compared to other viral clades. Nonetheless, comparable metagenomic discovery of other microbes, from viruses to eukaryotes, remains a prospect for future studies.
While the potential for genome-resolved metagenomics is great, recent reviews have highlighted the challenges in this field, and in particular, the difficulties of producing high-quality assemblies of complex microbial communities where strain-level divergence may be important (Chen et al. 2020). The appearance of incompletely resolved, composite MAGs in public databases has already been reported (Shaiber and Eren 2019), and ensuring accurate and high-quality genome discovery remains a key bioinformatic challenge for this emerging field (Bowers et al. 2017).