Introduction

Recently, an increasing number of studies have investigated the human microbiome (bacteria, archaea, microbial eukaryotes, fungi, and viruses), particularly of the gut, and its involvement in human disease (Lynch and Pedersen 2016), including metabolic (Boulangé et al. 2016), autoimmune (Proal et al. 2009), and neuropsychiatric (Kang et al. 2013) disorders. Indeed, the human gut microbiome is involved in many host functions, such as the production of enzymes to help food digestion (Bhattacharya et al. 2015), the synthesis of vitamins (e.g., biotin – vitamin B7) and other key compounds (e.g., gamma-aminobutyric acid, Barrett et al. 2012), and the development of the host immune system (Thaiss et al. 2016). Perturbation of the gut microbiota (dysbiosis) has been associated with many diseases, as in Clostridium difficile infection, which is associated with a reduction of gut microbial diversity, often resulting from antibiotic use (De La Cochetière et al. 2008). The intestinal microbiota composition can also influence drug action. For instance, Dubin et al. (2016) showed that patients with metastatic melanoma having a higher proportion of Bacteroidetes phylum were less affected by colitis following treatment with Ipilimumab.

The manipulation of the human gut microbiome has been suggested as a potential therapeutic option for different human diseases. Human faecal microbiota transplant (FMT), which involves the transfer of faeces from a healthy donor, has been a very successful treatment for C. difficile infections (Eiseman et al. 1958; Gough et al. 2011; van Nood et al. 2013), and it seems to be a promising approach to treat other diseases (e.g., ulcerative colitis, Anderson et al. 2012; insulin sensitivity, Vrieze et al. 2012). Another way to modify the gut microbiome is through the administration of probiotics (suspensions of live microorganisms, AlFaleh et al. 2012) and prebiotics (substances supporting resident beneficial microorganisms, Underwood et al. (2009) and Panigrahi et al. (2017)).

Advances in DNA sequencing, triggered by the development of high-throughput sequencing technologies (or next generation sequencing—NGS), make it nowadays possible to study the diversity of microorganisms present in/on the human body in a routine and inexpensive way, and without the need for cell cultures. This allows the characterisation of microorganisms particularly hard or (so far) impossible to culture, and of previously unknown ones (Schloss and Handelsman 2005; Human Microbiome Jumpstart Reference Strains Consortium 2010; Vartoukian et al. 2010). Two main approaches are currently in use: marker gene amplification (metagenetics, Handelsman 2009), and whole genome shotgun sequencing (metagenomics, Almeida and Pop 2015). Metagenetics approaches use a polymerase chain reaction (PCR) amplification of certain phylum-specific genes, (e.g., 16S ribosomal RNA (rRNA) for bacteria and archaea and 18S rRNA for fungi), followed by their sequencing. However, uncertainty exists in the accuracy of annotations for genus and species level, especially for those organisms that have not been well characterised yet. Metagenetics approaches are cost-effective and have been widely used for studying the association between microbiome abundance and several human traits or diseases (Shreiner et al. 2015). Metagenomics approaches, by sequencing the whole genome of all the microorganisms present in a sample, are tenfold more costly but allow to potentially infer the taxonomic profiles up to the strain level, thus allowing a deeper understanding of the physiology and ecology of the microbial community. In 2008, the Human Microbiome Project (Methé et al. 2012) and the Metagenome of the Human Intestinal Tract (MetaHIT) study (Qin et al. 2010) started to characterise and generate the reference genomes of bacterial strains commonly found in association with both healthy and diseased individuals. Along with DNA sequencing, microarray biochips specific for human microbiome, also known as phylogenetic microarrays or phylochips (Walker 2016), are nowadays available for the relative quantification of known microbiota.

Neither metagenomics nor metagenetics approaches can provide information on which microbial pathways are actually active. Indeed, DNA present in a sample could come from resident metabolising organisms, partially quiescent cells, host cells, viruses, spores, or dead microbiota. Recently, three other high-throughput technologies have emerged: (1) meta-transcriptomics, that measures the expression of RNA molecules through RNA-seq (Bashiardes et al. 2016); (2) meta-proteomics, that measures the microbial protein levels (Grassl et al. 2016); and (3) meta-metabolomics, that measures the microbial metabolite levels (Zhang and Davies 2016). These technologies, combined with metagenomics, allow a better characterisation of the physiological behaviours and dynamics of the microbial community, and of their role in human health and disease.

Metagenomics is a novel and rapidly developing discipline. Therefore, standardised protocols are currently lacking, especially for the data processing and analysis, which require high computational resources and bioinformatics expertise. In this review, we will discuss best practices for the implementation of a metagenomics project, summarised in Fig. 1, with an emphasis on quality control, a critical step often poorly described in the literature.

Fig. 1
figure 1

Metagenomics analysis pipeline. Hexagons represent the analysis steps. Rectangles and parallelepipeds denote the output data and reports, respectively. Cylinders represent additional data to be provided in input

Sample collection and storage

Samples could be collected at virtually any body site, the most studied being the gut. Collection approaches, microbiota composition, and biomass (the overall microbiota quantity) differ markedly between sites (Costello et al. 2009). While metagenetics approaches allow inference of the taxonomic profile using small amounts of DNA, metagenomics studies require higher amounts in order to get a reasonable coverage of all the present microbial genomes. For instance, in the comparative study performed by Ranjan et al. (2016), the authors used 50 ng of microbial DNA for the 16S rRNA amplicon library preparation but 5 μ g for the shotgun whole metagenome one. Although several efforts have been made to improve sequencing using smaller quantities of DNA, particularly for samples with low biomass (e.g., skin metagenome), a reduced DNA quantity can affect the inferred microbiome composition (Bowers et al. 2015).

The protocols used to select, store, prepare, and sequence the samples should be consistent throughout the project to avoid the introduction of unwanted technical variability that would be difficult to remove afterwards. For instance, Sinha et al. (2016) compared different faecal sample collection methods, and concluding that all of them showed high reproducibility, although sampling methods affected the observed microbiota variability. Shaw et al. (2016) investigated the effect of sample storage and preparation, concluding that neither the duration of long-term freezing at − 80C nor the storage at room temperature for less than 2 days significantly affected the microbial community composition, thus suggesting that samples should be shipped on the day of collection and then processed or frozen at − 80C within 2 days. However, Amir et al. (2017) showed that room temperature storage is associated with a bloom of certain bacteria, which alters the taxonomic profile. Also, Choo et al. (2015) showed that while refrigeration at 4C do not significantly alter faecal microbiota diversity or composition, other preservative buffers (namely RNAlater, OMNIgene.GUT, and Tris-EDTA) do.

Extensive clinical and demographic data should be collected along with the specimen sample (Methé et al. 2012). Indeed, several factors, including the geographical location where the subjects live, their body mass index, and their age, have been observed to play a role in the composition of the microbial community (Yatsunenko et al. 2012; Zhernakova et al. 2016). Stool consistency (measured by the Bristol Stool Chart (Lewis and Heaton 1997) and considered a proxy for intestinal transit time) should also be recorded, since it has been associated with species richness and community composition (Tigchelaar et al. 2016; Vandeputte et al. 2016). Koren et al. (2012) also observed dramatic changes in the third trimester of pregnancy, when the gut microbiome resembles that of subjects affected by metabolic syndromes. Diet is another important factor. While long-term dietary habits are firmly associated with the microbial composition (Wu et al. 2012), it has been shown that the short-term consumption of exclusively plant- or animal-based diets (David et al. 2014) and the dietary fluctuations between seasons (Davenport et al. 2014; Smits et al. 2017) can alter the microbial community structure. When collecting infant samples, the type of delivery and whether the baby has been breast- or bottle-fed should be taken into account (Bäckhed et al. 2015), while, when collecting samples from the female reproductive tract, the age of menarche, the number of pregnancies, the menopausal status, and the type and the duration of hormonal drug intake should also be collected (Markle et al. 2013). Short-term exposure to antibiotics alters both the bacterial physiology and the microbial community structure (Maurice et al. 2013), as with many other drugs, such as metformin (Forslund et al. 2015), analgesics (Pumbwe et al. 2007), and proton pump inhibitors (Jackson et al. 2016). Finally, in the design of a case control study, cases and controls should be carefully matched for any variable that may affect the microbiome composition.

Several reviews have been written on general experimental design (e.g., Kreutz and Timmer 2009), how to choose study samples and controls (e.g., Goodrich et al. 2014), and how to select, preserve, and prepare samples before sequencing (e.g., Lauber et al. 2010; Dominianni et al. 2014; Choo et al. 2015; Voigt et al. 2015). Although, the majority of them focus on metagenetics data, their guidelines are useful for the design of metagenomics experiments.

DNA extraction and library preparation

Prior to sequencing, particular attention should be paid to the DNA extraction and library preparation. For instance, both temperature of sequencing and DNA extraction procedure can make it difficult to sequence, within the same experiment, organisms that are characterised by different GC content and/or cell membrane composition (Bohlin et al. 2010; Peabody et al. 2015; Bag et al. 2016). Indeed, microbial cells membranes are highly heterogeneous and different lysing methods can extract different amount of DNA from different species, thus generating spurious differences in their abundances when assessed from sequencing data (Bag et al. 2016). This has been confirmed in a recent analysis which compared 21 DNA extraction protocols on the same faecal samples, showing that the DNA extraction step has a large effect on the quantification of the microbial community, therefore jeopardising comparability and replicability of research findings (Costea et al. 2017).

During the library preparation step, adapter sequences (i.e., short synthetic oligo-nucleotides, often platform-specific) are ligated to the 5’ and 3’ ends of each DNA fragment, and often amplified. Adapters include a PCR primer binding site for amplification, and, possibly, a barcode used when multiple samples are sequenced together on the same lane. Library preparation can also affect the abundance of some DNA sequences, and, therefore, of the inferred microbiome community composition, mostly due to differential efficiency in their amplification (Aird et al. 2011). For instance, it has been observed that GC-rich DNA sequences are more difficult to amplify, and that the higher the CG content the lower the probability of an amplification bias (Jones et al. 2015). Thus, the adoption of clonal-free and PCR-optimised (Aird et al. 2011) or PCR-free (Jones et al. 2015) approaches have been recommended. Moreover, this effect has been shown to be stronger in low biomass samples. Indeed, the amount of starting material influences the overall read quality, which improves with the increase of the quantity of DNA in input (Bowers et al. 2015).

Sequencing technologies

From the advent of the Sanger platform, sequencing technologies have constantly been evolving, allowing a steady decrease in sequencing costs. Sanger sequencing (Sanger and Coulson 1975) generates long reads (> 700 bp) with a low sequencing error-rate (less than 0.1%). Its high per-base cost (more than 6,000 USD per Gb) and the complex and long sample preparation make its use difficult in routine clinical settings, and, nowadays, Sanger sequencing is mainly used to validate findings from NGS studies. NGS technologies have a much lower per-base cost than Sanger (50 − 500 USD per Gb), but a higher sequencing error rate (approximately 0.1–1% Ronaghi 2001; Morozova and Marra 2008). NGS technologies allow sequencing of one (single-) or both (paired-) ends of a DNA fragment, the latter being more precise but also slightly more expensive, and enabling the sequencing of only half of the reads at the same genomic coverage. Currently, 100 to 150 bp-long paired-reads generated using the Illumina 2500-HiSeq and 4000-HiSeq are considered the standard for metagenomics studies. However, it has been observed that short-sequence libraries (< 200 bp) may alter the phylogenetic and functional characterisation of microbial communities (Wommack et al. 2008; Carr and Borenstein 2014). This potential alteration is due to the high sequence homology among certain microorganisms, which may lead to misclassification (Koonin and Galperin 2003; Janda and Abbott 2007). To overcome this issue, one could increase the sequencing coverage (i.e., the number of times each DNA fragment, and, thus, the genome, is sequenced), or the read length. However, it should be kept in mind that, while increasing the read depth increases the number of taxa detected, it may also augment spurious assignments and artefacts (Jovel et al. 2016). Single-molecule sequencing generates longer reads (1,000–10,000 bp), which can facilitate the assembly of new genomes and the identification of novel bacterial species and strains (Koren et al. 2013; Kuleshov et al. 2016). Nanopore-based single-molecule sequencing also offers cloud-based bioinformatic analyses from anywhere and in real time, allowing the detection of specific microbiota in less than 6 h from the sample collection (Greninger et al. 2015). Therefore, data can be generated and analysed on the field, with potential for clinical diagnostics and public health—as seen in the West Africa Ebola (Quick et al. 2016) and Brazilian Zika (Quick et al. 2017) outbreaks. Single-molecule sequencing, however, has lower sequencing depth, higher costs (around 2,000 USD per Gb) and higher sequencing error rates (10–20%, Brown et al. 2017). Several authors proposed combining reads from both NGS and single-molecule sequencing to overcome each other limitations (Frank et al. 2016; Mende et al. 2016).

Controlling batch effects

Batch effects are technical sources of variation, common in large-scale high-throughput experiments, which are unrelated with the biological and scientific variable under study. Many sources of technical variability can be easily avoided, for example by adopting homogeneous sampling and storage methods, DNA extraction or library preparation protocols. However, batch effects may be generated when samples are divided into different sequencing groups, which are often run at different dates (Leek et al. 2010). To avoid spurious associations, samples should be randomised across the batches for any of the variables under study and any potential confounder/covariate (Taub et al. 2010). For example, it is important to avoid enriching sequencing batches for female/males samples, or for older/ younger subjects, and, if studying a disease, is essential to have balanced proportions of cases and controls across batches. Finally, it is preferable to avoid sequencing in multiple small batches each including few samples, and, when possible, to sequence all the batches within a limited period of time. Reproducibility of sequencing data (and of the results) can be additionally improved by adding both negative and positive controls (e.g., synthetic communities) in each sequencing batch or sample. This will help in calibrating measurements, and normalising data, allowing the comparison among samples by adjusting for individual technical variability (Leek et al. 2010; Jones et al. 2015).

Quality control

Quality control (QC) is crucial for generating high-quality data by identifying and removing low-quality biological samples and/or reads, and technical artefacts (Leek et al. 2010), and for improving the read mapping to reference databases, the quality of de novo assemblies, and the accuracy of the microbial diversity and abundance estimation (Bokulich et al. 2013; Zhou et al. 2014).

Visualisation of QC metric

Researchers can take advantage of several tools (even if not specifically designed for metagenomics data) to obtain a preliminary overview of reads’ quality and to choose parameters to be used in the following QC steps. It is also important to examine the QC metrics after each QC step, in order to evaluate their effectiveness in generating high-quality QC’ed reads.

Close attention should also be paid to k-mers, i.e., all the possible sub-sequences of length k contained in the reads, as they could highlight low-complexity or repeated sequences (Plaza Onate et al. 2015). Although no specific guideline has been currently defined, if the k-mers distribution is not uniform and a set of k-mers is found in more than 1% of all reads, an in-depth investigation should be performed to understand their origin.

FastQC (Andrews 2010) is an excellent software to assess and visualise sequence quality.

De-duplication

Identical duplicated reads are usually considered as technical artefacts (Xu et al. 2012), since they are often the result of sequencing multiple copies of the same DNA fragment amplified during the PCR step (artificial duplicates, Ebbert et al. (2016)). Since in metagenomics the number of reads is used as an abundance measure, artificial duplicates may cause overestimation of the abundance of taxa, genes, and functions. On the other hand, natural duplicates (reads deriving from either the same region of different microbial clones, or from regions shared within/between multiple organisms, such as ortholog and paralog regions, and regions of DNA horizontal transfer) may also be present, and their removal may introduce underestimation of abundance (Niu et al. 2010). As a consequence, de-duplication should not be performed when using a PCR-free library as, in that case, all the duplicates will be natural duplicates. Also, duplicates should be removed before quality trimming, as it modifies the read sequence, potentially masking true duplicates or generating false ones.

De-duplication has only recently been included as a QC step in metagenomics studies. The first shotgun metagenomic projects (e.g., Qin et al. 2010) did not perform de-duplication and the CD-HIT-DUP module in CD-HIT (Li and Godzik 2006), used by the Human Microbiome Project, was the first method developed to manage duplicated reads without mapping on reference genomes.

Several tools are now available for removing exact duplicates without mapping on reference genomes: FASTX-Toolkit (Gordon and Hannon 2010) and Fulcrum (Burriesci et al. 2012) remove duplicates only in single-end reads, whereas FastUniq (Xu et al. 2012), the CD-HIT-DUP module in CD-HIT (Li and Godzik 2006), and the clumpify tool in the BBTools suite (Bushnell 2015) can deal with paired-end reads as well. When using pyrosequencing reads, the Cdhit-454 software (Niu et al. 2010), along finding exact and nearly exact duplicates, also estimates the number of natural duplicates based on the type/origin of the sample and its complexity.

Trimming

The quality of the reads is affected by sample preparation and by the precision of the sequencing instrument, leading to sequencing errors or low-confidence calling that can, in turn, influence the estimation of microbial diversity and taxonomy (Bokulich et al. 2013). The trimming step can identify and remove bases that have been called with a low-quality score, as measured by the PhRED quality score. We suggest keeping all the bases with a PhRED quality score > 10, representing a base call accuracy of 90% (i.e., the probability of calling a base out of ten incorrectly). Besides removing the low-quality bases, the trimming step also removes adapter sequences. Reads that become too short after trimming should be removed: in fact, short reads have a low sequence complexity (evaluated as 4N, where N is the read length) and may map on multiple genomic regions or genomes. It is recommended to remove all the reads that are shorter than 60 bp (corresponding to a complexity of 460 or less, Wommack et al. (2008)). When applied to paired-end reads, the trimming step can produce singleton reads, i.e., reads whose mate has been removed. It is recommended to keep these singleton reads in order to retain as much information as possible. However, some bioinformatics tools cannot deal with singleton reads.

A plethora of software is available to trim adapters and low-quality bases: FASTX-Toolkit (Gordon and Hannon 2010), PRINTSEQ (Schmieder and Edwards 2011b), Trim Galore! (Krueger 2012) (a wrapper to cutadapt (Martin 2011)), Trimmomatic (Bolger et al. 2014), ngsShoRT (Chen et al. 2014), BBDuk (Bushnell 2015), and AfterQC (Chen et al. 2017).

Decontamination

The last QC step is the identification and removal of contamination, i.e., of reads that do not belong to the studied ecosystem and that can cause misassembly of sequence contigs and/or erroneous read mapping on reference databases, thus hampering the downstream analyses.

Contaminated reads often originate from the host’s genome, but they can also derive by cross-contamination during sample preparation and sequencing (e.g., from other DNA samples sequenced at the same time or from DNA present in the reagents, Salter et al. (2014)), and particularly represent a problem for samples with a low biomass (e.g., the skin). It can be challenging to identify sources of contamination without a priori knowledge. It is always recommended to remove contaminating reads deriving from the human genome, even when a chemical removing agent was used beforehand (Qin et al. 2010; Methé et al. 2012). It should be kept in mind that many low-complexity sequences and certain features (e.g., ribosomes) are highly conserved among species and should be eventually removed from the custom dataset in order to avoid false positive matches.

Blooming bacteria (i.e., bacteria that are fast growing at room temperature) may represent another source of contamination. However, while removing blooms will decrease the noise caused by dishomogeneous storing temperature, it may also hider the identification of actual associations (Amir et al. 2017).

Tools such as FastQ Screen (Wingett 2011) and Meta-QC-Chain (Zhou et al. 2014) can help the detection of potential contaminating genomes. The reference genomes of known or inferred contaminating organisms can then be used to create custom databases guiding the decontamination of metagenomics data. Once the contamination database is defined, tools such as Deconseq (Schmieder and Edwards 2011a), ProDeGe (Tennessen et al. 2015), KneadData (The Huttenhower Lab 2017), and multiple tools in the BBTools suite (BBMap, BBwrap, BBsplit, Bushnell 2015) can be used to remove contaminants.

Taxonomic binning and profiling

Short reads, conserved sequences, and the absence/incompleteness of many reference genomes are some of the factors that hamper generating the complete assembly of a metagenomic sample. To help assembly and other analyses, such as functional annotations, the reads are clustered according to their sequence similarity, and/or composition and assigned to specific taxa or operational taxonomic units (OTUs), a procedure known as taxonomic binning. The identified clusters can be then used to guide the assembly and to characterise the microbiome composition and the functional profiling.

Taxonomic binning is performed using either similarity-based or composition-based approaches (Droge et al. 2012). Similarity-based approaches (e.g., BLAST (Altschul et al. 1990), MEGAN (Huson et al. 2007), IMG/M (Markowitz et al. 2013), MG-RAST (Wilke et al. 2016)) assess the local similarity of the sequences with those in reference databases and try to assign them to taxa. This information can be retrieved from general databases, such as NCBI RefSeq (O’Leary et al. 2015), Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al. 2016), eggNOG (Huerta-Cepas et al. 2016), or by specific microbial databases, such as Gene Catalog from MetaHIT (Qin et al. 2010) and MEDUSA (Karlsson et al. 2014). Although similarity-based approaches generate binnings that are highly accurate and specific, they cannot bin sequences from organisms whose genomic sequence is unavailable or that are conserved among multiple close organisms. Consequently, many reads may remain unassigned. Composition-based approaches (e.g., CD-HIT (Li and Godzik 2006), mOTU (Sunagawa et al. 2013), Kraken (Wood and Salzberg 2014), MetaPhlAn2 (Truong et al. 2015), and CLARK (Ounit et al. 2015)) evaluate the similarity of the sequences to compositional signature, such as clade-specific markers, codon usage, (i.e., the frequency of occurrence of synonymous codons in the genome) GC content, and k-mers composition, using reference databases. Since composition-based approaches do not perform intensive read alignment, they are much faster than similarity-based approaches. However, they are affected by the non-uniform representation of the various taxonomic groups in existing reference databases, and by the frequency of the compositional signatures that are usually derived only from short, specific, sequences, and not from the whole bacterial genomes. A mixed approach is used by SPHINX (Mohammed et al. 2011), which, for each read, first employs a composition-based binning algorithm to identify a subset of sequences from the reference database, and then limits the similarity-based search to this subset.

The microbiome composition, or taxonomic profile, can be computed as an absolute value, thus reporting the number of reads mapping to the detected microorganisms, or as a relative value, i.e., the relative abundance of that microorganism compared to the rest of the microbial community. Several tools have been developed to estimate microbial abundances, including GRAMMy (Xia et al. 2011), Kraken (Wood and Salzberg 2014), and ConStrains (Luo et al. 2015). An efficient approach for taxonomic profiling has been implemented in MetaPhlAn2 (Truong et al. 2015), which uses clade-specific markers to both detect the organisms present in a microbiome sample and estimate their relative abundance, allowing both binning and profiling at the same time.

Assembly

Sequence assembly is the process of aligning and merging reads with the aim of reconstructing the original genomic sequence, and it is a major step for determining the sequence of novel bacterial genomes. For instance, the HMP reconstructed the reference genome sequences for at least 900 bacteria belonging to the human microbiome (Human Microbiome Jumpstart Reference Strains Consortium 2010). Metagenomic datasets are composed of a mixture of reads belonging to multiple organisms, with different levels of taxonomic relatedness with each other and most assemblers, designed to assemble single, clonal, genome, are not able to handle these complex pan-genomic mixtures (Nielsen et al. 2014). For example, they may find it difficult to assign syntenic blocks (i.e., blocks of conserved sequence) to different organisms. Obtaining the complete assembly of a particular microbiota requires the complete coverage of its genome, which is often unfeasible. For instance, when Metsky et al. (2017) studied the Zika virus epidemic, the small amount of detected reads (sequenced using the Illumina MiSeq) hindered the complete reconstruction of the Zika genome, and the authors had to apply two targeted enrichment approaches (multiplex PCR amplification and hybrid capture) to reconstruct the genome of the virus and identify its strains. Strain-specific variants represent another challenge for assembly. In fact, the rate of genetic variations between strains could be similar to the sequencing error rate, making their assembly difficult especially with a low genomic coverage.

Assembly can be either de novo or reference-based (Ghurye et al. 2016). De novo assembly combines reads into contiguous sequences without using a reference genome. Several reference-free families of methods have been proposed. Greedy approaches (e.g., TIGR Sutton et al. 1995; phrap de la Bastide and McCombie 2007) iteratively merge reads into contigs greedily selecting those with maximum overlaps. Overlap-layout-consensus approaches (e.g., VICUNA Yang et al. 2012; Omega Haider et al. 2014) use the pairwise overlap between reads to build a graph, that is then traversed to merge reads into contigs that are, finally, ordered and extended. Approaches based on de Bruijn graphs (e.g., MetaVelvet Namiki et al. 2012; Afiahayati and Sakakibara 2015) split reads into overlapping k-mers, organising them in a de Bruijn graph structure (de Bruijn 1946; Compeau et al. 2011) based on their co-occurrence across reads. Analogously to overlap-layout-consensus methods, contigs are generated by traversing the generated graph. The efficiency of the de novo approaches decreases dramatically when the sequencing-error rate increases. Greedy approaches are the fastest and most effective approaches when there are no or few repeated elements (i.e., regions where the same genomic pattern occurs in multiple times throughout the genome), and the coverage is low, while overlap-layout consensus should be preferred when high sequencing error rate is observed. None of these methods are exempt from errors, and the resulting assembly is often extremely fragmented, also due to the incomplete coverage of the bacterial sequences. The reference-based assembly is guided by the known sequence of the organism itself; if this is not available, that of the phylogenetically closest organism may be carefully used instead (e.g., in MIRA Chevreux et al. 1999; Newbler Chaisson and Pevzner 2008; AMOS Treangen 2011). Reference-based approaches are computationally faster than de novo ones and can potentially map both repeated regions and those with low read coverage. However, they do not cope with new sequences and complex rearrangements (e.g., translocation and inversion) or large insertion and deletion. The de novo and reference-based approaches can be efficiently combined in order to improve each other results (e.g., OSLay Richter et al. 2007; E-RGA Vezzi et al. 2011). Recently, Mende et al. (2016) proposed to use single-cell NGS sequencing for improving single microbial assembling from metagenomics data.

Functional annotation

The MetaHIT project estimated that each individual’s intestinal tract hosts an average of 160 microbial species (Qin et al. 2010). However, there is no consensus yet, and the real number of species and strains harboured in the human gut may be of several thousands. Following this result, Turnbaugh and Gordon (2009) explored whether a gut core microbiome, i.e., a group of microbes present in every human gut, existed. While they found it hard to define such a core microbiome, they identified a common core at the gene/functional level, thus highlighting the importance of studying the modification of the functional capabilities of the microbial community rather than the microbial diversity and abundance, being the former a better marker of human health.

The functional capabilities of the microbiome community can be assessed through homology-based mapping of metagenomics sequences to databases of orthologous genes or proteins with known function (Carr and Borenstein 2014). Mapping can be achieved using the Basic Local Alignment Search Tool (BLAST) suite, or faster BLAST-like tools (e.g., Bowtie2 (Langmead and Salzberg 2012) and DIAMOND (Buchfink et al. 2014) to query gene and protein databases, respectively). The functional annotation relies on a reference database, usually chosen among the universal protein reference (UniRef, Suzek et al. 2015), KEGG (Kanehisa et al. 2016), the Protein Family Annotations (PFAM, Finn et al. 2014), and the Gene Ontology (Ashburner et al. 2000) databases.

Among the several pipelines available for functional annotation, a useful approach has been implemented in the HUMAnN2 pipeline that stratifies the community in known and unclassified organisms using MetaPhlAn2 and the ChocoPhlAn pan-genome database, and combines the results with those obtained through an organism-agnostic search on the UniRef proteomic database. HUMAnN2 can use both the UniRef90 and UniRef50 databases. While UniRef90 is, in general, the best option, as its clusters are more likely to be iso-functional and non-redundant, UniRef50 should be preferred when dealing with poorly characterised microbiomes. In fact, in the latter case, less stringent criteria might allow mapping a larger number of reads, although at a lower resolution. The number of reads mapping to genes and proteins is converted into coverage and abundance tables, and the MetaCyc database is used to assess pathway abundances.

A different but also popular approach has been implemented in MOCAT2 (Kultima et al. 2016), which carries out a preliminary assembly of the QC’ed sequence reads into larger contigs, that are then used for gene prediction with either Prodigal (Hyatt et al. 2010) or MetaGeneMark (Zhu et al. 2010). The functional annotation is finally estimated using the eggNOG database and integrating information from 18 publicly available resources covering different functional properties (e.g., KEGG, SEED, and MetaCyc for metabolic pathways, ARDB, CARD, and Resfams for antibiotic resistance genes, etc).

Methods based on reads mapping (as HUMAnN2) can achieve very high sensitivity, but cannot identify new genes. On the other hand, assembly based methods (as MOCAT2) can identify known genes and predict new one but might under-represent genes with low coverage, as multiple reads are needed to allow their assembly. Both methods are challenged by sequence homology, either by spurious mapping of the reads to multiple genes or by generating chimeric contigs through the assembly.

Gene predictions techniques (as implemented in Prodigal, MetaGeneMark, Orphelia (Hoff et al. 2009), MetaGeneAnnotator (Noguchi et al. 2008), and GeneMark (Besemer and Borodovsky 1999)) can be used to detect unknown genes from contigs.

Assessing diversity

The concept of ecological diversity was originally conceived by Whittaker and Whittaker (1972) who proposed three measures to compare the diversity between and within ecological environments: the α-, β-, and γ-diversities. The α-diversity measures the mean species diversity in a given ecosystem (e.g., in a metagenomics sample), while the γ-diversity measures the overall diversity of all the ecosystems under consideration (e.g., on all metagenomics samples). The β-diversity links α- and γ-diversity and measures the difference in species content between different ecosystems (e.g., between metagenomics samples). Whittaker defined β-diversity as “the extent of differentiation of communities along habitat gradients”, i.e., the ratio between γ- and α-diversity (Whittaker 1960; Whittaker et al. 2001). Several metrics have been suggested for each of these diversity measures, and the most suitable α- and β-diversity metrics in metagenomics appear to be those that take into consideration the phylogenetic tree describing the evolutionary relationship among taxa, such as the Faith’s phylogenetic diversity metric for α-diversity (Faith 1992) and the weighted UniFrac for β-diversity (Chang et al. 2011; Lozupone et al. 2011). α- and β- diversities can be evaluated with QIIME (Caporaso et al. 2010).

The MetaHIT project also showed that the classification of gut samples based on their microbial ecosystem (enterotypes) is not only driven by its microbial composition, but also by its functions (Arumugam et al. 2011). Therefore, gene counts and species richness should be taken into account to characterised samples according to their gene (and, thus, function) composition, which may better elucidate the microbial role in human health and disease.

Association studies

The abundances of taxa, gene/protein families, and pathways provide information about the bacterial community structure, diversity, and functions. These variables can be used to assess the association with diseases, quantitative phenotypes, or genomics features.

A rigourous quality control is particularly important to obtain reliable results from association studies and to avoid the identification of spurious associations. Low-quality samples and outliers should be identified and removed, and technical and biological variability controlled for.

In association studies, abundances can be either used as quantitative variables or recoded as presence/absence based on suitable thresholds. Moreover, abundances can be represented on a discrete scale by using the number of reads mapping to each feature, or on a truncated continuous scale by using relative abundances normalised on the total number of reads—thus being bounded between 0 and 1.

A peculiar feature of metagenomics data is the presence of many zeroes, as many organisms and functions can be either under-sampled or present only in a small subset of individuals. Consequently, their modelling requires suitable methods. In a recent review comparing several methods for the association analysis of read-count based abundances, Jonsson et al. (2016) suggested the use of generalised linear model based on an over-dispersed Poisson or negative-binomial distribution, as that used in RNA-Seq software EdgeR (Robinson et al. 2010) and DESeq2 (Love et al. 2014). However, other studies have suggested that these distributions do not adequately model zero-inflated data, because the presence of zeroes might not be due to low coverage but due to the true absence of particular organisms from a large number of samples. Thus, zero-inflated negative-binomial or hurdle models (Mullahy 1986) are likely to better reflect the distributional properties of the metagenomics data (Xu et al. 2015). The analysis of relative abundances also requires specific models, since these data are bounded between 0 and 1, which means that the effect of the explanatory variables might be non-linear and that the variance might decrease at the boundaries of the distribution. Some approaches rely on the arcsine square root transformation to stabilise the variance and normalise proportional data, such as in the MaAsLiN multivariate statistical framework (Morgan et al. 2012). However, the use of this transformation has repeatedly been debated (e.g., Warton and Hui 2011). An alternative approach is to model the association by using zero-inflated Beta models (Ospina and Ferrari 2012).

Studying the association of one organism at a time is a reductionist approach that ignores the interactions within the bacterial community. Machine learning methods can be applied to the bacterial community as a whole to reconstruct multi-category classifications that are then associated with the trait of interest (Statnikov et al. 2013), although they still need to be improved to consider the hierarchical nature of the data. Several multivariate testing methods taking into account the phylogeny among taxa (e.g., PERMANOVA McArdle and Anderson 2001; MiRKAT Zhao et al. 2015; aMiSPU Wu et al. 2016) are becoming more popular as they have higher statistical power in aggregating multiple weak associations and can reduce the burden of multiple testing correction.

Finally, multi-level functional data, such as meta-transcriptomics, meta-proteomics, or meta-metabolomics, can be integrated into genome-scale metabolic models, to disentangle the metabolism of the microbial ecosystem and its interactions with the host (Orth and Thiele 2010; Shoaie and Nielsen 2014).

Computational resources

An accurate assessment of the required computational resources is essential to efficiently and successfully tackle metagenomics projects. Metagenomics data processing requires good CPU and memory resources, and a substantial amount of disk space. To help estimate computational requirements, we provide here figures derived from a simple metagenomics analysis pipeline, which has been developed to rapidly and efficiently analyse a large number of gut metagenomic samples from the TwinsUK cohort (http://twinsuk.ac.uk/). MAP, which is available at https://github.com/alesssia/MAP, implements the QC steps listed in this review and processes the raw sequence up to the generation of the microbiome abundances. Briefly, MAP performs the QC using multiple tools from the BBmap suite (Bushnell 2015) to remove exact duplicates, trim low-quality bases and adapter, and to remove human decontamination. Each QC step is followed by the visualisation of the data quality metrics, carried out using FastQC (Andrews 2010). MAP takes advantage of MetaPhlAn2 (Truong et al. 2015) for fast and efficient abundance profiling, while QIIME (Caporaso et al. 2010) is used to evaluate multiple diversity measures. Table 1 reports figures on RAM, disk occupation, and time of execution obtained using MAP to process 842 compressed raw paired-end FASTQ files, obtained using the Illumina HiSeq 2500 platform, with average 26M reads per sample. Experiments were run on a High-Performance Computing (HPC) facility using four threads and limiting the available RAM to a maximum of 32 GB. First, it is worth noting that, while some of the implemented steps need less than 4 GB of RAM, others demand as much as 32 GB (Table 1). In fact, the higher the quantity of RAM, the more the reads that can be kept in the working memory, and the less the time consumed to use the disk as virtual memory. At the same time, multiple CPUs would allow for parallelization, further speeding up the data processing. Further parallelisation can be reached using an HPC facility, where several multi-core computers are aggregated. Attention should be paid to the disk occupation. For instance, each compressed file used for this experiment occupied 1–9 GB, while during the analysis this figure increased to 11–60 GB per sample (Table 1). At the end of the processing, deleting the temporary files released 6–35 GB of disk space per sample, and the compression of the QC’ed FASTQ files freed about 60% of extra space. The disk space requested by the files generated through this pipeline is roughly seven times the size of the original samples, and the problem would exacerbate when multiple samples are processed simultaneously, as in multi-core machines or HPC facilities.

Table 1 Resource usage. The reported figures were obtained by applying the proposed metagenomics pipeline to 842 raw paired-end FASTQ files with an average 26M reads per sample. Experiments were run on an HPC facility using 4 threads and limiting the available RAM to a maximum of 32 GB

If the user does not have the necessary computational resources or the expertise to install and run software on their local machine or cluster, they can take advantage of several user-friendly pipelines (e.g., YAMP, Visconti et al 2018; MOCAT2, Kultima et al. 2016), or of multiple web resources that are being made available for metagenomics studies (Dudhagara et al. 2015). However, the latter are less customisable, and may have constraints in their utilisation.

Data sharing

Journals and funding agencies have been increasingly requiring data sharing. Data are usually uploaded in public databases, such as NCBI, EBI, and DDBJ (http://www.insdc.org/, https://www.ncbi.nlm.nih.gov/genbank/metagenome/). An increasing number of tools (e.g., MG- RAST, EBI metagenomics) are now including the functionalities to upload data into these archives, that are then shared via the The Human Pan-Microbe Communities (HPMC) database (http://www.hpmcd.org/, Forster et al. (2016)).

While data sharing policies have increased the number of publicly available datasets, researchers should still be cautious in integrating and meta-analysing data from multiple studies, as integration might be hampered by technical and biological variabilities due to study-specific protocols for sample collection, processing, and data analysis (Lozupone et al. 2013). This highlights the necessity of standardised protocols and a first effort towards this direction is represented by the BioSharing portal that defines standards for the meta-data of a wide range of omics datasets (http://biosharing.org).

Clinical translational

The increased feasibility of large-scale metagenomics studies is opening new avenues for answering common microbiology questions, including understanding what species inhabit a particular environment, what they do, and their involvement in human diseases. These findings can then be translated into the clinic. For example, probiotic administration has been shown to reduce the incidence of severe necrotising enterocolitis and mortality (AlFaleh et al. 2012) and sepsis (Panigrahi et al. 2017) in preterm babies as well as lower-respiratory tract infections in infants (Szajewska et al. 2017), while the addition of fructo-oligosaccharides (which act as prebiotics) to the supplement helps the bacterial colonisation (Underwood et al. 2009). Analogously, Osborn and Sinn (2013) suggested that a simple administration of prebiotics can prevent eczema in infants, while Hsiao et al. (2017) indicate that a combination of probiotic and peanut immunotherapy can prevent allergic immune response. More research in beneficial microbe cocktails and in prebiotics supporting their growth is thus needed to increase the spectrum of diseases that could not only be treated but also prevented.

While FMT has been pioneered in the 1950s, and it is now routinely used to treat recurrent C. difficile infections, it still offers challenges, as, for instance, in using metagenomics screening to make FMT perfectly safe for the transplant recipients.

As the gut microbiota contributes to the metabolism of drugs modulating drug efficacy and toxicity (Carmody and Turnbaugh 2014; Dubin et al. 2016; Wilson and Nicholson 2017; Zitvogel et al. 2018), efficient bacterial composition screening would be also useful to assess whether a patient should receive a specific course of treatment, in order to check for its effectiveness and to prevent serious side effects.

Infections are becoming harder to treat with the antibiotics currently available because of the presence of antibiotic resistant bacteria, and antibiotic resistance has been named by WHO as one of the greatest threat to public health (http://www.who.int/mediacentre/news/releases/2014/amr-report/en/). Metagenomics studies may be able to unveil the evolution of antibiotic resistance, by detecting the antibiotic resistance profile of the microbial community (Forslund et al. 2014). This may also provide a way of identifying unknown unculturable bacterial carriers of antibiotic resistant genes that may potentially be horizontally transferred to other microorganisms (Huddleston 2014).

Metagenomics approaches have also been successful in monitoring and tracking infections outbreaks (e.g., Ebola (Quick et al. 2016) and Zika (Faria et al. 2017; Quick et al. 2017) viruses). However, while these works have been proven very useful a posteriori, more should be done to forecast the transmission routes and prevent the disease spread. For instance, Faria et al. (2016) hypothesised that there has been a single introduction of the Zika virus into the Americas in the second half of the 2013, more than one year before its detection in Brazil. Therefore, early genetic screening could have discovered and isolated the infection months in advance, possibly avoiding or limiting its spread. It is to be hoped that the decreasing costs of sequencing coupled with a deeper understanding of the microbiome will make metagenomics testing a routine screening for infectious diseases surveillance.

Despite these promising results in specific areas of high impact, some caution is necessary. We still have an incomplete understanding of what is a healthy or dysbiotic environment, and we often cannot distinguish whether it is the disease changing the microbiome or the microbiome that is responsible for the disease. More research is therefore needed before metagenomics findings can be safely and widely translated to the clinic (Quigley 2017).

Conclusion

This review presents an overview of best practices in metagenomics, with a particular focus on the methodological and computational strategies and challenges.

While a benchmark for the assessment of computational metagenomics software has been made available as part of the Critical Assessment of Metagenomic Interpretation (CAMI) challenge (Sczyrba et al. 2017), standards for data generation, access, and retrieval are still needed to allow data sharing and the exploitation of the full potential of metagenomics data in microbiology and clinical research—as pursued in the genomic field with the creation of Global Alliance for Genomics and Healthy (http://genomicsandhealth.org).