Analysis of soil microbial communities based on amplicon sequencing of marker genes
- 5.6k Downloads
The use of cultivation independent methods has revolutionized soil biology in the last decades. Most popular approaches are based on directly extracted DNA from soil and subsequent analysis of PCR-amplified marker genes by next-generation sequencing. While these high-throughput methods offer novel possibilities over cultivation-based approaches, several key points need to be considered to minimize potential biases during library preparation and downstream bioinformatic analysis. This opinion paper highlights crucial steps that should be considered for accurate analysis and data interpretation.
KeywordsNext-generation sequencing Amplicon sequencing 16S rRNA Sequencing depth Bioinformatics
Most advances achieved in microbial ecology over the last decades are based on methodological progresses enabling a fine description of complex microbial community composition and function in their particular environment.
Besides direct sequencing of the total extracted DNA (metagenome) or RNA via cDNA (metatranscriptome), targeted approaches are often used, which include amplification of single genes before sequencing. Due to its broad presence in all organisms, and its adequate level of conservation, the ribosomal operon is often used as “golden standard” for diversity inventories (Amann et al. 1995). Amplicon-based approaches targeting variable regions of specific markers (e.g., 16S, ITS, or 18S) are widely used to describe bacterial, archaeal, fungal (Lindahl et al. 2013), and micro-eukaryote community composition (Lentendu et al. 2014). This approach was transposed for functional studies, e.g., targeting enzyme-coding genes catalyzing C, N, and P cycles, for example, β-glucosidases (Pathan et al. 2015), protease genes (Baraniya et al. 2016), or alkaline phosphatases (Bergkemper et al. 2016).
In this opinion paper, we highlight major pitfalls in the generation, processing, and interpretation of amplicon sequencing data, keeping in mind key ecological concepts. For other issues including soil sampling strategies, DNA extraction, and metadata collection, we refer to Vestergaard et al. (2017).
When performing amplicon sequencing, several key points deserve consideration: Standard PCR conditions should be applied with minimized cycle numbers if possible (Schmidt and Rothhämel 2012) and DNA templates should be quantified, as 10–20 ng of template DNA is sufficient for amplification of ribosomal marker genes. Higher amounts might be required if the targeted genes are rare. Potential inhibition of the PCR may be assessed via preliminary tests using serial dilutions of template DNA. Moreover, to minimize PCR-introduced biases, it is recommended to perform technically replicated PCR reactions for each sample, which are subsequently pooled before sequencing.
Including negative controls is essential, as recent studies highlighted microbial DNA contaminations originating from DNA extraction kits that can highly bias the obtained sequencing output (Lusk 2014; Salter et al. 2014). Contamination issues become most problematic when working with low DNA amounts.
Primer systems frequently used for the partial amplification of the 16S rRNA gene from bacteria and archaea in studies based on Illumina MiSeq or HiSeq sequencing
(Walters et al. 2016)
(Klindworth et al. 2013)
(Klindworth et al. 2013)
Conversely, as a large fraction of soil microbial genomes is still uncharacterized, designing universal primers for functional genes with current database coverage is challenging (Fredriksson et al. 2013; Tremblay et al. 2015), often giving partial views on the real diversity only (Wei et al. 2015). Primer degeneration is offering more options for binding sites and diversity recovery, while non-PCR-based alternatives also do exist, e.g., DNA-DNA hybridization or screen of metagenomes (Jacquiod et al. 2014). Designing primers to bind to a region of the gene which codes for the catalytic domain of an enzyme, which is usually conserved between microbes, can increase the detection of a greater diversity of microbes carrying the gene of interest. If the catalytic domain of the enzyme is however highly conserved, compared to other regions of the enzyme, the phylogenetic resolution is greatly decreased. Despite the current lack of coverage in databases, evaluating designed primers using in silico PCR can give important insights about the phylogenetic resolution and specificity.
Amplicon length is crucial, as longer sequences will considerably increase annotation accuracy and phylogenetic resolution (Singer et al. 2016). If libraries will be sequenced using Illumina® Miseq (2 × 300 bp) or HiSeq (2 × 250 bp) paired-end sequencing technology, amplicon sizes up to 550 bp are possible.
Finally, it is important to keep in mind that DNA-based sequencing approaches do not provide information about the expression levels of the amplified gene. Thus, any speculations related to the “functionality of a gene” must be avoided or mitigated. The analysis of mRNA or 16S rRNA via cDNA, as well as meta-proteomics are efficient alternatives to address such questions.
The typical amplicon bioinformatic pipeline consists of several steps including: quality filtering, clustering, and comparison to a reference database. Key quality control steps include adapter trimming, quality and length filtering, chimeric sequence and contaminant removal (e.g., PhiX and host DNA), and are discussed in detail by Vestergaard et al. (2017). Bokulich et al. (2013) provide guidelines for Illumina® 16s rRNA amplicon sequencing analysis, showing that filtering for high-quality read length and using abundance cut-offs significantly improve diversity estimates.
Another important issue is the semi-quantitative aspect of amplicon sequencing data. Indeed, for 16S rRNA-based data, it has been shown that due to fluctuation of ribosomal operon copy numbers between microbial species, an absolute quantification of a given OTU is not feasible (Jacquiod et al. 2016). The same issue may hold true for enzyme-coding genes, but here, baseline data is still missing. The PICRUST tool, originally designed to predict metagenomic profiles from amplicon data, implements a normalization procedure correcting count data based on sequenced genome information available in databases for achieving better estimations of absolute abundances (Langille et al. 2013). Nevertheless, analysis between different treatments provides relative semi-quantitative information useful and relevant for comparative studies when performed appropriately (Nunes et al. 2016).
Guidelines for biostatistic analysis and data interpretation
Data analysis is typically based on a contingency table and an associated distance matrix (UNIFRAC or other metrics). One major issue is dealing with a varying number of reads between samples. These differences may introduce a large bias when estimating diversity indices (e.g., richness, Shannon, Chao-1), especially when an asymptotic trend of the curves (Fig. 1) is not fulfilled. QIIME and other pipelines account for this issue by even random resampling (Weiss et al. 2015), while others argue that more advanced models should be used (McMurdie and Holmes 2014).
To extract significantly responding taxa across experimental designs, unappropriated Gaussian-based models like ANOVA are often used. These models are based on the normality assumption and independency of each variable, which is often not the case for microbial communities which are mostly non-normally distributed, have many correlated variables and a very strong mean to variance relationship (e.g., many zeros and rare observations). Thus, Gaussian-based models result in biased conclusions due to attribution of a higher significance to taxa with a high variance, which persist even after data transformation (e.g., log10 or center-scaling). Instead, the correlations between variables can be assessed using permutations (e.g., Monte-Carlo simulations) while the mean to variance relationship can be predicted using generalized linear models and the non-normality of data modeled by logistic regressions (e.g., negative binomial distribution), followed by appropriated statistics like likelihood ratio tests.
All these approaches are available in the R packages mvabund (Wang et al. 2012), edgeR (Robinson et al. 2010), and phyloseq (McMurdie and Holmes 2013). Basic principles on these procedures are summarized in a YouTube tutorial by David Warton about data normalization in ecology using mvabund (https://www.youtube.com/watch?v=KnPkH6d89l4).
Network visualization is becoming more frequent to predict the complex interactions within microbial communities (Karimi et al. 2017). It is important to keep in mind that networks show data co-occurrences and co-exclusions, which is no evidence of biological interactions. Moreover, adequate appropriated correlation indices (e.g., Spearman, Pearson, SparCC) and permutation validation are required to avoid biases that might arise from the count data itself, like random correlations and compositional biases (Berry and Widder 2014; Friedman and Alm 2012). Besides coefficient choice and statistical validity, stringent high cut-offs can be applied on the correlation strengths (>|0.6|) to avoid integration of weak observations and false positives, at the cost of excluding many false negative interactions. Metadata addition (e.g., soil pH; carbon, and nitrogen content) to network visualizations can be an important factor to further assess biological linkage relevance. Here, patterns can be found, which reinforce the idea of specific niche observations, and also the detection of important topographic features such as keystone species, outliers/satellites, modules of highly co-occurring taxa, and even hubs of connected modules that might positively or negatively correlate (Jacquiod et al. 2016).
As the field of microbial ecology is growing, accumulation of knowledge and large amounts of data enable the testing of important ecological theories that were previously out of reach (Prosser et al. 2007). For instance, multivariate data generated from environmental microbial communities can be used to test macro-ecology concepts like Functional Response Groups (FRGs, group of organisms sharing similar response to environmental changes) and Functional Effect Groups (FEGs or guilds, groups of organisms contributing to a similar ecosystem function) (Lehsten et al. 2009). Data on 16S rRNA genes or transcripts can be used to identify FRG and allow the identification of bacterial taxa with a similar response pattern to an environmental clue (Nunes et al. 2016). In addition, functional genes and/or transcripts can be used to determine FEG contributing to a specific ecosystem function (e.g., chitin or cellulose degradation). Working with groups defined from multivariate data is very powerful for understanding the basis of microbial behavior such as trophic lifestyles (e.g., copiotrophs or oligotrophs), growing habits (e.g., fast growers or slow growers), or to reveal environmental bioindicators.
Improvements and developments of novel sequencing technologies add more detailed data, allowing new questions to be asked. Third-generation sequencing platforms can provide even greater detail to amplicon-based approaches. For instance, sequencing the entire 16S rRNA gene using PacBio® technology resulted in a higher phylogenetic resolution compared to shorter amplicons which are mostly generated from Illumina sequencing (Singer et al. 2016). However, new sequencing approaches often need new protocols and new bioinformatic tools. Here the construction of synthetic or “mock” communities is a valuable tool to validate and optimize sequencing workflows and subsequent data analyses (Jumpstart Consortium Human Microbiome Project Data Generation Working 2012).
Finally, the information that can be gained from sequence-based approaches heavily relies on the quality and completeness of reference databases. Therefore, sequencing approaches should be complemented with classical isolation and cultivation-based approaches as well as functional validation using enzyme assays to further improve the quality of databases for a better characterization of the dark microbial matter.
Gisle Vestergaard is supported by a Humboldt Research Fellowship for postdoctoral researchers.
- Baraniya D, Puglisi E, Ceccherini MT, Pietramellara G, Giagnoni L, Arenella M, Nannipieri P, Renella G (2016) Protease encoding microbial communities and protease activity of the rhizosphere and bulk soils of two maize lines with different N uptake efficiency. Soil Biol and Biochem 96:176–179 DOI: 10.1016/j.soilbio.2016.02.001Google Scholar
- Berry D, Widder S (2014) Deciphering microbial interactions and detecting keystone species with co-occurrence networks. Front Microbiol 5 doi:ARTN 219 10.3389/fmicb.2014.00219
- Karimi B, Maron PA, Chemidlin-Prevost Boure N, Bernard N, Gilbert D, Ranjard L (2017) Microbial diversity and ecological networks as indicators of environmental quality. Environ Chem Lett. doi: 10.1007/s10311-017-0614-6
- Nunes I et al. (2016) Coping with copper: legacy effect of copper on potential activity of soil bacteria following a century of exposure. FEMS Microbiol Ecol 92 doi: 10.1093/femsec/fiw175
- Schmidt H, Rothhämel S (2012) Polymerase-Kettenreaktion. Gentechnische Methoden:135–171Google Scholar
- Vestergaard G, Schulz S, Schöler A, Schloter M (2017) Making big data smart—how to use metagenomics to understand soil quality. Biol Fert Soils. doi: 10.1007/s00374-017-1191-3
- Walters W et al. (2016) Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys mSystems 1 doi: 10.1128/mSystems.00009-15
- Weiss SJ et al (2015) Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data. Peer J Preprint. doi: 10.7287/peerj.preprints.1157v1