Making big data smart—how to use metagenomics to understand soil quality
- 3.5k Downloads
Next-generation sequencing (NGS) has revolutionized the field of biology over the last decade. The Genomes OnLine Database (GOLD) that monitors sequencing projects worldwide has grown from just 1575 sequencing projects in 2005 to over 70,000 in 2015 (Reddy et al. 2015). This is partly caused by a rapid drop in the price of high-throughput sequencing (Hayden 2014), but also an increase of free user-friendly bioinformatical tools such as MG-RAST (Meyer et al. 2008), MEGAN (Huson et al. 2016) and user fora such as seqanswers.com, biostars.org etc.
This “brave new world” was introduced into soil sciences more than 10 years ago (Daniel 2005) and is becoming increasingly popular, as it is the only approach known, which allows a direct assessment of microbial community composition and function on various trophic levels. Today, according to the web of science, more than 900 papers have been published on soil metagenomes. In early times, sequencing depth was in the range of less than 1 Gbase and often resulted in the identification of only major functional traits and house keeping genes; today in recent publications up to 100 Gbases have been sequenced (Hultman et al. 2015), which allowed even a partly reconstruction of genomes of single microbes from the obtained reads. However, the interpretation of soil metagenomics data is still a challenge, given the often complex composition of the microbiomes, as well as their huge dynamics in time and space (Ebrahimi and Or 2016).
Checklist for analysis of metagenomic datasets
-Determine soil type/texture, sampling date, pH, study specific parameters
-Include negative control samples
-Sample at least 3 replicates (each consisting of composite soil samples)
-Store samples cold, also during sampling
Sequencing library preparation
-Check for inhibitory effects
-Avoid multiple displacement amplification (MDA)
-Use your controls
- Shear DNA for shotgun sequencing
Bioinformatic data analysis
-Quality & length filter
-If possible, use mock community (defined mixture of microbial cells) to validate your workflow
-Upload raw sequencing data to public server
Bergkemper et al. 2016; Darzi et al. 2016; Del Fabbro et al. 2013; Menzel et al. 2016; Rodriguez-R and Konstantinidis 2014a; Sanchez-Flores et al. 2015; Schmieder and Edwards 2011; Schubert et al. 2016; Wood and Salzberg 2014
Soils are vertically and horizontally structured ecosystems, which are composed of a multitude of different microhabitats comprising diverse physical, chemical, and biological properties (Totsche et al. 2010). The degree of heterogeneity strongly depends on (i) the sampled compartment, e.g., the rhizosphere is less heterogeneous compared to bulk soil (Hinsinger et al. 2009), (ii) the soil texture, which strongly influences aggregate formation and also nucleic acid extraction efficiency, (iii) the above ground diversity and plant coverage, (iv) season, and (v) specific site characteristics like slope, shadowing, and groundwater table. (Petersen and Esbensen 2005). Taking this heterogeneity into account, the typically 500 mg to 10 g soil used for DNA extraction often do not reflect a single microsite, but a mixture of different compartments with differing chemical, physical, and biological properties, which often makes data interpretation quite challenging and only allows a correlative analysis of microbial data with abiotic soil properties, but does not increase our mechanistic understanding of how soil ecosystems work.
As abiotic soil parameters are a major driver of soil microbiomes, besides the factors of interest, a minimum dataset is required, which needs to be analyzed and implemented independently from the research questions. Besides exact GPS coordinates and climatic conditions at the period of sampling, such metadata should include the soil type, soil texture, soil pH, stable pools of soil organic matter like total organic C and N, and labile pools of C, N, P, and S. If agricultural sites are studied, management-related properties like fertilization regimes, tillage, cropping sequence, plant protection measures, and plant biomass should be given. For unmanaged sites at least above ground diversity should be characterized.
Sample processing and downstream analysis
Soils should be stored after sampling at a suitable temperature, which is below 4 °C for short-term storage in the field and −20 °C for long-term storage (Lauber et al. 2010; Tatangelo et al. 2014). Compared to amplicon-based sequencing, the direct DNA sequencing (metagenomics) requires higher amounts of high-quality DNA, which in turn also depends on the kit used for library preparation (500 pg–1 μg). Thus, there is often a need to adapt the used DNA extraction protocol to fulfill these requirements. The use of multiple displacement amplification should be avoided taking the significant bias introduced into account (Yilmaz et al. 2010). Since DNA extraction protocols vary in efficiency depending on the nature of the samples and in removing various inhibitors we recommend testing the workflow on a few non-essential samples first (Frostegård et al. 1999). After a DNA extraction method has been selected, it should be used consistently given the inherent bias introduced throughout the whole project. Finally, depending on the aim of the study, one might also consider employing methods that separate extracellular DNA from intracellular DNA (Pietramellara et al. 2009) which allow a discrimination between alive and dead microbes. As recommended by the Earth Microbiome Project (Gilbert et al. 2014) and due to the impact of downstream procedures like DNA extraction or library preparation on detected microbial communities (Albertsen et al. 2015), it is essential to include negative controls, e.g., negative DNA extractions (Salter et al. 2014), mainly if low amounts of DNA (<5 ng) are used for sequencing.
Rapid advances in sequencing technology, which each have their specific challenges, make it impossible to provide universal guidelines. With 454 pyrosequencing being outdated and long read technologies such as Oxford Nanopore Technologies and PacBio® yet not frequently used for metagenomics, here, we focus on Illumina-based technologies, which are currently the de facto standard in metagenomics (Sanchez-Flores et al. 2015).
The needed quality of reads obtained by sequencing is highly dependent on questions asked, but nevertheless quality filtering of the sequences is essential and should be adjusted specifically for the dataset at hand to optimize the trade-off between read-loss and final quality of the dataset (Del Fabbro et al. 2013). Key quality controls should include the following steps: removal of sequencing adapters, quality and length filtering, and removal of possible contaminants such as PhiX and/or host DNA. A good combination is adapter removal for the removal of adapters, quality/length trimming, and merging of paired sequences (Schubert et al. 2016), followed by Deconseq for the removal of contaminants (Schmieder and Edwards 2011). Lack of proper contaminant removal is especially critical with Illumina sequencing as apparent from the large scale contamination of microbial isolate genomes with Illumina PhiX control DNA (Mukherjee et al. 2015).
The sequencing depth for a sound bioinformatic analysis strongly depends on the aims of the project. If binning is planned to assemble larger contigs from the obtained reads, sequencing depth of up to 100 Gbases per sample are needed (Hultman et al. 2015), for a pure comparison of single reads, for example, to reconstruct major nutrient cycles in a given soil much lower sequencing depth (5–10 Gbases) are required (Bergkemper et al. 2016). While highly recommended, estimating the obtained sequencing depth or coverage of a metagenome is challenging compared to, e.g., 16S rRNA-based amplicon sequencing. Using16S rRNA-based amplicon sequencing, we can assume that public databases allow us to identify the vast majority of reads, while comparing metagenomic datasets to public databases such as the NCBI non-redundant protein database or functional assignment databases such as KEGG (Kanehisa et al. 2016), SEED (Overbeek et al. 2005) or COG (Tatusov et al. 2000) would only identify a part of the reads and have a bias towards model and/or medically relevant organisms. Therefore, rarefaction analysis makes sense with 16S rRNA amplicons to assess species richness and sample coverage, while rarefaction of metagenomics datasets to assess metagenomic complexity and sample coverage would overestimate coverage, which is not even consistent across different samples. Thus, for more accurate coverage estimations of metagenomics data, database-independent approaches are needed. Nonpareil (Rodriguez-R and Konstantinidis 2014b), which examines the degree of overlap among individual sequences to assess if a sufficient coverage has been achieved, is a good alternative to overcome the above-mentioned problems (Rodriguez-R and Konstantinidis 2014a).
Assembling contigs from reads can significantly increase the quality of annotation, especially when working with the shorter reads provided by the HiSeq platforms. Assembly programs such as IDBA-UD (Peng et al. 2012) and MegaHit (Li et al. 2016) provide well-established pipelines which are also well accepted in literature. While general functional annotation databases such as the aforementioned are useful for descriptive studies and to obtain a broad overview of the data, they are often based on eukaryotic or model organisms leading to suboptimal functional assignments (Darzi et al. 2016). Thus, more targeted approaches might be very useful, such as the FOAM database (Prestat et al. 2014), which was developed specifically to screen environmental metagenomic data and is an improvement for any soil-related study. For studies of particular genes of interest, even more, focused approaches and specialized databases are needed depending on the research question. Depending on the availability of such specialized databases, one should either use or create custom databases to compare the metagenomics sequences to and/or employ hidden Markov models to detect conserved domains in the metagenomics sequences. Combining an initial metagenomic screen with subsequent amplicon sequencing can in some cases further increase sensitivity albeit often at a cost of limiting diversity (Bergkemper et al. 2016). For assembly-free taxonomic classification, several solutions are recommendable such as Kraken and Kaiju (Wood and Salzberg 2014; Menzel et al. 2016). In any case, the used bioinformatics pipeline must be well described as so far no “gold standard” for data analysis is available. The first data provided by the CAMI initiative (Critical Assessment of Metagenome Interpretation) has proven significant differences in the outcome of read analysis depending on the used software. In this respect, there is a need that sequences are deposited in public databases in their raw forms, as even data trimming introduces biases depending on the used method.
Despite the ever growing sequence databases, most metagenomic reads cannot be assigned to a function, limiting both our ability to test hypotheses, but also the value of metagenomic datasets as a tool for novel discoveries. Besides developing targeted approaches for the isolation of microorganisms from soil, which allows a classical taxonomic assignment of genotypic and phenotypic traits, novel approaches integrating metagenomic datasets with other types of data such as metabolomics and abiotic factors are starting to yield much greater insight into the workings of the microbiome (Feng et al. 2016).
As the analysis of DNA provides a potential for the expression of certain genes only, there has been a great interest in applying a comparable pipeline like described above for the analysis of metatranscriptomes from soil (Baldrian et al. 2012). In principle, the same approach can be also used for the analysis of extracted RNA from soil after reverse transcription. Due to the high stability of rRNA compared to mRNA, depletion techniques are needed to reduce the amount of rRNA. Furthermore, the issue of spatial and temporal heterogeneity is more pronounced when analyzing RNA, as the stability of mRNA in cells is often in the order of minutes to hours, thus one sampling may reflect only a snapshot depending the actual environmental conditions.
Moreover, the development of long-read sequencing technologies opens a new field of application, which has the potential to provide additional information about operon structures from samples with low diversity or samples where a specific target was enriched beforehand. Such approaches will help us in the future to improve our understanding on mechanisms how gene expression is regulated opening a new field in soil microbial ecology addressing issues of “metaregulation.” Such studies could help us to improve our understanding for example on the molecular mechanisms of major ecosystem services provided by soils like plant growth promotion or carbon sequestration.
Gisle Vestergaard is supported by a Humboldt Research Fellowship for postdoctoral researchers.
- Baldrian P, Kolarik M, Stursova M, Kopecky J, Valaskova V, Vetrovsky T, Zifcakova L, Snajdr J, Ridl J, Vlcek C, Voriskova J (2012) Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. ISME J 6:248–258. doi: 10.1038/ismej.2011.95 CrossRefPubMedGoogle Scholar
- Ebrahimi A, Or D (2016) Microbial community dynamics in soil aggregates shape biogeochemical gas fluxes from soil profiles - upscaling an aggregate biophysical model. Glob Chang Biol 3141–3156–3141–3156. doi: 10.1111/gcb.13345
- Feng Q, Liu Z, Zhong S, Li R, Xia H, Jie Z, Wen B, Chen X, Yan W, Fan Y, Guo Z, Meng N, Chen J, Yu X, Zhang Z, Kristiansen K, Wang J, Xu X, He K, Li G (2016) Integrated metabolomics and metagenomics analysis of plasma and urine identified microbial metabolites associated with coronary heart disease. Sci reports. doi: 10.1038/srep22525 Google Scholar
- Hultman J, Waldrop MP, Mackelprang R, David MM, McFarland J, Blazewicz SJ, Harden J, Turetsky MR, McGuire AD, Shah MB, VerBerkmoes NC, Lee LH, Mavrommatis K, Jansson JK (2015) Multi-omics of permafrost, active layer and thermokarst bog soil microbiomes. Nature 521:208–212. doi: 10.1038/nature14238 CrossRefPubMedGoogle Scholar
- Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinforma. doi: 10.1186/1471-2105-9-386 Google Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Rückert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33:5691–5702. doi: 10.1093/nar/gki866 CrossRefPubMedPubMedCentralGoogle Scholar
- Prestat E, David MM, Hultman J, Taş N, Lamendella R, Dvornik J, Mackelprang R, Myrold DD, Jumpponen A, Tringe SG, Holman E, Mavromatis K, Jansson JK (2014) FOAM (functional ontology assignments for metagenomes): a hidden Markov model (HMM) database with environmental focus. Nucleic Acids Res 42:e145–e145. doi: 10.1093/nar/gku702 CrossRefPubMedPubMedCentralGoogle Scholar
- Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula J, Pagani I, Lobos EA, Kyrpides NC (2015) The genomes OnLine database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43:D1099–D1106. doi: 10.1093/nar/gku950 CrossRefPubMedGoogle Scholar