Living Reference Work Entry

Encyclopedia of Metagenomics

pp 1-11

Date: Latest Version

Soil Metagenomics

  • Janet JanssonAffiliated withEarth Sciences Division, Lawrence Berkeley National Laboratory Email author 


Metagenomics refers to the use of DNA sequencing to determine the phylogenetic and functional gene complement of a sample, such as microbial community DNA in soil. A shotgun metagenomic approach relies on sequencing of total DNA extracted from a given sample, without prior cloning into a vector.

Soil as a Microbial Habitat

Soil has several unique properties compared to other microbial habitats that are important to consider when discussing the topic of soil metagenomics. Currently, there are few published soil metagenome studies, although there is a large interest in the scientific community on this topic based on attendance at workshops and conferences and funding of the TerraGenome NSF-sponsored Research Coordination Network (www.​terragenome.​org). Although metagenomics is revealing new information about phylogenetic and functional genes in some soils, it is not possible to adopt the information available to date to all soils.

There are different classes of soils that vary according to texture and other geochemical characteristics. The traditional soil classification scheme divides soils into 12 major orders, each of which can be subdivided into suborders and classes. The 12 soil orders are the following (http://​soils.​cals.​uidaho.​edu/​soilorders/​orders.​htm):
  1. 1.

    Gelisols (soils with permafrost within 2 m of the surface)

  2. 2.

    Histosols (organic soils)

  3. 3.

    Spodosols (acid forest soils with a subsurface accumulation of metal humus complexes)

  4. 4.

    Andisols (soils formed in volcanic ash)

  5. 5.

    Oxisols (intensely weathered soils of tropical and subtropical environments)

  6. 6.

    Vertisols (clayey soils with high shrink/swell capacity)

  7. 7.

    Aridisols (CaCO3-containing soils of arid environments with subsurface horizon development)

  8. 8.

    Ultisols (strongly leached soils with a subsurface zone of clay accumulation and <35 % base saturation)

  9. 9.

    Mollisols (grassland soils with high base status)

  10. 10.

    Alfisols (moderately leached soils with a subsurface zone of clay accumulation and >35 % base saturation)

  11. 11.

    Inceptisols (soils with weakly developed subsurface horizons)

  12. 12.

    Entisols (soils with little or no morphological development)


The soil properties that are recognized by soil taxonomists are also important for governing microbial community composition and activity. For example, soil pH is known to be a key driver for microbial communities (Rousk et al. 2010). Therefore, when considering “soil” as an ecosystem, it is important to take into account the type of soil that is being studied.

Another unique feature of the soil environment is its spatial heterogeneity at different scales (Fig. 1). For example, soil properties can vary considerably across a landscape and are dependent on several features including the plant type, slope of the land, soil moisture, and other aspects of the terrain. The soil profile also changes with depth. Soil is normally classified according to depth horizons with a surface organic “O” horizon that may or may not be present, followed by the “A” horizon and the subsurface “B” horizon that is influenced by plant roots and finally the underlying “C” horizon (Fig. 1). Usually, the microbial density and activity are highest at the surface and decrease with depth. Also, the types of microorganisms vary with depth. For example, there can be decreases in oxygen concentrations with depth, resulting in a corresponding increase in anaerobes with depth. Some recent soil metagenomic studies have shown a vertical variation in community structure (Delmont et al. 2011; Yergeau et al. 2010; Mackelprang et al. 2011). The latter two studies compared active layer and deeper permafrost layers that have very different properties.
Fig. 1

Microheterogeneity of soil

Different soils also have different mineral compositions and redox conditions and thus different potential electron acceptors. For example, in anaerobic zones, the microbial activity will be dependent on the availability of electron acceptors with decreasing redox potential as follows: O2 > NO3 > Mn > Fe > SO4 > CO2. Metagenome data can provide a clue to the type of prevailing redox conditions in the soil based on the prevalence of genes for reduction of given types of electron acceptors, i.e., methanogenesis, denitrification, sulfate reduction, etc. For example, Mackelprang et al. (2011) found a high abundance of functional genes for methanogenesis and denitrification when screening permafrost metagenomes.

The soil habitat is further complicated because of the partitioning of resources into different microscopic niches, for example, soil pores containing water or organic matter (Fig. 1). Microbial life in soil is thus often concentrated into discrete locations in soil aggregates. Due to the spatial heterogeneity at a microbial scale, there can be different microbial populations residing close to each other but physically separated by soil grains or air-filled pores. This becomes a complication when using a metagenomic approach to understand the microbial community composition and function. Ideally, one would sequence individual microscopic soil aggregates to determine which populations are present in individual microscopic habitats. However, this is currently beyond the current level of sequencing resolution, although advances in single-cell sequencing technologies may be applicable in this regard. Usually, soil metagenomes are obtained from at least 1 g of soil that has been homogenized and thus the individual microscopic habitats are not resolved and the composite community is analyzed in the sequence data (Fig. 1).

Soil Microbial Activity

Due to the microheterogeneity of soil (Fig. 1), microorganisms may be more or less active depending on their access to nutrients and other conditions necessary for activity, including the factors mentioned above. The majority of soil microbes are normally in a dormant or quiescent state as they optimize conditions to become available. For example, actinobacteria are known to be persistent survivors in soil that are resistant to desiccation and can withstand long-term starvation conditions. Therefore, actinobacteria are often abundant in 16S surveys of soil samples (Fierer et al. 2007; Mackelprang et al. 2011). However, whether or not they are active is another matter. A metaproteomic survey of a California grassland soil found a high prevalence of proteins corresponding to Bacillus spore proteins in soil, thus emphasizing the importance of this survival strategy (Chourey et al. 2010).

Roots can be considered nutrient “hot spots” for microbes living in normally nutrient-poor soil conditions. The portion of soil that is directly influenced by roots is known as the rhizosphere. Rhizosphere microbial communities have been found to be different than communities residing in bulk soil (Fierer et al. 2007). The rhizosphere effect is therefore something that should be considered when sequencing metagenomes from soil with a cover of vegetation. Unless the roots can be completely separated from samples prior to DNA extraction, the sample probably contains microbes that are influenced by the rhizosphere. The same soil sample can contain regions that are not influenced by roots. Therefore, the resulting metagenome will be a composite of different microbial communities that may be more or less active. When mining the metagenome sequence data for functional genes, the relative amounts of genes involved in activities expected in the rhizosphere, including quorum sensing, nitrogen fixation, etc., will be dependent on the relative amounts of DNA from rhizosphere soil in the sample.

The variable status of microbial activity in soil is a complication when analyzing soil metagenome sequence data that can include DNA extracted from microbes in different physiological states, ranging from active and growing to dormant or even dead. One option could be to fractionate the soil microbial community according to their physiological status prior to metagenome sequencing. For example, specific members of the soil community could be enriched with nutrients to increase the fraction of the community that is specifically capable of growth and metabolism of the added nutrients prior to DNA extraction. If a specific 13C-labeled substrate is added, the DNA from microbes that incorporate the 13C label during metabolism of the substrate can be fractionated on density gradients, a technique commonly referred to as stable isotope probing (SIP) (Chen and Murrell 2010). This approach was used by Dumont et al. (2006) to enrich methanotrophs in a forest soil by incubation with 13C-labeled methane. The 13C-labeled DNA was cloned into a BAC library and screened for genes involved in methane oxidation.

Another option is to add bromodeoxyuridine (BrdU) as a thymidine analogue that is incorporated into the DNA of replicating cells. The DNA with BrdU incorporated can be selectively extracted using magnetic beads coated with antibodies targeted to BrdU (Artursson and Jansson 2003, 2005). This DNA should then correspond to the growing members of the community. Although not all soil microbes take up BrdU with equal efficiencies, those that do so can be identified as growing using this approach. Sorting of cells prior to DNA extraction has also been proposed as a way to select for cells in a specific physiological state. For example, fluorescence-activated cell sorting (FACS) can be used to distinguish cells that are viable or dead based on their incorporation of different fluorescent dyes that stain live or dead cells (Maraha et al. 2004). Potentially the individually sorted cell fractions could be sequenced separately. Currently this approach has been limited by the low yield of cells obtained after cell sorting from soil. However, newer platforms, including single-cell sequencing approaches, show promise for amplification of low DNA yields, and this could be a future direction for soil metagenomics.

Mining Soil Metagenomes

Soil represents a potential treasure trove for gene hunters because of the abundance of unknown genes that could potentially encode novel pharmaceuticals or other products of biotechnological interest (Van Elsas et al. 2008). Two approaches are normally used for screening soil metagenomes for potentially interesting genes. The first is to rely on homology searches to gene databases. Using this approach, Hjort et al. (2010) identified chitinase genes in a metagenomic library from a phytopathogen suppressive soil. The other approach is to rely on screening of gene expression in clone libraries, a process that has been called “functional metagenomics” (Ekkers et al. 2012). Functional metagenomics relies on expression of unknown genes of unknown origin in a foreign host. Since most microbes in soil have never been isolated and the majority of genes are unknown, this type of approach is ideally suited for screening soil for novel genes of interest (Sjoling et al. 2007). However, this approach is still hampered by several bottlenecks that result in very few hits” when performing functional screens, including lack of efficient screens and low expression in heterologous hosts.

With the advent of shotgun metagenomic sequencing, the focus shifted from cloning into BAC and fosmid vectors to sequencing of total DNA. Depending on when soil metagenome projects were initiated, they were sequenced on different sequencing platforms with widely varying differences in read length, sequencing errors, and sequencing depth (Fig. 2, Table 1). The first shotgun soil metagenome of a Wisconsin farm soil (Tringe et al. 2005) was conducted using Sanger sequencing with long read lengths of high quality, but low depth. Therefore, only a fraction of the total community was sequenced, but still sufficient to distinguish key functional genes from other environments, such as whale fall and the Sargasso Sea. Similar broad functional comparisons between datasets were recently carried out on the Rothamsted park grass metagenomes, having higher sequencing depth, but shorter read lengths (using the 454 Titanium technology) (Delmont et al. 2012). Yergeau et al. (2010) used the 454 sequencing platform to compare an arctic soil active layer and permafrost, finding general differences in functional genes between the two soil types. Sequencing with the Illumina GAII sequencing resulted in the highest amount of metagenome sequence data for soil (Mackelprang et al. 2011, Fig. 2). The Illumina sequencing reads were 2 × 113 bp in length, and 40 gigabases of sequence data were generated for 12 metagenomes (two active layer and two permafrost samples, before and after 2 and 7 days of thaw).
Fig. 2

Progressively increasing sizes of published soil metagenome datasets (Figure courtesy of Emmanuel Prestat, Lawrence Berkeley National Laboratory)

Table 1

Examples of soil metagenomes obtained using a shotgun metagenome sequencing approach in the published literature

Study site

Sequencing platform and sequence data

General analyses

Key findings

Permafrost and active layer samples from a single core, Canadian high arctic soils: 2 samples total

Roche 454 GS FLX Titanium sequencing (454 Life Sciences, Branford, CT)

Assembled using Phrap software and annotation using MG-RAST server

Actinobacteria were dominant in both samples

Methanogens and genes involved in methanogenesis detected in both samples

DNA amplified by MDA prior to sequencing

Combined assembled and unassembled reads for downstream analyses

Detected genes involved in degradation of carbon compounds, including chitinase and sugars as well as nitrogen cycle genes

Permafrost and active layer samples from 2 replicate cores, 3 time points (before and after 2 and 7 days of thaw): 12 samples total

Illumina GAII (40 Gb total)

Assembled using Velvet

First draft genome from soil metagenome and corresponded to novel methanogen

176 million reads

9.7 Mb assembly

After thaw, there were rapid shifts in microbial community structure and function

DNA amplified by emPCR prior to sequencing

3,700 contigs >1 kb

Permafrost metagenomes were initially very different from replicate cores, but converged upon thaw

Longest contig 67 Kb

Draft genome 1.9 Mb

Waseca Farm soil: 1 sample

Sanger sequencing

Assembled using Phrap software

Significant differences in sequences from soil compared to those from Sargasso Sea and deep-sea whale fall

100 Mbp

ABI Genome Analyzer (Applied Biosystems, Life Technologies Corporation, Carlsbad, CA)

Combined assembled and unassembled reads

More genes for plant degradation and transport of potassium in soil compared to other 2 environments

Rothamsted park grass

Roche 454 GS FLX Titanium (13 runs)

Newbler assembly on 454 GS de novo assembler software (Newbler v2.0.00.22)

DNA extraction bias

10 samples. Different DNA extraction methods

12 million reads

Sequences annotated on MG-RAST and reads distributed into metabolic subsystems

<1 % of annotated sequences correspond to sequenced genomes at 96 % similarity

FACE sites (biocrust and creosote bush root zones – ambient and elevated CO2), 4 metagenomes total: 1 sample per condition

180 Mb (480,000 reads)

Trimmed unassembled reads analyzed on MG-RAST

Different phyla predominated in biocrusts compared to shrub root zones with higher abundance of cyanobacteria in biocrust

Significant differences in phyla distribution depending on sequencing technology

Contaminated arctic soils

450 Mb

Sequences annotated using MG-RAST. Hydrocarbon degradation genes identified by BLAST

Contamination and treatment of soil resulted in a shift in abundance of several bacterial phyla and functional genes

4 samples: before treatment, after 1-month treatment, 1-year treatment, and uncontaminated control (approx. 450 Mbp total)

Roche 454 GS FLX Titanium(approx. 1 milj reads)

Current Challenges and Bottlenecks

There are several steps in the processing and analysis pipeline that are critical for soil metagenomics. The first step is DNA extraction. This is a challenge that was recognized in the 1980s and persists to this day (Holben et al. 1988). The problem is that there is no single method that has proven to have no bias. Also, since the true composition and diversity of any given soil have not been elucidated, there is no standard benchmark to determine which method is more accurate than any other. The cell lysis step in the extraction protocol is particularly problematic. Many soil bacteria are recalcitrant to enzymatic lysis procedures, such as actinobacteria. Therefore, depending on the lysis steps used, different members of the community may be more or less represented in the sequence data. In recognition of these biases, the Earth Microbiome Project (www.​earthmicrobiome.​org) recommends a standard lysis protocol be used for every sample contributed for metagenome sequencing analysis (Gilbert et al. 2010). Delmont et al. (2011) compared different commonly used soil DNA extraction protocols and found significantly different members of the soil community that were represented as a result. Another option is to combine DNA extracts from several different extraction protocols to enable greater representation of the soil diversity (Delmont et al. 2012). However, without knowledge of the true diversity and community composition of any given soil, it is currently not possible to know how representative the sequence data from a given DNA extraction method is.

Due to the complexity of shotgun metagenome sequence data from soils, there have been attempts to fractionate DNA extracts to reduce the complexity prior to sequencing. For example, Holben (2010) fractionated DNA based on GC content on density gradients. The different regions on the resulting gradient represented different microbial community compositions. Another way to fractionate is to take different sized DNA fragments that are separated by polyacrylamide gel electrophoresis (PAGE).Delmont et al. (2011) found differences in genera represented in DNA separated by PAGE with higher amounts of some genera in the higher molecular weight DNA and different genera represented in the lower molecular weight DNA. Therefore, combination of different DNA fractions and/or DNA extracts prepared using different procedures is a promising approach to increase the representation of different microorganisms in a resulting soil metagenome (Delmont et al. 2011, 2012).

Some soil environments are particularly intractable and result in very low DNA yields (nanogram quantities). Examples include Antarctic dry valley soils and arctic permafrost soils. Current metagenome library protocols require microgram quantities of DNA, although with advances in sequencing technologies, the amount required for DNA preparation is decreasing substantially. Regardless, for samples with extremely low DNA yields, it is still sometimes necessary to amplify the DNA to have enough for library construction. Two methods have been used to amplify DNA. One is the molecular displacement amplification approach (MDA) that has considerable bias. For example, the MDA approach was used to amplify DNA extracted from active layer and permafrost layer soil samples prior to shotgun metagenome sequencing on the 454 pyrosequencing platform (Yergeau et al. 2010). Another method recently used to amplify DNA prior to metagenome sequencing is emulsion PCR (emPCR) (Mackelprang et al. 2011). This method relies on amplification of individual size fractionated DNA segments in individual beads prepared in an oil emulsion. The DNA is diluted to the extent that no more than one DNA molecule should be present in an individual bead. Since each DNA molecule is independently amplified, there is no competition for primers from other DNA molecules. Recently, the emPCR technique was used to amplify low amounts of DNA extracted from permafrost soil (Mackelprang et al. 2011).

Currently one of the biggest bottlenecks for soil metagenomics is the assembly step. The reason this is so challenging is because of the high microbial diversity in soil. The largest soil DNA metagenome sequence datasets to date are those for native prairie soils in the USA that are being sequenced as a Grand Challenge Pilot Study by the US Department of Energy Joint Genome Institute (JGI). For example, approximately 600 Gb of sequence data have been obtained for Kansas native prairie soil. However, even that amount of sequence is still far short of that required for efficient assembly.

There have been various estimates of the amount of sequencing that would be required to sufficiently cover a soil metagenome to enable reasonable assembly. Interestingly these estimates have been steadily increasing. After publication of the first shotgun metagenome sequence from soil, Tringe et al. (2005) estimated 2–5 Gigabases (Gb) would be required to get eightfold coverage of the most dominant members of the soil community. Delmont et al. (2012) estimated that about 450 Titanium pyrosequencing runs would be required to create contigs from all of the soil pyrosequence reads generated, but they recognized that chimeras might also be generated due to the complexity of the communities. One estimate based on an Alaskan soil metagenome was that 6 Tb of sequence data (or approximately 950,000 genome equivalents) would be required for the coverage of every OTU (at 97 % identity thresholds) (Schloss and Handelsman 2006).

Depending on the soil type included in the metagenome sequencing, the assembly might be more or less problematic. For example, permafrost soils have a lower microbial diversity than most other soils and it was possible to use Velvet to assemble a draft genome of a novel methanogen from permafrost (Mackelprang et al. 2011). The methanogen draft genome was represented in the majority of contigs greater than 1,000 Kb in length. It also appeared to have low sequence heterogeneity at the population level that aided in assembly. This case provides a clue that strong selection may enhance the ability to assemble specific members of the community that are selected by specific conditions and also may be present in higher relative amounts in the community.

Delmont et al. (2012) recently sequenced a grassland soil from the long-range experiment station in Rothamsted, UK, using 454 pyrosequencing to obtain nearly 5 Gb of sequence from 13 samples. Newbler assembly without optimization was used to assemble the reads into contigs. The largest contigs were about 23 Kb, but increasing the number of reads in the assembly did not improve the contig read lengths. However, the investigators noted that the contigs separated into two clusters represented by contig coverage. There was a trend for more sequences related to Firmicutes and Verrucomicrobia in the cluster of reads with 30× coverage, whereas Proteobacteria were the majority of sequences in contigs with low coverage (4.5×).

Annotation is also a current stumbling block for soil metagenomics due to the paucity of sequenced and annotated soil microbes in databases. Delmont et al. (2012) used the MG-RAST server ( for annotation of the Rothamsted park grass soil metagenome. Out of 878 possible functional subsystems, 835 were found in the soil metagenome, thus illustrating the large potential functional diversity in the soil. The most abundant functional subsystems they found were related to cAMP signaling and Ton and Tol transport. These same subsystems were found in other existing soil metagenomes, including that of the Minnesota farm soil published earlier (Tringe et al. 2005). The authors suggest that the high abundance of these subsystems in soil suggests that they have key role in soil ecosystems (Delmont et al. 2012).

Recent Soil Metagenome Examples

The metagenome studies that have been published to date are revealing some common players in many soil environments. For example, some bacterial phyla are seen across several studies suggesting that they are prevalent soil microbes. These include members of the Actinobacteria, Chloroflexi, Fibrobacter, Acidobacteria, Planctomycetes, and Synergistetes that were found to be more prevalent in soil compared to the human gut or oceans (Delmont et al. 2012). Mackelprang et al. (2011) also found similar phyla in active layer and permafrost soils from Alaska, and in addition sequences representative of the candidate phyla OD1 and OP11.

Yergeau et al. (2012) recently screened metagenomes from oil-contaminated arctic soils for hydrocarbon-degrading genes. They used BLAST to screen for major hydrocarbon degradation genes in the dataset. They found that contamination resulted in the increase of the abundance of several bacterial phylum/classes, including Actinobacteria, Alphaproteobacteria, and Gammaproteobacteria, and a decrease in the abundance of others, including Acidobacteria, Bacteroidetes, Chlorobi, Chloroflexi, Cyanobacteria, Firmicutes, Planctomycetes, and Deltaproteobacteria. Several of the classes that were enriched in the contaminated soils corresponded to well-known hydrocarbon degraders such as Pseudomonas, Rhodococcus, and Sphingomonads.

Steven et al. (2012) recently used shotgun metagenomics to compare soil microbial communities in biological soil crusts and creosote bush root zones that were collected at the Nevada FACE (free-air carbon dioxide enrichment) site from soils with either ambient or elevated CO2. They compared the phylogenetic representation from 16S rRNA gene pyrotag sequencing to classification of 16S reads from shotgun metagenomes. They found that the different approaches described the communities differently. They found a higher abundance of cyanobacteria in the biocrusts compared to the creosote root zones, but the proportion of cyanobacteria varied depending on the sequencing approach used. These authors cautioned against the use of different and incomplete databases for identification of taxonomic units as this may result in false ecological interpretation of the data.


In summary, there have recently been advances in sequencing technologies that are enabling much higher coverage of soil metagenomes than has been previously possible. Still, no single soil metagenome to date has been sufficient to represent the vast microbial diversity that resides in soil. Even with greater amounts of sequence data, there is a need for simultaneous developments in computing, assembly algorithms, and bioinformatic tools to analyze the data. Some recent developments are promising (e.g., Pell et al. [2012]), and in the next few years, there will be more published examples of the use of soil metagenomics to understand microbial community structure and function in a variety of soil systems.


AbundanceBin, Metagenomic Sequencing

Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads

Desert Soils, Metagenomics of

Environmental Genomics

Extraction Kits

Extraction Methods, DNA

Extraction Methods, Variability Encountered in

Genome Atlases, Potential Applications in Study of Metagenomes

Grassland Soil Metagenomics

Metagenome 1

Metagenome 2

Metagenomics Potential for Bioremediation

Metagenomics, Metadata and Meta-analysis

Microbial Diversity, Barcoding Approaches

Microbial Ecosystems, Protection of

Microbiome; Overview

Proteomics and Metaproteomics

Rhizosphere Metagenomics

Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview

Copyright information

© Springer Science+Business Media New York 2013
Show all