1 Background

Hot springs populated by extreme thermophiles (Topt: 65–79 °C) and hyperthermophiles (Topt > 80 °C) (DeCastro et al. 2016) are very diverse and some of them show combinations of extreme chemo-physical conditions, such as temperature, acidic or alkaline pHs, high pressure, and high concentrations of salts and heavy metals (López-López et al. 2013). As with all studies of environmental microbiology, our understanding of the composition, functional and physiological dynamics, and evolution of extreme- and hyper-thermophilic microbial consortia has lagged substantially behind. However, recent advances in ‘omics’ technologies, particularly within a system biology context, allowed significant progress in this field. These include the prediction of microbial consortia functionality in situ and the access to enzymes with important potential applications in biotechnology (Cowan et al. 2015). Metagenomics is particularly relevant in geothermal environments since most extremophilic microorganisms are recalcitrant to cultivation-based approaches (Amann et al. 1995; Lorenz et al. 2002). The rapid and substantial cost reduction in next-generation sequencing (NGS) (Fig. 1) has dramatically accelerated the development of sequence-based metagenomics as witnessed by the explosion of metagenome shotgun sequence datasets in the past few years. Metagenomics provides access to the gene composition of microbial communities offering a much deeper description than phylogenetic surveys, which are often based only on the diversity of the 16S rRNA gene. In addition, metagenomic datasets can provide novel, and even unexpected insights into community dynamics. For example, surprisingly, it has been found that both extreme thermophiles and hyperthermophiles showed a statistically significant higher number of clustered regularly interspaced short palindromic repeats (CRISPR) sequences in their genomes than mesophiles, suggesting that viruses/phages may play an important role in shaping composition and function of thermophiles communities as well as in driving their evolution (Cowan et al. 2015). In addition, functional metagenomic strategies, exploiting expression libraries in conventional microbes, are powerful alternatives to conventional genomic approaches for producing novel enzymes for industrial applications.

Fig. 1
figure 1

Evolution of technology sequencing in the past 10 years. The main characteristics of the different techniques, including main pros and cons, are highlighted. NGS, producing millions of shortreads (25–650 bp) is generally less expensive than the Sanger sequencing and shows faster library construction without bias for toxic genes in the cloning host

This review offers an overview on recent developments of metagenomics applied to terrestrial geothermal environments with temperature ≥65 °C. We describe how recent progress in deep sequencing technology led to the expansion of the studies on microbial and phage/viral communities populating these sites (Sect. 2) as well as to the use of CRISPR loci as a metagenomic tool to identify specific hosts for a viral assemblage (Sect. 2.2.1). Then, particular emphasis is given to the most recent literature on the distribution of microbiome and virome communities populating terrestrial geothermal sites worldwide (Sect. 4), and to the exploitation of functional metagenomics for the discovery and production of enzymes for biotechnology (Sect. 5).

2 Microbial and viral metagenomics of geothermal environments

2.1 Microbial metagenomics

The microbiological study of geothermal environments officially started in the early 1970s with the innovative work of Thomas Brock and followed, in the 1980s, by the early microbial studies by using 16S rRNA analysis allowing the identification of Archaea as a kingdom separated from Bacteria [Brock et al. 1971; Woese et al. 1990, which is reviewed in Brock (2001)]. Since then, this approach, overcoming the limitations of thermophilic microorganisms isolation, has continuously revealed novel uncultivated microbial lineages, proving that isolates represent less then 20% of the phylogenetic diversity in Archaea and Bacteria (Reysenbach et al. 1994; Stahl et al. 1984; Wu et al. 2009). Several microbiological surveys of (hyper)thermophilic environments were performed in the last 10 years by using 16S rRNA gene profiling. This led to the first microbial characterization of different geothermal areas, allowing the identification of the dominant phyla/genera among the microbial communities populating these environments and the correlations with geophysical and climatic parameters (Meyer-Dombard et al. 2005; Wang et al. 2014). Despite the easiness of this approach, the assessment of abundance estimation and of microbial diversity from a single 16S rRNA gene is challenging for several reasons (see the section below: Approaches and tools for microbial and viral metagenomics) (Fig. 2).

Fig. 2
figure 2

Schematic flow chart of the analysis approaches of a sample from geothermal sites. 16S rRNA profiling (purple), microbial metagenomics (light blue), viral metagenomics (orange), single-cell genomics (yellow) and functional metagenomics (green). Dashed arrows and box indicate optional steps. (Color figure online)

An alternative approach is the sequence-based metagenomic (SBM) analysis, which has been exploited successfully also on microbiomes populating geothermal sites. This method provides access to the gene composition of a microbial community and to its encoded function, giving a much broader and detailed phylogenetic description than the 16S rRNA profiling (Wu et al. 2014). Indeed, SBM, which is especially valuable for complex communities requiring deeper sequencing, represents the best approach in geothermal hot springs in which, despite the low microbial complexity, the population is not well characterized because of the difficulties in isolating new strains through classical microbiology approaches. As reported by Bahya and co-workers, SBM led for the first time in 2007 to the identification of two different Synechococcus populations inhabiting the microbial mats of the Octopus Spring in the Yellowstone National Park (YNP). This study revealed extensive genome rearrangements and differences related to the assimilation and storage of several elements such as nitrogen, phosphorus and iron suggesting that the two populations have adapted differentially to the fluxes and gradients of chemical elements (Bhaya et al. 2007). In addition, SBM showed correlations between function and phylogeny of unculturable microorganisms allowing the study of evolutionary profiles and the identification of novel candidate phyla (i.e. Geoarchaeota, Lokiarchaeota and Aigarchaeota). These studies contributed to the understanding of the archaea evolution and their metabolic interactions that may not have been addressed with the basic 16S rRNA gene profiling (Kozubal et al. 2013; Spang et al. 2015). At the onset, SBM was generally more expensive than 16S rRNA sequencing, however, the constant reduction of the NGS costs (Fig. 1) made this approach more and more convenient thereby considerably increasing the number of metagenomic projects available and making it a valid support, or even a direct alternative, to the 16S rRNA profiling. In addition, the sequence data banks resulting from SBM studies of geothermal environments are an important repository of genes encoding for novel enzymes with potential biotechnological interest. Therefore, in silico functional screening of metagenomic data banks allows the identification of genes that can be cloned and expressed in mesophilic hosts to produce recombinant enzymes. Alternatively, metagenomic expression libraries can be constructed to perform direct functional screenings of the enzymes of interest (see the section Enzyme discovery below).

2.2 Viral metagenomics

Phages are generally the predominant biological entity in every ecosystem and have the capability to greatly influence the structure, composition and function of their host population(s) (Snyder et al. 2015). This holds true also in geothermal environments, although their density is lower (typically 10–100-fold less viruses than host cells) if compared to mesophilic aquatic systems (López-López et al. 2013). Despite their importance, the knowledge about the diversity and biology of phages on the microbial communities in these ecosystems is still limited (Schoenfeld et al. 2008). Since not obvious common genetic markers exist, phages are still classified according to their host range and morphology, thus making challenging the discovery of genetic variants and novel subtypes. In addition, the rate of lateral gene transfer events within the geothermal environments is exceptionally high, a fact that renders uncertain the resolution of evolutionary histories of the known major viral/phages groups (Diemer and Stedman 2012).

Geothermal environments with temperature >80 °C tend to be dominated by archaea over bacteria and eukaryotes (Bolduc et al. 2015) and therefore, the majority of viruses isolated from two types of habitats are archeoviruses (Snyder et al. 2015). At present, one order and 10 families (Fuselloviridae, Bicaudaviridae, Ampullaviridae, Clavaviridae, Guttaviridae, Lipothrixviridae, Rudiviridae, Globuloviridae, Myoviridae, Siphoviridae) of archaeal viruses have been documented (Fusco et al. 2015a, b; Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b). Until relatively recent times, the only methodology available to study these viruses was through the cultivation of their hosts (López-López et al. 2013). By systematically applying this approach, our knowledge on viruses populating (hyper)thermal environments over the last 30 years has considerably expanded thanks to the pioneer work of Wolfram Zillig and, subsequently, of several groups in Europe and USA (Bize et al. 2008; Dellas et al. 2013, 2014; Diemer and Stedman 2012; Haring et al. 2005; Peng et al. 2012; Prangishvili et al. 2001, 2006; Prangishvili and Garrett 2004, 2005; Rice et al. 2001, 2004; Snyder et al. 2011; Snyder and Young 2013; Zillig et al. 1996). While enrichment cultures have been invaluable in the study of thermophilic viruses, contextual information, such as relative abundance, diversity, and distribution, was mainly unknown.

Direct SBM analysis of environmental samples together with the development of ad hoc bioinformatics tools (Rampelli et al. 2016; Roux et al. 2014, 2015) had a revolutionary impact on virology of extremophiles providing a better understanding of viral specific role in these environmental niches. In addition, viral metagenomics and genomics of cultured viruses has also revealed that a large proportion of predicted archaeal viral genes are ‘unknown’ or ‘hypothetical’ (Contursi et al. 2014a; Prato et al. 2008) expanding the content of genetic information referred as biological ‘dark matter’ (Martinez-Garcia et al. 2014). It is expected that the annotation of additional sequenced genomes as well as the exploitation of bioinformatics tools based on structural protein homologies will help to disclose this unexplored repository of viral genes (Fig. 2). Interestingly, non-coding nucleic acid sequences also play a critical role in archaeal virus function, which is a virtually underestimated topic in archaeal virology (Contursi et al. 2010).

Despite the promising scientific impact, only few viral metagenomics on geothermal samples have been reported so far (see also the paragraph: Geographical distribution of microbiomes). Some studies were pursued by deep sequencing of environmental samples enriched for virus particles (Bolduc et al. 2012; Garrett et al. 2010; Schoenfeld et al. 2008) whereas others were performed by retrieving viral sequences from whole SBM datasets (Gudbergsdottir et al. 2016; Servín-Garcidueñas et al. 2013a, b).

In addition, recently, the same approach allowed to focus on the study of CRISPR that became one of the most advanced fields in viral metagenomics and that is reviewed below.

2.2.1 CRISPR

CRISPR is a mechanism of acquired immunity playing a role in controlling the equilibrium between prokaryotic populations and their parasites. This system, which is found in the 80% of archaea and 40% of bacteria, recognizes and memorizes short sequences from the genome of the viral or phage invader (Barrangou et al. 2007; Brouns et al. 2008; Fusco et al. 2015a; Marraffini and Sontheimer 2010; van der Oost et al. 2009). The peculiar structure of CRISPR loci (Fig. 3), with alternating spacer and repeat units, results in a computationally identifiable sequence signature. Several bioinformatics tools have been developed to identify CRISPR spacers in bacterial genomes (Biswas et al. 2013; Bland et al. 2007; Edgar and Myers 2005; Grissa et al. 2007; Skennerton et al. 2013), and spacer sequences have also been collected in publicly accessible databases (Grissa et al. 2007; Rousseau et al. 2009).

Fig. 3
figure 3

Overview of the CRISPR-Cas system. Upon infection, CRISPR immune system operates by recognizing foreign genetic elements, such as plasmids and bacteriophages (blue solid line) and cleaving them into short DNA fragments (protospacers). In adaptation, these latter are then inserted, as new spacers, into the sequence array of a CRISPR locus. Such a locus consists of several short palindromic repeats, each approximately 20–50 bp in length, interspersed by spacers (Grissa et al. 2007). This array is typically located adjacent to a leader sequence (black rectangle) and CRISPR-associated cas genes (coloured arrows) (Rath et al. 2015). A newly acquired spacer (blue box highlighted by a black arrow) is generally inserted directionality into the CRISPR array region, which is closest to the leader sequence, thus preserving the history of viral infections. In transcription and processing, the repeats-spacers array is transcribed into a pre-crRNA, which is cleaved at each repeat to yield individual mature CRISPR RNAs (crRNA). These guides a dedicated set of CRISPR-associated (Cas) proteins to their targets during cellular surveillance (Marraffini and Sontheimer 2010; van der Oost et al. 2009). Indeed, upon reinfection, when the DNA or in some cases mRNA of remembered invaders is identified, the CRISPR-Cas system binds to invading phage DNA resulting in the degradation of the phage genome sequence (Stern and Sorek 2011). (Color figure online)

A challenge in the field of archaeal virology is the development of new approaches in order to move rapidly from analysis of SBM to the identification and isolation of the viral nucleic acids present in the environmental samples as well as of their respective hosts. In this regard, analyses of CRISPR spacers across metagenomics data provide high-resolution genetic markers that not only recapitulate the history of infections in the host genomes, but also allow individual phage strains to be tracked by following their presence in the very same host genomes (Vale and Little 2010) (Fig. 2). This has been done either by extracting spacers from sequenced host genomes or by PCR identifying CRISPR spacers from the same sample (Gudbergsdottir et al. 2016). An alternative approach consists in exploiting a microarray platform built up with CRISPR spacer sequences of host metagenomics data to examine temporal changes in viral populations within this environment (Snyder et al. 2010).

3 Approaches and tools for microbial and viral metagenomics

3.1 Preparation of high-molecular weight DNA and metagenomic libraries

The step of metagenomic DNA (mDNA) extraction from samples collected in extreme environments plays a critical role for the whole metagenomic analysis workflow (Fig. 2). To be sure that the genetic information obtained is representative of the whole microbiome, mDNA preparation requires specific protocols to preserve, as much as possible, the quality and the amount of the nucleic acids to guarantee best metagenomic library and the subsequent sequencing approach. mDNA extraction from geothermal environments can be performed by following two general strategies sharing the critical removal of humic acid, a major soil component made of phenolic moieties covalently bound to DNA (Lakay et al. 2007), which inhibits restriction enzymes and PCR amplification (Tebbe and Vahjen 1993). The first method is the “direct mDNA extraction” and consists in cell lysis directly followed by the nucleic acids separation from soil particles within the sample, generally providing quickly high DNA amounts.

By contrast, in the second approach, named “indirect mDNA extraction”, the environmental samples need to be physically and mechanically treated before cell lysis. Although this method requires abundant initial sample and is time-consuming if compared to the “direct mDNA extraction”, it is suitable for in-depth sequencing and creation of fosmid libraries, because of the reduced proportion of eukaryotic sequences present in the sample and increased length of the mDNA chunks (Delmont et al. 2011).

Library production for most sequencing technologies require not only high amounts of mDNA but in some cases also amplification of nucleic acids generally performed by Multiple Displacement Amplification (MDA). This method, used both in metagenomics and single-cell genomics, can amplify femtograms to up to micrograms of DNA (see below).

When only a specific part of the ecological community is the target of analysis, as the viral population, additional steps can be applied (Fig. 2).

Indeed, the relative abundance of viral particles in a sample, compared to that of other organisms such as bacteria or host cells (or their genomes), is a critical factor for the discovery of viruses when using metagenomic analysis. Enrichment methods applied to the detection of viruses in hot springs are methodologically challenging mainly because extremophilic viruses and their microbial hosts are rarely cultivable (Edwards and Rohwer 2005). In addition, the majority of the viral metagenomic reads (50–90%) show no significant similarity to sequences from known organisms. Current approaches for virus isolation and concentration include filtration and/or adsorption to and subsequent elution from positively or negatively charged membrane filters (Katayama et al. 2002) and pelletting of virus particles through ultracentrifugation (Bolduc et al. 2012; Short and Short 2008). The drawbacks include selective adsorption of viruses onto treated filters, limited volume capacity and low or variable recoveries of viruses. An efficient virus purification and concentration method consisting in the combination of tangential flow filtration (TFF) with centrifugal ultrafiltration technology, has been employed in order to obtain high density of viral particles from high turbidity seawaters samples (Sun et al. 2014). Such enrichment method has been applied to recover hot springs viral particles for metagenomic studies (Diemer and Stedman 2012; Schoenfeld et al. 2008). Finally, an efficient and reliable method to concentrate viruses from ecological samples has been developed by using FeCl3 as a low-cost and non-toxic agent that leads to nearly complete recovery (92–95%) (John et al. 2011).

3.2 Next generation sequencing technology

Sanger sequencing, developed almost 40 years ago, is still considered a good method for nucleic acid sequencing, and is characterized by the use of sequencing library with large insert sizes (>30 Kb), long read length (up to 1000 bp) with a relative low error rate. However, over the past 10 years shotgun sequencing, including metagenomic analysis, has gradually shifted from this technology to NGS with faster library construction, non associated to bias for toxic genes for the cloning host (Sorek et al. 2007) and generally less expensive. NGS platforms (Fig. 1) usually produce millions of short sequence reads up to 800 bp. To date, the most used sequencing techniques are Roche 454 and Illumina that have now been extensively applied to metagenomic from geothermal environments (Inskeep et al. 2013b; Menzel et al. 2015) (Table 1).

Table 1 State-of-the-art of the microbial metagenomics in geothermal environments worldwide

The Roche 454 system produces an average read length between 600 and 800 bp, reducing significantly the number of reads that are too short to be annotated without assembly (Wommack et al. 2008). The main drawbacks of this method with respect to metagenomic applications are the production of artificial replicate sequences (up to 15% of the resulting sequences), which affect the estimation of both microbial and gene abundances (Gomez-Alvarez et al. 2009) and a high error rate in homopolymer regions (Margulies et al. 2005). Despite these disadvantages, Roche 454 is much cheaper (up to 16,000$ per Gb) than the Sanger sequencing. In addition, also the sample preparation has been optimized requiring nanograms of mDNA for the sequencing of a single-end library (Adey et al. 2010), although pair-end sequencing might still require micrograms quantities.

If compared to the Roche 454 technology, Illumina, being sensibly cheaper with a cost of ~200$ per Gbp, usually reads up to 300 bp (Fig. 1). In addition, although more time consuming, Illumina has limited systematic errors and the quality control allows to detect and eliminate bad reads. Faster analysis can be obtained with the Illumina MiSeq instrument. However, despite there is evidence that the MiSeq offers valuable information for shotgun sequencing and can be used to test-run sequencing libraries before analysis on HiSeq instruments, deeper sequencing is strictly required in order to detect the majority of species in a sample and to perform an high quality assembly with a good abundance estimation of the microorganisms (Clooney et al. 2016).

A detailed comparison about the advantages and limitations of Roche 454 and Illumina platforms was reviewed by Luo et al. (2012). The authors suggest that both NGS technologies are reliable for quantitatively assessing genetic diversity within environmental communities. Moreover, considering the longer and more accurate contigs obtained with Illumina by assembly (despite the substantially shorter read length) and the monetary savings by one fourth of the cost relatively to Roche 454, Illumina method may be a more favourable approach for metagenomic studies (Luo et al. 2012).

3.3 16S rRNA PCR amplification versus sequence-based metagenomics and single cell genomics

Generally, 16S rRNA gene profiling is considered as a first approach in a metagenomic survey and has been applied to the analysis of the different microbial populations since the middle 1990s with recent boosts due to the advances of the NGS sequencing platforms (Fig. 1). The comparison of 16S rRNA sequence profiles across different samples, indeed, can explain how microbial communities are related across different environmental conditions. Typically, this approach involves the amplification and the sequencing by NGS of short hypervariable regions (V1-V9) of the 16S rRNA gene that demonstrate considerable and differential sequence diversity in microorganisms. Although a single hypervariable region is not sufficient to phylogenetically classify microorganisms, the hypervariable regions V2, V3 and V6 show the maximal heterogeneity among the different lineages providing the best discriminating power for the analysis of microbial communities (Chakravorty et al. 2007). Today, NGS can produce large 16S rRNA datasets containing hundreds of thousands of 16S RNAs fragments allowing the survey of several microbial communities simultaneously in different hyperthermophilic environments (Hou et al. 2013; Sahm et al. 2013). Despite the ease with which the 16S rRNA profiling can be made, this approach is known to be limited by the short read lengths obtained, sequencing errors (Quince et al. 2009), differences arising from the different hypervariable regions chosen (Youssef et al. 2009) and problems in the Operational Taxonomic Units (OTUs) assignment (Huse et al. 2010). Moreover, to assess abundance estimation and microbial diversity from the single 16S rRNA gene is challenging for several reasons, e.g.: (1) it may fail to resolve a substantial fraction of the diversity in a community given various biases associated with PCR, (2) sequencing can produce widely varying estimates of diversity because different hypervariable regions have differential power at resolving taxa, (3) sequencing provides just a survey of the taxonomic composition of the microbial community without information about the biological function of the taxa, and (4) sequencing is limited to the analysis of known taxa while novel or highly diverged microorganism or viruses, are difficult to study using this approach. Moreover, given the prevalence of horizontal gene transfer, the inherent difficulties in defining microbial species, and the limited resolution of the 16S rRNA gene among closely related species, 16S rRNA profiling should be evaluated carefully, in particular if applied to the temporal microbial surveys. In this case, by using both 16S amplicon analysis and a metagenomic approach, was observed 1.5- and ~10-times more OTU assigned to phyla and genera respectively with the metagenomic method than the 16S rRNA analysis. This seems masking several levels of intra-genus differentiation and heterogeneity of the microbial population (Poretsky et al. 2014).

SBM is an alternative approach to the study of microbial consortia that avoids the limitation of the 16S rRNA analysis. Reads obtained as described above, align to various genomic locations of the different genomes present in the sample, including viruses. By assembling short reads (e.g. Illumina 100 paired-end) longer genomic contigs are obtained. Then, the contigs can be clustered by “binning”, on the base of their nucleotide composition such as %GC and tetranucleotide frequency (Wu et al. 2014), allowing the taxonomic assignment of the resulting bins by homology searches. The analysis of the contigs provides access to the functional gene composition of microbial communities giving a much broader and detailed description than the phylogenetic surveys based on 16S rRNA profiling. Moreover, by using suitable sequence databases (nucleic and aminoacidic), SBM is useful to obtain genetic information on potentially novel biocatalysts, to reveal correlation between function and phylogeny for uncultured organisms, and to study evolutionary profiles of microbial communities. However, the assembly process is generally affected by the problem that single reads have lower confidence in accuracy (low coverage) than the multiple reads that cover the same segment of genetic information (high coverage). This implies that in a complex microbial community with low coverage, it is unlikely to get many reads covering the same fragment of DNA and affecting the result of the assembling. Nevertheless, without assembly, it is impossible to analyse longer and more complex genetic loci such as CRISPRs (Sun et al. 2016). Despite the clear benefits, metagenomic sequence data are not challenges-free since they are generally complex and large, requiring specific hardware for storing and elaboration to avoid computational issues. In addition environmental metagenomic samples may contain contaminating DNA such as from animals and plants seizing useful reads from the microbial analysis. To determine which reads were generated from a detected contaminant’s genome, especially when the contaminant is abundant or has a large genome, can be problematic. SBM is generally more expensive than 16S rRNA sequencing, especially in complex communities requiring a deepest sequencing. However, the cost reduction of the NGS has dramatically increased the number of metagenomic projects making these approaches a direct substitute of the 16S rRNA profiling. This expansion is reflected in the high number of bioinformatics tools and data resources that have been developed in the last six years and are available for SBM analysis. Many of them work on a command-line environment or are web-based tools, which centralize metagenome data management and analysis, providing an interface ready to use but lacking in customization of the analysis. For a complete and detailed overview of metagenomic tools and strategies see the excellent review of Thomas et al. (2012).

In contrast to the metagenomic approach, single-cell genomics is addressed to the analysis of genomes one cell at a time (Blainey and Quake 2014) (Fig. 2). This approach requires the separation of individual cells from a complex environmental sample (e.g. sediments or microbial mat), cell lysis and the amplification and sequencing of genomic DNA (gDNA). The first step is the isolation of individual cells from the primary samples to obtain a suspension of viable single cells. However, this step can be challenging when the primary sample requires mechanical or enzymatic dissociation (e.g. sediments from a mud pool) keeping the cells viable without biases for specific subpopulations. After lysis of individual cells, the gDNA is amplified by using MDA (Lasken 2012; Zong et al. 2012). Generally, the resulting single amplified genomes (SAG) are screened by 16S rRNA profiling for a preliminary survey and identification of candidate phyla or other taxa. SAGs of interest are then deep sequenced by NGS platforms, assembled, and analyzed. By the analysis of the number of single-copy conserved markers in the assembled sequences, it is possible to evaluate how well a given SAG covers the target microorganism’s genome (Rinke et al. 2013). Although single-cell genomics is a useful tool for the study of unknown and uncultivable microorganisms, in particular from extreme environments such as geothermal areas, this shows several critical issues. For instance, the amplification protocols can introduce chimeric artefacts and a severe bias in genomic coverage. To overcome these problems during the assembling, specific methods have been developed to analyse single-cell genomic datasets combining data from closely related single cells clustered by nucleotide percentage identity (Rinke et al. 2013). The resulting assemblies can often represent nearly complete pangenomes for a given strain or species, allowing the detailed analysis of genes and pathways. Today, single-cell genomics and metagenomics can be considered as complementary approaches because the former is not affected neither by amplification issues nor by problems related to the separation of individual cells from a complex primary sample, while the latter is able to associate directly and unambiguously phylogeny and function (Walker 2014).

A combined approach of these methods, indeed, was recently used to identify two novel candidate phyla, Calescamantes and Candidatus kryptonia, by the analysis of different SAGs and metagenomic databases collected in different high-temperature environments (Eloe-Fadrosh et al. 2016; Kim et al. 2015) proving that, although metagenomics and single-cell genomics are informative of their own, the results of the mixed approach could be greater than the sum of their parts.

4 Geographical distribution of microbiomes

Terrestrial surface hot springs (T > 65 °C), which are spread all over the world, offer a remarkable source of biodiversity. Hereinafter, we report on the state of the art of the metagenomic survey of several hot springs worldwide (Fig. 4). The microbial and viral metagenomic data are summarized in Tables 1 and 2, respectively.

Fig. 4
figure 4

Map of the geothermal sites described in this review and reported in Tables 1 and 2

Table 2 State-of-the-art of the viral metagenomics in geothermal environments worldwide

4.1 Yellowstone National Park, USA

The Yellowstone geothermal complex includes more then 10,000 thermal sites such as hot springs, vents, geysers, and mud pools showing broad ranges of pH, temperature and geochemical properties. One of the first detailed environmental and microbiological survey has been reported in 2005 and included three different hot springs in YNP, i.e. the Obsidian Pool (ObP) (80 °C, pH 6.5), the Sylvan Spring (SSp) (81 °C, pH ~ 5.5), and the Bison Pool (BP) (83 °C, pH ~ 8.0) (Meyer-Dombard et al. 2005). The Obsidian and Bison Pools are inhabited, among archaea, mostly by different groups of uncultured crenarchaeota or by members of the family Desulphurococcaceae (in ObP and BP, respectively), while bacteria belong to the genera Thermocrinis, Geothermobacterium and to the phylum Proteobacteria. On the other hand, Hydrogenothermus was the most abundant bacterial genus in SSp, together with a dominance of families Desulphurococcaceae and Thermoproteaceae among archaea (Meyer-Dombard et al. 2005). More recently, Inskeep and co-workers, in an extensive metagenomic survey of the microbial species in the YNP, reported on the identification of the predominant microbial populations, metabolic features, and the relationship between geochemical conditions and gene expression of five geochemically dissimilar high-temperature environments, namely Crater Hills (CH; 75 °C, pH 2.5), Norris Geyser Basin (NGB; 65 °C, pH 3.0), Joseph’s Coat (JCHS; 80 °C, pH 6.1), Calcite (CS; 75 °C, pH 7.8), and Mammoth Hot Springs (MHS; 71 °C, pH 6.6) (Inskeep et al. 2010). Specifically, binning and fragment recruitment approaches revealed that archaea, of the order Sulfolobales (in CH and NGB) and Thermoproteales (in JCHS), mainly dwelt in high-temperature acidic springs. Moreover, the results suggested that the relative abundance of Thermoprotei was modulated by differences in pH and/or concentration of dissolved O2. By contrast, bacteria mainly belonging to the order Aquificales, outnumbered archaea at pH values above 6.0 (CS and MHS). In particular, a predominance of reads showed nucleotide identity with Sulfurihydrogenibium sp. Y03AOP1 in MHS and with Thermus aquaticus and Sulfurihydrogenibium yellowstonensis in CS (Inskeep et al. 2010). To date, the widest investigation of microbial communities in hyperthermophilic environments (known as the YNP metagenome project) spans over 20 different geothermal sites in the YNP, 13 of which showing temperatures above 65 °C. These sites have been pooled in two different ecosystems based on features such as pH, temperature, presence of dissolved sulfide and elemental sulfur that are the main determinants shaping the microbiome within (Inskeep et al. 2013b). The first ecosystem, populated by Aquificales-rich “filamentous-streamer” communities, was identified in six sites: Dragon Spring (DS; 68–72 °C, pH 3.1); 100 Spring Plan (OSP_14; 72–74 °C, pH 3.5), Octopus Spring (OS; 74–76 °C, pH 7.9) and Bechler Spring (BCH; 80–82 °C, pH 7.8) together with MHS and CS described above (Inskeep et al. 2013b; Takacs-Vesbach et al. 2013). Whereas the second one is represented in seven archaeal-dominated sediments: Nymph Lake (NL; 88 °C, pH 4), Monarch Geyser (MG; 78–80 °C, pH 4.0), Cistern Spring (CIS; 78–80 °C, pH 4.4), Washburn Spring (WS; 76 °C, pH 6.4), 100 Spring Plan (OSP_8; 72 °C, pH 3.4), and including CH and JCHS described above (Inskeep et al. 2013a, b). The diversity detected among sites with similar characteristics suggested that additional geochemical and geophysical factors, such as the total dissolved organic carbon (DOC) and the amount of solid-phases of carbon, could play a role in the consortia composition (Inskeep et al. 2013a, b; Takacs-Vesbach et al. 2013). A similar observation but mainly related to the different SO4 2−/Cl ratio was recently reported in the analysis of three thermal springs sharing pH ~ 4.0 of the YNP: Norris (NOR; 84 °C, pH 4.34), Mary Bay Area (MRY; 80 °C, pH 4.32) and Mud Kettles (MKL, 72 °C pH 4.35), which resulted exclusively populated by bacteria and dominated by microorganisms belonging to the phylum of Cyanobacteria (NOR and MRY) and Aquificales (MKL) (Jiang and Takacs-Vesbach 2017).

A combined approach of metagenomics and single-cell analysis conducted in CH and NL revealed the presence of Nanoarchaeota that, indeed, represent Nanobsidianus stetter cells based on their high 16S rRNA similarity and their overall genome homology.

This is in agreement with previous findings highlighting the wide distribution of Nanobsidianus genus in this kind of YNP geothermal environments (Clingenpeel et al. 2013). Furthermore, single-cell and catalyzed reporter deposition-fluorescence in situ hybridization analysis performed on these environmental YNP samples showed the occurrence of a symbiotic association with extreme thermoacidophilic Crenarchaeota hosts, such as Acidicryptum nanophilium, Acidolobus sp, Vulcanisieta sp. (5%), and Sulfolobus spp.

Genome fragments of Nanobsidianus contain integrated viral sequences. On the other hand, matching viral DNA sequences were found in the viral fractions isolated from the same hot springs, suggesting that Nanobsidianus species can host viruses or support viral replication (Munson-McGee et al. 2015).

Worth of note is also the abundance of archaeal DNA-and RNA-viruses in these environments (Table 2). The first viral metagenomics study (Schoenfeld et al. 2008) lead to the identification of double-stranded DNA viruses in the Octopus (93 °C) and Bear Paw (74 °C) hot springs (Inskeep et al. 2013b). Operons and potentially complete genomes were assembled, thus providing insight to the possibly dominant viral populations within each hot-spring (López-López et al. 2013; Schoenfeld et al. 2008). Viral metagenomes indicated the predominance of a lytic lifestyle as suggested by the significant proportion of lys-like genes encoding for proteins involved in host cell lysis (López-López et al. 2013; Schoenfeld et al. 2008). This evidence is in contrast to the cultured thermophilic crenarchaeal viruses, most of which are non-lytic (Contursi et al. 2006; Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b). The evidence of the replacement of cellular genes by non-orthologous viral genes (i.e. helicases, DNA polymerases, ribonucleotide reductase, and thymidylate synthase) suggested that viruses might play a critical role in the evolution of DNA and its replication mechanisms (Schoenfeld et al. 2008).

In a further study, Inskeep and co-workers reported the assembly of viral genomes from SBM data collected in several YNP sites (Inskeep et al. 2013a, b). Based on phylogenetic analysis of known viruses, 10 scaffolds from the archaeal-dominated samples were classified as “viral” although the similarity of the scaffolds to known viruses varied considerably. CRISPR regions including both spacer regions and direct repeats were predicted from these assemblies and near perfect alignments were found between CRISPR spacer regions and 8 of the 10 viral-like scaffolds (Inskeep et al. 2010, 2013a).

Novel positive-strand RNA viruses have been also discovered in Nymph Lake hot springs (NL) characterised by high temperature (>80 °C) and low pH < 4 (Bolduc et al. 2012). Three sites (NL10, NL17 and NL18) were selected as putative niches for archaeal RNA viruses based on a viral-fraction-enrichment approach followed by deep sequencing and two genomic fragments of putative archaeal RNA viruses were identified (Bolduc et al. 2012).

An attempt to link these RNA viral genomes to a specific host type was carried out through the analysis of the CRISPR direct repeat (DR) and spacer content present in cellular metagenomics data sets from the same sites (Bolduc et al. 2012). The majority of matching spacer sequences of the RNA metagenome was related to DRs Sulfolobus species (an organismcommonly found in NL10) suggesting that this crenarchaeon, might host not only DNA (Contursi et al. 2014b; Lipps 2006) but also RNA viruses. Intriguingly, the identification of these spacers might indicate that not only DNA viruses but also archaeal RNA viruses elicit CRISPR-mediated immunity.

The genetic diversity of these newly identified putative archaeal RNA viruses was investigated by searching for similarity throughout global metagenomics datasets (Wang et al. 2015a). The authors were able to obtain nine novel partial or nearly complete genomes of novel genogroups or genotypes of the putative RNA viruses previously identified by Bolduc et al. (2012).

Viral sequences were also retrieved from the metagenomic dataset obtained in a recent study on the NL10 (Menzel et al. 2015). Among the viral families, Lipothrixiviridae, is the most abundant and Rudiviridae and Ampullaviridae members were also identified together with viral sequences assigned to Pyrobaculum spherical virus (PSV) and Thermoproteus tenax virus 1 (TTV1) (Haring et al. 2004; Neumann et al. 1989). The high representativeness of archaeal viral families is in agreement with the predominance of archaeal species (58.1% of reads assigned to Archaea) in the NL10 site (Gudbergsdottir et al. 2016).

Despite the fact that the predominance of the same archaeal species concerns the closely located CH1102 site as well (Sulfur Spring, Temp: 79 °C, pH 1.8), a different viral scenario has been detected in this site. Indeed, the virus Sulfolobus Monocaudavirus (SMV1) is the most abundant in CH1102 constituting more than 80% of the identified viral reads (Uldahl et al. 2016). To link the viral genomes and their potential host in the samples, CRISPR loci were identified from the cellular part of the CH1102 metagenome where, interestingly, the number of spacers matching to the novel SMV genomes was generally very high.

By employing a network approach to a time series of viral metagenomics data collected from high temperature Nymph Lake acidic hot springs, Bolduc and co-worker demonstrated the proof-of-concept that the viral assemblage structure and its stability over a 5 year sampling period can be precisely defined (Bolduc et al. 2015). Furthermore, this analysis highlighted the high representativeness of completely novel archaeal viruses, thus demonstrating that the combination of metagenomics dataset with advanced bioinformatics tools is essential to expand our knowledge on the archaeal virosphere.

A metagenomic approach employed to obtain full genome sequences from a hot basic enrichment sample (85 °C and pH 6.0) collected from ObP (Garrett et al. 2010) lead to the identification of two novel genomes HAV1 (linear) and HAV2 (circular) neither of which showing any clear similarity to other known archaeal viruses (Garrett et al. 2010). Extensive genomic differences were detected in multiple variants of a virus HAV1, possibly resulting from CRISPR-Cas-directed interference by unidentified hosts.

4.2 Iceland

Given its location on a divergent tectonic plate boundary (the mid-Atlantic Ridge), Iceland is studded with active volcanic systems. Among these, metagenomics data are available for two distantly located (about 45 km) sites, i.e. Krísuvík (Is3-13) and Grensdalur (Is2-5S). Is3-13 (90 °C, pH 3.5–4.0), belonging to a geothermal complex including solfataras, fumaroles, mud pots and hot springs, had very limited access to organic materials. Instead, Is2-5S (85 °C and pH 5.0) is reached by the flow through streams from other hot springs and is located on a hill plenty of organic materials, such as moss and lichens (Menzel et al. 2015). Genomic DNA, extracted from both sediment and water samples, was sequenced using the Illumina Hiseq and analysed by MEGAN (Huson and Weber 2013). Mapped reads (several millions) were assigned to archaeal microorganisms for 19.7 and 33% in Is3-13 and Is2-5S, respectively. The analysis showed a predominance of Thermoproteales and Sulfolobales in Is3-13 and of Crenarchaeota in Is2-5S, mainly of the Pyrobaculum genus. On the other hand, bacteria were overrepresented by Proteobacteria (in Is3-13), including Gamma- and Beta-proteobacteria, and a large population of Aquificales (Is2-5S) mostly belonging to the species Thermocrinis albus and Sulfurihydrogen ibiumazorense. By comparing their data with those available for other geothermal locations worldwide, authors concluded that the community structure is strongly influenced by environmental parameters rather than geographic distance (Menzel et al. 2015).

The viral community composition and the relative abundance of viruses in IS2-5S and IS3-13 sites are quite different. In both cases, the representativeness of crenarchaeal viral sequences is high despite the predominance of bacterial species. This apparent discrepancy might be due to a compositional bias in the reference database, since most of the thermophilic viruses have been isolated from archaeal hosts (Gudbergsdottir et al. 2016).

The non-crenarchaeal viral order Caudovirales, composed of head to tail viruses infecting members of Bacteria and Euryarchaea is most abundant in Is2-5S (Krupovic et al. 2011). Conversely, the IS3-13 site is mostly populated by Ampullaviridae members and constitutes only a small percentage of all the viral sequences in the former sample. Furthermore, sequences referable to Bicaudoviridae one of the most widely represented crenarchaeal family in hot springs (Wang et al. 2015b), are absent in IS2-5S metagenome (Prangishvili 2013; Snyder et al. 2015). Interestingly, the longest contig in Is3-13 metagenome was assigned to a near complete Acidanius-bottle-shaped (ABV)-like genome (Haring et al. 2005).

Common to both the sites are contigs assigned to the Rudiviridae family (Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b) and viral sequences assigned to Pyrobaculum spherical virus (PSV) (Haring et al. 2004) and Thermoproteus tenax virus 1 (TTV1) (Neumann et al. 1989) and accordingly the metagenomes also contain sequences assigned to their archaeal Pyrobaculum and Thermoproteus tenax hosts (Menzel et al. 2015).

Unique to IS2-5S site is a 20 kb contig representing an novel incomplete viral genome that, as suggested by CRISPR spacer analysis, is likely to infect Hydrogenobaculum, an host for which no virus has been reported before (Gudbergsdottir et al. 2016; Romano et al. 2013). This is remarkable as a small but significant percentage of cellular reads in the Is2-5S metagenome were assigned to Hydrogenobaculum supporting the CRISPR analysis (Menzel et al. 2015).

4.3 Kamchatka Peninsula, Russia

In the Kamchatka peninsula, also known as the land of fire, an extended volcanic region of approximately 472,300 km2, three different areas, Uzon (81 °C, pH 7.2–7.4), Kam37 (85 °C, pH 5.5) and Mutnovsky (70 °C, pH 3.5–4.0), were surveyed to study the diversity of their microbial communities (Chernyh et al. 2015; Eme et al. 2013; Merkel et al. 2017; Wemheuer et al. 2013). This analysis, besides showing that uncultivated members of the Aquificales, Euryarchaeota, Crenarchaeota, and a Miscellaneous Crenarchaeotic Group were dominating in Kam37, led also to the discovery of two ancient (hyper)thermophilic archaeal lineages, namely Hot Thaumarchaeota-related Clade 1 and Hot Thaumarchaeota-related Clade 2. Thaumarcheota, along with Proteobacteria and Thermotogae, thrive in the Uzon and Mutnovsky sites as well. Interestingly enough, these results further confirm the previous assumption that comparable environmental conditions result in similar microbial communities as in the case of the ObP in YNP and the Uzon Caldera hot spring sharing geochemical features as well as microbial community structures (Meyer-Dombard et al. 2005; Simon et al. 2009). A recent microbial census by Merkel and co-workers of several hot springs with temperature >65 °C spread across Uzon and Mutnovsky (Sery: 80 °C, pH 6.1; Thermofilny: 67 °C, pH 6.1; Bourlyashchy 82 °C, pH 7.0; Izvilist: 77 °C pH 5.9; 3423: 72 °C pH 5.0; 3462: 72 °C, pH 5.1; 3460: 68 °C, pH 6.1; 3401: 90 °C pH 3.5; 3404: 70 °C pH 6.0) revealed that as observed in other thermal habitat (i.e. YNP), bacteria belonging to the genus Sulfurihydrogenibium are the most abuntant and widely distributed group of lithoautotrophic prokaryotes in these environments as also previously reported for Bourlyashchy hot springs, the hottest thermal pool of Uzon (Chernyh et al. 2015). These microorganisms represent the only dominating representatives of Aquificae in the springs analysed together with other lithoautotrophic bacteria such as Caldimicrobium and Thermocrinis. This indicates that reduced sulfur compounds such as dissolved hydrogen sulfide, are the primary energy source for lithoautotrophic carbon assimilation. Because of the simultaneous presence of both aerobic (i.e. Sulfurihydrogenibium) and anaerobic (i.e. Caldimicrobium) microorganisms in these hot springs, the authors suggest that the aerobic sulfur oxidation, anaerobic hydrogen oxidation, and the reduction of the sulfur compounds are the main energy-giving processes in these sites (Merkel et al. 2017).

4.4 Furnas Valley, Azores

The Furnas Valley (Island of São Miguel) is the main geothermal area of the Azores archipelago. Unlike YNP and Iceland, here the largest spring is alkaline and located on a height, whereas smaller ones are in the valley and are more acidic (Brock and Brock 1967). Recently, nine sites showing a wide range of physico-chemical characteristics (51–92 °C; pH 2.5–8.0) were explored, i.e. AI-AIV near Caldeira Do Esgucho, BI-BIII near Caldeirão and CI and CII near Caldeira Asmodeu (Sahm et al. 2013). In this study several approaches were used to assess the prokaryotic diversity, i.e. fluorescence in situ hybridization (FISH), analysis of 16S rRNA and denaturing gradient gel electrophoresis (DGGE). The AI site (51 °C, pH 3.0) was found to be populated by few euryarchaeota, mainly belonging to the genus Thermoplasma, whereas among bacteria dominated Proteobacteria (80%), especially genera known for their acidophilic (Acidicaldus) and chemolithoautotrophic (Acidithiobacillus) lifestyles. By contrast, in the AIV site (92 °C, pH 8.0) an even distribution of archaea (35%) and bacteria (40%) was detected. In particular, phyla Thermotogales (genus Fervidobacterium), Firmicutes (genus Caldicellulosiruptor) and Dictyoglomi (genus Dictyoglomus) were abundant among bacteria, whereas the archaeal population was almost exclusively composed by Crenarchaeota belonging to the Desulfurococcaceae and Thermoproteaceae families (Sahm et al. 2013). A general observations was that, once again, the pH was the predominant parameter, influencing microbial complexity in different areas surveyed, with the highest bacterial diversity detected at sites where temperatures and pHs ranged 55–85 °C and 7.0–8.0, respectively. Intriguingly, unlike other hot spring-environments where Aquificales are dominant, here heterotrophic bacteria prevail. To explain this, authors suggested that a 20-400-fold higher DOC in the Furnas spring could be a reason for the abundance of heterotrophic bacteria (Sahm et al. 2013).

4.5 Sungai Klah, Malaysia

The Sungai Klah (SK) hot spring in Malaysia is surrounded by a wooded area, which makes it continuously fed by plant-derived material that results in a higher degree of total organic carbon (TOC) if compared with the other 60 geothermal sites present in Malaysia. Moreover, three additional key factors were found to be characteristic of SK: (1) temperature exceeds 100 °C in many spring pools along the main stream, (2) it fluctuates between 50 and 110 °C throughout the stream and, (3) the pH is not uniform and spans from 7.0 to 9.0 along the stream. In general, SK is a shallow and fast-flowing stream with temperatures of 75–85 °C and pH 8.0. Samples retrieved from this area were studied through 16S rRNA gene profiling (Chan et al. 2015). This approach led to the identification of 83 phyla among which the predominant were Firmicutes, Proteobacteria, Chloroflexi, Bacteroidetes, Euryarchaeota and Crenarchaeota. Interestingly, by studying sequence affiliations the authors could highlight a relationship between the population diversity and the geochemical parameters within the hot spring. Moreover, it was shown that microbial communities were able to survive by exploiting different symbiotic strategies to prosper under multiple environmental stresses (Chan et al. 2015).

4.6 Tengchong, China

Located on the northeastern edge of Tibet–Yunnan geothermal zone between the Eurasian and Indian plates, Tengchong (China) is one of the most active geothermal areas in the world with Rehai (Hot Sea) and Ruidian geothermal fields characterized by the most intense geothermal activity. While mainly large pools with neutral pH (e.g. Gongxiaoshe and Jinze) are located in Ruidian (Wang et al. 2014), Rehai hosts several types of hot springs showing a wide range of physico-chemical conditions, such as temperatures from 58 to 97 °C and pH values between 1.8 and 9.3. Examples are: (1) small source, high discharge springs (Gumingquan and Jiemeiquan); (2) small, shallow acidic mud pools featured by a decreasing temperature gradient (Diretiyanqu); (3) shallow acidic pools like Zhenzhuquan; and (4) shallow springs with multiple geothermal sources such as Shuirebaozha (Hou et al. 2013). Besides few metagenomic studies mainly focused on Crenarchaeota (Song et al. 2010) or ammonia oxidizing archaea (Jiang et al. 2010), the study by Hou et al. (2013) is the first report on a wide-ranging survey of the microbial community in 16 different hot springs of Tengchong, aiming at shedding light on the relationship between the diversity of the thermophilic microbial communities and local geochemical conditions. In particular, predicted number of OTU as well as Shannon and equitability indexes based on 16S rRNA gene sequence data, were used to highlight correlations between microbial diversity and environmental geochemistry. Analysis of these indexes using Mantel test revealed higher microbial richness, equitability, and diversity in Ruidian than in Rehai. Authors concluded that this is mainly due to differences in the pH, temperatures and TOC between the two springs (Hou et al. 2013). The effect of seasonal changes on the microbial diversity in Tengchong hot springs, which are located in a subtropical area with heavy temporal monsoon rain fall, has been studied by Briggs et al. (2014) and Wang et al. (2014). Specifically, they compared the samples collected between June and August (rainy season) with those of Hou and colleagues sampled in January during the dry season. By doing so, they revealed that Ruidian sediments contained more diverse microbial lineages than Rehai sediments thanks to the neutral pH and moderate temperatures, and that neutral springs contained similar microbial lineages in January and June while in August a single dominant lineage of Thermus emerged. Once again, the pH turned out to be the primary factor influencing the microbial community, followed by temperature and DOC (Wang et al. 2014). Overall, both studies indicated that temperature, pH, and other geochemical conditions play a key role in shaping the microbial community structure in Tengchong hot springs over the seasons.

4.7 Taupo Volcanic Zone, New Zealand

In the central area of the New Zealand’s Northern Island, a group of high temperature geothermal systems take the name of the Taupo Volcanic Zone (TVZ). Within this area, the Waiotapu region is characterized by a large number of springs exhibiting elevated arsenic concentrations (Mountain et al. 2003). With its 65-meter diameter, the Champagne Pool (CP) represents the largest geothermal feature of Waiotapu with arsenic concentrations ranging from 2.9 to 4.2 mg/L (Hedenquist and Henley 1985). The inner rim of CP is characterized by subaqueous orange amorphous As-S precipitate responsible of the characteristic orange color (Jones et al. 2001). Whereas convection stabilizes the water temperature around 75 °C, the surrounding silica terrace (Artist’s Palette) shows lower temperatures (~45 °C). In order to understand the evolution of arsenic resistance in sulfidic geothermal systems, Hug and co-workers studied the microbial contributions to coupled arsenic and sulfur cycling at CP. To this aim, the authors sampled four different CP areas with distinctive physical and chemical features: (1) the inner pool (CPp), (2) the inner rim (CPr), (3) the outflow channel (CPc), and (4) the outer silica terrace (AP). The concentration of the total dissolved arsenic was measured and was found to be 3.0, 2.9, 3.6, and 4.2 mg/L at sites CPp, CPr, CPc and AP, respectively. Moreover, total dissolved sulfur concentrations were rather even in all the sites analysed, i.e. 91–105 mg/L (Hug et al. 2014). It was previously reported that the combination between dissolved toxic metal(loid)s and high temperature in hot springs represents a strong selective pressure on the inhabiting microbial communities (Hirner et al. 1998). Indeed, sulfide ions commonly used as electron donors/acceptors by microorganisms under geothermal conditions, are highly reactive with arsenic (Macur et al. 2013). Therefore, microbially-mediated sulfur cycling could exert an indirect, but profound, influence on arsenic speciation, by affecting the concentration of thioarsenate species (Stauder et al. 2005). To shed light on the potential microbial contributions to arsenic speciation in CP, and to characterize the microbial diversity, total genomic DNA was extracted from sediments and analyzed by deep sequencing (Hug et al. 2014). This analysis revealed that sequences assigned to the Archaea domain were mostly belonging to genera Sulfolobus, Thermofilum, Pyrobaculum, Desulfurococcus, Thermococcus, and Staphylothermus, and that the percentage of total Archaea was of 28, 21, 12 and 2% at CPp, CPc, CPr and AP, respectively. On the other hand, most abundant sequences belonging to Bacteria in all sites were closely related to the genus Sulfurihydrogenibium. According to these authors, the combination of sulfide dehydrogenase and sulfur oxygenase–reductase encoding genes detected as major sulfur oxidation genes at CPp, suggests a two-step sulfide oxidation process to sulfite and thiosulfate, also producing sulfide. Moreover, biogenic sulfide produced would then be available to transform arsenite to monothioarsenate. Interestingly enough, the whole metagenomic analysis allowed to unravel the impact on sulfur speciation by genes underpinning sulfur redox transformations, thus highlighting a microbial role in sulfur-dependent transformation of arsenite to thioarsenate (Hug et al. 2014).

4.8 Phlegraean Fields, Italy

An extended area to the west of Naples, South Italy, known as Phlegraean Fields, comprises 24 craters and volcanic features, mostly lying underwater. In the middle of this area is located the Solfatara volcano, which is one of the youngest volcanoes formed within this active volcanic field (Isaia et al. 2009; Orsi et al. 1996; Petrosino et al. 2012; Rosi and Santacroce 1984). The site with the most intense geothermal activity is the Pisciarelli area that, despite its limited extension (about 800 m2), is featured by over 20 physically and chemically different springs and mud holes. Besides sulfide, arsenic is one of the most prominent heavy metals detectable in this high-temperature environment (Huber et al. 2000), thus suggesting the presence of hyperthermophilic microorganisms able to use these compounds for their metabolism, similarly to what was shown by Hug et al. (2014) in the Champagne Pool (New Zealand). Recently, Solfatara volcano (It6) and Pisciarelli hot spring (It3) were analysed by Menzel and colleagues with the aim of defining the biodiversity, genome contents and inferred functions of bacterial and archaeal communities (Menzel et al. 2015). It6 sample (76 °C, pH 3.0, water/sediment) is subjected to IlluminaHiseq sequencing and whereas those from the It3 site (86 °C, pH 5.5, water/sediment) is sequenced using the Roche/454 Titanium (Menzel et al. 2015). This study showed that in It6 78.6% of the mapped reads were assigned to Bacteria (phyla Proteobacteria and Thermoprotei), while only 17.6% to Archaea (phyla Crenarchaeota and Euryarchaeota). Conversely, It3 was mainly populated by Archaea (96.6%), including species such as Acidianus, Sulfolobus and Pyrobaculum (Menzel et al. 2015). When these data were compared with those from previous studies (Inskeep et al. 2013a; Sahm et al. 2013; Urbieta et al. 2014; Wemheuer et al. 2013) the authors conclude that environmental chemico-physical parameters are the major determinants in shaping the structure and composition of the microbial community (Menzel et al. 2015).

It3 and It6 metagenomes were screened for viral sequences as well (Table 2). A common feature to these two sites is the predominance of the Lipothrixviridae (Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b) viral sequences, representing the 43.5 and 81.7%, respectively (Gudbergsdottir et al. 2016). Despite the high abundance of contigs assigned to this family, the absence of a complete lipothrixviral genome in the metagenomes could indicate the presence of multiple related but not identical genomes of similar abundance.

Acidianus two-tailed virus (ATV)-like sequences are also abundant in these two italian metagenomes in agreement with the fact that ATV, the single member of Bicaudaviridae, was originally isolated around 10 years ago from the It6 site (Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b). However, no full genome of an ATV-like virus was recovered (Gudbergsdottir et al. 2016). Conversely, a long Acidanius-bottle-shaped (ABV)-like contig was identified in the It6 metagenome indicating that this linear genome, belonging to a novel representative of the Ampullaviridae family, was complete as judged by the presence of inverted terminal repeats. In addition, another long ARV-like contig identified in the same metagenome (Gudbergsdottir et al. 2016) was assigned to a new member of the Rudiviridae family (Prangishvili 2013; Snyder et al. 2015; Wang et al. 2015b).

The It3 metagenome contains also a ≈20 kbp contig assigned to HAV2 (see above) (Garrett et al. 2010). However, since only one ORF showed similarity (38% a.a. identity) to an HAV2 gene the authors raised the possibility that this contig was part of a novel viral genome (Gudbergsdottir et al. 2016). The CRISPR analysis revealed that four spacers in the It3 metagenome matched to this novel genome originating from the hyperthermophilic crenarchaeon Pyrobaculum and indicating that the contig is an extracellular sequence of either viral or plasmid origin (Gudbergsdottir et al. 2016).

4.9 Lassen Volcanic National Park, USA

Viral metagenomic was carried out on samples from the Boiling Springs Lake (BSL), an acidic, high temperature lake (temperature ranging between 52 and 95 °C with a pH of approximately 2.5) located in Lassen Volcanic National Park, USA (Table 2; Diemer and Stedman 2012). The study revealed the presence of a unique circular, putatively single-stranded DNA virus, named RDHV (RNA-DNA hybrid virus; Diemer and Stedman 2012). Indeed, this viral genome harboured genes homologous to both ssRNA and ssDNA viruses with the ORFs arranged in an uncommon orientation. Intriguingly, the hybrid nature of this virus was explained with the occurrence of an interviral RNA-DNA recombination event in which a DNA circovirus-like progenitor acquired a capsid protein gene from a ssRNA virus via reverse transcription and recombination (Faurez et al. 2009). Mining environmental sequence databases for genetic similar configurations allowed the identification of three candidate BSL RDHV-like genomes, thus indicating that BSL RDHV is not endemic to Boiling Springs Lake. Such recombination events, although occur infrequently, might constitute one of the driving force for the evolution of novel viruses originating through genetic exchange between distinct virus lineages (Diemer and Stedman 2012).

4.10 Los Azufres National Park, Mexico

To date, the only microbial survey carried out in the Los Azufres National Park (Mexico) was reported by Brito and co-workers (Brito et al. 2014) analyzing five different samples among which AM1 (87 °C; pH 3.4) is the only one with temperature >65 °C. AM1, collected from the main geyser present in the “Los Azufres spa” showed high concentration of metals such as Zn and Mn and of heavy metals (Hg, Pb and Fe) up 1000 fold the EPA and WHO drinking water standards. The analysis by T-RFLP and 16S of the sites revealed an overall low bacterial diversity and that in particular AM1 was dominated exclusively by a microorganism related to Lysobacter spp. before identified in different extreme environments. This result suggests that the bacterial community of this site, if compared with the other in the area at lower temperature, was mainly influenced by the concentration of Zn, Mn and temperature (Brito et al. 2014).

A study performed by Servín-Garcidueñas et al. (2013a, b) identified the consensus sequence of a novel archaeal rudivirus (SMR1) as well as of a new member of the family Fuselloviridae (SMF1) by metagenomic reads assembly. Despite the large geographical distance from the locations of other sequenced rudiviruses and fuselloviruses, SMR1 and SMF1 retained a core set of conserved genes specific to Rudiviridae and Fuselloviridae, respectively. These genes were inferred to be important for the viral life cycle and their occurrence on the genomes of viruses geographically separated supported the hypothesis of exchange of genetic material over intercontinental distances (Servín-Garcidueñas et al. 2013a, b ).

5 Enzyme discovery

One of the driving interests toward metagenomic of geothermal environments is the discovery and exploitation of a rich pool of uncharacterised metabolic pathways as well as of novel thermostable enzymes (thermozymes) with biochemical characteristics evolved to accommodate the unique environments that the microbes reside in Bartolucci et al. (2013). Specifically, thermozymes exhibit an intrinsic stability to common protein chemical–physical denaturants and therefore are of great interest in biotechnological applications representing a valuable alternative to the available enzymes from mesophiles (Cobucci-Ponzano et al. 2015; Sharma et al. 2012).

5.1 Sequence-based function prediction

SBM enables the construction of data banks of all the genes present in a geothermal sample. In silico screening for sequence similarity or the presence of conserved motifs followed by amplified through PCR-approaches can allow the production of enzymes of potential applicative interest. Some frequently used databases for functional annotation are the SEED annotation system, the KEGG orthology (KO) database or the Pfam database (Finn et al. 2016; Kanehisa et al. 2014; Overbeek et al. 2014). Then, identified genes can be cloned and expressed in conventional hosts from native or inducible promoters and recombinant enzymes can be purified and characterized in detail. This approach increased enormously the number of genes to analyse if compared to genomic screening requiring the isolation of microbial strains, and it is especially convenient for extremophilic microorganisms whose isolation is particularly challenging. On the other hand, also this approach shows limitations. Firstly, the databases may be subjected to phylogenetic biases, as some communities are more accurately annotated than others; secondly, since prediction of genes function is based on sequence similarity to not many already characterized genes and pathways in the public databases, currently, about 50% of genes in genomes are defined as “hypothetical” or proteins of unknown function. Thirdly, the presence of a gene on a metagenome does not mean that it is expressed. To increase the probability of finding active functional genes involved in a substrate uptake and transformation, some studies use activity-based screenings on expression libraries (functional metagenomics) or a substrate-induced enrichment of the community before the mDNA extraction (see below).

5.2 Functional metagenomics

A powerful alternative or a complement, to SBM is functional metagenomics that relies on the construction of metagenomic libraries by cloning environmental DNA into expression vectors and propagating them in the appropriate hosts, followed by activity-based screening (Fig. 2). Depending on the size of the insert, functional metagenomics can be explored using vectors carrying short (<10 kb) or long size inserts (200 kb). Bigger inserts have the potential to carry entire gene-clusters as well as their own promoter sequences, allowing the expression of more enzymes. By using an appropriate screening method, genes expressing a particular enzymatic activity can be identified and their products characterized. Lambda phage-based expression vectors offer the possibility of screening for particular enzymatic activities directly on phage plaques. Indeed E. coli cells are lysed at the end of the infection cycle and the translated metagenomic proteins are released into the extracellular matrix. The main advantages of this approach are: (1) isolation of entire genes, (2) direct identification of enzymes fully active in recombinant form, and (3) identification of novel enzymatic activities, whose functions would not be predicted from the available sequence databases (López-López et al. 2014).

The description of the thermozymes isolated through functional metagenomics goes beyond the aims of this manuscript. One of the main limitations of this approach is the expression of heterologous genes in E. coli (Gabor et al. 2004). Although the commonly used E.coli strains have relaxed requirements for promoter recognition and translation initiation, in this host proteins may not fold correctly and many genes from extremophilic environmental samples are not translated efficiently, especially those belonging to archaea, which might show a codon usage bias (Prato et al. 2008). Alternative hosts and broad-host-range vectors may be required to overcome these limitations (Angelov et al. 2009; Cheng et al. 2014). Recently, Leis and co-workers identified novel esterases from hot springs in the Azores based by complementing the growth of a custom, esterase-deficient strain of Thermus thermophilus. By using this method, they uncover several enzymes from underrepresented species of archaeal origin leading to the identification of new biocatalysts that do not share any known sequence signatures at all (Leis et al. 2015).

5.3 Extremophilic microbiome enrichments

Direct selection of the enzymes by functional metagenomics is not always the best approach and adaptation of entire extremophilic microbiomes on specific substrates is an interesting alternative choice. Often, complex substrates, recalcitrant to enzymatic conversion, require the combined action of many catalytic activities and accessory proteins that, in nature, are provided by a complex microbiome. This is particularly true for the conversion of plant lignocellulosic material including cellulose and hemicelluloses (xylans, xyloglucans, pectins, etc.) that are the two most abundant polymers on Earth (global cellulose production estimates range between 9 × 1012 and 1.5 × 1012 tons/year (Ha et al. 2011; Pinkert et al. 2009) and are remarkably stable to spontaneous hydrolysis (half-life of the glycosidic bond is 4.7 × 106 years, Wolfenden et al. 1998). Thus, carbohydrate active enzymes (cazymes) from (hyper)thermophiles have interesting biotechnological potential (Cobucci-Ponzano et al. 2006, 2010a, b). In fact, their impressive stability (Ausili et al. 2004) at the conditions at which plant lignocellulose is pretreated in second generation biorefineries (steam-explosion at extremes of temperatures and pHs), make them the ideal catalysts for the hydrolysis of (hemi)cellulose into fermentable sugars for the production of bioethanol and plastic precursors (Castiglia et al. 2016; Cobucci-Ponzano et al. 2013, 2015; Iacono et al. 2016; Aulitto et al. 2017).

Functional metagenomics through selections at high temperatures (60–75 °C) allowed the identification of interesting thermophilic cazymes, including β-glucosidase, cellulase, β-xylosidases/α-arabinofuranosidase, endoxylanase, α-fucosidase, and acetyl xylan esterase from samples collected in environmental niches, like guts of animals and insects, and compost, swine waste, and thermophilic methanogenic digesters (Allgaier et al. 2010; Dougherty et al. 2012; Liang et al. 2010; Wang 2009; Wang et al. 2015c). However, direct selection of microbiomes from extreme environments on plant biomass is a useful alternative. Gladden and collaborators adapted samples from composts to grow on switchgrass by multiple passages at 60 °C. The enrichments produced a reduction in microbiome diversity showing by 16S rRNA the presence of thermophilic Gram-positive bacteria Firmicutes, Bacteriodetes, Chloroflexi, and, remarkably, an uncultivated lineage related to the Gemmatimonadetes phylum (Gladden et al. 2011). In addition, supernatants of the enriched cultures showed endoglucanase and xylanase activities more stable than those commercially available and exploited in biorefineries. With a similar approach, enrichments of a sample from a 94 °C geothermal pool in Nevada (USA) on Miscanthus and cellulose Avicel in strictly anaerobic conditions at 90 °C were performed (Graham et al. 2011). Three 16S rRNA genes were identified corresponding to the archaea Ignisphaera aggregans, Pyrobaculum islandicum, and Thermophilum pendens, with the Ignisphaera-like strain dominating the microbiome. In addition, CMC active enzymes were detected in the enriched cultures and the annotation analysis of the metagenomic data bank allowed to identify and characterize a novel GH5 endoglucanase. This study showed the first microbiome of archaea able to deconstruct lignocellulose at 90 °C and a novel uncommon cellulase showing promising application in biorefineries. Finally, a metagenomic analysis of an environmental sample dominated by Firmicutes and collected in hot springs (50-70 °C) in Xiamen-China has been reported very recently. Upon enrichment on sugarcane bagasse, a novel cellulase, XM70-Cel9, was identified and biochemically characterized displaying relevant biotechnological properties i.e. optimal temperature of 70 °C and good pH tolerance (Zhao et al. 2017).

6 Conclusions

The interest in the diversity, ecology, physiology and biochemistry of extreme- and hyper-thermophilic microorganisms has increased enormously during the past few decades. An unanticipated phylogenetic and physiological diversity of thermophiles has been revealed through metagenomic studies conducted in different thermophilic biotopes. Figure 5 shows the principal component analysis of the geothermal sites herein described taking into account pH, temperature and dominant phyla. The analysis clearly indicates a direct correlation between the predominance of archaeal phyla (i.e. Crenarchaeota) in higher temperature (~75–90 °C) and low pH (2.6–5.5) sites and of bacterial phyla (i.e. Aquificae) under lower temperature (~65 to 80 °C) and high pH (5.5–9.4) conditions. This observation is in perfect agreement with the studies reported by Inskeep et al. (2013b) in the YNP metagenomic project.

Fig. 5
figure 5

Principal component representation of the geothermal sites described in this review (see Table 1) considering temperature, pH and dominant phyla as clustering parameters. The two main clusters of geothermal sites, A and B, reflect the correlation between the dominant phyla and the environmental parameters (pH and T) confirming the presence of dominant archaeal and bacterial phyla in high temperature/low pH and low temperature/high pH geothermal environments, respectively. 120 Yellowstone National Park (red), 2122 Iceland (light blue), 2333 Kamchatka (orange), 3435 Furnas Valley (purple), 36 Malasya (grey), 3750 Rehai and Ruidia (pink); 5154 Taupo (green) and 5556 Phlegraean fields (green); 57 Los Azufres (black). (Color figure online)

It is expected that the combination of DNA metagenomic studies together with metatranscriptomic and metaproteomic approaches will allow us to understand the functional dynamics of microbial communities as well as to advance the prediction of the in situ microbial activities and productivity of microbial consortia.

Besides the bacterial and archaeal communities, viral metagenomic studies have provided information concerning viral biogeography, diversity and community structure of hot environments, and new viral types contributing to the discovery of potential archaeal RNA viruses. Viral metagenomics analysis has also revealed the presence of high percentage of unknown sequences demonstrating the vast novelty of genetic information to be still obtained from viruses.

Microbial and viral metagenomics open up the roads to analyze and screen the genetically and metabolically rich microbial thermophilic communities in their entirety. By further developing high-throughput screening methods and chromogenic substrates-based tests for the detection of thermozymes it is possible to foresee a quantum leap in the discovery of novel biocatalysts to be successfully exploited in sustainable processes for industrial applications.