Study site and sample collection
The East River (ER) watershed has been described elsewhere [3]. In brief, the ER watershed is a 300-km2 area largely underlain by marine shales of the Cretaceous Mancos formation located in the Elk Mountains in west-central Colorado. The ER is a headwaters catchment in the Upper Colorado River basin, with an average elevation of 3350 m. At about 62-km long, the ER traverses an elevational gradient that includes alpine, subalpine, and montane life zones as a function of stream reach. The average annual temperature is ~ 0 °C, with long cold winters and short cool summers, and the majority of precipitation is received in the form of snow [27].
The sampling sites are located across an altitudinal gradient followed by the river (~ 2700–2900 m). The floodplain at the highest elevation is located ca. 6 km from the headwaters, nearby Gothic, Colorado, site of the Rocky Mountain Biological Laboratory (RMBL, Fig. 1). Therefore, samples collected from this site were named East River Meander-bound floodplain G (ERMG). The second site was located ca. 8 km downstream of Gothic, among a series of floodplains, one of which is situated adjacent to an intensive research site of the Watershed Function SFA [3]. This floodplain stands out because of its larger size, and samples were named ERML (L for large). The third site was located ca. 18 km downstream of Gothic and just upstream of the confluence with Brush Creek. Samples from this site were named ERMZ, with the stream reach between ERML and ERMZ being characterized by a relatively low gradient with high sinuosity.
In September 2015, during base flow conditions, two series of perpendicular transects were laid out at each site. Each set of transects comprised four transects that were parallel between them (Fig. 1). One set of transects were approximately North to South (T1–T4), and the other set of transects were East to West (T5–T8). The starting point of each transect was designated “0 m”, and the location of the other sites along the transect was relative to the point of origin. A Trimble Geo 7X GPS was used to determine the exact location of each site along the transects with an accuracy of 0.5 m. The distance (in meters) of each sample to the point of origin was included in the sample name, which comprised the initials of the study area (ER), the initials for each meander-bound floodplain (i.e., MG, ML, or MZ), the transect number (i.e., T1–T8), and the distance in meters from the first sample collected at the start point (e.g., 19 m). We sampled areas of ~ 4600 m2 in floodplain G, ~ 8000 m2 in floodplain L, and ~ 5400 m2 in floodplain Z.
Four soil samples from the 10–25 cm (± 1–2 cm) soil depth interval were collected in the span of 10 days along each one of the eight transects, for a total of 32 samples per floodplain. Each site was cleared of grasses and other vegetation with clippers, and the first ~ 10 cm of soil was removed with a sterile shovel. Soil samples were collected using sterile tools, including a soil core sampler and 7.6 × 15.2 cm plastic corer liners (AMS, Inc), stainless-steel spatulas, and Whirl-pak bags. Samples were immediately stored in coolers for transportation to RMBL, where samples were prepared for archival and transportation to the University of California, Berkeley. Soil cores were broken apart and manually homogenized inside the Whirl-pak bags. Subsamples for chemical analyses, DNA extractions, and long-term archival were obtained inside a biosafety cabinet, kept at − 80 °C, transported in dry ice, and stored at − 80 °C at the University of California, Berkeley.
In September 2016, another round of sample collection was conducted at floodplain L for metagenomics, metatranscriptomics, and chemical analyses. A subset of 19 out of the 32 sampling sites from the previous year was targeted, and a subset (15) of those was also selected for metatranscriptomics (Table S1, Additional file 1). Given that floodplain L was the site with the lowest total number of draft genomes recovered in 2015, we added new sites closer to the original sites with the intent of increasing this number by leveraging differential coverage across samples [28]. Four new sites located in between the original transects (denominated ERMLIBT) and two sites adjacent to ERMLT660 (ERMLT660_1 and ERMLT660_2) were sampled. Additionally, samples were collected from above the water table (approximately below 40–50 cm from the surface) at a depth of 32–47 cm (± 4–6 cm) from three sites (ERMLT200, ERML231, and ERML293) along T2. Samples from the 11–25 cm (± 1–1 cm) soil layer were obtained following the same protocol as the previous year, with the exception that subsamples for RNA sequencing were preserved in situ. Once the soil cores were transferred to a Whirl-pak bag, they were manually homogenized inside the bags. Eight grams (8 g) of soil were collected using sterile stainless-steel spatulas directly into 50-mL sterile falcon tubes containing 20 mL of LifeGuard Soil Preservation Solution (Qiagen) for RNA preservation. The samples were mixed by hand to saturation with the LifeGuard solution, stored in a chilled cooler for transportation to RMBL and later stored at − 80 °C.
Soil chemistry
Total carbon (TC) and total inorganic carbon (TIC) were analyzed using a Shimadzu TOC-VCPH analyzer equipped with a solid sample module SSM-5000A (Shimadzu Corporation, Japan). Total organic carbon (TOC) was obtained from the difference between TC and TIC. For TC quantification, a subsample of the dried solids was weighed into a ceramic boat and combusted in a TC furnace at 900 °C with a stream of oxygen. To ensure complete conversion to CO2, the generated gases are passed over a mixed catalyst (cobalt/platinum) for catalytic post-combustion. The CO2 produced is subsequently transferred to the NDIR detector in the main instrument unit (TOC-VCSH). Quantification of the inorganic carbon was carried out in a separate IC furnace of the module. Phosphoric acid is added to the sample, and the resulting CO2 is purged at 200 °C and measured.
Total nitrogen (TDN) was analyzed using a Shimadzu Total Nitrogen Module (TNM-1) coupled to the solid sample module (SSM-5000A) and TOC-VCSH analyzer (Shimadzu Corporation, Japan). TNM-1 is a nonspecific measurement of TN. All nitrogen species in samples were combusted at 900 °C, converted to nitrogen monoxide and nitrogen dioxide, then reacted with ozone to form an excited state of nitrogen dioxide. Upon returning to the ground state, light energy is emitted. Then, TN is measured using a chemiluminescence detector.
DNA extraction and sequencing
Genomic DNA was extracted from ~ 10 g of thawed soil using Powermax Soil DNA extraction kit (Qiagen) with some minor modifications as follows. Initial cell lysis by vortexing vigorously was substituted by placing the tubes in a water bath at 65 °C for 30 min and mixing by inversion every 10 min to decrease shearing of the genomic DNA. After adding the high concentration salt solution that allows binding of DNA to the silica membrane column used for removal of chemical contaminants, vacuum was used instead of multiple centrifugation steps. Finally, DNA was eluted from the membrane using 10 mL of the elution buffer (10 mM Tris buffer) instead of 5 mL to ensure full release of the DNA. DNA was precipitated out of solution using 10 mL of a 3-M sodium acetate (pH 5.2) and glycogen (20 mg/mL) solution and 20 mL 100% sterile-filtered ethanol. The mix was incubated overnight at 4 °C, centrifuged at 15,000 × g for 30 min at room temperature, and the resulting pellet was washed with chilled 10 mL sterile-filtered 70% ethanol, centrifuged at 15,000 × g for 30 min, allowed to air dry in a biosafety cabinet for 15–20 min, and resuspended in 100 μL of the original elution buffer. Genomic DNA yields were between 0.1 and 1.0 μg/μL except for two samples with 0.06 μg/μL. Power Clean Pro DNA clean up kit (Qiagen) was used to purify 10 μg of DNA following manufacturer’s instructions except for any vortexing which was substituted by flickering of the tubes to preserve the integrity of the high-molecular-weight DNA. DNA was resuspended in the elution buffer (10 mM Tris buffer, pH 8) at a final concentration of 10 ng/μL and a total of 0.5 μg of genomic DNA. DNA was quantified using a Qubit double-stranded broad range DNA Assay or the high-sensitivity assay (ThermoFisher Scientific) if necessary. Additionally, the integrity of the genomic DNA was confirmed on agarose gels and the cleanness of the extracts tested by absence of inhibition during PCR. For samples collected the following year, DNA was co-extracted with RNA (see next section), in addition to extracting subsamples (10 g of soil) from the same core following the extraction protocol described above (Table S1, Additional file 1).
Clean DNA extracts and co-extracts were submitted for sequencing at the Joint Genome Institute (Walnut Creek, CA), where samples were subjected to a quality control check. Two of the 96 samples from 2015 failed QC and thus were not sequenced (ERMZT233 and ERMZT446), and four samples were sequenced ahead of the others (ERMLT700, ERMLT890, ERMZT100, and ERMZT299). Ten out of 15 of the DNA co-extracts from 2016 failed QC due to low DNA yields and were not sequenced either. Sequencing libraries for the first four samples were prepared in microcentrifuge tubes. One hundred nanograms of genomic DNA was sheared to 600 bp pieces using the Covaris LE220 and size selected with SPRI using AMPureXP beads (Beckman Coulter). The fragments were treated with end repair, A-tailing, and ligation of Illumina-compatible adapters (IDT, Inc) using the KAPA Illumina Library prep kit (KAPA biosystems). Libraries for the rest of the samples were prepared in 96-well plates. Plate-based DNA library preparation for Illumina sequencing was performed on the PerkinElmer Sciclone NGS robotic liquid handling system using Kapa Biosystems library preparation kit. Two hundred nanograms of sample DNA was sheared to 600 bp using a Covaris LE220 focused-ultrasonicator. The sheared DNA fragments were size selected by double-SPRI and then the selected fragments were end-repaired, A-tailed, and ligated with Illumina-compatible sequencing adaptors from IDT containing a unique molecular index barcode for each sample library.
All the libraries were quantified using KAPA Biosystem’s next-generation sequencing library qPCR kit and a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then multiplexed with other libraries, and the pool of libraries was prepared for sequencing on Illumina HiSeq sequencing platform utilizing a TruSeq paired-end cluster kit, v4, and Illumina’s cBot instrument to generate a clustered flow cell for sequencing. Sequencing of the flow cell was performed on the Illumina HiSeq 2500 sequencer using HiSeq TruSeq SBS sequencing kits, v4, following a 2 × 150 indexed run recipe.
RNA–DNA co-extraction and sequencing
Total RNA was extracted from a subset of 15 samples using the RNA PowerSoil Total RNA isolation kit (Qiagen). Soil samples (8 g) preserved in LifeGuard solution (Qiagen) were thawed on ice and centrifuged at 2500 × g for 5 min to collect the soil at the bottom of the tubes. As a supernatant, the LifeGuard solution was extracted from the tubes and aliquoted into three 15-mL conical tubes that were used to transfer three separate 2-g subsamples for later use. The remaining 2 g were split in half into two of the kit’s bead tubes with pre-aliquoted bead solution (to disperse the cells and soil particles). The lysis solution (SR1) and the non-DNA organic and inorganic precipitation solution (SR2) were not added to the bead tube until all the subsamples to be processed in a given day had been aliquoted. Subsamples were kept at − 20 °C before transferring them to a − 80 °C freezer for permanent storage. The remainder of the extraction was carried out following the manufacturer’s instructions. An RNA PowerSoil DNA elution accessory kit was used to co-extract DNA from the RNA capture columns, which was quantified as previously described. A DNase treatment was performed in all the RNA extracts with a TURBO DNA-free kit (Ambion) using 4 U of TURBO DNase at 37 °C for 30 min. The absence of DNA was tested by PCR with universal primers to the SSU rRNA gene, and the integrity of the RNA was checked using a Bioanalyzer RNA 6000 Nano kit following the manufacturer’s instructions. Total RNA was quantified before and after DNase treatments using a Qubit high-sensitivity RNA assay (ThermoFisher Scientific). One of the RNA extracts (ERMLT590) did not yield enough RNA for sequencing.
Total RNA and DNA co-extracts were submitted for sequencing at the Joint Genome Institute in Walnut Creek, CA, where samples were subjected to a quality control check. rRNA was removed from 1 μg of total RNA using Ribo-ZeroTM rRNA Removal Kit (Illumina). Stranded cDNA libraries were generated using the Illumina Truseq Stranded mRNA Library Prep kit. The rRNA-depleted RNA was fragmented and reverse transcribed using random hexamers and SSII (Invitrogen) followed by second-strand synthesis. The fragmented cDNA was treated with end pair, A-tailing, adapter ligation, and 8 cycles of PCR. For low-input extracts, rRNA was removed from 100 ng of total RNA using Ribo-ZeroTM rRNA Removal Kit (Illumina). Stranded cDNA libraries were generated using the Illumina Truseq Stranded mRNA Library Prep kit. The rRNA-depleted RNA was fragmented and reverse transcribed using random hexamers and SSII (Invitrogen) followed by second-strand synthesis. The fragmented cDNA was treated with end pair, A-tailing, adapter ligation, and 10 cycles of PCR. The prepared libraries were quantified using KAPA Biosystem’s next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then multiplexed with other libraries, and the pool of libraries was prepared for sequencing on the Illumina HiSeq sequencing platform utilizing a TruSeq paired-end cluster kit, v4, and Illumina’s cBot instrument to generate a clustered flow cell for sequencing. Sequencing of the flow cell was performed on the Illumina HiSeq 2500 sequencer using HiSeq TruSeq SBS sequencing kits, v4, following a 2 × 150 indexed run recipe.
Metagenomes assembly and annotation and ribosomal protein L6 analysis
Methods used for 2015 and 2016 metagenome assembly and annotation are described elsewhere [29]. In brief, after quality filtering, reads from individual samples were assembled separately using IDBA-UD v1.1.1 [30] with a minimum k-mer size of 40, a maximum k-mer size of 140, and step size of 20. Only contigs > 1 Kb were kept for further analyses. Gene prediction was done with Prodigal v2.6.3 [31] in meta mode, annotations obtained using USEARCH [32] against Uniprot [33], Uniref90 and KEGG [34], and 16S rRNA and tRNAs predicted as described in Diamond et al. [5]. Reads were mapped to the assemblies using Bowtie2 [35] and default settings to estimate coverage. To estimate the number of genomes potentially present across all 94 metagenomes, we used the ribosomal protein L6 as marker gene and RPxSuite (https://github.com/alexcritschristoph/RPxSuite) as described in Olm et al. [6]. L6 OTU clusters were considered “binned” if any L6 containing scaffold within an L6 OTU cluster across all samples was associated with a binned genome. L6 clusters were taxonomically classified using GraftM (https://doi.org/10.1093/nar/gky174) against all L6 sequences from the GTDB database (Release 05-RS95) with default parameters. Rank abundance curve plotting was accomplished using the ggplot2 [36] package in R [37].
Genome binning, curation, and dereplication
Annotated metagenomes from both years were uploaded onto ggKbase (https://ggkbase.berkeley.edu), where binning tools based on GC content, coverage, and winning taxonomy [38] were used for genome binning. These bins and additional bins that were obtained with the automated binners ABAWACA1 (https://github.com/CK7/abawaca), ABAWACA2, MetaBAT [39], Maxbin2 [40], and Concoct [41] were pooled, and DAStool [42] was used for selection of the best set of bins from each sample as described by Diamond et al. [5]. Notably, no bins were recovered from sample ERMZT266 by any method.
Genomic bins were filtered based on completeness ≥ 70% of a set of 51 bacterial single copy genes (BSCG) if affiliated with Bacteria and a set of 38 archaeal single-copy genes (ASCG); and a level of contamination ≤ 10% based on the corresponding list of single-copy genes [42]. Additionally, bins that were 59–68% complete with a highest taxonomic level defined as Bacteria in ggKbase, or potential members of the candidate phyla radiation (CPR) were kept for further scrutiny. To obtain a set of genomes for visual curation in ggKbase, genomes were dereplicated at 99% ANI across samples located within a given floodplain using dRep [43] with the --ignoreGenomeQuality flag [43]. Any assembly error in the dereplicated set was addressed using ra2.py [44], and contigs that fell below the 1-Kb length minimum after this step were removed from the bins. At this point, the level of completeness of CPR genomes was confirmed based on a list of 43 BSCG [7]. Genomes that did not meet the completeness thresholds post-assembly error correction and that were not affiliated with CPR or novel bacteria were removed from the analysis. Considering that bins changed as a result of this process, genes were re-predicted using Prodigal [31] in single mode, reads were mapped to the bins using Bowtie2 [45], and bins were reimported onto ggKbase. Visual inspection of taxonomic profile, GC content and to a minor extent coverage, allowed us to further reduce contamination. The final set of 248 curated bins from 2015 was dereplicated at 98% ANI this time across floodplains including the --genomeInfo flag to take into account completeness and contamination in the process of representative bin selection. Within this set, genomes ≥ 90% complete were deemed near-complete (Table S2, Additional file 2). Eight relatively low-coverage genomes fell just below the completeness requirement due to fragmentation after curation to remove possible local assembly errors; these were retained as they represent important taxonomic diversity.
Similarly, genomes reconstructed from floodplain L samples collected in 2016 that passed the completeness (≥ 70%) and contamination thresholds (≤ 10%) were visually inspected and improved in ggKbase. Assembly errors were corrected with ra2.py [44], and contigs that fell below the 1-Kb length were removed, as well as genomes that did not pass the thresholds for completeness after assembly error correction. Genes were re-predicted using Prodigal [31] in single mode, and the final set of curated genomes were imported onto ggKbase.
To determine whether the same species were present in two different years, we pooled the genome set from 2015 and the curated 2016 set and dereplicated using dRep [43] at 95% ANI including the --genomeInfo flag to take into account completeness and contamination in the process of representative bin selection [43]. In this set of genomes, 13 were reconstructed from a deeper depth (Table S3, Additional file 4). However, only 3 genomes were unique and the other 10 clustered with genomes reconstructed from the ~ 10–25-cm depth, indicating overlap between the species found at the two depths. Therefore, we kept these genomes for further analyses.
Genome metabolic annotation
We carefully chose a set of ecologically relevant proteins that catalyze geochemical transformations related to aerobic respiration, metabolism of C1 compounds, hydrogen metabolism, nitrogen cycling, and sulfur cycling (Table S4, Additional file 6). Hidden Markov Models (HMMs) for the majority of these proteins were obtained from KOfam, the customized HMM database of KEGG Orthologs (KOs) [46]. Custom-made HMMs targeting nitrite oxidoreductase subunits A and B (NxrA and NxrB), periplasmic cytochrome c nitrite reductase (NirS, cd1-NIR heme-containing), cytochrome c-dependent nitric oxide reductase (NorC; cNOR), hydrazine dehydrogenase (HzoA), hydrazine synthase (HzsA), dissimilatory sulfite reductase D (DsrD), sulfide:quinone reductase (Sqr), sulfur dioxygenase (Sdo), ribulose-bisphosphate carboxylase (RuBisCO) form I and form II, and alcohol dehydrogenases (Pqq-XoxF-MxaF) were obtained from Anantharaman et al. [7]. NiFe and FeFe hydrogenases were predicted using HMMs from Méheust et al. [47] and assigned to functional groups following Matheus Carnevali et al. [38] (see Phylogenetic analyses subsection below for tree construction methods; Data S3, Additional file 12 and Data S4, Additional file 13; Table S9, Additional file 14 and Table S10, Additional file 15). No real group 4 membrane-bound NiFe hydrogenases were identified among the East River representative genomes (data not shown). HMMER3 [48] was used to annotate the dereplicated sets of genomes following predefined score cutoffs [46]. A subset (10%) of the hits to all of these HMMs were visually checked to determine whether the cutoffs were appropriate for this dataset as described in Lavy et al. [49] and Jaffe et al. [50]. Only in the case of formate dehydrogenase (FdhA (K05299 and K22516), FdoG/FdhF/FdwA (K00123)) was the cutoff lowered to include additional hits.
For a protein to be considered potentially encoded in the genome, the catalytic subunit and the majority of the accessory subunits had to be detected by the corresponding HMMs at the established cutoffs. The implication for these function definitions is that in some cases even if some subunits that make up an enzyme were detected, the enzyme could have been deemed absent because a key part was missing (Table S4, Additional file 6). Similarly, pathways that require the activity of multiple enzymes were only detectable if all of the enzymes were present. Only in cases like the Wood-Jungdahl pathway, we required the majority of the genes to be present, taking into consideration genome completeness. Furthermore, if multiple enzymes could catalyze a given reaction (e.g., use O2 as a terminal electron acceptor), the presence of genes encoding one such enzyme in a genome would be indicative that this capacity was present in the genome. Additionally, if different pathways lead to the same biogeochemical transformation (e.g., CO2-fixation), the presence of genes encoding one of those pathways (or key enzymes) was considered as sufficient to indicate its presence (Table S4, Additional file 6). In a limited number of cases, a given pathway may also involve enzymes that are part of central metabolism or that are part of multiple pathways, and in these cases, we chose to define presence based on the key catalyst instead of the whole pathway (e.g., RuBisCO in the Calvin Benson pathway).
Carbohydrate-active enzymes were predicted using the Carbohydrate-Active enZYmes Database (CAZy; http://www.cazy.org/) [10] (version 1.0) and dbCAN2 [51] (e value cutoff 1e–20).
Genome coverage and detection
Reads were mapped to the dereplicated set of bins using Bowtie2 [35] and a mismatch threshold of 2% dissimilarity. Calculate_coverage.py (https://github.com/christophertbrown/bioscripts/tree/master/ctbBio) was used to estimate the average number of reads mapping to each genome and the proportion of the genome that was covered by reads (breadth). Genomes with a coverage of at least 0.01 X were considered to be detected in a given sample. The Hellinger transformation was used to account for differences in sequencing depth among samples and determine final genome abundance. To illustrate genome detection across samples, we used the ggplot2 package [36]. Genomes were clustered by average linkage using the Hellinger-transformed abundance across samples (from read mapping), and the samples were clustered by Euclidean distance in R [37].
Phylogenetic analyses
Two phylogenetic trees were constructed with a set of 14 ribosomal proteins (L2, L3, L4, L5, L6, L14, L15, L18, L22, L24, S3, S8, S17, and S19). One tree included Betaproteobacteria genomes from this study at the subspecies level (98% ANI) and ~ 1540 reference Betaproteobacteria genomes from the NCBI (Figure S3, Additional file 3 and Data S1, Additional file 5). The other tree included the set of 215 genomes dereplicated at 95% ANI and ~ 2228 reference genomes from the NCBI genome database (Data S2, Additional file 9). For each genome, the ribosomal proteins were collected along the scaffold with the highest number of ribosomal proteins. A maximum-likelihood tree was calculated based on the concatenation of the ribosomal proteins as follows: Homologous protein sequences were aligned using MAFFT (version 7.390) (--auto option) [52], and alignments refined to remove gapped regions using Trimal (version 1.4.22) (--gappyout option) [53]. Tree reconstruction was performed using IQ-TREE (version 1.6.12) (as implemented on the CIPRES web server [54], using ModelFinder [55] to select the best model of evolution (LG + I + G4), and with 1000 ultrafast bootstrap [56]. Taxonomic affiliations were determined based on the closest reference sequences relative to the query sequences on the tree and extended to other members of the ANI cluster. In many cases, the phylogeny was not clear upon first inspection of the tree and additional reference genomes were added if publicly available. Phylogenetic trees for proteins of interest were reconstructed using the same methods described above, except with different sets of reference sequences. East River homologs in the dimethyl sulfoxide reductase (DMSOR) superfamily such as the catalytic subunit of formate dehydrogenase (FdhA), nitrite oxidoreductase (NxrA), membrane-bound nitrate reductase (NarG; H+-translocating), and periplasmic nitrate reductase subunit A (NapA) were confirmed by phylogeny on a tree with reference sequences from Méheust et al. [47] (Table S11, Additional file 16 and Data S5, Additional file 17). To distinguish form I and form II CODHs and other subtypes among homologs to K03520, we used Diamond’s et al. [5] dataset, which comprises reference sequences from Quiza et al. [18] (Table S12, Additional file 18 and Data S6, Additional file 19). Similarly, homologs identified using the Pqq-XoxF-MxaF HMM for alcohol dehydrogenases were placed on a phylogenetic tree with reference sequences from Diamond’s et al. [5] dataset, comprising references from Keltjens et al. [57] and Taubert et al. [58]. In this tree, all East River homologs were clustered with methanol dehydrogenases (Table S13, Additional file 20 and Data S7, Additional file 21) instead of other types of alcohol dehydrogenases. To distinguish between dissimilatory (bi)sulfite reductase oxidative or reductive bacterial types, DsrA and DsrB homologs from individual genomes were concatenated to each other, aligned, and added to a phylogenetic tree with reference sequences from Muller et al. [59] (Table S14, Additional file 22 and Data S8, Additional file 23).
Community diversity and composition
Diversity indices for each sample were calculated from the Hellinger transformed abundance table for the genome set at subspecies level (98% ANI) using the vegan package in R [60]. Species numbers and Shannon diversity per sample were quantified using the specnumber and vegdist functions of vegan, respectively (Figure S4, Additional file 3 ). An analysis of variance, implemented in the aov function in R [37], was used to test for significant differences in mean species number and Shannon diversity in relationship to the floodplain samples originated from. No significant differences in group means were detected considering a p value < 0.05 as significant.
To investigate community composition at the phylum/class level as determined by phylogenetic analysis, the Hellinger-transformed abundance table for the genome set at the subspecies level (98% ANI) was converted to a presence/absence table. The number of samples where each genome was detected was counted, and the number of genomes affiliated to a given taxon was summed by sample and plotted in R [37] with ggplot2 [36].
Identification of a core floodplain microbiome
To identify organisms that were a “core” or “shared” set across all sampled sites, we operationally defined a core set as (1) organisms that were not statistically associated with any specific floodplain using indicator species analysis (ISA) and (2) that were detected (displayed ≥ 0.01X coverage) in at least 89 of the 94 total samples (the 90th percentile for this level of presence across all 248 genomes). Indicator species analysis was performed on the log transformed coverage values that were filtered to include only coverage values ≥ 0.01X using the indicspecies package [61] in R version 3.5.2 [37] with 9999 permutations. All p values for associations of an organism genome with a floodplain or group of floodplains were then subsequently corrected using False Discovery Rate with FDR ≤ 0.05 being considered a significant association. This resulted in 42 genomes that were not statistically associated with any floodplain by ISA and were also detected in ≥ 89 samples (Table S5, Additional file 7). For visualization of organism abundance profiles in relationship to their membership in the core floodplain microbiome, ISA clusters, and relative to the coefficient of variation of their coverage, Hellinger normalized coverage data was projected onto a two-dimensional space using Uniform Manifold Approximation and Projection (UMAP) implemented in the uwot package in R [62] using the following parameters: umap(data = coverage_data, n_neighbors = 15, nn_method = “fnn”, spread = 5, min_dist = 0.01, n_components = 2, metric = “euclidian”, n_epochs = 1000).
Identification of enriched metabolic functions in core floodplain microbiome
Overrepresentation of metabolic functions within the set of genomes comprising the core floodplain microbiome (n = 42) was assessed using hypergeometric testing. The probability of observing the number of genomes in the core floodplain microbiome carrying each of 33 functions, given the total number of genomes with that function across our full genomic dataset (n = 248), was calculated using the phyper function in R [37]. Probabilities calculated across all metabolic functions were corrected for multiple testing using false discovery rate with the p.adjust function in R [37] and with FDR ≤ 0.05 being considered a significant enrichment of a function in the core microbiome.
Analysis of correlations among environmental variables
Correlations between numeric soil biogeochemical variables across samples were calculated using Spearman rank correlation implemented in the rcorr function of the Hmisc package in R (https://github.com/harrelfe/Hmisc). Correlations between variables were then plotted as a correlogram and ordered using hierarchical clustering with Ward’s method using the corrplot package in R [63].
Fourth-corner analysis
A rlq fourth-corner analysis was performed on genome abundances, environmental data, and genome metabolic annotations using the R package ade4 [64]. Specifically, the pre-Hellinger-transformed genome abundance table was used for a correspondence analysis, the selected environmental variables (see Soil Chemistry and GIS) were used for a Hill-Smith analysis, and the genome metabolic annotations were used for PCA. A randomization test (as described by ter Braak et al. [65] and Dray et al. [8] was used to test the global significance of the trait–environment relationships. The fourth-corner statistic was then calculated on the same inputs as the rlq analysis with 50,000 permutations and p value adjustments using the FDR global methods. The results of the rlq fourth-corner analysis were plotted using the ggplot2 package [36].
Metatranscriptomic analyses
To determine differentially transcribed genes, potential levels of activity by phylum or class, most transcribed CAZy, and most transcribed genes among key geochemical transformations, metatranscriptomic reads were mapped using Bowtie2 [35] to a set of high-quality draft genomes dereplicated at 95% (see above). Metagenomic reads from the subset of floodplain L sites that were sampled both in 2015 and 2016 (Table S1, Additional file 1) were also mapped to confirm high transcription levels were not due to higher gene abundance in 2016. Read pairs were then filtered by a minimum identity of 95% to the reference with MAPQ ≥ 2, and total number of mapped read pairs was counted for each gene. Counts for metabolic genes were analyzed with DESeq2 [9] to determine differential expression in response to soil organic carbon, and p values were adjusted to correct for multiple hypothesis testing (FDR < 0.05).
GIS
All GIS operations and cartographic visualizations were performed in QGIS v2.12.1 except where otherwise stated. The base remote-sensed imagery used was obtained from USDA NAIP (USDA-FSA Aerial Photography Field Office publication date 20171220; 1 m ground pixel resolution). Digital terrain model (DTM) at a ground resolution of 0.5 m/pixel was derived by airborne LiDAR data acquired by Quantum Spatial in collaboration with Eagle Mapping Ltd [66] (doi:10.21952/WTR/1412542) in 2015. All maps were projected using EPSG:26913 NAD83/ UTM zone 13N. Meander and adjacent river polygons were manually delineated in QGIS. The distance from a sample point to the manually delineated river polygons was calculated using the NNJoin tool. To calculate the sample distances to meander toe, lines were manually drawn between all samples and the meander toe perpendicular to river flow and distances calculated using NNJoin (Figure S6, Additional file 3 a). Similarly, to calculate sample distances to the middle of the meander, a line perpendicular to the meander toe line was drawn across the middle of the meander (Figure S6, Additional file 3 ). Sample distances to this line were also calculated using NNJoin and samples on the downstream side of the line were converted to negative values to indicate upstream and downstream sides of the meander. TPI is computed from the DTM as the difference between the elevation of a center point and the average elevation measured in the neighboring area (3 by 3 m) [66]. To display genome abundances as used in the rlq fourth-corner analysis, filtered abundance values were chi-square transformed in R using the decostand in the vegan package and exported to display in QGIS. Spatial kriging of inorganic carbon was performed in R. The manually delineated meander polygons were converted to SpatialPixelsDataFrame using the sp package. A simple variogram model was fit to the natural log-transformed inorganic carbon values with a spatial cutoff of 60 m. Kriging was then performed using the sample points, the meander SpatialPixelsDataFrame, and the fitted variogram model. The natural log-transformed inorganic carbon values were then back transformed and the kriged map exported for visualization in QGIS.