Introduction

Marine sediments cover more than two-thirds of Earth’s surface. Intact cells (Parkes et al. 2000) and intact membrane lipids (Zink et al. 2003) provide evidence of prokaryotic populations in sediments as deep as 800 m below the seafloor and recently in sediments down to 1,626 mbsf (Roussel et al. 2008). The prokaryotes of subseafloor sediments have been estimated to constitute as much as one-third of Earth’s total living biomass (Whitman et al. 1998). Despite this, microbial community structure and metabolic activity in subseafloor environments have been the subject of fewer studies than their terrestrial counterparts (Llobet-Brossa et al. 1998).

The paradigm that more than 99% of terrestrial bacteria cannot be cultured by conventional means also holds true for marine environments (Kennedy et al. 2008). DNA-based molecular methods have been developed to overcome many of the difficulties and limitations associated with cultivation techniques. 16S rRNA gene analysis, for example, provides important information on the taxonomy of bacteria present in marine sediments (Edlund et al. 2008); however, there are problems associated with this technique. PCR bias can be a problem especially in environments where only low concentrations of extractable nucleic acids can be obtained, i.e., due to low numbers of prokaryotic cells (Chandler et al. 1997; Webster et al. 2003). 16S rRNA gene analysis also provides little information about the total genome of microbes within the community and the functional role they play. By contrast, metagenomics (also referred to as environmental and community genomics), is a rapid and effective culture-independent approach to understanding and accessing the genetic information from microbial community within a specific environment by direct extraction and cloning of bacterial community DNA (Riesenfeld et al. 2004). This new and rapidly developing field has great potential in the study of marine microorganisms and in the past few years, has been applied worldwide in the study of bacterial assemblages in marine water and sediment (Beja et al. 2000; Hallam et al. 2004; Venter et al. 2004; DeLong et al. 2006; Martin-Cuadrado et al. 2007; Rusch et al. 2007; Biddle et al. 2008).

The South China Sea, one of the largest marginal seas in the western Pacific, was formed by oceanic spreading along a WSW-ENE axis during the Oligo-Miocene, and has over 4 km of organic-rich Cenozoic deposits in its sedimentary basins (Ludmann et al. 2001). The microbial abundances in the surface sediment of this area were approximately 107 cells/g sediment (Jiang et al. 2007), which indicates this is a huge ecological environment inhabited by large numbers of microbes. Previous studies using isolation of pure cultures and 16S rRNA gene-based community diversity analysis have been carried out in this region (Xu et al. 2004; Guo et al. 2007; Jiang et al. 2007; Liu et al. 2008; Tao et al. 2008; Tian et al. 2009); however, little is known about the indigenous metagenome. In this study, a fosmid library from Qiongdongnan Basin sediment, a potential methane hydrate-bearing basin on the northwestern continental shelf of the South China Sea (Wu et al. 2003; Su et al. 2005) was constructed. The community diversity and metabolic profile were preliminarily analyzed through the fosmid end sequences. To our knowledge, this is the first insight into the function of this microbial community using a metagenomic library approach.

Materials and Methods

Marine Sediment Sample Collection

The deep-sea sediment core was collected by gravity piston corer in March 2006 at the BD7-2 station (110°28′47.231″ E and 17°34′11.603″ N according to the global positioning system) in the Qiongdongnan Basin, South China Sea. After collection, the sediment core was stored onboard at 4°C (approximate in situ temperature) for about 2 weeks. The core was then transported back to the laboratory and dissected into 5-cm sediment subsamples and stored at −80°C. The top 5 to 10 cm layer was used for the construction of metagenomic fosmid library.

Microbial DNA Extraction

Prior to the total microbial DNA extraction, a sediment washing step, modified from Fortin et al. (2004), was used to remove contaminants. Marine sediment (10 g wet weight) was washed three times with 100 ml washing buffer (50 mM Tris–HCl, pH 9.0, 100 mM Na2EDTA, 1.0% PVP, 100 mM NaCl, 0.05% Triton X-100) by vortexing for 1 min, incubating in a 55°C water bath for 3 min, and then centrifuged at 3,000×g for 5 min. The phosphate-buffered saline (PBS) buffer was used as control with the same washing procedures as described above. After sediment washing, a 5 g pellet was mixed by vortexing with 13.5 ml of extraction buffer (100 mM Tris–HCl, pH 8.0; 100 mM sodium EDTA, pH 8.0; 100 mM sodium phosphate, pH 8.0; 1.5 M NaCl; and 1% CTAB). Three cycles of freezing in liquid nitrogen and thawing in a 65°C water bath were then applied to the suspensions (Kauffmann et al. 2004). After the samples had cooled to 37°C, 50 μl of proteinase K (20 mg/ml) was added and the samples were incubated at 37°C with horizontal shaking at 225 rpm for 30 min. The remaining DNA extraction procedures were performed as described by Zhou et al. (1996).

Metagenomic Library Construction

The extracted DNA was fractionated in 1% agarose (pulse field-certified agarose, Bio-Rad) by pulsed-field gel electrophoresis using a CHEF-DRIII system (Bio-Rad). Pulsed-field gel electrophoresis (PFGE) gels were run in 0.5× Tris–borate-EDTA buffer for 16 h at 5 V/cm at an angle of 120°. Ramping was carried out from an initial switch time of 0.1 s to a final switch time of 40 s at 14°C. DNA fragments of approximately 36 to 48 kb were cut from the gel and recovered by electroelution (Sambrook et al. 1989). The DNA concentration and purity was measured using a Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE, USA). 16S rRNA genes were amplified using bacterial primers 519F (5′-CAGCMGCCGCGGTAATAC-3′, positions 519 to 536 in the Escherichia coli 16S rRNA gene) and 1492R (5′-GGTTACCTTGTTACGACTT-3′, positions 1510 to 1492 in the E. coli gene). Pure E. coli DNA was used as control template for 16S rRNA PCR amplification. PCR was performed as follows: one cycle of 94°C for 5 min, followed by 30 cycles of denaturation at 94°C for 1 min, annealing at 55°C for 1 min and elongation at 72°C for 2 min, with a final extension at 72°C for 10 min. The metagenomic library named IMCAS-F003 was constructed using CopyControl™ HTP fosmid library production kit (Epicentre) according to the manufacturer's instructions. The repaired DNA was ligated into the pCC2 FOS fosmid vector (Epicentre). Ligated DNA mixtures were then packaged using the supplied lambda packaging extracts and were transformed into an EPI300-T1R phage T1-resistant E. coli host. The infected bacterial cells were stored in glycerol at −80°C, final glycerol concentration 20%. To determine the inserted fragment size, 64 randomly selected fosmids were extracted using Plasmid Mini Kit I (Omega, Doraville, GA, USA) and digested with EcoRI.

Fosmid End Sequencing and Sequence Analysis

A total of 600 randomly selected fosmid clones were subjected to bi-directional end sequencing at Open lab of Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, using T7 promoter primer (5′-TAATACGACTCACTATAGGG-3′) and pCC2 reverse sequencing primer (5′-CAGGAAACAGCCTAGGAA-3′). All fosmid end sequences were revised and trimmed using Lasergene package, version 7.10 (DNA star, USA), and compared with sequences in the NCBI nr (non-redundant) database using BLASTX program (Ye et al. 2006) under a cut-off value of 1e-5. Taxonomic binning analysis was based on MEtaGenome Analyzer (MEGAN, version 3.0.2) (Huson et al. 2007), which allows dissection of large datasets without the need for assembly or the targeting of specific phylogenetic markers. In a preprocessing step to using this program, sequences were compared against NCBI nr databases using BLASTX under default parameters. MEGAN was then used to compute and interactively explore the taxonomical content of the dataset, employing the NCBI taxonomy to summarize and order the results (absolute cut-off, BLASTX bitscore 75; relative cut-off, 10% of the top hit). For cluster of orthologous groups (COG) assignments, sequences were compared with the COG database using rpsBLAST (-p F) with a cut-off value of 1e-5. Other analysis including protein-based general taxonomy, SEED subsystem based metabolic profile and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway investigation, were performed with the help of the Meta Genome Rapid Annotation using Subsystem Technology (MG-RAST) server (Meyer et al. 2008) using an e-value cut-off of 1e-5 and a minimum alignment length of 50 bp. MG-RAST program is a fully automated service for annotating metagenomic data (http://metagenomics.theseed.org/metagenomics.cgi?).

Nucleotide Sequence Accession Numbers

Nucleotide sequences of the 1,051 fosmid ends have been deposited into the GenBank GSS database with accession numbers of FI496080-FI497130.

Results

Fosmid Library Construction and End Sequencing

Prior to the DNA extraction, a washing step was used to remove contaminants (see Materials and Methods). After the washing buffer treatment, the supernatant was brown in color compared to the clear PBS buffer treatment using the same procedure. The crude DNA extract ranged in size from 6 to 100 kb as shown in Fig. 1 and its yield was approximately 1 µg/g (wet weight) marine sediment. The UV (ultraviolet) absorption spectra showed that, without the washing step, contaminants in the crude DNA extract showed strong absorption in the UV range, masking any absorption of nucleic acids at 260 nm (Fig. 2a). After the washing step, the absorption interference was greatly reduced and the absorption of DNA at 260 nm could be detected clearly (Fig. 2b), indicating that humic and fulvic substance contamination was removed to some degree, however, the low A260/A230 value (0.9) suggested that some contaminants remained. The following purification step was combined with size selection using PFGE, and DNA fragments of 36–48 kb were cut from the gel and recovered by electroelution. After these steps, approximately 70% of the DNA was recovered and showed a very good absorption profile with an A260/A280 value of 1.83 and an A260/A230 value of 1.85 (Fig. 2c). This DNA had a high enough purity for restriction endonucleases EcoR I and Hind III digestion and PCR amplification of 16S rRNA genes (data not shown).

Fig. 1
figure 1

Pulse field gel electrophoresis of extracted DNA. Lane 1 PFGE Ladder. Lane 2 the environmental DNA extracted from the sediment washed with washing buffer (see Materials and methods)

Fig. 2
figure 2

UV absorption spectra of extracted and purified DNA. a Crude DNA extracted from the sediment washed with PBS buffer. b Crude DNA extracted from the sediment washed with washing buffer (see Materials and methods). c DNA purified after PFGE and electroelution

The metagenomic library (IMCAS-F003) contained approximately 200,000 clones. EcoRI digestion of 64 randomly selected fosmid clones indicated that the library possessed a rich diversity (data not shown). The average insert size was 36 kb and the total size of this fosmid library was estimated to be 7.2 Gb of metagenomic DNA. Given an average prokaryotic genome is approximately 5 Mb, the library theoretically was the size of over 1,400 prokaryotic genomes.

Six hundred randomly chosen fosmid clones were subjected to bi-directional end sequencing. After vector trimming, low quality and shorter sequences were removed and a total of 1,051 fosmid end sequences with read lengths from 300 to 800 bp were obtained. These sequences were of high quality with an average read length of 619 bp, and a sequence length >450 bp accounted for 93.0% of the total end sequences. The GC content of the sequences ranged from 24% to 73%, with an average content of 52.9%. More than 65% of these sequences exhibited high GC content (>55%).

Community Composition

MG-RAST analysis revealed 758 hits against the SEED protein non-redundant database and 0 hits against the ribosomal RNA database Greengenes. Most of the sequences, 701 (92.48%) were of bacteria origin, 33 were archaea, and 15 sequences were of eukaryota origin, nine sequences were not assigned to any group.

To obtain detailed taxonomic category information from all the fosmid ends, the sequences were analyzed using the MEGAN software, a program that assigns read to lowest-common ancestor of all its BLAST hits. By using this program, 660 sequences were assigned to different taxonomic categories, 389 had no hits, and two sequences were not assigned to any taxa (Fig. 3). The results showed that the IMCAS-F003 metagenomic library was dominated by Bacteria with a total of 583 sequences assigned to discrete taxa and 383 sequences assigned to different definite subcategories.

Fig. 3
figure 3

Phylogenetic diversity of the IMCAS-F003 fosmid end sequences computed by MEGAN. Each circle in the figure represents a taxon in the NCBI taxonomy and is labeled by its name and two numbers. The first number means number of reads assigned, that is number of reads assigned to the corresponding taxa. The second number means number of reads summarized, that is number of reads assigned to the corresponding taxa, or to any that contained in the subtree (no summarized number for the last taxon). The size of the circle is scaled logarithmically to represent the number of reads summarized

The largest group of sequences were from the bacterial phylum Proteobacteria (164 sequences, accounting for 42.8% of the total sequences assigned to definite subcategories), and within this the Deltaproteobacteria were most abundant followed by Alpha- and Gammaproteobacteria. Within the Deltaproteobacteria group, sulfate-reducing bacteria (SRB) Desulfobacterales, Desulfuromonadales, and some Syntrophobacter species accounted for the largest proportion. The second largest group following the Proteobacteria was Planctomycetes, with 96 sequences (25.1%) assigned to this phyla. Sequences related to Planctomyces maris, Blastopirellula marina, and Candidatus Kuenenia stuttgartiensis were in relatively high numbers within this group. Other taxa were also detected, namely Chloroflexi, Cyanobacteria, Acidobacteria, Verrucomicrobia, Actinobacteria, and Firmicutes. Furthermore, 12 sequences were assigned to Archaea among which the methanogenic Euryarchaeota orders Methanomicrobiales, Methanosarcinales, Methanopyrales, and two uncultured methanogenic archaeon from environmental samples were detected.

Metabolic Potential

Based on the BLASTX search against the NCBI nr database, 579 of the 1,051 end sequences were assigned to functional genes; 203 were classified as hypothetical proteins; 20 sequences were genes with unknown function; and 249 sequences did not match any hits (for detail see supplementary material Table 1). COG category analysis showed that 42.2% of the genes were related to metabolism and 25.2% to cellular processes and signaling, whereas only 12.6% corresponded to housekeeping genes involved in information-related processes. In addition, 20% of the genes fell into a poorly characterized group (Fig. 4). The most abundant metabolic type was amino acid transport and metabolism (11.3%), followed by energy production and conversion (9.6%) and then carbohydrate transport and metabolism (6.2%; for detail see supplementary material Table 2).

Fig. 4
figure 4

Distribution of fosmid ends in COG categories. Name of subcategories in COG database are listed on the left, and corresponding major categories are list on the right. The number of reads assigned to each major category and their ratio are shown

Further analysis of functional groups was carried out using SEED subsystems. A subsystem was defined as a set of functional roles that together implement a specific biological process, such as the genes whose products are involved in a metabolic pathway, or the group of genes whose products make a cellular structure (Overbeek et al. 2005). A total of 483 sequences were classed into SEED subsystems using an e-value cut-off of 1e-5 (Fig. 5). The top three functional categories determined were: clustering-based subsystems (12.01%); carbohydrate (11.8%); and amino acids and derivatives (10.35%). The clustering-based subsystems included a total of 20 subcategories, such as biosynthesis of galactoglycans and related lipopolysacharides (seven sequences), cytochrome biogenesis (five sequences), fatty acid metabolic cluster (three sequences), etc. (for detail see supplementary material Table 3). The carbohydrate subsystem was dominated by central carbohydrate metabolism (nearly 28% of the total for the category). One-carbon metabolism and monosaccharides metabolism subcategories were also very common, accounting for 17.54% and 14.04% of the identified carbohydrate subsystem, respectively. In addition, three hits to methanogenesis were detected in this subsystem, namely N5-methyltetrahydromethanopterin: coenzyme M methyltransferase subunit H (29% identity to Archaeoglobus fulgidus DSM 4304), formylmethanofuran-tetrahydromethanopterin N-formyltransferase (50% identity to Methanopyrus kandleri AV19), and dimethylamine methyltransferase corrinoid protein (46% identity to Methanosarcina acetivorans C2A; for detail see supplementary material Table 3). It is unsurprising, due to the deep-sea origin of the sample, that genes related to photosynthesis were not detected. In addition, the distribution of the fosmid ends in KEGG metabolic pathway indicated that genes involved in the biodegradation of xenobiotics were very diverse, including the degradation of DL-dithiothreitol, dichloroethane, chloroacrylic acid, benzoate, biphenyl, caprolactam, ethylbenzene, fluorene, naphthalene, anthracene, styrene, tetrachloroethene, and gamma-hexachlorocyclohexane (for detail see supplementary material Table 4). This result reinforces the previous observations that microorganisms living in deep-sea environment are adapted to degrade recalcitrant pools of organic matter (Martin-Cuadrado et al. 2007).

Fig. 5
figure 5

Distribution of fosmid ends in SEED subsystems

Discussion

Construction of large insert environmental metagenomic libraries depend on efficient recovery and purification of high-quality microbial DNA. Many extraction procedures have been developed to isolate nucleic acids directly from environmental samples; however, the post-extraction purification procedures, such as the use of silica gel/membrane, ion exchange chromatography etc., are often time-consuming and laborious (Schneegurt et al. 2003). In the present study, a washing step prior to SDS-based DNA extraction was used. This step removed the contaminants to a great extent as was shown in the UV spectral profile. After this treatment, the extracted crude DNA was directly subjected to size selection by using pulsed-field gel electrophoresis and recovered by electroelution. Both the PFGE and electroelution are also considered as effective strategies to separate DNA from humic materials (Rajendhran and Gunasekaran 2008). Since size selection was a necessary step for fosmid library construction, the combination of purification and size selection in one step here reduced the number of operation steps and saved time, thus avoiding of DNA loss and degeneration. It should be mentioned that some microbes may be lost from the washing step, thus causing the DNA extraction bias. Therefore, the recovery rate of microbial cells during the washing process and to what extent it influence the final DNA yield deserves further investigation.

The phylogenetic ascription of random sequenced fosmid ends allowed us to compare the magnitude of Bacteria and Archaea in this sediment. Analysis indicated that Bacteria were approximately 20- and 50-fold more abundant than Archaea according to protein-based taxonomy in SEED and MEGAN phylogenetic ascription, respectively. This suggested that Bacteria dominated the prokaryotes in this subsurface sediment, with Archaea contributing only a small proportion. Similar results have also been found using quantitative molecular techniques, indicating Archaea are 10–1,000-fold less abundant than Bacteria in sediment environments (Schippers and Neretin 2006). Recent lipid-based evidence, however, has indicated that in marine subsurface sediments Archaea are more abundant than Bacteria (Lipp et al. 2008). It is possible, therefore, that archaeal biomass may have been underestimated in this study due to poor extraction efficiency of Archaea DNA.

The Bacteria group in this library was dominated by the Proteobacteria and within this the Deltaproteobacteria were most abundant (68 sequences, nearly 41% of the total sequences assigned to Proteobacteria). The Deltaproteobacteria was also the majority in Nasha and Xisha surface sediments of South China Sea (38% and 64% of 16S rRNA genes belonging to Proteobacteria, respectively) of 16S rRNA genes belonging to Proteobacteria. The Deltaproteobacteria contain most of the known sulfate-reducing bacteria. The abundance of these microorganisms may well relate to the relatively high sulfate content (Jiang et al. 2007) in this area. SRB use sulfate as a terminal electron acceptor for the degradation of organic compounds, resulting in the production of sulfide. SRB have been found to play a dominant role in the bacterial communities of deep-sea sediments (Fry et al. 2008) and have been linked to anaerobic carbon cycling, in fact it has been estimated that sulfate reduction can account for more than 50% of the organic carbon mineralization in marine sediments (Muyzer and Stams 2008).

The Planctomycetes were the next most abundant group in this library, and have also been found in high numbers in the Xisha Trough (20.3% of the bacterial 16S rRNA gene library) and Nansha sediments (10% of the bacterial 16S rRNA gene library) of the South China Sea (Xu et al. 2004; Tao et al. 2008). Planctomycetes have been recently identified as the “missing lithotroph” responsible for the anaerobic oxidation of ammonium (anammox process; Strous et al. 1999), and this has led to the discovery of other autotrophic Planctomycetes with similar activities (Schmid et al. 2000). Taken together, with the relatively high abundance of these microbes observed in the library, indicates their potentially important role in the flux of nutrients in this sediment environment.

Functional groups assigned by SEED subsystems indicated that the most dominant category in this library was clustering-based subsystems. A clustering-based subsystem is described as “one in which there is functional coupling evidence that genes belong together, but we do not yet know what they do”. http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/ClusteringBasedSubsystem. Hence we cannot comment on the majority of the genes assigned to these subsystems or speculate as to the roles that they play.

The sampling site located at slope of Qiongdongnan Basin is one of potential methane hydrate-bearing basins on the northwestern continental shelf. The presence of methane hydrates in this area has been confirmed by geological and geophysical evidence (Wu et al. 2003; Su et al. 2005). In this study, one-carbon metabolisms were found very common. This metabolic type convert complex organic matter to simple one-carbon compounds which play important roles in the process of methanogenesis and are dominant in the methanogenic Archaea (Ferry 1999). More important, three enzymes closely related to those involved in methanogenesis were detected in this library according to SEED subsystems. The first one N 5-methyltetrahydromethanopterin: coenzyme M methyltransferase is an important enzyme in the process of methanogenesis from H2 and CO2, as well as from acetate. The second one formylmethanofuran-tetrahydromethanopterin N-formyltransferase is a key enzyme of the pathway for dismutation of methanol to methane. The last one dimethylamine (DMA) methyltransferase methylates the DMA corrinoid protein during a methylamine initiated methanogenesis process. In addition to this, seven sequences were assigned to Euryarchaeota, an archaeal kingdom that most of marine origin methanogens belonged to. These methanogenic archaea have one common attribute in they all use a methane-generating pathway for growth (Sowers 2004). This data provides the initial microbiological evidence for methanogenesis in this area. Further efforts on screening and shotgun sequencing of fosmid clones containing key genes for methanogenesis are under way in order to identify complete methane-generating pathways.

Since the metagenomic fosmid library presented here was constructed from an extreme environment, novel biotechnological biocatalysts with diverse biochemical and physiological characteristics were expected. The search for novel enzymes, lipolytic enzymes, for example, is now in progress.

In conclusion, high-quality microbial DNA was isolated from Qiongdongnan Basin sediment from the South China Sea, and a metagenomic fosmid library was constructed with the purified DNA. A total of 1,051 fosmid end sequences were obtained from random selected clones. Though there are limitations in the accuracy of phylogenetic affiliations and extent of metabolic profiles obtained, these sequences represent a random sample from the total of the library, hence providing a reasonably proxy to the community census. Analysis of these sequences offers the first insight into the microbial composition and metabolic potential of the community at this site.