A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary

Xu, Bu; Li, Fuyan; Cai, Lanlan; Zhang, Rui; Fan, Lu; Zhang, Chuanlun

doi:10.1038/s41597-022-01153-4

A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary

Data Descriptor
Open access
Published: 14 February 2022

Volume 9, article number 49, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary

Download PDF

Bu Xu^1,2,
Fuyan Li³,
Lanlan Cai^4,5,
Rui Zhang ORCID: orcid.org/0000-0003-1325-2613^5,6,
Lu Fan ORCID: orcid.org/0000-0002-4184-7211^2,7 &
…
Chuanlun Zhang^2,7

3157 Accesses
13 Citations
3 Altmetric
Explore all metrics

Abstract

Estuaries are one of the most important coastal ecosystems. While microbiomes and viromes have been separately investigated in some estuaries, few studies holistically deciphered the genomes and connections of viruses and their microbial hosts along an estuarine salinity gradient. Here we applied deep metagenomic sequencing on microbial and viral communities in surface waters of the Pearl River estuary, one of China’s largest estuaries with strong anthropogenic impacts. Overall, 1,205 non-redundant prokaryotic genomes with ≥50% completeness and ≤10% contamination, and 78,502 non-redundant viral-like genomes were generated from samples of three size fractions and five salinity levels. Phylogenomic analysis and taxonomy classification show that majority of these estuarine prokaryotic and viral genomes are novel at species level according to public databases. Potential connections between the microbial and viral populations were further investigated by host-virus matching. These combined microbial and viral genomes provide an important complement of global marine genome datasets and should greatly facilitate our understanding of microbe-virus interactions, evolution and their implications in estuarine ecosystems.

Measurement(s)	bacteria • Archaea • viruses
Technology Type(s)	Shotgun Sequencing
Sample Characteristic - Organism	estuary metagenome
Sample Characteristic - Environment	subtropical estuarine
Sample Characteristic - Location	Pearl river estuary

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.17139998

Study of Prokaryotes and Viruses in Aquatic Ecosystems by Metagenetic and Metagenomic Approaches

Assessing viral taxonomic composition in benthic marine ecosystems: reliability and efficiency of different bioinformatic tools for viral metagenomic analyses

Article Open access 22 June 2016

Optimized Metavirome Analysis of Marine DNA Virus Communities for Taxonomic Profiling

Article 02 May 2022

Background & Summary

Estuaries are transitional environments between ocean and river. Complex and dynamic estuarine ecosystems are distinguishable from oceanic environments by significant variety of physical, chemical and geomorphologic conditions^1,2,3,4. These factors have structured a highly unique estuarine microbial and viral community^5,6,7. In addition, most estuarine ecosystems are impacted by strong anthropogenic stresses¹. Viruses play essential roles in marine ecosystems by mortality^8,9 and reprogramming the metabolic processes of hosts¹⁰. There is a great interest to investigate the genomic characteristics, evolutionary mechanisms, community composition and interactions of microorganisms and viruses in coastal environments^11,12. While the abundance, distribution and function of prokaryotes or viruses in estuaries have been reported by using meta-omics approaches^{13,14,15,16,17,18}, few studies have investigated bacteria, archaea and viruses simultaneously and none has delineated the potential connections between the microbiome and the virome. Therefore, a holistic estuarine genome dataset recovering both microbiome and virome will allow the analysis of microbe-virus interactions in this unique ecosystem.

The Pearl River is the second largest river in China with an average annual discharge flux of about 3.5 × 10¹¹ m³ fresh water and 8.87 × 10⁷ tons suspended sediment¹⁹. Locating in the most densely industrialized and urbanized region in China, the Pearl River is heavily impacted by human activities including agricultural irrigation, industrial and domestic emissions and aquaculture^20,21. While some ecological and genomic studies on the bacterial or viral communities at the Pearl River estuary (PRE) have been performed^13,15,17, none of them has produced a combined dataset including both the microbial hosts and the viruses. Such a dataset is therefore urgently demanded to unveil the dynamic and diverse biological processes coupling with physiochemical factors at this estuary.

Here, we sequenced 15 deep-sequencing metagenomes of surface water with three size-fractions collected at five sampling sites along the salinity gradient of the PRE in August 2016 (Fig. 1a). Seawater was filtered through cellulose membranes subsequently. The 0.7–2.7 μm and 0.22–0.7 μm fractions were used to produce particle-attached and free-living prokaryotic metagenomes, respectively. To collect the viral fraction, surface water was prefiltered by using filters of 2.7 μm and 0.22 µm pore-size, subsequently, and then concentrated with 30 kilodalton (kDa) pore-size filters by using tangential-flow filtration. Further concentration and purification were done via polyethylene glycol (PEG) precipitation and cesium chloride (CsCl) step-gradient ultracentrifugation (Fig. 1b). DNA was extracted from the cellular (0.7–2.7 μm and 0.22–0.7 μm) and viral (<0.22 μm) fractions for metagenomic sequencing.

Overall, 13,305,017 contigs were generated by assembling quality checked sequencing reads (Table 1). A total of 1,205 non-redundant metagenome assembled genomes (MAGs) with the estimated completeness ≥50% and contamination ≤10% were reconstructed based on multi-strategy binning according to the MIMAG criteria²² (Supplementary Table 1). Phylogenomic analysis based on single-copy marker genes showed that these MAGs belonged to 32 bacterial and four archaeal phyla according to the Genome Taxonomy Database (GTDB) taxonomy²³ (Fig. 2, 3). We found that 24.8% and 86.8% of total MAGs did not have close relatives at genus and species level based on 95% average nucleotide identity (ANI). A total of 78,502 non-redundant viral contigs were predicted from the cellular microbiomes (0.2–2.7 μm) and viromes (<0.2 μm). They were then clustered into 56,289 viral populations^24,25,26. Taxonomic classification of viral populations was performed based on closest relative affiliation²⁴ (Supplementary Table 2). Only 15.3% populations could be assigned according to the RefSeqVirus database leaving the rest majority unclassified. A total of 15 viral families were identified including ssDNA, dsDNA and ssRNA viruses and the primary group belongs to order Caudovirales (Table 2). Virus-host pair prediction was performed based on clustered regularly interspaced short palindromic repeats (CRISPR) -spacer matching and 11 virus-host pairs were identified (Fig. 4). Among them, an Acinetobacter junii and a Rickettsiales bacterium were found being infected by more than one type of virus.

Table 1 Summary of reads, contigs, MAGs and viral contigs of PRE metagenomes.

Full size table

Table 2 Nonredundant contigs of abundant viral populations in samples.

Full size table

All of the primary contigs, non-redundant MAGs and viral-like contigs have been deposited in the National Center for Biotechnology Information (NCBI) BioProject database and the figshare website. The microbial and viral genomes provided here suggest great biological diversity in the PRE ecosystems. This combined dataset allows for systematic study on microbial-virial interactions including the regulatory mechanisms of viruses in manipulating estuarine biogeochemistry under anthropogenic impacts.

Methods

Sampling, DNA extraction and sequencing

Bacterial, archaeal and viral sample collection and particle size-based fractionation was done by filtration²⁷. To obtain the cellular fractions, about 500 L surface water (0.5–1.0 m in depth) was collected at each sampling site in PRE in August 2016 within three days (Fig. 1a, Table 3). The water samples were first filtered through 2.7 μm pore-size glass fiber filters (Shanghai Mosutech, Shanghai, China) to remove large particles and the filtrates were then successively filtered through 0.7 and 0.22 μm pore-size membrane filters (Pellicon cartridge, Millipore Corp., Billerica, MA, USA) to collect particulate associated and free-living microbial cells, respectively. The filters were stored in liquid nitrogen temporarily on board and then transferred to −80 °C freezers when back to laboratory for long-term storage until further processing. To collect viral particles, 200 L prefiltered seawater was further filtered through 2.7 μm and 0.2 μm pore-size membrane filters. A tangential-flow filtration 30 kDa cartridge was (0.5 m² Pellicon cartridge, Millipore Corp., Billerica, MA, USA) applied to increase viral particle concentration till a final liquid volume of 2 L and the liquid was kept at 4 °C till further process²⁸. Physiochemical measurements of water and the methods to generate these measurements have been published by He et al.²³. The measurements are also available in Table 3.

Table 3 Sampling locations and bulk properties of PRE surface water.

Full size table

DNA was extracted from the 0.2 and 0.7 μm pore-size membrane filters by using the FastDNA® SPIN kit for soil (MP Biomedicals, Solon, OH, USA) following the manufacturers’ instructions. For virome samples, a series of enrichment operations were applied to increase the concentration of the virial suspension²⁸ (Fig. 1b). Firstly, PEG8000 (10% w/v) was dissolved in DNase I (Sigma-Aldrich) treated viral concentrate and incubated at 4 °C overnight to precipitate viral particles. The PEG pellet was resuspended after centrifugation (10, 000 × g for 1 h) and then purified by CsCl density gradient ultracentrifugation (1.7, 1.5, and 1.35 g/mL CsCl layers). After centrifugation, viral like particles was concentrated in 1.5–1.35 g/mL CsCl layers according to the physical properties of various virions. After collection and purification, a phenol-chloroform extraction following the ethanol precipitation method was applied to extract viral genomic DNA^14,28.

The extracted prokaryotic and viral DNA were fragmented by sonication to a size of 350 bp. The DNA fragments were then end-polished, A-tailed, and ligated with the full-length adaptor to construct TruSeq metagenome libraries. Libraries were analyzed for size distribution using the Agilent2100 Bioanalyzer (Agilent, USA) and quantified using real-time PCR. They were then sequenced on an Illumina HiSeq 2500 platform at Novogene Bioinformatics Technology Co., Ltd. (Beijing, China) to generate 150 bp paired-end reads. The FASTQ files containing raw reads are available on NCBI. The overall study workflow is show in Fig. 1b.

Sequence quality check and assembly

The reads_qc module of MetaWRAP (v1.2.1)²⁹ was applied for adaptor trimming and contamination removal for the raw sequencing reads to generate high-quality clean reads by calling Cutadapt³⁰ and FastQC³¹ with the default parameters. Clean reads of the cellular fractions were assembled into contigs by using MetaWRAP employing megahit with k-mer values list of 21, 29, 39, 59, 79, 99, 119 and 141²⁹. The IDBA-UD software (v1.1.3) was applied to assemble the viral metagenomes with default parameters³². Contigs of length longer than 1 kb were used for further analysis as suggested by the MIMAG and the MIUViG standards^22,26 (Table 1).

MAG generation, refinement, quality check and taxonomic annotation

For each prokaryotic metagenome, MAGs were recovered by using the binning module and bin_refinement module of MetaWRAP²⁹. First, the binning module of MetaWRAP employing METABAT³³ and CONCOCT³⁴ was applied to recover the original genome MAGs sets based on tetranucleotide frequencies and read coverage. These MAGs sets were pooled and dRep (v2.6.2) was performed to remove redundant MAGs³⁵. The bin_refinement module of MetaWRAP was used to refine the MAGs to produce final MAGs. The completeness and contamination of archaeal and bacterial MAGs were estimated by running CheckM (v1.0.11)³⁶ (Supplementary Table 1). Taxonomic classification of the final MAGs was conducted by using GTDB-tk (v1.3.0, Release 95)³⁷ (Supplementary Table 1). MAGs are considered of the same species if they have ANI values larger than 95% by compared to a reference genome.

Phylogenomic analysis

We used 41 single-copy marker proteins to infer the maximum likelihood trees of archaeal and bacterial MAGs^38,39, respectively. Specifically, putative coding DNA sequences for each draft genome were predicted by using Prodigal (v2.6.3; -m -p meta)⁴⁰. Putative single copy genes of each MAGs were identified by using hmmsearch (HMMER v.3.1b2; -E 1E-5)⁴¹ based on Hidden Markov Models (HMMS) described by Sunagawa et al.³⁹. Amino acid sequences of these genes were aligned, respectively, by using Clustal Omega (v1.2.4)⁴² and further automatically trimmed by using trimAL (v1.4.1; -automated1)⁴³. The alignments of proteins were concatenated by using ScaFos (v1.2.5) and missing data were filled with gaps⁴⁴. The phylogenomic tree of concatenated alignment was reconstructed by using IQ-TREE (v.2.0.3; -st AA -m LG + PMSF + G -B 1000 --bnni)⁴⁵ and visualized in the Interactive Tree of Life (iTOL, v.5.1.1)⁴⁶.

Viral contig identification, dereplication and taxonomic classification

Following assembly, putative viral contigs were identified from contigs of all the three size fractions with length greater than 1.5 kb by using VirSorter (v1.0.6)⁴⁷ and VirFinder (v1.1)⁴⁸ as described by Gregory et al.²⁴. First, contigs identified as ‘lytic/prophage categories 1 and 2′ and ‘circular’ by VirSorter were assigned as viral contigs. The rest contigs of length >5 kb were kept for further classification. Among them, those identified as ‘lytic/prophage categories 1,2′ by VirSorter, or as viruses by VirFinder with score >0.9 (p < 0.05) were assigned as viral contigs. Those identified as ‘lytic/prophage category 3′ by VirSorter and as viruses by VirFinder with score 0.7–0.9 (p < 0.05) were also assigned as viral contigs. Those identified as ‘lytic/prophage category 3’ by VirSorter but not as viruses by VirFinder with score >0.7 (p < 0.05), and those identified as viruses by VirFinder with score 0.7–0.9 (p < 0.05) but not as ‘lytic/prophage categories 1–3’ by VirSorter were further analyzed through CAT⁴⁹ and only those having 40% genes classified as viruses were kept. In total, 97,003 viral contigs were identified. Redundancy of these contig sequences was removed by using CD-HIT at 99% identity (v4.6.8, −c 0.99 −aS 0.99)⁵⁰. The resulting 78,502 non-redundant viral contigs were further grouped into 56,289 viral populations by using nucmer based on the criterion that virial contigs in the same population share 80% of their genes and have 95% average nucleotide identify as previously described^51,52 (Fig. 1b). CheckV (v0.8.1) was used to determine the completeness and quality of the identified viral populations⁵³ (Supplementary Table 3). We used VirSorter to identify prophages by the de novo predictions of categories 4 and 5⁴⁷.

Taxonomic classification of viral populations was performed with a complementary approach by using vConTACT2⁵⁴ and blastp⁵⁵. First, the ORFs of each population were derived by using prodigal⁴⁰. Second, the protein sequences of population contigs >10 kb were analyzed by using vConTACT2 with Viral RefSeq release 201 based on genome gene-sharing profiles. Then, family level taxonomy of the remaining population including those that could not be assigned by vConTACT2 were further defined by closest relative affiliation using blastp against the Viral RefSeq database with the following principle: identity ≥30%, bit-score ≥50, and E value ≤0.001. Only the population with more than half of proteins assigned to the same viral family was considered as a viral family (Supplementary Table 2).

Host prediction of viral sequences

In order to link viral contigs to their putative microbial hosts, CRISPR spacers in MAGs were identified by using CRISPRDetect (v2.5)⁵⁶. Spacer sequences were then matched to viral contigs by using fuzznuc⁵⁷. Host and virus infection networks were reconstructed in Cytoscape (v3.8.0)⁵⁸.

Data Records

Raw reads generated in this study have been deposited in the National Center for Biotechnology Information BioProject database with the project ID PRJNA763043⁵⁹. Contigs, MAGs, viral genomes and source data files including the genome trees and associated amino acid alignments have been deposited in the figshare website⁶⁰. A full copy of this dataset is also available in the National Omics Data Encyclopedia (https://www.biosino.org/node/) with the project ID OEP001662⁶¹.

Technical Validation

Additional technical validation should be applied by researchers to confirm the accuracy of draft MAGs and VAGs used for specific downstream purposes.

Code availability

All versions of third-party software and scripts used in this study are described and referenced accordingly in the Methods sub-sections for ease of access and reproducibility.

References

Best, J. Anthropogenic stresses on the world’s big rivers. Nat. Geosci. 12, 7–21 (2018).
Article ADS Google Scholar
Carvalho, T. M. & Fidelis, T. The relevance of governance models for estuary management plans. Land Use Policy 34, 134–145 (2013).
Article Google Scholar
Zapata, C., Puente, A., Garcia, A., Garcia-Alba, J. & Espinoza, J. Assessment of ecosystem services of an urbanized tropical estuary with a focus on habitats and scenarios. PLoS One 13, e0203927 (2018).
Article PubMed PubMed Central Google Scholar
Campbell, B. J. & Kirchman, D. L. Bacterial diversity, community structure and potential growth rates along an estuarine salinity gradient. ISME J. 7, 210–220 (2013).
Article PubMed CAS Google Scholar
Sun, M. et al. Uncultivated viral populations dominate estuarine viromes on the spatiotemporal scale. mSystems 6, e01020–20 (2021).
Article PubMed PubMed Central CAS Google Scholar
Liu, L., Cai, L. & Zhang, R. Co-existence of freshwater and marine T4-like myoviruses in a typical subtropical estuary. FEMS Microbiol. Ecol. 93 (2017).
Xie, W. et al. Localized high abundance of Marine Group II archaea in the subtropical Pearl River Estuary: implications for their niche adaptation. Environ. Microbiol. 20, 734–754 (2018).
Article PubMed CAS Google Scholar
Chen, X., Ma, R., Yang, Y., Jiao, N. & Zhang, R. Viral regulation on bacterial community impacted by lysis-lysogeny switch: a microcosm experiment in eutrophic coastal waters. Front. Microbiol. 10, 1763 (2019).
Article PubMed PubMed Central Google Scholar
Manea, E. et al. Viral infections boost prokaryotic biomass production and organic C cycling in hadal trench sediments. Front. Microbiol. 10, 1952 (2019).
Article PubMed PubMed Central Google Scholar
Warwick-Dugdale, J., Buchholz, H. H., Allen, M. J. & Temperton, B. Host-hijacking and planktonic piracy: how phages command the microbial high seas. Virol. J. 16, 15 (2019).
Article PubMed PubMed Central Google Scholar
Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Article ADS PubMed CAS Google Scholar
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Article ADS PubMed PubMed Central CAS Google Scholar
Ni, Z., Huang, X. & Zhang, X. Picoplankton and virioplankton abundance and community structure in Pearl River Estuary and Daya Bay, South China. J. Environ. Sci. (China) 32, 146–154 (2015).
Article CAS Google Scholar
Cai, L., Zhang, R., He, Y., Feng, X. & Jiao, N. Metagenomic analysis of virioplankton of the subtropical Jiulong River estuary, China. Viruses 8, 35 (2016).
Article PubMed Central Google Scholar
Zhang, C. et al. The communities and functional profiles of virioplankton along a salinity gradient in a subtropical estuary. Sci. Total Environ. 759, 143499 (2021).
Article ADS PubMed CAS Google Scholar
Liu, Y. et al. Differences in metabolic potential between particle-associated and free-living bacteria along Pearl River Estuary. Sci. Total Environ. 728, 138856 (2020).
Article ADS PubMed CAS Google Scholar
Fortunato, C. S. & Crump, B. C. Microbial gene abundance and expression patterns across a river to ocean salinity gradient. PLoS One 10, e0140578 (2015).
Article PubMed PubMed Central Google Scholar
Natascha, S. & Meinhard, S. Composition and dynamics of particle-associated and free-living bacterial communities in the Weser estuary, Germany. Aquat. Microb. Ecol. 30, 221–237 (2003).
Article Google Scholar
Zhang, J. et al. The subtropical Zhujiang (Pearl River) Estuary: Nutrient, trace species and their relationship to photosynthesis. Estuarine Coastal and Shelf Science 49, 385–400 (1999).
Article ADS CAS Google Scholar
Li, Y. et al. Distribution, seasonality, and fluxes of dissolved organic matter in the Pearl River (Zhujiang) estuary, China. Biogeosciences 16, 2751–2770 (2019).
Article ADS CAS Google Scholar
Peng, X. et al. Persistence, temporal and spatial profiles of ultraviolet absorbents and phenolic personal care products in riverine and estuarine sediment of the Pearl River catchment, China. J. Hazard Mater. 323, 139–146 (2017).
Article PubMed CAS Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Article PubMed PubMed Central CAS Google Scholar
He, C. et al. Molecular composition and spatial distribution of dissolved organic matter (DOM) in the Pearl River Estuary, China. Environ. Chem. 17, 240–251 (2020).
Article CAS Google Scholar
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123 e1114 (2019).
Article PubMed PubMed Central CAS Google Scholar
Roux, S., Emerson, J. B., Eloe-Fadrosh, E. A. & Sullivan, M. B. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 5, e3817 (2017).
Article PubMed PubMed Central Google Scholar
Roux, S. et al. Minimum information about an uncultivated virus genome (MIUViG). Nat. Biotechnol. 37, 29–37 (2019).
Article PubMed CAS Google Scholar
Karsenti, E. et al. A holistic approach to marine eco-systems biology. PLoS Biol. 9, e1001177 (2011).
Article PubMed PubMed Central CAS Google Scholar
Thurber, R. V., Haynes, M., Breitbart, M., Wegley, L. & Rohwer, F. Laboratory procedures to generate viral metagenomes. Nat. Protoc. 4, 470–483 (2009).
Article PubMed CAS Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Article PubMed PubMed Central Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
Article Google Scholar
Brown, J., Pirrung, M. & McCue, L. A. FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool. Bioinformatics 33, 3137–3139 (2017).
Article PubMed PubMed Central CAS Google Scholar
Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Article PubMed CAS Google Scholar
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
Article PubMed CAS Google Scholar
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).
Article PubMed PubMed Central CAS Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article PubMed PubMed Central CAS Google Scholar
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2019).
PubMed Central Google Scholar
Martinez-Gutierrez, C. A. & Aylward, F. O. Phylogenetic signal, congruence, and uncertainty across bacteria and archaea. Mol. Biol. Evol, msab254 (2021).
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Article PubMed CAS Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).
Article Google Scholar
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–37 (2011).
Article PubMed PubMed Central CAS Google Scholar
Sievers, F. & Higgins, D. G. in Multiple Sequence Alignment Methods (ed. Russell, D. J.). Ch. 6, 105–116 (Humana Press, 2014).
Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article PubMed PubMed Central CAS Google Scholar
Roure, B., Rodriguez-Ezpeleta, N. & Philippe, H. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evol. Biol. 7(1), 1–12 (2007).
Article Google Scholar
Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Article PubMed PubMed Central CAS Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Article PubMed PubMed Central CAS Google Scholar
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ. 3, e985 (2015).
Article PubMed PubMed Central Google Scholar
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
Article PubMed PubMed Central Google Scholar
von Meijenfeldt, F. A. B., Arkhipova, K., Cambuy, D. D., Coutinho, F. H. & Dutilh, B. E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 20, 217 (2019).
Article Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article PubMed PubMed Central CAS Google Scholar
Brum, J. R. et al. Ocean plankton. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).
Article PubMed Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021).
Article PubMed CAS Google Scholar
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Article Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article PubMed CAS Google Scholar
Biswas, A., Staals, R. H., Morales, S. E., Fineran, P. C. & Brown, C. M. CRISPRDetect: A flexible algorithm to define CRISPR arrays. BMC Genomics. 17, 356 (2016).
Article PubMed PubMed Central Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the european molecular biology open software suite. Trends Genet. 16, 276–277 (2000).
Article PubMed CAS Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article PubMed PubMed Central CAS Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP337539 (2021).
Xu, B. et al. Pearl river estuary microbiome and virome. figshare https://doi.org/10.6084/m9.figshare.16618255.v7 (2021).
National Omics Data Encyclopedia https://www.biosino.org/node/project/detail/OEP001662 (2021).

Download references

Acknowledgements

This study was supported by National Natural Science Foundation of China (Nos. 91951120, 91851210, 42141003), the State Key R&D Project of China Grant (No. 2018YFA0605800), the Shenzhen Key Laboratory of Marine Archaea Geo-Omics, Southern University of Science and Technology (ZDSYS201802081843490), the Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou) (No. K19313901), and the Project of Educational Commission of Guangdong Province of China (No. 2020KTSCX123). Computation in this study was supported by the Centre for Computational Science and Engineering at the Southern University of Science and Technology.

Author information

Authors and Affiliations

School of Environment, Harbin Institute of Technology, Harbin, China
Bu Xu
Shenzhen Key Laboratory of Marine Archaea Geo-Omics, Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China
Bu Xu, Lu Fan & Chuanlun Zhang
Daniel K. Inouye Center for Microbial Oceanography: Research and Education (C-MORE), University of Hawaii, Honolulu, Hawaii, USA
Fuyan Li
Department of Ocean Science, The Hong Kong University of Science and Technology, Hong Kong, China
Lanlan Cai
State Key Laboratory of Marine Environmental Science, Fujian Key Laboratory of Marine Carbon Sequestration, College of Ocean and Earth Sciences, Xiamen University, Xiamen, Fujian, China
Lanlan Cai & Rui Zhang
Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, China
Rui Zhang
Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou, China
Lu Fan & Chuanlun Zhang

Authors

Bu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Fuyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Lanlan Cai
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Fan
View author publications
You can also search for this author in PubMed Google Scholar
Chuanlun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.Z., L.F. and C.Z. conceived this study. B.X., F.L. and L.C. collected the samples and extracted DNA. B.X. analyzed the metagenome data, produced the genomes and conducted all other analyses. B.X., R.Z., L.F. and C.Z. interpreted the results and drafted the manuscript. All authors contributed to the final version of the manuscript.

Corresponding author

Correspondence to Lu Fan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article

Xu, B., Li, F., Cai, L. et al. A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary. Sci Data 9, 49 (2022). https://doi.org/10.1038/s41597-022-01153-4

Download citation

Received: 01 October 2021
Accepted: 12 January 2022
Published: 14 February 2022
DOI: https://doi.org/10.1038/s41597-022-01153-4
Springer Nature Limited

This article is cited by

Ecophysiology and genomics of the brackish water adapted SAR11 subclade IIIa
- V Celeste Lanclos
- Anna N Rasmussen
- J Cameron Thrash
The ISME Journal (2023)
Mining of novel secondary metabolite biosynthetic gene clusters from acid mine drainage
- Ling Wang
- Wan Liu
- Guoqing Zhang
Scientific Data (2022)

A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary

Abstract

Similar content being viewed by others

Study of Prokaryotes and Viruses in Aquatic Ecosystems by Metagenetic and Metagenomic Approaches

Assessing viral taxonomic composition in benthic marine ecosystems: reliability and efficiency of different bioinformatic tools for viral metagenomic analyses

Optimized Metavirome Analysis of Marine DNA Virus Communities for Taxonomic Profiling

Background & Summary

Methods

Sampling, DNA extraction and sequencing

Sequence quality check and assembly

MAG generation, refinement, quality check and taxonomic annotation

Phylogenomic analysis

Viral contig identification, dereplication and taxonomic classification

Host prediction of viral sequences

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Rights and permissions

About this article

Cite this article

This article is cited by

Ecophysiology and genomics of the brackish water adapted SAR11 subclade IIIa

Mining of novel secondary metabolite biosynthetic gene clusters from acid mine drainage

Navigation

A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary

Abstract

Similar content being viewed by others

Background & Summary

Methods

Sampling, DNA extraction and sequencing

Sequence quality check and assembly

MAG generation, refinement, quality check and taxonomic annotation

Phylogenomic analysis

Viral contig identification, dereplication and taxonomic classification

Host prediction of viral sequences

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation