Background & Summary

The White-eared Night-Heron G. magnificus is a nocturnal wader bird that is mainly distributed in southern China and northern Vietnam1,2. It belongs to the Ardeidae family and inhabits dense forests with abundant-watered areas, marshes, and reservoirs3. This species is listed as an Endangered species on the IUCN Red List4 with an estimated population of 250–999 mature individuals1 and is poorly known due to its rarity and few localities in the wild4,5. The increasing demands of humans for timber and agricultural land, intensive use of agricultural chemicals, and hunting are the main threats to its survival. It is urgent and challenging to evaluate the genetic status of G. magnificus due to conservation status and scattered localities2. Mitochondrial phylogenetic relationships suggest G. magnificus is not closely related to other members of Gorsachius but might be more closely related to heron species6. A recent phylogeny robust of herons based on ultraconserved elements revealed this species is a sister species of the African-distributed White-backed Night Heron (Gorsachius leuconotus)7. However the purpose of assembling a chromosome-level genome of an endangered night heron species has three advantages: 1) to be a high-quality reference genome for other population genomic studies in the family Ardeidae; 2) to allow comparative genomic studies of nocturnal birds to reveal local adaptation; 3) to carry out conservation genomics of this endangered species. Therefore, the availability of this genome facilitates tackling some challenges in evolution, conservation, and ecological studies8.

In this research, we have successfully generated a high-quality reference genome of G. magnificus at the chromosomal level, employing a comprehensive approach that integrates PacBio long-read sequencing, chromosome conformation capture (Hi-C) technology, and Illumina platform paired-end short-read sequencing. The assembled genome spanned a total length of 1.176 Gb, organized into 539 contigs and 79 scaffolds. The contig N50 length reached 18.46 Mb, while the scaffold N50 length was 84.77 Mb. Subsequently, 29 pairs of chromosomes with a total of length 1.175 Gb were anchored utilizing Hi-C technology, which corresponds to 99.89% of the assembled sequences. Moreover, we have identified 18,082 protein-coding genes based on de novo and homolog-based strategies, and 95% of these genes (17,177) were functionally annotated in publicly available databases including Gene Ontology, KEGG, and Pfam. Additionally, a BUSCO analysis demonstrated the completeness of 97.2% of annotated genes. This high-quality genome not only offers a reference genome for conservation genomics of G. magnificus but also facilitates phylogenomic and comparative genomic studies on a relatively understudied avian family, Ardeidae.

Methods

Ethics statement

Sample collection for scientific research purposes was in accordance with the ethical conditions in the Chinese Animal Welfare Act (20090606) and was approved by Forestry Administration of Guangdong Province, China (DFGP Project of Fauna of Guangdong-202115).

Sampling, DNA/RNA extraction, library construction, and sequencing

In our study, we gathered samples from a dead male G. magnificus specimen, which had resided at a wildlife rescue center in Shandong, Guangdong, China. In the genomic assembly of bird species, sampling female individuals offers an opportunity to obtain the W chromosome. However, we did not achieve this ideal condition for such an endangered and cryptic species. Genomic DNA was extracted from muscle tissue and utilized for whole genome sequencing and subsequent de novo assembly. Additionally, we obtained a total of nine RNA samples from various tissues within the same male individual, including the brain, lung, testis, thigh muscle, liver, pectoralis muscles, wing muscle, cardiac muscle, and eyeball, for RNA sequencing (RNA-seq) analysis.

We extracted high-molecular-weight genomic DNA from muscle samples following the instructions of CTAB (Cetyl trimethyl ammonium bromide), specifically for the purpose of de novo genome sequencing. We assessed the integrity and quality of the genomic DNA through agarose gel electrophoresis and a Qubit Fluorometer. For genome survey and polishing, we sequenced a single shotgun library with a 350-bp insert size on the Illumina NovaSeq 6000, yielding approximately 35.79 Gb (equivalent to a 30.43X coverage) of 150-bp paired-end reads (see Table 1). To facilitate genome assembly, SMRTbell libraries were created with an average 20-kb insert size using the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences). Subsequently, we employed Blue Pippin (Labgene Scientific) to select fragment sizes, and we conducted library sequencing using the PacBio platform, utilizing single molecule real-time (SMRT) sequencing (PacBio RSII) technology, which generated a total of 128.52 Gb (equivalent to a 109.29X coverage) of data (see Table 1). For Hi-C sequencing, we fixed muscle tissue from the same male individual intended for de novo genome sequencing with 37% formaldehyde. Following a 10-minute incubation at room temperature, we halted the cross-linking reaction with 2.5% formaldehyde. We then collected the precipitated cells for Hi-C library preparation. A single Hi-C library was constructed and subjected to paired-end sequencing with 150 bp reads on the Illumina NovaSeq 6000 Sequencing System, resulting in a total of 110.6 Gb (94.05X coverage) of data (see Table 1). For RNA sequencing, we extracted total RNA from nine different tissues using the RiboPureTM RNA Purification Kit (Ambion®) and assessed its integrity with the RNA Nano 6000 Assay Kit on the Bioanalyzer 5400 system (Agilent Technologies, CA, USA). Following the manufacturer’s instructions, we constructed RNA libraries. All libraries were subjected to 150-bp paired-end sequencing on the Illumina NovaSeq 6000, and after adapter trimming and quality filtering using fastp (v 0.23.2)9 with default parameters, we obtained a total of 57.6 Gb of high-quality RNA sequencing data (see Table 2).

Table 1 Summary of sequencing strategy.
Table 2 Summary statistical of nine tissue’s transcriptome sequencing data.

Genome assembly

The chromosome-scale reference genome was assembled by combining PacBio long reads, Illumina short reads, and Hi-C sequencing data. Firstly, We obtained a genome size estimation of 1,181.27Mb and a heterozygosity rate of 0.397% by jellyfish (v 2.3.0)10 and GenomeScope (v 2.0.0)11 using Illumina short reads (Fig. 2). Then, we generated the initial draft contig assemblies based on high-coverage PacBio long reads data using wtdbg2 (v 2.5) and wtpoa-cns (v 1.1)12, the draft genome of the G. magnificus contained 539 contigs with a contig N50 of 18.46 Mb and a total length of 1.176 Gb (Table 3). Subsequently, PacBio long-read and additional Illumina paired-end short reads were applied to polish the draft genome following the wtdbg2 pipeline. To join contigs into scaffolds, 3D-DNA (v 190716)13 was used to produce the initial chromosomal results by aided by Hi-C data. These scaffolds were roughly reviewed and adjusted using Juicebox (v 1.11.08)14 and then further polishing the assembly using 3D-DNA. Finally, we obtained the chromosome-level assembly genome consisting of 79 scaffolds with an N50 of 84.77 Mb, and 99.89% genome was reordered and anchored onto 29 pairs of chromosomes (Table 3), with their lengths ranging from 60.39 kb to 218.43 Mb (Fig. 1, Table 4). The GC contents of the final assembled genome were 42.88% (Fig. 1, Table 3).

Table 3 Summary of Gorsachius magnificus genome assembly and annotation.
Fig. 1
figure 1

Characterization of assembled Gorsachius magnificus genome. From inner to outer layers for circle figure (right): Collinearity of different chromosomes, Distribution of SNPs, Genes abundance, Chromosomes, Density of GC content. (Left figure provided by Liao Zhikai).

Table 4 Summary of chromosome length of Gorsachius magnificus genome.

To assess the quality of genome assembly and annotation, we used BUSCO (v 5) (Benchmarking Universal Single-Copy Orthologs)15 with aves_odb10 contains 8,338 single-copy orthologs as a reference to evaluate the completeness of the genome, the BUSCO research for assessment of the genome completeness showed that 96.6% of the BUSCO genes were complete, 0.7% were fragmented, and 2.7% were missing (Table 3).

Repeat sequences and genome annotation

To annotate the genome of the G. magnificus, we identified 122.4 Mb of repetitive sequences, accounting for 10.41% of the genome by a combination of homology-based and de novo-based identification, manual curation, and classification16 (Table 3). Repeatmasker (v 4.1.2)17 was used to search homology sequence from the Repbase library (v 20181026) for Aves, and de novo prediction was performed using Repeatmodeler (v 2.0.8)18, we furthermore used an additional method, EDTA (v 2.0.1)19, to annotate LTRs. This method combines the raw predictions of LTRharvest (v 1.1)20, LTR_FINDER_parallel (v 1.2)21, and LTR_retriever (v 2.9)22.

Next, we utilized three methods to predict protein-coding genes: transcriptome-based prediction, homology-based prediction, and de novo prediction. First, we assembled the transcriptome data from different tissues using Trinity (v 2.13.2)23. The homologous gene sets were obtained from the protein and transcript sequences of 15 proximate bird genomes (Table S1). Then, we performed Maker (v 3.01.03)24 for de novo prediction, and GeMoMa (v 1.7.1)25 for homology-based prediction to identify the protein-coding genes. Finally, we generated non-redundant gene sets from the three data sets. We also used GeMoMa to predict the UTR regions of protein-coding genes based on transcriptome sequences.

The functional annotation of protein-coding genes was constructed by mapping gene sets to protein databases Gene Ontology (GO)26 and Kyoto Encyclopedia of Genes and Genomes (KEGG)27 using eggNOG-mapper28.

We also used the combination of de novo, homolog-based, and transcript-based methods, 18,062 protein-coding genes (Genes) were predicted, and 97% of predicted regions with UTR region. A total of 17,177 (95% of Genes) were successfully annotated with at least one function term by searching against functional databases (Gene Ontology, KEGG, Pfam). we also used BUSCO (v 5) to evaluate the completeness of the annotation, The analysis for assessment of annotation completeness revealed a complete recall of 97.2% (83.9% single-copy; 13.3% duplicated) of genes, 0.9% fragmented, and 1.2% missing (Table 3).

Data Records

The Raw data of PacBio are deposited into NCBI SRA with accession number SRR26858085, and the Illumina WGS, Hi-C, and RNA-seq sequencing data were stored under accession numbers SRX22552595- SRX2255260729. The genome assembly has been deposited in the GeneBank database under the accession number JAXBDB00000000030. The genome annotations are available from the Figshare repository31.

Technical Validation

The assembled genome of has a size of 1.176 Gb, and its scaffold N50 is 84.77 Mb. This is very close to the estimated size of 1.181 Gb from our kmer-analysis (Fig. 2). The Hi-C heatmap displays a well-organized interaction pattern within the chromosomal regions (Fig. 3). Notably, 99.89% of the genome bases have been anchored onto 29 pairs of chromosomes (Figs. 1, 3, Table 4). The genome assembly quality assessment shows 96.6% completeness using BUSCO, with the protein-coding sequences achieving 97.2% completeness (Table 3).

Fig. 2
figure 2

K-mer frequency and genome size evaluation of Gorsachius magnificus genome.

Fig. 3
figure 3

Hi-C interactive heatmap of genome-wide of Gorsachius magnificus. The coordinates in the figure indicate genome length. The deeper red means a stronger interaction between the genomics regions.

We used the sequence alignment method to evaluate possible contamination and the completeness of the genome assembly. The Burrows-Wheeler Aligner (BWA, v 0.7.6)32 was used to map Illumina short reads to the assembled genome with default parameters. Importantly, the relationship between sequencing depth and GC content distribution revealed no deviations from the expected levels, allaying concerns about contamination or sequencing biases (Fig. 4). Additionally, our mapping results indicate that 99.03% of reads were successfully mapped, and the coverage rate was approximately 99.77% (Table 5), confirming the alignment consistency between the reads and the assembled genome. Compared to other assembled avian genomes, the main measures like the scaffolds N50 and the number of genes are very close to G. gallus and a recently published chromosome-level genome A. baeri) (Table 6).

Fig. 4
figure 4

The average sequencing depth and the distribution of GC content of Gorsachius magnificus. The specific GC content and sequencing depth were calculated in each window with a 1000bp size, corresponding to a dot in the figure. The Y-axis and left histogram represent the average sequencing depth, the peak value of the left histogram is about 25. The X-axis and top histogram distribution represent the GC content, the peak value of the top histogram is about 0.4.

Table 5 Statistical results of Illumina short-reads alignment.
Table 6 Comparative analysis of other assembled avian genomes.