Background & Summary

Echiura, commonly referred to as spoon worms, are bilaterally symmetrical and coelomate marine invertebrates with a sausage-shaped body living in burrows in the sediments. They possess annelid-like morphological features, including the ladder-like nervous system, the ultrastructure of cuticle and chaetae, and the larval nervous system segments1, however, they have secondarily lost segmentation as adults, providing a particularly important model for understanding the mechanism underlying segment formation and secondary loss2. Furthermore, the evolutionary relationship between the Echiura and Annelida is in a long- standing controversy. Given the lack of segmentation as adults, Echiura has generally been regarded as a separate phyla closely related to Annelida. Recently, some researchers advised the inclusion of Echiura into Annelida based on the increasing amounts of morphological, molecular phylogenetic and phylogenomic evidence, including expressed sequence tags, transcriptome, and mitochondrial genome3,4,5,6,7,8. As solid evidence for a better understanding of their deep-level evolutionary relationships, phylogenomic analysis from high-quality chromosome-level genome data of the Echiura is still lacking.

Urechis unicinctus (also called the penis fish or the fat innkeeper worm), belonging to the Echiura, is a deposit-feeding burrowing organism that inhabits the intertidal zones along the Korean and Japanese coast and Bohai Gulf on the northeast coast of China. The intertidal zones are peculiar and dynamic areas which are vulnerable to a host of stressors, like steep gradients in temperature and oxygen concentration, threats from pathogen infections, pollution, and toxic substances9. The U. unicinctuss without adaptive immunity can survive in such harsh environments, providing an exciting resource for investigating environmental adaptative evolution. In addition, this endemic Echiuran species has essential ecological and socioeconomic significance and has become an important cultured aquaculture species due to its desirable flavour, nutrient-rich and high medicinal values in Asian countries, especially in China, Japan, and Korea10. The first draft genome assembly of U. unicinctuss based on Illumina short reads was published in 202111. However, due to the limitations of the sequencing technique and assembly algorithm, the genome assembly with a contig N50 length of 0.458 kb and scaffold N50 length of 0.517 kb remains highly fragmented (Table 1), which lags far behind the demand for further study of genetic and molecular in U. unicinctuss. Hence, a high-quality chromosome-scale genome assembly of U. unicinctuss are essential in elucidating its genome evolution and adaptive evolution, and providing theoretical support for the species’ culture.

Table 1 Comparative statistics of U. unicinctus genome assemblies in 2021 and 2023.

Here, we present a high-quality chromosome-scale genome assembly of U. unicinctus obtained by combining Illumina, PacBio, and high-throughput chromosome conformation capture (Hi-C) sequencing technology toolkits. The U. unicinctus genome, with a total size of ~1,138.6 Mb, was assembled into 1,394 scaffolds (N50 = 68.3 Mb). A total of 1,113.8 Mb assembled sequences (97.82%) were further anchored to 17 pseudochromosomes (Fig. 1). The quality of the genome assembly is significantly higher than that of the previously published version (NCBI accession No. PRJNA603659), with contig N50 being ~1,160 times higher and scaffold N50 being ~135,472 times higher (Table 1)11. The completeness, accuracy, and contiguity of the genome assembly were evaluated by Benchmarking Universal Single-Copy Ortholog (BUSCO) analysis, Core Eukaryotic Genes Mapping Approach (CEGMA), re-alignment between clean Illumina reads and the genome assembly, and SNP identification. Of the assembled genome, 482.1 Mb (42.34%) were repetitive sequences with a dominance of DNA elements. Additionally, a total of 21,524 protein-coding genes were annotated, of which 99.5% could be functionally annotated. This chromosome-level genome assembly builds the foundation for the understanding of genome evolution and evolutionary adaption and provides a valuable tool for further studies on the genetic and molecular biology of U. unicinctus.

Fig. 1
figure 1

Overview of U. unicinctus genome. (a) Circos plot of genomic features. From outer to inner ring: (A) gene density; (B) GC content; (C) transposable element abundance; (D) Tandem repeat density. Insert: image of adult U. unicinctus. (b) Pseudochromosomes interaction heatmap of U. unicinctus based on Hi-C assembly. Color block indicates the intensity of interaction from yellow (low) to red (high).

Methods

Samples collection and whole-genome sequencing

Adult U. unicinctus samples were obtained from the field of Xiyan, Yantai, Shandong Province, China (121°25’E, 37°56’N), and genomic DNA extracted from the muscle tissue was collected for whole-genome sequencing using a QIAGEN DNeasy Blood & TissueKit (QIAGEN, Shanghai, China). Paired-end Illumina sequence library with insert size 350 bp and 10× Genomics linked-read library were sequenced by Illumina HiSeq X Ten platform with 97.74 Gb of short- read sequencing data (Table 1). For long-read sequencing, a library with an insert of 20 kb was constructed using SMRTbell Template Prep Kits, followed by PacBio single-molecule real- time (SMRT) sequencing using Pacbio Sequel Platform (Pacific Biosciences, Menlo Park, USA), generating approximately 142.1 Gb of long-read raw data.

Transcriptome sequencing

For transcriptome sequencing, four tissues, including intestines, gonads, blood, and muscle, were sampled from the same individual and stored in liquid nitrogen. RNA was extracted from these tissues and used for transcriptome sequencing, respectively. The cDNA paired- end libraries were prepared and sequenced on an Illumina HiSeq X sequencer (Paired-end 350 bp reads). Approximately 26.51 Gb of clean data were yielded from the RNA-seq raw data after quality control using fastp v. 0.21.012 (Table 1).

Genome size estimation and assembly

Jellyfish v. 2.1.3 method13 with k-mer distribution was employed to calculate k-mer frequency (k = 17) based on the high-quality paired-end reads (with an insert size of 350 bp). The distribution of 17-mer depends on the characteristic of the genome and follows a Poisson’s distribution. The genome size was estimated to be 1,396.33 MB with K-mer depth of 58. The genome heterozygosity and repeat ratio are 1.25% and 53.86%, respectively (Table 2).

Table 2 The genome size estimation of U. unicinctus by k-mer distribution.

The WTDBG software v. 2.5, https://github.com/ruanjue/wtdbg) was used to assemble the contig of the U. unicinctus genome with parameters setting as ‘--node-drop 0.20 --node-len 2304 --node-max 150 -s 0.05 -e 3′. Then, Racon v. 1.3.114 with default parameters was used to correct errors of contigs assembly by PacBio data. The resulting contigs were connected to super-scaffolds by 10× Genomics linked-read data using the fragScaff software v. 140324 with parameters setting as ‘-maxCore 200 -m 3000 -q 30 -C 5’15. Lastly, pilon v. 1.22 with parameters setting as ‘-Xmx300G --diploid --threads 20’16 was used to perform the second round of error correction with short paired-end reads generated from Illumina Hiseq X Ten Platforms. The total length of the contig assembly was 1130.4 Mb with the contig N50 size of 528.1 Kb (Table 3). For the scaffolding step, SSPACE v. 3.017 was first used to construct scaffolds using HiSeq data from all the mate-pair libraries (2 kb, 5 kb, 10 kb and 20 kb). FragScaff v. 140324 was further applied to build superscaffolds using the barcoded sequencing reads, generating a genome with a scaffold N50 size of 1080.3 Kb. The total length of this version is 1146.5 Mb.

Table 3 The de novo assembly statistics of U. unicinctus genome.

Hi-C library construction, sequencing and pseudo-chromosome anchoring

The Hi-C library was constructed by a standard protocol described previously with certain modifications18. Briefly, the mussel tissue of U. unicinctus was fixed with 1% formaldehyde solution in MS buffer (10 mM potassium phosphate, pH 7.0; 50 mM NaCl; 0.1 M sucrose), and the nuclei were enriched from flow-through and subsequently digested with HindIII restriction enzyme (NEB). Biotin-labeled DNAs were ligated and purified, followed by fragmenting to a size of 300–500 bp. After a quality control process, the constructed Hi-C library was sequenced on an Illumina HiSeq X Ten sequencer with paired-end 350 bp. In total, 159.47 Gb of high-quality Hi-C data with 132.89 × coverage was acquired (Table 1).

The clean Hi-C paired-end reads were assembled using ALLHIC v. 0.9.819 containing five steps, namely pruning, partition, rescue, optimizing and building, with the following parameter settings: “allhic partition --pairsfile group.clean.pairs.txt --contigfile group.clean.countsGATC.txt -K 26 --minREs 50 --maxlinkdensity 3 --NonInformativeRabio 0”. Ultimately, the size of chromosome-level genome assembly is ~1113.8 Mb, of which 97.82% were anchored into 17 pseudo-chromosomes ranging from 41.8 Mb to 86.6 Mb in length (Fig. 1b and Table 4), containing 7,429 contigs with N50 of 531.5 kb and 1,394 scaffolds with N50 of 68.3 Mb (Table 5).

Table 4 The statistics of anchored rate and chromosome-level scaffold lengths.
Table 5 The assembly statistics for Hi-C.

Annotation of repeats and non-coding RNA (ncRNA)

Homologous comparison and de novo prediction were applied to annotate the repeated sequences in the assembled genome. For homologous comparison, the RepeatMasker v. 4.0.720 and the associated RepeatProteinMask v. 4.0521 were performed to align against Repbase database22. For ab initio prediction, LTR_FINDER v.1.0723, RepeatScout v. 1.0524 and RepeatModeler v. 1.0525 were first used for de novo candidate database constructing of repetitive elements. The repeated sequences were annotated using RepeatMasker v. 4.0.7 Tandem repeat sequences were de novo predicted using TRF v. 4.07b26. In total, 482.1 Mb repetitive sequences were annotated, accounting for 42.34% of the assembled U. unicinctus genome (Table 6). Among the repetitive sequences, DNA transposons (DNA), long interspersed elements (LINE), short interspersed nuclear elements (SINEs), and long terminal repeats (LTRs) accounted for 20.72%, 4.26%, 0.25%, and 10.60% of the whole genome, respectively (Table 7).

Table 6 The annotation of repeated sequences in U. unicinctus genome.
Table 7 Summary statistics of repeat annotation in U. unicinctus assembly.

For ncRNA annotation, the tRNAs were predicted using tRNAscan-SE v. 1.3.1 software27, and the rRNAs fragments were identified by searching against the Human rRNA database using BLAST with an E-value of 1E-10. Other ncRNAs, including microRNAs (miRNA) and small nuclear RNAs (snRNAs) were predicted by INFERNAL v. 1.1rc428 using Rfam database29. Finally, a total of 5,908 ncRNAs were annotated, including 1,535 miRNAs, 3,431 tRNAs, 124 rRNAs, and 348 snRNAs in U. unicinctus genome(Table 8).

Table 8 Annotation of non-coding RNA genes in U. unicinctus assembly.

Protein-coding gene prediction and function annotation

The structure of protein-coding genes were predicted by homology-based prediction, de novo prediction and transcriptome-based methods. For homologous annotation, the protein sequences of Helobdella robusta (GCA000326865.1), Capitella teleta (GCA000328365.1), Lottia gigantea (GCA000327385.1), Crassostrea gigas (GCA000297895.2), Mizuhopecten yessoensis (GCA002113885.2), Octopus bimaculoides (GCA001194135.2), Drosophila melanogaster (GCA000001215.4), Anopheles gambiae (GCA000005575.1), Caenorhabditis elegans (GCA004526295.1), Mnemiopsis leidyi (GCA000226015.1), Nematostella vectensis (GCA_932526225.1), Trichoplax adhaerens (GCA000150275.1), Branchiostoma floridae (GCA015852565.1), Homo sapiens (GCA000001405.29) were downloaded from the NCBI’s Genbank database, and aligned against U. unicinctus genome using TBLASTN v. 2.2.2630. The matching proteins were conjoined by Solar software v. 0.9.631, and then aligned to homologous genome sequences for structural prediction by GeneWise v. 2.4.132 (referred to “Homolog” in Table 9). Clean data of RNA-sequencing (RNA-seq) derived from intestines, blood, gonad, and muscle were assembled with Trinity (v2.0)33, and were then aligned against U. unicinctus genome using Program to Assemble Spliced Alignment (PASA)34 (referred to “PASA” in Table 9). Simultaneously, Augustus v. 3.2.335, GeneID v. 1.436, GeneScan37, GlimmerHMM v. 3.0.338, and SNAP39 were employed for ab initio prediction, in which Augustus, SNAP, and GlimmerHMM were trained by homolog set gene models (referred to “De novo” in Table 9). Additionally, RNA-seq reads were directly mapped to U. unicinctus genome using Tophat v. 2.0.1340. The mapped reads were assembled into gene models (RNAseq-Cufflinks-set) by Cufflinks v. 2.1.141 (referred to “Cufflinks” in Table 9). Finally, the gene models were integrated by EvidenceModeler v. 1.1.142. We set the Weights for each type of evidence as follows: PASA-T-set > Homology-set > Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. In order to get the information of untranslated regions (UTRs) and alternatively spliced sites, PASA2 was used to update the final gene models (referred to “Pasa-update” in Table 9). In total, 21,524 protein-coding genes were predicted in the U. unicinctus genome with an average transcript and coding sequence (CDS) length of 5,391.7 bp and 1,291.94 bp, respectively (Table 9).

Table 9 The statistics of predicted protein-coding genes of U. unicinctus assembly.

For functional annotation, the predicted protein sequences were aligned against SwissProt43, NCBI’s non-redundant protein sequence databases (NR), InterPro44, Gene Ontology (GO)45, Kyoto Encyclopedia of Genes and Genomes (KEGG)46 and Pfam protein databases47 by BLASTP (E-value ≤ 1E-05) with the matched rates of 74.1%, 93.2%, 74.2%, 98.5%, 90.6%, and 67.8%, respectively (Table 10). InterproScan tool48 in coordination with InterPro database was applied to predict protein function based on the conserved protein domains and functional sites. In total, 21,408 genes were functionally annotated by at least one database, accounting for 99.5% of all predicted genes, among which 123,56 (68.33%) were supported by all six databases (Fig. 2).

Table 10 Functional annotation of the predicted protein-coding genes in U. unicinctus.
Fig. 2
figure 2

Venn diagram of functional annotation of the U. unicinctus protein-coding genes. The Venn diagram shows the shared and unique annotations among NR, KEGG, SwissProt, and InterPro.

Data Records

All raw genomic sequencing data (Illumina, PacBio, Hi-C, 10× genomics) were deposited in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR2589312949 and SRP45820150. Four transcriptome data from intestine, blood, gonad, and muscle were submitted to the NCBI SRA database with accession numbers SRR25683611, SRR25683610, SRR25683609, and SRR25683608, respectively, under accession the BioProject number PRJNA100651451. The final chromosome assembly was deposited in the GenBank at the NCBI (JAXDRA000000000)52. The sequences of CDS, and protein and results of genome annotation, including repeat sequences, protein-coding regions, and ncRNA annotation, are available in figshare53.

Technical Validation

Quality validation of sequencing data

The quality control of Illumina, 10 × genomic and transcriptome sequencing data was assessed using FastQC quality control (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/). The Q20 of Illumina sequencing data was greater than 95.92%, and Q30 was greater than 89.67%. The Q20 of 10× genomic sequencing data was greater than 92.71%, and Q30 was greater than 85.77%. The Q20 of transcriptome sequencing data was larger than 97.67%, and Q30 was larger than 93.41%. The low-quality reads (<Q20) were filtered to ensure the reliability of the data used in subsequent analyses.

Assessment of the genome assembly

The quality of the genome assembly was assessed by four methods as follows: (i) The evaluation of the genome assembly by BUSCO v. 5.1.254 suggested a high level of completeness (93.5%). Of 954 metazoa BUSCO genes, 92.7% were complete and single-copy, 0.8% were complete and duplicated, 4.0% were fragmented, and 2.5% were missing (Table 11). Additionally, completeness of the gene model prediction was also evaluated by BUSCO v. 5.1.2, generating a score of 91.6% (4.7% fragmented and 3.7% missing BUSCOs) (Table 12); (ii) Employing CEGMA (RRID: SCR 015055)55, we detected 240 (96.77%) of 248 core eukaryotic genes were detected in the genome assembly, including 221 (89.11%) complete genes. (iii) The clean short reads generated by the Illumina platform were mapped to the assembled U. unicinctus genome using BWA with parameters setting as ‘-o 1 -i 15’56. The results showed a mapping rate of 97.83% and a coverage rate of 93.90%; (iv) To evaluate the accuracy of the assembly at a single base level, variant calling with SAMTOOLS v. 0.1.19 was performed57. A total of 8,064,289 SNPs, including 7,985,055 heterozygous SNPs and 79,234 homozygous SNPs, were identified with a homozygous rate of 0.0081% (Table 13). All these results suggested the high completeness and accuracy of the U. unicinctus genome assembly.

Table 11 The BUSCO and CEGMA evaluation result of the genome assembly.
Table 12 The BUSCO score of the gene models.
Table 13 The statistics of SNP in U. unicinctus assembly.