Background & Summary

Global warming has led to increased ocean temperatures, resulting in significant transformations within marine ecosystems, particularly in the Arctic marginal seas1. Ophiura sarsii, belonging to the echinoderm phylum Ophiuroidea, is a predominant benthic organism in the Arctic continental shelf region, with a wide distribution across Japan, the Chukchi Sea, Canada, the United States, the North Atlantic, and Norway1,2,3. Ophiura sarsii constitutes crucial components of marine ecosystems, serving as key contributors to the marine biogeochemical cycle and providing essential support for a diverse array of ecosystem services4,5,6. This species plays integral roles in the food chain, exerting significant influence on energy flow and matter cycling within marine environments7,8. As a cold-water species, O. sarsii is sensitive to temperature fluctuations9, making it an ideal indicator for ecological assessments and a valuable model for studying species differentiation and evolution amid global environmental shifts10.

Meanwhile, Ophiura sarsii contains a diverse array of bioactive substances with medicinal potential, including chlorin compounds, which have demonstrated notable efficacy in photodynamic therapy against triple-negative breast cancer cells11. This discovery emphasizes the significance of O. sarsii in biomedicine as a source for novel cancer treatment compounds11,12. Ophiura sarsii exhibits robust regenerative capabilities, offering a critical model for scientific inquiry13. A comprehensive understanding of how these organisms efficiently regenerate lost appendages, encompassing intricate structures and components of the nervous system, yields valuable insights into the genetic and molecular mechanisms of regeneration14,15. This knowledge holds promise for informing research on human tissue regeneration and wound healing, potentially catalyzing advancements in the treatment of injuries and diseases involving tissue damage13,15. Additionally, O. sarsii is a deuterostome invertebrate, representing a key evolutionary link between achordates and chordates16, thus holding significant evolutionary status17. Nonetheless, the adaptive evolution of O. sarsii, despite its ecological significance, remains underexplored.

In this research, we presented the first chromosome-level genome assembly for O. sarsii by leveraging a hybrid approach that incorporates Illumina short-read sequencing (105.11 Gb), Pacbio Single Molecule Real-Time sequencing (40.08 Gb), and high-throughput chromatin conformation capture (Hi-C) sequencing (172.08 Gb) (Table 1). The resultant genome size is 1.57 Gb (Table 2), organized into 19 chromosomes (Table 3) with an N50 size of 78.03 Mb. Our data, the first chromosomal level data in ophiuroidea, can not only provide basic data for revealing the adaptive evolution mechanism of cold-water representative species and elucidating the origin and evolution of marine organisms, but also provides a theoretical basis for understanding the long-term environmental adaptability of cold-water species in the context of global warming.

Table 1 Statistics of the genome sequencing data of O.sarsii.
Table 2 Summary statistics for O.sarrsi genome assembly.
Table 3 The length of each chromosome in O.sarsii genome.

Methods

Sample collection and sequencing

The samples of O. sarsii were collected using box corers during the tenth Chinese Arctic expedition from August to September 201918. Tissues from the arm base were excised, rinsed in 1X phosphate-buffered saline (PBS), and immediately preserved in liquid nitrogen before being stored at −80 °C. High-quality DNA was extracted using CTAB19 method for long-read and short-read whole genome sequencing. Total RNA was isolated using a commercial Animal Tissue Total RNA Extraction Kit (Tiangen, Beijing, China) according to the provided protocol.

For Illumina sequencing, genomic DNA was fragmented into 300-500 bp pieces, and a paired-end genomic library was prepared following the manufacturer’s protocol. Then, the library was sequenced on an Illumina HiSeq X-Ten platform using a paired-end 150 bp layout. For PacBio sequencing, the genomic DNA was used to construct SMRT bell libraries following the manufacturer’s protocol. After that, the libraries were sequencing on the PacBio Sequel II platform utilizing SMRT technology.

For the Hi-C sequencing, fresh tissue was crosslinked using 4% formaldehyde solution and digested with four-cutter restriction enzyme (Mbo I)20. The ends of the restriction fragments were labeled with biotinylated nucleotides (biotin-14-dCTP), and then the ligated DNA was sheared to 350 bp fragments for Hi-C library construction. The resulting library was quantified with the qRT-PCR method and sequenced with the Illumina HiSeq. 2500 (PE125) platform.

For transcriptome sequencing, total RNA of fresh tissue from O. sarsii was extracted for cDNA library construction. The resulting library was constructed by NEBNext® Ultra™ RNA Library Prep Kit (NEB, USA) according to the manufacturer’s instructions and sequenced on the Illumina HiSeq X-Ten platform.

Genome size and heterozygosity estimation

To ensure the quality of information analysis, strict filtering of Illumina sequencing data was performed using pk_qc.v2 and redup.v2 (proprietary software developed in-house by Novogene Co., LTD) with Default parameter, resulting in clean reads. The genome size of O. sarsii was estimated based on Illumina sequencing data using the k-mer counting method by the jellyfish 2.2.720 with parameters of “-G 2 -m 17 -C”. Likewise, the heterozygosity rate was estimated utilizing the count of k-mers at half the peak depth. Based on the results of the survey analysis, the main peak was observed around depth  =  47 in Fig. 1(a). The genome size calculated using the formula Kmer-number/depth was approximately 1.59 Gb, and the adjusted genome size was 1.58 Gb. The genome heterozygosity rate was 1.97%, and the proportion of repetitive sequences was 63.40%. Utilizing Illumina data, an initial genome assembly of O. sarsii was conducted using the SOAPdenovo2 r24221,22 with parameters of “-K 41 -R -d 1”. Subsequently, the distribution of contigs was analyzed (Fig. 1(b)). The assembly was performed with a k-mer value of 41, yielding a contig N50 of 697 bp, resulting in a total length of 1.66 Gb. Furthermore, the scaffold N50 reached 975 bp, comprising a cumulative genome length of 1.73 Gb (Table 4).

Fig. 1
figure 1

(a) The k-mer distribution used to estimate the genome size of O. sarsii. The distribution was determined based on the Jellyfish analysis using a k-mer size of 17. (b) Contig covers depth and length profiles of O. sarsii.

Table 4 Assembly statistics for Illumina sequencing data.

De novo genome assembly

The PacBio HiFi reads of O.sarsii were de novo assembled by using Hifiasm v0.16.123 with default parameters. A total of 40.08 Gb HiFi reads with N50 sizes of 12,313 bp were obtained using Circular Consensus Sequencing (CCS) mode (Table 1). The software Purge Haplotigs24 with default parameters was utilized for the purpose of de-redundancy in the genome after initial assembly correction. This involved the recognition and elimination of redundant heterozygous contigs based on both read depth distribution and sequence similarity. The draft genome had a total size of 1.59 Gb containing 8,048 contigs with N50 sizes of 311,019 bp (Table 2).

To confirm the assembly results belonging to the target species, the fragmented sequences were aligned to the NCBI Nucleotide Database (NT database) using Blast v2.4.025 (Table 5). Sequencing data from Illumina was aligned to the reference genome using BWA v0.7.1726 with parameter of “bwa mem -k 19 -w 100”. Sequencing data from Pacbio was aligned to the reference genome using Minimap2 v2.2427 with parameter of “-x map-hifi”. Subsequently, the alignment rates and coverage were calculated for sequence consistency assessment (Table 6). In assessing the quality of the genome assembly, we appraised its integrity through the utilization of BUSCO v5.2.228 with a comprehensive mammalian database (metazoa_odb10 database) (Table 7).

Table 5 Summary statistics of the fragmented sequences against the NCBI Nucleotide Database (NT database).
Table 6 De novo genome assembly data consistency assessment.
Table 7 BUSCO assessment statistics of de novo assembly.

Chromosome assembly using Hi-C data

The primary types of reads generated from Hi-C sequencing data encompass valid di-tags, contiguous sequences, circularized, dangling ends, internal fragments, PCR duplicates, and wrong sizes29,30. Hi-C sequencing data were subjected to filtration using HiCUP v0.8.031 with default parameter. The clean Hi-C reads were then mapped to the contig assembly using Juicer v1.632 with default parameter. Leveraging the principle that cis-interactions (interactions within the same chromosome) significantly surpass trans-interactions (interactions between distinct chromosomes), and considering the enhanced strength of cis-interactions with decreasing linear distance, the 3D-DNA pipeline30 with default settings is employed for the segmentation, anchoring, sorting, orientation, and amalgamation of contigs or scaffolds to obtain chromosomal-level genome. Following assisted genome assembly, the assembled genome underwent visualization and correction using JuiceBox v1.11.0833 to address potential errors in contig order, orientation, or internal assembly. As a result, Hi-C data facilitated the anchoring of contigs onto 19 chromosomes (Fig. 2). Circos v0.69-934 was used to draw a circle diagram to describe the Characterization of the O.sarsii genome (Fig. 3).

Fig. 2
figure 2

Genome-wide Hi-C heatmap of O. sarsii.

Fig. 3
figure 3

Characterization of the O.sarsii genome. From the outer to the inner layers, the GC density (a), gene density (b), repeat density (c), LTR density (d), LINE density (e) and DNA-TE density (f) are sequentially displayed.

Genome annotation

Repetitive sequences in the O. sarsii genome were annotated through a synergistic approach, combining de novo and homology-based prediction methods. Tandem repeat sequences within the genomic DNA were discerned utilizing the TRF v4.0935. Transposable elements in the genomic sequence were annotated using RepeatMasker v4.1.236, referencing the RepBase database v20181026. The resultant sequence file, generated through RepeatModeler v2.037 (the ‘-LTRStruct’ option) and LTR-FINDER38, was employed as a library for the de novo prediction of repetitive elements in the genomic sequence using RepeatMasker v4.1.236. A total of 914.83 Mb (58.09% of the genome) repetitive elements were identified (Table 8).

Table 8 Summary of repetitive sequences in the genome assembly of O.sarsii.

The prediction of coding gene structures is conducted through an integrated approach, combining de novo prediction, homologous prediction and transcriptome-based prediction. De novo gene prediction was performed using Genscan39 and Augustus v3.4.040,41 with default parameters. The pre-trained model of Augustus was pisaster. Homology-based prediction was performed using GEMOMA v1.9.042,43 based on protein sequences of 7 echinoderm species (Table 9).

Table 9 The URLs for protein sequences of 7 species used for homology prediction.

For transcriptome-based prediction, RNA-seq reads were mapped to the genome using HISAT2 v2.2.144 with default parameters and the transcriptome was assembled using STRINGTIE v2.2.045. The open reading frames (ORFs) were predicted by TransDecoder v5.5.046. ISO-seq reads were analyzed using StringTie2 v1.3.647. Using default parameters of the MAKER v3.01.03, the results files of the above-mentioned software were added to the MAKER configuration file (maker_exe.ctl, maker_bopts.ctl, maker_opts.ctl, maker_evm.ctl) and the gene sets predicted by above methods were integrated into a non-redundant and more comprehensive gene set. Simultaneously, leveraging the integrated results using CEGMA v2.548, the HiCESAP49 workflow was employed to derive the final reliable gene set. The final gene set annotation identified a total of 27,099 genes (Table 10). Finally, functional annotation of proteins within the gene set was accomplished through referencing external protein databases such as SwissProt (http://www.uniprot.org/), TrEMBL (http://www.uniprot.org/), KEGG (http://www.genome.jp/kegg/), InterPro (https://www.ebi.ac.uk/interpro/), and GO (http://geneontology.org/page/go-database) (Table 11). For the functional annotation of SwissProt and TrEMBL, BLAST v2.4.0 was employed for analysis with default parameters. KEGG, InterPro, and GO annotations were performed using KEGG API, InterProScan, and Blast2GO with default parameters, respectively.

Table 10 Summary of genome structure annotation in the genome assembly of O.sarsii.
Table 11 Summary of genome function annotation in the genome assembly of O.sarsii.

Non-coding RNAs, such as tRNA, rRNA, miRNA and sn RNA were annotated. The tRNAscan-SE v2.0.550 was employed to identify tRNA sequences in the genome based on the structural characteristics of tRNA. Due to the highly conserved nature of rRNA, BLASTN alignment is employed to search for rRNA in the genome. The prediction of miRNA and snRNA sequences on the genome is achieved using the covariance models from the Rfam family and INFERNAL provided by Rfam51 (Table 12). Based on the metazoa_odb10 database, BUSCO v5.2.228 assessment was conducted on the annotated data (Table 7).

Table 12 Summary of ncRNA annotation in the genome assembly of O.sarsii.

Data Records

All raw sequencing data that were used for genome assembly and annotation have been deposited into the National Center for Biotechnology Information (NCBI) with accession number SRR2734456052 for Illumina sequencing data, SRR2735325653 and SRR2735325754 for Pacbio sequencing data, SRR2737712555, SRR2737712656 and SRR2737712757 for Hi-C sequencing data, SRR2737181058, SRR2737181159, SRR2737181260, SRR2737181361, SRR2737181462, SRR2737181563, SRR2737181664, SRR2737181765 and SRR2737181866 for RNAseq data, SRR2737208267 for ISOseq data. The genome assembly has been deposited at GenBank under the accession JAYJML00000000068. The version described in this paper is version JAYJML010000000.In addition, the final genome assembly data and annotation file is available in figshare69.

Technical Validation

To validate the assembly results associated with the target species, the fragmented sequences were aligned against the NCBI Nucleotide Database (NT database) using Blast v2.4.025 (Table 5). The completeness of O.sarsii genome assembly was evaluated using the BUSCO (in the metazoa_odb10 database), and the completeness was 93.1% (86.2% single-copied genes and 6.8% duplicated genes), 1.6% fragmented, and 5.3% missing genes (Table 7). The Hi-C heatmap revealed a well-structured interaction pattern in and around the chromosome inversion regions (Fig. 3). All available evidence robustly supports the completeness and accuracy of O.sarsii genome assembly.