Data description

Genomic DNA was extracted from muscle tissue of a single female crab (Eriocheir sinensis; NCBI Taxonomy ID: 95602) after 3 generations of inbreeding that was obtained from a local farm in Panjin, Liaoning Province, China. We used the whole-genome shotgun sequencing strategy and constructed the subsequent short-insert libraries (170, 250, 500 and 800 bp) and long-insert libraries (2, 5, and 10 kb) using the standard protocol provided by Illumina (San Diego, USA). Paired-end sequencing was performed by the Illumina HiSeq 2000 system. In total, we generated 258.8 Gb of raw reads from all constructed libraries.

We extracted clean reads of the short-insert libraries (500 or 800 bp) to estimate the crab genome size by k-mer frequency distribution analysis [1]. A k-mer is related to an artificial sequence division of K nucleotides iteratively from sequencing reads. We defined the k-mer length as 17 bp; thus, a L bp-long clean sequence would include (L-17 + 1) k-mers. The frequency of each k-mer can be calculated from the genome sequence reads. Typically, k-mer frequencies were plotted against the sequence depth gradient following a Poisson distribution in any given dataset. The genome size (G), can be deduced from the formula:

$$ \mathrm{G} = \mathrm{N} \times \left(\mathrm{L}\hbox{-} 17+1\right)/\mathrm{K}\_\mathrm{depth} $$

where N is the total number of reads, and K_depth indicates the frequency that occurrs more often than other frequencies. In our calculations, N was 789,326,187 and K_depth was 40; therefore, the crab genome size was estimated to be 1.66 Gb.

For whole-genome assembly, we employed Platanus [2] with optimized parameters (−k 27, −m 200) to construct contigs and original scaffolds. All reads were mapped onto contigs for scaffold building by utilizing the paired-end information. This paired-end information was subsequently applied to link contigs into scaffolds using a stepwise approach. Some intra-scaffold gaps were filled by local software using read-pairs in which one end uniquely mapped to a contig and the other end was located within a gap. Final genome assembly of the Chinese mitten crab is 1.12 Gb in total length, which is about 67.5 % of the estimated genome size. The contig N50 size (i.e., 50 % of the genome is in fragments of this length or longer) is 6.02 kb, and the scaffold (>2 kb) N50 is 224 kb.

We constructed a de novo repeat library using RepeatModeller (Version 1.04, default parameter) and LTR_FINDER [3]. To identify known and de novo transposable elements (TEs), we employed RepeatMasker (Version 3.2.9) [4] against the Repbase TE library [5] (Version 14.04) and the de novo repeat library. In addition, we used RepeatProteinMask (Version 3.2.2) implemented in RepeatMasker to detect the TE-relevant proteins. We also predicted tandem repeats utilizing Tandem Repeat Finder [6, 7] (Version 4.04) with parameters set as “Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, and MaxPerid = 2000”. Finally, we confirmed that the repeat sequences occupy approximately 50.4 % of the crab genome. Among them, the long interspersed elements, occupying 19.0 % of the crab genome, are the most predominant type of repeat sequences.

Subsequently, we performed annotation analysis containing four major steps. (1) The homology-based gene prediction: We aligned Homo sapiens, Crassostrea gigas, Caenorhabditis elegans, Drosophila melanogaster and Daphnia pulex proteins (Ensembl release 75) to the crab genome using TblastN with an E-value ≤ 1E-5, and then made use of GeneWise2.2.0 [7] for precise spliced alignment and predicting gene structures. Short genes (<150 bp) and premature or frame-shifted genes were removed. (2) The ab initio prediction: Genome sequences of the crab were repeat-masked, and 1500 full-length, randomly selected genes from their homology gene sets were used to train the model parameters for AUGUSTUS2.5 [8]. We then utilized AUGUSTUS2.5 and GENSCAN1.0 [9] for de novo prediction on repeat-masked genome sequences. Short genes were discarded using the same filter threshold that was used for homology prediction. (3) Gene structure identification using transcriptome reads: We mapped the mixed RNA reads (from hepatopancreas tissue taken from four molting stages) reported in Huang’s study [10] on the crab genome using TopHat1.2 [11]. Subsequently, we sorted and merged the TopHat mapping results and then applied Cufflink [12] software to identify gene structures to assist gene annotation. (4) Gene set integration: All of the above gene sets were merged to form a comprehensive and non-redundant gene set using GLEAN [13]. We obtained a final gene set containing 7,549 genes (Table 1), which is more than the gene number (5,775) identified for horseshoe crab [14]. Meanwhile, the CEGMA [15] evaluation demonstrated the annotation completeness to be 66.9 % (166 of 248 core eukaryote genes were aligned).

Table 1 Summary of genome annotations

In summary, we report the first genome sequencing, assembly, and annotation of the Chinese mitten crab. The draft genome will provide a valuable resource for studying essential developmental processes in the Chinese mitten crab, investigating crustacean evolution, and improving the molecular breeding of this economically important species.

Availability of supporting data

Supporting data are available in the GigaDB database [16], and the raw data were deposited in the PRJNA305216.