Background & Summary

The Stag beetle (Coleoptera: Lucanidae) constitutes a relatively modest taxonomic cluster within the superfamily Scarabaeoidea, encompassing approximately 1,500 species1. Notably, male stag beetles are distinguished by their oversized mandibles, utilized in assertive interactions to establish dominance over favored mating territories, a behavior evocative of the competitive sparring seen among stags2. Their striking size and intricate mandibles have long captivated the interest of both professional taxonomists and amateur collectors3. The larvae of stag beetles primarily subsist on decaying wood, while adults from various species typically demonstrate nocturnal tendencies and consume decaying fruits, logs, and plant fluids4,5. Species of Lucanidae are distributed across all major zoogeographical regions, excluding Antarctica, and hold significant importance as a focal point in evolutionary processes6. Historically, the Lucanidae family has been regarded as one of the most primitive groups within the Scarabaeoidea, with scarabaeoid classifications and evolutionary hypotheses typically placing Lucanidae at the basal position within the superfamily7,8. Research on the Lucanidae family has predominantly focused on systematic evolutionary studies, leveraging data from nuclear gene fragments, mitochondrial multi-gene fragments, and mitochondrial genomes9,10,11,12. However, these datasets have proven insufficient in providing comprehensive insights into the evolutionary status of Lucanidae within the Scarabaeoidea10. Deciphering more high-quality reference genomes has emerged as a crucial step in inferring the phylogeny of insects. The limited availability of genomic data has impeded systematic evolutionary research on the Lucanidae family.

To enhance our understanding of Lucanidae evolution and ecology, we propose assembling a chromosome-level genome of Serrognathus titanus (Boisduval, 1835), combining PacBio HiFi, Illumina, and Hi-C data. We annotated repeats, non-coding RNAs, and protein-coding genes. The high-quality genome of S. titanus represents a significant advancement in Lucanidae research, offering valuable insights into Scarabaeoidea evolution and ecology.

Methods

Sample collection and sequencing

A single male sample of S. titanus was collected for DNA and RNA sequencing on July 26, 2023, from Guiyang, China. Muscle tissue, specifically from the pronotum and posterior abdomen, was extracted from the specimen. The tissue was thoroughly washed with a phosphate-buffered saline (PBS) solution for five minutes to eliminate potential external contaminants. The sample was then put into flash-frozen condition within a liquid nitrogen container for 20 minutes and finally stored at −80 °C in the laboratory before sequencing.

Genomic DNA and RNA were extracted from the specimen using the FastPure® Blood/Cell/Tissue/Bacteria DNA Isolation Mini Kit (Vazyme Biotech Co., Ltd, Nanjing, China) and TRIzol reagent (YiFeiXue Tech, Nanjing, China), respectively. PCR-free short-read libraries for whole genome sequencing (WGS) were generated using the Truseq DNA PCR-free Kit. The libraries consisted of 150 bp paired-end reads with a 350 bp insert size. The Hi-C experiment was conducted on a previously published protocol13, which involved several steps: DNA cross-linking, chromatin digestion using the restriction enzyme MboI, end repair, and DNA purification. All short-read libraries were sequenced using the Illumina NovaSeq 6000 platform. A 20 kb insert size library was prepared using the SMRTbellTM Express Template Prep Kit 2.0, and the resulting library was sequenced on the PacBio Sequel II platform with HiFi mode. Berry Genomics (Beijing, China) conducted all library constructions and sequencing procedures. In total, we obtained 167.62 Gb of sequencing data, which included 54.30 Gb (141.37×) of PacBio HiFi reads, 57.89 Gb (150.71×) of Illumina reads, 44.09 Gb (114.80×) of Hi-C data, and 11.34 Gb of transcriptome data (Table 1). The PacBio HiFi long reads had a scaffold N50 of 17.76 kb and an average length of 17.82 kb.

Table 1 Statistics of the sequencing data used for genome assembly.

Genome assembly

To perform quality control on the raw Illumina data, we utilized BBTools v38.8214. The quality control process involved using the “clumpify.sh” script to remove duplicate reads. In addition, we utilized the “bbduk.sh” script to trim the reads to a high standard. This involved removing sequences with quality scores lower than 20, filtering out sequences containing more than 5 Ns, trimming poly-A/G/C tails longer than 10 bp, and correcting overlapping paired reads.

The primary assembly of PacBio HiFi long reads was generated using Hifiasm v0.19.815 with the default parameters. To remove heterozygous regions, we employed Purge_Dups v1.2.516 with a haploid cutoff set at 70 for identifying contigs as haplotigs (“-s 70”). After conducting quality control, we aligned the Hi-C reads to the assembly using Juicer v1.6.217. The process of anchoring primary contigs into chromosomes was carried out using 3D-DNA v.18092218. The contig assembly result was thoroughly reviewed, and any assembly errors were manually corrected using Juicebox v.1.11.0818. To identify potential contaminants, we utilized MMseqs. 2 v1119 to perform BLASTN-like searches against the NCBI nucleotide and UniVec databases. In addition, we employed blastn (BLAST + v2.11.0)20 to identify vector contaminants against the UniVec database. Sequences demonstrating a match of over 90% in the databases above were considered potential contaminants. Furthermore, sequences with over 80% hits underwent a subsequent verification step through online BLASTN analysis in the NCBI nucleotide database. The potential bacterial and human contamination were removed from the assembled scaffolds. To identify the autosomes and sex chromosomes, the final assembly was remapped with raw PacBio HiFi long reads using MiniMap2 v2.1721. Subsequently, the coverage of each chromosome was calculated using SAMtools v1.922 by dividing raw data by chromosome length. The X chromosome was distinguished by displaying approximately half of the coverage compared to the other chromosomes. The final assembly of the S. titanus genome reached the chromosome-level assembly with a size of 384.09 Mb, consisting of 83 scaffolds and 99 contigs and a GC content of 34.31% (Table 2). The scaffold and contig N50 sizes of 75.81 Mb and 37.83 Mb, respectively. Most contigs (97.45%, 374.30 Mb) were anchored into six chromosomes with lengths ranging from 10.11 to 84.98 Mb (Table 3; Figs. 1, 2). The chromosome coverage was computed and displayed in Table 3. Among these chromosomes, chromosome 6 exhibited a PacBio HiFi long-read sequencing coverage of 63.59, which is approximately half of the coverage observed for the other chromosomes. Consequently, chromosome 6 was designated as the X chromosome (Table 3).

Table 2 Genome assembly statistics for Serrognathus titanus.
Table 3 The chromosome length and PacBio HiFi sequencing coverage of Serrognathus titanus.
Fig. 1
figure 1

Genome-wide chromosomal heatmap of Serrognathus titanus, with each chromosome framed in blue and each contig framed in green.

Fig. 2
figure 2

Genome characteristics of Serrognathus titanus. From the outer ring to the inner ring are the distributions of chromosome length, GC content, gene density, transposable elements: DNA transposon, short interspersed nuclear elements (SINE), long interspersed nuclear element (LINE), and long terminal repeats (LTR), and simple repeats.

Genome annotation

A de novo repeat library of S. titanus was constructed using RepeatModeler v2.0.423, employing the “-LTRStruct” LTR discovery pipeline. This specific repeat library was merged with RepBase-2023090924 and Dfam 3.525, resulting in a custom library. To identify and mask repeat elements in the S. titanus genome, RepeatMasker v.4.1.426 was employed, aligning the genome against the custom library. The analysis conducted by RepeatMasker revealed that the S. titanus genome comprises approximately 43.87% repetitive elements, including unclassified elements (24.29%), DNA transposons (10.58%), LINE transposons (5.09%), LTR transposons (2.84%), simple repeats (0.68%), and other elements (Table 4).

Table 4 Genome assembly and annotation statistics of Serrognathus titanus.

Non-coding RNAs (ncRNAs) in S. titanus were detected using Infernal v1.1.427 against the Rfam v14.10 database28, while transfer RNAs (tRNAs) were identified using tRNAscan-SE v2.0.929. The analysis revealed a total of 1,186 ncRNAs in the S. titanus genome, including 4 long non-coding RNAs, 21 ribozymes, 51 small nuclear RNAs, 67 microRNAs, 359 other ncRNAs, 406 tRNAs, and 278 ribosomal RNAs (Table 4).

Protein-coding gene annotation in S. titanus was analyzed by MAKER v3.01.0330 from transcribed RNA, ab initio gene predictions, and homologous proteins. Transcribed RNA alignment prediction was performed by HISAT2 v2.2.131. RNA-seq alignment production was then acted as a genome-guided assembly by StringTie v2.1.632. To obtain ab initio gene predictions, BRAKER v2.1.633 was utilized, employing GeneMark-ES/ET/EP 4.68_lic34 and Augustus v3.4.035 and automatically trained them based on RNA sequence alignments and reference proteins mined from the OrthoDB v11 database36. GeMoMa v1.937 was used to predict genes based on protein homology and intron position conservation, employing the parameters “GeMoMa.c = 0.4” “GeMoMa.p = 10” and protein sequences from five species (Drosophila melanogaster (GCF_000001215.4)38, Prosopocoilus inquinatus (GCA_036172665.1), Tribolium castaneum (GCF_000002335.3)39, Apis mellifera (GCF_003254395.2)40, and Coccinella septempunctata (GCF_907165205.1))41. The results obtained from BRAKER and GeMoMa were combined and utilized as the ab initio input for MAKER. In the S. titanus genome, we predicted a total of 14,263 protein-coding genes, with an average length of 9,487.6 bp. Each gene exhibited an average of 6.2 exons, 5.0 introns, and 6.0 coding sequences (CDS). The mean lengths of the exons, introns, and CDS were 314.4 bp, 1,599.3 bp, and 271.4 bp, respectively (Table 4). The completeness of the protein sequences was assessed using BUSCO, resulting in a high score of 98.0% (n = 1,367). This encompassed 63.3% (865) single-copy, 34.7% (474) duplicated, 0.1% (2) fragmented, and 1.9% (26) missing BUSCOs, indicating high-quality predictions.

We performed gene functional annotation searches against the UniProtKB database using Diamond v2.0.11.142 in sensitive mode with the parameters “–more-sensitive -e 1e-5”. Additionally, we utilized eggNOGmapper v2.0.143 and InterProScan 5.53–87.044 to assign Gene Ontology (GO) terms and identify (KEGG and Reactome) pathways, as well as protein domains. The InterProScan analyses included five databases: Pfam45, SMART46, Superfamily47, Gene3D48, and CDD49. The results obtained from these tools were integrated to derive the final gene function predictions. Functional annotation indicated that S. titanus contained 12,126 COG categories, 10,592 GO terms, 5,055 KEGG pathways, and 2,863 Enzyme Codes based on the InterProScan and eggNOG annotation integration. Additionally, we generated visualizations of repeat content, gene density, and GC content on individual pseudochromosomes using TBtool50.

Data Records

The raw sequencing data and genome assembly of Serrognathus titanus have been deposited at the National Center for Biotechnology Information (NCBI). The Hi-C, transcriptome, Illumina, and PacBio HiFi data can be found under identification numbers SRR2899952551, SRR2899952652, SRR2899952753, and SRR2899952854. The assembled genome has been deposited in the NCBI assembly with the accession number GCA_039766575.155. The annotation results for repeated sequences, gene structure, and functional prediction have been deposited in figshare56.

Technical Validation

Two methods were utilized to evaluate the quality of the genome assembly. Initially, assembly completeness was assessed using BUSCO v5.0.457 with the reference Insecta gene set (n = 1,367). The final genome assembly demonstrated a BUSCO completeness of 97.6%, with 92.8% single-copy BUSCOs, 4.8% duplicated BUSCOs, 0.5% fragmented BUSCOs, and 1.9% missing BUSCOs. As a measure of assembly accuracy, the mapping rate was calculated by aligning the reads from PacBio, Illumina, and RNA sequencing to the final assembly using Minimap2 and SAMtools. The mapping rates for PacBio, Illumina, and RNA reads were 99.85%, 87.64%, and 87.70%, respectively. These assessments collectively validate the high quality of the genome assembly generated in this study.