Background & Summary

The stag beetle (Coleoptera: Lucanidae) is a family in Superfamily Scarabaeoidea, comprising around 1,500 species worldwide1. Most stag beetle species exhibit significant intraspecific or even interspecific sexual dimorphism, in which males usually tend to have extremely impressive mandibles to fight and attract females in the wild. Thus, stag beetles have received much attention since Linnaeus first described the Scarabaeus parallelipipedus from Europe (later transferred to the genus Dorcus)2. Many lucanid species have been selected as an ideal behavior and functional morphology study model, and their fascinating mandibles make them popular pets and valuably private collections3,4,5,6,7. In the wild, most stag beetles are closely related to forest ecosystems, as their carboxylic larvae usually feed on decaying logs and other litter, such as leaves or fungi8,9,10.

The major geographical distribution and species diversity of Lucanidae are associated with the Indomalayan and Palearctic regions; 33 genera and nearly 400 species are known from China11,12,13. The present research on the stag beetle primarily focuses on its taxonomy and phylogeny, including new species descriptions and mitochondrial genome studies7,11,12,13,14. Our understanding of the stag beetle genome, especially high-quality genome assembly, remains in its infancy. Only one genome, Dorcus hopei, has been reported15. Compared with other beetles’ sharply increasing genome assembly number, more high-quality genome assemblies for stag beetles have become necessary and inevitable.

To enhance the knowledge of the taxonomy, evolution, and ecology of Lucanidae, we proposed the chromosome-level genome of a widespread species, Prosopocoilus inquinatus (Westwood, 1848), with the combination of PacBio HiFi, Illumina, and Hi-C data. Genome annotation, including repeats, non-coding RNAs (ncRNAs), and protein-coding genes (PCGs) were analyzed and exhibited. The high-quality genome of P. inquinatus provides valuable genomic information for Lucanidae study.

Methods

Sample collection and sequencing

A single P. inquinatus male sample was collected for DNA and RNA sequencing data on April 30, 2023, in Motuo County, Xizang, China. Muscle tissue, including the pronotum and posterior abdomen, was extracted from the specimen and washed via phosphate-buffered saline (PBS) solution for five minutes to eliminate any possible external pollutants. The specimen was then transferred into liquid nitrogen, frozen for at least 20 minutes, and kept at −80 °C for temporary storage until sequencing.

The specimen’s genomic DNA (gDNA) was extracted using the FastPure® Blood/Cell/Tissue/Bacteria DNA Isolation Mini Kit (Vazyme Biotech Co., Ltd, Nanjing, China). High molecular weight (HMW) gDNA was sheared into 15 kb with the MegaruptorTM device (Diagenode, Liege, Belgium) and was enriched using the AMPurePB Beads. PCR-free short reads library for whole genome sequencing (WGS) was prepared using the Truseq DNA PCR-free Kit. A PacBio HiFi 15 kb library was prepared using the SMRTbellTM Express Template Prep Kit 2.0, and the resulting library was sequenced on the PacBio Sequel II platform. The Hi-C data was carried out by digesting extracted DNA with the Mbol restriction enzyme. RNA was lysed from the specimen using the TRIzoTM Reagent (Invitrogen, Carlsbad, CA, USA). RNA-seq libraries were constructed using the VAHTS mRNA-seq v2 Library Prep Kit (Vazyme, Nanjing, China). The Illumina NovaSeq. 6000 platform was used to build all short-read libraries. The Nanopore PromethlION platform constructed long reads of the RNA library. Berry Genomics (Beijing, China) carried out all library constructions and sequencing. Consequently, we obtained 272.73 Gb of sequencing data, including 109.10 Gb (152.68×) of Illumina reads, 42.50 Gb (65.41×) of PacBio HiFi reads, 101.03 Gb (155.40×) of Hi-C data, 20.10 Gb of transcriptome data, including 9.72 Gb of short reads data and 10.38 Gb of long reads data (RNA-ONT) (Table 1).

Table 1 Statistics of the sequencing data generated for Prosopocoilus inquinatus.

De novo genome assembly

Raw genomic Illumina sequencing reads for genome scan were employed as quality control using Fastp v0.23.216 to remove adaptors, duplications, and low-quality reads.

Raw PacBio HiFi reads were generated into the primary assembly using Hifiasm v0.19.817. The direct reads were then mapped with the raw HiFi reads using Minimap2 v2.2418 to calculate the mapping rate. One round of primary self-polishing assembly was performed for primary assembly by utilizing NextPolish2 v0.2.019.

Raw Hi-C data was processed under quality control to remove duplicates using Chromap v.0.2.5-r47320. Clean Hi-C data was then utilized to align the primary assembly for haplotype identification and division. Contigs were anchored and orientated onto chromosomes using YaHS v1.221 and Juicer v1.6.222. The result of the contig assembly was reviewed, and any assembly errors were corrected manually under Juicebox v.1.11.0823. To determine the autosomes and sex chromosomes, the final assembly was remapped with raw HiFi data by using MiniMap2 to determine each chromosome length. Chromosome coverage was then calculated using SAMtools v. 1.924 by dividing raw data by chromosome length. Moreover, the X chromosome was also detected by chromosome synteny between the model beetle species, Tribolium castaneum, and the relative species Trypoxylus dichotomus according to the relatively conserved feature in insect sexual chromosome X25. Syntenic blocks were identified and determined using MCScanX26 and TBtools27. Conclusively, the X chromosome was identified by exhibiting around half of the chromosome coverage compared with other chromosomes (Table 3) and re-confirmed by sharing high synteny features with other beetles’ X chromosomes (Fig. 2).

To ensure the high-quality assembly of our genome, potential contaminants were detected and eliminated by software and NCBI. In this case, we focused on Humans, Bacteria, viruses, and plant sequences. Possible contaminants were detected using MMseq. 2 v1128, which utilizes BLASTN-like searches and the UniVec database based on the NCBI nucleotide database. Potential vector contaminants were also specifically detected and identified by blastn (BLAST + v2.11.029) against the UniVec database. Sequences with over 90% hits in the database above were considered contaminants, and sequences with over 80% hits were rechecked by online BLASTN analysis in the NCBI nucleotide database. The final genome assembly was uploaded to NCBI to detect and eliminate contaminants. According to vector search, no prominent contaminant was found in our assembly, reflecting the high quality of sample preparation and accuracy of specimen sequencing.

The final P. inquinatus genome assembly eventually reached the chromosomal level with a total size of 649.73 Mb, consisting of 174 scaffolds and 195 contigs (Table 2). The scaffold and contig N50 length reached 59.5 Mb and 26.36 Mb, respectively. GC content of the P. inquinatus was 35.67%. Most contigs (612.12 Mb, 94.21%) were firmly anchored and orientated onto 12 chromosomes. All chromosome coverage was computed and exhibited (Table 3). Among these chromosomes, one particular chromosome, number 12, has a coverage of 37.02 for long-read sequencing and 88.58 for short-read sequencing, around half of the other chromosomes (Table 3). Hence, the number 12 chromosome was considered the X chromosome in P. inquinatus. All chromosomes in assembly, including 11 autosomes and X chromosome, with individual lengths ranging from 17.22 to 75.68 Mb (Tables 2, 3; Fig. 1). Compared with the assembly result of its related species, Trypoxylus dichotomus30 (Sarabaeidae) (636.37 Mb in genome size and 35.11% GC content), P. inquinatus exhibited a larger genome size and GC content (Table 4).

Table 2 Genome assembly statistics for Prosopocoilus inquinatus.
Table 3 Chromosome status of Prosopocoilus inquinatus.
Fig. 1
figure 1

Genome-wide chromosomal heatmap of Prosopocoilus inquinatus, with each chromosome and contig framed in blue and green, respectively. “ChrX” represented the sex chromosome.

Fig. 2
figure 2

Chromosomal synteny between Tribolium castaneum, Prosopocoilus inquinatus, and Trypoxylus dichotomus. The sexual chromosome X is labeled red.

Table 4 Genome assembly and annotation statistics for Prosopocoilus inquinatus and its relative species, Trypoxylus dichotomus (Scarabaeidae).

Genome annotation

A de novo specific repeat library for P. inquinatus was built by RepeatModeler v2.0.431. This specific repeat library was combined with RepBase-2023090932 and added to the custom library. Repeat elements in the P. inquinatus genome were recognized and masked by RepeatMasker v.4.1.433 by aligning the custom library. Repetitive elements analysis resulting from RepeatMasker demonstrated that the P. inquinatus genome contains approximately 62.19% repetitive elements, including unclassified elements (42.02%), LTR elements (8.36%), DNA transposons (7.33%), LINE (1.77%), and simple repeats (0.68%) with other elements (S Table). The density for the type of each element, including simple and TEs elements, was exhibited on each chromosome (Fig. 3). Compared with the repetitive element components in T. dichotomus, P. inquinatus showed more significant size percent of Unclassified (42.02% to 16.67%) and LTR (8.36% to 1.24%) elements; however, P. inquinatus had a significantly minor size percent of DNA transposons, LINEs, and SINEs (Table 4).

Fig. 3
figure 3

Genome characteristics of Prosopocoilus inquinatus. Circos plot showing the genomic characters of P. inquinatus from outer to inner: chromosome length (Chr) (Mb), the density of GC content (GC), the density of protein-coding genes (GENE), the density of TEs (DNA, SINE, LINE, and LTR), and simple repeats (Simple). (The sliding window size is counted for every 10 kb).

Non-coding RNAs (ncRNAs) and transfer RNA (tRNA) in P. inquinatus were detected and identified by Infernal v1.1.434 and tRNAscan-SE v2.0.935, respectively. As a result, 1,857 ncRNAs were placed in the P. inquinatus genome, including four long non-coding RNAs, six ribozymes, 55 small nuclear RNAs, 93 microRNAs, 344 other ncRNAs, 351 tRNAs, and 1,004 ribosomal RNAs (Table 4). Comparatively, the number of P. inquinatus ncRNAs was around 2.8 times more than T. dichotomus (Table 4).

Protein-coding genes (PCGs) annotation in P. inquinatus was analyzed by MAKER v3.01.0336 from transcribed RNA, ab initio gene predictions, and homologous proteins. Transcribed RNA alignment prediction was performed by HISAT2 v2.2.137. RNA-seq alignment production was then acted as a genome-guided assembly by StringTie v2.1.638. The BRAKER v3.0.339 was applied to acquire the ab inito gene predictions by employing GeneMark-ETP40 and Augustus v3.4.041 and automatically trained them based on RNA sequence alignments and reference proteins obtained from OrthoDB v11 database42. GeMoMa v1.943 analyzed protein-homology alignments from five insect species’ proteins, including two Coleopteran species, Tribolium castaneum (GCF_000002335.344) and Coccinella septempunctata (GCF_907165205.145) related to Lucanidae and three sister families of Coleoptera, including one Dipteran species, Drosophila melanogaster (GCA_000001215.446), one Hymenopteran species, Apis mellifera (GCA_003254395.247), and one Neuropteran species Chrysoperla carnea (GCA_905475395.148) (Table 5). Results from BRAKER and GeMoMA were finally combined and applied as the ab inito input for MAKER. The final result of P. inquinatus PCGs establishment indicated 13,452 genes with an average length of 17,401.8 bp (Table 6).

Table 5 Species taxonomic information and accession code of all samples used in this study.
Table 6 Summary statistics of genome annotations in the Prosopocoilus inquinatus genome.

The functional gene annotation was proposed by searching the UniProtKB (SwissProt and TrEMBL) 20190527 database, which uses Diamond v2.0.11.149. Protein domain identifications were performed by eggNOG-mapper v2.1.950 and InterProScan 5.60–92.051 for Gene Ontology (GO) and KEGG pathway annotation analysis. Five databases, including Pfam52, SMART53, Superfamily54, Gene3D55, and CDD56, were analyzed in InterProScan. Functional annotation indicated that P. inquinatus contained 11,656 COG categories, 7,087 GO terms, 4,924 KEGG pathways, and 2,838 Enzyme Codes based on the InterProScan and eggNOG annotation integration (Table 6).

Data Records

The raw sequencing data and genome assembly of Prosopocoilus inquinatus have been deposited at the National Center for Biotechnology Information (NCBI). The Illumina, PacBio, Hi-C, transcriptome short reads, and transcriptome long reads data can be found under identification numbers SRR2712782557, SRR2724360458, SRR2712782859, SRR2712782760, and SRR2712782661, respectively, under the BioProject accession number PRJNA1015594 and BioSample accession number SAMN37358649. The assembled genome has been deposited in the GeneBank in NCBI under accession number GCA_036172665.162. The annotation results for repeated sequences, gene structure, and functional prediction have been deposited in the Figshare database63.

Technical Validation

Berry Genomics (Beijing, China) carried out the DNA extraction. Two quantities, including the NanoDrop and Qubit, were mentioned during the extraction process (Table 7). Our extraction yielded a NanoDrop of 86 ng/μl and a 44.65 ng/μl Qubit. The 280/260 and the 260/230 of our stag beetle are 1.78 and 1.85, respectively.

Table 7 DNA extraction of the Prosopocoilus inquinatus.

Two methods were used to evaluate the quality of the genome assembly. Firstly, BUSCO v5.4.464 was applied for assembly completeness calculation with the reference Insecta gene set (n = 1,367) with the euk_genome_met mode. The final genome assembly showed a BUSCO completeness of 99.6%, including 1,362 (98.5%) single-copy BUSCOs, 15 (1.1%) duplicated BUSCOs, 1 (0.1%) fragmented BUSCOs, and 4 (0.3%) missing BUSCOs. To investigate the quality of the de novo assembly, Merqury v1.365 was performed to identify possible assembly sequence errors based on efficient k-mer set operations and QV score calculation. Consequently, the k-mer completeness value of the stag beetle is 94.2%, and the QV score is 46.60. Both the k-mer value and the QV score reflect the high accuracy of the base pairs, combined with the BUSCOs, which exhibit the high completeness and accuracy of our genome assembly. The final annotation validation was also calculated by BUSCOs with a protein mode with the reference Insecta gene set (n = 1,367). The final annotation genome exhibited a BUSCO completeness of 99.6%, including 1,079 (78.9%) single-copy BUSCOs, 283 (20.7%) duplicated BUSCOs, 1 (0.1%) fragmented BUSCOs, and 4 (0.3%) missing BUSCOs. The mapping rate was also measured to determine the assembly accuracy. The mapping rates for PacBio, Illumina, RNA short reads, and RNA long reads were 99.6%, 96.51%, 96.93%, and 97.59%, respectively. These evaluations altogether reflected the high-quality value of the genome assembly.