Background

Sociality among insects is highly diverse ranging from simple interactions to complex hierarchical societies [34]. However, the social behaviours and their genetical background have been investigated mainly in eusocial species with well-ordered colonies (e.g. [67, 81]). Therefore, the literature lacks studies on the molecular basis of social behaviour in simpler societies such as subsocial species [34]. Among these insects, diverse forms of parental care occur, including guarding for the eggs, food provision, protecting the freshly hatched offspring, and even biparental care with division of labour between the parents [43]. These behaviours appeared independently in 13 different insect orders, including 15 families of Coleoptera such as Scarabaeidae and Silphidae. Insect species with parental care are feasible subjects of sociogenomics aiming to understand the interaction between behaviour and genes regulating parental behaviour [63]. Therefore, knowing the mechanistic principles of parental care among subsocial insects could lead to a better understanding of social evolution [15]. Another pathway to inferring the origin of sociality is via comparative analysis of genomes of many organisms with analogous parental behaviours [13].

Beetles have evolved an extraordinary variety of life history strategies [50], including very diverse social and reproductive behaviours, from aggregation through creating nests to biparental care [9]. Despite of the fact that beetles represent the most diverse animal order, only a handful of coleopteran genomes have been published to date, including model species like the red flour beetle (Tribolium castaneum), a burying beetle, Nicrophorus vespilloides and important pests like the Colorado potato beetle (Leptinotarsa decemlineata) and the small hive beetle (Aethina tumida) [49]. However, of the sequenced beetle species, only N. vespilloides and Onthophagus taurus have parental care.

Lethrus apterus Laxmann 1770 (Coleoptera: Geotrupidae; Fig. 1) is one of the few insect species that has biparental care [12]. These beetles are only active during their breeding season, which lasts from early March to the beginning of June, and outside of this period the adults spend their time in a diapause in the soil [19]. At the beginning of their breeding season, adults choose mates with whom they excavate underground nests [64]. After this, a division of labour can be observed with females collecting leaves for each offspring in separate underground chambers while males guard the entrance of the nest from intruders, e.g. other L. apterus individuals or predators [65]. At the end of the reproductive period, adults dig themselves into the soil while the hatching larvae consume the stored leaves [33].

Fig. 1
figure 1

Lethrus apterus adult female (Susa, Hungary). Photo: Nikoletta A. Nagy

In this study, we report the draft de novo genome of L. apterus which is the first published genome in the family Geotrupidae. We performed functional annotation, and searched directly for potential parental behaviour regulator genes in the genome. Additionally, we investigated the single nucleotide polymorphism variants distribution based on samples from eight populations in Hungary.

Results and discussion

Assembly quality and completeness

The final assembly of Lethrus apterus comprised 66,933 scaffolds with an N50 value of 8902 bp (Table 1). The total length of the genome was estimated to be 286.93 Mbp, comparable with other beetle genomes published to date, and with the estimated size by GenomeScope (252.49 Mbp, Fig. 2). The GC content of the final assembly is 31.66%, similar to other beetle genomes. GenomeScope results showed a relatively low heterozygosity rate (0.148%) and low percentage of unique sequences (55.8%) which was probably caused due to the combined dataset (see Materials and methods) (Fig. 2). An additional peak was formed by the high frequency (0.864%) of duplicated k-mers which predicts high proportion of repeats in the genome [80]. High ratio of repeat regions together with short reads sequencing can lead to fragmented genome assemblies as repeats are often longer than the reads [54]. Therefore, low contiguity of our assembly, even after merging assemblies generated by diverse applications, is likely due to the high repeat sequence content to which the lack of a closely related reference genome contributed as well. Nevertheless, the L. apterus assembly has a high gene completeness, since 93.5% of BUSCOs from the Endopterygota database were detected (Table 2).

Table 1 Descriptive statistics of different assemblies produced during analyses
Fig. 2
figure 2

Genome and read characteristics produced by GenomeScope. Len: haploid genome length; uniq: overall length of unique (i.e. not repetitive) sequences; het: heterozygosity rate; kcov: mean k-mer coverage for heterozygous sequences; err: error rate of reads; dup: read duplication rate; k: k-mer length

Table 2 Completeness of the different assemblies assessed by BUSCO

The distribution of contigs analysed for their coverage and GC content state space resulted in scaffolds separated into two groups according to their read coverage (Fig. 3). Both groups contained genes identified by the BUSCO analysis. The quotient of the coverage of the two groups was 3/4 and the proportion of males and females in the combined sample (see Materials and methods) was 1:1, suggesting that the group of scaffolds with lower coverage could reflect sequences from the chromosomes. This is further supported by Fig. 4 showing that the lower coverage group of scaffolds are only present in males. This suggests an XY or X0 sex determination system in Lethrus apterus.

Fig. 3
figure 3

Read coverage distribution. a The distribution of scaffolds in the GC content – read coverage state space. Blue symbols mark scaffolds, size of the symbol is proportional to the length of the scaffold. Green symbols show BUSCO genes, size of the symbol is proportional to the length of the scaffold containing the gene. b Density plot of read coverage of scaffolds of size longer than 5 kbp. The numbers above the peaks show the corresponding coverage values

Fig. 4
figure 4

Density plots of read coverage of scaffolds of size longer than 5 kbp. Female samples are shown with black whereas males are shown with red lines. The coverage values are rescaled so that the coverage value with maximum density is one. Numbers at the top of the plot show the coverage values of the peak densities. The relative coverage was 0.504 ± 0.03 in the lower coverage group of males

Genome annotation

RepeatMasker was used to identify repetitive elements in the assembly of Lethrus apterus. Results showed that a high proportion (36.44% of bases) of the genome contains interspersed repeats, most of which (71.46%) could not be classified as known repeats (Table 3). The most abundant repeats were A-rich sequences with low complexity. Only three of the next 10 repeat classes showed significant matches with the NCBI nt database. One matched with an inverted repeat in the pannier region of the harlequin ladybeetle (Harmonia axyridis), the other two had hits with uncharacterised genome regions of the mountain pine beetle (Dendroctonus ponderosae) and the ringlet butterfly (Aphantopus hyperantus).

Table 3 Repetitive elements found by RepeatMasker

Ab initio gene prediction resulted in 25,385 sequences whereas homology based prediction found 22,551. After merging and filtering the two gene sets, 34,392 remained from which 20,734 were functionally annotated by InterProScan or Diamond using different databases. The annotated gene set had a 1425 bp mean CDS length, 475 amino acids mean protein length and contained 5.10 exons and 4.10 introns per gene on average.

Based on the functional annotation, the potential sex chromosome related genes coded mostly proteins necessary for cell maintenance, including housekeeping genes and mitochondrial proteins, however, some transposons and retrotransposons were found. In addition, proteins involved in the innate immune responses, circadian rhythm and memory were also identified.

Annotation of reproductive behaviour related genes

Based on a literature search, 23 candidate genes were found in 21 research articles (Table 4). Of these, 19 genes were found in coleopteran species and were stored in NCBI. All 19 candidate genes had significant hits with the predicted genes of Lethrus apterus, sequences of the hits can be found as Additional file 1. Compared to the other examined beetle species, L. apterus had lower number of hits of odorant-binding and pheromone-binding proteins. These molecules play a significant role in recognition of the signals of the environment, such as food resources or recognition of conspecifics [21]. The loss of these genes may be a great starting point of a research on the evolution of olfactory perception among dung beetles, however, we should note that the low number of hits could be caused by the fragmented genome hence further investigation would require the involvement of other data, such as RNA sequencing. In addition, two of the 19 genes, namely troponin C, and octopamine receptor were found among genes located on potentially sex chromosomes and thus may serve as targets for future research on gene regulation of reproductive behaviour.

Table 4 Candidate genes involved in reproductive behaviour among coleopterans

Fragmented genome assembly can lead to over-prediction of paralogous genes, especially in case of gene families with high number of similar members [44]. The results of the candidate gene search, however, showed that the number of hits in genes of Lethrus apterus and the related species were similar, suggesting that the low contiguity of the assembly did not influence the gene prediction.

Comparison with other coleopteran species

Fourteen coleopteran proteomes available on NCBI and predicted Lethrus apterus genes were used to perform comparison of orthologous genes by Orthofinder. Based on the results, 357,992 genes (95.9% of the total number of genes) were assigned to 23,528 orthogroups. All species were present in 4754 orthogroups of which 44 included only single-copy genes. 9618 orthogroups were species-specific from which 607 (consisting of 1664 genes) were specific to L. apterus. Of the predicted genes, 436 were not assigned to any orthogroups. Finally, 42.1% of the orthogroups contained L. apterus genes. Our phylogenetic results are in line with those relationships described in [88, 89]. Based on the species trees reconstructed with two independent methods, Nicrophorus vespilloides appeared to be the sister taxa of Lethrus apterus + Onthophagus taurus. Monophyly of these groups received a high statistical support in all of our analyses (Fig. 5). This branching not only marks the divergence of Staphylinoidea and Scarabeoidea, but also separates those only three species in our dataset that have biparental care. Further studies are now needed to more precisely decipher the origin of biparental care among beetles.

Fig. 5
figure 5

Phylogenetic relationships of 14 coleopteran species rooted with Drosophila melanogaster as outgroup. Tree was constructed based on 37 common single-copy protein sequences. Species with parental care are highlighted in bold and Lethrus apterus is additionally highlighted in red. a Phylogram generated with coalescent based estimation, support values are local posterior probabilities. b Concatenation-based phylogenetic tree, support values are ultrafast bootstrap/aLRT

Variant calling

The three approaches used (see Materials and methods) produced a large number of variant loci; samtools: 2768768 (9.65 SNPs/kbp), GATK: 2804771 (9.77 SNPs/kbp), and freebayes: 3384895 (11.79 SNPs/kbp), respectively. After the filtering steps, only 237,835 SNP loci identified by all three methods remained. From these loci 12.65% were missing, however, different samples had different portion of missing variants, some had as low as 1%, while other had as high as 31%. The average number of missing variants per locus was 4.0.

From the 237,835 variants, 22,593 (9.5% of the total) were found in exonic regions, 24,231 (9.9%) in intronic regions and 191,797 (80.6%) in intergenic regions. Seven hundred eighty-six variants were reported in both exonic and intronic regions which suggests e.g. nested or reverse coding genes. The mean variant density was 0.83 SNPs/kbp varying between coding (0.62 SNPs/kbp), intron (0.71 SNPs/kbp) and intergenic regions (0.85 SNPs/kbp).

Based on the principal component analysis, the samples forming our Susa dataset were grouping together which supports our theory that there should be low variation between these populations (Figure S1). One interesting finding was that one population, namely the Debrecen population was separated from all of the other populations. This clustering can serve as great basis of future studies thusthese variant loci can serve as a resource for future population genomics analysis in Lethrus apterus.

Conclusion

We have reported the genome of Lethrus apterus, the first genome in the family Geotrupidae. Although the assembly is highly fragmented, probably due to the repetitive nature of the genome, it has a high level of gene completeness. This beetle species is a good model for division of labour during parental care. All genes related to reproductive and parental behaviours published so far were located in the final genome assembly, therefore, its use for future investigation for the genetic architecture of parental care among insects should be pursued. Further, potential sex chromosome related genes were identified which can be useful for uncovering the genetic basis of the division of labour between the parents.

Materials and methods

Sampling and DNA extraction

In the spring of 2016, 32 individuals were collected from eight locations (2 males and 2 females from each location) across Hungary. Samples were taken from thorax muscles from each individual. Each tissue sample was ground in a clean mortar using pestle and liquid nitrogen and when grinding was finished 430 μl of extraction buffer was added. According to Gilbert et al. [23], the buffer contained 3 mM CaCl2, 2% sodium dodecyl sulphate (SDS), 40 mM dithiotreitol (DTT), 250 μg/ml proteinase K, 100 mM Tris buffer pH 8 and 100 mM NaCl (final concentrations). The solution was suspended into a clean 1.5 ml microcentrifuge tube and 50 μl RNase was added. Suspensions were vortexed and centrifuged for 30 s at 400 G. Samples were incubated overnight at 56 °C until no tissue clumps were visible. After the mixtures were cooled down at room temperature, 0.5 volume of 7.5 M ammonium-acetate was added to each sample and vortexed, and then put in a − 20 °C freezer for 10–15 min. After this, 1 volume of chloroform:isoamylalcohol 24:1 was added to each sample and the tubes were vortexed for 2 s. Samples were centrifuged for 5 min at 18000 G and the supernatant was carefully transferred to a clean tube. This step was repeated once more. Two volumes of ice cold 96% ethanol were added and samples were mixed by inverting, then incubated on − 20 °C overnight. Tubes were centrifuged at 18000 G for 10 min and the liquid phase was carefully removed from the pellets, then 500 μl 70% ethanol was added. After inverting, samples were centrifuged at 5000 G for 3 min, then the pellets were dried at room temperature. Finally, 50–100 μl of 10 mM Tris-EDTA (pH = 9.0) buffer was added to dissolve the pellets.

DNA library preparation and sequencing

The sequencing library was prepared using 300 ng of DNA fragmented with a Bioruptor sonicator (Diagenode) with high mode 10 cycles setting (30 s on/30 s off). For the library preparation the TruSeq® Nano DNA LT Sample Preparation Kit (Illumina) was used following the manufacturer’s protocol. The individually barcoded libraries were diluted to 10 nM and two pools of 16 samples each were prepared for sequencing. Paired-end sequencing of 125 bp reads was performed on a HiSeq2500 machine at the EMBL GeneCore Facility (Heidelberg).

Individual sequence coverage was on average low (~10x), therefore, the reads of 12 individuals collected in neighbouring regions (Susa, Uraj and Ózd) were combined to achieve a higher coverage that would facilitate genome reconstruction. The base of the sample choice was their close spatial proximity, hence less variation was expected between them. Hereafter, we refer to this dataset of combined samples as the Susa dataset.

Quality control and filtering

The read quality of the Susa dataset was checked with FastQC (v0.11.5, [3]). The proportion of bases with a base quality of 30 or higher was above 90% in all samples. Trimmomatic (v0.36, [8]) was used to eliminate any adapter sequences from the reads, with at least 97% of the reads passing the trimming process. Next, the Susa reads were combined into two FASTQ files, one for the forward and one for the reverse sequence, and because these reads derived from different individuals, Musket (v1.1, [41]) was used to reduce the variation introduced by combining samples. After this correction, the quality of the dataset was checked with SGA-PreQC (v0.10.15, [72]). Results showed 20% were PCR duplicates, high mean quality score (>Phred 35), low sequencing error rate (< 5 × 10− 5), a peak of 51-mer distribution at count 70, and also predicted high repeat branch frequency similar to the human and oyster genomes. Additionally, GenomeScope (v1.0 [80];), which works very well on short reads, was run on the 17-mer count histogram of the Susa dataset after Musket produced using Jellyfish (v2.2.6 [47];) to check heterozygosity and repeat region contant.

Genome assembly

MEGAHIT (v1.1.1, [38]) was used (minimum k-mer size 51, maximum k-mer size 111, k-mer step size 20) to generate the initial assembly of the Susa reads and contig length was increased through running three cycles of SSPACE (v3.0, [6, 7]) alternating with GapFiller (v1.10, [7, 54]). Another assembly of the Susa reads was run using SOAPdenovo2 (v2.04, [46]) with the default k-mer size. These two assemblies were then merged with gam-nsg (v1.1b, [79]) to improve contig sizes. The final assembly was obtained by removing contaminant and duplicated contigs as reported by GenBank Contamination Screen and the potentially poor quality contigs shorter than 501 bp as short contigs can bias downstream analyses, e.g. mislead the gene prediction methods and the variant calling.

To assess the completeness of our assemblies, BUSCO (v4.1.2, [71]) was run to identify the proportion of the Endopterygota gene set that was found in the Lethrus genome. For this, the Tribolium castaneum gene set (tribolium2012) was used for algorithm training. The distribution of coverage and GC along the contigs was also analysed in R statistical environment (v3.5.2 [61];).

Genome annotation

The distribution of repetitive sequences was analysed with RepeatModeler (version open-1.0.8, [73]) and RepeatMasker (version open-4.0.7, [74]). First, a de novo repeat library was built using RepeatModeler. RepeatMasker was used for repeat analysis based on RepeatModeler output and Repbase (version 20170127).

For gene prediction in the soft masked genome assembly, ab initio and homology-based prediction methods were combined with the BRAKER pipeline (v2.1.5 [26, 27];). Ab initio prediction was carried out using Augustus (v3.3.3 [75, 76];). For homology-based prediction, hints for GeneMark-EP (GeneMark-ES Suite v4.57_lic [45];) were generated with ProtHint (v2.4.0 [10];) based on the endopterygota_db10. Augustus was then trained with these sequence hints with default settings and used for prediction. This was followed with another round of training and prediction based on the first iteration results and GeneMark hints. Predicted coding sequences containing premature stop codons and shorter than 50 amino acids were filtered out.

Before annotation, the ab initio and homology-based predicted genes were merged using CD-HIT (v4.7 [42];) to cluster genes with the exact same amino acid sequences with the parameters ‘-c 1 -G 0 -aL 1.0 -aS 1.0’. The merged gene dataset was functionally annotated on the public server at usegalaxy.eu [1] using InterProScan (v5.36–75.0 [29];) with Pfam [18], PANTHER [52], ProDom [69], PROSITE [70], SMART [37], SUPERFAMILY [57], TIGRFAM [24] and PRINTS [4] databases. Additionally, Diamond (v2.0.6.144 [11];) against Uniprot (downloaded on 6th May 2020; UniProt Consortium 2019 [77]) database with an e-value of 10− 5 was used to further annotate the predicted genes.

Analysis of reproductive behaviour related genes

To identify genes potentially involved in parental behaviour, a literature search was carried out on 12th May 2020 on Web of Science using the following “Topic” keywords: (insect* AND (“reproductive behavior” OR “reproductive behaviour” OR “parental care”) AND (gene OR genes)). This resulted in 96 articles from which a list of candidate genes was extracted on the basis of the genes being identified as playing a role in insect reproductive behaviour. NCBI was searched and filtered for complete and reliable protein sequences (not partial, predicted, putative, hypothetical, unknown or uncharacterised) of coleopteran species. These sequences were used as query for searching homologous proteins among the predicted genes of Lethrus apterus using Diamond (in sensitive mode, with e-value set to 10− 5 and query coverage to 40%). Besides, the same search was run against the proteomes of the following species: Tribolium castaneum which has a well annotated genome; Nicrophorus vespilloides that is a model species for parental care among insects; Onthophagus taurus, the most closely related species with an available genome; Asbolus verrucosus which has a highly fragmented genome, therefore can be a scale for the accuracy of our result from the assembly of L. apterus.

Orthology and phylogenetic analysis

Orthofinder (v2.3.12 [20];) was used to find orthologs and species-specific genes and also create phylogenetic tree to reveal the phylogenetic relationship of Lethrus apterus to the other beetle species. For this purpose, proteomes of all available coleopteran species (Aethina tumida, Agrilus planipennis, Anoplophora glabripennis, Asbolus verrucosus, Callosobruchus maculatus, Dendroctonus ponderosae, Diabrotica virgifera, Ignelater luminosus, Leptinotarsa decemlineata, Nicrophorus vespilloides, Onthophagus taurus, Photinus piralis, Sitophilus oryzae and Tribolium castaneum) and Drosophila melanogaster as an outgroup were downloaded from NCBI. For inferring the species tree, orthogroups containing all species with only single-copy genes were used. Amino acid sequences by orthogroups were first aligned with MUSCLE (v3.8.1551 [17];) using default parameters. A concatenation-based phylogenetic tree was estimated using IQ-TREE (v1.6.9 [55];) by specifying the gene-based partitioning scheme (i.e. allowing a different substitution model for each gene). Branch support was tested using 1000 ultrafast bootstrap replications (−bb 1000) and by a likelihood-ratio test (−alrt 1000). To further strengthen our results a coalescent based estimation of the species phylogeny was also applied as implemented in ASTRAL-III (v5.7.3 [88, 89];). Maximum likelihood gene trees were inferred individually for each alignment also by using IQ-TREE (v1.6.9 [55];). Branch support values were assessed by 1000 ultrafast bootstrap replications (−bb 1000). Branches with a lower bootstrap support than 70 were collapsed into polytomies using nw_ed from Newick Utilities (v1.6 [30];). Using these 37 individual gene trees as input, coalescence-based species tree was estimated with ASTRAL-III (v5.7.3 [88, 89];). In all runs of IQ-TREE the best scoring substitution matrix was simultaneously estimated by ModelFinder Plus (−MFP [31];) and the best fitting substitution model was used to reconstruct the resulting phylogenetic tree. The cophyloplot function of the R package “ape” (v5.4 [58];) was used for visualisation.

Variant identification

Cleaned reads for each individual sample were aligned to the final assembled genome using BWA mem (v0.7.15-r1140, [39]). The resulting sam files were converted to sorted and indexed bam files with samtools (v1.2, [40]). Since base correction might cause bias in the number of variants, three different methods were used for variant calling which was followed by rigorous filtering steps. The first set of variants were called with the samtools mpileup – bcftools call pipeline (bcftools v1.3, [40]). Variants were also identified with GATK (v3.6.0.g89b7209, [48]) and freebayes (v1.1.0–3-h961e5f3, [22]). To run GATK, picard was used to mark duplicates, create a sequence directory of the reference genome, add to read group information and build a bam index. Then GATK Haplotypecaller was run for each sample separately and the results were combined with GenotypeGVCFs. For freebayes, all samples were called simultaneously.

Freebayes produced the highest number of polymorphisms (see Results), therefore, we filtered them to remove loci which had at least one variant with a missing genotype, a minimum alternate count of less than two, a minimum coverage of less than four or a maximum coverage higher than 25. After this filtering, loci which had variants with Phred based genotype likelihood less than 20 were marked as missing, i.e. they were removed from the data of the given sample. From these loci only those which were also called by GATK and samtools were retained. From the filtered loci found by all three methods, those positioned in low complexity (identified using dustmasker v1.0.0, [53]) or repetitive regions were removed. From the remaining variants, only SNPs were retained.

Variants were located in the structurally annotated genome using the R packages “GenomicFeatures” (v1.34.8 [36];), “rtracklayer” (v1.42.2 [35];) and “VariantAnnotation” (v1.28.13 [56];). After this, multiallelic SNPs were filtered out and the genetic variation between populations were investigated with principal component analysis using R package “adegenet” (v2.1.3 [28];).