Background

Genome sequencing efforts across all domains of life have broadened our understanding about how phenotypic novelty coincides with genomic innovations. This was facilitated by continuous improvement of sequencing technology during the last two decades, which allowed the generation of high-quality genome assemblies in large-scale phylogenomic contexts. In contrast, gene annotation protocols have evolved at a much slower rate [1]. One reason for the slow progress in the further developments of gene annotation protocols was likely the absence of universal benchmarking standards. Even though gene predictions were often evaluated against available expression data, such results were difficult to compare between organisms due to strong differences in transcriptomic resources across various genome projects. This changed when alternative approaches, such as benchmarking universal single copy orthologs (BUSCO) [2], employed comparative genomics to define a set of highly conserved orthologous genes that should be expected in a given genome assembly. While the BUSCO completeness level has become a widely used quality measure with similar importance as the N50 measure for assembly contiguity, its informative value is highly dependent on the quality and sampling of the underlying orthology data, which may differ vastly across taxonomic groups. Currently, multiple representative genomes of the highly diverse and rapidly evolving nematode phylum are still poorly annotated. To overcome this problem in the case of the nematode model organism Pristionchus pacificus, community-based curations have recently been initiated to improve the quality of gene annotations [3, 4]. P. pacificus was initially established for comparative studies with the classical nematode model organism Caenorhabditis elegans [5], but more recently it gained importance as an independent model system for elucidating the genetic basis of phenotypic plasticity [6,7,8] and the emergence of novel genes [9,10,11]. P. pacificus has a chromosome-scale genome assembly and computationally generated gene annotations based on transcriptomic data, protein homology data, and gene predictions were of relatively high quality [12]. However, further strand-specific RNA-seq and Iso-seq data pointed towards the presence of numerous artificial gene fusions in gene dense regions of the genome [3]. This motivated a screen for suspicious gene models based on comparative genomic approaches and to propose corrections after manual inspection by community annotators. Two rounds of community curations improved the BUSCO completeness level from 86 to 98% (nematode odb9 data set) [2,3,4].

Here, I make use of the community-curated annotations of the P. pacificus genome to improve the annotations of related Pristionchus and other genomes of the family Diplogastridae, which were recently sequenced to study the evolutionary dynamics of novel gene families [9]. This demonstrates that a single high quality reference data set is sufficient to improve gene annotations in related genomes.

Results

High quality gene annotations are rare outside the Caenorhabditis clade

In order to assess the current status of nematode genome quality, I analyzed gene annotations from 54 nematode species as obtained from WormBase ParaSite (version WBPS14) using the BUSCO approach, which tests for the presence of highly conserved single copy orthologs [2, 13]. Using an arbitrary cutoff of > 80% BUSCO completeness of single copy genes to define high quality gene annotations, this analysis shows that high quality gene annotations are rare outside the Caenorhabditis clade (Fig. 1). Few exceptions are the genomes of the free-living Oscheius tipulae and the parasitic Haemonchus contortus [18], Dirofilaria immitis [19], Loa loa [20], Brugia malayi [21], and Onchocerca volvulus [22]. Please note that the most recent updates of the P. pacificus gene annotations have not yet been integrated into WormBase, which explains why the BUSCO completeness is shown at around 80% (Fig. 1). As BUSCO genes are defined as genes that should be present as single copy in at least 90% of genomes, the low completeness values point towards substantial annotation problems in various genomes (Fig. 1). Alternatively, these discrepancies could reflect true cases of gene losses in specific lineages. Another explanation could be that divergent orthologs may not be detected, as the underlying nematode odb10 data set only contains seven nematode species (Fig. 1) and these do not represent the full range of genomic diversity of the nematode phylum [14, 23]. Despite these alternative explanations, the phylum-wide assessment of annotation quality based on BUSCO completeness strongly suggests substantial need for improvement, which is in accordance with complementary studies assessing the quality of multiple nematode genomes [24, 25].

Fig. 1
figure 1

Comparative analysis of gene annotation quality based on BUSCO completeness. Gene annotations from 54 nematodes were retrieved from WormBase ParaSite (WBPS14) and quality was assessed based on the BUSCO approach (nematode odb10). The bar charts indicate the percentage of proteins falling into a specific BUSCO class. The schematic species trees are redrawn from Smythe et al. and other phylogenomic studies [9, 14,15,16,17]

A single high quality reference set is sufficient to improve gene annotations in related species

Previously, comparative genomic screens were combined with community curation to improve the quality of gene annotations in P. pacificus [4]. Here, the objective is to demonstrate how community curation of a single genome can be used to automatically improve the quality of phylogenomic data from related species. To this end, I reannotated nine nematode genomes of the family Diplogastridae including seven other Pristionchus species, which were sequenced previously as part of a phylogenomic study to investigate the evolutionary dynamics of novel gene families [9, 26]. Specifically, predicted open reading frames (ORFs) in assembled RNA-seq transcripts [27] as well as protein sequences of the community-curated P. pacificus annotation (El Paco gene annotation, version 3) were mapped to the draft genomes with the help of the exonerate alignment tool [28]. Subsequently, a simple heuristic was applied to pick only one representative gene model per locus (see Methods, Additional file 1, Figure S1). The raw mappings of transcribed ORFs, the previous and the newly generated gene annotations were then evaluated based on the level of BUSCO completeness (nematode odb10 data set). To assess, to what extent the manually improved P. pacificus data set yielded better gene annotations of related nematodes, the same reannotation procedure was applied using the automatically generated P. pacificus gene annotations as homology data [12]. Similarly, the gain in gene annotation accuracy by having a more closely related reference data set was evaluated by comparison with the highly curated but evolutionary distant C. elegans annotation. For the nine diplogastrid genomes, reannotation with the community-curated P. pacificus genome yielded a median BUSCO completeness of 84% (Fig. 2a, Table 1), which corresponds to a median improvement of 9% over the previous version of gene annotations [9]. The genome of Parapristionchus giblindavisi is the only exception where the new P. pacificus annotations did not result in an improved gene annotation (Fig. 2a). Most likely, this is due to general problems with this assembly, as it has the lowest level of contiguity, a high fraction of ambiguous bases, and the highest ratio of incorrectly oriented read pairs in realignment analysis [9]. Such problems potentially arose from remaining heterozygosity [29] and might impair accurate gene annotation. However, all other genomes showed that using the community-curated P. pacificus data set as homology data yielded the most complete gene annotations even when compared to the highly curated but evolutionary distant C. elegans data set (Fig. 2a).

Fig. 2
figure 2

Improved gene annotations of nine diplogastrid genomes. a Reannotation of nine diplogastrid genomes based on available RNA-seq data and the community-curated P. pacificus gene annotations results in a median improvement of 9% BUSCO completeness (single copy + duplicated, nematode odb10) over the previous versions [9]. b The barplots show the BUSCO completeness of annotated diplogastrid genomes, when only homology data from either C. elegans or P. pacificus are used. The more closely-related P. pacificus gene annotations are a consistently better reference data set than the C. elegans data. However, the quality of annotations inferred from homology data seems to decrease considerably when divergence exceeds the genus level. c The stacked barplot shows the number of gene models derived from RNA-seq or homology data for the final annotations

Table 1 Assembly and gene annotation features of diplogastrid genomes

Divergence across genera hinders gene model inference from homology data

The fact that homology data from P. pacificus yields better results than C. elegans data indicates that the ability to transfer gene annotations across species drops with increasing sequence divergence. This would imply that taking the highly curated C. elegans data as a reference, is suboptimal for annotating divergent nematode genomes. To test if the drop in homology-based annotation accuracy occurs already within the same nematode family, I reevaluated the BUSCO completeness of the gene annotations only inferred from homology data (Fig. 2b). While gene models based on homology with C. elegans have a constantly low BUSCO completeness of around 36%, the completeness values for the gene models inferred from community-curated P. pacificus data range between 59 and 84% (Fig. 2b). Note that the two non-Pristionchus genomes have with 59 and 60% the lowest completeness values whereas the lowest value for the remaining Pristionchus genomes is 75% (Fig. 2b). Moreover, the evaluation of the contribution of homology-inferred vs. RNA-seq derived gene models to the final gene annotations also show a drop in the contribution of homology data for the two non-Pristionchus species (Fig. 2c). Taken together, these results suggest that the ability to transfer gene models based on protein conservation already drops beyond genus-level sequence divergence. This strongly limits the usefulness of model organism data such as from C. elegans to be helpful for annotating genomes in evolutionary distant nematode clades.

The quality of several nematode genomes might be underestimated

The new gene annotations of the nine diplogastrid genomes contain between 19 and 39 thousand gene models that are completely evidence-based as they are either supported by transcriptional evidence or by protein conservation with P. pacificus (Table 1). However, between 9 and 15% of BUSCO genes (nematode odb10 data set) seem to be missing in the Pristionchus genomes (Table 1). If these missing genes are due to misannotations in individual genomes, further rounds of manual corrections across this phylogenomic data set could be used to reannotate missing genes in selected genomes. In such a case, the abundance of gene absence/presence patterns should be more or less randomly distributed. However, if missing genes are due to massive gene losses across the diplogastrid lineage, the distribution of patterns should be dominated by a phylogenetic pattern that could be parsimoniously explained by a single evolutionary event (Fig. 3a). The two most abundant presence/absence patterns are 1490 BUSCO genes that are found in all nine diplogastrid genomes and 316 genes that seem to be missing in the Parapristionchus giblindavisi genome, which seems to be the most problematic data set (Fig. 3a). The third most abundant pattern arises from 204 genes that were not found in any of the diplogastrid genomes, which is suggestive of a lineage-specific gene loss. To exclude that these genes are missing due to gene annotation failure, the BUSCO pipeline was run in genome mode against the raw assembly (Additional file 1, Table S1). This confirmed that 185 (91%) of those genes could not be detected by running BUSCO on the raw genome assembly. As mentioned above, an alternative explanation would be that these genes are present but could not be detected by the BUSCO pipeline as the nematode odb10 data set does not represent the full diversity of the nematode phylum [14, 23]. Thus, running the BUSCO pipeline with a phylogenetically more broadly sampled set of taxa should capture these missing genes. Consistently, the older nematode odb9 data set, which included data from P. pacificus, yields higher completeness values (median value 91%, Table 1). To further test the possibility of undetected orthologs, I used a complementary approach to find one-to-one orthologs of the corresponding C. elegans genes in the diplogastrid genomes based on best reciprocal BLASTP searches. This revealed that 101 (50%) of these 204 genes have predicted one-to-one orthologs in all diplogastrid genomes, which points to a failure of detection of the BUSCO pipeline (Fig. 3b). Comparison of the bitscores from BLASTP searches between BUSCO genes in C. elegans and their putative orthologs in the diplogastrid genomes shows a pronounced difference between the diplogastrid sequences that were detected as orthologs by BUSCO and the sequences that were only identified as best-reciprocal hits (Fig. 4a). Further analysis of alignment length and percentage identity indicates that this difference is due to stronger sequence divergence in the diplogastrid sequences that were not identified as orthologs by BUSCO (Fig. 4b, c). Similarly, the analysis of length-normalized bitscores shows strong differences between BUSCO orthologs and best-reciprocal hits, whereas the overall aligned proportion is more comparable between both groups (Additional file 1, Figure S2). To further support that at least some of these genes are truly orthologs, I reconstructed gene trees for 12 randomly chosen candidate gene families (Fig. 5a-l). Most of these phylogenies largely resemble the species phylogeny (Fig. 1) and therefore support the one-to-one orthology relationship. Thus, I conclude that the insufficient sample size and phylogenetic breadth of the nematode odb10 data set may cause failures in the ortholog detection by the BUSCO pipeline and that the quality of divergent nematode genomes might therefore be underestimated.

Fig. 3
figure 3

Missing BUSCO orthologs in diplogastrid genomes. a Classifications from the BUSCO results (nematode odb10) were summarized across all diplogastrid genomes and the twenty most abundant patterns are displayed. While misannotations should result in random patterns with low frequency, highly abundant patterns can point towards problematic data sets (e.g. Parapristionchus), true biological events (e.g. gene losses) or systematic biases. b Complementary ortholog predictions based on best-reciprocal BLASTP hits show that 101 (49.5%) of BUSCO genes, which were predicted to be missing in all diplogastrid genomes, have putative one-to-one orthologs

Fig. 4
figure 4

Sequence divergence causes orthology detection failures. a The left heatmap shows the bitscores for 101 randomly subsampled BUSCO orthologs derived from a BLASTP search of the C. elegans proteins against annotated protein sets of the ten diplogastrid genomes. The right heatmap shows the bitscores for 101 putative orthologs that were only predicted based on best-reciprocal BLASTP hits. b The heatmaps show the BLASTP alignment lengths for BUSCO genes from C. elegans against diplogastrid proteins for 101 randomly subsampled genes with BUSCO orthologs (left) and for 101 best-reciprocal hits (right). c The heatmaps show the percentage identity between BUSCO genes from C. elegans and their putative diplogastrid orthologs

Fig. 5
figure 5

Phylogenetic analysis supports orthology relationships. Twelve candidate gene families (a-l), where BUSCO failed to detect orthologs in diplogastrid genomes, were selected as candidates for phylogenetic analysis. Maximum-likelihood trees were reconstructed from BUSCO orthologs of other nematode species as well as diplogastrid sequences, which were identified as best-reciprocal hits against C. elegans proteins. Stars denote branches with a bootstrap support ≥99/100. In most cases, the phylogenies are largely consistent with the species tree and the diplogastrid sequences form a monophyletic clade that correctly groups as sister to the rhabditids (C. elegans and H. contortus)

Discussion

In this study, I have demonstrated the benefit of the community-curated P. pacificus gene annotations for phylogenomic data of related species. The automated improvements of gene annotations in other diplogastrid nematodes suggest that not every genome has to be manually curated, but that a single high-quality genome per genus or family is sufficient to allow effective gene model inference based on homology. It has to be noted that community curation is not the only way to obtain high quality gene annotations as previous studies showed that high levels of BUSCO completeness could be obtained by extensively optimized application of gene prediction tools such as BRAKER and AUGUSTUS [25, 30, 31]. Specifically, the training of gene models is the most important step for such ab initio gene prediction methods. However, the question how good gene annotations have to be will depend on the research topics that are going to be studied. Previous studies of novel gene origin in P. pacificus pointed towards an inflated number of species-specific orphan genes that are most likely gene prediction artifacts recognizing spurious coding signals on the antisense strand of truly protein-coding genes [10]. Thus, to facilitate more conclusive studies of gene birth processes using purely evidence-based set of gene annotations, species-specific genes without RNA-seq support were removed from current annotations [4]. Furthermore, other types of gene prediction artifacts such as artificially fused gene models do not necessarily impact BUSCO completeness levels but impair the detection of orthologs [3]. In this context, community annotation seems to be one of the most effective methods to increase gene annotation quality beyond what can be achieved using automated pipelines. For example, in the case of the nine diplogastrid genomes, BUSCO genes that are found in the genome or transcriptome (Additional file 1, Tables S1, S2), but not in the final gene annotations can be taken as candidates to further improve gene annotation quality by manual curation. Nevertheless, it has to be noted that the effectiveness of community annotation is highly dependent on the availability of abundant transcriptomic resources. Thus, requiring support by either transcriptome data or protein conservation for every gene is not generally feasible for genome projects of phylogenetically isolated species with limited expression data.

The drop in the ability to infer gene models based on homology beyond the genus level emphasizes the rapid evolution and extreme diversity of nematode lineages [32, 33]. This might also at least partially contribute to the failure of detection of 101 putative orthologs by the BUSCO pipeline, as the underlying OrthoDB data set (nematode odb10) appears to be too small and unevenly sampled to capture the full diversity of the nematode phylum [14, 23]. As a consequence, the quality of many nematode genomes might be underestimated. In the case of the diplogastrid genomes, the completeness level is likely underestimated by up to 3% (101 out of 3131 BUSCO genes that were not detected but are present in all 10 genomes). The fact that this problem seems to be less pronounced in the older odb9 data set (Table 1) suggests that the more recent odb10 data set is not well suited for assessing the genome quality of divergent nematode species.

Thus, my study not only highlights the need to optimize gene annotation pipelines, but also points at potential problems in the benchmarking processes. Further work will be needed to establish more comprehensive benchmarking data sets as well as to optimize annotation protocols. For example, apart from the alignment program exonerate [28], there are multiple alternative approaches for transferring gene models such as GMAP or liftover, and Liftoff [34, 35]. In addition, gene models from transcriptome and homology data were integrated into a non-redundant gene set using a heuristic approach that selects the gene model with longest ORF per 100 bp window. Hereby, genes with less than three exons or a predicted protein length of less than 60 amino acids were discarded. Such arbitrary thresholds may remove unusually short genes with important biological functions and might not be directly transferable to other nematode genomes with a smaller number of exons per gene than P. pacificus [36]. Apart from the mentioned caveats, the presented methods should be directly applicable to genomes of other taxonomic groups, even outside of nematodes.

In future, gene model inference based on the highly curated P. pacificus data set will help other genomic studies in this clade of nematodes. Currently, the genus Pristionchus has around 50 described species and genus-wide phylogenetic studies revealed interesting trends such as the parallel emergence of hermaphroditism and the convergent evolution of specific pheromones [27, 37]. The generation of high quality genomes of further members of the Pristionchus genus may therefore help to characterize and compare the genomic basis of these convergent patterns. Simultaneously, the pool of more than thousand P. pacificus strains was successfully exploited to dissect the genetic basis of phenotypic variation at a population level [38, 39]. Such unbiased genetic screens revealed that genomic changes including gene duplication and loss cause natural variation in pheromone production and response [40, 41]. Thus, properly annotated de novo assemblies of different P. pacificus strains will greatly aid the interpretation of associations between genotypes and phenotypes and thus complement future genetic screens in P. pacificus.

Conclusions

While genome sequencing technologies have undergone tremendous development over the past 20 years, gene annotation protocols evolved at a much slower rate. In the case of the nematode model organism Pristionchus pacificus, community-based gene curations have previously been presented as an effective means to lift annotation quality above the level of what could be obtained by automated pipelines. Here, I make use of these community-curated annotations to automatically improve phylogenomic data of nine related nematodes. This work has three major conclusions. First, the community-curated P. pacificus genome improves the completeness of related nematode genomes by a median of 9% over previous annotations. With BUSCO completeness levels between 83 and 86%, the reannotated Pristionchus genomes are more complete than most other members of the nematode phylum. Second, the ability to infer gene models based on homology already drops beyond the genus level, which implies that the highly curated C. elegans data is not well suited for annotation of divergent nematode genomes. Third, the insufficient sample size and phylogenetic breadth of the BUSCO and OrthoDB data sets may prohibit the detection of orthologs and thus cause an underestimation of nematode genome quality.

Methods

Generation of evidence-based gene annotations

The data set of transcribed ORFs was taken from a previous study where transcriptomes of different Diplogastrid species were assembled from mixed-stage RNA-seq data (Additional file 1, Table S2) and partial and complete ORFs with a minimal length of 40 amino acids were extracted [27]. These transcribed ORFs were aligned against the respective reference assembly with the help of the exonerate protein2genome program with the following parameter settings: --bestn 2, −-dnawordlen 20, and --maxintron 20,000 (version 2.2.0) [28]. The homology model data set was generated by aligning P. pacificus proteins (El Paco gene annotation, version 3) against the reference assemblies using exonerate with the same parameter settings [4]. The -bestn option was set to two in order to annotate potential duplicates. Homology models and transcribed ORFs were merged into a joint annotation which included possibly multiple gene models (different isoforms in the assembled transcripts, different evidence types) for a given gene. The complexity of the joint annotation was reduced by a simple heuristic to generate a set of non-redundant annotations. First, gene models from each DNA strand were separated and every start and end coordinate of an exon was assigned to a 100 bp window. Second, For each 100 bp window the coverage was computed as the number of features that were assigned to this window. Third, starting with the most highly covered 100 bp window, the intersecting gene model with the longest ORF was chosen whereas all other intersecting gene models were discarded. At this step only gene models with ORFs of at least 60 amino acids and at least three exons were considered. This last step was successively executed for all other windows. As gene models in P. pacificus tend to have more exons per gene than most other nematodes [36], the threshold of at least three exons per genes was implemented to prevent an inflation of gene counts by partially assembled transcript fragments and transcriptional noise. The source code for generating these gene annotations is written in perl. All scripts have been compiled in a software package called PPCAC and are available at https://github.com/roedelsberg/ppcac/.

Comparative genomic analysis

Protein sequences from 54 nematodes were retrieved from WormBase ParaSite (WBPS14). In case of multiple isoforms, the longest isoform was chosen as the representative sequence [20, 33, 42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58]. To assess the completeness level of gene annotations, the BUSCO pipeline (version 4.1.1) was run in protein mode with the nematode odb10 data set (Creation date: 2019-11-20, number of species: 7, number of BUSCOs: 3131). This combination of BUSCO and odb10 was used for most analyses (Figs. 1, 2 and 3). However, for comparisons with a set of orthologous genes, which include data from P. pacificus, the BUSCO pipeline (version 3.0.1) was run with the nematode odb9 data set (Creation date: 2016-02-13, number of species: 8, number of BUSCOs: 982). From the result files of the BUSCO pipeline, genes that were classified to be missing were extracted and compared with predicted one-to-one orthologs for C. elegans genes that were obtained by best-reciprocal BLASTP searches (version 2.6.0, e-value < 0.0001, bitscores were used to define the best BLASTP hit). From the BLASTP results, bitscores, percentage identity, and alignment lengths were extracted and used for comparison between the BUSCO orthologs and best-reciprocal hits. For visualization and comparison of BLASTP features, the set of 1490 BUSCO genes with orthologs in all diplogastrid genomes (Fig. 3a) was repeatedly downsampled to match the 101 putative orthologs that were identified by best-reciprocal BLASTP searches (Fig. 3b). Phylogenies of selected orthologous groups were generated by extracting the corresponding BUSCO orthologs in the high quality gene annotations of C. elegans, Haemonchus contortus, Brugia malayi, Onchocerca volvulus, and Trichinella spiralis and combining these with the best-reciprocal hits in the diplogastrid genomes. Subsequently, protein sequences were aligned by the MUSCLE program (version 3.8.31, default options) and maximum likelihood trees were calculated using the phangorn package in R (version 3.4.4, LG substitution model with optimization of base frequencies and invariant sites, 100 bootstrap pseudoreplicates) [59, 60]. The orthologous T. spiralis sequences, as identified by BUSCO pipeline, were used to root the trees.