Background

The Earth Biogenome Project (EBP) aims at sequencing and annotating all eukaryotic life on Earth within ten years [1]. It has brought about an explosion of genomic data: for instance, the Wellcome Sanger Institute alone currently aims at sequencing and assembling 60 genomes per day. This provides an unprecedented opportunity to study the diversity of life on Earth. Generating genome assemblies is now easier than ever thanks to cheaper sequencing, e.g. with Nanopore technology (for review of technology see [2]). However, while the number of available genomes continues to rapidly increase, the annotation of protein-coding genes remains a bottleneck in the analysis of these data [3]. This is, for instance, obvious from screening through Data Note Genome Announcements at Wellcome Open Research [4], or from counting genomes and their annotations at NCBI Genomes, where on April 3rd 2023, only 23% of 28,754 species are listed with the annotation of at least one annotated Coding Sequence (CDS) [5].

Genome annotation remains a bottleneck because it is currently not a straightforward approach. Large centers, such as Ensembl at EBI or the NCBI, are facing computational and human resources bottlenecks to apply their in-house annotation pipelines to all incoming genomes, while small and less experienced teams simply might not know where to start because not all annotation pipelines work equally well in all genomes.

BRAKER3 [6], a pipeline that combines the gene prediction tools GeneMark-ETP [7] and AUGUSTUS [8, 9] for fully automated structural genome annotation with short read transcriptome data (RNA-Seq) and a large database of proteins (such as an OrthoDB clade partition [10]) was recently demonstrated to have high accuracy for the particular input scenario of genome file, RNA-Seq short read data, and a protein database. However, despite the EBP encouraging the sequencing of transcriptomes alongside genomes [3], it can be difficult to obtain RNA-Seq data for some organisms for logistical or financial reasons, or an initial genome annotation can be desired before a transcriptome is sequenced. Also, some genes may not be expressed in tissues being sequenced and thus do not have RNA-Seq support. Conservation species often need to be annotated for gene-level genetic load estimation, frequently lacking RNA-Seq data. In invasomics, annotation of protein coding genes is of particular importance for exploratory gene drive studies, and generating probes for expression and localization studies. For both, high-quality rapid annotation is essential to move towards downstream analyses.

In the lack of transcriptome evidence, it is a common procedure to annotate novel genomes by leveraging spliced alignment information of proteins from related species to the target genome. Since the resulting alignments usually only cover a fraction of all existing genes in a genome and do not cover untranslated regions (UTRs), protein alignments are commonly combined with gene prediction tools that employ statistical models (e.g. AUGUSTUS, SNAP [11], and variants of GeneMark [12,13,14]) to identify the other fraction of genes as good as possible. MAKER [15,16,17] was an early pipeline that automated this for the gene prediction step (though it lacks automated training of gene predictors). FunAnnotate [18] was originally designed to train gene finders using RNA-Seq data but also provides a workaround for protein input on fungi. It has since also been applied to other eukaryotic genomes (a random example: [19]). In contrast to these algorithms, which usually use evidence from one or a low number of donor proteomes, BRAKER2 [20] is a pipeline that leverages a large database of proteins with GeneMark-EP [13] and AUGUSTUS to predict protein-coding genes. BRAKER2 fully automates the training of GeneMark-EP and AUGUSTUS in novel genomes. BRAKER2 was previously demonstrated to have higher accuracy than MAKER [20].

In order to allow for the alignment of a large number of protein sequences in a reasonable time, GeneMark-EP first runs self-training GeneMark-ES [12, 14] to generate genomic seeds. Subsequently, DIAMOND [21] quickly returns hits of proteins against those initial candidate protein-coding sequences found in the genome, and Spaln [22, 23] is applied to run accurate spliced-alignment of the best matching protein sequences against the genomic seeds. BRAKER2 executes one iteration of this process to expand the genomic seed space by AUGUSTUS predictions. This complex sub-pipeline is called ProtHint and was introduced to make the alignment of a large database of proteins against the genome for evidence generation computationally feasible on desktop machines. BRAKER2 generally achieves high accuracy in small and medium-sized genomes. In large genomes (e.g., the genome of a chicken or mouse), self-training GeneMark-ES performs poorly during seed generation, leading to lower prediction accuracy of BRAKER2.

With the appearance of miniprot [24], a very fast and accurate tool for spliced-aligning proteins to genome sequences, the question arose whether it is necessary to run a complicated pipeline such as ProtHint in order to generate evidence and training genes to annotate novel genomes with protein evidence with high accuracy. Moreover, miniprot has no problems processing average vertebrate-sized genomes and therefore promises to overcome the main shortcoming of BRAKER2 in terms of accuracy in large genomes.

With regard to the EBP, we expect the appearance of a large number of genomes for which suitable reference proteomes for running BRAKER2 will not be fully available. BRAKER2 requires a large protein database input; it usually fails to run with reference proteins of only one species because its components, ProtHint and GeneMark-EP, rely heavily on evidence derived from multiple alignments (requiring \(>= 4\) supporting alignments to classify a hint as high-confidence). This hinders BRAKER2’s ability to annotate genomes of poorly sequenced clades where only one reference relative is often available.

In order to address these open questions and challenges, we designed GALBA. GALBA is a fully automated pipeline that takes protein sequences of one or many species and a genome sequence as input, aligns the proteins to the genome with miniprot, trains AUGUSTUS, and then predicts genes with AUGUSTUS using the protein evidence. In this manuscript, we describe the GALBA pipeline and evaluate its accuracy in 14 genomes with existing reference annotation. Further, we present three use cases of de novo genome annotation in insects, vertebrates, and one land plant. We also evaluate the effect of merging GALBA and BRAKER2 gene sets with TSEBRA [25], the transcript selector for BRAKER.

Our pipeline is fully open source, containerized, and addresses the critical need for accurate gene annotation in large newly sequenced genomes. We believe that GALBA will greatly facilitate genome annotation for diverse organisms and is thus a valuable resource for the scientific community.

Results

We first briefly describe the GALBA pipeline and the effect of several features on gene prediction accuracy. Subsequently, we present accuracy results of the final software in 14 species. Further, we present three different use cases for GALBA.

GALBA pipeline

GALBA is a pipeline that connects three main components to predict protein coding genes: Firstly, we employ miniprot [24] to splice-align input protein sequences to the genome, and then use miniprothint [26] to score the resulting alignments and categorize the evidence into low- and high-confidence classes. We utilize the high-confidence alignment-derived genes with the highest alignment score per locus to train the gene prediction tool AUGUSTUS [8, 9]. Subsequently, we run AUGUSTUS with the Python package Pygustus to predict genes using the protein evidence in multithreading mode. After the first round of prediction, we select genes with 100% evidence support according to AUGUSTUS for a second round of training, while all other predicted genes are used to delineate flanking intergenic regions for the training of parameters for non-coding sequences. Then, we obtain the final set of predicted genes by AUGUSTUS (see Fig. 1). The idea of GALBA is that training AUGUSTUS on the basis of miniprot alignments will enable AUGUSTUS (with hints) to obtain a gene set that is more accurate and more complete than the miniprot alignments on their own. We show that GALBA works as expected in terms of accuracy with respect to reference annotations on the example of 14 species in Additional file 1: Table S10. This is also reflected by the drastically increasing complete BUSCOs when moving from training gene set to AUGUSTUS gene set within GALBA (see Additional file 1: Table S12).

Fig. 1
figure 1

The GALBA pipeline. Miniprot performs rapid spliced alignment of proteins against the genome. Subsequently, miniprothint (2) scores and classifies these alignments. Training genes for AUGUSTUS are generated from the best high quality miniprot alignment per locus (1). After training, AUGUSTUS predicts genes using the alignment evidence generated by miniprothint. AUGUSTUS parameters are refined by one iteration of training (3). The numbering of steps in the figure caption corresponds to the order in which steps were introduced into GALBA during development, see Additional file 1: Results section S4.1

GALBA was implemented in Perl, building on the existing codebase of BRAKER [27].

Effect of mutation rate from reference to target

GALBA is designed to be used with reference proteomes of (possibly several) closely related species. It is predictable that spliced protein to genome alignment with miniprot works better the lower the mutation rate from donor to target is. We provide results of GALBA runs with single-species reference protein inputs in D. melanogaster next to a phylogenetic tree that indicates mutation rates to provide users a reference for how similar a donor species should be to achieve good results with GALBA (see Fig. 2).

Fig. 2
figure 2

Gene prediction of GALBA provided with either a proteome of a single reference species (corresponding to phylogenetic tree from [57]), or executed with a combination of the species listed on the right. BRAKER2 can only be executed with a certain level of redundancy in the protein reference set, and results are therefore only provided for the combined protein input set

When executed using all annotated proteins of the target species itself, GALBA achieves a gene F1 of 79.5% (F1-scores are in this manuscript defined as \(\frac{2 \cdot \text {Sensitivity} \cdot \text {Specificity}}{ \text {Sensitivity} + \text {Specificity}}\)). When moving to D. ananassae, the accuracy drops by \(\sim\)7.5% points. Gene F1 does not drop below 63.6% when moving away to D. grimshawi, and even with Musca domestica input, GALBA maintains an accuracy of 57%. Interestingly, accuracy is restored to 71% when using a combined input of five protein donors. This last experiment can in fact also be performed with BRAKER2, which scores 3% points higher accuracy compared to GALBA.

Accuracy in genomes with reference annotation

We provide accuracy results measured in genomes of 14 species by comparison to existing annotations (see Figs. 3 and 4 for sensitivity and specificity on gene level, and Table 1 for F1-scores for gene, transcript, and exon levels). The annotations of the small model organisms Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster have undergone extensive curation [28], and thus we believe that benchmarking on these data sets gives a realistic estimate of the true accuracy of gene prediction pipelines. Annotations of the other species are much less reliable. Therefore, we report gene prediction sensitivity measured on two more reliable subsets created by selecting transcripts that (1) are complete and have all introns supported by RNA-Seq mapping (Additional file 1: Table S3); (2) have identical gene structures in two distinct reference annotations (Additional file 1: Table S4).

Fig. 3
figure 3

Sensitivity and Specificity on gene level in 7 genomes smaller than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)

Fig. 4
figure 4

Sensitivity and Specificity on gene level in 7 genomes larger than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)

Table 1 F1-scores of gene predictions for the genomes of 14 different species

We decided to show GALBA and BRAKER2 results with identical multi-species protein input side-by-side. Since users of BRAKER2 may be familiar with the Transcript Selector for BRAKER (TSEBRA) for combining several gene sets, we also provide TSEBRA results for which the GALBA and BRAKER2 outputs including their evidence were combined, enforcing the predictions by GALBA to avoid a drop of all transcripts without support by evidence. In large vertebrate genomes, GALBA shows a large improvement in accuracy compared to BRAKER2 (between 10 and 30% points in the gene F1-score). In small and medium-sized genomes, BRAKER2 is usually superior to GALBA. In A. thaliana, D. melanogaster, M. truncatula, P. tepidarorium, R. prolixus, and T. nigroviridis, BRAKER2 is \(\ge\)5% more accurate on the gene level than GALBA. GALBA shows particularly poor accuracy in C. elegans (17% points less than BRAKER2) and P. trichocarpa (7% points less than BRAKER2). In B. terrestris and S. lycopersicum, GALBA perfoms marginally better than BRAKER2.

This general impression also holds when looking at the subset of multi-exon genes that are supported by RNA-Seq from VARUS sampling (see Additional file 1: Table S3), and when inspecting Sensitivity in the subset of genes that are supported by more than one annotation provider (see Additional file 1: Table S4). In large vertebrate genomes, GALBA here achieves astonishing exon F1-scores of \(>90\%\), and gene F1-scores \(>70\)%, outperforming BRAKER2 by up to 42% points on the gene level.

Since BRAKER2 was originally designed to run with a large database of proteins instead of a hand-picked proteome of few closely related species, we show BRAKER2 results with OrthoDB v11 partitions for different taxonomic phyla (Arthropoda, Metazoa, Vertebrates, Viridiplantae), excluding proteins of the target species, and adding the hand-picked proteomes of close relatives by concatentation. This input does not change accuracy results much (see Additional file 1: Table S7). To the best of our knowledge, BRAKER2 is the most suitable pipeline for annotation scenarios where closer relatives have not been sequenced and annotated, yet. Therefore, we also provide BRAKER2 results with OrthoDB partitions, excluding proteins of species that are in the same taxomomic order as the target species.Footnote 1 In M. truncatula, P. tepidariorum, P. trichocarpa, and T. nigroviridis, BRAKER2 is even more accurate than GALBA using the remotely related protein set (see Additional file 1: Table S7).

It is an interesting question whether combining the GALBA and BRAKER2 gene sets (with the same protein input) with TSEBRA provides increased or restored accuracy. In general, TSEBRA tends to increase the ratio of mono-exonic to multi-exonic genes (see Table 2 and Additional file 1: Figure S5). In species where both GALBA and BRAKER2 shows initial comparable accuracy, TSEBRA application usually increases the accuracy by a few percentage points. However, if the GALBA gene prediction accuracy is particularly poor (e.g., in the case of C. elegans), then TSEBRA does not fully restore accuracy to the better gene finder (here BRAKER2). For large vertebrate genomes, the TSEBRA approach consistently yields very good results (despite increasing the amount of single-exon genes), although the effect varies between about 1% point on gene level in D. rerio and 13% points in M. musculus.

Table 2 Ratios of mono-exonic to multi-exonic genes per species

Since GALBA may also be executed with a single reference proteome, we provide results of such experiments, using the closest relative from our selection of protein donor species. Using a single protein donor instead of a set of several with GALBA usually leads to a decrease in accuracy (on average 4% points gene F1). This effect can be less strongly observed in species where GALBA performs comparably poorly (e.g., R. polixus or P. tepidariorum).

We also report results of FunAnnotate (see Additional file 1: Table S7) with the same protein and genome input as GALBA and BRAKER2, but these results are not directly comparable since this pipeline requires specification of a seed species for training AUGUSTUS, and of a BUSCO [30] lineage, and accuracy results may heavily depend on the selection of these (here used seed species and BUSCO lineages are listed in Additional file 1: Table S6). FunAnnotate was competetive with GALBA (and BRAKER2) only in the case of predicting genes in A. thaliana.

Use case examples

GALBA is widely applicable to eukaryotic genomes of different sizes and assembly quality. In the following, we present three use cases.

Insect genomes

We compare annotation results for four Hymenoptera species across three pipelines: GALBA, BRAKER2, and FunAnnotate. For this, we select three high-quality wasp genomes from [31], Vespula vulgaris, V. germanica, and V. pensylvanica, previously annotated using FunAnnotate with multiple rounds of annotation polishing, and one additional wasp generated with short-read assembly, [32] Polistes dominula (see Table 6). Input proteome to all three consisted of UniProt Swiss-Prot [33] release 2023_01, combined with published proteomes from RefSeq [34] release 104 of Apis mellifera HA v3.1 [35] and Polistes canadensis [36].

Compared to the other pipelines, GALBA consistently predicts the most genes. BUSCO scores are comparable with BRAKER2 and higher than FunAnnotate (see Table 3). GeneValidator [37], which scores individual proteins, serves as a larger metric for analyzing genome annotation results and scores individual protein predictions. GALBA predicts more higher-quality proteins, however the lower quartile for GALBA is always 0, while for BRAKER2 the average lower quartile is 39.3. Taken together, this shows GALBA predicts a larger number of both high-quality and low-quality proteins. Both pipelines outperform FunAnnotate in every metric. However, FunAnnotate was designed for use with RNA-Seq data (on fungi), so this is likely to be expected.

Table 3 Summary across four Hymenopteran insect genomes and de novo annotation pipelines

Vertebrate genomes

Three years ago, the Zoonomia consortium presented a large whole-genome alignment of various vertebrates [38]. Many of the genomes in this alignment have not been annotated for protein-coding genes until today. Most of the unannotated assemblies in the alignment were produced by short-read genome sequencing and are thus fragmented and incomplete, and for many species (reflected by a low N50, a very large number of scaffolds, and BUSCO completeness far below 100%), there is no transcriptome data available in the Sequencing Read Archive [39]. We de novo annotated all whale and dolphin assemblies from that alignment that lack RNA-Seq evidence (see Table 6). The selected reference protein sets are listed in Additional file 1: Table S1.

We were able to apply multi-threaded GALBA to these genomes without any problems. GALBA predicted between 53k and 78k genes in these assemblies. The ratio of mono- to multi-exonic genes suggests an overprediction of single-exon genes. It should be noted that AUGUSTUS is capable of predicting incomplete genes that span sequence borders, and that the high single-exon count is not caused by genome fragmentation alone. Removing all incomplete genes from the prediction does not substantially decrease the mono:mult ratio (data not shown). BUSCO-completeness of predicted genes is comparable to the BUSCO-completeness of the corresponding genomic assemblies (see Table 4 and Additional file 1: Figures S3 and S2). OMArk [40], a tool that provides an estimate on annotation quality for a much larger set of conserved genes than BUSCO, also indicates a high level of completeness in these genomes (see Additional file 1: Table S8). However, the number of unexpected duplicate HOGs is large for these annotations. The consistency report of OMArk shows that the predicted genes are to a large extent possibly incomplete/fragmented (which is here likely caused by the genome assembly quality).

Table 4 Summary of protein-coding gene structures predicted in the previously unannotated whale and dolphin genomes of Zoonomia [38], and in Coix aquatica

Plant genome

We chose the genome of the plant Coix aquatica [41] (see Table 6) to demonstrate the ability of GALBA to de novo annotate large chromosome-scaffolded genomes (see Table 6). This species is one of many that currently lack an annotation of protein-coding genes at NCBI Genomes (even though the publication [41] describes an annotation approach and statistics on predicted protein coding genes), and there is no RNA-Seq data of this species available at the Sequence Read Archive (even though [41] report having used RNA-Seq data for annotation). In practice, a Coix aquatica focused scientist would request the gene set from the authors of [41], but here, we took it as a de novo annotation example. Four reference proteomes used with GALBA are listed in Additional file 1: Table S1.

GALBA predicted 93k genes with a mono- to multi-exonic gene ratio of 1.07 in Coix aquatica. This is an overprediction compared to the number of 39,629 genes reported by [41]. However, the BUSCO sensitivity in the GALBA gene set is with \(\sim\)98% very high and comparable to BUSCO completeness of the assembly (see Additional file 1: Figure S4). OMArk also attests to a high degree of HOG completeness. Compared to the whale and dolphin gene predictions, the predictions in this plant genome show a much lower degree of fragmentation (see Additional file 1: Table S8). About half of the predicted proteins are placed as inconsistent, and most of these are identified by fragmented hits. GALBA here provided a quick and simple means to obtain a gene set.

Runtime

We report wallclock time passed when running GALBA on D. melanogaster using proteins of D. ananassae, D. pseudoobscura, D.willistoni, D. virilis, and D. grimshawi on an HPC node with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz using 48 threads. A complete GALBA run took 3:24 h. A full BRAKER2 run on the same node took 3:03 h. The most time-consuming step of GALBA (and BRAKER2) is often the metaparameter optimization for AUGUSTUS. This step can optionally be disabled (--skipOptimize), leading to slightly lower prediction accuracy in most cases. Without this optimization step, a GALBA run with the same input data took 0:44 h.

As a second example, we report wallclock time of 8:52 h for de novo annotation of the Coix aquatica genome on an HPC node with Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz using 72 threads (including metaparameter optimization). On the same data set and architecture, BRAKER2 required 11:11 h.

Discussion

Obtained accuracy results of GALBA are far from perfect when compared to reference annotations. However, GALBA provides substantially higher accuracy than BRAKER2 in the genomes of large vertebrates because GeneMark-ES within BRAKER2 performs poorly in such genomes when generating seed regions for spliced-alignment of proteins to the genome. We estimate, that to date, \(\sim\)1k unannotated genomes without transcriptome data could benefit from structural annotation with GALBA (see Additional file 1: Methods S3.7).

In smaller genomes, BRAKER2 remains superior because with the GeneMark-ES seeding process, it is able to produce protein to genome alignments with a higher specificity than miniprot (compare Fig. 5 and Additional file 1: Table S11).

Fig. 5
figure 5

Network plot of gene F1 accuracy for (clockwise starting from the top, increasing genome sizes) insects, metazoa, plants, and vertebrates. We show accuracy of GALBA and its intermediate product miniprot, and of BRAKER2 and its intermediate GeneMark-ES and GeneMark-EP gene sets. Accuracy of the combiner TSEBRA combining the final gene sets of both GALBA and BRAKER2 is also shown as TSEBRA G+B

Further, we demonstrate that GALBA can process highly fragmented as well as large genomes in multi-threading mode, mainly attributed to the usage of Pygustus. We expect the Pygustus approach to be adopted in BRAKER to improve stability.

Implementing pipelines that leverage protein-to-genome alignment for training and running gene finders is not straightforward. In this work, we once more demonstrate that alignment scoring is crucial for achieving high gene prediction accuracy when protein evidence is used as the sole extrinsic evidence source.

While neither GALBA nor BRAKER2 can compete with pipelines that integrate RNA-Seq as an additional source of evidence, such as BRAKER3, GALBA is a valuable addition to closing the annotation gap for already deposited genomes and for future genomes generated within the EBP for which RNA-Seq data is not available.

Combining multiple gene sets commonly yields higher accuracy than using a single gene set of a single gene predictor. However, the authors caution users that combining gene sets from different sources may not always lead to improved accuracy, and users of genome annotation pipelines should proceed with caution. Recommended estimates for gene set quality are BUSCO Sensitivity, the number of predicted genes, and the mono-to-multi-exon gene ratio.

Both GALBA and BRAKER2 tend to heavily overpredict single-exon genes, most likely a result of incorrectly splitting genes. For plants, a desired mono- to multi-exonic gene ratio of 0.2 was recently postulated by [42]. This particular ratio certainly does not hold for non-plant species, and also the reference annotations of plants used in this manuscript often deviated from that recommendation. Nevertheless, GALBA, BRAKER2, and TSEBRA output may benefit from downstream mono-exonic gene filtering. The EBP would benefit from future developments to address the split gene problem in pipelines for fully automated annotation of protein-coding genes.

GeMoMa is a different approach towards an accurate mapping of annotated protein-coding genes from one species to the genome of another [43,44,45]. GeMoMa does not work with protein sequence input in FASTA format but requires a gff3 or gtf file with the annotation of a related species. It was previously shown that GeMoMa has higher base Sensitivity in the human genome using the zebrafish annotation as the donor, while miniprot has higher base Sensitivity in the fruit fly when using the mosquito annotation as input. It is to be expected that a pipeline such as GALBA will yield more accurate results using GeMoMa instead of miniprot if GeMoMa achieves higher accuracy with a given input scenario. We have previously demonstrated that combining GeMoMa with BRAKER [46] and TSEBRA can be beneficial for annotating plant and insect genomes [47,48,49]. Particularly for larger genomes, it is worth replacing BRAKER2 with GALBA in such workflows in the future.

Recently, Helixer demonstrated the potential of modern machine learning for genome annotation [50]. Accuracy is not competitive, yet, possibly because these methods do not currently allow for the integration of extrinsic evidence. However, we believe that once an improved and more accurate gene finder on the basis of modern machine learning technology has been trained, it will be of great advantage not only in terms of accuracy, but also in terms of reduced runtime compared to methods such as GALBA.

We intend to expand GALBA in the future. For example, we might incorporate Helixer for faster trimming of the flanking regions of training genes for AUGUSTUS. Also, there is room for improvement in the hints generation given that the protein donors for GALBA might not always be closely related (see Additional file 1: Table S2).

There is a substantial gap in data processing between producing a GALBA (or BRAKER2) output and submission of the annotation to e.g. NCBI Genomes. This gap is already addressed in FunAnnotate, and also to some extent in MOSGA, a web service that executes BRAKER [51]. We expect the definition of a new standard for third-party genome annotation tagging in the foreseeable future. We will then adapt GALBA to produce an annotation that matches this novel standard in order to facilitate genome annotation tagging.

Conclusions

GALBA is an easy-to-use pipeline for the annotation of protein coding genes. It has competitive accuracy, in particular, it is superior to the BRAKER2 pipeline in the annotation of large vertebrate genomes.

Methods

Sequences for accuracy estimation

For estimating prediction accuracy of gene prediction tools, genomes with an already existing annotation are required. Here, we resort to using the genomes and annotations of 14 species (see Table 5), collected from two previous publications. Data of Arabidopsis thaliana, Bombus terrestris, Caenorhabditis elegans, Drosophila melanogaster, Rhodnius prolixus, Parasteatoda tepidariorum, Populus trichocarpa, Medicago truncatula, Solanum lycopersicum, and Xenopus tropicalis prepared as described in [20],Footnote 2 annotation supporting RNA-Seq evidence described at [53]. In addition, we used the following genomes and annotations from [7]Footnote 3: Danio rerio, Gallus gallus, and Mus musculus. For each species, reliable transcripts were identified, either by definition if at least two annotation providers report a transcript identically, or if all introns of a transcript have support by a spliced alignment from RNA-Seq evidence sampled with VARUS [55]

Table 5 Summary of genomes and annotations used for accuracy evaluation

As protein input, we manually selected the reference protein sets listed in Additional file 1: Table S1 from NCBI Genomes. These include close relatives of the target species. In short, we used NCBI Taxonomy [56] to identify species that are closely related to the target species and that have a protein sequence set originating from nuclear genome annotation. In order to enable a direct comparison with BRAKER2 (which cannot be executed with a protein set from only one reference species), we ensured to pick a minimum of three protein sets for annotating each species.

Since GALBA is a pipeline that may also be executed with only one reference proteome, we also present accuracy with such single-species protein sets. In general, we selected the closest relative, with the exception of experiments in Drosophila melanogaster, where we excluded D. simulans and D. erecta from the combined protein set, and from selection as single species reference because they have less than 0.2 expected mutations per genomic site and are thus extremely similar to the target species (see Fig. 2).

Successful generation of high-quality protein to genome alignments depends on the phylogenetic distance between donor and target species. We demonstrate this by evaluating GALBA in single-reference-mode on D. melanogaster, using protein donor species arranged on a phylogenetic tree from [57].

Software

All software versions used to generate results in this manuscript are listed in Additional file 1: Table S5.

Miniprot extensions

Miniprot was modified to output detailed residue alignment in a compact custom format to facilitate alignment parsing for scoring with miniprothint. An example of this format is shown in Additional file 1: Figure S1. Further, a new option -I was introduced that automatically sets the maximal size of introns to \(3.6\cdot \sqrt{\text {genomeSize}}\). On the Drosophila-Anopheles benchmark dataset used in the miniprot paper [24], the new feature doubles the alignment speed and reduces the number of spurious introns by 16.3% at the cost of missing 0.5% of introns that are longer than the threshold.

Miniprothint

During early development of GALBA, it became clear that miniprot (like any spliced aligner) may produce spurious alignments if the reference proteins originate from distantly related species (compare Additional file 1: Table S2). Furthermore, conflicting alignments of homologous proteins from multiple donor species negatively impacted the quality of the AUGUSTUS training gene set. To solve these problems, we wrote an alignment scorer—here called miniprothint—that scores all predicted introns by computing the intron border alignment (IBA) and the intron mapping coverage (IMC) scores. Briefly, the IBA score characterizes the conservation of exons adjacent to the scored intron, with larger weights given to parts close to the donor and acceptor splice sites. The IMC score counts how many times a given intron was exactly mapped by spliced alignments of distinct target proteins. See [58], pages 20 and 21, for a precise definition of both scores.

Based on these scores, miniprothint discards the least reliable evidence and separates the remaining evidence into two classes: high- and low-confidence (see Additional file 1: Figure S6 for more details). High-confidence evidence is used to select training gene candidates for AUGUSTUS and is enforced during gene prediction with AUGUSTUS. Low-confidence evidence is supplied to AUGUSTUS in the form of prediction hints. In comparison to the scoring defined in [58], miniprothint adds penalties for in-frame stop codons and frameshifts (common in the alignments of remote homologs) and significantly improves the computational speed of alignment scoring. The speed improvements are, in part, achieved by taking advantage of miniprot’s compact alignment format (see Additional file 1: Figure S1).

Iterative training

When generating putative training genes for AUGUSTUS from any kind of extrinsic evidence, typically, only some of the actually existing gene structures will be identified in the genome. Otherwise, one would not need to train a gene finder to find the others. In the case of AUGUSTUS, training genes are excised from the genome with flanking and hopefully truly intergenic regions. There is a certain risk that a flanking region will, in fact, carry parts of neighboring genes. Using such “contaminated” intergenic regions can lead to sub-optimal training results. Therefore, we implemented the training of AUGUSTUS in GALBA as follows (e.g., suggested in [9]):

  1. 1

    etraining on the original training genes derived from evidence with possibly contaminated flanking regions

  2. 2

    prediction of genes with the evidence by AUGUSTUS after initial training

  3. 3

    selection of predicted genes with 100% evidence support, other genes are only eliminated from flanking regions

  4. 4

    etraining with training genes with filtered flanking regions that are free of predicted genes

  5. 5

    optimize_augustus.pl for metaparameter optimization

Multithreading AUGUSTUS

AUGUSTUS is not multithreaded and the gene prediction and metaparameter optimization steps can have a relatively long running time. To address this issue, the BRAKER pipelines split the genome into individual sequence files and execute AUGUSTUS using the Perl module ParallelForkManager. However, this approach can strain the file system when dealing with highly fragmented genomes, as a large number of files need to be generated.

To overcome this limitation, we developed Pygustus, a Python wrapper for AUGUSTUS that supports parallel execution. This allows for multithreading of AUGUSTUS prediction on genomes of any size and fragmentation level. Large chromosomes are split into overlapping chunks that are not too large for fast parallel execution. The overlaps are introduced to prevent the truncation of genes. Conversely, many short sequences are joined into temporary FASTA files of which there are not too many to strain the file system. Pygustus automatically and invisible to the user decides what sequences to split or join, and assemblies are allowed to have simultaneously very many (small) sequences and (few) very large sequences. The annotation is then done in parallel and the redundancies in annotations from overlapping runs are removed.

In GALBA, we use Pygustus to multithread AUGUSTUS predictions, thereby enabling efficient genome annotation without compromising the file system. This approach can be particularly useful for researchers dealing with large and complex genomes, where computational efficiency is critical.

Repeat masking

The genomes of 14 species used for accuracy assessment were previously masked for repeats in [13] and [7]. In short, species-specific repeat libraries were generated with RepeatModeler2 [59]. Subsequently, the genomes were masked with RepeatMasker [60] using those libraries. For vertebrate genomes, an additional step of masking with TandemRepeatsFinder [61] was performed.Footnote 4

The same approach was adopted for each whale and dolphin genome (including the TandemRepeatsFinder step). The additional TandemRepeatsFinder step was not applied to the insects and the plant in Table 6. For Polistes dominula, we used repeat masking as provided by NCBI Genomes. Genomes of Vespula species were masked with RepeatModeler and RepeatMasker as described in [31].

Table 6 Genomes de novo annotated with GALBA using reference protein sets listed in Additional file 1: Table S1 as use cases that demonstrate the applicability of GALBA

Accuracy evaluation

For selected genomes, we used the existing reference annotation to assess SensitivityFootnote 5 and SpecificityFootnote 6 of predictions by GALBA, BRAKER2, FunAnnotate, and TSEBRA on gene, transcript and exon level. For this purpose, we used the script compute_accuracies.sh that is a part of the BRAKER code. To summarize Sensitivity and Specificity, we computed the F1-score as

$$\begin{aligned} \frac{2\cdot \text {Sensivitity} \cdot \text {Specificty}}{\text {Sensitivity} + \text {Specificity}}. \end{aligned}$$

Prediction quality estimation

For estimating the quality of gene prediction in previously unannotated genomes, we provide BUSCO Sensitivity of both genomes and predicted proteomes [30], and OMArk results [40]. For BUSCO assessment of use case insect assembly and proteome completeness, we used hymenoptera_odb10. In dolphins and whales, we used the vertebrate_odb10 lineage. For Coix aquatica, we used the poales_odb10. Further, we report basic metrics such as the number of predicted genes, the number of transcripts, the recently suggested mono-exonic to multi-exonic gene ratio [42], and the maximum number of exons per gene across all predicted genes.

To provide a more fine-grained view on the insect annotation use case, we use GeneValidator [37], which scores the predicted proteins to a reference set by length, coverage, conserved regions, and identifies putative merges. Each predicted protein receives an individual score, with 90 being considered a good prediction, and a score of 0 indicating a very poor prediction, or a lack of BLAST hits to the reference proteome to estimate potential lengths and conserved regions. In this instance, we use our input proteome for the prediction tools (Swiss-Prot and RefSeq of A. mellifera and P. canadensis) consisting of 611,968 proteins.

Assembly statistics

We used seqstats and BUSCO to report basic assembly metrics (see Additional file 1: Methods).