Background

Currently, the National Center for Biotechnology Information’s (NCBI) GenBank [1] hosts 7978 eukaryotic genomes, with 3,208 of these genomes lacking an annotation of protein-coding genes. Notably, 746 genome annotations out of existing 4770 ones were generated by NCBI [2, 3]. The original authors frequently omit an annotation step and many publicly available genomes remain not annotated. Furthermore, re-annotation may be in order for many of the annotated genomes as more related sequence data has become available, or annotation methods have been improved since their initial application. Thus, there is a need for accurate automated annotation methods that use all available data and are easily accessible to bioinformatics teams.

The most useful data to support accurate genome annotation are transcriptomic sequence data, e.g. RNA-sequencing (RNA-seq) data, from the same species and protein sequences from species that are sufficiently closely related to the target species in the tree of life. RNA-seq reads spliced aligned to a genomic region are used to infer likely intron intervals [4] in protein-coding genes. In a similar way, likely exon and intron boundaries can be inferred using homologous proteins because segments of gene structures are often highly conserved [5]. Protein evidence has the advantage that it maps only to protein-coding genes, but with the downside that it depends on the degree of sequence conservation, which may differ between genes and available species. In contrast, RNA-seq is usually obtained from the same species and then, free of this dependency, covers only genes and spliced isoforms that are expressed in a sample. However, RNA-seq could be generated from non-coding genes; sequencing errors may render accurate alignments difficult. Ever-increasing throughput has resulted in large databases of RNA-seq. For example, the NCBI Sequence Read Archive (SRA) [6] hosts more than 36 petabytes of data, while the protein database OrthoDB [7] contains more than 37 million sequences.

Genome annotation methods that use statistical models of gene structures such as splice site patterns in addition to the evidence from RNA-seq and homology, are arguably best suited for whole-genome annotation [8]. BRAKER, a popular pipeline of competitive accuracy [9], has two modes of a genome annotation process supported by extrinsic evidence. BRAKER1 uses GeneMark-ET [10,11,12] together with AUGUSTUS [13,14,15,16,17] and relies on RNA-seq data to support gene finder training and accurate prediction of gene structures. BRAKER2 [18] exploits spliced alignments of homologous proteins as a source of extrinsic evidence for genome annotation with GeneMark-EP+ [19] and AUGUSTUS.

When heterogeneous extrinsic evidence sources are available, some genome annotation tools like MAKER2 [20] and GeMoMa [21] integrate these different sources directly into the annotation protocol. Some, like the recent FINDER [22], perform protein-spliced alignments only with proteins that are mapped to genes missed by RNA-seq-based methods. On the other hand, FINDER does not use RNA-seq evidence to assess or compare homology-based gene models. A different approach is to first generate multiple whole-genome annotations and then to use a combiner tool that takes various gene predictions as input with diverse sources of extrinsic evidence and constructs a genome annotation that is on average more accurate than any input genome annotation. Some previously developed combiner tools built their own gene structure model in the form of a graph and report a gene structure either based on the consensus of all available data, e.g. IPred [23], or as the result of a machine learning procedure such as most likely parse of an HMM, e.g. Combiner [24], JIGSAW [25], Evigan [26], ExonHunter [27]. A prominent combiner tool is the openly accessible EVidenceModeler (EVM) [28]. It uses a weighted consensus from all available evidence sources to predict a gene structure. EVM was successfully used to produce several high-quality annotations of novel genomes [29, 30].

In our approach, we first generate several sets of whole-genome gene predictions based on a single type of extrinsic evidence (i.e. by BRAKER1 and BRAKER2). We use a new combiner tool that scores and ranks these predictions (transcripts) based on heterogeneous evidence. Then, we select those with higher rank into a newly constructed genome annotation which is on average more accurate than any whole-genome annotation provided in the input. Up to now, the BRAKER suite has so far not been able to achieve a prediction accuracy that is reliably superior to either single-source evidence mode when using RNA-seq and proteins simultaneously [31, 32]. Nevertheless, BRAKER users often have both types of extrinsic data available for a target genome. Incidentally, most of the previously mentioned combiner tools are either not publicly available anymore, lack support, or are very difficult to use for combining BRAKER1 and BRAKER2 predictions. Therefore, we present Transcript Selector for BRAKER (TSEBRA), a fast software tool for selecting gene predictions from the output of two branches of the BRAKER eukaryotic gene prediction suite based on all the heterogeneous extrinsic evidence. TSEBRA achieves high accuracy and is easy to use. We show that it delivers a significant increase in accuracy with respect to the input annotations generated by BRAKER1 and BRAKER2.

Implementation

TSEBRA uses a set of arbitrarily many gene prediction files in GTF format together with a set of files of heterogeneous extrinsic evidence to produce a combined output. From the whole set of transcripts contained in the gene predictions, TSEBRA must select those that are more reliably supported by a full complement of extrinsic evidence; these transcripts constitute the output. Less reliably supported transcripts are filtered out. The rational of TSEBRA’s approach is as follows. Taking a union of gene predictions generated by two or more gene finding tools makes a set of predictions with improved sensitivity but with lower specificity. A non-trivial task is to remove some predictions and increase specificity with little decrease of sensitivity. This task is tantamount to identification of likely false positives and filtering them out. TSEBRA solves exactly this problem.

TSEBRA uses extrinsic evidence in the form of intron regions or start/stop codon positions to evaluate and filter transcripts from gene predictions. These must be provided in a GFF file that includes two attributes in the last column ’mult=’, a number specifying its multiplicity – the number of alignments that support it, and ’src=’ determining its source, e.g., ’src=P’ for evidence from a protein alignment. The mult attribute is used to specify multiplicities larger than one.

TSEBRA takes three sets of different hyperparameters from a configuration file. More precisely, it takes a weight for any evidence source, four transcript score thresholds and two low evidence support thresholds. The weights are used to compute transcript scores and the transcript score thresholds are used for comparing transcripts. The low evidence support thresholds consist of minimum fractions of intron or start/stop codon support. We recommend the application of the default hyperparameters provided in the TSEBRA configuration file to be used in a standard use case.

The workflow of TSEBRA is as follows:

  1. 1.

    Take a union of transcripts predicted by BRAKER1 and BRAKER2 while merging identical transcripts.

  2. 2.

    Compute vectors of support scores for all transcripts.

  3. 3.

    Identify all pairs of transcripts with overlapping coding regions.

  4. 4.

    Compare all pairs of overlapping transcripts by a transcript comparison rule using the extrinsic evidence and mark some of them for exclusion.

  5. 5.

    Remove all transcripts marked for exclusion by the transcript comparison rule.

  6. 6.

    Remove all transcripts with low evidence support.

  7. 7.

    Combine the remaining transcripts into a final set of predictions with groups of overlapping transcripts making sets of alternative isoforms.

The output of TSEBRA is the set of genes (with alternative isoforms) in GTF format.

In step 6, a transcript is removed if the fractions of introns and start/stop codons supported by extrinsic evidence are lower than the low evidence support thresholds. In step 7, genes are the single-linkage clusters of transcripts where two transcripts are in the same gene if they overlap (and could be alternative splice forms). Two transcripts are considered to overlap if they share at least three adjacent protein-coding nucleotides on the same strand and in the same reading frame. Note that a transcript ’marked for exclusion’ in step 4 is still compared to all overlapping transcripts and may cause removal of another transcript. This filtering step is different from a simplistic approach that would first score transcripts and then apply a fixed threshold to their score. In our approach, the transcripts with strongest local support are kept, and those that are discarded can still have strong support in absolute terms if transcripts with even stronger support overlap.

As a special case, TSEBRA may be used with a single gene prediction file to filter for the ones with the strongest evidence support. This may be useful for a genome annotation with many transcript isoforms per gene.

Transcript scores

Four transcript scores \(s_1,\ldots , s_4\) characterize the support of features of a transcript, here introns (i) or start/stop-codons (s), by all extrinsic evidence E represented by hints. A hint h is either an intron region or start/stop codon position together with an identifier of its original source \(src (h) \in O\) and its multiplicity \(mult (h) \in \{1, 2 \ldots \}\). O is a set of original sources, e.g. \(O=\{\text {P},\text {R}\}\), if protein data and RNA-seq were used, but could also contain further elements, e.g. when variants of RNA-seq sequencing technologies shall be distinguished [32]. Multiplicity \(mult (h)\) is the number of alignments from the same source that supports hint h. A hint supports a transcript feature if all identifying characteristics match, i.e. sequence name, start/stop position, feature type, and strand.

Consider a particular transcript and let F be the set of all of its features. Define \(F_{f} \subset F\) as all features in F of type \(f \in \{\text {i}, \text {s}\}\). The relative support of a transcript feature is

$$\begin{aligned} r_f := \frac{|F_f \cap E|}{|F_f|}. \end{aligned}$$

Score \(s_1 := r_\text {i}\) is the relative support of the transcript’s introns (\(f=\text {i}\)) by the evidence E and \(s_2 := r_\text {s}\) is the fraction of start/stop codons supported by E.

A weight \(w_o\) with \(o \in O\) is assigned to each evidence source. The absolute quantity of supporting hints for a transcript feature f is the weighted sum of all supporting hints:

$$\begin{aligned} a_f := \sum _{h \in E \cap F_f} w_{ src (h)} \cdot mult (h) \qquad (f \in \{\text {i}, \text {s}\}). \end{aligned}$$

The scores \(s_3:=a_i\) and \(s_4:=a_s\) measure the abundance of extrinsic evidence that support the introns or the start/stop codons of a transcript, see Fig. 1 for an example.

Fig. 1
figure 1

Example of how extrinsic evidence in form of spliced alignments from homologous proteins (blue) or RNA-seq reads (red) is used to determine scores for the support of a transcript (green). Likely exon-intron borders are inferred from the alignments to create intron hints. The start and stop codons of the protein alignments are used to create start and stop codon hints, respectively. The transcript scores utilize them to quantify the support of the transcript structure

Pairwise transcript comparison rule

The pairwise transcript comparison rule compares two transcripts with respect to their support of extrinsic evidence using the transcript scores. One or no transcript is marked for exclusion when comparing two overlapping transcripts, see Fig. 2.

Fig. 2
figure 2

Comparison rule for two transcripts using extrinsic evidence, either one or none of the transcripts is marked for removal; \(s_j\) are transcript scores and \(\epsilon _j\) are score specific thresholds with \(j \in \{1,2, \ldots , 4\}\)

The differences of all transcript scores (of the same type) are each compared to one of the score specific thresholds \(\epsilon _1, \ldots , \epsilon _4\) (one for every score type), in order from \(s_1\) to \(s_4\). When the threshold is exceeded for the first time, the comparison rule terminates and the transcript with the smaller value for the current score is marked as the transcript that will be excluded from the combined gene set. Neither transcript is marked for removal if all differences are less than or equal to the associated thresholds.

Default hyperparameters

The TSEBRA suite includes a set of default hyperparameters, which are recommended for usage in a standard use case – to combine BRAKER1 and BRAKER2 – so that users are not required to set the hyperparameters themselves. Evidence sources in a standard BRAKER1 and BRAKER2 output are: protein database (P), EST database (E), combined EST/protein database (C), and manual anchored (M). The default weights for these are \(w_P=0.1,w_E=10,w_C=5\) and \(w_M=1\). A transcript has low evidence support in this default setting if the fraction of supported introns is less than 0.75 and the supported start/stop-codon fraction is less than 1.0. The score specific thresholds are \(\epsilon _1=0,\epsilon _2=0.5,\epsilon _3=25,\epsilon _4=10\). We have shown that TSEBRA using default parameters performs with high accuracy across several species, see Results and discussion. Our tests showed that a single parameter set is sufficient for TSEBRA working with BRAKER1 and BRAKER2 across all the tested genomes, therefore, a change (training) of the set of parameters for each new genome may not be needed.

Results and discussion

We compared the accuracy of TSEBRA in two experiments. First, we compared TSEBRA to BRAKER1 and BRAKER2 in their standard use modes, and second, we compared TSEBRA with EVM.

Accuracy assessment metrics

Specificity (Sp), sensitivity (Sn), and their harmonic mean - the F1-score - were the measures of gene prediction accuracy. Accuracy values were examined at the gene, transcript, and exon levels. A predicted gene is considered correct, if it is identical to at least one annotated alternative splicing isoform. A reference transcript t is considered as correctly predicted by transcript \(t'\), if t and \(t'\) completely agree on their sets of CDS (exons). Two CDS are considered to agree if they are located in the same strand and both pairs of sequence coordinates are identical.

Comparison with BRAKER1 and BRAKER2

Complete genome annotations generated independently by BRAKER1 and BRAKER2 (both BRAKER v.2.1.5) for 11 eukaryotic species (Additional file 1: Table S1) were processed by TSEBRA with default hyperparameters. For each genome, we used its ‘standard’ annotation to compute the accuracies of the sets of gene predictions made by BRAKER1, BRAKER2, and TSEBRA. BRAKER1 was supported by extrinsic evidence in form of RNA-seq reads aligned to the genome of interest. RNA-seq hints were sampled with VARUS [33] from SRA for each genome with HISAT2 [34] as an alignment tool. BRAKER2 was supported by protein data sets selected earlier while testing BRAKER2 [18]. For each of the three model species with genome annotations curated multiple times, A. thaliana, C. elegans, and D. melanogaster, we used proteins from three sets of species varied with respect to minimal evolutionary distance to the query species. Each protein set included proteins from a large clade, e.g. Plantae, Metazoa, and Arthropoda, of the query species. The three sets per species excluded either (i) proteins from the query species itself, (ii) all species of the same family or (iii) all species of the same order. The corresponding sets of proteins could provide more or less precise evidence for gene prediction depending on the degree of saturation by closely related species. The level (i) offers the largest number of close relatives while the level (iii) provides the least number of them and the least precise evidence for a query species. We used proteins from the corresponding sets of species selected at the level (iii) for the other eight species.

TSEBRA (with default hyperparameters) had a higher accuracy than either BRAKER1 or BRAKER2 across all 11 species and nearly all test settings, see Table 1. The F1-score of TSEBRA was on average 7.78 percent points higher on gene level, 4.53 percent points higher on transcript level, and 2.06 percent points higher on CDS level than the maximum F1-score of BRAKER1 and BRAKER2. Note that for some species, the BRAKER1 F1-score was higher than the one for BRAKER2 and vice versa for other species. The directionality was strongly correlated between the CDS, transcript, and gene levels. For a user, it is difficult to figure out which mode of BRAKER would perform better for a genome of interest. Using TSEBRA is supposed to resolve this uncertainty. TSEBRA generates a higher increase in specificity than in sensitivity: on average Sn increased by 0.52 percent points for all evaluation levels while Sp increased by 8.78 percent points. This was likely caused by the setting of parameters filtering out a majority of transcripts with low support from extrinsic evidence. We found that nearly all differences in Sn and Sp are highly significant between TSEBRA and the other methods (see Additional file 1: Table S5), using McNemar’s [35] test and the chi-square test of homogeneity [36]

Table 1 F1-score on CDS, transcript, and gene level for BRAKER1 (RNA-seq hints), BRAKER2 (protein hints of type (iii)), TSEBRA_EVM, EVM using comparable evidence, and TSEBRA (default hyperparameter) with hints generated by the BRAKER runs

The number of transcripts per gene selected by TSEBRA was on average 1.07 which is at the same level as BRAKER2, and lower than the average of 1.20 observed for BRAKER1.

Comparison with EVidenceModeler

We also compared the accuracy of TSEBRA with the accuracy of EVidenceModeler (EVM, commit 68e724e from GitHub [37]) working to combine BRAKER1 and BRAKER2 predictions with heterogeneous extrinsic evidence. Comparison of TSEBRA and EVM was performed for the genomes of the model species A. thaliana, C. elegans, and D. melanogaster.

EVM takes extrinsic evidence in form of spliced alignments from assembled transcripts. This type of evidence is not produced by BRAKER1 utilizing mappings of unassembled reads. To make a comparison between EVM and TSEBRA on the same data, we produced new and comparable extrinsic evidence for EVM (i.e. spliced alignments) and TSEBRA (i.e. intron or start/stop codon hints). We used the protein alignments generated by ProtHint during the BRAKER2 run and assembled spliced alignments from the RNA-seq reads sampled by VARUS. For each locus, we selected the protein alignment produced by ProtHint with the highest DIAMOND [38] score creating a genome-wide set of protein alignments. To produce RNA-seq based hints, we reconstructed transcripts from the RNA-seq reads with Trinity (v2.12.0) [39] and applied PASA (v2.4.1) [40] to assemble and align them. The genomes were partitioned into 400,000 bp long segments with an overlap of 50,000 bp between neighboring segments employing the tools provided by EVM. For each partition, EVM was run with transcript and protein evidence from PASA and protein alignments, respectively. TSEBRA was run with the introns from both sets of alignments and the start/stop codons from the protein alignments. We refer to this particular TSEBRA run as TSEBRA_EVM.

EVM requires that a weight is assigned to each of the four input sources. For each test setting, we used 10% of the total number of partitions to decide on weights for EVM and to set hyperparameters of TSEBRA. A grid search with two iterations was employed to find good sets of parameters for each test setting. The grid had five values for each parameter of EVM and three values for each parameter of TSEBRA. We evaluated the accuracy of TSEBRA_EVM on the remaining 90% of the partitions for each test setting that were not used to set the hyperparameters. The values of the hyperparameters are available in Additional file 1: Table S3 and Additional file 1: Table S4.

We compared the accuracy of TSEBRA with EVM, one of the most cited combiner tools to date. EVM was previously used to combine BRAKER with other predictions [41], to combine BRAKER2 predictions with RNA-seq evidence [42, 43] or other RNA-seq based predictions [44], and even for combining multiple BRAKER predictions [45]. Still, it is not the most suitable task for EVM to create a BRAKER-only combination. The authors of EVM recommend the use of a set of gene predictions, usually more than two, along with extrinsic evidence, because the strength of EVM is in finding consensus among diverse sources. This is in conflict with the fact that there is no direct way for EVM to use the hints generated by BRAKER and that we were looking for a way to combine only two gene predictions. In addition, EVM reports only one transcript per gene, which limits the completeness of its annotation output in a setting with much evidence for alternative splicing.

Fig. 3
figure 3

Average gene, transcript, and CDS levels Sn and Sp for all nine combinations of three model species (A. thal, C. ele., D. mel.) and all test settings ((i), (ii), (iii)) as described for these model species in Table 1

We compared TSEBRA and EVM to address a question: which is the better method for combining BRAKER1 and BRAKER2 predictions? In a test setting with comparable extrinsic evidence, we evaluated them across three species with three different protein sets each. Both methods have successfully combined the BRAKER1 and BRAKER2 predictions into one set with increased F1-score, see Fig. 3. Still, TSEBRA_EVM had, compared to EVM, a higher accuracy on average with an average increase of the F1-score at the gene, transcript, and CDS levels of 6.12, 3.38, and 0.79 percent points, respectively. These improvements came with an overall increased Sn and Sp for TSEBRA_EVM on the transcript and gene levels. Only at the CDS level, both methods had a similar accuracy; EVM had a slightly higher Sn and TSEBRA_EVM had a higher Sp. Analogous to the first experiment, we tested the significance of differences in Sn and Sp of TSEBRA_EVM to all other methods and found that nearly all are highly significant (see Additional file 1: Table S6).

We had to carefully select the test setting since the choice of partition size made a difference to EVM when it makes models of intergenic regions. An additional difficulty was that the tools used for creating the partitions enforced hard borders, even if they split a transcript. To neutralize this issue, an overlap between partitions was used.

TSEBRA executes much faster than EVM, which constructs a complete gene model and evaluates all possible gene structures. TSEBRA, on the other hand, only evaluates the set of explicitly given transcripts as input. In our tests, we ran both methods separately in parallel on a 28 multi-core processor, the average runtime of EVM was 35.28 min and of TSEBRA_EVM 0.37 min.

Conclusions

We presented TSEBRA, a tool that selects more reliable gene predictions (transcripts) from the sets of transcripts generated independently by BRAKER1 and BRAKER2. A novel approach to transcript selection was successfully implemented. In computational experiments made with genomes of 11 diverse eukaryotic species we have shown that the set of transcripts selected by TSEBRA matched annotated genes (believed to be the true ones) with higher accuracy than both BRAKER1 and BRAKER2. Note that the combined extrinsic evidence is not used at the step of generation of gene predictions. BRAKER1 and BRAKER2 use disjoint evidence sources also for training statistical gene-finding models. A relative complementarity of the gene sets can be an advantage when they are combined subsequently. The ranking and selection of the final set of transcripts, however, does use both protein and RNA-seq evidence. This approach makes an effective use of both sources of extrinsic evidence for selection of most likely true positive transcripts from the set of candidates, the transcripts generated by BRAKER1 and BRAKER2 running in parallel.

Thus, TSEBRA makes a useful tool that with help of heterogeneous extrinsic evidence transforms the union of predictions of BRAKER1 and BRAKER2 into a set of gene predictions whose accuracy exceeds the accuracy of both BRAKER1 and BRAKER2 running separately.