Background

Recent advances in sequencing technology enable the construction of chromosomal-level assemblies for even non-model organisms. As of December 2020, genomes of 16,108 eukaryotes, 295,784 prokaryotes, 41,936 viruses, 26,079 plasmids and 17,820 organelles are sequenced and available through GenBank [1], a considerable increase over the 1,500 sequences reported two decades ago (see Additional file 1: Fig. S1). Therefore, to annotate the ever-rising number of genome sequences, annotation software applications need to be fast, accurate, and designed to handle large amounts of expression data to facilitate discovery of novel genes across different conditions [2,3,4,5]. Extensive analysis of this available data is the key to achieving exhaustive gene discovery by analyzing samples from multiple tissues and conditions, obviating the need for additional sequencing.

Genome annotation is the process of identifying transcriptionally active regions of the genome and defining gene structures. Decoding the correct structures of genes is essential since several downstream applications rely on accurate annotations: detecting interactions between proteins [6,7,8,9,10,11,12,13,14], identifying post-translational modifications [15,16,17,18,19,20,21,22,23], mining effectors [24,25,26,27,28], and determining protein structure [29,30,31,32]. Although we have seen a significant improvement in genome sequencing technology, annotation methods continue to underperform [33, 34]. Obtaining accurate gene annotations is challenging, especially in recently sequenced non-model organisms. The presence of sequences exchanged through horizontal gene transfer in such genomes and the existence of fragmented assemblies make it difficult to predict gene structures [35]. Multiple groups working on the same species have different and oftentimes conflicting annotations that are difficult to merge into a common consensus.

The early 2000s saw initial genome annotation attempts with the introduction of PASA [36], which was developed to map full-length transcripts and Expressed Sequence Tags (ESTs) in order to annotate genomes. In parallel, FGENESH [37, 38], GeneGenerator [39], mGene [40] and GeneSeqer [41] were introduced which predicted gene structures directly from genome sequence. Tools such as MAKER [42,43,44,45] and PASA [36] closely depend on pre-assembled full-length transcripts to generate annotations. ESTs and/or de novo assembled transcriptomes have been often provided as inputs to these tools to generate annotations [46,47,48,49,50,51,52]. Transcripts constructed via de novo [53,54,55,56,57] or genome-guided [58,59,60,61,62,63] approaches are sensitive to the nature of the assembler and its parameter settings. Such assemblers report sequences that are highly similar to one another, making the process of sifting the correct assemblies from artefacts difficult. This issue is moderately mitigated by BRAKER2 [64, 65], which uses read splice information instead of full-length assemblies to predict gene structures and has been shown to perform better than de novo approaches [66]. BRAKER2 entails a round of unsupervised gene predictions using GeneMark-ET [67] generating ab-initio gene predictions followed by a second round of training by AUGUSTUS [68] using a subset of the gene models created by GeneMark-ET [64]. All variations of MAKER (MAKER, MAKER2 and MAKER-P) use a combination of AUGUSTUS [68] and SNAP [69] to generate gene predictions. Unlike BRAKER2 or PASA, users need to run MAKER for multiple rounds to improve annotation. With no standard technique to optimize the number of rounds, users often undertake a trial-and-error approach to decide what data is supplied to MAKER in each execution round. These unguided choices can create different annotations based on the same data sets. Thus, current approaches report either incomplete genes and/or derive annotations that are missing alternatively spliced transcripts. In addition to MAKER, BRAKER, PASA there is another gene annotator -GeMoMa [70] that use protein coding genes from a reference genome and transfers those to the target genome. Since it does not optimally use information from expression data, it has not been compared with the other gene annotators discussed in this manuscript.

To overcome the drawbacks described above, we developed FINDER, a new, automated annotation pipeline that downloads RNA-Seq data from NCBI SRA [71], conducts genome-guided assembly of short reads, predicts gene structure, and annotates genes. FINDER annotates both untranslated and coding regions of genes, categorizes transcripts based on the tissue/conditions where they are expressed, and outputs a complete set of alternatively spliced transcripts. FINDER analyzes the spatial expression profile of each transcript to redefine its boundaries and/or even create newer transcripts and employs an optimized strategy to locate transcripts housing micro-exons. Finally, gene models predicted by BRAKER2 are incorporated into the annotation along with assemblies generated by PsiCLASS [63]. We show that FINDER outperforms state-of-the-art annotation tools in constructing accurate gene structures, when executed with the same expression data.

Implementation

The detailed workflow of FINDER is outlined in Fig. 1. The pipeline accepts metadata via a comma-separated values (csv) file (see Additional file 2: Table S1). Users can verify the input data using the `verifyInputsToFINDER` utility (Please check Sect. 1.5.1 of Additional file 9). Both single-end and paired-end data are accepted. The pipeline automatically downloads RNA-Seq data from NCBI SRA or the samples can be accessed locally. Multiple rounds of alignment are conducted using STAR [72, 73] with short reads, thus ensuring the capture of tissue-specific splice junctions and ultimately generates the most comprehensive set of alternatively spliced transcripts. FINDER uses PsiCLASS [63] to generate transcripts both at the tissue level and consolidates them to produce a consensus annotation. It employs change-point detection (CPD) using coverage data to polish intron/exon boundaries if needed. Polished transcripts are then supplied to GeneMarkS-T [74] to predict protein coding regions. In addition to constructing genes from expression data, FINDER uses BRAKER2 [65] to predict genes de novo. Finally, gene models are assigned scores that reflect the confidence of prediction and evidence across different data sets. Throughout the pipeline run, intermediate temporary data is removed to optimize space usage. Proper logging of executions is implemented through ruffus [75].

Fig. 1
figure 1

FINDER workflow. FINDER assembles short reads from RNA-Seq expression data, collected from multiple tissues and conditions, to generate full-length transcripts using PsiCLASS. Short read coverage profile is used to polish the structure of the transcripts to enhance the quality of annotation. GeneMarkS-T is used to predict coding regions of the transcripts. Gene models predicted by BRAKER2 and models obtained by mapping proteins are added to the gene models constructed from RNA-Seq data. Additionally, FINDER outputs the tissues where each transcript is expressed allowing users to work with tissue-specific transcripts. FINDER categorizes transcripts into two confidence levels depending on the available supporting evidence and depth of coverage. (Generated using Microsoft PowerPoint v16.47)

Read alignments to the genome

Reads from each sample are aligned to the genome using STAR [73]. FINDER accepts the location of the genomic STAR indices. If indices are not provided, then FINDER will generate them locally. FINDER implements multiple strategies to detect as many correct splice-junctions as possible. Several studies use a multi-step approach where splice junctions are detected in the first pass and then those junctions are used to guide the alignments in future passes [76, 77]. FINDER employs a similar strategy to align reads and obtain the most confident splice junctions in each tissue type and/or condition by conducting mapping in four passes (Please check Sect. 1.3 of Additional file 9 for more details).

Annotating transcripts with micro-exons

Certain genes in eukaryotes have micro-exons (i.e., exons with fewer than 50 nucleotides) [78,79,80,81] which impart important biological properties both in plants [82,83,84,85,86] and animals [87,88,89,90,91]. FINDER uses OLego [92] to map the reads which were reported unmapped by STAR, because OLego optimizes micro-exon sensitivity by checking intron signatures when no hits of seed sequences (~ 14 nt) are found. It is configured to align reads to exons of minimum length 2, with a minimum and maximum intron size of 20 and 10 K respectively.

Generating exon-exon transcript structure annotation with PsiCLASS

Alignments reported by STAR and OLego are combined and provided as input to PsiCLASS [63]. Unlike traditional assemblers, PsiCLASS accepts alignments from multiple samples at the same time. It generates annotations for each sample and one consolidated gene annotation for all the samples. FINDER runs PsiCLASS with the—bamGroup option enabled which instructs PsiCLASS to preserve tissue/condition specific features. It is a fast meta-assembler generating 350 samples of output in less than three hours while running on 30 cores and consumes less than 50 GB of memory.

Polishing gene structures to optimize gene discovery

Gene structure annotations reported by PsiCLASS were polished to generate the best assemblies. Annotations generated by assemblers often have three kinds of errors that impact accuracy: (1) presence of redundant transcripts that are proper subsets of other transcripts, (2) multiple transcripts on the same strand merged into one, and (3) transcripts with ill-defined exon boundaries. Most assemblers ignore such cases to boost the speed of operation. Developing solutions to deal with these kinds of errors increases the number of correct structural annotations thereby improving downstream analysis.

FINDER uses different algorithmic and statistical approaches to deal with the above cases. To eliminate redundant transcripts, exon–intron structure of all transcripts is compared with each other to retain only unique transcripts. Even though eukaryotes possess large genomes, certain genes/transcripts are closely packed and are overlapping (Fig. 2). Reads originating from one of those genes often map to nearby overlapping genes making the task of distinctly recognizing the transcripts very challenging.

Fig. 2
figure 2

FINDER implements changepoint analysis of read coverages to modify existing gene models and/or generate new ones. Changepoint analysis is a statistical technique to assess alterations in trends over time. The same approach has been used to analyze read coverage patterns of a genome, where the data is distributed spatially. a Two Arabidopsis thaliana genes AT1G42960.1 and AT1G42970.1 are present within 50 base pairs of each other on the positive strand. Reads originating from the end exons of either genes bleed into each other resulting in PsiCLASS to merge the two gene models. Changepoint analysis recognizes the fall the read coverage and reports a position within the exon where the trough exists. This information is used to split up the gene models. b A similar issue exists with closely spaced genes residing on opposite strands. The end exons (highlighted with a red box) for a transcript extend up to the nearest intron of the adjacent transcript. Changepoint analysis is used to determine the actual end/start of transcript based on the read coverage. (Screenshot obtained from Integrative Genomics Viewer and figure generated using Microsoft PowerPoint v16.47)

FINDER is configured to use changepoint detection (CPD) analysis to detect the descent in read coverage at the junction of two overlapping transcripts. Statistical CPD is a procedure to detect changes in the probability distribution of a stochastic process. Typically, CPD is widely used to detect changes in time series [93,94,95,96,97], but can be extended to other applications as well [98, 99]. We have found that even though CPD was developed under the assumption of normality, it can also be used where normality is violated.

In the first step in FINDER’s CPD, short read alignments to the genome are converted into number of read counts per nucleotide using bedtools [100]. A custom python script is used to transfer the per nucleotide coverage data from the genome to the transcriptome reported by PsiCLASS. Each internal exon is considered as a potential site for the presence of changepoints if there exist premature stop codons in all the three frame translations. CPD only considers exons that have a high chance of housing a changepoint, thereby reducing duration of operation. The coverage pattern of each exon is probed to detect changepoints. The data has been modeled using an exponential distribution, and binary segmentation has been used to determines the changepoints in the exonic coverage using the ‘changepoints’ package [101]. Read coverage of exons mimics a time series where each nucleotide position of an exon can be assumed to be a single unit of time. Coverage patterns of exons, suspected to be merged, contain a characteristic depression in the signal to split the gene models (Fig. 2a). Overlapping transcripts on opposite strands sometimes share a common exon (Fig. 2b). This negatively impacts precision since the boundaries of the predicted transcript exceed the boundaries of the transcript in the reference annotation. FINDER trims the transcript boundaries, using the changepoints, to better model the RNA-Seq coverage (Fig. 2b). These strategies improve the annotation by increasing the transcript F1 scores (Table 1).

Table 1 Sensitivity, Specificity and F1 scores of transcripts generated by multiple gene annotation pipelines for three model organisms—Arabidopsis thaliana, Oryza sativa and Zea mays

De novo gene prediction from expression data and proteins from closely related species

Certain genes are expressed only under specific tissues and conditions [102]. However, constructing an exhaustive set of genes expressed across all possible tissues and conditions is a daunting task due to the mammoth volume of potential expression data. Hence, approaches that can predict structures of unknown genes using information obtained from known genes are needed. Within the FINDER framework, we used BRAKER2 [103] to predict the structure of protein coding genes. The pipeline is provided with alignment files generated by STAR and an optional, user-provided protein data file. If the previous execution fails, a second execution of BRAKER2 is launched without protein information. Genes predicted by BRAKER2 are compared to the genes obtained from expression data. To prevent too many false positives, predictions made by BRAKER2 are considered high confidence, only if those are supported by expression level or protein level evidence.

In addition to RNA-Seq data, FINDER also uses protein data (when provided), in two ways (1) to assess the veracity of the transcript models generated by BRAKER2, and (2) to align those proteins not recognized by BRAKER2 or PsiCLASS. Protein coding genes obtained from expression data and predicted by BRAKER2 are BLASTed [104] to the protein set provided by the user. Proteins not encountering any hits are aligned to the genome using exonerate [105] with a minimum threshold of 90% similarity. These alignments are augmented to the final set of gene predictions. Since these transcripts are obtained solely from proteins, they lack UTR sequences.

Prediction of coding regions

We leveraged GeneMarkS-T [74] to predict protein-coding regions of genes constructed from expression data. GTF files are first converted to FASTA files using the provided genome. Those FASTA files are supplied to GeneMarkS-T as inputs. GeneMarkS-T outputs coding sequence for the transcripts. CDS annotations are incorporated into the final GTF file by converting the transcriptomic coordinates to genomic coordinates.

Tissue/condition specific transcripts/gene models

Most eukaryotic genes have multiple isoforms which are derived from alternative transcripts. Expression of different transcripts can occur under different conditions in different tissues at different time points. FINDER compares assembled transcripts from each condition and prints out an association between each transcript and the provided tissue/condition (Additional file 9: Sect. 1.5).

Scoring gene models

FINDER groups genes into multiple categories based on supporting evidence. Genes that are expressed in RNA-Seq datasets, predicted by BRAKER2, and have protein evidence, are put into the high-confidence gene set. BRAKER2-predicted genes with no evidence of expression and/or proteins are treated as low confidence genes. FINDER expects a soft masked genome since it is a BRAKER2 requirement. Genes which are located in the repeat regions are marked as such and moved to the set of low-confidence genes.

Results and discussion

Choice of species for comparison

We tested the performance of FINDER primarily on three well-annotated plant organisms—Arabidopsis thaliana [106], Oryza sativa [107,108,109] and Zea mays [110, 111]. The genomes assemblies of these model organisms have been frequently updated and are almost complete with telomere-to-telomere sequences with fewer gaps and unknown nucleotides. In addition, their gene annotations have undergone regular improvement by mining the large number of RNA-Seq datasets available in the literature. Also, The Arabidopsis Information Resource (TAIR) provides a five-star rating system based on available evidence for each gene. Such a system offers a platform to test the quality of gene annotation software. For further evaluation, and to ensure that FINDER is able to annotate a wider range of genome types, we selected the following additional species to test: Caenorhabditis elegans [112], Drosophila melanogaster [113, 114], Homo sapiens [115, 116], Hordeum vulgare [117], and Triticum aestivum [117,118,119,120]). The genomes of these species range from small (C. elegans, D. melanogaster, A. thaliana), medium (O. sativa), to large (H. sapiens, Z. mays, H. vulgare, and T. aestivum). Finally, we evaluated FINDER on three different versions of Z. mays annotations—RefSeq [121], AGPv3 [111, 122] and AGPv4 [110, 123].

Metrics to assess quality of annotation

We used four metrics to compare the quality of annotations generated by each pipeline: (1) Annotation Edit Distance (AED) [42, 43, 124], (2) sensitivity, (3) specificity, and (4) F1 score. Although these metrics could be computed both at the nucleotide- and exon-level we chose to make comparisons at the transcript level since it encompasses bases, exons, and introns. An AED score of 0 indicates complete agreement of the predicted annotation with the reference, and a score of 1 denotes that the reference has not been identified in the annotation. A transcript is considered to be “recognized” only when all its intron definitions agree with at least one transcript from the predicted set. We used the Mikado “compare” utility to compare the predictions with the reference annotations [125]. A highly sensitive annotation is one that can correctly recognize more reference transcripts. A set of annotations has high specificity when it reports minimal incorrect transcripts. For an annotation to be of good quality, both sensitivity and specificity should be high. A balanced metric is the F1 score which is the harmonic mean of sensitivity and specificity. While AED provides a good numeric assessment of how well the ground truth evidence is represented in an annotation, when individually used, it fails to capture the extent to which false positives are reported. Hence, F1 score complements AED since it incorporates both specificity and sensitivity. For evaluation purposes, we assume that the annotations achieved through community efforts are the ground truth and contain no errors.

FINDER generates more accurate gene models than BRAKER2, MAKER2 and PASA

FINDER leverages expression data to construct transcript models and employs statistical changepoint detection to enhance their structures (see “Implementation” section). Both MAKER2 and PASA were run with transcript sequences reported by PsiCLASS.

To assess FINDER’s performance, we compared the AED scores of transcript models generated by FINDER with those generated by other commonly used annotation methods. As shown in Fig. 3a, d, g, the violin plots for FINDER are broader at the base, indicating a greater number of transcripts with lower AED scores as compared to BRAKER2, MAKER, and PASA. We compared the FINDER AED scores with the AED scores reported by other pipelines using Wilcoxon’s signed rank test (More details in Additional file 9: Sect. 2.5). For all organisms (Fig. 3, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2), the AED scores reported by FINDER were significantly lower (p_value < 0.01) than that of any other pipeline. Figure 3c, f, i, shows a stacked bar plot to represent the fraction of transcripts in each category of AED values. In all the cases, a higher percentage of transcripts reported by FINDER have lower AED scores (Additional file 1: Figs. S2–S5). This indicates that FINDER is capable of constructing gene structures that better comply with the reference annotations.

Fig. 3
figure 3

Comparison of performance of predicted annotations in three model species—ac A. thaliana, df O. sativa and gi Z. mays. Annotation Edit Distance (AED) is an assessment of how well predicted annotations agree with the evidence and was used as a quality control metric. A value of 0 denotes complete agreement of two annotations while a value of 1 denotes that the ‘gold standard’ reference annotation was not detected. Transcripts from ‘gold standard’ reference annotations that are not detected in any of the predicted annotations are removed from analysis. a, d, g Distribution of AED scores. Violin plots wider at the base indicate high density of annotations with lower AED. FINDER was able to create gene models having lowest AED resulting in a wide base. Gene models generated by FINDER were enhanced by adding predictions made by BRAKER and including protein evidence. Wilcoxon’s signed rank test was used to compare the AED scores between FINDER and other annotating pipelines. The “***” symbol implies that the AED scores of FINDER gene models were significantly lesser (p_value < 0.01) than the AED scores of the gene models reported by other pipelines. b, e, h Bar plot of F1 score of multiple approaches of annotation. Having a high nucleotide F1 (Base F1) or a high exon F1 score is not sufficient to conclude a good annotation. High value of transcript F1 score is indicative of good gene models with high sensitivity and high specificity. c, f, i Stacked bar plot showing percentage of transcripts in each of the four groups of AEDs. Higher number of transcripts to low AED denotes better annotation. In each of the three species, FINDER was able to generate a higher percentage of transcripts with low AED compared to other techniques of annotation. (Generated using ggplot2 v3.3.3)

High-quality exhaustive annotations predict the fewest false positives thereby boosting the transcript F1 score. The transcript F1 scores of the gene models that were reported by FINDER for A. thaliana, O. sativa and Z. mays were higher than the models generated by BRAKER2, MAKER, and PASA (Fig. 3b, e, h). This same trend is observed for other tested organisms where FINDER was successful in detecting nucleotides, exons, introns, transcripts and genes (Table 1, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2). MAKER2 and BRAKER2 registered a high specificity for most of the organisms because fewer transcripts were reported than FINDER. MAKER2 and BRAKER2 also had lower F1 scores, indicating less sensitivity than FINDER. Additionally, we compared the CDS regions of genes reported by FINDER with those of BRAKER2. For most of the organisms, FINDER generated transcript models with a higher F1 score (Additional file 4: Table S3). These results show that the better performance of FINDER is ensured not only due to the presence of UTRs but also due to enhanced CDS structure of gene models.

Finally, including BRAKER2 predictions and protein sequences to FINDER enhanced the gene model predictions. About 15% of the gene models reported by BRAKER2, those having high sequence similarity with the provided protein sequences were included in the final annotations (Table 2). As shown in Table 1 and Additional file 5: Table S4, including evidence at the protein level led to the identification of more genes.

Table 2 Improvement in overall gene recognition by adding gene models predicted by BRAKER2 and aligning protein sequences

Unlike BRAKER2, FINDER does not assume a homogeneous nucleotide composition of the genome [103]. FINDER outperforms BRAKER2 while constructing gene models in complex organisms like H. sapiens, H. vulgare, and Z. mays since assemblers generating transcriptomes from alignments do not require a genome to possess homogeneous nucleotide composition.

FINDER in itself is restricted to annotate genes only in regions of the genome that are transcriptionally active. Recognizing that BRAKER2, being a gene predictor, can construct gene models in transcriptionally silent regions of the genome, FINDER is designed to incorporate the gene models predicted by BRAKER2 into the final annotations.

Distinct gene groups are accurately annotated with FINDER

Although eukaryotic genes differ from one another in terms of location, structure and the isoforms they encode, most annotation pipelines annotate and evaluate gene predictions with a global and uniform approach. The problem arises when these variances prompt each pipeline to perform differently on dissimilar groups of genes. To avoid this pitfall, we created groups of genes and transcripts based on various criteria (Table 3) and compared the performance of FINDER with BRAKER2, MAKER, and PASA for each of these sets.

Table 3 Classification of gene models into different groups based on their relative location to other genes, number of isoforms and other criteria

On the set of UTR-containing transcripts, FINDER reported the best transcript F1 scores (Fig. 4, Additional file 1: Figs. S6, S7). Unlike BRAKER2, FINDER uses GeneMark S/T to predict CDS from the transcript sequences assembled by PsiCLASS and can hence annotate UTR regions. For most of the organisms, BRAKER2 and MAKER2 gene models register a low transcript F1 score in this category of genes. Next, we tested the performance of the annotation pipelines on transcripts that are closely located in the genome. On this set of transcripts, FINDER reported the best F1 transcript score for A. thaliana, O. sativa, and Z. mays (Fig. 4), and comparable scores for D. melanogaster (Additional file 1: Fig. S6), H. vulgare (Additional file 1: Fig. S8), and C. elegans (Additional file 1: Fig. S7) with BRAKER2. Most eukaryotic genes have multiple isoforms which differ from one another by their exon–intron definition. Splice sites and coverage information provides clues to construct such alternatively spliced transcripts. We selected genes with more than one transcript to check how well each annotation pipeline was able to detect transcript isoforms. For this case, FINDER was able to generate the best transcript structures with the highest transcript F1 score among all the pipelines gene annotation software applications (Fig. 4 and Additional file 1: Figs. S6–S9). Surprisingly, BRAKER2 fared poorly in this category despite training with all the detected splice sites from RNA-Seq data. This demonstrates that FINDER is capable of leveraging both intron splice sites and read coverages to report best transcript structures. For H. sapiens, PASA was able to generate the best transcript structures across all categories of transcripts. Adding transcripts from BRAKER2 and protein evidence improved the transcript F1 score for all the organisms, signifying the importance of incorporating de novo gene models and protein evidence.

Fig. 4
figure 4

FINDER versus other pipelines on different groups of genes in three model species—a A. thaliana, b O. sativa, c Z. mays. F1 score is the harmonic mean between sensitivity and specificity. Higher F1 score indicates better agreement with the reference transcript models. We created groups of transcripts that have similar characteristics as shown in the y-axis legend. A pool of transcripts was created containing multi-exonic transcript predictions, from each pipeline, that has a complete intron chain match with at least one reference annotation. Mono exonic transcripts were considered if at least 80% of the nucleotides overlap with one reference annotation. Transcript F1 scores, for each of the annotation pipelines, have been plotted as a bar graph. Even though all annotation pipelines are designed to serve the same purpose of annotating genomes, each pipeline adopts a different strategy. Each strategy has its own merits and demerits that lead to better annotation of a certain category of genes. This plot helps understand the performance of each annotation pipeline on different categories. The symbol “#” denotes the best annotator in each gene group. (Generated using ggplot2 v3.3.3)

BRAKER2 generated the best transcript annotation for the set of transcripts with a single exon (Fig. 4a, b and Additional file 1: Figs. S6–S9). Such transcripts, devoid of any introns, are difficult to construct from RNA-Seq alone. Also, the direction of the splice sites infers the direction of a transcript. Without any introns, such a single-exon transcript has to be probed for a CDS sequences' presence to infer directionality. BRAKER2 was configured to optimally predict only CDS regions of genes, hence, it performs well with the set of transcripts that have missing UTRs for organisms with small and moderate sized genomes (Fig. 4a, b and Additional file 1: Fig. S6–S9). The average number of transcripts per gene reported by BRAKER2 is lower than FINDER. While this boosts specificity, it compromises recall since BRAKER2 is not sensitive to detecting alternatively spliced transcripts. Hence, BRAKER2 accomplishes the best F1 score when tested on a set of single-transcript genes but performs poorly on a set of multi-transcript genes (Fig. 4a, b and Additional file 1: Figs. S6–S9).

Performance comparison on TAIR’s 5-star System

In order to assess the performance of the annotation pipelines on groups of genes constructed from varying levels of evidence, we used the TAIR10 5-star system. TAIR associates a quality score to each A. thaliana transcript based on the evidence used to construct the models, with five stars designating the best evidence and zero stars the least [126]. The three categories with limited evidence (< 3 stars) have fewer than 3,000 transcripts each. BRAKER2′s performance, on the genes in these three categories, was slightly better than the rest of the annotation pipelines (Fig. 5). The other two categories (five star and four star) have 9,067 and 18,374 transcripts respectively. In both of these categories, FINDER was able to detect more transcripts than any other annotation pipeline. 51.5% and 86.4% of genes in the 5-star and 4-star category respectively were multi-exonic. In both these categories, FINDER correctly constructed more gene models compared to any other annotation pipeline (Fig. 5). FINDER reported 80% of the gene models belonging to the 4-star category—18% more than BRAKER2 (Fig. 5). Hence, it is evident from this analysis that FINDER can reconstruct the structures of most of the genes that are well-supported by underlying evidence.

Fig. 5
figure 5

Performance of annotation pipelines on gene groups of Arabidopsis thaliana generated by TAIR10. The Arabidopsis Information Resource (TAIR) group has created a quality ranking system to indicate the level of confidence in an annotated gene/transcript. The ranking system has five levels (denoted by stars). Higher number of stars denote the availability of more information to generate the gene structure. Here we display the percentage of transcripts in each category that was identified by a particular annotation pipeline. A high percentage of identified transcripts indicate higher sensitivity and hence a better annotation. The number below each legend in the x-axis denote the number of genes in that respective group. The “#” denotes the predictor which detected the maximum number of transcripts within each group. (Generated using ggplot2 v3.3.3)

Improving transcript annotations using changepoint analysis

The co-location of multiple overlapping genes on the genome strands makes it difficult to correctly annotate their structures (see “Polishing gene structures to optimize gene discovery” section). FINDER employs changepoint detection (CPD) [101] to split the merged transcripts reported by PsiCLASS (Fig. 2). To gauge the magnitude of improvement in transcript structures brought about by the application of CPD, we compared the accuracy of the predicted transcriptome before and after implementing CPD based on read coverage. As shown in Table 4 and Additional file 6: Table S5, implementing the CPD improved both specificity and sensitivity in organisms with small or medium-sized genomes. In A. thaliana, the transcript F1 scores increased from 40.78 to 45.95 (Table 4 and Additional file 6: Table S5) and in C. elegans it increased from 40 to 50. In large genomes, the improvement was not as significant, mainly because there are only a few genes that overlap with one another.

Table 4 Comparison of specificity, sensitivity and F1 scores of transcripts assemblies generated by Strawberry, Scallop, Stringtie, PsiCLASS and FINDER for three model organisms—A. thaliana, O. sativa and Z. mays

PsiCLASS meta-assembly works better than other approaches

We explored three popularly used software applications for merging transcriptome assemblies—StringTie-merge [77, 127,128,129,130,131,132,133], TACO [134,135,136,137,138,139] and Cuffmerge [140,141,142,143,144,145] to combine 116 A. thaliana assemblies constructed by StringTie [59], Scallop [61] and Strawberry [60] (Please check Sect. 3 of Additional file 9 for more details). The best assembly was reported by StringTie-merge and was hence used for all other organisms. We compared the accuracy of the consensus transcript models generated by StringTie-merge with the transcript models reported by PsiCLASS [63]. As depicted in Table 4 and Additional file 6: Table S5, PsiCLASS generated the best transcript models for all organisms registering the highest transcript F1 score improving upon the StringTie models by up to 15%. Hence, FINDER uses only PsiCLASS to generate assemblies from short-read data.

Impact of missing untranslated region on annotation of transcripts

Gene transcription is triggered by adherence of a transcription factor in the promoter region of a gene. Promoters are typically located within 1,000 bp upstream of a gene’s transcription start site (TSS) [146,147,148]. Determining the TSS from sequencing data is best facilitated by RAMPAGE [149, 150] or CAGE-Seq [151], but this data is usually unavailable due to constraints imposed by cost and time. Nevertheless, a good estimate can be obtained from RNA-Seq data by assuming the start coordinates of the assembled genes as the TSS. Thus, researchers often localize their investigation to a section 500–1000 bp upstream of the assumed TSS [152, 153]. Without 5′ UTR annotation it is impossible to deduce a good approximation of the TSS. This leads to conducting promoter mining in a completely incorrect genome location. To assess the quality of 5′ UTR annotation, we plotted the difference of TSS between the reference genes and the genes reported by BRAKER2 and FINDER using a violin plot (Fig. 6). Further, we applied Wilcoxon’s rank-sum test and found that the TSS distances reported by FINDER were significantly less than that of BRAKER2 for A. thaliana and Z. mays. Interestingly, for O. sativa, BRAKER2 generated better gene models for more transcripts. Over 25% of reference gene models in O. sativa have no UTRs annotated which is higher compared to 15% UTR-less gene models in A. thaliana and Z. mays. This result illustrates that more FINDER transcripts have a TSS closer to the evidence as compared to the TSS of the transcripts reported by BRAKER2. This is an expected result since BRAKER2 was configured to annotate only CDS regions of transcripts. Table 5 highlights the number of transcripts that have better agreement with the reference TSS for FINDER and BRAKER2.

Fig. 6
figure 6

Comparison of distance between transcription start sites of gene models predicted by BRAKER2 and FINDER. Violin plots of the distribution of the distance between the actual transcription start site (TSS) and the predicted transcription start site. In a set of well annotation complete gene structures, a higher fraction of genes is expected to have low deviation from the actual TSS. We considered genes that were reported in either BRAKER or FINDER for this analysis. Wilcoxon’s rank sum test was used to compare the TSS distances between FINDER and BRAKER2. The “***” symbol implies that TSS distance for FINDER gene models was significantly less than BRAKER2 gene models. (Generated using ggplot2 v3.3.3)

Table 5 Use of RNA-Seq evidence to improve annotation of untranslated regions to aid in promoter mining and epigenetic studies

Enhancing ground truth annotations by extending untranslated regions

Official annotations of several model organisms, used as ground truth for this study, contain transcripts with missing UTR sequences. Even though UTRs do not code for proteins, they are relevant segments of a transcript involved in several important biological processes like mRNA translation [154,155,156], regulation of expression [157,158,159,160,161]] and a number of diseases [162,163,164,165,166]. In the A. thaliana TAIR10 annotations, there are 7,888 transcripts missing either UTR; 50% of these had a rating below 2 stars.

PacBio (Menlo Park, CA) offers long-read sequencing that contain both CDS and UTRs. Therefore, we used the PacBio annotations instead of the incomplete TAIR10 transcripts to assess FINDER’s performance on transcripts that were missing UTRs (Please refer to Sect. 2.6 in Additional file 9 for more details). Out of the 7,888 TAIR10 transcripts with missing UTRs, 113 transcripts were found both in the PacBio data and the 116 short-read RNA-Seq samples. We compared the FINDER annotations against these 113 transcripts. FINDER annotations were able to recall 91.55% of the nucleotides in 113 transcripts of TAIR10 and 97.86% of PacBio transcripts. The specificity of the FINDER annotations is markedly higher with PacBio transcripts (79.67%) compared to TAIR10 transcripts (72.14%). This demonstrates that FINDER enhances and improves upon the existing annotation.

The TRITEX H. vulgare annotation (Morex version r2) [117], released by the International Barley Sequencing Consortium (IBSC), is devoid of UTRs. We used FINDER to update and enrich the existing annotations by flanking the CDS region with UTRs on both sides. To verify the accuracy of the gene models reported by FINDER, we used PacBio full-length mRNA sequences derived from a time course of powdery mildew infected barley leaf tissue [167, 168]. A total of 7,352 gene models from IBSC, FINDER, and PacBio had a complete intron-chain match with each other. The gene structures for more than 93% (6,886 out of 7,352) of the FINDER models were improved when compared to PacBio full-length sequences (Additional file 7: Table S6). The highest F1 score achieved was 87.16. This shows that FINDER is capable of constructing accurate gene structures constituting both CDS and UTRs.

Evaluating performance with different annotations of Zea mays

Z. mays is an important model organism for crops and has been one of the most studied plants for genetics by researchers in several different fields [169,170,171,172]. Genes have been annotated in multiple ways using different kinds of data, resulting in substantial differences in gene structures [122]. Here we compare three alternative annotation sets of Z. mays—RefSeq, AGPv3, and AGPv4 and the performance of FINDER surpassed all three approaches. The transcript F1 score for FINDER gene models compared against the NCBI gene models were 43.48, whereas the F1 scores for AGPv3 and AGPv4 were 26.69 and 22.51 respectively. We observed the same trend for other annotation pipelines and reported a higher transcript F1 score for NCBI than the AGP annotations (Table 1 and Additional file 3: Table S2). Hence, FINDER generated high-quality gene structures with high transcript F1 scores for different Z. mays annotations.

Evaluating FINDER on different clades reported by Phylostratr

Genes in each organism can be categorized by their evolutionary history [173, 174]. We used Phylostratr [175] to classify genes into evolutionary strata. Here we present our results on the three model organisms—A. thaliana, O. sativa, and Z. mays. For all three, FINDER was able to accurately detect more genes in highly populated strata (Fig. 7). The performance of FINDER and PASA was comparable in strata with few genes. It was surprising to note that BRAKER2 was unable to identify highly conserved genes (those from the “cellular organisms” strata) since those would be easier to predict than organism specific genes. This demonstrates that FINDER is capable of effectively constructing genes from different evolutionary backgrounds.

Fig. 7
figure 7

Assessment of annotation pipelines on genes from each phylostrata—Genes from three model species—a Arabidopsis thaliana, b Oryza sativa and c Zea mays, were allocated into evolutionary classes using Phylostratr. The number of genes correctly constructed by each pipeline was computed and plotted as a bar graph. Numbers below each stratum indicate the number of genes allocated to that strata. Strata having genes fewer than 500 are not shown in the graph. (Generated using ggplot2 v3.3.3)

FINDER constructs gene models for polyploid genomes

Being a general-purpose genome annotator, in addition to diploid organisms, FINDER can annotate the genomes of polyploid organisms. We generated gene structures of Triticum aestivum, a hexaploid with 120,744 annotated genes and 146,597 transcripts [117]. FINDER was able to detect 48,129 transcripts (39.9%). Out of the 130,582 transcripts predicted by FINDER, 48,104 (36.83%) matched perfectly with at least one reference annotation.

Conclusion

Identifying genes on chromosomes and deducing their structures from a plethora of evidence has been undertaken in multiple ways, with each method having advantages and disadvantages. Herein, we propose FINDER—an entirely automated, general-purpose pipeline to annotate genes in eukaryotic genomes. FINDER (1) implements an optimized mapping strategy that reduces the number of spurious mappings, (2) produces complete full-length transcripts comprising UTRs while identifying transcripts with micro-exons, (3) employs statistical CPD to modify gene boundaries and construct new genes, (4) reports more alternatively spliced transcripts as compared to other state-of-the-art annotation pipelines, and (5) assigns confidence classes to each transcript based on the evidence(s) that were used to construct those.

While FINDER’s performance has been superior to other gene annotation softwares, all the gene models reported by FINDER are predicted. Hence, a validation is necessary to ensure false positives are detected and removed. Also, future versions of FINDER will offer functionalities to leverage data from CAGE-Seq and Ribo-Seq to better annotate transcription start site and translation start sites respectively.

With a wide variety of available data for annotation, researchers often struggle to manage and optimize their usage. Several gene annotation software also offer users complicated configurations without providing substantial guidance. FINDER makes the job of gene annotation easy for bench scientists by automating the entire process from RNA-Seq data processing to gene prediction. Since FINDER does not assume the ploidy or the nucleotide composition of a genome, it could be applied to derive gene structures for a wide range of species, including non-model organisms. FINDER constructs gene models primarily from RNA-Seq data and is therefore capable of constructing tissue- and/or condition- specific isoforms which would have been impossible to obtain from ESTs only. FINDER supersedes the performance of existing software applications by utilizing read coverage information to fine-tune gene model boundaries. Instead of removing low-quality transcripts, FINDER flags them as low confidence—giving users the choice of using them as they seem fit. As a proof of concept, we provided evidence that using read coverage signal indeed enhances gene structures in a diverse set of organisms. Thus, we are confident that FINDER will pave the way for improved gene structure annotation in the future.

Availability and requirements

Project name: FINDER.

Project home page: https://github.com/sagnikbanerjee15/Finder.

Operating system(s): Linux, MacOS.

Programming language: Python, C, C++, Perl, Shell.

License: MIT.

Other software requirements: All software requirements are listed in https://github.com/sagnikbanerjee15/Finder/blob/master/environment.yml.

Any restrictions to use by non-academics: MIT licensing restrictions apply.