Comparison of gene models using the UCSCtrackCompare program
To compare the performance of all programs at reproducing the rich Gencode annotation, we wrote a standalone program, UCSCtrackCompare, available in source form in Additional data file 1. The program compares, in chosen genomic regions, a UCSC track or a collection of tracks taken as reference (usually Gencode) against any number of selected tracks. It produces, in about 10 minutes, support for the analysis presented here; for a more detailed comparison of transcripts see Additional data file 2 and for coding regions see Additional data file 3. The direct outputs of UCSCtrackCompare were pasted in an Excel document (Additional data file 4). For the analyses of coding regions comparable to , our results almost exactly match those obtained by Guigo et al., and our results lie within 0.2% of theirs (Table 4 and 5 in ). However, there are uneven discrepancies that cannot be rationalized, but that may reach up to 8% in sensitivity and 13% in specificity, between our mRNA comparisons (Table 6 in ; see Additional data file 2.1). Methods that show an advantage in  include Ensembl, Exogean and Pairagon, and methods that show a disadvantage include AceView, ECgene, SGP2 and eight others. Yet, the general ordering of the methods is consistent across the two evaluations.
On 14 December 2005, we downloaded from UCSC all tracks with gene models in the 31 ENCODE test regions (see Additional data file 2.1). We selected for comparison to Gencode (October 2005 freeze) 14 non-redundant EGASP tracks released before the solutions, and 10 genome-wide tracks, distinguished on all diagrams by addition of an asterisk in front of the track name. Nine tracks, labeled P in front of their names, predict protein-only models by using ab initio methods, often integrating evolutionary sequence conservation; all others also use, or only use, mRNA and/or EST evidence. Statistics of the 25 tracks and hints on their inputs are summarized in Table 1.
Since the UCSC files give directly the exons of all tracks in a uniform chromosome-based coordinate system, the comparison of coordinates is straightforward and easy to duplicate (Additional data file 1). The choices offered in our program are either to decompose the models into their elements, exons, introns and nucleotides, or to consider them in their entirety; and then either to count each element with given genomic coordinates only once, thereby providing a rationalized projected measure of the unique elements (as was done in ), or to count each element as many times as it occurs in the alternative variants, providing a quantitative appraisal of the biological complexity of the gene. The two measures are complementary. Another option is to compare only the part of the models annotated as protein-coding. When applied to identical transcripts, this option allows the preferred hypotheses on choice of CDS and Start codon to be reverse engineered (Additional data file 2.5).
Comparison of introns, exons and nucleotides in whole models
Consistently, when we compare whole mRNA models (rather than CDSs) to the Gencode reference, AceView fares remarkably well, better than any other track. This is true in the projected mode, and even more striking in the quantitative mode, where alternative variants are counted separately (see details in Additional data files 2 and 4).
Nucleotides provide a global appraisal of the transcribed regions: AceView and Gencode spliced transcripts cover almost exactly the same nucleotides in the genome (92% sensitivity, 88% specificity; Additional data file 2.2). The structural precision of the models is best defined by the exact position of intron-exon boundaries. As shown in Figure 1a, most of the unique Gencode introns are used in AceView and few are added (sensitivity 90%, specificity 86%). Only AceView and ECgene (86.5%) detect more than 70% of the introns: the other EGASP tracks probably do not succeed in aligning with sufficient precision all the available ESTs and mRNAs. For reference, on 14 December 2005, RefSeq sees 56% of the Gencode validated introns and the Mammalian Gene Collection (MGC) 38%. As expected, the tracks relying mostly on ab initio predictions detect less Gencode introns, but more novel intron candidates, usually unique to each program (Additional data file 2.4); ExonHunter and Genscan are the most creative.
If all introns from alternative variants are counted separately (Figure 1b), Gencode uses close to three times the number of unique introns. AceView does too, but remarkably in this expansion, most of the introns remain exactly the same as in Gencode (sensitivity 85%, specificity 84%). In contrast, the specificity of ECgene drops to 28%, because of their drastic combinatorial use of the Gencode introns. This over-use feature is apparent to a lesser extent in the ExonWalk track, possibly because of excessive use of partial cDNAs in their step 3 (for documentation, see ).
The comparison of exons (Additional data file 2.3) can also assess the intron-exon boundaries of the models, but it is dominated by the effects of terminal exons, which represent 20% to 42% of all exons (depending on the method), and are often partial in transcript models. Furthermore, in contrast to intron boundaries, which are accurate and can in principle be verified by PCR or microarray experiments, the boundaries of terminal exons cannot be defined precisely even in a truly complete transcript because, biologically, the first (capped) base and the polyA addition site fluctuate in vivo  (D Thierry-Mieg and J Thierry-Mieg unpublished observations). Indeed, when we compare exons between Gencode and any other track, both sensitivity and specificity drop because of the terminal exons (Additional data file 2.3).
Another advantage of comparing spliced models through introns is that methods that predict only coding regions (labeled P(name) in all figures and tables) are less disadvantaged in intron than in exon mode, because most introns are located in the coding regions. This is true for example for 92% (2,075/2,264) of the unique introns in the UCSC 'Known Gene' track. We therefore chose to perform intron-based comparisons of whole spliced transcripts. Single exon genes and transcripts will be discussed separately below.
N to N comparison of entire mRNAs across methods proposing alternative transcript models
On average, Gencode genes with introns have 4.5 transcripts per gene; each transcript has 5.7 introns, but in projection only contributes 2.1 unique introns. If we limit this to coding transcripts, coding genes have on average 2.6 annotated CDSs, and each CDS has 8 introns, but in projection only contributes 3.7 unique introns (Additional data file 4, SummaryStats). Just a few of the possible combinations of introns correspond to supported models: this is called the combinatorial splicing problem. To properly compare performance across the tracks, we need to evaluate how the introns and exons are chained in the models. Let us consider each model as a non-separable chain of introns, the set of coordinates of its intron boundaries on the genome provides a precise signature. As illustrated in Figure 2a, we define the one-to-one best matching Gencode-to-track-X model pair by comparing all pairs and scoring intron boundaries: each boundary counts +1 if it is shared by the two models or -1 if it is unique to either of the two. In this way, identical models always score best. All pairs of models with one intron boundary in common are listed and sorted, and the two models from the best rating pair are flagged as 'best match', irrespective of their score. We then look recursively for the next best rating pair where none of the two models are flagged, and flag them. The remaining unflagged models in the list become 'additional variants in Gencode genes'. Finally, the models that do not occur in the list are orphans and belong to new genes, specific of method X, or to missed genes, present only in the Gencode reference. Some of the 'best matches' are actually 'identical to Gencode' in terms of intron-exon structure.
Figure 2b shows the results of this analysis (see also Additional data file 4, complete models). More than 70% of the Gencode transcripts have an exact structural equivalent in AceView, from the first to the last intron (1,191/1,691 = 70% sensitivity, 67% specificity); an additional 12% (206) have a best match. AceView misses 210 of the Gencode variants but sees 225 new variants in other Gencode genes. Finally, AceView misses some Gencode genes containing a total of 88 transcripts with introns, but sees some genes missed by Gencode, containing 170 transcripts with introns.
The second most sensitive track is ECgene (60% sensitivity; 26% specificity), which has twice as many spliced models as Gencode to choose from, but has fewer identical transcripts than AceView. The next best, ExonWalk and 'Known Genes', drop sharply to 23% exact matches. RefSeq sees only 17% of the Gencode transcripts and covers 53% of the unique nucleotides and 56% of the unique introns: despite its well recognized quality, RefSeq does not provide a comprehensive representation of the transcriptome.
By definition, the above analysis only evaluated transcripts with introns. Transcripts and genes without introns are in fact rare in the Gencode annotation (2.3% of the models, plus 1.7% due to clipping at the boundaries of the ENCODE regions). To better mimic Gencode, most were filtered by the EGASP participants, including AceView, which normally annotates a large number of single exon genes (Table 1; Additional data file 4).
Although rare on the entire UCSC browser, unspliced transcripts appear to be an important part of the human transcriptome, and they are an order of magnitude more frequent in mammals than in simpler Metazoa. Indeed, we compared the high quality full-length cDNA libraries made by Sugano using the oligo-capping method in either the nematode Caenorhabditis elegans or human (all sequences are in GenBank [7, 9]). We removed 1.1% of clones that may be genomic contaminants (331/29,562 aligned in an intron-less gene ending on an A-rich region in the genome) and found that the percentage of fully sequenced intronless clones is 10 times greater in human than in worm (36% in human (10,578/29,562 FLJ clones) versus 3.5% in worm (2,010/56,671 worm yk capped clones)). These clones also map in 11 times more unspliced genes (with no spliced variants) in human than in worm (25% of these FLJ-containing genes in human (4,261/17,214 genes) versus 2% in worm (155/7,223 genes)). However, the level of possibly immature transcripts (unspliced, but from a gene with introns) is similar in the libraries from both species (57% (6,043/10,578) in human, 59% (1,187/2,010) in the worm).
According to our analysis, the huge increase in intronless genes is a major difference between the worm and human transcriptomes. It may have co-evolved with the increased usage of alternative splicing, increased intron length or other transcriptional features. An intronless transcript is not associated with an exon junction complex, so it is expected to be translated and degraded less efficiently in human [10–12], and it might be submitted to less evolutionary pressure: unspliced genes could be where new functions arise. Indeed, some of the unspliced genes potentially encode small proteins (18,385 intronless genes have hypothetical CDS of more than 100 amino acids; note that most are human or mammal specific), some may be regulatory non-coding RNAs, and some may just be transcription by-products. We do annotate these genes in the public AceView  and expect that in the future the role of intronless genes will be better apprehended.
Are Gencode mRNAs fully validated and complete?
We compared the introns of Gencode to those of other tracks (Additional data file 2.4); 189 unique projected introns (5% of their 3,618) are seen exclusively in the Gencode transcripts but in no other track. We expect those to be supported by Gencode experimental validation, although the evidence was not submitted to GenBank as of August 2005. Another possibility is that some might be supported only by homology to non-human transcripts, as described in the standard HAVANA procedure , in which case we hope they are not exported to UniProt. Conversely, 681 'consensual' introns are seen by three tracks or more, or 340 are seen by the strictly cDNA-supported AceView that are not seen by Gencode. Actually, Gencode sees 8% of the consensual introns missed by AceView, but AceView sees 72% of the consensual introns missed by Gencode.
Overall, we find that the Gencode transcript annotation is quite comprehensive except for the quasi-absence of intronless genes. We confirm its high quality: it missed or annotated as pseudogene only a few expressed genes seen by AceView, and it did not exploit in the order of 15% of the introns represented in cDNAs from the public databases.
In general, we especially appreciate the fact that they report all observed transcripts with good alignments and standard introns, without filtering those that are structurally candidates for nonsense mediated RNA decay (NMD) . According to our estimates , putative nonsense transcripts represent about 13% of the fully supported transcripts with predicted proteins of more than 100 amino acids in human (12,855 of 101,877 have introns larger than 60 base-pairs (bp) and with standard boundaries lying at least 55 bp downstream of the Stop codon), in contrast to 4% in worm (671 of 15,119 using the same criteria as above, except that minimal intron size is 30 bp). This increase may parallel the evolution of NMD into an essential multifunctional mechanism in mammals [10–12], or it may indicate that our cells have a more tolerant life style than the worm. Over the past few years, evidence that mRNA may be functionally active beyond its protein coding ability has accumulated ; human transcripts have a complex life, with mechanisms present to protect the mRNA, modulate its accessibility to the ribosome or to specific modification or processing enzymes, monitor its aging, its position in the cell, or its stability . A comprehensive uninterpreted catalog of observed transcripts is needed to help understand all this complexity, and Gencode or AceView aim at this goal.
Selecting protein coding transcripts and regions can only be an educated guess
A glimpse at the ENCODE regions of the UCSC genome browser  shows that Gencode and AceView transcripts look quite different from the other tracks, but so similar to one another that it would be hard to guess which is which if the names of the tracks were masked. However, Guigo and Reese  state that there is no clear winner at finding the Gencode coding regions. These two observations are in fact not contradictory: annotation of transcripts is based on large amounts of experimental cDNA evidence, so Gencode and AceView can agree almost perfectly. In contrast, we diverge on protein annotation, because there is almost no experimental protein sequence data available today. For this reason, most protein annotation remains hypothetical. Even proteins from UniProt/SwissProt are now contaminated by CDS predictions derived from transcriptome annotation, which makes new predictions by homology more and more circular. For instance, UniProt currently harbors 23,298 C. elegans proteins, but one should be aware that those are predictions: at most, 9,487 have complete cDNA support, and almost none has been sequenced. Until a substantial amount of direct protein sequences is generated, it is not surprising that different points of view coexist. For instance, Gencode annotates a CDS in only 41% of their transcripts, whereas all other tracks, including AceView, conventionally annotate a CDS in almost all their transcripts. Precisely because Gencode and AceView transcripts are so similar, the apparent specificity of AceView automatically drops by 59% when we compare CDSs, and the resemblance between the two methods is no longer striking (Additional data file 3, in agreement with  for the projected view).
In this context, AceView considers the CDS problem fully open, and offers no guarantee on which actual proteins are made . In practice, we identify all possible CDSs, usually more than one per transcript, and annotate those larger than 50 amino acids using BlastP, PFAM, and Psort2. All hypothetical CDSs are available from our download page to help identify mass spectra. But to simplify the display, we pick a single 'best product' per transcript, knowing it may not reflect the situation in vivo, since it does not follow closely the rules indicated by Kozak . For example, we do not necessarily choose the first CDS, which is quite often a short upstream open reading frame (uORF) [15, 16], and we do not reinitiate and display multiple products per transcript. The 'best' protein is defined by considering, in a graded fashion, the presence of a Pfam protein domain, BlastP homologies, TaxBlast conservation, specific Psort annotations, maximization of introns within the CDS, position along the transcript and size of the CDS. If the mRNA is not known to be complete (if its 5' end is not defined by a capped clone) and the frame is open on the 5' side, the AceView CDS starts at the first in frame codon. But if the mRNA appears to be complete, the CDS starts at the first AUG codon, unless there is, in the correct environment, an in frame NUG or ANG codon [14, 17–20] at least 180 bp upstream of the first AUG. In such a case, we annotate a predicted CDS starting at the non-AUG codon (the limit was set at 60 bp in the August 2005 release, leading to an excessive 24% complete CDS starting on an NUG start codon). Escherichia coli is reported to use about 17% non-AUG start, and to our surprise 7% of the human best complete products in AceView (20,616 of 293,158) actually have an alternative Start codon in the correct environment 60 amino acids upstream of the first AUG. It will be interesting to see how many are occasionally used as Start in vivo.
On the other hand, Gencode departs from all other programs in that it does not call a CDS unless it is conserved or already annotated in SwissProt, and it has a 'sensible' gene structure that is not a candidate for NMD. The product they annotate is almost always the same as AceView, except that the Start codon may differ. In reality, they probably do not really mean that 59% of the transcripts from protein coding genes are non-coding, but they just have to be very careful, because their proteins are poured directly in the UniProt reference database, so any annotation error will spread. We respect their attitude, yet some of their choices can be questioned.
NMD transcripts for instance are expected to produce, briefly but efficiently, truncated proteins, some of which could be functional . Indeed, activation of mRNA degradation by this pathway requires a pioneering round of translation that, due to the dual role of NMD proteins in activating translation, should be very efficient on transcripts still decorated with exon junction complexes (reviewed in [10, 11]). Furthermore, NMD is only shortening transcripts lifetime by 1.5 to 11 times , and it remains possible that it does not act in all cells and tissues at all times. The leaky behavior of the surveillance machinery is well known to geneticists: if no protein was produced from NMD candidates, the great majority of nonsense mutations would behave as complete loss of function (true nulls), but there are a number of counter-examples where a well positioned stop mutation leads to a gain of function phenotype (for example, lin-1 allele n1790 ).
With respect to the choice of the initiation codon, Gencode and other groups give much weight to interspecies conservation, they do not annotate upstream ORFs, and consider only AUG codons. But it is difficult to conceive how the ribosomes would be aware of these rules, instead of following the scanning mechanism experimentally established by Kozak . If the transcript is accessible, upon scanning, the ribosome subunits assemble at the first AUG (or more rarely at an alternative start), irrespective of the length and inter-species conservation of the protein. If there is a stop codon soon after the AUG, the ribosome will keep scanning rather than drop off, and may reinitiate synthesis of a second product.
In summary, protein annotation is not supported by enough direct protein sequence evidence, and large scale mass spectrometry data on proteins are badly needed to clarify what happens in vivo.
Validating the transcriptome through democratic consensus
Annotating the transcriptome is a difficult and dynamic task, the data and the rules do evolve, and even the most careful manual annotation cannot be expected to provide by itself an incontestable and final truth. There are strengths and weaknesses in each annotation, but good mRNA models supported by strong cDNA data should be found by more than one method. Therefore, an alternative way to select a benchmark may be to take a democratic approach: instead of considering a single track as 'reference', we propose to pool independent cDNA-supported annotations and search for consensual models.
A caveat is that some annotations, such as RefSeq, CCDS or UniProt/SwissProt, are so renowned that most methods use them as a data source in addition to the primary cDNA or protein data, so these models are sticky and will end up validated, whether or not they are correct. Manual annotation is invariably the source of second-hand annotation problems. In fact, in AceView, we do not use SwissProt for this reason, and we now explicitly label all transcripts whose structure is supported only by a RefSeq model as possibly suspect. In the same vein, we use only human cDNAs at the exclusion of any other species, even mammals or primates.
We implemented the democratic idea in two flavors in UCSCtrackCompare: we either rotate the reference from Gencode to any other track and perform a closest neighbor consensus analysis (Figure 3b); or, alternatively, we pick as reference a selected pool of tracks and the program extracts their consensual models (for instance those whose intron structure is identical in at least two independent annotation tracks) and measures, for each of the 25 tracks, the number of models exactly matched, hence their sensitivity and specificity.
Table 2 and Figure 3a show the results of the pooling analysis, where the consensual set are the 1,556 spliced transcripts seen by at least 2 of the 7 evidence-based independent methods: Gencode, UCSC Known Genes, RefSeq, Ensembl, AceView, ECgene and ExonWalk. AceView and Gencode fare best, with 84% and 81% sensitivity and 73% and 74% specificity, respectively. The next most sensitive method is ECgene, with 77% confirmed models, but its specificity is only 31%. ExonWalk, UCSC Known Genes, and Exogean provide considerably fewer models (33% to 26% sensitivity). ExonWalk and its closest neighbor ECgene suffer from low specificity, unless both are included in the reference set, probably because they allow combinatorial arrangements of the introns. It would be interesting to know how frequently these entire models are validated in RT-PCR experiments. Then come in order RefSeq, Pairagon, Ensembl, MGC, Fgenesh, and CCDS. Finally, the 12 remaining methods are less sensitive than CCDS, as shown in Figure 3a. This analysis is robust against variations in the composition of the reference set, as long as both Gencode and AceView or ECgene are included. It is even stable if we include the 23 unrelated tracks in the reference pool, yielding 1,957 consensual models (Additional data file 2.6). But if AceView and ECgene are excluded, the consensus falls down to only 478 transcripts across NCBI RefSeq, UCSC Known Gene, EBI Ensembl and Sanger Institute Gencode. In this context, Gencode has an appalling specificity of only 26%. Fortunately, its agreement with AceView cross-validates both methods.
Another view is shown in Figure 3b, which displays the closest neighbor consensus analysis (see also Additional data files 2.6 and 3.1 and 3.6). The sum over seven references of the number of exactly matching models was used to order the tracks. Interestingly, all tracks spontaneously appear to be ordered in overall sensitivity, almost independently of the chosen reference. Gencode and AceView are nearly indistinguishable. With 1,191 models in common, they are by far the most sensitive, but retain a very good specificity. In a robust way, they detect the largest number of models from all other tracks; they are the most inclusive, and three times more thorough than any other reference track (see Additional data file 2.6).
To our surprise, Ensembl, which is often used as a reference catalog, in particular to count the human genes , is not consensual, and far below Gencode and AceView in both sensitivity and specificity. While it offers more models than RefSeq (427 versus 342), fewer are confirmed by at least one other method (270 versus 304 in the closest neighbor analysis, 295 versus 332 in the democratic consensus), but the caveat about the artificial increase of specificity and sensitivity of RefSeq certainly applies here. However, the quality of the RefSeq is truly higher than suggested by ExonWalk, Ensembl or Exogean, which validate only 220 to 224 of the 342 RefSeqs: in the EGASP AceView version, we purposely did not use the RefSeqs as a source, yet we confirmed the intron-exon structure of 82% of them (279). However, RefSeq is far from comprehensive. The even smaller CCDS collection is equally well matched by many tracks, but at the protein level (Additional data file 3.6), they are perfectly matched only by Ensembl (201), and not quite by the other members of the CCDS collaboration: RefSeq (197), Gencode (182) and UCSC (189). It is definitely difficult to agree on any standard for protein annotation.
AceView summarizes GenBank and dbEST into a comprehensive evidence-based gene annotation by performing hand-supervised automatic annotation
The fact that the manually curated Gencode and automatic AceView transcripts are so similar shows that the critical information for the intron-exon structure of a Gencode-like validated annotation is almost entirely contained in the combination of human ESTs, mRNAs and the genome. It appears that AceView is now able to automatically extract this information, with little more noise than a team of careful human experts. The resemblance also indicates that we have the same, possibly biased, way of looking at the data and that we apply similar filters when annotating transcripts.
To reconstruct the genes, AceView considers all cDNA sequences submitted to the public databases, and stringently co-aligns them at their single best position on the genome . Its cDNA to genome alignment algorithms are finely tuned to clip vectors and poly-A and to filter away 3% of the cDNAs because of insufficient quality of their best alignment (especially if they map in multiple genomic locations), and 2.2% because of suspected structural defects; 3.7% of the cDNAs are strand-inverted. The alignments are seeded on exact matches of 15 bp and extended using a finite automaton able to switch from normal to insertion or deletion mode when the EST fasta file starts calling bases at the wrong frequency. Missing exons are researched aggressively, seeding on 6 bp words. Short hits are counted not in base-pairs but in entropy, assuming that each base is statistically independent from its neighbors. This is an over-simplification, but the advantage is that, for instance, an AT rich region is penalized and at the extreme a pure poly-A hit counts zero. The intron-exon boundaries are then refined by co-alignment. Finally, we reject the very long introns unless they are bounded by strong exon support. Aligned cDNAs are then clustered into the minimal set of transcripts (that is, a gene) consistent with their complete intron-exon structure. Most of the gene models with multiple cDNAs have alternative variants, but since September 2004, in order to limit combinatorial expansion of variants, we minimize concatenation by using each cDNA in one and only one transcript, favoring a silent merge in a known compatible transcript, so that only cDNAs containing a specific alternative feature are singled out. As a result, some variants are partial, but 70% of all AceView transcripts have their predicted CDS entirely supported by a single identified cDNA. The remaining models require concatenation of rarer forms, and will possibly be split into multiple alternative variants when additional data become available.
We then name the gene by physical contact to an NCBI Entrez gene model, else by alignment of a RefSeq or GenBank mRNA assigned to an Entrez gene, else by a Pfam-containing name, else by a nickname. The nickname is a number encoded in decodable pseudo English or pseudo Japanese by using a set of phonemes as basic digits. All names and previous aliases are tracked from release to release, and de facto AceView closely follows the official HUGO and Entrez gene nomenclature.
However, genome annotation cannot be fully automatic. We must often look at the genes, and take significant decisions to resolve the irregularities. The difficulty is to maintain this hand annotation over the years, as new data become available. For example, the hand annotations of the first Drosophila jamboree and of the initial version of the Celera human genome were nearly entirely lost. In AceView, we have limited manpower, just the two of us, so we had to devise an efficient cumulative methodology. Rather than hand annotating the final report of a gene, we only provide hints that are incorporated automatically in the context of the most recent data. For example, if the program seems to merge two genes in an unreasonable way, we do not create a permanent wall between them; instead, we hand annotate a few cDNA clones as having a 'real 3' end' or 'real 5' end', and then port these annotations from build to build. These hints will probably induce the program to split the gene, but if tomorrow a new mRNA sequence strongly bridges the two genes, they will automatically be reunified. When a significant number of genes need the same kind of manual hints, we add a piece of code that performs the same task and then drop the manual annotations once they are automatically reproduced, and even often enhanced. For example, at the EGASP meeting, we learnt that non-standard introns (neither GT/AG, nor GC/AG) are not usually validated by RT-PCR: we now discard any cDNA variant with a nonstandard intron, unless it also brings a novel alternative intron with standard boundaries. As a result, we reduce the noise, but if many clones in a gene use a particular nonstandard intron boundary, for instance because of an error in the genome sequence or because the intron is truly nonstandard, this intron will naturally sift through our tolerant rule and be kept in AceView.
AceView is a service to the community, it does not provide a final answer, but rather some rated proposals aimed at stimulating confirmatory experiments. By using the genome as a guide, it automatically rectifies the sequencing errors in the cDNAs and brings these sequences in line with the excellent quality of the genome itself. But it only provides a partial view of the entire transcriptome, because we are still far from saturation in cDNAs. From release to release, we improve the models by incorporating the latest cDNA data, but also by refining the rules. For example, we recently redefined the gene as a set of transcripts sharing at least one intron boundary, instead of a simple sequence contact. This disentangled the numerous contiguous genes with 3' 5' untranslated region (UTR) overlap and separated the unspliced variants, improving the gene annotation in directions wished for by the users.
Because Gencode annotation is manual, it may prove more difficult for them to include new data or to implement a change in strategy. For example, one can read on the HAVANA guideline site: "Occasionally a short two exon product is supported by Fgenesh and Genscan, in which case the object can be translated. It is then annotated as "believable CDS"." This rule was recently abandoned, but it will be labor-intensive to hand revise all previous models accordingly.
The puzzle of gene counts
Gencode annotates 3,618 distinct introns. But this is possibly only the tip of the iceberg, since 10,241 other introns in coding regions are predicted in EGASP, mostly by ab initio methods. If a proportion of those were correct, we might have mRNA or EST support for maybe only half of the introns and, by extension, we might be missing an appreciable fraction of the genes.
The parallel with the nematode C. elegans is interesting. In "So many genes, such a little worm" , Hillier et al. count 19,735 coding genes in WormBase. But when we analyze all available cDNA sequences in AceView WormGenes , we find only 16,094 worm genes with direct experimental evidence, of which about 700 are not annotated in the current WormBase (WS150). There are 12,083 genes supported by cDNAs, mainly from the large scale libraries from Kohara and collaborators (all sequences are in GenBank). An additional 4,011 genes or gene fragments are supported only by the systematic RT-PCR amplification of predicted ORFs from the Vidal ORFeome project . So we conclude that the authors of  are confident that, in addition to the cDNA supported genes and the 4,011 gene fragments supported by RT-PCR amplification, close to 4,400 genes that remain pure ab initio predictions really exist. Indeed, they exported them to SwissProt/UniProt.
On the other hand, these authors and their collaborators  claim that the human genome contains a maximum of 25,000 protein coding genes. They consider that nearly all of them are already known, that the numerous cDNAs that map outside of their official gene list possibly 'reflect reproducible transcriptional noise', and they do not expect any reliable gene to come from ab initio predictions. However, AceView unambiguously reconstructs from the readily available human cDNAs about 40,000 genes potentially encoding more than 100 amino acids (22,280 spliced and 18,385 intronless in the August 2005 version), in addition to 13,133 spliced genes encoding shorter proteins or non-coding. Moreover, as we see in EGASP, many ab initio predictions can be proposed in between cDNA supported genes and, unlike in the worm, no intense RT-PCR experiments have yet been launched in human. Preliminary EGASP results  only provide a lower bound on their existence, because only a fraction of the cDNA supported introns of Gencode and AceView have been validated and ab initio predictions are expected to be less expressed. If we apply a uniform method to count genes, we are forced to conclude that human has at least 3 times as many coding genes as the worm, and at least 10 times as many protein isoforms.