A recent paper reported that 274 protein-encoding genes were missing from sequencing data from 60 bird species [1]. Most of them were organized in conserved syntenic clusters in non-avian vertebrates, suggesting that their loss in the avian lineage had occurred through genomic deletions of gene blocks. This hypothesis was supported by another study reporting that 640 protein-encoding genes were missing from 48 bird genomes [2]; the authors of this second study made a similar suggestion that large segmentally deleted regions had been lost during microchromosome evolution in birds. However, our recent discovery of leptin genes with ~70% GC content in chicken and duck [3], and the new identification of 89 GC-rich genes [4], suggested an alternative hypothesis of a technical barrier to explain the ‘missing genes’. To further explore this, RNA-Seq data from visceral fat, hypothalamus, and pituitary tissues from two types of chickens, broilers and layers (Additional file 1: Table S1), were used for de novo transcriptome assembly and identification of novel genes.

The initial set of 588,683 transcripts obtained using Trinity [5] was reduced to 257,700 after removing transcripts that were expressed at low levels. We mapped the transcripts to the chicken reference genome build consistent with the previous studies [1, 2] using Blat and Blast, and retained 8395 sequences without alignments. These transcripts were then characterized on the basis of sequence similarity to known genes in other vertebrates using the Trinotate pipeline (https://trinotate.github.io), which searches for sequences encoding known protein domains, transmembrane domains, and signal peptides (Additional file 1: Tables S2 and S3a). Genes that were already known in chicken were removed by comparing their gene symbols with those in Ensembl (release 80), RefSeq, and Entrez Gene, resulting in 1878 novel gene-candidate transcripts representing 1063 genes (Additional file 1: Tables S3b and S4).

To increase specificity and to remove multiple transcript isoforms, we tested each transcript by reciprocal Blastn against the full transcriptome assembly (588,683 transcripts), and Blastx against the set of coding sequences predicted by TransDecoder (https://transdecoder.github.io), consisting of 111,457 sequences. The remaining set yielded 194 transcripts encompassing 190 distinct high-confidence genes (Additional file 1: Table S5). Through Blastn, we found that 55 loci had already been recovered as annotated genes in an updated genome build (Galgal5) released after the previous studies. In addition, 47 genes mapped to the genome but lacked annotations, while another 51 genes were annotated as uncharacterized or putative proteins (Additional file 1: Table S6). One discrepancy in annotation between our genes and Galgal5 was observed for the RSAD1 transcript, which was annotated as MYCBPAP in Galgal5. Closer inspection revealed that these two genes, which are close neighbors in the human genome, have been mistakenly merged into MYCBPAP in Galgal5. Therefore, we considered RSAD1 as a novel annotation (Additional file 1: Table S6).

Among the remaining 38 genes (Additional file 1: Table S6) with no sequence similarity to any genome build are the tumor necrosis factor (TNF) and nephrin (NPHS1), which have been reported as missing from birds in several studies (Table 1) but which are critically important in vertebrate biology and have extensively been studied in non-avian vertebrates (there are more than 130,000 publications in PubMed on TNF and 1300 on NPHS1). These genes were subjected to full-cDNA-sequence determination, exon characterization, RT-PCR validation, and expression profiling using RNA-Seq data from red junglefowl (Additional file 2: Figures S1 and S2; Additional file 2: Tables S9 to S12). The similarity in sequences, exon–intron junctions, and characteristic expression profiles confirmed the identification of chicken NPHS1 and TNF, thus resolving the long discussion as to why these genes have been missing from the genome assembly despite their established essential biological function in other species (for examples, see [6,7,8,9,10,11,12]).

Table 1 Characterization of the novel genes reported missing in previous studies

Mass spectrometry analysis of fat tissue from the same chickens confirmed the identification of MEPCE, NPC1L1, PHF1, MRPS18, and SF3B2 at P < 0.01, and the expression of AMIGO1, CYAB, FKBP11, MGAT1, MOGS, MRI1, MTX1, POLR3D, PEA15, and TXNIP at P < 0.05 (Additional file 1: Tables S4, S5, and S8). To further validate the novel genes in the context of species phylogeny, we selected 11 genes with complete coding sequences predicted by TransDecoder (Additional file 3: Table S13) and at least four reported orthologous protein sequences in the NCBI protein database, for analysis of protein identity with the predicted chicken amino acid sequence using pBlast. As expected, the relative degrees of sequence identity were inversely correlated with evolutionary distance for most transcripts (r = –1 to –0.7), with three exceptions resulting from high conservation.

Comparing these genes to the genes previously reported as missing [1, 2, 6] recovered 74 overlapping gene symbols (Table 1). A higher proportion of the genes reported missing only in chickens was identified compared to those reported missing in all avian species (15% and 3–4.5%, respectively). The recovered transcripts had very high GC content (68%; Additional file 3: Figure S3b), further supporting the hypothesis that many of the genes that are currently missing from the draft genome eluded previous identification because of their high GC content [3, 4].

When exploring the location of novel genes recovered by the updated genome build, we observed that most genes (76%) were located on unplaced scaffolds, probably representing uncharacterized microchromosomes. Among those that mapped to known chromosomes, the majority (80%) were localized to microchromosomes, which are estimated to contain 50% of protein-coding genes in chickens [13]. Surprisingly, many of the mapped genes appeared in clusters. Mapping positions of the human orthologs demonstrated that the organization of 80% of the mapped novel genes was in syntenic clusters (Table 2). The strong tendency of these novel genes to cluster indicated their location in recalcitrant chromosomal regions with high GC content, primarily on microchromosomes. The methods used in this study are detailed in Additional file 4: Detailed materials and methods.

Table 2 Overview of novel genes missing from the Galgal4 assembly but present in Galgal5

Conclusions

Our RNA-Seq study, combined with extensive bioinformatics analysis, recovered 191 novel genes that were missing from previous chicken assemblies, 38 of which are still not present in the most recent genome build (Galgal5), as well as an additional 47 that are at least partially present in Galgal5 but lacking proper annotation. The high GC content (68% on average), the microchromosomal location of the majority of the novel genes (80%) covered by Galgal5, and their high tendency to cluster into syntenic blocks (80%) suggest that the novel genes were not found in earlier analyses because of their position in GC-rich gene clusters, rather than due to chromosomal fragmentation and loss. In addition, the identification and characterization of NPHS1 and TNF, which are expected to be essential for avian physiology, and which are still missing from the latest genome build, emphasizes the importance of striving towards a repertoire of known and characterized genes that is as complete as possible.