Main text

A recent paper reported 274 genes as missing in birds but present in the genomes of most other vertebrate lineages [1]. Here, we describe several genes from this list that are, in fact, present in the chicken genome. Importantly, we would like to draw attention to a subset of avian genes characterized by high GC content and multiple long GC-rich stretches. We suggest that the characteristics of these sequences are behind the frequent absence of this gene category from genomic assemblies and other sequence databases. However, the fact is that these genes can, in many cases, be reconstructed from large amounts of “raw” next-generation sequence (NGS) data available from the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI).

Pursuing our long-term interest in chicken hematopoiesis, we noticed that the gene cluster reported in Figure 2 of Lovell et al. [1] next to the erythropoietin receptor (EPOR) shows the LPPR2 gene as missing in birds. However, we already knew that the EPOR and LPPR2 genes existed in the chicken. The sequences of both these genes are in line with the GC-rich characteristics mentioned above. Furthermore, we have examined, though not exhaustively, the list of 274 genes reported as missing in birds [1]. Using mammalian and other vertebrate orthologs of these genes, we analyzed NCBI’s SRA datasets from the chicken and other birds. In this way, we were able to reconstruct two other chicken genes, MMP14 and MRPL52. The sequences of the chicken LPPR2, MMP14 and MRPL2 genes (Additional file 1) were assembled from multiple pooled RNA-seq datasets from the SRA. Several lines of evidence indicate that these genes are, in fact, the orthologs of corresponding genes in non-avian vertebrates. First, their sequences are absent from the current chicken assembly, or are present only as small fragments in unidentified genomic contigs. Second, phylogenetic analysis (Additional file 2) confirms that they are correctly placed with orthologous genes – not with their closest paralogs, MMP15 and LPPR5. Finally, for LPPR2 there is at least partial information showing correct synteny in birds. We have assembled the Tibetan ground tit (Pseudopodoces humilis) LPPR2, which lies on the same 46-kb genomic scaffold [GenBank: NW_005087926] in P. humilis as EPOR and SWSAP1. This is in keeping with gene arrangement in mammals.

The newly identified chicken MMP14 and MRPL2 genes also showed the GC-rich sequence characteristics. To show that this sequence pattern causes persistent problems for correct gene assembly, we analyzed the 89 genes (Supplemental Table 6A in [1]) reported as missing in chicken but present in some other bird species. Using these bird genes as probes, we were able to use the chicken SRA data to assemble several genes from this list (ALKBH7, BLVRB, INO80E, NDUFB7, OPLAH, PCP2, PET100, and SWSAP1) (Additional file 1). Indeed, as shown in Fig. 1, most of the 89 genes are clear outliers on account of their GC% and G/C-rich stretches. The majority of the 89 genes are from the P. humilis genome [2], whose assembly is, in our view, the most complete in terms of coverage of GC-rich avian genes. The distributions of GC% and G/C-stretches in P. humilis genes do not differ from those in the genes of other bird species (Additional file 3). Therefore, there is no systematic bias in the sequence composition of the majority of P. humilis genes.

Fig. 1
figure 1

Patterns of GC content and G/C stretches in avian and other vertebrate genes. a Dot plot of avian genes, displaying the GC-content and average length of stretches containing G or C nucleotides. G/C-stretch was defined as an undisrupted sequence of at least three consecutive G or C nucleotides. The complete set of approximately six thousand chicken RefSeq genes from the UCSC genome browser database [17] is depicted by blue circles. Only coding sequences longer than 299 nucleotides were analyzed. The set of 86 avian genes reported to be missing in chicken genome [1] are depicted by open circles, and 23 avian genes newly assembled in this study are shown as red circles. Additionally, a histogram showing the distribution of average G/C-stretch length in the chicken RefSeq gene category is depicted by a blue line. b Dot plots of selected avian genes, compared with their vertebrate orthologs. GC-content and average length of G/C stretches in coding sequences of chicken MMP14 and LPPR2 (reported as missing in birds [1]), and genes from the EPO and EPOR loci are shown. If available, orthologous genes from other birds, turtles, mammals, lizards and crocodilians are included in the plots. The blue dots show the distribution of chicken RefSeq genes. Sequences of newly assembled avian genes represented in this figure, and GenBank accession numbers of sequences plotted in panel B are listed in Additional file 1 and Additional file 4, respectively

Furthermore, we report here for the first time the sequences of chicken erythropoietin (EPO) and EPOR genes (Additional file 1), which also share the GC-rich sequence characteristics. These genes were absent from nucleotide databases, and it was assumed that avian hematopoiesis did not require EPO signaling since primary chicken erythroid progenitors were not EPO-dependent [36]. Therefore, the identification of chicken EPO and EPOR genes allows us to test whether avian EPO retains the biological activity it has in other vertebrates.

All these newly assembled avian genes previously considered missing in all birds or in the chicken share similar GC-rich sequence characteristics. GC-rich genes are extremely hard to amplify by PCR, a key step in NGS library preparation [7, 8]. These technical hurdles are presumably behind the absence of this gene subset from genomic databases. In particular, regions of long and concatenated GC-rich stretches cause an extreme decrease in the coverage by NGS reads. Therefore, the assembly of genes in this subset requires multiple large SRA datasets (examples are provided in Additional file 1). We also note that many of the GC-rich stretches are predicted to form DNA quadruplex structures [9]. We can only speculate about the biological determinants behind the presence of the GC-rich sequence patterns. In the genes we have analyzed here, these sequence patterns appear to be conserved in birds but not in other vertebrates. The best example is EPO, where we were able to assemble orthologs in several bird species from a wide variety of avian taxons. All avian EPO sequences cluster together, while the mammalian and other non-avian EPO orthologs have lower GC content (Fig. 1b and Additional file 4). Therefore, the events leading up to this change in EPO sequence composition must have occurred in a common ancestor of birds, or there must have been some driving force maintaining this pattern throughout avian evolution. A similar evolutionary trend can be observed in POP7 (which lies next to EPO in vertebrate genomes), EPOR, its genomic neighbor SWSAP1, and other GC-rich genes reported here (e.g. MMP14 and LPPR2, as shown in Fig. 1b). For these genes, we had only a very limited amount of sequences from outside their coding regions, so their position on avian chromosomes could not be determined. An intriguing possibility is that at least some of these genes reside on avian microchromosomes. The six smallest chicken microchromosomes (chromosomes 33-38) do not have any sequence representation in the chicken genome assembly [10]. Sequence information for the larger chicken microchromosomes is also fragmentary; they have, however, been reported to have higher GC content than macrochromosomes [11, 12]. In addition, avian microchromosomes contain various types of short microsatellite repeats [1316]. The extensive presence of these repeats is a typical feature that we observe in introns in the GC-rich gene subset.

Conclusion

We report the existence of avian genes with strongly biased GC patterns. These genes have been underrepresented in genomic databases, probably due to technical obstacles to genomic library preparation. In addition to identifying chicken EPO and EPOR loci, we analyzed the gene set reported as missing in birds [1] and found additional examples of such genes. Our examination of the genes listed in Lovell et al. [1] was not exhaustive, so among the avian genes absent from current databases several more can be expected to be assembled from SRA data. Nevertheless, the vast majority of the genes reported in Lovell et al. [1] are probably really missing in birds, and their article includes a detailed discussion of the evolutionary aspects of this phenomenon. The existence of an underrepresented GC-rich gene subset was originally suggested in the 2004 report on the chicken genome sequence [12]. Here, we present detailed examples of such genes, which present an analytical challenge from both technical and evolutionary perspectives.