We have developed a genomic sequence analysis pipeline utilizing BLAST searches [8] followed by HMMER domain analysis [9] to identify NR sequences within the human genome. Domain analysis was facilitated by the knowledge that the NR superfamily is unified by a common modular structure [9]. One hallmark structure that characterizes the family is a DNA-binding domain (DBD) characterized by two C4-type zinc fingers contained in the amino-terminal half of the proteins. A second characteristic feature, the ligand-binding domain (LBD), is found at the distal carboxyl terminus and contains a highly conserved transcriptional transactivation function (AF2) [10]. The complete known complement of human NRs was used as a query set to identify candidate novel NR sequences from public human genome databases. Identified candidate sequences were followed up with more detailed bioinformatic and, when warranted, molecular biology analysis. Using this approach, we identified two novel NR sequences. The closest homologs of these sequences were represented by FXR (NR1H4) and HNF4γ (NR2A2).
The FXR-related gene sequence (FXR-r) was mapped in silico to chromosomal position 1p13.1-1p13.3, distinct from the chromosomal location of FXR (12q23.1-20). The predicted coding sequence of FXR-r was not contiguous within the genome. A total of seven intronic gaps separated the regions of coding similarity. Interestingly, the positions of the introns within FXR-r were at the same relative positions within the coding sequence as in FXR, suggesting a close evolutionary relationship between these two sequences. The predicted coding sequence of FXR-r displayed similarity to FXR across nearly the entire length (48% sequence identity at the amino acid level) but contained multiple stop codons (Figure 1a). The sequences of multiple stop codons were confirmed by PCR amplification and subsequent sequencing of FXR-r genomic DNA fragments (see Materials and methods). Sequence analysis thus indicated that this gene does not code for a functional NR and is likely to be a pseudogene. Surprisingly, real-time quantitative PCR (RTQ-PCR) detected relatively high levels of expression of FXR-r mRNA in testis (data not shown) indicating that this gene is a transcribed pseudogene.
The second novel NR gene (HNF4γ-r) was mapped in silico to chromosome position 13q14.11 - 13q14.3, unlinked to the known HNF4γ gene at position 12q12. The HNF4γ-r sequence showed sequence similarity across nearly the entire length of the coding region of HNF4γ (71.4% sequence identity at the amino acid level). Like FXR-r, HNF4γ-r coding sequence contained multiple stop codons (Figure 1b) and thus also appears to represent a pseudogene. Nine frame-shifts were necessary to maintain the amino acid reading frame relative to HNF4γ. The predicted HNF4γ-r sequence was confirmed by sequence analysis of human genomic DNA (see Materials and methods). The predicted coding sequence of HNF4γ-r was contiguous within the genome, consistent with possible retrotransposition into the genome [11]. Unlike FXR-r, no expression of HNF4γ-r mRNA was detected in any of the tissues examined (data not shown).
Only one other NR pseudogene has been reported to date, a pseudogene related to the ERRα receptor [12]. The identification of FXR-r and HNF4γ-r brings the total human NR pseudogene number to three. Further evidence that these genes are pseudogenes includes the fact that no homologs of the HNF4γ-r and FXR-r genes could be found in available mouse, rat, Fugu, or Drosophila genome sequences. Pseudogene sequences would not be expected to be conserved between genomes even as closely related as human and mouse. In addition, and in contrast to their closest functional gene homologs HNF4 and FXR, HNF4γ-r or FXR-r did not display conservation of their amino acid sequences relative to their nucleic acid sequences. This result is also consistent with the pseudogene characterization of these sequences.
FXR-r and HNF4γ-r were found using searches that utilized other known human NRs as query sequences. Certain C. elegans receptors contain LBD sequences that differ significantly from mammalian LBDs [5]. It remained possible, then, that NR LED sequences existed in the human genome that represented homologs of C. elegans NRs but were not identified using mammalian NRs as a query set. To address this question, we scanned the human genome with the complete sets of C. elegans and Drosophila NRs. Extensive analysis using these receptors as query sequences did not reveal a single novel mammalian homolog of these sequences (no BLAST hits with p value < 10). From our analysis, we conclude that the human NR set will not be expanded by orthologs resembling the large number of NRs found in C. elegans.
Phylogenetic analysis of vertebrate NRs has defined six ancestral NR subfamilies [13,14,15]. Five of the subfamilies are also represented among both the C. elegans and Drosophila NRs (Figure 2), consistent with the proposed ancient metazoan origin of these subfamilies [14]. The earlier analysis identified no arthropod or nematode members of the NR3 subfamily, suggesting that this subfamily might be more recently derived and specific to the deuterostome lineage [14]. More recently, however, the Drosophila genome sequence [1] has revealed a previously unknown NR sequence (CG7407) that falls into the NR3 subfamily (Figures 2,3). The observation that NR3 is represented in both protostomes and deuterostomes indicates that NR3, like the other major NR subfamilies, is of an ancient metazoan origin. Notably, the C. elegans genome does not encode a member of the NR3 subfamily. It will be of interest to learn if NR3 is represented in any nematode species, as absence of the NR3 subfamily from nematodes in general would suggest that arthropods and vertebrates may share an evolutionary history that occurred after separation of the nematode lineage. Such an early divergence of the nematode evolutionary lineage would be in disagreement with a recent hypothesis placing nematodes and arthropods in a common evolutionary clade of molting invertebrates [16] and would be more consistent with the traditional placement of nematodes in a lineage that diverged from other metazoans before separation of the major protostome and deuterostome lineages [17].
Clearly, dramatically divergent evolutionary pathways have shaped the NR sets in separate phylogenetic lineages. Within the six major NR subfamilies, four groups of NRs are currently known only in vertebrates (thyroid hormone receptors (TR), peroxisome proliferator-activated receptor (PPAR), retinoic acid receptors (RAR) and the steroid receptor group containing glucocorticoid receptor (GR), mineralocorticoid receptors (MR), progesterone receptor (PR) and androgen receptors (AR)). In addition, both invertebrate genomes encode NRs that are not clearly placed in one of the six defined NR subfamilies (Figure 2). As previously noted [13], the three Drosophila members of the Knirps group define an unusual class of NRs that lack similarity to the LBDs of the vertebrate NRs. Most of the C. elegans NRs (255 of 270) are diverged from those found in humans and flies (Figure 2). It is unclear whether these divergent nematode NRs represent new subfamilies [18] or are highly diverged members of one or more of the six recognized subfamilies. In contrast to the situation with the insect Knirps group, analyses of potential structures indicates that the majority of the divergent C. elegans receptors are predicted to contain the canonical antiparallel α-helix sandwich structure characteristic of ligand-regulated NRs (A. Bogan, C. Maina, J.-M. Chandonia, F. Cohen, K. Yamamoto, and A. Sluder, unpublished data). Thus, despite the extensive diversity in sequence of many C. elegans LBD sequences, they are unified by a common structural fold, possibly reflecting the requirement for interaction with a core set of NR cofactors [19]. Since the known structures of NR LBDs contain a hydrophobic cleft in which their endogenous hormone ligands are bound, it is likely that most, if not all, orphan receptors will be amenable to modulation by small molecules.
In sum, we have found a striking difference between humans, Drosophila and C. elegans with respect to their NR sets. There is a finite possibility that the last 5% of the human genome sequence could harbor an additional novel NR sequence, but this is unlikely given that this 5% is enriched in repetitive heterochromatic sequence. Such a finding would not change the general conclusion that there are striking differences between the three genomes. Knowledge of all the members of each NR set defines the unique landscape for NR modulation and provides a basis for more detailed phylogenetic studies. Furthermore, such a whole-genome comparison of the types and numbers of genes only reflects one level of NR complexity in an organism. The impact of transcriptional and post-transcriptional processing events on total NR functional diversity in each proteome will be a subject for future studies.