Genome-wide characterization, evolution, structure, and expression analysis of the F-box genes in Caenorhabditis

Wang, Ailan; Chen, Wei; Tao, Shiheng

doi:10.1186/s12864-021-08189-7

Genome-wide characterization, evolution, structure, and expression analysis of the F-box genes in Caenorhabditis

Research article
Open access
Published: 11 December 2021

Volume 22, article number 889, (2021)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

Genome-wide characterization, evolution, structure, and expression analysis of the F-box genes in Caenorhabditis

Download PDF

Ailan Wang^1,2,3,
Wei Chen^1,2 &
Shiheng Tao ORCID: orcid.org/0000-0002-6076-6038^1,2

1633 Accesses
5 Citations
Explore all metrics

Abstract

Background

F-box proteins represent a diverse class of adaptor proteins of the ubiquitin-proteasome system (UPS) that play critical roles in the cell cycle, signal transduction, and immune response by removing or modifying cellular regulators. Among closely related organisms of the Caenorhabditis genus, remarkable divergence in F-box gene copy numbers was caused by sizeable species-specific expansion and contraction. Although F-box gene number expansion plays a vital role in shaping genomic diversity, little is known about molecular evolutionary mechanisms responsible for substantial differences in gene number of F-box genes and their functional diversification in Caenorhabditis. Here, we performed a comprehensive evolution and underlying mechanism analysis of F-box genes in five species of Caenorhabditis genus, including C. brenneri, C. briggsae, C. elegans, C. japonica, and C. remanei.

Results

Herein, we identified and characterized 594, 192, 377, 39, 1426 F-box homologs encoding putative F-box proteins in the genome of C. brenneri, C. briggsae, C. elegans, C. japonica, and C. remanei, respectively. Our work suggested that extensive species-specific tandem duplication followed by a small amount of gene loss was the primary mechanism responsible for F-box gene number divergence in Caenorhabditis genus. After F-box gene duplication events occurred, multiple mechanisms have contributed to gene structure divergence, including exon/intron gain/loss, exonization/pseudoexonization, exon/intron boundaries alteration, exon splits, and intron elongation by tandem repeats. Based on high-throughput RNA sequencing data analysis, we proposed that F-box gene functions have diversified by sub-functionalization through highly divergent stage-specific expression patterns in Caenorhabditis species.

Conclusions

Massive species-specific tandem duplications and occasional gene loss drove the rapid evolution of the F-box gene family in Caenorhabditis, leading to complex gene structural variation and diversified functions affecting growth and development within and among Caenorhabditis species. In summary, our findings outline the evolution of F-box genes in the Caenorhabditis genome and lay the foundation for future functional studies.

Novel and improved Caenorhabditis briggsae gene models generated by community curation

Article Open access 25 August 2023

Conserved paradoxical relationships among the evolutionary, structural and expressional features of KRAB zinc-finger proteins reveal their special functional characteristics

Article Open access 22 January 2021

The compact genome of Caenorhabditis niphades n. sp., isolated from a wood-boring weevil, Niphades variegatus

Article Open access 22 November 2022

Background

The formation of novel genes plays an essential role in biological evolution, such as morphological innovations and adaptation to environmental changes. Organisms can acquire novel genes through various molecular processes. For instance, genomic rearrangements, retroposition, horizontal gene transfer, and duplication-divergence of existing genes are responsible for novel gene birth [1]. The novel genes derived from different evolution mechanisms have distinct molecular signatures and are not equally active in all genomes. Among all of these evolutionary mechanisms for generating novel genes, gene duplication is a significant contributor that facilitates organisms to adapt to dynamically changing environments [2, 3].

Multiple possible evolutionary fates have been proposed for duplicated genes [4, 5]. The most likely fate of a duplicated gene is pseudogenization (i.e., unexpressed or functionless). Given increased gene dosage is beneficial, two gene copies will preserve the original gene function [6], the evolutionary process of which is also referred to as concerted evolution [7]. Another evolutionary fate is sub-functionalization, in which each daughter gene adopts partial original functions of their parental gene [2]. One of the most critical outcomes of gene duplication is neofunctionalization, with one copy undergoing adaptive changes and another maintaining ancestral function [3, 8, 9]. Each of these processes can retain duplicate genes in different conditions [10,11,12,13].

In Caenorhabditis species, gene duplication has been a vital evolutionary force for generating genetic diversity among F-box proteins [14]. F-box proteins are a class of substrate adaptor proteins that function in SKP1–CUL1–F-box protein (SCF)-mediated ubiquitination protein degradation pathway [15]. The number of F-box genes varies dramatically among closely related species or subspecies [15]. F-box genes are the largest and fastest evolving gene family in plants [16,17,18,19]. For instance, the number of F-box Kelch genes (FBKs) tremendously varies among Arabidopsis thaliana, Oryza sativa, Poulus trichocarpa, and Vitis vinifera [17]. Their species-specific metabolism might require a large number of F-box proteins, such as responses to various hormones [20], the circadian clock and photomorphogenesis [21, 22], flower development [23], and defense responses [24]. However, F-box genes are small in most investigated animals and relatively conserved among closely related species. For instance, the F-box gene number varies from 66 to 81 in Euarchontoglires [25] and only 42–47 in 12 extant Drosophila species [26].

In contrast, F-box genes massively expanded in the Caenorhabditis genus, and the number of F-box genes is even more than 1 thousand [14]. However, few studies have considered how these numerous F-box genes are generated in the genomes of Caenorhabditis species and why they were preserved after duplicated. To illustrate these intriguing undocumented scientific problems, we investigated mechanisms responsible for F-box gene number divergence in Caenorhabditis and their gene structural and functional diversification.

Results

Prediction of F-box genes and their protein domain architectures

The F-box domain is ∼40 amino acids long near the N terminus of E3 ubiquitin ligase, and a well-known function is acting as a Cullin1 adapter for ubiquitin-mediated proteolysis [27, 28]. Comprehensive genomic characterization defined F-box domain-encoding genes in five Caenorhabditis species (C. brenneri, C. briggsae, C. elegans, C. japonica, and C. remanei) and one outgroup (P. pacificus), based on an approach combining software HMMER, ScanProsite, PSI-BLAST, and InterProScan. The pairwise comparison of F-box HMM logos showed the high similarity between the F-box proteins identified in each species and the known F-box proteins confirming our approach (Fig. S2). The identified F-box domain-containing protein sequences in FASTA format are found in supplemental datasets S1, S2, S3, S4, S5, S6, which are C. brenneri, C. briggsae, C. elegans, C. japonica, C. remanei and P. pacificus F-box proteins, respectively. The identified F-box genes varied considerably among the five species, from 39 members in C. japonica to 1426 members in C. remanei (Table 1). Hence, C. japonica has the minimum number of F-box genes, and even P. pacificus has more F-box genes, reaching 97. For five Caenorhabditis species, the size of the F-box gene number is not proportional to the total number of genes in corresponding genomes, suggesting changes in the importance of the F-box gene family in Caenorhabditis. Intriguingly, the 1426 F-box genes in the C. remanei genome account for ~ 4.5% of total coding potential. In contrast, the proportion of F-box genes identified in the C. japonica genome has dropped to 0.13%.

Table 1 The number of F-box protein-coding genes identified in six species of nematodes

Full size table

In five Caenorhabditis species, most of the F-box genes with an identified C-terminus functional domain fall into two broad subfamilies: ~ 1052 contain an FBA2 domain, and ~ 720 contain an FTH domain (Fig. 1). In these two family members, the N-terminal domain of the F box is followed by a more divergent region consisting of approximately 300 amino acids domain, called FBA2 or FTH. In contrast to these two types of F-box genes identified in each of five Caenorhabditis species, they have no orthologs in P. pacificus. The striking contrast result illustrates that F-box-FBA2 and F-box-FTH genes are Caenorhabditis-specific and might be generated after the lineage-sorting divergence between P. pacificus and the Caenorhabditis genus. Among the remaining 856 F-box genes identified in five Caenorhabditis species, 111 members include known C-terminal domains such as WD-40, LRR-6, PbH1, and others (data not shown), all of which present in only one or a few F-box proteins. In addition, the remaining 745 F-box genes have no known C-terminal domains that have been characterized.

Identification of paralogs and orthologs of F-box genes

In the ENSEMBL database, the homology relationships of genes have been inferred and annotated based on sequence similarity, phylogenetic tree, and chromosomal locations. Paralogs and orthologs of F-box genes in five Caenorhabditis species were downloaded from the ENSEMBL database using BioMart, respectively. F-box genes are unequally divided into paralogous groups of different sizes (Fig. S3). As shown in Fig. S3, F-box domains were lost from many members of each F-box gene paralogous group. Furthermore, the F-box domains were lost from F-box genes unequally among paralogous groups, accounting for ~ 5% to ~ 98% of the total number of genes in paragroups. Of 2725 identified F-box genes, 1519 members were unequally divided into 270 orthologous groups with different sizes. A total of 1053 genes without F-box domain-encoding regions were considered as orthologs of these F-box genes (Fig. S4). Many orthologs of F-box genes from one species are missing in another Caenorhabditis species, presumably by deletion.

Protein sequences from each paragroup and orthogroup were aligned, respectively. We found several mechanisms potentially responsible for the loss of F-box domains that were present in their homologs: (1) multiple point mutations occurred in F-box domain-encoding region; (2) long DNA fragments were inserted into F-box domain regions (3) the whole F-box domain-encoding DNA fragments were deleted from the extant gene (data not shown).

A maximum-likelihood phylogenetic tree was constructed based on F-box domain sequences from 2725 proteins (Fig. 2). According to evolutionary stability, the F-box gene family includes two types of genes. One class with clear, conserved orthologs with bootstrap support in the five Caenorhabditis species, and a second type without a clear one-to-one orthologous group undergoing rapid birth-death evolution. For simplicity, we refer to the former members as “stable” genes and the latter as “unstable” genes based on their number of evolutionary conservations across five Caenorhabditis species. The constructed phylogenetic tree has large species-specific clades and only seven sets of stable orthologous groups, each of which has a single member in each species. The complete phylogeny with statistical support is shown in Fig. S5. We speculate that one-to-one orthologous F-box genes may target endogenous proteins for ubiquitin-mediated degradation as part of conserved normal development or physiology, which change little with time. In striking contrast to stable genes, the unstable F-box genes have continued to evolve rapidly by species-specific and birth-death evolution. It seems reasonable to propose that rapidly evolutionary F-box genes may recognize foreign proteins as part of the nematode innate immune system. Furthermore, exogenous pathogenic virus and bacterial protein are plausible targets, which drive an arms race between pathogens and nematode innate immune system. The phylogenetic tree of the paralogous genes from each species was reconstructed (Fig. S6).

F-box gene number divergence in Caenorhabditis and underlying mechanisms

During the long-term process of Caenorhabditis evolution, the evolutionary dynamic of F-box genes was investigated by reconciling gene tree and species tree using the maximum parsimony method. A total of 2473 gains and 144 losses events were inferred to have occurred in the F-box gene family in the Caenorhabditis lineage (Fig. 3). Linage-specific gene gains and losses remarkably revealed high evolutionary dynamics of F-box genes in the Caenorhabditis genus, with only 23 putative F-box genes inferred in ancestral species and as many as 1426 members in C. remanei. The number of F-box genes strikingly diverged, particularly after the Caenorhabditis species split, from only 39 members in C. japonica to the most prominent F-box gene family with 1426 members in C. remanei. Unveiling the mechanisms responsible for gene duplications origination and functional divergence would illuminate these F-box genes’ biological function.

The putative F-box genes identified from each Caenorhabditis species were mapped to corresponding chromosomes or contigs when chromosome-level genome assembly with high quality was not available. The F-box gene locus on the genome is not accidental, and with most members residing on particular chromosomes or arms. For instance, the vast majority of the F-box genes from C. elegans are concentrated and clustered on Chromosomes II, III, and V (Fig. S7a). Similarly, most F-box genes from C. briggsae are overrepresented on Chromosomes III and V (Fig. S7b). The number of F-box genes arising by tandem duplication was estimated according to the intergenic distance measured by the number of genes residing in that region. At least 53%, 43, 52, and 74% of F-box genes were inferred from tandem duplications in C. brenneri, C. briggsae, C. elegans, and C. remanei genome, respectively (Table 2).

Table 2 The number of F-box protein-coding genes arisen through tandem duplications

Full size table

Gene structural divergence between F-box gene sibling pairs

We compared the gene structures of the closely related F-box gene siblings from C. elegans and C. briggsae due to their high-quality genome sequences. Gene structural and sequence identical comparisons across 136 sibling pairs from C. elegans and C. briggsae revealed five distinct mechanisms involved in the divergence of these F-box gene paralogs: 1) exon/intron gains/losses; 2) sequence exonization/pseudoexonization; 3) alteration in exon/intron boundaries; 4) splitting one exon into two; 5) introns elongate/shorten more than twice length of ancestral ones. These five mechanisms that have occurred in five representative sibling pairs are shown in Fig. 4. Comparisons across 99 sibling pairs from C. elegans are schematically shown in Fig. S8, with their divergent mechanisms summarized more precisely in Supplementary file 1. Among 99 C. elegans sibling pairs, 41 pairs have diverged by introns elongating more than twice the length of the homologous intron. Furthermore, evolutionary events of intron elongation substantially have occurred more than once in some sibling pairs. Subsequently, the second frequent divergent mechanism associated with the F-box gene divergent in C. elegans is exon/intro gains and losses. The divergent events caused by the other three mechanisms have occurred in 21, 10, 11 pairs of 99 C. elegans F-box gene sibling pairs, respectively. Similarly, comparisons across 37 F-box gene sibling pairs from C. briggsae are schematically in Fig. S9, with their divergent mechanisms are summarized more precisely in Supplementary file 1. The F-box gene paralogs divergence patterns in C. briggsae are similar to those seen previously in C. elegans. The first two frequent patterns of F-box paralogs divergent in C. briggsae are exon/intron gains/losses that occurred in half of the compared sibling pairs, followed by intron elongation.

The overwhelming pattern of intron elongation substantially in F-box genes from C. elegans C. briggsae is intriguing. However, studies associated with the evolution of intron elongation are scarce. Therefore, we investigated the underlying mechanism for intron sequence elongation in F-box genes from C. elegans and C. briggsae. DotPlot was used for sequence alignment of one sequence with itself for 51 C. elegans and C. briggsae F-box genes in which the intron sequence elongated substantially. Generally, short-sequence DNA repeats were found in all 51 genes, particularly in elongated intron sequence regions. The results of dot matrix analysis of two representative genes, F40G9.18 and CBG13796, are illustrated in Fig. 5. The closely related paralogs F40G9.18 and F40G9.9 have the same number of exons and introns and > 80% identical coding region. Nevertheless, they substantially diverged at the second intron length, with 900 bp long in the former versus only 64 bp long later (Fig. S8). Remarkably, direct repeats of ~ 50 bp DNA sequence are concentrated on the 500–1400 bp region of F40G9.18, where the intron 2 resides (Fig. 5a). The CBG13796 and CBG13789 are paralogs from C. briggsae, with apparently divergent gene structures (Fig. S9). The CBG13796 has an intron 2 of 1743 bp long with multiple ~ 50 bp repeats (Fig. 5b), absent in the paralogs CBG13789. We speculate that the striking divergent gene structure of paralogs CBG13796 and CBG13789 might be caused by the elongation intron. Therefore it is reasonable to conclude that short-sequence DNA repeats result in substantial intron elongation, so they provide the raw materials of evolution for establishing divergent exon-intron structure whereby novel functional gene origination.

Functional divergence of F-box gene duplicates

One possible mechanism for the functional divergence of duplicated genes is via differential temporal- or spatial-specific expression patterns during evolution [4, 5]. For C. elegans, we retrieved gene transcripts for F-box genes expressing at seven developmental stages, including Embryo (EE), L1 Larvae early (LE), L1 Larvae (L1), L2 Larvae (L2), L3 Larvae (L3), L4 Larvae (L4), and Young adult (YA). The Gene expression pattern of F-box genes was compared among seven developmental stages in C. elegans. Most F-box genes show a stage-specific expression pattern. Some members have an exceptionally high expression at the embryonic stage, while others have particular high expression at the Larval stage (Fig. 6a). Based on the K-means Cluster method, all members within 42 paralogous groups clustered into eight groups, and none of the paragroups grouped into the same cluster. We observed divergent expression patterns for members of the same paralogous group (Fig. 7). Three paralogs at the bottom of Fig. 7 represent consistent patterns with low expression in each developmental stage, whereas three paralogs at the top show highly expressed in L4, LE, and EE development stages. The remaining seven paralogs, in contrast, have remarkably differential stage-specific expression patterns. Gene expression patterns of 99 closely related sibling pairs were further investigated, and 48 have diverged with differential stage-specific expression patterns.

Similarly, of 192 putative F-box genes identified in C. briggsae, gene transcripts are obtained for 177 members expressing at one or more than one stage of four developmental stages, including Embryo (EE), L2 Larvae (L2), L4 Larvae (L4), and Young adult (YA). The Gene expression pattern of these 177 F-box genes was compared among four developmental stages of C. briggsae. Divergent expression patterns of F-box genes in C. briggsae are comparable with those seen previously in C. elegans. The majority of F-box genes show a stage-specific expression pattern (Fig. 6b). Like in C. elegans, all 29 F-box gene paragroups from C. briggsae also diverged in expression patterns. Furthermore, 22 pairs of 37 close related siblings have differential stage-specific expression patterns. Therefore, it seems reasonable to conclude that these F-box genes have been sub-functionalized via stage-specific gene expression in C. elegans and C. briggsae.

Selection pressure on F-box genes

In order to investigate the potential contribution of evolutionary restrictions to sequence differences between paralogs, we estimated the mutation rates and selection patterns of 95 and 37 pairs of F-box paralogs of C. elegans and C. briggsae, respectively. The pairwise comparison results showed that all dN/dS ratios except one (Y56A3A.10 vs. Y56A3A.14) were smaller than 1 (Supplementary File 1), suggesting that the F-box gene paralogs were subjected to purifying selection. Since the overall strong purification selection may obscure positive selection detection on some regions, we performed dN/dS ratio sliding window analysis in pairwise sequence comparisons. The window size was set to 45 codons, with an offset of nine codons between successive windows. The window size roughly correlated with the size of some structural domains of the F-box proteins. To correct the multiple-testing problem using sliding window analysis, we choose a trial-and-error approach against high false positives [29, 30]. Positive selection was supported only if dN/dS ratio > 1.5 and purifying selection was indicated by dN/dS ratio < 0.67 in sliding windows of 45 codons [29]. The sliding window analysis of dN/dS revealed significant diversified selection features throughout the coding region of F-box genes (Fig. S10 and S11). Although most coding regions were constrained to less than 0.67, some sliding windows analyzed showed dN/dS ratios greater than 1.5 (Supplementary File 1). Most of the dN/dS ratios of the N-terminal region encoding for the first 50 amino acids were less than 1.0 (Fig. S10 and S11), suggesting that the N-terminal F-box domains were under strong purifying selection. Although the entire gene, especially the N-terminal, was subjected to purifying selection, positive selection has occurred in some regions. There were 73 pairs out of 95 analyzed paralogs of C. elegans, and 25 pairs out of 37 paralogs of C. briggsae showed large peaks with dN/dS value greater than 1.5 at C-terminus, indicating that substrate-targeting domains have undergone positive selection (Fig. S10 and S11). In summary, in contrast to N-terminal F-box domain encoding regions, C-terminal regions have been proposed under less selective pressure.

Discussion

F-box gene identifying approach in Caenorhabditis

In the present study, the Hidden Markov model, regular expression, and in combination with InterProScan were used to predict F-box protein-encoding genes in Caenorhabditis genus genomes comprehensively. Those highly diverged proteins in the F-box domain region could not be predicted as F-box proteins, although they might still retain F-box protein function. However, more likely, those F-box paralogs that lost F-box domains have evolved into novel functional genes. Although the prediction approach was challenging to avoid false-negative prediction, it was widely applied in numerous studies [31,32,33]. In addition, the identification of F-box genes in humans using our approach is highly reliable [25]. Notably, the duplicates of identified F-box genes have diverged substantially at corresponding F-box domain regions, contributing to their functional divergence. However, this conjecture should be further confirmed by experimental evidence in the future.

A significant expansion of F-box genes within Caenorhabditis genomes

Although the differentiation time of Caenorhabditis species is far longer than that of Euarchontoglires species, the variability of the F-box gene in Caenorhabditis species is more significant than that of Euarchontoglires species [25, 34,35,36]. Many F-box gene duplicates rapidly diverged at the F-box domain region, such as long sequence fragment insertion/deletion and numerous short sequence repeats in intron regions. Once duplicates emerge, redundant copies may undergo relaxed selection pressure, and mutations in sequences provide raw materials for the evolution of novel function elements [4]. Some members of the F-box gene family were not conserved among Caenorhabditis, as each F-box gene in a species does not always have an ortholog in another species. The corresponding ancestral F-box gene may have diverged at the F-box domain region, contributing to the evolution of new traits. The number of F-box genes in Caenorhabditis species was substantially more than that of other animals [15] and even more than plants [17]. Thus, based on lineage-specific F-box genes’ colossal expansion and contraction, we proposed that F-box genes in Caenorhabditis species show remarkably plastic evolution at the level of gene gains and gene loss.

G protein-coupled receptors (GPCRs) form the largest superfamily of cell surface receptors in Caenorhabditis species [37]; C. elegans genome encodes approximately 1300 GPCRs genes, most of which were identified in related species C. briggsae and C. remanei [37]. GPCRs in Caenorhabditis species include 19 prominent families, some of which were species-specific expanded primarily in C. elegans and C. remanei [37]. Many studies in C. elegans demonstrated the crucial role of GPCRs in innate immunity via their signaling in many physiological processes and for detecting a variety of environmental signals, including bacterial secondary metabolites [38, 39]. Therefore, we conclude that the evolutionary dynamics of GPCRs are comparable with that of the F-box gene family in Caenorhabditis species. Furthermore, we speculate that the F-box gene family and GPCRs may function as regulators of innate immunity and are involved in the same physiological process.

Rapid sequence and expression divergence of F-box genes in Caenorhabditis

The present study investigated the gene structure and expression divergence mechanisms of closely related F-box gene paralogs in C. briggsae and C. elegans. In such short twenty million years of evolutionary history since the speciation of Caenorhabditis genus [40], the number of F-box genes massively gain and loss in certain species of Caenorhabditis genus. For instance, C. elegans requires F-box protein fog-2 [41] that regulates the translation of tra-2 mRNAs during hermaphrodite development [42]. However, C. briggsae lacks fog-2 [43] and instead uses a novel F-box protein she-1 created by recent gene duplication and acts upstream of tra-2 as fog-2 does in C. elegans [44]. Thus, both species recruited F-box genes produced by recent duplication events into the sex-determination pathway to control hermaphrodite development, but they use distinct paralogs. This result implies not only the number of F-box genes massive gain and loss in particular species of Caenorhabditis genus, but also F-box gene duplicates rapidly diverged at expression. In addition, a stage-specific expression pattern of closely related F-box paralogs was widely observed during the physical development of C. briggsae and C. elegans, indicating that the function of F-box paralogs may have been sub-function. We speculated that the rapid evolution of F-box genes in Caenorhabditis species was driven by the requirement of adaptation to living environment change.

F-box genes displayed significant gene number variation, structural and functional, and expression pattern divergence, implying that these genes play an essential function in the environmental adaptation and reproduction process [14]. A study showed that the SCF complex responds to microsporidiosis and virus-mediated ubiquitin [45]. The target of the immune proteasome was ubiquitinated by E3 ubiquitin ligase, although no evidence shows which Culling and adaptor protein was involved in this process. Thomas conjectured that the ancestor system of Culling degradation of exogenous proteins is also the ancestor of MHC I [14]. The exogenous and endogenous cullin adaptor proteins might be identified via evolutionary studies if the conjecture was correct.

Conclusions

This study analyzed the underlying mechanisms of F-box gene sequences divergence, gene expression pattern, and gene number gains/losses in five Caenorhabditis species. We identified 594, 192, 377, 39, 1426 F-box homologs in the genome of C. brenneri, C. briggsae, C. elegans, C. japonica, and C. remanei, respectively. In particular, we found that tandem duplications have played an essential role in the enormous expansion of the F-box gene family. There are many mechanisms identified for F-box gene structural divergence. Moreover, analyses of their expression profiles provide functional information for members of the F-box gene family in C. elegans and C. briggsae at different development stages. Importantly, our results shed light on the evolution pattern of F-box genes in Caenorhabditis species, which will provide a valuable resource for understanding the biological roles of individual F-box genes.

Methods

Data retrieval

The proteomic sequences of five Caenorhabditis species (C. brenneri, C. briggsae, C. elegans, C. japonica, C. remanei) and one outgroup species (P. pacificus) were downloaded from the ENSEMBL Genome Browser. The Hidden Markov model and Prosite file of F-box domain were downloaded from PFAM (http://pfam.xfam.org/family/f-box#tabview=tab6) [46] and PROSITE respectively (ftp://ftp.expasy.org/databases/prosite/) [47]. Transcriptome sequencing data of different developmental phases of C. elegans and C. briggsae was downloaded from modENCODE (http://www.modencode.org/) [48].

Genome-wide prediction of F-box genes in five species of Caenorhabditis

Hmmersearch program implemented in HMMER software [49] was used to search for F-box domain-containing proteins in the proteome sequences of five Caenorhabditis species and.

P. pacificus. We also used regular expression implemented in local script multi-thread ps_scan.pl, a parallel computing Perl program modified from ps_scan.pl downloaded from PROSITE [47] to predict F-box proteins. Finally, to comprehensively predict F-box proteins that diverged largely at the F-box domain, the above-identified F-box proteins were used as a PSI-BLAST (e-value = 1e-30) search query against proteome sequences. All of the identified putative F-box proteins were then scanned for F-box domains using modified multi-thread iprscan_lwd.pl downloaded from InterProScan [50]. The longest protein isoform per gene was retained as the final F-box protein dataset. F-box domain sequences were aligned separately for each species. The HMM profiles were built using ‘hmmbuild’ from the HMMER package, followed by pairwise alignment of the built profile HMMs with F-box profile from PFAM database using LogoMat-P [51]. A schematic overview of the whole pipeline is shown in Fig. S1.

Identification of homology relationship between F-box genes

The paralogs of each F-box gene were downloaded from ENSEMBL using Biomart. Genes that were paralogous to each other were considered as a paralogous group (paragroup). The F-box gene orthologs in five Caenorhabditis species were downloaded from ENSEMBL using Biomart. F-box genes that were orthologous to each other were considered as an orthologous group (orthogroup). The pipeline for prediction of gene orthology/paralogy relationships in ENSEMBL include the following basic steps: 1) Load a representative translation of each gene from all species used in Ensembl; 2) Run an HMM search on the TreeFam HMM library to classify the sequences into their families; 3) Cluster the genes that did not have any match into additional families: run NCBI Blast+1 (refined with SmithWaterman) on every orphaned gene against every other (both self and non-self species); 4) Large families that would be too complex to analyze are broken down with QuickTree7 to limit them to 1500 genes; 5) For each cluster (family), build multiple sequence alignments based on the protein sequences using either a combination of multiple aligners; 6) For each aligned cluster, build a phylogenetic tree using TreeBeST 5 using the CDS back-translation of the protein multiple alignments from the original DNA sequences. 7) infer pairwise gene relations of orthology and paralogy types from each gene tree.

Some genes identified from the above homologous search approach were not predicted as F-box domain-encoding genes. We used sequences alignment to study the possible mechanisms responsible for the absence of F-box domains from those F-box homologs.

Reconstructing the phylogenetic tree of the F-box gene family

The phylogenetic history of the F-box gene family across and within the five Caenorhabditis species and P. pacificus were reconstructed using a maximum likelihood approach. First, amino acid sequences of F-box domain regions were extracted from the protein sequences. Second, Multiple sequence alignments (MSA) of the extracted amino acid sequences across or within each of the five Caenorhabditis species were generated using MUSCLE 3.52 [52], followed by removal gap columns using Gappyout implemented in trimAl software [53]. Third, the resulting alignments were used for gene tree inference by RAxML [54] using a PROTGAMMAVT model of evolution. Statistical support was obtained from 100 bootstrap replicates in RAxML. The online tool iTol [55] was used to display, manipulate, and annotate phylogenetic trees.

F-box gene number variation and underlying mechanisms

In Caenorhabditis, a large number of F-box genes are conserved only at the F-box domain region. Therefore, a whole F-box gene sequence is inappropriate for constructing gene trees to infer gene number variation. A gene tree was constructed using F-box domain region sequences for each orthologous group in the present study. Next, we combined the gene tree with the species tree [56, 57] to infer gene number variation using NOTUNG [58]. Finally, we inferred the total variation for all F-box genes based on the inference mentioned earlier.

The DNA sequences of C. elegans and C. briggsae have been assembled into whole chromosomes. The R package RIdeogram was used to show gene density distribution on Chromosomes for F-box genes from C. elegans and C. briggsae, respectively [59]. Two genes were considered tandem duplications, given that there were no more than twenty genes between them [33]. For species with no assembled chromosomes, we treated a Contig as a chromosome, resulting in underestimates of tandem duplicates.

Divergence of the gene structure of F-box paralogs

A phylogenetic tree was constructed for each identified F-box gene paralogous group. The closest two paralogs were compared for their difference in gene structure. Because of transcriptome sequencing data available for C. elegans and C. briggsae, the divergence mechanisms of F-box gene paralogs in the two species were studied. Each exon sequence was aligned with the sequence of the sibling using lfasta program [60]. Next, the similarity between the two compared sequences was shown in the graph, and the customized Perl scripts completed the whole process. A total of 99 and 37 siblings were aligned well in C. elegans and C. briggsae, respectively. The gene structural divergence mechanisms of these paralogs were then investigated.

DNA dot matrix analysis was performed on F-box gene sequences with itself using DNAMAN program version 6.0 (Lynnon Corporation, Pointe-Claire, Quebec, Canada) to find short tandem repeats. Dotplot was conducted with the following options: window size 30, minimum identity 60%.

Functional divergence of duplicated F-box genes

F-box genes are a vastly expanded gene family, implying that these duplicates may have diverged in function. Thus, we studied the mechanism responsible for functional divergence of these identified F-box gene paralogs. Transcriptome profiles based on RNA-seq technology for C. elegans and C. briggsae were downloaded from modENCODE. Genome sequences and GTF files for C. elegans and C. briggsae were downloaded from the ENSEMBL database for RNA-seq data analysis.

Index files for the two genomes were generated using Bowtie2 [61]. RNA-seq reads were aligned with respective genomes using Tophat software [62], followed by assembling with Cufflinks [63]. Finally, differential expression analyses were performed using Cuffdiff [1]. We referred to program flow in literature [64]. Heatmap ideographs of gene expression differences were drawn with R package gplot from Bioconductor [65]. Development phase-specific expression of F-box paralogous group were calculated using mean deviation approach in R software.

Detecting the selection pressure on F-box genes

In order to detect the selection pressure on the F-box genes, codons were extracted for each amino acid that was aligned between closely related paralogs using protein alignment as a guide, excluding regions containing gaps by using trimAl [53]. The synonymous substitution (dS), nonsynonymous substitution (dN), their ratio dN/dS, and the sliding window of duplicated genes were calculated using the DnaSP v5.0 program [66], following Nei and Gojobori method with the Jukes and Cantor correction [67, 68]. Sliding window options: window length = 45 bp; step size = 9 bp.

Availability of data and materials

The proteomic and cDNA sequences for six species (C. brenneri, C. briggsae, C. elegans, C. japonica, C. remanei, and P. pacificus) and the GTF file for C. brenneri and C. briggsae were available in EnsembleMetazoa database with release version 51 (https://metazoa.ensembl.org/info/data/ftp/index.html). Expression data for C. elegans was downloaded from modENCODE database (http://data.modencode.org/cgi-bin/findFiles.cgi?download=6532,3974,3975,4493,4547,4548,3977,3978,4544,4579,4580,3882,4006,4007,4527,4529,4574,4038,4039,4530,4575,4041,4532,3879,4044,4045,4055,4173,4534,4535,4577,4578,4594). Expression data for C. briggsae was downloaded from the modENCODE database (http://data.modencode.org/cgi-bin/findFiles.cgi?download=6528,6529,6530,6532). All other data generated during this study are included in this article and its Additional files. The Hidden Markov model of the F-box domain was downloaded from the PFAM database (http://pfam.xfam.org/family/f-box#tabview=tab6). The Prosite file of the F-box domain was downloaded from the PROSITE database (ftp://ftp.expasy.org/databases/prosite/).

Abbreviations

UPS:: Ubiquitin-proteasome system
C. brenneri :: Caenorhabditis brenneri
C. briggsae :: Caenorhabditis briggsae
C. elegans :: Caenorhabditis elegans
C. japonica :: Caenorhabditis japonica
C. remanei :: Caenorhabditis remanei
SCF:: SKP1–CUL1–F-box protein
FBKs:: F-box Kelch genes
P. pacificus :: Pristionchus pacificus
paragroup:: Paralogous group
orthogroup:: Orthologous group

References

Cardoso-Moreira M, Long M. The origin and evolution of new genes. In:Anisimova M. editor. Evolutionary genomics. Methods in molecular biology (Methods and Protocols), vol 856. Totowa: Humana Press; 2012. https://doi.org/10.1007/978-1-61779-585-5_7.
Hughes AL. The evolution of functionally novel proteins after gene duplication. Proc R Soc Lond Ser B Biol Sci. 1994;256(1346):119–24.
Article CAS Google Scholar
Bergthorsson U, Andersson DI, Roth JR. Ohno's dilemma: evolution of new genes under continuous selection. Proc Natl Acad Sci. 2007;104(43):17004–9.
Article PubMed PubMed Central CAS Google Scholar
Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11(2):97–108.
Article PubMed CAS Google Scholar
Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003;18(6):292–8.
Article Google Scholar
Ohno S. Evolution by gene duplication, vol. 160. Berlin Heidelberg: Springer-Verlag; 1970.
Book Google Scholar
Nei M, Rooney AP. Concerted and birth-and-death evolution of multigene families. Annu Rev Genet. 2005;39:121.
Article PubMed PubMed Central CAS Google Scholar
Dittmar K, Liberles D. Evolution after and before gene duplication. In: Dittmar K, Liberles D, editors. Evolution after Gene Duplication. Hoboken: Wiley-Blackwell; 2010. pp. 105–132.
Näsvall J, Sun L, Roth JR, Andersson DI. Real-time evolution of new genes by innovation, amplification, and divergence. Science. 2012;338(6105):384–7.
Article PubMed PubMed Central Google Scholar
Walsh JB. How often do duplicated genes evolve new functions? Genetics. 1995;139(1):421–8.
Article PubMed PubMed Central CAS Google Scholar
Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics. 2000;154(1):459–73.
Article PubMed PubMed Central CAS Google Scholar
Walsh B. Population-genetic models of the fates of duplicate genes. Genetica. 2003;118(2–3):279–94.
Article PubMed CAS Google Scholar
Kondrashov FA, Koonin EV. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 2004;20(7):287–90.
Article PubMed CAS Google Scholar
Thomas JH. Adaptive evolution in two large families of ubiquitin-ligase adapters in nematodes and plants. Genome Res. 2006;16(8):1017–30.
Article PubMed PubMed Central CAS Google Scholar
Kipreos ET, Pagano M. The F-box protein family. Genome Biol. 2000;1(5):3002.
Article Google Scholar
Yang X, Kalluri UC, Jawdy S, Gunter LE, Yin T, Tschaplinski TJ, et al. The F-box gene family is expanded in herbaceous annual plants relative to woody perennial plants. Plant Physiol. 2008;148(3):1189–200.
Article PubMed PubMed Central CAS Google Scholar
Schumann N, Navarro-Quezada A, Ullrich K, Kuhl C, Quint M. Molecular evolution and selection patterns of plant F-box proteins with C-terminal kelch repeats. Plant Physiol. 2011;155(2):835–50.
Article PubMed CAS Google Scholar
Hua Z, Zou C, Shiu SH, Vierstra RD. Phylogenetic comparison of F-box (FBX) gene superfamily within the plant kingdom reveals divergent evolutionary histories indicative of genomic drift. PLoS One. 2011;6(1):e16219.
Article PubMed PubMed Central CAS Google Scholar
Navarro-Quezada A, Schumann N, Quint M. Plant F-box protein evolution is determined by lineage-specific timing of major gene family expansion waves. PLoS One. 2013;8(7):e68672.
Article PubMed PubMed Central CAS Google Scholar
Binder BM, Walker JM, Gagne JM, Emborg TJ, Hemmann G, Bleecker AB, et al. The Arabidopsis EIN3 binding F-box proteins EBF1 and EBF2 have distinct but overlapping roles in ethylene signaling. Plant Cell. 2007;19(2):509–23.
Article PubMed PubMed Central CAS Google Scholar
Han L, Mason M, Risseeuw EP, Crosby WL, Somers DE. Formation of an SCFZTL complex is required for proper regulation of circadian timing. Plant J. 2004;40(2):291–301.
Article PubMed CAS Google Scholar
Kim W-Y, Fujiwara S, Suh S-S, Kim J, Kim Y, Han L, et al. ZEITLUPE is a circadian photoreceptor stabilized by GIGANTEA in blue light. Nature. 2007;449(7160):356–60.
Article PubMed CAS Google Scholar
Chae E, Tan QK-G, Hill TA, Irish VF. An Arabidopsis F-box protein acts as a transcriptional co-factor to regulate floral development. Development. 2008;135(7):1235–45.
Article PubMed CAS Google Scholar
Kim HS, Delaney TP. Arabidopsis SON1 is an F-box protein that regulates a novel induced defense response independent of both salicylic acid and systemic acquired resistance. Sci Signal. 2002;14(7):1469.
CAS Google Scholar
Wang A, Fu M, Jiang X, Mao Y, Li X, Tao S. Evolution of the F-box gene family in Euarchontoglires: gene number variation and selection patterns. PLoS One. 2014;9(4):e94899.
Article PubMed PubMed Central Google Scholar
Li A, Xu G, Kong H. Mechanisms underlying copy number variation in F-box genes: evidence from comparison of 12 Drosophila species. Biodivers Sci. 2011;19(01):3–16.
Article CAS Google Scholar
Bai C, Sen P, Hofmann K, Ma L, Goebl M, Harper JW, et al. SKP1 connects cell cycle regulators to the ubiquitin proteolysis machinery through a novel motif, the F-box. Cell. 1996;86(2):263–74.
Article PubMed CAS Google Scholar
Schulman BA, Carrano AC, Jeffrey PD, Bowen Z, Kinnucan ER, Finnin MS, et al. Insights into SCF ubiquitin ligases from the structure of the Skp1–Skp2 complex. Nature. 2000;408(6810):381–6.
Article PubMed CAS Google Scholar
Talbert PB, Bryson TD, Henikoff S. Adaptive evolution of centromere proteins in plants and animals. J Biol. 2004;3(4):1–17.
Article Google Scholar
Schmid K, Yang Z. The trouble with sliding windows and the selective pressure in BRCA1. PLoS One. 2008;3(11):e3746.
Article PubMed PubMed Central Google Scholar
Gupta S, Garg V, Kant C, Bhatia S. Genome-wide survey and expression analysis of F-box genes in chickpea. BMC Genomics. 2015;16:67.
Article PubMed PubMed Central Google Scholar
Jain M, Nijhawan A, Arora R, Agarwal P, Ray S, Sharma P, et al. F-box proteins in rice. Genome-wide analysis, classification, temporal and spatial gene expression during panicle and seed development, and regulation by light and abiotic stress. Plant Physiol. 2007;143(4):1467–83.
Article PubMed PubMed Central CAS Google Scholar
Xu G, Hong M, Nei M, Kong H. Evolution of F-box genes in plants: different modes of sequence divergence and their relationships with functional diversification. Proc Natl Acad Sci U S A. 2009;106(3):835–40.
Article PubMed PubMed Central CAS Google Scholar
Betancur RR, Orti G, Pyron RA. Fossil-based comparative analyses reveal ancient marine ancestry erased by extinction in ray-finned fishes. Ecol Lett. 2015;18(5):441–50.
Article Google Scholar
Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Schmitz J. Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol. 2006;4(4):e91.
Article PubMed PubMed Central Google Scholar
Soria-Carrasco V, Castresana J. Diversification rates and the latitudinal gradient of diversity in mammals. Proc R Soc B Biol Sci. 2012;279(1745):4148–55.
Article Google Scholar
Thomas JH, Robertson HM. The Caenorhabditis chemoreceptor gene families. BMC Biol. 2008;6(1):1–17.
Article Google Scholar
Venkatesh SR, Singh V. G protein-coupled receptors: the choreographers of innate immunity in Caenorhabditis elegans. PLoS Pathog. 2021;17(1):e1009151.
Article PubMed PubMed Central CAS Google Scholar
Premont RT, Gainetdinov RR. Physiological roles of G protein–coupled receptor kinases and arrestins. Annu Rev Physiol. 2007;69:511–34.
Article PubMed CAS Google Scholar
Memar N, Schiemann S, Hennig C, Findeis D, Conradt B, Schnabel R. Twenty million years of evolution: the embryogenesis of four Caenorhabditis species are indistinguishable despite extensive genome divergence. Dev Biol. 2019;447(2):182–99.
Article PubMed CAS Google Scholar
Schedl T, Kimble J. fog-2, a Germ-line-specific sex determination gene required for hermaphrodite spermatogenesis in Caenorhabditis elegans. Gentics. 1988;119:46–61. https://doi.org/10.1093/genetics/119.1.43.
Clifford R, Lee M-H, Nayak S, Ohmachi M, Giorgini F, Schedl T. FOG-2, a novel F-box containing protein, associates with the GLD-1 RNA binding protein and directs male sex determination in the C. elegans hermaphrodite germline. Development. 2000;127:5265–76. https://doi.org/10.1242/dev.127.24.5265.
Nayak S, Goree J, Schedl T. Fog-2 and the evolution of self-fertile hermaphroditism in Caenorhabditis. PLoS Biol. 2005;3(1):e6.
Article PubMed Google Scholar
Guo Y, Lang S, Ellis RE. Independent recruitment of F box genes to regulate hermaphrodite development during nematode evolution. Curr Biol. 2009;19(21):1853–60.
Article PubMed CAS Google Scholar
Bakowski MA, Desjardins CA, Smelkinson MG, Dunbar TA, Lopez-Moyado IF, Rifkin SA, et al. Ubiquitin-mediated response to microsporidia and virus infection in C. elegans. PLoS Path. 2014;10(6):e1004200.
Article Google Scholar
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40(D1):D290–301.
Article PubMed CAS Google Scholar
De Castro E, Sigrist CJ, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, et al. ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res. 2006;34(suppl 2):W362–5.
Article PubMed PubMed Central Google Scholar
Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010;330(6012):1775–87.
Article PubMed PubMed Central CAS Google Scholar
Eddy S: HMMER3: a new generation of sequence homology search software. 2010. URL: http://hmmerjanelia.Org.
Mulder N, Apweiler R.InterPro and InterProScan. In: Bergman NH editor. Comparative genomics. Methods In Molecular Biology™, vol 396. Hoboken: Humana Press; 2007. https://doi.org/10.1007/978-1-59745-515-2_5.
Schuster-Böckler B, Bateman A. Visualizing profile–profile alignment: pairwise HMM logos. Bioinformatics. 2005;21(12):2912–3.
Article PubMed Google Scholar
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7.
Article PubMed PubMed Central CAS Google Scholar
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3.
Article PubMed PubMed Central Google Scholar
Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22(21):2688–90.
Article PubMed CAS Google Scholar
Letunic I, Bork P. Interactive tree of life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019;47(W1):W256–9.
Article PubMed PubMed Central CAS Google Scholar
Kiontke K, Gavin NP, Raynes Y, Roehrig C, Piano F, Fitch DH. Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss. Proc Natl Acad Sci U S A. 2004;101(24):9003–8.
Article PubMed PubMed Central CAS Google Scholar
Kiontke KC, Félix M-A, Ailion M, Rockman MV, Braendle C, Pénigault J-B, et al. A phylogeny and molecular barcodes for Caenorhabditis, with numerous new species from rotting fruits. BMC Evol Biol. 2011;11(1):339.
Article PubMed PubMed Central CAS Google Scholar
Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7(3–4):429–47.
Article PubMed CAS Google Scholar
Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput Sci. 2020;6:e251.
Article PubMed PubMed Central Google Scholar
Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988;85(8):2444–8.
Article PubMed PubMed Central CAS Google Scholar
Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357–9.
Article PubMed PubMed Central CAS Google Scholar
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–11.
Article PubMed PubMed Central CAS Google Scholar
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27(17):2325–9.
Article PubMed CAS Google Scholar
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks. Nat Protoc. 2012;7(3):562–78.
Article PubMed PubMed Central CAS Google Scholar
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80.
Article PubMed PubMed Central Google Scholar
Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009;25(11):1451–2.
Article PubMed CAS Google Scholar
Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3(5):418–26.
PubMed CAS Google Scholar
Nei M. Molecular evolutionary genetics, New York: Chichester, West Sussex: Columbia University Press; 1987. https://doi.org/10.7312/nei-92038.

Download references

Acknowledgments

The authors thank the lab members of the Bioinformatics Center of Northwest A&F University for their valuable advice on research design and paper discussion. Finally, we would like to thank the two anonymous referees and the journal editors for their insightful comments, significantly improving the paper.

Funding

This work was supported by the National Natural Science Foundation of China (Grant 31771474). The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

State Key Laboratory of Crop Stress Biology in Arid Areas and College of Life Sciences, Northwest A & F University, Yangling, 712100, Shaanxi, China
Ailan Wang, Wei Chen & Shiheng Tao
Bioinformatics Center, Northwest A&F University, Yangling, Shaanxi, China
Ailan Wang, Wei Chen & Shiheng Tao
Geneis (Beijing) Co., Beijing, China
Ailan Wang

Authors

Ailan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shiheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ALW and SHT design the study. ALW and CW performed data curation and analysis. ALW wrote the manuscript, and SHT reviewed and edited the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Shiheng Tao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1.

A workflow to identify F-box genes for the six nematode species. First, F-box profile HMM (PF00646) and PROSITE motif (PS50181) were used as a pattern to search for F-box proteins in the proteome sequences by Hmmersearch and ps_san.pl program, respectively. Second, the putative F-box proteins were used as a PSI-BLAST (e-value = 1e-30) search query against proteome sequences. Third, all of the putative F-box proteins were scanned for the F-box domain using modified multi-thread iprscan_lwd.pl program; Finally, pairwise alignment comparisons of the built profile HMM for each species with identified F-box profile HMM were visualized manually.

Additional file 2: Figure S2.

Pairwise alignments of HMM Logos of F-box proteins. The overall height of the letter stacks represents the relative entropy of the distribution of the emission probabilities within some state relative to the background distribution given for the complete profile. The relative height of a letter corresponds to its emission probability from a state’s distribution. The column width denotes the relative contribution of the position to the overall protein family. Insert states are drawn in red. The aligned states in each HMM are framed and connected by a line. The numbers above and below each Logo show state positions in the HMM. a. Alignments of C. elegans-specific F-box HMM with the HMM of PF00646 from the PFAM database. b. Alignments of C. briggsae-specific F-box HMM with the HMM of PF00646 from the PFAM database. c. Alignments of C. brenneri-specific F-box HMM with the HMM of PF00646 from the PFAM database. d. Alignments of C. remanei-specific F-box HMM with the HMM of PF00646 from the PFAM database. e. Alignments of C. japonica-specific F-box HMM with the HMM of PF00646 from the PFAM database. f. Alignments of P. pacificus-specific F-box HMM with the HMM of PF00646 from the PFAM database.

Additional file 3: Figure S3.

The number of genes with and without F-box domains in each F-box Paragroup in six Caenorhabditis species. The Y-axis is the number of paralogous genes, while the X-axis represents the sequence number. The blue and gray boxes indicate the FBOX and Non-FBOX genes, respectively. Non-FBOX genes deleted the F-box domain in each paragroup were identified as putative paralogs of F-BOX genes by the ENSEMBL database.

Additional file 4: Figure S4.

The number of genes with and without F-box domains in each F-box orthogroup. The Y-axis is the number of orthologous genes in each orthogroup, while the X-axis represents the sequence number. The blue and gray boxes indicate the FBOX and Non-FBOX genes, respectively. Non-FBOX genes deleted the F-box domain and were identified as a putative orthologous counterpart in six Caenorhabditis species by the ENSEMBL database.

Additional file 5: Figure S5.

The entirely phylogenetic relationships of F-box proteins from C. brenneri, C. briggsae, C. elegans, C. japonica, and C. remanei and P. pacificus being color-coded light sea green, chocolate, purple, pink, lime-green, and orange, respectively. F-box domain sequences were aligned using MUSCLE. The topology was generated by maximum likelihood analysis using RAxML. Statistical support was obtained by 100-bootstrap RAxML replicates, and schematic triangles denote bootstrap values greater than 50.

Additional file 6: Figure S6.

Phylogenetic tree of F-box domain sequences from each of the six species. The topology was generated by maximum likelihood analysis using RAxML. 100-bootstrap RAxML replicates obtained statistical support, and schematic triangles indicate bootstrap values greater than 50. Different colored boxes indicate f-box genes from different clades.

Additional file 7: Figure S7.

Chromosomal distribution of genome-wide identified F-box genes. Chromosome numbers are denoted at the bottom of each bar, and its relative length indicates chromosome size. Gene density is indicated according to the heatmap in the legend at a 1-Mb window scale. a. The gene density of identified F-box genes across the entire genome of C. elegans. Of 377 identified F-box genes, 269 were distributed in 21 gene clusters with five or more F-box genes. Notably, one gene cluster located on Chromosome II consists of 42 F-box genes, and the other one is located on Chromosome III, which includes 32 F-box genes. b. The gene density of identified F-box genes across the entire genome of C. briggsae. ‘un’ indicates an unknown chromosome. Of 192 identified F-box genes, 76 were distributed in 7 gene clusters with five or more F-box genes. Of note, two neighboring gene clusters on Chromosome V consist of 21 and 18 F-box genes.

Additional file 8: Figure S8.

Evolutionary diverged Exon-intron structure of representative sibling paralogs in C. elegans. Ninety-nine sibling paralogs were compared for their exon-intron structure divergence. The color scale shown at the top of the schematic diagram represents the sequence similarity of the aligned homologous region. The numbers above and below each exon/intron denote the nucleotide length of alignments.

Additional file 9: Figure S9.

Evolutionary diverged Exon-intron structure of representative sibling paralogs in C. briggsae. Thirty-seven sibling paralogs were compared for their exon-intron structure divergence. The color scale shown at the top of the schematic diagram represents the sequence similarity of the aligned homologous region) is shown at the top. The numbers above and below each exon/intron show the nucleotide length of alignments.

Additional file 10: Figure S10.

Sliding-window plots of dS, dN, and dN/dS in pairwise comparisons of 95 closely related paralogs of F-box genes from C. elegans. The window size is 45 codons, and the offset between windows is nine codons. The solid line represents plots of dN/dS, the short-dotted line indicates plots of dS, and the long-dotted line indicates plots of dN.

Additional file 11: Figure S11.

Sliding-window plots of dS, dN, and dN/dS in pairwise comparisons of 37 closely related paralogs of F-box genes from C. briggsae. The window size is 45 codons, and the offset between windows is nine codons. The solid line represents plots of dN/dS, the short-dotted line indicates plots of dS, and the long-dotted line indicates plots of dN.

Additional file 12

Supplementary Datasets (S1-S6) provide the identified F-box protein sequences for C. brenneri, C. briggsae, C. elegans, C. japonica, C. remanei, and P. pacificus, respectively.

Additional file 13: Supplementary File 1.

shows the analysis results of gene structural and functional diversification and selection pressure of closely related paralogs from C.elegans and C.briggsae, respectively.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Wang, A., Chen, W. & Tao, S. Genome-wide characterization, evolution, structure, and expression analysis of the F-box genes in Caenorhabditis. BMC Genomics 22, 889 (2021). https://doi.org/10.1186/s12864-021-08189-7

Download citation

Received: 15 November 2020
Accepted: 19 November 2021
Published: 11 December 2021
DOI: https://doi.org/10.1186/s12864-021-08189-7

Genome-wide characterization, evolution, structure, and expression analysis of the F-box genes in Caenorhabditis

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Background

Results

Prediction of F-box genes and their protein domain architectures

Identification of paralogs and orthologs of F-box genes

F-box gene number divergence in Caenorhabditis and underlying mechanisms

Gene structural divergence between F-box gene sibling pairs

Functional divergence of F-box gene duplicates

Selection pressure on F-box genes

Discussion

F-box gene identifying approach in Caenorhabditis

A significant expansion of F-box genes within Caenorhabditis genomes

Rapid sequence and expression divergence of F-box genes in Caenorhabditis

Conclusions

Methods

Data retrieval

Genome-wide prediction of F-box genes in five species of Caenorhabditis

Identification of homology relationship between F-box genes

Reconstructing the phylogenetic tree of the F-box gene family

F-box gene number variation and underlying mechanisms

Divergence of the gene structure of F-box paralogs

Functional divergence of duplicated F-box genes

Detecting the selection pressure on F-box genes

Availability of data and materials

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation