Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

Kumar, Charu G; Everts, Robin E; Loor, Juan J; Lewin, Harris A

doi:10.1186/1471-2164-11-161

Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

Research article
Open access
Published: 09 March 2010

Volume 11, article number 161, (2010)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

Download PDF

Charu G Kumar¹,
Robin E Everts¹^nAff3,
Juan J Loor¹ &
…
Harris A Lewin^1,2

8507 Accesses
9 Citations
Explore all metrics

Abstract

Background

The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions acquired by the ruminant lineage. The aim of the present study was to infer functions of previously uncharacterized lineage-specific genes (LSGs) using co-expression, promoter, pathway and network analysis.

Results

Clusters of co-expressed genes preferentially expressed in liver, placenta and thymus were found using 49 previously uncharacterized LSTs as seeds. Over-represented composite transcription factor binding sites (TFBS) in promoters of clustered LSGs and known genes were then identified computationally. Functions were inferred for nine previously uncharacterized LSGs using co-expression analysis and pathway analysis tools. Our results predict that these LSGs may function in cell signaling, glycerophospholipid/fatty acid metabolism, protein trafficking, regulatory processes in the nucleus, and processes that initiate parturition and immune system development.

Conclusions

The placenta is a rich source of lineage-specific genes that function in the adaptive evolution of placental architecture and functions. We have shown that co-expression, promoter, and gene network analyses are useful methods to infer functions of LSGs with heretofore unknown functions. Our results indicate that many LSGs are involved in cellular recognition and developmental processes. Furthermore, they provide guidance for experimental approaches to validate the functions of LSGs and to study their evolution.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Article Open access 05 December 2014

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

RefFinder: a web-based tool for comprehensively analyzing and identifying reference genes

Article 15 April 2023

Background

Placentae exhibit remarkable variation in tissue structure and morphology within and between mammalian clades, and even within a single mammalian order [1]. The diversity of placental architectures is thought to be the result of adaptive evolution arising from rapidly diverging and novel genes [2–4]. A greater understanding of the functional roles that these genes play would provide insights into the molecular basis for the unique phenotypic and metabolic adaptations among closely related mammalian species. Toward that end, we previously identified and bioinformatically characterized novel transcripts in cattle using placenta as a source tissue [2]. These transcripts are lineage-specific (LSTs), and the genes that encode them have no detectable homology to genes outside of that lineage (LSGs). Functional elucidation of LSGs remains a daunting task and only a few have been characterized beyond their expression patterns [5–10]. A complementary approach that would direct the genetic and biochemical characterization of LSGs and their products is functional inference using co-expression [11] and promoter analysis [12].

Gene expression is regulated by a complex interaction of transcription factors (TFs) and their binding sites (TFBS) on the gene promoter. Co-expression analysis is based upon the assumption that a high degree of similarity in gene expression profiles correlates with relatedness of their functions [11]. Genes that are highly co-expressed are often regulated by common transcription factor(s), forming sub-networks of genes with a common function [12]. As a general rule, co-regulated genes share a specific arrangement of TFBSs on their promoters. The TFBSs are often located in a specific order relative to the transcription start site (TSS) as well as in a particular orientation with respect to the promoter [13]. For example, Kindy et al. [14] showed that both strands of the c-myc gene are transcribed in an overlapping fashion and that transcription of the coding and non-coding strands is regulated independently. Yu and coauthors [15] showed a strong correlation between inter-TFBS distances and their orientation with respect to each other, demonstrating that a combination of TFs rather than an individual TF is the functional unit in tissue-specific gene regulation. Others have shown that the inter-TFBS distance between functionally over-represented TFBS pairs can vary significantly from 10 to 200 bp, although it may be greater than 200 bp in some cases [16–18]. These findings provide insights into factors governing the interactions between specific TFs and document TF pairs that are predicted to act synergistically in a tissue-specific manner [19] or at specific stages of development [20].

In a previous work we identified 91 cattle- and cetartiodactyl-specific novel transcripts that included coding sequences and noncoding RNAs (ncRNAs) [2]. In the present work, we have inferred functions of a subset of these LSTs using co-expression analysis. In addition, we identified over-represented TFBSs and their composites in the promoters of co-expressed genes and searched existing databases and the literature for pathways and functions in which these TFs may play a synergistic role in a specific tissue or developmental stage. Using these functional inferences, we predicted sub-networks of genes that may be co-regulated with the LSGs. Our results predict that subsets of these LSGs function in glycerophospholipid/fatty acid metabolism and protein trafficking in liver and near-term placenta, and in processes involving the initiation of parturition and immune system development.

Results

Identification of tissue-specific and time-series clusters

A strategy for inferring functions of LSTs (Figure 1) was applied to 63 previously identified LSTs [2] (see Methods). Two microarray expression datasets, consisting of profiles of ~7,000 cattle genes [21] and including these LSTs, were used for generating co-expression clusters. From the dataset consisting of profiles of total RNA from 18 cattle tissues, 49 LSTs and 6,178 known genes were selected for further analysis after applying filtering conditions (see Methods). Using the LSTs as seeds, two clusters were identified that showed preferential expression in a specific tissue with at least two-fold higher expression compared to any other tissue (Figures 2 and 3).

From the liver time-series dataset [22], 28 of the 49 LSTs that had tissue profiles and 4,711 known expressed genes were selected for clustering after data filtering (see Methods). Two large clusters were identified with average pairwise Pearson correlation r ≥ 0.75, and r ≥ 0.90 between any LST and transcripts encoded by known genes. The identity of the genes in these clusters overlapped, and instead of merging the clusters by lowering the correlation threshold, we selected the largest cluster containing four LSTs and 208 known transcripts (LIVR) for further analysis. These transcripts were co-expressed at seven time-points and two diets (Figure 2). Apart from liver, the genes in this cluster were expressed at higher levels in adrenal gland, cerebrum, and placentome (Additional file 1).

In order to identify plane of nutrition (diet) and time-dependent relationships from the liver time-series data [22], the 28 LSTs that passed filtering of the liver expression data were hierarchically clustered (Figure 4). The four LSTs in the LIVR cluster (5BP, 39NG, 237NG, 266NG) were down-regulated at -14, +1, and +14 days relative to parturition in animals fed a restricted-energy diet pre-partum (Figure 4). Analysis of the entire LIVR cluster for diet and time-dependent relationships indicated that the genes were under-expressed by 1.8-fold at +1 and +14 days after parturition in liver of animals on a lower (restricted) as compared to higher (ad libitum) plane of nutrition pre-partum (P < 0.0001). This suggests that the expression of the LIVR cluster during the peri-partum period is directly influenced by plane of nutrition pre-partum (Figure 2).

LIVR cluster and functional inference for LSGs

The LIVR cluster contains 208 known genes and four LSGs, 237NG, 266NG, 39NG, and 5BP (Table 1; Figure 2). To determine the subsets of genes in the LIVR cluster that may be co-regulated, we predicted and analyzed TFBSs and their composites in upstream promoter regions [-100, 1000] relative to the TSS using cattle genome (BCM_HGSC Btau_3.1) [23] sequence. p53 and Oct-1 TFBSs were found to be significantly over-represented in the LIVR cluster. The Oct-1 site had a high frequency in the cluster and was predicted in 40 genes including the LSG 39NG. Of the over-represented paired composite TFBSs, Srebp-1*Pax-8 was predicted in the upstream region of two LSGs, 237NG and 266NG, and known genes CCDC12 (coiled-coil domain protein 12), MX1 (interferon-inducible protein p78), NGLY1 (N-glycanase), PLCE1 (phospholipase C, epsilon 1), TRIP10 (CDC42-interacting protein 4), and ZDHHC18 (zinc finger, DHHC domain containing 18). A search of the composite regulatory signature discovery (CRSD) database [24] for TFs Srebp-1 together with Pax-8 identified glycerophospholipid metabolism, among others, as over-represented pathways (Table 2). The cluster was enriched for another paired composite, Sp1*Pax-8, which was predicted in the LSG 237NG and five other known genes. This suggests that 237NG is regulated by all three TFs, Sp1, Srebp-1 and Pax-8.

Table 1 Summary information for nine LSTs co-expressed with known genes.

Full size table

Table 2 Over-represented ordered TFBS pairs and unordered TFBS triplets in LIVR, PLAC and THYM co-expression clusters.

Full size table

An unordered triplet composite of TFBSs (AP-2, ZF5, c-Ets1) that was over-represented in the cluster, was predicted in LSG 237NG and five other known genes (ANKRD16, ARF5, TMEM14C, ARL4A, NSMCE4A). The three TFs that bind to those sites were found to be active in the adipocytokine signaling pathway on the basis of a CRSD search [25]. Comparison of the motifs predicted ab initio using ANN-Spec[26] with known TFBSs identified Elf1- and Sp1-like sites as matches, indicating that Elf1 and Sp1 TFs regulate genes in the LIVR cluster (Additional file 2). As corroborating evidence, an Elf1 binding site was predicted in 13% of the genes in the LIVR cluster, including 266NG and 5BP, and an Sp1 site was predicted in 27% of the genes, which included the LSG 237NG. Analysis of the LIVR cluster using Ingenuity Pathway Analysis (IPA) [27] showed it to be enriched for genes in the glycerophospholipid metabolism pathway, DNA repair, cell death, organ development of epidermis and immune response (Table 3).

Table 3 Ingenuity Pathway Analysis of gene clusters.

Full size table

PLAC cluster and functional inference for LSGs

The PLAC cluster was expressed preferentially in placentome and consisted of 116 genes, including three that are LSGs, 34FL, 22JE, and 104JE (Table 1; Figure 3A). On the basis of PSI-BLAST search [28] and multiple sequence alignments we have annotated one of the LSTs, 34FL [GenBank: NM_001105478], as an SSLP-1 (secreted seminal vesicle protein) homolog, which belongs to a class of secreted Ly6 domain containing proteins. The predicted protein product of 34FL, like the SSLP-1 glycoprotein in mouse [29], has 10 cysteines and contained the conserved C-terminal CCXXXXXCN motif, indicating that it is a member of the SSLP-1 secreted Ly-6 glycoprotein subfamily (Additional file 3). In addition, 34FL was predicted by PSORTII [30] to contain a signal peptide, and was localized to the extracellular region providing evidence that it is a secreted protein. Furthermore, the 34FL gene was located on BTA29 in an orthologous region that is syntenic with mouse SSLP-1 on MMU9.

The PLAC cluster was not found to be enriched for any single TFBS or TFBS triplets. However, we identified four TFBS composite pairs in the cluster (Table 2). The pair, STAT*Pax-2 (signal transducer and activator of transcription; paired homeobox 2), was predicted in the LSG 34FL, PAG2 (Pregnancy associated glycoprotein), and PTGS2 (COX2, prostaglandin-endoperoxide synthase 2). The motifs predicted ab initio by ANN-Spec in the cluster had significant matches to NF-κB (nuclear factor kappa B), MAZ (Myc-associated zinc finger), and Sp1 TFBSs. All three sites were predicted in the cluster at varying frequency, although none were predicted in an LSG (Additional file 2). The cluster was found to be enriched for Wnt/β-catenin signaling and acute phase response (APR) signaling pathways. Other enriched IPA functions in the PLAC cluster were transport of amino acids and synthesis of prostaglandins, adhesion, development of trophoblast cells, and lipid metabolism.

THYM cluster and functional inference for LSGs

A thymus-specific cluster (THYM) was identified, consisting of 32 genes, including two LSGs 383NG and 21PW. Both of these are single-exon transcripts (Figure 3B) and have multiple ESTs from different libraries as evidence of transcription. 383NG is a paralog that has been duplicated in two other locations on the same chromosome [2]. The THYM cluster was found to be enriched for v-Myb (myeloblastosis viral oncogene homolog) and KROX (also EGR, early growth response gene) TFBSs (Table 2). Three TFBS composite pairs were over-represented in the THYM cluster, of which one, Nkx2-5*CdxA, was predicted in the LSG 21PW and ASXL1 (Additional sex comb-like 1). An ab initio predicted motif matched the IRF (Interferon regulatory factor) TFBS. IRF-1 was identified in 21% of the genes in the cluster, including the LSGs 383NG and 21PW (Additional file 2). An analysis of the THYM cluster using IPA showed enrichment for genes involved in apoptosis, immune and lymphatic system development, transcription and trans-activation, and cell proliferation (Table 3).

Gene interaction network for the LIVR cluster

We then used weighted gene co-expression network analysis (WGCNA) [31] to identify sub-networks in the LIVR cluster. Only one module consisting of all the 212 genes was identified indicating the integrity of the cluster. On the basis of gene connectivity measurements, the top five hub genes (CNGB3, GGA1, FABP5, IL22RA1, 237NG) with the highest connectivity were identified in the LIVR cluster, and included the LSG 237NG. The hub genes CNGB3 (cyclic nucleotide gated channel beta 3) and GGA1 (golgi associated, gamma adaptin ear containing, ARF binding protein 1) are known to play roles in ion and intracellular protein transport. The hub genes IL22RA1 (interleukin 22 receptor, alpha 1) functions in cell signaling and FABP5 (fatty acid binding protein 5, epidermal) in FA metabolism and signaling, suggesting that in addition to protein transport, these are dominant processes represented in the LIVR cluster. Using gene interactions from GeneGO MetaCore[32], which is modeled on known pathways in humans, a network was inferred for a subset of genes that are co-expressed with 237NG and 266NG. In addition, hub genes were added to build the network (Figure 5).

Discussion

Functional elucidation of a novel gene is a challenging task. We have used an informatics-based strategy (Figure 1) to infer functions of a set of LSGs first found expressed in a cattle term-placenta cDNA library [2]. This was accomplished by generating co-expression clusters (Figures 2 and 3) using LSTs as seeds to cluster other genes from two microarray datasets consisting of transcript profiles from 18 cattle-tissues and liver of animals fed two different diets at several peripartal time-points. We then identified over-represented TFBSs and their composites in the promoters of co-expressed genes, and searched existing databases and the literature, for pathways and functions in which these TFs may play a role in a specific tissue or developmental stage. Yu and co-authors found that genes targeted by the same TF tend to be co-expressed, with the degree of co-expression increasing if genes share more than one TF [33]. This provides significant validation of our approach, and gives us confidence in the sub-networks of co-regulated genes that were identified. We present below a synthesis of our results with the aim of supporting the inferred functions of LSGs in each cluster.

Evidence supporting inferred functions for LSGs in the LIVR cluster

The LIVR cluster was found to be enriched for genes in the glycerophospholipid metabolism pathway, DNA repair, transport, cell death, organ development of epidermis, and immune response functions (Table 3). These pathways and functions are also characteristic of term placenta [34], which was the source tissue used to create the cDNA library from which the LSTs were identified. In support of the correlated pathways and functions of genes in liver and placenta we also found that the LIVR cluster genes are expressed in placentome (Additional file 1). Glycerophospholipid metabolism plays a significant role in the onset of labor in humans [35], and apoptosis and immunological processes are known to represent important cellular functions in term-placenta [34]. The overlapping functions likely represent common subpopulations of cells in liver and placenta, such as macrophages and lymphocytes.

Genes in the LIVR cluster were enriched for p53 and Oct-1 TFBSs. p53 exerts a variety of regulatory effects following DNA damage [36]. An Oct-1 TFBS has been predicted in the 39NG promoter along with a PPARγ site. PPARγ works in concert with Oct-1 to mediate transcriptional activation of GADD45 (growth arrest and DNA damage-inducible gene 45) [37]. The presence of both Oct-1 and PPARγ sites on the 39NG promoter suggests a role for the encoded protein in DNA repair processes in response to DNA damage. In addition, the protein is predicted by PSORTII[30] to be a nuclear protein, which supports such a role. A paired TFBS composite, Srebp-1*Pax-8, was significantly over-represented in the LIVR cluster, and was predicted in two LSGs, 237NG and 266NG. It was also predicted in PLCE1, NGLY1, and TRIP10, which have known roles in fatty acid (FA) metabolism, turnover of glycoproteins, and lipid binding, respectively (see Additional file 4 for protein functions). Srebp-1 is known to regulate genes involved in the biosynthesis of fatty acids, triglycerides and phospholipids in liver and adipocytes [38], and has been shown to play a role in glycerophospholipid metabolism [39] suggesting that 237NG and 266NG are also involved in these processes. Smith and coauthors reviewed evidence that show Pax-8 works together with Srebp-1 to target PPARγ (peroxisome proliferator-activated receptor gamma) in adipocytes and liver [40]. Some of these LIVR genes were shown to form sub-networks that participate in glycerophospholipid metabolism, protein transport and signaling pathways in liver (Figure 5). The LSG, 237NG, is inferred to play a role in glycerophospholipid metabolism and cytokine signaling, and is one of the hub genes.

The LIVR cluster was enriched for an unordered triplet TFBS composite, AP-2, ZF5, c-Ets1 (p54), which included the LSG 237NG, ANKRD16, ARF5, TMEM14C, ARL4A, and NSMCE4A. The transcription factor AP-2 (TFAP2A) is correlated with expression of cytokine-induced serum amyloid A1 gene (SAA1) in cattle liver [41] and is known to be a repressor for SAA1 [42]. SAA1 plays a role in the immune system, supporting a role for 237NG in cytokine-related immune processes. AP-2 and ZF5 binding sites have been predicted together in liver by Smith et al [43]. A sub-network of genes including ARF5, ARFGAP1 and the hub gene GGA1, which play roles in protein trafficking and localization within the golgi apparatus, was shown previously to be linked with processes in signaling and glycerophospholipid metabolism pathways in liver (Figure 5). A motif predicted ab initio by ANN-Spec shows a significant similarity to the TFBS for Elf1, which is predicted in 13% of the LIVR gene promoters, including the promoter regions of 266NG and 5BP (Additional file 2). The Elf1 TF plays a role in early liver development of mouse embryos [44], thus suggesting a similar role for these two LSGs (Table 4).

Table 4 Inferred biological functions of LSTs.

Full size table

Expression of the LIVR genes was found to be affected by pre-partum diet. They were down-regulated by restricted feeding at +1 and +14 days postpartum suggesting that the predicted functions (e.g., apoptosis, glycerophospholipid metabolism, DNA repair mechanisms, and cell signaling) are down-regulated during the early post-partum period when the animals are fed restricted diets that do not meet 100% of the estimated energy requirements during the non-lactating period. This management strategy is more successful in preparing the animal to the onset of parturition and lactation, and leads to lower incidence of metabolic disease [22]. Therefore, animals on a higher plane of nutrition (i.e. consuming diets to meet or exceed energy requirements) show increased inflammatory responses, apoptosis, and DNA repair; a conclusion shared by Loor and coauthors [22]. Above, we suggested that glycerophospholipid metabolism is a common function in liver and near-term placenta in animals approaching labor and delivery. Metabolic processes in both tissues have been shown to be affected by diet in non-ruminants. For example, in pregnant mice the FA composition in the mother's diet influences the maternal liver and fetal placenta FA composition [45, 46]. These findings suggest that the LIVR genes, many of which are involved in FA-linked functions, protein transport and cell signaling, play similar diet-responsive roles in both liver and placenta of pregnant animals (Figure 5), given that nearly all (99%) of the LIVR genes, including the LSGs, are also expressed in the placenta (Additional file 1).

Evidence supporting inferred functions of LSGs in the PLAC cluster

The PLAC cluster genes were found to be preferentially up-regulated in placentome and enriched for specific processes in the placenta; e.g. transport of amino acids and synthesis of prostaglandins, trophoblast cell adhesion, lipid metabolism, transcription, and cell proliferation (Table 3). The cluster is also enriched for acute phase response (APR) genes, which function to restore homeostasis. These APR gene products are a variety of serum proteins synthesized in increased amounts in response to trauma and infection. Given that labor and delivery result in oxidative and immunological stresses, with APR and apoptotic responses in placental tissue [47], APR enrichment provides a snap-shot of these processes in near-term placenta. The cluster is also enriched for Wnt/β-catenin signaling, which has been shown to play a central role in coordinating uterus-embryo interactions required for implantation in mouse [48].

The composite TFBS pair, STAT*Pax-2, was over-represented in three co-expressed genes; 34FL, PAG2, and PTGS2. The PWM for the predicted STAT binding site is common to a range of STAT proteins that are involved in the development and function of the immune system and play a role in maintaining immune tolerance and tumor surveillance. PTGS2 is a biosynthetic isoenzyme that was shown in pregnant cows and guinea-pigs to be involved in intrauterine prostaglandin (PG) synthesis, which is crucial for the initiation of parturition [49, 50]. PTGS2 was found to be 20-fold greater in cattle term placentomes (delivery at 260 days or later) compared with preterm placentomes (delivery between day 174 and day 260 of gestation) further supporting its role in parturition [34]. Given that our data show that 34FL (a predicted SSLP-1 glycoprotein), PAG2, and PTGS2 are highly co-expressed and predicted to be regulated by STAT TFs, we suggest that 34FL also plays a role in pregnancy and/or parturition.

The ANN-Spec motifs predicted ab initio in the PLAC cluster have significant matches to TFBSs for NF-κB (nuclear factor kappa B), MAZ (Myc-associated zinc finger), and Sp1 (Additional file 2). NF-κB is known to initiate transcription for a variety of genes that are involved in immune response, acute phase and inflammatory processes [51]. It has been located in human fetal membranes and decidua at term and pre-term delivery [52]. The physiological expression of COX-2 (PTGS2) in rat trophoblast involves a sustained activation of NF-κB, and its inhibition abrogates the inducibility of PTGS2[48]. This result functionally links NF-κB and PTGS2 with the other co-expressed genes in the PLAC cluster, suggesting a complex role for glycoproteins including 34FL in initiating and orchestrating the cell biology at the feto-maternal interface before parturition (Table 4).

Evidence supporting functional inference for LSGs in the THYM cluster

The thymus is an immune system organ that is of central importance to the maturation of T lymphocytes. Genes in the THYM cluster are enriched for the related functions immune system and lymphatic system development, cell death, and cellular growth and proliferation (Table 3). The v-Myb TFBS was over-represented in the THYM cluster and predicted in LSG 383NG. The v-Myb oncogene product causes late onset T cell lymphomas when expressed in the T cell lineage of transgenic mice [53], thus suggesting a role for this LSG 383NG in cell-proliferation. The TFBS composite pair Nkx2-5*CdxA was over-represented in promoters of 21PW and ASXL1 (additional sex comb-like 1). CdxA and Nkx2 have been shown to be markers for endoderm germ layer patterning during gastrulation, a process necessary for formation of the thymus [54]. The AsxL1 gene in Drosophila is required to maintain homeotic gene activation and silencing, and its homologs have been identified in mouse and found to be expressed in thymus [55]. The roles played by the TFs CdxA and Nkx2 in endoderm germ layer patterning, and that of ASXL1 in homeotic gene activation and silencing support a role for the LSG 21PW in thymus development. Furthermore, the IRF-1 TFBS, which regulates IL-15 gene expression and influences the development of T-cells and natural killer cells in the thymus [56], is predicted in LSGs 383NG and 21PW (Additional file 2). Taken together, these findings implicate 383NG and 21PW in immune system development and cell-proliferation (Table 4).

Conclusions

We selected the placenta as a model system to identify and functionally characterize novel LSGs because of its unique characteristics as a rapidly evolving physiological system in mammals. As we and others have shown, the placenta is a rich source of expressed LSGs and rapidly diverging genes [2, 3, 57–60]. Such genes are candidates for adaptive placental functions acquired by the ruminant lineage. We used a combination of cluster analysis, promoter analysis, WCGNA, and gene annotation to predict the functions of nine previously uncharacterized LSGs (Table 4) from a starting set of 49 (18%). The stringent analysis criteria produced unique and highly correlated gene expression clusters among 18 different tissues and across seven time-points and two diets in liver (Figures 2 and 3). The three clusters analyzed contained nine LSTs, seven of which are encoded by presumptive novel protein encoding LSGs and two are presumptive ncRNAs [2]. Our results represent a major advance in characterizing the novel LSTs expressed in bovine placenta and have yielded predictions of functions that are consistent with their putative role in ruminant reproductive and immune physiology.

As additional animal genomes are sequenced and the numbers of novel genes with unknown functions increases, our approach establishes a valuable precedent for future studies. We show that it is possible to identify and characterize a significant fraction of lineage-specific genes bioinformatically, which may guide hypothesis-driven experiments to determine their biochemical and cellular functions. These may in turn yield new insights into the role of LSGs in speciation and adaptive evolution.

Methods

Source datasets

In a previous study, 91 novel transcripts were identified in a cattle placenta cDNA library. These LSTs were characterized on the basis of their genomic distribution and annotation in Btau_2.0 and expression patterns in 18 cattle tissues [2]. For the present work, the annotation was updated to Btau_3.1 (December 2007) [23]. Of the original 91 LSTs, 63 currently have no matches to non-Cetartiodactyl sequences in public databases [61]. The remaining 28 transcripts were not considered in this study as they were re-annotated as representing divergent homologs.

Two cDNA microarray expression datasets profiling ~7,000 cattle genes were used. The cDNAs used for the array were selected from a near-term cattle placenta cDNA library [21]. The first dataset (GEO GSE3029) was obtained by profiling total RNA from 18 cattle tissues [21]. For this dataset, transcripts were included in the analysis if the intensity was above the median signal intensity of negative control spots present on the array, and in addition, the minimum intensity was 250 units in at least one sample-point. The second dataset [GEO: GSE3331] was generated by temporal gene expression profiling of liver RNA during the peripartal period in Holstein cows fed with a moderate energy ad-libitum, or restricted diet in which the animals were fed to consume ca. 80% of their calculated energy requirements from -65 days until parturition [22]. The temporal data spanned -65 to +49 days relative to parturition for animals receiving each diet. Expression levels of the transcripts were analyzed further if the intensity was above the median signal intensity of negative control spots present on the array, the minimum raw intensity was 150 in at least one sample-point, and the relative expression compared to the control was statistically significant in at least one sample-point with a raw P-value (P) < 0.05. For both datasets, only those intensity spots that were flagged as 'present' were included in the analysis.

Tissue expression profile clustering

Among the 63 LSTs, 49 were present in at least one of the 18 tissues with a raw intensity of 250 (Additional file 5). In addition to these 49 LSTs, expression levels of 6,178 transcripts passed this filter. The LSTs were clustered using Pearson correlation (r) threshold of 0.90. A representative was selected from each cluster and the un-clustered LSTs were self-represented. Genes on the array that co-expressed with each of these LSTs at r ≥ 0.90 were grouped into clusters that included all co-expressed LSTs. The cluster was adjusted to bring the average cluster r ≥ 0.75.

Clustering of temporal liver gene expression profile

Of the 49 LSTs, 28 were present in at least one liver sample with a raw intensity of 150 and significantly expressed at P < 0.05 compared to a control mixture of tissues excluding liver. Expression of 4,711 unique genes passed these filter conditions. The temporal profiles of the LSTs were clustered hierarchically using gene-condition clustering as implemented in GeneSpring[62]. The liver gene expression profiles of representative LSTs were used as seeds to identify co-expressed genes from the 4,711 genes on the array using Pearson correlation (r) at a threshold r ≥ 0.90. As before, the clusters were adjusted to bring the average cluster r ≥ 0.75.

A mixed effects model using the SAS procedure Proc MIXED (SAS Institute, Cary, North Carolina, USA) [63] was used on the 212 unique genes in the liver cluster to determine expression differences between groups of animals on two diets (moderate energy ad-libitum and restricted) at different time points (-65, -30, -14, +1, +14, +28, +49 days). The LOG2-transformed ratios were analyzed for each gene using a mixed model that included the fixed effect of diet within time point. Statistically significant P-values for the models were adjusted for multiple comparisons using the Benjamini-Hochberg false discovery rate (FDR) correction [64].

Functional annotation and assignment of genome coordinates of genes in clusters

Functions, gene symbols and genome coordinates were assigned to each clone accession on the array using RefSeq (Btau_3.1) and human protein annotations in UCSC genome browser tables [65]. Manual curation of the clusters involved removing identical genes and using the UCSC browser to check if each gene was annotated with the correct gene symbol and genome coordinates. This manual inspection was crucial for ensuring the transcription start site of genes and their promoter regions.

Promoter extraction

Mammalian regulatory elements are concentrated near transcription start sites (TSS). For this reason, promoter analysis was concentrated on the proximal promoter region, -1000 to +100 bp relative to the TSS. Both unmasked and repeat-masked promoter sequences [-1000, +100] were extracted for gene clusters from Btau_3.1 using the UCSC Genome Table browser. To identify TFBSs that are over-represented in the gene clusters, we used promoters of unique cattle RefSeq genes as the background set. The coordinates for a non-redundant set of Btau_3.1 RefSeq genes were downloaded from UCSC Genome Table Browser, and the proximal promoters [-1000, +100] were extracted as described for the clusters.

Identification of transcription factor binding site (TFBS)

Vertebrate-specific TFBSs were predicted by scanning the repeat-masked promoters using the Match program [66] with a core similarity threshold of 0.9 and a matrix similarity threshold of 0.85. The promoters in each cluster were searched against a predefined matrix profile in the TRANSFAC Professional 11.4 database [67]. This database contained a set of 214 high-quality, vertebrate-specific, non-redundant position weight matrices (PWMs) with minimized false positives ("vertebrate_non_redundant_minFP" with high-quality matrices selected). Only a single occurrence of a TFBS was counted in each promoter, and the predicted TFBS counts in each cluster were compared to those in the cattle RefSeq promoter set using Fisher's exact test (FET), which is based on a hypergeometric distribution. The computed P-values were adjusted for multiple comparisons using the Benjamini-Hochberg FDR correction [64].

Identification of over-represented co-occurring TFBS combinations

Two TFBSs were defined as co-occurring if they were distinct, non-overlapping, and their PWMs had a core similarity threshold of 0.9 and matrix similarity threshold of 0.85 in the output from the Match program. The Match output lists the PWM matches in their positional order on the promoter. Both ordered (A-B ≠ B-A relative to the TSS) and unordered (A-B = B-A) TFBS pairs and triplets were predicted separately, and the orientation of the TFBSs was ignored. To prevent double-counting, only a single occurrence of a combination was counted per gene. For the unordered combinations, non-redundancy was ensured by collapsing each identified combination in its sorted order (A-B-A or A-C-B or B-A-C = A*B*C), and then counting only a unique occurrence of the unordered combination within a gene promoter. Unordered TFBS composites were denoted with a comma separating the sites. Composite ordered TFBSs were denoted with an asterisk between the sites indicating that they were predicted to be co-occurring in that order in the promoter relative to the TSS. TFBS pairs and triplets were predicted for three different minimum threshold distances of 20 bp, 50 bp, and 100 bp between the TFBSs to identify all adjacent non-overlapping TFBS combinations [16]. The maximum allowed inter-TFBS distance was set to 250 bp.

The counts of the ordered pair and triplet TFBSs were computed for each cluster of genes and compared to the counts of the respective pairs and triplets in the background RefSeq promoters using Fisher's exact test. A minimum cell count of five was necessary for comparisons. The computed P-values were adjusted for multiple comparisons using the FDR correction as before, and comparisons were deemed significant if the adjusted P-value was ≤ 0.1. The entire analysis was carried out for both repeat-masked and unmasked promoters and significant predictions in the repeat-masked promoters had to be predicted in the unmasked cluster promoters to be selected. This precautionary measure ensured that no predictions were within repeats. TFBS prediction results were manually checked for the presence of over-represented composites.

Ab initio motif prediction and comparison to known TFBSs

ANN-Spec[26] was used for ab initio prediction of motifs that were common to an entire cluster. For each cluster of genes, the motif predictions were made on unmasked promoters, using the unmasked Btau_3.1 RefSeq promoters from which the cluster genes were subtracted as background. ANN-Spec was run iteratively by varying the predicted motif length from 6 to 16 bp and setting the run cycle (parameter m) to 100. The PWMs of predicted motifs were parsed from the ANN-Spec output. Tomtom[68] was used to compare the predicted PWMs with TRANSFAC v11.4 PWMs and comparisons with P < 0.01 were deemed significant. Logos depicting the frequency of each nucleotide at each position of a motif were generated for the ANN-Spec-predicted and corresponding matching Transfac PWMs using the EnoLOGO web server [69].

Functional classification of clusters

Ingenuity Pathway Analysis (version 5.5) [27] was used to identify functional enrichment in the clusters, using the respective source gene sets (6,149 genes in tissue experiments, 4,711 in liver time-series experiments) as reference. The Ingenuity Pathway Knowledge Base (IPKB) was used as the source database for biological function and pathway assignment to genes. The significance threshold for function and pathway enrichment was P ≤ 0.05.

To identify known pathways in which the TFs were involved we queried the CRSD, which consists of miRNA, TF and gene expression regulatory signatures assigned to specific BioCarta and KEGG pathways using genome-wide enrichment analysis [24]. A Perl script was written that accepted a TFBS composite and parsed the dataset for the co-occurrence of the TFs in the composite in a common pathway at a P < 10^-03. In addition, we used the Predicted Regulatory Module (PREMOD) database [70] to identify any known modules within our set of TFBS composites.

Identification of genes with the highest connectivity using WGCNA

To identify sub-networks of co-expressed genes in the LIVR cluster and "hub genes" we used WGCNA (Weighted Gene Co-Expression Network) [71]. Expression ratios and log10 transformed P-values were used as input for the 212 genes in the LIVR cluster. WGCNA uses Pearson correlation to calculate an adjacency matrix using the power adjacency function defined as follows [71]:

where a_ij is the adjacency between two genes i and j, x is the expression of a gene, and β is the power factor for a scale-free network. For the LIVR cluster, this power was 8 as determined by the scale-free network criterion provided by the authors [71]. Default parameters were used for module generation. Gene connectivity was determined, and the top five genes with the highest connectivity (hub genes) were identified using 1.2 as a cutoff for gene significance and intramodular connectivity (K/Kmax) cutoff of 0.95.

Network inference using GeneGO

GeneGO MetaCore[32] was used to identify known interactions in the LIVR cluster of genes, modeled on the human interaction database included in GeneGO.

Abbreviations

LST:: Lineage-specific transcript
LSG:: Lineage-specific gene
ncRNA:: noncoding RNA
TFBS:: Transcription factor binding site
TF:: Transcription factor
PWM:: position weight matrix
LIVR:: cluster of genes expressed in liver and showing effect of diet
PLAC:: cluster of genes preferentially expressed in cattle placenta
THYM:: cluster of genes preferentially expressed in cattle thymus
SKIN:: cluster of genes preferentially expressed in cattle skin
ADRBRN:: cluster of genes preferentially expressed in cattle adrenal gland, thalamus, and cerebellum
FET:: Fisher's exact test
FDR:: Benjamini-Hochberg false discovery rate.

References

Mossman HW: Vertebrate Fetal Membranes. 1987, Houndmills: MacMillan, 288-290.
Book Google Scholar
Kumar CG, Larson JH, Band MR, Lewin HA: Discovery and characterization of 91 novel transcripts expressed in cattle placenta. BMC Genomics. 2007, 8: 113-128. 10.1186/1471-2164-8-113.
Article PubMed Central PubMed Google Scholar
Larson JH, Kumar CG, Everts RE, Green C, Wind Everts-van der A, Band MR, Lewin HA: Discovery of eight novel divergent homologs expressed in cattle placenta. Physiol Genomics. 2006, 25: 405-413. 10.1152/physiolgenomics.00307.2005.
Article CAS PubMed Google Scholar
Lespinet O, Wolf YI, Koonin EV, Aravind L: The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 2002, 12: 1048-59. 10.1101/gr.174302.
Article CAS PubMed Central PubMed Google Scholar
Yoon H, He H, Nagy R, Davuluri R, Suster S, Schoenberg D, Pellegata N, Chapelle Ade L: Identification of a novel noncoding RNA gene, NAMA, that is downregulated in papillary thyroid carcinoma with BRAF mutation and associated with growth arrest. Int J Cancer. 2007, 121: 767-75. 10.1002/ijc.22701.
Article CAS PubMed Google Scholar
Martens JA, Laprade L, Winston F: Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene. Nature. 2004, 429: 571-4. 10.1038/nature02538.
Article CAS PubMed Google Scholar
Westerman BA, Poutsma A, Steegers EA, Oudejans CB: C2360, a nuclear protein expressed in human proliferative cytotrophoblasts, is a representative member of a novel protein family with a conserved coiled coil-helix-coiled coil-helix domain. Genomics. 2004, 83: 1094-104. 10.1016/j.ygeno.2003.12.006.
Article CAS PubMed Google Scholar
Krause A, Sillard R, Kleemeier B, Klüver E, Maronde E, Conejo-García JR, Forssmann WG, Schulz-Knappe P, Nehls MC, Wattler F, Wattler S, Adermann K: Isolation and biochemical characterization of LEAP-2, a novel blood peptide expressed in the liver. Protein Sci. 2003, 12: 143-52. 10.1110/ps.0213603.
Article CAS PubMed Central PubMed Google Scholar
Onyango P, Lubyova B, Gardellin P, Kurzbauer R, Weith A: Molecular cloning and expression analysis of five novel genes in chromosome 1p36. Genomics. 1998, 50: 187-98. 10.1006/geno.1997.5186.
Article CAS PubMed Google Scholar
Jordan KL, Evans DL, Steelman S, Hall DJ: Isolation of two novel cDNAs whose products associate with the amino terminus of the E2F1 transcription factor. Biochemistry. 1996, 35: 12320-8. 10.1021/bi9611927.
Article CAS PubMed Google Scholar
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-97.
Article CAS PubMed Central PubMed Google Scholar
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281-5. 10.1038/10343.
Article CAS PubMed Google Scholar
Werner T: Proteomics and regulomics: the yin and yang of functional genomics. Mass Spectrom Rev. 2004, 23: 25-33. 10.1002/mas.10067.
Article CAS PubMed Google Scholar
Kindy MS, McCormack JE, Buckler AJ, Levine RA, Sonenshein GE: Independent regulation of transcription of the two strands of the c-myc gene. Mol Cell Biol. 1987, 7: 2857-62.
Article CAS PubMed Central PubMed Google Scholar
Yu X, Lin J, Masuda T, Esumi N, Zack DJ, Qian J: Genome-wide prediction and characterization of interactions between transcription factors in Saccharomyces cerevisiae. Nucleic Acids Res. 2006, 34: 917-27. 10.1093/nar/gkj487.
Article CAS PubMed Central PubMed Google Scholar
Hu Z, Hu B, Collins JF: Prediction of synergistic transcription factors by function conservation. Genome Biol. 2007, 8: R257-76. 10.1186/gb-2007-8-12-r257.
Article PubMed Central PubMed Google Scholar
Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res. 2005, 33: 3154-64. 10.1093/nar/gki624.
Article PubMed Central PubMed Google Scholar
Hannenhalli S, Levy S: Predicting transcription factor synergism. Nucleic Acids Res. 2002, 30: 4278-84. 10.1093/nar/gkf535.
Article CAS PubMed Central PubMed Google Scholar
Yu X, Lin J, Zack DJ, Qian J: Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res. 2006, 34: 4925-36. 10.1093/nar/gkl595.
Article CAS PubMed Central PubMed Google Scholar
Keller MA, Addya S, Vadigepalli R, Banini B, Delgrosso K, Huang H, Surrey S: Transcriptional regulatory network analysis of developing human erythroid progenitors reveals patterns of coregulation and potential transcriptional regulators. Physiol Genomics. 2006, 28: 114-28. 10.1152/physiolgenomics.00055.2006.
Article CAS PubMed Google Scholar
Everts RE, Band MR, Liu ZL, Kumar CG, Liu L, Loor JJ, Oliveira R, Lewin HA: A 7872 cDNA microarray and its use in bovine functional genomics. Vet Immunol Immunopathol. 2005, 105: 235-45. 10.1016/j.vetimm.2005.02.003.
Article PubMed Google Scholar
Loor JJ, Dann HM, Guretzky NA, Everts RE, Oliveira R, Green CA, Litherland NB, Rodriguez-Zas SL, Lewin HA, Drackley JK: Plane of nutrition prepartum alters hepatic gene expression and function in dairy cows as assessed by longitudinal transcript and metabolic profiling. Physiol Genomics. 2006, 27: 29-41. 10.1152/physiolgenomics.00036.2006.
Article CAS PubMed Google Scholar
BCM_HGSC, Baylor College of Medicine Human Genome Sequencing Centre. [http://www.hgsc.bcm.tmc.edu/projects/bovine]
Liu CC, Lin CC, Chen WS, Chen HY, Chang PC, Chen JJ, Yang PC: CRSD: a comprehensive web server for composite regulatory signature discovery. Nucleic Acids Res. 2006, 34: W571-7. 10.1093/nar/gkl279.
Article CAS PubMed Central PubMed Google Scholar
CRSD. [http://cgap.nci.nih.gov/Pathways/BioCarta/h_CCR3Pathway]
Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000, 467-78.
Google Scholar
IPA, Ingenuity Pathway Analysis. [http://www.ingenuity.com]
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-402. 10.1093/nar/25.17.3389.
Article CAS PubMed Central PubMed Google Scholar
Li SH, Lee RK, Lin MH, Hwu YM, Lu CH, Chen YJ, Chen HC, Chang WH, Chang WC: SSLP-1, a secreted Ly-6 protein purified from mouse seminal vesicle fluid. Reproduction. 2006, 132: 493-500. 10.1530/rep.1.01183.
Article CAS PubMed Google Scholar
PSORTII. [http://psort.ims.u-tokyo.ac.jp/]
Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007, 18: 463-72. 10.1007/s00335-007-9043-3.
Article PubMed Central PubMed Google Scholar
GeneGo MetaCore. [http://www.genego.com]
Yu H, Luscombe NM, Qian J, Gerstein M: Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 2003, 19: 422-7. 10.1016/S0168-9525(03)00175-6.
Article CAS PubMed Google Scholar
Everts RE, Chavatte-Palmer P, Razzak A, Hue I, Green CA, Oliveira R, Vignon X, Rodriguez-Zas SL, Tian XC, Yang X, Renard JP, Lewin HA: Aberrant gene expression patterns in placentomes are associated with phenotypically normal and abnormal cattle cloned by somatic cell nuclear transfer. Physiol Genomics. 2008, 33: 65-77. 10.1152/physiolgenomics.00223.2007.
Article CAS PubMed Google Scholar
Rice GE: Glycerophospholipid metabolism and human labour. Reprod Fertil Dev. 1995, 7: 613-22. 10.1071/RD9950613.
Article CAS PubMed Google Scholar
Gillet R, Grimber G, Bennoun M, Caron de Fromentel C, Briand P, Joulin V: The consequence of p53 overexpression for liver tumor development and the response of transformed murine hepatocytes to genotoxic agents. Oncogene. 2000, 19: 3498-507. 10.1038/sj.onc.1203671.
Article CAS PubMed Google Scholar
Bruemmer D, Yin F, Liu J, Berger JP, Sakai T, Blaschke F, Fleck E, Van Herle AJ, Forman BM, Law RE: Regulation of the growth arrest and DNA damage-inducible gene 45 (GADD45) by peroxisome proliferator-activated receptor γ in vascular smooth muscle cells. Circ Res. 2003, 93: e38-47. 10.1161/01.RES.0000088344.15288.E6.
Article CAS PubMed Google Scholar
Horton JD: Sterol regulatory element-binding proteins: transcriptional activators of lipid synthesis. Biochem Soc Trans. 2002, 30: 1091-5. 10.1042/BST0301091.
Article CAS PubMed Google Scholar
Seashols SJ, del Castillo Olivares A, Gil G, Barbour SE: Regulation of group VIA phospholipase A2 expression by sterol availability. Biochim Biophys Acta. 2004, 1684: 29-37.
Article CAS PubMed Google Scholar
Smith MR, Kantoff PW: Peroxisome proliferator-activated receptor gamma (PPARγ) as a novel target for prostate cancer. Invest New Drugs. 2002, 20: 195-200. 10.1023/A:1015670126203.
Article CAS PubMed Google Scholar
Loor JJ, Dann HM, Everts RE, Oliveira R, Green CA, Guretzky NA, Rodriguez-Zas SL, Lewin HA, Drackley JK: Temporal gene expression profiling of liver from periparturient dairy cows reveals complex adaptive mechanisms in hepatic function. Physiol Genomics. 2005, 23: 217-26. 10.1152/physiolgenomics.00132.2005.
Article CAS PubMed Google Scholar
Ren Y, Liao WS: Transcription factor AP-2 functions as a repressor that contributes to the liver-specific expression of serum amyloid A1 gene. J Biol Chem. 2001, 276: 17770-8. 10.1074/jbc.M010307200.
Article CAS PubMed Google Scholar
Smith AD, Sumazin P, Das D, Zhang MQ: Mining ChIP-chip data for transcription factor and cofactor binding sites. Bioinformatics. 2005, 21 (Suppl 1): i403-12. 10.1093/bioinformatics/bti1043.
Article CAS PubMed Google Scholar
Ramamurthy L, Barbour V, Tuckfield A, Clouston DR, Topham D, Cunningham JM, Jane SM: Targeted disruption of the CP2 gene, a member of the NTF family of transcription factors. J Biol Chem. 2001, 276: 7836-42. 10.1074/jbc.M004351200.
Article CAS PubMed Google Scholar
Amusquivar E, Herrera E: Influence of changes in dietary fatty acids during pregnancy on placental and fetal fatty acid profile in the rat. Biol Neonate. 2003, 83: 136-45. 10.1159/000067963.
Article CAS PubMed Google Scholar
Herrera E, Amusquivar E, López-Soldado I, Ortega H: Maternal lipid metabolism and placental lipid transfer. Horm Res. 2006, 65 (Suppl 3): 59-64. 10.1159/000091507. Review.
Article CAS PubMed Google Scholar
Reddy A, Zhong XY, Rusterholz C, Hahn S, Holzgreve W, Redman CW, Sargent IL: The effect of labour and placental separation on the shedding of syncytiotrophoblast microparticles, cell-free DNA and mRNA in normal pregnancy and pre-eclampsia. Placenta. 2008, 29: 942-9. 10.1016/j.placenta.2008.08.018.
Article CAS PubMed Google Scholar
Mohamed OA, Jonnaert M, Labelle-Dumais C, Kuroda K, Clarke HJ, Dufort D: Uterine Wnt/β-catenin signaling is required for implantation. Proc Natl Acad Sci USA. 2005, 102: 8579-84. 10.1073/pnas.0500612102.
Article CAS PubMed Central PubMed Google Scholar
Fuchs AR, Rust W, Fields MJ: Accumulation of cyclooxygenase-2 gene transcripts in uterine tissues of pregnant and parturient cows: stimulation by oxytocin. Biol Reprod. 1999, 60: 341-8. 10.1095/biolreprod60.2.341.
Article CAS PubMed Google Scholar
Welsh T, Mitchell CM, Walters WA, Mesiano S, Zakar T: Prostaglandin H2 synthase-1 and -2 expression in guinea pig gestational tissues during late pregnancy and parturition. J Physiol. 2005, 569: 903-12. 10.1113/jphysiol.2005.098129.
Article CAS PubMed Central PubMed Google Scholar
Zabel U, Schreck R, Baeuerle PA: DNA binding of purified transcription factor NF-κ B. Affinity, specificity, Zn2+ dependence, and differential half-site recognition. J Biol Chem. 1991, 266: 252-60.
CAS PubMed Google Scholar
Callejas NA, Casado M, Boscá L, Martín-Sanz P: Requirement of nuclear factor κ B for the constitutive expression of nitric oxide synthase-2 and cyclooxygenase-2 in rat trophoblasts. J Cell Sci. 1999, 112: 3147-55.
CAS PubMed Google Scholar
Davies J, Badiani P, Weston K: Cooperation of Myb and Myc proteins in T cell lymphomagenesis. Oncogene. 1999, 18: 3643-7. 10.1038/sj.onc.1202956.
Article CAS PubMed Google Scholar
Dessimoz J, Opoka R, Kordich JJ, Grapin-Botton A, Wells JM: FGF signaling is necessary for establishing gut tube domains along the anterior-posterior axis in vivo. Mech Dev. 2006, 123: 42-55. 10.1016/j.mod.2005.10.001.
Article CAS PubMed Google Scholar
Fisher CL, Randazzo F, Humphries RK, Brock HW: Characterization of Asxl1, a murine homolog of Additional sex combs, and analysis of the Asx-like gene family. Gene. 2006, 369: 109-18. 10.1016/j.gene.2005.10.033.
Article CAS PubMed Google Scholar
Ohteki T, Yoshida H, Matsuyama T, Duncan GS, Mak TW, Ohashi PS: The transcription factor interferon regulatory factor 1 (IRF-1) is important during the maturation of natural killer 1.1+ T cell receptor-α/β+ (NK1+ T) cells, natural killer cells, and intestinal intraepithelial T cells. J Exp Med. 1998, 187: 967-72. 10.1084/jem.187.6.967.
Article CAS PubMed Central PubMed Google Scholar
Hughes AL, Green JA, Garbayo JM, Roberts RM: Adaptive diversification within a large family of recently duplicated, placentally expressed genes. Proc Natl Acad Sci USA. 2000, 97: 3319-23. 10.1073/pnas.050002797.
Article CAS PubMed Central PubMed Google Scholar
Xie S, Green J, Bixby JB, Szafranska B, DeMartini JC, Hecht S, Roberts RM: The diversity and evolutionary relationships of the pregnancy-associated glycoproteins, an aspartic proteinase subfamily consisting of many trophoblast-expressed genes. Proc Natl Acad Sci USA. 1997, 94: 12809-16. 10.1073/pnas.94.24.12809.
Article CAS PubMed Central PubMed Google Scholar
Roberts RM: Interferon-tau, a Type 1 interferon involved in maternal recognition of pregnancy. Cytokine Growth Factor Rev. 2007, 18: 403-8. 10.1016/j.cytogfr.2007.06.010.
Article CAS PubMed Google Scholar
Chakrabarty A, MacLean JA, Hughes AL, Roberts RM, Green JA: Rapid evolution of the trophoblast kunitz domain proteins (TKDPs)-a multigene family in ruminant ungulates. J Mol Evol. 2006, 63: 274-82. 10.1007/s00239-005-0264-3.
Article CAS PubMed Google Scholar
Laboratory of Mammalian Genome Biology. [http://lewinlab.igb.uiuc.edu/Research/NovelGenes.html]
GeneSpring. [http://www.chem.agilent.com/scripts/pds.asp?lpage=27881]
SAS. [http://www.sas.com]
Benjamini Y, Hochberg Y: On the adaptive control of the false discovery fate in multiple testing with independent statistics. J Educ Behav Stats. 2000, 25: 60-83.
Article Google Scholar
UCSC Browser. [http://genome.ucsc.edu]
Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003, 31: 3576-9. 10.1093/nar/gkg585.
Article CAS PubMed Central PubMed Google Scholar
Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-8. 10.1093/nar/gkg108.
Article CAS PubMed Central PubMed Google Scholar
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs. Genome Biol. 2007, 8: R24-32. 10.1186/gb-2007-8-2-r24.
Article PubMed Central PubMed Google Scholar
Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos PV: enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res. 2005, 33: W389-92. 10.1093/nar/gki439.
Article CAS PubMed Central PubMed Google Scholar
Blanchette M, Bataille AR, Chen X, Poitras C, Laganière J, Lefèbvre C, Deblois G, Giguère V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 2006, 16: 656-68. 10.1101/gr.4866006.
Article CAS PubMed Central PubMed Google Scholar
Zhang B, Horvath S: A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005, 4: Article17-
PubMed Google Scholar
Discenza MT, Dehbi M, Pelletier J: Overlapping DNA recognition motifs between Sp1 and a novel trans -acting factor within the wt1 tumour suppressor gene promoter. Nucleic Acids Res. 1997, 25: 4314-22. 10.1093/nar/25.21.4314.
Article CAS PubMed Central PubMed Google Scholar
Numoto M, Yokoro K, Yasuda S, Yanagihara K, Niwa O: Detection of mouse skeletal muscle-specific product, which includes ZF5 zinc fingers and a VP16 acidic domain, by reverse transcriptase PCR. Biochem Biophys Res Commun. 1997, 236: 20-5. 10.1006/bbrc.1997.6769.
Article CAS PubMed Google Scholar
Lantinga-van Leeuwen IS, Leonhard WN, Dauwerse H, Baelde HJ, van Oost BA, Breuning MH, Peters DJ: Common regulatory elements in the polycystic kidney disease 1 and 2 promoter regions. Eur J Hum Genet. 2005, 13: 649-59. 10.1038/sj.ejhg.5201392.
Article CAS PubMed Google Scholar
Lalancette C, Platts AE, Lu Y, Lu S, Krawetz SA: Computational identification of transcription frameworks of early committed spermatogenic cells. Mol Genet Genomics. 2008, 280: 263-74. 10.1007/s00438-008-0361-2.
Article CAS PubMed Google Scholar
Garcia MG, Tirado-Gonzalez I, Handjiski B, Tometten M, Orsal AS, Hajos SE, Fernández N, Arck PC, Blois SM: High expression of survivin and down-regulation of Stat-3 characterize the feto-maternal interface in failing murine pregnancies during the implantation period. Placenta. 2007, 28: 650-7. 10.1016/j.placenta.2006.09.010.
Article CAS PubMed Google Scholar
Cheng CK, Yeung CM, Hoo RL, Chow BK, Leung PC: Oct-1 is involved in the transcriptional repression of the gonadotropin-releasing hormone receptor gene. Endocrinology. 2002, 143: 4693-701. 10.1210/en.2002-220576.
Article CAS PubMed Google Scholar
Ishiguro N, Matsui T, Shinagawa M: Specific expression of cellular oncogenes c-my c and c-myb in T-cell lines established from three types of bovine lymphosarcomas. Am J Vet Res. 1993, 54: 2010-4.
CAS PubMed Google Scholar
Savage AK, Constantinides MG, Han J, Picard D, Martin E, Li B, Lantz O, Bendelac A: The Transcription Factor PLZF Directs the Effector Program of the NKT Cell Lineage. Immunity. 2008, 29: 1-13. 10.1016/j.immuni.2008.07.011.
Article Google Scholar

Download references

Acknowledgements

We would like to thank Prof. Sheng Zhong, Department of Bioengineering, University of Illinois at Urbana-Champaign, for providing critical advice on gene clustering using the seeding approach and Dr. Denis Larkin for his helpful comments on the manuscript. We would also like to thank the UCSC genome bioinformatics staff for their help at various times with data download using the genome browser tables.

Author information

Robin E Everts
Present address: SEQUENOM, Inc, 3595 John Hopkins Court, San Diego, CA, 92121, USA

Authors and Affiliations

Department of Animal Sciences, University of Illinois at Urbana-Champaign, 210 Edward R Madigan Laboratory, 1201 W Gregory Dr, Urbana, IL, 61801, USA
Charu G Kumar, Robin E Everts, Juan J Loor & Harris A Lewin
Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Room 1608, Urbana, IL, 61801, USA
Harris A Lewin

Authors

Charu G Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Robin E Everts
View author publications
You can also search for this author in PubMed Google Scholar
Juan J Loor
View author publications
You can also search for this author in PubMed Google Scholar
Harris A Lewin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harris A Lewin.

Additional information

Authors' contributions

CGK and HAL conducted the research, analyzed the results, and wrote the manuscript, REE contributed to data analysis, and JJL provided the liver time-series data and participated in the analysis. All authors have read and approved the manuscript.

Electronic supplementary material

12864_2009_2755_MOESM1_ESM.PDF

Additional file 1: Tissue expression profile of the LIVR cluster of genes. Tissue expression profile of the LIVR cluster of genes includes two of the LSTs (237NG, 5BP) and 104/208 other genes in the cluster. L_Intestine, large intestine; M_L_Node, mesenteric lymph node; S_Intestine, small intestine. (PDF 531 KB)

12864_2009_2755_MOESM2_ESM.PDF

Additional file 2: Comparison of ANN-Spec. ab initio predicted motif PWMs with known Transfac binding site PWMs. This table displays the frequency logos of PWMs that were predicted using ANN-Spec and for known Transfac binding sites with which the predicted PWMs were significantly matched. (PDF 498 KB)

12864_2009_2755_MOESM3_ESM.PDF

Additional file 3: Multiple alignment of 34FL with mouse SSLP-1. Multiple alignment of 34FL with mouse SSLP-1 and other secreted Ly6 domain containing proteins having 10 conserved cysteine residues. Dashes in the alignment indicate gaps, and gray shaded areas indicate conserved cysteines. (PDF 24 KB)

Additional file 4: Functions of gene products. Text file containing glossary of gene functions. (TXT 3 KB)

12864_2009_2755_MOESM5_ESM.PDF

Additional file 5: LSTs used for clustering with tissue expression data. Initial set of 49 LSTs used as seeds to cluster the tissue expression data. (PDF 65 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kumar, C.G., Everts, R.E., Loor, J.J. et al. Functional annotation of novel lineage-specific genes using co-expression and promoter analysis. BMC Genomics 11, 161 (2010). https://doi.org/10.1186/1471-2164-11-161

Download citation

Received: 23 November 2009
Accepted: 09 March 2010
Published: 09 March 2010
DOI: https://doi.org/10.1186/1471-2164-11-161

Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Background

Results

Identification of tissue-specific and time-series clusters

LIVR cluster and functional inference for LSGs

PLAC cluster and functional inference for LSGs

THYM cluster and functional inference for LSGs

Gene interaction network for the LIVR cluster

Discussion

Evidence supporting inferred functions for LSGs in the LIVR cluster

Evidence supporting inferred functions of LSGs in the PLAC cluster

Evidence supporting functional inference for LSGs in the THYM cluster

Conclusions

Methods

Source datasets

Tissue expression profile clustering

Clustering of temporal liver gene expression profile

Functional annotation and assignment of genome coordinates of genes in clusters

Promoter extraction

Identification of transcription factor binding site (TFBS)

Identification of over-represented co-occurring TFBS combinations

Ab initio motif prediction and comparison to known TFBSs

Functional classification of clusters

Identification of genes with the highest connectivity using WGCNA

Network inference using GeneGO

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation