Introduction

Mutations can influence the phenotype of genes at two levels, the coding sequence level and the gene expression level. Actually, it has long been postulated that between the species with highly similar gene sequences, such as human and chimpanzee, the differences in gene expression level may account for most of the phenotypic differences.15 In recent years, the advance in microarray technology has made genome-wide expression data of many species available, which greatly facilitates the analysis of gene expression divergence in evolution.

Many studies have investigated the expression divergence between species, which are summarized in several recent reviews.4,10,23 Briefly, large number of quantitative changes in gene expression is found among primates in the initial study,2 but follow-up studies had somewhat conflicting results. Some studies support a neutral model of expression divergence,11,12,14,25 while other studies5, 9, 17 showed the general existence of stabilizing constraints. Some studies have also found positive selection in primate or mammal expression divergence.5,13 There are also many studies on expression divergence in non-mammal species,3,16,1820,24 most support a stabilizing selection model but do not exclude the existence of neutral or positive factors.

Although there are still debates on the details of the evolution models, we recognized that most of the current results are consistent with the stabilizing selection model (allowing the neutral factor). Meanwhile, the stabilizing selection model is also a valid null hypothesis for testing positive selection. In a recent study by Gu and Su,7 we proposed a statistical model of gene expression divergence driven by stabilizing selection. It is also proposed in that study a “tissue-driven” hypothesis stating that the selective pressure on the evolution of sequence and expression are correlated and can be partitioned among tissues, and tissues differ in their tolerance to gene expression variances.

We also recognized that individual genes may be under different selective pressures, owing to their particular properties. The Gene Ontology (GO) categorization8 is arguably the most comprehensive summarization of functional and positional properties of the genes. In this report, we further tested the “tissue-driven” hypothesis by sub-grouping the genes into the GO categories. We examine the distribution of the tissue expression distance of the major GO categories in the tissues. We also examined the correlation between tissue expression distance and tissue sequence distance or tissue duplicate distance in the major GO categories. We found that the tissue-specific constraints on expression divergence are not dominated by any particular GO categories. On the other hand, sub-grouping genes into GO categories has improved the resolution and sensitivity of expression divergence analysis.

Materials and methods

Datasets

Homology information of human and mouse genes was obtained from NCBI Homologene (http://www.ncbi.nlm.nih.gov/HomoloGene). Only the reviewed RefSeq genes are used and all the orthologous pairs are reciprocally unique. A total of 17,462 high-quality human–mouse orthologous pairs were identified. The expression information were retrieved from the GNF SymAtlas (http://symatlas.gnf.org).21,22 The data were from the human HG-U133A and GNF1H chips, and the mouse GNF1M chip. Our study was focused on 29 orthologous (adult) tissues or cell types (referred to as ‘tissues’ for simplicity) shared by the mouse and human expression datasets, which are listed as follows: adipose tissue (at), adrenal gland (ag), amygdala (ad), bone marrow (bm), cerebellum (cb), CD4+ Tcells (T4), CD8+ Tcells (T8), dorsal root ganglion (dr), heart (ht), hypothalamus (hp), kidney (kn), liver (li), lung (lu), lymph node (ln), olfactory bulb (oc), ovary (ov), pancreas (pc), pituitary (pi), placenta (pl), prostate (pt), salivary gland (sg), skeletal muscle (sm), testis (ts), thymus (tm), thyroid (tr), tongue (to), trachea (tc), trigeminal (tg), and uterus (ur); the abbreviations in the parentheses were used in the context and figures. As suggested by the original authors,22 we used the normalized (log2-based) ratio value (AffyRatio) of the median expression value among biological replicates. Using the annotation tables available at http://symatlas.gnf.org, we mapped the human–mouse orthologous genes onto the human and mouse Affymatrix tags. The final dataset included 8936 human–mouse ortholog pairs with expression information.

Tissue Expression Distance (E ti )

Consider a set of n orthologous gene pairs between species 1 (human) and species 2 (mouse). Let x 1,g,ti and x 2,g,ti denote the (log2-transformed) expression levels of the genes in the g-th pair in tissue ti. The mean tissue expression distance \( \bar E_{ti} \) is calculated as

$$ \bar E_{ti} = \sum\limits_{g = 1}^n {(x_{1,g,ti} - x_{2,g,ti} )^2 } /n $$
(1)

For genes in the GO category of cat, the mean expression distance in tissue ti is

$$ \bar E_{ti,cat} = \sum\limits_{g = 1}^{n_{cat} } {(x_{1,g,ti} - x_{2,g,ti} )^2 } /n_{cat} $$
(2)

where n cat is the number of orthologous gene pairs in the cat category.

Tissue Sequence Distance (D ti )

For the human genes that are expressed in tissue ti, we calculated the mean tissue sequence distances \( \bar D_{ti} \) as the mean evolutionary distance between these genes and their mouse orthologs.

$$ \bar D_{ti} = \sum\limits_{g = 1}^n { - \ln (I_g /100)/n} $$
(3)

, where n is the number of human genes expressed in tissue ti, and I g is the sequence identity percentage (ranging from 0–100) of orthologous gene pair g between human and mouse. For the genes in GO category cat, the mean tissue sequence distance is

$$ \bar D_{ti,cat} = \sum\limits_{g = 1}^{n_{cat} } { - \ln (I_g /100)/n_{cat} } $$
(4)

, where n cat is the number of the human genes expressed in tissue ti and in GO category cat.

For each orthologous pair, the evolutionary distance was estimated with the Poisson-correction. For a human gene, the tissues that has “High expression” is defined as the tissues where the expression level of the gene is above its median expression level among all 79 human tissues in the dataset; the tissues that the gene has “Normal expression” is defined as the minimum set of tissues to have an accumulative expression reading exceeding 97.5% of the sum of expression readings in all the 79 human tissues. Our definition of “Normal expression”, in effect, excludes the tissues with trace expression (though still above the background) from being counted as expressing tissues. By these definitions, the expression level of a gene in a tissue is compared to its expression in other tissues, rather than to other genes in the same tissue.

Tissue Duplicate Distance (T dup )

Duplicated genes were identified based on the methods in Gu et al.6 2698 human duplicate pairs were identified. To analyze the correlation between tissue expression distance (E ti ) and tissue duplication distance (T dup ), we further selected the duplicated pairs (1312 pairs) that have duplicated before the human–mouse split. A duplicated human gene pair is deemed to have duplicated before the human–mouse split, if each gene in the pair has a reciprocally unique orthologous gene in mouse and the distance between the human duplicated genes is longer than the correspondent human–mouse orthologs. We excluded the duplicated pairs that duplicated after the human–mouse split in the study, because the tissue expression distance (E ti ) is calculated from human–mouse orthologous genes. Including the recently duplicated gene pairs would bring bias into the analysis.

Consider a set of m duplicated gene pairs. For the j-th pair, the expression levels of two duplicate genes in a given tissue (ti) are denoted as x j and y j , respectively. Thus, the mean tissue duplicate distance in a given tissue can be calculated as

$$ \bar T_{dup} = \sum\limits_{j = 1}^m {(x_j - y_j )^2 } /m $$
(5)

For the genes in GO category cat, the mean tissue duplicate distance is

$$ \bar T_{dup,cat} = \sum\limits_{j = 1}^{m_{cat} } {(x_j - y_j )^2 } /m_{cat} $$
(6)

, where m cat is the number of duplicate gene pairs in GO category cat.

Gene Ontology (GO) Association of Genes

Gene ontology was downloaded from the GO consortium (http://www.geneontology.org/).1 Gene Ontology association was based on NCBI LocusLink (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene). We used the major categories defined by the “generic GO slim”,8 which offered a high level view of the ontologies. Customized PERL scripts were written to map the genes onto the categories, with considerations of the DAG (directed acyclic graph) nature of the GO terms.

Results

Tissue Expression Distance between Human and Mouse in Different GO Categories

We calculated the mean tissue expression distance \( \bar E_{ti,cat} \) in major GO categories between human and mouse. In Fig. 1 we displayed the histogram of the \( \bar E_{ti,cat} \)values of several major GO “molecular function” categories in the tissues (see Supplement for other GO categories). The tissues are ordered in the same order as is in Fig. 3 of Gu and Su7 (the general distribution), where the mean tissue expression distances \( \bar E_{ti} \) are in descending order among the tissues. Comparison of the histograms of \( \bar E_{ti,cat} \) and \( \bar E_{ti} \) displays the deviation of individual \( \bar E_{ti,cat} \) distribution from the general distribution. As is shown in Fig. 1, the distribution in individual GO categories is consistent with the general distribution in general. This is particular obvious in the categories of the more general functions such as binding and transcription regulator. However, there are also many deviations in some GO categories showing that the constraint on expression divergence is diversified among GO categories. For example, in the GO categories of “Catalytic activity” and “Enzyme regulator activity”, the expression divergence is significantly increased in liver between human and mouse. Another interesting deviation is in the “Structural molecule activity” category, the expression divergence is most profound in the tissue of tongue between human and mouse. These deviations are good indicators of turbulences in the evolution of the expression levels of genes in these categories, although more refined GO annotation and molecular biology experiments would be needed for further analysis. These results have shown that adding the GO information has greatly improved the resolution and sensitivity of expression divergence analysis.

Figure 1.
figure 1

The distribution of the mean tissue expression distances ( \( \bar E_{ti,cat} \)) in different GO “molecular function” categories. (a) Transcription regulator; (b) Binding; (c) Catalytic activity; (d) Enzyme regulator; (e) Structural molecule; and (e) Signal transducer.

Correlation between Tissue Expression Distance and Tissues Sequence Distance in Major GO Categories

In each tissue ti, we calculated the mean tissue sequence distance for the major GO categories (\( \bar D_{ti,cat} \)), for “High expression” and “Normal expression” genes respectively. The correlation between \( \bar E_{ti,cat} \) and \( \bar D_{ti,cat} \) are calculated for the major GO categories in the 29 tissues in the dataset. The result is summarized in Table 1. It can be seen from the table that, except for the categories of “biological process unknown”, “molecular function unknown”, “extracellular matrix” and “extracellular region”, the correlation for \( \bar E_{ti,cat} \) and \( \bar D_{ti,cat} \) are generally significant in the GO categories. Meanwhile, except for the “structural molecule activity” category, the significance of the correlation does not differ much between the “High expression” genes and “Normal expression” genes. Although, the \( \bar D_{ti,cat} \) usually have larger variance in the “High expression” genes than the “Normal expression” genes. In Fig. 2, the plots of \( \bar D_{ti,cat} \) vs. \( \bar E_{ti,cat} \) in the “Regulation of biological process”, “Transcriptional factor” and “Protein complex” GO categories were shown as examples.

Table 1. Summary of gene ontology analysis.
Figure 2.
figure 2

The correlation between \( \bar E_{ti,cat} \) and \( \bar D_{ti,cat} \) for both genes of “High expression” and “Normal expression” in several selected GO categories.

Although a strong correlation between the expression divergence and the sequence divergence is not a direct indicator of either the negative or the positive selection model, it however shows that there is common evolutionary constraint on expression divergence and sequence divergence. In our previous study,7 we have shown this constraint exists on the tissue level. Here, we further show that this correlation does not break when the genes expressed in a tissue are sub-grouped into GO categories. This shows that the stabilizing effect of the tissues is not dominated by genes in particular GO categories.

Correlation between Tissue Expression Distance and Tissues Duplicate Distance in View of GO Categories

In each tissue ti, the mean tissue duplicated distances are calculated for the major GO categories (\( \bar T_{dup,cat} \)), and the correlation between \( \bar E_{ti,cat} \)and \( \bar T_{dup,cat} \)are calculated. The result is summarized in Table 1. In most of the GO categories, \( \bar E_{ti,cat} \) shows a significant correlation with \( \bar T_{dup,cat} \). In Fig. 3, the plots of \( \bar T_{dup,cat} \) vs. \( \bar E_{ti,cat} \) in the “Regulation of biological process”, “Transcriptional factor” and “Protein complex” GO categories were shown as examples. The strong correlation shows that in the tissues where the stabilizing constraint for expression divergence between species is weaker, its tolerance to expression divergence between duplicated genes is also larger; and vice versa. In the previous study,7 we have shown this on a tissue level. Here we show that sub-grouping the genes by their GO categories does not break this correlation. This also shows that this effect is not dominated by genes in particular GO categories.

Figure 3.
figure 3

The correlation between \( \bar E_{ti,cat} \) and \( \bar T_{dup,cat} \) in several selected GO categories.

Discussion

In this study, we sub-grouped genes into the major GO categories and further tested the “tissue-driven” hypothesis. Our result has shown that, in most of the situations, the tissue-specific constraints are also reflected in the individual GO categories. This is particular obvious in the neural tissues and the hormone-related tissues. The mean tissue expression distances in the neural tissues, such as pituitary (pi), amygdala (ad), hypothalamus (hp), and cerebellum (cb), are always among the lowest in most of the GO categories examined. Meanwhile, the mean tissue expression distance in the hormone-related tissues, such as pancreas (pc) and thyroid (tr), are always among the highest, also in most of the GO categories examined. This has shown that the tissue-specific stabilizing constraint is not dominated by genes in particular GO categories. The tissue-specific stabilizing constraint is more likely to be governed on more general levels, such as the interaction between the tissues (adult or embryonic) or the physiological structure of the tissues.

On the other hand, in some GO categories, the mean tissue expression distance showed interesting deviation from the general distribution between human and mouse, for example, the increased mean tissue expression distances in the “Structural molecule” category in the tissue of tongue, and in the “catalytic activity” category in the tissue of liver. This shows that in particular tissue and in particular GO category, potential indications of positive selection do exist between species. However, we have not found such indicative deviations in neural tissues in the GO categories. This has shown that the constraint on gene expression variance in the neural tissue is particularly strong. This is contradictory to the vast phenotypical differences in these tissues between human and mouse. However, since all the expression data are from adult tissues, it is possible that divergence in expression among these tissues may be profound in other developmental stages, such as embryogenesis. The tissue-specific stabilizing constraints we shown here and in the previous report7 are based on the adult tissue, although the whole set of methods can be readily applied to tissues in other developmental stages once the data is available.