Weighted gene coexpression network analysis strategies applied to mouse weight
Systems-oriented genetic approaches that incorporate gene expression and genotype data are valuable in the quest for genetic regulatory loci underlying complex traits. Gene coexpression network analysis lends itself to identification of entire groups of differentially regulated genes—a highly relevant endeavor in finding the underpinnings of complex traits that are, by definition, polygenic in nature. Here we describe one such approach based on liver gene expression and genotype data from an F2 mouse intercross utilizing weighted gene coexpression network analysis (WGCNA) of gene expression data to identify physiologically relevant modules. We describe two strategies: single-network analysis and differential network analysis. Single-network analysis reveals the presence of a physiologically interesting module that can be found in two distinct mouse crosses. Module quantitative trait loci (mQTLs) that perturb this module were discovered. In addition, we report a list of genetic drivers for this module. Differential network analysis reveals differences in connectivity and module structure between two networks based on the liver expression data of lean and obese mice. Functional annotation of these genes suggests a biological pathway involving epidermal growth factor (EGF). Our results demonstrate the utility of WGCNA in identifying genetic drivers and in finding genetic pathways represented by gene modules. These examples provide evidence that integration of network properties may well help chart the path across the gene–trait chasm.
While traditional meiotic mapping methods such as linkage analysis and allelic association studies have been fruitful in identifying genetic targets responsible for Mendelian traits, these methods have been less successful in the identification of pathways and genes underlying complex traits. Integration of gene expression, genetic marker, and phenotype data via genetical genomics strategies is increasingly used in complex disease research (Bystrykh et al. 2005; Chen et al. 2004; Chesler et al. 2005; Hubner et al. 2005; Mahr et al. 2006; Nishimura et al. 2005; Schadt et al. 2003).
Closely related to “genetical genomics” are “systems genetics” approaches that emphasize network methods to describe the relationship between the transcriptome, physiologic traits, and genetic markers (Drake et al. 2006; Kadarmideen et al. 2006; Schadt and Lum 2006). Here we describe a particular incarnation of a systems genetics approach: integrated weighted gene coexpression network analysis (WGCNA) (Zhang and Horvath 2005; Horvath et al. 2006). By focusing on modules rather than on individual gene expressions, WGCNA greatly alleviates the multiple-testing problem inherent in microarray data analysis. Instead of relating thousands of genes to the physiologic trait, it focuses on the relationship between a few (here 12) modules and the trait. Because modules may correspond to biological pathways, focusing the analysis on module eigengenes (and equivalently intramodular hub genes) amounts to a biologically motivated data reduction scheme. WGCNA starts from the level of thousands of genes, identifies clinically interesting gene modules, and finally screens for suitable targets by requiring module membership (high intramodular connectivity) and other application-dependent criteria such as gene ontology or associations with clinical trait-related quantitative trait loci. Genetic marker data allow one to identify the chromosomal locations (referred to as module quantitative trait loci, mQTLs) that influence the module expression profiles. Genetic marker data also allow one to prioritize genes inside trait-related modules. In particular, if a genetic marker is known to be associated with the module expressions, using it to screen for gene expressions that correlate with the SNP allows one to identify upstream drivers of the module expressions. The underlying assumption in such an analysis is that functionally related genes and/or genetic pathways are regulated by common genetic drivers. We have applied this approach to identify mQTLs that control the expression profiles of a body weight–related module in an F2 population of mice (Ghazalpour et al. 2006). Here we extend these findings to another mouse cross. We also demonstrate the utility of WGCNA in relating distinct subgroups of a population via differential network analysis.
Materials and methods
Short glossary of network concepts
We define coexpression networks as undirected, weighted gene networks. The nodes of such a network correspond to gene expressions, and edges between genes are determined by the pairwise Pearson correlations between gene expressions. By raising the absolute value of the Pearson correlation to a power β ≥ 1 (soft thresholding), the weighted gene coexpression network construction emphasizes large correlations at the expense of low correlations. Specifically, aij = |cor(xi, xj)|β represents the adjacency.
Modules are clusters of highly interconnected genes. In coexpression networks, modules correspond to clusters of highly correlated gene expressions.
For each gene, the connectivity (also known as degree) is defined as the sum of connection strengths with the other network genes: ki = ∑u≠i aiu. In coexpression networks, the connectivity measures how correlated a gene is with all other network genes.
Intramodular connectivity (kIN)
Intramodular connectivity measures how connected, or coexpressed, a given gene is with respect to the genes of a particular module. The intramodular connectivity may be interpreted as a measure of module membership.
The module eigengene corresponds to the first principal component of a given module. It can be considered the most representative gene expression in a module.
Module eigengene-based connectivity (kME)
The module eigengene-based intramodular connectivity measure kME roughly approximates the standard intramodular connectivity kIN. This measure is determined by correlating the expression profile of a gene i with the module eigengene of its resident module: kMEi = |cor(xi, ME)|.
This loosely defined term is used as an abbreviation of “highly connected gene.” By definition, genes inside coexpression modules tend to have high network connectivity.
Abstractly speaking, the higher this value, the more significant a gene is. In our application, the gene significance measures how correlated a gene expression is with a clinical trait. Mouse body weight can be used to define a physiologic trait–based gene significance measure. Similarly, SNPs can be used to define SNP-based gene significance measures.
Module significance is determined as the average of gene significance measures for all genes in a given module. This measure is highly related to the correlation between module eigengene and the trait.
Module quantitative trait loci are chromosomal locations (e.g., SNP markers) that correlate with the module expression profiles. mQTLs can be defined as hotspots of the expression quantitative trait loci that are associated with a particular module.
We illustrate our methods using data from previously studied F2 mouse crosses. The first F2 data set (B × H cross) was obtained from liver tissue of 135 female mice derived from the F2 intercross between inbred strains C3H/HeJ and C57BL/6J (Ghazalpour et al. 2006; Wang et al. 2006). The second F2 (B × D) intercross data included liver tissue of 113 F2 mice derived from a cross of two standard inbred strains, C57BL/6J and DBA/2J (Ghazalpour et al. 2006; Schadt et al. 2003). Body weight and related physiologic (“clinical”) traits were measured in both sets of mice. We note that B × H and B × D mice differ in some respects. B × H mice are ApoE null (ApoE −/−) and thus hyperlipidemic, whereas B × D mice are wild type (ApoE +/+). B × H mice were fed a high-fat diet and B × D mice were fed a high-fat, high-cholesterol atherogenic diet. Also, B × H mice were sacrificed at an earlier age (24 weeks) than were the B × D mice (16 months).
Coexpression network analysis strategies
In the following, we present two distinct network analysis approaches: single-network analysis and differential network analysis. The two approaches answer different questions. The single-network analysis defines modules that can then be tested for validity with other data sets. Single-network analysis aims at identifying (a) pathways (modules) and (b) their key drivers (e.g., hub genes) that are present in a given data set. For example, we use all mice of a given F2 intercross to identify trait-related modules and mQTLs.
The second strategy, differential network analysis, aims to uncover differences in the modules and connectivity between different data sets (e.g., males versus females). Here we use body weight to arrive at two distinct data sets: lean and obese mice. Each data set is then used to construct a network. Next, the networks are contrasted to find (1) nonpreserved modules, (2) differentially expressed genes, and (3) differentially connected genes. Traditionally, a main goal of studying gene expression data is to relate differences in gene expression profiles to phenotypic differences across different conditions (e.g., different groups of mice). Viewing individual genes in isolation and analysis of differential expression is a well-established technique that has already yielded many important insights. On the other hand, differential analysis of network quantities (i.e., quantities describing the relationships between the genes such as intramodular connectivity) is neither as developed nor as widely used, although it has already led to some interesting results. For example, differential analysis of intramodular connectivity was used to identify key differences in expression networks of human and chimpanzee brains (Oldham et al. 2006).
Single weighted gene coexpression network analysis
In the case of single-network analysis, one uses a single network for modeling the relationship between transcriptome, clinical traits, and genetic marker data. In the following, we describe a typical single-network analysis for finding body weight–related modules and genes. While a single network is the focus, it does not imply that only a single data set is used. Instead, appropriately similar multiple data sets can be used to validate the robustness of module definition and connectivity.
Differential weighted gene coexpression network analysis
Next we define a measure of differential connectivity as DiffK(i) = K1(i) – K2(i), but other measures of differential connectivity could also be considered.
R software tutorials and the data for WGCNA can be found at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis/.
Single network analysis results
A single weighted gene coexpression network was constructed using expression data from livers of 135 female mice of the B × H cross, utilizing the 3421 most connected and varying transcripts from the approximately 23,000 transcripts present on the arrays (Ghazalpour et al. 2006). Using hierarchical clustering, we obtained 12 modules (each designated by a color). Gray denotes genes outside of modules. In this network, the Blue module had the highest module significance score for the physiologic trait of mouse weight (g) (module significance = 0.395, p = 7.7 × 10−5), and was also highly significant for abdominal fat pad mass (g) (module significance = 0.323, p = 0.009). These p values remain significant after Bonferroni correction adjusting for 12 modules. We mention that total mass (g) of other fat depots is also significant (module significance = 0.309, p = 0.02), but does not remain significant after Bonferroni correction.
Figure 3a shows that intramodular connectivity (kME) with regard to the Blue module is preserved between the B × H and the B × D crosses (correlation r = 0.45, p ≤ 10−20). GSweight was conserved with a Spearman correlation of 0.19 (p = 1.0 × 10−17, see Fig. 3b). Network-based gene screening uses both GSweight and kME to find weight-related genes. Note that kME is better preserved than GSweight, which suggests that kME may be a more robust gene-screening variable (see Fig. 3).
A module QTL on chromosome 19
Studying the preservation of correlations between the B × H and the B × D mouse cross data
B × H
B × D
1.3 × 10−15
2.1 × 10−4
2.1 × 10−4
< 2.2 × 10−16
< 2.2 × 10−16
Using a body weight–related mQTL to prioritize genes inside the Blue module
A SNP marker allows one to define a gene significance measure, GS.SNP, which can be used to prioritize genes within a module.
Additive marker coding reflects the dosage of a given allele; alternatively, one could use dominant or recessive marker coding (see Supplementary Material, Supplementary Table 2).
Observed GS.SNP values are reported in Supplementary Fig. 2a for our simulated module example. We explore the relationship between the GS.SNP values obtained by different marker coding methods in Supplementary Material, Appendix B, and depict the strong relationship between GS.SNP and the traditional LOD score in Supplementary Fig. 3. In short, this figure demonstrates that regardless of whether additive, dominant, or recessive marker coding is used, GS.SNP is highly related to the LOD score values.
Systems genetics gene-screening criteria
Gene-screening results of the single-network analysis
5.5 × 10−4
2.0 × 10−4
Sector plots for identifying differentially expressed and differentially connected genes
Differential network analysis is concerned with identifying both differentially connected and differentially expressed genes. To measure differential gene expression between the lean and the obese mice, we use the absolute value of the Student t-test statistic. Plotting DiffK, the difference in connectivity between lean and obese mice, versus the t-test statistic value for each gene gives a visual demonstration of how difference in connectivity relates to a more traditional t-statistic describing difference in expression between the two networks.
Functional enrichment analysis of sector 3 genes
We analyzed 61 sector 3 genes that were both highly connected in network 1 and lowly connected in network 2 for functional enrichment using the DAVID database (Dennis et al. 2003). This software, which is free and available for download at http://www.d.abcc.ncifcrf.gov/home.jsp, calculates the p value for the extent of enrichment of a given biological pathway/set by performing Fisher’s exact test. We focused on sector 3 for two reasons. First, sector 3 members had extreme values of DiffK as well as high t-statistic values. Also, as one can readily see from Fig. 4a, a high proportion of Yellow module genes were found in this module, based on network 1 module definitions. These Yellow module genes were lowly connected in network 2, and therefore were annotated as Gray module (background) members in a module assignment scheme based on network 2. This result suggests that in a pathophysiologic state (mouse obesity), the Yellow module can no longer be found.
Functional enrichment analysis of the results of the differential network analysis
1.8 × 10−4
5.4 × 10−4
7.7 × 10−4
8.7 × 10−4
Glycosylation site:N-linked (GlcNAc...)
IPR000742:EGF-like, type 3
In summary, we find a group of rewired genes identified by differential connectivity in lean and obese mice. These genes are highly enriched for extracellular and cell–cell interactions and notably 12 epidermal growth factor (EGF) or EGF-related factors. An indirect validation of the differential network results is provided by a published article that reports that EGF plays a causal role in inducing obesity in ovariectomized mice (Kurachi et al. 1993).
Functional enrichment analysis of sector 5 genes
Sector 5 is analagous to sector 3 in that it contains genes with both extreme differences in connectivity and extreme t-statistic values. After Bonferroni correction, these genes are enriched for enzyme inhibitor activity (p = 2.93 × 10−3), protease inhibitor activity (p = 6.00 × 10−3), endopeptidase activity (p = 6.00 × 10−3), dephosphorylation (p = 0.0122), protein amino acid dephosphorylation (p = 0.0122), and serine-type endopeptidase inhibitor activity (p = 0.0417) (Supplementary Table 6). Two genes were enriched for all significant categories: Itih1 and Itih3. These two genes are located near a QTL marker for hyperinsulinemia (D14Mit52) identified in C57Bl/6, 129S6/SvEvTac, and (B6 × 129) F2 intercross mice (Almind and Kahn 2004). Itih3 was independently determined to be a gene candidate for obesity-related traits based on differential expression in murine hypothalamus (Bischof and Wevrick 2005). Two serine protease inhibitors, Serpina3n and Serpina10, were enriched for the categories of enzyme inhibitor, protease inhibitor, and endopeptidase inhibitor. In humans, Serpina10 is also known as Protein Z-dependent protease inhibitor (ZPI). This serpin inhibits activated coagulation factors X and XI; ZPI deficiencies have been found to be associated with venous thrombosis (Water et al. 2004). We note that obesity is a strong independent risk factor for venous thrombosis (Abdollahi et al. 2003; Goldhaber et al. 1997) and that accordingly PZI may be a link between obesity and increased risk of venous thrombotic events.
Results from functional enrichment analysis for all other sectors are described in Supplementary Material, Appendix C and Supplementary Tables 3, 4, 5, 7, and 8 (Supplementary Table 3: enrichment of biological pathways/sets for Blue module genes intersecting B × H and B × D data sets; Supplementary Table 4: enrichment of biological pathways/sets for sector 2 genes; Supplementary Table 5: enrichment of biological pathways/sets for all sector 3 genes; Supplementary Table 7: enrichment of biological pathways/sets for sector 6 genes; Supplementary Table 8: enrichment of biological pathways/sets for sector 8 genes).
Integrating weighted gene coexpression network analysis with genotype data holds great promise for elucidating the molecular and genetic basis of complex diseases. Since WGCNA focuses on coexpression modules (as opposed to individual gene expressions), it will be useful only if trait-related modules can be detected in the gene expression data. In our mouse genetics application, we provide evidence for a body weight-related module that can be found in two F2 mouse crosses.
We show that several modules identified in the F2 B × H mouse intercross are roughly preserved in an independent B × D mouse cross. In particular, the weight-related module found in the F2 mouse intercross is recovered in the second mouse cross. Highly connected hub genes within this module are found to have high correlation with weight (GSweight). We also find that module-based measures tend to be stable and robust across independent data sets. This is even more striking given the difference between the B × H and B × D mouse populations. Hub gene status is also roughly preserved, validating the importance and robustness of intramodular connectivity. These validation successes provide evidence for the utility and robustness of network-based methods.
Central to WGCNA is the concept of intramodular connectivity, which can be considered a measure of module membership. In coexpression networks, intramodular hub genes can be considered the most central genes inside the module. Because the expression profiles of intramodular hub genes inside an interesting module are highly correlated, they are statistically equivalent. This does not imply that such genes have the same functional significance. Gene ontology may reveal that they differ in terms of biological plausibility or clinical utility. In many applications, the list of module hub genes may be further prioritized based on (1) biological plausibility based on external gene (ontology) information, (2) availability of protein biomarkers for further validation, (3) availability of suitable mouse models for further validation, and/or (4) druggability, i.e., the opportunity for therapeutic intervention.
We demonstrate that both single-network and differential network analyses may be useful for finding body weight-related genes. Single-network analysis describes the module structure and topological properties of a single data set. In single-network analysis, all samples, irrespective of their clinical trait, are used for network and module construction. In contrast, differential network analysis compares two different networks. Differential network analysis aims to identify genes that are both differentially expressed and differentially connected. Since module genes tend to be highly connected in coexpression networks, screening for differentially connected genes is related to studying the preservation of modules between the two networks. We have shown that genes that are differentially connected may or may not be differentially expressed. Changes in connectivity may correspond to large-scale “rewiring” in response to environmental changes and physiologic perturbations (Luscombe et al. 2004).
The availability of genetic markers greatly enhances the kind of questions that can be addressed by WGCNA. Genetic marker data provide valuable information for prioritizing gene expressions inside a module. The resulting systems genetics gene-screening strategy goes beyond drafting lists of differentially expressed genes or finding chromosomal locations that seem to cosegregate with a trait.
The authors acknowledge the support from Program Project Grant 1U19AI063603−01 and from NIH/NIDDKD-DK072206. They are grateful for discussions with collaborators Jun Dong, Dan Geschwind, Peter Langfelder, Ai Li, Wen Lin, Paul Mischel, Stan Nelson, Roel Ophoff, Anja Presson, Christiaan Saris, Lin Wang, and Wei Zhao.
- Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4(1) Article 17Google Scholar