'How-to' books tell us that networking is critical to get ahead in business and in life. Networks are also becoming increasingly important in biology, as we grapple with whole genome sequences. The traditional approach - to study one gene at a time and place it in a linear pathway with a defined biological role - falters when faced with thousands of genes, and genes with roles in two or more processes. In a recent paper in Nature Genetics, Lee et al. [1] confront these challenges by constructing a probabilistic network for Caenorhabditis elegans. This network differs from those of previous studies in that it captures most of the protein-coding genes in the C. elegans genome (82%), and it can use groups of genes to search for interacting loci.

Since its genome was completed a decade ago, C. elegans has emerged as a powerhouse for genome-wide analyses [2]. Large-scale surveys for RNA-interference (RNAi)-induced phenotypes [3, 4], RNA expression [5], protein-protein interactions (Interactome [6]), and protein-DNA binding [7] have generated a wealth of information about approximately 20,000 C. elegans genes. Our current challenge is to integrate these data into a coherent picture. In an early study, Kim et al. [5] combined data from multiple C. elegans microarray experiments, as well as those from Drosophila, yeast and humans, to find genes that were co-regulated across species. The authors made use of the 'guilt-by-association' concept to ask if genes that were coexpressed over many different conditions had similar functions. Gunsalus and co-workers [8] combined coexpression data, Interactome data and phenotypic analyses to predict the molecular machines that drive early embryogenesis. The Sternberg lab took this idea one step further, expanding predictions for all stages of life. Zhong and Sternberg [9] combined coexpression data, interactome predictions and genetic or protein interactions from worms or their orthologous genes and proteins in flies and yeast. The data were weighted according to their dependability and integrated into a Bayesian network with over 18,000 interactions for 2,254 genes, or around 11% of the predicted worm proteome. More recently, the Vidal lab developed an automated method to classify post-embryonic expression patterns of promoter-green fluorescent protein (GFP) reporters [10]. They combined anatomical data with the interactome dataset to weed out potential false positives for genes that were not expressed in the same tissue. These studies laid the foundation for integrated networks, but failed to capture the majority of protein-coding genes in their searches.

Wormnet is a probabilistic network for C. elegans

Now, Lee and colleagues [1] have assembled diverse data from C. elegans large-scale analyses to build a probabilistic network, dubbed Wormnet. Much of the information used by Zhong and Sternberg was also used by Lee et al. Additional information in Wormnet was derived from the following sources: gene interactions inferred from co-citation analysis, with the assumption that gene pairs that are co-cited in abstracts more than the random expectation are likely to be functionally linked; 'associalogs', which represent physical or genetic interaction data from other species mapped onto their C. elegans orthologs, as determined by INPARANOID [11, 12]; and phylogenetic and gene neighbor analysis using 117 bacterial genomes (Figure 1). The current study excluded Gene Ontology (GO) terms and RNAi phenotypic data, opting to use this information for data weighting and validation, respectively.

Figure 1
figure 1

The conceptual framework of Wormnet. Wormnet was generated from large-scale studies of C. elegans biology (boxes). Data for each gene were weighted according to their accuracy, which is diagrammed here as differently sized boxes. Wormnet can be used to make predictions about gene function, which can be rapidly tested in vivo. 'Associalogs' refers to physical or genetic interaction data from yeast, flies, and humans mapped to their nematode orthologs.

The assembled data were weighted and integrated into a comprehensive network using methodology the Marcotte lab had optimized previously for yeast [13]. Briefly, Lee et al. [1] determined how well each dataset (that is, the coexpression dataset, the Interactome dataset, and so on) predicted a meaningful linkage between genes known to share biological functions, based on GO annotations. The weighting of the datasets was performed for each individual link to provide more sensitive scoring. From this analysis, a log likelihood score (LLS) was calculated, which estimated the probability that two genes were linked in a meaningful way. Each dataset was given different weights depending on how well it estimated functional linkages, so that higher-quality data 'counted' more than the lower-quality data when it was incorporated into Wormnet. The power of this approach is that LLS scores are additive, allowing easy integration of different data points using Bayesian statistics. In addition, this flexibility permits addition of future data as they become available. Thus, assembling and weighting diverse groups of data, even poor-quality data, can accumulate evidence for a functional interaction between genes.

Using these criteria, Wormnet v1 established 384,700 interactions among 16,113 genes (approximately 80% of the proteome), a four- to eightfold increase in coverage over previous studies [14]. The authors trimmed this network to produce a higher-confidence dataset using an empirically defined LLS score cutoff, defined by their previous work in yeast [13]. This analysis generated the core Wormnet group, consisting of 113,928 linkages between 12,357 genes (approximately 63% of the proteome). Even this trimmed version of Wormnet constitutes a threefold increase in proteome coverage over previous studies with worms. Intriguingly, 83,946 linkages in the core database had never been noted elsewhere (for example, in the GO database or in the literature).

Testing Wormnet

To validate Wormnet, the authors queried the core database in four ways. First, they determined if Wormnet could predict essential genes. Interactome data from one- and two-hybrid screens have revealed that proteins with many interacting partners are likely to be essential [6, 7, 15]. This is called the lethality-centrality rule, and it also holds for Wormnet. Lee and colleagues [1] observed a good correlation between genes with many Wormnet linkages and the likelihood those genes would be essential, based on data derived from a genome-wide RNAi screen [3]. The RNAi dataset was not used to build Wormnet and therefore served as an independent test group. The authors extended their analysis to focus on the subset of C. elegans genes with mouse orthologs. They discovered that Wormnet could accurately predict genes with lethal phenotypes for mice as well as worms.

Second, the authors determined if genes connected to each other by Wormnet were associated with similar phenotypes. They examined 43 genome-wide RNAi screens that were focused on a particular phenotype such as 'increased lifespan' or 'growth defective.' Lee et al. found a strong correlation between linked genes in Wormnet and related phenotypes for 29 of the 43 RNAi screens, with another 10 screens having reasonable linkages. Thus, genes connected by Wormnet were likely to have similar phenotypes and, by extension, roles in similar cellular or developmental processes. This relationship, however, does not predict similar biochemical functions. For example, a pair of linked genes might reflect one activator and one repressor, both acting in a common pathway.

Next, Lee et al. examined whether Wormnet could predict specific functions for unstudied genes, based on their linkages to known genes. They chose two pathways implicated in human disease. First, they surveyed Wormnet for genes that might function in the retinoblastoma (Rb) tumor suppressor pathway. In C. elegans, the Rb pathway is best understood for its role in the developing vulva, which is the egg-laying apparatus for the worm. Previous studies had identified six genes that could suppress Rb-associated vulval phenotypes [16, 17]. The authors used these six genes as a seed to search for interacting loci, and identified 62 genes from the core Wormnet dataset. Using RNAi, they tested 50 of these genes and found 10 that produced scoreable suppression for vulval development, a hit rate of 20%. This was a significantly higher frequency compared with a recent genome-wide screen, which identified suppressors at a rate of around 0.4% [18]. Thus, Wormnet could pinpoint a set of candidates to test, and it improved the likelihood of success by orders of magnitude over an unbiased screen. However, neither the genome-wide nor the Wormnet screen was perfect: more than 70% of the suppressors discovered by Cui and coworkers [18] were missed by Wormnet, and conversely 38% of the Wormnet suppressors were not found by Cui et al. Some genes missed by Wormnet reflect pathways not represented by the six seed genes. Nevertheless, Wormnet successfully identified components for each of the chromatin regulatory complexes that were also discovered by Cui et al.

For the fourth test, Lee et al. examined an interaction predicted by Wormnet between the dystrobrevin-associated protein complex (DAPC) and the epidermal growth factor (EGF)-Ras-MAP kinase (MAPK) signaling pathway. DAPC components are primarily expressed in muscle cells, and mutation of several DAPC genes are linked to muscular dystrophies [19]. The EGF pathway is perturbed in many human cancers, but in C. elegans it is critical for cell-fate specification. RNAi of three DAPC genes strongly suppressed the cell-fate phenotypes associated with activated Ras, suggesting that DAPC augments EGF-Ras-MAPK signaling. As the authors point out, this relationship may be conserved in vertebrates [20], suggesting novel therapeutic targets for muscular dystrophies.

Where do we go from here?

What does the future hold for Wormnet? Adding new data will extend and refine the Wormnet database. The Model Organism Encyclopedia of DNA Elements (modENCODE) [21] is an effort to uncover functional elements in the fly and worm genomes, including additional protein coding sequences, noncoding RNAs and cis-regulatory regions. These important elements will aid the prediction machinery, for example, by increasing the proteome coverage of Wormnet from its current level of 80%. Inclusion of noncoding RNAs such as microRNAs [22] could add a whole new twist to understanding regulatory pathways. In addition, the current version of Wormnet does not rely on explicit spatial or temporal expression data. With the advances in tissue-specific profiling [2327], future versions of Wormnet could allow researchers to restrict their database searches to the subset of genes active in a tissue of interest (A Fraser, personal communication). This approach may reduce the number of false positives identified in a search. With an almost exponential increase in genome-wide datasets expected in the coming years, it is conceivable that Wormnet will soon cover the entire worm proteome and greatly aid in the discovery of gene function.

In summary, Lee and colleagues [1] have built an integrated database that can uncover genetic linkages between genes for C. elegans and probably also for mammals. One big payoff of this study is the enhanced predictive power. Wormnet can detect interactions not only between components of stable complexes (for example, the proteasome), but also factors associated with dynamic processes, such as cell signaling. Put another way, Wormnet describes the possible linkages associated with a gene, only some of which will be active at any particular time or place. This may enable Wormnet to uncover links for proteins with diverse functions. Many proteins participate in more than one process-consider the roles of β-catenin in transcription versus cell adhesion [28], or of the GTPase Ran in nuclear trafficking versus mitotic spindle assembly [29]. Probabilistic networks are capable of building gene linkages that represent multiple biological roles, rather than placing genes in traditional linear pathways. Wormnet provides an excellent resource for the field of C. elegans biology, and the principles set forth by these studies can also be applied to more complex organisms.