Introduction

Owing to their sessile mode of life, plants are subject to drastic variations in their environment that lead to rapid adaptation of their gene expression states resulting from their complex gene-regulatory networks. The ultimate goal in plant systems biology is to infer, for both scientific and practical gain, how such regulatory networks will respond under untested conditions. In prokaryotes, models to infer gene-regulatory networks (GRNs) have successfully predicted genome-wide variations in untested environmental conditions, as well as the causal relationships between genes [14]. However, there has been less success in generating predictive network models for multicellular organisms, including plants. With the increasing availability of high-throughput '-omic' techniques and data, we think it useful to summarize both experimental and informatic approaches for inferring causal relationships in GRNs. Here, we use the term GRN to refer to the set of transcriptional interactions between transcription factors (TFs) and their targets, as opposed to a multimodal set of gene-to-gene or gene-to-metabolite interactions.

Here, we have three aims: first, to summarize efforts to use time-series and other -omic data to infer causal regulatory edges, showing the kinds of biological insights that can be obtained; next to provide a description and a categorization of the informatic methods that are being used to infer causal networks; and finally to discuss recent high-throughput experimental techniques to validate inferred GRNs in plants.

Successful case studies of learning gene-regulatory networks in plants

Different kinds of systems approaches are used to model GRNs in plants. One way of characterizing these systems approaches is dependent on whether or not they start with a significant amount of prior experimental knowledge of the connectivity of the modeled GRN. Thus, in this article, we call them 'Strong Prior' and 'Weak Prior' approaches, respectively.

Strong Prior approaches

In our terminology, Strong Prior approaches are grounded in extensive previous knowledge about the components involved in the GRNs [5] of well-studied functions - for example, auxin signaling [68], the circadian clock [911] or flower development [1214]. This previous knowledge is paradigmatically derived using differential-equation systems and Boolean models (described below). Outputs of the models are then compared with experimental data to determine their predictive power. When the predictions hold, the models can be used to explore GRN behavior in untested conditions in silico and to determine the overall system properties and architecture. These kinds of investigations have led to some striking results, as discussed below.

For auxin signaling, Vernoux and colleagues [6] built a model based on previous knowledge of the auxin/indole-3-acetic acid and auxin response factor (AUX/IAA-ARF) transcription factor network and yeast two-hybrid experiments (taking into account the possibility of interactions between the protein partners). This ordinary differential equation model demonstrated that the resulting GRN shows a strong buffering capacity as the transcriptional induction of auxin-induced genes is stabilized even when auxin inputs display strong variations. This property was experimentally revealed in planta, in the shoot apical meristem, by using the fluorescent sensor DII-VENUS as a reporter of the input of the signaling pathway and the DR5 reporter gene as the output.

The circadian clock is also a well-studied gene-regulatory system (for a comprehensive review, see Bujdoso and Davis [15]) that consists of interlocked transcription factor feedback loops [1618]. GRN modeling of the circadian system has been successful in determining its evolution in time and the crucial components involved in some key features of the oscillations. For instance, in studies by Pokhilko and colleagues [17], the GRN model was central to the discovery of the role of PRR5 as a night inhibitor of the expression of LHY/CCA1, including the role for PRR5 in the control of the phase of morning gene expression. In the same work, this GRN-generated hypothesis was validated by matching the behavior of prr5 mutants to gene expression predicted by the model [17]. In an alternative approach, Akman et al. [10] used Boolean logic to describe circadian circuits in a quantitative model. The simplified model with decreased parameterization was able to simulate observed circadian oscillations accurately and identify regulatory structures consistent with experimental data.

Flower development (described by the ABC model) is a textbook example of a conserved GRN that controls the fate of cells becoming sepals, petals, stamens and carpels [19]. A successful approach using a discrete-network model (gene expression is coded into discrete values) has been to simulate the cell-fate determination during formation of floral organ primordia in Arabidopsis [12]. This particular GRN dynamically converges towards different steady-states in gene expression, each of which defines the different cell fates in flower organs. Plants arrive at these cell-fate-associated steady-states (or 'basins of attraction') independently of the initial gene expression values. This shows that this GRN has feedback/buffering capacities that direct gene expression behavior towards a dedicated state (for example, making a particular organ) [12]. More recent studies have taken advantage of the wealth of interaction and expression data available in public databases to construct extensive [13] and condensed [14] models of GRNs involved in floral development, resulting in time-evolving molecular regulatory networks for the development of sepal primordia [13] as well as for floral transition [14].

These few examples of successful Strong Prior approaches demonstrate that GRNs confer robust emergent properties supporting developmental or environmental adaptations.

Weak Prior approaches

The Strong Prior approaches described above begin with some physical connection data and then use time-series and other experiments to model behavior [5]. However, for many systems - in plants, animals and microbes - this initial knowledge has yet to be unearthed.

Weak Prior approaches infer potential connections in GRNs from -omic datasets. Many techniques are used to infer unknown networks in the field of systems biology (for reviews, see [1, 20, 21]). These techniques have enjoyed great success in simpler systems, such as for bacteria. For instance, a striking success story is the model of gene-regulatory programs built from a multi-level dataset (including transcriptomic data and cis-regulatory element (CRE) inference) to describe the response of Halobacterium salinarum to environmental cues [2]. The model was built de novo by a machine-learning procedure based on 72 transcription factors responding to 9 environmental factors. The same model was able to predict the correct gene response (80% of the genome) in 147 untested conditions [2]. This study clearly demonstrates the feasibility of Weak Prior approaches in prokaryotic systems. In plant science, as this eukaryotic system is far more complex than that of yeast or bacteria, the field of GRN de novo learning is far less advanced [22]. However, Weak Prior approaches have been developed with some success, as described below.

In the plant field of GRN modeling, the three most popular top-down approaches are: (i) classical correlations networks (in combination with other information to establish causality), (ii) graphical Gaussian models (based on partial correlation) and (iii) machine-learning modeling, or combinations of the above.

Correlation networks have been used extensively to study GRNs in plants even if, by themselves, they do not directly determine causality in networks [23]. When combined with other experimental information, correlation networks help to identify key features of plant regulatory networks. For example, an Arabidopsis multi-network was constructed from all available information about putative TF-to-CRE interactions, protein-protein interactions and microRNA-mRNA interactions [24]. Correlation data integrated with the Arabidopsis multi-network have uncovered biomodules involved in carbon/nitrogen signal integration [25] and have also revealed a central role for CCA1, the central component of the circadian clock in nutrient control [26]. Additionally, correlation network approaches were strikingly successful in identifying two genes (encoding a myo-inositol-1-phosphate synthase and a Kelch-domain protein) correlating with biomass accumulation in plants [27]. The individual role of these two genes was further supported by an association-mapping study that demonstrated coherent allelic diversity at their loci [27].

Graphical Gaussian models can be viewed as an approximate method to find partial correlation networks. Partial correlation is a measure of correlation between pairs while controlling for other factors. Mathematically, if A, B and C correlate together, partial correlation correlates A and B by 'subtracting' the correlation due to A and C. Practically, partial correlation is the correlation between the residuals resulting from the linear regression of A with C, and of B with C. Graphical Gaussian models have been successfully developed [28] and applied to plant GRNs [29, 30]. Ingkasuwan and colleagues analyzed a time-series to identify genes regulated across the diurnal cycle [29]. Then a sub-network of starch-metabolism genes together with the diurnally regulated TFs were modeled using graphical Gaussian models. This model was tested and validated by studying regulator mutants that displayed starch granule defects in plastids [29].

Machine-learning methods have also been employed to learn GRNs from time-series and other data. State-space modeling is a modern machine-learning technique devoted to detecting causality in networks by inferring ordinary differential equations specifying the relationships among genes in those networks while avoiding over-fitting. In plants, this technique has been applied to probe GRNs involved in leaf senescence [31] and GRNs involved in regulating early, time-dependent transcriptional responses to NO3- [32]. Breeze and colleagues [31] provided a high-resolution temporal picture of the transcriptome of the aging leaf. Machine learning revealed modules that play various roles at different times, where each module involves particular TF families and CREs. This approach resulted in a GRN model that correctly predicted the influence of the TF ANAC092 and proposed several new regulatory edges between genes (representing causal relationships) that remain to be validated [31]. In another study [32], state-space modeling and machine learning were applied to an Arabidopsis high-resolution time-course of genome-wide transcriptional response to treatments with NO3-. A subset of TFs and nitrogen transport and assimilation genes has been modeled in order to propose a GRN that explains NO3- signal propagation. The model has been tested in silico as well as experimentally. In silico validation demonstrated that the model trained on the early time points of the time-series experiment is able to predict modulation of gene expression at later time-points (not used to train the model). Experimental validation consisted of studying the effect of overexpressing a predicted hub (SPL9 TF) on the NO3- response of other NO3--regulated genes. Indeed, SPL9 overexpression modified the regulation of the nitrate assimilation gene NIA2 but also of many genes encoding NO3--regulated TFs [32].

Analytical approaches used to infer causality in the gene-regulatory network (a mathematical point of view)

Inferring a causal edge between objects is useful in many applications in plant biology, from genomics to ecology. If some population of objects A can cause an increase in the population of object B (where A could be a gene in our context, a hormone or a species in ecology), then lowering the population of B can be achieved by: (i) removing some members of B, (ii) removing some members of A or (iii) interfering with the edge from A to B. Conversely, making B achieve a higher population can be achieved by: (i) adding more members of B, (ii) adding more members of A or (iii) enhancing the efficiency of the edge from A to B. Commonly, causal relationships in biology can involve several elements, for example A1 to A5, influencing some B, sometimes positively and sometimes negatively. The influences can be 'linear', in which case each element has either a positive or negative weight (or coefficient), or 'non-linear', in which case the elements work synergistically. An example of synergy would be a dependency of B on the product of the concentrations of some genes X and Y.

Generally, simpler models scale to larger numbers of genes, but are less informative, as summarized by the classes of network-inference methods listed in Table 1. Virtually all approaches deteriorate as the size of networks becomes larger, some more than others. Fortunately, biology tends to be modular, so large analyses can be broken down into smaller ones and then recombined [5].

Table 1 Methods for network inference

The approaches to network inference fall into the following categories, which can be classified based on level of information richness (low, medium and high) and scalability of the derived network (large, medium and small networks), as shown in Table 1. High information richness would, for example, allow the inference of the dynamic behavior of a network [21], whereas low information richness would give some approximation to the connectivity of a causal GRN.

Correlation techniques are techniques that try to find single source-target relationships. To try to isolate the possibly mutual influence of one gene on another, many researchers make use of partial correlations. Schaefer and Strimmer [33] and Ingkasuwan et al. [29] have presented an analysis of graphical Gaussian models. These models assume a Gaussian noise distribution and try to infer partial correlations (gene X influences gene Y, while holding the effects of other genes constant). Partial correlations can be computed indirectly by calculating regressions and correlations among the residuals. Such analyses require heuristic approximations for large networks because the number of experiments (for example, microarrays) is always far fewer than the number of genes. Thus, partial-correlation approaches can result in medium-sized networks (up to 100 genes) (Table 1).

Like correlation, 'mutual information' [28] seeks pairwise relationships among variables without assumptions of linear or rank dependencies. Also, like correlation, mutual information can be used for large-scale networks and does not try to compute the weight of influence of one gene on another in predicting the expression value of the target.

Use of differential equations, often based on mass action, yields equations of the form: Rate of change in gene A concentration = Synthesis rate - Decay rate. Such approaches work especially well for small, information-rich networks such as the auxin networks mentioned above [5, 34]. An issue with the mass-action approach is that it assumes that different inputs interact in a multiplicative manner (product of concentration of each component), whereas the interaction is likely to be more complex in biological, as opposed to chemical, settings.

An alternative approach to network inference is to use a Boolean approach, which allows other logical relationships among regulators and their targets [5, 10, 12]. Logic gates are based on thresholds - for example, an 'AND gate' will have an effect on target if the minimum input reaches a certain threshold, thus permitting non-linear relationships. These tend to work better on smaller networks than linear equations and better than multiplicative relationships in modeling regulation (Table 1).

Closely related to Boolean approaches are decision/regression tree approaches that embody paths of threshold tests (where each path represents a Boolean conjunction of conditions) leading to a prediction (for example, of expression values). 'Gene network inference with ensemble of trees 3' (GENIE3) is a regression tree algorithm that can be applied to steady-state, time-series and/or mutational transcriptome data [35]. This approach has worked particularly well in 'dialogue for reverse engineering assessments and methods 3' (DREAM3) competitions that use in silico data as benchmarks for validating the predictive power of inferred networks [36].

'Integrative genomic' techniques analyze how changes can cause divergent behavior over time [37]. The idea is that genes are in some steady-state before some perturbation occurs, and the technique follows the genes that change first, that change second and so on to try to guess causality. This is the qualitative idea behind the differential-equation approaches.

Pipeline approaches typically combine different algorithms on different data types. For example, the Inferelator is a network-inference approach that uses differential-equation techniques and mutual information to integrate many different data types, including steady-state, time-series and mutation/perturbation data [38, 39]. These algorithms treat knowledge in a pipelined fashion. Thus, if physical experiments show that a target gene Z has potential connections from X and Y but not from W, then only X and Y will be considered in the subsequent analysis. The time-series-based inference algorithm then might use these potential edges to derive an ordinary differential-equation model that can combine linear and non-linear terms. The result of such a pipeline is a set of equations that estimate the change in transcription level of a target gene based on transcriptional levels of other genes using time-series data. Figure 1 illustrates the concept of such pipeline approaches, which refine large, information-poor networks into smaller, information-rich networks with predictive power.

Figure 1
figure 1

An experimental/computational systems-biology cycle using different data types and feedback. Starting from many possible edges, different data types and their analyses successively reduce the size of the network, while increasing confidence in edges. (1) Correlation leads to pairwise associations of genes. (2) Transgenic manipulation permits the determination of the effect of mutations and overexpression of single genes. (3) Binding experiments (for example, Chip-Seq) reveals physical connectivity of a source gene to a target. (4) Time-series experiments along with machine-learning techniques lead to a weighted network where the weight on the edge from A to B determines the extent of influence of A on B. (5) Subsequent predictions followed by validations can then suggest the need for new experimentation, thus refueling the systems-biology cycle.

Finally, other work importantly suggests trying many network-inference methods in combination [20], showing empirically that a combination of strategies often leads to the best network resolution and supporting the widespread popular use of the 'wisdom of crowds' concept.

Validations of inferred GRNs (an experimentalist's point of view)

GRN modeling described in the above sections complements genetic studies and generates hypotheses for TF-target interactions to be tested, thus inspiring a new round of the systems-biology cycle of high-throughput experimentation for model validation and refinement (Figure 1). A variety of methods have been used to uncover the global structure of gene networks by inferring regulatory relationships between TFs and their target genes from genomic data [6, 4043], in particular transcriptional analysis and chromatin immuno-precipitation.

The most common approach has been TF perturbation in stable overexpression or knockout/knockdown lines, followed by transcriptional analysis [4447]. However, it remains unclear in such analyses whether changes in transcript levels are a direct consequence of TF manipulation or whether these changes are caused by indirect or possibly pleiotropic effects. To overcome the limitation of this approach, several other techniques have been used to supplement transcriptional data, including yeast one-hybrid assays [40] and electrophoretic mobility-shift assays [4850]. However, while these methods can result in a significant enrichment of direct targets, they are often time-consuming and not easily applicable to high-throughput analyses.

The introduction of ChIP-X, chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-seq) or tiling array (ChIP-chip) has greatly improved the genome-wide identification of TF binding sites and has uncovered many potential direct targets [5153]. Importantly, although ChIP-X reveals the binding of a TF onto a promoter, it does not indicate whether this results in activation/repression of gene expression [54]. Therefore, ChIP-X has often been combined with genome-wide transcriptional analysis to characterize the primary targets of a TF [5557].

Recently, novel combinations of these technologies have yielded vastly improved knowledge about interactions between TFs and their targets. For example, whole-plant studies using dexamethasone (DEX)-inducible TF translocation into the nucleus followed by separate ChIP-X experiments identified target genes both bound and regulated by a TF of interest [5860]. Another new technology was recently described by Bargmann and colleagues [61] in which a protoplast system combined with fluorescence-activated cell sorting (FACS) has been employed to scale-up validation of GRNs in vivo. Briefly, plant protoplasts are transformed with plasmid harboring a fluorescent selection marker together with the over-expression of a TF of interest fused to a glucocorticoid receptor from rat. Co-treatment of protoplasts with DEX and the protein synthesis inhibitor cycloheximide, which blocks secondary-target responses, results in the identification of only primary TF targets. This rapid technique makes it possible to perform high-throughput investigations/validations of TFs and the GRNs they regulate in plants [61]. Data from such high-throughput TF-target validations can then be fed-back into network-inference pipelines to refine predicted edges in the derived GRNs, in a true systems-biology cycle (Figure 1).

Perspectives

Plant systems biology is at the beginning of a new era, in which machine-learning techniques and experimental investigations mutually and iteratively reinforce one another. We believe that this experimental-analytical symbiosis will lead plant biologists to better and deeper insights into biological phenomena and will encourage computer scientists to develop new algorithms. Together, this symbiotic collaboration should accelerate the understanding of plants as systems.