A Bayesian active learning strategy for sequential experimental design in systems biology
Dynamical models used in systems biology involve unknown kinetic parameters. Setting these parameters is a bottleneck in many modeling projects. This motivates the estimation of these parameters from empirical data. However, this estimation problem has its own difficulties, the most important one being strong ill-conditionedness. In this context, optimizing experiments to be conducted in order to better estimate a system’s parameters provides a promising direction to alleviate the difficulty of the task.
Borrowing ideas from Bayesian experimental design and active learning, we propose a new strategy for optimal experimental design in the context of kinetic parameter estimation in systems biology. We describe algorithmic choices that allow to implement this method in a computationally tractable way and make it fully automatic. Based on simulation, we show that it outperforms alternative baseline strategies, and demonstrate the benefit to consider multiple posterior modes of the likelihood landscape, as opposed to traditional schemes based on local and Gaussian approximations.
This analysis demonstrates that our new, fully automatic Bayesian optimal experimental design strategy has the potential to support the design of experiments for kinetic parameter estimation in systems biology.
KeywordsSystems biology Kinetic parameter estimation Active learning Bayesian experimental design
Systems biology emerged a decade ago as the study of biological systems where interactions between relatively simple biological species generate overall complex phenomena . Quantitative mathematical models, coupled with experimental work, now play a central role to analyze, simulate and predict the behavior of biological systems. For example, ordinary differential equation- (ODE) based models, which are the focus of this work, have proved very useful to model numerous regulatory, signaling and metabolic pathways -, including for example the cell cycle in budding yeast , the regulatory module of nuclear factor θB (NF- θB) signaling pathway ,, the MAP kinase signaling pathways  or the caspase function in apoptosis .
Such dynamical models involve unknown parameters, such as kinetic parameters, that one must guess from prior knowledge or estimate from experimental data in order to analyze and simulate the model. Setting these parameters is often challenging, and constitutes a bottleneck in many modeling project ,. On the one hand, fixing parameters from estimates obtained in vitro with purified proteins may not adequately reflect the true activity in the cell, and is usually only feasible for a handful of parameters. On the other hand, optimizing parameters to reflect experimental data on how some observables behave under various experimental conditions is also challenging, since some parameters may not be identifiable, or may only be estimated with a large errors, due to the frequent lack of systematic quantitative measurements covering all variables involved in the system; many authors found, for example, that finding parameters to fit experimental observations in nonlinear models is a very ill-conditioned and multimodal problem, a phenomenon sometimes referred to as sloppiness-, a concept closely related to that of identifiability in system identification theory ,, see also  for a recent review. When the system has more than a few unknown parameters, computational issues also arise to efficiently sample the space of parameters ,, which has been found to be very rugged and sometimes misleading in the sense that many sets of parameters that have a good fit to experimental data are meaningless from a biological point of view .
Optimizing the experiments to be conducted in order to alleviate non-identifiabilities and better estimate a system’s parameters therefore provides a promising direction to alleviate the difficulty of the task, and has already been the subject of much research in systems biology ,. Some authors have proposed strategies involving random sampling of parameters near the optimal one, or at least coherent with available experimental observations, and systematic simulations of the model with these parameters in order to identify experiments that would best reduce the uncertainty about the parameters -. A popular way to formalize and implement this idea is to follow the theory of Bayesian optimal experimental design (OED) ,. In this framework, approximating the model by a linear model (and the posterior distribution by a normal distribution) leads to the well-known A-optimal , or D-optimal - experimental designs, which optimize a property of the Fisher information matrix (FIM) at the maximum likelihood estimator. FIM-based methods have the advantage to be simple and computationally efficient, but the drawback is that the assumption that the posterior probability is well approximated by a unimodal, normal distribution is usually too strong. To overcome this difficulty at the expense of computational burden, other methods involving a sampling of the posterior distribution by Monte-Carlo Markov chain (MCMC) techniques have also been proposed ,. When the goal of the modeling approach is not to estimate the parameters per se, but to understand and simulate the system, other authors have also considered the problem of experimental design to improve the predictions made by the model -, or to discriminate between different candidate models -.
In this work we propose a new general strategy for Bayesian OED, and study its relevance for kinetic parameter estimation in the context of systems biology. As opposed to classical Bayesian OED strategies which select the experiment that most reduces the uncertainty in parameter estimation, itself quantified by the variance or the entropy of the posterior parameter distribution, we formulate the problem in a decision-theoretic framework where we wish to minimize an error function quantifying how far the estimated parameters are from the true ones. For example, if we focus on the squared error between the estimated and true parameters, our methods attempts to minimize not only the variance of the estimates, as in standard A-optimal designs ,, but also a term related to the bias of the estimate. This idea is similar to an approach that was proposed for active learning , where instead of just reducing the size of the version space (i.e., the amount of models coherent with observed data) the authors propose to directly optimize a loss function relevant for the task at hand. Since the true parameter needed to define the error function is unknown, we follow an approach similar to  and average the error function according to the current prior on the parameters. This results in a unique, well-defined criterion that can be evaluated and used to select an optimal experiment.
In the rest of this paper, we provide a rigorous derivation of this criterion, and discuss different computational strategies to evaluate it efficiently. The criterion involves an average over the parameter space according to a prior distribution, for wich we designed an exploration strategy that proved to be efficient in our experiments. We implemented the criterion in the context of an iterative experimental design problem, where a succession of experiments with different costs is allowed and the goal is to reach the best final parameter estimation given a budget to be spent, a problem that was made popular by the DREAM 6 and DREAM 7 Network Topology and Parameter Inference Challenge -. We demonstrate the relevance of our new OED strategy on a small simulated network in this context, and illustrate its behavior on the DREAM7 challenge. The method is fully automated, and we provide an RR package to reproduce all simulations.
A new criterion for Bayesian OED
In this section we propose a new, general criterion for Bayesian OED. We consider a system whose behavior and observables are controlled by an unknown parameter that we wish to estimate. For that purpose, we can design an experiment e2, which in our application will include which observables we observe, when, and under which experimental conditions. The completion of the experiment will lead to an observation o, which we model as a random variable generated according to the distribution o~P(o|θ*;e). Note that although θ* is unknown, the distribution P(o|θ;e) is supposed to be known for any θ and e, and amenable to simulations; in our case, P(o|θ;e) typically involves the dynamical equations of the system if the parameters are known, and the noise model of the observations.
The expected risk R(e;π) of a candidate experiment e given our current estimate of the parameter distribution π is the criterion we propose in order to assess the relevance of performing e. In other words, given a current estimate π, we propose to select the best experiment to perform as the one that minimizes R(e;π). We describe in the next section more precisely how to use this criterion in the context of sequential experimental design where each experiment has a cost.
Sequential experimental design
In sequential experimental design, we sequentially choose an experiment to perform, and observe the resulting outcome. Given the past experiments e1,. . .,ek and corresponding observations o1,. . .,ok, we therefore need to choose what is the best next experiment ek+1 to perform, assuming in addition that each possible experiment e has an associated cost Ce and we have a limited total budget to spend.
Evaluating the risk
The expected risk of an experiment R(e;π) (2) involves a double integral over the parameter space and an integral over the possible observations, a challenging setting for practical evaluation. Since no analytical formula can usually be derived to compute it exactly, we now present a numerical scheme that we found efficient in practice. Since the distribution πk over the parameter space after the k-th experiment can not be manipulated analytically, we resort on sampling to approximate it and estimate the integrals by Monte-Carlo simulations.
We see that the quantity wij(e) measures how similar the observation profiles are under the two alternatives θi and θj. A good experiment produces dissimilar profiles and thus low values of wij(e) when θi and θj are far appart. The resulting risk is thus reduced accordingly.
which can be interpreted as a weighted likelihood of the alternative when the observation is generated according to θi.
In most settings, generating a sample involves running a deterministic model, to be performed once for each θi, and degrading the output according to a noise model independently for each u. In our case, we used the solver proposed in  provided in the package  to simulate the ODE systems. Thus, a large number M can be used if necessary at minimal cost. Based on these samples, the approximated weights can be computed from (5), from which the expected risk of experiment e can be derived from (4).
Note that an appealing property of this scheme is that the same sample θi can be used to evaluate all experiments. We now need to discuss how to obtain this sample.
Sampling the parameter space
The method described in Algorithm 1 is independant of the sampling scheme used. However, convergence of posterior samples is essential to ensure a good behaviour of the method. First, it is known that improper (or "flat") priors may lead to improper posterior distributions when the model contains non identifiabilities. Such issues should be avoided since MCMC based sampling schemes are known not to converge in these cases. Therefore, proper prior distributions are essential in this context and improper priors should not be used in order to avoid improper posteriors. The second important element for posterior samples is numerical convergence of the sampling scheme, usually guaranteed asymptotically. Fine tuning parameters that drive the scheme is necessary to ensure that one is close to convergence in a reasonable amount of time. To check appropriate sampling behaviour, we use a graphical heuristic. We draw ten different samples from the same posterior distribution, using different initialization seeds. For each model parameter, we compare the dispersion within each sample to the total dispersion obtained by concatenating the ten samples. This value should be close to one. Such an heuristic can be used to tune parameters of the sampler, such as sample size or proposal distribution. More details and numerical results are given in Additional file 1: Annex B.
Enforcing regularity through the prior distribution
The advantage of this is twofold. First, it is reasonable to assume that variables we do not observe in a specific design vary smoothly with time. Second, this penalization allows to avoid regions of the parameter space corresponding to very stiff systems, which are poor numerical models of reality, and which simulation are computationally demanding or simply make the solver fail. This penalty term is only used in the local optimization phase not during the Monte Carlo exploration of the posterior. The main reason for adopting such a scheme is numerical stability.
The choice of prior parameters directly affects the posterior disribution, specially when a low amount of data is available. In our experiments, the prior is chosen to be log-normal with large variance. This allows to cover a wide range of potential physical values for each parameter (from 10-9 to 109). The weight of the regularity enforcing term has also to be determined. Since the purpose is to avoid regions corresponding to numerically unstable systems, we chose this weight to be relatively small compared to the likelihood term. In practical applications, parameters have to be chosen by considering the physical scale of quantities to be estimated. Indeed, a wrong choice of hyper parameter leads to very biased estimates at the early stages of the design.
Results and discussion
In silico network description
Various experiments can be performed on the network producing new time course trajectories in unseen experimental conditions. An experiment consists in choosing an action to perform on the system and deciding which quantity to observe. The possible actions are
do nothing (wild type);
delete a gene (remove the corresponding species);
knock down a gene (increase the messenger RNA degradation rate by ten folds);
decrease gene ribosomal activity (decrease the parameter value by 10 folds).
These actions are coupled with 38 possible observable quantities
messenger RNA concentration for all genes, at two possible time resolutions (2 possible choices);
protein concentration for a single pair of proteins, at a single resolution (resulting in 9-8/2=36 possible choices).
Purchasing data consists in selecting an action and an observable quantities. In addition, it is possible to estimate the constants (binding affinity and hill coefficient) of one of the 13 reactions in the system. Different experiments and observable quantities have different costs, the objective being to estimate unknown parameters as accurately as possible, given a fixed initial credit budget. The cost of the possible experiments are described in Table S1 in Additional file 2: Annex A.
For simulation purposes, we fix an unknown parameter value θθ to control the dynamics of the systems, and the risk of an estimator is defined in terms of the loss function .
The noise model used for data corruption is heteroscedastic Gaussian: given the true signal , the corrupted signal has the form y+z1+z2, where z1 and z2 are centered normal variables with standard deviation 0.1 and (0.2×y), respectively.
Performance on a 3-gene subnetwork
In order to assess the performance of our sequential OED strategy in an easily reproducible setting, we first compare it to other strategies on a small network made of 3 genes. We take the same architecture as in Figure 2, only considering proteins 6, 7 and 8. The resulting model has 6 variables (the mRNA and protein concentrations of the three genes) whose behavior is governed by 9 parameters. There are 50 possible experiments to choose from for this sub network: 10 perturbations (wildtype and 3 perturbations for each gene) and 5 observables (mRNA concentrations at two different time resolutions and each protein concentration at a single resolution). We compare three ways to sequentially choose experiments in order to estimate the 9 unknown parameters: (i) our new Bayesian OED strategy, including the multimodal sampling of parameter space, (ii) the criterion proposed in equation (13) in  together with our posterior exploration strategy, and (iii) a random experimental design, where each experiment not done yet is chosen with equal probability. The comparison of (i) and (ii) is meant to compare our strategy with a criterion that proved to be efficient in a similar setting. The comparison to (iii) is meant to assess the benefit, if any, of OED for parameter estimation in systems biology. Since all methods involve randomness, we repeat each experiment 10 times with different pseudo-random number generator seeds.
Results on the full DREAM7 network
Estimation of the expected risk
Delete gene 7
Decrease gene 7 RBS activity
Knock down gene 7
Decrease gene 7 RBS activity
Decrease gene 7 RBS activity
Delete gene 9
Knock down gene 7
Delete gene 7
Decrease gene 7 RBS activity
Decrease gene 7 RBS activity
Moreover, our criterion determines that it is better to observe protein 3 than protein 5, which makes sense since the only protein which affects protein 5 evolution is protein 8 (see Figure 2). Therefore uncertainty about protein 5 time course is tightly linked to protein 8 time course, and observing protein 3 brings more information than observing protein 5. This might not be obvious when looking at the graph in Figure 4 and could not have been foreseen by a method that considers uncertainty about each protein independently. At this point, we purchase protein 3 and 8 time courses for gene 7 deletion experiment and highlight in red in Figure 4 the profiles of proteins 3 and 8 obtained from the system.
Some parameters, like p_d e g r a d a t i o n_r a t e or p r o3_s r e n g h t, clearly concentrate around a single value while others, like p r o1_s t r e n g t h or p r o2_s t r e n g t h, have very wide ranges with multiple accumulation points. Despite this variability in parameter values, the protein time course trajectories are very similar. It appears that protein 5 time course is less concentrated than the two others. This is due to the hetroscedasticity of the noise model which was reflected in the likelihood. Indeed, the noise model is Gaussian with standard deviation increasing with the value of the corresponding concentration. Higher concentrations are harder to estimate due to larger noise standard deviation.
Computational systems biology increasingly relies on the heavy use of computational resources to improve the understanding of the complexity underlying cell biology. A widespread approach in computational systems biology is to specify a dynamical model of the biological process under investigation based on biochemical knowledge, and consider that the real system follows the same dynamics for some kinetic parameter values. Recent reports suggest that this has benefits in practical applications (e.g.). Systematic implementations of the approach requires to deal with the fact that most kinetic parameters are often unknown, raising the issue of estimating these parameters from experimental data as efficiently as possible. An obvious sanity check is to recover kinetic parameters from synthetic data where dynamic and noise model are well specified, which is already quite a challenge.
In this paper we proposed a new general Bayesian OED strategy, and illustrated its relevance on an in silico biological network. The method takes advantage of the Bayesian framework to sequentially choose experiments to be performed, in order to estimate these parameters subject to cost constraints. The method relies on a single numerical criterion and does not depend on a specific instance of this problem. This is in our opinion a key point in order to reproducibly be able to deal with large scale networks of size comparable to of a cell for example. Experimental results suggest that the strategy has the potential to support experimental design in systems biology.
As noted by others ,,-, the approach focusing on kinetic parameter estimation is questionable. We also give empirical evidence that very different parameter values can produce very similar dynamical behaviors, potentially leading to non-identifiability issues. Moreover, focusing on parameter estimation supposes that the dynamical model represents the true underlying chemical process. In some cases, this might simply be false. For example, hypotheses underlying the law of mass action are not satisfied in the gene transcription process. However, simplified models might still be good proxies to characterize dynamical behaviors we are interested in. The real problem of interest is often to reproduce the dynamics of a system in terms of observable quantities, and to predict the system behavior for unseen manipulations. Parameters can be treated as latent variables which impact the dynamics of the system but cannot be observed. In this framework, the Bayesian formalism described here is well suited to tackle the problem of experimental design.
The natural continuity of this work is to adapt the method to treat larger problems. This raises computational issues and requires to develop numerical methods that scale well with the size of the problem. Sampling strategies that adapt to the local geometry and to multimodal aspects of the posterior, such as described e.g. in , are interesting directions to investigae in this context. The main bottlenecks are the cost of simulating large dynamical systems, and the need for large sample size in higher dimension for accurate posterior estimation. Posterior estimation in high dimensions is known to be hard and is an active subject of research. Although our Bayesian OED criterion is independent of the model investigated, it is likely that a good sampling strategy to implement may benefit from specific tuning in order to perform well on specific problem instances. As for reducing the computational burden of simulating large dynamical systems, promising research directions are parameter estimation methods that do not involve dynamical system simulation such as  or differential equation simulation methods that take into account both parameter uncertainty and numerical uncertainty such as the probabilistic integrator of .
Availability of supporting data
An RR package that allows to reproduce our results and simulations is available at the following URL: https://doi.org/cran.r-project.org/package=pauwels2014.
The authors would like to thank Gautier Stoll for insightful discussions. This work was supported by the European Research Council (SMAC-ERC-280032). Most of this work was carried out during EP’s PhD at Mines ParisTech.
- 12.Brown KS, Sethna JP: Statistical mechanical approaches to models with many poorly known parameters. Phys Rev E2003, 68:021904.Google Scholar
- 46.Roy N, McCallum A: Toward optimal active learning through sampling estimation of error reduction. Proceedings of the Eighteenth International Conference on Machine Learning. 2001, Morgan Kaufmann Publishers Inc, San Francisco, 441–448.Google Scholar
- 47.Dialogue for Reverse Engineering Assessments and Methods (DREAM) website. . Accessed 2013–1228., [https://doi.org/www.the-dream-project.org]
- 48.DREAM6 Estimation of Model Parameters Challenge website.  Accessed 2013–1228., [https://doi.org/www.the-dream-project.org/challenges/dream6-estimation-model-parameters-challenge]
- 49.DREAM7 Estimation of Model Parameters Challenge website. . Accessed 2013–1228., [https://doi.org/www.the-dream-project.org/challenges/network-topology-and-parameter-inference-challenge]
- 53.Nocedal J, Wright S: Numerical Optimization. 2006, Springer, New YorkGoogle Scholar
- 57.Calderhead B, Girolami M, Lawrence ND: Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes. In Adv. Neural. Inform. Process Syst: MIT Press; 2008:217–224.Google Scholar
- 58.Chkrebtii O, Campbell DA, Girolami MA, Calderhead B: Bayesian uncertainty quantification for differential equations 2013. Technical Report 1306.2365, arXiv.Google Scholar
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.