Graphical models for inferring single molecule dynamics
- 3.5k Downloads
The recent explosion of experimental techniques in single molecule biophysics has generated a variety of novel time series data requiring equally novel computational tools for analysis and inference. This article describes in general terms how graphical modeling may be used to learn from biophysical time series data using the variational Bayesian expectation maximization algorithm (VBEM). The discussion is illustrated by the example of single-molecule fluorescence resonance energy transfer (smFRET) versus time data, where the smFRET time series is modeled as a hidden Markov model (HMM) with Gaussian observables. A detailed description of smFRET is provided as well.
The VBEM algorithm returns the model’s evidence and an approximating posterior parameter distribution given the data. The former provides a metric for model selection via maximum evidence (ME), and the latter a description of the model’s parameters learned from the data. ME/VBEM provide several advantages over the more commonly used approach of maximum likelihood (ML) optimized by the expectation maximization (EM) algorithm, the most important being a natural form of model selection and a well-posed (non-divergent) optimization problem.
The results demonstrate the utility of graphical modeling for inference of dynamic processes in single molecule biophysics.
KeywordsModel Selection Hide Markov Model Transition Rate Graphical Modeling Fluorescence Resonance Energy Transfer
For example, some of these models will involve conversion of chemical to mechanical energy, or motion associated with diffusion, or motion associated with transitions between distinct configurational states. Modeling the data, then, typically involves introducing several variables — some of which are observed, others of which are latent or “hidden”; some of which are real-valued coordinates, others of which are discrete states — and specifying algebraically how they are related. Such algebraic relations among a few variables are typical in physical modeling (e.g., the stochastic motion of a random walker, or the assumption of additive, independent, normally distributed errors typical in regression); models involving multiple conditionally-dependent observations or hidden variables with more structured noise behavior are less common. Implicitly, each equation of motion or of constraint specifies which variables are conditionally dependent and which are conditionally independent. Graphical modeling, which begins with charting these dependencies among a set of nodes, with edges corresponding to the conditional probabilities which must be algebraically specified (i.e., the typical elements of a physical model) organizes this process and facilitates basing inference on such models [13, 14, 15].
Here we explore the application of a specific subset of GMs to biophysical time series data using a specific algorithmic approach for inference: the directed GM and the variational Bayesian expectation maximization algorithm (VBEM). After discussing the theoretical basis and practical advantages of this general approach to modeling biophysical time series data, we apply the method to the problem of inference given single molecule fluorescence resonance energy transfer (smFRET) time series data. We emphasize the process and caveats of modeling smFRET data with a GM and demonstrate the most helpful features of VBEM for this type of time series inference.
GMs are a flexible inference framework based on factorizing a (high-dimensional) multivariate joint distribution into (lower-dimensional) conditionals and marginals [13, 14, 15]. In a GM, the nodes of the graph represent either observable variables (data, denoted by filled circles), latent variables (hidden states, denoted by open circles), or fixed parameters (denoted by dots). Directed edges between nodes represent conditional probabilities. For example, the three-node graphical model X → Y → Z implies that the joint distribution p(Z, Y, X) ≡ p(Z|Y, X)p(Y|X)p(X) can be further factorized as p(Z|Y)p(Y|X)p(X). Data with a temporal component are modeled by connecting arrows from variables at earlier time steps to variables at later time steps. In many graphical models, the number of observed and latent variables grows with the size of the data set under consideration. To avoid clutter, these variables are written once and placed in a box, often called a “plate”, labeled with the number of times the variables are repeated . This manuscript will denote hidden variables by z and observed data by d. Parameters which are vectors will be denoted as such by overhead arrows.
In such a simple case it is straighforward to arrive at the expression in Eq. 1 without the use of a GM, but such a chart makes this factorization far more obvious and interpretable.
Inference of GMs
In some contexts, one wishes to learn the probability of the hidden states given the observed data, Open image in new window where Open image in new window denotes the parameters of the model and K denotes the number of allowed values of the latent variables (i.e. number of hidden states). If Open image in new window is known then efficient inference of Open image in new window can be performed on any loop-free graph with discrete latent states using the sum-product algorithm , or, if only the most probable values of Z are needed, using the closely related max-sum algorithm . A loop in a graph occurs when multiple pathways connect two variables, which is unlikely in a graph modeling time series data. Inference for models with continuous latent variables is discussed in [18, 19]. For most time series inference problems in biophysics, both Z and Open image in new window are unknown. In these cases, a criterion for choosing a best estimate of Open image in new window and an optimization algorithm to find this estimate are needed.
Inference via maximum likelihood
The probability Open image in new window is known as the likelihood. The expectation maximization (EM) algorithm can be used to estimate Open image in new window . In EM, an initial guess for Open image in new window is used to calculate Open image in new window . The Open image in new window learned is then used to calculate a new guess for Open image in new window . The algorithm iterates until convergence, and is guaranteed to converge to a local optimum. The EM algorithm should be run with multiple initializations of Open image in new window , often called “random restarts”, to increase the probability of finding the globally optimal Open image in new window .
Model selection: The first limitation of ML is that it has no form of model selection: the likelihood monotonically increases with the addition of more model parameters. This problem of fitting too many states to the data (overfitting) is highly undesirable for biophysical time series data, where learning the correct K for the data is often an experimental objective.
Ill-posedness The second problem with ML occurs only in the case of a model with multiple hidden states and a continuous observable (a case which includes the majority of biophysical time series data, including the smFRET data we will consider here). If the mean of one hidden state approaches the position of a data point and the variance of that state approaches zero, the contribution of that datum to the likelihood will diverge. When this happens, the likelihood will be infinite regardless of how poorly the rest of the data are modeled, provided the other states in the model have non-zero probabilities for the rest of the data. For such models, the ML method is ill-posed; poor parameters can still result in infinite likelihood.
In practical contexts, the second problem (divergent likelihood) can be avoided either by performing MAP estimation (maximizing the likelihood times a prior which penalizes small variance) or by ignoring solutions for which likelihood is diverging. That is, one does not actually maximize the likelihood. Model selection can then be argued for based on cross-validation or by penalizing likelihood with a term which monotonically increases with model complexity [15, 21, 22]. We consider, instead, an alternative optimization criterion which naturally avoids these problems.
Inference via maximum evidence
The quantity Open image in new window is called the evidence. Sometimes it is also referred to as the marginal likelihood, since unknown parameters are assigned probability distributions and marginalized (or summed out) over all possible settings. The evidence penalizes both models which underfit and models which overfit. The second expression in Eq. 3 follows readily from the sum rule of probability provided we are willing to model the parameters themselves as random variables. That is, we are willing to specify a distribution over parameters, Open image in new window . This distribution is called the “prior”, since it can be thought of as the probability of the parameters prior to seeing any data. The parameters for the distributions of the prior Open image in new window are called hyperparameters. In addition to providing a method for model selection, by integrating over parameters to calculate the evidence rather than using a “best” point estimate of the parameters, ME avoids the ill-posedness problem associated with ML.
Although ME provides an approach to model selection, calculation of the evidence does not, on its own, provide an estimate for Open image in new window The VBEM approach to estimating evidence does, however, provide a mechanism to estimate Open image in new window . VBEM can be thought of as an extension of EM for ME. Both the VBEM algorithm and considerations for choosing priors are discussed in Methods.
The photophysics of FRET have been studied for over half a century, but the first smFRET experiments were only carried out about fifteen years ago . The field has been growing exponentially since, and hundreds of smFRET papers are published annually . Diverse topics such as protein folding , RNA structural dynamics , and DNA-protein interactions  have been investigated via smFRET. The size and complexity of smFRET experiments has grown substantially since the original smFRET publication. A modern smFRET experiment can generate thousands of time series to be analyzed .
Results and discussion
smFRET as a graphical model
For a time series of length T where each latent variable can take on K states, a brute summation over all possible states requires O(K T ) calculations. By exploiting efficiencies in the GM and using the sum-product algorithm, this summation can be performed using O(K2T) calculations (which can be seen by noting that the latent state probabilities in Eq. 6 factorize into p(z t |z t −1, A, K), where each of the T latent states has K2 possible combinations of states). The sum-product algorithm applied to the HMM is called the forward-backward algorithm or the Baum-Welch algorithm , and the most probable trajectory is called the Viterbi path .
There are several assumptions of this model which should be considered. First, although it is common to assume the noise of smFRET states is Gaussian, the assumption does not have a theoretical justification (and since FRET intensities can only be on the interval (0,1), and the Gaussian distribution has suport (−∞, ∞), the data cannot be truly Gaussian). Despite this caveat, several groups have successfully modeled smFRET the data as having Gaussian states [25, 34, 35]. We note that other distributions have been considered as well .
Second, the HMM assumes that the molecule instantly switches between hidden states. If the time it takes the molecule to transition between conformations is on the same (or similar) order of magnitude as the time it spends within a conformation, the HMM is not an appropriate model for the process and a different GM will be needed. For many molecular processes, such as protein domain rearrangements, the molecule transitions between conformations orders of magnitude faster than it remains in a conformation and the HMM can model the process well .
Third, the HMM is “memoryless” in the sense that, given its current state, the transition probabilities are independent of the past. It is still possible to model a molecule which sometimes transitions between states quickly and sometimes transitions between states slowly (if, for example, binding of another small molecule to the molecule being studied changes its transition rates ). This situation can be modeled using two latent states for each smFRET state. The two latent states will have the same emissions model parameters, but different transition rates.
Illustration of the inference
Model selection: For each trace, L(q), the lower bound of the log(evidence), was calculated for 1 ≥ K ≥ 7. The results are shown in Fig. 5A, with the largest value of L(q) for each trace shown in red. For traces 1 and 2, L(q) peaks for K* = 3, correctly inferring the complexity of the model. For trace 3, the noise of the system is too large, given the length of the trace, to infer three clearly resolved states. For this trace L(q) peaks at K* = 2. This result illustrates an important consideration of evidence based model selection: states which are distinct in a generative model (or an experiment generating data) may not give rise to statistically significant states in the data generated. For example, two states which have identical means, variances, and transition rates would be statistically indistinguishable from a single state with those parameters. When states are resolvable, however, ME-based model selection is generally effective, as demonstrated in traces 1 and 2.
Posterior distributions: The ability to learn a complete posterior distribution for Open image in new window provides more information than simply learning an estimate for Open image in new window , and is a feature unique to Bayesian statistics. The maximum of the distribution, denoted Open image in new window , can be used as an estimate of Open image in new window (e.g., if idealized trajectories are needed). The subscript here differentiates it from the estimate in the absence of the prior, Open image in new window . The curvature of the distribution describes the certainty of the Open image in new window estimate. As a demonstration, the posterior for the mean of the lowest smFRET state of each trace is shown in Fig. 5B. The X and Y axes are the same in all three plots, so the distributions can be compared. As expected, the lower the noise in the trace, the narrower the posterior distribution and the higher the confidence of the estimate for The estimate of µ for trace 3 is larger than in the other traces because K* = 2; some the middle smFRET state data are misclassified as belonging to the low smFRET state.
Idealized trajectories: Idealized smFRET trajectories can be a useful visual aid to report on inference. They are also a necessity for some forms of post-processing commonly used at present, such as dwell-time analysis . Idealized trajectories can be generated from the posterior learned from VBEM by using Open image in new window to calculate the most probable hidden state trajectory (the Viterbi path) . The idealized trajectories for each trace are shown in Fig. 5C. For traces 1 and 2, where K * is correctly identified, the idealized trajectory captures the true hidden state trajectory perfectly. Because of the model selection and well-posedness of ME/VBEM the idealized trajectories learned with this method can be substantially more accurate than those learned by ML for some data sets .
This manuscript demonstrates how graphical modeling, in conjunction with a detailed description of a biophysical process, can be used to model biophysical time series data effectively. The GM designed here is able to model smFRET data and learn both the number of states in the data and the posterior parameter values for those states. The ME/VBEM methodology used here offers several advantages over the more commonly used ML/EM inference approach, including intrinsic model selection and a well-posed optimization. All modeling assumptions are readily apparent from the GM. The GM framework with inference using ME/VBEM is highly flexible modeling approach which we anticipate will be applicable to a wide array of problems in biophysics.
All code used in this manuscript is available open source at http://vbfret.sourceforge.net/.
Variational Bayesian expectation maximization
Unfortunately, calculation of Eq. 3 requires a sum over all K settings for each of T extensive variables Z (where T is the length of the time series). Such a calculation is numerically intractable, even for reasonably small systems (e.g., K=2, T=100) so an approximation to the evidence must be used. Several approximation methods exist, such as Monte Carlo techniques, for numerically approximating such sums . The method we will consider here is VBEM.
Summations over the discrete components of X should be included in these equations, but are omitted for notational simplicity. The equality in Eq. 7 results from the requirement that q(X) be a normalized probability; Eq. 8 rewrites Open image in new window in terms of a conditional probability; and Eq. 11 reinserts Open image in new window for X and renames the two terms in Eq. 10 as L(q), the lower bound of the log(evidence), and the Kullback-Leibler divergence, respectively.
Using Jensen’s inequality, it can be shown that
D kL (q| |p) ≥ 0, (12)
i.e., exp(L(q)) is a lower bound on the model’s evidence. Eq. 12 implies that L(q) is maximized when Open image in new window is equal to Open image in new window . As a corollary, from this it follows that q(θ) approximates Open image in new window , the posterior distribution of the parameters. Therefore, the optimization simultaneously performs model selection (by finding a K which maximizes Open image in new window ) and inference (by approximating Open image in new window ).
Here Open image in new window denotes the expected value with respect to the subscripted distribution and Open image in new window is a normalization constant. Whereas the Open image in new window is a log of a sum/integral, the right hand sides of Eqs. 14 & 15 both involve the sum/integral of a log. This difference renders Open image in new window intractable, yet Eqs. 14 & 15 tractable.
An interesting and potentially useful feature of the Open image in new window learned from VBEM is that when K is chosen to be larger than the number of states supported by the data, the optimization leaves the extra states unpopulated. This propensity to leave unnecessary states unpopulated in the posterior, sometimes called “extinguishing”, is a second form of model selection intrinsic to VBEM, which is in addition to the model selection described by Eq. 3. An explanation for this behavior can be found in Chapter 3 of .
Several considerations should go into choosing a prior. Choosing distributions which are conjugate to the parameters of the likelihood can greatly simplify inference . Priors can be chosen to minimize their influence on the inference. Such priors are called “weak” or uninformative. Alternatively, priors can also be chosen to respect previously obtained experimental observations . It is important to check that inference results do not heavily depend on the prior (e.g. doubling or halving hyperparameter values should not affect inference results).
The conjugate prior of a multinomial distribution is a Dirichlet distribution: Open image in new window Expressed in terms of precision λ, rather than variance σ2 (where λ = 1/σ2), the conjugate prior for the mean and precision of a Gaussian is a Gaussian-Gamma distribution: Open image in new window .
Synthetic traces were generated in MATLAB using 1-D Gaussian noise for each hidden state and a manually determined hidden state trajectory. All traces were analyzed by vbFRET , using its default parameter settings, for 1 ≥ K ≥ 7, with 25 random restarts for each value of K. The restart with the highest evidence was used to generate the data in Fig. 5. The posterior probability of is given by Open image in new window , where Open image in new window are the hyperparameters of the posterior. The data in Fig. 5B were generated using this equation with λk fixed at its most probable posterior value.
JEB contributed to the graphical modeling and smFRET inference. JMH contributed to the graphical modeling. JF and RLG contributed to the smFRET inference. CHW contributed to the graphical modeling and smFRET inference. JEB and CHW wrote the manuscript.
This work was supported by a grant to CHW from the NIH (5PN2EY016586-03); grants to RLG from the Burroughs Wellcome Fund (CABS 1004856), the NSF (MCB 0644262), and the NIH-NIGMS (1RO1GM084288-01); and a grant to JEB from the NSF (GRFP).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.
- 7.Fei J, Bronson JE, Hofman JM, Srinivas RL, Wiggins CH, Gonzalez RL: Allosteric collaboration between elongation factor G and the ribosomal L1 stalk directs tRNA movements during translation. Proc. Natl. Acad. Sci. U.S.A. 2009, 106: 15702–15707. 10.1073/pnas.0908077106PubMedCentralCrossRefPubMedGoogle Scholar
- 14.MacKay DJ: . In Information theory, inference, and learning algorithms. Cambridge University Press; 2003.Google Scholar
- 15.Bishop C: . In Pattern Recognition and Machine Learning. Oxford Oxfordshire: Oxford University Press; 2006.Google Scholar
- 18.Ghahramani Z, Beal M: Propagation Algorithms for Variational Bayesian Learning. In Advances in Neural Information Processing Systems. Volume 130. Cambridge, MA: MIT Press; 2001.Google Scholar
- 19.Bishop C, Spiegelhalter D, Winn J: VIBES: A Variational Inference Engine for Bayesian Networks. In Advances in Neural Information Processing Systems. Volume 15. Cambridge, MA: MIT Press; 2003.Google Scholar
- 20.Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society Series B-MEthodological 1977, 39(1):1–38.Google Scholar
- 27.Ha T, Enderle T, Ogletree DF, Chemla DS, Selvin PR, Weiss S: Probing the interaction between two single molecules: fluorescence resonance energy transfer between a single donor and a single acceptor. Proc. Natl. Acad. Sci. U.S.A. 1996, 93: 6264–6268. 10.1073/pnas.93.13.6264PubMedCentralCrossRefPubMedGoogle Scholar
- 28.Deniz AA, Laurence TA, Beligere GS, Dahan M, Martin AB, Chemla DS, Dawson PE, Schultz PG, Weiss S: Single-molecule protein folding: diffusion fluorescence resonance energy transfer studies of the denaturation of chymotrypsin inhibitor 2. Proc. Natl. Acad. Sci. U.S.A. 2000, 97: 5179–5184. 10.1073/pnas.090104997PubMedCentralCrossRefPubMedGoogle Scholar
- 37.Creighton TE: . In Proteins: Structures and Molecular Properties. W. H. Freeman; 1992.Google Scholar
- 38.Neal R: Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93–1, Department of Computer Science, University of Toronto 1993.Google Scholar
- 40.Beal M: Variational Algorithms for Approximate Bayesian Inference. PhD thesis University of Cambridge, UK,; 2003. [http://www.cse.buffalo.edu/faculty/mbeal/papers.html]Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.