Introduction

Molecular epidemiology has advanced from testing associations of disease with single polymorphisms, to exhaustive examination of all polymorphisms in a candidate gene using haplotype tagging single nucleotide polymorphisms (SNPs), to studying increasing numbers of candidate genes simultaneously. Often, gene-environment and gene-gene interactions are considered at the same time. As the number of main effects and interactions proliferate, there is a growing need for a more systematic approach to model development [1].

In recognition of this need, the American Association for Cancer Research held a special conference [2] in May 2007, bringing together experts in epidemiology, genetics, statistics, computational biology, systems biology, toxicology, bioinformatics and other fields to discuss various multidisciplinary approaches to this problem.

A broad range of exploratory methods have been developed recently for identifying interactions, such as neural nets, classification and regression trees, multi-factor dimension reduction, random forests, hierarchical clustering, etc. [37] Our focus here, however, is instead on hypothesis-driven methods based on prior understanding about the structure of biological pathways postulated to be relevant to a particular disease. Our primary purpose is to contrast mechanistic and empirical methods and explore ways of combining the two.

The folate pathway as an example

Folate metabolism provides a rich example to illustrate these challenges. Folate has been implicated in colorectal cancer, [8] coronary heart disease [9] and neural tube defects, [10, 11] among other conditions. Several steps in the metabolism of folate could be involved in these various diseases (Figure 1) and could have quite different effects. The pathway is complex, involving 19 enzymes or carrier proteins, with various feedback loops and two main cycles, the folate and the methionine cycles. The former is involved in pyrimidine synthesis through the action of thymidylate synthase (TS), potentially leading to uracil misincorporation into DNA and subsequent DNA damage and repair or misrepair. The latter is involved in DNA methylation through the conversion of S-adenosyl methionine (SAM) to S-adenosyl homocysteine (SAH) by DNA-methyltransferase (DNMT). These two mechanisms in particular have been suggested as important links between folate and carcinogenesis, although other possibilities include purine synthesis (via the aminoimidazole-carboxamide ribonucleotide transferase [AICART] reaction) and homocysteine itself. Because polymorphisms that tend to increase one of these effects may decrease others, their effects on disease endpoints can be quite different, depending on which part of the pathway is more important. A detailed mathematical model for this system has been developed by Nijhout et al. [12, 13] and Reed et al., [14, 15] based on the equilibrium solution to a set of linked ordinary differential equations for Michaelis-Menten kinetics and implemented in software available at http://metabolism.math.duke.edu/.

Figure 1
figure 1

Biochemical diagram of folate metabolism (reproduced with permission from Reed et al.[14]). AICART, aminoimidazolecarboxamide ribonucleotide transferase; BHMT, betaine-homocysteine methyltransferase; CBS, cystathionine b-synthase; DHFR, dihydrofolate reductase; DNMT, DNA-methyltransferase; dTMP, thymidine monophosphate; FTD, 10-formyltetrahydrofolate dehydrogenase; FTS, 10-formyltetrahydrofolate synthase; GAR, glycinamide ribotid; G-NMT, glycine N-methyltransferase; HCOOH; formic acid; H2C = O, formaldehyde; HCY, homocysteine; MAT, methionine adenosyl transferase; MS, methionine synthase; MTCH, 5,10-methylenetetrahydrofolate cyclohydrolase; MTD, 5,10-methylenetetrahydrofolate dehydrogenase; MTHFR, 5,10-methylenetetrahydrofolate reductase; NE, nav-enzymatic; PGT, phosphoribosyl glycinamide transferase; SAH, S-adenosylhomocysteine; SAHH, SAH hydrolase; SAM, S-adenosylmethionine; SHMT, serine hydroxymethyltransferase; THF, tetrahydrofolate; 5m-THF, 5-methylTHF; 5,10-CH2-THF, 5,10-methyleneTHF; 10f-THF, 10-formylTHF; TS, thymidylate synthase.

To illustrate the various approaches, we simulated some typical data in the form that might be available from a molecular epidemiology study -- specifically data on genetic variants, various environmental exposures, a disease outcome or clinical trait, and, possibly, biomarker measurements on some or all subjects. We began with a population of 10,000 individuals with randomly generated values of intracellular folate E1 (the total tetrahydrofolate [THF] concentration in the six compartments forming the closed loop shown on the left-hand side of Figure 1), methionine intake E2 (METin, log-normally distributed) and 14 of the key genes G shown in Figure 1. For each gene, a person-specific value of the corresponding V max was sampled from log-normal distributions with genotype-specific geometric means (GMs = 0.6, 0.8, or 1.1 times the overall GM) and common geometric standard deviations (GSD = 1.1) and K m appropriate for that enzyme (see Table 1 in Reed et al. [14] for reference values for V max and K m for each gene). The differential equations were then evaluated to determine the steady-state solutions for ten intermediate metabolite concentrations and 14 reaction rates for each individual, based on their specific environmental variables and enzyme activity rates. The probability of disease was calculated under a logistic model for each of four scenarios for the causal biological mechanism -- homocysteine concentration, the rate of DNA methylation reactions and the rates of purine and pyrimidine synthesis -- and a binary disease indicator Y was sampled with the corresponding probability. Only the data on (Y, E, G) were retained from the first 500 cases and 500 controls for the first level of the epidemiological analysis. For some analyses, we also simulated biomarkers [16] on stratified subsamples of these subjects, as will be described later. Various summaries of the correlations among the (X, E, G) values for the remaining 9,000 subjects were deposited into what we shall call the 'external database' for use in constructing priors, as described below (no individual Y data were used for this purpose).

Table 1 Marginal odds ratios (ORs) for the association of each gene with disease under various choices of reaction rates or intermediate metabolite concentrations as the causal risk factor (ORs are expressed relative to the low enzyme activity rate genotype)

Table 1 shows the univariate associations of each gene with disease under each assumption about the causal risk factor. In these simulations, only one of these was taken as causal at a time, each scaled with the relative risk coefficient β = 2.0 per standard deviation of the respective risk factor. When homocysteine concentration was taken as the causal factor, the strongest association was with genetic variation in the cystathionine b-synthase (CBS) and S-adenosylhomocysteine hydrolase (SAHH) genes. The remaining three columns relate to various reaction rates as causal mechanisms. For pyrimidine synthesis (characterised here by the TS reaction rate), the strongest influence was seen for genetic variation in TS and the 5,10-methyleneTHF dehydrogenase (MTD) gene. For purine synthesis (reflected in the AICART reaction rate), the strongest associations were with genetic variation in the phosphoribosyl glycinamide transferase (PGT) gene and somewhat weaker for MTD and 5,10-methyleneTHF cyclohydrolase (MTCH) and serine hydroxymethyltransferase (SHMT) genes; interestingly, the disease risk is not particularly related to the AICART genotype itself. When DNA methylation (reflected by the DNMT reaction rate) was assumed to be causal, none of the genetic associations were as strong as for the other three causal mechanisms, the strongest being with the 5,10-methyleneTHF reductase (MTHFR) gene, SAHH and MTD. Genetic variation in DNMT was not explicitly simulated, but the reaction rates for this enzyme were identical to those for methionine adenosyl transferase (MAT-II) and SAHH, reflecting a rate-limiting step. Thus, genetic variation in MAT-II had no effect on risk, the reaction rate being driven entirely by SAHH. Other rate-limited combinations included dihydrofolate reductase (DHFR) with TS, MTD with MTCH, and PGT with AICART. Methionine intake was the strongest environmental exposure factor for the simulation with homocysteine as the causal mechanism, whereas intracellular folate had a stronger effect under the other three mechanisms.

Mechanistic vs empirical models

For the four highlighted simulations, we also conducted multiple logistic regressions in a stepwise manner, offering methionine, folate, the 14 genotypes and all 91 pairwise G × G and 28 G × E interactions (Table 2). These are difficult to interpret, however, owing to the large numbers of comparisons and unstable regression coefficients, particularly in the models that include interaction terms. In an attempt to gain greater insight into mechanisms, attention will now be turned to more pathway-driven modelling approaches, based on hierarchical or mechanistic models. The former extend the standard logistic models summarised in Table 2 by the addition of 'prior covariates' incorporating knowledge about the relative risk coefficients predicted by the pathway. The latter attempt to model the pathways explicitly, using simplified versions of physiologically based pharmacokinetic (PBPK) models, thereby requiring stronger assumptions about reaction dynamics and population distributions of rate parameters.

Table 2 Multiple stepwise logistic regression models, including only main effects or main effects and G × G/G × E interaction terms for four different choices of the causal variable (gene names are given in Table 1; E1 = intracellular folate concentration; E2 = methionine intake)

Hierarchical models for disease-pathway associations

In the first level, the epidemiological data are fitted using a conventional 'empirical' model for the main effects and interactions among the various input genotypes and exposures, here denoted generically as X = (X ip )p = 1 ... p= (E, G, G × G, G × E, G × G × E, ...); for example, a logistic regression model of the form

logit Pr  ( Y i = 1 | X i )= β 0 + p = 1 p β p X i p
(1)

the sum being taken over the range of terms included in the X vector. Note that all possible effects of some predetermined complexity (eg all main effects and two-way, or perhaps higher order, interactions possibly limited to subsets relevant to the hypothesised pathway structure) are included, rather than using some form of model selection, as was done in the stepwise analyses summarised in Table 2.

In the second-level model, each of the regression coefficients from Eq. (1) is in turn regressed on a vector Z p = (Z pv )v = 1 ... Vof 'prior covariates' describing characteristics of the corresponding terms in X; for example,

β p ~ N ( π 0 + v = 1 v π v Z p v , σ 2 )
(2)

There are many possibilities for what could be included in the set of prior covariates, ranging from indicator variables for which of several pathways each gene might act in, [17]in silico predictions of the functional significance of polymorphisms in each gene, [18, 19] or genomic annotation from formal ontologies [20]. Summaries of the effects of genes on expression levels ('genetical genomics') or of associations of genes with relevant biomarkers might also be used as prior covariates. Rebbeck et al. [21] provide a good review of available tools that could be used for constructing prior covariates.

Alternatively, one could model the variances, for example:

β p ~ N ( π ' Z p , σ 2 exp ( φ ' Z p ) )
(3)

For example, suppose the X vector comprised effects for different polymorphisms within each gene and one had some prior predictors of the effects of each polymorphism (eg in silico predictions of functional effects or evolutionary conservation) and other predictors of the general effects of genes (eg their roles in different pathways or the number of other genes that they are connected to in a pathway). Then, it might be appropriate to include the former in the π ' Z part of the model for the means, and the latter in the φ ' Z part of the model for the variances.

So far, the second-level models have assumed an independence prior for each of the regression coefficients; but now, suppose we have some prior information about the relationships among the genes, such as might come from networks inferred from gene co-expression data. Let A = (A pq )p, q = 1 ... Pdenote a matrix of prior connectivities between pairs of genes -- for example, taking the value 1 if the two are adjacent (connected) in some network or 0 otherwise. Then, one might replace the independence prior of Eq. (2) by a multivariate prior of the form:

β ~ N p ( π ' Z , σ 2 ( I - ρ A) - 1 )

This is known as the conditional autoregressive model, and is widely used in spatial statistics [22]. Sample WinBUGS code to implement these and other models described below are available in an online supplement.

In applications to the folate simulation, we tried two variants of this model. First, we considered three prior covariates in Z: an indicator for whether a gene is involved in the methionine cycle; whether it is involved in the folate cycle; and the number of other genes it is connected to in the entire network (a measure of the extent to which it might have a critical role as a 'hub' gene). The A matrix was specified in terms of whether a pair of genes had a metabolite in common, either as substrate or product.

Table 3 summarises the results of several models, including these three prior covariates in the means or variance model, as well as the connectivities in the covariance model. As would be expected, in the zero mean model, all the significant parameter estimates were shrunk towards zero because of the large number of genes with no true effect in the model. In general, none of the prior covariates significantly predicted the means. The estimates of the βs in all these models were much closer to the simple maximum likelihood estimates (the first column), however, and their standard errors were generally somewhat smaller, indicating the 'borrowing of strength' from each other. In the model with covariates in the prior variances, however, the number of connections for each gene was significantly associated with the variance. In the final model, with correlations between genes being given by indicators for whether they were connected in the graph, the posterior distribution for the parameter ρ is constrained by the requirement that the covariance matrix be positive definite, but showed strong evidence of gene-gene correlations following the pattern given by the connectivities in Figure 1. The generally weak effects of prior covariates in these models may simply reflect the crudeness of these classifications. Below, we will revisit these models with more informative covariates based on the quantitative predictions of the differential equations model.

Table 3 Summary of hierarchical modelling fits (parameter estimates [SEs]) for selected genetic effects (βG), prior covariates (Z'π) and prior correlations (σ2A) for simulation with homocysteine concentration as the causal variable

Mechanistic models

Whereas hierarchical models are generally applicable whenever one has external information about the genes and exposures available, in some circumstances the dynamics of the underlying biological process may be well enough understood to support mechanistic modelling. These are typically based on systems of ordinary differential equations (ODEs) describing each of the intermediate nodes in a graphical model as deterministic quantities given by their various inputs (exposures or previous substrates) with reaction rates determined by genotypes (Figure 2). For example, in a sequence j = 1, ..., J of linear kinetic steps, with conversion from metabolite M j to M j+1 at rate λj and removal at rate μ j , the instantaneous concentration is given by the differential equation:

d M j d t = ( λ j - 1 M j - 1 - ( λ j + μ j ) M j
(4)
Figure 2
figure 2

Schematic representation of simplified one-compartment model.

leading to the equilibrium solution for the final metabolite M J as:

M J ( E , G ) = E × j = 1 J λ j -1 ( G j - 1 ( λ ) ) λ j ( G j ( λ ) ) + μ j ( G j ( μ ) )

where X0 denotes the concentration of exposure E. This predicted equilibrium concentration of the final metabolite in the graph is then treated as a covariate in a logistic model for the risk of disease:

logit [Pr ( Y i = 1 )]= β 0 + β 1 M J ( E i , G i )

If sufficient external knowledge about the genotype-specific reaction rates is available, these could be treated as fixed constants, but more likely they would need to be estimated jointly with the βs in the disease model using a non-linear fitting program. More sophisticated non-linear models are possible -- for example, incorporating Michaelis-Menten kinetics by replacing each of the λM terms in Eq. (4) by expressions of the form:

V m a x λ j ( G j ) M j M j + K m λ j ( G j )

and similarly for the μM terms. The resulting equilibrium solutions for MJ(E, G) are now more complex solutions to a polynomial equation. For example, with only a single intermediate metabolite with one activation rate λ and one detoxification rate μ, the solution becomes:

M = E λ / μ 1 + ( ( 1 / K m λ ) - ( λ / μ ) / K m μ ) E

where λ= V m a x λ ( G 1 ) / K m λ and μ= V m a x μ ( G 2 ) / K m μ denote the low-dose slopes of the two reactions. These solutions can be either upwardly or downwardly curvilinear in E, depending on whether the term in parentheses is positive or negative (basically, whether the creation of the intermediate exceeds the rate at which it can be removed). For the fitted values in the application below (third block of Table 4), the dose-response relationship for M|E is upwardly curved for all genotype combinations (not shown).

A more realistic and more flexible model would allow for stochastic variation in the reaction rates λ ij and μ ij for each individual i conditional on their genotypes G ij ; for example, λ i j ~LN ( λ ̄ j ( G i j ) , σ j 2 ) and likewise for μ ij [23] or similarly for their corresponding V max and K m [24]. The population genotype-specific rates are, in turn, assumed to have log-normal prior distributions λ ̄ j ( g ) ~ L N ( λ ̄ ̄ j ω j 2 ) (and similarly for the μs), with vague priors on the population means λ ̄ ̄ j , inter-individual variances σ j 2 and between-genotype variances ω j 2 . The individual data might be further supplemented by available biomarker measurements B ij of either the enzyme activity levels or intermediate metabolite concentrations, modelled as B i j ~LN ( λ i j , ω 2 ) and B i j ~LN ( M i j , ω 2 ) respectively.

The WinBUGS software [25] has an add-in called PKBUGS, [26] which implements a Bayesian analysis of population pharmacokinetic parameters [2731]. More complex models can, in principle, be fitted using the add-in WBDIFF http://www.winbugs-development.org.uk/wbdiff.html, which allows user-specified differential equations as nodes in a Bayesian graphical model.

To illustrate the approach, we consider a highly simplified model with only a single intermediate metabolite M (homocysteine). We assume this is created at linear rate λ determined by SAHH and removed at linear rate μ determined by CBS. The ratios of λ and μ between genotypes are estimated jointly with β. The first two lines of Table 4 provide the results of fitting the linear kinetics model, with and without inter-individual variability in the two rate parameters. Although, of course, many other genes are involved in the simulated model, the estimated homocysteine concentrations M are strongly predictive of disease, and both genes have highly significant effects on their respective rates. Allowing additional random variability in these rates slightly increased the population average genetic effects. For the Michaelis-Menten models, we allowed the V max s to depend on genotype, while keeping the K m s fixed. Not all the parameters can be independently estimated, but only the ratios μ00 and K m μ / K m λ , along with the genetic rate ratios λ10 and μ10. Allowing the V max s and K m s to vary between subjects leads to some instability, but did not substantially alter the population mean parameter estimates. Adding in biomarker measurements B i as surrogates for M i for even a subset of subjects, as described below, substantially improved the precision of estimation of all the model parameters (results not shown).

Table 4 Results of Markov chain Monte Carlo fitting of single-compartment models with homocysteine as an unobserved intermediate metabolite, created at a rate depending on SAHH (λ) and removed at a rate depending on CBS (μ), applied to the simulation taking homocysteine concentration as the causal variable

Combining mechanistic and statistical models

Such an approach is likely to be impractical for complex looped pathways like folate, however. In this case, one might use the results of a preliminary exploratory or hierarchical model to simplify the pathway to a few key rate-limiting steps, so as to yield a simpler unidirectional model for which the differential equation steady-state solutions can be obtained in closed form.

Rather than taking M(E, G) as a deterministic node in the mechanistic modelling approach described above, a fully Bayesian treatment would use stochastic differential equations to derive Pr(M|E, G). For example, suppose one postulated that the rate of change dM/dt depends on the rate at which it is created as a constant rate λ(G1)E and the rate at which it is removed at rate μ(G2)M. (Of course, the exposures E could be time dependent, in which case one would be interested in the long-term average of M rather than its steady state, but in most epidemiological studies there is little information available on short-term variation in exposures, so the following discussion is limited to the case of time-constant exposures.) Consider a discrete number of molecules and let p m (t) = Pr(M = m|T = t). Then, the resulting stochastic differential equation becomes:

d p m d t = - ( λ E + μ m ) p m + λ E p m - 1 + μ ( m + 1 ) p m + 1

The solution turns out to be simply a Poisson distribution for m with mean E(m) = λE/μ. This suggests as a distribution for continuous metabolite concentrations M in some volume of size N:

p ( M ) = N e - λ E N / μ ( λ E N / μ ) N M / Γ ( N M + 1 )

where N now controls the dispersion of the distribution. More complex solutions for Michaelis-Menten kinetics with a finite number of binding sites have been provided by Kou et al., [32] who showed that the classical solutions still held in expectation, but other properties -- like the distribution of waiting times in various binding states -- were different, appearing to demonstrate a non-Markov memory phenomenon, particularly at high substrate concentrations. Further stochastic variability arises from fluctuations in binding affinity due to continual changes in enzyme conformation [33].

To illustrate the general idea, we fitted this simplified version of the model, treating λ and μ as fixed genotype-specific population values, yielding the estimates shown in the last line of Table 4. The dispersion parameter N cannot be estimated, but the results for other parameters are relatively insensitive to this choice; the results in Table 4 are based on either a fixed value N = 10 or using an informative Γ(100,1) prior; as N gets very large, the estimates converge to those in the first line for linear kinetics with fixed genotype-specific λ and μ.

For more complex models, for which analytic solution of the differential equations may be intractable, the technique of approximate Bayesian computation [34] may be helpful. The basic idea is, at each Markov chain Monte Carlo cycle, to simulate data from the differential equations model using the current and proposed estimates of model parameters and evaluate the 'closeness' of the simulated data to the observed data in terms of some simple statistics. This is then used to decide whether to accept or reject the proposed new estimates, rather than having to compute the likelihood itself.

A simpler approach uses the output of a PBPK simulation model as prior covariates in a hierarchical model. Let Z ge = E[M(G g , E e )] denote the predicted steady-state concentrations of the final metabolite from a differential equations model for a particular combination of genes and/or exposures (thus, Z gg' might represent the predicted effect of a G × G interaction between genes g and g'). As discussed above, other Zs could comprise variances of predicted Ms across a range of input values as a measure of the sensitivity of the output to variation in that particular combination of inputs. Z ge could also be a vector of several different predicted metabolite concentrations if there were multiple hypotheses about which was the most aetiologically relevant.

For the folate application, the Z matrix was obtained by correlating the simulated intermediate phenotypes v (reaction rates or metabolite concentrations) with the 14 genotypes, 91 G × G and 28 G × E interaction terms. The resulting correlation coefficients for the four simulated causal variables were then used as a vector of in silico prior covariates Z p = (Z pv )v= 1..4 for the relative risk coefficients β p . The full set of correlations Z pv across all ten metabolites and nine non-redundant velocities were also used to compute an adjacency matrix as A pq = corr v (Z pv , Z qv ), representing the extent to which a pair of genes had similar effects across the whole range of intermediate phenotypes. The effects of these in silico covariates (Table 5) were substantially stronger than for the simple indicator variables illustrated earlier. In each simulation, the prior covariate corresponding to the causal variable was the strongest predictor of the genetic main effects.

Table 5 Summary of hierarchical modelling fits for selected genetic effects (βG), prior covariates (Z'π) and prior standard deviations (σβ and σπ) for simulation with different intermediates as the causal variable, using the Z matrix derived from independent data from the same simulation model (see text).
Table 6 Estimated log relative risk per unit change of true long-term homocysteine concentrations, treated as a latent variable in a single compartment linear-kinetics model; data simulated assuming homocysteine is the causal variable.

Designs incorporating biomarkers

Ultimately, it may be helpful to incorporate various markers of the internal workings of a postulated pathway, perhaps in the form of biomarker measurements of intermediate metabolites, external bioinformatic knowledge about the structure and parameters of the network, or toxicological assays of the biological effects of the agents under study. For example, in a multi-city study of air pollution, we are applying stored particulate samples from each city to cell cultures with a range of genes experimentally knocked down to assess their genotype-specific biological activities. We will then incorporate these measurements directly into the analysis of G × E interactions in epidemiological data [35]. See Thomas, [1] Thomas et al., [2] Conti et al. [20] and Parl et al. [36] for further discussion about approaches to incorporating biomarkers and other forms of biological knowledge into pathway-driven analyses.

Typically biomarker measurements are difficult to obtain and are only feasible to collect on a subset of a large epidemiological study. While one might consider using a simple random sample for this purpose, greater efficiency can often be obtained by stratified sampling. Suppose the parent study is a case-control study with exposure information and DNA already obtained. One might then consider sampling on the basis of some combination of disease status, exposure and the genotypes of one or more genes thought to be particularly important for the intermediate phenotype(s) for which biomarkers are to be obtained. The optimal design would require knowledge of the true model (which, of course, is unknown), but a balanced design, selecting the subsample so as to obtain equal numbers in the various strata defined by disease and predictors is often nearly optimal [3739]. The analysis can then be conducted by full maximum likelihood, integrating the biomarkers for unmeasured subjects over their distribution (given the available genotype, exposure and disease data) or by some form of multiple imputation, quasi-likelihood [40] or MCMC methods. Here, the interest is not in the association of disease with the biomarker B itself, but rather with the unobserved intermediate phenotype M it is a surrogate for. The disease model is thus of the form Pr(Y|M), with a latent process model for Pr(M|G, E) and a measurement model for Pr(B|M).

Again, using the folate simulation as the example, we simulated biomarkers for samples of ten or 25 individuals selected at random from each of the eight cells defined by disease status, the MTHFR genotype and high or low folate intake. A measurement B of either homocysteine concentration or the TS enzyme activity level was assumed to be normally distributed around their simulated equilibrium concentrations with standard deviations 10 per cent of that the true long-term average concentrations.

These data were analysed within a conventional measurement error framework [41, 42] by treating the true long-term average values of homocysteine or TS activity as a latent variable X in a model given by the following equations:

logit Pr ( Y = 1 ) = β 0 + β 1 M M ~ N ( X ' α , σ 2 ) where X = ( G,E, . . . ) B ~ N ( M , τ 2 )

For joint analyses of homocysteine and TS activity measurements, M and B were assumed to be bivariate normally distributed with M ~ N2(X'A, Σ) and B ~ N2(M, T), and Y as having a multiple logistic dependence on M. Only the main effects of the 14 genes and two environmental factors were included in X for this analysis. While the model can be fitted by maximum likelihood, it is convenient to use MCMC methods, which more readily deal with arbitrary patterns of missing B data. Thus, it is not essential for the different biomarkers to be measured on the same subset of subjects, but some overlap is needed to estimate the covariances Σ12 and T12. More complex mechanistic models could, of course, be used in place of the regression model M|X. For this model to be identifiable, however, it is essential that distinct biomarkers be available for each of the intermediate phenotypes included in the disease model.

Estimates of the effects of both homocysteine and TS enzyme activity were highly significant in univariate analyses, even though the simulated causal variable is homocysteine. In bivariate analyses, however, the TS effect became non-significant, owing to the strong positive correlation (rΣ = 0.45; 95 per cent confidence interval [CI] 0.21, 0.71) between the residuals of M, while correlation between the residuals of the measurement errors was not significant (rT = 0.34; 95% CI -0.12, +0.63). Although the standard errors varied strongly with subsample size, stratified sampling did not seem to improve the precision of the estimates. The reason for this appears to be that the biased sampling is not properly allowed for in the Bayesian analysis. Further work is needed to explore whether incorporating the sampling fractions into a conditional likelihood would yield more efficient estimators in the stratified designs.

Dealing with reverse causation: Mendelian randomisation

The foregoing development assumes that the bio-marker measurement B or the underlying phenotype M of which it is a measurement is not affected by the disease process. While this may be a reasonable assumption in a cohort or nested case-control study where biomarker measurements are made on stored specimens obtained at entry to the cohort rather than after the disease has already occurred, it is a well known problem (known as 'reverse causation') in case-control studies. In this situation, one might want to restrict biomarker measurements only to controls and use marginal likelihood or imputation to deal with the unmeasured biomarkers for cases. Alternatively, one might consider using case measurements in a model that includes terms for differential error in the measurement model, Pr(B|M, Y).

These ideas have been formalised in literature known as 'Mendelian randomisation' (MR), [4347] sometimes referred to as 'Mendelian deconfounding' [48]. Here, the focus of attention is not the genes themselves, but intermediate phenotypes (M) as risk factors for disease. The genes that influence M are treated as 'instrumental variables' (IVs) [4954] in an analysis that indirectly infers the M-Y relationship from separate analyses of the G-M and G-Y relationships. The appeal of the approach is that uncontrolled confounding and reverse causation are less likely to distort these relationships than they are to distort the M-Y relationship if studied directly. In essence, the idea of imputing M values using G as an IV in a regression of Y on E(M|G) is a form of MR argument. Nevertheless, the approach is not without its pitfalls, [5558] both as a means of testing the null hypothesis of no causal connection between M and Y and as a means of estimating the magnitude of its effect. Particularly key is the assumption that the effect of G on Y is mediated solely through M. For complex pathways, the simple MR approach is unlikely to be of much help, but the idea of using samples free of reverse causation to learn about parts of the model from biomarker measurements and incorporating these into the analysis of a latent variable model is promising.

To illustrate these methods, consider the scenario where homocysteine is the causal variable for disease. The logistic regression of disease directly on homocysteine yields a logRR coefficient β of 2.57 (SE 0.22) per SD change of homocysteine (Table 7). This estimate is, however, potentially subject to confounding and reverse causation, and indeed in this simulation we generated an upward bias in B|M of 50 per cent of the SD of M, which produced a substantial overestimate of the simulated β = 2. An MR estimate could in principle be obtained by using any of the genes in the pathway as an IV, MTHRF being the most widely studied. The regression of homocysteine on MTHFR yields a regression coefficient of α = 0.216 (0.079) and a logistic regression of disease on MTHFR yields a regression coefficient of γ = 0.112 (0.142), to produce an MR estimate of β = γ/α = 0.52 (0.68). Since MTHRF is only a relatively weak predictor of homocysteine concentrations in this simulation, however, it is a poor instrumental variable, as reflected in the large SE of the ratio estimate. Several other genes, exposures and interactions have much stronger effects on both homocysteine and disease risk -- notably, SAHH and CBS, which yield significant MR estimates, 1.27 (0.33) and 1.09 (0.20), respectively. These differences between estimates using different IVs and their underestimation of the simulated β suggest that simple Mendelian randomisation is inadequate to deal with complex pathways.

Table 7 Mendelian randomization estimates of the effect of homocysteine on disease risk

A stepwise multiple regression model for M ^ = E ( B | G ) included 13 main effects and G × G interactions and attained an R2 of 0.43. Treating these predicted homocysteine concentrations as the covariate yielded a single imputation estimate of the log RR for disease of 1.32 (0.16), only slightly less precise than that from the logistic regression of disease directly on the measured values. While robust to uncontrolled confounding, this approach is not robust to reverse causation or misspecification of the prediction model; for example, it fails to include any exposure effects, which we have excluded to avoid distortion by reverse causation. More importantly, it also assumes that the entire effect of the predictors is mediated through homocysteine; this is true for this simulation, but is unlikely to be in practice. While not quite as downwardly biased as the Mendelian randomisation estimates (resulting from the improved prediction of B|G), the incompleteness of the model has still produced some underestimation.

Since we have simulated the case where the biomarker measurements are distorted by disease status, one might consider one of two alternative single imputation analyses. If both cases and controls have biomarker measurements available, one might include disease status in a model for M ^ =E ( B | G , Y ) = α ' G+δY, and then set Y = 0 in the fitted regression in order to estimate the predisease values for the cases. Alternatively, one could fit the model for M ^ =E ( B | G ) using data only from controls and then apply the fitted model to all subjects, cases and controls. In either case, one would use only the predicted values for all subjects, not the actual biomarker measurements for those having them. In these simulated data, these approaches yield log RR estimates of 1.28 (0.20) and 1.31 (0.20), respectively. Either of these approaches avoids the circularity of using disease status to predict B|G, Y and then using it again in the regression of Y on M ^ =E ( B | G , Y ) . While the first approach uses more of the data, it requires a stronger assumption that the effect of Y on B is correctly specified, including possible interactions with G. In this simulation, the estimate of δ is 1.33 (0.06), substantially biased away from the simulated value of 0.50 because it includes some of the causal effect of X on Y. A fully Bayesian analysis jointly estimates the bias term δY in the full model p α ( M | E , G ) p β ( Y | M ) p γ , δ ( B | M , Y ) . In this simulation, the fully Bayesian analysis yielded an estimate of β = 2.95 (0.22) and δ = -0.02 (1.02). Obviously, δ is so poorly estimated and β so overestimated that this approach appears to suffer from problems of identifiability that require further investigation.

In the Colon Cancer Family Registries, [59] we have pre-disease biospecimens on several hundred relatives of probands who were initially unaffected and subsequently became cases themselves. In a currently ongoing substudy of biomarkers for the folate pathway, it will be possible to use these samples to estimate the effect of reverse causation directly. Of course, it would have been even more informative to have both pre- and post-diagnostic biomarker measurements on incident cases to model reverse causation more accurately.

Incorporating external information: Ontologies

There are now numerous databases available that catalogue various types of genomic information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is perhaps the most familiar of these for knowledge about the structure of pathways and the parameters of each step therein. Others include the Gene Ontology, Biomolecular Interaction Network Database, Reactome, PANTHER, Ingenuity Pathway Analysis, BioCARTA, GATHER, DAVID and the Human Protein Reference Database, (see, for example, Meier and Gehring, [60] Thomas et al. [61] and Werner [62] for reviews). Literature mining is emerging as another tool for this purpose, [63] although potentially biased by the vagaries of research and publication trends. Such repositories form part of a system for organising knowledge known as an 'ontology' [64]. Representation of our knowledge via an ontology may provide a more useful and broadly informative platform to generate system-wide hypotheses about how variation in human genes ultimately impacts on organism-level phenotypes via the underlying pathway or complex system. Since the biological and environmental knowledge relevant to most diseases spans many research fields, each with specific theories guiding ongoing research, expertise across the entire system by one individual scientist is limited. While the information that contributes to each knowledge domain may contain uncertainties and sources of error stemming from the underlying experiments and studies, biases in the selection of genes and pathways chosen to be included and lack of comparability across terms and databases, an ontology as a whole can generate hypotheses and links across research disciplines that may only arise when information is integrated from several disciplines across the entire span of suspected disease aetiology. An ontology should not be taken as the truth, but rather as the current representation of knowledge that can, and should, be updated as new findings arise and hypotheses are tested. Evaluation of the accuracy of ontologies is an active research area.

In our folate simulation, we considered three prior covariates for Z in Table 3. The creation of these priors followed directly from the network representation given in Figure 1, obtained from a previously published article representing one research group's interpretation of the folate pathway [14]. An ontology, such as Gene Ontology (GO), has the potential advantage of allowing for the construction of prior covariates across a range of biological mechanisms. For example, a very refined biological process captured by the GO term folic acid and derivative biosynthetic process indicates two genes (MTCH and MS) from our example set of genes. A more general term, methionine biosynthetic process, identifies three genes (MTCH, MTHFR and MS). Finally, a broad process, such as one-carbon compound metabolic process, identifies five genes (SAHH, DHFR, MAT-II, MTCH and SHMT). Since an ontology has a hierarchical structure in a easily computable format, one may consider more quantitative approaches in generating prior covariates, such as the distance between two genes in the ontology. Across the full range of 184 GO terms involving one or more of these 14 genes, positively correlated sets include (MTHFR, MTCH, MS), (MTD, CBS), (FTD, DHFR) and (AICART, TS), while PGT and MTCH are negatively correlated. Figure 3 represents these correlations using a complete agglomerative clustering.

Figure 3
figure 3

Hierarchical clustering of folate genes based on 184 GO terms.

Although both approaches to building prior covariates, via either the visual interpretation of a network or the use of Gene Ontology, use knowledge of biological mechanisms, they lack a formal link of these mechanisms to disease risk or organism-level phenotypes. Such links may be critical when generating hypotheses or informing statistical analyses using biological mechanisms. Many publicly available ontologies provide a vast amount of structural information on various bio-logical processes, but interpretation or weighting of the importance of those processes in relation to specific phenotypes will only come when ontologies from biological domains are linked to ontologies characterising phenotypes. As one example, Thomas et al. [61] created a novel ontological representation linking smoking-related phenotypes and response to smoking cessation treatments with the underlying biological mechanisms, mainly nicotine metabolism. Most of the ontological concepts created for this specific ontology were created using concept definitions from existing ontologies, such as SOPHARM and Gene Ontology. This ontology was used in Conti et al. [20] to demonstrate the use in pathway analysis as a systematic way of eliciting priors for a hierarchical model. Specifically, the ontology was used to generate quantitative priors to reduce the space of potential models and to inform subsequent analysis via a Bayesian model selection approach.

Dealing with uncertainty in pathway structure

A more general question is how to deal with model uncertainty in any of these modelling strategies. The general hierarchical modelling strategy was first extended by Conti et al. [65] to deal with uncertainty about the set of main effects and interactions to be included in X using stochastic search variable selection [66]. Specifically, they replaced the second-level model by a pair of models, a logistic regression for the probability that β p = 0 and a linear regression of the form of Eq. (2) for the expected values of the coefficient, given that it was not zero. In turn, the pair of second-level models inform the probability that any given term will be included in the model at the current iteration of the stochastic search. Thus, over the course of MCMC iterations, variables are entered and removed, and one can then estimate the posterior probability or Bayes factor (1) for each factor or possible model (2), for whether each factor has a non-zero β averaging over the set of other variables in the model, or (3) the posterior mean of each β, given that it is non-zero. Other alternatives include the Lasso prior, [67] which requires only a single hyperparameter to accomplish both shrinkage and variable selection in a natural way, and the elastic net, [68] which combines the Lasso and normal priors and can be implemented in a hierarchical fashion combining variable selection at lower levels (eg among SNPs within a pathway) and shrinkage at higher levels (eg between genes within a pathway or between pathways) (Chen et al. Presented at the Eastern North American Region Meeting of the Biometric Society; San Antonio, TX: February 2009).

In an analysis, utilising the methods described by Conti et al., [20] of the simulated data when homocysteine is the causal variable (Table 5, first column) and incorporating an exchangeable prior structure in which all genes are treated equally (ie intercept only in the prior covariate matrix, Z), the posterior probabilities of including the two modestly significant genes TS and FTD are 0.57 and 0.48, respectively. By contrast, when the prior covariate matrix is derived from the 'external database' from the simulation model and is thus more informative of the underlying mechanism, these posterior probabilities change to 0.84 and 0.14, respectively. These changes in the posterior probabilities of inclusion reflect the covariate values for these genes in relation to homocysteine concentration and the AICART reaction velocity (the two prior covariates with the largest estimated second-level effects). In the case of TS, the velocities for these covariates are large, resulting in an increase in the posterior probability of inclusion. By contrast, for FTD these values are much smaller and there is a subsequent decrease.

For mechanistic models, the 'topology' of the model Λ and the corresponding vector of model parameters θΛ are treated as unknown quantities, about which we might have some general prior knowledge in the form of the 'ontology' Z. In the microarray analysis world, Bayesian network analysis has emerged as a powerful technique for inferring the structure of a complex network of genes [69]. Might such a technique prove helpful for epidemiological analysis?

One promising approach is 'logic regression', which considers a set of tree-structured models relating measurable inputs (genes and exposures) to a disease trait through a network of unobserved intermediate nodes representing logical operators (AND, OR, XOR etc) [70]. To allow for uncertainty about model form, a MCMC method is used to update the structure of the graphical model by adding, deleting, moving or changing the types of the intermediate nodes [71]. Although appealing as a way of representing the biochemical pathways, logic regression does not exploit any external information about the form of network. It also treats all intermediate nodes as binary, so it is more suitable for modelling regulatory than metabolic pathways where the intermediate nodes would represent continuous metabolite concentrations.

To overcome some of these difficulties, we relaxed the restriction to binary nodes, parameterising the model as:

M j = θ j 1 M p j 1 + θ n 2 M p j 2 + ( 1 - θ j 1 - θ j 2 ) M p j 1 M p j 2
(5)

When both input nodes (the 'parents' pj = [pj1, pj2]) are binary, various combinations of θs will yield the full range of possible logical operators (eg AND = [0,0], OR = [1,1]), but this framework allows great flexibility in modelling interactions between continuous nodes, while remaining identifiable. The Ms are treated as deterministic nodes, so the final metabolite concentration M J (E, G; Λ, θ) can be calculated via a simple recursion. The disease risk is assumed to have a logistic dependence on M J . Prior knowledge about the topology can be incorporated by use of a measure of similarity of each fitted network to the postulated true network (eg the proportion of connections in the true graph which are represented in the fitted one, minus the number of connections in the fitted graph which are not represented in the true one). In the spirit of Monte Carlo logic regression, the topology of the graph is modified by proposing to add or delete nodes or to move a connection between them using the Metropolis-Hastings algorithm [72]. Finally, the model parameters are updated conditional on the current model form. By post-processing the resulting set of graphs, various kinds of inference can be drawn, such as the posterior probability that a given input appears in the fitted graphs, that a pair of inputs is represented by a node in the graph, or the marginal effect of any input or combination of inputs on the disease risk. In small simulations, we demonstrated that the model could correctly identify the true network structure (or logically equivalent ones) and estimate the parameters well, while not identifying any incorrect models. In an application to data on ten candidate genes from the Children's Health Study, we were able to replicate the interactions found by a purely exploratory technique [73] and identified several alternative networks with comparable Bayes factors.

The folate pathway poses difficulties for mechanistic modelling because it is not a directed acyclic graph (DAG); although each arrow in Figure 1 is directed, the graph contains numerous cycles (feedback loops), making direct computation of probabilities difficult. In some instances, such cycles can be treated as single composite nodes with complex deterministic or stochastic laws, thereby rendering the remainder of the graph acyclic, but when there are many interconnected cycles, as in the folate pathway, such decomposition may be difficult or impossible to identify. Might it be possible, however, to identify a simpler DAG that captures the key behaviour of the network? Since any DAG would be an oversimplification and there could be many such DAGs that provide a reasonable approximation, the problem of model uncertainty is important.

A further extension of the Baurley et al. approach to the folate simulation will now be summarised. As in their approach, we assume that each node has exactly two inputs, but now distinguish three basic types of nodes, G × G, G × M (or G × E) and M × M. G × G nodes are treated as logical operators, yielding a binary output as high or low risk. G × M and G × E nodes represent intermediate metabolite concentrations, treated as continuous variables with deterministic values given by Michaelis-Menten kinetics with rate parameters V max (G) and K m . M × M nodes are regression expressions yielding a continuous output variable with the mean parameterised as in Eq. (5). Disease risk is assumed to have a logistic dependence on one or more of the Zs. Finally, each measured biomarker B is assumed to be log-normally distributed around one of the Ms, with some measurement error variance. Rather than treating the intermediate nodes as deterministic, the likelihood of the entire graph is now calculated by peeling over possible states of all the intermediate nodes.

Figure 4 shows the topologies discovered by the MCMC search. The largest Bayes factors are obtained when using no prior topologies. With a prior topology, essentially the same networks are found, with somewhat different Bayes factors.

Figure 4
figure 4

Top-ranking topologies without incorporating priors: left, gene only; right, genes and exposures. With no priors, the two topologies have posterior probabilities 3.9 per cent and 2.3 per cent, respectively. Using a topology derived by hierarchical clustering of the A matrix from simulated data, the top-ranked gene-only topology was identical to that shown on the left, with posterior probability of 9.5 per cent. Using the GO topology shown in Figure 3, the same genes were included, but reordered as (((MTHR, SAHH), MTD), SHMT)with a posterior probability of 6.4 per cent.

Pathways in a genome-wide context

Genome-wide association studies (GWAS) are generally seen as 'agnostic' -- the antithesis of hypothesis-driven pathway-based studies. Aside from the daunting computational challenge, their primary goal is, after all, the discovery of novel genetic associations, possibly in genes with unknown function or even with genomic variation in 'gene desert' regions not known to harbour genes. How, then, could one hope to incorporate prior knowledge in a GWAS? The response has generally been to wait until the GWAS has been completed (after a multi-stage scan and independent replication) and then conduct various in vitro functional studies of the novel associations before attempting any pathway modelling.

The idea of incorporating prior knowledge from genomic annotation databases or other sources as a way of improving the power of a genome-wide scan for discovery has, however, been suggested by several authors. Roeder et al., [74] Saccone et al., [75] Wakefield [7678] and Whittemore [79] introduced variants of a weighted false discovery rate, while Lewinger et al. [80] and Chen and Witte [81] described hierarchical modelling approaches for this purpose. These could be applied at any stage of a GWAS to improve the prioritisation of variants to be taken forward to the next stage. For example, Sebastiani et al. [82] used a Bayesian test to incorporate external information for prioritising SNP associations from the first stage of a GWAS using pooled DNA, to be subsequently tested using individual genotyping. Roeder et al. [74] originally suggested the idea of exploiting external information in the context of using a prior linkage scan to focus attention in regions of the genome more likely to harbour causal variants, but subsequent authors have noted that various other types of information, such as linkage disequilibrium, functional characterisation or evolutionary conservation, could be included as predictors. An advantage of hierarchical modelling is that multiple sources can be readily incorporated in a flexible regression framework, whereas the weighted FDR requires a priori choice of a specific weighting scheme.

A recent trend has been the incorporation of pathway inference in genome-wide association scans, [75, 8389] borrowing ideas from the extensive literature on network analysis of gene expression array data [90, 91]. Currently, the most widely used tool for this purpose is gene set enrichment analysis, [92] which in GWAS applications aims to test whether groups of genes in a common pathway tend to rank higher in significance. Several published applications have yielded novel insights using this approach, [9396] although others have found that no specific pathway outranks the most significant single markers, [89, 97, 98] suggesting that the approach may not be ideal for all complex diseases. Many other empirical approaches have been used in the gene-expression field, including Bayesian network analysis, [69, 99, 100] neural networks, [101] support vector machines [102] and a variety of other techniques from the fields of bioinformatics, computational or systems biology and machine learning [103111]. Most of these are empirical, although in the sense of trying to reconstruct the unknown network structure from observational data, rather than using a known network to analyse the observational data. It is less obvious how such methods could be applied to mining single-marker associations from a GWAS, but they could be helpful in mining G × G interactions. Even simple analyses of GWAS data can be computationally demanding, particularly if all possible G × G interactions are to be included, and analyses incorporating pathway information is likely to be even more daunting. Recent developments in computational algorithms for searching high-dimensional spaces and parallel cluster computing implementations may, however, make this feasible.

Recently, several authors [112116] have undertaken analyses of the association of genome-wide expression data with genome-wide SNP genotypes in search of patterns of genetic control that would identify cis- and trans-activating factors and master regulatory regions. Ultimately, one could foresee using networks inferred from gene expression directly as priors in a hierarchical modelling analysis for GWAS data, or a joint analysis of the two phenotypes, but this has yet to be attempted. Other novel technologies, such as whole-genome sequencing, metabolomics, proteomics and so on may provide other types of data that will inform pathway-based analysis on a genome-wide scale.

Conclusions

As in any other form of statistical modelling, the analyst should be cautious in interpretation. An pointed out by Jansen: [117]

'So, the modeling of the interplay of many genes -- which is the aim of complex systems biology -- is not without danger. Any model can be wrong (almost by definition), but particularly complex (overparameterized) models have much flexibility to hide their lack of biological relevance' [emphasis added].

A good fit to a particular model does not, of course, establish the truth of the model. Instead, the value of models, whether descriptive or mechanistic, lies in their ability to organise a range of hypotheses into a systematic framework in which simpler models can be tested against more complex alternatives. The usefulness of the Armitage-Doll [118] multistage model of carcinogenesis, for example, lies not in our belief that it is a completely accurate description of the process, but rather in its ability to distinguish whether a carcinogen appears to act early or late in the process or at more than one stage. Similarly, the importance of the Moolgavkar-Knudson two-stage clonal-expansion model [119] lies in its ability to test whether a carcinogen acts as an 'initiator' (ie on the mutation rates) or a 'promoter' (ie on proliferation rates). Such inferences can be valuable, even if the model itself is an incomplete description of the process, as must always be the case.

Although mechanistic models do make some testable predictions about such things as the shape of the dose-response relationship and the modifying effects of time-related variables, testing such patterns against epidemiological data tends to provide only weak evidence in support of the alternative models, and only within the context of all the other assumptions involved. Generally, comparisons of alternative models (or specific sub-models) can only be accomplished by direct fitting. Visualisation of the fit to complex epidemiological datasets can be challenging. Any mechanistic interpretations of model fits should therefore consider carefully the robustness of these conclusions to possible misspecification of other parts of the model.