Background

The companion Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study fenofibrate clinical trial data set [1,2,3] was the foundation of our Genetic Analysis Workshop 20 (GAW20) simulation. The general simulation strategy was to first simulate visit 4 methylation array data for each subject (which measures the individual epigenetic responses to genomethate treatment), and then use this plus the genome-wide association study (GWAS) genotypes to produce the simulated triglycerides for visits 3 and 4 post-treatment values. The main simulated effect of genomethate is on the phenotype of the individual subject’s triglyceride (TG) values measured as slope in response to treatment (change in mg/dL per unit time of treatment).

Methods, results and discussion

Figure 1 illustrates the graphical design of the simulations.

Fig. 1
figure 1

A 3D indexing order of the GAW20 simulation. The j index in the figure represents the subjects, the i index is noting the causal SNPs, where i = 1–5 also indexes the 5 main effects of the corresponding nearby CpG sites, while the sites 6–105 are 100 SNPs with background genetic effects. The k index indicates replications (k = 1, 2, …, R = 200)

The j index in the figure represents the subject (j = 1, 2, …, N = 717). The i index is noting the single-nucleotide polymorphisms (SNPs) chosen to be causal in the simulating model (i = 1, 2, …, G = 105), where i = 1, 2, 3, 4, 5 also indexes the 5 main effects of the corresponding nearby cytosine-phosphate-guanine (CpG) sites, while beyond main effects, the sites from 6 to 105 are 100 SNPs with background genetic effects. The k index indicates replications (k = 1, 2, …, R = 200).

The first 5 causal SNPs are “major” effects (summarized in Table 1), and the last 100 SNPs are polygenic background effects (Table 2). Note that only the first 5 CpG sites are relevant to the model, the polygenic background effects do not depend upon CpG states.

Table 1 Five major effect causal SNPs and corresponding nearby CpG markers affecting triglycerides at visits 3 and 4
Table 2 Background polygenic SNPs. All markers are simulated with the same heritability (hg2 = 0.001) affecting triglycerides at visits 3 and 4

We first defined a series of subjects’ triglyceride values from the original (real) Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) data [1], which was used to generate the simulations. Because triglycerides were approximately log-normally distributed, we worked with log-transformed triglyceride values in all calculations, only transforming back to the measured triglyceride scale at the end of the simulations. In particular, for the jth subject, the average log triglycerides pre-treatment (average of visits 1 and 2, which are 1 day apart) and post-treatment (average of visits 3 and 4, which are also 1 day apart) in the original (real) GOLDN data are:

$$ \boldsymbol{O}\_\boldsymbol{preRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}=\boldsymbol{mean}\left(\mathbf{\log}\left(\boldsymbol{TG}{\mathbf{1}}_{\boldsymbol{j}}\right),\mathbf{\log}\left(\boldsymbol{TG}{\mathbf{2}}_{\boldsymbol{j}}\right)\right) $$
$$ \boldsymbol{O}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}=\boldsymbol{mean}\left(\mathbf{\log}\left(\boldsymbol{TG}{\mathbf{3}}_{\boldsymbol{j}}\right),\mathbf{\log}\left(\boldsymbol{TG}{\mathbf{4}}_{\boldsymbol{j}}\right)\right) $$

where O –stands for “Observed / Original”, preRx stands for “pre-medication treatment,” postRx stands for “after medication treatment,” and TG labels “triglycerides” which were log transformed to ensure a normal distribution of the trait. The TG of person j is measured in visits 1, 2, 3 and 4 and averaged as above for each individual as preRx and postRx. The corresponding change in log triglycerides pre-treatment to post-treatment for subject j is given by:

$$ \boldsymbol{O}\_\boldsymbol{delta}\_{\boldsymbol{TG}}_{\boldsymbol{j}}=\left[\boldsymbol{O}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}-\boldsymbol{O}\_\boldsymbol{preRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}\right] $$

where delta is the “change”. The individual time on treatment (less than 30 days) for each subject (in days), is given by the following formula:

$$ \boldsymbol{O}\_{\boldsymbol{daysRx}}_{\boldsymbol{j}}=\boldsymbol{mean}\left(\boldsymbol{draw}\_\boldsymbol{date}\_\boldsymbol{v}{\mathbf{3}}_{\boldsymbol{j}},\boldsymbol{draw}\_\boldsymbol{date}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{j}}\right)-\boldsymbol{draw}\_\boldsymbol{date}\_\boldsymbol{v}{\mathbf{2}}_{\boldsymbol{j}} $$

where daysRx is “days after medication treatment,” draw_date is “blood draw date” at a particular v- “visit.” Thus, the observed slope (change in log triglycerides over the treatment period) is:

$$ \boldsymbol{O}\_\boldsymbol{slope}\_{\boldsymbol{TG}}_{\boldsymbol{j}}=\boldsymbol{O}\_\boldsymbol{delta}\_{\boldsymbol{TG}}_{\boldsymbol{j}}/\boldsymbol{O}\_{\boldsymbol{daysRx}}_{\boldsymbol{j}} $$

If mean_O_PreRx_TG and sd_O_preRx_TG are the mean and standard deviations, respectively, of all the O_preRx_TGj across the j = 1, …, N individuals, then the standardized original preRx of TGj are given by:

$$ \boldsymbol{O}\_{\boldsymbol{preZ}}_{\boldsymbol{j}}=\left(\boldsymbol{O}\_\boldsymbol{preRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}-\boldsymbol{mean}\_\boldsymbol{O}\_\boldsymbol{PreRx}\_\boldsymbol{TG}\right)/\boldsymbol{sd}\_\boldsymbol{O}\_\boldsymbol{preRx}\_\boldsymbol{TG} $$

where O_preZ -is a standardized normally distributed variable with N(0,1).

Tables 1 and 2 summarize the epigenetic model in our simulation. We chose 5 “major gene” causal variants (ranging from modest to small effect sizes corresponding to expected “heritabilities” of 0.125, 0.10, 0.075, 0.05, and 0.025), which, in the absence of any epigenetic effects, should govern individual genomethate treatment response along with 100 polygene variants (each of tiny effect size corresponding to “heritabilities” of 0.001 each). These were chosen randomly from chromosomes 1–20 of the GWAS Affymetrix Genome-wide Human SNP Array 6.0, which had 718,544 autosomal SNPs.

For the epigenetic component, we choose 5 CpG sites on the Illumina Infinium HumanMethylation450 BeadChip array (which had 463,995 CpG sites) that are physically closest to the 5 “major gene” causal SNPs, while the methylation sites near the 100 polygenes have no effect. The genomethate response model is based upon the idea that these CpG sites need to be sufficiently unmethylated for the corresponding causal SNPs to express their influence on each individual’s phenotype. If the nearby CpG site is totally methylated (=1), then the corresponding causal SNP actually has no effect on the phenotype. If the CpG site is totally unmethylated (=0), then the corresponding causal SNP carries its full effect size impact on the phenotype. If the CpG site is partially methylated (between 0 and 1), then the effect size of the causal SNP is proportionally attenuated.

Specifically, for the kth simulation, we first generated the simulated visit 4 methylation array results for all subjects, based upon their corresponding visit 2 and/ or visit 4 methylation values. For each subject j = 1, …, 717, and each CpG methylation site i = 1, 2, 3, 4, 5 (corresponding to 5 major effect CpGs)

$$ \boldsymbol{sim}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{ji}\boldsymbol{k}}=\boldsymbol{real}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{2}}_{\boldsymbol{ji}}+{\boldsymbol{sd}}_{\boldsymbol{i}}\ast \boldsymbol{Z}{\mathbf{1}}_{\boldsymbol{ji}\boldsymbol{k}} $$

where sim_meth stands for “simulated methylation” at visit 4, real_meth is the jth subject’s “real methylation” array data at visit 2 for the ith CpG site, sdi = 0.4 represents the standard deviation of individual subject methylation responses to treatment, and Z1jik ~ N(0, 1) is a pseudo-random standard normal variable drawn independently for each jik.

For the remaining, non-causal CpG sites, if the subject j had real visit 4 methylation array data then

$$ \boldsymbol{sim}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{ji}\boldsymbol{k}}=\boldsymbol{real}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{ji}}+{\boldsymbol{sd}}_{\boldsymbol{i}}\ast \boldsymbol{Z}{\mathbf{1}}_{\boldsymbol{ji}\boldsymbol{k}} $$

Otherwise, if the subject j only had visit 2 methylation array data, then

$$ \boldsymbol{sim}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{ji}\boldsymbol{k}}=\boldsymbol{real}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{2}}_{\boldsymbol{ji}}+{\boldsymbol{sd}}_{\boldsymbol{i}}\ast \boldsymbol{Z}{\mathbf{1}}_{\boldsymbol{ji}\boldsymbol{k}} $$

where real_meth_v2ji and real_meth_v4ji are the real visit 2 and visit 4 methylation array data, respectively, for subject j and CpG site i, sdi represents the standard deviation of individual subject methylation responses to treatment for the ith CpG site, and again, Z1jik ~ N(0,1) is a pseudo-random variable drawn independently for each jik.

We selected five random non-causal (red-herrings) CpG sites also (shown in Table 3). We set for them the sdi = 0.4, to be similar to the simulated causal CpG sites. For the remaining non-causal CpG sites, we set the corresponding sdi = 0.03, which is closer to that seen in the real visit 4 methylation data CpG sites, essentially at the measurement error level.

Table 3 Five non-causal (red-herrings) CpG markers chosen to have N(0,0.4) random variability, imitating the distribution of the 5 real causative CpG markers

In all cases, all simulated visit 4 methylation values were then truncated to be strictly in the [0,1] interval, that is,

if (sim_meth_v4jik>1) then sim_meth_v4jik=1

if (sim_meth_v4jik<0) then sim_meth_v4jik=0

for all subjects j, CpG sites i, and simulation replications k.

Note that the model is such that, on average, the genomethate treatment has no effect on the amount of methylation increase/decrease from visit 2 to visit 4, however, there is variability across subjects. To reiterate, the variability is quite high (sdi = 0.4) for the five CpG regions controlling the expression of the major causal variants and 5 other non-causal CpG (red-herrings) sites. The variability is low (sdi = 0.03) for all other CpGs, at the level of measurement error.

Using these simulated visit 4 methylation data, we then generated the simulated slope change in triglyceride response for each individual j in each replication k as follows:

$$ {\boldsymbol{slope}}_{\boldsymbol{jk}}={\sum}_{\boldsymbol{i}=\mathbf{1}}^{\mathbf{5}}\left(\mathbf{1}-\boldsymbol{sim}\_\boldsymbol{meth}\_\boldsymbol{v}{\mathbf{4}}_{\boldsymbol{ji}\boldsymbol{k}}\right)\ast \boldsymbol{sqrt}\left(\boldsymbol{hg}{\mathbf{2}}_{\boldsymbol{i}}\right)\ast {\boldsymbol{SSNP}}_{\boldsymbol{ji}}+{\sum}_{\boldsymbol{i}=\mathbf{6}}^{\mathbf{105}}\boldsymbol{sqrt}\left(\boldsymbol{hg}{\mathbf{2}}_{\boldsymbol{i}}\right)\ast {\boldsymbol{SSNP}}_{\boldsymbol{ji}}+{\boldsymbol{zenv}}_{\boldsymbol{jk}}\ast \boldsymbol{sqrt}\left(\mathbf{1}-\sum \limits_{\boldsymbol{i}=\mathbf{1}}^{\mathbf{105}}\boldsymbol{hg}{\mathbf{2}}_{\boldsymbol{i}}\right) $$
(1)

In the above formula, zenvjk is an independently drawn pseudo-random normal deviate distributed N(0,1) for each subject j and each replication k, and it represents unexplained residual variation in the phenotype. SSNPji is the standardized ith SNP additive genotype-dosage (i.e., coded such that mean = 0 and sdi = 1 in the sample), and the i = 1, 2, …, 105 regression coefficients in this linear model are given in terms of constants sqrt(hg2i), in Tables 1 and 3. Note that if the five causal CpG sites were completely unmethylated for all subjects (i.e., no epigenetic effects), then (1 – sim_meth_v4 jik) would be = 1 for all j = 1,…, N and i = 1,…, 5, and k = 1,…, 200, so that the regression coefficients would be interpreted as the square root of the locus specific heritability of the associated SNPs. Conversely, when the causal CpG site is totally methylated for that subject, (1 - sim_meth_v4 jik) = 0, so that the corresponding major effect SNPi will not express its effect on the phenotype. Similarly, if the CpG site is partially methylated (between 0 and 1), the effect size of the causal SNP is proportionally attenuated.

To carry forward these simulated relationships in eq. (1), we must address the fact that the observed slope responses for each subject are correlated to their baseline values of triglyceride (i.e., lower baseline values should produce less dramatic declines with treatment, whereas higher baseline values can experience greater slope change with treatment). In the real GOLDN data, the correlation between slope change in response to fenofibrate treatment and baseline log triglycerides is − 0.41881, and we used this constant value in our genomethate simulation to introduce a correlation between slope change and baseline values:

$$ {\boldsymbol{corrz}}_{\boldsymbol{j}\boldsymbol{k}}=\left(-\mathbf{0.41881}\right)\ast \boldsymbol{O}\_{\boldsymbol{preZ}}_{\boldsymbol{j}}+\boldsymbol{sqrt}\left(\mathbf{1}-{\left(\mathbf{0.41881}\right)}^{\mathbf{2}}\right)\ast {\boldsymbol{slope}}_{\boldsymbol{j}\boldsymbol{k}} $$

Because the simulated individual slopes are generated on the standardized scale, we needed to rescale to that of the original scale of triglyceride changes per day of treatment, by working backwards. The mean and standard deviation of O_slope_TGj over all subjects j, are denoted by mean_O_slope_TG and sd_O_slope_TG, respectively. We used the above observed mean and standard deviation of slopes seen in the original GOLDN data, to rescale as follows:

$$ \boldsymbol{sim}\_{\boldsymbol{slope}}_{\boldsymbol{jk}}={\boldsymbol{corrz}}_{\boldsymbol{jk}}\ast \boldsymbol{sd}\_\boldsymbol{O}\_\boldsymbol{slope}\_\boldsymbol{TG}+\boldsymbol{mean}\_\boldsymbol{O}\_\boldsymbol{slope}\_\boldsymbol{TG} $$

Then the expected response to genomethate treatment of the jth subject, after O_DaysRxj original days of treatment, is given by:

$$ \boldsymbol{sim}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}\boldsymbol{k}}=\left(\boldsymbol{sim}\_{\boldsymbol{slope}}_{\boldsymbol{j}\boldsymbol{k}}\ast \boldsymbol{O}\_{\boldsymbol{DaysRx}}_{\boldsymbol{j}}\right)+\boldsymbol{O}\_\boldsymbol{preRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}} $$

Finally, we used the simulated individual responses to produce the simulated values of triglyceride at visits 3 and 4, based upon the variability we see between those visits in the real GOLDN fenofibrate data:

$$ \boldsymbol{sim}\_\boldsymbol{TG}{\mathbf{3}}_{\boldsymbol{j}\boldsymbol{k}}=\boldsymbol{\exp}\left[\boldsymbol{sim}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}\boldsymbol{k}}+\left(\mathbf{\log}\Big(\boldsymbol{TG}{\mathbf{3}}_{\boldsymbol{j}}\right)-\boldsymbol{O}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}\Big)\right] $$
(2)
$$ \boldsymbol{sim}\_\boldsymbol{TG}{\mathbf{4}}_{\boldsymbol{j}\boldsymbol{k}}=\boldsymbol{\exp}\left[\boldsymbol{sim}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}\boldsymbol{k}}+\left(\mathbf{\log}\Big(\boldsymbol{TG}{\mathbf{4}}_{\boldsymbol{j}}\right)-\boldsymbol{O}\_\boldsymbol{postRx}\_{\boldsymbol{TG}}_{\boldsymbol{j}}\Big)\right] $$
(3)

If only 1 replicate of the GAW20 simulated data was to be analyzed, we recommend the 84th replication, which was provided in a separate directory, as a “representative” of the 200 replicated simulations. Chromosomes 21 and 22 datasets were not used in the simulation, so an analyst can use the corresponding data for building a NULL hypothesis. The simulated GAW20 data are accompanied by README and Data Dictionary files.