Simulating variance heterogeneity in quantitative genome wide association studies
Analyzing Variance heterogeneity in genome wide association studies (vGWAS) is an emerging approach for detecting genetic loci involved in gene-gene and gene-environment interactions. vGWAS analysis detects variability in phenotype values across genotypes, as opposed to typical GWAS analysis, which detects variations in the mean phenotype value.
A handful of vGWAS analysis methods have been recently introduced in the literature. However, very little work has been done for evaluating these methods. To enable the development of better vGWAS analysis methods, this work presents the first quantitative vGWAS simulation procedure. To that end, we describe the mathematical framework and algorithm for generating quantitative vGWAS phenotype data from genotype profiles. Our simulation model accounts for both haploid and diploid genotypes under different modes of dominance. Our model is also able to simulate any number of genetic loci causing mean and variance heterogeneity.
We demonstrate the utility of our simulation procedure through generating a variety of genetic loci types to evaluate common GWAS and vGWAS analysis methods. The results of this evaluation highlight the challenges current tools face in detecting GWAS and vGWAS loci.
KeywordsVariance heterogeneity Genome wide association studies GWAS simulation
Genome-wide association studies (GWAS) have been widely utilized over the past decade for studying the genetic origins of phenotypic traits, including a wide range of diseases. GWAS analysis compares the genome-wide genotype profiles of a large number of subjects with the corresponding phenotypic values. These studies aim to locate genetic loci where the genotype variation is highly correlated with the variation of the phenotype value [1, 2, 3]. These studies, however, have been limited to studying variation in the mean phenotype value. That is, the mean phenotype value observed when a certain genotype is present will be significantly different from the mean phenotype value observed when an alternative genotype is present. Although this type of analysis has been successful in uncovering a wide range of genetic associations, detecting mean changes only explains a small part of the total variance . The remaining variance has been thought to be caused by gene-gene (GxG) interactions, such as regulatory factors and epistasis, and gene-environment (GxE) interactions. These types of interactions cannot be easily detected using typical GWAS methods.
To address the limitations of typical GWAS analysis methods, vGWAS has emerged recently as an approach for detecting GxG and GxE loci [5, 6, 7]. There has been increasing evidence that when GxG or GxE interactions occur, variance heterogeneity is introduced into the genetic locus at which the interactions occur . Therefore, detecting vGWAS loci enables researchers to identify and study genetic loci involved in these interactions. vGWAS analysis aim to detect genetic loci that cause a significant change in the variance of the phenotype value. That is, the variance of the phenotype value observed when a certain genotype is present will be significantly different from the variance value observed when an alternative genotype is present.
The number of vGWAS analysis methods have been steadily increasing in the literature in the past few years. These methods utilize statistical tests and regression models to locate loci that have statistically significant effects on the phenotype. For example, Levene’s (Brown-Forsythe) test is a single-pass statistic widely used in vGWAS analysis. The test calculates the phenotype’s variance for both values of the genotype and then assesses the significance of the difference between the two variances using a chi-distribution. Generalized linear models have also been proposed for vGWAS analysis. These methods aim to fit a linear regression model to uncover relations between the genotype profiles and the observed phenotype values. In these models, the genotypes represent the variables of the model, whereas their coefficients represent their contribution to the mean and variance changes . As opposed to typical GWAS analysis linear models, vGWAS models assume that the dispersion of the model is a function of the genotype values. The authors of  used a double-generalized linear model (DGLM) to account for population specific effects in addition to mean and variance heterogeneity.
These methods have been utilized to uncover GxG and GxE interacting loci in a number of studies. For example in a study about body mass, vGWAS identified a locus that is believed to interact with physical activity and the lifestyle of the individual . In another study of the human genome, most of the vGWAS loci identified were located in CNV-containing regions, suggesting a link between variance heterogeneity and the effects of CNVs which include having a higher susceptibility to various complex neurological disorders such as schizophrenia and Parkinson disease . Finally, the vGWAS analysis of lymphoblastoid cell lines identified two genetic loci which have been confirmed to be involved in GxE and GxG interactions .
The proposed methods have been able to uncover biologically relevant information through detecting variance heterogeneity in different GWAS datasets. However, the development of vGWAS analysis models could still benefit from significant enhancements. A first step towards achieving that goal could be establishing reliable and consistent grounds for comparison between these tools. This could be achieved through simulating vGWAS loci. Simulation provides the researcher with three important features: 1) Control over the input genotypes, 2) Predictability of the output values, and 3) Consistency across different runs. Controlling the values of the input genotypes allows the researcher to simulate different genetic and evolutionary scenarios. The predictability of the output is particularly valuable for evaluating vGWAS methods as it provides a ground truth for comparison. And finally, consistency is achieved through eliminating the uncontrollable factors of randomness that are usually present in lab experiments. Therefore, simulation presents a reliable and inexpensive method for testing and evaluating vGWAS methods under different conditions.
Several GWAS simulators have been developed in the past decade, however, none of these simulators account for variance heterogeneity effects. In this work we present a systematic procedure for simulating vGWAS data, as well as introducing the first simulation algorithm for quantitative vGWAS data. The algorithm could be applied to either haploid or diploid genotypes with different modes of dominance while incorporating the effects of multiple vGWAS loci using an additive model. We hope that the presented work will motivate the evaluation of current vGWAS methods and drive the development of new methods.
This paper is organized as follows: we describe the genotype generation process and the subsequent mathematical framework for producing vGWAS phenotype values in the “Methods” section. We then describe the implementation of our vGWAS simulation algorithm and demonstrate it’s utility in the “Results and discussion” section before presenting our conclusions in the last section. We present a detailed mathematical discussion in the Supplementary Materials section.
The simulation of GWAS data enables researchers to investigate important aspects of genetic control . Simulation grants researchers control over the genetic properties of the input population and allows them to observe the impact of using different genetic models on the output. The main contribution of GWAS simulation is that it provides a tool for evaluating the statistical and algorithmic methods developed for detecting genetic control loci. Simulation allows the researcher to set a ground truth for these methods to be compared against, as well as enabling the evaluation of these methods under different scenarios.
The simulation process is divided into two steps. The first step of the simulation process focuses on producing a realistic population of genotypes while allowing control over several genetic and evolutionary parameters. The second step of the simulation process assigns or calculates phenotypic values for the simulated population. In this section, we describe both steps of the simulation process.
The population generation process focuses on producing a realistic population of genotypes. Population generation is used to simulate different genetic properties and population structures, such as demographies, mutations, and recombination events. This level of control allows the researcher to engineer the population according to the genetic and evolutionary scenarios investigated. Several simulation methods have been developed to address this problem. These methods could be categories into three types, depending on the simulation strategy they use: resampling-based methods [13, 14, 15], backward- time-based (coalescent) methods [16, 17, 18], and forward-time-based methods [12, 19, 20, 21, 22].Each of these strategies provides the researcher with varying levels of control over the different aspects of the simulation. However, the vGWAS simulation method we describe in this paper could be applied to genotypes from any type of population simulator.
Phenotype assignment maps the genotype state of each simulated individual to a phenotype value. Phenotype values could be either quantitative or qualitative. Qualitative phenotypes are usually assigned in case/control studies where the phenotype could be one of only two values. Quantitative phenotypes on the other hand take a numerical value which could represent traits such as height, weight, or expression level of a certain gene.
The individual has a haploid genome.
The phenotype value is affected by the allelic value of a single genetic locus.
Phenotype calculation framework
The phenotype calculation step uses the generated genotype profiles to calculate a phenotype value for each subject in the population. The genotype profiles are used to calculate the allele frequencies of each loci. The researcher then specifies the phenotype controlling loci and the effect size of each locus on the total phenotypic variation. In typical settings, the researcher would choose specific loci from the list of generated loci. Alternatively, the researcher could specify the number of phenotype controlling loci and an allele frequency range such that the simulator could pick appropriate loci from the list. The phenotype calculation process is often provided with parameters such as a baseline mean and the value of the total phenotype variation in order to control the range of possible phenotype values. This allows the generated phenotype values to resemble the values encountered in the researcher’s investigation. In this section, we show how these inputs could be used to calculate the parameters of Eq. 1 and therefore calculate the corresponding phenotype value.
Relaxing assumption 1: accounting for diploid genomes
Simulating diploid GWAS data is important for studying complex genetic traits especially in diploid organisms such as humans. We modify the formulation described in the previous subsection in order to calculate phenotype values that are based on a genotype that could take one of three values: major allele homozygous, minor allele homozygous, and heterozygous.
In many genetic applications, the population would be engineered to only consist of homozygous individuals. However, we choose to account for heterozygous individuals to facilitate the investigation of a wide number of scenarios and genetic traits. We lay out the mathematical procedures needed to calculate the phenotype value for heterozygous individuals using two models of inheritance: 1) the co-dominance model and 2) the complete dominance model. In this section we start by describing the formulation of the co-dominance model before describing the extensions needed for using the complete dominance model. Co-dominance occurs when both alleles in the diploid are fully expressed. That is, the effects of both allele in the diploid are added together to produce the total effect of the genotype of the phenotype’s value.
Substituting the values of these parameters in Eq. 11 yields the phenotype value for each individual in the population.
Complete dominance In diploid genomes
An allele shows dominance when it suppresses the effects of the recessive allele. Dominance could exist in three forms: complete dominance, incomplete dominance, and co-dominance. In this section we modify the formulation used in the co-dominance model to account for complete dominance. This type of dominance occurs when the dominant allele completely masks the effects of the recessive allele.
A 2 A 2
Using this formulation, the phenotype could be calculated using the same procedure as the one used for haploid individuals. α, ϕ, and σ can all be calculated using Eqs. (8) to (10). Consequently, the phenotype value could be caluclated using Eq. 1.
Relaxing assumption 2: accounting for multiple additive loci
In a more realistic scenario, the phenotype value is determined by the genotype state of multiple loci. We modify the phenotype calculation procedure to account for multi-loci effects, assuming an additive model. Linear interaction is only one form of multi-loci interaction. Our phenotype calculation process does not account for nonlinear effects, such as epistasis. However, extending the proposed process to account for these effects is a topic for future investigation.
vGWAS simulation algorithm
To facilitate the implementation of vGWAS simulators, we translated the mathematical framework into an algorithm, presented in Algorithm 1. The algorithm uses genotype profiles as its input. These genotype profiles could be generated using population simulators or by using real data from genotyping experiments such as the 1000 Genome Project. The algorithm also requires some information about the loci to be simulated, including their number, minor allele frequency, and mean and variance effect size. The algorithm searches the inputted genotypes for loci that meet the specified criteria and then exploits the described mathematical framework to calculate the phenotype value. This algorithm is able to induce both mean and variance discrepancies based on one or more loci. The algorithm accounts for both haploid and diploid genotypes under co-dominance or complete dominance. The algorithm uses Eqs. 15 and 19 to calculate the phenotype value of the haploid genotypes and co-dominant diploid genotypes respectively. The algorithm also uses Eq. 15 to calculate the phenotype value in the case of complete dominance for diploid genotypes.
Results and discussion
In this section we describe the evaluation of our simulator through the generation of different genetic control scenarios. We conducted two sets of experiments; in the first set, we generated and visualized all the modes of variation supported by our tool to demonstrate its ability to mimic realistic patterns. In the second set of experiments, we used common GWAS and vGWAS approaches to recover the loci generated using our algorithm. Finally, we highlighted several applications in which our simulation process could be useful.
The simulation algorithm was implemented as Python scripts and has been tested on a Unix platform. The simulation scripts were used to calculate quantitative phenotypes from genotype profiles generated using ms  and GENOME  which are two popular coalescent genotype simulators. The simulation scripts use Eq. 15 and Eq. 19 to calculate the phenotype value of the haploid genotypes and complete-dominance diploid genotypes respectively. The simulator also uses Eq. 15 to calculate the phenotype value in the case of co-dominance for diploid genotypes. In our implementation, we utilized scripts from  for performing basic tasks such parsing genotypes and writing the output files.
Our first experiment aims to verify that the values of the generated phenotype are in accordance with the desired distribution. We generated a segment containing 10,000 genotypes for 2,000 samples using Hudson et al.’s ms  simulator. From the generated genotypes, we generated six sets of results corresponding to two modes of ploidy, each containing three types of effects. We used our simulation algorithm to calculate phenotype values for haploid genotypes, diploid genotypes under co-dominance. For each mode of ploidy, we generated created three sets of phenotype values corresponding to three types of loci. We created one locus in each of these sets. In the first set, the locus had a mean effect size of 5% and no variance effects. The second locus had no mean effects, and a variance effect size of 5%. The third set, the single locus had both mean and variance effects, each with a 5% effect size.
Association signal recovery
To demonstrate the utility of our algorithm, we performed a second experiment which involved simulating genotypes based on realistic settings and then using common GWAS and vGWAS analysis tests to recover the simulated loci. The purpose of this experiment is to show that the GWAS and vGWAS signals simulated using our algorithm could be retrieved using common GWAS/vGWAS detection algorithms. To that purpose we used the GENOME  simulator to generate genotypes with similar characteristics as those observed in the arabidopsis thaliana 120Mbp genome [24, 25, 26]. We generated three diploid chromosomes with 50,000 SNPs each for a population size of 10,000 with a population recombination parameter of r = 8X10−3. We then used our simulation algorithm to generate three loci, one on each chromosome. The first locus has a mean effect size of 5% and no variance effects. Similarly, the second locus has a variance effect size of 5% and no mean effects. The third locus has a total effect size of 6%, divided equally between mean and variance effects.
We used Plink’s ‘assoc’ function which is a GWAS detection method. This method attempts to fit a linear regression at each genotype to determine if there is a relation between the varying genotype values and the corresponding phenotypes. Eq 1. This approach mainly attempts to estimate the mean shift sizes occurring at the different genotype states in the presence of a normally distributed residual having a constant variance term.
The Brown-Forsythe test which is a vGWAS detection method. This method single-pass test which measures the significance of the variance separation observed at one tested genotype at a time and ultimately produces a p-value statistic for that genotype. This test has been commonly used to check for variance heterogeneity in vGWAS experiments. We implemented this test using R’s Levene Test package .
We implemented the DGLM association method described in  which is both a GWAS and vGWAS method. This method attempts to recover both mean and variance shift-causing loci from the data simultaneously.
In this work we presented a mathematical framework for simulating quantitative vGWAS data. We translated the mathematical framework into a simulation algorithm, which, to the best of our knowledge, is the first vGWAS simulation algorithm in the literature. The presented algorithm generates quantitative vGWAS phenotype values using genotype profiles from common population simulators. The algorithm could be applied to either haploid or diploid genotypes with different modes of dominance while incorporating the effects of multiple GWAS and vGWAS loci using an additive model. To demonstrate the utility of our algorithm, we used common GWAS and vGWAS analysis methods to detect the loci generated using our algorithm. We hope that the presented work will motivate the further evaluation of current vGWAS methods and drive the development of new methods to address the limitations of current methods.
This project has been funded in part with federal funds from the National Institute of 601 Health under award R21AI126219 (JJC). The publication costs of this article was funded through the support of a grant offered by Texas A&M University at Qatar (ES).
Availability of data and materials
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 3, 2018: Selected original research articles from the Fourth International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC 2017): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-3.
AAK formulated the simulation math, performed the experiments, and wrote the first draft of the manuscript. MA verified the simulation math and majorly contributed to the second draft of the manuscript. JJC conceived the idea for vGWAS Simulation and supervised the simulation process. ES and AD contributed in an advisory capacity. JJC, ES, and AD made significant contributions to the final draft of the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 27.Hui W, Gel Y, Gastwirth J. lawstat: an R package for law, public policy and biostatistics.J Stat Softw Articles. 2008; 28(3):1–26.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.