Introduction

Human genetic studies generally focus on identifying genes associated with diseases and subsequently using the significant genetic information for phenotypic predictions. Statistical methods vary with each of these goals, with methods for gene identification becoming increasingly effective with genome-wide association studies (GWAS). However, modeling principles for phenotypic predictions of complex human diseases remain largely unexplored even in this era of GWAS [1, 2]. Phenotypic predictions in human data sets have been modeled using regression models with a few single nucleotide polymorphisms (SNPs) having strong association identified by GWAS, but this practice of using only a few genetic markers has been disappointing due to its low predictive power [3]. A similar problem known as “the problem of missing heritability” is encountered when only highly associated markers are used to estimate heritability of complex traits [4, 5]. However, as demonstrated recently with human height and complex diseases, the missing heritability problem can be alleviated when linear regression models include information from all markers in the GWAS [68]. Such models can be seen as finite locus approximations for polygenic models, being representative for linkage disequilibrium (LD) structure in population, and they should therefore provide useful predictions even in the presence of a large number of rare variants.

Bayesian variable selection [9, 10] and the frequentist regularization methods [11, 12] have gained attention to perform parameter estimation and variable selection simultaneously in phenotype-marker association analysis. These methods generally perform well in selecting trait-associated loci or loci in LD to estimate genomic breeding values in animals and plants [13, 14]. Lee et al. [15] considered that methods for predictions of phenotypes and genomic breeding values may employ similar tasks and can be successfully substituted for one another. Generally, it is known that the marker density (the length of the genome), population LD, and number of individuals in the learning sample will all have a strong influence on the prediction accuracy [1]. Moreover, simplicity of the genetic architecture and heritability of the trait are also the key factors for the prediction success. Animal and plant breeding literature includes numerous simulation studies on how these factors influence prediction accuracy, see, e.g., references in [1]. Generally, from studies of different prediction methods applied to plant and animal data sets, one can conclude that variable selection approaches seem to work most efficiently for oligogenic traits with sparse genetic architectures in the presence of moderate LD among markers. On the other hand, approaches based on mixed models and marker-estimated covariance structures [1, 16] seem to work well for polygenic traits or oligogenic traits in the presence of strong LD among markers (which will make oligogenic data sets to have large numbers of trait-associated markers like polygenic traits). Certain modern classification methods [17] and Bayesian subset selection methods [18, 19] have been applied to predict chronic disease or to find important subsets of SNPs contributing to chronic diseases with small data sets. In general, combining multiple sources of information may lead to more accurate phenotype prediction [12, 20, 21]. This approach has been confirmed with simulations but its benefit with real data is still questionable, especially with complex phenotypes [19]. We consider here information from a carefully selected set of pathway-focused SNPs and applied Bayesian methods in the context of association mapping to model phenotype status. In association studies with SNPs, it is difficult to distinguish the most associated SNPs in regions of high LD. In such situations, it may be informative to inspect for stable association signals after first smoothing the signals in regions of high LD [22]. On this basis, we hypothesize that more stable and accurate phenotypic prediction may be achieved after smoothing pathway-focused SNP effects in regions of high LD. Our goals here are to provide genomic prediction of an individual’s disease status as a sum of SNP effects, which depend on individual’s genotype pattern at SNPs. This is done by means of model-averaged estimation of genetic effects used to weight individual SNPs. For this, we use Bayesian methods in genetic association studies and model-averaged estimate of genetic effects that we call as weighted genetic variation (WGV) [23]. Our Bayesian methods can perform variable selection and parameter estimation simultaneously and can induce the sparseness and smoothness of the estimated SNP effects [19, 23]. Smoothness of the SNP effects is obtained by the explicit modeling of the covariance structure of the SNP effects. We evaluated our Bayesian modeling principle and methods for modeling of chronic fatigue syndrome (CFS), as an example of a complex disease where its diagnosis remains elusive with the need for improved analytical approaches for gene identification and accurate phenotypic prediction [24].

Methods

Subjects and illness classification

This study adhered to human experimental guidelines of US Department of Health and Human Services and the Helsinki Declaration. The Centers for Disease Control Human Subjects committee approved the study protocol, and all subjects gave written informed consent. Subject recruitment, clinical evaluation, laboratory tests, and their classification were described previously [25]. Briefly, 227 subjects were recruited from Wichita, KS, USA, as part of a 2-day in-hospital evaluation of unexplained fatigue. These subjects were identified from a surveillance cohort of 7162 fatigued and non-fatigued subjects who were originally screened from 56,146 adult residents, 18 to 69 years of age. During the 2-day hospital stay, symptoms and exclusionary medical and psychiatric conditions were reevaluated for all 227 subjects. Following the 2-day hospital study, all subjects were classified based on all aspects specified in 1994 CFS case definition [26] and Medical Outcomes Short-Form, Multidimensional Fatigue Inventory and Symptom Inventory cutoff scores to include measures on the functional impairment, fatigue, and accompanying symptom complex that characterize CFS. Following this classification, 124 subjects were excluded because of medical or psychiatric exclusionary conditions or insufficient criteria to classify as CFS. Of the 103 remaining subjects, 101 subjects with genotype data were classified as CFS (43 subjects) and non-fatigued (NF; 58 subjects) healthy controls. The demographic characteristics along with the type of disease onset (gradual vs. sudden) of subjects in this study are given in Table 1.

Table 1 Demographic and other characteristics of the subjects selected for analysis

SNP selection, genotyping, and annotation

Because of the reported associations of CFS with perturbations in hypothalamic-pituitary-adrenal (HPA) axis and immune functions, we selected a total of 39 candidate genes implicated with the central nervous system (CNS) (30 genes) or immune and inflammation functions (nine genes) to determine the accuracy CFS prediction based on combinations of SNPs (Additional file 1: Table S1). There were a total of 167 SNPs in all candidate genes (137 SNPs in genes implicated with CNS and 30 SNPs in genes implicated with immune and inflammation). There were a total of 23 SNPs in X-chromosomes, and these were in two genes of serotonergic neurotransmission (HTR2C and MAOA). SNPs were selected from the SNP database (dbSNP) of National Center for Biotechnology Information database, Applied Biosystem’s SNPBrowser™ or from the literature. All SNP markers had a minor allele frequency ≥10 % with the exception of HTR2A rs6314 (6.2 %), a non-synonymous SNP. DNA extraction and genotyping were done as described earlier [27]. Most SNPs (158 out of 167) were genotyped using validated TaqMan genotyping assay kits (Applied Biosystems, CA, USA) and the 7900 Sequence detection system (Applied Biosystems). Eight SNPs were genotyped by pyrosequencing. Genotyping for one polymorphism in SLC6A4 designated 5-HTTLPR was conducted using gel-based assays [28]. SNP annotation was done using SPOT algorithm as implemented in the web-based tool accessible at https://spot.cgsmd.isi.edu/submit.php [29]. It may be noted that SNP selection was done in the design stage of this study before genotyping while in some other studies, statistical prescreening procedures such as sure independent screening were applied after genotyping to reduce the dimensionality [30, 31].

Model

We applied Bayesian logistic association models [18, 19] using subset of SNPs to predict disease status (y i ) of the individual i. Our hierarchical model structure underlying our predictive model is almost identical to that presented earlier by us [23] for LD mapping.

The logistic association model for SNP data (of individual i) can be written as:

$$ \mathrm{Logit}\left({p}_i\right)=\alpha +{\displaystyle \sum_{l\in {M}_A}\left({\beta}_{l,1}\left(2-{m}_{i,l}\right)+{\beta}_{l,2}{m}_{i,l}\right){I}_l}+{\displaystyle \sum_{l\in {M}_S}\frac{1}{2}\;\left({\beta}_{l,1}\left(2-{m}_{i,l}\right)+{\beta}_{l,2}{m}_{i,l}\right){I}_l}. $$

Here, the logistic link function, \( \mathrm{Logit}\left({p}_i\right)= \ln \left[\frac{p_i}{1-{p}_i}\right]= \ln \left[\frac{P\left({y}_i=1\Big|m\right)}{1-P\left({y}_i=1\Big|m\right)}\right] \), α is an intercept, (β l,1 and β l,2) are genetic effects of SNP l. I l ’s are the indicator variables taking care of variable selection to select subset of SNPs to the model (see below for more details). The SNPs have been reorganized into two groups according to whether they are on sex chromosomes (i.e., lM S ) or autosomal chromosomes (i.e., lM A ). The genotype value m i,l of individual i at SNP l is represented numerically by 0 for homozygote AA, 1 for heterozygote AB, and 2 for homozygote BB. Our model for the genetic effects assumes an additive model, where an effect of the heterozygote (β l,1 + β l,2) is in the middle of the effects of two homozygotes (β l,1 + β l,1 and β l,2 + β l,2). It is over-parameterized to improve the mixing and convergence properties of the Markov chain Monte Carlo (MCMC) sampling algorithm to yield better parameter estimates. The usual practice would be to set the first genetic effect to zero and have one parameter only for the other genetic effect. The absolute difference of the two effects here reflects the effect size of the SNP. Note that the factor of one half is introduced for the SNPs on the sex chromosome due to female X-chromosome mosaicism and the adjustment by half can be used for male only after assuming all SNPs on their X-chromosome are homozygous (cf. [32]).

We assume a priori that there is only a small subset of important SNPs that are useful to predict disease status. In these predictive models, subset selection of important SNP effects to the predictive model is based on the use of indicator variables (I l , l = 1,…, M), all of which either equals one (inclusion) or zero (exclusion) depending on the importance of particular SNP (see, e.g., [9]). Here, M is a number of SNPs. However, in reality for closely situated SNPs, it is difficult to distinguish the effects of individual SNPs due to LD. Thus, LD pattern among SNPs in the linked region is explicitly modeled as dependency prior for variable selection indicators in our model as described below and reported earlier in [23].

Model for missing SNP data

Missing SNP data are handled in a similar manner as any other model parameter. Thus, prior distributions are assigned to all missing SNP data. We assume a priori that the missing SNP genotypes occur at random and independently within and across SNPs (in the sense that the probability that the genotype is missing is not dependent on the true genotype pattern at the locus or at any of its neighboring loci). The prior distribution p(m i,l ) of a missing genotype m i,l under the Hardy-Weinberg equilibrium is a binomial distribution, where both alleles have equal occurrence probabilities within the locus (i.e., p(m i,l ) = 0.25, 0.5, and 0.25 for three genotypes AA, AB, and BB, respectively). Additionally, an independence over loci (linkage equilibrium) is assumed for missing SNP genotypes as \( p(m)={\displaystyle \prod_{i=1}^Np\left({m}_{i,1},\dots, {m}_{i,M}\right)={\displaystyle \prod_{i=1}^N{\displaystyle \prod_{l=1}^Mp\left({m}_{i,l}\right)}}} \). During MCMC sampling, our phenotypic model (likelihood) gives information on which genotype value is most likely in the light of the data. Thus, we let the data to speak in our missing data model. For meta-analysis purposes, where SNP genotypes at some data sets may be missing for all the individuals, one may want to consider to predict the SNP genotypes with reasonable accuracy using genotype imputation methods (utilizing LD correlation structure between neighboring SNPs) based on HapMap and 1000 genomes project reference panels [33, 34].

Priors for α, β, I

Prior specification is intrinsically subjective, and specifying prior that will satisfy everyone and/or every aspect might be unachievable. We adopt the method where priors reflect our intuitive knowledge but are also useful in avoiding some potential pitfalls and help reduce the computational burden. The parameter α relates to the intercept term of the regression which in this case with the Logit function would be close to zero (with 43 CFS cases and 58 NF controls). In the case of regression modeling a quantitative phenotype, the variance of the intercept parameter can be related to the scale of the phenotype. Here, we have binary phenotype; thus, without further information on the variability, we assume a standard normal distribution for the intercept parameter α.

The parameters β and I together determine which SNPs potentially have effect on the phenotype and the extent of this effect.

Following [23], we denote the vector of (genetic or physical) distances between the SNPs with d = (d 2 ,, d M ). We also use a smoothing parameter λ for neighboring SNPs which allows us to model dependence of two adjacent SNPs. Our prior probability for each SNP to be involved in the model is \( \left.P\Big({I}_l=1\right|s\Big)=s=\frac{1}{M} \), which corresponds to assuming only a single marker to be important predictor in the model (see [23] for details). However, SNPs exhibiting strong LD in a single genomic region would change these probabilities, and we model this by a Markov model where the extent of LD is decaying according to the distance information [23]. This is, \( P\left({I}_1,\dots, {I}_N\left|s\right.,\lambda, d\right)=P\left({I}_1\left|s\right.\right){\displaystyle \prod_{l=2}^MP\left({I}_l\right|}{I}_{l-1},s,\lambda, {d}_l\Big) \). Given the state of inclusion indicators at locus (l − 1), the transition matrix for the inclusion indicators for locus l is given by

$$ \left(\begin{array}{cc}\hfill {e}^{-\lambda {d}_l}+\left(1-{e}^{-\lambda {d}_l}\right)\left(1-s\right)\hfill & \hfill \left(1-{e}^{-\lambda {d}_l}\right)s\hfill \\ {}\hfill \left(1-{e}^{-\lambda {d}_l}\right)\left(1-s\right)\hfill & \hfill {e}^{-\lambda {d}_l}+\left(1-{e}^{-\lambda {d}_l}\right)s\hfill \end{array}\right). $$

Genetic effects (β l,1 and β l,2) of each locus are assigned marginally a priori using a scale-mixture representation of Student’s t distribution (e.g., [35]). This means that genetic effects at each locus are first assumed to be normally distributed with common variance σ l 2 which again is assumed to be a priori inverse gamma distributed. In this model, a vector of locus-specific genetic variance components σ 2 = (σ 1 2,…, σ M 2), over the M loci, controls the corresponding genetic effect parameters under a normal model. We further make the following conditional independence assumption that given σ 2 and s, the locus indicators I and genetic effects β are independent. Thus, prior distribution P(β l,k |σ l 2) for genetic effects β l,k (k = 1,2) was assumed to be normal N(0, σ l 2) with locus-specific variance component σ l 2.

Priors for hyper-parameters σl 2, λ

The prior for genetic variance P(σ l 2) at locus l was given an inverse gamma (1, 1), and consequently, P(σ 2) = Π l M = 1 P(σ l 2). The smoothing parameter λ is given a wide prior of gamma (1, 0.01) which has both the mean and standard deviation as 100. This parameter helps to eliminate spurious associations but strengthens the real association signals. However, since we utilize a common smoothing parameter over multiple locations of the genome, it lacks intuitive explanation compared to situation where it is used for a single densely mapped genomic region. There, this parameter can be thought to roughly represent the time since the relevant mutation affecting the phenotype (see [23]).

Complete model

The relevant joint density to derive the posterior of the parameters of interest is then obtained using the following expression which is based on the above and utilizes appropriate conditional independence assumptions:

$$ \begin{array}{l}p\left(y,I,\alpha, \beta, {\sigma}^2,\lambda, m\Big|s,d\right)\kern0.5em =\kern0.5em p\left(y\Big|m,I,\alpha, \beta \right)\hfill \\ {}\kern10.3em \times \kern0.5em p\left(I\Big|\lambda, s,d\right)p\left(\lambda \right)\hfill \\ {}\times \kern0.5em p\left(\alpha \right)p\left(\beta \Big|{\sigma}^2\right)p\left({\sigma}^2\right)\hfill \\ {}\times \kern0.5em p(m)\hfill \end{array} $$

Weighted genetic variation and heuristic model reduction

The weighted genetic variation (WGVl = |β l,1− β l,2|I l ) at locus l, (l = 1,…, M), is computed as a product of absolute difference of the genetic effects |β l,1− β l,2| and the inclusion indicator (I l ).

A heuristic model reduction method was used to speed up the estimation by determining the number of SNP predictors that can be reduced without loss of predictive ability. Using the percentiles of WGV (Additional file 1: Table S2) as cutoff/threshold, a selected set of SNPs was retained in the model with two critical components derived from the overall model as follows. For the indicators, the joint posterior distribution of the indicators from the full model was used where the outcome of the spike-n-slab technique over the MCMC simulation for full model was stored and reused for model reduction purposes. For the remaining SNPs (with WGV lower than the threshold value), the individual-level genotype information were not used, and thus, these SNPs would effectively cease to act as covariates in the model (see Additional file 1 for detailed information on WGV and heuristic model reduction).

K-fold cross-validation

Cross-validation methods [36] give a better assessment of model predictive performance for new data, i.e., phenotypic predictions of individuals whose phenotypes and genotypes have not been involved in the learning sample. The hold-out or split-sample method, in which the data is split into training and testing sets, is the simplest kind of cross-validation. While this method assesses model performance on real prediction situation with new data, it is subjective to the choice of the partition of the data into training and testing sets. K-fold cross-validation is one way to improve over the split-sample method. The data set is divided into K (approximately) equal subsets, and the hold-out method is repeated K times, by which every data point gets to be in a test set exactly once and gets to be in a training set K − 1 times. We have used K = 10 which is also one of the most popular choices of K [37].

One major disadvantage of K-fold cross-validation is that the time taken would also be typically K times that required for estimation based on whole data. Using data on all 167 SNPs, a tenfold cross-validation would require approximately 150 min per MCMC iteration. We retained information from all SNPs since CFS phenotype is complex and there could be loss of predictive ability with reduced number of SNPs. Therefore, instead of reducing the number of SNPs in the model to reduce the time, we made a few modifications to the full model (“Model” section) as noted in the supplementary file.

Model implementation

The models were implemented in WinBUGS software [38], a special software to carry out MCMC simulation from posterior of complex models. The simulations were started with random initial values and were run for several thousands of iterations. Convergence of MCMC chains was monitored by visual inspection of MCMC trace plots with respect to several different model parameters and also by assuring that low values of the MCMC error have been reached for all critical model parameters. Software codes are available in the supplementary file and at http://www.rni.helsinki.fi/~mjs/.

Comparison with competing prediction methods

To compare our method, certain predictions were made for the same data set with generalized linear model versions of LASSO and ridge regression [11, 12] using R-package “penalized” [39]. LASSO model assumes independence among predictors while ridge regression can handle collinearity among them. We also compared our Bayesian model assuming independence between predictors (i.e., prior independence among indicator variables). This can provide information on the importance of the dependence structure to the model. LASSO and ridge regression analyses were performed using the same over-parameterized model as in the Bayesian analysis which had own coefficient for each allele. Before LASSO and ridge regression analyses, we imputed missing data (once) using the Bayesian missing data model. However, this resulted in rank-deficient data matrix in which we eliminated 15 markers to make matrix acceptable for the R-package. To find the best tuning parameter value in LASSO (and ridge regression), we tried 55 (and 94) different values in range [0.1, 1000] (and [0.01, 25000]), respectively. In the Bayesian analysis, the priors and other settings were kept the same as earlier.

Results

CFS in-data prediction using full model

In-data prediction basically uses the training data itself to model the phenotype outcome. The data described earlier in the “Subjects and illness classification” and “SNP selection, genotyping, and annotation” sections and the model developed by us (presented in the “Model” section) were used to evaluate the goodness of fit of the model to predict CFS for 101 subjects (with 43 CFS and 58 NF individuals) using all the SNP data. This in-data prediction model, among other things, provided estimates of WGV of the SNPs, which are combined measures of their selection probability in the model and degree of effect on the phenotype (Additional file 1: Table S2). This is computed as a random variable based on the product of the indicator and absolute difference of the genetic effects for a locus, i.e.

WGVl= |β l,1− β l,2 |I l for the l-th locus, l = 1,…, M. The MCMC sampling from the posterior also enables us to estimate the WGVs. SNPs rs2288831 (IL12B), rs2071376 (IL1A), rs2069718 (IFNG), rs846906 (HSD11B1), rs1923884 (HTR2A), rs1799836 (MAOA), and rs1396862 (CRHR1) were among the top 10 SNPs with the highest genetic effect on the phenotype as measured by WGV (Table 2). For this particular CFS data using all 167 SNPs, this full model showed 100 % goodness of fit, probably due to over-fitting in a small data set. This necessitates extensive cross-validation of the model with unseen test data.

Table 2 Top 10 genetic markers associated with CFS based on weighted genetic variation (WGV) estimated by the Bayesian model

Impact of CFS prediction using variable number of SNPs

We examined if the number of SNP predictors can be reduced without loss of predictive ability by implementing a heuristic model reduction. Using percentiles from the estimated WGV, an increasing number of SNPs were included in the model with appropriate parameters, while the effect of the rest was adjusted as described above. As expected, with the increased number of SNPs, there was a corresponding increase in the model prediction accuracy (see Fig. 1 and Table 3). Although prediction accuracy reached near perfection (accuracy 97 %) with nearly all SNPs (159 out of 167), accuracy remained still close to perfection (accuracy 95 %) using a combination of close to 100 SNPs (35th percentile of WGV). Accuracy remained high (90 %) even with the top 70 SNPs in WGV. Accuracy decreased with fewer SNPs in the model, although we also obtained 79 % accuracy with combinations of 26 SNPs ranked top in WGV.

Fig. 1
figure 1

Impact of varying the number of SNPs on prediction performance, as measured by sensitivity, specificity, and accuracy

Table 3 Increase in accuracy with increasing number of SNPs in the predictive model with individual-level allelic information

The above utilized the cutoff probability of 0.5 to declare a case as CFS or otherwise. However, this threshold could also be varied. Figure 2 presents the predictive performance as measured by accuracy of prediction while varying the threshold probability and number of (unadjusted) SNPs in the model. As stated earlier, SNPs were considered in the increasing order of their WGV for inclusion/exclusion in the model. Thus, least effective SNPs are removed from the model at first. This illustrates that model accuracy is not highly susceptible to the choice of cutoff and is capable of staying at high level in an interval around the chosen 0.5 used for this analysis.

Fig. 2
figure 2

Impact of varying the threshold and the number of SNPs on prediction performance as assessed by the in-data prediction accuracy

Tenfold cross-validation of CFS prediction

The cross-validation results on prediction are summarized using the standard sensitivity, specificity, false detection rate (FDR), and overall accuracy (Table 4). Computation time per iteration is approximately 50 min. As can be seen, the overall accuracy is still high (79 %) for the tenfold cross-validation. The overall sensitivity and specificity of the tenfold cross-validation were 74.4 and 82.8 %, respectively. Accuracy for individual sets varied between 70 and 100 %. At an individual level, the prediction probabilities were compared for the K-fold model and in-data model, and the results are presented as Fig. 3. We observe that those predicted with high probability/accuracy when the data is seen can still be recovered when the respective data is withheld. However, as this accuracy comes down, probability of correct prediction also comes down when the corresponding data is unseen. It should also be noted that the grouping of subjects based on overall cross-validation probability (>0.5) still predicts 14-fold higher risk for a subject to be classified as CFS (OR = 13.963; 95 % confidence interval 5.313–36.69; p value = 6.64 × 10−9).

Table 4 CFS prediction from in-data and tenfold cross-validations
Fig. 3
figure 3

Probability of CFS at individual level as estimated by the K-fold cross-validation model and in-data prediction model

For comparison, the same cross-validation analysis was done for LASSO, ridge regression, and Bayesian model without dependence (Table 5). The Bayesian approach (with and without dependence) clearly outperformed LASSO and ridge regression in accuracy and in sensitivity, but specificity was slightly better in penalized regression approaches. Including between SNP dependence to the Bayesian model improved sensitivity from 65.1 to 74.4. Otherwise, accuracy and specificity remained almost at the same levels for the two Bayes models.

Table 5 CFS prediction from tenfold cross-validation for competing methods

There were a total of 21 false predictions out of 101 predictions in this study. We investigated the source of these false predictions in terms of the sudden vs. gradual onset of chronic fatigue as reported by the subjects in the study. It appears that tenfold cross-validation is accurate for all individuals (six out of six) with sudden onset of CFS and one CFS subject with no onset information. All the erroneous predictions for CFS cases occurred only for those CFS cases with gradual onset (11 out of 36 CFS subjects with gradual onset). Among NF controls, 15 subjects reported previously being fatigued (14 gradual and 1 sudden onset) but did not meet the criteria for CFS, and three of these NF controls were predicted incorrectly. The remaining seven false predictions belonged to the 43 NF subjects who reported that they were never chronically fatigued before. These erroneous predictions potentially reflect heterogeneity in both cases and control subjects in the study.

Discussion

This is a proof-of-principle study that presents a powerful genetic approach that can simultaneously rank SNPs based on their genetic effect and for prediction of complex phenotype based on a Bayesian logistic mixture modeling principle combined with biologically meaningful pathway-focused genetic markers and rigorous case ascertainment. We used CFS SNP data as an example to apply our analytical approach and to compare our model performance with LASSO and ridge regression as well as with results in the literature using other approaches. Our Bayesian prediction model is suitable for any data set that has genotype calls and subjects ascertained into binary phenotype, regardless of whether the illness is specialized or not.

Our modeling approach yielded 80 % accuracy after tenfold cross-validation with CFS data. This prediction accuracy is one of the best so far for this complex disease compared to other prediction models using the same data set. In comparison to other analytical methods, several modeling principles we used may have contributed to this high prediction accuracy. For example, we incorporated covariance due to LD between SNPs so that SNP effects of the predictive model depended on each other and dependence vanishes according to exponential decay with the genetic or physical distance [40]. Originally, closely after the single locus model of Conti and Witte [40], we presented our multi-locus model for genetic association mapping in small chromosomal segments [23]. After that, Malo et al. [41] and Tsai et al. [42] have expressed their own approaches to model covariance between SNP effects in genetic association analyses. Very recently, Fridley and Jenkins [43] also introduced an approach, which is closely related to our earlier model [23] for genetic association studies, see also [35, 44, 45] for the other alternatives.

A second methodological enhancement in our model is that it incorporates both sparse selection of trait-associated SNPs and smoothing of estimated SNP effects, and that this enhancement potentially provided competitive and accurate phenotype predictions in comparison to previous methods for CFS prediction. There are two previous reports [17, 46] on testing combinations of SNPs for prediction of CFS using the same subjects but a smaller set of (42 SNPs out of 167) SNPs used in this study. Because of the smaller set of SNPs, Goertzel et al. [46] used an enumerative approach with cross-validation by permutation and reported 76.3 % accuracy in predicting CFS. Huang et al. [17] compared three sets of classification methods (naïve Bayes model, support vector machine, and C4.5 decision tree algorithm) with 42 SNPs and found the naïve Bayes model with the wrapper-based feature selection to give the best overall sensitivity (64 %) and specificity (52 %). This comparison of our model having dependence structure with Huang et al. that does not have dependence structure [17] shows that dependence model like ours provides clear predictive advantage. Our study implies that models with dependence structure should be utilized more in methods addressing the prediction problem. While there are methods that take into account of dependence between SNPs for association mapping [46, 47], to the best of our knowledge, our Bayesian model is the only one that probabilistically models dependence between SNPs for prediction of binary phenotype. Moreover, since there are explicit differences in the models for association mapping and phenotypic prediction, methods developed for association mapping [46, 47] are not necessarily suitable or preferable for phenotypic prediction and vice versa.

Our heuristic model reduction method provided insight into the strength of Bayesian multi-locus association involving 100’s of SNPs in comparison to current regression models using a few genetic variants. In this analysis, prediction accuracy reached near perfection (accuracy 97 %) with nearly all SNPs (159 out of 167) and accuracy remained still close to perfection (accuracy 95 %) using a combination of close to 100 SNPs (35th percentile of WGV). Accuracy remained high (90 %) even with the top 70 SNPs in WGV but it decreased with fewer SNPs in the model, although we also obtained 79 % accuracy with 27 SNPs. These results clearly show that the higher the genetic information in the SNP profile, the greater will be the accuracy in prediction of complex diseases. Since the current computational approaches limit multi-locus analysis to only a few genetic variants, each with weak associations, current attempts to predict the individual genetic risk of complex disease traits have not been highly accurate, and their value/relevance was also questioned [48]. However, our results suggest that one can actually achieve reasonable predictive accuracy by using Bayesian predictions based on multi-locus association models involving 100’s of SNPs in a small set of genes. Prediction can be further improved with greater accuracy, sensitivity, and specificity, if the model includes genetic variation in hundreds of genes representing expanded set of multiple pathways implicated in immune, inflammation [49], and CNS functions [50]. With additional improvements in genetic variation data, modeling, and sample size, one can also conceive the potential of our approach to generate genetic signatures to delineate heterogeneity/subtypes in complex disease.

Other statistical considerations that favored high accuracy include model selection and estimation (including phenotype predictions) being done simultaneously. By this, we obtained estimates for posterior probability (uncertainty) of disease status and the degree of belief estimate for predicted disease phenotypes. These predictions, which are based on the Bayesian model averaging approach providing robust predictions [51], are however not based on a single best model but rather several different models where prediction of each model is weighted according to the corresponding posterior model probability. Model reduction aspects were investigated in a computationally feasible manner for an otherwise analytically intractable problem. We also showed accurate predictions from partial genotype data. That is, while the learning set contains information on complete set of SNPs, accuracy does not immediately drop if phenotype predictions are made for individuals which have genotypes measured from only a partial set of SNPs. We used indicator variables in the predictive models because they will make it possible to shrink coefficients at some unimportant positions exactly to zero. The same numerical property is difficult to obtain by tuning the value of the shrinkage parameter (which controls amount of shrinkage) in regularization models like the Bayesian LASSO [13], which do not contain indicators in the model. In those models, which have a single shrinkage parameter only, there is always trade-off between obtaining enough shrinkage for unimportant contributors and still maintaining good estimation ability for important positions (without having too much shrinkage).

It was reported recently that in order to achieve accurate prediction of human disease for unrelated individuals, it would take approximately a sample size of 350,000 [52]. This is because accuracy is dependent on the average relatedness between individuals in the population (i.e., effective population size structuring the LD) and genetic architecture of the trait. It was further noted that it can be accurate on small samples only for diseases that are determined by a limited number of genes which is unlikely to be true for a complex disease. It is thus possible that population ascertainment process that involved rigorous clinical evaluation of both cases and control subjects in this study might be an important factor, along with the strength of modeling and SNP selection, for prediction accuracy in this particular case. We are also aware of the rule of thumb that the number of independent predictors should not be greater than the one tenth of the number of samples in the smaller of the two outcome categories in logistic regression modeling. This limitation may be overcome by using the conventional two-stage procedure wherein significant markers are first evaluated from independent analysis, and then, the short-listed markers tested for prediction accuracy. Our analytical approach is to move away from this rule of thumb. Indeed, one of the main focuses of our study is that our Bayesian model estimates good prediction accuracy based on multi-locus/simultaneous association of all SNPs using small data set. This approach in predicting human disease phenotype using the Bayesian multi-locus association model is supported by closely related prediction models in plant breeding research. For example, using 80 markers and 126 soybean lines, Hu et al. demonstrated an increase from 33 to 78 % in explaining the phenotypic variance when markers were collectively accounted for epistatic effects [53]. The findings of Heffner et al. based on a wheat breeding program using a small set of samples [54] and markers also agree with our findings from human studies that jointly estimating all marker effects is able to capture more of the genetic variance than the two-stage conventional approach of first selecting significant markers from independent analysis and then estimating their effects.

Besides, phenotypic prediction using 100’s of SNPs, our Bayesian model provided the genetic effect estimated as WGV for each of the SNPs incorporated in the model and thus allowing to rank individual SNPs for exploring their functional role in the pathophysiology of the disease. For example, in our in-data prediction analysis, the top 110 SNPs with over 90 % accuracy were spread on 32 of 39 genes (25 CNS and 7 immune function candidate genes). It is interesting to note that SNPs from 7 out of 9 immune function genes were represented in this group that showed greater prediction, and SNPs in three of them (IL12B rs2288831, IL1A rs2071376, and IFNG rs2069718) showed the highest association with CFS in terms of WGV values. All three SNPs showed replicated or moderate association with other diseases as well. SNP rs2288831 is in complete LD with rs3212227 located in the 3′-untranslated region (3′UTR) of IL12B (Table 2), and this proxy SNP was reported to be associated with psoriasis in a large-scale association study, confirming the results of a previous study [55]. rs2071376 located in intron 6 of IL1A showed significant association in patients with keratoconus [56]. SNP rs2069718, located in the intron of IFNG, was associated with susceptibility to systemic lupus erythematosus in a recessive genetic model [57]. Among the CNS-related genes, the highest measures of genetic effects on CFS prediction were provided by SNPs in HSD11B1 (rs846906), HTR2A (rs1923884), MAOB (rs1799836), CRHR1 (rs1396862), SLC18A2 (rs363236), NOS3 (rs891512), and DRD2 (rs1124492). Except rs1396862 and rs363236, these SNPs in CNS-related genes were located in introns with no potential function through transcription factor binding or splicing regulation or no proxy SNPs with functional significance. SNP rs891512, although located in intron of NOS3, showed protective effect against suicidal behavior [58]. While rs363236 is located in the 3′UTR of SLC18A2 and is in complete LD with a marker (rs3814230) resulting in synonymous codon change in PDZD8, no functional role or association with disease could be identified with this marker. SNP rs1396862, on the other hand, is in complete LD with rs12185233 that results in a missense codon (R460P) change in intramembrane protease 5 (IMP5), a gene associated with Parkinson’s disease [59]. Another SNP in CRHR1 (rs173365) with WGV of 0.58 also showed high LD with another missense codon change (rs242944, H302R) in IMP5, suggesting further support of association of this gene region with CFS. Genes with >5 SNPs that contributed to >90 % accuracy included ACE, DRD2, HTR2A, HTR2C, HTR4, IL1A, MAOA, NR3C1, and TPH2. Thus, at least 16 unique genes (ACE, CRHR1, DRD2, HSD11B1, HTR2A, HTR2C, HTR4, IL12B, IL1A, IFNG, MAOA, MAOB, NOS3, NR3C1, SLC18A2, and TPH2) appear to be major contributors to CFS prediction by greater genetic effects either through individual or multiple (>5) SNPs. Among these 16 genes, only two (TPH2 and NR3C1) were common with the smaller set of 10 genes in the previous reports [17, 46] illustrating that higher predictive accuracy by multi-locus association is determined by variants with greater genetic effects. Sequence variations in some of these genes were reported to be associated with CFS (NR3C1: Rajeevan et al. [60] and HTR2A: Smith et al. [27]) or associated with some of the CFS subtypes identified by latent class analyses (MAOA, MAOB, TPH2, and NR3C1: Smith et al. [61]) or association with allostatic load, a construct that describes cumulative physiological effects of adaptation in response to stress (ACE: Smith et al. [62]). Besides estimating prediction accuracy of complex phenotype using multi-locus association, these findings on the association of individual markers/genes with CFS support the hypothesized general applicability of Bayesian model-based WGV estimates to identify specific genetic variations that may play biological roles in the pathophysiology of various diseases.

We concentrated more on the modeling principle rather than implementing the tool for routine phenotype prediction in this study. Thus, we used a general purpose software tool, WinBUGS, for MCMC estimation, which is often slower than tailor-made programs. Moreover, for routine phenotype prediction, maximum a posteriori estimation may be more practical than MCMC estimation (cf. [35, 63]). As is well known, population structure and cryptic relatedness are confounding factors in genetic association studies [64]. We have not corrected for these factors in our models since subjects were collected from a homogeneous population (Caucasians >93 % in both CFS and NF subjects) with no close relatives or complex links between individuals. However, even though such confounding factors may exist in the CFS data, the multi-locus association models have been found to be surprisingly robust to these confounders [31, 65]. Non-genetic risk factors can be included into the model as environmental covariates even if we have not done so here, and their inclusion may improve the accuracy even further. Presently, there is limited clinical/diagnostic utility for this genetic prediction model, since its reproducible performance remains to be evaluated in multiple populations. Further, in order to be clinically useful, it may be advantageous to evaluate prediction models by decision analytic techniques to determine whether models would change medical decisions. This Bayesian model provides an approach to improve phenotypic prediction by exploiting all available genetic information in 100’s of loci jointly, an approach that can be extended to the emerging computational field of whole-genome markers-enabled prediction of genetic predisposition in humans.

In conclusion, our results demonstrate, as a proof of principle, the power of using a combination of a Bayesian logistic mixture modeling principle, pathway-focused SNPs, and rigorous subject ascertainment for highly accurate prediction of complex phenotypes. WGV estimates provided by the model can also be useful to identify individual genetic markers/genes with potential biological functions. Future studies are warranted to expand this approach using multiple biological pathways and multiple populations [66] to derive a reproducible genetic profile with greater predictive power than non-genetic risk factors to identify chronic diseases like CFS and its subtypes that have no laboratory-based diagnosis or intermediate markers.