Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases

Zhang, Min; Zhang, Dabao; Wells, Martin T

doi:10.1186/1471-2105-9-251

Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases

Methodology article
Open access
Published: 29 May 2008

Volume 9, article number 251, (2008)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases

Download PDF

Min Zhang¹,
Dabao Zhang¹ &
Martin T Wells²

9393 Accesses
26 Citations
Explore all metrics

Abstract

Background

Identifying quantitative trait loci (QTL) for both additive and epistatic effects raises the statistical issue of selecting variables from a large number of candidates using a small number of observations. Missing trait and/or marker values prevent one from directly applying the classical model selection criteria such as Akaike's information criterion (AIC) and Bayesian information criterion (BIC).

Results

We propose a two-step Bayesian variable selection method which deals with the sparse parameter space and the small sample size issues. The regression coefficient priors are flexible enough to incorporate the characteristic of "large p small n" data. Specifically, sparseness and possible asymmetry of the significant coefficients are dealt with by developing a Gibbs sampling algorithm to stochastically search through low-dimensional subspaces for significant variables. The superior performance of the approach is demonstrated via simulation study. We also applied it to real QTL mapping datasets.

Conclusion

The two-step procedure coupled with Bayesian classification offers flexibility in modeling "large p small n" data, especially for the sparse and asymmetric parameter space. This approach can be extended to other settings characterized by high dimension and low sample size.

Novel Algorithm for Multiple Quantitative Trait Loci Mapping by Using Bayesian Variable Selection Regression

Bayesian high-dimensional covariate selection in non-linear mixed-effects models using the SAEM algorithm

Article 18 December 2023

Assessing a Spatial Boost Model for Quantitative Trait GWAS

Background

With the advent of high-throughput biotechnologies to genotype dense molecular markers throughout the genome, statistical methodologies are crucial in understanding the genetic architecture of complex traits, and in locating genes underlying important traits. Since the pioneering statistical work by Lander and Botstein [1], much effort has been devoted to improving the efficiency and accuracy of QTL mapping. Traditional approaches to QTL mapping test each of dense grid loci on chromosomes via the likelihood ratios of linear regression models (see the reviews by Doerge et al. [2] and Broman and Speed [3]), and Wang et al. [4] also proposed a Bayesian shrinkage estimation of QTL parameters allowing varying shrinkage factors across different effects.

Epistases (that is, interactions between genes) are ubiquitous in biological systems [5] and may even play a more important role than additive effects, as have been shown in human population [6, 7] and other organisms [8–12]. However, even a moderate number of markers implies a large number of pairwise combinations, thus creating statistical issues in QTL mapping. Due to the small sample sizes and the lack of efficient statistical tools, the number of identified genes is limited although the existence of epistasis has been recognized for nearly a hundred years [13]. To detect epistatic effects, Kao and Zeng [14] proposed modeling epistasis via orthogonal contrast scales using Cockerham's model; Yi and Xu [15] developed a Bayesian method to detect epistasis using reversible jump Markov chain Monte Carlo (MCMC) algorithm; Yi et al. [16–18] then proposed a Bayesian model selection approach to detect genome-wide epistasis (with the software described in [19]); Bogdan et al. [20] modified Bayesian information criterion (mBIC) to permit the identification of additive effects as well as pairwise interactions; and Cui and Wu [21] also proposed a statistical framework to detect genetic interactions derived from different genomes in self-pollinated plants. Recently, Żak et al. [22] developed a rank-based model selection and Shi et al. [23] developed a LASSO-type penalized likelihood method to locate interacting QTL while Bogdan et. al [24] extended mBIC for strongly correlated markers and multiple interval mapping.

Consider Y_ias the trait value of strain i = 1, ⋯, n, and let X_ijbe the genotypic value of marker j = 1, ⋯, p_βwithin the i-th strain. Here we focus on the populations with binary markers X_ij(coded as -0.5 and 0.5), such as doubled-haploid, backcross or recombinant inbred lines. With available markers (either observed or imputed) densely located on chromosomes, we assume the putative QTL co-transmit with some of the markers. Let $I$ {X} denote the set including all pairwise epistases of interest, and define Z_ij= X_ikX_ilfor the j-th candidate epistasis (k, l) ∈ $I$ {X}, j = 1, ⋯, p_γ. We investigate the additive effects of putative QTL and the epistatic interactions between them through the following multiple regression model,

\begin{matrix} Y_{i} = μ + \sum_{j = 1}^{p_{β}} β_{j} X_{i j} + \sum_{j = 1}^{p_{γ}} γ_{j} Z_{i j} + ε_{i}, & ε_{i} ~_{i i d}^{} N (0, σ_{ε}^{2}), \end{matrix}

(1)

where μ is the overall mean, β_jis the additive effect of marker j, γ_jrepresents the j-th epistatic effect, and ε_iis the random error.

QTL mapping with this multiple regression model can be viewed as a model selection procedure [3, 25–27]. However, several characteristics of the data complicate the application of classical statistical methodologies. First, a large amount of missing molecular markers, due to failure in genotyping or selective genotyping, is common in practice. When markers are sparse, the missing genotype information between markers must be inferred. Second, the molecular markers in the same linkage group may be highly correlated. Third, the total number of molecular markers and putative epistases, i.e., p = p_β+ p_γ, is usually much larger than the sample size n. Because of these issues, the efficiency and accuracy are usually compromised for easy development of statistical approaches. Characteristics of the "large p small n" data with missing values require further attention via extensions of traditional model selection approaches. We extend the Bayesian classification approach in Zhang et al. [28] to map QTL with epistases. Spike and slab priors have been used by, for example, Mitchell and Beauchamp [29], George and McCulloch [30], and Ishwaran and Rao [31] to develop Bayesian variable selection approaches. The spike and slab priors consist of two components, with one modeling zero coefficients and the other modeling non-zero ones.

Furthermore, the mixing weight plays a crucial role in condensing the searchable parameter space and enforcing a stochastic search within low-dimensional spaces. When only a limited number of covariates are being investigated, a uniform distribution on [0, 1] or even a fixed value (e.g., 0.5) is usually chosen for the mixing weight. However, when n ≪ p, it is unrealistic to expect half of the variables to be selected because the final model may still be unidentifiable. Instead, we expect that, for a successful variable selection, the prior distributions of the mixing weights depend on both n and p.

We investigate the predictability of a model developed for a dataset of sample size n, and tackle the aforementioned issues. We then construct a two-step Bayesian variable selection approach for model (1) in the case that n ≪ (p_β+ p_γ). In the first step, we employ a restrictive prior for each of the coefficients in model (1) in order to enforce stochastic filtering of the large number of candidate variables. This prior also allows flexibility for the possible different numbers and/or scales of positive and negative coefficients (see [32] for more details on its advantage over symmetric priors). A Gibbs sampling algorithm is developed to compute the posterior distributions and to implement the stochastic search. Only a limited number of variables are filtered to go through the second step, which repeats the first step but with much fewer candidate variables. The second step is necessary to model (1) when n ≪ (p_β+ p_γ), as the priors in the first step could potentially be too restrictive. The performance of our approach is evaluated via a simulation study and application to real datasets.

Results and Discussion

Simulation

Simulation studies were performed to evaluate the performance of our method in the case of p ≫ n. We simulated 56 markers across 3 chromosomes, with each having 10, 20, and 26 markers, and being 56.7 cM, 133.5 cM and 171.6 cM long respectively. We specify $σ_{ε}^{2}$ = 0.5415, and the locations of 28 markers are chosen based on the Drosophila data [28], which include 221 inbred introgression lines between two closely related species. The other 28 markers are chosen such that the neighboring markers are at least 5 cM away. Table 1 shows the detailed information of the non-zero effects specified in the simulation, including two additive effects and three epistatic effects. To assess whether our method is able to identify different types of epistatic effects, we include all three possible interactions in the simulation: (1) neither of the two markers has additive effects (that is, 2–133.8 and 3–56.7); (2) one of them has additive effects (that is, 1–24.7 and 2–47.8); (3) both have additive effects (that is, 2–47.8 and 3–141.5). All epistatic effects were set at the same size to avoid its effects on detectability. Due to the intensive computation involved in Gibbs sampling, a total of 100 complete data sets were simulated. Each of the 100 data sets was analyzed using two models, one model with both additive and epistatic effects while the other with additive effects only. When mapping QTL with epistases, we have a total number of 1596 variables (56 additive-effect loci and 1540 epistases) versus 221 observations in the model.

Table 1 Design of the simulation studies.

Full size table

For the model without epistases, both markers can be detected in most of the 100 simulated datasets even when the false discovery rate (FDR) is controlled as low as 0 (via setting the Bayes factor higher than 3.2), see Table 2. When modeling the epistases, all (additive and interaction) effects are still detected in more than 90% of the data sets for all levels of Bayes factor (BF) though the FDRs are higher. For those data sets with any effect not identified, the immediate neighbors of the corresponding marker locus are mostly detected instead. As expected, it is more difficult to detect epistases than to detect additive effects. The epistasis of markers both having additive effects is the easiest to be detected among all epistases. The true parameter values are included in their 95% credible intervals with the associated posterior probabilities being very close to one (results not shown).

Table 2 Simulation results on the basis of model (1).

Full size table

Application

We apply the developed method to the simulans backcross II (BS2) data and the mauritiana backcross II (BM2) data [33, 34]. An F₁ population was first produced by females from an inbred line of D. simulans and males from an inbred line of D. mauritiana. Then the F₁ females were backcrossed to the parental line of D. simulans, which was fixed for different alleles at 45 marker loci, to produce a simulans backcross (BS) population. A mauritiana backcross (BM) population was also produced by backcrossing the F₁ females to the other parental line. Based on the two different times of crossing, a total of four data sets were obtained, namely, BS1 (n = 186), BS2 (n = 288), BM1 (n = 192), and BM2 (n = 299). The phenotypic value of an individual is a morphometric descriptor of the posterior lobe, obtained by averaging both sides of the first principal component (PC1) of the Fourier coefficients of the posterior lobe. The genotypes of males were determined at each marker locus, and genetic map positions were estimated from gametes produced by the F₁ females in this study. Further information about the data is referred to Liu et al. [33] and Zeng et al. [34].

Employing multiple interval mapping (MIM) [25, 35] to the BS2 data, Zeng et al. [34] detected a total of 16 additive effects and no epistatic effect. Pooling all four data sets, Zeng et al. [34] detected three extra additive effects and six epistatic effects. These epistatic effects appeared to be relatively unimportant for PC1 in the interspecific backcross populations, which carried an observation difficult to interpret biologically. Of the 19 additive effects, 18 additive effect estimates have the same sign [34]. Zeng et al. [34] explained this interesting phenomena as an unusually strong directional selection, although Tanksley [36] suggested that transgressive segregation usually followed a mixture of plus and minus alleles in each species as demonstrated by most previous analyses of quantitative traits.

We focused our analysis on the BS2 and BM2 data with the standardized phenotypic values. Of the 19 putative QTL reported by Zeng et al. [34], only nine are at least 1 cM away from the 45 marker loci. Therefore, we analyzed both datasets with these 54 additive effects (nine putative QTL and 45 markers) and all possible pairwise interactions (that is, 1431 putative epistases). When controlling BF ≥ 1, the analysis of the BS2 data reported a total of 25 additive effects (see Table 3), including all nine putative QTL, but no epistatic effect. The analysis of the BM2 data instead reported a total of 20 additive effects (see Table 4), including three of the nine putative QTL, and 18 epistatic effects (see Table 5). On the basis of the simulation study, we may expect less than 0.67% FDR for those 17 and 16 additive effects reported with BF ≥ 100 in analyzing the BS2 and BM2 data respectively. Similarly, three epistatic effects reported in analyzing the BM2 data have BF ≥ 100, less than 12% of which may be false discoveries.

Table 3 Additive effects with BF ≥ 1 in analyzing the BS2 data.

Full size table

Table 4 Additive effects with BF ≥ 1 in analyzing the BM2 data.

Full size table

Table 5 Epistatic effects with BF ≥ 1 in analyzing the BM2 data.

Full size table

Interestingly, the 25 additive effects detected from the BS2 data include all those detected by Zeng et al. [34] except the 2–135, 3–5 and 3–83 (we consider the markers within 1 cM to be same), but the 20 additive effects detected from the BM2 data only include nine of those detected by Zeng et al. [34]. On the other hand, nine additive effects (i.e., 2–28.53, 2–145.85, 3-0, 3–43.2, 3–49.99, 3–101.29, 3–126.62, 3–134.6, 3–147.69) from the BS2 data are not reported by Zeng et al. [34], and eleven additive effects from the BM2 data (i.e., 1-0, 2–6.98, 2–67.96, 2–145.85, 3–14.33, 3–28.74, 3–43.2, 3–49.99, 3–126.62, 3–147.69, 3–161.43) are not reported by Zeng et al. [34]. Note that almost each additive effect uniquely detected by Zeng et al. [34] has a neighboring one (within 10 cM) in our lists except 2–135 and 3–94 for the BM2 dataset, and almost each additive effect unique in our lists has a neighboring one (within 10 cM) detected by Zeng et al. [34]. Per the discussion on the precision of QTL location by Bogdan and Doerge [37] and Bogdan et al. [24], these effects of close neighbors may be due to identical QTL. Our analysis reported R² = 0.934 and R² = 0.902 for the BS2 and BM2 data respectively.

Conclusion

This article extends the Bayesian framework in Zhang et al. [28] to identify both additive and epistatic effects of QTL based on model (1). The advantage of this approach mainly lies in the flexible priors for the regression coefficients by accounting for some characteristics of "large p small n" data, the predictability of a model constructed with size n data, and the two step strategy for dimension reduction. A Gibbs sampler is developed to draw Markov chain samples from the posterior distributions, which can be considered as a stochastic search for an optimal model. Unlike information criteria based model selections which require calculation of the effective sample size for incomplete data, missing values can be naturally imputed within the Gibbs sampling scheme. The corresponding algorithm has been implemented in Matlab and is available as QTLBayes http://www.stat.purdue.edu/~zhangdb/QTLBayes/.

Bayesian variable selections can be viewed as penalized likelihood approaches, which have been studied recently [38, 39]. With "large p small n" data, it is not clear how to set up the penalty properly such that it will neither overpenalize nor underpenalize the likelihood. An overpenalized likelihood will lose some significant variables of particular interest, while an underpenalized likelihood will introduce false positives. The predictability of size n data sheds light on the choice of this penalty. Since a size n data set will allow us to understand the variation of the trait explained by only p_n= O( $\sqrt{n}$ ) QTL with accuracy O(n^-1/2), selecting too many variables into the model will ruin this practice of QTL mapping. As shown by Bogdan and Doerge [37], severely biased estimates can be resulted from large genome and/or marker number in QTL mapping. We propose a Bayesian framework to resolve the bias problem. We have illustrated our approach by application to the BS2 and BM2 data [33, 34], both of which have 45 markers observed across three chromosomes. The disadvantage of this approach is the heavy computation involved as the computation-intensive Markov chain Monte Carlo algorithm is utilized. For example, the analysis of a dataset with more than 200 markers from 1,000 subjects take almost 24 hours using one Intel^® Xeon™ CPU at 2.80 GHz.

Coding binary markers with -0.5 and 0.5 has been commonly utilized in QTL mapping as it does not introduce correlation between additive effects and interactive effects, and such uncorrelation benefits the identification of additive effects. On the other hand, coding binary markers with 0 and 1 introduces correlation and thus is not preferred for QTL mapping with epistases [40, 41]. Although developed for QTL mapping, this approach is completely general and can be applied to other settings with "large p small n" data, such as associating genomic features to clinical outcomes or phenotypes of biological interest. Unlike QTL mapping data with known missing structure from the linkage information, genomic data with imaging and microarray may require more information to impute missing values because of the unknown missing mechanism. Even though the missing values are usually imputed with a nearest-neighbor approach [42], Gibbs samplers allow natural multiple imputation under the assumption of missing at random (MAR, see Little and Rubin, [43]).

Methods

Predictability and Sample Size

Suppose, for a sample of size n, we select up to p_n(assuming p_n<n) significant variables into the following regression model,

\begin{matrix} Y_{n} = X_{n} β + ε_{n}, & ε_{n} ~ N (0, σ_{ε}^{2} I_{n}), \end{matrix}

where Y_nis an n-dimensional column vector; X_nis an n × p_ndesign matrix such that $X_{n}^{T} X_{n} = n \times I_{p_{n}}$ . The best linear unbiased estimator (BLUE) of β is

{\hat{β}}_{n} = β + \frac{1}{n} X_{n}^{T} ε_{n} .

Let $\tilde{x} = ({\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{p_{n}})$ include p_npredictors for $\tilde{y}$ such that $\max_{1 \leq j \leq p_{n}} | {\tilde{x}}_{j} | = O (1)$ . Since trace ${V a r ({\hat{β}}_{n})} = \frac{p_{n}}{n} σ_{ε}^{2}, \tilde{x} β$ can be consistently estimated by $\tilde{x} {\hat{β}}_{n}$ . When using $\tilde{x} {\hat{β}}_{n}$ to predict $\tilde{y}$ , the mean squared prediction error is

E [{(\tilde{y} - \tilde{x} {\hat{β}}_{n})}^{2}] = σ_{ε}^{2} + \frac{p_{n}}{n} \frac{\tilde{x} {\tilde{x}}^{T}}{p_{n}} σ_{ε}^{2} .

If p_n= o(n), the mean squared prediction error asymptotically achieves the minimum variance, and thus the prediction is asymptotically efficient.

This illustration implies that, with a sample of size n and p_n= O( $\sqrt{n}$ ) predictors, the mean squared prediction error can reach the minimum prediction error at rate O(n^-1/2). Suppose that all p_nsignificant variables could be perfectly selected out of p candidates, we still need p_n= o(n) in order to have a chance to correctly understand the variation of the dependent variable explained by the selected predictors. Therefore, we always assume that there are at most p_n= O( $\sqrt{n}$ ) significant variables among a total of p candidates in the case of p ≫ n. Indeed, the study of consistency in a triangular array setting for regression problems was conducted by Huber [44–46]. In examining the underlying theory of 'model-selection' and 'variable-selection' procedures that choose p_nexplanatory variables from an initial set of variables, Greenshtein and Ritov [46] proved that one may expect consistency for the choice of p_nwith an order between o( $\sqrt{n / \log (n)}$ ) and o(n/ log(n)). Our choice of p_n= O( $\sqrt{n}$ ) satisfies the Greenshtein and Ritov [46] conditions for consistency.

Bayesian Variable Selection

Here we propose a two-step Bayesian variable selection approach to map QTL with epistases through model (1). With the following Bayesian framework, we first select c $\sqrt{n}$ out of p_βadditive effects and c $\sqrt{n}$ out of p_γepistatic effects (e.g., we use c = 2), respectively, using a restrictive prior for each coefficient. We then apply the same Bayesian framework to stochastically select the filtered variables, using a non-restrictive prior for each coefficient. Gibbs sampling algorithms are developed to stochastically search low-dimensional subspaces, as implied by the predictability of a size n data set.

Prior Specification

For a two-state marker system, both additive effects β_j, j = 1, ⋯, p_β, and epistatic effects γ_j, j = 1, ⋯, p_γ, are the primary focus of QTL mapping. As is often the case p = (p_β+p_γ) ≫ n, many of these coefficients are zero, either because the variation of the trait can be explained by only a few QTL or because the limited sample size precludes selecting too many variables (otherwise the constructed model is not reliable as shown in the previous section). It is also possible that the number and/or scale of the positive coefficients may be different from those of the negative ones. To account for these properties, a three-component mixture prior is specified for each coefficient β_jor γ_j. More specifically,

{\begin{array}{l} β_{j} ~_{i i d}^{} (1 - w_{β +} - w_{β -}) δ (\cdot) + w_{β +} N_{+} (μ_{β +}, σ_{β +}^{2}) + w_{β -} N_{-} (μ_{β -}, σ_{β -}^{2}), \\ γ_{j} ~_{i i d}^{} (1 - w_{γ +} - w_{γ -}) δ (\cdot) + w_{γ +} N_{+} (μ_{γ +}, σ_{γ +}^{2}) + w_{γ -} N_{-} (μ_{γ -}, σ_{γ -}^{2}), \end{array}

(2)

where δ (·) is a Dirac function with mass one at zero; N₊(μ, σ²) and N_-(μ, σ²) positively and negatively truncate the normal distribution, i.e., N(μ, σ²), respectively. Therefore, w_β+(or w_β-) is the probability for any single marker, and w_γ+(or w_γ-) is the probability for any pair of markers in $I$ {X}, to have positive (or negative) interactive effect on the trait.

The hyperparameters, $σ_{β +}^{2}, σ_{β -}^{2}, σ_{γ +}^{2}$ and $σ_{γ -}^{2}$ , are assumed to have priors as inverse gamma distributions, that is, IG(θ_β+, φ_β+), IG(θ_β-, φ_β-), IG(θ_γ+, φ_γ+), and IG(θ_γ-, φ_γ-), respectively (e.g., setting θ_β+= θ_β-= θ_γ+= θ_γ-= 0.1 and φ_β+= φ_β-= φ_γ+= φ_γ-= 10). As a result, the prior on β (and γ) is essentially a mixture of a point mass at zero and some truncated t-distributions, which shrinks the smaller effects towards zero and allows sufficient flexibility for non-zero effects. Furthermore, t-type prior distributions yield Bayes rules with desirable decision-theoretic frequentist properties [47]. The hyperparameters, μ_β+, μ_γ+, μ_β-and μ_γ-, are assumed to have diffuse priors, and the prior distribution for $σ_{ε}^{2}$ is proportional to 1/ $σ_{ε}^{2}$ .

As suggested by the predictability of a size n data set, we expect to select at most p_n= O( $\sqrt{n}$ ) out of the p variables for the final model. Therefore, we specify the priors for (w_β+, w_β-) and (w_γ+,w_γ-) as

\begin{matrix} w_{β +} + w_{β -} ~ U (0, c \sqrt{n} / p_{β}), & w_{γ +} + w_{γ -} ~ U (0, c \sqrt{n} / p_{γ}) \end{matrix},

(3)

that is, expecting at most c $\sqrt{n}$ significant additive effects and epistatic effects, respectively. Gaffney [48] and Yi et al. [17], among others, employed similar ideas to rescale the priors based on the number of possible effects. Apparently, when n ≪ (p_β+ p_γ), either c $\sqrt{n}$ /p_βor c $\sqrt{n}$ /p_γis very small, which implies a restrictive prior on each corresponding coefficient. Therefore, we usually select c $\sqrt{n}$ additive effects and c $\sqrt{n}$ epistatic effects during the first run of Bayesian analysis. We then apply the same Bayesian analysis to these pre-selected variables. The second run of Bayesian analysis has both w_β++ w_β-and w_γ++ w_γ-, a priori, uniformly distributed on [0, 1].

Likelihood

Let Y_nbe the column vector including the trait values of all strains under investigation, let X_ibe the vector of all marker values of the i-th strain and $X_{n} = {(X_{1}^{T}, \dots, X_{n}^{T})}^{T}$ , and let Z_ibe the vector of all epistatic candidate values of the i-th strain. Denote the marginal distribution of A as [A], and the conditional distribution of A given B as [A|B]. With data (Y_n, X_n) and the prior specification in Section 3.1, we have the likelihood function, that is, the joint distribution function of the data (Y_n, X_n), the parameters (μ, β, γ), $σ_{ε}^{2}$ , and all hyperparameters

\begin{array}{l} (w_{β +}, w_{β -}, w_{γ +}, w_{γ -}, μ_{β +}, μ_{γ +}, σ_{β +}^{2}, σ_{γ +}^{2}, μ_{γ +}, μ_{γ -}, σ_{β -}^{2}, σ_{γ -}^{2}), \\ \begin{array}{l} L & \propto & [Y_{n} | X_{n}, μ, β, γ, σ_{ε}^{2}] \times [μ] \times [μ_{β +}] \times [σ_{β +}^{2}] \times [μ_{β -}] \times [σ_{β -}^{2}] \times [w_{β +}, w_{β -}] \\ \times [β | w_{β +}, w_{β -}, μ_{β +}, σ_{β +}^{2}, μ_{β -}, σ_{β -}^{2}] \times [μ_{γ +}] \times [σ_{γ +}^{2}] \times [μ_{γ -}] \times [σ_{γ -}^{2}] \\ \times [w_{γ +}, w_{γ -}] \times [γ | w_{γ +}, w_{γ -}, μ_{γ +}, σ_{γ +}^{2}, μ_{γ -}, σ_{γ -}^{2}] \times [σ_{ε}^{2}] \times [X_{n}] \\ \propto & σ_{ε}^{- n - 2} \exp {- \sum_{i = 1}^{n} \frac{{(Y_{i} - μ - X_{i} β - Z_{i} γ)}^{2}}{2 σ_{ε}^{2}}} \times \exp (- \frac{σ_{β +}^{2}}{φ_{β +}} - \frac{σ_{β -}^{2}}{φ_{β -}}) \\ \times {(σ_{β +}^{2})}^{- θ_{β +} - 1} \times {(σ_{β -}^{2})}^{- θ_{β -} - 1} \times \exp (- \frac{σ_{γ +}^{2}}{φ_{γ +}} - \frac{σ_{γ -}^{2}}{φ_{γ -}}) \times {(σ_{γ +}^{2})}^{- θ_{γ +} - 1} \\ \times {(σ_{γ -}^{2})}^{- θ_{γ -} - 1} \times [β | w_{β +}, w_{β -}, μ_{β +}, σ_{β +}^{2}, μ_{β -}, σ_{β -}^{2}] \times I [w_{β +} + w_{β -} \leq c \frac{\sqrt{n}}{p_{β}}] \\ \times [γ | w_{γ +}, w_{γ -}, μ_{γ +}, σ_{γ +}^{2}, μ_{γ -}, σ_{γ -}^{2}] \times I [w_{γ +} + w_{γ -} \leq c \frac{\sqrt{n}}{p_{γ}}] \times [X_{n}] . \end{array} \end{array}

(4)

The distribution of X_ncan be specified based on the available linkage map information [2]. The conditional distribution of $[β | w_{β +}, w_{β -}, μ_{β +}, σ_{β +}^{2}, μ_{β -}, σ_{β -}^{2}]$ is a product of the prior distribution for each β_j. Similarly, the conditional distribution of $[γ | w_{γ +}, w_{γ -}, μ_{γ +}, σ_{γ +}^{2}, μ_{γ -}, σ_{γ -}^{2}]$ is a product of the prior distribution for each γ_j. The priors of the hyperparameters, θ_β+, θ_γ+, φ_β+, φ_γ+, θ_β-, θ_γ-, φ_β-and φ_γ-, are specified to be as noninformative as possible.

Gibbs Sampling

Since the specified priors are conditionally conjugate, Bayesian variable selection can be implemented with a Gibbs sampling algorithm. We initialize the algorithm by imputing missing genotypic values based on the observed genotypes and linkage information. The initial value of μ is set as the mean of the observed trait values. Then, with individuals having fully observed trait values, each component of β and γ is initially estimated using recursive univariate regression. Other parameters, $w_{β +}, w_{β -}, μ_{β +}, σ_{β +}^{2}, μ_{β -}$ and $σ_{β -}^{2}$ , are simply initialized based on the initial value of β, and similarly, the initial values for $w_{γ +}, w_{γ -}, μ_{γ +}, σ_{γ +}^{2}, μ_{γ -}$ , and $σ_{γ -}^{2}$ can be specified using the information from γ. For example, we can initialize $σ_{β +}^{2} = σ_{β -}^{2}$ with an estimate from the initial value β, and then use max{#{j : β_j> 2σ_β+}, 1}/p_βto initialize w_β+.

Let X_i,-jbe X_iexcluding the j-th component, and define β_-jand γ_-jsimilarly. Based on the likelihood function in (4), the Gibbs sampler can be developed by recursively drawing the missing genotypic values, the missing trait values, and the model parameters from their full conditional posterior distributions as follows.

Sample missing values: Sample each missing genotypic value X_ijfrom its full conditional posterior distribution,

[X_{i j} | Y_{i}, X_{i, - j}, μ, β, γ, σ_{ε}^{2}] \propto [Y_{i} | X_{i, - j}, X_{i j}, μ, β, γ, σ_{ε}^{2}] \times [X_{i j} | X_{i, j - 1}, X_{i, j + 1}],

and then sample each missing trait value Y_ifrom its full conditional posterior distribution [Y_i|X_i, μ, β, γ, $σ_{ε}^{2}$ ].

Sample μ: Sample μ from its full conditional posterior distribution,

μ | Y_{n}, X_{n}, β, γ, σ_{ε}^{2} ~ N (\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i} β - Z_{i} γ), \frac{σ_{ε}^{2}}{n}) .

Sample β and γ: Sample each β_jand γ_jfrom their full conditional posterior distributions,

\begin{matrix} [β_{j} | Y_{n}, X_{n}, μ, β_{- j}, γ, w_{β +}, w_{β -}, σ_{ε}^{2}, σ_{β +}^{2}, σ_{β -}^{2}] \\ ~ (1 - {\tilde{w}}_{β j +} - {\tilde{w}}_{β j -}) δ (β_{j}) + {\tilde{w}}_{β j +} N_{+} ({\tilde{μ}}_{β j +}, {\tilde{σ}}_{β j +}^{2}) + {\tilde{w}}_{β j -} N_{-} ({\tilde{μ}}_{β j -}, {\tilde{σ}}_{β j -}^{2}), \\ [γ_{j} | Y_{n}, X_{n}, μ, β, γ_{- j}, w_{γ +}, w_{γ -}, σ_{ε}^{2}, σ_{γ +}^{2}, σ_{γ -}^{2}] \\ ~ (1 - {\tilde{w}}_{γ j +} - {\tilde{w}}_{γ j -}) δ (γ_{j}) + {\tilde{w}}_{γ j +} N_{+} ({\tilde{μ}}_{γ j +}, {\tilde{σ}}_{γ j +}^{2}) + {\tilde{w}}_{γ j -} N_{-} ({\tilde{μ}}_{γ j -}, {\tilde{σ}}_{γ j -}^{2}), \end{matrix}

where ${\tilde{w}}_{β j +}, {\tilde{w}}_{β j -}, {\tilde{μ}}_{β j +}, {\tilde{σ}}_{β j +}^{2}, {\tilde{μ}}_{β j -}$ , and ${\tilde{σ}}_{β j -}^{2}$ are specified in the APPENDIX. In addition, ${\tilde{w}}_{γ j +}, {\tilde{w}}_{γ j -}, {\tilde{μ}}_{γ j +}, {\tilde{σ}}_{γ j +}^{2}, {\tilde{μ}}_{γ j -}$ , and ${\tilde{σ}}_{γ j -}^{2}$ can be obtained similarly.

Sample w_β+, w_γ+, w_β-, and w_γ-: These parameters can be sampled from the conditional posterior distributions,

\begin{array}{l} (w_{β +}, w_{β -}, 1 - w_{β +} - w_{β -}) | β \\ ~ D i r i c h l e t ({\tilde{p}}_{β +} + 1, {\tilde{p}}_{β -} + 1, p_{β} - {\tilde{p}}_{β +} - {\tilde{p}}_{β -} + 1), w_{β +} + w_{β -} \leq \frac{c \sqrt{n}}{p_{β}}, \\ (w_{γ +}, w_{γ -}, 1 - w_{γ +} - w_{γ -}) | γ \\ ~ D i r i c h l e t ({\tilde{p}}_{γ +} + 1, {\tilde{p}}_{γ -} + 1, p_{γ} - {\tilde{p}}_{γ +} - {\tilde{p}}_{γ -} + 1), w_{γ +} + w_{γ -} \leq \frac{c \sqrt{n}}{p_{γ}}, \end{array}

where ${\tilde{p}}_{β +}$ = #{β_j> 0 : 1 ≤ j ≤ p_β} and ${\tilde{p}}_{β -}$ = #{β_j< 0 : 1 ≤ j ≤ p_β}; ${\tilde{p}}_{γ +}$ = #{γ_j> 0 : 1 ≤ j ≤ p_γ} and ${\tilde{p}}_{γ -}$ = #{γ_j< 0 : 1 ≤ j ≤ p_γ}.

Sample $σ_{ε}^{2}, σ_{β +}^{2}, σ_{γ +}^{2}, σ_{β -}^{2}$ , and $σ_{γ -}^{2}$ : With conditionally conjugate priors, the posterior for all variance parameters are still inverse gamma distributions. Specifically,

\begin{array}{l} σ_{ε}^{2} | Y_{n}, X_{n}, μ, β, γ ~ I G (\frac{n}{2}, \frac{2}{\sum_{i = 1}^{n} {(Y_{i} - μ - X_{i} β - Z_{i} γ)}^{2}}), \\ σ_{β +}^{2} | β ~ I G (θ_{β +} + \frac{{\tilde{p}}_{β +}}{2}, \frac{2}{\frac{2}{φ_{β +}} + \sum_{j = 1}^{p_{β}} β_{j}^{2} I [β_{j} > 0]}), \\ σ_{γ +}^{2} | γ ~ I G (θ_{γ +} + \frac{{\tilde{p}}_{γ +}}{2}, \frac{2}{\frac{2}{φ_{γ +}} + \sum_{j = 1}^{p_{γ}} γ_{j}^{2} I [γ_{j} > 0]}), \\ σ_{β -}^{2} | β ~ I G (θ_{β -} + \frac{{\tilde{p}}_{β -}}{2}, \frac{2}{\frac{2}{φ_{β -}} + \sum_{j = 1}^{p_{β}} β_{j}^{2} I [β_{j} < 0]}), \\ σ_{γ -}^{2} | γ ~ I G (θ_{γ -} + \frac{{\tilde{p}}_{γ -}}{2}, \frac{2}{\frac{2}{φ_{γ -}} + \sum_{j = 1}^{p_{γ}} γ_{j}^{2} I [γ_{j} < 0]}) . \end{array}

Bayesian Inference

For each variable in model (1), one pair of parameters is used to select the corresponding variable. They are, for the j-th additive effect, the posterior probabilities w_{β j+}= P (β_j> 0|Y_n, X_n) and w_{β j-}= P (β_j< 0|Y_n, X_n). With the full conditional posterior distribution of β_jand all the notations in the APPENDIX, we have

\begin{matrix} w_{β j +} = E [{\tilde{w}}_{β j +} | Y_{n}, X_{n}], & w_{β j -} = E [{\tilde{w}}_{β j -} | Y_{n}, X_{n}] . \end{matrix}

Therefore, the two parameters w_{β j+}and w_{β j-}can be estimated with the Markov chains of ${\tilde{w}}_{β j +}$ and ${\tilde{w}}_{β j -}$ drawn from the above Gibbs sampler. If and only if both w_{β j+}and w_{β j-}are less than 0.5, the median of the posterior distribution of β_jis zero. Similarly, the posterior probabilities w_{γ j+}= P (γ_j> 0|Y_n, X_n) and w_{γ j-}= P (γ_j< 0|Y_n, X_n) can be estimated with the Markov chains of ${\tilde{w}}_{γ j +}$ and ${\tilde{w}}_{γ j -}$ drawn from the above Gibbs sampler.

We propose to select variables twice under the above Bayesian framework for model (1). At the first step, we use a restrictive prior for each coefficient to ensure an identifiable Bayesian model and enforce to stochastically search for an optimal low-dimensional parameter subspace. We then rank the j-th additive effect based on max{w_{β j+}, w_{β j-}}, and rank the j-th epistatic effect based on max{w_{γ j+}, w_{γ j-}}. The top c $\sqrt{n}$ out of p_βadditive effects, and the top c $\sqrt{n}$ out of p_γepistatic effects are selected, respectively. At the second step, we select variables out of those selected c $\sqrt{n}$ additive effects and c $\sqrt{n}$ epistatic effects, under the above Bayesian framework for model (1). Obviously, we have a non-restrictive prior for each coefficient at the second step, and therefore avoid possible over-penalization due to restrictive priors.

Following Jeffreys [49, 50], we test the hypothesis H₀ : β_j= 0 vs. H₁: β_j≠ 0 on the basis of the Bayes factor, which was defined as

B_{10} (β_{j}) = \frac{P (D a t a | β_{j} \neq 0)}{P (D a t a | β_{j} = 0)} = \frac{P (β_{j} \neq 0 | D a t a)}{P (β_{j} = 0 | D a t a)} \times \frac{π (β_{j} = 0)}{π (β_{j} \neq 0)} = \frac{w_{β j +} + w_{β j -}}{1 - w_{β j +} - w_{β j -}},

where π (β_j= 0) and π (β_j≠ 0) are the a priori probabilities, and the last equality follows the fact that π (β_j= 0) = π (β_j≠ 0) at the second step of our Bayesian Classification. As suggested by Jeffreys [50], a B₁₀ (β_j) with value between 1 and $\sqrt{10}$ ≈ 3.2 provides "not worth more than a bare mention" evidence against H₀; a B₁₀ (β_j) with value from $\sqrt{10}$ to 10 provides "substantial" evidence against H₀; a B₁₀ (β_j) with value from 10 to 100 provides "strong" evidence against H₀; and a B₁₀ (β_j) with value larger than 100 provides "decisive" evidence against H₀. Similarly, we can test the hypothesis H₀: γ_j= 0 vs. H₁: j ≠ 0 using the following Bayes factor

B_{10} (γ_{j}) = \frac{w_{γ j +} + w_{γ j -}}{1 - w_{γ j +} - w_{γ j -}} .

(5)

Appendix

Fully Conditional Posterior Distribution of β _j

For each j = 1, ⋯, p_β, the fully conditional posterior distribution of β_jis

\begin{array}{l} β_{j} | Y_{n}, X_{n}, μ, β_{- j}, γ, w_{β +}, w_{β -}, σ_{ε}^{2}, σ_{β +}^{2}, σ_{β -}^{2} \\ ~ (1 - {\tilde{w}}_{β j +} - {\tilde{w}}_{β j -}) δ_{(0)} + {\tilde{w}}_{β j +} N_{+} ({\tilde{μ}}_{β j +}, {\tilde{σ}}_{β j +}^{2}) + {\tilde{w}}_{β j -} N_{-} ({\tilde{μ}}_{β j -}, {\tilde{σ}}_{β j -}^{2}), \end{array}

where the updated parameter values are

\begin{array}{l} {\tilde{μ}}_{β j +} & = & σ_{β +}^{2} \sum_{i = 1}^{n} X_{i j} (Y_{i} - μ - X_{i, - j} β_{- j} - Z_{i} γ) / (σ_{ε}^{2} + σ_{β +}^{2} \sum_{i = 1}^{n} X_{i j}^{2}), \\ {\tilde{σ}}_{β j +}^{2} & = & σ_{β +}^{2} σ_{ε}^{2} / (σ_{ε}^{2} + σ_{β +}^{2} \sum_{i = 1}^{n} X_{i j}^{2}), \\ {\tilde{μ}}_{β j -} & = & σ_{β -}^{2} \sum_{i = 1}^{n} X_{i j} (Y_{i} - μ - X_{i, - j} β_{- j} - Z_{i} γ) / (σ_{ε}^{2} + σ_{β -}^{2} \sum_{i = 1}^{n} X_{i j}^{2}), \\ {\tilde{σ}}_{β j -}^{2} & = & σ_{β -}^{2} σ_{ε}^{2} / (σ_{ε}^{2} + σ_{β -}^{2} \sum_{i = 1}^{n} X_{i j}^{2}), \\ {\tilde{w}}_{β j +} & = & \frac{{\tilde{σ}}_{β j +}}{σ_{β +}} Φ (\frac{{\tilde{μ}}_{β j +}}{{\tilde{σ}}_{β j +}}) / [\frac{1 - w_{β +} - w_{β -}}{2 w_{β +}} \exp (- \frac{{\tilde{μ}}_{β j +}^{2}}{2 {\tilde{σ}}_{β j +}^{2}}) + \frac{{\tilde{σ}}_{β j +}}{σ_{β +}} \\ \times Φ (\frac{{\tilde{μ}}_{β j +}}{{\tilde{σ}}_{β j +}}) + \frac{w_{β -} {\tilde{σ}}_{β j -}}{w_{β +} σ_{β -}} Φ (- \frac{{\tilde{μ}}_{β j -}}{{\tilde{σ}}_{β j -}}) \exp (\frac{{\tilde{μ}}_{β j -}^{2}}{2 {\tilde{σ}}_{β j -}^{2}} - \frac{{\tilde{μ}}_{β j +}^{2}}{2 {\tilde{σ}}_{β j +}^{2}})], \\ {\tilde{w}}_{β j -} & = & \frac{{\tilde{σ}}_{β j -}}{σ_{β -}} Φ (- \frac{{\tilde{μ}}_{β j -}}{{\tilde{σ}}_{β j -}}) / [\frac{1 - w_{β +} - w_{β -}}{2 w_{β -}} \exp (- \frac{{\tilde{μ}}_{β j -}^{2}}{2 {\tilde{σ}}_{β j -}^{2}}) + \frac{{\tilde{σ}}_{β j -}}{σ_{β j -}} \\ \times Φ (- \frac{{\tilde{μ}}_{β j -}}{{\tilde{σ}}_{β j -}}) + \frac{w_{β +} {\tilde{σ}}_{β j +}}{w_{β -} σ_{β +}} Φ (\frac{{\tilde{μ}}_{β j +}}{{\tilde{σ}}_{β j +}}) \exp (\frac{{\tilde{μ}}_{j +}^{2}}{2 {\tilde{σ}}_{β j +}^{2}} - \frac{{\tilde{μ}}_{β j -}^{2}}{2 {\tilde{σ}}_{β j -}^{2}})] . \end{array}

References

Lander ES, Botstein D: Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 1989, 121: 185–199.
PubMed Central CAS PubMed Google Scholar
Doerge RW, Zeng ZB, Weir BS: Statistical issues in the search for genes affecting quantitative traits in experimental populations. Statistical Science 1997, 12: 195–219. 10.1214/ss/1030037909
Article Google Scholar
Broman KW, Speed TP: A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society Series B 2002, 64: 641–656. 10.1111/1467-9868.00354
Article Google Scholar
Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S: Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 2005, 170: 465–480. 10.1534/genetics.104.039354
Article PubMed Central CAS PubMed Google Scholar
Carlborg d, Haley CS: Epistasis: too often neglected in complex trait studies? Natuer Review Genetics 2004, 5: 618–625. 10.1038/nrg1407
Article CAS Google Scholar
Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human disease. Human Heredity 2003, 56: 73–82. 10.1159/000073735
Article PubMed Google Scholar
Williams SM, Addy JH, Phillips JAI, Dai M, Kpodonu J, Afful J, Jackson H, Joseph K, Eason F, Murray MM, Epperson P, Aduonum A, Wong LJ, Jose PA, Felder RA: Combinations of variation in multiple genes are associated with hypertension. Hypertension 2000, 36: 2–6.
Article CAS PubMed Google Scholar
Leamy LJ, Routman EJ, Cheverud JM: An epistatic genetic basis for fluctuating asymmetry of mandible size in mice. Evolution 2002, 56: 642–653.
Article PubMed Google Scholar
Wagner A: Robustness against mutations in genetic networks of yeast. Nature Genetics 2000, 24: 355–361. 10.1038/74174
Article CAS PubMed Google Scholar
Sanjuán R, Cuevas JM, Moya A, Elena SF: Epistasis and the adaptability of an RNA virus. Genetics 2005, 170: 1001–1008. 10.1534/genetics.105.040741
Article PubMed Central PubMed Google Scholar
Eshed Y, Zamir D: Less-than-additive epistatic interactions of quantitative trait loci in tomato. Genetics 1996, 143: 1807–1817.
PubMed Central CAS PubMed Google Scholar
Xu S, Jia Z: Genomewide analysis of epistatic effects for quantative traits in Barley. Genetics 2007, 175: 1955–1963. 10.1534/genetics.106.066571
Article PubMed Central CAS PubMed Google Scholar
Bateson W: Mendel's Principles of Heredity. Cambridge: Cambridge University Press; 1909.
Book Google Scholar
Kao CH, Zeng ZB: Modeling epistasis of quantitative trait loci using Cockerham's model. Genetics 2002, 160: 1243–1261.
PubMed Central PubMed Google Scholar
Yi N, Xu S: Mapping quantitative trait loci with epistatic effects. Genetical Research 2002, 79: 185–198. 10.1017/S0016672301005511
Article PubMed Google Scholar
Yi N, Yandell BS, Churchill GA, Allison DB, Eisen EJ, Pomp D: Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics 2005, 170: s1333–1344. 10.1534/genetics.104.040386
Article Google Scholar
Yi N, Banerjee S, Pomp D, Yandell BS: Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits. Genetics 2007, 176: 1855–1864. 10.1534/genetics.107.071142
Article PubMed Central PubMed Google Scholar
Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS: An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics 2007, 176: 1865–1877. 10.1534/genetics.107.071365
Article PubMed Central PubMed Google Scholar
Yandell BS, Mehta T, Banerjee S, Shriner D, Venkataraman R, Moon JY, Neely WW, Wu H, von Smith R, Yi N: R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 2007, 23: 641–643. 10.1093/bioinformatics/btm011
Article CAS PubMed Google Scholar
Bogdan M, Ghosh JK, Doerge RW: Modifying the Schwartz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics 2004, 167: 989–999. 10.1534/genetics.103.021683
Article PubMed Central CAS PubMed Google Scholar
Cui YH, Wu R: Mapping genome-genome epistasis: a high-dimensional model. Bioinformatics 2005, 21: 2447–2455. 10.1093/bioinformatics/bti342
Article CAS PubMed Google Scholar
Żak M, Baierl A, Bogdan M, Futschik A: Locating multiple interacting quantitative trait loci using rank-based model selection. Genetics 2007, 176: 1845–1854. 10.1534/genetics.106.068031
Article PubMed Central PubMed Google Scholar
Shi W, Lee KE, Wahba G: Detecing disease-causing genes by LASSO-Patternsearch algorithm. BMC Proceedings 2007, 1(Suppl 1):S60.
Article PubMed Central PubMed Google Scholar
Bogdan M, Frommlet F, Biecek P, Cheng R, Ghosh JK, Doerge RW: Extending the modified Bayesian information criterion (mBIC) to dense markers and multiple interval mapping. Biometrics, in press.
Kao CH, Zeng ZB, Teasdale RD: Multiple interval mapping for quantitative trait loci. Genetics 1999, 152: 1203–1216.
PubMed Central CAS PubMed Google Scholar
Zeng ZB, Kao CH, Basten CJ: Estimating the genetic architecture of quantitative traits. Genetical Research 1999, 74: 279–289. 10.1017/S0016672399004255
Article CAS PubMed Google Scholar
Ball RD: Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. Genetics 2001, 159: 1351–1364.
PubMed Central CAS PubMed Google Scholar
Zhang M, Montooth KL, Wells MT, Clark AG, Zhang D: Mapping multiple quantitative trait loci by Bayesian classification. Genetics 2005, 169: 2305–2318. 10.1534/genetics.104.034181
Article PubMed Central CAS PubMed Google Scholar
Mitchell TJ, Beauchamp JJ: Bayesian variable selection in linear regression (with discussion). Journal of the American Statistical Association 1988, 83: 1023–1036. 10.2307/2290129
Article Google Scholar
George EI, McCulloch RE: Variable selection via Gibbs sampling. Journal of the American Statistical Association 1993, 88: 881–889. 10.2307/2290777
Article Google Scholar
Ishwaran H, Rao JS: Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 2005, 33: 730–773. 10.1214/009053604000001147
Article Google Scholar
Zhang M, Zhang D, Wells MT: Generalized Shrinkage Estimators Adpative to Sparsity and Asymmetry of High Dimensional Parameter Spaces. Technical Reports, Department of Statistics, Purdue University 2008, 08–01.
Google Scholar
Liu J, Mercer JM, Stam LF, Gibson G, Zeng ZB, Laurie CC: Genetic analysis of a morphological shape difference in the male genitalia of Drosophila simulans and D. mauritiana . Genetics 1996, 142: 1129–1145.
PubMed Central CAS PubMed Google Scholar
Zeng ZB, Liu J, Stam LF, Kao CH, Mercer JM, Laurie CC: Genetic architecture of a morphological shape difference between two drosophila species. Genetics 2000, 154: 299–310.
PubMed Central CAS PubMed Google Scholar
Kao CH, Zeng ZB: General formula for obtaining the MLEs and the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics 1997, 53: 653–665. 10.2307/2533965
Article CAS PubMed Google Scholar
Tanksley SD: Mapping polygenes. Annual Review Genetics 1993, 27: 205–233. 10.1146/annurev.ge.27.120193.001225
Article CAS Google Scholar
Bogdan M, Doerge RW: Biased estimators of quantitative trait locus heritability and location in interval mapping. Heredity 2005, 95: 476–484. 10.1038/sj.hdy.6800747
Article CAS PubMed Google Scholar
Tibshirani RJ: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 1996, 58: 267–288.
Google Scholar
Fan J, Peng H: Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 2004, 32: 928–961. 10.1214/009053604000000256
Article Google Scholar
Álvarez-Castro JM, Carlborg O: A unified model for functional and statistical epistasis and its application in quantitative trait loci analysis. Genetics 2007, 176: 1151–1167. 10.1534/genetics.106.067348
Article PubMed Central PubMed Google Scholar
Zeng ZB, Wang T, Zou W: Modeling quantitative trait loci and interpretation of models. Genetics 2005, 169: 1711–1725. 10.1534/genetics.104.035857
Article PubMed Central CAS PubMed Google Scholar
Hastie H, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D: Imputing missing data for gene expression arrays. PhD thesis. Stanford University, Statistics Department; 1999.
Google Scholar
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: John Wiley; 2002.
Book Google Scholar
Huber P: Robust regression: asymptotics, conjectures, and Monte Carlo. The Annals of Statistics 1973, 1: 799–821. 10.1214/aos/1176342503
Article Google Scholar
Portnoy S: Asymptotic behavior of M-estimators of p regression parameters when p²/n is large, I. Consistency. Annals of Statistics 1984, 12: 1298–1309. 10.1214/aos/1176346793
Article Google Scholar
Greenshtein E, Ritov Y: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 2004, 10: 971–988. 10.3150/bj/1106314846
Article Google Scholar
Fourdrinier D, Strawderman WE, Wells MT: On the construction of Bayes minimax estimators. The Annals of Statistics 1998, 26: 660–671. 10.1214/aos/1028144853
Article Google Scholar
Gaffney PJ: An efficient reversible jump Markov chain Monte Carlo approach to detect multiple loci and their effects in inbred crosses. PhD thesis. Department of Statistics, University of Wisconsin, Madison, WI; 2001.
Google Scholar
Jeffreys H: Some tests of significance, treated by the theory of probability. Proceedings of the Cambridge Philosophy Society 1935, 31: 201–222.
Google Scholar
Jeffreys H: Theory of Probability. Oxford: Clarendon Press; 1961.
Google Scholar

Download references

Acknowledgements

We thank Rebecca Doerge and three anonymous referees for suggestions and comments that significantly improved this article. This research was supported by NSF grant 0612031 and NIH-NIGMS 1R01GM083606-01 to MTW, Purdue Research Foundation grants to MZ and DZ, and Purdue Alumni Association Faculty Incentive Grant to MZ.

Author information

Authors and Affiliations

Department of Statistics, Purdue University, West Lafayette, IN, 47907, USA
Min Zhang & Dabao Zhang
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, 14853, USA
Martin T Wells

Authors

Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dabao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Martin T Wells
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dabao Zhang.

Additional information

Authors' contributions

MZ and DZ both conceived the study and developed the method. DZ wrote the MATLAB^® code and did the simulation study. MZ analyzed the real data. MTW participated in conceptual development, writing, reviewing and editing the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, M., Zhang, D. & Wells, M.T. Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases. BMC Bioinformatics 9, 251 (2008). https://doi.org/10.1186/1471-2105-9-251

Download citation

Received: 26 November 2007
Accepted: 29 May 2008
Published: 29 May 2008
DOI: https://doi.org/10.1186/1471-2105-9-251

Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases