Introduction

Interval mapping (Lander and Botstein 1989) is the most commonly used method for mapping quantitative trait loci (QTL). The method usually applies to quantitative traits, i.e., traits that have a continuous distribution. In agricultural crops, the phenotypes of some traits are measured as discrete variables. Ordinal traits, e.g., disease resistance in plants and animals, are typical examples of such discrete traits. These traits are usually modeled by a multinomial distribution. Traits measured as counts, such as liter size in pigs and tiller number in rice, are usually modeled by the Poisson distribution. Binomial traits are also common in agricultural experiments, such as the ratio of germinated seeds to the total number of seeds planted. These traits are not normally distributed. Although some transformations, such us the log transformation or the more general Box–Cox transformation (Box and Cox 1964), can be used to improve the normality of the traits, not all traits can be transformed. For example, a binary trait has no appropriate transformation to make it normal.

The generalized linear model approach (McCullagh and Nelder 1989; Wedderburn 1974) is the most appropriate method for analyzing traits with non-normal distribution and it has been widely used in statistics for parameter estimation. Generalized linear model takes advantage of all theory and methods developed in the usual linear model methodology (Searle 1997). It has been applied to QTL mapping for some special traits, e.g., binary traits (Deng et al. 2006; Xu and Atchley 1996; Yi and Xu 1999a, b, 2000), ordinal traits (Hackett and Weller 1995; Rao and Xu 1998) and Poisson traits (Cui et al. 2006; Cui and Yang 2009). Depending on the special characteristics of the traits, distribution-specific generalized linear models have been developed for these traits. These methods are not sufficiently general to extend to all traits that can be modeled by the generalized linear model. For example, the EM algorithm developed by Xu et al. (2003, 2005) only applies to binary and ordinal traits. They treated both the marker genotypes and the latent variable as missing values. Although parameter estimation under the EM algorithm is simple, the information matrix of the estimated parameters is difficult to calculate. A more comprehensive analysis of the generalized linear model applied to QTL mapping is the seminal paper by Lange and Whittaker (2001). They adopted the generalized estimating equations (GEE) approach to analyze multiple traits with arbitrary combination of continuous and discrete trait components. The method replaces the unobserved QTL genotypes by the conditional expectations of the genotype indicator variable given flanking marker information. The uncertainties of the genotype indicator variables are ignored. In addition, detailed formulas for the partial derivatives of the expectation of the data with respect to the parameters are not given.

When there are no missing values, commercial software packages are available to estimate parameters under a wide range of distribution of the traits, e.g., SAS (SAS Institute 1999) and GENESTAT (PHOEBE Biostatistics Group 2007, http://www.genestat.org). Although these programs may handle missing values using the imputation algorithm or the EM algorithm (Dempster et al. 1977), the missing patterns handled by these commercial programs are usually different from that of QTL mapping. In interval mapping, genotypes are missing for every individual at a putative QTL position unless the QTL overlaps with a fully informative marker. Special mixture models are required in QTL mapping (Lander and Botstein 1989). Hackett and Weller (1995) and Xu and Atchley (1996) were the first group of people to use the EM algorithm to estimate QTL parameters for ordinal traits, but they did not investigate the variance–covariance matrix of the estimated QTL effects. Hackett and Weller (1995) took advantage of an existing commercial software (GENESTAT) for generalized linear model analysis by iteratively calling the subroutine for generalized linear model with non-missing genotypes and calculating the weight (posterior probability of QTL genotype). The attractive property of that method is that users do not have to write their own code for the maximization step, which is conducted by the commercial software. They only incorporated the expectation step into the existing program for parameter updating. As a result, no variance–covariance matrix of the estimated parameters was provided. Xu et al. (2003) developed an EM algorithm for binary data and used the Monte Carlo simulation approach to obtain the Louis’ (1982) information matrix. The variance–covariance matrix of the estimated parameters can be approximated by the inverse of the information matrix. This method is computationally intensive due to the use of Monte Carlo simulation for approximate integration.

In the statistics literature, generalized linear model with missing covariates is often handled with the EM algorithm (Horton and Laird 1998). However, other methods are also available, as summarized by Ibrahim et al. (2005), who reviewed four general approaches: maximum likelihood method implemented via the EM algorithm by the method of weights (Horton and Laird 1998), multiple imputation (Rubin 1987), fully Bayesian (Ibrahim et al. 2002) and weighted estimation equation (Robins and Ritov 2001). Ibrahim et al. (2005) concluded that the most accurate method is the fully Bayesian method, although the method is associated with a high cost in terms of computing time. The second best method is the EM algorithm via the method of weights. Application of this method to interval mapping has not been analyzed.

The mixture model-based maximum likelihood estimation of parameters has a slow convergence speed. The computational intensity is high within each round of the iteration. Therefore, an approximate method that improves the computational efficacy with little loss in power may be desirable. Xu (1998) developed a weighted least square approximation for QTL mapping of normally distributed traits. Recently, Han and Xu (2008) made a further improvement of the weighted least square estimation. Their idea can be applied to the generalized linear model. Therefore, we will present an approximate method to improve the computational efficiency over the mixture model maximum likelihood method.

Model and methods

Generalized linear model

We will use ordinal trait QTL mapping as an example to introduce the generalized linear model. Extension to other traits will be given later. Suppose that a disease phenotype of individual \( j\,\left( {j = 1,\ldots,n} \right) \) is measured by an ordinal variable denoted by \( T_{j} = 1,\ldots,p + 1 , \) where p + 1 is the total number of disease classes and n is the sample size. Let \( Y_{j} = \left\{ {Y_{jk} } \right\},\,\,\forall \;k = 1,\ldots,p + 1\), be a \( (p + 1) \times 1 \) vector to indicate the disease status of individual j. The kth element of Y j is defined as

$$ Y_{jk} = \left\{ {\begin{array}{*{20}l} 1 & {{\text{if}} \quad T_{j} = k} \\ 0 & {{\text{if}} \quad T_{j} \ne k} \\ \end{array} } \right. $$
(1)

Note that the phenotype of the ordinal trait has been formulated as a multivariate Bernoulli variable (or multinomial variable with sample size one). Different link functions can be used to describe the relationship of the observed ordinal phenotype and the genetic effects of QTL. The most commonly used link functions are the logit and probit link functions (McCullagh and Nelder 1989). Here, we described the probit link function and leave the logit link function in Appendix D for interested readers. Under the probit link function, the expectation of Y jk is

$$ \mu_{jk} = E(Y_{jk} ) = \Upphi \left( {\alpha_{k} + X_{j} \beta + Z_{j} \gamma } \right) - \Upphi \left( {\alpha_{k - 1} + X_{j} \beta + Z_{j} \gamma } \right) $$
(2)

where \( \alpha_{k} \left( {\alpha_{0} = - \infty \;{\text{and}}\;\alpha_{p + 1} = + \infty } \right) \) is the intercept, β is a q × 1 vector for some systematic effects (not related to the effects of QTL), and γ is an r × 1 vector for the effects of a quantitative trait locus. The symbol Φ(·) is the standardized cumulative normal function. The design matrix X j is assumed to be known, but Z j may not be fully observed because it is determined by the genotype of individual j for the locus of interest. We will defer the definition of Z j to the next section. Because the link function is probit, this type of analysis is called probit analysis. Let μ j  = {μ jk } be a \( (p + 1) \times 1 \) vector. The expectation for vector Y j is \( E(Y_{j} ) = \mu_{j} \) and the variance–covariance matrix of Y j is

$$ V_{j} = \text{var} \left( {Y_{j} } \right) = {\text{diag}}\left( {\mu_{j} } \right) - \mu_{j} \mu_{j}^{T} $$
(3)

The method to be developed requires the inverse of matrix V j . However, V j is not of full rank (see Appendix C for proof). We can use a generalized inverse of V j , such as \( V_{j}^{ - } = {\text{diag}}^{ - 1} (\mu_{j} ), \) in place of V −1 j (see Appendix C for the generalized inverse). The weight matrix is

$$ W_{j} = {\text{diag}}^{ - 1} (\mu_{j} ) $$
(4)

The parameter vector is denoted by θ = {α, β, γ} with a dimensionality of \( (p + q + r) \times 1. \) Once \( \mu_{j} \,{\text{and}}\,\,W_{j} \) and are defined, the reweighted least square method of Wedderburn (1974) can be used to estimate the parameters.

Mixture model maximum likelihood estimation

When the design matrix Z j is fully observed, the maximum likelihood solution of parameters can be solved in a straightforward manner using the reweighted least squares approach (Wedderburn 1974). For the paper to be self contained, the iteration equation and the information matrix in the situation of no missing value is briefly described. The Newton–Raphson iteration is

$$ \theta^{(t + 1)} = \theta^{(t)} + \Updelta \theta $$
(5)

where

$$ \Updelta \theta = \left[ {\sum\limits_{j = 1}^{n} {D_{j}^{T} W_{j} D_{j} } } \right]^{ - 1} \left[ {\sum\limits_{j = 1}^{n} {D_{j}^{T} W_{j} \left( {Y_{j} - \mu_{j} } \right)} } \right] $$
(6)

is the iteratively reweighted least squares formula for parameter updating (increment of the parameter from iteration t to t + 1). Matrix D j is the partial derivative matrix of μ j with respect to θ,

$$ D_{j} = \left[ {\begin{array}{*{20}l} {{\frac{{\partial \mu_{j} }}{{\partial \alpha^{T} }}}} & {{\frac{{\partial \mu_{j} }}{{\partial \beta^{T} }}}} & {{\frac{{\partial \mu_{j} }}{{\partial \gamma^{T} }}}} \\ \end{array} } \right] $$
(7)

with a dimensionality of \( (p + 1) \times (p + q + r). \) Once the iteration process converges, the information matrix is automatically given, as it is the coefficient matrix in the last step of the iteration (Wedderburn 1974),

$$ I(\theta ) = \sum\limits_{j = 1}^{n} {D_{j}^{T} W_{j} D_{j} } $$
(8)

From this information matrix, the variance–covariance matrix of estimated parameters is calculated because the inverse of the information matrix approximately equals the variance–covariance matrix of the estimated parameters.

In QTL mapping for ordinal traits, the generalized linear model in its original form can be applied if one is interested in individual marker analysis because, at markers, the genotypes are observed and thus matrix Z j is known. In interval mapping, however, effects of QTL that are located between markers should also be estimated. In this case, the genotypes of QTL are not observed and must be inferred using flanking marker genotypes. This is a typical missing value problem. The missing value Z j can be inferred from linked markers. We now use an F2 population as an example to show how to handle the missing value of Z j . Let

$$ p_{j} (g) = \Pr (Z_{j} = H_{g} |{\text{marker}}),\quad\forall \,g = 1,\;2,\;3 $$
(9)

be the conditional probability of QTL genotype given marker information, where the marker information can be either drawn from two flanking markers (interval mapping, Lander and Botstein 1989) or multiple markers (multipoint analysis, Jiang and Zeng 1997). For an F2 population, matrix H g is defined as the gth row of matrix H, where

$$ H = \left[ {\begin{array}{*{20}l} { + 1} & 0 \\ { \, 0} & 1 \\ { - 1} & 0 \\ \end{array} } \right] $$
(10)

Corresponding to the definition of H, the QTL effect vector γ is defined as \( \gamma = [\begin{array}{*{20}l} a & d \\ \end{array} ]^{T} \) where a and d represents the additive and dominance effects, respectively. When Z j is missing, the generalized linear model becomes a generalized linear mixture model. Under the mixture model approach, we need to define the genotype-specific expectation, the genotype-specific variance matrix and the genotype-specific derivatives for each individual. Let

$$ \mu_{jk} (g) = E\left( {Y_{jk} |g} \right) = \Upphi \left( {\alpha_{k} + X_{j} \beta + H_{g} \gamma } \right) - \Upphi \left( {\alpha_{k - 1} + X_{j} \beta + H_{g} \gamma } \right) $$
(11)

be the genotype-specific expectation of Y jk when j takes the gth genotype for g = 1, 2, 3. The corresponding genotype specific weight matrix is

$$ W_{j} (g) = {\text{diag}}^{ - 1} \left[ {\mu_{j} (g)} \right] $$
(12)

Let D j (g) be the genotype-specific partial derivatives of the expectation with respect to the parameters,

$$ D_{j} (g) = \left[ {\begin{array}{*{20}l} {{\frac{{\partial \mu_{j} (g)}}{{\partial \alpha^{T} }}}} & {{\frac{{\partial \mu_{j} (g)}}{{\partial \beta^{T} }}}} & {{\frac{{\partial \mu_{j} (g)}}{{\partial \gamma^{T} }}}} \\ \end{array} } \right] $$
(13)

The closed form of matrix D j (g) is given in Appendix A. The increment of parameters in the iteration is

$$ \begin{gathered} \Updelta \theta = \left[ {\sum\limits_{j = 1}^{n} {\sum\limits_{g = 1}^{3} {p_{j}^{*} (g)D_{j}^{T} (g)W_{j} (g)D_{j} (g)} } } \right]^{ - 1} \hfill \\ \times \left[ {\sum\limits_{j = 1}^{n} {\sum\limits_{g = 1}^{3} {p_{j}^{*} (g)D_{j}^{T} (g)W_{j} (g)\left( {Y_{j} - \mu_{j} (g)} \right)} } } \right] \hfill \\ \end{gathered} $$
(14)

where p * j (g) is the posterior probability of QTL genotype after the phenotype information is incorporated and is given by

$$ p_{j}^{*} (g) = {\frac{{p_{j} (g)p(Y_{j} |g)}}{{\sum\nolimits_{g^{\prime} = 1}^{3} {p_{j} (g^{\prime})p(Y_{j} |g^{\prime})}}}} $$
(15)

where

$$ p(Y_{j} |g) = \prod\limits_{k = 1}^{p + 1} {\mu_{jk}^{{Y_{jk} }} (g)} = Y_{j}^{T} \mu_{j} (g) $$
(16)

is the multinomial probability. Derivation of the EM algorithm (Eq. 14) is given in Appendix B.

Unfortunately, the information matrix under the EM algorithm is not identical to the coefficient matrix of the reweighted least squares equation; rather, it has to be adjusted for the information loss due to the uncertainty of QTL genotypes. The Louis’ (1982) adjustment of the information matrix is

$$ I(\theta ) = \sum\limits_{j = 1}^{n} {E\left[ {B_{j} \left( {\theta |Z_{j} } \right)} \right]} - \sum\limits_{j = 1}^{n} {\text{var} \left[ {S_{j} \left( {\theta |Z} \right)_{j} } \right]} $$
(17)

The first term in the above expression (Eq. 17) is

$$ E\left[ {B_{j} (\theta |Z_{j} )} \right] = \sum\limits_{g = 1}^{3} {p_{j}^{*} (g)D_{j}^{T} (g)W_{j} (g)D_{j} (g)} $$
(18)

which is the expected value of the negative Hessian matrix. The second term of Eq. (17) is

$$ \text{var} \left[ {S_{j} (\theta |Z_{j} )} \right] = \sum\limits_{g = 1}^{3} {p_{j}^{*} (g)\left[ {S_{j} (\theta |g) - \bar{S}_{j} (\theta )} \right]\left[ {S_{j} (\theta |g) - \bar{S}_{j} (\theta )} \right]^{T} } $$
(19)

which is the variance matrix of the score vector, where

$$ S_{j} (\theta |g) = D_{j}^{T} (g)W_{j} (g)\left( {Y_{j} - \mu_{j} (g)} \right) $$
(20)

and

$$ \bar{S}_{j} (\theta ) = \sum\limits_{g = 1}^{3} {p_{j}^{*} (g)S_{j} (\theta |g)} $$
(21)

Approximation under the heterogeneous variance model

The EM implemented mixture model approach described above is computationally intensive due to (1) the mixture model itself and (2) the extra step in calculating the information matrix of parameters. Here, we introduce an approximate method that replaces the mixture model by a heterogeneous residual variance model. Let

$$ U_{j} = E(Z_{j} ) = \sum\limits_{g = 1}^{3} {p_{j} (g)H_{g} } $$
(22)

be the conditional expectation of Z j given marker information and

$$ \Upsigma_{j} = {\text{var}}\left( {Z_{j} } \right) = \sum\limits_{g = 1}^{3} {p_{j} \left( g \right)\left( {H_{g} - U_{j} } \right)^{T} \left( {H_{g} - U_{j} } \right)} $$
(23)

be the corresponding conditional variance–covariance matrix of Z j . If Z j were observed, we would have

$$ \mu_{jk} = E\left( {Y_{jk} } \right) = \Upphi \left( {\alpha_{k} + X_{j} \beta + Z_{j} \gamma } \right) - \Upphi \left( {\alpha_{k - 1} + X_{j} \beta + Z_{j} \gamma } \right) $$
(24)

When Z j is missing, we can replace Z j by U j and adjust for the over dispersion caused by the substitution,

$$ \mu_{jk} = E\left( {Y_{jk} } \right) \approx \Upphi \left[ {{\tfrac{1}{{\sigma_{j} }}}\left( {\alpha_{k} + X_{j} \beta + U_{j} \gamma } \right)} \right] - \Upphi \left[ {{\tfrac{1}{{\sigma_{j} }}}\left( {\alpha_{k - 1} + X_{j} \beta + U_{j} \gamma } \right)} \right] $$
(25)

where

$$ \sigma_{j}^{2} = \gamma^{T} \Upsigma_{j} \gamma + 1 $$
(26)

is the heterogeneous dispersion (see Xu 1998).

We are now in a position to explain the over dispersion. It makes more sense to introduce the liability model before we explain the overdispersion. Let

$$ l_{j} = X_{j} \beta + Z_{j} \gamma + \varepsilon_{j} $$
(27)

be the liability for individual j, where \( \varepsilon_{j} \sim N\left( {0,\sigma^{2} } \right) \) is the residual error of the liability. The liability is a latent variable that controls the observed ordinal phenotype through a series of thresholds \( \alpha = \left[ {\alpha_{1} ,\ldots,\alpha_{p} } \right]^{T} \) with T j  = k if \( a_{k - 1} \le l_{j} < \alpha_{k} . \) The residual error variance is not estimable and thus we set σ 2 = 1. When Z j is replaced by U j , the variance of the liability becomes

$$ \sigma_{j}^{2} = \text{var} \left( {l_{j} } \right) = \gamma^{T} \text{var} \left( {Z_{j} } \right)\gamma + \text{var} \left( {\varepsilon_{j} } \right) = \gamma^{T} \Upsigma_{j} \gamma + 1 $$
(28)

Because σ 2 j  ≥ 1, the model is called overdispersion. Since σ 2 j varies from one individual to another, the model is also called the heterogeneous residual variance model. To adjust for the heterogeneous overdispersion, we replace \( \alpha_{k} + X_{j} \beta + U_{j} \gamma \) by \( {\tfrac{1}{{\sigma_{j} }}}\left( {\alpha_{k} + X_{j} \beta + U_{j} \gamma } \right). \) This adjustment serves as a way to standardize the liability so that the adjusted liability has a mean \( {\tfrac{1}{{\sigma_{j} }}}\left( {\alpha_{k} + X_{j} \beta + U_{j} \gamma } \right) \) and a unity variance.

Unlike the mixture model, the expectation μ j is no longer a function of g. Similarly, the weight matrix is

$$ W_{j} = {\text{diag}}^{ - 1} \left( {\mu_{j} } \right) $$
(29)

This modification leads to a change in matrix D j , the partial derivatives of μ j with respect to the parameters, which is given in Appendix A. The same iteration equation given in Eq. (6) is used here for the heterogeneous residual variance model. The attractive property of this approximation is that as the information matrix is given in Eq. (8) no adjustment is required because EM algorithm is not used here for the approximation.

Extension to other traits

Ordinal traits are the most commonly observed discrete traits in QTL mapping experiments. Other discrete traits also commonly seen in QTL mapping experiments are binary traits, binomial traits and Poisson traits. This section is dedicated to these commonly observed discrete traits. The mixture model algorithm and the heterogeneous variance approximation apply to all traits as long as the traits can be analyzed under the generalized linear model. To apply the algorithms to any specific trait, we only need to find: (1) the distribution of the trait (probability density of the data point), (2) the expectation of the data point, (3) the weight (inverse of the variance) of the data point and (4) the partial derivative of the expectation with respect to the parameters. We now introduce these discrete traits and leave the details of the formulas in Appendix A for interested readers.

Binary traits

Binary traits can be treated as a special case of ordinal traits with p = 1. Without any modification, the method developed for ordinal traits can be applied to binary traits with \( Y_{j} = \left[ {Y_{j1} \,\,Y_{j2} } \right]^{T} \) defined as a 2 × 1 vector. Each of the two components is defined as a binary variable and the two components are perfectly correlated. Here, we simplify the problem by defining Y j as a univariate binary trait. This univariate treatment not only saves computing time but also simplifies the notation. We now use the univariate definition to define the binary phenotype,

$$ Y_{j} = \left\{ {\begin{array}{*{20}l} 1 \\ 0 \\ \end{array} } \right.\begin{array}{*{20}l} {} \\ {} \\ \end{array} \begin{array}{*{20}l} {\text{for presence of the trait}} \\ {\text{for absence of the trait}} \\ \end{array} $$
(30)

The expectation and variance of the phenotype given the parameter value are

$$ E(Y_{j} ) = \mu_{j} = \Upphi \left( {X_{j} \beta + Z_{j} \gamma } \right) $$
(31)

and

$$ \text{var} (Y_{j} ) = V_{j} = \mu_{j} (1 - \mu_{j} ) $$
(32)

respectively. The probability distribution is

$$ p\left( {Y_{j} } \right) = \mu_{j}^{{Y_{j} }} \left( {1 - \mu_{j} } \right)^{{1 - Y_{j} }} $$
(33)

Details for the mixture model and the heterogeneous variance model are given in Appendix A.

Binomial traits

Let n j be the number of trials observed from individual j and m j be the number of events happened to individual j. The binomial phenotype for individual j is defined as \( Y_{j} = m_{j} /n_{j} \) (expressed as a fraction so that 0 ≤ Y j  ≤ 1). Under the probit model, the expectation and the variance of the phenotype are

$$ E(Y_{j} ) = \mu_{j} = \Upphi (X_{j} \beta + Z_{j} \gamma ) $$
(34)

and

$$ \text{var} (Y_{j} ) = V_{j} = {\frac{1}{{n_{j} }}}\mu_{j} (1 - \mu_{j} ) $$
(35)

respectively. The weight is

$$ W_{j} = V_{j}^{ - 1} = {\frac{{n_{j} }}{{\mu_{j} \left( {1 - \mu_{j} } \right)}}} $$
(36)

The probability distribution is

$$ p(Y_{j} ) = {\frac{{(n_{j} )!}}{{(n_{j} )!(n_{j} - n_{j} Y_{j} )!}}}\mu_{j}^{{n_{j} Y_{j} }} (1 - \mu_{j} )^{{n_{j} (1 - Y_{j} )}} $$
(37)

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Poisson traits

Let \( Y_{j} = 0,\; 1,\ldots,\infty \) be the phenotype of a Poisson trait. The expectation and the variance of the phenotype are equivalent, \( E(Y_{j} ) = \text{var} (Y_{j} ) = \mu_{j} , \), where

$$ \mu_{j} = V_{j} = \exp \left( {X_{j} \beta + Z_{j} \gamma } \right) $$
(38)

The weight is

$$ W_{j} = V_{j}^{ - 1} = {\frac{1}{{\exp \left( {X_{j} \beta + Z_{j} \gamma } \right)}}} $$
(39)

The probability density is

$$ p(Y_{j} ) = {\frac{{\mu_{j}^{{Y_{j} }} }}{{(Y_{j} )!}}}\exp ( - \mu_{j} ) $$
(40)

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Hypothesis tests

Two different hypothesis tests are provided, the likelihood ratio test and the Wald test (1949). The likelihood ratio test requires evaluation of different likelihood functions (full model and reduced model). The Wald test is much simpler than the likelihood ratio test statistic because it requires only \( \text{var} (\hat{\theta }) \approx I^{ - 1} (\hat{\theta }) \) and the estimated parameters to form the test statistics. Let \( V_{\gamma } = \text{var} (\hat{\gamma }) \) be the subset of matrix \( \text{var} (\hat{\theta }) \), the Wald test statistics for the null hypothesis, \( H_{0} :\gamma = 0, \) is

$$ {\text{Wald}} = \hat{\gamma }^{T} V_{\gamma }^{ - 1} \hat{\gamma } $$
(41)

Recall that \( \gamma = \left[ {a\,\,\,d} \right]^{T} \) represents both the additive and dominance effects. Under the null hypothesis, the Wald test follows approximately the Chi-square distribution with two degrees of freedom. Therefore, the Wald test is comparable to the likelihood ratio test statistics (McCullagh and Nelder 1989). When testing a single parameter (either a or d), the Wald test statistics is equivalent to the Chi-square test with one degree of freedom. Therefore, some people used the Wald test and the Chi-square test interchangeably (e.g., Han and Xu 2008).

Applications

Binary traits

This example demonstrates the application of the generalized linear model to QTL mapping for binary traits in wheat. The experiment was conducted by Dou et al. (2009) who made the data available to us for this analysis. A female sterile line XND126 and an elite cultivar Gaocheng 8901 with normal fertility were crossed for genetic analysis of female sterility measured as a binary trait. The parents and their F1 and F2 progeny were planted at Huaian experimental station in China for the 2006–2007 growing season under the normal autumn sowing condition. The mapping population was an F2 family consisting of 234 individual plants. The binary trait was the presence of seed setting of the female plants. About five-sixth of the F2 progeny had seeded splikelets (phenotype 1) and the remaining one-sixth plants did not have seeded splikelet (phenotype 0). This is a typical binary trait regarding the presence of seeds. Among the plants that had seeded spikelets, the number of seeded spikelets varied and can be treated as a binomial trait for further QTL analysis (see Sect. “Binomial traits” for binomial trait QTL mapping). A total of 28 SSR markers were used in this experiment. These markers covered 5 chromosomes of the wheat genome with an average genome marker density of 15.5 cM per marker interval. The 5 chromosomes are only part of the wheat genome.

These chromosomes were scanned for QTL of the binary trait using both the mixture model and the heterogeneous variance model. The LOD profiles of the two methods are shown in Fig. 1. Two QTL on chromosome 2 and one QTL on chromosome 5 have been detected with high LOD score. The chromosome-wise empirical threshold values are lower than LOD 3. With the chromosome-wise threshold values, we detected one more QTL on chromosome 1. The estimated QTL parameters are listed in Table 1. The two models (mixture model and heterogeneous variance model) produced very similar results.

Fig. 1
figure 1

The LOD profiles for the female sterility (binary) trait of wheat. a The upper panel shows the LOD profile for the heterogeneous variance model; b the lower panel shows the LOD profile for the mixture model. The chromosomes are separated by the vertical gray reference lines

Table 1 The estimated QTL parameters of the wheat female sterility (binary) trait

Binomial traits

The same experiment conducted by Dou et al. (2009) also recorded the number of seeded spikelets and the total number spikelets for each plant. The ratio of the two records is a binomial trait. The same mapping population and the same linkage map were used also for the binomial trait QTL mapping. Again, both the mixture model and the heterogeneous variance model were used for the binomial trait analysis. Unfortunately, the mixture model failed to generate meaningful result. Therefore, the result of the mixture model analysis was not reported. In chromosome regions, where there were no QTL, the mixture model generated result similar to that of the heterogeneous variance model. For regions with large QTL, the mixture model approach failed to converge. The possible reason for the failure will be presented in Sect. “Discussion”. We now focus on the result of the heterogeneous variance model. The LOD profile is shown in Fig. 2. First, the pattern of the LOD profile is similar to that of the binary trait analysis. There is strong evidence that there is one QTL on chromosome 1 and two QTL on chromosome 2. The LOD score here for the binomial trait is not in the same scale as that of the binary trait. The highest LOD is about 1,000, almost a hundred times more than the binary trait. This inflation of the LOD reflects the increased power of the binomial data analysis than the binary data analysis. Regions of other chromosomes also show LOD score higher than the empirical chromosome-wise threshold values. These include chromosomes four and five. The estimated QTL parameters are listed in Table 2.

Fig. 2
figure 2

The LOD score profile of the female sterility (binomial) trait in wheat under the heterogeneous variance model. The mixture model failed to converge to the correct solution and thus are not presented. The chromosomes are separated by the vertical gray reference lines

Table 2 Estimated QTL parameters for the wheat female sterility (binomial) trait from the heterogeneous variance model analysis

Poisson traits

This example demonstrates the application of the generalized linear model QTL mapping for traits with a Poisson distribution. The data were provided by Dr. Gangqiang Cao at Zengzhou University, China. The result has not been published in any form. The mapping population was a double haploid family of the rice initiated from the cross between IR64 and Azucena. The trait analyzed was the tiller number with an assumed Poisson distribution. The sample size was n = 110 and the number of markers was 175. These markers covered 12 chromosomes (2,031 cM in total length) of the rice genome with an average marker interval of 11.6 cM. This dataset was different from that used by Cui et al. (2006). Both experiments were initialized from the same line cross with the same linkage map, but the experiments were conducted in different times and different locations by different investigators. Interval mappings under both the mixture model and the heterogeneous variance model were applied to the data. The LOD score profiles obtained from the two different methods are depicted in Fig. 3. The LOD score from the heterogeneous variance model is slightly higher than that of the mixture model, but the difference is almost negligible. If LOD = 3 was used as the criterion for significance, two QTL would have been detected on chromosomes 1 and 4. We used the quick method of Piepho (2001) to calculate the empirical threshold for each chromosome and used these thresholds to declare significance of QTL. The LOD thresholds are substantially less than LOD 3. Using the chromosome-specific thresholds, we detected one more QTL at the end of chromosome 12. The estimated QTL parameters are listed in Table 3. The supporting interval for each estimated QTL position was determined by the one-LOD drop approach (Ooijen 1992). The two methods differ slightly in the estimated QTL positions and the supporting intervals. The supporting intervals of QTL positions for the heterogeneous variance model were consistently shorter than those of the mixture model. Overall, the difference between the two methods is very small and can be safely ignored for this data analysis.

Fig. 3
figure 3

LOD profiles for the tiller number of rice (Poisson trait) QTL mapping. a The upper panel shows the LOD profile for the heterogeneous variance model; b the lower panel shows LOD profile for the mixture model. The chromosomes are separated by the vertical gray reference lines

Table 3 Estimated QTL parameters using the mixture model and the heterogeneous variance model for the tiller number (Poisson) trait QTL mapping experiment

Simulation studies

Binomial traits

This simulation experiment was to demonstrate the difference between the mixture model and the heterogeneous variance model for binomial trait QTL mapping. For the female sterility trait of wheat, the mixture model approach failed to converge for the binomial data analysis but succeeded for the binary data. Binomial data were supposed to be more informative than the binary data, but more information turned out to do more harm than good to the mixture model, a typical problem of inconsistency has occurred. The problem happened in the calculation of the posterior probability of QTL genotype. The binomial density is equivalent to the product of multiple independent Bernoulli trials. This density is extremely sensitive to the change of genotype from one form to another, especially when the number of trials is large. The super sensitiveness of binomial density to the QTL genotype change led to degeneracy of the posterior probability of QTL genotype. We conducted a small-scale simulation experiment to demonstrate the problem of the mixture model. We simulated a QTL at position 55 cM of a chromosome with 100 cM in length. Five markers were placed evenly on the chromosome with 20 cM per marker interval. The sample size was 500 for an F2 population. The binomial trait phenotype was simulated for each plant with a constant number of trials for all plants. We simulated the following number of trials in four experiments: 20, 40, 80 and 160. The LOD score profiles obtained from the mixture model and the heterogeneous variance model are shown in Fig. 4 for all the four experiments. We can see that when the number of trials were 20 and 40, the LOD profiles of the two methods are very similar. As the number of trials increased to 80 and 160, the differences between the two models are dramatic, with LOD profile of the mixture model drastically deviated from that of the heterogeneous variance model when the putative QTL position off the marker position. The strange leaf-like pattern of the mixture model LOD profile reflects the instability and failure of the mixture method.

Fig. 4
figure 4

Comparison of the LOD profiles for the mixture model (dotted line) and the heterogeneous variance model (solid line) for a binomial trait of a simulation experiment. The sample size was 500 and the chromosome size was 100 cM long covered by 6 evenly distributed markers (20 cM per marker interval). The binomial trait is measured as the ratio of the number of events to the number of trails. a The number of trials per plant was 20; b the number of trials per plant was 40; c the number of trials per plant was 80; d the number of trials per plant was 160

Poisson traits

This simulation study was to evaluate the differences between the EM implemented mixture model and the heterogeneous residual variance model of interval mapping for Poisson traits. The simulation experiments were much more comprehensive than the previous one. The criteria of evaluation include the statistical powers, the test statistic profiles, the estimation errors of QTL parameters, the biases of the estimation and the computational times. The factors considered are the marker density (A), size of the QTL (B), mean of the trait (C), the sample size (D) and the QTL position (E).

A single chromosome with 100 cM in length was simulated for an F2 population derived from the cross of two inbred lines. The chromosome was covered evenly by the following number of markers: 101, 51, 21, 11 and 6. These correspond to 1, 2, 5, 10 and 20 cM per marker interval. The size of the QTL (additive effect a) was investigated at the following five levels: 0.142, 0.324, 0.471, 0.707 and 0.926. These five levels of the additive effects correspond to the following levels of the heritability (h 2): 0.01, 0.05, 0.10, 0.20 and 0.30. For simplicity, dominance effect was not simulated and also not included in the model for data analysis. The mean of the Poisson trait was determined by the non-QTL effect β (intercept, a scalar because no other non-QTL effects were simulated). The five levels of the intercept (β) were −1.0, −0.5, 0.0, 0.5 and 1.0. The sample size was investigated at the following five levels: 100, 200, 300, 500 and 600. The simulated QTL was located at the following different positions: 10, 25, 50, 75 and 90 cM. The true QTL parameter values and the experimental parameters for this comprehensive simulation experiment are summarized in Table 4.

Table 4 Annotation of the simulation experiments

The total number of combinations for all the five factors is 55 = 3,125. If each combination was simulated 1,000 times, the work load would be very intensive. Therefore, we decided to evaluate a small subset of the treatment combinations to draw conclusions. We chose sample size n = 300, marker interval 5 cM, QTL size \( a = 0.471\,\,\left( {h^{2} = 0.10} \right), \) non-QTL effect β = 0.0 and QTL position at 50 cM as the basic experimental setup (central level for each of the five factors). Under the basic setup of the experiment, we then evaluated one factor at a time by expanding to all the five levels for the factor of interest with the remaining factors held at the basic levels. This reduced the total number of experiments from 3,125 to 5 × 4 + 1 = 21. Each of the 21 treatment combinations was replicated 1,000 times to examine the empirical statistical powers and estimation errors of parameters for comparisons of the two models. The critical value for QTL detection was determined by simulating additional 1,000 samples under the null QTL model (a = 0.0) for each of the treatment combinations. The empirical power for each experiment was the proportion of the samples in which the QTL was detected out of the 1,000 replicates.

Before we discuss the biases and errors of the estimated QTL parameters, let us look at Fig. 5 which shows the average LOD test statistic profiles from the 1,000 replicates for all the experiments. The solid and dotted curves represent the LOD score profiles for the heterogeneous variance model and the mixture model, respectively. The straight horizontal lines are the critical values used to declare QTL significance for power studies. These critical values were drawn from 1,000 additional simulations under the null model (a = 0). Figure 5 provides a rich source of information regarding the behaviors of the two models. We only highlight a few important points here. First, at marker positions, the LOD scores for the heterogeneous variance model and the mixture model are identical, as expected, but off the markers the mixture model consistently produced higher LOD scores than the heterogeneous variance model. The higher LOD scores of the mixture model did not translate into higher powers because the critical values for the mixture model were also higher than the heterogeneous variance model. Second, the leaf-like patterns of the LOD score profiles of the mixture model became more severe when the marker interval was increased. This reflected an intrinsic flaw of the mixture model. In contrast, the heterogeneous variance model produced much more smoothed LOD profiles.

Fig. 5
figure 5

Average LOD score profiles of the simulation experiments of QTL mapping for a Poisson trait. The solid curve represents the average LOD score profile of the heterogeneous variance model while the dotted line represents the average LOD score profile of the mixture model. The solid and dotted horizontal lines denote the empirical critical values under a Type I error of 0.05 used to declare the significance of the QTL for the heterogeneous variance model and the mixture model, respectively. Labels A, B, C, D and E represent five different factors (treatments) that affect results of QTL mapping. These factors are marker density (a), QTL size measured in heritability (b), intercept (c), sample size (d) and QTL position (e). Numbers 1, 2, 3, 4 and 5 stand for five levels within each factor. Detailed information about the factors and levels within each factor can be found in Table 4

We now compare the empirical statistical powers for the two models (see the top panels in Fig. 6). The solid and open symbols represent the powers for the heterogeneous variance model and the mixture model, respectively. In all cases, the heterogeneous variance model had a slightly higher power than the mixture model, although the difference was almost negligible. Some factors had large effects on the powers and others had small effects on the powers. In general, the power declined as the marker interval increased. Sample size and the size of QTL are most influencing factors on the powers. Intercept and position of the true QTL had little influence on the powers.

Fig. 6
figure 6

Estimated QTL parameters and empirical powers of the simulation experiments of QTL mapping for a Poisson trait. Labels A, B, C, D and E represent five treatments (factors). Within each treatment, there are five different levels represented by levels 1 (up triangle), 2 (upside down triangle), 3 (circle), 4 (square) and 5 (diamond). The solid (filled) symbol represents result of the heterogeneous variance model while the open symbol represents the result of the mixture model. The vertical bars represent the standard deviations calculated from the 1,000 replicated experiments. The horizontal lines represent the true values of the QTL parameters. See Table 4 for the detailed information about the factors and levels within factors

Let us turn into the panels in the second row of Fig. 6 to examine the factors that affect the estimated QTL positions. The solid and open symbols indicate the average estimated QTL positions for the two models obtained from the 1,000 replications. The vertical bars represent the standard deviations of the estimated QTL position calculated from the 1,000 replicated experiments. The red horizontal lines indicate the true positions of the simulated QTL. Both models are unbiased and the standard deviations of the estimated positions are very similar for the two models. The size of the QTL and the sample size had the largest influence on the accuracy of the estimated QTL position.

Panels of the third row of Fig. 6 show the average estimated intercepts and the standard deviations obtained from the 1,000 replicated simulations. The two models had very similar estimated intercepts, but both models were biased upwardly.

The panels at the bottom of Fig. 6 show the results of the estimated QTL effect (a) for the two models. The heterogeneous model was unbiased but the mixture model was consistently biased upwardly. The bias for the mixture model was more serious as the marker interval increased.

Overall, the heterogeneous variance model performed consistently better than the mixture model. This observation was not expected. Our original purpose of proposing the heterogeneous variance model was to improve the computational efficiency. We did not anticipate that the mixture model was not consistent and performed poorly. The simulation experiments did show that the heterogeneous variance model took, on average, about one-third of the computing time of the mixture model. To our surprise, the heterogeneous variance model outperformed the mixture model in all cases considered in the experiments.

Discussion

The most commonly used algorithm for missing values in generalized linear model is the EM algorithm (Horton and Laird 1998). We extended the EM algorithm to interval mapping, where the independent variable (genotype indicator variable) is missing for all individuals. We also developed a heterogeneous variance model to approximate the mixture model. Both the binary data and the Poisson data analyses showed that the two methods generated similar results, but the heterogeneous variance model is computationally faster than the mixture model-based EM algorithm. To our surprise, the mixture model approach is not always better than the heterogeneous variance model in terms of better estimation of QTL parameters. The binomial data analysis showed that the mixture model approach failed to converge to the correct values for some marker intervals while the heterogeneous variance model worked very well. The failure is due to several possible reasons: (1) low information content for large marker intervals, (2) large QTL effects and (3) inconsistency of the EM algorithm. For the first reason, a large marker interval means that QTL position in the middle of the interval has little information regarding the genotype of the putative QTL. The posterior probability of QTL genotype is largely determined by the data and the current parameter values. In the end, the posterior probabilities become degenerated (probability equals unity for one genotype and zero for all other genotypes) for all individuals. This phenomenon was observed only for large intervals. The second reason of the failure is due to large QTL effects. We noticed that the failure of the mixture model approach did not happen in intervals that do not have QTL, even though some of those intervals are very large. It is the combination of large interval with large QTL effects that caused the failure. The third reason of the failure is due the inconsistency (intrinsic flaw) of the mixture model. More extensive simulation experiments using the Poisson data further justified the heterogeneous variance model for interval mapping.

Most QTL mapping experiments in the future will be done with a high marker density. In that case, the mixture model and heterogeneous variance model will have negligible difference in efficiency, with the latter slightly more preferable than the former due to its light computational load. Theory and methods of interval mapping under the mixture model are well developed both for normally distributed traits and for discrete traits. Variance–covariance matrix for the estimated parameters is also available, but only for normally distributed trait interval mapping (Kao and Zeng 1997). This research is the first attempt to develop the covariance matrix of estimated QTL parameters under the generalized linear mixture model. With the availability of the variance–covariance matrix of estimated QTL effects, we have a choice to perform either the Wald test or the likelihood ratio test. The two test statistics are asymptotically the same, with the latter slightly more preferable when the sample size is small (Frank 2001).

The approximate heterogeneous variance model was originally developed by one of us for normal trait QTL mapping (Xu 1998). We demonstrated here that it worked equally well for the generalized linear model. Simulation experiments and real data analysis all demonstrated that the approximation is very close to, and sometimes better than, the mixture model. The advantage of the approximate method is the avoidance of mixture model and thus avoidance of the EM algorithm. As a result, computing the variance–covariance matrix of the estimated parameters becomes straightforward, a by-product of the iteration process. In addition, the heterogeneous variance model appears to be more robust and stable compared with the mixture model-based EM algorithm according to our binomial data analysis and the simulation study.

Both the mixture model maximum likelihood and the heterogeneous variance approximation have been coded in an existing QTL mapping program called PROC QTL (Hu and Xu 2009). This program is a user-defined SAS procedure. Users need to specify the distribution of the data. The default distribution is normal. The current version of PROC QTL can handle binary, binomial, ordinal and Poisson distributions. More distributions will be added in the future, depending on the availability of the data. The program is downloadable from our Web site (http://www.statgen.ucr.edu).