# Mixed model approaches for the identification of QTLs within a maize hybrid breeding program

## Abstract

Two outlines for mixed model based approaches to quantitative trait locus (QTL) mapping in existing maize hybrid selection programs are presented: a restricted maximum likelihood (REML) and a Bayesian Markov Chain Monte Carlo (MCMC) approach. The methods use the in-silico-mapping procedure developed by Parisseaux and Bernardo (2004) as a starting point. The original single-point approach is extended to a multi-point approach that facilitates interval mapping procedures. For computational and conceptual reasons, we partition the full set of relationships from founders to parents of hybrids into two types of relations by defining so-called intermediate founders. QTL effects are defined in terms of those intermediate founders. Marker based identity by descent relationships between intermediate founders define structuring matrices for the QTL effects that change along the genome. The dimension of the vector of QTL effects is reduced by the fact that there are fewer intermediate founders than parents. Furthermore, additional reduction in the number of QTL effects follows from the identification of founder groups by various algorithms. As a result, we obtain a powerful mixed model based statistical framework to identify QTLs in genetic backgrounds relevant to the elite germplasm of a commercial breeding program. The identification of such QTLs will provide the foundation for effective marker assisted and genome wide selection strategies. Analyses of an example data set show that QTLs are primarily identified in different heterotic groups and point to complementation of additive QTL effects as an important factor in hybrid performance.

### Abbreviations

- BLUE
Best linear unbiased estimator

- GCA
General combining ability

- GS
Genome wide selection

- MAS
Marker assisted selection

- MCMC
Markov chain Monte Carlo

- QTL
Quantitative trait locus

- REML
Restricted maximum likelihood

- SCA
Specific combining ability

- HP
Hybrid performance

## Introduction

The transition from open-pollinated populations to double-cross and then single-cross hybrids in maize breeding was a major component of the long-term genetic gain for yield of maize in the US Corn Belt. It is tempting to think that increasing heterosis, i.e., an increasing difference between hybrid performance and average parental performance, was an important driving force behind this gain. In a study of a time series of maize hybrids released between the early 1930s till 2001, Duvick et al. (2004) concluded that heterosis played a minor role in the improved performance of present day hybrids, the role of additive gene action was found far more important. Nevertheless, the phenomenon of heterosis remains interesting and papers keep appearing about the subject, where it is remarkable that hardly any consensus exists about the genetic mechanism(s) that may underlie the phenomenon: dominance, overdominance, pseudo-overdominance, and epistasis, or combinations of these components. For example, Frascaroli et al. (2007) emphasized the role of dominance and overdominance, whereas they found little evidence for epistasis. In contrast, Bernardo (1996a, b), in line with Duvick et al. (2004), proposes to omit specific combining ability from the analysis of hybrid performance, implying a negligible role for dominance and overdominance. Similar conclusions on the low impact of specific combining ability (SCA) were recently reached by Fischer et al. (2008) and Schrag et al. (2009). Deviating from this trend, Melchinger et al. (2007) present an argument for the importance of epistasis in heterosis.

Although the importance and underlying mechanism for heterosis still elicit discussion, the heterosis question constitutes only a part of the more relevant and encompassing question concerning the prediction of hybrid performance (HP). Nowadays, the key question for HP prediction is whether marker information on the parents, or on related inbred lines suffices for HP prediction, thereby obviating field evaluations of the particular hybrids themselves. To enable HP prediction for a range of quantitative traits, we are interested in making use of pedigree relations (coancestry) between the lines in the parental generations, of phenotypic records on hybrid performance for hybrids other than the ones to be predicted, and of marker information from the parent generation.

Two types of approaches can be distinguished with respect to HP prediction. First, there is a class of approaches that can be characterized as distance methods: either HP or SCA, as part of HP, is regressed on marker information, whether in the form of a single predictor or a small set of predictors derived from operations on a matrix of similarity coefficients or coancestry coefficients (Charcosset et al. 1998) or in the form of a predictor set that consists of a subset of coded markers (Vuylsteke et al. 2000; Schrag et al. 2006, 2009). A second approach consists in a more classic elaboration of the quantitative genetic theory within the mixed model framework. This approach is advocated by Bernardo (1994, 1996a, b, 1999) and finds its culmination in Parisseaux and Bernardo (2004) and Yu et al. (2005). In the last two papers, the so-called method of in-silico-mapping is presented: the use of accumulated phenotypic data in public and private plant breeding programs for quantitative trait locus (QTL) mapping. Four advantages are mentioned for in-silico-mapping over classical QTL mapping using designed crosses: (1) larger mapping populations are available; (2) evaluation takes place in multiple environments, so that results will be applicable across a wide range of future growing environments; (3) a wide sample of germplasm and genetic backgrounds is tested, so fewer problems will occur with respect to the validity of predictions for other genetic backgrounds; (4) field data are already available, so no extra costs need to be made to obtain the phenotypic data. In addition, new inbred lines are routinely genotyped for multiple purposes within commercial breeding programs.

The two classical stages of hybrid breeding programs are the development of promising inbred lines followed by the identification and selection of superior hybrids created from crosses between the inbred lines. Reliable prediction of HP on the basis of information produced in the second stage (hybrid selection) of on-going breeding programs would be extremely useful. A prerequisite for such a HP prediction strategy is the availability of advanced QTL mapping methodology, i.e., methodology that is able to accommodate the specifics of phenotypic, genotypic, and pedigree data represented in hybrid selection programs. In this paper, we propose a mixed model based statistical framework to map QTLs in hybrid selection programs. We use and extend the in-silico-mapping QTL model first described in Parisseaux and Bernardo (2004) in our development. The latter approach was a single point analysis restricted to evaluations at marker positions. Interval mapping, a multi-point analysis, which was indicated as computationally prohibitive by Parisseaux and Bernardo in their 2004 paper, is possible using our approach. Further differences are that we model QTL effects as random, whereas these effects were modeled fixed in Parisseaux and Bernardo (2004). The assumption of QTL effects being random allows us to impose structure on their variances and covariances. As a consequence, we are able to arrive at a QTL analysis that combines elements of linkage and linkage disequilibrium mapping, closely akin to the way in which Meuwissen et al. (2001) defined such a combined mapping strategy. This latter element is essential for any QTL mapping methodology adapted to the details of data generated from hybrid selection programs. The approach by Parisseaux and Bernardo (2004) focused on additive QTL effects. Our modeling framework can deal with both additive and dominant effects, although, in accordance with the remark by Bernardo (1999) about the negligibility of dominance effects in QTL mapping, we found a minor role for dominance in our illustration data and thus we will not further report on dominance aspects in this paper. Extensions to include epistasis in the models are currently under study.

Below we will first describe a restricted maximum likelihood (REML) based mixed model approach to QTL mapping in hybrid selection data, followed by a Bayesian Markov Chain Monte Carlo (MCMC) mixed model approach. The REML and Bayesian MCMC approaches share the same linear model structure and model terms, and they use equivalent definitions for the genetic relationships between the individuals in the pedigree. In the REML approach, the number of alleles per QTL locus is typically larger than two, whereas in the Bayesian approach this number is fixed at two. In the REML approach, a single QTL model is fitted at a grid of evaluation points across the genome. In the Bayesian approach, multi-QTL models with random numbers of causative loci located at random positions on the genome are fitted as competing models during the MCMC process. Following a description of the major theoretical aspects of the two mixed model approaches introduced above, we briefly illustrate their performance using data on ear height, a trait known to show intermediate heritability. The data stem from a maize hybrid selection program at Pioneer Hi-Bred International. The intention of this paper is to investigate the suitability of current mixed model methodology as run on standard PCs for mapping QTLs in hybrid selection programs; a feasibility study and not an exhaustive comparison of REML and Bayesian mixed models in a QTL mapping context (like, for example, Bauer et al. 2009).

## Data

The maize hybrid selection data used for illustration of our models can be considered to represent a generic example data set. Our calculations are based on proprietary data of Pioneer Hi-Bred International. The phenotypic trait analyzed was ear height, originally measured in inches from the ground to the node from which the ear was attached, but below presented in centimeters. The estimated heritability of ear height was 0.36. Ear height was available for 1,700 hybrids produced from crosses between inbred parents that belong to two heterotic groups: 1 and 2. Hybrids were evaluated on average at 15 locations during two growing seasons, 2004 and 2005, in the US Corn Belt.

At the hybrid level, the phenotypic data points used in the QTL analyses came from a two-stage analysis (Smith et al. 2001). In the first stage, trials were analyzed by location to compute hybrid means (Best Linear Unbiased Estimates, or BLUEs), which were stored with their relative weights. In the second stage, these hybrid-by-location BLUEs were analyzed in a multi-location analysis using an additive mixed model with locations and hybrids as fixed effects. The resulting across-location hybrid BLUEs obtained in the second stage were used as phenotypic data in the QTL analyses. We were interested in the average hybrid performance across locations and for that reason did not pursue analyses of genotype by environment interaction or QTL by environment interaction.

Our stage wise approach for calculating the hybrid means for use in the QTL analysis was dictated by the large computer requirements in our genetic (QTL) analysis; a one hit approach that would fit a genetic model starting from plot data was infeasible.

Heterotic group 1 consisted of 222 inbred parents, versus 213 inbred parents in group 2, making 435 parents in total. Pedigree data were available for the parents, where the pedigree was complete for three ancestral generations. Going back, three generations in the pedigree, 62 inbred founder lines were defined for group 1, versus 55 for group 2, making 117 founders in total. For all 435 parents and most ancestors, 768 SNP markers were scored. Furthermore, using a proprietary estimation method, pedigree relationships, and denser marker coverage than the 768 SNP markers mentioned above, genetic relationship matrices between the 117 founders were calculated at 1 centi-Morgan intervals along the full length of the genome.

## REML approach

### Structure of pedigree and nomenclature

Following Fig. 1, we now define the different types of individuals that will appear in our analyses. At the highest level in the (considered) known pedigree we find the (intermediate) founders (F), at the lowest levels the hybrids (H) with field trial data. As remarked above, additional quantitative relationship information was available at a grid along the genome beyond the level of the (intermediate) founders. Just above the hybrids, we see their parents (P), while inbetween parents and founders we have what we will call intermediate inbred lines (I). The numbers of founders, hybrids, parental lines and intermediate lines are given by, respectively, *n*_{F}, *n*_{H}, *n*_{P}, and *n*_{I}. To indicate the numbers of parents in heterotic group 1 and 2, we write *n*_{P1} and *n*_{P2,} and similarly *n*_{F1} and *n*_{F2} for the founders.

### A reference mixed model for QTL mapping

**y**the vector of phenotypic observations,

**X**the design matrix for adjustment of non-genetic effects [in the Parisseaux and Bernardo (2004) paper this included corrections for multi-location trials], and

**β**the corresponding vector of fixed effects. The vectors

**α**

_{1}and

**α**

_{2}represented general combining abilities (GCA) associated with markers in group 1 and 2, and

**g**

_{1}and

**g**

_{2}represented GCA effects not related to markers in group 1 and 2. The dimensions of these vectors corresponded to the numbers of marker alleles in group 1 and 2 for one particular QTL, and the numbers of parent lines in groups 1 and 2 (

*n*

_{P1}and

*n*

_{P2}). Model (1) thus describes a model for a single QTL. The incidence matrices

**M**

_{1},

**M**

_{2},

**Z**

_{1}, and

**Z**

_{2}allocated the corresponding effects to the hybrids. The marker effects

**α**

_{1}and

**α**

_{2}were treated as fixed, while

**g**

_{1}and

**g**

_{2}were treated as random with a normal distribution and variance–covariance matrices

**G**

_{1}

*V*

_{G1}and

**G**

_{2}

*V*

_{G2}, where

**G**

_{1}and

**G**

_{2}were the

*n*

_{P1}× n

_{P1}and

*n*

_{P2}×

*n*

_{P2}coancestry matrices within groups 1 and 2, and

*V*

_{G1}and

*V*

_{G2}group specific variance components. The error term

**e**had a diagonal variance–covariance matrix with the entries on the diagonal depending on the reciprocal of the number of replicates.

### A mixed model for hybrid performance without marker information

**β**was the general intercept, μ, while

**X**consisted of the unity vector of length

*n*

_{H}. For each heterotic group, we calculated the coefficients of coancestry among inbred parents of hybrids, in terms of IBD probabilities, and formed the GCA variance–covariance structuring matrices

**G**

_{1}and

**G**

_{2}mentioned above using a tabular method (Bernardo 2002). The matrices

**G**

_{1}and

**G**

_{2}were used to structure GCA effects in both the REML and Bayesian approach. For the variance of

**e**, we simply took

**I**

*V*

_{e}, i.e., independent errors with constant variance,

*V*

_{e}.

### Extending the reference mixed model for QTL mapping

We diverge from Parisseaux and Bernardo (2004) with respect to the exact form for introduction of marker and QTL information into the model. As alluded to above, instead of defining the QTLs at the level of the parents (P), we chose to define QTLs at the level of the (intermediate) founders (F). For this to be possible, the transition (descent) probabilities for alleles from founders to parents are needed. These probabilities were calculated at a 1 centi-Morgan grid along the genome using pedigree and marker data by a recursive or tabular method (see e.g., George et al. 2000; Bernardo 2002). The method is a top–down approach, starting with the (intermediate) founders and incorporates a Hidden Markov Model (HMM) approach (Lander and Green 1987).

*n*

_{P1}×

*n*

_{F1}matrix of transition probabilities is called

**T**

_{1}. When we want to test for a QTL being present in the first heterotic group at a particular position coinciding with the particular transition matrix, we can compare the following model with the GCA model (2) by a deviance test for a single variance component (Verbeke and Molenberghs 2000):

**a**

_{1}is a vector of random QTL allele effects corresponding to a QTL at the position of the

**T**

_{1}matrix of founder–parent transition probabilities in group 1. As mentioned, we can structure the variance–covariance matrix (VCOV) of the random QTL allele effects in

**a**

_{1}using pedigree and dense marker information on inbred lines preceding the (intermediate) founders. The structuring matrix for the VCOV of the intermediate founder alleles can be calculated at the same genomic grid as the matrix of transition probabilities,

**T**

_{1}. For an overview of genetic relationships and their matrix representations in the mixed model, see Fig. 1.

### The deviance test for variance components

Critical values for the deviance in the test for single variance components can be found from a two component, 0.5–0.5, mixture distribution consisting of a Chi-square distribution on zero degrees of freedom combined with a Chi-square distribution on one degree of freedom. However, as pointed out by a reviewer, this approximation is only valid under the assumption of independence between the hybrids. A more general test for single variance components, including the dependence configuration pertinent to our hybrid data, is presented by Greven et al. (2008), who show that the commonly used mixture of Chi-squares provides a conservative test. Because alternatives to standard Chi-square mixture approximations for deviance differences are still cumbersome, we will stick to these standard approximations, acknowledging that they produce conservative tests.

### Linkage and linkage disequilibrium

We can interpret the structure of the VCOV of the founder marker allele effects as the addition of linkage disequilibrium information to our linkage analysis, in the spirit of Meuwissen et al. (2001). Effectively, the total of the pedigree and marker information on the inbred lines preceding the parents is split into two parts, where the split is determined by the choice of the (intermediate) founder individuals, in this case three generations before the parent lines. In the founder–parent part of the relationships, classical genetic relationships are assessed from pedigree and marker information using Hidden Markov Models. This information enters the model at the linkage level. For the relationships between inbred lines above the founder level, often characterized at high marker density, a summary is constructed in the form of a symmetric matrix of IBD probabilities. The latter information accounts for the covariance between the ‘intermediate’ founder level random terms and incorporates ancestral linkage disequilibrium information.

### Decomposing founder effect structuring matrices

Define the structuring matrix for the variance of **a**_{1} in model (3) by **Q**_{1}, then VCOV(**a**_{1}) = **Q**_{1}*V*_{Q1}, with *V*_{Q1} the corresponding variance component. The matrix **Q**_{1} may turn out to be non-positive definite, which will cause problems when fitting model (3). A solution is to use a spectral decomposition of **Q**_{1} = **U**_{1}**U**_{1}^{T} and approximate **Q**_{1} by discarding the eigenvectors with negative eigenvalues (as long as these are small in absolute value): **Q**_{1} = **U**_{1}*** U**_{1}*^{T}, where **U**_{1}***** represents the part of **U** with positive eigenvalues (Calinski et al. 2005; Piepho et al. 2008).

**Q**

_{1}by a low rank approximation that centers on founder groups instead of founders (both preceding the currently used ‘intermediate’ founders). The matrix

**Q**

_{1}is again decomposed as before:

**Q**

_{1}=

**U**

_{1}

*** U**

_{1}*

^{T}, but the number of columns in

**U**

_{1}

*****is based on a small set of eigenvectors, in the order of 2–6, with the largest eigenvalues included. When we post-multiply

**Z**

_{1}

**T**

_{1}by

**U**

_{1}

*****and call this matrix

**M**

_{1};

**M**

_{1}

**=**

**Z**

_{1}

**T**

_{1}

**U**

_{1}

*****, we can write (3) as

**U**

_{1}

*****consists of the first few eigenvectors, model (4) estimates QTL allele effects that are linear combinations of the (intermediate) founder QTL allele effects. The transformation of the initial founder effects

**a**

_{1}to the reduced set of effects

**b**

_{1}can be interpreted as the creation of approximating founder groups, going back to key individuals somewhere in the pedigree. Because of the linear transformation defined by

**U**

_{1}

*****, the effects in

**b**

_{1}are independent. The founder effects

**a**

_{1}can be found from those for

**b**

_{1}by

**a**

_{1}=

**U**

_{1}

***b**

_{1}.

An alternative to the spectral decomposition of **Q**_{1} is the factorization based on a latent class model as proposed by ter Braak et al. (2009): **Q**_{1} = **P**_{1}**P**_{1}^{T}, with the elements in **P**_{1} representing probabilities for individual founders to belong to a particular founder class. This latter factorization was especially useful in the Bayesian implementation of our approach.

### Modeling two heterotic groups

As discussed above, the dimension of the QTL effect vectors in model (5) is relatively small. Therefore, it is not difficult to add dominance QTL effects, or even various forms of epistatic interactions, to the model. A dominance design matrix can be created from a multiplication of the columns of **M**_{1} by those of **M**_{2}.

### Single and multiple QTL models

The models described in this REML section are all single QTL models. To arrive at multi-QTL models, one can follow the standard practice of first performing a genome wide simple interval mapping scan using, for example, a comparison of model (5) with model (2) along the genome, and then retain a set of significant/interesting QTL. The set of QTLs identified in the simple interval mapping scan can be used to perform one or more rounds of composite interval mapping, or can immediately be used to carry out a backward selection procedure. Similar backward selection procedures can be started from the results of composite interval mapping scans. Forward selection procedures, give, of course, under certain conditions, another possible strategy for arriving at multi-QTL models (Bauer et al. 2009; Broman and Speed 2002). Alternatively, a one step whole genome selection approach of the type advocated by Meuwissen et al. (2001) can be considered.

### Practicalities

We performed all REML computations using Genstat 12 (www.genstat.co.uk) and an Intel Core 2 Quad Q9550 processor with a clock rate of 2.83 GHz. The time required for a genome wide scan testing for additive QTLs in one or both heterotic groups was 4 h.

## Bayesian approach

### General description

The general mixed model formulation for the Bayesian MCMC approach is similar to the REML approach. Our Bayesian approach differs in two main respects: we use (1) a multi-QTL model in which the number of QTLs is a random variable, and (2) the assumption that the QTL effects are bi-allelic. One minor difference is that the GCA effects in both heterotic groups are assumed to have a common variance, implying a common GCA design matrix **Z** of dimensions *n*_{H} × *n*_{P}.

**Xβ**+

**Zg**and variance

*V*

_{e}. The fixed effects have an uniform prior, the GCA effects have a normal prior,

*P*(

**g**) =

*N*(0,

**G**

*V*

_{G}), with the coancestry matrix

**G**, as in the REML approach, being block diagonal as a consequence of the disconnectedness of the two heterotic groups.

**v**

_{k}(

*k*= 1,…,

*n*

_{QTL}), are defined at the founder level. Unlike the REML mixed model (5), however, in the Bayesian model (7) the number of QTLs,

*n*

_{QTL}, is itself a random variable, following ideas in Heath (1997), Sillanpää and Arjas (1998), and Bink et al. (2002). The QTL effects can include both additive and dominance effects. With respect to the QTL effects, the assumption is made that these effects will be bi-allelic. When the QTL effects are chosen to be exclusively additive, the design matrices

**W**

_{k}will have dimension

*n*

_{H}× 1, whereas for inclusion of dominance effects they will have two columns, one for additive QTL effects and a second one for the dominance effects. The prior for the additive QTL effects is normal and similar to that in Bink et al. (2008). To decide upon the number of QTLs governing a particular trait, model selection procedures implementing the standard theory of Bayes factors (Kass and Raftery 1995) can be used. For illustrations of the use of Bayes factors in QTL mapping, see Bink et al. (2002, 2008).

### Structuring VCOV of founder effects

The VCOV of the additive founder QTL effects is structured in the same way as in the REML mixed model by a combination of pedigree and marker information, where this structuring changes for each centi-Morgan. In the Bayesian approach, the VCOV of the additive effects was approximated by various methods of clustering the founders. For example, ancestral classes were constructed allocating individuals beyond a certain threshold value for coancestry, 0.8 in our case, to the same founder class. The ancestral classes form an addition to the Bayesian framework described originally in Bink et al. (2002). An attractive alternative to the threshold algorithm is the use of latent class models as proposed by ter Braak et al. (2009). We are presently incorporating this latent class algorithm in our Bayesian QTL models.

### Practicalities

Bayesian models (6) and (7) are analytically intractable and MCMC simulation is used to sample from the joint distribution of data and model parameters. For the Bayesian analyses, we used the FlexQTL™ software (http://www.flexqtl.nl), on a 64-bit Dual-core Opteron system with a clock rate of 2.2 GHz. The simulations were performed with chains of 500,000 iterations, while storing samples every 200th iteration to reduce auto-correlation among samples and to save disk memory. Visual inspection of trace plots of important parameters indicated an absence of burn-in periods and that these numbers of Markov iterations were sufficient for reliable inference. The required computation time for an analysis was up to 5 days, while the analysis of the single chromosome 10 required 14 h of CPU time. These numbers may be reduced by shorter Markov chains and optimization of computer coding or running parallel simulation chains. The stored samples were used for posterior inference on the parameters of interest, i.e., the posterior mean estimates were taken to summarize the accumulated knowledge.

## Illustration of modeling approaches on maize ear height data

The QTL on chromosome 10 had most posterior evidence and explained 37 percent of the genetic variation by itself. At the place of maximum intensity on chromosome 10, we calculated the products of posterior founder QTL allele probabilities and QTL allele effects to arrive at quantities that closely resemble the predicted REML founder allele effects. The middle and right panels of Fig. 5 show that the Bayesian and REML models produce comparable estimated QTL allele effects.

For comparison, the bottom part of Fig. 6 repeats the REML profiles for a test in either or both of the heterotic groups. The REML and Bayesian analysis coincide with respect to the major QTLs, to be found on chromosomes 6 and 10. Further coincidences can be observed between the Bayesian intensity profile and the REML deviance profile, where it should be taken into account that the REML analysis is based on a series of single QTL models, whereas the Bayesian analysis is a multi-QTL analysis.

## Discussion

### Comparison of the REML and Bayesian MCMC mixed model based approaches

In this paper, we described a mixed model framework for the detection of QTLs in elite hybrid backgrounds. We used both REML and Bayesian implementations of this framework that are similar in the linear model structure for the response and the VCOV matrices for QTL and GCA effects. One difference was that the Bayesian implementation required a bi-allelic QTL representation. Another difference consisted in the fact that the REML implementation was based on the fitting of single QTL models, whereas the Bayesian implementation contained the number of QTLs as a parameter inside the model.

Both approaches are computer-intensive, but do allow multi-point/interval mapping strategies to QTL mapping in a multi-population setting while accounting for pedigree relations between the parents of the offspring populations, something that was not possible so far for realistic hybrid selection data. The REML approach presented in this paper is an elaboration of mixed model QTL work in a REML context as presented earlier in Malosetti et al. (2004, 2007, 2008), Boer et al. (2007), and Paulo et al. (2008). An attractive point of this REML approach is that it allows QTL modeling that combines features of the phenotypic modeling of individual field trials (Boer et al. 2007), modeling of genotype by environment interaction and QTL by environment interactions across trials (Malosetti et al. 2004; Boer et al. 2007), and modeling of multiple traits (Malosetti et al. 2008). These features can be integrated with the approach to QTL mapping in a commercial maize breeding program that was developed here. For the current paper, we did use features on the modeling of multiple populations as hinted at by Paulo et al. (2008), as well as elaborations of the modeling of genetic relations via structuring of the VCOV matrix for the genetic effects in an association mapping context, as was described by Malosetti et al. (2007). Our REML mixed model approach to QTL mapping is thus flexible with respect to the spectrum of genetic phenomena it can handle.

A few choices keep the computing time and requirements acceptable. First, the splitting up of the pedigree information, i.e., the definition of a layer of intermediate founders in combination with a low rank approximation to the VCOV matrix of the random (intermediate) founder effects provides a powerful tool to efficiently manage computer requirements for large and complex data sets. Second, the essentially single QTL models within the REML approach reduce calculation requirements considerably. To arrive at multi-QTL models in the REML setting, one can use, for example, composite interval mapping or backward selection strategies in combination with the simple interval mapping strategy that was described.

The Bayesian mixed model approach has a more elegant solution to the questions on the number of QTLs and their location: it has the number of QTLs as a parameter with a prior to be specified and fits sequences of multi-QTL models with varying numbers of QTLs across the whole of the genome while running through its MCMC chains. The posterior distributions for the number of QTLs and their respective locations are an integration of the realizations of the QTL models as occurring during the MCMC chain. The price for this elegance is that the computing requirements and times strongly increase, from 4 h for a REML analysis to 5 days for a Bayesian analysis, possibly up to a level that hinders extensions of the present Bayesian mixed models to multi-environment and multi-trait models. In contrast, when compared to the REML mixed model, the simplifying assumption of bi-allelic QTLs in the Bayesian implementation can provide greater versatility for defining epistatic interactions.

The Bayesian method can be extended to include the number of alleles at a QTL as a parameter in the model (Jannink and Wu 2003). However, Jannink and Wu (2003) concluded that for interconnected families there is insufficient information in DNA-marker and phenotypic data to determine with high probability the QTL allelic number unless each family contains many individuals (more than 100). Our own experience (unpublished results) is that simulations of tri-allelic QTLs with substantial allelic differences, the bi-allelic QTL approach fits two closely linked QTLs. In cases where the allelic effects did not differ substantially, the biallelic Bayesian method clustered the alleles into two allelic groups.

The REML and Bayesian approach could be used in a complementary way to combine the advantages of both approaches, i.e., the handling of a wide spectrum of genetic phenomena with the possibility of an elegant approach to the number and the location of QTLs. One could start with a genome wide scan using a REML implementation of our mixed model framework to get a first rough estimate of the number of QTLs, their locations and effects. Next, one could focus on specific genomic regions to run a more refined Bayesian analysis to obtain estimates for especially QTL location and to a lesser extent QTL number.

### Approaches to hybrid prediction

When looking at the literature, various classes of approaches to predicting HP can be distinguished. On one extreme, we find the mixed model based approaches of which Parisseaux and Bernardo (2004) is the classical example. The modeling approaches developed in this paper are part of that tradition. The approach can be characterized by the fact that GCA and SCA effects are seen as random parameters whose VCOV matrix should be structured on the basis of pedigree and marker related information. QTLs then absorb part of the original GCA and SCA signal. On the other extreme, we find regression based approaches where HP and parts thereof, like GCA, SCA, mid parent performance, and mid parent heterosis are regressed on marker information or distance related information derived from markers (Vuylsteke et al. 2000; Schrag et al. 2006, 2009). Although QTLs are identified, their relationship with GCA and SCA is not direct, as in the above-mentioned mixed models. A risky property of the latter approaches is that they typically estimate constitutive effects of HP (GCA, SCA, etc.) in a first linear model and then select markers for association with those effects in one or more further regression procedures, to finally put the pieces together again and to arrive at HP predictions that are composed of the estimates for the corresponding parameters (GCA, SCA, etc.), where these estimates come from different model fits and estimation procedures. For unbalanced data, as hybrid prediction data invariably are, interaction parameters as SCA (and GCA to a lesser extent) are not estimable within fixed linear models, which means that these parameters cannot be calculated as linear combinations of observations without imposing rather artificial constraints (Nelder 1994). When GCA and SCA are taken random, estimability of the parameters is less of a problem, but interpretation remains a problem. Parameters like SCA mainly indicate differences between model fits of increasing complexity, so these parameters can better be understood as residuals indicating lack of fit (from a main effects only model) than as genetic entities contributing to HP. Modeling of HP within one and the same modeling framework avoids all these interpretational problems and has our preference.

The problem of difficult to interpret genetic interaction parameters, like SCA and epistasis, in unbalanced data also affects a recent proposal of Melchinger et al. (2007) to construct NCIII designs to estimate dominance effects that are not contaminated by additive × additive epistatic interactions. Melchinger et al. (2007) see the unbiased estimation of dominance and additive × additive epistatic interactions as a condition for reliable HP prediction. QTL effect estimates would otherwise not carry over to other genetic backgrounds. Still, we feel that a mixed model QTL approach for use in an existing breeding population, as outlined in this paper, should always be a strong competitor to an approach requiring specifically designed breeding populations producing balanced data sets.

### Hybrid performance and prediction

The identification of QTLs in biparental mating designs does not necessarily lead to a successful marker assisted selection strategy (MAS), because the sampled allelic diversity is (too) small and QTL × genetic background interactions are not taken into account. Furthermore, the statistical methods for QTL detection typically inadequately model polygenic traits with many small effects (Heffner et al. 2009). Recently, genome wide selection (GS) has been presented as an alternative to MAS in plant breeding (Bernardo and Yu 2007; Nordborg and Weigel 2008; Heffner et al. 2009; Piepho 2009). In MAS, a predictive model for genotypic performance is created on the basis of earlier identified QTLs and estimates for their allelic effects. In MAS, only the markers that are found to be significantly associated with traits are used. In GS, no prior testing of markers on significant association takes place, but all markers enter the predictive model. However, for GS, estimation methods are required that appropriately shrink QTL effects at markers close to minor QTLs, while leaving QTL effects at stronger QTLs relatively untouched. The idea is that in GS both oligogenic QTLs with strong effects and polygenic QTLs with small effects contribute to a genome wide breeding value estimate that can serve for genotypic prediction and selection.

Bernardo and Yu (2007) speculate that breeding value approaches as underlying GS will be useful for evaluating the combining ability of maize inbreds, but will be less useful for finding pairs of inbreds that perform well as single-cross hybrids. With regards to the latter aspect, we are less pessimistic. Our mixed model QTL approach applied within a breeding program should identify relevant QTLs and realistic estimates for their effects. We expect to have included the full allelic diversity. As mapping and prediction are performed with respect to the same breeding population, no QTL × genetic background interactions should show up. We expect to be able to benefit from the complimentary QTLs that we identified in our two heterotic groups by bringing them together in new hybrids using optimal combinations of inbred lines. Furthermore, using the Bayesian implementation of our mixed model framework, we can naturally combine the posterior intensities for QTLs with the QTL effects to arrive at a powerful index for genome wide selection. As a preliminary test of the mixed model methodology in this respect, we produced genome wide predictions from a REML and Bayesian analysis, using the 2004/2005 data. The validation data consisted of a set of 288 new hybrid combinations that were evaluated in the field in 2007/2008. Correlations between observations and predictions were 0.69 for the Bayesian analysis and 0.68 for the REML analysis. The correlation between the REML and Bayesian predictions was 0.89.

We have described two mixed model approaches to QTL mapping in existing hybrid selection programs. Both approaches produced encouraging results. We see this as a first step in the construction of a HP prediction protocol.

## Notes

### Acknowledgments

We thank the organizers of the Heterosis conference for the invitation to participate in their outstanding meeting and two reviewers for constructive comments that have improved the manuscript.

### Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

### References

- Bauer AM, Hoti F, von Korff M, Pillen K, Léon J, Sillanpää MJ (2009) Advanced backcross-QTL analysis in spring barley (
*H. vulgare*ssp. spontaneum) comparing a REML versus a Bayesian model in multi-environmental field trials. Theor Appl Genet 119:105–123CrossRefPubMedGoogle Scholar - Bernardo R (1994) Prediction of maize single cross performance using RFLPs and information from related hybrids. Crop Sci 34:20–25Google Scholar
- Bernardo R (1996a) Best linear unbiased prediction of maize single-cross performance. Crop Sci 36:50–56CrossRefGoogle Scholar
- Bernardo R (1996b) Best linear unbiased prediction of the performance of crosses between untested maize inbreds. Crop Sci 36:872–876Google Scholar
- Bernardo R (1999) Best linear unbiased prediction of single-cross performance. Crop Sci 39:1277–1282CrossRefGoogle Scholar
- Bernardo R (2002) Breeding for quantitative traits in plants. Stemma Press, WoodburyGoogle Scholar
- Bernardo R, Yu JM (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Sci 47:1082–1090CrossRefGoogle Scholar
- Bink M, Uimari P, Sillanpää MJ, Janss LLG, Jansen RC (2002) Multiple QTL mapping in related plant populations via a pedigree-analysis approach. Theor Appl Genet 104:751–762CrossRefPubMedGoogle Scholar
- Bink M, Boer MP, ter Braak CJF, Jansen J, Voorrips RE, de Weg WEV (2008) Bayesian analysis of complex traits in pedigreed plant populations. Euphytica 161:85–96CrossRefGoogle Scholar
- Boer MP, Wright D, Feng LZ, Podlich DW, Luo L, Cooper M, van Eeuwijk FA (2007) A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial data using environmental covariables for QTL-by-environment interactions, with an example in maize. Genetics 177:1801–1813CrossRefPubMedGoogle Scholar
- Broman KW, Speed TP (2002) A model selection approach for the identification of quantitative trait loci in experimental crosses. J R Stat Soc B 64:641–656CrossRefGoogle Scholar
- Calinski T, Czajka S, Kaczmarek Z, Krajewski P, Pilarczyk W (2005) Analyzing multi-environment variety trials using randomization-derived mixed models. Biometrics 61:448–455CrossRefPubMedGoogle Scholar
- Charcosset A, Bonnisseau B, Touchebeuf O, Burstin J, Dubreuil P, Barriere Y, Gallais A, Denis JB (1998) Prediction of maize hybrid silage performance using marker data: comparison of several models for specific combining ability. Crop Sci 38:38–44CrossRefGoogle Scholar
- Duvick DN, Smith JSC, Cooper M (2004) Long-term selection in a commercial hybrid breeding program. Plant Breed Rev 24:109–151Google Scholar
- Fischer S, Mohring J, Schon CC, Piepho HP, Klein D, Schipprack W, Utz HF, Melchinger AE, Reif JC (2008) Trends in genetic variance components during 30 years of hybrid maize breeding at the University of Hohenheim. Plant Breed 127:446–451CrossRefGoogle Scholar
- Frascaroli E, Cane MA, Landi P, Pea G, Gianfranceschi L, Villa M, Morgante M, Pe ME (2007) Classical genetic and quantitative trait loci analyses of heterosis in a maize hybrid between two elite inbred lines. Genetics 176:625–644CrossRefPubMedGoogle Scholar
- George AW, Visscher PM, Haley CS (2000) Mapping quantitative trait loci in complex pedigrees: a two-step variance component approach. Genetics 156:2081–2092PubMedGoogle Scholar
- Greven S, Crainiceanu CM, Küchenhoff H, Peters A (2008) Restricted likelihood ratio testing for zero variance components in linear mixed models. J Comput Graph Stat 17:870–891CrossRefGoogle Scholar
- Heath SC (1997) Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am J Hum Genet 61:748–760CrossRefPubMedGoogle Scholar
- Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49:1–12CrossRefGoogle Scholar
- Jannink JL, Wu XL (2003) Estimating allelic number and identity in state of QTLs in interconnected families. Genet Res 81:133–144CrossRefPubMedGoogle Scholar
- Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795CrossRefGoogle Scholar
- Lander ES, Green P (1987) Construction of multilocus genetic-linkage maps in humans. Proc Natl Acad Sci USA 84:2363–2367CrossRefPubMedGoogle Scholar
- Malosetti M, Voltas J, Romagosa I, Ullrich SE, van Eeuwijk FA (2004) Mixed models including environmental covariables for studying QTL by environment interaction. Euphytica 137:139–145CrossRefGoogle Scholar
- Malosetti M, van der Linden CG, Vosman B, van Eeuwijk FA (2007) A mixed-model approach to association mapping using pedigree information with an illustration of resistance to
*Phytophthora infestans*in potato. Genetics 175:879–889CrossRefPubMedGoogle Scholar - Malosetti M, Ribaut JM, Vargas M, Crossa J, van Eeuwijk FA (2008) A multi-trait multi-environment QTL mixed model with an application to drought and nitrogen stress trials in maize (
*Zea mays*L.). Euphytica 161:241–257CrossRefGoogle Scholar - Melchinger AE, Utz HF, Piepho HP, Zeng ZB, Schon CC (2007) The role of epistasis in the manifestation of heterosis: a systems-oriented approach. Genetics 177:1815–1825CrossRefPubMedGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using dense marker maps. Genetics 157:1819–1829PubMedGoogle Scholar
- Nelder J (1994) The statistics of linear models: back to basics. Stat Comput 4:221–234CrossRefGoogle Scholar
- Nordborg M, Weigel D (2008) Next-generation genetics in plants. Nature 456:720–723CrossRefPubMedGoogle Scholar
- Parisseaux B, Bernardo R (2004) In silico mapping of quantitative trait loci in maize. Theor Appl Genet 109:508–514CrossRefPubMedGoogle Scholar
- Paulo MJ, Boer M, Huang XQ, Koornneef M, van Eeuwijk F (2008) A mixed model QTL analysis for a complex cross population consisting of a half diallel of two-way hybrids in
*Arabidopsis thaliana*: analysis of simulated data. Euphytica 161:107–114CrossRefGoogle Scholar - Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176CrossRefGoogle Scholar
- Piepho HP, Mohring J, Melchinger AE, Buchse A (2008) BLUP for phenotypic selection in plant breeding and variety testing. Euphytica 161:209–228CrossRefGoogle Scholar
- Schrag TA, Melchinger AE, Sorensen AP, Frisch M (2006) Prediction of single-cross hybrid performance for grain yield and grain dry matter content in maize using AFLP markers associated with QTL. Theor Appl Genet 113:1037–1047CrossRefPubMedGoogle Scholar
- Schrag TA, Mohring J, Maurer HP, Dhillon BS, Melchinger AE, Piepho HP, Sorensen AP, Frisch M (2009) Molecular marker-based prediction of hybrid performance in maize using unbalanced data from multiple experiments with factorial crosses. Theor Appl Genet 118:741–751CrossRefPubMedGoogle Scholar
- Sillanpää MJ, Arjas E (1998) Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148:1373–1388PubMedGoogle Scholar
- Smith A, Cullis B, Gilmour A (2001) The analysis of crop variety evaluation data in Australia. Aust N Z J Stat 43:129–145CrossRefGoogle Scholar
- ter Braak CJF, Kourmpetis Y, Kiers HAL, Bink MCAM (2009) Approximating a similarity matrix by a latent class model: a reappraisal of additive fuzzy clustering. Comput Stat Data Anal 53:3183–3193CrossRefGoogle Scholar
- Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York, 568 pGoogle Scholar
- Vuylsteke M, Kuiper M, Stam P (2000) Chromosomal regions involved in hybrid performance and heterosis: their AFLP (R)-based identification and practical use in prediction models. Heredity 85:208–218CrossRefPubMedGoogle Scholar
- Yu J, Arbelbide M, Bernardo R (2005) Power of in silico QTL mapping from phenotypic, pedigree, and marker data in a hybrid breeding program. Theor Appl Genet 110:1061–1067CrossRefPubMedGoogle Scholar