Key words

1 Introduction

Because genotype by environment (G×E) interactions are one of the key issues in plant breeding, genomic prediction (GP) to predict G×E interactions is an active area of investigation. One approach to this problem is to use crop growth models (CGMs) to take environmental conditions into account. CGMs simulate plant development based on various environmental (e.g., air temperature and soil water) and management (e.g., sowing dates and fertilizer application) inputs. CGMs are calibrated using data from controlled environments and field trials. Large CGMs often use parameters calibrated for each module/process independently using a variety of datasets. Once calibrated, CGMs can predict phenotypes for new (untested) environments if conditions at these environments are given. When CGMs are calibrated for each genotype, estimates of CGM parameters can differ between genotypes. Differences in parameters are considered as representing the differences in response to environmental stimuli among genotypes. Thus, G×E interactions may be modeled using CGMs calibrated for each genotype.

CGMs are connected with genes via various parameters and state variables (see next section). Genetic analyses, such as quantitative trait loci (QTL) mapping, are integrated into CGM parameters/state variables by treating them as “trait phenotypes.” This idea is transferred to studies integrating GP with CGMs [1, 2]. These studies predict phenotypes of new genotypes for untested environments. CGM parameters for new genotypes are first predicted with GP, then phenotypes for untested environments are predicted by running the CGM with predicted parameters and environmental and management inputs. This integration of CGMs with GP is hereafter referred to as GP-assisted CGM (Fig. 1). Alternatively, CGMs can support GP by predicting secondary (indicator) traits or via inferring growth stages. Such integration is termed CGM-assisted GP (Fig. 1).

Fig. 1
figure 1

Genomic prediction-assisted crop growth models (GP-assisted CGMs) and the crop growth model-assisted genomic prediction (CGM-assisted GP). In GP-assisted CGMs, GP is used to predict parameters of CGM for new genotypes. Phenotypes of the genotypes are then predicted with the CGM. In CGM-assisted GP, phenotypes of new genotypes are predicted with GP. CGMs are used to support development of GP models

The terminology used in this chapter includes “calibration,” which can be read as “training” which is often used in current quantitative genetics. Both terms are used when estimating model parameters from available data (i.e., training data). These terms are strictly distinguished from prediction, which refers to forecasting plant performance. The term “estimate” is used to indicate estimation of CGM or GP model parameters by applying the models to available datasets. The term “simulate” indicates running CGM using given environmental input regardless of whether the input is for current or future conditions.

Since the first attempt of GP-assisted CGMs reported in 2015 [1], attempts to use this fruit of crop science, CGMs, with GP have been continued. In this chapter, CGMs (concept, history, and classes), genetic analyses of CGM parameters, integration of CGMs with GP, and R examples for GP-assisted CGMs are briefly introduced to encourage additional empirical studies.

2 Crop Growth Models

The headwater of CGMs is a systematic approach introduced by B. Jensen in the early nineteenth century for understanding plant biomass production. Jensen is also famous for recognizing the hormone responsible for phototropism, auxin. His approach was advanced by subsequent studies, including the work of Monsi and Saeki [3], who developed mathematical models to simulate canopy production based on photosynthesis efficiency by introducing the concept of light interception. One of the first CGMs, ELCROS, was developed in 1970 [4]. ELCROS is a dynamic model, able to simulate biomass production based on canopy photosynthesis. Subsequently, various CGMs have been developed and used for research and practical purposes. Bouman et al. [5] provide a historical view of CGMs and the school of de Wit and Jones et al. [6] also provide historical perspectives. Muller & Martre [7] provide a brief and understandable introduction to CGMs, and summary of the genealogy of CGMs. CGMs currently widely used include DSSAT [8] and APSIM [9]. Both CGMs use a graphical user interface (GUI) and cover a wide range of crops. DSSAT is also available as an R package [10]. CGMs suitable for experiment purposes may be found in the “Qualitative Plant” database [11]. This database is a good starting point for finding an appropriate CGM.

Components of comprehensive CGMs are state variables (X) representing current plant status (e.g., leaf area index, biomass, and developmental stages), rate variables (R) representing rates of change in the state variables, environmental variables (E) representing environmental inputs (e.g., air temperature and radiation), and parameters characterizing functional relationships among the variables, X, R, and E [12]. State variables can be represented as:

$$ X=\int R\mathrm{d}t=\int f\left(X,E\right)\mathrm{d}t $$

reflecting the dynamic nature of CGMs. State variables X then relate to each other, and feedback loops exist among variables. Consequently, comprehensive CGMs comprise many parameters to be determined. These parameters are often determined from experiments under controlled environments.

A more descriptive approach is also used to evaluate dry matter production based on radiation use efficiency [13]. In such simplified models, biomass accumulation, which is modeled as the relationship between photosynthesis and respiration in comprehensive CGMs, is assumed to be proportional to the amount of solar radiation that plants absorbed. Accumulation of dry matter, w, is typically calculated as:

$$ \frac{\mathrm{d}w}{\mathrm{d}t}= SRE\times SR\left[1-r-\left(1-{r}_a\right)\exp \left(- kLAI\right)\right] $$

where SRE is solar radiation use efficiency; SR is solar radiation; r and ra are reflectivity of canopy and soil, respectively; k is the radiation extinction efficiency of the canopy; LAI is the leaf area index [13]. Grain yield is then calculated by partitioning dry matter, which is known as the concept of harvest index (HI). Such models are hereafter referred to as “simplified CGMs.” Fewer parameters are needed for model application, and thus, simplified CGMs are often used for integration with GP.

Other CGMs are functional-structural plant models (FSPMs) that simulate the three-dimensional (3D) architecture of plants by integrating physiological processes with plant 3D architecture [14]. CGMs discussed above assess canopy-level processes. In contrast, FSPMs treat individual plant processes [7]. FSPMs will become more important for crop science as phenotyping technologies develop and more precise information on phenotypes becomes available. To date, only a single example of the use of FSPM for GP is reported [15]. This study uses MAppleT model, which simulates the above-ground development of apple trees.

Conventional growth curves, such as logistic [16] or Gompertz [17], are often used to model growth trajectories for both plants and animals. Growth curves usually do not account for environmental conditions, and thus, have no advantage in GP aimed at prediction of G×E interactions. However, because both growth curves and CGMs are mathematical models that treat dynamic (time-dependent) processes, studies using growth curves are also introduced together with CGMs in this chapter. In some cases, lessons on statistical inference on model parameters can be shared, as described later. Further, the extension of growth curves to include environmental conditions is possible, as illustrated by Campbell et al. [18]. This study demonstrates high similarity between growth curves and CGMs. Note that, although QTL mapping using growth curves is categorized as a class of functional mapping, reviewing all classes of functional mapping is beyond the scope of this chapter, and only a class based on growth curves (i.e., parametric models) is mentioned.

3 Gene-Based Models

CGMs consist of multiple parameters, some of which are regarded as genotype-specific and may explain differences in response to environments between genotypes. A popular application of CGMs is, therefore, to define or search ideotypes [19, 20]. That is, plants with ideal phenotypes for a given environment are searched in silico by modifying model parameters. The parameter values resulting in such ideal phenotypes are then regarded as breeding targets. Genotype-specific parameters of CGMs have various alias including genetic coefficients [21], input traits [22], physiological traits [1], and genotypic parameters [23].

A first attempt to connect CGM parameters with genes is reported in a common bean study [24]. Parameters were linked to known genes responsible for phenology, growth habit, and seed size using linear regression. Subsequently, the amount of phenotypic variation explained by CGMs was assessed with fitted parameter values. The model developed by the authors, GeneGro, was updated [25] by adding several genes responsible for photoperiod sensitivity. Modeling that links parameters of CGMs with genotypes of known genes is often called gene-based modeling. Chapman et al. [26] used such a model for sorghum to simulate phenotypes for traits, such as transpiration efficiency and flowering time, in future breeding environments. Stewart et al. [27] developed a gene-based model for soybean to predict flowering time by incorporating flowering genes (E loci). Similar approaches were implemented to simulate yield responses in soybean [28] and to analyze the effects of Vrn and Ppd loci that affect vernalization and photoperiodism, respectively, on flowering in wheat [29]. A comprehensive summary on the integration of genes/QTLs with CGMs is provided by Wang et al. [30].

In the gene-based models, model parameters selected as genotype-specific are expressed as linear combinations of gene effects. Gene effects are estimated after model parameter estimation using training data, a so-called “independent” or “two-step” approach. The term “independent” reflects model calibration/training that is initially performed independently for each genotype. Subsequent genetic analysis on model parameters is also conducted independently from this parameter estimation (Fig. 2). The opposite “joint” or “single-step” approach uses a process where parameter estimation for all genotypes and genetic analyses on these parameters are performed simultaneously (Fig. 2). This issue is discussed further later in this chapter.

Fig. 2
figure 2

Independent and joint approaches for integration of crop growth models (CGMs) and genetic analyses. In the independent approach, CGMs are independently fit to data (I, inputs; Y, phenotypes) for each genotype. N denotes the number of genotypes. Estimated model parameters (Φ) are then analyzed genetically by connecting with genotypes at gene, QTL, or whole-genome markers. In the joint approach, model fitting for all genotypes is performed jointly with genetic analyses. Genetic information yields dependencies among CGM outputs for each genotype

4 CGMs and QTL Mapping

CGMs are also used to discover novel genes. QTL mapping and association analyses on parameters or state variables of CGMs are used for such exploration. Since the first attempt for yield of barley [31], many such studies have been reported (Table 1). Typically, models used are simplified CGMs or models for limited biological processes that have fewer than 10 genotype-specific parameters. QTL mapping on CGM parameters is usually conducted on parameters estimated for each genotype (independent approach). QTL mapping is also often conducted for growth curves (Table 2). This approach can be regarded as a type of functional mapping. Growth curves do not take environmental information into account, but QTL mapping on growth curve parameters has added benefits for avoiding multiple testing problems that appear when mapping at individual time points.

Table 1 QTL/GWAS analyses on crop growth model parameters
Table 2 Genetic analyses on growth curve parameters

Because parameters or state variables of CGMs have interpretable roles in physiological processes, QTL mapping on these parameters/variables is expected to provide more detailed information on QTL functions than mapping on final traits. Typical examples can be found in two studies on phenological traits in rice and wheat. In rice [32], a phenology model, developmental rate (DVR) model, was used to map flowering-time QTLs in backcross inbred lines. The DVR model has multiple genotype-specific parameters, including parameters that represent photoperiod and temperature sensitivity. QTL mapping on these two parameters showed partially overlapping QTLs. Known QTLs, including Hd1 and Hd2, were detected for both parameters, and two additional major QTLs, Hd8 and Hd9, were detected only for parameters related to photoperiod and temperature sensitivity, respectively. Interestingly, Hd9 was not detected when using heading dates (i.e., final trait) as the response variable. Hd8 and Hd9 are still not well characterized, though Hd9 is possibly involved in a photoperiod-independent pathway of heading regulation.

For wheat, a phenology model with two genotype-specific parameters was used to map flowering-time QTLs in an association panel. These parameters represented photoperiod sensitivity and vernalization requirement. The latter is related to temperature. Association analyses revealed 12 and 11 QTLs for these parameters, respectively. No photoperiod and vernalization QTLs colocalized, suggesting that the processes governed by these parameters are underpinned by different genetic architecture.

These examples illustrate that QTL mapping on CGM parameters can increase the interpretability of QTL mapping. However, caution is needed. When parameters are correlated, QTLs found for one parameter are not necessarily associated with the process that the parameter is assumed to be involved in. The QTLs may actually be associated with other correlated parameters.

5 CGMs and GP

5.1 Overview

Integrating CGMs with GP is achieved in two ways, GP-assisted CGMs and CGM-assisted GP (Figs. 1 and 2, and Tables 3 and 4). In the former, genotype-specific parameters of CGMs are predicted using GP, then phenotypes of target traits (e.g., yield or flowering time) are predicted using CGMs taking environmental information as inputs. Thus, phenotypes are predicted by the CGM, and GP aids these predictions. This approach is an extension of gene-based models and QTL mapping on CGM parameters. If the definition of GP is simultaneous fitting of genome-wide markers irrespective of effect size, the first attempt of GP-assisted CGMs was reported in 2015 [1]. In this conceptual study based on simulations, four parameters in a maize CGM that takes solar radiation and temperature as environmental inputs are regressed on whole-genome markers. Yields for new plants under new environments were then predicted using the same CGM.

Table 3 Genomic prediction-assisted crop growth models
Table 4 Crop growth model-assisted genomic prediction

Early applications of GP-assisted CGM to real data were reported by Cooper et al. [33] and Onogi et al. [2]. In the former study, a maize CGM [1] was modified to integrate drought stress and was applied to real maize data. In the latter study, the DVR model was combined with GP to predict rice heading dates. A common feature of these studies is, interestingly, adoption of a “joint” approach; CGM parameters of all genotypes were jointly estimated and marker effects, or additive genotypic effects, on CGM parameters were also estimated jointly (Fig. 2). The above studies [1, 2, 33] realized this approach by developing hierarchal models that incorporate CGMs with GP and applying parameter estimation procedures in Bayesian frameworks. In the maize case [1, 33], approximate Bayesian computation [34] was adopted; for case [2], a hybrid approach of Markov chain Monte Carlo (MCMC) and variational Bayesian inference was developed.

A joint approach, however, is not essential for GP-assisted CGMs. All except one study [35] on GP-assisted CGMs subsequently published adopts an independent approach (Table 3). CGM is applied to each genotype independently, and then estimated CGM parameters are subject to GP independently from CGM fitting (Fig. 2). A major advantage of an independent approach is its ease of implementation. When published CGMs, such as APSIM or DSSAT, are used, CGM implementation is unnecessary. Moreover, an independent approach is applicable to complex CGMs with many parameters. Conversely, a joint approach requires the development of hierarchical models and estimation procedures and is likely to be more difficult to apply as CGMs become more complex.

CGM-assisted GP uses CGMs to assist GP in multiple ways (Table 4). Largely three approaches have been proposed. The first is the use of CGMs to infer plant growth stage. Predicted growth stages are used successively for inferring environmental covariates that affect G×E interactions [36, 37]. Such environmental covariates are then used to create kernels between environments to assess reaction norms in mixed models [38]. The second approach is the use of CGMs to characterize environments [39]. A comprehensive CGM, APSIM, was first calibrated for a typical variety at reference environments and run for target environments. Nitrogen nutrition indices output by the CGM to indicate crop nutritional balance were then used to characterize these environments. GP was conducted by considering the genotype-by-index interactions. The third approach is the use of CGMs to predict secondary or indicator traits. For example, heading dates for new environments are predicted using a CGM, then predicted heading dates are included as covariates in mixed models of GP to predict yield [40]. Using real data of winter wheat, it was shown that including predicted heading dates increase the prediction accuracy of GP for between-environment prediction. Heading date or flowering time can affect various traits [41] and is generally easier to predict than yield with CGMs. Thus, this approach may be a good alternative to GP-assisted CGMs.

Note that GP-assisted CGMs and CGM-assisted GP will have qualitatively different roles in plant breeding. Although both methods predict phenotypes, to be exact, CGM-assisted GP predicts genetic merits or breeding values. Thus, CGM-assisted GP will be suitable for selecting candidates to increase genetic gain, whereas GP-assisted CGM will be suitable for designing ideal phenotypes under given environmental conditions.

5.2 Predictive Ability of GP-Assisted CGMs

Applications of integrated CGMs and GP have gradually increased, but comparative studies are still insufficient. In particular, GP-assisted CGMs have not been compared with other methods that consider G×E interactions (e.g., mixed models with reaction norms). Thus, the usefulness of GP-assisted CGMs in phenotype prediction is not fully understood. GP-assisted CGMs were compared with ordinary GP methods, such as genomic BLUP (GBLUP ) [1, 33, 35] and BayesA [42], which cannot account for G×E interactions, for between-environment predictions. GP-assisted CGMs showed better accuracy than GBLUP /BayesA in simulations as expected [1, 35], but showed accuracy similar to GBLUP with real data [33]. These results probably reflect the small number of environments used (two), too few for calibrating CGMs. In several other studies [2, 15, 43,44,45], proposed methods were not compared with other methods in between-environment predictions mainly because authors had other objectives. For example, the primary focus of Onogi et al. [2] was a comparison of joint and independent approaches. Alimi [43] compared single- and multi-trait GP models to predict CGM parameters with an independent approach and showed the superiority of multi-trait models. Rosen et al. [45] reported that CGM performance was better when CGM parameters were predicted through GP than when parameters were predicted using major QTLs.

An interesting comparison study is a biomass prediction for rice using longitudinal data [46]. An independent approach based on a simplified CGM was compared with an ordinary GP method (Lasso) and an approach that replaces the CGM with machine learning (linear regression or random forests) in an independent framework. Machine learning methods use state variables predicted with GP as inputs, then output phenotypes of the target trait (biomass). The two methods (the independent approach based on a CGM and the approach that replaces the CGM with machine learning) showed similar accuracy and much better accuracy than Lasso in between-environment prediction. Chen et al. [47] compared an independent approach based on the DVR model with XGBoost [48] that takes both environmental covariates and marker genotypes as inputs for prediction of heading dates of rice. Results were less promising for CGMs because XGBoost generally provides lower prediction errors.

In summary, use of GP-assisted CGMs remains a conceptual approach that needs further development to realize its potential for phenotype prediction. Issues to be considered to improve predictive ability are: (a) the choice of CGMs; (b) the choice of genotype-specific parameters of CGMs, and (c) the accuracy of CGM parameter estimation.

The predictive ability of CGMs differs depending on the model used and the target traits. Thus, the appropriate choice of CGMs is critical for high prediction accuracy. Predictive ability is also affected by relationships (or similarities) between target and training environments. The remaining two points are reviewed in the following sections.

Importantly, a fundamental assumption for use of CGMs for G×E analyses is that genotype-specific parameters are constant among environments and differences of phenotypes of a genotype (i.e., reaction norms) are brought about by environmental conditions via CGMs. However, this assumption often does not hold [49]. In such cases, mega-environments are needed where CGM parameters are consistent.

5.3 Genotype-Specific Parameters in CGMs

Despite the long history of CGMs, no good means exists for choosing genotype-specific parameters. This issue is, however, critical and can directly affect the predictive performance of GP with CGMs. Model ability to describe phenotypic variations depends on parameters chosen as genotype-specific [50]. Uncertainties in prediction are also affected by these designations [51]. Actually, as mentioned in the last Section (8.2), the predictive ability of the DVR model increased by increasing the number of parameters chosen as genotype-specific. A popular practice to determine genotype-specific parameters is to adopt knowledge garnered from the published literature. Usually, CGMs are calibrated for multiple major varieties. Thus, model parameters that show variation among varieties can be empirically deduced. Another method is to identify parameters that cause large variations in phenotype via sensitivity analyses [52], even though genetic aspects are lacking in this procedure. Considering that the aim is high prediction accuracy, a practical approach is to select a set of genotype-specific parameters based on predictive ability examined with cross-validation. This technique is time-consuming, but feasible when CGMs are small.

5.4 Parameter Estimation

Parameters for joint GP-assisted CGMs [1, 2, 33, 35] were estimated in Bayesian frameworks, which is reasonable considering the hierarchical structure of joint models. Bayesian methods, such as MCMC or an extension of MCMC, differential evolution adaptive metropolis (DREAM, [53]), and generalized likelihood uncertainty estimation [54], are often used for CGMs’ optimization [47, 51, 55,56,57]. Because prior distributions in Bayesian methods are statistically equivalent to penalties, parameter estimation by Bayesian methods could increase the accuracy of predictions. Non-Bayesian and general optimization methods such as Nelder–Mead optimization [58] or particle swarm optimization (PSO, [59]) can also be used for optimization of CGMs and growth curves. To prevent overfitting, adding penalties to objective functions may be useful. These optimizers (Nelder–Mead and PSO) are implemented with user-friendly R scripts, and are good candidates as a first step, particularly when small CGMs are used.

Prior distributions might be useful for avoiding identifiability problems in parameter estimation [56]. CGMs are an accumulation of physiological processes represented as mathematical equations, and parameters are often redundant and interrelated. Thus, parameters can compensate for each other and cause identifiability problems (i.e., different parameter values can result in the same output [49]). Correlated parameters may not be identifiable from data only, and different prior distributions can augment identifiability. However, estimates obtained with strong prior distributions are not the result of statistical learning. Rather, they reflect prior assumptions. Likewise, compensation among parameters raises a concern for the interpretation of QTLs. QTLs may be detected for multiple parameters that are interrelated, and compensation in physiological processes may result in QTLs detected for parameters controlling different processes.

For GP-assisted CGMs, two approaches for parameter estimation are used, joint and independent. The independent approach is easier to implement, but the joint approach has two major advantages: (a) uncertainty in CGM parameter estimation can be considered in GP, whereas in the independent approach, the uncertainty increases noises and leads to lower heritability of the parameters; (b) information from the phenotype of a genotype can be used simultaneously for CGM parameter inference for other individuals, through genome-wide marker effects. These advantages were first discussed in a simulation study using growth curves [60]. Onogi [61] showed that the joint approach could estimate model parameters more accurately than the independent approach in various mathematical models. The difference between the two approaches becomes more prominent as the number of phenotypic values used for estimation decreases, e.g., a large proportion of missing phenotypes or large sampling intervals in longitudinal data. The joint approach is often applied to growth curve analyses [60, 62,63,64] probably because model structures of growth curves are simple and easy to extend.

An interesting approach to increase the accuracy of parameter estimation of CGMs is to optimize the members (environments) of multi-environment trials (METs) used for calibrating the CGMs [65]. Using a set of pre-determined CGM parameter values, the method identifies a set of environments that provides diverse outputs (phenotypes). Using these environments for calibration, the accuracy of CGM parameter estimates can be increased. If this idea is modified such that the set of parameters is optimized whereas the members of METs are fixed, genotype-specific parameters may be more effectively chosen.

6 Examples of CGMs Applications

6.1 Overview of Examples

Popular comprehensive CGMs, such as APSIM and DSSAT, are available upon registration at their respective sites, which also supply rich supporting materials [66, 67]. Both programs are equipped with a GUI and cover a wide range of crops. Thus, connecting these CGMs with GP using an independent approach is not a hard task. On the other hand, applications of GP and QTL mapping are still limited to small CGMs or growth curves, and thus, it is possible to implement in-house CGMs with an arbitrary language and to integrate them with GP, QTL mapping, and genome-wide association studies. An advantage of in-house development is that fitting of CGMs, and subsequent genetic analyses, can be accomplished in a uniform digital environment that researchers/breeders know. Moreover, models can be diagnosed easily and extensions of the models are also feasible. Examples of CGMs implemented with R and Rcpp are presented below. The models are the DVR used in [2], and the maize growth model used in [1].

6.2 DVR Model

The DVR model predicts heading (flowering) time of rice from daily mean temperature (T, °C) and photoperiod (P, h) [32, 68]. The model is:

$$ {DVS}_D=\sum \limits_{d=1}^D{DVR}_d $$
$$ {DVR}_d=\left\{\begin{array}{c}f\left({T}_d\right), if\ {DVS}_d<{DVS}_1\ \mathrm{or}\ {DVS}_d>{DVS}_2\\ {}f\left({T}_g\right)g\left({P}_d\right), if\ {DVS}_1\le {DVS}_d\le {DVS}_2\end{array}\right. $$
$$ f\left({T}_d\right)=\left\{\begin{array}{c}{\left[\left(\frac{T_d-{T}_b}{T_o-{T}_b}\right){\left(\frac{T_c-{T}_d}{T_c-{T}_o}\right)}^{\left(\frac{T_c-{T}_o}{T_o-{T}_b}\right)}\right]}^{\alpha }, if\ {T}_b\le {T}_d\le {T}_c\\ {}0, if\ {T}_d<{T}_b\ \mathrm{or}\ {T}_d>{T}_c\end{array}\right. $$
$$ g\left({P}_d\right)=\left\{\begin{array}{c}{\left[\left(\frac{P_d-{P}_b}{P_o-{P}_b}\right){\left(\frac{P_c-{P}_d}{P_c-{P}_o}\right)}^{\left(\frac{P_c-{P}_o}{P_o-{P}_b}\right)}\right]}^{\beta }, if\ {P}_b\le {P}_d\le {P}_c\\ {}0, if\ {P}_d<{P}_b\ \mathrm{or}\ {P}_d>{P}_c\end{array}\right. $$
$$ {DVS}_1=0.145G+0.005{G}^2 $$
$$ {DVS}_2=0.345G+0.005{G}^2 $$

where DVSd and DVRd denote developmental stage and rate at day d; DVS1 and DVS2 define the period when the plant is photo-sensitive; α and β are sensitivity coefficients of temperature and photoperiod; G represents the earliness of flowering under optimum temperature and photoperiod conditions. The index d indicates days after emergence. Indices T and P, b, o, and c, indicate base (lower limit), optimum, and ceiling (upper limit) of temperature and photoperiod, respectively. The model outputs D once DVSD is >G, as days to heading. The model has multiple parameters that can be genotype-specific, such as α, β, G, Tb, To, Tc, Pb, Po, and Pc. In some studies [2, 68], α, β, and G were assumed to be genotype-specific, and in others were determined based on prior knowledge. In another study [61], Po and To were also assumed to be genotype-specific and this assumption resulted in better prediction accuracy. Thus, this assumption is followed here. Boxes 1 and 2 show R and Rcpp script examples of the DVR model that returns days to heading under given environmental conditions (daily mean temperature and photoperiod) and parameters (α, β, G, Po, and To). The Rcpp script is faster than the R script and thus will be preferred.

Box 1 An R Script for the DVR Model (DVRmodel.R)

Temp and Photo are matrices that store daily mean temperature and photoperiod (day length), respectively. Rows of each matrix indicate days, and columns indicate environments. The first day (row) is the emergence day. The model is fitted for each environment successively and returns days to heading for each environment as a vector, DTH. Arguments G, Alpha, Beta, Po, and To are scalar and indicate α, β, G, To, and Po, respectively. The function returns Md + 1 if heading does not occur (Md is the number of days included in Temp and Photo).

Box 2 An Rcpp Script for the DVR Model (DVRmodel.cpp)

Temp and Photo are matrices that store daily mean temperature and photoperiod (day length), respectively. Rows of each matrix indicate days and columns indicate environments. The first day (row) is the emergence day. The model is fitted for each environment successively and returns days to heading as a vector, DTH. Arguments G, Alpha, Beta, Po, and To indicate α, β, G, To, and Po, respectively. The function returns Md + 1 if heading does not occur (Md is the number of days included in Temp and Photo).

6.3 Maize Growth Model

The maize growth model [1, 69] uses daily mean temperature (T, °C), solar radiation (SR, MJ/m2), and plant population (Pop, plants/m2) as inputs (Fig. 3). Plant growth is simulated based on thermal unit (TU), the cumulative temperature calculated from T subtracted by a base temperature (8 °C). Emergence occurs at 87 TU. Leaf number (LN) increases exponentially depending on TU until it reaches total (maximum) leaf number (TLN). Plant leaf area (PLA) is calculated by integrating the area of each plant (AR) by LN. TU also promotes senescence, represented as the fraction of senescent leaf area (FAS), and LAI is calculated from PLA, FAS, and Pop. Then, SR, solar radiation use efficiency (SRE), and LAI are used to simulate the increase of biomass each day. Female heading (silking) occurs when TU exceeds 67 after the end of leaf growth. Grains grow based on a thermal unit for grain (TUgrain), calculated using a base temperature of 0 °C until TUgrain reaches physiological maturity (MTU). HI increases linearly from 3 days after silking. TLN, area of the largest leaf (AM), SRE, and MTU are assumed to be genotype-specific. Muchow et al. [69] set SRE to 1.6 g MJ−1 and reduced to 1.2 g MJ−1 once TUgrain exceeds 500. Examples of R and Rcpp script for this maize growth model are provided in Boxes 3 and 4, respectively.

Fig. 3
figure 3

Flow diagram of the maize growth model adapted from [69]. Abbreviations not explained in the diagram are TLN total leaf number; AM area of the largest leaf; Pop plant population (plants/m2); SRE solar radiation use efficiency; TUleaf TU at the end of leaf growth; DFS days from silking; MTU thermal units to physiological maturity; GW grain weight

Box 3 An R Script for the Maize Growth Model (MaizeGrowthModel.R)

Temp and SR are matrices that store daily mean temperature and solar radiation, respectively. Rows of each matrix indicate days and columns indicate environments. The first day (row) is the sowing day. Population is the vector including the plant populations (plants/m2) for each environment. The model is fitted for each environment successively and returns grain weight at maturity as a vector, GW.maturity. Arguments TLN, AM, SRE, and MTU are scalar values. The integral of plant leaf area is solved numerically with an arbitrary width (0.5).

Box 4 An Rcpp Script for the Maize Growth Model (MaizeGrowthModel.cpp)

Temp and SR of the function MGM are matrices that store daily mean temperature and solar radiation, respectively. Rows of each matrix indicate days and columns indicate environments. The first day (row) is the sowing day. Population is the vector including the plant populations (plants/m2) for each environment. The model is fitted for each environment successively and returns grain weight at maturity as a vector, GW.maturity. Arguments TLN, AM, SRE, and MTU are scalar values. The integral of plant leaf area is solved numerically with an arbitrary width (0.5).

6.4 Examples of CGM Fitting

The DVR and maize growth models in Boxes 14 can be fitted to observed data—days to heading for the DVR model and grain weight for the maize growth model—by wrapping with R functions (Boxes 5 and 6). Wrapper functions (FitDVR and FitMGM) evaluate model performance using mean squared errors. Vectors of observed data (DTH and GW) are given as the last arguments. The length of these vectors is the number of environments, and the order of environments should be the same as the order of Temp, Photo, and SR. Model functions (DVRmodel and MGM) simulate phenotypes for each environment, and calculation time increases linearly as the number of environments increases. The accuracy of parameter estimation will also improve. Wrapper functions use a vector of parameters as the first argument, which is a requirement of optimizers, such as optim of library stats for the Nelder–Mead optimization and psoptim of library pso for PSO. Here psoptim is used. Ranges of parameters α, β, and G in the DVR model were determined arbitrarily, and ranges of Po and To are defined by Pb, Pc, Tb, and Tc, set to 0, 24, 8, and 42, respectively. Ranges for maize growth model parameters were determined following [1]. Optimization depends on initial parameter values, and 10 sets of randomly assigned initial values are tested to obtain the best results. The number of initial value sets can be modified. After repeating the optimizing process for individuals (genotypes) in the training set, GP models can be trained using parameter estimates as phenotypes.

Box 5 Fitting and Optimization of the DVR Model

(RunDVRmodel.R)

DVRmodel.R and DVRmodel.cpp are files of scripts illustrated in Boxes 1 and 2, respectively. Installation of Rcpp and pso from CRAN (https://cran.r-project.org/) is required before running these scripts. The first argument of psoptim, rep (NA, 5), is just used to provide the number of parameters and is included primarily for compatibility with optim (see the manual of the function).

Box 6 Fitting and Optimization of the Maize Growth Model (RunMaizeGrowthModel.R)

MaizeGrowthModel.R and MaizeGrowthModel.cpp are files of scripts illustrated in Boxes 3 and 4, respectively. Installation of Rcpp and pso from CRAN (https://cran.r-project.org/) is required before running these scripts. The first argument of psoptim, rep (NA, 4), is just used to provide the number of parameters and is included primarily for compatibility with optim (see the manual of the function).

6.5 Examples of the Joint Approach

A joint approach has important advantages, but implementation is not an easy task. The use of probabilistic programing languages such as stan [70] and Edward2 [71], which offer automatic parameter estimation for arbitrary models that users specify, can facilitate implementation. If R is the preferred environment, the R package GenomeBasedModel [61] designed for the implementation of a joint approach can be used. This package allows automatic parameter estimation, based on a variational Bayesian (VB) framework. Posterior distributions of marker effects on CGM parameters are approximated with VB methods. These methods are sufficiently fast to estimate genome-wide marker effects. On the other hand, posterior distributions of the CGM parameters are usually not closed forms and thus, rapid approximation is often infeasible. Such parameters are inferred with MCMC methods using genome-wide markers in prior distributions. Two estimation steps (estimation of CGM parameters with MCMC and marker effects on parameters with VB) are repeated until convergence (see [61] for detailed algorithms).

To use the GenomeBasedModel, functions for CGMs should meet requirements of the package. Functions should take three arguments for each type of script, Input, Freevec, and Parameter, and output a vector of phenotypic values. Input includes environmental and management conditions, Freevec is used for any purposes in the function, and Parameter includes all parameter values. The default use of the CGM function in the package is to fit the function to each variety successively. This approach keeps the function simple, but results in repetitive function calls for model fitting for all varieties and thus increases calculation time. Thus, the package provides an alternative treatment that allows model fitting to all varieties in a single function call. The CGM function then takes inputs and parameters for all varieties as matrices, and repeats model fitting for all varieties, and returns phenotypic values for each environment as a matrix. Example functions of the DVR model and the maize growth model for this process are provided in Boxes 7 and 8, respectively. Only Rcpp scripts are supplied since they are more practical than R scripts for analyses that include many varieties, but both R and Rcpp functions can be used. The GenomeBasedModel can process with log and logit transformed model parameters. In the DVR model, parameters α, β, and G are assumed to be log-transformed for genome-wide marker fitting. Thus, they are scaled back using an exponential function (Box 7).

Box 7 An Rcpp Script for the DVR Model Designed for GenomeBasedModel (DVRmodel_GBM.cpp)

The first argument Input is a (2 × Ne × Md) × Nl matrix where Ne is the number of environments, Md is the maximum day, and Nl is the number of varieties. The upper half from the first to the Ne × Mdth row of Input is temperature and the lower half is photoperiod. Both temperature and photoperiod for each environment from emergence to Md are vertically stacked. Each column of Input includes measurements for each variety. Freevec is a vector including Nl, Ne, and Md. Parameter is an Np × Nl matrix of parameters for all varieties. Np is the number of parameters (5 in this example).

Box 8 An Rcpp Script for the Maize Growth Model Designed for GenomeBasedModel (MaizeGrowthModel_GBM.cpp)

The first argument Input is a (2 × Ne × Md) × Nl matrix where Ne is the number of environments, Md is the maximum day, and Nl is the number of varieties. The upper half from the first to the Ne × Mdth row of Input is temperature, and the lower half is solar radiation. Both temperature and solar radiation for each environment from emergence to Md are vertically stacked. Each column of Input includes measurements for each variety. Freevec is a vector including Nl, Ne, Md, and populations at each environment. Parameter is an Np × Nl matrix of parameters for all varieties. Np is the number of parameters (4 in this example).

Several additional arguments required to run the GenomeBasedModel are listed below. See the manual of the package for details. Boxes 9 and 10 illustrate scripts to run GenomeBasedModel.

  1. 1.

    Input: a matrix with Nl columns, including environmental and management information, where Nl is the number of genotypes. Each column is used to fit the CGM for each genotype. For the DVR model, Input includes mean temperature and photoperiod; for the maize growth model, it includes mean temperature and solar radiation. See also the captions for Boxes 7 and 8.

  2. 2.

    Freevec: a vector used for any purpose in the model function. In the DVR and maize growth models, Freevec is used to define partitions in Input (e.g., which elements in Input are temperature and photoperiod). See also captions for Boxes 7 and 8.

  3. 3.

    Y: a (Ne + 1) × Nl matrix, including phenotypic values (e.g., days to heading for the DVR model) where Ne is the number of environments. The first row is the IDs of genotypes (must be numeric).

  4. 4.

    Missing: a scalar value indicating missing records in Y. NA is not allowed.

  5. 5.

    Np: an integer indicating the number of parameters (five for the DVR model and four for the maize growth model).

  6. 6.

    Geno: a (Nm + 1) × Nl matrix, including marker genotypes, where Nm is the number of markers. The first row is the ID of varieties (must be numeric). Missing values in Geno are not allowed.

  7. 7.

    Methodcode: a vector of length, Np, specifying methods for regression of markers on model parameters. The GenomeBasedModel offers multiple choices for regression methods, including Bayesian lasso [72], extended Bayesian lasso [73], Bayesian alphabets from A to C [42, 74], and GBLUP . Codes are assigned to regression methods and users specify codes for each parameter. Whole-genome regression methods, such as BayesB, BayesC, and extended Bayesian lasso, can be used to detect QTLs [61]. Normal prior distributions that do not depend on genome-wide markers can also be specified.

  8. 8.

    Referencevalues: a vector of length, Np, including reference (typical) values of model parameters. These values are used to specify prior distributions and to check model function.

The argument, PassMatrix, indicates that the model is fitted for all varieties within the model function. In Box 9, an additional argument, Transformation, specifies whether model parameters are transformed in marker fitting. “log” and “logit” indicate log and logit transformation, respectively, and “nt” indicates no transformation. The first three parameters (α, β, and G) are log-transformed and thus, “log” is assigned (Box 7).

Box 9 An R Script for GenomeBasedModel for the DVR Model (RunDVRmodel_GBM.R)

See the text for arguments. MCMC is used for model parameter estimation, and GenomeBasedModel returns MCMC samples for each parameter. Posterior means of parameters can be estimated from MCMC samples. Different regression methods are specified for each model parameter with Methodcode.

figure o

Box 10 An R Script to Run GenomeBasedModel for the maize Growth Model (RunMaizeGrowthModel_GBM.R)

See the text for arguments. MCMC is used for model parameter estimation, and GenomeBasedModel returns MCMC samples for each parameter. Posterior means of parameters can be estimated from MCMC samples. Different regression methods are specified for each model parameter with Methodcode. Returned objects differ depending on the regression method.

figure p

7 Concluding Remarks

CGMs are attractive tools to consider G×E interactions. Many studies have been conducted to connect CGMs with genetics, but integration of CGMs with GP, which is a purely prediction-oriented approach, has a relatively short history. Thus, ideas on how to integrate these two methods to predict crop performance have not been exhausted yet. As reviewed here, there are two ways of integration: GP-assisted CGMs and CGM-assisted GP. The former can be conducted with either independent or joint approaches, and various options exist for parameter estimation (e.g., Bayesian or non-Bayesian). For CGM-assisted GP, approaches for integration other than ones considered to date might be developed, and CGM-assisted GP is likely to accommodate more variations than GP-assisted CGMs. Given global environmental changes, considering environmental information is essential for plant breeding. In such era, integration of CGMs and GP can make a substantial contribution to this effort. To this end, exploration of new approaches along with fair and comprehensive comparisons to conventional approaches is essential. A key issue is collaboration between CGM modelers and quantitative geneticists.

8 Script Availability

All scripts illustrated in the Boxes are available at: https://github.com/Onogi/IntegratingCGMwithGP.