Introduction

Groundnut (Arachis hypogaea L.) is a self-pollinated crop, cultivated in > 100 countries worldwide, and has occupied a global area of 28.5 million ha producing 45.95 million tons with the productivity of 1.61 tons/ha during 2018 (http://www.fao.org/faostat/en/#data/QC). Mostly smallholder farmers are engaged in groundnut cultivation under rainfed conditions with limited resources and inputs in Africa and Asia. Considering the strength of genomics-based robust and precise selection of breeding progenies (Pandey et al. 2012a; Varshney et al. 2013), selection of parents and individuals in the segregating breeding populations can be made more precise and efficient.

Last decade witnessed rapid development of genomic resources such as large scale molecular markers (Wang et al. 2012), genetic maps (Gautami et al. 2012) and genome sequences (Bertioli et al. 2019; Chen et al. 2019; Zhuang et al. 2019) and deployment in genomics-assisted breeding (GAB) in groundnut (see Pandey et al. 2016, 2020; Varshney 2016; Varshney et al. 2019). There are three GAB approaches, namely marker-assisted backcrossing (MABC), marker-assisted recurrent selection (MARS) and genomic selection (GS). MABC and MARS require trait association, while the GS does not need such analysis. Realizing the limitation associated with MABC and MARS to capture small-effect genetic factors, GS has emerged as the most promising, efficient and cost-effective breeding approach which capture both small- and large-effect genetic factors. GS promises to achieve higher genetic gains to improve complex traits (Meuwissen et al. 2001; Heffner et al. 2009; Bernardo 2010; Shikha et al. 2017; Wang et al. 2019) including legumes (Li et al. 2018). GS uses uniformly distributed genetic markers across the genome to predict genomic estimated breeding values (GEBV) using multiple methods with varying degrees of complexity, computational efficiency and predictive accuracy (see Jannink et al. 2010; Desta and Ortiz 2014; Wang et al. 2018). Apart from it, GS is the only modern genomics-based approach with the potential to accumulate thousands of favorable alleles to develop resilient crop lines with high yield potential. This approach has been utilized extensively in livestock breeding (Hays and Goddard 2010; van der Werf 2013; Hays et al. 2013; Meuwissen et al. 2016) and is still evolving in plant breeding. If integrated with rapid generation advancement technology such as speed breeding, the GS can make remarkable achievement and positive impact on breeding programs (Watson et al. 2019) including groundnut (Pandey et al. 2020).

The learnings from genomic prediction strategies from successful animal breeding programs can easily be translated for deployment of genomic prediction-based breeding in crops (Hickey et al. 2017; Xu et al. 2020). In order to fix and evaluate several factors, many studies were conducted to choose appropriate GS models and criteria (Burgueño et al. 2012; Heslot et al. 2012; Jarquín et al. 2014). Such efforts could be seen in last few years in several crop plants such as maize (Sun et al. 2019; Millet et al. 2019), wheat (Song et al. 2017; Norman et al. 2018), rice (Cerrudo et al. 2018; Bhandari et al. 2019), barley (Nielsen et al. 2016), oats (Asoro et al. 2011, 2013), oil palm (Wong and Bernardo 2008) and chickpea (Roorkiwal et al. 2018). In order to enhance precision of predicting GEBVs in the breeding population, it is important to achieve higher correlation between the GEBVs estimated on training population (TP) and in validation sets during cross-validation.

The major problem for the improvement of quantitative traits in crop breeding has been the presence of large genotype × environment interactions (G × E) effects which more often complicate the trait expression by adversely affecting the heritability and response to selection resulting in low genetic gain. G × E effects pose serious challenge to prediction of GEBVs in the GS breeding. Significant variation among different environments is quite obvious due to varied climatic conditions, and it becomes very difficult for optimizing GS models for such environments when complete information across germplasm sets and target environments is not available for use in modeling. In such scenarios, the robust genomic prediction models are required which can take care of G × E interactions to facilitate implementation of GS breeding across germplasm sets and environments. Few GS models were developed by incorporating G × E interaction component either by using structured covariances to model relationships among environments (Burgueño et al. 2012) or by including environmental information to model relationships via covariance structures (Jarquín et al. 2014). Therefore, in order to initiate GS breeding in groundnut, it is utmost important to assess the potential and comparative performance of such promising models by using multi-season phenotyping and high density genotyping data on a sizeable training population. In this context, a training set with 340 diverse and elite groundnut genotypes were extensively phenotyped for important breeding traits and genotyped with high-density ‘Axiom_Arachis’ array containing > 58 K highly informative genome-wide single nucleotide polymorphism (SNP) markers. Four different GS models were tested on this training set with three cross-validation (CV) scenarios mimicking prediction problems such as prediction of tested genotypes in tested environments, untested genotypes in tested environments and tested genotypes in untested environments. The best performing GS models can be used for initiating GS breeding for improving complex traits to achieve higher genetic gains in groundnut.

Materials and methods

Constitution of training set and phenotyping

A genomic selection training population (GSTP) was constituted with 340 groundnut genotypes that includes elite breeding lines from the groundnut breeding programs from International Crops Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad; University of Agricultural Sciences (UAS), Dharwad; Indian Council of Agricultural Research-Directorate of Groundnut Research (ICAR-DGR), Junagadh, along with some accessions from gene bank of ICRISAT (that are used in breeding programs) and popular cultivars from India (Supplementary Table 1). This training population includes 227 lines from subspecies fastigiata and 113 lines from subspecies hypogaea and has variation for key agronomical traits focussed by the Indian groundnut breeding programs. From the perspective of botanical varieties, 212 lines belong to vulgaris (Spanish bunch), 111 lines belong to botanical variety hypogaea, 10 to fastigiata (Valencia), four to peruviana and single representative line to aequatoriana, hirsuta and unknown botanical type (Chaudhari et al. 2019). These lines were phenotyped for 11 agronomic, 7 quality and 6 foliar fungal disease resistance traits at Patancheru, Aliyarnagar and Jalgaon locations in India during two environments (Rainy 2015 and Post-Rainy 2015–2016). The experimental trials were conducted in alpha lattice design with two replications. The detail procedure of conducting trials along with phenotyping of disease resistance at three different time intervals each for rust (rust@75 days, rust@90 days and rust@105 days) and late leaf spot (Late leaf spot@75 days, Late leaf spot@90 days and Late leaf spot@105 days) can be found in Chaudhari et al. (2019). The data on agronomic traits included days to 50% flowering, days to maturity, primary branches/plant, pods/plant, plant height (cm), pod yield/plant (g), shelling  %, hundred seed weight (g), seed yield/plant (g), total yield/plot (g) and pod yield/ha (kg) recorded from both the replications across environments. The oil quality traits including oleic acid, linoleic acid, oleic/linoleic acid ratio, palmitic acid, stearic acid, oil content and protein content were estimated using near-infrared reflectance spectroscopy (NIRS).

Genotyping with Axiom_Arachis SNP array and SNP allele calling

High-quality genomic DNA was isolated from the plant leaves collected from 15 days old seedlings using high-throughput mini-DNA extraction method (Pandey et al. 2012b). Quality and quantity of DNA were assessed using spectrophotometer (Shimadzu UV160A, Japan). High-density genotyping data have been generated for 318 lines using high-quality DNA samples with Axiom_Arachis SNP array (Pandey et al. 2017; Clevenger et al. 2017) containing 58 K highly informative genome- wide SNPs (Supplementary Table 2). The SNP genotyping on Affymetrix GeneTitan®platform and SNP calling has been performed following the methods explained in Pandey et al. (2017). In brief, the target probes were prepared for all the 318 lines followed by amplification, fragmentation, hybridization on the chip, extension through DNA ligation and signal amplification. Staining and scanning the samples were performed on The GeneTitan® Multi- Channel Instrument. The software Axiom™ Analysis Suite version 1.0 was used for allele calling for all the 318 lines of the GSTP. The quality control (QC) analysis of samples was performed using ‘Best Practices’ workflow to select samples which passed the QC test. The genotype calls were produced using the ‘Sample QC’ workflow followed by using ‘Genotyping’ workflow to perform genotyping on the imported CEL files. Finally, the ‘Summary Only’ workflow was used to produce a summary and allows to retrieve SNP data for further analysis at the DQC > 0.75 and call rates > 90. The above criteria helped in removing the SNPs with low call rates, thus, keeping only the high-quality SNPs for the further analysis.

Statistical genomic-enabled prediction models

Total four genomic selection models were tested using the genotyping and phenotyping data on training set as explained in Jarquín et al. (2014) and Roorkiwal et al. (2018). Of these four models, two are main-effect models, and two include genomic × environment interactions. These models are: (1) model 1 (M1 = E + L) which includes the main effects of environments (E) and lines (L); (2) model 2 (M2 = E + L + G) which includes the main effects of markers (G) in addition to environments (E) and lines (L); (3) model 3 (M3 = E + L + G + GE), a naïve interaction model; and (4) model 4 (E + L + G + LE + GE), a naïve and informed interaction model.

The Bayesian Generalized Linear Regression (BGLR) R-package (de los Campos et al. 2013; Pérez-Rodríguez et al. 2015) was used for performing entire analysis with these four GS models. The scripts for these four GS models have already been made available in public domain by Pérez-Rodríguez et al. (2015), and technical details for these GS models are provided in Roorkiwal et al. (2018). A brief statistical description of the four models (M1–M4) is given below in addition to the conventional base line model. In the base line model, the response of the jth (j = 1,…,J) genotype evaluated in the ith (i = 1,…,I) environment \( \{ y_{ij} \} \) is the sum of an overall mean μ plus random deviations around zero due to environmental \( E_{i} \sim N(0, \sigma_{E}^{2} ) \) that is assumed to have a normal distribution with mean 0 and variance \( \sigma_{E}^{2} \) assuming an independent and identically distributed response (IID), and line effects are assumed idd \( L_{j} \sim N(0, \sigma_{L}^{2} ) \) where \( \sigma_{L}^{2} \) is the variance of the lines, and the interaction between the ith genotype and the jth environment is also iid \( LE_{ji} \sim N(0, \sigma_{LE}^{2} ) \) where \( \sigma_{LE}^{2} \) is the interaction variance and the random error term is assumed iid \( e_{ji} \sim N(0, \sigma_{e}^{2} ) \)

$$ y_{ij} = \mu + E_{i} + L_{j} + EL_{ij} + e_{ij} $$

Evidently, this model does not allow borrowing of information among lines because they were treated as independent outcomes. The following models were derived from the baseline model by either subtracting terms or modifying the underlying assumptions.

Model 1 (M1): environment + line main effects (E + L)

This model is obtained from the baseline model by retaining the first three components, while their underlying assumptions remain unchanged.

$$ y_{ij} = \mu + E_{i} + L_{j} + e_{ij} $$
(1)

Model 2 (M2): environment + line + genomic main effects (E + L + G)

Adding to model M1 as a linear combination between markers and their correspondent marker effects, \( g_{j} = \sum\nolimits_{m = 1}^{p} {x_{jm} b_{m} } \), genomic information can be introduced using the following linear predictor

$$ y_{ij} = \mu + E_{i} + L_{j} + g_{j} + e_{ij} $$
(2)

where \( b_{m} \mathop \sim\limits^{iid} N(0,\sigma_{b}^{2} ) \) represents the random effect of the mth (m = 1,…,p) marker and \( \sigma_{b}^{2} \) its correspondent variance component. Using the results from the multivariate normal distribution, \( {\mathbf{g}} = ( {\text{g}}_{1} , \ldots ,{\text{g}}_{J} )^{{\prime }} \), the vector of genetic effects, follows a normal density with zero mean vector and covariance matrix \( {\text{Cov}}({\mathbf{g}}) = {\mathbf{G}}\sigma_{g}^{2} \) with \( {\mathbf{G}} = \frac{{{\text{XX}}^{{\prime }} }}{p} \) as the genomic relationship matrix. It describes genetic similarities among pairs of individuals. Here, X represents the centered and standardized (by columns) genomic matrix and \( \sigma_{g}^{2} = p \times \sigma_{b}^{2} \) acts as the correspondent variance component such that \( {\mathbf{g}} = \{ {\text{g}}_{j} \} \sim N(0,{\mathbf{G}}\sigma_{{\mathbf{g}}}^{2} ) \). In this model, the line effect Lj is retained in the model to account for imperfect information and model mis-specification due to imperfect linkage disequilibrium.

Model 3 (M3): environment + line + genomic + genomic × environment interaction [E + L + G + (G × E)]

This model extends model M3 by adding the genomic × environment interaction as follows:

$$ y_{ij} = \mu + E_{i} + L_{j} + g_{j} + Eg_{ij} + e_{ij} $$
(3)

The main disadvantage of the previous models is that they only consider the main effect of the lines/genotypes across environments, avoiding specific responses of each genotype in each environment. To overcome this issue, the G × E interaction is introduced via covariance structures, as shown by Jarquín et al. (2014). Here, interaction component \( EL_{ij} \) is replaced by \( Eg_{ij} \), where \( \varvec{Eg} = \{ Eg_{ij} \} \sim N(0,(\varvec{Z}_{\varvec{g}} \varvec{GZ}_{\varvec{g}}^{{\prime }} )^\circ (\varvec{Z}_{\varvec{E}} \varvec{Z}_{\varvec{E}}^{{\prime }} )\sigma_{{\varvec{Eg}}}^{2} ) \) and \( \varvec{Z}_{\varvec{g}} \) and \( \varvec{Z}_{\varvec{E}} \) are the correspondent incidence matrices for genotypes and environments, \( \sigma_{Eg}^{2} \) is the associated variance component for this interaction, and ‘\( \circ \)’ represents the Hadamard or Schur product (element-to-element product) between two matrices.

Model 4 (M4): environment + line + genomic + genomic × environment + line × environment interaction [E + L + G + (G × E) + L × E)]

This model extends model M2 by adding the line × environment interaction as follows:

$$ y_{ij} = \mu + L_{J} + E_{i} + LE_{ij} + g_{j} + Eg_{ij} + e_{ij} $$

where all the terms have been previously defined.

Assessing different prediction problems using various cross-validation strategies

The above-mentioned four GS models (E + L, E + L + G, E + L + G + GE and E + L + G + LE + GE) were deployed in training set using three different random cross-validation (CV) schemes, namely CV0, CV1 and CV2. Random CV2 represents incomplete field trials where some lines are observed in some environments but not in others; the goal here is to predict the crop performance of these lines in environments where these lines have not yet been phenotyped. Random CV1 predicts newly developed lines to measure the predictive ability of new lines that have not yet been phenotyped in any field, predictive ability between observed and unobserved genotypes is based on genetic similarities as main source of information, and CV0 is the prediction of already observed lines in unobserved environments (CV0). In CV0, the main interest is to predict the crop performance of lines in potentially new environments.

For random cross-validation CV1 and CV2, the prediction accuracies of the four models were computed by performing random fivefold cross-validation where the performance of 20% of the lines (testing set) was predicted considering the remaining 80% observed lines as training set. For CV1, none of the 20% of the lines in the testing set were observed in any of the environments (combination), whereas for CV2, the 20% of the lines in the testing set were observed in some environments but not in the others. The prediction accuracy is obtained as the average Pearson’s correlations between the observed breeding values and predicted GEBVs.

Results

Identification of genetic polymorphism and phenotypic variation in training population

Genotyping data with SNP array have been generated on 318 lines, while phenotyping data were generated for 340 lines. Genotyping on 318 lines with Axiom_Arachis SNP array identified 13,355 polymorphic SNPs. The phenotypic data generated on 340 lines showed wide genetic variation for different agronomical, quality and foliar disease resistance traits. All the 11 agronomic traits have shown high (75–90%) to very high (> 90%) heritability, namely days to maturity (96.6%), hundred seed weight (93.4%), plant height (92.3%), yield/ha (89.7%), total yield/plant (89.3%), pod yield/plant (85.8), pods/plant (85.0%), and days to 50% flowering (84.8%), seed yield/plant (84.6%), shelling  % (82.9%) and primary branches/plant (78.7%) (Supplementary Table 3). In case of 7 quality traits, the highest heritability was observed for oleic/linoleic acid ratio (96.7%) followed by palmitic acids (84.0%), oleic acid (82.1%), linoleic acid (81.7%), oil content (78.6%), stearic acid (77.5%) and protein content (57.4%) recorded medium heritability. The foliar disease resistance traits recorded high heritability at different days of sowing (80.4% for rust@75 days, 84.2% for rust@90 days, 82.7% for rust@105 days, 83.9% for LLS@90 days, 79.7% for LLS@105 days and 74.5% for LLS@75 days).

Comparative performance of four GS models under three cross-validation schemes

Prediction accuracy estimated by four models indicated clear advantage of the inclusion of marker information which was reflected in better prediction accuracy achieved from models E + L + G, informed interaction (E + L + G + GE) and naïve and informed interaction as compared to E + L model. The detailed results for scheme CV0 (Table 1; Fig. 1a), CV1 (Table 2; Fig. 1b) and CV2 (Table 3; Fig. 1c) are summarized in Table 4 and Fig. 2.

Table 1 Mean correlations from tenfold cross-validation between the predicted and the observed values for four models (M1–M4) for unobserved environment (CV0) in different agronomic, quality and disease resistance traits of groundnut
Fig. 1
figure 1

Cross-validation between the predicted and the observed values for a unobserved environment (CV0); b untested genotypes (CV1); and unevaluated environment (CV2) for different agronomic, quality and disease resistance traits of groundnut

Table 2 Mean correlations from tenfold cross-validation between the predicted and the observed values for four models (M1–M4) for untested some lines (CV1) in different agronomic, quality and disease resistance traits of groundnut
Table 3 Mean correlations from tenfold cross-validation between the predicted and the observed values for four models (M1–M4) for some lines evaluated in some environments (CV2) in different agronomic, quality and disease resistance traits of groundnut
Table 4 Comparative prediction accuracy by four models (M1 = E + L, M2 = E + L + G, M3 = E + L + G + GE and M4 = E + L + G + GE + LE) and three cross-validation schemes (CV0, CV1 and CV2) in groundnut
Fig. 2
figure 2

Comparative performance of four genomic prediction models in three different cross-validation scenarios in groundnut training population

Performance of four GS models for unobserved environment (CV0 scheme)

In general, the prediction values across four environments with four GS models were found consistent for CV0 scheme (Table 1). The exceptions in consistent prediction with all the four models were observed for days to 50% flowering for Env2 (Jalgaon, Rainy 2015), and days to maturity, hundred seed weight, total yield/plant, yield/ha, oil content, protein content for Env4 (Patancheru, Post-Rainy 2015–2016) (Table 1).

The high prediction accuracy (> 0.600) across the four models was observed for days to 50% flowering (0.659–0.673), days to maturity (0.709–0.732), primary branches/plant (0.679–0.690), plant height (0.643–0.647), hundred seed weight (0.673–0.678), oleic acid (0.788–0.792), linoleic acid (0.764–0.769), OLR (0.759–0.763), palmitic acid (0.821–0.823), stearic acid (0.717–0.720), oil content (0.672–0.677), rust@90 days (0.730–0.752), rust@105 days (0.721–0.739) and late leaf spot@90 days (0.708–0.728) (Tables 1, 4). The traits, namely pods/plant (0.442–0.484), shelling  % (0.475–0.485), total yield/plant (0.507–0.534), yield/ha (0.507–0.534), protein content (0.415–0.423), rust@75 days (0.459–0.538), late leaf spot@75 days (0.499–0.538) and late leaf spot@105 days (0.507–0.534), have obtained medium (0.400–0.600) prediction accuracy. The two important traits in breeding program, pod yield/plant (0.334–0.381) and seed yield/plant (0.348–0.389), obtained low (< 0.400) prediction accuracy (Tables 1, 4). In the current study, all the traits showed high heritability (> 75%) except protein content (57.4%). It is noted that despite achieving high heritability (> 75%) for pods/plant, shelling  %, total yield/plant, yield/ha, protein content, rust@75 days, late leaf spot@75 days, late leaf spot@105 days, pod yield/plant and seed yield/plant, these traits have achieved low prediction accuracy (Table 4).

Performance of different GS models for untested genotypes environment (CV1 scheme)

In CV1, the model E + L yielded negative prediction values for all the traits studied. Among other three GS models, the prediction values across four environments were found less consistent for CV1 scheme (Table 1) as compared to CV0. The exceptions in consistent prediction with all the four models were observed for pods/plant, pod yield/plant, shelling  %, and hundred seed weight for Env1; days to 50% flowering and plant height, rust@90 days and late leaf spot@75 days for Env2 (Jalgaon, Rainy 2015); and pods/plant and palmitic acid for Env3 while days to maturity, plant height, pod yield/plant, hundred seed weight, seed yield/plant, total yield/plant, yield/ha, oil content and protein content for Env4 (Patancheru, Post-Rainy 2015–2016) (Table 2).

The high prediction accuracy (> 0.600) across the three models was observed for only for disease scores, i.e., rust@90 days (0.623–0.624), rust@105 days (0.638–0.646) and late leaf spot@90 days (0.624–0.629) (Tables 2, 4). A majority of the traits, namely days to 50% flowering (0.501–0.503), days to maturity (0.466–0.489), primary branches/plant (0.531–0.540), pods/plant (0.453–0.471), pod yield/plant (0.374–0.423), hundred seed weight (0.430–0.469), seed yield/plant (0.375–0.422), total yield/plant (0.486–0.533), yield/ha (0.486–0.533), oleic acid (0.493–0.496), linoleic acid (0.466–0.468), OLR (0.469–0.473), palmitic acid (0.465–0.468), rust@75 days (0.445–0.488), late leaf spot@75 days (0.465–0.495) and late leaf spot@105 days (0.558–0.579), have obtained medium (0.400–0.600) prediction accuracy. The low (< 0.400) prediction has been observed for plant height (0.367–0.373), shelling  % (0.326–0.335), stearic acid (0.254–0.273), oil content (0.344–0.362) and protein content (0.146–0.222) (Tables 1, 4). Among the high heritable traits (h > 75%), only rust@90 days, rust@105 days and late leaf spot@90 days achieved high prediction accuracy (Table 4).

Performance of different GS models for unevaluated environment (CV2 scheme)

In general, the prediction values across four environments with four GS models were found consistent for CV2 scheme (Table 3). The exceptions to consistent prediction with all the four models were observed for pod yield/plant, and seed yield/plant in Env1; days to 50% flowering, plant height, hundred seed weight, rust@75 days, rust@90 days, late leaf spot@75 days and late leaf spot@90 days for Env2 (Jalgaon, Rainy 2015); and days to maturity, plant height, shelling  %, hundred seed weight, seed yield/plant, total yield/plant, yield/ha, stearic acid and oil content for Env4 (Patancheru, Post-Rainy 2015–2016) (Table 3).

The high prediction accuracy (> 0.600) across the four models was observed for days to 50% flowering (0.657–0.672), days to maturity (0.731–0.769), primary branches/plant (0.675–0.695), plant height (0.640–0.659), shelling  % (0.468–0.505), hundred seed weight (0.670–0.721), oleic acid (0.787–0.791), linoleic acid (0.762–0.769), OLR (0.757–0.765), palmitic acid (0.820–0.826), stearic acid (0.717–0.738), oil content (0.672–0.699), rust@90 days (0.744–0.756), rust@105 days (0.718–0.751) and late leaf spot@90 days (0.710–0.735) (Tables 1, 4). The traits, namely pods/plant (0.434–0.511), total yield/plant (0.499–0.603), yield/ha (0.499–0.603), protein content (0.411–0.461), rust@75 days (0.489–0.541), late leaf spot@75 days (0.499–0.538) and late leaf spot@105 days (0.572–0.654), have obtained medium (0.400–0.600) prediction accuracy. The low (< 0.400) prediction has been observed for pod yield/plant (0.321–0.454) and seed yield/plant (0.336–0.462) (Tables 3, 4). Among the high heritable traits (> 75%), pod yield/plant and seed yield/plant showed low prediction accuracy (Table 4).

Comparative prediction accuracy across models and cross-validation schemes

Among four GS models tested for 24 traits, the model (E + L) (0.613) performed marginally better in general for all the traits as compared to models (E + L + G) (0.571), (E + L + G + GE) (0.577) and (E + L + G + LE + GE) (0.581) (Table 5). The model (E + L) completely failed in cross-validation scheme CV1, and it yielded negative predictions. In general, the predictions were consistent across different models and cross-validation schemes (except model M1 for CV1) for different traits. However, there have been large variations in predictions obtained for different traits. For example, the palmitic acid (0.704), rust@90 days (0.705), rust@105 days (0.708) followed by days to 50% flowering (0.614), days to maturity (0.653), primary branches/plant (0.639), hundred seed weight (0.613), oleic acid (0.692), linoleic acid (0.668), OLR (0.666), late leaf spot@90 days (0.694) and late leaf spot@105 days (0.602) showed high (> 0.600) genomic prediction (Table 5). The traits, namely pod yield/plant (0.402), seed yield/plant (0.408) and protein content (0.354), showed lowest predictions among the studies traits. The remaining traits showed medium prediction accuracies. The results also indicated absence of relationship between trait heritability and its prediction accuracy.

Table 5 Comparative prediction accuracy for different traits by four models under three cross-validation schemes in groundnut

Discussion

Breeding methodologies have been evolving over the time to develop superior crop varieties for achieving higher productivity to feed the global population. Majority of the breeding programs have been relying on phenotype-based selection approaches with some efforts dedicated toward using marker-assisted selection (MAS) or marker-assisted backcrossing (MABC) including groundnut (Pandey et al. 2016; Varshney 2016; Varshney et al. 2019). The MAS and MABC efforts are now routine in few groundnut breeding programs; however, these breeding methods are mostly successful for simple traits for which diagnostic markers are being developed through trait mapping approaches (Pandey et al. 2020). The major problem lies with complex traits for which generating precise and repeatable phenotyping data for complex traits is challenging as a consequence of high G × E interaction. Under such scenario, a new breeding approach called genomic selection is gaining momentum across crops which promises to improve complex as well as simultaneous improvement of multiple traits (Meuwissen et al. 2001; Jannink et al. 2010; Crossa et al. 2017). This approach uses genome-wide marker and multi-environment phenotyping data on target complex traits on a training population possessing diversity for target traits and close resemblance with the candidates under selection.

The availability of cost-effective high- to mid-density genotyping assays is very important for deploying genomic selection in any crop species. The groundnut, one of the most important food and oilseed crops of the world, has recently attained optimum genomic resources such as the reference genomes for diploid progenitors (Bertioli et al. 2016; Chen et al. 2016) and both the subspecies of cultivated tetraploid (Bertioli et al. 2019; Chen et al. 2019; Zhuang et al. 2019) in addition to a high-density genotyping assay (Axiom_Arachis array with 58 K SNPs) (Pandey et al. 2017; Clevenger et al. 2017). These optimum genomic resources have accelerated the process and precision in several genomics and breeding applications including initiating genomic selection in groundnut. In this context, a training population in groundnut was constituted successfully with 340 elite lines containing several desired agronomic features required for Indian and other global breeding programs. The results clearly showed high variability for traits targeted in this effort, and the high-density genotyping assay played important role in performing genomic prediction for these target traits. Therefore, this panel has potential to serve as ideal training population for different Indian groundnut breeding programs.

Conventional breeding relies on phenotype-based selections for complex traits performing replicated yield trials in advanced (F6 onward) generations which require huge resources to grow large number of plants in each generation and conduct replicated yield trials. GS provides an advantage by facilitating selection of promising individuals at very early generations (F2), thereby reducing the number of lines to be generation advanced and phenotyped in replicated yield trials. If rapid generation advancement technology is integrated with this approach, GS also will save time by shortening breeding cycle in addition to offering more precise selection and reduced use of resources in the breeding process (Heffner et al. 2009, 2011; Isidro et al. 2015). There have been several studies on this approach which clearly indicated that GS is affected by several factors such as marker types and density (Chen and Sullivan 2003; Poland and Rife 2012; Zhang et al. 2017; Norman et al. 2018; Roorkiwal et al. 2018), population size (Daetwyler et al. 2010; Zhang et al. 2017; Norman et al. 2018), marker types and statistical models (Heslot et al. 2012; Roorkiwal et al. 2018). Besides above important considerations, the main question which has been lingering on was that GS breeding can be made more effective to tackle G × E interactions while performing genomic-based predictions for complex traits. In this context, this study reports constituting a training population in groundnut, genotyping with high-density SNP array and testing four GS models under three different cross-validation schemes in groundnut. This study provides information on prediction accuracy for four important GS models which can take care of G × E interactions for performing more precise selection in GS breeding in groundnut. The identified best prediction models from this study are now ready for deployment in routine GS breeding as the impact of G × E interactions in the precision of selecting best performing plants has been accounted for the models.

It is very difficult for any breeding program to generate phenotyping data on training population at all the possible evaluation sites. Under such circumstances, the crop breeder may face multiple situations on their datasets for training population such as (a) lines have never been evaluated/phenotyped in any of the target environment, (b) lines of the training population may have been phenotyped in some environments but not all the environments, and (c) no phenotyping data have been generated for some environments. To address the situation (a), we used a cross-validation scheme (CV1) to assess the prediction accuracy for the situation where a set of lines have never been evaluated/phenotyped in any of the target environment to see whether these GS models can give high prediction accuracy for the unevaluated genotypes in different environments by taking clues from only genotyping data. The results from this study clearly showed total reliance on genomic information for achieving high prediction accuracy under such situation, and one of the models (M1) fell flat with very poor prediction accuracy as it does not use genomic information, while model 2 (M2) may not be good to use for achieving higher prediction for the location with high G × E. The results showed that remaining two GS models were competitive in achieving high prediction accuracies, indicating their potential deployment in GS breeding under such situations with high G × E.

To address situation (b), the cross-validation scheme CV2 was used to assess the prediction accuracy for the situation where some lines of a larger set have been evaluated in only few environments (i.e., not in all the target environments). The idea was to see performance of these GS models to assess prediction accuracy for untested lines and unobserved environments using the information from evaluated lines in different environments. The results from current study clearly showed comparative performance of all the four candidate GS models which indicated that such scenario can be handled with ease using any of these prediction models. It also indicates that breeder can introduce new germplasm with partial datasets into the extended training population and there would not be any adverse impact on prediction accuracies, and thus, selection efficiency will not be affected. Although the models showed good prediction accuracies in predicting the performance of genotypes in untested environments, it will not completely eliminate the need of testing especially in advanced generations; therefore, the real-time testing of promising lines would be needed prior to product advancement. However, in such scenario GS would be useful in reducing the resources for real-time testing of low performing genotypes in respective target environments and facilitate to identify the best suitable genotypes for testing in different target production environments. Similarly, to address the situation (c), the cross-validation scheme CV0 was used to assess the prediction accuracy for unobserved environment using the phenotyping information on training set from related or remaining environments. In this case, prediction was made for each environment using the information from remaining environments. Similar to CV2 scenario, the results from current study for CV0 also demonstrated comparative performance of all the four candidate GS models which indicated that breeder can introduce new environment into the ongoing breeding program without any adverse impact on prediction accuracies and selection efficiency. Similar results have also been obtained in other studies in different crops (de los Campos et al. 2009; Hays and Goddard 2010; Heffner et al. 2009; Gorjanc et al. 2016) including chickpea (Roorkiwal et al. 2018) for these three scenarios, and the results obtained in this study, therefore, provide more confidence while deploying this scheme in case of groundnut.

Among the agronomic traits, days to maturity, pods/plant, shelling  %, hundred seed weight and yield/ha along with nutritional quality traits such as oil content and protein content are the key priority traits in groundnut governed by polygenes and are complex in nature. However, the resistance to LLS and rust in groundnut are governed by major quantitative trait loci (Sujay et al. 2012; Kolekar et al. 2016; Shirasawa et al. 2018) and used for introgression of LLS and rust resistance into elite varieties (Varshney et al. 2014; Janila et al. 2016; Shasidhar et al. 2020). The quantitative inheritance with additive effect of minor genes has been reported for LLS and rust resistance in groundnut (Janila et al. 2013). Furthermore, the high G × E interactions and environment effect make these traits more complex in nature. Hence, for achieving higher genetic gains for resistance to LLS and rust, both major and minor QTL/gene effects need to be captured that can be very well taken care in GS. The models considering G × E interactions in prediction of GEBVs would be of great use to develop product with wider adaptability.

Identification of best performing GS prediction model is the critical question to be answered before initiating GS breeding. The current study tested four GS models, i.e., E + L, E + L + G, E + L + G + GE (naïve interaction model), and E + L + G + LE + GE (naïve and informed interaction model) (de los Campos et al. 2013; Pérez-Rodríguez et al. 2015). The results showed that high prediction accuracies can be achieved for CV0 and CV2 scenarios with best performance from the naïve and informed interaction model performed followed by informed interaction model and main-effect model E + L + G. One of the main-effect models (E + L) which does not use genotyping information has completely failed in prediction for cross-validation scheme (CV1) to assess the prediction accuracy, while the remaining three GS models, although performed much better than model E + L, performed poorly in providing good prediction for untested genotypes. Therefore, achieving high prediction accuracy for this scenario is still a distant dream and more suitable models need to be developed and tested to predict the performance of genotypes in untested environments. Besides selection of parents, the prediction of GEBVs of newly developed lines which are not tested in any environment is one of the major applications of GS in the breeding programs. The low prediction accuracies for CV1 could be attributed to low resemblance between the training set and candidate population. The prediction accuracies can be substantially increased by adding more lines in training population that shows genetic resemblance with candidate population. These models have shown very good performance for simple and complex traits tested in this research and therefore can also be extended to other complex traits in groundnut such as heat tolerance and aflatoxin contamination (Pandey et al. 2019). It is worth mentioning that the models which consider G × E effects hold high potential in improving further the prediction accuracies (Jonas and de Koning 2013; Oakey et al. 2016; Roorkiwal et al. 2018); therefore, such models may be more appropriate to deploy in GS breeding.

In summary, this study reports the development and testing of four GS models and provides comparative performance under three important cross-validation which occur more frequently before breeders due to several reasons such as lack of resources, time, facility or inclusion of new potential parents/traits/environments in breeding program. The current study tested four GS models, i.e., E + L, E + L + G, E + L + G + GE (naïve interaction model), and E + L + G + LE + GE (naïve and informed interaction model), and suggests use of latter two models for achieving higher prediction accuracies for even traits with large G × E effects in groundnut. The identified GS models could be deployed in breeding program upon validation of prediction accuracies on candidate population.