Background

Genomic selection (GS) is a form of marker assisted selection (MAS) using markers distributed across the entire genome. Unlike conventional MAS, GS does not require a priori association between marker and trait generated from linkage mapping and genome-wide association study (GWAS). Genomic selection was first implemented successfully in cattle, doubling the rate of genetic gain in breeding programs [1, 2]. This method has recently been adapted for crops, including maize, wheat and rice [3, 4]. In commercial perennial crops, the application of GS has been proposed for the selection of complex quantitative traits, including oil yield related traits [5].

As the most efficient oilseed crop in the world, oil palm is able to produce up to 10 times more than other leading oil crops, and has surpassed soy oil since 2008 [6] as the world’s most traded oil. Oil produced in the kernel and mesocarp is suitable for both human consumption and for oleo-chemical industries. The commercial oil palm planting material in Southeast Asia is mainly derived from Deli dura x AVROS pisifera crosses, selected for high oil yield. The potential reduction of genetic variation in the Deli dura due to founder effects is a concern for oil palm breeders. In addition, self-pollination (“selfing”) and sib-mating, which are commonly practiced to concentrate the desired agronomical traits in Deli dura populations, has further reduced the genetic variation. This has resulted in various symptoms of inbreeding depression, such as abortive bunch formation, poor fruit set and oil yield depression in Deli dura trials [7]. To address this problem, Sime Darby Plantation R&D adopted introgression using dura populations acquired from Nigeria to widen the genetic base of the commercial Deli dura materials. Deli x Nigerian progenies have improved fresh fruit bunch (FFB) and oil-to-bunch ratio (O/B) values, with useful trait variation observed, resulting in a new series of mother palms for future commercial material. Still, the emphasis placed on the introgression program was low because of the slow nature of breeding progress in oil palm, which is typically 10–12 years per selection cycle [8]. Hence, the ultimate goal of GS is to expedite the breeding progress by maximizing the genetic gains per generation.

In oil palm, GS was first evaluated using simulated data and a limited number of QTLs [9]. Another study using SSRs supports the potential of this method in an oil palm population with narrow genetic base [10, 11]. Although SSRs are informative, the emergence of high-throughput sequencing, which can detect genetic variations across the whole genome, has shifted the preference of marker system towards SNPs. This is because SNPs are generally abundant in the genome and require less time and cost to genotype. As an example, the Malaysian Palm Oil Board (MPOB) has reported of several SNP loci capable of distinguishing different fruit forms in oil palm, which is a monogenic trait [12, 13]. For polygenic traits, including O/B and oil content in mesocarp, genome-wide markers are required to locate the multiple genetic components of these quantitative traits. An example is the mesocarp oil content trait, with multiple QTLs, mainly found in Chromosome 5 [14].

A whole-genome OP200K SNP genotyping array [15] capable of representing genomic information of main breeding stock was developed by referring to the published 1.8-Gb diploid genome of oil palm [16]. This, together with the reduction in genotyping cost, has made GS feasible in oil palm. Even though the use of these genome-wide SNPs have also been proven to work in trait prediction of tenera progenies [17], the selection of mother dura palms is of more importance for oil palm breeding programs. Before implementation of GS in oil palm dura breeding programs, it is crucial to evaluate different marker systems and GS modeling methods. For this purpose, a full-sib family, consisting of 112 individuals, derived from Deli x Nigerian origin was selected for development and assessment of GS models. In total, eight methods, including two machine learning approaches, and two marker systems, SSRs and SNPs were evaluated. Prediction models built for the important yield-related traits were assessed. In general, individual populations available for introgression or further crossing are often small in oil palm. The development of accurate models to guide future crossing and introgression is essential for efficiency development of traits into commercially-relevant parental stock.

Methods

Plant materials and phenotyping

Unopened spear leaves from 112 twenty-year-old palms derived from a Deli x Nigerian family, planted in a randomized complete block design trial and maintained at Sime Darby Plantation R&D, Malaysia were sampled [18]. The four years (4th to 8th year) yield, bunch analysis and vegetative measurements were recorded based on the standard industry protocols [19] with modification [20]. From these records, six important oil yield-related traits (fruit-to-bunch (F/B), shell-to-fruit weight (S/F), kernel-to-fruit weight (K/F), mesocarp-to-fruit weight (M/F), total oil yield per palm (O/P) and oil weight-to-dry mesocarp weight (O/DM)) were selected for study. All assayed traits for this study were measured in percentage (%) with the exception of O/P, which was in kg/palm/year. Other traits were measured directly besides O/P. The product of FFB and O/B was used to calculate O/P (FFB * O/B). O/B itself was calculated based on O/DM * DM/WM * M/F * F/B, where DM/WM is the dry matter content of the mesocarp [21]. The phenotype data were averaged across 4 years. Phenotypic analysis was done using a modified “chart” function from the library “PerformanceAnalytics” [22] under R version 3.0.0.

SSR and SNP genotyping

Total genomic DNA was isolated from 100 mg of each leaf sample using the DNAeasy Plant Mini Kit (Qiagen, Germany). For SSR genotyping, 135 markers from our previous study were selected [18]. In the same study, we have shown that these SSRs, from an average of 10 cM mapping interval provided sufficient power for QTL detection. The SSR amplification was carried out using a M13-tailed forward primer with a four-color fluorescent detection technique [23, 24]. The SSRs with more than two alleles were interpreted as two or more markers in order to convert all genotype values to −1, 0 or 1, an approach similar as published [10], without loss of information. Therefore, the informative SSRs used in this study were represented as 221 markers.

SNP genotyping on the same family was carried out using the OP200K array (170,860 SNPs) [15]. The process was done on the Infinium iScan platform (Illumina Inc., San Diego, CA) according to the manufacturer’s recommendations. The raw intensity SNP data was analyzed and auto-clustered using GenomeStudio version 20,011.1 (Illumina Inc., San Diego, CA) with genotyping module version 1.8.4. A total of 46,933 SNPs were identified to be polymorphic. These genotypes were coded into −1 (AA), 0 (AB) and 1 (BB) format. Missing genotype data were imputed using the na.roughfix function of the randomForest package [25] in R.

Genomic heritabilities and kinship coefficient estimates

Genomic heritabilities for the traits were calculated using an in-house R script, based on a linear model using all informative SNP markers [17, 26, 27]. The kinship coefficients among the individuals was estimated with the R package related [28] using 5000 SNPs selected randomly from the full dataset, with estimation method based on Li [29].

GS methods

This study was performed using an IBM System ×3850 X5 HPC, with 40 Intel (R) Xeon (R) CPU E7–8850 @2.00 GHz processors and 1 Tb of RAM. The methods selected to perform GS in oil palm were RR-BLUP [30], Bayes A [31], B [32],Cπ [32], LASSO [33], Ridge Regression [34]) and two machine learning methods (Support Vector Machine (SVM) and random forest (RF)) [25]. All the methods mentioned were carried out with R version 3.0.0.

RR-BLUP assumes that the marker effects are normally distributed and that they explain the same amount of variance [31]. Bayes A and Bayes B allow for different variance across different markers. Bayes A models the variance using a scaled inverted chi-square distribution [31]. The additional feature of Bayes B compared to A is that it allows for the variance of markers to be zero with a probability of π [32]. The later Bayesian methods are Cπ, Lasso and Ridge Regression: Cπ assumes that π follows a uniform distribution [32]; Bayesian Ridge Regression, on the other hand, uses a Gaussian prior and Lasso [34] uses a Double-Exponential prior [33]. For both SVM and Random Forest, no assumptions were made regarding the markers.

For all methods, a 5-fold cross validation was carried out across all traits, where a single subsample was used as the validation set, and the remaining four samples were used as the training set. This step was carried out until all subsamples had been used for both training and validation steps. RR-BLUP was implemented using the R package rrBLUP [30], and Bayesian methods using the BGLR package [35]. For Bayesian methods, the number of inner iterations was optimized based an increasing iteration approach using Bayes A. For all traits, the optimization graph became plateau before reaching 10000, indicating that this number of inner iterations was more than sufficient for the estimation of marker effects. The optimization graphs of S/F and O/P were selected for illustration purpose (Additional file 1: Figure S1). In addition, the first 2000 iterations were used as “burn-in”. The same conditions were extended for all Bayesian methods. For Bayesian LASSO, the lambda parameter was set to 25, type set as gamma, rate parameter as 1e−4 and shape parameter as 0.55.

SVM was carried out using R package e1071. Optimization of the parameters was carried out for each iteration during cross validation using the tune.svm function. Within the training set, another 10-fold cross-validation was used to estimate the optimal parameters. The gamma and cost factors for the SSR dataset were optimized using an exponential range of 2−15 to 215 [36], epsilon from 0 to 0.2 under the kernels of “linear”, “radial”, “polynomial” and “sigmoid”. For SNP-based prediction, the gamma and cost factors were set from 2−30 to 215. Random forest was carried out using the randomForest package [25]. Optimization of the parameters used the mlr library for each cross validation step. Like SVM, 10-fold cross validation was used on the training set. The range of mtry optimization range was set from 1 to 221, nodesize from 1 to 50 and ntree was from 500 to 1500 for SSRs. For SNPs, the mtry optimization range was set at 1 to 500, ntree was set from 500 to 3000.

For each iteration, accuracy was defined as the predictive ability, which was calculated as the Pearson correlation coefficient of predicted trait value of the validation set, versus its observed trait value. The overall accuracy was calculated as the mean of the predictive abilities from all iterations. Genomic selection was performed using SSRs, followed by SNPs. Representative regression boxplots were generated for each trait using SNPs for the best method. In order to test for linear assumption made under RR-BLUP and Bayesian models, residuals were calculated from the difference between the predicted trait values and the observed trait values. The residuals were then standardized using the “rstandard” function in R. The residual and Quantile-quantile (QQ) plot were generated using the “plot” and the “qqnorm” functions.

Results

Phenotypes, genomic heritabilities and kinship coefficient estimates between individuals

The 4-year recordings for each trait were averaged, with the mean for F/B being 67.37 (±3.74), S/F 33.89 (±2.61), K/F 9.89 (±1.55), M/F 56.23 (±3.37), O/P 28.84 (±7.00) and O/DM 75.44 (±2.54). Since these traits were all yield related, many of them correlated with each other (Fig. 1). In particular, M/F was negatively correlated with S/F and K/F. O/P and O/DM were positively correlated with M/F. The genomic heritabilities of all these traits are shown in Fig. 2. S/F (0.35) and O/DM (0.29) have the highest heritability, followed by M/F (0.27), K/F (0.15), F/B (0.11) and O/P (0.07). The kinship coefficient between individuals in the Deli x Nigerian family ranged from 0.35 to 0.62 with both mean and median at 0.48 (Additional file 1: Figure S2).

Fig. 1
figure 1

Plot representing phenotypic distribution and correlation for the traits of F/B (%), S/F (%), K/F (%), M/F (%), O/P (kg/palm/year) and O/DM (%) in the Deli x Nigerian family. Diagonal of the plot shows the histograms and the distribution of the observed phenotypes values. The lower off-diagonal is the scatterplot between traits, whereas the upper off-diagonal represents the correlation value between traits. Significant correlations are tagged with the asterisk (*) symbol

Fig. 2
figure 2

Bar plot showing the genomic heritabilities for the traits of F/B, S/F, K/F, M/F, O/P and O/DM in the Deli x Nigerian family. S/F and O/DM had the highest heritability, followed by M/F, K/F, F/B and O/P

GS accuracy for SSR-based models

On average, the trait with the highest accuracy was F/B (0.28), followed by S/F (0.25), K/F (0.19) and O/DM (0.19), O/P (0.18) and M/F (0.18). Averaging across all traits, machine learning methods out-performed all other methods, albeit only slightly. SVM having the highest accuracy for F/B (0.30), S/F (0.30) and O/P (0.28), and RF having the highest accuracy for K/F (0.24), M/F (0.34) and O/DM (0.28). However, SVM also has the lowest accuracy for K/F (0.14), RF has the lowest accuracy for F/B (0.19) and O/P (0.14) (Table 1).

Table 1 Mean accuracy of traits based on different SSR-based GS methods

GS accuracy for SNP-based models

Accuracy of the GS model was improved drastically for all traits when SNPs were used (Table 2 & Fig. 3). The improvement was the largest for S/F, M/F and O/DM. Averaging across methods, the trait with the highest accuracy was O/DM (0.43), S/F (0.39), followed by F/B (0.31) and by M/F (0.30). Averaging across traits, similar to SSRs, the performance of all methods were almost equal, with machine learning methods having a marginal advantage (0.32) over other methods (0.31). For individual traits, machine learning methods have the highest accuracy for all traits besides F/B, with SVM having the highest accuracy for S/F (0.47) and K/F (0.28), and RF having the highest accuracy for M/F (0.37), O/P (0.30) and O/DM (0.47). However, SVM also has the lowest accuracy M/F (0.26), RF has the lowest accuracy for F/B (0.24) and S/F (0.23). Residual analysis carried out for RR-BLUP and Bayesian methods showed that the points in a residual plot were randomly dispersed around the horizontal axis, and the points in QQ plot were almost in a straight line for all traits (Additional file 1: Figure S3 & Figure S4). Therefore, the underlying mixed linear model used was suitable.

Table 2 Mean accuracy of traits based on different SNP-based GS methods
Fig. 3
figure 3

Regression boxplot illustrating predicted trait values vs. observed trait values for F/B, S/F, K/F, M/F, O/P and O/DM, selected by best GS method for each trait. The observed trait values were split into three classes. The prediction accuracy was written on the top left corner for each plot

Discussion

Perennial crops, including oil palm, have long life and crop cycles, require lengthy periods to reach maturity. Thus, important agricultural traits, such as oil yield can only be evaluated after years of field planting. With the planted materials remaining in production over long time frames, small differences in yield have a large impact. The introduction of genomic selection will enable early selection of the best performing parents without field evaluation, eventually expediting genetic gains in perennial crops. For oil palm, early trait prediction of the best performing Deli x Nigerian dura allows breeders to prioritize the best dura candidates to be progeny tested with the commercial AVROS pollen donors. The traits evaluated ranged from simple quantitative traits, which are controlled by fewer genes, including S/F [15] and O/DM [37], to the more complex quantitative trait controlled by more genes, O/P. From the statistical analysis of the traits in this study, M/F was inversely correlated with S/F and K/F, which was expected because the thicker the shell and kernel, the thinner the mesocarp, and vice versa. With crude palm oil accumulating in the mesocarp, fruits with thicker mesocarp are often preferred. As for the complex trait O/P, it was calculated from a few simple quantitative traits. Therefore positive correlations between O/P with M/F and O/DM were observed. Unlike annual crops, traits measured for a single year are not useful because oil palm yields vary greatly from year to year [38]. In comparison, The four-year mean (4th – 8th years of planting) for oil palm traits, such as FFB, provides higher heritability [39] and also highly correlates to later yields (11th – 16th years of planting) [40]. Consequently, the four-year mean was used in this study for all traits.

The conventional MAS only requires a few major QTLs for breeding prediction, but the genetic effects of individual QTL can be small for complex traits, with a limited amount of the trait variation explained [41]. In such conditions, the genetic gains will remain minimal or insignificant in the breeding program. The variations of complex traits can be further improved by consolidating the effects of all genes or chromosomal/genomic segments to estimate total breeding value, which is essentially the GS approach [31]. Conventionally, the thick-shelled dura as maternal parents are field planted in small number. Yet even with a small population, Cros [10] demonstrated that SSRs can be used in the GS modeling for dura palms, thereby enabling genomic predictions to be evaluated as a method to improve desired traits.

Various methods developed for GS make different assumptions regarding the markers. As compared to these methods, the absence of such assumptions might allow for machine learning to model the traits better. Among the methods assessed, even though the difference in performance was small, machine learning models did have the highest accuracy, followed by Bayesian methods and RR-BLUP. This slight increase in accuracy when using machine learning methods, was however, at the cost of longer computing times compared to Bayesian methods, which in turn took longer than RR-BLUP. In addition, machine learning methods gave inconsistent results among traits in this study. Different parameters are required for different traits using machine learning methods to maximize prediction accuracies. Determination of these parameters is again a computationally intensive and time consuming process. When the number of variables is large, such as in the case of high density SNPs in predictive modeling, parameters optimization is a challenging process. Nonetheless, given sufficient optimization, machine learning has the potential to outperform other methods.

Overall, the accuracy using SSRs alone was low for most of the traits, which is not in agreement to what was reported by Cros [10]. This is probably due to the lower kinship coefficient within the Deli x Nigerian family of 0.48 (current study), compared to 0.58 in the population used by Cros [10]. In addition, observed trait data was used in this study, instead of deregressed estimated breeding value derived from progeny testing. Hence, the observed trait data probably increased the residual effects due to environmental effect, thus reducing the accuracy. To be accurate, GS requires every QTL in the genome to be represented by at least one marker [42]. Even though it is possible to have genome-wide representation of SSRs, this is usually not done in practice due to the large amount of workload and the high cost required for genotyping. Therefore, usually it is insufficient for SSRs to cover all possible loci affecting a complex trait, and thus for use in GS of traits at underrepresented loci. In soybean, the accuracy of GS using low density markers has been reported to reach 0.69 [43]. However, the high accuracy might due to the use of trait-associated markers. In this case, the assayed SSRs in this study were randomly mined from the genome, thus no a priori association was known. As compared to SSRs, the introduction of high-throughput SNP genotyping has made whole-genomic genotyping possible at a low cost.

When genome-wide SNPs were introduced into the GS models, accuracy for all traits improved, with the greatest improvement for S/F, M/F and O/DM. This can be attributed to the fact that the OP200K array provides high genomic resolution [15]. With most of the QTL being represented by at least one SNP, most of the important genetic variations relative to the traits in study would have been captured. Representing SNP as marker effect, the model built from the genome-wide SNPs therefore better explained the observed trait variations. In addition to marker density, another important factor that affects accuracy is trait heritability. The genomic heritabilities calculated for the traits in this study were in the descending order of S/F, O/DM, M/F, K/F, F/B and lastly O/P. The fruit traits, including S/F, M/F, K/F and O/DM were reported to be highly heritable in an introgressed Deli dura x Nigerian dura population, while F/B and O/P were reported to have comparatively lower heritability [44]. Therefore, the calculated genomic heritabilities agreed with the literature rather well. Among the highly heritable traits, K/F had the lowest genomic heritability, which coincided with it having the lowest accuracy among the highly heritable traits. The low heritability and accuracy were probably due to the low phenotypic variation of K/F in the assayed family. Another important factor is the training population size. Even though further improvement in accuracy can easily be achieved through the increment in the training population size [45], the dura maternal population size is usually small in breeding programs, as they are often accommodated in commercial estates. In these estates, tenera palms (the hybrids used for oil production) are predominantly planted, together with a small number of mother (dura) palms, which are used solely for seed production for new planting materials. The small training population size in this study therefore reflects the reality of dura mother palm breeding. Even with this condition, we have found the application of GS in dura breeding programs to be promising.

Conclusion

For outcrossing perennial crops such as oil palm, the conventional breeding selection of mother palms for progeny testing is partially random and is phenotype-dependent. The ability to select for the best mother palms will expedite the production of high-yielding progenies as future commercial planting materials. Despite having a small training population, which is a usual case for mother palms, GS using SSRs has reasonable accuracy, which can be much improved by using dense SNPs that afford better coverage of the genome and hence the desirable QTL. From our results, the differences in accuracies between the methods evaluated for SSRs and SNPs were small. In general machine learning methods slightly outperformed the other methods, and with sufficient parameter optimization machine learning could become a powerful tool to support selection of optimal materials for breeding programs of oil palm and other perennial crops.