Background

In the last two decades, new advances and technologies in the field of sequencing—commonly referred to as next-generation sequencing—have dramatically decreased the cost and time necessary for genotyping an individual [1, 2]. The resulting abundance of genomic data was the prerequisite for genomic prediction to become an important tool in the field of animal and plant breeding [3,4,5,6]. Genomic prediction is the process of predicting phenotypic values based on genomic data in the form of genetic markers. Nowadays, these markers are typically biallelic single nucleotide polymorphisms (SNPs). The predicted phenotypes can then be used as an alternative to the actual phenotyping of individuals, which can be an expensive and time-consuming process, and, thereby, speed up the process of breeding programs and increase their efficiency [4].

Since Meuwissen et al. [7] popularized the concept of genomic prediction, many different algorithms have been developed. These include among others, linear regression based approaches such as best linear unbiased prediction (BLUP) and different Bayesian algorithms (e.g. BayesA, BayesB, BayesC, BayesCπ, or Bayesian ridge regression) as well as additional machine learning models, such as support vector machines, random forest or neural networks [8, 9]. To date, no single method has proven to be the universally best approach and their performances depend highly on the dataset being studied [9, 10]. However, Howard et al. [8] suggested the use of non-parametric models if the genetic architecture of the trait is not known to be strictly additive and includes non-additive effects such as interactions between loci. Regardless of the model chosen, accurate phenotype prediction generally requires a large number of genetic markers that ideally cover all loci involved in the trait.

However, with the increasing availability of SNPs, the number of genotyped and phenotyped individuals has become the limiting factor for the development of genomic prediction models [10]. This leads to the so-called pn problem, where the number of features (p)—in this case SNPs—is much larger than the number of individuals (n), which can result in overfitted models with poor performance [11]. One way to solve this problem is to use a feature selection approach in order to reduce the number of SNPs in the dataset, which in addition would save computation time and resources [12,13,14,15]. Different feature selection approaches have been applied for genomic prediction. These range from simple filters that remove uninformative or redundant SNPs (e.g., using a variance threshold or linkage disequilibrium pruning) [16] to more complex algorithms such as Elastic Net, BayesA, least absolute shrinkage and selection operator (LASSO), gradient boosting machine (GBM), and others that are used to identify and select relevant SNPs [9, 12, 13, 17]. In their study on the prediction of human traits, Bermingham et al. [14] applied a feature selection approach that ranks the SNPs based on the strength of their association with the phenotype as determined by a genome-wide association study (GWAS). A similar approach was used by Jeong et al. [15] for their feature selection program GMStool. Both studies reported that a GWAS-based feature selection approach may improve the accuracy of genomic prediction. However, it is still unclear under which circumstances feature selection results in an actual improvement compared to using all available SNPs. To address this, we describe an incremental feature selection approach based on GWAS, similar to [14, 15], in combination with random forest as a prediction model and demonstrate its effectiveness by applying it on several animal and plant datasets. Our results suggest that the incremental feature selection approach may lead to a considerable improvement in prediction accuracy, but it depends strongly on the genotype data. Therefore, we have implemented our approach as an easy-to-use R script to enable researchers to decide for themselves if their dataset would benefit from incremental feature selection.

Methods

Data

We analyzed several publicly available genotype-phenotype datasets for animals and plants, some of which have already been used for genomic prediction [9, 18, 19]. In their benchmarking study on genomic prediction, Azodi et al. [9] analyzed genotypic data from six plant species, namely maize, rice, sorghum, soy, spruce, and switchgrass. For each species, they provided three phenotypes. The datasets themselves were taken from different publications and contain SNP genotypes obtained through genotyping-by-sequencing as well as SNP chips. Since population size and number of SNPs varied considerably between these species, we were able to consider the effect of feature selection under different scenarios. However, for this study, we used only five of these six datasets, i.e. the rice dataset was excluded because the missing genotypes were filled using the numeric genotype mean values, which made them unsuitable for our analysis (only 968 out of 57,542 SNPs have no imputed genotypes) [20]. Cleveland et al. [18] published a dataset containing genotypes and phenotypes for 3534 pigs, which was made available by the Genus company PIC. The genotypes were obtained using the Illumina PorcineSNP60 chip [21], and phenotypes were measured for five traits with heritability estimates ranging from 0.07 to 0.62. However, SNPs and phenotypes were anonymized, so what the real traits were, was not known. Another dataset used in this study was published by Liu et al. [19] and relates to the egg weight of 1063 Rhode Island Red chicken. The animals were genotyped using the Affymetrix Axiom® 600K Chicken Genotyping Array and egg weight was measured at seven time-points up to an age of 80 weeks. Due to missing phenotype and genotype data, the numbers of individuals and SNPs in the pig and chicken datasets varied depending on trait. Individuals and SNPs with missing values were removed. No further quality control was performed since the datasets were already filtered in their original publications. Table 1 gives a short overview of the datasets and their respective sizes.

Table 1 Overview of the datasets used in our study

Methodology

The incremental feature selection (IFS) approach that we used is analogous to the methods presented by Bermingham et al. [14] and Jeong et al. [15]. In order to select an optimal number of SNPs, we first performed a GWAS on the training data (80% of the individuals) by using the PLINK software (version 1.90) [22]. Based on the reported p-values, we ranked the SNPs according to the strength of their association with the phenotype. With the ranking in place, we began by training a model using only the top-ranked SNP as input. Subsequently, we added new markers in a stepwise manner to the input dataset and trained models with those. This stepwise incrementation was continued until a trained model that used all SNPs was obtained. As the number of features increases, the training becomes more time-consuming. To speed up the process, we increased the step size after certain intervals (e.g. step size of 1 until 100, 5 until 500, 10 until 1000, …).

To assess the impact of the IFS approach, we used a random forest (RF) algorithm for genomic prediction, although other methods for genomic prediction could also be applied. RF is a non-parametric algorithm, which works by averaging the prediction results of multiple independently trained regression trees [23]. Compared to other additive or linear approaches, RF has the advantage of being able to capture interaction effects between SNPs as well as dominance effects [15, 24]. Furthermore, RF has proven to be a robust and highly predictive model for genomic prediction which can reach performances comparable to other classical approaches such as Bayesian methods [9, 10, 15, 16, 25, 26]. We used the implementation provided in the R package ranger with default settings (500 trees, mtry = \(\sqrt{p}\) and a minimal node size of 5) [27]. Compared to other libraries, this is a highly optimized implementation that can be parallelized to increase speed.

Cross-validation was applied to measure the ability of the trained models to predict unobserved phenotypes. Following Azodi et al. [9], we split the individuals into five separate folds, thus the training data consisted of 80% of the population and the model was tested on the remaining 20%. This fivefold cross-validation was repeated ten times with randomly reordered individuals. Previous studies have shown that only the training data should be used for feature selection since including the test data results in an inflation of the prediction accuracy [14, 28]. Therefore, the GWAS was repeated for each training dataset, separately.

Following previous studies [10, 17, 29,30,31,32], we measured the accuracy of the predictions using the coefficient of determination (R2), which can be interpreted as the proportion of the variance in the dependent variable that can be explained by the genotypes [33]. For each repetition of the cross-validation, we calculated the R2 value between the predicted and the observed phenotypes for all folds. Based on these values, we reported the average R2 value as well as their standard error for each model. Although Pearson’s correlation coefficient (r) is the most widely used measure in genomic prediction, it does not necessarily mean that a high value of r reflects the accurate prediction of the true phenotype values. Furthermore, a repetition of our analysis using r as accuracy measure demonstrated a correlation of nearly 1 between the two measures on the test results. Please note that R2 is equal to the square of r only in linear regression with no constraints [31].

To obtain a robust trend curve for the performance of the prediction model depending on the number of SNPs, we applied Friedman’s super smoother [34] on the R2 values. Based on these smoothed values, we determined the maximum R2 value to find the optimal number of SNPs.

In order to evaluate the performance of our IFS approach on data not included in selecting the SNPs and training the model, first we randomly split each dataset into two parts, with an 80%/20% ratio. Let Φ be the dataset containing 80% of the data and Ψ the remaining dataset. Subsequently, we applied the IFS approach on Φ to determine the optimal number of SNPs. During IFS, this dataset was further subdivided using the aforementioned cross-validation approach. Then, a random forest model was trained on the Φ dataset using this optimal number of SNPs. Finally, we employed the trained model to predict the phenotypes of the Ψ dataset and calculated the R2 value between predicted and true phenotypes.

We implemented our IFS approach for genomic prediction as an R script that is available from https://github.com/FelixHeinrich/GP_with_IFS allowing for easy use. As mentioned in the previous sections, it requires the software PLINK to be installed as well as the R libraries ranger, data.table [35] and ggplot2 [36]. Genotype and phenotype data need to be given in the form of PLINK’s binary formats. Additional parameters such as the number of threads, the step sizes for feature selection, or the numbers of folds and repetitions for cross-validation can be easily modified by the user in the script.

Results

Both the increase in R2 obtained using the IFS approach and the baseline performance of the model trained using all available SNPs varied considerably between the different species and traits. Table 2 lists, for each dataset, the number of selected SNPs and the R2 value achieved with the Ψ data using the models trained only with the selected top SNPs as well as the models which were trained using all SNPs. Figures 1, 2, 3 and Additional file 1: Fig. S1, Additional file 2: Fig. S2, Additional file 3: Fig. S3, and Additional file 4: Fig. S4 show, for each species, the prediction accuracies of the phenotypes according to the number of SNPs used in the Φ data. Maize exhibited the strongest increase in R2 due to IFS (Fig. 1), i.e. for the prediction of flowering time (FT), with only the top 0.18% of all SNPs (446 out of 244,781) ranked on association strength, the model reached an R2 value of 0.486, while with all available SNPs it achieved a value of 0.364. In the following, we refer to this increase in R2 value as the advantage of the IFS approach. For the height (HT) and yield (YLD) traits, the advantage was 0.034 and 0.068, respectively, using the 1.64% and 0.65% top SNPs, respectively. The application of our method on the soy dataset produced similarly higher prediction accuracies although the gain was not as impressive as for maize. For soy, the advantage of IFS was between 0.003 and 0.012 when approximately the 10 to 23% top SNPs were used as input (Fig. 2). In contrast, for switchgrass, only the anthesis date (AN) and height (HT) traits showed an advantage with IFS, with an advantage of 0.013 observed using the 35% top SNPs for AN, and an advantage of 0.006 using the 60% top SNPs for HT (Fig. 3). Furthermore, for the standability (ST) trait, the model with all available SNPs reached the best performance on the Φ data. However, a similar level of performance could be achieved by using only ~ 10,000 SNPs instead of all 217,150 SNPs. In the case of sorghum and spruce, the IFS approach suggested to use all SNPs for one trait in sorghum and for two traits in spruce (see Additional file 1: Fig. S1 and Additional file 2: Fig. S2). For the other traits, the models trained on the selected top SNPs performed similarly or slightly less well with the Ψ data than the model trained with all SNPs. In the chicken dataset, all egg weight traits showed a similar behavior with respect to IFS (see Additional file 3: Fig. S3). By using less than the 20% top SNPs, the prediction accuracy could be increased by up to 0.027. Of particular note in this dataset, is that the prediction accuracy for all traits strongly decreased when the model was initially trained using the up to ~ 200 top SNPs, and the R2 value started to increase, only after that point, until it reached its maximum. Among the traits included in the pig dataset, the IFS approach yielded an advantage of 0.009 for the T5 trait, while it achieved a comparable or slightly less good performance than the model based on all SNPs for the other traits. This level of performance was achieved only when between 9% and 28% of the SNPs were used (see Additional file 4: Fig. S4). However, the T1 trait, which has a reported heritability of only 0.07 [18], could not be predicted at all and our approach wrongly suggested a model using only a single SNP.

Table 2 Prediction accuracy of phenotypes using selected SNPs compared to all SNPs on Ψ data
Fig. 1
figure 1

Prediction accuracy of maize phenotypes. Prediction accuracy (measured as mean R2) of maize phenotypes as a function of the number of SNPs used for the model (presented as logarithmic values) on the Φ data. The trend estimate, represented by the solid black curve, is obtained through smoothing. The maximum accuracy is indicated by the vertical green line. Mean performance of the model when trained on all SNPs is represented by the horizontal black line, with the shaded interval around it indicating the standard error of the mean of 10 cross-validation repetitions. The prediction accuracy of all three traits could be strongly increased using IFS with improvements ranging from 0.035 to 0.147

Fig. 2
figure 2

Prediction accuracy of soy phenotypes. Prediction accuracy (measured as mean R2) of soy phenotypes as a function of the number of SNPs used for the model (presented as logarithmic values) on the Φ data. The trend estimate, represented by the solid black curve, is obtained through smoothing. The maximum accuracy is indicated by the vertical green line. Mean performance of the model when trained on all SNPs is represented by the horizontal black line, with the shaded interval around it indicating the standard error of the mean of 10 cross-validation repetitions. The prediction accuracy of all three traits could be increased by up to 0.02 using IFS

Fig. 3
figure 3

Prediction accuracy of switchgrass phenotypes. Prediction accuracy (measured as mean R2) of switchgrass phenotypes as a function of the number of SNPs used for the model (presented as logarithmic values) on the Φ data. The trend estimate, represented by the solid black curve, is obtained through smoothing. The maximum accuracy is indicated by the vertical green line. Mean performance of the model when trained on all SNPs is represented by the horizontal black line, with the shaded interval around it indicating the standard error of the mean of 10 cross-validation repetitions. Only for the trait of Anthesis date (AN) could the prediction accuracy be increased by 0.008

Discussion

Previous studies have reported that feature selection for genomic prediction can reduce the number of SNPs required to achieve the same performance as models trained with all available SNPs, or even result in a greatly increased performance [13,14,15, 30, 37]. The latter is assumed to be the case when the number of individuals is much smaller than the number of features used by the model. In this case, reducing the number of features may reduce overfitting and increase the models’ ability for generalization [38]. However, even if only a similar performance is reached with fewer SNPs, this can still lead to the development of new low-density SNP chips specific to certain traits, which can reduce the cost of genotyping and make genomic prediction more economically feasible in breeding programs [39]. In this paper, we present a feature selection approach based on incrementally adding SNPs that are sorted by their association strength to the phenotype as reported by a GWAS. This approach is an implementation of the methodology outlined by Bermingham et al. in their study on the prediction of human traits [14]. While the authors proposed further alternative feature selection methods that incorporate linkage disequilibrium (LD) information to remove redundant markers, their results indicated that these modifications did not improve the results of feature selection [14]. Furthermore, it has been shown that removing SNPs in LD can negatively impact prediction accuracy [40]. Therefore, we did not include LD pruning in our method. This IFS approach was combined with random forest as a prediction model, which has previously been shown to be a robust and highly predictive model for genomic prediction [10, 16, 26]. Furthermore, while random forest is able to consider interactions between SNPs [41], in high-dimensional data such interactions can be missed by the model [42, 43]. Therefore, reducing the number of SNPs through IFS can help in this regard.

A further advantage of the IFS approach is that it can be combined with most genomic prediction models. Several studies have shown that it is the choice of prediction model that has the greatest impact on the accuracy of the predictions [16] and that the best performing method can vary with the dataset [9, 10]. For simplicity and because of the advantages already mentioned, we applied random forest in our study. However, in the supplied R script it can be easily replaced by a different model.

We applied our approach to datasets for seven animal and plant species, each with multiple traits, and observed a wide range of results (Table 2) that went from a considerable improvement of the prediction accuracy of traits (e.g., in the maize dataset) to no improvement for several traits i.e. the IFS approach did not lead to a model that was better than that trained with all SNPs. In a few cases, the performance of the full model could also be achieved using a fraction of the SNPs. To find an optimal number of SNPs resulting in the best performance, we smoothed the R2 values and determined the maximum value among them. This approach works quite well if there is a peak in the trend curve (for example, Fig. 1 or 2). However, if there is no such peak and instead the trend curve strives to the asymptote that represents the performance of the full model (for example the ST trait in Fig. 3), the identified maximum tends to be more conservative and suggests to use more SNPs than what is actually necessary. A manual selection might further reduce the number of SNPs that are required to reach the best performance in such situations.

Of particular interest are the results for the chicken dataset. For all the traits in this dataset, the prediction accuracy started to decrease after about the first 30 SNPs and then started to increase again only after about 200 SNPs (see Additional file 3: Fig. S3). Discarding these SNPs, which initially lead to lower prediction accuracy, could improve the final model. As an example, for egg weight at 36 weeks of age (EW36), for which the minimum prediction accuracy was achieved using the top 236 SNPs, we examined how the removal of these influenced the prediction accuracy (see Additional file 5: Fig. S5). We found that the new curve followed a similar trend, but the model that did not discard these SNPs reached a higher prediction accuracy. Thus, although they have a negative effect on the prediction accuracy when they are first added to the model, these SNPs still have a positive contribution on the final model. Therefore, we decided not to discard any SNPs in our approach.

The number of individuals and markers in the datasets under study differed strongly between the species, which allowed us to consider different scenarios. The aforementioned pn problem suggests that feature selection should be more effective if the markers largely outnumber the individuals in the dataset [37]. However, this expectation is not reflected by our results. For example, on the one hand, with the soy dataset that included a comparable number of individuals and markers (5014 and 4234, respectively) we could improve the prediction accuracy by up to 0.012 using about 20% of the top SNPs. On the other hand, with the switchgrass dataset that included only 514 individuals and 217,150 markers, we could only improve the prediction accuracy for two of the three traits by 0.013 and 0.006, respectively. The prediction accuracy of the third trait, standability, could not be improved, although only about 10,000 SNPs seemed to be necessary to achieve a similar level of performance as the model trained with all available SNPs. There is no detectable relationship between the size of the dataset and the effectiveness of the IFS approach. Therefore, our recommendation is to always test for the effectiveness of feature selection even if the number of SNPs in the dataset is small. Furthermore, IFS seems to be more influenced by the genotype data than by the trait under study. The prediction accuracy curves for a specific species follow a similar trend across the different traits.

In their feature selection program GMStool, Jeong et al. [15] applied a similar GWAS-based approach. However, instead of selecting a certain number of SNPs to be used, they selected specific SNPs that contributed positively to the model performance when they were initially added. SNPs without a positive contribution were not included and SNPs were added one by one until certain stop criteria (number of consecutive rejections or reaching target correlation) were met. To assess the effectiveness of our method, we compared it with GMStool. However, due to the greater computation time required for GMStool, we applied it only to the small datasets and to one of the larger datasets. Both methods were run on a dual Intel® Xeon® Gold 6138 Processor using 60 threads. While GMStool offers multiple different genomic prediction models, we selected random forest, which allows a better comparison with our own approach, and the results are in Additional file 6: Table S1. On average, the runtime of GMStool was four times longer than that of our script. The higher computational speed of our approach is most likely due to GMStool training more models as it uses a constant step size of 1. Jeong et al. [15] further used a different parallelization strategy (parallelizing the cross-validation instead of the random forest training) that was combined with a less efficient R package for random forest (randomForest [44]). For all the datasets examined, GMStool selected a smaller number of SNPs for the best model than our IFS approach. However, the performance of the resulting final models on the Ψ data was comparable between the two methods. In four of the 13 datasets analyzed, the models selected by GMStool achieved higher R2 values, and in eight of the other datasets, our method resulted in models with higher R2 values than GMStool. For the last dataset, both approaches achieved the same performance. The better performance of our IFS approach may be due to GMStool discarding the SNPs that do not improve the performance when they are initially added to the model. This can prevent the inclusion of SNPs that are, by themselves, not strongly associated to the phenotype but may be associated through interactions with other SNPs. With our method, we found that removing SNPs, which initially have a negative contribution, may lead to less accurate predictions (see Additional file 5: Fig. S5). Finally, the stop criteria may also cause GMStool to end its selection before all possible SNPs are tested. Combined with its faster computation time, the improved results of our IFS method demonstrate its state-of-the-art performance.

Conclusions

The objective of this study was to investigate the potential of incremental feature selection to improve the accuracy of genomic prediction. To this end, we have developed a framework for incremental feature selection based on ranking the SNPs according to the strength of their association with the phenotype as determined by a GWAS. In combination with random forest as prediction model, we applied this method to various datasets for plant and animal species. The results obtained ranged from substantial or minor increases in prediction accuracy to a reduction in the number of SNPs required to achieve the performance of the full model. A substantial improvement could only be obtained in a few cases. Furthermore, we show that even with a small number of SNPs, feature selection can still improve the performance of the model. Therefore, we propose that genomic prediction programs should always test whether incremental feature selection can improve the model. Our approach has been implemented as an R script and is freely available at https://github.com/FelixHeinrich/GP_with_IFS.