Introduction

Most human diseases and disorders result from a complex interplay between multiple genetic and environmental factors (Lander and Schork 1994). These conditions are commonly called complex diseases or disorders. Particularly when facing severe complex conditions, prevention medicine and the development of long-term curative strategies demand effective and reliable disease prediction. This, however, remains challenging.

This limitation may be at least partially due to the fact that the vast majority of standard disease prediction models omit genetic information. Instead, they solely rely on typical risk factors (hereinafter termed ‘traditional risk factors’) such as environmental exposures and intermediate phenotypes. The latter are defined as disease-related clinical or molecular measures that are related to the pathomechanism(s) underlying the disease of interest. Well-known examples for such traditional risk factors include a high body mass index (BMI) and high blood cholesterol in cardiovascular diseases (Yusuf et al. 2004).

On the contrary, the recent advancements in the field of complex disease genetics have paved the way for including genetic data in disease prediction models. Moreover, genotyping disease-specific genetic variants can be conducted independently of the tested individual’s age and is increasingly being considered an affordable routine diagnostic procedure. Although identified genetic risk variants explain only a minor proportion of heritability so far, this proportion is continually growing due to ongoing advances provided by genome-wide association studies (GWAS) and next generation sequencing analyses (Stranger et al. 2011). Particularly, GWAS have identified a growing number of common single nucleotide polymorphisms (SNPs) and several studies have started to consider such genetic information in the framework of common complex disease prediction with notable, but highly varying success. Here, we provide a systematic analysis of these studies by discussing the applied methodology, reliability of obtained results and their clinical relevance: based on this analysis, we further suggest potential directions for future research. We extend previous work (Cook and Paynter 2010; Thanassoulis and Vasan 2010; Wang 2011; Vassy and Meigs 2012) by including current original publications as well as by comparing results across prediction of different phenotypes. This allows us to analyse on a broader basis for key-drivers that may be related to improved prediction performance. Investigated parameters include the performance of the baseline model without genetics, the number of SNPs, the SNP validation level and that of the model, family history, and whether SNPs were chosen that are associated with the predicted phenotype itself or associated with intermediate phenotypes. Based on this analysis, we further suggest potential directions for future research.

Search strategy and study identification

For direct comparison, we included studies that predicted susceptibility to frequent complex diseases and disorders by models incorporating (1) traditional (non-genetic) risk factors and (2) traditional risk factors and common genetic variants. Studies predicting the course of a disease were not considered. Moreover, studies selected had to test the benefit of genetic marker inclusion by comparing the combined prediction model against a model omitting genetic markers in a quantitative way. If any of these criteria was not met, the study was not enclosed.

The included studies are selected according to the following search strategy: initially, PubMed was filtered by the search term “Risk”[MeSH] AND “Genetic Predisposition to Disease”[MeSH] AND “Polymorphism, Single Nucleotide”[MeSH] AND improve*. Articles published between 2006/01/01 and 2015/10/01 were included, results were restricted to the species humans. Articles of the category ‘review’ and ‘clinical study’ were excluded. This resulted in 250 articles. When filtering these articles according to our inclusion criteria, 25 eligible studies remained. Subsequently, we iteratively analysed the co-citation network based on the reference lists of the included studies. This identified further 17 additional articles. Finally, another 13 articles were identified using the search engines https://www.google.de/ and http://scholar.google.de/ by combining the terms ‘SNPs’, ‘family information’ and ‘risk prediction’ (Fig. 1). With this strategy, we benefit on the one hand from the controlled MeSH vocabulary of MEDLINE which decreases loss of potentially relevant articles due to differences in vocabulary. On the other hand, analysis of the co-citation network extends this initial MEDLINE search in an expert-guided manner.

Fig. 1
figure 1

Search strategy for the inclusion of studies in the analysis. This figure provides an overview of the search strategy and the numbers of eligible studies meeting the inclusion criteria

In total, we identified 55 studies meeting all pre-set inclusion criteria. Several studies analysed more than one cohort or used different analytic models. When multiple nested non-genetic models were reported for the same cohort, the best-performing model was preferred, which typically included the highest number of predictors. This resulted in 100 included distinct analyses (Table 1; Supplemental Table 1). Predicted phenotypes comprised cardiovascular pathologies (n = 18 studies) (Humphries et al. 2007; Morrison et al. 2007; Kathiresan et al. 2008; Paynter et al. 2009, 2010; Davies et al. 2010; Ripatti et al. 2010; Hughes et al. 2012; Lluis-Ganella et al. 2012; Hernesniemi et al. 2012; Brautbar et al. 2012; Isaacs et al. 2013; Bolton et al. 2013; Ganna et al. 2013; Tikkanen et al. 2013; Ibrahim-Verbaas et al. 2014; Beaney et al. 2015; de Vries et al. 2015), breast cancer (n = 5) (Wacholder et al. 2010; Mealiffe et al. 2010; Darabi et al. 2012; Dite et al. 2013; Vachon et al. 2015), prostate cancer (n = 10) (Zheng et al. 2008; Nam et al. 2009; Salinas et al. 2009; Aly et al. 2011; Johansson et al. 2012; Kader et al. 2012; Klein et al. 2012; Lindström et al. 2012; Helfand et al. 2013; Butoescu et al. 2014), type 2 diabetes (n = 15) (Balkau et al. 2008; van Hoek et al. 2008; Lyssenko et al. 2008; Meigs et al. 2008; Lin et al. 2009; Schulze et al. 2009; Talmud et al. 2010; Wang et al. 2010; de Miguel-Yanes et al. 2011; Vassy et al. 2012a, b; Tam et al. 2013; Mühlenbruch et al. 2013; Vassy et al. 2014; Walford et al. 2014), atrial fibrillation (n = 2) (Everett et al. 2013; Tada et al. 2014), venous thrombosis (n = 2) (de Haan et al. 2012; Bruzelius et al. 2014), esophageal squamous cell carcinoma (ESCC) (Chang et al. 2013), melanoma (Fang et al. 2013) and Parkinson’s disease (Hall et al. 2013) (n = 1 each). If not reported in the original publications, p values were calculated from reported confidence intervals.

Table 1 Overview of the number and type of genetic data used for prediction

Which genetic markers were selected to improve prediction?

We did not identify studies which conducted de novo SNP selections, but instead referred to previously published GWAS/candidate gene association studies which identified genetic variants associated with the disease of interest. Hence, SNPs used in the prediction studies could be considered pre-validated at least at a basic level. However, different SNP selection strategies were reported (Table 1). 48 studies (87 %) included SNPs resulting from previous GWAS while 7 studies (13 %) considered SNPs identified in previous candidate association studies. The method of choice strongly correlated with the year of publication, probably indicating an increased availability of GWAS data for the disease of interest: only 2 out of 42 studies (5 %) published after 2009 used SNPs from candidate association studies. Contrarily, 5 out of 13 studies (39 %) published before 2010 solely relied on candidate association studies for SNP selection.

Notably, seven studies (Kathiresan et al. 2008; Paynter et al. 2010; Lluis-Ganella et al. 2012; Hernesniemi et al. 2012; Brautbar et al. 2012; Isaacs et al. 2013; Ibrahim-Verbaas et al. 2014) explicitly distinguished between SNPs that are directly associated with the predicted endpoint, and SNPs that are associated with intermediate phenotypes and therefore may contribute indirectly. A rationale behind this approach is that it may be of relevance for the predictive value of genetics when SNPs associated with intermediate phenotype are excluded as these intermediate phenotypes may be already included as predictors in the baseline model. A different rationale is that the model may benefit from shared genetics of the intermediate and predicted phenotype.

How were SNPs included into the prediction model?

In general, two main strategies for including genetic data into the predictive model were identified (see also Fig. 2). In the first, each genetic risk variant was considered as an individual covariate in addition to the traditional risk factors in the framework of a regression model. This strategy was adopted by 6/55 studies (11 %) (Paynter et al. 2009; Lindström et al. 2012; Chang et al. 2013; Bolton et al. 2013; Helfand et al. 2013; Bruzelius et al. 2014). A disadvantage of this approach is that the model size considerably increases with the number of independent genetic risk factors added. The second strategy aimed to solve this problem by proposing an additive genetic risk score (Horne et al. 2005), which was adopted by 49/55 studies (89 %). The simplest form to construct a genetic risk score is to count the number of risk alleles among all SNPs, which was performed in the majority of the studies (26/49 studies, 53 %). However, this assumes that each risk allele of each SNP has the same predictive value. However, this is likely to be divergent from reality in most cases. To account for potential differences between SNPs, SNPs can be alternatively weighted according to their effect size. This was done in 29/49 studies (53 %). Please note that studies utilising both strategies were assigned to both categories. Using an additive genetic risk score composed of weighted risk variants resulted in risk estimates with similar effect sizes as compared to risk estimates from a regression model including each genetic variant as an individual covariate. In the former approach, however, fewer degrees of freedom are utilised. The analysed studies adopted two methods for SNP weighting. In the first method, weights were based on effect sizes from an independent cohort, e.g. from the literature (25/29, 86 %). In the second, weights were calculated from the same population used for prediction analysis (6/29, 14 %), which potentially might result in biased estimates of the model’s prediction accuracy.

Fig. 2
figure 2

Overview of methods as to how genetic data were included in the prediction model. “Sum score”: from all SNPs a single predictor reflecting the genetic burden was created and used as single parameter in the prediction model, “individual SNPs”: SNPs were included as individual covariates in the model used for prediction, “weighted”: risk alleles of SNPs were weighted according to the respective odds ratio, and “unweighted”: risk alleles of SNPs were counted without weighting. Note that Brautbar et al. (2012), Everett et al. (2013) and Talmud et al. (2010) used weighted as well as unweighted sum scores in their analyses and thus appear in both categories in the figure

Next to these two commonly applied strategies for SNP inclusion into the prediction model, two additional approaches were identified. Humphries et al. (2007) focused on the independent component of each SNP and, given all traditional risk factors, on the effect of all other SNPs. By doing so, the authors adjusted the effect of each SNP on all other SNPs and known risk factors, and extended the traditional risk score using these adjusted values. Helfand et al. (2013) did not include genetic variants in their model, but instead aimed to improve the value of prostate specific antigen (PSA) levels as an indicator for biopsy-based screening of prostate cancer with genetic data. The authors divided the PSA level by a genetic risk score and used resulting, modified PSA levels to decide on recommendation for biopsy. This represents a rather unconventional approach which has only rarely been reported so far. Hence, additional in-depth assessment of its impact and predictive value is recommendable.

Regardless of the outlined differences, many studies accounted for the correlation of individual SNPs, which is a result from linkage disequilibrium. The common strategy was pruning SNPs, i.e. removing one SNP of each correlating pair above a certain cut-off. Another approach to account for the correlation between SNPs would be to explicitly model their correlation structure. However, this was not done in any of the investigated studies. We note that methodology for feature selection and predictor weighting is continually improving (Kooperberg et al. 2010; Kruppa et al. 2012) and future work will show whether this may lead to further improved prediction.

How were traditional risk factors included into the prediction model?

Traditional risk factors were incorporated in the prediction models in different ways. On the one hand, studies used well-established risk models like the Framingham risk score for cardiovascular disease (e.g. Isaacs et al. 2013; Bolton et al. 2013) or the Gail model for breast cancer (e.g. Wacholder et al. 2010; Darabi et al. 2012). On the other hand, some studies selected non-genetic risk factors themselves for prediction. Ten of eighteen (56 %) studies predicting cardiovascular disease used established risk models as well as all (5/5) studies predicting breast cancer. When selecting non-genetic risk factors themselves, overfitting might occur if no appropriate validation strategy is applied.

Note that heterogeneity among prediction performance of the baseline model can be high between different cohorts and studies, even when the same model of traditional risk factors is applied. Exemplarily, the AUC for predicting prostate cancer with the non-genetic PCPTRC risk model ranged 0.56–0.72 when it was applied to different cohorts in the same study (Ankerst et al. 2012). Therefore, improvements of prediction due to genetics should always be interpreted under consideration of this heterogeneity (Ankerst and Thompson 2012).

How can the benefit of genetic data inclusion for disease prediction be measured?

Several methods have been reported to compare models including genetic data with those omitting such information. Most of these methods can be categorised in discrimination ability, reclassification ability, and model calibration (Steyerberg et al. 2010; Pencina et al. 2010; Wang 2011; Siontis et al. 2012).

Discrimination describes the ability of a model to distinguish individuals from risk and non-risk groups. The most prominent measure is the area under the receiver operating characteristic (ROC) curve (AUC). Within a classical ROC, the sensitivity of the model is plotted against 1-specificity for various thresholds of the predictive score (Bamber 1975). The AUCs can range from 0.5 (the model is completely uninformative) to 1 (perfect discrimination between affected and unaffected individuals). Technically speaking, the AUC can be interpreted as the probability that an affected individual has a higher predicted risk score than an individual from the control group (Hanley and McNeil 1982). In order to analyse the benefit of including genetic data in a disease prediction model, the AUC of the model additionally including genetic data is compared with the AUC of the baseline model. If the former is significantly larger, a benefit can be claimed. However, the AUC under the ROC method has certain limitations. First, it represents a measure of multiple realisations of a predictive model as it evaluates the performance of the model for all possible thresholds of the predictive score. This is done regardless of whether or not these thresholds are clinically meaningful. Therefore, an improved AUC does not necessarily reflect an improved performance with respect to clinical relevance. Second, the AUC has been criticised for being relatively insensitive (Cook 2007). This is of particular relevance when the AUC of the baseline model is already good. Here, the power to detect a statistically significant improvement of the AUC by including a certain genetic marker is much lower than the power to improve the AUC of a model with lower initial AUC values (Tzoulaki et al. 2009; Pencina et al. 2010). A prominent example is the study of Pencina et al. (2010), who reported that an additional genetic marker with an effect size of 0.41 (corresponding to an SNP with an odds ratio of 1.5) can improve the AUC of baseline models with values 0.55, 0.6, and 0.75 to an AUC of 0.63, 0.65, and 0.77, respectively. Such calculations typically do not assume a correlation between the predictors of the baseline model and the predictors of the model including genetic data. As one may expect, adding a genetic component to a baseline model will result in a relevant correlation between predictors of both models. Hence, the power of the AUC method may be even lower.

Reclassification improvement describes improvement in the classification of cases and controls when comparing an updated model against a baseline model. For this purpose, Pencina et al. (2010) proposed the net reclassification improvement (NRI). The NRI examines whether the model including genetic data shifts cases to higher risk categories more often than to lower risk and, vice versa, controls to lower risk categories more often than to higher ones. Improvement can be claimed if the sum of these movements is better than 0 %, which is found if there is equal movement in the correct and incorrect direction (Paynter et al. 2010). However, changes of risk categories do not necessarily result from clinically important risk categories. Therefore, some authors report the ‘clinical NRI’ related to changes within the most clinically relevant risk categories (Cook and Paynter 2011). Although the NRI is a well-suited measure for reclassification analysis, a number of limitations need to be considered (see Cook and Paynter 2011 for details). Therefore, the results obtained should be interpreted carefully and reported in sufficient detail (Pepe 2011).

The integrated discrimination improvement (IDI) represents an alternative and relevant reclassification measure. The IDI is the difference between the discrimination slope of the baseline and the updated model (Pencina et al. 2008). As clinical risk categories are not required for IDI calculation, it is of particular value when such categories do not (yet) exist.

Calibration assesses the agreement of predicted and observed risks across subgroups with varying baseline risk. In general, only predicted risks that are well-calibrated are useful for clinical management, because treatment decisions often depend on estimates of the predicted risk. The most common measure for calibration is the Hosmer–Lemeshow test, which compares predicted and observed outcomes over percentiles of risk (Lemeshow and Hosmer 1982). Superiority of one model to another is typically demonstrated by increased p values resulting from this test.

All measures described above reflect different aspects of model quality and should be considered in close relation to each other whenever possible. For example, a study in which the AUC increase is considered small may still provide substantial improvement of the reclassification measure NRI and/or increase in the IDI (Pencina et al. 2008). Therefore, an AUC increase of even 0.01 might still be suggestive of a meaningful improvement in some cases (Pencina et al. 2008), as reclassification of clinically important patient subgroups might have been improved. However, reclassification and discrimination are only of clinical value when the predicted risks are in strong correlation with the actual risk. The estimation of calibration is therefore necessary, and, in the case of a poorly calibrated model, a recalibration to the population of interest is strongly recommended (Pepe and Janes 2013).

In order to avoid biased estimates, the quality measures discussed should be computed on test sets independent from the initial (‘training’) set used for fitting the model. All prediction studies considered in this review only included SNPs that were pre-validated in previous, independent association studies.

Did the inclusion of genetic information improve disease prediction?

We identified both, studies reporting and not reporting improved prediction when including genetic data (Supplemental Table 1; Fig. 3). Thirty studies (55 %) saw significant improvements in AUC when including genetic data. Fourteen further studies (26 %) did not identify a significant AUC improvement, but a significant improvement in reclassification. Effect sizes of traditional risk factors were generally larger than those of genetic risk factors, regardless of whether or not they were determined in an independent data set. Nevertheless, a considerable variation in the effect sizes of traditional risk factors was observed. For example, a strong risk factor for venous thrombosis is the presence of minor leg injuries. This risk factor has an odds ratio (OR) >5 (Previtali et al. 2011), whereas obesity is a moderate risk factor reported to confer risk for cardiovascular diseases with an OR of 1.62 (Yusuf et al. 2004). As the power to improve the AUC in prediction depends on the predictive strength of the baseline model, poorly performing baseline models showed best improvement when adding genetic predictors. In consequence, we found a clear relationship between the improvement of the AUC due to including genetic data and the phenotypes predicted (Fig. 3). For example, Zheng et al. (2008) (predicting prostate cancer) reported an AUC of 0.608 for a model accounting for age, geographic region and family history. After the addition of the genetic risk score, the AUC of the model increased with statistical significance to 0.633 (p = 6.1 × 10−6). In well-performing baseline models, a significant increase of the AUC is less frequently reported, still, prediction improvement by considering genetic data is often shown by applying measures of reclassification. As an example, Walford et al. (2014) reported no significant improvement of the baseline model AUC (AUC = 0.861), but significant improvement in the reclassification measure (NRI = 0.247, p = 0.0009).

Fig. 3
figure 3

Overview of the discrimination improvement due to inclusion of genetic data across all included 100 analyses. An AUC of 1.0 indicates perfect discrimination between cases and controls, 0.5 is equivalent to random guessing. Studies are stratified according to their predicted phenotype. Each reported analysis is depicted in form of an arrow with the arrow start indicating the AUC when using traditional risk factors only and the arrowhead indicating the AUC of the model including genetic data. The colour of the arrow illustrates significance of reclassification measures with blue statistically significant, orange not statistically significant, and grey not tested. Solid lines indicate GWAS-derived SNPs and dashed lines all other SNPs. The figure clearly illustrates it is generally harder to improve discrimination of a prediction model by including the genetic data in cases where the baseline model already performs well. Nevertheless, in some cases significant reclassification can be observed even for high baseline AUC values. For numbers and additional details on studies, please also refer to Supplemental Table 1

Of note, study comparability was limited by divergent study aims and designs as well as by the different analytic strategies applied. For example, four studies did not analyse whether the observed changes in discrimination were statistically significant. 15/55 studies (27 %) reported results of discrimination, reclassification, and calibration of the model with and without genetic data conjointly (Supplemental Table 1). Furthermore, relevant aspects of statistical procedures were frequently not reported in detail. For example, studies applying bootstrap-based procedures rarely provided details whether weighting of SNPs or assessment of predictive accuracy was done using the out-of-bag or the in-bag data. As another example, studies applying cross-validation rarely provided details whether weighting of SNPs or traditional risk factors was done only once before the cross-validation procedure started or repeatedly within each cross-validation iteration. Such details are very helpful when evaluating reported classification performance and are important for a valid comparison of results from different studies. Guidelines already exists, but are only infrequently accounted for in current publications (Janssens et al. 2011).

What is characteristic for studies showing the strongest improvement of prediction by the inclusion of genetic data?

Several studies report enhanced prediction although performance of the baseline model was similar to other studies not reporting such improvement (Fig. 3). The first outstanding example is the study of Bolton et al. (2013) predicting coronary heart disease. Here, AUC increased in two analyses from 0.671 to 0.741 and from 0.717 to 0.753 when including genetic data. Although this study has several strengths (large number of recently reported SNPs, a well-defined phenotype definition, and a prospective, population-based design) it also comes with some limitations. First, the sample size is rather moderate compared to other studies on the same phenotype. Second, each SNP was included in the model without applying weights from the literature, but estimating them from the cohort also used for prediction analysis. This may result in biased estimates. Third, the baseline model was one of the weakest that existed for this particular phenotype, which clearly favoured significant prediction improvement by adding a genetic component to the model. Other interesting examples are two studies predicting the occurrence of venous thrombosis (de Haan et al. 2012; Bruzelius et al. 2014). Here, AUC increased from 0.77 to 0.82 and from 0.71 to 0.77 in the discovery and validation cohort, respectively (de Haan et al. 2012) and Bruzelius et al. (2014) reported an increased AUC from 0.80 to 0.84. These studies, in difference to all others investigated, included common SNPs with very strong effect sizes: variants rs6025 and rs8176719 have literature-reported ORs of 3.8 and 1.85 with a frequency in cases of 10 and 47 %, respectively. Such impressive effect sizes together with high minor allele frequencies are of course the exception for genetic factors of common diseases. If present, however, they allow for a tremendous improvement of prediction when added.

Are there SNP-specific differences?

In addition to the effect sizes, we explored whether additional SNP characteristics may lead to better disease prediction. First, we compared performance of GWAS SNPs versus SNPs from candidate studies. 32/48 (67 %) studies that included SNPs from GWAS reported significant improvement in classification, 18/32 (56 %) also included some strategy of validation. 4/7 (57 %) studies that included SNPs from candidate association studies reported significant improvement in classification, with two studies including a validation strategy. A reason for non-superior performance of GWAS SNPs might be that candidate SNPs were well chosen focussing on well-validated SNPs with at least moderate effect sizes.

Analyses applying weighted genetic risk scores more frequently reported a significant improvement in prediction due to genetic data (40/58, 69 %) as compared to analyses applying non-weighted genetic risk scores (17/37, 46 %). An improved performance compared with studies applying non-weighted genetic risk scores was still observed when we filtered for analyses applying SNP weights determined in independent cohorts (29/43, 67 %). This was also true when further filtering for studies completely evaluating in a second cohort (4/4, 100 %). Therefore, it can be assumed that better performance of weighted genetic risk scores is unlikely to solely result from model overfitting. It also underpins the fact that weights from independent cohorts should be used to maximise the reproducibility of results whenever possible.

Inclusion of familial risk did not have a major effect on the predictive power of candidate SNPs. 30/50 (60 %) analyses with familial risk had improvement of prediction when including genetic data versus 32/50 (64 %) of analyses that did not include familial risk. This finding is in accordance with previous reports (Ripatti et al. 2010; Vassy and Meigs 2012). As noted by Ripatti et al. (2010), reasons include measurement error for family history and that currently known genetic variants only account for a small proportion of familial risk.

Finally, we did not observe a major effect of number of SNPs on prediction improvement (Supplemental Fig. 1). This may reflect the limited knowledge regarding the genetic component of complex diseases. Simulations demonstrate that the prediction improvement might be considerably higher when all relevant SNPs were included (Aly et al. 2011; Dudbridge 2013), but considerably larger studies are needed to identify more relevant SNPs and to uncover a larger fraction of the heritability.

Is there a benefit of including or excluding SNPs associated with intermediate phenotypes?

Brautbar et al. (2012), Paynter et al. (2010) and Lluis-Ganella et al. (2012) investigated the strategy of including SNPs significantly associated with coronary heart disease, but excluding SNPs associated with an intermediate phenotype related to this disease. In contrast, Kathiresan et al. (2008) and Isaacs et al. (2013) restricted selected SNPs to those that were associated with an intermediate phenotype. Ibrahim-Verbaas et al. (2014) included 322 SNPs associated with nine intermediate phenotypes of stroke and only two SNPs directly associating with stroke. Only Paynter et al. (2010) directly compared both strategies in the same dataset and found a similar (weak) performance of SNPs from both SNP selection strategies. Brautbar et al. (2012), Kathiresan et al. (2008), Lluis-Ganella et al. (2012) and Ibrahim-Verbaas et al. (2014) observed an improved predictive value when including genetic information. Given those examples for a successful inclusion as well as exclusion of SNPs associated with intermediate phenotypes, no final conclusion can be drawn at this point due to the limited number of studies directly comparing different strategies at the same cohort. However, investigations of endophenotypes appear to increase in relevance while applied methodology continually improves. Significant benefit is expected especially for complex diseases in which endophenotypes have the potential to bridge complex genetic backgrounds and known disease heterogeneity (Insel and Cuthbert 2009). Interestingly, a complementary strategy applied by Hernesniemi et al. (2012) proved to be ineffective. Selecting GWAS-derived SNPs associated with cardiovascular diseases (CVD) to predict an intermediate phenotype of CVD, i.e. intima–media thickness and artery elasticity, did not result in enhanced prediction while only a limited association with the intermediate phenotype was found. A plausible explanation is that selected SNPs may act via different intermediate phenotypes.

What are potential implications for clinical research and practice?

The inclusion of genetic markers such as SNPs into diagnostic procedures was originally thought to rapidly increase diagnostic accuracy and to significantly reduce the number of patients being diagnosed false-negatively (Ginsburg and McCarthy 2001; Diamandis et al. 2010). In theory, this would ultimately translate into both faster and improved therapeutic intervention and preventive treatment. After analysing results from first studies, much of the initial enthusiasm cooled off. Ripatti et al. (2010) and de Haan et al. (2012) were sceptical about the potential clinical use of common SNPs for disease prediction in cardiovascular diseases. Wacholder et al. (2010) found a certain benefit in prediction when including genetic markers for high- and low-risk groups, but not in groups at intermediate risk for breast cancer (2010). Still, the benefit was not sufficiently large to meaningfully improve the identification of patients who might profit from prophylactic treatment. Nevertheless, a clear diagnostic advantage was repeatedly seen for a number of diseases when genetic markers for diagnosis were included (Fig. 3), suggesting the definition of particular scenarios under which this strategy can provide a measureable benefit.

From a socio-economic perspective, cost-effectiveness is relevant when deciding which patient to test. For example, Mealiffe et al. (2010) proposed to improve the cost-effectiveness by only testing patients whose risk status is likely to change. Only testing individuals close to classification cut-offs would lower the effort because the majority of patients do not require screening. However, with the steady progress in the field, the costs for thorough genetic profiling are continuously declining (Gershon et al. 2011). This in turn may increase the value of genetic diagnostics even for conditions where current approaches only provide a small improvement.

More recent studies including those by Ganna et al. (2013) and Tikkanen et al. (2013) are more optimistic about the benefit of adding genetic markers to disease prediction models. Ganna et al. (2013) estimated that one additional event resulting from coronary heart disease could be prevented in 318 patients by including genetic markers to their model. This measure is the gain in ‘number needed to screen’ (NNS) when comparing the baseline model with the model including genetic data. NNS is defined as the number of patients needed to be screened to detect one affected individual (Rembold 1998). Tikkanen et al. (2013) reported a gain in NNS of 15.9 (prevention of 135 events out of 2144 cases) by additional genetic testing in a subgroup of patients. This subgroup was previously categorised at intermediate risk for coronary heart disease by using non-genetic information. While these findings are generally encouraging, the reported gains in NNS are still relatively high. Several strategies may be suggested to compensate for this problem and to ensure a measurable benefit.

First, reporting the NNS is generally recommended when investigating diagnostic procedures amended by genetic test components. This is important, since the NNS is rarely reported in the literature, although this measure precisely indicates clinical relevance and assesses the value of intervention. Consequently, the NNS can be used for identifying the best-performing model when comparing different approaches.

Second, future research may also focus on assessments in which the investigation of traditional risk factors is more costly than the investigation of genetic risk factors. By applying a second confirmative diagnosis, e.g. by conventional diagnostics using non-genetic components, the number of patients treated with optimal benefit may be maximised (Peterson et al. 2013; Hagemann et al. 2013).

Third, future work may focus on conditions for which a highly effective or even causative treatment is available and/or early therapeutic intervention provides a clear benefit with respect to prognosis. This may further include cases in which a failure to treat causes relevant additive burden to the patient. In these scenarios, prediction is of high clinical relevance and even a small benefit as the inclusion of genetic data may be of value. However, such approaches always require careful balancing of socio-economic costs caused by additional treatments versus gained benefits, as well as a thorough ethical consideration.

Such improvement of diagnostic genetic markers might be also relevant for other fields of clinical research, e.g. adaptive clinical trials or improved handling of patient heterogeneity by molecular reclassification. Up to now, genetic testing for drug efficiency is rarely used in clinical practice albeit its underlying potential (Antman et al. 2012). This is also true for classifying subtyping of diseases using common SNPs. Here, promising candidates for certain diseases like breast cancer exist, but these are not routinely available yet (Garcia-Closas et al. 2013).

Conclusion

A considerable number of reports indicating that genetic data could contribute to the improvement of prediction models have been identified. The additive value of considering genetic information was statistically significant in many analyses, though limited with respect to the absolute effect in most cases. Hence, considerable progress is still required before routine clinical practice will benefit from including genetic data in the prediction of risk to certain diseases. Although the heterogeneity of the included phenotypes by the reviewed studies requires careful and case-specific interpretation, we derived general conclusions for future work which we summarised in Fig. 4.

Fig. 4
figure 4

Take-home messages for predicting complex diseases with common genetic markers

Given encouraging examples of improved prediction with noticeable clinical relevance and in light of the ongoing progress in the field of genetics, we feel optimistic that including the genetic component in prediction models of complex diseases and disorders will continually increase and provide a measurable add-on benefit in the future.