Introduction

Osteoporotic fractures are a global health burden with 9 million incident cases identified in 2000 [1]. It is also estimated that approximately 1.6 million hip fractures occur each year with marked variations in incidence around the world [2]. The costs of osteoporosis are considerable, with Europe spending over €36 billion and the United States spending $27 billion annually on treating and managing osteoporosis [3, 4]. The number of fractures in the elderly and associated economic burden will only continue grow due to the world’s aging population imposing greater emphasis on health care planning. Targeting individuals who are at an increased risk of developing hip and other osteoporotic fractures that might benefit from therapeutic or preventive intervention is therefore an important challenge to be resolved.

Risk Prediction Models

Multivariable risk prediction models hold enormous potential to augment and support the physician in the clinical decision reasoning with objective probability estimates [5]. The key term is multivariable, which explicitly implies the simultaneous contribution of multiple risk factors on the outcome (ie, fracture) captured by a mathematical function.

There is an abundance of prediction models, even for the same condition, and each year an ever-increasing number of models are developed yet disappointingly very few are routinely used in clinical practice. Reasons why certain prediction models get used while others do not are largely unknown. Existing reviews of clinical prediction models in cancer [68], diabetes [9•], and traumatic brain injury [10] have all consistently highlighted design problems, methodological weaknesses, and deficiencies in reporting, which could contribute to the lack of uptake.

A recent systematic review critically evaluated 35 prediction models for fracture risk assessment [11]. Consistent with existing reviews of prediction models of other clinical outcomes, the authors highlighted problems with poor reporting including key details on the development and performance of the model frequently omitted. Worthy of particular note, the well-known FRAX model [12] was excluded from the review due to the insufficient information in the FRAX manuscript to provide a minimum of information needed to objectively review the model (to be discussed later).

Regardless of how a clinical prediction model has been developed, the crucial and arguably the only important attribute of a prediction model is to assess its performance (validation), particularly when tested in new individuals (external validation).

Overview of the Steps in Validating a Clinical Prediction Model

The general framework for developing and evaluating a prediction model has been well described [13••, 14•, 1519, 20•, 21, 22]. Once a model has been developed, the sequence of steps with increasing evidence to support the usefulness and transportability of a prediction models is apparent validation [18], internal validation [23], temporal validation [24], external validation [13••, 2426], and ultimately an impact study [14•, 21, 27]. Apparent validation provides the least amount of useful information and is defined as the performance of the model using the same dataset used to develop the model. Such performance will be optimistic and have limited information to support the prediction model.

Internal validation comprises split-sample, cross-validation, and bootstrapping (in increasing levels of usefulness). When performing a split-sample internal validation, the dataset is randomly split into a development and validation dataset. However, this archaic approach is suboptimal as the random split merely creates two non-dissimilar datasets and does not provide a real test of the prediction model [13••, 24]. Furthermore, the model is built on a smaller subset of the original data leading to unstable models and the validation data is also usually small leading to unreliable and potentially biased performance data. Bootstrapping is a stronger method that uses all the data to develop the model. Bootstrapping involves taking repeated samples (with replacement) from the original dataset, reflecting drawing samples from an underlying population. The main advantage of bootstrapping is the ability to calculate optimism-corrected performance estimates.

External validation is a key aspect to evaluate the transportability of the prediction model. External validation may comprise temporal, geographical, or spectrum transportability (ie, primary vs secondary care) or their combination. Ideally, external validation should be conducted by independent investigators not involved in the development of a prediction model [24]. An important aspect of external validation is to evaluate a prediction model using cohorts of patients with different case mix.

Impact studies attempt to comparatively quantify the effect of a prediction model on clinician behavior (ie, changing therapeutic decisions, acceptability), cost-effectiveness, or patient outcomes against a system without the prediction model [14•, 21, 22, 27]. While impact studies are the ultimate test and evaluation of a prediction model, very few impact analyses have been carried out [14•, 17].

Quantifying Performance

Irrespective of how a prediction model has been developed, the fundamental aspect is to quantify and evaluate the performance of the model. There are two important characteristics of a prediction model: discrimination and calibration [28••]. Discrimination is the ability of the prediction model to differentiate between individuals who do or do not experience the event (ie, fracture). The crucial aspect here is that individuals who experience the event have higher predicted risks than those who do not. Discrimination is typically assessed by the area under the receiver operating characteristic curve (equivalent to the c-index), ranging from 0 to 1 (the higher the better), where a value of 0.5 reflects the model is no better than flipping a coin. It is also worth noting that the c-index estimated from time-to-event data is influenced heavily on the length of follow-up [29]. Calibration is the agreement between observed outcomes and predictions and can be assessed visually by a calibration plot, plotting observed outcomes versus predicted risk, or by key prognostic predictors (eg, age) [30]. Calibration is also assessed numerically by the calibration slope or calibration-in-the-large [28••]. Discrimination and calibration are joint properties of a prediction model and both aspects should be evaluated; yet existing systematic reviews all conclude that calibration is frequently not assessed [9•, 31]. In the absence of calibration assessment, it is unknown how accurate are the model predictions.

Both discrimination and calibration are statistical properties of a prediction model; neither component captures clinical usefulness. Decision analytic methods have recently been suggested to address clinical usefulness [32]. The approach uses the whole range of clinically useful thresholds to designate an individual at high risk who could benefit from treatment accounting for relative harms of false-positive and false-negative results. Competing (multiple) prediction models can be easily compared, and also compared against a strategy of do nothing (ie, is the model doing any harm?) and treat all (ie, assume everyone is at high risk).

The specific impact of each variable included in the multivariable model can be misinterpreted since risk estimates of dichotomous variables cannot readily be compared with estimates derived from continuous variables, often expressed as a relative risk per 1 standard deviation increase of the values. The relevant comparison should be a risk change by a 2 standard deviation increase of the exposure [33]. In addition, threshold effects or non-linear associations are rarely considered when developing multivariable risk prediction models.

Study Design

One of the most important aspects in developing and, more importantly, in validation studies is study design and data quality. While study design in terms of minimizing potential for overfitting is understood in model development [3437], less is known on the design requirements for validation studies. Although no firm guidance exists on sample size requirements for validation studies, empirical simulations have found that a minimum 100 events (ie, fractures) are required to validate a prediction model based on a logistic regression [38]. There is no evidence to the contrary to suggest sample size requirements for external validation models based for survival outcomes would be any different. It is worth stressing that the effective sample size in studies involving prediction model for development and validation are driven not by the number of patients in the cohort, but the number of events (ie, fractures).

Missing data are omnipresent in medical research and systematic reviews of prediction models have frequently identified the handling and treatment of missing data as being a major concern [9•, 39]. Omitting patients with missing values, thereby conducting a so-called complete-case analysis can lead to inaccurate and biased estimates of performance [4047]. Multiple imputation is regarded and advocated as a viable and methodologically superior approach to handle missing data, which makes less strong assumptions on the missing data mechanism than a complete-case analysis [42, 46, 48].

FRAX

Developed under the umbrella of the World Health Organization, the risk assessment tool known as FRAX (http://www.shef.ac.uk/FRAX ) has revolutionized the way in which clinicians identify individuals who are at an increased risk of major osteoporotic or hip fracture over the next 10 years [12, 49••]. As of November 2011, 45 FRAX calculators were available for use in 40 countries [2]. The popularity and uptake of FRAX has been astonishing, with over 300 publications listed on PubMed (accessed March 27, 2012) since its introduction in 2007. Furthermore, FRAX is now embedded in numerous clinical guidelines of the National Osteoporosis Foundation [50], National Osteoporosis Guideline Group [51], Osteoporosis Canada guidelines [52], and is currently (as of March 2012) under consideration in the United Kingdom by the National Institute for Health and Clinical Excellence (http://guidance.nice.org.uk/CG/Wave25/2). See Table 1 for a list of risk factors included in FRAX.

Table 1 Risk factors in FRAX, QFracture, and Garvan risk calculators

However, in an era of reproducible research [53•, 54] and transparent reporting [5557], it is disappointing that FRAX has to date failed to deliver on either [58••]. Furthermore, methodological weaknesses and complexities in both the original derivation and validation of FRAX have largely been ignored and its scientific validity has not come under rigorous methodological and statistical scrutiny. Many subsequent studies evaluating FRAX suffer similar shortcomings, primarily due to poor design or replicating the methodology in the original development and validation of FRAX. Many of these studies often included the FRAX developers and thus, in addition, very few independent and methodologically robust evaluations of the predictive performance of FRAX exist. It is also disappointing that the FRAX developers have often received any criticisms of FRAX negatively, often with flawed arguments [59].

FRAX may have great potential to assist the clinical decision making and ultimately in improving patient outcomes; however, it is debatable as to whether the scientific community has been provided with sufficient evidence to support the widespread use of FRAX. This is compounded by the problem that FRAX is largely a black box, as the underlying equations behind them have never been published or placed in the public domain and it seems unlikely they will be published in the near future. It is unclear whether commercial and licensing issues are behind not placing the equations in the public domain, but such secrecy makes it impossible for independent investigators to appropriately evaluate and critique the prediction model [60]. In addition, the methodology behind FRAX has been reported disjointedly and opaquely, making it difficult for researchers to objectively evaluate and critique. As previously mentioned, a recent systematic review of prediction models for fracture risk were also forced to omit FRAX from their review as insufficient information has been reported to enable an objective evaluation of FRAX [58••].

FRAX is frequently cited to have been validated in 11 independent cohorts [12, 49••]; however, there are non-ignorable design and methodological issues that are worth elucidating to refute or dampen such claims. Of the 11 cohorts only one included men (Miyama cohort, 180 men); the other 10 cohorts, all substantially larger (ranging from 1173-135,695) than the Miyama cohort, included only women. As discussed earlier, unreliable and misleading performance data are frequently observed in cohorts with less than 100 events; 7 of the 11 cohorts for hip fracture had less than 100 events. Inflated, and optimistic performance data are frequently observed in cohorts comprising small numbers of events [61]. However, equally disconcerting is the handling of missing data. Missing data were present in the cohorts used to derive the FRAX, with many cohorts not having collected specific risk factors, and how this was handled is unclear. In the validation of FRAX, six cohorts did not record one or more risk factors required to calculate FRAX, while it is unclear on how many had missing body mass index in any of the 11 cohorts. In particular, one cohort (the PERF cohort) did not record information on five of the risk factors needed to calculate FRAX: family history (either parent has a history of hip fracture), currently using oral glucocorticoids, smoking status, alcohol consumption, and diagnosis of rheumatoid arthritis. Only six cohorts collected information on alcohol consumption. How this was handled was to set the beta-coefficient in the FRAX model to zero (ie, omit the predictor from the FRAX model) and calculate the score based on the available risk factors. Such practice is not only highly flawed and misleading but results from such analyses do not constitute valid performance data of FRAX, and certainly should not be used to claim validation of FRAX.

Despite these methodological shortcomings the discrimination in each of the 11 validation cohorts was disappointingly low. For example, in the FRAX model (without bone mineral density [BMD]) for predicting for major osteoporotic fractures, area under the receiver operator characteristic curve (AUROC) ranged from 0.54 to 0.81 (0.81 clearly an outlier from the small Miyama cohort), with AUROC ≤ 0.6 in 9 of the 11 cohorts. Similarly, for the FRAX model with BMD, AUROC ranged from 0.55 to 0.77 (0.77 again from the small Miyama cohort) with AUROC ≤ 0.6 in 5 of the 9 cohorts that had BMD data. Such unimpressive performance data are highly debatable to support the validation of FRAX. For the FRAX models that predict hip fractures, the performance data was marginally higher, notably for the model that includes BMD, but for the model without BMD, the data supporting FRAX is hardly stunning to support such rapid and unequivocal acceptance. Reasons for the poor performance are unclear, but contributing factors, acknowledged by the FRAX developers, will undoubtedly include the omission of previous falls (number, severity, or type) frailty, and history of cardiac disease or stroke as predictors in the model [6264]. Failure to take into account dose–response relationships (ie, dose and duration of glucocorticoids), alcohol consumption, and smoking status will also likely weaken the predictive accuracy of FRAX [62, 63].

As described earlier, characteristics of a model’s predictive accuracy include both discrimination and calibration. While the discrimination of FRAX on the 11 cohorts was evaluated by calculating the AUROC, calibration of FRAX was not evaluated, yet calibration describing the agreement between observed and predicted risks is a crucial component. In the absence of an assessment of calibration, it is unclear on the accuracy of the predictions from FRAX. We also note that very few subsequent publications of FRAX have evaluated calibration [65].

QFracture

QFracture is a prediction tool for the 10-year risk of osteoporotic fracture and 10-year risk of hip fracture [66•]. It was developed and validated in the United Kingdom on a large cohort of general practice patients (3.6 million) contributing 50,755 osteoporotic fractures (hip, vertebral, and distal radius fracture) and 19,531 hip fractures from 25 million person years of observation. See Table 1 for a list of risk factors included in the model. Subsequent external validation by independent investigators on a separate but large (2.2 million patients) cohort of patients contributing 25,208 osteoporotic fractures and 12,188 hip fractures demonstrated very good performance of QFracture [60], despite the fact that the model did not include BMD. AUROC curves were 0.89 (women) and 0.86 (men) for the model predicting the risk of hip fracture and 0.82 (women) and 0.74 (men) for the model predicting the risk of osteoporotic fracture. Independent evaluation of QFracture also showed it was well calibrated, and taken with the high discrimination values would suggest good predictive accuracy.

However, unlike FRAX, there are very few other studies that have examined the usefulness of QFracture [67]. Nevertheless, the development, internal validation [66•] and external validation [60] have to date used 5.9 million patients, contributing 38 million person years of observation and 75,963 new cases of osteoporotic fracture and 31,719 new hip fractures with good predictive accuracy demonstrated. While the actual prediction models have not been published, open source software has been made freely available by the developers (http://www.qfracture.org). Since writing this paper, QFracture has been updated to include a number of additional predictors including ethnicity, previous fracture, use of antidepressants, chronic obstructive pulmonary disease, epilepsy, use of anticonvulsants, dementia, Parkinson's disease, cancer, systemic lupus erythematosus, chronic renal disease, type 1 diabetes and care or nursing home residence [68].

Garvan Calculator

The Garvan fracture risk prediction model has been developed in Australia by use of the Dubbo cohort constituting 1208 women and 740 men aged 60 years of age or older [69•, 70]. The risk algorithm includes only five variables: age, sex, number of previous fractures after 50 years of age, number of falls during the last year, and BMD. External validation of the Garvan calculator has displayed moderate to good discrimination ability [71] and there are indications of equal or better performance with the Garvan calculator compared to FRAX [7174]. Unlike FRAX, the equations of the Garvan model are freely available [71].

Conclusions

In an era of evidence-based medicine, reproducibility, and transparency, it is disappointing that FRAX had penetrated the medical community to the levels it has in the absence of methodological and transparent evidence. Arguably, the results from the flawed validation, which were branded as an international validation FRAX that accompanied the development of FRAX, have potentially misled potential users of FRAX by falsely claiming the model was well validated. Subsequent studies using FRAX have mis-cited [62] this validation as providing satisfactory or good evidence to support FRAX to elevate the model to be viewed as a gold standard risk prediction fracture assessment [75]. Results from the SCOOP (Screening of Older Women for Prevention of Fracture) randomized controlled trial (http://www.scoopstudy.ac.uk/) will provide excellent data on the impact of FRAX on the effectiveness and cost-effectiveness in screening 11,580 older women in the United Kingdom [76]. Such impact studies for prediction models are rare yet a vital component in determining clinical usefulness. Interestingly, bisphosphonates [7779] and denosumab [80] are only proven efficacious for clinical fracture prevention in those with osteoporosis, and not in those with higher BMD values. Thus, off-label use presently recommended by current guidelines partially based on FRAX [8183] might not benefit the patient and might also theoretically even cause net harm [8486]. Importantly, in large US cohort studies, more than 80 % of women who suffer a fracture after the age of 65 years do not have osteoporosis [87, 88].

Studies evaluating prediction models as a minimum must assess and report both discrimination and calibration; merely stating FRAX has been calibrated to a particular countries incidence and mortality rates does not imply the model is calibrated. The process of adjusting FRAX to a countries incidence is unclear as it has to date not been adequately reported, but regardless of the method, the approach can be viewed as a form of model updating or recalibration [18, 27, 8992]. Such updating is unfortunately rarely done for prediction models in other medical areas and models. Models that are not calibrated in certain populations often get discarded and needless new prediction models are derived when an updating or recalibration would suffice. However, it is prudent and recommended that for any model that undergoes recalibration to another setting or country, that the model undergo extensive, and preferably an independent, validation to assess the predictive accuracy of the model. In addition, official country-, sex-, and age-specific fracture incidence rates based on register data are used to provide 10-year risk fracture probabilities for each country. The validity of the underlying fracture register incidences are uncertain—in Sweden, a country well known for complete national registers [93], incidence fracture rates are overestimated by use of normal official data [94] (ie, prevalent and incident cases are admixed).

A number of guidelines in the medical literature exist for the reporting of randomized controlled trials [95], diagnostic accuracy [96], systematic reviews and meta-analyses [97] cohort and case–control studies [98], and tumor marker prognostic studies [99]. Yet, there are currently no consensus-based guidelines for developing and evaluating multivariable risk prediction models in terms of conduct or reporting, though a recent collaborative initiative has started to develop such reporting guidelines [100]. It is envisaged such reporting guidelines will improve the quality and clarity of studies developing or validating prediction models. In the meantime, the developers of FRAX are urged to provide a single concise document describing the full steps of the derivation of FRAX and to place the underlying equations of FRAX in the public domain to enable independent investigators to critically evaluate the tool using prospectively collected data and appropriate statistical methods.