Introduction

The most important point in regard to osteoporotic subjects is fracture prediction. Osteoporosis commonly has no clear clinical symptoms, and thus, it is called a “silent bone thief.” A large number of patients indicate that osteoporosis is an epidemic in modern societies. Osteoporotic fractures are the essential symptom of bone loss and they are usually the consequence of falls from a standing height. Fractures may occur in several skeletal sites, and the most common are fractures of the hip, spine, forearm, and arm. Fractures in those locations are called “major osteoporotic fractures.” Prior fragility fracture is an important risk factor for subsequent fracture [1]. Therefore, the primary aim in patient management is to avoid the first fracture. Usually, osteoporosis is not recognized at early stages of the disease, and the most important point is the possibility to establish fracture risk and predict future ones. In recent years some methods designed to assess fracture risk have been developed. There are FRAX [2], Garvan [3, 4], and POL-RISK [5, 6] among them. Garvan and POL-RISK allow for the estimation of fracture risk, whereas FRAX enables the establishment of fracture probability limited by life expectancy. For practitioners, the measurements of fracture risk are helpful in the identification of subjects requiring pharmacologic therapy. Some studies presented the results of fracture prediction in various populations [7,8,9]. The early assessment of fracture risk is an important point for adequate daily patient management in order to avoid the first fracture. Several authors discussed the problem of fracture risk assessment and prediction [10,11,12]. In other studies, the comparisons between the FRAX and Garvan methods were presented [13,14,15,16,17,18,19,20,21,22,23]. Such data on fracture prediction are helpful for practitioners in daily work with patients.

The aim of the current study was the comparison of fracture prediction established by FRAX, Garvan, and POL-RISK in presented earlier cohort of the GO Study [24, 25]. The secondary goal was to establish an optimal threshold of fracture risk for initiation of pharmacologic therapy.

Material

The study group comprised a cohort of postmenopausal women from the GO Study [24, 25]. Briefly, the study sample consisted of postmenopausal women recruited in the Outpatient Osteoporotic Clinic in South Poland (city Gliwice, Upper Silesia). Data on clinical risk factors for osteoporosis and fractures were collected in all subjects. The skeletal status was assessed at the femoral neck using the Prodigy device (GE, USA). The clinical characteristics of the studied subjects are presented in Table 1. The information on fracture incidence for a period of 10 years was gathered. Either the review of patients’ charts or phone calls allowed us to identify individuals who experienced the fracture during the previous 10 years. All data were collected by one investigator (WP).

Table 1 Clinical characteristics of the whole study group and subgroups with and without fracture in follow-up and the results of baseline fracture risk assessment

During the 10-year period of observation, 72 osteoporotic fractures occurred in 63 subjects. These fractures were located in the following skeletal sites: forearm — 20, spine — 17, hip — 13, ankle 9, arm — 7, and other 6. Single fractures were recorded in 56 women, whereas 7 women reported multiple (2 or 3) fractures.

Methods

Baseline fracture risk was established using the FRAX, Garvan, and POL-RISK algorithms. Obviously, for FRAX, the fracture risk was expressed as fracture probability limited by life expectancy.

Because POL-RISK does not measure the risk of hip fracture, only any fracture prediction (Garvan, POL-RISK) and major fractures (FRAX) were presented.

For a preliminary comparison of the predictive value of the analyzed diagnostic tools, the fracture risk threshold of 10% was used, according to local recommendations [26].

Proximal femur bone densitometry was performed using the Prodigy device (GE Lunar). Based on repeated measurements the precision (CV%) of DXA measurements at FN was established at 1.6%. Calculations of fracture risk were performed based on femoral neck (FN) BMD.

All measurements were done by one experienced DXA technician.

Statistics

All statistical calculations were performed with the use of Statistica 13.3 software (StatSoft, Tulsa, OK, USA) and PQStat v.1.8.2.238 (PQStat Software, Plewiska, Poland; https://pqstat.pl). The mean values and standard deviations were used for descriptive statistics of continuous variables. The prevalence of qualitative features was presented as the number of subjects with percentage values. Values of sensitivity, specificity, and balanced accuracy (BAcc) were used to compare the predictive powers of compared fracture risk algorithms. BAcc was chosen instead of accuracy (Acc), as the analyzed dataset was characterized by a high degree of imbalance. In medical decision-making, data is usually highly imbalanced, i.e., high-risk patients are in the minority, whereas correct prediction of their disease is critical. It means that bad recognition of the minority subjects has much more serious consequences for the patients and medics than the creation of a so-called “false alarm” when the low-risk subjects of the majority class are assigned to the minority one [27]. Therefore, the conventional accuracy, calculated as a percentage of the examples correctly predicted by the method, is not only skewed by the imbalance but also inappropriate. Hence, to assess the quality of the analyzed tools, balanced accuracy (BAcc), which is the average of sensitivity and specificity, was applied in the study. To verify the prediction accuracy of the analyzed diagnostic tools, the receiver operating characteristic (ROC) was studied, as well as the area under the curve (AUC) was calculated using the DeLong method. The alternative cut-offs determining high or low fracture risk were established based on ROC curves with the “distance from the top left corner” method. Based on logistic regression analysis, calibration plots for all three analyzed calculators at both compared cut-off thresholds — the standard one and the one determined by ROC analysis were prepared.

A p-value at a level of 0.05 was regarded as statistically significant.

The statistical power of the test was determined post hoc for the assumed α (level of significance) at 0.05, the sample size of 457 subjects (divided into two subgroups of 63 and 394) and for the effect size between subgroups 0.8 (calculated based on Cohen’s method) was 0.99.

Results

Table 1 presents clinical characteristics, FN T-score values, and fracture risk calculated for major (FRAX) or any (Garvan and POL-RISK) fractures in all patients and subgroups.

As one might expect, age was significantly higher, and the T-score was significantly lower for women in each fractured subgroups (p < 0.05). To compare the predictive value of analyzed calculators in relation to fractures observed during the 10-year observation, all the subjects were first categorized into low or high fracture probability/risk of major/any fracture according to thresholds of 10%. Table 2 presents information on the number of subjects classified in high-risk category, and the number of observed fractures during the follow-up period.

Table 2 The number of subjects classified in the high-risk category (fracture risk/probability > 10%) and the number of fractures observed during the follow-up period

For FRAX, the fracture probability exceeding 10% was observed only in 11 subjects with the observed fractures, and thus the fracture was properly predicted only in 22.9% of women with major fractures (11 patients out of 48 ones). For Garvan, the respective value was 90.5% (57 patients out of 63 with any fracture), and for POL-RISK it was 98.4% (62 patients out of 63 with any fracture). Therefore, FRAX showed significantly lower true positive (TP) results in comparison to Garvan and POL-RISK. It means that only 22.9% of patients who should receive therapy would be properly identified to start the treatment based on the FRAX method. Based on the Garvan algorithm, only 9.5% (6 of 63 fractured patients) would not be recommended to start treatment, and in the case of POL-RISK, only one patient (1.6% of fractured ones) would miss the treatment.

On the other hand, the Garvan and POL-RISK calculators showed a much higher prediction of fractures than actually observed, which means those algorithms provide a high ratio of false positive (FP) results. According to Garvan, 315 patients, and according to POL-RISK, even 368 patients were identified as high-risk subjects but did not experience fracture during the follow-up period. One can say that those patients would receive unnecessary treatment. For FRAX, the number of such patients was clearly lower — only 29. In Table 3, there are given data for sensitivity, specificity, and balanced accuracy (BAcc) for three methods.

Table 3 Sensitivity, specificity (with 95% CI), and balanced accuracy (BAcc) for FRAX, Garvan, and POL-RISK calculated for “standard” cut-off of 10%

Such a significant discrepancy between sensitivity and specificity clearly shows that using the “standard” cut-off at 10% does not allow achieving the optimal predictive value of compared tools. According to [19], an acceptable model should have both specificity and sensitivity of 50–79%. The ROC analyses were also performed to establish the separate cut-offs for each calculator, improving the accuracy of fracture prediction. The achieved ROC curves for each diagnostic tool are presented in Fig. 1.

Fig. 1
figure 1

Analysis of receiver operator characteristic (ROC) curves for FRAX, Garvan, and POL-RISK; null hypothesis: true area = 0.5; the numbers in parentheses indicate specificity and sensitivity values for the proposed cut-off

When determining the cut-off thresholds, the method of measuring the distance from the top left corner was used. The following fracture risk threshold values were obtained, corresponding to the balanced values of sensitivity and specificity, giving optimal diagnostic accuracy: FRAX major fracture — 6.3%, Garvan any fracture — 20.0%, and POL-RISK any fracture — 18.0%. Table 4 shows the sensitivity, specificity, and balanced accuracy (BAcc) values after applying the given cut-offs. It can be seen that the BAcc value increased compared to the values obtained for previously tested thresholds (Table 3) for all tested diagnostic tools.

Table 4 Sensitivity, specificity (with 95% CI), and balanced accuracy (BAcc) for FRAX, Garvan, and POL-RISK calculated for the newly determined cut-offs

Hence, the presented analyses confirm the validity of using different decision thresholds for all compared calculators.

The relationship between observed fractures and the estimated number of subjects predicted to have fractures based on the “new” cut-offs is presented in Table 5.

Table 5 The number of subjects classified in the high-risk category with respect to the newly determined cut-off values and the number of observed fractures during the follow-up period

In comparison to data from Table 2, a clear increase in properly predicted fractures by FRAX is achieved. In the case of the Garvan method and the POL-RISK algorithm, there is a significant reduction in the number of high-risk subjects with an acceptable decrease in the level of properly predicted fractures (similar to the prediction achieved for the FRAX algorithm).

Finally, to support the comprehensive presentation of differences in fracture risk assessment provided by standard and newly determined cut-off thresholds, calibration plots based on logistic regression analysis for all three analyzed calculators were prepared. They are presented in Fig. 2. It can be noticed that for all diagnostic tools, the prediction curves are much closer to the “ideal line” in the case of thresholds calculated in the current study by ROC analyses in comparison to “standard” cut-off values. This additionally supports the validity of determining cut-off thresholds in the local population rather than the universal use of standard thresholds.

Fig. 2
figure 2

Calibration plots (based on logistic regression analysis) for fracture prediction in analyzed calculators at both compared cut-off thresholds — the standard one and the one determined in ROC analysis

Discussion

The problem of accurate fracture prediction is essential in osteoporotic patients. Usually, osteoporosis has no clear clinical signs of the disease, and the first symptom is a fracture of typical low-trauma origin. The first fracture significantly increases the risk of subsequent fracture; therefore, the therapy should be initiated earlier. Due to an enormous number of individuals with osteoporosis, objective medical as well as economic aspects do not allow recommending the initiation of therapy for all patients. For practitioners, the most important point is to identify subjects at high risk. Methods designed to establish fracture risk (or probability) potentially should be helpful. In the current study, we presented the results of fracture prediction established by three different methods. Significant differences were noted in regard to the identification of subjects at high fracture risk, e.g., those who should be treated because of incident fracture. For the threshold of 10%, only 23% patients were classified as those who require therapy by FRAX, while for POL-RISK, and Garvan, the majority of patients would be treated (98% and 90%, respectively). This observation indicates that using FRAX, the majority of high-risk patients would not be identified, and therapy would not be started. However, good values of TP for POL-RISK and Garvan were accompanied by a great number of subjects with FP classification. Therefore, many patients would be treated unnecessarily. Presented results suggested that the new threshold of fracture risk should be proposed in order to recommend treatment in the majority of high-risk patients. The second important point is to avoid treatment in patients at low risk. We consider that the threshold of 18% established in the current study for the POL-RISK calculator might be used in daily practice.

This threshold is the most important finding of the study. With acceptable sensitivity, specificity, and accuracy, the therapy may be started in high-risk patients, and the number of patients unnecessarily treated is much lower.

One should also take into consideration that each recommendation should be based not only on pure medical points but also economic aspects must be added. The general cost for the health system, including all points (medication reimbursement, surgery, rehabilitation, and many others) should allow for establishing a threshold of high risk for use in daily practice. Such considerations are pointed out by the authors of the Garvan algorithm (www.fractureriskcalculator.com). For example, according to these recommendations, any fracture risk assessment values exceeding 26% allow using reimbursement medications, and those below 14% do not. The patients classified in the range between 14 and 26% should be assessed individually. One may consider that the pure medical threshold is 14%, and the economic threshold is 26%. Our threshold of 18% for POL-RISK should be treated as medical one, and further analysis in order to reveal a threshold fitted to economic aspects is necessary. In the near future, we plan to perform the study in order to establish the optimal fracture risk threshold for use in daily practice.

Irrespective of the threshold used (either 10 or 18%) some patients would not be adequately identified. Some will not be treated, and the other ones will be unnecessarily treated. Such observations suggest that it is not possible to create a tool that would replace physician thinking, and individualization of the initiation of treatment will always be necessary.

A similar need to differentiate the threshold for low and high risk of future fractures was also demonstrated in a recently published study based on analyses in an epidemiological sample of postmenopausal women from the RAC-OST-POL Study [28]. In that study, the optimal threshold for prediction of major fractures using FRAX was 6.0%, which is very close to the value obtained in current analyses (6.3%). The optimal threshold for any fracture in the Garvan algorithm established in the RAC-OST-POL cohort was 14.4%, and this value is lower than calculated in the current study. The difference may be explained by the specific character of both study cohorts. The RAC-OST-POL sample is population-representative and, therefore, also includes healthy people. The currently presented group was recruited in an osteoporosis outpatient clinic, which may result in a smaller proportion of healthy people or those with risk factors present only to a minimal extent. POL-RISK algorithm was not taken into consideration in the cited study.

The results presented by other authors should be discussed. In the study by Donaldson et al. [7], the authors compared various methods in order to classify patients as candidates for treatment. Always some patients at high risk were missed, and in others, unnecessary treatment would be offered. An interesting comparison between the utility of absolute risk prediction using the FRAX and Garvan methods was presented in the next study [16]. Only 8.9% of women who sustained a fracture had an estimated fracture risk ≥ 20% using FRAX compared with 53.3% using Garvan. The authors compared AUC, sensitivity, specificity, FP, FN, and accuracy for thresholds 10, 20, and 30%. Always the accuracy had the highest value for threshold 30% but the exact threshold recommended for daily use was not given. Other authors compared FRAX and Garvan in a great cohort of women from the Women’s Health Initiative Study [18]. The final conclusion was that there is no useful threshold for either methods. In the Belgian study with the use of the threshold of 20%, only a small part of high-risk patients would be treated according to results given by FRAX and Garvan, and a slightly better classification was performed by Garvan [20]. Overall, data given by other authors suggest that there is no simple tool that can accurately identify a high-risk person and at the same time recommend treatment. Always in some patients, the fracture risk is either over- or underestimated. In our opinion, the numerous studies on the use of fracture risk assessment tools should be completed with clear recommendations for practitioners. Long variables expressing available methods like AUC, accuracy, sensitivity, or specificity are not sufficiently helpful in daily practice with patients. We believe that the threshold of fracture risk for treatment initiation is an essential condition for adequate patients’ management. Indicating a certain threshold can be very helpful, but one should not forget about an individual approach to the patient. The established threshold should be treated as a guideline rather than an immutable value.

The study has some limitations. We observed only women, the sample size of patients and number of fractures might be greater. POL-RISK does not establish separate risk for hip fractures and only any fracture risk was established. One should also remember that the group studied was not an epidemiological sample.

The current study showed that different fracture risk assessment tools, although having similar clinical purposes, require different cut-off thresholds for making therapeutic decisions. Adjusting such thresholds separately for each calculator improves their diagnostic sensitivity and specificity. Better identification of patients requiring therapy based on such an approach may help reduce the number of new fractures.