Background

Rationale for the systematic reviews

There is no international consensus on the recommended approach to screening and subsequent treatment to prevent fragility fractures [1]. Screening has traditionally focused on measuring bone mineral density (BMD) with intervention in those with low bone mass, often referred to as osteoporosis [2]. More recent evidence suggests that fracture risk prediction may be improved by instead considering an array of clinical risk factors, alone or in addition to BMD, which may be incorporated into risk prediction tools to estimate the absolute short- to mid-term risk of fracture [2].

The 2010 Osteoporosis Canada screening strategy (presence of any of various clinical risk factors) has low sensitivity in identifying females aged 50 to 64 years for BMD testing who later experience a major osteoporotic fracture [3]. In addition, the screening strategy has not been evaluated in a randomized controlled trial (RCT), indicating that updated screening and treatment algorithms that incorporate the most recent evidence are needed. Since 2018, three RCTs have been published that integrate a 2-step approach to screening to prevent fragility fractures (i.e., risk assessment followed by BMD measurement in those exceeding a certain risk threshold, but without shared decision-making) [4,5,6]. A systematic review published in 2020 [7], after we began this review, reported on the effects of screening from these three trials on fractures and all-cause mortality. The review had slightly different eligibility criteria than ours (thus two studies included in our review are not included), did not address overdiagnosis (defined later), and did not review additional aspects such as alternative screening strategies or patient perspectives related to recommendations about screening in primary care.

Because randomized trials on screening were not anticipated to evaluate all possible screening tools and outcomes (e.g., harms from the treatment provided to those at high risk), we have included reviews on these topics to determine whether certain screening tools may be interchangeable, and whether treatment harms may impact the main screening recommendation.

Description and burden of the condition

Fragility fractures are those that occur without stimulus during normal daily activities or secondary to minor incidents that in healthy adults would not normally result in a fracture [8]. Major independent risk factors for fragility fracture include low bone density, chronic use of certain medications (e.g., glucocorticoids), older age, female sex, low body weight, a personal or family history of fracture, a history of falls, smoking, higher levels of alcohol use, and living with type 2 diabetes and/or rheumatoid arthritis [9,10,11,12,13,14]. Advancing age, especially among postmenopausal females and older males [15], and menopausal status [16, 17] are strong predictors of fragility fracture, as is low bone density [18]. A reduction in bone mass and quality is a common consequence of the aging process.

Fragility fractures impose a substantial burden on societies worldwide [19]. By the year 2040, it has been projected that more than 319 million people globally will be considered to be at high risk of fragility fracture (based on the Fracture Risk Assessment Tool without incorporating BMD results [clinical FRAX]) [20]. In Canada in 2015/16, the incidence of hip fractures among people aged 65 to 69 years was 87 per 100,000 and increased steeply with advancing age to a rate of 1156 per 100,000 in 85 to 89-year-olds [21]. Fragility fractures, particularly hip and clinical vertebral fractures, can result in significant morbidity (e.g., decreased mobility, pain, reduced quality of life) and increase the risk of mortality in the 5 years post-fracture [22,23,24]. Fragility fractures have been noted to result in more hospitalized days than either stroke or myocardial infarction [25].

Screening for the primary prevention of fragility fractures

Screening in primary care aims to decrease the risk of future fragility fractures among those without a prior fracture, and to reduce fracture-related morbidity, mortality, and costs. Harms may be related to the screening test itself (e.g., minimal radiation exposure from dual X-ray absorptiometry [DXA]) [26] or the psychosocial or physical (if harmed from treatment) consequences of being labelled “at risk” [27, 28], which may be due to an inaccurate estimation of fracture risk (i.e., due to a risk prediction tool that is poorly calibrated), and/or detection of excess risk among people who, had they not been screened, would never have known their risk nor experienced a fracture. Though considered by the Task Force to be the ideal approach, shared decision-making for screening and subsequent treatment may not be the standard of care across Canada; many primary care providers may instead screen all people without a prior fracture for risk (e.g., using available risk prediction tools and/or offer of BMD assessment) and consider patients eligible for treatment when screening places them within pre-specified thresholds of BMD or fracture risk. It may instead be ideal to use shared decision-making during the clinical encounter, allowing patients to make informed decisions about screening and treatment after weighing the possible benefits against the potential harms. Information from screening can then be used, along with patient preferences, to consider preventive treatment among those who consider themselves to be at a high fracture risk.

There is large variation in the screening approaches suggested by international guidelines, which often consider the population burden of fragility fractures and mortality, competing societal priorities, and resource availability [1, 29]. A variety of approaches may be used within a single screening program, with recommendations often differing by population group based on age, sex, or menopausal status [1, 29]. Common approaches include (a) a one-step direct to BMD approach (e.g., in females >65 years old in Canada [30] and the USA [31]); and (b) a 2-step approach incorporating the assessment of absolute fracture risk followed by BMD assessment in individuals exceeding a pre-defined threshold [29]. The findings of BMD assessment may then be used independently or incorporated into revised clinical risk scores. Clinical risk factors alone may be used to estimate risk in circumstances where BMD is unavailable, but this is not recommended by current North American guidelines [30,31,32,33,34,35]. There are at least 12 published fracture risk prediction tools available [36, 37]; however, not all tools are easily accessible to clinicians nor have all tools been calibrated for Canada or validated in populations outside of their derivation cohort, limiting their use [38].

Treatment thresholds vary considerably across countries [1, 29, 39]. A common threshold for treatment used in Canada [30, 40], the USA [41], and several other countries is a fixed 10-year major osteoporosis-related fracture probability ≥20% [39]. In some countries (not Canada), a 10-year hip fracture probability ≥3% may also be used [39]. Other approaches include the use of variable thresholds based on age [39], and hybrid models that incorporate both age-based and fixed thresholds [42,43,44]. Few existing guidelines incorporate shared decision-making [45, 46], but ideally this could be applied to determine the point at which an individual patient, informed about the benefits and risks, would want to contemplate treatment. Bisphosphonates (i.e., alendronate, risedronate, or zoledronic acid) are the most commonly used first-line treatments for the prevention of fragility fractures [47, 48]. Denosumab may also sometimes be considered [47, 48], but this is less common due to its higher cost compared to bisphosphonates. Changing lifestyle factors (e.g., diet, exercise) and fall prevention are other approaches to preventing fragility fractures [30] but were not in the scope of these systematic reviews.

According to a systematic review commissioned by the United States Preventive Services Task Force (USPSTF) with a comprehensive search in 2016, compared to placebo, treatment with bisphosphonates probably reduces the risk of nonvertebral and vertebral fractures (moderate certainty), but may make little-to-no difference in the risk of hip fractures (low certainty) in females [37]. There was low certainty evidence for reduction in all fracture types with denosumab in females [37]. Evidence for males was limited across all pharmacologic treatments of interest [37]. The review authors did not rate the certainty for all clinical fractures, as is of interest for the current review, and updating the evidence may change findings for some outcomes. Various harms may be associated with treatment to various degrees, with some such as mild upper gastrointestinal distress being fairly benign. Others such as serious infections or cardiac events, osteonecrosis of the jaw, and atypical femoral fractures are potentially highly concerning [49].

The effectiveness of treatment relies on high uptake and adherence [50]. However, uptake of pharmacologic treatment is often low, and adherence tends to diminish over time [51]. Low uptake and adherence may be related to a variable assessment of the balance of benefits and harms by individual patients. Though shared decision-making is incorporated into few existing screening guidelines [45, 46], a large variation in treatment preferences across patients could support a shared decision-making approach in the place of recommended treatment thresholds based solely on fracture risk [52, 53].

Objectives of systematic reviews

In these reviews, we have synthesized evidence relevant to screening for the primary prevention of fragility fractures and related mortality and morbidity among adults 40 years and older in primary care. The findings are among several considerations (including consultations with patients on outcome prioritization, information on issues of feasibility, acceptability, costs/resources, and equity) that will be used by the Canadian Task Force on Preventive Health Care (“Task Force”) to inform recommendations on screening for the prevention of fragility fractures among adults 40 years and older in Canada. Our key questions (KQs) were as follows:

KQ1a: What are the benefits and harms of screening compared with no screening to prevent fragility fractures and related morbidity and mortality in primary care for adults ≥40 years?

KQ1b: Does the effectiveness of screening to prevent fragility fractures vary by screening program type (i.e., 1-step vs 2-step) or risk assessment tool?

KQ2: How accurate are screening tests at predicting fractures among adults ≥40 years?

KQ3a: What are the benefits of pharmacologic treatments to prevent fragility fractures among adults ≥40 years?

KQ3b: What are the harms of pharmacologic treatments to prevent fragility fractures among adults ≥40 years?

KQ4: For patients ≥40 years, what is the acceptability (i.e., positive attitudes, intentions, willingness, uptake) of screening and/or initiating treatment to prevent fragility fractures when considering the possible benefits and harms from screening and/or treatment?

Screening and treatment for risk factors related to fractures, such as fall risk, were not considered though the Task Force is currently developing separate recommendations about falls prevention interventions [54].

Methods

Terminology

Throughout this report, we refer to “females” and “males”; these terms refer to biological sex (i.e., biological attributes, particularly the reproductive or sexual anatomy at birth) unless otherwise indicated.

Review conduct

We followed a peer-reviewed protocol [55] for this review which was based on accepted systematic review methodology [56]. The review was registered prospectively in the International Prospective Register of Systematic Reviews (PROSPERO): CRD42019123767. The methods for the systematic review are reported in detail within the protocol [55]; we report on the methods here briefly, focusing on deviations from the original plans. We report the systematic review according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 statement [57].

At the protocol stage, members of the Task Force rated outcomes on their importance for clinical decision-making using a 9-point scale according to the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach [58]. In addition, the findings of surveys and focus groups with patients that were conducted by the Knowledge Translation team at St. Michael’s, Unity Health Toronto, were incorporated into the final outcome ratings. Outcomes rated as critical (7–9/9) were hip fracture, clinical fragility fractures, fracture-related mortality, quality of life or wellbeing, functionality and disability, serious adverse events, and prediction model calibration (KQ2 only). Outcomes rated as important (4–6/9) were all-cause mortality, non-serious adverse events, discontinuation due to adverse events, and overdiagnosis. The outcomes are defined in detail within our protocol [55]. As screening for risk of fracture does not result in a “diagnosis” but rather a risk for a future event, overdiagnosis has not been previously defined in the context of fracture risk assessment. However, as with conditions such as osteoporosis, overdiagnosis generally refers to identifying and labelling people with “problems,” or in this case “risks,” that would never have caused harm [59]. Thus, for the purpose of this review, we defined overdiagnosis as the identification of high risk in individuals who, if not screened, would never have known that they were at risk and would never have experienced a fragility fracture [59]. The systematic review protocol and this report were revised following review by external stakeholders (n=7 and n=4, respectively). The Task Force and their external clinical experts were involved with developing the scope of the review and the eligibility criteria (n=4; see “Acknowledgments”), as well as with interpreting the findings (n=2), but were not involved in the selection and risk of bias assessments of studies, data extraction, or analysis.

We reviewed the evidence following a staged approach, beginning by identifying direct evidence from trials (including all controlled trials but prioritizing evidence from RCTs) of primary screening versus no screening (KQ1a). Based on positive evidence from KQ1a, we proceeded to KQ1b, examining the comparative effectiveness of different screening approaches. We reviewed evidence related to the acceptability of screening and/or treatment (KQ4), as well as indirect evidence on the accuracy of screening tests (KQ2), concurrently with KQ1. The accuracy of screening tests was reviewed to better understand whether other well calibrated tools existed outside of those used in the screening trials, which could influence the tool ultimately recommended for screening. Because the Task Force believed that further information on the benefits and harms of pharmacologic treatment could be relevant to their recommendations, we proceeded with KQs 3a (benefits) and 3b (harms). After completing KQ3a on the benefits of treatment, discussions with the Task Force indicated that a rapid overview of reviews approach for KQ3b (harms of treatment) would be adequate to inform decision-making, while reducing the time and resources needed to review the evidence. We therefore amended our planned approach to KQ3b, as described herein.

Eligibility criteria

Detailed PICOTs for each KQ are shown in Table 1. Here, we report changes from our original plans that occurred during the selection phase. For KQ1 (benefits and harms of screening), we had intended to exclude studies of patients already being treated with anti-fracture drugs and/or with prior fractures at baseline, but some relevant trials included unknown proportions of previously treated and/or fractured patients. The comparator of interest was no screening, but in reality the available trials included some degree of ad hoc screening in the comparison group. We considered these factors within the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) indirectness domain.

Table 1 Eligibility criteria for each key question

For KQ2 (predictive accuracy of screening tests), based on clinical expert input, we decided to exclude tools that (a) are not freely available for use by clinicians or (b) do not provide an absolute risk prediction (e.g., provide only a risk categorization; Canadian Association of Radiologists and Osteoporosis Canada Risk Assessment [CAROC] tool retained due to relevance to Canada). We also considered external validations of FRAX-Canada to be most relevant, in comparison to FRAX tools calibrated for other countries. Though our original eligibility included studies from multiple countries, because of the applicability of Canadian studies (when tools are calibrated to this population) and those from Canada in our original search (in 2019) were among the highest quality, our search update in 2021 focused on finding new Canadian studies for which we limited our inclusion. Though not a deviation from our protocol, it is important to note that the discriminative ability of risk prediction tools was not rated as a critical or important outcome by the Task Force. For this reason, we did not review this information systematically within KQ2, but included data reported in a 2018 USPSTF review [60] within our GRADE Summary of Findings Tables for information purposes.

For KQ3a (benefits of treatment), we had planned to exclude the 5 mg/day dosage of alendronate but later included it as well as mixed doses (e.g., 5 mg followed by 10 mg) based on clinical expert input. This decision was supported by the apparent uncertainty about the superiority of the 10mg/day dose and the likelihood of some variability in the doses used in practice. For KQ3b (harms of treatment), we relied on systematic reviews published since 2015 rather than primary studies, as originally intended (see Review Conduct). We included the one most appropriate systematic review per outcome comparison by considering comprehensiveness (likelihood that the search captured all relevant studies, informed by domain 2 in the Risk Of Bias In Systematic reviews [ROBIS] tool [61]); recency (date of last search); and other relevant features (e.g., availability of subgroup and/or adjusted analyses; availability of absolute event rates for the pooled effect). We included systematic reviews of bisphosphonates as a class only for serious adverse events where findings were very uncertain for individual drugs (i.e., additional data may be useful). For rebound fractures (i.e., fractures resulting from increased bone turnover and reductions in BMD after stopping treatment) from denosumab, we compared discontinuation of denosumab to persistence of denosumab or discontinuation of placebo, based on Task Force input about this being the most relevant available comparison. We also added “multiple vertebral fractures” as the most valid potential outcome to capture the effects of rebound fractures. Further, because the reviews were limited on reporting rebound fractures, we added a search for recent (2020 onwards) primary studies for this outcome. For non-serious adverse events, we included: non-serious gastrointestinal adverse events, musculoskeletal pain, dermatologic adverse events, and infections. There were no changes to the original eligibility criteria for KQ4 (acceptability of screening/treatment).

Literature search and selection of studies

The approach and dates used to search for and select studies for inclusion in the systematic reviews for each KQ are shown in Table 2. Briefly, for KQs 1 (benefits and harms of screening), 2 (predictive accuracy of screening tests), and 3a (benefits of treatment), we integrated eligible studies published up to 2016 from an existing systematic review by the USPSTF [60]. Due to differences in eligibility criteria, we also checked the USPSTF’s excluded studies list and the reference lists of other systematic reviews and major guidelines to identify studies published before 2016 that would have been excluded from the USPSTF review but met our inclusion criteria (e.g., studies that the USPSTF judged to have serious risk of bias concerns, and those examining the comparative effectiveness of screening approaches). We did not integrate studies from existing reviews for KQs 3b (harms of treatment) or 4 (acceptability) and instead relied solely on our search strategies.

Table 2 Approach to search and selection of studies for each key question

A research librarian developed and implemented comprehensive peer-reviewed [62] electronic search strategies for each KQ (see protocol [55]; Additional file 2 for KQ3b on harms of treatment). We also searched clinical trials registries and scanned the reference lists of relevant systematic reviews and the included studies. We exported the database results to EndNote (version X7 or X9, Clarivate Analytics, Philadelphia, PA) and removed duplicates before screening the records in DistillerSR (Evidence Partners Inc., Ottawa, Canada).

Data extraction and risk of bias assessment

We had initially planned to rely (with verification) on data from the USPSTF systematic review [60] for older studies. However, during review conduct differences in outcome definitions, subgroups of interest, and methodology (e.g., updated version of the PROBAST tool became available) became apparent. Therefore, following a pilot round (with two reviewers), one reviewer independently extracted data from all included studies into a standardized form in Excel (Microsoft Corporation, Redmond, WA). Study characteristics were then verified by a second reviewer and outcome data were extracted in duplicate, with final data based on consensus. The full list of data extraction items is available in our protocol [55]. Since we altered our approach to rely on systematic reviews for KQ3b (harms of treatment), we additionally collected the following: databases searched and date of last search, scope of systematic review and selection criteria for the included studies, number and design of primary studies included, number of participants and summary characteristics, summary of interventions and comparators included, risk of bias/quality appraisal tool used to appraise included studies, analyses methods, and summary statistics for outcomes of interest.

Outcome-level risk of bias was appraised for each included study in duplicate (one reviewer with verification for KQ3b [harms of treatment]) using published design-specific tools as applicable (Cochrane risk of bias tool version 2011 for KQs 1 and 3a [63], PROBAST for KQ2 [64], AMSTAR 2 for KQ3b [65]), with final ratings determined by consensus. For KQ3b (harms of treatment), we also extracted information on the risk of bias of the systematic reviews’ included studies, but if these were missing, we did not perform these appraisals anew. Since there is no commonly used or accepted tool to assess risk of bias in studies of acceptability, we assessed risk of bias in the studies included for KQ4 (acceptability of screening/treatment) by considering the risk of bias subdomains within the GRADE guidance for assessing the certainty of evidence in studies of the importance of outcomes or patient values and preferences [66] (adapted to be suitable to acceptability). Assessments of risk of bias informed the study limitations domain of our assessments of the certainty of the body of evidence.

Synthesis

We performed meta-analyses when appropriate based on clinical and methodological similarity across studies. For KQ1 (benefits and harms of screening), we pooled data for each outcome via pairwise meta-analysis using the DerSimonian and Laird random effects model [67] in Review Manager (version 5.3, The Cochrane Collaboration, Copenhagen, Denmark). Due to differences in the populations analyzed across studies, we pooled data from different population perspectives separately. The perspectives analyzed were (a) offer-to-screen, which included all those randomized and offered (by mail), but not necessarily completing any screening, and in the group they were originally assigned; (b) offer-to-screen in selected populations, which included those who independently completed a mailed clinical Fracture Risk Assessment (FRAX) questionnaire, in the group they were originally assigned (randomized before or after completion, depending on the trial); and (c) acceptors, which included those randomized who ultimately completed the entire screening process (i.e., clinical FRAX and BMD if meeting the risk threshold). In one study [68], hip fractures were presented only as counts (rather than number of participants with ≥1 fracture); we included this study among the others in meta-analysis based on clinical and statistical expert input indicating that the outcome was sufficiently rare that count and rate data would be similar. As described previously, we defined overdiagnosis as the identification of high risk in individuals who, if not screened, would never have known that they were at risk and would never have experienced a fragility fracture [59]. As this was not reported directly in any trial, we estimated this using available data from two trials, considering the proportion of participants exceeding the risk threshold in the study and the mean risk in these patients (see Additional file 3). For KQ3a (benefits of treatment), we pooled data by outcome as in KQ1; in several studies, there were zero events reported in one or both groups, and in these cases, we performed random effects meta-analysis using the reciprocal of the opposite treatment arm size correction for pooled odds ratio [69] in Stata (StataCorp, College Station, TX). We pooled data by sex and for each drug separately, but also performed an “all bisphosphonates” analysis including data from studies reporting on either of alendronate, risedronate, and zoledronic acid. For KQ3b, we report pooled effects directly as they were presented within the included systematic reviews and did not perform any re-analyses of data from primary studies.

We calculated absolute effects for each outcome comparison by applying the relative risk or odds ratio from the meta-analysis to the median control group event rates from the included studies [70]. For KQ1, we also incorporated a sensitivity analysis by calculating absolute effects using an assumed risk based on the general population in Canada (45 to 54 years and ≥65 years) [15, 71]. If statistically significant, we calculated the number needed to screen for an additional beneficial outcome (NNS), number needed to treat for an additional beneficial outcome (NNT), or number needed to treat for an additional harmful outcome (NNH) [72].

For KQ2 (predictive accuracy of screening tests), we chose not to pool the overall findings on calibration for most tools due to high levels of heterogeneity that could not be explained by a priori subgroups (age, sex, baseline risk within and across studies). We present the calibration findings by tool for both the population overall (average) and a summary of calibration within categories (e.g., quintiles, deciles) of baseline risk. We did pool data for the studies without high risk of bias reporting on the FRAX-Canada tool; we considered data from this subgroup to be most reliable and most directly applicable to Canada. These studies presented no major risk of bias concerns that would reduce our certainty in the findings, whereas all others generally had multiple major reasons to be seriously concerned about risk of bias. In all cases, we used the restricted maximum likelihood estimation approach and the Hartun-Knapp-Sidnick-Jonkman correction to derive 95% CIs [73, 74]. We rescaled total observed versus expected fracture event ratios (O:E) and their variance (standard error) on the natural log scale prior to entering these into meta-analysis (or displaying on forest plots) to achieve approximate normality [75,76,77].

For KQ4 (acceptability of screening/treatment), we performed a narrative synthesis following the guidance of Popay et al. [78], recognizing that our question of acceptability differs to some extent from questions about interventions or implementation factors.

Across KQs, we considered several potential population and intervention/exposure subgroups of interest, for example in KQ1 analyses were stratified by age, while in KQ3a we analyzed data for postmenopausal females separately from males. In several cases, data on characteristics of interest were unavailable in the included study reports (e.g., baseline fracture risk). We also considered within-study subgroup analyses when these were available. We performed sensitivity analyses by risk of bias, applicability concerns (e.g., high-risk population in KQ1), and outcome ascertainment methods (e.g., clinical fragility fractures in KQ3a). When analyses for interventions contained at least eight trials of varying size, we assessed for small study bias using funnel plots and Harbord’s test (KQ3a) [79].

Rating certainty of evidence and drawing conclusions

Two reviewers rated the certainty in the evidence for each outcome comparison of interest and agreed on the final rating and conclusion statements. Our certainty of evidence appraisals for effects of interventions were based on the absolute effects and considered only the direction of effect and not its magnitude. For KQ1 (benefits and harms of screening), KQ3 (benefits and harms of treatment), and KQ4 (acceptability of screening/treatment), we assessed the certainty of the evidence following the GRADE approach [80,81,82,83,84,85,86]. In the absence of published guidance on GRADE for reviews of risk prediction models, for KQ2 (predictive accuracy of screening tests) calibration outcomes, we considered input from an expert in GRADE to modify existing guidance [87] and assist in rating the evidence and developing conclusions. We decided a priori to consider tools to be well calibrated when the O:E ratio across the study populations consistently fell between 0.8 and 1.2 (20% over- or underestimation, respectively) [88]. We then rated certainty for one of four possible conclusions: well calibrated (O:E ratio consistently between 0.8 and 1.2), underestimation (O:E ratio >1.2 and adequately precise to draw clinically meaningful conclusions), overestimation (O:E ratio <0.8 and adequately precise to draw clinically meaningful conclusions), or poorly calibrated (wide variation across studies including over- and underestimation; unable to draw a clinically meaningful conclusion) (Additional file 4). For KQ3b, we relied preferentially on the certainty of evidence ratings presented by the included systematic reviews, with modifications if needed to align with our other appraisals. When these were not reported by the included systematic reviews, we performed our own GRADE appraisals, relying on the data available in the systematic reviews. When the data required to perform full evidence appraisals were missing from the included systematic reviews, we collected data from the included primary studies (if ≤5 studies) and/or made assumptions, as described in Additional file 4.

We developed informative statements based on our certainty in the evidence for each outcome comparison [89]. We adopted standard wording to describe our findings, using the word “may” together with the direction of effect to describe findings of low certainty and “probably” for those of moderate certainty. When our certainty in the evidence was very low, we describe the evidence only as “very uncertain” [89].

Results

KQ1a: What are the benefits and harms of screening compared with no screening to prevent fragility fractures and related morbidity and mortality in primary care for adults ≥40 years?

Of 7151 unique records retrieved by the searches for KQ1a and b, we assessed 163 for eligibility by full text and included five trials (4 randomized controlled trials [RCT] [4,5,6, 90], 1 controlled clinical trial [CCT] [68]) and two associated publications [91, 92] for KQ1a, and one RCT for KQ1b [93] (Fig. 1). Studies excluded after full text appraisal are listed with reasons in Additional file 5.

Fig. 1
figure 1

Flow of records through the selection process. Legend: not applicable

Study characteristics

Table 3 shows the characteristics of the included trials for KQ1a. The trials were conducted in countries with a moderate-to-high baseline fracture risk [94]: Denmark (ROSE [5]), the Netherlands (SALT [4]), the UK (SCOOP [6] and APOSS [90]), and the USA (Kern CCT [68]). Aside from the Kern CCT, which included a relatively equal proportion of males and females ≥65 years old [68], the trials included populations of exclusively peri-menopausal (aged 45 to 54 years) [90] or postmenopausal (mean ages 70 to 75.5 years; range 65 to 90 years) [4,5,6] females. When reported, between 10 and 44% of the study population had a prior fracture [4,5,6]. The proportion of participants with a prior fracture was highest in the SALT trial (44%), which enrolled females who reported at least one clinical risk factor on the clinical FRAX tool [4]. Participants were not treatment-naïve in all trials; in particular, the APOSS trial allowed enrollment by females with past use of hormone replacement therapy [90] and 11% of participants in the ROSE trial were taking anti-osteoporosis medications at baseline [5].

Table 3 Characteristics of studies included for key questions 1a&b on the benefits and harms of screening versus no screening, and the comparative benefits and harms of different screening approaches

The three more recent trials (published 2018–2019) [4,5,6] employed a 2-step approach to screening, whereby all participants completed a mailed questionnaire including data to assess risk with the clinical FRAX tool, and only those surpassing certain risk thresholds were offered BMD assessment. The threshold for BMD assessment varied across trials; in the SALT trial, the entire population had ≥1 risk factor and were offered BMD and vertebral fracture assessment [4], whereas ROSE offered BMD for those with a clinical FRAX-based 10-year major osteoporotic fracture risk ≥15% [6], and SCOOP used age-based thresholds of 10-year hip fracture risk [5]. The two older trials (APOSS [2010] and Kern CCT [2005]) used a one-step direct to BMD screening approach [68, 90]. No trials included a true “no screening” comparator; in all cases, the comparator was usual care, with evidence of varying levels of ad hoc screening and treatment (median 17% treatment rate when this was reported, range 5 to 59% [4,5,6, 90]) within the comparison groups.

Thresholds for treatment were also variable across the trials. In both the SALT and SCOOP trials, BMD assessment was used to recalculate the 10-year FRAX fracture risk with inclusion of BMD, and treatment was offered when participants exceeded age-specific thresholds [4, 6]; the SALT trial also allowed for several other treatment indications according to Dutch guidelines (e.g., vertebral fracture) [4]. Of note, in the SCOOP trial, only 898 females exceeded a treatment threshold despite 3064 being considered at elevated risk based on fairly similar thresholds but without incorporation of the BMD results into the risk prediction by clinical FRAX. In the ROSE trial, treatment was offered when the BMD T-score at any measured site was ≤2.5, and/or a fracture was detected on vertebral fracture assessment [5]. In the two 1-step screening trials, treatment was offered to those in the lowest quartile of BMD, based on the first 1000 participants screened (APOSS) [90], and to those below the age-matched mean of the reference group according to the densitometer’s manufacturer (Kern CCT) [68]. Across the trials, between 7 and 25% of those assigned to screening had indications for treatment; the proportion was highest in the SALT trial, where higher-risk patients were enrolled [4]. The rate of treatment was lowest in the Kern CCT (31% of those with a treatment indication) [68]; among the remaining trials, more than two-thirds (69 to 80%) of those with a treatment indication reported using some form of anti-osteoporosis drugs during follow-up (variable treatments across studies, and sometimes including those such as calcitonin and hormone replacement therapy, which are no longer recommended; see Table 3). It was apparent that most of the treatment provided in the recent RCTs was pharmacologic, though at least one protocol (SALT) mentioned calcium and vitamin D supplementation, as well as notification of a high fall risk, that may have been acted upon by the primary care practitioner.

The trials provided data for hip fractures [4,5,6, 68, 90], clinical fragility fractures (described as major osteoporotic [4, 5, 90] or osteoporosis-related fractures [6]), serious adverse events [6], all-cause mortality [4, 6, 68, 90], and quality of life or wellbeing [6, 90, 92]; no trials reported on fracture-related mortality, functionality and disability, discontinuation due to adverse events, or non-serious adverse events. Though not directly reported, data were available in two trials to estimate the potential extent of overdiagnosis (see Additional file 3 for calculations) [4, 6]. Because of differences in design and reporting across the trials, we considered three possible population perspectives in our analyses. Two trials (APOSS and ROSE) provided data for an offer-to-screen population, whereby all eligible people invited for screening by mail, regardless of actual participation in any screening, were analyzed [5, 90]. The APOSS study also provided data for acceptors of screening, where the analysis included only those who attended for BMD measurement and thus completed screening. The SALT, ROSE, and SCOOP trials provided data for what we considered an offer-to-screen in selected population approach, because the analyses only included people who independently completed a mailed clinical FRAX questionnaire as part of 2-step screening [4,5,6]. The Kern CCT [68] also contributed data for this approach, as the sample population for screening was those already enrolled in the Cardiovascular Health Study (i.e., not the general population) [95]. We considered the “selected population” approach to be the one to be most applicable to primary care—where healthcare providers would complete risk assessment tools during the patient visit and then discuss the findings—although the participants in these trials are likely to be more accepting and compliant with screening, and possibly with treatment, than the general population presenting to primary care.

The risk of bias ratings for the included trials for KQ1a are in Table 4. The main risk of bias concerns were related to participant awareness of group assignments and contamination of the control groups in all trials (aforementioned ad hoc screening and treatment, likely to bias the findings toward the null) [4,5,6, 68, 90], and a high risk of attrition bias in the APOSS trial (42% lost to follow-up) in the offer-to-screen population [90]. The Kern CCT was not randomized, however patients were invited based on age- and sex-stratified random sampling and analyses were adjusted for baseline differences between groups [68]. We rated this trial, as well as the “acceptors” population for the APOSS and the “selected population” in the ROSE trial, to be at unclear risk of selection bias [5, 90], because in these analyses, the participants no longer represented the initially randomized population.

Table 4 Risk of bias assessments for trials included for KQ1a on the benefits and harms of screening vs. no screening, and KQ1b on the comparative benefits and harms of different screening approaches

Findings

Table 5 summarizes the main findings for KQ1a; Additional file 3 contains the full GRADE Evidence Profiles and Summary of Findings Tables, with explanations for each rating as well as the forest plots, which include the results of statistical tests for subgroup differences. Among females aged 68–80 years, data from one trial showed that a mailed offer of screening in the general population may not reduce the risk of hip fractures, clinical fragility fractures, or all-cause mortality during 5 years of follow-up [5]. The evidence is very uncertain for all outcomes from a mailed offer of screening with BMD among females aged 45–54 years during 9 years of follow-up (1 trial) [90].

Table 5 Summary of findings for KQ1a on the benefits and harms of screening compared with no screening

Among a selected population of females aged ≥65 years who are willing to independently complete a mailed fracture risk questionnaire, 2-step screening with risk assessment (clinical FRAX or FRAX-like tool) and BMD probably reduces the risk of hip fractures (3 RCTs + 1 CCT; n=43,736; 6.2 fewer in 1000, 95% confidence interval [CI] 9.0 fewer to 2.8 fewer; NNS=161) [4,5,6, 68] and clinical fragility fractures (3 RCTs; n=42,009; 5.9 fewer in 1000, 95% CI 10.9 fewer to 0.8 fewer; NNS=169) [4,5,6]. However, screening in this selected population probably does not reduce the risk of all-cause mortality (note: 1379 males were included in this analysis from the Kern CCT, representing 5.4% of the total sample) [4, 6, 68]. Our sensitivity analyses using assumed/baseline risks from a general Canadian population (age roughly corresponding to that of the trials) suggest that the effects for clinical fragility fracture may be larger than found in the trial populations, but these analyses are considered exploratory (Table 5). Post hoc subgroup analyses from the SCOOP study showed that the effectiveness of screening on hip fracture risk was greater in females with higher baseline clinical FRAX 10-year hip fracture risk (HR [95% CI] 0.67 [0.53–0.84] in the 90th percentile of risk vs. 0.93 [0.71–1.23] in the 10th percentile, p=0.021) and with prior fracture (HR [95% CI] 0.55 [0.38–0.79] vs. 0.87 [0.68–1.12], p=0.040 without prior fracture) [91]. The evidence for the effect of an offer of screening in a selected population of males is very uncertain [68]. In females aged 70–85 years, screening may make little-to-no difference in health-related quality of life [6]. Between 11.8% [6] and 19.3% [4] of females in a selected population offered 2-step screening may be overdiagnosed, but the magnitude of these estimates is of low certainty due to serious concerns of indirectness from lack of data provided as required for the proposed equation (e.g., mean risk in the high-risk population in SCOOP was limited to results of clinical FRAX without incorporation of BMD as used for treatment decisions) and from use of data from the SALT trial where participants were all at increased risk. Among females aged 70–85 years who are considered at high-risk by FRAX 10-year hip fracture risk alone and are referred to BMD assessment, data from one trial indicate that 24.1% may be overdiagnosed [6], but there is low certainty about this due to serious concerns about indirectness.

The evidence for hip and clinical fragility fractures among females aged 45–54 who accept 1-step screening with BMD measurement is very uncertain.

KQ1b: Does the effectiveness of screening to prevent fragility fractures vary by screening program type (i.e., 1-step vs. 2-step) or risk assessment tool?

Study characteristics

As indicated in the findings for KQ1a, one RCT (OPRA) [93] was included for the comparative effectiveness of different screening approaches. Characteristics of the OPRA trial are in Table 3. The trial included a mailed offer-to-screen population (acceptors of screening also available but less relevant to the primary care population). Eligible (n=9268; 34% participated) postmenopausal females were randomized to one of three screening approaches: 1-step screening using BMD via DXA; 2-step screening using the Simple Calculated Osteoporosis Risk Estimation (SCORE)-based tool, with BMD assessment offered when the score was ≥7 (74% eligible); and 2-step screening using the Study of Osteoporotic Fractures (SOF)-based tool, with BMD assessment offered to those with ≥5 clinical risk factors (7% eligible) [93]. Patients were eligible for potential treatment if they had ≥5 risk factors and/or BMD T-score below age-specific thresholds, or if they had a prior fracture after age 50 years (SOF-based group only) [93]. The proportion of patients dispensed a prescription (including alendronate, hormone replacement therapy, calcitonin, raloxifene) was similar across groups (13 to 14% of those offered screening) [93]. The two outcomes reported by the trial were the total number of hip fractures, and clinical fragility fractures (reported as non-pathologic [osteoporotic] fractures) [93].

The risk of bias assessment for the OPRA trial is in Table 4. The trial was rated at unclear risk of bias due to the potential for selection bias (randomization and allocation concealment not clearly defined) and patient awareness of group assignment (those in the SCORE- and SOF-based groups not assigned to BMD testing would have increased awareness of risk and could seek further care) [93]. The trial was not powered to detect a difference in fracture outcomes across groups.

Findings

Additional file 3 contains the full analysis details for KQ1b, including the GRADE Summary of Findings Tables, with explanations for each rating and forest plots. The evidence from a single RCT showed that, among females aged 60–80 years, the evidence comparing 1-step (BMD) versus 2-step screening (risk assessment + BMD) and comparing different 2-step screening strategies (i.e., SCORE-based vs. SOF-based for the risk assessment) for risk of hip and clinical fragility fractures is very uncertain [93].

KQ2: How accurate are screening tests at predicting fractures among adults ≥40 years?

Of 6081 unique records retrieved by the searches for KQ2, we assessed 413 for eligibility by full text, and 59 external validation cohort studies [96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154] taking place in very high human development index countries with moderate fracture risk, met eligibility criteria for inclusion in the review (Fig. 1). From our search update in June 2021 when we changed our eligibility to Canadian reports of unique cohorts or that added data to that previously included, we included one study [154] and excluded 18 other reports [148, 155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171]. Studies excluded after full text appraisal are listed with reasons in Additional file 5. Among the initial set of included studies from our search in July 2019, there were several that analyzed cohorts with substantial overlap in participants. To prevent double-counting in the analysis, when cohorts were overlapping for a given tool-outcome comparison, we selected a single primary cohort study for analysis (n=32) [98,99,100, 104, 106,107,108,109,110,111,112,113, 116, 117, 119, 128, 129, 134, 136, 138, 140, 142,143,144,145,146, 148,149,150,151, 153, 154]. We primarily considered recency in our choice of cohorts, but also considered the size of cohorts, quality of the methods (primarily more available data on predictors), and available outcomes. The remaining publications were then used for any reported supplementary data (e.g., calibration plots, subgroups of interest).

Study characteristics

Additional file 6 shows the characteristics of the included studies and their associated publications. Half (16/32, 50%) of the included studies were composed of participants from the USA (n=9) [104, 109, 110, 113, 136, 140, 143, 144, 153] and Canada (n=7) [106, 111, 119, 128, 129, 134, 154]; the remaining studies took place in Spain (n=4) [98, 99, 145, 150], Japan (n=3) [117, 148, 149], France (n=2) [146, 151], Israel (n=2) [108, 112], Poland (n=2) [107, 142], Australia (n=1) [116], New Zealand (n=1) [100], and Portugal (n=1) [138]. The studies analyzed data from a total of 1,491,968 participants (median 3305, range 91 to 1,054,815), with mean age ranging from 51 to 74.2 years. In more than half of the studies, only females were included (17/32, 53%) [98,99,100, 104, 106, 107, 112, 134, 136, 142,143,144, 146, 148, 150, 151]; the remaining were equally split between including only males (n=7, 22%; one cohort [129] included females but only the male population was used for analysis) [109, 110, 113, 116, 117, 153], and a mix of males and females (n=8, 25%) [108, 111, 119, 128, 138, 140, 145, 154]. Participants were often recruited from patient, insurance, or resident (e.g., electoral rolls) registries (n=16/32, 50%) [98, 108,109,110,111,112,113, 116, 119, 138, 140, 142, 143, 146, 148, 149]; ten (31%) studies enrolled all those presenting for BMD assessment (potentially at higher risk depending on local practices) [99, 106, 107, 128, 129, 136, 144, 145, 150, 151], five (16%) included patients already enrolled in other studies [100, 104, 117, 134, 154], and one (3.2%) enrolled only veterans [153]. Studies most commonly provided findings for the calibration of clinical FRAX (i.e., without incorporation of BMD) or with incorporation of BMD results (i.e., FRAX + BMD; n=26/32, 81%) [98,99,100, 104, 106,107,108,109, 111, 112, 116, 117, 129, 134, 138, 140, 142,143,144, 146, 148,149,150,151, 153, 154] and Garvan with or without BMD (n=8, 25%) [100, 104, 108, 113, 119, 142, 145, 154]; there were few external validation studies reporting on QFracture (n=3) [108, 113, 154], the Fracture Risk Calculator (FRC; n=2) [110, 136], CAROC (n=1) [128], and the Fracture and Immobilization Score (FRISC; n=1) [149].

The risk of bias ratings for the included studies for KQ2 are in Additional file 7. Almost all of the studies were at high overall risk of bias; only four [106, 111, 128, 129] were lacking serious risk of bias concerns (rated at unclear risk of bias because proxy variables were used for some predictors, e.g., chronic obstructive pulmonary disease instead of smoking status). The primary risk of bias concerns across the included studies were related to predictor ascertainment (missing predictor data, predictors not handled as intended by the tool), outcome ascertainment (self-reported or including high trauma fractures), and the analysis (large losses to follow-up and/or competing risk of mortality not accounted for, inadequate number [<100] of fracture outcomes, follow-up duration not matching the prediction period [e.g., substantially shorter or longer than 10 years without adjustment]). Many studies did not account for the effect of treatment prior to risk assessment or during follow-up.

Findings

Additional file 4 contains the full analysis details for KQ2, including GRADE Summary of Findings Tables, with explanations for each rating, and forest plots. Within the Summary of Findings Tables, discrimination findings from the USPSTF’s review are shown. Due to a high degree of heterogeneity that could not be well explained by a priori subgroup analyses, we generally did not pool data on calibration, and instead present the findings descriptively. The exception was FRAX-Canada, where we pooled (and relied on primarily) data from the three Canadian studies without serious risk of bias concerns. This decision was based on recognition that FRAX is considered as a suite of tools (algorithm calibrated to various countries) rather than a single tool; therefore, these Canadian studies without serious risk of bias would provide the most directly applicable evidence.

Forest plots for the calibration of clinical FRAX and FRAX + BMD across studies with and without serious risk of bias concerns are in Figs. 2 and 3, respectively. For both the 10-year prediction of hip and clinical fragility fractures, there was a high degree of heterogeneity in O:E estimates across studies that was not well explained by subgroup analyses by age, sex, and baseline risk (Additional file 4). Most studies were at high risk of bias and did not use FRAX-Canada. We judged the performance of FRAX (with and without BMD) to be poor in these studies, but the evidence was rated as very uncertain due to concerns across all GRADE domains. Pooled data from three Canadian studies (n = 67,611) [106, 111, 129] without serious risk of bias indicate that clinical FRAX-Canada may be well calibrated for the 10-year prediction of hip fractures (O:E = 1.13, 95% CI 0.74–1.72, I2 = 89.2%) and is probably well calibrated for the 10-year prediction of clinical fragility fractures (O:E = 1.10, 95% CI 1.01–1.20, I2 = 50.4%), both with some underestimation of the observed risk. Data from these same studies (n = 61,156) [106, 111, 129] showed that FRAX-Canada with BMD may perform poorly to estimate 10-year hip fracture risk (O:E = 1.31, 95% CI 0.91–2.13, I2 = 92.7%), but is probably well calibrated for the 10-year prediction of clinical fragility fractures, with some underestimation of the observed risk (O:E 1.16, 95% CI 1.12–1.20, I2 = 0%). Within-study data from calibration plots (e.g., using deciles of baseline risk) were heterogeneous (7 studies for 10-year prediction of hip fractures [99, 100, 104, 109, 112, 143, 148] and 8 for clinical fragility fractures [99, 100, 104, 109, 112, 134, 143, 148] with clinical FRAX; 8 studies for the 10-year prediction of hip fractures [99, 100, 106, 109, 111, 125, 143, 148] and 10 for clinical fractures [99, 100, 106, 109, 111, 117, 140, 143, 148, 150] with FRAX + BMD), but two Canadian studies without serious concerns for risk of bias showed acceptable calibration of clinical FRAX-Canada in females at a baseline predicted risk above 5% [106], and FRAX-Canada with BMD in females at a baseline predicted risk above 6 or 12%, depending on the study [106, 111].

Fig. 2
figure 2

Calibration of clinical FRAX for the 10-year prediction of hip and clinical fragility fractures. Legend: Forest plots show the calibration ratios reported across the included studies; these were not pooled for the high risk of bias studies, and pooled for the studies without high risk of bias (reporting on FRAX-Canada)

Fig. 3
figure 3

Calibration of FRAX with the incorporation of bone mineral density for the 10-year prediction of hip and clinical fragility fractures. Legend: Forest plots show the calibration ratios reported across the included studies; these were not pooled for the high risk of bias studies, and pooled for the studies without high risk of bias (reporting on FRAX-Canada)

There is evidence to suggest acceptable calibration of FRAX to predict the 5-year risk of hip (FRAX + BMD only) and clinical fragility fractures (clinical FRAX and FRAX + BMD) (low certainty; most applicable to females) [129], but the prediction of 5-year risk is not a well-accepted or intended purpose of the tool. Findings on discrimination from Viswanathan 2018 [60] show an area under the curve (AUC) for the 10-year prediction of hip fractures in females of 0.76 (95% CI 0.72–0.81) for clinical FRAX and 0.79 (95% CI 0.76–0.81) for FRAX + BMD. The AUC for clinical fragility fractures in females was 0.67 (95% CI 0.65–0.68) for clinical FRAX and 0.70 (0.68–0.71) for FRAX + BMD [60]. Reported findings for males are in Additional file 4.

We are very uncertain about the ability of clinical Garvan (2 cohort; n=67,923) [104, 113] and Garvan + BMD (5 cohort; n=11,869) [100, 113, 119, 142, 145] to predict the 10-year risk of hip fractures and the 10-year risk of clinical fragility fractures [100, 113, 119, 142, 145]. Clinical Garvan (1 cohort; 1,054,815) [108] may underestimate the 5-year risk of hip fractures (O:E 2.17, 95% CI 2.16 to 2.17; low certainty); evidence for calibration for 5-year risk of clinical fragility fractures is very uncertain [154]. The AUC for 10-year prediction of hip fractures reported by the USPSTF was 0.68 (95% CI not reported) for clinical Garvan and 0.73 for Garvan + BMD [60]. For clinical fragility fractures in females, the AUC was 0.66 (95% CI 0.61–0.72) for clinical Garvan and 0.68 for Garvan + BMD [60]. Data for males are in Additional file 4. There is evidence from one study (n=34,060) to suggest that CAROC [128] (includes BMD) may be adequately calibrated to predict a category of 10-year risk of clinical fragility fracture; observed fracture risk (95% CI) was 6.4 (6.0–6.8)% in the low risk (<10%) group, 13.8 (13.1–14.5)% in the moderate risk group (10–20%), and 23.8 (22.5–25.0)% in the high-risk group (>20%). The discrimination of this tool was not reported by the USPSTF [60]. There was very limited evidence for the remaining tools (QFracture [108, 113, 154], FRISC [149], FRC [110, 136] with and without BMD).

KQ3a: What are the benefits of pharmacologic treatments to prevent fragility fractures among adults ≥40 years?

Of 11,693 unique records retrieved by the searches for KQ3a, we assessed 211 for eligibility by full text and included 27 RCTs [172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198] (one trial of alendronate was open-label [185]) and 11 associated publications [199,200,201,202,203,204,205,206,207,208,209] (Fig. 1). Studies excluded after full text appraisal are listed with reasons in Additional file 5.

Study characteristics

Detailed study characteristics are in Additional file 6. In total, there were 10 trials of alendronate (5 or 10 mg/day, or mixed doses, or 70 mg/week for 12 to 48 months) [172, 173, 176, 177, 183,184,185, 187, 193, 197], 7 trials of risedronate (2.5 or 5 mg/day for 12 to 36 months) [179, 182, 183, 186, 189, 190, 196], 6 trials of zoledronic acid (1 to 5 mg/year [5 mg/year most commonly] for 12 to 72 months) [175, 180, 181, 188, 194, 195], and 6 trials of denosumab (60 mg/6 months, or mixed doses for 12 to 36 months) [174, 178, 185, 191, 192, 198]. About half (14/27, 52%) of the trials were multi-country [172, 175, 178, 179, 183, 184, 187,188,189,190,191, 193, 194, 196], with the remaining taking place in the USA (n=4) [173, 176, 177, 185], New Zealand (n=3) [180, 181, 195], China (n=3) [186, 197, 198], Australia (n=1) [182], India (n=1) [192], or the USA and Canada (n=1) [174].

The trials included a total of 34,317 participants (median 398, range 50 to 9931), primarily postmenopausal females with low BMD (definition variable across trials). The prevalence of prior fracture was median 16.9% (range 0 to 48%) when specified in the trials. There were only two trials of males with low BMD, one for zoledronic acid [175] and one for denosumab [191]. Most of the trials were small and probably underpowered to detect differences in fracture incidence, especially for hip fractures; analyses generally relied on one large trial per drug. Most (23/27, 82%) trials included adjunct calcium and/or vitamin D supplements in both groups (treatment and placebo). Length of follow-up for outcomes ranged from 0.5 to 6 years, which in almost all cases corresponded with the duration of treatment; rarely, the follow-up period extended 1 year beyond the end of treatment. The trials provided data for hip fractures [172, 175,176,177,178, 180, 181, 184, 186, 187, 189,190,191,192,193, 195,196,197,198], clinical fragility fractures [172,173,174,175, 177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196, 198], clinical vertebral fractures [172, 174, 176,177,178, 180, 181, 184, 186, 191, 192, 194,195,196,197], all-cause mortality [174,175,176,177,178, 180, 185, 188, 191, 192, 195,196,197,198], and health-related quality of life [178]; no trials reported on fracture-related mortality or functionality and disability. Discontinuation due to adverse events, serious and non-serious adverse events are addressed in KQ3b.

The risk of bias ratings for the trials included for KQ3a are in Additional file 7. One of the main risk of bias concerns was selective reporting, as many trials lacked protocols and did not pre-specify fractures as an outcome of interest (either in a protocol or in the “Methods” section); instead, these were often collected as potential harms. In these cases, it was often unclear whether the fracture outcomes were collected prospectively or systematically [172, 173, 176, 180, 181, 185, 190,191,192, 197, 198]. Several trials were at high risk of attrition bias, due to large or imbalanced losses to follow-up for various outcomes [172, 173, 179, 180, 186, 189,190,191]. One trial of alendronate was open-label [185] and therefore was at high risk of performance and detection biases. When applicable (“all bisphosphonates” analyses), we assessed for small study bias and this was not detected.

Findings

Additional file 4 contains the full analysis details for KQ3a, including GRADE Evidence Profiles and Summary of Findings Tables, with explanations for each rating and forest plots.

Bisphosphonates

In postmenopausal females at risk of fragility fractures, the risk of hip fractures may be reduced by median 2 (range 1 to 6) years of treatment with bisphosphonates as a class (alendronate, risedronate, or zoledronic acid; 14 RCTs; n=21,038; 2.9 fewer in 1000, 95% CI 4.6 fewer to 0.9 fewer; NNT=345; low certainty) compared to placebo [48, 172, 176, 177, 180, 181, 184, 186, 187, 189, 190, 193, 195,196,197, 201, 209]. Data for individual bisphosphonates showed that median 3 (range 1 to 3) years of treatment with risedronate may reduce the risk of hip fractures (4 RCTs; n=9,672; 7.9 fewer in 1000, 95% CI 13.0 fewer to 1.5 fewer; NNT=127; low certainty), but median 2 (range 1 to 4) years of treatment with alendronate and median 2 (range 2 to 6) years of treatment with zoledronic acid may not reduce the risk of hip fractures (low certainty). Within-study subgroup analyses were available for alendronate [177] and risedronate [189] (1 trial each) by age and baseline risk (BMD, prevalent fractures). These were not considered to be credible as they were available only in single trials (no evidence of consistency), may not have been adequately powered, and were not necessarily pre-specified (Additional file 4). One trial in males (n = 1199) showed that 2 years of treatment with zoledronic acid may not reduce the risk of hip fractures [175].

The risk of clinical fragility fractures in postmenopausal females is probably reduced by median 2 (range 1 to 6) years of treatment with bisphosphonates as a class (19 RCTs; n=22,482; 11.1 fewer in 1000, 95% CI 15.0 fewer to 6.6 fewer; NNT=90; moderate certainty) [172, 173, 177, 179,180,181,182,183,184, 186,187,188,189,190, 193,194,195,196, 200, 201, 203, 206, 209], median 2 (range 1 to 4) years of treatment with alendronate (8 RCTs; n=8854; 14.7 fewer in 1000, 95% CI 24.5 fewer to 2.6 fewer; NNT=68; moderate certainty) [172, 173, 177, 183, 184, 187, 193, 200, 203, 206, 209] and median 2 (range 1 to 6) years of treatment with zoledronic acid (5 RCTs; n=3,218; 20.1 fewer in 1000, 95% CI 27.6 fewer to 9.9 fewer; NNT=50; moderate certainty) compared to placebo [180, 181, 188, 194, 195, 201]. Median 2 (range 1 to 3) years of treatment with risedronate may reduce the risk of clinical fragility fractures (7 RCTs; n=10,572; 7.8 fewer in 1000, 95% CI 12.5 fewer to 2.3 fewer; NNT=128; low certainty) [179, 182, 183, 186, 189, 190, 196]. The analyses were robust to sensitivity analysis using only “nonvertebral fractures” in one trial of zoledronic acid where nonvertebral and vertebral fractures had been summed to determine the total number of people with fractures (could overestimate) [195]. One trial in males (n = 1199) showed that 2 years of treatment with zoledronic acid may not reduce the risk of clinical fragility fractures [175].

The risk of clinical vertebral fractures among postmenopausal females may be reduced by median 2 (range 1 to 6) years of treatment with bisphosphonates as a class (11 RCTs; n=8921; 10.0 fewer in 1000, 95% CI 14.0 fewer to 3.9 fewer; NNT=100; low certainty) [172, 176, 177, 179,180,181, 184, 194,195,196,197, 201, 203] and median 2 (range 1 to 6) years of treatment with zoledronic acid (4 RCTs; n=2367; 18.7 fewer in 1000, 95% CI 25.6 fewer to 6.6 fewer; NNT=53; low certainty) [180, 181, 194, 195]. The evidence for alendronate [172, 176, 177, 184, 197, 203] and risedronate [179, 196] is very uncertain. There were no studies in males that reported on clinical vertebral fractures.

Bisphosphonates as a class may not reduce the risk of all-cause mortality in postmenopausal females compared to placebo over 1 to 6 years of follow-up [176, 177, 180, 185, 188, 195,196,197, 202, 206]. Evidence for individual bisphosphonates is very uncertain (including for zoledronic acid in males).

Denosumab

In postmenopausal females the risk of hip fractures may not be reduced by median 1 (range 0.5 to 3) years of treatment with denosumab compared to placebo [178, 192, 198, 199, 207]. Within-study subgroup analyses were available by age, baseline BMD and FRAX score from one trial [178], but were not considered credible because there is no evidence that the effects are consistent as they have not been replicated in other trials (Additional file 4). The risk of clinical fragility fractures is probably reduced by median 1.5 (range 0.5 to 3) years of treatment with denosumab (6 RCTs; n=9473; 9.1 fewer in 1000, 95% CI 12.1 fewer to 5.6 fewer; NNT=110; moderate certainty) [174, 178, 185, 192, 198, 206, 207]. This analysis was robust to sensitivity analysis using only “nonvertebral” fractures for one trial [178] where vertebral and nonvertebral were summed to determine the total number of people with fractures. The risk of clinical vertebral fractures is probably reduced by median 1.5 (range 0.5 to 3) years of treatment with denosumab (4 RCTs; n=8639; 16.0 fewer in 1000, 95% CI 18.6 fewer to 12.1 fewer; NNT=62; moderate certainty) [174, 178, 192, 204, 205]. Denosumab probably does not reduce the risk of all-cause mortality over 0.5 to 3 years of follow-up [174, 178, 185, 192, 198, 205,206,207], and probably makes little-to-no difference in health-related quality of life over 3 years of follow-up [208]. The evidence for the effect of denosumab on the incidence of fractures (hip, clinical fragility, clinical vertebral) and all-cause mortality from one trial in males (n=242) [191] is very uncertain.

KQ3b: What are the harms of pharmacologic treatments to prevent fragility fractures among adults ≥40 years?

Of 721 unique records retrieved by the searches for KQ3b, we assessed 85 for eligibility by full text with 31 systematic reviews and one primary study meeting our eligibility criteria (Fig. 1). After reviewing these for key characteristics, we included 10 systematic reviews [60, 210,211,212,213,214,215,216,217,218], 3 associated publications [37, 48, 49], and one primary study on rebound fractures after discontinuation of denosumab [219]. Reviews excluded after full text appraisal, as well as systematic reviews that met inclusion criteria but were not selected for the overview, are listed with reasons in Additional file 5.

Study characteristics

Characteristics of the systematic reviews and primary study are in Additional file 6. The systematic reviews were published between 2014 and 2020 and included either only RCTs [212, 216, 217] or a mix of RCTs and observational studies [60, 211, 213,214,215, 218]; occasionally, only observational studies were included when there existed no RCTs for rare harms [210]. The systematic reviews were generally focused on patients (males or females) with low BMD (often referred to as osteoporosis) or who had risk factors for fracture, though some included wider populations (e.g., patients with chronic use of glucocorticoids); in many cases, patients with other disorders of bone metabolism were excluded. Across the systematic reviews, risk of bias was usually not assessed specific to harm outcomes (assessed in 3 reviews [210, 216, 217]), and certainty of evidence was assessed for selected outcomes in only three of the systematic reviews [60, 211, 215]. Notably, no evidence (either no systematic reviews, or the systematic reviews located no primary studies) was located for the following outcome comparisons: serious stroke and thromboembolic events, atypical femoral fractures, osteonecrosis of the jaw, or myalgia, cramps, and limb pain with risedronate; serious gastrointestinal adverse events, gastrointestinal cancer, pulmonary embolism, and thromboembolic events with zoledronic acid; osteonecrosis of jaw with long-term bisphosphonates as a class; serious gastrointestinal adverse events, gastrointestinal cancer, thromboembolic events, cardiac death, and rebound hip fractures with denosumab. The primary study on rebound fractures (multiple vertebral fractures) after discontinuation versus persistence of denosumab was a retrospective cohort study of 3110 individuals (91% females; mean age 72 years; 42% with prior fracture; denosumab as first-line therapy for 5.4%) conducted in Israel.

The appraisal of the quality of the systematic reviews and primary study included for KQ3b are shown in Additional file 7. Common methodological concerns across the reviews were potential errors in data extraction (because data were not collected in duplicate), limited description of the characteristics of the included studies, and lack of risk of bias appraisal (or risk of bias was assessed for benefits but not for harms). The primary study did not adjust findings for potential confounders, though there was demonstration of comparability across multiple characteristics between groups.

Findings

Additional file 4 contains the full analysis details for KQ3b, including GRADE Evidence Profiles and Summary of Findings Tables, with explanations for each rating.

Bisphosphonates

The evidence was very uncertain for many adverse events, for example gastrointestinal cancers and several of the serious cardiovascular events. Compared to no treatment, alendronate may increase the risk of atypical subtrochanteric (0.08 more in 1000, 95% CI 0.05 more to 0.14 more; systematic review of 1 cohort; n=220,360; NNH=12,500; low certainty) [215] and femoral shaft fractures (0.06 more in 1000, 95% CI 0.03 to 0.10; systematic review of 1 cohort; n=220,360; NNH=16,667; low certainty) [215], and osteonecrosis of the jaw (systematic review of 1 cohort; n=220,360; 0.22 more in 1000, 95% CI 0.04 more to 0.59 more; NNH=4545; low certainty) [215]. The evidence for bisphosphonates as a class showed similar findings [48, 49, 211, 215]. The risk of “any serious adverse event” (composite outcome) is probably not increased with risedronate [37, 60] and zoledronic acid [37, 60] and may not be increased with alendronate [37, 60]. The risk of certain serious gastrointestinal adverse events (perforations, ulcers, and bleeds; serious esophageal) may not be increased with alendronate [48, 49, 211]. The risk of stroke and myocardial infarction probably does not increase with bisphosphonates as a class [216]; certainty was low for little-to-no difference in other serious cardiovascular events from individual drugs and from the drug class [48, 49, 211, 216].

The risk of non-serious gastrointestinal adverse events is probably increased by treatment with alendronate (systematic review of 50 RCTs; n=22,549; 16.3 more in 1000, 95% CI 2.4 more to 31.3 more; NNH=61; moderate certainty) [48, 49, 211], but probably not by treatment with risedronate [48, 49, 211]. Non-serious adverse events (composite outcome) are probably increased by treatment with zoledronic acid (systematic review of 6 RCTs; n=9575; 51.8 more in 1000, 95% CI no difference to 112.2 more; NNH=19; moderate certainty) [212], related to the potential increased risk of multiple influenza-like symptoms [48, 49, 211] including pyrexia [212], headache [212], chills [48, 49, 211], arthritis and arthralgia [48, 49, 211], and myalgia [48, 49, 211] (low-to-moderate certainty). With the exception of zoledronic acid, the risk of “any non-serious adverse event” (composite outcome) [212] and discontinuation due to adverse events [37, 60] do not appear to be increased by treatment with bisphosphonates (low-to-moderate certainty).

Denosumab

The evidence was very uncertain for many adverse events, including serious infections [37, 60], venous thromboembolism [213], and rebound fractures after denosumab discontinuation [219]. Treatment with denosumab may not increase the risk of “any serious adverse event” (composite outcome) [37, 60] and does not appear to increase the risk of serious cardiovascular outcomes (stroke and various composite outcomes) [48, 49, 211, 213, 217] (low certainty).

The risks of non-serious gastrointestinal adverse events (systematic review of 3 RCTs; n=8454; 64.5 more in 1000, 95% CI 26.4 more to 113.3 more; NNH=16; moderate certainty) [48, 49, 211], rash or eczema (systematic review of 3 RCTs; n=8454; 15.8 more in 1000, 95% CI 7.6 more to 27.0 more; NNH=63; moderate certainty) [37, 60], and infections (any serious or non-serious; systematic review of 4 RCTs; n=8691; 1.8 more per 1000, 95% CI 0.1 more to 4.0 more; NNH=556; moderate certainty) [48, 49, 211] are probably increased by treatment with denosumab. Risks of any non-serious adverse event (composite outcome) [212, 213] and discontinuation due to adverse events [37, 60] do not appear to be increased by treatment with denosumab (moderate and low certainty, respectively).

KQ4: For adults ≥40 years, what is the acceptability of screening and/or initiating treatment to prevent fragility fractures when considering the possible benefits and harms from screening and/or treatment?

Of 8794 unique records retrieved by the searches for KQ4, we assessed 146 for eligibility by full text and included 12 studies (5 cross-sectional [220,221,222,223,224], 4 cohort [225,226,227,228], 3 RCTs [229,230,231]) and one associated publication of another study [53] (Fig. 1). Studies excluded after full text appraisal are listed with reasons in Additional file 5.

Study characteristics

Detailed study characteristics are in Additional file 6. Half of the 12 studies were conducted in the USA (6/12, 50%) [221, 223, 225,226,227, 231]; the remaining were conducted in New Zealand (n=3) [222, 229, 230], Canada (n=1), the Netherlands (n=1) [220], and China (n=1) [224]. Across all studies, a total of 2188 participants (median 204, range 30 to 393) were included, primarily postmenopausal females. In three studies [222, 224, 230], both males and females were included. One study reported on the acceptability of screening among females who would be considered to be at low risk based on age (mean 57 years, range 50 to 65 years) [231]. The remaining 11 studies elicited patients’ views on the acceptability of initiating pharmacologic treatments. In four (36%) of these studies, patients who were at risk for fracture based on BMD (T-score in osteoporosis or osteopenia range, definitions varied across studies) and were aware of their 10-year major osteoporotic and/or hip fracture probability were provided decision aids and were in the position to make real-life decisions about starting treatment. In the remaining studies, the decisions about starting treatment were based on hypothetical scenarios; patients in these studies were not always made aware of their fracture risk and would not necessarily have been eligible for treatment [220,221,222,223,224, 229,230,231].

The risk of bias assessments for studies included in KQ4 are in Additional file 7. Four studies were at high risk of bias due to low participation rates (<40% of those eligible) [222, 223, 229, 231]. Three studies were at high risk of bias because they provided participants no or inaccurate (based on our comparison to currently available evidence) information on the potential benefits or harms of treatment—we required information on at least one of benefits or harms for inclusion [222, 227, 230]. Two studies were considered to be at high risk of bias because they did not present findings for important subgroups of interest (e.g., baseline fracture risk) for whom results may be expected to differ [225, 227]. Other risk of bias concerns were infrequent.

Findings

Additional file 4 shows the full analysis details for KQ4, including GRADE Summary of Findings Tables, with explanations for each rating. One RCT (n=258) [231] that included females aged 50–65 years (low risk based on age), revealed that this population had a strong intention to be screened over the next 5 years (mean [standard deviation] intention score 3.74 [0.96]/5). Participants were then provided a 1-page decision support sheet containing information on benefits in one of four formats (words, numbers, narrative, or framed narrative in terms of benefits of not screening). The sheet indicated that screening and treatment would be associated with a reduction in the risk of hip fractures by 2 per 1000 or “very few” females, and a reduction in other fractures in “few” females over 10 years. Risks were described as the potential for worry, minor stomach upset, and muscle or joint pain. Serious harms were described as rare—osteonecrosis of the jaw in 1 to 10 per 1000 or “very few” females and atypical fractures in 5 per 1000 or “very few” females over 10 years. Overdiagnosis was presented by showing that the incidence of low bone density (labelled as osteoporosis) exceeded important fracture outcomes. After reviewing the decision support sheet, participants’ intention to screen did not change substantially and also did not differ based on the format of information provided (1 study, n=258; low certainty) [231].

Seven observational studies and two RCTs (n=1930; sample size uncertain in one study) [220, 221, 224,225,226,227,228,229,230] reported on the acceptability of treatment. In five studies (n=1010) [220, 221, 224, 229, 230], adults (primarily females) ≥50 years old were provided information on the benefits and harms of treatment in various formats; not all participants in these studies were considered to be at high fracture risk or eligible for treatment. In these studies, patients were asked to make hypothetical treatment decisions, with results of three studies showing that patients’ preference for treatment versus no treatment may be highly variable (3 studies, n=317; low certainty) [220, 221, 224]. Two other studies showed that after receiving information on their personal fracture risk (median [IQR] 10-year hip fracture risk 2.2 [0.5–2.7%] in one study, 5-year hip fracture risk 1.4 [0.8–3.0%] in the other), relatively few (19 to 39%) patients may be willing to accept treatment (2 studies, n=593; low certainty) [229, 230]. In the four remaining studies (n=324; sample size uncertain in one study), postmenopausal females with low bone density (labelled as osteoporosis or osteopenia) who were in the position to make real-life decisions about treatment were provided decision aids outlining the potential benefits and harms of treatment. These studies showed that few (5–20%) eligible patients who read decision aids and are aware of their fracture risk are willing to initiate treatment (2 studies, n=240; sample size uncertain in one study [227, 228], but that somewhat more may be willing to start treatment when the decision aid is used during a clinical encounter (4–44% acceptance; 2 studies, n=84 [225, 226] or when they have had a previous fracture or are at higher fracture risk (32–45%; 1 study, n=208) [53, 228]. Overall, a minority of postmenopausal females at increased risk for fracture may accept treatment (moderate certainty).

Three observational studies (n=741) [220, 222, 224] reported on the minimum acceptable benefit of treatment among adults ≥50 years (mean 60 to 72 years) provided hypothetical scenarios about the benefits and harms of anti-osteoporosis treatment. These studies indicated that about two-thirds (64%) of adults ≥50 years may have overly optimistic views of the benefits of treatment (1 study, n=354) [222] and that these views may be highly variable (3 studies, n=741; low certainty) [220, 222, 224]. For example, one study reported that patients may require a reduction of 20 to 200 fractures per 1000 to consider 10 years of bisphosphonate treatment with no major side effects to be acceptable (1 study, n=354; low certainty) [222].

Six observational studies (n=1091) [53, 220, 223, 226, 229, 230] reported on the level of risk at which treatment would be considered acceptable among adults (97% female) ≥45 years old who were aware of their personal fracture risk but not necessarily at high risk or making real-life treatment decisions. These studies reported that there is large heterogeneity in the level of risk at which treatment would be considered to be acceptable (6 studies, n=1091; low certainty) [53, 220, 223, 226, 229, 230]. Many patients (19 to 51%) are willing to accept treatment even at low levels of fracture risk (5 to 20%); meanwhile, a large proportion (44 to 68%) of high-risk females (≥3% hip or ≥20% osteoporotic fracture risk; ≥30% in one study) would choose not to be treated (3 studies, n=378; low certainty) [53, 226, 229].

Discussion

Summary of principal findings for screening

In this review, we found that among a selected population of females aged 65 years and older who are willing to independently complete a mailed questionnaire about personal risk factors, an offer of 2-step screening using a fracture risk assessment tool (clinical FRAX) followed by assessment of BMD in those at increased risk (and treatment initiated based on various criteria) probably reduces the risk of hip (6.2 fewer in 1000, NNS=161) [4,5,6, 68, 91] and clinical fragility fractures (5.9 fewer in 1000, NNS=169) [4,5,6, 91] over 3 to 5 years of follow-up. The evidence is very uncertain for younger females [90] and for males [68]. A mailed offer of screening to females aged 68 to 80 years, where 54% returned a completed questionnaire and were eligible, may not reduce the risk for hip or clinical fragility fractures over 5 years of follow-up [5]. Screening does not appear to make any difference in the risk of all-cause mortality nor wellbeing (very uncertain for younger females). The findings for the selected population (willing to independently complete clinical FRAX) are similar to those of a 2020 systematic review that pooled data only from the three most recent trials [7]. Minimal evidence related to the potential harms of screening is available; in one trial [6] no serious adverse events were reported but these did not appear to be collected systematically. Among selected females offered screening, 12% of those meeting age-specific treatment thresholds based on FRAX 10-year hip fracture risk, and 19% of those meeting thresholds based on FRAX 10-year major osteoporotic fracture risk, may be overdiagnosed according to our definition [4, 6, 59]. We did not locate convincing evidence to recommend one method of screening over another, although the evidence from the trials supports the use of clinical FRAX followed by BMD assessment in those at increased risk.

Clinical considerations and implications

There appeared to be a considerable amount of ad hoc screening (and subsequent treatment; median 17%) in the control groups of the included trials; it is possible that the magnitude of effect would have been larger with a true “no screening” comparator. In all of the trials, the rate of completion of mailed risk assessment tools was low (generally less than two-thirds of those who were sent the tool), and 8 to 29% of those eligible for BMD did not attend [4,5,6]. There appeared to be a healthy selection bias in several of the trials. For example, in the SALT trial 25% of those who were offered DXA did not participate, and non-participants were among those at the highest fracture risk on clinical FRAX [4]. In the ROSE trial, the majority of fractures occurred in those who did not return the initial mailed risk assessment questionnaire [5]. In our review of the acceptability of screening, we similarly found that low risk (based on age) females have a high intention to be screened [231], but unfortunately we found no studies reporting on the intentions of higher-risk females. An analysis of non-participants in the ROSE trial showed that those who declined DXA scans were older, more likely to have comorbid conditions, had lower socioeconomic status, and were more likely to smoke and have high alcohol consumption [232]. Many of these factors may also place a person at increased risk for fracture. There are multiple reasons for which a person may choose not to be screened. For example, lack of interest may be related to a low perception (and perhaps underestimation) of personal fracture risk [232], the belief that low bone density is not a serious health issue [233], and fears of the potential serious harms of treatment despite their rare occurrence [234]. If screening for fracture risk is believed to be important, there may be a need to improve its accessibility for those at highest risk, and to attempt shared decision-making on the benefits and harms.

The mechanism by which the small reductions in fracture risk were achieved by screening is uncertain in light of other findings of this review. For example, among postmenopausal females, we found that treatment with bisphosphonates as a class may result in small reductions in the risk of hip (2.9 fewer in 1000; NNT=345) and clinical fragility fractures (11.1 fewer in 1000; NNT=90), of a magnitude similar to that seen in the screening trials, where only a small proportion of females were eligible for treatment and treated for a clinically meaningful length of time. In the screening context, we also observed an absolute risk reduction for hip fractures (6.2 per 1000) that was of similar magnitude to the reduction in clinical fragility fractures (5.9 per 1000) among females who independently completed the FRAX tool. The plausibility of this finding is difficult to ascertain. Notably, the one trial finding a statistically significant reduction in hip fracture risk with screening (SCOOP) did not find a similar reduction in the risk of clinical fractures [6], an outcome that occurs more frequently than hip fractures. It is possible that participants in this trial were better selected to benefit in terms of hip fracture reduction, because FRAX 10-year hip fracture risk was used in treatment thresholds, as opposed to 10-year major osteoporotic fracture risk used in the other trials. It is also possible that the treatments used in the trials were more effective at reducing hip rather than other clinical fractures, or simply that hip fractures were more reliably reported and ascertained than other fractures. Uncertainty remains because the trials do not provide information on which particular participants sustained fractures (i.e., those at increased risk or otherwise). Females in the screening trials may have been at higher risk overall than in the treatment trials due to older age (e.g., in SCOOP all were ≥70 years), though this is difficult to ascertain.

The effectiveness of screening may depend on uptake and persistence with anti-fracture treatments among those at high risk [50], but this tends to be suboptimal and declines with longer durations of treatment [51]. In the three more recent screening trials, uptake of anti-fracture drugs ranged from 69 to 80% of those with a treatment indication [4,5,6]; however, these values could be overestimates as they were based on self-reports and prescription records. Longer-term follow-up from the SALT trial showed that by 36 months less than half (43%) of those at high risk reported using anti-osteoporosis drugs [4]. In the larger treatment trials, full compliance with treatment was somewhat higher, ranging from about 50 to 80% [177, 178, 189, 195]. One hypothesis is that the benefits seen from screening might be the result of unmeasured variables. For example, participation in screening may have provoked alterations in health behavior that helped participants to avoid fractures [235], like increasing weight-bearing exercise, stopping smoking, or taking preventive action to reduce the risk of falls. Post hoc analyses from the SCOOP trial showed, however, that screening had no significant impact on the risk of falls [236], and that the intervention was most beneficial in those at highest baseline hip fracture risk and those with prior fracture [91]. These findings suggest that the reduction in fracture risk seen with screening may be more related to treatment uptake and adherence (even if suboptimal) than other risk-reducing behaviors. It remains unclear from the trials whether the patients who sustained fractures were those who undertook treatment. It should be noted that decreased fracture risk in our review was only seen among highly motivated participants (those completing the clinical FRAX independently or accepting screening with BMD) who are probably more likely to adhere to treatment than the general population. The recent screening RCTs focused on treatment using first-line pharmacologic treatment and it is unclear what the impact may have been, if any, if they replaced this with or added therapies including vitamin D and calcium and/or interventions designed to prevent falls (e.g., exercise) or fractures from falls (e.g., hip protectors).

Predictive value of screening strategies

If screening, overall, is believed to offer net benefit, there is limited certainty about which strategy to use. Two-step with risk assessment followed by BMD in those meeting a pre-determined risk threshold appears effective for reducing fractures, and the variable screening methods and treatment criteria in the trials suggest that some variation between strategies may be acceptable. The evidence from one comparative effectiveness trial suggests that BMD alone may be more effective than 2-step screening but we rated this evidence to be of very low certainty. The trials are most applicable to use of clinical FRAX for risk assessment and FRAX with BMD for treatment thresholds, and the evidence from KQ2 indicates that FRAX-Canada (with or without BMD) is probably well calibrated, with some underestimation, for the 10-year prediction of clinical fragility fractures [106, 111, 129]. Clinical FRAX-Canada may also be well calibrated, with some underestimation, to predict the 10-year risk of hip fracture, but the calibration of FRAX + BMD for this outcome may be poor [106, 111, 129]. One potential reason for the underestimation is lack of ability to incorporate a history of previous falls in FRAX; clinicians should be aware that those with previous falls may be at higher risk than estimated with FRAX [237]. The CAROC tool seemed to be adequately calibrated to predict a category of risk; however, it was not used in any of the included trials and requires the inclusion of BMD results. It was beyond the scope of this review to compare screening tools directly (e.g., with vs. without BMD); however, the evidence from this review showed clinical FRAX-Canada to be adequately calibrated without the addition of BMD. A review by Kanis et al. showed high concordance between risk categorization using either FRAX scores or BMD alone; people with higher scores are also generally those with a low BMD [238]. Also of interest is that in one of the trials (SCOOP) [6], only about one-third of those considered at high risk for 10-year hip fracture with clinical FRAX (using criteria suggested for treatment initiation in some cases [239]) were eligible for treatment (using only slightly different criteria) after their BMD results were incorporated into the predictions. Though not a focus of the current review, it is important to consider that the calibration of FRAX may vary by ethnicity. In a study using data from the Manitoba Bone Mineral Density Program registry, FRAX-Canada substantially overestimated 10-year risk of fracture in females who identified as Black or Asian as compared to White [240].

Treatment effects

We found that treatment of postmenopausal females in a primary prevention population (<50% with prior fracture, but who are at risk of fragility fracture) with bisphosphonates as a class probably reduces the risk of clinical fragility fractures. Notably, our conclusion for the effect of bisphosphonates on the risk of hip fractures differs from the USPSTF who in 2018 reported low certainty evidence of no benefit [37]. We included additional trials in our analysis (including one large trial of zoledronic acid published after the USPSTF’s review was completed) and found a similar estimate of effect as the USPSTF but with improved precision, allowing for us to conclude that bisphosphonates may reduce the risk of hip fracture. Denosumab probably reduces the risk of clinical fragility fractures and clinical vertebral fractures, but may not reduce the risk of hip fractures. The limited evidence showed that zoledronic acid may not reduce the risk of hip or clinical fragility fractures in males with low BMD, and evidence for the use of denosumab in males was very uncertain. As reported in a recent review of risedronate for primary and secondary prevention of fractures [241], the trials for individual drugs are hampered by lack of power, as most studies focused on the impact of treatment on BMD as their main outcome of interest, rather than fractures which are then reported only as adverse events. Selection into treatment studies was often based on BMD, and no study used clinical risk scores to select patients. Similar to the screening trials, participants with prior fracture were often included, which differs somewhat from primary prevention where screening would be aimed at those without prior fracture. This review’s focus was determining estimates for the effects from the treatments used as first-line therapy in the RCTs on screening (mostly from anticipation of poor reporting on the harms), which largely employed pharmacologic treatment. Nevertheless, considering that most hip fractures occur as a direct result of a fall [242], preventing falls may be of value for people at high risk for fracture. The Task Force is currently developing recommendations about interventions for preventing falls [54].

Patient perspectives

Though pharmacologic treatments appear to be beneficial, the magnitude of benefit may not be felt to be important enough to make treatment acceptable to patients. The most important findings of our acceptability review were that despite a high willingness to be screened among younger females, a minority of eligible older females may be willing to undergo treatment. Additionally, there was a large degree of variability in the level of risk at which individual patients would be willing to accept treatment (given information on benefits and/or harms). Many older adults have unrealistic views about the effectiveness of treatment and may require a reduction of 20 to 200 fractures per 1000 to consider 10 years of treatment with a bisphosphonate with no major side effects; this is at least double the magnitude of reduction in risk that was observed in our meta-analyses. Overall, though it was outside the scope of our review to determine the optimal length of treatment, a recent systematic review by Fink et al. found evidence of moderate certainty for no difference in the risk of clinical fragility fracture with 5 versus 10 years of treatment with alendronate and 3 versus 6 years of zoledronic acid [215]. There appeared to be some benefit of longer (10 vs. 5 years) treatment with alendronate on the risk of clinical vertebral fractures [215].

Consideration of treatment harms and shared decision-making

Patients considering treatment should be able to weigh the proposed benefits with potential harms. We found increased risk for some non-serious adverse events; namely non-serious gastrointestinal events with alendronate; influenza-like symptoms with zoledronic acid; and non-serious gastrointestinal adverse events, dermatologic adverse events, and infections with denosumab. There was also low certainty evidence for an increased risk for the rare occurrence of atypical femoral fractures and osteonecrosis of the jaw with bisphosphonates (most evidence for alendronate). A concern about the risk of rebound fractures, and in particular multiple vertebral fractures, after cessation of treatment with denosumab has been raised by clinical experts [218, 243]. This requires more research focus as to date there is only minimal empiric evidence of very low certainty addressing these concerns; this finding was based on one available trial that compared discontinuation of denosumab with discontinuation of placebo (FREEDOM and its extension) [178, 244]. In this study, findings from patients initially randomized to denosumab or placebo who participated in the extension were analyzed for the occurrence of fractures after voluntary discontinuation (i.e., non-random sample). Ideally, trials would follow randomized participants from treatment initiation through an adequate time period after discontinuation to fully understand the net impact of denosumab treatment and subsequent discontinuation on the risk of fractures. The findings of our review also substantiate the large heterogeneity in the level of risk at which patients may accept treatment [52]. The finding that patients’ decisions about treatment may not correspond with guideline-recommended treatment thresholds [53, 225,226,227], and awareness of the complexity of decisions about treatment [245], supports the importance of shared decision-making about screening and subsequent treatment. A recent study of decision-making for osteoporosis treatment showed that allowing patients to make autonomous decisions after being provided information on the benefits and harms of treatment can result in better persistence with medication [246]. Most (91%) of the females in this study who started pharmacotherapy continued to be treated after 1 year of follow-up [246].

Strengths and limitations

We comprehensively reviewed evidence related to the benefits and harms of screening for the primary prevention of fragility fractures by first considering direct evidence from screening trials, and supplementing this by reviews on the accuracy of risk assessment, benefits and harms of treatment, and patient perceptions of the acceptability of screening and treatment. To our knowledge, this is the first systematic review to synthesize evidence on the calibration of fracture risk assessment tools. We implemented rigorous searches to locate all potentially relevant studies; though our searches were limited to English and French language studies, this has been shown not to bias the effect estimates from meta-analyses [247]. We limited our update search for the accuracy of risk assessment tools to Canadian studies because these were thought to be the most relevant; the studies included for other tools were all affected by serious risk of bias (among), such that conclusions were unlikely to be impacted by this limit. We did not update the evidence for KQ3a on the benefits of treatment because this data did not weigh heavily into the Task Force’s decision making for their guideline on screening, for which there were several RCTs. Since we took a rapid approach to KQ3b (harms of treatment), there is the small possibility that relevant systematic reviews were missed or that minor errors were overlooked; by using an experienced reviewer, we reduced the likelihood of major omissions that would impact the findings [248]. It is also possible that the evidence for this KQ was less up to date (versus using primary studies) or did not examine all outcomes of interest that could be available in primary studies; moderate certainty of evidence would suggest stable findings for several outcomes. For KQ2 (accuracy of risk prediction tools), we did not review discrimination as it was not rated as critical or important by the Task Force; reported findings from the USPSTF review [60] are therefore less up to date.

There was some indirectness in our findings due to populations, interventions, comparators, and outcomes differing from those of primary interest. Our findings focus mainly on a selected population of patients who completed a mailed clinical FRAX tool independently and who are likely to be more compliant with screening and potentially treatment than the general population. This differs to some extent from clinical practice, where ideally decisions about screening would be made in shared decision-making with between patients and providers, after which patients would have the opportunity to consider their level of risk, along with their perceived benefits and harms of treatment. In addition, some participants in the screening trials had previously used anti-osteoporosis drugs, and the comparator included ad hoc treatment. Across all KQs, the ascertainment of clinical fragility fractures was problematic; definitions differed across studies and in some cases could have included non-clinical vertebral fractures, or other fractures that were not related to fragility (e.g., due to trauma). Our findings were robust to sensitivity analyses removing studies with unclear ascertainment of outcomes, or including only a single type of fracture (e.g., if multiple were added to determine a total number, rather than number of patients with ≥1 fracture). There was concern for selective reporting across some outcomes. Minimal discussion of potential harms was included across the screening trials; in the treatment trials, it was often unclear whether fracture data were collected systematically, and many did not report on clinical vertebral fractures (though this information should have been available).

The evidence in this review is most applicable to postmenopausal females aged 65 and over. We located very limited evidence for males and younger females, and there were no screening trial data specific to females aged 55 to 65 years. In addition, though one trial provided evidence of increased effectiveness of screening among those at higher baseline risk, there is a need for analyses from other trials to substantiate these findings. There is a need for robust comparative effectiveness trials to inform the most effective screening strategy. Examining whether different treatment approaches may positively impact effects for those at high risk based on screening for fracture risk, especially for those individuals nonadherent or uninterested in anti-osteoporosis medications, may also be of value.

Conclusion

Screening in primary care using clinical FRAX, followed by BMD assessment in those at increased risk, among selected females aged 65 years and older who are likely to be more compliant with screening (as ascertained by their willingness to independently complete a risk assessment questionnaire) probably results in a small reduction in the risk of clinical fragility fracture and hip fracture compared to no screening. This may differ to some extent from clinical practice, where healthcare providers would ideally engage in shared decision-making about screening and discuss the results of fracture risk estimation, as well as the risks and benefits of treatment, during the patient consultation. A mailed offer of screening in the general population, where uptake was relatively low, did not improve any patient-important outcomes. Minimal information on harms is available, although our calculated estimates of overdiagnosis were 12 and 19% for hip and major osteoporotic fractures, respectively. The mechanism of the reduction in risk with screening is not fully clear, though there is some evidence to suggest it may be attributed to pharmacologic treatment rather than a reduction in falls or other risk behaviors. It is not clear which screening strategy would be most beneficial. The screening trials used diverse criteria when deciding for whom to offer treatment. There is some evidence for clinical FRAX and FRAX + BMD being adequately calibrated (particularly for clinical fragility fractures), with some underestimation, among Canadian studies; CAROC seems adequately calibrated to predict a category of risk and requires a BMD measurement. Treatment with bisphosphonates in primary prevention populations (at risk, but without prior fracture) probably reduces the risk of clinical fragility fractures and may reduce the risk of hip fractures and clinical vertebral fractures among postmenopausal females, to a similar magnitude as seen in the screening trials. Denosumab probably reduces the risk of clinical fragility fractures and clinical vertebral fractures but may not reduce the risk of hip fractures in postmenopausal females; evidence for males is very uncertain. Females at low risk seem to have a high willingness to be screened but there is large heterogeneity in the level of risk at which higher-risk patients would accept treatment, supporting a shared decision-making approach. The findings of this review will be used, among several other considerations (e.g., information on issues of feasibility, acceptability, costs/resources, and equity) by the Canadian Task Force on Preventive Health Care to inform recommendations on screening for the prevention of fragility fractures among adults 40 years and older in primary care in Canada.