Background

Among other issues, ageing within the population poses a major burden on healthcare due to the increasing prevalence of frailty among older people [1]. Frailty is defined as a state of increased vulnerability due to somatic, environmental or psychosocial factors [2]. To better accommodate for the complex care needs of frail, older people, a transition towards proactive, population-based care is required, which will improve clinical outcomes and cost-effectiveness [3, 4]. To facilitate this care transition, general practitioners (GPs) must be capable of identifying frail older patients within their daily clinical practice.

The Frailty Index (FI) is one of the screening tools for frailty [5]. An FI comprises a list of health deficits (e.g. symptoms, signs, impairments, and diseases) that are indicative of frailty. The proportion of deficits present forms the patient’s FI score, which can range from zero to one [6]. When an FI consists of at least 30 deficits, different numbers and types of deficits may be used without major influence on the properties of the FI, which enables application in and comparison between different datasets [7].

There is considerable debate over whether the FI can be used for frailty screening in daily primary care. Some authors have stated that the FI has not been validated in this setting, that the instrument is of limited value due to its perceived complexity, that the FI has only moderate discriminative ability, and that other frailty instruments, such as the Tilburg Frailty Indicator, are more promising [811]. Others have argued that the FI is a significant predictor of adverse health outcomes, that it covers all important frailty factors, that it can be easily derived from routine administrative healthcare data, and they have called for further exploration of the FI’s merits in primary care [1214].

To further assess the potential of the FI as a screening and monitoring instrument for frailty in primary care, knowledge of its characteristics is essential. Therefore, we performed a systematic review of the literature and assessed the psychometric properties of the FI in identifying frailty among community-dwelling older people.

Methods

Search strategy, selection criteria and data extraction

We searched the Cochrane, PubMed, and Embase databases using the terms ‘frailty AND (index OR deficit OR deficits OR cumulative OR accumulation)’. We searched for studies published from August 8th, 2001 onwards, which is the publication date of the landmark study presenting the FI concept [6]. The search was limited to studies in English, and databases were searched until October 30th, 2012. The first and third author (ID and GK) screened titles and abstracts independently and selected studies for full-text assessment. These full-text studies were assessed by the first author for inclusion, and in cases where doubt existed, an independent assessment by the last author (MS) followed. Citations from the included articles were also searched for additional relevant publications by the first author. Eligibility disagreements were resolved by consensus.

Studies were included that met the following criteria: first, the studies focused on an FI. The FI was defined as a list of health deficits for which patients were screened and that provided an FI score that reflected the proportion of deficits present on the predefined list [6]; second, only original research was included that assessed one of the following psychometric properties of the FI: criterion validity, construct validity or responsiveness; third, the studies focused primarily on community-dwelling older people. Community-dwelling older people were defined as older people who lived independently at home; older people who lived at home while receiving home care; and older people living in assisted living facilities. In the Netherlands, GPs provide care to older people in all these different living situations, and virtually all older people in these living situations are inscribed in a general practice. Studies were excluded when the FI was based on a comprehensive geriatric assessment (CGA), because it is not feasible to perform a CGA for all older patients in general practice. Also, studies were excluded when the entire study population was living in a nursing home, was hospitalized or was selected because of one specific disease in common. Secondary reports of FI datasets that did not report additional psychometric properties were excluded (see Additional file 1 for full details of inclusion and exclusion criteria). Based on these predefined criteria, the first author extracted data on general study characteristics, frailty index characteristics and assessed psychometric properties.

Psychometric properties– definitions

Currently, there is no consensus about a frailty reference standard against which the criterion validity of the FI could be assessed. However, since there is general agreement that the concept of frailty reflects a state of increased vulnerability to adverse health outcomes, criterion validity is defined as the ability of an FI to predict adverse health outcomes [15]. An Area Under the Curve (AUC) of < 0.70 was considered poor; an AUC of 0.70-0.89 was considered adequate; and an AUC of ≥ 0.90 was considered excellent [16]. Construct validity refers to the coherence of the FI with other frailty measures or related conditions and constructs, including comorbidity, disability, self-rated health, age, and gender [15]. Responsiveness reflects the ability of the FI to detect clinically important changes over time in the frailty construct (see Additional file 1 for a detailed description of the various psychometric properties) [17].

Quality assessment

Study quality was evaluated using the Quality in Prognosis Studies (QUIPS) tool, which considers six potential domains of bias: inclusion, attrition, prognostic factor measurement, confounders, outcome measurement, and analysis and reporting [18]. Each domain comprises a number of prompting items, which enable assessment of the domain as having a high, moderate or low risk of bias.

The QUIPS tool was considered the most appropriate quality appraisal tool because, conceptually, the frailty index is a prognostic instrument. We modified three domains of the QUIPS tool. First, in our review, we were interested only in the descriptive, rather than explanatory, relationships of the FI to adverse health outcomes and other measures; thus, we considered the domain ‘confounders’ irrelevant. Second, the domain ‘outcome measurement’ only accommodated studies in which the FI correlated with adverse outcomes, i.e., criterion validity studies. We modified this domain such that the QUIPS tool also applied to studies in which the FI was correlated cross-sectionally or longitudinally with other frailty measures or related constructs, i.e., construct validity or responsiveness studies. Third, in the domain ‘prognostic factor measurement’, we redefined the prompting item ‘Valid and Reliable Measurement of Prognostic Factor’ as ‘Valid and Reliable Construction of Prognostic Factor’ because the FI deficit list must be constructed based on specific criteria [2, 19]: first, deficits should be acquired and related to health status; thus, ‘blue eyes’ is not an appropriate deficit whereas ‘heart failure’ is appropriate; second, deficit prevalence should increase with age; third, deficits should not ‘saturate’ too early, for example, presbyopia is present in almost all older people, thus, it is not appropriate as a deficit; fourth, the combination of deficits in an FI should cover a range of systems; fifth, the same FI should be used in follow-up measures; and finally, the FI should comprise at least 30 deficits and deficit prevalence should be at least 1% [2] (see Additional file 2 for the modified QUIPS form that was used for the quality appraisal of the studies included).

Registration

This systematic review was registered prospectively in the PROSPERO international prospective register of systematic reviews (CRD42013003737).

Funding

This research was supported by a grant from ZON-MW, The Netherlands Organization for Health Research and Development (reference 311040201). The sponsor had no influence on the research design, data collection, data interpretation, the writing of this report or the decision to publish.

Results

Search results

After removing duplicates, our search resulted in 867 studies (Figure 1). We excluded 809 studies after screening the titles/abstracts and 38 studies after full-text assessment. We have listed the full bibliographic details and the reason for exclusion of each of these studies (available upon request). No additional studies were found in manual reference searching; thus, we used twenty studies for our final review.

Figure 1
figure 1

Flowchart of search results.

Description of study characteristics

One study was a cross-sectional study [20], and nineteen studies were cohort studies with a follow-up ranging from one to twelve years (Table 1). Eighteen studies used survey datasets; one study used an administrative dataset of home-care clients [21], and one study was based on the analysis of routine administrative primary care data [22].

Table 1 General characteristics of the studies included

In ten studies, the study population was population-based, consisting of a representative mixture of independently living and institutionalized older people, with the majority of people living independently [6, 2331]. Eight studies included only independently living older people [19, 20, 22, 3236]; and two studies focused specifically on older people receiving home care or older people in assisted living facilities [21, 37]. The number of participants ranged from 754 to 36,424 older people with a mean age varying from 70.1 to 84.9 years, and the percentage of women varied from 50.0 to 76.7%.

The FIs used in the studies were based on 13 to 92 health deficits. Most studies scored deficits dichotomously [6, 2126, 2931]. Eight studies applied multilevel scoring [19, 28, 3237] and used, for example, a Likert-scale [33]. Two studies did not report how the deficits were scored [20, 27]. Two studies assigned extra weight to predefined deficits [23, 31], for example, to ‘polypharmacy’ [31]. The mean FI scores varied from 0.13 to 0.26, and except for two studies that reported a lower maximum FI score [22, 31], the maximum reported FI score varied from 0.60 to 0.70.

Quality assessment

Four studies showed a low risk of bias for each of the five domains of the QUIPS tool considered, namely inclusion, attrition, prognostic factor measurement, outcome measurement, and analysis and reporting. Fourteen studies showed a moderate-to-high risk of bias in one or two domains; and two studies showed a moderate-to-high risk of bias in three or four domains (Table 2). Risks of bias were highest in the domain of study attrition, which was due to very low response rates [31] or an unclear response rate [19, 25, 34]. In one cohort study, attrition was not assessed because only the cross-sectional study component was considered [27]. For the remaining fourteen cohort studies, losses to follow-up were < 16%.

Table 2 Assessment of risk of bias using the ‘Quality Assessment in Prognostic Studies’ (QUIPS) tool

In the domain of prognostic factor measurement, eleven studies were judged as having a moderate risk of bias [19, 20, 22, 24, 27, 28, 3032, 34, 36]. Of these eleven studies, four studies did not report their entire FI deficit list [20, 26, 27, 32], three used data-driven cut-off points for the FI [24, 26, 30], and nine did not report the percentage of missing FI data or how missing FI data were managed [19, 20, 22, 24, 3032, 34, 36]. In the remaining nine studies showing a low risk of bias in the prognostic factor measurement, eight reported a percentage of missing data of <5% [21, 23, 25, 28, 29, 33, 35, 37], and one study did not report the percentage of missing data [6]. Six studies managed missing data by excluding the missing deficits from the denominator when calculating the FI [6, 25, 28, 32, 35, 37]. Two studies imputed the missing FI data [23, 29]. All twenty studies complied with the criteria for adequate FI construction as described in the ‘Methods’ section.

In total, in the 20 included studies, 5.1% of domains, i.e., inclusion, attrition, prognostic factor measurement, outcome measurement, and analysis and reporting as assessed with the QUIPS tool showed a high risk of bias, 25.5% of domains showed a moderate risk of bias, and 69.4% of domains showed a low risk of bias (full QUIPS appraisal forms for each study are available upon request).

Psychometric properties of the FI

Criterion validity

Fifteen studies assessed the criterion validity of the FI by evaluating the predictive ability of the FI for mortality, institutionalization, hospitalization, number of days in hospital, morbidity, Emergency Department (ED) visits, out-of-hours GP consultations, falls, fractures, change in ADL score, and change in mental score (Table 3). In each study, the FI was incorporated into a multivariable regression model that was corrected for age, gender and a variety of other co-variables. In each model, the FI was a significant predictor of the assessed outcome.

Table 3 Criterion validity results; the predictive ability of the frailty index for adverse health outcomes

Twelve studies focused on the prediction of mortality, for which hazard ratios of 1.01 (SE ± 0.003; per deficit increase in the frailty index) to 6.45 (95% CI 4.10-10.14, most-frail group (FI score 0.35-0.65) versus the least-frail group (FI score < 0.07) were reported [34, 33]. A multivariable model with age, gender, co-morbidity and an FI resulted in an Area Under the Curve (AUC) of 0.691 (95% CI 0.648-0.733) for one-year mortality [37]. Used as a single independent variable, the FI predicted two-year mortality with an AUC of 0.780 (± 0.020 SE) and a ten-year mortality with an AUC of 0.720 (± 0.020 SE) [29].

For other outcome measures, comparable AUCs were as follows: 0.610 (95% CI 0.576-0.644) for one-year hospitalization risk and 0.667 (95% CI 0.625-0.707) for a one-year risk of moving to long-term care [37]. For the prediction of time to the combined outcome of ED/out-of-hours GP surgery visits, nursing home admission and mortality, the c-statistic of the FI used as a single independent variable was 0.686 (95% CI 0.664-0.708). When the FI was combined in a model with age, gender, and consultation gap, the c-statistic improved to 0.702 (95% CI 0.680-0.724) [22].

One study tested the added value of the FI in a multivariable model for predicting adverse health outcomes. For mortality and transition to long-term care, the AUCs of the models including an FI were significantly higher than the AUCs of a model comprising only age, gender and co-morbidity (p < 0.03). For hospitalization, the AUC of the full model with age, gender, co-morbidity and an FI was significantly higher than the AUC of a model comprising only age and gender (p < 0.001) [37].

Construct validity

Eleven studies evaluated the construct validity of the FI [6, 20, 21, 2428, 34, 36, 37]. The FI showed a strong positive correlation with the Functional Reach test (r = 0.73) [29], Consolice Study of Brain Ageing (CSBA) score (r = 0.72) [26], Frailty Phenotype (0.65) [28], and Edmonton Frail Scale (EFS; r = 0.61) [21], a strong negative correlation with the Mini Mental State Examination score (r = −0.58) [28], and a moderate correlation with the Changes in Health, End-Stage Disease and Signs and Symptoms (CHESS) Scale (r = 0.35) [21]. When the dichotomized FI was compared with the Frailty Phenotype where the latter was used as a reference standard, the FI showed a sensitivity of 45.9 to 60.7% and a specificity of 83.5 to 90.0% [20, 24]. When compared with the Functional Domains model, the sensitivity of the FI was 38%, and its specificity was 91.5% [20]. When using a three-level risk categorization, the weighted kappa of the FI compared with the Frailty Phenotype was 0.17 (95% CI 0.13-0.20), and the weighted kappa of the FI compared with the CHESS scale was 0.36 (95% CI 0.31-0.40).

The FI displayed moderate correlation with the concept of self-rated health (r = 0.49), which was expressed as an index of self-rated health deficits [27]. When the crude correlation of the FI was assessed with age, a weak to moderate correlation of 0.193, 0.241 and 0.320, respectively, was reported [6, 25, 26]. One study compared the age trajectories of the FI score within community-dwelling and institutional/clinical cohorts [34], with higher levels of comorbidity and disability in the latter. The FI score increased gradually with age in community-dwelling cohorts, whereas the FI score was high at all ages in the institutional/clinical cohorts.

One study examined specifically an FI with only symptoms and signs as deficits and demonstrated that older people with higher FI scores showed more functional impairments in (I) ADL and more co-morbidity than patients with lower FI scores [36].

Without formally assessing correlations within a construct validity context, sixteen studies reported that older people and women show higher FI scores [6, 19, 20, 22, 23, 2537], and only one study reported a lower percentage of women in the most-frail group [21].

Six studies quantified the increase in FI score with chronological age, all reporting a similar increase in FI score with age ranging from +0.02 to 0.05/year [6, 19, 22, 26, 34, 35].

No studies reported on the responsiveness of the FI in daily clinical practice.

Discussion

In this systematic review, we demonstrate that the FI adequately predicts a wide range of adverse health outcomes and that its discriminative capability is poor to adequate. The FI correlates strongly with other frailty measures, except for the CHESS scale. However, this scale is not a frailty measure per se but was designed to measure ‘health instability’ and to specifically predict mortality in institutionalized older people [38]. The FI score increases steadily with age, and the maximum FI score reported was 0.70, indicating that no ceiling effect exists.

Our review has a number of strengths. First, we used a broad, sensitive search strategy with a low risk of missing relevant studies. Thus, we identified a large number of studies with consistent results across a variety of FIs in different populations. Second, we only considered relevant psychometric properties. We omitted reliability because the FI is an automated screening procedure and therefore not susceptible to intra- or interrater variability. Internal consistency was not examined because the FI is a formative model, i.e., the items form the construct together and therefore do not need to be correlated [39]. Third, the definitions used were tailored specifically to those aspects considered essential for frailty measures and based on a standardized taxonomy [15, 17]. Fourth, we tailored our detailed inclusion and exclusion criteria to support our aim, which was to select those FI studies relevant for primary care. For example, we excluded studies with an FI based on a comprehensive geriatric assessment because it is not feasible to perform such an assessment for each older patient in primary care. Fifth, we appraised included studies critically using the QUIPS tool, which provided comprehensive quality assessment that demonstrated overall good quality of the methodology used in the included studies. The majority of studies reported sufficient details on their study sample, used appropriate criteria for FI construction, and reported few missing data. Moreover, the reported loss to follow-up was typically well below 20%; thus, biased results were unlikely [40].

Our review also has several limitations. First, there is a risk of publication bias because studies with negative results are less likely to be published [41]. Because no register exists for validation studies, publication bias could not be formally assessed. Second, due to the withdrawal of one of the authors (GK), the first author (ID) performed the full-text assessment and quality appraisal partially alone, which may have caused potential selection bias. However, strict predefined selection and quality appraisal criteria were applied (see Additional files 1 and 2), and in cases where doubt existed, full-texts were assessed independently by the last author (MS). Third, most of the included studies on construct validity lacked prespecified hypotheses, which increases the risk of bias because, retrospectively, alternative explanations for low correlations may be sought [39]. Because the majority of correlations were robust, this risk appears limited. Finally, an individual patient data meta-analysis would have been preferable when summarizing research on the criterion validity of the FI. However, because the nature and number of deficits differed between the studies, it was not feasible to merge these data. Moreover, due to study heterogeneity, a meta-analysis on the outcome measures was not possible [41].

Apart from the FI, another frailty screening instrument that has shown good criterion and construct validity is the Frailty Phenotype [42]. One may question whether this performance-based measure would be preferable to implement in general practice, since it has also good face validity, consisting of five easily interpretable parameters (unintentional weight loss, self-reported exhaustion, weakness, slow walking speed, and low physical activity). However, compared to the FI, the Frailty Phenotype would require extra time and resources to enable execution in daily clinical care, and in direct comparison, the FI has been shown to better predict mortality risk among older people [24].

Our results are consistent with previous FI reviews that also reported on criterion validity and construct validity of the FI [7, 13, 43]. Our review updates these findings, and whereas these previous reviews were narrative in nature, our review is the first to systematically review the FI’s psychometric properties that are relevant to primary care.

In the majority of the included studies on the FI’s criterion validity, its predictive ability for mortality is examined. This does not mean that the FI is meant to be a ‘mortality prediction’ instrument; rather, by including the FI in a multivariable model including age, the FI score aims to explain the variable vulnerability to adverse health outcomes in people of the same age. This heterogeneity in frailty levels is also reflected by the relatively low correlation coefficients that we found between FI and age; whereas, in general, the correlation coefficient for the mean FI scores versus age was high (e.g. r = 0.985, [34]), the correlation coefficient for the individual FI scores versus age was at maximum 0.320 [26].

To assess the construct validity of the FI, we focused on its correlation with other frailty measures, age, gender, disability, comorbidity, and self-rated health [15]. However, the concordance of the FI with a broad array of other measures has also been investigated, and a high FI score has been demonstrated to correlate with a high and low BMI [44], smoking [45, 46], impaired psychological well-being [47], psychiatric illness [48], impaired mobility [49], impaired cognition and Alzheimer’s disease [50, 51], pain [52], high levels of gonadotropins [53], neighborhood deprivation and low individual socio-economic status [54], rural residence [55, 56], and low education and little social support or participation [57]. The FI may also serve as a basis to calculate ‘biological age’. Individuals with an FI score that is relatively high for their age and gender show a biological age that is higher than their chronological age, and this biological age is also a significant predictor of mortality [58].

There is no evidence supporting responsiveness or utility. However, some studies reflected upon the potential utility of the FI and noted two major advantages: first, the FI can be constructed from available data whether from administrative routine primary care data [22], specific measurements, such as the interRAI-AL instrument [37], or comprehensive geriatric assessment data [26, 29]. Second, the FI score can be calculated using software thereby facilitating its clinical application [24, 37]. However, only in one study the FI was actually studied in routine clinical data, so these potential advantages need to be further explored.

One may argue that studies relating FI score change to baseline factors, such as mobility and baseline frailty state, and studies modeling FI score change [49, 59] do describe responsiveness. These studies demonstrate that FI score development over time can be adequately described using a time dependent Poisson distribution, and that the probability of improvement, stability and worsening of the FI score is directly related to the baseline number of deficits, age, and mobility status. However, we did not consider these studies as responsiveness studies, since they did not study pre-specified hypotheses regarding the expected correlations between changes in the score on the FI instrument, and changes in other variables, such as scores on other instruments, or demographic or clinical variables [17]. An important finding of our systematic review is that eighteen out of twenty studies explored the FI’s psychometric properties in datasets gathered specifically for research purposes. These studies consistently showed a higher maximum and mean FI score compared with the study that investigated the FI using routine primary care data [22]. however, because only one study with an FI using routine primary care data was included, there is not enough evidence to support conclusions about any structural differences in mathematical properties of the FI. More FIs applied in routine primary care data sets should be studied to further explore these potentially different mathematical properties. The narrower FI score range in the study using routine primary care data reflects unexpectedly low deficit prevalences, which may be caused by several reasons: first, patients may experience symptoms or problems with which they do not present themselves to the GP; second, there may be suboptimal data registration in the EMR [60, 61], and third, the FI may need to include more items on level of functioning, mobility or health attitude instead of merely relying on morbidity deficits. Also, except for the polypharmacy deficit, this FI was based on one single data source out of the Electronic Medical Records (EMRs), namely symptoms and diagnoses encoded according to the International Classification of Primary Care (ICPC, [62]). Care should be taken to construct an FI that captures all information available in the EMR by using, for example, not only ICPC-encoded data but also diagnostic measurement data, such as body mass index or laboratory tests, and elaborate medication data, encoded according to the Anatomic Therapeutic Chemical (ATC) [63].

Conclusions

In this systematic review, the FI demonstrates good criterion and construct validity, but its discriminatory ability is poor to moderate. In general, the FI appears to be an easily interpretable instrument that is practical to manage; however, studies that focus on its responsiveness, interpretability or utility are lacking. These results support the potential of the FI as a screening instrument for frailty in primary care and also demonstrate that further research into its psychometric properties is required. FIs based on research data show lower FI scores than those based on routine primary care data. Given its implementation in clinical practice, future validation studies of the FI should focus primarily on its application in routine primary care data.