Falls are a leading cause of injury and activity limitation in older adults and the adverse effects associated with falling result in significant personal, social and economic burden. Approximately 30% of community dwelling people aged 65 years and over will fall each year [1]. Falls account for 40% of all injury deaths and lead to 20-30% of mild to severe injuries ranging from soft tissue injuries to fractures in the elderly [2]. The causes of falling are multi-factorial and include extrinsic (environment-related), intrinsic (person-related) and behavioural (activity-related) factors. Gait instability has been identified as a relatively consistent risk factor for falls and the majority of screening programmes to identify those at risk of falls comprise an assessment of gait and balance [3, 4]. There are a number of performance orientated mobility assessment tools that assess aspects of balance and gait involved in normal daily activities. These tools serve to identify patients at risk of falling however, the sensitivity and specificity of existing tools is low [5]. One such example is the STRATIFY clinical prediction rule (St. Thomas Risk Assessment Tool in Falling elderly inpatients), which consists of five items that address risk factors for falling including past history of falling, patient agitation, visual impairment affecting everyday function, need for frequent toileting, and transfer ability and mobility. The STRATIFY rule yields a possible score between 0 and 5 (each item scoring 1 if present or 0 if absent). A recent systematic review examined the predictive value of the rule in elderly inpatients at risk of falls and found that at a score ≥2 points, the STRATIFY rule had only limited predictive ability with moderate summary estimates of sensitivity (0.67, 95% CI 0.52 – 0.80) and specificity (0.57, 95% CI 0.45 – 0.69) [6].

The TUG test is another commonly used screening tool for falls risk in the inpatient and the community setting. The TUG (Timed Up and Go) test was developed in 1991 as a modified timed version of the Get up and Go test [7, 8]. To perform the TUG test as described in the original derivation study, the patient is timed while they rise from an arm chair (approximate seat height 46 cm), walk at a comfortable and safe pace to a line on the floor three metres away, turn and walk back to the chair and sit down again. The subject walks through the test once before being timed to become familiar with the test. The subject wears his regular footwear and uses his customary walking aid (cane or walker) if necessary [8]. A faster time indicates a better functional performance and a score of ≥13.5 seconds is used as a cut-point to identify those at increased risk of falls in the community setting [9]. However, reported threshold values vary from 10 to 33 seconds in the literature [10, 11].

The TUG is recommended as a routine screening test for falls in guidelines published by the American Geriatric Society and the British Geriatric Society [12]. The National Institute of Clinical Evidence (NICE) guidelines also advocate the use the TUG for assessment of gait and balance in the prevention of falls in older people [13]. To date three systematic reviews have examined the clinical utility of the TUG to discriminate between those at low and high risk of falling [1416]. The most recent systematic review reported that the pooled mean difference in time taken to complete the TUG between fallers and non-fallers depended on the baseline functional status of the cohort of patients under investigation. In essence, there was a mean difference of 0.63 seconds (95% CI 0.14–1.12 seconds) in the performance of the TUG for high-functioning versus a difference of 3.59 seconds (95% CI = 2.18–4.99 seconds) for those in institutional settings [16]. The aim of this systematic review with meta-analysis is to examine the predictive value of the test to identify individuals at risk in falling in the community using the frequently cited cut-off of ≥13.5 seconds. A secondary aim of the study is to examine the summary estimates of sensitivity and specificity of alternative cut-off scores to optimally discriminate between fallers and non-fallers.


Search strategy

This systematic review and meta-analysis was performed according to the principles outlined by the Cochrane diagnostic test accuracy working group [17, 18]. We aimed to identify all studies that validated the TUG test in community dwelling older adults. A systematic literature search was conducted in June 2012 (updated in March 2013) and included the following search engines: Pubmed, EMBASE, Cochrane Library, EBSCO, CINAHL and SCOPUS. A combination of the following keywords and MeSH terms were used: ‘Timed Up and Go test’, ‘Get Up and Go test’, ‘TUG’, ‘GUG’, ‘TGUG’, ‘TGUGT’, ‘ETGUG’, ‘ETGUGT’,’TUGT’, ‘modified TUG’ and ‘accidental falls’, ‘fall’, ‘falling’, ‘faller’. No language restrictions were applied to the search. The search was supplemented by hand searching reference lists of retrieved articles and searching Google scholar. The original version of the Get Up and Go test was created in 1986 [7], the timed version was later derived in 1991 [8], therefore only studies published from 1991–2013 were included in our literature review.

Study selection and data extraction

Studies were included if they met the following inclusion criteria: 1) Prospective or retrospective cohort studies or randomised control trials, 2) Studies that included community dwelling older adults as the population of interest, 3) Studies that validated the original version of the TUG test, 4) Studies that recorded a subsequent fall. Studies were excluded if their population of interest was limited to patients with specific neurological or orthopaedic condition e.g. Parkinson’s disease, stroke, hip fracture or amputation of a lower limb. Studies were also excluded if they were limited to a population with a particular medical condition e.g. patients with chronic obstructive pulmonary disease. For the purposes of this review, we included studies where ≥80% of subjects were community dwelling and/or were described as self caring or independent. Studies where >20% of the subject population were described as institutionalised, living in nursing homes, residential care homes or geriatric inpatients were excluded. The definition of a subsequent fall was considered in the context of each individual study. We considered the following definition of a fall: ‘an unexpected event in which the patient comes to rest on the ground, floor or lower level as the reference standard [1] and variations of this definition were recorded in Table 1 that contains details of the included studies.

Table 1 Descriptive characteristics of the studies included in the review

Two reviewers (EB, RG) read the titles and/or abstracts of the identified references and eliminated irrelevant studies. Studies that were considered eligible for inclusion were read fully in duplicate and their suitability for inclusion was independently determined by both RG and EB. Disagreement was managed by consensus. Data were extracted on study type and setting, patient demographics (age, gender) and clinical characteristics including relevant inclusion and exclusion criteria, person who administered the TUG, person who recorded the subsequent fall, the definition of a fall used. For the purposes of this paper, the unit of analysis was the patient or “faller” rather than each “fall” to avoid duplication bias. Authors were contacted by email to provide further information on patient cohorts where there was insufficient data provided. Studies that included data on the same patient cohort for more than one publication were only included once in the meta-analysis.

Quality assessment

The methodological quality of the selected studies was evaluated independently by two reviewers (EB and FH) using the QUADAS-2 tool, a validated tool for the quality assessment of diagnostic accuracy studies [17, 18]. This checklist consists of four key domains: patient selection, index test, reference standard and flow and timing. Within each study, the domains are assessed in terms of risk of bias and the first three of these domains are assessed in terms of concerns about applicability. Signalling questions as specified in the QUADAS-2 tool enable the reviewer to give each domain a rating of high, low or unclear. Disagreements were resolved by a third reviewer (RG).

Statistical methods

We used Stata version 12 (StataCorp College Station, Texas, USA), particularly the metandi command that fits the bivariate model, for all statistical analyses. We have applied this methodology in similar studies [19]. Significance was set at p < 0.05 for all analyses. A 2 × 2 table was constructed to extract the number of true positives, false positives, true negatives, false negatives, for the TUG test from each validation study using the pre-defined cut-point of ≥13.5 seconds to identify those at increased risk of falling. We applied the bivariate random effects model to estimate summary estimates of sensitivity and specificity and their corresponding 95% confidence intervals. This approach preserves the two-dimensional nature of the original data and takes into account both study size and the heterogeneity beyond chance between studies [20]. Sensitivity refers to the proportion of fallers correctly classified as high risk. Specificity is the proportion of non-fallers correctly classified as low risk.

The sensitivity and specificity for the TUG test was plotted in a hierarchical summary receiver operating characteristic (HSROC) graph, plotting sensitivity (true positive) on the y axis against 1-specificity (false negative) on the x axis. The 95% confidence region and the 95% prediction region were plotted around pooled estimates to illustrate the precision with which the pooled values were estimated (confidence ellipse around the mean value) and to illustrate the amount of between study variation (prediction ellipse).

Heterogeneity was evaluated visually using the summary ROC plots and statistically by using the variance of logit transformed sensitivity and specificity, with smaller values indicating less heterogeneity between studies. Bayes’ theorem was used to estimate the post test probability of a fall by multiplying the pre-test odds by the likelihood ratio; where pre-test odds are calculated by dividing the pre-test probability by (1+ pre-test probability) and the post -test probability equals post test odds divided by (1 + post-test odds). We completed sensitivity analyses to explore the effect of methodological features, as determined by the QUADAS-2 tool, on the predictive value of the TUG test.

The c statistic, or area under the curve, with 95% CI were also estimated to describe model discrimination. The c statistic ranges from 0.5 (no discrimination) to a theoretical maximum of 1, values between 0.7 and 0.9 represent moderate accuracy and greater than 0.9 represents high accuracy. A c statistic of 1 represents perfect discrimination, whereby scores for all cases (fallers) are higher than those for all the non-cases (non-fallers) with no overlap [21]. Finally, the association between the TUG score and falls was assessed using logistic regression and is presented as odds ratios with 95% confidence intervals.


Study identification

A flow diagram of the search strategy is presented in Figure 1. Two researches (EB, RG) screened all potential papers. The search strategy yielded 1,134 articles and an additional 20 articles were found by hand searching resulting in 1,154 articles. Six hundred and fifty five articles remained after duplicates were removed. Five hundred and fifty were then excluded based on title or abstract. Of the remaining 105 articles, 80 were excluded after reading the full text leaving 25 articles. Within this group, there were four publications based on two unique cohorts of patients [2225].

Figure 1
figure 1

Flow diagram.

Study characteristics

The characteristics of the 25 prospective cohort studies are contained in Table 1. The descriptive characteristics were combined where two studies were based on the same population of patients [2225]. In relation to 25 studies: seven studies were based in the USA [2632], five in Japan [3337], three in Israel [22, 23, 38], four in France [24, 25, 39, 40] and one in each of Taiwan [41], Australia [42], the UK (unpublished) [43], Brazil [44], Ireland [45] and Norway [46]. The size of the patient cohort in the included studies ranged from 13 [29] to 1618 patients [40]. In total 2,314 patients were included in the meta-analysis from 10 different datasets. The duration of follow up varied from six months [27, 28, 31, 32, 34] to five years [33].

The application and the conditions of testing varied in many of the validation studies – variations included the instruction to walk as quickly as possible during the task [26, 46], the sole use or non-use of an assistive device [36, 45] standing from an armless chair [27], seat height variations 40 cm [35] to 50 cm [41], walking with arms crossed [26]. Testing conditions also varied in that some studies to allow a practice attempt and/or record the average time of two or three attempts [28].

Study quality

The summary diagram of the quality assessment is shown in Figure 2. All twenty five articles were quality assessed. The overall quality of the studies included was moderate with six studies [23, 27, 30, 35, 36, 39] rated as low in all domains in both risk of bias and concerns about applicability. Ten studies [22, 2426, 31, 33, 37, 38, 41, 43] rated as having an unclear risk of bias and nine studies [28, 29, 32, 34, 40, 42, 4446] were rated as having a high risk of bias. This was primarily attributed to a lack of information provided with respect to methods of patient recruitment (selection bias) and criteria used to ascertain of a subsequent fall (reference standard).

Figure 2
figure 2

Methodological quality of the studies included in the review.

In relation to concerns about applicability of each individual study to the proposed research question, ten studies [23, 26, 27, 30, 35, 36, 39, 40],[42, 45] were rated as low, ten studies [22, 24, 25, 29, 33, 37, 38, 41],[43, 44] were rated as unclear and five studies [28, 31, 32, 34, 46] were considered as having high level of concern. A high or unclear risk of bias was noted in studies that inadequately described loss to follow-up in the cohort or the methods used record the incidence of a fall over the period of study. The index test was adequately described in the majority of studies but the many studies failed to record whether the index test (TUG score) was interpreted with or without knowledge of the reference standard (subsequent fall).

Predictive accuracy of all included studies

All authors were contacted to request primary data and ten authors responded with the relevant data [26, 30, 31, 34, 38, 4246]. In two of the ten studies where data was provided [26, 46], the TUG was administered as quickly as possible and in the remaining eight studies it was administered at a comfortable pace. The duration of follow-up in these studies varied from six months [31, 34] to two years [45]. The remaining seven studies followed patients for one year after administration of TUG [26, 30, 38, 4244, 46]. The pooled sensitivity, specificity and the respective variance of the logit transformed sensitivity and specificity for the ten studies included in the meta-analysis are displayed in Table 2. These findings indicate that the TUG test is more useful at ruling in rather than ruling out falls in individuals classified as high risk (≥13.5 seconds), with a higher pooled specificity (0.73, 95% CI 0.51-0.88) than sensitivity (0.32, 95% CI 0.14-0.57). Individual and summary estimates of sensitivity and specificity for all studies, the 95% confidence region and 95% prediction region are presented in the summary ROC graph (Figure 3). The 95% confidence region is broad, reducing the precision of studies in the pooled estimate. The 95% prediction region (amount of variation between studies) is also wide suggesting heterogeneity between studies.

Table 2 Summary estimates of sensitivity, specificity, and positive and negative likelihood ratios for all included studies and for sensitivity analyses at a cut point of ≥13.5 seconds
Figure 3
figure 3

Hierarchical summary receiver operating characteristic plot of sensitivity and specificity for the TUG predicting falls at a cut point ≥13.5 seconds.

The logistic regression analysis also indicates that the TUG score is not a significant predictor of falls (OR = 1.01, 95% CI 1.00-1.02, p = 0.04). The limited discriminative performance of the TUG is confirmed by the ROC curve analysis (Figure 4) indicating about 57% overall accuracy by a significant area under the curve (AUCROC = 0.57, 95% CI 0.54-0.59).

Figure 4
figure 4

Performance of TUG to distinguish fallers from non-fallers.

Sensitivity analysis

A sensitivity analysis was completed excluding the two studies where the TUG was administered as fast as possible.[26, 46] The summary estimates of sensitivity (0.44, 95% CI 0.20-0.71) and specificity (0.71, 95% CI 0.49-0.86) were broadly unchanged. The three studies where the duration of follow up was less than or greater than one year were removed [31, 34, 45]. Similarly, the summary estimates of sensitivity (0.33, 95% CI 0.11-0.68) and specificity (0.70, 95% CI 0.37-0.90) were unchanged. We also excluded four studies where there was evidence of selection bias [34, 42, 44, 46]. Removal of these studies from the meta-analysis reduced the precision of the estimates of sensitivity (0.29, 95% CI 0.10-0.60) and specificity (0.64, 95% CI 0.20-0.93). Four studies [31, 38, 43, 46] that did not adequately describe the method of administration of the TUG test were removed from the sensitivity analysis and the summary estimates of sensitivity (0.33, 95% CI 0.17-0.54) and specificity (0.71, 95% CI 0.58-0.81) of the TUG was broadly similar to the overall analysis. Finally we excluded four studies [26, 31, 43, 45] where no clear definition of a fall was reported. While the sensitivity remained stable (0.28, 95% CI 0.11-0.54), the predictive ability of the TUG to rule in individuals at high risk of falling increased to 81% (95% CI 0.64-0.91). The pooled sensitivity, specificity and the respective variance of the logit transformed sensitivity and specificity for the studies included in the sensitivity analysis are displayed in Table 2.

Bayesian analysis

Using Bayes’ theorem, the post-test probability of a fall across the different subgroups is presented in Table 3. The pre-test probability (prevalence) was calculated as 51% across all studies. The cut-point of ≥13.5 seconds has little impact on identifying those at high risk of falls when all studies are combined and across all of the different subgroups. Of note, when studies that provided no/unclear definitions of falls were excluded, the positive likelihood ratio increased to 1.50 (95% CI 1.15-1.94) and the post-test probability of a fall in patients classified as high risk increased from 54% to 64%.

Table 3 Post-test probability of a fall in patients classified as high risk (≥13.5 seconds) and low risk (<13.5 seconds) using the TUG score


Statement of principal findings

This systematic review demonstrates that the diagnostic accuracy of the Timed Up and Go test is limited at the widely used cut point of ≥13.5 seconds and should not be used for identifying community dwelling adults at high risk of falls in clinical practice. The sensitivity analysis which examined the performance of the rule in different subgroups also showed broadly comparable results, indicating that the TUG performed in a similar manner regardless of the methodological quality of the studies.

Results in the context of the current literature

The TUG is commonly used in the research and clinical setting to screen individuals at increased risk of falling. The commonly cited cut-off score of ≥13.5 seconds used to identify individuals at high risk of falls was first described by Shumway-Cook and colleagues in 2000 [47]. However, the nature of the study design (case–control) used to derive the TUG was not optimal and subject to bias in terms of choosing appropriate controls and determining exposure. In addition, the study comprised of small numbers of patients with 15 fallers and 15 non-fallers included in the analysis. The definition of a fall was broad “any unplanned unexpected contact with a supporting surface, excluding unavoidable environmental hazards” and the study excluded those who had had one or fewer falls in the previous six months. The authors reported a sensitivity of 80% and specificity of 100%, suggesting that the TUG is more useful at ‘ruling-in’ falls in those classified as high risk. However, these findings need to be interpreted in the context of the methodological limitations of the study.

This systematic review only included cohort studies and randomised controlled trials where the index test (TUG) preceded the outcome of interest (fall) and the findings are in keeping with those reported in a previous systematic review that included both case–control and cohort studies [16]. The authors reported that the predictive accuracy of the TUG in identifying fallers across the included studies was poor to moderate and sensitivity and specificity were often close to chance [16]. Furthermore, cut-off points for identifying patients at increased risk of falls in independent-living patients varied between 8.1-16 seconds for performing TUG at a comfortable speed and between 11–13.5 seconds at a fast walking speed.

The limited predictive value of the TUG may be explained by the fact that the TUG is a single test which reflects strength balance and mobility nonetheless, the risk of falling has been shown to depend on multiple intrinsic and extrinsic factors [48, 49]. The TUG does not appear to adequately encompass these risk factors. Recent literature has focused on the addition of a second manual [50] or cognitive task [51]. Nevertheless, the predictive ability of the tool remains limited with the inclusion of these tasks. Further study of the constituent parts of the TUG by quantifying body movements through the use of body worn sensors have increased the predictive accuracy of the TUG to almost 80% in one study [45].

Strengths and weaknesses of the study

This study pooled data from a broad range of studies enhancing the generalisability of its findings. We examined the methodological quality of the studies using the validated QUADAS-2 tool for assessing the quality of such studies. In addition, sensitivity analyses examined the effect of important methodological variables including studies with selection bias, index test bias and reference standard bias. We also used individual patient data rather than aggregate data to calculate summary estimates of sensitivity and specificity and their corresponding 95% confidence intervals. This allowed more accurate data analysis by accounting for heterogeneity between studies and influences of sample size. However, the findings from the systematic review need to be interpreted in the context of the study limitations. Significant heterogeneity exists between the validation studies with respect to variation in the application of the TUG and a lack of information relating to the conditions of performing the TUG e.g. shoes worn, floor surface, chair seat and arm height, walk to a line or an X on floor. Studies have shown that these factors can affect TUG performance, time to complete the test was found to be significantly longer when a chair without armrests was used [52]. Studies varied in the number of practice trials given and an average result recorded. In other studies up to three attempts were given before a timed trial was done. Furthermore, some studies did not allow the use of assistive walking device and others specified the flooring type which has also been shown to affect TUG score.

Clinical implications

Falls risk screening tools are an important element of falls prevention in the community. It is necessary to identify patients at high risk for falls and to facilitate the effective delivery of appropriate interventions to such patients. Inaccuracy of falls screening tools leads to inappropriate distribution of resources, contributing to varying degrees of success and failure of falls prevention strategies. It is essential to establish the accuracy of such tools and identify alternative tools that may be able to identify patients at risk of falling more accurately. Despite a growing body of evidence indicating its limited ability to predict falls, the TUG continues to be mentioned in clinical guidelines as a potential tool to identify fallers [12, 13]. This is most likely because it is easy and quick to perform and does not require specialist equipment. Nonetheless, the totality of evidence to date is that it has limited predictive ability and should not be used in isolation to identify community-dwelling older people at increased risk of falls. Clinicians who assess the elderly for risk of falling should ideally do so in a comprehensive manner, taking into account the multi-factorial nature of falls rather than relying on a single test of mobility.

Areas of future research

This study demonstrates that the TUG should no longer be used as a falls risk assessment in community dwelling elderly people. Gait, balance and to a lesser degree vision and cognition are inherently assessed in the TUG however, it does not include other accepted risk factors for falls e.g. medication use and morbidity. Further research is needed to determine its usefulness in lower functioning groups and those who have specific deficits in the areas of balance and mobility.

Advancing age is a primary risk factor for falling and recent studies have shown that the rate of falling remains at approx 30% in older people since 1988 [2]. The reasons why the elderly fall continues to be explored in the literature, and further research is required to develop a comprehensive falls risk tool that can accurately identify the common risk factors that predict falls. In order to prevent falls and reduce the overall rate of falling in the elderly, falls prevention programmes should then be tailored to the individual needs of the patient.


It is well recognised that falls assessment and prevention programmes are multi-factorial. Evidence from this systematic review of diagnostic accuracy suggests that a single assessment tool like the TUG should not be used to identify community dwelling older adults at increased risk of falls.