Background

Physical activity has gained importance as therapeutic strategy for individuals with dementia (IWD), and in accordance, the number of trials investigating its effectiveness on motor and cognitive performance in IWD has increased [1]. However, methodological limitations, such as inappropriate or inconclusive motor assessments, affect the derivation of evidence. Thus, further high quality investigations are required [2,3,4].

Considering motor assessments, high quality is reflected by appropriateness for the intended population, sensitivity to change, sound psychometric properties, and standardisation [4,5,6]. In many cases, motor assessments used in previous trials failed to meet these criteria. The majority of applied assessments has predominately been developed for healthy older adults and does not consider specific characteristics of IWD [7]. However, IWD and unimpaired individuals differ in their cognitive and motor performance [8,9,10,11,12]. Thus, tailoring motor assessments to IWD is essential to ensure appropriateness. Furthermore, insufficient or inconsistent research regarding sensitivity to change and psychometric properties in IWD [13] restricts the derivation of meaningful conclusions from applied motor assessments [14, 15]. Referring to this, literature indicates that dementia affects reliability [6, 16,17,18], which was scarcely considered in previous trials. With regard to standardisation, previous research utilised a variety of motor assessments and modifications, affecting comparability [4, 13]. Therefore, inappropriateness, insensitivity, inconclusiveness, and non-standardisation limit the derivation of evidence.

Considering heterogeneous cognitive and motor impairments [10, 19], motor assessments may not be equally suitable for all IWD. Severity and aetiology of dementia, which are important determinants contributing to this heterogeneity [19, 20], potentially influence psychometric properties of motor assessments. Particularly, test-retest reliability may decrease with increasing severity of dementia, due to growing intra-individual variability or progressive difficulties to participate in motor assessments [6, 16,17,18]. Similarly, aetiology of dementia can influence test-retest reliability as cognitive and motor impairments vary in time of occurrence and severity in different aetiologies [14, 19]. Moreover, the influence of external cues on test-retest reliability, which are used to compensate for cognitive and motor impairments, has been discussed [16, 21].

Literature comprehensively addressing motor assessments for IWD is limited. The importance of research in this area is highlighted in a qualitative approach [22] of analysing the appropriateness of motor assessments for IWD. Additionally to elaborating recommendations, this article emphasises the need for tailoring and standardising motor assessments for IWD [22]. Moreover, three systematic reviews [7, 13, 23] and one scoping review [24] examined frequency of use, sensitivity to change, and psychometric properties. Bossers et al. [13] and McGough et al. [24] identified eight frequently applied, sensitive assessments, showing good to excellent relative test-retest reliability. Fox et al. [7] found appropriate relative test-retest reliability, but insufficient absolute test-retest reliability and limited information on validity for several motor assessments. While Lee et al. [23] determined similar intraclass correlation coefficients (ICC), they applied a more stringent rating, suggesting acceptable relative test-retest reliability only for the Berg Balance Scale (BBS). Additionally, they considered the influence of different aetiologies of dementia on relative test-retest reliability, but were not able to draw conclusions due to insufficient research. In summary, these reviews provide an important basis, but do not actually allow a comprehensive quantitative evaluation of motor assessments for IWD. Previous reviews focused on frequency of use and sensitivity to change [13, 24] or just considered relative reliability and neglected other psychometric properties such as absolute reliability or validity [13, 23, 24]. They only investigated psychometric properties of the most common motor assessments without taking into account the influences of the heterogeneity of IWD [7, 13, 24] or considering further outcomes such as frequency of use or sensitivity to change [7, 23]. Moreover, information on how psychometric properties were graded was rare [13, 23, 24], no specific recommendations were suggested [7, 23], and the results of different outcomes were not combined when drawing conclusions [7]. Finally, previous randomised controlled trials (RCT) with IWD applied additional motor assessments which were not considered in previous reviews [7, 13, 23, 24].

With respect to these limitations, we indicated the following main research gaps: (a) comprehensive quantitative approaches combining outcomes of identified reviews including psychometric properties, frequency of use, and effect sizes of motor assessments applied in previous RCT with IWD and (b) research on the influence of severity and aetiology of dementia and cueing on test-retest reliability. Therefore, the objectives of this systematic review are: (1) to quantitatively examine motor assessments for IWD used in previous RCT by comprehensively analysing psychometric properties (primary outcome), frequency of use, and effect sizes of those assessments (secondary outcomes) and (2) to assess the influence of severity and aetiology of dementia and cueing on test-retest reliability. Based on primary and secondary outcomes, this review derives recommendations, which contribute to create consensus and decrease heterogeneity of motor assessments for future research. It needs to be considered that there are several purposes and reasons for applying motor assessments. Motor assessments are essential for diagnostic purposes and to assess changes over time, e.g. in RCT. Regarding specific reasons, they are utilised to determine actual motor performance, but also to evaluate related outcomes, such as frailty [25] and risk of falls [26], or to draw conclusions on underlying cognitive performance [27]. This review focuses on motor assessments to assess changes over time, but does not further differentiate between various reasons for the use of motor assessments. Instead, it aims to provide a general overview.

Methods

For this systematic review, we considered the guidelines and recommendations of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Statement [28, 29]. Furthermore, we registered the systematic review in PROSPERO (CRD42018105399).

We performed a two-stage literature search to address the objectives of this systematic review. A first search focused on the identification of motor assessments applied in RCT in IWD. Based on these findings, a second search (main search) aimed to determine publications examining psychometric properties of the identified motor assessments. This approach ensures to focus on those motor assessments commonly applied in IWD and allows the determination of various outcomes required for a comprehensive quantitative evaluation of motor assessments for IWD. The taxonomy of COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative [30] provided the terminology and definitions of psychometric properties. In line with literature, we applied the terms relative and absolute reliability for reliability and measurement error, respectively [31]. Relative reliability, quantified by correlation coefficients, refers to the degree to which individual measurements maintain their position within a sample over repeated assessments, while absolute reliability, quantified by standard error of measurements or minimal detectable changes, is the degree to which individual measurements vary over repeated assessments [6, 31, 32].

First search

For the first search, we examined the electronic databases Pubmed, Web of Science, Cochrane Library, and ALOIS between December 2016 and July 2018 without date restrictions. We applied terms related to dementia, physical activity, and motor performance to identify eligible trials (see Additional file 1 for complete search term), supplemented by manually checking references of indicative articles and reviews. Two reviewers independently screened titles and abstracts (ST and BB) and checked inclusion criteria during full-text analysis (ST and AH). Trials were eligible if they met the following criteria: (a) designed as (cluster) RCT, (b) included individuals with primary dementia (Alzheimer’s disease (AD), vascular dementia, frontotemporal dementia, and Lewy body disease) older than 65 years, (c) applied physical activity interventions,Footnote 1 (d) used motor assessments independent of intended reasons, and (e) were published and written in English or German. We excluded comments, conference abstracts, protocols, and trial registrations. If there were disagreements, the two reviewers consulted a third reviewer (AW) to reach a consensual decision.

One reviewer (ST) extracted the following data from included RCT using a standardised extraction form: sample size, sample characteristics, motor assessments, means and standard deviations of baseline and post motor assessments, corresponding F/t statistics, and effect sizes. A second reviewer (AH) checked the outcomes. The two reviewers discussed ambiguities and disagreements in consensus meetings and consulted a third reviewer (BB) if they reached no agreement.

In addition to analysing frequency of use of identified motor assessments, we calculated time*group interaction effect sizes to represent their sensitivity to change. We determined Cohen’s d if F (time*group interaction) or t (between group baseline-post differences) statistics, or baseline-post differences including standard deviations were provided ([34] formulas see Additional file 2). A Cohen’s d of 0.2, 0.5, and 0.8 represents a small, medium, and large effect size, respectively [35]. Furthermore, we considered time*group interaction effect sizes provided in RCT.

This first search primarily aimed to identify motor assessments used in previous RCT with IWD and served as basis for the main search. Hence, we did not assess risk of bias.

Main search

For the main search, we examined the electronic databases PubMed, Web of Science, Cochrane Library, and Scopus (no date restrictions) between August and September 2018 for terms related to dementia, psychometric properties, and motor assessments identified in the first search (see Additional file 3 for complete search term). Additionally, we manually checked reference lists of indicative articles. Two reviewers (ST and PM) independently screened titles and abstracts and checked inclusion criteria during full-text analysis. Trials were eligible if they fulfilled the following criteria: (a) examined psychometric properties (content validity, construct validity, criterion validity, internal consistency, intra-rater reliability, inter-rater reliability, test-retest reliability, relative and absolute reliability) of (b) motor assessments in (c) individuals with primary dementia (AD, vascular dementia, frontotemporal dementia, and Lewy body disease) aged above 65 years, (d) applied Mini-Mental State Examination (MMSE) [36], and (e) were written and published in English or German. We excluded comments and conference abstracts. The two reviewers discussed disagreements and consulted a third reviewer (BB) to resolve remaining discrepancies.

Two reviewers (ST and PM) independently extracted the following information from eligible investigations utilising a standardised data extraction form: sample size, sample characteristics, motor assessments, methodologies, and statistics of psychometric properties. Moreover, they independently assessed risk of bias of individual investigations with the COSMIN checklist [37, 38]. The two reviewers resolved disagreements through discussion and consulted a third reviewer (BB) if necessary.

Afterwards, we analysed findings of eligible investigations in a systematic narrative synthesis and summarised extracted information. In order to allow comparability of minimal detectable change values, we calculated percentage minimal detectable changes at 95% confidence interval (MDC95%) if any standard error of measurement or minimal detectable change was reported ([39, 40] formulas: see Additional file 4).

Moreover, we rated the results of each study against the COSMIN criteria for good measurement properties [41]. Since information on minimal important change of considered motor assessments in IWD is rare [17], and no other firm criteria for acceptable values [42] are available, we considered a MDC95% higher than 30% as unacceptable [43, 44]. Based on COSMIN reliability criteria for good measurement properties [41] and indications for unacceptable values [43, 44], we rated relative and absolute reliability as follows:

  • sufficient relative/absolute reliability (+): ICC ≥ 0.70/minimal detectable change at 95% confidence interval < minimal important change

  • indeterminate relative/absolute reliability (?): ICC not reported/minimal important change not defined

  • insufficient relative/absolute reliability (−): ICC < 0.70/minimal detectable change at 95% confidence interval > minimal important change

  • unacceptable absolute reliability (↓): MDC95% > 30%

Subsequently, we summarised overall evidence and graded quality of evidence using the Grading of Recommendations Assessment, Development, and Evaluation approach, which considers risk of bias, inconsistency, imprecision, and indirectness of included investigations [41, 45]. Additionally, we analysed the influence of severity and aetiology of dementia and cueing on test-retest reliability. Therefore, we determined severity of dementia according to reported MMSE values (mild: MMSE = 26–17, moderate: MMSE = 17–10, severe: MMSE< 10 [46,47,48]) and/or classification of publications if range of MMSE was not reported. Due to insufficient information on aetiology, we were only able to compare between AD and various or not reported types. In accordance with Muir-Hunter et al. [14] we defined cueing as “providing any additional verbal, visual, or tactile direction necessary to ensure correct performance of the task after the initial set of standardized instructions was given”. To investigate its influence on test-retest reliability, we classified cueing in five categories, considering information in identified psychometric property studies: (a) not reported, (b) no cueing, (c) verbal cueing, (d) verbal and visual/tactile cueing, and (e) more extensive cueing than (c) and (d) including physical assistance.

Results

Systematic searches (first and main search)

The first search revealed 5007 publications. After removing duplicates and initial screening on titles and abstracts, we screened the full texts of 309 publications and included 46 RCT for further analysis. For the main search, we obtained 902 publications. Removing duplicates and initial screening on titles and abstracts yielded 68 publications, of which we scanned full texts. Eventually, we included 21 eligible investigations in the narrative data synthesis (see Fig. 1, further information on study characteristics and data extractions are provided in Additional files 5, 6, 7 and 8).

Fig. 1
figure 1

Flow of information (IWD: individuals with dementia, MMSE: Mini-Mental State Examination, n: number, RCT: randomised controlled trial)

Motor assessments applied in previous randomised controlled trials

Previous RCT with IWD utilised 57 different motor assessments to determine balance, mobility and gait, strength, endurance, flexibility, and functional performance. Psychometric properties of 28 of these assessments were investigated in IWD. Table 1 contains a short description of all identified motor assessments with available psychometric property studies (see Additional file 9 for motor assessments identified during first search without available information on psychometric properties).

Table 1 Description, frequency of use, and effect sizes of motor assessments applied in previous randomised controlled trials

Psychometric properties

Seventeen of twenty-one studies examining psychometric properties focused on inter-rater and/or test-retest reliability. Herein, they determined consistency among different evaluators simultaneously rating the same participant, and between repeated measurements, respectively [32]. Investigations assessing content, construct, and criterion validity, internal consistency, and intra-rater reliability were rare. Thus, we only summarised results and did not derive conclusions.

Summary for content, construct, and criterion validity, internal consistency, and intra-rater reliabilityFootnote 2

The systematic search did not identify any investigation examining content validity. Based on hypotheses testing or revealing known group differences, construct validity was suggested for Physiomat assessments, the Erlangen Test of Activities of Daily Living (E-ADL Test), and knee extensor strength assessed with dynamometers [53, 110, 111, 114]. Seven investigations include information on criterion validity (concurrent and predictive validity), correlation with, or prediction of external criteria. For the E-ADL Test, criterion related validity was determined based on the relation between achieved scores and level of care [111]. Concurrent validity with spatiotemporal gait parameters or 2D-video motion analysis was established for a modified BBS, Short Physical Performance Battery (SPPB), and Assessment of Compensatory Sit-to-Stand Maneuvers in People With Dementia (ACSID) [26, 99]. Moreover, both the SPPB and 6-min walk test (6 min WT) significantly correlated with peak oxygen consumption (assessed with a cycle ergometer test), suggesting that these assessments are useful in identifying individuals with low aerobic capacity [115]. Furthermore, knee extensor strength was found to be a significant predictor for several activities of daily living, gait, and sit-to-stand (STS) performance [114, 116]. No predictive validity concerning future falls could be observed for Timed Up & Go Test (TUG), Performance Oriented Mobility Assessment (POMA), and Five Times Sit-to-Stand Test (5x STS) [117].

Considering internal consistency, three studies observed Cronbach’s α between 0.37 and 0.77 for E-ADL Test [110, 111] and 0.95 for BBS [15]. Furthermore, one study examining ACSID total score determined intra-rater reliability based on ICC ranging between 0.72 and 0.90 [99].

Inter-rater reliability (relative and absolute reliability)

Five studies assessed inter-rater reliability of nine assessments. ICC ranged from 0.72 to 1.00 and MDC95 included values between 0.0 and 98.0% [14, 15, 43, 99, 118]. Accordingly, all assessments reached sufficient relative inter-rater reliability. Quality of evidence for relative inter-rater reliability was high for BBS, moderate for TUG, and low or very low for all other assessments. Grading MDC95%, TUG and 6-m walk test (6 m WT) showed sufficient absolute inter-rater reliability, while it was insufficient/unacceptable for 4-m walk test (4 m WT), and indeterminate for all other assessments. Quality of evidence for absolute inter-rater reliability was low for 6 m WT and 30-s chair stand test (30s CST), and moderate for all remaining assessments (see Table 2).

Table 2 Relative and absolute inter-rater reliability

Regarding balance assessments, ICC were higher for Groningen Meander Walking Test (GMWT) and BBS than for Functional Reach Test (FR). Furthermore, MDC95% were lower for BBS compared to GMWT. Focusing on GMWT, time measurement showed lower MDC95% than number of oversteps. For mobility and gait, ICC increased and MDC95% decreased from 4 m WT, through 6 m WT, to TUG. Considering strength assessments, ICC were higher for 30s CST counting repetitions than for ACSID rating STS performance, while MDC95% was only determined for 30s CST. Since ICC was only assessed for 6 min WT, a comparison of inter-rater reliability of endurance assessments was not possible (see Table 2).

Test-retest reliability (relative and absolute reliability)

Fifteen studies investigated test-retest reliability considering 24 assessments. ICC ranged between 0.02 and 0.99 and MDC95% varied from 6.8 to 225.7% [5, 6, 14, 17, 26, 43, 51, 53, 63, 102, 110, 114, 118, 120, 121] (see Table 3).

Table 3 Relative and absolute test-retest reliability

Most studies focused on between-day test-retest reliability, while some studies examined within-day and within-session test-retest reliability. Comparing these studies, ICC increased and MDC95% decreased, respectively, from between-day (ICC = 0.02–0.99, MDC95% = 6.8–225.7% [5, 14, 17, 43, 51, 53, 63, 102, 118, 120, 121]), through within-day (ICC = 0.79–0.99, MDC95% = 21.1–30.0% [6, 26, 118]), to within-session test-retest reliability (ICC = 0.95–0.98 [114]).

Balance

Six investigations assessing test-retest reliability of eleven balance assessments determined ICC and MDC95% ranging between 0.32–0.99 and 10.2–225.7%, respectively [14, 17, 43, 51, 53, 63]. Relative test-retest reliability was sufficient for all balance assessments except for Limits of Stability, Step Quick Turn Test, and simple condition of Physiomat-Trail-Making Task. However, quality of evidence for relative test-retest reliability was low or very low for most assessments. Only GMWT (time) and BBS reached moderate quality of evidence. Absolute test-retest reliability for balance assessments was indeterminate or unacceptable with moderate to very low quality of evidence (see Table 3).

GMWT (time) and BBS showed the highest ICC, while we could not observed a clear tendency for MDC95%. Comparing different outcomes of GMWT, ICC were higher and MDC95% were lower for time than for number of oversteps (see Table 3).

Mobility and gait

Nine studies investigated test-retest reliability of six mobility and gait assessments. They reported ICC between 0.50 and 0.99 and MDC95% from 6.8 to 84.3% [5, 6, 14, 17, 26, 43, 51, 102, 121]. Relative test-retest reliability was sufficient for TUG, manual TUG, 6 m WT, 4 m WT, and instrumented gait analysis (except for cadence variability, walking speed variability, and walking speed assessed with NeuroCom Balance Master), while it was insufficient for cognitive TUG. Quality of evidence for relative test-retest reliability was high for TUG, moderate to very low for instrumented gait analysis, and low or very low for all other assessments. Absolute test-retest reliability was indeterminate for spatiotemporal gait parameters, insufficient/unacceptable for variability gait parameters, 4 m WT, and 6 m WT, and sufficient for manual TUG. For TUG, cognitive TUG, and walking speed assessed with instrumented gait analysis, absolute test-retest reliability was sufficient according to COSMIN criteria but unacceptable when applying MDC95% limit of 30%. Except for TUG and walking speed assessed with instrumented gait analysis (high/moderate quality of evidence), quality of evidence for absolute test-retest reliability was low or very low (see Table 3).

Considering up and go tasks, ICC were higher for single than for dual task conditions. Focusing on short distance walk tests (WT), MDC95% were lower for 6 m WT than for 4 m WT. Furthermore, the comparison of different gait parameters assessed with instrumented gait analysis, determined lower ICC and higher MDC95% for variability measures than for spatiotemporal gait parameters. Comparing different assessments to determine short distance walking speed showed higher ICC and lower MDC95% for instrumented gait analysis (except for NeuroCom Balance Master) than for simple short distance WT (see Table 3).

Strength

Five studies focusing on test-retest reliability of strength assessments reported ICC and MDC95% ranging between 0.02–0.98 and 21.8–80.2%, respectively [17, 51, 102, 114, 120]. Relative test-retest reliability was sufficient for modified 30s CST, 5x STS, handgrip dynamometers (except for severe dementia and one-time measuring), and maximum isometric strength assessed with dynamometers (except for dorsiflexor and iliopsoas muscle strength), while it was insufficient for STS on NeuroCom Balance Master (except for Rising Index). Quality of evidence for relative test-retest reliability was high for handgrip dynamometers and low or very low for all other strength assessments. Absolute test-retest reliability was indeterminate for 5x STS and Rising Index of STS on NeuroCom Balance Master, and unacceptable for modified 30s CST, centre of gravity sway velocity of STS on NeuroCom Balance Master, and handgrip dynamometers. Quality of evidence for absolute test-retest reliability was low or very low for all assessments (see Table 3).

Comparing different STS assessments, ICC for assessments performing only one STS repetition were lower (except for Rising Index) than STS assessments with more repetitions. Moreover, MDC95% increased from 5x STS, through modified 30s CST, to STS on NeuroCom Balance Master (except for Rising Index) (see Table 3).

Endurance

Considering endurance, test-retest reliability was only determined for 6 min WT. Two studies observed ICC between 0.75 and 0.98, while MDC95% ranged from 21.2 to 28.9% [6, 118]. Accordingly, relative test-retest reliability was sufficient with moderate to very low quality of evidence. Absolute test-retest reliability was indeterminate with low quality of evidence (see Table 3).

Functional performance

Functional performance was rarely assessed. One study focusing on the E-ADL Test did not determine ICC and MDC95%, but found significant correlations for the whole test (r = 0.73) and separate items (r = 0.35–0.63) [110]. Quality of evidence was very low.

Influence of severity and aetiology of dementia and cueing on test-retest reliability

With respect to severity of dementia, the Frailty and Injuries: Cooperative Studies of Intervention Techniques - subtest 4 (FICSIT-4) and GMWT tend to yield higher ICC and/or lower MDC95% with less cognitive impairment. In contrast, ICC were slightly higher and/or MDC95% lower with stronger cognitive impairment for BBS, 6 m WT, modified 30s CST, and 5x STS (see Table 4).

Table 4 Subgroup analysis of test-retest reliability considering severity of dementia

Regarding aetiology of dementia, maximum isometric strength assessed with dynamometers and short distance walking speed (except for instrumented gait analysis with NeuroCom Balance Master) resulted in somewhat higher ICC and/or lower MDC95% for AD vs. various or not reported types. In contrast, ICC were slightly higher and/or MDC95% were lower for various or not reported types vs. AD for BBS, TUG (between-day reliability), up and go tasks in general (between-day reliability), 5x STS, and STS tasks in general (except for Rising Index) (see Table 5).

Table 5 Subgroup analysis of test-retest reliability considering aetiology of dementia

Considering cueing, GMWT and TUG showed somewhat higher ICC and/or lower MDC95% when cueing was allowed or more extensive. In contrast, ICC were slightly higher and/or MDC95% were lower for no cueing or less extensive cueing in FR, short distance WT, and short distance walking speed (see Table 6).

Table 6 Subgroup analysis of test-retest reliability considering cueing

Frequency of use and effect sizes of motor assessments applied in previous randomised controlled trials

TUG, BBS, 5x STS, POMA, 30s CST, and instrumented gait analysis, were the most frequently applied assessments, utilised in six to 16 RCT. We were only able to calculate effect sizes for 12 studies, as F/t statistics and/or standard deviations of baseline-post differences were infrequently reported. Effect sizes were large for FR, BBS, POMA, TUG, instrumented gait analysis, 5x STS, ACSID, and 30s CST (see Table 1/Additional file 9 for motor assessments identified during first search without available information on psychometric properties).

Summary and derivation of recommendations

Aiming to derive comprehensive recommendations on motor assessments for IWD, we combined the results of primary and secondary outcomes for each physical domain as summarised in Table 7.

Table 7 Summary of outcomes to derive recommendations for motor assessments for individuals with dementia

Considering all information on primary and secondary outcomes, the derived recommendations include the following motor assessments:

  • Balance: FR, GMWT (time), BBS, and POMA

  • Mobility and gait: TUG and instrumented gait analysis to assess spatiotemporal gait parameters

  • Strength: STS assessments with more than one repetition

  • Endurance: 6 min WT

  • Functional Performance: No recommendation possible, due to insufficient research on psychometric properties

These recommendations are based on several outcomes rated in the highest category or one outcome rated in the highest and at least two in the second category (see Table 7).

Discussion

We addressed the purpose of this systematic review to quantitatively examine motor assessments for IWD by comprehensively analysing psychometric properties (primary outcome), frequency of use, and effect sizes (secondary outcomes) in a two-stage literature search. Recommendations on motor assessments are based on primary and secondary outcomes. Additionally, we analysed the influence of severity and aetiology of dementia and cueing on test-retest reliability.

Findings on primary and secondary outcomes

The systematic search identified only few investigations examining validity, internal consistency, and intra-rater reliability of motor assessments in IWD. Thus, we were not able to draw further conclusions or consider these outcomes for deriving recommendations. Summarizing findings for inter-rater reliability shows sufficient relative inter-rater reliability and relatively low MDC95% of considered motor assessments. Hence, they are objective measures to determine motor performance in IWD. Motor assessments analysing time in tasks of short duration, such as 4 m WT, should, however, be treated with caution, as small measurement errors may significantly influence absolute inter-rater reliability. With respect to test-retest reliability, the majority of identified investigations observed sufficient relative test-retest reliability, while absolute test-retest reliability was mainly indeterminate or unacceptable. This supports their usage to investigate changes on a group level, but does not allow assessing intra-individual changes [7, 17, 31]. Moreover, decreasing test-retest reliability from between-day, through within-day, to within-session investigations may be related to fluctuating daily forms in IWD. We expect that characteristics of daily form, such as mood or motivational aspects, remain relatively constant within short intervals, while they potentially alter with increasing time. More research is necessary to develop criteria to determine daily form, aiming to ensure comparable conditions in longitudinal investigations. Besides, fluctuating daily forms in IWD may have contributed to observed unacceptable absolute test-retest reliability. Other explanations refer to high intra-individual variability in IWD and related inappropriate or naive selection of metrics, which do not account for this variability.

Regarding frequency of use, previous trials predominately applied clinical motor assessments established in healthy older adults or various clinical populations, while those considering specific characteristics of IWD such as GMWT, Physiomat, or ACSID, were less frequently applied. This may be related to their first introduction between 2014 and 2018. Due to insufficient information in previous RCT, we were only able to determine time*group interaction effect sizes for 38% of analysed motor assessments. Based on large effect sizes reported in at least one RCT, we assumed sensitivity to change for most of these assessments.

Findings on influence of severity and aetiology of dementia and cueing on test-retest reliability

Considering severity of dementia, we expected decreasing test-retest reliability with increasing cognitive impairment. This assumption was true for FICSIT-4 and GMWT but not for all assessments. Severity of dementia may only influence specific assessments, for example those with complex instructions or assessing outcomes frequently impaired in IWD, such as balance [10]. Unexpectedly, we observed increasing test-retest reliability with increasing severity of dementia for BBS, 6 m WT, modified 30s CST, and 5x STS. However, these observations were only based on single studies, which partly differed in characteristics, such as aetiology of dementia.

Regarding the aetiology of dementia, test-retest reliability of BBS and up and go tasks was lower for AD than for various or not reported types. Both assessments consist of several short tasks and include multi-step instructions. Compared to other aetiologies, individuals with AD may have more difficulties in understanding and/or remembering such instructions, which potentially influences test-retest reliability [14, 23, 122]. In contrast, test-retest reliability of walking speed was higher in AD which could be related to later occurring gait impairments in AD [20]. Additional research on aetiologies, however, is required to understand lower test-retest reliability of STS tasks and higher test-retest reliability of maximum isometric strength assessed with dynamometers in AD.

Analysing the influence of cueing on test-retest reliability revealed higher test-retest reliability when cueing was allowed or more extensive for GMWT and TUG, which are assessments consisting of unfamiliar or several short tasks. Cueing possibly stabilises motor performance by supporting impaired cognitive performance and thus improves test-retest reliability. In contrast, short distance WT, for which test-retest reliability was higher when cueing was not allowed or less extensive, are close to everyday life, include single-stage tasks, and consider well automated movement processes not requiring additional cognitive support. Accordingly, cueing rather may distract IWD leading to destabilised performance decreasing test-retest reliability. No explanation for the same association in FR is available.

Based on these observed influences, we derived the following suggestions:

  • Put emphasis on simple instructions, especially for IWD with advanced stages or AD.

  • Consider individual cognitive and motor deficits, when selecting motor assessments.

  • Only use cueing for motor assessments where it is inevitable.

Recommendations and need for future research

Recommendations for balance assessments include FR, GMWT (time), BBS, and POMA. Due to infrequent use and insufficient research on psychometric properties, feasibility and sensitivity to change of GMWT and psychometric properties of POMA require further investigation. Focusing on mobility and gait, we suggest to apply TUG and spatiotemporal gait parameters assessed with instrumented gait analysis. Comparing different gait analysis systems, NeuroCom Balance Master, however, seems to be less suitable. Despite insufficient or equivocal results, future research should investigate short distance WT of different distances, as instrumented gait analysis systems may not be available for all studies. Considering strength, we suggest to apply STS assessments comprising more than one repetition, which, however, predominately determine functional performance of lower limbs. Thus, further evaluation of strength assessments including upper limb strength and measures allowing conclusion on actual strength performance are required. Moreover, we suggest to use the 6 min WT as an endurance assessment for IWD. Future research on endurance assessment, however, is crucial since this was the only identified assessment. As information on psychometric properties is insufficient, we are not able to recommend any functional performance assessment. Based on secondary outcomes some indications are available for SPPB. However, psychometric properties of SPPB and other functional performance assessments need to be investigated in future studies.

Comparison with state of research

Recommendations of motor assessments in this review are largely in line with those of previous reviews [13, 24]. Small discrepancies may be related to distinctions in identified assessments and studies, different prioritisation of considered outcomes, and divergent criteria for good measurement properties. Additionally, this review, consistently to Fox et al. [7], determined sufficient relative test-retest reliability for the majority of motor assessments in IWD, but remarked high MDC95% reflecting unacceptable absolute test-retest reliability.

Similarly, motor assessments recommended in this review are mainly in line with those elaborated in a qualitative approach [22]. However, FICSIT, 6 m WT, SPPB, and Physical Performance Test were rated appropriate in the qualitative approach, but could not be recommended based on quantitative outcomes as they were infrequently used or insufficiently investigated. Further discrepancies on FR, which was rated inappropriate but can be recommended based on quantitative outcomes, require additional examination. Moreover, some general indications, related to consideration of specific characteristics and cueing are consistently suggested. Accordingly, this review largely sustains the recommendations elaborated in a qualitative approach.

General considerations on primary and secondary outcomes

The interpretation of findings regarding psychometric properties is challenging as there are no firm criteria for acceptable reliability in literature [31]. Regardless of concrete criteria, ICC do not only reflect relative reliability but also can be related to sample size or variability in the sample [123]. Accordingly, trial-to-trial consistency can be poor, despite high ICC. Thus, it is advised not to focus on single estimates of reliability and to additionally consider absolute reliability [17, 31]. Due to lack of information on minimal important change of motor assessments in IWD, we could scarcely apply COSMIN criteria for absolute reliability. Besides, Smidt et al. [42] arbitrarily defined that a difference of 10% in minimal detectable change would be acceptable. Other research groups referred to them and introduced another cut-off of 30% without any justification [43, 44]. In absence of other criteria, we adopted this cut-off of 30% to identify unacceptable MDC95% but not to conclude on sufficient absolute reliability.

Frequency of use and effect sizes do not necessarily allow conclusions to be drawn on quality of motor assessments and should not be overestimated. Regardless of appropriateness and meaningfulness, researchers may decide to apply motor assessments as they are commonly used or easy to utilise. Nonetheless, frequency of use can provide indications about feasibility of motor assessments, which is based on the assumption that unfeasible motor assessments do not disseminate as good as feasible ones. Comparably, effect sizes can provide information on sensitivity to change, but are also dependent on effectiveness of interventions.

Strengths and limitations

To our knowledge, this is the first systematic review utilising a comprehensive approach combining different outcomes of previous reviews by performing an extensive two-stage literature search. We need to state potential risk of bias regarding the selection of considered motor assessments. Due to restricting the analysis of motor assessments to those applied in RCT, some assessments may be missing. Furthermore, large heterogeneity of included psychometric property studies limits the meaningfulness of derived recommendations. As psychometric properties are potentially influenced by various determinants, such as sample size, sample characteristics including severity and aetiology of dementia, cueing, test-retest interval, or considered outcomes, we cannot ensure that the deductions on psychometric properties are true and not randomly caused by differing determinants. Therefore, false assumptions, undetected influences or relations, and random observations may have occurred. Similarly, the consideration of several influences on test-retest reliability only allows rough estimations, which could be also affected by heterogeneity of analysed studies. Moreover, insufficient information on execution of motor assessments, severity and aetiology of dementia, and cueing in available investigations impeded detailed analyses and limited meaningfulness of observations. Accordingly, the elaborated recommendations should be used with care and further research investigating psychometric properties and dementia specific influences on test-retest reliability is required.

Conclusion

Despite the necessity for further research in various areas, this review establishes an important foundation for future investigations. Additionally, direct implications for studies determining effectiveness of physical activity on motor performance in IWD can be derived. However, elaborated recommendations cannot be considered as final conclusions since the analysis of primary and secondary outcomes reveals several challenges and areas of insufficient research, and only focus on quantitative aspects. Furthermore, new assessments, especially developed for IWD, are required. Such assessments can be based on prior tasks but should consider specific characteristics of IWD. Additionally, it is of high importance to standardise motor assessments and cueing to ensure comparability between studies. Herein, standardisation refers to selection and performance procedures of motor assessments and external cues. Currently, a wide range of motor assessments (e.g. previous RCT applied 19 different balance assessments) with different performance procedures (e.g. different ratings or modifications) as well as various external cues (e.g. clearly defined verbal cues vs. as much assistance as needed) are frequently applied to determine the same motor functions or quantities. Accordingly, recommendations on specific motor assessments as well as indications on assessment procedures elaborated in quantitative and qualitative (see [22]) approaches are important to improve standardisation. Evidence on effectiveness of physical activity can contribute to gain access to physical activity interventions and thereby positively influence quality of life in IWD. Determining evidence, however, is not possible without appropriate, sensitive, valid, reliable, and standardised motor assessments, which consider the individual characteristics of single individuals.