Measurement strategy and statistical power in studies assessing gait stability and variability in older adults

Background Gait variability and stability measures might be useful to assess gait quality changes after fall prevention programs. However, reliability of these measures appears limited. Aims The objective of the present study was to assess the effects of measurement strategy in terms of numbers of subjects, measurement days and measurements per day on the power to detect relevant changes in gait variability and stability between conditions among healthy elderly. Methods Sixteen healthy older participants [65.6 (SD 5.9) years], performed two walking trials on each of 2 days. Required numbers of subjects to obtain sufficient statistical power for comparisons between conditions within subjects (paired, repeated-measures designs) were calculated (with confidence intervals) for several gait measures and for different numbers of trials per day and for different numbers of measurement days. Results The numbers of subjects required to obtain sufficient statistical power in studies collecting data from one trial on 1 day in each of the two compared conditions ranged from 7 to 13 for large differences but highly correlated data between conditions, up to 78–192 for data with a small effect and low correlation. Discussion Low correlations between gait parameters in different conditions can be assumed and relatively small effects appear clinically meaningful. This implies that large numbers of subjects are generally needed. Conclusion This study provides the analysis tools and underlying data for power analyses in studies using gait parameters as an outcome of interventions aiming to reduce fall risk.


Introduction
A large proportion of falls in older adults occurs during locomotion [1][2][3]. These falls are often attributed to a decreased quality of gait, due to age-related, peripheral [4] and central [5] impairments. Gait variability and local dynamic stability have received much attention as indicators of fall-related measures of gait quality [6,7] and several studies have confirmed that these parameters are, indeed, related to fall risk [8][9][10][11][12][13]. Although ultimately the ability to predict actual fall risk would remain to be shown, the use of gait quality measures as outcome variables in intervention studies might allow faster iterative development of fall prevention programs, as actual fall risk by gathering fall incidence data requires a long follow-up period. While reliability of gait variability and stability estimates can to some extent be improved by treadmill walking to collect data from a large number of strides [14][15][16][17], a recent study indicated that reliability between sessions is still only moderate [18]. The statistical consequences of limited test-retest reliability can be overcome by adjusting the measurement strategy, but previous reports do not allow inferences on optimal measurement strategies. In studies investigating differences in gait quality between conditions in a population, the optimal measurement strategy, in terms of the number of subjects and the number of measurements per subject, depends on the variance of the gait parameters between and within subjects.
The first and main aim of this study was to estimate between-and within-subject variance components of gait variability and stability measures in treadmill walking, to allow estimation of the number of subjects necessary to obtain sufficient statistical power in studies that are aimed at detecting relevant differences between conditions in a repeated-measures design using subjects as their own controls. The second aim was to determine how the number of measurement days or measurements per day (i.e., the within-subject data collection strategy) influences the required numbers of subjects to detect differences between conditions with sufficient statistical power.

Subjects
Sixteen older subjects [n female = 9, n male = 7, mean age 65.6 (SD 5.9) years, mean weight 77.5 (SD 15.3) kg, mean height 1.74 (SD 0.09) m], without physical impairments interfering with their walking ability, participated in this study. All subjects gave informed written consent. The ethics committee of the Faculty of Human Movement Sciences, VU University Amsterdam approved the experimental protocol in accordance with the Declaration of Helsinki.

Study design
Time series of 5 min of treadmill walking at 3.0 km h -1 were collected during four trials (two trials on each of 2 days). In between the walking trials, subjects performed a 15-min trial of perturbed walking at 3.0 km h -1 for another study. Subjects were allowed to rest as long as needed in between walking trials. The median number of days in between the two measurement days was 5 (range . Subjects were asked to perform their normal activities on the day before each measurement day.

Procedure
Upon arrival at the laboratory, each subject was first informed about the measurement procedure and then familiarized with treadmill walking. Subjects were allowed to practice treadmill walking for any amount of time. In general, subjects were comfortable with treadmill walking within 5 min. Subjects were instrumented with clusters of 3 LED's on the trunk, at the level of T6, and on both feet. An optoelectronic system (Optotrak Northern Digital Inc., Waterloo, Ontario) measured the LED positions at 50 samples s -1 .

Gait measures
The extracted gait variability measures were variability of medio-lateral trunk center of mass velocity (VAR ml ), stride-time-variability (VAR ST ) and step-width-variability (VAR SW ) of the final 150 strides of each trial (approximately the final 2-3 min). VAR ml was calculated as the mean of the standard deviations of medio-lateral trunk velocities at each increment of normalized time (0-100 %) of the measured strides. Trunk center of mass position was estimated based on the position of the LED-cluster attached to the trunk, trunk circumference and the position of several bony landmarks relative to the cluster [19]. The data were low-pass filtered (20 Hz, second-order lowpass Butterworth), for gait variability measures only, before 3-point differentiation to obtain trunk velocities. VAR ST was calculated as the standard deviation of the final 150 stride times. Stride time was calculated as the time between consecutive foot contacts of the same foot, which were determined as the local minima of the vertical position of the feet cluster markers.
Step width was calculated as the maximal perpendicular distance relative to the walking direction between the lateral malleoli for each step. VAR SW was calculated as the standard deviation of the final 300 steps.
Gait stability was quantified using local divergence exponents (LDE) [20]. LDEs describe how small initial differences in kinematics progress over the course of a step. The method for calculating the LDE has been described previously in more detail [16,20]. In the present study, we used a reconstructed state-space based on a single timeseries of medio-lateral trunk velocity and a state-space reconstructed from trunk kinematics in six degrees of freedom, to obtain LDE ml and LDE trunk , respectively. Parameters for state space reconstruction were based on data-driven estimates of the appropriate time-delay using the average mutual information procedure and the required number of embedded dimensions using the global false nearest neighbor analysis. LDE ml was determined from a 5-dimensional state-space from embedded medio-lateral trunk velocity time-series, with a delay of 10 samples. LDE trunk was based on a 12-dimensional state space reconstructed by combining the 3-dimensional linear and angular velocities of the trunk and their time delayed copies. The embedding delay for this 12-dimensional statespace was 25 samples. Rosenstein's algorithm was used to calculate the LDE [21] from the state space reconstructions. In short, for each time point in state-space, a nearest neighbor was found and the Euclidean distance between these points in state-space was tracked, resulting in a number of time-distance curves equal to the number of time points in state space. The divergence curve was then calculated as the mean of the natural log of the time-distance curves. Finally, the LDE was determined as the slope of the linear fit through the first 50 samples (time needed for one step on average) of the divergence curve, corresponding to the initial period of rapid exponential divergence. Thus, the LDE indicates the rate of logarithmic divergence as a result of differences in initial conditions over the time needed for one step. A positive LDE indicates local instability.

Statistical analysis
As pointed out in the introduction, power calculations in gait studies require information about between-subjects and within-subjects variance components of the gait measures of interest, the latter including variances between measurement days and between trials within a day. All gait measures were obtained, as described above, in two separate trials on each of two different days for each subject. The parent data set, thus, consisted of 64 values for each gait measure (16 subjects 9 2 days 9 2 trials). These 64 values provided the basis for the analyses of variance and power, performed for each separate gait measure. A nested random model was used to estimate variance components [22], by solving expected mean squares of the two-way (subject, day) ANOVA corresponding to this model. This assumes that no systematic sources of variance (fixed effects) are present in the data. To check the validity of this assumption, a repeated-measures ANOVA was performed to test for effects of day (first vs second) and trial (first vs second, within day) on each of the gait measures. Neither day, trial nor their interaction had any systematic effect (p [ 0.05, absolute differences \5 %).
The estimates obtained from the parent data were the overall mean (m) and three variance components: variance between subjects (s 2 BS ), variance between days within subjects (s 2 BD ), and variance between trials within days within subjects (s 2 WD ). These parameters can be used to estimate the number of subjects required to obtain sufficient power for different measurement strategies as outlined in the ''Appendix''. For all analyses, the desired level of significance was set to 0.05 and power was set to 0.80. Additional assumptions needed regard the correlation (q) between measurements in the two compared conditions (e.g., before and after an intervention) at the level of individuals, i.e., the predictability of the result in one condition from that in the other for any particular subject. As far as we know, such values have not been reported for gait measures in the literature. Therefore, we explored a range of values of q (0.3-0.6-0.9) as possible scenarios.
Based on these settings, we estimated the required number of subjects, n s , to detect effects of 10 and 30 % of the mean of the reference condition for repeated-measures (paired) designs, under the scenario that only one trial was performed by each subject in each condition. The detectable effect sizes were arbitrarily chosen, but are in the order of magnitude reported in the literature for comparisons between fallers and non-fallers [8][9][10][23][24][25].
To answer the second research question, we evaluated how a change in the number of measurement days or trials per day would influence the required number of subjects at a maintained statistical power. One or 2 measurement days and 1-3 trials per day were selected as realistic measurement strategies in clinical gait studies.
To estimate the prediction intervals of the calculated distribution parameters in the parent data set (m, s 2 BS , s 2 BD , s 2 WD ), and of the required numbers of subjects, we used a bootstrap technique [26,27]. In short, sixteen subjects were randomly drawn with replacement from the original 16 subjects, keeping the results from the four trials of each of the 16 selected subjects. Thus, one resampled bootstrap data set contained the same number of subjects and trials as the parent data set. For the resampled data set, the mean and variance components (m, s 2 BS , s 2 BD , s 2 WD ) as well as n s were estimated for all combinations of number of days and number of trials. This procedure was repeated for 5000 bootstrap data sets, and bias-corrected 95 % prediction intervals for each of the estimated parameters were obtained from the distribution of the 5000 determinations as a measure of estimation uncertainty [28]. All statistical analyses were done in R 2.13 [29].

Results
All three variance components, key factors for estimating the required numbers of subjects in any particular data collection strategy, were substantial (see Table 1). For the gait variability measures VAR ST , VAR SW , and VAR ml , between-subject variance was larger than within-subject variance. For LDE measures, the sum of the two within-subject variance components was similar to the between-subjects variance, and between-days variance was two to three times larger than within-day variance. All variance components had wide 95 % prediction intervals. The numbers of subjects required to obtain sufficient statistical power in studies collecting data from one trial on 1 day in each of the two compared conditions ranged from 7 to 13 for highly correlated (q = 0.9) data with a large effect (30 %), up to 78-192 for data with a low correlation (q = 0.3) and with a small effect (10 %; Table 2).
The effect of changing the measurement strategy on the required number of subjects is illustrated for VAR ST in Fig. 1. Similar effects of changing the measurement strategy were obtained for the other gait measures. The largest decrease in the required numbers of subjects occurred when an additional measurement day was added. Conducting more trials on the same day did result in fewer required subjects, but it was generally less effective than increasing the number of measurement days, in particular when increasing the number of trials from two to three.

Discussion
The main objective of this paper was to assess the numbers of subjects required to obtain sufficient statistical power (80 %) for detecting specified differences in gait measures between two conditions using subjects as their own controls, i.e., a repeated-measures design. In this study, we set the differences to 10 and 30 % of the mean value in the reference condition based on results reported in literature. These differences are in line with suggested meaningful changes reported by Brach et al. [30], i.e., 0.01 s for stance time and swing time variability and 0.25 cm for step length variability. These changes correspond to approximately 10 and 30 %, respectively, of the baseline mean value of these gait measures. However, more research on clinically relevant change in gait variability is warranted. To the best of Mean value and variance components between subjects (s 2 BS ), within subjects between days (s 2 BD ), and within subjects and days within days (s 2 WD ) for stride time variability, step width variability, variability of medio-lateral trunk velocity, and medio-lateral and trunk local divergence exponents. In brackets: 95 % prediction intervals, as derived from the bootstrap simulations VAR ST stride time variability, VAR SW step width variability, VAR ml variability of medio-lateral trunk velocity, LDE ml the local divergence exponent of medio-lateral trunk velocity, LDE trunk the local divergence exponent of trunk kinematics  Results with 95 % prediction intervals in brackets, as obtained by bootstrap simulation, are shown for stride time variability, step width variability, variability of medio-lateral trunk velocity, and medio-lateral and trunk local divergence exponents. All results refer to a data collection strategy of one trial on 1 day per subject and measurement condition VAR ST stride time variability, VAR SW step width variability, VAR ml variability of medio-lateral trunk velocity, LDE ml the local divergence exponent of medio-lateral trunk velocity, LDE trunk the local divergence exponent of trunk kinematics a Required numbers of subjects, each of which is measured in both compared conditions (e.g., before and after an intervention) b Difference between conditions, expressed in percentage of the group mean value in the control condition, cf. Eq. (2) in ''Appendix'' our knowledge, there is no literature on meaningful or relevant changes of LDE. While we have exemplified calculation procedures and effects on study sizes using the 10 and 30 % differences, any other expected effects can be addressed using the data and equations presented in the paper and ''Appendix''.
Regarding effects of physical training on gait variability, one small study [31] reported a large effect (35 %) and one large study a small (4 %) and non-significant effect [31]. To our best knowledge, no reports are available on effects of physical training on gait LDE. A meta-analysis on training effects on standing balance reported a small effect size, i.e. 11 % [32]. The results of the present study demonstrate that when expected differences are small, as illustrated by a 10 % change of the group mean, the required numbers of subjects is large ( Table 2). Since a  10 % change, or even less, in gait measures between conditions might be clinically relevant [30], it is advisable to measure a large number of subjects and to report both significant and non-significant results of several gait measures to allow future meta-analyses.
The dominant cause of the need for large study sizes is the large gross between-subjects variance of gait measures, which in turn depends on the between-subjects variance and the variance associated with estimating a mean value of a gait measure in each subject. The latter affects the uncertainty associated with gait studies in its own right and also decreases the effective correlation between pairs of measurements (cf. ''Appendix''). Like the clinically relevant effect sizes, the correlations between pairs of measurements before and after intervention, which quantify the predictability of the intervention result for any subject, are largely unknown. Van Schooten et al. [33] found correlations between conditions ranging from 0.55 to 0.97 for gait variability measures and LDE (personal communication). Hak et al. [34] found that the predictability of gait variability and stability measures varied with the effect size, small effects showing correlations from 0.33 to 0.79 and large effects showing correlations between -0.28 and 0.56 (personal communication). A conservative estimate of the correlation may therefore be justified. We tested different sizes of the ''true'', error-free correlation between measurements in the pre-and post-intervention conditions in our analyses. From Fig. 1, it is clear that the correlation had a large influence on the required numbers of subjects. The error-free correlation is effectively reduced by the substantial within-subjects error associated with determining gait measures (see ''Appendix'').
In the present study, we used treadmill walking at a fixed gait speed. Treadmill walking was used to allow collecting data from a large number of strides, to improve precision of estimates of gait variability [14,15] and stability [16,17]. In clinical practice, gait data is often collected in overground walking, using optoelectronic methods or electronic walkways, which limit data collection to a few strides. This increases within-subject variance and thus decreases statistical power to detect differences between groups and conditions. Data on larger numbers of strides can be collected in overground walking when using inertial sensors [35,36], but the number of consecutive strides is usually still limited by spatial constraints. Therefore, as an alternative to collecting a large number of consecutive strides, the number of trials can be increased [37,38]. It should be kept in mind that treadmill walking in itself affects gait variability and stability [39] and this may limit generalizability of the present results to overground walking, although statistical precision of stability estimates appears similar between overground [36,38] and treadmill walking [18]. The fixed gait speed used, may have affected the between-and within-subjects variance components. However, since we did not establish preferred gait speeds, and since there is no consensus on the nature of the relationship between gait speed on the one hand and gait variability [40][41][42][43][44] and LDE [40,41,[45][46][47] on the other hand, it is impossible to estimate the effect of gait speed on the results. Thus, generalization to studies using preferred speed should be done with care.
For VAR ST and LDE ml and LDE trunk , the between-days variance was higher than the within-day variance, but the between-days variance was also substantial for the other gait measures. Since subjects were exposed to similar conditions on both measurement days, the large betweenday variances imply that other factors might influence the gait measures on a particular day. It could be that healthy subjects have a broad array of variability and LDE within which, for example, balance and agility are sufficient, and thus not further controlled. This could imply that a more challenging gait assessment, i.e., using mechanical and/or cognitive challenges to bring gait more toward the boundary of stable gait, is required to assess gait quality. The requirement to maintain global stability in such conditions might reduce the redundancy of gait performance and consequently reduce within-subject variance. In addition, more challenging test conditions, whether mechanical or cognitive, may increase effect sizes, much like these conditions often increase between-group differences in stability and variability [e.g. 48,49]. However, decreased betweengroup differences under more challenging conditions have also been described [e.g., 50] and consequently the effect of using more challenging test conditions on statistical power of measurement strategies requires further study.
Our analysis of the effects of changing the number of measurements days per subject and trials per day clearly demonstrated that the former is more effective in reducing the number of required subjects than the latter, but that both have an effect. The large increase in statistical power when measuring subjects on multiple days is an effect of the generally large between-days variance, while within-day variances were, in general, smaller. It should be noted, though, that it will always be more beneficial to allocate multiple measurements to different days than to collect them on the same day, since this will more effectively reduce the gross between-subject variance (''Appendix'', Eq. 4).
Within-subject variance components as well as betweensubject variance may be dependent on the subject group studied. The present study involved healthy and relatively young (mean age 65 years) older adults. Results can, thus, not be generalized to patient populations and older and potentially more frail elderly.
Calculations of LDE allow for many different choices of the number of embedding dimensions and time-delays when constructing the state-space. While it is most common to use a fixed dimensionality (5D or 12D) of the statespace, different approaches to estimate these parameters have also been used [51]. Furthermore, the region of the divergence curve used to estimate the slope also needs to be selected. We did not investigate the effects of these choices on statistical power of LDE in gait studies. However, a study on the effects of these choices on the reliability of LDE exponents demonstrated that a fixed statespace reconstruction is generally more reliable than an individualized approach [36].
The prediction intervals of variance components ( Table 1) and thus of the required number of subjects (Table 2) were wide, in the latter case particularly when investigating small differences between conditions. Wide prediction intervals of variance components are in line with reports from a few studies assessing postures and muscle activity in occupational settings [27,52]. These wide prediction intervals complicate the determination of the required numbers of subjects. It has been suggested to base the study size on the 80th percentile of the distribution of the required number of subjects (cf. Table 2) rather than on the point estimate, which is in general downward (''optimistically'') biased [53]. The wide prediction intervals also imply that a pilot study with a small number of subjects is not likely to result in reliable data for power calculations. An unreliable power analysis could lead to underpowered studies and hence a waste of time, effort, and money in executing a study that will probably be inconclusive, but it could also result in overpowered studies, which would, indeed, have a high probability of resulting in statistically significant findings, but also consume unnecessarily large resources in reaching these results.

Conclusions
The results of the present study indicate that studies attempting to detect small changes in gait variability and stability between conditions measured in the same subjects (i.e., a repeated-measures design) need a large sample of subjects, generally well over 50, to obtain sufficient statistical power. To increase statistical power, increasing the number of measurement days is more effective than increasing the number of trials within a day. The presented results are important when interpreting studies that report small and non-significant effects.

Conflict of interest
On behalf of all co-authors, the corresponding author states that there is no conflict of interest.
Ethical approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent Informed consent was obtained from all individual participants included in the study.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix
The variance in the parent data set was partitioned using a nested random model [22]: where, GM sdt is the value of the gait measure in trial t, collected on day d in subject s; l is the group mean; a s the effect of subject, s = 1, 2, … , 16; b sd the effect of day within subject, d = 1,2; e sdt the residual corresponding to trial within day and subject, t = 1, 2. Variance components were estimated by solving expected mean squares of the two-way nested ANOVA corresponding to Eq. (1). Thus, the parent data were used to estimate the overall mean (m, the estimate of l) and the three variance components: variance between subjects (s 2 BS , the estimated variance of a s ), variance between days within subjects (s 2 BD , the estimated variance of b sd ), and variance between trials within days within subjects (s 2 WD , the estimated variance of e sdt ).
The required number of subjects to obtain sufficient statistical power to detect a significant difference between two conditions within subjects by means of a paired t test is given by: n s ¼ s 2 D Â t n s À1;1Àb þ t n s À1;1Àa=2 where n s is the required number of subjects (each measured in both conditions); D the specified effect to be detected; s 2 D the variance of the difference between conditions; t df,p the p percentile of the t distribution with df degrees of freedom, 1 -b desired level of statistical power, and a desired level of significance. s 2 D depends on the gross between-subjects variance (s 2 S ) and the adjusted correlation between conditions in the paired design (q 0 ) as shown in Eq. (3): where s 2 S is the gross between-subjects variance, which in turn depends on the between-subjects variance (s 2 BS ) and the variance associated with estimating a mean value of a gait measure in one subject according to: where n d is number of measured days per subject and n t are number of trials per day and subject. In Eq. (3), q 0 is the adjusted correlation between results obtained by a subject in the two compared conditions, i.e., an estimate of the predictability of the result in one condition from that in the other (e.g., the predictability of an intervention effect). q 0 depends on the ratio of s 2 BS -s 2 S : where q is the ''true'' within-subject correlation between measurements in the two compared conditions in the ideal case of error-free measurements. Equation (2) has to be solved by iterative methods because n s occurs on both sides of the equal sign.