Introduction

After an amputation of the upper limb, return to work is often an important goal of rehabilitation, as employment is generally beneficial for the individual [1, 2]. However, functional capabilities may have altered due to the amputation and prosthesis use. Also individuals who are born with a transversal reduction deficiency of the upper limb may experience physical limitations due to one-handedness, which may influence their functional capacity. As no instrument was available to assess the functional capacity of individuals with upper limb absence (ULA) in a standardized environment, a functional capacity evaluation (FCE) for one-handed individuals was developed [3]. This instrument can be used to guide decision making of rehabilitation professionals regarding suitable work, or to measure outcomes of vocational rehabilitation programs. Moreover, it helps assessing work limitations due to musculoskeletal complaints, which are a frequent problem in individuals with ULA [4, 5]. Test outcomes can be compared directly with workload or indirectly with reference values [6]. The FCE for one-handed individuals (FCE-OH) contains items that are adapted from an FCE for individuals with work-related upper limb disorders (WRULD) [7]. While all tests of the FCE-WRULD were considered reliable in healthy adults [8], this cannot automatically be assumed for the tests of the FCE-OH in its target patient group. Because the FCE-OH contains substantial alterations with regard to the FCE-WRULD, and is developed for a different patient group, psychometric properties should be examined separately.

It has been demonstrated in healthy workers and patient groups with different chronic pain syndromes that FCE has not led to serious adverse effects, although, a temporary pain increase is common [9, 10]. The safety of FCE application and pain response during and after FCE application in patients with ULA has not been investigated. The aim of this study was to examine the repeatability and safety of the FCE-OH.

Method

Setting

Patients were recruited from, and FCE sessions were held at the Prosthetic Center of the Italian Workers’ Compensation Authority (INAIL) in Vigorso di Budrio, Italy.

Design

Test–retest design; two FCE sessions were held with an interval of at least 24 h apart. A questionnaire on demographics was answered directly after session 1, while 24 h after session 1 the participant was asked to answer several questions about pain response. Stability of construct (e.g. the participant’s self-perceived physical and mental health status being unchanged) was assessed with a questionnaire prior to session 2. The guideline for reporting reliability and agreement study (GRRAS) checklist was followed [11].

Participants

Potential participants were inpatients from INAIL, who stayed at the centre for several days for prosthetic fitting, repair or training. They were informed of the study by the prosthetic and therapeutic staff and received an information letter from the primary researcher, who invited them to participate in the study. It was made clear that participation was voluntary and rejection of participation would not influence their treatment at the center. Inclusion criteria were: age 18–62 years (official retirement age in Italy); presence of an upper limb reduction deficiency or amputation at or proximal to the carpal level; normal function of the unaffected hand; all seven items of the Italian translation of the physical activity readiness questionnaire [12, 13] were answered negatively, or, when the latter was not the case, if participation was considered safe by a medical doctor. Prosthesis use was not necessary for participation. All patients who met the inclusion criteria were invited. Determination of sample size was based on sample sizes of previous studies on repeatability of FCEs (ranging from 18 to 50; median of 30 participants [8, 10, 14,15,16,17,18]), and availability of participants during the allocated study period.

Procedures

All sessions were administered by the same tester, who was trained in the standardized FCE-OH procedures. Participants and tester were blinded for the test results of session 1 until the second session was completed. After a general introduction of the sessions, the participant was verbally instructed how to perform each test. Each test was demonstrated by the tester. The participant was also instructed on the four termination criteria: (1) the participant wished to stop one or all tests for whatever reason; (2) the tester deemed it unsafe to continue; (3) the participant’s heart rate was above 85% of his or her age-related maximum [220-age]; or (4) a set time limit or number of repetitions was reached. Delayed onset (muscle) soreness as a result of the FCE-OH was expected and the participants were informed accordingly before signing the informed consent form. To provide safety during the FCE-OH tests, heart rate was monitored continuously with a heart rate monitor.

Six tests were performed, of which two (repetitive reaching test and fingertip dexterity test) were performed with the unaffected limb and prosthetic limb separately, thus making a total of eight test items. The repetitive reaching test with the unaffected limb and prosthetic limb, the fingertip dexterity test with the unaffected limb and the prosthetic limb, and the handgrip strength test with the unaffected limb were each performed three times (referred to as three trials). The tests were performed in a set order: overhead lifting test with a receptacle, overhead lifting test with a 2.0 kg weight (with the unaffected limb), overhead working test, repetitive reaching test (alternating three trials with the unaffected limb and three trials with the prosthetic limb), fingertip dexterity test (alternating three trials with the unaffected limb and three trials with the prosthetic limb), and handgrip strength test (three trials with the unaffected limb). The overhead lifting test with a receptacle and the overhead working test were performed two-handed (unaffected and prosthesis hand), unless the participant had no prosthesis available or had a transhumeral amputation. In that case the test was performed with the unaffected limb only. The fingertip dexterity test could not be performed with a cosmetic prosthesis. Materials, objects and test procedures are presented in Appendix.

Pain response was assessed with self-reported questionnaires. After session 1 participants received an extended version of the pain response questionnaire (PQR) [9] and were asked to answer the first three questions 24 h after finishing session 1. These three questions informed after: (1) whether pain was perceived and, if so, the type of pain (muscular pain, other pain, or a combination of these), (2) whether the participant perceived this pain as being directly caused by the FCE session, and (3) whether the patient had experienced any other physical reaction after the first FCE session. The remaining 14 questions were answered prior to session 2, and assessed stability of construct and presence of pain in the 12 h prior to session 2. If pain was present, the location of pain and the severity of the pain [on an 11 point numeric rating scale from 0 (no pain) to 10 (worst imaginable pain)] was asked. To control for stability of construct of measurement, which is a prerequisite for test–retest reliability analyses [19, 20], changes in mental and physical health (equal, better or worse) compared to the first session, and changes in medication use since the first measurement were recorded. Moreover, it was asked whether the participants had received prosthesis training between sessions, and whether changes were made to the prosthesis (e.g. repair) since session 1. In addition, a questionnaire about demographics was administered.

Data Analysis

All test scores consisted of continuous variables. For the items of the repetitive reaching test and the fingertip dexterity test, and the hand grip strength test the average of three trials was calculated and used for further analyses. Skewness and kurtosis values divided by their standard error were used to assess distribution of normality of difference between test outcomes of session 1 and 2. If both outcomes were smaller than ±1.96 normal distribution was assumed, and a paired sample t test was performed to analyse whether test results of session 1 were significantly different from results of session 2. When the difference was not normally distributed, a Wilcoxon signed-rank test was used, and the median and inter quartile range (IQR) were presented. Differences were considered statistically significant when p < 0.05.

Descriptives of the test results during the first and second trial, one-way random intraclass correlation coefficients (ICC) for single measures, and 95% confidence intervals (95% CI) of the ICC-values were computed. ICCs of ≥0.90 were interpreted as excellent, ICCs between 0.75 and 0.90 as good, ICCs between 0.50 and 0.75 as moderate, and ICCs of ≤0.50 as poor [21]. ICC-values of ≥0.75 were considered acceptable [15, 16]. To assess repeatability further, 95% Limits of agreement (LoA) were calculated (mean difference ± 1.96 × standard deviation of mean difference), which represent the size of difference between both measurements; approximately 95% of differences will lie between these LoA [22]. Only changes outside the LoA should be considered as real change [20]. In order to get a global impression of the width of the LoA, a ratio between the LoA and the mean score of session 1 and 2 was calculated [((1.96 × standard deviation of mean difference)/(mean session 1, 2)) × 100%]. When the difference between test outcomes of session 1 and 2 was not normally distributed, the value of 1.96 in the previous two calculations was replaced with the value 2 [22]. Interpretation of LoA is a clinical decision and not a statistical one [22]. Widths of LoA of the overhead lifting test, overhead working test, fingertip dexterity test and handgrip strength test were compared with the widths of LoA found in healthy adults [8]; differences of ≥10% were considered deviant. As it is unknown whether the LoAs found in healthy adults are acceptable, clinically relevant interpretation of (widths of) LoA is not possible.

Test–retest reliability explains the extent to which scores for patients who have not changed are the same for repeated measurement over time [20]. Therefore two analyses were performed; first an analysis including all participants, and second an analysis including only the participants with stability of construct of measurement (e.g. stable functional capacity, in this study determined as unchanged physical and mental health status, and no changes in medical use as measured with the PRQ). Furthermore, in order to assess inter-trial variation a repeated measures one-way ANOVA was performed if a test consisting of multiple trials showed a significant difference in test results between sessions 1 and 2. If Mauchly’s test of sphericity was significant, the Greenhouse-Geisser estimate was reported. To examine whether observed trends of intertrial variation were significant post hoc Bonferroni analyses were performed.

All analyses were performed in IBM Statistics SPSS 22 [23].

Results

Thirty-two individuals were invited to participate, of which two declined. Therefore, 30 individuals participated, of which 23 performed the FCE-OH protocol twice (completers), with a median time of 47.2 h (IQR: 43.7; 68.0) between sessions. Reasons for seven participants not to perform session 2 (non-completers) were: logistic difficulty to schedule a second session due to time constraints (n = 4), declining for unknown reason (n = 2), and no show (n = 1). Characteristics of the participants are presented in Table 1. Differences between completers and non-completers were small (<10% difference for each variable).

Table 1 Characteristics of the participants

Stability of Construct

Six individuals had prosthesis training between FCE-OH sessions 1 and 2, all but one participant used the same prosthesis during both sessions, and two participants had changes to the prosthesis. In total, six individuals mentioned changed health status at session 2: mental health was better (n = 3), or worse (n = 1), or physical health was better (n = 2), or worse (n = 1). No participants had alterations in medication use between sessions. One participant showed a difference in overhead lifting capacity of 16 kg. During session 2, issues with the prosthesis led to substantial difficulties in lifting performance, as observed by the tester and confirmed by the participant. Therefore, the results of this participant were omitted from the analyses of the overhead lifting test.

Repeatability

Both primary (with all participants; Table 2) and secondary (participants with changed health status excluded; Table 3) analyses showed acceptable reliability (ICC-values of ≥0.75) for five out of eight items of the FCE-OH, and one item close to the 0.75 threshold of acceptable reliability. Differences between primary and secondary analyses were small and did not influence interpretation of acceptability. Secondary analyses revealed widths of LoA ranging between 16 and 79%. Differences of the widths of LoA of the overhead lifting test, fingertip dexterity test, and handgrip strength test observed in this study and in healthy adults (23, 14, and 20%, respectively) [8] were ≤10%, and thus considered similar. The width of LoA of the overhead working test was wider in this study (79%) compared to the width of LoA in healthy adults (41%) [8].

Table 2 Test results of two FCE sessions, and limits of agreement and interclass correlation coefficients between these test results (all participants; n = 9–23)
Table 3 Test results of two FCE sessions, and limits of agreement and interclass correlation coefficients between these test results, of individuals with a stable construct (six individuals with changed self-perceived physical and mental health status excluded; n = 6–17)

Participants performed significantly better during the second session on the repetitive overhead lifting test with the 2.0 kg weight, the repetitive reaching test (both with the unaffected limb and the prosthetic limb), and on the fingertip dexterity test with the unaffected hand (Tables 2, 3). Analyses of intertrial variation are presented in Table 4. As the first trial of the repetitive reaching test was performed slower compared to all following trials an extra analysis was performed, omitting this trial. The ICC-value showed an evident increase (to 0.69, 95% CI 0.32; 0.88), however, the 0.75 level of acceptable reliability was still not reached. The width of LoA changed minimally.

Table 4 Intertrial effects of the repetitive reaching test and the fingertip dexterity test with the dominant hand

Safety

No tests were terminated due to surpassing 85% of the age-related maximum heart rate. During the first session the overhead lifting test was thrice terminated by the tester, as it was deemed unsafe to continue (generally due to too much bodily swing while lifting, sometimes in combination with difficult grip with the prosthesis). During the second session this occurred only once. No serious adverse reactions occurred, but one individual reported a bruise on the unaffected forearm 1 day after the first session. This adverse reaction was most likely caused by pressure of the lower rim of the container on the forearm, while lifting the container with one hand during the overhead lifting test. After this event the container was padded with foam, and no such incident occurred again. Eight (30%) participants reported a physical response 24 h (pain or other) after the first FCE-OH session, which was partly or completely caused by the test procedure (Fig. 1). Five of these eight individuals performed both sessions; in all five participants the pain was still present at the start of session two (median pain grade: 4, range 2–5). Three of the eight individuals with a pain response did not perform session two; one of these three individuals declined further participation due to the pain response.

Fig. 1
figure 1

Pain response 24 h after FCE-OH session 1. MSC musculoskeletal complaints. Prior to session 2 participants answered a questionnaire regarding the locations of possible complaints. The five individuals with myalgia 24 h after session 1, who performed session 2, had myalgia of the shoulder of the nonaffected limb (n = 1), the shoulder of the affected limb (n = 3) and the forearm of the nonaffected limb and lower back (n = 1)

Discussion

Repeatability

Five of the eight items of the FCE-OH showed acceptable reliability. For the repetitive reaching test with the unaffected limb and the fingertip dexterity test with the prosthesis test–retest reliability was not acceptable. The overhead working test was close to reliable. However, the width of LoA of this test was much wider compared to the width of LoA in healthy adults. Three other tests showed similar widths of LoA, while for four remaining tests comparison of agreement was not possible.

The overhead working test showed a large width of LoA, meaning large within individual differences between sessions. The long duration of the test enhances the chance that an individual performs notably different when repetitively performing the test. However, this test showed considerably smaller widths of LoA when performed by healthy adults [8] and patients with whiplash associated disorders [10] (41 and 49%, respectively, versus 71% in this study), as well as a higher ICC-values (0.90 and 0.83, respectively). The overhead working test is known to show variable ICC-values, ranging from 0.36 in patients with low back pain [16] to 0.90 in healthy adults [8]. Possibly, participants did not completely recover between sessions, and results of this endurance test, were affected by muscle fatigue. Furthermore, it could be that reminiscence of fatigue and possibly pain decreased motivation.

In comparison with the overhead working test, the repetitive reaching test with the unaffected limb was performed significantly better during session 2, which may be caused by learning effects. The higher bound of LoA showed that a test had to be performed at least 19 s faster, to be considered as a real change [20]. The first trial of session 1 was performed significantly slower than all following trials, and therefore it was hypothesized that removing this trial from analysis would improve reliability measures. The ICC-value increased evidently when this trial was removed; however, the 0.75 level of acceptable reliability was not reached. The test showed modest variability between subjects, which can substantially decrease reliability, as reliability demonstrates how well persons can be distinguished from each other, despite measurement error [19, 24]. In the FCE-OH, the repetitive reaching test has been substantially altered and therefore reliability measures cannot be compared with existing literature [3].

For the fingertip dexterity test with the prosthesis the wide CI, possible mediated by the low number of participants, resulted in an uncertain estimate of the ICC-value. No definitive conclusions should be drawn until the reliability of this test is ratified in a larger sample.

Secondary analyses were performed, including only individuals with predefined stability of construct. ICC-values are ratio measures of the between-subject variance and the total variance, the latter including within-subject variance (measurement error) [24, 25]. Excluding individuals without stability of construct decreased within-subject variance, but possibly also the between-subject variance, and definitely the number of participants, which may explain why differences between results of the primary and secondary analyses are small, and did not change interpretation of reliability. Significant difference of test results between session 1 and 2, and inter-trial variation was present in several tests. Nevertheless, when within-participant differences are smaller than between-participant differences, acceptable reliability coefficients are possible [19, 24]. However, it is important to be aware of these effects when assessing a patient.

Safety

When necessary precautions are taken, the FCE-OH seems to be safe in use, since no serious adverse events occurred, and heart rates of all participants fell within acceptable ranges. One patient experienced a bruise 1 day after FCE-testing, which may be classified as an adverse reaction, which is defined as “any untoward and unintended response to an investigational medical product” [26]. Furthermore, several individuals experienced a pain response after the first FCE-OH session. Pain was mostly denoted as muscular pain and located in the shoulder of the affected limb, which may be caused by the generally passive use of this limb in daily life. Moreover, exacerbation of musculoskeletal complaints may occur. The percentage of individuals with a pain response was much lower than found by Soer et al. [9] (30 vs. 82%, respectively). Reasons for this finding are still speculative and beyond the scope of this article.

Strengths and Weaknesses

A weakness of the study is that the COSMIN recommendation of 50 participants was not feasible. While the COSMIN guideline recommends 50 participants, the results of this study, and of other FCE reliability studies with a similar number of participants [14, 17, 18], show that a substantially smaller sample can be sufficient to establish reliability. However, the smaller sample may have provoked large CIs of ICC-values, reflecting a general uncertainty about the true ICC, and making it necessary to frame clinical interpretations at the individual level with care. Although results are promising, a study on a larger sample is called for. Most individuals eligible and available for the study were willing to participate; however, completion of both sessions was not always possible and mostly related to time constraints. It is unknown whether this caused any bias.

The interval between both sessions should be long enough to avoid recall bias and fatigue, but short enough to avoid changes in health status, causing genuine difference in performance. Following practical considerations, the interval in this study was variable, with a median of approximately 2 days. This is a shorter time interval compared to most studies, which had time intervals of 1 to 2–3 weeks [8, 10, 16,17,18], but similar to the time interval in the study of Gross and Battie, who had 2–4 days between sessions [14]. A short interval may cause recall bias of test results (especially for the overhead lifting test, as participants may have recalled the number of weights put in the container), leading to higher ICC-values; but simultaneously may lead to lower ICC-values as the interval might not allow for full recovery. With exception of the overhead working test, we don’t expect the short interval to have played a role, as participants typically performed equal or better during session 2. Reliability measures of the overhead working test are preferably replicated in a study with a larger interval between sessions.

In this study widths or LoA are compared with healthy adults [8]. Some tests of the FCE-OH were substantially altered, and therefore could not be compared. Interpretation of LoA is a clinical decision [22], and a possible way to interpret them is by using the minimal clinically important change. However, the FCE-OH being new, the minimal clinically important change still is to be established. Therefore, further considerations on LoA will follow.

Conclusion

Good or excellent test–retest reliability was observed in five tests, while the remaining three tests showed poor or moderate test–retest reliability. Comparison of agreement was possible for four tests, of which three showed similar agreement. The FCE-OH was considered safe in use when the right precautions are taken. Large CIs of the ICC-values and LoA, as well as learning effects, make it necessary to frame clinical interpretations at the individual level with care.