Background

Musculoskeletal disorders (MSDs) are common complaints encountered by clinicians including physical therapists [1]. The upper limb MSDs (UL-MSDs) impact both health care resources and quality of life [1,2,3]. In Saudi Arabia, the prevalence of UL-MSDs in general population reaches up to 45.6% [2, 3].

One of the evaluation tools is self-reported outcome measures, which are designed to detect a patients’ health status, function level, and health-related quality of life [4, 5]. Furthermore, they measure people’s emotions, thoughts, behaviors, and circumstances associated with disability or impairment [6]. Several self-reported outcome measures have been developed for UL-MSDs including the Neck and Upper Limb Index (NULI) [7], Upper Extremity Functional Scale (UEFS) [8], Upper Extremity Functional Index (UEFI) [9], Disabilities of the Arm, Shoulder, and Hand (DASH) [10], QuickDASH [11], and QuickDASH-9 [12]. Most of these tools have limitations such as comprehensiveness, adequacy of the items towards the instrument domains, and generalization from a specific to general population [13, 14]. Other limitations are related to practical characteristics or interpretability [15, 16] and psychometric properties [9, 15].

The upper limb functional index (ULFI), on the other hand, has overcome the aforementioned limitations successfully. The ULFI has been used in several countries and translated and validated in many languages including Spanish [17], French-Canadian [18, 19], Turkish [20], Italian [21], Korean [22], Persian [23], Brazilian [24], Greek [25] and Urdu [26]. Since cultural background that may affect the original questionnaire, we recently translated and cross-culturally adapted the ULFI to Arabic language (ULFI-Ar). The ULFI-Ar demonstrated an excellent content validity (0.96) and high internal consistency (Cronbach’s α = 0.88) [27]. However, other psychometric measurements of the ULFI-Ar have not been studied. Thus, the current study aimed to test the longitudinal psychometric properties of the ULFI-Ar by investigating other measurements of validity and reliability, namely factorial validity, test–retest reliability, measurement error, minimal detectable change, and responsiveness. We hypothesized that the ULFI-Ar would have adequate construct validity, test–retest reliability, and responsiveness.

Methods

This is an observational cross-sectional study that was conducted between of March and September 28, 2021 in (King Fahad Hospital for University in Al Khobar, Saudi Arabia). The Institutional Review Board of the (Imam Abdulrahman bin Faisal University) approved the study (IRB-PGS-2021–03-063; date: 22/02/2021). The study followed the guidelines of the Strengthening the Reporting of OBservational studies in Epidemiology (STROBE) [28].

Participants

All participants were referred to the physical therapy department and recruited consecutively. The eligible criteria for recruitment were adult participants (18 to 60 years old), diagnosed with UL-MSDs including shoulder, elbow, wrist, or hand joints, and able to read and understand Arabic. Participants with any recent upper limb surgery, cognitive impairment, infectious disease, neurological disease, tumor, or other systematic diseases that could affect function of the upper limb were excluded. Further studies are needed to correlate between these factors (such as recent surgeries) and specific question item(s) are those are improper to ask. A written consent form was completed by each participant.

The recommended sample size is at least 5 times the number of the questionnaire items provided that the sample size is ≥ 100 participants [29]. Thus, 125 participants were required to achieve a statistical power of 80% for validation. To consider a dropout rate of 10%, 139 participants were consecutively recruited to complete the following questionnaires: ULFI-Ar, DASH-Arabic, and numeric pain rating scale (NPRS). The minimum number of participants recruited in previous research was 30 participants for test–retest reliability [19, 23] and 20 participants for responsiveness [19, 30].

Measurement instruments

The ULFI is a single-page instrument with 25 items. It is a valid, reliable and responsive measure to assess people with UL-MSDs [22]. It has three-point response options of ‘Yes = 1’, ‘Partly = 0.5’, and ‘No = 0’ [31]. The total score ranges from 0 (maximum limitation) to 100 (full function), which can be calculating by the following equation: \([{\mathrm{ULFI}}_{\mathrm{Score}}=\{(\text{sum of the }25\text{ items points})\mathrm{\times}4\}-100]\). The ULFI permits up to two missing responses to validate scoring. The ULFI-Ar was equivalent to the English ULFI. In the ULFI-Ar, only a few items were adapted to fit the Arabic context. A more detailed description of the ULFI-Ar was previously reported [27]. The authors of the current study obtained permission from the authors of the original English ULFI to translate and validate the ULFI to Arabic.

The DASH-Arabic is divided into four sections: introduction, main 30 items, and two optional sections. The main 30 items target any functional level to the upper limb, the severity of symptoms, and psychosocial difficulties, whereas the optional sections address the work and sport impairments. Each statement has a five Likert scale response that ranges from 1 “without any difficulty or no symptoms exist” to 5 “unable to engage in activity or very severe symptoms”. A minimum of 27 items out of the main 30 must be answered to get the correct scoring. For scoring, the following formula is used: \({\mathrm{DASH}-\mathrm{Arabic}}_{100\mathrm{ score}}=\left\{\left(\frac{\mathrm{sum of completed responses}}{\mathrm{count of completed responses}}\right)-1\right\}\text{ x }25\). The higher the score, the higher the disability. The optional sections follow the same procedure but they require answering all five items. The DASH-Arabic is reliable, valid, and responsive [32, 33].

The NPRS-Arabic consists of a horizontal line of numerical point scale from 0 ‘no pain’ to 10 ‘extreme pain’. The participant was asked to rate the current pain intensity. The NPRS-Arabic is a valid, reliable, and responsive tool for pain intensity in UL-MSDs [34].

Data and statistical analysis

Data was analyzed using IBM SPSS Statistics for Macintosh, Version 26.0. (IBM Corp. Armonk, NY, USA). The level of significance was set at p < 0.05. The mean and standard deviation (SD) were conducted as descriptive analysis for the demographic variables. The Shapiro–Wilk test was used to test data normality of the ULFI-Ar, DASH-Arabic, and NPRS-Arabic [35]. The data were normally distributed (p > 0.05) for the ULFI-Ar. The DASH-Arabic and NPRS-Arabic demonstrated a relatively normal distribution for participants with elbow and wrist/hand disorders. However, the data distribution was inconsistent for pooled data and participants with shoulders disorders in both the DASH-Arabic and NPRS-Arabic. Paired t-test was performed to compare the scores of the ULFI-Ar, DASH-Arabic, and NPRS-Arabic for the test–retest and responsiveness in comparison with baseline. A ceiling or floor effect was determined if more than 15% of respondents revealed the highest or the lowest possible score, respectively [19].

Factor analysis was performed to evaluate construct validity of ULFI-Ar. Two classes of factor analysis were applied: Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) [36]. Prior to the extraction of the factors, suitability of the respondent data was assessed by Kaiser–Meyer–Olkin (KMO) test with a value between 0.60 and 0.90 and a significant Bartlett’s Sphericity test [27]. The KMO result was 0.812 and the Bartlett’s sphericity test was significant (p < 0.001). Thus, these results confirmed factor analysis by using the EFA and CFA. The EFA was used with maximum likelihood extraction (MLE) and varimax rotation [37]. The factor extraction had three a-priori requirements: Eigenvalue > 1, accounting for > 10% of variance [38] and the ‘point of inflection’ on the scree plot [39]. The CFA was analyzed by using the IBM SPSS Amos 26.0.0 for Windows (Amos Development Corporation, Wexford, USA) to clarify the dimensions loading and the model fit. The fit indices were chi-square (χ2)/ degrees of freedom (DF), Root Means Square Error of Approximation (RMSEA), Comparative Fit Index (CFI), and Tucker-Lewis Index (TLI). These were considered adequate when χ2 / DF < 3, RMSEA < 0.10, CFI and TLI > 0.90, and a factor loading > 0.40 [40].

The test–retest reliability was assessed by interclass correlation coefficients [ICC2,1] [41] in a subgroup of the participants who completed the ULFI-Ar, DASH-Arabic, and NPRS-Arabic at two time intervals (baseline and 2–4 days) during non-treatment period. All participants were asked about their symptoms in the second interval to make sure that their symptoms were stable. The minimum accepted level of ICC for test–retest reliability was 0.70 [42]. The measurement error was expressed as the standard error of measurement (SEM) and calculated by using the following formula: \(\mathrm{SEM}={\mathrm{SD}}_{(\mathrm{Baseline})} \sqrt{(1-\mathrm{ICC})}\), where SD(Baseline) was standard deviation at baseline [35]. The minimal detectable change at 90% confidence interval (MDC90) was converted from SEM using the equation: \({\mathrm{MDC}}_{90}=\mathrm{SEM x }\sqrt{2}\mathrm{ x }1.65\) [35].

For responsiveness, another subgroup of the participants completed the three questionnaires twice: before treatment and after discharge, with a period of six weeks between these two tests. The responsiveness was determined by two methods. The internal responsiveness was assessed by the effect size (Cohen’s d) and the standard response mean (SRM) [35]. Cohen’s d can be obtained either by dividing the mean of pretest and posttest over standard deviation of both the baseline and post-treatment measurement (\(d=\frac{\mathrm{mean}}{\mathrm{SD}}\)) or by obtaining the paired-sample t-test on the square root of the sample size (\(d=\frac{\mathrm{t}}{\sqrt{\mathrm{N}}}\)). Both formulas reveal the same result. Cohen’s d is expressed as small (0.2), medium (0.5), and large (0.8) effect size [35]. The SRM was calculated by dividing the average difference between the baseline and responsiveness measurement over its standard deviation (\(\mathrm{SRM }=\frac{{\mathrm{\rm X}}_{\mathrm{change}}}{{\mathrm{SD}}_{\mathrm{Xchange}}})\). The external responsiveness was computed by calculating the correlation between ULFI-Ar, DASH-Arabic, and NPRS-Arabic using Pearson’s correlation coefficients (r). A moderate external responsiveness (r) value is approximately 0.5 [29].

Results

A total of 146 participants with UL-MSDs were screened. Three participants were excluded because they did not fulfill the inclusion criteria and four participants were excluded because of incomplete information. A total of 139 participants completed the ULFI-Ar, DASH-Arabic, and the NPRS-Arabic. Of these, 46 participants completed the same questionnaires for test–retest study and 27 patients for the responsiveness testing. Table 1 shows the demographic and clinical characteristics of the participants. Age of the participants was in the mid-thirties, and male participates were more than women. Average pain duration was 10 months and 57.6% of the participants had pain for more than 14 days. The most common affected joint was the shoulder with referred diagnosis of impingement and rotator cuff syndrome.

Table 1 Participants’ characteristics

Table 2 presents the mean and standard deviation obtained from the three questionnaires, which showed no floor or ceiling scores. There were no missing responses for the ULFI-Ar. The ‘Half’ response option was used by 95% of the participants in a total of 22% of their responses. The DASH-Arabic had missing responses from 26 different items from 84 (60.4%) participants. Six participants had ≥ 3 missing responses in completing the DASH-Arabic; therefore, they were excluded from the data analysis.

Table 2 The scores of the questionnaires for the baseline, test–retest, and responsiveness

For construct validity, the EFA revealed six factors with Eigenvalues > 1; where only one factor exceeded 10% variance (25.62%) and was presented before the inflection point (Fig. 1). As the three priori criteria were met, this result indicated a unidimensional structure of the tool. Table 3 shows the items factor loading for the one-factor solution and its average scores for each item. For factor loading, eight items scored below 0.50 (lowest = 0.34), while no items scored > 0.80 (highest = 0.72), which indicated no item redundancy. The extraction component under the item average score showed only three items had scores below 0.50 (lowest = 0.33), expressing a strong distinct component. The unidimensional factor was analyzed with CFA and showed that all 25 items factor loading was more than 0.40 (Fig. 2). Fit model of the CFA was acceptable [df = 275, χ2 = 588.98 (p < 0.001), χ2 / df = 2.14, CFI = 0.652, RMSEA = 0.091, and TLI = 0.620], which supported that the 25 items structure should be reserved.

Fig. 1
figure 1

Scree plot of the one factor of the upper limb functional index—Arabic

Table 3 Factor analysis loading for the upper limb functional index – Arabic
Fig. 2
figure 2

Confirmatory factor analysis and standardized factor loading values of the upper limb functional index—Arabic

Paired t-tests showed no significant difference between the ULFI-Ar testing and retesting scores (t = 0.695; p = 0.49). The test–retest reliability of the ULFI-Ar was excellent (ICC2,1 = 0.95) with an individual range of 95% CI = 0.90 – 0.97. The measurement error from the SEM and MDC90 were 4.43% and 10.34%, respectively.

The internal responsiveness of the ULFI-Ar as represented by the paired t-test resulted in significant difference between the baseline and responsiveness scores (t = 3.47; p = 0.002). The effect size was medium (Cohen’s d = 0.67; 95% CI = 1.08 – 1.06) and SRM was also medium (0.667; 95% CI = 0.24 – 0.98). The percentage difference between SRM and effect size for the same change measurement on the same participant was 1%. The external responsiveness was strongly correlated with the DASH-Arabic (r =—0.90). A negative strong correlation was found between the ULFI-Ar and NPRS-Arabic (r =—0.75, p < 0.001).

Table 4 summarizes the psychometric characteristics of the three questionnaires including reliability, validity, and responsiveness.

Table 4 Methodological characteristics of, upper limb functional index-Arabic, disabilities of arm, shoulder, and hand, and numeric pain rating scale

Discussion

The psychometric properties testing demonstrated adequate results that support the validity, reliability, and responsiveness of the ULFI-Ar. The construct validity of the ULFI-Ar in the current study was supported by the single-factor solution that emerged from the factor analysis. Although six factors had Eigenvalue > 1.0, only one factor accounted for > 10% of variance (29.4%). This result is in agreement with the studies of the English [31], Spanish [17], and Persian [23] versions. Conversely, the Turkish, Greek, and Urdu studies found that two factors showed variance > 10% from six to seven factors with Eigenvalues > 1.0 [20, 25, 26]. The Brazilian version used a parallel analysis as an alternative method and confirmatory factor analysis (CFA), which both extracted only one factor [24]. The other studies did not report factor analysis results [18, 22]. The Italian version used a sample size lower than the required participants [17]. In the current study, there were 8 items that were scored below 0.50 in the factor loading compared with the Spanish [5 items] [17], Greek [5 items] [25] Urdu [7 items] [26], Turkish [9 items] [20], Persian [10 items] [23], and English version [14 items] [31]. This finding suggests that reduction of the total number of items may reduce the respondent burden and improve the tool practicality [16]. In our study, no items scored > 0.80 (highest = 0.72) which confirms no item redundancy. In the extraction component, only two items were below 0.50 (lowest = 0.34), suggesting a strong distinct component for upper limb outcome measure. In our study, the CFA testing of the unidimensional model of the ULFI-Ar showed a factor loading more than 0.40 for all the 25 items. This is in agreement with the Brazilian study [24]. However, in the current study, both the CFI (0.652) and TLI (0.620) were less than the recommended levels (> 0.090). These low values may be resolved by increasing the sample size to at least 200 participants although a minimum of 100 participants was accepted for factor analysis [36]. The ULFI-Ar has a greater value of χ2 / df [2.14] and RMSEA [0.091] than the Brazilian version [1.75 and 0.063, respectively] [24]. However, ULFI-Ar demonstrated lower values of the CFI [0.652] and TLI [0.620] compared with the Brazilian version [0.918 and 0.910, respectively] [24].

The high test–retest reliability of the ULFI-Ar (ICC2:1 = 0.95) supports the instrument’s stability. This is comparable with the English [ICC2:1 = 0.98] [31], Greek [ICC2:1 = 0.97] [25], Italian [ICC2:1 = 0.94] [21], Spanish [ICC2:1 = 0.93] [17], Persian [ICC2:1 = 0.93] [23], French-Canadian [ICC2:1 = 0.92] [19], Urdu [ICC2:1 = 0.91] [26], Korean [ICC2:1 = 0.90] [22], and Brazilian versions [ICC2:1 = 0.90] [24]; but higher than the Turkish version [ICC2:1 = 0.72] [20]. The authors of the Turkish version contributed the lower value of test–retest reliability in their study to that all participants reported the ‘same’ on ‘global rating of change’ [20]. We do not agree with the authors, as reporting “the same” by the participants indicates that their status was stable, and consequently, the ICC value should be higher.

Measurement error and sensitivity determined from SEM and MDC90 were 4.43% and 10.34%, respectively. The small value of the SEM in this study suggests a good measure of precision [35]. This SEM is comparable to the Greek [3.34%] [25], English [3.41%] [31], Urdu [3.89%] [26], French-Canadian [4%] [18], Turkish [2.95%] [20], Persian [3.11%] [23], and Spanish [3.52%] [17]; but lower than the Brazilian version [6.11] [24]. The MDC90 in other versions were: 5.53% (Turkish) [20], 7.25% (Persian) [23], 7.79% (Greek [25], 7.93 (English) [31], 8.03% (Spanish) [17], 9.3% (French-Canadian) [19], 10.6% (Urdu) [26], 12% (Italian) [21], and 14.26% (Brazilian) [24].

Internal responsiveness measured by Cohen’s d effect size (0.67) and SRM (0.67) was moderate. Our finding is similar to the French-Canadian version [d = 0.62, SRM = 0.88] [19] but lower than the Greek and English versions [d = 1.19 and 0.93, SRM = 1.31 and 1.33, respectively] [25, 31]. External responsiveness of the ULFI-Ar was strong as estimated by Pearson’s correlation coefficients with the DASH-Arabic (r = 0.90) and the NPRS-Arabic (r = 0.75). In comparison, only the French-Canadian study (r =—0.64) investigated this type of responsiveness in relation to the DASH-FC [19]. In both studies, the Arabic and French-Canadian, the time interval between the two measurements ranged from 2 to 6 weeks and showed a significant difference between the baseline and responsiveness readings as detected by paired t-test. It is an optimal period for the clinician to detect the patients’ functional status in a short time and to evaluate the intervention outcome [19].

The main strength of this study is that we attempted to investigate all psychometric properties of the ULFI-Ar.. Another strength is that our study recruited participants with acute, subacute, and chronic conditions [17]. A limitation may be that sample recruitment was from one clinical setting. Although the standard Arabic language was used in the translation of the ULFI, inclusion of Arabic participants other than Saudi could confirm conflicting findings. Moreover, sample size was not calculated for reliability and responsiveness although we tried to recruit more than the number of participants used in previous similar research. In addition, the current study did not include an assessment of the psychometric properties of the ULFI-Ar for patients undergoing other treatments than physical therapy, which in turn may limit the breadth of the study.

Conclusion

The study showed that the ULFI-Ar is a unidimensional factor and has excellent test–retest reliability, and medium to strong responsiveness. The ULFI-Ar can be used as an appropriate outcome measure in clinical and research setting for Arabic speaking patients with UL-MSDs. Future research is recommended to assess the psychometric properties of the ULFI-Ar in patients undergoing treatments than physical therapy.