Introduction

Traditional methods of reporting clinical outcomes include data on complication and mortality rates, and these data should be an integral part of any outcome study. Patient self-administered questionnaires acquire data on a patient’s experience of pain, functional disability, and general health status. The collection of data by patient reports removes the possibility of observer bias, as the questionnaires are self-administered by the patient at each assessment. These questionnaires do not replace traditional measures of clinical endpoints, but will be additions to data collection [1]. Patient questionnaires developed so far have focused on general health and quality of life, such as the 36-Item Short-Form Health Survey (SF-36) [2] or EuroQol (EQ-5D) [3]. Several measures for the evaluation of upper extremity function have also been developed [48]. Some of them are disease-specific [5], joint-specific [4] or region-specific measures [68]. Especially for the wrist and hand region, the most commonly used outcome measures described in the literature are the disabilities of the arm, shoulder, and hand (DASH) questionnaire [6] and the patient-rated wrist evaluation (PRWE) questionnaire [5]. The DASH score is the best instrument for evaluating patients with disorders involving multiple joints of the upper limb. In the Japanese Society for Surgery of the Hand (JSSH), the DASH questionnaire has been subjected to cross-cultural adaptation, and a Japanese version of the DASH–JSSH [9] has been developed. The PRWE questionnaire has also undergone cross-cultural adaptation, and a Japanese version of the PRWE (PRWE-J) has been developed by the Impairment Evaluation Committee of the JSSH. It has been reported that the PRWE-J has evaluation capacities equivalent to those of the original PRWE [10], and it is now used for patients with wrist and hand disorders.

MacDermid has reported its reliability and validity for patients with elbow pathology [11], and a German version of the PREE is available [12].

We, the Functional Evaluation Committee of the Japan Elbow Society, have completed cross-cultural adaptation and development of the Japanese version of the PREE (PREE-J). The purpose of the present study was to test the reliability, validity, and responsiveness of the PREE-J, with a view to making it available for use in Japan.

Materials and methods

Our aim was to adapt the PREE culturally for Japanese patients, as the Functional Evaluation Committee of the JSSH has done already for the PRWE-J.

Adaptation process

Similarly to other transcultural adaptation projects, we followed a cycle of forward/backward translation [13]. The English version of the PREE [11] was translated into Japanese by two translators, one with and one without a medical background, whose first language was Japanese. Then, the two forward translations were synthesized into a single one after a review and discussion by the committee. Parts of the PREE which were also found in the PRWE were translated identically, since the PRWE-J had already been published and it was important to be consistent. This Japanese version was translated back into English by two other translators whose first language was English. After we had compared those two back-translations with the original PREE, we developed a temporary version of the PREE- J. After pilot testing, the final PREE-J version was then evaluated for reliability, validity, and responsiveness.

PREE questionnaire

The PREE questionnaire contains two subscales: a pain scale (PREE-P) and a functional scale (PREE-F). The PREE-P consists of five items that have multiple choice responses, which are scored from 0 point (no pain) to 10 points (worst possible). The pain score (0 = best to 50 = worst) is calculated as the sum of the scores for the five individual items (4 items for intensity, 1 item for frequency).

The PREE-F consists of a specific functional scale (PREE-SF) and a usual functional scale (PREE-UF). The answers are rated from 0 point (no difficulty) to 10 points (unable to perform at all). The PREE-SF has eleven specific elbow functional activities and is calculated as the total sum of all eleven items (0 = best to 110 = worst). The PREE-UF has four usual elbow functional activities and is calculated as the total sum of all four items (0 = best to 40 = worst). The overall score for PREE-F (0 = best to 50 = worst) is calculated as the sum of PREE-SF and PREE-UF divided by three. The total PREE score is the sum of PREE-P and PREE-F, greater pain and disability being indicated by a higher score (0 = best, 100 = worst).

Patients and setting

The study was conducted on a consecutive series of 74 patients with elbow disorders seen on an out-patient or inpatient basis at six departments of orthopedic surgery in Japan (Table 1).

Table 1 Descriptive summary of subjects

Informed consent was obtained from each patient to participate in this study. Minors had parental consent instead. They answered the PREE-J questionnaire, the DASH–JSSH questionnaire, and the official Japanese version of the SF-36 (version 2.0). The data collected from the 74 patients were used as baseline values. Among the 74 patients, 53 were readministered the PREE-J questionnaire for test–retest reliability 1 or 2 weeks later. Fifty-three patients who underwent surgery conducted by six elbow surgeons answered the PREE-J, the DASH–JSSH, and SF-36 questionnaires twice: preoperatively and 3 months after surgery.

Assessment of reliability, validity, and responsiveness

Reliability was investigated by assessment of reproducibility and internal consistency based on the test–retest method. The following analyses were conducted for examination of validity. Factor analysis (principal axis factoring) was conducted to examine the construct validity and unidimensionality of the PREE-P and PREE-F. Completeness of the item responses for the PREE-J was also examined.

Coefficients of correlation between the PREE-P or PREE-F and the DASH–JSSH were obtained. The following hypotheses were examined to investigate concurrent validity: (1) the PREE-P would exhibit moderate association with DASH–JSSH; (2) the PREE-F (SF and UF) would exhibit the strongest association with DASH–JSSH.

Coefficients of correlation between the PREE-P or PREE-F and the SF-36 were also obtained. The following hypotheses were examined to investigate concurrent validity: (1) the PREE-P would exhibit the strongest association with “bodily pain” (SF36-BP) among the SF-36 subscales; (2) the PREE-F (SF and UF) would exhibit the strongest association with “physical functioning” (SF36-PF) or “role-physical” (SF36-RP). Those three subscales of the SF-36 were chosen because the correlation between the DASH–JSSH and the three SF-36 subscales was more than moderate [7, 10].

The responsiveness of the PREE-J, DASH–JSSH, and SF-36 was examined by calculating the standardized response mean (SRM) (mean change/SD) [14] and effect size (mean change/SD of baseline value) [15] after elbow surgery in 53 patients.

The protocol of this study was reviewed and approved by the institutional review board of Nagaoka Red Cross Hospital prior to implementation.

Statistical analysis

Distribution of the PREE-J, DASH–JSSH, and SF-36, and ages of the subjects were assessed. The interval measurements (age, PREE-J except PREE-SF, DASH–JSSH, all subscales of the SF-36 except “physical functioning”, “social functioning”, and “role-emotional”) were normally distributed; the other interval measurements (PREE-SF, “physical functioning”, “social functioning”, and “role-emotional” of the SF-36) were not normally distributed. Cronbach’s α was then used to assess the internal consistency of the PREE-J (P and F). The instrument test–retest reliability of the PREE-J (P and F) was assessed with the intraclass correlation coefficient (ICC). All coefficients of correlation among the PREE-J (P and F), DASH–JSSH, and SF-36 were calculated using Spearman’s correlation (a nonparametric test) because some subscales of the SF-36 were not normally distributed. Changes in measurements after elbow surgery were assessed with a parametric test (paired t test). All statistical analyses were conducted using the Statistical Package for the Social Sciences (SPSS) version 19.0J software package. Differences at P < 0.05 were considered to be significant.

Results

Completeness of item responses

No patients had difficulty completing the PREE-J questionnaire. Of the 74 patients, 3 did not answer one or two items (two failed to answer 2 items and 1 failed to answer one item). The items that they failed to respond to were as follows: item 5 of the PREE-P (unanswered by one patient), item 7 of the PREE-SF and item 4 of the PREE-UF (unanswered by one patient), and item 10 of the PREE-SF and item 4 of the PREE-UF (unanswered by one patient).

The mean, median, standard deviation, and range of the PREE-J, DASH–JSSH, and SF-36 are shown in Table 2. Two, 7, 7, and 3 patients had a minimum disability score of zero (ceiling) on the PREE-P, PREE-SF, PREE-UF, and PREE-F, respectively. One patient each had a maximum disability score (floor) on the PREE-P, PREE-SF, PREE-UF, PREE-F, and PREE, respectively. One patient had a maximum disability score on the PREE.

Table 2 Scores for PREE, DASH, and SF-36

Reliability

Internal consistency was assessed using Cronbach’s α coefficient (Table 3). The α coefficient for the five items in the PREE-P was 0.92 (n = 73); when calculated for each of the five items by eliminating each of them one by one, the range was 0.88–0.93. The α coefficient for the eleven items in the PREE-SF was 0.96 (n = 72); after eliminating each item one by one, the range was 0.95–0.96. The α coefficient for the four items in the PREE-UF was 0.92 (n = 72); after eliminating each item one by one, the range was 0.88–0.92. The α coefficient for the fifteen items in the PREE-F was 0.97 (n = 72); after eliminating each item one by one, the range was 0.97–0.97. The α coefficient for the twenty items in the PREE was 0.97 (n = 71); after eliminating each item one by one, the range was 0.97–0.97. Among all of the above, no items were found to change the internal consistency substantially.

Table 3 Internal consistency of PREE-J

Instrument test–retest reliability was assessed with the intraclass correlation (ICC) (Table 4). With regard to test–retest reliability, 52 of the 53 patients assessed had no missing items, and the mean period between the first and second tests was 12 ± 3 days (range 5–19 days). The ICC for the PREE-P, PREE-SF, PREE-UF, PREE-F, and PREE in all cases (n = 53) was 0.92, 0.93, 0.92, 0.93, and 0.94, respectively. The ICC for the PREE-P, PREE-SF, PREE-UF, PREE-F, and PREE in RA patients (n = 30) was 0.90, 0.86, 0.85, 0.86, and 0.90, respectively. All ICCs for the PREE-J subscales and total scale indicated sufficient reproducibility.

Table 4 Intraclass correlation coefficient of PREE-J

Validity

Factor analysis (principal axis factoring) was conducted to confirm the unidimensionality of the PREE-P and PREE-F. The first factor of the PREE-P had an eigenvalue (amount of variation in the total sample attributable to that factor) of 3.67, which accounted for 71 % of the total variance of the PREE-P scores (Fig. 1). The unidimensionality of the PREE-P was found to be strong as a result of the low eigenvalue of the second factor (0.54) (Fig. 1). Assessment of the first factor loading for each item showed that all items had a loading of 0.4 or higher (Table 5).

Fig. 1
figure 1

Solid plot of the patient-rated elbow evaluation, pain scale (PREE-P) factors

Table 5 Component matrix of factor analysis for PREE-P

The first factor of the PREE-F had an eigenvalue of 10.50, which accounted for 70 % of the total variance of the PREE-F scores (Fig. 2). The second factor of the PREE-F had an eigenvalue of 1.03, which accounted for 7 % of the total variance of the PREE-F scores and 77 % of the cumulative of PREE-F scores (Fig. 2). The third factor of the PREE-F had an eigenvalue of 0.65, which accounted for 4 % of the total variance of the PREE-F scores. Assessment of the first factor loading for each item showed that all items had a loading of 0.4 or higher (Table 6).

Fig. 2
figure 2

Solid plot of the PREE-F (where F represents the functional scale) factors

Table 6 Component matrix of factor analysis for PREE-F

Factor analysis indicated two factors in the PREE-F, suggesting a bidimensional structure. When looking at the second factor loading for each item, all items pertaining to usual function had plus values. Although many items of specific function had minus values, three items related to specific function had high plus values (Table 6). The bidimensionality of the PREE-F was not separated clearly into PREE-SF and PREE-UF.

The coefficients of correlation between the PREE-P and PREE-F or DASH–JSSH were 0.81 and 0.74, respectively (Table 7) (P < 0.01). These results indicated moderate correlations between the PREE-P and the PREE-F and between the PREE-P and the DASH–JSSH. The coefficient of correlation between the PREE-F and DASH–JSSH was 0.86 (Table 7) (P < 0.01), which indicated a strong correlation between them. The coefficients of correlation between the DASH–JSSH and PREE-SF or PREE-UF were 0.85 and 0.82, respectively (Table 7) (P < 0.01). These results demonstrated strong correlations between the PREE-SF and DASH–JSSH as well as between the PREE-UF and DASH–JSSH, thus supporting the proposed hypotheses (Table 7).

Table 7 Correlation of PREE, DASH, and SF-36

The correlations between the PREE-P score and the subscales of the SF-36 scale ranged from −0.38 to −0.65 (Table 7). A moderate correlation was observed between the SF36-BP and PREE-P (r = −0.65). These results supported the proposed hypotheses. The correlations between the PREE-P and SF36-PF or SF36-RP were somewhat weak, thus the proposed hypotheses were not supported.

The correlations between the PREE-SF score and the subscales of the SF-36 scale ranged from −0.43 to −0.69 (Table 7). A moderate correlation was observed between the SF36-PF and PREE-SF (r = −0.69), PREE-F (r = −0.69), or PREE (r = −0.68). These results supported the proposed hypotheses. The correlations between the PREE-SF and SF36-BP or SF36-RP were somewhat weak, and thus the hypotheses were not supported.

The correlations between the PREE-UF score and the subscales of the SF-36 scale ranged from −0.27 to −0.66 (Table 7). A moderate correlation was observed between the PREE-UF and SF36-PF (r = −0.66) or SF36-BP (r = −0.60), and thus supported the proposed hypotheses.

Multiple regression analysis of PREE adjusted by age and sex was conducted because a significant difference (P < 0.001) in mean age (±SD) was found between men (36 ± 19 years) and women (59 ± 15 years). The PREE scores were explained by the DASH scores and SF36-BP (Table 8).

Table 8 Multiple regression of PREE

Responsiveness

Fifty-three patients who underwent elbow surgery completed the PREE, DASH–JSSH, and SF-36 at 3 months (mean ± SD: 98 ± 23 days) after surgery. The mean age of those patients was 42 ± 20 years (range 13–77 years). There were 32 men and 21 women. The calculated SRMs and effect sizes of PREE-P, PREE-SF, PREE-UF, PREE-F, PREE, DASH–JSSH, and SF36-BP (n = 53) were 1.31/1.32, 0.94/0.81, 1.03/0.91, 1.02/0.86, 1.28/1.12, 0.99/0.85, and −1.15/−1.43, respectively (Table 9). There were significant differences between the mean values of preoperative and postoperative PREE-P, PREE-SF, PREE-UF, PREE-F, PREE, DASH–JSSH, and all subscales of the SF-36 except for the “mental health” subscale (n = 53).

Table 9 Standardized response means and effect size of PREE, DASH, and SF-36

Among those patients who underwent surgery, 17 of 53 who had rheumatoid arthritis underwent total elbow arthroplasty. The calculated SRMs and effect sizes of PREE-P, PREE-SF, PREE-UF, PREE-F, PREE, DASH–JSSH, and SF36-BP were 2.31/2.06, 1.34/1.32, 1.41/1.29, 1.44/1.35, 2.21/1.75, 1.12/0.94, and −1.12/−1.29, respectively (Table 9). There were significant differences between the mean values of preoperative and postoperative PREE-P, PREE-SF, PREE-UF, PREE-F, PREE, DASH–JSSH, SF36-PF, and SF36-BP (n = 17).

Discussion

We had previously developed a Japanese self-administered questionnaire based on an English version of the Western Ontario and McMaster Universities osteoarthritis index (WOMAC) to measure subjective function and pain status of patients undergoing total knee arthroplasty [16]. With the cross-cultural translation and adaptation of the original PREE, we produced a joint-specific instrument for subjective assessment of elbow pain and function in Japanese-speaking patients. The purpose of this study was to examine the psychometric qualities of the PREE-J by assessing its psychometric standards in the areas of reliability, validity, and responsiveness.

Internal consistency was assessed using Cronbach’s α coefficient (Table 3). The α coefficients for pain, function, and total scores in PREE-J (0.92/0.97/0.97) were equivalent to those of the original version (data not shown) and the German version (–/–/0.96) [12].

Instrument test–retest reliability was assessed with the ICC (Table 4). The ICCs of the pain, function, and total scores in PREE-J (0.92/0.93/0.94) were equivalent to those of the original version (0.88/0.89/0.95) [11], and the German version (0.73/0.82/0.80) [12]. This indicated that the ICCs for the PREE-J subscales and total scale had sufficient reproducibility. Even in patients with RA who showed a diurnal variation and day-to-day variation, ICCs had good scores (0.90/0.86/0.90) (Table 4).

The process of validation of the PREE-J questionnaire has shown that it has validity similar to those of the German version and the original PREE. The strong correlations between the PREE-J and DASH–JSSH supported this validity (Table 7). Although the correlations between the PREE-J and SF-36 were weak, bodily pain and physical functioning assessed by SF-36 showed moderate correlation with the German version as well as the original PREE. DASH–JSSH had higher correlations with physical functioning assessed by SF-36 than that assessed by PREE-J. This is thought to be because the DASH covers a broader region of the whole body than the PREE. These results demonstrated that the PREE-J measures only one area of health-related quality of life.

The pain scales of PREE-J exhibited high unidimensionality (Table 5; Fig. 1), and there was no low item-scale correlation. The loading of this scale was very high. These results indicated that the pain scale of PREE-J has a high validation quality.

The functional scale of PREE-J exhibited bidimensionality (Table 6; Fig. 2), and there was no low item-scale correlation. However, the two factors could not be clearly separated into specific function and usual function in the PREE-J. Imaeda et al. [10] reported that the two factors in the PRWE-J were clearly separatable into specific function (6 items) and usual function (4 items). Therefore, we would like to consider changing some of the items related to special function in the PREE-J.

Cohen’s rule-of-thumb for interpreting the “effect size index” (a value of 0.2 is small, 0.5 moderate, and ≥0.8 large) can be applied to the SRM [14]. Imaeda et al. [10] reported that the SRMs/effect sizes of PRWE-P, PRWE-F or PRWE were excellent: 1.7/2.2, 1.2/1.3, and 1.6/1.9, respectively. In the present study, the responsiveness (SRM/ES) of the pain scale and functional scale of the PREE-J for the patients overall was excellent 3 months after surgery, and larger than that of the DASH–JSSH.

The Kinemax Outcome Group reported that measurement of the SRM showed the Knee Society knee score to be more responsive (SRM 2.2) than the WOMAC (SRM 2.0 for pain and 1.4 for function) and the SF-36 (SRM 1.0 for bodily pain and 1.1 for physical functioning). The Knee Society function score was the least responsive measure (SRM 0.8). The WOMAC and SF-36 have high internal consistency and are more responsive measures of the outcomes of total knee arthroplasty [17]. In this study, the responsiveness of the pain and functional scale of the PREE-J for patients undergoing total elbow arthroplasty were equivalent to the results 3 months after surgery.

Conclusions

We conclude that the Japanese version of PREE (PREE-J) has evaluation capacities comparable to those of the original PREE. We expect that use of this scale in Japan for self-assessment by patients receiving treatment will make a meaningful contribution to improving the outcomes of patients with elbow problems.