Psychometric properties of the Swedish version of the Treatment Outcome Satisfaction Questionnaire

Patient satisfaction is an outcome measure for low-back pain (LBP) interventions which allows clinicians to design patient-oriented treatments. The Treatment Outcome Satisfaction Questionnaire (TOSQ) is an English instrument constructed for such evaluations, and no equivalent instruments exist for the Swedish population. This study, therefore, translated TOSQ into Swedish and assessed the translated version’s psychometric properties for patients with LBP. A cross-cultural adaptation was used to translate TOSQ into Swedish. Subsequently, data from 131 patients with LBP whom undergone physiotherapy were consecutively aggregated and analyzed in a Rasch rating scale model with person measures standardized at 0–100 logits to evaluate the translated scale’s validity. Finally, test–retest reliability of the Swedish version of TOSQ (TOSQ-S) was quantified via an intraclass correlation coefficient (ICC) and the standard error of measurement (SEM) in 41 patients. TOSQ was successfully translated into Swedish; however, while some Rasch model indices supported the translated scale’s unidimensionality, one out of eight items and 12 out of 131 subjects misfitted the model. Scale optimization resulted in a 6-item subconfiguration, for which all items fitted the model, person misfits were reduced to ten subjects, and the person separation index increased from 1.86 to 2.04. ICC and SEM estimates suggested acceptable reliability for the six-item TOSQ-S at 0.66 and 6.6 logits, respectively. A six-item TOSQ-S configuration showed acceptable psychometric properties and is suitable for measuring treatment outcome satisfaction of physiotherapy in patients with LBP.


Introduction
Low-back pain (LBP) is the single leading cause for disability globally [1,2] and, consequently, one of the most common reasons for seeking healthcare in the western world [3]. Physiotherapy is one of the main conservative LBP treatments, and intervention results are regularly evaluated by patient-reported outcome measures targeted at several domains; patient satisfaction being one [4,5].
Patient satisfaction is a multidimensional concept constituting of the satisfaction with the physical environment, the patient-care provider interaction, and the treatment outcome [6,7]. Of these, the last in principle reflects the patient's attitude towards the health care intervention [8,9]. It is thus essential, as it incorporates the patient's view into everyday clinical practice [10], and allows the clinician to design patient-oriented interventions [5]. Hitherto, treatment outcome satisfaction of LBP interventions has typically either been assessed by global items [11][12][13] or instruments intended for other purposes [14]. The absence of psychometrically sound instruments tuned for patients with LBP provides a conceivable clarification as to why.
In 2010, the 'Treatment Outcome Satisfaction Questionnaire' (TOSQ) was developed to assess outcome satisfaction in patients with LBP following physiotherapeutic interventions [15]. TOSQ underwent a meticulous development process during which construct validity was examined based on a sample of physiotherapy outpatients in the New York metropolitan area [15]. To allow integration of TOSQ into the Swedish primary health care, a Swedish translation of the instrument is necessary. Whereas crosscultural adaptation is suitable for this purpose, incorporating both language and cultural differences into the translation process, the translation nevertheless risks influencing the TOSQ properties considerably, and rendering it necessary to validate the translated version [16].
Traditionally, approaches based on classical test theory were used to assess psychometric properties of single-construct instruments; however, such methods are subject to two major limitations: they do not consider item hierarchy and they assume additivity of rating scale data [17]. Rasch analysis provides an alternative that is not limited by these assumptions, since it transforms ordinal item scores to logodds (logits) interval data, used to determine the relation between person ability and item requirement in the measured construct [18]. In addition, it is necessary to quantify the testretest reliability of the translated TOSQ items to determine result precision. Hence, this study translated TOSQ into Swedish and assessed the psychometric properties of the Swedish version of TOSQ (TOSQ-S) in patients with LBP.

Design and participants
In a two-step design, TOSQ was cross culturally adapted from English to Swedish, and TOSQ-S was subsequently psychometrically evaluated.
Following approval by the Regional Ethical Research Committee, data were consecutively aggregated from patients with LBP in conjunction with their rehabilitation at three specialist physiotherapy clinics in Stockholm, Sweden. Patients aged between 18 and 75 years seeking care for LBP as a primary complaint were included, and excluded if they did not understand Swedish. In total, 131 patients, 44 males and 87 females, with a mean (SD) age of 49 (14) years enrolled for participation. Rehabilitation comprised a median of seven occasions of physical training, manual therapy, and home exercises adapted to the patients' needs.
At rehabilitation completion, participants signed an informed consent, completed TOSQ-S without interference in a private room, and left the completed questionnaire in a sealed, coded envelope. Forty-one subjects also completed the questionnaire at home 1-week post-rehabilitation, and mailed the completed questionnaire in a coded envelope.

Treatment Outcome Satisfaction Questionnaire
TOSQ assesses treatment outcome satisfaction via eight aspects of the patient's experience at the end of physiotherapy treatment. It reveals information about patients' understanding of their LBP, how they feel about their response to treatment and the improvement achieved, their perception of their ability to cope and/or carry on with their lives, the effort they placed in the rehabilitation process, and their ability to take care of themselves and control their LBP problem [15]. In total, TOSQ contains ten items scored on a seven-point Likert scale, of which two are global and eight are specific [15]. A summary score between 0 and 48 can be calculated for the latter [15].

Cross-cultural adaptation
In accordance with published guidelines [16], the crosscultural adaptation followed a forward-and-backward translation process in which individual items, response options, and questionnaire instructions were adapted to Swedish (Fig. 1). Two native Swedish speakers independently carried out translations from English to Swedish: one a physiotherapist and researcher and the other a researcher unfamiliar with physiotherapy. Next, the two translations were checked for inconsistencies, compared, adjusted, and pooled into one. Subsequently, two Englishspeaking translators: one with clinical expertise and one without clinical expertise, and none with knowledge of TOSQ, independently back-translated the Swedish translation of TOSQ to English. The two translations were then reviewed, compared to the original TOSQ to ascertain conceptual and semantic equivalence, combined into one final back-translated version, and used to construct the final TOSQ-S. Finally, TOSQ-S was pre-tested for accuracy of wording and comprehension in a cohort of 15 outpatients with LBP seeking physiotherapy.

Statistical analysis
A polytomous Rasch rating scale model (Winsteps Ò Rasch measurement program v3.92.1, John M Linacre, Oregon, USA) with person measures scaled at 0-100 logits was used to evaluate the psychometric properties of TOSQ-S. The eight specific TOSQ-S items were first analyzed, and subsequently, scale structure optimization was attempted starting with all ten TOSQ-S items. Analyses were initiated with an assessment of the scale's categorical structure, and acceptability was set to: minimum ten observations per category, monotonical categorical advancement, and categorical fit to the model [19]. Next, person and item fit to the model was examined. Model fit was evaluated via the infit and outfit mean square (MNSQ) values, with 1.0 suggesting perfect fit and lower or higher values indicating overfit and underfit, respectively [18]. Significant MNSQ values smaller than 0.6 or larger than 1.4 were considered to misfit the model [20], and less than 5% of misfitted items and persons, respectively, was considered acceptable. Misfitted items were discarded one-by-one starting from the highest MNSQ value, until an acceptable internal scale validity was achieved. Misfitted persons were investigated to elucidate possible explanations for misfit, but were not omitted to maintain external validity. Scale unidimensionality was assessed via the proportion of total variance accounted for by the Rasch model and by a principal component analysis of the residuals (PCAR). Unidimensionality criteria were that observed variance met approximated expected model variance [21] and that eigenvalues of any residual components were less than 2.0 [22]; corresponding to that of less than two items in the latent residual dimension. The scale targeting, meaning the sample's satisfaction level relative to items' satisfaction requirement, was assessed via person-item maps [23]. Person separation reliability, meaning the scale's reliability in distinguishing between persons according to their satisfaction level, was determined via the person separation index, with 1.50 considered acceptable and 2.00 good, as it can discern two and three satisfaction levels, respectively [23].
Test-retest reliability of individual items was estimated by the linearly weighted kappa and Krippendorff's alpha for ordinal scale data [24,25]. The overall scale reliability was calculated from the person logits measure, and quantified by both an intraclass correlation coefficient (ICC) based on a single-measurement two-way random effects model of absolute agreement, and the standard error of measurement (SEM, i.e., the within-subject standard deviation between trials) [26]. To ascertain that inter-trial differences were related to persons and not items, item measures were anchored with the estimates of the first trial in the retest Rasch model. Reliability estimates were interpreted as B0.40 = poor, 0.40-0.59 = fair, 0.60-0.74 = good, and C0.75 = excellent [27]. Reliability analyses were computed via the packages 'rel' (v1.

Results
Cross-cultural adaptation Table 1 displays the final TOSQ-S. TOSQ was successfully forward-and-backward translated into Swedish with neither semantic nor language ambiguities, and all items were approved by both English and Swedish research members. Pre-testing of TOSQ-S revealed no difficulties. Figure 2 shows the response distribution of the 131 patients across the TOSQ-S items. Whereas the responses were skewed towards high treatment outcome satisfaction, no patients had maximal or minimal scores. Nonetheless, the eight-item TOSQ-S Rasch model's lowest response category only had one observation, rendering that category's estimates unstable.

Rasch analysis
Moreover, while the average categorical measures advanced monotonically, the second and third lowest categories underfitted the model (MNSQ outfit B 1.81), and collapsing them did not resolve the misfit (MNSQ outfit B 1.70). In addition, whereas both the variance accounted for by the model (45%; eigenvalue = 6.6) and the PCAR (eigenvalues \1.8) supported unidimensionality, nine persons (7%) and one item (item 8: MNSQ infit = 1.51; MNSQ outfit = 1.55) underfitted, and three persons (2%) overfitted the model. Finally, TOSQ-S mistargeted the sample, having a mean difference of 22 logits between person satisfaction and item satisfaction requirement; however, the person separation index of 1.86 indicated that it could nevertheless distinguish between patients of high and low satisfactions.
To improve TOSQ-S's psychometric properties, data were reanalyzed following stepwise modifications. First, all ten items were analyzed together. Whereas all categories had at least ten observations and their average measures advanced monotonically, the two lowest categories misfitted the Rasch model (MNSQ infit B 1.66; MNSQ outfit B 2.68), and collapsing them did not resolve the misfit (MNSQ infit = 1.80; MNSQ outfit = 2.83). Again, both the variance accounted for by the model (53%; eigenvalue = 11.2) and the PCAR (eigenvalues\1.9) supported unidimensionality; however, 14 patients (11%) and two items ( Table 2) underfitted, and nine subjects (7%) overfitted the model. The person-item map revealed an increase in the mistargeting with a mean difference of 28.4 logits between person satisfaction and item satisfaction requirement (Fig. 3a), but the person separation index remained at a similar magnitude of 1.89. Following exclusion of the misfitted items 8 and 9, nine (7%) and four subjects (3%) underfitted and overfitted the model, respectively. The additional exclusion of item 6 reduced the number of underfitted subjects to seven (5%) but increased the number of overfitted subjects to six (5%). Finally, discarding item 10 reduced the number of overfitted subjects to three (2%). The exclusion of these four items resulted in the three lowest response categories having less than ten observations, rendering estimates there unstable; however, their average measures advanced monotonically, and while the third lowest category overfitted the model (MNSQ infit = 0.55; MNSQ outfit = 0.37), collapsing the second and third lowest categories this time resolved the misfit. The reduced scale accounted for 60% of the total variance (eigenvalue = 8.9), and no substantive residual dimensions were identified during PCAR (eigenvalues \1.7); both measures again suggesting unidimensionality. The mistargeting remained at a similar magnitude of 26.3 logits (Fig. 3b); however, the person separation index at 2.04 suggested that three distinct subgroups could be reliably discerned by the six-item subconfiguration (Fig. 3b).

Test-retest reliability
Reliability point estimates suggested excellent, fair, and poor agreements for two, six, and two items, respectively ( Table 2). For the six-item TOSQ-S summary score, mean (SD) person measures were 38.3 (20.8) logits the first trial    One patient decreased considerably more in satisfaction between trials than other patients. Excluding this patient from the analyses reduced the inter-trial mean discrepancy to 3.3 logits, decreased SEM to 5.5 (3.7, 7.4) logits, and increased ICC to 0.79 (0.64, 0.88).

Discussion
This study cross culturally adapted TOSQ from English to Swedish and assessed the psychometric properties of the translated scale in patients with LBP following physiotherapy completion. Our results showed that a sixitem TOSQ-S configuration had acceptable psychometric properties and is suitable for measuring treatment outcome satisfaction of physiotherapy in patients with LBP.
The cross-cultural adaption process resulted in an equivalent version of TOSQ in Swedish. The Rasch model, however, revealed that TOSQ-S had a considerable amount of misfitted items and respondents, but these limitations were largely remedied in the scale optimization procedure. A six-item subconfiguration was able to reliably distinguish between patients of high-, average-, and low-treatment outcome satisfaction, met the criteria for an acceptable proportion of misfitted items, and had slightly higher than 5% misfitted persons. The average person measure sample's satisfaction level was high relative to the items' satisfaction requirement, suggesting that TOSQ-S may be mistargeted to the intended population; however, no ceiling A is based on a Rasch model of all ten TOSQ-S items and B on the final six-item TOSQ-S configuration. M mean, SD standard deviation effect was observed (Fig. 3). The skewness towards high sample satisfaction could be a consequence of treatments being conducted at specialist clinics, and the person-item relationship could, therefore, change considerably at general outpatient clinics. Consequently, although our sample size met the recommendation for polytomous Rasch models [28], the three lowest categories had very few observations, which can explain the misfit of one of these categories to the model. The categorical structure is, therefore, a possible area of improvement, but was kept unchanged until further investigated. Unscaled item measure standard errors below 0.2 logits, however, supported that item measures had a precision of 0.5 logits with 95% confidence [28].
On average, the six-item TOSQ-S summary score had acceptable reliability, with ICC suggesting good reliability and SEM corresponding to 7% of the person measure range. Whereas equivalent patient numbers increased and decreased in satisfaction between trials, one patient displayed a systematic decrease across most items, suggesting true decrease in satisfaction for this instance. The considerably higher ICC estimates following exclusion of this patient was likely a results ICC's dependence on sample variance, as SEM was only marginally affected [26]. Hence, ICC estimates excluding the patient may be more accurate.
Interestingly, item 1 which assessed overall treatment satisfaction had the lowest satisfaction requirement, while item 9 which inquired about how satisfied patients would be living the rest of their lives with the current low-back symptoms had the highest satisfaction requirement; each separated by more than one standard deviation from adjacent items (Fig. 3a). This suggests that global items inquiring about current satisfaction overestimate satisfaction level and thereby may not provide representative satisfaction estimates. In contrast, items inquiring about the long-term perspective appear to be conservative in their estimates. Thus, while item 9 was discarded in the final Rasch model, it, nevertheless, may be valuable as a proxy for treatment success from the patient's perspective; either reflecting complete recovery from symptoms or incomplete recovery paired with symptom acceptance.
Patient satisfaction has been recommended as an outcome measure for LBP interventions [4,5], and as no such instruments currently exist in Swedish, TOSQ-S can suitably fill the void; both as a measure of health care quality and of the patient's perception of treatment outcome. However, while the initial design of TOSQ intended for a summary score of the specific items [15], the Rasch analysis suggested that another configuration is more suitable for TOSQ-S. We, therefore, recommend that the TOSQ-S summary score be derived from the six-item configuration, and the other items remain for their descriptive value. The permissive inclusion criteria combined with the indiscriminate consecutive patient recruitment via remittance from health care clinics in areas of different socioeconomic status facilitate the results extrapolation. Nonetheless, it is limited to the Swedish speaking urban population.

Conclusions
A six-item configuration of the Swedish version of the Treatment Outcome Satisfaction Questionnaire showed acceptable psychometric properties and is suitable for measuring treatment outcome satisfaction of physiotherapy in patients with LBP.