Background

In the general population, quality of life (QoL) is measured across countries to indicate the state and development of societies like, for example, in the annual Eurobarometer of the European Commission [1] or the World Values Survey [2]. National levels of QoL have been found to be related with wealth, human rights, individualism, and the fulfillment of basic biological needs in a given society [3, 4]. Measuring QoL of individuals with certain health conditions provides information about health states beyond diagnosis, about the impact of a disease and its treatment on different domains of daily life, and about the health experience from the "insider" perspective of the affected persons themselves [5, 6]. In relation to health, QoL is measured across countries to compare the burden of disease and disability in different populations. However, QoL is not restricted to health-related issues.

The notion of QoL in general covers various concepts including health-related quality of life (HRQoL) but also subjective well-being (SWB) [7]. HRQoL, on the one hand, describes difficulties caused by poor health on mental and physical functioning, task performance, participation in life areas, or "health status" [8, 9]. SWB on the other hand, includes overall life satisfaction, satisfaction with life domains, as well as positive and negative affect [10]. Life satisfaction is traditionally viewed as a cognitive, needs-based approach towards QoL. It refers to the individual's personal evaluation of the gap between his or her aspirations and achievements. More currently, also a cognitive-affective conceptualization of satisfaction has been discussed [10, 11].

Essentially, life satisfaction is related to the subjective "insider" perspective and is increasingly considered as a meaningful and efficient way to collect information about QoL [12, 13]. Assessing QoL of individuals in health services provision and research complements measurement that is based on performance, and adds relevant information for treatment decision-making and outcome evaluation [6, 14].

QoL of persons who sustained spinal cord injury (SCI) seems to be diminished compared to the general population [15, 16] QoL appears not to be directly related to the severity of SCI [16, 17], but it is related to perceived health, participation and integration, to social support and relationships as well as to living circumstances, e.g. accessibility or income [15, 17].

Several reviews summarized the application and metric properties of QoL measures in SCI [16, 1820]. Among the various instruments with promising properties were also short scales, such as the Satisfaction with Life Scale (SWLS) [21], which is part of the United States SCI Model Systems [22], the Life Satisfaction Questionnaire (LISAT) [23], or the World Health Organization Quality of Life Assessment (WHOQOL-BREF) [24].

QoL in persons with SCI has been found to differ across countries [25, 26]. Such differences may be related to country level factors (e.g. culture and values), to internal and external individual level factors (e.g. personality, self-esteem or social support), as well as their interactions (e.g. social desirability) [27]. Differences found in these studies may reflect the properties of the measurement instruments used.

The comparability of measurement results between countries depends on the cross-cultural validity of the applied instruments [28]. Common steps in various guidelines for cross-cultural adaptation of QoL instrument include systematic translation procedures and cross-cultural testing of psychometric properties [29]. There have been efforts to develop and/or validate QoL instruments cross-culturally (e.g. the WHOQoL-development or the International Quality of Life Assessment project) [30, 31]. However, the cross-cultural validity and international comparability of QoL measurement is not well established in SCI.

Psychometric properties, like reliability, validity, etc. can be examined using different techniques. Currently, Rasch-based methods are becoming increasingly popular in the context of rehabilitation outcome measurement [32]. They are used to create interval scale measurement, can reveal metric difficulties of the measures, but also provide techniques to account for them at a statistical level in certain circumstances, for example, by item reduction, collapsing response scale options, splitting items, etc. Thus, Rasch-based methods have also been used to examine and account for cross-cultural bias in outcome measures [33, 34].

The objective of this study is to examine the cross-cultural validity of selected QoL scales across countries in a sample of persons with SCI using Rasch analysis. The specific aims are (1) to examine and compare measurement properties of the instruments, namely, dimensionality, response scale structure, and reliability; (2) to examine the validity of the instruments across countries; and (3) to examine possibilities to enhance the measurement properties and the cross-cultural validity of the instruments.

Methods

Design and setting

This cross-sectional multi-centre study was conducted as a nested project within the international collaborative development of the "ICF Core Sets for Spinal Cord Injury" [35, 36]. For the current analyses, data from participating study centers in Australia, Brazil, Canada, Israel, South Africa, and the United States are used.

Participants and data collection

Subjects were recruited through the six participating rehabilitation facilities. Patients were recruited who had sustained a SCI with an acute onset and who were at least 18 years old. Acute onset was defined as a trauma or non-traumatic event resulting in spinal cord dysfunction within 14 days of onset. Subjects with significant traumatic brain injury or diagnosed mental disorders prior to SCI were excluded. Prior to data collection participants were informed about the purpose and reason of the study and signed an informed consent.

For the purpose of the analyses presented in this paper data from outpatients were selected. In four of the participating centers data were also collected for inpatients. Overall, 109 inpatient data sets were available; however, 76% of these were from one country only (Israel). Thus, to avoid confounding of country with care setting, and to obtain a more homogeneous data set for the cross-country comparisons, the inpatient data were omitted.

The data collection included, beside socio-demographic and injury related variables, four QoL measures: The Satisfaction with Life Scale (SWLS) [21], the Life Satisfaction Questionnaire-9 (LISAT-9) [23], the Personal Well-Being Index (PWI) [37] and five satisfaction items from the World Health Organization Quality of Life Assessment (WHOQOL-5) [24, 38]. For the data collection, instruments were selected that include less than 10 items, focus on the concepts of life and domain satisfaction, and contain items that are applicable and not offensive to people with SCI (do not contain items on walking, kneeling, bending, etc.). In addition, psychometric properties and the availability of different language versions were considered. Short questionnaires are more feasible, acceptable, and impose less burden on the patients compared to longer instruments. They can be more easily embedded into routine clinical assessments or larger scale data collection schemes. Instruments were chosen with a focus on the aspect of satisfaction within the broader notion of QoL, as satisfaction is not only conceptually well-defined, but has also been traditionally considered as a clinically relevant person-centered outcome in rehabilitation [39].

In Australia, Canada, South Africa, and the United States the English versions of the instruments were used. For the SWLS and the WHOQOL also the Portuguese (Brazil) and the Hebrew (Israel) versions exist. However, for the LISAT and the PWI translations were not available in Brazil and Israel. In these cases, translations of the English version were prepared at the participating facilities.

Satisfaction with Life Scale

The SWLS was designed to assess global life satisfaction. It addresses the cognitive evaluation of one's own life in terms of ideal life, wish for change, and current and past satisfaction. The SWLS consists of five items with a 7-point Likert-scale from "strongly disagree" to "strongly agree". Reliability and validity of the scale have been examined in several studies [21, 40, 41] also for various translations and in different countries [42, 43]. The SWLS has been used in cross-country studies in the general and student populations [27] and is also widely used in SCI research, especially in the United States [22, 4449]. Internal consistency coefficients range between .79 and .89 [40] and several studies confirmed the single factor structure of the SWLS [21, 4143, 50]. However, studies in SCI scarcely reported about the psychometric properties of the instrument [47]. Two studies comparing general population samples in the United States and Russia [51], Norway and Greenland [52], respectively, hinted at potential cross-cultural biases affecting the interpretation of the SWLS.

Life Satisfaction Questionnaire

The LISAT-9 is a measure of domain-specific life satisfaction. It consists of nine items including one on general life satisfaction and eight domain-specific items (self-care, vocational, financial, leisure situation, sexual life, partner relationship, family life, social contacts). Responses are rated along a 6-point scale from "very dissatisfying" to "very satisfying". Among the psychometric properties of the LISAT, internal consistency and factorial structure are reported in the literature [23, 53, 54]. A 3-factor has been shown for the LISAT-9 and a 4-factor structure for the LISAT-11 with internal consistency reliability of the factors between .57 and .79 (overall .85) [23, 53]. Thus, analyses using the LISAT are frequently done item-wise, but also using mean or median of the scores. The instrument has been widely used in SCI research, mainly in Europe [25, 5459], little is known about the measurement properties of the LISAT in non-European countries, and only few studies have addressed the psychometric properties of the LISAT in the SCI population [54] The LISAT has also been used to compare SCI samples across countries (Sweden and Japan; China and UK; UK, Germany, Austria, and Switzerland), however, without considering potential cross-cultural validity issues [25, 26, 58].

Personal Well-Being Index

The PWI consists of 7 items about satisfaction with specific life domains (living standard, health, achievement, relationships, safety, community, future security) and one optional item about overall life satisfaction. Responses are provided on a 0-10 numeric rating scale with the end points "completely dissatisfied" to "completely satisfied". The PWI has been developed in Australia for use in national surveys [60] and has been adapted for international use [37]. Validity and reliability of the PWI have been demonstrated in general population samples from different countries [37, 6062]. The PWI has been designed as a unidimensional tool with internal consistencies between .70 and .85. Although already used in various countries (Australia, Hong Kong/China, Algeria), a rigorous examination of cross-cultural validity has not yet been conducted. The PWI has not been used with persons with SCI so far.

World Health Organization Quality of Life Assessment-5

The WHOQOL-5 is a selection of five satisfaction items out of the World Health Organization's short health-related quality of life measure, the WHOQOL-BREF. The 5 items cover overall quality of life, satisfaction with health, daily activities, relationships, and living conditions. The WHOQOL and WHOQOL-BREF were specifically developed for cross-cultural use and are currently available in 36 languages. Psychometric properties have been examined in 23 countries with samples of sick and healthy persons [24, 38, 63], with internal consistency coefficients lying between .75 and .87. The WHOQOL-BREF has also been applied in people with SCI [64, 65]. A selection of 8 items out of the WHOQOL-BREF (including the 5 items in this study) was used in the EUROHIS project across 10 European countries and showed satisfactory psychometric properties, unidimensionality and cross-cultural validity [66, 67]. The 5-item version has been used in different international WHO collaboration projects since 2002 [35, 68, 69], but has not been psychometrically tested previously in this format.

Ethics committee approval

The study was carried out in compliance with the Helsinki Declaration, the design and materials were approved by the Ethics Committee of the Ludwig-Maximilian University Munich, as well as by the respective Ethics Committees for the study centers in each world region.

Rasch Analyses

Rasch analyses were carried out using the RUMM software [70] and applying the partial credit Rasch model [71]. This model is a special case of the one-parameter Rasch model. In the field of Rasch-based or item response modeling further types of models exist, e.g. two- or three-parameter item response models, nonparametric Mokken analyses, or mixed Rasch models, etc. The use of these models might result in better fit of the data, as they consider varying item difficulty curves, varying homogeneity or monotonicity of the data, or multiple latent classes within the sample populations. However, the one-parameter Rasch model is especially helpful for developing precise and accurate measurement instruments, as it imposes strict requirements on the items and is not data-driven. It can ensure through its mathematical formulation fundamental measurement in the tradition of Guttman's work within a probabilistic framework [72, 73].

Applying this type of Rasch analysis, three parameters are estimated: The person parameters (for the patients), the item parameters, and the parameters of the thresholds of the response scale (e.g. four threshold parameters for a 5-point Likert-scale). These parameters describe the position of the persons, items and thresholds on the unidimensional continuum of the measured latent trait (e.g., low to high quality of life).

First, the unidimensionality of each instrument was examined. Unidimensionality describes the idea that items should contribute to the measurement of only one attribute at a time and should not be confounded by other attributes or dimensions [73]. This ensures the interpretability of the summary scores of the instrument. Unidimensionality can be checked for by comparing the observed responses in a set of items to the expected values predicted by the unidimensional Rasch model [74]. The fit of each item is indicated by standardized residuals (z values) and Chi2 test results. Z values exceeding +/-2.5 are considered to indicate misfit to the Rasch model [74]. For the Chi2 significance tests a Bonferroni-corrected critical p-value at the 5% level [75] was applied.

To further examine unidimensionality, principal components analyses (PCA) of the residuals not explained by the Rasch-model were performed. The residuals should show a random pattern to indicate unidimensionality [76]. Given the sample size in this study, eigenvalues below 1.9 in the PCA results are indicative of random residual variation, eigenvalues above 1.9 indicate some structure in the residuals [77]. In addition, the Rasch person parameters of each patient were estimated separately for the items with positive versus negative loadings on the first PCA factor, and then compared using independent t-tests. The percentage of significant t-tests (α = 0.05) should not exceed 5% [78, 79].

The structure of the response scale for each instrument was studied based on the ordering of the threshold parameters. The threshold parameters should take increasing values, as they represent the successive transition points along the response scale from low to high quality of life. Reversed thresholds show that the scores do not differentiate as intended [80].

Reliability is indicated by the person reliability index, which is the Rasch-based correspondent to Cronbach's alpha [71, 81]. The person reliability index is constructed using the person parameter estimates and the standard errors of measurement to calculate the ratio of true person ability variance to the observed variance [74, 82]. It ranges between 0 and 1, where the value of 1 indicates perfect reproducibility of person placements on the latent continuum.

To examine the cross-cultural validity of the four instruments across countries, differential item functioning (DIF) analyses were conducted [33]. Potential DIF is ascertained for each item by comparing the standardized residuals between the countries and across the latent trait continuum of QoL using a two-way analysis of variance (ANOVA). A significant main effect of the country (uniform DIF) or a significant interaction effect in the ANOVA results (e.g. Country × QoL, non-uniform DIF) indicates problems with the cross-country comparability of the responses. If no DIF is apparent, the scores are comparable across countries. A respective Bonferroni-corrected type I error level was applied [75]. Tukey-Cramer post-hoc tests allowed identifying the countries that contribute to DIF in the data.

Based on the results of Rasch analyses different approaches can be taken to account for weaknesses in the metric properties of the instruments post-hoc. To come up with suggestions to enhance the measurement properties and cross-cultural validity of the instruments across countries, four alternative strategies of handling the data set were tested and compared. As a result, for each instrument an optimal solution for handling the data could be identified, which allows for acceptable measurement properties with as little change to the instrument as possible. Figure 1 gives an overview of the four strategies implemented in the post-hoc analyses.

Figure 1
figure 1

Overview of the four Rasch-based strategies applied to account for the weaknesses in the metric properties of the four quality of life instruments post-hoc.

In the first strategy, response scale disorder was addressed first. Disordered response categories were collapsed, i.e. adjacent response options were merged and the scores recoded for all items of the instrument if more than half of the items showed disorder [80]. In addition, items that still misfitted after the collapsing, were deleted using a step-wise top-down deletion strategy until the remaining items fit the model [83].

In the second strategy, item misfit was attended to first by using the step-wise top-down deletion strategy and the remaining fitting items are checked again for response scale disorder.

The third strategy focused on accounting for DIF. So-called subtest analyses were conducted, which were used to merge the scores of those items that display DIF for country. Thereby, if two items of an instrument show DIF but in opposite directions, they can be combined into one score, which adjusts for invariance across countries. The advantage of this strategy-if it is successful in ameliorating DIF-is that no changes to the items are necessary and the summary score of the instrument can be interpreted as comparable across countries.

The fourth strategy also addressed DIF, but applied the subtest analyses to either option one or option two, depending on which of the two represented the most effective strategy for the instrument so far (i.e. enhanced statistics with less change).

The strategies one to three were calculated for all four instruments (according to the properties in the basic analyses), and after each step, the overall and item fit, DIF, response scale ordering, and reliability were documented. The fourth strategy was only applied, if the first three did not result in acceptable metric properties.

The efficiency of the different strategies was determined by the metric properties on the one side and the modifications to the instrument on the other side. Hereby, the metric properties were considered hierarchical in terms of desirability: Item and overall fit were considered the most important criteria to be fulfilled first, DIF as second, and response scale ordering as the third criterion. Regarding the modifications to the instruments, the merging strategy was considered the least invasive strategy, as it does not require changes to the items or the response scale. Collapsing of response options was considered the second least invasive strategy, as it requires the recoding of responses, but no changes to the items. Deletion of items was considered an invasive strategy, as it alters the instrument from its original version.

Thus, if for example the strategies one to three all resulted in acceptable metric properties in terms of fit, DIF, and response scale ordering, then the merging strategy three would be preferred as optimum solution, for being least invasive.

Results

From six countries and four different world regions, overall, 243 out-patients with SCI were included in the study. Table 1 shows the socio-demographic and SCI-related characteristics of the study sample. Table 2 shows the mean raw scores, respective standard deviations, and the number of valid responses in the four instruments overall, per item, and per country.

Table 1 Socio-demographic and spinal cord injury related patient characteristics (N = 243)
Table 2 Raw scores for the four instruments overall and by country

Statistics for the examined measurement properties of the 4 instruments are documented in Table 3. The SWLS showed overall misfit to the Rasch model according to the significant Chi2 test and the PCA eigenvalue. At the item level, 3 out of 5 items showed misfit to the model. In terms of response scale structure, 3 out of 5 items had disordered thresholds. Reliability was high with a value of 0.88.

Table 3 Rasch-based fit statistics, ordering of the response scale thresholds, and reliability (n = 243)

For the LISAT-9, the overall fit statistics (i.e. Chi2 test, PCA eigenvalue, and independent t-test approach) consistently contradict the assumption of unidimensionality. At item level, 3 items out of 9 showed misfit to the Rasch model. In 5 items the response scale thresholds were disordered. The person reliability index was high with a value of 0.86.

For the PWI the Chi2 statistics suggested unidimensionality overall as well as for the individual items. However, the eigenvalue and the t-test approach questioned the assumption of unidimensionality of the instrument. The response scale thresholds were all ordered with the exception of 1 item out of the 8. Reliability was found high with a value of 0.92.

For the WHOQoL-5 all overall statistics confirmed unidimensionality, but one of the items misfitted the model according to the significant Chi2 test result. All response scale thresholds were ordered and reliability was within an acceptable range with a value of 0.78.

The results of the DIF analyses to examine the cross-cultural validity of the 4 instruments are displayed in Table 4. Uniform DIF across countries was found in two items of the SWLS and the WHOQoL-5, three items of the LISAT-9 and four items of the PWI. Non-uniform DIF was found only in the item "Leisure situation" of the LISAT-9 (data not shown). For the SWLS and the LISAT-9 the data from Israel showed most frequently significant differences from the other countries. For the PWI, the data from Australia and Canada showed most frequently significant differences to other countries. For the WHOQoL-5 this was the case for the data from Canada (data for post-hoc tests not shown).

Table 4 DIF across countries prior to and after applying the post-hoc strategies (n = 243)

Table 5 shows the statistics about instrument and item fit, response scale structure, and reliability for the 4 different strategies applied to enhance the measurement properties and the cross-cultural validity of the 4 instruments. Also, Table 4 contains the results of the final check for DIF after having identified the optimal option for handling the data.

Table 5 Rasch-based statistics for the different strategies applied to enhance the metric properties of the instruments (n = 243)

Strategy 2 was regarded as the optimum choice for the SWLS. Two misfitting items were deleted using the step-wise data purification procedure. With this handling of the data, item fit and response scale order were achieved, and no DIF was apparent.

Strategy 4 was regarded the optimum choice for handling the data for the LISAT-9. Only after collapsing the response options, deleting two misfitting items and merging another two items with DIF were all the remaining items fitting, the response scale thresholds ordered (with one exception), and DIF not present.

Strategy 3 appeared the optimum choice for the PWI. The scores of the four items that displayed DIF prior to applying any post-hoc strategies were merged into two items, which lead to no item misfit and no response scale disorder. However, one of the merged items remained inconsistent across countries and displayed DIF.

Strategy 3 was also the optimum choice for the WHOQoL-5. After merging the scores of those two items which initially indicated DIF, all items fitted the Rasch model, the response scale thresholds were ordered, and no DIF was found.

Discussion

The study examined the metric properties of the Satisfaction with Life Scale (SWLS), the Life Satisfaction Questionnaire (LISAT), the Personal Well-Being Index (PWI) and the 5-item World Health Organization Quality of Life Assessment (WHOQoL-5) in a cross-country sample of persons with SCI based on Rasch analysis. Although all instruments displayed metric problems in the analyses and showed cross-country bias at first, it was possible to identify post-hoc strategies to ameliorate those problems. Such strategies can also be used in further studies to enhance the metric comparability of data across countries. The two instruments which performed best overall in this comparison in terms of reliability, dimensionality, response scale structure, and cross-cultural validity were the WHOQoL-5 and the PWI, prior as well as after applying the post-hoc strategies.

Reliability

In the current study, high values of the person reliability index were found for all four instruments. The person reliability index was similar for the WHOQoL-5 and for the SWLS in our study to alpha coefficients reported in the literature in different samples and countries, including also persons with spinal cord injuries [40, 41, 43, 63, 6567]. However, for the PWI and the LISAT-9, the reliability index was higher than reliability measures reported earlier [37, 53, 54, 61, 62]. The person reliability index is the Rasch-based counterpart of Cronbach's alpha. In this study, alpha coefficients could not be calculated because of missing data. Rasch analysis, however, not only deals readily with missing data [84], but in general the person reliability index can also have the advantage of being a more conservative estimate of reliability under certain circumstances, e.g. when alpha may be inflated due to the number of items or the sample variance [85].

Dimensionality

In line with an earlier study using structural equation modeling [67], unidimensionality can be assumed for the WHOQoL-5. For the PWI, previous studies indicated unidimensionality, which is partially supported by the statistics in this analysis [60, 62]. Although unlike previous authors, we included the first overall item in the analyses [37], in the item-wise examination, this overall item fitted the model along with the domain-specific items.

The assumption of unidimensionality was rejected for the LISAT-9 and the SWLS. Earlier studies, as well as the findings presented here, suggest that more than one dimension is assessed by the LISAT [23, 53]. In this study, with deleting the two items "partner relations" and "family life" unidimensionality of the remaining items was established. The item "partner relations" had far more missing data than any of the other items (see Table 2), which might have caused metric irregularities. However, the standard error of the estimates was not larger compared to the other items, indicating acceptable precision of estimation. However, a potential explanation how these two items differ from the others could lay in the specific meaning of the items in the context of SCI and in the specific experiences of the affected persons. While the other items may be related to the experienced difficulties and problems in body functions, activities and participation imposed by SCI (e.g. difficulties in sexuality, less contact with friends), the partner and family life items may be related to the more stable, positive, and support providing relationships [86]. Thus, the difference between the separate dimensions identified in the statistical analyses might be interpreted conceptually as negative versus positive experience, problems in own functioning versus support by others.

The results regarding the unidimensionality of the SWLS contradict the findings of several earlier studies, which demonstrated a single underlying dimension [21, 4043, 47, 50]. In this study the last two items ("If I could live my life over, I would change almost nothing" and "So far I have gotten the important things I want in life") had to be removed before unidimensionality was achieved for the remaining three. A study from France using structural equation modeling found no support for the unidimensionality of the SWLS in a general population sample and the authors proposed to take the last two items separately [87]. They suggest that the semantic structure of those two items, which relate to the past, may explain the inconsistency among the items. In the current study the sample consisted of persons who have met with a major life event in the past, namely SCI. One thing that persons with SCI might want to change in the past and might be strongly dissatisfied with is the SCI itself [47]. In the context of SCI, it could be hypothesized that the first items (related to present life satisfaction) of the SWLS might be connected to acceptance, the last two items to grief and regret. These different connotations might explain in line with the suggestion of Vautier et al. (2004) the observed inconsistency and disjunction among the items within the instrument.

Response scale structure

Considering the response scale structure of the instruments, the results suggest that the 5-steps scale of the WHOQOL-5 ("very dissatisfied", "dissatisfied" "neither satisfied nor dissatisfied", "satisfied", "very satisfied") and the 11-steps numeric rating scale of the PWI ("completely dissatisfied" to "neutral" to "completely satisfied") have the expected ordering and persons with SCI could differentiate between the steps consistently when responding to the items.

For the SWLS and the LISAT the response scale structure showed disorder in several items. For the SWLS, after removing the last two items for misfit, only one disordered item ("ideal life") remained. For the LISAT-9 the original 6-step rating scale was reduced to a 4-step solution in this study. The optimal solution in the post-hoc analyses appeared to be the merging of the response options "dissatisfying", "rather dissatisfying" and "rather satisfying". This merging of the response options parallels the cut-off used by Fugl-Meyer to dichotomize item scores (1-4 = satisfied/5-6 = unsatisfied) placing the "rather satisfying" option in the unsatisfied category [53]. Accordingly, future studies could test the metric properties and usefulness of a modified 4-step scale for the LISAT with a suggested structure as "very dissatisfying", "dissatisfying", "satisfying", and "very satisfying".

Cross-cultural validity

The current findings hint at potential cross-country bias in all four examined instruments largely in line with existing research. In the case of the SWLS, two earlier studies using different methodologies found indications that the comparability and interpretability of the scores across countries is not consistent [51, 52], which is now supported in an SCI sample.

Lau et al (2005) found cross-cultural differences in the performance of the PWI between an Australian and a Hong Kong Chinese population and suggested that cultural response bias would be a plausible explanation for the differences [61]. Our results in SCI showed DIF for 4 of the PWI items across the 6 countries, and Australia was among the countries which showed strongest deviation from the other five (beside Canada). However, by merging the scores of those items which had DIF, the deviations proved to be balanced out. Thus, at the level of the summary score, cross-country comparability may be possible.

Schmidt et al (2006) examined DIF for the Eurohis-QoL-8 instrument, which is a selection of 8 items out of the WHOQOL-BREF and which includes the 5 items used in this study [67]. They found acceptable cross-cultural properties in their instrument which is in line with the findings here for the reduced 5-item version. Again, the minor deviation in the first DIF analyses could be alleviated by merging the two items "health" and "quality of life" to establish cross-country comparability of the summary score.

Although the LISAT has been used in cross-country studies [25, 26], those did not examine potential bias between the different language versions of the instrument. In this study, the post-hoc analyses showed that acceptable metric properties could only be achieved for the LISAT by applying the whole range of modification strategies, including the collapsing of response options, the deletion of items and the merging of item scores.

Limitations

The study is subject to several methodological limitations. The major drawback of the study is the low sample size in the individual countries. For this reason certain statistical techniques for assessing psychometric characteristics and handling DIF could not be applied, e.g. the item-splitting method suggested by Tennant et al [33]. However, the overall sample size was sufficient to reliably sustain the performed analyses [88]. According to Linacre (1994) a sample size of n = 250 is sufficient to achieve stable item parameters. In the current analyses the stability of the parameters was high, obvious from the small standard errors of the item parameters (SE = 0.04-0.09, see Table 3). Secondly, as the study included a convenience sample of persons with SCI, selection bias cannot be ruled out and the generalizability of the results may be compromised. Third, the quality of the Portuguese and Hebrew language versions of the questionnaires were not tested prior to their use in these data collections. Fourth, as a more current development, the PWI includes a further item on spirituality, which was not yet taken up in the data collections for this study. Fifth, in these analyses, only basic psychometric characteristics (i.e. reliability, unidimensionality) were considered, but features like stability or sensitivity to change were not examined. Sixth, the DIF analyses only focused on potential cross-country biases, but were not extended to other factors that might influence the participants' responses, e.g. sociodemographic factors or depression. Finally, the post-hoc solutions shown in this study can be considered "optimum" only in the current sample, and in other studies the results may look different. However, we have shown that using these strategies data can be handled in a way that increases the confidence in the metric quality and interpretability of the data.

Conclusions

The Rasch analyses of the four quality of life instruments showed that the raw scores were not consistently comparable across countries at first in an international SCI sample. However, by accounting for DIF across countries in a way that the requirements of the Rasch model are met, the scores can become comparable. Following the post-hoc procedures the items of the WHOQOL-5 and the PWI worked in a consistent and expected way in all countries. Thus, the differences between countries assessed by these instruments could potentially show cross-culturally valid differences in the responses of the persons. In contrast, summary scores of the LISAT-9 and the SWLS have to be interpreted with caution. The findings of the current study can be especially helpful to select instruments for international research projects in spinal cord injury.