Findings

The classical test theory (CTT) and the item response theory (IRT) are the two most common methods used to test the reliability and validity of the quality of life instruments. The advantages of IRT models outnumber those of CTT methods [1, 2]. While the CTT approach allocates an equal weight to all the items in the instrument and focuses on assessing summated scale scores, IRT models are able to analyze the properties of items individually with respect to the amount of information they provide on the underlying construct [3].

However, the researchers using IRT models are faced with different problems. These models require two crucial assumptions including unidimensionality and local independence to estimate the model parameters. Moreover, model fit indices depend on a variety of factors, including the number of response options and the spread of responses across categories. IRT models also need a huge sample size to guarantee accurate item parameter estimates [1, 2, 4].

The KIDSCREEN is an international instrument for measuring HRQoL in children and adolescents, which has been simultaneously applied and evaluated in several European countries [57]. Structural validity of the KIDSCREEN-27 has been assessed in 13 European countries using CTT and IRT methods [5, 8]. Although these studies revealed that all the items fit the data well, none of them discussed the optimal number of response categories except the handbook of the KIDSCREEN questionnaires [9]. The main objective of the current study, hence, was to determine whether the adjacent response categories for each item in the Persian version of the KIDSCREEN-27 were located in the expected order. In the current research, the PCM was used to report item properties and rating scale structure of the KIDSCREEN-27.

Methods

The target population was Iranian school children aged 8–18 and their parents who were randomly selected by a two-stage cluster random sampling technique from the four educational districts of Shiraz, southern Iran. Written informed consent was obtained from the participants prior to enrollment in the study. The study was approved by the ethical committee of our institution, Shiraz University of Medical Sciences. The Persian version of the KIDSCREEN-27, which was previously translated by the KIDSCREEN group, was filled in by 1083 school children (55.4% boys, 44.6% girls) and 1070 of their parents. The mean (± standard deviation) age of boys and girls was 13.65±2.11 and 12.7±2.65, respectively. It encompasses 27 items divided into five domains including physical well-being (5 items), psychological well-being (7 items), autonomy and parent relation (7 items), social supports and peers (4 items), and school environment (4 items). The participants responded to the items on a 5-point Likert scale from 1=never to 5= always or from 1=not at all to 5=extremely. For ease of interpretation, rating scale categories of negatively worded items were reversed such that higher scores indicated better HRQoL.

Internal consistency for each domain was assessed by Choronbach’s alpha coefficient.

The value of a correlation coefficient of greater than 0.40 between an item and its own domain was considered as an adequate evidence of convergent validity. Discriminant validity was supported whenever a correlation between an item and its hypothesized domain was higher than that with the other scales [10].

The essential assumption of IRT models, unidimensionality, was examined using the Rasch PCM. Moreover, the PCM was used to assess item statistics and response-categories functioning [11, 12]. Parameters for this model were estimated using the program WINSTEP [13]. The two key indicators including infit and outfit statistics were used to evaluate whether all the items contribute effectively to their own domain. The range of acceptable values for both infit and outfit item statistics was from 0.7 to 1.3 and values close to 1 were ideal [3]. Items with lower fit statistics were considered redundant and those with high item-fit statistics indicated that the items may not be sufficiently related to the rest of the scale and unidimensionality may not hold [3, 11]. Average measures, step calibrations and fit statistics were used to test whether the response categories behaved sufficiently well [3, 13]. The categories were considered as misfitting if infit or outfit statistics were greater than 1.5 or less than 0.5 [13]. For the five categories, there are four step calibrations corresponding to the locations on the domain at which participants are able to choose higher as compared lower responses (2 over 1, 3 over 2, 4 over 3, and 5 over 4). Average measures and step calibrations are expected to increase with increasing response categories. The violation of this pattern indicates that the response categories are disordered. In addition to average measure and step calibration estimates, category fit indices and category probability curves (CPC) provide additional information about functioning of response categories. According to Linacre’s criteria [14], categories with an outfit of greater than 2 were considered to be misfit.

Results

Tables 1 and 2 represent item difficulty, average measures, step calibrations, and item and category fit indices for self- and proxy-reports. All of the items in the KIDSCREEN-27 demonstrated acceptable infit and outfit statistics (0.7-1.3). Hence, all domains in both self- and proxy-reports can be considered sufficiently unidimensional. Item difficulty estimates ranged from −0.77 to 0.50 and −0.55 to 0.55 for self- and proxy-reports respectively. Items 1 and 4 in the social support and peers domain for child self-report, and items 2 and 4 in the autonomy and parent relation domain for parent proxy-report were the most and least difficult items, respectively. As shown in Tables 1 and 2, the infit and outfit statistics for all response categories, except for “never or not at all”, were within the acceptable range (0.5–1.5). In the child self-report, items 1 and 2 in the physical well-being, items 6 and 7 in the psychological well-being, items 3 and 4 in the autonomy and parent relation, item 3 in the social support and peers, and item 4 in the school environment domains had infit and/or outfit greater than 1.5. Moreover, items 1 and 2 in the physical well-being, items 3 and 6 in the psychological well-being, and item 7 in the autonomy and parent relation domains, in parent-proxy report, had infit and/or outfit greater than 1.5. Within each item, the average measures and step calibrations increased monotonically as the rating scales moved from lower to higher categories. These results correspond to the intersections in the CPC, Figure 1.

Table 1 Item and category fit indices for self child-report in the KIDSCREEN-27
Table 2 Item and category fit indices for parent proxy-report in the KIDSCREEN-27
Figure 1
figure 1

Category probability curves of five response categories for item 6 in the psychological well-being domain and item 2 in the autonomy and parent relation domain in the KIDSCREEN-27.

Table 3 shows that all the domains have adequate internal consistency (greater than 0.7). Moreover, scaling success rates for convergent and discriminant validity were 100% in all domains.

Table 3 Cronbach’s alpha coefficient, convergent and discriminant validity for the KIDSCREEN-27 and score domains for the Iranian school children

Discussion

In the current study, Cronbach's alpha coefficients for all five domains conformed to those obtained in the combined sample from all European countries [8]. The Rasch PCM analysis of the self- and proxy-reports showed that no item was misfitting. These findings are in the same line with those of the previous study conducted in 13 European countries, indicating that each of the test items measures the underlying construct adequately [8]. Although average measures and step calibrations for all five response categories increased monotonically, 5 and 8 out of 27 items had category fit statistics greater than 1.5 in the self- and proxy-reports, respectively. According to Linacre [14], for a five category scale, advances of at least 1.0 logits between step calibrations are needed in order to achieve the optimal number of response categories. As seen in Tables 1 and 2, the advance in step calibrations from a rating of 1 to 2 to a rating of 2 to 3 is less than 1.0 logits in almost all items. For example, in item 2 for child self-report, step calibrations advance from 1.52 to 1.05, a distance of 0.47. This is not sufficiently large to meet the criteria. These findings indicate that categories 1 (never or not at all) and 2 (seldom or slightly) should be combined in all items for self- and proxy-reports. Similar results were also observed in the Persian version of the PedsQL™ 4.0 Generic Core Scales [15].

Just as in the case with the PedsQL™ 4.0 on Iranian children with chronic conditions [16, 17], this study showed that the Persian version of the KIDSCREEN-27 has a good internal consistency, and excellent convergent and discriminant validity. However, although the PCM showed that all the items contributed adequately to their own domain, Rasch analysis revealed that the number of response categories should be reduced from five to four in the Persian version of the KIDSCREEN-27. It is not clear whether this problem is due to the meaning of the response options in the Persian language or an artifact of a mostly healthy schoolchildren who did not choose the full range of the response scale [15]. Therefore, the response categories should be evaluated in further validation studies, especially in large samples of chronically ill children.