Introduction

Television (TV) viewing increased among youth in the United States [1], and is considered a cause of childhood obesity [25]. Parenting practices to reduce children’s TV viewing may be important for preventing child obesity. Parenting practices (PP) are behaviors parents use to influence their child’s behaviors [68]. Limited psychometric analyses have been reported on TV PP scales with all having employed only classical test theory (CTT) [9]. CTT, however, is sample-dependent. In contrast, item response modeling (IRM) provides model-based measurements: trait level estimates obtained as a function of participants’ responses and properties of the administered items [10, 11]. For example, the participants’ estimated trait level of TV PP depends both on a person’s response to these items and the items’ parameters.

Valid measures are needed both to understand how PP influence child behaviors and to measure mediating variables in parenting change interventions. PP that influence child TV viewing may differ depending on parental education level, child’s age or parent’s understanding of items that may differ by language [1214]. Such differences could pose serious problems for validity by making it difficult to compare parameter estimates across these variables or across studies. Multidimensional polytomous item response modeling (MPIRM) enables differential item functioning (DIF) analysis [15] for multidimensional scales.

The aim of this study was to use MPIRM and DIF to examine the item and person characteristics of TV PP scales across education, language, and age groups.

Methods

Participants

Children (n = 358) between 3 to 12 years old (yo) in Houston, Texas, were included in the present analyses and the data were assembled from three studies: a physical activity intervention using Wii Active Video Games (Wii, n = 78) [16], a first line obesity treatment intervention Helping HAND (Healthy Activity and Nutrition Directions) (HH, n = 40) [17], and one cross-sectional study, Niños Activos (NA, n = 240). The Wii study recruited 84 children from multiple sources to participate in a 13-week exergame intervention in 2010. The inclusionary criteria targeted children 9–12 yo, whose BMI were within the 50-99th percentile range. Details have been reported elsewhere [16]. HH recruited 40 5–8 yo children whose BMI were within 85–99th percentile range to participate in an obesity treatment study in pediatric primary care. Details have been reported elsewhere [17]. Niños Activos recruited 240 3–5 yo Hispanic children from Houston, TX with no restrictions on BMI to participate in a study assessing influences on child PA. TV PP was assessed at baseline in all studies.

The Institutional Review Board of Baylor College of Medicine approved all three study protocols. Signed informed consent and assent were obtained from each parent and child.

Instrument

All parents self-completed in English or Spanish a TV PP questionnaire [9] which was originally developed to assess the TV mediation styles of 519 Dutch parents of children 5–12 years old. In the original study, this scale contained 15 items distributed across 3 subscales: restrictive mediation (5 items, α = 0.79), instructive mediation (5 items, α = 0.79) and social co-viewing (5 items, α = 0.80) [9]. Restrictive mediation was defined as the parent determining the duration of TV viewing and specifying appropriate programs; instructive mediation was the parent explaining the meaning of TV programs and the acceptability of characters’ behaviors; and social co-viewing was a parent watching TV together with his/her child [1719].

Items in the Wii and HH studies featured the same four response options as in the original studies (Never, Rarely, Sometimes, and Often). Items in the NA study featured five response options (Never, Rarely, Sometimes, Often and Always). To facilitate analyses, category response curves (CRCs) were depicted on the NA sample to determine the collapse of response categories. Parents provided demographic information in all three studies at baseline.

Analyses

Classical test theory

Item difficulty (mean) and item discrimination (corrected item-total correlations, CITC) were first assessed for the TV PP scales, and Cronbach’s alpha assessed internal consistency reliability. Criteria for acceptable CITC and internal consistency reliability were defined as greater than 0.30 and 0.70, respectively [20]. All CTT analyses were conducted using Statistical Analysis Systems [21].

Item response modeling (IRM)

The primary assumption of IRM, unidimensionality, was tested using exploratory factor analysis in SPSS [22] for each subscale. Unidimensionality was satisfied if the scree plots showed one dominant factor, the solution explained at least 20% of variance for the first factor, and the factor loadings were >0.30 [23]. An IRM model which best explained the data structure was selected after unidimensionality was confirmed in each subscale.

Polytomous IRM models were used because the TV PP items presented multiple response possibilities [24, 25]. Polytomous IRM modeled the probability of endorsing one response category over another, referred to as a threshold parameter, indicating the probability of responding at or above a given category. For an item with four response options (e.g., never, rarely, sometimes, and often), three thresholds exist (1) from “never” to “rarely”, (2) from “rarely” to “sometimes”, and (3) from “sometimes” to “often”. The item threshold locations were determined along the latent trait continuum. The latent trait estimates from IRM can be related to the raw scores of the TVPP scale using non-linear transformation.

Category response curves (CRC) show the probability of a response in a particular category for a given trait level. The number of CRCs equals the number of response options. In this study, every item has four CRCs, and each CRC shows the probability of endorsing the particular response at different levels of the latent trait. For example, CRCs for response option “rarely” show at what latent trait level participants will be more likely to endorse “rarely” than the other three response categories. The sum of the response probabilities equals 1.0 at any location along the underlying trait continuum. CRCs can also be used to identify the most likely response at various levels of a latent trait.

Item-person maps, often called Wright maps (with units referred to as log odds), depicted the distributions of scale items with that of the respondents along the latent trait on the same scale. The dashed vertical line presents the latent trait in logits which were specified on the far left of the map. A logit of 0 in this map implies a moderate amount of latent trait. The location of thresholds in a Wright map shows the point at which the probability of the scores below k equals the probability of the scores k and above. For example, the location of Threshold 1 shows the amount of latent trait of the corresponding sub-scale, e.g. restrictive TVPP, a person must possess if there is a 0.5 probability of selecting “rarely” over “never”. Large gaps along the difficulties continuum imply that additional items will help distinguish within that particular range of difficulty. Since the TV PP instrument contained three sub-scales, two multidimensional polytomous models were considered: partial credit (PCM) [26], and rating scale models (RSM) [27, 28]. RSM is a special case of the PCM where the response scale is fixed for all items, i.e., the response threshold parameters are assumed to be identical across items. The relative fit of RSM and PCM was evaluated by considering the deviance difference, where df was equal to the difference in the number of estimated parameters between the two models.

Item fit was assessed using information-weighted fit statistic (infit) and outlier-sensitive fit statistic (outfit) mean square index (MNSQ) which have possible ranges from zero to infinity. Infit MNSQ is based on information-weighted sum of squared standardized residuals; outfit MNSQ is a sum of squared standardized residuals [29]. An infit or outfit MNSQ value of one indicates the observed variance equals the expected variance. MNSQ values greater than, or smaller than, one indicate the observed variance is greater, or smaller, than the expected, respectively. Infit or outfit MNSQ values greater than 1.3 indicate poor item fit (for n < 500 [30, 31] with significant t-values. Concerning thresholds, outfit MNSQ values greater than 2.0 indicate misfit, identifying candidates for collapsing with a neighboring category [29, 32].

Differential item functioning (DIF)

Participants with the same underlying trait level, but from different groups, may have different probabilities of endorsing an item. DIF was assessed by an item-by-group interaction term [33, 34], with a significant chi-square for the interaction term indicating DIF. Items display DIF if the ratio of the item-by-group parameter estimates to the corresponding standard error exceeds 1.96. A finding of DIF by gender means that a male and a female with the same latent trait level responded differently to an item, suggesting that respondents’ interpretation of the item differed for males and females.

The magnitude of DIF was determined by examining the differences of the item-by-group interaction parameter estimates. Because the parameters were constrained to be zero, if only two groups were considered the magnitude of DIF difference was twice the estimate of the first focal group. If comparison was made among three or more groups, the magnitude of DIF was the difference of the interaction term estimate of the corresponding groups. Items were placed into one of three significant DIF categories depending on the effect size: small (difference < 0.426), intermediate (0.426 < difference < 0.638), and large (difference > 0.638) DIF [35, 36]. ACER (Australian Council for Educational Research) ConQuest [37] was used for all IRM analyses.

Results

Descriptive statistics

Participant demographic characteristics are shown in Table 1 by source. Parental education level was almost evenly distributed across the three studies with 50.7% of all participants reporting a high school education or less education. Because these three original studies recruited kids at different age ranges, percentages by age group in the combined sample were proportional to the original study’s sample size; 57.8% completed the English version. The majority of participants were Hispanic (79.1%).

Table 1 Demographic characteristics of respondents

Category response curves (CRCs)

The CRCs for the response category “often” mostly never peaked for the NA sample across the 15 items, indicating that “often” never had the highest probability of being selected for most items. Therefore, the response categories “Often” and “Always” were collapsed in the NA sample. Figure 1 shows CRCs for item 2 across the three original samples (Wii, HH and NA). The curve for “rarely” never peaked in two of the samples, indicating that respondents were unlikely to choose “rarely”. CRCs revealed that respondents did not use all response categories (usually only 3), and response category use differed by sample. (CRCs for the remaining items are available upon request).

Figure 1
figure 1

Category Response Curves for Item 2: “How often do you explain what something on TV really means?”.

Classical test theory

The percentage of variance explained by the one-factor solution was 60%, 60% and 48% for social co-viewing, instructive mediation and restrictive subscales, respectively. The scree plots revealed one dominant factor and factor loadings were >0.30 for all three subscales.

Item difficulty (item means) ranged from 3.16 (SD = 0.49) to 3.67 (SD = 0.62), indicating that on average respondents reported frequently performing the PP. Internal consistencies were good for social co-viewing (α = 0.83); and instructive parental mediation, (α = 0.83); and adequate for restrictive mediation (α = 0.72). CITCs acceptably ranged from 0.41 to 0.70.

IRM model fit

The chi-square (χ2) deviance statistic was calculated by considering differences in model deviances (RSM: 8749.63; PCM: 8485.99) and differences in numbers of parameters (RSM: 23; PCM: 51) for the nested models. The chi-square test of the deviance difference showed RSM significantly reduced model fit (Δ deviance = 263.64, Δ df = 28, p < 0.0001); thus, further analyses employed multidimensional PCM.

Item fit

Item difficulties are summarized in Table 2. Assuming a multidimensional PCM, only one item (item 2) exceeded the criterion guideline (> 1.3). Item 14 was flagged as the only misfit item when taking into account the difference in parental education level (infit/outfit MNSQ = 1.37), in language (infit/outfit MNSQ = 1.32), or in child’s age (infit MNSQ = 1.36; outfit MNSQ = 1.42). Misfit values were relatively small; therefore, the items were retained in the ensuing analyses.

Table 2 Item description, item difficulty, and misfit item(s)

Item-person fit Wright map

Figure 2 presents the multidimensional PCM item-person maps. Person, item and threshold estimates were placed on the same map where “x” on the left side represented the trait estimates of a person with the parent scoring in the highest TV PP range placed at the top of the figure. Item and threshold difficulties were presented on the right side, with the more difficult response items and thresholds at the top. The range of item difficulties was narrow (logits ranged from -1.02 to 0.72); the distribution of item difficulties did not match that of individuals for each dimension. In each subscale category, most parents found it easy to endorse these items. Many items’ (1, 2, 4–6, 10 and 12–15) first step threshold did not coincide with participants at the lower end of TV PP.

Figure 2
figure 2

Wright Map of TV PP Scale (n=358).

Differential item functioning (DIF)

Item difficulty differences between demographic groups are presented in Table 3. One, five and nine items exhibited significant DIF between educational level, language, and child’s age groups, respectively (Table 3). Only item 2 had significant DIF by educational level at 0.67, a large DIF effect: it was easier for parents with higher education level to endorse item 2. Moderate DIF was detected for item 2 and small DIF for items 5, 7, 8, and 9 by use of the English or Spanish version. The Spanish version users found it somewhat easier to endorse items 5, 8, and 9, but more difficult to endorse items 2 and 7. Medium DIF was detected for items 8 and 11 between children of ages 3–5 years and children of ages 5–8. Large DIF was indicated for item 2 between children 3–5 yo and children 9–12 yo, and between children 5–8 yo and children 9–12 yo. It was easier for parents with older children to endorse items 6, 12, 2, and 8 and for parents with younger children to endorse items 5, 4, 7, and 11.

Table 3 Item description and estimates of DIF where significant

Discussion

This is the first study to present an analysis using multidimensional PCM for a TV PP instrument. While CTT analyses indicated that the scales yielded generally acceptable (good or adequate) reliability, item characteristic curves revealed respondents used only 3 of 4 response categories. Thus, it appears appropriate to simplify response categories to 3 options in the future. The asymmetric distribution of items and item thresholds against individuals on the Wright map indicated the items and thresholds did not cover the more difficult to endorse end of each of the three latent variable dimensions. This suggests that items should be developed to cover the more difficult extreme end for each dimension.

DIF analyses indicated that some items did not behave the same way across subgroups. A large amount of DIF was identified for item 2 (i.e. “How often do you explain what something on TV really means?”) on the basis of education of parent and age of child; medium DIF was detected for item 2 on the language version, and for items 8 and 11 on children’s age. Parents with 3–5 yo kids tended to watch favorite programs together, and more likely restricted the amount of TV viewing than parents with older kids. Parents with older kids (9–12 yo) and with higher education level showed a higher degree of agreement with explaining to their child what something on TV really meant. Parents with 5–8 yo kids were more likely to specify in advance the programs that kids may watch than the other two age groups. Parents who used the English version tended to help their kids understand the meaning of something on TV and set specific TV viewing time; while parents who completed the Spanish version tended to agree that they watched the favorite program together, pointed out why some things actors do are bad, and asked their child to turn off the TV when he/she was viewing an unsuitable program.

DIF by age group presents distinct issues. While the usual prescription for eliminating DIF is to rewrite items to enhance the clarity of meaning [38, 39], it may be that these items are reasonably clear and just not equally applicable across all ages of children. This suggests that responses for such scales can only be analyzed within rather narrow age groupings. The optimal age groupings await determination in future studies with larger samples. These subscale items all reflect frequency of performance, which is common among behavioral indicators. There may be benefit in introducing a value or normative aspect to these items: should parents do each of these practices?

Several limitations exist. The response scales for the NA items (a 5-point rating scale) were different than those for items in the Wii and HH studies (4-point scale). Collapsing one of the response categories based on infrequent use was a reasonable accommodation, but having the same categories would have been preferred. The samples in the three studies reflected different inclusionary/exclusionary criteria and different recruitment procedures, with unknown effects on the findings. The Wii study with 9–12 yo did not include any participants using the Spanish version, therefore, DIF by age may confound language. Finally, the sample size was relatively small. While no clear standards for minimum sample size are available, Embretson and Reise [40] recommended using a sample of 500, and Reeve and Fayers [41] recommended at least 350. Finally, an interaction term was used to detect DIF. Further investigation should pursue other DIF-detection procedures (e.g., Mantel [42]; Shealy & Stout [43, 44]).

Conclusion

TVPP subscales demonstrated factorial validity and acceptable internal consistency reliability. The true latent variables demonstrated adequate fit to the data, but did not adequately cover the more difficult to respond end of each dimension; effectively used only these response categories; and showed differential item functioning, especially by age. While the scales can be cautiously used, further formative work is necessary.