PROMIS® Parent Proxy Report Scales: an item response theory analysis of the parent proxy report item banks
- First Online:
- Cite this article as:
- Varni, J.W., Thissen, D., Stucky, B.D. et al. Qual Life Res (2012) 21: 1223. doi:10.1007/s11136-011-0025-2
- 449 Views
The objective of the present study is to describe the item response theory (IRT) analysis of the National Institutes of Health (NIH) Patient Reported Outcomes Measurement Information System (PROMIS®) pediatric parent proxy-report item banks and the measurement properties of the new PROMIS® Parent Proxy Report Scales for ages 8–17 years.
Parent proxy-report items were written to parallel the pediatric self-report items. Test forms containing the items were completed by 1,548 parent–child pairs. CCFA and IRT analyses of scale dimensionality and item local dependence, and IRT analyses of differential item functioning were conducted.
Parent proxy-report item banks were developed and IRT parameters are provided. The recommended unidimensional short forms for the PROMIS® Parent Proxy Report Scales are item sets that are subsets of the pediatric self-report short forms, setting aside items for which parent responses exhibit local dependence. Parent proxy-report demonstrated moderate to low agreement with pediatric self-report.
The study provides initial calibrations of the PROMIS® parent proxy-report item banks and the creation of the PROMIS® Parent Proxy-Report Scales. It is anticipated that these new scales will have application for pediatric populations in which pediatric self-report is not feasible.
KeywordsPROMIS®Parent proxy reportItem response theory
Patient Reported Outcomes Measurement Information System
Food and drug administration
Health-related quality of life
National Institute of Health
The Patient Reported Outcomes Measurement Information System (PROMIS®) is a National Institutes of Health (NIH) Initiative, created to advance the assessment of patient-reported outcomes (PRO) in chronic diseases. Items are evaluated using item response theory (IRT) to derive scales with scores that are maximally reliable and valid along the full spectrum of the latent trait . A primary objective is to develop item banks and computerized adaptive tests (CAT) across a variety of chronic disorders . During the past 7 years, the PROMIS® Pediatric Cooperative Group has developed pediatric self-report item banks for ages 8–17 years across five generic health domains (physical function, pain, fatigue, emotional health, and social health) consistent with the larger PROMIS® network . It was anticipated that measures of these five generic health domains would be applicable across pediatric chronic health conditions and hence were developed as generic or non-disease-specific scales [4–7]. Additionally, an asthma-specific item bank was developed and validated .
It has been well documented in both the adult and pediatric literature that information provided by proxy-respondents is not equivalent to that reported by the patient [9, 10]. Imperfect agreement between self-report and proxy-report, termed cross-informant variance , has been consistently documented in the HRQOL measurement of children with chronic health conditions and healthy children . However, even as pediatric patient self-report is advocated, there remains a role for parent proxy-report in pediatric clinical trials and health services research.
While pediatric patient self-report should be considered the standard for measuring PROs, there may be circumstances when the child is too young, too cognitively impaired, too ill or fatigued to complete a PRO instrument, and parent proxy-report may be needed in such cases . Further, it is typically parents’ perceptions of their children’s health and well-being that influences healthcare utilization [14–16]. Thus, instruments should be developed that measure the perspectives of both the child/adolescent and parent since these perspectives may be independently related to healthcare utilization, risk factors, and quality of care .
The majority of parent proxy-report scales, consistent with other clinical assessment instruments , have utilized classical test theory (CTT) and have rarely taken advantage of IRT analysis in the scale development process . By utilizing IRT analysis, the resulting item bank can be the basis of a more customizable measure for meeting a researcher’s or clinician’s needs. Depending on the desired level of precision, the evaluator can then select the number of items to administer and obtain scores on the same metric as all other users of this item bank .
The objective of the present study is to address this measurement gap in the parent proxy-report literature by describing the IRT analysis of the PROMIS® parent proxy-report item banks and the measurement properties of the new PROMIS® Parent Proxy Report Scales, including investigations of scale dimensionality and sources of local dependence and differential item functioning.
Participants were recruited between May 2008 through March 2009 in hospital-based outpatient general pediatrics and subspecialty clinics. Clinic participants were identified through a review of clinic appointment rosters or while in the clinic waiting rooms according to protocols approved by the institutional review boards (IRBs) of University of North Carolina (UNC), Duke University Medical Center, University of Washington (UW), Children’s Memorial Hospital, Chicago (CMH), and Children’s Hospital at Scott and White (S&W) in Texas. Pediatric patients within the appropriate age range who had clinic appointments and their caregivers were recruited while waiting for their clinic appointments. The UNC, Duke, UW, CMH and S&W general pediatric clinics were representative of health issues for which children have physician office visits (e.g., well child visits, acute illnesses, and some chronic illnesses). The specialty clinics included Pulmonology, Allergy, Gastroenterology, Rheumatology, Nephrology, Obesity, Rehabilitation, Dermatology, and Endocrinology. Children with asthma were over sampled during recruitment because asthma-specific items were tested.
To be eligible to participate in the large-scale testing survey, all participants were required to meet the following inclusion criteria: able to speak and read English; and able to see and interact with a computer screen, keyboard, and mouse. Children enrolled were between the ages of 8 and 17 years and with their parents/guardians formed a dyad (for convenience, we refer to the dyad as simply parent/child). Both members of the dyad were required to individually complete the items. Children completed the self-report version, and parents completed the proxy-report version.
Parents signed an informed consent document, and children signed an informed assent document that outlined the following: purpose of the study, participation requirements, potential benefits and risks of participation, and the measures implemented to protect participant privacy. Both the informed assent and the informed consent were administered in English, so parents were also required to read and speak English. Each participant received a $10 gift card in return for their time and effort. The study protocols were approved by the institutional review boards at each institution.
Pediatric self-report item bank development
The PROMIS® Pediatric item banks were developed using a strategic item generation methodology adopted by the PROMIS® Network . Six phases of item development were implemented: identification of existing items, item classification and selection, item review and revision, focus group input on domain coverage, cognitive interviews with individual items, and final revision before field testing. The final pediatric self-report item banks contained 165 items across the 5 generic health domains (physical function, pain, fatigue, emotional health, and social health) and Asthma. Because physical function includes both upper extremity and mobility item banks, emotional distress includes separate anger, anxiety and depressive symptoms item banks, and fatigue includes both fatigue and lack of energy item banks, a total of 10 content domains were tested [4–8].
Parent proxy sampling plan and item distribution
The parent proxy-report items were developed from the 10 existing pediatric self-report content domains [4–8]. The items were revised to retain their meaning, while modifying the phrasing so that all items involved parents reporting on their child. For example, in the pediatric self-report pain interference domain , children responded to the item “I had trouble sleeping when I had pain,” while parents responded to the parent proxy-report equivalent of this item, “My child had trouble sleeping when he/she had pain.”
Proxy-report short form items were selected from items that were on the pediatric self-report short forms for each domain and did not include any items that were not already on the self-report short forms. This decision was made because most researchers prefer parent proxy-report item banks that have the same item content as the pediatric self-report item banks.
Parent and child demographics
N = 1,548 (% of complete data)
Mean = 41.1, SD = 7.8
Living with partner
Separated or divorced
Black or African-American
American Indian/Alaska Native
Native Hawaiian/Pacific Is
Guardian’s relationship to child
Mother, stepmother, foster mother
Father, stepfather, foster father
Guardian or other
Guardian’s education level
Some high school
High school degree/GED
Some college/technical degree
Child’s age (years)
Means and standard deviations of the pediatric self-report measures in the matched parent–child samples, and the correlations between latent variables measured by the pediatric self-report and parent proxy-report scales
Depressive symptoms (14 items)
Anxiety (15 items)
Anger (5 items)
Lack of energy (11 items)
Tired (23 items)
Upper extremity/dexterity (29 items)
Mobility (23 items)
Pain interference (13 items)
Peer relations (15 items)
Asthma impact (17 items)
All items had a 7-day recall period and used standardized 5-point response options (e.g., never, almost never, sometimes, often, almost always; or, with no trouble, with a little trouble, with some trouble, with a lot of trouble, not able to do). Of the 293 items administered, 165 were retained for analysis, as these corresponded to the final items in the pediatric self-report item banks. A complete list of the 165 items may be found in the “Appendix”.
Statistical and psychometric methods
The purpose of the psychometric assessment of the parent proxy-report items was to develop parent content domain short forms that could serve as proxies for their child counterpart domain short forms. Based on the methods used in the development of the pediatric item banks , ten content domain-specific analyses were conducted which investigated scale dimensionality and sources of local dependence (LD).
LD occurs when items retain an association after accounting for inter-item correlations with the latent variable. Researchers typically identify locally dependent item pairs or clusters either using categorical confirmatory factor analysis (CCFA) models that account for LD with correlated residuals and subfactors, or using IRT methods involving model-based pairwise residuals. We employed both methods (CCFA and IRT) to identify potential LD.
Initially, we conducted CCFAs of the inter-item polychoric correlation matrices using Mplus . Because this approach is conducted using complete case data, and our sampling design assigned only subsets of items from each content domain to any given parent, multiple CCFAs (usually 2) were conducted on independent subsets of items. This approach involved fitting 15 different CCFA models in which each analysis included at least 175 parents. We began with a single-factor model and estimated additional correlated residuals and subfactors using item content and statistical relevance for guidance. To quantify the degree to which each scale is approximately unidimensional, we used the value of “explained common variance” (ECV) attributable to the first (general) factor of the final bifactor model for each scale within each combined form .
To further investigate potential LD, we calibrated each content domain using Samejima’s graded response model (GRM) in an IRT framework using IRTPRO . Following these initial calibrations, we tested for LD between pairs and clusters of items using a standardized χ2 statistic as implemented in IRTPRO. The Chen-Thissen X2  is based on a comparison between the observed and IRT-expected response frequencies for pairs of items.
These analyses provided two different assessments of LD. Our initial assessment of the validity of these tests of LD was made by comparing the magnitude of the CCFA-based residual correlation to the magnitude of the IRT-based standardized χ2 statistic. In making these comparisons, we note that the two methods use slightly different respondents; the IRT analyses used the same individuals as the CCFA analyses, and additional parents. Where there was disagreement between the two methods, a team of six content experts evaluated the validity of the potential LD. The expert panel reviewed all pairs and triplets of locally dependent items that contained more than one item from the pediatric short form. If the LD χ2 statistic was greater than 10, item(s) were eliminated so that only one item from the LD pair or triplet remained. If the LD χ2 statistic was less than 10, each member of the panel voted on whether or not to delete item(s) and if so, which item to retain. Members made decisions about whether to eliminate item(s) by evaluating whether the item’s content was substantively different from others in the locally dependent group and offered unique information about the child’s experience. When members voted to delete item(s), decisions about which item to retain were based on a content analysis of which item best fit the conceptualization of the domain. Setting aside items from locally dependent pairs ensured that the IRT assumption of unidimensionality was maintained.
We tested for the presence of differential item functioning (DIF) between parents of children ages 8 through 12 and parents of adolescents ages 13 through 17, and separately between parents of male or female children. DIF was investigated using IRTPRO’s DIF module that implements Wald χ2 tests [25, 26]. In each case, the presence of DIF indicates that the relation between the item responses and latent variable differs between groups. Because many tests of DIF were conducted, we controlled for multiple comparisons using the Benjamini-Hochberg procedure . A significant χ2 statistic indicates the presence of DIF. To evaluate the influence of DIF on item scores, we used graphical methods suggested by Steinberg and Thissen . The goodness of fit of the IRT model to the data was examined using the S-X2 statistic  as implemented in IRTPRO. Non-significant S-X2 values suggest adequate fit of the model to the data.
In order to place the parent scales on the same metric as the child items, final calibrations of unidimensional domains were conducted using as the reference group mean and variance values (in Table 2) obtained from the child responses who had parents taking the same domain items; parents who did not have children taking the same domain items were included in the calibration as a separate group.
Parent proxy-report short forms were created after setting aside locally dependent items from the item subsets derived from the pediatric self-report short forms. The degree to which these short forms produce precise scores was evaluated using information functions. IRT conceptions of information allow score precision to vary across the range of the content domain. To ease interpretability, reliability is approximately one less the inverse of test information. Thus, when information is 10, reliability is 0.90.
We used two-dimensional IRT models to estimate the correlations between the latent variables measured by the pediatric self-report scales for ages 8–17 and the parent proxy-report scales for ages 8–17. In these models, one latent variable was used with the graded model to fit the pediatric self-report responses, and a second latent variable was used to fit the parent proxy-report data; the correlation between the two latent variables was estimated simultaneously with the item parameters. In terms used by traditional test theory, this correlation is an estimate of the “disattenuated” correlation between the two variables; that is, the correlation corrected for the presence of measurement error in scores. As a result, these correlations would be 1.0 if the pediatric self-report and parent proxy-report scales measured the same constructs.
Explained common variance (ECV) for the general factor for each of the parent proxy scales, for the form combinations that were administered to parents
Lack of energy
Tables 4, 5, 6, 7, 8 and 9 in the “Appendix” list the items (arranged by the magnitude of the IRT discrimination parameter), IRT parameter estimates, item fit, and DIF statistics for the 165 items retained for analysis and 10 content domains. The notes at the bottom of each table indicate with superscripts items that may exhibit LD and items that were set aside from each domain’s short form. Items marked with the same superscript letter within a domain remain in the item banks, but may exhibit LD, and so users are cautioned to use only one item from each such lettered set in any custom form or CAT. Because many CCFA and IRT models were used to assess the properties of the items, we summarize the results by content domain.
Emotional Distress: LD and DIF
Emotional distress (Table 4) comprises depressive symptoms, anxiety, and anger. No significant DIF by sex or age was identified for these content domains. Among the depressive symptoms short form items, three pairs of items were identified as being locally dependent. From the pairs “My child felt lonely” and “My child felt alone”, and “My child felt sad” and “My child felt unhappy”, the second item was set aside from each LD pair, respectively, for the proxy short form. From the pair “My child felt everything in his/her life went wrong” and “My child felt like he/she couldn’t do anything right”, neither item was set aside after content review. There were three additional pairs of potentially locally dependent items identified in the anxiety domain; however, statistical tests of LD were mixed regarding these pairs, and no further items were set aside from the short form after item content review suggested that significant LD may be spurious.
Fatigue: LD and DIF
Physical Functioning: LD and DIF
Results for physical functioning (upper extremity and mobility; Table 6) indicate a single instance of LD between the item pair “My child could do sports and exercise that other kids his/her age could do” and “My child has been physically able to do the activities he/she enjoys most”. The content review panel concluded that each item provides sufficient unique contribution to warrant inclusion in the final short form. Additionally, there was evidence of both age and sex DIF throughout both scales, but this was largely due to missing or sparsely endorsed response categories at the extreme ends of the distribution, and no items were set aside. The resulting 8-item short forms produce reliable scores between about one standard deviation to three standard deviations below the mean (see Fig. 2).
Pain Interference: LD and DIF
There were two potentially locally dependent pairs of items in the pain interference domain, marked with superscripts a and b in Table 7. All four of these items were considered to be substantively unique, and neither item was set aside from the final short form. Additionally, the item “It was hard for my child to pay attention when he/she had pain” exhibited evidence of DIF by sex; the item was not set aside because it had only 6 and 3 parents of male children endorse the two most extreme response categories, as compared to 20 and 15 parents of female children; hypothesis testing using such sparse cell counts is untrustworthy. The resulting 8-item short form produced scores with information greater than 10 from about one standard deviation below the mean to about two and half standard deviations above the mean (see Fig. 2).
Peer Relationships: LD and DIF
There were two potentially locally dependent pairs of items in the Peer Relationships domain (Table 8). From the item pair “My child was able to count on his/her friends” and “My child was able to talk about everything with his/her friends”, the latter item was set aside. Neither item was set aside from the other LD pair after content review suggested a unique contribution from each item. There were no items with significant DIF. The remaining 7 items on the short form produce scores with reliability greater than 0.90 between the mean and about three standard deviations below the mean (see Fig. 2).
Asthma Impact: LD and DIF
The correlations between the latent variables measured by the pediatric self-report scales and the parent proxy-report scales are in Table 2.
For the convenience of users who prefer to use summed scores, scoring tables that translate summed scores on the parent proxy-report short forms into IRT scaled scores are given in Tables 10, 11 . Computation of these scales used the means and standard deviations in Table 2. Because those means and standard deviations refer back to the original calibration samples for the PROMIS® pediatric self-report scales, the effect is that it is as though these parent proxy-report scales had also been calibrated, or “normed”, on that same sample. Because the correlations (in Table 2) between the pediatric self-report scales and the parent proxy-report scales with the same names range from moderate to low, this does not mean that parent proxy-report scores are comparable with pediatric self-report scores. However, it does mean that the average “level” of the scores (on the PROMIS®T-score scales) has the same meaning with respect to “average” and “one standard deviation above average,” and so on.
This study describes the development and calibrations of the NIH PROMIS® Parent Proxy Report Scales based on an iterative series of IRT analyses regarding scale dimensionality, item LD, and DIF. After determining scale dimensionality, items with LD and DIF were next identified, and some were removed from the recommended short forms.
The potential advantages of utilizing IRT analysis in scale development include greater flexibility in selecting items from the existing parent proxy-report item banks tailored to the objectives of a particular clinical research investigation. Further, scales that have been developed with CTT may have gaps in their ability to measure the full spectrum of the latent construct, while with IRT calibrated items, one can construct a measure that is useful across the full continuum of the latent variable . Thus, this analytic methodology provides clinical researchers the opportunity to select the most meaningful items for their study design and hypotheses. In this study, we proposed short forms measuring each of the content domains; however, a smaller subset of items from the item banks can also be used and scored on the same metric as the larger set using a more dynamic CAT algorithm.
Our finding that parent proxy-report demonstrated moderate to low agreement with pediatric self-report is consistent with the extant literature , suggesting that information provided by proxy-respondents is not equivalent to that reported by the patient. In the HRQOL literature, imperfect agreement between self-report and proxy-report has been consistently documented, typically demonstrating higher correlations for more observable domains (i.e., physical functioning) and lower correlations for less observable or internal symptoms such as emotional functioning, pain, and fatigue . Our findings are consistent with this larger literature.
By administering the items spread over several test forms, we were unable to perform factor analyses across the entire item bank for six of the ten content domains. For each of these six domains, factor analysis was conducted on two separate sets of items. It is possible that factor analyses would turn out differently if all the items within each content domain were analyzed as a single set. However, because the items were created to fill content from qualitative work and then were randomly allocated to each test form, the different test forms can be viewed as replications. By having replicated factor analyses, our impressions of multidimensionality, when repeated across forms, increased our confidence in the factor analytic results.
We recruited participants from clinics across five sites to achieve a sample with diverse experiences in terms of health outcomes, but also cultural and ethnic influences. This study does not report on using the items in languages other than English or in children living in other countries, as such, we cannot assume that the scales would have the same test characteristics in those other populations.
Future research with other samples may reveal other sources of DIF for the items; an advantage of IRT as a method is that it can detect item-level DIF, and “flag” items to be used only with caution for comparisons across levels of a variable for which DIF exists. Although analysis of DIF led to smaller item banks, we believe this approach will ultimately yield a more broadly applicable measure for comparing results across populations.
In conclusion, this study provides initial IRT calibrations of the PROMIS® parent proxy-report item banks and the creation of the PROMIS® Parent Proxy Report Scales which address an important gap in the current literature. Further research is indicated on construct validity and tests of the responsiveness of these scales and item banks in larger samples of parents of pediatric patients with chronic health conditions.