Plain English summary

The patient-reported outcomes measurement information system (PROMIS) 16-item Profile (PROMIS-16) was developed to be minimally burdensome, clinically useful, and able to generate eight health-related quality of life domain-specific scores (physical function, ability to participate in social roles and activities, anxiety, depression, sleep disturbance, pain interference, cognitive function, and fatigue). The PROMIS-16 was developed in three phases. In the first phase, a thorough empirical evaluation of all candidate PROMIS items and item pairs was conducted using data from a sample of adults from Amazon’s Mechanical Turk (MTurk) panel. This included basic descriptive information and associations with the PROMIS-29 + 2 Profile. In the second phase, a stakeholder meeting was held to discuss the findings. Final item pairs were agreed upon for two domains, and the candidate sets for the remaining domains were reduced. In the third phase, a survey of the stakeholder panel and another sample of MTurk adults was conducted to solicit preferences for one of two remaining item pairs for each of the other six domains. Stakeholders and MTurk respondents had similar preferences among the remaining candidate item pairs, and final items were selected based on those preferences. The results of the development process showed that the PROMIS-16 has good psychometric properties. The PROMIS-16 is a promising new brief measure of eight distinct domains of health-related quality of life for clinical care and research, representing an ideal screening tool for clinical care, which can help clinicians quickly identify distinct areas of concern that may require further assessment and follow-up. Further research is needed to confirm these findings and to evaluate the PROMIS-16 Profile in real-world settings.

Introduction

The patient-reported outcomes measurement information system® (PROMIS®) [1, 2] includes an extensive portfolio of health-related quality of life (HRQOL) measures that are used around the world in research- and practice-based settings due to their psychometric soundness, flexibility of administration, and scoring normed to the United States general population. The PROMIS library has domain-specific (e.g., anxiety, pain) and global (e.g., general health) measures and offers a collection of pre-packaged multiple-domain measures called PROMIS Profiles (PROMIS-29, -43, -57) [3] that yield seven domain scores: anxiety, depression, fatigue, pain interference, physical function, sleep disturbance, and ability to participate in social roles and activities. The domain scores can be aggregated into physical and mental health summary scores [4] and six of them (anxiety is not included) plus a PROMIS measure of cognitive function can be used to calculate the PROMIS-based preference score (PROPr) [5, 6].

The PROMIS Profiles have seen a rapid uptake in health research settings given their accessibility, ability to describe in detail HRQOL domains that are specific and actionable, and summary scores. However, despite the push to implement clinically relevant patient-reported outcome data collection in clinical care, even the shortest PROMIS Profile, the PROMIS-29, may be considered too burdensome for routine clinical use and some research settings, leading many to decide not to measure HRQOL or opt for the more feasible PROMIS Global-10 [7]. The Global-10, a brief general measure that provides mental and physical health summary scores, is particularly useful for general surveillance and risk adjustment but it does not provide clinically actionable HRQOL domain-specific scores (e.g., pain interference score, depression score).

Thus, although PROMIS offers a host of measurement options, it does not provide an off-the-shelf domain profile option that is regarded as sufficiently brief for routine clinical use. In this article, we describe the development and provide evidence for the reliability and validity of a short PROMIS profile measure that represents eight HRQOL domains (physical function, ability to participate in social roles and activities, anxiety, depression, sleep disturbance, pain interference, cognitive function, and fatigue) with two items each: the PROMIS-16 Profile.

Methods

Participants

Amazon’s mechanical turk (MTurk) development sample

We collected demographic, clinical, and PROMIS item-level data (described further below) for this study as part of a larger survey from MTurk participants that used the online platform CloudResearch (formerly TurkPrime) to collect the data in 2021 [8]. Eligible study participants were 18 years or older with an IP address in the USA and had to have completed a minimum of 500 previous MTurk “human intelligence tasks” (surveys, writing product descriptions, coding, or identifying content in images or videos) with a successful completion rate of at least 95%. The 95% threshold was selected because it is associated with better response quality [9]. Additional quality control measures included deploying small batches of surveys hourly over several weeks to reduce selection bias, screening for excessive speediness in completing the survey (< 1 s per item) ,and including two fake conditions in a list of chronic health conditions [10].

All MTurk participants provided electronic consent at the start of the survey and were paid $1.50, an amount based on the expected time needed to complete the survey and the US federal minimum wage. Of the 6997 respondents who enrolled in the survey, 247 were excluded because they did not complete the survey, and 975 were excluded based on endorsing a fake condition. The final analytic sample of 5775 respondents had a median age of 37 years, was predominantly White (82%), non-Hispanic (86%), male (53%), and well-educated (over 65% had a bachelor’s degree or higher). Rates of endorsement for chronic conditions ranged from 4% (stroke) to 40% (back pain; see Table 1).

Table 1 Demographic characteristics of participants in MTurk development (N = 5775) and preference (N = 124) samples

MTurk preference sample

We surveyed a second sample of MTurk respondents to elicit item pair preferences for measure finalization. The analytic sample included 124 respondents with demographic characteristics similar to the development sample: median age of 37 years, predominantly White (83%), non-Hispanic (95%), and male (63%). Rates of endorsement for chronic conditions ranged from 0% (heart attack) to 27% (allergies or sinus trouble). Nearly 75% of participants reported having seen a healthcare provider in the past two years (see Table 1).

Stakeholder panel

To ensure broad-based buy-in of the content of the new PROMIS profile measure, we consulted with a key stakeholder panel of individuals representing clinical care, PROMIS developers, researchers and adopters, and patient advocates (see Supplement Table S1).

All procedures were reviewed and approved by the research team’s institutional review board (RAND Human Subjects Research Committee FWA00003425; IRB00000051) and conform to the principles in the Declaration of Helsinki.

Measures

Participant demographics

Surveys administered to the MTurk development and preference samples included questions about demographic characteristics and 22 health conditions. The preference sample was also asked how long it had been since last seeing a doctor or other health professional and their number of emergency room visits and hospital stays in the past year.

Candidate items for the short PROMIS profile

The development sample survey included 50 PROMIS items from four overlapping sources (see Tables 2 and S2) as candidates for the short PROMIS profile. The four sources include items assessing eight PROMIS domains (physical function, fatigue, sleep disturbance, pain interference, anxiety, depression, ability to participate in social roles and activities [social roles] and cognitive function—abilities [cognitive function]) and were selected based on discussions among the project team and PROMIS developers. Item sources 2 and 3 (described below) meet some of the criteria for a short PROMIS profile (brief, measure multiple domains) and thus contain attractive candidate items. However, these are custom forms and the sources have not been officially adopted and made available by PROMIS as unique stand-alone measures.

Table 2 Number of candidate items and source for PROMIS-16 by domain

Item Source #1: PROMIS-29 + 2 Profile [3]. Four items each to assess domains of physical function, fatigue, sleep disturbance, pain interference, anxiety, depression, and social roles, and two items to assess cognitive function (30 items total, 17 unique to this source; as it is not scored with any of the eight target domains, the single pain intensity item was not a candidate for the profile composition).

Item Source #2: PROPr initial valuation items (PROPr-14) [11]. Two items each to assess domains of physical function, fatigue, sleep disturbance, pain interference, depression, social roles, and cognitive function (14 items total, 10 unique to this source).

Item Source #3: University of Pittsburgh Medical Center (UPMC) (UPMC16) [12]. Two PROMIS items each used in routine clinical data collection in specialty ambulatory care clinics at UPMC to assess domains of physical function, fatigue, sleep disturbance, pain interference, anxiety, depression, social roles, and cognitive function selected based on their strong psychometric properties and perceived clinical relevance (16 items total, 6 unique to this source).

Item Source #4: PROMIS items having high ‘signal’ and/or being likely to be administered in the PROMIS Computer-Adaptive Testing (CAT) algorithm (SIGNAL). One item each assessing fatigue, pain interference, depression, social roles, and cognitive function; two items assessing sleep (one sleep disturbance, one sleep-related impairment) (7 items total, 2 unique to this source).

PROMIS HRQOL domain scores were generated for all possible item pairs within the eight domains using established parameters from the PROMIS item banks (parameters for the sleep-related impairment item were generated based on calibration to the sleep disturbance items) and converted to the T-score metric (M = 50, SD = 10) per PROMIS convention. All domains except for sleep disturbance were centered on a general population mean of 50. The sleep disturbance domain used a combined general population and clinical sample for centering the T-score metric. Throughout the results section, item pairs are referred to as [domain abbreviation_item1item2] following the list in Table S2. As a gold standard, we generated the PROMIS-29 4-item domain scores and a 5-item cognitive function domain score using all the candidate cognitive function items. We use the term “gold standard” to evaluate how well the newly created short PROMIS Profile compares in psychometric properties to the longer established PROMIS-29 Profile measure.

Item pair preference questions

In addition to items assessing demographic characteristics, health conditions, and health utilization, preference sample respondents were presented with sets of two-item pair choices representing six of the eight PROMIS Profile domains (sleep and fatigue items were selected without preference sample input). Respondents were asked to “read the question pairs and use the radio buttons to indicate which pair they liked the best.” (See Supplement Fig. S1).

Approach

The goal of the developmental approach, conducted in three phases, was to select the best item pair to represent each domain. In the first phase, we conducted an empirical evaluation of all candidate PROMIS items and item pairs using data from the MTurk development sample (N = 5775) to identify item pairs with relatively poor performance. This included basic descriptive information and performance of domain-specific T-scores for all item pairs relative to the gold standard (correlations with the gold standard and standardized mean differences from the gold standard with Cohen’s d) [13]. We also asked the stakeholder panel to select, for each HRQOL domain, the two items that ‘taken together, best reflect the domain’ based on item content. Ten of the thirteen stakeholders contributed initial ratings. We used the results from phase 1 to rule out several candidate pairs per domain.

In phase 2, we held our first stakeholder meeting in which we summarized the findings from phase 1, including the stakeholder preferences and discussed the remaining candidate pairs considering their content and psychometric information relative to the gold standard to agree on a reduced set of candidate pairs for further consideration. Item pair performance was presented graphically using item response theory (IRT)-based information curves [14,15,16]. These curves display information (presented on the y-axis) as a continuous function that varies according to the underlying domain score (presented on the x-axis). Estimates of precision (standard error and reliability) can be derived from information, and the presentation of multiple item pairs on a single plot effectively display their relative performance. Higher information magnitude reflects increased reliability and lower standard error.

Phase 3 included a second survey of the stakeholder panel and the MTurk preference sample to solicit their preferences between remaining candidate pairs for each domain. A total of nine stakeholders and 124 adult MTurk respondents provided preference ratings at this phase. We arrived at a proposed final PROMIS-16-item set, selected based on the preference ratings, and held a second stakeholder meeting to review the set’s basic descriptive statistics and obtain stakeholder approval for the final PROMIS-16 items.

Results

Phase 1

Across the eight HRQOL domains, empirical analyses revealed limited variability in the performance of the item pairs but did highlight some as performing better than others (see Table 3). In general, the T-score means and ranges showed values clustered around the population mean of 50, although anxiety and depression were slightly worse, whereas social role participation was slightly better. Correlations among items within each domain varied somewhat, with the largest ranges for the physical function and sleep disturbance domains. The average correlation among items was highest for pain interference and lowest for sleep disturbance. Item pair correlations with the gold standard were more consistent, although pairs composed of items from the PROMIS-29 + 2 were more highly correlated. A similar pattern was seen in the standardized mean differences of item pairs with the gold standard. Effect sizes for these mean differences tended to be small, although some exceeded 0.2 (small effect) within the physical function domain.

Table 3 Item pair performance summary by domain

Stakeholder preferences were quite varied for physical function, social roles, anxiety, and depression and somewhat more consistent for sleep disturbance, pain interference, cognitive function, and fatigue (see rightmost column of Supplement Table S2).

Phase 2

We considered empirical IRT information functions and stakeholder ratings from phase 1, as well as IRT item parameters (thresholds and discrimination), to exclude some item pairs and prioritize others, reducing the number of pairs in each domain for further discussion during the stakeholder meeting. Fig. 1a–h displays IRT information curves for the remaining pairs plotted together with the gold standard for each domain and reveal variable degrees of precision across the domain score continua among the remaining set of item pairs for each domain.

Fig. 1
figure 1

Gold standard (GS) and item pair information curves by domains of the PROMIS-16 presented to stakeholders. a Physical function (PF); b ability to participate in social roles and activities (SOC); c anxiety (ANX); d depression (DEP); e sleep disturbance (SLP); f pain interference (PI); g cognitive function—abilities (CF); h fatigue (FTG). Numbers following the domain abbreviations in the figure legend identify the specific item pair as listed in Supplement Table S2. SOC_16 and SOC_46 (social roles domain) were not presented in the first stakeholder meeting but were added to the figure after discussion. In each domain, the selected item pair has a diamond marker. The dashed lines indicate the cut-offs for reliability, with reliability of 0.90 at the upper line; 0.80 at the middle line; and 0.71 at the bottom line

During the discussion of each HRQOL domain, stakeholders considered the relative merits of item pairs that provided reasonable precision (reliability > 0.8) [15, 16] across a wide range of the T-score continuum, rejecting some item pairs based on content preferences and others based on format. For example, in the physical function domain, stakeholders noted that items in pair PF_26 use different formats and response options, mixing item stem introductions Are you able… with Does your health now limit you… and response options Without any difficulty—Unable to do with Not at all—Cannot do. In another example, stakeholders noted that the candidate items for social roles could be separated into two content groups, one representing more leisure or recreational roles (items 1, 4, and 5) and the other reflecting responsibility or work-related roles (items 2, 3, and 6) and recommended selecting an item pair representing these two aspects of social role participation. In this way, the stakeholders narrowed down the set of candidate pairs for all domains substantially during the meeting, reaching consensus on the final item pairs for sleep disturbance and fatigue. Following the meeting, the study team synthesized the stakeholder discussion points with the empirical evidence and item parameters and narrowed the set of remaining pairs to two pair options per domain.

Phase 3

Preference ratings from the stakeholders and MTurk preference sample respondents were remarkably consistent and the final item pair for each HRQOL domain was selected based on these ratings (see Table 4). Stakeholders had no objections to the selected item set when presented with the psychometric performance of the 16 items at the second stakeholder meeting.

Table 4 Preference ratings by stakeholders and MTurk preference sample

PROMIS-16: item content and psychometric properties of HRQOL domain scores

The final version of the PROMIS-16 Profile contains 16 items measuring eight HRQOL domains with two items per domain. The measurement precision of the two-item domain scores is displayed in Fig. 1a–h, wherein a line with a marker depicts the selected item sets. The two-item sets provide acceptable precision across a moderate range of the score continuum for all domains. Although information curves for some domains fall below reliability of 0.7 at the low or high ends of the score continua, this tendency is also evident in the 4-item scales. The means of the domain T-scores ranged from 49.3 for fatigue to 54.8 for anxiety (see Table 5). Across the eight domains, there were moderate to strong correlations between items, ranging from 0.50 for cognitive function to 0.77 for anxiety and pain interference. All domains were highly correlated with the gold standard. This correlation exceeded 0.90 for all domains except sleep disturbance which correlated at 0.80. This result is expected given that the PROMIS-16 sleep disturbance domain does not share any items with the PROMIS-29, whereas other domains have some degree of item overlap. The standard mean differences of the final pairs with the gold standard were small, with five of the eight domains showing absolute effect sizes ≤ 0.060; absolute effect sizes for physical function (Cohen’s d = 0.11), ability to participate in social roles and activities (Cohen’s d = − 0.14), and cognitive function—abilities (Cohen’s d = − 0.16), while still considered trivial, exceeded 0.1. Table 6 shows the intercorrelations among domain scores for the PROMIS-16 (above diagonal) and the PROMIS-29 + 2 (below diagonal). The pattern and magnitude of relationships look similar across the two sources. Table 7 contains the item content, response options, and response frequencies for the PROMIS-16 by domain. In most cases, item response frequencies are distributed across the five response options, although the more extreme response options tend to have low endorsement rates. The table format and layout reflect the suggested format for administration. A version for administration is provided as Supplement Table S3. Because pairs of items with five response options produce a limited number of response patterns, the domain scoring of the PROMIS-16 is straightforward to document. Supplement Table S4 provides a scoring look-up table for the PROMIS-16, listing T-scores by domain for each item-pair response pattern.

Table 5 Item pair performance by domain for the 16-item PROMIS-16
Table 6 Correlations of the PROMIS-16 (above diagonal) and PROMIS-29 + 2 (below diagonal) domains
Table 7 PROMIS-16-item content and response frequencies (N, %) from MTurk development sample (N = 5775)

Discussion

This paper describes the development of a short 16-item HRQOL PROMIS Profile measure, the PROMIS-16, for use in research and clinical care. Items in the PROMIS-16 were selected from among a set of 50 candidate PROMIS items through rigorous empirical evaluation and consideration of stakeholder preferences. Because the PROMIS-16 uses existing PROMIS items, it has face validity, is straightforward to interpret, has multiple accessible administration options, and like other PROMIS scales will be easy to relate to other widely used measures both within and outside the PROMIS library. The use of only two items for each domain also enables easy access to pattern-based IRT scoring through the T-score look-up table provided as Supplement Table S4.

As the push to implement PRO measures (PROMs) in clinical care grows, the sustainability of these efforts requires careful consideration [17]. Implementation of PRO data collection in clinical practice requires measures that are short, relevant to the patient population being treated, rigorously developed and evaluated, easy to use and interpret, minimally disruptive to the clinical workflow, and have provider and patient buy-in [18, 19]. The PROMIS-16’s strong psychometric performance and estimation of clinically actionable HRQOL domain scores will likely lead to increased adoption of PROs in clinical practice. The reduction in patient burden relative to the longer profile measures is also beneficial for use in research, especially in studies that require the measurement of multiple outcomes or in which these are not the primary outcomes but are of interest to include as covariates. However, when HRQOL domain scores are a primary study outcome, longer scales may be preferable to provide adequate precision, especially at the extreme ends of score distributions. The PROMIS-16 may also prove useful for population health measurement and monitoring. In clinical care, the PROMIS-16 represents an ideal screening tool, which can help clinicians quickly identify distinct areas of concern that may require further assessment with longer, more targeted measures, and follow-up.

Preliminary evidence presented in this paper suggests that the eight HRQOL domain T-scores generated from the PROMIS-16 have strong psychometric properties, comparable in large part to those of the PROMIS-29 + 2 across a wide range of the score continuum. However, as can be seen in Fig. 1, the 4-item domain scores from the PROMIS-29 have better performance at the extremes. Further evaluation of the measure is needed and should include a more extensive evaluation of domain scores as well as evaluation of physical and mental summary scores and the overall utility score (PROPr).

There is a strong precedent for the viability and attractiveness of an ultra-short HRQOL Profile measure. When shorter versions of PROMs are used, they can result in higher acceptance and response rates and less missing data while having minimal impact on the psychometric performance compared to the long-form version of the PROM [20]. We made considerable effort to select items that cover a wide range of relevant content with adequate precision across the largest score range possible. However, the precision/brevity trade-off is challenging to balance, and there is an impact on psychometric performance that should be considered when using these shorter PROMs.

The strengths of longer measures are that they will have more precision, especially in the lower and upper ends of the score distribution, which is important for either discriminating among patients or examining change within an individual over time. In addition, more content from the HRQOL domain can be included with longer forms strengthening content validity. Thus, although the two-item T-scores represent valid mean estimates, there are many situations where more precision may be needed. For example, a study focused on depression as a primary outcome should include more than two items to assess that construct. Similarly, if used as a screener, responses to the two items that reach a level of clinical concern should trigger the administration of additional items and clinician probes to determine the severity of the problem and identify appropriate next steps. These limitations are particularly salient in the measurement of physical function where we elected to focus measurement with two items reflecting the mobility subdomain. PROMIS-16 users should be aware of the restricted content range of the physical function domain as it may be problematic to compare clinical populations experiencing physical limitations in different body areas [21].

In sum, the PROMIS-16 was selected to optimally balance measurement length and precision trade-offs, making it possible to assess eight core domains covering a broad range of physical and mental health aspects of HRQOL with minimal burden to respondents. Its availability will increase the inclusion of domain-specific HRQOL outcomes in clinical care and clinical and health-related research.

The results of this study should be considered with several limitations in mind. First, our development work was based on data from a single online sample of experienced survey takers who were predominantly White and non-Hispanic and relatively highly educated, thus this paper provides only preliminary evidence. However, our use of PROMIS items with established parameters mitigates this limitation considerably. Second, due to the homogeneity of empirical performance results, our selection was heavily influenced by item content, and we relied heavily on stakeholder input. Although this reliance on content preferences and stakeholder input may be seen as a limitation, it also points to the quality of the candidate PROMIS items. Further, our reliance on stakeholder input conveys their buy-in and will facilitate PROMIS-16 uptake.

This paper describes the development of the PROMIS-16 Profile, an ultra-short measure which generates eight domain-specific HRQOL scores for physical function, ability to participate in social roles and activities, anxiety, depression, sleep disturbance, pain interference, cognitive function, and fatigue with two items per domain. The inclusion of these eight domain scores in the PROMIS-16 makes it possible to generate physical and mental health summary scores following Hays et al. (2018) and the PROMIS-preference (PROPr) score, which are described in forthcoming manuscripts. The physical and mental health summary score derivation includes a sensitivity analysis to evaluate the impact of excluding the pain intensity item from the physical health summary score (because that item is not included in the PROMIS-16). Subsequent work will examine the correspondence of PROMIS-16 summary scores and global health (global-10) summary scores and establish a crosswalk between these two sets of summary scores. Although future work remains to establish the summary and preference scores and validate the domain score findings in an independent sample, preliminary results presented here indicate that the PROMIS-16, a short, rigorous HRQOL profile measure can be translated to domain-specific action and will be a useful tool for clinicians and researchers.