Introduction

Patient-reported outcome measures (PROMs) are questionnaires that assess the perspective of patients regarding their health. The patients’ perspectives have become increasingly important for clinical decision making, and in health research and policy making [1,2,3]. The use of PROMs enables monitoring symptoms and evaluating treatment effectiveness and can enhance communication between patients and clinicians to improve the engagement of patients in their care [4, 5].

The Patient-Reported Outcomes Measurement Information System (PROMIS®) is an initiative founded by a collaboration of eight US research institutes and the US National Institutes of Health. PROMIS aims to standardize the measurement of patient-reported outcomes by developing a standardized set of high-quality PROMs based on modern psychometric techniques (called item banks) to assess core physical (e.g., pain, physical function), mental (e.g., depression, anxiety), and social (e.g., role functioning) domains of health [6,7,8]. PROMIS item banks can be administered using computerized-adaptive testing (CAT) or through fixed-length and custom-made short forms [9]. In addition, several PROMIS profile instruments are available containing a fixed number of items from seven PROMIS core health domains (physical function, pain interference, anxiety, depression, fatigue, sleep disturbance, and ability to participate in social roles and activities), measured on 5-point Likert scales, plus a 0–10 numeric rating item on pain intensity [10]. With 29 items, the PROMIS-29 v2.1 profile is the shortest profile. It consists of four items for each of the seven domains, equivalent to the standard 4-item short forms, plus the single pain intensity item [11]. The PROMIS-29 is more or less comparable to the Short-Form 36 Health Survey (SF-36) [12], one of the most widely used profile measures today. However, it measures slightly different domains and was developed based on the results of item response theory (IRT) [13, 14] instead of classical test theory (CTT). The length of the PROMIS-29 is relatively short while providing a wealth of health-related information because each domain is scored separately [11]. Moreover, Hays et al. have developed physical and mental health summary scores [15] analogous to the global physical health and a global mental health scores of the PROMIS Global Health Scale [16] and the physical and mental component scores of the SF-36 [17]. These bottom-line indicators can be of value [18], and allow the PROMIS-29 to be used as other, older instruments.

PROMIS item banks or their short forms have been translated into more than 60 languages, including Dutch [19]. Psychometric assessments of various Dutch item banks have been conducted [20,21,22,23,24,25], including the assessment of cross-cultural validity (absence of differential item functioning (DIF) for language), making them available for use in the Netherlands in research and clinical practice. Because PROMIS profiles combine short forms on the core domains of health [10], these profiles are particularly suitable for use in clinical trials, observational studies, and routine clinical practice. With PROMIS profiles, a broad overview of a person’s health status can be obtained, which is particularly useful for patients with multiple conditions or comorbidities impacting several health domains.

The applicability of the seven Dutch-Flemish PROMIS item banks on which the PROMIS-29 is based is supported so far by results of IRT analyses, including the absence of DIF for language [20,21,22,23,24,25,26,27]. However, there is no evidence yet for the seven-factor structure of the PROMIS-29 domains in the Netherlands, neither in the general population nor in persons with chronic diseases. It would also be important to know whether the physical and mental health summary score and the associated factor scoring coefficients of Hays et al. [15] can be reproduced in another sample. Moreover, for most item banks, [28] included in the PROMIS-29 measurement invariance for persons with and without chronic diseases as well as for other important sociodemographic characteristics (e.g., ethnicity, educational level), has not been assessed. Therefore, the objective of this study was to investigate the structural validity of the PROMIS-29, including unidimensionality of each domain and its physical and mental health summary scores. Moreover, internal consistency, measurement invariance (no DIF for age, gender, mode of administration, educational level, ethnicity, and chronic diseases), and construct validity (hypotheses on known-groups validity and correlations between domains) were assessed for each domain of the PROMIS-29.

Methods

Participants

For this cross-sectional study, data were obtained from the Lifelines cohort study. Lifelines are a multi-disciplinary prospective population-based cohort study examining the health and health-related behaviors of 167,729 persons living in the North of the Netherlands in a unique three-generation design. It employs a broad range of investigative procedures in assessing the biomedical, sociodemographic, behavioral, physical, and psychological factors which contribute to the health and disease of the general population, with a special focus on multi-morbidity and complex genetics [29]. The study population is broadly representative for the people living in this region [30]. Detailed information about the cohort and participant selection can be found elsewhere [29, 31, 32]. Before participating in the cohort all participants provided written informed consent. The Lifelines cohort study is approved by the medical ethics committee of the University Medical Center Groningen, the Netherlands. The Lifelines cohort study is conducted in accordance with the ethical standards as laid down in the Declaration of Helsinki. For the present study, adults of 18 years and older who completed the PROMIS-29 v2.1 profile were included. The PROMIS-29 was administered in Lifelines follow-up 2B during the period 2016–2020, for which 109,407 adults were invited.

Measures

Participants completed questions regarding their demographic characteristics (age, gender, educational level, and ethnicity) and the presence of chronic diseases (diabetes, cardiovascular disease, chronic obstructive pulmonary disease (COPD), high blood pressure, and other chronic diseases). Participants also completed the Dutch version of the PROMIS-29 v2.1 profile [19]. The PROMIS-29 v2.1 profile contains the standard 4-item short forms from seven PROMIS core health domains (physical function, pain interference, anxiety, depression, fatigue, sleep disturbance, and ability to participate in social roles and activities) and one separate item on pain intensity from the PROMIS Global Health scale. Each item has 5 response options, except for the pain intensity item, which has a 0–10 numeric rating scale. All items have a seven-day recall period, except for the items in the domains ‘physical function’ and ‘ability to participate in social roles and activities’, for which the recall period is not indicated [11] (PROMIS measures can be obtained through healthmeasures.net). Total scores for each domain are derived from the IRT model and expressed as T-scores with a mean of 50 and a standard deviation of 10 for the US reference population [33]. Higher T-scores indicate a higher level of the underlying construct. Because of the large sample size it was not possible to calculate T-scores by uploading item scores in the online HealthMeasures Scoring Service, provided by the US Assessment Center [34]. Therefore, T-scores were calculated by obtaining the official US item parameters used in the US Assessment Center through enquiry.

Statistical analyses

All analyses were conducted in R-Studio or SPSS version 25. Descriptive statistics were used to analyze demographic and clinical characteristics of participants and the percentage of participants with the minimum or maximum score. Structural validity was investigated with confirmatory factor analyses (CFA) in the R package lavaan [35]. First, a seven-factor correlated CFA was fitted, examining the expected factor structure of the PROMIS-29 as a whole, both for the entire sample and separately for participants with and without chronic diseases. Next, items from each domain separately were fitted to a single-factor CFA in order to assess the unidimensionality of each short form. This was also done for the entire sample and for participants with and without chronic diseases. Because of the ordinal response options diagonally weighted least squares (DWLS) estimation with a mean- and variance-adjusted test statistic (weighted least square mean and variance (WLSMV)) was used. Last, a two-factor correlated CFA with maximum likelihood estimation was fitted with domain z-scores to investigate the structural validity of the physical and mental health summary scores. As advised by Hays [15, 36], a pain composite was created by averaging z-scores for the pain intensity item and the pain interference domain to minimize local dependence. In addition, an emotional distress composite was created by averaging z-scores for depressive symptoms and anxiety domains. Similar to the model of Hays et al. [15], the factor physical health was represented by z-scores for physical function, pain (composite score), fatigue, and ability to participate in social roles and activities. The factor mental health was represented by z-scores for fatigue, pain (composite score), ability to participate in social roles and activities, emotional distress (composite score), and sleep disturbance (see also Fig. 1). For all models, CFA model fit was evaluated using the following criteria [37]: Comparative Fit Index (CFI) ≥ 0.95, Tucker-Lewis Index (TLI) ≥ 0.95, root mean square error of approximation (RMSEA) ≤ 0.06, and standardized root mean square residual (SRMR) ≤ 0.08. Standardized factor loadings were compared to the loadings reported by Hays et al. [15] and Huang et al. [38]. Subsequently, factor scoring coefficients for the physical and mental health summary scores were estimated with linear regression models in which the factor scores were the dependent variable and the z-scores for each of the domains were the independent variables.

Fig. 1
figure 1

Standardized CFA estimates for the physical and mental health summary scores. Numbers above the squares represent standardized factor loadings, numbers below the squares represent standardized error variances; Black: standardized factor loadings from this study, green: standardized factor loadings from the study of Hays et al. [15], red: standardized factor loadings from the study of Huang et al. [35]; Pain is average of pain interference and pain intensity item, emotional distress is average of anxiety and depression. (Color figure online)

To evaluate internal consistency, Cronbach’s alpha was calculated for each of the seven PROMIS-29 domains for the entire sample and for participants with and without chronic diseases. To assess measurement invariance, DIF analyses for each domain were conducted with an iterative hybrid of logistic regression and IRT with the R package lordif [39]. The likelihood-ratio χ2 test with detection criterion R2 was used to detect DIF. McFadden’s pseudo-R2 was used as a measure of DIF magnitude with a 2% change being considered as critical threshold. DIF was assessed for age (median split: ≤ 53 years and ≥ 54 years), gender, mode of administration (digital vs. paper and pencil), educational level (high vs. medium/low), ethnicity (Dutch nationality vs. other), and chronic diseases (no vs. yes, and each of the chronic diseases vs. no chronic disease). No DIF was expected for any of these variables given the intended universal applicability of the PROMIS-29 [40]. With respect to construct validity, known-group validity was assessed for groups that were expected to differ in score: groups differing in age (three age groups were compared), gender, and chronic diseases (yes/no) were evaluated. The expected direction and magnitude of the differences were based on previous research on other Dutch adults on the same domains [22, 25,26,27, 41]. Furthermore, Pearson correlations between each of the domains and the pain intensity item were calculated. The magnitude and direction of the expected correlation was based on previous knowledge on and experience with the measured constructs. In total, 88 a priori hypotheses were formulated (see Table 6). In line with the COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) methodology [42] if at least 75% of the hypotheses were confirmed the construct validity of the PROMIS-29 was considered sufficient.

Results

A total of 63,602 respondents completed the PROMIS-29 (response rate 58%). Those who completed the PROMIS-29 had a higher mean age at baseline (47.8 vs. 42.4 years), were more often female (58.8% vs. 57.2%), more often had a low educational level at baseline (31.9% vs. 26.2%), and were more often native Dutch (94.9% vs. 94.0%). Table 1 presents the characteristics of the respondents. For each item, all response categories were endorsed. Missing responses on each of the items ranged from 0.2 to 1.3%. Depending on the direction of scoring of the domain, the number of respondents having minimum or maximum raw sum score (i.e., the best score) was high, especially for physical function, depression, and pain interference (Table 2).

Table 1 Sociodemographic characteristics of participants
Table 2 PROMIS-29 mean T-scores per domain, and the percentage participants having the maximum and minimum score, for the complete sample and samples with and without chronic diseases

Satisfactory CFA model fit was found for the entire PROMIS-29, confirming its seven-factor structure both for the complete sample as for the samples with and without chronic diseases (Table 3). The model provides acceptable fit to the response data. Each single-factor CFA for each domain separately also had acceptable model fit in all three samples, although the cut-off for RMSEA was not met for all domains. The measurement model, thus, seems to make conceptual sense for the assessments of the domains and the items included in the domains [43]. Factor loadings for the seven-factor model and each single-factor model can be found in Supplementary file 1.

Table 3 CFA model fit for the entire PROMIS-29 and all domains tested separately, and Cronbach’s alpha

Figure 1 shows the standardized estimates from the CFA of the physical and mental health summary scores with domain z-scores for the total population. Standardized factor loadings were similar to those found by Hays et al. [15] and Huang et al. [38], although the correlation between the two factors was notably lower (r = 0.40 in this study vs. r = 0.69 and r = 0.59 in the studies of Hays et al. [15] and Huang et al. [38], respectively). Model fit reached the criteria: CFI = 0.982, TLI = 0.947, RMSEA = 0.080, SRMR = 0.025. Table 4 shows scoring coefficients to calculate the physical and mental health summary scores.

Table 4 Scoring coefficients for the physical and mental health summary scores from the CFA model (scoring coefficients found by Hays et al. [15] in parentheses)

The estimated physical and mental health summary scores are presented in Table 5, calculated with the scoring coefficients presented in Table 4 and with the scoring coefficients developed by Hays et al. [15]. On a population level, physical and mental health summary scores based on the Dutch scoring coefficients were approximately one T-score point higher than physical and mental health summary scores based on the US scoring coefficients. However, on an individual level, absolute differences between the two scoring approaches reached up to eight points for the mental health summary score and even 20 points for the physical health summary score.

Table 5 PROMIS physical and mental health summary T-scores based on Dutch and US scoring coefficients

Cronbach’s alpha for each of the seven PROMIS-29 domains ranged from 0.75 to 0.96 in the complete sample (Table 3), showing that the domains do not include items beyond their concept [43]. Cronbach’s alpha for each domain was higher in the sample with chronic diseases compared to the sample without chronic diseases.

No DIF for age, gender, mode of administration, educational level, ethnicity, or presence of chronic diseases was detected for any of the domains (McFadden’s pseudo-R2 all < 0.02; Supplementary file 2). Nor was DIF detected in each of the chronic diseases compared to no chronic disease for any of the domains (McFadden’s pseudo-R2 all < 0.02; Supplementary file 3). Differences in demographic backgrounds, thus, do not lead to substantially different interpretations of the items in each of the domains, nor do different modes of administration lead to substantially different scores. Also, the scoring rule does not create bias with respect to one group of patients versus another [43].

Of the predefined hypotheses, 78% could be confirmed (64%-100% per subscale) (Table 6). The hypotheses not being confirmed were mostly related to the one point difference between adjacent age groups in the first hypotheses. The domain sleep disturbance had the least confirmed hypotheses (64%). The large number of confirmed hypotheses shows that scores from most domains correspond to how persons actually feel or function in their daily lives, and that the scores are sensitive enough to reflect differences in the domains between persons [43]. The T-scores of the groups can be found in Supplementary file 4, whereas the Pearson correlations among PROMIS-29 domains and the pain intensity item are presented in Supplementary file 5.

Table 6 Confirmation of a priori hypotheses regarding the expected differences between groups and the correlation between domains of the PROMIS-29

Discussion

This study assessed some important measurement properties of the Dutch PROMIS-29 in a large cohort. We found sufficient evidence for structural validity, internal consistency, and measurement invariance, both in a sample with and without chronic diseases, whereas requirements for sufficient evidence for construct validity were (almost) met for most subscales. Therefore, the PROMIS-29 is considered a valid instrument to measure physical, mental, and social aspects of self-reported health in adults with and without chronic diseases for use in research and routine clinical practice.

We found a high proportion of participants obtaining the minimum and maximum score (i.e., the best score, depending on the direction of the domain) for most domains, in accordance with findings from previous studies in general population samples [44, 45]. Particularly, over 50% of the population obtained the best scores in the domains physical function, depression, and pain interference. Only the domain sleep disturbance seems to be an exception with only few participants obtaining the minimum score, which is also consistent with other studies [44, 45]. The number of participants with a minimum or maximum score was lower in the sample with chronic diseases. However, even within the sample with chronic diseases, more than 50% of participants had the maximum score for the domain physical function and the minimum score for the domain depression. This latter result was also found in a study with patients with rheumatic diseases [46]. There, thus, seems to be some mistargeting of the short-form items included in the PROMIS-29, even though these items were selected from the item banks following a mix of qualitative expert input and quantitative criteria [10]. Indeed, if we look at the item parameters (obtained from the US Assessment Center in order to calculate T-scores), item parameters for physical function and ability to participate in social roles and activities are all on the lower side of the theta scale. This means that these short forms are more targeted towards persons with low levels of these constructs. For fatigue and sleep disturbance, the item parameters seem to be more equally divided over the theta scale, which possibly also explains the smaller proportion of extreme scores found on these scales. For pain interference, depression, and anxiety the item parameters are on the higher side of the theta scale, and thus, these short forms are more targeted towards persons with high levels of these constructs. The use of CATs has shown to result in a lower proportion of participants obtaining the minimum and maximum score, and CAT scores are accurate over a wider range of the measured construct while only a small number of items is administered [47]. Therefore, to obtain accurate scores with which people are sufficiently discriminated, administration of a CAT might be preferred over these 4-item short forms both in persons with and without chronic diseases.

The seven-factor structure of the PROMIS-29 could be confirmed for the Dutch population and model fit was acceptable for both the entire population as for samples with and without chronic diseases. Unidimensionality for each of the PROMIS domains was also demonstrated. To a certain extent, we were able to reproduce the correlated factor structure for the physical and mental health summary scores. Applying the same model as Hays et al. [15] is in line with PROMIS convention to use the same factor structure for the same measures across the world, unless evidence is provided that this is not acceptable. Since the model fitted quite well and alternative models showed less adequate fit (data not shown), we decided to adhere to this factor structure, which contributes to the general applicability of the scoring system for PROMIS instruments. Although standardized factor loadings were comparable to those found in previous studies [15, 38], the correlation between the physical and mental component was considerably lower. An explanation for this might be that the samples in previous studies were less healthy. The sample of Hays et al. reported about half a standard deviation worse health compared to the general population [15, 48] whereas the sample of Huang et al. consisted of older adults with chronic conditions [38]. Less healthy populations usually have more variations in their responses, resulting in higher correlations. The impact of using the Dutch scoring coefficients versus the US scoring coefficients was small on a population level. Because our sample is broadly representative for the people living in the Northern part of the Netherlands and is over 20 times larger compared to the (less healthy) population from the study of Hays et al. [15, 48], we think our scoring coefficients might be closer to the true values than the scoring coefficients presented by Hays et al. [15]. Therefore, we recommend to use the Dutch scoring coefficients to calculate physical and mental health summary scores for the Dutch population and possibly also for other populations. However, more research is needed to better evaluate this scoring system and replicate the findings, preferably in large (n > 50,000) samples like ours.

Cronbach’s alpha values were all around 0.9 or higher, except for sleep disturbance (alpha = 0.75), thereby showing sufficient internal consistency. These results are in accordance with other studies that have also found high Cronbach’s alpha values for PROMIS profile domains [15, 38, 44, 46, 49], with the study of Hays et al. also finding a lower Cronbach’s alpha for sleep disturbance [15].

We assessed DIF for important sociodemographic and clinical characteristics as DIF for language has already been investigated for most full item banks [22,23,24,25,26]. No DIF for age, gender, mode of administration, educational level, ethnicity, or the presence of chronic diseases was detected for any of the domains, nor for any of the chronic diseases separately compared to no chronic disease. The absence of DIF for chronic diseases is of particular importance because the PROMIS-29 is suitable for use in, for example, research or routine clinical practice in which persons with chronic diseases are overly represented.

Of our a priori defined hypotheses 78% could be confirmed, thereby meeting the 75% required for sufficient construct validity according to the COSMIN criteria for good measurement properties [42]. For most domains, this criterion was also (almost) met. Although we based our hypothesis on analyses with other Dutch datasets [22, 25,26,27, 41] and previous experiences, one should note that a one point difference, as used in some hypotheses, might not (always) be meaningful. It is not yet clear what a minimal important difference in scores between groups is for PROMIS measures, but most studies suggest a within-person change of at least three points to be meaningful [50,51,52,53,54]. However, expecting larger differences between, e.g., age groups would not have been realistic. Another way to formulate hypotheses in future studies is to state that differences smaller than, e.g., 2 points were expected between certain groups. These hypotheses might especially be useful when small, non-meaningful differences are to be expected. Even though the magnitude of the differences between groups was sometimes smaller than expected, especially the differences between adjacent age groups, the direction of the differences was mostly in accordance with expectations. All together, we think our results add to the evidence for sufficient construct validity of the PROMIS-29 domains [15, 46, 49, 55].

A strength of this study is the very large sample size, enabling us to perform the analyses for subgroups with and without chronic diseases and to investigate DIF for important sociodemographic and clinical characteristics. A limitation of our study is the representativeness of the Lifelines cohort, in which males, younger persons, and persons with an immigration background are underrepresented compared with the general Dutch population. Furthermore, in our sample, 62% reported not having a chronic condition, whereas according to registries in 2019, 43% of the Dutch population had no chronic condition [56]. Thus, our sample was not representative for the Dutch population, and therefore, reported T-scores should not be interpreted as reference values for the Dutch population. Papers regarding reference values for the Dutch population on the domains included in the PROMIS-29 have recently been or will soon be published [25, 26, 41]. Finally, formulating challenging hypotheses in which both the direction and the magnitude of the difference or relationship are included, is difficult. We based our hypotheses on findings of previous research, to show that PROMIS-29 functions in our population as expected.

Conclusion

This study provides evidence for sufficient structural validity, internal consistency, and measurement invariance of the PROMIS-29 profile in the Dutch population. Requirements for evidence for construct validity were (almost) met for most subscales, adding to the evidence for sufficient construct validity. That these measurement properties were sufficient in a sample with chronic diseases and without chronic diseases are important because the PROMIS-29 can be used in, for example, research or routine clinical practice, in which persons with chronic diseases are usually over-represented. The large proportion of participants obtaining the best score on the PROMIS-29 might hamper the ability to discriminate between persons. Therefore, administration of a CAT might be preferred. Future studies should also investigate the test–retest reliability, measurement error, and responsiveness of the PROMIS-29.