Introduction

In recent years, clinicians, health service providers, researchers, the pharmaceutical industry, reimbursement agencies and health policymakers have been increasingly recognizing the importance of measuring health-related-quality of life (HRQoL) [1, 2]. Some HRQoL instruments are referred to as ‘generic measures’ that describe health in a general way allowing the assessment of HRQoL and changes in HRQoL across a range of disease areas and patient populations, including members of the general public and patient groups. Such measures include the 36-Item Short Form Survey (SF-36), EQ-5D and Assessment of Quality of Life (AQoL) [3, 4]. More recently, the Patient-Reported Outcomes Measurement Information System (PROMIS) adult generic profiles (PROMIS-57, -43 and -29)[5] have been developed that represent a new generation of such measures by relying on item response theory (IRT) calibrated item banks there using a different approach than conventional measures [6].

The PROMIS initiative has so far developed item banks for over 100 key HRQoL domains, such as physical (e.g., pain, physical function, itch, sleep), mental (e.g., anxiety, depression) and social health (e.g., ability to participate in social roles and activities) [7]. Item banks enable computerized adaptive testing (CAT) tools for individual assessment of HRQoL. A major advantage of the three PROMIS generic profile measures is that they are able to produce comparable results to the complete item banks [5]. Although originating from the US, the item banks and the profile measures have been translated to several languages and have increasingly been used in European and Asian countries [8,9,10,11,12]. As standardised HRQoL measures are required to maintain their psychometric performance in different languages, the robustness of measurement properties needs to be confirmed for all language versions.

Among the three PROMIS adult profile measures, PROMIS-29 is the most widely used as a standalone, concise HRQoL measure [13]. By extending it with two items of cognitive function (PROMIS-29+2), it allows the estimation of quality-adjusted life years (QALYs) to assess benefits of treatments in economic analyses [14]. Psychometric performance of PROMIS-29, including validity, reliability and responsiveness, has already been tested in a broad range of health conditions and populations, such as cancer [15, 16], inflammatory bowel diseases [17], chronic kidney disease [18], burn [19], haemophilia [20], musculoskeletal diseases [21,22,23], systemic lupus erythematosus [24], aortic dissection [25], elderly with multiple chronic conditions [26] and general population [27,28,29,30]. Moreover, PROMIS-29 population reference values have also been established in many countries [28, 29] supporting the interpretation of scores by evaluating the relative burden of health conditions compared with reference values. The psychometric performance of the Hungarian PROMIS profile measures has not yet been tested and no reference scores are available for Hungary. This study therefore aims to (1) assess psychometric properties of the Hungarian PROMIS-29+2 profile measure and (2) provide general population reference values from a large representative sample in Hungary.

Methods

Study design and data collection

The study was approved by the Research Ethics Committee of the Corvinus University of Budapest (No. KRH/343/2020). The validation of PROMIS-29+2 formed part of a larger survey on health and well-being of the Hungarian general population [31, 32]. In November 2020, a web-based cross-sectional survey was undertaken in Hungary. We engaged a survey research company to conduct the data collection among members of an online panel. By contract the company provided access to the dataset of those respondents’ responses that had fully completed the questionnaire. Providing access to partially completed questionnaires was not included in the contract. The survey company provided compensation to the respondents in the form of survey points redeemable for rewards. We set ‘soft’ target quotas for age, gender, education, type of settlement and region to achieve a sample that approximates the composition of the Hungarian adult general population. Inclusion criteria were being aged ≥ 18 years and providing informed consent prior to starting the survey.

Respondents completed the official Hungarian-language version PROMIS-29+2 v2.1 [33] as distributed by the PROMIS Health Organization. Other data collected included sociodemographic questions (age, gender, education, employment, marital status, income, household size, type of settlement, region), history of chronic health conditions and the 36-item Short Form Health Survey (SF-36v1). The order of the two instruments was fixed, respondents first completed the PROMIS-29+2 followed by the SF-36. There were no missing values in the data as we made it mandatory to respond to all questions in the online survey.

PROMIS-29+2

PROMIS-29+2 v2.1 [33] was included in our survey that consists of PROMIS-29 and two items from Cognitive Function-Abilities v2.0 [34]. The PROMIS-29 profile comprises of 29 items relating to the following seven HRQoL domains [physical function, anxiety, depression, fatigue, sleep disturbance, ability to participate in social roles and activities (hereafter social roles) and pain interference] and an 11-point pain intensity numeric rating scale [5]. The Cognitive Function-Abilities items are measures of an eighth, cognitive function domain. Each PROMIS-29 domain has four five-level items. The five-point response scale varies across difficulty (i.e., ‘without any difficulty’ to ‘unable to do’), frequency (‘never’ to ‘always’), severity (‘not at all’ to ‘very much’) and global rating (‘very poor’ to ‘very good’) format scales. The recall period is unspecified for physical function and social roles; all other domains refer to the past seven days. A total raw score ranging from 4 to 20 (2–10 for cognitive function) may be computed for each domain by adding up the responses on each item of the domain. The US item calibrations were used to derive T-scores from raw domain scores, where a mean T-score of 50 with a SD of ten represents the US general population [7]. The only exception is the sleep disturbance domain, where a mixed general population and clinical sample was used for the calibration of T-scores with above-average sleep disturbance [35]. For scales of function (i.e., physical function, social roles and cognitive function) a higher score corresponds to a better HRQoL and for symptoms (i.e., anxiety, depression, fatigue, sleep disturbance and pain interference) a higher score corresponds to worse HRQoL [36].

36-item short form survey (SF-36)

SF-36 is one of the most extensively used and validated generic HRQoL instruments [37]. It assesses respondents’ HRQoL in 36 items covering eight domains with a four-week recall period: physical functioning (ten items), role limitations due to physical health problems (four items), bodily pain (two items), general health (five items), vitality (four items), social functioning (two items), role limitations due to emotional problems (three items) and mental health (five items). One item (2nd), which asks about health change, is not included in the scale or summary scores. Scores for items on each of the eight scales are summed up to give scale scores that are linearly transformed onto a 0–100 scale. Note that scores are not comparable across domains.

Psychometric analyses

Data analysis was carried out with R version 4.1.1 (Vienna, Austria). We followed classical test theory and IRT methods previously used in testing psychometric properties of PROMIS item banks and profile measures [6, 20, 21, 27, 38, 39]. For the analyses, we considered PROMIS-29 as the core measure and we tested measurement properties of the additional cognitive function domain separately, wherever possible. Psychometric analyses were performed on the unweighted sample; however, for estimating population reference values, the sample was weighted for age group and gender. All the statistical tests were two-sided, and p < 0.05 was considered statistically significant.

Floor and ceiling effect

Floor (proportion of responses at the lowest score) and ceiling (proportion of responses at the highest score) were computed for the eight PROMIS-29+2 domains. If > 15% of respondents scored the lowest or highest response level, we considered ceiling or floor effect to be present [40, 41].

Reliability analyses

Internal consistency reliability was assessed by computing Cronbach’s alpha and McDonald’s omega (total) for each domain (‘psych’ package [42]). For Cronbach’s alpha, a value > 0.70, while for McDonald’s omega total > 0.90 was considered as a sign of adequate internal consistency [43].

Item response theory assumptions

In accordance with previous PROMIS validation studies [6, 27, 30], the seven domains of PROMIS-29 were separately analysed with graded response models (GRM). Before modelling, the following three statistical assumptions were tested: unidimensionality, local independence and monotonicity. Unidimensionality was assessed using an exploratory bifactor model (‘psych’ package [42]) that allowed to extract explained common variance (ECV) and McDonald’s omega (hierarchical) values. The following cut-off values were used: ECV > 0.60 and omega > 0.70 [44]. IRT-based standardized Chen and Thissen’s index (χ2) was used to detect local dependence (‘mirt’ package [45]). A χ2 of > 0.3 implied possible local dependence and > 1 definite local dependence [46]. Any violations of local dependence were considered negligible if the ECV was ≥ 0.90 [46,47,48,49]. Monotonicity was tested by examining the graphs of item mean scores conditional on the total raw scale score minus the item score [6].

Item response theory analyses

After confirming the IRT assumptions, we fitted a GRM (‘mirt’ package [45]). We examined each item’s discrimination (i.e., item slope, a) and item thresholds (i.e., item difficulty, b). Model fit was assessed by root mean square error of approximation (RMSEA), Standardized Root Mean Square Residual (SRMR), Comparative Fit Index (CFI) and Tucker–Lewis Index (TLI), and was considered acceptable if CFI > 0.95, TLI > 0.95, RMSEA < 0.06 and SRMR < 0.08 [50]. Item fit was assessed by computing the differences between observed and expected responses under the GRM using S-χ2 statistic, where a p-value < 0.001 was considered indicative of item misfit [51]. Item characteristic curves (ICCs) were generated using GRM.

Differential item functioning

To assess differential item functioning (DIF), a series of ordinal logistic regressions were fitted (‘lordif’ package [52]). In the first step, we performed an ordinal logistic regression without any anchor. The χ2 criterion was assessed looking for potential items with DIF. Once DIF was detected, we moved to the second step, where items within a domain that did not show any DIF were used as already-purified anchors. In this second step, three ordinal logistic regression models were estimated to compare the overall, uniform and non-uniform DIF for each item. Uniform DIF occurs when there is a constant systematic difference in item response between subgroups of respondents across the entire continuum of the latent trait, whereas non-uniform DIF occurs when the differences between groups vary across the continuum of the latent trait. Uniform, non-uniform and overall DIF were examined by comparing model 1 vs. model 2, model 2 vs. model 3, model 1 vs. model 3, respectively. Items were flagged for DIF when the McFadden’s pseudo R2 change was > 0.02 [33]. Test characteristic curves were used to visualize the aggregate impact of DIF on domain scores (i.e., differential test functioning). DIF was evaluated for age (median split at 47 years), gender (male vs. female), education (primary, secondary, university/college), employment (employed, retired, other), place of residence (capital, other town, village), geographical region (Central Hungary, Transdanubia, Great Plain and North), marital status (married or domestic partnership vs. any other) and household net monthly income per person (under or over the median of HUF 126,924 and do not know/want to answer).

Convergent validity

Convergent validity of PROMIS-29+2 was assessed against the SF-36v1 questionnaire. We used Spearman’s rank-order correlations to test the association between domains and summary scores of the two measures. Correlation coefficients were interpreted as very weak (< 0.20), weak (0.20–0.39), moderate (0.40–0.59), strong (0.60–0.79) and very strong (≥ 0.80) [53]. We hypothesized at least strong correlations between domains covering a similar construct (e.g., PROMIS physical function and SF-36 physical functioning). Weak or no correlations were assumed between the PROMIS cognitive function and SF-36 domains as this area of HRQoL is missing from the SF-36.

Population reference values and cross-country comparisons

In estimating population reference values, the sample was weighted for age group and gender to account for small deviations from the reference population in Hungary [54]. To accommodate the effect of weighting on variances, Taylor linearization was used to calculate appropriate standard errors. Mean (SD) dimension and summary T-scores and their 95%CIs were computed by gender and age groups (18–24, 25–34, 35–44, 45–54, 55–64 and 65 + years). Bivariate ordinary least squares regressions were used to test the association between domain T-scores and pain intensity scores with age groups and gender. Weighted domain T-scores were compared to those of the general population in the US, the UK, Germany and France [28].

Results

Characteristics of the sample

Overall, 2502 online panel members initiated the survey. Of these, 2079 consented and 379 dropped out during the questionnaire. A total of 1700 respondents finished the survey. The median completion time of PROMIS-29+2 was 2 min 59 s (Q1: 2 min 9 s, Q3: 4 min 8 s). Table 1 shows the sociodemographic and health-related characteristics of the respondents in comparison to the general population in Hungary. The sample was generally representative of the Hungarian general population for age, gender, employment and marital status, type of settlement and geographical region. Secondary educated respondents were underrepresented in the sample. Overall, 47.4% had a self-reported, physician diagnosed health condition. Descriptive statistics of PROMIS-29+2 and SF-36 domain scores are presented in Table 2.

Table 1 Characteristics of the study population (n = 1700)
Table 2 Descriptive statistics of the outcome measures

Floor and ceiling effect

Among the eight PROMIS-29+2 domains, the highest floor effects were observed for pain interference (50.5%), followed by depression (44.1%), anxiety (35.4%) and fatigue (25.2%) (Table 2). Floors of the physical function, social roles, sleep and cognitive function domains were well below the threshold (0.3–6.2%). High ceiling effect was observed for physical function (60.7%), social roles (39.1%) and cognitive function (36.5%), while there were no apparent ceiling effects for the other domains (0.4–1.3%).

Reliability

Cronbach’s alpha and McDonald’s omega total values exceeded the thresholds of 0.70 and 0.90 for all PROMIS-29 domains with the exception of McDonald’s omega total (0.87) for the sleep disturbance domain (Table 3).

Table 3 Unidimensionality, IRT model fit and reliability estimates for the domains of the Hungarian PROMIS-29

IRT assumptions

Using bifactor models, the unidimensionality assumption was confirmed for all PROMIS-29 domains. For sleep disturbance, ECV was met (0.68), however, McDonald’s omega hierarchical was exactly at the threshold (0.70) (Table 3). Chen and Thissen’s local dependence indices were below 1 for nearly all item pairs of each domain (Online Resource 1). The exceptions include Sleep109 (‘sleep quality’) vs. Sleep20 (‘problem with sleep’) and PAININ9 (‘pain interfering with day to day activities’) vs. PAININ22 (‘pain interfering with work around the home’). However, for the latter pair, the ECV from the bifactor model was very high (0.94), therefore the local dependence detected can be deemed negligible. In the sleep disturbance domain three item pairs showed a Chen and Thissen’s index of above 0.3 and one pair was above 1. Graph item mean scores conditional on total score minus item score supported the monotonicity assumption for all domains (Online Resource 2).

IRT analysis

For each of the seven PROMIS-29 domains, almost all three assumptions of IRT analysis were met. Several items misfitted the GRM as indicated by the p-values for the S–χ2 statistics (Table 4). Misfitting items included two items of the anxiety domain [EDANX01 (‘fearful’) and EDANX53 (‘uneasy’)], two items of the depression domain [EDDEP04 (‘worthless’), EDDEP41 (‘hopeless’)], all four items of the sleep disturbance domain and one item of the pain interference domain [PAININ31 (‘pain interfering with social activities’)].

Table 4 IRT parameters for the Hungarian PROMIS-29

For all domains but sleep disturbance, the GRM models’ fit indices met the established criteria for SRMR, CFI and TLI. However, out of the seven PROMIS-29 domains, only anxiety, depression and social roles met the RMSEA cut-off value. The sleep disturbance (0.06–0.97) and fatigue (0.81–0.99) domains had the lowest average item difficulty (b), while physical function (1.41–1.82) had the highest in absolute values. The following items produced the highest discriminative ability (a): PAININ22 (‘pain interfering with work around the home’), PAININ34 (‘pain interfering with household chores’), FATEXP40 (‘fatigue on average’) and PAININ9 (‘pain interfering with day to day activities’). Three items of the sleep disturbance domain [Sleep116 (‘refreshing sleep’), Sleep44 (‘difficulty falling asleep’), Sleep109 (‘sleep quality’)] had the lowest item discrimination.

The ICC plots shown in Online Resource 3 indicated that for most items, the five response options were monotonically ordered. The only exception was item Sleep116 (‘refreshing sleep’) (Fig. 1).

Fig. 1
figure 1

Item characteristic curves for PROMIS-29+2 Sleep disturbance domain

Differential item functioning

No DIF was identified for any of the domains for the following sociodemographic characteristics: gender, education, employment, place of residence, geographical region, marital status and income. However, PFA21 (‘go up and down stairs at a normal pace’) and PFA53 (‘run errands at shop’) of the physical function domain showed uniform DIF for age (McFadden’s pseudo R2 changes between model 1 and 2: 0.030 and 0.023, respectively). The test characteristic curves for these two items showed a small overall impact of DIF (Online Resource 4).

Convergent validity

Table 5 presents the results of the convergent validity analyses. In line with our hypotheses, evidence of strong convergence between corresponding PROMIS-29+2 and SF-36 domains were identified. The strongest correlations were observed between PROMIS-29+2 physical function and SF-36 physical function domains (rs = 0.78), PROMIS-29+2 fatigue and SF-36 vitality (rs = −0.76), PROMIS-29+2 pain interference and SF-36 bodily pain (rs = −0.74) and PROMIS-29+2 depression and SF-36 mental health (rs = −0.70). The PROMIS-29+2 sleep disturbance domain correlated weakly or moderately with SF-36 domains and showed the strongest association with vitality (rs = −0.57). As expected, the PROMIS-29+2 cognitive function domain correlated moderately or weakly with all SF-36 domains (rs = 0.18–0.42). The correlations between the domains within the two questionnaires are presented in Online Resources 5 and 6.

Table 5 Spearman’s correlation matrix between PROMIS-29+2 and SF-36 domains

Population reference values and cross-country comparisons

Mean domain T-scores tended to worsen with age for physical function, pain interference and social roles, whereas improved with age for depression, anxiety, fatigue and cognitive function (p < 0.01) (Table 6). The age gradient was not present for sleep disturbance (p = 0.155). Self-reported HRQoL problems were generally higher for females in all domains (p < 0.001), except for cognitive function (p = 0.348). Higher mean pain intensity scores were reported by older and female respondents (p < 0.001).

Table 6 Population reference values for Hungarian PROMIS-29+2 domain T-scores and pain intensity scale

Compared to the US calibration sample with a mean of 50 and the three European countries with existing reference values, mean PROMIS-29+2 domain T-scores in the Hungarian general population indicated similar or better HRQoL with the largest difference being seen for social roles (> 5 points from the US calibration sample) (Fig. 2). The lowest level of anxiety and sleep disturbance was found in Hungary, while for physical function it was similar to Germany and the UK and for depression, fatigue and pain interference to France. Cognitive function in Hungary was better compared to the US calibration sample.

Fig. 2
figure 2

Comparison of domain T-scores in the general population across Hungary, the US, France, Germany and the UK. Note that the cognitive function domain is not presented in the figure due to the lack of data from general population samples in any of the Western European countries. For PROMIS-29+2 scales of function (i.e., physical function, social roles and cognitive function) a higher score corresponds to a better HRQoL and for symptoms (i.e., anxiety, depression, fatigue, sleep disturbance and pain interference) a higher score corresponds to worse HRQoL. HRQoL health-related quality of life

Discussion

This study assessed the psychometric properties of the Hungarian version of PROMIS-29+2 and provided reference values in a large representative sample of the adult general population in Hungary. Our findings provide evidence of a satisfactory measurement performance of the Hungarian PROMIS-29+2. Floor and ceiling effects were observed for nearly all domains depending on the scale orientation that is comparable to the findings of previous studies in various patient samples [18, 20, 21, 25]. An acceptable reliability was confirmed for all domains. Favourable psychometric properties of the scale include an excellent convergent validity with SF-36 and no or minor DIF for main sociodemographic characteristics. Nevertheless, few potential weaknesses of PROMIS-29+2 have also been identified, particularly the poor performance of the sleep disturbance domain.

While the GRM produced an acceptable fit for six PROMIS-29+2 domains, sleep disturbance failed to meet any fit indices and showed item misfit for all four items of the domain and very low item discrimination ability. Sleep109 (‘sleep quality’) vs. Sleep20 (‘problem with sleep’) showed local dependence suggesting redundancy between the two items. Furthermore, response categories of item Sleep116 (‘refreshing sleep’) were disordered and its discriminatory ability was also substantially lower than that of any other item. Similarly to our findings, the Norwegian and Dutch PROMIS-29 validation studies also reported problems with the performance of the sleep disturbance domain and item characteristics curves of Sleep116 [27, 30]. The sleep disturbance domain of PROMIS-29 is unique in the sense that it includes two positively phrased, reverse coded items (Sleep109 and Sleep116). In questionnaires, reverse-worded items are typically intended to reduce response bias (e.g., pattern answering), disrupt nonsubstantive responding or provide a better coverage of the domain studied [55]. Yet, several studies reported that such items can lead to measurement problems, including low reliability and poor model fit and some argue that they would prevent respondents from inattentive or acquiescent answering [56]. The further exploration of the issues with the sleep disturbance domain as well as testing alternative combinations of items could be subject of future research that administer the full PROMIS sleep item bank.

HRQoL decreased with age for physical and social health domains, but not for the cognitive or mental ones. This finding corresponds to the general population reference values in neighbouring Slovenia that reported worse mental health among young adult respondents using the EQ-5D-5L [57] and to the European reference values for the European Organisation for Research and Treatment of Cancer (EORTC) CAT Core that reported an improving trend for cognitive and emotional functions with age [58]. The better HRQoL of the Hungarian population in some domains compared to Western Europe is an unexpected finding as the average health status in Hungary was found to be below the EU average [59]. Comparisons across countries using different health status measures also reported mixed evidence. Using the EQ-5D-3L, the Hungarian general population was in a substantially worse HRQoL compared to other European countries [60]; however, the EQ-5D lacks domains for fatigue, sleep problems and social roles. By contrast, the EORTC CAT showed that in some HRQoL domains (e.g., physical functioning, social functioning, sleep problems), the Hungarian population, in fact, had a better health status than what was found in Germany or the UK [58].

In this study, we used the official US item parameters to compute T-scores. However, multiple approaches exist to score PROMIS items with each offering their own advantages and disadvantages [61]. Using the US item calibrations follows the PROMIS convention and has the advantage that it represents a common metric, which directly allows for international comparisons. On the other hand, if any item within a domain shows language-DIF, the parameter estimates may not be valid for the local population. Another option is using country-specific item calibrations that enable improved accuracy for comparisons with local patient groups and country-specific interpretation of scores. To benefit from the advantages of both methods, a hybrid approach may also be recommended that uses US item calibrations for items without language-DIF and country-specific item parameters for items with language-DIF [62].

There are a number of limitations to this study. First, the online mode of administration might be responsible for selection bias, and the quota sampling lacks known sampling probability. Second, data were collected during the second wave of the COVID-19 pandemic in Hungary that could have an effect on self-reported health, particularly on young adults’ mental health [62,63,64,65,66,67]. However, responses on self-perceived health status (SF-36 first question) were roughly identical to those reported in a similar large-scale general population survey in Hungary before the pandemic (2019) [68]. The third limitation is that we had no information on the total number of potential respondents contacted by the survey company or access to the data from partially completed questionnaires. Fourth, the reference values for the 65 + age group might not be fully representative to the general population as there were relatively few respondents in the 75 + age group (3.4%). Fifth, it was not possible to fit a GRM for cognitive function because the domain has only two items in PROMIS-29+2. Finally, for each PROMIS-29 domain we fitted a GRM, as this modelling approach was used to develop the PROMIS item banks and this is suggested in the PROMIS analytical recommendations [6]. However, it is possible that certain traits measured by PROMIS-29+2 domains do not have an a priori normal distribution in the population, e.g., physical functioning, pain, fatigue, anxiety and depression because many respondents reporting no problems [69]. A few alternative model types exist that could be useful for future analyses, for example, to alleviate the skewness in data, e.g., zero-inflated mixture IRT models or Davidian Curve IRT [70, 71].

In summary, our results provide support for the satisfactory psychometric properties of the Hungarian version of PROMIS-29+2, including internal consistency reliability, good convergent validity with SF-36 and no DIF. However, the large ceiling and floor effect may detract from the usefulness of the measure when the aim is to differentiate between HRQoL levels at the mild end of the scale. Measurement problems were found with regard to the sleep disturbance domain that would require further refinement. Age and gender-specific reference values were generated for the Hungarian PROMIS-29+2 that facilitate the interpretation of HRQoL outcomes in various patient populations.