Extending the PROMIS item bank “ability to participate in social roles and activities”: a psychometric evaluation using IRT

Objective Our objective was to explore whether the extension of the PROMIS item bank Ability to Participate in Social Roles and Activities (APSRA) with new items would result in more effective targeting (i.e., selecting items that are appropriate for each individual's trait level), and more reliable measurements across all latent trait levels. Methods A sample of 1,022 Dutch adults completed all 35 items of the original item bank plus 17 new items (in Dutch). The new items presented in this publication have been translated provisionally from Dutch into English for presentation purposes. We evaluated the basic IRT assumptions unidimensionality, local independence, and monotonicity. Furthermore, we examined the item parameters, and assessed differential item functioning (DIF) for sex, education, region, age, and ethnicity. In addition, we compared the test information functions, item parameters, and θ scores, for the original and extended item bank in order to assess whether the measurement range had improved. Results We found that the extended item bank was compatible with the basic IRT assumptions and showed good reliability. Moreover, the extended item bank improved the measurement in the lower trait range, which is important for reliably assessing functioning in clinical populations (i.e., persons reporting lower levels of participation). Conclusion We extended the PROMIS-APSRA item bank and improved its psychometric quality. Our study contributes to PROMIS measurement innovation, which allows for the addition of new items to existing item banks, without changing the interpretation of the scores and while maintaining the comparability of the scores with other PROMIS instruments. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-024-03666-4.


Introduction
Participation in social roles and activities contributes strongly to good health throughout life [1][2][3] and could be considered one of the core outcomes of healthcare [4][5][6][7][8].The ability to participate in social roles and activities (APSRA) reflects what is considered important for improving health and general wellbeing, besides the relief of symptom burden [4,9,10].APSRA is also a key component of the International Classification of Functioning, Disability and Health (ICF), which is a universal conceptualization of health and disability by the World Health Organization (WHO) [11].While the importance of APSRA seems clear, the definition and operationalization of this construct is complex.Therefore, it is important to develop valid and reliable measures of APSRA that capture its diversity and specificity across different groups and settings [12][13][14].An important contribution toward this objective has been made by the Patient Reported Outcomes Measurement Information System (PROMIS).PROMIS aims to improve and harmonize the measurement of self-reported health outcomes by using Item Response Theory (IRT) [15][16][17][18][19] and applications such as Computerized Adaptive Tests (CAT) [20][21][22][23][24][25][26][27][28][29]).PROMIS has designed several IRT-based item banks, including the item bank APSRA for measuring participation that allows for more efficient and reliable measurement of this construct using CAT [30].
Overall, the psychometric properties of the item bank APSRA were reported as adequate, according to the PROMIS standards [23,31].However, a recent qualitative study suggested that the APSRA item bank could benefit from additional items to better capture the full breadth of the underlying construct for lower or higher functioning individuals [14].Furthermore, it was suggested that the content validity could be improved, especially with regard to the ICF activity and participation subdomains acquisition of necessities (i.e., acquiring a place to live), education (i.e., gaining admission to school), managing finances (i.e., maintaining a bank account), community life (i.e., engaging in social or community associations), and religion and spirituality (i.e., engaging in religious or spiritual activities).Also, it was suggested that the item bank may lack a distinction between engagement in remunerative (i.e., compensated) and non-remunerative (i.e., uncompensated) employment, and domestic life activities.As a solution, van Leeuwen et al. [32] proposed to add 17 items to the PROMIS-APSRA item bank (see Table 2).These additional items were generated by means of a three-step approach: (1) Item generation for 16 ICF subdomains currently not covered by the item bank; (2) Evaluation of content validity through expert review and think-aloud interviews; and (3) Item revision in a consensus procedure, based on the results of step 2 [32].Their research confirmed the relevance, comprehensibility, and comprehensiveness of the 17 proposed items.They recommended to further study the psychometric properties of these items using IRT analysis, and to see how this affects the decision to add these new questions to the current item bank.
The present study has two aims.First, we will evaluate whether the IRT assumptions of unidimensionality, local item independence, and monotonicity hold for the extended item bank; and whether the items are free from differential item functioning (DIF) and show adequate levels of fit of the IRT model used.Second, we will investigate whether adding the new items leads to more effective targeting, i.e., covers a broader and more representative spectrum of the latent trait.Ideally, the item bank would contain items that cover the entire range of latent trait values, so that all latent trait levels can be measured with adequate levels of reliability.Evidence of improved targeting would support the added value of the new items.

Participants
A sample of 1022 Dutch people was drawn from the general population, aged 18 years and older, by a certified data collection agency (DESAN Research Solutions).The net sample was representative of the adult Dutch population (maximum deviation of 2.5%) regarding age (young 18-39 years; middle 40-64 years; old 65 + years), sex, education (low, middle, high), region (north, east, south, west), and ethnicity (native, first-, and second-generation western immigrant, first-and second-generation non-western immigrant), when compared to reference data from Statistics Netherlands from 2019 [33] (Table 1).

Procedure
All participants were members of an existing internet panel.The internet panel consisted of members of the online Pan-elClix service that was commissioned by DESAN Research Solutions (a specialized Dutch agency for collecting, processing, and reporting data for market and opinion research) in order to put together the online panel.PanelClix issues points (Clix) which are managed and administered by EuroClix, who also ensures that PanelClix members can exchange their points for euros.The panel members received 100 points for participating in our research, worth approximately 1 euro.The participants received an invitation to voluntarily take part in an online survey through an internet browser.After being presented with an introductory text with a brief explanation of the purpose of the survey, participants were asked to provide information about their age, sex, education, zip code, and ethnicity.Next, the participants were asked to rate their general level of participation by answering the question "How would you describe your ability to participate in social role activities?" on a 4-point scale (1 = not limited, 2 = mildly limited, 3 = moderately limited, 4 = severely limited).Next, they were asked to complete all 52 items of the extended version of the PROMIS-APSRA item bank.The items were presented in the same order for all participants, starting with the 35 original items followed by the 17 new items.The items were displayed in blocks of 5 items.All items within a block had to be answered for the next block of items to be presented.

Measures
The original PROMIS item bank consists of 35 negatively worded items (e.g., "I have to limit social activities outside my home"; item code SRPPER_CaPS1).The 17 new items were written in the same grammatical style as the original items (e.g., "I have trouble acquiring my groceries"; item code PEXP_2.See Table 2 for the complete list of items 1 ).The items were scored on a 5-point Likert scale (5 = never, 4 = rarely, 3 = sometimes, 2 = usually, 1 = always), with higher scores indicating greater ability to participate (i.e., fewer limitations).The item bank does not specify a time frame (e.g., "Think back over the past 30 days").

IRT assumptions
In order to evaluate the incremental value of the new items, we first evaluated whether the items in the extended item bank adhered to the assumptions underlying most IRT models: unidimensionality, local item independence, and monotonicity; and whether the items were DIF-free [17,18,39,40].Fixed Fixed Fixed Fixed Fixed SRPPER07_CaPS I have trouble doing all of the family activities that I feel I should do Fixed Fixed Fixed Fixed Fixed SRPPER08_CaPS I have trouble doing all of the family activities that are really important to me Fixed Fixed Fixed Fixed Fixed SRPPER09_CaPS I have trouble doing everything for work that I want to do (include work at home) Fixed Fixed Fixed Fixed Fixed SRPPER11_CaPS I have trouble doing all of my regular leisure activities with others Fixed Fixed Fixed Fixed Fixed SRPPER13_CaPS I have to limit social activities with groups of people Fixed Fixed Fixed Fixed Fixed SRPPER14r1 I have to limit my regular family activities Fixed Fixed Fixed Fixed Fixed SRPPER15_CaPS I have to limit the things I do for fun with others Fixed Fixed Fixed Fixed Fixed SRPPER16r1 I have to do my work for shorter periods of time than usual (include work at home) Fixed Fixed Fixed Fixed Fixed SRPPER17r1 I feel limited in the amount of time I have for my family Fixed Fixed Fixed Fixed Fixed SRPPER18_CaPS I have trouble doing all of the family activities that I want to do Fixed Fixed Fixed Fixed Fixed SRPPER20_CaPS I have trouble doing all of the activities with friends that are really important to me Fixed Fixed Fixed Fixed Fixed SRPPER21_CaPS I have trouble doing all the leisure activities with others that I want to do Fixed Fixed Fixed Fixed Fixed SRPPER22_CaPS I have trouble keeping up with my family responsibilities Fixed Fixed Fixed Fixed Fixed SRPPER23_CaPS I have trouble doing all my usual work (include work at home) Fixed Fixed Fixed Fixed Fixed SRPPER26_CaPS I have trouble doing all of the work that is really important to me (include work at home) Fixed Fixed Fixed Fixed Fixed

Dimensionality
Unidimensionality is a key assumption of the most frequently used IRT models.It means that the responses to a set of items can be sufficiently explained by a single latent trait, and it allows for the estimation of item parameters and latent person scores on a common scale.In keeping with the IRT framework employed throughout our psychometric analyses, an exploratory Mokken scale analysis (a nonparametric IRT based scaling technique) was performed to assess unidimensionality [41].More specifically, the Automated Item Selection Procedure (AISP) was used.This procedure groups items into scales in an iterative manner.One of the aims in this procedure is to maximize the scalability coefficient H, which can be seen as an item-total correlation that has been corrected for the influence of item difficulty, i.e., item location.A Mokken scale is considered strong if H equals 0.5 or higher, moderate if 0.4 ≤ H < 0.5, and acceptable if 0.3 ≤ H < 0.4.Furthermore, the item scalability coefficients H j should be ≥0.30and the item-pair scalability coefficients H ij should be positive [37].As an additional check, we examined the model fit and the percentage of explained variance of the unidimensional graded response model (GRM; see section item calibration for more information).Following the recommendations of Maydeu-Olivares (2014) we used the M 2 fit statistic in conjunction with the RMSEA 2 and SRMR to assess adequate model fit (M 2 p > 0.05; RMSEA 2 < 0.089; SRMR < 0.05) [42].

Local item independence
Item pairs are locally independent when, controlling for the latent trait score, item responses show no association, i.e., the person parameter θ is not influenced by other factors than the trait level [16][17][18].In order to test this assumption, we used Yen's Q3 statistic with a residual correlation ≥ |0.2| as a critical value for signaling local item dependency [31,43,44].

Monotonicity
Monotonicity implies that when the latent trait level is increasing, so will the probability of endorsing a higher response category [19,23].We assessed monotonicity for the extended item bank by inspecting the category characteristic curves produced by confirmatory Mokken scale analysis.More specifically, the monotone homogeneity model (MHM) was estimated, which can be seen as a nonparametric counterpart of the GRM [36,37,[45][46][47].We evaluated the output for non-significant violations (#vi) and significant violations (#zsig).Additionally, violations of monotonicity were assessed by inspection of the critical values (CRIT) of the items.CRIT is a single statistic of several combined "goodness of fit" indicators [41] used in Mokken scaling.CRIT values should not exceed 80, while values below 40 are ideal, and values between 40 and 80 are considered acceptable violations [46,48].

Differential item functioning
Differential item functioning (DIF) assesses the degree to which an item in a questionnaire functions differently for different groups.DIF occurs when two groups of respondents with similar ability levels but differing in some characteristic (such as sex, ethnicity, or age) have different probabilities of endorsing a response category on an item.DIF analysis is used to identify items that are biased in favor of one group to the detriment of another, thereby affecting the validity of the questionnaire.We investigated whether the items were sufficiently DIF-free with respect to age, sex, education, region, and ethnicity.We performed uniform and non-uniform DIF analyses for age (median split: ≤ 49 years, > 49 years), sex (male, female), education (low, middle, high), region (north, east, south, west), and ethnicity (native, western immigrant, non-western immigrant) [31].
With uniform DIF, the probability of endorsing an item will on average always be lower for one group, for all levels of θ.The two item characteristic curves for these groups would not intersect, i.e., would run more or less parallel to  each other.Non-uniform DIF occurs when the probability of a response to an item depends both on the level of θ and the group membership of the respondent, resulting in intersecting item characteristic curves.DIF was evaluated by applying ordinal logistic regression models, using a McFadden's pseudo R 2 change of 2% as a criterion for DIF [38,49], and by inspecting the item characteristic curves (ICCs) of items that were flagged for DIF.

Item calibration
In order to assess the item parameters of the extended item bank, we used a GRM [50] where the item parameters for the original PROMIS items were set to the fixed US calibration values (as per PROMIS convention), and only those of the new items were estimated.The official PROMIS US item parameters were obtained via enquiry at HealthMeasures. 2 The resulting estimated latent trait scores (i.e., θ) were scaled with a mean of 0 and a SD of 1, since this aligns best with keeping the scale as similar as possible to the original American PROMIS scale.Furthermore, a model with a freely estimated mean and SD showed negligible differences with a mean close to 0 and SD close to 1. Reliability was calculated for evaluating the quality of the test (i.e., scores are consistent and a good measure of the underlying trait).In order to examine item fit we calculated the generalized S-Χ 2 statistic [51], which compares observed and expected response frequencies estimated by the IRT model, and quantifies differences between these frequencies.Items with a p-value smaller than 0.05 were considered indicative of poor fit. 3 In addition, we assessed whether the discrimination parameters were sufficiently large (a > 1.0).We used a Welch Two Sample t-test to test the difference between the item parameters of the new and old items, respectively.Effect sizes for the t-test were evaluated based on Cohen's (1988) recommendations [52].Lastly, we visually examined the category response curves of the items, with the aim to gauge whether the item response categories were ordered as expected, and whether all item response categories had added value (i.e., were sufficiently nonoverlapping).This provides an indication to what extent the response categories are able to differentiate between levels of functioning.

Targeting
Targeting in IRT refers to the extent to which test items are appropriately matched to the latent trait level of the respondent.In order to achieve good targeting (i.e., to ensure accurate and meaningful measurements), it is important to use items that vary in location across the range of latent trait levels of the individuals completing the questionnaire.We evaluated the θ distribution of the extended item bank and examined whether the location (i.e., b 1 −b 4 ) parameters of the new items covered a part of the latent trait range that had not yet been covered by the original items.For this we compared the test information functions and beta distributions for the original and extended item bank, in order to assess whether the new items broadened the range of θ values that can be measured.Furthermore, we compared the absolute differences between the individual θ score estimated with the original item bank to those estimated with the extended item bank.

Unidimensionality
The exploratory Mokken scale analysis indicated that the items in the extended item bank form a uniform scale.The total scale had an H-value of 0.56, which is indicative of a strong unidimensional scale with good item discriminatory power.All item scalability coefficients H j exceeded 0.30 (range 0.31-0.65)and all the item-pair scalability coefficients H ij were positive (see Table S1 in the online supplement).The proportion variance of 74% also supported unidimensionality.The overall fit of the model was unsatisfactory (M 2 (df = 1293) = 19,088.86,p < 0.001; RMSEA 2 = 0.12; SRMSR = 0.14).

Local item independence
Yen's Q3 statistic flagged 34 item pairs for local item dependence.However, it should be noted that most violations were minor, only just exceeding the cut-off value of |0.2|.An exception was the residual correlation of 0.70 2 HealthMeasures is the dissemination and implementation hub for four state-of-the-science measurement systems: PROMIS®, NIH Toolbox®, Neuro-QoL™, and ASCQ-Me.®.Originally funded by the National Institutes of Health (NIH), it is now supported under a cost-recovery business model with no profit motivation at Northwestern University's Feinberg School of Medicine.(https:// www.healt hmeas ures.net). 3Although it has been suggested that a p-value smaller than .05may indicate poor fit [53,57], there is still no consensus on the optimal cut-off value for the p-value.A Benjamini and Hochberg adjusted alpha level could be more suitable for controlling a false discovery rate [58,59].However, for individual testing, an alpha adjustment could be less appropriate [60].An alpha adjustment might also emphasize the importance of statistical significance, whereas practical consequences should also be considered.After all, a statistical misfit does not automatically indicate that an item should be dropped, because it might still be clinically relevant; and at the same time keeping the item does not necessarily have a significant or practical impact on severity estimates [58].After considering aforementioned points, we decided to use an alpha level of .05without adjustment as significance threshold for the S-X 2 p-value.
between item PEXP_12 ("I have trouble keeping track of my finances (managing a bank account)") and PEXP_11 ("I have trouble doing things online like making payments").Also, the residual correlations between item pair SRPPER23_CaPS ("I have trouble doing all my usual work (include work at home)") and SRPPER37_CaPS ("I have trouble doing all of the work that I feel I should do (include work at home)"), and between item pair SRPPER35_CaPS ("I have trouble doing everything for my friends that I want to do") and SRPPER36_CaPS ("I have trouble doing all of the activities with friends that I feel I should do") were relatively high (respectively 0.48 and 0.41).

Monotonicity
All items of the extended item bank had critical values below 40, and no violations of the assumption were observed when inspecting the monotonicity plots visually.Thus, we did not find evidence that this assumption was violated.

Differential item functioning
None of the items were flagged for DIF associated with sex, education, region, or ethnicity.For age, only item 17 was flagged for uniform DIF 4 (SRPPER16r1 "I have to do my work for shorter periods of time than usual (include work at home")).However, the degree of DIF is negligible (for more details, see supplemental material).

Item calibration
The reliability for the extended item bank was high (0.98).The generalized S-Χ 2 statistic showed that 23 of the 52 items (44%) had a p-value smaller than 0.05, possibly indicating a poor fit.Interestingly, this concerned 21 original items (40%) and only 2 new items (4%).The (freely) estimated discrimination and location (difficulty) parameter estimates for the new items are shown in Table 2. 5The discrimination parameter a ranged from 0.97 to 3.04 and from 1.99 to 4.88 for the new and old items respectively, indicating overall sufficient discriminating power.Only one item showed a value just below 1.00 (PEXP_16 "I have trouble using digital and social media, such as WhatsApp, email, Facebook"; a = 0.97).In general, the discrimination parameters of the new items were lower in comparison to the old (original) items.The Welch Two Sample t-test suggested that the a parameters of the new items were significantly lower than the old items, with a large effect size (mean a of new items = 1.98; mean a of old items = 3.92; difference = −1.94,95% CI [−2.31, −1.58], t(33.79)= −10.83,p < 0.001; Cohen's d = −3.17,95% CI [−4.10, −2.21]) (see Table 3).
The location parameters (b 1 , b 2 , b 3 , and b 4 ) ranged from −4.20 to 0.55 and from −2.49 to 0.73 for the new and old items, respectively.Figure 1 shows that targeting is improved substantially for the lower end of the scale by the new items (black bars) relative to the old items (grey bars).We refer to figure S2 in the online supplement for density plots of the item parameters grouped by old and new items.
The Welch Two Sample t-test indicated that the mean b parameters of the new items were significantly lower than the old items, with a medium effect size (mean beta of new items = −1.An examination of the trace lines of the probability functions from the extended item bank (i.e., the category response curves) showed that for some items of the extended item bank, it is less clear what response option (i.e., scoring category) is the most likely given a certain trait level (see Fig. 2).This seems true for item 36 (PEXP_1), item 43 (PEXP_8), item 44 (PEXP_9), item 46 (PEXP_11), item 47 (PEXP_12), item 50 (PEXP_15), and item 51 (PEXP_16).
In sum, the extended item bank showed high reliability, but many original items showed poor fit according to the generalized S-Χ 2 statistic.Although, the new items had lower discrimination parameters, the lower location parameters of the new items showed that these items improved targeting people who reported low levels of social participation.Some items in the extended item bank had disordered response categories, meaning that the response option that was most likely for a given trait level was not always obvious.

Targeting
The test information function in Fig. 3a visualizes where the original and extended item banks are providing (the most) information relative to θ levels.It can be seen that the extended item bank covers a wider range of θ levels, especially, at the lower range (i.e., persons reporting lower levels of participation).This is consistent with our finding that the new items had significantly lower location parameters than the old items (see Fig. 1), meaning that they are possibly more suitable for measuring lower levels of participation.
The individual θ scores based on the original item bank ranged from −2.76 to 1.75, whereas the individual θ scores based on the extended item bank ranged from −3.11 to 1.91.A comparison of the individual θ scores from the original item bank (old items with fixed parameters) and the individual θ scores from extended item bank (old items with fixed parameters and new items with freely estimated parameters) showed a high correlation (r.= 0.99) and an absolute mean difference of 0.06 with an sd of 0.06.However, the absolute mean difference in θ scores between the original and extended item bank, was larger for individuals with lower θ scores (Fig 3b).This shows that the new items broaden the measurement range especially at the lower end of the scale.

Discussion
This study applied IRT modeling to examine the psychometric properties of the extended PROMIS-APSRA item bank, including the basic IRT assumptions, differential item functioning, item fit and whether the new items improved the targeting of lower/higher levels of participation.Overall, we found sufficient support for the IRT assumptions, and we did not find substantial item bias in terms of DIF.The discrimination parameters of the new items were lower than those of the old items.However, the inclusion of the new items in the item bank enhanced the information function at the lower levels of participation, leading to better targeting of the lower range of the latent trait scale.Together, these findings suggest that extension of the PROMIS-APSRA item bank resulted in a meaningful improvement of the psychometric quality.Although, many items seemed locally dependent, most violations were minor, and possibly an artefact of the fact that the items were displayed in blocks of 5 items at the same time [43].An exception was the high residual correlation between the new items PEXP_12 ("I have trouble keeping track of my finances (managing a bank account)") and PEXP_11 ("I have trouble doing things online like making payments").This is likely due to the similarity in wording and content, making it harder for a respondent to distinguish the differences between these questions [23,43,44].As a consequence, we advise against including both these items at the same time in a short form or CAT.
Our results indicated that item bias in terms of DIF was low.Only one item (item 17; SRPPER16r1 "I have to do my work for shorter periods of time than usual (include work at home)") was flagged for uniform DIF due to age.The impact, however, seemed negligible, and we therefore suggest keeping this item in the item bank.We conclude that different subgroups with the same level of participation do not have different probabilities of endorsing an item response (i.e., the item parameters are invariant across different populations), and that the items are unbiased for all respondents, regardless of their sex, education, region, or ethnic background.
The generalized S-Χ 2 statistic indicated that 21 items from the original item bank and 2 new items from the extended item bank had a potential misfit with the model.Item misfit occurs when an item does not conform to the expectations of the model, and the observed responses deviate significantly from the expected responses based on the model.Several factors can contribute to misfitting items, such as multidimensionality, guessing, local dependence, or cultural bias [51,53,54].We ruled out multidimensionality and guessing as possible sources of misfit, since our analysis confirmed the unidimensional structure of the scale, and the items did not have correct or incorrect responses.However, we considered local dependence and cultural bias as plausible explanations.Local dependence occurs when the responses to two or more items are highly correlated, and the response to one item can predict the response to another item.This can lead to an overestimation of the test reliability and an underestimation of the standard errors of the item parameters.We detected some minor effects of local dependence in our data, but they were not sufficient to explain the misfit identified by the S-Χ 2 statistic.
Since the misfit mainly occurred in the original items, and the parameters for the old items were fixed on US parameters while the new items were estimated based on Dutch data, cultural bias could be the most likely cause of the misfitting items in our scale.This might partly clarify why the model's overall fit was not satisfactory, even though the expanded item bank constitutes a robust unidimensional scale.The The discrimination parameters (a) of the new items have significantly lower values, indicating that they are less able to differentiate between respondents with a high level of functioning and those with a low level of functioning compared to the old items.Nevertheless, the discrimination  3b illustrates the absolute mean difference in individual θ scores between the original and extended item bank across various levels of θ.While the overall absolute mean difference for the entire group was 0.06, this discrepancy increased when comparing subgroups with different θ levels.For instance, when examining the absolute difference in individual θ scores between the item banks for subjects with an individual θ score of −2 or less, the absolute mean difference was 0.32.Notably, this effect is primarily observed at the lower end of the latent trait, i.e., in subjects with lower levels of participation.These findings indicate that the inclusion of new items in the extended item bank expands the measurement range, particularly at the lower end of the scale.power for the new items is still sufficient.Only item PEXP_16 ("I have trouble using digital and social media, such as Whatsapp, email, Facebook") showed a discriminating power just below 1 and a marginal Mokken scalability coefficient (H i = 0.306).Therefore, item PEXP_16 is a serious candidate for exclusion from the item bank despite its low threshold values (starting at b1 ≈ −4.20), that could make this item eligible for measuring the latent trait score of respondents with severe impairments (i.e., in a clinical population, who are expected to generally have a lower ability to participate in social roles and activities).We suggest a critical study of this item in a clinical sample.We also advise to rephrase this item by removing specific examples (i.e."I have trouble using digital and social media" or "I have trouble using digital and social media due to certain health-related challenges.") to prevent outdated wording in the future.
We also found that for 7 of the 17 new items, the category response curves were not peaked and adequately dispersed across all levels of the latent trait, making it is less clear what response option (i.e., scoring category) was the most likely given a certain θ value (see Fig. 2).This meant that not all response options contributed meaningfully to the estimation of trait levels.A visual inspection of the operating characteristic curves for these items suggests that using three rather than five response options may have been more appropriate for these items.However, we advise against using different numbers of response options for a subset of items, since it might be confusing for respondents to answer them.

Strengths, limitations, and future research
This study has several strengths and limitations.One of the strengths is that we used a large [55] and representative, stratified, sample of the Dutch general population, which enhances the external validity and generalizability of the findings.Another strength is that this study built on a wellestablished item bank from a renowned system (PROMIS), and thus had a solid foundation for the development of a potentially more accurate measurement of participation in social roles and activities.
However, the study also has some limitations that may have affected the quality of the findings and the ability to answer the research questions.In order to ensure comparability with the original item bank, the parameters for the old items were fixed on US parameters, while the new items were estimated based on Dutch data.As argued by Terwee et al. [56], such an approach may have introduced some bias or inconsistency in the item calibration and scaling.Furthermore, it is crucial to recognize the intricacies associated with translating and culturally adapting new items.Notably, the newly proposed items were developed in Dutch, while the original items were developed in English.Therefore, further research is needed before the proposed items are incorporated into other language versions of the item bank.This study examined the psychometric properties in a non-clinical population.We strongly recommend that the item bank's applicability in clinical practice and for individuals with specific needs, such as those with low literacy, is examined in a future study.Moreover, this study did not test the predictive validity or responsiveness of the measure, which are important aspects for evaluating its usefulness in clinical practice and research.In order to address these important topics, we plan to conduct further studies, preferably by using CAT simulations, to examine the added value of the extended item bank in a clinical population, and to test its ability to detect changes over time and predict treatment outcomes.

Conclusion
In conclusion, we found that the extended item bank showed good reliability and validity in the Dutch general population.Moreover, the extended item bank improved the measurement in the lower trait range, which is important for reliably assessing functioning in clinical populations.Our study also contributes to further innovation of PROMIS measurements, which allow for dynamic and flexible addition of new items to item banks, without changing the interpretation of the scores, and while maintaining the comparability of the scores with other PROMIS instruments.We hope that this study will stimulate further research on social participation and its measurement in different populations and contexts. of functioning in psychiatric patients.The Medical Ethics Review Committee (MERC) of VU University Medical Center (VUmc) confirmed that the Medical Research Involving Humans Subjects Act (WMO) does not apply to this study, and an official approval by the MERC is not required.The MERC of VUmc is registered with the US Office for Human Research Protections (OHRP) as IRB00002991.The FWA number assigned to the VUmc is FWA00017598.The data were collected by DESAN Research Solutions (a specialized Dutch agency for collecting, processing, and reporting data for market and opinion research).All employees of DESAN have signed a confidentiality agreement which is included in their employment contract and are in possession of a Certificate of Conduct (VOG).The Computer Aided Instruction (CAI) framework within which the questionnaire is designed is owned by DESAN Research Solutions.
DESAN will host the application on its own server clusters in two identical environments.The data will be mirrored between both locations and the locations will act as fail-over for each other.DESAN and both data centers are fully ISO-27001 certified.The research was conducted and delivered completely anonymously.DESAN also had no insight into the identity of the respondents during fieldwork; and the panel administrator had no insight into the answers of the respondents during fieldwork.The data that were provided to the researchers cannot be traced back to natural persons in any way.The results of the research (data and documents) and any other sensitive data were encrypted by DESAN and sent by email, via Cryptshare.Sending sensitive data via Cryptshare guarantees a secure transfer and also means that the relevant files do not end up in different locations (especially, mail servers and their backups), neither at DESAN nor at the client.The research data will be stored by DESAN for at least five years.
Consent to participate Informed consent was obtained from all individual participants included in the study.Software All analyses were performed in R, version 4.1.2[34].The main packages used for the IRT analysis were mirt (version 1.36.1)[35], Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

SRPPER28r1 I have to
limit my regular activities with friends Fixed Fixed Fixed Fixed Fixed SRPPER31_CaPS I have trouble taking care of my regular personal responsibilities Fixed Fixed Fixed Fixed Fixed SRPPER35_CaPS I have trouble doing everything for my friends that I want to do Fixed Fixed Fixed Fixed Fixed SRPPER36_CaPS I have trouble doing all of the activities with friends that I feel I should do Fixed Fixed Fixed Fixed Fixed SRPPER37_CaPS I have trouble doing all of the work that I feel I should do (include work at home) Fixed Fixed Fixed Fixed Fixed SRPPER42r1 I feel limited in my ability to visit friends Fixed Fixed Fixed Fixed Fixed SRPPER43r1 I have trouble keeping in touch with others Fixed Fixed Fixed Fixed Fixed SRPPER46_CaPS I have trouble doing all of the activities with friends that I want to do Fixed Fixed Fixed Fixed Fixed SRPPER47_CaPS I have trouble keeping up with my work responsibilities (include work at home) Fixed Fixed Fixed Fixed Fixed SRPPER54_CaPS I have trouble doing everything for my friends that I feel I should do Fixed Fixed Fixed Fixed Fixed SRPPER55r1 I feel limited in the amount of time I have to visit friends Fixed Fixed Fixed Fixed Fixed PEXP_1 I have trouble doing what is needed to acquire a place to live 1.18 −3.34 −2.41 −1.29 −0.

Fig. 3 a
Fig.3a Test information curves.b Absolute mean θ difference between item banks by θ score.Fig.3billustrates the absolute mean difference in individual θ scores between the original and extended item bank across various levels of θ.While the overall absolute mean difference for the entire group was 0.06, this discrepancy increased when comparing subgroups with different θ levels.For instance, when examining the absolute difference in individual θ scores between the item banks for subjects with an individual θ score of −2 or less, the absolute mean difference was 0.32.Notably, this effect is primarily observed at the lower end of the latent trait, i.e., in subjects with lower levels of participation.These findings indicate that the inclusion of new items in the extended item bank expands the measurement range, particularly at the lower end of the scale.

Table 2
Discrimination and threshold parameter estimates for the extended patient-reported outcomes measurement information system item bank for the ability to participate in social roles and activities 2.0

Table 3
Item-fit statistics for the extended patient-reported outcomes measurement information system item bank for the ability to participate in social roles and activities 2.0