Introduction

The assessment of outcomes in health intervention and care related studies requires information not only about the identification and measurement of impacts, but also the consideration of how valuable these impacts are to patients or the general public [1]. Outcomes in the area of mental health often go beyond health and incorporate wider impacts [2]. The capability framework, introduced by Amartya Sen in the early 1980s as an alternative to standard utilitarian welfare economics, provides a richer evaluative space beyond health [3]. The core focus of the capability approach is on what individuals are able to be and do in their lives, in other words, what they are capable of [4]. One of the outcome assessment instruments specifically designed to capture different dimensions of well-being within the capability framework in the area of mental health is the self-reported Oxford Capability questionnaire-Mental Health (OxCAP-MH) instrument [2]. The development of the OxCAP-MH was based on the list of ten central capabilities endorsed by Martha Nussbaum in 2003 [5]. This list included life; bodily health; bodily integrity; senses, imagination and thought; emotions; practical reason; affiliation; other species; play; and control over one’s environment. The scoring of the OxCAP-MH currently relies on equal weights across items and the instrument has been used also as a non-preference based outcome measure in economic evaluations so far [6]. There are, however, suggestions in the literature that some capability items may be more important than others in determining someone’s well-being [5, 7]. The weighting of preferences may vary not only between different cultural settings (i.e. regions/countries) and main sociodemographic characteristics (i.e. age, gender), but also across different cohorts influenced by specific insight into or adaptation to an illness [8,9,10]. It has long been acknowledged in the preference elicitation and value weighting literature that the time spent in a health state may have an influence on the way that state is perceived too [11]. Beside the methods used for valuation, controversies exist concerning who should provide valuations. In principle, preferences or values could be provided by members of the general public, who most often pay for interventions via taxation or social insurance; or by patients, who have lived experience with the health state in question; or by experts, who have relevant scientific or clinical insights [12].

The idea of developing value sets for capability instruments has first appeared in 2008, during the development of the OCAP-18 [13]. However, there is still no consensus on the best method to elicit values and whether capability instruments should be anchored similarly to preference-based health-related quality of life (HRQoL) measures. A recent literature review identified that all studies aiming to develop value sets for capability instruments used the Best–Worst Scaling (BWS) method [14]. These include the UK value sets for the Adult Social Care Outcomes Toolkit (ASCOT), Investigating Choice Experiments Capability Measure for Adults (ICECAP-A), for older people (ICECAP-O) and Supportive Care Measure (ICECAP-SCM) instruments [7, 15,16,17]. Moreover, a recently published study aimed to establish Austrian preference weights for the ASCOT instrument also used the BWS method [18]. The main reason for this is that BWS elicits values rather than preferences because individuals are not asked to trade one thing for another, but rather only state which option they find most and least important within hypothetical scenarios [15]. Thus, the BWS approach may come closest from all the potentially available methods, which have been used to elicit value sets, that would satisfy Sen's interpretation of the capability approach [3]. Moreover, alternative methods, such as discrete choice experiments (DCEs), could not handle the number of attributes and levels in the OxCAP-MH and may pose problems disentangling independent effects because of inter-item correlations [19].

This study aimed to explore the relative preference weights of the 16 items of the German language version of the OxCAP-MH (Oxford Capability questionnaire-Mental Health) capability instrument and their differences across cohorts with alternative levels of mental ill-health experience.

Methods

OxCAP-MH

The study was conducted prospectively in Austria using the German language version of the OxCAP-MH capability wellbeing questionnaire. The OxCAP-MH is a 16-item instrument originally developed for capability wellbeing measurement in the mental health context. Items are rated on a 1–5 Likert scale. The total level sum score assumes equal weights between the different items and is converted using the formula: 100 × [OxCAP-MH total score – minimum score)/range to arrive to a standardised score between 0 and 100. In other words, the total raw score is calculated by multiplying the number of items with the number of levels: 16 × 5 = 80, the minimum score therefore equals 16, whilst the range is 64 (16–80). Higher scores indicate better capabilities [20]. The OxCAP-MH has been so far validated in English [20], German [21, 22], Luganda [23], Hungarian [24] and Chinese [25] languages and used in multiple studies [26,27,28]. A sample copy of the OxCAP-MH is available at https://healtheconomics.meduniwien.ac.at/downloads/oxcap-mh/, and the full list of items is included in the Additional file 1.

Best–worst scaling (BWS) experiment

Relative preference weights of the OxCAP-MH capability items were elicited by a Best–Worst Scaling (BWS) experiment where an individual's decision regarding the most and least preferred options in a set of attributes is elicited across repeated hypothetical scenarios that are systematically varied via an experimental design.

The survey questionnaire consisted of the BWS instrument and several sociodemographic questions, including age, gender, education, living environment, marital status and indirect experience with mental disorders. The attributes of the hypothetical scenarios for the BWS were developed from the 16 OxCAP-MH items, in a set of qualitative interviews following pilot testing with seven persons [29, 30]. Item wording adaptation was necessary so that the attribute wordings reflected the correct individual choices and was carried out similar to the attribute development process for DCE studies [29].

The valuation of item weights was based on the object case of the BWS method. The object case is designed to determine the relative importance of the listed attributes (adapted OxCAP-MH items), which have no level, and choice scenarios differ merely in the particular subset of attributes shown [31]. Each participant was presented with 16 hypothetical scenarios with six attributes. This was considered as the optimal, sufficient design based on recommendations from the literature [32] and established in the pilot phase. Respondents were asked to select the most and least important attributes for them as an individual. An example of the choice task is shown in Table 1, and an exemplary questionnaire can be seen in the Additional file 1.

Table 1 Exemplary BWS task

The 16 items of the German OxCAP-MH, each of them with 5 levels (5^16), would result in 152,587.890,625 potential combinations of items and levels, i.e. capability states. Currently available BWS methods for the weighting of instruments are not able to handle this many hypothetical states because respondents would be presented with a number of hypothetical scenarios that they are not able to cope with. One potential option to solve this problem would be to divide the potential options into blocks, and only present respondents with a manageable number of scenarios. However, this would have required a study participant sample size that was not feasible in practice. We, therefore, concentrated on the preference weights of the 16 items without including levels. Hypothetical scenarios were reduced through Balanced Incomplete Block Design (BIBD) in Sawtooth Software [30, 33]. BIBD means that each item appears the same number of times, and it also forces paired combinations of attributes to appear together [34]. Part of the analysis focused on three cohorts of the overall sample which reduced the sample size, hence, response efficiency was increased by developing three blocks of the BWS questionnaire with the Paper-and-Pencil module of Sawtooth Software’s Lighthouse Studio 9.13.1, based on 1000 iterations. All three versions were equally distributed within each cohort of respondents and randomly assigned to them.

Participants and data collection

The study included three cohorts of participants from Austria: (1) psychiatric patients (direct mental ill-health experience), (2) (mental) healthcare experts (indirect mental ill-health experience), and 3) primary care patients with no mental ill-health experience. All participants had to be between 18 and 80 years of age and able and willing to complete the questionnaire. Sufficient intellectual capacities and German language skills were judged by the recruiting team. The minimum sample size estimate of 50 in each cohort, i.e. 150 respondents in total, allowed differences between means of a magnitude at least 0.57 within-group standard deviations to be detected with 80% power at a two-sided significance level of 5% and was similar to previous relevant studies that had aimed to develop weights or value sets to capability wellbeing questionnaires [7, 15, 17, 35].

Similar to some other elicitation studies (e.g. [16, 36]), convenience sampling was used. Mental health patients were approached in the relevant health care facilities by their treating health care professionals in the Department of Psychiatry and Psychotherapy, Clinical Division of Social Psychiatry at the Medical University of Vienna. Persons attending primary care appointments were approached in the waiting areas of the collaborating Primary Care Centres in Vienna. (Mental) health experts were recruited through collaborating professional contacts at the Medical University of Vienna. Participants were recruited between January and May 2020. The study received approval from the Medical University of Vienna (EK Medizinische Universität Wien: 1779/2019). People were first given a participant information sheet and received the BWS questionnaire after they provided consent to participation in the study. The survey was paper-based and self-completed. Data were pseudo-anonymised for further analysis.

Statistical analysis

The statistical analysis aimed to estimate weights to the OxCAP-MH items across different cohorts. Sociodemographic characteristics of the respondents were summarised as mean values by cohorts. The weights for the items were calculated by Hierarchical Bayes (HB) estimation using Sawtooth Software’s analysis tool. HB estimation was employed because its calculation is based on individual respondents and consequently enables segmentation by cohorts. Instead of estimating each respondent’s utilities individually, the algorithm estimates how different the respondent’s utilities are from the other respondents in the study [37]. Hence, the HB estimation is based on the probability that a respondent selects a specific concept in a choice task given a specific set of preferences and the probability that the respondent’s preferences are consistent with the patterns of the preferences observed in all other respondents [37]. HB estimation can yield reliable individual best–worst values even when the number of responses per participant is small [38]. The mean relative importance score (RIS) was calculated for each item based on HB estimation [39]. An individual fit statistic (root likelihood) per respondent below 0.2 was used to identify inconsistent responders following guidance from Orme [40]. A lower value would be an indication of purely random responses to the choice task. Differences between the RIS scores calculated by HB estimation were explored in a graphical presentation and tables including rank orders of the items. The rank order analysis based on HB estimation was repeated for cohorts, testing the differences across groups by Kruskal–Wallis equality of populations rank tests. Furthermore, to quantify linear association between variables, Pearson correlation coefficients between the RIS scores of the items for the full cohort were calculated and visualised by means of a heatmap.

Finally, multivariable linear regression analyses were conducted to explore the relative adjusted importance of the items across cohorts. The continuous RIS of each item was regressed on binary group indicators, whilst adjusting for gender and cohort. Other demographic variables (e.g., age, profession) were not considered as adjustments as they were main attributes of the cohorts' characteristics. Robust standard errors to account for violations of model assumptions and the implicit correlation of the outcome variables were obtained using the Jackknife method. It means that each participant was omitted in turn, the HB scores were re-estimated and the regressions of the HB scores on the covariates were re-fitted. From the regression coefficients of these 16 (items) \(\times\) 158 (participants) analyses, robust variances of regression coefficients were obtained by the following formula:

$${\widehat{v}}_{kj}=\frac{n-1}{n}\sum_{i=1}^{n}({\stackrel{\sim }{\beta }}_{kj}^{\left(-i\right)}-{\widehat{\beta }}_{kj}{)}^{2},$$

where \({\widehat{\beta }}_{kj}\) denotes the regression coefficient of covariate \(j\) for item \(k\) (where j and k are indices for covariate and item) from the full sample; and \({\stackrel{\sim }{\beta }}_{kj}^{(-i)}\) denotes the corresponding estimated regression coefficients obtained from re-estimating the HB scores after omitting participant \(i\).

The standard errors were calculated as \({\widehat{s}}_{kj}=\sqrt{{\widehat{v}}_{kj}}\), and p-values for testing the hypothesis \({\beta }_{kj}=0\) were obtained by comparing \({\widehat{\beta }}_{kj}/{\widehat{s}}_{kj}\) to a t-distribution with \(n-J-1\) degrees of freedom, \(J\) being the number of regression coefficients, i.e. four.

The HB estimation and count analysis were performed in Sawtooth Software, the regression analyses were calculated using R version 4.0.2 [41] and all other analyses were conducted with STATA Version 16 [42]. A two-sided significance level of \(\alpha =0.05\) was considered to indicate statistical significance in all analyses, unless stated otherwise. P-values were not corrected for multiple testing, but all performed comparisons were reported.

Indicative preference weight set development

An indicative preference weight set for the German OxCAP-MH was developed for Austria by a linear transformation of values based on HB estimation across all three cohorts. Since each of the 16 OxCAP-MH items has five levels and the assumption was that they carry proportionately constant weights, i.e. their contribution to the overall preference weight was assumed to be equally distributed. Worst capability levels carried the multiplication factor of 0, best capability levels the multiplication factor of 1, and those in-between the multiplication factors of 0.25, 0.5, 0.75, respectively. Similar to other capability instruments, the proposed indicative preference weight set of the OxCAP-MH instrument is anchored on a scale of 0 (no capability) to 1 (full capability).

Results

Participant characteristics

Initially, 235 people were approached, with response rates of ca. 90% for psychiatric patients, nearly 100% for (mental) healthcare experts, and ca. 70% for primary care attendees. In total, 195 persons agreed to participate, out of whom 159 participants fully completed the BWS questionnaire and was initially included in the analysis as shown in Fig. 1.

Fig. 1
figure 1

Overview of recruitment strategy and inclusion of participants

In a subsequent step, one psychiatric patient had to be excluded because the calculated HB fit statistic was under 0.2, and it was therefore assumed that the responses of this participant were provided by chance. Table 2 provides further details on participant characteristics of the final included 158 participants.

Table 2 Characteristics of participants

Relative importance of items

Figure 2 illustrates the RIS of the items calculated by HB estimation and by means of box plots. The mean RIS scores for the full cohort ranged from 0.76 to 15.72. The RIS based on HB estimation indicate very strong relative preference weights for the items Self-determination and Limitation in daily activities. The RIS also suggest that Influencing local decisions and Losing sleep over worry are regarded as the least important items.

Fig. 2
figure 2

Relative Importance Scores of OxCAP-MH items calculated by Hierarchical Bayes estimation (n = 158). White vertical line = median; box = interquartile range; ‘whiskers’ extending to max 1.5 interquartile ranges outside the box; any values outside the whisker are depicted individually

More detailed information about the Mean Relative Importance Scores and Standard Deviations of OxCAP-MH items calculated by Hierarchical Bayes estimation can be found in the Additional file 1.

Table 3 provides a rank order of mean RIS scores attributed to the individual OxCAP-MH items calculated by both HB estimation and count analysis, with links to Nussbaum’s ten central human capabilities.

Table 3 Rank orders per item using different analysis methods

Figure 3 provides a heatmap of the correlation between the items, indicating that some items are strongly correlated including: Social networks with Enjoying friendship and support; Likelihood of assault with Likelihood of discrimination; and Likelihood of discrimination with Freedom of expression.

Fig. 3
figure 3

Heatmap of pairwise Pearson correlation coefficients between OxCAP-MH items

Analysis by cohorts with different mental ill-health experience and gender

The rank orders and observed mean RIS scores for each of the three cohorts revealed that the mean RIS per cohort significantly differs in eight out of the 16 OxCAP-MH items (Table 4). This is also visible on the graphical presentation of mean RIS estimates by cohort (see Additional file 1).

Table 4 Rank order (1 = most important, 16 = least important) and observed mean Relative Importance Scores for each cohort (n = 158)

The regression coefficients with robust standard errors and p-values presented in Table 5 described the gender-adjusted difference in expected RIS between the cohorts and the cohort-adjusted difference between males and females. This analysis revealed that Freedom of expression was rated significantly less important by psychiatric patients than by the other two cohorts, given that the adjusted difference in RIS was 3.25 units, which is approximately half of the overall mean RIS for this item. Moreover, Having suitable accommodation appeared significantly less important for the expert cohort than for primary care patients with no mental ill-health experience. The difference for this item between psychiatric patients and experts was even larger in terms of the regression coefficient, but with a larger standard error. There were no further significant differences in the relative preference weights of OxCAP-MH items between the cohorts or according to gender.

Table 5 Multivariate regression analysis per item with robust standard errors (n = 158)

Preference weight set

Table 6 provides an indicative preference weight set that could be used as current best proxy to calculate individual scores for different capability states in economic evaluations [43].

Table 6 Preference weights of the OxCAP-MH items and their levels (n = 158)

Discussion

This study is the first to explore the relative preference weights of the items of the OxCAP-MH, a capability well-being instrument that was originally developed for outcome measurement in this context, and the impact of mental ill-health experience on these. As opposed to other capability instruments with existing preference weight or value sets such as the ASCOT [17], ICECAP-A [7], ICECAP-O [15] and ICECAP-SCM [35], the high number of items posed a major challenge in case of the OxCAP-MH. Hence, this study only focused on the items without incorporating preferences related to levels. The overall study sample was balanced for gender and for differing mental ill-health experiences between the cohorts.

The results confirmed significantly different preference weights attributed to the 16 items of the OxCAP-MH. The RIS scores for the full cohort ranged from 0.76 to 15.72 and indicated very strong relative preference weights for Self-determination and Limitation in daily activities, suggesting that these concepts have large spans within the ‘capability space’ covered by the OxCAP-MH. The relatively high importance of these items also demonstrates that respondents placed a much higher emphasis on individual aspects of capabilities than on community-based aspects, including Influencing local decisions, in the Austrian context [18]. The most important items of the OxCAP-MH are those related to Nussbaum’s practical reasoning and overall health capabilities, whilst aspects related to emotions and control over one’s environment are regarded significantly less important. These results are in line with the findings of the BWS experiment eliciting preference weights for the ASCOT instrument for service users in Austria. Hajji et al. found that the most important choices were related to being meaningfully occupied during the day and having control over daily life, whilst the least important choices were associated with dignity and social participation [18]. These findings are also in line with the results of the UK preference weighting study of the ASCOT [17]. Conversely, the development of the initial UK value set of ICECAP-A in 2013 found that attachment and stability had slightly higher contribution to the capability space then the items of autonomy, achievement and enjoyment. This suggests that the different capability instruments either elicit different aspects of capabilities, or the actual wording of questions and the cultural context play a more significant role than previously thought.

With regard to differences across the cohorts, we could not find evidence that mental ill-health experience or gender were significantly associated with the relative importance of the OxCAP-MH items. While rank analysis revealed some differences across the cohorts, this was not to a significant extent. The magnitudes of the regression coefficients with respect to the mean values of the OxCAP-MH item scores were relatively small across cohorts, apart from the Having suitable accommodation and Freedom of expression items. These findings suggest that potential future value sets for the OxCAP-MH elicited from the general population are likely to represent well also the direct preferences of people with lived experience of mental ill-health. These results somewhat contradict the findings that patient preferences differ from preferences derived from the general population for the EQ-5D-5L items [10]. In the study by Ludwig et al., 2021, patients in the EQ-5D-5L study gave more importance to mobility, self-care, or usual activities and less importance to pain/discomfort and anxiety/depression compared to the general population [10]. The differential findings between the two studies could be explained by the specific perception of mental health itself and the difference between health-related quality of life and capabilities as evaluative spaces.

Our study is unique by explicitly exploring the impact of ill-health experience on preferences in the context of a capability wellbeing instrument. Moreover, the study is based on a thoroughly designed BWS experiment and sophisticated statistical analysis. This takes into account the multivariate distribution with implicit correlation of outcomes, and robustness against model misspecification and violation of assumptions of linear regression, which probably implied larger estimated standard errors.

The interpretation of the study results, however, require the consideration of its limitations. One of the main limitations is the relatively small sample size. A larger number of observations would have strengthened the robustness of the statistical analysis, and also it would have enabled the formation of bigger groups for subgroup analyses, including latent class estimation. Moreover, we adjusted for gender, but not for other demographic variables because they were intrinsic to the definition of the population, for instance, in terms of experts’ age. This meant that the analysis compared the items between population cohorts in a simple form and did not artificially equalise their characteristics, which would have caused over-adjustment and could have induced bias. A further limitation is that the sample is not representative of the Austrian population because it is generally younger and all respondents live in or around Vienna. In the future, the current methods could be expanded and be developed into a full preference weight set using a representative Austrian general population sample. Finally, the research focused only on the capability items of the OxCAP-MH themselves and not their levels due to the focus of our research and the high number of possible item-level combinations which would have proven challenging to incorporate. The assumption that levels carry proportionately constant weights may prove problematic for an actual value set development. Similar BWS studies developing weights for capability instruments, including the ICECAP-A and the Carer Experience Scale, found that greater value was placed on differences between the bottom and middle levels of items than between the middle and top levels [7, 44]. Future preference weight set developments should address the issue of preferences related to levels. The indicative preference weight set could be used as current best proxy to calculate individual scores for different capability states in economic evaluations [42].

Conclusion

This study found no evidence that preference weights for the items of the mental health-specific OxCAP-MH capability well-being instrument could not be elicited from alternative population cohorts with differing mental ill-health experience. Interim, the current preference weight set can be seen as the best proxy for calculating individual scores for different capability states as defined by the OxCAP-MH in economic evaluations. Our approach can also serve as an example for other BWS preference elicitation studies for more complex patient-reported outcome measures (PROMs) in order to be used for cost-effectiveness analyses.