FormalPara Key Points for Decision Makers

Achieving consensus on the definitions of “scientific quality” and “quality appraisal” for systematic reviews of health state utility values is needed for a standardised framework for assessing studies that elicit health state utility values. A standardised framework helps ensure consistent, transparent and reproducible quality appraisal outcomes

A comprehensive quality appraisal in systematic reviews of health state utility values must evaluate three quality appraisal dimensions—reporting, relevance and methodological quality—for a holistic and rigorous assessment

Our research provides a framework and groundwork for developing a quality appraisal tool designed to elevate the assessment process of studies eliciting health state utility values, thus reinforcing evidence-based healthcare decision making

1 Introduction

Health state utility values (HSUVs), also referred to as “health utilities“, “utility weights”, “utility values” or “preference-based health-related quality of life measures”, are essential quantitative metrics signifying the cardinal strength of an individual’s preference for specific health-related outcomes or health states [1,2,3]. Health state utility values are typically anchored between 0 (representing death) and 1 (representing full health), and are used to adjust the length of life lived in a specific health state based on the quality of life perceived in that state. The quality-adjusted life-year (QALY) is a widely used generic measure of health outcomes in comparative cost-utility analyses. Quality-adjusted life-years are calculated by multiplying the years lived in each health state by its corresponding HSUV [4]. For example, living 1 year with perfect health is regarded as 1 QALY, whereas living the same year with not-so-perfect health with a HSUV of 0.7 will result in 0.7 QALYs (1 × 0.7). Given that the choice of HSUVs has a significant impact on the outcomes of cost-effectiveness analyses, it is essential to have reliable and unbiased HSUV estimates.

Recent years have witnessed an exponential increase in primary studies eliciting HSUVs, with researchers employing direct (standard gamble, time trade-off or visual analogue scale) and indirect methods (i.e. generic preference-based measures such as the European Quality of Life Five Dimension, Short-Form Six Dimensions, Health Utility Index or mapping algorithms) [1, 5, 6]. Consequently, systematic reviews and meta-analyses have become indispensable tools for synthesising these studies across various decision-making contexts. Irrespective of the source of HSUVs used in an economic evaluation, they must be devoid of recognisable sources of bias. The measurement methodology should be validated, aptly suited to the relevant condition and setting, and aligned with the decision-makers’ viewpoint [2]. Therefore, quality appraisal (QA) of HSUV elicitation studies, which aims to ensure the credibility and reliability of HSUV estimates, is central when conducting systematic reviews to inform new health technology assessments [7,8,9,10,11,12,13].

However, the conduct of QA in systematic reviews of studies that elicit HSUVs needs to be improved. Recent reviews [5, 14] of studies that elicited HSUVs estimated that only 55% appraised the quality of individual studies in the systematic reviews, which is far lower than that in other research fields [15,16,17,18]. The low prevalence of conducting a QA in this field could be partly attributed to the lack of a widely accepted and scientifically developed QA tool [5, 14] specific to this context. This gap may arise from the unique features of these studies, which include multiple applicable study designs and elicitation methods [5]. Consequently, identifying an appropriate tool as recommended by other scholars [8], combining multiple existing tools or developing a bespoke tool is a time-consuming endeavour and significant challenge.

Previous reviews have highlighted several QA tools used to appraise the quality of “broader” health economic evaluation studies [6, 19]. Yet, only a few of these QA tools directly apply to evaluating the quality of studies that elicit HSUVs (Table S.1 of the Electronic Supplementary Material [ESM]). Moreover, these tools differ considerably in their QA items, QA dimensions and synthesis of QA results.

Our previous review [5] showed that reviewers typically assess three QA dimensions: reporting, methodological limitations and risk of bias (RoB), and relevance, albeit the extent to which these three dimensions are considered varies. Additionally, the terminology used to describe the QA process varied widely across the systematic reviews analysed [5], and ranged from terms such as QA or assessment [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37], critical appraisal [38], RoB assessment [39,40,41,42,43,44,45], relevancy and quality assessment [46, 47], assessment of quality and data appropriateness [48], methodological quality assessment [49,50,51,52,53], reporting quality [54, 55], credibility checks and methodological review [56] (see Fig. S1 in the ESM). This inconsistency or heterogeneity largely stems from the absence of a standardised conceptual framework deconstructing scientific quality and the QA process, which is the overarching aim of the current study.

Quality is a multidimensional concept that varies according to the context and field of study [57, 58]. The concept of ‘scientific quality’ has been introduced to differentiate research quality from other forms of quality, such as product or process quality [59]. Nonetheless, to our knowledge, only one study has attempted to find an agreed-upon definition and has been unsuccessful [60]. Existing definitions range widely, from the likelihood of generating unbiased results in comparative clinical effectiveness science [61] to including dimensions such as relevance and applicability [62, 63], generalisability and imprecision in other research fields [13, 58, 64,65,66].

Stemming from the heterogeneous definitions of scientific quality are the considerable variations and inconsistencies in how QA has been and is currently applied in most systematic reviews (not only in HSUVs) [5, 67]. For example, Viswanathan and colleagues [65, 66], basing their argument on Cochrane’s recommendations, differ significantly from the Grading of Recommendations Assessment, Development and Evaluation Working Group (GRADE) framework [68]. On the one hand, Viswanathan and colleagues [65, 66] advocate for the use and evaluation of the RoB rather than quality assessment as the term quality is used variedly across many fields. On the other hand, the GRADE framework considers quality to be more than just RoB, as quality also includes imprecision, inconsistency, indirectness of study results and publication bias. The GRADE framework later uses these dimensions to make overall judgements regarding the strength of the body of evidence [68].

Often, systematic review authors account for reporting quality and consider this as the overall study quality [69], yet the quality of reporting may not reflect the quality of the study [70, 71]. In fact, a focus on reporting quality alone may overestimate the overall quality of a study [69]. We previously posited that a comprehensive QA should encompass all three QA dimensions—all three QA dimensions are necessary and sufficient components [5]. Evaluating relevance and methodological limitations is contingent on reporting quality; after all, appraisal can only be based on what is documented.

Given this context, there is a pressing need for available tools and guidelines to offer more explicit directives on the constructs, dimensions and items that are pivotal for rigorous QA. A widely accepted definition of scientific quality—one that identifies the dimensions and constructs pertinent to QA in systematic reviews of studies eliciting HSUVs—might present a viable solution to the existing conundrum [58].

Considering the importance placed on HSUVs in the health economic evaluation of new technologies and interventions [9], and the inconsistencies highlighted above, developing a QA tool specific to HSUVs is pertinent and timely. The present study aims to set the stage for future work on developing an evidence-based QA tool specific to studies eliciting HSUVs by conceptualising scientific quality and QA in systematic reviews of studies eliciting HSUVs. This entails:

  1. 1.

    Establishing a working definition for scientific quality and QA.

  2. 2.

    Establishing and defining the relevant dimensions for the QA of studies eliciting HSUVs.

  3. 3.

    Defining the scope of a QA tool specific to studies eliciting HSUVs.

2 Methodology

This study builds on our previous rapid review of the current nature of QA in systematic reviews of studies eliciting HSUVs [5]. From this review, we discerned three QA dimensions frequently assessed in such systematic reviews, albeit to varying degrees. Additionally, we collated terminologies (see Fig. S.1 in the ESM) and definitions related to “quality” and “QA” (see Tables S.2 and S.3 in the ESM). For pragmatic reasons, we opted for a modified Delphi technique to facilitate a consensus among experts regarding the definitions of quality and QA, as well as the components that should be integral to QA in our specified context.

2.1 Study Design

The conventional Delphi method, developed by the RAND Corporation in the 1950s, is a structured technique for achieving a formal consensus on specific issues among panel members [72,73,74]. This method emphasises non-face-to-face interactions, ensuring anonymity or quasi-anonymity among participants throughout the process. Instead, communication within this framework hinges on a series of iterative questionnaires designed to capture insights and opinions from the participants [72, 74,75,76].

We identify our methodological approach as a modified Delphi technique [72, 75], sometimes referred to as a modified nominal group technique [75]. We adopted this modification for two primary reasons: our panel comprised seven experts, and after two rounds of online questionnaires, we conducted a virtual face-to-face meeting with the experts. Subsequent sections delve into the reasoning and elaborate on critical components of the study, such as participant selection, the preservation of anonymity, the iterative process and group response (or consensus) [72].

2.1.1 Steering Committee

The steering committee comprised four members of the project team: MTM (who served as the project leader), KH, RE and MS. Their responsibilities encompassed conducting literature review(s), developing and piloting the questionnaires, analysing and reporting the responses at each stage, organising the virtual panel discussion and the overall moderation of the process. Importantly, all committee members refrained from expressing their personal opinions during the consensus-building exercises.

2.1.2 Selection of Experts

The steering committee sought to enlist experts seasoned in conducting systematic reviews of studies eliciting HSUVs, and are also recognised experts with knowledge and experience in health technology assessments, mapping studies, health-related quality of life and core health economic evaluations. Moreover, potential contributors should have authored peer-reviewed publications involving one or more of these domains.

Given the stringent inclusion criteria and the limited number of systematic reviews that appraised the quality of their incorporated studies (40 out of 73) [5], the eligible potential participant pool was considerably restricted. To ensure representation from the desired domains in the eligibility criteria, we set a minimum target of five experts, without capping the maximum.

To achieve our recruitment goal, we purposively constituted an international multidisciplinary expert panel. Personalised emails were sent to 23 experts between September and December 2021. These individuals were identified through the articles deemed eligible for the rapid review [5], the QA tools referenced in those articles, or reference searches from these articles.

2.2 Modified Delphi Rounds

The appeal of the traditional Delphi method stems from the principles of expert anonymity, controlled feedback and iterative discussions during the process [72,73,74,75]. Advocates of the Delphi technique believe that its structured and controlled nature helps counter the drawbacks often seen in face-to-face meetings, such as undue influence from other experts (dominance) and group conformity (defined as groupthink) [72, 73, 75].

2.2.1 Expert Anonymity

True anonymity is ensured when no one (including the researcher) can trace back a response from a respondent [76]. In this study, the project leader managed all communication exclusively through individualised e-mail interactions, instead of sending group e-mails. Furthermore, the analysis was conducted without identifiers that could be linked back to a particular expert. As a result, true anonymity was unattainable. Instead, we maintained quasi-anonymity, where only one researcher (MTM) knew all the respondents and their responses [76].

2.2.2 First-Round Questionnaire

In the first-round questionnaire, Section A was designed to gather information on the characteristics and expertise of contributing experts in conducting systematic reviews of studies eliciting HSUVs. This section aimed to understand the experts’ background and prior experience with systematic reviews of HSUVs.

Section B explored the experts’ perspectives on various conceptual issues related to quality and QAs in systematic reviews of studies eliciting HSUVs. Specifically, we refrained from providing predefined definitions for “quality” and “quality appraisal.” Instead, the experts were requested to offer concise definitions and comments regarding these terms. They were also asked to indicate their agreement on whether they considered QA as an integral part of systematic reviews of studies eliciting HSUVs.

The definitions for reporting, methodological limitations and RoB, and relevance quality dimensions presented in the first-round questionnaire were crafted by the steering committee, drawing upon the prior literature [15,16,17, 67] and their theoretical understanding of the terms. The experts were then asked to rate their agreement on whether they considered these three dimensions to be fundamental aspects of systematic reviews of studies eliciting HSUVs.

To comprehensively capture the experts’ opinions and insights, a combination of a five-point Likert rating scale (with a range from strongly disagree to strongly agree) and open-ended questions were included. The open-ended questions allowed the experts to express and discuss their views and opinions in greater detail. The first-round questionnaire can be found in the ESM.

2.2.3 Controlled Feedback

For controlled feedback, the first-round’s responses and comments were descriptively (quantitatively and qualitatively) summarised; any potential identifiers were removed to preserve anonymity. The input from the experts was qualitatively analysed, key concepts and themes were extracted, and then presented in text boxes. Integrating these insights with themes from the existing literature [60,61,62,63, 77,78,79], we formulated working definitions of scientific quality and QA for the second-round questionnaire (see Tables S.2 and S.3 of the ESM). Similarly, based on experts’ feedback and previous literature [15,16,17, 67], we fine-tuned the working definitions of reporting, methodological limitations (RoB) and relevance for the second-round questionnaire (see Table S.4 of the ESM).

In the quantitative analysis, we calculated frequencies for each level on the Likert scale or the responses in “yes” or “no” format, and presented these as bar charts. A comprehensive report of the first-round responses was shared through separate e-mails with all the experts.

2.2.4 Second-Round Questionnaire

The proposed working definitions for quality, QA and three QA dimensions (i.e. reporting, methodological limitations and RoB, and relevance) from the first-round questionnaire were provided as part of the second-round questionnaire. A statement about the purpose and scope of a QA tool for studies eliciting HSUVs was also presented. Consensus was determined a priori when an item reached ≥85% agreement (i.e. six out of seven experts). The consensus level was later changed to 83% (5/6) during the analysis, to align with the observed questionnaire response rate.

Unlike the first-round questionnaire, all responses were coded as binary “yes” or “no” format and analysed descriptively (quantitatively and qualitatively). The second-round questionnaire is presented in the ESM. Similar to the first round, a report of the second-round responses was prepared and shared as controlled feedback with all the experts through individualised e-mails.

2.2.5 Virtual Panel Discussion

Following the second-round questionnaire, it became apparent that agreement on some of the remaining issues would be more effectively facilitated through an open virtual panel discussion instead of a third round of anonymous questionnaire. Thus, in consultation with the experts, the steering committee set aside anonymity and invited all experts to participate in a virtual panel discussion.

All experts were invited to join a 3-hour virtual consensus panel discussion held on 31 May, 2023 using the Zoom virtual conferencing platform. Real-time polling was used during the meeting to facilitate a consensus on each point of discussion. The session was recorded to enable a thorough review and post-discussion transcription.

The video and audio recordings were transcribed by a steering committee member (MTM) and deleted once the transcription had been verified for accuracy. The polling results and discussion content were analysed descriptively and anonymously.

3 Results

3.1 Description of the Expert Panel

Seven international experts, all health economists, except for a biostatistician with a health economics background, consented to participate in the modified Delphi study. The panel members represented a diverse range of countries, including the UK (three experts), Italy (one expert), Australia (two experts who later moved to the UK and another who later moved to Switzerland) and Germany (one expert).

Importantly, the majority (6/7) had significant prior experience with systematic reviews of HSUVs, 5/7 rated themselves as very familiar with systematic reviews, whereas the remaining two considered themselves somewhat familiar. Among them, three had previously co-authored a systematic review of studies eliciting HSUVs, four had participated as contributing health economists and one as a contributing biostatistician.

One expert in the panel had not directly participated in a systematic or rapid review of studies eliciting HSUVs. However, this expert has conducted significant methodological research on HSUVs and contributed to the latest developments in both methodological and applied research in the field.

3.2 Response Rates to the Questionnaires and Virtual Panel Discussion

Response rates to the first-round and second-round questionnaires and the virtual consensus meeting were 100%, 86% and 71%, respectively. The first-round questionnaire was completed on 8 April, 2022, the second round on 23 September, 2022, and the virtual meeting was held on 31 May, 2023. The completion rate was 100% for all questions regarding the characteristics of the experts and conceptual considerations.

3.3 Definition of Scientific Quality and QA

Key constructs extracted from the definitions of quality suggested by the experts, combined with themes from the existing literature, include “using the term scientific quality instead of quality”, “the validity of results and questionnaire”, “attention to methods and methodology”, “accurate and comprehensive reporting”, transparency”, “relevance, “replicability”, “reproducibility” and “bias minimisation”.

Key constructs extracted from the definitions of QA suggested by the experts, combined with themes from the existing literature, include “independence”, “systematic”, “explicit”, “transparency” and “thorough (robust).” Some experts also highlighted the need to consider internal validity, RoB, reporting standards, and methodological or reporting quality in QAs (see Box S.1 of the ESM).

Figure 1 illustrates the consensus from the second questionnaire on the term “scientific quality”, with minor comments regarding word order and spelling (see Box S.2 of the ESM). The agreed working definition of scientific quality following the second round of questionnaires and minor revisions by the steering committee is reported in Table 1.

Fig. 1
figure 1

Second-round questionnaire: experts’ opinions on the proposed definitions of “scientific quality” and “quality appraisal (QA)” and how important QA is in systematic literature reviews (SLRs) [systematic reviews]) of studies eliciting health state utility values (HSUVs). > 83% (5/6) = agreement or consensus

Table 1 Key definitions of QA components for systematic literature reviews of studies eliciting health state utility values

Nevertheless, disagreement regarding the definition of QA was observed (Fig. 1). Some experts advocated for a more concise definition, while others suggested removing the word “robustness” from the definition (Box S.3 of the ESM). After substantial deliberation and revisions in the virtual meeting (for intermediary results, see Box S.4 of the ESM) and further refinement during manuscript proofreading and revision, the consensus definition is delineated in Table 1.

3.4 Terms That Can be Considered Synonymous with QA

In the first-round questionnaire, the experts presented varied opinions on terms synonymous with QA. Most of the experts (5/7) initially suggested that the terms “quality assessment” and “critical appraisal” could be used synonymously with QA. However, terms like “credibility checks”, “reporting quality assessment” and “data appropriateness” were not viewed as direct synonyms. Instead, most experts regarded these terms as constituents or dimensions of a QA (see Box 1).

figure a

Box 1 First-round questionnaire: an extract of the experts’ comments on terms considered synonymous with quality appraisal (QA)

Contrary to the first round, in the subsequent round, there was a unanimous consensus that the term “quality appraisal” can be used synonymously with the term “quality assessment” (Fig. 2). However, the experts indicated that it should not be considered synonymous with a “critical appraisal” (50% disagreed) or any of the other presented terms (100% disagreed).

Fig. 2
figure 2

Second-round questionnaire: experts’ opinions on terms considered synonymous with quality appraisal. ≥ 83% (5/6) = agreement or consensus

3.5 QA Dimensions

In the first-round questionnaire, the majority (5/7) of the experts strongly agreed, and 2/7 agreed that reporting and methodological limitations and RoB are essential for QA of systematic reviews of HSUVs. However, opinions on the “relevance” dimension diverged considerably. One expert was neutral (neither agree nor disagree), three agreed and the remaining three strongly agreed on the significance of the relevance dimension to QA in systematic reviews of HSUVs. Some experts suggested that the definitions of these three dimensions required further refinement and elaboration. Notably, although supportive of reporting quality in a QA, one expert pointed out cases in which it may not hold equal weight (see Box 2).

figure b

Box 2 First-round questionnaire: extract of the experts’ comments regarding the definitions of the three quality appraisal (QA) dimensions and whether to include these in a QA tool specific for studies eliciting health state utility values (HSUVs). RCTs randomised controlled trials, RoB risk of bias

After rephrasing the three QA dimensions’ definitions (see Table S.4 of the ESM), there was unanimous agreement among the experts to include them in a QA tool specific to HSUVs. Furthermore, as shown in Fig. 3, a consensus (yes ≥83%) was reached on the proposed definition of the three QA dimensions.

Fig. 3
figure 3

Second-round questionnaire: consensus on defining the three quality appraisal (QA) dimensions for systematic literature reviews of health state utility values (HSUVs). RoB risk of bias

A remaining concern was the practical application of the “relevance” dimension. The experts wondered how the issue of study relevance was related to the study perspective (experienced patient utilities vs general population or ex-ante vs post-ante utilities) and the study population for HSUV valuation (see Box 3).

figure c

Box 3 Second-round questionnaire: extract of the experts’ comments regarding the definitions of the three quality appraisal dimensions. HSUVs health state utility values

3.6 Scope of a QA Tool for Systematic Reviews of Studies Eliciting HSUVs

A question was introduced in the second-round questionnaire asking the experts what they would consider a plausible scope of a QA tool in systematic reviews of studies eliciting HSUVs. Figure 4 depicts the experts’ views on how broad the scope of the proposed QA tool in development should be.

Fig. 4
figure 4

Second-round questionnaire: experts’ opinions on the proposed scope of a quality appraisal (QA) tool specific to studies eliciting health state utility values (HSUVs). CCSs case controlled studies, CSS cross sectional studies, HTA health technology assessment, RCTs randomised controlled studies, RoB risk if bias, > 83% (5/6) = agreement or consensus

Notably, the experts disagreed with the suggestion that the questionnaire should be reasonably short and allow a consistent and reliable quality assessment of different backgrounds (Box 4). One expert suggested that a QA tool specific to studies eliciting HSUVs should not exclude mapping studies from the tool’s scope a priori because some items evaluated for direct methods, such as sample size, relevance and reporting, also apply to mapping studies. Another concern was whether randomised controlled studies are applicable research design methods for studies eliciting HSUVs.

figure d

Box 4 Second-round questionnaire: an extract of the experts’ comments on the proposed scope of a quality appraisal tool specific to studies eliciting health state utility values (HSUVs). RCT randomised controlled trials

4 Discussion

We elucidated the essential conceptual considerations for developing a QA tool specific to systematic reviews of studies eliciting HSUVs. The systematic reviews can be for publication, health economic model development or broader health technology assessment purposes .This study defined scientific quality, QA and three QA dimensions by synthesising insights from a comprehensive literature review and opinions from seven international experts via three modified Delphi rounds. Furthermore, this study proposed a preliminary scope (boundaries) for future QA tools.

Scientific quality is a nuanced multidimensional concept highly specific to the context and field of study [13, 58, 64,65,66]. Contrary to previous unsuccessful attempts to define quality [60], a consensus on the working definition of scientific quality was reached during the second-round questionnaire. It took a further virtual panel discussion for the experts to agree on the working definition of QA, underscoring the existing ambiguity and heterogeneity in applying the terms.

In its simplest form, QA involves evaluating a study’s scientific quality. Our intention is not to dictate precisely how studies that elicit HSUVs should be conducted; however, we recognise that a universal scientific quality standard for these studies is crucial. To facilitate this, we propose a QA tool specific to HSUV eliciting studies that may promote adherence to high-quality standards. This proposition aligns with the sentiments of Wolowacz and colleagues [58], who advocated that HSUVs intended for policy and decision making should be derived from validated methods, aimed at minimising bias and be relevant to the condition, population and decision-maker’s perspective. Detailing all elements that constitute scientific quality and QA in systematic reviews of studies eliciting HSUVs is an embarking step toward realising this goal.

Evident from the definition of scientific quality and QA are the primary constituents of QA in systematic reviews of studies eliciting HSUVs—reporting, relevance, methodological limitations and RoB. The intricate relationship between these dimensions warrants special consideration. Methodological limitations or flaws, such as high attrition rates, incorrect use of utility assessment tools or improper statistical methods, can compromise the reliability of the derived HSUV estimates, underscoring the need to inspect all studies contributing to a systematic review for methodological flaws and the likelihood of bias that may result from such.

Unlike the commonly assessed clinical outcomes in clinical effectiveness studies, which can be considered comparable across different settings and not affected by context, HSUVs are context sensitive [5]. Primary studies eliciting HSUVs have limited relevance to a research question, policy or decision framework if they differ significantly in attributes, such as populations included, specific health conditions, measurement instruments, health state descriptions, and valuations or settings. Thus, it is essential to assess the relevance of HSUVs to specific research questions or decision frameworks.

More importantly, a comprehensive study report is essential to assess the previous two dimensions, rarely do systematic review authors undertake additional steps to conduct principal investigators of the primary studies eliciting HSUVs. Thus, all three QA dimensions are necessary and sufficient components for a robust QA tool. Achieving a consensus on the definitions of the three QA dimensions will further augment efforts to mitigate ambiguity among these three pillars of scientific quality regarding studies eliciting HSUVs.

Our endeavour to develop a QA tool specific to systematic reviews of studies eliciting HSUVs aligns with the efforts of other research groups [6, 80]. However, our study diverges from these efforts by proposing a unified QA tool designed to distinguish and evaluate reporting, relevance, methodological limitations and potential bias in primary studies that elicit HSUVs.

The three QA dimensions are nothing new to systematic reviews of studies eliciting HSUVs. Many systematic review authors have already considered these dimensions, albeit inconsistently [5]. For instance, 18% of the systematic reviews of studies eliciting HSUVs published between 2015 and 2022 evaluated all three QA dimensions [5]. Furthermore, the need for differentiating these three QA dimensions has been underscored in various studies [62, 63, 69, 81, 82].

The opinions of experts regarding study relevance and perspective require careful consideration and discussion. This study provides a succinct definition of study relevance concerning HSUVs. A generally accepted assumption in conventional normative health economics is that the society is rational and the primary goal of choosing certain health technologies over others is to maximise health [83]. In this regard, the health component is widely measured by using QALYs as a generic outcome. What health economists often disagree on is whose utility values should be used in the QALY computation to maximise health, fueling debates about the perspective of analysis [83, 84].

While the perspective of analysis in health economic evaluations is primarily used to prescribe the breadth of which cost to include, another essential perspective to consider is whose utilities (preferences or utility weights) are considered when valuing health. The choice of which HSUVs are appropriate depends on the viewpoint (perspective) of the decision maker (i.e. whose point of view is the decision on healthcare intervention being made?) [85]. A related discussion borrowed from mainstream behavioural economics evolves around decision utilities (ex-ante preferences of decision makers or society as a whole for states they have not experienced) and experienced utilities (post-ante preferences of individuals who have experienced the health states) [84]. Different perspectives can include those of the general population, patients, clinicians or decision makers. Notably, utilities derived from patients can differ from other utilities owing to variations in the decisional context and other factors [83]. Each type of utility has its strengths and limitations, and the choice depends on the specific evaluation objectives and requirements. An explicit statement of the purpose of an economic evaluation and a systematic review of studies eliciting HSUVs is crucial, as it determines the analysis perspective, relevant population, setting and health valuation techniques, all of which are integral components of the relevance dimension.

A fundamental first step in any scale or QA tool development is a clearly defined boundary between what the tool can and cannot do [60, 62, 86]. A third of the current panel members disagreed that a QA tool specific to studies eliciting HSUVs should be reasonably short and allow reliable and consistent assessments by raters with different backgrounds. This result can be considered inconsistent with previous tool development exercises [62]. Nevertheless, a pertinent consideration when developing a QA tool is the anticipated burden on the raters. A lengthy questionnaire with too many items and signalling questions may be burdensome and deter the QA process.

Furthermore, item and construct overlap is more likely to occur with an increasing number of items and ultimately, the length of the questionnaire. Another reason for revising the QUADAS-1 to QUADAS-2 was that users reported problems rating certain items that seemed to overlap. Their [62] proposed solution was to limit the number of domains and signalling questions. Unfortunately, there are no strict guidelines on how many items to include, how long a QA tool should be, or how long raters should take to complete the QA of a single study eliciting HSUVs.

It is also prudent to use simple language that is easily understandable by raters with different backgrounds. Using complex terms such as “construct-irrelevant variance” and “quality of construct representation”, as done by Eiring and colleagues [80], could be misleading and fuel the existing inconsistencies in QA in systematic reviews of studies eliciting HSUVs. In developing the ROBINS-I tool [86], the developers noted the challenges brought about by the variations in terminology used to describe different domains and items. For example, terms such as “selection bias” may be confused with related but different terms such as “confounding”. For the same reasons, the recent version of RoB 2 avoided the use of terms such as “selection bias”, “performance bias”, “attrition bias” and “detection bias” [87].” While most systematic reviews of studies eliciting HSUVs will likely be performed by people with relevant knowledge in health economic evaluations, using easily understood terminology remains essential.

The major strength of this study is its ability to triangulate the findings of a previous literature review with the knowledge of experts in the field. Keeping the experts anonymous during the two rounds of questionnaires ensured that the contributing experts freely expressed their views without being influenced or dominated by others [72,73,74,75,76]. Akin to a focus group discussion, the virtual meeting afforded experts space to ask questions and explain their viewpoints, delving deeper into the subjects at hand—a depth often missing in individual-based questionnaires [88]. It is also widely accepted that a well-facilitated interactive meeting can bolster participants’ contributions by providing opportunities to clarify or rephrase questions, thus enhancing both comprehension and quality of response [89]. As a result, we captured the depth and complexity of experts’ views, particularly for the more complex topics of interest that were unresolved after the two rounds of questionnaires, and reached a consensus on the definition of QA.

Because of the limited number of researchers with expertise required to contribute to this study, the generalisability of our findings may be limited. Nevertheless, experts were drawn from different countries and had diverse competencies regarding HSUVs (i.e. systematic reviews of HSUVs, health technology assessments, mapping algorithms, health-related quality of life and mainstream health economic evaluations). Furthermore, combining questionnaire-based approaches and face-to-face meetings further reduced the need for a larger sample size. For example, the nominal group technique is typically effective when the group size is small [75, 90].

Another limitation that often affects qualitative research is that the steering committee may impose its views throughout the study. To limit this bias, the steering committee sought to ensure high levels of transparency throughout the study phases, for example, using text boxes to report raw comments from experts.

5 Conclusions

This study defined scientific quality and QA in the context of systematic reviews of studies eliciting HSUVs. In addition, a consensus was reached on the scope and boundaries of a QA tool specific for this context. Based on these, the experts concurred that an effective QA should discern among reporting, relevance and methodological quality while being applicable to multiple design features and health elicitation techniques. This consensus represents a fundamental step towards harmonising or standardising the QA process in this field. Future work should leverage this foundation to identify QA items, signalling questions, and response options and develop a QA tool for systematic reviews of studies eliciting HSUVs.