Plain English summary

Instruments that measure health outcomes are important for making treatment decisions and understanding diseases. Systematic reviews are used to compare different instruments and help select the best one for a specific situation. Previous studies have shown that the quality of these reviews can vary and may not always meet scientific standards. Since then, new tools and methods have been developed to help systematic review authors in improving the quality of their work. This study looked into the quality of recent systematic reviews of instruments. The study identified important improvements over time. For example, risk of bias is more often evaluated, and the data is analyzed in a better way. However, the study also shows that there are still areas that need improvement. These include formulating a clear research question, and creating a comprehensive search strategy. Ongoing efforts are needed to improve the quality of systematic reviews of instruments. This can be achieved by developing new and accessible resources.

Introduction

Outcome measurement instruments (OMIs) are used to evaluate the impact of disease and treatment [1,2,3]. When many different OMIs that measure similar constructs are available [1, 4, 5], the choice for an OMI depends on various aspects, including its quality (i.e., the sufficiency of measurement properties) [6]. Systematic reviews in which the measurement properties of OMIs are critically evaluated and compared are important tools for the selection of an OMI [4], for example in core outcome sets used in research projects or clinical practice [7]. With these systematic reviews, gaps in knowledge about the measurement properties of OMIs can also be identified.

Only well-designed, well-conducted, and comprehensively reported systematic reviews can provide a complete and balanced overview of the measurement properties of OMIs [4]. High-quality systematic reviews have: a well-defined research question; a comprehensive search strategy in multiple databases; independent abstract and full-text article selection; a risk of bias assessment of included studies; a systematic evaluation and syntheses of the results; and a certainty assessment of the body of evidence [8].

Previous overviews appraising the quality of systematic reviews of OMIs identified major limitations in the search strategy, the risk of bias assessment, and the evaluation and synthesis of the measurement properties’ results [9, 10]. These limitations preclude systematic reviews to provide a complete and unbiased overview of the measurement properties of OMIs. This has consequences for knowledge users, who rely on the findings of these systematic reviews and might select suboptimal OMIs to use in their research or clinical practice [11]. This in turn impacts the measurements conducted on patients, which might be invalid and unreliable, and possibly even lead to incorrect healthcare decisions.

Various methodologies and practical tools have been developed to guide authors in conducting high-quality systematic reviews of OMIs [4, 12, 13]. The methodology and tools developed by the COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) initiative are the most comprehensive and most widely used (Fig. 1) [14]. Since the most recent overview that assessed the quality of systematic reviews of OMIs, published in 2016 [10], the COSMIN guideline for systematic reviews has been developed [4] and the COSMIN risk of bias checklist has been updated [15, 16]. Other methodologies and tools for critical appraisal of OMIs have also been developed and updated since then [12, 17]. When reading or reviewing such systematic reviews, even those that claim to have used these guidelines, we often observe flaws in the design, conduct, and reporting. The aim of this overview of reviews was therefore to investigate whether the quality of recent systematic reviews of OMIs lives up to the current scientific standards. As a secondary aim, we explored which aspects have notably improved over time.

Fig. 1
figure 1

COSMIN tools and methodology

Methods

The study protocol was registered in the PROSPERO database, number CRD42022320675 [18]. There were no important deviations from the protocol. The study was reported according to the preferred reporting items for overviews of reviews (PRIOR) statement [19]. Consistent with the previous overview [10], we randomly selected 100 out of 136 most recent systematic reviews from the COSMIN database of systematic reviews [20]. These reviews were identified while updating the COSMIN database through a systematic literature search performed on March 17, 2022 in MEDLINE (through PubMed) and EMBASE (through www.embase.com), and concerned systematic reviews of OMIs published from June 1, 2021 onwards. The search strategy consisted of search terms for systematic reviews, search terms for OMIs, and a validated search filter for measurement properties [21]. The full search strategy can be found in Supplementary File 1. Table 1 contains inclusion and exclusion criteria for the COSMIN database [20]. We defined systematic reviews of OMIs as peer-reviewed studies with a systematic search in at least one electronic database which aimed to summarize evidence on the measurement properties of all OMIs of interest to the review.

Table 1 Inclusion and exclusion criteria for the COSMIN database [20]

Eligibility for inclusion in the COSMIN database was determined by one reviewer (IS). All reviewers confirmed that each review appraised in the current study complied with the inclusion and exclusion criteria. If a review was selected from the COSMIN database that should have been excluded (false-positive), this review was replaced by a randomly selected new review after confirming exclusion by a third reviewer (LM).

A study-specific data extraction form (Supplementary File 2) was developed to appraise the quality of systematic reviews of OMIs, which includes both methodological quality and reporting quality—two aspects that cannot be considered separately when appraising the quality of published OMI systematic reviews. The data extraction form was based on criteria used in previous studies [9, 10], which were updated for this study. The data extraction form contained items on the key elements of the review (i.e., construct, population, type of OMI, and measurement properties of interest), search strategy, eligibility criteria, article selection, data extraction, risk of bias assessment, evaluation of measurement properties, data synthesis, certainty assessment, presentation of results, instrument recommendation, and elements of open science). Specifically, the appropriateness of the search for the construct, population, type of OMI and measurement properties was based on published search filters [21, 23], search terms found at blocks.bmi-online.nl, and the reviewers own knowledge. For each item, two independent reviewers extracted information on whether this was done/reported in the included reviews. No attempts were made to verify information with study authors. Reviewers also noted any major methodological and reporting flaws for each of these aspects.

The data extraction form was pilot tested with six OMI systematic reviews [24,25,26,27,28,29] by two independent reviewers (different pairs of EE, CT, and LM). A subsequent update was done after training the other reviewers, who were instructed to extract data for one of these six reviews [25]. Discrepancies were discussed during two 90-min Zoom meetings intended to standardize the data extraction process. After these meetings, the data extraction form and instructions on how to appraise each systematic review were finalized, and five pairs of reviewers were formed (EE&JP/IS, LM&DO, CT&IA, KH&KM, AC&OA). Each reviewer pair subsequently appraised the quality of 18–19 systematic reviews independently. Reviews were not appraised by a reviewer who was a co-author or had a potential conflict of interest. Discrepancies between the pair of reviewers were resolved through discussion. Appraisals of reviewers were descriptively synthesized by review counts and a qualitative comparison of the results was made to the results of previous studies [9, 10], if possible.

Results

Characteristics of the 100 systematic reviews are presented in Table 2. Half of the included reviews focused on patient-reported outcomes, 30% focused on non-patient-reported outcomes, and 20% on a combination of both. The aspect of health of the construct of interest in the reviews was mostly functional status (62%), symptom status (56%), and/or general health perceptions (36%). Reviews focused on a variety of populations, such as children and (older) adults with a variety of diseases and conditions. Questionnaires (77%), clinical rating scales (41%), and/or performance-based tests (24%) were the OMI types most often included.

Table 2 Characteristics of systematic reviews of outcome measurement instruments (n = 100)

Syntheses of the quality appraisal of the 100 systematic reviews of OMIs [24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123] are presented in Table 3. Supplementary File 2 contains the completed data extraction form, whereas Supplementary File 3 contains the data from Table 3 in comparison with the results of the two previous studies [9, 10].

Table 3 Quality appraisal of systematic reviews of outcome measurement instruments

Key elements

Only 11% of the reviews had a title that included all four key elements (i.e., construct, population, type of OMI, and measurement properties of interest) and the fact that it concerned a systematic review. In titles of the remaining reviews, often no reference to measurement property evaluation was made. 47% of the reviews had a title that omitted at least 2 key elements and/or the fact that it concerned a systematic review. The term ‘scoping review’ was used in 7% of the reviews. In 45% of the reviews all 4 key elements were included in the aim, whereas in 18% of the review aims at least 2 key elements were not reported. Major flaws identified in the aim were often that the aim was unclear or vague, for example by stating that the aim was “to discuss validity” [121], or “to provide information about frailty instruments” [94].

Search strategy

In 78% of the reviews the search strategy matched the research aim. When there was a mismatch between the aim and the search strategy, often the aim was to identify all available OMIs, whereas search terms for measurement properties were included. Hence, only OMIs with evidence for the measurement properties were identified.

Only 27% of the reviews had an appropriate search strategy with respect to search terms used for both the construct, population, OMI type and measurement properties. Search terms for OMI type were not appropriate for 40% of the reviews because relevant synonyms or search terms were not included. Search terms for measurement properties were deemed inappropriate for 34% of the reviews.

The number of databases searched ranged from 1–14, with a median of 4. MEDLINE was searched in 98% of the included reviews, whereas EMBASE was searched in 56%. Only 66% of the reviews performed reference checking of included articles.

Eligibility criteria and article selection

In 75% of the reviews the eligibility criteria were clearly defined, and in 83% the eligibility criteria matched the research aim. Mismatches often concerned that the aim was to identify all available or used OMIs, whereas eligibility criteria included that the study should report on measurement properties, hence resulting in including only OMIs that were validated to at least some extent. In 42% of the reviews other notable eligibility criteria were used, such as only including OMIs that were reported in at least a certain number of articles, only including validation studies of original OMIs or certain (language) versions, excluding studies of low quality, or excluding OMIs that were described in previously published systematic reviews.

In 65% and 69% of the reviews, respectively abstract and full text selection was (partly) done by at least 2 independent reviewers, compared to 41% and 38% in 2014. Data extraction was (partly) done by at least 2 independent reviewers in 42%, compared to 25% in both 2014 and 2007. In most other cases it was unclear whether two independent reviewers were involved.

Risk of bias assessment

The methodological quality (i.e., risk of bias) of the studies was evaluated in 63% of the reviews, compared to 41% in 2014 and 30% in 2007. In 62% of those reviews, the quality assessment was done by at least two reviewers independently. For 33% of the reviews this was unclear.

Measurement property evaluation

In 73% of the reviews (some) measurement properties of the included OMIs were evaluated, compared to 58% in 2014 and 55% in 2007. This means that in these reviews a judgement was made about the sufficiency of the measurement properties, rather than providing only the results of measurement properties. For those reviews in which (some) measurement properties were evaluated, (a reference to) criteria for measurement properties were provided in 81% of the reviews; in 19% of the reviews it was not clear on what criteria judgements were based. In those reviews in which measurement properties were evaluated and that included multidimensional OMIs, only 18% evaluated each subscale separately. In 22% of the reviews the evaluation of measurement properties was (partly) done by at least two independent reviewers.

Data synthesis and certainty assessment

Data synthesis, in which results from multiple studies on the same OMI were combined, was (partly) performed in 60% of the reviews, compared to 44% in 2014 and 7% in 2007. In those reviews in which data synthesis was performed and that included multidimensional OMIs, synthesis was performed for each subscale separately in only 13% of the cases. Methods for data syntheses were clearly described in 47% of the reviews. In 84% of the reviews data synthesis was performed for each measurement property separately. Data synthesis was performed by at least 2 independent reviewers for 18% of the reviews.

In 33% of the reviews, a certainty assessment was done in which the quality of the evidence was graded. Quality of the evidence was graded by at least 2 independent reviewers in 27% of the reviews.

Results and instrument recommendation

A flowchart was provided in 96% of the reviews, often with reasons for excluding full texts (85% vs. 55% in 2014). Included instruments were in 86% of the reviews in accordance with the inclusion criteria. In 72% of the reviews, the results of (some) measurement properties were reported as raw data.

In almost half of the reviews (42%) recommendations on which instrument (not) to use were made. In 25% of the reviews, recommendations were made for each construct of interest. In 62% of the reviews the recommendations made were consistent with the evidence appraisal.

A summary of the main results with recommendations for future OMI systematic reviews is provided in Table 4.

Table 4 Overview of main findings and recommendations for future OMI systematic reviews

Discussion

This overview of reviews aimed to investigate whether the quality of recent systematic reviews of OMIs lives up to the current scientific standards and which aspects have notably improved over time. Compared to previous studies [9, 10], we found marked improvements in the conduct of risk of bias assessments, evaluation of measurement properties, and performance of formal data syntheses. Despite this, further improvements in these areas are necessary, as well as with respect to the research question and search strategy.

Over half of the reviews included in this study had an unclear research question or aim, for example with respect to the population of interest, the measurement properties that were evaluated, or the type of OMIs that were included. Including the four key elements, analogue to the PICO (population, intervention, comparison, outcome) format in systematic reviews of interventions [4, 8, 125], helps to formulate a well-defined research question and facilitates the development of an appropriate search strategy. Without a clear research question, it is not possible to assess the comprehensiveness of the search strategy.

Almost three-quarters of the reviews had an inappropriate or incomprehensive search strategy, often because inappropriate search terms for OMI type or measurement properties were included. It is preferred not to use search terms for OMI type to avoid missing any studies; however, if search terms are needed because of too many results, a search filter exists for PROMs [23]. A highly sensitive search filter also exists for measurement properties [21], but it was used in only 14 reviews. While searching both MEDLINE and EMBASE is recommended as a minimum by Cochrane [126], almost half of the reviews included in this study did not search EMBASE. Similarly, whilst reference checking is recommended [126], this was not reported by a third of the reviews. Through reference checking, one can also confirm the comprehensiveness of the search strategy: if many relevant articles were found through reference checking, the search was probably not comprehensive and important studies may have been missed [126].

In almost half of the reviews poorly justified eligibility criteria were used, e.g., only including OMIs in a certain language, or excluding OMIs that were included in previous systematic reviews. Such unintuitive eligibility criteria might negatively impact the inclusion of relevant studies or OMIs, hampering a complete synthesis of the body of available evidence. The number of reviews in which article selection and data extraction was conducted by at least 2 independent reviewers increased compared to previous overviews [9, 10].

Whilst a marked increase in the number of reviews that included a risk of bias assessment was found (63% currently compared to 41% in 2014 and 30% in 2007 [9, 10]), opportunities for improvement remain. Evaluating risk of bias in empirical studies on measurement properties is important, because results might not be valid if a study has bias. For example, relevant items might be missing in a PROM if patients were not involved in its development, or the reliability of an OMI might be underestimated if the time interval between test and retest is (too) long. The COSMIN risk of bias checklist [16] or tool [15] were specifically developed for this purpose and were used in 47 reviews. Other risk of bias tools reported in the reviews [43, 70, 82, 101, 120] included, for example, the QUADAS-2 [127], QAREL [128], ROBINS-I [129], and Newcastle–Ottawa quality assessment scale [130]. These tools are, however, not specifically developed to assess the methodological quality of empirical measurement property studies and may not identify important bias.

The number of reviews in which measurement properties were formally evaluated has notably increased since 2007 (73% currently compared to 58% in 2014 and 55% in 2007 [9, 10]). In 14 reviews, however, it was not clear which criteria were used. In several reviews, authors mistakenly used risk of bias or certainty assessment ratings as a measure of OMI quality. However, these ratings refer to the quality of the study and the quality of the evidence, respectively, and not to the quality of the OMI (i.e., its measurement properties).

A clear increase in the number of reviews in which a data synthesis was performed was also observed (60% currently compared to 42% in 2014 and 7% in 2007 [9, 10]). However, the methods for data synthesis were often unclearly described and only in a third of the reviews a certainty assessment of the body of evidence was conducted. Potentially, the publication of the COSMIN guideline for systematic reviews of PROMs [4] in 2018 increased the number of reviews in which a data synthesis was performed. This guideline details how to synthesize multiple studies on the same measurement property for the same OMI, although more guidance might be necessary.

Each subscale in a multidimensional instrument should be considered a separate instrument as it represents a unique construct with measurement properties often varying between subscales [4]. However, we observed that few studies separately evaluated measurement properties or conducted an evidence synthesis at the subscale level. By not evaluating each subscale separately, a review therefore presents an incomplete picture of the measurement properties for the given scale.

Less than half of the reviews made recommendations about which OMI (not) to use. The conclusions of systematic reviews will be used by other researchers and clinicians who need to select an OMI for their purpose, although the selection of the most appropriate OMI may depend on the context and situation. Clear, evidence-based recommendations on which OMI (not) to use will help others in their OMI selection and contribute to the standardization of OMIs.

Although two-thirds of the reviews purport to include an evaluation of content validity, there is doubt over the thoroughness of these evaluations. Whilst 25 reviews reported application of the COSMIN guideline for evaluating content validity, only 13 appear to have applied it correctly. One of the steps in the assessment of content validity according to the COSMIN guideline is the evaluation of the content by reviewers themselves. This step was often lacking. Other flaws included not distinguishing between development and content validity studies, and only conducting a risk of bias assessment without evaluating the content validity of the OMI.

Other major flaws that we observed in some reviews were confusing the quality of the study (i.e., risk of bias) with the quality of the OMI (i.e., its measurement properties) or making recommendations based on certainty assessment rather than the sufficiency of measurement properties.

Towards high quality OMI systematic reviews

Systematic reviews of OMIs are difficult to conduct, and this study shows that the availability of methodology and tools that guide authors in the conduct of their systematic review does not translate automatically into high-quality systematic reviews. Besides more and better resources, behavioral change techniques [131], implementation strategies, and knowledge translation activities are needed to improve systematic review quality. Several of these have recently been developed or are being considered. First, the COSMIN guideline for systematic reviews has recently been updated and made more user-friendly to better facilitate reviewers [132]. Second, a newly developed animated video explains the key steps of conducting a systematic review of OMIs (available at https://www.cosmin.nl/). Third, a reporting guideline for OMI systematic reviews has recently been developed [133], and knowledge translation activities have been implemented to increase its uptake. Last, a course on how to conduct OMI systematic reviews is being developed to educate reviewers more thoroughly. To alert systematic review authors to the various tools available, an automated email can be sent to authors registering their review in PROSPERO. PROSPERO is a database for registering systematic reviews of health related outcomes [134], and although less than half of the included reviews reported prospective registration, such an email alert might increase the uptake of tools and improve the quality of future OMI systematic reviews.

Limitations

An important limitation of the current study is the potential subjectivity in appraising the quality of systematic reviews. We attempted to use a rigorous and standardized data extraction process, in which we pilot tested and improved the data extraction form, provided training to reviewers who were already experts in systematic reviews of OMIs, and assigned systematic reviews to reviewer pairs who independently appraised their quality and reached consensus about any discrepancies. However, because of large variations in the systematic reviews included, some degree and variation of subjective judgement in appraising the quality of systematic reviews could not be avoided. Second, some of the included reviews might not have been systematic reviews by definition, as the inclusion criteria were not stringent in that respect. We decided to include a review if at least one measurement property was evaluated (i.e., some degree of judgement was made about the sufficiency of a measurement property, as opposed to only providing an overview of the measurement properties). Third, we were unable to compare all quality aspects historically, because not all aspects were rated in the studies conducted in 2014 and 2007 [9, 10]. Compared to the previous studies, the current appraisal is the most comprehensive, and new elements were added, such as inclusion of key elements in the title, specification of criteria for measurement properties, evaluation of subscales, and assessment of certainty. Fourth, we randomly selected 100 recent reviews that fulfilled the eligibility criteria, out of a set of 136 reviews that were identified while updating the COSMIN database [20]. Our aim was not to include all available systematic reviews but rather to appraise and compare the quality of a random sample of the most recently published reviews with a set of reviews published respectively 8 and 15 years ago. We believe that the inclusion of additional reviews would not have altered our findings. Lastly, the appraisal of the reviews’ quality was hampered by poor reporting, for example with respect to the process of data synthesis or the number of independent reviewers involved in each of the steps of the review process. The recently developed PRISMA-COSMIN for OMIs reporting guideline could improve the reporting of OMI systematic reviews [133]. Although the current study is not a one-to-one baseline assessment of reporting aspects required by PRISMA-COSMIN for OMIs, most reporting items have been included in the current quality appraisal. Because our aim was to assess whether the quality of recent systematic reviews lived up to the current scientific standards, including reporting quality, we have not contacted the authors of the included systematic reviews to provide additional information.

Conclusion

In conclusion, this overview of 100 reviews published after June 2021 found, compared to previous overviews of reviews, a clear improvement in the number of OMI systematic reviews that conducted a risk of bias assessment, evaluated the measurement properties of included OMIs, and conducted a data synthesis. However, room for improvement in these areas remains. Improvements regarding the research question and search strategy are urgently needed, as more than half of the reviews likely missed important studies. To ensure that systematic reviews of OMIs meet current scientific standards, more consistent conduct and reporting of systematic reviews of OMIs is needed.