New ordering of the boxes of the measurement properties
In the guideline for a systematic review of PROMs, a new order of evaluating the measurement properties is proposed [3]. Therefore, the ordering of the boxes in the COSMIN Risk of Bias checklist was accordingly changed. This new ordering is shown in Table 1. Boxes 1 and 2 address content validity. Content validity is considered to be the most important measurement property because first of all it should be clear that the items of the PROM are relevant, comprehensive, and comprehensible with respect to the construct of interest and target population [3].
Table 1 Boxes in the original COSMIN checklist (left) and the COSMIN Risk of Bias checklist (right)
Boxes 3–5 address structural validity, internal consistency, and cross-cultural validity\measurement invariance, respectively, together reflecting the internal structure of the PROM. Internal structure refers to how the different items in a PROM are related, which is important to know for deciding how items might be combined into scales or subscales. Evaluating the internal structure of the instrument is relevant for PROMs that are based on a reflective model. In a reflective model, the construct manifests itself in the items, i.e., the items are a reflection of the construct to be measured [11]. This step concerns an evaluation of (a) structural validity (including unidimensionality) using factor analyses or IRT or Rasch analyses, (b) internal consistency, and (c) cross-cultural validity and other forms of measurement invariance (MI) [using Differential Item Functioning (DIF) analyses or Multi-Group Confirmatory Factor Analyses (MGCFA)]. These three measurement properties focus on the quality of items and the relationships between items. It is recommended to evaluate these measurement properties immediately after evaluating the content validity of a PROM. As evidence for the unidimensionality or structural validity of a scale or subscale is a prerequisite for the interpretation of internal consistency analyses (e.g., Cronbach’s alphas), it is recommended to first evaluate structural validity, to be followed by internal consistency.
We recommend to evaluate cross-cultural validity for PROMs that are used in culturally different populations than originally developed for. We interpret ‘culturally different population’ broadly. We do not only consider different ethnicity or language groups as different cultural populations, but also other groups such as different gender or age groups, or different patient populations. Cross-cultural validity is evaluated by assessing whether the scale is measurement invariant or whether DIF occurs. MI and non-DIF refer to whether respondents from different groups with the same latent trait level (allowing for group differences) respond similarly to a particular item. The term MI is an overarching term. However, we decided not to delete terms from the COSMIN taxonomy. Therefore, the box is now called cross-cultural validity\measurement invariance.
The boxes 6–10 address the remaining measurement properties (i.e., reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness). We do not consider one of these measurement properties as more important than the others. These measurement properties mainly focus on the quality of the (sub)scale as a whole, rather than on item level.
Removal of boxes
The boxes General requirements for studies that applied IRT models, Interpretability, and Generalizability have been removed from the checklist.
We removed the box General requirements for IRT or Rasch analyses. The first three standards of this box concerned reporting of the IRT model, the computer software package, and the method of estimation. These reporting items do not concern the quality of the studies in terms of risk of bias. The fourth item concerned whether IRT assumptions like unidimensionality and local independence were checked. These issues are removed because lack of testing for these assumptions does not necessarily indicate poor quality of the study. That is, if the model fits, unidimensionality and local dependence can be assumed and do not need to be checked. But when a poor IRT or Rasch model fit is found, one may examine if unmet assumptions can explain the misfit of the model. Furthermore, the quality of studies on unidimensionality of scales or subscales is considered in the box Structural validity. The fourth item also concerned whether DIF analyses were performed. We have moved this standard to the box Cross-cultural validity\measurement invariance, as it tests whether items behave similarly in different groups. Standards on preferred statistical methods based on IRT or Rasch analyses are included in the boxes Internal consistency, Structural validity, and Cross-cultural validity\measurement invariance, similarly as in the original COSMIN checklist.
The box Interpretability contained items referring to the reporting of information to facilitate interpretation of (change) scores, rather than standards to assess risk of bias of a study on interpretability. Moreover, interpretability is not a measurement property. Despite its importance, it was decided to remove this box because the checklist focuses on risk of bias.
The box Generalizability contained items on whether the study population is adequately described in terms of age, gender, and important disease characteristics. These items also do not refer to risk of bias and were therefore removed.
Adaptations of individual standards
Table 2 provides an overview of the adaptations resulting in the COSMIN Risk of Bias checklist.
Table 2 Overview of changes in COSMIN standards
Removal of standards on missing data and handling missing data
In the original COSMIN checklist, each box, except for the content validity box, contained standards about whether the percentage missing items was reported, and how these missing items were handled. Although we consider information on missing items very important to report, we decided to remove these standards from all boxes, as it was agreed that lack of reporting on the number of missing items and on how missing items were handled would not necessarily lead to biased results of the study. Furthermore, at the moment there is little evidence about what the best way is to handle missing items in studies on measurement properties.
Removal of standards on sample size
We decided to remove the standard about adequate sample size for single studies from those boxes where it is possible to pool the results (i.e., the boxes Internal consistency, Reliability, Measurement error, Criterion validity, Hypotheses testing for construct validity, and Responsiveness) to a later phase of the review, i.e., when drawing conclusions across studies on the measurement properties of the PROM [3]. This was decided because several small high-quality studies can together provide good evidence for the measurement property. Therefore, we recommend to take the aggregated sample size of the available studies into account when assessing the overall quality of evidence on a measurement property in a systematic review, as is described in detail elsewhere [3]. This is in compliance with Cochrane guidelines [10]. However, the standard about adequate sample size for single studies was maintained in the boxes Structural validity and Cross-cultural validity\measurement invariance, because the results of these studies cannot be pooled. In these boxes, factor analyses, or IRT or Rasch analyses, are included as preferred statistical methods and these methods require sample sizes that are sufficiently large to obtain reliable results.
The suggested sample size requirements should be considered as the basic rules; in some situations, dependent on the type of model, number of factors or items, more nuanced criteria might be applied. For example, a smaller sample size might be acceptable when an argument is presented in the individual study, stating the considerations why a smaller sample size is adequate. Subsequently, the study can still be rated as very good or adequate, despite lower sizes than requested in the standard.
Detailed information is provided in the ‘COSMIN methodology for systematic reviews of Patient-Reported Outcome Measures (PROMs)—user manual’ [9].
Removal of standards to determine which measurement property was assessed
In the original COSMIN checklist, several standards were included to determine whether or not a specific measurement property was evaluated. For example, in the boxes Reliability and Measurement error, it was asked whether at least two measurements were available; in the boxes Internal consistency and Structural validity, a standard was included whether the scale consists of effect indicators. If the answer was ‘no,’ the measurement property was not relevant. These questions do not refer to the risk of bias of the study, but to the relevance of the study, and are therefore no longer considered as standards. They are now either deleted or the item number was removed (to indicate that it is not a standard) and instructions were added (i.e., ‘if no, the study can be ignored’).
Removal of redundant standards from the box Internal consistency
Unidimensionality is a prerequisite for a proper interpretation of the internal consistency statistic. In the original version of the COSMIN checklist, two standards were included in the box Internal consistency about checking this assumption, i.e., “Was the unidimensionality checked, i.e., was factor analysis or IRT model applied?” and “Was the sample size included in the unidimensionality analysis adequate?” We have removed these items from the box, because according to the new order of evaluating measurement properties it should first be checked whether there is evidence that a scale or subscale is unidimensional, using the box Structural validity, before evaluating internal consistency.
Removal of standards about the translation process
In the original COSMIN checklist, the box Cross-cultural validity included both standards for assessing the quality of the translation process and standards for assessing the quality of a cross-cultural validity study. We decided to remove the standards for assessing the quality of the translation process because the translation process itself is not a measurement property; performing a pilot test after a translation is considered part of content validity (i.e., an evaluation of comprehensibility) and now included in the box Content validity [8], and a poor translation process does not necessarily mean that the instrument has a poor cross-cultural validity.
Changes in the boxes Criterion validity, Hypotheses testing for construct validity, and Responsiveness
We decided to delete the standard about a reasonable gold standard and all standards about formulating hypotheses a priori from these boxes. We consider it important to determine whether a ‘gold standard’ can indeed be considered a ‘gold standard.’ However, when conducting a systematic review of PROMs, we now recommend that the review team determines before assessing the quality of included studies which outcome measurement instruments can indeed be considered a ‘gold standard.’ Next, although we consider it majorly important to define hypotheses in advance when assessing construct validity or responsiveness of a PROM, results of studies without these hypotheses can in many cases still be used in a systematic review on PROMs because the presented correlations or mean differences between (sub)groups are not necessarily biased. The conclusions of the authors though are often biased when a priori hypotheses are lacking. We recommend that the review team formulates hypotheses themselves about the expected direction and magnitude of correlations between the PROM of interest and other PROMs and of mean differences in scores between groups [12], and compare the results found in the included studies to the hypotheses formulated by the review team. If construct validity studies do include hypotheses, the review team can adopt these hypotheses if they consider them adequate. This way, the results from many studies can still be used in the systematic review as studies without hypotheses will no longer receive an ‘inadequate’ (previously called ‘poor’) quality rating. An additional advantage of this approach is that the results of all included studies are compared to the same set of hypotheses. A detailed explanation for completing these boxes can be found in the manual of the checklist [9].
To improve the comprehensibility of the boxes Hypotheses testing for construct validity and Responsiveness, we now include separate sections for different study designs in these boxes. These sections concern standards for testing hypotheses about comparing (changes on) the outcome measurement instrument of interest with (changes on the) comparator outcome measurement instruments (e.g., convergent validity), or standards for comparing (changes in) scores between subgroups (discriminative or known-groups validity). We also included a separate section in the box Responsiveness containing standards for studies in which effect sizes and related parameters are being used. In the sections on comparison between subgroups, we added a standard whether an adequate description was provided of important characteristics of the subgroup.
Finally, several standards were reformulated to change them from a reporting standard into a standard for risk of bias assessment. For example, we changed the original standard “Was an adequate description provided of the comparator instrument(s)?” into “Is it clear what the comparator instrument(s) measure(s)?” This standard can be answered based on information from the article, but also based on additional information from the literature.
New labels for the four-point rating system
It was argued that the original labels of the four-point rating scale (i.e., ‘excellent,’ ‘good,’ ‘fair,’ ‘poor’) do not appropriately reflect the judgments given, because the labels do not exactly match the descriptions used in the boxes. The descriptions of the category ‘fair’ often used the words doubtful and unclear. Therefore, a label ‘doubtful’ was considered more appropriate. The labels ‘good’ and ‘poor’ were not considered symmetrical and were therefore changed into ‘adequate’ and ‘inadequate.’ Lastly, we wanted to have a category to reflect studies that performed very well. We changed ‘excellent’ into ‘very good’ because we considered the latter to reflect the distance between the response categories more appropriately. Also, by changing all labels, the difference between the original and new COSMIN checklist would be more clear for users.
Availability
The COSMIN Risk of Bias checklist for systematic reviews of PROMs is presented in the Appendix and on the COSMIN website. The ‘COSMIN methodology for systematic reviews of Patient-Reported Outcome Measures (PROMs)—user manual’ is also published on the COSMIN website [9] with detailed instructions about how each standard should be rated.