General description of the scoring system
The specific criteria for excellent, good, fair, and poor quality per item of each COSMIN box are described in the COSMIN checklist with a 4-point scale (available from the web site www.cosmin.nl). As an example, the box “Reliability” (box B) with a 4-point scale is presented in Table 1. In general, an item is scored as excellent when there is evidence that the methodological quality aspect of the study to which the item is referring is adequate (this equals the original response option “yes”). For example, if evidence is provided (e.g., from a global rating scale) that patients remained stable between the test and retest (item 7, box B), this item is scored as excellent. An item is scored as good when relevant information is not reported in an article, but it can be assumed that the quality aspect is adequate. For example, if it can be assumed that patients were stable between the test and retest (e.g., based on the clinical characteristics of the patients and the time interval between the test and retest), the item is scored as good. An item is scored as fair if it is doubtful whether the methodological quality aspect is adequate. For example, when it is unclear whether the patients were stable in a reliability study, the item is scored as fair. Finally, an item is scored as poor when evidence is provided that the methodological quality aspect is not adequate, for example, if patients were treated between the test and retest.
Table 1 Example of one COSMIN box with 4-point scale
In defining the response options, the “worst score counts” algorithm was taken into consideration. Only fatal flaws in the design or statistical analyses were regarded as poor methodological quality. For example, when in a construct validity study no hypotheses were formulated a priori regarding the relation of the instrument under study with other measures, and it was unclear what was expected, this is considered poor methodological quality. For some items, the worst possible response option was defined as good or fair instead of poor because we did not want these items to have too much impact on the methodological quality score per box. For example, item 1 in most boxes refers to whether the percentage of missing items is given. The only two possible answers are yes or no, which were rated as excellent and good, respectively. This does not mean, however, that we consider it good practice if this information is not reported. It rather means that, in our opinion, a study that did not report the number of missing items can still obtain an overall score of good methodological quality for a measurement property, if all other items are scored good or excellent. Item 2 in most boxes refers to whether it was described how missing items were handled. If this is not described, this is not necessarily a fatal flaw in the study. Therefore, it was decided to score this item as fair instead of poor if it was not described how missing items were handled. Finally, for some items, it was not possible to define four different response options. For these items, only two or three response options were defined. For example, item 9 in box E (structural validity) refers to whether exploratory or confirmatory factor analysis was performed. The only possible answers are (1) yes (excellent), (2) yes but exploratory factor analysis was performed while confirmatory would have been more appropriate (good), or (3) no (poor).
In all boxes, a small sample size was considered poor methodological quality. As a rule of thumb, a sample size of 100 is considered as excellent, 50 as good, 30 as fair, and less than 30 as poor [10]. For the assessment of some measurement properties, larger sample sizes are required, e.g., for factor analysis, the sample size should be at least five to seven times the number of items with a minimum of 100 (item 6, box A and item 4 box E) [11].
Scoring the quality of IRT studies
If studies use IRT models, the COSMIN IRT box should be completed in addition to the specific boxes for the measurement properties that were evaluated in the IRT study. IRT models are most often used for assessing internal consistency and cross-cultural validity (Differential Item Functioning). If the IRT model, the computer software package, or the method of estimation was not adequately described (IRT box items 1–3), these items are scored good instead of excellent. If the assumptions for estimating parameters of the IRT model were not checked or this is unknown (item 4), this item is scored fair. To obtain a total score for the methodological quality of studies that use IRT methods, the “worst score counts” algorithm should be applied to the combination of the IRT box and the box of the measurement property that was evaluated in the IRT study. For example, if IRT methods are used to study internal consistency and item 4 in the IRT box is scored fair, while the items in the internal consistency box (box A) are all scored as good or excellent, the methodological quality score for the internal consistency study will be fair. However, if any of the items in box A is scored poor, the methodological quality score for the internal consistency study will be poor.
Adaptations made during testing
In comparing the initial COSMIN scoring with the rater’s overall judgement of the methodological quality of the study, we found a few discrepancies. In most cases, the rater’s overall judgement was more positive than the rating obtained with the COSMIN scoring system. For example, when rating the methodological quality of a study on construct validity and the expected direction of correlations or mean differences were not included in the hypotheses for testing construct validity, this was originally rated as fair quality. However, the rater argued that it was often possible to deduce what was expected. We therefore changed the scoring of this response option into good.
Example of the application of the scoring system in systematic reviews of measurement properties
The scoring system was applied on a set of 46 articles from a systematic review on the measurement properties of 8 neck disability questionnaires [9]. The results are presented in Fig. 1. This figure shows how the scoring system can be used to provide an overview of the methodological quality of the included studies on measurement properties in a systematic review. For example, in 41 of the 46 articles, construct validity was evaluated. 5 (11%) of these studies were rated as excellent, 8 (19%) as good, 16 (40%) as fair, and 12 (30%) as poor.
Subsequently, the methodological quality of the studies should be taken into account in the evaluation of the results of the included studies. In this phase of the review, the results from different studies are combined [12]. In this systematic review, levels of evidence were used to rate the quality of the instruments, like is done in reviews of randomized clinical trials [13, 14]. In applying levels of evidence, the methodological quality of the studies is taken into account, as well the number of studies and their results. As the results of studies with poor methodological quality cannot be trusted, they do not contribute any evidence, while excellent studies provide strong evidence. The highest level of evidence was applied to the results of studies of excellent methodological quality, and the lowest level of evidence was applied to the results of studies of fair methodological quality [9].