We developed a highly sensitive search filter for finding studies on measurement properties in PubMed. This filter was able to retrieve 97.4% of the relevant records in the gold standard. We also developed a more precise search filter. This filter reduced the number of records that need to be read to identify one study on measurement properties from 87 (10,000/116) without using a filter to 11 (1,150/108) when using the filter.
The sensitivity of both filters was slightly lower than we aimed for (97.4% instead of 98% and 93.1% instead of 95%). This means that there is a slight risk of missing relevant studies when using these filters. However, the sensitivities are still quite acceptable, when compared to other search filters (http://hiru.mcmaster.ca/hiru/HIRU_Hedges_MEDLINE_Strategies.aspx).
The performance of both filters was very good in the two validation sets. The performance was better in the set of records of studies on measurement properties of the WOMAC questionnaire than in the set of records of studies on measurement properties of the physical activity questionnaires. This might be due to the fact that studies on the same questionnaire are often performed according to a similar methodology, which might lead to more consistent use of terminology. Another reason might be that the methodology and reporting of studies on measurement properties have received more attention and have been further developed in the field of health status and quality of life measurement (e.g. WOMAC) than in other fields (e.g. physical activity).
Although we tested the performance of the filters always in combination with the exclusion filter, we decided to present the exclusion filter as a separate filter, because this enables users to choose to use the filter for measurement properties without the exclusion filter if they want to retrieve all publication types, or human and animal studies. Moreover, information specialists recommend using exclusions (Boolean NOT) always at the end of the search strategy.
This study has several methodological strengths: first, the gold standard, a random sample of PubMed, is representative for the literature in which the filters are going to be used. This will increase the likelihood of a good performance of the filters in future studies. We did not only include high quality or recent studies (or high quality journals) in our gold standard, but also poor and older studies because the filters should also be able to find these studies. For the same reason, we also included records that were not yet indexed by the NLM. Many published search filters, like those developed by Haynes and Wilczynski et al. [8–11], are tested against recent high quality studies. The sensitivity of these filters in the “real world” is likely to be overestimated.
Second, we analyzed the performance of the filters in a way that mimics the real use of the filters, e.g. in a systematic review. Therefore, we calculated sensitivity based on screening of the abstracts, not on screening of the full-text articles. Three abstracts were missed by our sensitive filter (PMID 11681521, 10747220, and 9650947) because they did not contain any terms for measurement properties in the abstracts. They were selected by hand search because of statements like “The results obtained using these techniques are compared” or “A comparison of organism recoveries and morphologies was undertaken with both … (WT) and (ES)”. When we read the full-text articles of these three abstracts, it appeared that only two of them included some information on measurement properties. However, we still counted all three abstracts as false negatives because we would have selected these abstracts in a real situation, e.g. when screening abstracts for a systematic review. Therefore, we wanted the filter to retrieve them. If we would have calculated sensitivity based on the full-text articles, as has been done in many other studies [8–11], we would have overestimated the real sensitivity of the filter, because in that case, the one study that did not include information on measurement properties that we missed would not have been counted as false negative.
Third, our filters have been validated in two very different settings, i.e. one set of records from a search for finding studies on measurement properties of a disease-specific health questionnaire and one set of records from a search for finding studies on measurement properties of physical activity questionnaires. The performance of the filters in these two settings is promising. Nevertheless, it would be worthwhile to validate the filters in new validation sets, especially in the field of (health-related) quality of life research, where there are many instruments available to measure the same construct, with different measurement properties. It would also be worthwhile to analyze whether the performance of the filters is different e.g. for disease-specific versus generic instruments or for different medical fields.
This study also has some limitations: first, we did not hand search all records in the gold standard because we used an exclusion filter. We might have missed studies on measurement properties by using this exclusion filter. If that was the case, the performance of the measurement properties filter might have been either overestimated or underestimated, depending on whether the filter would have retrieved these missed records.
Secondly, the gold standard contained only 116 studies on measurement properties, and therefore the initial performance of the filter was based only on 116 studies. However, the validation sets contained 100 and 242 studies on measurement properties, respectively, which means that in total the filter has been tested on 458 studies on measurement properties.
The performance of our sensitive filter is higher than that of many other filters. For example, our filter has a higher sensitivity than 23 available search filters for finding diagnostic studies (highest sensitivity 86.9%) [12, 13]. This might be the result of the generalizability of our gold standard set of records, of using multiple sources for search term selection, and the inclusion of over 150 search terms in the filter. Large search filters are easy to use in PubMed because the filter can be copied and pasted at once into the search box.
The performance of the filters can be improved in the future when records on measurement properties are properly indexed or when indexation is corrected. This can be facilitated by reaching consensus among researchers on terminology of measurement properties. For example, the search terms “reproducib* [tw]” and “reliab*[tiab]” retrieved almost a similar amount of studies. In the COSMIN Delphi study, international consensus was reached on using the term “reliability” [14]. Such efforts will facilitate indexing by the NLM and improve retrieval of studies. In addition, standards for reporting studies on measurement properties should be developed. Such standards do not yet exist. For randomized clinical trials, this, with considerable effort of the Cochrane Collaboration, has resulted in increased performance of search filters up to over 99% [15].
Practical recommendations for using the filters
For using the search filters, a computer with internet access is required. PubMed is freely available all over the world. Users of the filters should make a choice of the filter they want to use. This depends on the aim of their search. The sensitive search filter is especially suitable for researchers to use in systematic reviews of studies on measurement properties. The precise filter can be used by researchers or clinicians for a less extensive search, e.g. to obtain an overview of the measurement properties of one specific measurement instrument to be used as an outcome measure in a particular study or in clinical practice. In both cases, the filter should be used in combination with search terms for the construct of interest, search terms for the kind of measurement instruments of interest, and search terms for the population of interest. These terms should be defined by the users, preferably with help of an information specialist. The exclusion filter could be used to exclude irrelevant study types. If users want to retrieve all publication types, or they want to include human and animal studies, they should not use the exclusion filter.
If users of the filters think that the performance of the filters might improve by adding additional terms, they are free to test and validate this. Adding additional terms might improve the sensitivity, but at the cost of lowering the precision because new terms will also yield new irrelevant studies.