Introduction

It is essential to carry out a systematic and extensive search for any type of systematic review (SR). However, searches can often retrieve an overwhelming number of studies [1, 2]. To overcome this, methodological search filters have been developed to find articles related to specific clinical questions. A search filter is a pre-defined combination of search terms combined into a search strategy using the “AND” Boolean operator. Dozens of search filters exist for retrieving randomized controlled trials (RCTs) [3, 4]. These filters have been successful in reducing the number of references needed to screen in SRs, however this is difficult to reproduce for prognostic factor studies, as the literature pertaining to non-interventional studies is more variable. Unlike RCTs, non-interventional investigations have heterogenous, non-standardized study designs [5]. These studies also suffer from poorer indexing of terms, thus making them more difficult to find in the database. Due to these limitations, the use of filters in diagnostic or prognostic studies is not widely recommended [6,7,8].

Prognosis research focuses on identifying variables that allow the estimation of the possibilities of improvement or worsening of a given health problem. This area of clinical research is becoming significantly more important, as throughout the world, people are living longer, but with more chronic health conditions and diseases. Prognosis research can be classified into four different themes or areas of research: fundamental prognostics, prognostic models, stratified medicine, and prognostic factors [9,10,11,12]. A prognostic factor (PF) “is any measure that, among people with a given health condition (that is, a start point), is associated with a subsequent clinical outcome (an endpoint)” [11]. Generic filters exist for finding prediction and prognosis studies such as the Haynes broad filter, Ingui filter and the Yale prognosis and natural history filter [13,14,15]. These published prognostic search filters have lower sensitivity and precision than other types of search filters such as those for medical intervention studies [16]. While carrying out various PF systematic reviews we explored the possibility of using a PF filter [17, 18], however, to the best of our knowledge, no filter exists for these studies. The aim of this paper is to develop and evaluate a search filter for prognostic factor studies to be used in SRs. The main objective of the filter is to achieve maximum sensitivity so as not to lose any relevant studies when using the filter, while maintaining specificity to make the search more efficient.

Methods

We developed a search filter partially based on methods described by Rietjens et al., Sampson et al., and also on the criteria of the filter appraisal tool developed by Glanville et al. [19,20,21]. The completed filter appraisal checklist is available as supplementary material. We completed the study in three phases as outlined below:

  1. 1.

    Identification of a reference set (relative recall)

  2. 2.

    Search term selection

  3. 3.

    Filter evaluation

Identification of a reference set (relative recall)

The first step of search filter development is to create the reference set list, which is most often referred to as the gold standard [22]. The reference set is a known set of studies that are relevant to the general type of studies under review, in our case, prognostic factor studies. We used the relative recall method, which involves replicating the searches of systematic reviews and using the included studies in these reviews as the reference standard [21]. Relative recall is useful as it allows for the inclusion of a broader range of journals and publication years than otherwise could be included practically by manual searching [7, 21]. This approach is also more generalizable to topics that are important for our filter, as the literature is spread across a broad range of journals.

We searched for prognostic factor systematic reviews in PubMed by combining the filter for systematic reviews from “National Library of Medicine: systematic reviews PubMed subset strategy [2018] [PubMed]” and “Prognostic factors [title]. Then these reviews were screened to see if they met the criteria for inclusion in the gold standard. These criteria were that they carried out a search on Ovid MEDLINE, did not include a prognosis filter or prognosis terms in the search strategy, and that they used a search strategy that was publicly available and reproducible. Additionally, we made sure that the SR´s were related to different clinical topics to allow for generalizability.

Search term selection

Frequency analysis

Search term selection was partially based on the objective method used by Rietjens 2019 [20]. A word frequency analysis of titles and abstracts of PF articles was carried out using the free online software systematic review accelerator [23]. We separately analyzed the language of both the included and excluded studies of the SRs used for relative recall to create two distinct lists of terms.

Calculate chi square values

Chi square values were calculated for terms generated from the word frequency analysis. From this, we determined the significance of the difference in relative frequencies of the terms in positive studies (the studies that are included in the review) and negative studies (studies not included in the review). As expected, given the small number of studies included, all terms showed non-significant results. Thus, we complemented this frequency analysis with a Delphi panel of experts to reach a consensus on the terms selected for the filter.

Delphi panel

The Delphi panel consisted of 15 members of various specialties, in particular systematic reviewers, statisticians, clinicians and information retrieval specialists. Each panelist had to evaluate the appropriateness of including each term in the filter. We used the RAND definitions of agreement to classify the terms as appropriate, neutral or inappropriate for use in the filter and also to decide whether this qualification was agreed on by a majority of the panel members [24]. The Delphi method consisted of three rounds, the first two being individual ratings and the last round was a panel meeting where a discussion took place on the ratings given to each term. The most relevant methodological terms were extracted from the frequency analysis and made into a list of 80 terms. This list was given to the panel to rate on a scale of 1–9, with 1 being the least appropriate for inclusion and 9 being most appropriate. The terms scoring between 7 and 9 on the Delphi were defined as potentially eligible for inclusion in the filter [24]. The consecutive connection of these terms with the Boolean operator OR produced the final search strategy (filter).

Filter evaluation

An essential component of the search filter development process is the evaluation of how well the search filter performs in retrieving relevant records in a systematic review. To carry out the evaluation the filter was combined using the Boolean operator AND with the broad search strategy for Ovid MEDLINE that was used in each included SR.

During the evaluation we tested the sensitivity, specificity, precision, and number needed to read (NNR) of the filter. We used Table 1 below to guide us in the evaluation:

Table 1 Table to calculate sensitivity, specificity, precision and NNR of the filter

Sensitivity is the proportion of the total number of references included in the reference set retrieved by the filter [25,26,27]. If the search had low sensitivity, it would miss a large proportion of relevant articles. In contrast, a highly sensitive search is constructed so that it can pick up most of the relevant articles. It was calculated by: (A/(A + C)) × 100.

Specificity is the proportion of the total number of non- relevant references that are not retrieved by the filter [25, 27]. It was calculated by: (D/(B + D)) × 100.

Precision (or positive predictive value PPV) is the number of relevant records retrieved as a proportion of the total number of records retrieved by the filter [25, 27]. It was calculated by: (A/(A + B)) × 100.

The number needed-to read (NNR) is a measure of the usability of the filter, as it indicates how many records a searcher must screen for each relevant record retrieved [25,26,27]. In the context of searching, NNR refers to the number of references that have to be screened to find one additional relevant article. A relatively high NNR means a lot of references would have to be screened, thus having important resource implications in terms of time and cost, whereas a low NNR means that relevant articles can be identified quicker without having to screen large numbers of titles and abstracts. It was calculated by: (1/precision) × 100.

The measure of time saved is the percentage of studies that could be screened but can be saved by using the filter. When using the filter, as compared to without the filter, less articles should be retrieved thus saving time during the screening process. It was calculated by: ((C + D)/(A + B + C + D)) × 100.

Table 2 provides a summary of the different performance measures and formulas used in our study.

Table 2 Summary of performance measures and formulas

We computed a pooled average of sensitivity, and specificity over the 6 reviews used for evaluation by means of a random effects meta-analysis of proportions using Stata v. 16.0 [28].

Results

Identification of a reference set (relative recall)

As outlined in Fig. 1, our search on PubMed yielded ninety-one SRs of prognostic factors of various topics. We excluded eighty-five SRs due to not having a publicly available and reproducible search strategy, not having carried out a search on Ovid MEDLINE, or for having used prognosis terms in the search strategy. Finally, we formed our reference set with six SRs that met all of our criteria [29,30,31,32,33,34]. Each individual reference set included between 3 and 22 studies. The studies from the 6 individual reference sets were combined into one overall reference set with a total of 73 studies.. The prognostic factors assessed in these reviews were the following: symptoms of depression, protease activity, sarcopenia, interstitial pneumonia, controlling nutritional status score, and interim PET results.

Fig. 1
figure 1

Flow diagram of reference set search

Selection of search terms

After completing the word frequency analysis, we compiled a list of 80 of the most frequent methodological terms in the prognostic factor reference set. This list of terms was evaluated by the Delphi panel for inclusion in the filter. At the end of the last round of the Delphi we had a list of 8 terms which were deemed appropriate and agreed upon by the panel to include in the filter. We consulted the information retrieval specialists from the Delphi panel about the best way to combine them using MeSH and free text title/abstract terms. We truncated the terms prognostic (prognos*) and predictive (predict*) to be as inclusive as possible in the search. The consecutive connection of these terms with the Boolean operator OR produced the final search strategy (filter) and it is shown below in Table 3.

Table 3 Terms included in prognostic factor filter

Filter evaluation

We evaluated the filter using the relative recall method with the six systematic reviews in our reference set. The filter was added to the end of the search strategy of each SR using the Boolean operator “AND”. The complete search strategy was entered into Ovid MEDLINE and the number of references retrieved was recorded and downloaded into Endnote [35]. To measure the performance of the filter we compared the references retrieved from the original search in the review with our new search using the filter. The performance of the filter in each review is shown in Table 4.

Table 4 Results for sensitivity, specificity, precision, NNR and NNS of the filter evaluated in each review

The filter had a sensitivity of 100% in all reference sets except for Westby 2018 [29], which had a low sensitivity of 31%. As can be seen below in Fig. 2, the filters overall sensitivity was calculated to be 95% (95% CI 69%-100%).

Fig. 2
figure 2

Sensitivity of the filter in various systematic review searches

The specificity varied from 14–70%, with the highest performance of specificity being in Westby 2018 [29] and the lowest in Takagi 2019 [31]. As seen below in Fig. 3, the overall specificity was calculated to be 41% (95% CI 29–43%). The precision performance also varied considerably ranging from 0.4 [34] to 17% [33]. The NNR value varied largely among reviews ranging from 6 to 278. Time saving was substantial ranging from 13% (Takagi [31]) to 70% (Westby [29]).

Fig. 3
figure 3

Specificity of filter in various systematic review searches

Discussion

Main findings

We aimed to develop and test a search filter for finding studies about the role of PFs in Ovid MEDLINE. Overall, the obtained filter showed an excellent sensitivity to retrieve studies from a reference set constructed from studies included in relevant systematic reviews in the field. Specificity was much lower with an overall combined specificity of 41%. Precision ranged from 0.36 to 17%, but it is important to note that efforts to optimise recall has a direct impact on the screening burden (total number of references retrieved) and may not be an appropriate indicator to measure performance of approaches focusing on sensitivity. Resulting from these statistics, the number of references required to screen to retrieve a relevant article varied hugely, from 6 to 277. We calculated that, when using the filter, the time for screening would be lower in all reference sets (13 to 70%).

Out of the six reviews in which we tested the filter, Westby 2018 [29] was the only review where the filter was not effective in retrieving all of the reference set studies. It was a Cochrane review on protease activity as a prognostic factor for healing wounds [29]. After examining the studies that weren´t retrieved, we observed that they did not use any of the search terms attributable to prognosis and their approach was not obvious for usual prognostic factor studies. Those studies had terms such as influence or associated that could make them in some way related to prognosis. Another possible explanation for the low sensitivity in Westby 2018 [29] could be that the review authors were generous or lenient with the studies that they included in the review as they had broad inclusion criteria for the studies such as including prognostic factor studies and prediction model development studies and including studies with any period of follow-up. When examining the flow diagram of Westby 2018 [29], they screened a lot of full texts (10% of the titles and abstracts screened were passed on to the full text stage). In comparison, most of the other reviews in the reference set only passed on 2–3% of studies to the full text stage, thus they were seemingly stricter with the prognostic factor study criteria. When we added our filter to the other systematic review strategies, the sensitivity was 100%, as all included studies were retrieved.

Comparison with available prognosis filters

There are a few published filters for prognosis studies which focus on prognostic models and prediction rules. We compared our prognostic factor filter with the Haynes broad prognosis filter [14]: (incidence[MeSH:noexp] OR mortality[MeSH Terms] OR follow up studies[MeSH:noexp] OR prognos*[Text Word] OR predict*[Text Word] OR course*[Text Word]). We chose this filter as a comparison since it is the one that is most available to people who use PubMed. In general, the filter is known to have a sensitivity of 90% and specificity of 80%. We evaluated this filter in our reference set. As can be seen below in Table 5 the filter was less sensitive overall than our PF filter (74% 95%CI (0.45—0.96)), but it was more specific (0.63 95%CI (0.51—0.74)). All of the SR´s in our reference set had a similar precision performance as the Haynes filter. This is because the reference sets had very low numbers of included studies, which this statistic is dependent on. More time can be saved using the Haynes filter, but that is at a risk of losing potentially relevant studies to include in the review.

Table 5 Results from Haynes sensitive broad filter in our reference set

Strengths and limitations

Our relative recall references included various topics, thus allowing us to evaluate the filter over many different clinical situations. If the references in the reference set are from one area only it can lead to subject bias in the filter (working well in some subjects, but not others). Through using the relative recall method, we were able to ensure that each study in the reference set was in fact a prognostic factor study. It can be difficult to decipher prognostic factor studies from other studies at times, so since we were using studies that were included in prognostic factor systematic reviews, we could be assured that they were truly prognostic factor studies.

An important limitation to note is that the reference standard contained a low number of systematic reviews, which in turn contained a low number of studies (73). This is because prognostic factor studies and thus prognostic factor systematic reviews are a relatively new area of research. For example, in the Cochrane library there are 10 prognosis systematic reviews, while there are 8,487 intervention systematic reviews.

When developing the protocol for this study, we realized that there were many different methods that researchers have used in the past to create a search filter. We examined all the published methods and weighed up our options before deciding on which methods to follow. If we had more resources, time, and manpower available there are more robust methods that we could employ in the future. These other methods include creating a larger reference set of PF studies or creating a traditional gold standard through manually searching for studies. However, even though we had a small reference set of PF studies the filter can still be considered a 3rd generation search filter. Jenkins et al. describe 3rd generation filters as the most objective filter as “terms may be derived objectively through a frequency analysis of relevant records and combined on either the basis of their individual or overall performance or through statistical analysis” [22].

Implications for research

This filter has a high sensitivity so we can be assured that the risk of missing a study is very low. However, as we noted with the studies in Westby 2018 [29], not all PF studies include typical prognostic words, so we still need to think carefully about what kind of studies we might be searching for and if they will include the correct terms. The use of the filter in search strategies could decrease the number of studies needed to be manually screened. Many times, search strategies for PF systematic reviews yield large numbers of studies from the search, for example 20,000–100,000 references. Thus, it can take a lot of time) and resources to screen through them all, making the NNR an important statistic. This PF filter needs to be evaluated in rapid reviews, as time constraints in these reviews make efficient searches even more necessary.

Future research

Evaluating the performance of the search filter against a reference set that is different from the one used to identify the search terms can lead to a search filter of higher quality. As PF research increases, we expect to see many more studies being available for use in the validation process. In the future, to improve the quality of the filter, we would like to validate it using a new reference set of PF studies.

Conclusions

To the best of our knowledge, no search filter exists for locating PF studies in Ovid MEDLINE, nor in any other online database. Our filter had a high sensitivity of 95% overall in the systematic reviews in which we tested it. Its specificity on the other hand, was lower at 41% overall. Our aim was to create a sensitive filter as we feel the most important part of search filter development is to not lose any relevant studies in the search. Further research is still needed on this topic to increase the specificity of the filter, while keeping its high sensitivity.