Background

The relentless growth in research and the greater reliance placed on evidence for clinical and policy decisions have created a need for automation to support the process of systematic reviews [1, 2]. A core process in the creation of such reviews is the initial search for and selection of research articles for inclusion in the review. Often a literature search identifies hundreds to thousands of candidate documents, with the vast majority being screened and then excluded because they are not relevant to the review question.

To save time and resources on screening, often only the title and abstract are appraised first, and the more taxing task of appraising the full text is reserved for cases where decisions cannot be made based on the abstract alone, where reviewers disagree on whether to include or exclude, and for those papers that appear relevant at title/abstract screening, but may be disregarded at full-text screening [3]. However, screening on the abstract and title is likely to be the most time-consuming of all systematic review tasks because of the large number of references typically retrieved from the high-sensitivity search strategies employed in systematic reviews [4].

Several supervised machine learning algorithms have been tested to automate screening. A recent review found 44 such algorithms [5], and this functionality is already available in several commercial systems. These algorithms use natural language processing to determine the probability that a candidate paper should be excluded from a systematic review. Machine learning systems work by training algorithms based upon the inclusion and exclusion decisions made by human reviewers and then creating a classifier that models these observations. A limitation of such machine learning screening systems is the large number of human decisions that are needed before a reliable classifier can be developed. A second limitation is that the reliability of classifications appears relatively low. These algorithms are thus used to exclude only the most obvious cases (between 30 and 70% of candidate studies) and often only rank studies in a decreasing order of likelihood in preparation for human screening. For example, the systematic review tool Rayyan updates a machine learning classifier that promotes abstracts to the top of the screening queue when they have more similar words to previously included abstracts [6].

Whilst it makes sense to model automation on human processes [7], there is no requirement that each step of automation in screening must be identical in their human and machine versions. What is time-consuming for humans may be easy for a machine. For example, the extraction of key information from an abstract about the characteristics of a study (such as the population, exposure, confounders and outcomes) only occurs after screening in the human process, because such extraction is time expensive for humans.

No such time costs however hold on the machine version of screening. Our hypothesis for this study is that automated extraction of study characteristics from abstracts can itself be used to make screening decisions. This effectively re-orders the tasks in a human-driven systematic review so that extraction of key information (step 3, see Table 1) occurs before screening (step 2). In this study, we test to see if automated information extraction can lead to effective screening and thus reduce the overall screening load on humans.

Table 1 High-level steps of a systematic review

Information extraction techniques that combine supervised machine learning and heuristic methods appear to be a promising approach to this task [8]. This has been aided by the increased use of clinical trial registries and specialised databases (e.g. Epistemonikos [9]) which means that more easily managed structured information is available for screening algorithms.

Materials

Data preparation

Due to limited currently available data, and limited resources in generating such data, we have made secondary use of data already available to us from a previous study conducted in our group [10]. The fact that the same data was used to develop and evaluate the semi-automated characteristic extraction method used here did not bias our results as the extraction algorithm is only meant to simulate an automated extraction method that is not part of this study. We thus identified three recent systematic reviews of environmental observational studies (Table 2) that included a defined and repeatable search strategy.

Table 2 The three systematic reviews used in this study

For each review, we consulted with the corresponding authors to ask for the original search results they used, to ensure that the search strategy provided in the review is indeed the correct one and asked for their search results. When the original search results were not available from the authors, we have repeated the search strategies using the databases specified in the original searches, using the limit dates of the original search. The original search queries and a comparison between the original searches and the reproduced searches are given in Appendix 2.

The citations for articles identified in the searches were then collected using EndNote (EndNote X7.7.1; Bld 10036). Duplicate entries were removed using EndNote’s de-duplication function. Abstracts were retrieved automatically using EndNote’s “Find Reference Updates…” function and references without abstracts were removed. Five of the included studies in Thayer were removed as a result of this process. References with abstracts were then exported to text files in preparation for automated extraction.

Study characteristic extraction

We used a previously developed text mining algorithm to extract six study characteristics of observational studies: population, exposure, confounders, outcomes, (collectively PECO), country and study type. The algorithm was developed using the General Architecture for Text Engineering (GATE) [11]. The algorithm’s accuracy has been tested previously on the included studies in these systematic reviews with precision = 95% and recall = 81% (F score = 87%) [10].

The algorithm uses human-crafted grammatical rules designed to suit the abstracts of observational studies. Study characteristic recognition is performed by first identifying semantic elements using specific dictionaries and semantic rules. The rules and dictionaries were developed manually by inspecting the 17 articles included in Johnson 2014 (training set). The rules were further tested using 34 articles that include the 17 articles included in Hamra 2014 and tested using 35 articles including the 11 of the 16 included in Thayer 2013. In each case, the exposure and outcome dictionaries were updated for the corresponding review question. The identification algorithm dictionaries and rules are described in Appendix 1.

The output from the extraction algorithm is text phrases that match the attributes of the six study characteristics. However, for the purpose of this study, we ignored this text output and only noted whether or not the algorithm had found candidate phrases for a given characteristic from an abstract. Whenever a required PECO item was identified, this was recorded as a “hit”, and if no phrases could be identified, this was coded as a “miss”.

Screening evaluation

We tested several variants of screening rules which applied different thresholds for the number of study characteristics needed to include a study in a review (Table 3). These threshold rules ranged from a strict rule that required all four PECO items to be detectable, to rules that required some subset of elements to be detectable. We applied each of the six extraction rules to abstracts identified in the search strategy.

Table 3 Screening rule tests in this study and the rationale behind each

Included articles that met all the requirements of a screening rule were counted as a true positive (TP). Those that met the screening rule requirements but were excluded in the original systematic review were considered false positive (FP). Articles included in the original systematic review that did not meet a screening rule criteria were counted as false negatives (FN). The remaining references that did not meet the screening rule and were excluded from the original systematic review were counted as true negatives. We calculated the precision (Pr) and recall (Re).

We assume that all articles not excluded by the screening algorithm (i.e. FP and TP) are to be manually screened. We thus estimated the “work saved” as 1 − (TP + FP)/N where N (or n) is the number of articles that were screened. This assumption also means that each FN article is an important study that will not be manually screened and would thus be excluded from the review. Hence, we consider perfect recall (i.e. Re = 1.0) to be of the highest importance and precision to be of secondary importance.

We calculated the maximum work saved as the proportion of manual screening that would need to be done after a theoretical perfect screening tool that has perfect recall and perfect precision. The maximum work that can be saved for each review is 1 − TP/N.

Results

There was a wide range of result among the six rules but consistent results for each rule across the three evaluation sets. Table 4 shows the screening results for each of the rules on each of the reviews. The table also shows the maximum work that can be saved with perfect recall for each review. Overall, the best screening rule was EO as it provided the best recall with the highest precision and thus most work saved.

Table 4 Summary of screening workload savings for three systematic reviews

The All 4 PECO Terms screening rules generally performed worst. Screening with Any 3 PECO Terms improves recall but only screening on Any 2 PECO Terms provides the desired recall. Comparing the screening results of the All 4 PECO Terms rule with those from the PEO rule, we can see that about half the misses of the former seem to be due to confounders not being mentioned in the abstract. The PE rule improved precision compared to the Any 2 PECO Terms rule, but had a worse recall, whereas the EO rule had the same recall as the Any 2 PECO Terms with a higher precision at every case.

Using the best screening rule (EO), the screening load reduction was between 86.7% (out of maximum possible of 97.2%) and 89% (out of maximum possible 99.3%) for Hamra 2014 and Thayer 2013, respectively, and over 98% (out of maximum possible 99.4%) for Johnson 2014 (Table 4). We note that Johnson 2014 also did not achieve perfect recall because one included study was missed (i.e. was classified as a false negative). The more generalised Any 2 PECO Terms screening rule also missed the same paper; hence, the overall FN rate across all three datasets was 2%.

Discussion

Screening studies based on study characteristics is an effective way to address the need for automation of screening references as part of a systematic review.

A recent systematic review of automatic screening methods [5] found that all 44 reviewed methods used machine learning and were observed to save between 30 and 70% of screening decisions with up to a false positive rate (FPR) of 5% (i.e. Re = 95%). A more recent review of these methods questions the confidence in these results due to the similarity of the methods and the wide range of results [12].

By comparison, the use of automatically detected study characteristics in abstracts appears to be even more effective and reliable. For example, screening by detection of exposure and outcome (EO) mentions in an abstract in this study saved 93.7% of the screening work with 2% FPR (Re = 98%). In other words, in a systematic review with 10,000 articles to screen, automatically screening 70% of articles leaves the reviewer 3000 articles to screen manually. A system that screens 93.7% would leave the reviewer 630 articles to review manually, almost five times less.

Whilst this method appears to dramatically reduce reviewer screening effort, it does generate some additional work in algorithm development. Specifically, the information extraction algorithm used here is not fully automated, and some effort had to be put into developing two dictionaries specific to the topic of the reviews and may require even more effort when applied to other types of articles (e.g. RCTs). This work focuses on the potential of extraction algorithms to assist in screening and not the extraction algorithm itself.

The approach of using study characteristics for screening depends on having a reliable method for identifying study characteristics. A recent review of text extraction methods for study characteristics [8] showed that most algorithms focused on identifying the sentences that hold key information, rather than automatically extracting the information. We anticipate that new extraction algorithms will greatly improve the extraction of trial characteristics from abstracts.

An alternative approach to characteristic extraction using text mining is to take advantage of trial registries. Registries typically will report in a structured format the kind of study characteristics used in this study, and these could be automatically identified without the need for text mining and concept or word recognition, at least for screening using the approach used here. The number of clinical trials that can be found in online registries is increasing, and the quality of the data in these registries is improving. Other databases of clinical trials that contain trial characteristics (e.g. Epistemonikos [9]) are another potential source that can be used for screening although, in this study, we did not assess how these would be used.

Error analysis

We have analysed the abstract of the one study that was missed by the system [13]. Of the PECO elements, the abstract actually mentioned the exposure and population (and country) but not the outcome or any confounders. However, both exposure and population were described in terms that were missing from the extraction algorithm’s dictionary. Specifically, the exposure of the study was given as “PFC”, and the exposure of interest (“PFOA” or “PFOS”) was only mentioned in the outcomes. The population was described as “maternal cord blood” rather than the population of interest (“pregnant women”). Therefore, all screening rules including Any 2 PECO Terms and EO failed to identify it as potentially relevant. This points at a limitation of the extraction algorithm that is also inherent in abstract and title screening in general and is not specific to screening by study characteristics.

Limitations and future work

This study used three systematic reviews, and our reported performance may not generalise to other systematic reviews, especially in other domains, because of the small sample size. As more data becomes available, for example from efforts of the International Collaboration for the Automation of Systematic Reviews (ICASR), we will repeat this study to provide more robust conclusions.

Our study demonstrates the susceptibility of all automated screening methods to type I errors, i.e. erroneously excluding articles, as such decisions would not be revisited. Whilst some systems avoid type I errors by ranking articles rather than excluding them from further analysis, works saved by such systems are also limited. Further research is required on whether having a confidence measure, rather than a binary classification, would avoid such errors or whether another method could be more effective.

We have not investigated the reasons and implications of the EO screening rule providing best ones and hence did not make recommendations for more immediate changes to the systematic review process that could be made immediately. It may be possible that search strategies could be designed to require exposure and outcome terms to be included in the title and abstract and thus reduce the result set sizes without missing important studies. Further research comparing such search strategies, is still required.

Conclusion

We have demonstrated a novel method for accelerating the screening for systematic reviews. Study characteristic extraction is done ahead of the screening, whilst expensive without automation is nonetheless practical and effective when such a method is automated.