We have conducted experiments to study the accuracy of the classification approach using different subsets of the features. Along with that, we studied the performance at different splits of training and testing datasets. For a system that is to be deployed for real use, we need to find out what is the minimum required size of a training dataset to produce satisfactory predictions. In the third experiment, we focus on the SR update process where the decisions on an already published review can be used to predict what should be included in a future update.
In the first experiment, we study the performance of our random-forest-based classifier using the different classes of features that we have introduced earlier. Previous research (Cohen 2008) showed that unigrams and bigrams are optimal features for a SVM classifier. However, as we introduce new features (e.g., citations), bigrams may not be the best predictor.
For each dataset, we perform a \(5\times 2\) cross validation to be consistent with the related work we compare against.Footnote 3 In \(5 \times 2\) cross validation, the dataset is first split in half with one half used for training and the second used for testing. In these experiments, split is carried out with stratification, similar to previous work. Then the roles of each half are switched. This process is repeated 5 times, resulting in 10 estimates that are averaged at the end.
In Table 2, we report the obtained recall along with the corresponding WSS for each SR. There are several experiments, each conducted with different combinations of features. The different features are title and abstract unigrams (Uni), title and abstract bigrams (Bi), citation information (Cite), Huffman codes from Brown clustering (BC), and length 12 and 16 prefixes of Huffman codes, along with the full code, generated from Brown clustering (SubBC). While we have experimented with all possible combinations, we report mostly on the unigram-based features as they were the most competitive in terms of recall. All of the models included publication type and MeSH term features by default. From Table 2, we find that the combination of unigrams and Brown clustering codes (Uni+BC) achieved the highest recall in 7 out of the 15 systematic reviews, tied with unigram and Brown clustering code prefixes (Uni+SubBC). However, the unigram and citation (Uni+Cite) model had recall values very close to the (Uni+BC) model. In some cases the difference was less than 0.001. We perform statistical significance test on the combination of features where we found that these models are statistically insignificant, hence we adopt the (Uni+Cite) model for the rest of the paper.
We compare the obtained WSS against values of WSS reported by other approaches in Table 3. The included approaches are voting perceptron (VP) (Cohen et al. 2006), Complement Naive Bayes (CNB) (Matwin et al. 2010), and the citations + unigram based random forest (RF). The recall column reports the recall value for the positive class obtained by RF. For the VP approach, the authors report WSS at different values of recall. Therefore, comparing with VP is carried by picking the value of WSS corresponding to the closest recall obtained by RF. Note that WSS values for CNB are for recall of 95 % and not for the same level of recall obtained by RF. Since our approach does not depend on tweaking variables to obtain different values of recall, we cannot always guarantee a recall of 95 %, as sometimes it can be higher (8 out of 15 in the datasets had recall \(>\) 95 %), or slightly smaller (4 out of 15). Generally, higher values of recall are correlated with lower values of WSS. Overall, not only is our approach capable of obtaining recall higher than 95 %, but also outperforms CNB in 5 out of the 15 datasets, while being comparable in 3 others. Note that the CNB values are selected for the weights yielding the highest recall, where these weights cannot be chosen before hand. When compared against VP, our RF based classifier is outperforming it in 13 out of the 15 datasets.
To compare with the SVM approach reported in Cohen (2008) we compute AUC for our RF model because Cohen (2008) only reported AUC. We compare RF classifier with different feature sets against the reported SVM approach that uses title and abstract n-grams along with mesh terms. We report the AUC values for our (Uni+Cite) random forest model in Table 3. When statistical significance test is performed to compare our AUC values against the ones reported in Cohen (2008), our RF model ranked in the first rank group, while the model based on Cohen (2008) was in the third rank group when all possible combinations of features were tested. Therefore, a model based on our features will have higher AUC than Cohen (2008). We also notice that a SVM model with citations and/or Brown clustering features yields statistically insignificant results when compared to a RF model on all the datasets, despite having slightly lower AUC value. This suggests that the contribution is resulting from the features more than the classifier when optimizing for AUC. Therefore, we compare the contribution of each feature groups to the accuracy of the classifier. Table 4 lists the AUC of each possible combination of the feature groups after grouping statistically insignificant results together using Wilcoxon tests. All combinations were tried to get the best feature set. The number of non-zero combinations equals to \(2^6 - 1 =63\). Antihistamines and SkeletalMuscleRelaxants reviews were excluded as they are considered outliers because of having small absolute number of included studies, as pointed out by Cohen (2011) and Matwin et al. (2010).
Required training size
In the previous experiments, we have examined the recall and WSS of the classifier based on a \(5 \times 2\) fold cross validation where 50 % of the data being used for training. However, we would like to find the smallest percentage of training data with which the classifier is making accurate predictions. We vary the size of the training dataset to be 10, 20, 30, and 40 % of the total dataset size. For each training size percentage, five splits are created and the final result corresponds to the average of these 5 splits. While the approach we devised relies on having the ratio of include to exclude articles reasonably similar in the training and testing dataset, in this experiment we relax this constraint and study the performance when the training and testing datasets are split without stratification. This scenario arises in real life when SR preparers might review articles that are sorted by some measure of similarity, or the case when a reviewer starts reviewing articles before compiling the entire list of candidate articles. Therefore, in this experiment we compare the performance of the classifier at various percentages of splits for both a stratified sample and an non-stratified sample. For each sampling approach, five splits are carried at each level and the average performance over these five splits is reported. Note that in case of an non-stratified split, the training data may have zero positive examples, hence calculating the ratio r that computes the misclassification penalty is not possible. When a split results in zero positive examples in the training data, the split is ignored—one such split was encountered per dataset at most.
In Figs. 1 and 2, values of recall and WSS for stratified and random splitting are plotted against different percentages of training data size, where each sub plot represents a specific SR from our sample. S and R denote stratified and non-stratified sampling, respectively. In 12 out of the 15 SRs, the recall and WSS values start to converge at training data set split of 30 %. While recall tends to converge at lower percentages of training, increasing the training size to 30 % is more beneficial to WSS than recall. That is, having more training data is reducing the false positive rate, hence saving the time of SR authors. Interestingly, previous research that applied active learning to select which articles to tag reported needing up to 30–40 % of the entire dataset to be able to achieve recall as high as 1 (Wallace et al. 2010b) (Albeit working with a different dataset that contains three SRs only, their results about the required percentage of training data are informative here). In the Web application http://rayyan.qcri.org/ that we deployed, the application simply decides when it is confident enough to show predictions by running a cross validation on the already tagged articles.
Updating systematic reviews
As SRs go out of date, it is crucial to find and include the newly published studies that might change the recommendation of the review. SR authors are now presented with a collection of studies that were published after the review went to print so they would sift through the titles and abstracts to identify potential relevant studies. A classification algorithm can be used here to filter the list of newly published studies based on the inclusion and exclusion decisions made while creating the review in the first place. This, of course, depends on storing the decisions for the studies used in the initial review. We, however, were unable to obtain a dataset that records the list of studies used when creating a SR, along with the list of studies considered in the update phase. Automatic classification techniques are more acceptable here because the reviewers are not asked to manually tag studies. Instead, the tags generated while creating the original review are used to train the model.
To model this, we set up the following experiment. For each of the 15 SRs described earlier, we mimic an update scenario by assuming that all articles published on or before 2001 were used for creating the original review, and the articles published between 2002 and 2004 are used for a hypothetical update taking place in 2005. The dates were chosen such that there is enough body of work published after the original review goes online that mandates an update. Furthermore, three years period is a reasonable time span after which SR might need an update based on Shojania et al. (2007). The number of included and excluded studies published before 2002, and after it is reported in Table 5. Note that the dataset used here contains studies from 1991 to 2004 only.
We use the random forest classifier with unigram and citation features. The results are presented in Table 5. In 11 out of the 15 SRs, we are able to achieve a recall of 1 with WSS values ranging from 0.09 up to 0.52. By examining the datasets closely, we find that in SRs 2 and 14 the number of included studies published after 2001 is equal to or larger than the number of included articles published before 2002 indicating that the timeline split is a good fit for these reviews. This can explain the lower than 1 recall incurred for these two reviews. Overall, the average WSS was 0.3671, meaning that with a recall of nearly 1, the classifier is saving 36 % of the SR preparer’s time. This can translate into hours, if not days, worth of work.
Integration with Rayyan
Rayyan (http://rayyan.qcri.org/) is a web-enabled application that helps systematic review authors expedite their work. Authors upload a list of studies obtained from searches on different databases and start screening them for inclusion and exclusion. Using different facets, e.g., extracted MeSH terms, keywords for inclusion, keywords for exclusion, journals, authors, and year of publication, they navigate through their citations and filter them to focus on those they want to exclude or exclude. They can also browse a similarity graph for the studies based on different attributes such as titles and authors. As they browse and filter on studies, they select those to include or exclude. They can also label them for easy reference or for reporting the reason for inclusion or exclusion. Once they have excluded and included enough studies, the prediction module is run and suggestions on undecided studies are returned to the users who will then make the actual decisions. When updating a review, users simply upload a new set of studies and Rayyan will then provide suggestions on these. Other features of Rayyan include the ability to have multiple collaborators work on the same review, uploading multiple files (list of studies) to the same review, and copying of studies across reviews.