Introduction

Systematic reviews (SRs) are often placed at the top of the evidence pyramid due to their systematic methods and consideration of the entire body of evidence on a topic [1]. Unfortunately, the standard practice of relying on small teams of individuals to perform time-consuming tasks, such as screening thousands of abstracts, retrieving and reviewing hundreds of full-text articles, and data extraction, often leads to considerable delays between study initiation and completion [2]. This issue is further compounded by the recent exponential growth in scientific literature [3], resulting in a larger number of citations at each stage of the SR process and a higher workload for each team member. Consequently, SRs are not initiated due to concerns about feasibility, abandoned along the way, take years to finish, or are out of date shortly following publication [4].

There is significant interest in methods that can increase the speed and efficiency of SR completion [5,6,7,8,9,10,11,12,13,14]. Approaches intended to increase efficiency include computer-assisted screening (natural language processing or machine learning) [10,11,12,13], screening by a single reviewer [9], or screening of the title without the abstract [8]. While these methods can reduce the workload per reviewer or decrease the time to SR completion, there are concerns these methodological approaches may compromise SR quality [9, 15,16,17]. For example, some authors have found a 7%-10% loss in sensitivity when single reviewer screening is used [9].

One potential approach to accelerate the completion of large SRs, while adhering to the gold standard methodology of two independent assessments per citation, would be to involve a larger team. This would reduce the workload per reviewer, facilitating faster and potentially more valid and reliable work [18, 19]. For example, our team of almost twenty reviewers completed four systematic reviews on mask decontamination during the COVID-19 pandemic, in an average time of two weeks from protocol development to manuscript dissemination on the Open Science Framework [20,21,22,23]. This large team or crowdsourcing approach can be helpful when SRs involve many tasks (e.g. large citation set sizes) or when investigators need to complete an SR at a much quicker pace.

Despite these promising methods, the extent to which investigators adapt team sizes or methods for a SR in response to larger citation set sizes is unknown. Since 1950, the number of authors on MEDLINE-indexed papers has tripled in response to the increasing complexity of research [24], which could suggest that SR teams are growing in relation to citation workload. The aim of this study was to understand how citation set size impacts the approach to conducting SRs, specifically the impact on authorship, team size and systematic review screening methods. Our primary objective was to evaluate the extent to which SR screening workload (initial citation set size) influences the team size. The secondary objective was to report on the influence of citation set size on the application of crowdsourcing, machine learning and non-standard (i.e. single assessment) screening methods.

Methods

This was a cross-sectional study of SRs in the area of health/medicine indexed in MEDLINE in April 2019. The study protocol is available at https://osf.io/w57ts/) and adheres closely to the standard methodology for systematic reviews.

Eligibility criteria

Systematic and scoping reviews indexed in MEDLINE in April 2019 were included in this study if they met the PRISMA-P definition of a SR [25]: they demonstrated a knowledge synthesis effort, their stated objective was to summarise evidence from multiple studies, explicit methodologies were described, and the report represented an original study (i.e. not a protocol or a summary of published findings). Scoping reviews and Health Technology Assessments (HTAs) met this definition [26]. We included SRs on any health/medicine-related topic (i.e. biomedical, technological, social) that included any type of studies (i.e. qualitative and quantitative). We excluded narrative reviews, non-systematic literature reviews, rapid reviews, and umbrella reviews. We excluded publications that reported two different studies, if the main objective was not the SR (e.g. an SR was used to inform the main objective of the study). We also excluded SRs written in a language other than English. The inclusion and exclusion criteria are shown in Supplementary Material 1.

Search

We searched for SRs indexed in MEDLINE throughout one month (April 2019). We searched Ovid MEDLINE In-Process & Other Non-Indexed Citations and Ovid MEDLINE 1946 to May 14, 2019, using the search strategy reported by Moher et al. [27]: 1. 201,904$.ed.; 2. limit 1 to English; 3. 2 and (cochrane database of systematic reviews.jn. or search.tw. or metaanalysis.pt. or medline.tw. or systematic review.tw. or ((metaanalysis.mp,pt. or review.pt. or search$.tw.) and methods.ab.)). From this search, we selected those articles indexed in April 2019. Records were downloaded and imported into Reference Manager.

Study selection

Citation screening was performed using insightScope (www.insightscope.ca), a web-based platform that allows the creation of a large, online team to facilitate rapid citation screening [22, 28]. After piloting screening on an initial random selection of 400 citations, we sought to recruit ≥ 10 additional team members. Before citation screening was initiated the reviewers read the study protocol and were required to achieve sensitivity above 80% on a test set of 50 randomly selected citations (40% true positives) [28]. Ten individuals completed and passed the test set, with an average sensitivity and specificity of 94 and 69%, respectively. Kappa values were calculated for title-abstract and full text screening using the Fleiss approach; values between 0.6 and 0.8 are considered moderate, and values > 0.8 are considered strong [29].

Citations at both title/abstract and full-text levels were assessed by two reviewers independently, with conflict resolution performed by one of the three study leads. For all SRs meeting our inclusion criteria we recorded initial citation set size, defined as the total number of citations screened at the first level of the SR, after duplicate removal. SRs that did not report removing duplicates were included only if the number of citations screened at the first screening level was clearly reported. The included SRs were then sorted into 5 citation set size categories: < 1,000, 1,001–2,500, 2,501–5,000, 5,001–10,000, and > 10,000. These categories were chosen based on results from Page et al. [30], and intended to approximately represent the 50th, 75th, 90th and 95th percentile cut-offs.

A random sample of 50 citations from each category was selected for further data extraction (no formal sample size calculation was performed). The citation list for each category was imported into Microsoft Excel and each study was assigned a random number using the Generate Random Number feature; citations were sorted by the random number and the first 50 were selected for data extraction. SRs that did not clearly report the size of the initial citation set screened (i.e. the number of citations screened at the first level of screening, post-duplicate removal) were excluded from further analysis.

Data extraction and verification

A data extraction tool was developed by the study leads using KNACK software (knack.com) and piloted on 10 randomly selected eligible SRs (See Supplementary Material 2 for data extraction variables) The three study leads and seven reviewers contributed to data extraction. Prior to initiating data extraction, reviewers attended a one-hour interactive training webinar. The 250 SRs were then divided among the study team and extracted independently in duplicate. After extraction was completed for a given SR, both team members reviewed conflicting answers and corrected any errors in the data they had entered. Once the data extracted from each SR had been verified by both original extractors one of the study leads resolved outstanding conflicts.

Outcome data is reported for the 50 random SRs from the five initial citation set size categories. The primary objective was to determine the relationship between citation set size and team size, which was evaluated using both author number and number of screeners (when stated). The number of authors was determined based on the authors named on the paper, and individuals listed as part of group authorship (if applicable). The number of screeners was defined as the number of individuals reported in the manuscript or acknowledgments to have contributed to citation screening (for a given screening level). The average workload per screener for a given screening level was calculated using only those SRs that reported using two assessments per citation, regardless of independence. Gold-standard methodology was defined as two (or more) independent assessments per citation.

Statistical analysis

Characteristics of the SRs were summarized using descriptive statistics. Continuous variables were described using mean and standard deviation (SD) if normally distributed, or median and interquartile range (IQR) for non-normal distributions. Categorical variables were described using frequencies and percentages. Pearson correlation was used to quantify a relationship between citation set size and listed author number. Pairwise t-tests, corrected for multiple comparisons using a Holm correction, were used to compare listed author numbers by citation set size category. Logistic regression was used for the following outcomes: screener number (> 2 screeners vs 1 or 2 screeners) as the dependent variable, gold standard methodology at the title/abstract level as the dependent variable, as well as gold standard methodology at the full-text level as the dependent variable. A Chi-squared test was performed to test for differences in proportions. The independent variable, citation set size, was log transformed as it varied by several orders of magnitude. As categorization of citations by abstract citation set size was potentially less applicable for the evaluation of relationships at the full-text level we also evaluated for relationships by grouping citations based on terciles. All analyses were performed using R statistical software version 3.6.2 [31].

Changes from the original protocol

There were three major changes from the original study protocol. First, the original protocol included an objective proposing to determine whether there was a relationship between citation set size and time required for SR completion. Due to a combination of significant missing data, and concerns about misclassification, this objective was not pursued. Second, a decision was made to evaluate differences in study screening and extraction approaches using not only the 5 citation size categories assigned based on the initial citation set sizes, but terciles corresponding to the citation set sizes at full text review and data extraction. Finally, given the small number of SRs that used non-standard screening methods (e.g. crowdsourcing or computer assisted screening) no further analysis was possible.

Results

Search results

A total of 4,119 citations were retrieved from the MEDLINE search. Title and abstract screening excluded 1,840, with the review team achieving a kappa of 0.81. Review of the remaining 2,279 citations at full text identified 1,880 eligible SRs, with an overall kappa of 0.67. An overview of the screening process, results and reasons for exclusions are shown in the PRISMA diagram (Fig. 1). Upon more detailed review during data extraction, nine of the 250 randomly selected SRs were reclassified into different citation set size categories. To achieve a minimum of 50 SRs in each size category, an additional nine citations were randomly selected, providing a final sample size of 259 SRs.

Fig. 1
figure 1

PRISMA Diagram

Epidemiology of the included systematic reviews

Table 1 provides a summary of the characteristics of the 259 systematic reviews (see Supplementary Material 3 for a summary of the number of SRs included in each subsection of the results). The overall median initial screening set size for the 259 SRs was 3,585 (IQR: 1,350 to 6,700) with a maximum of 215,744. The median (IQR) number of citations for each initial citation set size category is presented in Table 2. For the full cohort of 259 articles, the median author number was 6 (IQR: 4 to 8), with a maximum of 26. Three levels of screening (title, abstract, full text) were performed in 42 (16.2%) of the SRs. Authors of all 259 SRs completed full-text screening, and all but one completed data extraction (no eligible studies were identified so no data collection occurred). Authors of two SRs used crowdsourcing during title/abstract screening [32, 33], and authors of one SR used computer-assisted screening during title only screening [34].

Table 1 Epidemiology of the included systematic reviews
Table 2 Overall citation set size

Relationship between number of authors and citation set size

The median author number and IQR for each initial citation set size is presented in Fig. 2. Confirming visual inspection, application of pairwise t test with Holm’s correction, identified that the only between-group comparison with a p-value under 0.05 was between the < 1000 (median 5, IQR: 3.2, 6.0) vs > 10,000 citation set size categories (median 6, IQR: 4.5, 8.0); p = 0.049). Further, when examined through Pearson correlation using citation set size as a continuous value, there was little evidence for a relationship between initial citation set size and number of authors (r = 0.03, (95% CI: -0.09, 0.15, p = 0.63); this finding was unchanged after removing three outliers (r = 0.02, 95% CI: -0.13, 0.17, p = 0.78).

Fig. 2
figure 2

Listed author number by initial citation set size category (N = 259). The median (IQR) number of authors listed by citation set size category was as follows: < 1,000: 5.0 (3.2, 6.0); 1,000–2,500: 7.0 (5.0, 8.0): 2,500–5,000: 6.0 (4.0, 8.0): 5,000–10,000: 5.5 (4.0, 8.0) and > 10,000: 6.0 (4.5, 8.0). In pairwise t tests, corrected for multiple comparisons using Holm’s method, only three comparisons yielded a p-value less than 0.2: < 1000 vs 1,000–2,500 (p value = 0.18); < 1000 vs 5,000–10,000 (p value = 0.15); and < 1000 vs > 10,000 (p value = 0.046)

Relationship between number of screeners and citation set size

We identified 192 SRs (75%) reporting the total number of screeners contributing to the initial level of screening. The median number of screeners was 2 (IQR: 2 to 2) with the full range from 1 to 24. In total, 21 (11%) SRs reported a single screener performing all of the initial level of screening; 138 (72%) reported a two-screener team, and 33 (17%) reported utilizing a team of more than two screeners. Figure 3 presents the percentage of SRs with teams of 1, 2 or > 2 screeners by initial citation set size category. There appeared to be a greater percentage of SRs that used > 2 screeners in larger citation size categories. The relationship between log initial citation set size and screener number > 2 is illustrated in Supplementary Material 4. In a logistic regression model, the odds ratio between log2 citation set size and screener number > 2 was 1.16 (95% CI 0.95, 1.44), p = 0.15. An odds ratio of this magnitude would imply that for every doubling in citation set size the odds of using more than two screeners would increase by 16%. When SR initial citation set size was dichotomized into categories representing above or below 2,500, the proportion of SR’s with > 2,500 citations and using > 2 screeners was 21.9% (28/128) compared to 7.8% (5/64) in the smaller citation set size group, p = 0.03.

Fig. 3
figure 3

Total number of screeners who contributed to title/abstract screening by initial citation set size category (N = 192). The above figure presents the percentage of SRs with teams of 1 (white), 2 (light grey) or > 2 (dark grey) screeners by initial citation set size category

Average workload per screener

There were 156 SRs that reported using two assessments per citation during title/abstract screening and provided a citation set size at the title/abstract level. As shown in Table 3, there is a strong relationship between workload per screener at the title/abstract level and initial citation set size category. Full-text workload per screener by initial citation set size is presented in Table 4. Again, the workload per screener was higher in the larger citation set size categories. For example, the large citation set size category required 117 (53, 345) citations per screener, with a median of only 40 (20, 58) in the smallest initial citation set size category.

Table 3 Workload per screener at title/abstract by initial citation set size category
Table 4 Full text workload per screener by initial citation set size category

Methodological approach to title and title/abstract screening

There were 116 (44.8%) SRs that adhered to the gold-standard methodology for both citation screening levels performed (i.e. two or more independent assessments per citation).

Title only screening was performed in 42 (16.2%) of the 259 SRs. Of these, 45.2% (N = 19/42) reported using gold-standard methodology for title screening; 4.8% (N = 2/42) reported using two assessments per citation but independence was not reported or was unclear; 19.0% (N = 8/42) of SRs reported using a single assessment per citation; and 2.4% (N = 1/42) of the SRs reported using computer-assisted screening. Finally, for the remaining 28.6 (N = 12/42), the screening method used was not reported or was unclear.

During title/abstract screening, 163 (62.9%) SRs utilized gold-standard methodology. The screening approach used at the title/abstract level by initial citation set size category is presented in Table 5. There appeared to be a smaller proportion of SRs in the larger citation set size categories reporting standard methodology. In a logistic regression model examining the association between log2 initial citation set size and gold-standard approach to screening, the odds ratio was 0.85 (95% CI 0.71, 1.00), p = 0.05 (see Supplementary Material 5). An odds ratio of this magnitude implies that for every doubling of the initial citation set size, there is a 15% decrease in the odds of using the gold standard approach. When dichotomizing SRs into large and small size categories, the proportion using gold-standard methodology was 98/136 (72.1%), when the initial citation set was > 2,500 compared to 65/78 (83.3%) when the initial citation set size was ≤ 2,500 (p = 0.09).

Table 5 Distribution of methodology used during title/abstract screening by initial citation set size category (n = 214)a

Methodological approach to full-text screening

For full-text screening, the methodology used is presented in Table 6. Terciles were created for the 186 SRs based on the number of citations retained for full-text screening. When evaluating the relationship between screening methodology and full-text set size, there was a higher percentage of SRs using two (or more) reviewers in the smaller citation set size tercile (see Supplementary Material 6). In a logistic regression model examining the association between log2 full-text set size and gold-standard approach to screening, the odds ratio was 0.80 (95% CI 0.65, 0.99) implying for every doubling of the set size at the full-text level there was a 20% decrease in the odds of using the gold standard approach (see Supplementary Material 7).

Table 6 Distribution of methodology used during full text screening by initial citation set size category (n = 186)a

Methodological approach to data extraction

A total of 141 (54.4%) of the 259 SRs reported that the majority of all data was extracted in duplicate. Supplementary Material 8 summarizes the methodology used for data extraction by initial citation set size category. The number of articles in each SR at data extraction was then used to create terciles, with a median (IQR) number of citations 40.0 (range: 25.0, 72.0). No relationship was observed between the number of individuals who contributed to data extraction and initial citation set size, or between the number of individuals who contributed to data extraction and number of articles for data extraction, based on terciles (see Supplementary Material 9).

Discussion

Using a cross-sectional study design, we sought to identify a large representative sample of health-related systematic reviews to evaluate the relationship between citation set size, team size and screening methods. We identified considerable variability in citation set size, suggesting that there are thousands to potentially tens of thousands of SRs published each year with initial citation set sizes exceeding 5,000. Despite this wide-ranging variability in initial citation set sizes, we did not observe a discernable relationship between the number of authors or the average number of screeners and citation set size. Further, few studies appeared to use or report adjunct screening methods including crowdsourcing or computer-assisted screening. Finally, this work provides evidence that SR teams are more likely to deviate from accepted gold standard screening approaches when faced with larger citations sizes.

Our search of a single health database identified approximately 2,000 systematic reviews published in a one-month timeframe. This finding suggests that at least ~ 25,000 SRs are published each year, which is likely a considerable underestimate if one considers other publication databases, unpublished work, and reviews from other disciplines within the natural and applied sciences. In addition, through comparison with findings from similar studies, this work suggests that the rate of SR publication is rapidly increasing. For example, Page et al., [30] using very similar methodology and SR definitions, identified 682 systematic reviews indexed in MEDLINE during a single month in 2014; these numbers would suggest an approximate three-fold increase in a five-year period. This is further supported by Alabousi et al. who reported that publication rates for diagnostic imaging-related SRs have increased more than ten-fold between 2000 and 2016 [35].

In addition to reporting on the current rate of SR publication this work also provided valuable information on the distribution of citation set sizes. During full-text screening, the SRs were placed into five categories, based on the post de-duplication citation set size. While just over half of the studies have citation set sizes at or below 1,000, we also found that a fair proportion (10%) had large citation set sizes (> 5,000). Given the number of SRs performed each year there are many thousands of SRs performed annually with citation set sizes in excess of 5,000 or 10,000. As the growth rate of published scientific literature increases (recent estimates have found a growth rate anywhere from 3.5 to 9% per year [36, 37]) it is likely that initial citation sets sizes could continue to grow unless countered by improvements in citation coding and search methodology. Investigative teams will need to find means for accommodating large citation sets, in an era when timely medical and healthcare policy decisions are needed. Strategies to accommodate large SRs may be to streamline the topic and research questions or focus search strategies using validated filters to limit certain study types or topics [38].

A logical response to increased workload is expansion of team size, and we sought to explore the relationship between citation set size and authorship. Studies are now being authored by dozens, hundreds and sometimes even more than 1,000 authors, a phenomenon known as “hyperauthorship” [39]. This trend appears to be most apparent in the field of physical sciences, where open data and the need for collaboration has prompted authors to combine resources [39]. The number of authors per publication in the biomedical literature also appears to be growing [40]. The need for a multidisciplinary approach and the desire to collaborate to increase the chances for publication may be the driving factor in longer author lists [41]. Despite these observations from the scientific literature, this study did not find evidence of a relationship between citation set size and author number. Performing the analysis using the reported number of screeners as opposed to author number again did not suggest a relationship between screening team size and citation set size categories. However, a more detailed inspection demonstrated that SRs with larger citation set size were less likely to use two screeners, with an increase in the proportion of SRs using a single screener approach.

In addition to assessing total screeners by citation set size, we also evaluated the relationship between citation set size and the use of adjunctive or non-standard screening approaches. While not statistically significant, our analysis did suggest that for every doubling of citation set size, there was a decrease in the odds of using gold standard methodology; implying that with increasing workload, adherence to best practices declines. Importantly, our ability to evaluate for differences was potentially limited by the observation that more SRs failed to report their methodology (16%) than indicated using single assessment per citation (10%). Single reviewer screening has been shown to miss substantially more studies than conventional double screening [42].

In addition to evaluating whether investigators performing large reviews adapt through team size, we also evaluated the frequency of application of other methodologies intended to reduce workload. Over the past two decades, there has been considerable interest in the ability of natural language processing to assist with abstract screening, and available evidence suggests that with the proper application the human screening burden can be reduced and time saved [43, 44]. Yet, despite a significant number of publications on the topic, and incorporation into a number of common screening platforms, only one of the 259 SRs reported its application. The SR used Cochrane’s RCT Classifier, a machine learning system that can identify RCT (and quasi-RCT) trials and non-RCTs [34]. The surprisingly low uptake of computer-automated screening may relate to a lack of trust in the technology and uncertainty over implementation [45], although our study was not designed to answer this question. Alternatively, SRs may not be reliably reporting the use of automation. Finally, given the sample of SRs used for this analysis had screening performed in 2019 or before, it is possible that a larger proportion of more recently completed or ongoing SRs are utilizing adjunct screening methods.

A second and newer methodological approach intended to reduce the burden on study leads and allow for faster systematic review completion is crowdsourcing. This approach has been validated by our group [7, 46] and others [6, 47,48,49], demonstrating that large groups of individuals are willing to assist with systematic review tasks. With minimal training, these groups can accurately perform citation screening with sensitivity approaching 100% and work saved in excess of 50%. While crowdsourcing was only identified in two of the 259 SRs (0.8%) it is important to acknowledge that the method has only recently been validated and there are just a few platforms available (Cochrane Crowd, insightScope). Both the identified studies were Cochrane reviews, with one being an update of the original SR [32, 33]. The “crowd” in the original review was made up of students and provided an initial screen of search results to decide whether citations were described as randomized or quasi-randomized. This approach removed 81% of citations retrieved (N = 4,742/5,832).

Interestingly, a number of studies have been published suggesting and even validating a hybrid of crowdsourcing and machine learning [50, 51]. As crowdsourcing, with or without machine learning, becomes more mainstream investigators and organizations will have to consider what constitutes a reasonable workload for a crowd member – particularly one seeking to join the team and who might eventually meet criteria for authorship. Our findings on average workload provide a starting point for these investigators, indicating that on average, screeners in a SR assessed approximately 1,000 citations at title/abstract and 50 at full-text. Dividing the work in this fashion would help guarantee screening and extraction occur in a timely manner and may avoid the deterioration in screening accuracy that is anticipated with excessive workloads.

A limitation to this study is that author size for SRs may not accurately represent the number of people contributing to screening and data extraction. It is common for only a subset of authors listed to contribute to screening and data extraction work. While we attempted to account for this through a secondary analysis using the number of stated screeners, misclassification is possible as some study groups may not have described individuals who did not meet authorship criteria. Secondly, we did not assess the potential role of time (i.e. how long it took to screen through each level) as a factor in citation set size and screening methodology, given that most SRs did not report this data. This points to a further limitation in the lack of reporting on SR methodology [52], which could have underestimated the number of SRs using gold-standard screening methods or alternative methods. Further, the small sample size of selected SRs from one month in a single health-related database and uncertainty about the strength of association between alternative screening methodologies and citation set size may have influenced findings. However, our aim was to map SR screening methods within a relevant scope of practice, and these findings are likely an accurate representation of health-based SR literature.

Overall, evidence points to thousands of SRs with large citation sets being published each year. Increasing workload appears to decrease the likelihood that teams will use gold standard approaches to citation screening, which may lead to authors’ missing important studies in their included citation set. To help manage the rising volume of published literature we recommend teams performing large SRs consider increasing their team size to ensure that gold standard methods are adhered to, and results are disseminated in a timely manner.