Background

Systematic reviews (SRs) are evidence syntheses that serve as valuable support for decision-making in healthcare [1, 2] and social sciences [3, 4]. They can be defined as a summary of studies addressing a specific topic using reproducible analytical methods to collect secondary data and analyse it using systematic and explicit methods to identify, select and critically appraise relevant studies, and to extract and summarize the data [5]. SRs can use to reduce biases and provide evidence to stakeholders such as policymakers, decision-makers, practitioners, researchers, academia, the public, and citizens [6].

Since the 1990s, when organizations like Cochrane, Campbell Collaboration, and the Joanna Briggs Institute (JBI) emerged [1, 7, 8], there has been an increase in both the number of SRs and their utilization to inform policy and practice [9]. Due to the fact that tremendous resources are needed to produce and update SRs, there is a need to automatize the process as much as possible and use the workforce and resources to make it more efficient [10, 11]. This study was conceptualized within the framework of EVidence-Based RESearch (EVBRES) [12], which is a 4-year (2018–2022) EU-funded COST Action CA-17117 with over 40 countries participating globally, aiming to encourage researchers and other stakeholders to use an Evidence-Based Research (EBR) approach while carrying out and supporting clinical research—thus avoiding redundant research. The Action has been extended until 16th April 2023, and as part of the research agenda of a working group (WG3) focusing on improving efficiency in producing and updating systematic reviews [10, 13, 14]. Based on the results of previous activities, we designed a Delphi study (DS) to reach an agreement on prioritising the most promising areas and methods to improve the efficiency of producing and updating SRs. DSs offer a flexible approach to obtaining information regarding how best to allocate professional resources such as knowledge and expertise [15, 16]. This is especially important when the agreement on statement is needed based on the best available evidence. In this study, the expert consensus regarding the most promising areas and methods to improve the efficient production and updating of SRs was pursued.

Methods

A DS was employed to gain expert insight regarding which areas and methods need to be prioritized to improve efficiency in producing and updating SRs. The DS was conducted exclusively online since face-to-face interaction was neither preferred [15] nor achievable due to Coronavirus disease (COVID-19) pandemic travel restrictions. Recommendations for Conducting and Reporting Delphi Studies (CREDES) [17] were followed throughout the manuscript, excluding parts that were beyond the scope of this project (i.e., external validation and dissemination). Participants were provided with a description of the overall aim of the EVBRES and the specific objective of the DS.

Usually, data from the first round of a DS are solely qualitative [18], but as this survey was informed by a scoping review [10] and a qualitative study [19], the quantitative techniques could be applied as early as in round 1.

Round 1 of the DS was launched on November 15th, 2021, and two reminders were sent 2 weeks apart from each other. Round 1 ended on December 14th, 2021. Round 2 of the DS was launched on April 11th, 2022, and three reminders were sent two weeks apart from each other. The survey ended on May 25th, 2022. Since our study was thoroughly informed by previous EVBRES research [10, 11, 13], agreement was expected to be reached after two rounds. As agreement was reached after round 2, there was no need for round 3, and we closed the DS.

Participants

As response characteristics of a small expert panel in a well-defined knowledge area are considered reliable in light of augmented sampling [20], 55 experienced authors of SRs were invited using a combination of two non-probabilistic sampling methods, both widely used in Delphi methods [21]: purposive sampling technique followed by criterion sampling in which members of the EVBRES were requested to provide us with personal contacts that met the inclusion criteria for participation. Potential participants were contacted by an email from the EVBRES member and informed what they would be asked to do, how much time they would be expected to contribute, and when, as well as what use would be made of the information they provided. Inclusion criteria were (a) participating in at least 3 SRs as an author; and (b) being a first/last/senior/mentor author in at least one SR. All suggested participants’ names were searched using Google Scholar [22] prior of sending invitations in order to confirm that they satisfied the inclusion criteria.

Delphi survey design and procedures

Survey design

To design the surveys, seven formulation and review sessions were conducted with at least two research team members (1 permanent and 2 alternating research team members). LimeSurvey Professional (LimeSurvey Cloud 3.27.24, LimeSurvey GmbH, Hamburg, Germany) was used to design the online survey.

Round 1 of the DS included sections presented in Table 1.

Table 1 Organization scheme of DS questionnaire

The study was piloted among 10 participants of the EVBRES on June 5th, 2021 and some minor adjustments were made to make the questionnaire more user-friendly and time efficient.

In round 2, areas where agreement was not reached in round 1 were further explored by testing statements with required 5-point Likert answers. This included 2 statements on “snowballing” (“development of better tools is needed” and “automation can be helpful in this area”), 4 statements on “GRADE-ing” (“this step is methodologically well developed”, “potential for automation is low in this step”, “standardization of GRADE assessment may be helpful”, “this step is relatively low resource task”) and 2 statements on “deduplication” (“automation is advanced in this step”, “there is scope for improvement”). Theme “meta-analyzing” was not further explored since participants commented that they considered that it refers to "synthesising data".

Data analysis

For the prioritisation exercise 5-point Likert-scale based quantitative answers was analysed. When analysing the questions regarding prioritization, it was considered that consensus was reached when two-thirds or more (66.7%) of the participants’ responses reached a certain score range (“agree” and “strongly agree”, or “disagree” and “strongly disagree”). Since our sample was small and skewed interquartile range (IQR) as a measure of variability and median as a measure of central tendency were chosen in order to find out in which range the most of the results lie: smaller the values, less skewed the results.

A reflexive thematic analysis approach was used to analyse the qualitative data [24, 25] since the theoretical freedom of this approach allows flexibility. Following the familiarization with the data through transcription, reading, and re-reading, initial codes were generated and gathered into potential themes. After reviewing themes across an entire data set, a thematic map was developed, and clear definitions and names for each theme were identified and further refined through ongoing analysis. The data were coded by one author (VT) with an inductive approach and themes were developed at a semantic level. Concepts of data or code saturation were not used in this study because they were not consistent with reflexive thematic analysis values and assumptions [24]. Taxonomy was developed manually and independently at the same time by another researcher (MMK), and the taxonomy choice finalized by the third researcher (AM). Minimal grammar and spelling corrections were made to participants’ answers by a researcher that is a native English speaker (MEE). In round 2 of the DS, a taxonomy based on the responses from the first surveys was presented to participants.

Data gathered with the online survey system were exported to Microsoft Excel and SPSS 16 + compatible files and then analysed with NVivo 12 Plus for Windows (QSR International Pty Ltd., London, UK), Microsoft Excel 2013 (Microsoft Corporation, Redmond, WA, USA) and IBM SPSS Statistics for Windows, version 28.0 (IBM Corp., Armonk, NY, USA).

Results

A total of 55 participants were invited and 39 agreed to participate. In round 1, 33 completed the survey in its entirety, 3 participants opted out, and 3 participants did not provide complete responses. These 33 participants were invited to the round 2, and 24 of them completed it. Since the consensuses were reached after sending 3 reminders 2 weeks apart from each other, the survey was ended.

Participants’ characteristics

Of the 33 participants who completed the questionnaire in round 1; 14 identified as male, 17 as female, and 2 of them preferred not to specify their gender. There were representatives of all age groups, but most participants were aged 41–50 years (n = 13), or 31–40 years old (n = 9). Since the average experience in conducting SRs was 13.33 years, it can be stated that our participants were experts in the field (Fig. 1).

Fig. 1
figure 1

Demographic characteristics of participants. Since there were no extreme outlier values, mean and standard deviation measures were selected as central tendency and level of dispersion measures.*if respondent's answer was “more than x”, “around x” etc. the value was calculated as x

Descriptive data analysis

For the general Likert-scale questions in round 1, with regards to identifying which areas and methods of the process of conducting SR are most time/effort/resource-intensive and should be prioritized in future research concerning methods of development and automation, consensus was reached (more than 66.7% participants agreed or strongly agreed that topic should be prioritized) for 13 out of the 19 topics, as is shown in Fig. 2. Data extraction was prioritized by the majority as an area where research and automation can help reduce the intensity of resource use; 90% (30 out of 33) have “strongly agreed” or “agreed”; which was also emphasized in qualitative results section. Screening abstracts has the most “strongly agree” answers 51% (17 out of 33) and by adding “agree” it comes third 82% (27 out of 33). Great majority of participants (27 out of 33; 82%) prioritized literature searching (12 participants “strongly agree” and 15 “agree”), obtaining full texts (11 participants “strongly agree” and 16 “agree”) and updating the SR (10 “strongly agree” and 17 “agree”).

Fig. 2
figure 2

Results of round 1 of the DS. Participants’ view on to what degree the step of the SR production and updating should be prioritizied concerning methods of development and automation

As depicted in Fig. 3, during round 2, participants reached a consensus on all 8 tested statements with 66.7% either agreeing/strongly agreeing or disagreeing/strongly disagreeing. The most supported statements from round 2 are for snowballing and GRADE-ing: 83% or 20 out of 24 participants reached consensus (10 “strongly agree”, 10 “agree”) that automation can be helpful in snowballing; 79% or 19 out of 24 reached consensuses (9 “strongly agree”, 10 “agree”) that GRADE-ing is a complex task that requires human judgement and potential for automation is low in this activity. The latter consensus of 79% was also reached regarding statements that GRADE-ing is methodologically well developed (9 “strongly agree” and 9 “agree”), as well as development of better tools in snowballing (6 “strongly agree” and 13 “agree).

Fig. 3
figure 3

Results of round 2 of the DS. Participants’ view on snowballing (development on better tools, need for automation), GRADE-ing (complex task that requires human judgement, standardization, resource use, methodologically developed area), and deduplication (scope for improvement, advancement of automation)

Qualitative data analysis

The following three main themes were developed from the qualitative data gathered in the DS: (1) the most important tools and approaches, (2) different areas and methods require different levels of automation, and (3) prioritization concerning future research of particular methods is crucial to improve efficiency (Fig. 4).

Fig. 4
figure 4

The most promising areas and methods to improve the efficient production of SRs thematic map. The reflexive thematic analysis of qualitative data identified 3 main topics and 6 subtopics

The first theme (Table 2) developed in our qualitative study is regarding what are considered the most important tools and approaches used to produce and update SRs. Participants mostly pointed out tools and approaches that are already available on the market but are often not known or used by most researchers. However, they also recommended some tools and approaches that need to be developed to improve the efficiency of SRs. As a result, two sub-themes were developed: “Existing tools and approaches that can improve efficiency” and “Tools and approaches that need to be developed”.

Table 2 Qualitative analysis—themes, subthemes, and participants’ statements

For areas such as “Formulating the review question” and “Writing the protocol”, the participants recommended SUAMRI, IEBHC Review Wizard, and RevMan. They also pointed out several other tools developed by the Institute for Evidence-Based Healthcare at Bond University in Australia: Systematic Review Accelerator (SRA), PRISMA for Abstracts, TIDieR, Shared decision making, EBM teaching resources, CriSTAL Tool, and MASCoT.

Regarding de-duplicating, screening abstracts, and full-texts, the expert suggested that tools such as EPPI-Reviewer, Covidence, DistillerSR can improve the efficiency of SRs.

Some participants pointed out that some handy tools already exist, such as “Citation chaser” (https://estech.shinyapps.io/citationchaser/) that can be used for citation tracking.

The participants also pointed out that the quality of available tools for translating non-English studies has increased dramatically over the years, suggesting that Google Translate and DeepL seem to be the most useful tools in this area.

Additional training on how to use tools and training in general was seen as something that could improve the overall quality of SRs.

The participants suggested various tools and approaches that need to be developed to increase the efficiency of SRs. Some participants suggested the need for repositories for all registered and published reviews that can be used to find previous SRs and share translations or extracted data across reviews.

One possible approach to improving SRs’ efficacy would be a common search strategy for all databases. Automation of screening abstracts and full texts is considered one of the most needed tools in SRs. Development of tools for automation in extracting data could be highly beneficial, so future research in that area is worthwhile. Some of the participants also mentioned the need for further development of machine learning to improve the efficiency of SRs, especially for finding full texts.

Development of more advanced tools for the translation of non-English studies was proposed by few participants.

Automation was one of the most frequently mentioned topics in round 1 of our DS. Some participants suggested that particular areas and methods of SRs have a limited potential for automation, and that some parts of the SR process are impossible to automate.

On the other hand, some participants mentioned areas and methods with a strong potential for automation. The components most frequently mentioned with a strong potential for automation are searching and screening.

Since the participants most frequently described some areas and methods as more suitable for automation than others, two additional sub-themes were also developed: “Full automation is not suitable for areas and methods that need complex human judgment” and “Repetitive tasks and searching have strong potential for automation”.

Some participants believed that the components of SR that require complex human judgments are not suitable for automation. One participant emphasized that automation of judgments should not be favourable even if artificial intelligence would allow it.

The participants mostly viewed areas and methods connected with writing as less suitable for automation, specifically the writing up the protocol or manuscript and formulating the review question.

Although some experts considered activities connected with searching as most promising for automation, others suggested that humans still need to make decisions in those areas.

Screening full texts was specified as a part of searching that is not suitable for automation, especially in social sciences where inconsistency on how studies are reported is common.

GRADE-ing or going from evidence to decision is another process that was seen as dependent upon human judgements.

The participants also pointed out that synthesizing data is a complex process, which is why they remain sceptical about its automation.

However, other participants suggested that some areas and methods that need complex human judgments could benefit from semi-automation in which a combination of automation and human judgment will be utilized in the decision-making process.

The participants discussed areas and methods in which automation could be worthwhile. They concluded that repetitive tasks and activities connected with searching have the most substantial potential for automation.

Some participants mentioned in their comments that some areas are very time-consuming, so additional research and development of new tools would be beneficial for improving the efficiency of SR.

Others pointed out that some areas and methods are already developed or automated, which is why they are less important for future research.

Therefore, a third theme emerged: prioritization concerning future research of particular methods to improve the efficiency of SRs in general. Two sub-themes were developed within this theme: “Time-consuming methods require a higher priority in future research” and “Already available and reasonably developed methods should not be prioritized”.

Most participants believed that the SR process stands to benefit the most from future research aimed at improving the time-consuming components.

Extracting data was seen as one of the most time-consuming parts of SR, so the participants especially emphasised prioritising that area in future research.

The participants mentioned several methods that are reasonably developed and already available on the market.

They emphasized that already existing methods and areas should not be prioritized in future research since it seems unlikely to produce additional benefits for SR efficiency.

One participant mentioned that there is a need to increase the integration of different study designs into producing evidence.

Although many participants stated that there are good enough translating tools available, the language barrier still resembles a huge obstacle in SRs’ production.

Many participants agreed that there is a lot of automation achieved in the area of SRs’ production and updating, but there are still tasks that cannot and should not automatized since they rely on complex human judgements.

Several participants emphasized that methodology of SRs’ production is fairly advanced.

Discussion

The main finding of this DS was that extracting data, literature searching, and screening abstracts as the most important areas to be prioritized in future research when developing SR methods and tools.

There is a consensus among participants that “snowballing” is a relatively low resource task, development of better tools is needed, and automation can be helpful in this area. One participant (P32) mentioned an efficient tool that has a great potential to successfully automate this step: Citationchaser. There are some software solutions already available that support basic forms of snowballing/citation chasing. Citationchaser [26] is an open source, easy-to-use tool for quick backward and forward citation chasing, developed by Haddaway et al. [27], and seems to be the most advanced in the field. It generates standardized output files that can be quickly and effectively combined with the results of bibliographic database searches to reduce duplication and increase the breadth of the pool of potentially pertinent records that can be screened within an evidence synthesis [27]. The fact that only one participant was aware of existing of this tool, among our highly expert panel, grants that this tool has to be further developed and popularized.

Participants also agreed that potential for automation of “GRADE-ing” is low since this is a complex task that requires human judgement, and that methodologically this is a very well-developed area and standardization of GRADE assessments may be helpful. Regarding “deduplication,” experts agreed that automation is already fairly advanced, but there is room for improvement.

Regarding possible area of improvements in methodology, several participants emphasized that automation is not a panacea and has to be used to “prevent from having to do rote, complicated repetitive tasks”, “automation is a tool…to have more difference…it’s is not a replacement for human judgement”.

Including non-English studies in SRs has been recognized as important to avoid bias, although reviewers commonly report that it is costly and time-consuming to include them, and previously have been reluctant to bother with the language barriers [28]. “Many researchers have to restrict the search to English language because of resource constraints”. Participants from our DS showed awareness and willingness to incorporate non-English evidence in their SRs. In their comments, participants emphasized the importance of this issue, especially in the qualitative area: “This would be a game changer…qualitative researchers could be confident the ‘meaning’ of text was captured”. Many participants emphasized that quality of available language translating tools has dramatically increased over the years, specifically pointing out Google translate and Deep L, which is promising, and hopefully will progress into qualitative filed of research in the future.

Limitations

Limitations of the study stem from the very nature of the research method: one can always debate that there were additional “expert” SRs’ authors who could have better answered the survey. Efforts were made to select experts who were relatively impartial yet had interest in the research topic and were willing to spare their precious time. The EVBRES collective knowledge of the SRs’ production landscape was excellent base for handpicking the best available sample and serve as effective gatekeepers [18]. In fact, participants demonstrated vast experience (totalling 440 years in conducting SRs) at a relatively young age (most panellists were 41–50 years of age) (Fig. 1). Most of the participants (26/33) work in academia: it is highly understandable that researchers from that area are the most efficient producers of SRs. Participants published SRs in various settings: all (n = 33) have published in various journals, and the majority (n = 28) published Cochrane SRs. Another limitation is also due to the nature of DSs: the Delphi method has been criticized in that it does not allow participants to discuss the issues raised and gives no opportunities for participants to elaborate on their views, resulting in the potential risk that greater reliance is placed on the results than might be warranted [18]. DSs also have additional limitations, such as not allowing the same level of interaction or fast turnaround that is possible, for example, in a focus group. However, this also presents a strength due to the fact that participants do not meet with each other face to face, and therefore they can present and react to ideas unbiased by the identities and pressures of others [29].

Conclusions

The participants recommended tools and approaches that can improve the efficiency of SRs. Data extraction was prioritized by the majority of participants as an area that needs more research/methods development, where research and automation can help reduce the intensity of resource use. They specified that some areas and methods are more suitable for automation than others, e.g., snowballing, and development of tools is needed in this area. There is an open-source tool—Citation chaser, which has a high potential to present a significant time saving in the SRs production process. GRADE-ing was identified as an area that is methodologically well developed, a complex task that has lowest potential for automation, as it requires high level of human judgement.

As expected, s of SR automation is already developed and less critical for future research (GRADE-ing), om additional research and the development of new tools.