Introduction

Laparoscopic anti-reflux surgery is considered the standard treatment for refractory gastro-oesophageal reflux disease (GORD), conferring a shorter length of hospital stay, fewer complications and quicker return to baseline function than the equivalent open technique [1,2,3]. Robot-assisted anti-reflux surgery (RA-ARS) is a recent surgical innovation which may offer improved ergonomics, dexterity, and three-dimensional vision, all of which are particularly useful around the oesophageal hiatus [4,5,6]. Whether these theoretical advantages translate into better patient outcomes remains unclear. The Cumberlege report (also known as the ‘First do no harm’ report) recommended that innovative procedures (such as RA-ARS) should be robustly evaluated before being widely adopted, with transparent communication with patients about the risks, benefits and alternatives [7]. Whilst RA-ARS has been adopted by certain NHS service providers, it remains unclear whether it has been robustly evaluated.

Evaluating surgical innovations is challenging due to idiosyncrasies related to surgical practice [8]. New procedures are modified case-to-case, and may continue to be modified even after becoming widely adopted, leading to uncertainty about the optimal timing of randomised controlled trials (RCTs). Operator learning curves and a lack of standard outcome measures create additional complexity [9]. To address these challenges, the IDEAL (Idea, Development, Exploration, Assessment, and Long-term follow-up) Collaboration proposed a staged approach for evaluating and reporting surgical innovations [10]. The IDEAL framework progresses from Stage 1 (first-in-human) to Stage 4 (Long-term follow-up of established techniques), and makes recommendations about methodology, governance factors, ethical factors, and the quality of reporting for each stage. The level of understanding of these 2009 recommendations was found to be low [11], leading to subsequent publication of practical guidance [9], updates [12] and reporting guideline checklists [13] to further improve the quality of reporting in surgical innovation.

High quality reporting of surgical innovation has important benefits. It supports shared learning between innovators, facilitating diffusion of successful techniques as well as timely elimination of poor techniques, thereby improving research efficiency and avoiding the repeated exposure of patients to harmful interventions [14]. High quality reporting also permits better comparison between studies [9], leading to meaningful meta-analyses that may inform clinical practice guidelines. RA-ARS was first described in 2001 [15], with subsequent increases in both volume of publications and technique adoption. Despite the quantity of research, the quality of reporting in these studies has never to our knowledge been evaluated. The aim of this study was therefore to evaluate the standard to which RA-ARS has been reported during its evolution, in relation to the IDEAL framework. The study did not aim to examine effectiveness or efficacy of RA-ARS.

Methods

The systematic review was informed by previously published studies [16, 17], and is summarised below.

Search strategy

Systematic searches using terms for ‘anti-reflux’ and ‘robotic surgery’ were performed in OVID (MEDLINE, Embase), Cochrane Library and Web of Science (SCI-EXPANDED, ESCI) from inception to June 2020 (Supplementary Table 1).

Study eligibility

All primary research studies pertaining to RA-ARS in adults with symptomatic GORD were included. Studies relating to paraoesophageal or giant hernias were excluded because of differences in operative techniques and disease and complication profiles. Studies where the indication for surgical intervention was not primarily for symptoms of GORD (e.g. Roux-en-Y gastric bypass with concomitant hiatus hernia repair) were excluded. Studies reporting combined interventions in which the outcomes could not be separated were excluded (e.g. those describing a centre’s experience of robotic surgery across multiple specialties and procedures). Conference abstracts and non-English language studies were excluded [18].

Study selection

After de-duplication, abstracts were screened by two independent reviewers (MH and HR). Full texts were retrieved and screened for eligibility in the same manner. Conflicts at both stages were resolved by discussion involving a third independent reviewer (NB), providing a final list of included papers.

Data extraction

Data extraction was undertaken by two independent reviewers using a purpose-built online tool. After appropriate training, primary data extraction was performed by a member of the ‘RoboSurg Collaborative’ (see participating investigators). Data were then verified by a senior member of the research team. Disagreements were discussed with a third independent reviewer where necessary (NB). Data categories were informed by IDEAL guidelines and a previously published study [17].

General study characteristics and IDEAL stage

Data were collected on the publication year, country of origin, study design, whether prospective or retrospective, number of participants, and number and type of included centres (e.g. specialist, general). Information about the intervention (and, where applicable, comparator group(s)) was noted. Where reported, the IDEAL stage of the study was recorded. Studies that did not provide this information were classified into IDEAL stages by two researchers (MH and HR), using an algorithm created by the IDEAL collaboration [19]. The first case report [4] was considered to be the stage 1 first-in-human study. Bias in the included RCTs was assessed using the Cochrane risk of bias (ROB2) tool [20].

Patient selection and demographics

All reported inclusion and exclusion criteria were extracted, as well as statements about what happened to patients not meeting the inclusion criteria. All reported demographics of the included patients (e.g. age, sex and co-morbidities) were also recorded.

Governance and ethical factors

Statements confirming institutional review board (IRB) or ethics committee approval were documented. Reports of patient consent, including those specifically regarding to the innovative nature of RA-ARS, were recorded verbatim. Funding and conflicts of interest (COI) declarations were noted.

Surgeon expertise and training

Any prespecified requirements for study participation, such as experience or training courses, were recorded. The number of surgeons participating in each study was documented along with their respective grades (e.g. trainee/junior or consultant/attending). Information regarding the reporting or measurement of surgeons’ learning curves was extracted verbatim.

Outcome selection, measurement and reporting

Individual outcomes from each study were extracted verbatim and coded by two independent reviewers (MH and HR) into one of seven pre-determined domains (technical outcomes, complications, investigations, persistence of symptoms, patient-reported outcomes, surgeon-reported outcomes, and health economic outcomes—Supplementary Table 2). The total number of distinct outcomes across all studies and the total number of outcomes reported in each domain were recorded. Outcomes with the same meaning that were worded differently, for example ‘length of stay’ and ‘duration of hospitalisation’, were not counted as distinct. Where available, the follow-up period for each recorded outcome was documented. If a core outcome set (i.e. an agreed minimum set of outcomes that should be reported in all clinical trials of a specific disease [21]) was used, this was noted.

Data synthesis

Data were summarised in a narrative synthesis and descriptive statistics were used where appropriate. Meta-analyses were not performed as we aimed to examine the reporting of RA-ARS, rather than its efficacy or effectiveness. Sequential progression of data categories through the IDEAL stages was displayed graphically where appropriate.

Results

Included studies

A total of 854 abstracts were screened and 56 full texts were assessed for eligibility. Of these, 23 studies were included in the analysis [4,5,6, 15, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40] (Fig. 1).

Fig. 1
figure 1

PRIMSA diagram showing selection of articles for review

General study characteristics and IDEAL stage

A total of 13,506 participants were included across the 23 studies (range, 1–12,079; median, 44; Table 1). Of these, 741 participants underwent RA-ARS, 10,597 underwent L-ARS and 2168 underwent an open procedure. Most studies (n = 17, 74%) included fewer than 100 participants. Studies were published between 2001 and 2019 and included two case reports (9%), five case series (22%), 10 comparative cohort studies (43%) and six randomised controlled trials (RCTs, 26%). Of the 18 (78%) studies reporting the temporality of data collection, 10 (43%) were prospective and 8 (35%) were retrospective. Most were single-centre studies (n = 11, 48%) and were undertaken in the USA (n = 7, 30%), although 9 (39%) did not provide this information. The type of participating centre was generally omitted (n = 18, 78%).

Table 1 Characteristics of included studies

No study reported an IDEAL stage. In addition to the first published case report [4], one further study was retrospectively classified as IDEAL stage 1 as it described the application of RA-ARS to a novel patient group (scleroderma oesophagus) [36]. Two studies were classified as IDEAL stage 2a (9%), 13 as stage 2b (57%) and six as stage 3 (26%). No studies were classified as stage 4. Sequential progression through the IDEAL stages over time was lacking: the first case report [4] and the first RCT [15] were published in the same year (2001), and all RCTs were published by 2009, with all subsequent studies either stage 2b (n = 7) or stage 1 (n = 1; Fig. 2).

Fig. 2
figure 2

Bar chart showing the publication dates and IDEAL stages of included studies

All six included RCTs were single-centre comparisons of RA-ARS and laparoscopic techniques, involving between 20 and 50 participants. Of these, one was deemed to have an overall high risk of bias, and in three others, there was a lack of clarity around sequence generation and allocation concealment. There were some concerns regarding reporting bias in all six papers due to lack of previously published protocols, and a lack of clarity around sequence generation and allocation concealment in four [5, 15, 27, 32]. In the RCT with a high risk of bias, there were additional concerns regarding bias due to missing data and measurement of post-operative outcomes [5].

Patient selection and demographics

Eleven studies (48%) reported both inclusion and exclusion criteria, five (22%) reported solely inclusion or exclusion criteria, and seven (30%) studies did not report any patient selection criteria (Table 2). Among the studies reporting this information, there were 17 distinct patient selection criteria, none of which were reported as specifically relating to the robotic nature of the procedure. One study excluded the first 10 patients undergoing RA-ARS [30], stating that this was done to minimise learning curve effects. No studies included a statement about what happened to ineligible patients. There was no discernible pattern in terms of widening of inclusion criteria over time or with advancing IDEAL stage.

Table 2 Summary of patient selection criteria

A total of 28 demographic characteristics were identified across the included studies, including sex (n = 22, 96%), age (n = 21, 91%), body mass index (n = 12, 52%) and grade of oesophagitis (n = 7, 30%; Table 3). No demographic characteristics were reported across all studies, and 12 were reported only once. There was no evidence of widening of patient demographics (e.g. inclusion of older, more comorbid patients) over time or with advancing IDEAL stage.

Table 3 Patient demographic characteristics reported in the included studies

Governance and ethical factors

Fourteen articles (74%) reported institutional review board (IRB) or ethics committee approval. Three authors reported exemptions, although reasons were not provided. Fifteen studies (65%) reported obtaining consent from the included patients, of which one specifically documented the innovative nature of the RA-ARS. Sixteen articles (70%) did not include statements regarding COI. Two COI were declared: one author founded a robotics company [35], and another received honoraria for speaking on behalf of device companies[38]. Seventeen studies (74%) did not report whether funding was received and four (17%) stated that no funding was provided. One study received funding from the authors’ local department [36], and one from a medical device company [6].

Surgeon expertise and training

Four studies (17%) described pre-specified criteria for surgeons to be eligible to participate in the study; in all cases, these criteria pertained to prior surgical experience. Thresholds of 10 [25], 20 [5], and 30 cases[27] were used for prior robotic experience and 30 cases [5, 32] for prior laparoscopic experience. No study reported specific training for surgeons prior to their first RA-ARS procedure.

Although 14 studies (61%) mentioned the number of operating surgeons (range, 1–3; median, 1), the majority (n = 20, 87%) did not report their grade or the previous number of RA-ARSs they had performed. In 13 studies, phrases such as ‘experienced’ or ‘senior’ were included as general statements of the operating surgeon’s experience.

Five studies (22%) measured and displayed the surgeons’ learning curve graphically. All five compared one or more surrogate markers for performance (operation time [n = 5], docking time [n = 2], complication rate [n = 1], length of stay [n = 1], console time [n = 1] and setup time [n = 1]), to the number of operations.

Outcome selection, measurement and reporting

There were 157 distinct outcomes across the 23 studies, of which 95 (61%) were reported only once (summarised in Table 4, detailed in full in Fig. 3). No single outcome was reported in all 23 studies. The most frequently reported outcome domain was ‘complications’: 22 (96%) studies reported outcomes from this domain, and a third of all reported outcomes were from this domain (n = 117). The most frequently reported outcome was ‘mean operative time’ (n = 18, 78%), although there were 19 other different ways of reporting ‘time’, most of which were reported only once (n = 15, 79%). No study cited any surgeon-reported outcomes. The length of follow-up was reported in 15 (65%) studies (range 1–85 months; median, 24 months), of which 7 studies (47%) reported a follow-up period of less than one year. Contrary to the IDEAL recommendations, there was a lack of progression in the type of outcomes included (i.e. from technical to patient-reported outcomes) between each subsequent IDEAL stage (Fig. 4).

Table 4 Summary of outcome selection by domains, across included studies
Fig. 3
figure 3

Diagram displaying all reported outcomes across all included studies

Fig. 4
figure 4

Bar chart showing papers’ reported outcomes by IDEAL stage

Discussion

To our knowledge, this is the first study to undertake a detailed examination of the reporting of the introduction and evaluation of robotic anti-reflux surgery. The overall quality and consistency of reporting was deficient across the included studies. Outcome reporting was heterogeneous, with over half of all outcomes used only once across the studies. Patient selection criteria were variable, inconsistent, and sometimes omitted. Most studies did not provide statements about conflicts of interest and many did not report obtaining consent or ethics committee approval. The evolution of RA-ARS differed significantly from the IDEAL model of surgical innovation with a lack of stepwise progression from IDEAL stage 1 (Idea) to stage 4 (Long-term follow-up). The first RCT was published in the same year as the first case report, and the other five RCTs followed soon after, with most subsequent studies classified as stage 2b. Collectively, these findings suggest that there has not been sequential and incremental building of evidence from one study to the next.

Four previous meta-analyses have summarised the efficacy of RA-ARS, published in 2010 [41,42,43] and 2012 [44], which all included the six RCTs identified in this review. The methodological limitations of the included RCTs were highlighted in all reviews: small sample sizes from single centres, and a lack of information about randomisation, raising questions about the validity and reliability of the findings. Although one review suggested that postoperative complications might be reduced with RA-ARS, the authors agreed with findings from two other reviews that the advantages of robotic surgery did not translate to improved patient outcomes, with higher costs and longer operating times. Despite the same findings, the fourth review argued longer operating times were related to lack of familiarity with RA-ARS and concluded that RA-ARS was safe, effective and should be the ‘future trend for treatment of GORD’ [43]. Another problem was that the RCTs did not report the same outcomes, reducing the number of studies available for meta-analysis. Moreover, all six RCTs were undertaken early in the emergence of RA-ARS as a promising technique, raising the possibility that the results were influenced by learning curve effects. Were a well-designed, multicentre RCT to be conducted in the present day, it is possible that the findings would differ, and further research is therefore warranted in this area.

Although this was a comprehensive literature review, there were some limitations. Non-English language studies were excluded, meaning that valuable data may have been missed, although this has been shown not to cause systematic bias [18]. We excluded studies related to giant or paraoesophageal hernias because of differences in technique, disease, and complication profiles. As reflux symptoms can also occur in these conditions, our study may therefore not represent the entirety of RA-ARS. A final limitation was that determining the stage of innovation of a report was sometimes difficult using the algorithm provided by the IDEAL Collaboration [19]. For example, distinguishing stages 2a from 2b was particularly problematic (mainly due to a lack of information about the technique of RA-ARS and whether this was evolving), and they may represent either end of a continuum [9]. Moreover, the algorithm did not encompass the expected differences in outcomes or patient selection criteria, which are key considerations in moving between the IDEAL stages.

This study found that the evolution of RA-ARS differed significantly from the model of surgical innovation proposed by the IDEAL Collaboration. This may be a consequence of slow adoption of the IDEAL framework. Despite the publication of practical guidance in 2016 [9], a subsequent systematic review found that the IDEAL framework was not widely implemented outside the membership of the IDEAL Collaboration [11]. Several factors hindering its adoption have been described, including lack of understanding of the recommendations or how to apply them, and difficulty in determining the stage of innovation as was the case in our study. An updated framework and reporting guidance have been published to address the deficit in understanding [12], and the IDEAL collaboration is designing a study to investigation barriers to implementing the framework [personal communication, IDEAL Collaboration, 24th May 2022]. Furthermore, our institution is developing a method for determining stage of innovation. Collectively this may inform future studies, and prevent future innovations evolving in a similar manner to RA-ARS. Another possibility is that the IDEAL model may not be representative of real-world surgical innovation. The IDEAL model was partly derived from theories of diffusion of innovations in the social sciences rather than from any empirical study of innovation in practice [8], and therefore may not align well with real-world events. This could further explain the discrepancy between the IDEAL model and how RA-ARS has evolved. Real-world surgical innovation is being studied in depth at our institution using case study methods with multiple qualitative data sources [45].

The reporting of patient selection criteria and outcomes was highly heterogeneous. This caused difficulties in comparing studies and synthesising evidence in systematic reviews and meta-analyses [46], which could be remedied by using a core outcome set (COS). A COS is an agreed minimum set of outcomes that should be reported in all studies of a specific disease (such as GORD) [21], ensuring reporting consistency. The development of a COS for GORD in adults may therefore play an important role in improving the quality of future RA-ARS studies. While one has been developed for the paediatric population [47], a COS for GORD in adults has not yet been published. More generally, a COS for the standardised evaluation of surgical innovation has been developed [48], aiming to reduce outcome heterogeneity in the reporting of new surgical procedures or devices in the future.

In conclusion, the under-reporting of important aspects of study design and high degree of outcome heterogeneity impedes the ability to draw meaningful conclusions from the body of evidence. There is a need for further well-designed randomised trials, alongside agreement about outcome selection, measurement, and reporting for future RA-ARS studies. Furthermore, we support the development of a core outcomes set for adult GORD, and the use of frameworks such as those published by the IDEAL Collaboration.