FormalPara Key Summary Points
We aimed to identify novel biomarkers which had been validated and showed sufficient promise to warrant further evaluation in low-prevalence populations.
We identified 431 unique biomarkers; only 35 of which had been investigated in at least two studies, with outcomes for that individual marker for the same tumour type - four of these were identified as the most promising for future studies.
This review highlights the need for more biomarker studies that consider primary care/community settings as their intended populations.
Findings also indicate we still need better reporting to facilitate knowledge translation; we also need more consistency in the use of biomarkers.
Research collaborations are vital to reduce duplicate efforts and ensure appropriate samples sizes when studying low-prevalence populations.

Digital Features

This article is published with digital features, including a summary slide, to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.13214843.

Introduction

Gastrointestinal (GI) cancers represented more than 25% (4.8 million) of cancer cases and over a third (3.4 million) of cancer-related deaths worldwide in 2018 [1]. Upper GI cancers contribute an important proportion of these, with over 2.1 million new cases of cancers of the stomach, oesophagus, pancreas and biliary tract diagnosed worldwide in 2018 [1, 2]. Prognosis is often poor as upper GI cancers are generally not detected until the disease is advanced and less amenable to curative treatment [1].

Primary care plays a key role in the early detection of upper GI cancers, as more than 90% of patients present with symptoms [3,4,5], and screening tests for asymptomatic populations are not yet widely established. Early detection of upper GI cancers is challenging, as initial symptoms such as indigestion, abdominal discomfort or fatigue are common, often intermittent, and most patients presenting with them will not have cancer [6, 7].

There is growing demand to improve early cancer detection through better diagnostic and triage approaches, particularly for use in primary care or other community settings where cancer prevalence is low [5]. New diagnostic approaches, applied either among asymptomatic at-risk populations or to triage patients presenting with cancer symptoms, could be transformational. Electronic health records and large population-based surveys have been used to develop cancer risk prediction models to identify those requiring investigation for cancer [8]; diagnostic pathways have also been implemented in different countries in an effort to improve timely cancer diagnosis [5]. Innovative strategies applying artificial intelligence techniques to imaging and other medical data are also promising [5, 9]. For cancers with non-specific symptom signatures, like most upper GI cancers, we also need better biomarkers to support diagnostic assessment [10]. Biomarkers such as carcinoembryonic antigen (CEA) and CA19-9 are used in clinical practice predominantly for surveillance following treatment of upper GI cancers [9, 11]. Substantial investment has been made into developing new biomarkers for early cancer detection; most such biomarker research has been conducted in laboratory and specialist clinical settings [12, 13], where cancer prevalence is higher compared to community settings [14, 15].

The distinction between care settings is important, as the diagnostic performance characteristics of a test are strongly determined by the prevalence and severity of the target disease and of other diseases within the study population [14]. In populations in which the prevalence of the target disease is low (e.g. primary care), positive predictive values are lower than in high-prevalence populations seen in specialist cancer centres. Tests evaluated in high-prevalence populations tend to have lower sensitivity and higher specificity when used in low-prevalence populations [15, 16]. This is known as the spectrum effect or spectrum bias [14, 15] and has crucial implications for translating results from one care setting to another. To gain an accurate understanding of how a test will perform within a low incidence setting, it must ultimately be evaluated within that setting.

In recognition of this, the CanTest Framework has been developed, proposing a 5-phase translational pathway for diagnostic tests, from new test development to health system implementation in low-prevalence populations [15]. The framework highlights the importance of evaluating not only clinical performance but also the feasibility and acceptability of implementation, patient safety and quality of care, and cost-effectiveness in the chosen clinical setting. Understanding and addressing these issues is vital, as test performance alone, even if evaluated in the target populations, does not guarantee clinical utility nor improved patient outcomes [12].

This review set out to systematically identify novel biomarkers for the early detection of upper GI cancers which have been validated and show sufficient promise to warrant further evaluation in low-prevalence populations.

Methods

Search Strategy and Inclusion/Exclusion Criteria

This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [17], and the protocol was registered in PROSPERO (CRD42020165005). We searched MEDLINE, Embase, Emcare and Web of Science from 1 January 2000 to 31 October 2019 for primary studies published in English. The search strategy (Online supplementary file 1) was developed with the assistance of a medical librarian and refined until it identified all relevant core publications known by the senior authors. Reference lists of included studies were also screened. Articles that were not available online were ordered via the British Library.

Studies were included if they reported on at least one measure of diagnostic performance: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), false positive, false negative or area under the curve (AUC) for biomarkers used to detect oesophageal, gastric, pancreatic or biliary tract cancers. We included adult populations (mean/median age ≥ 18); we accepted individuals aged < 18 if these were outliers in large samples. The search strategy also included terms for lower GI (colorectal and anal) cancers for the purposes of a parallel review of novel biomarkers for the early detection of lower GI cancers, to be reported separately. Non-specified GI cancers, neuroendocrine cancers and studies only reporting on familial populations at risk of hereditary cancers were excluded.

Novel biomarkers were considered both individually and as part of a combination/panel test. Studies reporting only the performance of a single, established biomarker (i.e. CEA and CA19-9 for pancreatic cancer) were not eligible for inclusion [9]. We included studies reporting on performance for established biomarkers if these were in combination with additional novel biomarkers.

We aimed to identify studies situated within Phase 2 (measures of diagnostic accuracy in high-prevalence settings) and Phase 3 of the CanTest framework (measures of diagnostic accuracy or clinical utility, acceptability and feasibility in intended low-prevalence settings) (Fig. 1) [15]. We included studies if they reported more than preliminary measures of performance calculated in a discovery phase; this required additional measures of diagnostic performance in an independent cohort. If no references to previous studies evaluating performance were available and the study provided only one set of measures, the study was excluded. Panels with previously investigated biomarkers were included even if the biomarkers had not been investigated as part of a panel. As larger sample sizes are required beyond the biomarker discovery phase [13, 18], studies had to include at least 50 cancer cases and at least one group of 50 non-cancer controls with similar clinical characteristics (healthy, or with non-malignant or pre-malignant conditions). Similar criterion has been adopted by previous reviews that informed our study [13, 19].

Fig. 1
figure 1

Reproduced with permission from [15]

The CanTest Framework

We only included biomarkers which are feasible to use in a community setting, i.e. blood (serum and plasma), urine, faecal, salivary or breath samples. Observational studies (cross-sectional or longitudinal, prospective or retrospective) and trials were eligible for inclusion. We included all recruitment settings, as we expected that very few studies would have been carried out in community settings.

We used the online tool Covidence [20] to facilitate title and abstract screening and study selection. Two reviewers (any two of NC, PED, CS, KMM, DB or RB) independently screened titles and abstracts. Then, two reviewers (any two of the above) independently evaluated full-text articles for inclusion. Titles and abstracts of reference lists of included studies were reviewed by one author (NC); full-text articles selected at this stage were independently assessed by two reviewers (any two of NC, PED, RB or DB). Disagreements were resolved by consensus; when this could not be reached a senior, third reviewer (FMW or JE) was consulted.

Data Extraction and Analysis

Data extraction was piloted to ensure consistency and was carried out by one of seven reviewers (NC, PED, RB, DB, JMG, JO and SS). We extracted information on: study characteristics (publication year, country of population of interest, recruitment setting, study aims and design); populations (numbers included, age, sex, tumour staging for cases and health status for controls); biomarkers (type of sample, biomarker name, biomarker category); and summary measures of diagnostic performance (sensitivity, specificity, PPV, NPV, false positives, false negatives and AUC, with 95% confidence intervals when available, for all comparisons). When studies reported on different phases of biomarker development, we only extracted data from the eligible phases (i.e. biomarkers and measures beyond the discovery phase). When studies had more than one eligible phase, we extracted data from all phases. Extracted data were collated and checked for consistency and inaccuracies (NC).

Biomarkers were categorised according to a modified version of Uttley et al.’s classification [19], which included: microRNAs and other RNAs, autoantibodies and other immunological markers, other proteins (that did not fit into other categories), metabolic markers, circulating tumour DNA, and other biomarkers. Controls were classified as: normal/healthy, having non-malignant, or pre-malignant conditions. Biomarkers and control populations were coded by one author (NC) and checked by other authors (PED, KMM and MM; and PED, FMW and JE, respectively). Controls described as being healthy were coded as such unless studies described underlying conditions. Patients with cancer were ineligible as controls. Full details of the classification of controls are available (online supplementary table S1). Microsoft Excel 2015 and SPSS v.23 (IBM) were used for data extraction and data analysis.

Quality Assessment and Risk of Bias

Risk of bias [21] was not assessed as described in the original protocol, following independent piloting. Appraisal was hindered by the use of diverse methods across studies and incomplete reporting, resulting in a large number of “unclear” assessments. Instead, a list of issues identified in the studies was prepared (Online supplementary file 2). As spectrum bias is a key issue when translating results from high- to low-prevalence populations, all included studies were classified as either single-gate or two-gate designs. In single-gate designs, cases and controls are recruited through a single route of entry and with the same inclusion criteria (e.g. all cases and controls presented with symptoms). In two-gate designs, participants are recruited through different routes and different inclusion criteria exist for cases and controls. In this situation, controls can be either normal/healthy or with an alternative diagnosis, which can produce symptoms and signs similar to patients with cancer [16]. One author (NC) classified all studies and another (PED) checked the classification. A full description of this classification and how it approaches some of the issues covered by the critical appraisal tool is available (Online supplementary file 3).

Data Synthesis

Included studies were heterogeneous and rarely evaluated the same biomarkers in the same way, often using different cut-off points, populations and/or biomarker combinations in panels. Therefore, we were unable to undertake meta-analysis. Instead, we used narrative synthesis to summarise data across studies [22]. First, we developed an overview of the available evidence, describing key characteristics of included studies, their populations and biomarkers, and outcome measures. Then, we looked for similarities that would allow for subgroup analyses, namely the same biomarker, for the same tumour type, with similar designs, outcome measures and populations.

Compliance with Ethics Guidelines

This article is based on previously conducted studies and does not contain any studies with human participants or animals performed by any of the authors.

Results

Database searches identified 16,597 records; 9172 were retained after removing duplicates. During title and abstract screening, 8179 ineligible records were excluded. The full texts of the remaining 993 records were assessed for eligibility; 731 were excluded (Fig. 2). A total of 262 studies from database searches met inclusion criteria; 25 additional studies were identified in reference lists. Of these, 149 included studies referred to upper GI cancers and were included in our narrative synthesis.

Fig. 2
figure 2

Study selection

Characteristics of Included Studies

Key characteristics of included studies are described in Table 1 and 2. Most studies recruited participants from a single country (n = 142). China was the most common country (n = 77), followed by Japan and South Korea (n = 15 each), the USA (n = 12) and Germany (n = 9). The most common recruitment settings were hospital or other secondary care institutions (n = 125), biobanks, reference sets, databases or archived samples (n = 20), general population cohorts or cohorts from population screening programmes (n = 11) and cohorts from previous trials or observational studies (n = 9). Several studies recruited from more than one setting. Gastric cancer was the most commonly investigated tumour type (n = 69), followed by pancreatic (n = 54), oesophageal (n = 24) and biliary tract cancers (n = 3). Four studies investigated more than one type of upper GI cancer (Table 1).

Table 1 Characteristics of included studies: country, setting and population
Table 2 Characteristics of included studies: biomarkers and study design

Characteristics of Cases and Controls

Overall, the included studies reported on 22,264 cancer cases (10,589 gastric, 7964 pancreatic, 3258 oesophageal and 290 biliary tract cancers, and 163 oesophago-gastric cancers, not distinguishing between oesophageal and gastric cancer). The minimum age for cases was 16 while the oldest patient was aged 93. Most cases were male (68%) across all tumour types. Over 50% of cancers had been diagnosed at stages III and IV (median 55.5%, interquartile range 47.0–68.1%; data available for 106 included studies). The included studies reported on 49,474 controls (38,955 normal/healthy, 9042 with non-malignant conditions, 1106 with pre-malignant conditions, and 371 with either normal or non-malignant conditions). Pancreatitis and gastritis were the most commonly reported non-malignant conditions (online supplementary Figure S1). Over half of the studies (n = 83) investigated more than one type of control population. Normal healthy controls were the majority across all tumour types, except for biliary tract cancers. The minimum age for controls was 16 while the maximum age was 94. Overall, most controls were male (74%); this was the case for all tumour types except for biliary tract cancers.

Types of Biomarkers

Biomarkers were most commonly sampled from blood (145 studies; 107 investigated serum, 33 plasma and 5 both); two studies analysed urine [28, 36], one breath [169] and another saliva [47]. Most studies (n = 128) investigated more than one biomarker. A total of 431 biomarkers were identified (online supplementary table S2). These were most often microRNA and other RNAs (n = 183), other proteins (n = 119) and autoantibodies and other immunological markers (n = 79). Less than a third of studies (n = 44) included biomarkers from different categories. This was often due to use of established biomarkers (proteins CA19-9 and CEA) in combination with novel biomarkers. Studies of pancreatic cancer reported on over half of identified biomarkers (n = 231) (Fig. 3). Only about a fifth (n = 90) of all identified biomarkers were reported in more than one study; 72 of these were reported in more than one study for the same tumour type (Table 3).

Fig. 3
figure 3

Types of biomarkers, overall and by tumour type. aFive proteins; bthese refer to volatile organic compounds and platelets; autoab autoantibodies, ctDNA circulating tumour DNA, miRNA microRNA

Table 3 Biomarkers investigated more than once, for the same tumour type (number of studies)

Measures of Diagnostic Performance

The most commonly reported measures of diagnostic performance were sensitivity (n = 136), specificity (n = 129) and AUC (n = 123). PPV and NPV were each reported by 40 studies, while false positives and false negatives were least often reported (11 studies each). Outcome data on individual biomarkers were available in most studies (n = 121); the remaining 28 studies only reported on performance for a combination/panel. Over half of the included studies (n = 83) reported on measures of performance for biomarkers both individually and in combinations. Outcome data were not available for all control populations; only 95 studies provided outcome data for cancers versus normal controls, 54 provided outcome data for cancers versus non-malignant controls, and 10 provided measures for cancers versus pre-malignant conditions (online supplementary table S3).

Individual measures of diagnostic performance were available for 35 biomarkers mentioned more than once, for the same tumour type (online supplementary table S4). We were not able to synthesise outcomes further due to heterogeneity in biomarker combinations, in control populations and subgroup analyses, and variations in reported cut-off points and diagnostic accuracy data (see online supplementary table S5 for a textual description of outcomes).

Only four novel biomarkers were reported on studies adopting a single-gate design (Table 4). Apolipoproteins AII-AT and AII-ATQ had poor sensitivity (range 4–25%) but good AUCs (range 52–94.6%) reported for pancreatic cancer in three studies (same first author for all) [104,105,106]. Their diagnostic accuracy increased when combined with CA19-9 (sensitivity range 7–95.4%, specificity range 96–98%, AUC range 56–78%). Pepsinogen I (PGI) and PGI/PGII ratio had a wide range of sensitivity and specificity (ranges 27–77.9% and 20.2–92%, respectively) and good AUC (range 70–76%) reported for gastric cancer across four studies [29, 40, 41, 76]. When evaluated with other novel biomarkers (including miR-1290, MIC-1, ULBP2 and CA125), one established biomarker, CA19-9, also showed some promise (sensitivity range 23.1–88%, specificity range 71.6–96.6%, AUC 92–98%) for pancreatic cancer [121, 132, 138]. There were also two studies reporting panels rather than individual biomarkers using a single-gate, reversed-flow design (Table 4) [89, 119].

Table 4 Biomarkers reported more than once for the same tumour type and panels adopting a single-gate (reversed-flow) design

Discussion

Our systematic review identified 149 studies reporting on 431 different biomarkers for gastric, pancreatic, oesophageal and biliary tract cancers. Only a fifth of biomarkers were reported by more than one study, and from these only four novel biomarkers, apoAII-AT and apoAII-ATQ (pancreatic cancer) and pepsinogen I and II (gastric cancer), plus one established biomarker (CA19-9 combined with other novel biomarkers), were reported with individual measures of diagnostic performance, adopting a recommended single-gate design. Heterogeneity in methods, populations, biomarkers, outcomes and comparisons precluded meta-analysis. Applying novel biomarkers for the early detection of upper GI cancers is therefore at an early stage of matureness: few have been extensively evaluated and evaluations have almost exclusively focussed on high-prevalence populations. Further evaluation of the most promising biomarkers in low-prevalence populations is needed before extensive adoption into routine clinical practice can be recommended.

While other reviews have investigated biomarkers used for early cancer detection [19, 172], few have considered the evidence in the context of future application of tests in low-prevalence populations, the likely target for clinical application [12, 13]. To our knowledge, this is the first review to do so for upper GI cancers. The four novel and one established biomarkers we highlight in this review were evaluated in a mix of high- and low-prevalence populations, including hospital patients, general population cohorts, screening populations (both high and average cancer risk), and patients presenting with symptoms. We did not identify any studies reporting outcomes relevant to feasibility, acceptability, benefits and harms, nor health economics as initially planned in the review protocol (i.e. phase 3 studies and beyond in the CanTest framework). The best performing biomarkers for pancreatic cancer, with an AUC between 56% and 94%, were ApoAII-ATQ/AT alone, CA19-9 plus miR-1290, MIC-1 and ULPB2, and Mellby et al.’s [119] 29-panel signature. These may be ready for trials and other phase 3 studies, single or in combination, in low-prevalence populations. We did not identify any novel biomarkers with similar AUCs for gastric, biliary tract or oesophageal cancers.

A previous review investigating the role of pepsinogens in early detection of gastric cancers reported that they had only moderate capacity to detect gastric cancer [173]. Another review on early pancreatic cancer detection highlighted that no single biomarker has yet translated to clinical use and suggested the use of ‘robust panels of biomarkers’ [9]. This review confirms that more research is required before we have sufficient evidence about biomarkers for upper GI cancers to warrant their adoption into clinical practice.

We identified several important methodological limitations within the biomarker studies to date. These include large numbers of biomarkers analysed in parallel during discovery studies, increasing risk of falsely positive results; limited sample sizes; evaluation of “extreme” cases; limited external, independent validation; and selective reporting for validation (several alternatives analyses and combinations, use of several cut-off points and over-optimistic interpretation of the data) [12]. Together with use of two-gate rather than recommended single-gate designs, these could all lead to over-inflated measures of performance. Population characteristics were often provided as supplementary data, with little discussion of potential selection bias and other sources of uncertainty. We also excluded relevant studies when we could not obtain sufficient information on an individual tumour type; this was the case for the CancerSeek tool [174]. Adoption of reporting guidelines [175] and development of early cancer detection collaborations [15, 18] could be useful strategies to address these issues.

This review offers a comprehensive overview of the available evidence. It benefitted from having a multidisciplinary team of experts, a broad search strategy, independent screening, and classifications checked by senior team members. Since meta-analysis was not feasible nor appropriate, we had to use text and tables to synthesise the evidence. We did not include studies investigating biomarkers as part of risk prediction models or risk assessment tools. These studies have strong potential to be used in the community and should be investigated in a separate systematic review. Recent reviews indicate that only including studies in English has minimal impact on review conclusions [176, 177]. We believe this is also the case for this review, particularly due to the overall lack of evidence on biomarkers ready to be evaluated in low-prevalence settings. Although we did not formally appraise risk of bias, we identified several quality and methodological issues, indicating that challenges already highlighted in the literature persisted over time [12]. Finally, due to the large amount of evidence on biomarker development and evaluation, we believe the field could benefit from a “living systematic review”; this refers to high quality, up-to-date online summaries of evidence which can be constantly updated as new research becomes available [178].

The studies we identified focused on measures of diagnostic performance, which is reasonable given the phase of development for most of them. The CanTest Framework [15] can help guide studies aiming to build much needed evidence on later phases of biomarker development, focussing on impact on clinical decision-making, patient, health system and economic outcomes.

Conclusion

There is a large body of evidence on biomarkers being developed for the detection of upper GI cancers, but relatively few have yet to demonstrate their validity or clinical utility in settings where cancer prevalence is low. Early detection of colorectal cancer already benefits from biomarkers that can be used across different populations. This is the case for the faecal immunochemical test (FIT), which is recommended for use in primary care in Spain, Australia and the United Kingdom, in addition to being effective at mass population screening programmes, using different cut-off points [179, 180]. It took several decades from FIT development to generate evidence for its cost-effectiveness as a screening test for colorectal cancer. Its role in the assessment of patients in primary care with lower GI symptoms is still being evaluated. Biomarkers for upper GI cancer remain in their infancy but there are a few which show promise and require further evaluations. Ultimately, they may be able to contribute to improving outcomes for upper GI cancers through earlier detection.