FormalPara Key Summary Points

To our knowledge, this is the first systematic review to characterise the range of novel biomarkers being investigated for the early detection of lower GI cancers, with a focus on their readiness to progress to further evaluation in low-prevalence populations such as primary care.

We identified 378 unique biomarkers from the literature; a meta-analysis of diagnostic accuracy data indicated mSEPT9 and TuM2-PK have potential for further evaluation in low-prevalence populations.

We highlight the need for (1) further studies on mSEPT9 and TuM2-PK in low-prevalence populations; (2) better reporting to facilitate translation; (3) more consistency in the use of biomarkers. By doing so, we will be able to progress to a different step in the evaluation process of promising biomarkers, and ultimately ascertain clinical benefits for our intended population. This will require going beyond test performance, investigating implementation (including feasibility and acceptability), safety and cost-effectiveness.

Digital Features

This article is published with digital features, including a summary slide, to facilitate understanding of the article. To view digital features for this article go to


Gastrointestinal (GI) cancers account for over 25% of global cancer incidence and 35% of all cancer-related deaths [1]. Lower GI cancers, particularly colorectal cancer (CRC), contribute the most significant proportion with over 1.8 million new cases in 2018 [1]. CRC is the most commonly diagnosed GI cancer and constitutes 1 in 10 cancer cases and deaths [1].

Around 90% of patients with cancer first present with symptoms in primary care, highlighting a key role for primary care providers in the early detection of GI cancers [2, 3]. Diagnosis of GI cancers can prove challenging in the community setting: while gastrointestinal symptoms are commonly encountered, they are usually due to benign or self-limiting conditions and rarely to GI cancer [3]. Initial symptoms are often non-specific, and more specific symptoms usually represent more advanced disease [3].

Increased demand for diagnostic services for lower GI cancer and pressure on waiting times have been seen internationally in countries like Australia, the UK and Canada, where primary care plays a ‘gatekeeping’ role to specialist care. In many countries implementation of faecal occult blood tests (FOBTs) or faecal immunochemical tests (FITs) for CRC screening and diagnostic triage adds further pressure on colonoscopy services. In some healthcare systems, over-screening via colonoscopy is also an issue [4]. New diagnostic approaches are needed to help reduce the burden on specialist care, particularly in the current context of COVID-19 and associated delays in access to cancer diagnostic and treatment services [5].

There is considerable interest in the potential of biomarkers to detect GI cancers [3]. To date, carcinoembryonic antigen (CEA) and carbohydrate antigen 19-9 (CA19-9) have played an important role in clinical practice to detect recurrent disease, but their diagnostic performance is inadequate for the early detection of new disease [3, 6, 7]. Substantial investment has been made developing new biomarkers for early detection, but most studies of these tests have occurred in specialist clinical settings [8] where cancer prevalence is higher than in the community settings where they would eventually be applied [9, 10]

The performance characteristics of a diagnostic test are strongly determined by the prevalence and severity of the target disease and of other diseases within the study population [9]. In populations where the prevalence of the target disease is low (e.g. GI cancer in primary care), the corresponding positive predictive values (PPV) are lower than in high-prevalence populations. Tests that are evaluated only in these high-prevalence populations tend to have lower sensitivity and higher specificity when translated to low-prevalence populations [10, 11]. This is known as the ‘spectrum effect’, and has crucial implications for comparing the performance of tests in different populations [9, 10].

In recognition of these issues, the CanTest Framework was developed (Fig. 1) [10]. This novel framework encompasses a translational pathway for diagnostic tests, from new test discovery to health system implementation in low-prevalence populations [11]. The framework highlights the importance of evaluating clinical performance, implementation, patient safety, quality of care, and cost-effectiveness in the intended setting. It is vital that these elements are investigated alongside test performance in order to ascertain clinical utility and improved outcomes for patients [8].

Fig. 1
figure 1

Source: Walter et al. 2019 [10]

The CanTest framework.

This review aimed to systematically identify novel biomarkers for the early detection of lower GI cancers that have measures of diagnostic performance and show sufficient promise to warrant further evaluation in low-prevalence populations.


Search Strategy and Inclusion/Exclusion Criteria

These have been reported elsewhere [12]. The protocol for this review was registered on PROSPERO (registration ID CRD42020165005) and the Preferred Reporting Items for Systematic Reviews and Meta-Analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) statement was followed [13]. MEDLINE, Embase, Emcare and Web of Science were electronically searched for primary studies published in English between the 1 January 2000 and 31 October 2019. The search strategy was developed with the assistance of a medical librarian (Appendix 1 in the supplementary material).

Studies eligible for inclusion were situated within phase 2 (i.e. providing measures of diagnostic accuracy beyond discovery, even if carried out in high-prevalence settings) and phase 3 (i.e. examining diagnostic accuracy in intended low-prevalence settings, and providing measures of clinical utility, including feasibility and acceptability) of the CanTest framework [10] (Fig. 1). We included studies which reported measures of diagnostic performance in an independent population (i.e. beyond measures from the initial discovery phase). Studies were excluded if no references to previous research evaluating biomarker performance were available, and if the study provided only one set of performance measures (reflecting discovery phase only). As studies beyond the discovery phase require larger sample sizes [14, 15], we included those which reported data on at least 50 cancer cases and at least one group of homogeneous non-cancer controls (n ≥ 50) with similar clinical characteristics (e.g. healthy, or with non-malignant conditions), as in previous reviews [15, 16].

We included studies on non-invasive biomarkers feasible for use in the community setting: blood (serum and plasma), urine, faecal, salivary or breath samples. Both observational (cohort or case–control, cross-sectional or longitudinal, prospective or retrospective) and experimental designs were eligible for inclusion. Studies undertaken in all recruitment settings were included.

We included studies if they reported on at least one measure of diagnostic performance, namely sensitivity, specificity, PPV, negative predictive value (NPV), false positive, false negative or area under the curve (AUC) for biomarkers used to detect lower GI cancers, including colorectal (colon, rectum, caecum) or anal cancers, in adult populations (mean/median age 18 or older; studies including individuals aged less than 18 were accepted if these were outliers in large samples). Non-specified GI cancers, neuroendocrine cancers, and studies only reporting on familial populations at risk of hereditary cancers were excluded.

Novel biomarkers were considered individually, in combination or as part of a panel. Studies reporting only on a single, established biomarker (CEA, CA19-9, or FIT or FOBT) were excluded [16, 17]. Studies providing measures of diagnostic performance for combinations of established and novel biomarkers were included.

Covidence systematic review software [18] was used to facilitate article screening. Titles and abstracts were screened independently by two reviewers (any two of PD, NC, CS, KM, DB or RB). Full-text articles were also independently evaluated for inclusion by two reviewers (any two of the aforementioned). Reference lists of included studies were manually reviewed by one author to identify additional studies (NC). Full-text articles selected at this stage were also independently assessed by two reviewers (any two of PD, NC, RB or DB). Disagreements were resolved by consensus; when this could not be reached, a senior reviewer was consulted (JE or FMW).

Data Extraction and Analysis

Data extraction was piloted to ensure consistency and was performed independently (by SS, DB, RB, JMG, JO). Information on study characteristics, populations, biomarkers and measures of diagnostic performance were extracted. When studies reported on different phases of biomarker development, only data from the eligible phases were extracted. When studies had more than one eligible phase, data were extracted for all eligible phases. Extracted data were collated and checked for consistency and inaccuracies (PD).

Biomarkers were categorised according to a modified version of the classifications reported by Uttley et al. [15]: microRNAs and other RNAs, autoantibodies and other immunological markers, other proteins (i.e. proteins that did not fit into other categories), metabolic markers, DNA-related markers (protein-coding genes, gene mutations), circulating tumour DNA, DNA methylation markers and other biomarkers. Controls were classified as normal/healthy, having non-malignant conditions or those with adenomas/polyps. Controls described as healthy were coded as such unless studies described underlying conditions. Full details of the control population classification are available in Appendix 2 in the supplementary material.

Quality Assessment and Risk of Bias

Considering the key issue of spectrum bias, studies were classified as either single-gate or two-gate designs. Single-gate studies recruit participants before disease status is known, with a single route of entry, and with the same inclusion criteria. Two-gate studies recruit participants through different routes and use different criteria for cases and controls. This can lead to over-inflated measures of diagnostic performance for example if there is an over-representation of individuals with advanced disease within the study population and comparison with the ‘fittest of the fit’ healthy controls [11]. One author (PD) classified all studies and another (NC) checked the classification. A full description of this classification and how it approaches issues covered by the QUADAS-2 critical appraisal tool [19] is available in Appendix 3 in the supplementary material. Studies included in the meta-analysis were assessed using the QUADAS-2 [19] tool by two reviewers (PD and NC).

Data Synthesis

As significant heterogeneity was anticipated, we used narrative synthesis to summarise the data [20]. An overview of the evidence was developed to describe the key characteristics of the included studies, their populations, biomarkers and outcome measures. Data were examined for similarities that would allow for subgroup analyses, namely the same biomarker, with similar study design and appropriate accuracy performance measures. For meta-analysis to occur, biomarkers had to be investigated in more than two studies, with individual outcome measures provided, similar populations included and a single-gate study design. We focused the analysis on single-gate studies, as this design reduces spectrum bias, and is more likely to provide results that translate for use in low-prevalence populations. Meta-analysis of diagnostic test accuracy was performed using MetaDTA (version 1.43) [21] and RevMan (5.3) [22] software.

For meta-analysis, we used the random effects bivariate binomial model of Chu and Cole fitted as a generalised linear mixed effect model [21]. Sensitivity and specificity were jointly modelled and the estimates from each study were assumed to vary [21]. Hierarchical summary receiver operating characteristic (HSROC) parameters were estimated using the bivariate model parameters. Summary points of sensitivity and specificity were presented alongside forest plots and SROC curves. Heterogeneity and threshold effects were evaluated using the SROC plots and random effects correlation.

Compliance with Ethics Guidelines

This article is based on previously conducted studies and does not contain any studies with human participants or animals performed by any of the authors.


A total of 16,597 records were identified in database searches; 9172 were retained after removing duplicates (Fig. 2). During title and abstract screening 8179 records were excluded. After assessing the full text of the remaining 993 records, 731 of them were excluded. Of the remaining studies, 142 are included in this review. The characteristics of included studies are described further in Table 1, and measures of diagnostic performance are described in supplementary Table 1.

Fig. 2
figure 2

Study selection

Table 1 Characteristics of included studies: country, setting and population

Characteristics of Included Studies

Most papers (n = 124) recruited patients from a single country. China was the most common country (n = 62), followed by Japan and Germany (both n = 14), and the USA (n = 13). Most studies recruited from single settings with few studies recruiting from at least two different settings (n = 11). The most common recruitment settings were hospitals and other secondary care settings (n = 106). Only one study recruited controls from a primary care setting [57]. All included studies reported on CRC, with six studies specifically referring to colon cancer, and one study specifying rectal and caecum cancer cases separately to colon cancer. Some studies (n = 22) also referred to adenomas or polyps as cases, and five studies also included data on upper GI cancers (e.g. gastric, oesophageal and pancreatic cancers).

Characteristics of Cases and Controls

Overall, the included studies reported on 24,844 cases; the majority were diagnosed with CRC (80.2%) and a minority with adenomas/polyps (19.8%). Most cases had their age reported (79%), either as a range, mean or median. The overall mean age for CRC cases was 61.3 years, and 60.7 years for adenoma/polyp cases. The minimum age for cases was 18 years, while the oldest was 97 years old. The majority (59%) of CRC and adenoma/polyp cases were male. Most studies provided data on tumour staging, mainly using the TNM system (n = 101), though some studies used Dukes’ classification (n = 22), with one study providing data for both. When combining TNM and Dukes’ staging data, over half of the cancers (54%) were diagnosed at early stages (I–II/A + B). Adenomas included as cases were most frequently defined by size, dysplasia, villous component and/or number of adenomas.

The included studies reported on a total of 45,374 controls (31,352 normal/healthy, 6414 with non-malignant conditions and 7608 with adenomas or polyps). A number of studies (n = 37) investigated more than one type of control population. The control populations of most studies (n = 108) were tested to rule out CRC, mainly using colonoscopy (n = 65). The majority of studies (n = 17) with adenomas or polyps as controls included those that were low risk (hyperplastic, non-neoplastic polyps or non-advanced adenomas), though some were high risk (advanced adenomas, those with villous histology or high-grade dysplasia). Age data were extractable for 47.1% of controls. The minimum age for a control was 16 years (healthy control), while the oldest was 99 years old. The majority of both healthy (50.6%) and non-malignant (58%) controls were male.

Types of Biomarkers

Most studies investigated more than one biomarker (79.6%), and these often reported on measures of performance for individual and combinations or panels of biomarkers (45.8%). The commonest sample source was blood (82.4%); these analysed serum (n = 62), plasma (n = 41) or whole blood (n = 14). Faeces was also a common sample source (24.6%); two studies analysed urine, and 13 studies analysed more than one type of sample.

A total of 378 unique biomarkers were identified across the 142 included studies (Appendix 4 in the supplementary material). The commonest biomarkers were microRNAs and other RNAs, followed by proteins, DNA markers, autoantibodies and other immunological markers, and metabolic markers. Proteins were further classified into subcategories, with the most common being novel proteins (Table 2).

Table 2 Classification of identified biomarkers

A total of 54 biomarkers were reported in more than one study (Appendix 5 in the supplementary material). Three biomarkers were investigated by more than 10 studies: CA19-9, CEA and mSEPT9 (methylated septin 9). Additionally, six other biomarkers were investigated in five or more studies: tumour pyruvate kinase isoenzyme type M2 (TuM2-PK), microRNA-21 (miR-21), FIT, microRNA-92a (miR-92a), cancer antigen 72-4 (CA72-4) and TIMP metallopeptidase inhibitor 1 (TIMP-1) (see Appendix 5 for references).

Measures of Diagnostic Performance

Individual measures of diagnostic performance (i.e. measures outside of combinations or panels) were available for 35 biomarkers evaluated more than once (Appendix 5 in the supplementary material). Heterogeneity of study design and included populations precluded meta-analysis for the majority of these biomarkers; however, three had individual measures from multiple studies adopting a classic single-gate design: CEA (n = 7 studies), mSEPT9 (n = 4 studies) and TuM2-PK (n = 3 studies). Differences in the sample sources and diagnostic performance measures provided across the studies precluded meta-analysis for any accuracy measures available for CEA, which was included as a comparator to the novel markers. Meta-analysis was performed for the markers mSEPT9 and TuM2-PK.

The estimated sensitivity and specificity of mSEPT9 was 80.6% (95% CI 76.6–84.0%) and 88.0% (95% CI 79.1–93.4%), respectively, and the diagnostic odds ratio was 30.3 (95% CI 17.8–51.4). TuM2-PK had an estimated sensitivity of 81.6% (95% CI 75.2–86.6%) and a specificity of 80.1% (95% CI 76.7–83.0%), and a diagnostic odds ratio of 17.8 (95% CI 11.6–27.2). Paired forest plots of the sensitivity and specificity for both mSEPT9 and TuM2-PK are shown in Figs. 3 and 4.

Fig. 3
figure 3

Forest plots of sensitivity and specificity for mSEPT9 in plasma

Fig. 4
figure 4

Forest plots of sensitivity and specificity for TuM2-PK in stool

The random effects correlation for mSEPT9 was − 1, indicating a significant threshold effect. Heterogeneity and threshold effect were harder to evaluate statistically for the meta-analysis of TuM2-PK as the low number of included studies impeded accurate fitting of the HSROC curve and generation of a random effects correlation. A cut-off value of 4 U/ml was used for the TuM2-PK assays across all studies. The studies included in the meta-analyses were at low risk of bias across most domains, except for the domains related to patient selection and the index test. Full appraisal data can be found in Appendix 6 in the supplementary material. Summary plots including risk of bias and applicability ratings from QUADAS-2 are shown in Figs. 5 and 6.

Fig. 5
figure 5

HSROC curve for mSEPT9 (with risk of bias and applicability ratings)

Fig. 6
figure 6

HSROC plot for TuM2-PK (with risk of bias and applicability ratings)


This systematic review identified 142 studies reporting on 378 different biomarkers for CRC. The included papers were very heterogeneous, with differences in study design, control populations, sample sources, types of biomarkers, test thresholds and reported performance measures. Meta-analysis of diagnostic accuracy data was only possible for two novel markers: mSEPT9 and TuM2-PK. Both demonstrated high sensitivity, specificity and diagnostic odds ratios in hospital populations.

The most common biomarkers (both individually and in panels) were CEA, CA19-9, mSEPT9 and TuM2-PK. CEA and CA19-9 have a more established role in clinical practice for detecting recurrent disease [3, 7] so it is not surprising that these markers are prevalent throughout the literature. Most of the studies included CEA (42/53) and CA19-9 (20/21) in panels or used them as comparators for novel markers. Meta-analysis was not possible for these studies because of heterogeneity in sample sources and performance measures. Twenty studies reported on the performance of mSEPT9 for CRC detection, mostly as a blood-based biomarker sampled from plasma. While most measures of diagnostic performance were for mSEPT9 as an individual marker, it was also included in panels or combinations across seven studies. Fewer studies reported on the performance of TuM2-PK (nine overall, three included TuM2-PK in panels or combinations). Unlike mSEPT9, TuM2-PK was predominantly sampled from stool, though some studies also reported it as a blood-based biomarker. The studies that evaluated mSEPT9 and TuM2-PK included a number of two-gate studies or hybrid designs, and multiple instances where the study design was unclear. The meta-analyses for mSEPT9 and TuM2-PK included only those with a clear, classic single-gate design [11] to reduce heterogeneity and spectrum bias; consequently, both meta-analyses included a low number of studies, resulting in wide confidence intervals for the diagnostic odds ratios.

The meta-analysis for mSEPT9 synthesised diagnostic performance data on 899 CRC cases. Cancer cases were mostly diagnosed in stages II or III and diagnostic performance data were also provided for adenomas and polyps in most cases. This is important to note, as the diagnostic performance results for early stage cancers are more likely to translate for use in early detection, and the ability for biomarkers also to detect high-risk adenomas, polyps or dysplasia could provide additional clinical utility. Across all studies, the test sensitivity was higher when detecting advanced CRC cases. Conversely, test sensitivity decreased when used to detect either adenomas or degrees of dysplasia. As previously mentioned, several studies evaluated mSEPT9 within diagnostic panels or in combination with other markers. Three studies in particular [66, 151, 153] showed the sensitivity of mSEPT9 to detect CRC increased when combined with more established markers such as FIT and CEA. The results from our review show a slightly higher sensitivity for mSEPT9 in comparison to a recent meta-analysis of 19 studies [165] though it should be noted the analysis from that review included a mixture of study designs and focused on high-risk populations. Our results are comparable to previous analyses which estimated the sensitivity of mSEPT9 as up to 88% [165, 166].

The meta-analysis of diagnostic accuracy data for TuM2-PK as a stool marker included 183 CRC cases. Similarly to mSEPT9, the sensitivity of TuM2-PK was higher for more advanced cancers (Dukes’ stage C and D; stages III and IV) and lower when it was used to detect adenomas, polyps or dysplasia. All three studies included in the meta-analysis compared the diagnostic performance of TuM2-PK to the established stool marker FIT, and demonstrated that FIT was preferable to TuM2-PK as a faecal biomarker for screening populations [94, 116, 117]. Three studies [50, 67, 115] evaluated TuM2-PK as a blood-based biomarker in combination with other markers or in panels, and all found sensitivity to be higher for TuM2-PK in combination with other markers. TuM2-PK may therefore be more promising in blood-based diagnostic panels than as a stand-alone stool marker.

Two-gate and hybrid designs were used widely in the included studies. These types of study designs can lead to over-inflated measures of diagnostic performance due to an over-representation of individuals with advanced disease within the study population [11]. While many studies attempted single-gate designs and recruited participants through one route (usually screening populations where all participants attended for a colonoscopy), the low prevalence of CRC cases meant that extra cases were sourced from alternative routes. This study design issue highlights the importance of large-scale studies and trials that are adequately powered to evaluate diagnostic performance in truly low-prevalence populations.

Several other methodological limitations were identified across the studies. These included the parallel analysis of large numbers of biomarkers during discovery studies; limited external, independent validation of test performance; and selective reporting for validation including alternative analyses and combinations or use of several cut-off points. Insufficient reporting regarding population characteristics and recruitment was also an issue in many studies, with information often provided as supplementary data and with little detail. As a result of the large amount of evidence on biomarker development and evaluation, we believe the field could benefit from a “living systematic review”; this refers to high-quality, up-to-date online summaries of evidence which can be constantly updated as new research becomes available [167].

Although our search was restricted to studies published in English, recent reviews indicate that this has minimal impact on review conclusions [168, 169]. Further limitations of this review include the exclusion of studies that evaluated biomarkers within risk assessment tools or risk prediction models. These studies have strong potential to be used in the community; however, we believe they should be investigated in a separate systematic review. The heterogeneity of the published literature meant we could only conduct meta-analyses on a limited subset of included studies. Nonetheless, we believe the narrative synthesis of additional studies provides a useful summary of the current state of the science in this area. There was insufficient homogeneous data on biomarker panels to report summary estimates of their diagnostic performance. A study from Fung et al. [48] describes ColoSTAT, a novel blood-based diagnostic panel for CRC that includes TuM2-PK with two other biomarkers (IL-8 and DKK-3) and is currently being trialled in Australia. The ColoSTAT panel has reported sensitivity and specificity of 73% and 95%, respectively, for CRC, which is comparable to reported values for FIT (64–73% and 92–95%, respectively [170,171,172,173]) for the detection of CRC in screening populations. Previous trials using this panel have been conducted in high-prevalence settings, with two-gate designs. Once further data are available on ColoSTAT and its performance to detect early stage CRC, it may have applicability in low-prevalence settings as an alternative to FIT, either for screening or in symptomatic populations.


There is a large body of evidence on novel biomarkers being developed to aid with the early detection of lower GI cancers. Few of these markers have yet demonstrated their validity or clinical utility, but two show promise for further evaluation, mSEPT9 and TuM2-PK, and could contribute towards the early detection of CRC as part of blood-based diagnostic panels. Further, large-scale studies in low-prevalence populations are required to evaluate their potential role to support diagnostic assessment in primary care and community settings. This review offers a comprehensive overview of the current state of evidence, situates it within a translational framework for diagnostic tests and makes recommendations in order to build the evidence base for the early detection of lower GI cancers in low-prevalence settings.