FormalPara Key Findings/Implications

Application of next-generation sequencing for the detection of somatic mutations in circulating cell-free DNA derived from solid tumour malignancies seeks to revolutionize precision medicine by using a simple blood draw to detect cancers, monitor response to therapies, and personalize treatment strategies.

Advancing cell-free DNA diagnostics to early-stage and non-metastatic cancers has been limited by challenges associated with distinguishing the true signal of low-frequency tumour-derived cell-free DNA from noise generated during next-generation sequencing.

Differences in the fragment length between tumour-derived cell-free DNA (circulating tumour DNA [ctDNA]) and cell-free DNA originating from healthy cells is a biologic phenomenon that can be leveraged to improve detection of ctDNA.

Both laboratory-based (in vitro) and computer-based (in silico) methods are being developed to use the fragment length profile of ctDNA to improve cell-free DNA diagnostics in cancer.

Further investigations into using size-based analyses in difficult to detect cancers are necessary to expand the role of cell-free DNA in non-invasive applications of precision oncology.

1 Why is the Fragment Length of Circulating Tumour DNA (ctDNA) Important?

In 1948, the presence of DNA in plasma without a protective membrane, which has been subsequently termed circulating cell-free DNA (ccfDNA) (Fig. 1a), was described by Mandel and Métais [1]. In 1977, the first report of a difference in the ccfDNA concentration (ng/mL plasma) between cancer patients and healthy individuals hinted at the possibility of harnessing ccfDNA as a diagnostic tool in clinical oncology [2]. Unfortunately, applications based on ccfDNA concentration alone proved limited because of significant overlap between healthy individuals and patients with cancer, particularly patients with early-stage and non-metastatic disease [2,3,4,5]. In addition, factors such as age [6], weight [7], inflammation [8], infection [9], and exercise [10], among other clinical factors, have been shown to affect ccfDNA abundancy, likely undermining the use of ccfDNA concentration to discern and monitor malignancy [11]. Thus, the 1994 discovery of a KRAS p.G12D point mutation in ccfDNA derived from pancreatic cancer was a landmark occasion because the detection of tumour-derived ccfDNA (circulating tumour DNA [ctDNA]) provided incontrovertible molecular proof for the viability of the non-invasive detection of cancer [12].

Fig. 1
figure 1

Acquisition and characteristics of circulating cell-free DNA (ccfDNA). Whole blood acquired through venipuncture is centrifuged to separate plasma from buffy coat and erythrocytes (a). ccfDNA derived from apoptosis is present in plasma as various multiples of nucleosomes—DNA wrapped around a histone core with a linker fragment of DNA (~10 bp) joining adjacent nucleosomes (a). The relative quantity and fragment length distribution of ccfDNA is shown in (b), where the most abundant fragment length corresponds to the length of the mononucleosome. The fragment lengths of circulating tumour DNA (ctDNA) tend to be shorter than ccfDNA (c); however, there is substantial overlap. Enrichment of ctDNA has generally focused on isolation of fragment lengths < 150 bp to improve the ratio between ctDNA and ccfDNA

Haematopoietic cells account for > 90% of the ccfDNA in healthy individuals [13,14,15,16]. The remaining portion of ccfDNA is principally derived from endothelial cells and hepatocytes [13, 14, 16]. In the presence of solid tumours, the proportion of ccfDNA composed of ctDNA is generally correlated to disease severity [3]. In advanced and metastatic disease, the somatic mutations present in ctDNA may overwhelm the wild-type alleles derived from haematopoietic cell DNA and appear with strikingly high allele frequencies (> 20% [17]). In turn, the allele frequency of ctDNA associated with early-stage or non-metastatic disease may be quite low (< 1%). Because a widely sought goal of ccfDNA diagnostics is detection of ctDNA before advanced disease occurs to potentially maximize therapeutic options, the field has been driven towards development of sensitive methods to identify ctDNA among the abundant ccfDNA from normal healthy cells—the proverbial “needle in a haystack”. From 1994 to 2012, research efforts were largely led by polymerase chain reaction (PCR)-based methods conducting narrow tumour-informed searches or analysing common tumourigenic loci (e.g. EGFR p.T790M) for evidence of ctDNA. During this period, ctDNA was detected from malignancies associated with a multitude of organs, such as the colon [3], breast [18], prostate [19], lung [20], skin [21], and pancreas [22] among others. In some instances, detection of ctDNA correlated with poor survival [23]. Other studies found that serial quantitative monitoring of ctDNA informed responses to therapy [24]. Overall, however, the successful detection of ctDNA based on mutation detection tended towards advanced disease where the ctDNA fraction was generally observed at > 1%.

In 2012, the first uses of next-generation sequencing (NGS) to detect ctDNA were detailed [25, 26]. Integration of NGS into ccfDNA diagnostics promised the broader characterization of ctDNA because capture-enrichment panels used during NGS can be customized to target a wide range of molecular foci depending on the desired applications. Thus, detection of ctDNA could move beyond a narrow search to a broad surveillance of a multitude of genes enabling a more extensive genotyping through perfusion sampling rather than focal tissue analysis. Theoretically, NGS should enable substantial advances in precision oncology via (1) detection of both common and patient-specific mutations, (2) monitoring of molecular evolution, and (3) detection of ctDNA derived from metastatic lesions with a potentially separate molecular profile. Each of these mutation-based features is key in the advancement of personalized medicine through the enabling of treatment optimization for each patient’s genotypically unique primary and possibly distal lesions. Although NGS is on par with PCR-based methods in ability to detect ctDNA from a wide array of human cancers [27], further attainment of NGS’s potential has been constrained by assay noise generated during NGS because a portion of artifacts can occur randomly with an allele frequency similar to low-frequency ctDNA [28]. An overview of publications from 2016 to 2020 highlights several key effects of NGS-associated noise on the mutation-based detection of ctDNA (Fig. 2, [29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142]). First, the sensitivity of a mutation-based ctDNA detection strategy to identify cancer reveals a high degree of variability across all disease severities (Figs. 2, 3a, b). Although detection of advanced and metastatic disease of some cancers has improved, others remain challenging likely indicating a low ctDNA burden associated with some cancers regardless of clinical stage. As such, development of more sensitive techniques to discern ctDNA from noise may not only extend ctDNA applications to a broader range of cancer types and stages, but also improve characterization of genetic heterogeneity in advanced cancers. Second, a large portion of studies use a tumour-informed search to guide mutation-based ctDNA detection to distinguish true positives from false positive (Fig. 2). Thus, the ctDNA mutational landscape is constrained to somatic mutations present in a focal tissue sample that may not adequately represent genetic heterogeneity of the primary tumour and/or mutations unique to metastatic lesions. In addition, dependence on tumour DNA to detect ctDNA largely voids the concept of the non-invasive liquid biopsy, particularly applications pertaining to diagnosis and monitoring of molecular evolution. Finally, inclusion of healthy control ccfDNA is uncommon in studies searching for ctDNA with or without a priori knowledge of somatic mutations (Fig. 2). Because a principal component of NGS-associated error is stochastic [28, 143], investigations using a mutation-based detection of ctDNA without validating specificity pertinent to the experimental design may be challenging to interpret due to the confounding effects random noise may have on identification of low-frequency ctDNA, especially when tumour DNA is not used as a guide. Collectively, the data presented in Fig. 2 and 3 show positive trends in the overall progress achieved in the past 5 years, but also provide compelling evidence that the true potential of NGS-based ccfDNA diagnostics in precision oncology remains unrealized. As such, laboratory methods, technical innovations, or both to improve the signal-to-noise ratio during NGS are needed to translate mutation-based ctDNA detection into impactful clinical applications.

Fig. 2
figure 2

Profile of publications from 2016 to 2020 using NGS for mutation-based ctDNA detection. A PubMed querya was used to initiate a search in each year for publications with a total sample size ≥ 10. Publications that sought to detect ctDNA associated with minimal residual disease were excluded. The list is first grouped by year (Yr), then disease severity, and finally by whether or not the mutation-based search for ctDNA was tumour informed. The reference (Ref) column identifies the citation. When feasible, data associated with different disease severities are presented separately and the reference number is non-black to support matching of data from the same source. In one study (Ref. [112]), data obtained with and without a tumour-informed search were merged and separation was not possible (marked ‘Both’). Sensitivity (Sens) represents a study value corresponding to the mutation-based detection of ctDNA to determine presence/absence of a malignancy. The associated sample size is presented in the adjacent column (Patients). Specificity (Spec) is reported only if obtained from healthy control data. Sensitivity and specificity values should be interpreted cautiously as calculations can vary substantially within and between publications. For example, in Ref. [131] the authors report detection of ctDNA in 55% of patients. However, no mutations identified in solid tumour DNA were present in ccfDNA, so sensitivity is also shown at 0% for a tumour-informed search. To gain adequate contextual understanding of values, reviewing the publication’s supplemental data may be necessary. ccfDNA circulating cell-free DNA, ctDNA circulating tumour DNA, mixed = cancers from different organs, NB neuroblastoma. *Specificity not calculated, controls used to determine error rate (< 3.3 × 10−7 false positive mutation calls per base); controls used for error modelling; **specificity not calculated, median error rate of 0.03 non-silent single nucleotide variants per Mb; ††specificity not calculated, 307 of 342 targeted positions error free. aPubMed query: (cell-free DNA[Title]) AND (cancer) AND (("YEAR/01/01"[Date - Publication] : "YEAR/12/31"[Date - Publication]) AND (next-generation sequencing) AND (circulating))

Fig. 3
figure 3

Characteristics of publications from 2016 to 2020 using NGS for mutation-based ctDNA detection. In a, the violin plots show the sensitivity of mutation-based ctDNA detection relative to disease severity. In general, there was a trend towards increased detection with more severe disease. In b, sensitivity is relatively similar across years likely because more difficult to detect tumours are being included in later years, which may adversely affect sensitivity but also indicates detection strategies are working towards inclusion of more challenging cancers. In c, the total number of publications profiled for each year from Fig. 2 is shown (dark gray) and the number of those publications that include early/stage I–II disease is also depicted (light gray). In more recent years, a larger proportion of publications are seeking to detect ctDNA in early-stage and non-metastatic cancers. ctDNA circulating tumour DNA, NGS next-generation sequencing

One route to improving detection of low-frequency variants is using a larger DNA input so that the signal exceeds the level of noise. Unfortunately, the amount of ccfDNA in plasma is finite. In healthy individuals, the concentration of ccfDNA ranges between 5 and 15 ng/mL plasma and the quantity of ccfDNA in patients with early-stage or non-metastatic disease is similar or only marginally higher. As a reference point, a single human cell (diploid) contains 7 pg of DNA. In 1 ng of human DNA, each genomic position associated with the non-sex chromosomes is represented ~ 286 times (i.e. genomic equivalents). Thus, a 30-ng ccfDNA library input provides ~ 8580 unique copies of DNA at each base position assuming a lossless library preparation and absence of gene amplifications and deletions. Using NGS, a ctDNA variant with an allele frequency of 0.05% may be indistinguishable from noise because fewer than five unique copies of the mutation under ideal conditions are present in the sequencing data. Because of losses that occur during sample handling for NGS in combination with the low quantity of cell-free DNA in plasma, achieving the necessary boost in signal at such low allele frequencies to overcome NGS-associated noise requires large volumes of blood (i.e. > 100 mL [121]). However, a distinctive characteristic of ctDNA may help overcome these signal-to-noise obstacles. There is growing evidence that the predominance of ctDNA occurs with a fragment length shorter than ccfDNA from healthy cells. Given the challenges associated with detecting ctDNA derived from early-stage and non-metastatic cancers, differences in fragment length size between ctDNA and ccfDNA may represent an exploitable phenomenon to further move ccfDNA beyond research development and towards a non-invasive diagnostic tool in clinical oncology.

2 The Fragment Length of ctDNA

The fragment length of ctDNA was originally used to discern the type of cell death driving the release of cell-free DNA into the circulation—apoptosis versus necrosis. DNA derived from apoptosis has a characteristic ladder-like distribution with the principal peak at the length of the mononucleosome (~167 bp) and subsequent peaks of decreasing magnitude corresponding to di- and trinucleosomes (Fig. 1b) [144]. Necrosis produces genomic DNA with lengths > 10,000 bp. Initially, there was evidence supporting the origins of ccfDNA from both pathways [145]. Terms such as apoptosis index and DNA integrity index subsequently emerged to characterize PCR-based analyses using ratios of short and long amplicon lengths [146,147,148]. By 2011, there was a preponderance of evidence that ccfDNA originated from apoptosis. The first application of NGS to better characterize ctDNA fragment length occurred in 2015, when Jiang et al. used copy number aberrations in hepatocellular carcinoma to infer the difference in size distributions between ccfDNA and ctDNA [149]. A positive correlation was identified between the tumour DNA fraction in plasma and the proportion of DNA fragments less than 150 bp, suggesting a shorter size length profile for ctDNA compared to the mononucleosome’s ~ 167-bp length associated with normal ccfDNA.

In 2016, two technical aspects of NGS were used to characterize the specific fragment length of ctDNA. First, during library preparation of ccfDNA, mechanical and enzymatic shearing is not used because the predominant component of ccfDNA is found in association with the mononucleosome, which has a median length of ~167 bp (Fig. 1b). Therefore, to prevent generation of short fragments that are difficult to sequence, ccfDNA NGS workflows do not include a DNA shearing step, which is commonly used during preparation of genomic DNA extracted from white blood cells and tumour cells. Second, NGS generally employs a constrained sequencing length (e.g. 100 bp, 125 bp, 150 bp), which means the entire length of the ccfDNA molecule may not be sequenced. Thus, many NGS applications use paired-end sequencing to read the sequence from both ends of the molecule. Once paired reads are aligned to a reference genome, the length of the original DNA molecule is measured. These two components of NGS were combined with rat xenograft models of a brain tumour and hepatocellular carcinoma to precisely measure fragment lengths of normal ccfDNA and ctDNA, which were parsed based on species (rat DNA vs. human DNA, respectively) [150]. Overall, the median fragment length of ctDNA was 134–144 bp, which was ~ 20–30 bp shorter than the median length of 167 bp for rat ccfDNA (Fig. 1c). In addition, smaller peaks along the overall fragment distribution profile of the mononucleosome were present at regular intervals (i.e. ~ 10-bp periodicity). Observations regarding differences in fragment length size and, to a lesser extent, periodicity translated to a patient with widely metastatic melanoma where the principal peak associated with the BRAF p.V600E mutation was ~20 bp shorter than the wild type distribution. Similar differences in fragment size were also observed in lung cancer patients harbouring the common EGFR p.T790M and p.L858R mutations. These findings not only solidified observations by Jiang et al. [149], but also echoed findings from several earlier reports. First, Giacona et al. applied a microspreading technique in 1998 to measure double-stranded DNA with transmission electron microscopy to compare strand length between three controls and three patients with adenocarcinoma of the pancreas [151]. The median ccfDNA strand length in cancer patients was 185 nm compared to 273 nm in controls, which is a ~ 30-bp difference (assuming 0.34 nm/bp). Second, in 2010, Lo et al. used NGS to describe characteristics of foetal ccfDNA in the maternal circulation [152]. The principal fragment length of foetal ccfDNA in the maternal circulation was 143 bp, shorter than the principal fragment length of maternal ccfDNA at 166 bp. In addition, the fragment distribution for both foetal and maternal ccfDNA exhibited a 10-bp periodicity. Finally, Zheng et al. showed in 2012 that ccfDNA derived from a transplanted liver exhibited a 10-bp periodicity consisting of significantly shorter fragments than that of the recipient-specific ones [153]. Collectively, these studies established and provided orthogonal support for the narrow difference in fragment size length between ctDNA and ccfDNA. Importantly, multiple studies using a mixture of xenograft models and point mutations have subsequently confirmed the discrete size disparity between ctDNA and ccfDNA [68, 112, 121, 154, 155], which has also been the topic of previous reviews [156]. In addition, the overall size profile in patients with metastatic disease has also been shown to be shorter compared to healthy controls [154], but substantial overlap between patients and controls limits size profiling alone as a clinical diagnostic tool, particularly for early-stage and non-metastatic cancers. Although the origins of the 10-bp periodicity seen in the fragmentation pattern of ccfDNA has been linked to both intra- and extracellular processes [157, 158], the origins of the size difference between ctDNA and ccfDNA is unclear. However, there is compelling evidence that chromatin structure or epigenetic changes may have a role [15, 159,160,161] (see [162] for a review of epigenetic modifications associated with ccfDNA). Regardless of aetiology, the focus of this review is exploring prior use of ctDNA fragment length to improve the non-invasive detection of somatic mutations in ccfDNA associated with malignant disease. Both in vitro and in silico methods to isolate subfractions of ccfDNA are reviewed to identify strengths and weaknesses associated with different approaches. In so doing, the goal if this review is to establish the current state of the art in ccfDNA size-selection technology to better understand the necessary steps forward to potentially bring ccfDNA diagnostics to earlier stages of disease and enable precision oncology through detection of patient-specific somatic mutations regardless of disease severity.

3 In Vitro Enrichment of ctDNA

In vitro refers to a mechanical or wet laboratory-based technique. Gel-based methods have been the principal in vitro approach for extracting a narrow size profile range. The first description of this occurred in 2016 and used polyacrylamide gel electrophoresis to achieve sufficient resolution to obtain six fractions from the ccfDNA mononucleosome peak [150]. In four lung cancer patients harbouring the EGFR p.T790M mutation, isolation of the ccfDNA fraction 20–50 bp shorter than the principal peak increased the mutant allele frequency (MAF) measured by droplet digital PCR by 2.8-fold to 9.1-fold in 75% of samples (Table 1). Although the size-selection methodology was not scalable to broader applications, the potential for using the size difference to enrich for ctDNA was successfully demonstrated.

Table 1. In vitro size selection for ctDNA detection

In 2018, in vitro size selection was applied to larger sample sizes using benchtop microfluidic devices. Mouliere et al. loaded 8–20 ng of pre-library ccfDNA (i.e. ccfDNA prior to library generation and PCR amplification) into a 3% agarose cassette (HTC3010, Sage Bioscience) and targeted the ccfDNA fraction between 90 and 150 bp for extraction on a PippinHT (Sage Bioscience; Table 1) [68]. Using shallow whole genome sequencing (sWGS) (read depth of 0.5 ×), the trimmed median absolute deviation from copy number neutrality (tMAD; a measure of copy number alterations) increased by a median of 2.1-fold in 47 of 48 plasma samples from 35 patients with high-grade serous ovarian cancer. In a subset of six patients with ctDNA variants detected by whole exome sequencing (WES) (median MAF 9.5%, range 0.5–100%), in vitro size selection increased MAF by a median factor of 4.27-fold. Three weeks after therapy when MAFs were reduced (median MAF 6.5%, range 1.2–48%), the median increase in MAF via in vitro size selection was less, at 1.5-fold, suggesting that in vitro size selection may have reduced benefits at lower MAFs.

Instead of using ccfDNA during size selection prior to library preparation, Hellwig et al. used 1 µg of ccfDNA obtained after library generation and PCR amplification as input into an automated liquid handler (NIMBUS Select, Hamilton, Reno, NV) that incorporated Ranger Technology (Coastal Genomics, Burnaby, BC) for the monitoring and real-time manipulation of electrophoretic mobilities through a 3.0% agarose matrix in a 12-channel cassette (Table 1) [155]. Three ccfDNA fractions centred at ~ 141 bp, ~ 155 bp, and ~ 167 bp were obtained from 13 patients harbouring mutations in a BRAF p.V600, KRAS p.G12, or KRAS p.G13 codon associated with colorectal cancer, melanoma, and pancreatic cancer for comparison to unselected ccfDNA. The median MAF in unselected ccfDNA was 1.1% (range 0.39–12.3%), with MAF ≤ 0.81% in 46.2% of the mutations. Using NGS, an average 2.0-fold increase in MAF was observed in the fraction centred at 141 bp compared to unselected ccfDNA, which was significantly higher than the 1.3-fold gain in the 155-bp fraction and the 0.7-fold reduction in the 167-bp fraction. All NGS results were cross-validated with droplet digital PCR, which similarly confirmed enrichment of ctDNA through in vitro size selection. In addition, the observed enrichment of ctDNA associated with a low MAF (< 1%) suggests that size selection conducted on ccfDNA libraries rather than pre-library ccfDNA may be advantageous by mitigating loss of rare unique molecules through PCR amplification to generate more copies for acquisition.

Most recently, Ishida et al. used solid-phase reversible immobilization (SPRI) beads (SPRIselect beads, Beckman Coulter, CA) to fractionate pre-library ccfDNA to obtain DNA sizes of 100–400 bp (Table 1) [116]. In so doing, the proportion of ccfDNA in the range of 90–150 bp increased from 3.0% in unselected ccfDNA to 25.9% in the size-selected ccfDNA. In 20 patients with colorectal cancer (63% with distant metastasis), the mean MAFs of driver genes as measured by digital PCR increased from 6.8% in unselected ccfDNA to 10.7% in size-selected ccfDNA. Using NGS, a similar increase in mean MAFs was observed. In addition, the average number of mutations with an MAF > 1% detected in size-selected ccfDNA compared to unselected ccfDNA increased from 1.0 to 1.8 per patient. In contrast to gel-based techniques that sought to isolate a subfraction of the mononucleosome, the bead-based approach yielded a size-selected fraction of ccfDNA inclusive of both the mono- and dinucleosomes. Although the extent of ctDNA enrichment was less than the gel-based methods, the study demonstrated that even a marginal shift in the overall size profile of ccfDNA towards shorter fragment lengths has the potential to improve ctDNA detection, which may have implications for high-throughput labs using bead-based methods as part of automated protocols.

Additional publications that use in vitro size selection to enrich for ctDNA were not identified in the literature. However, in vitro size selection has been more extensively investigated in prenatal diagnostics, where foetal ccfDNA in the maternal circulation has a similar fragment length shortening relative to maternal ccfDNA [152]. Thus, several papers are highlighted herein to substantiate the observations described above and to also provide a broader scope of in vitro methods associated with ccfDNA size selection that may have potential ctDNA applications. The in vitro methodologies to enrich for foetal ccfDNA from maternal ccfDNA broadly organize into two categories: (1) isolation of the entire mononucleosome or (2) extraction of the shorter fragments associated with the mononucleosome. Regarding the former category, Hahn et al. described a microsystem to preconcentrate ccfDNA using electrokinetic trapping followed by separation of fragments using ionic mobility to effectively parse ccfDNA with a length < 300 bp from fragments > 500 bp [163]. Yang et al. established a PCR-based enrichment protocol by modifying the amplification reaction conditions [164]. Using an optimized lower denaturing temperature that biased PCR towards amplification of shorter fragments, the foetal fraction was comparably increased to that of removing fragments longer than 300 bp from the maternal ccfDNA sample. A potential advantage of this method is that existing workflows do not require alterations beyond optimization of thermal cycling temperatures. To isolate the short fraction (< 160 bp) of cell-free DNA from the mononucleosome, Hu et al. used custom-designed magnetic beads with an average particle size of 1 µm (for reference, the average size of the industry standard SPRI beads is similarly 1 µm [165]) to enrich for the foetal fraction of ccfDNA [166]. Using a bead ratio of 1.5 × for size selection after end-repair and prior to adapter ligation increased the proportion of ccfDNA < 160 bp from ~ 25% in unfractionated ccfDNA to ~85% in fractionated ccfDNA. Foetal ccfDNA was enriched in the fractionated maternal ccfDNA by 1.5- to 4-fold, which improved detection of missed foetal trisomies by 45.5%. Compared to the approach by Ishida et al. described above [116], the method employed by Hu et al. appears to more strongly select for the shorter fraction of the mononucleosome, suggesting that a bead-based approach may be able to more strongly enrich for ctDNA once optimized. Most recently, Welker et al. used gel electrophoresis on 2% agarose cassettes (BluePippin, Sage Bioscience) to isolate short ccfDNA fragments from the mononucleosome with an average length of 140 bp (not including adapters) [167]. Of the 2401 maternal ccfDNA samples tested, 2395 (99.8%) showed an increase in the foetal fraction, with an average increase of 2.3-fold. Samples with a foetal fraction < 4% showed the largest gain, with an average increase of 3.9-fold. Because the general methodology was similar to that used by Mouliere et al. [68] and Hellwig et al. [155] for ctDNA enrichment, an important aspect to recognize in the Welker et al. study [167] is the feasibility of scaling gel-based techniques to meet high-throughput demands.

Overall, there is compelling evidence that in vitro size selection enriches for ctDNA. The brief overview of a few foetal ccfDNA studies supports this conjecture, while also identifying additional in vitro strategies that may facilitate ccfDNA diagnostics in cancer. A key caveat to recognize, however, is the limited enrichment data for MAF < 1%. Moreover, data have not been previously reported for in vitro enrichment of ctDNA associated with an MAF < 0.1%. Thus, additional studies are necessary to determine if the strengths of in vitro fragment size selection to enrich for ctDNA extend to the lower allele frequencies that are more representative of early-stage and non-metastatic disease.

4 In Silico Applications of ctDNA Fragment Length

In silico refers to the computer-based or bioinformatic curation of NGS data. Because paired-end sequencing associated with NGS enables strand length measurements, reads can be sorted based on the length of the original DNA molecules used during library preparations. In 2018, Mouliere et al. applied an in silico size-selection analysis to synthetically enrich for ctDNA by selecting for ccfDNA fragments with lengths of 90–150 bp (Table 2) [68]. In 189 cancer patients and 65 healthy controls, in silico size selection of sWGS data improved the area under the curve (AUC) for discerning cancer patients based on tMAD scores from 0.69 to 0.90. In silico size selection also increased MAF as measured by WES by 2.2-fold in six patients with high-grade serous ovarian carcinoma. In a separate cohort of 16 cancer patients that underwent WES, in silico size selection increased the MAF of ctDNA on average by 1.7-fold and additional mutations were detected in 13 of 16 patients. In 2020, Smith et al. used in silico size selection for ccfDNA fragments between 90 and 150 bp to improve detection of ctDNA associated with renal tumours (Table 2) [115]. In silico size profiling improved somatic copy number alteration (SCNA) detection in ctDNA from 6.3 to 22.9% in 48 patients with renal tumours ranging from benign tumours to metastatic malignancies. In a separate cohort of 43 patients with renal cell carcinoma (41 with metastatic disease) and most with the primary tumour removed (35/43 patients), in silico size selection identified SCNA-based ctDNA in six additional patients, increasing the detection rate from 18.6 to 32.6%. The original MAFs of SCNAs ranged from 4 to 17% and increased on average by 2.2-fold (range 0.9–5.7). Also in 2020, Nygard et al. similarly used in silico size selection for ccfDNA fragments between 90 and 150 bp to improve SCNA-based detection of ctDNA derived from stage III non-small cell lung cancer (Table 2) [168]. In 23 samples obtained from six patients undergoing radiation therapy, in silico size selection improved SCNA detection from 22 to 70% of samples. Although it is important to recognize that these three studies largely identified ctDNA associated with MAFs > 1% [68, 115, 168], this MAF range was governed by study design rather than technical limitations of in silico size selection. The studies were either targeting SCNA via sWGS or point mutations detected by WES. Both approaches are associated with read depths that are not amenable to detection of MAFs < 1%. Thus, the use of in silico size selection to improve detection of specific variants in ctDNA at low MAFs remains uncertain. However, these studies collectively provide compelling evidence that integration of size profiling into bioinformatic applications of ctDNA detection may be advantageous.

Table 2. In silico size-based filtering of insert size for ctDNA detection

As an alternative to the explicit exclusion of fragments outside of a size profile window, in silico approaches have also assigned a weight based on fragment length as part of a probability model for cancer detection. Using sWGS data and tMAD scores, Mouliere et al. applied a random forest algorithm that included the proportion of fragments in defined size ranges to enhance the detection of tumours [68] (Table 3). In 68 patients with high ctDNA cancers, the model was able to distinguish cancer patients from 26 healthy controls with an AUC of 0.994. In 57 patients with low ctDNA cancers, the AUC was 0.914. Notably, using fragmentation features without the tMAD score yielded AUCs of 0.989 and 0.891, respectively, suggesting that fragment size was the principal component of the original model. Wan et al. used fragment length as part of an in silico error reduction strategy named INtegration of VAriant Reads (INVAR), which weights mutant reads across all patient-specific mutation loci based on the empirical distribution of mutant fragments in all other samples in the cohort being studied to give a size range enriched in cancer greater weight (Table 3) [121]. Using a tumour-informed search to guide ctDNA detection, INVAR achieved a median AUC of 0.80 (range 0.64–0.92) in early-stage disease and previously difficult to detect cancers such as stage I–III non-small cell lung cancer, stage I–II breast cancer, and renal and brain tumours, and after surgery for stage II–III melanoma. INVAR has also been applied by Smith et al. to detect renal tumours (Table 3) [115]. Using a tumour-informed search, ctDNA was detected in 12 of 22 patients (54.5%) with renal tumours ranging from benign to metastatic. In a search without a priori knowledge of somatic tumour mutations, INVAR detected ctDNA in 18.6% of patients with largely advanced renal cell carcinoma. Finally, Chabon et al. similarly integrated fragment length into a noise reduction strategy, an enhanced version of CAPP-Seq [37], to detect ctDNA derived from stage I–III non-small cell lung cancer (Table 3) [112]. Using a tumour-informed search, ctDNA was detected in 45% (38 of 85) of patients. In a subset of the 85-patient cohort where ctDNA was not detected, patient-specific panels derived from WES of the tumour were generated, which led to new ctDNA detection in 10 of 17 patients (59%). In a subsequent step, ccfDNA fragment length was integrated into a machine learning algorithm to screen for ctDNA indicative of non-small cell lung cancer in high-risk patients. At 98% specificity, sensitivity for detecting stage I–II non-small cell lung cancer via ctDNA in 41 patients was ~30%. Collectively, these studies demonstrate the promise of integrating ctDNA fragment length into probability-based models to detect early-stage and difficult to detect cancers. Importantly, there are two key aspects of these studies that merit additional consideration within the context of ccfDNA fragment length. First, ccfDNA fragment length was only one component in the overall ctDNA detection strategy. For example, Chabon et al. used a multi-tiered approach to estimate the probability that a ccfDNA mutation was tumour derived using background frequencies, ccfDNA fragment size, the gene affected, and likelihood of clonal haematopoiesis of indeterminate potential [112]. Second, the studies used fragment length to enhance tumour detection rather than mutation detection. For example, Wan et al. observed in nine of the 11 stage II–III melanoma patients where INVAR detected ctDNA that the integrated MAF determined from multiple loci was below the 95% limit of detection for a “perfect” single-locus assay based on the ccfDNA input [121], indicating that insufficient read depth and signal were available to confidently determine presence/absence of each individually targeted mutation in ctDNA. As such, additional studies are needed to determine if machine-learning algorithms using fragment length profiling can be further harnessed to improve detection of individual somatic mutations in ctDNA.

Table 3. In silico size-based weighting of potential somatic mutations for ctDNA and tumour detection

In silico size selection offers a multitude of opportunities to analyse ccfDNA using hard filters, probability modelling, or both. An advantage of in silico size analysis is the integration of fragment length with other features of ccfDNA and/or other biomarkers of cancer [169] in a lossless manner to optimize ctDNA and tumour detection. Relative to ctDNA enrichment for somatic mutation detection, however, the caveats associated with in silico size selection are identical to those introduced with in vitro enrichment—there is limited data available for MAFs < 1%. Although in silico size selection has been used during detection of foetal ccfDNA in the maternal circulation [170, 171], the approaches are similar to that described for ctDNA, and the foetal fraction is generally > 1%, with few exceptions. As such, additional insights are unlikely to be gleaned from reviewing the foetal ccfDNA literature. Accordingly, future studies devoted to using in silico size selection for enrichment of ctDNA associated with early-stage and non-metastatic cancers are needed to better understand the potential impact on mutation-based ctDNA detection to further improve the non-invasive genotyping of tumours.

5 In Vitro Versus in Silico ccfDNA Size Selection, Which Is Better?

Both in vitro and in silico approaches to enrich for ctDNA present strengths and challenges. In a direct comparison between methods, Mouliere et al. found that in vitro size selection afforded a higher degree of ctDNA enrichment compared to in silico size selection (mean increase in MAF of 4.19-fold vs. 2.2-fold, respectively) [68]. However, several factors regarding ctDNA fragment size within the context of in vitro size selection warrant careful review. Although ctDNA generally aggregates around a shorter distribution of fragment lengths (< 150 bp), longer fragments outside the targeted range for enrichment may also be present. For example, Hellwig et al. found the fraction of isolated ccfDNA fragments with a median length of ~ 167 bp to be associated with a reduced MAF rather than absent ctDNA indicating that ctDNA is not exclusive to fragments < 150 bp [155]. In addition, the size difference between ctDNA and normal ccfDNA has been shown to extend beyond the mononucleosome to the dinucleosome, where a second cluster of enriched ctDNA may occur at 230–310 bp [112, 121, 172]. Finally, Chabon et al. found in silico hard filtering based on ctDNA-enriched size windows disproportionately favoured variants with higher pre-enrichment MAFs [112]. As such, using in vitro size selection to exclude molecules > 150 bp has the potential to adversely affect sensitivity [173], particularly for ctDNA with a low allele frequency where accrual of DNA losses during sample preparation for NGS may be compounded by the in vitro size-selection process.

Whereas in vitro size selection may lead to irretrievable loss of rare molecules, in silico size selection allows a lossless size-based analysis. In addition, a multitude of size analyses using hard filters and/or probability modelling can be conducted iteratively to optimize ctDNA detection that may be specific to each type and severity of cancer. In combination with bioinformatic strategies to reduce error [174], the additional potential boost in ctDNA signal via in vitro size selection to overcome NGS-associated noise may not be necessary. A potential weakness of in silico size selection is that application of algorithms may lead to fewer samples available for analysis and/or reduced read depth. For example, Smith et al. found that seven of 29 samples (24.1%) were excluded as technical failures because there were insufficient reads after size selection to meet the requirement of the INVAR algorithm [115]. However, these effects can likely be anticipated and mitigated by using a higher number of total paired reads for each sample during sequencing.

Heretofore, in vitro size selection has been discussed within the context of ctDNA enrichment. However, an in vitro approach may also have beneficial bioinformatic effects that merit consideration. Specifically, Hellwig et al. showed that in vitro size selection reduced sample complexity, leading to the generation of larger family sizes without compromising ctDNA detection [155]. During library preparation, amplicons of each unique DNA molecule are generated. Subsequently, molecular barcodes are used to identify amplicons derived from the same original DNA molecule. During in silico analysis of NGS data, amplicons with the same molecular barcode are grouped into a family and used to generate a single consensus sequence that represents the original unique DNA molecule. Family size is the number of amplicons used to generate a consensus sequence and larger family sizes correspond to larger error reduction. Thus, in vitro size selection has the potential to enrich for ctDNA, while concomitantly reducing NGS error.

Currently, there is not strong evidence that supports using one approach over the other. In vitro size selection may enable enrichment of ctDNA while also yielding fewer NGS-associated errors, but at the cost of losing rare ctDNA molecules that may mitigate the desired enrichment effect. In silico size selection allows a multitude of lossless bioinformatic analyses, but at the cost of potentially reduced ctDNA enrichment that may be insufficient to overcome NGS-associated noise. Finally, and perhaps most importantly, there are insufficient data at very low allele frequencies to adequately determine which approach will best position ccfDNA diagnostics to improve non-invasive precision oncology. Until further data become available, study goals will need to continue to dictate integration of in vitro or in silico approaches into experimental designs.

6 Additional Observations in ctDNA Fragmentation

Although this review has largely focused on the discrete size difference between ctDNA and ccfDNA relative to the mono- and dinucleosomes, other studies have identified additional size-based differences between ctDNA and ccfDNA that should be considered to provide a more complete overview of ctDNA fragment length. In 2016, single-stranded DNA library preparation methods found that ccfDNA shorter than 100 nt was more accessible compared to the more commonly used double-stranded DNA library generation approach because of the latter’s requirement to use size-selective steps to remove adapter dimers, which tends to also eliminate shorter ccfDNA fragments [15, 175]. In 2018, Sanchez et al. used single-stranded DNA library preparation in a small cohort of cancer patients to show highly fragmented ccfDNA with a 10 nt periodicity and a size range of 41 to 167 nt. In contrast, double-stranded DNA library preparation yielded the same periodicity, but was observed only from 81 to 166 bp [176]. Subsequently, Liu et al. used single-stranded library preparation to detect ctDNA present in stage II–III pancreatic ductal adenocarcinoma patients with paired tumour DNA and ccfDNA [177]. In 13 patients, they found evidence of corresponding ctDNA in ~ 69% of patients that were either the KRAS p.G12D or p.G12C variants (median allele frequency of 0.12%; range 0.05–0.56%). Overall, ~ 30% of mutations present in tumour tissue with an MAF > 1% were present in ccfDNA, which were all associated with common KRAS mutations. However, the proportion of pancreatic patients with detected ctDNA was similar to previous reports [44, 45]. In addition, Moser et al. reported that single-stranded DNA library preparation enriched for smaller DNA fragments, but did not preferentially enrich for ctDNA [178]. As such, more direct comparisons between single- and double-stranded DNA library generation inclusive of different preparation methods for both are necessary to determine if the additional steps involved in single-stranded library generation are warranted.

In contrast to previous studies that have shown ctDNA is shorter than ccfDNA, a report in 2019 found that ctDNA fragment length may include both shorter and longer fragments relative to ccfDNA. Cristiano et al. used whole genome sequencing at 1–2 × coverage in 208 cancer patients and analysed the ratio of small (100–150 bp) to large (151–220 bp) ccfDNA fragments to evaluate the fragmentation profile in 504 windows of 5 Mb, covering approximately 2.6 Gb of the genome [160]. They found the ratio indicated both longer and shorter fractions associated with cancer. Moreover, the size profile indicated tissue of origin. Across a range of stages and cancer types, detection of cancer ranged between 68 and 79% at a 98% specificity. The origins of the variation of fragment lengths within the same patient are unclear. The authors hypothesized that the fragmentation pattern in patients with cancer likely resulted from mixtures of nucleosomal DNA from blood and neoplastic cells that reflected both epigenomic and genomic alterations. Although genomic monitoring may not identify patient-specific somatic mutations, broad surveillance of cancer across multiple types of malignancy may be feasible by profiling the fragmentation profile across the entire genome.

Finally, viral genomes appear to have an alternate size profile after integration into somatic cells that may enable viral-driven malignancies to be discerned from an active or concurrent viral infection. Lam et al. studied the size profile of Epstein-Barr virus (EBV) with and without the occurrence of nasopharyngeal carcinoma (NPC) [179]. They observed that EBV DNA in plasma from non-NPC patients were distributed over shorter fragment lengths compared to NPC patients, leading to a higher proportion of EBV DNA molecules shorter than 110 bp in non-NPC patients. Subsequently, the proportion of EBV DNA within 80–110 bp relative to autosomal DNA within 80–110 bp was used to discern EBV-positive patients at a higher risk of harbouring NPC. These observations regarding the viral genome fragmentation pattern may have significant implications for detecting a variety of virus-associated cancers such as human papilloma virus-associated squamous cell carcinoma [180].

7 Conclusions

Over the past decade, tremendous growth in the development of ccfDNA as a diagnostic tool in oncology has occurred. Challenges associated with the amount of ccfDNA available in plasma (i.e. signal) and errors associated with NGS (i.e. noise) have driven significant advances in the state of the art. Leveraging differences in fragment size between ctDNA and healthy ccfDNA may be one avenue of innovation that further advances clinical applications in oncology. Both in vitro and in silico data for ctDNA enrichment and detection are encouraging. Overall, however, there is a paucity of data to support definitive conclusions regarding utility in clinical diagnostics, particularly for early-stage and non-metastatic cancers. Many key questions persist. For example, does in vitro enrichment for ctDNA adversely affect detection of ultra-low-frequency ctDNA because of losses that may occur during sample recovery? Can in silico probability estimates of ctDNA based on fragment size be used to discover patient-specific somatic mutations without a tumour-informed search? Is a combination of in vitro and in silico methods warranted? Future studies that address these questions, among others, would provide the necessary evidence to establish fragment length as a key component in ccfDNA diagnostics. Additionally, greater inclusion of healthy control data into experimental designs may provide better support for determination of assay specificity, especially as ccfDNA diagnostics advances towards harder to detect cancers. Finally, investigations into the mechanisms that govern the size difference between ctDNA and ccfDNA may prove meaningful not only for detection, but also lead to new therapeutic strategies. Regardless, ctDNA’s fragmentation pattern is an exploitable biologic phenomenon with the potential to enhance ctDNA detection in clinical oncology that may lead to an advanced era of non-invasive, personalized cancer diagnostics to improve patient outcomes.