FormalPara Key Points

Long cfDNA exists in the plasma of healthy subjects, pregnant women, and cancer patients.

Efforts have been made to understand the fragmentomic, epigenetic, and tissue-of-origin information of long cfDNA, with demonstrated potential clinical utilities in prenatal testing and oncology.

The high cost and low throughput of long-read sequencing are limitations that need to be addressed before the clinical potential of long cfDNA analysis can be fully realized.

1 Introduction

The analysis of cell-free DNA (cfDNA) in bodily fluids for disease diagnosis and monitoring has been gaining importance. The detection of fetal chromosomal aneuploidies based on maternal plasma cfDNA analysis has revolutionized the field of prenatal testing [1,2,3,4]. The detection of circulating tumor DNA has been used to guide targeted therapy for lung cancer [5]. However, most of the existing research efforts have been focused on short cfDNA molecules (e.g., ≤ 600 bp). This can be explained by the fact that most of the studies utilized massively parallel short-read sequencing technologies, for example, the Illumina sequencing platform, which have limitations in detecting long DNA molecules (Fig. 1). For instance, the bridge amplification in the Illumina sequencing technology favors the amplification of short DNA molecules over that of longer DNA molecules [6]. Also, long DNA molecules would generate more diffuse clusters during bridge amplification, thereby lowering signal intensities [6, 7].

Fig. 1
figure 1

Cell-free DNA molecules consist of both short and long molecules. In second-generation sequencing, e.g., Illumina sequencing, only the short DNA molecules ≤ 600 bp are analyzed. For epigenetic analysis using second-generation sequencing platforms, additional steps such as bisulfite treatment or enzymatic conversion are necessary before sequencing. In contrast, for third-generation sequencing platforms, e.g., single-molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) and nanopore sequencing by Oxford Nanopore Technologies (ONT), they can sequence long DNA molecules and perform direct methylation analysis

The development of third-generation sequencing platforms has sparked interest in long-read sequencing. Single-molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) and nanopore sequencing by Oxford Nanopore Technologies (ONT), which can generate significantly longer reads than next-generation sequencing [8], are key players in the field (Fig. 1). The Telomere-to-Telomere Consortium (T2T) utilized both SMRT sequencing and nanopore sequencing to fill in the gaps of the human reference genome [9]. Because of the potential of resolving long repeats and complex structural variants, long-read sequencing was named the “Method of the Year” by Nature Methods in 2022 [10].

Recently, using single-molecule sequencing, two studies have demonstrated the presence of long cfDNA in the plasma of pregnant women [11] and cancer patients [12], respectively. This has opened a new avenue of long cfDNA-based liquid biopsy. Coupled with the development of novel approaches, such as the holistic kinetic model [13] that allows direct DNA methylation analysis without a prior chemical or enzymatic conversation step, we are able to obtain genetic, fragmentomic features, as well as epigenetic characteristics [14] of long cfDNA. Compared to short cfDNA, long cfDNA generally contains more CpG and SNP sites (Fig. 2). As previously reported, among long cfDNA molecules in the size range of 1–3 kb, more than 80% carried at least seven CpG sites, whereas among shorter cfDNA molecules in the size range of 200–600 bp and below 200 bp, less than 20% and 5% carried at least seven CpG sites, respectively [12] (Fig. 2A). The increased number of CpG sites associated with long cfDNA molecules was found to enhance the performance of an approach for deducing the tissue-of-origin of individual cfDNA molecules based on an analysis of their methylation patterns at the single-molecule level (Fig. 2B). Moreover, based on computer simulations, long cfDNA molecules in the size range of 1–3 kb had a sixfold higher percentage of molecules with at least one informative SNP site for differentiating fetal- and maternal-derived cfDNA than short cfDNA molecules in the size range of 50–600 bp (Fig. 2C). As a result, a smaller number of cfDNA molecules would need to be sequenced to achieve the same coverage of the fetal genome (Fig. 2D). These have tremendous potential in clinical implementation in oncology and noninvasive prenatal testing (NIPT). Table 1 compares the characteristics, sequencing methods and limitations of short cfDNA and long cfDNA. In this article, we review the current efforts that have been made in prenatal and cancer testing and probe into the future of long cfDNA analysis.

Fig. 2
figure 2

Advantages of the analysis of long cfDNA. A Long cfDNA molecules contain a larger number of CpG sites than short cfDNA molecules. B The performance of a classifier for differentiating the tissue-of-origin (e.g., placenta vs. buffy coat) of individual cfDNA molecules based on the analysis of single-molecule methylation pattern of cfDNA improves when cfDNA molecules of longer sizes are used. C There is a higher percentage of molecules containing an informative SNP site for differentiating fetal- and maternal-derived cfDNA molecules among long cfDNA compared to short cfDNA. D A smaller number of long cfDNA molecules will need to be sequenced to attain the same coverage of the fetal genome compared to short cfDNA

Table 1 Comparison between short cell-free (cf) DNA and long cfDNA

2 Long Cell-Free DNA Fragmentomics

Recent studies have shown that, with the use of SMRT sequencing, about 10–20% of long cfDNA molecules of > 1 kb were analyzable in the plasma [11, 12]. When looking at high-resolution size profiles of cfDNA obtained from SMRT sequencing, besides the well-described mononucleosmal and dinucleosomal DNA peaks observable in previous studies using short-read sequencing, an additional series of peaks at multiples of nucleosomal units extending to molecules of 2 kb in size could be observed. Such ladder patterns suggest that apoptosis might be an important mechanism for the generation of long cfDNA.

It is well known that both the cell-free fetal DNA in maternal plasma and the tumor-derived DNA in plasma of cancer patients showed a shorter modal size of 143 bp compared to their background DNA (maternal-derived DNA and non-tumoral DNA) [15, 16]. With the analysis of long cfDNA, such a conclusion could now be generalized to a broader size spectrum of up to 3 kb [11, 12]. In other words, both fetal fraction and tumor fraction were negatively correlated with cfDNA size. Nevertheless, long cfDNA from fetal and tumoral origin could be detected in the respective plasma samples. Thus far, the longest detected fetal- and tumor-derived DNA molecules are 24 kb and 14 kb, respectively [11, 12]. Thus, the use of long-read sequencing for cfDNA analysis has extended the analyzable size spectrum of cfDNA.

The use of long-read sequencing also allows the DNA sequences at fragment ends to be deduced [11, 12]. While short cfDNA fragments of < 500 bp are predominantly ended with C, an increasing proportion of A end has been observed in longer cfDNA fragments of > 500 bp. Previously, biological links between fragment end characteristics and activities of various nucleases have been established [17]. The enrichment of 5´ A end motifs in long cfDNA fragments suggests that their generation might be related to the nuclease DNA fragmentation factor subunit β (DFFB).

3 Single-Molecule Methylation Pattern Analysis of Long cfDNA

Based on distinct methylation patterns among different tissues, numerous methods have been developed for tracing the tissue origins of cfDNA [18,19,20,21]. Most studies have employed an approach that uses combined methylation signals from bulk collections of cfDNA molecules that have been mapped to tissue-specific marker regions to deduce the corresponding tissue contribution to cfDNA in a sample [18, 19]. These studies have provided data demonstrating the relative contributions of different tissues to cfDNA in plasma in various physiologic or pathologic conditions.

Another approach is to use methylation signals on individual cfDNA molecules to deduce their respective tissues of origin [11, 12, 21]. In fact, both short-read sequencing [21] and long-read sequencing [11, 12] allow the methylation status of each CpG site on a cfDNA molecule to be determined. However, due to the size detection limit of short-read sequencing and the fact that there is on average only one CpG site in every 100 nucleotides, most cfDNA fragments that can be analyzed by a short-read sequencing platform (i.e., mostly in the size range of 150–400 bp) would probably contain only one to four CpG sites on average. On the other hand, for the reference tissue methylomes which methylation patterns of individual cfDNA molecules have been compared to, they can either be in the form of population-based average methylation levels at individual CpG sites or single molecule-based methylation patterns. While the former could be readily obtained from methylation data generated by short-read bisulfite sequencing or DNA methylation microarray analysis, the latter would require data generated from high-depth long-read sequencing of genomic DNA from different tissues.

Taking advantage of the increased number of CpG sites associated with long cfDNA molecules, as well as the ability of a long-read sequencing platform to simultaneously analyze long cfDNA molecules and their associated methylation, Yu et al. have developed an approach for the classification of individual cfDNA molecules in maternal plasma samples as being derived from the placenta or buffy coat based on the comparison of the single-molecule methylation pattern of cfDNA with the reference methylation profiles of placenta and buffy coat obtained from high-depth bisulfite sequencing [11]. This approach achieved an area under the receiver operating characteristic curve (AUC) of 0.88. A similar approach was used by Choy et al. to deduce the tissue origin of individual cfDNA molecules and to quantify liver-derived cfDNA in plasma samples from healthy individuals, patients with chronic hepatitis B virus (HBV) infection, and patients with hepatocellular carcinoma (HCC) [12].

Based on a computer simulation analysis, it was found that the performance of such tissue-of-origin analysis would improve with an increasing number of informative CpG sites on individual cfDNA molecules [11]. Hence, single-molecule methylation pattern analysis enhances the resolution of the tissue-of-origin analysis, with the use of long cfDNA allowing a more accurate tissue-of-origin determination than the use of short cfDNA.

4 Analytical Platforms for Long cfDNA Analysis

There are two predominant long-read sequencing platforms, namely SMRT sequencing from PacBio and nanopore sequencing from ONT, which can be used for long cfDNA analysis [22]. Based on the analysis of artificial mixtures of sonicated human and mouse genomic DNA of different sizes (1500 bp vs. 200 bp) at a molar ratio of 1:1, researchers found that both the PacBio and the ONT showed bias towards sequencing of longer DNA fragments, with the extent of bias from PacBio stronger than that from ONT (a fivefold vs. a twofold over-representation of long fragments) [22].

Table 2 provides a summary of the comparison between PacBio and ONT for long cfDNA analysis [22].

Table 2 Comparison between PacBio and ONT for long cell-free (cf) DNA analysis

5 Potential Clinical Utilities of Long cfDNA Analysis

5.1 Size-Based Detection of Pre-eclampsia

Pre-eclampsia is a pregnancy complication associated with increased maternal and neonatal morbidity and mortality. Both the absolute amounts of total cfDNA and fetal cfDNA have been reported to be elevated in pregnancies with pre-eclampsia [23, 24]. In addition to such quantitative differences, a recent study by Yu et al. revealed that there was a significant reduction in the proportion of long cfDNA molecules in pregnancies with pre-eclampsia [11]. Therefore, a classifier that was based on the percentage of long cfDNA in a maternal plasma sample for differentiating pregnancies with and without pre-eclampsia has been built. Interestingly, such a cfDNA size-based classifier was found to perform much better when a long-read sequencing platform was used (AUC: 1 vs. 0.7 when a short-read sequencing platform was used), possibly related to an extended spectrum of cfDNA size that was analyzable by a long-read sequencing platform. Nevertheless, future studies are needed to validate these findings and to investigate the predictive power of such a cfDNA size-based biomarker for pre-eclampsia before the onset of clinical symptoms.

5.2 Single-Molecule Methylation-Based Analysis for the Noninvasive Prenatal Testing of Monogenic Diseases

A major technical challenge for the noninvasive prenatal testing (NIPT) of monogenic diseases had been the assessment of maternally inherited mutations of the fetus because of the mother’s own contribution of the mutant alleles to the maternal plasma. To overcome this challenge, strategies including relative mutation dosage analysis (RMD) [25] and relative haplotype dosage analysis (RHDO) [15], which determine the maternal inheritance of the fetus by detecting the slight allelic and haplotype dosage imbalance, respectively, in the maternal plasma, have been developed.

As an alternative to these quantitative approaches, Yu et al. recently described the principle of a qualitative approach for determining both maternal and paternal inheritance of the fetus based on the aforementioned single-molecule tissue-of-origin analysis of long cfDNA [11]. The principle is that if a cfDNA molecule that carried a paternal- or maternal-specific genetic alteration, such as an SNP or a disease-causing mutation, was identified as being derived from the placenta based on the methylation pattern analysis, such genetic alteration would be considered as being inherited by the fetus.

In a proof-of-concept study, Yu et al. demonstrated the feasibility of deducing the maternal inheritance of the fetus on a chromosome-arm level based on the single-molecule tissue-of-origin analysis of long cfDNA [11]. Of note, as the accuracy of the current version of the tissue-of-origin analysis and the sequencing output of the current long-read sequencing platforms are far from sufficient for the realization of such a qualitative approach, a quantitative approach comparing the number of cfDNA molecules being determined to be of placental origin between the two maternal haplotypes has been employed. Furthermore, this analysis was performed with cfDNA molecules that were pooled from 28 maternal plasma samples due to the limited number of cfDNA molecules that were associated with informative SNPs in each sample. Nevertheless, the approach was shown to achieve a classification rate of 90% and an accuracy of 100%. Finally, Yu et al. demonstrated the feasibility of NIPT for fragile X syndrome and the detection of a recombination event on chromosome X in a pregnancy involving a male fetus who was at risk of fragile X syndrome [11].

5.3 Single-Molecule Methylation-Based Analysis for Cancer Detection

In another proof-of-concept study, Choy et al. applied the single-molecule tissue-of-origin analysis to cancer detection [12]. A scoring system, which was named the HCC methylation score, was developed to assess the likelihood for a patient to have HCC, with a higher score indicating a higher likelihood. It was shown that HCC patients have significantly higher HCC methylation scores when compared to HBV carriers and healthy individuals. Choy et al. showed that the use of long cfDNA molecules with at least seven CpG sites further improved the discriminating power of the HCC methylation score when compared to the use of shorter cfDNA molecules with one to six CpG sites, with the AUC improved from 0.75 to 0.91.

In addition to the use of single-molecule methylation-based analysis alone for cancer detection, one can combine this methylation-based analysis of long cfDNA with the detection of cancer-associated somatic mutation to improve the diagnostic specificity for cancer detection. In the detection of cancer-associated somatic mutations from cfDNA, clonal hematopoiesis, which is an age-related phenomenon due to the clonal expansion of blood cells carrying somatic mutations, can lead to false-positive identification of cancer-associated mutations [26]. The analysis of long cfDNA in this scenario will be potentially useful. Due to its longer length, it is more likely for a long cfDNA molecule to carry a mutation together with multiple CpG sites. Combining the mutation information with the methylation pattern information offered by multiple CpG sites in long cfDNA can potentially reduce false-positive results. For example, if a cfDNA molecule contains both the mutation and the methylation pattern that is suggestive of tumoral origin instead of hematopoietic origin, then one can be more confident that the mutation identified is from the tumor, rather than the result of clonal hematopoiesis.

5.4 Detection of Structural Variants and Repeat Expansions

Structural variants (SVs), which are defined as DNA rearrangements of at least 50 bp [27], and repeat expansions are known causes of monogenic diseases [28]. Also, they are associated with complex disorders such as cancers [29, 30]. The detection of SVs and repeat expansions are valuable for disease diagnosis, monitoring, or treatment stratification [29, 30]. However, the detection of SVs and the accurate determination of the number of repeats from cfDNA remain technically challenging with limited sensitivity [31], which can be partly attributed to the analysis of only the short cfDNA using the short-read sequencing technology. Such short cfDNA molecules of ≤ 600 bp in length rarely span the entire length of the SV or the repeat regions, resulting in some ambiguity during sequence alignment [27]. We believe that the analysis of long cfDNA with the use of long-read sequencing technologies would likely reduce alignment ambiguity and improve the coverage of repetitive regions, thereby improving the detection of SVs and repeat expansions.

6 Conclusion and Future Perspectives

Currently, the Achilles’ heel of long cfDNA analysis is low throughput. The long-read sequencing platforms have lower throughput than second-generation sequencing platforms. Depending on the sequencing platform used, long-read sequencing platforms can generate up to 300 gigabytes of data per flow cell, whereas the second-generation sequencing platforms can generate up to 3 terabytes of data per flow cell. Such a relatively low throughput for current long-read sequencing platforms hinders the immediate clinical application of long cell-free DNA analysis. For instance, in the study by Yu et al., due to the inadequate number of plasma DNA molecules with informative SNPs in individual samples, a chromosome-arm-level analysis with pooled data from multiple maternal plasma samples had to be employed for the deduction of maternal inheritance of the fetus [11]. In addition to low throughput, the high cost of long-read sequencing also hinders its widespread use in resource-limited settings.

Despite the current limitations, there have been ongoing efforts to improve the hardware of long-read sequencing. Recently, PacBio launched a new sequencer called Revio, with the number of zeromode waveguides in each SMRT cell increased from 8 million to 25 million [32], which may potentially increase the throughput. Also, ONT has released a new sequencing kit (Kit 14), with raw-read accuracy at 99.6% [33].

In addition to hardware, efforts have been made in refining the software of long-read sequencing. Recently, partnering with Google, PacBio has introduced DeepConsensus, which utilized an alignment-based loss to train a gap-aware transformer-encoder for sequence correction [34]. DeepConsensus has been claimed to reduce read errors by 42% when compared to the standard approach based on a hidden Markov model [34].

In addition to single-molecule sequencing technologies (e.g., SMRT sequencing or nanopore sequencing), which sequence long DNA molecules directly without the need for PCR amplification, new players using the synthetic long-read sequencing approach are also emerging. Examples include the LoopSeq synthetic long read by Element Biosciences and the Complete Long-Read technology by Illumina. In general, the synthetic long-read sequencing approach involves labelling short fragments from individual long DNA molecules with unique barcode sequences, amplifying and then sequencing the barcoded short fragments with a short-read sequencing platform, and finally computationally reconstructing long reads from short sequencing reads. While the use of a short-read sequencing platform may potentially reduce the cost, long cfDNA analysis using the synthetic long-read sequencing approach may suffer from the potential loss of accurate fragmentomic information, such as the size and end motif signatures, as a result of the reassembling of long reads from barcoded short sequencing reads. Also, current synthetic long-read sequencing technologies do not support direct DNA methylation analysis, which may impede the tissue-of-origin analysis of cfDNA. Thus, these limitations may prevent the immediate application of the synthetic long-read sequencing approach for the analysis of long cfDNA. Nevertheless, the long-read sequencing field is actively working towards improving sequencing throughput with lower cost and higher accuracy, which may ultimately benefit cfDNA analysis in the future.

As we gradually learn more about long cfDNA, further studies into the preanalytical factors affecting long cfDNA analysis would be valuable. Certain preanalytical factors, such as specimen collection tubes, blood sample storage conditions, blood-processing protocols, and DNA extraction methods, might affect long and short DNA molecules to different extents. A detailed and thorough understanding of the preanalytical aspects would be necessary before clinical implementation of long cell-free DNA-based liquid biopsy to ensure consistent clinical result interpretation. With continuous efforts, it is hoped that long cfDNA analysis can enhance the spectrum of diagnostic applications of liquid biopsy.