Introduction

Since the beginning of the Human Genome Project in 1990, there has been a close pairing between technological innovation driving science and science demanding technological innovation. This drive led to next-generation, short-read sequencing methods dominating the field of nucleic acid sequencing (reviewed in ref. 1). However, short-read sequencing is fundamentally limited in read length (<1000 bp reported1) owing to cycle dephasing and the resulting drops in read quality over length2,3. By contrast, single-molecule sequencing methods, especially platforms from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are not subject to this limitation and allow for the sequencing of long reads (>10 kb). Perhaps the most important difference between these platforms is that PacBio performs sequencing-by-synthesis whereas ONT uses a protein nanopore to characterize the molecule through electrolytic current modulation4. Though both technologies had initial issues with read accuracy (PacBio continuous long read accuracy 85–89%5; ONT R6 accuracy 67%6) and yield (PacBio RS II ~500–1000 Mb; ONT R6 yield ~250 Mb), these features have improved substantially over the past eight years. Both technologies can now achieve impressive accuracies — ~98% for ONT and 99% for PacBio4,7 — and an ONT PromethION device can generate in excess of 100 Gb per flow cell, whereas a PacBio Sequel II HiFi run can generate over 30 Gb4. These output levels put the cost per Gb of PacBio (US$65) and ONT (US$17) sequencing closer to that of short-read instruments such as the Illumina NovaSeq 6000 (US$6) (Supplementary Note).

Long reads have already changed the landscape of genomics, expanding our knowledge by exploring areas that were previously unattainable with short reads. Long reads allow for more complete genome assemblies8, highlighted by their use in the assembly of the first telomere-to-telomere human genome9. Many more structural variants and repetitive areas can be probed with long reads because of their ability to map through the variant10,11, leading to the use of long-read sequencing for surveying structural variants in human populations12,13. Single-molecule sequencing even allows for native measurement of DNA methylation14, including in previously inaccessible regions such as centromeres15,16. Aside from DNA, long reads have also been used to explore RNA, providing information about full-length transcript isoforms including allele-specific expression, poly(A) tail length and RNA modifications17,18,19.

The increasing accuracy and affordability of single-molecule, long-read sequencing has resulted in the accelerated development of methods that apply it to new problems in biology. Here, we review a selection of emerging methods and applications using commercially available single-molecule platforms. First, we review methods used for targeted sequencing of long reads, which harness the advantages of long-read sequencing without the need for whole-genome sequencing, thereby improving coverage and affordability. Next, we focus on assays for mapping protein–DNA interactions, which in addition to ascertaining information already revealed by short reads also provide previously unknown insights into genome organization. Last, we cover the sequencing of short reads with single-molecule platforms, a suite of methods that seek to increase the accessibility of sequencing and the amount of information that can be gained from a single sequencing run.

Insights without whole-genome sequencing

Costs for whole-genome sequencing have dropped substantially during the past decade, but even with the lower cost there are biological questions for which focused, high-depth sequencing is needed. For example, somatic variant calling and epigenetic sequencing of heterogeneous samples requires high sequencing depth to enable low-frequency variants or rare epigenetic states to be measured with confidence. Alternatively, when sequencing large sample sets such as complex disease cohorts, cost per sample becomes an important factor. In these scenarios, depth or sample number may be more important than unbiased genome-wide analysis, so targeting specific regions can drive down cost. Specific regions of interest — for example promoters or exons of protein-coding genes — can be selectively targeted for sequencing. Such targeted sequencing methods, including PCR amplicon sequencing and hybridization capture, have been extensively used in concert with short-read sequencing. These same methods have been adapted for long-read sequencing, in addition to the emergence of novel methods taking advantage of the PacBio and ONT platforms.

PCR enrichment

PCR enrichment, also known as amplicon sequencing, allows for targeted sequencing by simply designing primers flanking regions of interest. PCR enrichment is a mature method with low DNA input requirements and low hands-on time, which enables multiplexing of as many as 24,000 amplicons in one reaction with carefully designed commercial primer panels (Ion AmpliSeq assays20). Overlapping amplicons can be tiled across regions much longer than the amplicon length, with a recent example targeting genomic regions >40 kb21. PCR enrichment can be adapted to long-read sequencing (Fig. 1a) owing in part to the commercial availability of DNA polymerases that can amplify amplicons greater than 10 kb22,23. However, as the length of an amplicon increases, PCR becomes less efficient and requires optimization for each new reaction24. Amplicons greater than 7 kb and long amplicons with high GC content are difficult to consistently amplify25. PCR can also introduce errors (mainly substitutions)25, which can be an issue when probing rare mutations26. Amplifying DNA with PCR erases native DNA modifications, eliminating one of the key advantages of single-molecule platforms (Table 1). Notably, amplicon approaches often require sets of primers to be split into multiple pools owing to possible interactions between primer pairs, thus requiring multiple, optimized PCRs. This makes scaling PCR amplicons to multiple regions difficult. This is especially true for schemes that attempt to tile overlapping amplicons across large regions, as demonstrated in peer-reviewed and preprint studies21,27. Despite these caveats, amplicon sequencing has been used with ONT to detect structural variant frequency in genes frequently mutated in pancreatic cancer (CDKN2A and SMAD4)28 and with PacBio to identify disease-causing variants in a gene frequently mutated in autosomal-dominant polycystic kidney disease (PKD1)29. Outside human genetics, as demonstrated in both peer-reviewed30 and preprint27 articles, tiled amplicons have been used for low-cost, portable, infectious disease outbreak monitoring with ONT for a host of viruses including Zika30, Ebola31 and SARS-CoV-227, underscoring the utility of this method (Table 1).

Fig. 1: Long-read targeted sequencing methods.
figure 1

Long-read targeted enrichment methods fall within broad categories including PCR enrichment, hybridization capture, Cas-mediated enrichment and adaptive sampling. a, PCR enrichment uses specific primers to amplify regions of interest before library preparation. b, Hybridization capture uses biotinylated antisense probes designed against regions of interest to isolate DNA fragments containing the targets. PCR and hybridization capture enrichment methods are both commonly used with short-read sequencing and have been adapted to long-read sequencing. c, Cas-mediated enrichment uses Cas ribonuclear complexes (most commonly Cas9) to cut on either side of regions of interest. Cut fragments are selectively sequenced owing to preferential adapter ligation to the freshly cut ends55. Targeted fragments can be further enriched through depletion of off-target fragments56,57,58. d, Enrichment using adaptive sampling is a nanopore sequencing method in which regions of interest are selectively sequenced by controlling the voltage at individual pores to eject unwanted fragments. ONT, Oxford Nanopore Technologies.

Table 1 Summary of long-read DNA enrichment methods

Hybridization capture sequencing

Hybridization capture sequencing uses tagged, antisense oligonucleotide probes against regions of interest. Genomic DNA is denatured using a combination of heat and chemical methods, probes are hybridized against it, probe-bound DNA is captured and unbound DNA is washed away32 (Fig. 1b). This method can be more easily scaled than PCR amplicons and often only requires one reaction, though probes are expensive and the resulting on-target rate tends to be lower (Table 1). Hybridization capture probes can also be used to enrich across large, contiguous target regions (for example, ~750,000 bp33) by tiling probes across the region in one reaction. Multiple separate locations are easily targeted — exemplified by a study targeting 4800 genes simultaneously with nanopore sequencing (Table 1), even though reads were only ~1,000 bp34. Though long-read hybridization capture methods have been applied successfully even in human cohorts to resolve complex structural variants leading to disease35,36,37,38, they have key limitations (Table 1). The lengths of sequenced fragments are typically shorter than those in the original library, suggesting bias towards shorter fragments38. This observation has been consistent across long-read hybridization capture experiments37,39,40,41,42 and is attributed to the hybridization capture step41. We and others have found large fragments more difficult to capture, with the most efficient capture size found to be about 5 kb43,44,45. As with PCR amplicons, amplification (pre-capture or post-capture) can lead to errors in reads; for example, errors in AT-rich regions led to gaps in assembled haplotypes of a complex genomic region containing the natural killer-cell immunoglobulin-like receptor (KIR) gene family46. Hybridization capture is often a lengthy protocol (often >3 days42) independent of the long-read platform used — though automation and high throughput (96 samples) are possible with liquid-handling robotics. Despite these limitations, hybridization capture can produce deep on-target coverage with one study reporting 1099-fold enrichment from a single run on an ONT MinION device37.

Cas-mediated enrichment

Though powerful, amplicon and hybridization capture have key limitations in read length and maintenance of modification state: to fully capitalize on the potential of single-molecule targeted sequencing, methods need to be designed from the ground up with this in mind. A bacterial defence system, clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (Cas) proteins, though primarily used for genome editing47, can be adapted to enrich long fragments (Table 1). In Cas-mediated enrichment, the CRISPR–Cas system is used to induce double-stranded breaks flanking the regions of interest, which produces long fragments with ends amenable to downstream applications. Initially used to clone large fragments48,49, Cas9-assisted targeting of chromosome segments (CATCH) was adapted so that the cut fragments were instead gel-isolated by size and sequenced on an ONT MinION flow cell, achieving ~25–70× mean coverage tiled across a 200-kb region encompassing the hereditary cancer gene BRCA1 (ref. 50). Unfortunately, so little DNA was recovered after gel isolation that amplification was required, removing native DNA modifications and resulting in read lengths less than 5 kb50.

Subsequent methods have instead used preferential ligation at freshly cut sites flanking the regions of interest to remove the size selection step and have been used with both PacBio51,52 and ONT sequencing53,54,55,56,57,58. Typically in these approaches, Cas cleavage occurs before library preparation and the first step is to passivate existing DNA ends by dephosphorylating them, which prevents random ligation. DNA is then cut by a Cas protein–guide RNA complex, either on one side or flanking a region of interest, to create 5′ phosphorylated ends. Sequencing adapters are then ligated to the freshly cut and phosphorylated sites to enable selective sequencing of fragments containing the area of interest (Fig. 1c). Exemplifying this strategy is nanopore Cas9-targeted sequencing (nCATS), which achieved up to 1,000× coverage at loci on an ONT MinION sequencer55. However, without multiplexing, only a fraction of the flow cell capacity is used in this method because of the low molarity of resulting library molecules55 (Table 1). Furthermore, this method seems to work best when two cut sites are generated. Additionally, obtaining read lengths greater than 50 kb was difficult, which may be attributed to the isolation of fragmented DNA during purification55. This affects the ability to obtain single reads that span larger regions.

Additional methods have been developed in an attempt to improve upon these caveats. For example, the affinity-based Cas9-mediated enrichment method (ACME) removes non-target fragments (increasing the molarity of library molecules) via bead-based pulldown of a His-tagged Cas9, which remains bound to non-target fragments after cutting56. Data presented in a preprint article demonstrated that ACME excelled in enriching for single reads spanning the entire length of large target regions (~100 kb)56. Cas-mediated enrichment has also been demonstrated on a completed PacBio sequencing library. As presented in a preprint article from 2017, a special capture adapter can be ligated to cut sites after Cas-mediated digestion51, allowing for a bead-based pull-down enrichment approach similar to ACME. This optimized PacBio approach was able to achieve 9% on-target reads, greater than reported with ACME (<1%)51,56. Alternatively, exonucleases can be used to digest off-target fragments, as in Cas9-based background elimination (CaBagE)57, Negative Enrichment58 and PacBio No-Amp52. These exonuclease-based methods can produce high coverage at target loci (~400× for small targets) with a high percentage of reads spanning the entire target region57. Furthermore, as shown in both published and preprint work, the size of target regions can be increased by tiling guide RNAs across a region59,60, similar to tiling methods used with PCR amplicons or hybridization capture. By using a pool of in vitro transcribed guide RNAs tiled across the region, a recent preprint study demonstrated the ability to enrich reads across a region as large as 9 Mb60.

Adaptive sampling

All the methods mentioned above include additional molecular biology steps involving targeted probes, primers or guide RNAs, which can add time and cost. An enrichment approach that does not include additional manipulations makes single-molecule targeted sequencing more accessible. Nanopore sequencing offers a unique opportunity in this regard — as the molecule is sequenced, a decision can be made to eject the molecule by flipping the voltage if the data do not match a database of targets, a process called adaptive sampling (Fig. 1d). Initially, adaptive sampling was implemented by matching the real-time electrical signal to a reference genome using dynamic time warping with the ‘Read Until’ approach61, but was limited to small reference genomes. As a result, improved algorithms for mapping electrical signal were developed62,63,64,65,66,67, exemplified by UNCALLED, which demonstrated real-time enrichment of 148 human cancer genes with an average coverage of ~30× (5.5-fold enrichment over non-enriched) using an ONT MinION flow cell62 (Fig. 1d). Alternatively, improvements to the speed of the basecaller enabled the development of tools that align basecalled reads against a reference to decide whether or not a molecule should be sequenced68,69,70. These tools are exemplified by readfish, which demonstrated enrichment of the genomic sequence of ~700 genes associated with human cancer (~30× mean coverage)68. A version of these sequence-based methods has been directly incorporated into the ONT sequencing software (MinKNOW), making it easy for end-users to employ.

Compared to other methods, adaptive sampling can target large regions of interest without additional expense or optimization of primers, probes or guide RNAs. Even entire human chromosomes can be targeted68, which can be ideal for biological questions such as exploring putative X chromosome-linked disorders. However, in order to achieve enrichment, sequenced fragments must be a sufficient length (>5 kb)71; the longer the ‘rejected’ molecule, the more time is saved by not sequencing it and hence the higher the enrichment of ‘accepted’ sequences. Best results are typically achieved for fragment sizes >10 kb62,68,72. Samples with damaged DNA (for example, formalin-fixed, paraffin-embedded tissue) typically have DNA lengths below this threshold, which may hinder their use with adaptive sampling. Finally, targeting either too low a percentage (<1%) or too high a percentage (>10%) of the genome will also lead to less enrichment: if too much time or not enough time is spent rejecting molecules, the resulting on-target sequence yield will not be sufficient.

Though easy to use, adaptive sampling methods result in lower coverage and a lower percentage of on-target reads than other enrichment methods (Table 1). Encouragingly, data presented in a recent preprint article demonstrated that readfish multiplexed sequencing on the ONT PromethION flow cell yielded 25–50× coverage for three human samples (5–6× enrichment over theoretical whole-genome sequencing), further reducing cost and indicating that higher depth is achievable72. Currently, adaptive sequencing requires relatively substantial computational resources, including access to graphical processing units (NVIDIA 2060 series or better with CUDA capability) or powerful central processing units to achieve the analysis speed needed for enrichment. Finally, pores become inactive more quickly during adaptive sampling than during standard nanopore sequencing runs, possibly owing to DNA blockages62. Maximum output can be achieved by performing a nuclease flush of the flow cells to remove blockages and a reload of the flow cell with fresh library62,68,72, but this increases the amount of DNA, reagents and hands-on time required for these experiments.

Additional methods

There are other approaches for long-read enrichment that do not fit into the above categories. For example, Xdrop partitions long DNA molecules into droplets with locus-specific primers, followed by droplet digital PCR. Droplets containing the loci of interest are isolated with flow sorting, and DNA is amplified73. This amplified DNA can then be sequenced with short-read or long-read platforms. This method requires a specialized microfluidic apparatus whereas the methods described above need only standard molecular biology tools.

Mapping protein–DNA interactions

For decades, researchers have tried to understand not just the sequence of DNA, but how DNA is organized within the nucleus and how that organization affects cellular function, development, gene regulation and disease (reviewed in ref. 74). State-of-the-art genomics methods including microarrays and next-generation sequencing have been leveraged to study chromatin state and protein–DNA binding (reviewed in75,76,77), even down to the single-cell level (reviewed in ref. 78). Most of these assays rely on PCR enrichment for states of interest (such as open chromatin or bound protein), requiring input controls to correct for PCR bias and thereby making quantification difficult. These methods also typically fragment the DNA to small sizes to provide resolution, making it impossible to study the coordination of chromatin states at adjacent loci on the same single molecule of DNA. Short reads also make it difficult to assign reads to haplotypes given the infrequency of variants on short fragments. As emphasized above, PCR erases native DNA modifications, making additional steps necessary in order to measure methylation and protein–DNA interactions or chromatin state simultaneously79,80,81.

Specific short-read methods using methyltransferase footprinting have set the stage for long-read approaches to explore protein–DNA binding. Emerging from the observation that methyltransferase enzymes preferentially label accessible DNA82, methyltransferase footprinting assays were developed to measure nucleosome positioning and protein–DNA interactions83,84,85,86. Such assays can even determine protein binding through the protection from labelling; though the identity of the protein is not known, it can be inferred from the size of the protected areas (nucleosomes) or motifs in the protected areas87Chemical bisulfite conversion of unmethylated bases followed by next-generation sequencing allowed these footprinting assays to be applied to panels of promoters88, to genome-wide footprinting89, and down to single molecules with short reads90. These methods have now been combined with single-molecule platforms to begin to probe unknown aspects of gene regulation (Fig. 2).

Fig. 2: Long-read, single-molecule methyltransferase footprinting methods can reveal heterogeneity and coordination of chromatin states.
figure 2

a, In methyltransferase footprinting assays, a methyltransferase enzyme deposits exogenous methylation on accessible DNA, which may include linker DNA between histones, open chromatin regions or regions surrounding transcription factors bound to DNA. b, When this exogenous labelling is performed on long, single molecules, the heterogeneity of nucleosome positioning, open or closed chromatin and protein–DNA binding can be measured on single molecules. c, With long molecules that span multiple regulatory elements, the coordination between adjacent sites can be measured, potentially revealing unknown aspects of gene regulation. d, Antibody-directed methyltransferase labelling builds on methyltransferase footprinting by concentrating labelling around binding sites of specific proteins. The methyltransferase is fused to protein A, protein G or both, which bind to IgG antibodies.

Measuring chromatin accessibility with methyltransferase footprinting

Three methods have been developed that combine 5-methylcytosine (5mC) labelling with ONT sequencing to assay nucleosome positioning and open chromatin (Table 2). Two methods focused on yeast: one measured nucleosome positioning, with methyltransferase treatment followed by single-molecule long-read sequencing (MeSMLR-seq) using the GpC methyltransferase M.CviPI91; the other measured nucleosome occupancy via DNA methylation and high-throughput sequencing (ODM-seq) using both M.CviPI and the CpG methyltransferase M.SssI92. These methods were shown to correlate well with micrococcal nuclease (MNase) digestion sequencing (MNase-seq), a classic method for measuring nucleosome positioning. Using MeSMLR-seq data, over 300 inferred nucleosomes were phased on a single read and it was found that the number of molecules with open chromatin at a given promoter correlates with the expression of its corresponding gene91. ODM-seq estimated the number of nucleosomes across the entire genome in a yeast cell and quantified protein binding in nucleosome-free regions92. Methyltransferase footprinting has also been applied to human samples. Nanopore sequencing of nucleosome occupancy and methylome (nanoNOMe), adapted from NOMe-seq89, used M.CviPI to simultaneously call accessible chromatin (GC 5mC) and native CpG methylation, allowing for footprinting of proteins bound to DNA in bulk and on single reads93. NanoNOMe made use of the advantages of long reads by exploring chromatin state in repetitive elements and phasing reads to measure allele-specific chromatin accessibility and CpG methylation93. In particular, nanoNOMe was able to quantitatively examine protein binding at known motifs, such as CTCF sites, by examining the inferred footprint at these locations. Unsurprisingly, this revealed that traditional chromatin immunoprecipitation followed by sequencing (ChIP–seq) methods are semi-quantitative and that a ChIP–seq peak can represent a large range of fractional binding states. Later work combining nanoNOMe with Cas-mediated enrichment for higher depth found that different CTCF-binding sites have very different percentages of reads (5–70%) supporting CTCF binding94.

Table 2 Summary of long-read footprinting assays

The absence of recognition motifs for these 5mC methyltransferases can limit their ability to label some parts of the genome, such as AT-rich regions. Thus, other methods have leveraged N6-methyladenine (m6dA, also known as 6mA) methyltransferases (Table 2) for labelling, as m6dA is either absent from or present only at low levels in the genomes of eukaryotes95. The single-molecule long-read accessible chromatin mapping sequencing assay (SMAC-seq) uses a combination of methyltransferases (including M.CviPI, M.SssI and EcoGII (m6dA on all adenines)) to achieve high-resolution (<5 bp) mapping in order to study chromatin states and the coordination of regulatory elements on single molecules using ONT nanopore sequencing96 (Fig. 2). Fiber-seq used the Hia5 methyltransferase (m6dA on all adenines) with readout from PacBio sequencing97. Both methods were developed using model organisms with small genomes: SMAC-seq was developed using yeast and Fiber-seq using the Drosophila melanogaster S2 cell line. Both showed high correlation with existing open-chromatin data and the ability to study the coordination of chromatin state between adjacent regulatory sites (Fig. 2). More recently, a preprint article has described the use of Fiber-seq in human samples, leveraging improvements in single-molecule yield to profile the chromatin state of telomeres98.

Methyltransferase labelling has been further extended by combining it with other methods that can reveal protein–DNA interactions (Table 2). The single-molecule adenine methylated oligonucleosome sequencing assay (SAMOSA) combines EcoGII-mediated m6dA labelling with MNase digestion99, which targets reads to accessible regions. Footprinting information can be obtained both from the molecule ends and from m6dA labelling. A recent preprint described tagmentation-assisted SAMOSA (SAMOSA-Tag)100 in which the MNase is replaced with  Tn5 transposase, commonly used in the assay for transposase-accessible chromatin using sequencing (ATAC-seq) and cleavage under targets and tagmentation (CUT&Tag)101. Importantly, the authors demonstrate identification of m6dA labelling and native 5mC CpG modifications, showing that SAMOSA-Tag can assay protein–DNA interactions, epigenetic modifications and primary DNA sequence simultaneously with PacBio sequencing.

Directly mapping protein–DNA interactions

In an extension of footprinting, m6dA labelling has been used within the framework of cleavage under targets & release using nuclease (CUT&RUN) and CUT&Tag methods101,102 to directly measure interactions between specific proteins and DNA (Table 2). In these approaches, a protein of interest is bound by specific antibodies (Fig. 2d). These antibodies are bound by bacterial proteins that bind tightly to IgG (protein A, protein G or both)103 fused to methyltransferases, thereby concentrating methyltransferase activity — and m6dA labelling with S-adenosylmethionine — around protein binding sites (Fig. 2d). This approach has been implemented for Hia5 (ref. 104) and EcoGII105 and can map protein–DNA binding with a resolution of 100–200 bp. Directed methylation with long-read sequencing (DiMeLo-seq) uses Hia5 and is the most extensively tested and optimized approach: it has been used to measure protein–DNA interactions across repetitive regions of the genome, study the coordination and heterogeneity of adjacent binding sites and phase read to study allele-specific protein–DNA binding104.

Although single-molecule approaches for measuring protein–DNA binding unlock the ability to explore previously intractable biological questions, the way the interactions are measured is fundamentally different from established short-read methods (such as ChIP-seq, CUT&RUN and CUT&Tag). Short-read methods enrich bound regions, producing peaks of enrichment that cover a small percentage of the genome (<10%106,107) but often contain >50% of sequenced reads (the so-called fraction of reads in peaks)108. By contrast, the single-molecule methods discussed above have no built-in enrichment step, and although this makes them more quantitative and removes bias, it also requires whole-genome sequencing in order to obtain the same genome-wide signal. Fortunately, recent efforts have shown that these labelling techniques can be combined with enrichment methods for long reads94,104, allowing cost-effective profiling.

Measuring chromosome conformation

Moving to a larger scale, there is an interplay between DNA methylation, chromatin state, protein–DNA interactions and DNA organization in the nucleus. The three-dimensional organization of the genome plays a critical role in gene regulation, development and human disease (reviewed in refs. 109,110). Primary methods used to measure three-dimensional organization rely on proximity ligation and are known as chromatin conformation capture (3C) assays (reviewed in ref. 111). Most of these methods measure pairwise interactions with short-read sequencing and fail to capture information about potential cooperation between multiple loci112. Although methods that do not rely on proximity ligation make it possible to measure multi-way contacts113, long-read sequencing platforms have the potential to read long fragments from 3C-based experiments that represent multi-way interactions and have been employed in a variety of methods. PacBio sequencing was initially employed by a method measuring chromosomal walks in which 3C DNA was directly sequenced114. However, the long-read data were mostly used to validate short-read data, the reads were not very long (<8 kb) and the data produced represented <0.5× coverage of the mouse and human genomes, limiting what information could be gleaned114. Multi-contact circular chromosome conformation capture (MC-4C) employed circular chromosome conformation capture combined with Cas9 targeting to measure all interactions at one locus (a so-called ‘one versus all approach’) with ONT sequencing115,116. Again, the average sequenced read size was not very long (~2 kb), owing in part to the use of PCR, with most reads measuring three-way or four-way contacts and some measuring ten contacts115. Genome-wide methods such as multi-contact 3C (MC-3C)117 and Pore-C118 do not employ PCR and are ‘all versus all’ methods (that is, all contacts at all loci are measured) like Hi-C and chromosomal walks. MC-3C used PacBio, whereas Pore-C used ONT. Of these two methods, the data from Pore-C best demonstrate the potential of these approaches owing to extremely deep sequencing (up to >132× genome coverage)118. With high-depth data, the authors were able to explore CpG methylation on haplotype-specific, multi-way interactions on single molecules. In a good example of how quickly this area is moving, Pore-C has already been modified to reduce cost and improve throughput with a method termed high-throughput Pore-C (HiPore-C)119.

Short reads on single-molecule platforms

Although single-molecule sequencing typically emphasizes read length, both PacBio and ONT technologies can sequence short nucleic acid fragments. Despite Illumina (and other short-read sequencers) dominating the short-read sequencing field, approaches that sequence short reads on ONT and PacBio have gained traction. The portability, low physical footprint and ability to analyse sequencing data in real-time make ONT sequencing devices ripe for use with short reads directly at the bench or in the field, without the need for a sequencing core. Single-molecule sequencing can reduce cost as multiple types of -omics data (for example, methylation and genetic variation) can be gleaned from a single sequencing run. The increases in throughput and accuracy of these single-molecule platforms provide advantages that have made them even more attractive for short-read sequencing. These advantages fall into the ‘iron triangle’ of project management: fast, good or cheap.

Fast: portability and speed

Recent attempts to detect chromosomal abnormalities by optimizing short-read sequencing on ONT highlight the advantage of the low cost and small size of the ONT sequencing devices, especially the ONT Flongle flow cells and ONT MinION flow cells. These aspects could make sequencing more accessible for environments with limited resources and bring these assays from centralized cores to the laboratory benchtop. Additionally, real-time sequencing with ONT enables rapid turnaround times compared to waiting for a completed sequencing run120,121. Chromosomal abnormalities, including aneuploidies and copy number variants (CNVs), play a role in human disease and are commonly screened for during pregnancy and in cancer (reviewed in refs. 122,123). Multiple studies have shown that short-read sequencing can be optimized for the portable ONT MinION device to detect aneuploidies124,125 and CNVs126,127. These approaches showed that sequencing libraries could be multiplexed, detected abnormalities were concordant with Illumina sequencing, only 0.5–2 million reads were required and sufficient reads could be obtained in under 3 hours (Fig. 3a). Additionally, similar CNV estimates were observed on the same sequencing device with short or long reads, underscoring the flexibility of these devices126.

Fig. 3: Applications of shorter-read sequencing (<5 kb) on single-molecule platforms.
figure 3

a, Short reads can be quickly sequenced on portable Oxford Nanopore sequencing devices, returning real-time information about copy number variants and aneuploidy in 3 h or less. b, Primary sequence, fragment patterns and endogenous methylation can be measured simultaneously with single-molecule platforms, and that information can be used to assign reads to tissues of origin. c, Accuracy of short reads on single-molecule platforms can be improved by correcting for errors by reading the same molecule multiple times. d, The cost of sequencing short fragments on single-molecule platforms can be decreased by combining multiple different short molecules into a single, long molecule.

Good: multimodal measurements

An important advantage for single-molecule platforms is that base modification information is acquired for free (not counting computational requirements) alongside the primary sequence. Specifically, short-read single-molecule assays can take advantage of modification data to measure cell-free DNA (cfDNA), which is fragmented DNA found in plasma that is usually the same length as DNA wrapped around a nucleosome (~150 bp). cfDNA has become a popular diagnostic tool owing to the relative ease of collection (via blood draws or ‘liquid biopsies’) and has been used to analyse fetal DNA during pregnancy, circulating tumour DNA and donor-derived DNA in transplant patients (reviewed in refs. 128,129). As reported in both published and preprint articles, cfDNA has been sequenced with PacBio and ONT to detect fetal DNA in maternal blood130,131 and assay circulating tumour DNA132,133,134,135. The ability to measure native CpG methylation and patterns from fragment ends (known as ‘fragmentomics’129) has been used to classify placental and maternal DNA130, show that tumour-derived DNA had lower methylation than non-tumour-derived DNA132, estimate tissue-of-origin and cell-type proportions (Fig. 3b), footprint transcription factor binding sites and measure nucleosome positioning133. ONT and PacBio platforms can also capture any longer fragments in these liquid biopsies, revealing previously unknown biology. For example, long reads (>1 kb) can constitute a large proportion (up to ~41%) of cfDNA reads in maternal plasma and the percentage of long reads increases as pregnancy progresses130.

Though exogenous labelling methods are a focus of single-molecule chromatin assay development (see ‘Mapping protein–DNA interactions’), methods sequencing short fragments from chromatin assays have also emerged. For example, Array-seq simply sequences the typical MNase digestion ladder to measure nucleosome positioning with ONT136 and short fragments from native ChIP-seq without amplification have been sequenced with PacBio137, allowing for both protein binding and native DNA modifications to be measured simultaneously. Another example is DamID, which uses exogenous DNA adenine methyltransferase (Dam) labelling and methylation-sensitive restriction enzyme digestion to probe protein–DNA interactions138. DamID output has been directly sequenced with ONT both with amplification (RNA Pol DamID (RAPID))139 and without amplification (nanopore-DamID)140, the latter reported in a recent preprint. These approaches have been shown to benefit from the single-molecule platforms that can sequence longer reads, measuring binding sites in repetitive sequences and segmental duplications as well as simultaneously investigating protein–DNA binding and native methylation140.

Good: accuracy

Two primary methods have been used to improve the accuracy of reads on single-molecule platforms: consensus methods and molecular indexing methods. Consensus methods have received the most attention with various approaches existing for both ONT and PacBio. PacBio sequencing natively supports consensus sequencing (‘circular consensus sequencing’ (CCS) with PacBio HiFi) and has been used on both short fragments (<1,000 bp)141 and long fragments (>13 kb)142 to generate highly accurate (99.8%)142 consensus reads. As ONT does not sequence circular molecules, a variety of methods have been developed using rolling circle amplification to generate linear molecules composed of concatemers of the original molecule (Fig. 3c). These methods usually begin with linear fragments of DNA that are circularized by intramolecular ligation143, molecular inversion probes144, ligation into a backbone145, or by using Gibson assembly and a common DNA splint146. The circular molecules are then amplified using the phi29 polymerase to create long concatemerized molecules. After sequencing, concatemers are identified and a consensus sequence of the original molecule is constructed (Fig. 3c). Even though long reads could be used with these methods, during development these methods have focused on short reads (<1000 bp) down to 52 bp144. All of these methods show increased accuracy (for example, improving from 74% to >95% accuracy144) when consensus molecules are constructed, with a recent publication reporting the added benefit of increasing the sequencing yield compared to sequencing the short fragments directly146.

In addition to consensus sequencing, unique molecular identifiers (UMIs) have been developed for single-molecule platforms and incorporated into amplicon sequencing147. UMIs were shown to improve the error rate of both ONT and PacBio (all >99.5% accuracy) and remove PCR chimeras that may arise during amplification. Although the UMIs were shown to work with long amplicons (>4,000 bp), they have the potential to be used in short-read methodologies as well.

Regardless of the approach used to improve accuracy, systematic errors in sequencing reads from these single-molecule platforms will prevent all errors from being corrected. For example, nanopore sequencing is error-prone in low-complexity sequences148 and homopolymer sequences, even with the latest commercially available pores7. PacBio is more accurate than ONT in general, but also shows systematic errors in homopolymer regions147,149. That said, further improvement is possible as indicated by recent efforts combining PacBio CCS with UMIs that resulted in very few errors147 and the improvement of accuracy seen by retraining nanopore basecallers with troublesome sequences150.

Cheap: increasing throughput

Both PacBio and ONT typically produce fewer reads per sequencing run than an Illumina device, affecting the cost of these platforms for read-counting applications such as assaying CNVs and RNA-seq. Because of this, a set of methods have been developed to increase the yield of short reads on single-molecule platforms. The methods are similar to approaches used to increase Sanger sequencing throughput in the 1990s151,152 and rely on concatenating short fragments into artificial, long fragments to increase throughput using either Gibson assembly153 or sticky-end ligation154,155,156 (Fig. 3d). For example, a method published in a recent preprint article, multiplexed arrays sequencing of isoforms (MAS-ISO-seq), shows ~15–25× increase in throughput with PacBio156 and sampling molecules using re-ligated fragments (SMURF-seq) achieves a ~3× increase on ONT155. Based on the gain in sequencing output, both methods can reduce the cost per million reads or full-length transcripts from >US$883 (PacBio) and >US$415 (ONT) to <US$56 (PacBio) and <US$146 (ONT) (see Supplementary Note and Supplementary Data). These approaches have been used in a variety of ways including identifying cancer variants153,155, measuring CNVs154 and sequencing RNA isoforms156,157.

It is currently unclear if any biases are introduced during these concatemerization methods and how they may affect the resulting data. Two of the methods recently described in preprints, MAS-ISO-seq156 and HIT-scISOseq157, both show relative depletion of longer spike-in RNA variants compared to shorter transcripts when compared to PacBio Iso-Seq. This could be due to any step in those protocols, including PCR, uracil digestion or ligation. Furthermore, the ligases used in these assays may have some GC bias, as was shown for serial analysis of gene expression (SAGE)151,158,159. Finally, these concatemerization methods rely on being able to accurately identify the junction sites between molecules in order to split them into individual fragments. Although most of these methods are paired with software for resolving concatemers, the base pair accuracy of these methods has not been fully elucidated. For example, ConcatSeq showed a small distribution of fragments deviating from the expected fragment length153. We expect that benchmarking and further exploration of these data will elucidate any sources of bias.

Conclusions and future perspectives

The increasing use of single-molecule sequencing platforms in genomics has led to an increase in applications beyond typical use cases. As they enter the mainstream, the number of creative uses of these platforms will increase and the methods detailed in this Review will be optimized, refined and expanded. If anything, development will be accelerated in coming years owing to the massive increase in the use of ONT sequencing to monitor the SARS-CoV-2 pandemic, as illustrated by ~50% of COVID-19 sequencing across the African continent being performed with ONT160. This increase will give an expanded population of researchers ready access to single-molecule sequencing technology.

Targeted sequencing methods will be improved to capture longer reads to take full advantage of these platforms. The optimization of these methods will lead to greater read depths and lengths, enabling applications that need ultra-high-depth sequencing such as identifying somatic mosaic variants or intratumoural heterogeneity. Further developments in combining methods, such as Cas-mediated enrichment with adaptive sampling161, will improve on-target rates and drive costs even lower. Targeted long reads are likely to generate new insights into the direct molecular impact of mutations and alterations as their single-molecule nature is a proxy for cellular heterogeneity in complex clinical samples.

Since their inception, short-read assays measuring protein–DNA binding have been developed to reduce input even to the single-cell level (reviewed in ref. 78) and to measure multiple protein–DNA interactions simultaneously162,163. We expect single-molecule methods to follow the same trajectory as they offer an appealing route to quantitative methods for measuring these interactions. Early work on the coordination of epigenetic marks on long, single reads — in some cases as long as 100 kb — offers tantalizing views into exploring epigenetic heterogeneity, such as examining the temporal dynamics of T cell activation94. However, determining whether exogenous labelling variation is biological or technical requires careful molecular controls. Potential confounding technical aspects include the extent to which both protein and antibody penetrate cells and/or nuclei and their binding efficiencies, fidelity of modification calling and enzyme labelling efficiencies.

Although the throughput of short reads on single-molecule platforms is improving, it still remains at a relatively high cost per million reads for counting applications, such as RNA-seq, CNV analysis and CUT&RUN. Improvements increasing the number of short reads obtained in a single sequencing run will enable sample multiplexing, driving down the cost of sequencing. With increasing throughput, we expect more short reads from a variety of assays to be sequenced on these long-read platforms owing to decreasing cost, increased speed and portability, and the ability to gain multimodal information.

Although we focus on DNA-based methods in this Review, we believe the ability to sequence RNA directly will also have an important role in a variety of methods going forward. However, at this time, direct RNA sequencing lags behind DNA sequencing and will require improvement in many aspects, including accuracy, to spur further use164. Similarly, we expect the young field of protein sequencing on nanopores to continue to advance165, eventually completing our ability to measure the central dogma in its entirety.

Finally, we imagine these advances could be combined with parallel advances in the portability and flexibility of sample collection166 and data analysis167,168. This is an especially exciting prospect when considering their use with portable ONT sequencing, which could lead to sequencing assays leaving core facilities for use directly at the bench or even the field. Improvements and future developments in these methods set the stage for a more flexible and accessible field of genomics, pushing it into a new and exciting era.