Initial sequencing technologies and short-read next-generation sequencing

Determining the nucleic acid sequence has shaped our view of genome structure and function. Back in 1968, Wu and Kaiser used primer extension methods to identify a short sequence of the bacteriophage lambda [62], whereas 5 years later, Maxam and Gilbert determined the sequence of the lactose-repressor binding site by chemical cleavage [21]. Subsequently, the widespread method using chain-terminating dideoxynucleotides by Frederick Sanger and colleagues has fostered sequencing since the mid-1970s [42, 51]. Sanger sequencing culminated in the sequencing of the human genome and is still relevant for targeted resequencing [27, 37, 61]. However, the advent of massively parallel sequencing (next-generation sequencing, NGS) turned out to be another game changer and revolutionized human genetics. Within 10 years, NGS led to a dramatic increase in knowledge on genetic variation and allowed fast and accurate diagnostics of clinically relevant germline and somatic mutations [45]. Different methods using semiconductors (Ion Torrent), pyrosequencing (Roche), sequencing by ligation (Applied Biosystems), and the widely used sequencing by synthesis with reversible terminators (Solexa, Illumina) enabled gene panel, whole-exome, and whole-genome sequencing within a few days at moderate costs [43]. However, both Sanger sequencing and NGS technologies deliver only short-read DNA fragments within the range of 50–1000 bases. The short-reads prevent analysis of complex genomic loci, repetitive elements, or variant phasing (haplotyping) and result in inefficient and incomplete genome assemblies. Moreover, PCR amplification of sequencing templates generates artefacts and precludes detection of native base modifications. Several of these shortcomings can be overcome by third-generation sequencing technologies (TGS), also referred to as long-read sequencing in the following.

Long-read next-generation sequencing methods

Nanopore sequencing

The idea to sequence long fragments of DNA and RNA without PCR amplification and nucleotide labeling had its origins as early as the 1980s, but has only become feasible after a technology using nanopores recently reached market maturity (Oxford Nanopore Technologies®, ONT, Oxford, UK) [14, 34]. In nanopore sequencing, a tiny protein pore (Mycobacterium smegmatis porin A, MspA, or Escherichia coli Curlin sigma S‑dependent growth subunit G, CsgG) is embedded in an electrically resistant polymer membrane and an ionic current is passed through this nanopore by setting a voltage across the membrane. When DNA or RNA passes through the pore via a helicase, this creates a characteristic change in the current, which provides information on the respective nucleotides in the nanopore (Fig. 1a; Table 1). The technology does not depend on a polymerase and allows sequencing of native DNA and RNA and the detection of various chemical modifications (e.g., methylation) of nucleic acids [12]. The longest reads achieved with the current method comprise a length of more than 2 million bases of DNA in a row.

Fig. 1
figure 1

Principle of nanopore and single-molecule real-time (SMRT) sequencing. a Nanopore sequencing: DNA is analyzed by threading it through a biological protein pore (e.g., Mycobacterium smegmatis porin A, MspA). The DNA is unzipped by a helicase to allow single-strand sequencing. Nucleotides inside the pore disrupt the ion flow through the channel. Each flow cell operates up to 50 (Flongle), 512 (MinION/GridION) or 3000 (PromethION) pores in parallel. DNA and RNA can be sequenced and processed in real-time at a speed of 450 (or 70 for RNA) bases per second and pore. The resulting current traces are converted to DNA sequences. b SMRT sequencing: The polymerase is attached to the bottom of a zero-mode waveguide (ZMW) µ‑well and incorporates fluorophore-labeled nucleotides. 100,000 (RSII), 1 million (Sequel) or 8 million (Sequel 2) of these µ‑wells are combined on one flow cell. During elongation with few bases per second, the fluorophore-labeled nucleotides are excited through a laser and emitted light is detected by four complementary metal oxide semiconductor cameras, one per color. The time-resolved fluorescence signals are converted to DNA sequences

Table 1 Comparison of long-read sequencing methods

SMRT sequencing

In single-molecule real-time (SMRT) sequencing, a single DNA polymerase molecule is immobilized at the bottom of picoliter wells called zero-mode waveguides (ZMWs). These wells are small enough to allow real-time recording of individual fluorescence signals on excitation by a laser when labeled nucleotides are progressively incorporated by the polymerase during the replication process (Fig. 1b; Table 1; [54]). The technology, commercialized by Pacific Biosciences® (Pacific Biosciences of California, Inc., Menlo Park, CA, USA), produces an average read length of 10–30 kb, but reads can exceed 80 kb [60]. Circular DNAs serve as a sequencing template and can be sequenced multiple times to provide higher accuracy consensus sequences. Base modifications affect the speed of nucleotide incorporation, which enables SMRT sequencing to detect modified bases.

Other approaches

Currently there are only a few alternatives to assessing long stretches of nucleic acids. Synthetic long read (SLR) technologies are offered by Illumina® or by emulsion-based sequencing from 10X Genomics®. However, both techniques are built on classical Illumina short-read sequencing and are in fact not TGS technologies. BioNano Genomics® uses an optical mapping method to mark sequences in long DNA fragments (500 bases – megabases) which are imaged and allow long-range genome mapping and detection of structural variants (Saphyr system).

Applications of long-read sequencing in human genetics

The first applications of long-read sequencing were restricted to the sequencing of smaller genomes such as bacteria. However, with improvements in chemistry, human genome sequencing became feasible [29]. In contrast to short-reads, these technologies enable unambiguous mapping of reads such as in regions of high homology, low complexity, or in pseudogenes. Also, the phasing of alleles (generation of haplotypes) is facilitated by long reads and is possible without information on the parental SNPs. This also allows whether genetic variants occur on the same allele or on opposite strands to be distinguished. Recent examples demonstrated that complete haplotyping of highly complex regions, including killer cell immunoglobin-like receptor (KIR) and human leukocyte antigen (HLA) loci can be performed using long-read technologies [1]. With improvements in the read lengths, as yet unresolved regions of the human genome, such as low-copy repeats, telomeres or centromeres (for sequencing of the Y‑chromosome centromere see [30]), become accessible [39].

An obvious advantage of long-read sequencing is the detection of structural variations (SVs), including the detection of balanced chromosomal rearrangements. There are several studies demonstrating the successful identification of constitutive [50], complex “chromothrypsis” [11], or somatic genomic rearrangements [16, 25]. Exact characterization of breakpoints for larger indels [36] or the detection of fusion gene products [32] are possible with long-read approaches. Long-read whole genome sequencing can identify thousands of SVs that may escape NGS and allows otherwise missed disease-causative genomic aberrations to be discovered [8, 12, 53]. The identification of SVs from TGS data may also require lower coverage than with NGS [11].

Long-read sequencing also enables studying larger repeat-expansions that escape PCR-based approaches. Repetitive elements can be evaluated with high precision, for example, for the FMR1-associated Fragile X‑syndrome repeat and determination of its repeat-stability-relevant AGG interruptions [3]. Larger repeats such as the facioscapulohumeral muscular dystrophy (FSHD)-associated D4Z4 repeat array have also been fully sequenced by TGS [44]. Using long-read sequencing, novel expansions of intronic TTTCA and TTTTA repeats of SAMD12 have been reported in benign adult familial myoclonic epilepsy [28] and repeat expansions in NOTCH2NLC have recently been associated with a neuronal intranuclear inclusion disease [57]. The highly similar sequences of the tandem repeats can be directly assessed from the raw signal (Fig. 2). Cas9-based enrichments, e.g., of disease-causing repetitive or other genomic regions make TGS more feasible for routine diagnostic applications and allow several genomic loci to be analyzed in one assay. Utilizing the ONT Flongle for these targeted approaches enables the costs of TGS-based analysis to be further reduced.

Fig. 2
figure 2

Detection of tandem repeat expansions from nanopore sequencing raw signal traces. a, d Sample plots showing the raw nanopore sequencing signal from a tandem repeat expansion. a positive and d negative strands. The repeat consists of two distinct sequence motives, which are indicated by red and blue. Adjacent sequences are shown in gray. b, c, e, f Current profile of a single repeat unit. Source: Institute of Human Genetics, RWTH Aachen

The feasibility of long-read sequencing to detect unusual mutation mechanisms was recently reported for the exonization of an intronic LINE-1 element inserted into the DMD gene in a patient with muscular dystrophy [24]. Another example of an unusual mutation is a SINE-VNTR-Alu (SVA) retrotransposition into intron 32 of the TAF1 locus, which causes an endemic type of X‑linked dystonia parkinsonism [2].

Previous sequencing technologies provided only limited access to the state of nucleic acid modifications. In principle, any base modification that affects the current in nanopore sequencing (Fig. 3) or the nucleotide incorporation time in SMRT sequencing is recorded in the raw signals. It allows, for example, discrimination between 5‑methylcytosine and 5‑hydroxymethylcytosine, or detection of N6-methyladenosine [48, 56]. This unique feature of TGS enables SV, SNV, and the methylation status of genomic loci to be analyzed in parallel and may improve the molecular diagnostics, for example, of cancer and imprinting disorders. Not only the landscape of alternative splicing can be investigated by reading through entire isoforms [33], but the various base modifications present on native RNA molecules can also be detected using this PCR-free method [18]. Moreover, native CpG methylation and chromatin accessibility can be studied in parallel using long reads [38]. Table 2 provides an overview of current long-read sequencing applications.

Fig. 3
figure 3

Detection of native epigenetic modifications by nanopore sequencing. Methylation of a cytosine (C) causes a change in the recorded current profile. The signal of the unmodified cytosine is marked by a blue box and 5‑methylcytosine is labelled by a red box. Source: Institute of Human Genetics, RWTH Aachen

Table 2 Examples of applications of long-read sequencing

Challenges of long-read sequencing

Preparing of libraries for long-read sequencing is straightforward; however, there are several pitfalls in terms of obtaining optimal sequencing libraries. A major drawback of SMRT sequencing is the fixed number of µ‑wells per flow cell, which means that shorter or no sequencing templates per well reduce the overall output. In contrast, individual pores in nanopore sequencing can sequence up to several thousand molecules; however, very large DNA molecules tend to block respective pores. A major challenge in TGS sequencing is the high sequencing error rate, but higher coverage and optimized filtering strategies can improve consensus accuracy [14]. The release of a new ONT “linear consensus sequencing” (LCS) chemistry will provide better results, such as the “circular consensus sequencing” (CCS) chemistry used by PacBio. Another issue is the relatively large raw data file size, which requires a high demand for data management and storage especially for nanopore sequencing applications. PCR-free target enrichment strategies for nanopore sequencing are hardly available, but interesting approaches using CRISPR/Cas9 are under development. Cas9 is used to cleave and directly capture genomic regions via hybridization and immobilization on beads before sequencing. Moreover, software applications for nanopore sequencing may be useful for in silico target enrichment. ‘ReadUntil’ is a software application that allows fragments of interest to be selected by reversing the voltage across utilized nanopores and extruding DNA on the fly [41]. Bioinformatics strategies for the processing of long-read sequencing data are rapidly evolving; however, it is currently unclear which applications are the most suitable [52]. Notably, base calling performance is lower for modified bases owing to the lack of suited reference sequences and computational models. Table 3 provides an overview of some of the most commonly used bioinformatics tools in long-read sequencing.

Table 3 Selected bioinformatics tools for analyzing nanopore (N) and/or PacBio (P) data

Outlook

Long-read sequencing has a huge potential and will provide additional insight into genome biology and human genetics. Several disease-relevant genes and pathomechanisms that escape short-read sequencing technologies will be elucidated by long-read technologies. The technologies will soon become an integral part of molecular genetic diagnostics. An open question is whether the techniques will mature such that they will even replace short-read sequencing technologies, array-based analyses, and cytogenetics. Applications of TGS to detect SVs and tandem repeats are already superior to NGS and almost ready for use in molecular routine diagnostics. In contrast, the higher error rate of nanopore sequencing currently makes SNV detection only suitable in targeted sequencing approaches that generate a high coverage (> 100×). The lack of commercially available kits for TGS enrichments and gold-standard bioinformatics solutions is at the moment one of the bottlenecks for usage in molecular diagnostics. Besides the aforementioned applications, the portability of small nanopore sequencers opens up additional opportunities for field applications in a nearly lab-free environment. This is illustrated by surveillance of pathogens in disease epidemics, such as the real-time tracking of Ebola distribution [47] or the molecular mapping of Zika virus spread in Brazil [17]. Are we perhaps heading for times of “sequencing at home” or in outpatient clinics and medical practices, with direct data transfer to genetic specialists? Other open questions concern the speed of nanopore technologies from library preparation to obtaining the first sequencing results within minutes to a few hours: Can we tackle fast sepsis diagnostics or intraoperative molecular genotyping? Undoubtedly, genetics is becoming increasingly important in many fields of health care and the possibilities for addressing the plentiful questions by TGS are rapidly evolving.

Conclusions for clinical practice

  • Different long-read sequencing platforms are available that either depend on an immobilized polymerase and fluorescently labelled nucleotides or on biological (nano)pores.

  • Long-read sequencing is mostly applied in research, but has the potential to be used in many fields of molecular genetic diagnostics.

  • Long-read sequencing has several advantages compared with short-read sequencing methods and is well suited to, for example, addressing structural variations, epigenetic modifications, and repetitive elements of the genome.