Long-read sequencing in human genetics

Sanger sequencing revolutionized molecular genetics 40 years ago. However, next-generation sequencing technologies became further game changers and shaped our current view on genome structure and function in health and disease. Although still at the very beginning, third-generation sequencing methods, also referred to as long-read sequencing technologies, provide exciting possibilities for studying structural variations, epigenetic modifications, or repetitive elements and complex regions of the genome. We discuss the advantages and pitfalls of current long-read sequencing methods with a focus on nanopore sequencing, summarize respective applications and provide an outlook on the potential of these novel methods.

Determining the nucleic acid sequence has shaped our view of genome structure and function. Back in 1968, Wu and Kaiser used primer extension methods to identify a short sequence of the bacteriophage lambda [62], whereas 5 years later, Maxam and Gilbert determined the sequence of the lactose-repressor binding site by chemical cleavage [21]. Subsequently, the widespread method using chain-terminating dideoxynucleotides by Frederick Sanger and colleagues has fostered sequencing since the mid-1970s [42,51]. Sanger sequencing culminated in the sequencing of the human genome and is still relevant for targeted resequencing [27, 37,61]. However, the advent of massively parallel sequencing (next-generation sequencing, NGS) turned out to be another game changer and revolutionized human genetics. Within 10 years, NGS led to a dramatic increase in knowledge on genetic variation and allowed fast and  [45]. Different methods using semiconductors (Ion Torrent), pyrosequencing (Roche), sequencing by ligation (Applied Biosystems), and the widely used sequencing by synthesis with reversible terminators (Solexa, Illumina) enabled gene panel, whole-exome, and wholegenome sequencing within a few days at moderate costs [43]. However, both Sanger sequencing and NGS technologies deliver only short-read DNA fragments within the range of 50-1000 bases. The short-reads prevent analysis of complex genomic loci, repetitive elements, or variant phasing (haplotyping) and result in inefficient and incomplete genome assemblies. Moreover, PCR amplification of sequencing templates generates artefacts and precludes detection of native base modifications. Several of these shortcomings can be overcome by third-generation sequencing technologies (TGS), also referred to as long-read sequencing in the following.

Nanopore sequencing
The idea to sequence long fragments of DNA and RNA without PCR amplification and nucleotide labeling had its origins as early as the 1980s, but has only become feasible after a technology using nanopores recently reached market maturity (Oxford Nanopore Technologies ® , ONT, Oxford, UK) [14,34]. In nanopore sequencing, a tiny protein pore (Mycobacterium smegmatis porin A, MspA, or Escherichia coli Curlin sigma S-dependent growth subunit G, CsgG) is embedded in an electrically resistant polymer membrane and an ionic current is passed through this nanopore by setting a voltage across the membrane. When DNA or RNA passes through the pore via a helicase, this creates a characteristic change in the current, which provides information on the respective nucleotides in the nanopore (. Fig. 1a; . Table 1). The technology does not depend on a polymerase and allows sequencing of native DNA and RNA and the detection of various chemical modifications (e.g., methylation) of nucleic acids [12]. The longest reads achieved with the current method comprise a length of more than 2 million bases of DNA in a row.

SMRT sequencing
In single-molecule real-time (SMRT) sequencing, a single DNA polymerase molecule is immobilized at the bottom of picoliter wells called zero-mode waveguides (ZMWs). These wells are small enough to allow real-time recording of individual fluorescence signals on excitation by a laser when labeled nucleotides are progressively incorporated by the polymerase during the replication process (. Fig. 1b; . Table 1; [54]

Applications of long-read sequencing in human genetics
The first applications of long-read sequencing were restricted to the sequencing of smaller genomes such as bacteria. However, with improvements in chemistry, human genome sequencing became feasible [29]. In contrast to short-reads, these technologies enable unambiguous mapping of reads such as in regions of high homology, low complexity, or in pseudogenes. Also, the phasing of alleles (generation of haplotypes) is facilitated by long reads and is possible without information on the parental SNPs. This also allows whether genetic variants occur on the same allele or on opposite strands to be distinguished. Recent examples demonstrated that complete haplotyping of highly complex regions, including killer cell immunoglobin-like receptor (KIR) and human leukocyte antigen (HLA) loci can be performed using long-read technologies [1]. With improvements in the read lengths, as yet unresolved regions of the human genome, such as lowcopy repeats, telomeres or centromeres (for sequencing of the Y-chromosome centromere see [30]), become accessible [39]. An obvious advantage of long-read sequencing is the detection of structural variations (SVs), including the detectionofbalanced chromosomal rearrangements. There are several studies demonstrating the successful identification of constitutive [50], complex "chromothrypsis" [11], or somatic genomic rearrangements [16,25]. Exact characterization of breakpoints for larger indels [36]orthedetectionof fusion gene products [32] are possible with long-read approaches. Long-read whole genome sequencing can identify thousands of SVs that may escape NGS and allows otherwise missed disease-causative genomic aberrations to be discovered [8,12,53]. The identification of SVs from TGS data may also require lower coverage than with NGS [11].
Long-read sequencing also enables studying larger repeat-expansions that escape PCR-based approaches. Repetitive elements can be evaluated with high precision, for example, for the FMR1associated Fragile X-syndrome repeat and determination of its repeat-stabilityrelevant AGG interruptions [3]. Larger repeats such as the facioscapulohumeral muscular dystrophy (FSHD)-associated D4Z4 repeat array have also been fully sequenced by TGS [44]. Using longread sequencing, novel expansions of intronic TTTCA and TTTTA repeats of SAMD12 have been reported in benign adult familial myoclonic epilepsy [28] and repeat expansions in NOTCH2NLC have recently been associated with a neuronal intranuclear inclusion disease [57]. The highly similar sequences of the tandem repeats can be directly assessed from the raw signal (. Fig. 2). Cas9based enrichments, e.g., of disease-causing repetitive or other genomic regions make TGS more feasible for routine diagnostic applications and allow sev-Abstract · Zusammenfassung medgen 2019 · 31:198-204 https://doi.org/10.1007/s11825-019-0249-z © The Author(s) 2019

F. Kraft · I. Kurth
Long-read sequencing in human genetics Abstract Sanger sequencing revolutionized molecular genetics 40 years ago. However, nextgeneration sequencing technologies became further game changers and shaped our current view on genome structure and function in health and disease. Although still at the very beginning, third-generation sequencing methods, also referred to as long-read sequencing technologies, provide exciting possibilities for studying structural variations, epigenetic modifications, or repetitive elements and complex regions of the genome. We discuss the advantages and pitfalls of current long-read sequencing methods with a focus on nanopore sequencing, summarize respective applications and provide an outlook on the potential of these novel methods.

Schlüsselwörter
Third-generation sequencing · Long-read sequencing · Nanoporensequenzierung · Single-molecule real-time sequencing · Genomik eral genomic loci to be analyzed in one assay. Utilizing the ONT Flongle for these targeted approaches enables the costs of TGS-based analysis to be further reduced.
The feasibility of long-read sequencing to detect unusual mutation mechanisms was recently reported for the exonization of an intronic LINE-1 element inserted into the DMD gene in a patient with muscular dystrophy [24]. Another example of an unusual mutation is a SINE-VNTR-Alu (SVA) retrotransposition into intron 32 of the TAF1 locus, which causes an endemic type of X-linked dystonia parkinsonism [2].
Previous sequencing technologies provided only limited access to the state of nucleic acid modifications. In principle, any base modification that affects the current in nanopore sequencing (. Fig. 3) or the nucleotide incorporation time in SMRT sequencing is recorded in the raw signals. It allows, for example, discrimination between 5-methylcytosine and 5-hydroxymethylcytosine, or detection of N 6 -methyladenosine [48,56]. This unique feature of TGS enables SV, SNV, and the methylation status of genomic loci to be analyzed in parallel and may improve the molecular diagnostics, for example, of cancer and imprinting disorders. Not only the landscape of alternative splicing can be investigated by reading through entire isoforms [33], but the various base modifications present on native RNA molecules can also be detected using this PCR-free method [18]. Moreover, native CpG methylation and chromatin accessibility can be studied in parallel using long reads [38].
. Table 2 provides an overview of current long-read sequencing applications.

Challenges of long-read sequencing
Preparing of libraries for long-read sequencing is straightforward; however, there are several pitfalls in terms of obtaining optimal sequencing libraries.
A major drawback of SMRT sequencing is the fixed number of μ-wells per flow cell, which means that shorter or no sequencing templates per well reduce the overall output. In contrast, individual pores in nanopore sequencing can sequence up to several thousand molecules; however, very large DNA molecules tend to block respective pores. A major challenge in TGS sequencing is the high sequencing error rate, but higher cov-erage and optimized filtering strategies can improve consensus accuracy [14]. The release of a new ONT "linear consensus sequencing" (LCS) chemistry will provide better results, such as the "circular consensus sequencing" (CCS) chemistry used by PacBio. Another issue is the relatively large raw data file size, which requires a high demand for data management and storage especially for nanopore sequencing applications. PCR-free target enrichment strategies for nanopore sequencing are hardly available, but interesting approaches using CRISPR/Cas9 are under development.
Cas9 is used to cleave and directly capture genomic regions via hybridization and immobilization on beads before sequencing. Moreover, software applications for nanopore sequencing may be useful for in silico target enrichment. 'ReadUntil' is a software application that allows fragments of interest to be selected by reversing the voltage across utilized nanopores and extruding DNA on the fly [41]. Bioinformatics strategies for the processing of long-read sequencing data are rapidly evolving; however, it is curmedizinische genetik 2 · 2019 201 Table 2 Examples of applications of long-read sequencing Applications
. Table 3 provides an overview of some of the most commonly used bioinformatics tools in long-read sequencing.

Outlook
Long-read sequencing has a huge potential and will provide additional insight into genome biology and human genetics. Several disease-relevant genes and pathomechanisms that escape shortread sequencing technologies will be elucidated by long-read technologies. The technologies will soon become an integral part of molecular genetic diagnos- tics. An open question is whether the techniques will mature such that they will even replace short-read sequencing technologies, array-based analyses, and cytogenetics. Applications of TGS to detect SVs and tandem repeats are already superior to NGS and almost ready for use in molecular routine diagnostics. In contrast, the higher error rate of nanopore sequencing currently makes SNV detection only suitable in targeted sequencing approaches that generate a high coverage (> 100×). The lack of commercially available kits for TGS enrichments and goldstandard bioinformatics solutions is at the moment one of the bottlenecks for usage in molecular diagnostics. Besides the aforementioned applications, the portability of small nanopore sequencers opens up additional opportunities for field applications in a nearly lab-free environment. This is illustrated by surveillance of pathogens in disease epidemics, such as the real-time tracking of Ebola distribution [47] or the molecular mapping of Zika virus spread in Brazil [17]. Are we perhaps heading for times of "sequencing at home" or in outpatient clinics and medical practices, with direct data transfer to genetic specialists? Other open questions concern the speed of nanopore technologies from library preparation to obtaining the first sequencing results within minutes to a few hours: Can we tackle fast sepsis diagnostics or intraoperative molecular genotyping? Undoubtedly, ge-netics is becoming increasingly important in many fields of health care and the possibilities for addressing the plentiful questions by TGS are rapidly evolving.

Conclusions for clinical practice
4 Different long-read sequencing platforms are available that either depend on an immobilized polymerase and fluorescently labelled nucleotides or on biological (nano)pores. 4 Long-read sequencing is mostly applied in research, but has the potential to be used in many fields of molecular genetic diagnostics. 4 Long-read sequencing has several advantages compared with shortread sequencing methods and is well suited to, for example, addressing structural variations, epigenetic modifications, and repetitive elements of the genome. outstanding papers in the field have not been cited owing to the limitations of space. We would like to point out that developments in the field of longread sequencing illustrate changes in the current research practice toward rapid publication of results on preprint servers such as bioRXiv (https://www. biorxiv.org/), the nanopore community platform (https://nanoporetech.com/community), Twitter, or as blogs. In our opinion, this practice fosters lively discussion and speedy innovation, and may serve as a contemporary model to complement the often viscous and delaying peer-review processes.

Compliance with ethical guidelines
Conflict of interest F. Kraft and I. Kurth declare that they have no competing interests.
For this article no studies with human participants or animals were performed by any of the authors. All studies performed were in accordance with the ethical standards indicated in each case.
Open Access. This article is distributedundertheterms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, providealinktotheCreativeCommons license, and indicate if changes were made.