1 Precision Gene Editing: Design Guided by Gene Structural Features

Gene editing, e.g., by CRISPR/Cas, is widely used for plant functional genomics research and has huge potential for targeted improvement of desired heritable traits in crops [1]. The precision and specificity of CRISPR/Cas allows for dedicated screening for induced mutant alleles at target sites. Combinations of mutant alleles detected by molecular screens on the one hand, with the central dogma of molecular biology that explains the relationship between the primary DNA sequence, gene structural features, and expression of the encoded protein (Fig. 5.1) on the other, allows prediction of the effect of the mutation on protein functionality. Thus, a comprehensive gene editing screening workflow to characterize the generated mutants ideally spans the entire path between “gene structure based” CRISPR/Cas and gRNA design, molecular detection of the mutant alleles, and prediction of the effect of the mutation on the encoded protein sequence, hence capturing the actual outcome of gene editing. Here, we review the different types of mutations that can be introduced by variants of CRISPR/Cas gene editing, highlight several molecular screening and detection techniques and place them in this overarching perspective.

Fig. 5.1
A schematic representation of the different genetic information. The genetic information is passed on from D N A to Pre m R N A by transcription and polyadenylation. Mature m R N A is formed by R N A processing, and protein or amino acid sequence is formed after translation.

The central dogma of molecular biology posits that the genetic information encoded in a genic DNA sequence is transcribed into messenger RNA (mRNA) and then translated into protein. Transcription factors bind to regulatory elements in the promoter of a gene and the flanking genomic DNA sequence is transcribed into pre-messenger RNA (pre-mRNA) by an RNA polymerase. This pre-mRNA molecule matures through the removal of introns resulting in the juxtaposition of exons to form mature mRNA. The mature mRNA is translated into a protein by ribosomes that start from the translation initiation (START) codon and follow the open reading frame (ORF) until the first translation termination (STOP) codon. Transcription and translation are tightly interrelated dynamic processes that are regulated at different levels and rely on various structural features in a gene [2]. Therefore, mutations in a DNA sequence may affect any of these processes and gRNAs and CRISPR/Cas DNA modifiers may be designed to create specific changes in the expression, structure, stability, activity, and/or function of a protein

2 Types of Mutations Introduced by CRISPR/Cas Mediated Gene Editing

In its basic form, CRISPR/Cas mediated gene editing introduces a double-stranded DNA break at a specific genomic position defined by the gRNA. Subsequent non-perfect repair via non-homologous end-joining (NHEJ) or via homology-directed repair (HDR) results in the introduction of mutations in the target DNA sequence [3]. Mutations induced by CRISPR/Cas occur within a short range flanking the protospacer adjacent motif (PAM) site (e.g., 3–4 bp upstream of the PAM site for Cas9). NHEJ creates allelic series of mutations, typically in the range of short deletions and insertions (one up to tens of nucleotides), or few substitutions. Potential target sites, cleavage efficiencies, and induced scarring patterns may be predicted based on the gRNA sequence using machine learning, but accurate models for plants require further training on large-scale data sets [4, 5]. CRISPR/Cas is widely used to create targeted gene knockouts in several plant species. For example, Wang et al. [6] used CRISPR/Cas to knock out the susceptibility genes of the mildew-resistance locus (MLO) in wheat, generating wheat that is resistant to powdery mildew [6]. Mutations in a gene that result in a non-functional protein encoded by that gene, as described for the mutations in the MLO genes, are called loss-of-function (LOF) mutations and can occur when the ORF downstream of the mutation is disrupted (out-of-frame indel or frameshift) [7].

The CRISPR/gRNA complex may also be used as location-specific vehicle to deliver DNA sequence modifiers to a given location and modify the primary sequence (like base-editing or prime-editing) or epigenetic state [8, 9]. In base-editing, a cytidine deaminase (C:G-to-T:A) or adenosine deaminase (A:T-to-G:C) is linked to the CRISPR/gRNA complex and is used for the conversion of a single nucleotide at a specific position [8]. Base-editing can be used to create a specific point mutation, which may result in a single amino acid change in the protein sequence, a premature STOP codon or alter a splicing acceptor or donor site [10]. For example, in tomato and potato, base-editing was successfully used to convert a cytidine into a thymine in the acetolactate synthase (ALS) gene, conferring resistance to herbicides [11]. Mutations in a gene that result in an enhanced activity or functionality of the protein encoded by that gene, as described for the mutation in the ALS gene, are called gain-of-function (GOF) mutations.

While base-editing can only be used for two types of nucleotide substitution, prime-editing can introduce all kinds of predefined mutations, including the deletion, insertion, and/or substitution of specific nucleotides [12]. A prime-editing system consists of a Cas enzyme with nickase activity, reverse transcriptase, and prime-editing guide RNA (pegRNA) with a primer binding site for the specification of the genomic target site and an RNA template that encodes the desired edit [13]. It was already successfully used in rice for the insertion of a fragment up to 15 bp in the OsCDC48-T1 gene [12] and for the triple amino acid substitution in the EPSPS gene in rice to confer a higher level of glyphosate resistance [14].

CRISPR/Cas and its variants are also able to target DNA sequences at gene regulatory sites, e.g., transcription factor binding sites, splicing sites, and translation initiation and/or termination codons, thus changing gene structural features or coding potential. This will affect the different processes driving transcription, mRNA maturation, and translation (Fig. 5.1). In addition, the epigenetic state can be modulated by fusing the CRISPR protein with an epigenetic modifier that can affect the methylation state at DNA level or affect the methylation and/or acetylation state at nucleosome level (histone modification) [15, 16]. For example, Gallego-Bartolomé et al. [17] were able to reactivate the transcription of the FWA gene by demethylation of the FWA promotor using a dead Cas9 fused to the human demethylase TET2cd [17].

In short, CRISPR/Cas and its variants can be used to introduce a range of mutations into a DNA sequence that have different effects on the encoded protein. LOF or GOF mutations can be generated to study the role of certain proteins in biological processes, to confer resistance to pathogens or herbicides, to divert the metabolic flux of biosynthesis pathways towards valuable compounds, etc.

3 Screening Methods: Complexity, Resolution, and Scalability

Different screening methods are available to detect the outcome of CRISPR/Cas gene editing, to identify which plant material contains a desired gene edited sequence, and to evaluate the mutation efficiency. Screening methods may apply different detection methods (physical properties of an amplified allele vs sequencing-based), comprise targeted or untargeted screening (local sequencing of the predicted edited site (e.g., amplification or capture of the gRNA binding site and flanking regions), or global sequencing (e.g., WGS, RNA-Seq)), and with different levels of throughput and automation (via locus and/or sample multiplexing).

Simply put, any standard molecular detection technique that can discriminate DNA sequence variants (alleles) can also be used to detect CRISPR-induced mutations (Fig. 5.2). PCR-amplification of the target region, coupled to a detection method such as high-resolution melting (HRM) [18], fluorescent probe binding (qPCR or ddPCR [19]), or amplicon length polymorphism (agarose gel-electrophoresis, capillary fragment analysis, or mismatch detection assay [20] (a variant of Cleaved Amplified Polymorphic Sequences (CAPS) markers), or IDAA [20]) can be used to identify mutated alleles (Fig. 5.2). In addition, Kompetitive Allele-Specific PCR (KASP [21]) or primer–extension assays [22] may be used to screen for expected SNPs. These techniques are cheap, easy to implement, and allow for quick routine screening of gene edited mutant collections [23]. However, they only indirectly show the presence of a mutation, and not the actual, exact mutant DNA sequence at the nucleotide level, a prerequisite to interpret the effect of the mutation on the encoded protein.

Fig. 5.2
A schematic representation of the workflow in target mutation. The steps are gene editing, amplification of target gene by amplicon-based methods, sequencing of amplicon by sequencing-based methods, placement of allele into gene context, and translation of mutated sequence.

A targeted mutation screening workflow. Mutations can be introduced into a DNA sequence using CRISPR/Cas and the CRISPR/Cas-based variants base-editing and prime-editing. Screening methods are needed to evaluate if a mutation has occurred. In this screening workflow, the different steps needed to: (1) identify if a mutation occurred (differential detection based on physical attributes of amplicons); (2) identify which mutation (deletion, insertion, substitution) occurred at nucleotide level (sequencing-based methods); and (3) evaluate the effect of the mutation on encoded protein sequence (Fig. 5.1) are illustrated together with the expected outcome of the different steps

Amplification and sequencing of target loci of mutants provides information on the specific nucleotides that are deleted, inserted and/or substituted (Fig. 5.2). Sanger (dideoxy-) sequencing generates electropherograms allowing for the determination of the DNA sequence and the identification of mutations [23, 24]. The interpretation of the electropherogram can be challenging, as multiple nucleotides can be called at the same position due to heterozygous insertions, deletions and/or substitutions. Therefore, several computational tools have been developed to deconvolute the electropherograms, such as Tracking of Insertions and Deletions (TIDE) [25], CRISP-ID [26], Deconvolution of Complex DNA Repair (DECODR) [27], and Inference of CRISPR Edits (ICE) [28]. These tools utilize distinct algorithms to analyze electropherograms from a wild-type and a gene edited sample, generating a list with predicted mutated sequences [24]. The sensitivity of Sanger sequencing for alternative alleles in a heterozygous or otherwise mixed sample is about 15% [29]. Consequently, low-efficiency editing is likely to be overlooked. Furthermore, these methods are typically performed with a separate amplification and detection reaction for each sample and each locus (simplex), limiting the scalability for mutation screens to large collections at multiple target loci.

Next Generation Sequencing NGS allows for massive parallel sequencing and analysis of heterogeneous samples and substantially lowers the per-sample and per-locus costs in high-throughput mutation screens [3]. Because of its deep read coverage, NGS sensitivity for alternative alleles is 0.1–1% and thus enables screening of bulk samples (e.g., protoplasts after transfection), and efficient 1D, 2D or 3D pooling schemes [28]. In addition, multiplex amplicon sequencing combined with incorporation of sample-specific barcodes during library preparation facilitates parallel sequencing at hundreds of loci, in hundreds of samples per sequencing run. NGS yields targeted resequencing data that can be analyzed via bioinformatics tools such as CRISPResso2 [30] and SMAP haplotype-window [31] (Fig. 5.3). In SMAP haplotype-window, sequencing reads are mapped to a reference and the entire read sequence spanning the region between borders (typically the amplicon primer binding sites) is considered as an allele [31]. All the unique alleles are sorted and counted for the calculation of relative allele frequency per locus per sample. A region of interest (ROI) can be defined to focus the analysis on mutations introduced by the gene editing technique in a narrow nucleotide window and ignore additional sequence variants at distance from the edit site. Every allele is compared to the reference in its entirety, allowing for the detection of any combination of insertion, deletion, and/or substitution. SMAP haplotype-window will generate an integrated genotype call table with all the observed alleles per locus per sample. Since it is agnostic to the length of the deletion, insertion, or substitution, it can detect any mutation resulting from an edit in the primary DNA sequence in a given window, as long as the amplicon or read length spans the mutated allele. SMAP haplotype-window can also process probe-capture enriched, WGS, and RNA-Seq read data from global resequencing screens, for a given list of target loci. PacBio sequencing [32] and nanopore MinION sequencing [33] can be used to detect long-range insertions and deletions, as well as epigenetic DNA modifications introduced by CRISPR/Cas.

Fig. 5.3
A schematic representation of the S M A P haplotype window. The steps involved are sorting and counting of unique sequencing and calculation of relative mutation frequency, filtering for mutations in the window around the expected target site, and substitution of wild-type amplicon by alleles.

SMAP haplotype-window and SMAP effect-prediction can be used to analyze highly multiplex amplicon sequencing data and to estimate the novel encoded protein sequence. SMAP haplotype-window is a module of the SMAP package that is used to analyze the sequencing reads obtained from NGS. It maps the sequencing reads (in this figure illustrated for a bulk sample) to the reference genome, groups all the alleles with the same mutations, determines a ROI and calculates the mutation frequency. SMAP effect-prediction is used to provide biological interpretation of the different mutations that were introduced by substituting the wild-type allele with the mutant allele(s) in the reference genome and translating the novel genic sequence. Mutation types include a frameshift mutation, in-frame indel, missense mutation, deletion, silent mutation and nonsense mutation

4 Mutation Screening in a Broader Perspective: From Nucleotide to Protein

The current repertoire of CRISPR/Cas DNA modifiers combined with gRNA specificity, generates a huge array of design possibilities, especially when based on the principles that predict how protein sequences may be altered by editing the genomic nucleotide sequence. A mutation screening workflow that draws on clever CRISPR design, in turn, should be able to consider detected mutated alleles in their respective gene context and classify the mutated alleles based on predefined desired alleles (e.g. a unique base-edit) or on percentage protein sequence similarity to the original wild-type allele [7].

The SMAP effect-prediction module from the SMAP package estimates the novel encoded protein sequence and is the final step in the mutation screening workflow (Fig. 5.3) [31]. SMAP haplotype-window generates a list of all observed haplotypes per locus per sample, which is directly used as input for SMAP effect-prediction, together with all positional information on gene structural features. SMAP effect-prediction replaces a segment of the original reference gene sequence by the observed mutated sequence and evaluates all the splicing sites, the translation initiation codon, open reading frame, and translation termination codon [31]. After translation of the most likely ORF in the mutated allele, the amino acid sequence is aligned to the original reference protein and the percentage protein sequence similarity is estimated as quantitative score for the remaining protein functionality. Proteins may no longer resemble the original reference protein (frameshift mutation or nonsense mutation), proteins can be identical to the original reference protein (silent mutation), etc. (Fig. 5.3). A threshold value can be set for the percentage sequence similarity to the original reference protein still needed for a protein to perform its function. DNA mutations that result in a protein with a lower percentage of similarity as a given threshold value can be defined as a loss-of-function mutation [34].

5 Conclusions

CRISPR/Cas and CRISPR/Cas variants are widely used to introduce mutations into a DNA sequence. However, mutations can have different effects on the function of a gene and its encoded protein. Here, we describe a molecular screening workflow that focuses on the path from CRISPR/Cas and gRNA design, through screening for mutant alleles, and prediction of the effect of the DNA sequence mutation on the encoded protein, all implemented in modules of the SMAP package. By using SMAP haplotype-window and SMAP effect-prediction, the detected mutated alleles are placed in their respective gene context, and the mutated alleles can be classified based on percentage of protein sequence similarity to the original wild-type allele. This high-throughput screening workflow allows for the automated and streamlined screening of multiplex CRISPR experiments, in large mutant collections (locus and/or sample multiplexing) and enables fast and easy interpretation of the effect of the mutant alleles on the protein sequence, and automated routine identification of carriers of desired alleles.