Background

Carefully designed forward-genetic screens have been an integral part of research programs for decades and remain an important tool for resolving biological pathways. Many proteins contributing to plant immune signalling have been discovered through such screens. As one example, the receptor kinase FLAGELLIN SENSING 2 (FLS2) was identified from a mutagenized Arabidopsis thaliana (hereafter, Arabidopsis) population as the receptor for bacterial flagellin [1]. The discovery of FLS2 and other surface-localized immune receptors that detect conserved molecular features of microbes (known as pathogen-associated molecular patterns; PAMPs) revolutionized our understanding of plant immunity [2], and reinforces the importance of genetic screens in modern research.

Genetic screens in all systems are based on similar principles. Individuals containing a phenotype of interest are first isolated from a mutagenized or naturally polymorphic population. Marker-assisted linkage analysis is then performed to identify the genomic region containing the underlying mutation(s). Finally, mutations are identified by sequence analysis and the causative mutation is usually confirmed by complementation with a non-mutated (wild-type) copy of the gene.

The most commonly used mutagenesis strategies in Arabidopsis include the induction of guanine-to-adenine substitutions using ethylmethane sulfonate (EMS) or the insertion of transfer-DNA using Agrobacterium tumefaciens-mediated transformation [3]. The number of mutations identified in a mapped region depends primarily on the mutagenesis. Increasing the strength of the mutagen will likely result in the recovery of more mutants containing the phenotype of interest, however this also results in more mutations in each mutant genome and can complicate correct gene identification. Mapping mutations by position classically involves out-crossing to a polymorphic ecotype and linking the phenotype of recombinant F2 individuals to molecular markers with known genomic positions, such as insertion/deletions (indels) or single nucleotide polymorphisms (SNPs). Rather than genotyping individual recombinants exhibiting the scored phenotype, linkage analysis can be performed on bulked recombinants. This process, referred to as bulk segregant analysis (BSA) [4], eases genetic analysis and is particularly effective when a large number of molecular markers are available. Although robust, these classical methods are time-consuming and labour-intensive, commonly taking more than a year (in the case of Arabidopsis) to correctly identify the causative mutation. Positional cloning can be particularly tedious in suppressor or modifier screens, where multiple loci are segregating in the mapping population. Correct identification of causative mutations depends greatly on the strength of the mutagenesis, the complexity of the cross, the penetrance of the phenotype, and the availability of molecular markers.

Recent advances in high-throughput sequencing (HTS) technologies have allowed for rapid identification of causative mutations and have been widely adopted in many fields [5]. HTS approaches have many advantages over classical mapping strategies, including a potential reduction in time and personnel needed to identify a causative mutation. Combining HTS with BSA has proven to be particularly useful [5]. Several expert operator methods utilizing HTS of bulked segregants have been described to assist plant genetic screens, including SHOREmap [6],[7], Next-Generation Mapping (NGM) [8], MutMap [9]-[11], and others [5]. However, each of these has restrictions in their functioning that make them limited to certain types of crosses or are difficult to use for non-experts including bench-trained biologists.

The two major tools used by plant researchers start with a classical mapping approach, requiring data generated from out-crosses. SHOREmap [6],[7] uses a statistic that explicitly calculates the relative abundance of alleles identified in bulked out-crossed F2 populations and therefore relies on a priori knowledge of polymorphic allele positions in both the parental and out-crossed ecotypes. As a result, this powerful tool cannot be used when such crosses are not performed (for example, in back-crossed populations) or when genetic marker resources are not yet available. Comparatively, the marker-independent, web-based method NGM [8], does not rely on previous knowledge of ecotype-specific polymorphisms, but rather uses the ratio of the expected allele frequency of the causative mutation in bulked out-cross F2 segregants relative to the background allele frequency of unrelated mutations. This two-step process relies first on identifying a coarse mapping interval of relative SNP paucity in which the mutation should lie; this region is usually in the order of megabases in length [8]. To further reduce the width of the coarse interval and ease identification of the causative mutation, the SNP frequency in different bands of SNP allele frequencies (overlapping groupings of similar allele frequencies; e.g., 0.5-0.6, 0.51-0.61, etc.) are compared. The point that maximises the ratio between bands representing homozygous alleles and heterozygous alleles at 50% frequency is expected to be at or around the causative mutation. The MutMap [10],[11] and CloudMap [12] systems avoid the need for coarse mapping as an initial step but do not exist as user-friendly tools such that application and optimisation of parameters requires extensive expertise in a command-line computing environment.

We recently conducted a forward-genetic modifier screen in the immune-deficient bak1-5 background to identify novel components involved in plant immune signalling [13]. BRASSINOSTEROID INSENSITIVE 1-ASSOCIATED KINASE 1 (BAK1) is a multi-functional co-receptor that interacts with and phosphorylates several surface-localized immune receptors including FLS2 [14]-[19]. Accordingly, loss-of-function bak1 alleles are strongly impaired in signalling triggered by several PAMPs [15],[17]-[19]. We mutagenized bak1-5 seeds (in the Columbia-0 (Col-0) ecotype) with EMS and screened the M2 generation for modifier of bak1-5 (mob) mutants that restored immune signalling. To uncover the causal mob mutation(s), we back-crossed bak1-5 mob mutants to the parental line (bak1-5) and Illumina sequenced bulked F2mob segregants. Importantly, as the parent was itself generated through EMS-mutagenesis of a transgenic Arabidopsis line [15],[20],[21], we additionally sequenced bak1-5, which had been back-crossed for three generations prior to mutagenesis, as a reference.

We chose this approach over out-crossing to ease phenotyping and segregation analysis. First, selection of the mob mutant phenotype required scoring a quantitative response dependent on immune receptor activity, which varies in different ecotypes. For example, FLS2 in the Wassilewskija-0 (Ws-0) ecotype contains a deletion mutation resulting in a truncated and non-functional FLS2 receptor [22], while FLS2 in the Landsberg erecta-0 (Ler-0) ecotype contains polymorphisms that cause FLS2 to bind flagellin about three times stronger than FLS2 in Col-0 [23]. Second, selection of the mob mutant phenotype was dependent on bak1-5, which would not be present in a polymorphic ecotype such as Ler-0 and would thus need to be genotyped prior to phenotype scoring. Similar considerations would likely arise in any screen involving second-site modifier or suppressor mutations.

While a back-cross simplifies genetic analysis, bulk segregant sequence analysis is complicated by far fewer segregating SNPs (1 SNP every 65,000 bp) compared to out-crosses (1 SNP every 900 bp) (Additional file 1). Although few in number, we found that simply plotting the position of SNPs with close-to-homozygous alternate allele frequencies along the chromosomes was a convenient and easy way of performing bulk segregant linkage analysis from a back-crossed population. We developed this method into a user-friendly web-based application, called CandiSNP, which generates density plots from SNP data obtained from HTS. We demonstrate the utility of CandiSNP by analysing sequence data generated from two allelic mob mutants, bak1-5 mob1 and bak1-5 mob2, which are caused by mutations in the gene encoding the calcium-dependent protein kinase CPK28 [13].

Implementation

CandiSNP is part of a straightforward and flexible workflow

To provide a publicly available easy-to-use tool for mapping mutations, we developed the CandiSNP web application (Figure 1). Prior to using CandiSNP, users must identify SNP positions. Typically this would be done in a workflow that maps quality-controlled (QC) reads to the appropriate reference genome and provides SNP position data as input (Figure 2A). The data must be provided in a simple comma-delimited format and must have the following column headers: ‘Chr’, ‘Pos’, ‘Ref’, ‘Alt’ and ‘Allele_Freq’ (meaning: Chromosome, Position, Reference base, Alternate base, and Allele Frequency, respectively). To create CandiSNP input files, we suggest using the pileups_to_snps.rb Ruby script (Additional file 2 and https://github.com/danmaclean/candisnp/blob/master/pileup_to_snps.rb). By using this generic and flexible input format, our system allows the user to take advantage of existing pipelines and data, including files generated by external service providers or even datasets originating from other technologies.

Figure 1
figure 1

Screen-shot of the CandiSNP web application. CandiSNP is openly accessible online at http://candisnp.tsl.ac.uk. The application is laid out so users can make their way through the application in numbered steps. Users choose which genome they would like to use for comparison (the program currently supports Arabidopsis, rice, tomato, grape, maize and soybean genomes). The option of filtering SNPs concentrated around centromeres is also provided. Users then upload their SNP data file, indicate their preferred allele frequency cut-off, and choose from a number of different palettes for SNP visualization.

Figure 2
figure 2

Bioinformatics pipeline for sequence analysis. Pipeline indicating the preparatory steps required by the user (A) prior to running the CandiSNP web application (B).

CandiSNP analysis is a two-step process (Figure 2B). CandiSNP first uses snpEff [24] to categorise SNPs according to their position in genomic features. The SNP predictions are categorised as: (1) Causing a change in an intergenic (non-annotated) region, (2) Causing a synonymous change in an annotated protein-coding region, or (3) Causing a non-synonymous change in an annotated protein-coding region. CandiSNP then creates a chromosome map visualizing the position of all SNPs meeting a user-selected (and customizable) alternate allele frequency (AF) threshold, and renders this information by colouring SNPs according to category. SNPs in category (3) are represented in a colour that highlights their priority as putative causative SNPs. The density and distribution of SNPs is also visualized as a line graph below each chromosome. The user is provided with a downloadable list of genomic effects for all inputted SNPs and is provided with a Scalable Vector Graphic figure of publication quality that can be easily exported.

Currently, CandiSNP supports several plant genomes including Arabidopsis thaliana Col-0 TAIR9 and TAIR10 [25], Oryzae sativa v7 [26], Solanum lycopersicum v2.40 [27], Glycine max 1.09v8 [28], Vitus vinefera v1 [29] and Zea mays B73 v5b [30].

Design and availability

CandiSNP is available in multiple formats for users with diverse security and confidentiality needs and differing access to computational infrastructure. Primarily, CandiSNP is provided as a web application, available at http://candisnp.tsl.ac.uk. The web application takes text files as input. Instructions are provided on-screen to assist users new to the tool. The web application requires no registration and does not collect user information. For laboratories with bioinformatics support wishing to use an internal and private version of the web application, we provide a package and source code in Perl/HTML/Javascript for free download and use under the GNU GPL3 Licence, from the dedicated code hosting website GitHub at https://github.com/danmaclean/candisnp. For those wishing to run the CandiSNP process on a command line as part of bioinformatics pipelines, a Perl module is also available as part of the source code.

Results and discussion

Case study

Bulk segregant analysis of two mob mutants using Illumina sequencing identifies thousands of polymorphisms

As a case study for CandiSNP we examined HTS data obtained for two allelic recessive mutants, bak1-5 mob1 and bak1-5 mob2, that were isolated from the modifier of bak1-5 (mob) screen [13]. Both bak1-5 mob1 and bak1-5 mob2 were back-crossed with bak1-5 and the F2 populations were screened for the mob phenotype (Figure 3A). F2 segregants that displayed the mob phenotype were bulked and genomic DNA was isolated. The bak1-5 parental genome was prepared by harvesting individuals from a homozygous back-crossed line. DNA samples were sent to the Beijing Genomics Institute (BGI, Hong Kong) for library construction and 90 bp paired-end sequencing on the Illumina HiSeq platform.

Figure 3
figure 3

Pipeline for bulking segregants and identification of unique SNPs. (A) The recessive bak1-5 mob1 and bak1-5 mob2 mutants were back-crossed to the parent bak1-5, allowed to self-fertilize in the F1, and phenotypically scored in the F2 for the mob phenotype. Positive segregants were bulk harvested and genomic DNA was prepared and sequenced using the Illumina HiSeq platform. For comparison, the bak1-5 genome was also sequenced. A similar genetics pipeline could be employed for dominant mutants, but material would need to be bulked from segregants that were phenotypically verified as homozygous in the F3 generation. (B) A three-way comparison between the bak1-5, bak1-5 mob1, and bak1-5 mob2 genomes identified the total number of unique SNPs in each genome.

With the aid of FASTQC v 0.10.1 [31] in the Galaxy platform [32]-[34], all reads were quality controlled (QC) so that reads that contained undefined nucleotides, were not 90 bp long, or were full-length homopolymer runs were removed. Reads containing nucleotides with a Sanger-scaled PHRED quality score of less than 10 at the 3′ end were trimmed to this minimum using Sickle version 1.21.0 [35]. The QC pipeline is available as a Galaxy workflow at http://dx.doi.org/10.6084/m9.figshare.1248898.

QC reads were then mapped to the TAIR10 genome [25] using the BWA v 0.6.1 aligner [36]. For the bak1-5 genome, we aligned 47.9 million 90 bp paired-end reads, with a mean insert size of 467 bp, to the TAIR10 Arabidopsis reference sequence (98.5% of reads aligned). We similarly aligned 47.2 million 90 bp paired-end reads, with a mean insert size of 452 bp, to TAIR10 for the bak1-5 mob2 genome (99.6% of reads aligned). Average alignment depth over the nuclear chromosomes was 36 for bak1-5 and 59.3 for bak1-5 mob2. Details regarding the bak1-5 mob1 genome sequence have been previously described [13].

After alignment, SNPs were identified and allele frequencies were calculated using SAMtools v 0.1.8 [37]. Reads with mapping quality scores less than 20 and individual bases with sequence quality less than 20 were discarded. Genome positions where the reference base was unknown were excluded. Positions were considered SNPs if they had a minimum read coverage of 6 and a maximum of 250. The alignment and SNP calling workflow is available at http://dx.doi.org/10.6084/m9.figshare.1171109. In total, bak1-5 contained 2,639 SNPs compared to Col-0, while bak1-5 mob1 and bak1-5 mob2 contained 4,188 and 3,581 SNPs, respectively (Table 1).

Filtering of non-unique SNPs in the mutants reduces complexity

To reduce the complexity of the bak1-5 mob1 and bak1-5 mob2 datasets, we compared SNP calls from the different genome sequences and removed SNPs that either mob had in common with each-other or bak1-5. Comparing the bak1-5 mob genomes to the parental bak1-5 genome identified 2,111 and 2,132 SNPs shared between bak1-5 and either bak1-5 mob1 or bak1-5 mob2, respectively (Figure 3B). There were an additional 444 SNPs that were shared between the bak1-5 mob1 and bak1-5 mob2 genomes that were not identified in bak1-5. We reasoned that these shared SNPs were contributed by bak1-5 but were not identified due to low sequence coverage in those areas. These analyses identified 2,746 polymorphisms that were shared between at least two of the genomes. Discarding these allowed us to identify over 1,000 SNPs that were uniquely present in bak1-5 mob1 and bak1-5 mob2 (Figure 3B). The general value of comparing multiple mutant sequences to remove shared SNPs has been previously demonstrated [12] and the general case discussed in Additional file 3. However, deleting all common SNPs in this way precludes identification of identical causative SNPs in different mutants, which, while extremely rare, is something to consider prior to performing such analysis.

CandiSNP enables easy visual assessment of SNP positions and finds genomic regions with low recombination linked to the phenotype of interest

Positional cloning is based on linking phenotypes to molecular markers with known genomic positions. If recessive, F2 recombinants that contain the phenotype of interest are homozygous for the unknown mutation. Identifying molecular markers that are invariably homozygous for the mutant type are therefore linked to the scored phenotype and thus the mutation. As we conducted a back-cross rather than an out-cross, the only molecular markers we could use were those resulting from the comparative analysis just described between the parental and mutant genomes.

To identify which of the over 1,000 unique SNPs in the mutant genomes are linked to the scored phenotype, it is necessary to determine which SNPs are homozygous or close-to-homozygous. Although bulking mutants increases the likelihood of identifying homozygous SNPs (in theory, with an allele frequency of 100%), some margin of error must be allowed to account for sequencing and phenotyping errors. CandiSNP facilitates the easy discovery of a useful frequency cut-off by allowing the user to iteratively refine the allele frequency and view a new plot concurrent with previous ones for comparison. As a further refinement of the CandiSNP web application, we included the option of removing SNPs concentrated around centromeres (in organisms where a centromere is defined in the genome assembly), as these are areas of low recombination frequency and tend to skew density analysis. After selecting an allele frequency threshold, CandiSNP plots the positions of retained SNPs as dots across the chromosomes and highlights SNPs of different classes according to a selected palette. The per-chromosome density and distribution of SNPs is rendered in a second plot to aid in cases with high numbers of SNPs. In our case study we chose 75% as an acceptable frequency cut-off, and used CandiSNP to identify 88 and 143 unique SNPs meeting this requirement in the bak1-5 mob1 (Additional file 4) and bak1-5 mob2 (Additional file 5) genomes, respectively (Table 1). CandiSNP further identified 9 and 31 candidate causative SNPs (those which cause non-synonymous changes in protein-coding regions) for bak1-5 mob1 (Table 2; Additional file 4) and bak1-5 mob2 (Table 3; Additional file 5). By choosing a palette to highlight candidate SNPs (shown in our case study as red dots) we observe putative map positions for both mutants at the bottom of chromosome 5 (Figure 4).

Table 1 Identification of unique and candidate SNPs in the parental and mutant genomes
Table 2 Candidate causative SNPs in bak1 - 5 mob1
Table 3 Candidate causative SNPs in bak1 - 5 mob2
Figure 4
figure 4

Chromosome 5 SNP density plots for bak1 -5 mob1 and bak1 -5 mob2 . All SNPs with allele frequencies >75% are plotted in grey, while candidate causative SNPs (defined as those causing non-synonymous changes in gene-coding regions) are plotted in red. The position of CPK28/At5g66210 is indicated.

Therefore, CandiSNP visualizes the location of SNPs linked to a phenotype of interest. Moreover, CandiSNP provides annotations describing the genomic feature in which each SNP is located. This function provides useful information for biologists who can make conceptual links between biological knowledge and molecular function of the genomic feature and refine candidate lists further. On its own this annotation function provides a fast and easy way of finding the effect of any mutation in the supported genomes, enhancing the usefulness of CandiSNP beyond that of mutant mapping.

Fine mapping and confirmation of causative SNPs

We confirmed the presence of candidate causative SNPs by Sanger sequencing polymerase chain reaction (PCR)-generated amplicons containing the predicted mutations prepared from individual back-crossed F3bak1-5 mob1 and bak1-5 mob2 plants compared to the bak1-5 parent (Tables 2 and 3). Using these SNPs as molecular markers allowed us to further map the mutations by position and narrowed the list of candidate causative SNPs down to 3 in bak1-5 mob1 and 6 in bak1-5 mob2. Primers used for this analysis are available in Additional file 6. Alternative methods for such analysis could include other allele-specific genotyping methods such as designing cleaved amplified polymorphic sequence (CAPS) markers [38] or conducting high-resolution melting (HRM) analysis [39] on PCR amplicons. We previously reported genetic confirmation that the polymorphic CPK28 alleles contained within these lists of candidate SNPs were causative of the bak1-5 mob1 and bak1-5 mob2 mutant phenotypes [13]. Our analysis was simplified by knowledge of allelism between the two mutants, which clearly indicated CPK28 as the causative locus. In the absence of such knowledge, marker-assisted genotyping of additional homozygous F3 lines could further reduce the number of candidate mutations and ease genetic confirmation.

CandiSNP performance

As CandiSNP is a predictive classification method for determining whether a given SNP is a causative mutation, it is important that we estimate the accuracy of the classifications made. One useful approach for assessing the overall accuracy of the analysis, rather than each individual prediction, post hoc, is to construct a receiver operating characteristic (ROC) curve [40]. In such an analysis, a set of independently verified ‘true positive’ results are compiled and the ability of the classifier to recall these at different parameters is plotted. To assess CandiSNP we used the bak1-5 mob1 and bak1-5 mob2 verified causative SNPs and varied the allele frequency parameter to carry out a standard ROC analysis [40] (Additional file 7). True positives were defined as verified causative SNPs and false positives were defined as any non-causative SNP identified by CandiSNP regardless of location and category. False negatives were defined as causative SNPs not included in that threshold and true negatives as any position in the genome where a SNP was not identified (i.e., genome size – false positives). Sensitivity was calculated as the number of true positives divided by the total of true positives and false negatives. Specificity was calculated as the number of true negatives divided by the number of false positives and true negatives. Sensitivity assesses the ability of CandiSNP to recall the verified SNPs whilst specificity assesses the ability of CandiSNP to exclude non-causative SNPs.

For our test case, the sensitivity of CandiSNP drops completely for allele frequencies over 75% (Additional file 7), indicating that accounting for phenotype penetrance and sequencing errors is an important factor in the pipeline. Further, rather than reducing the number of errors, setting an overly stringent allele frequency causes the pipeline to fail by screening out real candidates. Specificity remains high across all possible allele frequencies, mostly due to the masking effect of a very high true negative count. The closely related false positive score shows a decrease to less than 25% of the candidate SNP list after an allele frequency of 62%. While the absolute optimum for our data is at 75% (i.e., the maximisation of sensitivity and minimisation of false positives), taken as a whole the ROC analysis indicates that allele frequencies of 60% to 75% represent a likely ‘best trade-off’ window for CandiSNP analysis.

Conclusions

Genetic screens have revealed important regulators of signal transduction pathways and remain an important tool in modern research. Although greatly accelerated with the advent of HTS technologies, correct identification of causative mutation(s) remains a bottleneck in forward-genetics. To increase the repertoire of programs available to plant geneticists, we developed the CandiSNP web application, which is particularly useful for datasets containing few SNPs. In our test case, CandiSNP successfully identified causative SNPs in two recessive mutants after bulking phenotypically homozygous F2 segregants generated from a back-cross. We propose that CandiSNP could additionally be used to identify causative SNPs in dominant mutants as long as they are verified to be phenotypically homozygous in the F3 prior to bulk segregant analysis. By plotting homozygous and close-to-homozygous SNPs identified from HTS along the chromosome arms, the program visualizes areas of linkage and easily narrows down candidate mutation positions. CandiSNP is both fast and accurate, producing high-quality editable graphics in a matter of minutes. CandiSNP is a user-friendly web application that will facilitate gene discovery in plant genetic screens.

Availability requirements

Project name: CandiSNP.

Project home page: http://candisnp.tsl.ac.uk

(Source code is available under the GPLv3 open-source license at https://github.com/danmaclean/candisnp).

Operating system(s): All systems capable of running a modern web-browser.

Other requirements: Internet connection.

License: GPL3 (http://www.gnu.org/licenses/licenses.html).

Any restrictions: None.

Additional files