Rapid fine mapping of causative mutations from sets of unordered, contig-sized fragments of genome sequence
Traditional Map based Cloning approaches, used for the identification of desirable alleles, are extremely labour intensive and years can elapse between the mutagenesis and the detection of the polymorphism. High throughput sequencing based Mapping-by-sequencing approach requires an ordered genome assembly and cannot be used with fragmented, un-scaffolded draft genomes, limiting its application to model species and precluding many important organisms.
We addressed this gap in resource and presented a computational method and software implementations called CHERIPIC (Computing Homozygosity Enriched Regions In genomes to Prioritise Identification of Candidate variants). We have successfully validated implementation of CHERIPIC using three different types of bulk segregant sequence data from Arabidopsis, maize and barley, respectively.
CHERIPIC allows users to rapidly analyse bulk segregant sequence data and we have made it available as a pre-packaged binary with all dependencies for Linux and MacOS and as Galaxy tool.
KeywordsMapping by sequencing Next generation mapping Bulk segregant analysis
Bulk segregant sequencing
Computing homozygosity enriched regions in genomes to prioritise identification of candidate variants
Homozygosity enrichment score
Map based cloning
Mapping by sequencing
Forward genetic screens are an essential and widely used tool for the identification of alleles underlying desirable traits . The successful identification and cloning of genes from these screens depends on availability of polymorphic molecular markers between two accessions and a physical map of markers . A mapping population is created by crossing a mutated plant from one polymorphic accession with another. Map based Cloning (MBC) involves screening of either individuals or pooled individuals from segregating populations using defined markers to identify regions of the genome with limited or no recombination due to linkage disequilibrium and refine the mutant position [3, 4, 5, 6]. Traditional MBC is labour intensive and years can elapse between the mutagenesis and the detection of the polymorphism responsible . Mapping-by-sequencing (MBS) is a high throughput sequencing based mutation mapping approach that has shortened this process considerably by allowing the calculation of allele frequency from bulks and the identification of causal mutations at single-nucleotide resolution . However, application of MBS requires a chromosomal ordered genome assembly and cannot be used with fragmented, unscaffolded draft genomes, thus limiting its application to model species and precluding many important organisms. Most crops and their wild relatives have complex and difficult to order genomes, hence genetic and genomic resources remain in draft stages. Carrying out mapping studies without the necessary genetic resources is cumbersome, thus limiting the number of mapping studies in non-model organisms and rendering a store of potential disease resistances and other agronomically important traits unavailable.
We present a computational method and software implementations to address this problem, called CHERIPIC (Computing Homozygosity Enriched Regions In genomes to Prioritise Identification of Candidate variants). CHERIPIC makes use of short contig fragments (such as those from the first pass assembly of Illumina data or a PacBio sequence run) from bulk segregant sequencing (BSS) experiments to call variants and to reduce the list of candidates to a few closely linked variants in the region harbouring the trait of interest and in some cases includes the candidate mutation as well. We successfully applied CHERIPIC to fragmented assemblies made from publicly available BSS high-throughput data for Arabidopsis, maize and barley. CHERIPIC improves on previous methods by being input type agnostic, working well on genome-seq and RNA-seq data, having extremely low computational requirements and being available for direct use through a web interface.
We developed a rapid HMES sorting algorithm to arrange unordered fragments into a sequence representing distance from the causative mutation, though not necessarily their original order in the genome (Algorithm 1). In broad terms the algorithm orders the fragments with the largest HMES at the centre, the second largest to its left, the third largest to its right, the fourth largest to the extreme left and so on resulting in a rough ordering of the genomic fragments such that nearness to the centre of the ordering increases the likelihood of a fragment carrying the causative mutation (Additional file 1: Figure S2).
Output from CHERIPIC is a tab-delimited text file with the following information about the variants selected - “HMES, allele frequency, length of contig, id of contig, variant position in contig, reference base, coverage, read bases, base qualities, left sequence to variant, variant allele, right sequence to variant”. Left and right sequences are provided to easily design markers and sequence length can be user adjusted to retrieve enough sequence information.
We applied our algorithm to three plant genome data sets from previously published experiments in which sequence data are publicly available. Datasets being whole genome shotgun data of pooled bulks of the sup2 Arabidopsis mutant (a mutation in AT4G11260 at chromosome 4:6852405) , pooled bulk RNA-seq data from the maize gl3 mutant (GRMZM2G162434 gene at Chr4:185827677-185831259)  and exome capture data from the barley mnd mutant (MLOC_64838 gene at Chr5:468277462-468279844) . All data are from bulked segregant analysis involving outcrosses. In all these experiments, the allele under question is known to be recessive, hence we focussed on tracking the linkage disequilibrium around the mutant allele. For a recessive candidate in mutant bulks we expect an allele frequency close to 100%, while the allele frequency in background bulk would be around 33.3% (since two third fraction of the background are heterozygotes and half of that represent mutant allele), these percentages allow tuning of the identification of polymorphisms as homo/heterozygous according to calculated allele frequency. In all these data, we could show that HMES ordering can be used to narrow down the region of the genome to a fine interval to identify the causative mutation.
Bulk segregant data for the maize gl3  was generated using RNA-seq of pooled RNAs of the bulks. We first assembled transcripts from the RNA-seq data of non-gl3 phenotype bulks and used this as the alignment reference sequence in CHERIPIC analysis. CHERIPIC identified twelve variants with HMES greater than one from all ten chromosomes (Additional file 1: Figure S4), ten of which on chromosome four (Fig. 2b), and collocated with the causative mutation. The distance between the closest and the causative mutation was 1Mbp (1006746 bp). A limitation of bulked RNA-seq approach is that polymorphisms may cause expression changes and transcripts from the gene carrying the causative mutation could be undetectable because of poor sequence coverage and could not be considered by CHERIPIC. Transcripts from gl3 were at very low amounts and therefore missed.
Sequence data from barley mnd bulks was generated using exome capture . Application of CHERIPIC to barley data has resulted in identification of 1997 variants with HMES>1 from all 7 Chromosomes (Additional file 1: Figure S5a). The causative mutation for mnd is located on chromosome 5 and 63.99% of selected variants (1278 out of 1997) were present on that chromosome (Additional file 1: Figure S5a). Selecting variants with top 5% of HMES resulted in 90 variants from all 7 Chromosomes (Additional file 1: Figure S5b), while 78 (86.67%) were on Chromosome 5 (Fig. 2c). Distance between mnd causative mutation and closest CHERIPIC variant was 11.7Mbp (11722011 bp). The barley mnd mutant was generated using X-ray mutagenesis which is known to create large deletions and could have added to the increased distance between the causative mutation and the closest variant identified in this case.
To permit the easy application of our method and to allow users to rapidly analyse bulk segregant sequence data we have produced a range of implementations of the CHERIPIC algorithm. As input CHERIPIC takes a file of reference fragments and the variant files for both bulks. Variant calls from background bulks are not required, but if available help reduce the background and increase high quality variants for downstream analysis. Variants files can be provided as either pileup, BAM or VCF files. CHERIPIC is implemented in Ruby and is available as a pre-packaged binary with all dependencies for Linux and MacOS at https://github.com/TeamMacLean/cheripic. A Galaxy install script is provided to allow integration into to Galaxy servers. An interactive web interface is provided at http://cheripic.tsl.ac.uk.
(1) Mapping population of a Arabidopsis leaf curl mutant (hasty) was generated through backcrossing it to the parent mir159a, a T-DNA insertion mutant . From the mapping population, a bulk of 110 individuals showing mutant phenotype and a parent mir159a individual were sequenced to 50x coverage, using Illumina Hiseq2000 through paired end sequencing (2x100bp). (2) Two suppressor mutants (sup#1 and sup#2) of hemizygous uni-1D transgene in Arabidopsis Ws background was out-crossed to Col-T, to generate mapping population . Sequencing was done on (i) a pool of 80 individuals showing suppression phenotype, (ii) wild type Col-T and (iii) wild type Ws, using Illumina Genome Analyzer IIx as 75bp single end reads. We have used sequencing data from sup#2, Col-T and Ws in our analysis and these samples had 8.3 to 9.1x sequence coverage. (3) A maize mapping population was generated by out-crossing glossy phenotype showing gl3-ref allele in non-B73 genetic background to B73, an inbred reference line . Sequencing was carried out on pools of RNA from 32 mutant phenotype individuals and 31 non-mutant phenotype individuals, on Illumina Genome Analyzer II as 75 bp single end reads. (4) An X-ray mutagenised mnd mutant in barley cv. Saale was out-crossed to cv. Barke to generate a mapping population . Exome capture was performed on two pools of DNA from 18 mutant and 30 wild type plants, respectively. Sequencing was carried out on Illumina Hiseq2000 as paired reads (2X 100 bp).
For mapping by sequencing to be successful, we need ordered reference sequence to place variants from bulk segregant sequence data on chromosomes to identify the region of genome with linkage disequilibrium. We wanted to test the impact of fragmented nature of de-novo genome assemblies in identifying the genomic region with linkage disequilibrium. To avoid variability resulting from parameters of genome assembly, variant calling and sequencing depth we have used variant data from BSS experiments that had been previously published to identify causative mutation using Arabidopsis ordered reference genome. Arabidopsis BSS data for hasty  (a backcross) and sup2  (an outcross) were used for variant calling against Arabidopsis Col-0 TAIR10 genome. Sequencing reads from pooled samples of mutant bulks and parents were quality filtered and trimmed using Trimmomatic v0.33  (with options: ILLUMINACLIP:Trimmomatic provided Illumina adapter file:2:30:10 HEADCROP:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:31) aligned using BWA  mem (v0.7.12) with default settings and bamfiles were generated using samtools  (v1.0). Mpileup file generated using samtools (-q 20 -Q 15 -d 20xmean_depth) and variant calls were generated using varScan  (v2.3.9, options: mpileup2cns –variants 1 –output-vcf 1 –strand-filter 1). Homozygous variant calls from background bulk data were subtracted from mutant bulk data. Remaining mutant bulk data variants from backcross and outcross datasets were used in respective simulations. We have used paired end reads of miR159a parent  to assemble Arabidopsis genome. Reads were quality filtered and trimmed using Trimmomatic v0.33. Genome assembly was done using SOAPdenovo  v2.40 with Multi-Kmer method (SOAPdenovo-127mer all -K 25 -d 1 -R -M 1 -m 95 -E -F). Assembled scaffolds of ≥300bp were selected resulting in assembly size of 117.6 Mb (n =18,267), with a N50 length of 20.3 Kb. Resulting assembly scaffold lengths were modelled against normal, log-normal and exponential distribution; and found to follow a log-normal distribution (Additional file 1: Figure S1). Arabidopsis genome was randomly fragmented using log-normal distribution (mean: 7.88 and alpha: 1.56) to generate a 1000 fragmented genome assemblies each for outcross and backcross experiment. Position and chromosomal order of individual fragments in each generated assembly is known. Background subtracted variant data from the mutant bulk were assigned to respective fragments to generate 1000 assemblies with variant data from bulk segregant sequencing.
Arabidopsis: De-novo assembly made for bulk segregant simulation analysis was used as a reference to call variants using single end whole genome reads of outcrossed bulked individuals showing sup2  phenotype, and two parents Col-T and Ws. Variant analysis was carried out as mentioned in the “Simulations” section. Assembly and variant files are provided as inputs to CHERIPIC. CHERIPIC takes multiple inputs for background variants, as outcrosses involve two polymorphic parents and as was the case for sup2 experiment. Removing background variants from both parents would help in removing candidates not linked to the phenotype but arising from regions with suppressed recombination.
Maize: Single end RNA-seq reads from combined bulks of both gl3 mutant and non-mutant phenotype  were quality filtered and trimmed using Trimmomatic v0.33 (options: ILLUMINACLIP:ilmn_adapters.fa:2:30:10 HEADCROP:13 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:25). Assembly was carried out using Trinity  v2.0.6 (options: –seqType fq –max_memory 50G –CPU 64 –full_cleanup) by using sequences from both bulks. Assembly using default parameters resulted in 33,563 transcript sequences, which were clustered using CD-HIT  with identity threshold of -c 0.975, resulting in 29,288 transcripts.
Barley: Wildtype pool of paired end exome sequence data from mnd bulk segregant sequences  were quality filtered and trimmed using Trimmomatic v0.33 and were assembled using SOAPdenovo v2.40 with Multi-Kmer method (SOAPdenovo-127mer all -K 25 -d 1 -R -M 1 -m 95 -E -F). Assembled scaffolds of ≥300bp has resulted in sequence of 166.5 Mb (n=253137) and a N50 length of 0.72 Kb.
Availability and requirements
Project name: CHERIPIC Project home page:https://github.com/TeamMacLean/cheripicOperating system(s): Linux and Mac OS Programming language: Ruby Other requirements: CHERIPIC has light computational requirements. It will run on Linux or Mac OS operating system with 2GHz CPU and minimally 8GB RAM. Higher RAM may be required if input files are large. License: MIT Any restrictions to use by non-academics: None
We sincerely thank Carlos A Lugo, Matthew Moscou, Kamil Witek, Christian Schudoma, Cintia Goulart Kawashima, Peter van Esse, Burkhard Steuernagel and Cristobal Uauy for useful discussions. We thank NBI Computing infrastructure for Science (CiS) group for their support for high performance computing.
This work was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) tools and resources development fund to G.R. and D.M. (BB/M019896/1). D.M. was supported by The Gatsby Charitable Foundation.
GR: Conceived, designed and performed the experiments. Developed and released CHERIPIC. PCM and EC: Developed and performed early tests on first version of a sorting algorithm. MP: Developed components for CHERIPIC. Have set up the web server and website for CHERIPIC. DM: Conceived, designed and supervised the project. GR and DM wrote the manuscript. All authors read and approved the final manuscript.;
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Schneeberger K. Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nat Rev Genet. 2014; 15(10):662–76. https://doi.org/10.1038/nrg3745.
- 2.Peters JL, Cnudde F, Gerats T. Forward genetics and map-based cloning approaches. Trends Plant Sci. 2003; 8(10):484–91. https://doi.org/10.1016/j.tplants.2003.09.002.
- 3.Michelmore RW, Paran I, Kesseli RV. Identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. Proc Natl Acad Sci. 1991; 88(21):9828–32. https://doi.org/10.1073/pnas.88.21.9828.
- 8.Austin RS, Vidaurre D, Stamatiou G, Breit R, Provart NJ, Bonetta D, Zhang J, Fung P, Gong Y, Wang PW, McCourt P, Guttman DS. Next-generation mapping of arabidopsis genes. Plant J. 2011; 67(4):715–25. https://doi.org/10.1111/j.1365-313X.2011.04619.x.
- 9.Uchida N, Sakamoto T, Kurata T, Tasaka M. Identification of ems-induced causal mutations in a non-reference arabidopsis thaliana accession by whole genome sequencing. Plant Cell Physiol. 2011; 52(4):716–22. https://doi.org/10.1093/pcp/pcr029.
- 10.Liu S, Yeh CT, Tang HM, Nettleton D, Schnable PS. Gene mapping via bulked segregant rna-seq (bsr-seq). PLoS ONE. 2012; 7(5):1–8. https://doi.org/10.1371/journal.pone.0036406.
- 11.Mascher M, Jost M, Kuon J-E, Himmelbach A, Aßfalg A, Beier S, Scholz U, Graner A, Stein N. Mapping-by-sequencing accelerates forward genetics in barley. Genome Biol. 2014; 15(6):78. https://doi.org/10.1186/gb-2014-15-6-r78.
- 12.Allen R, Nakasugi K, Doran R, Millar T, Waterhouse P. Facile mutant identification via a single parental backcross method and application of whole genome sequencing based mapping pipelines. Frontiers Plant Sci. 2013;4(362). https://doi.org/10.3389/fpls.2013.00362.
- 14.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009; 25(14):1754–60. https://doi.org/10.1093/bioinformatics/btp324.
- 17.Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu S-M, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J. Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012; 1(1):18. https://doi.org/10.1186/2047-217X-1-18.
- 18.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from rna-seq data without a reference genome. Nat Biotech. 2011; 29(7):644–52.CrossRefGoogle Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.