Genome-wide single nucleotide polymorphism and Insertion-Deletion discovery through next-generation sequencing of reduced representation libraries in common bean
- 959 Downloads
Single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) are valuable molecular markers for genomics and genetics studies and molecular breeding. The advent of next-generation sequencing techniques has enabled researchers to approach high-throughput and cost-effective SNP and InDel discovery on a genomic scale. In this report, 36 common bean genotypes grown in Canada were used to construct reduced representation libraries for next-generation sequencing. Using 76 million sequence reads generated by the Illumina HiSeq 2000 Sequencing System, we identified a total of 43,698 putative SNPs and 1,267 putative InDels. Of the SNPs, 43,504 were bi-allelic and 194 were tri-allelic, and the InDels comprised 574 insertions and 693 deletions. The putative bi-allelic SNPs were distributed across all 11 chromosomes with the highest number of SNPs observed in chromosome 2 (4,788), and the lowest in chromosome 10 (2,941). With the aid of the recent release of the first chromosome-scale version of Phaseolus vulgaris, 24,907 bi-allelic SNPs, 79 tri-allelic SNPs, 315 insertions, and 377 deletions were located in 8,758, 77, 273, and 364 genes, respectively. Among these 24,907 bi-allelic SNPs, 7,168 nonsynonymous bi-allelic SNPs were identified within 36 common bean genotypes that were located in 4,303 genes. A total of 113 putative SNPs were randomly chosen for validation using high-resolution melt analysis. Of the 113 candidate SNPs, 105 (92.9 %) contained the predicted SNPs.
KeywordsSingle nucleotide polymorphism Insertion Deletion Next-generation sequencing Phaseolus vulgaris L.
Common bean (Phaseolus vulgaris L.) is one of the most important legume crops for human consumption worldwide. It is a diploid annual legume (2n = 2x = 22) with a relatively small genome size (580 Mb) (Bennett and Leitch 1995). Common bean has relatively low levels of duplication and repetitive regions in the genome compared to other plants, and molecular and genetic mapping experiments revealed that most loci are single copy (Freyre et al. 1998; McClean et al. 2002; Vallejos et al. 1992). Moreover, common bean gene families tend to be small, and the traditionally large families such as plant resistance genes and protein kinases are of moderate size (Gepts et al. 2008; Rivkin et al. 1999; Vallad et al. 2001). Common bean originated in the Americas, and then diverged into two major gene pools, the Mesoamerican and Andean (Gepts 1998; Gepts and Debouck 1991). The two gene pools underwent domestications independently, and each evolved into three races: Durango, Jalisco, and Mesoamerica in the Mesoamerican gene pool, and Chile, Nueva Granada, and Peru in the Andean gene pool. The former gene pool is represented by the small- and medium-seeded white, pinto, pink, black, and some snap beans, and the latter is represented by the large-seeded kidney, cranberry, and many snap beans (Kwak and Gepts 2009; Mamidi et al. 2011; McClean et al. 2004; Singh et al. 1991).
Single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) represent the most abundant DNA polymorphisms present in eukaryotic genomes (Galeano et al. 2009; Hillier et al. 2008; Hyten et al. 2010a, b; Lai et al. 2010; Salathia et al. 2007; Subbaiyan et al. 2012; Wang et al. 1998). SNP and InDel markers have become powerful tools for many research applications such as genetic mapping, association studies, diversity analysis, marker-assisted selection, and map-based cloning of genes (Blair et al. 2013; Galeano et al. 2012; Hayashi et al. 2006; Mammadov et al. 2012; Salathia et al. 2007; Shen et al. 2004; Shi et al. 2011; Varshney et al. 2009). Meanwhile, next-generation sequencing technologies provide rapid and cost-effective approaches for the discovery of DNA polymorphisms on a genomic scale (Davey et al. 2011; Depristo et al. 2011; Kumar et al. 2012; Shendure and Ji 2008; Varshney et al. 2009). With the ever-increasing throughput of next-generation sequencing and the development and improvement of bioinformatics tools, discovery of DNA sequence variations is readily accomplished by comparing the whole-genome sequences of individuals with reference genome sequences (Arai-Kichise et al. 2011; Hyten et al. 2010a; Ossowski et al. 2008). When a reference genome sequence is not available, a common practice for SNP and InDel discovery is to assemble sequence reads into contigs and then align all reads against them to call variants (Fu and Peterson 2012; Gaur et al. 2012; Hyten et al. 2010b; Lai et al. 2012; Oliver et al. 2011; Ratan et al. 2010; You et al. 2011). Plant genomes are usually large and complex with a great abundance of repetitive sequences. These features pose challenges for SNP and InDel discovery (Adams and Wendel 2005; Bennetzen et al. 2005; Davey et al. 2011; Deschamps and Campbell 2010). Several genome complexity reduction techniques have been applied for high-throughput genetic marker discovery, including reduced-representation libraries (RRLs) (Altshuler et al. 2000), complexity reduction of polymorphic sequences (CRoPS) (van Orsouw et al. 2007), restriction-site-associated DNA sequencing (RAD-Seq) (Baird et al. 2008), and genotyping by sequencing (GBS) (Davey et al. 2011; Elshire et al. 2011). RRLs were first used in the human genome, and this approach has been adapted for genome-wide SNP and InDel discovery in many animal and plant species (Altshuler et al. 2000; Davey et al. 2011; Depristo et al. 2011; Van Tassell et al. 2008). In general, sequencing is performed on RRLs which reduce the complexity of a pooled DNA sample from a population of interest using a restriction enzyme digestion followed by size selection; sequencing reads are then aligned to at least a draft genome sequence for identifying DNA variants.
To date, considerable effort has been made toward DNA polymorphism discovery in common bean. Several thousand SNPs and hundreds of InDels have been discovered through expressed sequence tag data mining or partial re-sequencing of certain genotypes (Gaitan-Solis et al. 2008; Galeano et al. 2009; McConnell et al. 2010; Ramírez et al. 2005; Souza et al. 2012). In addition, Hyten et al. (2010b) discovered 3,487 SNPs from Mesoamerican genotype BAT93 and Andean genotype JaloEEP558 through next-generation sequencing of a multi-tier RRL when the genome sequence was still unavailable.
The first chromosome-scale version of P. vulgaris sequence assembly has also been recently released to the public, and provides an invaluable resource for common bean genome mapping and marker development (Phaseolus vulgaris v1.0, DOE-JGI and USDA-NIFA, http://www.phytozome.net/commonbean). The objective of this study was to discover SNPs and InDels in common bean through next-generation sequencing of RRLs using the latest release of the common bean sequence assembly.
Materials and methods
Library construction and sequencing
Thirty-six common bean genotypes grown in Canada were used for SNP and InDel discovery (Supplementary Table S1). Six pools of DNA containing 36 common bean DNA samples were prepared using DNeasy Plant Mini kit (Qiagen Inc., Toronto, Canada). A 5.5-μg sample of DNA from each pool was digested with HaeIII (New England Biolabs Ltd, Whitby, CA, USA) as suggested by the manufacturer. DNA fragments between 250 and 350 bp were selected, ligated with different adapters, and amplified using TruSeqTM DNA Sample Preparation Kit (Illumina, Inc., San Diego, CA, USA) following the manufacturer’s instructions. Libraries were then sent to The Centre for Applied Genomics, The Hospital for Sick Children, Toronto for further preparation and Illumina sequencing. First, libraries were size-selected to 400 and 600 bp using E-Gel® SizeSelect™ 2 % Agarose system (Life Technologies Inc., Burlington, ON, Canada) and purified using Qiagen MiniElute PCR Purification Kit (Qiagen) following the recommended protocol. Libraries were then sequenced on a HiSeq 2000 instrument following Illumina’s recommended guidelines.
Discovery of SNPs and InDels
Reads passing Illumina’s filter that were at least 25 bp long were retained. Quality filtered reads were aligned to P. vulgaris reference genome (Phaseolus vulgaris v1.0, DOE-JGI and USDA-NIFA, http://www.phytozome.net/commonbean) using BWA 0.6.2 (Li and Durbin 2009). Subsequently, base quality recalibration and local re-alignment were performed using GATK (GenomeAnalysisTK-1.6-6-g4bc04e2) (McKenna et al. 2010). After refining the alignment, variant calling (SNV or InDel) was performed using FreeBayes 0.9.5 (arXiv:1207.3907 [q-bio.GN]). FreeBayes is capable of calling tri-allelic sites. Variants with Freebayes QUAL scores less than 10 and a read depth less than 3 were assumed to be false positives and filtered out. Finally, variants were annotated using Annovar (Wang et al. 2010). The raw sequencing reads have been deposited in the NCBI Sequence Read Archive [Genbank: SPR022760]. The SNPs and InDels data are available at http://bioinfo.uwindsor.ca/cgi-bin/gb2/gbrowse/US_PVulgaris.
A total of 113 putative Class I (C/T and G/A) and Class II (C/A and G/T) SNPs were randomly chosen for validation using high-resolution melt (HRM) analysis (Herrmann et al. 2006). HRM primers were designed using Beacon Designer 7.91 (Supplementary Table S2). HRM PCR was conducted on a Bio-Rad CFX96 qRT-PCR machine and the PCR conditions were the same as described by Wang et al. (2012). HRM data were analyzed by Precision Melt Analysis software according to its user manual (Bio-Rad).
Results and discussion
DNA deep sequencing and alignment
Thirty-six common bean genotypes grown in Canada were used to construct reduced representation libraries for next-generation sequencing (Supplementary Table S1). They all belonged to race Mesoamerica of the Middle American Gene Pool. Two main clusters could be distinguished based on a cluster analysis (Supplementary Fig. S1). These 36 genotypes were randomly divided into six pools. RRLs were constructed from the six pools using the restriction enzyme HaeIII, which recognizes the sequence ‘GGCC’ and generates blunt-ended fragments starting with CC. DNA fragments between 250 and 350 bp long were selected and sequenced by HiSeq 2000. Short sequences from the six pools were aligned to the Phaseolus vulgaris v1.0 genome assembly, which is approximately 521.1 Mb in size. The in silico digestion of Phaseolus vulgaris v1.0 with HaeIII and selection for 250–350 bp fragments showed an expected sequence coverage of 3 % of the reference genome, that is, 15,829,160 bp uniquely aligned to the common bean genome.
Summary of sequence read alignment
Reads aligned (%)
Reads aligned bases
HQ aligned readsa
HQ aligned bases
Mean of read length
SNP and InDel discovery and SNP validation
Summary of SNP and InDel discovery
43,504 (22.7 %)
194 (22.7 %)
574 (37.2 %)
693 (31.8 %)
147,866 (77.3 %)
608 (77.3 %)
971 (62.8 %)
1,484 (68.2 %)
In order to evaluate the quality of SNPs identified, a total of 113 putative SNPs were randomly chosen for validation using HRM analysis. Of the 113 candidate SNPs, 105 (92.9 %) contained the predicted SNPs (Supplementary Table S2). This validation rate was higher than the 86 % validation rate reported in common bean and 78 % reported in wheat, but similar to the 92.5 % obtained in soybean or 91 % obtained in cattle through sequencing RRLs and predicting SNPs from a depth of greater than two reads (Hyten et al. 2010a, b; Lai et al. 2012; Van Tassell et al. 2008). The experimental validation rate was also in line with expectations based on the Freebayes quality threshold.
Analysis of SNPs and InDels
Annotation of SNPs and InDels
Summary of annotated SNPs and InDels
15,448 (62.0 %)
41 (51.9 %)
65 (20.6 %)
126 (33.4 %)
7,666 (30.8 %)
32 (40.5 %)
189 (60.0 %)
196 (52.0 %)
1,793 (7.2 %)
6 (7.6 %)
61 (19.4 %)
55 (14.6 %)
Using reduced representation libraries, coupled with next-generation sequencing and the latest release of the common bean genome sequence assembly, we identified 43,698 putative SNPs and annotated 24,986 putative SNPs in genic regions. Additionally, 1,267 putative InDels were identified, including 692 putative InDels located in genic regions. The combination of SNPs and InDels discovered in this study and the SNP and InDel resources already available will help to anchor and orient scaffolds arising from future whole-genome sequencing efforts against common bean. Furthermore, the variants identified will also be useful for genetic diversity analyses, QTL mapping, genome-wide association studies, and marker-assisted breeding in common bean.
We thank Dr. Sergio Pereira and the bioinformatics team at The Centre for Applied Genomics, The Hospital for Sick Children for the next-generation sequencing and data analysis. Phaseolus vulgaris v1.0 data were produced by the US Department of Energy Joint Genome Institute http://www.jgi.doe.gov/ in collaboration with the user community. This work is supported by Ontario Research Fund (ORF), Ontario White Bean Producers’ Marketing Board (OWBPMB), Ontario Colored Bean Growers’ Association (OCBGA), and Agriculture and Agri-Food Canada (AAFC).
- Choi IY, Hyten DL, Matukumalli LK, Song Q, Chaky JM, Quigley CV, Chase K, Lark KG, Reiter RS, Yoon MS, Hwang EY, Yi SI, Young ND, Shoemaker RC, Van Tassell CP, Specht JE, Cregan PB (2007) A soybean transcript map: gene distribution, haplotype and single-nucleotide polymorphism analysis. Genetics 176:685–696PubMedCentralPubMedCrossRefGoogle Scholar
- Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D (2007) Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science 317:338–342PubMedCrossRefGoogle Scholar
- Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–501PubMedCentralPubMedCrossRefGoogle Scholar
- Freyre R, Skroch PW, Geffroy V, Adam-Blondon AF, Shirmohamadali A, Johnson WC, Llaca V, Nodari RO, Pereira PA, Tsai SM, Tohme J, Dron M, Nienhuis J, Vallejos CE, Gepts P (1998) Towards an integrated linkage map of common bean. 4. Development of a core linkage map and alignment of RFLP maps. Theor Appl Genet 97:847–856CrossRefGoogle Scholar
- Gepts P (1998) Origin and evolution of common bean: past events and recent trends. HortScience 33:1124–1130Google Scholar
- Gepts P, Debouck D (1991) Origin, domestication, and evolution of the common bean (Phaseolus vulgaris L.). In: van Schoonhaven A, Voysest O (eds) Common beans: research for crop improvement. C.A.B. International, Oxon, pp 7–53Google Scholar
- Gepts P, Aragão FL, Barros E, Blair M, Brondani R, Broughton W, Galasso I, Hernández G, Kami J, Lariguet P, McClean P, Melotto M, Miklas P, Pauls P, Pedrosa-Harand A, Porch T, Sánchez F, Sparvoli F, Yu K (2008) Genomics of Phaseolus beans, a major source of dietary protein and micronutrients in the tropics. In: Moore P, Ming R (eds) Genomics of Tropical Crop Plants. Springer, New York, pp 113–143CrossRefGoogle Scholar
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5:183–188PubMedCrossRefGoogle Scholar
- Hufford MB, Xu X, Van Heerwaarden J, Pyhäjärvi T, Chia JM, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM, Lai J, Morrell PL, Shannon LM, Song C, Springer NM, Swanson-Wagner RA, Tiffin P, Wang J, Zhang G, Doebley J, McMullen MD, Ware D, Buckler ES, Yang S, Ross-Ibarra J (2012) Comparative population genomics of maize domestication and improvement. Nat Genet 44:808–811PubMedCrossRefGoogle Scholar
- Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB (2010a) High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics 11:38PubMedCentralPubMedCrossRefGoogle Scholar
- Kumar S, Banks TW, Cloutier S (2012) SNP discovery through next-generation sequencing and its applications. Int J Plant Genomics. Article ID 831460. doi:10.1155/2012/831460
- Lai J, Li R, Xu X, Jin W, Xu M, Zhao H, Xiang Z, Song W, Ying K, Zhang M, Jiao Y, Ni P, Zhang J, Li D, Guo X, Ye K, Jian M, Wang B, Zheng H, Liang H, Zhang X, Wang S, Chen S, Li J, Fu Y, Springer NM, Yang H, Wang J, Dai J, Schnable PS, Wang J (2010) Genome-wide patterns of genetic variation among elite maize inbred lines. Nat Genet 42:1027–1030PubMedCrossRefGoogle Scholar
- Mammadov J, Aggarwal R, Buyyarapu R, Kumpatla S (2012) SNP markers and their impact on plant breeding. Int J Plant Genomics. Article ID 728398. doi:10.1155/2012/728398
- McClean P, Kami J, Gepts P (2004) Genomics and genetic diversity in common bean. In: Wilson RF, Stalker HT, Brummer EC (eds) Legume crop genomics. AOCS Press, ChampaignGoogle Scholar
- McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE, Stokowski R, Ballinger DG, Frazer KA, Cox DR, Padhukasahasram B, Bustamante CD, Weigel D, Mackill DJ, Bruskiewich RM, Rätsch G, Buell CR, Leung H, Leach JE (2009) Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proc Natl Acad Sci USA 106:12273–12278PubMedCentralPubMedCrossRefGoogle Scholar
- Oliver RE, Lazo GR, Lutz JD, Rubenfield MJ, Tinker NA, Anderson JM, Wisniewski Morehead NH, Adhikary D, Jellen EN, Maughan PJ, Brown Guedira GL, Chao S, Beattie AD, Carson ML, Rines HW, Obert DE, Bonman JM, Jackson EW (2011) Model SNP development for complex genomes based on hexaploid oat using high-throughput 454 sequencing technology. BMC Genomics 12:77PubMedCentralPubMedCrossRefGoogle Scholar
- Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, Gill N, Joshi T, Libault M, Sethuraman A, Zhang XC, Shinozaki K, Nguyen HT, Wing RA, Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178–183PubMedCrossRefGoogle Scholar
- Shen YJ, Jiang H, Jin JP, Zhang ZB, Xi B, He YY, Wang G, Wang C, Qian L, Li X, Yu QB, Liu HJ, Chen DH, Gao JH, Huang H, Shi TL, Yang ZN (2004) Development of genome-wide DNA polymorphism database for map-based cloning of rice genes. Plant Physiol 135:1198–1205PubMedCentralPubMedCrossRefGoogle Scholar
- Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36–46Google Scholar
- van Orsouw NJ, Hogers RCJ, Janssen A, Yalcin F, Snoeijers S, Verstege E, Schneiders H, van der Poel H, van Oeveren J, Verstegen H, van Eijk MJT (2007) Complexity reduction of polymorphic sequences (CRoPS™): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE 2:e1172PubMedCentralPubMedCrossRefGoogle Scholar
- Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester E, Spencer J, Kruglyak L, Stein L, Hsie L, Topaloglou T, Hubbell E, Robinson E, Mittmann M, Morris MS, Shen N, Kilburn D, Rioux J, Nusbaum C, Rozen S, Hudson TJ, Lipshutz R, Chee M, Lander ES (1998) Large-scale identification, mapping, and genotyping of single- nucleotide polymorphisms in the human genome. Science 280:1077–1082PubMedCrossRefGoogle Scholar
- You FM, Huo N, Deal KR, Gu YQ, Luo MC, McGuire PE, Dvorak J, Anderson OD (2011) Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics 12:59PubMedCentralPubMedCrossRefGoogle Scholar