Use of VeraCode 384-plex assays for watermelon diversity analysis and integrated genetic map of watermelon with single nucleotide polymorphisms and simple sequence repeats
- First Online:
- Cite this article as:
- Nimmakayala, P., Abburi, V.L., Bhandary, A. et al. Mol Breeding (2014) 34: 537. doi:10.1007/s11032-014-0056-9
- 396 Views
Watermelon (Citrullus lanatus var. lanatus) is one of the most important vegetable crops in the world. Molecular markers have become the tools of choice for resolving watermelon taxonomic relationships and evolution. Increased numbers of single nucleotide polymorphism (SNP) markers together with simple sequence repeat (SSR) markers would be useful for phylogenetic analyses of germplasm accessions and for linkage mapping for marker-assisted breeding with quantitative trait loci and single genes. We aimed to construct a genetic map based on SNPs (generated by Illumina Veracode multiplex assays for genotyping) and SSR markers and evaluate relationships inferred from SNP genotypes between 130 watermelon accessions collected throughout the world. We incorporated 282 markers (232 SNPs and 50 SSRs) into the linkage map. The genetic map consisted of 11 linkage groups spanning 924.72 cM with an average distance of 3.28 cM between markers. Because all of the SNP-containing sequences were assembled with the whole-genome sequence draft for watermelon, chromosome numbers could be readily assigned for all the linkage groups. We found that 134 SNPs were polymorphic in 130 watermelon accessions chosen for diversity studies. The current 384-plex SNP set is a powerful tool for characterizing genetic relatedness and for developing medium-resolution genetic maps.
KeywordsSNPsHigh throughput genotypingSSRsGenetic mappingWatermelonCultivar diversity
Watermelon [Citrullus lanatus (Thunb.) Matsum. & Nakai] (2n = 2x = 22) belongs to the genus Citrullus Schrad. ex Eckl. & Zeyh (Robinson and Decker-Walters 1997) of the Cucurbitaceae family and has a genome size of 425 Mb (Guo et al. 2013). Citrullus lanatus includes two botanical varieties, lanatus and citroides (Bailey) Mansf. (Pitrat et al. 1999; Meeuse 1962). The variety lanatus is one of the most important vegetable crops in the world (Paris et al. 2013). The variety citroides is cultivated for pickles or medicinal purposes in South Africa and is also called Tsamma or citron melon (Whitaker and Davis 1962; Whitaker and Bemis 1976). Excavations in Egypt and Libya suggested that northern Africa was a primary center of origin and domestication of cultivated watermelon (Dane and Liu 2007). Citrullus lanatus subsp. mucasospermus, representing the “egusi” watermelon group, has large edible seeds and a fleshy pericarp; C. lanatus subsp. vulgaris represents the red sweet type (Paris et al. 2013). Although the genetic diversity within subsp. vulgaris is extremely narrow, the species are phenotypically diverse in fruit shape, flesh pressure, fruit weight, soluble solids and rind thickness (Zhang et al. 2012).
Molecular markers are the tools of choice for determining relationships between cultivars from different breeding programs and from plant introduction accessions (Deschamps and Campbell 2010; Mammadov et al. 2012). Few single nucleotide polymorphism (SNP) markers are available for watermelon genome studies. SNP markers have been useful in genetic mapping and identification of quantitative trait loci (QTL) in other crops (Semagn et al. 2014; Thomson et al. 2012; Yan et al. 2010). Apart from SNPs, other sequence-specific markers are insertion and deletions (InDels), structural variations and simple sequence repeats (SSRs). SNPs are useful because they are easy to identify, inexpensive to use, and higher in frequency than other DNA markers and can be identified with several technologies (Deschamps and Campbell 2010). The technologies can be used to genotype large sets of DNAs for mapping projects and for marker-assisted breeding (Wong et al. 2012; Li et al. 2012). SNPs are being used for several crops (Sim et al. 2012; Poland et al. 2012; Wong et al. 2012). Genotyping by sequencing (GBS) is an economical approach to finding SNPs and using them for mapping, diversity studies and association mapping (Aslam et al. 2012; Elshire et al. 2011; Ersoz et al. 2012). Once useful markers are identified from genome-based sequencing methods such as GBS, a suitable genotyping technique is required for routine use in multiple populations (Viquez-Zamora et al. 2013; Oliver et al. 2013).
SSRs are stretches of nucleotides consisting of a variable number of short tandem repeats that produce co-dominant, multi-allelic, reproducible bands on amplification (Stagel et al. 2008; Parida et al. 2009; Cavagnaro et al. 2010; Sun et al. 2013). Furthermore, SSR markers are transferrable across various Citrullus spp. and can be used with distantly related taxa (Jarret et al. 1997). The markers have been used for the construction of genetic maps of many plant species and provide dependable landmarks throughout the genome (Cordoba et al. 2010; Cavagnaro et al. 2011; Ren et al. 2012).
Three SNP maps constructed for watermelon consist of 378, 357 and 338 SNP markers (Sandlin et al. 2012) spanning 1,438, 1,514 and 1,144 cM, respectively. Probably the best genome-wide map for watermelon involved 698 SSRs, 219 InDels and 36 structural variants that covered 800 cM with a mean marker interval of 0.8 cM (Ren et al. 2012). This map positioned 234 watermelon genome-sequence scaffolds and accounted for 93.5 % of the assembled 353-Mb genome size (Ren et al. 2012). This map elucidated novel genomic features for locating recombination cold spots and the distribution of segregation distortion. However, SNP markers were not included in that mapping study. A genetic map based on SNPs together with SSR markers representing different linkage regions of the watermelon genome has not been constructed. More SNP markers and SSR markers would be useful for applications such as linkage disequilibrium (LD) mapping, phylogenetic analysis of germplasm accessions, and marker-assisted breeding to select and incorporate QTL into elite breeding lines (Berlin et al. 2010; Neves et al. 2011). Here, we aimed to construct a genetic map based on SNPs generated by Illumina Veracode multiplex assays for genotyping as well as SSR markers. We hoped to evaluate relationships based on SNP genotypes between 130 watermelon cultivars chosen for their diversity of origin.
Materials and methods
We performed genetic mapping and cultivar diversity analysis using SNPs. For genetic mapping, we chose two plant introduction accessions as parents from the US Department of Agriculture germplasm collection (Griffin, GA). PI 244018 (C. lanatus var. citroides), a yellow-fleshed citron from Zimbabwe (Plant ID-TGR 98), was crossed with PI 270306 (C. lanatus var. lanatus), a white-fleshed watermelon from Zaire (Plant ID-Mangara). The F1 population was produced in summer 2006 at the West Virginia State University Agricultural Experiment Station. F2 populations of watermelon were generated from a single F1 plant that was self-pollinated. For the diversity study, we chose a set of 130 edible-type watermelon accessions from around the world (see Supplementary Information).
RAD library preparation
Genomic DNA from PI 244018 and PI 270306 was digested with the restriction endonuclease PstI and processed as RAD (restriction site associated DNA) libraries as described by Baird et al. (2008) with modification. Briefly, ~300 ng genomic DNA was digested for 60 min at 37 °C in a 50-μl reaction with 20 U PstI (New England Biolabs [NEB]). Samples were heat-inactivated for 20 min at 65 °C. PstI P1 adapters each contained a unique multiplex sequence index (barcode), which is read during the first 4 nucleotides of the Illumina sequence read. Then 100 nM P1 adaptor was added to each sample along with 1 μl of 10 mM rATP (Promega), 1 μl 10 × NEB Buffer 4, 1.0 μl (1,000 U) T4 DNA Ligase (high concentration; Enzymatics, Inc.) and 5 μl H2O, and incubated at room temperature for 20 min. Samples were then heat-inactivated for 20 min at 65 °C, pooled and randomly sheared with a Bioruptor (Diagenode) to an average size of 500 bp.
Samples were then run on 1.5 % agarose (Sigma) and 0.5 × TBE gel, and DNA, 300–800 bp, was isolated using a MiniElute Gel Extraction Kit (Qiagen). End-blunting enzymes (Enzymatics, Inc.) were used to remove single-strand overhangs of the DNA. Samples were purified on a Mini-elute column (Qiagen), and 15 U Klenow fragment exo- (Enzymatics) was used to add adenosine (Fermentas) overhangs on the 3′ end of the DNA at 37 °C. After purification, 1 μl of 10 μM P2 adapter, a divergent modified Solexa adapter (2006 Illumina, Inc.), was ligated to the obtained DNA fragments at 18 °C. Samples were purified and eluted in 50 μl. The elution was quantified by use of a Qubit fluorimeter and 20 ng of the product was used in PCR amplification with 20 μl Phusion Master Mix (NEB), 5 μl of 10 μM modified Solexa Amplification primer mix (2006 Illumina, Inc., all rights reserved) and up to 100 μl H2O. Phusion PCR settings followed product guidelines (NEB) for a total of 18 cycles. Again, samples were gel-purified, excising DNA from 300 to 700 bp, and diluted to 1 nM. Two RAD libraries corresponding to PI 244018 and PI 270306 were run on an Illumina Genome Analyser at the University of Oregon Sequencing Facility (Eugene, OR, USA). Illumina/Solexa protocols were followed for paired-end (2 × 54 bp) sequencing chemistry.
RAD long-read assembly and genotyping
To construct RAD long-read contigs, data from each accession were used to construct a reference assembly for SNP detection. First, sequences with >20 poor Illumina quality scores were discarded (typically < 5 % of all data). Remaining reads were then collapsed into RAD sequence “clusters,” which share 100 % sequence identity at the single-end Illumina read. We imposed a minimum of 25× and maximum 50× sequence coverage at RAD single-end reads to maximize the efficient assembly of sequences contributed from low-copy, single-dose genome positions. Single loci with coverage <25× often display short and fragmented contig assemblies due to insufficient sequence coverage, whereas loci with >500 identical single-end reads are often contributed from high-copy contaminant DNA (plastids) or can be contributed by dosing from multiple genomic loci (e.g., repetitive class sequences). The variable paired-end sequences for each common single-end locus were extracted from these filtered sequences and passed to the Velvet sequence assembler for contig assembly (Zerbino and Birney 2008). SNP-containing sequences were used for mapping with the whole-genome sequence draft that is now publicly available (ww.icugi.org), using Bowtie v1.0. For further analysis, we chose 522 SNPs. To annotate the sequences, we performed a similarity search against the non-redundant protein database with a BLASTx search (NCBI-BLAST v. 2.2.26) and an e-value cutoff of 10E−6 and maximum of 20 highest scoring pairs per subject. The BLAST hits were then analyzed by using the command-line version of BLAST2GO (BLAST2GO PIPE 2.5) to find the corresponding gene ontology (GO) terms in biological process, molecular function and cellular component classes. The number of sequences mapped to each of the observed GO terms were computed and plotted using the graphical Blast2GO results. Mapping was performed with and without allowing mismatches to locate various chromosome positions. For the chosen SNPs, flanking regions were extracted by using a custom Python script, and a primer designability rank score (0–1) was calculated with Illumina’s Veracode Assay Designer software. SNPs with the highest primer designability rank score, totaling 384, were selected for Illumina custom Oligo Pool analysis (OPA), a multiplexing procedure in a fluidics-based system that does not involve fixed arrays (for sequences, see Supplementary File). Parents of the mapping population and 94 F2 progenies, along with 130 watermelon accessions, were genotyped using Illumina VeraCode technology for GoldenGate assays with the BeadXpress system (Illumina, San Diego, CA, USA) (Lin et al. 2009), according to the manufacturer’s protocol. SNP data were analyzed with the Genotyping module (v1.6.3) of Illumina GenomeStudio (v2010.1) with GeneCall threshold 0.25.
To identify problem samples, a call rate was used to produce a scatter plot as a function of sample number, and samples with poor call rate were eliminated. In addition, GenTrain Scores and Gene Call Scores (GC Scores) calculated with the software were checked to refine the SNP calling. SNP clusters were manually adjusted when appropriate and recoded in the “AUX” module of the software according to the following key: 1 = robust; 2 = some manual editing; 3 = heterozygotes dispersed; 4 = heterozygotes similar to homozygous class; 5 = other not reliable; 6 = no amplification; 7 = monomorphic. These data were exported along with the genotypes to provide useful information on, for example, whether marker skewness was real or artificial because of a poorly performing assay.
SSR amplification and mapping
We obtained SSR primers (282 pairs) from the ‘Charleston Grey’ watermelon sequencing project (manuscript in preparation). In total, 22 bacterial artificial chromosome (BAC)-end sequences were provided by Dr. Hongbin Zhang (Texas A&M University, College Station, TX, USA). PCR was performed in a total volume of 10 μl containing 10 ng DNA template, 1 × Taq buffer, 2 mM MgCl2, 0.2 mM dNTPs, 1 U Taq DNA polymerase (Fermentas) and 0.5 μM each of forward and reverse primers. Amplification was performed in a GeneAmp PCR 9700 System thermocycler (Applied Biosystems) programmed at 94 °C for 2 min followed by 35 cycles of 94 °C for 30 s, 50–65 °C for 30 s and 72 °C for 1 min, and a final extension step at 72 °C for 10 min. Amplified products were separated on a high-throughput DNA fragment analyzer (Advance FS; Advanced Analytical Technologies, Inc. [AATI], Ames, IA, USA). Amplified PCR products were diluted 1:10 depending on the initial concentration of the products, the dilution and the injection voltage needed to adjust to prevent excessive PCR product on the fragment analyzer. PCR product of 2 μl was pipetted into 22 μl of 1 × TE dilution buffer in respective wells of the sample plate. The samples were size-separated with a 96-capillary automated system with capillaries of 80 cm. Polymer and other required reagents were from a double-stranded DNA (dsDNA) DNF-900 kit (AATI). The DNF-900 dsDNA reagent kit can effectively separate the amplicon ranges between 35 and 500 bp and can differentiate a 1-bp difference between various alleles. Following the capillary electrophoresis, the data were processed by PRO Size 2.0 (AATI). The data were normalized to the 35-bp lower marker and 500-bp upper marker and calibrated to the 75- to 400-bp range.
The resulting genotypes for the mapping population were matched for respective parental genotypes and transformed into a locus file. Construction of a genetic linkage map involved the use of JoinMap 4.1 (Van Ooijen 2011) with regression mapping. Markers were grouped into linkage groups with a logarithm of odds (LOD) score of 10.0 as the initial threshold and groups were selected up to LOD 5.0. Default parameters were used with the maximum likelihood algorithm for map building with the exception of changing spatial sampling thresholds. The Kosambi map function was used to estimate map distances.
The P matrix for five principal components was calculated from all the SNP genotypes using genotype principal component analysis (PCA) with the SNP & Variation Suite (SVS) v7.7.6 (Golden Helix, Bozeman, MT, USA; www.goldenhelix.com). A PCA chart is presented with the first two eigenvectors. Allele calls were also used for genetic diversity analysis with Tassel v3.0 (www.maizegenetics.net), and a neighbor-joining tree was built with MEGA 5 (www.megasoftware.net).
SNP validation and sequence annotation
A total of 22.8 million reads were obtained from PI 244018 and PI 270306, representing ~1.2 Gb of sequence data. All reads were first coalesced into contigs by using the Velvet assembler. Initial de novo assembly produced ~2.9 Mb of watermelon genome sequence distributed over 12,105 individual contigs. Contig lengths ranged from 320 to 750 bp. The contig length distribution was in line with the fragment size range selected during RAD-Seq library preparation. Contigs were then evaluated for the presence of repetitive elements by using the RepeatMasker web server with the Arabidopsis Repbase library. The percentage of the RAD-Seq (RHA 464) assembly classified as repetitive by RepeatMasker was 0.9 %. This finding is consistent with a genome assembly principally from low-copy regions, because the 450-Mb watermelon genome is expected to contain more than 60 % repetitive nucleotide content. The GC dinucleotide content assembly was 32.2 %, which is consistent with results from paired-end RAD-Seq studies in other plant genomes (Pegadaraju et al. 2013). SNP-containing sequences were used to generate a map based on the publicly available draft sequence of the watermelon genome (ww.icugi.org) with Bowtie v1.0. Mapping was performed with and without mismatches allowed to locate various chromosome positions. A total of 8,234 SNPs from both mapping parents were identified after submission to the Assay Design Tool of Illumina (http://icom.illumina.com) with a final design score of 0.7. We selected 522 of these SNPs by their uniform distribution across the chromosomes and their importance in biological and molecular functions as inferred by GO analysis. A set of 384 was assayed; 50 of the SNPs could not be reliably scored, mainly because of incorrect automatic clustering by the GenomeStudio software. We obtained a final set of 334 successfully genotyped, polymorphic and monomorphic SNPs across the watermelon accessions and mapping populations.
The GO annotations for SNPs showed a fairly consistent sampling of functional classes, which indicates that they represent various genes with known molecular functions involving important biological processes. Cellular, metabolic, biosynthetic and developmental processes were evenly represented. The annotation files for 384 SNP sequences and figures (Figs. 1S, 2S and 3S) pertaining to GO annotations for biological processes, molecular functions and cellular contents are in supplementary files.
Development of integrated genetic map
Because most of the markers in linkage groups were assembled by using the whole-genome sequence for watermelon, chromosome numbers could be readily assigned to various linkage groups. Among the mapped SSRs, only two were from BAC-end sequences, which were mapped onto chromosomes (Chrs) 7 and 8. Various linkage groups were formed, with number of markers ranging from 16 to 47. SNPs ranged from 3 in Chr10 to 34 in Chr9. In all, 19 SNPs of Chr10 were present in the mapping data, with only three markers on the map because of their distorted segregation. Map distances of 11 chromosomes were 51.24, 94.33, 68.91, 66.22, 56.68, 99.66, 174.28, 90.84, 77.21, 76.60 and 68.75 cM. Chromosome-wide physical locations of map positions for various SNPs are given in the Appendix table.
Genetic diversity and relationships between the cultivars
Here we developed an SNP assay containing 384 markers that was suitable for genetic mapping and resolving genetic diversity among cultivated watermelon. Most SNP-containing sequences were found to have catalytic and binding activities and included a large number of hydrolases, kinases and transferases. Other abundant assignments were abiotic and biotic stress-response along with the other signal transduction, transport and transcriptional regulations. Platforms with the collections of functionally important SNPs and those with known annotations will be useful in future genetic studies (Esteras et al. 2012).
The genetic map we developed consisted of all 11 chromosomes spanning 924.72 cM. Map lengths of previously published SNP maps (Sandlin et al. 2012) are larger than that in our study. Genetic maps with sizes <800 cM would agree well with the small watermelon genome size of 450 Mb (Ren et al. 2012). Moreover, three SNP maps generated by Sandlin et al. (2012) had only 55 common markers, which indicates that most of their markers were not polymorphic. Moreover, linkage groups in their maps were not assigned to any chromosomes. In our study, the average distance between pairs of markers was 3.28 cM across the map, which is comparable with previous SNP maps (3.8, 4.2 and 3.4 cM) (Sandlin et al. 2012).
Previous mapping studies have shown the presence of distorted segregation in the wide cross of lanatus × citroides (Levi et al. 2002). Reddy et al. (2013) reported the presence of wide chromosomal structural differences among lanatus and citroides, which could explain the distorted segregation in the mapping populations derived from these subspecies. Compared with the physical and genetic map positions of various SNPs, we noted many disagreements, which may have occurred because our mapping population was derived from a cross of C. lanatus var. lanatus × C. lanatus var. citroides, which are known to produce genome-wide distortions.
Previous studies indicated that the molecular diversity in cultivated watermelon ranged from 2 to 4 % (Levi et al. 2013; Nimmakayala et al. 2011; Romão 2000; Levi et al. 2001; Zhang et al. 2012). From our analysis of cultivated watermelons only, we support published findings of the narrow genetic diversity among American watermelon accessions. One explanation for the narrow genetic diversity in American and European germplasm could be the founder effect, whereby a small number of accessions are brought to a continent or region as people travel (Dane and Liu 2007; Tóth Zoltán et al. 2007). Watermelons may have entered Europe around 512 AD, when the Moors invaded the Iberian peninsula, or during the Crusades (Tóth Zoltán et al. 2007). In India and China, watermelon was introduced around 800 and 1100 AD, respectively (Paris et al. 2013). The introduction of watermelon cultivars into the Americas occurred after the second voyage of Columbus and during the slave trade and colonization (Romão 2000; Paris et al. 2013; Tóth Zoltán et al. 2007). In contrast, when we analyzed cultivated types with edible flesh from Africa alone, the diversity of accessions ranged widely. Most importantly, our study disagrees with results obtained by Zhang et al. (2012), especially in showing distinct clustering of American and Chinese or East Asian types. The discrepancy may be due to the smaller set of Chinese and United States cultivars included in the Zhang et al. study. In our research, we had 130 cultivated forms of watermelon from the entire world, especially the ancestral cultivar forms from Africa. As we included several intermediate types between East Asian types and American types in the analysis, both the ecotypes became merged into a single group.
Most pioneering work on watermelon ecotypes was carried out by Fursa (1972). According to this, African collections are highly diverse and possess foundation types for global collections. Fursa (1972) divided rest of the world watermelon collections into various ecotypes based on their fruit characteristics. The American ecotypes are oval-shaped fruits with red flesh and produce sweet to highly sweet flavor. In contrast, East Asian types are round, with flesh ranging from yellow to orange or red. The Russian ecotype is round, with pink or crimson flesh. Most Transcaucasian watermelons are oval-shaped with red flesh and are less sweet. Afghanistan and Iran have a wide variety of watermelons, with fruit shapes ranging from round to oval and a wide array of flesh colors: white, yellow, light pinkish and red. Because our collection contained fruit shapes ranging from round to oval with a wide range of colors, our study did not agree with that of Zhang et al. (2012), which was focused on resolving differences between East Asian and American ecotypes.
Watermelon fruit can be round, oval, blocky, oblong and elongate (Poole 1944; Guner and Wehner 2004). Weetman (1937) concluded that fruit shape dissimilarities in watermelons were shown to be fixed in the primordial ovary. One of the major factors known to alter shape of ovary in cucurbits is sex expression. If the sex expression is andromonoecious, the ovary is round in shape, with some exceptions. In contrast, most monoecious flowers produce an oval-shaped ovary. The sex expression is known to be altered by drought or higher temperature. Cultivars with andromonoecious flowers can withstand drought or higher temperatures because the pollen is produced in the same bisexual flower. In monoecious flowers, pollen is produced in male flowers and could desiccate under stress. This situation may explain the altered genetic diversity across the geographical zones. However, critical research is further needed to understand the genes and genomic areas that are important for varietal differentiation and divergence with special reference to watermelon.
The 384-plex SNP set is a powerful tool for characterizing genetic relatedness and for developing high-resolution genetic maps. Robust allele calling and low-cost genotyping allows for analysis of large numbers of families in breeding populations or accessions in germplasm collections. High-density SNP arrays, with thousands of SNPs for crops such as maize and rice (Ganal et al. 2011; Hansey et al. 2012; Zhai et al. 2013), shows what is possible for the watermelon research community. The identification of SNPs for watermelon in this and previous studies will allow for genome-wide association mapping and marker-assisted selection to support breeding programs.
This project was supported by the USDA-NIFA (no. 2013-38821-21453), National Science Foundation under Grant No. NSF 09-570 - EPS-1003907, Gus R. Douglass Institute and NIH Grant P20RR016477 to the West Virginia IDeA Network for Biomedical Research Funding. The authors thank GRDI graduate assistantships for A. Bhandary and L. Abburi. The authors are grateful to R. Jarret, Plant Genetic Resources Conservation Unit, USDA-ARS (Griffin, GA) for providing the seeds of germplasm accessions.
Conflict of interest
We declare that we have no conflict of interest.