Plant materials and DNA extraction
Twenty-four yellow mustard accessions, including four cultivars, 10 landraces and 10 inbred breeding lines (Table 1 and Table S1), were used in the study. The landraces and cultivars are open-pollinated populations originated from different countries. The 10 inbred breeding lines were developed at Agriculture and Agri-Food Canada Saskatoon Research Center. Y1352-9, Y1476-1, Y1485-5 Y1495-2, Y1355-2, Y1487 and Y1492 are inbred lines produced by inbreeding of different open-pollinated plants of the cultivar Andante (Table S1). Y1354-2 is the doubled haploid line SaMD3 produced by Bundrock (1998). Y1486-2 is derived from inbreeding of a Russian landrace. Y1354-7 was produced by seven generations of inbreeding of the F1 plant between the cultivar Sabre and the Svalöf high oil line (Todd Olson, personal communication with B. F. Cheng, 2010). About 10 seeds were randomly chosen from each selected accession. Plants were grown from seed for 3–4 weeks in a greenhouse at the Saskatoon Research Centre. Young leaf tissue from individual plants of each accession was collected, freeze-dried, and stored at −20 °C. DNA from one plant per accession was extracted from 15 mg of freeze-dried tissue with a DNEasy Plant Mini kit (Qiagen, Mississauga, ON, Canada) following the manufacturer’s instructions, quantified by using a Thermo Scientific Nanodrop 8000 (Fisher Scientific, Ottawa, ON, Canada), and adjusted to 100 ng/μl in Qiagen AE buffer (10 mM Tris–HCl, 0.5 mM EDTA, pH 9.0).
Table 1 List of 24 yellow mustard accessions studied, 454 pyrosequencing information, and identified SNPs
Genome reduction and barcoding
Genomic reduction and multiplex-identifier (MID) barcoding of the yellow mustard samples were conducted following the method of Maughan et al. (2009) by using the same sourced reagents and supplies where possible. EcoRI and BfaI adaptors and barcoded PCR primers were synthesized by Integrated DNA Technologies (Coralville, IA, USA). The 24 samples were divided into two pools (Table 1). All samples were digested with EcoRI and BfaI. BfaI- and biotin-modified EcoRI-adaptors were ligated onto the digested fragments. The ligation reactions were cleaned on Chroma Spin +TE-400 columns (Clontech, Mountain View, CA, USA) following the manufacturer’s instructions. Fragments with the biotin-modified EcoRI-adaptor were selected by using streptavidin coated paramagnetic beads (Dynabeads M-280; Invitrogen, Burlington, ON, Canada) according to the manufacturer’s instructions.
Twenty-four unique Roche 454 RLMID barcodes were selected and used to identify 24 samples (Table 1). Paramagnetic beads with bound, digested DNA fragments were used as templates for PCR by using primers specific to the EcoRI- and BfaI-adaptors, containing a specific MID barcode for each sample in each pool. The PCR method was followed from Maughan et al. (2009) by using the Clontech HF2 chemistry and a C1000 thermocycler (BioRad, Mississauga, ON, Canada). Between four and six replicates of each PCR reaction were carried out, and a 3 μl sample from each was separated on a 1.5 % agarose gel to confirm amplification. Successful amplicons for each sample were bulked together and concentrated by evaporation in a vacuum centrifuge to approximately 35 μl. Individual samples were separated on a 1.5 % agarose gel for 5 h at 60 V. A gel fragment from each sample between 400 and 600 bp based on the New England Biolabs 2-Log ladder (Pickering, ON, Canada) was excised and cleaned by using the Qiaquick Gel Extraction kit (Qiagen, Mississauga, ON, Canada). Samples were eluted in 35 μl of one-third concentration Qiagen EB (3.33 mM Tris; pH 8.5) and quantified with the Thermo Scientific Nanodrop 8000. Individual samples were concentrated by evaporation in a vacuum centrifuge, re-quantified, and adjusted to 50 ng/μl with water and 1 mM EDTA pH 8.0, so that the final salt concentration did not exceed 10 mM Tris and 1 mM EDTA. Each pool was prepared with 200 ng of each of eight individual samples for a total of 1,600 ng at 50 ng/μl.
Pools were submitted to the DNA Technologies Laboratory at the Canadian National Research Council, Saskatoon, Saskatchewan, Canada, and sequenced on a full Roche 454 PicoTiterPlate (PTP) by using the Roche 454 GS FLX instrument with Titanium chemistry. An extra run was also made on a half PTP plate (i.e., a quarter PTP for each pool) by using Roche 454 GS FLX+ instrument with Titanium chemistry.
Generation of contigs and SNPs
DNA reads were combined from two 454 pyrosequencing runs and separated into sample-specific SFF files according to MID barcode based on the Roche Newbler SFF tools, followed by the removal of the forward and reverse adaptor sequences. Contig generation and SNP detection were performed with the DIAL pipeline (Ratan et al. 2010). The pipeline adds the SFF file of each sample and performs a completely automatic call of SNPs from all added SFF files in a Linux system. However, it requires both the input on the expected length of a target genome to identify contigs from all added SFF files and the version of Roche Newbler, as it is dependent on the Newbler’s gsAssembler to assemble the reads into the identified contigs for SNP identification. Thus, DIAL was trained for different versions of Newbler and variable lengths of target genome from 100 Mbp to 50 kbp. The final analysis was made by using Newbler v2.0.01.14 and an expected genome size of 300 kbp to generate the maximum numbers of contigs with SNPs for the 24 samples. All the training analyses generated an unrealistically low yield of 1–2 SNPs in the output file snps.txt due to the use of highly stringent filters for SNP calling. However, the pipeline also generated an output file report.txt collecting all the assembled contigs with the length and supporting reads, the position of the variant alleles, the number of reads supporting the allele, and the quality value of the reads at that position. Several specific Perl scripts were written to extract contigs and SNPs from report.txt into separate files for validation and for data report and analysis, and these custom-built Perl scripts are available upon request to the corresponding author.
Contig annotation
Searches by basic local alignment search tool (BLAST) (Altschul et al. 1990) for all identified contigs were made by using two approaches to provide some level of validation and gene ontology (GO) annotation on the contig sequences. The first was to conduct BLAST searches directly against the NCBI nr/nt protein database in the NCBI website (http://www.ncbi.nlm.nih.gov/). The second was to employ the program Blast2GO (Conesa et al. 2005) against the NCBI nr protein database. Specifically, the applied annotation parameters were a pre-e-value-Hit-Filter (10−6), annotation cut-off threshold (55) and GO weight (5). Blast2GO uses BLAST to find similar sequences (potential homologs) for one or several input sequences, extracts all GO terms associated to each of the obtained hits, and returns an evaluated GO annotation for the query sequence(s).
Contig and SNP validation
A random set of 41 contigs was selected for validation with Sanger sequencing (SS) based on three randomly selected samples (SA44, SA115, Y1476-1). The contig selection considered only the variable SNP count and contig length, not the BLAST search results. The PCR primers for 41 contigs were designed by using Primer3 (v.0.4.0) (Rozen and Skaletsky 2000). The conditions for PCR were: 1× KAPA 2G Buffer A containing 1.5 mM MgCl2 (KAPA Biosystems, Woburn, MA, USA), 1× KAPA Enhancer 1, 0.2 mM each dNTP, 0.4 pmol/μl each forward and reverse primers, 100 ng of the same genomic DNA template samples as used above for NGS, and 0.5 U KAPA 2G Robust polymerase in a final volume of 25 μl; touchdown PCR cycled at 95 °C for 3 min followed by 10 cycles of 95 °C for 10 s, 60 °C decreasing 0.5 °C per cycle for 15 s, 72 °C for 30 s followed by 25 cycles of 95 °C for 10 s, 55 °C for 15 s, 72 °C for 20 s, followed by a final extension of 72 °C for 30 s. A 3 μl sample of each PCR product was separated on 1.5 % agarose for 2 h at 120 V. Two primer sets amplified no or multiple products. For the remaining 39 primer sets, their PCR products were cleaned following the method outlined by Rosenthal et al. (1993) and submitted to the DNA Technologies Laboratory at the Canadian National Research Council, Saskatoon, for Sanger sequencing.
Forward and reverse Sanger sequences from each sample were assembled and aligned with Sequencher v.5.0 (GeneCodes, Ann Arbor, MI, USA), then aligned against the consensus sequence generated from 454 pyrosequencing for each contig by using Muscle v.3.6 (Edgar 2004), and proofread by hand. The putative 454 SNPs were checked with the Sanger sequences, where sample data were available, and additional SNPs and indels from the SS, if any, were also identified.
Comparative SNP identification by Roche Newbler
Roche Newbler GS Reference Mapper software (version 2.6p1 supplied by Roche in November, 2011) was also run for all Roche 454 sequence reads generated for this study against the 39 contigs confirmed by SS. The software called all sequence differences between the sequences of 39 contigs and assayed samples, including SNP and indel, and stored them in the file 454AllDiffs.txt. A specific Perl script was written to extract genetic variants from 454AllDiffs.txt and to compare them to those identified by the SS and DIAL pipeline.
Diversity analysis
The 454 SNP data obtained from the DIAL pipeline were analyzed for each sample by counting the total putative SNPs, the heterozygous SNPs, and the SNPs that were undetected in the sample due to insufficient sequence reads. As the 454 SNP data are highly imbalanced, a random permutation rest was made on the pairwise sample differences in SNP count. This was done by a random permutation of the 454 SNPs (including missing ones) per locus over the 24 samples and repeat of the permutation for all the loci, followed by the SNP count for each sample from the permuted 454 SNP data and the calculation of the permuted pairwise sample differences in SNP count. This process was run 10,000 times to calculate the proportion of runs in which the permuted pairwise sample difference was larger or smaller than (depending on the sign of) the observed pairwise sample difference in SNP count, giving the significant level of the test for each pairwise sample difference in SNP count. The random permutation was performed with a custom R script within R version 2.15 (R Development Core Team 2011) that is available upon request.
An analysis of molecular variance (AMOVA) was performed with Arlequin version 3.01 (Excoffier et al. 2005) on the 454 SNP data to quantify the genetic variation present among various groups of samples (landrace, cultivar, and breeding line; yellow- and black-seeded groups). To assess the impact of missing SNPs on the variation partition, the original SNP data were re-coded with 1 for a missing SNP and 0 for an available SNP for each locus and sample (ignoring the nucleotide information), and AMOVA was performed on the re-coded data based on the above group structures.
The genetic relationships of the 24 yellow-seeded mustard samples were determined with three different approaches for comparison. The first was to generate a neighbor-joining dendrogram by using NTSYSpc 2.01 (Rohlf 1997) based on the dissimilarity matrix of the available putative SNPs. The second was to generate a neighbor-joining tree with PAUP* (Swofford 1998) and display it by using MEGA5 (Tamura et al. 2011). The third was to generate a distance-based NeighborNet (Bryant and Moulton 2004) of the 24 samples by using the SplitsTree4 (Huson and Bryant 2006). To assess the impact of missing SNPs on sample clustering, the re-coded data for missing versus existing SNPs were used to determine the sample genetic relationships by following three approaches mentioned above.
Computer simulation
To understand the effects of missing SNPs on the genetic diversity analysis, a Monte Carlo computer simulation was performed based on available 454 SNP data with an average of 73 % SNPs missing per sample. Ten scenarios of missing SNPs were considered: completely random missing SNPs at the missing levels of 5, 20, 35, 50, 65, 80, and 95 %, completely random with equal missing SNP level of 73 % for each sample (73e), randomly matched individually with the existing missing SNP level of 73 % (73r), and randomly matched individually with the existing missing level and pattern of 73 % (73f). Each simulation started with a generation of a full SNP data set by randomly allocating four nucleotides (A, C, G, T) based on the observed nucleotide frequencies at each locus (available from the 454 SNP dataset) to the 24 samples and repeating for 828 loci. Then, a data set with missing SNPs was generated for each missing scenario by selecting randomly from, or (for the 73f scenario) matching observed patterns of missing data with, the simulated full SNP data set. Next, a diversity analysis was performed on both full and missing SNP data sets to estimate four diversity parameters (as described in the following paragraph). This process was repeated 5,000 times, and the mean and standard deviation of the parameter estimates were obtained.
Our simulation examined four diversity parameters: allelic counts for two groups of alleles (a tail group of alleles of frequencies smaller than 0.1 and a middle group of alleles of frequencies from 0.45 and 0.55), the probability of detecting a population genetic structure under missing SNPs, and the congruency between two distance matrices representing 24 samples with and without missing SNPs. The AMOVA algorithms (Excoffier et al. 1992) were used to estimate the sum of squared differences among (SSA) and within (SSW) three groups of samples (also see AMOVA analysis above), and the number of the simulation runs where SSA is larger than SSW provided the estimate of the probability of detecting a population genetic structure. Pairwise sample SNP dissimilarities were calculated following Fu (2006) from full or missing SNP data, and two dissimilarity matrices were used to estimate the normalized Mantel correlation coefficient (Mantel 1967). The simulations were conducted with a custom R script within R version 2.15 (R Development Core Team 2011) that is available upon request.