SNP discovery and array design
Two approaches, both using next generation sequencing (NGS), were adopted to identify SNPs in the C. sativa genome. The first involved the development and sequencing of cDNA libraries that were targeted to capture the 3′ end of expressed transcripts and the second approach used reduced representation through restriction digestion and size selection to limit the regions of the genome that were being assayed.
The 3′ biased cDNA libraries were sequenced using Roche 454 and 956,538 high quality sequences were generated from line 33708-06 and 586,982 for line 31471-03. Since no reference genome sequence was available for C. sativa, a de novo assembly was generated for line 33708-06, which resulted in 582,229 reads (60.9 %) being assembled into 47,313 contigs with an average length of 425 bp. Seventy-four percent (435,016) of the reads from line 31471-03 were reference mapped to the assembled contigs with a fivefold average coverage. Nucleotide variation was identified using a depth cut-off of 3 and a variant percentage of 30, which identified 8,037 SNPs (2,683 contigs) and 21,537 insertion/deletions (6,509 contigs). Due to the anticipated polyploid nature of the genome and the desire to generate locus-specific SNPs, further filtering required both the reference and the alternate base to be represented in 100 % of the reads. This significantly reduced the potential number of useful SNPs to 426 (5 % of the observed variation). Screening for SNPs with sufficient flanking sequence that also passed Illumina’s quality check for probe design (ADT score >0.6) identified 252 SNP loci, which were submitted for Illumina GoldenGate array design.
The reduced representation genomic libraries were sequenced on the Illumina GAIIx platform and the resultant data for each line are shown in Supplementary Table 2. Eighty-two percent of the Lindo reads (84,331,454) were de novo assembled using CLCBio Genomics Workbench to generate 288,946 contigs (≥200 bp), with an average length of 511 bp covering 147.7 Mb of genome sequence. The data from Licalla was referenced mapped to the Lindo contigs, resulting in alignment of 46,922,482 reads to 260,431 contigs. SNP detection using CLCBio identified 234,838 SNP positions with a single variant base in Licalla at a depth of at least 8 reads and a variant percentage greater than 35 %. In order to reduce the impact of duplicate loci only SNPs where the reference and alternate base showed no variation were further processed. This reduced the number of available SNP positions to 48,421 (20.6 % of possible variation). In addition, SNPs were further restricted by selecting those with 100 bp of flanking sequence and which contained no additional SNPs, reducing the available SNPs to 6,686 in 4,919 contigs. These SNPs were submitted to Illumina’s Assay Design Tool and only those with a score of >0.6 were considered further. In an attempt to select SNPs across the genome, inferred synteny with Arabidopsis thaliana was exploited. The sequence of each contig with potentially useful SNPs was aligned to the A. thaliana genome using BLASTN (E value cut-off of 1E−12). Approximately 50 % of the contigs (2,448) were homologous to 1,878 annotated A. thaliana genes. A subset of SNPs were selected for the array design from contigs that potentially covered the expanse of the A. thaliana genome. This represented 288 SNPs that were positioned in contigs with homology to 64, 58, 48, 47 and 61 A. thaliana genes on chromosomes one to five, respectively. Since genic SNPs can be less robust due to the influence of unidentified homologues, 228 SNPs were chosen randomly from those assumed to be intergenic. Including SNPs designed from the 3′ cDNA analyses a total of 768 SNPs were submitted for Illumina GoldenGate array design (Supplementary Table 3).
Genetic linkage map for Camelina sativa
A recombinant inbred (RI) population derived from a cross between Lindo and Licalla was used to develop a genetic map for C. sativa. The newly developed GoldenGate array was hybridized with DNA from the two parental lines and 180 RI lines. Eighteen of the probes on the array gave poor signals with normalized R values <0.2 for each sample. Two hundred and seven probes on the array showed no polymorphism between the parental lines. The majority of these monomorphic loci (189) were designed from the 3′ cDNA data, and only 18 of these loci had been designed to specifically target SNP variation between Lindo and Licalla. The cluster distribution for the remaining probes on the array varied in pattern and ease of scoring (Fig. 1). The majority of the SNP assays showed a pattern that was distinguished by three clearly defined clusters representing the three genotypes in the mapping population (Fig. 1a). In some instances, although three clusters were observed, one allele was far less tightly clustered than its counterpart suggesting perhaps additional SNP variation in the flanking DNA could be impacting the efficacy of the hybridization (Fig. 1b). In rare cases both alleles showed loose clustering indicating poor hybridization. Such anomalies could in extreme cases suggest additional clusters; however, mapping of the loci showed normal segregation was occurring. Differences in separation of the clusters was also observed and in some cases the variance in normalized theta value between the two alleles was extremely small, requiring manual cluster calling in the GenomeStudio software (Fig. 1c). A very small subset of SNP loci (7) appeared to be dominant in nature, with only one of the alleles showing significant fluorescence levels (normalized R values). For such loci determination of heterozygous individuals was not possible (Fig. 1d).
After manual editing of the GenomeStudio cluster file it was possible to score and map 533 SNP loci. These were arranged over twenty linkage groups, representing the haploid chromosome number of C. sativa (Table 1; Fig. 2). Forty-six EST-SSR loci that had previously been mapped on 90 lines of the same population were added to give a final genetic map composed of 579 loci distributed over 1,808.7 cM. There were at least 4 instances where significant (>20 cM) gaps in the linkage map (Cas 4, 15, 17 and 18) were observed. These regions were not associated with the four regions where segregation ratios for multiple linked loci were significantly (p < 0.01) imbalanced (Cas 1, 6, 17 and 20).
Table 1 Genetic linkage map of Camelina sativa
Anchoring to the Camelina sativa genome and delineation of the Brassicaceae ancestral blocks
The 100 bp sequences flanking the SNP loci were aligned to the C. sativa genome sequence using BLAT (Kent 2002) with default parameters. In addition, sequences of the contigs from which each of the mapped SNP markers was derived were aligned to both the C. sativa and the A. thaliana genome using BLASTN (1E−12) (Supplementary Table 4). Similarly the EST sequences used to design the SSR primer sequences were compared to the two genomes. There was a strong correlation between the genetic and physical maps of C. sativa (Supplementary Figure 1); however, in regions of reduced recombination there were minor discrepancies between the marker order of the genetic map and the genome sequence. On average the markers were distributed 1 locus per 1 Mb of genome sequence, the regions with increased recombination or the larger gaps in the map corresponded to a paucity of loci selected for the particular genomic segment with physical distances ranging from 3.7 to 6.2 Mb between the loci. Some of the centromeric regions also displayed a low density of SNP loci, which was not reflected in the genetic distance (Supplementary Table 4).
Comparative alignment of 413 loci with homology either to A. thaliana genes or adjacent genome sequence identified the Brassicaceae ancestral blocks (A–X) defined by Schranz et al. (2006) (Supplementary Table 5; Fig. 2). These alignments were subsequently confirmed through the comprehensive analyses offered by alignment of the C. sativa genome sequence with the A. thaliana genome in Kagale et al. (2014). The SNP loci allow delineation of shared ancestry across the Brassicaceae, which assists with the identification of candidate genes underlying genomic regions of interest, in particular providing access to the extensive annotation of the A. thaliana genome.
Genetic variation among C.
sativa accessions
The newly developed C. sativa SNP array was used to genotype 178 C. sativa accessions, three lines had >20 % missing values and were excluded from further analyses. The cluster patterns observed for the SNP loci were similar to those observed for the mapping population, although further clusters were observed in some instances presumably due to the presence of additional SNP variation in the DNA flanking the SNP position found among the diversity collection. Based on automated calling 232 of the 768 SNPs were uninformative, and 11 had >20 % missing genotype values; thus 493 SNP loci were used for further analyses. Basic information including PIC value (ranging from 0.006 to 0.375), gene diversity (0.006–0.5) and major allele frequency (0.5–0.99) for each SNP locus is provided in Supplementary Table 6. The gene diversity for the entire collection was 0.26, which is lower than a similar analysis of elite maize germplasm (Van Inghelandt et al. 2010). A recent study by Delourme et al. (2013) which assessed SNP variation among germplasm of the related allotetraploid Brassica napus presented PIC values as a measure of gene diversity for each SNP locus. In comparing mean PIC values between the species invariably lower PIC values were seen for C. sativa, where values for each linkage group ranged from 0.153 to 0.286 in C. sativa and from 0.292 to 0.330 in B. napus (Supplementary Table 7). A very high inbreeding coefficient (F
IS value) of 0.96 was calculated from the C. sativa lines that can be explained by the inbreeding nature of the species whereas the overall fixation index (F
ST value) of 0.276, which provides a measure of population differentiation, indicates a similar level of differentiation among sub-populations as that found among winter and spring types of B. napus (Delourme et al. 2013).
Population structure analysis was completed using STRUCTURE (Pritchard et al. 2000) for 175 accessions. Since the estimated log-likelihood values appeared to be an increasing function of K for all examined values of K, inferring the exact value of K was not straightforward (Supplementary Figure 2a). Using the program Structure Harvester (Evanno et al. 2005) maximal ∆K revealed that at a K value of 2 the accessions were clustered into two sub-populations (Supplementary Figure 2b). Using a minimum value of 70 % ancestry, 152 accessions were assigned to one of the two sub-populations, 61 accessions to Population I and 91 accessions to Population II (Fig. 3a). The remaining 23 accessions appeared to be admixtures or have ancestry from more than one population, with qK values <70 % for both populations (Supplementary Table 1). The population clusters did not group according to the available geographical information. A similar pattern was observed for the relationship as determined by the unweighted Neighbour-Joining method, which clustered accessions into two major groups. In Fig. 3b, the red and green branches on the tree represent Populations I and II, respectively as determined by STRUCTURE; all accessions defined as admixtures are shown in black. Similar to the STRUCTURE analysis, the resultant phylogenetic tree did not cluster the accessions based on geographical origin, with the lines derived from each country being evenly distributed between the populations.