An updated genetic linkage map for Jatropha curcas
We recently published the first intraspecies linkage map for J. curcas . The combined map, which was based on four F2 mapping populations, contained 502 markers spanning a total distance of 717 cM. To improve the density of individual maps and add candidate genes that may contribute to specific traits, we developed a number of additional SSR markers which are detailed in Additional file 1: Table S1. The revised genetic linkage map, which now contains 587 markers spanning a total distance of 673 cM, is shown in Figs. 1 and 2. A summary of the markers, marker densities and genetic distances for each of the linkage groups is shown in Table 1. The increase in the number of markers, together with a small reduction in the overall calculated map length, has resulted in a modest improvement in mean marker density of 0.3 cM; our latest map has a density of 1.2 cM per marker or 1.5 cM per unique locus, compared with 1.5 and 1.8 cM, respectively, in our previous map.
Previously, using the draft genome assembly released by the Kazusa DNA Research Institute [19, 20], we were able to physically map 17 Mbp (of 297 Mbp) of genome sequence against our genetic linkage map. Within this 17 Mbp were 3077 of the 39,277 predicted gene models . This represents 5.7 % of the genome and 7.8 % of the predicted genes for this version of genome assembly. The ability to map a greater proportion of the genome would be beneficial in allowing the position of candidate genes likely to correspond to particular traits to be mapped. Recently, the Chinese Academy of Sciences (CAS) has also released a J. curcas genome . This genome was obtained from sequencing to a depth of 189-fold, and contains scaffolds with an N50 of 746,835 compared to the Kazusa DNA Research Institute version 4.5, which has an N50 of 15,950. This improved genome assembly provided us with the opportunity to physically map a substantial amount of the genome against our genetic linkage map. After conducting BlastN searches of our molecular markers against this new version of the genome, we were able to map a total of 162 Mbp of the predicted 318 Mbp (i.e. 51 %) of the CAS Jatropha genome assembly (Table 2 and Additional file 2: Tables S2–S13). This is similar to the value obtained by Wu et al. using our previous generation of the map . In a few instances we observed that some scaffolds mapped to more than one linkage group. This may be due to misassemblies in the published genome sequence or segmental chromosome duplications. In general, however, our mapping order was highly consistent with this draft genome sequence. The scaffolds that we were able to map contained 17,452 of 27,172 predicted protein encoding sequences (64 %) contained within the CAS Jatropha genome (Table 1 and Additional file 2: Table S2).
Positioning markers for storage lipid biosynthesis candidate genes onto the linkage map
To locate the positions of lipid biosynthesis genes onto our linkage map, we first identified the orthologues of Arabidopsis genes known or suspected to be involved in de novo plastidial lipid biosynthesis and the pathway for the conversion of acyl-CoA into triglycerides, the principal storage lipid in seeds. A diagrammatic representation of these pathways is shown in Fig. 3. In addition to enzymes, we included a number of regulatory proteins. The candidate gene list was compiled from the Arabidopsis Acyl-Lipid Metabolism Website . The genes were identified using BlastP searches of the peptide sequence data for J. curcas contained on GenBank. In addition to a number of markers that we developed in close proximity to these candidate genes, we also used the combined genetic and physical map shown in Additional file 2, and the genetic or physical map produced for the interspecific crosses [18, 22], and thus were able to identify the positions of almost all of the lipid biosynthesis candidate genes. These genes could potentially be utilized for molecular breeding by the targeted development of additional SNP or SSR markers in the flanking regions of these genes (Additional file 3: Table S14). The limited number of genes involved in lipid biosynthesis that we were unable to map included one isoform of the plasitidial enoyl-acyl carrier protein reductase (step 7 in Fig. 3) which resides on a scaffold we could not map, and a glycerol-3-phosphate acyltransferase isoform and Wrinkled1 transcription factor isoform which both mapped to part of a (possibly misassembled) scaffold that may be part of linkage group 3 or 8.
Both vegetative traits and seed weight contribute to the oil yield in mapping population G51 × CV
The F2 mapping population G51 × CV, which has one “wild” partially heterozygous parent (G51, heterozygous at 46 % of markers) and a fully homozygous “Cape Verde”-like parent, was created primarily for the identification of seed oil content QTL, based on contrasting phenotypes we observed for the parents of these plants (36.9 % oil in G51, 26.0 % oil in CV). However, we also collected data for various other traits in the field including plant height, stem diameter, canopy area, number of branches and number of seeds produced (see “Methods”). Normal, or near-normal distributions were observed for the majority of these traits (Additional file 4: Figure S1). To determine the relationship between these variables and the final calculated oil yields per plant, Pearson correlation coefficients were calculated (Table 2). For the final calculated oil yields, almost all of the traits produced significant positive correlations. Within the vegetative traits for example, the number of branches at 763 days (R = 0.474) and canopy area at 763 days (R = 0.431) produced the highest correlations for year 3 calculated oil yields. These correlations were very similar to those observed for total seeds per plant in year 3 (R = 0.457 and 0.446), suggesting that the yield correlations are most closely linked to a higher number of seeds produced in plants showing stronger vegetative growth. Unsurprisingly, the total number of seeds produced per plant was the most significant contributor to the final seed yield (R = 0.972 and R = 0.948 for years 2 and 3), indicating that for mapping population G51 × CV, the number of seeds per plant is more important than the amount of oil per seed. Nonetheless, 100-seed weights also produced significant correlations with the calculated oil yields (R = 0.205 to R = 0.489), as did seed oil content in the first harvest for year 3 (R = 0.402). Interestingly, for the year 3 data, the total number of seeds per plant also produced a weak but positive correlation with 100-seed weights, indicating that the plants producing more seed do not appear to allocate fewer resources to each seed. Similarly, oil content and seed number either had no correlation or a weak positive correlation (R = 0.190 for total seeds in year 3 and oil content in year 3, harvest 1), showing producing more seeds does not reduce the amount of oil stored in the seed.
Overall, the data for this mapping population indicate that the final oil yield is a composite trait, and that the vigour of the plants contributes most significantly to oil yield by producing plants with increased number of seeds. However, 100-seed weights and oil content can also make significant contributions to final oil yield. This suggests that there should be significant potential for developing improved varieties of J. curcas through the pyramiding of desirable loci.
Identification of QTL associated with vegetative growth characteristics, in mapping population G51 × CV
After performing QTL analyses on the data collected from mapping population G51 × CV, we detected a number of QTL underlying vegetative traits (Table 3; Fig. 4; Additional file 5: Figure S2a–e and Additional file 6: Figure S3a–h). QTL for plant height were observed on both linkage group 4 and linkage group 8 (Table 3). The QTL on linkage group 4 was observed at both 567 and 763 days after transplantation from the nursery, accounting for 9.2 and 7.0 % of the phenotypic variance explained (PVE) for these traits, respectively. The height QTL on linkage group 8 was only observed at 763 days, and also accounted for 7.0 % PVE. Both of these QTL were minor and only detected using a significance threshold of p = 0.10. The small effects of these height QTL are most likely related to the high level of complexity of this trait. Interestingly, ANOVA analysis of the phenotypes at the height QTL locus on linkage group 4 indicated that this QTL was overdominant, i.e. the heterozygous phenotype was greater than either of the homozygous phenotypes. At the same position of linkage group 4 as the height QTL, we also observed an overdominant QTL corresponding to stem diameter. This accounted for 14.9 and 8.9 % PVE at 567 and 763 days, respectively. A further stem diameter QTL was detected on linkage group 5 at 567 days and linkage group 7 at 763 days. The QTL on linkage group 7 was the largest of these, accounting for 10.2 % PVE. A single dominant QTL for branching was observed on linkage group 1, for which the CV allele had a positive effect. We were unable to detect significant QTL for canopy area, perhaps due to the high level of complexity of the trait. Given the significances of the correlations between the plant vegetative growth traits and the calculated seed and oil yields obtained from the Pearson correlation analysis, the QTL on linkage group 4 for height and stem diameter would be useful targets in a plant breeding programme. The close proximity of these QTL and their similar overdominance indicates that this may be a single locus with a pleotropic effect. However, finer mapping would be required to determine whether these are the same or separate loci. Use of overdominant QTL in plant breeding would require the production of F1 hybrid plants for implementation. Due to its monoecious, self-fertile nature, efficient production of F1 hybrid seed would require an alternate strategy such as the cytoplasmic male sterility and restorer system . Alternatively, F1 plants could be multiplied by vegetative propagation (i.e. from cuttings) or from micropropagation .
Identification of QTL for seed number per plant, seed weight and oil content in mapping population G51 × CV
For the second harvest year after transplantation, although we observed a large variation in the number of seeds produced per plant (Additional file 4: Figure S1i), we did not observe any QTL associated with this trait. For the third harvest year, a single QTL was observed on linkage group 10, which accounted for an estimated 11.7 % of the phenotypic variance (Table 3; Fig. 4). This QTL was dominant, with the CV allele being beneficial compared to the G51 allele. Interestingly, an oil content QTL was also observed at a similar position on linkage group 10 for the second harvest year and the second harvest of year 3, accounting for between 11.8 and 12.1 % PVE. This QTL was dominant, with the beneficial allele being from the G51 parent (Additional file 6: Figures S3j, m). Although this may suggest that there is a potential reduction in oil content in response to a higher level of seed production, it should be noted that no correlation was observed for seed number and oil content in the second harvest year, and the correlation was weak but positive in the third harvest year (Table 2). A further QTL for oil content was observed in the second harvest year on linkage group 4. This locus was dominant and accounted for 13.3 % PVE. The beneficial allele was from the G51 parent. A QTL at a similar position was also identified for the first (but not second) harvest of year 3 (PVE = 10.8 %).
QTL contributing to fatty acids composition of mapping population G51 × CV
In J. curcas, the two main fatty acids present in the storage oil are oleate and linoleate. For biodiesel production, monounsaturated fatty acids such as oleate are regarded as being desirable, as they have greater oxidative stability than polyunsaturated fatty acids and do not have poor cold-flow and cloud-point characteristics associated with saturated fatty acids [1, 25, 26]. It has been shown previously that plant growth temperature is likely to play a significant role in the proportion of these two fatty acids . Within this mapping population we also found a strong negative correlation in the percentage of oleate (42.6–50.5 %) and linoleate (26.6–35.3 %) content within the seeds, suggesting that variation in these two fatty acids is both genetically and environmentally determined (Table 4 and Additional file 6: Figure S1). A number of QTL were observed for these two fatty acids (Table 5). On linkage group 6, a QTL was observed at 2 cM (10.8 % PVE) and 3 cM (11.9 % PVE), respectively, for oleate and linoleate content. Given the strong negative correlation between these two fatty acids, it is probable that the same underlying gene is responsible. Two additional QTL for linoleate content were observed on linkage groups 4 (at 4 cM) and 8 (at 11.5 cM), with PVE of 11.1 and 9.9 %, respectively.
The two other main fatty acids present in the seeds of J. curcas are palmitate (10.7 %–13.9 %) and stearate (6.1–9.2 %). Although the variations in stearate content were minor, four QTL were detected for stearate (Table 5), accounting in total for 45.7 % PVE. One of these mapped to a similar position as the linoleate QTL on linkage group 8. Three QTL were observed for palmitate content, accounting for 28.3 % PVE in total (Table 5).
Identification of QTL for seed number per plant, seed weight and oil content in mapping population G33 × G43
Mapping population G33 × G43 was originally developed for the purpose of identifying a locus responsible for the biosynthesis of phorbol esters , the principal toxin in J. curcas seeds. However, we were also able to identify a number of QTL for seed traits using this population (Table 6; Additional file 7: Figure S4, Additional File 8: Figure S5 and Additional file 9: Figure S6). Pearson correlation analysis of the trait data (Table 7) revealed that for all 3 years, the calculated oil yields were mainly dependent on the number of seeds produced per plant (R ≥ 0.98 for all 3 years). Weak, but significant correlations were observed for oil content and oil yields in years 1 and 3 (R = 0.333 and 0.123, respectively), but not in year 2. Interestingly, weak but significant correlations between 100-seed weight and oil yield were observed for all three years, but these were positive in year 1 (R = 0.203) and year 2 (R = 0.316) but negative in year 3 (R = −0.142). Similarly, a negative correlation was observed between the 100-seed weight and number of seeds produced per plant during year 3 (R = −0.273). This may indicate that in the third year for this mapping population, source strength rather than sink capacity is important (i.e. as the plants produce more seeds, they are able to allocate fewer resources per seed), or that there is greater competition between individual plants of the mapping population for light or nutrients as the size of the plants increase.
For the first year we did not detect any QTL relating to the number of seeds per plant. For the number of seeds produced per plant during the second year, a weak QTL was observed (p < 0.10) when non-parametric analysis was performed. It should be noted, however, that the average number of seeds harvested per plant declined between years 1 and 2, due to adverse weather conditions at the field site of the G33 × G43 mapping population (see “Methods” and Additional file 7: Figures S4a, f). In the year 3, we observed that two QTL were found on linkage groups 4 and 7, accounting for 11.3 % PVE. The largest QTL detected for this population were for the 100-seed weights. In the first harvest year, three QTL were detected on linkage groups 2, 4 and 11, which accounted from 24.5 % PVE. In the second harvest year, three QTL at similar positions were also identified, alongside an additional QTL on linkage group 10. In total, these accounted for 42.9 % PVE. In the third year, six QTL for 100-seed weight were observed, although the total PVE declined to 29.9 %. The two additional QTL were on linkage group 9 and the upper arm of linkage group 11. The QTL on linkage groups 4 and in the middle of linkage group 11 were additive, whereas those on linkage groups 2, 9 and 10 were dominant. The QTL on the upper arm of linkage group 11 (year 3 only) was recessive. With the exception of the QTL on linkage group 10, the allele from the G33 parent was beneficial in each case. Based on the confidence intervals, it does not appear that the QTL on linkage group 4 of this mapping population is co-located with the 100-seed weight QTL we observed in mapping population G51 × CV. For the second harvest year, four QTL accounting for a total of 25.6 % PVE were detected from seed oil content, on linkage groups 4, 5, 6 and 10. In the subsequent year, we only observed the QTL on linkage groups 5 and 6, which had a total PVE of 16.4 %. The beneficial allele for the QTL on linkage groups 4 and 5 was from patent G33, whereas the beneficial allele for the other two QTL (linkage groups 6 and 10) were from parent G43. Two of these QTL, on linkage groups 4 and 10, may be related to the oil QTL observed in mapping population G51 × CV, though due to the relatively large QTL intervals compared to those observed in the G33 × G43 population, this would require further experimental confirmation. Interestingly, the oil content QTL on linkage group 10 also maps to a similar position as the seed weight QTL on this linkage group and in both instances, the G43 parent contributed the beneficial allele.
Comparison of QTL positions with mapped candidate genes for lipid biosynthesis
Where the position of candidate genes are known, it is possible to compare QTL positions to determine whether they may potentially underlie a specific QTL. This approach is most effective when the confidence intervals for the QTL are low. Based on our successful mapping of the majority of the candidate genes we identified involved in lipid biosynthesis (Fig. 3 and Additional file 3: Table S14), we compared the positions of these genes and QTL. In mapping population G51 × CV the majority of the QTL had very large 95 % confidence intervals, but the main QTL for oleate and linoleate appeared to be located between 2.0 and 7.0 of linkage group 6 (Table 5).
A likely candidate gene for this QTL would be oleate desaturase (FAD2), an enzyme which converts an oleate group at the sn2-position of phospholipids to linoleate (Fig. 3, step 19). In J. curcas there are two FAD2 genes, both of which are expressed within developing seeds . We mapped these to linkage groups 1 and 6 (Additional file 3: Table S3). The Bayes 95 % confidence intervals for the QTL would indicate that it is unlikely that the FAD2 on linkage group 6 could be the locus underlying the main QTL for oleate. However, the 95 % confidence intervals indicated that this QTL mapped between two markers (SNP12983 and 1406628|12346310) which both resided on a single 3.37 Mbp scaffold (KK915213.1) of the J. curcas genome sequence released by the Chinese Academy of Sciences (Additional file 2: Table S8). This scaffold contains 560 predicted gene sequences, of which 134 are located within the 726 kb of sequence between these two markers. Further analysis of polymorphisms in this region should provide more insight into discovering the underlying genetic basis of the observed variation between oleate and linoleate content. The strongest QTL for stearate content on linkage group 7 mapped in close proximity to the genes for both acyl-ACP thioesterase (Step 12) and an acyl-CoA synthetase. The acyl-ACP thioesterase gene of linkage group 7 encodes the FatA type of enzyme (Additional file 2: Table S14), which typically displays a preference for oleoyl-ACP, whereas the FatB type typically show broader specificity including activity with saturated acyl-ACPs . The long-chain acyl-CoA synthetases involved in activation of the export and activation of fatty acids from the plastids also show broad specificity . Although the colocalization of these two genes with the stearate QTL is interesting from a biological perspective, given the relatively minor importance and the small amount of absolute variation in stearate content, we do not think this QTL warrants further investigation from a plant breeding perspective.
In the G33 × G43 mapping population, the QTL with the smallest interval was for oil content in the second harvest year. The Bayes 95 % confidence interval for this QTL indicated that it resided within a 5 cM interval on linkage group 10, between markers Jcuint152 and 1403415|12338032 (Additional file 2: Table S12). Both of these markers reside on a single 3.63 Mbp scaffold (KK914240.1) which contains 394 genes. It should be noted, however, that in comparison to the composite interval map (Fig. 2), 5 cM of the upper arm of the linkage group for mapping population G33 × G43 was not mapped and the QTL may have resided within this region. Interestingly, however, one of the candidate gene markers that mapped to scaffold KK914240.1 was for the ABA Insensitive (ABI) 4 gene. The ABI gene family includes abscisic acid (ABA)-responsive transcription factors which have roles in the regulation of a number of biochemical and developmental processes. In Arabidopsis, the ABI4 protein is known to be a regulator of DGAT1 expression in seedlings . The role of ABI4 in oil accumulation during seed development is less clear, and ABI3 seems to play a more dominant role . The role of ABI genes in Jatropha has not been studied extensively, but ABI4 expression has been shown to correlate with the stages of seed development in which oil accumulation occurs . The oil content QTL on linkage group 5, which appeared in both years 2 and 3, produced relatively short confidence interval of 11 cM (Table 6). Although this QTL interval could not be located to a single scaffold of the genome, analysis of the combined genetic/physical map (Additional file 2: Table S3) and the population-specific map for G33 × G43 (Fig. 5) revealed that 9 cM of this region corresponded to a single scaffold (GenBank KK914632.1, containing a predicted 133 genes). A pair of tandemly duplicated phosphatidate phosphatase (PAP) genes is located on this scaffold (Fig. 3, step 17 and Additional file 3: Table S14). The PAP enzyme is part of the ER pathway and converts phosphatidic acid into diacylglycerol. In Arabidopsis, a PAP gene was also shown to underlie a QTL for oil content in a mapping population segregating for this trait . These two PAP genes in J. curcas therefore represent strong potential causal gene candidates responsible for the oil content QTL on linkage group 5. One further oil content QTL on linkage group 4 also had a relatively short confidence interval of 10 cM. Comparison of the marker positions (Fig. 5) with the mapped scaffolds indicated that this QTL is likely to reside on scaffold KK914227, which is 2.74 Mbp and contains 274 predicted genes (Additional file 2: Table S6). Included within these genes was one of the mapped lipid biosynthesis genes, malonyl-CoA:ACP malonyl transferase (Fig. 3 and Additional file 3: Table S6). Our future work will involve characterization of these genes in the different parental populations, including upstream regions and gene expression levels, to determine whether there is any variation between the two parental lines.
Future approaches to QTL mapping in J. curcas
In addition to being able to identify a number of QTL, we were in some cases able to identify specific DNA scaffolds from the CAS Jatropha genome assemblies underlying these QTL and even identify candidate genes that may be responsible for these QTL. Nonetheless, in many instances, the QTL confidence intervals were too large to identify specific genome regions. The mapping resolution obtained by the family-based mapping approach is often limited as QTL intervals are usually dependent on population size, QTL effect and marker density . Increasing the number of meioses within a mapping population by generating advanced-generation crosses can be used for finer mapping of QTL, but this approach is impractical with perennial plants because of the length of time required to produce and collect phenotypic data from each generation. An alternative approach that improves the ability to identify loci-controlling traits is a genome-wide association study (GWAS). This approach permits a higher resolution than family-based mapping by exploiting historical recombination events and does not therefore rely on the creation of experimental populations. The use of germplasm collections rather than biparental crosses also permits the identification of beneficial alleles from a wider genetic background. We believe that the advances that have been obtained by combined genetic and physical mapping that have been reported in the current study and elsewhere , together with the improvements in our knowledge of the availability of genetically diverse germplasm for this species within Mesoamerica [10, 12], make GWAS a feasible next step. In addition, it should also be possible to further improve and integrate the genetic and physical maps of J. curcas by developing molecular markers for unmapped scaffolds using an approach similar to the one we used previously to fine-map the phorbol ester biosynthesis locus in J. curcas . These approaches should lead to the identification and characterization of a greater number of QTL from a wider genetic pool.