General statistics of DArTseq analysis
We obtained a total of 47,994 dominant silico-DArT (SD) markers scored as presence or absence and 20,046 co-dominant SNP markers. The genetic positions of 8822 SD and 6794 SNP markers were determined on the 21 wheat chromosomes.
Among the SD markers, the frequency of genotype A (absence or presence of the SD reference sequence) was ca. 50% in most of the individuals (Fig. S1a), but was 3.8% in line MSD291. Subsequent analysis revealed that more than 35% of the MSD291 genotypes were missing (missing data), whereas other individuals had only ca. 10% of such missing genotypes (Fig. S1b). Among the SNP markers, the frequency of homozygous SNPs peaked at about 60% (Fig. S2a). The frequency of missing genotypes varied greatly (up to 59.5%), with a peak at about 7% (Fig. S2b). Missing genotype frequency in LDN was 49.8%, which was expected because LDN has no D genome. Missing genotype frequency was higher in MSD242, MSD331, MSD257 and MSD291 than in LDN. Unlike MSD291, which already presented large number of missing genotypes in SD, other three individuals had an average missing genotype ratio. The data of MSD291 was highly distorted and was discarded from further analyses.
Genome structure of primary synthetic wheat (PS) lines
To validate the accuracy of DArTseq genotyping, we graphically genotyped LDN, CS, N61 and the 47 PS lines with the 6794 SNP markers. As expected, LDN lacked the D genome and the PS lines were similar to LDN in A and B genomes, but had diverse D genomes. The band patterns of CS, N61 and LDN were distinct from each other (Fig. 1). This result indicated that DArTseq analysis was accurate.
Among the PS lines, the band patterns on A and B genomes were almost identical to each other and to that of LDN, except that the pattern of Syn45 differed from those of LDN, CS and N61 for all the three genomes (Fig. 1). This indicates that the progenitor of Syn45 was not one of these three cultivars but an unknown contaminant. Eight of the 47 PS lines had missing chromosomes, of which six were from the D genome (Fig. 1): Syn54 lacked chromosome 1D, Syn43 and Syn46 lacked chromosome 3D, Syn27 and Syn30 lacked chromosome 4D, and Syn34 lacked chromosome 7D. Using genotyping data of the D genome, we found no genetic relation among the eight lines and thus confirmed that chromosome elimination events were random (Fig. S3). Despite the presence of several nullisomics among the PS lines used to produce the MSD population, all PS lines except Syn45 had complete and pure genomes.
The pedigree of the MSD lines
One of the purposes of DArTseq genotyping was to identify the pedigree of each of the selected MSD lines to be able to estimate the genetic drift in the population due to the selection from generation to generation. Because we had no diagnostic markers for the PS lines and the N61-originated genome fragments were distributed randomly in the MSD line genome, we developed a new method that calculated the global D genome homology between MSD individual and each PS line, after discarding markers with the same genotype as that of N61. The matched and unmatched genotypes were scored positively and negatively in the same weight. We assumed that the PS line with the highest homology score is the progenitor of the respective MSD individual. We used 2649 SD and 2403 SNP markers from the D genome; markers with a high rate of missing genotypes (less than 85% call rate) were excluded.
Each of the 43 PS lines used to generate the MSD population was found to be a progenitor of at least one of the 399 MSD individuals, whereas none of the four unused PS lines (Syn41, Syn43, Syn46 and Syn70) were MSD line progenitors (Fig. 2), demonstrating that our method is quite reliable. Lines MSD108 and MSD254 were determined as progenies of the contaminated Syn45; thus, we discarded them from further analyses. Each PS progenitor produced from 1 to 33 offspring, which is wider than the expected range (5–17; single-tailed Fisher’s exact test, P ≥ 0.05) based on an equal contribution of each progenitor (Fig. 2). This phenomenon indicated the presence of fitness in the MSD population depending on the D genome origin. Four of the six D-chromosome nullisomic PS lines (Syn27, Syn30, Syn34 and Syn54; Fig. 1) had 41 MSD progenies out of the 397 in total. The presence of these lines (4/43 in PS vs 41/397 in MSD; P = 1.0, two-tailed Fisher’s exact test) and the absence of nullisomy in MSD lines indicated that nullisomy observed in the PS lines was not retained in the MSD population.
To determine whether the Ae. tauschii D genome affects the crossing over rate; we analyzed the crossing over status of MSD individuals. We converted the genotypes of the chromosome-assigned SD and SNP markers to the N61-like (N) or progenitor PS-like (S) form and visualized the chromosomes by using different colors for the N or S genotypes. Representative results for the MSD subpopulations that originated from the PS lines Syn26 and Syn32 are shown in Fig. 3. Regions of N61 origin prevailed in the genome, consistent with the expected 75% genome occupancy after one backcross event. Among PS genome fragments retained in the MSD genome, those in the A and B genomes originated from LDN and those in the D genome originated from Ae. tauschii. Random distribution and similar sizes of PS-derived fragments on D chromosomes of MSD individuals indicated that the Ae. tauschii genome was successfully incorporated into those of MSD lines as a result of unbiased crossing over.
In the total MSD population (excluding the three lines), the overall gene diversity (Ht, Nei 1987) was 0.4508, whereas the Ht of the D genome was 0.3633, indicating less crossing over within the D genome. Sohail et al. (2012) studied the genetic diversity and population structure of 81 Ae. tauschii accessions collected from different regions of its geographical distribution and classified these lines into three lineages or groups. We examined the genetic relatedness of the PS lines using the D genome markers. We found that Syn45 was placed in a separate group confirming the conclusion of the graphical genotyping that this line is a contaminant (Fig. 1). The remaining PS lines separated into three groups or lineages (Fig. S3). According to Sohail et al. (2012) Syn27, 26 and 48 are in lineage 3, Syn64–Syn66 are in lineage 2 and Syn62–Syn59 are in lineage 1. This result indicates that the PS lines represent genetic diversity from most of the Ae. tauschii natural habitat.
Conversion of the DArTseq marker map positions into physical positions
DArTseq data provided linkage distance information for a substantial fraction of markers with relatively short (28–69 nt) sequences. We converted marker linkage distances to physical positions in the wheat reference genome. A total of 14,355 marker sequences perfectly matched unique positions in the genome (Table 2). Among 15,616 chromosome-assigned SD and SNP markers, 4513 (2510 SD and 2003 SNP) markers were anchored, but only 63 of them (ca. 1.3%) were anchored between different homoeologous chromosomes. The remaining chromosome-matched markers were evenly distributed on the chromosomes, and the order of markers was generally similar in both linkage and physical maps (Fig. 4).
Table 2 Numbers of DArTseq markers on maps described in this study
GWA analysis
To evaluate the versatility of the MSD population, elucidate D genome-derived agronomic traits, and verify the suitability of the population for QTL identification, we conducted GWA analysis for glume coloration as a qualitative trait and heading date as a quantitative trait.
Glume coloration is one of the well-studied traits in wheat and other Triticeae crops. Almost all modern wheat cultivars, including LDN and N61, have colorless glume, whereas a substantial fraction of the MSD and PS lines had black glume, indicating that this trait is controlled by allele(s) from the D genome in an epistatic manner. Among the 397 MSD individuals evaluated, 336 had no spike pigmentation (similar to LDN and N61), whereas 61 (15% of the MSD population) had black spikes. Ancestry estimation indicated that 30 PS lines contributed to the black glume trait; the 61 MSD lines correspond to almost one-quarter of the 292 progenies of these PS lines (P = 0.61; single-tailed Fisher’s exact test). GWA analysis using the linear mixed model with both the linkage and physical maps showed a single prominent peak at 22.564 cM on the short arm of chromosome 1D (Fig. 5a), corresponding to a sharp association peak at 2.07 Mb (range, 0.3–2.28 Mb) of the wheat reference chromosome 1D (Fig. 5b); this region harbored 64 protein-coding genes (Table S1). The black glume color suggests that the pigment is melanin. Melanin biosynthesis in plants is largely regarded to tyrosinase activity (Singh et al. 2013). A model monocot plant rice contains six tyrosinase-related genes, and we identified five homologs from the current wheat genome (Table S2). However, among the 64 genes of which several encoded putative enzymes, none of these involved in melanin biosynthesis (Table S1).
Evaluation of heading date at Dongola showed two major peaks: the larger one at ~ 70 days (early flowering individuals) and the smaller one at ~ 95 days (late flowering individuals). The ratio of the early to the late genotypes was consistent with 3:1 (P = 1.0, two-tailed Fisher’s exact test), indicating that a single gene controls DH in the MSD population (Fig. 6d). At Wad Medani, three peaks (~ 60, ~ 85 and ~ 100 days) were observed (Fig. 6a) indicating that more than one gene controls the heading time.
At Dongola, GWA analysis based on either genetic or physical map revealed a single significant peak on the short arm of chromosome 2D (Fig. 6e, f). The peak was located between 47.521 and 84.625 cM, which corresponds to 11.98–13.26 Mb; this region included 55 protein-coding genes (Table S3). At Wad Medani genetic map-based analysis detected five significant peaks on the short arms of chromosomes 2A, 2B and 2D, and on the long arms of chromosomes 5A and 5D (Fig. 6b), whereas physical map-based analysis detected only two significant peaks on the short arm of chromosome 2D and the long arm of chromosome 5D (Fig. 6c).
In the two environments, a highly significant peak was detected on the short arm of chromosome 2D. The position of this peak matched that of Ppd-1D, a pentatricopeptide repeat (PPR) protein-coding gene, which strongly affects wheat response to photoperiod (Langer et al. 2014; Guedira et al. 2016). However, we did not find a PPR gene sequence within the peak range (Table S3). Our search with the previously reported Ppd-1D sequence (Guo et al. 2009) in the wheat reference genome used in this study detected no significant hits.