Patterns of linkage disequilibrium and association mapping in diploid alfalfa (M. sativa L.)
- First Online:
- Cite this article as:
- Sakiroglu, M., Sherman-Broyles, S., Story, A. et al. Theor Appl Genet (2012) 125: 577. doi:10.1007/s00122-012-1854-2
- 828 Downloads
Association mapping enables the detection of marker-trait associations in unstructured populations by taking advantage of historical linkage disequilibrium (LD) that exists between a marker and the true causative polymorphism of the trait phenotype. Our first objective was to understand the pattern of LD decay in the diploid alfalfa genome. We used 89 highly polymorphic SSR loci in 374 unimproved diploid alfalfa (Medicago sativa L.) genotypes from 120 accessions to infer chromosome-wide patterns of LD. We also sequenced four lignin biosynthesis candidate genes (caffeoyl-CoA 3-O-methyltransferase (CCoAoMT), ferulate-5-hydroxylase (F5H), caffeic acid-O-methyltransferase (COMT), and phenylalanine amonialyase (PAL 1)) to identify single nucleotide polymorphisms (SNPs) and infer within gene estimates of LD. As the second objective of this study, we conducted association mapping for cell wall components and agronomic traits using the SSR markers and SNPs from the four candidate genes. We found very little LD among SSR markers implying limited value for genomewide association studies. In contrast, within gene LD decayed within 300 bp below an r2 of 0.2 in three of four candidate genes. We identified one SSR and two highly significant SNPs associated with biomass yield. Based on our results, focusing association mapping on candidate gene sequences will be necessary until a dense set of genome-wide markers is available for alfalfa.
Linking DNA polymorphism to trait phenotypic variation is an increasingly important tool for plant breeding programs (Lande and Thompson 1990). Historically, segregating populations of a particular cross have been used to identify marker-trait associations (e.g., Stuber et al. 1999). More recently, association mapping has shown promise for trait mapping due to the increased access to abundant molecular markers in many crops (Stich et al. 2005).
Association mapping takes advantage of the fact that historical recombination within a population has decreased linkage disequilibrium (LD) to short chromosomal intervals, enabling potentially statistically strong and robust marker-trait associations to be detected (Jannink and Walsh 2002). In association mapping, existing allele variation within an entire population can be more efficiently represented because mapping is conducted directly in breeding populations (Hirschhorn and Daly 2005; Remington et al. 2001). In general, the precision of locating a QTL is much higher in association panels compared to biparental mapping populations, provided sufficient markers are available to detect the QTL. If LD extends over long distances, however, the biparental mapping approach is more powerful to detect the existence of a QTL, particularly if marker numbers are limited (Mackay and Powell 2007).
Two major drawbacks exist in association mapping. First, false positive associations between markers and traits can be obtained due to the presence of population structure (Aranzana et al. 2005; Lander and Schork 1994). However, population structure can be assessed with marker information from genome-wide genetic markers (such as SSRs), and association tests can then be conditioned on the population structure to reduce the false positive rate (Aranzana et al. 2005; Pritchard et al. 2000). Second, the extent of LD plays a practical role in determining the number of markers needed to detect associations between genotype and phenotype (Rafalski and Morgante 2004). Limited LD in the population means that associations will only be detected between alleles at loci close together, requiring many markers to saturate the genome (Hagenblad and Nordborg 2002). When the limiting factor for association mapping is the absence of a sufficiently large number of markers evenly dispersed throughout the genome, an alternative strategy is to assay variation in candidate genes (Neale and Savolainen 2004). For both cases, the design and use of association studies require knowledge of the LD structure in the genome (Oraguzie et al. 2007).
Alfalfa is one of the most important forage legumes in the world (Quiros and Bauchan 1988; Michaud et al. 1988), and has been proposed as a bioenergy crop (Delong et al. 1995). Alfalfa has potential to produce high yield but genetic improvement for yield is not as high as has been realized for the major grain crops (Hill et al. 1988). Digestion of forage for animal nutrition or for cellulosic bioethanol production requires the effective hydrolysis of cellulose and solubilization of hemicellulose in the presence of lignin (U.S. DOE 2006). Reducing lignin content can increase the efficiency of sugar release from cell wall complexes up to two fold (Chen and Dixon 2007). Therefore, improving biomass yield and modifying the plant’s cell wall composition are two breeding targets important for both forage and biofeedstock (Ragauskas et al. 2006) applications. If QTL associated with yield and cell wall components could be identified, they could be incorporated into modern cultivars enhancing the efficiency of alfalfa breeding.
Cultivated alfalfa is an autotetraploid (2n = 4x = 32) domesticated from the Medicago sativa–falcata complex. Autotetraploidy complicates genetic mapping, but diploid (2n = 2x = 16) relatives of alfalfa exist that share the same karyotype, have highly syntenic genetic linkage groups, and can be hybridized with tetraploid individuals (Diwan et al. 2000; McCoy and Bingham 1988; Quiros and Bauchan 1988). The diploid members of the complex include M. sativa subsp. falcata, M. sativa subsp. caerulea, and their natural hybrid, M. sativa subsp. hemicycla (Quiros and Bauchan 1988; Havananda et al. 2010).
The genomewide extent of LD in the M. sativa–falcata complex has previously been estimated in one tetraploid breeding population using SSR markers (Li et al. 2011). Within gene LD was estimated in a set of different tetraploid breeding populations using two regions of the alfalfa gene homologous to M. truncatulaCONSTANS-LIKE gene (Herrmann et al. 2010). However, both of these populations are expected to have had reduced recombination due to breeding efforts compared to a broad-based natural population. In this paper, we assess both chromosome-wide estimates of LD in a population consisting of 374 unimproved diploid alfalfa genotypes from 120 accessions using 89 polymorphic SSR loci distributed throughout genome and within gene estimates of LD in sequences of four candidate genes of the lignin biosynthesis pathway. In addition, we evaluated SSR and candidate gene SNP marker polymorphisms for associations with 23 traits relevant to biomass accumulation and cell wall components.
Materials and methods
Plant materials and phenotyping
We selected 374 individual genotypes from 120 accessions obtained from the USDA National Plant Germplasm System, representing the geographical distribution of the diploid M. sativa complex, including subsp. caerulea, falcata, and hemicycla (Supplemental Table 1) (Sakiroglu et al. 2010; Sakiroglu and Brummer 2011). These genotypes were planted in field experiments near Watkinsville and Eatonton, Georgia. The experimental design and procedures were reported previously (Sakiroglu et al. 2011). We evaluated neutral detergent fiber (NDF), acid detergent fiber (ADF) acid detergent lignin (ADL), and total nonstructural carbohydrate (TNC) composition, glucose, xylose, arabinose, total aboveground biomass yield, and regrowth after harvest in 2007 and 2008. Five other agronomic traits were measured in 1 year, stem yield and stem/leaf ratio in 2007, and plant height, stem thickness, and spring regrowth in 2008 (Sakiroglu et al. 2011).
Genotyping and sequencing
Management of sequencing reads
Sequences were sorted based on their tags using software provided at http://bioinf.eva.mpg.de/pts/ as described in Meyer et al. (2008), resulting in 72 FASTA files each representing the sequences from a single individual. Sequence reads were assembled for each individual using Lasergene’s SeqMan program (DNASTAR, Madison, WI, USA) to produce seven contigs for each individual (Fig. 2). M. truncatula or M. sativa sequences used to design the primers for this project were added to each contig as a reference sequence. Each contig was exported from SeqMan as a phrap (.ace) file for use with an in-house Perl script. The script was written to determine single nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms (indels) within each contig by tallying the number of reads containing the same series of SNPs. Both SeqMan and our Perl script eliminated bases at low frequency (<0.05), variation likely attributable to sequencing errors such as homopolymer regions.
Reamplification of PCR products led to a high rate of chimeric sequences due to recombination during the PCR. As a consequence, our script identified more than two haplotypes for each individual. The script identified the location of each SNP represented by an ambiguity code in the consensus sequence and the number of reads containing each SNP allele at each of those positions. SNP combinations with the highest frequencies, which we presumed were the non-chimeric sequences and replaced the IUPAC ambiguity codes in the consensus sequence to create two likely true haplotypes for each individual for CCoAoMT and F5H. These manually determined haplotypes from CCoAoMT and F5H had similar LD plots as the LD analyses based on unphased SNP genotypes. Therefore, we used unphased SNP genotype data to estimate within gene LD as well as to conduct association tests. Manually determined haplotypes had slightly elevated diversity statistics for CCoAoMT and slightly lower diversity statistics for F5H (Supplemental Table 3) compared to haplotypes determined by PHASE (Stephens and Donnelly 2003), as implemented in DnaSP v. 5.0 (Librado and Rozas 2009). Because the manually determined haplotypes did not show a systematic bias compared to inferred haplotypes based on PHASE, we used the inferred haplotypes for estimating diversity statistics for all genes. Sequences were aligned using MUSCLE (Edgar 2004). Alignments were manually edited using BioEdit (Hall 1999).
To estimate genetic diversity in the four candidate genes, we computed the average number of nucleotide differences between sequence pairs, heterozygozity per nucleotide site (π), Tajima’s D statistic (Nei 1987; Tajima 1989), and Watterson’s estimator of the population mutation rate (θ) (Watterson, 1975) using the computer program DnaSP v5 (Librado and Rozas 2009). DnaSP does not recognize DNA ambiguity codes. Each individual genotypic sequence resulted in two inferred haplotypic sequences from each individual.
Least square means of 23 phenotypic traits were obtained as described previously (Sakiroglu et al. 2011). The software program TASSEL 2.1 (Bradbury et al. 2007) was used to detect associations between SSR markers and the phenotypic means. TASSEL 3.0 (Bradbury et al. 2007) was used to test for associations between candidate gene SNPs and the phenotypic means. A mixed linear model (MLM) was fitted for each single marker and trait (Yu et al. 2006). In addition to the population structure inference (Q matrix), this approach accounts for relatedness among individuals using the pairwise kinship matrix as a covariate in the mixed model. Correction for multiple testing was applied to P values obtained from MLM using the positive FDR method (Storey 2002; Storey and Tibshirani 2003) implemented in software program Q Value (Storey 2002). We also constructed quantile–quantile (QQ) plots to visualize the observed MLM P value versus expected P value distribution for each of the candidate gene association tests. Deviations from the line of equality imply an association.
Alignments for each gene region have been deposited at GenBank with the following accession numbers. F5H exon 1 JN705257–JN705321; F5H exon 2 JN714201–JN714257; PAL 1 exon 1 JN849691–JN849757; PAL 1 exon 2 JN849758–JN849828; COMT exon 1 JN849829–JN849897; COMT contig 2 JN849970–JN850038; CCoAoMT JN849898–JN849969.
Sequencing results and molecular diversity of subspecies
Candidate gene amplicon composition, coverage and SNP (MAF > 0.05) distribution
Number of amplicons
Alignment length (bp)
Coverage range (reads)
No. of SNPs
No. of SNPs per bp
No. of NS SNPs
No. of indels
F5H exon 1
F5H exon 2
PAL1 exon 1
PAL1 exon 2
COMT exon 1
COMT contig 2
Summary of DNA sequence variation from four candidate genes in three subspecies of diploid alfalfa from inferred haplotypes
Medicago sativa subspecies
No. of individuals
No. of polymorphic sites
No. of haplotypes
Haplotype diversity (SD)
CCoAoMT (1,340 bp)
F5H exon 1 (723 bp)
F5H exon 2 (594 bp)
PAL1 exon 1 (377 bp)
PAL1 exon 2 (1,483 bp)
COMT exon 1 (415 bp)
COMT contig 2 (1,203 bp)
Physical locations of 58 of 89 SSR loci were identified using the M. truncatula genome sequence build (version 3.5.1), which covers about 66 % of the gene space (Chris Town, pers. comm.). We could estimate LD from 199 locus pairs between markers known to be located on the same chromosome (Fig. 1). Markers on M. truncatula chromosomes 4 and 8 that are denoted in Fig. 1 by asterisks are most likely found on the other chromosome in M. sativa, because the sequenced M. truncatula accession has an unusual translocation between chromosomes 4 and 8 (Kamphuis et al. 2007). To investigate the evidence of the translocation in depth, we calculated LD among SSR markers that are denoted in Fig. 1 by asterisks with remaining markers of both chromosomes 4 and 8 separately. We observed only two significant associations when all accessions were considered: SSR marker BE321117 showed significant LD with al367160 on chromosome 4 and with aw267840 on chromosome 8 (Fig. 1). However, when the five groups identified by Structure were analyzed separately, no significant LD was detected, suggesting the observed LD was created by family structure rather than a real physical proximity. We excluded a total of 20 pairwise LD calculations because we could not infer the accurate distance between markers in the above-mentioned situation.
Number of SSR locus pairs showing linkage disequilibrium in five main populations of diploid alfalfa and over all genotypes based on a significance level of P = 0.0001 after control for the false discovery rate (FDR)
No. of genotypes
No. of locus pairs in LD
% of locus pairs in LD
Tests of association
Significant marker-phenotype associations after correction for multiple testing using the positive FDR method. (SNP FDR Q values <0.05)
FDR Q value
F5H exon 2/276
Candidate gene sequencing and sequence diversity
For sequencing in this project, we multiplexed four genes consisting of seven contigs derived from 16 overlapping tagged amplicons and aimed for 600× coverage using Titanium chemistry on the 454 FLX sequencer. Our amplicon lengths were approximately 500 bp, but we were not certain about the read lengths we could expect from the Titanium chemistry when we started the experiment. We were also concerned that our estimation of the concentrations of the 1,200 amplicon reactions was accurate. Taken together, this uncertainty supported our conservative approach. Our experience from this study indicates that accurate haplotype deduction can be achieved with as few as 35 454 FLX reads because, at this coverage level, sequencing errors can still be easily distinguished from true SNPs. The advantages of 454 sequencing over Sanger and Illumina methods are that read lengths allow for the determination of phase over longer distances and insertion-deletion mutations can be easily deduced, both of which are useful for distinguishing haplotypes. Template concentration for PCR, amplicon quantification for pooling prior to sequencing, and methods to avoid PCR recombination which we encountered and that has been seen by other groups (Griffin et al. 2011) are all needed to ensure even coverage of 35–100X.
A previous study investigating the history of domestication of alfalfa (Muller et al. 2005) reported higher levels of sequence diversity for diploid M. sativa subsp. caerulea than we report here. Sequence diversity at two genes sampled from eight individuals resulted in θ = 0.0376 and θ = 0.0272 while our values ranged from 0.0048 to 0.0143. The strategy used by Muller et al. (2005) was very different from ours in a few ways. First, only one allele per individual was used in an effort to sample more variation across a diverse collection of Medicago. Second, the plant material used in their study included diploid and tetraploid, domesticated, and wild populations. Finally, their sequences were from predominantly intron regions.
Comparing patterns of genetic diversity across species could be hampered due to differences in the mode of reproduction (self pollinated vs. cross pollinated) and the nature of genetic material used (breeding material vs. unimproved population). Alfalfa has a predominantly outcrossing breeding system and the plant material used in this experiment is unimproved germplasm collected from broad-based populations. Although we observed higher sequence diversity in the four candidate genes compared to other crop species and model plants (Tenaillon et al. 2001; Schmid et al. 2005; Liu and Burke 2006; Mather et al. 2007), we focused on comparing our results to the wild relatives or landraces of maize (Zea mays ssp. mays L.) and sunflower (Helianthus annuus L.), which are also outcrossing populations and could be assumed to be more similar to alfalfa than populations from autogamous species. The average θ value estimated in this study (0.0142) was comparable to the values from 21 genes in 15 maize landraces (θ = 0.0129; Tenaillon et al. 2001) and from nine genes in 16 wild populations of sunflower (θ = 0.0144; Liu and Burke 2006).
Tajima’s test of neutrality indicated that selection may be acting on F5H exon 2 within subspecies hemicycla, but not in the other subspecies. Sliding window analysis of F5H exon 2 did not detect significant values for Tajima’s D within subspecies caerulea or falcata, but Tajima’s D was consistently negative for hemicycla. Negative Tajima’s D values indicate an excess of rare variation, which may be due to hemicycla’s hybrid nature, consisting of genetic variation derived from both of the other subspecies. Tajima’s test of neutrality is negative for COMT contig 2 in subspecies falcata, suggesting that selection has acted on this sequence.
The extent of LD is crucial to determine marker density necessary for association mapping analyses, with longer LD requiring fewer markers to saturate the genome, but resulting in lower resolution (Jorde 1995; Buckler and Thornsberry 2002; Ching et al. 2002; Rafalski and Morgante 2004). We observed very little LD between SSR marker pairs and the estimates of the extent of LD in our study are lower than those reported in maize and barley (Remington et al. 2001; Liu et al. 2003; Stich et al. 2005; Malysheva-Otto et al. 2006). The small number of SSR locus pairs in LD could partially be due to the FDR calculations that we used to correct possible false positives arising from thousands of pairwise LD calculations. It could also be due to the nature of the plant material. In the above-mentioned studies, landraces or inbred lines that had resulted from human selection were used, which could create LD (Jannink and Walsh 2002), whereas our germplasm contained all wild accessions. The extent of LD in alfalfa was previously estimated in a breeding population using SSR markers and the results revealed that 61.5 % of SSR marker pairs separated by less than 1 Mbp were in LD (P < 0.001) implying extensive LD (Li et al. 2011). However, Li et al. (2011) used a synthetic tetraploid alfalfa population that was derived from 300 individuals of three cultivars (100 individuals from each cultivar). The larger estimate of the extent of LD obtained by Li et al. (2011) compared to LD in our study was probably due to artificial selection.
LD is considered to have decayed when r2 values drop below 0.1 (Remington et al. 2001, Ersoz et al. 2007). Two of our gene sequences decayed below r2 = 0.1 in 500 bp and overall LD was below 0.2 within 500 bp in five of the seven contigs. Only F5H exon 2 compares to a previous report of within gene LD in alfalfa. Estimates of within gene LD in a CONSTANS-LIKE gene from 59 genotypes of a breeding variety in tetraploid alfalfa revealed that LD of r2 = 0.2 could persist as long as 700 bp (Herrmann et al. 2010). The difference between the LD estimates in two studies is probably attributed to usage of different genetic material. Herrmann et al. (2010) used cultivated material in which LD could persist over longer distances due to bottlenecks produced by artificial selection (Ching et al. 2002; Liu and Burke 2006; Kolkman et al. 2007), where as we used broad-based wild germplasm. The difference in the extent of LD between different genetic materials has previously been reported in other crops (Caldwell et al. 2006; Liu and Burke 2006).
We identified one SSR marker, and three SNPs associated with biomass yield, stem proportions, regrowth, and stem thickness but no associations with cell wall composition traits. Given the relative paucity of SSR markers we examined and the limited number of individuals (and few genes) in our candidate gene analysis, this lack of association is not surprising. The CCoAoMT SNP in position of 111 associated with several traits is located in the first intron and although it is not in linkage disequilibrium with any SNPs in the first exon or downstream of this site, it may be linked to causative SNPs in the promoter region. The F5H exon 2 SNP associated with yield in 2008 is a synonymous change; however, LD does not decay within the region we sequenced, so this SNP may be linked functionally to an unsampled causative SNP. The nature of each of these associations needs to be investigated further and validated in additional alfalfa populations.
Successful candidate gene association mapping studies have generally focused on genes from single pathways (Myles et al. 2009). Despite evidence that the lignin biosynthetic genes CCoAoMT and F5H directly impact the lignin content of alfalfa (Guo et al. 2001; Chen and Dixon 2007), our candidate gene approach did not detect any associations with the cell wall characteristics measured, which were based on fiber analysis. Weak associations between lignin genes and both yield traits and cell wall components have been reported previously from maize inbred lines by Chen et al. (2010), who concluded that qualitative trait polymorphisms for yield and cell wall characteristics segregate independently of one another. The phenylpropanoid pathway, of which lignin is one of many products, has several components that have been linked to plant growth.
In summary, in this paper we attempt to estimate both genome-wide SSR and within gene SNP variation to determine the extent of LD in diploid alfalfa. In terms of the potential to use the candidate gene approach for allele mining for alfalfa improvement, we have shown that although our sample size was small, two significant SNPs in two candidate genes that are associated with biomass yield and other traits were detected.
This research was supported by United States Department of Agriculture-Department of Energy Plant Feedstock Genomics for Bioenergy award 2006-35300-17224 to E. Charles Brummer.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.