Patterns of polymorphism and linkage disequilibrium in cultivated barley
- 1.7k Downloads
We carried out a genome-wide analysis of polymorphism (4,596 SNP loci across 190 elite cultivated accessions) chosen to represent the available genetic variation in current elite North West European and North American barley germplasm. Population sub-structure, patterns of diversity and linkage disequilibrium varied considerably across the seven barley chromosomes. Gene-rich and rarely recombining haplotype blocks that may represent up to 60% of the physical length of barley chromosomes extended across the ‘genetic centromeres’. By positioning 2,132 bi-parentally mapped SNP markers with minimum allele frequencies higher than 0.10 by association mapping, 87.3% were located to within 5 cM of their original genetic map position. We show that at this current marker density genetically diverse populations of relatively small size are sufficient to fine map simple traits, providing they are not strongly stratified within the sample, fall outside the genetic centromeres and population sub-structure is effectively controlled in the analysis. Our results have important implications for association mapping, positional cloning, physical mapping and practical plant breeding in barley and other major world cereals including wheat and rye that exhibit comparable genome and genetic features.
KeywordsLinkage Disequilibrium Minor Allele Frequency Association Mapping Barley Chromosome Brittle Rachis
Genome-wide association studies are currently commonplace in the search for genes controlling human genetic disorders (Kruglyak 2008) but are in their infancy in plant genetic research, particularly in crops (Zamir 2008). This is despite the approach sparking considerable interest and having tremendous potential for identifying the genes underlying agriculturally important traits, particularly through exploiting the tremendous wealth of highly replicated historic phenotypic information that is potentially available to researchers through national plant trialing and registration schemes (Waugh et al. 2009). The exceptions are the model plants with fully sequenced genomes (The Arabidopsis Initiative 2000; Goff et al. 2002; Jun Yu et al. 2002) and maize in both private and public sectors (Chandler and Brendel 2002; Lawrence et al. 2008) where good progress has recently been made.
In the majority of crops, while interest in association mapping is considerable, implementation has so far been restricted. A common reason appears to be the lack of high-plex, low cost, robust and informative marker platforms that facilitate sufficiently dense genome-wide analysis of molecular polymorphism to assist subsequent association analysis (Zhu et al. 2008). Furthermore, much of the theory and practice of association mapping has been established in heterozygous outbreeding species (Ersoz et al. 2009), and there is limited detailed information on both the extent and patterns of polymorphism, and their consequent impact on genome-wide association scans, specifically in autogamous crop populations where linkage disequilibrium (LD) is predicted to be extensive (Caldwell et al. 2006).
Barley is a diploid inbreeding crop plant that, over the last 10,000 years (Badr et al. 2000), has gone through complex and rapid evolution imposed by the dual bottlenecks of domestication and breeding. While it is a fairly strict inbreeder, since the late 19th century, forced cross-pollination imposed by breeders, followed either by several rounds of inbreeding or, in the second half of the 20th century, by the generation of F1 doubled haploid’s coupled with heavy selection for desirable phenotypes, has generated a unique population of homozygous ‘elite’ inbred cultivars with intricate and complex pedigrees. We recently proposed (Rostoks et al. 2006) that this pseudo-outbred ‘elite genepool’ could be used effectively for medium resolution genome-wide association scans for trait-gene identification using a manageable number (100–1,000’s) of robust bi-allelic markers. From an applied standpoint this is particularly attractive for trait dissection as the elite genepool contains the majority of the genetic variation currently being manipulated by breeders in contemporary crop improvement schemes. Diagnostics for positive alleles for relevant traits in this germplasm will help to facilitate the realization of ‘predictive breeding’ often discussed as necessary to meet the world’s future demands for food and feed (FAO 2002). Understanding the patterns of polymorphism in this type of material and the possible limitations it has for genome-wide association scans is therefore an important step that will have significant fundamental and translational outcomes.
We assembled 190 elite cultivated accessions from two large association genetics programs in the US (BarleyCAP, http://www.barleycap.org/) and UK (AGOUEB, http://www.agoueb.org/) chosen to represent the available genetic variation in current elite North West European and North American barley germplasm. The sample encompasses elite lines from three major biotypes present in barley germplasm (Malysheva-Otto et al. 2006) that we genotyped with 4,596 SNP loci. We used this genotypic information to carry out a genome-wide analysis of population sub-structure, investigate patterns of diversity and LD and interpret the implications of our findings for association mapping. Ultimately, we demonstrated the validity of genome-wide association mapping in this germplasm by fine-mapping each of 2,132 SNP markers that exhibited minimum allele frequencies (MAFs) higher than 0.10. The possibilities for identifying and validating candidate genes that underlie positive marker-trait association are discussed.
Materials and methods
We assembled a total of 190 elite cultivated accessions from two large association genetics programs in the US (BarleyCAP, http://www.barleycap.org/) and UK (AGOUEB, http://www.agoueb.org/) chosen to represent the available genetic variation in current elite North West European and North American barley germplasm (Table S1).
4,596 high confidence SNPs were assayed across the association mapping panel. These SNPs were incorporated into three Illumina™ GoldenGate Pilot Oligo Pool Assays (POPA 1, 2, and 3) as described by Close et al. (2009). In the current experiment, 671 SNP assays were considered as ‘failures’ and omitted from the dataset. Of the remaining 3,925 SNPs, 2,943 had previously been incorporated into a combined genetic map, and 982 were unmapped. Ambiguous calls were coded as ‘missing data’ in all analyzes. A sub-set of 2,709 mapped SNPs with ‘missing data’ <10% was collected for all 190 elite barley accessions (Table S1). Of these, 2,132 had MAF > 10% providing data matrices of 2,943 × 190 and 2,132 × 190 loci which were used to explore patterns of genetic diversity, population structure and linkage disequilibrium, and genome-wide association scans (GWASs), respectively. All genotyping assays were conducted at the Southern California Genotyping Consortium at the University of California, Los Angeles.
Population structure and patterns of diversity
We calculated a phylogenetic tree using the neighbour-joining (NJ) tree building and clustering algorithm implemented in the PHYLIP package (Felsenstein 1997). The resulting dendrogram was rooted using the Hordeum spontaneum line “Mehola”. Principal coordinate (PCO) analysis based on simple matching of SNP alleles was performed with Genstat 11 (Payne et al. 2008). Thirdly, Bayesian clustering, again using simple matching, was applied to identify clusters of genetically similar individuals using STRUCTURE 2.1 considering admixture (Pritchard et al. 2000b; Pritchard and Donnelly 2001), with population differentiation measured using the Fst estimator implemented in the STRUCTURE software.
Genome structure and linkage disequilibrium
Pair-wise measures of LD (r 2) were calculated for the selected 2,132 SNPs for each chromosome using Haploview 4.01 (Barrett et al. 2005). Only markers with MAF > 0.1 and pair-wise comparisons with p > 0.001 were considered. r 2 values were plotted as a function of genetic distance for each chromosome. Haploview was used to generate r 2 LD heat-map charts for each chromosome.
To determine markers associated with a trait of interest, SNP data were modeled using a generalized linear mixed model so that random population structure estimates could be fitted to reduce type I errors. Mapped SNP marker scores were all considered as binomial traits and only SNP markers with MAF > 0.1 and ‘missing data’ <10% were used.
Mixed model methodology
We derived a relative kinship matrix (K) on the basis of simple matching coefficients from a set of random SNP data using Genstat software (Payne et al. 2008). Markers were fitted as fixed effects. Genotype was fitted as a random effect which is assumed to be distributed as N (0, 2 Kσ g 2 ) where K is the kinship matrix. −log10 (p value) scores were used as a measure of LD. For the STRUCTURE model, the resulting STRUCTURE output matrix (Q) for k = 7 was directly used as co-factor in the random term of the mixed model.
Population structure and substructure
We genotyped 190 accessions chosen to represent the available diversity within the elite cultivated genepool from NW Europe and the USA, including a small number of foundation genotypes and key cultivars that have featured strongly in the development of contemporary barley cultivars in these regions, with barley POPA 1, 2 and 3. The assembled dataset contained a total of 1,746,480 allele assignments ordered along each barley chromosome. After manual supervision and correction, 347,118 data points, including all data from poor quality SNPs were removed from the dataset (254,980) or coded as missing (92,138) in all subsequent analyzes.
Patterns of polymorphism along barley linkage groups
We explored whole genome patterns of LD using classic LD algorithms (r 2 and D′) and using a mixed-model approach with population structure estimators as co-factors to account for most of the population structure effects on long-range LD. We considered it important to remove long-range LD effects because they may obscure our interpretation of both genome coverage and mapping resolution.
Classic LD algorithms
Inspection of the 2,943 genetically mapped SNPs scored across the germplasm set revealed a subset of 2,132 that exhibited a minor allele frequency (MAF) higher than 0.10 in this germplasm set with less than 0.10 of missing data. These were used in subsequent analyzes. Plotting LD as a function of genetic distance revealed extensive intra-chromosomal LD along each barley chromosome (Figure S4). Heatmap charts of the distribution of intra-chromosomal r 2 values across each barley chromosome highlight the extended LD values across the genetic centromeres (Figure S5). High LD extends outwards from these regions along the spine of each chromosome forming an axis of blocks of short-range LD. A background of long-range LD, that commonly results from population sub-structure and admixture within a germplasm set (Ersoz et al. 2008), was observed for all seven chromosomes (Figure S5).
Barley has a large 5,300 Mb un-sequenced genome but extensive EST resources derived from nine cultivated lines (Harvest, Barley v1.68) (Wanamaker et al. 2008). Using this EST information Close et al. (2009) previously developed three Illumina 1,536-plex gene-based SNP assay platforms from a combination of informatics analysis and by re-sequencing PCR-amplicons from a collection of eight diverse elite barley cultivars. They used these POPA’s to genotype three-doubled haploid barley mapping populations [Steptoe × Morex (Kleinhofs et al. 1993), Morex × Barke (Stein et al. unpublished) and OWB-D × OWB-R (Costa et al. 2001)] and generated genetic linkage maps of each population. Then, they used a directed acyclic graphing algorithm implemented in MergeMap (Wu et al. 2008) to derive a consensus map from the forced linear order of the 2,943 polymorphic SNPs segregating in the three populations. The consensus map coordinates from MergeMap were normalized to the arithmetic mean cM distance for each linkage group from the individual maps. We considered this consensus map to represent an approximate gene order along each of the seven barley chromosomes and used this as a template for GWAS. We chose to remove rare SNPs from our GWAS datasets. While this is common practice, it results in a huge loss of information and limits our ability to capture variation associated with rare alleles. Loci with a low MAF (<10%) have less power to detect weak genetic effects than loci with a high MAF (>40%) because of small sample size (Ardlie et al. 2002). Furthermore, previous studies have demonstrated that rare genotypes are more likely to result in spurious findings (Lam et al. 2007) because of a higher relatedness between individuals sharing rare alleles. While it has been shown, in large human GWAS, that including SNP loci with MAF > 5% does not result in inflated false positive rates (Tabangin et al. 2007), due to the complexity of the pedigrees linked to plant populations we decided to remove SNPs with MAF < 10% from our LD and GWAS.
The patterns of genetic diversity along each of the seven barley genetic linkage maps varied amongst the sub-groups identified by both PCO and STRUCTURE analysis. However, the overall genetic diversity in the population remained high. We did observe a 2.9 cM region on barley chromosome 3H that exhibited a sharp decrease in genetic diversity across all germplasm groups. This interval would contain 585 gene models if we assumed absolute conservation of synteny between rice and barley. It may represent a strong signature of selection for non-brittle rachis, a trait involved in non-shattering of ears after ripening and that was important in barley domestication (Komatsuda and Mano 2002; Komatsuda et al. 2004). The position of this 2.9 cM interval on the short arm of chromosome 3H is consistent with that reported in previous studies for non-brittle rachis loci (Kandemir et al. 2004). The BCD706 and ABG396 RFLP markers delimiting the brittle rachis QTL interval (Kandemir et al. 2004) co-segregate with BOPA markers 11_10081, and 11_10137, respectively (Szucs et al. 2009) delimitating a 14.95 cM interval on the consensus map of Close et al. (2009). Brittle rachis in wild barley is controlled by two dominant complementary genes, Btr1 and Btr2, with mutations in either locus (btr1 or btr2) resulting in the non-brittle rachis of cultivated barley. The btr1 allele is present in most occidental cultivars whereas the btr2 allele is present in most oriental cultivars. Interestingly, we did not observe differential patterns of diversity in this region between the European and American Manchurian types. Only seven lines were polymorphic in the region: Mehola and OWB-R which are both brittle rachis lines and Dicktoo, Haruna Hijo, Morex, Steptoe and OWB-D, which count among the genetically “exotic” cultivars used as the parental lines of several mapping populations.
The extent of LD has long been of interest for population geneticists as its value determines the required genetic marker number and mapping resolution achievable in GWAS. It is commonly accepted that the extent of LD over short genetic distances is mainly affected by recombination while population structure (or possibly epistasis) largely accounts for long-range LD. Classic algorithms to measure LD (r 2/D′) are useful to explore short and long-range patterns of LD. However, they fail to discriminate between LD caused by genetic linkage and that caused by population structure (Text S1). We therefore also used a mixed model approach with relative kinship estimators as co-factors to investigate LD patterns without most of the population structure effects (Yu et al. 2006). These results can be extrapolated to indicate the expected frequency of false positives and expected resolution in subsequent whole genome scans when using the same marker dataset and statistical model. While we observed high LD extending along the spine of each chromosome, as expected, the mixed model significantly reduced the long-range and inter-chromosomal (background) LD and we subsequently used this approach to assess the mapping resolution achievable by GWAS.
91.2% of SNPs were positioned by GWAS to within 10 cM of their position predicted on the consensus map of Close et al. (2009). Inspection of the patterns of LD for the 8.8% of SNP markers that did not map within 10 cM of their original genetic map position revealed 3.7 and 5.1% intra- and inter-chromosomal SNP:SNP associations, respectively. We did not find any that mapped into the centromeric regions. This set of markers and those that mapped within 5–10 cM of their original positions are useful for identifying genomic regions where current marker coverage is insufficient, either as a result of very fast decay of LD or the presence of non-tagged SNPs (which we define as those that segregate in only a subset of the germplasm and are not in LD with their flanking SNPs, despite being physically close). The three genetic clusters observed within our sample (two-row spring barleys, winter barleys and Manchurian types) have been in distinct breeding pools for a considerable period of time (Malysheva-Otto et al. 2006) and only a few individuals resulting from inter-cluster crosses, which could potentially introduce recombination events between clusters, are present in the sample. Thus, there is the possibility that SNPs present in only one cluster are not in LD with flanking SNPs at the whole sample level. Supporting this hypothesis we did observe contrasting PIC values in closely linked SNPs. Further investigation of the SNPs that did not map close to their original genome position and the SNPs that map to more than one genome position should be pursued to investigate the fraction of those related directly to genetic map artifacts, spurious associations due population structure and those related to gene duplication events and epistasis.
Our results have important implications for the design of association mapping studies in barley. We have shown that genetically diverse populations of relatively small size prove adequate for fine-mapping simple traits, as long as the trait is segregating across the entire mapping population (MAF > 10% in our case), that population sub-structure is effectively controlled in the analysis and the trait does not fall into recombinationally poor regions such as genetic centromeres. Despite these limitations, we show that 87.3% of the SNPs could be mapped by GWAS to within 5 cM (50% within 1 cM) of their position on the Close et al. (2009) consensus map. We also show that if the positive associations do not fall within centromeric regions the mapping resolution achieved was reasonably high (i.e. 50% of the test SNPs mapped to within 27 gene models of the GWAS framework markers). In practical terms, this means that SNPs associated to a simple trait or quantitative locus with strong additive effects are potentially close enough to be used with reasonable confidence in marker-assisted breeding.
The use of markers based on gene sequence data is of special interest because they facilitate exploitation of conservation of synteny with model, fully sequenced grass genomes (e.g. Brachypodium and rice). Consequently, considerable value can be attributed to the targeted identification of new markers that are even closer to a positive association, allowing the interval containing the causal gene to be better delimited. Conserved synteny also helps to predict the number and identity of possible candidate genes, forming the focus of further investigations that may include allele re-sequencing across the association panel to improve genetic resolution, screening an independent GWA panel containing different germplasm or, when available, re-sequencing an allelic series of mutants that affect the target trait. Clearly, the situation is different for traits where candidate genes are not obvious, where there is a breakdown in the conservation of synteny and/or where well-characterized mutant stocks are not available. In those cases, identification of the causal genes will most likely proceed in combination with large and relevant bi-parental mapping populations that can be used for validation and, if considering quantitative characters, after the generation of QTL-near isogenic lines containing alternative alleles. Here, a barley genome sequence will have a big role to play, cutting out the ‘middle men’ (rice/Brachypodium), and providing a true list of positional candidates for more detailed investigation (Schulte et al. 2009).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N et al (2009) Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics 10:582Google Scholar
- Ersoz ES, Yu J, Buckler ES (2008) Applications of linkage disequilibrium and association mapping in crop plants. In: Varshney RK, Tuberosa R (eds) Genomic assisted crop improvement: vol I: Genomic approaches, platforms. Springer Verlag, GermanyGoogle Scholar
- Ersoz ES, Yu J, Buckler ES (2009) Applications of linkage disequilibrium and association mapping in maize Molecular Genetic Approaches to Maize Improvement, Springer Berlin Heidelberg, pp 173–195Google Scholar
- FAO (2002) World agriculture: towards 2015/2030. Summary report. Rome, Food and Agriculture Organization of the United NationsGoogle Scholar
- Lawrence CJ, Harper LC, Shaeffer ML, Sen TZ, Seigfried TE et al (2008) MaizeGDB: the maize model organism database for basic, translational, and applied research. Int J Plant GenomicsGoogle Scholar
- Payne RW, Murray DA, Harding SA, Baird DB, Soutar DM (2008) GenStat for Windows Introduction, 11th edn. VSN International, Hemel Hempstead, UKGoogle Scholar
- Wanamaker SI, Close TJ, Roose ML, Lyon M (2008) HarvEST http://harvest.ucr.edu